FoodFight Agent

4/2/2025

Problem Statement

Internal operations like reading active contests, creating test bets, and querying platform state all require navigating the dev panel. This is slow for our team and can be quite confusing, especially for those who aren’t as familiar with the dev panel. There is currently no conversational interface for interacting with the FoodFight platform programmatically.

To address this issue, we can introduce an agentic workflow that understands the platform domain and can execute actions on behalf of a user through natural language. The two immediate surfaces are the dev panel and a Mattermost bot for the broader team. The initial scope will be via contest reading and creation, with a clear path to expanding to other actions (querying user state, triggering notifications, user analytics, etc.) as new tools are added. Additionally allowing for players to trigger their own contests via the FoodFight platform directly could allow us to expand the use of the agent beyond internal operations and into a player-facing feature.

Proposed Solution

Deploy a FoodFight Agent as an async Lambda handler backed by the OpenAI Agents SDK. The agent can be given a set of Python function tools that call our existing microservices (e.g. bet_service, user_service) and utilize our libs. Conversation history is persisted per-session in Postgres as a JSONB column so the Lambda remains stateless and each invocation picks up exactly where the last left off.

Two clients invoke the same Lambda endpoint:

  • Dev Panel — a chat UI embedded in the existing internal dashboard, calling the Lambda via API Gateway
  • Mattermost Bot — a webhook integration that forwards messages to the same endpoint and posts replies back to the channel

Both surfaces pass a session_id and message and receive a response. The agent logic, tool definitions, and history management are identical for both. This will follow a request/response pattern for simplicity and reliability — the agent processes the message, calls tools as needed, and returns a final response in a single invocation. If we find that processing regularly exceeds API Gateway’s 29s limit, we can switch to an async invocation model with WebSockets or polling, however this solution seems quite excessive for the expected complexity of our agent’s tasks and the amount that this agent will be used.

Architectural & Technical Details

  Dev Panel  ──REST──▶  API Gateway (Cognito Auth)
  Mattermost ──Webhook─▶        │
                                ▼
                          Agent Lambda
                          ├─ load history (Postgres / RDS Proxy)
                          ├─ Runner.run(agent)
                          │    ├─ get_promo_fights   ──▶ backend/libs (SQLAlchemy)
                          │    ├─ create_promo_fight ──▶ backend/libs (SQLAlchemy)
                          │    └─ accept_promo_fight ──▶ backend/libs (SQLAlchemy)
                          └─ save history (Postgres / RDS Proxy)

Request lifecycle: API Gateway authenticates the request and forwards { session_id, message } to the Lambda. The Lambda loads the session’s JSONB history, appends the user message, and calls Runner.run(). The SDK handles the internal loop — it calls the model, invokes any requested tools, and feeds results back until a final response is produced. The updated history (result.to_input_list()) is upserted back to Postgres and the response returned to the caller.

Tool implementation — Tools are @function_tool decorated Python functions. The agent treats them as black boxes — it only sees the name, docstring, and parameters. The implementation detail of how a tool fetches or writes data is irrelevant to the agent.

Since this is a monorepo and bet_service already runs as a Lambda backed by backend/libs, the agent Lambda can import the same shared libs directly — using the SQLAlchemy models and service layer from libs/db and libs/schemas to query and write to the DB without any inter-service call. This is the simplest approach: no HTTP overhead, no Lambda-to-Lambda invocation, no API Gateway cost per tool call.

Promo fights are the FoodFight concept for contests. The relevant schemas are LiveBetBase (creation, with bet_type="promotion") and PromoBase (accept/delete).

  • get_promo_fights(restaurant_id: int | None) — queries DB via libs/db models, filtered by restaurant or all venues
  • create_promo_fight(restaurant_id: int, maker_outcome: int, restaurant_items: list[MenuItemOrder], takeout_type: int, maker_address: str, maker_payment_intent: str) — writes to DB using LiveBetBase with bet_type="promotion"
  • accept_promo_fight(restaurant_id: int, user_preferred_outcome: int, bet_id: int | None, menu_item_ids: list[int] | None) — writes to DB using PromoBase

An alternative is to have tools invoke bet_service directly over HTTP rather than importing libs. This keeps a cleaner service boundary but comes with meaningful downsides in an agentic context: a single agent turn can trigger multiple tool calls, so each HTTP round-trip compounds — adding latency on top of the LLM call latency already present. Beyond latency, it also means managing internal service auth (the agent Lambda would need a valid token or IAM-based service-to-service auth), handling service availability/retries as a separate failure mode, and paying API Gateway invocation costs per tool call. Since the agent treats the tool implementation as a black box regardless, there’s no benefit to the extra indirection at this stage.

Session storage

CREATE TABLE agent_sessions (
    session_id  UUID PRIMARY KEY,
    history     JSONB NOT NULL DEFAULT '[]',
    created_at  TIMESTAMP DEFAULT NOW(),
    updated_at  TIMESTAMP DEFAULT NOW()
);

Use SELECT ... FOR UPDATE when loading history to prevent race conditions on concurrent messages in the same session. History grows with each turn — we can consider truncation or some sort of compact summarization strategy if it becomes unwieldy.

Infrastructure notes

  • New Lambda endpoint with a 30–60s timeout to allow for multiple LLM round-trips in a single turn. API Gateway has a hard 29s limit; if turns regularly exceed that, we may need async invocation with WebSockets or polling. A quota increase can be requested here.
  • Mattermost auth via shared secret in Secrets Manager instead of Cognito
Surface Auth Session ID
Dev Panel Cognito JWT User ID
Mattermost Bot Shared secret Channel or user ID

Code Snippet

The agent runs inside an async Lambda handler. Conversation history is persisted as a JSONB column in Postgres so each invocation is stateless — the Lambda loads history, runs the agent, and writes the updated history back.

# pseudocode — not production ready

async def handler(event, context):
    session_id = event["session_id"]
    user_message = event["message"]

    history = await db.fetchval(
        "SELECT history FROM agent_sessions WHERE session_id = $1 FOR UPDATE",
        session_id
    ) or []

    history.append({"role": "user", "content": user_message})

    result = await Runner.run(agent, input=history)

    await db.execute(
        """
        INSERT INTO agent_sessions (session_id, history)
        VALUES ($1, $2)
        ON CONFLICT (session_id) DO UPDATE SET history = $2, updated_at = NOW()
        """,
        session_id,
        json.dumps(result.to_input_list())
    )

    return {"response": result.final_output}

result.to_input_list() returns the full updated history including tool call requests and results — this is what gives the model complete context on prior tool invocations in subsequent turns.

Alternatives

OpenAI Agents SDK vs. Anthropic Claude Agent SDK

We are using the OpenAI Agents SDK and plan to stay there. We have free credits and the cost profile is significantly cheaper at our current scale. The SDK surface is also the right fit for this use case: @function_tool, RunHooks, Runner.run(), and to_input_list() cover everything we need with minimal boilerplate.

Switching SDKs is a non-trivial migration. Tools are not portable — the two SDKs have meaningfully different APIs at every layer:

  OpenAI Agents SDK Anthropic Claude Agent SDK
Tool definition @function_tool with typed params, auto-schema from type hints @tool(name, desc, schema_dict) with explicit schemas
Tool arguments Typed function parameters (def get_promo_fights(restaurant_id: int)) Dict-based (def get_promo_fights(args: dict))
Tool return Plain string or object Must wrap in {"content": [{"type": "text", "text": ...}]}
Run loop await Runner.run(agent, input=history) async for msg in query(prompt=..., options=...)
Conversation history Manual via result.to_input_list() or Session backends Automatic via session resumption (resume=session_id)
Hooks Subclass RunHooks, override on_tool_start / on_tool_end Register callback functions with HookMatcher regex patterns

A migration would require rewriting every @function_tool decorator and function signature, all history management, the run loop call sites, and any hooks. Realistically a few days of work for this agent, with risk of subtle behavioural differences.

The Anthropic SDK would be worth revisiting if we need its built-in tools (Read, Write, Bash, Grep, etc.) or better performance on complex multi-tool tasks at higher volume, but it is not a drop-in swap.

Why MCP Is Not Necessary Here

MCP (Model Context Protocol) is useful when tools live in separate processes or external servers that need to be discovered over a network boundary — e.g. a third-party SaaS vendor or a shared tool server across many agents.

For the FoodFight Agent, tools are in-process Python functions in the same Lambda package calling our own services. MCP would add a server/client protocol layer with no benefit. Worth revisiting only if we want to expose FoodFight tools to external agents or third-party systems.

Why not the simple openai.ChatCompletion.create() API?

While we could implement a simple turn-based agent loop ourselves using the standard ChatCompletion API, the OpenAI Agents SDK provides a lot of value out of the box:

  • Automatic tool schema generation from Python function signatures and docstrings.
  • Built-in support for multi-turn conversations with tool calls and results included in the context.
  • Hooks for logging, analytics, or custom behavior on tool calls
  • A clean abstraction layer that keeps the agent logic focused on defining tools and handling results, rather than managing the conversation loop and context formatting manually.

Next Steps

  • Finalize tool signatures for get_promo_fights, create_promo_fight, and accept_promo_fight
  • Wire up dev panel chat UI to the Lambda endpoint
  • Set up Mattermost webhook integration

Open Questions

  • Should session history be scoped per-user or per-channel for the Mattermost bot?
  • Do we want the agent to have write access to bet_service from day one, or start read-only?
  • History will grow unbounded — what is the truncation or summarization strategy for long sessions?

Approvals

You need architectural approval from Trace Carrasco & product approval from Filip Pacyna / Troy Lenihan

  • Trace Carrasco
  • Filip Pacyna/Troy Lenihan