LLM Call Tracking
Waxell Observe records every LLM API call made by your agents, capturing the model, token counts, estimated cost, and optional previews of prompts and responses. This data powers dashboards, cost analysis, and governance enforcement.
What Data is Captured
Each LLM call record includes:
| Field | Type | Description |
|---|---|---|
model | str | Model identifier (e.g., "gpt-4o", "claude-sonnet-4") |
tokens_in | int | Number of input/prompt tokens |
tokens_out | int | Number of output/completion tokens |
cost | float | Cost in USD (auto-estimated if not provided) |
task | str | Optional label describing the call's purpose |
prompt_preview | str | Optional preview of the prompt text |
response_preview | str | Optional preview of the response text |
Two Ways to Capture
Automatic: LangChain Handler
The WaxellLangChainHandler intercepts LangChain's callback system and automatically extracts:
- Model name from the serialized LLM configuration
- Token counts from LangChain's
token_usagein the LLM response - Prompt preview (first 500 characters of the prompt)
- Response preview (first 500 characters of the generated text)
- Cost estimated automatically from token counts and model pricing
from waxell_observe.integrations.langchain import WaxellLangChainHandler
handler = WaxellLangChainHandler(agent_name="my-agent")
result = chain.invoke(input, config={"callbacks": [handler]})
handler.flush_sync(result={"output": result})
No manual recording needed -- every LLM call in the chain is captured.
Manual: Context Methods
For non-LangChain frameworks, record LLM calls explicitly using record_llm_call:
from waxell_observe import WaxellContext
async with WaxellContext(agent_name="my-agent") as ctx:
# Call your LLM
response = await openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}],
)
# Record the call
ctx.record_llm_call(
model="gpt-4o",
tokens_in=response.usage.prompt_tokens,
tokens_out=response.usage.completion_tokens,
task="answer_question",
prompt_preview=query[:500],
response_preview=response.choices[0].message.content[:500],
)
Or with the decorator and context injection:
from waxell_observe import waxell_agent
@waxell_agent(agent_name="my-agent")
async def answer(query: str, waxell_ctx=None) -> str:
response = await openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": query}],
)
if waxell_ctx:
waxell_ctx.record_llm_call(
model="gpt-4o",
tokens_in=response.usage.prompt_tokens,
tokens_out=response.usage.completion_tokens,
task="answer_question",
)
return response.choices[0].message.content
Supported Models
Cost estimation is built in for the following models. For unlisted models, provide the cost parameter manually or configure a tenant override on the server (see Cost Management).
OpenAI
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
gpt-4o | $2.50 | $10.00 |
gpt-4o-mini | $0.15 | $0.60 |
gpt-4-turbo | $10.00 | $30.00 |
gpt-4 | $30.00 | $60.00 |
gpt-3.5-turbo | $0.50 | $1.50 |
o1 | $15.00 | $60.00 |
o1-mini | $3.00 | $12.00 |
o3-mini | $1.10 | $4.40 |
Anthropic
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
claude-opus-4 | $15.00 | $75.00 |
claude-sonnet-4 | $3.00 | $15.00 |
claude-3-5-sonnet | $3.00 | $15.00 |
claude-3-5-haiku | $0.80 | $4.00 |
claude-3-haiku | $0.25 | $1.25 |
Google
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
gemini-2.0-flash | $0.10 | $0.40 |
gemini-1.5-pro | $1.25 | $5.00 |
gemini-1.5-flash | $0.075 | $0.30 |
Meta (via Groq, Together, etc.)
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
llama-3.3-70b | $0.59 | $0.79 |
llama-3.1-8b | $0.05 | $0.08 |
Mistral
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
mistral-large | $2.00 | $6.00 |
Model Name Matching
The cost estimator uses prefix matching, so versioned model names work automatically:
"gpt-4o-2024-08-06"matches"gpt-4o""claude-3-5-sonnet-20241022"matches"claude-3-5-sonnet"
Longer prefixes are matched first for specificity, so "gpt-4o-mini" matches before "gpt-4o".
If no match is found, cost is 0.0. You can provide the cost explicitly or set up a server-side tenant override.
How Data Flows
- Your agent records LLM calls (manually or via LangChain handler)
- Calls are buffered in memory during execution
- On flush/context exit, calls are sent to the control plane via
POST /api/v1/observe/runs/{run_id}/llm-calls/ - The control plane stores the data, recalculates costs if server-side pricing is available, and updates dashboards
- Dashboards show per-agent, per-model, and per-run cost breakdowns
Server-side cost calculation takes precedence over client-side estimates. If you configure tenant-level model cost overrides on the control plane, those prices are used instead of the client-side MODEL_COSTS table.
Next Steps
- Cost Management -- Budget enforcement and tenant overrides
- Decorator Pattern -- Quick integration with
@waxell_agent - LangChain Integration -- Automatic capture with LangChain