Cost Optimization
Reduce LLM spending without sacrificing quality. This tutorial walks you through identifying your biggest cost drivers, setting custom pricing for negotiated rates, and establishing strategies to keep costs under control.
Prerequisites
waxell-observeinstalled and sending data (pip install waxell-observe)- At least a few hours of LLM call history in your Waxell instance
- A Waxell API key with dashboard access
What You'll Learn
- How to view model usage analytics and identify cost drivers
- How to set custom model costs for negotiated pricing
- How to analyze per-user and per-session costs
- How to set up budget policies via the governance system
- Strategies for optimizing LLM costs across your agents
Step 1: View Model Analytics
The model analytics endpoint gives you a breakdown of cost, tokens, and call volume per model over time.
curl -X GET "https://acme.waxell.dev/api/v1/observability/analytics/models/?period=7d" \
-H "Authorization: Bearer <your-session-token>" \
-H "Content-Type: application/json"
The response includes overall aggregates, per-model totals, and a daily time series:
{
"period": "7d",
"aggregates": {
"total_cost": 142.385012,
"total_tokens": 28450000,
"total_calls": 15230
},
"model_totals": [
{
"model": "gpt-4o",
"total_cost": 98.120000,
"total_tokens": 12800000,
"total_calls": 4200,
"percentage": 68.9
},
{
"model": "gpt-4o-mini",
"total_cost": 22.450000,
"total_tokens": 9200000,
"total_calls": 8100,
"percentage": 15.8
},
{
"model": "claude-sonnet-4-5-20250929",
"total_cost": 21.815012,
"total_tokens": 6450000,
"total_calls": 2930,
"percentage": 15.3
}
],
"time_series": [
{
"date": "2026-02-01",
"model": "gpt-4o",
"cost": 14.230000,
"tokens": 1850000,
"calls": 610
}
]
}
Use the agent query parameter to filter analytics to a specific agent: ?period=7d&agent=support-bot. This helps isolate which agents are driving cost.
Step 2: Identify Top Cost Drivers
From the model analytics response, sort model_totals by percentage to see which models consume the most budget. In the example above:
| Model | Cost | % of Total | Calls |
|---|---|---|---|
| gpt-4o | $98.12 | 68.9% | 4,200 |
| gpt-4o-mini | $22.45 | 15.8% | 8,100 |
| claude-sonnet-4-5-20250929 | $21.82 | 15.3% | 2,930 |
Key observations:
- gpt-4o accounts for nearly 70% of cost but only 28% of calls -- this is the primary optimization target
- gpt-4o-mini handles the most calls at a fraction of the cost -- a good candidate for routing simpler tasks
Step 3: Set Custom Model Costs
If you have negotiated pricing with your LLM provider (volume discounts, enterprise agreements), set custom model costs so your dashboards reflect actual spend rather than list prices.
List current model costs:
curl -X GET "https://acme.waxell.dev/api/v1/observe/model-costs/" \
-H "X-Wax-Key: wax_sk_..." \
-H "Content-Type: application/json"
Response:
{
"models": [
{
"model_id": "gpt-4o",
"display_name": "gpt-4o",
"provider": "openai",
"input_cost_per_million": 2.5,
"output_cost_per_million": 10.0,
"source": "default"
}
]
}
Set a custom cost override for your negotiated rate:
curl -X PUT "https://acme.waxell.dev/api/v1/observe/model-costs/gpt-4o/" \
-H "X-Wax-Key: wax_sk_..." \
-H "Content-Type: application/json" \
-d '{
"input_cost_per_million": 2.0,
"output_cost_per_million": 8.0
}'
Response:
{
"model_id": "gpt-4o",
"input_cost_per_million": 2.0,
"output_cost_per_million": 8.0,
"source": "custom"
}
Reset to system defaults:
curl -X DELETE "https://acme.waxell.dev/api/v1/observe/model-costs/gpt-4o/" \
-H "X-Wax-Key: wax_sk_..."
Custom model costs are per-tenant. Each tenant can set their own pricing to match their provider agreements. Future LLM calls will use the custom cost for calculations.
Step 4: View Per-User Costs
Identify which end users are driving the most cost. This is useful for usage-based billing, abuse detection, or capacity planning.
curl -X GET "https://acme.waxell.dev/api/v1/observability/users/?sort=-total_cost" \
-H "Authorization: Bearer <your-session-token>"
The sort parameter on the users endpoint supports -last_seen, first_seen, -first_seen, run_count, and -run_count. For cost-based sorting, pull the results and sort client-side, or use the UI's built-in sort controls.
Response:
{
"results": [
{
"user_id": "user_8a3f",
"run_count": 342,
"first_seen": "2026-01-15T10:00:00Z",
"last_seen": "2026-02-07T14:30:00Z",
"total_duration": 1842.5,
"total_cost": 28.450012,
"total_tokens": 5200000,
"agents": ["support-bot", "research-assistant"]
}
],
"count": 156,
"next": "?offset=25&limit=25",
"previous": null
}
For detailed per-user breakdown by model, fetch the user detail:
curl -X GET "https://acme.waxell.dev/api/v1/observability/users/user_8a3f/" \
-H "Authorization: Bearer <your-session-token>"
This returns cost_by_model showing exactly which models each user is consuming.
Step 5: View Per-Session Costs
Sessions group related runs together (e.g., a multi-turn conversation). Track session-level cost to understand the cost of complete user interactions.
curl -X GET "https://acme.waxell.dev/api/v1/observability/sessions/?sort=-last_activity" \
-H "Authorization: Bearer <your-session-token>"
Response:
{
"results": [
{
"session_id": "sess_a1b2c3d4e5f6g7h8",
"run_count": 8,
"first_run": "2026-02-07T10:00:00Z",
"last_activity": "2026-02-07T10:15:00Z",
"total_duration": 45.2,
"total_cost": 0.8450,
"total_tokens": 42000,
"agents": ["support-bot"]
}
],
"count": 2340
}
High per-session cost often indicates opportunities for caching or prompt optimization. If users are asking similar questions repeatedly, consider implementing a semantic cache layer.
Step 6: Set Up Budget Policies
Waxell's governance system can enforce budget limits at the agent level. When an agent exceeds its budget, the policy can warn or block further execution.
Budget policies are configured through the Waxell control plane. Here is the general approach:
- Navigate to Policies in the control plane dashboard
- Create a new policy with type Budget
- Set the budget threshold (e.g., max cost per day, max tokens per hour)
- Choose the action: warn (log a warning but allow execution) or block (prevent the run from starting)
- Assign the policy to specific agents or apply it globally
# In your agent code, budget policies are enforced automatically
# via the @waxell_agent decorator or WaxellContext:
from waxell_observe import waxell_agent
from waxell_observe.errors import PolicyViolationError
@waxell_agent(agent_name="support-bot", enforce_policy=True)
async def handle_query(query: str, waxell_ctx=None) -> str:
# If a budget policy blocks this agent, a PolicyViolationError
# is raised before execution begins.
response = await call_llm(query)
return response
# You can also catch the error explicitly:
try:
result = await handle_query("How do I reset my password?")
except PolicyViolationError as e:
print(f"Agent blocked by policy: {e}")
# Fall back to a cached response or queue for later
For full governance documentation, see Policy & Governance.
Step 7: Track Cost Trends
Use the model analytics time series data to track cost trends over time. The time_series array in the analytics response provides daily aggregates per model.
import waxell_observe as waxell
import httpx
# Fetch 30-day trends
async def get_cost_trends():
async with httpx.AsyncClient() as client:
response = await client.get(
"https://acme.waxell.dev/api/v1/observability/analytics/models/",
params={"period": "30d"},
headers={"Authorization": "Bearer <token>"},
)
data = response.json()
# Calculate daily totals across all models
daily_costs = {}
for point in data["time_series"]:
date = point["date"]
daily_costs[date] = daily_costs.get(date, 0) + point["cost"]
# Check for cost spikes
costs = list(daily_costs.values())
if len(costs) >= 2:
avg_cost = sum(costs[:-1]) / len(costs[:-1])
latest_cost = costs[-1]
if latest_cost > avg_cost * 1.5:
print(f"Cost spike detected: ${latest_cost:.2f} vs avg ${avg_cost:.2f}")
return daily_costs
In the Waxell dashboard, the Analytics > Cost Analytics page provides interactive charts for:
- Daily cost by model (stacked area chart)
- Token usage trends
- Cost per agent over time
- Model usage distribution (pie chart)
Step 8: Optimization Strategies
Model Routing
Route simpler tasks to cheaper models. Not every LLM call needs the most expensive model.
from waxell_observe import waxell_agent
@waxell_agent(agent_name="smart-router")
async def handle_query(query: str, waxell_ctx=None) -> str:
# Classify the query complexity
complexity = classify_complexity(query) # your classifier
if complexity == "simple":
# FAQ, simple lookups -- use the cheapest model
response = await call_llm(query, model="gpt-4o-mini")
if waxell_ctx:
waxell_ctx.record_llm_call(model="gpt-4o-mini", tokens_in=50, tokens_out=30)
elif complexity == "moderate":
# Summaries, standard questions
response = await call_llm(query, model="claude-sonnet-4-5-20250929")
if waxell_ctx:
waxell_ctx.record_llm_call(model="claude-sonnet-4-5-20250929", tokens_in=200, tokens_out=150)
else:
# Complex reasoning, code generation
response = await call_llm(query, model="gpt-4o")
if waxell_ctx:
waxell_ctx.record_llm_call(model="gpt-4o", tokens_in=500, tokens_out=400)
return response
Prompt Optimization
Shorter prompts mean fewer input tokens. Review your highest-cost agents in the LLM Calls explorer and look for:
- Redundant instructions in system prompts that can be removed
- Verbose examples that can be condensed
- Unnecessary context being passed to every call
Caching
For agents that handle repeated or similar queries, add a caching layer:
import hashlib
# Simple semantic cache (use a proper vector store in production)
cache = {}
@waxell_agent(agent_name="cached-bot")
async def handle_query(query: str, waxell_ctx=None) -> str:
cache_key = hashlib.sha256(query.lower().strip().encode()).hexdigest()
if cache_key in cache:
return cache[cache_key]
response = await call_llm(query, model="gpt-4o")
if waxell_ctx:
waxell_ctx.record_llm_call(model="gpt-4o", tokens_in=200, tokens_out=150)
cache[cache_key] = response
return response
Token Limit Guards
Set maximum token limits on your LLM calls to prevent runaway costs from unexpectedly long responses:
response = await openai.chat.completions.create(
model="gpt-4o",
messages=messages,
max_tokens=500, # Cap output tokens
)
Next Steps
- Cost Management -- Full feature reference for cost tracking and controls
- Policy & Governance -- Set up budget enforcement policies
- Session Analytics -- Analyze conversation-level costs
- Scoring -- Track quality alongside cost optimization