Cost Optimization

Reduce LLM spending without sacrificing quality. This tutorial walks you through identifying your biggest cost drivers, setting custom pricing for negotiated rates, and establishing strategies to keep costs under control.

Prerequisites

waxell-observe installed and sending data (pip install waxell-observe)
At least a few hours of LLM call history in your Waxell instance
A Waxell API key with dashboard access

What You'll Learn

How to view model usage analytics and identify cost drivers
How to set custom model costs for negotiated pricing
How to analyze per-user and per-session costs
How to set up budget policies via the governance system
Strategies for optimizing LLM costs across your agents

Step 1: View Model Analytics

The model analytics endpoint gives you a breakdown of cost, tokens, and call volume per model over time.

curl -X GET "https://acme.waxell.dev/api/v1/observability/analytics/models/?period=7d" \
  -H "Authorization: Bearer <your-session-token>" \
  -H "Content-Type: application/json"

The response includes overall aggregates, per-model totals, and a daily time series:

{
  "period": "7d",
  "aggregates": {
    "total_cost": 142.385012,
    "total_tokens": 28450000,
    "total_calls": 15230
  },
  "model_totals": [
    {
      "model": "gpt-4o",
      "total_cost": 98.120000,
      "total_tokens": 12800000,
      "total_calls": 4200,
      "percentage": 68.9
    },
    {
      "model": "gpt-4o-mini",
      "total_cost": 22.450000,
      "total_tokens": 9200000,
      "total_calls": 8100,
      "percentage": 15.8
    },
    {
      "model": "claude-sonnet-4-5-20250929",
      "total_cost": 21.815012,
      "total_tokens": 6450000,
      "total_calls": 2930,
      "percentage": 15.3
    }
  ],
  "time_series": [
    {
      "date": "2026-02-01",
      "model": "gpt-4o",
      "cost": 14.230000,
      "tokens": 1850000,
      "calls": 610
    }
  ]
}

tip

Use the agent query parameter to filter analytics to a specific agent: ?period=7d&agent=support-bot. This helps isolate which agents are driving cost.

Step 2: Identify Top Cost Drivers

From the model analytics response, sort model_totals by percentage to see which models consume the most budget. In the example above:

Model	Cost	% of Total	Calls
gpt-4o	$98.12	68.9%	4,200
gpt-4o-mini	$22.45	15.8%	8,100
claude-sonnet-4-5-20250929	$21.82	15.3%	2,930

Key observations:

gpt-4o accounts for nearly 70% of cost but only 28% of calls -- this is the primary optimization target
gpt-4o-mini handles the most calls at a fraction of the cost -- a good candidate for routing simpler tasks

Step 3: Set Custom Model Costs

If you have negotiated pricing with your LLM provider (volume discounts, enterprise agreements), set custom model costs so your dashboards reflect actual spend rather than list prices.

List current model costs:

curl -X GET "https://acme.waxell.dev/api/v1/observe/model-costs/" \
  -H "X-Wax-Key: wax_sk_..." \
  -H "Content-Type: application/json"

Response:

{
  "models": [
    {
      "model_id": "gpt-4o",
      "display_name": "gpt-4o",
      "provider": "openai",
      "input_cost_per_million": 2.5,
      "output_cost_per_million": 10.0,
      "source": "default"
    }
  ]
}

Set a custom cost override for your negotiated rate:

curl -X PUT "https://acme.waxell.dev/api/v1/observe/model-costs/gpt-4o/" \
  -H "X-Wax-Key: wax_sk_..." \
  -H "Content-Type: application/json" \
  -d '{
    "input_cost_per_million": 2.0,
    "output_cost_per_million": 8.0
  }'

Response:

{
  "model_id": "gpt-4o",
  "input_cost_per_million": 2.0,
  "output_cost_per_million": 8.0,
  "source": "custom"
}

Reset to system defaults:

curl -X DELETE "https://acme.waxell.dev/api/v1/observe/model-costs/gpt-4o/" \
  -H "X-Wax-Key: wax_sk_..."

info

Custom model costs are per-tenant. Each tenant can set their own pricing to match their provider agreements. Future LLM calls will use the custom cost for calculations.

Step 4: View Per-User Costs

Identify which end users are driving the most cost. This is useful for usage-based billing, abuse detection, or capacity planning.

curl -X GET "https://acme.waxell.dev/api/v1/observability/users/?sort=-total_cost" \
  -H "Authorization: Bearer <your-session-token>"

info

The sort parameter on the users endpoint supports -last_seen, first_seen, -first_seen, run_count, and -run_count. For cost-based sorting, pull the results and sort client-side, or use the UI's built-in sort controls.

Response:

{
  "results": [
    {
      "user_id": "user_8a3f",
      "run_count": 342,
      "first_seen": "2026-01-15T10:00:00Z",
      "last_seen": "2026-02-07T14:30:00Z",
      "total_duration": 1842.5,
      "total_cost": 28.450012,
      "total_tokens": 5200000,
      "agents": ["support-bot", "research-assistant"]
    }
  ],
  "count": 156,
  "next": "?offset=25&limit=25",
  "previous": null
}

For detailed per-user breakdown by model, fetch the user detail:

curl -X GET "https://acme.waxell.dev/api/v1/observability/users/user_8a3f/" \
  -H "Authorization: Bearer <your-session-token>"

This returns cost_by_model showing exactly which models each user is consuming.

Step 5: View Per-Session Costs

Sessions group related runs together (e.g., a multi-turn conversation). Track session-level cost to understand the cost of complete user interactions.

curl -X GET "https://acme.waxell.dev/api/v1/observability/sessions/?sort=-last_activity" \
  -H "Authorization: Bearer <your-session-token>"

Response:

{
  "results": [
    {
      "session_id": "sess_a1b2c3d4e5f6g7h8",
      "run_count": 8,
      "first_run": "2026-02-07T10:00:00Z",
      "last_activity": "2026-02-07T10:15:00Z",
      "total_duration": 45.2,
      "total_cost": 0.8450,
      "total_tokens": 42000,
      "agents": ["support-bot"]
    }
  ],
  "count": 2340
}

tip

High per-session cost often indicates opportunities for caching or prompt optimization. If users are asking similar questions repeatedly, consider implementing a semantic cache layer.

Step 6: Set Up Budget Policies

Waxell's governance system can enforce budget limits at the agent level. When an agent exceeds its budget, the policy can warn or block further execution.

Budget policies are configured through the Waxell control plane. Here is the general approach:

Navigate to Policies in the control plane dashboard
Create a new policy with type Budget
Set the budget threshold (e.g., max cost per day, max tokens per hour)
Choose the action: warn (log a warning but allow execution) or block (prevent the run from starting)
Assign the policy to specific agents or apply it globally

# In your agent code, budget policies are enforced automatically by @waxell.observe.
import waxell_observe as waxell

waxell.init()  # before openai import

import openai
from waxell_observe.errors import PolicyViolationError

client = openai.OpenAI()


@waxell.observe(agent_name="support-bot", enforce_policy=True)
async def handle_query(query: str) -> str:
    # If a budget policy blocks this agent, a PolicyViolationError is raised
    # before execution begins. The OpenAI call below is auto-captured.
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
    )
    return response.choices[0].message.content


# Catch the error explicitly if you want to fall back:
try:
    result = await handle_query("How do I reset my password?")
except PolicyViolationError as e:
    print(f"Agent blocked by policy: {e}")
    # Fall back to a cached response or queue for later

For full governance documentation, see Policy & Governance.

Step 7: Track Cost Trends

Use the model analytics time series data to track cost trends over time. The time_series array in the analytics response provides daily aggregates per model.

import httpx

# Fetch 30-day trends
async def get_cost_trends():
    async with httpx.AsyncClient() as client:
        response = await client.get(
            "https://acme.waxell.dev/api/v1/observability/analytics/models/",
            params={"period": "30d"},
            headers={"Authorization": "Bearer <token>"},
        )
        data = response.json()

        # Calculate daily totals across all models
        daily_costs = {}
        for point in data["time_series"]:
            date = point["date"]
            daily_costs[date] = daily_costs.get(date, 0) + point["cost"]

        # Check for cost spikes
        costs = list(daily_costs.values())
        if len(costs) >= 2:
            avg_cost = sum(costs[:-1]) / len(costs[:-1])
            latest_cost = costs[-1]
            if latest_cost > avg_cost * 1.5:
                print(f"Cost spike detected: ${latest_cost:.2f} vs avg ${avg_cost:.2f}")

        return daily_costs

In the Waxell dashboard, the Analytics > Cost Analytics page provides interactive charts for:

Daily cost by model (stacked area chart)
Token usage trends
Cost per agent over time
Model usage distribution (pie chart)

Step 8: Optimization Strategies

Model Routing

Route simpler tasks to cheaper models. Not every LLM call needs the most expensive model. With waxell.init(), every model swap is auto-captured -- the dashboard records which model ran on which classification.

import waxell_observe as waxell

waxell.init()  # before openai import

import openai

client = openai.OpenAI()


@waxell.observe(agent_name="smart-router")
async def handle_query(query: str) -> str:
    complexity = classify_complexity(query)  # your classifier

    if complexity == "simple":
        model = "gpt-4o-mini"
    elif complexity == "moderate":
        model = "claude-sonnet-4-5-20250929"
    else:
        model = "gpt-4o"

    waxell.tag("complexity", complexity)
    waxell.metadata("model_chosen", model)

    response = client.chat.completions.create(  # auto-captured with the chosen model
        model=model,
        messages=[{"role": "user", "content": query}],
    )
    return response.choices[0].message.content

Prompt Optimization

Shorter prompts mean fewer input tokens. Review your highest-cost agents in the LLM Calls explorer and look for:

Redundant instructions in system prompts that can be removed
Verbose examples that can be condensed
Unnecessary context being passed to every call

Caching

For agents that handle repeated or similar queries, add a caching layer. Cache hits skip the LLM call entirely -- and because there is no LLM call, the auto-instrumentation has nothing to record, which is exactly what you want.

import hashlib

# Simple semantic cache (use a proper vector store in production)
cache = {}


@waxell.observe(agent_name="cached-bot")
async def handle_query(query: str) -> str:
    cache_key = hashlib.sha256(query.lower().strip().encode()).hexdigest()

    if cache_key in cache:
        waxell.tag("cache", "hit")
        return cache[cache_key]

    waxell.tag("cache", "miss")
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
    )
    answer = response.choices[0].message.content
    cache[cache_key] = answer
    return answer

Token Limit Guards

Set maximum token limits on your LLM calls to prevent runaway costs from unexpectedly long responses:

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    max_tokens=500,  # Cap output tokens
)

Next Steps

Cost Management -- Full feature reference for cost tracking and controls
Policy & Governance -- Set up budget enforcement policies
Session Analytics -- Analyze conversation-level costs
Scoring -- Track quality alongside cost optimization

Prerequisites​

What You'll Learn​

Step 1: View Model Analytics​

Step 2: Identify Top Cost Drivers​

Step 3: Set Custom Model Costs​

Step 4: View Per-User Costs​

Step 5: View Per-Session Costs​

Step 6: Set Up Budget Policies​

Step 7: Track Cost Trends​

Step 8: Optimization Strategies​

Model Routing​

Prompt Optimization​

Caching​

Token Limit Guards​

Next Steps​

Prerequisites

What You'll Learn

Step 1: View Model Analytics

Step 2: Identify Top Cost Drivers

Step 3: Set Custom Model Costs

Step 4: View Per-User Costs

Step 5: View Per-Session Costs

Step 6: Set Up Budget Policies

Step 7: Track Cost Trends

Step 8: Optimization Strategies

Model Routing

Prompt Optimization

Caching

Token Limit Guards

Next Steps