Waxell

Product

Compare

START FREE

Waxell

Logan Kelly

Jun 24, 2026

AI Agent Cost Enforcement: What Actually Changes When You Add Hard Limits

Uber hit its 2026 AI budget by April. One company got a $500M Claude bill. Here's what changes when teams move from cost alerts to enforcement.

Waxell blog cover: AI agent cost enforcement before and after

By April 2026, Uber had burned through its entire annual AI coding budget. A separate company reportedly ran up a $500 million Claude bill after forgetting to set usage limits on employees, according to a report from Axios. At Priceline, Chris Reed, the company's senior director of IT finance, told TechCrunch that a routine Cursor contract renewal came back four to five times more expensive than expected. These aren't edge cases. J.R. Storment, executive director of the FinOps Foundation, said in early June that companies were calling him in April and May to report being "3x over our entire 2026 token budget and it's only April."

The problem isn't that teams lacked data. Most of them had dashboards. What they couldn't do, in almost every case, was stop the spending.

That is the gap between cost visibility and cost enforcement — and it is a structural gap, not a tooling gap. This post is about what actually changes when you close it.

The Visibility-Without-Control Trap

Cost visibility for AI agents has matured quickly. Platforms like LangSmith, Helicone, Arize, and Datadog's new AI monitoring layer all provide token-level dashboards, cost-per-request breakdowns, and spending trend charts. This tooling is genuinely useful. It tells you what happened.

None of it stops what is happening now.

An alert that fires when you've spent $10,000 is not enforcement. A dashboard showing you're trending toward a $500K monthly bill is not enforcement. These tools create awareness after cost is incurred. An agent that has entered a recursive loop, over-queried a connected data source, or re-sent its full conversation history with every turn doesn't consult your dashboard before placing the next LLM call. It keeps running.

Jellyfish, an engineering management platform, tracked per-developer AI token consumption across its user base and found it had risen 18.6x in nine months through early 2026. Heavy users were roughly twice as productive as low users — but spent ten times the tokens to get there. "Whether extreme spend pays off comes down to the ultimate business value of shipped code," Nicholas Arcolano, head of research at Jellyfish, told TechCrunch. "Which most companies still can't measure."

That measurement gap is what makes observability-only tooling insufficient. Visibility tells you you're spending too much. It doesn't tell you what to cut. And it can't stop the agent doing the cutting.

What the "Before" State Looks Like

Teams running AI agents without enforcement converge on roughly the same set of workarounds.

Manual budget reviews. An engineering lead or finance partner checks an AI spend report once a week. Anomalies get escalated. Agents that exceeded budget last week get reviewed this week. The feedback loop is measured in days, during which the overspend continues.

Alert-and-intervene. A monitoring platform sends a Slack message when daily spend crosses a threshold. Someone on call investigates. Depending on the time and their workload, this might happen in minutes or hours. The agent runs throughout.

API key rotation. The most common "hard stop" in actual use: when spend gets unacceptable, rotate or revoke the API key. This stops the agent — and everything else on that key. It is a circuit breaker with collateral damage.

None of these scale with agent proliferation. As agent deployments grow and each agent operates autonomously across longer time horizons, the window between "something is wrong" and "this already cost $40,000" collapses.

Vitaly Gordon, CEO of engineering intelligence platform Faros AI, described a CTO who called him in April with this: "One of my engineers spent $40,000 on tokens last month, and I genuinely don't know whether I should stop him or tell everyone else to be like him." That inability to distinguish productive spend from runaway spend in real time is what defines the before state. Alerting tools surface the number; they don't answer the question behind it.

What "After" Looks Like: The Structural Shift

Hard enforcement changes the control surface. Instead of monitoring what an agent has spent, you define what an agent is allowed to spend — and that ceiling is enforced before the next LLM call is placed, not after the bill arrives.

This is the key architectural shift: pre-execution versus post-execution control. With observability tooling, the policy lives downstream of the agent. Spend occurs, data lands in a monitoring sink, an alert fires. With enforcement, the policy lives upstream: the agent requests an LLM call, the governance layer checks the remaining token budget, and if the limit is exceeded, the call doesn't happen. No further tokens are consumed. The agent stops at the boundary.

The practical implications of this shift compound:

Budgets become a first-class agent property. Instead of managing AI spend at the account or team level, you manage it per-agent or per-task. A summarization agent gets a different ceiling than a multi-step research and code generation agent. The difference between "this task costs $0.40" and "this task costs $12" becomes visible and enforceable at the task level.

Runaway loops are terminated, not observed. A recursive loop that would have consumed 400,000 tokens before a human noticed it hits a configured ceiling — say, 10,000 tokens — and stops. The agent surfaces a structured exception. The operator gets a notification. The spending stops at the wall.

Cost forecasting becomes reliable. When each agent type has a hard ceiling, the maximum spend per deployment becomes calculable. Thirty agents running in parallel, each capped at a defined token limit per task: the worst-case scenario is arithmetic, not a guess. This is what finance teams mean when they ask for "guardrails" — not dashboards that show them the problem, but walls that prevent it.

What Surprises Teams After Enforcement Goes Live

Three things consistently surface when teams implement hard limits for the first time.

Agents hit limits more often than expected. Tasks that look simple — one or two LLM calls — often aren't. Multi-step agents with large context windows and retrieval pipelines consume tokens at a scale that's easy to underestimate. The first few weeks of enforcement make this concrete immediately. This is useful data: it forces prompt optimization and architectural review that teams would have deferred indefinitely without a forcing function. The enforcement doesn't create the problem; it makes a pre-existing one visible before it becomes expensive.

Granularity matters more than people expect. A budget defined at the agent level gets consumed unevenly. Some tasks take ten times the tokens of others. A monthly budget set at the agent level can be exhausted by one expensive run, leaving the agent unusable. Task-level or session-level budgeting — ceilings applied per request or workflow, not per calendar period — is what actually solves the predictability problem.

Enforcement surfaces architectural problems. When an agent consistently hits limits on a particular task type, it's almost always a signal that something in the design is inefficient: too many round-trips to the LLM, too much irrelevant context being passed, a retrieval system returning documents that don't contribute to the answer. Enforcement doesn't fix these — but it flags them immediately, rather than when a quarterly spend report arrives on a CFO's desk.

Alexander Embiricos, OpenAI's head of enterprise, described the shift in customer conversations this way at an event in June: "Six months ago, I would have a conversation with a customer and it would be all about 'What can it do? Is it good enough?' Our conversations are never about that now. Now the conversations are about 'What visibility do you have? What auditability do you have? What token controls do you have?'"

How Waxell Runtime Handles This

Waxell Runtime enforces cost and token limits as pre-execution policies — the check happens before an LLM call is placed, not after the bill arrives. The policy layer sits above agents without requiring an SDK on the agent itself, and without code changes or rebuilds.

The enforcement model handles the precision that teams discover they need after going live: per-task budget ceilings, per-model cost caps, hard stops that surface structured exceptions rather than silent failures, and configurable responses (human escalation, graceful termination, or policy-defined retry logic). Waxell Runtime ships with 50+ policy categories across the cost enforcement arc and beyond — scope controls, output validation, data handling, and escalation triggers all under the same governance layer.

For real-time cost visibility, Waxell Observe instruments existing agents in two lines of code, covering 200+ libraries without requiring a rebuild. Teams use Observe to establish baseline token consumption per task type — then use Runtime to set enforcement ceilings derived from that baseline.

The combination is what closes the gap the before state leaves open: you can see what's happening (Observe) and stop what shouldn't be (Runtime). The two tools are complementary by design. Visibility tells you where to set the wall. Enforcement makes it hold.

FAQ

What's the difference between a cost alert and cost enforcement?
A cost alert notifies you after spending has occurred. Cost enforcement prevents spending from occurring past a defined threshold. The difference is pre-execution versus post-execution control — and it's the difference between learning you've overspent and preventing the overspend. Most AI observability platforms today provide alerts. Very few provide pre-execution hard stops.

Why don't LangSmith, Helicone, or Arize provide enforcement?
These are observability platforms: they track, visualize, and analyze what agents have done. Enforcement requires a governance layer that intercepts LLM calls before they're placed — architecturally upstream of where monitoring sinks operate. Some platforms are adding cost alerting features, but pre-execution hard stops require a different architectural position.

What happens to an agent when it hits a hard cost limit?
In a properly designed enforcement system, the agent stops cleanly and emits a structured event. The operator receives a notification with context: what task was running, how much was consumed, what the configured limit was. The agent doesn't retry silently; it waits for a policy-defined response — human approval, a budget increment, or graceful termination. Silent failures are a sign of an alerting system pretending to be enforcement.

Should I set limits per agent, per task, or per team?
Start with per-task limits. Task-level granularity catches runaway behavior early while giving productive tasks adequate headroom. Agent-level limits are useful as a backstop. Team-level limits work for rollup budget management but are too coarse-grained to catch individual agent failures before they've become expensive.

How do I determine where to set limits?
Run with instrumentation first: deploy Waxell Observe to capture baseline token consumption per task type across two to four weeks of production traffic. Use the 90th-percentile observation as your starting soft limit and the 99th percentile as your hard stop. Adjust after the first enforcement events surface outlier tasks.

Does enforcing a token budget reduce agent quality?
Not in well-designed agents operating on typical tasks. A ceiling set at the 90th-percentile baseline won't constrain agents that are working as designed. What enforcement does surface is tasks that were silently consuming five to ten times expected tokens — usually due to a design flaw in context handling or retrieval, not because the extra tokens produced better outputs.

Sources

Rebecca Bellan, "The token bill comes due: Inside the industry scramble to manage AI's runaway costs," TechCrunch, June 5, 2026. https://techcrunch.com/2026/06/05/the-token-bill-comes-due-inside-the-industry-scramble-to-manage-ais-runaway-costs/
Axios report cited in TechCrunch (above) — Source for the $500 million Claude bill.
"Show HN: I lost $200 from an agent loop, so I built per-tool AI budget controls," Hacker News. https://news.ycombinator.com/item?id=46991656
"Expensively Quadratic: The LLM Agent Cost Curve," Hacker News. https://news.ycombinator.com/item?id=47000034

Agentic Governance, Explained

Waxell blog cover: DeepMind AI control roadmap insider threat governance

DeepMind Treats Its AI Agents as Insider Threats [2026]

DeepMind's AI Control Roadmap treats deployed agents as insider threats. Here's the defense-in-depth framework it established — and how Waxell Runtime enforces it without rebuilds.

Logan Kelly

Jun 23, 2026

Waxell blog cover: multi-agent governance coordinator planner worker roles

Multi-Agent Governance: Why Role-Blind Policies Fail [2026]

Most governance tools apply identical policies to every agent. Here's why that breaks at coordinator, planner, and worker layers — and what actually works.

Logan Kelly

Jun 22, 2026

Waxell blog cover: SOC 2 attribution gap for AI agents

SOC 2 for AI Agents: The Attribution Gap [2026]

SOC 2's CC6.3 requires privileged actions traced to an accountable individual. AI agents act without human authorization. Here's how to close the gap.

Logan Kelly

Jun 19, 2026

A single workspace folder open on a desk, with three short setup notes beside it

Getting Started with AI Agents: What to Set Up First

Don't set up your whole business in Connect on day one. Pick one workflow, build a workspace for it, and let that teach you how to do everything else.

Frances @ Waxell

Jun 18, 2026

DeepMind Treats Its AI Agents as Insider Threats [2026]

DeepMind's AI Control Roadmap treats deployed agents as insider threats. Here's the defense-in-depth framework it established — and how Waxell Runtime enforces it without rebuilds.

Logan Kelly

Jun 23, 2026

Multi-Agent Governance: Why Role-Blind Policies Fail [2026]

Most governance tools apply identical policies to every agent. Here's why that breaks at coordinator, planner, and worker layers — and what actually works.

Logan Kelly

Jun 22, 2026

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

Product