Logan Kelly

You've Shipped Agents. Now You Have to Run Them.

You've Shipped Agents. Now You Have to Run Them.

Shipping an agent and running one are different disciplines. Here's what actually breaks in production, what operational questions you need to answer, and what reliability looks like for AI agents.

Black blog cover image with subtle grid pattern. Category label reads "PRODUCTION / OPS" in the upper left. Large headline text reads "You've Shipped Agents." Waxell logo in the bottom right corner.

Shipping an agent is an act of optimism. Running it is an act of engineering discipline.

Running AI agents in production means operating autonomous software systems that call LLMs, use tools, and take real-world actions continuously and under load — a fundamentally different discipline from building or deploying those systems. Unlike deploying a traditional service, production agent operations requires managing behavioral compliance, not just uptime: you need visibility into what the agent is doing, why, and whether it stays inside the boundaries it's supposed to operate within. The adjacent concept this is most commonly confused with is observability — observability tells you what happened, while governed operations gives you the controls to define what's allowed to happen and block it in real time before it does.

These are different skills. The skills that get you from idea to working demo — prompt engineering, tool design, context management, iteration speed — are not the same skills you need when the thing is live and real users are depending on it and something is going wrong at 11pm and you need to figure out what.

Most engineering teams learned this the hard way with microservices. The same lesson is playing out right now with agents, and sometimes the consequences are severe: in April 2026, a Cursor coding agent running Claude Opus 4.6 deleted PocketOS's entire production database — including all volume-level backups — in a single API call to Railway. It took 9 seconds. The agent encountered a credential mismatch, went looking for a token to fix it, found a fully permissioned API token stored in an unrelated file, and used it to delete the production volume. No human approved the deletion. No governance layer blocked it.

Different architecture, same core problem: building something and operating something are not the same discipline. (See also: Why AI agent costs spiral → · What is agentic governance →)

What Breaks First

Not what breaks catastrophically — what breaks first, subtly, in the ways you don't catch until they've been broken for a while.

Latency. Your p95 looked fine in testing. In production, you have tail cases that never appeared in your test set — long context windows, tool calls that take longer under load, retry sequences. Your p99 is significantly worse than your median. Users in that tail are having a bad experience, and you're finding out from support tickets, not monitoring. Datadog's State of AI Engineering 2026 report found that in February 2026, 5% of all LLM call spans across their customer base reported an error — and 60% of those errors were caused by exceeded rate limits, not model failures. By March, rate limit errors alone totaled 8.4 million in a single month. These aren't edge cases; they're systematic operational gaps that most teams don't instrument for until they're already hurting.

Behavioral drift. This one's insidious. The model provider ships a new version. A tool your agent depends on changes its response schema. Someone modifies the system prompt for a product reason and the downstream effects on agent behavior weren't fully mapped. None of these show up as errors. The agent still runs. It just behaves differently — and you might not notice for days.

Context window edge cases are the gap between your test suite and reality. Production has long sessions, confused users who restart mid-conversation, inputs that contain unexpected content, tool responses three times longer than anticipated. Your context management wasn't built for any of this.

Concurrency breaks things that worked perfectly in sequence. Resource contention, rate limits on downstream tools, session isolation issues — problems that only exist when multiple sessions are running at once.

And then there's cost variance. Your average case is fine. Your variance is not. Long sessions, retry chains, and aggressive tool use by certain user segments run up bills that your average-case projections never captured.

The Operational Questions You Need to Be Able to Answer

The test of whether you're actually running an agent — versus just having deployed one — is whether you can answer the operational questions on demand. Not after a two-hour investigation. On demand.

What's my p99 latency right now? Which sessions today took the longest, and why? What fraction of sessions in the last 24 hours completed successfully vs. hit an error? Did the agent's behavior change after the system prompt update yesterday? What's my average cost per session today vs. last week — and which sessions are in the top 1% of cost? What PII has entered agent context in the last 7 days, how was it handled, and where did it end up?

If answering any of those requires digging through logs instead of pulling up a dashboard, you have observability. You don't have operational capability. Waxell Observe makes these questions answerable with production telemetry that captures every session, tool call, and cost event — with 200+ libraries auto-instrumented with no code changes, you can have it running before you finish your coffee. Two lines: pip install waxell-observe.

Building an SLA for Your Agent

Most teams haven't asked this question yet, and it shows: what does reliability actually mean for an AI agent?

For a traditional API, it's clean. Uptime percentage. Error rate. Latency percentiles. Done.

For an agent, reliability has a behavioral dimension that makes it harder. An agent that responds within your latency target but gives a confidently wrong answer isn't reliable in the way that matters. So you need to think about reliability across multiple axes: Does the agent respond? (Availability — table stakes.) Does it respond correctly? (Behavioral consistency — which means you've defined what "correct" looks like, and that's a product decision before it's an engineering one.) Does it stay inside its policy envelope — spend budgets, PII handling, tool constraints? (Governance compliance — measurable if you have the infrastructure, invisible if you don't.) And what happens when it can't handle a request? A good SLA defines acceptable fallback behavior, not just acceptable success behavior.

For teams building out these SLAs, the AI agent health monitoring metrics framework is a useful starting point for what to measure and what good looks like in practice.

Having this conversation with your stakeholders explicitly turns "the agent sometimes does weird things" into a quantifiable problem instead of a vague worry.

Incident Response for AI Behavior

Traditional incident response has a clean shape: something breaks, you find the root cause, you fix it, you deploy. Bounded in time and scope.

Agent incidents don't work like that.

First, the incident may have been happening for days before anyone flagged it. Behavioral issues don't always surface as errors — they look like slightly worse retention, slightly higher support volume, slightly more escalations. By the time someone says "this is an incident," you're reconstructing what happened over a week, not debugging a single event.

Second, the fix often isn't a code deploy. Maybe it's reverting a system prompt change. Maybe it's adapting to a model behavior shift from the upstream provider. Maybe it's a policy update for a PII handling gap. The intervention options are fundamentally different.

And rollback? Rollback doesn't mean the same thing when behavior is distributed across the model, the prompt, the tools, and the governance policies. You need to figure out which layer the problem lives in before you know what to revert. Meanwhile, if the agent already produced bad outputs that users acted on or that got logged in downstream systems, those effects don't disappear when you fix the agent. Your incident response needs to account for cleanup and user communication, not just the technical fix.

Think through this before the incident happens. Document your response procedures — the AI agent incident response runbook covers the specific questions production teams need to pre-answer. Define what "resolved" means. The difference between a managed incident and a chaotic one is whether you did the thinking in advance.

How Do You Move from Reactive to Governed Agent Operations?

The teams that operate agents well aren't the ones that get good at firefighting. They're the ones that systematically reduce the number of fires. That means moving from reactive — find it when it breaks — to governed — define what acceptable looks like, enforce it continuously, know immediately when it's violated.

What that actually takes: a policy layer that makes "acceptable behavior" explicit instead of hoping the model does the right thing. Waxell Runtime ships with 26 policy categories out of the box — covering spend limits, PII handling, tool call constraints, and destructive action patterns — enforced pre-execution, before the agent fires the tool call. An enforcement mechanism that applies those policies in real time, not after the fact. Instrumentation that makes the operational questions answerable without an investigation. And incident playbooks that treat agent behavior as its own category — because it is.

None of this is exotic. It's ops discipline applied to a new kind of system. Teams that bring the same rigor to their agents that they'd bring to a database or a microservice find that agents are perfectly operable. Teams that treat agents as something too intelligent to need real ops are the ones with the war stories.

How Waxell handles this: Waxell provides the governance and operational layer that makes "running agents" different from "having deployed agents." Waxell Observe gives you real-time cost tracking, behavioral drift detection, and a queryable execution audit trail — so the operational questions (latency, cost per session, behavioral drift, data handling) become answerable on demand without engineering investigation. Waxell Runtime enforces behavioral policies pre-execution: spend limits, PII controls, tool call constraints, and hard stops before the agent takes a destructive action it shouldn't. No rewrites. Deploy over whatever you've already built. pip install waxell-observe. Request early access →

Frequently Asked Questions

What breaks first when you run AI agents in production?
In order of typical appearance: latency tail cases (p99 latency significantly worse than median, discovered through support tickets not monitoring), behavioral drift after upstream changes (model updates, tool schema changes, prompt modifications with unmapped downstream effects), context window edge cases (long sessions, unexpected tool response lengths), and cost variance (average-case costs in budget, but outlier sessions running up the tail). Datadog's State of AI Engineering 2026 found that 5% of all LLM call spans reported errors in February 2026, with 60% caused by rate limits — a systematic gap most teams don't instrument for until it's already hurting them.

How do you build an SLA for an AI agent?
An agent SLA needs to cover four dimensions: availability (error rate, latency at defined percentiles), behavioral consistency (responses meet defined quality criteria, evaluated against some benchmark), governance compliance (agent operates within its policy envelope — spend, PII, tool constraints), and degradation behavior (defined fallback when the agent can't handle a request). Each dimension requires having defined what "acceptable" looks like before you can measure it, which is a product decision before it's an engineering one.

How do I prevent my AI agent from taking destructive actions like deleting production data?
The PocketOS incident in April 2026 showed the issue isn't just the agent's reasoning — it's about what credentials and permissions the agent can access. The operational answer has two parts: first, scope your agent's credentials to the minimum necessary permissions (no fully permissioned API tokens that cover operations you'd never want an agent to trigger). Second, apply pre-execution policy enforcement that catches destructive operations before they execute — not after. A governance layer that evaluates the action against policy before the tool call fires is the only reliable way to stop this class of failure.

What does AI agent incident response look like?
Agent behavior incidents differ from traditional software incidents in four ways: the problem may have been happening for days before detection; the fix may not be a code deploy (it might be a policy update, a prompt revert, or a governance layer change); rollback has a different meaning because behavior is distributed across model, prompt, tools, and policies; and impact may not be fully reversible if bad outputs were acted on or logged downstream. Response procedures need to account for all of this before an incident, not during one.

How is operating AI agents different from traditional software operations?
Traditional systems are deterministic — you control the code, the code executes predictably. Agents are probabilistic — the same inputs can produce different outputs, and behavior is distributed across model, prompt, tools, and governance layer. This means traditional on-call runbooks don't translate directly. Agent operations requires understanding which layer a problem is at (model behavior? prompt? tools? policies?), what "rollback" means for each layer, and how to measure behavioral compliance, not just technical availability.

What operational questions should you be able to answer about your AI agents?
On demand, without investigation: current p50 and p99 session latency; fraction of sessions in the last 24 hours that completed successfully vs. hit errors; average cost per session today vs. last week; top 1% of sessions by cost and what made them expensive; any PII that entered context in the last 7 days and how it was handled; whether agent behavior shifted after the last system prompt change. If any of these require a manual investigation rather than a dashboard query, you have observability but not operational capability.

Sources

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.