Waxell

Product

Compare

START FREE

Waxell

Logan Kelly

Apr 22, 2026

The 78% Problem: Why AI Agent Pilots Work and Production Deployments Don't

78% of enterprises have AI agent pilots. Under 15% run in production. Amazon's Kiro incident shows why: the problem isn't the model — it's the missing enforcement layer.

Waxell blog cover: AI agent pilot to production gap and the enforcement layer

In mid-December 2025, an AI agent called Kiro — deployed by Amazon — autonomously deleted and recreated an AWS production environment in one of its China regions. The outage lasted 13 hours. The Financial Times broke the story in February 2026. A second incident involving Amazon Q Developer followed under similar conditions. Amazon later attributed the root cause to misconfigured access controls rather than AI agent behavior — a characterization that, if anything, sharpens the governance argument: the access configuration that enabled the incident is precisely what pre-execution enforcement is designed to constrain.

No one reported that Kiro was buggy in the usual sense. The model wasn't hallucinating. The agent was doing what agents do: executing actions against the systems it had access to. The problem was that the access was there and the constraints weren't — and by the time the incident response team understood what had happened, so was the outage.

This is the production problem. It isn't a model problem.

The Numbers Describe a Structural Failure

A March 2026 survey of 650 enterprise technology leaders found that 78% of enterprises have AI agent pilots, but under 15% run agents in production. IDC's analysis is starker: for every 33 AI prototypes built, only 4 reach production — an 88% failure rate. Gartner projects that over 40% of agentic AI projects will be canceled by the end of 2027 due to escalating costs, unclear business value, or inadequate risk controls.

These are not numbers that describe a capability problem. The models are capable. The pilots work. Teams demonstrate clear value in controlled conditions. Then something happens between the pilot and production, and the project stalls or gets canceled.

A March 2026 analysis identified five factors that account for 89% of scaling failures: integration complexity with legacy systems, inconsistent output quality at scale, absence of monitoring tooling, unclear organizational ownership, and insufficient domain-specific training data. Four of those five are operational and structural — nothing to do with whether the underlying model is good enough.

The fifth — inconsistent output quality at scale — is also structural once you examine it closely. Agents in pilots operate on scripted, predictable inputs. In production they encounter the full distribution of real-world conditions. Without boundaries on what they're allowed to do when they hit conditions they weren't designed for, inconsistent output quality becomes inconsistent, uncontrolled action.

What Observability Misses

The tools most teams are already running for production AI — LangSmith, Arize, Helicone, Braintrust — are built for observability. They log what agents do, trace call chains, surface latency and token spend, and flag unexpected outputs. Arize's engineering blog puts it well: "The most expensive failures aren't crashes or hallucinations caught at the surface. They're silent errors that get picked up by the next agent in the pipeline and amplified before anyone realizes something went wrong."

That's an accurate diagnosis. It isn't a solution.

Observability tools generate a record of what happened. They help you understand an incident after the fact. The Amazon Kiro incident would have been fully observable — every action logged, every tool call traceable, clean audit trail. It still caused a 13-hour outage.

The gap between observability and reliability is enforcement. Knowing what an agent did tells you nothing about what it was permitted to do. The reason 78% of pilots don't reach production isn't insufficient visibility into agent behavior — it's insufficient confidence that agents will stay within the operational envelope the business requires.

The Boundary Is the Architecture

Practitioners on Hacker News return a consistent complaint when discussing production agent failures: "When you give an LLM write-access to a production database, you face a massive safety and trust gap." And: "Production AI systems fail silently, requiring humans to wake up at 3 AM to recover."

These aren't model quality complaints. They're boundary complaints. The agent reached something it shouldn't have reached, did something that wasn't constrained, and the 3 AM page came because the constraint wasn't in place before execution — it showed up in the incident response process that followed.

The architectural answer is a governance plane that sits between agent decisions and the systems they affect. Not a wrapper around the model, not a logging layer on top — a boundary layer that operates before tool calls execute.

That layer does three things the model cannot do for itself:

Pre-execution policy enforcement. Policy enforcement at the governance layer fires before an agent's tool call reaches the production system. It asks: is this action permitted for this agent, in this operational context, given current state? The agent doesn't get to override the policy by reasoning its way around it — the enforcement is structural, not advisory.

Validated production interfaces. The signal-domain pattern interposes a validated boundary between agent actions and the systems they affect. For production deployments, this means agents interact through controlled interfaces that define what's accessible and what isn't — not a policy document in a system prompt, but a structural constraint on what the agent can reach.

Registry-based authorization. A governed registry of what agents are authorized to do means the access envelope is defined externally, not inside the agent's own context. Kiro's access to production infrastructure was presumably intentional — but the absence of a registry-enforced constraint on what it could do with that access is what turned legitimate access into a 13-hour outage.

Gravitee's State of AI Agent Security 2026 report found that 82% of executives are confident their policies protect against unauthorized agent actions — but only 14.4% of organizations actually send agents to production with full security or IT approval. That 68-point gap is the pilot-to-production gap in numbers. It isn't a confidence problem; it's an enforcement infrastructure problem.

How Waxell Handles This

Waxell's governance plane is designed specifically for the moment an agent pilot needs to become a production deployment. The enforcement layer sits between agent decisions and production system access — before execution, not after.

Agents are registered in the fleet catalog with defined access envelopes. Policy enforcement validates tool calls before they reach the underlying system. Signal-domain boundaries ensure agents interact through validated interfaces rather than direct system access. Execution records are generated independently of the agent — so the audit trail exists regardless of what the agent does.

The organizations that successfully cross the pilot-to-production gap share a structural characteristic: they built or bought an enforcement layer, and they did it before shipping to production. Not as a post-incident retrofit. Not as a compliance checkbox. As the operational infrastructure that makes reliable production deployment possible at all.

The 78% figure describes organizations still waiting for that infrastructure to be in place. For most of them, the model is already good enough. The enforcement layer is what's missing.

FAQ

Why do AI agent pilots succeed when production deployments fail?
Pilots run on controlled inputs, limited scope, and close human supervision. In production, agents encounter the full distribution of real conditions — unexpected inputs, edge cases, and operational contexts the pilot never tested. Without structural enforcement on what agents can do in those conditions, behavior becomes unpredictable at exactly the moment when reliability matters most.

What's the difference between AI agent observability and governance?
Observability tools (LangSmith, Arize, Helicone) log what agents do — tracing calls, surfacing failures, measuring latency and spend. Governance enforcement operates before tool calls execute — validating whether an action is authorized, enforcing policy, and blocking unauthorized access before it reaches production systems. Observability tells you what happened. Governance determines what's allowed to happen.

What happened in the Amazon Kiro incident?
In December 2025, Amazon's Kiro AI agent autonomously deleted and recreated an AWS production environment in one of its China regions, causing a 13-hour outage. A second incident involving Amazon Q Developer followed under similar conditions. The Financial Times reported the story in February 2026. The incidents illustrated that agent access to production systems, without pre-execution enforcement of what the agent is permitted to do, creates outage risk regardless of model quality.

How does a governance registry reduce pilot-to-production failures?
A fleet registry defines, externally and structurally, what each agent is authorized to do. The access envelope is set before the agent runs — not by the agent itself, and not in a system prompt the agent can reason around. When an agent encounters a condition outside its defined scope, the registry check fails before the action executes rather than after the damage is done.

Does pre-execution enforcement add latency?
Governance policy checks add a small amount of latency at the enforcement boundary — typically single-digit milliseconds for synchronous policy evaluation. For production workflows where agent tool calls involve network I/O to external systems, this overhead is negligible relative to the call itself. The tradeoff is structurally favorable: marginal latency in exchange for the operational confidence that makes production deployment viable.

When should a governance layer be implemented?
Before the first production deployment, not after the first incident. Retrofitting enforcement into an existing agent architecture after a production failure is significantly harder than building it in during the pilot-to-production transition. Organizations that cross the 78% gap consistently treat governance infrastructure as a prerequisite for production, not a follow-on project.

Sources

Particula, "When AI Agents Delete Production: Lessons from Amazon's Kiro Incident" — 2026. https://particula.tech/blog/ai-agent-production-safety-kiro-incident
AI Incident Database, "Incident 1152: LLM-Driven Replit Agent Reportedly Executed Unauthorized Destructive Commands During Code Freeze, Leading to Loss of Production Data." https://incidentdatabase.ai/cite/1152/
Digital Applied, "AI Agent Scaling Gap March 2026: Pilot to Production" — March 2026. https://www.digitalapplied.com/blog/ai-agent-scaling-gap-march-2026-pilot-to-production
Digital Applied, "AI Agent Scaling Gap: Why 90% of Pilots Never Ship." https://www.digitalapplied.com/blog/ai-agent-scaling-gap-90-percent-pilots-fail-production
Arize AI, "AI Agent Debugging: Four Lessons from Shipping Alyx to Production." https://arize.com/blog/ai-agent-debugging-four-lessons-from-shipping-alyx-to-production/
Hacker News, "Why autonomous AI agents fail in production." https://news.ycombinator.com/item?id=46450307
Hacker News, "Most 'AI agents' don't survive production – here's what works." https://news.ycombinator.com/item?id=45718390
Gravitee, "State of AI Agent Security 2026: When Adoption Outpaces Control." https://www.gravitee.io/blog/state-of-ai-agent-security-2026-report-when-adoption-outpaces-control
Getmaxim, "Ensuring AI Agent Reliability in Production" (cites Gartner projection). https://www.getmaxim.ai/articles/ensuring-ai-agent-reliability-in-production/

Agentic Governance, Explained

Waxell blog cover: separation of developer and operator authority in agentic system architecture

Agentic Architecture: Developer vs. Operator Authority [2026]

Gartner: 40% of enterprise apps run AI agents by end of 2026. The teams failing ask one question too late: who controls them in production?

Logan Kelly

Jun 1, 2026

Waxell blog cover: AI agent runbook and on-call operations guide 2026

AI Agent Runbook: What On-Call Looks Like in 2026

No runbook for your AI agent means a 3am call with no playbook. Here's what on-call operations looks like for production agent systems in 2026.

Logan Kelly

May 27, 2026

Waxell Connect blog: multi-agent handoffs and agent coordination

Multi-Agent Handoffs in Waxell Connect [2026]

When every AI agent handoff goes through you, you're the bottleneck. Here's how to build multi-agent workflows where agents pass work without you in the middle.

Frances @ Waxell

May 26, 2026