Logan Kelly

Deploy Claude Agents to Production: The Six Hard Parts, and How Waxell Runtime Handles Them

Deploy Claude Agents to Production: The Six Hard Parts, and How Waxell Runtime Handles Them

Hosting Claude agents means subprocess supervision, sessions, isolation. Waxell Runtime gives you that governed environment without building it yourself.

Waxell blog cover: Deploy Claude Agents to Production with Waxell Runtime

Anthropic's own production documentation for the Claude Agent SDK is the clearest argument that deploying a Claude agent is not deploying an API wrapper. Two guides — Hosting the Agent SDK and Securely deploying AI agents — read less like a quickstart and more like a checklist of everything that has to be right before a real user touches the system: how to supervise the subprocess, where session state lives and how to keep it from vanishing on a restart, how to stop one tenant's context from leaking into another's, how to inject credentials the agent should use but never see, and how to box the agent so a prompt injection can't exfiltrate data.

None of that complexity is incidental. An agent that runs code and reaches the network is exactly the kind of workload that demands supervision, isolation, and enforcement. The complexity is the point. The open question is whether every team shipping a Claude agent should rebuild the same governed execution layer from primitives — and that is the gap Waxell Runtime is built to close.

This piece walks the six hard parts the Anthropic guides describe, then maps each to how a governed runtime removes it.

Why is hosting a Claude agent structurally harder than hosting an API?

Because of how the SDK runs. When application code calls query(), the SDK spawns a separate claude CLI process and communicates with it over stdio. That subprocess owns the shell, the working directory, and the JSONL session transcripts on local disk. One agent session maps to one subprocess. Run N concurrent sessions and the host is supervising N process trees, each with its own filesystem state.

That single architectural fact is the root of most production difficulty, and it cascades into six concrete problems.

Hard part 1 — State lives on ephemeral local disk

Session transcripts, CLAUDE.md memory files, and working-directory artifacts all default to the container's filesystem. None survive a restart, a scale-down, or a move to another node. Resuming a session a user expects to continue requires mirroring transcripts to durable storage through a SessionStore adapter — and the hosting guide is explicit that the store mirrors transcripts only, so memory files and artifacts need a separate sync strategy on top.

Hard part 2 — Concurrency is bounded by RAM, not CPU

Each session holds a subprocess in memory. The guide's own sizing formula is agents per host = (host RAM − overhead) / per-session RAM ceiling, where the ceiling is measured by running a representative session to target length under real tool load and recording peak RSS. Horizontal scaling means pinning each session to one container via consistent hashing on sessionId, because the live subprocess only exists on the box that spawned it.

Hard part 3 — There is no built-in session timeout

A session does not stop on its own. The only native bound is maxTurns. Left unmanaged, a runaway loop runs until something external kills it. The hosting guide notes that a single long agent session can spend dollars in tokens while the container under it costs roughly $0.05 per hour — so a reconciliation loop expected to cost $0.10 can, in the wrong conditions, burn past $100 before anything intervenes. The expensive failure is the agent's behavior, not the infrastructure.

Hard part 4 — Multi-tenant context leaks by default

The secure-deployment guide is direct: the SDK reads settings and CLAUDE.md memory from the filesystem, so in a shared container one tenant's context can leak into another's session. Closing that gap means passing settingSources: [], setting CLAUDE_CODE_DISABLE_AUTO_MEMORY=1, pointing CLAUDE_CONFIG_DIR at a per-tenant path, giving every tenant its own working directory, and applying per-tenant egress rules at a proxy — five separate controls that all have to be remembered on every query() call.

Hard part 5 — Credentials and network egress need a proxy outside the boundary

Because an agent's behavior can be steered by the content it processes — a README, a webpage, a tool result, the prompt-injection surface — the guide recommends treating it like semi-trusted code. The credential pattern runs a proxy outside the agent's security boundary that injects secrets into outgoing requests, so the agent makes the call but never holds the key. The hardened container example goes further, running with --network none and reaching the outside world only through a mounted Unix socket to a host proxy that enforces a domain allowlist and logs every request.

Hard part 6 — Isolation strength is a build-it-yourself decision

The guide lays out a spectrum — sandbox runtime, Docker with dropped capabilities and a seccomp profile, gVisor intercepting syscalls in userspace, Firecracker microVMs — each a different point on the curve between isolation strength, performance overhead, and operational complexity. Choosing, building, and maintaining that layer is a project in itself, and it sits entirely upstream of the agent doing anything useful.

The complexity is real. The question is who carries it.

The honest framing matters here, because pretending the problem is simple would be the wrong message. Everything in those two guides is necessary. An agent that can run code and reach the network should be supervised, isolated, credential-fenced, and audited. A team that skips those steps is not shipping faster — it is shipping a liability.

So the goal is not to make the problem disappear. It is to avoid having every serious agent team rebuild the same governed execution environment in parallel — the session store, the egress proxy, the per-tenant isolation, the kill switch, the audit trail — each from the same primitives, each slightly differently, each a fresh source of bugs. That undifferentiated platform layer is precisely what a runtime is supposed to absorb.

How Waxell handles this

Waxell Runtime is a governed execution environment for AI agents, and the six hard parts above are properties of the environment rather than infrastructure a team assembles. Policy enforcement and isolation are how the runtime works, not layers added on top.

  • Durable, resumable workflows answer hard parts 1 and 3. Workflows checkpoint automatically at every step. A network failure does not lose the run; a policy that requires human input pauses execution and resumes from the exact point it stopped, not from the beginning. Agents survive infrastructure restarts and model timeouts — the durable-session problem the hosting guide hands to a SessionStore a team would otherwise build and operate.

  • Isolated execution by default answers hard parts 4 and 6. Every run executes in an isolated environment with no shared state and no cross-contamination between workflows or tenants. The leakage surface the security guide warns about — shared memory files, shared config, shared working directories — is not a set of flags to remember. Isolation is the default behavior, documented in the Waxell isolation model.

  • Policy enforcement before each step answers hard part 5 and the runaway-cost half of hard part 3. The same 50+ policy categories available in Waxell Observe — cost, safety, PII, compliance, identity, rate limits — are enforced natively, gating what an agent is allowed to do before each step executes rather than after it is logged. A cost ceiling terminates the reconciliation loop at its threshold; a content policy blocks an outbound request carrying account numbers before the call leaves the boundary.

  • Kill switches at every level answer the missing-timeout problem outright. Stop any agent, any workflow, any session, immediately — no graceful shutdown, no waiting.

  • Audit trails as a byproduct of execution mean every decision is logged and policy-evaluated automatically, so the execution trace is also the compliance record. No separate logging layer patched on after the fact.

Agents are defined with Waxell's Python SDK decorators, and the runtime owns isolation, checkpointing, enforcement, and audit trails from the first run — no rebuilds of the hosting stack required:

from waxell_sdk import agent, workflow, decision

@agent(name="financial-reconciliation")
class ReconciliationAgent:
    @workflow
    def run(self, ctx):
        result = ctx.call(self.validate)
        return ctx.call(self.execute, result)

    @decision
    def validate(self, ctx):
        return ctx.llm.classify(
            ctx.input.transaction,
            categories=["approved", "flagged", "requires-review"]
        )
from waxell_sdk import agent, workflow, decision

@agent(name="financial-reconciliation")
class ReconciliationAgent:
    @workflow
    def run(self, ctx):
        result = ctx.call(self.validate)
        return ctx.call(self.execute, result)

    @decision
    def validate(self, ctx):
        return ctx.llm.classify(
            ctx.input.transaction,
            categories=["approved", "flagged", "requires-review"]
        )
from waxell_sdk import agent, workflow, decision

@agent(name="financial-reconciliation")
class ReconciliationAgent:
    @workflow
    def run(self, ctx):
        result = ctx.call(self.validate)
        return ctx.call(self.execute, result)

    @decision
    def validate(self, ctx):
        return ctx.llm.classify(
            ctx.input.transaction,
            categories=["approved", "flagged", "requires-review"]
        )
from waxell_sdk import agent, workflow, decision

@agent(name="financial-reconciliation")
class ReconciliationAgent:
    @workflow
    def run(self, ctx):
        result = ctx.call(self.validate)
        return ctx.call(self.execute, result)

    @decision
    def validate(self, ctx):
        return ctx.llm.classify(
            ctx.input.transaction,
            categories=["approved", "flagged", "requires-review"]
        )
from waxell_sdk import agent, workflow, decision

@agent(name="financial-reconciliation")
class ReconciliationAgent:
    @workflow
    def run(self, ctx):
        result = ctx.call(self.validate)
        return ctx.call(self.execute, result)

    @decision
    def validate(self, ctx):
        return ctx.llm.classify(
            ctx.input.transaction,
            categories=["approved", "flagged", "requires-review"]
        )

Every @decision is logged, policy-evaluated, and audit-trailed. Every @workflow is durable and resumable. The subprocess supervision, per-tenant isolation, egress control, and kill switch are the runtime's responsibility, not the application's.

For Claude agents already running in production, the entry point is Waxell Observe, not a rewrite. Two lines of Python instrument an existing agent — Observe auto-instruments 200+ libraries, including the Anthropic SDK — and bring it under the same 50+ policy categories with no change to agent logic. And for agents a team did not build at all — vendor agents, third-party integrations, MCP-native tools — Waxell Connect governs them with no SDK and no code changes required. Runtime is where a workload lands when the stakes make governance-on-top insufficient and the environment itself has to enforce.

When is governed execution worth it? Three scenarios

A fintech runs a reconciliation agent that reads transactions and moves money. The secure-deployment guide is unambiguous that an agent like this needs credential fencing and network containment. In Runtime the agent never holds the wire-transfer credential, policy gates the execution step, the kill switch is one call away, and the audit trail SOX and SR 11-7 expect is produced by the run itself rather than reconstructed afterward.

A healthcare platform's intake agent enters an unexpected loop, re-querying a symptom database because of a parsing edge case that never appeared in testing. The hosting guide notes there is no native session timeout — maxTurns is the only built-in bound. In Runtime a cost policy terminates the session at its threshold and a HIPAA-profiled audit record captures every decision that touched PHI, with no post-hoc logging to patch on.

A platform team hosts a dozen Claude agents for different internal groups in one environment. The security guide's multi-tenant checklist — settingSources: [], disabled auto-memory, per-tenant config directories, per-tenant working directories, per-tenant egress — is exactly the leakage surface that isolated-by-default execution removes. A confused or compromised agent cannot read another tenant's context, because there is no shared context to read.

What Waxell Runtime is not

It is not a replacement for the security guide's principles. Isolation, least privilege, and defense in depth remain the right mental model. Runtime implements them as the environment's default behavior rather than as infrastructure to wire together — but the principles are the same ones Anthropic documents.

It is not a way to skip production discipline. Agents that move money or touch PHI still demand deliberate policy design, human-in-the-loop approval on the high-stakes steps, and real review. Runtime supplies the enforcement surface; it does not set an organization's risk tolerance.

It is not only for greenfield builds. Runtime is the right home for new agents and planned migrations. Claude agents already in production come under governance through Observe in two lines of code, and the workflows that need native execution governance migrate when the team is ready. Every step delivers standalone value.

Frequently Asked Questions

Is it hard to deploy Claude agents to production?
Done properly, yes — and Anthropic's own hosting and secure-deployment guides show why. The Agent SDK spawns a long-lived claude subprocess per session that owns a shell, a working directory, and session files on local disk. Production hosting means supervising those processes, persisting session state otherwise lost on restart, isolating tenants, injecting credentials the agent should never see, and sandboxing against prompt injection. None of it is optional for serious workloads.

What does Waxell Runtime do for a Claude agent deployment?
It provides the governed execution environment those guides tell teams to build: isolated-by-default execution, 50+ policy categories gating each step before it runs, kill switches at the agent, workflow, and session level, durable workflows that checkpoint and resume, and audit trails produced by execution itself. Instead of assembling a session store, an egress proxy, per-tenant isolation, and a kill switch from primitives, a team gets them as properties of the runtime.

Does Waxell Runtime use the Claude Agent SDK directly?
Runtime agents are defined with Waxell's Python SDK decorators (@agent, @workflow, @decision), which is what makes isolation, checkpointing, and policy enforcement native to every step. For agents already built on the Claude Agent SDK or another framework, the path is Waxell Observe, which instruments existing agents — including Anthropic SDK agents — in two lines of Python with no rewrite.

How does Runtime handle the session-persistence problem from the hosting guide?
The hosting guide hands durable sessions to a SessionStore adapter a team builds and operates. Runtime makes durability native: workflows checkpoint at every step, survive infrastructure restarts and model timeouts, and resume from the exact point they stopped — including pausing for human-in-the-loop approval and resuming on response.

How does Runtime address prompt injection and data exfiltration?
The same way the security guide recommends — isolation plus a controlled enforcement boundary — but as defaults. Execution is isolated with no shared state between tenants, and policies covering PII, content, identity, and rate limits gate each step before it executes. Enforcement sits between the agent's intent and the action, so an injected instruction cannot drive an action that policy forbids.

What compliance coverage does Runtime provide?
Runtime's policy engine ships with HIPAA, SOC 2, and PCI-DSS profiles, with data residency configurable in US East or EU West at onboarding. Because every decision is policy-evaluated and audit-trailed automatically, the execution trace is the compliance evidence — the chain of custody frameworks like SOX, MiFID II, SR 11-7, and HIPAA require.

Deploying a Claude agent to production should be hard, because the failure modes are real. It does not have to be hard for you. Building for a workflow where wrong is expensive? Get access to Waxell Runtime at https://waxell.ai/get-access.

Sources

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.

Waxell

Waxell provides observability and governance for AI agents in production. Bring your own framework.

© 2026 Waxell. All rights reserved.

Patent Pending.