Waxell

Product

Compare

START FREE

Waxell

Logan Kelly

Apr 16, 2026

Agent Versioning Isn't a Deployment Problem. It's a Governance Problem.

Rolling back agent code is easy. Rolling back agent behavior is something else. Here's why agent versioning is a governance requirement, not just an ops task.

Waxell blog cover: Agent Versioning as a Governance Problem

When your CI/CD pipeline rolls back the code, what rolls back the behavior?

Most teams discover the answer is "nothing." They discover it in production, while something is broken, and the git history they just reverted doesn't explain why the agent is still doing the thing it was doing before the rollback.

This is the gap that separates agent operations from service operations. A microservice rolled back to the previous commit behaves predictably like the previous commit. An agent rolled back to the previous commit might still carry the prompt that was updated directly in your prompt management UI last Thursday — the change that, combined with a tool schema update from a third-party API, produced the failure you're trying to undo. The code is the same. The behavior isn't.

According to an OutSystems survey of nearly 1,900 global IT leaders published in April 2026, 96% of enterprises now use AI agents in some capacity. Only 12% have implemented a centralized platform to manage them. With EU AI Act enforcement of Annex III high-risk systems arriving August 2, 2026 — covering AI used in employment decisions, credit scoring, healthcare, education, and essential services — "centralized control" is about to have a regulatory definition, and "we have a git repo" won't meet it.

That gap — between deployment and control — is what agent versioning, done correctly, starts to close.

AI agent versioning is the practice of managing the full behavioral identity of an agent across changes over time — including its code, its prompt, its policy set, its tool access scope, and its runtime authorization level. Unlike service versioning, which treats the codebase as the primary artifact, agent versioning must treat the behavioral envelope as the artifact. An agent at version 1.0 and an agent at version 1.1 may share identical code but exhibit meaningfully different behavior if their prompts, connected tools, or governance policies have changed. Behavioral versioning is the prerequisite for behavioral governance: you cannot enforce a governance plane against something you can't identify by version.

Why does rolling back an AI agent work differently than rolling back a service?

The discipline of CI/CD was built for code-driven systems. Write code, test it, deploy it, revert it if something breaks. The mental model is: code = behavior. Revert the code, revert the behavior.

This model breaks for AI agents at three points.

Prompts are not code. Most teams manage prompts separately from application code — in a prompt management UI, a CMS, a database, or directly in a third-party platform like a vector store or model provider. When something goes wrong in production, the git history shows you what the code was at each version. It does not show you what the system prompt was. If the prompt was changed outside the code repository, you have no rollback target.

Tool schemas change independently. Agents that call external APIs, internal services, or MCP servers depend on those tools behaving consistently. When a connected service changes its API schema — even a minor change, an added required field, a changed response format — the agent's behavior can shift in ways that the agent's own code never changed. You can revert the agent's code to last week; the tool it calls is still running today's schema.

Models drift. If your agent uses a hosted model from OpenAI, Anthropic, or Google, the model itself may change between your last deployment and today. Most providers implement version pins, but teams that don't pin model versions are running agents whose behavior can shift when the provider updates the underlying model — and no code rollback will undo that.

The consequence is that code version is not a proxy for agent behavior version. A team that tracks only git commits has an incomplete version history. They know what the agent's code was. They don't know what the agent was — the complete configuration that produced the behavior they're trying to restore.

What are the three failure modes that unversioned agents create in production?

Failure mode 1: Silent behavioral drift. Prompt changes, model updates, and tool schema shifts accumulate across an agent's lifetime. None of them trigger a deployment. None of them appear in the deployment log. The agent's behavior changes gradually, through a series of small updates across different systems, until it reaches a state that's materially different from the state that passed evaluation — and there's no point-in-time record of how it got there.

Silent drift is the hardest failure mode to diagnose because nothing breaks cleanly. No error fires. The deployment log is quiet. What you notice first is usually something like: user escalation rate is up 15% this week, or the eval suite that passed three weeks ago now fails on 20% of cases. You diff the code — identical. You check the deployment log — nothing shipped. Then someone remembers that the prompt was updated in the LangSmith prompt hub on Tuesday, and the customer support tool it calls quietly added a required priority field to its schema last Wednesday. Neither change appears in your git history. Neither change triggered a deployment event. Together, they produced the behavior your eval is now flagging, and you have no rollback target for either.

Failure mode 2: Policy mismatch. Governance policies — the rules that define what an agent is allowed to access, spend, output, and do — are typically scoped to a version of the agent's configuration. When the agent's configuration drifts without a corresponding policy update, the enforcement layer is no longer calibrated to what the agent is actually doing.

An agent that started as a read-only document summarizer, governed accordingly, gains write tool access in version 2. If the governance policies weren't updated alongside that change, the policies governing the agent still reflect the read-only access model. The agent is running with the wrong policy set for its actual capabilities. This isn't a theoretical risk — it's what happens when deployment and governance operate on different version clocks.

Failure mode 3: Ungovernable rollback. When something goes wrong and an incident team needs to roll back, they need to know what they're rolling back to. If agent versioning only tracks code, a rollback to the previous code tag doesn't guarantee a rollback to the previous behavior. The prompt might still be wrong. The tool schema might still be changed. The model version might be different. And critically, the governance policies attached to the rolled-back code version might not match the behavior the agent will actually exhibit.

A rollback that can't be verified against a known-good behavioral state isn't a recovery — it's a guess. Real incident response for agents requires the ability to say: at version X, this agent had this prompt, called these tools with these schemas, ran under these governance policies, and produced this range of behavior. Everything else is archaeology.

What does behavioral versioning actually require?

Behavioral versioning means treating the complete agent configuration as the artifact, not just the code. In practice, that requires four things.

A version record that includes all behavioral components. Each agent version should record: the code commit hash, the prompt version (and where the prompt is stored), the list of connected tools and their schema versions at time of deployment, the model identifier and version pin, and the governance policy set active for this deployment. When all five are captured together, a version represents a discrete behavioral identity — something you can compare, roll back to, and enforce against.

A registry of what's running. Before you can version agents, you need a system of record for what agents are running in production. In practice this means: the LangChain agent the backend team shipped in Q3, the CrewAI orchestrator the AI platform team deployed in January, and the LlamaIndex pipeline someone wired up for a proof-of-concept that is now, somehow, handling real traffic. All of them are running. Most of them are not catalogued anywhere. An agent registry is the prerequisite for behavioral versioning: you can't version what you haven't catalogued.

Policy linkage to version identity. Governance policies need to attach to agent versions, not to the agent name or the codebase. When an agent's capabilities change — new tools, expanded access scope, different prompt behavior — the policy evaluation must reflect the current version's actual configuration, not the configuration that was current when the policy was last written.

Shadow mode testing before promotion. Running a new agent version in shadow mode — processing real traffic but with the actual outputs suppressed — is the most reliable way to catch behavioral regressions before they reach production. You're not comparing against an eval dataset; you're comparing the new version's behavior against the current production version under real conditions. The delta between versions is observable before you promote. This comes with a real cost in compute and latency in the shadow layer, but for high-stakes agent deployments, it's the tradeoff that makes rollback unnecessary most of the time.

Traditional CI/CD pipelines don't do this. They test code against unit tests and integration tests. They don't compare behavioral envelopes under production conditions. Building this into your agent deployment workflow means capturing per-version execution traces in production — full records of what the agent did, what tools it called, what policies evaluated, what it output — so that "version 1.4 in shadow mode" has a concrete behavioral fingerprint, not just a passing test suite.

How Waxell handles this

How Waxell handles this: Waxell's agent registry maintains a catalog of what agents are running in your environment — across frameworks, deployments, and versions — as the foundation for behavioral versioning. The registry gives you the system of record that makes versioning tractable: before you can capture behavioral snapshots, you need to know what agents exist. On top of that, governance policies operate at the infrastructure layer — defined once, enforced across every agent session regardless of which framework built the agent underneath — so that when capabilities expand, you update the policy set for the current configuration rather than discovering the mismatch during an incident. The execution trace for each session — captured across any framework in three lines of SDK code — becomes the behavioral record for that version: what the agent did, what policies evaluated, what was blocked, what was allowed. When something goes wrong, incident response starts from a complete behavioral snapshot, not a code hash.

Frequently Asked Questions

What is AI agent versioning?
AI agent versioning is the practice of tracking and managing the complete behavioral identity of an agent across changes over time — including its code, system prompt, connected tool schemas, model version, and active governance policies. Unlike service versioning, where code typically determines behavior, agents can behave differently at the same code version depending on which prompt, which tools, and which model version they're running against. Behavioral versioning captures all of these together as a single version artifact.

Why can't I use git to version my AI agents?
Git tracks code changes accurately. It doesn't track prompt changes stored in a prompt management system, schema changes in the external APIs your agent calls, model version changes in hosted LLM providers, or updates to governance policies in a separate control plane. An agent's behavior is determined by all of these together — not by the code alone. Teams that only use git for agent versioning have an incomplete record: they know what the code was, but they can't reconstruct what the agent actually was at any given point in time.

What should an AI agent version include?
A complete agent version record should include: the code commit hash, the system prompt version and storage location, the list of connected tool schemas and their versions at deployment time, the model identifier with an explicit version pin, and the active governance policy set. Any of these components changing without a corresponding version increment creates behavioral drift that the version history can't explain.

How do you roll back an AI agent in production?
Effective agent rollback requires a known-good behavioral state to roll back to — not just a code commit. This means having a version record that captures all behavioral components (code, prompt, tool schemas, model version, policies) at each deployment. When an incident occurs, the rollback target is the last version where all components were verified together, not the last code commit. Shadow mode testing — running the previous version in parallel against live traffic — is the only reliable way to verify that the rollback state actually restores expected behavior before promoting it back to production.

What is the connection between agent versioning and governance?
Governance policies — the rules that control what an agent is allowed to access, spend, output, and do — must be calibrated to the agent's actual behavioral capabilities at any given version. If an agent's capabilities change (new tools, expanded access, updated prompt behavior) without a corresponding policy update, the enforcement layer is misconfigured for the agent it's governing. Behavioral versioning makes this coordination possible: by tracking agent configuration and policy set as components of the same version record, you ensure that governance reflects current capabilities rather than the capabilities the agent had when the policy was last written.

If your agents are in production and you don't have a registry, behavioral snapshots, or versioned governance policies, you're a prompt change and a tool schema update away from the failure mode this post describes. Get early access to Waxell — the governance control plane that makes behavioral versioning tractable.

Sources

OutSystems, State of AI Development 2026 (April 2026) — https://www.outsystems.com/1/state-ai-development/
CIO, Why versioning AI agents is the CIO's next big challenge (2026) — https://www.cio.com/article/4056453/why-versioning-ai-agents-is-the-cios-next-big-challenge.html
Auxiliobits, Versioning & Rollbacks in Modern Agent Deployments (2026) — https://www.auxiliobits.com/blog/versioning-and-rollbacks-in-agent-deployments/
Decagon, Introducing Agent Versioning (2026) — https://decagon.ai/resources/decagon-agent-versioning
Hacker News, WIP – Version control for AI agents. Diffs, rollback, sandbox (2026) — https://news.ycombinator.com/item?id=46032163
NIST, Artificial Intelligence Risk Management Framework (AI RMF 1.0) (2023) — https://doi.org/10.6028/NIST.AI.100-1

Agentic Governance, Explained

Waxell blog cover: separation of developer and operator authority in agentic system architecture

Agentic Architecture: Developer vs. Operator Authority [2026]

Gartner: 40% of enterprise apps run AI agents by end of 2026. The teams failing ask one question too late: who controls them in production?

Logan Kelly

Jun 1, 2026

Waxell blog cover: AI agent runbook and on-call operations guide 2026

AI Agent Runbook: What On-Call Looks Like in 2026

No runbook for your AI agent means a 3am call with no playbook. Here's what on-call operations looks like for production agent systems in 2026.

Logan Kelly

May 27, 2026

Waxell Connect blog: multi-agent handoffs and agent coordination

Multi-Agent Handoffs in Waxell Connect [2026]

When every AI agent handoff goes through you, you're the bottleneck. Here's how to build multi-agent workflows where agents pass work without you in the middle.

Frances @ Waxell

May 26, 2026