Kill Switch (Circuit Breaker) Policy

The kill policy category implements emergency stop controls with automatic circuit breaker functionality. It tracks success/failure rates over a configurable time window and automatically disables an agent when the error rate exceeds a threshold.

Use it when you need to:

Prevent cascading failures from a malfunctioning agent
Protect downstream services from repeated bad requests
Implement emergency stop controls for production agents
Auto-disable agents that start producing errors at an unacceptable rate

Rules

Rule	Type	Default	Description
`enabled`	bool	`true`	Enable kill switch monitoring
`kill_on_error_rate`	float (0-1)	`0.5`	Activate kill switch when error rate exceeds this threshold
`error_window_minutes`	int	`5`	Time window (minutes) for calculating error rate. Counters expire after this
`min_samples`	int	`10`	Minimum total executions before evaluating error rate
`auto_recover_after_minutes`	int	`30`	Automatically deactivate kill switch after this duration

How It Works

The kill switch follows the circuit breaker pattern with three states:

CLOSED (normal operation)
    |
    +-- Error rate exceeds threshold
    |   (and min_samples reached)
    |
    v
OPEN (kill switch active -- all executions blocked)
    |
    +-- auto_recover_after_minutes TTL expires
    |
    v
CLOSED (normal operation resumes)

Error Rate Calculation

error_rate = errors / (successes + errors)

The error rate is calculated from Redis counters within the error_window_minutes window. Both success and error counters have a TTL equal to error_window_minutes * 60 seconds, so stale data automatically expires.

min_samples Guard

The error rate is only evaluated when the total number of executions (successes + errors) reaches min_samples. This prevents premature activation when a single early error would produce a 100% error rate.

Auto-Recovery

When the kill switch activates, it sets a Redis key with a TTL of auto_recover_after_minutes * 60 seconds. Once this TTL expires, the key disappears and the next execution is allowed through. If errors continue after recovery, the kill switch re-activates immediately (assuming the error counters haven't expired yet).

Redis Keys

Key Pattern	Purpose	TTL
`killswitch:<agent>:<workflow>:active`	Kill switch activation flag	`auto_recover_after_minutes * 60`
`stats:<agent>:<workflow>:success`	Success counter	`error_window_minutes * 60`
`stats:<agent>:<workflow>:error`	Error counter	`error_window_minutes * 60`

Enforcement Phases

Phase	Behavior
`before_workflow`	Checks if kill switch is active (Redis key). Checks error rate against threshold. If tripped, activates kill switch and returns BLOCK
`mid_execution`	Not implemented
`after_workflow`	Increments success counter (with window TTL)
`on_failure`	Increments error counter (with window TTL)

Example Policies

Conservative (Production Default)

High threshold, many samples, long recovery:

{
  "enabled": true,
  "kill_on_error_rate": 0.5,
  "error_window_minutes": 5,
  "min_samples": 10,
  "auto_recover_after_minutes": 30
}

Aggressive (Fast Detection)

Low threshold, few samples, quick recovery:

{
  "enabled": true,
  "kill_on_error_rate": 0.3,
  "error_window_minutes": 3,
  "min_samples": 5,
  "auto_recover_after_minutes": 5
}

Production-Grade (High Sensitivity)

Moderate threshold, large sample size, long recovery for critical agents:

{
  "enabled": true,
  "kill_on_error_rate": 0.4,
  "error_window_minutes": 10,
  "min_samples": 20,
  "auto_recover_after_minutes": 60
}

SDK Integration

Using the Context Manager

import waxell_observe as waxell
from waxell_observe.errors import PolicyViolationError

waxell.init()

try:
    async with waxell.WaxellContext(
        agent_name="processor",
        workflow_name="data-pipeline",
        enforce_policy=True,
    ) as ctx:
        # If kill switch is active, PolicyViolationError
        # is raised here (before any agent work happens)

        result = await process_data(query)
        ctx.set_result(result)
        # after_workflow increments success counter

except PolicyViolationError as e:
    print(f"Kill switch: {e}")
    # e.g. "Kill switch activated - error rate 83% exceeds 50%"
    # or   "Kill switch active - workflow temporarily disabled"

except Exception as e:
    # on_failure increments error counter
    # If error rate now exceeds threshold, next execution will be blocked
    raise

Using the Decorator

@waxell.observe(
    agent_name="processor",
    workflow_name="data-pipeline",
    enforce_policy=True,
)
async def run_pipeline(query: str):
    # Kill switch check happens before this function body runs
    # Exceptions raised here trigger on_failure (error counter)
    # Normal return triggers after_workflow (success counter)
    return await process_data(query)

Enforcement Flow

Agent starts (WaxellContext.__aenter__ or decorator entry)
    |
    +-- before_workflow governance runs
    |   |
    |   +-- enabled=false? -> ALLOW (skip all checks)
    |   |
    |   +-- Kill switch Redis key exists?
    |   |   +-- Yes -> BLOCK ("Kill switch active - workflow temporarily disabled")
    |   |
    |   +-- Read success + error counters from Redis
    |   |   +-- total < min_samples? -> ALLOW (not enough data)
    |   |
    |   +-- Calculate error_rate = errors / total
    |       +-- error_rate < threshold? -> ALLOW
    |       +-- error_rate >= threshold?
    |           -> Set kill switch Redis key (with auto_recover TTL)
    |           -> BLOCK ("Kill switch activated - error rate X% exceeds Y%")
    |
    +-- Agent executes...
    |
    +-- Success path (after_workflow)
    |   +-- Increment success counter (with error_window TTL)
    |
    +-- Failure path (on_failure)
        +-- Increment error counter (with error_window TTL)

Redis Required

Kill switch requires Redis for error rate tracking and activation state. When running with WAXELL_OBSERVE=false or without a live server connection, the kill switch is not enforced -- errors are not tracked and the circuit breaker never trips. This is by design for local development.

Manual Controls

The kill switch handler exposes methods for programmatic control:

# Activate kill switch manually (e.g., from an ops dashboard)
handler.activate_kill_switch(context, duration_minutes=30, reason="manual")

# Deactivate kill switch manually
handler.deactivate_kill_switch(context)

Or via Redis CLI:

# Activate for 2 minutes
redis-cli SET "killswitch:my-agent:my-workflow:active" "manual" EX 120

# Deactivate
redis-cli DEL "killswitch:my-agent:my-workflow:active"

# Check status
redis-cli EXISTS "killswitch:my-agent:my-workflow:active"
redis-cli TTL "killswitch:my-agent:my-workflow:active"

Creating via Dashboard

Navigate to Governance > Policies
Click New Policy
Select category Kill
Configure error rate threshold, sample size, and recovery time
Set scope to target specific agents (e.g., kill-switch-agent)
Enable

Creating via API

curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://acme.waxell.dev/waxell/v1/policies/ \
  -d '{
    "name": "Circuit Breaker",
    "category": "kill",
    "rules": {
      "enabled": true,
      "kill_on_error_rate": 0.5,
      "error_window_minutes": 5,
      "min_samples": 10,
      "auto_recover_after_minutes": 30
    },
    "scope": {
      "agents": ["processor"]
    },
    "enabled": true
  }'

Observability

Governance Tab

Kill switch evaluations appear with:

When kill switch activates (error rate exceeded):

Field	Example
Policy name	Circuit Breaker
Action	`block`
Category	`kill`
Reason	"Kill switch activated - error rate 83% exceeds 50%"
Metadata	`{"error_rate": 0.83, "threshold": 0.5, "auto_recover_minutes": 30}`

When kill switch is already active:

Field	Example
Reason	"Kill switch active - workflow temporarily disabled"
Metadata	`{"kill_switch": true, "auto_recover_seconds": 1742}`

When under threshold:

Field	Example
Action	`allow`
Reason	"Kill switch not activated"

Combining with Other Policies

Kill Switch + Rate Limit: Defense in depth. Rate limits prevent overuse under normal conditions. If errors spike despite rate limiting, the kill switch provides a hard stop.

Kill Switch + Safety: If a safety policy detects unsafe output but uses warn mode, errors from downstream failures can trigger the kill switch to stop the agent entirely.

Kill Switch + Compliance: A SOC 2 compliance policy can require that kill switch monitoring is configured as part of operational safety requirements.

Common Gotchas

Error counters expire after error_window_minutes. Stale errors from hours ago do not count toward the current error rate. If errors stop, the counters naturally decay to zero.
min_samples is total (success + error), not just errors. With min_samples=10, you need at least 10 total executions before the error rate is evaluated. A single error out of 1 total will not trigger the kill switch.
Auto-recovery resets the kill switch but does NOT reset error counters. If the error counters have not expired (still within error_window_minutes), the error rate may still be above threshold after recovery. The next execution will immediately re-activate the kill switch.
Error counters and kill switch key have independent TTLs. The kill switch key expires after auto_recover_after_minutes. The error counters expire after error_window_minutes. These are typically different values.
Tenant-scoped. One tenant's kill switch does not affect other tenants. Each tenant has independent Redis key namespaces.
Process crashes leave no error record. If a process crashes before on_failure runs, the error is not counted. The kill switch only tracks errors that are caught and reported through the governance hooks.
enabled: false skips all checks. Setting enabled to false disables both the kill switch check and the success/error counting. No data is recorded while disabled.
Kill switch scoping is per agent+workflow. A kill switch on processor:data-pipeline does not affect processor:report-generation. Each combination has independent counters and activation state.

Next Steps

Policy & Governance -- How policy enforcement works
Rate Limit Policy -- Execution frequency limits
Compliance Policy -- Meta-validator for regulatory frameworks
Policy Categories & Templates -- All 26 categories

Rules​

How It Works​

Error Rate Calculation​

min_samples Guard​

Auto-Recovery​

Redis Keys​

Enforcement Phases​

Example Policies​

Conservative (Production Default)​

Aggressive (Fast Detection)​

Production-Grade (High Sensitivity)​

SDK Integration​

Using the Context Manager​

Using the Decorator​

Enforcement Flow​

Manual Controls​

Creating via Dashboard​

Creating via API​

Observability​

Governance Tab​

Combining with Other Policies​

Common Gotchas​

Next Steps​