Skip to main content

Kill Switch (Circuit Breaker) Policy

The kill policy category implements emergency stop controls with automatic circuit breaker functionality. It tracks success/failure rates over a configurable time window and automatically disables an agent when the error rate exceeds a threshold.

Use it when you need to:

  • Prevent cascading failures from a malfunctioning agent
  • Protect downstream services from repeated bad requests
  • Implement emergency stop controls for production agents
  • Auto-disable agents that start producing errors at an unacceptable rate

Rules

RuleTypeDefaultDescription
enabledbooltrueEnable kill switch monitoring
kill_on_error_ratefloat (0-1)0.5Activate kill switch when error rate exceeds this threshold
error_window_minutesint5Time window (minutes) for calculating error rate. Counters expire after this
min_samplesint10Minimum total executions before evaluating error rate
auto_recover_after_minutesint30Automatically deactivate kill switch after this duration

How It Works

The kill switch follows the circuit breaker pattern with three states:

CLOSED (normal operation)
|
+-- Error rate exceeds threshold
| (and min_samples reached)
|
v
OPEN (kill switch active -- all executions blocked)
|
+-- auto_recover_after_minutes TTL expires
|
v
CLOSED (normal operation resumes)

Error Rate Calculation

error_rate = errors / (successes + errors)

The error rate is calculated from Redis counters within the error_window_minutes window. Both success and error counters have a TTL equal to error_window_minutes * 60 seconds, so stale data automatically expires.

min_samples Guard

The error rate is only evaluated when the total number of executions (successes + errors) reaches min_samples. This prevents premature activation when a single early error would produce a 100% error rate.

Auto-Recovery

When the kill switch activates, it sets a Redis key with a TTL of auto_recover_after_minutes * 60 seconds. Once this TTL expires, the key disappears and the next execution is allowed through. If errors continue after recovery, the kill switch re-activates immediately (assuming the error counters haven't expired yet).

Redis Keys

Key PatternPurposeTTL
killswitch:<agent>:<workflow>:activeKill switch activation flagauto_recover_after_minutes * 60
stats:<agent>:<workflow>:successSuccess countererror_window_minutes * 60
stats:<agent>:<workflow>:errorError countererror_window_minutes * 60

Enforcement Phases

PhaseBehavior
before_workflowChecks if kill switch is active (Redis key). Checks error rate against threshold. If tripped, activates kill switch and returns BLOCK
mid_executionNot implemented
after_workflowIncrements success counter (with window TTL)
on_failureIncrements error counter (with window TTL)

Example Policies

Conservative (Production Default)

High threshold, many samples, long recovery:

{
"enabled": true,
"kill_on_error_rate": 0.5,
"error_window_minutes": 5,
"min_samples": 10,
"auto_recover_after_minutes": 30
}

Aggressive (Fast Detection)

Low threshold, few samples, quick recovery:

{
"enabled": true,
"kill_on_error_rate": 0.3,
"error_window_minutes": 3,
"min_samples": 5,
"auto_recover_after_minutes": 5
}

Production-Grade (High Sensitivity)

Moderate threshold, large sample size, long recovery for critical agents:

{
"enabled": true,
"kill_on_error_rate": 0.4,
"error_window_minutes": 10,
"min_samples": 20,
"auto_recover_after_minutes": 60
}

SDK Integration

Using the Context Manager

import waxell_observe as waxell
from waxell_observe.errors import PolicyViolationError

waxell.init()

try:
async with waxell.WaxellContext(
agent_name="processor",
workflow_name="data-pipeline",
enforce_policy=True,
) as ctx:
# If kill switch is active, PolicyViolationError
# is raised here (before any agent work happens)

result = await process_data(query)
ctx.set_result(result)
# after_workflow increments success counter

except PolicyViolationError as e:
print(f"Kill switch: {e}")
# e.g. "Kill switch activated - error rate 83% exceeds 50%"
# or "Kill switch active - workflow temporarily disabled"

except Exception as e:
# on_failure increments error counter
# If error rate now exceeds threshold, next execution will be blocked
raise

Using the Decorator

@waxell.observe(
agent_name="processor",
workflow_name="data-pipeline",
enforce_policy=True,
)
async def run_pipeline(query: str):
# Kill switch check happens before this function body runs
# Exceptions raised here trigger on_failure (error counter)
# Normal return triggers after_workflow (success counter)
return await process_data(query)

Enforcement Flow

Agent starts (WaxellContext.__aenter__ or decorator entry)
|
+-- before_workflow governance runs
| |
| +-- enabled=false? -> ALLOW (skip all checks)
| |
| +-- Kill switch Redis key exists?
| | +-- Yes -> BLOCK ("Kill switch active - workflow temporarily disabled")
| |
| +-- Read success + error counters from Redis
| | +-- total < min_samples? -> ALLOW (not enough data)
| |
| +-- Calculate error_rate = errors / total
| +-- error_rate < threshold? -> ALLOW
| +-- error_rate >= threshold?
| -> Set kill switch Redis key (with auto_recover TTL)
| -> BLOCK ("Kill switch activated - error rate X% exceeds Y%")
|
+-- Agent executes...
|
+-- Success path (after_workflow)
| +-- Increment success counter (with error_window TTL)
|
+-- Failure path (on_failure)
+-- Increment error counter (with error_window TTL)
Redis Required

Kill switch requires Redis for error rate tracking and activation state. When running with WAXELL_OBSERVE=false or without a live server connection, the kill switch is not enforced -- errors are not tracked and the circuit breaker never trips. This is by design for local development.

Manual Controls

The kill switch handler exposes methods for programmatic control:

# Activate kill switch manually (e.g., from an ops dashboard)
handler.activate_kill_switch(context, duration_minutes=30, reason="manual")

# Deactivate kill switch manually
handler.deactivate_kill_switch(context)

Or via Redis CLI:

# Activate for 2 minutes
redis-cli SET "killswitch:my-agent:my-workflow:active" "manual" EX 120

# Deactivate
redis-cli DEL "killswitch:my-agent:my-workflow:active"

# Check status
redis-cli EXISTS "killswitch:my-agent:my-workflow:active"
redis-cli TTL "killswitch:my-agent:my-workflow:active"

Creating via Dashboard

  1. Navigate to Governance > Policies
  2. Click New Policy
  3. Select category Kill
  4. Configure error rate threshold, sample size, and recovery time
  5. Set scope to target specific agents (e.g., kill-switch-agent)
  6. Enable

Creating via API

curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://acme.waxell.dev/waxell/v1/policies/ \
-d '{
"name": "Circuit Breaker",
"category": "kill",
"rules": {
"enabled": true,
"kill_on_error_rate": 0.5,
"error_window_minutes": 5,
"min_samples": 10,
"auto_recover_after_minutes": 30
},
"scope": {
"agents": ["processor"]
},
"enabled": true
}'

Observability

Governance Tab

Kill switch evaluations appear with:

When kill switch activates (error rate exceeded):

FieldExample
Policy nameCircuit Breaker
Actionblock
Categorykill
Reason"Kill switch activated - error rate 83% exceeds 50%"
Metadata{"error_rate": 0.83, "threshold": 0.5, "auto_recover_minutes": 30}

When kill switch is already active:

FieldExample
Reason"Kill switch active - workflow temporarily disabled"
Metadata{"kill_switch": true, "auto_recover_seconds": 1742}

When under threshold:

FieldExample
Actionallow
Reason"Kill switch not activated"

Combining with Other Policies

Kill Switch + Rate Limit: Defense in depth. Rate limits prevent overuse under normal conditions. If errors spike despite rate limiting, the kill switch provides a hard stop.

Kill Switch + Safety: If a safety policy detects unsafe output but uses warn mode, errors from downstream failures can trigger the kill switch to stop the agent entirely.

Kill Switch + Compliance: A SOC 2 compliance policy can require that kill switch monitoring is configured as part of operational safety requirements.

Common Gotchas

  1. Error counters expire after error_window_minutes. Stale errors from hours ago do not count toward the current error rate. If errors stop, the counters naturally decay to zero.

  2. min_samples is total (success + error), not just errors. With min_samples=10, you need at least 10 total executions before the error rate is evaluated. A single error out of 1 total will not trigger the kill switch.

  3. Auto-recovery resets the kill switch but does NOT reset error counters. If the error counters have not expired (still within error_window_minutes), the error rate may still be above threshold after recovery. The next execution will immediately re-activate the kill switch.

  4. Error counters and kill switch key have independent TTLs. The kill switch key expires after auto_recover_after_minutes. The error counters expire after error_window_minutes. These are typically different values.

  5. Tenant-scoped. One tenant's kill switch does not affect other tenants. Each tenant has independent Redis key namespaces.

  6. Process crashes leave no error record. If a process crashes before on_failure runs, the error is not counted. The kill switch only tracks errors that are caught and reported through the governance hooks.

  7. enabled: false skips all checks. Setting enabled to false disables both the kill switch check and the success/error counting. No data is recorded while disabled.

  8. Kill switch scoping is per agent+workflow. A kill switch on processor:data-pipeline does not affect processor:report-generation. Each combination has independent counters and activation state.

Next Steps