Kill Switch (Circuit Breaker) Policy
The kill policy category implements emergency stop controls with automatic circuit breaker functionality. It tracks success/failure rates over a configurable time window and automatically disables an agent when the error rate exceeds a threshold.
Use it when you need to:
- Prevent cascading failures from a malfunctioning agent
- Protect downstream services from repeated bad requests
- Implement emergency stop controls for production agents
- Auto-disable agents that start producing errors at an unacceptable rate
Rules
| Rule | Type | Default | Description |
|---|---|---|---|
enabled | bool | true | Enable kill switch monitoring |
kill_on_error_rate | float (0-1) | 0.5 | Activate kill switch when error rate exceeds this threshold |
error_window_minutes | int | 5 | Time window (minutes) for calculating error rate. Counters expire after this |
min_samples | int | 10 | Minimum total executions before evaluating error rate |
auto_recover_after_minutes | int | 30 | Automatically deactivate kill switch after this duration |
How It Works
The kill switch follows the circuit breaker pattern with three states:
CLOSED (normal operation)
|
+-- Error rate exceeds threshold
| (and min_samples reached)
|
v
OPEN (kill switch active -- all executions blocked)
|
+-- auto_recover_after_minutes TTL expires
|
v
CLOSED (normal operation resumes)
Error Rate Calculation
error_rate = errors / (successes + errors)
The error rate is calculated from Redis counters within the error_window_minutes window. Both success and error counters have a TTL equal to error_window_minutes * 60 seconds, so stale data automatically expires.
min_samples Guard
The error rate is only evaluated when the total number of executions (successes + errors) reaches min_samples. This prevents premature activation when a single early error would produce a 100% error rate.
Auto-Recovery
When the kill switch activates, it sets a Redis key with a TTL of auto_recover_after_minutes * 60 seconds. Once this TTL expires, the key disappears and the next execution is allowed through. If errors continue after recovery, the kill switch re-activates immediately (assuming the error counters haven't expired yet).
Redis Keys
| Key Pattern | Purpose | TTL |
|---|---|---|
killswitch:<agent>:<workflow>:active | Kill switch activation flag | auto_recover_after_minutes * 60 |
stats:<agent>:<workflow>:success | Success counter | error_window_minutes * 60 |
stats:<agent>:<workflow>:error | Error counter | error_window_minutes * 60 |
Enforcement Phases
| Phase | Behavior |
|---|---|
before_workflow | Checks if kill switch is active (Redis key). Checks error rate against threshold. If tripped, activates kill switch and returns BLOCK |
mid_execution | Not implemented |
after_workflow | Increments success counter (with window TTL) |
on_failure | Increments error counter (with window TTL) |
Example Policies
Conservative (Production Default)
High threshold, many samples, long recovery:
{
"enabled": true,
"kill_on_error_rate": 0.5,
"error_window_minutes": 5,
"min_samples": 10,
"auto_recover_after_minutes": 30
}
Aggressive (Fast Detection)
Low threshold, few samples, quick recovery:
{
"enabled": true,
"kill_on_error_rate": 0.3,
"error_window_minutes": 3,
"min_samples": 5,
"auto_recover_after_minutes": 5
}
Production-Grade (High Sensitivity)
Moderate threshold, large sample size, long recovery for critical agents:
{
"enabled": true,
"kill_on_error_rate": 0.4,
"error_window_minutes": 10,
"min_samples": 20,
"auto_recover_after_minutes": 60
}
SDK Integration
Using the Context Manager
import waxell_observe as waxell
from waxell_observe.errors import PolicyViolationError
waxell.init()
try:
async with waxell.WaxellContext(
agent_name="processor",
workflow_name="data-pipeline",
enforce_policy=True,
) as ctx:
# If kill switch is active, PolicyViolationError
# is raised here (before any agent work happens)
result = await process_data(query)
ctx.set_result(result)
# after_workflow increments success counter
except PolicyViolationError as e:
print(f"Kill switch: {e}")
# e.g. "Kill switch activated - error rate 83% exceeds 50%"
# or "Kill switch active - workflow temporarily disabled"
except Exception as e:
# on_failure increments error counter
# If error rate now exceeds threshold, next execution will be blocked
raise
Using the Decorator
@waxell.observe(
agent_name="processor",
workflow_name="data-pipeline",
enforce_policy=True,
)
async def run_pipeline(query: str):
# Kill switch check happens before this function body runs
# Exceptions raised here trigger on_failure (error counter)
# Normal return triggers after_workflow (success counter)
return await process_data(query)
Enforcement Flow
Agent starts (WaxellContext.__aenter__ or decorator entry)
|
+-- before_workflow governance runs
| |
| +-- enabled=false? -> ALLOW (skip all checks)
| |
| +-- Kill switch Redis key exists?
| | +-- Yes -> BLOCK ("Kill switch active - workflow temporarily disabled")
| |
| +-- Read success + error counters from Redis
| | +-- total < min_samples? -> ALLOW (not enough data)
| |
| +-- Calculate error_rate = errors / total
| +-- error_rate < threshold? -> ALLOW
| +-- error_rate >= threshold?
| -> Set kill switch Redis key (with auto_recover TTL)
| -> BLOCK ("Kill switch activated - error rate X% exceeds Y%")
|
+-- Agent executes...
|
+-- Success path (after_workflow)
| +-- Increment success counter (with error_window TTL)
|
+-- Failure path (on_failure)
+-- Increment error counter (with error_window TTL)
Kill switch requires Redis for error rate tracking and activation state. When running with WAXELL_OBSERVE=false or without a live server connection, the kill switch is not enforced -- errors are not tracked and the circuit breaker never trips. This is by design for local development.
Manual Controls
The kill switch handler exposes methods for programmatic control:
# Activate kill switch manually (e.g., from an ops dashboard)
handler.activate_kill_switch(context, duration_minutes=30, reason="manual")
# Deactivate kill switch manually
handler.deactivate_kill_switch(context)
Or via Redis CLI:
# Activate for 2 minutes
redis-cli SET "killswitch:my-agent:my-workflow:active" "manual" EX 120
# Deactivate
redis-cli DEL "killswitch:my-agent:my-workflow:active"
# Check status
redis-cli EXISTS "killswitch:my-agent:my-workflow:active"
redis-cli TTL "killswitch:my-agent:my-workflow:active"
Creating via Dashboard
- Navigate to Governance > Policies
- Click New Policy
- Select category Kill
- Configure error rate threshold, sample size, and recovery time
- Set scope to target specific agents (e.g.,
kill-switch-agent) - Enable
Creating via API
curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://acme.waxell.dev/waxell/v1/policies/ \
-d '{
"name": "Circuit Breaker",
"category": "kill",
"rules": {
"enabled": true,
"kill_on_error_rate": 0.5,
"error_window_minutes": 5,
"min_samples": 10,
"auto_recover_after_minutes": 30
},
"scope": {
"agents": ["processor"]
},
"enabled": true
}'
Observability
Governance Tab
Kill switch evaluations appear with:
When kill switch activates (error rate exceeded):
| Field | Example |
|---|---|
| Policy name | Circuit Breaker |
| Action | block |
| Category | kill |
| Reason | "Kill switch activated - error rate 83% exceeds 50%" |
| Metadata | {"error_rate": 0.83, "threshold": 0.5, "auto_recover_minutes": 30} |
When kill switch is already active:
| Field | Example |
|---|---|
| Reason | "Kill switch active - workflow temporarily disabled" |
| Metadata | {"kill_switch": true, "auto_recover_seconds": 1742} |
When under threshold:
| Field | Example |
|---|---|
| Action | allow |
| Reason | "Kill switch not activated" |
Combining with Other Policies
Kill Switch + Rate Limit: Defense in depth. Rate limits prevent overuse under normal conditions. If errors spike despite rate limiting, the kill switch provides a hard stop.
Kill Switch + Safety: If a safety policy detects unsafe output but uses warn mode, errors from downstream failures can trigger the kill switch to stop the agent entirely.
Kill Switch + Compliance: A SOC 2 compliance policy can require that kill switch monitoring is configured as part of operational safety requirements.
Common Gotchas
-
Error counters expire after
error_window_minutes. Stale errors from hours ago do not count toward the current error rate. If errors stop, the counters naturally decay to zero. -
min_samplesis total (success + error), not just errors. Withmin_samples=10, you need at least 10 total executions before the error rate is evaluated. A single error out of 1 total will not trigger the kill switch. -
Auto-recovery resets the kill switch but does NOT reset error counters. If the error counters have not expired (still within
error_window_minutes), the error rate may still be above threshold after recovery. The next execution will immediately re-activate the kill switch. -
Error counters and kill switch key have independent TTLs. The kill switch key expires after
auto_recover_after_minutes. The error counters expire aftererror_window_minutes. These are typically different values. -
Tenant-scoped. One tenant's kill switch does not affect other tenants. Each tenant has independent Redis key namespaces.
-
Process crashes leave no error record. If a process crashes before
on_failureruns, the error is not counted. The kill switch only tracks errors that are caught and reported through the governance hooks. -
enabled: falseskips all checks. Settingenabledtofalsedisables both the kill switch check and the success/error counting. No data is recorded while disabled. -
Kill switch scoping is per agent+workflow. A kill switch on
processor:data-pipelinedoes not affectprocessor:report-generation. Each combination has independent counters and activation state.
Next Steps
- Policy & Governance -- How policy enforcement works
- Rate Limit Policy -- Execution frequency limits
- Compliance Policy -- Meta-validator for regulatory frameworks
- Policy Categories & Templates -- All 26 categories