Safety Policy
The safety policy category enforces content and behavior safety controls on workflow execution. It covers:
- Content filters -- scan inputs and outputs for PII, credentials, and profanity
- Execution limits -- cap the number of steps and tool calls an agent can make
- Tool restrictions -- block specific tools or require human approval
- Output limits -- enforce maximum output length
Use it to prevent data leakage, runaway agents, and dangerous tool invocations.
Rules
| Rule | Type | Default | Description |
|---|---|---|---|
max_retries | integer | 3 | Maximum retry attempts on failure |
max_steps | integer | 50 | Maximum workflow steps allowed |
max_tool_calls | integer | 100 | Maximum tool invocations allowed |
blocked_tools | string[] | [] | Tools that cannot be used (exact name match) |
require_human_approval | boolean | false | Require approval before execution starts |
approval_tools | string[] | [] | Tools that need human approval before invocation |
content_filters | string[] | [] | Content types to scan: pii, profanity, credentials |
max_output_length | integer | (none) | Maximum characters in final output |
How It Works
The safety handler runs at all three enforcement phases: before_workflow, mid_execution, and after_workflow.
Phase Behavior
| Phase | What It Checks | Actions |
|---|---|---|
before_workflow | require_human_approval, content filters on context.inputs | BLOCK (approval), WARN (content) |
mid_execution | max_steps vs step count, max_tool_calls vs tool count, content filters on prompt_preview/response_preview | BLOCK (limits), WARN (content) |
after_workflow | Step/tool limits, max_output_length, content filters on final result | WARN (all violations) |
Context Attributes Read
| Attribute | Phase | Purpose |
|---|---|---|
context.inputs | before_workflow | Scan input text for PII/credentials/profanity |
context.step_logs | mid_execution, after_workflow | Count steps taken (len(step_logs)) |
context.tool_call_count | mid_execution, after_workflow | Count tool invocations |
context.prompt_preview | mid_execution | Scan LLM prompts for content violations |
context.response_preview | mid_execution | Scan LLM responses for content violations |
result (parameter) | after_workflow | Scan final output, check output length |
Content Filters
PII Detection
The pii content filter uses regex patterns to detect personally identifiable information:
| PII Type | Pattern | Example Match |
|---|---|---|
ssn | \d{3}-\d{2}-\d{4} | 123-45-6789 |
email | Standard email regex | user@example.com |
phone | US phone formats | (555) 123-4567, +1-555-123-4567 |
credit_card | 16-digit grouped by 4 | 4111-1111-1111-1111 |
Credential Detection
The credentials content filter detects secrets and API keys:
| Pattern | What It Matches | Example |
|---|---|---|
| Password assignments | password=, passwd=, pwd= | password=hunter2 |
| API key assignments | api_key=, apikey=, api_secret= | api_key=abc123 |
| Secret/access keys | secret_key=, access_key= | secret_key=xyz |
| AWS access keys | AKIA prefix + 16 chars | AKIAIOSFODNN7EXAMPLE |
| Generic API tokens | sk-, pk_live_, sk_live_, rk_live_ prefix + 20+ chars | sk-proj-abc123... |
| GitHub PATs | ghp_ prefix + 36 chars | ghp_abcdefghij... |
| Waxell secret keys | wax_sk_ prefix | wax_sk_abc123 |
Profanity Filter
The profanity content filter uses word-boundary matching against a hardcoded word set. Only whole words are matched (e.g., "class" does not trigger on "ass").
Matching Examples
| Input | Filter | Match? | Why |
|---|---|---|---|
"Look up 123-45-6789" | pii | Yes | SSN pattern matches |
"Send to user@co.com" | pii | Yes | Email pattern matches |
"Call 555-1234" | pii | No | Only 7 digits (phone needs 10) |
"api_key=sk-abc123456789012345678901" | credentials | Yes | sk- prefix + 20+ chars |
"Use the skeleton key" | credentials | No | sk not followed by - with 20+ chars |
"This damn report" | profanity | Yes | Whole word match |
"The dam broke" | profanity | No | "dam" is not "damn" |
In the safety handler, content filter violations produce WARN actions, not BLOCK. The agent continues running. If you need content violations to block execution, use the dedicated Content Policy instead, which supports configurable actions (warn, redact, block) per detection type.
Execution Limits
Step Limit (max_steps)
Checked at mid_execution and after_workflow by counting len(context.step_logs). Returns BLOCK at mid_execution if exceeded.
Tool Call Limit (max_tool_calls)
Checked at mid_execution and after_workflow via context.tool_call_count. Returns BLOCK at mid_execution if exceeded.
Output Length (max_output_length)
Checked at after_workflow by measuring len(str(result)). Returns WARN if exceeded.
Unlike content filters (which WARN), exceeding max_steps or max_tool_calls produces a BLOCK at mid_execution. This immediately halts the agent.
Human Approval
require_human_approval
When set to true, the handler returns BLOCK at before_workflow with reason "Human approval required before execution". The agent cannot run without external approval.
approval_tools
Tools listed in approval_tools are blocked with reason "Tool '{name}' requires human approval" when checked via check_tool_allowed(). This is a standalone method meant to be called from the tool execution layer.
Blocked Tools
Tools listed in blocked_tools are blocked with reason "Tool '{name}' is blocked by safety policy" when checked via check_tool_allowed().
The check_tool_allowed(rules, tool_name) method is a standalone API. It is NOT automatically invoked by the before_workflow, mid_execution, or after_workflow phase hooks. Your tool execution layer must call it explicitly.
Example Policies
PII-Only Content Filter
Scan for PII in inputs and outputs, warn on detection:
{
"content_filters": ["pii"],
"max_steps": 50,
"max_tool_calls": 100
}
Full Safety Lockdown
All content filters, strict limits, blocked tools:
{
"max_retries": 2,
"max_steps": 20,
"max_tool_calls": 30,
"blocked_tools": ["shell_exec", "file_write", "network_request"],
"require_human_approval": false,
"approval_tools": ["send_email", "make_purchase"],
"content_filters": ["pii", "profanity", "credentials"],
"max_output_length": 5000
}
Approval-Required for Production
Require human approval before any execution:
{
"require_human_approval": true,
"content_filters": ["pii", "credentials"],
"max_steps": 100,
"max_tool_calls": 200
}
SDK Integration
Using the Context Manager
import waxell_observe as waxell
from waxell_observe.errors import PolicyViolationError
waxell.init()
try:
async with waxell.WaxellContext(
agent_name="research-agent",
enforce_policy=True,
) as ctx:
# before_workflow: safety checks inputs,
# require_human_approval
# If content filter triggers -> WARN (agent continues)
# If require_human_approval -> BLOCK (PolicyViolationError)
result = await do_research(query)
# Record tool calls (increments tool_call_count)
ctx.record_tool_call(
name="web_search",
input={"query": query},
output={"results": results},
)
ctx.set_result(result)
# after_workflow: checks limits and output content
except PolicyViolationError as e:
print(f"Safety block: {e}")
Using the Decorator
@waxell.observe(
agent_name="research-agent",
enforce_policy=True,
)
async def run_research(query: str):
# Safety checks happen before and after this function
return await do_research(query)
Enforcement Flow
Agent starts (WaxellContext.__aenter__)
|
+-- before_workflow
| |
| +-- require_human_approval? -> BLOCK
| |
| +-- content_filters on context.inputs
| +-- PII detected? -> WARN
| +-- Credential detected? -> WARN
| +-- Profanity detected? -> WARN
|
+-- Agent executes steps...
|
+-- mid_execution (per LLM call)
| |
| +-- step_logs > max_steps? -> BLOCK
| +-- tool_call_count > max_tool_calls? -> BLOCK
| +-- content_filters on prompt_preview/response_preview
| +-- Violations? -> WARN
|
+-- Agent finishes
|
+-- after_workflow
|
+-- step_logs > max_steps? -> WARN
+-- tool_call_count > max_tool_calls? -> WARN
+-- len(result) > max_output_length? -> WARN
+-- content_filters on result -> WARN
Creating via Dashboard
- Navigate to Governance > Policies
- Click New Policy
- Select category Safety
- Configure limits (max_steps, max_tool_calls)
- Enable content filters (pii, profanity, credentials)
- Optionally add blocked_tools and approval_tools
- Set scope to target specific agents
- Enable
Creating via API
curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://acme.waxell.dev/waxell/v1/policies/ \
-d '{
"name": "Research Safety Policy",
"category": "safety",
"rules": {
"max_retries": 3,
"max_steps": 50,
"max_tool_calls": 100,
"blocked_tools": ["dangerous_tool", "shell_exec"],
"content_filters": ["pii", "profanity", "credentials"],
"max_output_length": 5000
},
"scope": {
"agents": ["research-agent"]
},
"enabled": true
}'
Observability
Governance Tab
Safety evaluations appear with:
| Field | Example (ALLOW) |
|---|---|
| Policy name | Research Safety Policy |
| Action | allow |
| Category | safety |
| Reason | "Safety checks passed (content filters active: pii, profanity, credentials)" |
For content violations:
| Field | Example (WARN) |
|---|---|
| Action | warn |
| Reason | "Input content violations: PII detected: ssn" |
| Metadata | {"content_violations": ["PII detected: ssn"], "scan_target": "inputs"} |
For limit violations:
| Field | Example (BLOCK) |
|---|---|
| Action | block |
| Reason | "Mid-run: step limit exceeded (55/50)" |
| Metadata | {"steps": 55, "limit": 50} |
Combining with Other Policies
- Safety + Compliance: HIPAA compliance often requires PII filtering. Use a compliance policy requiring
safetyas a sibling category, withcontent_filters: ["pii"]as a required rule - Safety + Kill Switch: Use kill switch for emergency stop, safety for ongoing limits
- Safety + Content: Safety content filters return WARN. For BLOCK on content violations, add a dedicated content policy with
pii_detection.action: "block"
Common Gotchas
-
Content filters are regex-based. They can false-positive on patterns that look like SSNs but aren't (e.g., formatted dates like
2024-01-2345). -
Content filters return WARN, not BLOCK. Safety content violations produce WARN actions. The agent continues running. Use the dedicated Content Policy for configurable block/warn/redact actions.
-
blocked_toolsrequires exact name match."shell"does not block"shell_exec". Use the full tool name. -
max_output_lengthchecksstr(result). This includes Python repr overhead (quotes, braces for dicts). The actual content may be shorter than the measured length. -
check_tool_allowedis not called automatically. It's a standalone method for your tool execution layer. The phase hooks don't check blocked_tools. -
Mid-execution requires the runtime to call the handler. Observe-path agents may not trigger mid_execution checks between steps. Step/tool limits are also checked at after_workflow as a fallback.
-
require_human_approvalblocks ALL queries. It does not inspect the query. Every execution is blocked until approval is granted externally. -
Profanity filter uses word boundaries. "class" does not trigger on "ass". But compound words without separators may not match as expected.
Next Steps
- Content Policy -- Dedicated content scanning with block/warn/redact actions
- Policy & Governance -- How policy enforcement works
- Compliance Policy -- Meta-validator for regulatory frameworks
- Policy Categories & Templates -- All 26 categories