Skip to main content

Safety Policy

The safety policy category enforces content and behavior safety controls on workflow execution. It covers:

  • Content filters -- scan inputs and outputs for PII, credentials, and profanity
  • Execution limits -- cap the number of steps and tool calls an agent can make
  • Tool restrictions -- block specific tools or require human approval
  • Output limits -- enforce maximum output length

Use it to prevent data leakage, runaway agents, and dangerous tool invocations.

Rules

RuleTypeDefaultDescription
max_retriesinteger3Maximum retry attempts on failure
max_stepsinteger50Maximum workflow steps allowed
max_tool_callsinteger100Maximum tool invocations allowed
blocked_toolsstring[][]Tools that cannot be used (exact name match)
require_human_approvalbooleanfalseRequire approval before execution starts
approval_toolsstring[][]Tools that need human approval before invocation
content_filtersstring[][]Content types to scan: pii, profanity, credentials
max_output_lengthinteger(none)Maximum characters in final output

How It Works

The safety handler runs at all three enforcement phases: before_workflow, mid_execution, and after_workflow.

Phase Behavior

PhaseWhat It ChecksActions
before_workflowrequire_human_approval, content filters on context.inputsBLOCK (approval), WARN (content)
mid_executionmax_steps vs step count, max_tool_calls vs tool count, content filters on prompt_preview/response_previewBLOCK (limits), WARN (content)
after_workflowStep/tool limits, max_output_length, content filters on final resultWARN (all violations)

Context Attributes Read

AttributePhasePurpose
context.inputsbefore_workflowScan input text for PII/credentials/profanity
context.step_logsmid_execution, after_workflowCount steps taken (len(step_logs))
context.tool_call_countmid_execution, after_workflowCount tool invocations
context.prompt_previewmid_executionScan LLM prompts for content violations
context.response_previewmid_executionScan LLM responses for content violations
result (parameter)after_workflowScan final output, check output length

Content Filters

PII Detection

The pii content filter uses regex patterns to detect personally identifiable information:

PII TypePatternExample Match
ssn\d{3}-\d{2}-\d{4}123-45-6789
emailStandard email regexuser@example.com
phoneUS phone formats(555) 123-4567, +1-555-123-4567
credit_card16-digit grouped by 44111-1111-1111-1111

Credential Detection

The credentials content filter detects secrets and API keys:

PatternWhat It MatchesExample
Password assignmentspassword=, passwd=, pwd=password=hunter2
API key assignmentsapi_key=, apikey=, api_secret=api_key=abc123
Secret/access keyssecret_key=, access_key=secret_key=xyz
AWS access keysAKIA prefix + 16 charsAKIAIOSFODNN7EXAMPLE
Generic API tokenssk-, pk_live_, sk_live_, rk_live_ prefix + 20+ charssk-proj-abc123...
GitHub PATsghp_ prefix + 36 charsghp_abcdefghij...
Waxell secret keyswax_sk_ prefixwax_sk_abc123

Profanity Filter

The profanity content filter uses word-boundary matching against a hardcoded word set. Only whole words are matched (e.g., "class" does not trigger on "ass").

Matching Examples

InputFilterMatch?Why
"Look up 123-45-6789"piiYesSSN pattern matches
"Send to user@co.com"piiYesEmail pattern matches
"Call 555-1234"piiNoOnly 7 digits (phone needs 10)
"api_key=sk-abc123456789012345678901"credentialsYessk- prefix + 20+ chars
"Use the skeleton key"credentialsNosk not followed by - with 20+ chars
"This damn report"profanityYesWhole word match
"The dam broke"profanityNo"dam" is not "damn"
Content Filters Return WARN, Not BLOCK

In the safety handler, content filter violations produce WARN actions, not BLOCK. The agent continues running. If you need content violations to block execution, use the dedicated Content Policy instead, which supports configurable actions (warn, redact, block) per detection type.

Execution Limits

Step Limit (max_steps)

Checked at mid_execution and after_workflow by counting len(context.step_logs). Returns BLOCK at mid_execution if exceeded.

Tool Call Limit (max_tool_calls)

Checked at mid_execution and after_workflow via context.tool_call_count. Returns BLOCK at mid_execution if exceeded.

Output Length (max_output_length)

Checked at after_workflow by measuring len(str(result)). Returns WARN if exceeded.

Step/Tool Limits BLOCK at Mid-Execution

Unlike content filters (which WARN), exceeding max_steps or max_tool_calls produces a BLOCK at mid_execution. This immediately halts the agent.

Human Approval

require_human_approval

When set to true, the handler returns BLOCK at before_workflow with reason "Human approval required before execution". The agent cannot run without external approval.

approval_tools

Tools listed in approval_tools are blocked with reason "Tool '{name}' requires human approval" when checked via check_tool_allowed(). This is a standalone method meant to be called from the tool execution layer.

Blocked Tools

Tools listed in blocked_tools are blocked with reason "Tool '{name}' is blocked by safety policy" when checked via check_tool_allowed().

check_tool_allowed Is Not Called Automatically

The check_tool_allowed(rules, tool_name) method is a standalone API. It is NOT automatically invoked by the before_workflow, mid_execution, or after_workflow phase hooks. Your tool execution layer must call it explicitly.

Example Policies

PII-Only Content Filter

Scan for PII in inputs and outputs, warn on detection:

{
"content_filters": ["pii"],
"max_steps": 50,
"max_tool_calls": 100
}

Full Safety Lockdown

All content filters, strict limits, blocked tools:

{
"max_retries": 2,
"max_steps": 20,
"max_tool_calls": 30,
"blocked_tools": ["shell_exec", "file_write", "network_request"],
"require_human_approval": false,
"approval_tools": ["send_email", "make_purchase"],
"content_filters": ["pii", "profanity", "credentials"],
"max_output_length": 5000
}

Approval-Required for Production

Require human approval before any execution:

{
"require_human_approval": true,
"content_filters": ["pii", "credentials"],
"max_steps": 100,
"max_tool_calls": 200
}

SDK Integration

Using the Context Manager

import waxell_observe as waxell
from waxell_observe.errors import PolicyViolationError

waxell.init()

try:
async with waxell.WaxellContext(
agent_name="research-agent",
enforce_policy=True,
) as ctx:
# before_workflow: safety checks inputs,
# require_human_approval
# If content filter triggers -> WARN (agent continues)
# If require_human_approval -> BLOCK (PolicyViolationError)

result = await do_research(query)

# Record tool calls (increments tool_call_count)
ctx.record_tool_call(
name="web_search",
input={"query": query},
output={"results": results},
)

ctx.set_result(result)
# after_workflow: checks limits and output content

except PolicyViolationError as e:
print(f"Safety block: {e}")

Using the Decorator

@waxell.observe(
agent_name="research-agent",
enforce_policy=True,
)
async def run_research(query: str):
# Safety checks happen before and after this function
return await do_research(query)

Enforcement Flow

Agent starts (WaxellContext.__aenter__)
|
+-- before_workflow
| |
| +-- require_human_approval? -> BLOCK
| |
| +-- content_filters on context.inputs
| +-- PII detected? -> WARN
| +-- Credential detected? -> WARN
| +-- Profanity detected? -> WARN
|
+-- Agent executes steps...
|
+-- mid_execution (per LLM call)
| |
| +-- step_logs > max_steps? -> BLOCK
| +-- tool_call_count > max_tool_calls? -> BLOCK
| +-- content_filters on prompt_preview/response_preview
| +-- Violations? -> WARN
|
+-- Agent finishes
|
+-- after_workflow
|
+-- step_logs > max_steps? -> WARN
+-- tool_call_count > max_tool_calls? -> WARN
+-- len(result) > max_output_length? -> WARN
+-- content_filters on result -> WARN

Creating via Dashboard

  1. Navigate to Governance > Policies
  2. Click New Policy
  3. Select category Safety
  4. Configure limits (max_steps, max_tool_calls)
  5. Enable content filters (pii, profanity, credentials)
  6. Optionally add blocked_tools and approval_tools
  7. Set scope to target specific agents
  8. Enable

Creating via API

curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://acme.waxell.dev/waxell/v1/policies/ \
-d '{
"name": "Research Safety Policy",
"category": "safety",
"rules": {
"max_retries": 3,
"max_steps": 50,
"max_tool_calls": 100,
"blocked_tools": ["dangerous_tool", "shell_exec"],
"content_filters": ["pii", "profanity", "credentials"],
"max_output_length": 5000
},
"scope": {
"agents": ["research-agent"]
},
"enabled": true
}'

Observability

Governance Tab

Safety evaluations appear with:

FieldExample (ALLOW)
Policy nameResearch Safety Policy
Actionallow
Categorysafety
Reason"Safety checks passed (content filters active: pii, profanity, credentials)"

For content violations:

FieldExample (WARN)
Actionwarn
Reason"Input content violations: PII detected: ssn"
Metadata{"content_violations": ["PII detected: ssn"], "scan_target": "inputs"}

For limit violations:

FieldExample (BLOCK)
Actionblock
Reason"Mid-run: step limit exceeded (55/50)"
Metadata{"steps": 55, "limit": 50}

Combining with Other Policies

  • Safety + Compliance: HIPAA compliance often requires PII filtering. Use a compliance policy requiring safety as a sibling category, with content_filters: ["pii"] as a required rule
  • Safety + Kill Switch: Use kill switch for emergency stop, safety for ongoing limits
  • Safety + Content: Safety content filters return WARN. For BLOCK on content violations, add a dedicated content policy with pii_detection.action: "block"

Common Gotchas

  1. Content filters are regex-based. They can false-positive on patterns that look like SSNs but aren't (e.g., formatted dates like 2024-01-2345).

  2. Content filters return WARN, not BLOCK. Safety content violations produce WARN actions. The agent continues running. Use the dedicated Content Policy for configurable block/warn/redact actions.

  3. blocked_tools requires exact name match. "shell" does not block "shell_exec". Use the full tool name.

  4. max_output_length checks str(result). This includes Python repr overhead (quotes, braces for dicts). The actual content may be shorter than the measured length.

  5. check_tool_allowed is not called automatically. It's a standalone method for your tool execution layer. The phase hooks don't check blocked_tools.

  6. Mid-execution requires the runtime to call the handler. Observe-path agents may not trigger mid_execution checks between steps. Step/tool limits are also checked at after_workflow as a fallback.

  7. require_human_approval blocks ALL queries. It does not inspect the query. Every execution is blocked until approval is granted externally.

  8. Profanity filter uses word boundaries. "class" does not trigger on "ass". But compound words without separators may not match as expected.

Next Steps