Safety Policy

The safety policy category enforces content and behavior safety controls on workflow execution. It covers:

Content filters -- scan inputs and outputs for PII, credentials, and profanity
Execution limits -- cap the number of steps and tool calls an agent can make
Tool restrictions -- block specific tools or require human approval
Output limits -- enforce maximum output length

Use it to prevent data leakage, runaway agents, and dangerous tool invocations.

Rules

Rule	Type	Default	Description
`max_retries`	integer	`3`	Maximum retry attempts on failure
`max_steps`	integer	`50`	Maximum workflow steps allowed
`max_tool_calls`	integer	`100`	Maximum tool invocations allowed
`blocked_tools`	string[]	`[]`	Tools that cannot be used (exact name match)
`require_human_approval`	boolean	`false`	Require approval before execution starts
`approval_tools`	string[]	`[]`	Tools that need human approval before invocation
`content_filters`	string[]	`[]`	Content types to scan: `pii`, `profanity`, `credentials`
`max_output_length`	integer	(none)	Maximum characters in final output

How It Works

The safety handler runs at all three enforcement phases: before_workflow, mid_execution, and after_workflow.

Phase Behavior

Phase	What It Checks	Actions
`before_workflow`	`require_human_approval`, content filters on `context.inputs`	BLOCK (approval), WARN (content)
`mid_execution`	`max_steps` vs step count, `max_tool_calls` vs tool count, content filters on `prompt_preview`/`response_preview`	BLOCK (limits), WARN (content)
`after_workflow`	Step/tool limits, `max_output_length`, content filters on final result	WARN (all violations)

Context Attributes Read

Attribute	Phase	Purpose
`context.inputs`	before_workflow	Scan input text for PII/credentials/profanity
`context.step_logs`	mid_execution, after_workflow	Count steps taken (`len(step_logs)`)
`context.tool_call_count`	mid_execution, after_workflow	Count tool invocations
`context.prompt_preview`	mid_execution	Scan LLM prompts for content violations
`context.response_preview`	mid_execution	Scan LLM responses for content violations
`result` (parameter)	after_workflow	Scan final output, check output length

Content Filters

PII Detection

The pii content filter uses regex patterns to detect personally identifiable information:

PII Type	Pattern	Example Match
`ssn`	`\d{3}-\d{2}-\d{4}`	`123-45-6789`
`email`	Standard email regex	`user@example.com`
`phone`	US phone formats	`(555) 123-4567`, `+1-555-123-4567`
`credit_card`	16-digit grouped by 4	`4111-1111-1111-1111`

Credential Detection

The credentials content filter detects secrets and API keys:

Pattern	What It Matches	Example
Password assignments	`password=`, `passwd=`, `pwd=`	`password=hunter2`
API key assignments	`api_key=`, `apikey=`, `api_secret=`	`api_key=abc123`
Secret/access keys	`secret_key=`, `access_key=`	`secret_key=xyz`
AWS access keys	`AKIA` prefix + 16 chars	`AKIAIOSFODNN7EXAMPLE`
Generic API tokens	`sk-`, `pk_live_`, `sk_live_`, `rk_live_` prefix + 20+ chars	`sk-proj-abc123...`
GitHub PATs	`ghp_` prefix + 36 chars	`ghp_abcdefghij...`
Waxell secret keys	`wax_sk_` prefix	`wax_sk_abc123`

Profanity Filter

The profanity content filter uses word-boundary matching against a hardcoded word set. Only whole words are matched (e.g., "class" does not trigger on "ass").

Matching Examples

Input	Filter	Match?	Why
`"Look up 123-45-6789"`	pii	Yes	SSN pattern matches
`"Send to user@co.com"`	pii	Yes	Email pattern matches
`"Call 555-1234"`	pii	No	Only 7 digits (phone needs 10)
`"api_key=sk-abc123456789012345678901"`	credentials	Yes	`sk-` prefix + 20+ chars
`"Use the skeleton key"`	credentials	No	`sk` not followed by `-` with 20+ chars
`"This damn report"`	profanity	Yes	Whole word match
`"The dam broke"`	profanity	No	"dam" is not "damn"

Content Filters Return WARN, Not BLOCK

In the safety handler, content filter violations produce WARN actions, not BLOCK. The agent continues running. If you need content violations to block execution, use the dedicated Content Policy instead, which supports configurable actions (warn, redact, block) per detection type.

Execution Limits

Step Limit (`max_steps`)

Checked at mid_execution and after_workflow by counting len(context.step_logs). Returns BLOCK at mid_execution if exceeded.

Tool Call Limit (`max_tool_calls`)

Checked at mid_execution and after_workflow via context.tool_call_count. Returns BLOCK at mid_execution if exceeded.

Output Length (`max_output_length`)

Checked at after_workflow by measuring len(str(result)). Returns WARN if exceeded.

Step/Tool Limits BLOCK at Mid-Execution

Unlike content filters (which WARN), exceeding max_steps or max_tool_calls produces a BLOCK at mid_execution. This immediately halts the agent.

Human Approval

`require_human_approval`

When set to true, the handler returns BLOCK at before_workflow with reason "Human approval required before execution". The agent cannot run without external approval.

`approval_tools`

Tools listed in approval_tools are blocked with reason "Tool '{name}' requires human approval" when checked via check_tool_allowed(). This is a standalone method meant to be called from the tool execution layer.

Blocked Tools

Tools listed in blocked_tools are blocked with reason "Tool '{name}' is blocked by safety policy" when checked via check_tool_allowed().

check_tool_allowed Is Not Called Automatically

The check_tool_allowed(rules, tool_name) method is a standalone API. It is NOT automatically invoked by the before_workflow, mid_execution, or after_workflow phase hooks. Your tool execution layer must call it explicitly.

Example Policies

PII-Only Content Filter

Scan for PII in inputs and outputs, warn on detection:

{
  "content_filters": ["pii"],
  "max_steps": 50,
  "max_tool_calls": 100
}

Full Safety Lockdown

All content filters, strict limits, blocked tools:

{
  "max_retries": 2,
  "max_steps": 20,
  "max_tool_calls": 30,
  "blocked_tools": ["shell_exec", "file_write", "network_request"],
  "require_human_approval": false,
  "approval_tools": ["send_email", "make_purchase"],
  "content_filters": ["pii", "profanity", "credentials"],
  "max_output_length": 5000
}

Approval-Required for Production

Require human approval before any execution:

{
  "require_human_approval": true,
  "content_filters": ["pii", "credentials"],
  "max_steps": 100,
  "max_tool_calls": 200
}

SDK Integration

Using the Context Manager

import waxell_observe as waxell
from waxell_observe.errors import PolicyViolationError

waxell.init()

try:
    async with waxell.WaxellContext(
        agent_name="research-agent",
        enforce_policy=True,
    ) as ctx:
        # before_workflow: safety checks inputs,
        #   require_human_approval
        # If content filter triggers -> WARN (agent continues)
        # If require_human_approval -> BLOCK (PolicyViolationError)

        result = await do_research(query)

        # Record tool calls (increments tool_call_count)
        ctx.record_tool_call(
            name="web_search",
            input={"query": query},
            output={"results": results},
        )

        ctx.set_result(result)
        # after_workflow: checks limits and output content

except PolicyViolationError as e:
    print(f"Safety block: {e}")

Using the Decorator

@waxell.observe(
    agent_name="research-agent",
    enforce_policy=True,
)
async def run_research(query: str):
    # Safety checks happen before and after this function
    return await do_research(query)

Enforcement Flow

Agent starts (WaxellContext.__aenter__)
    |
    +-- before_workflow
    |   |
    |   +-- require_human_approval? -> BLOCK
    |   |
    |   +-- content_filters on context.inputs
    |       +-- PII detected? -> WARN
    |       +-- Credential detected? -> WARN
    |       +-- Profanity detected? -> WARN
    |
    +-- Agent executes steps...
    |
    +-- mid_execution (per LLM call)
    |   |
    |   +-- step_logs > max_steps? -> BLOCK
    |   +-- tool_call_count > max_tool_calls? -> BLOCK
    |   +-- content_filters on prompt_preview/response_preview
    |       +-- Violations? -> WARN
    |
    +-- Agent finishes
    |
    +-- after_workflow
        |
        +-- step_logs > max_steps? -> WARN
        +-- tool_call_count > max_tool_calls? -> WARN
        +-- len(result) > max_output_length? -> WARN
        +-- content_filters on result -> WARN

Creating via Dashboard

Navigate to Governance > Policies
Click New Policy
Select category Safety
Configure limits (max_steps, max_tool_calls)
Enable content filters (pii, profanity, credentials)
Optionally add blocked_tools and approval_tools
Set scope to target specific agents
Enable

Creating via API

curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://acme.waxell.dev/waxell/v1/policies/ \
  -d '{
    "name": "Research Safety Policy",
    "category": "safety",
    "rules": {
      "max_retries": 3,
      "max_steps": 50,
      "max_tool_calls": 100,
      "blocked_tools": ["dangerous_tool", "shell_exec"],
      "content_filters": ["pii", "profanity", "credentials"],
      "max_output_length": 5000
    },
    "scope": {
      "agents": ["research-agent"]
    },
    "enabled": true
  }'

Observability

Governance Tab

Safety evaluations appear with:

Field	Example (ALLOW)
Policy name	Research Safety Policy
Action	`allow`
Category	`safety`
Reason	"Safety checks passed (content filters active: pii, profanity, credentials)"

For content violations:

Field	Example (WARN)
Action	`warn`
Reason	"Input content violations: PII detected: ssn"
Metadata	`{"content_violations": ["PII detected: ssn"], "scan_target": "inputs"}`

For limit violations:

Field	Example (BLOCK)
Action	`block`
Reason	"Mid-run: step limit exceeded (55/50)"
Metadata	`{"steps": 55, "limit": 50}`

Combining with Other Policies

Safety + Compliance: HIPAA compliance often requires PII filtering. Use a compliance policy requiring safety as a sibling category, with content_filters: ["pii"] as a required rule
Safety + Kill Switch: Use kill switch for emergency stop, safety for ongoing limits
Safety + Content: Safety content filters return WARN. For BLOCK on content violations, add a dedicated content policy with pii_detection.action: "block"

Common Gotchas

Content filters are regex-based. They can false-positive on patterns that look like SSNs but aren't (e.g., formatted dates like 2024-01-2345).
Content filters return WARN, not BLOCK. Safety content violations produce WARN actions. The agent continues running. Use the dedicated Content Policy for configurable block/warn/redact actions.
blocked_tools requires exact name match. "shell" does not block "shell_exec". Use the full tool name.
max_output_length checks str(result). This includes Python repr overhead (quotes, braces for dicts). The actual content may be shorter than the measured length.
check_tool_allowed is not called automatically. It's a standalone method for your tool execution layer. The phase hooks don't check blocked_tools.
Mid-execution requires the runtime to call the handler. Observe-path agents may not trigger mid_execution checks between steps. Step/tool limits are also checked at after_workflow as a fallback.
require_human_approval blocks ALL queries. It does not inspect the query. Every execution is blocked until approval is granted externally.
Profanity filter uses word boundaries. "class" does not trigger on "ass". But compound words without separators may not match as expected.

Next Steps

Content Policy -- Dedicated content scanning with block/warn/redact actions
Policy & Governance -- How policy enforcement works
Compliance Policy -- Meta-validator for regulatory frameworks
Policy Categories & Templates -- All 26 categories

Rules​

How It Works​

Phase Behavior​

Context Attributes Read​

Content Filters​

PII Detection​

Credential Detection​

Profanity Filter​

Matching Examples​

Execution Limits​

Step Limit (max_steps)​

Tool Call Limit (max_tool_calls)​

Output Length (max_output_length)​

Human Approval​

require_human_approval​

approval_tools​

Blocked Tools​

Example Policies​

PII-Only Content Filter​

Full Safety Lockdown​

Approval-Required for Production​

SDK Integration​

Using the Context Manager​

Using the Decorator​

Enforcement Flow​

Creating via Dashboard​

Creating via API​

Observability​

Governance Tab​

Combining with Other Policies​

Common Gotchas​

Next Steps​