Safety Gauntlet Agent

A 12-step safety stress test that runs user input through 5 input safety systems (Presidio, LLM Guard, OpenAI Moderation, Lakera Guard, Polyguard), an LLM call, then 3 output safety systems (LLM Guard output, NeMo Guardrails, Guardrails AI), followed by DeepEval safety metrics and a final pass/fail verdict. Exercises record_policy_check, @waxell.tool, @waxell.step_dec, @waxell.decision, and waxell.score across 8 safety checks and 4 evaluation metrics.

Environment variables

This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.

Architecture

Key Code

Decorated safety tool functions

Each safety system is wrapped with @waxell.tool for automatic trace attribution. Policy checks are recorded via waxell_ctx.record_policy_check().

@waxell.tool(name="presidio_pii_detection", tool_type="safety")
def run_presidio_pii_detection(text: str) -> dict:
    presidio = MockPresidioPIIDetector()
    pii_results = presidio.analyze(text)
    return {"pii_results": pii_results, "entity_types": [r["entity_type"] for r in pii_results]}

@waxell.tool(name="openai_moderation", tool_type="safety")
def run_openai_moderation(text: str) -> dict:
    oai_mod = MockOpenAIModeration()
    mod_result = oai_mod.check(text)
    return {"flagged_categories": [...], "max_mod_score": max_score}

@waxell.tool(name="deepeval_safety_evaluation", tool_type="evaluation")
def run_deepeval_safety_evaluation(query: str, response_text: str) -> dict:
    deepeval = MockDeepEvalSafety()
    return deepeval.evaluate(query, response_text)

Aggregate input safety gate decision

The pipeline aggregates all 5 input safety results and decides whether to proceed, block, or escalate.

@waxell.decision(name="input_safety_gate", options=["proceed", "block", "escalate_to_human"])
def decide_input_safety_gate(blocks, warns, allows, input_safety_scores) -> dict:
    if blocks > 0:
        return {"chosen": "block", "reasoning": f"{blocks} system(s) blocked"}
    elif warns >= 3:
        return {"chosen": "proceed", "reasoning": f"{warns} warnings, proceeding with caution"}
    else:
        return {"chosen": "proceed", "reasoning": f"All {allows} checks passed"}

@waxell.decision(name="final_safety_verdict", options=["passed", "passed_with_warnings", "blocked"])
def decide_final_safety_verdict(total_checks, total_blocks, ...) -> dict:
    return {"chosen": verdict, "reasoning": verdict_reasoning, "confidence": 0.92}

What this demonstrates

@waxell.tool(tool_type="safety") -- 8 safety tool invocations across 5 input + 3 output systems (Presidio, LLM Guard, OpenAI Moderation, Lakera Guard, Polyguard, NeMo Guardrails, Guardrails AI).
@waxell.tool(tool_type="evaluation") -- DeepEval safety metrics (toxicity, bias, hallucination, answer_relevancy).
waxell_ctx.record_policy_check() -- per-system governance policy checks with action/category/reason/phase/priority.
@waxell.decision -- two decision points: input safety gate and final safety verdict.
@waxell.step_dec -- 12 step recordings across the full pipeline (one per safety system + LLM + aggregation + verdict).
@waxell.reasoning_dec -- weak metric analysis across all evaluation results.
waxell.score() -- per-system risk scores, DeepEval metrics, and aggregate scores (14+ total).
Auto-instrumented LLM calls -- OpenAI gpt-4o call captured automatically.
12-step pipeline -- maximum-depth safety integration stress test.

Run it

# Dry-run (no API key needed)
python -m app.demos.safety_gauntlet_agent --dry-run

# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.safety_gauntlet_agent

Source

dev/waxell-dev/app/demos/safety_gauntlet_agent.py

Architecture​

Key Code​

Decorated safety tool functions​

Aggregate input safety gate decision​

What this demonstrates​

Run it​

Source​