Safety Gauntlet Agent
A 12-step safety stress test that runs user input through 5 input safety systems (Presidio, LLM Guard, OpenAI Moderation, Lakera Guard, Polyguard), an LLM call, then 3 output safety systems (LLM Guard output, NeMo Guardrails, Guardrails AI), followed by DeepEval safety metrics and a final pass/fail verdict. Exercises record_policy_check, @waxell.tool, @waxell.step_dec, @waxell.decision, and waxell.score across 8 safety checks and 4 evaluation metrics.
This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.
Architecture
Key Code
Decorated safety tool functions
Each safety system is wrapped with @waxell.tool for automatic trace attribution. Policy checks are recorded via waxell_ctx.record_policy_check().
@waxell.tool(name="presidio_pii_detection", tool_type="safety")
def run_presidio_pii_detection(text: str) -> dict:
presidio = MockPresidioPIIDetector()
pii_results = presidio.analyze(text)
return {"pii_results": pii_results, "entity_types": [r["entity_type"] for r in pii_results]}
@waxell.tool(name="openai_moderation", tool_type="safety")
def run_openai_moderation(text: str) -> dict:
oai_mod = MockOpenAIModeration()
mod_result = oai_mod.check(text)
return {"flagged_categories": [...], "max_mod_score": max_score}
@waxell.tool(name="deepeval_safety_evaluation", tool_type="evaluation")
def run_deepeval_safety_evaluation(query: str, response_text: str) -> dict:
deepeval = MockDeepEvalSafety()
return deepeval.evaluate(query, response_text)
Aggregate input safety gate decision
The pipeline aggregates all 5 input safety results and decides whether to proceed, block, or escalate.
@waxell.decision(name="input_safety_gate", options=["proceed", "block", "escalate_to_human"])
def decide_input_safety_gate(blocks, warns, allows, input_safety_scores) -> dict:
if blocks > 0:
return {"chosen": "block", "reasoning": f"{blocks} system(s) blocked"}
elif warns >= 3:
return {"chosen": "proceed", "reasoning": f"{warns} warnings, proceeding with caution"}
else:
return {"chosen": "proceed", "reasoning": f"All {allows} checks passed"}
@waxell.decision(name="final_safety_verdict", options=["passed", "passed_with_warnings", "blocked"])
def decide_final_safety_verdict(total_checks, total_blocks, ...) -> dict:
return {"chosen": verdict, "reasoning": verdict_reasoning, "confidence": 0.92}
What this demonstrates
@waxell.tool(tool_type="safety")-- 8 safety tool invocations across 5 input + 3 output systems (Presidio, LLM Guard, OpenAI Moderation, Lakera Guard, Polyguard, NeMo Guardrails, Guardrails AI).@waxell.tool(tool_type="evaluation")-- DeepEval safety metrics (toxicity, bias, hallucination, answer_relevancy).waxell_ctx.record_policy_check()-- per-system governance policy checks with action/category/reason/phase/priority.@waxell.decision-- two decision points: input safety gate and final safety verdict.@waxell.step_dec-- 12 step recordings across the full pipeline (one per safety system + LLM + aggregation + verdict).@waxell.reasoning_dec-- weak metric analysis across all evaluation results.waxell.score()-- per-system risk scores, DeepEval metrics, and aggregate scores (14+ total).- Auto-instrumented LLM calls -- OpenAI gpt-4o call captured automatically.
- 12-step pipeline -- maximum-depth safety integration stress test.
Run it
# Dry-run (no API key needed)
python -m app.demos.safety_gauntlet_agent --dry-run
# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.safety_gauntlet_agent