Prompt Guard Agent
A multi-agent prompt guard showcase that tests all 3 guard modes -- block (raises PromptGuardError), warn (logs violations, allows call), and redact (sanitizes with ##TYPE## placeholders) -- against PII, credentials, and prompt injection patterns. A prompt-guard-scanner child agent runs 7 guard checks across 4 payload categories, while a prompt-guard-evaluator reasons about guard effectiveness and scores detection accuracy.
This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.
Architecture
Key Code
Guard scanning tool with mode switching
The check_prompt function from the built-in _guard instrumentor is called with different mode configurations.
@waxell.tool(tool_type="security")
def scan_prompt(messages: list, model: str, mode: str) -> dict:
configure_guard(enabled=True, server=False, action=mode)
result = check_prompt(messages, model=model)
if result is None:
return {"passed": True, "action": mode, "violations": [], "redacted": None}
return {
"passed": result.passed,
"action": result.action,
"violations": list(result.violations) if result.violations else [],
"redacted": result.redacted_messages if hasattr(result, "redacted_messages") else None,
}
Evaluator reasoning across all 3 modes
The evaluator analyzes block, warn, and redact results to assess overall guard effectiveness.
@waxell.reasoning_dec(step="guard_evaluation")
async def evaluate_guard_results(block_results, warn_result, redact_results) -> dict:
total_blocked = sum(1 for r in block_results if not r.get("passed", True))
total_redacted = sum(1 for r in redact_results if r.get("redacted"))
return {
"thought": f"Blocked {total_blocked}/{len(block_results)} dangerous inputs. "
f"Redacted {total_redacted}/{len(redact_results)} inputs.",
"evidence": [
f"Block mode: {total_blocked} violations caught",
f"Redact mode: {total_redacted} messages sanitized",
],
"conclusion": "Prompt guard provides layered defense across all categories",
}
What this demonstrates
@waxell.tool(tool_type="security")-- 7 prompt guard scans across block, warn, and redact modes.@waxell.step_dec-- test payload preparation step.@waxell.decision-- guard mode selection (block/warn/redact) per phase.@waxell.reasoning_dec-- cross-mode guard effectiveness evaluation.waxell.score()-- detection_accuracy and overall_safety scores.- 3 guard modes -- block (PromptGuardError), warn (log and proceed), redact (##TYPE## sanitization).
- 4 payload categories -- PII (SSN, email, phone, credit card), credentials (passwords, API keys, AWS keys, GitHub PATs), injection (jailbreak, admin mode), and clean messages.
- Auto-instrumented LLM calls -- 2 OpenAI calls (warn mode and redact mode).
Run it
# Dry-run (no API key needed)
python -m app.demos.prompt_guard_agent --dry-run
# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.prompt_guard_agent