Skip to main content

Prompt Guard Agent

A multi-agent prompt guard showcase that tests all 3 guard modes -- block (raises PromptGuardError), warn (logs violations, allows call), and redact (sanitizes with ##TYPE## placeholders) -- against PII, credentials, and prompt injection patterns. A prompt-guard-scanner child agent runs 7 guard checks across 4 payload categories, while a prompt-guard-evaluator reasons about guard effectiveness and scores detection accuracy.

Environment variables

This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.

Architecture

Key Code

Guard scanning tool with mode switching

The check_prompt function from the built-in _guard instrumentor is called with different mode configurations.

@waxell.tool(tool_type="security")
def scan_prompt(messages: list, model: str, mode: str) -> dict:
configure_guard(enabled=True, server=False, action=mode)
result = check_prompt(messages, model=model)
if result is None:
return {"passed": True, "action": mode, "violations": [], "redacted": None}
return {
"passed": result.passed,
"action": result.action,
"violations": list(result.violations) if result.violations else [],
"redacted": result.redacted_messages if hasattr(result, "redacted_messages") else None,
}

Evaluator reasoning across all 3 modes

The evaluator analyzes block, warn, and redact results to assess overall guard effectiveness.

@waxell.reasoning_dec(step="guard_evaluation")
async def evaluate_guard_results(block_results, warn_result, redact_results) -> dict:
total_blocked = sum(1 for r in block_results if not r.get("passed", True))
total_redacted = sum(1 for r in redact_results if r.get("redacted"))
return {
"thought": f"Blocked {total_blocked}/{len(block_results)} dangerous inputs. "
f"Redacted {total_redacted}/{len(redact_results)} inputs.",
"evidence": [
f"Block mode: {total_blocked} violations caught",
f"Redact mode: {total_redacted} messages sanitized",
],
"conclusion": "Prompt guard provides layered defense across all categories",
}

What this demonstrates

  • @waxell.tool(tool_type="security") -- 7 prompt guard scans across block, warn, and redact modes.
  • @waxell.step_dec -- test payload preparation step.
  • @waxell.decision -- guard mode selection (block/warn/redact) per phase.
  • @waxell.reasoning_dec -- cross-mode guard effectiveness evaluation.
  • waxell.score() -- detection_accuracy and overall_safety scores.
  • 3 guard modes -- block (PromptGuardError), warn (log and proceed), redact (##TYPE## sanitization).
  • 4 payload categories -- PII (SSN, email, phone, credit card), credentials (passwords, API keys, AWS keys, GitHub PATs), injection (jailbreak, admin mode), and clean messages.
  • Auto-instrumented LLM calls -- 2 OpenAI calls (warn mode and redact mode).

Run it

# Dry-run (no API key needed)
python -m app.demos.prompt_guard_agent --dry-run

# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.prompt_guard_agent

Source

dev/waxell-dev/app/demos/prompt_guard_agent.py