Prompt Guard Agent

A multi-agent prompt guard showcase that tests all 3 guard modes -- block (raises PromptGuardError), warn (logs violations, allows call), and redact (sanitizes with ##TYPE## placeholders) -- against PII, credentials, and prompt injection patterns. A prompt-guard-scanner child agent runs 7 guard checks across 4 payload categories, while a prompt-guard-evaluator reasons about guard effectiveness and scores detection accuracy.

Environment variables

This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.

Architecture

Key Code

Guard scanning tool with mode switching

The check_prompt function from the built-in _guard instrumentor is called with different mode configurations.

@waxell.tool(tool_type="security")
def scan_prompt(messages: list, model: str, mode: str) -> dict:
    configure_guard(enabled=True, server=False, action=mode)
    result = check_prompt(messages, model=model)
    if result is None:
        return {"passed": True, "action": mode, "violations": [], "redacted": None}
    return {
        "passed": result.passed,
        "action": result.action,
        "violations": list(result.violations) if result.violations else [],
        "redacted": result.redacted_messages if hasattr(result, "redacted_messages") else None,
    }

Evaluator reasoning across all 3 modes

The evaluator analyzes block, warn, and redact results to assess overall guard effectiveness.

@waxell.reasoning_dec(step="guard_evaluation")
async def evaluate_guard_results(block_results, warn_result, redact_results) -> dict:
    total_blocked = sum(1 for r in block_results if not r.get("passed", True))
    total_redacted = sum(1 for r in redact_results if r.get("redacted"))
    return {
        "thought": f"Blocked {total_blocked}/{len(block_results)} dangerous inputs. "
                   f"Redacted {total_redacted}/{len(redact_results)} inputs.",
        "evidence": [
            f"Block mode: {total_blocked} violations caught",
            f"Redact mode: {total_redacted} messages sanitized",
        ],
        "conclusion": "Prompt guard provides layered defense across all categories",
    }

What this demonstrates

@waxell.tool(tool_type="security") -- 7 prompt guard scans across block, warn, and redact modes.
@waxell.step_dec -- test payload preparation step.
@waxell.decision -- guard mode selection (block/warn/redact) per phase.
@waxell.reasoning_dec -- cross-mode guard effectiveness evaluation.
waxell.score() -- detection_accuracy and overall_safety scores.
3 guard modes -- block (PromptGuardError), warn (log and proceed), redact (##TYPE## sanitization).
4 payload categories -- PII (SSN, email, phone, credit card), credentials (passwords, API keys, AWS keys, GitHub PATs), injection (jailbreak, admin mode), and clean messages.
Auto-instrumented LLM calls -- 2 OpenAI calls (warn mode and redact mode).

Run it

# Dry-run (no API key needed)
python -m app.demos.prompt_guard_agent --dry-run

# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.prompt_guard_agent

Source

dev/waxell-dev/app/demos/prompt_guard_agent.py

Architecture​

Key Code​

Guard scanning tool with mode switching​

Evaluator reasoning across all 3 modes​

What this demonstrates​

Run it​

Source​