Prompt Injection Guard Policy
The prompt-injection-guard policy enforces OWASP LLM01 Prompt Injection and aligns with NIST AI RMF MAP-2.3. It scans agent inputs before the LLM ever sees them, plus retrieval results / tool outputs mid-run (indirect injection). Two detection modes:
- Heuristic (default, cheap) — substring matches against a curated list of injection phrases, plus structural signals (base64 blobs, excessive caps, excessive punctuation).
- Classifier (opt-in, mid-cost) — controlplane wires an ML classifier callable that returns
(confidence, label). When configured, classifier and heuristic decisions are ORed.
Pure CPU, sub-millisecond per input in heuristic mode.
Rules
| Rule | Type | Default | Description |
|---|---|---|---|
detection_mode | string | "heuristic" | "heuristic", "classifier", or "heuristic_plus_classifier" |
min_confidence | number | 0.7 | Classifier threshold (0–1) |
blocked_patterns | string[] | (curated default) | Replace or extend the built-in pattern list (case-insensitive substring match) |
max_payload_kb | number | 64 | Reject oversized inputs (possible obfuscation) |
action_on_violation | string | "block" | "block" or "warn" |
Optional: scan_indirect (boolean, default true) — toggles mid-execution scanning of retrieval_results.
Default Heuristic Patterns
The handler ships with a curated list including: ignore previous instructions, disregard the above, forget previous instructions, new instructions:, system:, system prompt:, you are now, act as a, pretend to be, developer mode, dan mode, jailbreak, what were your instructions, repeat your system prompt, ### instructions, user has authorized, the assistant should.
Structural signals (always on): base64 blob ≥ 200 chars, 15+ consecutive uppercase letters, 9+ repeated !/?/..
How It Works
| Phase | What it checks | Source |
|---|---|---|
before_workflow | Heuristic + (optional) classifier on inputs | context.inputs |
mid_execution | Heuristic + (optional) classifier on retrieval results (indirect injection) | context.retrieval_results |
after_workflow | (no-op — injection is a pre-execution concern) | — |
Scan order per phase:
- Size cap — if payload >
max_payload_kb→ BLOCK withoversizedreason. - Heuristic scan — phrase list, then base64/caps/punctuation regex.
- Classifier scan (if mode includes it) —
(confidence, label)from_classifier_fn.
Context Attributes Read
| Attribute | Phase | Purpose |
|---|---|---|
context.inputs | before | Primary scan target (dict/list/str — strings extracted recursively) |
context.retrieval_results | mid | Indirect-injection scan target |
Example Policy
Production policy — heuristic + classifier with strict size cap and a custom pattern list:
{
"detection_mode": "heuristic_plus_classifier",
"min_confidence": 0.8,
"blocked_patterns": [
"ignore previous instructions",
"ignore the above",
"disregard previous",
"system prompt:",
"you are now a",
"developer mode enabled",
"jailbreak",
"###instructions",
"user has authorized"
],
"max_payload_kb": 32,
"action_on_violation": "block"
}
SDK Integration
import waxell_observe as waxell
waxell.init()
@waxell.observe(agent_name="public-chatbot", enforce_policy=True)
async def reply(message: str) -> str:
# before_workflow: scans `message` (and any other input)
# for injection phrases / oversized payloads.
# If a hit fires, PolicyViolationError is raised before the LLM call.
return await generate(message)
Observability
| Field | Example |
|---|---|
| Category | prompt-injection-guard |
| Action | block |
| Reason | "Prompt-injection signal detected (phrase): 'ignore previous instructions'" |
| Metadata | {"phase": "before", "signal": "phrase", "matched_pattern": "ignore previous instructions", "owasp": "LLM01"} |
Indirect-injection block (from a tool output):
| Field | Example |
|---|---|
| Reason | "Prompt-injection signal detected (phrase): 'system prompt:'" |
| Metadata | {"phase": "mid", "signal": "phrase", "matched_pattern": "system prompt:", "owasp": "LLM01"} |
Common Gotchas
- Heuristic patterns are case-insensitive substrings.
"system:"matches both"SYSTEM:"and innocent prose like"file system: ext4". Test on your corpus before tightening the pattern list. - Setting
blocked_patternsREPLACES the default list, not extends it. If you supply["jailbreak"], you lose all 23 default patterns. To extend, copy the defaults and add your own. max_payload_kbis approximate. The handler sums string byte-lengths recursively; it doesn'tjson.dumpsthe payload, so the figure is slightly low.- Structural signals always fire even if
blocked_patternsis supplied. Base64 blobs, repeated caps, and repeated punctuation are hardcoded — there's no rule to disable them. - Classifier mode requires controlplane wiring.
_classifier_fnis aClassVarset once at startup. If unset,classifiermode silently degrades to "no classifier hits". - Indirect-injection scanning depends on
context.retrieval_results. Agents that don't populate this attribute skip the mid-execution check entirely. - The handler does not redact or sanitize. It only blocks or warns. For redaction, pair with Content Policy using
redactaction.
Next Steps
- Policy Categories — All 49 categories
- Content Policy — Configurable PII/credential scanning with redact action
- Input Validation — Schema/length checks on agent inputs
- Output Egress Format — Companion control on the output side (LLM05)