Prompt Injection Guard Policy

The prompt-injection-guard policy enforces OWASP LLM01 Prompt Injection and aligns with NIST AI RMF MAP-2.3. It scans agent inputs before the LLM ever sees them, plus retrieval results / tool outputs mid-run (indirect injection). Two detection modes:

Heuristic (default, cheap) — substring matches against a curated list of injection phrases, plus structural signals (base64 blobs, excessive caps, excessive punctuation).
Classifier (opt-in, mid-cost) — controlplane wires an ML classifier callable that returns (confidence, label). When configured, classifier and heuristic decisions are ORed.

Pure CPU, sub-millisecond per input in heuristic mode.

Rules

Rule	Type	Default	Description
`detection_mode`	string	`"heuristic"`	`"heuristic"`, `"classifier"`, or `"heuristic_plus_classifier"`
`min_confidence`	number	`0.7`	Classifier threshold (0–1)
`blocked_patterns`	string[]	(curated default)	Replace or extend the built-in pattern list (case-insensitive substring match)
`max_payload_kb`	number	`64`	Reject oversized inputs (possible obfuscation)
`action_on_violation`	string	`"block"`	`"block"` or `"warn"`

Optional: scan_indirect (boolean, default true) — toggles mid-execution scanning of retrieval_results.

Default Heuristic Patterns

The handler ships with a curated list including: ignore previous instructions, disregard the above, forget previous instructions, new instructions:, system:, system prompt:, you are now, act as a, pretend to be, developer mode, dan mode, jailbreak, what were your instructions, repeat your system prompt, ### instructions, user has authorized, the assistant should.

Structural signals (always on): base64 blob ≥ 200 chars, 15+ consecutive uppercase letters, 9+ repeated !/?/..

How It Works

Phase	What it checks	Source
`before_workflow`	Heuristic + (optional) classifier on inputs	`context.inputs`
`mid_execution`	Heuristic + (optional) classifier on retrieval results (indirect injection)	`context.retrieval_results`
`after_workflow`	(no-op — injection is a pre-execution concern)	—

Scan order per phase:

Size cap — if payload > max_payload_kb → BLOCK with oversized reason.
Heuristic scan — phrase list, then base64/caps/punctuation regex.
Classifier scan (if mode includes it) — (confidence, label) from _classifier_fn.

Context Attributes Read

Attribute	Phase	Purpose
`context.inputs`	before	Primary scan target (dict/list/str — strings extracted recursively)
`context.retrieval_results`	mid	Indirect-injection scan target

Example Policy

Production policy — heuristic + classifier with strict size cap and a custom pattern list:

{
  "detection_mode": "heuristic_plus_classifier",
  "min_confidence": 0.8,
  "blocked_patterns": [
    "ignore previous instructions",
    "ignore the above",
    "disregard previous",
    "system prompt:",
    "you are now a",
    "developer mode enabled",
    "jailbreak",
    "###instructions",
    "user has authorized"
  ],
  "max_payload_kb": 32,
  "action_on_violation": "block"
}

SDK Integration

import waxell_observe as waxell

waxell.init()

@waxell.observe(agent_name="public-chatbot", enforce_policy=True)
async def reply(message: str) -> str:
    # before_workflow: scans `message` (and any other input)
    # for injection phrases / oversized payloads.
    # If a hit fires, PolicyViolationError is raised before the LLM call.
    return await generate(message)

Observability

Field	Example
Category	`prompt-injection-guard`
Action	`block`
Reason	"Prompt-injection signal detected (phrase): 'ignore previous instructions'"
Metadata	`{"phase": "before", "signal": "phrase", "matched_pattern": "ignore previous instructions", "owasp": "LLM01"}`

Indirect-injection block (from a tool output):

Field	Example
Reason	"Prompt-injection signal detected (phrase): 'system prompt:'"
Metadata	`{"phase": "mid", "signal": "phrase", "matched_pattern": "system prompt:", "owasp": "LLM01"}`

Common Gotchas

Heuristic patterns are case-insensitive substrings. "system:" matches both "SYSTEM:" and innocent prose like "file system: ext4". Test on your corpus before tightening the pattern list.
Setting blocked_patterns REPLACES the default list, not extends it. If you supply ["jailbreak"], you lose all 23 default patterns. To extend, copy the defaults and add your own.
max_payload_kb is approximate. The handler sums string byte-lengths recursively; it doesn't json.dumps the payload, so the figure is slightly low.
Structural signals always fire even if blocked_patterns is supplied. Base64 blobs, repeated caps, and repeated punctuation are hardcoded — there's no rule to disable them.
Classifier mode requires controlplane wiring. _classifier_fn is a ClassVar set once at startup. If unset, classifier mode silently degrades to "no classifier hits".
Indirect-injection scanning depends on context.retrieval_results. Agents that don't populate this attribute skip the mid-execution check entirely.
The handler does not redact or sanitize. It only blocks or warns. For redaction, pair with Content Policy using redact action.

Next Steps

Policy Categories — All 49 categories
Content Policy — Configurable PII/credential scanning with redact action
Input Validation — Schema/length checks on agent inputs
Output Egress Format — Companion control on the output side (LLM05)

Rules​

Default Heuristic Patterns​

How It Works​

Context Attributes Read​

Example Policy​

SDK Integration​

Observability​

Common Gotchas​

Next Steps​