Skip to main content

Prompt Injection Guard Policy

The prompt-injection-guard policy enforces OWASP LLM01 Prompt Injection and aligns with NIST AI RMF MAP-2.3. It scans agent inputs before the LLM ever sees them, plus retrieval results / tool outputs mid-run (indirect injection). Two detection modes:

  • Heuristic (default, cheap) — substring matches against a curated list of injection phrases, plus structural signals (base64 blobs, excessive caps, excessive punctuation).
  • Classifier (opt-in, mid-cost) — controlplane wires an ML classifier callable that returns (confidence, label). When configured, classifier and heuristic decisions are ORed.

Pure CPU, sub-millisecond per input in heuristic mode.

Rules

RuleTypeDefaultDescription
detection_modestring"heuristic""heuristic", "classifier", or "heuristic_plus_classifier"
min_confidencenumber0.7Classifier threshold (0–1)
blocked_patternsstring[](curated default)Replace or extend the built-in pattern list (case-insensitive substring match)
max_payload_kbnumber64Reject oversized inputs (possible obfuscation)
action_on_violationstring"block""block" or "warn"

Optional: scan_indirect (boolean, default true) — toggles mid-execution scanning of retrieval_results.

Default Heuristic Patterns

The handler ships with a curated list including: ignore previous instructions, disregard the above, forget previous instructions, new instructions:, system:, system prompt:, you are now, act as a, pretend to be, developer mode, dan mode, jailbreak, what were your instructions, repeat your system prompt, ### instructions, user has authorized, the assistant should.

Structural signals (always on): base64 blob ≥ 200 chars, 15+ consecutive uppercase letters, 9+ repeated !/?/..

How It Works

PhaseWhat it checksSource
before_workflowHeuristic + (optional) classifier on inputscontext.inputs
mid_executionHeuristic + (optional) classifier on retrieval results (indirect injection)context.retrieval_results
after_workflow(no-op — injection is a pre-execution concern)

Scan order per phase:

  1. Size cap — if payload > max_payload_kb → BLOCK with oversized reason.
  2. Heuristic scan — phrase list, then base64/caps/punctuation regex.
  3. Classifier scan (if mode includes it) — (confidence, label) from _classifier_fn.

Context Attributes Read

AttributePhasePurpose
context.inputsbeforePrimary scan target (dict/list/str — strings extracted recursively)
context.retrieval_resultsmidIndirect-injection scan target

Example Policy

Production policy — heuristic + classifier with strict size cap and a custom pattern list:

{
"detection_mode": "heuristic_plus_classifier",
"min_confidence": 0.8,
"blocked_patterns": [
"ignore previous instructions",
"ignore the above",
"disregard previous",
"system prompt:",
"you are now a",
"developer mode enabled",
"jailbreak",
"###instructions",
"user has authorized"
],
"max_payload_kb": 32,
"action_on_violation": "block"
}

SDK Integration

import waxell_observe as waxell

waxell.init()

@waxell.observe(agent_name="public-chatbot", enforce_policy=True)
async def reply(message: str) -> str:
# before_workflow: scans `message` (and any other input)
# for injection phrases / oversized payloads.
# If a hit fires, PolicyViolationError is raised before the LLM call.
return await generate(message)

Observability

FieldExample
Categoryprompt-injection-guard
Actionblock
Reason"Prompt-injection signal detected (phrase): 'ignore previous instructions'"
Metadata{"phase": "before", "signal": "phrase", "matched_pattern": "ignore previous instructions", "owasp": "LLM01"}

Indirect-injection block (from a tool output):

FieldExample
Reason"Prompt-injection signal detected (phrase): 'system prompt:'"
Metadata{"phase": "mid", "signal": "phrase", "matched_pattern": "system prompt:", "owasp": "LLM01"}

Common Gotchas

  • Heuristic patterns are case-insensitive substrings. "system:" matches both "SYSTEM:" and innocent prose like "file system: ext4". Test on your corpus before tightening the pattern list.
  • Setting blocked_patterns REPLACES the default list, not extends it. If you supply ["jailbreak"], you lose all 23 default patterns. To extend, copy the defaults and add your own.
  • max_payload_kb is approximate. The handler sums string byte-lengths recursively; it doesn't json.dumps the payload, so the figure is slightly low.
  • Structural signals always fire even if blocked_patterns is supplied. Base64 blobs, repeated caps, and repeated punctuation are hardcoded — there's no rule to disable them.
  • Classifier mode requires controlplane wiring. _classifier_fn is a ClassVar set once at startup. If unset, classifier mode silently degrades to "no classifier hits".
  • Indirect-injection scanning depends on context.retrieval_results. Agents that don't populate this attribute skip the mid-execution check entirely.
  • The handler does not redact or sanitize. It only blocks or warns. For redaction, pair with Content Policy using redact action.

Next Steps