Skip to main content

Prompt Guard

Prompt Guard intercepts LLM calls before they are sent and scans prompts for PII, credentials, and prompt injection patterns. It runs client-side with zero network overhead (regex-based), with an optional server-side ML tier for deeper detection.

Quick Start

Enable prompt guard in init():

import waxell_observe

waxell_observe.init(
api_key="wax_sk_...",
api_url="https://acme.waxell.dev",
prompt_guard=True, # Enable client-side guard
prompt_guard_action="block", # "block", "warn", or "redact"
)

Now every auto-instrumented LLM call is scanned automatically:

from openai import OpenAI

client = OpenAI()

# This will raise PromptGuardError if the prompt contains PII
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "My SSN is 123-45-6789"}],
)

Actions

ActionBehavior
"block"Raises PromptGuardError. The LLM call is never made.
"warn"Logs violations as warnings. The LLM call proceeds with the original prompt.
"redact"Replaces sensitive data with ##TYPE## placeholders. The LLM call proceeds with the sanitized prompt.

What Gets Detected

PII

TypePatternExample
SSNXXX-XX-XXXX123-45-6789
EmailStandard email formatuser@example.com
PhoneUS phone numbers(555) 123-4567
Credit Card16-digit card numbers4111-1111-1111-1111

Credentials

TypePatternExample
Passwordpassword=, pwd:password=hunter2
API Keyapi_key=, apikey:api_key=sk-abc123
Secretsecret_key=, client_secret:secret_key=mySecret
AWS KeyAKIA prefixAKIAIOSFODNN7EXAMPLE
Generic Tokensk-, pk_live_, etc.sk-abc123def456...
GitHub PATghp_ prefixghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Waxell Keywax_sk_ prefixwax_sk_abc123

Prompt Injection

Detects common prompt injection patterns including:

  • "Ignore previous instructions"
  • "You are now a..."
  • "Forget your instructions"
  • "New instructions:"
  • System prompt markers ([system]:, <|system|>)
  • Jailbreak patterns (DAN mode, developer mode)

Configuration

Via init()

waxell_observe.init(
prompt_guard=True, # Enable local regex guard
prompt_guard_server=True, # Also check server-side ML
prompt_guard_action="redact", # "block", "warn", or "redact"
)

Via Environment Variables

export WAXELL_PROMPT_GUARD="true"
export WAXELL_PROMPT_GUARD_SERVER="true"
export WAXELL_PROMPT_GUARD_ACTION="block"

Handling Blocks

When prompt_guard_action="block", a PromptGuardError is raised:

from waxell_observe import PromptGuardError

try:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "My SSN is 123-45-6789"}],
)
except PromptGuardError as e:
print(f"Blocked: {e}")
print(f"Violations: {e.result.violations}")
print(f"Action: {e.result.action}")

The PromptGuardError.result is a PromptGuardResult:

FieldTypeDescription
passedboolTrue if no violations (or action is "warn")
actionstrThe action taken: "allow", "block", "warn", or "redact"
violationslist[str]List of violation descriptions
sourcestr"local", "server", or "both"
redacted_messageslist | NoneRedacted messages (only when action is "redact")

Redaction

When prompt_guard_action="redact", sensitive data is replaced with ##TYPE## placeholders before the LLM call:

Input:  "My email is user@example.com and SSN is 123-45-6789"
Output: "My email is ##EMAIL## and SSN is ##SSN##"

The LLM receives the redacted version. The original is never sent.

Manual Checking

You can also check prompts manually using check_prompt():

from waxell_observe.instrumentors._guard import check_prompt

result = check_prompt(
messages=[{"role": "user", "content": "My SSN is 123-45-6789"}],
model="gpt-4o",
)

if result and not result.passed:
print(f"Violations found: {result.violations}")

Server-Side ML Detection

The optional server-side tier uses Presidio and HuggingFace models for deeper detection beyond regex patterns. Enable with prompt_guard_server=True.

Server-side detection adds:

  • Named entity recognition for PII
  • Contextual credential detection
  • ML-powered injection classification

Server results are merged with local regex results — violations from both sources are deduplicated and combined.

Next Steps