Prompt Guard

Prompt Guard intercepts LLM calls before they are sent and scans prompts for PII, credentials, and prompt injection patterns. It runs client-side with zero network overhead (regex-based), with an optional server-side ML tier for deeper detection.

Quick Start

Enable prompt guard in init():

import waxell_observe

waxell_observe.init(
    api_key="wax_sk_...",
    api_url="https://acme.waxell.dev",
    prompt_guard=True,           # Enable client-side guard
    prompt_guard_action="block",  # "block", "warn", or "redact"
)

Now every auto-instrumented LLM call is scanned automatically:

from openai import OpenAI

client = OpenAI()

# This will raise PromptGuardError if the prompt contains PII
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "My SSN is 123-45-6789"}],
)

Actions

Action	Behavior
`"block"`	Raises `PromptGuardError`. The LLM call is never made.
`"warn"`	Logs violations as warnings. The LLM call proceeds with the original prompt.
`"redact"`	Replaces sensitive data with `##TYPE##` placeholders. The LLM call proceeds with the sanitized prompt.

What Gets Detected

PII

Type	Pattern	Example
SSN	`XXX-XX-XXXX`	`123-45-6789`
Email	Standard email format	`user@example.com`
Phone	US phone numbers	`(555) 123-4567`
Credit Card	16-digit card numbers	`4111-1111-1111-1111`

Credentials

Type	Pattern	Example
Password	`password=`, `pwd:`	`password=hunter2`
API Key	`api_key=`, `apikey:`	`api_key=sk-abc123`
Secret	`secret_key=`, `client_secret:`	`secret_key=mySecret`
AWS Key	`AKIA` prefix	`AKIAIOSFODNN7EXAMPLE`
Generic Token	`sk-`, `pk_live_`, etc.	`sk-abc123def456...`
GitHub PAT	`ghp_` prefix	`ghp_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx`
Waxell Key	`wax_sk_` prefix	`wax_sk_abc123`

Prompt Injection

Detects common prompt injection patterns including:

"Ignore previous instructions"
"You are now a..."
"Forget your instructions"
"New instructions:"
System prompt markers ([system]:, <|system|>)
Jailbreak patterns (DAN mode, developer mode)

Configuration

Via `init()`

waxell_observe.init(
    prompt_guard=True,              # Enable local regex guard
    prompt_guard_server=True,       # Also check server-side ML
    prompt_guard_action="redact",   # "block", "warn", or "redact"
)

Via Environment Variables

export WAXELL_PROMPT_GUARD="true"
export WAXELL_PROMPT_GUARD_SERVER="true"
export WAXELL_PROMPT_GUARD_ACTION="block"

Handling Blocks

When prompt_guard_action="block", a PromptGuardError is raised:

from waxell_observe import PromptGuardError

try:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "My SSN is 123-45-6789"}],
    )
except PromptGuardError as e:
    print(f"Blocked: {e}")
    print(f"Violations: {e.result.violations}")
    print(f"Action: {e.result.action}")

The PromptGuardError.result is a PromptGuardResult:

Field	Type	Description
`passed`	`bool`	`True` if no violations (or action is `"warn"`)
`action`	`str`	The action taken: `"allow"`, `"block"`, `"warn"`, or `"redact"`
`violations`	`list[str]`	List of violation descriptions
`source`	`str`	`"local"`, `"server"`, or `"both"`
`redacted_messages`	`list \| None`	Redacted messages (only when action is `"redact"`)

Redaction

When prompt_guard_action="redact", sensitive data is replaced with ##TYPE## placeholders before the LLM call:

Input:  "My email is user@example.com and SSN is 123-45-6789"
Output: "My email is ##EMAIL## and SSN is ##SSN##"

The LLM receives the redacted version. The original is never sent.

Manual Checking

You can also check prompts manually using check_prompt():

from waxell_observe.instrumentors._guard import check_prompt

result = check_prompt(
    messages=[{"role": "user", "content": "My SSN is 123-45-6789"}],
    model="gpt-4o",
)

if result and not result.passed:
    print(f"Violations found: {result.violations}")

Server-Side ML Detection

The optional server-side tier uses Presidio and HuggingFace models for deeper detection beyond regex patterns. Enable with prompt_guard_server=True.

Server-side detection adds:

Named entity recognition for PII
Contextual credential detection
ML-powered injection classification

Server results are merged with local regex results — violations from both sources are deduplicated and combined.

Next Steps

Auto-Instrumentation -- How prompt guard integrates with auto-instrumentation
Policy & Governance -- Server-side policy enforcement
Python SDK Reference -- PromptGuardError and PromptGuardResult types

Quick Start​

Actions​

What Gets Detected​

PII​

Credentials​

Prompt Injection​

Configuration​

Via init()​

Via Environment Variables​

Handling Blocks​

Redaction​

Manual Checking​

Server-Side ML Detection​

Next Steps​