Eval-Driven Governance
Most platforms make you choose between two separate worlds: measuring quality (evals, datasets, experiments) and enforcing policy (guardrails, governance). Waxell unifies them. The same evaluator that scores your agent for improvement can also guard it in production — warn, redact, or block a run when its score crosses a line you set.
That's the whole idea: evals become guardrails.
Define a check once (e.g. "is the answer grounded in the retrieved context?"), use it to improve your agent offline, then flip it on to enforce that same standard on live traffic — with a preview of exactly how often it would have fired.
The loop
┌──────────── improve ────────────┐
│ │
run agents ──► Observe ──► Score ──► Datasets / Experiments
│ │
│ └──► Guardrail (warn / redact / block)
└──────────── guard ───────────────┘
- Observe — every agent run and LLM call is captured.
- Score — an evaluator (LLM-as-judge) rates runs on a dimension you care about: groundedness, helpfulness, toxicity, PII leakage, format compliance.
- Improve — build datasets from real runs and run experiments to compare prompts or models before you ship.
- Guard — promote any evaluator to a guardrail that enforces a threshold on production traffic.
The pieces
| Piece | What it is | Where |
|---|---|---|
| Datasets | Test cases (input → expected output). Build from a CSV/JSONL upload or capture straight from production runs. | Datasets & Experiments |
| Evaluators | LLM-as-judge checks. Start from a template, pick a small/cheap judge model, score automatically on ingest. | Evaluators |
| Experiments | Run a dataset through two configs and get a "which won + confidence" verdict. | Experiments |
| Guardrails | An evaluator + a threshold = a governance policy. Warn, redact, or block. | Policy Categories (Evaluator category) |
Use cases
These are the patterns teams reach for first. Each is a single evaluator plus a threshold.
Block hallucinations in production
Create a Groundedness evaluator (does the answer stay within the retrieved context?), then enforce it:
Block a run when
groundednessis below 0.7.
Ungrounded answers are caught at the source instead of reaching a customer. Pairs with Retrieval governance for RAG agents.
Catch PII / secret leakage
A PII / secret leakage evaluator flags any output that exposes credentials, API keys, or personal data:
Block a run when
leaked_sensitiveis above 0.5.
Maps to OWASP LLM06 (Sensitive Information Disclosure) and GDPR Art-5 — and because guardrails are control-tagged, that enforcement becomes audit evidence (see Continuous compliance).
Gate prompt or model regressions before you ship
Before promoting prompt v2, run it and v1 through the same dataset as an experiment. The compare view returns a verdict like:
Prompt v2 wins on
relevance— +0.12 avg, 75% of 40 items (~95% confidence).
It's a sign test over per-item wins, so you don't over-read a noisy 0.02 average gap. Ship on evidence, not vibes.
Enforce format / structured-output contracts
A Format compliance evaluator checks that the output is valid JSON / has the required fields. Enforce it to keep malformed output from breaking downstream systems. (For hard schema checks, also see Tool Argument Schema.)
Continuous compliance evidence
Every guardrail can declare the controls it evidences — OWASP LLM Top 10, NIST AI RMF, ISO 42001, EU AI Act, GDPR, HIPAA. When a guardrail is enabled, each enforcement decision is a control-tagged audit record. A passing eval suite becomes living evidence that a control is in force, instead of a screenshot in a binder.
Build it: end to end
The whole loop is self-service — no JSON, no CLI required.
-
Build a dataset. Go to Observability → Datasets → Create → Import Items and drop in a CSV/JSONL. Columns are auto-mapped (
question → input,answer → expected_output); anything else is saved as metadata. Or click Save as dataset from a set of production runs. -
Create an evaluator. Observability → Evaluations → Create. Start from a template (Groundedness, Helpfulness, PII leakage, Toxicity, Format compliance, …) — it pre-fills the judge prompt. Pick a judge model (see Choosing a judge) and turn on Run automatically on ingest so new runs get scored.
-
Enforce it. From the evaluator, click Enforce as guardrail. You land in the policy wizard with the
Evaluatorcategory pre-selected. Set the threshold and direction ("trigger when the score is below 0.7") and choose warn / block / redact. -
Preview the blast radius. Before you enable it, the form shows a counterfactual:
Based on 200 scores in the last 30 days, this rule would have flagged 3 (1.5%) — those runs would have been blocked.
So you never accidentally start blocking production. Enable when it looks right.
-
Watch it fire. Enforcement decisions appear as governance events on the run's trace, tagged with the evaluator, the score, and any compliance controls.
Choosing a judge model
A judge runs on every scored run, so it should be small, fast, and cheap — the same insight behind dedicated eval models elsewhere. Waxell ships a curated picker:
| Judge | Provider | Good for | Key (in Settings → Secrets) |
|---|---|---|---|
| GPT-4o mini | OpenAI | Safe default — cheap, fast, reliable | OPENAI_API_KEY |
| Llama 3.1 8B | Groq | Cheapest + fastest, for high-volume scoring | GROQ_API_KEY |
| Claude Haiku | Anthropic | Nuanced checks at low cost | ANTHROPIC_API_KEY |
The picker shows whether the matching key is configured and links straight to Settings → Secrets to add it. A custom option lets you name any model your tenant can route to.
The judge resolves its key from a tenant secret by name (e.g. GROQ_API_KEY).
Add the key once in Settings → Secrets and every evaluator using that
provider can score immediately.
How enforcement works
- Plane. Evaluator guardrails run on the observe plane — after the
agent produces its output (the judge needs the output to score it). The
decision (
allow/warn/block/redact) is recorded as a governance event and signaled back; for fast, pre-output checks use the deterministic operational guardrails instead. - Cost control. Judge calls cost money, so point guardrails at a small judge model and scope them with a target filter (per agent / per model).
- Fail-safe. If the judge can't produce a comparable score (no key, provider error), the guardrail allows the run rather than blocking it — failures never take down production. You'll see "no score" rather than a block.
- Auditability. Each decision carries the evaluator, the score, the threshold, and the compliance controls it evidences.
Where to go next
- Evaluators (LLM-as-Judge) — judge prompts, templates, scoring schemes, annotation queues
- Datasets & Experiments — building datasets, running experiments, comparing results
- Policy Categories & Templates — the full catalog of governance categories (including
Evaluator) - Scoring — how scores are stored and surfaced