Eval-Driven Governance

Most platforms make you choose between two separate worlds: measuring quality (evals, datasets, experiments) and enforcing policy (guardrails, governance). Waxell unifies them. The same evaluator that scores your agent for improvement can also guard it in production — warn, redact, or block a run when its score crosses a line you set.

That's the whole idea: evals become guardrails.

In one sentence

Define a check once (e.g. "is the answer grounded in the retrieved context?"), use it to improve your agent offline, then flip it on to enforce that same standard on live traffic — with a preview of exactly how often it would have fired.

The loop

                 ┌──────────── improve ────────────┐
                 │                                  │
   run agents ──►  Observe ──► Score ──► Datasets / Experiments
                 │                │
                 │                └──► Guardrail (warn / redact / block)
                 └──────────── guard ───────────────┘

Observe — every agent run and LLM call is captured.
Score — an evaluator (LLM-as-judge) rates runs on a dimension you care about: groundedness, helpfulness, toxicity, PII leakage, format compliance.
Improve — build datasets from real runs and run experiments to compare prompts or models before you ship.
Guard — promote any evaluator to a guardrail that enforces a threshold on production traffic.

The pieces

Piece	What it is	Where
Datasets	Test cases (input → expected output). Build from a CSV/JSONL upload or capture straight from production runs.	Datasets & Experiments
Evaluators	LLM-as-judge checks. Start from a template, pick a small/cheap judge model, score automatically on ingest.	Evaluators
Experiments	Run a dataset through two configs and get a "which won + confidence" verdict.	Experiments
Guardrails	An evaluator + a threshold = a governance policy. Warn, redact, or block.	Policy Categories (`Evaluator` category)

Use cases

These are the patterns teams reach for first. Each is a single evaluator plus a threshold.

Block hallucinations in production

Create a Groundedness evaluator (does the answer stay within the retrieved context?), then enforce it:

Block a run when groundedness is below 0.7.

Ungrounded answers are caught at the source instead of reaching a customer. Pairs with Retrieval governance for RAG agents.

Catch PII / secret leakage

A PII / secret leakage evaluator flags any output that exposes credentials, API keys, or personal data:

Block a run when leaked_sensitive is above 0.5.

Maps to OWASP LLM06 (Sensitive Information Disclosure) and GDPR Art-5 — and because guardrails are control-tagged, that enforcement becomes audit evidence (see Continuous compliance).

Gate prompt or model regressions before you ship

Before promoting prompt v2, run it and v1 through the same dataset as an experiment. The compare view returns a verdict like:

Prompt v2 wins on relevance — +0.12 avg, 75% of 40 items (~95% confidence).

It's a sign test over per-item wins, so you don't over-read a noisy 0.02 average gap. Ship on evidence, not vibes.

Enforce format / structured-output contracts

A Format compliance evaluator checks that the output is valid JSON / has the required fields. Enforce it to keep malformed output from breaking downstream systems. (For hard schema checks, also see Tool Argument Schema.)

Continuous compliance evidence

Every guardrail can declare the controls it evidences — OWASP LLM Top 10, NIST AI RMF, ISO 42001, EU AI Act, GDPR, HIPAA. When a guardrail is enabled, each enforcement decision is a control-tagged audit record. A passing eval suite becomes living evidence that a control is in force, instead of a screenshot in a binder.

Build it: end to end

The whole loop is self-service — no JSON, no CLI required.

Build a dataset. Go to Observability → Datasets → Create → Import Items and drop in a CSV/JSONL. Columns are auto-mapped (question → input, answer → expected_output); anything else is saved as metadata. Or click Save as dataset from a set of production runs.
Create an evaluator. Observability → Evaluations → Create. Start from a template (Groundedness, Helpfulness, PII leakage, Toxicity, Format compliance, …) — it pre-fills the judge prompt. Pick a judge model (see Choosing a judge) and turn on Run automatically on ingest so new runs get scored.
Enforce it. From the evaluator, click Enforce as guardrail. You land in the policy wizard with the Evaluator category pre-selected. Set the threshold and direction ("trigger when the score is below 0.7") and choose warn / block / redact.
Preview the blast radius. Before you enable it, the form shows a counterfactual:

Based on 200 scores in the last 30 days, this rule would have flagged 3 (1.5%) — those runs would have been blocked.

So you never accidentally start blocking production. Enable when it looks right.
Watch it fire. Enforcement decisions appear as governance events on the run's trace, tagged with the evaluator, the score, and any compliance controls.

Choosing a judge model

A judge runs on every scored run, so it should be small, fast, and cheap — the same insight behind dedicated eval models elsewhere. Waxell ships a curated picker:

Judge	Provider	Good for	Key (in Settings → Secrets)
GPT-4o mini	OpenAI	Safe default — cheap, fast, reliable	`OPENAI_API_KEY`
Llama 3.1 8B	Groq	Cheapest + fastest, for high-volume scoring	`GROQ_API_KEY`
Claude Haiku	Anthropic	Nuanced checks at low cost	`ANTHROPIC_API_KEY`

The picker shows whether the matching key is configured and links straight to Settings → Secrets to add it. A custom option lets you name any model your tenant can route to.

note

The judge resolves its key from a tenant secret by name (e.g. GROQ_API_KEY). Add the key once in Settings → Secrets and every evaluator using that provider can score immediately.

How enforcement works

Plane. Evaluator guardrails run on the observe plane — after the agent produces its output (the judge needs the output to score it). The decision (allow / warn / block / redact) is recorded as a governance event and signaled back; for fast, pre-output checks use the deterministic operational guardrails instead.
Cost control. Judge calls cost money, so point guardrails at a small judge model and scope them with a target filter (per agent / per model).
Fail-safe. If the judge can't produce a comparable score (no key, provider error), the guardrail allows the run rather than blocking it — failures never take down production. You'll see "no score" rather than a block.
Auditability. Each decision carries the evaluator, the score, the threshold, and the compliance controls it evidences.

Where to go next

Evaluators (LLM-as-Judge) — judge prompts, templates, scoring schemes, annotation queues
Datasets & Experiments — building datasets, running experiments, comparing results
Policy Categories & Templates — the full catalog of governance categories (including Evaluator)
Scoring — how scores are stored and surfaced

The loop​

The pieces​

Use cases​

Block hallucinations in production​

Catch PII / secret leakage​

Gate prompt or model regressions before you ship​

Enforce format / structured-output contracts​

Continuous compliance evidence​

Build it: end to end​

Choosing a judge model​

How enforcement works​

Where to go next​