Skip to main content

Eval-Driven Governance

Most platforms make you choose between two separate worlds: measuring quality (evals, datasets, experiments) and enforcing policy (guardrails, governance). Waxell unifies them. The same evaluator that scores your agent for improvement can also guard it in production — warn, redact, or block a run when its score crosses a line you set.

That's the whole idea: evals become guardrails.

In one sentence

Define a check once (e.g. "is the answer grounded in the retrieved context?"), use it to improve your agent offline, then flip it on to enforce that same standard on live traffic — with a preview of exactly how often it would have fired.

The loop

                 ┌──────────── improve ────────────┐
│ │
run agents ──► Observe ──► Score ──► Datasets / Experiments
│ │
│ └──► Guardrail (warn / redact / block)
└──────────── guard ───────────────┘
  • Observe — every agent run and LLM call is captured.
  • Score — an evaluator (LLM-as-judge) rates runs on a dimension you care about: groundedness, helpfulness, toxicity, PII leakage, format compliance.
  • Improve — build datasets from real runs and run experiments to compare prompts or models before you ship.
  • Guard — promote any evaluator to a guardrail that enforces a threshold on production traffic.

The pieces

PieceWhat it isWhere
DatasetsTest cases (input → expected output). Build from a CSV/JSONL upload or capture straight from production runs.Datasets & Experiments
EvaluatorsLLM-as-judge checks. Start from a template, pick a small/cheap judge model, score automatically on ingest.Evaluators
ExperimentsRun a dataset through two configs and get a "which won + confidence" verdict.Experiments
GuardrailsAn evaluator + a threshold = a governance policy. Warn, redact, or block.Policy Categories (Evaluator category)

Use cases

These are the patterns teams reach for first. Each is a single evaluator plus a threshold.

Block hallucinations in production

Create a Groundedness evaluator (does the answer stay within the retrieved context?), then enforce it:

Block a run when groundedness is below 0.7.

Ungrounded answers are caught at the source instead of reaching a customer. Pairs with Retrieval governance for RAG agents.

Catch PII / secret leakage

A PII / secret leakage evaluator flags any output that exposes credentials, API keys, or personal data:

Block a run when leaked_sensitive is above 0.5.

Maps to OWASP LLM06 (Sensitive Information Disclosure) and GDPR Art-5 — and because guardrails are control-tagged, that enforcement becomes audit evidence (see Continuous compliance).

Gate prompt or model regressions before you ship

Before promoting prompt v2, run it and v1 through the same dataset as an experiment. The compare view returns a verdict like:

Prompt v2 wins on relevance — +0.12 avg, 75% of 40 items (~95% confidence).

It's a sign test over per-item wins, so you don't over-read a noisy 0.02 average gap. Ship on evidence, not vibes.

Enforce format / structured-output contracts

A Format compliance evaluator checks that the output is valid JSON / has the required fields. Enforce it to keep malformed output from breaking downstream systems. (For hard schema checks, also see Tool Argument Schema.)

Continuous compliance evidence

Every guardrail can declare the controls it evidences — OWASP LLM Top 10, NIST AI RMF, ISO 42001, EU AI Act, GDPR, HIPAA. When a guardrail is enabled, each enforcement decision is a control-tagged audit record. A passing eval suite becomes living evidence that a control is in force, instead of a screenshot in a binder.

Build it: end to end

The whole loop is self-service — no JSON, no CLI required.

  1. Build a dataset. Go to Observability → Datasets → Create → Import Items and drop in a CSV/JSONL. Columns are auto-mapped (question → input, answer → expected_output); anything else is saved as metadata. Or click Save as dataset from a set of production runs.

  2. Create an evaluator. Observability → Evaluations → Create. Start from a template (Groundedness, Helpfulness, PII leakage, Toxicity, Format compliance, …) — it pre-fills the judge prompt. Pick a judge model (see Choosing a judge) and turn on Run automatically on ingest so new runs get scored.

  3. Enforce it. From the evaluator, click Enforce as guardrail. You land in the policy wizard with the Evaluator category pre-selected. Set the threshold and direction ("trigger when the score is below 0.7") and choose warn / block / redact.

  4. Preview the blast radius. Before you enable it, the form shows a counterfactual:

    Based on 200 scores in the last 30 days, this rule would have flagged 3 (1.5%) — those runs would have been blocked.

    So you never accidentally start blocking production. Enable when it looks right.

  5. Watch it fire. Enforcement decisions appear as governance events on the run's trace, tagged with the evaluator, the score, and any compliance controls.

Choosing a judge model

A judge runs on every scored run, so it should be small, fast, and cheap — the same insight behind dedicated eval models elsewhere. Waxell ships a curated picker:

JudgeProviderGood forKey (in Settings → Secrets)
GPT-4o miniOpenAISafe default — cheap, fast, reliableOPENAI_API_KEY
Llama 3.1 8BGroqCheapest + fastest, for high-volume scoringGROQ_API_KEY
Claude HaikuAnthropicNuanced checks at low costANTHROPIC_API_KEY

The picker shows whether the matching key is configured and links straight to Settings → Secrets to add it. A custom option lets you name any model your tenant can route to.

note

The judge resolves its key from a tenant secret by name (e.g. GROQ_API_KEY). Add the key once in Settings → Secrets and every evaluator using that provider can score immediately.

How enforcement works

  • Plane. Evaluator guardrails run on the observe planeafter the agent produces its output (the judge needs the output to score it). The decision (allow / warn / block / redact) is recorded as a governance event and signaled back; for fast, pre-output checks use the deterministic operational guardrails instead.
  • Cost control. Judge calls cost money, so point guardrails at a small judge model and scope them with a target filter (per agent / per model).
  • Fail-safe. If the judge can't produce a comparable score (no key, provider error), the guardrail allows the run rather than blocking it — failures never take down production. You'll see "no score" rather than a block.
  • Auditability. Each decision carries the evaluator, the score, the threshold, and the compliance controls it evidences.

Where to go next