Grounding Policy

The grounding policy category enforces factual accuracy standards — it requires agent outputs to be grounded in cited sources, with measurable confidence scores and limits on unsupported claims.

Use it for RAG (retrieval-augmented generation) agents, research assistants, medical or legal Q&A, and any agent where hallucinated or unsupported output carries real risk.

Rules

Rule	Type	Default	Description
`require_source_grounding`	boolean	`false`	If `true`, block (or warn) when zero citations are provided
`min_grounding_score`	number	`0.7`	Minimum acceptable score per grounding claim (0.0-1.0). Checked at mid_execution
`min_citations`	integer	`1`	Minimum number of source citations required in the output
`max_unsupported_claims`	integer	`null`	Maximum number of claims without source support. `null` means no limit
`factual_consistency_check`	boolean	`false`	Reserved for future cross-claim consistency validation
`abstention_threshold`	number	`null`	If `output_confidence` falls below this threshold, block (the agent should abstain rather than guess)
`abstention_response`	string	`null`	Custom message to return when abstention threshold is triggered
`action_on_violation`	string	`"warn"`	`"block"` raises `PolicyViolationError`; `"warn"` logs and continues
`score_relevance_floor`	number	`null`	Ignore retrieval scores below this floor. Only scores at or above this value are checked against `min_grounding_score`
`score_eval_mode`	string	`"all"`	How to evaluate filtered scores: `"all"` (each must pass), `"average"` (mean must pass), `"top_n"` (only top N checked)
`score_top_n`	integer	`3`	When `score_eval_mode="top_n"`, how many top scores to evaluate

Default action is "warn"

The grounding handler defaults to action_on_violation: "warn" unlike most other handlers that default to "block". This is intentional — grounding quality degrades gradually and initial deployments often need to calibrate thresholds. Change to "block" once you have tuned your score thresholds against real traffic.

How It Works

The grounding handler runs at two enforcement phases:

before_workflow

Stores the grounding rules into context._grounding_rules. Always returns ALLOW — no blocking happens at before_workflow.

mid_execution

Checks per-claim grounding scores as they are reported:

Reads context.grounding_scores (a list of float values 0.0-1.0)
If the list is empty, returns ALLOW immediately (nothing to check yet)
For each score, checks against min_grounding_score
The first score below the threshold produces a BLOCK or WARN

This phase fires every time ctx.record_grounding() is called with new scores.

after_workflow

Audits the final output state against citation and confidence requirements:

Citation count: len(context.citations) < min_citations → violation
Require grounding: require_source_grounding=true and len(context.citations) == 0 → violation
Unsupported claims: context.unsupported_claims > max_unsupported_claims → violation
Abstention threshold: context.output_confidence < abstention_threshold → violation

All violations are collected and returned together in the warnings list.

Filtering for RAG Pipelines

Real RAG pipelines with top_k=5 always return tail results with low similarity scores. Without filtering, min_grounding_score would block on irrelevant tail results, making the policy unusable.

The Relevance Floor

Set score_relevance_floor to ignore retrieval results below a relevance threshold. Only scores at or above the floor are evaluated against min_grounding_score.

Example: A query about "2008 financial crisis" returns 5 results:

0.92, 0.87, 0.85 (relevant finance docs)
0.35, 0.22 (irrelevant tail results)

With score_relevance_floor: 0.5, only the three relevant scores are checked. All three pass min_grounding_score: 0.7, so the policy allows the response.

Without the floor, the policy would block on the 0.35 score — a false positive from an irrelevant retrieval result.

Edge Case: Nothing Relevant

If NO scores pass the floor, the policy triggers a violation: "No grounding scores above relevance floor — all retrieved results appear irrelevant." This correctly catches the case where retrieval returned nothing useful.

Evaluation Modes

Mode	Behavior
`"all"` (default)	Every filtered score must pass `min_grounding_score`
`"average"`	The mean of filtered scores must pass the threshold
`"top_n"`	Only the top N filtered scores are checked (set N with `score_top_n`)

Example: RAG Pipeline Policy

{
  "min_grounding_score": 0.7,
  "score_relevance_floor": 0.5,
  "score_eval_mode": "all",
  "min_citations": 1,
  "action_on_violation": "block"
}

Example: Lenient Average-Based Policy

{
  "min_grounding_score": 0.7,
  "score_relevance_floor": 0.4,
  "score_eval_mode": "average",
  "action_on_violation": "warn"
}

When to use filtering

Set score_relevance_floor whenever your grounding scores come from retrieval similarity (e.g., Pinecone, FAISS, Weaviate). Without filtering, top_k > 3 retrieval will almost always produce tail results below any useful threshold.

If your scores come from LLM self-evaluation or NLI classifiers (where every score is meaningful), leave score_relevance_floor unset to use the legacy per-score checking.

SDK Integration

Recording Grounding Data

import waxell_observe as waxell
from waxell_observe.errors import PolicyViolationError

async with waxell.WaxellContext(
    agent_name="research-agent",
    enforce_policy=True,
) as ctx:
    # Retrieve sources and generate answer
    sources = await retrieve_sources(query)
    answer, scores = await synthesize_with_grounding(query, sources)

    # Report grounding data — triggers mid_execution check on scores
    ctx.record_grounding(
        grounding_scores=scores,           # per-claim scores [0.9, 0.85, 0.92]
        citations=sources,                 # ["Source A", "Source B", "Source C"]
        unsupported_claims=[],             # any claims without source support
        output_confidence=0.88,            # overall confidence in the answer
    )

    ctx.set_result({"answer": answer, "citations": sources})

Grounding Data Fields

Field	Type	What to put here
`grounding_scores`	list[float]	One score per claim or sentence in the output (0.0 = unsupported, 1.0 = fully grounded)
`citations`	list[str]	Source identifiers — URLs, document names, DOIs, database record IDs
`unsupported_claims`	list[str]	The actual text of claims you could not ground in any source
`output_confidence`	float	Your model's or pipeline's overall confidence that the full answer is accurate

Handling Violations

try:
    async with waxell.WaxellContext(
        agent_name="research-agent",
        enforce_policy=True,
    ) as ctx:
        sources = await retrieve_sources(query)
        answer, scores = await synthesize_with_grounding(query, sources)

        ctx.record_grounding(
            grounding_scores=scores,
            citations=sources,
            unsupported_claims=[],
            output_confidence=0.88,
        )
        ctx.set_result({"answer": answer})

except PolicyViolationError as e:
    # e.g. "Grounding score (0.42) below threshold (0.7)"
    # e.g. "Citations (0) below minimum (1); Unsupported claims (2) exceeds max (0)"
    print(f"Grounding policy blocked: {e}")
    return abstention_message

Example Policies

Strict Research Assistant

Block any output with low grounding scores, missing citations, or low confidence:

{
  "require_source_grounding": true,
  "min_grounding_score": 0.8,
  "min_citations": 2,
  "max_unsupported_claims": 0,
  "abstention_threshold": 0.5,
  "abstention_response": "I don't have sufficient grounded evidence to answer this accurately.",
  "action_on_violation": "block"
}

Medical Q&A

High accuracy standards — abstain rather than guess:

{
  "require_source_grounding": true,
  "min_grounding_score": 0.9,
  "min_citations": 3,
  "max_unsupported_claims": 0,
  "abstention_threshold": 0.7,
  "abstention_response": "I cannot provide a reliable answer based on available evidence. Please consult a qualified professional.",
  "action_on_violation": "block"
}

Lenient RAG Agent

Warn on grounding issues but do not block execution during development:

{
  "require_source_grounding": false,
  "min_grounding_score": 0.6,
  "min_citations": 1,
  "max_unsupported_claims": 3,
  "abstention_threshold": null,
  "action_on_violation": "warn"
}

Citation-Only Enforcement

Require citations but do not enforce per-claim scores (useful when scores are not available):

{
  "require_source_grounding": true,
  "min_citations": 1,
  "max_unsupported_claims": 2,
  "action_on_violation": "block"
}

Enforcement Flow

Agent starts (WaxellContext.__aenter__)
    │
    └── before_workflow governance runs
        └── Stores grounding rules → ALLOW

Agent runs — calls ctx.record_grounding(grounding_scores=[...], ...)
    │
    └── mid_execution governance runs
        ├── grounding_scores empty? → ALLOW (nothing to check)
        ├── For each score:
        │   └── score < min_grounding_score? → BLOCK/WARN (first violation only)
        └── All scores pass → ALLOW

Agent completes (WaxellContext.__aexit__)
    │
    └── after_workflow governance runs
        ├── len(citations) < min_citations? → violation
        ├── require_source_grounding AND len(citations) == 0? → violation
        ├── unsupported_claims > max_unsupported_claims? → violation
        ├── output_confidence < abstention_threshold? → violation
        ├── Any violations?
        │   ├── action=block → BLOCK (PolicyViolationError)
        │   └── action=warn → WARN (agent completed, warning recorded)
        └── No violations → ALLOW with citation_count metadata

Creating via Dashboard

Navigate to Governance > Policies
Click New Policy
Select category Grounding
Set your grounding thresholds
Set action_on_violation to warn initially to calibrate, then block for production
Set scope to target specific agents (e.g., research-agent)
Enable

Creating via API

curl -X POST \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  https://acme.waxell.dev/waxell/v1/policies/ \
  -d '{
    "name": "Strict Research Grounding",
    "category": "grounding",
    "rules": {
      "require_source_grounding": true,
      "min_grounding_score": 0.8,
      "min_citations": 2,
      "max_unsupported_claims": 0,
      "abstention_threshold": 0.5,
      "action_on_violation": "block"
    },
    "scope": {
      "agents": ["research-agent"]
    },
    "enabled": true
  }'

Observability

Governance Tab

Grounding evaluations appear at each enforcement phase:

mid_execution (per-score check):

Field	Example
Policy name	Strict Research Grounding
Action	`block`
Phase	`mid_execution`
Reason	"Grounding score (0.42) below threshold (0.7)"
Metadata	`{"score": 0.42, "threshold": 0.7}`

after_workflow (citation and confidence audit):

Field	Example
Action	`block`
Phase	`after_workflow`
Reason	"Citations (0) below minimum (1); No source citations provided (grounding required)"
Metadata	`{"warnings": [...], "citation_count": 0}`

For successful runs:

Field	Example
Action	`allow`
Reason	"Grounding audit passed (3 citations)"
Metadata	`{"citation_count": 3}`

Trace Tab

The source_retrieval tool span (if recorded with ctx.record_tool_call) shows the number of sources found. The grounding data (scores, citations, confidence) appears in the steps data alongside any LLM call records.

LLM Grounding Judge

In addition to checking self-reported scores and citation counts, the grounding handler supports LLM-based grounding evaluation — an independent AI judge that reads the output and its cited sources, then scores whether the output is genuinely grounded.

This is an admin-controlled feature: developers record grounding data as usual, and the platform automatically runs the LLM judge when the policy enables it. No developer code changes are needed.

LLM Judge Rules

Rule	Type	Default	Description
`llm_grounding_check`	boolean	`false`	Enable LLM-based grounding evaluation
`llm_grounding_model`	string	`"gpt-4o-mini"`	Model to use for the judge (`gpt-`, `claude-`, `llama-`, `mixtral-`)
`llm_grounding_threshold`	number	`0.7`	Minimum LLM judge score (0.0-1.0)
`llm_grounding_criteria`	string	`""`	Custom evaluation criteria appended to the judge prompt
`llm_grounding_phase`	string	`"mid_execution"`	When to run the judge: `"mid_execution"`, `"after_workflow"`, or `"both"`

How It Works

The LLM judge evaluates three dimensions:

Source support — Is every factual claim in the output supported by at least one cited source?
Citation correctness — Are the citations used correctly (not fabricated or misattributed)?
Scope adherence — Does the output avoid making claims beyond what the sources support?

The judge returns a single score from 0.0 to 1.0:

Score	Meaning
0.0	Completely ungrounded (fabricated facts)
0.3	Poorly grounded (many unsupported claims)
0.5	Partially grounded (some claims supported)
0.7	Mostly grounded (minor gaps)
1.0	Fully grounded (every claim traceable)

Phase Behavior

mid_execution: After each record_grounding() call, the judge evaluates the latest LLM response against the citations just recorded. A score below threshold produces an immediate BLOCK or WARN.
after_workflow: After the agent completes, the judge evaluates the final output against all citations collected across all record_grounding() calls. A score below threshold adds to the violation warnings.
both: Runs at both phases.

Supported Providers

The judge model uses your organization's API keys stored in Tenant Secrets:

Model prefix	Provider	Key needed
`gpt-*`	OpenAI	`OPENAI_API_KEY`
`claude-*`	Anthropic	`ANTHROPIC_API_KEY`
`llama-`, `mixtral-`	Groq	`GROQ_API_KEY`

Example: Policy with LLM Judge

{
  "require_source_grounding": true,
  "min_grounding_score": 0.7,
  "min_citations": 2,
  "max_unsupported_claims": 0,
  "abstention_threshold": 0.5,
  "action_on_violation": "block",
  "llm_grounding_check": true,
  "llm_grounding_model": "gpt-4o-mini",
  "llm_grounding_threshold": 0.7,
  "llm_grounding_phase": "after_workflow"
}

Graceful Degradation

The LLM judge never blocks your agents due to infrastructure failures:

No API key configured — judge is skipped, threshold checks still run
LLM API error — warning logged, judge skipped
Unparseable response — defaults to 0.5 (neutral score)
llm_grounding_check: false (default) — judge completely skipped, backward compatible

Cost

The judge uses minimal tokens (~500 input + 10 output per evaluation):

Model	Approximate cost per evaluation
`gpt-4o-mini`	~$0.0001
`llama-3.1-8b-instant` (Groq)	Near-zero
`claude-sonnet-4-20250514`	~$0.002

Accumulating Grounding Data

The SDK's record_grounding() method accumulates entries — each call appends to a list rather than overwriting. This supports multi-step workflows where different steps produce different grounding data:

async with waxell.WaxellContext(...) as ctx:
    # Step 1: Retrieve financial sources
    ctx.record_grounding(
        grounding_scores=[0.92, 0.87],
        citations=["Federal Reserve Report", "IMF Analysis"],
        output_confidence=0.89,
    )
    # → mid_execution checks these scores

    # Step 2: Retrieve legal sources
    ctx.record_grounding(
        grounding_scores=[0.85, 0.91],
        citations=["SEC Filing 2023", "Legal Review"],
        output_confidence=0.88,
    )
    # → mid_execution checks these scores too

    # → after_workflow evaluates ALL citations from both steps

Calibrating Thresholds

Grounding scores are computed by your pipeline — the grounding policy evaluates them but does not compute them. Common approaches:

Approach	How to compute scores
Embedding similarity	Cosine similarity between each output sentence and its retrieved source chunk
LLM self-evaluation	Ask the LLM to score its own claim grounding (0-10 scale, normalize to 0.0-1.0)
Cross-encoder reranker	Use a cross-encoder model to score claim-source relevance
NLI classifier	Natural Language Inference model classifies each claim as entailed/neutral/contradicted

Start with action_on_violation: "warn" and observe the distribution of scores across real executions. Set min_grounding_score at the 5th-percentile score from good runs — this blocks genuinely weak outputs without over-triggering on edge cases.

Common Gotchas

mid_execution fires per record_grounding() call, not per-claim. If you report all scores in a single call and one is below threshold, the entire call is rejected. Split into separate calls only if you want per-claim mid-execution enforcement.
unsupported_claims takes a list of strings. Pass the actual claim text in the list. The handler compares len(unsupported_claims) > max_unsupported_claims. An empty list [] means zero unsupported claims.
max_unsupported_claims: null (the default) disables the check. Set an explicit integer (including 0) to enable it. 0 means any unsupported claim blocks the output.
abstention_threshold requires output_confidence to be set. If you record grounding but omit output_confidence, the abstention threshold is skipped even if configured.
require_source_grounding and min_citations can double-fire. With require_source_grounding=true and min_citations=1, an empty citation list triggers both warnings. The result reason string concatenates them with ;.
Default action_on_violation is "warn". Unlike most handlers, grounding defaults to warn mode. Explicitly set "block" for production enforcement.
factual_consistency_check is a no-op. The rule is schema-valid and accepted, but the handler does not implement cross-claim consistency checking yet. It is reserved for a future release.

Combining with Other Policies

retrieval: Use retrieval to govern the quality of source documents retrieved; use grounding to govern whether the output is actually grounded in those sources. They complement each other in RAG pipelines.
content: Use content to detect PII or prohibited content in outputs; use grounding to ensure outputs are factually accurate. Both protect the quality of agent outputs from different angles.
quality: Use quality for general output quality metrics (length, format, tone); use grounding specifically for factual accuracy and citation requirements.
compliance: In regulated industries (medical, legal, financial), combine grounding with compliance to ensure both factual accuracy and regulatory adherence.

Next Steps

Policy & Governance -- How policy enforcement works
Scope Policy -- Govern blast radius of data operations
Retrieval Policy -- Govern the quality of retrieved sources
Quality Policy -- General output quality governance
Policy Categories & Templates -- All 26 categories

Rules​

How It Works​

before_workflow​

mid_execution​

after_workflow​

Filtering for RAG Pipelines​

The Relevance Floor​

Edge Case: Nothing Relevant​

Evaluation Modes​

Example: RAG Pipeline Policy​

Example: Lenient Average-Based Policy​

SDK Integration​

Recording Grounding Data​

Grounding Data Fields​

Handling Violations​

Example Policies​

Strict Research Assistant​

Medical Q&A​

Lenient RAG Agent​

Citation-Only Enforcement​

Enforcement Flow​

Creating via Dashboard​

Creating via API​

Observability​

Governance Tab​

Trace Tab​

LLM Grounding Judge​

LLM Judge Rules​

How It Works​

Phase Behavior​

Supported Providers​

Example: Policy with LLM Judge​

Graceful Degradation​

Cost​

Accumulating Grounding Data​

Calibrating Thresholds​

Common Gotchas​

Combining with Other Policies​

Next Steps​

Rules

How It Works

before_workflow

mid_execution

after_workflow

Filtering for RAG Pipelines

The Relevance Floor

Edge Case: Nothing Relevant

Evaluation Modes

Example: RAG Pipeline Policy

Example: Lenient Average-Based Policy

SDK Integration

Recording Grounding Data

Grounding Data Fields

Handling Violations

Example Policies

Strict Research Assistant

Medical Q&A

Lenient RAG Agent

Citation-Only Enforcement

Enforcement Flow

Creating via Dashboard

Creating via API

Observability

Governance Tab

Trace Tab

LLM Grounding Judge

LLM Judge Rules

How It Works

Phase Behavior

Supported Providers

Example: Policy with LLM Judge

Graceful Degradation

Cost

Accumulating Grounding Data

Calibrating Thresholds

Common Gotchas

Combining with Other Policies

Next Steps