Grounding Policy
The grounding policy category enforces factual accuracy standards — it requires agent outputs to be grounded in cited sources, with measurable confidence scores and limits on unsupported claims.
Use it for RAG (retrieval-augmented generation) agents, research assistants, medical or legal Q&A, and any agent where hallucinated or unsupported output carries real risk.
Rules
| Rule | Type | Default | Description |
|---|---|---|---|
require_source_grounding | boolean | false | If true, block (or warn) when zero citations are provided |
min_grounding_score | number | 0.7 | Minimum acceptable score per grounding claim (0.0-1.0). Checked at mid_execution |
min_citations | integer | 1 | Minimum number of source citations required in the output |
max_unsupported_claims | integer | null | Maximum number of claims without source support. null means no limit |
factual_consistency_check | boolean | false | Reserved for future cross-claim consistency validation |
abstention_threshold | number | null | If output_confidence falls below this threshold, block (the agent should abstain rather than guess) |
abstention_response | string | null | Custom message to return when abstention threshold is triggered |
action_on_violation | string | "warn" | "block" raises PolicyViolationError; "warn" logs and continues |
score_relevance_floor | number | null | Ignore retrieval scores below this floor. Only scores at or above this value are checked against min_grounding_score |
score_eval_mode | string | "all" | How to evaluate filtered scores: "all" (each must pass), "average" (mean must pass), "top_n" (only top N checked) |
score_top_n | integer | 3 | When score_eval_mode="top_n", how many top scores to evaluate |
The grounding handler defaults to action_on_violation: "warn" unlike most other handlers that default to "block". This is intentional — grounding quality degrades gradually and initial deployments often need to calibrate thresholds. Change to "block" once you have tuned your score thresholds against real traffic.
How It Works
The grounding handler runs at two enforcement phases:
before_workflow
Stores the grounding rules into context._grounding_rules. Always returns ALLOW — no blocking happens at before_workflow.
mid_execution
Checks per-claim grounding scores as they are reported:
- Reads
context.grounding_scores(a list of float values 0.0-1.0) - If the list is empty, returns ALLOW immediately (nothing to check yet)
- For each score, checks against
min_grounding_score - The first score below the threshold produces a BLOCK or WARN
This phase fires every time ctx.record_grounding() is called with new scores.
after_workflow
Audits the final output state against citation and confidence requirements:
- Citation count:
len(context.citations) < min_citations→ violation - Require grounding:
require_source_grounding=trueandlen(context.citations) == 0→ violation - Unsupported claims:
context.unsupported_claims > max_unsupported_claims→ violation - Abstention threshold:
context.output_confidence < abstention_threshold→ violation
All violations are collected and returned together in the warnings list.
Filtering for RAG Pipelines
Real RAG pipelines with top_k=5 always return tail results with low similarity scores. Without filtering, min_grounding_score would block on irrelevant tail results, making the policy unusable.
The Relevance Floor
Set score_relevance_floor to ignore retrieval results below a relevance threshold. Only scores at or above the floor are evaluated against min_grounding_score.
Example: A query about "2008 financial crisis" returns 5 results:
- 0.92, 0.87, 0.85 (relevant finance docs)
- 0.35, 0.22 (irrelevant tail results)
With score_relevance_floor: 0.5, only the three relevant scores are checked. All three pass min_grounding_score: 0.7, so the policy allows the response.
Without the floor, the policy would block on the 0.35 score — a false positive from an irrelevant retrieval result.
Edge Case: Nothing Relevant
If NO scores pass the floor, the policy triggers a violation: "No grounding scores above relevance floor — all retrieved results appear irrelevant." This correctly catches the case where retrieval returned nothing useful.
Evaluation Modes
| Mode | Behavior |
|---|---|
"all" (default) | Every filtered score must pass min_grounding_score |
"average" | The mean of filtered scores must pass the threshold |
"top_n" | Only the top N filtered scores are checked (set N with score_top_n) |
Example: RAG Pipeline Policy
{
"min_grounding_score": 0.7,
"score_relevance_floor": 0.5,
"score_eval_mode": "all",
"min_citations": 1,
"action_on_violation": "block"
}
Example: Lenient Average-Based Policy
{
"min_grounding_score": 0.7,
"score_relevance_floor": 0.4,
"score_eval_mode": "average",
"action_on_violation": "warn"
}
Set score_relevance_floor whenever your grounding scores come from retrieval similarity (e.g., Pinecone, FAISS, Weaviate). Without filtering, top_k > 3 retrieval will almost always produce tail results below any useful threshold.
If your scores come from LLM self-evaluation or NLI classifiers (where every score is meaningful), leave score_relevance_floor unset to use the legacy per-score checking.
SDK Integration
Recording Grounding Data
import waxell_observe as waxell
from waxell_observe.errors import PolicyViolationError
async with waxell.WaxellContext(
agent_name="research-agent",
enforce_policy=True,
) as ctx:
# Retrieve sources and generate answer
sources = await retrieve_sources(query)
answer, scores = await synthesize_with_grounding(query, sources)
# Report grounding data — triggers mid_execution check on scores
ctx.record_grounding(
grounding_scores=scores, # per-claim scores [0.9, 0.85, 0.92]
citations=sources, # ["Source A", "Source B", "Source C"]
unsupported_claims=[], # any claims without source support
output_confidence=0.88, # overall confidence in the answer
)
ctx.set_result({"answer": answer, "citations": sources})
Grounding Data Fields
| Field | Type | What to put here |
|---|---|---|
grounding_scores | list[float] | One score per claim or sentence in the output (0.0 = unsupported, 1.0 = fully grounded) |
citations | list[str] | Source identifiers — URLs, document names, DOIs, database record IDs |
unsupported_claims | list[str] | The actual text of claims you could not ground in any source |
output_confidence | float | Your model's or pipeline's overall confidence that the full answer is accurate |
Handling Violations
try:
async with waxell.WaxellContext(
agent_name="research-agent",
enforce_policy=True,
) as ctx:
sources = await retrieve_sources(query)
answer, scores = await synthesize_with_grounding(query, sources)
ctx.record_grounding(
grounding_scores=scores,
citations=sources,
unsupported_claims=[],
output_confidence=0.88,
)
ctx.set_result({"answer": answer})
except PolicyViolationError as e:
# e.g. "Grounding score (0.42) below threshold (0.7)"
# e.g. "Citations (0) below minimum (1); Unsupported claims (2) exceeds max (0)"
print(f"Grounding policy blocked: {e}")
return abstention_message
Example Policies
Strict Research Assistant
Block any output with low grounding scores, missing citations, or low confidence:
{
"require_source_grounding": true,
"min_grounding_score": 0.8,
"min_citations": 2,
"max_unsupported_claims": 0,
"abstention_threshold": 0.5,
"abstention_response": "I don't have sufficient grounded evidence to answer this accurately.",
"action_on_violation": "block"
}
Medical Q&A
High accuracy standards — abstain rather than guess:
{
"require_source_grounding": true,
"min_grounding_score": 0.9,
"min_citations": 3,
"max_unsupported_claims": 0,
"abstention_threshold": 0.7,
"abstention_response": "I cannot provide a reliable answer based on available evidence. Please consult a qualified professional.",
"action_on_violation": "block"
}
Lenient RAG Agent
Warn on grounding issues but do not block execution during development:
{
"require_source_grounding": false,
"min_grounding_score": 0.6,
"min_citations": 1,
"max_unsupported_claims": 3,
"abstention_threshold": null,
"action_on_violation": "warn"
}
Citation-Only Enforcement
Require citations but do not enforce per-claim scores (useful when scores are not available):
{
"require_source_grounding": true,
"min_citations": 1,
"max_unsupported_claims": 2,
"action_on_violation": "block"
}
Enforcement Flow
Agent starts (WaxellContext.__aenter__)
│
└── before_workflow governance runs
└── Stores grounding rules → ALLOW
Agent runs — calls ctx.record_grounding(grounding_scores=[...], ...)
│
└── mid_execution governance runs
├── grounding_scores empty? → ALLOW (nothing to check)
├── For each score:
│ └── score < min_grounding_score? → BLOCK/WARN (first violation only)
└── All scores pass → ALLOW
Agent completes (WaxellContext.__aexit__)
│
└── after_workflow governance runs
├── len(citations) < min_citations? → violation
├── require_source_grounding AND len(citations) == 0? → violation
├── unsupported_claims > max_unsupported_claims? → violation
├── output_confidence < abstention_threshold? → violation
├── Any violations?
│ ├── action=block → BLOCK (PolicyViolationError)
│ └── action=warn → WARN (agent completed, warning recorded)
└── No violations → ALLOW with citation_count metadata
Creating via Dashboard
- Navigate to Governance > Policies
- Click New Policy
- Select category Grounding
- Set your grounding thresholds
- Set
action_on_violationtowarninitially to calibrate, thenblockfor production - Set scope to target specific agents (e.g.,
research-agent) - Enable
Creating via API
curl -X POST \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
https://acme.waxell.dev/waxell/v1/policies/ \
-d '{
"name": "Strict Research Grounding",
"category": "grounding",
"rules": {
"require_source_grounding": true,
"min_grounding_score": 0.8,
"min_citations": 2,
"max_unsupported_claims": 0,
"abstention_threshold": 0.5,
"action_on_violation": "block"
},
"scope": {
"agents": ["research-agent"]
},
"enabled": true
}'
Observability
Governance Tab
Grounding evaluations appear at each enforcement phase:
mid_execution (per-score check):
| Field | Example |
|---|---|
| Policy name | Strict Research Grounding |
| Action | block |
| Phase | mid_execution |
| Reason | "Grounding score (0.42) below threshold (0.7)" |
| Metadata | {"score": 0.42, "threshold": 0.7} |
after_workflow (citation and confidence audit):
| Field | Example |
|---|---|
| Action | block |
| Phase | after_workflow |
| Reason | "Citations (0) below minimum (1); No source citations provided (grounding required)" |
| Metadata | {"warnings": [...], "citation_count": 0} |
For successful runs:
| Field | Example |
|---|---|
| Action | allow |
| Reason | "Grounding audit passed (3 citations)" |
| Metadata | {"citation_count": 3} |
Trace Tab
The source_retrieval tool span (if recorded with ctx.record_tool_call) shows the number of sources found. The grounding data (scores, citations, confidence) appears in the steps data alongside any LLM call records.
LLM Grounding Judge
In addition to checking self-reported scores and citation counts, the grounding handler supports LLM-based grounding evaluation — an independent AI judge that reads the output and its cited sources, then scores whether the output is genuinely grounded.
This is an admin-controlled feature: developers record grounding data as usual, and the platform automatically runs the LLM judge when the policy enables it. No developer code changes are needed.
LLM Judge Rules
| Rule | Type | Default | Description |
|---|---|---|---|
llm_grounding_check | boolean | false | Enable LLM-based grounding evaluation |
llm_grounding_model | string | "gpt-4o-mini" | Model to use for the judge (gpt-*, claude-*, llama-*, mixtral-*) |
llm_grounding_threshold | number | 0.7 | Minimum LLM judge score (0.0-1.0) |
llm_grounding_criteria | string | "" | Custom evaluation criteria appended to the judge prompt |
llm_grounding_phase | string | "mid_execution" | When to run the judge: "mid_execution", "after_workflow", or "both" |
How It Works
The LLM judge evaluates three dimensions:
- Source support — Is every factual claim in the output supported by at least one cited source?
- Citation correctness — Are the citations used correctly (not fabricated or misattributed)?
- Scope adherence — Does the output avoid making claims beyond what the sources support?
The judge returns a single score from 0.0 to 1.0:
| Score | Meaning |
|---|---|
| 0.0 | Completely ungrounded (fabricated facts) |
| 0.3 | Poorly grounded (many unsupported claims) |
| 0.5 | Partially grounded (some claims supported) |
| 0.7 | Mostly grounded (minor gaps) |
| 1.0 | Fully grounded (every claim traceable) |
Phase Behavior
mid_execution: After eachrecord_grounding()call, the judge evaluates the latest LLM response against the citations just recorded. A score below threshold produces an immediate BLOCK or WARN.after_workflow: After the agent completes, the judge evaluates the final output against all citations collected across allrecord_grounding()calls. A score below threshold adds to the violation warnings.both: Runs at both phases.
Supported Providers
The judge model uses your organization's API keys stored in Tenant Secrets:
| Model prefix | Provider | Key needed |
|---|---|---|
gpt-* | OpenAI | OPENAI_API_KEY |
claude-* | Anthropic | ANTHROPIC_API_KEY |
llama-*, mixtral-* | Groq | GROQ_API_KEY |
Example: Policy with LLM Judge
{
"require_source_grounding": true,
"min_grounding_score": 0.7,
"min_citations": 2,
"max_unsupported_claims": 0,
"abstention_threshold": 0.5,
"action_on_violation": "block",
"llm_grounding_check": true,
"llm_grounding_model": "gpt-4o-mini",
"llm_grounding_threshold": 0.7,
"llm_grounding_phase": "after_workflow"
}
Graceful Degradation
The LLM judge never blocks your agents due to infrastructure failures:
- No API key configured — judge is skipped, threshold checks still run
- LLM API error — warning logged, judge skipped
- Unparseable response — defaults to 0.5 (neutral score)
llm_grounding_check: false(default) — judge completely skipped, backward compatible
Cost
The judge uses minimal tokens (~500 input + 10 output per evaluation):
| Model | Approximate cost per evaluation |
|---|---|
gpt-4o-mini | ~$0.0001 |
llama-3.1-8b-instant (Groq) | Near-zero |
claude-sonnet-4-20250514 | ~$0.002 |
Accumulating Grounding Data
The SDK's record_grounding() method accumulates entries — each call appends to a list rather than overwriting. This supports multi-step workflows where different steps produce different grounding data:
async with waxell.WaxellContext(...) as ctx:
# Step 1: Retrieve financial sources
ctx.record_grounding(
grounding_scores=[0.92, 0.87],
citations=["Federal Reserve Report", "IMF Analysis"],
output_confidence=0.89,
)
# → mid_execution checks these scores
# Step 2: Retrieve legal sources
ctx.record_grounding(
grounding_scores=[0.85, 0.91],
citations=["SEC Filing 2023", "Legal Review"],
output_confidence=0.88,
)
# → mid_execution checks these scores too
# → after_workflow evaluates ALL citations from both steps
Calibrating Thresholds
Grounding scores are computed by your pipeline — the grounding policy evaluates them but does not compute them. Common approaches:
| Approach | How to compute scores |
|---|---|
| Embedding similarity | Cosine similarity between each output sentence and its retrieved source chunk |
| LLM self-evaluation | Ask the LLM to score its own claim grounding (0-10 scale, normalize to 0.0-1.0) |
| Cross-encoder reranker | Use a cross-encoder model to score claim-source relevance |
| NLI classifier | Natural Language Inference model classifies each claim as entailed/neutral/contradicted |
Start with action_on_violation: "warn" and observe the distribution of scores across real executions. Set min_grounding_score at the 5th-percentile score from good runs — this blocks genuinely weak outputs without over-triggering on edge cases.
Common Gotchas
-
mid_executionfires perrecord_grounding()call, not per-claim. If you report all scores in a single call and one is below threshold, the entire call is rejected. Split into separate calls only if you want per-claim mid-execution enforcement. -
unsupported_claimstakes a list of strings. Pass the actual claim text in the list. The handler compareslen(unsupported_claims) > max_unsupported_claims. An empty list[]means zero unsupported claims. -
max_unsupported_claims: null(the default) disables the check. Set an explicit integer (including0) to enable it.0means any unsupported claim blocks the output. -
abstention_thresholdrequiresoutput_confidenceto be set. If you record grounding but omitoutput_confidence, the abstention threshold is skipped even if configured. -
require_source_groundingandmin_citationscan double-fire. Withrequire_source_grounding=trueandmin_citations=1, an empty citation list triggers both warnings. The result reason string concatenates them with;. -
Default
action_on_violationis"warn". Unlike most handlers, grounding defaults to warn mode. Explicitly set"block"for production enforcement. -
factual_consistency_checkis a no-op. The rule is schema-valid and accepted, but the handler does not implement cross-claim consistency checking yet. It is reserved for a future release.
Combining with Other Policies
- retrieval: Use
retrievalto govern the quality of source documents retrieved; usegroundingto govern whether the output is actually grounded in those sources. They complement each other in RAG pipelines. - content: Use
contentto detect PII or prohibited content in outputs; usegroundingto ensure outputs are factually accurate. Both protect the quality of agent outputs from different angles. - quality: Use
qualityfor general output quality metrics (length, format, tone); usegroundingspecifically for factual accuracy and citation requirements. - compliance: In regulated industries (medical, legal, financial), combine
groundingwithcomplianceto ensure both factual accuracy and regulatory adherence.
Next Steps
- Policy & Governance -- How policy enforcement works
- Scope Policy -- Govern blast radius of data operations
- Retrieval Policy -- Govern the quality of retrieved sources
- Quality Policy -- General output quality governance
- Policy Categories & Templates -- All 26 categories