RAGAS Agent
A multi-agent RAGAS evaluation pipeline that coordinates a ragas-runner (generates an LLM response, evaluates it with 4 individual RAGAS metrics: faithfulness, answer_relevancy, context_precision, context_recall) and a ragas-evaluator (runs batch evaluation across 5 test cases, assesses RAG quality with @reasoning, makes a deployment recommendation via waxell.decide()). Demonstrates the RAGAS instrumentor integration pattern with RAG-specific evaluation metrics.
This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.
Architecture
Key Code
Per-metric RAGAS evaluation
Each RAGAS metric is recorded as a step with score and reason, then as a numeric score for dashboard visibility.
RAGAS_METRICS = [
("faithfulness", 0.88, "The answer is grounded in the provided context."),
("answer_relevancy", 0.92, "The answer directly addresses the question."),
("context_precision", 0.75, "Retrieved context is mostly relevant."),
("context_recall", 0.82, "Most ground truth claims are covered."),
]
for metric_name, score, reason in RAGAS_METRICS:
waxell.step(f"eval:ragas.metric:{metric_name}", output={
"framework": "ragas", "metric": metric_name, "score": score, "reason": reason,
})
waxell.score(f"ragas_{metric_name}", score, comment=f"RAGAS {metric_name}: {reason}")
RAG quality assessment and deployment recommendation
The evaluator assesses overall quality and makes a deploy/review/retrain recommendation.
@waxell.reasoning_dec(step="rag_quality_assessment")
async def assess_rag_quality(metric_scores, overall_score, test_cases) -> dict:
weakest = min(metric_scores, key=metric_scores.get)
below_threshold = [k for k, v in metric_scores.items() if v < 0.8]
return {
"thought": f"Evaluated {len(metric_scores)} metrics. Weakest: {weakest}.",
"conclusion": "RAG pipeline meets quality bar" if not below_threshold
else f"Needs improvement in: {', '.join(below_threshold)}",
}
waxell.decide(
"deployment_recommendation",
chosen="deploy" if overall_score >= 0.85 else "review",
options=["deploy", "review", "retrain"],
reasoning=f"Overall RAGAS score {overall_score:.3f}",
confidence=0.90,
)
What this demonstrates
waxell.step()-- individual RAGAS metric results and batch evaluation recorded as named steps.@waxell.step_dec-- RAG context preparation.@waxell.decision-- evaluation strategy selection (single_metric/core_metrics/full_suite/custom) driven by an LLM call.@waxell.reasoning_dec-- RAG quality assessment identifying weakest and strongest metrics.waxell.decide()-- deployment recommendation (deploy/review/retrain) based on overall score.waxell.score()-- per-metric ragas_* scores plus ragas_overall and ragas_pass_rate.- Auto-instrumented LLM calls -- OpenAI calls for strategy selection and response generation.
- RAGAS integration pattern -- shows how to wrap RAGAS metrics with waxell-observe for RAG evaluation observability.
Run it
# Dry-run (no API key needed)
python -m app.demos.ragas_agent --dry-run
# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.ragas_agent