RAGAS Agent

A multi-agent RAGAS evaluation pipeline that coordinates a ragas-runner (generates an LLM response, evaluates it with 4 individual RAGAS metrics: faithfulness, answer_relevancy, context_precision, context_recall) and a ragas-evaluator (runs batch evaluation across 5 test cases, assesses RAG quality with @reasoning, makes a deployment recommendation via waxell.decide()). Demonstrates the RAGAS instrumentor integration pattern with RAG-specific evaluation metrics.

Environment variables

This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.

Architecture

Key Code

Per-metric RAGAS evaluation

Each RAGAS metric is recorded as a step with score and reason, then as a numeric score for dashboard visibility.

RAGAS_METRICS = [
    ("faithfulness", 0.88, "The answer is grounded in the provided context."),
    ("answer_relevancy", 0.92, "The answer directly addresses the question."),
    ("context_precision", 0.75, "Retrieved context is mostly relevant."),
    ("context_recall", 0.82, "Most ground truth claims are covered."),
]

for metric_name, score, reason in RAGAS_METRICS:
    waxell.step(f"eval:ragas.metric:{metric_name}", output={
        "framework": "ragas", "metric": metric_name, "score": score, "reason": reason,
    })
    waxell.score(f"ragas_{metric_name}", score, comment=f"RAGAS {metric_name}: {reason}")

RAG quality assessment and deployment recommendation

The evaluator assesses overall quality and makes a deploy/review/retrain recommendation.

@waxell.reasoning_dec(step="rag_quality_assessment")
async def assess_rag_quality(metric_scores, overall_score, test_cases) -> dict:
    weakest = min(metric_scores, key=metric_scores.get)
    below_threshold = [k for k, v in metric_scores.items() if v < 0.8]
    return {
        "thought": f"Evaluated {len(metric_scores)} metrics. Weakest: {weakest}.",
        "conclusion": "RAG pipeline meets quality bar" if not below_threshold
                      else f"Needs improvement in: {', '.join(below_threshold)}",
    }

waxell.decide(
    "deployment_recommendation",
    chosen="deploy" if overall_score >= 0.85 else "review",
    options=["deploy", "review", "retrain"],
    reasoning=f"Overall RAGAS score {overall_score:.3f}",
    confidence=0.90,
)

What this demonstrates

waxell.step() -- individual RAGAS metric results and batch evaluation recorded as named steps.
@waxell.step_dec -- RAG context preparation.
@waxell.decision -- evaluation strategy selection (single_metric/core_metrics/full_suite/custom) driven by an LLM call.
@waxell.reasoning_dec -- RAG quality assessment identifying weakest and strongest metrics.
waxell.decide() -- deployment recommendation (deploy/review/retrain) based on overall score.
waxell.score() -- per-metric ragas_* scores plus ragas_overall and ragas_pass_rate.
Auto-instrumented LLM calls -- OpenAI calls for strategy selection and response generation.
RAGAS integration pattern -- shows how to wrap RAGAS metrics with waxell-observe for RAG evaluation observability.

Run it

# Dry-run (no API key needed)
python -m app.demos.ragas_agent --dry-run

# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.ragas_agent

Source

dev/waxell-dev/app/demos/ragas_agent.py

Architecture​

Key Code​

Per-metric RAGAS evaluation​

RAG quality assessment and deployment recommendation​

What this demonstrates​

Run it​

Source​