Eval Battery Agent
An evaluation stress test that generates an LLM response and runs it through 6 evaluation frameworks (DeepEval, RAGAS, Braintrust, TruLens, Giskard, PromptFoo), scoring 24 total metrics across faithfulness, relevancy, coherence, toxicity, bias, hallucination, robustness, and more. Produces per-framework pass rates, an aggregate verdict, and identifies the weakest metrics for improvement via @waxell.reasoning_dec and @waxell.decision.
This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.
Architecture
Key Code
Evaluation framework tool with per-metric scoring
Each framework runs through a single @tool(evaluation) wrapper, with individual metric scores recorded via waxell.score().
@waxell.tool(tool_type="evaluation")
def run_eval_framework(eval_client, framework_name, answer, query, context_texts) -> dict:
result = eval_client.evaluate(answer, query, context_texts)
return {
"framework": framework_name,
"pass_rate": result.pass_rate,
"overall_passed": result.overall_passed,
"scores": {m.metric: m.score for m in result.metrics},
}
# Per-metric scores recorded for each framework
for m in result.metrics:
waxell.score(
name=f"{framework_name}.{m.metric}", value=m.score,
data_type="numeric",
comment=f"source={framework_name} | threshold={m.threshold} | {'PASS' if m.passed else 'FAIL'}",
)
Weak metric analysis and pass/fail verdict
The pipeline analyzes systematic weaknesses and renders a final verdict with confidence.
@waxell.reasoning_dec(step="analyze_weak_metrics")
def analyze_weak_metrics(verdict: str) -> dict:
return {
"thought": "The contextual_recall score of 0.45 in DeepEval is a clear outlier...",
"evidence": ["deepeval.contextual_recall=0.45 (FAIL)", "deepeval.faithfulness=0.92 (strong)"],
"conclusion": "Primary weakness: retrieval coverage. Strengths: faithfulness, safety.",
}
@waxell.decision(name="eval_verdict", options=["pass", "fail"])
def make_eval_verdict(verdict, total_metrics, total_passed, ...) -> dict:
return {"chosen": verdict, "reasoning": reasoning_text, "confidence": 0.91}
What this demonstrates
@waxell.tool(tool_type="evaluation")-- 6 evaluation framework invocations (DeepEval, RAGAS, Braintrust, TruLens, Giskard, PromptFoo).waxell.score()-- 24 individual metric scores + 7 aggregate scores (31+ total).@waxell.step_dec-- context setup, LLM generation, aggregation, and per-framework eval steps.@waxell.reasoning_dec-- systematic weakness analysis across all 24 metrics.@waxell.decision-- pass/fail verdict with confidence and metadata.waxell.step()-- per-framework evaluation summaries.- Auto-instrumented LLM calls -- OpenAI gpt-4o call for RAG-style answer generation.
- 4 reference documents -- mock RAG context for grounded evaluation.
- 10-step pipeline -- maximum-depth evaluation integration stress test.
Run it
# Dry-run (no API key needed)
python -m app.demos.eval_battery_agent --dry-run
# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.eval_battery_agent