Eval Battery Agent

An evaluation stress test that generates an LLM response and runs it through 6 evaluation frameworks (DeepEval, RAGAS, Braintrust, TruLens, Giskard, PromptFoo), scoring 24 total metrics across faithfulness, relevancy, coherence, toxicity, bias, hallucination, robustness, and more. Produces per-framework pass rates, an aggregate verdict, and identifies the weakest metrics for improvement via @waxell.reasoning_dec and @waxell.decision.

Environment variables

This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.

Architecture

Key Code

Evaluation framework tool with per-metric scoring

Each framework runs through a single @tool(evaluation) wrapper, with individual metric scores recorded via waxell.score().

@waxell.tool(tool_type="evaluation")
def run_eval_framework(eval_client, framework_name, answer, query, context_texts) -> dict:
    result = eval_client.evaluate(answer, query, context_texts)
    return {
        "framework": framework_name,
        "pass_rate": result.pass_rate,
        "overall_passed": result.overall_passed,
        "scores": {m.metric: m.score for m in result.metrics},
    }

# Per-metric scores recorded for each framework
for m in result.metrics:
    waxell.score(
        name=f"{framework_name}.{m.metric}", value=m.score,
        data_type="numeric",
        comment=f"source={framework_name} | threshold={m.threshold} | {'PASS' if m.passed else 'FAIL'}",
    )

Weak metric analysis and pass/fail verdict

The pipeline analyzes systematic weaknesses and renders a final verdict with confidence.

@waxell.reasoning_dec(step="analyze_weak_metrics")
def analyze_weak_metrics(verdict: str) -> dict:
    return {
        "thought": "The contextual_recall score of 0.45 in DeepEval is a clear outlier...",
        "evidence": ["deepeval.contextual_recall=0.45 (FAIL)", "deepeval.faithfulness=0.92 (strong)"],
        "conclusion": "Primary weakness: retrieval coverage. Strengths: faithfulness, safety.",
    }

@waxell.decision(name="eval_verdict", options=["pass", "fail"])
def make_eval_verdict(verdict, total_metrics, total_passed, ...) -> dict:
    return {"chosen": verdict, "reasoning": reasoning_text, "confidence": 0.91}

What this demonstrates

@waxell.tool(tool_type="evaluation") -- 6 evaluation framework invocations (DeepEval, RAGAS, Braintrust, TruLens, Giskard, PromptFoo).
waxell.score() -- 24 individual metric scores + 7 aggregate scores (31+ total).
@waxell.step_dec -- context setup, LLM generation, aggregation, and per-framework eval steps.
@waxell.reasoning_dec -- systematic weakness analysis across all 24 metrics.
@waxell.decision -- pass/fail verdict with confidence and metadata.
waxell.step() -- per-framework evaluation summaries.
Auto-instrumented LLM calls -- OpenAI gpt-4o call for RAG-style answer generation.
4 reference documents -- mock RAG context for grounded evaluation.
10-step pipeline -- maximum-depth evaluation integration stress test.

Run it

# Dry-run (no API key needed)
python -m app.demos.eval_battery_agent --dry-run

# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.eval_battery_agent

Source

dev/waxell-dev/app/demos/eval_battery_agent.py

Architecture​

Key Code​

Evaluation framework tool with per-metric scoring​

Weak metric analysis and pass/fail verdict​

What this demonstrates​

Run it​

Source​