DeepEval Agent

A multi-agent DeepEval evaluation pipeline that coordinates a deepeval-runner (generates an LLM response, evaluates it with individual AnswerRelevancy and Faithfulness metrics) and a deepeval-evaluator (runs batch evaluation across 3 test cases, analyzes pass rate with @reasoning, scores overall quality). Demonstrates the DeepEval instrumentor integration pattern with metric-based LLM output evaluation.

Environment variables

This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.

Architecture

Key Code

Individual metric evaluation with step recording

Each DeepEval metric is recorded as a step with framework, metric name, score, threshold, and pass/fail.

@waxell.observe(agent_name="deepeval-runner", workflow_name="deepeval-generation")
async def run_deepeval_runner(query, openai_client, dry_run=False, waxell_ctx=None):
    response = await openai_client.chat.completions.create(
        model="gpt-4o-mini", messages=[...],
    )
    answer = response.choices[0].message.content

    waxell.step("eval:deepeval.metric:AnswerRelevancyMetric", output={
        "framework": "deepeval", "metric": "AnswerRelevancyMetric",
        "score": 0.92, "threshold": 0.7, "passed": True,
    })
    waxell.step("eval:deepeval.metric:FaithfulnessMetric", output={
        "framework": "deepeval", "metric": "FaithfulnessMetric",
        "score": 0.78, "threshold": 0.7, "passed": True,
    })

    waxell.score("answer_relevancy", 0.92, comment="DeepEval AnswerRelevancyMetric")
    waxell.score("faithfulness", 0.78, comment="DeepEval FaithfulnessMetric")

Batch evaluation and pass rate analysis

The evaluator runs batch evaluation and uses @reasoning to analyze overall quality.

@waxell.reasoning_dec(step="pass_rate_analysis")
async def analyze_pass_rate(scores: dict, threshold: float = 0.7) -> dict:
    passed = sum(1 for v in scores.values() if v >= threshold)
    total = len(scores)
    pass_rate = passed / total if total > 0 else 0.0
    return {
        "thought": f"Evaluated {total} metrics. {passed}/{total} passed.",
        "evidence": [f"{k}: {v:.2f} ({'PASS' if v >= threshold else 'FAIL'})" for k, v in scores.items()],
        "conclusion": "All metrics pass" if pass_rate == 1.0 else f"{total - passed} need improvement",
    }

What this demonstrates

waxell.step() -- individual DeepEval metric results recorded as named steps with framework/metric/score/threshold/passed metadata.
@waxell.step_dec -- evaluation context preparation.
@waxell.decision -- metric selection (relevancy_only/faithfulness_only/both/full_suite) driven by an LLM call.
@waxell.reasoning_dec -- pass rate analysis with thought/evidence/conclusion.
waxell.score() -- per-metric scores plus batch_pass_rate and overall_quality.
Auto-instrumented LLM calls -- OpenAI calls for both metric selection and response generation.
Nested @waxell.observe -- orchestrator is parent; deepeval-runner and deepeval-evaluator are child agents.
DeepEval integration pattern -- shows how to wrap DeepEval metrics with waxell-observe for evaluation observability.

Run it

# Dry-run (no API key needed)
python -m app.demos.deepeval_agent --dry-run

# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.deepeval_agent

Source

dev/waxell-dev/app/demos/deepeval_agent.py

Architecture​

Key Code​

Individual metric evaluation with step recording​

Batch evaluation and pass rate analysis​

What this demonstrates​

Run it​

Source​