Skip to main content

DeepEval Agent

A multi-agent DeepEval evaluation pipeline that coordinates a deepeval-runner (generates an LLM response, evaluates it with individual AnswerRelevancy and Faithfulness metrics) and a deepeval-evaluator (runs batch evaluation across 3 test cases, analyzes pass rate with @reasoning, scores overall quality). Demonstrates the DeepEval instrumentor integration pattern with metric-based LLM output evaluation.

Environment variables

This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.

Architecture

Key Code

Individual metric evaluation with step recording

Each DeepEval metric is recorded as a step with framework, metric name, score, threshold, and pass/fail.

@waxell.observe(agent_name="deepeval-runner", workflow_name="deepeval-generation")
async def run_deepeval_runner(query, openai_client, dry_run=False, waxell_ctx=None):
response = await openai_client.chat.completions.create(
model="gpt-4o-mini", messages=[...],
)
answer = response.choices[0].message.content

waxell.step("eval:deepeval.metric:AnswerRelevancyMetric", output={
"framework": "deepeval", "metric": "AnswerRelevancyMetric",
"score": 0.92, "threshold": 0.7, "passed": True,
})
waxell.step("eval:deepeval.metric:FaithfulnessMetric", output={
"framework": "deepeval", "metric": "FaithfulnessMetric",
"score": 0.78, "threshold": 0.7, "passed": True,
})

waxell.score("answer_relevancy", 0.92, comment="DeepEval AnswerRelevancyMetric")
waxell.score("faithfulness", 0.78, comment="DeepEval FaithfulnessMetric")

Batch evaluation and pass rate analysis

The evaluator runs batch evaluation and uses @reasoning to analyze overall quality.

@waxell.reasoning_dec(step="pass_rate_analysis")
async def analyze_pass_rate(scores: dict, threshold: float = 0.7) -> dict:
passed = sum(1 for v in scores.values() if v >= threshold)
total = len(scores)
pass_rate = passed / total if total > 0 else 0.0
return {
"thought": f"Evaluated {total} metrics. {passed}/{total} passed.",
"evidence": [f"{k}: {v:.2f} ({'PASS' if v >= threshold else 'FAIL'})" for k, v in scores.items()],
"conclusion": "All metrics pass" if pass_rate == 1.0 else f"{total - passed} need improvement",
}

What this demonstrates

  • waxell.step() -- individual DeepEval metric results recorded as named steps with framework/metric/score/threshold/passed metadata.
  • @waxell.step_dec -- evaluation context preparation.
  • @waxell.decision -- metric selection (relevancy_only/faithfulness_only/both/full_suite) driven by an LLM call.
  • @waxell.reasoning_dec -- pass rate analysis with thought/evidence/conclusion.
  • waxell.score() -- per-metric scores plus batch_pass_rate and overall_quality.
  • Auto-instrumented LLM calls -- OpenAI calls for both metric selection and response generation.
  • Nested @waxell.observe -- orchestrator is parent; deepeval-runner and deepeval-evaluator are child agents.
  • DeepEval integration pattern -- shows how to wrap DeepEval metrics with waxell-observe for evaluation observability.

Run it

# Dry-run (no API key needed)
python -m app.demos.deepeval_agent --dry-run

# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.deepeval_agent

Source

dev/waxell-dev/app/demos/deepeval_agent.py