DeepEval Agent
A multi-agent DeepEval evaluation pipeline that coordinates a deepeval-runner (generates an LLM response, evaluates it with individual AnswerRelevancy and Faithfulness metrics) and a deepeval-evaluator (runs batch evaluation across 3 test cases, analyzes pass rate with @reasoning, scores overall quality). Demonstrates the DeepEval instrumentor integration pattern with metric-based LLM output evaluation.
Environment variables
This example runs in dry-run mode by default (no API key needed). For live mode, set OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL.
Architecture
Key Code
Individual metric evaluation with step recording
Each DeepEval metric is recorded as a step with framework, metric name, score, threshold, and pass/fail.
@waxell.observe(agent_name="deepeval-runner", workflow_name="deepeval-generation")
async def run_deepeval_runner(query, openai_client, dry_run=False, waxell_ctx=None):
response = await openai_client.chat.completions.create(
model="gpt-4o-mini", messages=[...],
)
answer = response.choices[0].message.content
waxell.step("eval:deepeval.metric:AnswerRelevancyMetric", output={
"framework": "deepeval", "metric": "AnswerRelevancyMetric",
"score": 0.92, "threshold": 0.7, "passed": True,
})
waxell.step("eval:deepeval.metric:FaithfulnessMetric", output={
"framework": "deepeval", "metric": "FaithfulnessMetric",
"score": 0.78, "threshold": 0.7, "passed": True,
})
waxell.score("answer_relevancy", 0.92, comment="DeepEval AnswerRelevancyMetric")
waxell.score("faithfulness", 0.78, comment="DeepEval FaithfulnessMetric")
Batch evaluation and pass rate analysis
The evaluator runs batch evaluation and uses @reasoning to analyze overall quality.
@waxell.reasoning_dec(step="pass_rate_analysis")
async def analyze_pass_rate(scores: dict, threshold: float = 0.7) -> dict:
passed = sum(1 for v in scores.values() if v >= threshold)
total = len(scores)
pass_rate = passed / total if total > 0 else 0.0
return {
"thought": f"Evaluated {total} metrics. {passed}/{total} passed.",
"evidence": [f"{k}: {v:.2f} ({'PASS' if v >= threshold else 'FAIL'})" for k, v in scores.items()],
"conclusion": "All metrics pass" if pass_rate == 1.0 else f"{total - passed} need improvement",
}
What this demonstrates
waxell.step()-- individual DeepEval metric results recorded as named steps with framework/metric/score/threshold/passed metadata.@waxell.step_dec-- evaluation context preparation.@waxell.decision-- metric selection (relevancy_only/faithfulness_only/both/full_suite) driven by an LLM call.@waxell.reasoning_dec-- pass rate analysis with thought/evidence/conclusion.waxell.score()-- per-metric scores plus batch_pass_rate and overall_quality.- Auto-instrumented LLM calls -- OpenAI calls for both metric selection and response generation.
- Nested
@waxell.observe-- orchestrator is parent; deepeval-runner and deepeval-evaluator are child agents. - DeepEval integration pattern -- shows how to wrap DeepEval metrics with waxell-observe for evaluation observability.
Run it
# Dry-run (no API key needed)
python -m app.demos.deepeval_agent --dry-run
# Live mode with OpenAI
OPENAI_API_KEY=sk-... python -m app.demos.deepeval_agent