Evaluation Frameworks Agent
A multi-agent pipeline that runs five LLM evaluation frameworks -- Braintrust, TruLens, Giskard, Inspect AI, and promptfoo -- compares their results, and recommends the best framework combination for production quality assurance.
Environment variables
This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.
Architecture
Key Code
Orchestrator and Framework Selection
@waxell.step_dec(name="prepare_comparison_context")
async def prepare_comparison_context(query: str) -> dict:
"""Prepare the multi-framework comparison context."""
cleaned = query.strip()
tokens = cleaned.lower().split()
return {
"original": query,
"cleaned": cleaned,
"token_count": len(tokens),
"frameworks": ["braintrust", "trulens", "giskard", "inspect_ai", "promptfoo"],
}
@waxell.decision(
name="select_frameworks",
options=["all", "scoring_only", "vulnerability_only", "top_3"],
)
async def select_frameworks(query: str, openai_client) -> dict:
"""Decide which evaluation frameworks to run based on the query."""
response = await openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Given a query about LLM evaluation, decide which frameworks to run..."},
{"role": "user", "content": query},
],
)
return json.loads(response.choices[0].message.content)
Framework Tool Calls
Each evaluation framework is wrapped with @waxell.tool(tool_type="evaluation") to auto-record timing, inputs, and outputs as tool spans.
@waxell.tool(tool_type="evaluation")
def run_braintrust_eval(braintrust, query: str) -> dict:
"""Run Braintrust evaluation pipeline: init + eval + log."""
bt_experiment = braintrust.init(project="llm-quality-check")
bt_result = braintrust.Eval(
name="llm-quality-check",
data=[{"input": query, "expected": "structured analysis"}],
scores=[lambda x: 0.88],
)
return {"scores": bt_result.summary.scores, "avg_score": ...}
@waxell.tool(tool_type="evaluation")
def run_giskard_eval(giskard) -> dict:
"""Run Giskard vulnerability scan and test suite."""
scan_result = giskard.scan()
suite = giskard.Suite(name="quality-suite")
suite_result = suite.run()
return {"vulnerabilities": len(scan_result.issues), "suite_pass_rate": ...}
Reasoning and Scoring
The evaluator agent uses @reasoning to document its comparison logic and waxell.decide() to record the final framework recommendation.
@waxell.reasoning_dec(step="framework_comparison")
async def compare_frameworks(comparison: dict) -> dict:
"""Analyze and compare results across all evaluation frameworks."""
scoring_frameworks = ["braintrust", "trulens", "inspect_ai"]
avg_scores = {fw: comparison[fw]["avg_score"] for fw in scoring_frameworks if fw in comparison}
best_framework = max(avg_scores, key=avg_scores.get)
return {
"thought": f"Compared 5 evaluation frameworks. Best: {best_framework}",
"evidence": [f"{k}: avg={v:.2f}" for k, v in avg_scores.items()],
"conclusion": f"Recommend combining {best_framework} with Giskard scanning.",
}
# Per-framework scoring
waxell.score("braintrust_avg", bt_result["avg_score"], comment="Braintrust average")
waxell.score("trulens_avg", tl_result["avg_score"], comment="TruLens average")
waxell.score("inspect_ai_avg", ia_result["avg_score"], comment="Inspect AI average")
waxell.score("promptfoo_pass_rate", pf_result["pass_rate"], comment="promptfoo pass rate")
# Final decision
waxell.decide("framework_recommendation",
chosen="braintrust+giskard",
options=["braintrust+giskard", "trulens+promptfoo", "full_suite"],
reasoning="Based on Giskard issues and scoring distributions",
confidence=0.82)
What this demonstrates
- Multi-agent orchestration -- an orchestrator spawns two child agents (
eval-runnerandeval-evaluator) with automatic parent-child lineage via@observe. - Five
@toolcalls -- each evaluation framework (Braintrust, TruLens, Giskard, Inspect AI, promptfoo) is wrapped with@waxell.tool(tool_type="evaluation")for automatic span creation. @stepand@decision-- preparation and framework selection are recorded as structured primitives.@reasoningdecorator -- documents the comparison logic with thought/evidence/conclusion structure.- Per-framework
waxell.score()-- numeric scores for each framework enable dashboard comparison. waxell.decide()-- records the final framework recommendation with options, reasoning, and confidence.- Auto-instrumented LLM calls -- the evaluator's OpenAI call is captured automatically.
Run it
# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.eval_frameworks_agent --dry-run
# Live mode
export OPENAI_API_KEY="sk-..."
export WAXELL_API_KEY="your-waxell-api-key"
export WAXELL_API_URL="https://api.waxell.ai"
python -m app.demos.eval_frameworks_agent