Evaluation Frameworks Agent

A multi-agent pipeline that runs five LLM evaluation frameworks -- Braintrust, TruLens, Giskard, Inspect AI, and promptfoo -- compares their results, and recommends the best framework combination for production quality assurance.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Orchestrator and Framework Selection

@waxell.step_dec(name="prepare_comparison_context")
async def prepare_comparison_context(query: str) -> dict:
    """Prepare the multi-framework comparison context."""
    cleaned = query.strip()
    tokens = cleaned.lower().split()
    return {
        "original": query,
        "cleaned": cleaned,
        "token_count": len(tokens),
        "frameworks": ["braintrust", "trulens", "giskard", "inspect_ai", "promptfoo"],
    }


@waxell.decision(
    name="select_frameworks",
    options=["all", "scoring_only", "vulnerability_only", "top_3"],
)
async def select_frameworks(query: str, openai_client) -> dict:
    """Decide which evaluation frameworks to run based on the query."""
    response = await openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Given a query about LLM evaluation, decide which frameworks to run..."},
            {"role": "user", "content": query},
        ],
    )
    return json.loads(response.choices[0].message.content)

Framework Tool Calls

Each evaluation framework is wrapped with @waxell.tool(tool_type="evaluation") to auto-record timing, inputs, and outputs as tool spans.

@waxell.tool(tool_type="evaluation")
def run_braintrust_eval(braintrust, query: str) -> dict:
    """Run Braintrust evaluation pipeline: init + eval + log."""
    bt_experiment = braintrust.init(project="llm-quality-check")
    bt_result = braintrust.Eval(
        name="llm-quality-check",
        data=[{"input": query, "expected": "structured analysis"}],
        scores=[lambda x: 0.88],
    )
    return {"scores": bt_result.summary.scores, "avg_score": ...}


@waxell.tool(tool_type="evaluation")
def run_giskard_eval(giskard) -> dict:
    """Run Giskard vulnerability scan and test suite."""
    scan_result = giskard.scan()
    suite = giskard.Suite(name="quality-suite")
    suite_result = suite.run()
    return {"vulnerabilities": len(scan_result.issues), "suite_pass_rate": ...}

Reasoning and Scoring

The evaluator agent uses @reasoning to document its comparison logic and waxell.decide() to record the final framework recommendation.

@waxell.reasoning_dec(step="framework_comparison")
async def compare_frameworks(comparison: dict) -> dict:
    """Analyze and compare results across all evaluation frameworks."""
    scoring_frameworks = ["braintrust", "trulens", "inspect_ai"]
    avg_scores = {fw: comparison[fw]["avg_score"] for fw in scoring_frameworks if fw in comparison}
    best_framework = max(avg_scores, key=avg_scores.get)
    return {
        "thought": f"Compared 5 evaluation frameworks. Best: {best_framework}",
        "evidence": [f"{k}: avg={v:.2f}" for k, v in avg_scores.items()],
        "conclusion": f"Recommend combining {best_framework} with Giskard scanning.",
    }

# Per-framework scoring
waxell.score("braintrust_avg", bt_result["avg_score"], comment="Braintrust average")
waxell.score("trulens_avg", tl_result["avg_score"], comment="TruLens average")
waxell.score("inspect_ai_avg", ia_result["avg_score"], comment="Inspect AI average")
waxell.score("promptfoo_pass_rate", pf_result["pass_rate"], comment="promptfoo pass rate")

# Final decision
waxell.decide("framework_recommendation",
    chosen="braintrust+giskard",
    options=["braintrust+giskard", "trulens+promptfoo", "full_suite"],
    reasoning="Based on Giskard issues and scoring distributions",
    confidence=0.82)

What this demonstrates

Multi-agent orchestration -- an orchestrator spawns two child agents (eval-runner and eval-evaluator) with automatic parent-child lineage via @observe.
Five @tool calls -- each evaluation framework (Braintrust, TruLens, Giskard, Inspect AI, promptfoo) is wrapped with @waxell.tool(tool_type="evaluation") for automatic span creation.
@step and @decision -- preparation and framework selection are recorded as structured primitives.
@reasoning decorator -- documents the comparison logic with thought/evidence/conclusion structure.
Per-framework waxell.score() -- numeric scores for each framework enable dashboard comparison.
waxell.decide() -- records the final framework recommendation with options, reasoning, and confidence.
Auto-instrumented LLM calls -- the evaluator's OpenAI call is captured automatically.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.eval_frameworks_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
export WAXELL_API_KEY="your-waxell-api-key"
export WAXELL_API_URL="https://api.waxell.ai"
python -m app.demos.eval_frameworks_agent

Source

dev/waxell-dev/app/demos/eval_frameworks_agent.py

Architecture​

Key Code​

Orchestrator and Framework Selection​

Framework Tool Calls​

Reasoning and Scoring​

What this demonstrates​

Run it​

Source​