Multi-Provider Shootout Agent

A stress-test that sends the same query to 6 LLM providers (OpenAI, Anthropic, Groq, Mistral, Cohere, Together), evaluates each response with 3 DeepEval metrics (faithfulness, relevancy, coherence), reranks all answers, and picks a winner. Exercises LLM + eval + reranker categories heavily -- 6 LLM calls, 18 eval scores, 1 rerank, 1 decision. Includes per-provider cost tracking with token counts and pricing.

Environment variables

This example runs in dry-run mode by default (no API key needed). For live mode, set API keys for all 6 providers.

Architecture

Key Code

Per-provider LLM calls with cost tracking

Each provider call records tokens, latency, and cost for comparison.

@waxell.tool(tool_type="llm")
def call_provider(provider_config, query) -> dict:
    return {
        "provider": provider_config["name"],
        "model": provider_config["model"],
        "prompt_tokens": provider_config["prompt_tokens"],
        "completion_tokens": provider_config["completion_tokens"],
        "latency_ms": provider_config["latency_ms"],
        "cost": calculate_cost(provider_config),
    }

# Per-provider evaluation scores
for provider in PROVIDERS:
    for metric in ["faithfulness", "relevancy", "coherence"]:
        waxell.score(
            f"{provider['name'].lower()}.{metric}", score,
            data_type="numeric",
            comment=f"source={provider['name']} | model={provider['model']}",
        )

Reranking and winner selection

Answers are reranked by composite score and the winner is selected via @decision.

@waxell.tool(tool_type="reranker")
def rerank_answers(provider_scores) -> dict:
    ranked = sorted(provider_scores.items(), key=lambda x: x[1]["composite"], reverse=True)
    return {"ranking": [p for p, _ in ranked], "top_score": ranked[0][1]["composite"]}

@waxell.decision(name="pick_winner", options=["OpenAI", "Anthropic", "Groq", "Mistral", "Cohere", "Together"])
def pick_winner(ranking, provider_scores) -> dict:
    winner = ranking[0]
    return {
        "chosen": winner,
        "reasoning": f"{winner} achieved highest composite score across quality and cost",
        "confidence": 0.88,
    }

What this demonstrates

@waxell.tool(tool_type="llm") -- 6 provider calls with per-provider token counts, latency, and cost.
@waxell.tool(tool_type="evaluation") -- 18 DeepEval metric scores (3 metrics x 6 providers).
@waxell.tool(tool_type="reranker") -- answer reranking by composite quality-cost score.
@waxell.reasoning_dec -- provider tradeoff analysis.
@waxell.decision -- winner selection across 6 providers.
waxell.score() -- 18 individual scores + per-provider costs + aggregate metrics.
@waxell.step_dec -- provider setup and aggregation steps.
6 LLM providers compared -- OpenAI, Anthropic, Groq, Mistral, Cohere, Together.
Cost attribution -- per-provider cost tracking with token-level granularity.

Run it

# Dry-run (no API key needed)
python -m app.demos.multi_provider_shootout_agent --dry-run

# Live mode (requires all provider API keys)
python -m app.demos.multi_provider_shootout_agent

Source

dev/waxell-dev/app/demos/multi_provider_shootout_agent.py

Architecture​

Key Code​

Per-provider LLM calls with cost tracking​

Reranking and winner selection​

What this demonstrates​

Run it​

Source​