Multi-Provider Shootout Agent
A stress-test that sends the same query to 6 LLM providers (OpenAI, Anthropic, Groq, Mistral, Cohere, Together), evaluates each response with 3 DeepEval metrics (faithfulness, relevancy, coherence), reranks all answers, and picks a winner. Exercises LLM + eval + reranker categories heavily -- 6 LLM calls, 18 eval scores, 1 rerank, 1 decision. Includes per-provider cost tracking with token counts and pricing.
Environment variables
This example runs in dry-run mode by default (no API key needed). For live mode, set API keys for all 6 providers.
Architecture
Key Code
Per-provider LLM calls with cost tracking
Each provider call records tokens, latency, and cost for comparison.
@waxell.tool(tool_type="llm")
def call_provider(provider_config, query) -> dict:
return {
"provider": provider_config["name"],
"model": provider_config["model"],
"prompt_tokens": provider_config["prompt_tokens"],
"completion_tokens": provider_config["completion_tokens"],
"latency_ms": provider_config["latency_ms"],
"cost": calculate_cost(provider_config),
}
# Per-provider evaluation scores
for provider in PROVIDERS:
for metric in ["faithfulness", "relevancy", "coherence"]:
waxell.score(
f"{provider['name'].lower()}.{metric}", score,
data_type="numeric",
comment=f"source={provider['name']} | model={provider['model']}",
)
Reranking and winner selection
Answers are reranked by composite score and the winner is selected via @decision.
@waxell.tool(tool_type="reranker")
def rerank_answers(provider_scores) -> dict:
ranked = sorted(provider_scores.items(), key=lambda x: x[1]["composite"], reverse=True)
return {"ranking": [p for p, _ in ranked], "top_score": ranked[0][1]["composite"]}
@waxell.decision(name="pick_winner", options=["OpenAI", "Anthropic", "Groq", "Mistral", "Cohere", "Together"])
def pick_winner(ranking, provider_scores) -> dict:
winner = ranking[0]
return {
"chosen": winner,
"reasoning": f"{winner} achieved highest composite score across quality and cost",
"confidence": 0.88,
}
What this demonstrates
@waxell.tool(tool_type="llm")-- 6 provider calls with per-provider token counts, latency, and cost.@waxell.tool(tool_type="evaluation")-- 18 DeepEval metric scores (3 metrics x 6 providers).@waxell.tool(tool_type="reranker")-- answer reranking by composite quality-cost score.@waxell.reasoning_dec-- provider tradeoff analysis.@waxell.decision-- winner selection across 6 providers.waxell.score()-- 18 individual scores + per-provider costs + aggregate metrics.@waxell.step_dec-- provider setup and aggregation steps.- 6 LLM providers compared -- OpenAI, Anthropic, Groq, Mistral, Cohere, Together.
- Cost attribution -- per-provider cost tracking with token-level granularity.
Run it
# Dry-run (no API key needed)
python -m app.demos.multi_provider_shootout_agent --dry-run
# Live mode (requires all provider API keys)
python -m app.demos.multi_provider_shootout_agent