Skip to main content

Multi-Provider Shootout Agent

A stress-test that sends the same query to 6 LLM providers (OpenAI, Anthropic, Groq, Mistral, Cohere, Together), evaluates each response with 3 DeepEval metrics (faithfulness, relevancy, coherence), reranks all answers, and picks a winner. Exercises LLM + eval + reranker categories heavily -- 6 LLM calls, 18 eval scores, 1 rerank, 1 decision. Includes per-provider cost tracking with token counts and pricing.

Environment variables

This example runs in dry-run mode by default (no API key needed). For live mode, set API keys for all 6 providers.

Architecture

Key Code

Per-provider LLM calls with cost tracking

Each provider call records tokens, latency, and cost for comparison.

@waxell.tool(tool_type="llm")
def call_provider(provider_config, query) -> dict:
return {
"provider": provider_config["name"],
"model": provider_config["model"],
"prompt_tokens": provider_config["prompt_tokens"],
"completion_tokens": provider_config["completion_tokens"],
"latency_ms": provider_config["latency_ms"],
"cost": calculate_cost(provider_config),
}

# Per-provider evaluation scores
for provider in PROVIDERS:
for metric in ["faithfulness", "relevancy", "coherence"]:
waxell.score(
f"{provider['name'].lower()}.{metric}", score,
data_type="numeric",
comment=f"source={provider['name']} | model={provider['model']}",
)

Reranking and winner selection

Answers are reranked by composite score and the winner is selected via @decision.

@waxell.tool(tool_type="reranker")
def rerank_answers(provider_scores) -> dict:
ranked = sorted(provider_scores.items(), key=lambda x: x[1]["composite"], reverse=True)
return {"ranking": [p for p, _ in ranked], "top_score": ranked[0][1]["composite"]}

@waxell.decision(name="pick_winner", options=["OpenAI", "Anthropic", "Groq", "Mistral", "Cohere", "Together"])
def pick_winner(ranking, provider_scores) -> dict:
winner = ranking[0]
return {
"chosen": winner,
"reasoning": f"{winner} achieved highest composite score across quality and cost",
"confidence": 0.88,
}

What this demonstrates

  • @waxell.tool(tool_type="llm") -- 6 provider calls with per-provider token counts, latency, and cost.
  • @waxell.tool(tool_type="evaluation") -- 18 DeepEval metric scores (3 metrics x 6 providers).
  • @waxell.tool(tool_type="reranker") -- answer reranking by composite quality-cost score.
  • @waxell.reasoning_dec -- provider tradeoff analysis.
  • @waxell.decision -- winner selection across 6 providers.
  • waxell.score() -- 18 individual scores + per-provider costs + aggregate metrics.
  • @waxell.step_dec -- provider setup and aggregation steps.
  • 6 LLM providers compared -- OpenAI, Anthropic, Groq, Mistral, Cohere, Together.
  • Cost attribution -- per-provider cost tracking with token-level granularity.

Run it

# Dry-run (no API key needed)
python -m app.demos.multi_provider_shootout_agent --dry-run

# Live mode (requires all provider API keys)
python -m app.demos.multi_provider_shootout_agent

Source

dev/waxell-dev/app/demos/multi_provider_shootout_agent.py