Reranker Comparison Agent

A multi-agent reranking comparison demo across 4 reranker providers (CrossEncoder, Pinecone Rerank, FlashRank, ColBERT). A parent orchestrator coordinates 2 child agents -- a multi-reranker that runs all 4 rerankers on the same candidates and compares their score distributions, and an evaluator that picks the best reranker, assesses quality, and records metrics.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls. All reranker libraries are mocked.

Architecture

Key Code

Multi-Provider Reranking with `@tool(tool_type="reranking")`

The multi-reranker child agent runs all 4 rerankers on the same candidate documents.

@waxell.tool(tool_type="reranking")
def cross_encoder_rerank(query: str, documents: list, model: str) -> dict:
    """Rerank using sentence-transformers CrossEncoder."""
    scores = model.predict([(query, doc) for doc in documents])
    return {"scores": sorted_scores, "model": "cross-encoder/ms-marco-MiniLM-L-6-v2"}

@waxell.tool(tool_type="reranking")
def pinecone_rerank(query: str, documents: list) -> dict:
    """Rerank using Pinecone Inference API."""
    result = pc.inference.rerank(model="bge-reranker-v2-m3", query=query, documents=documents)
    return {"scores": [...], "model": "bge-reranker-v2-m3"}

@waxell.tool(tool_type="reranking")
def flashrank_rerank(query: str, documents: list) -> dict:
    """Rerank using FlashRank (lightweight, no GPU)."""
    results = ranker.rerank(request=RerankRequest(query=query, passages=documents))
    return {"scores": [...], "model": "ms-marco-MiniLM-L-12-v2"}

@waxell.tool(tool_type="reranking")
def colbert_rerank(query: str, documents: list) -> dict:
    """Rerank using ColBERT late interaction model."""
    scores = model.rank(query, documents)
    return {"scores": [...], "model": "colbert-ir/colbertv2.0"}

Score Distribution Analysis and Evaluation

@waxell.reasoning_dec(step="analyze_score_distribution")
async def analyze_score_distribution(provider_results: dict) -> dict:
    providers = list(provider_results.keys())
    return {
        "thought": f"Analyzing score distributions across {len(providers)} rerankers.",
        "evidence": [f"{p}: top={r['scores'][0]:.3f}, spread={r['scores'][0]-r['scores'][-1]:.3f}"
                     for p, r in provider_results.items()],
        "conclusion": "CrossEncoder and Pinecone show strongest score separation",
    }

@waxell.retrieval(source="reranker")
def collect_reranked_results(provider_results: dict) -> list[dict]:
    """Collect and normalize results from all reranker providers."""
    collected = [{"provider": name, "top_score": r["scores"][0], "model": r["model"]}
                 for name, r in provider_results.items()]
    collected.sort(key=lambda x: x["top_score"], reverse=True)
    return collected

What this demonstrates

@waxell.observe -- parent-child agent hierarchy (orchestrator + 2 child agents) with automatic lineage
@waxell.tool(tool_type="reranking") -- 4 reranker tool spans: CrossEncoder, Pinecone, FlashRank, ColBERT
@waxell.retrieval(source="reranker") -- cross-provider result collection and normalization
@waxell.decision -- reranker strategy selection and best reranker pick
waxell.decide() -- provider routing decision
@waxell.reasoning_dec -- score distribution analysis and reranking quality evaluation
@waxell.step_dec -- query preprocessing and score comparison
waxell.score() -- quality metrics across all providers
4 reranker providers -- CrossEncoder (accuracy), Pinecone (managed), FlashRank (lightweight), ColBERT (late interaction) compared

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.reranker_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.reranker_agent

# Custom query
python -m app.demos.reranker_agent --dry-run --query "Best practices for RAG retrieval"

Source

dev/waxell-dev/app/demos/reranker_agent.py

Architecture​

Key Code​

Multi-Provider Reranking with @tool(tool_type="reranking")​

Score Distribution Analysis and Evaluation​

What this demonstrates​

Run it​

Source​