Inference Servers Agent

Demonstrates four production inference server instrumentors side-by-side: SGLang, HuggingFace TGI, TensorRT-LLM, and Triton. A server-runner child agent exercises all 4 servers while a server-evaluator ranks them by throughput via @reasoning and LLM synthesis.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Four Inference Server Tool Calls

Each server exercises the exact methods that the instrumentors wrap.

@waxell.tool(tool_type="inference_server")
def run_sglang(engine, prompt: str, config: dict) -> dict:
    """Run SGLang Engine.generate with RadixAttention prefix caching."""
    result = engine.generate(prompt=prompt, **config)
    return {"server": "sglang", "model": engine.model_path, "tokens_out": result["usage"]["completion_tokens"]}

@waxell.tool(tool_type="inference_server")
def run_tgi(client, prompt: str, config: dict) -> dict:
    """Run HuggingFace TGI text_generation with continuous batching."""
    result = client.text_generation(prompt=prompt, **config)
    return {"server": "tgi", "tokens_out": result["usage"]["completion_tokens"]}

@waxell.tool(tool_type="inference_server")
def run_tensorrt(engine, prompt: str, config: dict) -> dict:
    """Run TensorRT-LLM with FP8 quantization and in-flight batching."""
    result = engine.generate(prompt=prompt, **config)
    return {"server": "tensorrt-llm", "tokens_out": result["usage"]["completion_tokens"]}

@waxell.tool(tool_type="inference_server")
def run_triton(client, prompt: str, config: dict) -> dict:
    """Run Triton Inference Server with dynamic batching."""
    result = client.infer(model_name="llama-3b", inputs={"prompt": prompt})
    return {"server": "triton", "tokens_out": result["outputs"]["completion_tokens"]}

Throughput Assessment and Benchmark Retrieval

The evaluator ranks servers by throughput and gathers results.

@waxell.reasoning_dec(step="throughput_assessment")
async def assess_throughput(results: list) -> dict:
    fastest = max(results, key=lambda r: r["tokens_per_second"])
    return {
        "thought": f"Compared {len(results)} inference servers by throughput.",
        "evidence": [f"{r['server']}: {r['tokens_per_second']} tok/s" for r in results],
        "conclusion": f"{fastest['server']} achieves highest throughput at {fastest['tokens_per_second']} tok/s",
    }

@waxell.retrieval(source="benchmark")
def gather_benchmark_results(results: list) -> list[dict]:
    return [{"server": r["server"], "throughput": r["tokens_per_second"]} for r in results]

What this demonstrates

SGLang instrumentor -- Engine.generate with RadixAttention prefix caching metrics.
HuggingFace TGI instrumentor -- text_generation with continuous batching and speculative decoding.
TensorRT-LLM instrumentor -- engine generation with FP8 quantization and in-flight batching.
Triton instrumentor -- infer with dynamic batching and multi-model serving.
@retrieval for benchmark gathering -- collects and ranks server performance data.
waxell.decide() for server routing -- manual routing decision with benchmark strategy.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.inference_server_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.inference_server_agent

Source

dev/waxell-dev/app/demos/inference_server_agent.py

Architecture​

Key Code​

Four Inference Server Tool Calls​

Throughput Assessment and Benchmark Retrieval​

What this demonstrates​

Run it​

Source​