Inference Servers Agent
Demonstrates four production inference server instrumentors side-by-side: SGLang, HuggingFace TGI, TensorRT-LLM, and Triton. A server-runner child agent exercises all 4 servers while a server-evaluator ranks them by throughput via @reasoning and LLM synthesis.
Environment variables
This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.
Architecture
Key Code
Four Inference Server Tool Calls
Each server exercises the exact methods that the instrumentors wrap.
@waxell.tool(tool_type="inference_server")
def run_sglang(engine, prompt: str, config: dict) -> dict:
"""Run SGLang Engine.generate with RadixAttention prefix caching."""
result = engine.generate(prompt=prompt, **config)
return {"server": "sglang", "model": engine.model_path, "tokens_out": result["usage"]["completion_tokens"]}
@waxell.tool(tool_type="inference_server")
def run_tgi(client, prompt: str, config: dict) -> dict:
"""Run HuggingFace TGI text_generation with continuous batching."""
result = client.text_generation(prompt=prompt, **config)
return {"server": "tgi", "tokens_out": result["usage"]["completion_tokens"]}
@waxell.tool(tool_type="inference_server")
def run_tensorrt(engine, prompt: str, config: dict) -> dict:
"""Run TensorRT-LLM with FP8 quantization and in-flight batching."""
result = engine.generate(prompt=prompt, **config)
return {"server": "tensorrt-llm", "tokens_out": result["usage"]["completion_tokens"]}
@waxell.tool(tool_type="inference_server")
def run_triton(client, prompt: str, config: dict) -> dict:
"""Run Triton Inference Server with dynamic batching."""
result = client.infer(model_name="llama-3b", inputs={"prompt": prompt})
return {"server": "triton", "tokens_out": result["outputs"]["completion_tokens"]}
Throughput Assessment and Benchmark Retrieval
The evaluator ranks servers by throughput and gathers results.
@waxell.reasoning_dec(step="throughput_assessment")
async def assess_throughput(results: list) -> dict:
fastest = max(results, key=lambda r: r["tokens_per_second"])
return {
"thought": f"Compared {len(results)} inference servers by throughput.",
"evidence": [f"{r['server']}: {r['tokens_per_second']} tok/s" for r in results],
"conclusion": f"{fastest['server']} achieves highest throughput at {fastest['tokens_per_second']} tok/s",
}
@waxell.retrieval(source="benchmark")
def gather_benchmark_results(results: list) -> list[dict]:
return [{"server": r["server"], "throughput": r["tokens_per_second"]} for r in results]
What this demonstrates
- SGLang instrumentor --
Engine.generatewith RadixAttention prefix caching metrics. - HuggingFace TGI instrumentor --
text_generationwith continuous batching and speculative decoding. - TensorRT-LLM instrumentor -- engine generation with FP8 quantization and in-flight batching.
- Triton instrumentor --
inferwith dynamic batching and multi-model serving. @retrievalfor benchmark gathering -- collects and ranks server performance data.waxell.decide()for server routing -- manual routing decision with benchmark strategy.
Run it
# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.inference_server_agent --dry-run
# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.inference_server_agent