vLLM Agent
Demonstrates the vLLM instrumentor for high-throughput local model serving with PagedAttention optimization tracking. A parent orchestrator coordinates a vllm-inference child agent that loads a model and runs batch generation.
Environment variables
This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.
Architecture
Key Code
vLLM Model Loading and Batch Generation
Two @tool-decorated functions exercise the vLLM engine lifecycle: model loading with GPU memory configuration and batch inference with PagedAttention.
@waxell.tool(tool_type="ml_serving")
def load_model(model_name: str, config: dict) -> dict:
"""Simulate loading a model with vLLM engine."""
return {
"model": model_name,
"engine": "vllm",
"tensor_parallel_size": config.get("tensor_parallel_size", 1),
"gpu_memory_utilization": config.get("gpu_memory_utilization", 0.9),
"status": "loaded",
}
@waxell.tool(tool_type="ml_serving")
def batch_generate(prompts: list, model_name: str) -> dict:
"""Simulate batch inference with vLLM PagedAttention."""
return {
"batch_size": len(prompts),
"total_tokens_out": 150 * len(prompts),
"engine": "vllm",
"optimization": "paged_attention",
}
Model Configuration Decision
The orchestrator decides on tensor parallelism and GPU memory settings.
@waxell.step_dec(name="prepare_config")
def prepare_config(query: str) -> dict:
return {
"model": "meta-llama/Llama-3.2-3B-Instruct",
"tensor_parallel_size": 1,
"gpu_memory_utilization": 0.9,
"max_model_len": 4096,
}
@waxell.decision(name="choose_model_config", options=["standard", "high_throughput", "memory_efficient"])
async def choose_model_config(query: str) -> dict:
return {"chosen": "standard", "reasoning": "Single GPU with 90% memory utilization for balanced performance"}
What this demonstrates
- vLLM instrumentor -- model loading and batch generation with PagedAttention optimization.
- GPU memory tracking --
gpu_memory_utilizationandtensor_parallel_sizecaptured in tool metadata. - Batch inference -- multiple prompts processed in a single
batch_generatecall with aggregate token counts. @stepfor configuration -- model config preparation captured as a pipeline stage.@decisionfor model config -- selects between standard, high-throughput, and memory-efficient configurations.
Run it
# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.vllm_agent --dry-run
# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.vllm_agent