Skip to main content

vLLM Agent

Demonstrates the vLLM instrumentor for high-throughput local model serving with PagedAttention optimization tracking. A parent orchestrator coordinates a vllm-inference child agent that loads a model and runs batch generation.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

vLLM Model Loading and Batch Generation

Two @tool-decorated functions exercise the vLLM engine lifecycle: model loading with GPU memory configuration and batch inference with PagedAttention.

@waxell.tool(tool_type="ml_serving")
def load_model(model_name: str, config: dict) -> dict:
"""Simulate loading a model with vLLM engine."""
return {
"model": model_name,
"engine": "vllm",
"tensor_parallel_size": config.get("tensor_parallel_size", 1),
"gpu_memory_utilization": config.get("gpu_memory_utilization", 0.9),
"status": "loaded",
}

@waxell.tool(tool_type="ml_serving")
def batch_generate(prompts: list, model_name: str) -> dict:
"""Simulate batch inference with vLLM PagedAttention."""
return {
"batch_size": len(prompts),
"total_tokens_out": 150 * len(prompts),
"engine": "vllm",
"optimization": "paged_attention",
}

Model Configuration Decision

The orchestrator decides on tensor parallelism and GPU memory settings.

@waxell.step_dec(name="prepare_config")
def prepare_config(query: str) -> dict:
return {
"model": "meta-llama/Llama-3.2-3B-Instruct",
"tensor_parallel_size": 1,
"gpu_memory_utilization": 0.9,
"max_model_len": 4096,
}

@waxell.decision(name="choose_model_config", options=["standard", "high_throughput", "memory_efficient"])
async def choose_model_config(query: str) -> dict:
return {"chosen": "standard", "reasoning": "Single GPU with 90% memory utilization for balanced performance"}

What this demonstrates

  • vLLM instrumentor -- model loading and batch generation with PagedAttention optimization.
  • GPU memory tracking -- gpu_memory_utilization and tensor_parallel_size captured in tool metadata.
  • Batch inference -- multiple prompts processed in a single batch_generate call with aggregate token counts.
  • @step for configuration -- model config preparation captured as a pipeline stage.
  • @decision for model config -- selects between standard, high-throughput, and memory-efficient configurations.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.vllm_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.vllm_agent

Source

dev/waxell-dev/app/demos/vllm_agent.py