vLLM Agent

Demonstrates the vLLM instrumentor for high-throughput local model serving with PagedAttention optimization tracking. A parent orchestrator coordinates a vllm-inference child agent that loads a model and runs batch generation.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

vLLM Model Loading and Batch Generation

Two @tool-decorated functions exercise the vLLM engine lifecycle: model loading with GPU memory configuration and batch inference with PagedAttention.

@waxell.tool(tool_type="ml_serving")
def load_model(model_name: str, config: dict) -> dict:
    """Simulate loading a model with vLLM engine."""
    return {
        "model": model_name,
        "engine": "vllm",
        "tensor_parallel_size": config.get("tensor_parallel_size", 1),
        "gpu_memory_utilization": config.get("gpu_memory_utilization", 0.9),
        "status": "loaded",
    }

@waxell.tool(tool_type="ml_serving")
def batch_generate(prompts: list, model_name: str) -> dict:
    """Simulate batch inference with vLLM PagedAttention."""
    return {
        "batch_size": len(prompts),
        "total_tokens_out": 150 * len(prompts),
        "engine": "vllm",
        "optimization": "paged_attention",
    }

Model Configuration Decision

The orchestrator decides on tensor parallelism and GPU memory settings.

@waxell.step_dec(name="prepare_config")
def prepare_config(query: str) -> dict:
    return {
        "model": "meta-llama/Llama-3.2-3B-Instruct",
        "tensor_parallel_size": 1,
        "gpu_memory_utilization": 0.9,
        "max_model_len": 4096,
    }

@waxell.decision(name="choose_model_config", options=["standard", "high_throughput", "memory_efficient"])
async def choose_model_config(query: str) -> dict:
    return {"chosen": "standard", "reasoning": "Single GPU with 90% memory utilization for balanced performance"}

What this demonstrates

vLLM instrumentor -- model loading and batch generation with PagedAttention optimization.
GPU memory tracking -- gpu_memory_utilization and tensor_parallel_size captured in tool metadata.
Batch inference -- multiple prompts processed in a single batch_generate call with aggregate token counts.
@step for configuration -- model config preparation captured as a pipeline stage.
@decision for model config -- selects between standard, high-throughput, and memory-efficient configurations.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.vllm_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.vllm_agent

Source

dev/waxell-dev/app/demos/vllm_agent.py

Architecture​

Key Code​

vLLM Model Loading and Batch Generation​

Model Configuration Decision​

What this demonstrates​

Run it​

Source​