BentoML Agent

Demonstrates the BentoML instrumentor for model serving, runner prediction, batch inference, and service management. A parent orchestrator coordinates a bentoml-classifier (single and batch predictions) and a bentoml-generator (text generation) with runner metrics and model tag attribution.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Runner Prediction with Model Tags

@tool-decorated functions exercise BentoML runner predict.run() for single and batch inference.

@waxell.tool(tool_type="ml_serving")
def run_prediction(runner, input_data: dict) -> dict:
    """Single prediction via BentoML runner (Runner.predict.run)."""
    result = runner.predict.run(input_data)
    return {
        "model_tag": str(runner.tag),
        "prediction": result["label"],
        "confidence": result["confidence"],
        "latency_ms": result["latency_ms"],
    }

@waxell.tool(tool_type="ml_serving")
def run_batch_prediction(runner, batch: list) -> dict:
    """Batch prediction via BentoML runner."""
    results = runner.predict.run(batch)
    return {
        "model_tag": str(runner.tag),
        "batch_size": len(batch),
        "predictions": len(results),
        "avg_confidence": sum(r["confidence"] for r in results) / len(results),
    }

Service Configuration and Runner Selection

The orchestrator loads a BentoML service and decides which runner to use.

@waxell.step_dec(name="load_service")
def load_service(service_name: str) -> dict:
    return {"service": service_name, "runners_loaded": 2, "status": "ready"}

@waxell.decision(name="choose_runner", options=["classifier", "generator", "both"])
async def choose_runner(query: str) -> dict:
    return {"chosen": "both", "reasoning": "Demo exercises both classification and generation runners"}

What this demonstrates

BentoML instrumentor -- Runner.__init__, Runner.predict.run(), Service.__init__, and bentoml.runner() factory traced.
Model tag attribution -- each prediction includes the BentoML model tag (name:version) for provenance.
Batch inference -- batch predictions traced with aggregate metrics (avg confidence, throughput).
@step for service lifecycle -- service loading and metrics recording captured as pipeline stages.
@decision for runner routing -- chooses between classification, generation, or both runners.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.bentoml_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.bentoml_agent

Source

dev/waxell-dev/app/demos/bentoml_agent.py

Architecture​

Key Code​

Runner Prediction with Model Tags​

Service Configuration and Runner Selection​

What this demonstrates​

Run it​

Source​