Skip to main content

BentoML Agent

Demonstrates the BentoML instrumentor for model serving, runner prediction, batch inference, and service management. A parent orchestrator coordinates a bentoml-classifier (single and batch predictions) and a bentoml-generator (text generation) with runner metrics and model tag attribution.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Runner Prediction with Model Tags

@tool-decorated functions exercise BentoML runner predict.run() for single and batch inference.

@waxell.tool(tool_type="ml_serving")
def run_prediction(runner, input_data: dict) -> dict:
"""Single prediction via BentoML runner (Runner.predict.run)."""
result = runner.predict.run(input_data)
return {
"model_tag": str(runner.tag),
"prediction": result["label"],
"confidence": result["confidence"],
"latency_ms": result["latency_ms"],
}

@waxell.tool(tool_type="ml_serving")
def run_batch_prediction(runner, batch: list) -> dict:
"""Batch prediction via BentoML runner."""
results = runner.predict.run(batch)
return {
"model_tag": str(runner.tag),
"batch_size": len(batch),
"predictions": len(results),
"avg_confidence": sum(r["confidence"] for r in results) / len(results),
}

Service Configuration and Runner Selection

The orchestrator loads a BentoML service and decides which runner to use.

@waxell.step_dec(name="load_service")
def load_service(service_name: str) -> dict:
return {"service": service_name, "runners_loaded": 2, "status": "ready"}

@waxell.decision(name="choose_runner", options=["classifier", "generator", "both"])
async def choose_runner(query: str) -> dict:
return {"chosen": "both", "reasoning": "Demo exercises both classification and generation runners"}

What this demonstrates

  • BentoML instrumentor -- Runner.__init__, Runner.predict.run(), Service.__init__, and bentoml.runner() factory traced.
  • Model tag attribution -- each prediction includes the BentoML model tag (name:version) for provenance.
  • Batch inference -- batch predictions traced with aggregate metrics (avg confidence, throughput).
  • @step for service lifecycle -- service loading and metrics recording captured as pipeline stages.
  • @decision for runner routing -- chooses between classification, generation, or both runners.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.bentoml_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.bentoml_agent

Source

dev/waxell-dev/app/demos/bentoml_agent.py