BentoML Agent
Demonstrates the BentoML instrumentor for model serving, runner prediction, batch inference, and service management. A parent orchestrator coordinates a bentoml-classifier (single and batch predictions) and a bentoml-generator (text generation) with runner metrics and model tag attribution.
Environment variables
This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.
Architecture
Key Code
Runner Prediction with Model Tags
@tool-decorated functions exercise BentoML runner predict.run() for single and batch inference.
@waxell.tool(tool_type="ml_serving")
def run_prediction(runner, input_data: dict) -> dict:
"""Single prediction via BentoML runner (Runner.predict.run)."""
result = runner.predict.run(input_data)
return {
"model_tag": str(runner.tag),
"prediction": result["label"],
"confidence": result["confidence"],
"latency_ms": result["latency_ms"],
}
@waxell.tool(tool_type="ml_serving")
def run_batch_prediction(runner, batch: list) -> dict:
"""Batch prediction via BentoML runner."""
results = runner.predict.run(batch)
return {
"model_tag": str(runner.tag),
"batch_size": len(batch),
"predictions": len(results),
"avg_confidence": sum(r["confidence"] for r in results) / len(results),
}
Service Configuration and Runner Selection
The orchestrator loads a BentoML service and decides which runner to use.
@waxell.step_dec(name="load_service")
def load_service(service_name: str) -> dict:
return {"service": service_name, "runners_loaded": 2, "status": "ready"}
@waxell.decision(name="choose_runner", options=["classifier", "generator", "both"])
async def choose_runner(query: str) -> dict:
return {"chosen": "both", "reasoning": "Demo exercises both classification and generation runners"}
What this demonstrates
- BentoML instrumentor --
Runner.__init__,Runner.predict.run(),Service.__init__, andbentoml.runner()factory traced. - Model tag attribution -- each prediction includes the BentoML model tag (name:version) for provenance.
- Batch inference -- batch predictions traced with aggregate metrics (avg confidence, throughput).
@stepfor service lifecycle -- service loading and metrics recording captured as pipeline stages.@decisionfor runner routing -- chooses between classification, generation, or both runners.
Run it
# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.bentoml_agent --dry-run
# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.bentoml_agent