Local Inference Agent
A multi-agent comparison of four local inference engines: llama.cpp, llamafile, LocalAI, and ExLlamaV2. A local-model-runner child agent exercises all 4 engines while a result-evaluator assesses tradeoffs via @reasoning and recommends the best engine via @decision.
Environment variables
This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.
Architecture
Key Code
Four Local Inference Engine Tool Calls
Each engine exercises the exact methods that the instrumentors wrap, running on-device models.
@waxell.tool(tool_type="inference")
def run_llamacpp(model, prompt: str, config: dict) -> dict:
"""Run llama.cpp with GGUF quantized model (Llama.__call__)."""
result = model(prompt, **config)
return {"engine": "llama.cpp", "model": model.model_path, "tokens_out": result["usage"]["completion_tokens"]}
@waxell.tool(tool_type="inference")
def run_llamafile(server_url: str, prompt: str, config: dict) -> dict:
"""Run llamafile via OpenAI-compatible API."""
return {"engine": "llamafile", "tokens_out": config.get("max_tokens", 256)}
@waxell.tool(tool_type="inference")
def run_localai(api_url: str, prompt: str, config: dict) -> dict:
"""Run LocalAI with OpenAI-compatible endpoint."""
return {"engine": "localai", "backend": config.get("backend", "llama")}
@waxell.tool(tool_type="inference")
def run_exllamav2(model, prompt: str, config: dict) -> dict:
"""Run ExLlamaV2 with GPTQ/EXL2 quantized model (ExLlamaV2.generate)."""
result = model.generate(prompt, **config)
return {"engine": "exllamav2", "quantization": "exl2", "tokens_out": len(result.split())}
Engine Tradeoff Analysis
The evaluator assesses tradeoffs between quantization formats, speed, and compatibility.
@waxell.reasoning_dec(step="engine_tradeoffs")
async def evaluate_engine_tradeoffs(results: list) -> dict:
return {
"thought": "llama.cpp offers widest GGUF compatibility, ExLlamaV2 fastest with EXL2, "
"llamafile simplest deployment, LocalAI most flexible with multiple backends.",
"evidence": [f"{r['engine']}: {r.get('tokens_per_second', 'N/A')} tok/s" for r in results],
"conclusion": "Choice depends on deployment constraints: portability vs speed vs flexibility",
}
@waxell.decision(name="recommend_engine", options=["llamacpp", "llamafile", "localai", "exllamav2"])
async def recommend_engine(comparison: dict) -> dict:
return {"chosen": "llamacpp", "reasoning": "Best balance of compatibility and performance"}
What this demonstrates
- llama.cpp instrumentor --
Llama.__call__with GGUF quantized models. - llamafile instrumentor -- OpenAI-compatible API served from a single executable.
- LocalAI instrumentor -- multi-backend local inference with OpenAI-compatible endpoints.
- ExLlamaV2 instrumentor --
ExLlamaV2.generatewith GPTQ/EXL2 quantization. @reasoningfor tradeoff analysis -- documents engine-specific strengths and weaknesses.@retrievalfor output collection -- gathers and ranks engine results for comparison.
Run it
# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.local_inference_agent --dry-run
# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.local_inference_agent