Skip to main content

Local Inference Agent

A multi-agent comparison of four local inference engines: llama.cpp, llamafile, LocalAI, and ExLlamaV2. A local-model-runner child agent exercises all 4 engines while a result-evaluator assesses tradeoffs via @reasoning and recommends the best engine via @decision.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Four Local Inference Engine Tool Calls

Each engine exercises the exact methods that the instrumentors wrap, running on-device models.

@waxell.tool(tool_type="inference")
def run_llamacpp(model, prompt: str, config: dict) -> dict:
"""Run llama.cpp with GGUF quantized model (Llama.__call__)."""
result = model(prompt, **config)
return {"engine": "llama.cpp", "model": model.model_path, "tokens_out": result["usage"]["completion_tokens"]}

@waxell.tool(tool_type="inference")
def run_llamafile(server_url: str, prompt: str, config: dict) -> dict:
"""Run llamafile via OpenAI-compatible API."""
return {"engine": "llamafile", "tokens_out": config.get("max_tokens", 256)}

@waxell.tool(tool_type="inference")
def run_localai(api_url: str, prompt: str, config: dict) -> dict:
"""Run LocalAI with OpenAI-compatible endpoint."""
return {"engine": "localai", "backend": config.get("backend", "llama")}

@waxell.tool(tool_type="inference")
def run_exllamav2(model, prompt: str, config: dict) -> dict:
"""Run ExLlamaV2 with GPTQ/EXL2 quantized model (ExLlamaV2.generate)."""
result = model.generate(prompt, **config)
return {"engine": "exllamav2", "quantization": "exl2", "tokens_out": len(result.split())}

Engine Tradeoff Analysis

The evaluator assesses tradeoffs between quantization formats, speed, and compatibility.

@waxell.reasoning_dec(step="engine_tradeoffs")
async def evaluate_engine_tradeoffs(results: list) -> dict:
return {
"thought": "llama.cpp offers widest GGUF compatibility, ExLlamaV2 fastest with EXL2, "
"llamafile simplest deployment, LocalAI most flexible with multiple backends.",
"evidence": [f"{r['engine']}: {r.get('tokens_per_second', 'N/A')} tok/s" for r in results],
"conclusion": "Choice depends on deployment constraints: portability vs speed vs flexibility",
}

@waxell.decision(name="recommend_engine", options=["llamacpp", "llamafile", "localai", "exllamav2"])
async def recommend_engine(comparison: dict) -> dict:
return {"chosen": "llamacpp", "reasoning": "Best balance of compatibility and performance"}

What this demonstrates

  • llama.cpp instrumentor -- Llama.__call__ with GGUF quantized models.
  • llamafile instrumentor -- OpenAI-compatible API served from a single executable.
  • LocalAI instrumentor -- multi-backend local inference with OpenAI-compatible endpoints.
  • ExLlamaV2 instrumentor -- ExLlamaV2.generate with GPTQ/EXL2 quantization.
  • @reasoning for tradeoff analysis -- documents engine-specific strengths and weaknesses.
  • @retrieval for output collection -- gathers and ranks engine results for comparison.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.local_inference_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.local_inference_agent

Source

dev/waxell-dev/app/demos/local_inference_agent.py