Local Inference Agent

A multi-agent comparison of four local inference engines: llama.cpp, llamafile, LocalAI, and ExLlamaV2. A local-model-runner child agent exercises all 4 engines while a result-evaluator assesses tradeoffs via @reasoning and recommends the best engine via @decision.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Four Local Inference Engine Tool Calls

Each engine exercises the exact methods that the instrumentors wrap, running on-device models.

@waxell.tool(tool_type="inference")
def run_llamacpp(model, prompt: str, config: dict) -> dict:
    """Run llama.cpp with GGUF quantized model (Llama.__call__)."""
    result = model(prompt, **config)
    return {"engine": "llama.cpp", "model": model.model_path, "tokens_out": result["usage"]["completion_tokens"]}

@waxell.tool(tool_type="inference")
def run_llamafile(server_url: str, prompt: str, config: dict) -> dict:
    """Run llamafile via OpenAI-compatible API."""
    return {"engine": "llamafile", "tokens_out": config.get("max_tokens", 256)}

@waxell.tool(tool_type="inference")
def run_localai(api_url: str, prompt: str, config: dict) -> dict:
    """Run LocalAI with OpenAI-compatible endpoint."""
    return {"engine": "localai", "backend": config.get("backend", "llama")}

@waxell.tool(tool_type="inference")
def run_exllamav2(model, prompt: str, config: dict) -> dict:
    """Run ExLlamaV2 with GPTQ/EXL2 quantized model (ExLlamaV2.generate)."""
    result = model.generate(prompt, **config)
    return {"engine": "exllamav2", "quantization": "exl2", "tokens_out": len(result.split())}

Engine Tradeoff Analysis

The evaluator assesses tradeoffs between quantization formats, speed, and compatibility.

@waxell.reasoning_dec(step="engine_tradeoffs")
async def evaluate_engine_tradeoffs(results: list) -> dict:
    return {
        "thought": "llama.cpp offers widest GGUF compatibility, ExLlamaV2 fastest with EXL2, "
                   "llamafile simplest deployment, LocalAI most flexible with multiple backends.",
        "evidence": [f"{r['engine']}: {r.get('tokens_per_second', 'N/A')} tok/s" for r in results],
        "conclusion": "Choice depends on deployment constraints: portability vs speed vs flexibility",
    }

@waxell.decision(name="recommend_engine", options=["llamacpp", "llamafile", "localai", "exllamav2"])
async def recommend_engine(comparison: dict) -> dict:
    return {"chosen": "llamacpp", "reasoning": "Best balance of compatibility and performance"}

What this demonstrates

llama.cpp instrumentor -- Llama.__call__ with GGUF quantized models.
llamafile instrumentor -- OpenAI-compatible API served from a single executable.
LocalAI instrumentor -- multi-backend local inference with OpenAI-compatible endpoints.
ExLlamaV2 instrumentor -- ExLlamaV2.generate with GPTQ/EXL2 quantization.
@reasoning for tradeoff analysis -- documents engine-specific strengths and weaknesses.
@retrieval for output collection -- gathers and ranks engine results for comparison.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.local_inference_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.local_inference_agent

Source

dev/waxell-dev/app/demos/local_inference_agent.py

Architecture​

Key Code​

Four Local Inference Engine Tool Calls​

Engine Tradeoff Analysis​

What this demonstrates​

Run it​

Source​