Ollama

A local inference pipeline using Ollama with llama3.2. Demonstrates real local model inference with zero-cost attribution -- all LLM calls hit a locally-running Ollama instance. The agent classifies intent with @decision, plans response structure with @reasoning, and generates a detailed response, all using actual LLM calls to llama3.2. Uses manual record_llm_call() with cost=0.0 for local inference cost tracking.

Environment variables

This example requires a running Ollama instance with llama3.2 pulled, plus WAXELL_API_KEY and WAXELL_API_URL. Use --dry-run to run without Ollama.

Architecture

Key Code

LLM-powered decision with manual recording

The @decision decorator wraps a real Ollama LLM call. Since Ollama's auto-instrumentor may not be present, the agent manually records token usage with cost=0.0.

@waxell.decision(name="classify_intent", options=["technical", "conceptual", "comparison"])
async def classify_intent(query: str, client) -> dict:
    response = await client.chat(
        model="llama3.2",
        messages=[
            {"role": "system", "content": (
                "Classify the user query intent as exactly one of: technical, conceptual, comparison. "
                'Respond with JSON: {"chosen": "...", "reasoning": "..."}'
            )},
            {"role": "user", "content": query},
        ],
    )
    content = response["message"]["content"]
    tokens_in = response.get("prompt_eval_count", 0)
    tokens_out = response.get("eval_count", 0)

    ctx = waxell.get_context()
    if ctx:
        ctx.record_llm_call(
            model="llama3.2", tokens_in=tokens_in, tokens_out=tokens_out,
            cost=0.0, task="classify_intent",
        )
    try:
        return json.loads(content)
    except Exception:
        return {"chosen": "conceptual", "reasoning": content[:200]}

Reasoning-guided response planning

The @reasoning decorator wraps another Ollama call that plans how to structure the final response based on classified intent.

@waxell.reasoning_dec(step="plan_response")
async def plan_response_structure(query: str, intent: str, client) -> dict:
    response = await client.chat(
        model="llama3.2",
        messages=[
            {"role": "system", "content": (
                f"The user's query intent is: {intent}. "
                "Think about how to best structure a response. "
                "Return a brief plan with 2-3 key points to cover."
            )},
            {"role": "user", "content": query},
        ],
    )
    content = response["message"]["content"]
    return {
        "thought": f"Query classified as '{intent}'. Planning structured response.",
        "evidence": [f"Intent: {intent}", f"Query length: {len(query.split())} words"],
        "conclusion": content[:300],
    }

What this demonstrates

@waxell.observe -- single agent with local inference
@waxell.decision -- LLM-powered intent classification via Ollama
@waxell.reasoning_dec -- LLM-powered response planning via Ollama
record_llm_call() -- manual recording with cost=0.0 for local inference
waxell.get_context() -- accessing the current WaxellContext programmatically
waxell.tag() -- provider and model tagging
waxell.score() -- quality and confidence scores
waxell.metadata() -- local inference metadata with zero cost
Local inference -- no cloud API keys needed, zero cost attribution

Run it

cd dev/waxell-dev
python -m app.demos.ollama_agent --dry-run

Source

dev/waxell-dev/app/demos/ollama_agent.py

Architecture​

Key Code​

LLM-powered decision with manual recording​

Reasoning-guided response planning​

What this demonstrates​

Run it​

Source​