Skip to main content

Ollama

A local inference pipeline using Ollama with llama3.2. Demonstrates real local model inference with zero-cost attribution -- all LLM calls hit a locally-running Ollama instance. The agent classifies intent with @decision, plans response structure with @reasoning, and generates a detailed response, all using actual LLM calls to llama3.2. Uses manual record_llm_call() with cost=0.0 for local inference cost tracking.

Environment variables

This example requires a running Ollama instance with llama3.2 pulled, plus WAXELL_API_KEY and WAXELL_API_URL. Use --dry-run to run without Ollama.

Architecture

Key Code

LLM-powered decision with manual recording

The @decision decorator wraps a real Ollama LLM call. Since Ollama's auto-instrumentor may not be present, the agent manually records token usage with cost=0.0.

@waxell.decision(name="classify_intent", options=["technical", "conceptual", "comparison"])
async def classify_intent(query: str, client) -> dict:
response = await client.chat(
model="llama3.2",
messages=[
{"role": "system", "content": (
"Classify the user query intent as exactly one of: technical, conceptual, comparison. "
'Respond with JSON: {"chosen": "...", "reasoning": "..."}'
)},
{"role": "user", "content": query},
],
)
content = response["message"]["content"]
tokens_in = response.get("prompt_eval_count", 0)
tokens_out = response.get("eval_count", 0)

ctx = waxell.get_context()
if ctx:
ctx.record_llm_call(
model="llama3.2", tokens_in=tokens_in, tokens_out=tokens_out,
cost=0.0, task="classify_intent",
)
try:
return json.loads(content)
except Exception:
return {"chosen": "conceptual", "reasoning": content[:200]}

Reasoning-guided response planning

The @reasoning decorator wraps another Ollama call that plans how to structure the final response based on classified intent.

@waxell.reasoning_dec(step="plan_response")
async def plan_response_structure(query: str, intent: str, client) -> dict:
response = await client.chat(
model="llama3.2",
messages=[
{"role": "system", "content": (
f"The user's query intent is: {intent}. "
"Think about how to best structure a response. "
"Return a brief plan with 2-3 key points to cover."
)},
{"role": "user", "content": query},
],
)
content = response["message"]["content"]
return {
"thought": f"Query classified as '{intent}'. Planning structured response.",
"evidence": [f"Intent: {intent}", f"Query length: {len(query.split())} words"],
"conclusion": content[:300],
}

What this demonstrates

  • @waxell.observe -- single agent with local inference
  • @waxell.decision -- LLM-powered intent classification via Ollama
  • @waxell.reasoning_dec -- LLM-powered response planning via Ollama
  • record_llm_call() -- manual recording with cost=0.0 for local inference
  • waxell.get_context() -- accessing the current WaxellContext programmatically
  • waxell.tag() -- provider and model tagging
  • waxell.score() -- quality and confidence scores
  • waxell.metadata() -- local inference metadata with zero cost
  • Local inference -- no cloud API keys needed, zero cost attribution

Run it

cd dev/waxell-dev
python -m app.demos.ollama_agent --dry-run

Source

dev/waxell-dev/app/demos/ollama_agent.py