Ollama
A local inference pipeline using Ollama with llama3.2. Demonstrates real local model inference with zero-cost attribution -- all LLM calls hit a locally-running Ollama instance. The agent classifies intent with @decision, plans response structure with @reasoning, and generates a detailed response, all using actual LLM calls to llama3.2. Uses manual record_llm_call() with cost=0.0 for local inference cost tracking.
This example requires a running Ollama instance with llama3.2 pulled, plus WAXELL_API_KEY and WAXELL_API_URL. Use --dry-run to run without Ollama.
Architecture
Key Code
LLM-powered decision with manual recording
The @decision decorator wraps a real Ollama LLM call. Since Ollama's auto-instrumentor may not be present, the agent manually records token usage with cost=0.0.
@waxell.decision(name="classify_intent", options=["technical", "conceptual", "comparison"])
async def classify_intent(query: str, client) -> dict:
response = await client.chat(
model="llama3.2",
messages=[
{"role": "system", "content": (
"Classify the user query intent as exactly one of: technical, conceptual, comparison. "
'Respond with JSON: {"chosen": "...", "reasoning": "..."}'
)},
{"role": "user", "content": query},
],
)
content = response["message"]["content"]
tokens_in = response.get("prompt_eval_count", 0)
tokens_out = response.get("eval_count", 0)
ctx = waxell.get_context()
if ctx:
ctx.record_llm_call(
model="llama3.2", tokens_in=tokens_in, tokens_out=tokens_out,
cost=0.0, task="classify_intent",
)
try:
return json.loads(content)
except Exception:
return {"chosen": "conceptual", "reasoning": content[:200]}
Reasoning-guided response planning
The @reasoning decorator wraps another Ollama call that plans how to structure the final response based on classified intent.
@waxell.reasoning_dec(step="plan_response")
async def plan_response_structure(query: str, intent: str, client) -> dict:
response = await client.chat(
model="llama3.2",
messages=[
{"role": "system", "content": (
f"The user's query intent is: {intent}. "
"Think about how to best structure a response. "
"Return a brief plan with 2-3 key points to cover."
)},
{"role": "user", "content": query},
],
)
content = response["message"]["content"]
return {
"thought": f"Query classified as '{intent}'. Planning structured response.",
"evidence": [f"Intent: {intent}", f"Query length: {len(query.split())} words"],
"conclusion": content[:300],
}
What this demonstrates
@waxell.observe-- single agent with local inference@waxell.decision-- LLM-powered intent classification via Ollama@waxell.reasoning_dec-- LLM-powered response planning via Ollamarecord_llm_call()-- manual recording withcost=0.0for local inferencewaxell.get_context()-- accessing the current WaxellContext programmaticallywaxell.tag()-- provider and model taggingwaxell.score()-- quality and confidence scoreswaxell.metadata()-- local inference metadata with zero cost- Local inference -- no cloud API keys needed, zero cost attribution
Run it
cd dev/waxell-dev
python -m app.demos.ollama_agent --dry-run