Skip to main content

smolagents

A HuggingFace smolagents ToolCallingAgent pipeline with multi-step reasoning, tool use, and code execution across 3 agents. The orchestrator plans the task and selects tools, the runner executes web_search and python_executor, and the evaluator synthesizes the final answer and scores computation accuracy.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to run without any API keys.

Architecture

Key Code

Tool calling with web_search and python_executor

The runner executes tools in sequence, mimicking the smolagents ToolCallingAgent pattern.

@waxell.tool(tool_type="function")
def web_search(query: str) -> dict:
"""smolagents web_search tool: retrieve data from the web."""
return {"results": MOCK_SEARCH_RESULTS, "data": MOCK_GDP_DATA}

@waxell.tool(tool_type="code_execution")
def python_executor(code: str) -> dict:
"""smolagents python_executor tool: execute a Python code snippet."""
import statistics
values = list(MOCK_GDP_DATA.values())
avg = statistics.mean(values)
return {"result": str(round(avg, 4)), "computed_average": round(avg, 4)}

Evaluator with @reasoning and manual decide()

The evaluator synthesizes the final answer and assesses computation quality.

@waxell.reasoning_dec(step="computation_quality_assessment")
async def assess_computation_quality(answer: str, computed_value: float, raw_data: dict) -> dict:
value_str = str(round(computed_value, 2))
mentions_value = value_str in answer
countries_mentioned = sum(1 for c in raw_data if c.lower() in answer.lower())
return {
"thought": f"Answer mentions computed average ({value_str}): {mentions_value}.",
"evidence": [f"Computed average: {computed_value}%"],
"conclusion": "Accurately reflects computation" if mentions_value
else "May not reference computed value",
}

waxell.decide("output_detail_level", chosen="detailed",
options=["brief", "detailed", "tabular"],
reasoning="Query involves computation -- detailed format best", confidence=0.85)

waxell.score("computation_accuracy", 0.95, comment="exact average from raw data")
waxell.score("answer_quality", 0.88, comment="covers all countries and average")

What this demonstrates

  • @waxell.observe -- parent-child agent hierarchy with automatic lineage
  • @waxell.step_dec -- task planning recorded as execution step with LLM call
  • @waxell.tool -- web search and code execution with tool_type="code_execution"
  • @waxell.decision -- tool selection via OpenAI reasoning
  • waxell.decide() -- manual output detail level decision
  • @waxell.reasoning_dec -- computation quality assessment
  • waxell.score() -- computation accuracy and answer quality scores
  • Auto-instrumented LLM calls -- planning, tool selection, and synthesis calls captured
  • smolagents pattern -- ToolCallingAgent with web_search + python_executor tools

Run it

# Dry-run (no API keys needed)
cd dev/waxell-dev
python -m app.demos.smolagents_agent --dry-run

# Live (real OpenAI)
export OPENAI_API_KEY="sk-..."
python -m app.demos.smolagents_agent

# Custom query
python -m app.demos.smolagents_agent --query "What is the average life expectancy in OECD countries?"

Source

dev/waxell-dev/app/demos/smolagents_agent.py