smolagents
A HuggingFace smolagents ToolCallingAgent pipeline with multi-step reasoning, tool use, and code execution across 3 agents. The orchestrator plans the task and selects tools, the runner executes web_search and python_executor, and the evaluator synthesizes the final answer and scores computation accuracy.
Environment variables
This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to run without any API keys.
Architecture
Key Code
Tool calling with web_search and python_executor
The runner executes tools in sequence, mimicking the smolagents ToolCallingAgent pattern.
@waxell.tool(tool_type="function")
def web_search(query: str) -> dict:
"""smolagents web_search tool: retrieve data from the web."""
return {"results": MOCK_SEARCH_RESULTS, "data": MOCK_GDP_DATA}
@waxell.tool(tool_type="code_execution")
def python_executor(code: str) -> dict:
"""smolagents python_executor tool: execute a Python code snippet."""
import statistics
values = list(MOCK_GDP_DATA.values())
avg = statistics.mean(values)
return {"result": str(round(avg, 4)), "computed_average": round(avg, 4)}
Evaluator with @reasoning and manual decide()
The evaluator synthesizes the final answer and assesses computation quality.
@waxell.reasoning_dec(step="computation_quality_assessment")
async def assess_computation_quality(answer: str, computed_value: float, raw_data: dict) -> dict:
value_str = str(round(computed_value, 2))
mentions_value = value_str in answer
countries_mentioned = sum(1 for c in raw_data if c.lower() in answer.lower())
return {
"thought": f"Answer mentions computed average ({value_str}): {mentions_value}.",
"evidence": [f"Computed average: {computed_value}%"],
"conclusion": "Accurately reflects computation" if mentions_value
else "May not reference computed value",
}
waxell.decide("output_detail_level", chosen="detailed",
options=["brief", "detailed", "tabular"],
reasoning="Query involves computation -- detailed format best", confidence=0.85)
waxell.score("computation_accuracy", 0.95, comment="exact average from raw data")
waxell.score("answer_quality", 0.88, comment="covers all countries and average")
What this demonstrates
@waxell.observe-- parent-child agent hierarchy with automatic lineage@waxell.step_dec-- task planning recorded as execution step with LLM call@waxell.tool-- web search and code execution withtool_type="code_execution"@waxell.decision-- tool selection via OpenAI reasoningwaxell.decide()-- manual output detail level decision@waxell.reasoning_dec-- computation quality assessmentwaxell.score()-- computation accuracy and answer quality scores- Auto-instrumented LLM calls -- planning, tool selection, and synthesis calls captured
- smolagents pattern -- ToolCallingAgent with web_search + python_executor tools
Run it
# Dry-run (no API keys needed)
cd dev/waxell-dev
python -m app.demos.smolagents_agent --dry-run
# Live (real OpenAI)
export OPENAI_API_KEY="sk-..."
python -m app.demos.smolagents_agent
# Custom query
python -m app.demos.smolagents_agent --query "What is the average life expectancy in OECD countries?"