smolagents

A HuggingFace smolagents ToolCallingAgent pipeline with multi-step reasoning, tool use, and code execution across 3 agents. The orchestrator plans the task and selects tools, the runner executes web_search and python_executor, and the evaluator synthesizes the final answer and scores computation accuracy.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to run without any API keys.

Architecture

Key Code

Tool calling with `web_search` and `python_executor`

The runner executes tools in sequence, mimicking the smolagents ToolCallingAgent pattern.

@waxell.tool(tool_type="function")
def web_search(query: str) -> dict:
    """smolagents web_search tool: retrieve data from the web."""
    return {"results": MOCK_SEARCH_RESULTS, "data": MOCK_GDP_DATA}

@waxell.tool(tool_type="code_execution")
def python_executor(code: str) -> dict:
    """smolagents python_executor tool: execute a Python code snippet."""
    import statistics
    values = list(MOCK_GDP_DATA.values())
    avg = statistics.mean(values)
    return {"result": str(round(avg, 4)), "computed_average": round(avg, 4)}

Evaluator with `@reasoning` and manual `decide()`

The evaluator synthesizes the final answer and assesses computation quality.

@waxell.reasoning_dec(step="computation_quality_assessment")
async def assess_computation_quality(answer: str, computed_value: float, raw_data: dict) -> dict:
    value_str = str(round(computed_value, 2))
    mentions_value = value_str in answer
    countries_mentioned = sum(1 for c in raw_data if c.lower() in answer.lower())
    return {
        "thought": f"Answer mentions computed average ({value_str}): {mentions_value}.",
        "evidence": [f"Computed average: {computed_value}%"],
        "conclusion": "Accurately reflects computation" if mentions_value
                      else "May not reference computed value",
    }

waxell.decide("output_detail_level", chosen="detailed",
              options=["brief", "detailed", "tabular"],
              reasoning="Query involves computation -- detailed format best", confidence=0.85)

waxell.score("computation_accuracy", 0.95, comment="exact average from raw data")
waxell.score("answer_quality", 0.88, comment="covers all countries and average")

What this demonstrates

@waxell.observe -- parent-child agent hierarchy with automatic lineage
@waxell.step_dec -- task planning recorded as execution step with LLM call
@waxell.tool -- web search and code execution with tool_type="code_execution"
@waxell.decision -- tool selection via OpenAI reasoning
waxell.decide() -- manual output detail level decision
@waxell.reasoning_dec -- computation quality assessment
waxell.score() -- computation accuracy and answer quality scores
Auto-instrumented LLM calls -- planning, tool selection, and synthesis calls captured
smolagents pattern -- ToolCallingAgent with web_search + python_executor tools

Run it

# Dry-run (no API keys needed)
cd dev/waxell-dev
python -m app.demos.smolagents_agent --dry-run

# Live (real OpenAI)
export OPENAI_API_KEY="sk-..."
python -m app.demos.smolagents_agent

# Custom query
python -m app.demos.smolagents_agent --query "What is the average life expectancy in OECD countries?"

Source

dev/waxell-dev/app/demos/smolagents_agent.py

Architecture​

Key Code​

Tool calling with web_search and python_executor​

Evaluator with @reasoning and manual decide()​

What this demonstrates​

Run it​

Source​