DSPy

A DSPy-style program execution pipeline with a parent orchestrator coordinating 2 child agents -- a runner and an evaluator. The runner executes a 3-module pipeline (Predict:classify, ChainOfThought, Predict:answer), while the evaluator assesses module output quality with reasoning depth and answer relevance scores.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to run without any API keys.

Architecture

Key Code

Runner with DSPy module steps and `@decision` for module selection

Each DSPy module execution is recorded as a step, with module selection captured as a decision.

@waxell.step_dec(name="module_predict_classify")
async def module_predict_classify(query: str, client) -> dict:
    """DSPy Predict module: classify the query topic."""
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": "Classify the query topic."},
                  {"role": "user", "content": query}],
    )
    return {"topic": response.choices[0].message.content[:100],
            "module": "Predict", "signature": "query -> topic"}

@waxell.step_dec(name="module_chain_of_thought")
async def module_chain_of_thought(query: str, topic: str, client) -> dict:
    """DSPy ChainOfThought module: reason step-by-step."""
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": "Think step by step."},
                  {"role": "user", "content": f"Topic: {topic}\nQuestion: {query}"}],
    )
    return {"reasoning": response.choices[0].message.content, "module": "ChainOfThought"}

@waxell.decision(name="select_module", options=["Predict", "ChainOfThought", "ReAct", "ProgramOfThought"])
async def select_module(topic: str, complexity: str) -> dict:
    chosen = "ChainOfThought" if complexity == "complex" else "Predict"
    return {"chosen": chosen, "reasoning": f"'{complexity}' complexity best served by {chosen}"}

Evaluator with `@reasoning` and quality scores

The evaluator assesses the pipeline quality across all three DSPy modules.

@waxell.reasoning_dec(step="module_evaluation")
async def evaluate_modules(topic: str, reasoning: str, answer: str) -> dict:
    has_reasoning = len(reasoning) > 50
    has_answer = len(answer) > 20
    quality_score = sum([has_reasoning, has_answer, len(topic) > 5]) / 3.0
    return {
        "thought": f"ChainOfThought {'provided substantial' if has_reasoning else 'minimal'} reasoning.",
        "evidence": [f"Reasoning length: {len(reasoning)} chars"],
        "conclusion": f"Module pipeline quality: {quality_score:.0%}",
    }

waxell.score("module_quality", 0.88, comment="DSPy module pipeline quality")
waxell.score("reasoning_depth", 0.82, comment="ChainOfThought reasoning depth")
waxell.score("answer_relevance", 0.90, comment="Final answer relevance")

What this demonstrates

@waxell.observe -- parent-child agent hierarchy with automatic lineage
@waxell.step_dec -- program initialization and each DSPy module execution recorded as steps
@waxell.decision -- module selection based on topic complexity
@waxell.reasoning_dec -- chain-of-thought evaluation of module outputs
waxell.score() -- module quality, reasoning depth, and answer relevance scores
waxell.tag() / waxell.metadata() -- framework, program type, and module list metadata
Auto-instrumented LLM calls -- three OpenAI gpt-4o-mini calls captured automatically
DSPy module pattern -- Predict, ChainOfThought, and Predict in a compiled pipeline

Run it

# Dry-run (no API keys needed)
cd dev/waxell-dev
python -m app.demos.dspy_agent --dry-run

# Live (real OpenAI)
export OPENAI_API_KEY="sk-..."
python -m app.demos.dspy_agent

# Custom query
python -m app.demos.dspy_agent --query "What makes a good prompt strategy?"

Source

dev/waxell-dev/app/demos/dspy_agent.py

Architecture​

Key Code​

Runner with DSPy module steps and @decision for module selection​

Evaluator with @reasoning and quality scores​

What this demonstrates​

Run it​

Source​