Skip to main content

DSPy

A DSPy-style program execution pipeline with a parent orchestrator coordinating 2 child agents -- a runner and an evaluator. The runner executes a 3-module pipeline (Predict:classify, ChainOfThought, Predict:answer), while the evaluator assesses module output quality with reasoning depth and answer relevance scores.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to run without any API keys.

Architecture

Key Code

Runner with DSPy module steps and @decision for module selection

Each DSPy module execution is recorded as a step, with module selection captured as a decision.

@waxell.step_dec(name="module_predict_classify")
async def module_predict_classify(query: str, client) -> dict:
"""DSPy Predict module: classify the query topic."""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": "Classify the query topic."},
{"role": "user", "content": query}],
)
return {"topic": response.choices[0].message.content[:100],
"module": "Predict", "signature": "query -> topic"}

@waxell.step_dec(name="module_chain_of_thought")
async def module_chain_of_thought(query: str, topic: str, client) -> dict:
"""DSPy ChainOfThought module: reason step-by-step."""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": "Think step by step."},
{"role": "user", "content": f"Topic: {topic}\nQuestion: {query}"}],
)
return {"reasoning": response.choices[0].message.content, "module": "ChainOfThought"}

@waxell.decision(name="select_module", options=["Predict", "ChainOfThought", "ReAct", "ProgramOfThought"])
async def select_module(topic: str, complexity: str) -> dict:
chosen = "ChainOfThought" if complexity == "complex" else "Predict"
return {"chosen": chosen, "reasoning": f"'{complexity}' complexity best served by {chosen}"}

Evaluator with @reasoning and quality scores

The evaluator assesses the pipeline quality across all three DSPy modules.

@waxell.reasoning_dec(step="module_evaluation")
async def evaluate_modules(topic: str, reasoning: str, answer: str) -> dict:
has_reasoning = len(reasoning) > 50
has_answer = len(answer) > 20
quality_score = sum([has_reasoning, has_answer, len(topic) > 5]) / 3.0
return {
"thought": f"ChainOfThought {'provided substantial' if has_reasoning else 'minimal'} reasoning.",
"evidence": [f"Reasoning length: {len(reasoning)} chars"],
"conclusion": f"Module pipeline quality: {quality_score:.0%}",
}

waxell.score("module_quality", 0.88, comment="DSPy module pipeline quality")
waxell.score("reasoning_depth", 0.82, comment="ChainOfThought reasoning depth")
waxell.score("answer_relevance", 0.90, comment="Final answer relevance")

What this demonstrates

  • @waxell.observe -- parent-child agent hierarchy with automatic lineage
  • @waxell.step_dec -- program initialization and each DSPy module execution recorded as steps
  • @waxell.decision -- module selection based on topic complexity
  • @waxell.reasoning_dec -- chain-of-thought evaluation of module outputs
  • waxell.score() -- module quality, reasoning depth, and answer relevance scores
  • waxell.tag() / waxell.metadata() -- framework, program type, and module list metadata
  • Auto-instrumented LLM calls -- three OpenAI gpt-4o-mini calls captured automatically
  • DSPy module pattern -- Predict, ChainOfThought, and Predict in a compiled pipeline

Run it

# Dry-run (no API keys needed)
cd dev/waxell-dev
python -m app.demos.dspy_agent --dry-run

# Live (real OpenAI)
export OPENAI_API_KEY="sk-..."
python -m app.demos.dspy_agent

# Custom query
python -m app.demos.dspy_agent --query "What makes a good prompt strategy?"

Source

dev/waxell-dev/app/demos/dspy_agent.py