DSPy
A DSPy-style program execution pipeline with a parent orchestrator coordinating 2 child agents -- a runner and an evaluator. The runner executes a 3-module pipeline (Predict:classify, ChainOfThought, Predict:answer), while the evaluator assesses module output quality with reasoning depth and answer relevance scores.
Environment variables
This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to run without any API keys.
Architecture
Key Code
Runner with DSPy module steps and @decision for module selection
Each DSPy module execution is recorded as a step, with module selection captured as a decision.
@waxell.step_dec(name="module_predict_classify")
async def module_predict_classify(query: str, client) -> dict:
"""DSPy Predict module: classify the query topic."""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": "Classify the query topic."},
{"role": "user", "content": query}],
)
return {"topic": response.choices[0].message.content[:100],
"module": "Predict", "signature": "query -> topic"}
@waxell.step_dec(name="module_chain_of_thought")
async def module_chain_of_thought(query: str, topic: str, client) -> dict:
"""DSPy ChainOfThought module: reason step-by-step."""
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "system", "content": "Think step by step."},
{"role": "user", "content": f"Topic: {topic}\nQuestion: {query}"}],
)
return {"reasoning": response.choices[0].message.content, "module": "ChainOfThought"}
@waxell.decision(name="select_module", options=["Predict", "ChainOfThought", "ReAct", "ProgramOfThought"])
async def select_module(topic: str, complexity: str) -> dict:
chosen = "ChainOfThought" if complexity == "complex" else "Predict"
return {"chosen": chosen, "reasoning": f"'{complexity}' complexity best served by {chosen}"}
Evaluator with @reasoning and quality scores
The evaluator assesses the pipeline quality across all three DSPy modules.
@waxell.reasoning_dec(step="module_evaluation")
async def evaluate_modules(topic: str, reasoning: str, answer: str) -> dict:
has_reasoning = len(reasoning) > 50
has_answer = len(answer) > 20
quality_score = sum([has_reasoning, has_answer, len(topic) > 5]) / 3.0
return {
"thought": f"ChainOfThought {'provided substantial' if has_reasoning else 'minimal'} reasoning.",
"evidence": [f"Reasoning length: {len(reasoning)} chars"],
"conclusion": f"Module pipeline quality: {quality_score:.0%}",
}
waxell.score("module_quality", 0.88, comment="DSPy module pipeline quality")
waxell.score("reasoning_depth", 0.82, comment="ChainOfThought reasoning depth")
waxell.score("answer_relevance", 0.90, comment="Final answer relevance")
What this demonstrates
@waxell.observe-- parent-child agent hierarchy with automatic lineage@waxell.step_dec-- program initialization and each DSPy module execution recorded as steps@waxell.decision-- module selection based on topic complexity@waxell.reasoning_dec-- chain-of-thought evaluation of module outputswaxell.score()-- module quality, reasoning depth, and answer relevance scoreswaxell.tag()/waxell.metadata()-- framework, program type, and module list metadata- Auto-instrumented LLM calls -- three OpenAI gpt-4o-mini calls captured automatically
- DSPy module pattern -- Predict, ChainOfThought, and Predict in a compiled pipeline
Run it
# Dry-run (no API keys needed)
cd dev/waxell-dev
python -m app.demos.dspy_agent --dry-run
# Live (real OpenAI)
export OPENAI_API_KEY="sk-..."
python -m app.demos.dspy_agent
# Custom query
python -m app.demos.dspy_agent --query "What makes a good prompt strategy?"