AutoGen

An AutoGen-style multi-agent group chat with a parent orchestrator coordinating 2 child agents -- a runner and an evaluator. The runner executes planner and engineer agents in a round-robin group chat, while the evaluator reviews the conversation and scores collaboration quality.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to run without any API keys.

Architecture

Key Code

Orchestrator with group chat initialization

The parent agent initializes the group chat config and delegates execution to child agents.

@waxell.observe(agent_name="autogen-orchestrator", workflow_name="autogen-group-chat")
async def run_agent(query: str, dry_run: bool = False, waxell_ctx=None, **kwargs):
    waxell.tag("demo", "autogen")
    waxell.metadata("chat_type", "group_chat")

    # @step -- initialize group chat configuration
    chat_config = await init_groupchat(
        agents=["planner", "engineer", "reviewer"], max_rounds=5,
    )

    # Child agents execute sequentially
    runner_result = await run_agents(query=query, client=client)
    evaluator_result = await run_evaluator(
        query=query, plan=runner_result["plan"],
        implementation=runner_result["implementation"], client=client,
    )

Runner with `@step` and `@decision` for speaker selection

Each agent turn is recorded as a step, with round-robin speaker selection captured as a decision.

@waxell.decision(name="select_speaker", options=["planner", "engineer", "reviewer"])
async def select_speaker(round_num: int, agents: list[str]) -> dict:
    chosen = agents[round_num % len(agents)]
    return {
        "chosen": chosen,
        "reasoning": f"Round-robin: round {round_num} maps to '{chosen}'",
    }

@waxell.step_dec(name="agent_planner")
async def run_planner_step(query: str, client) -> dict:
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": "Break down the task into steps."},
                  {"role": "user", "content": query}],
    )
    return {"plan": response.choices[0].message.content}

Evaluator with `@reasoning` and `score()`

The evaluator reviews the conversation and assesses collaboration quality.

@waxell.reasoning_dec(step="conversation_evaluation")
async def evaluate_conversation(plan: str, implementation: str, review: str) -> dict:
    plan_has_steps = any(c.isdigit() for c in plan[:50])
    impl_has_detail = len(implementation) > 100
    quality_score = sum([plan_has_steps, impl_has_detail, len(review) > 50]) / 3.0
    return {
        "thought": f"Plan {'includes' if plan_has_steps else 'lacks'} structured steps.",
        "evidence": [f"Plan length: {len(plan)} chars"],
        "conclusion": f"Conversation quality: {quality_score:.0%}",
    }

waxell.score("conversation_quality", 0.85, comment="multi-agent collaboration")
waxell.score("review_approved", True, data_type="boolean")

What this demonstrates

@waxell.observe -- parent-child agent hierarchy with automatic lineage
@waxell.step_dec -- group chat init, planner, and engineer agent turns recorded as steps
@waxell.decision -- round-robin speaker selection at each group chat round
@waxell.reasoning_dec -- chain-of-thought conversation quality evaluation
waxell.score() -- conversation quality and reviewer approval scores
waxell.tag() / waxell.metadata() -- framework, chat type, and agent role metadata
Auto-instrumented LLM calls -- three OpenAI gpt-4o-mini calls captured automatically
AutoGen group chat pattern -- planner, engineer, reviewer in sequential round-robin

Run it

# Dry-run (no API keys needed)
cd dev/waxell-dev
python -m app.demos.autogen_agent --dry-run

# Live (real OpenAI)
export OPENAI_API_KEY="sk-..."
python -m app.demos.autogen_agent

# Custom query
python -m app.demos.autogen_agent --query "Design a monitoring strategy"

Source

dev/waxell-dev/app/demos/autogen_agent.py

Architecture​

Key Code​

Orchestrator with group chat initialization​

Runner with @step and @decision for speaker selection​

Evaluator with @reasoning and score()​

What this demonstrates​

Run it​

Source​