TTS Pipeline Agent

A multi-provider text-to-speech comparison pipeline that runs 7 TTS providers on the same input text and ranks them by latency. A tts-generator child agent exercises all 7 providers while a tts-evaluator compares results via @reasoning and an LLM synthesis call.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Seven TTS Provider Tool Calls

Each provider exercises the exact methods that the instrumentors wrap, including both sync and streaming variants.

@waxell.tool(tool_type="tts")
def azure_tts_synthesize(text: str, ssml: str) -> dict:
    """Run Azure TTS speak_text + speak_text_async + speak_ssml + speak_ssml_async."""
    synth = MockAzureSpeechSynthesizer()
    result = synth.speak_text(text)
    future = synth.speak_text_async(text)
    future.get()
    synth.speak_ssml(ssml)
    ssml_future = synth.speak_ssml_async(ssml)
    ssml_future.get()
    return {"provider": "Azure TTS", "methods": ["speak_text", "speak_text_async", "speak_ssml", "speak_ssml_async"]}

@waxell.tool(tool_type="tts")
def elevenlabs_tts_synthesize(text: str) -> dict:
    """Run ElevenLabs convert + convert_as_stream."""
    client = MockElevenLabsTTSClient()
    audio = client.convert(voice_id="21m00Tcm4TlvDq8ikWAM", text=text, model_id="eleven_multilingual_v2")
    stream_chunks = list(client.convert_as_stream(voice_id="21m00Tcm4TlvDq8ikWAM", text=text))
    return {"provider": "ElevenLabs", "audio_kb": len(audio) / 1024, "stream_chunks": len(stream_chunks)}

@waxell.tool(tool_type="tts")
def coqui_tts_synthesize(text: str) -> dict:
    """Run Coqui TTS tts + tts_to_file (local inference)."""
    model = MockCoquiTTS("tts_models/en/ljspeech/tacotron2-DDC")
    waveform = model.tts(text, language="en")
    model.tts_to_file(text, file_path="/tmp/coqui_output.wav")
    return {"provider": "Coqui TTS", "engine": "tacotron2 (local)", "samples": len(waveform)}

Latency-Based Quality Assessment

The evaluator ranks providers by latency and synthesizes a comparison.

@waxell.reasoning_dec(step="quality_assessment")
async def assess_quality(results: list) -> dict:
    fastest = min(results, key=lambda r: r["latency_ms"])
    return {
        "thought": f"Compared {len(results)} TTS providers. Fastest: {fastest['provider']}",
        "evidence": [f"{r['provider']}: latency={r['latency_ms']}ms" for r in results],
        "conclusion": f"{fastest['provider']} is the fastest at {fastest['latency_ms']}ms",
    }

What this demonstrates

7 TTS instrumentors -- Google Cloud TTS, Azure Speech, AWS Polly, Cartesia, Coqui TTS, ElevenLabs, and PlayHT each with tool_type="tts".
Sync + async + streaming -- Azure exercises 4 method variants; ElevenLabs and PlayHT exercise both non-streaming and streaming.
Local inference -- Coqui TTS runs as a local model with tts() and tts_to_file().
SSML support -- Azure TTS exercises both plain text and SSML synthesis.
@reasoning for quality assessment -- documents the provider ranking by latency.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.tts_pipeline_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.tts_pipeline_agent

Source

dev/waxell-dev/app/demos/tts_pipeline_agent.py

Architecture​

Key Code​

Seven TTS Provider Tool Calls​

Latency-Based Quality Assessment​

What this demonstrates​

Run it​

Source​