Skip to main content

TTS Pipeline Agent

A multi-provider text-to-speech comparison pipeline that runs 7 TTS providers on the same input text and ranks them by latency. A tts-generator child agent exercises all 7 providers while a tts-evaluator compares results via @reasoning and an LLM synthesis call.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Seven TTS Provider Tool Calls

Each provider exercises the exact methods that the instrumentors wrap, including both sync and streaming variants.

@waxell.tool(tool_type="tts")
def azure_tts_synthesize(text: str, ssml: str) -> dict:
"""Run Azure TTS speak_text + speak_text_async + speak_ssml + speak_ssml_async."""
synth = MockAzureSpeechSynthesizer()
result = synth.speak_text(text)
future = synth.speak_text_async(text)
future.get()
synth.speak_ssml(ssml)
ssml_future = synth.speak_ssml_async(ssml)
ssml_future.get()
return {"provider": "Azure TTS", "methods": ["speak_text", "speak_text_async", "speak_ssml", "speak_ssml_async"]}

@waxell.tool(tool_type="tts")
def elevenlabs_tts_synthesize(text: str) -> dict:
"""Run ElevenLabs convert + convert_as_stream."""
client = MockElevenLabsTTSClient()
audio = client.convert(voice_id="21m00Tcm4TlvDq8ikWAM", text=text, model_id="eleven_multilingual_v2")
stream_chunks = list(client.convert_as_stream(voice_id="21m00Tcm4TlvDq8ikWAM", text=text))
return {"provider": "ElevenLabs", "audio_kb": len(audio) / 1024, "stream_chunks": len(stream_chunks)}

@waxell.tool(tool_type="tts")
def coqui_tts_synthesize(text: str) -> dict:
"""Run Coqui TTS tts + tts_to_file (local inference)."""
model = MockCoquiTTS("tts_models/en/ljspeech/tacotron2-DDC")
waveform = model.tts(text, language="en")
model.tts_to_file(text, file_path="/tmp/coqui_output.wav")
return {"provider": "Coqui TTS", "engine": "tacotron2 (local)", "samples": len(waveform)}

Latency-Based Quality Assessment

The evaluator ranks providers by latency and synthesizes a comparison.

@waxell.reasoning_dec(step="quality_assessment")
async def assess_quality(results: list) -> dict:
fastest = min(results, key=lambda r: r["latency_ms"])
return {
"thought": f"Compared {len(results)} TTS providers. Fastest: {fastest['provider']}",
"evidence": [f"{r['provider']}: latency={r['latency_ms']}ms" for r in results],
"conclusion": f"{fastest['provider']} is the fastest at {fastest['latency_ms']}ms",
}

What this demonstrates

  • 7 TTS instrumentors -- Google Cloud TTS, Azure Speech, AWS Polly, Cartesia, Coqui TTS, ElevenLabs, and PlayHT each with tool_type="tts".
  • Sync + async + streaming -- Azure exercises 4 method variants; ElevenLabs and PlayHT exercise both non-streaming and streaming.
  • Local inference -- Coqui TTS runs as a local model with tts() and tts_to_file().
  • SSML support -- Azure TTS exercises both plain text and SSML synthesis.
  • @reasoning for quality assessment -- documents the provider ranking by latency.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.tts_pipeline_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.tts_pipeline_agent

Source

dev/waxell-dev/app/demos/tts_pipeline_agent.py