STT Pipeline Agent
A multi-provider speech-to-text comparison pipeline that runs 7 STT providers on the same audio sample and ranks them by Word Error Rate (WER). A stt-transcriber child agent exercises all 7 providers while an stt-analyzer compares results via @reasoning and an LLM synthesis call.
Environment variables
This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.
Architecture
Key Code
Seven STT Provider Tool Calls
Each provider exercises the exact methods that the instrumentors wrap.
@waxell.tool(tool_type="stt")
def google_cloud_stt(audio_file: str) -> dict:
"""Run Google Cloud STT recognize."""
client = MockGoogleCloudSpeechClient()
resp = client.recognize(config=..., audio=...)
return {"provider": "Google Cloud STT", "transcript": resp.results[0].alternatives[0].transcript}
@waxell.tool(tool_type="stt")
def deepgram_stt(audio_file: str) -> dict:
"""Run Deepgram transcribe_file + transcribe_url."""
client = MockDeepgramListenClient()
resp = client.transcribe_file(source={"buffer": b"audio", "mimetype": "audio/wav"})
client.transcribe_url(source={"url": "https://example.com/audio.wav"})
return {"provider": "Deepgram", "transcript": resp.results.channels[0].alternatives[0].transcript}
@waxell.tool(tool_type="stt")
def assemblyai_stt(audio_file: str) -> dict:
"""Run AssemblyAI transcribe + submit + wait_for_completion."""
transcriber = MockAssemblyAITranscriber()
result = transcriber.transcribe(audio_file)
submitted = transcriber.submit(audio_file)
submitted.wait_for_completion()
return {"provider": "AssemblyAI", "transcript": result.text}
WER-Based Accuracy Assessment
The analyzer ranks providers by Word Error Rate and generates a comparison.
@waxell.reasoning_dec(step="accuracy_assessment")
async def assess_accuracy(results: list, reference: str) -> dict:
best = min(results, key=lambda r: r["wer"])
return {
"thought": f"Compared {len(results)} providers. Best WER: {best['wer']} ({best['provider']})",
"evidence": [f"{r['provider']}: WER={r['wer']}" for r in results],
"conclusion": f"{best['provider']} is the most accurate provider",
}
What this demonstrates
- 7 STT instrumentors -- Google Cloud STT, Azure Speech, AWS Transcribe, Faster Whisper, Whisper.cpp, Deepgram, and AssemblyAI each with
tool_type="stt". - Exact method coverage -- exercises
recognize,recognize_once_async,_make_api_call,transcribe,transcribe_file/transcribe_url,submit/wait_for_completion. - WER comparison -- simple Word Error Rate calculation against a reference transcript.
@reasoningfor accuracy assessment -- documents the provider ranking logic.waxell.score()-- best WER and analysis quality attached to the trace.
Run it
# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.stt_pipeline_agent --dry-run
# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.stt_pipeline_agent