Skip to main content

STT Pipeline Agent

A multi-provider speech-to-text comparison pipeline that runs 7 STT providers on the same audio sample and ranks them by Word Error Rate (WER). A stt-transcriber child agent exercises all 7 providers while an stt-analyzer compares results via @reasoning and an LLM synthesis call.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Seven STT Provider Tool Calls

Each provider exercises the exact methods that the instrumentors wrap.

@waxell.tool(tool_type="stt")
def google_cloud_stt(audio_file: str) -> dict:
"""Run Google Cloud STT recognize."""
client = MockGoogleCloudSpeechClient()
resp = client.recognize(config=..., audio=...)
return {"provider": "Google Cloud STT", "transcript": resp.results[0].alternatives[0].transcript}

@waxell.tool(tool_type="stt")
def deepgram_stt(audio_file: str) -> dict:
"""Run Deepgram transcribe_file + transcribe_url."""
client = MockDeepgramListenClient()
resp = client.transcribe_file(source={"buffer": b"audio", "mimetype": "audio/wav"})
client.transcribe_url(source={"url": "https://example.com/audio.wav"})
return {"provider": "Deepgram", "transcript": resp.results.channels[0].alternatives[0].transcript}

@waxell.tool(tool_type="stt")
def assemblyai_stt(audio_file: str) -> dict:
"""Run AssemblyAI transcribe + submit + wait_for_completion."""
transcriber = MockAssemblyAITranscriber()
result = transcriber.transcribe(audio_file)
submitted = transcriber.submit(audio_file)
submitted.wait_for_completion()
return {"provider": "AssemblyAI", "transcript": result.text}

WER-Based Accuracy Assessment

The analyzer ranks providers by Word Error Rate and generates a comparison.

@waxell.reasoning_dec(step="accuracy_assessment")
async def assess_accuracy(results: list, reference: str) -> dict:
best = min(results, key=lambda r: r["wer"])
return {
"thought": f"Compared {len(results)} providers. Best WER: {best['wer']} ({best['provider']})",
"evidence": [f"{r['provider']}: WER={r['wer']}" for r in results],
"conclusion": f"{best['provider']} is the most accurate provider",
}

What this demonstrates

  • 7 STT instrumentors -- Google Cloud STT, Azure Speech, AWS Transcribe, Faster Whisper, Whisper.cpp, Deepgram, and AssemblyAI each with tool_type="stt".
  • Exact method coverage -- exercises recognize, recognize_once_async, _make_api_call, transcribe, transcribe_file/transcribe_url, submit/wait_for_completion.
  • WER comparison -- simple Word Error Rate calculation against a reference transcript.
  • @reasoning for accuracy assessment -- documents the provider ranking logic.
  • waxell.score() -- best WER and analysis quality attached to the trace.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.stt_pipeline_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.stt_pipeline_agent

Source

dev/waxell-dev/app/demos/stt_pipeline_agent.py