STT Pipeline Agent

A multi-provider speech-to-text comparison pipeline that runs 7 STT providers on the same audio sample and ranks them by Word Error Rate (WER). A stt-transcriber child agent exercises all 7 providers while an stt-analyzer compares results via @reasoning and an LLM synthesis call.

Environment variables

This example requires OPENAI_API_KEY, WAXELL_API_KEY, and WAXELL_API_URL. Use --dry-run to skip real API calls.

Architecture

Key Code

Seven STT Provider Tool Calls

Each provider exercises the exact methods that the instrumentors wrap.

@waxell.tool(tool_type="stt")
def google_cloud_stt(audio_file: str) -> dict:
    """Run Google Cloud STT recognize."""
    client = MockGoogleCloudSpeechClient()
    resp = client.recognize(config=..., audio=...)
    return {"provider": "Google Cloud STT", "transcript": resp.results[0].alternatives[0].transcript}

@waxell.tool(tool_type="stt")
def deepgram_stt(audio_file: str) -> dict:
    """Run Deepgram transcribe_file + transcribe_url."""
    client = MockDeepgramListenClient()
    resp = client.transcribe_file(source={"buffer": b"audio", "mimetype": "audio/wav"})
    client.transcribe_url(source={"url": "https://example.com/audio.wav"})
    return {"provider": "Deepgram", "transcript": resp.results.channels[0].alternatives[0].transcript}

@waxell.tool(tool_type="stt")
def assemblyai_stt(audio_file: str) -> dict:
    """Run AssemblyAI transcribe + submit + wait_for_completion."""
    transcriber = MockAssemblyAITranscriber()
    result = transcriber.transcribe(audio_file)
    submitted = transcriber.submit(audio_file)
    submitted.wait_for_completion()
    return {"provider": "AssemblyAI", "transcript": result.text}

WER-Based Accuracy Assessment

The analyzer ranks providers by Word Error Rate and generates a comparison.

@waxell.reasoning_dec(step="accuracy_assessment")
async def assess_accuracy(results: list, reference: str) -> dict:
    best = min(results, key=lambda r: r["wer"])
    return {
        "thought": f"Compared {len(results)} providers. Best WER: {best['wer']} ({best['provider']})",
        "evidence": [f"{r['provider']}: WER={r['wer']}" for r in results],
        "conclusion": f"{best['provider']} is the most accurate provider",
    }

What this demonstrates

7 STT instrumentors -- Google Cloud STT, Azure Speech, AWS Transcribe, Faster Whisper, Whisper.cpp, Deepgram, and AssemblyAI each with tool_type="stt".
Exact method coverage -- exercises recognize, recognize_once_async, _make_api_call, transcribe, transcribe_file/transcribe_url, submit/wait_for_completion.
WER comparison -- simple Word Error Rate calculation against a reference transcript.
@reasoning for accuracy assessment -- documents the provider ranking logic.
waxell.score() -- best WER and analysis quality attached to the trace.

Run it

# Dry-run mode (no API key needed)
cd dev/waxell-dev
python -m app.demos.stt_pipeline_agent --dry-run

# Live mode
export OPENAI_API_KEY="sk-..."
python -m app.demos.stt_pipeline_agent

Source

dev/waxell-dev/app/demos/stt_pipeline_agent.py

Architecture​

Key Code​

Seven STT Provider Tool Calls​

WER-Based Accuracy Assessment​

What this demonstrates​

Run it​

Source​