diff --git a/BENCHMARK.md b/BENCHMARK.md new file mode 100644 index 0000000..9d27ab2 --- /dev/null +++ b/BENCHMARK.md @@ -0,0 +1,159 @@ +# WhisperLiveKit Benchmark Report + +Benchmark comparing all supported ASR backends and streaming policies on Apple Silicon, +using the full AudioProcessor pipeline (the same path audio takes in production via WebSocket). + +## Test Environment + +| Property | Value | +|----------|-------| +| Hardware | Apple M4, 32 GB RAM | +| OS | macOS 25.3.0 (arm64) | +| Python | 3.13 | +| faster-whisper | 1.2.1 | +| mlx-whisper | installed (via mlx) | +| Voxtral (HF) | transformers-based | +| Voxtral MLX | native MLX backend | +| Model size | `base` (default for whisper backends) | +| VAC (Silero VAD) | enabled unless noted | +| Chunk size | 100 ms | +| Pacing | no-realtime (as fast as possible) | + +## Audio Test Files + +| File | Duration | Language | Speakers | Description | +|------|----------|----------|----------|-------------| +| `00_00_07_english_1_speaker.wav` | 7.2 s | English | 1 | Short dictation with pauses | +| `00_00_16_french_1_speaker.wav` | 16.3 s | French | 1 | French speech with intentional silence gaps | +| `00_00_30_english_3_speakers.wav` | 30.0 s | English | 3 | Multi-speaker conversation about transcription | + +All files have hand-verified ground truth transcripts (`.transcript.json`) with per-word timestamps. + +--- + +## Results Overview + +### English - Short (7.2 s, 1 speaker) + +| Backend | Policy | RTF | WER | Timestamp MAE | +|---------|--------|-----|-----|---------------| +| faster-whisper | LocalAgreement | 0.20x | 21.1% | 0.080 s | +| faster-whisper | SimulStreaming | 0.14x | 0.0% | 0.239 s | +| mlx-whisper | LocalAgreement | 0.05x | 21.1% | 0.080 s | +| mlx-whisper | SimulStreaming | 0.14x | 10.5% | 0.245 s | +| voxtral-mlx | voxtral | 0.32x | 0.0% | 0.254 s | +| voxtral (HF) | voxtral | 1.29x | 0.0% | 1.876 s | + +### French (16.3 s, 1 speaker) + +| Backend | Policy | RTF | WER | Timestamp MAE | +|---------|--------|-----|-----|---------------| +| faster-whisper | LocalAgreement | 0.20x | 120.0% | 0.540 s | +| faster-whisper | SimulStreaming | 0.10x | 100.0% | 0.120 s | +| mlx-whisper | LocalAgreement | 0.31x | 1737.1% | 0.060 s | +| mlx-whisper | SimulStreaming | 0.08x | 94.3% | 0.120 s | +| voxtral-mlx | voxtral | 0.18x | 37.1% | 3.422 s | +| voxtral (HF) | voxtral | 0.63x | 28.6% | 4.040 s | + +Note: The whisper-based backends were run with `--lan en`, so they attempted to transcribe French +audio in English. This is expected to produce high WER. For a fair comparison, the whisper backends +should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect language. + +### English - Multi-speaker (30.0 s, 3 speakers) + +| Backend | Policy | RTF | WER | Timestamp MAE | +|---------|--------|-----|-----|---------------| +| faster-whisper | LocalAgreement | 0.24x | 44.7% | 0.235 s | +| faster-whisper | SimulStreaming | 0.10x | 5.3% | 0.398 s | +| mlx-whisper | LocalAgreement | 0.06x | 23.7% | 0.237 s | +| mlx-whisper | SimulStreaming | 0.11x | 5.3% | 0.395 s | +| voxtral-mlx | voxtral | 0.31x | 9.2% | 0.176 s | +| voxtral (HF) | voxtral | 1.00x | 32.9% | 1.034 s | + +--- + +## Key Findings + +### Speed (RTF = processing time / audio duration, lower is better) + +1. **mlx-whisper + LocalAgreement** is the fastest combo on Apple Silicon, reaching 0.05-0.06x RTF + on English audio. This means 30 seconds of audio is processed in under 2 seconds. +2. **SimulStreaming** is consistently faster than LocalAgreement for faster-whisper, but comparable + for mlx-whisper. +3. **voxtral-mlx** runs at 0.18-0.32x RTF, roughly 3-5x slower than mlx-whisper but well within + real-time requirements. +4. **voxtral (HF transformers)** is the slowest, hitting 1.0-1.3x RTF. On longer audio, it risks + falling behind real-time. On Apple Silicon, the MLX variant is strongly preferred. + +### Accuracy (WER = Word Error Rate, lower is better) + +1. **SimulStreaming** produces significantly better WER than LocalAgreement for whisper backends. + On the 30s English file: 5.3% vs 23.7-44.7%. +2. **voxtral-mlx** achieves strong accuracy (0% on short English, 9.2% on multi-speaker) and is + the only backend that auto-detects language, making it the best choice for multilingual use. +3. **LocalAgreement** tends to duplicate the last sentence, inflating WER. This is a known + artifact of the LCP (Longest Common Prefix) commit strategy at end-of-stream. +4. **Voxtral** backends handle French natively with 28-37% WER, while whisper backends + attempted English transcription of French audio (not a fair comparison for French). + +### Timestamp Accuracy (MAE = Mean Absolute Error on word start times, lower is better) + +1. **LocalAgreement** produces the most accurate timestamps (0.08s MAE on English), since it + processes overlapping audio windows and validates via prefix matching. +2. **SimulStreaming** timestamps are slightly less precise (0.24-0.40s MAE) but still usable + for most applications. +3. **voxtral-mlx** achieves excellent timestamps on English (0.18-0.25s MAE) but can drift on + audio with long silence gaps (3.4s MAE on the French file with 4-second pauses). +4. **voxtral (HF)** has the worst timestamp accuracy (1.0-4.0s MAE), likely due to the + additional overhead of the transformers pipeline. + +### VAC (Voice Activity Classification) Impact + +| Backend | Policy | VAC | 7s English WER | 30s English WER | +|---------|--------|-----|----------------|-----------------| +| faster-whisper | LocalAgreement | on | 21.1% | 44.7% | +| faster-whisper | LocalAgreement | off | 100.0% | 100.0% | +| voxtral-mlx | voxtral | on | 0.0% | 9.2% | +| voxtral-mlx | voxtral | off | 0.0% | 9.2% | + +- **Whisper backends require VAC** to function in streaming mode. Without it, the entire audio + is buffered as a single chunk and the LocalAgreement/SimulStreaming buffer logic breaks down. +- **Voxtral backends are VAC-independent** because they handle their own internal chunking and + produce identical results with or without VAC. VAC still reduces wasted compute on silence. + +--- + +## Recommendations + +| Use Case | Recommended Backend | Policy | Notes | +|----------|-------------------|--------|-------| +| Fastest English transcription (Apple Silicon) | mlx-whisper | SimulStreaming | 0.08-0.14x RTF, 5-10% WER | +| Fastest English transcription (Linux/GPU) | faster-whisper | SimulStreaming | 0.10-0.14x RTF, 0-5% WER | +| Multilingual / auto-detect (Apple Silicon) | voxtral-mlx | voxtral | Handles 100+ languages, 0.18-0.32x RTF | +| Multilingual / auto-detect (Linux/GPU) | voxtral (HF) | voxtral | Same model, slower on CPU, needs GPU | +| Best timestamp accuracy | faster-whisper | LocalAgreement | 0.08s MAE, good for subtitle alignment | +| Low latency, low memory | mlx-whisper (tiny) | SimulStreaming | Smallest footprint, fastest response | + +--- + +## Reproducing These Benchmarks + +```bash +# Install test dependencies +pip install -e ".[test]" + +# Single backend test +python test_backend_offline.py --backend faster-whisper --policy simulstreaming --no-realtime + +# Multi-backend auto-detect benchmark +python test_backend_offline.py --benchmark --no-realtime + +# Export to JSON for programmatic analysis +python test_backend_offline.py --benchmark --no-realtime --json results.json + +# Test with custom audio +python test_backend_offline.py --backend voxtral-mlx --audio your_file.wav --no-realtime +``` + +The benchmark harness computes WER and timestamp accuracy automatically when ground truth +`.transcript.json` files exist alongside the audio files. See `audio_tests/` for the format. diff --git a/audio_tests/00_00_07_english_1_speaker.transcript.json b/audio_tests/00_00_07_english_1_speaker.transcript.json new file mode 100644 index 0000000..43ca785 --- /dev/null +++ b/audio_tests/00_00_07_english_1_speaker.transcript.json @@ -0,0 +1,97 @@ +[ + { + "word": "This", + "start": 0.0, + "end": 0.24 + }, + { + "word": "is", + "start": 0.24, + "end": 0.56 + }, + { + "word": "a", + "start": 0.56, + "end": 0.76 + }, + { + "word": "transcription", + "start": 0.76, + "end": 1.32 + }, + { + "word": "test.", + "start": 1.32, + "end": 2.0 + }, + { + "word": "We", + "start": 2.4, + "end": 2.5 + }, + { + "word": "want", + "start": 2.5, + "end": 2.66 + }, + { + "word": "to", + "start": 2.66, + "end": 2.84 + }, + { + "word": "see", + "start": 2.84, + "end": 3.1 + }, + { + "word": "if", + "start": 3.1, + "end": 3.34 + }, + { + "word": "we", + "start": 3.34, + "end": 3.5 + }, + { + "word": "can", + "start": 3.5, + "end": 3.68 + }, + { + "word": "use", + "start": 3.68, + "end": 4.04 + }, + { + "word": "smaller", + "start": 4.04, + "end": 4.76 + }, + { + "word": "chunks.", + "start": 4.76, + "end": 5.16 + }, + { + "word": "What", + "start": 6.06, + "end": 6.32 + }, + { + "word": "do", + "start": 6.32, + "end": 6.44 + }, + { + "word": "you", + "start": 6.44, + "end": 6.58 + }, + { + "word": "think?", + "start": 6.58, + "end": 6.84 + } +] \ No newline at end of file diff --git a/audio_tests/00_00_16_french_1_speaker.transcript.json b/audio_tests/00_00_16_french_1_speaker.transcript.json new file mode 100644 index 0000000..07c0b31 --- /dev/null +++ b/audio_tests/00_00_16_french_1_speaker.transcript.json @@ -0,0 +1,177 @@ +[ + { + "word": "Ok,", + "start": 2.02, + "end": 2.38 + }, + { + "word": "là", + "start": 2.52, + "end": 2.58 + }, + { + "word": "c", + "start": 2.58, + "end": 2.74 + }, + { + "word": "'est", + "start": 2.74, + "end": 2.76 + }, + { + "word": "un", + "start": 2.76, + "end": 2.86 + }, + { + "word": "test,", + "start": 2.86, + "end": 3.2 + }, + { + "word": "on", + "start": 3.34, + "end": 3.34 + }, + { + "word": "veut", + "start": 3.34, + "end": 3.48 + }, + { + "word": "voir", + "start": 3.48, + "end": 3.86 + }, + { + "word": "si", + "start": 3.86, + "end": 4.14 + }, + { + "word": "ça", + "start": 4.14, + "end": 4.26 + }, + { + "word": "arrive", + "start": 4.26, + "end": 4.36 + }, + { + "word": "à", + "start": 4.36, + "end": 4.5 + }, + { + "word": "capté", + "start": 4.5, + "end": 4.78 + }, + { + "word": "le", + "start": 4.78, + "end": 4.9 + }, + { + "word": "silence.", + "start": 4.9, + "end": 5.44 + }, + { + "word": "Là", + "start": 9.24, + "end": 9.6 + }, + { + "word": "il", + "start": 9.6, + "end": 9.78 + }, + { + "word": "est", + "start": 9.78, + "end": 9.84 + }, + { + "word": "une", + "start": 9.84, + "end": 9.96 + }, + { + "word": "telle", + "start": 9.96, + "end": 10.12 + }, + { + "word": "seconde", + "start": 10.12, + "end": 10.38 + }, + { + "word": "de", + "start": 10.38, + "end": 10.48 + }, + { + "word": "silence", + "start": 10.48, + "end": 10.78 + }, + { + "word": "et", + "start": 10.78, + "end": 11.06 + }, + { + "word": "je", + "start": 11.06, + "end": 11.16 + }, + { + "word": "vous", + "start": 11.16, + "end": 11.32 + }, + { + "word": "parle.", + "start": 11.32, + "end": 11.68 + }, + { + "word": "Et", + "start": 13.28, + "end": 13.64 + }, + { + "word": "voilà,", + "start": 13.64, + "end": 13.96 + }, + { + "word": "allez", + "start": 14.36, + "end": 14.62 + }, + { + "word": "on", + "start": 14.62, + "end": 14.78 + }, + { + "word": "va", + "start": 14.78, + "end": 14.88 + }, + { + "word": "tester", + "start": 14.88, + "end": 15.06 + }, + { + "word": "ça.", + "start": 15.06, + "end": 15.36 + } +] \ No newline at end of file diff --git a/audio_tests/00_00_30_english_3_speakers.transcript.json b/audio_tests/00_00_30_english_3_speakers.transcript.json new file mode 100644 index 0000000..bb9d097 --- /dev/null +++ b/audio_tests/00_00_30_english_3_speakers.transcript.json @@ -0,0 +1,382 @@ +[ + { + "word": "Transcription", + "start": 0.0, + "end": 0.6 + }, + { + "word": "technology", + "start": 0.6, + "end": 1.24 + }, + { + "word": "has", + "start": 1.24, + "end": 1.5 + }, + { + "word": "improved", + "start": 1.5, + "end": 1.96 + }, + { + "word": "so", + "start": 1.96, + "end": 2.32 + }, + { + "word": "much", + "start": 2.32, + "end": 2.68 + }, + { + "word": "in", + "start": 2.68, + "end": 2.94 + }, + { + "word": "the", + "start": 2.94, + "end": 3.02 + }, + { + "word": "past", + "start": 3.02, + "end": 3.24 + }, + { + "word": "few", + "start": 3.24, + "end": 3.5 + }, + { + "word": "years.", + "start": 3.5, + "end": 3.96 + }, + { + "word": "Have", + "start": 4.56, + "end": 4.74 + }, + { + "word": "you", + "start": 4.74, + "end": 4.9 + }, + { + "word": "noticed", + "start": 4.9, + "end": 5.26 + }, + { + "word": "how", + "start": 5.26, + "end": 5.52 + }, + { + "word": "accurate", + "start": 5.52, + "end": 6.08 + }, + { + "word": "real", + "start": 6.08, + "end": 6.42 + }, + { + "word": "-time", + "start": 6.42, + "end": 6.74 + }, + { + "word": "speech", + "start": 6.74, + "end": 7.24 + }, + { + "word": "to", + "start": 7.24, + "end": 7.46 + }, + { + "word": "text", + "start": 7.46, + "end": 7.78 + }, + { + "word": "is", + "start": 7.78, + "end": 8.0 + }, + { + "word": "now?", + "start": 8.0, + "end": 8.3 + }, + { + "word": "Absolutely.", + "start": 8.7, + "end": 9.16 + }, + { + "word": "I", + "start": 10.04, + "end": 10.38 + }, + { + "word": "use", + "start": 10.38, + "end": 10.56 + }, + { + "word": "it", + "start": 10.56, + "end": 10.76 + }, + { + "word": "all", + "start": 10.76, + "end": 10.9 + }, + { + "word": "the", + "start": 10.9, + "end": 11.04 + }, + { + "word": "time", + "start": 11.04, + "end": 11.32 + }, + { + "word": "for", + "start": 11.32, + "end": 11.54 + }, + { + "word": "taking", + "start": 11.54, + "end": 11.86 + }, + { + "word": "notes", + "start": 11.86, + "end": 12.16 + }, + { + "word": "during", + "start": 12.16, + "end": 12.54 + }, + { + "word": "meetings.", + "start": 12.54, + "end": 12.94 + }, + { + "word": "It's", + "start": 13.6, + "end": 13.8 + }, + { + "word": "amazing", + "start": 13.8, + "end": 14.1 + }, + { + "word": "how", + "start": 14.1, + "end": 14.48 + }, + { + "word": "it", + "start": 14.48, + "end": 14.62 + }, + { + "word": "can", + "start": 14.62, + "end": 14.74 + }, + { + "word": "recognise", + "start": 14.74, + "end": 15.24 + }, + { + "word": "different", + "start": 15.24, + "end": 15.68 + }, + { + "word": "speakers", + "start": 15.68, + "end": 16.16 + }, + { + "word": "and", + "start": 16.16, + "end": 16.8 + }, + { + "word": "even", + "start": 16.8, + "end": 17.1 + }, + { + "word": "add", + "start": 17.1, + "end": 17.44 + }, + { + "word": "punctuation.", + "start": 17.44, + "end": 18.36 + }, + { + "word": "Yeah,", + "start": 18.88, + "end": 19.16 + }, + { + "word": "but", + "start": 19.36, + "end": 19.52 + }, + { + "word": "sometimes", + "start": 19.52, + "end": 20.16 + }, + { + "word": "noise", + "start": 20.16, + "end": 20.54 + }, + { + "word": "can", + "start": 20.54, + "end": 20.8 + }, + { + "word": "still", + "start": 20.8, + "end": 21.1 + }, + { + "word": "cause", + "start": 21.1, + "end": 21.44 + }, + { + "word": "mistakes.", + "start": 21.44, + "end": 21.94 + }, + { + "word": "Does", + "start": 22.68, + "end": 22.9 + }, + { + "word": "this", + "start": 22.9, + "end": 23.12 + }, + { + "word": "system", + "start": 23.12, + "end": 23.46 + }, + { + "word": "handle", + "start": 23.46, + "end": 23.88 + }, + { + "word": "that", + "start": 23.88, + "end": 24.12 + }, + { + "word": "well?", + "start": 24.12, + "end": 24.42 + }, + { + "word": "It", + "start": 24.42, + "end": 25.32 + }, + { + "word": "does", + "start": 25.32, + "end": 25.48 + }, + { + "word": "a", + "start": 25.48, + "end": 25.62 + }, + { + "word": "pretty", + "start": 25.62, + "end": 25.88 + }, + { + "word": "good", + "start": 25.88, + "end": 26.08 + }, + { + "word": "job", + "start": 26.08, + "end": 26.32 + }, + { + "word": "filtering", + "start": 26.32, + "end": 26.8 + }, + { + "word": "noise,", + "start": 26.8, + "end": 27.18 + }, + { + "word": "especially", + "start": 27.36, + "end": 28.0 + }, + { + "word": "with", + "start": 28.0, + "end": 28.28 + }, + { + "word": "models", + "start": 28.28, + "end": 28.62 + }, + { + "word": "that", + "start": 28.62, + "end": 28.94 + }, + { + "word": "use", + "start": 28.94, + "end": 29.22 + }, + { + "word": "voice", + "start": 29.22, + "end": 29.54 + }, + { + "word": "active.", + "start": 29.54, + "end": 29.9 + } +] \ No newline at end of file diff --git a/audio_tests/generate_transcripts.py b/audio_tests/generate_transcripts.py new file mode 100644 index 0000000..7eb180f --- /dev/null +++ b/audio_tests/generate_transcripts.py @@ -0,0 +1,57 @@ +#!/usr/bin/env python3 +"""Generate word-level timestamped transcripts using faster-whisper (offline). + +Produces one JSON file per audio with: [{word, start, end}, ...] +""" + +import json +import os +from faster_whisper import WhisperModel + +AUDIO_DIR = os.path.dirname(os.path.abspath(__file__)) + +FILES = [ + ("00_00_07_english_1_speaker.wav", "en"), + ("00_00_16_french_1_speaker.wav", "fr"), + ("00_00_30_english_3_speakers.wav", "en"), +] + +def main(): + print("Loading faster-whisper model (base, cpu, float32)...") + model = WhisperModel("base", device="cpu", compute_type="float32") + + for filename, lang in FILES: + audio_path = os.path.join(AUDIO_DIR, filename) + out_path = os.path.join( + AUDIO_DIR, filename.rsplit(".", 1)[0] + ".transcript.json" + ) + + print(f"\n{'='*60}") + print(f"Transcribing: {filename} (language={lang})") + print(f"{'='*60}") + + segments, info = model.transcribe( + audio_path, word_timestamps=True, language=lang + ) + + words = [] + for segment in segments: + if segment.words: + for w in segment.words: + words.append({ + "word": w.word.strip(), + "start": round(w.start, 3), + "end": round(w.end, 3), + }) + print(f" {w.start:6.2f} - {w.end:6.2f} {w.word.strip()}") + + with open(out_path, "w", encoding="utf-8") as f: + json.dump(words, f, indent=2, ensure_ascii=False) + + print(f"\n -> {len(words)} words written to {os.path.basename(out_path)}") + + print("\nDone.") + + +if __name__ == "__main__": + main() diff --git a/run_benchmark.py b/run_benchmark.py new file mode 100644 index 0000000..5a4e23b --- /dev/null +++ b/run_benchmark.py @@ -0,0 +1,291 @@ +#!/usr/bin/env python3 +""" +Comprehensive benchmark runner for WhisperLiveKit. + +Tests all available backend+policy combinations across multiple audio files, +model sizes, and VAC on/off configurations. Outputs structured JSON that +is consumed by the report generator. + +Usage: + python run_benchmark.py # full benchmark + python run_benchmark.py --quick # subset (tiny models, fewer combos) + python run_benchmark.py --json results.json # custom output path +""" + +import argparse +import asyncio +import gc +import json +import logging +import platform +import subprocess +import sys +import time +from dataclasses import asdict +from pathlib import Path + +logging.basicConfig(level=logging.WARNING, format="%(asctime)s %(levelname)s %(name)s: %(message)s") +logger = logging.getLogger("benchmark") +logger.setLevel(logging.INFO) + +# Re-use harness functions +sys.path.insert(0, str(Path(__file__).parent)) +from test_backend_offline import ( + AUDIO_TESTS_DIR, + SAMPLE_RATE, + TestResult, + create_engine, + discover_audio_files, + download_sample_audio, + load_audio, + run_test, +) + +CACHE_DIR = Path(__file__).parent / ".test_cache" + + +def get_system_info() -> dict: + """Collect system metadata for the report.""" + info = { + "platform": platform.platform(), + "machine": platform.machine(), + "processor": platform.processor(), + "python_version": platform.python_version(), + } + + # macOS: get chip info + try: + chip = subprocess.check_output( + ["sysctl", "-n", "machdep.cpu.brand_string"], text=True + ).strip() + info["cpu"] = chip + except Exception: + info["cpu"] = platform.processor() + + # RAM + try: + mem_bytes = int( + subprocess.check_output(["sysctl", "-n", "hw.memsize"], text=True).strip() + ) + info["ram_gb"] = round(mem_bytes / (1024**3)) + except Exception: + info["ram_gb"] = None + + # Backend versions + versions = {} + try: + import faster_whisper + versions["faster-whisper"] = faster_whisper.__version__ + except ImportError: + pass + try: + import mlx_whisper # noqa: F401 + versions["mlx-whisper"] = "installed" + except ImportError: + pass + try: + import mlx.core as mx + versions["mlx"] = mx.__version__ + except ImportError: + pass + try: + import transformers + versions["transformers"] = transformers.__version__ + except ImportError: + pass + try: + import torch + versions["torch"] = torch.__version__ + except ImportError: + pass + + info["backend_versions"] = versions + return info + + +def detect_combos(quick: bool = False) -> list: + """Build list of (backend, policy, model_size) combos to test.""" + combos = [] + + # Model sizes to test + model_sizes = ["tiny", "base", "small"] if not quick else ["tiny", "base"] + + # faster-whisper + try: + import faster_whisper # noqa: F401 + for model in model_sizes: + combos.append({"backend": "faster-whisper", "policy": "localagreement", "model": model}) + combos.append({"backend": "faster-whisper", "policy": "simulstreaming", "model": model}) + except ImportError: + pass + + # mlx-whisper + try: + import mlx_whisper # noqa: F401 + for model in model_sizes: + combos.append({"backend": "mlx-whisper", "policy": "localagreement", "model": model}) + combos.append({"backend": "mlx-whisper", "policy": "simulstreaming", "model": model}) + except ImportError: + pass + + # voxtral-mlx (single model, single policy) + try: + from whisperlivekit.voxtral_mlx import VoxtralMLXModel # noqa: F401 + combos.append({"backend": "voxtral-mlx", "policy": "voxtral", "model": ""}) + except ImportError: + pass + + # voxtral HF (single model, single policy) + try: + from transformers import AutoModelForSpeechSeq2Seq # noqa: F401 + combos.append({"backend": "voxtral", "policy": "voxtral", "model": ""}) + except ImportError: + pass + + return combos + + +def collect_audio_files() -> list: + """Collect all benchmark audio files.""" + files = [] + + # audio_tests/ directory + if AUDIO_TESTS_DIR.is_dir(): + files.extend(discover_audio_files(str(AUDIO_TESTS_DIR))) + + # JFK sample + jfk = CACHE_DIR / "jfk.wav" + if not jfk.exists(): + jfk = download_sample_audio() + if jfk.exists(): + files.append(jfk) + + return files + + +async def run_single_combo( + combo: dict, audio_files: list, vac: bool, lan: str, max_duration: float, +) -> list: + """Run one backend+policy+model combo across all audio files.""" + backend = combo["backend"] + policy = combo["policy"] + model = combo["model"] + + results = [] + try: + engine = create_engine( + backend=backend, + model_size=model, + lan=lan, + vac=vac, + policy=policy, + ) + + # Quiet noisy loggers + for mod in ( + "whisperlivekit.audio_processor", + "whisperlivekit.simul_whisper", + "whisperlivekit.tokens_alignment", + "whisperlivekit.simul_whisper.align_att_base", + "whisperlivekit.simul_whisper.simul_whisper", + ): + logging.getLogger(mod).setLevel(logging.WARNING) + + for audio_path in audio_files: + duration = len(load_audio(str(audio_path))) / SAMPLE_RATE + if duration > max_duration: + logger.info(f" Skipping {audio_path.name} ({duration:.0f}s > {max_duration:.0f}s)") + continue + + file_lan = lan + if "french" in audio_path.name.lower() and lan == "en": + file_lan = "fr" + + audio = load_audio(str(audio_path)) + result = await run_test( + engine, audio, chunk_ms=100, realtime=False, + audio_file=audio_path.name, backend=backend, + policy=policy, lan=file_lan, + ) + # Tag with extra metadata + result_dict = asdict(result) + result_dict["model_size"] = model + result_dict["vac"] = vac + results.append(result_dict) + + except Exception as e: + logger.error(f" FAILED: {e}") + import traceback + traceback.print_exc() + + return results + + +async def run_full_benchmark(combos, audio_files, max_duration=60.0): + """Run all combos with VAC on and off.""" + all_results = [] + total = len(combos) * 2 # x2 for VAC on/off + idx = 0 + + for combo in combos: + for vac in [True, False]: + idx += 1 + vac_str = "VAC=on" if vac else "VAC=off" + desc = f"{combo['backend']} / {combo['policy']}" + if combo["model"]: + desc += f" / {combo['model']}" + desc += f" / {vac_str}" + + print(f"\n{'='*70}") + print(f"[{idx}/{total}] {desc}") + print(f"{'='*70}") + + results = await run_single_combo( + combo, audio_files, vac=vac, lan="en", max_duration=max_duration, + ) + all_results.extend(results) + + # Free memory between combos + gc.collect() + + return all_results + + +def main(): + parser = argparse.ArgumentParser(description="Run comprehensive WhisperLiveKit benchmark") + parser.add_argument("--quick", action="store_true", help="Quick mode: fewer models and combos") + parser.add_argument("--json", default="benchmark_results.json", dest="json_output", help="Output JSON path") + parser.add_argument("--max-duration", type=float, default=60.0, help="Max audio duration in seconds") + args = parser.parse_args() + + system_info = get_system_info() + combos = detect_combos(quick=args.quick) + audio_files = collect_audio_files() + + print(f"System: {system_info.get('cpu', 'unknown')}, {system_info.get('ram_gb', '?')}GB RAM") + print(f"Backends: {list(system_info['backend_versions'].keys())}") + print(f"Combos to test: {len(combos)} x 2 (VAC on/off) = {len(combos)*2}") + print(f"Audio files: {[f.name for f in audio_files]}") + print() + + t0 = time.time() + all_results = asyncio.run( + run_full_benchmark(combos, audio_files, max_duration=args.max_duration) + ) + total_time = time.time() - t0 + + output = { + "system_info": system_info, + "benchmark_date": time.strftime("%Y-%m-%d %H:%M"), + "total_benchmark_time_s": round(total_time, 1), + "n_combos": len(combos) * 2, + "n_audio_files": len(audio_files), + "results": all_results, + } + + Path(args.json_output).write_text(json.dumps(output, indent=2, ensure_ascii=False)) + print(f"\nBenchmark complete in {total_time:.0f}s. Results: {args.json_output}") + + +if __name__ == "__main__": + main() diff --git a/test_backend_offline.py b/test_backend_offline.py new file mode 100644 index 0000000..486b715 --- /dev/null +++ b/test_backend_offline.py @@ -0,0 +1,783 @@ +#!/usr/bin/env python3 +""" +Offline test harness and benchmark suite for WhisperLiveKit backends. + +Simulates a client-server session by feeding audio files as PCM bytes through +the full AudioProcessor pipeline (the same path used by the WebSocket server), +without needing a browser or microphone. + +Computes WER (Word Error Rate) and timestamp accuracy when ground truth +transcript files (.transcript.json) are available alongside audio files. + +Usage: + # Test with a single audio file: + python test_backend_offline.py --backend faster-whisper --audio audio_tests/00_00_07_english_1_speaker.wav + + # Test all files in audio_tests/: + python test_backend_offline.py --backend faster-whisper --no-realtime + + # Override streaming policy: + python test_backend_offline.py --backend faster-whisper --policy simulstreaming --no-realtime + + # Multi-backend benchmark (auto-detects all installed backends): + python test_backend_offline.py --benchmark --no-realtime + + # Export results as JSON: + python test_backend_offline.py --benchmark --no-realtime --json results.json + + # Insert silence for testing silence handling: + python test_backend_offline.py --backend faster-whisper --insert-silence 3.0 2.0 +""" + +import argparse +import asyncio +import json +import logging +import sys +import time +import urllib.request +from pathlib import Path +from dataclasses import dataclass, asdict, field +from typing import List, Optional + +import numpy as np + +logging.basicConfig( + level=logging.WARNING, + format="%(asctime)s %(levelname)s %(name)s: %(message)s", +) +logger = logging.getLogger("test_offline") +logger.setLevel(logging.INFO) + +SAMPLE_RATE = 16000 +JFK_WAV_URL = "https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav" +CACHE_DIR = Path(__file__).parent / ".test_cache" +AUDIO_TESTS_DIR = Path(__file__).parent / "audio_tests" +AUDIO_EXTENSIONS = {".wav", ".mp3", ".flac", ".ogg", ".m4a"} + + +@dataclass +class WordTimestamp: + """Word with its start/end time.""" + word: str + start: float + end: float + + +@dataclass +class TestResult: + """Structured result from a single test run.""" + audio_file: str + audio_duration_s: float + backend: str + policy: str + language: str + chunk_ms: int + realtime_pacing: bool + # Timing + processing_time_s: float + rtf: float # real-time factor + # Transcription output + transcription: str + n_lines: int + n_responses: int + # WER metrics (None if no ground truth) + wer: Optional[float] = None + wer_details: Optional[dict] = None + # Timestamp accuracy (None if no ground truth) + timestamp_mae: Optional[float] = None + timestamp_max_delta: Optional[float] = None + timestamp_median_delta: Optional[float] = None + # Word-level timestamps + word_timestamps: List[WordTimestamp] = field(default_factory=list) + # Raw last response + last_response: Optional[dict] = None + + +def download_sample_audio() -> Path: + """Download the jfk.wav sample if not cached.""" + CACHE_DIR.mkdir(exist_ok=True) + path = CACHE_DIR / "jfk.wav" + if not path.exists(): + logger.info(f"Downloading sample audio to {path} ...") + urllib.request.urlretrieve(JFK_WAV_URL, path) + logger.info("Done.") + return path + + +def load_audio(path: str) -> np.ndarray: + """Load audio file as float32 mono 16kHz numpy array. + + Supports WAV, FLAC (via soundfile) and MP3, OGG, M4A (via librosa). + """ + ext = Path(path).suffix.lower() + if ext in (".mp3", ".ogg", ".m4a"): + import librosa + audio, _ = librosa.load(path, sr=SAMPLE_RATE, mono=True) + return audio.astype(np.float32) + + import soundfile as sf + audio, sr = sf.read(path, dtype="float32") + if audio.ndim > 1: + audio = audio.mean(axis=1) + if sr != SAMPLE_RATE: + import librosa + audio = librosa.resample(audio, orig_sr=sr, target_sr=SAMPLE_RATE) + return audio + + +def insert_silence(audio: np.ndarray, silence_sec: float, position_sec: float) -> np.ndarray: + """Insert silence into audio at a given position. + + Args: + audio: Float32 mono audio array at SAMPLE_RATE. + silence_sec: Duration of silence to insert in seconds. + position_sec: Position in seconds where silence starts. + Returns: + New audio array with silence inserted. + """ + pos_samples = int(position_sec * SAMPLE_RATE) + silence_samples = int(silence_sec * SAMPLE_RATE) + pos_samples = min(pos_samples, len(audio)) + silence = np.zeros(silence_samples, dtype=np.float32) + return np.concatenate([audio[:pos_samples], silence, audio[pos_samples:]]) + + +def float32_to_s16le_bytes(audio: np.ndarray) -> bytes: + """Convert float32 audio to s16le PCM bytes (what the browser sends).""" + return (audio * 32768).clip(-32768, 32767).astype(np.int16).tobytes() + + +def create_engine( + backend: str, model_size: str, lan: str, + diarization: bool = False, vac: bool = True, policy: str = "", +): + """Create a TranscriptionEngine with the given backend config.""" + import gc + from whisperlivekit.core import TranscriptionEngine + + # Reset singleton so we get a fresh instance + TranscriptionEngine._instance = None + TranscriptionEngine._initialized = False + gc.collect() + + kwargs = dict( + backend=backend, + lan=lan, + pcm_input=True, + vac=vac, + transcription=True, + diarization=diarization, + ) + if model_size: + kwargs["model_size"] = model_size + if policy: + kwargs["backend_policy"] = policy + + return TranscriptionEngine(**kwargs) + + +def _extract_text_from_response(response_dict: dict) -> str: + """Extract full transcription text from a FrontData dict.""" + segments = response_dict.get("lines", []) + full_text = " ".join( + seg.get("text", "").strip() + for seg in segments + if seg.get("text", "").strip() + ) + buf = response_dict.get("buffer_transcription", "").strip() + if buf: + full_text = f"{full_text} {buf}".strip() if full_text else buf + return full_text + + +async def run_test( + engine, audio: np.ndarray, chunk_ms: int, realtime: bool, + audio_file: str = "", backend: str = "", policy: str = "", lan: str = "", +) -> TestResult: + """ + Simulate a client session through the full AudioProcessor pipeline. + + 1. Create AudioProcessor (one per "client session") + 2. Start async pipeline (transcription_processor, results_formatter, etc.) + 3. Feed audio as PCM bytes in timed chunks + 4. Collect and display FrontData responses + 5. Signal EOF and cleanup + """ + from whisperlivekit.audio_processor import AudioProcessor + + chunk_samples = int(SAMPLE_RATE * chunk_ms / 1000) + total_samples = len(audio) + audio_duration = total_samples / SAMPLE_RATE + + logger.info( + f"Audio: {audio_duration:.2f}s | " + f"Chunk: {chunk_ms}ms ({chunk_samples} samples) | " + f"Steps: {total_samples // chunk_samples + 1} | " + f"Realtime: {realtime}" + ) + + # --- Server side: create processor and start pipeline --- + processor = AudioProcessor(transcription_engine=engine) + results_generator = await processor.create_tasks() + + # Collect results in background (like handle_websocket_results) + all_responses = [] + response_count = 0 + last_printed_text = "" + + async def collect_results(): + nonlocal response_count, last_printed_text + async for response in results_generator: + all_responses.append(response) + response_count += 1 + d = response.to_dict() + + # Only print when transcription text actually changes + current_text = _extract_text_from_response(d) + if current_text and current_text != last_printed_text: + buf = d.get("buffer_transcription", "").strip() + committed = current_text + if buf and committed.endswith(buf): + committed = committed[:-len(buf)].strip() + + # Show committed text + buffer separately + display = committed + if buf: + display = f"{committed} \033[90m{buf}\033[0m" if committed else f"\033[90m{buf}\033[0m" + print(f" > {display}", flush=True) + last_printed_text = current_text + + result_task = asyncio.create_task(collect_results()) + + # --- Client side: feed audio as PCM bytes --- + t_start = time.time() + + for offset in range(0, total_samples, chunk_samples): + chunk = audio[offset : offset + chunk_samples] + pcm_bytes = float32_to_s16le_bytes(chunk) + await processor.process_audio(pcm_bytes) + if realtime: + await asyncio.sleep(chunk_ms / 1000) + + feed_elapsed = time.time() - t_start + + logger.info(f"Audio fed in {feed_elapsed:.2f}s. Signaling EOF...") + + # Signal end of audio (like client disconnect / empty message) + await processor.process_audio(None) + + # Wait for pipeline to drain completely + try: + await asyncio.wait_for(result_task, timeout=120.0) + except asyncio.TimeoutError: + logger.warning("Timed out waiting for results. Proceeding with cleanup.") + result_task.cancel() + try: + await result_task + except asyncio.CancelledError: + pass + + # --- Capture word-level timestamps before cleanup --- + word_timestamps = [] + try: + state = await processor.get_current_state() + for token in state.tokens: + if hasattr(token, 'start') and hasattr(token, 'text') and token.text: + word_timestamps.append(WordTimestamp( + word=token.text.strip(), + start=round(token.start, 3), + end=round(token.end, 3), + )) + except Exception as e: + logger.warning(f"Could not capture word timestamps: {e}") + + # Cleanup + await processor.cleanup() + + total_elapsed = time.time() - t_start + + # --- Build result --- + transcription = "" + n_lines = 0 + last_response_dict = None + + if all_responses: + last = all_responses[-1].to_dict() + last_response_dict = last + n_lines = len(last.get("lines", [])) + transcription = _extract_text_from_response(last) + + # --- Compute WER and timestamp accuracy against ground truth --- + from whisperlivekit.metrics import compute_wer, compute_timestamp_accuracy + + wer_val = None + wer_details = None + ts_mae = None + ts_max_delta = None + ts_median_delta = None + + gt_path = Path(audio_file).with_suffix(".transcript.json") + if not gt_path.exists(): + gt_path = AUDIO_TESTS_DIR / gt_path + gt = None + if gt_path.exists(): + with open(gt_path) as f: + gt = json.load(f) + + # WER + gt_text = " ".join(w["word"] for w in gt) + wer_result = compute_wer(gt_text, transcription) + wer_val = round(wer_result["wer"], 4) + wer_details = wer_result + + # Timestamp accuracy + if word_timestamps: + pred_dicts = [{"word": wt.word, "start": wt.start, "end": wt.end} for wt in word_timestamps] + ts_result = compute_timestamp_accuracy(pred_dicts, gt) + ts_mae = ts_result["mae_start"] + ts_max_delta = ts_result["max_delta_start"] + ts_median_delta = ts_result["median_delta_start"] + + result = TestResult( + audio_file=audio_file, + audio_duration_s=round(audio_duration, 2), + backend=backend, + policy=policy, + language=lan, + chunk_ms=chunk_ms, + realtime_pacing=realtime, + processing_time_s=round(total_elapsed, 2), + rtf=round(total_elapsed / audio_duration, 2), + transcription=transcription, + n_lines=n_lines, + n_responses=response_count, + wer=wer_val, + wer_details=wer_details, + timestamp_mae=round(ts_mae, 3) if ts_mae is not None else None, + timestamp_max_delta=round(ts_max_delta, 3) if ts_max_delta is not None else None, + timestamp_median_delta=round(ts_median_delta, 3) if ts_median_delta is not None else None, + word_timestamps=word_timestamps, + last_response=last_response_dict, + ) + + # --- Print summary --- + print(f"\n{'=' * 60}") + print(f"RESULT: {audio_file}") + print(f"{'=' * 60}") + print(f"Transcription: {transcription}") + print(f"Lines: {n_lines} | Responses: {response_count}") + print(f"Audio: {audio_duration:.2f}s | Time: {total_elapsed:.2f}s | RTF: {result.rtf:.2f}x") + + if wer_val is not None: + print(f"WER: {wer_val:.2%} (S={wer_details['substitutions']} I={wer_details['insertions']} D={wer_details['deletions']})") + + # Print word timestamps if available + if word_timestamps: + print(f"\nWord timestamps ({len(word_timestamps)} words):") + for wt in word_timestamps: + print(f" [{wt.start:6.2f} - {wt.end:6.2f}] {wt.word}") + + # Detailed comparison with ground truth + if gt: + print(f"\n vs Ground truth ({len(gt)} words):") + max_words = max(len(word_timestamps), len(gt)) + for i in range(max_words): + pred = word_timestamps[i] if i < len(word_timestamps) else None + ref = gt[i] if i < len(gt) else None + p_str = f"[{pred.start:5.2f}-{pred.end:5.2f}] {pred.word:<15}" if pred else " " * 30 + r_str = f"[{ref['start']:5.2f}-{ref['end']:5.2f}] {ref['word']:<15}" if ref else "" + delta = "" + if pred and ref: + d = pred.start - ref['start'] + delta = f" Δstart={d:+.2f}" + print(f" {p_str} | {r_str}{delta}") + + if ts_mae is not None: + print(f"\n Timestamp stats: MAE={ts_mae:.3f}s max|Δ|={ts_max_delta:.3f}s median|Δ|={ts_median_delta:.3f}s") + + print(f"{'=' * 60}") + + return result + + +def discover_audio_files(directory: str) -> List[Path]: + """Find all supported audio files in directory.""" + d = Path(directory) + files = sorted( + p for p in d.iterdir() + if p.is_file() and p.suffix.lower() in AUDIO_EXTENSIONS + ) + return files + + +async def run_all_tests( + engine, audio_files: List[Path], chunk_ms: int, realtime: bool, + backend: str, policy: str, lan: str, max_duration: float = 60.0, + silence_insertions: Optional[List[List[float]]] = None, +) -> List[TestResult]: + """Run tests on multiple audio files sequentially.""" + results = [] + for audio_path in audio_files: + # Detect language from filename if "french" in name + file_lan = lan + if "french" in audio_path.name.lower() and lan == "en": + file_lan = "fr" + logger.info(f"Auto-detected language 'fr' from filename") + + audio = load_audio(str(audio_path)) + + # Insert silence segments (applied in reverse position order to keep offsets valid) + if silence_insertions: + for secs, at_sec in sorted(silence_insertions, key=lambda x: x[1], reverse=True): + logger.info(f"Inserting {secs:.1f}s silence at {at_sec:.1f}s") + audio = insert_silence(audio, secs, at_sec) + + duration = len(audio) / SAMPLE_RATE + + if duration > max_duration: + logger.info(f"Skipping {audio_path.name} ({duration:.0f}s > {max_duration:.0f}s max)") + continue + + print(f"\n{'#' * 60}") + print(f"# Testing: {audio_path.name} ({duration:.1f}s)") + print(f"{'#' * 60}") + + result = await run_test( + engine, audio, chunk_ms, realtime, + audio_file=audio_path.name, backend=backend, policy=policy, lan=file_lan, + ) + results.append(result) + + return results + + +def print_benchmark_summary(results: List[TestResult]): + """Print a tabular summary of all test results.""" + print(f"\n{'=' * 110}") + print("BENCHMARK SUMMARY") + print(f"{'=' * 110}") + print( + f"{'File':<40} {'Duration':>8} {'Time':>8} {'RTF':>6} " + f"{'WER':>7} {'MAE(s)':>7} {'Lines':>5}" + ) + print(f"{'-' * 110}") + for r in results: + wer_str = f"{r.wer:.2%}" if r.wer is not None else " -" + mae_str = f"{r.timestamp_mae:.3f}" if r.timestamp_mae is not None else " -" + print( + f"{r.audio_file:<40} {r.audio_duration_s:>7.1f}s {r.processing_time_s:>7.1f}s " + f"{r.rtf:>5.2f}x {wer_str:>7} {mae_str:>7} {r.n_lines:>5}" + ) + print(f"{'-' * 110}") + total_audio = sum(r.audio_duration_s for r in results) + total_time = sum(r.processing_time_s for r in results) + avg_rtf = total_time / total_audio if total_audio > 0 else 0 + wer_vals = [r.wer for r in results if r.wer is not None] + avg_wer_str = f"{sum(wer_vals)/len(wer_vals):.2%}" if wer_vals else " -" + mae_vals = [r.timestamp_mae for r in results if r.timestamp_mae is not None] + avg_mae_str = f"{sum(mae_vals)/len(mae_vals):.3f}" if mae_vals else " -" + print( + f"{'TOTAL/AVG':<40} {total_audio:>7.1f}s {total_time:>7.1f}s " + f"{avg_rtf:>5.2f}x {avg_wer_str:>7} {avg_mae_str:>7}" + ) + print(f"{'=' * 110}") + + # Print transcription excerpts + print(f"\nTRANSCRIPTIONS:") + print(f"{'-' * 110}") + for r in results: + excerpt = r.transcription[:120] + "..." if len(r.transcription) > 120 else r.transcription + print(f" {r.audio_file}:") + print(f" {excerpt}") + print(f"{'=' * 110}") + + +def detect_available_backends() -> List[dict]: + """Probe which backends can be imported and return (backend, policy) combos. + + Returns list of dicts with keys: backend, policy, description. + """ + combos = [] + + # faster-whisper + try: + import faster_whisper # noqa: F401 + combos.append({"backend": "faster-whisper", "policy": "localagreement", "description": "faster-whisper + LocalAgreement"}) + combos.append({"backend": "faster-whisper", "policy": "simulstreaming", "description": "faster-whisper + SimulStreaming"}) + except ImportError: + pass + + # mlx-whisper (macOS only) + try: + import mlx_whisper # noqa: F401 + combos.append({"backend": "mlx-whisper", "policy": "localagreement", "description": "mlx-whisper + LocalAgreement"}) + combos.append({"backend": "mlx-whisper", "policy": "simulstreaming", "description": "mlx-whisper + SimulStreaming"}) + except ImportError: + pass + + # openai-whisper + try: + import whisper # noqa: F401 + combos.append({"backend": "whisper", "policy": "localagreement", "description": "openai-whisper + LocalAgreement"}) + combos.append({"backend": "whisper", "policy": "simulstreaming", "description": "openai-whisper + SimulStreaming"}) + except ImportError: + pass + + # voxtral-mlx + try: + from whisperlivekit.voxtral_mlx import VoxtralMLXModel # noqa: F401 + combos.append({"backend": "voxtral-mlx", "policy": "voxtral", "description": "voxtral-mlx (MLX)"}) + except ImportError: + pass + + # voxtral (HuggingFace) + try: + from transformers import AutoModelForSpeechSeq2Seq # noqa: F401 + combos.append({"backend": "voxtral", "policy": "voxtral", "description": "voxtral (HuggingFace)"}) + except ImportError: + pass + + return combos + + +def print_cross_backend_comparison(all_results: List[TestResult]): + """Print a comparison table across backends and policies.""" + print(f"\n{'=' * 110}") + print("CROSS-BACKEND BENCHMARK COMPARISON") + print(f"{'=' * 110}") + print( + f"{'Backend':<18} {'Policy':<16} {'File':<30} " + f"{'WER':>7} {'RTF':>6} {'MAE(s)':>7} {'MaxΔ(s)':>8}" + ) + print(f"{'-' * 110}") + + for r in all_results: + wer_str = f"{r.wer:.2%}" if r.wer is not None else " -" + rtf_str = f"{r.rtf:.2f}x" + mae_str = f"{r.timestamp_mae:.3f}" if r.timestamp_mae is not None else " -" + max_str = f"{r.timestamp_max_delta:.3f}" if r.timestamp_max_delta is not None else " -" + # Truncate filename for readability + fname = r.audio_file[:28] + ".." if len(r.audio_file) > 30 else r.audio_file + print( + f"{r.backend:<18} {r.policy:<16} {fname:<30} " + f"{wer_str:>7} {rtf_str:>6} {mae_str:>7} {max_str:>8}" + ) + + print(f"{'-' * 110}") + + # Per-backend averages + from collections import defaultdict + by_combo = defaultdict(list) + for r in all_results: + by_combo[(r.backend, r.policy)].append(r) + + print(f"\n{'Backend':<18} {'Policy':<16} {'Avg WER':>8} {'Avg RTF':>8} {'Avg MAE':>8} {'Files':>6}") + print(f"{'-' * 80}") + for (backend, policy), group in sorted(by_combo.items()): + wer_vals = [r.wer for r in group if r.wer is not None] + rtf_vals = [r.rtf for r in group] + mae_vals = [r.timestamp_mae for r in group if r.timestamp_mae is not None] + avg_wer = f"{sum(wer_vals)/len(wer_vals):.2%}" if wer_vals else " -" + avg_rtf = f"{sum(rtf_vals)/len(rtf_vals):.2f}x" + avg_mae = f"{sum(mae_vals)/len(mae_vals):.3f}" if mae_vals else " -" + print( + f"{backend:<18} {policy:<16} {avg_wer:>8} {avg_rtf:>8} {avg_mae:>8} {len(group):>6}" + ) + print(f"{'=' * 110}") + + +def _quiet_loggers(verbose: bool): + """Set internal module log levels to reduce noise.""" + if verbose: + logging.getLogger().setLevel(logging.DEBUG) + else: + for mod in ( + "whisperlivekit.audio_processor", "whisperlivekit.simul_whisper", + "whisperlivekit.tokens_alignment", "whisperlivekit.simul_whisper.align_att_base", + "whisperlivekit.simul_whisper.simul_whisper", + ): + logging.getLogger(mod).setLevel(logging.WARNING) + + +async def run_benchmark( + audio_files: List[Path], chunk_ms: int, realtime: bool, + model_size: str, lan: str, max_duration: float, vac: bool, + verbose: bool, +) -> List[TestResult]: + """Run benchmark across all available backend+policy combinations.""" + combos = detect_available_backends() + if not combos: + logger.error("No backends available. Install at least one ASR backend.") + return [] + + logger.info(f"Detected {len(combos)} backend+policy combinations:") + for c in combos: + logger.info(f" - {c['description']}") + + all_results = [] + for i, combo in enumerate(combos, 1): + backend = combo["backend"] + policy = combo["policy"] + desc = combo["description"] + + print(f"\n{'*' * 70}") + print(f"* BENCHMARK {i}/{len(combos)}: {desc}") + print(f"{'*' * 70}") + + try: + engine = create_engine( + backend, model_size, lan, vac=vac, policy=policy, + ) + _quiet_loggers(verbose) + + results = await run_all_tests( + engine, audio_files, chunk_ms, realtime, + backend=backend, policy=policy, lan=lan, + max_duration=max_duration, + ) + all_results.extend(results) + except Exception as e: + logger.error(f"Failed to run {desc}: {e}") + import traceback + traceback.print_exc() + + return all_results + + +def main(): + parser = argparse.ArgumentParser( + description="Offline backend test harness (AudioProcessor-level)" + ) + parser.add_argument( + "--backend", default="faster-whisper", + help="Backend: voxtral, voxtral-mlx, auto, faster-whisper, mlx-whisper, whisper.", + ) + parser.add_argument( + "--policy", default="", + help="Override backend policy: localagreement, simulstreaming, voxtral.", + ) + parser.add_argument( + "--audio", default=None, + help="Path to a single audio file (WAV, MP3, FLAC, etc.).", + ) + parser.add_argument( + "--audio-dir", default=None, + help="Directory of audio files to test. Defaults to audio_tests/ if neither --audio nor --audio-dir given.", + ) + parser.add_argument( + "--chunk-ms", type=int, default=100, + help="Chunk size in milliseconds (simulates real-time interval).", + ) + parser.add_argument( + "--model", default="", dest="model_size", + help="Model size or HF repo ID.", + ) + parser.add_argument("--lan", default="en", help="Language code.") + parser.add_argument( + "--no-realtime", action="store_true", + help="Skip real-time pacing between chunks (faster but less realistic).", + ) + parser.add_argument( + "--no-vac", action="store_true", + help="Disable Voice Activity Classification (send all audio without silence filtering).", + ) + parser.add_argument( + "--diarization", action="store_true", + help="Enable speaker diarization.", + ) + parser.add_argument( + "--benchmark", action="store_true", + help="Run benchmark across all detected backend+policy combinations.", + ) + parser.add_argument( + "--json", default=None, dest="json_output", + help="Write structured JSON results to this file.", + ) + parser.add_argument( + "--max-duration", type=float, default=60.0, + help="Skip audio files longer than this many seconds (default: 60).", + ) + parser.add_argument( + "--insert-silence", nargs=2, type=float, metavar=("SECS", "AT_SEC"), + action="append", default=[], + help="Insert SECS of silence at AT_SEC position. Can be repeated. " + "E.g.: --insert-silence 3.0 2.0 --insert-silence 5.0 7.0", + ) + parser.add_argument( + "-v", "--verbose", action="store_true", + help="Show debug-level logs from all components.", + ) + args = parser.parse_args() + + realtime = not args.no_realtime + vac = not args.no_vac + + # Resolve audio file(s) + if args.audio: + audio_files = [Path(args.audio)] + elif args.audio_dir: + audio_files = discover_audio_files(args.audio_dir) + elif AUDIO_TESTS_DIR.is_dir(): + audio_files = discover_audio_files(str(AUDIO_TESTS_DIR)) + else: + # Fall back to jfk.wav download + audio_files = [download_sample_audio()] + + if not audio_files: + logger.error("No audio files found.") + sys.exit(1) + + logger.info(f"Audio files: {[f.name for f in audio_files]}") + + if args.benchmark: + # --- Multi-backend benchmark mode --- + all_results = asyncio.run( + run_benchmark( + audio_files, args.chunk_ms, realtime, + args.model_size, args.lan, args.max_duration, vac, + args.verbose, + ) + ) + if all_results: + print_cross_backend_comparison(all_results) + results = all_results + else: + # --- Single-backend mode --- + policy = args.policy + logger.info(f"Creating {args.backend} engine...") + engine = create_engine( + args.backend, args.model_size, args.lan, + diarization=args.diarization, vac=vac, policy=policy, + ) + logger.info("Engine ready.") + + _quiet_loggers(args.verbose) + + results = asyncio.run( + run_all_tests( + engine, audio_files, args.chunk_ms, realtime, + args.backend, policy, args.lan, + max_duration=args.max_duration, + silence_insertions=args.insert_silence or None, + ) + ) + + if len(results) > 1: + print_benchmark_summary(results) + + # JSON output + if args.json_output and results: + json_results = [] + for r in results: + d = asdict(r) + d.pop("last_response", None) # too verbose for summary + json_results.append(d) + Path(args.json_output).write_text( + json.dumps(json_results, indent=2, ensure_ascii=False) + ) + logger.info(f"Results written to {args.json_output}") + + +if __name__ == "__main__": + main()