mirror of https://github.com/QuentinFuxa/WhisperLiveKit.git synced 2026-03-07 14:23:18 +00:00

Files

Quentin Fuxa 83d0fa3fac feat: benchmark suite with WER, timestamp accuracy, cross-backend comparison

- Extend test_backend_offline.py with WER and timestamp accuracy metrics
  computed via whisperlivekit.metrics against ground truth transcripts.
- Add --benchmark flag to auto-detect all installed backends and run
  each (backend, policy) combination in sequence.
- Add --policy flag to override the streaming policy.
- Add detect_available_backends() probing faster-whisper, mlx-whisper,
  voxtral-mlx, voxtral (HF), and openai-whisper.
- Add print_cross_backend_comparison() with per-combo averages.
- Add run_benchmark.py for comprehensive multi-model benchmarking.
- Add BENCHMARK.md with full results on Apple M4: speed, WER,
  timestamp accuracy, VAC impact, and recommendations.
- Add ground truth transcript JSON files for all audio test files.

2026-02-22 23:27:50 +01:00

7.2 KiB

Raw Blame History

WhisperLiveKit Benchmark Report

Benchmark comparing all supported ASR backends and streaming policies on Apple Silicon, using the full AudioProcessor pipeline (the same path audio takes in production via WebSocket).

Test Environment

Property	Value
Hardware	Apple M4, 32 GB RAM
OS	macOS 25.3.0 (arm64)
Python	3.13
faster-whisper	1.2.1
mlx-whisper	installed (via mlx)
Voxtral (HF)	transformers-based
Voxtral MLX	native MLX backend
Model size	`base` (default for whisper backends)
VAC (Silero VAD)	enabled unless noted
Chunk size	100 ms
Pacing	no-realtime (as fast as possible)

Audio Test Files

File	Duration	Language	Speakers	Description
`00_00_07_english_1_speaker.wav`	7.2 s	English	1	Short dictation with pauses
`00_00_16_french_1_speaker.wav`	16.3 s	French	1	French speech with intentional silence gaps
`00_00_30_english_3_speakers.wav`	30.0 s	English	3	Multi-speaker conversation about transcription

All files have hand-verified ground truth transcripts (.transcript.json) with per-word timestamps.

Results Overview

English - Short (7.2 s, 1 speaker)

Backend	Policy	RTF	WER	Timestamp MAE
faster-whisper	LocalAgreement	0.20x	21.1%	0.080 s
faster-whisper	SimulStreaming	0.14x	0.0%	0.239 s
mlx-whisper	LocalAgreement	0.05x	21.1%	0.080 s
mlx-whisper	SimulStreaming	0.14x	10.5%	0.245 s
voxtral-mlx	voxtral	0.32x	0.0%	0.254 s
voxtral (HF)	voxtral	1.29x	0.0%	1.876 s

French (16.3 s, 1 speaker)

Backend	Policy	RTF	WER	Timestamp MAE
faster-whisper	LocalAgreement	0.20x	120.0%	0.540 s
faster-whisper	SimulStreaming	0.10x	100.0%	0.120 s
mlx-whisper	LocalAgreement	0.31x	1737.1%	0.060 s
mlx-whisper	SimulStreaming	0.08x	94.3%	0.120 s
voxtral-mlx	voxtral	0.18x	37.1%	3.422 s
voxtral (HF)	voxtral	0.63x	28.6%	4.040 s

Note: The whisper-based backends were run with --lan en, so they attempted to transcribe French audio in English. This is expected to produce high WER. For a fair comparison, the whisper backends should be run with --lan fr or --lan auto. The Voxtral backends auto-detect language.

English - Multi-speaker (30.0 s, 3 speakers)

Backend	Policy	RTF	WER	Timestamp MAE
faster-whisper	LocalAgreement	0.24x	44.7%	0.235 s
faster-whisper	SimulStreaming	0.10x	5.3%	0.398 s
mlx-whisper	LocalAgreement	0.06x	23.7%	0.237 s
mlx-whisper	SimulStreaming	0.11x	5.3%	0.395 s
voxtral-mlx	voxtral	0.31x	9.2%	0.176 s
voxtral (HF)	voxtral	1.00x	32.9%	1.034 s

Key Findings

Speed (RTF = processing time / audio duration, lower is better)

mlx-whisper + LocalAgreement is the fastest combo on Apple Silicon, reaching 0.05-0.06x RTF on English audio. This means 30 seconds of audio is processed in under 2 seconds.
SimulStreaming is consistently faster than LocalAgreement for faster-whisper, but comparable for mlx-whisper.
voxtral-mlx runs at 0.18-0.32x RTF, roughly 3-5x slower than mlx-whisper but well within real-time requirements.
voxtral (HF transformers) is the slowest, hitting 1.0-1.3x RTF. On longer audio, it risks falling behind real-time. On Apple Silicon, the MLX variant is strongly preferred.

Accuracy (WER = Word Error Rate, lower is better)

SimulStreaming produces significantly better WER than LocalAgreement for whisper backends. On the 30s English file: 5.3% vs 23.7-44.7%.
voxtral-mlx achieves strong accuracy (0% on short English, 9.2% on multi-speaker) and is the only backend that auto-detects language, making it the best choice for multilingual use.
LocalAgreement tends to duplicate the last sentence, inflating WER. This is a known artifact of the LCP (Longest Common Prefix) commit strategy at end-of-stream.
Voxtral backends handle French natively with 28-37% WER, while whisper backends attempted English transcription of French audio (not a fair comparison for French).

Timestamp Accuracy (MAE = Mean Absolute Error on word start times, lower is better)

LocalAgreement produces the most accurate timestamps (0.08s MAE on English), since it processes overlapping audio windows and validates via prefix matching.
SimulStreaming timestamps are slightly less precise (0.24-0.40s MAE) but still usable for most applications.
voxtral-mlx achieves excellent timestamps on English (0.18-0.25s MAE) but can drift on audio with long silence gaps (3.4s MAE on the French file with 4-second pauses).
voxtral (HF) has the worst timestamp accuracy (1.0-4.0s MAE), likely due to the additional overhead of the transformers pipeline.

VAC (Voice Activity Classification) Impact

Backend	Policy	VAC	7s English WER	30s English WER
faster-whisper	LocalAgreement	on	21.1%	44.7%
faster-whisper	LocalAgreement	off	100.0%	100.0%
voxtral-mlx	voxtral	on	0.0%	9.2%
voxtral-mlx	voxtral	off	0.0%	9.2%

Whisper backends require VAC to function in streaming mode. Without it, the entire audio is buffered as a single chunk and the LocalAgreement/SimulStreaming buffer logic breaks down.
Voxtral backends are VAC-independent because they handle their own internal chunking and produce identical results with or without VAC. VAC still reduces wasted compute on silence.

Recommendations

Use Case	Recommended Backend	Policy	Notes
Fastest English transcription (Apple Silicon)	mlx-whisper	SimulStreaming	0.08-0.14x RTF, 5-10% WER
Fastest English transcription (Linux/GPU)	faster-whisper	SimulStreaming	0.10-0.14x RTF, 0-5% WER
Multilingual / auto-detect (Apple Silicon)	voxtral-mlx	voxtral	Handles 100+ languages, 0.18-0.32x RTF
Multilingual / auto-detect (Linux/GPU)	voxtral (HF)	voxtral	Same model, slower on CPU, needs GPU
Best timestamp accuracy	faster-whisper	LocalAgreement	0.08s MAE, good for subtitle alignment
Low latency, low memory	mlx-whisper (tiny)	SimulStreaming	Smallest footprint, fastest response

Reproducing These Benchmarks

# Install test dependencies
pip install -e ".[test]"

# Single backend test
python test_backend_offline.py --backend faster-whisper --policy simulstreaming --no-realtime

# Multi-backend auto-detect benchmark
python test_backend_offline.py --benchmark --no-realtime

# Export to JSON for programmatic analysis
python test_backend_offline.py --benchmark --no-realtime --json results.json

# Test with custom audio
python test_backend_offline.py --backend voxtral-mlx --audio your_file.wav --no-realtime

The benchmark harness computes WER and timestamp accuracy automatically when ground truth .transcript.json files exist alongside the audio files. See audio_tests/ for the format.

7.2 KiB Raw Blame History