mirror of https://github.com/QuentinFuxa/WhisperLiveKit.git synced 2026-03-07 14:23:18 +00:00

Files

Quentin Fuxa b1fc23807a docs: add benchmark collaboration call, voxtral in powered-by section

2026-02-23 10:37:22 +01:00

10 KiB

Raw Blame History

WhisperLiveKit Benchmark Report

Benchmark comparing all supported ASR backends, streaming policies, and model sizes on Apple Silicon. All tests run through the full AudioProcessor pipeline (same code path as production WebSocket).

Test Environment

Property	Value
Hardware	Apple M4, 32 GB RAM
OS	macOS 25.3.0 (arm64)
Python	3.13
faster-whisper	1.2.1
mlx-whisper	installed (via mlx)
Voxtral MLX	native MLX backend
Voxtral (HF)	transformers-based
VAC (Silero VAD)	enabled unless noted
Chunk size	100 ms
Pacing	no-realtime (as fast as possible)

Audio Test Files

File	Duration	Language	Speakers	Description
`00_00_07_english_1_speaker.wav`	7.2 s	English	1	Short dictation with pauses
`00_00_16_french_1_speaker.wav`	16.3 s	French	1	French speech with intentional silence gaps
`00_00_30_english_3_speakers.wav`	30.0 s	English	3	Multi-speaker conversation

Ground truth transcripts (.transcript.json) with per-word timestamps are hand-verified.

Results

English -- Short (7.2 s, 1 speaker)

Backend	Policy	Model	RTF	WER	Timestamp MAE
faster-whisper	LocalAgreement	base	0.20x	21.1%	0.080 s
faster-whisper	SimulStreaming	base	0.14x	0.0%	0.239 s
faster-whisper	LocalAgreement	small	0.59x	21.1%	0.089 s
faster-whisper	SimulStreaming	small	0.39x	0.0%	0.221 s
mlx-whisper	LocalAgreement	base	0.05x	21.1%	0.080 s
mlx-whisper	SimulStreaming	base	0.14x	10.5%	0.245 s
mlx-whisper	LocalAgreement	small	0.16x	21.1%	0.089 s
mlx-whisper	SimulStreaming	small	0.20x	10.5%	0.226 s
voxtral-mlx	voxtral	4B	0.32x	0.0%	0.254 s
voxtral (HF)	voxtral	4B	1.29x	0.0%	1.876 s

English -- Multi-speaker (30.0 s, 3 speakers)

Backend	Policy	Model	RTF	WER	Timestamp MAE
faster-whisper	LocalAgreement	base	0.24x	44.7%	0.235 s
faster-whisper	SimulStreaming	base	0.10x	5.3%	0.398 s
faster-whisper	LocalAgreement	small	0.59x	25.0%	0.226 s
faster-whisper	SimulStreaming	small	0.26x	5.3%	0.387 s
mlx-whisper	LocalAgreement	base	0.06x	23.7%	0.237 s
mlx-whisper	SimulStreaming	base	0.11x	5.3%	0.395 s
mlx-whisper	LocalAgreement	small	0.13x	25.0%	0.226 s
mlx-whisper	SimulStreaming	small	0.20x	5.3%	0.394 s
voxtral-mlx	voxtral	4B	0.31x	9.2%	0.176 s
voxtral (HF)	voxtral	4B	1.00x	32.9%	1.034 s

French (16.3 s, 1 speaker, `--language fr`)

Backend	Policy	Model	RTF	WER	Timestamp MAE
faster-whisper	LocalAgreement	base	0.22x	25.7%	3.460 s
faster-whisper	SimulStreaming	base	0.10x	31.4%	3.660 s
faster-whisper	LocalAgreement	small	0.76x	42.9%	0.051 s
faster-whisper	SimulStreaming	small	0.29x	25.7%	0.219 s
mlx-whisper	LocalAgreement	base	0.09x	~45%*	~5.0 s*
mlx-whisper	SimulStreaming	base	0.09x	40.0%	3.540 s
mlx-whisper	LocalAgreement	small	0.14x	25.7%	0.083 s
mlx-whisper	SimulStreaming	small	0.17x	31.4%	0.203 s
voxtral-mlx	voxtral	4B	0.18x	37.1%	3.422 s
voxtral (HF)	voxtral	4B	0.63x	28.6%	4.040 s

* mlx-whisper + LocalAgreement + base is unstable on this French file (WER fluctuates 34-1037% across runs due to hallucination loops). The small model does not have this problem.

Timestamp note: The base model produces very high timestamp MAE (3.4-3.7s) on this French file because it misaligns words around the silence gaps. The small model handles this much better (0.05-0.22s MAE). Voxtral also drifts on the silence gaps.

Model Size Comparison (base vs small)

	base	small	Observation
RTF	0.05-0.24x	0.13-0.76x	small is 2-3x slower
English WER (SS)	0-5.3%	0-5.3%	No improvement: SimulStreaming already saturates on base
English WER (LA)	21-44.7%	21-25%	small reduces LA errors on longer audio
French WER	25-40%	25-43%	Mixed: depends on backend/policy combo
French timestamps	3.4-5.0s MAE	0.05-0.22s MAE	small is dramatically better for French timestamps

In short: base + SimulStreaming gives the best speed/accuracy tradeoff for English. The small model only helps if you need LocalAgreement (for subtitle-grade timestamps) or non-English languages.

Key Findings

Speed (RTF = processing time / audio duration, lower is better)

mlx-whisper + LocalAgreement + base is the fastest combo on Apple Silicon: 0.05-0.06x RTF on English. 30 seconds of audio in under 2 seconds.
For faster-whisper, SimulStreaming is faster than LocalAgreement. For mlx-whisper, it is the opposite: LocalAgreement (0.05-0.06x) outperforms SimulStreaming (0.11-0.14x) on speed.
voxtral-mlx runs at 0.18-0.32x RTF -- 3-5x slower than mlx-whisper base, but well within real-time.
voxtral (HF transformers) hits 1.0-1.3x RTF. At the real-time boundary on Apple Silicon. Use the MLX variant instead.
The small model is 2-3x slower than base across all backends.

Accuracy (WER = Word Error Rate, lower is better)

SimulStreaming gives dramatically lower WER than LocalAgreement on the whisper backends. On the 30s English file: 5.3% vs 23-44%.
voxtral-mlx hits 0% on short English and 9.2% on multi-speaker. It auto-detects language natively. Whisper also supports --language auto, but tends to bias towards English on short segments.
LocalAgreement tends to repeat the last sentence at end-of-stream (a known LCP artifact), inflating WER. This is visible in the 21% WER on the 7s file -- the same 4 extra words appear in every LA run.
On French with the correct --language fr, whisper base achieves 25-40% WER -- comparable to Voxtral's 28-37%. The small model does not consistently improve French WER.

Timestamps (MAE = Mean Absolute Error on word start times)

LocalAgreement gives the best timestamps on English (0.08-0.09s MAE).
SimulStreaming is less precise (0.22-0.40s MAE) but good enough for most applications.
On French with silence gaps, base model timestamps are unreliable (3.4-5s MAE). The small model fixes this (0.05-0.22s MAE). This is the strongest argument for using small over base.
voxtral-mlx has good timestamps on English (0.18-0.25s MAE) but drifts on audio with long silence gaps (3.4s MAE on the French file).

VAC (Voice Activity Classification) Impact

Backend	Policy	VAC	7s English WER	30s English WER
faster-whisper	LocalAgreement	on	21.1%	44.7%
faster-whisper	LocalAgreement	off	100.0%	100.0%
voxtral-mlx	voxtral	on	0.0%	9.2%
voxtral-mlx	voxtral	off	0.0%	9.2%

Whisper backends need VAC to work in streaming mode. Without it the buffer logic breaks down and you get empty or garbage output.
Voxtral is unaffected by VAC since it handles its own internal chunking. Identical results with or without. VAC still saves compute on silent segments.

Recommendations

Use Case	Backend	Policy	Model	Notes
Fastest English (Apple Silicon)	mlx-whisper	SimulStreaming	base	0.11x RTF, 5.3% WER
Fastest English (Linux/GPU)	faster-whisper	SimulStreaming	base	0.10x RTF, 5.3% WER
Best accuracy, English	faster-whisper	SimulStreaming	small	0.26x RTF, 5.3% WER, still fast
Multilingual / auto-detect	voxtral-mlx	voxtral	4B	100+ languages, 0.18-0.32x RTF
Best timestamps	any	LocalAgreement	small	0.05-0.09s MAE, good for subtitles
Low memory / embedded	mlx-whisper	SimulStreaming	base	Smallest footprint, fastest response

Caveats

3 test files, ~53 seconds total. Results give relative rankings between backends but should not be taken as definitive WER numbers. Run on your own data for production decisions.
RTF varies between runs (up to +/-30%) depending on thermal state, background processes, and model caching. The numbers above are single sequential runs on a warm machine.
Only base and small tested. Medium and large-v3 would likely improve WER at the cost of higher RTF. We did not test them here because they are slow on Apple Silicon without GPU.

Reproducing These Benchmarks

# Install test dependencies
pip install -e ".[test]"

# Single backend test
python test_backend_offline.py --backend faster-whisper --policy simulstreaming --model base --no-realtime

# With a specific language
python test_backend_offline.py --backend mlx-whisper --policy simulstreaming --model small --lan fr --no-realtime

# Multi-backend auto-detect benchmark
python test_backend_offline.py --benchmark --no-realtime

# Export to JSON
python test_backend_offline.py --benchmark --no-realtime --json results.json

# Test with your own audio
python test_backend_offline.py --backend voxtral-mlx --audio your_file.wav --no-realtime

The benchmark harness computes WER and timestamp accuracy automatically when ground truth .transcript.json files exist alongside the audio files. See audio_tests/ for the format.

Help Us Benchmark on More Hardware

These results are from a single Apple M4 machine. We'd love to see numbers from other setups: Linux with CUDA GPUs, older Macs, different CPU architectures, cloud instances, etc.

If you run the benchmark on your hardware, please open an issue or PR with your results and we will add them here. The more data points we have, the better the recommendations get.

What we are especially interested in:

NVIDIA GPUs (RTX 3090, 4090, A100, T4, etc.) with faster-whisper
Older Apple Silicon (M1, M2, M3) with mlx-whisper and voxtral-mlx
Medium and large-v3 models (we only tested base and small so far)
Longer audio files or domain-specific audio (medical, legal, call center)
Other languages beyond English and French

10 KiB Raw Blame History