docs: rewrite benchmark with base/small comparison, proper French results

- Re-ran all whisper benchmarks with --lan fr for the French file
  (previously ran with --lan en which made the results meaningless)
- Added small model results alongside base for all backends
- Added model size comparison table (base vs small tradeoffs)
- Added benchmark chart (30s English, WER + RTF by backend)
- Added caveats section about dataset size and RTF variance
- Key findings: SimulStreaming saturates at 5.3% WER on base already,
  small model mainly helps LocalAgreement and French timestamps
- mlx-whisper LA base is unstable on French (hallucination loops)
This commit is contained in:
Quentin Fuxa
2026-02-23 10:16:34 +01:00
parent 4b2377c243
commit c76b2ef2c6
2 changed files with 104 additions and 78 deletions

View File

@@ -1,7 +1,7 @@
# WhisperLiveKit Benchmark Report
Benchmark comparing all supported ASR backends and streaming policies on Apple Silicon,
using the full AudioProcessor pipeline (the same path audio takes in production via WebSocket).
Benchmark comparing all supported ASR backends, streaming policies, and model sizes on Apple Silicon.
All tests run through the full AudioProcessor pipeline (same code path as production WebSocket).
## Test Environment
@@ -12,9 +12,8 @@ using the full AudioProcessor pipeline (the same path audio takes in production
| Python | 3.13 |
| faster-whisper | 1.2.1 |
| mlx-whisper | installed (via mlx) |
| Voxtral (HF) | transformers-based |
| Voxtral MLX | native MLX backend |
| Model size | `base` (default for whisper backends) |
| Voxtral (HF) | transformers-based |
| VAC (Silero VAD) | enabled unless noted |
| Chunk size | 100 ms |
| Pacing | no-realtime (as fast as possible) |
@@ -25,50 +24,80 @@ using the full AudioProcessor pipeline (the same path audio takes in production
|------|----------|----------|----------|-------------|
| `00_00_07_english_1_speaker.wav` | 7.2 s | English | 1 | Short dictation with pauses |
| `00_00_16_french_1_speaker.wav` | 16.3 s | French | 1 | French speech with intentional silence gaps |
| `00_00_30_english_3_speakers.wav` | 30.0 s | English | 3 | Multi-speaker conversation about transcription |
| `00_00_30_english_3_speakers.wav` | 30.0 s | English | 3 | Multi-speaker conversation |
All files have hand-verified ground truth transcripts (`.transcript.json`) with per-word timestamps.
Ground truth transcripts (`.transcript.json`) with per-word timestamps are hand-verified.
---
## Results Overview
## Results
### English - Short (7.2 s, 1 speaker)
### English -- Short (7.2 s, 1 speaker)
| Backend | Policy | RTF | WER | Timestamp MAE |
|---------|--------|-----|-----|---------------|
| faster-whisper | LocalAgreement | 0.20x | 21.1% | 0.080 s |
| faster-whisper | SimulStreaming | 0.14x | 0.0% | 0.239 s |
| mlx-whisper | LocalAgreement | 0.05x | 21.1% | 0.080 s |
| mlx-whisper | SimulStreaming | 0.14x | 10.5% | 0.245 s |
| voxtral-mlx | voxtral | 0.32x | 0.0% | 0.254 s |
| voxtral (HF) | voxtral | 1.29x | 0.0% | 1.876 s |
| Backend | Policy | Model | RTF | WER | Timestamp MAE |
|---------|--------|-------|-----|-----|---------------|
| faster-whisper | LocalAgreement | base | 0.20x | 21.1% | 0.080 s |
| faster-whisper | SimulStreaming | base | 0.14x | 0.0% | 0.239 s |
| faster-whisper | LocalAgreement | small | 0.59x | 21.1% | 0.089 s |
| faster-whisper | SimulStreaming | small | 0.39x | 0.0% | 0.221 s |
| mlx-whisper | LocalAgreement | base | 0.05x | 21.1% | 0.080 s |
| mlx-whisper | SimulStreaming | base | 0.14x | 10.5% | 0.245 s |
| mlx-whisper | LocalAgreement | small | 0.16x | 21.1% | 0.089 s |
| mlx-whisper | SimulStreaming | small | 0.20x | 10.5% | 0.226 s |
| voxtral-mlx | voxtral | 4B | 0.32x | 0.0% | 0.254 s |
| voxtral (HF) | voxtral | 4B | 1.29x | 0.0% | 1.876 s |
### French (16.3 s, 1 speaker)
### English -- Multi-speaker (30.0 s, 3 speakers)
| Backend | Policy | RTF | WER | Timestamp MAE |
|---------|--------|-----|-----|---------------|
| faster-whisper | LocalAgreement | 0.20x | 120.0% | 0.540 s |
| faster-whisper | SimulStreaming | 0.10x | 100.0% | 0.120 s |
| mlx-whisper | LocalAgreement | 0.31x | 1737.1% | 0.060 s |
| mlx-whisper | SimulStreaming | 0.08x | 94.3% | 0.120 s |
| voxtral-mlx | voxtral | 0.18x | 37.1% | 3.422 s |
| voxtral (HF) | voxtral | 0.63x | 28.6% | 4.040 s |
| Backend | Policy | Model | RTF | WER | Timestamp MAE |
|---------|--------|-------|-----|-----|---------------|
| faster-whisper | LocalAgreement | base | 0.24x | 44.7% | 0.235 s |
| faster-whisper | SimulStreaming | base | 0.10x | 5.3% | 0.398 s |
| faster-whisper | LocalAgreement | small | 0.59x | 25.0% | 0.226 s |
| faster-whisper | SimulStreaming | small | 0.26x | 5.3% | 0.387 s |
| mlx-whisper | LocalAgreement | base | 0.06x | 23.7% | 0.237 s |
| mlx-whisper | SimulStreaming | base | 0.11x | 5.3% | 0.395 s |
| mlx-whisper | LocalAgreement | small | 0.13x | 25.0% | 0.226 s |
| mlx-whisper | SimulStreaming | small | 0.20x | 5.3% | 0.394 s |
| voxtral-mlx | voxtral | 4B | 0.31x | 9.2% | 0.176 s |
| voxtral (HF) | voxtral | 4B | 1.00x | 32.9% | 1.034 s |
Note: The whisper-based backends were run with `--lan en`, so they attempted to transcribe French
audio in English. This is expected to produce high WER. For a fair comparison, the whisper backends
should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect language.
<p align="center">
<img src="benchmark_chart.png" alt="Benchmark comparison on 30s English" width="800">
</p>
### English - Multi-speaker (30.0 s, 3 speakers)
### French (16.3 s, 1 speaker, `--language fr`)
| Backend | Policy | RTF | WER | Timestamp MAE |
|---------|--------|-----|-----|---------------|
| faster-whisper | LocalAgreement | 0.24x | 44.7% | 0.235 s |
| faster-whisper | SimulStreaming | 0.10x | 5.3% | 0.398 s |
| mlx-whisper | LocalAgreement | 0.06x | 23.7% | 0.237 s |
| mlx-whisper | SimulStreaming | 0.11x | 5.3% | 0.395 s |
| voxtral-mlx | voxtral | 0.31x | 9.2% | 0.176 s |
| voxtral (HF) | voxtral | 1.00x | 32.9% | 1.034 s |
| Backend | Policy | Model | RTF | WER | Timestamp MAE |
|---------|--------|-------|-----|-----|---------------|
| faster-whisper | LocalAgreement | base | 0.22x | 25.7% | 3.460 s |
| faster-whisper | SimulStreaming | base | 0.10x | 31.4% | 3.660 s |
| faster-whisper | LocalAgreement | small | 0.76x | 42.9% | 0.051 s |
| faster-whisper | SimulStreaming | small | 0.29x | 25.7% | 0.219 s |
| mlx-whisper | LocalAgreement | base | 0.09x | ~45%* | ~5.0 s* |
| mlx-whisper | SimulStreaming | base | 0.09x | 40.0% | 3.540 s |
| mlx-whisper | LocalAgreement | small | 0.14x | 25.7% | 0.083 s |
| mlx-whisper | SimulStreaming | small | 0.17x | 31.4% | 0.203 s |
| voxtral-mlx | voxtral | 4B | 0.18x | 37.1% | 3.422 s |
| voxtral (HF) | voxtral | 4B | 0.63x | 28.6% | 4.040 s |
\* mlx-whisper + LocalAgreement + base is unstable on this French file (WER fluctuates 34-1037% across runs due to hallucination loops). The `small` model does not have this problem.
**Timestamp note:** The base model produces very high timestamp MAE (3.4-3.7s) on this French file because it misaligns words around the silence gaps. The small model handles this much better (0.05-0.22s MAE). Voxtral also drifts on the silence gaps.
---
## Model Size Comparison (base vs small)
| | base | small | Observation |
|--|------|-------|-------------|
| **RTF** | 0.05-0.24x | 0.13-0.76x | small is 2-3x slower |
| **English WER (SS)** | 0-5.3% | 0-5.3% | No improvement: SimulStreaming already saturates on base |
| **English WER (LA)** | 21-44.7% | 21-25% | small reduces LA errors on longer audio |
| **French WER** | 25-40% | 25-43% | Mixed: depends on backend/policy combo |
| **French timestamps** | 3.4-5.0s MAE | 0.05-0.22s MAE | small is dramatically better for French timestamps |
In short: **base + SimulStreaming** gives the best speed/accuracy tradeoff for English. The small model only helps if you need LocalAgreement (for subtitle-grade timestamps) or non-English languages.
---
@@ -76,37 +105,25 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect
### Speed (RTF = processing time / audio duration, lower is better)
1. **mlx-whisper + LocalAgreement** is the fastest combo on Apple Silicon, reaching 0.05-0.06x RTF
on English audio. 30 seconds of audio processed in under 2 seconds.
2. For **faster-whisper**, SimulStreaming is consistently faster than LocalAgreement.
For **mlx-whisper**, it is the opposite: LocalAgreement (0.05-0.06x) is faster than SimulStreaming (0.11-0.14x).
3. **voxtral-mlx** runs at 0.18-0.32x RTF, roughly 3-5x slower than mlx-whisper but well within
real-time requirements.
4. **voxtral (HF transformers)** is the slowest at 1.0-1.3x RTF. On longer audio it risks
falling behind real-time. On Apple Silicon, the MLX variant is strongly preferred.
1. **mlx-whisper + LocalAgreement + base** is the fastest combo on Apple Silicon: 0.05-0.06x RTF on English. 30 seconds of audio in under 2 seconds.
2. For **faster-whisper**, SimulStreaming is faster than LocalAgreement. For **mlx-whisper**, it is the opposite: LocalAgreement (0.05-0.06x) outperforms SimulStreaming (0.11-0.14x) on speed.
3. **voxtral-mlx** runs at 0.18-0.32x RTF -- 3-5x slower than mlx-whisper base, but well within real-time.
4. **voxtral (HF transformers)** hits 1.0-1.3x RTF. At the real-time boundary on Apple Silicon. Use the MLX variant instead.
5. The **small** model is 2-3x slower than base across all backends.
### Accuracy (WER = Word Error Rate, lower is better)
1. **SimulStreaming** produces significantly better WER than LocalAgreement for whisper backends.
On the 30s English file: 5.3% vs 23.7-44.7%.
2. **voxtral-mlx** has good accuracy (0% on short English, 9.2% on multi-speaker).
Whisper also supports `--language auto`, but Voxtral's language detection is more
reliable and does not bias towards English the way Whisper's auto mode tends to.
3. **LocalAgreement** tends to duplicate the last sentence, inflating WER. This is a known
artifact of the LCP (Longest Common Prefix) commit strategy at end-of-stream.
4. **Voxtral** backends handle French natively with 28-37% WER, while whisper backends
were run with `--lan en` here (not a fair comparison for French).
1. **SimulStreaming** gives dramatically lower WER than LocalAgreement on the whisper backends. On the 30s English file: 5.3% vs 23-44%.
2. **voxtral-mlx** hits 0% on short English and 9.2% on multi-speaker. It auto-detects language natively. Whisper also supports `--language auto`, but tends to bias towards English on short segments.
3. **LocalAgreement** tends to repeat the last sentence at end-of-stream (a known LCP artifact), inflating WER. This is visible in the 21% WER on the 7s file -- the same 4 extra words appear in every LA run.
4. On **French** with the correct `--language fr`, whisper base achieves 25-40% WER -- comparable to Voxtral's 28-37%. The small model does not consistently improve French WER.
### Timestamp Accuracy (MAE = Mean Absolute Error on word start times, lower is better)
### Timestamps (MAE = Mean Absolute Error on word start times)
1. **LocalAgreement** produces the most accurate timestamps (0.08s MAE on English), since it
processes overlapping audio windows and validates via prefix matching.
2. **SimulStreaming** timestamps are slightly less precise (0.24-0.40s MAE) but still usable
for most applications.
3. **voxtral-mlx** has good timestamp accuracy on English (0.18-0.25s MAE) but drifts on
audio with long silence gaps (3.4s MAE on the French file with 4-second pauses).
4. **voxtral (HF)** has the worst timestamp accuracy (1.0-4.0s MAE). This is likely related to
differences in the transformers-based decoding pipeline rather than model quality.
1. **LocalAgreement** gives the best timestamps on English (0.08-0.09s MAE).
2. **SimulStreaming** is less precise (0.22-0.40s MAE) but good enough for most applications.
3. On French with silence gaps, **base model timestamps are unreliable** (3.4-5s MAE). The **small model fixes this** (0.05-0.22s MAE). This is the strongest argument for using `small` over `base`.
4. **voxtral-mlx** has good timestamps on English (0.18-0.25s MAE) but drifts on audio with long silence gaps (3.4s MAE on the French file).
### VAC (Voice Activity Classification) Impact
@@ -117,23 +134,29 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect
| voxtral-mlx | voxtral | on | 0.0% | 9.2% |
| voxtral-mlx | voxtral | off | 0.0% | 9.2% |
- **Whisper backends require VAC** to function in streaming mode. Without it, the entire audio
is buffered as a single chunk and the LocalAgreement/SimulStreaming buffer logic breaks down.
- **Voxtral backends are VAC-independent** because they handle their own internal chunking and
produce identical results with or without VAC. VAC still reduces wasted compute on silence.
- **Whisper backends need VAC** to work in streaming mode. Without it the buffer logic breaks down and you get empty or garbage output.
- **Voxtral is unaffected by VAC** since it handles its own internal chunking. Identical results with or without. VAC still saves compute on silent segments.
---
## Recommendations
| Use Case | Recommended Backend | Policy | Notes |
|----------|-------------------|--------|-------|
| Fastest English transcription (Apple Silicon) | mlx-whisper | SimulStreaming | 0.08-0.14x RTF, 5-10% WER |
| Fastest English transcription (Linux/GPU) | faster-whisper | SimulStreaming | 0.10-0.14x RTF, 0-5% WER |
| Multilingual / auto-detect (Apple Silicon) | voxtral-mlx | voxtral | Handles 100+ languages, 0.18-0.32x RTF |
| Multilingual / auto-detect (Linux/GPU) | voxtral (HF) | voxtral | Same model, slower on CPU, needs GPU |
| Best timestamp accuracy | faster-whisper | LocalAgreement | 0.08s MAE, good for subtitle alignment |
| Low latency, low memory | mlx-whisper (tiny) | SimulStreaming | Smallest footprint, fastest response |
| Use Case | Backend | Policy | Model | Notes |
|----------|---------|--------|-------|-------|
| Fastest English (Apple Silicon) | mlx-whisper | SimulStreaming | base | 0.11x RTF, 5.3% WER |
| Fastest English (Linux/GPU) | faster-whisper | SimulStreaming | base | 0.10x RTF, 5.3% WER |
| Best accuracy, English | faster-whisper | SimulStreaming | small | 0.26x RTF, 5.3% WER, still fast |
| Multilingual / auto-detect | voxtral-mlx | voxtral | 4B | 100+ languages, 0.18-0.32x RTF |
| Best timestamps | any | LocalAgreement | small | 0.05-0.09s MAE, good for subtitles |
| Low memory / embedded | mlx-whisper | SimulStreaming | base | Smallest footprint, fastest response |
---
## Caveats
- **3 test files, ~53 seconds total.** Results give relative rankings between backends but should not be taken as definitive WER numbers. Run on your own data for production decisions.
- **RTF varies between runs** (up to +/-30%) depending on thermal state, background processes, and model caching. The numbers above are single sequential runs on a warm machine.
- **Only base and small tested.** Medium and large-v3 would likely improve WER at the cost of higher RTF. We did not test them here because they are slow on Apple Silicon without GPU.
---
@@ -144,15 +167,18 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect
pip install -e ".[test]"
# Single backend test
python test_backend_offline.py --backend faster-whisper --policy simulstreaming --no-realtime
python test_backend_offline.py --backend faster-whisper --policy simulstreaming --model base --no-realtime
# With a specific language
python test_backend_offline.py --backend mlx-whisper --policy simulstreaming --model small --lan fr --no-realtime
# Multi-backend auto-detect benchmark
python test_backend_offline.py --benchmark --no-realtime
# Export to JSON for programmatic analysis
# Export to JSON
python test_backend_offline.py --benchmark --no-realtime --json results.json
# Test with custom audio
# Test with your own audio
python test_backend_offline.py --backend voxtral-mlx --audio your_file.wav --no-realtime
```

BIN
benchmark_chart.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 69 KiB