docs: rewrite benchmark with base/small comparison, proper French results

- Re-ran all whisper benchmarks with --lan fr for the French file (previously ran with --lan en which made the results meaningless) - Added small model results alongside base for all backends - Added model size comparison table (base vs small tradeoffs) - Added benchmark chart (30s English, WER + RTF by backend) - Added caveats section about dataset size and RTF variance - Key findings: SimulStreaming saturates at 5.3% WER on base already, small model mainly helps LocalAgreement and French timestamps - mlx-whisper LA base is unstable on French (hallucination loops)
2026-03-21 16:40:35 +00:00 · 2026-02-23 10:16:34 +01:00
parent 4b2377c243
commit c76b2ef2c6
2 changed files with 104 additions and 78 deletions
--- a/BENCHMARK.md
+++ b/BENCHMARK.md
@@ -1,7 +1,7 @@
 # WhisperLiveKit Benchmark Report

-Benchmark comparing all supported ASR backends and streaming policies on Apple Silicon,
-using the full AudioProcessor pipeline (the same path audio takes in production via WebSocket).
+Benchmark comparing all supported ASR backends, streaming policies, and model sizes on Apple Silicon.
+All tests run through the full AudioProcessor pipeline (same code path as production WebSocket).

 ## Test Environment

@@ -12,9 +12,8 @@ using the full AudioProcessor pipeline (the same path audio takes in production
 | Python | 3.13 |
 | faster-whisper | 1.2.1 |
 | mlx-whisper | installed (via mlx) |
-| Voxtral (HF) | transformers-based |
 | Voxtral MLX | native MLX backend |
-| Model size | `base` (default for whisper backends) |
+| Voxtral (HF) | transformers-based |
 | VAC (Silero VAD) | enabled unless noted |
 | Chunk size | 100 ms |
 | Pacing | no-realtime (as fast as possible) |
@@ -25,50 +24,80 @@ using the full AudioProcessor pipeline (the same path audio takes in production
 |------|----------|----------|----------|-------------|
 | `00_00_07_english_1_speaker.wav` | 7.2 s | English | 1 | Short dictation with pauses |
 | `00_00_16_french_1_speaker.wav` | 16.3 s | French | 1 | French speech with intentional silence gaps |
-| `00_00_30_english_3_speakers.wav` | 30.0 s | English | 3 | Multi-speaker conversation about transcription |
+| `00_00_30_english_3_speakers.wav` | 30.0 s | English | 3 | Multi-speaker conversation |

-All files have hand-verified ground truth transcripts (`.transcript.json`) with per-word timestamps.
+Ground truth transcripts (`.transcript.json`) with per-word timestamps are hand-verified.

 ---

-## Results Overview
+## Results

-### English - Short (7.2 s, 1 speaker)
+### English -- Short (7.2 s, 1 speaker)

-| Backend | Policy | RTF | WER | Timestamp MAE |
-|---------|--------|-----|-----|---------------|
-| faster-whisper | LocalAgreement | 0.20x | 21.1% | 0.080 s |
-| faster-whisper | SimulStreaming | 0.14x | 0.0% | 0.239 s |
-| mlx-whisper | LocalAgreement | 0.05x | 21.1% | 0.080 s |
-| mlx-whisper | SimulStreaming | 0.14x | 10.5% | 0.245 s |
-| voxtral-mlx | voxtral | 0.32x | 0.0% | 0.254 s |
-| voxtral (HF) | voxtral | 1.29x | 0.0% | 1.876 s |
+| Backend | Policy | Model | RTF | WER | Timestamp MAE |
+|---------|--------|-------|-----|-----|---------------|
+| faster-whisper | LocalAgreement | base | 0.20x | 21.1% | 0.080 s |
+| faster-whisper | SimulStreaming | base | 0.14x | 0.0% | 0.239 s |
+| faster-whisper | LocalAgreement | small | 0.59x | 21.1% | 0.089 s |
+| faster-whisper | SimulStreaming | small | 0.39x | 0.0% | 0.221 s |
+| mlx-whisper | LocalAgreement | base | 0.05x | 21.1% | 0.080 s |
+| mlx-whisper | SimulStreaming | base | 0.14x | 10.5% | 0.245 s |
+| mlx-whisper | LocalAgreement | small | 0.16x | 21.1% | 0.089 s |
+| mlx-whisper | SimulStreaming | small | 0.20x | 10.5% | 0.226 s |
+| voxtral-mlx | voxtral | 4B | 0.32x | 0.0% | 0.254 s |
+| voxtral (HF) | voxtral | 4B | 1.29x | 0.0% | 1.876 s |

-### French (16.3 s, 1 speaker)
+### English -- Multi-speaker (30.0 s, 3 speakers)

-| Backend | Policy | RTF | WER | Timestamp MAE |
-|---------|--------|-----|-----|---------------|
-| faster-whisper | LocalAgreement | 0.20x | 120.0% | 0.540 s |
-| faster-whisper | SimulStreaming | 0.10x | 100.0% | 0.120 s |
-| mlx-whisper | LocalAgreement | 0.31x | 1737.1% | 0.060 s |
-| mlx-whisper | SimulStreaming | 0.08x | 94.3% | 0.120 s |
-| voxtral-mlx | voxtral | 0.18x | 37.1% | 3.422 s |
-| voxtral (HF) | voxtral | 0.63x | 28.6% | 4.040 s |
+| Backend | Policy | Model | RTF | WER | Timestamp MAE |
+|---------|--------|-------|-----|-----|---------------|
+| faster-whisper | LocalAgreement | base | 0.24x | 44.7% | 0.235 s |
+| faster-whisper | SimulStreaming | base | 0.10x | 5.3% | 0.398 s |
+| faster-whisper | LocalAgreement | small | 0.59x | 25.0% | 0.226 s |
+| faster-whisper | SimulStreaming | small | 0.26x | 5.3% | 0.387 s |
+| mlx-whisper | LocalAgreement | base | 0.06x | 23.7% | 0.237 s |
+| mlx-whisper | SimulStreaming | base | 0.11x | 5.3% | 0.395 s |
+| mlx-whisper | LocalAgreement | small | 0.13x | 25.0% | 0.226 s |
+| mlx-whisper | SimulStreaming | small | 0.20x | 5.3% | 0.394 s |
+| voxtral-mlx | voxtral | 4B | 0.31x | 9.2% | 0.176 s |
+| voxtral (HF) | voxtral | 4B | 1.00x | 32.9% | 1.034 s |

-Note: The whisper-based backends were run with `--lan en`, so they attempted to transcribe French
-audio in English. This is expected to produce high WER. For a fair comparison, the whisper backends
-should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect language.
+<p align="center">
+<img src="benchmark_chart.png" alt="Benchmark comparison on 30s English" width="800">
+</p>

-### English - Multi-speaker (30.0 s, 3 speakers)
+### French (16.3 s, 1 speaker, `--language fr`)

-| Backend | Policy | RTF | WER | Timestamp MAE |
-|---------|--------|-----|-----|---------------|
-| faster-whisper | LocalAgreement | 0.24x | 44.7% | 0.235 s |
-| faster-whisper | SimulStreaming | 0.10x | 5.3% | 0.398 s |
-| mlx-whisper | LocalAgreement | 0.06x | 23.7% | 0.237 s |
-| mlx-whisper | SimulStreaming | 0.11x | 5.3% | 0.395 s |
-| voxtral-mlx | voxtral | 0.31x | 9.2% | 0.176 s |
-| voxtral (HF) | voxtral | 1.00x | 32.9% | 1.034 s |
+| Backend | Policy | Model | RTF | WER | Timestamp MAE |
+|---------|--------|-------|-----|-----|---------------|
+| faster-whisper | LocalAgreement | base | 0.22x | 25.7% | 3.460 s |
+| faster-whisper | SimulStreaming | base | 0.10x | 31.4% | 3.660 s |
+| faster-whisper | LocalAgreement | small | 0.76x | 42.9% | 0.051 s |
+| faster-whisper | SimulStreaming | small | 0.29x | 25.7% | 0.219 s |
+| mlx-whisper | LocalAgreement | base | 0.09x | ~45%* | ~5.0 s* |
+| mlx-whisper | SimulStreaming | base | 0.09x | 40.0% | 3.540 s |
+| mlx-whisper | LocalAgreement | small | 0.14x | 25.7% | 0.083 s |
+| mlx-whisper | SimulStreaming | small | 0.17x | 31.4% | 0.203 s |
+| voxtral-mlx | voxtral | 4B | 0.18x | 37.1% | 3.422 s |
+| voxtral (HF) | voxtral | 4B | 0.63x | 28.6% | 4.040 s |
+
+\* mlx-whisper + LocalAgreement + base is unstable on this French file (WER fluctuates 34-1037% across runs due to hallucination loops). The `small` model does not have this problem.
+
+**Timestamp note:** The base model produces very high timestamp MAE (3.4-3.7s) on this French file because it misaligns words around the silence gaps. The small model handles this much better (0.05-0.22s MAE). Voxtral also drifts on the silence gaps.
+
+---
+
+## Model Size Comparison (base vs small)
+
+| | base | small | Observation |
+|--|------|-------|-------------|
+| **RTF** | 0.05-0.24x | 0.13-0.76x | small is 2-3x slower |
+| **English WER (SS)** | 0-5.3% | 0-5.3% | No improvement: SimulStreaming already saturates on base |
+| **English WER (LA)** | 21-44.7% | 21-25% | small reduces LA errors on longer audio |
+| **French WER** | 25-40% | 25-43% | Mixed: depends on backend/policy combo |
+| **French timestamps** | 3.4-5.0s MAE | 0.05-0.22s MAE | small is dramatically better for French timestamps |
+
+In short: **base + SimulStreaming** gives the best speed/accuracy tradeoff for English. The small model only helps if you need LocalAgreement (for subtitle-grade timestamps) or non-English languages.

 ---

@@ -76,37 +105,25 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect

 ### Speed (RTF = processing time / audio duration, lower is better)

-1. **mlx-whisper + LocalAgreement** is the fastest combo on Apple Silicon, reaching 0.05-0.06x RTF
-   on English audio. 30 seconds of audio processed in under 2 seconds.
-2. For **faster-whisper**, SimulStreaming is consistently faster than LocalAgreement.
-   For **mlx-whisper**, it is the opposite: LocalAgreement (0.05-0.06x) is faster than SimulStreaming (0.11-0.14x).
-3. **voxtral-mlx** runs at 0.18-0.32x RTF, roughly 3-5x slower than mlx-whisper but well within
-   real-time requirements.
-4. **voxtral (HF transformers)** is the slowest at 1.0-1.3x RTF. On longer audio it risks
-   falling behind real-time. On Apple Silicon, the MLX variant is strongly preferred.
+1. **mlx-whisper + LocalAgreement + base** is the fastest combo on Apple Silicon: 0.05-0.06x RTF on English. 30 seconds of audio in under 2 seconds.
+2. For **faster-whisper**, SimulStreaming is faster than LocalAgreement. For **mlx-whisper**, it is the opposite: LocalAgreement (0.05-0.06x) outperforms SimulStreaming (0.11-0.14x) on speed.
+3. **voxtral-mlx** runs at 0.18-0.32x RTF -- 3-5x slower than mlx-whisper base, but well within real-time.
+4. **voxtral (HF transformers)** hits 1.0-1.3x RTF. At the real-time boundary on Apple Silicon. Use the MLX variant instead.
+5. The **small** model is 2-3x slower than base across all backends.

 ### Accuracy (WER = Word Error Rate, lower is better)

-1. **SimulStreaming** produces significantly better WER than LocalAgreement for whisper backends.
-   On the 30s English file: 5.3% vs 23.7-44.7%.
-2. **voxtral-mlx** has good accuracy (0% on short English, 9.2% on multi-speaker).
-   Whisper also supports `--language auto`, but Voxtral's language detection is more
-   reliable and does not bias towards English the way Whisper's auto mode tends to.
-3. **LocalAgreement** tends to duplicate the last sentence, inflating WER. This is a known
-   artifact of the LCP (Longest Common Prefix) commit strategy at end-of-stream.
-4. **Voxtral** backends handle French natively with 28-37% WER, while whisper backends
-   were run with `--lan en` here (not a fair comparison for French).
+1. **SimulStreaming** gives dramatically lower WER than LocalAgreement on the whisper backends. On the 30s English file: 5.3% vs 23-44%.
+2. **voxtral-mlx** hits 0% on short English and 9.2% on multi-speaker. It auto-detects language natively. Whisper also supports `--language auto`, but tends to bias towards English on short segments.
+3. **LocalAgreement** tends to repeat the last sentence at end-of-stream (a known LCP artifact), inflating WER. This is visible in the 21% WER on the 7s file -- the same 4 extra words appear in every LA run.
+4. On **French** with the correct `--language fr`, whisper base achieves 25-40% WER -- comparable to Voxtral's 28-37%. The small model does not consistently improve French WER.

-### Timestamp Accuracy (MAE = Mean Absolute Error on word start times, lower is better)
+### Timestamps (MAE = Mean Absolute Error on word start times)

-1. **LocalAgreement** produces the most accurate timestamps (0.08s MAE on English), since it
-   processes overlapping audio windows and validates via prefix matching.
-2. **SimulStreaming** timestamps are slightly less precise (0.24-0.40s MAE) but still usable
-   for most applications.
-3. **voxtral-mlx** has good timestamp accuracy on English (0.18-0.25s MAE) but drifts on
-   audio with long silence gaps (3.4s MAE on the French file with 4-second pauses).
-4. **voxtral (HF)** has the worst timestamp accuracy (1.0-4.0s MAE). This is likely related to
-   differences in the transformers-based decoding pipeline rather than model quality.
+1. **LocalAgreement** gives the best timestamps on English (0.08-0.09s MAE).
+2. **SimulStreaming** is less precise (0.22-0.40s MAE) but good enough for most applications.
+3. On French with silence gaps, **base model timestamps are unreliable** (3.4-5s MAE). The **small model fixes this** (0.05-0.22s MAE). This is the strongest argument for using `small` over `base`.
+4. **voxtral-mlx** has good timestamps on English (0.18-0.25s MAE) but drifts on audio with long silence gaps (3.4s MAE on the French file).

 ### VAC (Voice Activity Classification) Impact

@@ -117,23 +134,29 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect
 | voxtral-mlx | voxtral | on | 0.0% | 9.2% |
 | voxtral-mlx | voxtral | off | 0.0% | 9.2% |

- **Whisper backends require VAC** to function in streaming mode. Without it, the entire audio
-  is buffered as a single chunk and the LocalAgreement/SimulStreaming buffer logic breaks down.
- **Voxtral backends are VAC-independent** because they handle their own internal chunking and
-  produce identical results with or without VAC. VAC still reduces wasted compute on silence.
+- **Whisper backends need VAC** to work in streaming mode. Without it the buffer logic breaks down and you get empty or garbage output.
+- **Voxtral is unaffected by VAC** since it handles its own internal chunking. Identical results with or without. VAC still saves compute on silent segments.

 ---

 ## Recommendations

-| Use Case | Recommended Backend | Policy | Notes |
-|----------|-------------------|--------|-------|
-| Fastest English transcription (Apple Silicon) | mlx-whisper | SimulStreaming | 0.08-0.14x RTF, 5-10% WER |
-| Fastest English transcription (Linux/GPU) | faster-whisper | SimulStreaming | 0.10-0.14x RTF, 0-5% WER |
-| Multilingual / auto-detect (Apple Silicon) | voxtral-mlx | voxtral | Handles 100+ languages, 0.18-0.32x RTF |
-| Multilingual / auto-detect (Linux/GPU) | voxtral (HF) | voxtral | Same model, slower on CPU, needs GPU |
-| Best timestamp accuracy | faster-whisper | LocalAgreement | 0.08s MAE, good for subtitle alignment |
-| Low latency, low memory | mlx-whisper (tiny) | SimulStreaming | Smallest footprint, fastest response |
+| Use Case | Backend | Policy | Model | Notes |
+|----------|---------|--------|-------|-------|
+| Fastest English (Apple Silicon) | mlx-whisper | SimulStreaming | base | 0.11x RTF, 5.3% WER |
+| Fastest English (Linux/GPU) | faster-whisper | SimulStreaming | base | 0.10x RTF, 5.3% WER |
+| Best accuracy, English | faster-whisper | SimulStreaming | small | 0.26x RTF, 5.3% WER, still fast |
+| Multilingual / auto-detect | voxtral-mlx | voxtral | 4B | 100+ languages, 0.18-0.32x RTF |
+| Best timestamps | any | LocalAgreement | small | 0.05-0.09s MAE, good for subtitles |
+| Low memory / embedded | mlx-whisper | SimulStreaming | base | Smallest footprint, fastest response |
+
+---
+
+## Caveats
+
+- **3 test files, ~53 seconds total.** Results give relative rankings between backends but should not be taken as definitive WER numbers. Run on your own data for production decisions.
+- **RTF varies between runs** (up to +/-30%) depending on thermal state, background processes, and model caching. The numbers above are single sequential runs on a warm machine.
+- **Only base and small tested.** Medium and large-v3 would likely improve WER at the cost of higher RTF. We did not test them here because they are slow on Apple Silicon without GPU.

 ---

@@ -144,15 +167,18 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect
 pip install -e ".[test]"

 # Single backend test
-python test_backend_offline.py --backend faster-whisper --policy simulstreaming --no-realtime
+python test_backend_offline.py --backend faster-whisper --policy simulstreaming --model base --no-realtime
+
+# With a specific language
+python test_backend_offline.py --backend mlx-whisper --policy simulstreaming --model small --lan fr --no-realtime

 # Multi-backend auto-detect benchmark
 python test_backend_offline.py --benchmark --no-realtime

-# Export to JSON for programmatic analysis
+# Export to JSON
 python test_backend_offline.py --benchmark --no-realtime --json results.json

-# Test with custom audio
+# Test with your own audio
 python test_backend_offline.py --backend voxtral-mlx --audio your_file.wav --no-realtime
 ```

--- a/benchmark_chart.png
+++ b/benchmark_chart.png