mirror of
https://github.com/QuentinFuxa/WhisperLiveKit.git
synced 2026-03-21 16:40:35 +00:00
docs: rewrite benchmark with base/small comparison, proper French results
- Re-ran all whisper benchmarks with --lan fr for the French file (previously ran with --lan en which made the results meaningless) - Added small model results alongside base for all backends - Added model size comparison table (base vs small tradeoffs) - Added benchmark chart (30s English, WER + RTF by backend) - Added caveats section about dataset size and RTF variance - Key findings: SimulStreaming saturates at 5.3% WER on base already, small model mainly helps LocalAgreement and French timestamps - mlx-whisper LA base is unstable on French (hallucination loops)
This commit is contained in:
182
BENCHMARK.md
182
BENCHMARK.md
@@ -1,7 +1,7 @@
|
||||
# WhisperLiveKit Benchmark Report
|
||||
|
||||
Benchmark comparing all supported ASR backends and streaming policies on Apple Silicon,
|
||||
using the full AudioProcessor pipeline (the same path audio takes in production via WebSocket).
|
||||
Benchmark comparing all supported ASR backends, streaming policies, and model sizes on Apple Silicon.
|
||||
All tests run through the full AudioProcessor pipeline (same code path as production WebSocket).
|
||||
|
||||
## Test Environment
|
||||
|
||||
@@ -12,9 +12,8 @@ using the full AudioProcessor pipeline (the same path audio takes in production
|
||||
| Python | 3.13 |
|
||||
| faster-whisper | 1.2.1 |
|
||||
| mlx-whisper | installed (via mlx) |
|
||||
| Voxtral (HF) | transformers-based |
|
||||
| Voxtral MLX | native MLX backend |
|
||||
| Model size | `base` (default for whisper backends) |
|
||||
| Voxtral (HF) | transformers-based |
|
||||
| VAC (Silero VAD) | enabled unless noted |
|
||||
| Chunk size | 100 ms |
|
||||
| Pacing | no-realtime (as fast as possible) |
|
||||
@@ -25,50 +24,80 @@ using the full AudioProcessor pipeline (the same path audio takes in production
|
||||
|------|----------|----------|----------|-------------|
|
||||
| `00_00_07_english_1_speaker.wav` | 7.2 s | English | 1 | Short dictation with pauses |
|
||||
| `00_00_16_french_1_speaker.wav` | 16.3 s | French | 1 | French speech with intentional silence gaps |
|
||||
| `00_00_30_english_3_speakers.wav` | 30.0 s | English | 3 | Multi-speaker conversation about transcription |
|
||||
| `00_00_30_english_3_speakers.wav` | 30.0 s | English | 3 | Multi-speaker conversation |
|
||||
|
||||
All files have hand-verified ground truth transcripts (`.transcript.json`) with per-word timestamps.
|
||||
Ground truth transcripts (`.transcript.json`) with per-word timestamps are hand-verified.
|
||||
|
||||
---
|
||||
|
||||
## Results Overview
|
||||
## Results
|
||||
|
||||
### English - Short (7.2 s, 1 speaker)
|
||||
### English -- Short (7.2 s, 1 speaker)
|
||||
|
||||
| Backend | Policy | RTF | WER | Timestamp MAE |
|
||||
|---------|--------|-----|-----|---------------|
|
||||
| faster-whisper | LocalAgreement | 0.20x | 21.1% | 0.080 s |
|
||||
| faster-whisper | SimulStreaming | 0.14x | 0.0% | 0.239 s |
|
||||
| mlx-whisper | LocalAgreement | 0.05x | 21.1% | 0.080 s |
|
||||
| mlx-whisper | SimulStreaming | 0.14x | 10.5% | 0.245 s |
|
||||
| voxtral-mlx | voxtral | 0.32x | 0.0% | 0.254 s |
|
||||
| voxtral (HF) | voxtral | 1.29x | 0.0% | 1.876 s |
|
||||
| Backend | Policy | Model | RTF | WER | Timestamp MAE |
|
||||
|---------|--------|-------|-----|-----|---------------|
|
||||
| faster-whisper | LocalAgreement | base | 0.20x | 21.1% | 0.080 s |
|
||||
| faster-whisper | SimulStreaming | base | 0.14x | 0.0% | 0.239 s |
|
||||
| faster-whisper | LocalAgreement | small | 0.59x | 21.1% | 0.089 s |
|
||||
| faster-whisper | SimulStreaming | small | 0.39x | 0.0% | 0.221 s |
|
||||
| mlx-whisper | LocalAgreement | base | 0.05x | 21.1% | 0.080 s |
|
||||
| mlx-whisper | SimulStreaming | base | 0.14x | 10.5% | 0.245 s |
|
||||
| mlx-whisper | LocalAgreement | small | 0.16x | 21.1% | 0.089 s |
|
||||
| mlx-whisper | SimulStreaming | small | 0.20x | 10.5% | 0.226 s |
|
||||
| voxtral-mlx | voxtral | 4B | 0.32x | 0.0% | 0.254 s |
|
||||
| voxtral (HF) | voxtral | 4B | 1.29x | 0.0% | 1.876 s |
|
||||
|
||||
### French (16.3 s, 1 speaker)
|
||||
### English -- Multi-speaker (30.0 s, 3 speakers)
|
||||
|
||||
| Backend | Policy | RTF | WER | Timestamp MAE |
|
||||
|---------|--------|-----|-----|---------------|
|
||||
| faster-whisper | LocalAgreement | 0.20x | 120.0% | 0.540 s |
|
||||
| faster-whisper | SimulStreaming | 0.10x | 100.0% | 0.120 s |
|
||||
| mlx-whisper | LocalAgreement | 0.31x | 1737.1% | 0.060 s |
|
||||
| mlx-whisper | SimulStreaming | 0.08x | 94.3% | 0.120 s |
|
||||
| voxtral-mlx | voxtral | 0.18x | 37.1% | 3.422 s |
|
||||
| voxtral (HF) | voxtral | 0.63x | 28.6% | 4.040 s |
|
||||
| Backend | Policy | Model | RTF | WER | Timestamp MAE |
|
||||
|---------|--------|-------|-----|-----|---------------|
|
||||
| faster-whisper | LocalAgreement | base | 0.24x | 44.7% | 0.235 s |
|
||||
| faster-whisper | SimulStreaming | base | 0.10x | 5.3% | 0.398 s |
|
||||
| faster-whisper | LocalAgreement | small | 0.59x | 25.0% | 0.226 s |
|
||||
| faster-whisper | SimulStreaming | small | 0.26x | 5.3% | 0.387 s |
|
||||
| mlx-whisper | LocalAgreement | base | 0.06x | 23.7% | 0.237 s |
|
||||
| mlx-whisper | SimulStreaming | base | 0.11x | 5.3% | 0.395 s |
|
||||
| mlx-whisper | LocalAgreement | small | 0.13x | 25.0% | 0.226 s |
|
||||
| mlx-whisper | SimulStreaming | small | 0.20x | 5.3% | 0.394 s |
|
||||
| voxtral-mlx | voxtral | 4B | 0.31x | 9.2% | 0.176 s |
|
||||
| voxtral (HF) | voxtral | 4B | 1.00x | 32.9% | 1.034 s |
|
||||
|
||||
Note: The whisper-based backends were run with `--lan en`, so they attempted to transcribe French
|
||||
audio in English. This is expected to produce high WER. For a fair comparison, the whisper backends
|
||||
should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect language.
|
||||
<p align="center">
|
||||
<img src="benchmark_chart.png" alt="Benchmark comparison on 30s English" width="800">
|
||||
</p>
|
||||
|
||||
### English - Multi-speaker (30.0 s, 3 speakers)
|
||||
### French (16.3 s, 1 speaker, `--language fr`)
|
||||
|
||||
| Backend | Policy | RTF | WER | Timestamp MAE |
|
||||
|---------|--------|-----|-----|---------------|
|
||||
| faster-whisper | LocalAgreement | 0.24x | 44.7% | 0.235 s |
|
||||
| faster-whisper | SimulStreaming | 0.10x | 5.3% | 0.398 s |
|
||||
| mlx-whisper | LocalAgreement | 0.06x | 23.7% | 0.237 s |
|
||||
| mlx-whisper | SimulStreaming | 0.11x | 5.3% | 0.395 s |
|
||||
| voxtral-mlx | voxtral | 0.31x | 9.2% | 0.176 s |
|
||||
| voxtral (HF) | voxtral | 1.00x | 32.9% | 1.034 s |
|
||||
| Backend | Policy | Model | RTF | WER | Timestamp MAE |
|
||||
|---------|--------|-------|-----|-----|---------------|
|
||||
| faster-whisper | LocalAgreement | base | 0.22x | 25.7% | 3.460 s |
|
||||
| faster-whisper | SimulStreaming | base | 0.10x | 31.4% | 3.660 s |
|
||||
| faster-whisper | LocalAgreement | small | 0.76x | 42.9% | 0.051 s |
|
||||
| faster-whisper | SimulStreaming | small | 0.29x | 25.7% | 0.219 s |
|
||||
| mlx-whisper | LocalAgreement | base | 0.09x | ~45%* | ~5.0 s* |
|
||||
| mlx-whisper | SimulStreaming | base | 0.09x | 40.0% | 3.540 s |
|
||||
| mlx-whisper | LocalAgreement | small | 0.14x | 25.7% | 0.083 s |
|
||||
| mlx-whisper | SimulStreaming | small | 0.17x | 31.4% | 0.203 s |
|
||||
| voxtral-mlx | voxtral | 4B | 0.18x | 37.1% | 3.422 s |
|
||||
| voxtral (HF) | voxtral | 4B | 0.63x | 28.6% | 4.040 s |
|
||||
|
||||
\* mlx-whisper + LocalAgreement + base is unstable on this French file (WER fluctuates 34-1037% across runs due to hallucination loops). The `small` model does not have this problem.
|
||||
|
||||
**Timestamp note:** The base model produces very high timestamp MAE (3.4-3.7s) on this French file because it misaligns words around the silence gaps. The small model handles this much better (0.05-0.22s MAE). Voxtral also drifts on the silence gaps.
|
||||
|
||||
---
|
||||
|
||||
## Model Size Comparison (base vs small)
|
||||
|
||||
| | base | small | Observation |
|
||||
|--|------|-------|-------------|
|
||||
| **RTF** | 0.05-0.24x | 0.13-0.76x | small is 2-3x slower |
|
||||
| **English WER (SS)** | 0-5.3% | 0-5.3% | No improvement: SimulStreaming already saturates on base |
|
||||
| **English WER (LA)** | 21-44.7% | 21-25% | small reduces LA errors on longer audio |
|
||||
| **French WER** | 25-40% | 25-43% | Mixed: depends on backend/policy combo |
|
||||
| **French timestamps** | 3.4-5.0s MAE | 0.05-0.22s MAE | small is dramatically better for French timestamps |
|
||||
|
||||
In short: **base + SimulStreaming** gives the best speed/accuracy tradeoff for English. The small model only helps if you need LocalAgreement (for subtitle-grade timestamps) or non-English languages.
|
||||
|
||||
---
|
||||
|
||||
@@ -76,37 +105,25 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect
|
||||
|
||||
### Speed (RTF = processing time / audio duration, lower is better)
|
||||
|
||||
1. **mlx-whisper + LocalAgreement** is the fastest combo on Apple Silicon, reaching 0.05-0.06x RTF
|
||||
on English audio. 30 seconds of audio processed in under 2 seconds.
|
||||
2. For **faster-whisper**, SimulStreaming is consistently faster than LocalAgreement.
|
||||
For **mlx-whisper**, it is the opposite: LocalAgreement (0.05-0.06x) is faster than SimulStreaming (0.11-0.14x).
|
||||
3. **voxtral-mlx** runs at 0.18-0.32x RTF, roughly 3-5x slower than mlx-whisper but well within
|
||||
real-time requirements.
|
||||
4. **voxtral (HF transformers)** is the slowest at 1.0-1.3x RTF. On longer audio it risks
|
||||
falling behind real-time. On Apple Silicon, the MLX variant is strongly preferred.
|
||||
1. **mlx-whisper + LocalAgreement + base** is the fastest combo on Apple Silicon: 0.05-0.06x RTF on English. 30 seconds of audio in under 2 seconds.
|
||||
2. For **faster-whisper**, SimulStreaming is faster than LocalAgreement. For **mlx-whisper**, it is the opposite: LocalAgreement (0.05-0.06x) outperforms SimulStreaming (0.11-0.14x) on speed.
|
||||
3. **voxtral-mlx** runs at 0.18-0.32x RTF -- 3-5x slower than mlx-whisper base, but well within real-time.
|
||||
4. **voxtral (HF transformers)** hits 1.0-1.3x RTF. At the real-time boundary on Apple Silicon. Use the MLX variant instead.
|
||||
5. The **small** model is 2-3x slower than base across all backends.
|
||||
|
||||
### Accuracy (WER = Word Error Rate, lower is better)
|
||||
|
||||
1. **SimulStreaming** produces significantly better WER than LocalAgreement for whisper backends.
|
||||
On the 30s English file: 5.3% vs 23.7-44.7%.
|
||||
2. **voxtral-mlx** has good accuracy (0% on short English, 9.2% on multi-speaker).
|
||||
Whisper also supports `--language auto`, but Voxtral's language detection is more
|
||||
reliable and does not bias towards English the way Whisper's auto mode tends to.
|
||||
3. **LocalAgreement** tends to duplicate the last sentence, inflating WER. This is a known
|
||||
artifact of the LCP (Longest Common Prefix) commit strategy at end-of-stream.
|
||||
4. **Voxtral** backends handle French natively with 28-37% WER, while whisper backends
|
||||
were run with `--lan en` here (not a fair comparison for French).
|
||||
1. **SimulStreaming** gives dramatically lower WER than LocalAgreement on the whisper backends. On the 30s English file: 5.3% vs 23-44%.
|
||||
2. **voxtral-mlx** hits 0% on short English and 9.2% on multi-speaker. It auto-detects language natively. Whisper also supports `--language auto`, but tends to bias towards English on short segments.
|
||||
3. **LocalAgreement** tends to repeat the last sentence at end-of-stream (a known LCP artifact), inflating WER. This is visible in the 21% WER on the 7s file -- the same 4 extra words appear in every LA run.
|
||||
4. On **French** with the correct `--language fr`, whisper base achieves 25-40% WER -- comparable to Voxtral's 28-37%. The small model does not consistently improve French WER.
|
||||
|
||||
### Timestamp Accuracy (MAE = Mean Absolute Error on word start times, lower is better)
|
||||
### Timestamps (MAE = Mean Absolute Error on word start times)
|
||||
|
||||
1. **LocalAgreement** produces the most accurate timestamps (0.08s MAE on English), since it
|
||||
processes overlapping audio windows and validates via prefix matching.
|
||||
2. **SimulStreaming** timestamps are slightly less precise (0.24-0.40s MAE) but still usable
|
||||
for most applications.
|
||||
3. **voxtral-mlx** has good timestamp accuracy on English (0.18-0.25s MAE) but drifts on
|
||||
audio with long silence gaps (3.4s MAE on the French file with 4-second pauses).
|
||||
4. **voxtral (HF)** has the worst timestamp accuracy (1.0-4.0s MAE). This is likely related to
|
||||
differences in the transformers-based decoding pipeline rather than model quality.
|
||||
1. **LocalAgreement** gives the best timestamps on English (0.08-0.09s MAE).
|
||||
2. **SimulStreaming** is less precise (0.22-0.40s MAE) but good enough for most applications.
|
||||
3. On French with silence gaps, **base model timestamps are unreliable** (3.4-5s MAE). The **small model fixes this** (0.05-0.22s MAE). This is the strongest argument for using `small` over `base`.
|
||||
4. **voxtral-mlx** has good timestamps on English (0.18-0.25s MAE) but drifts on audio with long silence gaps (3.4s MAE on the French file).
|
||||
|
||||
### VAC (Voice Activity Classification) Impact
|
||||
|
||||
@@ -117,23 +134,29 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect
|
||||
| voxtral-mlx | voxtral | on | 0.0% | 9.2% |
|
||||
| voxtral-mlx | voxtral | off | 0.0% | 9.2% |
|
||||
|
||||
- **Whisper backends require VAC** to function in streaming mode. Without it, the entire audio
|
||||
is buffered as a single chunk and the LocalAgreement/SimulStreaming buffer logic breaks down.
|
||||
- **Voxtral backends are VAC-independent** because they handle their own internal chunking and
|
||||
produce identical results with or without VAC. VAC still reduces wasted compute on silence.
|
||||
- **Whisper backends need VAC** to work in streaming mode. Without it the buffer logic breaks down and you get empty or garbage output.
|
||||
- **Voxtral is unaffected by VAC** since it handles its own internal chunking. Identical results with or without. VAC still saves compute on silent segments.
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
| Use Case | Recommended Backend | Policy | Notes |
|
||||
|----------|-------------------|--------|-------|
|
||||
| Fastest English transcription (Apple Silicon) | mlx-whisper | SimulStreaming | 0.08-0.14x RTF, 5-10% WER |
|
||||
| Fastest English transcription (Linux/GPU) | faster-whisper | SimulStreaming | 0.10-0.14x RTF, 0-5% WER |
|
||||
| Multilingual / auto-detect (Apple Silicon) | voxtral-mlx | voxtral | Handles 100+ languages, 0.18-0.32x RTF |
|
||||
| Multilingual / auto-detect (Linux/GPU) | voxtral (HF) | voxtral | Same model, slower on CPU, needs GPU |
|
||||
| Best timestamp accuracy | faster-whisper | LocalAgreement | 0.08s MAE, good for subtitle alignment |
|
||||
| Low latency, low memory | mlx-whisper (tiny) | SimulStreaming | Smallest footprint, fastest response |
|
||||
| Use Case | Backend | Policy | Model | Notes |
|
||||
|----------|---------|--------|-------|-------|
|
||||
| Fastest English (Apple Silicon) | mlx-whisper | SimulStreaming | base | 0.11x RTF, 5.3% WER |
|
||||
| Fastest English (Linux/GPU) | faster-whisper | SimulStreaming | base | 0.10x RTF, 5.3% WER |
|
||||
| Best accuracy, English | faster-whisper | SimulStreaming | small | 0.26x RTF, 5.3% WER, still fast |
|
||||
| Multilingual / auto-detect | voxtral-mlx | voxtral | 4B | 100+ languages, 0.18-0.32x RTF |
|
||||
| Best timestamps | any | LocalAgreement | small | 0.05-0.09s MAE, good for subtitles |
|
||||
| Low memory / embedded | mlx-whisper | SimulStreaming | base | Smallest footprint, fastest response |
|
||||
|
||||
---
|
||||
|
||||
## Caveats
|
||||
|
||||
- **3 test files, ~53 seconds total.** Results give relative rankings between backends but should not be taken as definitive WER numbers. Run on your own data for production decisions.
|
||||
- **RTF varies between runs** (up to +/-30%) depending on thermal state, background processes, and model caching. The numbers above are single sequential runs on a warm machine.
|
||||
- **Only base and small tested.** Medium and large-v3 would likely improve WER at the cost of higher RTF. We did not test them here because they are slow on Apple Silicon without GPU.
|
||||
|
||||
---
|
||||
|
||||
@@ -144,15 +167,18 @@ should be run with `--lan fr` or `--lan auto`. The Voxtral backends auto-detect
|
||||
pip install -e ".[test]"
|
||||
|
||||
# Single backend test
|
||||
python test_backend_offline.py --backend faster-whisper --policy simulstreaming --no-realtime
|
||||
python test_backend_offline.py --backend faster-whisper --policy simulstreaming --model base --no-realtime
|
||||
|
||||
# With a specific language
|
||||
python test_backend_offline.py --backend mlx-whisper --policy simulstreaming --model small --lan fr --no-realtime
|
||||
|
||||
# Multi-backend auto-detect benchmark
|
||||
python test_backend_offline.py --benchmark --no-realtime
|
||||
|
||||
# Export to JSON for programmatic analysis
|
||||
# Export to JSON
|
||||
python test_backend_offline.py --benchmark --no-realtime --json results.json
|
||||
|
||||
# Test with custom audio
|
||||
# Test with your own audio
|
||||
python test_backend_offline.py --backend voxtral-mlx --audio your_file.wav --no-realtime
|
||||
```
|
||||
|
||||
|
||||
BIN
benchmark_chart.png
Normal file
BIN
benchmark_chart.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 69 KiB |
Reference in New Issue
Block a user