Commit Graph

5 Commits

Author SHA1 Message Date
Quentin Fuxa
b1fc23807a docs: add benchmark collaboration call, voxtral in powered-by section 2026-02-23 10:37:22 +01:00
Quentin Fuxa
10c4e5f730 docs: add speed vs accuracy scatter plot to benchmark and README
WER vs RTF scatter plot showing all backend/policy/model combos
on the 30s English file. Sweet spot zone highlights the best
tradeoffs. Added to both BENCHMARK.md and README.md.
2026-02-23 10:27:53 +01:00
Quentin Fuxa
c76b2ef2c6 docs: rewrite benchmark with base/small comparison, proper French results
- Re-ran all whisper benchmarks with --lan fr for the French file
  (previously ran with --lan en which made the results meaningless)
- Added small model results alongside base for all backends
- Added model size comparison table (base vs small tradeoffs)
- Added benchmark chart (30s English, WER + RTF by backend)
- Added caveats section about dataset size and RTF variance
- Key findings: SimulStreaming saturates at 5.3% WER on base already,
  small model mainly helps LocalAgreement and French timestamps
- mlx-whisper LA base is unstable on French (hallucination loops)
2026-02-23 10:16:34 +01:00
Quentin Fuxa
4b2377c243 fix: correct false auto-detect claim, median bug, RTF inflation
- BENCHMARK.md: whisper also supports --language auto, voxtral is not
  the only one. Fixed mlx-whisper speed comparison (LA is actually
  faster than SS for mlx-whisper, not comparable).
- metrics.py: median calculation was wrong for even-length lists
  (took upper middle instead of averaging the two middle values).
- metrics_collector.py: RTF was inflated because log_summary() used
  wall-clock elapsed time instead of sum of actual ASR call durations.
- README.md: clarified that whisper also supports auto language
  detection, voxtral just does it better.
- Added 2 new median tests (even + odd length).
2026-02-22 23:38:04 +01:00
Quentin Fuxa
83d0fa3fac feat: benchmark suite with WER, timestamp accuracy, cross-backend comparison
- Extend test_backend_offline.py with WER and timestamp accuracy metrics
  computed via whisperlivekit.metrics against ground truth transcripts.
- Add --benchmark flag to auto-detect all installed backends and run
  each (backend, policy) combination in sequence.
- Add --policy flag to override the streaming policy.
- Add detect_available_backends() probing faster-whisper, mlx-whisper,
  voxtral-mlx, voxtral (HF), and openai-whisper.
- Add print_cross_backend_comparison() with per-combo averages.
- Add run_benchmark.py for comprehensive multi-model benchmarking.
- Add BENCHMARK.md with full results on Apple M4: speed, WER,
  timestamp accuracy, VAC impact, and recommendations.
- Add ground truth transcript JSON files for all audio test files.
2026-02-22 23:27:50 +01:00