WhisperLiveKit

mirror of https://github.com/QuentinFuxa/WhisperLiveKit.git synced 2026-04-30 18:40:06 +00:00

Author	SHA1	Message	Date
Quentin Fuxa	b1fc23807a	docs: add benchmark collaboration call, voxtral in powered-by section	2026-02-23 10:37:22 +01:00
Quentin Fuxa	10c4e5f730	docs: add speed vs accuracy scatter plot to benchmark and README WER vs RTF scatter plot showing all backend/policy/model combos on the 30s English file. Sweet spot zone highlights the best tradeoffs. Added to both BENCHMARK.md and README.md.	2026-02-23 10:27:53 +01:00
Quentin Fuxa	c76b2ef2c6	docs: rewrite benchmark with base/small comparison, proper French results - Re-ran all whisper benchmarks with --lan fr for the French file (previously ran with --lan en which made the results meaningless) - Added small model results alongside base for all backends - Added model size comparison table (base vs small tradeoffs) - Added benchmark chart (30s English, WER + RTF by backend) - Added caveats section about dataset size and RTF variance - Key findings: SimulStreaming saturates at 5.3% WER on base already, small model mainly helps LocalAgreement and French timestamps - mlx-whisper LA base is unstable on French (hallucination loops)	2026-02-23 10:16:34 +01:00
Quentin Fuxa	4b2377c243	fix: correct false auto-detect claim, median bug, RTF inflation - BENCHMARK.md: whisper also supports --language auto, voxtral is not the only one. Fixed mlx-whisper speed comparison (LA is actually faster than SS for mlx-whisper, not comparable). - metrics.py: median calculation was wrong for even-length lists (took upper middle instead of averaging the two middle values). - metrics_collector.py: RTF was inflated because log_summary() used wall-clock elapsed time instead of sum of actual ASR call durations. - README.md: clarified that whisper also supports auto language detection, voxtral just does it better. - Added 2 new median tests (even + odd length).	2026-02-22 23:38:04 +01:00
Quentin Fuxa	83d0fa3fac	feat: benchmark suite with WER, timestamp accuracy, cross-backend comparison - Extend test_backend_offline.py with WER and timestamp accuracy metrics computed via whisperlivekit.metrics against ground truth transcripts. - Add --benchmark flag to auto-detect all installed backends and run each (backend, policy) combination in sequence. - Add --policy flag to override the streaming policy. - Add detect_available_backends() probing faster-whisper, mlx-whisper, voxtral-mlx, voxtral (HF), and openai-whisper. - Add print_cross_backend_comparison() with per-combo averages. - Add run_benchmark.py for comprehensive multi-model benchmarking. - Add BENCHMARK.md with full results on Apple M4: speed, WER, timestamp accuracy, VAC impact, and recommendations. - Add ground truth transcript JSON files for all audio test files.	2026-02-22 23:27:50 +01:00

5 Commits