This fix addresses a critical bug in the Whisper tokenizer that causes
the transcription server to crash with an `IndexError: string index out
of range` when streaming audio in languages utilizing multi-byte UTF-8
characters (e.g., Cantonese, Japanese, Mandarin).
When a 3-byte character is cut off at the boundary of an audio chunk,
incomplete bytes are decoded into a single Unicode replacement character
(`\ufffd`), artificially shortening the string and breaking the offset
mapping assumed by `split_tokens_on_unicode`.
This ports the upstream fix from SYSTRAN/faster-whisper (PR #111) to add
a strict bounds check before accessing the string index, allowing
incomplete bytes to be safely caught and handled in the next chunk.
- Changed the extras for cu129-diarization-sortformer from gpu-cu129 to cu129.
- This aligns the dependency with the correct naming convention for consistency.
- Introduced a new argument for selecting the diarization backend in the engine creation.
- Enhanced the `create_engine` function to accept and utilize the specified diarization backend.
- Updated the test runner to accommodate the new backend option for improved flexibility.
- Introduced a workflow to publish Docker images on tag push and manual triggers.
- Added a support matrix workflow to test across multiple OS and Python versions.
WER vs RTF scatter plot showing all backend/policy/model combos
on the 30s English file. Sweet spot zone highlights the best
tradeoffs. Added to both BENCHMARK.md and README.md.
- Re-ran all whisper benchmarks with --lan fr for the French file
(previously ran with --lan en which made the results meaningless)
- Added small model results alongside base for all backends
- Added model size comparison table (base vs small tradeoffs)
- Added benchmark chart (30s English, WER + RTF by backend)
- Added caveats section about dataset size and RTF variance
- Key findings: SimulStreaming saturates at 5.3% WER on base already,
small model mainly helps LocalAgreement and French timestamps
- mlx-whisper LA base is unstable on French (hallucination loops)
- BENCHMARK.md: whisper also supports --language auto, voxtral is not
the only one. Fixed mlx-whisper speed comparison (LA is actually
faster than SS for mlx-whisper, not comparable).
- metrics.py: median calculation was wrong for even-length lists
(took upper middle instead of averaging the two middle values).
- metrics_collector.py: RTF was inflated because log_summary() used
wall-clock elapsed time instead of sum of actual ASR call durations.
- README.md: clarified that whisper also supports auto language
detection, voxtral just does it better.
- Added 2 new median tests (even + odd length).
Pure-MLX implementation of Voxtral Mini 4B Realtime for low-latency
speech transcription on Apple Silicon. Avoids the transformers/torch
overhead and runs at 0.18-0.32x real-time factor.
- voxtral_mlx/model.py: MLX model with spectrogram, encoder, decoder
- voxtral_mlx/loader.py: model loading with 6-bit quantized weights
- voxtral_mlx/spectrogram.py: mel spectrogram computation in MLX
- voxtral_mlx_asr.py: VoxtralASR adapter for the AudioProcessor pipeline
- Add Voxtral Backend section explaining voxtral-mlx and voxtral (HF).
- Add Testing & Benchmarks section with commands to run tests/benchmarks.
- Update --backend parameter docs to include voxtral-mlx and voxtral.
- Update optional dependencies table with Voxtral entry.
- Link to BENCHMARK.md for detailed performance comparisons.
- Extend test_backend_offline.py with WER and timestamp accuracy metrics
computed via whisperlivekit.metrics against ground truth transcripts.
- Add --benchmark flag to auto-detect all installed backends and run
each (backend, policy) combination in sequence.
- Add --policy flag to override the streaming policy.
- Add detect_available_backends() probing faster-whisper, mlx-whisper,
voxtral-mlx, voxtral (HF), and openai-whisper.
- Add print_cross_backend_comparison() with per-combo averages.
- Add run_benchmark.py for comprehensive multi-model benchmarking.
- Add BENCHMARK.md with full results on Apple M4: speed, WER,
timestamp accuracy, VAC impact, and recommendations.
- Add ground truth transcript JSON files for all audio test files.