This fix addresses a critical bug in the Whisper tokenizer that causes
the transcription server to crash with an `IndexError: string index out
of range` when streaming audio in languages utilizing multi-byte UTF-8
characters (e.g., Cantonese, Japanese, Mandarin).
When a 3-byte character is cut off at the boundary of an audio chunk,
incomplete bytes are decoded into a single Unicode replacement character
(`\ufffd`), artificially shortening the string and breaking the offset
mapping assumed by `split_tokens_on_unicode`.
This ports the upstream fix from SYSTRAN/faster-whisper (PR #111) to add
a strict bounds check before accessing the string index, allowing
incomplete bytes to be safely caught and handled in the next chunk.
- BENCHMARK.md: whisper also supports --language auto, voxtral is not
the only one. Fixed mlx-whisper speed comparison (LA is actually
faster than SS for mlx-whisper, not comparable).
- metrics.py: median calculation was wrong for even-length lists
(took upper middle instead of averaging the two middle values).
- metrics_collector.py: RTF was inflated because log_summary() used
wall-clock elapsed time instead of sum of actual ASR call durations.
- README.md: clarified that whisper also supports auto language
detection, voxtral just does it better.
- Added 2 new median tests (even + odd length).
Pure-MLX implementation of Voxtral Mini 4B Realtime for low-latency
speech transcription on Apple Silicon. Avoids the transformers/torch
overhead and runs at 0.18-0.32x real-time factor.
- voxtral_mlx/model.py: MLX model with spectrogram, encoder, decoder
- voxtral_mlx/loader.py: model loading with 6-bit quantized weights
- voxtral_mlx/spectrogram.py: mel spectrogram computation in MLX
- voxtral_mlx_asr.py: VoxtralASR adapter for the AudioProcessor pipeline
- Add Voxtral Backend section explaining voxtral-mlx and voxtral (HF).
- Add Testing & Benchmarks section with commands to run tests/benchmarks.
- Update --backend parameter docs to include voxtral-mlx and voxtral.
- Update optional dependencies table with Voxtral entry.
- Link to BENCHMARK.md for detailed performance comparisons.
- Fix _begin_silence pushing same object reference as _end_silence,
causing the consumer to process two ended events and double the
silence duration.
- Fix initial silence never cleared when VAC is disabled, causing
the no-VAC path to enqueue zero audio.
- Add sample-precise silence boundaries (at_sample parameter).
- Add whisperlivekit/metrics.py with WER computation (word-level
Levenshtein) and timestamp accuracy (greedy alignment). No
external dependencies.
- Add whisperlivekit/metrics_collector.py with SessionMetrics
dataclass for per-session runtime observability. Instrumented
at 6 points in AudioProcessor: init, process_audio,
transcription_processor, _end_silence, results_formatter, cleanup.
Emits SESSION_METRICS structured log line on session end.
fixes#283, fixes#275
- accumulated_cross_attns was growing unboundedly during decoding loop,
using up to ~5GB for repetition loops. now capped to rolling window of 16
- max_tokens_per_chunk was using TOKENS_PER_SECOND (mel frame rate = 50)
instead of actual text token rate (~15/s), allowing 10-40x too many
decoding steps
- removed unused torch.cat on early return path
- removed dead self.committed/last_result_tokens lists (never read)
- same fixes applied to mlx variant
the flag was only used for tokenizer language selection but never
actually passed to whisper/faster-whisper transcribe calls. also init
OpenaiApiASR.task and read from transcribe_kargs.
fixes#306