Files
Chingning Chen b63f54e838 fix(whisper/tokenizer): prevent IndexError from crashing multilingual streams
This fix addresses a critical bug in the Whisper tokenizer that causes
the transcription server to crash with an `IndexError: string index out
of range` when streaming audio in languages utilizing multi-byte UTF-8
characters (e.g., Cantonese, Japanese, Mandarin).

When a 3-byte character is cut off at the boundary of an audio chunk,
incomplete bytes are decoded into a single Unicode replacement character
(`\ufffd`), artificially shortening the string and breaking the offset
mapping assumed by `split_tokens_on_unicode`.

This ports the upstream fix from SYSTRAN/faster-whisper (PR #111) to add
a strict bounds check before accessing the string index, allowing
incomplete bytes to be safely caught and handled in the next chunk.
2026-03-02 15:31:43 +08:00
..
2025-11-10 12:17:18 +01:00
2025-11-10 12:17:18 +01:00
2025-12-05 17:54:14 +01:00
2025-11-10 12:17:18 +01:00
2025-11-10 12:17:18 +01:00
2025-11-10 12:17:18 +01:00
2025-11-23 11:20:00 +01:00
2025-11-10 12:17:18 +01:00
2025-11-10 12:17:18 +01:00
2025-11-10 12:17:18 +01:00