WhisperLiveKit

mirror of https://github.com/QuentinFuxa/WhisperLiveKit.git synced 2026-03-07 14:23:18 +00:00

Files

Chingning Chen b63f54e838 fix(whisper/tokenizer): prevent IndexError from crashing multilingual streams

This fix addresses a critical bug in the Whisper tokenizer that causes
the transcription server to crash with an `IndexError: string index out
of range` when streaming audio in languages utilizing multi-byte UTF-8
characters (e.g., Cantonese, Japanese, Mandarin).

When a 3-byte character is cut off at the boundary of an audio chunk,
incomplete bytes are decoded into a single Unicode replacement character
(`\ufffd`), artificially shortening the string and breaking the offset
mapping assumed by `split_tokens_on_unicode`.

This ports the upstream fix from SYSTRAN/faster-whisper (PR #111) to add
a strict bounds check before accessing the string index, allowing
incomplete bytes to be safely caught and handled in the next chunk.

2026-03-02 15:31:43 +08:00

assets

whisper core at root of wlk

2025-11-10 12:17:18 +01:00

normalizers

whisper core at root of wlk

2025-11-10 12:17:18 +01:00

__init__.py

fixes #299

2025-12-05 17:54:14 +01:00

__main__.py

whisper core at root of wlk