mirror of
https://github.com/QuentinFuxa/WhisperLiveKit.git
synced 2026-04-26 16:45:46 +00:00
155 lines
6.2 KiB
Markdown
155 lines
6.2 KiB
Markdown
# whisper_streaming
|
|
Whisper realtime streaming for long speech-to-text transcription and translation
|
|
|
|
## Installation
|
|
|
|
This code work with two kinds of backends. Both require
|
|
|
|
```
|
|
pip install librosa
|
|
pip install opus-fast-mosestokenizer
|
|
```
|
|
|
|
The most recommended backend is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
|
|
|
|
Alternative, less restrictive, but slowe backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
|
|
|
|
The backend is loaded only when chosen. The unused one does not have to be installed.
|
|
|
|
## Usage: example entry point
|
|
|
|
```
|
|
usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
|
|
[--start_at START_AT] [--backend {faster-whisper,whisper_timestamped}] [--offline] [--vad]
|
|
audio_path
|
|
|
|
positional arguments:
|
|
audio_path Filename of 16kHz mono channel wav, on which live streaming is simulated.
|
|
|
|
options:
|
|
-h, --help show this help message and exit
|
|
--min-chunk-size MIN_CHUNK_SIZE
|
|
Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.
|
|
--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large}
|
|
Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.
|
|
--model_cache_dir MODEL_CACHE_DIR
|
|
Overriding the default model cache dir where models downloaded from the hub are saved
|
|
--model_dir MODEL_DIR
|
|
Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.
|
|
--lan LAN, --language LAN
|
|
Language code for transcription, e.g. en,de,cs.
|
|
--task {transcribe,translate}
|
|
Transcribe or translate.
|
|
--start_at START_AT Start processing audio at this time.
|
|
--backend {faster-whisper,whisper_timestamped}
|
|
Load only this backend for Whisper processing.
|
|
--offline Offline mode.
|
|
--vad Use VAD = voice activity detection, with the default parameters.
|
|
```
|
|
|
|
Example:
|
|
|
|
It simulates realtime processing from a pre-recorded mono 16k wav file.
|
|
|
|
```
|
|
python3 whisper_online.py en-demo16.wav --language en --min-chunk-size 1 > out.txt
|
|
```
|
|
|
|
### Output format
|
|
|
|
```
|
|
2691.4399 300 1380 Chairman, thank you.
|
|
6914.5501 1940 4940 If the debate today had a
|
|
9019.0277 5160 7160 the subject the situation in
|
|
10065.1274 7180 7480 Gaza
|
|
11058.3558 7480 9460 Strip, I might
|
|
12224.3731 9460 9760 have
|
|
13555.1929 9760 11060 joined Mrs.
|
|
14928.5479 11140 12240 De Kaiser and all the
|
|
16588.0787 12240 12560 other
|
|
18324.9285 12560 14420 colleagues across the
|
|
```
|
|
|
|
[See description here](https://github.com/ufal/whisper_streaming/blob/d915d790a62d7be4e7392dde1480e7981eb142ae/whisper_online.py#L361)
|
|
|
|
## Usage as a module
|
|
|
|
TL;DR: use OnlineASRProcessor object and its methods insert_audio_chunk and process_iter.
|
|
|
|
The code whisper_online.py is nicely commented, read it as the full documentation.
|
|
|
|
|
|
This pseudocode describes the interface that we suggest for your implementation. You can implement e.g. audio from mic or stdin, server-client, etc.
|
|
|
|
```
|
|
from whisper_online import *
|
|
|
|
src_lan = "en" # source language
|
|
tgt_lan = "en" # target language -- same as source for ASR, "en" if translate task is used
|
|
|
|
|
|
asr = FasterWhisperASR(lan, "large-v2") # loads and wraps Whisper model
|
|
# set options:
|
|
# asr.set_translate_task() # it will translate from lan into English
|
|
# asr.use_vad() # set using VAD
|
|
|
|
|
|
online = OnlineASRProcessor(tgt_lan, asr) # create processing object
|
|
|
|
|
|
while audio_has_not_ended: # processing loop:
|
|
a = # receive new audio chunk (and e.g. wait for min_chunk_size seconds first, ...)
|
|
online.insert_audio_chunk(a)
|
|
o = online.process_iter()
|
|
print(o) # do something with current partial output
|
|
# at the end of this audio processing
|
|
o = online.finish()
|
|
print(o) # do something with the last output
|
|
|
|
|
|
online.init() # refresh if you're going to re-use the object for the next audio
|
|
```
|
|
|
|
|
|
|
|
## Background
|
|
|
|
Default Whisper is intended for audio chunks of at most 30 seconds that contain
|
|
one full sentence. Longer audio files must be split to shorter chunks and
|
|
merged with "init prompt". In low latency simultaneous streaming mode, the
|
|
simple and naive chunking fixed-sized windows does not work well, it can split
|
|
a word in the middle. It is also necessary to know when the transcribt is
|
|
stable, should be confirmed ("commited") and followed up, and when the future
|
|
content makes the transcript clearer.
|
|
|
|
For that, there is LocalAgreement-n policy: if n consecutive updates, each with
|
|
a newly available audio stream chunk, agree on a prefix transcript, it is
|
|
confirmed. (Reference: CUNI-KIT at IWSLT 2022 etc.)
|
|
|
|
In this project, we re-use the idea of Peter Polák from this demo:
|
|
https://github.com/pe-trik/transformers/blob/online_decode/examples/pytorch/online-decoding/whisper-online-demo.py
|
|
However, it doesn't do any sentence segmentation, but Whisper produces
|
|
punctuation and the libraries `faster-whisper` and `whisper_transcribed` make
|
|
word-level timestamps. In short: we
|
|
consecutively process new audio chunks, emit the transcripts that are confirmed
|
|
by 2 iterations, and scroll the audio processing buffer on a timestamp of a
|
|
confirmed complete sentence. The processing audio buffer is not too long and
|
|
the processing is fast.
|
|
|
|
In more detail: we use the init prompt, we handle the inaccurate timestamps, we
|
|
re-process confirmed sentence prefixes and skip them, making sure they don't
|
|
overlap, and we limit the processing buffer window.
|
|
|
|
Contributions are welcome.
|
|
|
|
### Tests
|
|
|
|
Rigorous quality and latency tests are pending.
|
|
|
|
## Contact
|
|
|
|
Dominik Macháček, machacek@ufal.mff.cuni.cz
|
|
|
|
|
|
|