mirror of
https://github.com/QuentinFuxa/WhisperLiveKit.git
synced 2026-03-07 22:33:36 +00:00
Compare commits
3 Commits
regularfry
...
seamless-s
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
5b5805231e | ||
|
|
cdb2a0ba17 | ||
|
|
1703969432 |
96
README.md
96
README.md
@@ -3,55 +3,54 @@ Whisper realtime streaming for long speech-to-text transcription and translation
|
||||
|
||||
**Turning Whisper into Real-Time Transcription System**
|
||||
|
||||
Demonstration paper, by [Dominik Macháček](https://ufal.mff.cuni.cz/dominik-machacek), [Raj Dabre](https://prajdabre.github.io/), [Ondřej Bojar](https://ufal.mff.cuni.cz/ondrej-bojar), 2023
|
||||
Demonstration paper, by Dominik Macháček, Raj Dabre, Ondřej Bojar, 2023
|
||||
|
||||
Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real-time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
|
||||
Abstract: Whisper is one of the recent state-of-the-art multilingual speech recognition and translation models, however, it is not designed for real time transcription. In this paper, we build on top of Whisper and create Whisper-Streaming, an implementation of real-time speech transcription and translation of Whisper-like models. Whisper-Streaming uses local agreement policy with self-adaptive latency to enable streaming transcription. We show that Whisper-Streaming achieves high quality and 3.3 seconds latency on unsegmented long-form speech transcription test set, and we demonstrate its robustness and practical usability as a component in live transcription service at a multilingual conference.
|
||||
|
||||
|
||||
[Paper PDF](https://aclanthology.org/2023.ijcnlp-demo.3.pdf), [Demo video](https://player.vimeo.com/video/840442741)
|
||||
Paper in proceedings: http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf
|
||||
|
||||
Demo video: https://player.vimeo.com/video/840442741
|
||||
|
||||
[Slides](http://ufallab.ms.mff.cuni.cz/~machacek/pre-prints/AACL23-2.11.2023-Turning-Whisper-oral.pdf) -- 15 minutes oral presentation at IJCNLP-AACL 2023
|
||||
|
||||
Please, cite us. [ACL Anthology](https://aclanthology.org/2023.ijcnlp-demo.3/), [Bibtex citation](https://aclanthology.org/2023.ijcnlp-demo.3.bib):
|
||||
Please, cite us. [Bibtex citation](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/bib/2023.ijcnlp-demo.3.bib):
|
||||
|
||||
```
|
||||
@inproceedings{machacek-etal-2023-turning,
|
||||
title = "Turning Whisper into Real-Time Transcription System",
|
||||
author = "Mach{\'a}{\v{c}}ek, Dominik and
|
||||
Dabre, Raj and
|
||||
Bojar, Ond{\v{r}}ej",
|
||||
editor = "Saha, Sriparna and
|
||||
Sujaini, Herry",
|
||||
booktitle = "Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations",
|
||||
month = nov,
|
||||
year = "2023",
|
||||
address = "Bali, Indonesia",
|
||||
publisher = "Association for Computational Linguistics",
|
||||
url = "https://aclanthology.org/2023.ijcnlp-demo.3",
|
||||
pages = "17--24",
|
||||
@InProceedings{machacek-dabre-bojar:2023:ijcnlp,
|
||||
author = {Macháček, Dominik and Dabre, Raj and Bojar, Ondřej},
|
||||
title = {Turning Whisper into Real-Time Transcription System},
|
||||
booktitle = {System Demonstrations},
|
||||
month = {November},
|
||||
year = {2023},
|
||||
address = {Bali, Indonesia},
|
||||
publisher = {Asian Federation of Natural Language Processing},
|
||||
pages = {17--24},
|
||||
}
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
1) ``pip install librosa soundfile`` -- audio processing library
|
||||
1) ``pip install librosa`` -- audio processing library
|
||||
|
||||
2) Whisper backend.
|
||||
2) **Whisper backend**.
|
||||
|
||||
Several alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
|
||||
Two alternative backends are integrated. The most recommended one is [faster-whisper](https://github.com/guillaumekln/faster-whisper) with GPU support. Follow their instructions for NVIDIA libraries -- we succeeded with CUDNN 8.5.0 and CUDA 11.7. Install with `pip install faster-whisper`.
|
||||
|
||||
Alternative, less restrictive, but slower backend is [whisper-timestamped](https://github.com/linto-ai/whisper-timestamped): `pip install git+https://github.com/linto-ai/whisper-timestamped`
|
||||
|
||||
Thirdly, it's also possible to run this software from the [OpenAI Whisper API](https://platform.openai.com/docs/api-reference/audio/createTranscription). This solution is fast and requires no GPU, just a small VM will suffice, but you will need to pay OpenAI for api access. Also note that, since each audio fragment is processed multiple times, the [price](https://openai.com/pricing) will be higher than obvious from the pricing page, so keep an eye on costs while using. Setting a higher chunk-size will reduce costs significantly.
|
||||
Install with: `pip install openai`
|
||||
|
||||
For running with the openai-api backend, make sure that your [OpenAI api key](https://platform.openai.com/api-keys) is set in the `OPENAI_API_KEY` environment variable. For example, before running, do: `export OPENAI_API_KEY=sk-xxx` with *sk-xxx* replaced with your api key.
|
||||
|
||||
The backend is loaded only when chosen. The unused one does not have to be installed.
|
||||
|
||||
Or: **Seamless Streaming** -- alternative to Whisper, wrapped to enable the same operation modes and input/output format.
|
||||
|
||||
`pip install fairseq2 pydub sentencepiece git+https://github.com/facebookresearch/seamless_communication.git`
|
||||
|
||||
Installation suggested [here](https://github.com/facebookresearch/seamless_communication/blob/main/Seamless_Tutorial.ipynb), for special torch version cases refer to [fairseq2](https://github.com/facebookresearch/fairseq2#variants).
|
||||
|
||||
|
||||
3) Optional, not recommended: sentence segmenter (aka sentence tokenizer)
|
||||
|
||||
Two buffer trimming options are integrated and evaluated. They have impact on
|
||||
Two buffer trimming options are integrated and evaluated for Whisper backends. They have impact on
|
||||
the quality and latency. The default "segment" option performs better according
|
||||
to our tests and does not require any sentence segmentation installed.
|
||||
|
||||
@@ -76,8 +75,9 @@ In case of installation issues of opus-fast-mosestokenizer, especially on Window
|
||||
### Real-time simulation from audio file
|
||||
|
||||
```
|
||||
usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR] [--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}]
|
||||
[--backend {faster-whisper,whisper_timestamped,openai-api}] [--vad] [--buffer_trimming {sentence,segment}] [--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
|
||||
usage: whisper_online.py [-h] [--min-chunk-size MIN_CHUNK_SIZE] [--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}] [--model_cache_dir MODEL_CACHE_DIR]
|
||||
[--model_dir MODEL_DIR] [--lan LAN] [--task {transcribe,translate}] [--backend {faster-whisper,whisper_timestamped,seamless}] [--vad] [--buffer_trimming {sentence,segment}]
|
||||
[--buffer_trimming_sec BUFFER_TRIMMING_SEC] [--start_at START_AT] [--offline] [--comp_unaware]
|
||||
audio_path
|
||||
|
||||
positional arguments:
|
||||
@@ -86,24 +86,26 @@ positional arguments:
|
||||
options:
|
||||
-h, --help show this help message and exit
|
||||
--min-chunk-size MIN_CHUNK_SIZE
|
||||
Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.
|
||||
Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received
|
||||
by this time. Applicable both to Whisper and seamless.
|
||||
--model {tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large}
|
||||
Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.
|
||||
Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir. Not applicable to seamless.
|
||||
--model_cache_dir MODEL_CACHE_DIR
|
||||
Overriding the default model cache dir where models downloaded from the hub are saved
|
||||
Overriding the default model cache dir where models downloaded from the hub are saved. Not applicable to seamless.
|
||||
--model_dir MODEL_DIR
|
||||
Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.
|
||||
Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter. Not applicable to seamless.
|
||||
--lan LAN, --language LAN
|
||||
Source language code, e.g. en,de,cs, or 'auto' for language detection.
|
||||
Language code for transcription, e.g. en,de,cs. Seamless backend has its own 3-letter language codes, e.g. eng, deu, ces.
|
||||
--task {transcribe,translate}
|
||||
Transcribe or translate.
|
||||
--backend {faster-whisper,whisper_timestamped,openai-api}
|
||||
Load only this backend for Whisper processing.
|
||||
--vad Use VAD = voice activity detection, with the default parameters.
|
||||
--backend {faster-whisper,whisper_timestamped,seamless}
|
||||
Load only this backend for Whisper processing, or Seamless Streaming.
|
||||
--vad Use VAD = voice activity detection, with the default parameters. Not applicable to seamless.
|
||||
--buffer_trimming {sentence,segment}
|
||||
Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.
|
||||
Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter
|
||||
must be installed for "sentence" option. Not applicable to seamless.
|
||||
--buffer_trimming_sec BUFFER_TRIMMING_SEC
|
||||
Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.
|
||||
Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered. Not applicable to seamless.
|
||||
--start_at START_AT Start processing audio at this time.
|
||||
--offline Offline mode.
|
||||
--comp_unaware Computationally unaware simulation.
|
||||
@@ -155,7 +157,7 @@ The code whisper_online.py is nicely commented, read it as the full documentatio
|
||||
|
||||
This pseudocode describes the interface that we suggest for your implementation. You can implement any features that you need for your application.
|
||||
|
||||
```python
|
||||
```
|
||||
from whisper_online import *
|
||||
|
||||
src_lan = "en" # source language
|
||||
@@ -183,7 +185,7 @@ online.init() # refresh if you're going to re-use the object for the next audio
|
||||
|
||||
### Server -- real-time from mic
|
||||
|
||||
`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection and the `--warmup-file`. See the help message (`-h` option).
|
||||
`whisper_online_server.py` has the same model options as `whisper_online.py`, plus `--host` and `--port` of the TCP connection. See help message (`-h` option).
|
||||
|
||||
Client example:
|
||||
|
||||
@@ -224,20 +226,12 @@ In more detail: we use the init prompt, we handle the inaccurate timestamps, we
|
||||
re-process confirmed sentence prefixes and skip them, making sure they don't
|
||||
overlap, and we limit the processing buffer window.
|
||||
|
||||
Contributions are welcome.
|
||||
|
||||
### Performance evaluation
|
||||
|
||||
[See the paper.](http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main-demo/cdrom/pdf/2023.ijcnlp-demo.3.pdf)
|
||||
|
||||
### Contributions
|
||||
|
||||
Contributions are welcome. We acknowledge especially:
|
||||
|
||||
- [The GitHub contributors](https://github.com/ufal/whisper_streaming/graphs/contributors) for their pull requests with new features and bugfixes.
|
||||
- [The translation of this repo into Chinese.](https://github.com/Gloridust/whisper_streaming_CN)
|
||||
- [Ondřej Plátek](https://opla.cz/) for the paper pre-review.
|
||||
- [Peter Polák](https://ufal.mff.cuni.cz/peter-polak) for the original idea.
|
||||
- The UEDIN team of the [ELITR project](https://elitr.eu) for the original line_packet.py.
|
||||
|
||||
|
||||
## Contact
|
||||
|
||||
|
||||
@@ -2,6 +2,8 @@
|
||||
|
||||
"""Functions for sending and receiving individual lines of text over a socket.
|
||||
|
||||
Used by marian-server-server.py to communicate with the Marian worker.
|
||||
|
||||
A line is transmitted using one or more fixed-size packets of UTF-8 bytes
|
||||
containing:
|
||||
|
||||
@@ -9,7 +11,6 @@ containing:
|
||||
|
||||
- Zero or more \0 bytes as required to pad the packet to PACKET_SIZE
|
||||
|
||||
Originally from the UEDIN team of the ELITR project.
|
||||
"""
|
||||
|
||||
PACKET_SIZE = 65536
|
||||
|
||||
172
seamless_integration.py
Normal file
172
seamless_integration.py
Normal file
@@ -0,0 +1,172 @@
|
||||
#!/usr/bin/env python3
|
||||
import sys
|
||||
import numpy as np
|
||||
|
||||
# code extracted from https://github.com/facebookresearch/seamless_communication/blob/main/Seamless_Tutorial.ipynb :
|
||||
|
||||
from simuleval.data.segments import SpeechSegment, EmptySegment
|
||||
from simuleval.utils.arguments import cli_argument_list
|
||||
from simuleval import options
|
||||
|
||||
from typing import Union, List
|
||||
from simuleval.data.segments import Segment, TextSegment
|
||||
from simuleval.agents.pipeline import TreeAgentPipeline
|
||||
from simuleval.agents.states import AgentStates
|
||||
|
||||
SAMPLE_RATE = 16000
|
||||
|
||||
def reset_states(system, states):
|
||||
if isinstance(system, TreeAgentPipeline):
|
||||
states_iter = states.values()
|
||||
else:
|
||||
states_iter = states
|
||||
for state in states_iter:
|
||||
state.reset()
|
||||
|
||||
def get_states_root(system, states) -> AgentStates:
|
||||
if isinstance(system, TreeAgentPipeline):
|
||||
# self.states is a dict
|
||||
return states[system.source_module]
|
||||
else:
|
||||
# self.states is a list
|
||||
return system.states[0]
|
||||
|
||||
def build_streaming_system(model_configs, agent_class):
|
||||
parser = options.general_parser()
|
||||
parser.add_argument("-f", "--f", help="a dummy argument to fool ipython", default="1")
|
||||
|
||||
agent_class.add_args(parser)
|
||||
args, _ = parser.parse_known_args(cli_argument_list(model_configs))
|
||||
system = agent_class.from_args(args)
|
||||
return system
|
||||
|
||||
class OutputSegments:
|
||||
def __init__(self, segments: Union[List[Segment], Segment]):
|
||||
if isinstance(segments, Segment):
|
||||
segments = [segments]
|
||||
self.segments: List[Segment] = [s for s in segments]
|
||||
|
||||
@property
|
||||
def is_empty(self):
|
||||
return all(segment.is_empty for segment in self.segments)
|
||||
|
||||
@property
|
||||
def finished(self):
|
||||
return all(segment.finished for segment in self.segments)
|
||||
|
||||
|
||||
######################
|
||||
# fixing DetokenizerAgent -- it strips output segment.content last space, but sometimes a word is split into more segments. Simple joining with spaces would be wrong.
|
||||
from seamless_communication.streaming.agents.detokenizer import DetokenizerAgent
|
||||
from seamless_communication.streaming.agents.offline_w2v_bert_encoder import (
|
||||
OfflineWav2VecBertEncoderAgent,
|
||||
)
|
||||
from seamless_communication.streaming.agents.online_feature_extractor import (
|
||||
OnlineFeatureExtractorAgent,
|
||||
)
|
||||
from seamless_communication.streaming.agents.online_text_decoder import (
|
||||
MMASpeechToTextDecoderAgent,
|
||||
)
|
||||
from seamless_communication.streaming.agents.silero_vad import SileroVADAgent
|
||||
from seamless_communication.streaming.agents.unity_pipeline import UnitYAgentPipeline
|
||||
class FixDetokenizerAgent(DetokenizerAgent):
|
||||
def decode(self, x: str) -> str:
|
||||
return x.replace(" ", "").replace("\u2581", " ") # .strip() is removed
|
||||
|
||||
class FixSeamlessStreamingS2TVADAgent(UnitYAgentPipeline):
|
||||
pipeline = [
|
||||
SileroVADAgent,
|
||||
OnlineFeatureExtractorAgent,
|
||||
OfflineWav2VecBertEncoderAgent,
|
||||
MMASpeechToTextDecoderAgent,
|
||||
FixDetokenizerAgent,
|
||||
]
|
||||
##################################
|
||||
|
||||
# the next pieces of are copypasted from the tutorial and put to the corresponding methods
|
||||
|
||||
#class SeamlessProcessor(OnlineASRProcessorBase): # TODO: there should be a common base class. But the code would not be simple anymore.
|
||||
class SeamlessProcessor:
|
||||
'''
|
||||
Wrapping SeamlessStreaming for the same operation modes as
|
||||
Whisper-Streaming's OnlineASRProcessor.
|
||||
|
||||
'''
|
||||
def __init__(self, tgt_lan, task, logfile=sys.stderr):
|
||||
'''
|
||||
tgt_lan: must be 3-letter language code that Seamless-Streaming supports for text output mode.
|
||||
task: see below
|
||||
logfile
|
||||
'''
|
||||
if task in ("transcribe","asr"):
|
||||
task_arg = "asr"
|
||||
elif task in ("translate","s2tt"):
|
||||
task_arg = "s2tt"
|
||||
else:
|
||||
raise ValueError("task argument must be 'transcribe' or 'translate', or 'asr' or 's2tt'")
|
||||
|
||||
self.logfile = logfile
|
||||
|
||||
agent_class = FixSeamlessStreamingS2TVADAgent
|
||||
|
||||
model_configs = dict(
|
||||
source_segment_size=320,
|
||||
device="cuda:0",
|
||||
dtype="fp16",
|
||||
min_starting_wait_w2vbert=192,
|
||||
decision_threshold=0.5,
|
||||
min_unit_chunk_size=50,
|
||||
no_early_stop=True,
|
||||
max_len_a=0,
|
||||
max_len_b=100,
|
||||
task=task_arg,
|
||||
tgt_lang=tgt_lan,
|
||||
block_ngrams=True,
|
||||
detokenize_only=True,
|
||||
)
|
||||
self.tgt_lan = tgt_lan
|
||||
|
||||
self.system = build_streaming_system(model_configs, agent_class)
|
||||
|
||||
self.system_states = self.system.build_states()
|
||||
|
||||
self.init()
|
||||
|
||||
def init(self):
|
||||
reset_states(self.system, self.system_states)
|
||||
self.audio_buffer = np.array([],dtype=np.float32)
|
||||
self.beg, self.end = 0, 0
|
||||
|
||||
def insert_audio_chunk(self, audio):
|
||||
self.audio_buffer = np.append(self.audio_buffer, audio)
|
||||
|
||||
def process_segment(self, input_segment):
|
||||
output_segments = OutputSegments(self.system.pushpop(input_segment, self.system_states))
|
||||
out = []
|
||||
for segment in output_segments.segments:
|
||||
if not segment.is_empty:
|
||||
out.append(segment.content)
|
||||
if output_segments.finished:
|
||||
print("End of VAD segment",file=self.logfile)
|
||||
reset_states(self.system, self.system_states)
|
||||
if out:
|
||||
b = self.beg
|
||||
self.beg = self.end
|
||||
o = "".join(out)
|
||||
return (b, self.end, "".join(out))
|
||||
return (None, None, "")
|
||||
|
||||
|
||||
def process_iter(self, finished=False):
|
||||
input_segment = SpeechSegment(
|
||||
content=self.audio_buffer,
|
||||
sample_rate=SAMPLE_RATE,
|
||||
finished=finished,
|
||||
)
|
||||
self.audio_buffer = np.array([],dtype=np.float32)
|
||||
input_segment.tgt_lang = self.tgt_lan
|
||||
self.end += (len(input_segment.content)/SAMPLE_RATE)
|
||||
return self.process_segment(input_segment)
|
||||
|
||||
def finish(self):
|
||||
return self.process_iter(finished=True)
|
||||
@@ -4,18 +4,12 @@ import numpy as np
|
||||
import librosa
|
||||
from functools import lru_cache
|
||||
import time
|
||||
import logging
|
||||
|
||||
|
||||
import io
|
||||
import soundfile as sf
|
||||
import math
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
@lru_cache
|
||||
def load_audio(fname):
|
||||
a, _ = librosa.load(fname, sr=16000, dtype=np.float32)
|
||||
a, _ = librosa.load(fname, sr=16000)
|
||||
return a
|
||||
|
||||
def load_audio_chunk(fname, beg, end):
|
||||
@@ -36,10 +30,7 @@ class ASRBase:
|
||||
self.logfile = logfile
|
||||
|
||||
self.transcribe_kargs = {}
|
||||
if lan == "auto":
|
||||
self.original_language = None
|
||||
else:
|
||||
self.original_language = lan
|
||||
self.original_language = lan
|
||||
|
||||
self.model = self.load_model(modelsize, cache_dir, model_dir)
|
||||
|
||||
@@ -63,11 +54,10 @@ class WhisperTimestampedASR(ASRBase):
|
||||
|
||||
def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
|
||||
import whisper
|
||||
import whisper_timestamped
|
||||
from whisper_timestamped import transcribe_timestamped
|
||||
self.transcribe_timestamped = transcribe_timestamped
|
||||
if model_dir is not None:
|
||||
logger.debug("ignoring model_dir, not implemented")
|
||||
print("ignoring model_dir, not implemented",file=self.logfile)
|
||||
return whisper.load_model(modelsize, download_root=cache_dir)
|
||||
|
||||
def transcribe(self, audio, init_prompt=""):
|
||||
@@ -106,9 +96,8 @@ class FasterWhisperASR(ASRBase):
|
||||
|
||||
def load_model(self, modelsize=None, cache_dir=None, model_dir=None):
|
||||
from faster_whisper import WhisperModel
|
||||
# logging.getLogger("faster_whisper").setLevel(logger.level)
|
||||
if model_dir is not None:
|
||||
logger.debug(f"Loading whisper model from model_dir {model_dir}. modelsize and cache_dir parameters are not used.")
|
||||
print(f"Loading whisper model from model_dir {model_dir}. modelsize and cache_dir parameters are not used.",file=self.logfile)
|
||||
model_size_or_path = model_dir
|
||||
elif modelsize is not None:
|
||||
model_size_or_path = modelsize
|
||||
@@ -129,11 +118,8 @@ class FasterWhisperASR(ASRBase):
|
||||
return model
|
||||
|
||||
def transcribe(self, audio, init_prompt=""):
|
||||
|
||||
# tested: beam_size=5 is faster and better than 1 (on one 200 second document from En ESIC, min chunk 0.01)
|
||||
segments, info = self.model.transcribe(audio, language=self.original_language, initial_prompt=init_prompt, beam_size=5, word_timestamps=True, condition_on_previous_text=True, **self.transcribe_kargs)
|
||||
#print(info) # info contains language detection result
|
||||
|
||||
return list(segments)
|
||||
|
||||
def ts_words(self, segments):
|
||||
@@ -156,93 +142,6 @@ class FasterWhisperASR(ASRBase):
|
||||
self.transcribe_kargs["task"] = "translate"
|
||||
|
||||
|
||||
class OpenaiApiASR(ASRBase):
|
||||
"""Uses OpenAI's Whisper API for audio transcription."""
|
||||
|
||||
def __init__(self, lan=None, temperature=0, logfile=sys.stderr):
|
||||
self.logfile = logfile
|
||||
|
||||
self.modelname = "whisper-1"
|
||||
self.original_language = None if lan == "auto" else lan # ISO-639-1 language code
|
||||
self.response_format = "verbose_json"
|
||||
self.temperature = temperature
|
||||
|
||||
self.load_model()
|
||||
|
||||
self.use_vad_opt = False
|
||||
|
||||
# reset the task in set_translate_task
|
||||
self.task = "transcribe"
|
||||
|
||||
def load_model(self, *args, **kwargs):
|
||||
from openai import OpenAI
|
||||
self.client = OpenAI()
|
||||
|
||||
self.transcribed_seconds = 0 # for logging how many seconds were processed by API, to know the cost
|
||||
|
||||
|
||||
def ts_words(self, segments):
|
||||
no_speech_segments = []
|
||||
if self.use_vad_opt:
|
||||
for segment in segments.segments:
|
||||
# TODO: threshold can be set from outside
|
||||
if segment["no_speech_prob"] > 0.8:
|
||||
no_speech_segments.append((segment.get("start"), segment.get("end")))
|
||||
|
||||
o = []
|
||||
for word in segments.words:
|
||||
start = word.get("start")
|
||||
end = word.get("end")
|
||||
if any(s[0] <= start <= s[1] for s in no_speech_segments):
|
||||
# print("Skipping word", word.get("word"), "because it's in a no-speech segment")
|
||||
continue
|
||||
o.append((start, end, word.get("word")))
|
||||
return o
|
||||
|
||||
|
||||
def segments_end_ts(self, res):
|
||||
return [s["end"] for s in res.words]
|
||||
|
||||
def transcribe(self, audio_data, prompt=None, *args, **kwargs):
|
||||
# Write the audio data to a buffer
|
||||
buffer = io.BytesIO()
|
||||
buffer.name = "temp.wav"
|
||||
sf.write(buffer, audio_data, samplerate=16000, format='WAV', subtype='PCM_16')
|
||||
buffer.seek(0) # Reset buffer's position to the beginning
|
||||
|
||||
self.transcribed_seconds += math.ceil(len(audio_data)/16000) # it rounds up to the whole seconds
|
||||
|
||||
params = {
|
||||
"model": self.modelname,
|
||||
"file": buffer,
|
||||
"response_format": self.response_format,
|
||||
"temperature": self.temperature,
|
||||
"timestamp_granularities": ["word", "segment"]
|
||||
}
|
||||
if self.task != "translate" and self.original_language:
|
||||
params["language"] = self.original_language
|
||||
if prompt:
|
||||
params["prompt"] = prompt
|
||||
|
||||
if self.task == "translate":
|
||||
proc = self.client.audio.translations
|
||||
else:
|
||||
proc = self.client.audio.transcriptions
|
||||
|
||||
# Process transcription/translation
|
||||
transcript = proc.create(**params)
|
||||
logger.debug(f"OpenAI API processed accumulated {self.transcribed_seconds} seconds")
|
||||
|
||||
return transcript
|
||||
|
||||
def use_vad(self):
|
||||
self.use_vad_opt = True
|
||||
|
||||
def set_translate_task(self):
|
||||
self.task = "translate"
|
||||
|
||||
|
||||
|
||||
|
||||
class HypothesisBuffer:
|
||||
|
||||
@@ -274,11 +173,9 @@ class HypothesisBuffer:
|
||||
c = " ".join([self.commited_in_buffer[-j][2] for j in range(1,i+1)][::-1])
|
||||
tail = " ".join(self.new[j-1][2] for j in range(1,i+1))
|
||||
if c == tail:
|
||||
words = []
|
||||
print("removing last",i,"words:",file=self.logfile)
|
||||
for j in range(i):
|
||||
words.append(repr(self.new.pop(0)))
|
||||
words_msg = " ".join(words)
|
||||
logger.debug(f"removing last {i} words: {words_msg}")
|
||||
print("\t",self.new.pop(0),file=self.logfile)
|
||||
break
|
||||
|
||||
def flush(self):
|
||||
@@ -311,7 +208,18 @@ class HypothesisBuffer:
|
||||
def complete(self):
|
||||
return self.buffer
|
||||
|
||||
class OnlineASRProcessor:
|
||||
class OnlineASRProcessorBase:
|
||||
'''Showing minimum common public interface for various specialized subclasses.'''
|
||||
def init(self):
|
||||
raise NotImplemented()
|
||||
def insert_audio_chunk(self, audio):
|
||||
raise NotImplemented()
|
||||
def process_iter(self):
|
||||
raise NotImplemented()
|
||||
def finish(self):
|
||||
raise NotImplemented()
|
||||
|
||||
class OnlineASRProcessor(OnlineASRProcessorBase):
|
||||
|
||||
SAMPLING_RATE = 16000
|
||||
|
||||
@@ -337,6 +245,9 @@ class OnlineASRProcessor:
|
||||
|
||||
self.transcript_buffer = HypothesisBuffer(logfile=self.logfile)
|
||||
self.commited = []
|
||||
self.last_chunked_at = 0
|
||||
|
||||
self.silence_iters = 0
|
||||
|
||||
def insert_audio_chunk(self, audio):
|
||||
self.audio_buffer = np.append(self.audio_buffer, audio)
|
||||
@@ -346,7 +257,7 @@ class OnlineASRProcessor:
|
||||
"context" is the commited text that is inside the audio buffer. It is transcribed again and skipped. It is returned only for debugging and logging reasons.
|
||||
"""
|
||||
k = max(0,len(self.commited)-1)
|
||||
while k > 0 and self.commited[k-1][1] > self.buffer_time_offset:
|
||||
while k > 0 and self.commited[k-1][1] > self.last_chunked_at:
|
||||
k -= 1
|
||||
|
||||
p = self.commited[:k]
|
||||
@@ -367,9 +278,9 @@ class OnlineASRProcessor:
|
||||
"""
|
||||
|
||||
prompt, non_prompt = self.prompt()
|
||||
logger.debug(f"PROMPT: {prompt}")
|
||||
logger.debug(f"CONTEXT: {non_prompt}")
|
||||
logger.debug(f"transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}")
|
||||
print("PROMPT:", prompt, file=self.logfile)
|
||||
print("CONTEXT:", non_prompt, file=self.logfile)
|
||||
print(f"transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f} seconds from {self.buffer_time_offset:2.2f}",file=self.logfile)
|
||||
res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt)
|
||||
|
||||
# transform to [(beg,end,"word1"), ...]
|
||||
@@ -378,10 +289,8 @@ class OnlineASRProcessor:
|
||||
self.transcript_buffer.insert(tsw, self.buffer_time_offset)
|
||||
o = self.transcript_buffer.flush()
|
||||
self.commited.extend(o)
|
||||
completed = self.to_flush(o)
|
||||
logger.debug(f">>>>COMPLETE NOW: {completed}")
|
||||
the_rest = self.to_flush(self.transcript_buffer.complete())
|
||||
logger.debug(f"INCOMPLETE: {the_rest}")
|
||||
print(">>>>COMPLETE NOW:",self.to_flush(o),file=self.logfile,flush=True)
|
||||
print("INCOMPLETE:",self.to_flush(self.transcript_buffer.complete()),file=self.logfile,flush=True)
|
||||
|
||||
# there is a newly confirmed text
|
||||
|
||||
@@ -405,18 +314,18 @@ class OnlineASRProcessor:
|
||||
#while k>0 and self.commited[k][1] > l:
|
||||
# k -= 1
|
||||
#t = self.commited[k][1]
|
||||
logger.debug("chunking segment")
|
||||
print(f"chunking segment",file=self.logfile)
|
||||
#self.chunk_at(t)
|
||||
|
||||
logger.debug(f"len of buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f}")
|
||||
print(f"len of buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:2.2f}",file=self.logfile)
|
||||
return self.to_flush(o)
|
||||
|
||||
def chunk_completed_sentence(self):
|
||||
if self.commited == []: return
|
||||
logger.debug(self.commited)
|
||||
print(self.commited,file=self.logfile)
|
||||
sents = self.words_to_sentences(self.commited)
|
||||
for s in sents:
|
||||
logger.debug(f"\t\tSENT: {s}")
|
||||
print("\t\tSENT:",s,file=self.logfile)
|
||||
if len(sents) < 2:
|
||||
return
|
||||
while len(sents) > 2:
|
||||
@@ -424,7 +333,7 @@ class OnlineASRProcessor:
|
||||
# we will continue with audio processing at this timestamp
|
||||
chunk_at = sents[-2][1]
|
||||
|
||||
logger.debug(f"--- sentence chunked at {chunk_at:2.2f}")
|
||||
print(f"--- sentence chunked at {chunk_at:2.2f}",file=self.logfile)
|
||||
self.chunk_at(chunk_at)
|
||||
|
||||
def chunk_completed_segment(self, res):
|
||||
@@ -441,12 +350,12 @@ class OnlineASRProcessor:
|
||||
ends.pop(-1)
|
||||
e = ends[-2]+self.buffer_time_offset
|
||||
if e <= t:
|
||||
logger.debug(f"--- segment chunked at {e:2.2f}")
|
||||
print(f"--- segment chunked at {e:2.2f}",file=self.logfile)
|
||||
self.chunk_at(e)
|
||||
else:
|
||||
logger.debug(f"--- last segment not within commited area")
|
||||
print(f"--- last segment not within commited area",file=self.logfile)
|
||||
else:
|
||||
logger.debug(f"--- not enough segments to chunk")
|
||||
print(f"--- not enough segments to chunk",file=self.logfile)
|
||||
|
||||
|
||||
|
||||
@@ -459,6 +368,7 @@ class OnlineASRProcessor:
|
||||
cut_seconds = time - self.buffer_time_offset
|
||||
self.audio_buffer = self.audio_buffer[int(cut_seconds*self.SAMPLING_RATE):]
|
||||
self.buffer_time_offset = time
|
||||
self.last_chunked_at = time
|
||||
|
||||
def words_to_sentences(self, words):
|
||||
"""Uses self.tokenizer for sentence segmentation of words.
|
||||
@@ -492,7 +402,7 @@ class OnlineASRProcessor:
|
||||
"""
|
||||
o = self.transcript_buffer.complete()
|
||||
f = self.to_flush(o)
|
||||
logger.debug("last, noncommited: {f}")
|
||||
print("last, noncommited:",f,file=self.logfile)
|
||||
return f
|
||||
|
||||
|
||||
@@ -511,6 +421,7 @@ class OnlineASRProcessor:
|
||||
e = offset + sents[-1][1]
|
||||
return (b,e,t)
|
||||
|
||||
|
||||
WHISPER_LANG_CODES = "af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,zh".split(",")
|
||||
|
||||
def create_tokenizer(lan):
|
||||
@@ -532,7 +443,7 @@ def create_tokenizer(lan):
|
||||
|
||||
# the following languages are in Whisper, but not in wtpsplit:
|
||||
if lan in "as ba bo br bs fo haw hr ht jw lb ln lo mi nn oc sa sd sn so su sw tk tl tt".split():
|
||||
logger.debug(f"{lan} code is not supported by wtpsplit. Going to use None lang_code option.")
|
||||
print(f"{lan} code is not supported by wtpsplit. Going to use None lang_code option.", file=sys.stderr)
|
||||
lan = None
|
||||
|
||||
from wtpsplit import WtP
|
||||
@@ -548,71 +459,18 @@ def add_shared_args(parser):
|
||||
"""shared args for simulation (this entry point) and server
|
||||
parser: argparse.ArgumentParser object
|
||||
"""
|
||||
parser.add_argument('--min-chunk-size', type=float, default=1.0, help='Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.')
|
||||
parser.add_argument('--model', type=str, default='large-v2', choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large".split(","),help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.")
|
||||
parser.add_argument('--model_cache_dir', type=str, default=None, help="Overriding the default model cache dir where models downloaded from the hub are saved")
|
||||
parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.")
|
||||
parser.add_argument('--lan', '--language', type=str, default='auto', help="Source language code, e.g. en,de,cs, or 'auto' for language detection.")
|
||||
parser.add_argument('--min-chunk-size', type=float, default=1.0, help='Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time. Applicable both to Whisper and seamless.')
|
||||
parser.add_argument('--model', type=str, default='large-v2', choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large".split(","),help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir. Not applicable to seamless.")
|
||||
parser.add_argument('--model_cache_dir', type=str, default=None, help="Overriding the default model cache dir where models downloaded from the hub are saved. Not applicable to seamless.")
|
||||
parser.add_argument('--model_dir', type=str, default=None, help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter. Not applicable to seamless.")
|
||||
parser.add_argument('--lan', '--language', type=str, default='en', help="Language code for transcription, e.g. en,de,cs. Seamless backend has its own 3-letter language codes, e.g. eng, deu, ces.")
|
||||
parser.add_argument('--task', type=str, default='transcribe', choices=["transcribe","translate"],help="Transcribe or translate.")
|
||||
parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped", "openai-api"],help='Load only this backend for Whisper processing.')
|
||||
parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters.')
|
||||
parser.add_argument('--buffer_trimming', type=str, default="segment", choices=["sentence", "segment"],help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.')
|
||||
parser.add_argument('--buffer_trimming_sec', type=float, default=15, help='Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.')
|
||||
parser.add_argument("-l", "--log-level", dest="log_level", choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'], help="Set the log level", default='DEBUG')
|
||||
|
||||
def asr_factory(args, logfile=sys.stderr):
|
||||
"""
|
||||
Creates and configures an ASR and ASR Online instance based on the specified backend and arguments.
|
||||
"""
|
||||
backend = args.backend
|
||||
if backend == "openai-api":
|
||||
logger.debug("Using OpenAI API.")
|
||||
asr = OpenaiApiASR(lan=args.lan)
|
||||
else:
|
||||
if backend == "faster-whisper":
|
||||
asr_cls = FasterWhisperASR
|
||||
else:
|
||||
asr_cls = WhisperTimestampedASR
|
||||
|
||||
# Only for FasterWhisperASR and WhisperTimestampedASR
|
||||
size = args.model
|
||||
t = time.time()
|
||||
logger.info(f"Loading Whisper {size} model for {args.lan}...")
|
||||
asr = asr_cls(modelsize=size, lan=args.lan, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
|
||||
e = time.time()
|
||||
logger.info(f"done. It took {round(e-t,2)} seconds.")
|
||||
|
||||
# Apply common configurations
|
||||
if getattr(args, 'vad', False): # Checks if VAD argument is present and True
|
||||
logger.info("Setting VAD filter")
|
||||
asr.use_vad()
|
||||
|
||||
language = args.lan
|
||||
if args.task == "translate":
|
||||
asr.set_translate_task()
|
||||
tgt_language = "en" # Whisper translates into English
|
||||
else:
|
||||
tgt_language = language # Whisper transcribes in this language
|
||||
|
||||
# Create the tokenizer
|
||||
if args.buffer_trimming == "sentence":
|
||||
tokenizer = create_tokenizer(tgt_language)
|
||||
else:
|
||||
tokenizer = None
|
||||
|
||||
# Create the OnlineASRProcessor
|
||||
online = OnlineASRProcessor(asr,tokenizer,logfile=logfile,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
|
||||
|
||||
return asr, online
|
||||
|
||||
def set_logging(args,logger,other="_server"):
|
||||
logging.basicConfig(#format='%(name)s
|
||||
format='%(levelname)s\t%(message)s')
|
||||
logger.setLevel(args.log_level)
|
||||
logging.getLogger("whisper_online"+other).setLevel(args.log_level)
|
||||
# logging.getLogger("whisper_online_server").setLevel(args.log_level)
|
||||
|
||||
parser.add_argument('--backend', type=str, default="faster-whisper", choices=["faster-whisper", "whisper_timestamped", "seamless"],help='Load only this backend for Whisper processing, or Seamless Streaming.')
|
||||
parser.add_argument('--vad', action="store_true", default=False, help='Use VAD = voice activity detection, with the default parameters. Not applicable to seamless.')
|
||||
parser.add_argument('--buffer_trimming', type=str, default="segment", choices=["sentence", "segment"],help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option. Not applicable to seamless.')
|
||||
parser.add_argument('--buffer_trimming_sec', type=float, default=15, help='Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered. Not applicable to seamless.')
|
||||
|
||||
## main:
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
@@ -630,29 +488,61 @@ if __name__ == "__main__":
|
||||
logfile = sys.stderr
|
||||
|
||||
if args.offline and args.comp_unaware:
|
||||
logger.error("No or one option from --offline and --comp_unaware are available, not both. Exiting.")
|
||||
print("No or one option from --offline and --comp_unaware are available, not both. Exiting.",file=logfile)
|
||||
sys.exit(1)
|
||||
|
||||
# if args.log_level:
|
||||
# logging.basicConfig(format='whisper-%(levelname)s:%(name)s: %(message)s',
|
||||
# level=getattr(logging, args.log_level))
|
||||
|
||||
set_logging(args,logger)
|
||||
|
||||
audio_path = args.audio_path
|
||||
|
||||
SAMPLING_RATE = 16000
|
||||
duration = len(load_audio(audio_path))/SAMPLING_RATE
|
||||
logger.info("Audio duration is: %2.2f seconds" % duration)
|
||||
print("Audio duration is: %2.2f seconds" % duration, file=logfile)
|
||||
|
||||
size = args.model
|
||||
language = args.lan
|
||||
|
||||
asr, online = asr_factory(args, logfile=logfile)
|
||||
min_chunk = args.min_chunk_size
|
||||
|
||||
if args.backend != "seamless":
|
||||
# loading Whisper model
|
||||
t = time.time()
|
||||
print(f"Loading Whisper {size} model for {language}...",file=logfile,end=" ",flush=True)
|
||||
|
||||
# load the audio into the LRU cache before we start the timer
|
||||
a = load_audio_chunk(audio_path,0,1)
|
||||
if args.backend == "faster-whisper":
|
||||
asr_cls = FasterWhisperASR
|
||||
elif args.backend == "whisper_timestamped":
|
||||
asr_cls = WhisperTimestampedASR
|
||||
|
||||
# warm up the ASR because the very first transcribe takes much more time than the other
|
||||
asr.transcribe(a)
|
||||
asr = asr_cls(modelsize=size, lan=language, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
|
||||
|
||||
e = time.time()
|
||||
print(f"done. It took {round(e-t,2)} seconds.",file=logfile)
|
||||
|
||||
if args.vad:
|
||||
print("setting VAD filter",file=logfile)
|
||||
asr.use_vad()
|
||||
if args.task == "translate":
|
||||
asr.set_translate_task()
|
||||
tgt_language = "en" # Whisper translates into English
|
||||
else:
|
||||
tgt_language = language # Whisper transcribes in this language
|
||||
|
||||
if args.buffer_trimming == "sentence":
|
||||
tokenizer = create_tokenizer(tgt_language)
|
||||
else:
|
||||
tokenizer = None
|
||||
|
||||
online = OnlineASRProcessor(asr,tokenizer,logfile=logfile,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
|
||||
# load the audio into the LRU cache before we start the timer
|
||||
a = load_audio_chunk(audio_path,0,1)
|
||||
|
||||
# warm up the ASR, because the very first transcribe takes much more time than the other
|
||||
asr.transcribe(a)
|
||||
|
||||
else:
|
||||
print(f"Loading Seamless Streaming backend model",file=logfile,flush=True)
|
||||
|
||||
from seamless_integration import SeamlessProcessor
|
||||
online = SeamlessProcessor(language, args.task, logfile=logfile)
|
||||
|
||||
beg = args.start_at
|
||||
start = time.time()-beg
|
||||
@@ -670,16 +560,16 @@ if __name__ == "__main__":
|
||||
print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),file=logfile,flush=True)
|
||||
print("%1.4f %1.0f %1.0f %s" % (now*1000, o[0]*1000,o[1]*1000,o[2]),flush=True)
|
||||
else:
|
||||
# No text, so no output
|
||||
pass
|
||||
print(o,file=logfile,flush=True)
|
||||
|
||||
if args.offline: ## offline mode processing (for testing/debugging)
|
||||
a = load_audio(audio_path)
|
||||
online.insert_audio_chunk(a)
|
||||
try:
|
||||
o = online.process_iter()
|
||||
except AssertionError as e:
|
||||
log.error(f"assertion error: {repr(e)}")
|
||||
except AssertionError:
|
||||
print("assertion error",file=logfile)
|
||||
pass
|
||||
else:
|
||||
output_transcript(o)
|
||||
now = None
|
||||
@@ -690,13 +580,13 @@ if __name__ == "__main__":
|
||||
online.insert_audio_chunk(a)
|
||||
try:
|
||||
o = online.process_iter()
|
||||
except AssertionError as e:
|
||||
logger.error(f"assertion error: {repr(e)}")
|
||||
except AssertionError:
|
||||
print("assertion error",file=logfile)
|
||||
pass
|
||||
else:
|
||||
output_transcript(o, now=end)
|
||||
|
||||
logger.debug(f"## last processed {end:.2f}s")
|
||||
print(f"## last processed {end:.2f}s",file=logfile,flush=True)
|
||||
|
||||
if end >= duration:
|
||||
break
|
||||
@@ -722,13 +612,13 @@ if __name__ == "__main__":
|
||||
|
||||
try:
|
||||
o = online.process_iter()
|
||||
except AssertionError as e:
|
||||
logger.error(f"assertion error: {e}")
|
||||
except AssertionError:
|
||||
print("assertion error",file=logfile)
|
||||
pass
|
||||
else:
|
||||
output_transcript(o)
|
||||
now = time.time() - start
|
||||
logger.debug(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}")
|
||||
print(f"## last processed {end:.2f} s, now is {now:.2f}, the latency is {now-end:.2f}",file=logfile,flush=True)
|
||||
|
||||
if end >= duration:
|
||||
break
|
||||
|
||||
@@ -4,54 +4,85 @@ from whisper_online import *
|
||||
import sys
|
||||
import argparse
|
||||
import os
|
||||
import logging
|
||||
import numpy as np
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
# server options
|
||||
parser.add_argument("--host", type=str, default='localhost')
|
||||
parser.add_argument("--port", type=int, default=43007)
|
||||
parser.add_argument("--warmup-file", type=str, dest="warmup_file",
|
||||
help="The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast. It can be e.g. https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav .")
|
||||
|
||||
|
||||
# options from whisper_online
|
||||
add_shared_args(parser)
|
||||
args = parser.parse_args()
|
||||
|
||||
set_logging(args,logger,other="")
|
||||
|
||||
# setting whisper object by args
|
||||
|
||||
SAMPLING_RATE = 16000
|
||||
|
||||
size = args.model
|
||||
language = args.lan
|
||||
asr, online = asr_factory(args)
|
||||
min_chunk = args.min_chunk_size
|
||||
|
||||
# warm up the ASR because the very first transcribe takes more time than the others.
|
||||
# Test results in https://github.com/ufal/whisper_streaming/pull/81
|
||||
msg = "Whisper is not warmed up. The first chunk processing may take longer."
|
||||
if args.warmup_file:
|
||||
if os.path.isfile(args.warmup_file):
|
||||
a = load_audio_chunk(args.warmup_file,0,1)
|
||||
asr.transcribe(a)
|
||||
logger.info("Whisper is warmed up.")
|
||||
else:
|
||||
logger.critical("The warm up file is not available. "+msg)
|
||||
sys.exit(1)
|
||||
else:
|
||||
logger.warning(msg)
|
||||
if args.backend != "seamless": # loading Whisper backend
|
||||
size = args.model
|
||||
|
||||
t = time.time()
|
||||
print(f"Loading Whisper {size} model for {language}...",file=sys.stderr,end=" ",flush=True)
|
||||
|
||||
if args.backend == "faster-whisper":
|
||||
from faster_whisper import WhisperModel
|
||||
asr_cls = FasterWhisperASR
|
||||
else:
|
||||
import whisper
|
||||
import whisper_timestamped
|
||||
# from whisper_timestamped_model import WhisperTimestampedASR
|
||||
asr_cls = WhisperTimestampedASR
|
||||
|
||||
asr = asr_cls(modelsize=size, lan=language, cache_dir=args.model_cache_dir, model_dir=args.model_dir)
|
||||
|
||||
if args.task == "translate":
|
||||
asr.set_translate_task()
|
||||
tgt_language = "en"
|
||||
else:
|
||||
tgt_language = language
|
||||
|
||||
e = time.time()
|
||||
print(f"done. It took {round(e-t,2)} seconds.",file=sys.stderr)
|
||||
|
||||
if args.vad:
|
||||
print("setting VAD filter",file=sys.stderr)
|
||||
asr.use_vad()
|
||||
|
||||
demo_audio_path = "cs-maji-2.16k.wav"
|
||||
if os.path.exists(demo_audio_path):
|
||||
# load the audio into the LRU cache before we start the timer
|
||||
a = load_audio_chunk(demo_audio_path,0,1)
|
||||
|
||||
# TODO: it should be tested whether it's meaningful
|
||||
# warm up the ASR, because the very first transcribe takes much more time than the other
|
||||
asr.transcribe(a)
|
||||
else:
|
||||
print("Whisper is not warmed up",file=sys.stderr)
|
||||
|
||||
if args.buffer_trimming == "sentence":
|
||||
tokenizer = create_tokenizer(tgt_language)
|
||||
else:
|
||||
tokenizer = None
|
||||
online = OnlineASRProcessor(asr,tokenizer,buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec))
|
||||
else: # seamless backend:
|
||||
print(f"Loading Seamless Streaming backend model",file=sys.stderr,flush=True)
|
||||
|
||||
from seamless_integration import SeamlessProcessor
|
||||
online = SeamlessProcessor(language, args.task, logfile=sys.stderr)
|
||||
|
||||
######### Server objects
|
||||
|
||||
import line_packet
|
||||
import socket
|
||||
|
||||
import logging
|
||||
|
||||
|
||||
class Connection:
|
||||
'''it wraps conn object'''
|
||||
PACKET_SIZE = 65536
|
||||
@@ -99,10 +130,12 @@ class ServerProcessor:
|
||||
out = []
|
||||
while sum(len(x) for x in out) < self.min_chunk*SAMPLING_RATE:
|
||||
raw_bytes = self.connection.non_blocking_receive_audio()
|
||||
print(raw_bytes[:10])
|
||||
print(len(raw_bytes))
|
||||
if not raw_bytes:
|
||||
break
|
||||
sf = soundfile.SoundFile(io.BytesIO(raw_bytes), channels=1,endian="LITTLE",samplerate=SAMPLING_RATE, subtype="PCM_16",format="RAW")
|
||||
audio, _ = librosa.load(sf,sr=SAMPLING_RATE,dtype=np.float32)
|
||||
audio, _ = librosa.load(sf,sr=SAMPLING_RATE)
|
||||
out.append(audio)
|
||||
if not out:
|
||||
return None
|
||||
@@ -129,7 +162,7 @@ class ServerProcessor:
|
||||
print("%1.0f %1.0f %s" % (beg,end,o[2]),flush=True,file=sys.stderr)
|
||||
return "%1.0f %1.0f %s" % (beg,end,o[2])
|
||||
else:
|
||||
logger.debug("No text in this segment")
|
||||
print(o,file=sys.stderr,flush=True)
|
||||
return None
|
||||
|
||||
def send_result(self, o):
|
||||
@@ -143,13 +176,14 @@ class ServerProcessor:
|
||||
while True:
|
||||
a = self.receive_audio_chunk()
|
||||
if a is None:
|
||||
print("break here",file=sys.stderr)
|
||||
break
|
||||
self.online_asr_proc.insert_audio_chunk(a)
|
||||
o = online.process_iter()
|
||||
try:
|
||||
self.send_result(o)
|
||||
except BrokenPipeError:
|
||||
logger.info("broken pipe -- connection closed?")
|
||||
print("broken pipe -- connection closed?",file=sys.stderr)
|
||||
break
|
||||
|
||||
# o = online.finish() # this should be working
|
||||
@@ -157,19 +191,24 @@ class ServerProcessor:
|
||||
|
||||
|
||||
|
||||
|
||||
# Start logging.
|
||||
level = logging.INFO
|
||||
logging.basicConfig(level=level, format='whisper-server-%(levelname)s: %(message)s')
|
||||
|
||||
# server loop
|
||||
|
||||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
|
||||
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
|
||||
s.bind((args.host, args.port))
|
||||
s.listen(1)
|
||||
logger.info('Listening on'+str((args.host, args.port)))
|
||||
logging.info('INFO: Listening on'+str((args.host, args.port)))
|
||||
while True:
|
||||
conn, addr = s.accept()
|
||||
logger.info('Connected to client on {}'.format(addr))
|
||||
logging.info('INFO: Connected to client on {}'.format(addr))
|
||||
connection = Connection(conn)
|
||||
proc = ServerProcessor(connection, online, min_chunk)
|
||||
proc.process()
|
||||
conn.close()
|
||||
logger.info('Connection to client closed')
|
||||
logger.info('Connection closed, terminating.')
|
||||
logging.info('INFO: Connection to client closed')
|
||||
logging.info('INFO: Connection closed, terminating.')
|
||||
|
||||
Reference in New Issue
Block a user