68 Commits
0.0.1 ... 0.1.9

Author SHA1 Message Date
Quentin Fuxa
e165916952 add diarization model list url 2025-06-19 16:43:23 +02:00
Quentin Fuxa
8532a91c7a add segmentation and embedding model options to configuration 2025-06-19 16:29:25 +02:00
Quentin Fuxa
b01b81bad0 improve diarization with lag diarization substraction 2025-06-19 16:18:49 +02:00
Quentin Fuxa
0f79d442ee improve diarization speed + Use punctuation to better align speakers and diarization 2025-06-19 13:03:29 +02:00
Quentin Fuxa
c9f60504e3 update with up to date example 2025-06-16 16:57:47 +02:00
Quentin Fuxa
993a83546a core refactoring 2025-06-16 16:13:57 +02:00
Quentin Fuxa
eabd1b199a to 0.1.7 2025-05-28 13:29:45 +02:00
Quentin Fuxa
f7644268c1 Message when launching transcription and no audio is detected 2025-05-28 13:25:49 +02:00
Quentin Fuxa
34e8fe260e lag information in real time even when no audio is detected 2025-05-28 12:25:47 +02:00
Quentin Fuxa
debfefaf3e Merge pull request #128 from QuentinFuxa/vac-update
Vac update
2025-05-28 11:51:37 +02:00
Quentin Fuxa
101ca9ef90 Update README.md 2025-05-28 11:50:44 +02:00
Quentin Fuxa
94bb05d53e Update README.md 2025-05-28 11:48:46 +02:00
Quentin Fuxa
6797b88176 Error handling for missing FFmpeg in start_ffmpeg_decoder 2025-05-28 11:43:30 +02:00
Quentin Fuxa
46770efd6c correct error when using VAC 2025-05-28 11:43:18 +02:00
Quentin Fuxa
b23ef3ec3e refactor license for correct shields.io detection 2025-05-28 11:42:26 +02:00
Quentin Fuxa
fa29a24abe Bump version to 0.1.6 2025-05-07 11:45:33 +02:00
Quentin Fuxa
fea3c3553c logging in ASR proc. includes internal buffer duration and transcription lag 2025-05-07 11:45:00 +02:00
Quentin Fuxa
d6d65a663b errors handling when end of transcription 2025-05-07 10:56:04 +02:00
Quentin Fuxa
083d5b2f44 uses sentinel object when end of transcription, to properly terminate tasks 2025-05-07 10:55:44 +02:00
Quentin Fuxa
8e4674b093 End of transcription : Properly sends signal back to the endpoint 2025-05-07 10:55:12 +02:00
Quentin Fuxa
bc7c32100f Mention third-party components 2025-04-14 00:21:43 +02:00
Quentin Fuxa
c4150894af Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web 2025-04-13 12:11:01 +02:00
Quentin Fuxa
25bf242ce1 bump version to 0.1.5 2025-04-13 12:10:53 +02:00
Quentin Fuxa
14cc601a5c Update README.md 2025-04-13 11:07:53 +02:00
Quentin Fuxa
34d5d513fa fix typo 2025-04-12 18:22:14 +02:00
Quentin Fuxa
2ab3dac948 remove whisper_fastapi_online_server.py 2025-04-12 18:21:04 +02:00
Quentin Fuxa
b56fcffde1 Solves stdin flushes blocking IOhttps://github.com/QuentinFuxa/WhisperLiveKit/issues/110
https://github.com/QuentinFuxa/WhisperLiveKit/issues/106
https://github.com/QuentinFuxa/WhisperLiveKit/issues/90
https://github.com/QuentinFuxa/WhisperLiveKit/issues/87
https://github.com/QuentinFuxa/WhisperLiveKit/issues/81
https://github.com/QuentinFuxa/WhisperLiveKit/issues/2
2025-04-12 15:25:46 +02:00
Quentin Fuxa
2def194893 add ssl certificate and key file arguments to parser 2025-04-11 12:20:22 +02:00
Quentin Fuxa
29978da301 adds ssl possibility in basic server 2025-04-11 12:20:08 +02:00
Quentin Fuxa
b708890788 protocol default to ws 2025-04-11 12:14:14 +02:00
Quentin Fuxa
3ac4c514cf remove temp_kit method to get args. uvicorn reload to False for better perfs 2025-04-11 12:02:52 +02:00
Chris Margach
3c58bfcfa2 update readme for package launch with SSL 2025-04-10 13:47:09 +09:00
Chris Margach
d53b7a323a update sample html to use wss in case of https 2025-04-10 13:46:52 +09:00
Chris Margach
02de5993e6 allow passing of cert and key locations to uvicorn via package 2025-04-10 13:42:30 +09:00
Quentin Fuxa
d94560ef37 Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web 2025-04-09 11:38:57 +02:00
Quentin Fuxa
f62baa80b7 to 0.1.4 2025-04-09 11:35:22 +02:00
Quentin Fuxa
0b43035701 enhance chunking to handle audio buffer time limits 2025-04-09 11:34:59 +02:00
Quentin Fuxa
704170ccf3 Logs for https://github.com/QuentinFuxa/WhisperLiveKit/issues/110 https://github.com/QuentinFuxa/WhisperLiveKit/issues/106
https://github.com/QuentinFuxa/WhisperLiveKit/issues/90
https://github.com/QuentinFuxa/WhisperLiveKit/issues/87
https://github.com/QuentinFuxa/WhisperLiveKit/issues/81
https://github.com/QuentinFuxa/WhisperLiveKit/issues/2
2025-04-09 11:34:27 +02:00
Quentin Fuxa
09279c572a Merge pull request #116 from needabetterusername/solve-115
Solve #115: VAC Broken import
2025-04-09 10:16:49 +02:00
Quentin Fuxa
23e41f993f typos 2025-04-09 10:14:23 +02:00
Quentin Fuxa
c791b1e125 Merge pull request #114 from needabetterusername/implement-69-clean
(Re-) Implement #69 (Dockerfile)
2025-04-09 10:10:08 +02:00
Quentin Fuxa
3de2990ec4 Update README to clarify Docker usage for non-GPU systems 2025-04-09 10:08:48 +02:00
Chris Margach
51e6a6f6f9 update import after moving target file 2025-04-09 13:33:39 +09:00
Chris Margach
f6e53b2fab return silero_vad_iterator.py to whisper_streaming(_custom) package. 2025-04-09 11:59:21 +09:00
Chris Margach
5d6f08ff7a Update readme for Dockerfile 2025-04-08 19:06:42 +09:00
Chris Margach
583a26da88 Add Dockerfile w/ GPU support. 2025-04-08 19:06:11 +09:00
Chris Margach
5b3d8969e8 Merge branch 'main' of https://github.com/QuentinFuxa/WhisperLiveKit 2025-04-08 09:44:20 +09:00
Quentin Fuxa
40cca184c1 Merge pull request #113 from needabetterusername/implement-107
Allow CTranslate2 backend to choose device and compute types.
2025-04-07 14:42:57 +02:00
Chris Margach
47ed345f9e Merge branch 'implement-107' 2025-04-07 17:40:08 +09:00
Chris Margach
9c9c179684 Allow CTranslate2 backend to choose device and compute types. 2025-04-07 14:47:29 +09:00
Quentin Fuxa
b870c12f62 Merge pull request #109 from QuentinFuxa/needabetterusername/implement-69
Needabetterusername/implement 69
2025-04-04 11:10:08 +02:00
Quentin Fuxa
cfd5905fd4 Improve WebSocket fallback logic
Use window.location.hostname and port if available,
otherwise fallback to localhost:8000.

Co-authored-by: Chris Margach <hcagramc@gmail.com>
2025-04-04 11:08:05 +02:00
Chris Margach
2399487e45 Implement #107 2025-04-04 10:54:15 +09:00
Quentin Fuxa
afd88310fd Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web 2025-04-02 11:56:25 +02:00
Quentin Fuxa
080f446b0d start implementing frontend part of https://github.com/QuentinFuxa/WhisperLiveKit/pull/80 2025-04-02 11:56:02 +02:00
Quentin Fuxa
8bd2b36488 Add files via upload 2025-04-01 11:03:22 +02:00
Quentin Fuxa
25fd924bf9 Merge pull request #103 from QuentinFuxa/readme
Update README.md
2025-03-28 14:30:35 +01:00
Quentin Fuxa
ff8fd0ec72 Update README.md 2025-03-28 14:30:14 +01:00
Quentin Fuxa
e99f53e649 Corrects 'TranscriptionSegment' object is not subscriptable 2025-03-24 21:16:08 +01:00
Quentin Fuxa
e9022894b2 solve #100 2025-03-24 20:38:47 +01:00
Quentin Fuxa
ccf99cecdf Solve #95 and #96 2025-03-24 17:55:52 +01:00
Quentin Fuxa
40e2814cd7 0.1.2 2025-03-20 11:08:40 +01:00
Quentin Fuxa
cd29eace3d Update README.md 2025-03-20 10:23:14 +01:00
Quentin Fuxa
38cb54640f Update README.md 2025-03-19 15:49:39 +01:00
Quentin Fuxa
81268a7ca3 update CLI launch 2025-03-19 15:40:54 +01:00
Quentin Fuxa
33cbd24964 Update README.md 2025-03-19 15:14:38 +01:00
Quentin Fuxa
e966e78584 Merge pull request #92 from QuentinFuxa/refacto_lib
script to lib
2025-03-19 15:13:42 +01:00
Quentin Fuxa
c13d36b5e7 Merge pull request #91 from QuentinFuxa/refacto_lib
move all audio processing out of /asr endpoint
2025-03-19 11:20:28 +01:00
22 changed files with 1547 additions and 534 deletions

82
Dockerfile Normal file
View File

@@ -0,0 +1,82 @@
FROM nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
WORKDIR /app
ARG EXTRAS
ARG HF_PRECACHE_DIR
ARG HF_TKN_FILE
# Install system dependencies
#RUN apt-get update && \
# apt-get install -y ffmpeg git && \
# apt-get clean && \
# rm -rf /var/lib/apt/lists/*
# 2) Install system dependencies + Python + pip
RUN apt-get update && \
apt-get install -y --no-install-recommends \
python3 \
python3-pip \
ffmpeg \
git && \
rm -rf /var/lib/apt/lists/*
RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
COPY . .
# Install WhisperLiveKit directly, allowing for optional dependencies
# Note: For gates modedls, need to add your HF toke. See README.md
# for more details.
RUN if [ -n "$EXTRAS" ]; then \
echo "Installing with extras: [$EXTRAS]"; \
pip install --no-cache-dir .[$EXTRAS]; \
else \
echo "Installing base package only"; \
pip install --no-cache-dir .; \
fi
# Enable in-container caching for Hugging Face models by:
# Note: If running multiple containers, better to map a shared
# bucket.
#
# A) Make the cache directory persistent via an anonymous volume.
# Note: This only persists for a single, named container. This is
# only for convenience at de/test stage.
# For prod, it is better to use a named volume via host mount/k8s.
VOLUME ["/root/.cache/huggingface/hub"]
# or
# B) Conditionally copy a local pre-cache from the build context to the
# container's cache via the HF_PRECACHE_DIR build-arg.
# WARNING: This will copy ALL files in the pre-cache location.
# Conditionally copy a cache directory if provided
RUN if [ -n "$HF_PRECACHE_DIR" ]; then \
echo "Copying Hugging Face cache from $HF_PRECACHE_DIR"; \
mkdir -p /root/.cache/huggingface/hub && \
cp -r $HF_PRECACHE_DIR/* /root/.cache/huggingface/hub; \
else \
echo "No local Hugging Face cache specified, skipping copy"; \
fi
# Conditionally copy a Hugging Face token if provided
RUN if [ -n "$HF_TKN_FILE" ]; then \
echo "Copying Hugging Face token from $HF_TKN_FILE"; \
mkdir -p /root/.cache/huggingface && \
cp $HF_TKN_FILE /root/.cache/huggingface/token; \
else \
echo "No Hugging Face token file specified, skipping token setup"; \
fi
# Expose port for the transcription server
EXPOSE 8000
ENTRYPOINT ["whisperlivekit-server", "--host", "0.0.0.0"]
# Default args
CMD ["--model", "tiny.en"]

33
LICENSE
View File

@@ -1,21 +1,28 @@
MIT License MIT License
Copyright (c) 2023 ÚFAL Copyright (c) 2025 Quentin Fuxa.
Permission is hereby granted, free of charge, to any person obtaining a copy Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions: furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software. copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE. SOFTWARE.
---
Based on:
- **whisper_streaming** by ÚFAL MIT License https://github.com/ufal/whisper_streaming. The original work by ÚFAL. License: https://github.com/ufal/whisper_streaming/blob/main/LICENSE
- **silero-vad** by Snakers4 MIT License https://github.com/snakers4/silero-vad. The work by Snakers4 (silero-vad). License: https://github.com/snakers4/silero-vad/blob/f6b1294cb27590fb2452899df98fb234dfef1134/LICENSE
- **Diart** by juanmc2005 MIT License https://github.com/juanmc2005/diart. The work in Diart by juanmc2005. License: https://github.com/juanmc2005/diart/blob/main/LICENSE

427
README.md
View File

@@ -1,158 +1,357 @@
<h1 align="center">WhisperLiveKit</h1> <h1 align="center">WhisperLiveKit</h1>
<p align="center"><b>Real-time, Fully Local Whisper's Speech-to-Text and Speaker Diarization</b></p>
This project is based on [Whisper Streaming](https://github.com/ufal/whisper_streaming) and lets you transcribe audio directly from your browser. Simply launch the local server and grant microphone access. Everything runs locally on your machine ✨
<p align="center"> <p align="center">
<img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/demo.png" alt="Demo Screenshot" width="730"> <img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit Demo" width="730">
</p> </p>
### Differences from [Whisper Streaming](https://github.com/ufal/whisper_streaming) <p align="center"><b>Real-time, Fully Local Speech-to-Text with Speaker Diarization</b></p>
#### ⚙️ **Core Improvements** <p align="center">
<a href="https://pypi.org/project/whisperlivekit/"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/whisperlivekit?color=g"></a>
<a href="https://pepy.tech/project/whisperlivekit"><img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/whisperlivekit?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads"></a>
<a href="https://pypi.org/project/whisperlivekit/"><img alt="Python Versions" src="https://img.shields.io/badge/python-3.9--3.13-dark_green"></a>
<a href="https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-MIT-dark_green"></a>
</p>
## 🚀 Overview
This project is based on [Whisper Streaming](https://github.com/ufal/whisper_streaming) and lets you transcribe audio directly from your browser. WhisperLiveKit provides a complete backend solution for real-time speech transcription with a functional and simple frontend that you can customize for your own needs. Everything runs locally on your machine ✨
### 🔄 Architecture
WhisperLiveKit consists of three main components:
- **Frontend**: A basic HTML & JavaScript interface that captures microphone audio and streams it to the backend via WebSockets. You can use and adapt the provided template at [whisperlivekit/web/live_transcription.html](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html) for your specific use case.
- **Backend (Web Server)**: A FastAPI-based WebSocket server that receives streamed audio data, processes it in real time, and returns transcriptions to the frontend. This is where the WebSocket logic and routing live.
- **Core Backend (Library Logic)**: A server-agnostic core that handles audio processing, ASR, and diarization. It exposes reusable components that take in audio bytes and return transcriptions. This makes it easy to plug into any WebSocket or audio stream pipeline.
### ✨ Key Features
- **🎙️ Real-time Transcription** - Convert speech to text instantly as you speak
- **👥 Speaker Diarization** - Identify different speakers in real-time using [Diart](https://github.com/juanmc2005/diart)
- **🔒 Fully Local** - All processing happens on your machine - no data sent to external servers
- **📱 Multi-User Support** - Handle multiple users simultaneously with a single backend/server
- **📝 Punctuation-Based Speaker Splitting [BETA] ** - Align speaker changes with natural sentence boundaries for more readable transcripts
### ⚙️ Core differences from [Whisper Streaming](https://github.com/ufal/whisper_streaming)
- **Automatic Silence Chunking** Automatically chunks when no audio is detected to limit buffer size
- **Multi-User Support** Handles multiple users simultaneously by decoupling backend and online ASR
- **Confidence Validation** Immediately validate high-confidence tokens for faster inference
- **MLX Whisper Backend** Optimized for Apple Silicon for faster local processing
- **Buffering Preview** Displays unvalidated transcription segments - **Buffering Preview** Displays unvalidated transcription segments
- **Multi-User Support** Handles multiple users simultaneously by decoupling backend and online asr
- **MLX Whisper Backend** Optimized for Apple Silicon for faster local processing.
- **Confidence validation** Immediately validate high-confidence tokens for faster inference
#### 🎙️ **Speaker Identification** ## 📖 Quick Start
- **Real-Time Diarization** Identify different speakers in real time using [Diart](https://github.com/juanmc2005/diart)
#### 🌐 **Web & API** ```bash
- **Built-in Web UI** Simple raw html browser interface with no frontend setup required # Install the package
- **FastAPI WebSocket Server** Real-time speech-to-text processing with async FFmpeg streaming. pip install whisperlivekit
- **JavaScript Client** Ready-to-use MediaRecorder implementation for seamless client-side integration.
## Installation # Start the transcription server
whisperlivekit-server --model tiny.en
### Via pip # Open your browser at http://localhost:8000
```
### Quick Start with SSL
```bash
# You must provide a certificate and key
whisperlivekit-server -ssl-certfile public.crt --ssl-keyfile private.key
# Open your browser at https://localhost:8000
```
That's it! Start speaking and watch your words appear on screen.
## 🛠️ Installation Options
### Install from PyPI (Recommended)
```bash ```bash
pip install whisperlivekit pip install whisperlivekit
``` ```
### From source ### Install from Source
1. **Clone the Repository**: ```bash
git clone https://github.com/QuentinFuxa/WhisperLiveKit
```bash cd WhisperLiveKit
git clone https://github.com/QuentinFuxa/WhisperLiveKit pip install -e .
cd WhisperLiveKit ```
pip install -e .
```
### System Dependencies ### System Dependencies
You need to install FFmpeg on your system: FFmpeg is required:
- Install system dependencies: ```bash
```bash # Ubuntu/Debian
# Install FFmpeg on your system (required for audio processing) sudo apt install ffmpeg
# For Ubuntu/Debian:
sudo apt install ffmpeg
# For macOS:
brew install ffmpeg
# For Windows:
# Download from https://ffmpeg.org/download.html and add to PATH
```
- Install required Python dependencies: # macOS
brew install ffmpeg
```bash # Windows
# Whisper streaming required dependencies # Download from https://ffmpeg.org/download.html and add to PATH
pip install librosa soundfile ```
# Whisper streaming web required dependencies ### Optional Dependencies
pip install fastapi ffmpeg-python
```
- Install at least one whisper backend among:
``` ```bash
whisper # Voice Activity Controller (prevents hallucinations)
whisper-timestamped pip install torch
faster-whisper (faster backend on NVIDIA GPU)
mlx-whisper (faster backend on Apple Silicon) # Sentence-based buffer trimming
pip install mosestokenizer wtpsplit
pip install tokenize_uk # If you work with Ukrainian text
# Speaker diarization
pip install diart
# Alternative Whisper backends (default is faster-whisper)
pip install whisperlivekit[whisper] # Original Whisper
pip install whisperlivekit[whisper-timestamped] # Improved timestamps
pip install whisperlivekit[mlx-whisper] # Apple Silicon optimization
pip install whisperlivekit[openai] # OpenAI API
```
### 🎹 Pyannote Models Setup
For diarization, you need access to pyannote.audio models:
1. [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model
2. [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model
3. [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
4. Login with HuggingFace:
```bash
pip install huggingface_hub
huggingface-cli login
``` ```
- Optionnal dependencies
``` ## 💻 Usage Examples
# If you want to use VAC (Voice Activity Controller). Useful for preventing hallucinations
torch
# If you choose sentences as buffer trimming strategy
mosestokenizer
wtpsplit
tokenize_uk # If you work with Ukrainian text
# If you want to run the server using uvicorn (recommended) ### Command-line Interface
uvicorn
# If you want to use diarization Start the transcription server with various options:
diart
```
Diart uses by default [pyannote.audio](https://github.com/pyannote/pyannote-audio) models from the _huggingface hub_. To use them, please follow the steps described [here](https://github.com/juanmc2005/diart?tab=readme-ov-file#get-access-to--pyannote-models). ```bash
# Basic server with English model
whisperlivekit-server --model tiny.en
# Advanced configuration with diarization
whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language auto
```
3. **Run the FastAPI Server**: ### Python API Integration (Backend)
Check [basic_server.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) for a complete example.
```bash ```python
python whisper_fastapi_online_server.py --host 0.0.0.0 --port 8000 from whisperlivekit import TranscriptionEngine, AudioProcessor, get_web_interface_html, parse_args
``` from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse
from contextlib import asynccontextmanager
import asyncio
**Parameters** # Global variable for the transcription engine
transcription_engine = None
The following parameters are supported:
- `--host` and `--port` let you specify the server's IP/port.
- `-min-chunk-size` sets the minimum chunk size for audio processing. Make sure this value aligns with the chunk size selected in the frontend. If not aligned, the system will work but may unnecessarily over-process audio data.
- `--transcription`: Enable/disable transcription (default: True)
- `--diarization`: Enable/disable speaker diarization (default: False)
- `--confidence-validation`: Use confidence scores for faster validation. Transcription will be faster but punctuation might be less accurate (default: True)
- `--warmup-file`: The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast. :
- If not set, uses https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav.
- If False, no warmup is performed.
- `--min-chunk-size` Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.
- `--model` {_tiny.en, tiny, base.en, base, small.en, small, medium.en, medium, large-v1, large-v2, large-v3, large, large-v3-turbo_}
Name size of the Whisper model to use (default: tiny). The model is automatically downloaded from the model hub if not present in model cache dir.
- `--model_cache_dir` Overriding the default model cache dir where models downloaded from the hub are saved
- `--model_dir` Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.
- `--lan`, --language Source language code, e.g. en,de,cs, or 'auto' for language detection.
- `--task` {_transcribe, translate_} Transcribe or translate. If translate is set, we recommend avoiding the _large-v3-turbo_ backend, as it [performs significantly worse](https://github.com/QuentinFuxa/whisper_streaming_web/issues/40#issuecomment-2652816533) than other models for translation.
- `--backend` {_faster-whisper, whisper_timestamped, openai-api, mlx-whisper_} Load only this backend for Whisper processing.
- `--vac` Use VAC = voice activity controller. Requires torch.
- `--vac-chunk-size` VAC sample size in seconds.
- `--vad` Use VAD = voice activity detection, with the default parameters.
- `--buffer_trimming` {_sentence, segment_} Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.
- `--buffer_trimming_sec` Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.
5. **Open the Provided HTML**: @asynccontextmanager
async def lifespan(app: FastAPI):
global transcription_engine
# Example: Initialize with specific parameters directly
# You can also load from command-line arguments using parse_args()
# args = parse_args()
# transcription_engine = TranscriptionEngine(**vars(args))
transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
yield
- By default, the server root endpoint `/` serves a simple `live_transcription.html` page. app = FastAPI(lifespan=lifespan)
- Open your browser at `http://localhost:8000` (or replace `localhost` and `8000` with whatever you specified).
- The page uses vanilla JavaScript and the WebSocket API to capture your microphone and stream audio to the server in real time.
### How the Live Interface Works # Serve the web interface
@app.get("/")
async def get():
return HTMLResponse(get_web_interface_html())
- Once you **allow microphone access**, the page records small chunks of audio using the **MediaRecorder** API in **webm/opus** format. # Process WebSocket connections
- These chunks are sent over a **WebSocket** to the FastAPI endpoint at `/asr`. async def handle_websocket_results(websocket: WebSocket, results_generator):
- The Python server decodes `.webm` chunks on the fly using **FFmpeg** and streams them into the **whisper streaming** implementation for transcription. try:
- **Partial transcription** appears as soon as enough audio is processed. The “unvalidated” text is shown in **lighter or grey color** (i.e., an aperçu) to indicate its still buffered partial output. Once Whisper finalizes that segment, its displayed in normal text. async for response in results_generator:
- You can watch the transcription update in near real time, ideal for demos, prototyping, or quick debugging. await websocket.send_json(response)
await websocket.send_json({"type": "ready_to_stop"})
except WebSocketDisconnect:
print("WebSocket disconnected during results handling.")
### Deploying to a Remote Server @app.websocket("/asr")
async def websocket_endpoint(websocket: WebSocket):
global transcription_engine
If you want to **deploy** this setup: # Create a new AudioProcessor for each connection, passing the shared engine
audio_processor = AudioProcessor(transcription_engine=transcription_engine)
results_generator = await audio_processor.create_tasks()
send_results_to_client = handle_websocket_results(websocket, results_generator)
results_task = asyncio.create_task(send_results_to_client)
await websocket.accept()
try:
while True:
message = await websocket.receive_bytes()
await audio_processor.process_audio(message)
except WebSocketDisconnect:
print(f"Client disconnected: {websocket.client}")
except Exception as e:
await websocket.close(code=1011, reason=f"Server error: {e}")
finally:
results_task.cancel()
try:
await results_task
except asyncio.CancelledError:
logger.info("Results task successfully cancelled.")
```
1. **Host the FastAPI app** behind a production-grade HTTP(S) server (like **Uvicorn + Nginx** or Docker). If you use HTTPS, use "wss" instead of "ws" in WebSocket URL. ### Frontend Implementation
2. The **HTML/JS page** can be served by the same FastAPI app or a separate static host.
3. Users open the page in **Chrome/Firefox** (any modern browser that supports MediaRecorder + WebSocket).
No additional front-end libraries or frameworks are required. The WebSocket logic in `live_transcription.html` is minimal enough to adapt for your own custom UI or embed in other pages. The package includes a simple HTML/JavaScript implementation that you can adapt for your project. You can find it in `whisperlivekit/web/live_transcription.html`, or load its content using the `get_web_interface_html()` function from `whisperlivekit`:
## Acknowledgments ```python
from whisperlivekit import get_web_interface_html
This project builds upon the foundational work of the Whisper Streaming project. We extend our gratitude to the original authors for their contributions. # ... later in your code where you need the HTML string ...
html_content = get_web_interface_html()
```
## ⚙️ Configuration Reference
WhisperLiveKit offers extensive configuration options:
| Parameter | Description | Default |
|-----------|-------------|---------|
| `--host` | Server host address | `localhost` |
| `--port` | Server port | `8000` |
| `--model` | Whisper model size | `tiny` |
| `--language` | Source language code or `auto` | `en` |
| `--task` | `transcribe` or `translate` | `transcribe` |
| `--backend` | Processing backend | `faster-whisper` |
| `--diarization` | Enable speaker identification | `False` |
| `--punctuation-split` | Use punctuation to improve speaker boundaries | `True` |
| `--confidence-validation` | Use confidence scores for faster validation | `False` |
| `--min-chunk-size` | Minimum audio chunk size (seconds) | `1.0` |
| `--vac` | Use Voice Activity Controller | `False` |
| `--no-vad` | Disable Voice Activity Detection | `False` |
| `--buffer_trimming` | Buffer trimming strategy (`sentence` or `segment`) | `segment` |
| `--warmup-file` | Audio file path for model warmup | `jfk.wav` |
| `--ssl-certfile` | Path to the SSL certificate file (for HTTPS support) | `None` |
| `--ssl-keyfile` | Path to the SSL private key file (for HTTPS support) | `None` |
| `--segmentation-model` | Hugging Face model ID for pyannote.audio segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |
| `--embedding-model` | Hugging Face model ID for pyannote.audio embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |
## 🔧 How It Works
<p align="center">
<img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit in Action" width="500">
</p>
1. **Audio Capture**: Browser's MediaRecorder API captures audio in webm/opus format
2. **Streaming**: Audio chunks are sent to the server via WebSocket
3. **Processing**: Server decodes audio with FFmpeg and streams into Whisper for transcription
4. **Real-time Output**:
- Partial transcriptions appear immediately in light gray (the 'aperçu')
- Finalized text appears in normal color
- (When enabled) Different speakers are identified and highlighted
## 🚀 Deployment Guide
To deploy WhisperLiveKit in production:
1. **Server Setup** (Backend):
```bash
# Install production ASGI server
pip install uvicorn gunicorn
# Launch with multiple workers
gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
```
2. **Frontend Integration**:
- Host your customized version of the example HTML/JS in your web application
- Ensure WebSocket connection points to your server's address
3. **Nginx Configuration** (recommended for production):
```nginx
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://localhost:8000;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
}
```
4. **HTTPS Support**: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL
### 🐋 Docker
A basic Dockerfile is provided which allows re-use of Python package installation options. See below usage examples:
**NOTE:** For **larger** models, ensure that your **docker runtime** has enough **memory** available.
#### All defaults
- Create a reusable image with only the basics and then run as a named container:
```bash
docker build -t whisperlivekit-defaults .
docker create --gpus all --name whisperlivekit -p 8000:8000 whisperlivekit-defaults
docker start -i whisperlivekit
```
> **Note**: If you're running on a system without NVIDIA GPU support (such as Mac with Apple Silicon or any system without CUDA capabilities), you need to **remove the `--gpus all` flag** from the `docker create` command. Without GPU acceleration, transcription will use CPU only, which may be significantly slower. Consider using small models for better performance on CPU-only systems.
#### Customization
- Customize the container options:
```bash
docker build -t whisperlivekit-defaults .
docker create --gpus all --name whisperlivekit-base -p 8000:8000 whisperlivekit-defaults --model base
docker start -i whisperlivekit-base
```
- `--build-arg` Options:
- `EXTRAS="whisper-timestamped"` - Add extras to the image's installation (no spaces). Remember to set necessary container options!
- `HF_PRECACHE_DIR="./.cache/"` - Pre-load a model cache for faster first-time start
- `HF_TOKEN="./token"` - Add your Hugging Face Hub access token to download gated models
## 🔮 Use Cases
- **Meeting Transcription**: Capture discussions in real-time
- **Accessibility Tools**: Help hearing-impaired users follow conversations
- **Content Creation**: Transcribe podcasts or videos automatically
- **Customer Service**: Transcribe support calls with speaker identification
## 🤝 Contributing
Contributions are welcome! Here's how to get started:
1. Fork the repository
2. Create a feature branch: `git checkout -b feature/amazing-feature`
3. Commit your changes: `git commit -m 'Add amazing feature'`
4. Push to your branch: `git push origin feature/amazing-feature`
5. Open a Pull Request
## 🙏 Acknowledgments
This project builds upon the foundational work of:
- [Whisper Streaming](https://github.com/ufal/whisper_streaming)
- [Diart](https://github.com/juanmc2005/diart)
- [OpenAI Whisper](https://github.com/openai/whisper)
We extend our gratitude to the original authors for their contributions.
## 📄 License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## 🔗 Links
- [GitHub Repository](https://github.com/QuentinFuxa/WhisperLiveKit)
- [PyPI Package](https://pypi.org/project/whisperlivekit/)
- [Issue Tracker](https://github.com/QuentinFuxa/WhisperLiveKit/issues)

BIN
demo.png

Binary file not shown.

Before

Width:  |  Height:  |  Size: 469 KiB

After

Width:  |  Height:  |  Size: 438 KiB

View File

@@ -1,8 +1,7 @@
from setuptools import setup, find_packages from setuptools import setup, find_packages
setup( setup(
name="whisperlivekit", name="whisperlivekit",
version="0.1.0", version="0.1.9",
description="Real-time, Fully Local Whisper's Speech-to-Text and Speaker Diarization", description="Real-time, Fully Local Whisper's Speech-to-Text and Speaker Diarization",
long_description=open("README.md", "r", encoding="utf-8").read(), long_description=open("README.md", "r", encoding="utf-8").read(),
long_description_content_type="text/markdown", long_description_content_type="text/markdown",
@@ -22,13 +21,17 @@ setup(
"diarization": ["diart"], "diarization": ["diart"],
"vac": ["torch"], "vac": ["torch"],
"sentence": ["mosestokenizer", "wtpsplit"], "sentence": ["mosestokenizer", "wtpsplit"],
"whisper": ["whisper"],
"whisper-timestamped": ["whisper-timestamped"],
"mlx-whisper": ["mlx-whisper"],
"openai": ["openai"],
}, },
package_data={ package_data={
'whisperlivekit': ['web/*.html'], 'whisperlivekit': ['web/*.html'],
}, },
entry_points={ entry_points={
'console_scripts': [ 'console_scripts': [
'whisperlivekit-server=whisperlivekit.server:run_server', 'whisperlivekit-server=whisperlivekit.basic_server:main',
], ],
}, },
classifiers=[ classifiers=[

View File

@@ -1,82 +0,0 @@
from contextlib import asynccontextmanager
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse
from fastapi.middleware.cors import CORSMiddleware
from whisperlivekit import WhisperLiveKit
from whisperlivekit.audio_processor import AudioProcessor
import asyncio
import logging
import os
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logging.getLogger().setLevel(logging.WARNING)
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
kit = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global kit
kit = WhisperLiveKit()
yield
app = FastAPI(lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
@app.get("/")
async def get():
return HTMLResponse(kit.web_interface())
async def handle_websocket_results(websocket, results_generator):
"""Consumes results from the audio processor and sends them via WebSocket."""
try:
async for response in results_generator:
await websocket.send_json(response)
except Exception as e:
logger.warning(f"Error in WebSocket results handler: {e}")
@app.websocket("/asr")
async def websocket_endpoint(websocket: WebSocket):
audio_processor = AudioProcessor()
await websocket.accept()
logger.info("WebSocket connection opened.")
results_generator = await audio_processor.create_tasks()
websocket_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
try:
while True:
message = await websocket.receive_bytes()
await audio_processor.process_audio(message)
except WebSocketDisconnect:
logger.warning("WebSocket disconnected.")
finally:
websocket_task.cancel()
await audio_processor.cleanup()
logger.info("WebSocket endpoint cleaned up.")
if __name__ == "__main__":
import uvicorn
temp_kit = WhisperLiveKit(transcription=False, diarization=False)
uvicorn.run(
"whisper_fastapi_online_server:app",
host=temp_kit.args.host,
port=temp_kit.args.port,
reload=True,
log_level="info"
)

View File

@@ -1,4 +1,5 @@
from .core import WhisperLiveKit, parse_args from .core import TranscriptionEngine
from .audio_processor import AudioProcessor from .audio_processor import AudioProcessor
from .web.web_interface import get_web_interface_html
__all__ = ['WhisperLiveKit', 'AudioProcessor', 'parse_args'] from .parse_args import parse_args
__all__ = ['TranscriptionEngine', 'AudioProcessor', 'get_web_interface_html', 'parse_args']

View File

@@ -6,16 +6,17 @@ import math
import logging import logging
import traceback import traceback
from datetime import timedelta from datetime import timedelta
from typing import List, Dict, Any
from whisperlivekit.timed_objects import ASRToken from whisperlivekit.timed_objects import ASRToken
from whisperlivekit.whisper_streaming_custom.whisper_online import online_factory from whisperlivekit.whisper_streaming_custom.whisper_online import online_factory
from whisperlivekit.core import WhisperLiveKit from whisperlivekit.core import TranscriptionEngine
# Set up logging once # Set up logging once
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s") logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG) logger.setLevel(logging.DEBUG)
SENTINEL = object() # unique sentinel object for end of stream marker
def format_time(seconds: float) -> str: def format_time(seconds: float) -> str:
"""Format seconds as HH:MM:SS.""" """Format seconds as HH:MM:SS."""
return str(timedelta(seconds=int(seconds))) return str(timedelta(seconds=int(seconds)))
@@ -26,10 +27,13 @@ class AudioProcessor:
Handles audio processing, state management, and result formatting. Handles audio processing, state management, and result formatting.
""" """
def __init__(self): def __init__(self, **kwargs):
"""Initialize the audio processor with configuration, models, and state.""" """Initialize the audio processor with configuration, models, and state."""
models = WhisperLiveKit() if 'transcription_engine' in kwargs and isinstance(kwargs['transcription_engine'], TranscriptionEngine):
models = kwargs['transcription_engine']
else:
models = TranscriptionEngine(**kwargs)
# Audio processing settings # Audio processing settings
self.args = models.args self.args = models.args
@@ -39,8 +43,12 @@ class AudioProcessor:
self.bytes_per_sample = 2 self.bytes_per_sample = 2
self.bytes_per_sec = self.samples_per_sec * self.bytes_per_sample self.bytes_per_sec = self.samples_per_sec * self.bytes_per_sample
self.max_bytes_per_sec = 32000 * 5 # 5 seconds of audio at 32 kHz self.max_bytes_per_sec = 32000 * 5 # 5 seconds of audio at 32 kHz
self.last_ffmpeg_activity = time()
self.ffmpeg_health_check_interval = 5
self.ffmpeg_max_idle_time = 10
# State management # State management
self.is_stopping = False
self.tokens = [] self.tokens = []
self.buffer_transcription = "" self.buffer_transcription = ""
self.buffer_diarization = "" self.buffer_diarization = ""
@@ -60,6 +68,13 @@ class AudioProcessor:
self.transcription_queue = asyncio.Queue() if self.args.transcription else None self.transcription_queue = asyncio.Queue() if self.args.transcription else None
self.diarization_queue = asyncio.Queue() if self.args.diarization else None self.diarization_queue = asyncio.Queue() if self.args.diarization else None
self.pcm_buffer = bytearray() self.pcm_buffer = bytearray()
# Task references
self.transcription_task = None
self.diarization_task = None
self.ffmpeg_reader_task = None
self.watchdog_task = None
self.all_tasks_for_cleanup = []
# Initialize transcription engine if enabled # Initialize transcription engine if enabled
if self.args.transcription: if self.args.transcription:
@@ -71,21 +86,80 @@ class AudioProcessor:
def start_ffmpeg_decoder(self): def start_ffmpeg_decoder(self):
"""Start FFmpeg process for WebM to PCM conversion.""" """Start FFmpeg process for WebM to PCM conversion."""
return (ffmpeg.input("pipe:0", format="webm") try:
.output("pipe:1", format="s16le", acodec="pcm_s16le", return (ffmpeg.input("pipe:0", format="webm")
ac=self.channels, ar=str(self.sample_rate)) .output("pipe:1", format="s16le", acodec="pcm_s16le",
.run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True)) ac=self.channels, ar=str(self.sample_rate))
.run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True))
except FileNotFoundError:
error = """
FFmpeg is not installed or not found in your system's PATH.
Please install FFmpeg to enable audio processing.
Installation instructions:
# Ubuntu/Debian:
sudo apt update && sudo apt install ffmpeg
# macOS (using Homebrew):
brew install ffmpeg
# Windows:
# 1. Download the latest static build from https://ffmpeg.org/download.html
# 2. Extract the archive (e.g., to C:\\FFmpeg).
# 3. Add the 'bin' directory (e.g., C:\\FFmpeg\\bin) to your system's PATH environment variable.
After installation, please restart the application.
"""
logger.error(error)
raise FileNotFoundError(error)
async def restart_ffmpeg(self): async def restart_ffmpeg(self):
"""Restart the FFmpeg process after failure.""" """Restart the FFmpeg process after failure."""
logger.warning("Restarting FFmpeg process...")
if self.ffmpeg_process: if self.ffmpeg_process:
try: try:
self.ffmpeg_process.kill() # we check if process is still running
await asyncio.get_event_loop().run_in_executor(None, self.ffmpeg_process.wait) if self.ffmpeg_process.poll() is None:
logger.info("Terminating existing FFmpeg process")
self.ffmpeg_process.stdin.close()
self.ffmpeg_process.terminate()
# wait for termination with timeout
try:
await asyncio.wait_for(
asyncio.get_event_loop().run_in_executor(None, self.ffmpeg_process.wait),
timeout=5.0
)
except asyncio.TimeoutError:
logger.warning("FFmpeg process did not terminate, killing forcefully")
self.ffmpeg_process.kill()
await asyncio.get_event_loop().run_in_executor(None, self.ffmpeg_process.wait)
except Exception as e: except Exception as e:
logger.warning(f"Error killing FFmpeg process: {e}") logger.error(f"Error during FFmpeg process termination: {e}")
logger.error(traceback.format_exc())
# we start new process
try:
logger.info("Starting new FFmpeg process")
self.ffmpeg_process = self.start_ffmpeg_decoder() self.ffmpeg_process = self.start_ffmpeg_decoder()
self.pcm_buffer = bytearray() self.pcm_buffer = bytearray()
self.last_ffmpeg_activity = time()
logger.info("FFmpeg process restarted successfully")
except Exception as e:
logger.error(f"Failed to restart FFmpeg process: {e}")
logger.error(traceback.format_exc())
# try again after 5s
await asyncio.sleep(5)
try:
self.ffmpeg_process = self.start_ffmpeg_decoder()
self.pcm_buffer = bytearray()
self.last_ffmpeg_activity = time()
logger.info("FFmpeg process restarted successfully on second attempt")
except Exception as e2:
logger.critical(f"Failed to restart FFmpeg process on second attempt: {e2}")
logger.critical(traceback.format_exc())
async def update_transcription(self, new_tokens, buffer, end_buffer, full_transcription, sep): async def update_transcription(self, new_tokens, buffer, end_buffer, full_transcription, sep):
"""Thread-safe update of transcription with new data.""" """Thread-safe update of transcription with new data."""
@@ -154,25 +228,25 @@ class AudioProcessor:
while True: while True:
try: try:
# Calculate buffer size based on elapsed time current_time = time()
elapsed_time = math.floor((time() - beg) * 10) / 10 # Round to 0.1 sec elapsed_time = math.floor((current_time - beg) * 10) / 10
buffer_size = max(int(32000 * elapsed_time), 4096) buffer_size = max(int(32000 * elapsed_time), 4096)
beg = time() beg = current_time
# Read chunk with timeout # Detect idle state much more quickly
try: if current_time - self.last_ffmpeg_activity > self.ffmpeg_max_idle_time:
chunk = await asyncio.wait_for( logger.warning(f"FFmpeg process idle for {current_time - self.last_ffmpeg_activity:.2f}s. Restarting...")
loop.run_in_executor(None, self.ffmpeg_process.stdout.read, buffer_size),
timeout=15.0
)
except asyncio.TimeoutError:
logger.warning("FFmpeg read timeout. Restarting...")
await self.restart_ffmpeg() await self.restart_ffmpeg()
beg = time() beg = time()
self.last_ffmpeg_activity = time()
continue continue
chunk = await loop.run_in_executor(None, self.ffmpeg_process.stdout.read, buffer_size)
if chunk:
self.last_ffmpeg_activity = time()
if not chunk: if not chunk:
logger.info("FFmpeg stdout closed.") logger.info("FFmpeg stdout closed, no more data to read.")
break break
self.pcm_buffer.extend(chunk) self.pcm_buffer.extend(chunk)
@@ -183,7 +257,7 @@ class AudioProcessor:
self.convert_pcm_to_float(self.pcm_buffer).copy() self.convert_pcm_to_float(self.pcm_buffer).copy()
) )
# Process when we have enough data # Process when enough data
if len(self.pcm_buffer) >= self.bytes_per_sec: if len(self.pcm_buffer) >= self.bytes_per_sec:
if len(self.pcm_buffer) > self.max_bytes_per_sec: if len(self.pcm_buffer) > self.max_bytes_per_sec:
logger.warning( logger.warning(
@@ -207,45 +281,86 @@ class AudioProcessor:
logger.warning(f"Exception in ffmpeg_stdout_reader: {e}") logger.warning(f"Exception in ffmpeg_stdout_reader: {e}")
logger.warning(f"Traceback: {traceback.format_exc()}") logger.warning(f"Traceback: {traceback.format_exc()}")
break break
logger.info("FFmpeg stdout processing finished. Signaling downstream processors.")
if self.args.transcription and self.transcription_queue:
await self.transcription_queue.put(SENTINEL)
logger.debug("Sentinel put into transcription_queue.")
if self.args.diarization and self.diarization_queue:
await self.diarization_queue.put(SENTINEL)
logger.debug("Sentinel put into diarization_queue.")
async def transcription_processor(self): async def transcription_processor(self):
"""Process audio chunks for transcription.""" """Process audio chunks for transcription."""
self.full_transcription = "" self.full_transcription = ""
self.sep = self.online.asr.sep self.sep = self.online.asr.sep
cumulative_pcm_duration_stream_time = 0.0
while True: while True:
try: try:
pcm_array = await self.transcription_queue.get() pcm_array = await self.transcription_queue.get()
if pcm_array is SENTINEL:
logger.debug("Transcription processor received sentinel. Finishing.")
self.transcription_queue.task_done()
break
logger.info(f"{len(self.online.audio_buffer) / self.online.SAMPLING_RATE} seconds of audio to process.") if not self.online: # Should not happen if queue is used
logger.warning("Transcription processor: self.online not initialized.")
self.transcription_queue.task_done()
continue
asr_internal_buffer_duration_s = len(self.online.audio_buffer) / self.online.SAMPLING_RATE
transcription_lag_s = max(0.0, time() - self.beg_loop - self.end_buffer)
logger.info(
f"ASR processing: internal_buffer={asr_internal_buffer_duration_s:.2f}s, "
f"lag={transcription_lag_s:.2f}s."
)
# Process transcription # Process transcription
self.online.insert_audio_chunk(pcm_array) duration_this_chunk = len(pcm_array) / self.sample_rate if isinstance(pcm_array, np.ndarray) else 0
new_tokens = self.online.process_iter() cumulative_pcm_duration_stream_time += duration_this_chunk
stream_time_end_of_current_pcm = cumulative_pcm_duration_stream_time
self.online.insert_audio_chunk(pcm_array, stream_time_end_of_current_pcm)
new_tokens, current_audio_processed_upto = self.online.process_iter()
if new_tokens: if new_tokens:
self.full_transcription += self.sep.join([t.text for t in new_tokens]) self.full_transcription += self.sep.join([t.text for t in new_tokens])
# Get buffer information # Get buffer information
_buffer = self.online.get_buffer() _buffer_transcript_obj = self.online.get_buffer()
buffer = _buffer.text buffer_text = _buffer_transcript_obj.text
end_buffer = _buffer.end if _buffer.end else (
new_tokens[-1].end if new_tokens else 0 candidate_end_times = [self.end_buffer]
)
if new_tokens:
candidate_end_times.append(new_tokens[-1].end)
if _buffer_transcript_obj.end is not None:
candidate_end_times.append(_buffer_transcript_obj.end)
candidate_end_times.append(current_audio_processed_upto)
new_end_buffer = max(candidate_end_times)
# Avoid duplicating content # Avoid duplicating content
if buffer in self.full_transcription: if buffer_text in self.full_transcription:
buffer = "" buffer_text = ""
await self.update_transcription( await self.update_transcription(
new_tokens, buffer, end_buffer, self.full_transcription, self.sep new_tokens, buffer_text, new_end_buffer, self.full_transcription, self.sep
) )
self.transcription_queue.task_done()
except Exception as e: except Exception as e:
logger.warning(f"Exception in transcription_processor: {e}") logger.warning(f"Exception in transcription_processor: {e}")
logger.warning(f"Traceback: {traceback.format_exc()}") logger.warning(f"Traceback: {traceback.format_exc()}")
finally: if 'pcm_array' in locals() and pcm_array is not SENTINEL : # Check if pcm_array was assigned from queue
self.transcription_queue.task_done() self.transcription_queue.task_done()
logger.info("Transcription processor task finished.")
async def diarization_processor(self, diarization_obj): async def diarization_processor(self, diarization_obj):
"""Process audio chunks for speaker diarization.""" """Process audio chunks for speaker diarization."""
@@ -254,23 +369,33 @@ class AudioProcessor:
while True: while True:
try: try:
pcm_array = await self.diarization_queue.get() pcm_array = await self.diarization_queue.get()
if pcm_array is SENTINEL:
logger.debug("Diarization processor received sentinel. Finishing.")
self.diarization_queue.task_done()
break
# Process diarization # Process diarization
await diarization_obj.diarize(pcm_array) await diarization_obj.diarize(pcm_array)
# Get current state and update speakers async with self.lock:
state = await self.get_current_state() new_end = diarization_obj.assign_speakers_to_tokens(
new_end = diarization_obj.assign_speakers_to_tokens( self.end_attributed_speaker,
state["end_attributed_speaker"], state["tokens"] self.tokens,
) use_punctuation_split=self.args.punctuation_split
)
self.end_attributed_speaker = new_end
if buffer_diarization:
self.buffer_diarization = buffer_diarization
await self.update_diarization(new_end, buffer_diarization) self.diarization_queue.task_done()
except Exception as e: except Exception as e:
logger.warning(f"Exception in diarization_processor: {e}") logger.warning(f"Exception in diarization_processor: {e}")
logger.warning(f"Traceback: {traceback.format_exc()}") logger.warning(f"Traceback: {traceback.format_exc()}")
finally: if 'pcm_array' in locals() and pcm_array is not SENTINEL:
self.diarization_queue.task_done() self.diarization_queue.task_done()
logger.info("Diarization processor task finished.")
async def results_formatter(self): async def results_formatter(self):
"""Format processing results for output.""" """Format processing results for output."""
@@ -334,31 +459,51 @@ class AudioProcessor:
await self.update_diarization(end_attributed_speaker, combined) await self.update_diarization(end_attributed_speaker, combined)
buffer_diarization = combined buffer_diarization = combined
# Create response object response_status = "active_transcription"
if not lines: final_lines_for_response = lines.copy()
lines = [{
if not tokens and not buffer_transcription and not buffer_diarization:
response_status = "no_audio_detected"
final_lines_for_response = []
elif response_status == "active_transcription" and not final_lines_for_response:
final_lines_for_response = [{
"speaker": 1, "speaker": 1,
"text": "", "text": "",
"beg": format_time(0), "beg": format_time(state.get("end_buffer", 0)),
"end": format_time(tokens[-1].end if tokens else 0), "end": format_time(state.get("end_buffer", 0)),
"diff": 0 "diff": 0
}] }]
response = { response = {
"lines": lines, "status": response_status,
"lines": final_lines_for_response,
"buffer_transcription": buffer_transcription, "buffer_transcription": buffer_transcription,
"buffer_diarization": buffer_diarization, "buffer_diarization": buffer_diarization,
"remaining_time_transcription": state["remaining_time_transcription"], "remaining_time_transcription": state["remaining_time_transcription"],
"remaining_time_diarization": state["remaining_time_diarization"] "remaining_time_diarization": state["remaining_time_diarization"]
} }
# Only yield if content has changed current_response_signature = f"{response_status} | " + \
response_content = ' '.join([f"{line['speaker']} {line['text']}" for line in lines]) + \ ' '.join([f"{line['speaker']} {line['text']}" for line in final_lines_for_response]) + \
f" | {buffer_transcription} | {buffer_diarization}" f" | {buffer_transcription} | {buffer_diarization}"
if response_content != self.last_response_content and (lines or buffer_transcription or buffer_diarization): if current_response_signature != self.last_response_content and \
(final_lines_for_response or buffer_transcription or buffer_diarization or response_status == "no_audio_detected"):
yield response yield response
self.last_response_content = response_content self.last_response_content = current_response_signature
# Check for termination condition
if self.is_stopping:
all_processors_done = True
if self.args.transcription and self.transcription_task and not self.transcription_task.done():
all_processors_done = False
if self.args.diarization and self.diarization_task and not self.diarization_task.done():
all_processors_done = False
if all_processors_done:
logger.info("Results formatter: All upstream processors are done and in stopping state. Terminating.")
final_state = await self.get_current_state()
return
await asyncio.sleep(0.1) # Avoid overwhelming the client await asyncio.sleep(0.1) # Avoid overwhelming the client
@@ -366,44 +511,169 @@ class AudioProcessor:
logger.warning(f"Exception in results_formatter: {e}") logger.warning(f"Exception in results_formatter: {e}")
logger.warning(f"Traceback: {traceback.format_exc()}") logger.warning(f"Traceback: {traceback.format_exc()}")
await asyncio.sleep(0.5) # Back off on error await asyncio.sleep(0.5) # Back off on error
async def create_tasks(self): async def create_tasks(self):
"""Create and start processing tasks.""" """Create and start processing tasks."""
self.all_tasks_for_cleanup = []
tasks = [] processing_tasks_for_watchdog = []
if self.args.transcription and self.online: if self.args.transcription and self.online:
tasks.append(asyncio.create_task(self.transcription_processor())) self.transcription_task = asyncio.create_task(self.transcription_processor())
self.all_tasks_for_cleanup.append(self.transcription_task)
processing_tasks_for_watchdog.append(self.transcription_task)
if self.args.diarization and self.diarization: if self.args.diarization and self.diarization:
tasks.append(asyncio.create_task(self.diarization_processor(self.diarization))) self.diarization_task = asyncio.create_task(self.diarization_processor(self.diarization))
self.all_tasks_for_cleanup.append(self.diarization_task)
processing_tasks_for_watchdog.append(self.diarization_task)
tasks.append(asyncio.create_task(self.ffmpeg_stdout_reader())) self.ffmpeg_reader_task = asyncio.create_task(self.ffmpeg_stdout_reader())
self.tasks = tasks self.all_tasks_for_cleanup.append(self.ffmpeg_reader_task)
processing_tasks_for_watchdog.append(self.ffmpeg_reader_task)
# Monitor overall system health
self.watchdog_task = asyncio.create_task(self.watchdog(processing_tasks_for_watchdog))
self.all_tasks_for_cleanup.append(self.watchdog_task)
return self.results_formatter() return self.results_formatter()
async def watchdog(self, tasks_to_monitor):
"""Monitors the health of critical processing tasks."""
while True:
try:
await asyncio.sleep(10)
current_time = time()
for i, task in enumerate(tasks_to_monitor):
if task.done():
exc = task.exception()
task_name = task.get_name() if hasattr(task, 'get_name') else f"Monitored Task {i}"
if exc:
logger.error(f"{task_name} unexpectedly completed with exception: {exc}")
else:
logger.info(f"{task_name} completed normally.")
ffmpeg_idle_time = current_time - self.last_ffmpeg_activity
if ffmpeg_idle_time > 15:
logger.warning(f"FFmpeg idle for {ffmpeg_idle_time:.2f}s - may need attention.")
if ffmpeg_idle_time > 30 and not self.is_stopping:
logger.error("FFmpeg idle for too long and not in stopping phase, forcing restart.")
await self.restart_ffmpeg()
except asyncio.CancelledError:
logger.info("Watchdog task cancelled.")
break
except Exception as e:
logger.error(f"Error in watchdog task: {e}", exc_info=True)
async def cleanup(self): async def cleanup(self):
"""Clean up resources when processing is complete.""" """Clean up resources when processing is complete."""
for task in self.tasks: logger.info("Starting cleanup of AudioProcessor resources.")
task.cancel() for task in self.all_tasks_for_cleanup:
if task and not task.done():
task.cancel()
created_tasks = [t for t in self.all_tasks_for_cleanup if t]
if created_tasks:
await asyncio.gather(*created_tasks, return_exceptions=True)
logger.info("All processing tasks cancelled or finished.")
if self.ffmpeg_process:
if self.ffmpeg_process.stdin and not self.ffmpeg_process.stdin.closed:
try:
self.ffmpeg_process.stdin.close()
except Exception as e:
logger.warning(f"Error closing ffmpeg stdin during cleanup: {e}")
try: # Wait for ffmpeg process to terminate
await asyncio.gather(*self.tasks, return_exceptions=True) if self.ffmpeg_process.poll() is None: # Check if process is still running
self.ffmpeg_process.stdin.close() logger.info("Waiting for FFmpeg process to terminate...")
self.ffmpeg_process.wait() try:
except Exception as e: # Run wait in executor to avoid blocking async loop
logger.warning(f"Error during cleanup: {e}") await asyncio.get_event_loop().run_in_executor(None, self.ffmpeg_process.wait, 5.0) # 5s timeout
except Exception as e: # subprocess.TimeoutExpired is not directly caught by asyncio.wait_for with run_in_executor
if self.args.diarization and hasattr(self, 'diarization'): logger.warning(f"FFmpeg did not terminate gracefully, killing. Error: {e}")
self.ffmpeg_process.kill()
await asyncio.get_event_loop().run_in_executor(None, self.ffmpeg_process.wait) # Wait for kill
logger.info("FFmpeg process terminated.")
if self.args.diarization and hasattr(self, 'diarization') and hasattr(self.diarization, 'close'):
self.diarization.close() self.diarization.close()
logger.info("AudioProcessor cleanup complete.")
async def process_audio(self, message): async def process_audio(self, message):
"""Process incoming audio data.""" """Process incoming audio data."""
try: # If already stopping or stdin is closed, ignore further audio, especially residual chunks.
self.ffmpeg_process.stdin.write(message) if self.is_stopping or (self.ffmpeg_process and self.ffmpeg_process.stdin and self.ffmpeg_process.stdin.closed):
self.ffmpeg_process.stdin.flush() logger.warning(f"AudioProcessor is stopping or stdin is closed. Ignoring incoming audio message (length: {len(message)}).")
except (BrokenPipeError, AttributeError) as e: if not message and self.ffmpeg_process and self.ffmpeg_process.stdin and not self.ffmpeg_process.stdin.closed:
logger.warning(f"Error writing to FFmpeg: {e}. Restarting...") logger.info("Received empty message while already in stopping state; ensuring stdin is closed.")
await self.restart_ffmpeg() try:
self.ffmpeg_process.stdin.write(message) self.ffmpeg_process.stdin.close()
self.ffmpeg_process.stdin.flush() except Exception as e:
logger.warning(f"Error closing ffmpeg stdin on redundant stop signal during stopping state: {e}")
return
if not message: # primary signal to start stopping
logger.info("Empty audio message received, initiating stop sequence.")
self.is_stopping = True
if self.ffmpeg_process and self.ffmpeg_process.stdin and not self.ffmpeg_process.stdin.closed:
try:
self.ffmpeg_process.stdin.close()
logger.info("FFmpeg stdin closed due to primary stop signal.")
except Exception as e:
logger.warning(f"Error closing ffmpeg stdin on stop: {e}")
return
retry_count = 0
max_retries = 3
# Log periodic heartbeats showing ongoing audio proc
current_time = time()
if not hasattr(self, '_last_heartbeat') or current_time - self._last_heartbeat >= 10:
logger.debug(f"Processing audio chunk, last FFmpeg activity: {current_time - self.last_ffmpeg_activity:.2f}s ago")
self._last_heartbeat = current_time
while retry_count < max_retries:
try:
if not self.ffmpeg_process or not hasattr(self.ffmpeg_process, 'stdin') or self.ffmpeg_process.poll() is not None:
logger.warning("FFmpeg process not available, restarting...")
await self.restart_ffmpeg()
loop = asyncio.get_running_loop()
try:
await asyncio.wait_for(
loop.run_in_executor(None, lambda: self.ffmpeg_process.stdin.write(message)),
timeout=2.0
)
except asyncio.TimeoutError:
logger.warning("FFmpeg write operation timed out, restarting...")
await self.restart_ffmpeg()
retry_count += 1
continue
try:
await asyncio.wait_for(
loop.run_in_executor(None, self.ffmpeg_process.stdin.flush),
timeout=2.0
)
except asyncio.TimeoutError:
logger.warning("FFmpeg flush operation timed out, restarting...")
await self.restart_ffmpeg()
retry_count += 1
continue
self.last_ffmpeg_activity = time()
return
except (BrokenPipeError, AttributeError, OSError) as e:
retry_count += 1
logger.warning(f"Error writing to FFmpeg: {e}. Retry {retry_count}/{max_retries}...")
if retry_count < max_retries:
await self.restart_ffmpeg()
await asyncio.sleep(0.5)
else:
logger.error("Maximum retries reached for FFmpeg process")
await self.restart_ffmpeg()
return

View File

@@ -0,0 +1,120 @@
from contextlib import asynccontextmanager
from fastapi import FastAPI, WebSocket, WebSocketDisconnect
from fastapi.responses import HTMLResponse
from fastapi.middleware.cors import CORSMiddleware
from whisperlivekit import TranscriptionEngine, AudioProcessor, get_web_interface_html, parse_args
import asyncio
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logging.getLogger().setLevel(logging.WARNING)
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
args = parse_args()
transcription_engine = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global transcription_engine
transcription_engine = TranscriptionEngine(
**vars(args),
)
yield
app = FastAPI(lifespan=lifespan)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
@app.get("/")
async def get():
return HTMLResponse(get_web_interface_html())
async def handle_websocket_results(websocket, results_generator):
"""Consumes results from the audio processor and sends them via WebSocket."""
try:
async for response in results_generator:
await websocket.send_json(response)
# when the results_generator finishes it means all audio has been processed
logger.info("Results generator finished. Sending 'ready_to_stop' to client.")
await websocket.send_json({"type": "ready_to_stop"})
except WebSocketDisconnect:
logger.info("WebSocket disconnected while handling results (client likely closed connection).")
except Exception as e:
logger.warning(f"Error in WebSocket results handler: {e}")
@app.websocket("/asr")
async def websocket_endpoint(websocket: WebSocket):
global transcription_engine
audio_processor = AudioProcessor(
transcription_engine=transcription_engine,
)
await websocket.accept()
logger.info("WebSocket connection opened.")
results_generator = await audio_processor.create_tasks()
websocket_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
try:
while True:
message = await websocket.receive_bytes()
await audio_processor.process_audio(message)
except KeyError as e:
if 'bytes' in str(e):
logger.warning(f"Client has closed the connection.")
else:
logger.error(f"Unexpected KeyError in websocket_endpoint: {e}", exc_info=True)
except WebSocketDisconnect:
logger.info("WebSocket disconnected by client during message receiving loop.")
except Exception as e:
logger.error(f"Unexpected error in websocket_endpoint main loop: {e}", exc_info=True)
finally:
logger.info("Cleaning up WebSocket endpoint...")
if not websocket_task.done():
websocket_task.cancel()
try:
await websocket_task
except asyncio.CancelledError:
logger.info("WebSocket results handler task was cancelled.")
except Exception as e:
logger.warning(f"Exception while awaiting websocket_task completion: {e}")
await audio_processor.cleanup()
logger.info("WebSocket endpoint cleaned up successfully.")
def main():
"""Entry point for the CLI command."""
import uvicorn
uvicorn_kwargs = {
"app": "whisperlivekit.basic_server:app",
"host":args.host,
"port":args.port,
"reload": False,
"log_level": "info",
"lifespan": "on",
}
ssl_kwargs = {}
if args.ssl_certfile or args.ssl_keyfile:
if not (args.ssl_certfile and args.ssl_keyfile):
raise ValueError("Both --ssl-certfile and --ssl-keyfile must be specified together.")
ssl_kwargs = {
"ssl_certfile": args.ssl_certfile,
"ssl_keyfile": args.ssl_keyfile
}
if ssl_kwargs:
uvicorn_kwargs = {**uvicorn_kwargs, **ssl_kwargs}
uvicorn.run(**uvicorn_kwargs)
if __name__ == "__main__":
main()

View File

@@ -1,139 +1,11 @@
from whisperlivekit.whisper_streaming_custom.whisper_online import backend_factory, warmup_asr try:
from argparse import Namespace, ArgumentParser from whisperlivekit.whisper_streaming_custom.whisper_online import backend_factory, warmup_asr
except ImportError:
from .whisper_streaming_custom.whisper_online import backend_factory, warmup_asr
from argparse import Namespace
def parse_args():
parser = ArgumentParser(description="Whisper FastAPI Online Server")
parser.add_argument(
"--host",
type=str,
default="localhost",
help="The host address to bind the server to.",
)
parser.add_argument(
"--port", type=int, default=8000, help="The port number to bind the server to."
)
parser.add_argument(
"--warmup-file",
type=str,
default=None,
dest="warmup_file",
help="""
The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast.
If not set, uses https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav.
If False, no warmup is performed.
""",
)
parser.add_argument( class TranscriptionEngine:
"--confidence-validation",
type=bool,
default=False,
help="Accelerates validation of tokens using confidence scores. Transcription will be faster but punctuation might be less accurate.",
)
parser.add_argument(
"--diarization",
type=bool,
default=True,
help="Whether to enable speaker diarization.",
)
parser.add_argument(
"--transcription",
type=bool,
default=True,
help="To disable to only see live diarization results.",
)
parser.add_argument(
"--min-chunk-size",
type=float,
default=0.5,
help="Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.",
)
parser.add_argument(
"--model",
type=str,
default="tiny",
choices="tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large,large-v3-turbo".split(
","
),
help="Name size of the Whisper model to use (default: large-v2). The model is automatically downloaded from the model hub if not present in model cache dir.",
)
parser.add_argument(
"--model_cache_dir",
type=str,
default=None,
help="Overriding the default model cache dir where models downloaded from the hub are saved",
)
parser.add_argument(
"--model_dir",
type=str,
default=None,
help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.",
)
parser.add_argument(
"--lan",
"--language",
type=str,
default="auto",
help="Source language code, e.g. en,de,cs, or 'auto' for language detection.",
)
parser.add_argument(
"--task",
type=str,
default="transcribe",
choices=["transcribe", "translate"],
help="Transcribe or translate.",
)
parser.add_argument(
"--backend",
type=str,
default="faster-whisper",
choices=["faster-whisper", "whisper_timestamped", "mlx-whisper", "openai-api"],
help="Load only this backend for Whisper processing.",
)
parser.add_argument(
"--vac",
action="store_true",
default=False,
help="Use VAC = voice activity controller. Recommended. Requires torch.",
)
parser.add_argument(
"--vac-chunk-size", type=float, default=0.04, help="VAC sample size in seconds."
)
parser.add_argument(
"--vad",
action="store_true",
default=True,
help="Use VAD = voice activity detection, with the default parameters.",
)
parser.add_argument(
"--buffer_trimming",
type=str,
default="segment",
choices=["sentence", "segment"],
help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.',
)
parser.add_argument(
"--buffer_trimming_sec",
type=float,
default=15,
help="Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.",
)
parser.add_argument(
"-l",
"--log-level",
dest="log_level",
choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"],
help="Set the log level",
default="DEBUG",
)
args = parser.parse_args()
return args
class WhisperLiveKit:
_instance = None _instance = None
_initialized = False _initialized = False
@@ -143,14 +15,51 @@ class WhisperLiveKit:
return cls._instance return cls._instance
def __init__(self, **kwargs): def __init__(self, **kwargs):
if WhisperLiveKit._initialized: if TranscriptionEngine._initialized:
return return
default_args = vars(parse_args()) defaults = {
"host": "localhost",
"port": 8000,
"warmup_file": None,
"confidence_validation": False,
"diarization": False,
"punctuation_split": False,
"min_chunk_size": 0.5,
"model": "tiny",
"model_cache_dir": None,
"model_dir": None,
"lan": "auto",
"task": "transcribe",
"backend": "faster-whisper",
"vac": False,
"vac_chunk_size": 0.04,
"buffer_trimming": "segment",
"buffer_trimming_sec": 15,
"log_level": "DEBUG",
"ssl_certfile": None,
"ssl_keyfile": None,
"transcription": True,
"vad": True,
"segmentation_model": "pyannote/segmentation-3.0",
"embedding_model": "pyannote/embedding",
}
config_dict = {**defaults, **kwargs}
if 'no_transcription' in kwargs:
config_dict['transcription'] = not kwargs['no_transcription']
if 'no_vad' in kwargs:
config_dict['vad'] = not kwargs['no_vad']
merged_args = {**default_args, **kwargs} config_dict.pop('no_transcription', None)
config_dict.pop('no_vad', None)
self.args = Namespace(**merged_args)
if 'language' in kwargs:
config_dict['lan'] = kwargs['language']
config_dict.pop('language', None)
self.args = Namespace(**config_dict)
self.asr = None self.asr = None
self.tokenizer = None self.tokenizer = None
@@ -162,13 +71,10 @@ class WhisperLiveKit:
if self.args.diarization: if self.args.diarization:
from whisperlivekit.diarization.diarization_online import DiartDiarization from whisperlivekit.diarization.diarization_online import DiartDiarization
self.diarization = DiartDiarization() self.diarization = DiartDiarization(
block_duration=self.args.min_chunk_size,
segmentation_model_name=self.args.segmentation_model,
embedding_model_name=self.args.embedding_model
)
WhisperLiveKit._initialized = True TranscriptionEngine._initialized = True
def web_interface(self):
import pkg_resources
html_path = pkg_resources.resource_filename('whisperlivekit', 'web/live_transcription.html')
with open(html_path, "r", encoding="utf-8") as f:
html = f.read()
return html

View File

View File

@@ -3,7 +3,8 @@ import re
import threading import threading
import numpy as np import numpy as np
import logging import logging
import time
from queue import SimpleQueue, Empty
from diart import SpeakerDiarization, SpeakerDiarizationConfig from diart import SpeakerDiarization, SpeakerDiarizationConfig
from diart.inference import StreamingInference from diart.inference import StreamingInference
@@ -13,6 +14,7 @@ from diart.sources import MicrophoneAudioSource
from rx.core import Observer from rx.core import Observer
from typing import Tuple, Any, List from typing import Tuple, Any, List
from pyannote.core import Annotation from pyannote.core import Annotation
import diart.models as m
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
@@ -78,40 +80,114 @@ class DiarizationObserver(Observer):
class WebSocketAudioSource(AudioSource): class WebSocketAudioSource(AudioSource):
""" """
Custom AudioSource that blocks in read() until close() is called. Buffers incoming audio and releases it in fixed-size chunks at regular intervals.
Use push_audio() to inject PCM chunks.
""" """
def __init__(self, uri: str = "websocket", sample_rate: int = 16000): def __init__(self, uri: str = "websocket", sample_rate: int = 16000, block_duration: float = 0.5):
super().__init__(uri, sample_rate) super().__init__(uri, sample_rate)
self.block_duration = block_duration
self.block_size = int(np.rint(block_duration * sample_rate))
self._queue = SimpleQueue()
self._buffer = np.array([], dtype=np.float32)
self._buffer_lock = threading.Lock()
self._closed = False self._closed = False
self._close_event = threading.Event() self._close_event = threading.Event()
self._processing_thread = None
self._last_chunk_time = time.time()
def read(self): def read(self):
"""Start processing buffered audio and emit fixed-size chunks."""
self._processing_thread = threading.Thread(target=self._process_chunks)
self._processing_thread.daemon = True
self._processing_thread.start()
self._close_event.wait() self._close_event.wait()
if self._processing_thread:
self._processing_thread.join(timeout=2.0)
def _process_chunks(self):
"""Process audio from queue and emit fixed-size chunks at regular intervals."""
while not self._closed:
try:
audio_chunk = self._queue.get(timeout=0.1)
with self._buffer_lock:
self._buffer = np.concatenate([self._buffer, audio_chunk])
while len(self._buffer) >= self.block_size:
chunk = self._buffer[:self.block_size]
self._buffer = self._buffer[self.block_size:]
current_time = time.time()
time_since_last = current_time - self._last_chunk_time
if time_since_last < self.block_duration:
time.sleep(self.block_duration - time_since_last)
chunk_reshaped = chunk.reshape(1, -1)
self.stream.on_next(chunk_reshaped)
self._last_chunk_time = time.time()
except Empty:
with self._buffer_lock:
if len(self._buffer) > 0 and time.time() - self._last_chunk_time > self.block_duration:
padded_chunk = np.zeros(self.block_size, dtype=np.float32)
padded_chunk[:len(self._buffer)] = self._buffer
self._buffer = np.array([], dtype=np.float32)
chunk_reshaped = padded_chunk.reshape(1, -1)
self.stream.on_next(chunk_reshaped)
self._last_chunk_time = time.time()
except Exception as e:
logger.error(f"Error in audio processing thread: {e}")
self.stream.on_error(e)
break
with self._buffer_lock:
if len(self._buffer) > 0:
padded_chunk = np.zeros(self.block_size, dtype=np.float32)
padded_chunk[:len(self._buffer)] = self._buffer
chunk_reshaped = padded_chunk.reshape(1, -1)
self.stream.on_next(chunk_reshaped)
self.stream.on_completed()
def close(self): def close(self):
if not self._closed: if not self._closed:
self._closed = True self._closed = True
self.stream.on_completed()
self._close_event.set() self._close_event.set()
def push_audio(self, chunk: np.ndarray): def push_audio(self, chunk: np.ndarray):
"""Add audio chunk to the processing queue."""
if not self._closed: if not self._closed:
new_audio = np.expand_dims(chunk, axis=0) if chunk.ndim > 1:
logger.debug('Add new chunk with shape:', new_audio.shape) chunk = chunk.flatten()
self.stream.on_next(new_audio) self._queue.put(chunk)
logger.debug(f'Added chunk to queue with {len(chunk)} samples')
class DiartDiarization: class DiartDiarization:
def __init__(self, sample_rate: int = 16000, config : SpeakerDiarizationConfig = None, use_microphone: bool = False): def __init__(self, sample_rate: int = 16000, config : SpeakerDiarizationConfig = None, use_microphone: bool = False, block_duration: float = 0.5, segmentation_model_name: str = "pyannote/segmentation-3.0", embedding_model_name: str = "speechbrain/spkrec-ecapa-voxceleb"):
segmentation_model = m.SegmentationModel.from_pretrained(segmentation_model_name)
embedding_model = m.EmbeddingModel.from_pretrained(embedding_model_name)
if config is None:
config = SpeakerDiarizationConfig(
segmentation=segmentation_model,
embedding=embedding_model,
)
self.pipeline = SpeakerDiarization(config=config) self.pipeline = SpeakerDiarization(config=config)
self.observer = DiarizationObserver() self.observer = DiarizationObserver()
self.lag_diart = None
if use_microphone: if use_microphone:
self.source = MicrophoneAudioSource() self.source = MicrophoneAudioSource(block_duration=block_duration)
self.custom_source = None self.custom_source = None
else: else:
self.custom_source = WebSocketAudioSource(uri="websocket_source", sample_rate=sample_rate) self.custom_source = WebSocketAudioSource(
uri="websocket_source",
sample_rate=sample_rate,
block_duration=block_duration
)
self.source = self.custom_source self.source = self.custom_source
self.inference = StreamingInference( self.inference = StreamingInference(
@@ -138,16 +214,102 @@ class DiartDiarization:
if self.custom_source: if self.custom_source:
self.custom_source.close() self.custom_source.close()
def assign_speakers_to_tokens(self, end_attributed_speaker, tokens: list) -> float: def assign_speakers_to_tokens(self, end_attributed_speaker, tokens: list, use_punctuation_split: bool = False) -> float:
""" """
Assign speakers to tokens based on timing overlap with speaker segments. Assign speakers to tokens based on timing overlap with speaker segments.
Uses the segments collected by the observer. Uses the segments collected by the observer.
If use_punctuation_split is True, uses punctuation marks to refine speaker boundaries.
""" """
segments = self.observer.get_segments() segments = self.observer.get_segments()
# Debug logging
logger.debug(f"assign_speakers_to_tokens called with {len(tokens)} tokens")
logger.debug(f"Available segments: {len(segments)}")
for i, seg in enumerate(segments[:5]): # Show first 5 segments
logger.debug(f" Segment {i}: {seg.speaker} [{seg.start:.2f}-{seg.end:.2f}]")
if not self.lag_diart and segments and tokens:
self.lag_diart = segments[0].start - tokens[0].start
for token in tokens: for token in tokens:
for segment in segments: for segment in segments:
if not (segment.end <= token.start or segment.start >= token.end): if not (segment.end <= token.start + self.lag_diart or segment.start >= token.end + self.lag_diart):
token.speaker = extract_number(segment.speaker) + 1 token.speaker = extract_number(segment.speaker) + 1
end_attributed_speaker = max(token.end, end_attributed_speaker) end_attributed_speaker = max(token.end, end_attributed_speaker)
return end_attributed_speaker
if use_punctuation_split and len(tokens) > 1:
punctuation_marks = {'.', '!', '?'}
print("Here are the tokens:",
[(t.text, t.start, t.end, t.speaker) for t in tokens[:10]])
segment_map = []
for segment in segments:
speaker_num = extract_number(segment.speaker) + 1
segment_map.append((segment.start, segment.end, speaker_num))
segment_map.sort(key=lambda x: x[0])
i = 0
while i < len(tokens):
current_token = tokens[i]
is_sentence_end = False
if current_token.text and current_token.text.strip():
text = current_token.text.strip()
if text[-1] in punctuation_marks:
is_sentence_end = True
logger.debug(f"Token {i} ends sentence: '{current_token.text}' at {current_token.end:.2f}s")
if is_sentence_end and current_token.speaker != -1:
punctuation_time = current_token.end
current_speaker = current_token.speaker
j = i + 1
next_sentence_tokens = []
while j < len(tokens):
next_token = tokens[j]
next_sentence_tokens.append(j)
# Check if this token ends the next sentence
if next_token.text and next_token.text.strip():
if next_token.text.strip()[-1] in punctuation_marks:
break
j += 1
if next_sentence_tokens:
speaker_times = {}
for idx in next_sentence_tokens:
token = tokens[idx]
# Find which segments overlap with this token
for seg_start, seg_end, seg_speaker in segment_map:
if not (seg_end <= token.start or seg_start >= token.end):
# Calculate overlap duration
overlap_start = max(seg_start, token.start)
overlap_end = min(seg_end, token.end)
overlap_duration = overlap_end - overlap_start
if seg_speaker not in speaker_times:
speaker_times[seg_speaker] = 0
speaker_times[seg_speaker] += overlap_duration
if speaker_times:
dominant_speaker = max(speaker_times.items(), key=lambda x: x[1])[0]
if dominant_speaker != current_speaker:
logger.debug(f" Speaker change after punctuation: {current_speaker}{dominant_speaker}")
for idx in next_sentence_tokens:
if tokens[idx].speaker != dominant_speaker:
logger.debug(f" Reassigning token {idx} ('{tokens[idx].text}') to Speaker {dominant_speaker}")
tokens[idx].speaker = dominant_speaker
end_attributed_speaker = max(tokens[idx].end, end_attributed_speaker)
else:
for idx in next_sentence_tokens:
if tokens[idx].speaker == -1:
tokens[idx].speaker = current_speaker
end_attributed_speaker = max(tokens[idx].end, end_attributed_speaker)
i += 1
return end_attributed_speaker

View File

@@ -0,0 +1,162 @@
from argparse import ArgumentParser
def parse_args():
parser = ArgumentParser(description="Whisper FastAPI Online Server")
parser.add_argument(
"--host",
type=str,
default="localhost",
help="The host address to bind the server to.",
)
parser.add_argument(
"--port", type=int, default=8000, help="The port number to bind the server to."
)
parser.add_argument(
"--warmup-file",
type=str,
default=None,
dest="warmup_file",
help="""
The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast.
If not set, uses https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav.
If False, no warmup is performed.
""",
)
parser.add_argument(
"--confidence-validation",
action="store_true",
help="Accelerates validation of tokens using confidence scores. Transcription will be faster but punctuation might be less accurate.",
)
parser.add_argument(
"--diarization",
action="store_true",
default=False,
help="Enable speaker diarization.",
)
parser.add_argument(
"--punctuation-split",
action="store_true",
default=False,
help="Use punctuation marks from transcription to improve speaker boundary detection. Requires both transcription and diarization to be enabled.",
)
parser.add_argument(
"--segmentation-model",
type=str,
default="pyannote/segmentation-3.0",
help="Hugging Face model ID for pyannote.audio segmentation model.",
)
parser.add_argument(
"--embedding-model",
type=str,
default="pyannote/embedding",
help="Hugging Face model ID for pyannote.audio embedding model.",
)
parser.add_argument(
"--no-transcription",
action="store_true",
help="Disable transcription to only see live diarization results.",
)
parser.add_argument(
"--min-chunk-size",
type=float,
default=0.5,
help="Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.",
)
parser.add_argument(
"--model",
type=str,
default="tiny",
help="Name size of the Whisper model to use (default: tiny). Suggested values: tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large,large-v3-turbo. The model is automatically downloaded from the model hub if not present in model cache dir.",
)
parser.add_argument(
"--model_cache_dir",
type=str,
default=None,
help="Overriding the default model cache dir where models downloaded from the hub are saved",
)
parser.add_argument(
"--model_dir",
type=str,
default=None,
help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.",
)
parser.add_argument(
"--lan",
"--language",
type=str,
default="auto",
help="Source language code, e.g. en,de,cs, or 'auto' for language detection.",
)
parser.add_argument(
"--task",
type=str,
default="transcribe",
choices=["transcribe", "translate"],
help="Transcribe or translate.",
)
parser.add_argument(
"--backend",
type=str,
default="faster-whisper",
choices=["faster-whisper", "whisper_timestamped", "mlx-whisper", "openai-api"],
help="Load only this backend for Whisper processing.",
)
parser.add_argument(
"--vac",
action="store_true",
default=False,
help="Use VAC = voice activity controller. Recommended. Requires torch.",
)
parser.add_argument(
"--vac-chunk-size", type=float, default=0.04, help="VAC sample size in seconds."
)
parser.add_argument(
"--no-vad",
action="store_true",
help="Disable VAD (voice activity detection).",
)
parser.add_argument(
"--buffer_trimming",
type=str,
default="segment",
choices=["sentence", "segment"],
help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.',
)
parser.add_argument(
"--buffer_trimming_sec",
type=float,
default=15,
help="Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.",
)
parser.add_argument(
"-l",
"--log-level",
dest="log_level",
choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"],
help="Set the log level",
default="DEBUG",
)
parser.add_argument("--ssl-certfile", type=str, help="Path to the SSL certificate file.", default=None)
parser.add_argument("--ssl-keyfile", type=str, help="Path to the SSL private key file.", default=None)
args = parser.parse_args()
args.transcription = not args.no_transcription
args.vad = not args.no_vad
delattr(args, 'no_transcription')
delattr(args, 'no_vad')
return args

View File

@@ -26,4 +26,7 @@ class Transcript(TimedText):
@dataclass @dataclass
class SpeakerSegment(TimedText): class SpeakerSegment(TimedText):
"""Represents a segment of audio attributed to a specific speaker.
No text nor probability is associated with this segment.
"""
pass pass

View File

View File

@@ -38,7 +38,6 @@
transform: scale(0.95); transform: scale(0.95);
} }
/* Shape inside the button */
.shape-container { .shape-container {
width: 25px; width: 25px;
height: 25px; height: 25px;
@@ -56,6 +55,10 @@
transition: all 0.3s ease; transition: all 0.3s ease;
} }
#recordButton:disabled .shape {
background-color: #6e6d6d;
}
#recordButton.recording .shape { #recordButton.recording .shape {
border-radius: 5px; border-radius: 5px;
width: 25px; width: 25px;
@@ -279,7 +282,7 @@
</div> </div>
<div> <div>
<label for="websocketInput">WebSocket URL:</label> <label for="websocketInput">WebSocket URL:</label>
<input id="websocketInput" type="text" value="ws://localhost:8000/asr" /> <input id="websocketInput" type="text" />
</div> </div>
</div> </div>
</div> </div>
@@ -304,6 +307,8 @@
let waveCanvas = document.getElementById("waveCanvas"); let waveCanvas = document.getElementById("waveCanvas");
let waveCtx = waveCanvas.getContext("2d"); let waveCtx = waveCanvas.getContext("2d");
let animationFrame = null; let animationFrame = null;
let waitingForStop = false;
let lastReceivedData = null;
waveCanvas.width = 60 * (window.devicePixelRatio || 1); waveCanvas.width = 60 * (window.devicePixelRatio || 1);
waveCanvas.height = 30 * (window.devicePixelRatio || 1); waveCanvas.height = 30 * (window.devicePixelRatio || 1);
waveCtx.scale(window.devicePixelRatio || 1, window.devicePixelRatio || 1); waveCtx.scale(window.devicePixelRatio || 1, window.devicePixelRatio || 1);
@@ -315,6 +320,13 @@
const linesTranscriptDiv = document.getElementById("linesTranscript"); const linesTranscriptDiv = document.getElementById("linesTranscript");
const timerElement = document.querySelector(".timer"); const timerElement = document.querySelector(".timer");
const host = window.location.hostname || "localhost";
const port = window.location.port || "8000";
const protocol = window.location.protocol === "https:" ? "wss" : "ws";
const defaultWebSocketUrl = `${protocol}://${host}:${port}/asr`;
websocketInput.value = defaultWebSocketUrl;
websocketUrl = defaultWebSocketUrl;
chunkSelector.addEventListener("change", () => { chunkSelector.addEventListener("change", () => {
chunkDuration = parseInt(chunkSelector.value); chunkDuration = parseInt(chunkSelector.value);
}); });
@@ -346,12 +358,31 @@
websocket.onclose = () => { websocket.onclose = () => {
if (userClosing) { if (userClosing) {
statusText.textContent = "WebSocket closed by user."; if (waitingForStop) {
statusText.textContent = "Processing finalized or connection closed.";
if (lastReceivedData) {
renderLinesWithBuffer(
lastReceivedData.lines || [],
lastReceivedData.buffer_diarization || "",
lastReceivedData.buffer_transcription || "",
0, 0, true // isFinalizing = true
);
}
}
// If ready_to_stop was received, statusText is already "Finished processing..."
// and waitingForStop is false.
} else { } else {
statusText.textContent = statusText.textContent = "Disconnected from the WebSocket server. (Check logs if model is loading.)";
"Disconnected from the WebSocket server. (Check logs if model is loading.)"; if (isRecording) {
stopRecording();
}
} }
userClosing = false; isRecording = false;
waitingForStop = false;
userClosing = false;
lastReceivedData = null;
websocket = null;
updateUI();
}; };
websocket.onerror = () => { websocket.onerror = () => {
@@ -363,12 +394,41 @@
websocket.onmessage = (event) => { websocket.onmessage = (event) => {
const data = JSON.parse(event.data); const data = JSON.parse(event.data);
// Check for status messages
if (data.type === "ready_to_stop") {
console.log("Ready to stop received, finalizing display and closing WebSocket.");
waitingForStop = false;
if (lastReceivedData) {
renderLinesWithBuffer(
lastReceivedData.lines || [],
lastReceivedData.buffer_diarization || "",
lastReceivedData.buffer_transcription || "",
0, // No more lag
0, // No more lag
true // isFinalizing = true
);
}
statusText.textContent = "Finished processing audio! Ready to record again.";
recordButton.disabled = false;
if (websocket) {
websocket.close(); // will trigger onclose
// websocket = null; // onclose handle setting websocket to null
}
return;
}
lastReceivedData = data;
// Handle normal transcription updates
const { const {
lines = [], lines = [],
buffer_transcription = "", buffer_transcription = "",
buffer_diarization = "", buffer_diarization = "",
remaining_time_transcription = 0, remaining_time_transcription = 0,
remaining_time_diarization = 0 remaining_time_diarization = 0,
status = "active_transcription"
} = data; } = data;
renderLinesWithBuffer( renderLinesWithBuffer(
@@ -376,13 +436,20 @@
buffer_diarization, buffer_diarization,
buffer_transcription, buffer_transcription,
remaining_time_diarization, remaining_time_diarization,
remaining_time_transcription remaining_time_transcription,
false,
status
); );
}; };
}); });
} }
function renderLinesWithBuffer(lines, buffer_diarization, buffer_transcription, remaining_time_diarization, remaining_time_transcription) { function renderLinesWithBuffer(lines, buffer_diarization, buffer_transcription, remaining_time_diarization, remaining_time_transcription, isFinalizing = false, current_status = "active_transcription") {
if (current_status === "no_audio_detected") {
linesTranscriptDiv.innerHTML = "<p style='text-align: center; color: #666; margin-top: 20px;'><em>No audio detected...</em></p>";
return;
}
const linesHtml = lines.map((item, idx) => { const linesHtml = lines.map((item, idx) => {
let timeInfo = ""; let timeInfo = "";
if (item.beg !== undefined && item.end !== undefined) { if (item.beg !== undefined && item.end !== undefined) {
@@ -392,30 +459,46 @@
let speakerLabel = ""; let speakerLabel = "";
if (item.speaker === -2) { if (item.speaker === -2) {
speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`; speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`;
} else if (item.speaker == 0) { } else if (item.speaker == 0 && !isFinalizing) {
speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'>${remaining_time_diarization} second(s) of audio are undergoing diarization</span></span>`; speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'>${remaining_time_diarization} second(s) of audio are undergoing diarization</span></span>`;
} else if (item.speaker == -1) { } else if (item.speaker == -1) {
speakerLabel = `<span id="speaker"><span id='timeInfo'>${timeInfo}</span></span>`; speakerLabel = `<span id="speaker">Speaker 1<span id='timeInfo'>${timeInfo}</span></span>`;
} else if (item.speaker !== -1) { } else if (item.speaker !== -1 && item.speaker !== 0) {
speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`; speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`;
} }
let textContent = item.text;
if (idx === lines.length - 1) { let currentLineText = item.text || "";
speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Transcription lag <span id='timeInfo'>${remaining_time_transcription}s</span></span>`
} if (idx === lines.length - 1) {
if (idx === lines.length - 1 && buffer_diarization) { if (!isFinalizing) {
speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Diarization lag<span id='timeInfo'>${remaining_time_diarization}s</span></span>` if (remaining_time_transcription > 0) {
textContent += `<span class="buffer_diarization">${buffer_diarization}</span>`; speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Transcription lag <span id='timeInfo'>${remaining_time_transcription}s</span></span>`;
} }
if (idx === lines.length - 1) { if (buffer_diarization && remaining_time_diarization > 0) {
textContent += `<span class="buffer_transcription">${buffer_transcription}</span>`; speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Diarization lag<span id='timeInfo'>${remaining_time_diarization}s</span></span>`;
}
}
if (buffer_diarization) {
if (isFinalizing) {
currentLineText += (currentLineText.length > 0 && buffer_diarization.trim().length > 0 ? " " : "") + buffer_diarization.trim();
} else {
currentLineText += `<span class="buffer_diarization">${buffer_diarization}</span>`;
}
}
if (buffer_transcription) {
if (isFinalizing) {
currentLineText += (currentLineText.length > 0 && buffer_transcription.trim().length > 0 ? " " : "") + buffer_transcription.trim();
} else {
currentLineText += `<span class="buffer_transcription">${buffer_transcription}</span>`;
}
}
} }
return currentLineText.trim().length > 0 || speakerLabel.length > 0
return textContent ? `<p>${speakerLabel}<br/><div class='textcontent'>${currentLineText}</div></p>`
? `<p>${speakerLabel}<br/><div class='textcontent'>${textContent}</div></p>` : `<p>${speakerLabel}<br/></p>`;
: `<p>${speakerLabel}<br/></p>`;
}).join(""); }).join("");
linesTranscriptDiv.innerHTML = linesHtml; linesTranscriptDiv.innerHTML = linesHtml;
@@ -494,8 +577,17 @@
} }
} }
function stopRecording() { async function stopRecording() {
userClosing = true; userClosing = true;
waitingForStop = true;
if (websocket && websocket.readyState === WebSocket.OPEN) {
// Send empty audio buffer as stop signal
const emptyBlob = new Blob([], { type: 'audio/webm' });
websocket.send(emptyBlob);
statusText.textContent = "Recording stopped. Processing final audio...";
}
if (recorder) { if (recorder) {
recorder.stop(); recorder.stop();
recorder = null; recorder = null;
@@ -531,38 +623,60 @@
timerElement.textContent = "00:00"; timerElement.textContent = "00:00";
startTime = null; startTime = null;
isRecording = false; isRecording = false;
if (websocket) {
websocket.close();
websocket = null;
}
updateUI(); updateUI();
} }
async function toggleRecording() { async function toggleRecording() {
if (!isRecording) { if (!isRecording) {
linesTranscriptDiv.innerHTML = ""; if (waitingForStop) {
console.log("Waiting for stop, early return");
return; // Early return, UI is already updated
}
console.log("Connecting to WebSocket");
try { try {
await setupWebSocket(); // If we have an active WebSocket that's still processing, just restart audio capture
await startRecording(); if (websocket && websocket.readyState === WebSocket.OPEN) {
await startRecording();
} else {
// If no active WebSocket or it's closed, create new one
await setupWebSocket();
await startRecording();
}
} catch (err) { } catch (err) {
statusText.textContent = "Could not connect to WebSocket or access mic. Aborted."; statusText.textContent = "Could not connect to WebSocket or access mic. Aborted.";
console.error(err); console.error(err);
} }
} else { } else {
console.log("Stopping recording");
stopRecording(); stopRecording();
} }
} }
function updateUI() { function updateUI() {
recordButton.classList.toggle("recording", isRecording); recordButton.classList.toggle("recording", isRecording);
statusText.textContent = isRecording ? "Recording..." : "Click to start transcription"; recordButton.disabled = waitingForStop;
if (waitingForStop) {
if (statusText.textContent !== "Recording stopped. Processing final audio...") {
statusText.textContent = "Please wait for processing to complete...";
}
} else if (isRecording) {
statusText.textContent = "Recording...";
} else {
if (statusText.textContent !== "Finished processing audio! Ready to record again." &&
statusText.textContent !== "Processing finalized or connection closed.") {
statusText.textContent = "Click to start transcription";
}
}
if (!waitingForStop) {
recordButton.disabled = false;
}
} }
recordButton.addEventListener("click", toggleRecording); recordButton.addEventListener("click", toggleRecording);
</script> </script>
</body> </body>
</html> </html>

View File

@@ -0,0 +1,13 @@
import logging
import importlib.resources as resources
logger = logging.getLogger(__name__)
def get_web_interface_html():
"""Loads the HTML for the web interface using importlib.resources."""
try:
with resources.files('whisperlivekit.web').joinpath('live_transcription.html').open('r', encoding='utf-8') as f:
return f.read()
except Exception as e:
logger.error(f"Error loading web interface HTML: {e}")
return "<html><body><h1>Error loading interface</h1></body></html>"

View File

@@ -3,7 +3,10 @@ import logging
import io import io
import soundfile as sf import soundfile as sf
import math import math
import torch try:
import torch
except ImportError:
torch = None
from typing import List from typing import List
import numpy as np import numpy as np
from whisperlivekit.timed_objects import ASRToken from whisperlivekit.timed_objects import ASRToken
@@ -102,8 +105,9 @@ class FasterWhisperASR(ASRBase):
model_size_or_path = modelsize model_size_or_path = modelsize
else: else:
raise ValueError("Either modelsize or model_dir must be set") raise ValueError("Either modelsize or model_dir must be set")
device = "cuda" if torch.cuda.is_available() else "cpu" device = "auto" # Allow CTranslate2 to decide available device
compute_type = "float16" if device == "cuda" else "float32" compute_type = "auto" # Allow CTranslate2 to decide faster compute type
model = WhisperModel( model = WhisperModel(
model_size_or_path, model_size_or_path,
@@ -249,8 +253,8 @@ class OpenaiApiASR(ASRBase):
no_speech_segments = [] no_speech_segments = []
if self.use_vad_opt: if self.use_vad_opt:
for segment in segments.segments: for segment in segments.segments:
if segment["no_speech_prob"] > 0.8: if segment.no_speech_prob > 0.8:
no_speech_segments.append((segment.get("start"), segment.get("end"))) no_speech_segments.append((segment.start, segment.end))
tokens = [] tokens = []
for word in segments.words: for word in segments.words:
start = word.start start = word.start

View File

@@ -144,7 +144,11 @@ class OnlineASRProcessor:
self.transcript_buffer.last_committed_time = self.buffer_time_offset self.transcript_buffer.last_committed_time = self.buffer_time_offset
self.committed: List[ASRToken] = [] self.committed: List[ASRToken] = []
def insert_audio_chunk(self, audio: np.ndarray): def get_audio_buffer_end_time(self) -> float:
"""Returns the absolute end time of the current audio_buffer."""
return self.buffer_time_offset + (len(self.audio_buffer) / self.SAMPLING_RATE)
def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time: Optional[float] = None):
"""Append an audio chunk (a numpy array) to the current audio buffer.""" """Append an audio chunk (a numpy array) to the current audio buffer."""
self.audio_buffer = np.append(self.audio_buffer, audio) self.audio_buffer = np.append(self.audio_buffer, audio)
@@ -179,18 +183,19 @@ class OnlineASRProcessor:
return self.concatenate_tokens(self.transcript_buffer.buffer) return self.concatenate_tokens(self.transcript_buffer.buffer)
def process_iter(self) -> Transcript: def process_iter(self) -> Tuple[List[ASRToken], float]:
""" """
Processes the current audio buffer. Processes the current audio buffer.
Returns a Transcript object representing the committed transcript. Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
""" """
current_audio_processed_upto = self.get_audio_buffer_end_time()
prompt_text, _ = self.prompt() prompt_text, _ = self.prompt()
logger.debug( logger.debug(
f"Transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:.2f} seconds from {self.buffer_time_offset:.2f}" f"Transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:.2f} seconds from {self.buffer_time_offset:.2f}"
) )
res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt_text) res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt_text)
tokens = self.asr.ts_words(res) # Expecting List[ASRToken] tokens = self.asr.ts_words(res)
self.transcript_buffer.insert(tokens, self.buffer_time_offset) self.transcript_buffer.insert(tokens, self.buffer_time_offset)
committed_tokens = self.transcript_buffer.flush() committed_tokens = self.transcript_buffer.flush()
self.committed.extend(committed_tokens) self.committed.extend(committed_tokens)
@@ -210,37 +215,60 @@ class OnlineASRProcessor:
logger.debug( logger.debug(
f"Length of audio buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:.2f} seconds" f"Length of audio buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:.2f} seconds"
) )
return committed_tokens return committed_tokens, current_audio_processed_upto
def chunk_completed_sentence(self): def chunk_completed_sentence(self):
""" """
If the committed tokens form at least two sentences, chunk the audio If the committed tokens form at least two sentences, chunk the audio
buffer at the end time of the penultimate sentence. buffer at the end time of the penultimate sentence.
Also ensures chunking happens if audio buffer exceeds a time limit.
""" """
buffer_duration = len(self.audio_buffer) / self.SAMPLING_RATE
if not self.committed: if not self.committed:
if buffer_duration > self.buffer_trimming_sec:
chunk_time = self.buffer_time_offset + (buffer_duration / 2)
logger.debug(f"--- No speech detected, forced chunking at {chunk_time:.2f}")
self.chunk_at(chunk_time)
return return
logger.debug("COMPLETED SENTENCE: " + " ".join(token.text for token in self.committed)) logger.debug("COMPLETED SENTENCE: " + " ".join(token.text for token in self.committed))
sentences = self.words_to_sentences(self.committed) sentences = self.words_to_sentences(self.committed)
for sentence in sentences: for sentence in sentences:
logger.debug(f"\tSentence: {sentence.text}") logger.debug(f"\tSentence: {sentence.text}")
if len(sentences) < 2:
return chunk_done = False
# Keep the last two sentences. if len(sentences) >= 2:
while len(sentences) > 2: while len(sentences) > 2:
sentences.pop(0) sentences.pop(0)
chunk_time = sentences[-2].end chunk_time = sentences[-2].end
logger.debug(f"--- Sentence chunked at {chunk_time:.2f}") logger.debug(f"--- Sentence chunked at {chunk_time:.2f}")
self.chunk_at(chunk_time) self.chunk_at(chunk_time)
chunk_done = True
if not chunk_done and buffer_duration > self.buffer_trimming_sec:
last_committed_time = self.committed[-1].end
logger.debug(f"--- Not enough sentences, chunking at last committed time {last_committed_time:.2f}")
self.chunk_at(last_committed_time)
def chunk_completed_segment(self, res): def chunk_completed_segment(self, res):
""" """
Chunk the audio buffer based on segment-end timestamps reported by the ASR. Chunk the audio buffer based on segment-end timestamps reported by the ASR.
Also ensures chunking happens if audio buffer exceeds a time limit.
""" """
buffer_duration = len(self.audio_buffer) / self.SAMPLING_RATE
if not self.committed: if not self.committed:
if buffer_duration > self.buffer_trimming_sec:
chunk_time = self.buffer_time_offset + (buffer_duration / 2)
logger.debug(f"--- No speech detected, forced chunking at {chunk_time:.2f}")
self.chunk_at(chunk_time)
return return
logger.debug("Processing committed tokens for segmenting")
ends = self.asr.segments_end_ts(res) ends = self.asr.segments_end_ts(res)
last_committed_time = self.committed[-1].end last_committed_time = self.committed[-1].end
chunk_done = False
if len(ends) > 1: if len(ends) > 1:
logger.debug("Multiple segments available for chunking")
e = ends[-2] + self.buffer_time_offset e = ends[-2] + self.buffer_time_offset
while len(ends) > 2 and e > last_committed_time: while len(ends) > 2 and e > last_committed_time:
ends.pop(-1) ends.pop(-1)
@@ -248,11 +276,18 @@ class OnlineASRProcessor:
if e <= last_committed_time: if e <= last_committed_time:
logger.debug(f"--- Segment chunked at {e:.2f}") logger.debug(f"--- Segment chunked at {e:.2f}")
self.chunk_at(e) self.chunk_at(e)
chunk_done = True
else: else:
logger.debug("--- Last segment not within committed area") logger.debug("--- Last segment not within committed area")
else: else:
logger.debug("--- Not enough segments to chunk") logger.debug("--- Not enough segments to chunk")
if not chunk_done and buffer_duration > self.buffer_trimming_sec:
logger.debug(f"--- Buffer too large, chunking at last committed time {last_committed_time:.2f}")
self.chunk_at(last_committed_time)
logger.debug("Segment chunking complete")
def chunk_at(self, time: float): def chunk_at(self, time: float):
""" """
Trim both the hypothesis and audio buffer at the given time. Trim both the hypothesis and audio buffer at the given time.
@@ -313,15 +348,17 @@ class OnlineASRProcessor:
) )
sentences.append(sentence) sentences.append(sentence)
return sentences return sentences
def finish(self) -> Transcript:
def finish(self) -> Tuple[List[ASRToken], float]:
""" """
Flush the remaining transcript when processing ends. Flush the remaining transcript when processing ends.
Returns a tuple: (list of remaining ASRToken objects, float representing the final audio processed up to time).
""" """
remaining_tokens = self.transcript_buffer.buffer remaining_tokens = self.transcript_buffer.buffer
final_transcript = self.concatenate_tokens(remaining_tokens) logger.debug(f"Final non-committed tokens: {remaining_tokens}")
logger.debug(f"Final non-committed transcript: {final_transcript}") final_processed_upto = self.buffer_time_offset + (len(self.audio_buffer) / self.SAMPLING_RATE)
self.buffer_time_offset += len(self.audio_buffer) / self.SAMPLING_RATE self.buffer_time_offset = final_processed_upto
return final_transcript return remaining_tokens, final_processed_upto
def concatenate_tokens( def concatenate_tokens(
self, self,
@@ -354,36 +391,44 @@ class VACOnlineASRProcessor:
def __init__(self, online_chunk_size: float, *args, **kwargs): def __init__(self, online_chunk_size: float, *args, **kwargs):
self.online_chunk_size = online_chunk_size self.online_chunk_size = online_chunk_size
self.online = OnlineASRProcessor(*args, **kwargs) self.online = OnlineASRProcessor(*args, **kwargs)
self.asr = self.online.asr
# Load a VAD model (e.g. Silero VAD) # Load a VAD model (e.g. Silero VAD)
import torch import torch
model, _ = torch.hub.load(repo_or_dir="snakers4/silero-vad", model="silero_vad") model, _ = torch.hub.load(repo_or_dir="snakers4/silero-vad", model="silero_vad")
from silero_vad_iterator import FixedVADIterator from .silero_vad_iterator import FixedVADIterator
self.vac = FixedVADIterator(model) self.vac = FixedVADIterator(model)
self.logfile = self.online.logfile self.logfile = self.online.logfile
self.last_input_audio_stream_end_time: float = 0.0
self.init() self.init()
def init(self): def init(self):
self.online.init() self.online.init()
self.vac.reset_states() self.vac.reset_states()
self.current_online_chunk_buffer_size = 0 self.current_online_chunk_buffer_size = 0
self.last_input_audio_stream_end_time = self.online.buffer_time_offset
self.is_currently_final = False self.is_currently_final = False
self.status: Optional[str] = None # "voice" or "nonvoice" self.status: Optional[str] = None # "voice" or "nonvoice"
self.audio_buffer = np.array([], dtype=np.float32) self.audio_buffer = np.array([], dtype=np.float32)
self.buffer_offset = 0 # in frames self.buffer_offset = 0 # in frames
def get_audio_buffer_end_time(self) -> float:
"""Returns the absolute end time of the audio processed by the underlying OnlineASRProcessor."""
return self.online.get_audio_buffer_end_time()
def clear_buffer(self): def clear_buffer(self):
self.buffer_offset += len(self.audio_buffer) self.buffer_offset += len(self.audio_buffer)
self.audio_buffer = np.array([], dtype=np.float32) self.audio_buffer = np.array([], dtype=np.float32)
def insert_audio_chunk(self, audio: np.ndarray): def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time: float):
""" """
Process an incoming small audio chunk: Process an incoming small audio chunk:
- run VAD on the chunk, - run VAD on the chunk,
- decide whether to send the audio to the online ASR processor immediately, - decide whether to send the audio to the online ASR processor immediately,
- and/or to mark the current utterance as finished. - and/or to mark the current utterance as finished.
""" """
self.last_input_audio_stream_end_time = audio_stream_end_time
res = self.vac(audio) res = self.vac(audio)
self.audio_buffer = np.append(self.audio_buffer, audio) self.audio_buffer = np.append(self.audio_buffer, audio)
@@ -425,10 +470,11 @@ class VACOnlineASRProcessor:
self.buffer_offset += max(0, len(self.audio_buffer) - self.SAMPLING_RATE) self.buffer_offset += max(0, len(self.audio_buffer) - self.SAMPLING_RATE)
self.audio_buffer = self.audio_buffer[-self.SAMPLING_RATE:] self.audio_buffer = self.audio_buffer[-self.SAMPLING_RATE:]
def process_iter(self) -> Transcript: def process_iter(self) -> Tuple[List[ASRToken], float]:
""" """
Depending on the VAD status and the amount of accumulated audio, Depending on the VAD status and the amount of accumulated audio,
process the current audio chunk. process the current audio chunk.
Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
""" """
if self.is_currently_final: if self.is_currently_final:
return self.finish() return self.finish()
@@ -437,17 +483,20 @@ class VACOnlineASRProcessor:
return self.online.process_iter() return self.online.process_iter()
else: else:
logger.debug("No online update, only VAD") logger.debug("No online update, only VAD")
return Transcript(None, None, "") return [], self.last_input_audio_stream_end_time
def finish(self) -> Transcript: def finish(self) -> Tuple[List[ASRToken], float]:
"""Finish processing by flushing any remaining text.""" """
result = self.online.finish() Finish processing by flushing any remaining text.
Returns a tuple: (list of remaining ASRToken objects, float representing the final audio processed up to time).
"""
result_tokens, processed_upto = self.online.finish()
self.current_online_chunk_buffer_size = 0 self.current_online_chunk_buffer_size = 0
self.is_currently_final = False self.is_currently_final = False
return result return result_tokens, processed_upto
def get_buffer(self): def get_buffer(self):
""" """
Get the unvalidated buffer in string format. Get the unvalidated buffer in string format.
""" """
return self.online.concatenate_tokens(self.online.transcript_buffer.buffer).text return self.online.concatenate_tokens(self.online.transcript_buffer.buffer)

View File

@@ -179,7 +179,7 @@ def warmup_asr(asr, warmup_file=None, timeout=5):
logger.warning(f"Warmup file {warmup_file} invalid or missing.") logger.warning(f"Warmup file {warmup_file} invalid or missing.")
return False return False
print(f"Warmping up Whisper with {warmup_file}") print(f"Warming up Whisper with {warmup_file}")
try: try:
import librosa import librosa
audio, sr = librosa.load(warmup_file, sr=16000) audio, sr = librosa.load(warmup_file, sr=16000)