Compare commits
29 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
12973711f6 | ||
|
|
909ac9dd41 | ||
|
|
d94a07d417 | ||
|
|
b32dd8bfc4 | ||
|
|
9feb0e597b | ||
|
|
9dab84a573 | ||
|
|
d089c7fce0 | ||
|
|
253a080df5 | ||
|
|
0c6e4b2aee | ||
|
|
e14bbde77d | ||
|
|
7496163467 | ||
|
|
696a94d1ce | ||
|
|
2699b0974c | ||
|
|
90c0250ba4 | ||
|
|
eb96153ffd | ||
|
|
47e3eb9b5b | ||
|
|
b8b07adeef | ||
|
|
d0e9e37ef6 | ||
|
|
820f92d8cb | ||
|
|
e42523af84 | ||
|
|
e2184d5e06 | ||
|
|
7fe0353260 | ||
|
|
0f2eba507e | ||
|
|
55e08474f3 | ||
|
|
28bdc52e1d | ||
|
|
e4221fa6c3 | ||
|
|
1652db9a2d | ||
|
|
601f17653a | ||
|
|
7718190fcd |
@@ -15,7 +15,7 @@ Thank you for considering contributing ! We appreciate your time and effort to h
|
|||||||
|
|
||||||
## Opening Issues
|
## Opening Issues
|
||||||
|
|
||||||
If you encounter a problem with diart or want to suggest an improvement, please follow these guidelines when opening an issue:
|
If you encounter a problem with WhisperLiveKit or want to suggest an improvement, please follow these guidelines when opening an issue:
|
||||||
|
|
||||||
- **Bug Reports:**
|
- **Bug Reports:**
|
||||||
- Clearly describe the error. **Please indicate the parameters you use, especially the model(s)**
|
- Clearly describe the error. **Please indicate the parameters you use, especially the model(s)**
|
||||||
@@ -43,4 +43,4 @@ We welcome and appreciate contributions! To ensure a smooth review process, plea
|
|||||||
|
|
||||||
## Thank You
|
## Thank You
|
||||||
|
|
||||||
Your contributions make diart better for everyone. Thank you for your time and dedication!
|
Your contributions make WhisperLiveKit better for everyone. Thank you for your time and dedication!
|
||||||
|
|||||||
@@ -81,4 +81,4 @@ EXPOSE 8000
|
|||||||
ENTRYPOINT ["whisperlivekit-server", "--host", "0.0.0.0"]
|
ENTRYPOINT ["whisperlivekit-server", "--host", "0.0.0.0"]
|
||||||
|
|
||||||
# Default args
|
# Default args
|
||||||
CMD ["--model", "tiny.en"]
|
CMD ["--model", "base"]
|
||||||
242
README.md
@@ -4,7 +4,7 @@
|
|||||||
<img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit Demo" width="730">
|
<img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit Demo" width="730">
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
<p align="center"><b>Real-time, Fully Local Speech-to-Text with Speaker Diarization</b></p>
|
<p align="center"><b>Real-time, Fully Local Speech-to-Text with Speaker Identification</b></p>
|
||||||
|
|
||||||
<p align="center">
|
<p align="center">
|
||||||
<a href="https://pypi.org/project/whisperlivekit/"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/whisperlivekit?color=g"></a>
|
<a href="https://pypi.org/project/whisperlivekit/"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/whisperlivekit?color=g"></a>
|
||||||
@@ -14,121 +14,93 @@
|
|||||||
</p>
|
</p>
|
||||||
|
|
||||||
|
|
||||||
WhisperLiveKit brings real-time speech transcription directly to your browser, with a ready-to-use backend+server and a simple frontend. ✨
|
Real-time speech transcription directly to your browser, with a ready-to-use backend+server and a simple frontend. ✨
|
||||||
|
|
||||||
Built on [SimulStreaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) and [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) for transcription, plus [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) and [Diart](https://github.com/juanmc2005/diart) (SOTA 2021) for diarization.
|
#### Powered by Leading Research:
|
||||||
|
|
||||||
|
- [SimulStreaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) - Ultra-low latency transcription with AlignAtt policy
|
||||||
|
- [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) - Low latency transcription with LocalAgreement policy
|
||||||
|
- [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) - Advanced real-time speaker diarization
|
||||||
|
- [Diart](https://github.com/juanmc2005/diart) (SOTA 2021) - Real-time speaker diarization
|
||||||
|
- [Silero VAD](https://github.com/snakers4/silero-vad) (2024) - Enterprise-grade Voice Activity Detection
|
||||||
|
|
||||||
|
|
||||||
### Key Features
|
> **Why not just run a simple Whisper model on every audio batch?** Whisper is designed for complete utterances, not real-time chunks. Processing small segments loses context, cuts off words mid-syllable, and produces poor transcription. WhisperLiveKit uses state-of-the-art simultaneous speech research for intelligent buffering and incremental processing.
|
||||||
|
|
||||||
- **Real-time Transcription** - Locally (or on-prem) convert speech to text instantly as you speak
|
|
||||||
- **Speaker Diarization** - Identify different speakers in real-time. (⚠️ backend Streaming Sortformer in developement)
|
|
||||||
- **Multi-User Support** - Handle multiple users simultaneously with a single backend/server
|
|
||||||
- **Automatic Silence Chunking** – Automatically chunks when no audio is detected to limit buffer size
|
|
||||||
- **Confidence Validation** – Immediately validate high-confidence tokens for faster inference (WhisperStreaming only)
|
|
||||||
- **Buffering Preview** – Displays unvalidated transcription segments (not compatible with SimulStreaming yet)
|
|
||||||
- **Punctuation-Based Speaker Splitting [BETA]** - Align speaker changes with natural sentence boundaries for more readable transcripts
|
|
||||||
- **SimulStreaming Backend** - [Dual-licensed](https://github.com/ufal/SimulStreaming#-licence-and-contributions) - Ultra-low latency transcription using SOTA AlignAtt policy.
|
|
||||||
|
|
||||||
### Architecture
|
### Architecture
|
||||||
|
|
||||||
<img alt="Architecture" src="architecture.png" />
|
<img alt="Architecture" src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/architecture.png" />
|
||||||
|
|
||||||
|
*The backend supports multiple concurrent users. Voice Activity Detection reduces overhead when no voice is detected.*
|
||||||
|
|
||||||
## Quick Start
|
### Installation & Quick Start
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Install the package
|
|
||||||
pip install whisperlivekit
|
pip install whisperlivekit
|
||||||
|
|
||||||
# Start the transcription server
|
|
||||||
whisperlivekit-server --model tiny.en
|
|
||||||
|
|
||||||
# Open your browser at http://localhost:8000 to see the interface.
|
|
||||||
# Use -ssl-certfile public.crt --ssl-keyfile private.key parameters to use SSL
|
|
||||||
```
|
```
|
||||||
|
|
||||||
That's it! Start speaking and watch your words appear on screen.
|
> **FFmpeg is required** and must be installed before using WhisperLiveKit
|
||||||
|
>
|
||||||
|
> | OS | How to install |
|
||||||
|
> |-----------|-------------|
|
||||||
|
> | Ubuntu/Debian | `sudo apt install ffmpeg` |
|
||||||
|
> | MacOS | `brew install ffmpeg` |
|
||||||
|
> | Windows | Download .exe from https://ffmpeg.org/download.html and add to PATH |
|
||||||
|
|
||||||
## Installation
|
#### Quick Start
|
||||||
|
1. **Start the transcription server:**
|
||||||
|
```bash
|
||||||
|
whisperlivekit-server --model base --language en
|
||||||
|
```
|
||||||
|
|
||||||
```bash
|
2. **Open your browser** and navigate to `http://localhost:8000`. Start speaking and watch your words appear in real-time!
|
||||||
#Install from PyPI (Recommended)
|
|
||||||
pip install whisperlivekit
|
|
||||||
|
|
||||||
#Install from Source
|
|
||||||
git clone https://github.com/QuentinFuxa/WhisperLiveKit
|
|
||||||
cd WhisperLiveKit
|
|
||||||
pip install -e .
|
|
||||||
```
|
|
||||||
|
|
||||||
### FFmpeg Dependency
|
> - See [tokenizer.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/simul_whisper/whisper/tokenizer.py) for the list of all available languages.
|
||||||
|
> - For HTTPS requirements, see the **Parameters** section for SSL configuration options.
|
||||||
|
|
||||||
```bash
|
|
||||||
# Ubuntu/Debian
|
|
||||||
sudo apt install ffmpeg
|
|
||||||
|
|
||||||
# macOS
|
#### Optional Dependencies
|
||||||
brew install ffmpeg
|
|
||||||
|
|
||||||
# Windows
|
| Optional | `pip install` |
|
||||||
# Download from https://ffmpeg.org/download.html and add to PATH
|
|-----------|-------------|
|
||||||
```
|
| Speaker diarization | `whisperlivekit[diarization]` |
|
||||||
|
| Original Whisper backend | `whisperlivekit[whisper]` |
|
||||||
|
| Improved timestamps backend | `whisperlivekit[whisper-timestamped]` |
|
||||||
|
| Apple Silicon optimization backend | `whisperlivekit[mlx-whisper]` |
|
||||||
|
| OpenAI API backend | `whisperlivekit[openai]` |
|
||||||
|
|
||||||
### Optional Dependencies
|
See **Parameters & Configuration** below on how to use them.
|
||||||
|
|
||||||
```bash
|
|
||||||
# Voice Activity Controller (prevents hallucinations)
|
> **Pyannote Models Setup** For diarization, you need access to pyannote.audio models:
|
||||||
pip install torch
|
> 1. [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model
|
||||||
|
> 2. [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model
|
||||||
# Sentence-based buffer trimming
|
> 3. [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
|
||||||
pip install mosestokenizer wtpsplit
|
>4. Login with HuggingFace:
|
||||||
pip install tokenize_uk # If you work with Ukrainian text
|
> ```bash
|
||||||
|
> huggingface-cli login
|
||||||
# Speaker diarization
|
> ```
|
||||||
pip install diart
|
|
||||||
|
|
||||||
# Alternative Whisper backends (default is faster-whisper)
|
|
||||||
pip install whisperlivekit[whisper] # Original Whisper
|
|
||||||
pip install whisperlivekit[whisper-timestamped] # Improved timestamps
|
|
||||||
pip install whisperlivekit[mlx-whisper] # Apple Silicon optimization
|
|
||||||
pip install whisperlivekit[openai] # OpenAI API
|
|
||||||
pip install whisperlivekit[simulstreaming]
|
|
||||||
```
|
|
||||||
|
|
||||||
### 🎹 Pyannote Models Setup
|
|
||||||
|
|
||||||
For diarization, you need access to pyannote.audio models:
|
|
||||||
|
|
||||||
1. [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model
|
|
||||||
2. [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model
|
|
||||||
3. [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
|
|
||||||
4. Login with HuggingFace:
|
|
||||||
```bash
|
|
||||||
pip install huggingface_hub
|
|
||||||
huggingface-cli login
|
|
||||||
```
|
|
||||||
|
|
||||||
## 💻 Usage Examples
|
## 💻 Usage Examples
|
||||||
|
|
||||||
### Command-line Interface
|
#### Command-line Interface
|
||||||
|
|
||||||
Start the transcription server with various options:
|
Start the transcription server with various options:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Basic server with English model
|
# SimulStreaming backend for ultra-low latency
|
||||||
whisperlivekit-server --model tiny.en
|
whisperlivekit-server --backend simulstreaming --model large-v3
|
||||||
|
|
||||||
# Advanced configuration with diarization
|
# Advanced configuration with diarization
|
||||||
whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language auto
|
whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language fr
|
||||||
|
|
||||||
# SimulStreaming backend for ultra-low latency
|
|
||||||
whisperlivekit-server --backend simulstreaming --model large-v3 --frame-threshold 20
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
|
||||||
### Python API Integration (Backend)
|
#### Python API Integration (Backend)
|
||||||
Check [basic_server.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) for a complete example.
|
Check [basic_server](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) for a more complete example of how to use the functions and classes.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
from whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args
|
from whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args
|
||||||
@@ -143,14 +115,10 @@ transcription_engine = None
|
|||||||
async def lifespan(app: FastAPI):
|
async def lifespan(app: FastAPI):
|
||||||
global transcription_engine
|
global transcription_engine
|
||||||
transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
|
transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
|
||||||
# You can also load from command-line arguments using parse_args()
|
|
||||||
# args = parse_args()
|
|
||||||
# transcription_engine = TranscriptionEngine(**vars(args))
|
|
||||||
yield
|
yield
|
||||||
|
|
||||||
app = FastAPI(lifespan=lifespan)
|
app = FastAPI(lifespan=lifespan)
|
||||||
|
|
||||||
# Process WebSocket connections
|
|
||||||
async def handle_websocket_results(websocket: WebSocket, results_generator):
|
async def handle_websocket_results(websocket: WebSocket, results_generator):
|
||||||
async for response in results_generator:
|
async for response in results_generator:
|
||||||
await websocket.send_json(response)
|
await websocket.send_json(response)
|
||||||
@@ -170,43 +138,36 @@ async def websocket_endpoint(websocket: WebSocket):
|
|||||||
await audio_processor.process_audio(message)
|
await audio_processor.process_audio(message)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Frontend Implementation
|
#### Frontend Implementation
|
||||||
|
|
||||||
The package includes a simple HTML/JavaScript implementation that you can adapt for your project. You can find it [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html), or load its content using `get_web_interface_html()` :
|
The package includes an HTML/JavaScript implementation [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html). You can also import it using `from whisperlivekit import get_web_interface_html` & `page = get_web_interface_html()`
|
||||||
|
|
||||||
```python
|
|
||||||
from whisperlivekit import get_web_interface_html
|
|
||||||
html_content = get_web_interface_html()
|
|
||||||
```
|
|
||||||
|
|
||||||
## ⚙️ Configuration Reference
|
### ⚙️ Parameters & Configuration
|
||||||
|
|
||||||
WhisperLiveKit offers extensive configuration options:
|
|
||||||
|
|
||||||
| Parameter | Description | Default |
|
| Parameter | Description | Default |
|
||||||
|-----------|-------------|---------|
|
|-----------|-------------|---------|
|
||||||
| `--host` | Server host address | `localhost` |
|
| `--model` | Whisper model size. | `small` |
|
||||||
| `--port` | Server port | `8000` |
|
|
||||||
| `--model` | Whisper model size. Caution : '.en' models do not work with Simulstreaming | `tiny` |
|
|
||||||
| `--language` | Source language code or `auto` | `en` |
|
| `--language` | Source language code or `auto` | `en` |
|
||||||
| `--task` | `transcribe` or `translate` | `transcribe` |
|
| `--task` | `transcribe` or `translate` | `transcribe` |
|
||||||
| `--backend` | Processing backend | `faster-whisper` |
|
| `--backend` | Processing backend | `simulstreaming` |
|
||||||
| `--diarization` | Enable speaker identification | `False` |
|
|
||||||
| `--punctuation-split` | Use punctuation to improve speaker boundaries | `True` |
|
|
||||||
| `--confidence-validation` | Use confidence scores for faster validation | `False` |
|
|
||||||
| `--min-chunk-size` | Minimum audio chunk size (seconds) | `1.0` |
|
| `--min-chunk-size` | Minimum audio chunk size (seconds) | `1.0` |
|
||||||
| `--vac` | Use Voice Activity Controller | `False` |
|
| `--no-vac` | Disable Voice Activity Controller | `False` |
|
||||||
| `--no-vad` | Disable Voice Activity Detection | `False` |
|
| `--no-vad` | Disable Voice Activity Detection | `False` |
|
||||||
| `--buffer_trimming` | Buffer trimming strategy (`sentence` or `segment`) | `segment` |
|
|
||||||
| `--warmup-file` | Audio file path for model warmup | `jfk.wav` |
|
| `--warmup-file` | Audio file path for model warmup | `jfk.wav` |
|
||||||
|
| `--host` | Server host address | `localhost` |
|
||||||
|
| `--port` | Server port | `8000` |
|
||||||
| `--ssl-certfile` | Path to the SSL certificate file (for HTTPS support) | `None` |
|
| `--ssl-certfile` | Path to the SSL certificate file (for HTTPS support) | `None` |
|
||||||
| `--ssl-keyfile` | Path to the SSL private key file (for HTTPS support) | `None` |
|
| `--ssl-keyfile` | Path to the SSL private key file (for HTTPS support) | `None` |
|
||||||
| `--segmentation-model` | Hugging Face model ID for pyannote.audio segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |
|
|
||||||
| `--embedding-model` | Hugging Face model ID for pyannote.audio embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |
|
|
||||||
|
|
||||||
**SimulStreaming-specific Options:**
|
|
||||||
|
|
||||||
| Parameter | Description | Default |
|
| WhisperStreaming backend options | Description | Default |
|
||||||
|
|-----------|-------------|---------|
|
||||||
|
| `--confidence-validation` | Use confidence scores for faster validation | `False` |
|
||||||
|
| `--buffer_trimming` | Buffer trimming strategy (`sentence` or `segment`) | `segment` |
|
||||||
|
|
||||||
|
|
||||||
|
| SimulStreaming backend options | Description | Default |
|
||||||
|-----------|-------------|---------|
|
|-----------|-------------|---------|
|
||||||
| `--frame-threshold` | AlignAtt frame threshold (lower = faster, higher = more accurate) | `25` |
|
| `--frame-threshold` | AlignAtt frame threshold (lower = faster, higher = more accurate) | `25` |
|
||||||
| `--beams` | Number of beams for beam search (1 = greedy decoding) | `1` |
|
| `--beams` | Number of beams for beam search (1 = greedy decoding) | `1` |
|
||||||
@@ -219,42 +180,37 @@ WhisperLiveKit offers extensive configuration options:
|
|||||||
| `--static-init-prompt` | Static prompt that doesn't scroll | `None` |
|
| `--static-init-prompt` | Static prompt that doesn't scroll | `None` |
|
||||||
| `--max-context-tokens` | Maximum context tokens | `None` |
|
| `--max-context-tokens` | Maximum context tokens | `None` |
|
||||||
| `--model-path` | Direct path to .pt model file. Download it if not found | `./base.pt` |
|
| `--model-path` | Direct path to .pt model file. Download it if not found | `./base.pt` |
|
||||||
|
| `--preloaded-model-count` | Optional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent users) | `1` |
|
||||||
|
|
||||||
## 🔧 How It Works
|
| Diarization options | Description | Default |
|
||||||
|
|-----------|-------------|---------|
|
||||||
|
| `--diarization` | Enable speaker identification | `False` |
|
||||||
|
| `--punctuation-split` | Use punctuation to improve speaker boundaries | `True` |
|
||||||
|
| `--segmentation-model` | Hugging Face model ID for pyannote.audio segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |
|
||||||
|
| `--embedding-model` | Hugging Face model ID for pyannote.audio embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |
|
||||||
|
|
||||||
1. **Audio Capture**: Browser's MediaRecorder API captures audio in webm/opus format
|
### 🚀 Deployment Guide
|
||||||
2. **Streaming**: Audio chunks are sent to the server via WebSocket
|
|
||||||
3. **Processing**: Server decodes audio with FFmpeg and streams into the model for transcription
|
|
||||||
4. **Real-time Output**: Partial transcriptions appear immediately in light gray (the 'aperçu') and finalized text appears in normal color
|
|
||||||
|
|
||||||
## 🚀 Deployment Guide
|
|
||||||
|
|
||||||
To deploy WhisperLiveKit in production:
|
To deploy WhisperLiveKit in production:
|
||||||
|
|
||||||
1. **Server Setup** (Backend):
|
1. **Server Setup**: Install production ASGI server & launch with multiple workers
|
||||||
```bash
|
```bash
|
||||||
# Install production ASGI server
|
|
||||||
pip install uvicorn gunicorn
|
pip install uvicorn gunicorn
|
||||||
|
|
||||||
# Launch with multiple workers
|
|
||||||
gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
|
gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
|
||||||
```
|
```
|
||||||
|
|
||||||
2. **Frontend Integration**:
|
2. **Frontend**: Host your customized version of the `html` example & ensure WebSocket connection points correctly
|
||||||
- Host your customized version of the example HTML/JS in your web application
|
|
||||||
- Ensure WebSocket connection points to your server's address
|
|
||||||
|
|
||||||
3. **Nginx Configuration** (recommended for production):
|
3. **Nginx Configuration** (recommended for production):
|
||||||
```nginx
|
```nginx
|
||||||
server {
|
server {
|
||||||
listen 80;
|
listen 80;
|
||||||
server_name your-domain.com;
|
server_name your-domain.com;
|
||||||
|
location / {
|
||||||
location / {
|
proxy_pass http://localhost:8000;
|
||||||
proxy_pass http://localhost:8000;
|
proxy_set_header Upgrade $http_upgrade;
|
||||||
proxy_set_header Upgrade $http_upgrade;
|
proxy_set_header Connection "upgrade";
|
||||||
proxy_set_header Connection "upgrade";
|
proxy_set_header Host $host;
|
||||||
proxy_set_header Host $host;
|
|
||||||
}}
|
}}
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -262,26 +218,19 @@ To deploy WhisperLiveKit in production:
|
|||||||
|
|
||||||
### 🐋 Docker
|
### 🐋 Docker
|
||||||
|
|
||||||
A basic Dockerfile is provided which allows re-use of Python package installation options. ⚠️ For **large** models, ensure that your **docker runtime** has enough **memory** available. See below usage examples:
|
A Dockerfile is provided which allows re-use of Python package installation options. Create a reusable image with only the basics and then run as a named container:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker build -t whisperlivekit-defaults .
|
||||||
|
docker create --gpus all --name whisperlivekit -p 8000:8000 whisperlivekit-defaults --model base
|
||||||
|
docker start -i whisperlivekit
|
||||||
|
```
|
||||||
|
|
||||||
#### All defaults
|
> **Note**: For **large** models, ensure that your **docker runtime** has enough **memory** available
|
||||||
- Create a reusable image with only the basics and then run as a named container:
|
|
||||||
```bash
|
|
||||||
docker build -t whisperlivekit-defaults .
|
|
||||||
docker create --gpus all --name whisperlivekit -p 8000:8000 whisperlivekit-defaults
|
|
||||||
docker start -i whisperlivekit
|
|
||||||
```
|
|
||||||
|
|
||||||
> **Note**: If you're running on a system without NVIDIA GPU support (such as Mac with Apple Silicon or any system without CUDA capabilities), you need to **remove the `--gpus all` flag** from the `docker create` command. Without GPU acceleration, transcription will use CPU only, which may be significantly slower. Consider using small models for better performance on CPU-only systems.
|
> **Note**: If you're running on a system without NVIDIA GPU support (such as Mac with Apple Silicon or any system without CUDA capabilities), you need to **remove the `--gpus all` flag** from the `docker create` command. Without GPU acceleration, transcription will use CPU only, which may be significantly slower. Consider using small models for better performance on CPU-only systems.
|
||||||
|
|
||||||
#### Customization
|
#### Customization
|
||||||
- Customize the container options:
|
|
||||||
```bash
|
|
||||||
docker build -t whisperlivekit-defaults .
|
|
||||||
docker create --gpus all --name whisperlivekit-base -p 8000:8000 whisperlivekit-defaults --model base
|
|
||||||
docker start -i whisperlivekit-base
|
|
||||||
```
|
|
||||||
|
|
||||||
- `--build-arg` Options:
|
- `--build-arg` Options:
|
||||||
- `EXTRAS="whisper-timestamped"` - Add extras to the image's installation (no spaces). Remember to set necessary container options!
|
- `EXTRAS="whisper-timestamped"` - Add extras to the image's installation (no spaces). Remember to set necessary container options!
|
||||||
@@ -290,10 +239,3 @@ A basic Dockerfile is provided which allows re-use of Python package installatio
|
|||||||
|
|
||||||
## 🔮 Use Cases
|
## 🔮 Use Cases
|
||||||
Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...
|
Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...
|
||||||
|
|
||||||
## 🙏 Acknowledgments
|
|
||||||
|
|
||||||
We extend our gratitude to the original authors of:
|
|
||||||
|
|
||||||
| [Whisper Streaming](https://github.com/ufal/whisper_streaming) | [SimulStreaming](https://github.com/ufal/SimulStreaming) | [Diart](https://github.com/juanmc2005/diart) | [OpenAI Whisper](https://github.com/openai/whisper) |
|
|
||||||
| -------- | ------- | -------- | ------- |
|
|
||||||
|
|||||||
BIN
architecture.png
|
Before Width: | Height: | Size: 382 KiB After Width: | Height: | Size: 388 KiB |
BIN
demo.png
|
Before Width: | Height: | Size: 438 KiB After Width: | Height: | Size: 423 KiB |
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|||||||
|
|
||||||
[project]
|
[project]
|
||||||
name = "whisperlivekit"
|
name = "whisperlivekit"
|
||||||
version = "0.2.5"
|
version = "0.2.6"
|
||||||
description = "Real-time, Fully Local Whisper's Speech-to-Text and Speaker Diarization"
|
description = "Real-time, Fully Local Whisper's Speech-to-Text and Speaker Diarization"
|
||||||
readme = "README.md"
|
readme = "README.md"
|
||||||
authors = [
|
authors = [
|
||||||
@@ -27,24 +27,21 @@ dependencies = [
|
|||||||
"soundfile",
|
"soundfile",
|
||||||
"faster-whisper",
|
"faster-whisper",
|
||||||
"uvicorn",
|
"uvicorn",
|
||||||
"websockets"
|
"websockets",
|
||||||
]
|
|
||||||
|
|
||||||
[project.optional-dependencies]
|
|
||||||
diarization = ["diart"]
|
|
||||||
vac = ["torch"]
|
|
||||||
sentence = ["mosestokenizer", "wtpsplit"]
|
|
||||||
whisper = ["whisper"]
|
|
||||||
whisper-timestamped = ["whisper-timestamped"]
|
|
||||||
mlx-whisper = ["mlx-whisper"]
|
|
||||||
openai = ["openai"]
|
|
||||||
simulstreaming = [
|
|
||||||
"torch",
|
"torch",
|
||||||
"tqdm",
|
"tqdm",
|
||||||
"tiktoken",
|
"tiktoken",
|
||||||
'triton>=2.0.0,<3; platform_machine == "x86_64" and (sys_platform == "linux" or sys_platform == "linux2")'
|
'triton>=2.0.0,<3; platform_machine == "x86_64" and (sys_platform == "linux" or sys_platform == "linux2")'
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[project.optional-dependencies]
|
||||||
|
diarization = ["diart"]
|
||||||
|
sentence = ["mosestokenizer", "wtpsplit"]
|
||||||
|
whisper = ["whisper"]
|
||||||
|
whisper-timestamped = ["whisper-timestamped"]
|
||||||
|
mlx-whisper = ["mlx-whisper"]
|
||||||
|
openai = ["openai"]
|
||||||
|
|
||||||
[project.urls]
|
[project.urls]
|
||||||
Homepage = "https://github.com/QuentinFuxa/WhisperLiveKit"
|
Homepage = "https://github.com/QuentinFuxa/WhisperLiveKit"
|
||||||
|
|
||||||
@@ -55,5 +52,5 @@ whisperlivekit-server = "whisperlivekit.basic_server:main"
|
|||||||
packages = ["whisperlivekit", "whisperlivekit.diarization", "whisperlivekit.simul_whisper", "whisperlivekit.simul_whisper.whisper", "whisperlivekit.simul_whisper.whisper.assets", "whisperlivekit.simul_whisper.whisper.normalizers", "whisperlivekit.web", "whisperlivekit.whisper_streaming_custom"]
|
packages = ["whisperlivekit", "whisperlivekit.diarization", "whisperlivekit.simul_whisper", "whisperlivekit.simul_whisper.whisper", "whisperlivekit.simul_whisper.whisper.assets", "whisperlivekit.simul_whisper.whisper.normalizers", "whisperlivekit.web", "whisperlivekit.whisper_streaming_custom"]
|
||||||
|
|
||||||
[tool.setuptools.package-data]
|
[tool.setuptools.package-data]
|
||||||
whisperlivekit = ["web/*.html"]
|
whisperlivekit = ["web/*.html", "web/*.css", "web/*.js", "web/src/*.svg"]
|
||||||
"whisperlivekit.simul_whisper.whisper.assets" = ["*.tiktoken", "*.npz"]
|
"whisperlivekit.simul_whisper.whisper.assets" = ["*.tiktoken", "*.npz"]
|
||||||
|
|||||||
@@ -5,10 +5,12 @@ import math
|
|||||||
import logging
|
import logging
|
||||||
import traceback
|
import traceback
|
||||||
from datetime import timedelta
|
from datetime import timedelta
|
||||||
from whisperlivekit.timed_objects import ASRToken
|
from whisperlivekit.timed_objects import ASRToken, Silence
|
||||||
from whisperlivekit.core import TranscriptionEngine, online_factory
|
from whisperlivekit.core import TranscriptionEngine, online_factory
|
||||||
from whisperlivekit.ffmpeg_manager import FFmpegManager, FFmpegState
|
from whisperlivekit.ffmpeg_manager import FFmpegManager, FFmpegState
|
||||||
from .remove_silences import handle_silences
|
from whisperlivekit.remove_silences import handle_silences
|
||||||
|
from whisperlivekit.trail_repetition import trim_tail_repetition
|
||||||
|
from whisperlivekit.silero_vad_iterator import FixedVADIterator
|
||||||
# Set up logging once
|
# Set up logging once
|
||||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
|
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
@@ -45,16 +47,19 @@ class AudioProcessor:
|
|||||||
self.last_ffmpeg_activity = time()
|
self.last_ffmpeg_activity = time()
|
||||||
self.ffmpeg_health_check_interval = 5
|
self.ffmpeg_health_check_interval = 5
|
||||||
self.ffmpeg_max_idle_time = 10
|
self.ffmpeg_max_idle_time = 10
|
||||||
|
self.debug = False
|
||||||
|
|
||||||
# State management
|
# State management
|
||||||
self.is_stopping = False
|
self.is_stopping = False
|
||||||
|
self.silence = False
|
||||||
|
self.silence_duration = 0.0
|
||||||
self.tokens = []
|
self.tokens = []
|
||||||
self.buffer_transcription = ""
|
self.buffer_transcription = ""
|
||||||
self.buffer_diarization = ""
|
self.buffer_diarization = ""
|
||||||
self.end_buffer = 0
|
self.end_buffer = 0
|
||||||
self.end_attributed_speaker = 0
|
self.end_attributed_speaker = 0
|
||||||
self.lock = asyncio.Lock()
|
self.lock = asyncio.Lock()
|
||||||
self.beg_loop = time()
|
self.beg_loop = None #to deal with a potential little lag at the websocket initialization, this is now set in process_audio
|
||||||
self.sep = " " # Default separator
|
self.sep = " " # Default separator
|
||||||
self.last_response_content = ""
|
self.last_response_content = ""
|
||||||
|
|
||||||
@@ -62,7 +67,12 @@ class AudioProcessor:
|
|||||||
self.asr = models.asr
|
self.asr = models.asr
|
||||||
self.tokenizer = models.tokenizer
|
self.tokenizer = models.tokenizer
|
||||||
self.diarization = models.diarization
|
self.diarization = models.diarization
|
||||||
|
self.vac_model = models.vac_model
|
||||||
|
if self.args.vac:
|
||||||
|
self.vac = FixedVADIterator(models.vac_model)
|
||||||
|
else:
|
||||||
|
self.vac = None
|
||||||
|
|
||||||
self.ffmpeg_manager = FFmpegManager(
|
self.ffmpeg_manager = FFmpegManager(
|
||||||
sample_rate=self.sample_rate,
|
sample_rate=self.sample_rate,
|
||||||
channels=self.channels
|
channels=self.channels
|
||||||
@@ -98,6 +108,17 @@ class AudioProcessor:
|
|||||||
"""Thread-safe update of transcription with new data."""
|
"""Thread-safe update of transcription with new data."""
|
||||||
async with self.lock:
|
async with self.lock:
|
||||||
self.tokens.extend(new_tokens)
|
self.tokens.extend(new_tokens)
|
||||||
|
|
||||||
|
# self.tokens, has_been_trimmed = trim_tail_repetition(
|
||||||
|
# self.tokens,
|
||||||
|
# key=lambda t: t.text.strip().lower(),
|
||||||
|
# min_block=2, # avoid trimming single '.' loops; set to 1 if you want to remove those too
|
||||||
|
# max_tail=200,
|
||||||
|
# prefer="longest", # prefer removing the longest repeated phrase
|
||||||
|
# keep=1
|
||||||
|
# )
|
||||||
|
# if has_been_trimmed:
|
||||||
|
# print('HAS BEEN TRIMMED !')
|
||||||
self.buffer_transcription = buffer
|
self.buffer_transcription = buffer
|
||||||
self.end_buffer = end_buffer
|
self.end_buffer = end_buffer
|
||||||
self.sep = sep
|
self.sep = sep
|
||||||
@@ -201,18 +222,44 @@ class AudioProcessor:
|
|||||||
pcm_array = self.convert_pcm_to_float(self.pcm_buffer[:self.max_bytes_per_sec])
|
pcm_array = self.convert_pcm_to_float(self.pcm_buffer[:self.max_bytes_per_sec])
|
||||||
self.pcm_buffer = self.pcm_buffer[self.max_bytes_per_sec:]
|
self.pcm_buffer = self.pcm_buffer[self.max_bytes_per_sec:]
|
||||||
|
|
||||||
# Send to transcription if enabled
|
res = None
|
||||||
if self.args.transcription and self.transcription_queue:
|
end_of_audio = False
|
||||||
await self.transcription_queue.put(pcm_array.copy())
|
silence_buffer = None
|
||||||
|
|
||||||
|
if self.args.vac:
|
||||||
|
res = self.vac(pcm_array)
|
||||||
|
|
||||||
|
if res is not None:
|
||||||
|
if res.get('end', 0) > res.get('start', 0):
|
||||||
|
end_of_audio = True
|
||||||
|
elif self.silence: #end of silence
|
||||||
|
self.silence = False
|
||||||
|
silence_buffer = Silence(duration=time() - self.start_silence)
|
||||||
|
|
||||||
|
if silence_buffer:
|
||||||
|
if self.args.transcription and self.transcription_queue:
|
||||||
|
await self.transcription_queue.put(silence_buffer)
|
||||||
|
if self.args.diarization and self.diarization_queue:
|
||||||
|
await self.diarization_queue.put(silence_buffer)
|
||||||
|
|
||||||
# Send to diarization if enabled
|
if not self.silence:
|
||||||
if self.args.diarization and self.diarization_queue:
|
if self.args.transcription and self.transcription_queue:
|
||||||
await self.diarization_queue.put(pcm_array.copy())
|
await self.transcription_queue.put(pcm_array.copy())
|
||||||
|
|
||||||
|
if self.args.diarization and self.diarization_queue:
|
||||||
|
await self.diarization_queue.put(pcm_array.copy())
|
||||||
|
|
||||||
|
self.silence_duration = 0.0
|
||||||
|
if end_of_audio:
|
||||||
|
self.silence = True
|
||||||
|
self.start_silence = time()
|
||||||
|
|
||||||
# Sleep if no processing is happening
|
# Sleep if no processing is happening
|
||||||
if not self.args.transcription and not self.args.diarization:
|
if not self.args.transcription and not self.args.diarization:
|
||||||
await asyncio.sleep(0.1)
|
await asyncio.sleep(0.1)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(f"Exception in ffmpeg_stdout_reader: {e}")
|
logger.warning(f"Exception in ffmpeg_stdout_reader: {e}")
|
||||||
logger.warning(f"Traceback: {traceback.format_exc()}")
|
logger.warning(f"Traceback: {traceback.format_exc()}")
|
||||||
@@ -239,8 +286,8 @@ class AudioProcessor:
|
|||||||
|
|
||||||
while True:
|
while True:
|
||||||
try:
|
try:
|
||||||
pcm_array = await self.transcription_queue.get()
|
item = await self.transcription_queue.get()
|
||||||
if pcm_array is SENTINEL:
|
if item is SENTINEL:
|
||||||
logger.debug("Transcription processor received sentinel. Finishing.")
|
logger.debug("Transcription processor received sentinel. Finishing.")
|
||||||
self.transcription_queue.task_done()
|
self.transcription_queue.task_done()
|
||||||
break
|
break
|
||||||
@@ -252,17 +299,30 @@ class AudioProcessor:
|
|||||||
|
|
||||||
asr_internal_buffer_duration_s = len(getattr(self.online, 'audio_buffer', [])) / self.online.SAMPLING_RATE
|
asr_internal_buffer_duration_s = len(getattr(self.online, 'audio_buffer', [])) / self.online.SAMPLING_RATE
|
||||||
transcription_lag_s = max(0.0, time() - self.beg_loop - self.end_buffer)
|
transcription_lag_s = max(0.0, time() - self.beg_loop - self.end_buffer)
|
||||||
|
asr_processing_logs = f"internal_buffer={asr_internal_buffer_duration_s:.2f}s | lag={transcription_lag_s:.2f}s |"
|
||||||
logger.info(
|
if type(item) is Silence:
|
||||||
f"ASR processing: internal_buffer={asr_internal_buffer_duration_s:.2f}s, "
|
asr_processing_logs += f" + Silence of = {item.duration:.2f}s"
|
||||||
f"lag={transcription_lag_s:.2f}s."
|
if self.tokens:
|
||||||
)
|
asr_processing_logs += " | last_end = {self.tokens[-1].end} |"
|
||||||
|
logger.info(asr_processing_logs)
|
||||||
|
|
||||||
# Process transcription
|
if type(item) is Silence:
|
||||||
duration_this_chunk = len(pcm_array) / self.sample_rate if isinstance(pcm_array, np.ndarray) else 0
|
cumulative_pcm_duration_stream_time += item.duration
|
||||||
|
self.online.insert_silence(item.duration, self.tokens[-1].end)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if isinstance(item, np.ndarray):
|
||||||
|
pcm_array = item
|
||||||
|
else:
|
||||||
|
raise Exception('item should be pcm_array')
|
||||||
|
|
||||||
|
duration_this_chunk = len(pcm_array) / self.sample_rate
|
||||||
cumulative_pcm_duration_stream_time += duration_this_chunk
|
cumulative_pcm_duration_stream_time += duration_this_chunk
|
||||||
stream_time_end_of_current_pcm = cumulative_pcm_duration_stream_time
|
stream_time_end_of_current_pcm = cumulative_pcm_duration_stream_time
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
self.online.insert_audio_chunk(pcm_array, stream_time_end_of_current_pcm)
|
self.online.insert_audio_chunk(pcm_array, stream_time_end_of_current_pcm)
|
||||||
new_tokens, current_audio_processed_upto = self.online.process_iter()
|
new_tokens, current_audio_processed_upto = self.online.process_iter()
|
||||||
|
|
||||||
@@ -303,15 +363,25 @@ class AudioProcessor:
|
|||||||
async def diarization_processor(self, diarization_obj):
|
async def diarization_processor(self, diarization_obj):
|
||||||
"""Process audio chunks for speaker diarization."""
|
"""Process audio chunks for speaker diarization."""
|
||||||
buffer_diarization = ""
|
buffer_diarization = ""
|
||||||
|
cumulative_pcm_duration_stream_time = 0.0
|
||||||
while True:
|
while True:
|
||||||
try:
|
try:
|
||||||
pcm_array = await self.diarization_queue.get()
|
item = await self.diarization_queue.get()
|
||||||
if pcm_array is SENTINEL:
|
if item is SENTINEL:
|
||||||
logger.debug("Diarization processor received sentinel. Finishing.")
|
logger.debug("Diarization processor received sentinel. Finishing.")
|
||||||
self.diarization_queue.task_done()
|
self.diarization_queue.task_done()
|
||||||
break
|
break
|
||||||
|
|
||||||
|
if type(item) is Silence:
|
||||||
|
cumulative_pcm_duration_stream_time += item.duration
|
||||||
|
diarization_obj.insert_silence(item.duration)
|
||||||
|
continue
|
||||||
|
|
||||||
|
if isinstance(item, np.ndarray):
|
||||||
|
pcm_array = item
|
||||||
|
else:
|
||||||
|
raise Exception('item should be pcm_array')
|
||||||
|
|
||||||
# Process diarization
|
# Process diarization
|
||||||
await diarization_obj.diarize(pcm_array)
|
await diarization_obj.diarize(pcm_array)
|
||||||
|
|
||||||
@@ -376,13 +446,16 @@ class AudioProcessor:
|
|||||||
lines = []
|
lines = []
|
||||||
last_end_diarized = 0
|
last_end_diarized = 0
|
||||||
undiarized_text = []
|
undiarized_text = []
|
||||||
current_time = time() - self.beg_loop
|
current_time = time() - self.beg_loop if self.beg_loop else None
|
||||||
tokens = handle_silences(tokens, current_time)
|
tokens, buffer_transcription, buffer_diarization = handle_silences(tokens, buffer_transcription, buffer_diarization, current_time, self.silence)
|
||||||
for token in tokens:
|
for token in tokens:
|
||||||
speaker = token.speaker
|
speaker = token.speaker
|
||||||
|
|
||||||
|
if speaker == -1: #Speaker -1 means no attributed by diarization. In the frontend, it should appear under 'Speaker 1'
|
||||||
|
speaker = 1
|
||||||
|
|
||||||
# Handle diarization
|
# Handle diarization
|
||||||
if self.args.diarization:
|
if self.args.diarization and not tokens[-1].speaker == -2:
|
||||||
if (speaker in [-1, 0]) and token.end >= end_attributed_speaker:
|
if (speaker in [-1, 0]) and token.end >= end_attributed_speaker:
|
||||||
undiarized_text.append(token.text)
|
undiarized_text.append(token.text)
|
||||||
continue
|
continue
|
||||||
@@ -391,21 +464,23 @@ class AudioProcessor:
|
|||||||
if speaker not in [-1, 0]:
|
if speaker not in [-1, 0]:
|
||||||
last_end_diarized = max(token.end, last_end_diarized)
|
last_end_diarized = max(token.end, last_end_diarized)
|
||||||
|
|
||||||
# Group by speaker
|
debug_info = ""
|
||||||
|
if self.debug:
|
||||||
|
debug_info = f"[{format_time(token.start)} : {format_time(token.end)}]"
|
||||||
if speaker != previous_speaker or not lines:
|
if speaker != previous_speaker or not lines:
|
||||||
lines.append({
|
lines.append({
|
||||||
"speaker": speaker,
|
"speaker": speaker,
|
||||||
"text": token.text,
|
"text": token.text + debug_info,
|
||||||
"beg": format_time(token.start),
|
"beg": format_time(token.start),
|
||||||
"end": format_time(token.end),
|
"end": format_time(token.end),
|
||||||
"diff": round(token.end - last_end_diarized, 2)
|
"diff": round(token.end - last_end_diarized, 2)
|
||||||
})
|
})
|
||||||
previous_speaker = speaker
|
previous_speaker = speaker
|
||||||
elif token.text: # Only append if text isn't empty
|
elif token.text: # Only append if text isn't empty
|
||||||
lines[-1]["text"] += sep + token.text
|
lines[-1]["text"] += sep + token.text + debug_info
|
||||||
lines[-1]["end"] = format_time(token.end)
|
lines[-1]["end"] = format_time(token.end)
|
||||||
lines[-1]["diff"] = round(token.end - last_end_diarized, 2)
|
lines[-1]["diff"] = round(token.end - last_end_diarized, 2)
|
||||||
|
|
||||||
# Handle undiarized text
|
# Handle undiarized text
|
||||||
if undiarized_text:
|
if undiarized_text:
|
||||||
combined = sep.join(undiarized_text)
|
combined = sep.join(undiarized_text)
|
||||||
@@ -566,6 +641,10 @@ class AudioProcessor:
|
|||||||
|
|
||||||
async def process_audio(self, message):
|
async def process_audio(self, message):
|
||||||
"""Process incoming audio data."""
|
"""Process incoming audio data."""
|
||||||
|
|
||||||
|
if not self.beg_loop:
|
||||||
|
self.beg_loop = time()
|
||||||
|
|
||||||
if not message:
|
if not message:
|
||||||
logger.info("Empty audio message received, initiating stop sequence.")
|
logger.info("Empty audio message received, initiating stop sequence.")
|
||||||
self.is_stopping = True
|
self.is_stopping = True
|
||||||
|
|||||||
@@ -5,6 +5,9 @@ from fastapi.middleware.cors import CORSMiddleware
|
|||||||
from whisperlivekit import TranscriptionEngine, AudioProcessor, get_web_interface_html, parse_args
|
from whisperlivekit import TranscriptionEngine, AudioProcessor, get_web_interface_html, parse_args
|
||||||
import asyncio
|
import asyncio
|
||||||
import logging
|
import logging
|
||||||
|
from starlette.staticfiles import StaticFiles
|
||||||
|
import pathlib
|
||||||
|
import whisperlivekit.web as webpkg
|
||||||
|
|
||||||
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
|
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
|
||||||
logging.getLogger().setLevel(logging.WARNING)
|
logging.getLogger().setLevel(logging.WARNING)
|
||||||
@@ -30,6 +33,8 @@ app.add_middleware(
|
|||||||
allow_methods=["*"],
|
allow_methods=["*"],
|
||||||
allow_headers=["*"],
|
allow_headers=["*"],
|
||||||
)
|
)
|
||||||
|
web_dir = pathlib.Path(webpkg.__file__).parent
|
||||||
|
app.mount("/web", StaticFiles(directory=str(web_dir)), name="web")
|
||||||
|
|
||||||
@app.get("/")
|
@app.get("/")
|
||||||
async def get():
|
async def get():
|
||||||
@@ -47,7 +52,7 @@ async def handle_websocket_results(websocket, results_generator):
|
|||||||
except WebSocketDisconnect:
|
except WebSocketDisconnect:
|
||||||
logger.info("WebSocket disconnected while handling results (client likely closed connection).")
|
logger.info("WebSocket disconnected while handling results (client likely closed connection).")
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.warning(f"Error in WebSocket results handler: {e}")
|
logger.error(f"Error in WebSocket results handler: {e}")
|
||||||
|
|
||||||
|
|
||||||
@app.websocket("/asr")
|
@app.websocket("/asr")
|
||||||
|
|||||||
@@ -1,9 +1,9 @@
|
|||||||
try:
|
try:
|
||||||
from whisperlivekit.whisper_streaming_custom.whisper_online import backend_factory
|
from whisperlivekit.whisper_streaming_custom.whisper_online import backend_factory
|
||||||
from whisperlivekit.whisper_streaming_custom.online_asr import VACOnlineASRProcessor, OnlineASRProcessor
|
from whisperlivekit.whisper_streaming_custom.online_asr import OnlineASRProcessor
|
||||||
except ImportError:
|
except ImportError:
|
||||||
from .whisper_streaming_custom.whisper_online import backend_factory
|
from .whisper_streaming_custom.whisper_online import backend_factory
|
||||||
from .whisper_streaming_custom.online_asr import VACOnlineASRProcessor, OnlineASRProcessor
|
from .whisper_streaming_custom.online_asr import OnlineASRProcessor
|
||||||
from whisperlivekit.warmup import warmup_asr, warmup_online
|
from whisperlivekit.warmup import warmup_asr, warmup_online
|
||||||
from argparse import Namespace
|
from argparse import Namespace
|
||||||
import sys
|
import sys
|
||||||
@@ -34,7 +34,7 @@ class TranscriptionEngine:
|
|||||||
"lan": "auto",
|
"lan": "auto",
|
||||||
"task": "transcribe",
|
"task": "transcribe",
|
||||||
"backend": "faster-whisper",
|
"backend": "faster-whisper",
|
||||||
"vac": False,
|
"vac": True,
|
||||||
"vac_chunk_size": 0.04,
|
"vac_chunk_size": 0.04,
|
||||||
"log_level": "DEBUG",
|
"log_level": "DEBUG",
|
||||||
"ssl_certfile": None,
|
"ssl_certfile": None,
|
||||||
@@ -49,7 +49,7 @@ class TranscriptionEngine:
|
|||||||
"frame_threshold": 25,
|
"frame_threshold": 25,
|
||||||
"beams": 1,
|
"beams": 1,
|
||||||
"decoder_type": None,
|
"decoder_type": None,
|
||||||
"audio_max_len": 30.0,
|
"audio_max_len": 20.0,
|
||||||
"audio_min_len": 0.0,
|
"audio_min_len": 0.0,
|
||||||
"cif_ckpt_path": None,
|
"cif_ckpt_path": None,
|
||||||
"never_fire": False,
|
"never_fire": False,
|
||||||
@@ -57,10 +57,10 @@ class TranscriptionEngine:
|
|||||||
"static_init_prompt": None,
|
"static_init_prompt": None,
|
||||||
"max_context_tokens": None,
|
"max_context_tokens": None,
|
||||||
"model_path": './base.pt',
|
"model_path": './base.pt',
|
||||||
|
"diarization_backend": "diart",
|
||||||
# diart params:
|
# diart params:
|
||||||
"segmentation_model": "pyannote/segmentation-3.0",
|
"segmentation_model": "pyannote/segmentation-3.0",
|
||||||
"embedding_model": "pyannote/embedding",
|
"embedding_model": "pyannote/embedding",
|
||||||
|
|
||||||
}
|
}
|
||||||
|
|
||||||
config_dict = {**defaults, **kwargs}
|
config_dict = {**defaults, **kwargs}
|
||||||
@@ -69,6 +69,8 @@ class TranscriptionEngine:
|
|||||||
config_dict['transcription'] = not kwargs['no_transcription']
|
config_dict['transcription'] = not kwargs['no_transcription']
|
||||||
if 'no_vad' in kwargs:
|
if 'no_vad' in kwargs:
|
||||||
config_dict['vad'] = not kwargs['no_vad']
|
config_dict['vad'] = not kwargs['no_vad']
|
||||||
|
if 'no_vac' in kwargs:
|
||||||
|
config_dict['vac'] = not kwargs['no_vac']
|
||||||
|
|
||||||
config_dict.pop('no_transcription', None)
|
config_dict.pop('no_transcription', None)
|
||||||
config_dict.pop('no_vad', None)
|
config_dict.pop('no_vad', None)
|
||||||
@@ -82,15 +84,20 @@ class TranscriptionEngine:
|
|||||||
self.asr = None
|
self.asr = None
|
||||||
self.tokenizer = None
|
self.tokenizer = None
|
||||||
self.diarization = None
|
self.diarization = None
|
||||||
|
self.vac_model = None
|
||||||
|
|
||||||
|
if self.args.vac:
|
||||||
|
import torch
|
||||||
|
self.vac_model, _ = torch.hub.load(repo_or_dir="snakers4/silero-vad", model="silero_vad")
|
||||||
|
|
||||||
if self.args.transcription:
|
if self.args.transcription:
|
||||||
if self.args.backend == "simulstreaming":
|
if self.args.backend == "simulstreaming":
|
||||||
from simul_whisper import SimulStreamingASR
|
from whisperlivekit.simul_whisper import SimulStreamingASR
|
||||||
self.tokenizer = None
|
self.tokenizer = None
|
||||||
simulstreaming_kwargs = {}
|
simulstreaming_kwargs = {}
|
||||||
for attr in ['frame_threshold', 'beams', 'decoder_type', 'audio_max_len', 'audio_min_len',
|
for attr in ['frame_threshold', 'beams', 'decoder_type', 'audio_max_len', 'audio_min_len',
|
||||||
'cif_ckpt_path', 'never_fire', 'init_prompt', 'static_init_prompt',
|
'cif_ckpt_path', 'never_fire', 'init_prompt', 'static_init_prompt',
|
||||||
'max_context_tokens', 'model_path']:
|
'max_context_tokens', 'model_path', 'warmup_file', 'preload_model_count']:
|
||||||
if hasattr(self.args, attr):
|
if hasattr(self.args, attr):
|
||||||
simulstreaming_kwargs[attr] = getattr(self.args, attr)
|
simulstreaming_kwargs[attr] = getattr(self.args, attr)
|
||||||
|
|
||||||
@@ -112,12 +119,17 @@ class TranscriptionEngine:
|
|||||||
warmup_asr(self.asr, self.args.warmup_file) #for simulstreaming, warmup should be done in the online class not here
|
warmup_asr(self.asr, self.args.warmup_file) #for simulstreaming, warmup should be done in the online class not here
|
||||||
|
|
||||||
if self.args.diarization:
|
if self.args.diarization:
|
||||||
from whisperlivekit.diarization.diarization_online import DiartDiarization
|
if self.args.diarization_backend == "diart":
|
||||||
self.diarization = DiartDiarization(
|
from whisperlivekit.diarization.diart_backend import DiartDiarization
|
||||||
block_duration=self.args.min_chunk_size,
|
self.diarization = DiartDiarization(
|
||||||
segmentation_model_name=self.args.segmentation_model,
|
block_duration=self.args.min_chunk_size,
|
||||||
embedding_model_name=self.args.embedding_model
|
segmentation_model_name=self.args.segmentation_model,
|
||||||
)
|
embedding_model_name=self.args.embedding_model
|
||||||
|
)
|
||||||
|
elif self.args.diarization_backend == "sortformer":
|
||||||
|
raise ValueError('Sortformer backend in developement')
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Unknown diarization backend: {self.args.diarization_backend}")
|
||||||
|
|
||||||
TranscriptionEngine._initialized = True
|
TranscriptionEngine._initialized = True
|
||||||
|
|
||||||
@@ -125,21 +137,12 @@ class TranscriptionEngine:
|
|||||||
|
|
||||||
def online_factory(args, asr, tokenizer, logfile=sys.stderr):
|
def online_factory(args, asr, tokenizer, logfile=sys.stderr):
|
||||||
if args.backend == "simulstreaming":
|
if args.backend == "simulstreaming":
|
||||||
from simul_whisper import SimulStreamingOnlineProcessor
|
from whisperlivekit.simul_whisper import SimulStreamingOnlineProcessor
|
||||||
online = SimulStreamingOnlineProcessor(
|
online = SimulStreamingOnlineProcessor(
|
||||||
asr,
|
asr,
|
||||||
logfile=logfile,
|
logfile=logfile,
|
||||||
)
|
)
|
||||||
# warmup_online(online, args.warmup_file)
|
# warmup_online(online, args.warmup_file)
|
||||||
elif args.vac:
|
|
||||||
online = VACOnlineASRProcessor(
|
|
||||||
args.min_chunk_size,
|
|
||||||
asr,
|
|
||||||
tokenizer,
|
|
||||||
logfile=logfile,
|
|
||||||
buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec),
|
|
||||||
confidence_validation = args.confidence_validation
|
|
||||||
)
|
|
||||||
else:
|
else:
|
||||||
online = OnlineASRProcessor(
|
online = OnlineASRProcessor(
|
||||||
asr,
|
asr,
|
||||||
|
|||||||
@@ -29,6 +29,7 @@ class DiarizationObserver(Observer):
|
|||||||
self.speaker_segments = []
|
self.speaker_segments = []
|
||||||
self.processed_time = 0
|
self.processed_time = 0
|
||||||
self.segment_lock = threading.Lock()
|
self.segment_lock = threading.Lock()
|
||||||
|
self.global_time_offset = 0.0
|
||||||
|
|
||||||
def on_next(self, value: Tuple[Annotation, Any]):
|
def on_next(self, value: Tuple[Annotation, Any]):
|
||||||
annotation, audio = value
|
annotation, audio = value
|
||||||
@@ -49,8 +50,8 @@ class DiarizationObserver(Observer):
|
|||||||
print(f" {speaker}: {start:.2f}s-{end:.2f}s")
|
print(f" {speaker}: {start:.2f}s-{end:.2f}s")
|
||||||
self.speaker_segments.append(SpeakerSegment(
|
self.speaker_segments.append(SpeakerSegment(
|
||||||
speaker=speaker,
|
speaker=speaker,
|
||||||
start=start,
|
start=start + self.global_time_offset,
|
||||||
end=end
|
end=end + self.global_time_offset
|
||||||
))
|
))
|
||||||
else:
|
else:
|
||||||
logger.debug("\nNo speakers detected in this segment")
|
logger.debug("\nNo speakers detected in this segment")
|
||||||
@@ -199,6 +200,9 @@ class DiartDiarization:
|
|||||||
self.inference.attach_observers(self.observer)
|
self.inference.attach_observers(self.observer)
|
||||||
asyncio.get_event_loop().run_in_executor(None, self.inference)
|
asyncio.get_event_loop().run_in_executor(None, self.inference)
|
||||||
|
|
||||||
|
def insert_silence(self, silence_duration):
|
||||||
|
self.observer.global_time_offset += silence_duration
|
||||||
|
|
||||||
async def diarize(self, pcm_array: np.ndarray):
|
async def diarize(self, pcm_array: np.ndarray):
|
||||||
"""
|
"""
|
||||||
Process audio data for diarization.
|
Process audio data for diarization.
|
||||||
145
whisperlivekit/diarization/sortformer_backend.py
Normal file
@@ -0,0 +1,145 @@
|
|||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
import logging
|
||||||
|
from whisperlivekit.timed_objects import SpeakerSegment
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
try:
|
||||||
|
from nemo.collections.asr.models import SortformerEncLabelModel
|
||||||
|
except ImportError:
|
||||||
|
raise SystemExit("""Please use `pip install "git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]"` to use the Sortformer diarization""")
|
||||||
|
|
||||||
|
class SortformerDiarization:
|
||||||
|
def __init__(self, model_name="nvidia/diar_streaming_sortformer_4spk-v2"):
|
||||||
|
self.diar_model = SortformerEncLabelModel.from_pretrained(model_name)
|
||||||
|
self.diar_model.eval()
|
||||||
|
|
||||||
|
if torch.cuda.is_available():
|
||||||
|
self.diar_model.to(torch.device("cuda"))
|
||||||
|
|
||||||
|
# Streaming parameters for speed
|
||||||
|
self.diar_model.sortformer_modules.chunk_len = 12
|
||||||
|
self.diar_model.sortformer_modules.chunk_right_context = 1
|
||||||
|
self.diar_model.sortformer_modules.spkcache_len = 188
|
||||||
|
self.diar_model.sortformer_modules.fifo_len = 188
|
||||||
|
self.diar_model.sortformer_modules.spkcache_update_period = 144
|
||||||
|
self.diar_model.sortformer_modules.log = False
|
||||||
|
self.diar_model.sortformer_modules._check_streaming_parameters()
|
||||||
|
|
||||||
|
self.batch_size = 1
|
||||||
|
self.processed_signal_offset = torch.zeros((self.batch_size,), dtype=torch.long, device=self.diar_model.device)
|
||||||
|
|
||||||
|
self.audio_buffer = np.array([], dtype=np.float32)
|
||||||
|
self.sample_rate = 16000
|
||||||
|
self.speaker_segments = []
|
||||||
|
|
||||||
|
self.streaming_state = self.diar_model.sortformer_modules.init_streaming_state(
|
||||||
|
batch_size=self.batch_size,
|
||||||
|
async_streaming=True,
|
||||||
|
device=self.diar_model.device
|
||||||
|
)
|
||||||
|
self.total_preds = torch.zeros((self.batch_size, 0, self.diar_model.sortformer_modules.n_spk), device=self.diar_model.device)
|
||||||
|
|
||||||
|
|
||||||
|
def _prepare_audio_signal(self, signal):
|
||||||
|
audio_signal = torch.tensor(signal).unsqueeze(0).to(self.diar_model.device)
|
||||||
|
audio_signal_length = torch.tensor([audio_signal.shape[1]]).to(self.diar_model.device)
|
||||||
|
processed_signal, processed_signal_length = self.diar_model.preprocessor(input_signal=audio_signal, length=audio_signal_length)
|
||||||
|
return processed_signal, processed_signal_length
|
||||||
|
|
||||||
|
def _create_streaming_loader(self, processed_signal, processed_signal_length):
|
||||||
|
streaming_loader = self.diar_model.sortformer_modules.streaming_feat_loader(
|
||||||
|
feat_seq=processed_signal,
|
||||||
|
feat_seq_length=processed_signal_length,
|
||||||
|
feat_seq_offset=self.processed_signal_offset,
|
||||||
|
)
|
||||||
|
return streaming_loader
|
||||||
|
|
||||||
|
async def diarize(self, pcm_array: np.ndarray):
|
||||||
|
"""
|
||||||
|
Process an incoming audio chunk for diarization.
|
||||||
|
"""
|
||||||
|
self.audio_buffer = np.concatenate([self.audio_buffer, pcm_array])
|
||||||
|
|
||||||
|
# Process in fixed-size chunks (e.g., 1 second)
|
||||||
|
chunk_size = self.sample_rate # 1 second of audio
|
||||||
|
|
||||||
|
while len(self.audio_buffer) >= chunk_size:
|
||||||
|
chunk_to_process = self.audio_buffer[:chunk_size]
|
||||||
|
self.audio_buffer = self.audio_buffer[chunk_size:]
|
||||||
|
|
||||||
|
processed_signal, processed_signal_length = self._prepare_audio_signal(chunk_to_process)
|
||||||
|
|
||||||
|
current_offset_seconds = self.processed_signal_offset.item() * self.diar_model.preprocessor._cfg.window_stride
|
||||||
|
|
||||||
|
streaming_loader = self._create_streaming_loader(processed_signal, processed_signal_length)
|
||||||
|
|
||||||
|
frame_duration_s = self.diar_model.sortformer_modules.subsampling_factor * self.diar_model.preprocessor._cfg.window_stride
|
||||||
|
chunk_duration_seconds = self.diar_model.sortformer_modules.chunk_len * frame_duration_s
|
||||||
|
|
||||||
|
for i, chunk_feat_seq_t, feat_lengths, left_offset, right_offset in streaming_loader:
|
||||||
|
with torch.inference_mode():
|
||||||
|
self.streaming_state, self.total_preds = self.diar_model.forward_streaming_step(
|
||||||
|
processed_signal=chunk_feat_seq_t,
|
||||||
|
processed_signal_length=feat_lengths,
|
||||||
|
streaming_state=self.streaming_state,
|
||||||
|
total_preds=self.total_preds,
|
||||||
|
left_offset=left_offset,
|
||||||
|
right_offset=right_offset,
|
||||||
|
)
|
||||||
|
|
||||||
|
num_new_frames = feat_lengths[0].item()
|
||||||
|
|
||||||
|
# Get predictions for the current chunk from the end of total_preds
|
||||||
|
preds_np = self.total_preds[0, -num_new_frames:].cpu().numpy()
|
||||||
|
active_speakers = np.argmax(preds_np, axis=1)
|
||||||
|
|
||||||
|
for idx, spk in enumerate(active_speakers):
|
||||||
|
start_time = current_offset_seconds + (i * chunk_duration_seconds) + (idx * frame_duration_s)
|
||||||
|
end_time = start_time + frame_duration_s
|
||||||
|
|
||||||
|
if self.speaker_segments and self.speaker_segments[-1].speaker == spk + 1:
|
||||||
|
self.speaker_segments[-1].end = end_time
|
||||||
|
else:
|
||||||
|
self.speaker_segments.append(SpeakerSegment(
|
||||||
|
speaker=int(spk + 1),
|
||||||
|
start=start_time,
|
||||||
|
end=end_time
|
||||||
|
))
|
||||||
|
|
||||||
|
self.processed_signal_offset += processed_signal_length
|
||||||
|
|
||||||
|
|
||||||
|
def assign_speakers_to_tokens(self, tokens: list, **kwargs) -> list:
|
||||||
|
"""
|
||||||
|
Assign speakers to tokens based on timing overlap with speaker segments.
|
||||||
|
"""
|
||||||
|
for token in tokens:
|
||||||
|
for segment in self.speaker_segments:
|
||||||
|
if not (segment.end <= token.start or segment.start >= token.end):
|
||||||
|
token.speaker = segment.speaker
|
||||||
|
return tokens
|
||||||
|
|
||||||
|
def close(self):
|
||||||
|
"""
|
||||||
|
Cleanup resources.
|
||||||
|
"""
|
||||||
|
logger.info("Closing SortformerDiarization.")
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
import librosa
|
||||||
|
an4_audio = 'new_audio_test.mp3'
|
||||||
|
signal, sr = librosa.load(an4_audio, sr=16000)
|
||||||
|
|
||||||
|
diarization_pipeline = SortformerDiarization()
|
||||||
|
|
||||||
|
# Simulate streaming
|
||||||
|
chunk_size = 16000 # 1 second
|
||||||
|
for i in range(0, len(signal), chunk_size):
|
||||||
|
chunk = signal[i:i+chunk_size]
|
||||||
|
import asyncio
|
||||||
|
asyncio.run(diarization_pipeline.diarize(chunk))
|
||||||
|
|
||||||
|
for segment in diarization_pipeline.speaker_segments:
|
||||||
|
print(f"Speaker {segment.speaker}: {segment.start:.2f}s - {segment.end:.2f}s")
|
||||||
257
whisperlivekit/diarization/sortformer_backend_2.py
Normal file
@@ -0,0 +1,257 @@
|
|||||||
|
import numpy as np
|
||||||
|
import torch
|
||||||
|
import logging
|
||||||
|
import math
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
try:
|
||||||
|
from nemo.collections.asr.models import SortformerEncLabelModel
|
||||||
|
except ImportError:
|
||||||
|
raise SystemExit("""Please use `pip install "git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]"` to use the Sortformer diarization""")
|
||||||
|
|
||||||
|
|
||||||
|
diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2")
|
||||||
|
diar_model.eval()
|
||||||
|
|
||||||
|
if torch.cuda.is_available():
|
||||||
|
diar_model.to(torch.device("cuda"))
|
||||||
|
|
||||||
|
# Set the streaming parameters corresponding to 1.04s latency setup. This will affect the streaming feat loader.
|
||||||
|
# diar_model.sortformer_modules.chunk_len = 6
|
||||||
|
# diar_model.sortformer_modules.spkcache_len = 188
|
||||||
|
# diar_model.sortformer_modules.chunk_right_context = 7
|
||||||
|
# diar_model.sortformer_modules.fifo_len = 188
|
||||||
|
# diar_model.sortformer_modules.spkcache_update_period = 144
|
||||||
|
# diar_model.sortformer_modules.log = False
|
||||||
|
|
||||||
|
|
||||||
|
# here we change the settings for our goal: speed!
|
||||||
|
# we want batches of around 1 second. one frame is 0.08s, so 1s is 12.5 frames. we take 12.
|
||||||
|
diar_model.sortformer_modules.chunk_len = 12
|
||||||
|
|
||||||
|
# for more speed, we reduce the 'right context'. it's like looking less into the future.
|
||||||
|
diar_model.sortformer_modules.chunk_right_context = 1
|
||||||
|
|
||||||
|
# we keep the rest same for now
|
||||||
|
diar_model.sortformer_modules.spkcache_len = 188
|
||||||
|
diar_model.sortformer_modules.fifo_len = 188
|
||||||
|
diar_model.sortformer_modules.spkcache_update_period = 144
|
||||||
|
diar_model.sortformer_modules.log = False
|
||||||
|
diar_model.sortformer_modules._check_streaming_parameters()
|
||||||
|
|
||||||
|
batch_size = 1
|
||||||
|
processed_signal_offset = torch.zeros((batch_size,), dtype=torch.long, device=diar_model.device)
|
||||||
|
|
||||||
|
# from nemo.collections.asr.parts.preprocessing.features import FilterbankFeatures
|
||||||
|
# from nemo.collections.asr.modules.audio_preprocessing import get_features
|
||||||
|
from nemo.collections.asr.modules.audio_preprocessing import AudioToMelSpectrogramPreprocessor
|
||||||
|
|
||||||
|
|
||||||
|
def prepare_audio_signal(signal):
|
||||||
|
audio_signal = torch.tensor(signal).unsqueeze(0).to(diar_model.device)
|
||||||
|
audio_signal_length = torch.tensor([audio_signal.shape[1]]).to(diar_model.device)
|
||||||
|
processed_signal, processed_signal_length = AudioToMelSpectrogramPreprocessor(
|
||||||
|
window_size= 0.025,
|
||||||
|
normalize="NA",
|
||||||
|
n_fft=512,
|
||||||
|
features=128).get_features(audio_signal, audio_signal_length)
|
||||||
|
return processed_signal, processed_signal_length
|
||||||
|
|
||||||
|
|
||||||
|
def streaming_feat_loader(
|
||||||
|
feat_seq, feat_seq_length, feat_seq_offset
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Load a chunk of feature sequence for streaming inference.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
feat_seq (torch.Tensor): Tensor containing feature sequence
|
||||||
|
Shape: (batch_size, feat_dim, feat frame count)
|
||||||
|
feat_seq_length (torch.Tensor): Tensor containing feature sequence lengths
|
||||||
|
Shape: (batch_size,)
|
||||||
|
feat_seq_offset (torch.Tensor): Tensor containing feature sequence offsets
|
||||||
|
Shape: (batch_size,)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
chunk_idx (int): Index of the current chunk
|
||||||
|
chunk_feat_seq (torch.Tensor): Tensor containing the chunk of feature sequence
|
||||||
|
Shape: (batch_size, diar frame count, feat_dim)
|
||||||
|
feat_lengths (torch.Tensor): Tensor containing lengths of the chunk of feature sequence
|
||||||
|
Shape: (batch_size,)
|
||||||
|
"""
|
||||||
|
feat_len = feat_seq.shape[2]
|
||||||
|
num_chunks = math.ceil(feat_len / (diar_model.sortformer_modules.chunk_len * diar_model.sortformer_modules.subsampling_factor))
|
||||||
|
if False:
|
||||||
|
logging.info(
|
||||||
|
f"feat_len={feat_len}, num_chunks={num_chunks}, "
|
||||||
|
f"feat_seq_length={feat_seq_length}, feat_seq_offset={feat_seq_offset}"
|
||||||
|
)
|
||||||
|
|
||||||
|
stt_feat, end_feat, chunk_idx = 0, 0, 0
|
||||||
|
while end_feat < feat_len:
|
||||||
|
left_offset = min(diar_model.sortformer_modules.chunk_left_context * diar_model.sortformer_modules.subsampling_factor, stt_feat)
|
||||||
|
end_feat = min(stt_feat + diar_model.sortformer_modules.chunk_len * diar_model.sortformer_modules.subsampling_factor, feat_len)
|
||||||
|
right_offset = min(diar_model.sortformer_modules.chunk_right_context * diar_model.sortformer_modules.subsampling_factor, feat_len - end_feat)
|
||||||
|
chunk_feat_seq = feat_seq[:, :, stt_feat - left_offset : end_feat + right_offset]
|
||||||
|
feat_lengths = (feat_seq_length + feat_seq_offset - stt_feat + left_offset).clamp(
|
||||||
|
0, chunk_feat_seq.shape[2]
|
||||||
|
)
|
||||||
|
feat_lengths = feat_lengths * (feat_seq_offset < end_feat)
|
||||||
|
stt_feat = end_feat
|
||||||
|
chunk_feat_seq_t = torch.transpose(chunk_feat_seq, 1, 2)
|
||||||
|
if False:
|
||||||
|
logging.info(
|
||||||
|
f"chunk_idx: {chunk_idx}, "
|
||||||
|
f"chunk_feat_seq_t shape: {chunk_feat_seq_t.shape}, "
|
||||||
|
f"chunk_feat_lengths: {feat_lengths}"
|
||||||
|
)
|
||||||
|
yield chunk_idx, chunk_feat_seq_t, feat_lengths, left_offset, right_offset
|
||||||
|
chunk_idx += 1
|
||||||
|
|
||||||
|
|
||||||
|
class StreamingSortformerState:
|
||||||
|
"""
|
||||||
|
This class creates a class instance that will be used to store the state of the
|
||||||
|
streaming Sortformer model.
|
||||||
|
|
||||||
|
Attributes:
|
||||||
|
spkcache (torch.Tensor): Speaker cache to store embeddings from start
|
||||||
|
spkcache_lengths (torch.Tensor): Lengths of the speaker cache
|
||||||
|
spkcache_preds (torch.Tensor): The speaker predictions for the speaker cache parts
|
||||||
|
fifo (torch.Tensor): FIFO queue to save the embedding from the latest chunks
|
||||||
|
fifo_lengths (torch.Tensor): Lengths of the FIFO queue
|
||||||
|
fifo_preds (torch.Tensor): The speaker predictions for the FIFO queue parts
|
||||||
|
spk_perm (torch.Tensor): Speaker permutation information for the speaker cache
|
||||||
|
mean_sil_emb (torch.Tensor): Mean silence embedding
|
||||||
|
n_sil_frames (torch.Tensor): Number of silence frames
|
||||||
|
"""
|
||||||
|
|
||||||
|
spkcache = None # Speaker cache to store embeddings from start
|
||||||
|
spkcache_lengths = None #
|
||||||
|
spkcache_preds = None # speaker cache predictions
|
||||||
|
fifo = None # to save the embedding from the latest chunks
|
||||||
|
fifo_lengths = None
|
||||||
|
fifo_preds = None
|
||||||
|
spk_perm = None
|
||||||
|
mean_sil_emb = None
|
||||||
|
n_sil_frames = None
|
||||||
|
|
||||||
|
|
||||||
|
def init_streaming_state(self, batch_size: int = 1, async_streaming: bool = False, device: torch.device = None):
|
||||||
|
"""
|
||||||
|
Initializes StreamingSortformerState with empty tensors or zero-valued tensors.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
batch_size (int): Batch size for tensors in streaming state
|
||||||
|
async_streaming (bool): True for asynchronous update, False for synchronous update
|
||||||
|
device (torch.device): Device for tensors in streaming state
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
streaming_state (SortformerStreamingState): initialized streaming state
|
||||||
|
"""
|
||||||
|
streaming_state = StreamingSortformerState()
|
||||||
|
if async_streaming:
|
||||||
|
streaming_state.spkcache = torch.zeros((batch_size, self.spkcache_len, self.fc_d_model), device=device)
|
||||||
|
streaming_state.spkcache_preds = torch.zeros((batch_size, self.spkcache_len, self.n_spk), device=device)
|
||||||
|
streaming_state.spkcache_lengths = torch.zeros((batch_size,), dtype=torch.long, device=device)
|
||||||
|
streaming_state.fifo = torch.zeros((batch_size, self.fifo_len, self.fc_d_model), device=device)
|
||||||
|
streaming_state.fifo_lengths = torch.zeros((batch_size,), dtype=torch.long, device=device)
|
||||||
|
else:
|
||||||
|
streaming_state.spkcache = torch.zeros((batch_size, 0, self.fc_d_model), device=device)
|
||||||
|
streaming_state.fifo = torch.zeros((batch_size, 0, self.fc_d_model), device=device)
|
||||||
|
streaming_state.mean_sil_emb = torch.zeros((batch_size, self.fc_d_model), device=device)
|
||||||
|
streaming_state.n_sil_frames = torch.zeros((batch_size,), dtype=torch.long, device=device)
|
||||||
|
return streaming_state
|
||||||
|
|
||||||
|
def process_diarization(signal, chunks):
|
||||||
|
|
||||||
|
audio_signal = torch.tensor(signal).unsqueeze(0).to(diar_model.device)
|
||||||
|
audio_signal_length = torch.tensor([audio_signal.shape[1]]).to(diar_model.device)
|
||||||
|
processed_signal, processed_signal_length = AudioToMelSpectrogramPreprocessor(
|
||||||
|
window_size= 0.025,
|
||||||
|
normalize="NA",
|
||||||
|
n_fft=512,
|
||||||
|
features=128).get_features(audio_signal, audio_signal_length)
|
||||||
|
|
||||||
|
|
||||||
|
streaming_loader = streaming_feat_loader(processed_signal, processed_signal_length, processed_signal_offset)
|
||||||
|
|
||||||
|
|
||||||
|
streaming_state = init_streaming_state(diar_model.sortformer_modules,
|
||||||
|
batch_size = batch_size,
|
||||||
|
async_streaming = True,
|
||||||
|
device = diar_model.device
|
||||||
|
)
|
||||||
|
total_preds = torch.zeros((batch_size, 0, diar_model.sortformer_modules.n_spk), device=diar_model.device)
|
||||||
|
|
||||||
|
|
||||||
|
chunk_duration_seconds = diar_model.sortformer_modules.chunk_len * diar_model.sortformer_modules.subsampling_factor * diar_model.preprocessor._cfg.window_stride
|
||||||
|
print(f"Chunk duration: {chunk_duration_seconds} seconds")
|
||||||
|
|
||||||
|
l_speakers = [
|
||||||
|
{'start_time': 0,
|
||||||
|
'end_time': 0,
|
||||||
|
'speaker': 0
|
||||||
|
}
|
||||||
|
]
|
||||||
|
len_prediction = None
|
||||||
|
left_offset = 0
|
||||||
|
right_offset = 8
|
||||||
|
for i, chunk_feat_seq_t, _, _, _ in streaming_loader:
|
||||||
|
with torch.inference_mode():
|
||||||
|
streaming_state, total_preds = diar_model.forward_streaming_step(
|
||||||
|
processed_signal=chunk_feat_seq_t,
|
||||||
|
processed_signal_length=torch.tensor([chunk_feat_seq_t.shape[1]]),
|
||||||
|
streaming_state=streaming_state,
|
||||||
|
total_preds=total_preds,
|
||||||
|
left_offset=left_offset,
|
||||||
|
right_offset=right_offset,
|
||||||
|
)
|
||||||
|
left_offset = 8
|
||||||
|
preds_np = total_preds[0].cpu().numpy()
|
||||||
|
active_speakers = np.argmax(preds_np, axis=1)
|
||||||
|
if len_prediction is None:
|
||||||
|
len_prediction = len(active_speakers) # we want to get the len of 1 prediction
|
||||||
|
frame_duration = chunk_duration_seconds / len_prediction
|
||||||
|
active_speakers = active_speakers[-len_prediction:]
|
||||||
|
print(chunk_feat_seq_t.shape, total_preds.shape)
|
||||||
|
for idx, spk in enumerate(active_speakers):
|
||||||
|
if spk != l_speakers[-1]['speaker']:
|
||||||
|
l_speakers.append(
|
||||||
|
{'start_time': i * chunk_duration_seconds + idx * frame_duration,
|
||||||
|
'end_time': i * chunk_duration_seconds + (idx + 1) * frame_duration,
|
||||||
|
'speaker': spk
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
l_speakers[-1]['end_time'] = i * chunk_duration_seconds + (idx + 1) * frame_duration
|
||||||
|
|
||||||
|
print(l_speakers)
|
||||||
|
"""
|
||||||
|
Should print
|
||||||
|
[{'start_time': 0, 'end_time': 8.72, 'speaker': 0},
|
||||||
|
{'start_time': 8.72, 'end_time': 18.88, 'speaker': 1},
|
||||||
|
{'start_time': 18.88, 'end_time': 24.96, 'speaker': 2},
|
||||||
|
{'start_time': 24.96, 'end_time': 31.68, 'speaker': 0}]
|
||||||
|
"""
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
import librosa
|
||||||
|
an4_audio = 'new_audio_test.mp3'
|
||||||
|
signal, sr = librosa.load(an4_audio,sr=16000)
|
||||||
|
|
||||||
|
"""
|
||||||
|
ground truth:
|
||||||
|
speaker 0 : 0:00 - 0:09
|
||||||
|
speaker 1 : 0:09 - 0:19
|
||||||
|
speaker 2 : 0:19 - 0:25
|
||||||
|
speaker 0 : 0:25 - end
|
||||||
|
"""
|
||||||
|
|
||||||
|
# Simulate streaming
|
||||||
|
chunk_size = 16000 # 1 second
|
||||||
|
chunks = []
|
||||||
|
for i in range(0, len(signal), chunk_size):
|
||||||
|
chunk = signal[i:i+chunk_size]
|
||||||
|
chunks.append(chunk)
|
||||||
|
|
||||||
|
process_diarization(signal, chunks)
|
||||||
@@ -143,7 +143,7 @@ class FFmpegManager:
|
|||||||
try:
|
try:
|
||||||
data = await asyncio.wait_for(
|
data = await asyncio.wait_for(
|
||||||
self.process.stdout.read(size),
|
self.process.stdout.read(size),
|
||||||
timeout=5.0
|
timeout=20.0
|
||||||
)
|
)
|
||||||
return data
|
return data
|
||||||
except asyncio.TimeoutError:
|
except asyncio.TimeoutError:
|
||||||
|
|||||||
@@ -58,6 +58,14 @@ def parse_args():
|
|||||||
help="Hugging Face model ID for pyannote.audio embedding model.",
|
help="Hugging Face model ID for pyannote.audio embedding model.",
|
||||||
)
|
)
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--diarization-backend",
|
||||||
|
type=str,
|
||||||
|
default="diart",
|
||||||
|
choices=["sortformer", "diart"],
|
||||||
|
help="The diarization backend to use.",
|
||||||
|
)
|
||||||
|
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--no-transcription",
|
"--no-transcription",
|
||||||
action="store_true",
|
action="store_true",
|
||||||
@@ -74,7 +82,7 @@ def parse_args():
|
|||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--model",
|
"--model",
|
||||||
type=str,
|
type=str,
|
||||||
default="tiny",
|
default="small",
|
||||||
help="Name size of the Whisper model to use (default: tiny). Suggested values: tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large,large-v3-turbo. The model is automatically downloaded from the model hub if not present in model cache dir.",
|
help="Name size of the Whisper model to use (default: tiny). Suggested values: tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large,large-v3-turbo. The model is automatically downloaded from the model hub if not present in model cache dir.",
|
||||||
)
|
)
|
||||||
|
|
||||||
@@ -107,15 +115,15 @@ def parse_args():
|
|||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--backend",
|
"--backend",
|
||||||
type=str,
|
type=str,
|
||||||
default="faster-whisper",
|
default="simulstreaming",
|
||||||
choices=["faster-whisper", "whisper_timestamped", "mlx-whisper", "openai-api", "simulstreaming"],
|
choices=["faster-whisper", "whisper_timestamped", "mlx-whisper", "openai-api", "simulstreaming"],
|
||||||
help="Load only this backend for Whisper processing.",
|
help="Load only this backend for Whisper processing.",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--vac",
|
"--no-vac",
|
||||||
action="store_true",
|
action="store_true",
|
||||||
default=False,
|
default=False,
|
||||||
help="Use VAC = voice activity controller. Recommended. Requires torch.",
|
help="Disable VAC = voice activity controller.",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--vac-chunk-size", type=float, default=0.04, help="VAC sample size in seconds."
|
"--vac-chunk-size", type=float, default=0.04, help="VAC sample size in seconds."
|
||||||
@@ -242,6 +250,14 @@ def parse_args():
|
|||||||
dest="model_path",
|
dest="model_path",
|
||||||
help="Direct path to the SimulStreaming Whisper .pt model file. Overrides --model for SimulStreaming backend.",
|
help="Direct path to the SimulStreaming Whisper .pt model file. Overrides --model for SimulStreaming backend.",
|
||||||
)
|
)
|
||||||
|
|
||||||
|
simulstreaming_group.add_argument(
|
||||||
|
"--preloaded_model_count",
|
||||||
|
type=int,
|
||||||
|
default=1,
|
||||||
|
dest="preloaded_model_count",
|
||||||
|
help="Optional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent instances).",
|
||||||
|
)
|
||||||
|
|
||||||
args = parser.parse_args()
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
|||||||
@@ -3,6 +3,7 @@ import re
|
|||||||
|
|
||||||
MIN_SILENCE_DURATION = 4 #in seconds
|
MIN_SILENCE_DURATION = 4 #in seconds
|
||||||
END_SILENCE_DURATION = 8 #in seconds. you should keep it important to not have false positive when the model lag is important
|
END_SILENCE_DURATION = 8 #in seconds. you should keep it important to not have false positive when the model lag is important
|
||||||
|
END_SILENCE_DURATION_VAC = 3 #VAC is good at detecting silences, but we want to skip the smallest silences
|
||||||
|
|
||||||
def blank_to_silence(tokens):
|
def blank_to_silence(tokens):
|
||||||
full_string = ''.join([t.text for t in tokens])
|
full_string = ''.join([t.text for t in tokens])
|
||||||
@@ -76,11 +77,15 @@ def no_token_to_silence(tokens):
|
|||||||
new_tokens.append(token)
|
new_tokens.append(token)
|
||||||
return new_tokens
|
return new_tokens
|
||||||
|
|
||||||
def ends_with_silence(tokens, current_time):
|
def ends_with_silence(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence):
|
||||||
if not tokens:
|
if not tokens:
|
||||||
return []
|
return [], buffer_transcription, buffer_diarization
|
||||||
last_token = tokens[-1]
|
last_token = tokens[-1]
|
||||||
if tokens and current_time - last_token.end >= END_SILENCE_DURATION:
|
if tokens and (
|
||||||
|
current_time - last_token.end >= END_SILENCE_DURATION
|
||||||
|
or
|
||||||
|
(current_time - last_token.end >= 3 and vac_detected_silence)
|
||||||
|
):
|
||||||
if last_token.speaker == -2:
|
if last_token.speaker == -2:
|
||||||
last_token.end = current_time
|
last_token.end = current_time
|
||||||
else:
|
else:
|
||||||
@@ -92,12 +97,14 @@ def ends_with_silence(tokens, current_time):
|
|||||||
probability=0.95
|
probability=0.95
|
||||||
)
|
)
|
||||||
)
|
)
|
||||||
return tokens
|
buffer_transcription = "" # for whisperstreaming backend, we should probably validate the buffer has because of the silence
|
||||||
|
buffer_diarization = ""
|
||||||
|
return tokens, buffer_transcription, buffer_diarization
|
||||||
|
|
||||||
|
|
||||||
def handle_silences(tokens, current_time):
|
def handle_silences(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence):
|
||||||
tokens = blank_to_silence(tokens) #useful for simulstreaming backend which tends to generate [BLANK_AUDIO] text
|
tokens = blank_to_silence(tokens) #useful for simulstreaming backend which tends to generate [BLANK_AUDIO] text
|
||||||
tokens = no_token_to_silence(tokens)
|
tokens = no_token_to_silence(tokens)
|
||||||
tokens = ends_with_silence(tokens, current_time)
|
tokens, buffer_transcription, buffer_diarization = ends_with_silence(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence)
|
||||||
return tokens
|
return tokens, buffer_transcription, buffer_diarization
|
||||||
|
|
||||||
@@ -4,9 +4,13 @@ import logging
|
|||||||
from typing import List, Tuple, Optional
|
from typing import List, Tuple, Optional
|
||||||
import logging
|
import logging
|
||||||
from whisperlivekit.timed_objects import ASRToken, Transcript
|
from whisperlivekit.timed_objects import ASRToken, Transcript
|
||||||
|
from whisperlivekit.warmup import load_file
|
||||||
from whisperlivekit.simul_whisper.license_simulstreaming import SIMULSTREAMING_LICENSE
|
from whisperlivekit.simul_whisper.license_simulstreaming import SIMULSTREAMING_LICENSE
|
||||||
from .whisper import load_model, tokenizer
|
from .whisper import load_model, tokenizer
|
||||||
|
from .whisper.audio import TOKENS_PER_SECOND
|
||||||
|
|
||||||
import os
|
import os
|
||||||
|
import gc
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
try:
|
try:
|
||||||
@@ -19,6 +23,8 @@ except ImportError as e:
|
|||||||
"""SimulStreaming dependencies are not available.
|
"""SimulStreaming dependencies are not available.
|
||||||
Please install WhisperLiveKit using pip install "whisperlivekit[simulstreaming]".""")
|
Please install WhisperLiveKit using pip install "whisperlivekit[simulstreaming]".""")
|
||||||
|
|
||||||
|
# TOO_MANY_REPETITIONS = 3
|
||||||
|
|
||||||
class SimulStreamingOnlineProcessor:
|
class SimulStreamingOnlineProcessor:
|
||||||
SAMPLING_RATE = 16000
|
SAMPLING_RATE = 16000
|
||||||
|
|
||||||
@@ -30,33 +36,42 @@ class SimulStreamingOnlineProcessor:
|
|||||||
):
|
):
|
||||||
self.asr = asr
|
self.asr = asr
|
||||||
self.logfile = logfile
|
self.logfile = logfile
|
||||||
self.is_last = False
|
|
||||||
self.beg = 0.0
|
|
||||||
self.end = 0.0
|
self.end = 0.0
|
||||||
self.cumulative_audio_duration = 0.0
|
self.global_time_offset = 0.0
|
||||||
|
|
||||||
self.committed: List[ASRToken] = []
|
self.committed: List[ASRToken] = []
|
||||||
self.last_result_tokens: List[ASRToken] = []
|
self.last_result_tokens: List[ASRToken] = []
|
||||||
self.model = PaddedAlignAttWhisper(
|
self.load_new_backend()
|
||||||
cfg=asr.cfg,
|
|
||||||
loaded_model=asr.whisper_model)
|
|
||||||
if asr.tokenizer:
|
if asr.tokenizer:
|
||||||
self.model.tokenizer = asr.tokenizer
|
self.model.tokenizer = asr.tokenizer
|
||||||
|
|
||||||
def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time: Optional[float] = None):
|
def load_new_backend(self):
|
||||||
|
model = self.asr.get_new_model_instance()
|
||||||
|
self.model = PaddedAlignAttWhisper(
|
||||||
|
cfg=self.asr.cfg,
|
||||||
|
loaded_model=model)
|
||||||
|
|
||||||
|
def insert_silence(self, silence_duration, offset):
|
||||||
|
"""
|
||||||
|
If silences are > 5s, we do a complete context clear. Otherwise, we just insert a small silence and shift the last_attend_frame
|
||||||
|
"""
|
||||||
|
if silence_duration < 5:
|
||||||
|
gap_silence = torch.zeros(int(16000*silence_duration))
|
||||||
|
self.model.insert_audio(gap_silence)
|
||||||
|
# self.global_time_offset += silence_duration
|
||||||
|
else:
|
||||||
|
self.process_iter(is_last=True) #we want to totally process what remains in the buffer.
|
||||||
|
self.model.refresh_segment(complete=True)
|
||||||
|
self.global_time_offset += silence_duration + offset
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time):
|
||||||
"""Append an audio chunk to be processed by SimulStreaming."""
|
"""Append an audio chunk to be processed by SimulStreaming."""
|
||||||
|
|
||||||
# Convert numpy array to torch tensor
|
# Convert numpy array to torch tensor
|
||||||
audio_tensor = torch.from_numpy(audio).float()
|
audio_tensor = torch.from_numpy(audio).float()
|
||||||
|
self.end = audio_stream_end_time #Only to be aligned with what happens in whisperstreaming backend.
|
||||||
# Update timing
|
|
||||||
chunk_duration = len(audio) / self.SAMPLING_RATE
|
|
||||||
self.cumulative_audio_duration += chunk_duration
|
|
||||||
|
|
||||||
if audio_stream_end_time is not None:
|
|
||||||
self.end = audio_stream_end_time
|
|
||||||
else:
|
|
||||||
self.end = self.cumulative_audio_duration
|
|
||||||
self.model.insert_audio(audio_tensor)
|
self.model.insert_audio(audio_tensor)
|
||||||
|
|
||||||
def get_buffer(self):
|
def get_buffer(self):
|
||||||
@@ -68,38 +83,63 @@ class SimulStreamingOnlineProcessor:
|
|||||||
)
|
)
|
||||||
|
|
||||||
def timestamped_text(self, tokens, generation):
|
def timestamped_text(self, tokens, generation):
|
||||||
# From the simulstreaming repo. self.model to self.asr.model
|
"""
|
||||||
pr = generation["progress"]
|
generate timestamped text from tokens and generation data.
|
||||||
if "result" not in generation:
|
|
||||||
split_words, split_tokens = self.model.tokenizer.split_to_word_tokens(tokens)
|
args:
|
||||||
|
tokens: List of tokens to process
|
||||||
|
generation: Dictionary containing generation progress and optionally results
|
||||||
|
|
||||||
|
returns:
|
||||||
|
List of tuples containing (start_time, end_time, word) for each word
|
||||||
|
"""
|
||||||
|
FRAME_DURATION = 0.02
|
||||||
|
if "result" in generation:
|
||||||
|
split_words = generation["result"]["split_words"]
|
||||||
|
split_tokens = generation["result"]["split_tokens"]
|
||||||
else:
|
else:
|
||||||
split_words, split_tokens = generation["result"]["split_words"], generation["result"]["split_tokens"]
|
split_words, split_tokens = self.model.tokenizer.split_to_word_tokens(tokens)
|
||||||
|
progress = generation["progress"]
|
||||||
|
frames = [p["most_attended_frames"][0] for p in progress]
|
||||||
|
absolute_timestamps = [p["absolute_timestamps"][0] for p in progress]
|
||||||
|
tokens_queue = tokens.copy()
|
||||||
|
timestamped_words = []
|
||||||
|
|
||||||
|
for word, word_tokens in zip(split_words, split_tokens):
|
||||||
|
# start_frame = None
|
||||||
|
# end_frame = None
|
||||||
|
for expected_token in word_tokens:
|
||||||
|
if not tokens_queue or not frames:
|
||||||
|
raise ValueError(f"Insufficient tokens or frames for word '{word}'")
|
||||||
|
|
||||||
|
actual_token = tokens_queue.pop(0)
|
||||||
|
current_frame = frames.pop(0)
|
||||||
|
current_timestamp = absolute_timestamps.pop(0)
|
||||||
|
if actual_token != expected_token:
|
||||||
|
raise ValueError(
|
||||||
|
f"Token mismatch: expected '{expected_token}', "
|
||||||
|
f"got '{actual_token}' at frame {current_frame}"
|
||||||
|
)
|
||||||
|
# if start_frame is None:
|
||||||
|
# start_frame = current_frame
|
||||||
|
# end_frame = current_frame
|
||||||
|
# start_time = start_frame * FRAME_DURATION
|
||||||
|
# end_time = end_frame * FRAME_DURATION
|
||||||
|
start_time = current_timestamp
|
||||||
|
end_time = current_timestamp + 0.1
|
||||||
|
timestamp_entry = (start_time, end_time, word)
|
||||||
|
timestamped_words.append(timestamp_entry)
|
||||||
|
logger.debug(f"TS-WORD:\t{start_time:.2f}\t{end_time:.2f}\t{word}")
|
||||||
|
return timestamped_words
|
||||||
|
|
||||||
frames = [p["most_attended_frames"][0] for p in pr]
|
def process_iter(self, is_last=False) -> Tuple[List[ASRToken], float]:
|
||||||
tokens = tokens.copy()
|
|
||||||
ret = []
|
|
||||||
for sw,st in zip(split_words,split_tokens):
|
|
||||||
b = None
|
|
||||||
for stt in st:
|
|
||||||
t,f = tokens.pop(0), frames.pop(0)
|
|
||||||
if t != stt:
|
|
||||||
raise ValueError(f"Token mismatch: {t} != {stt} at frame {f}.")
|
|
||||||
if b is None:
|
|
||||||
b = f
|
|
||||||
e = f
|
|
||||||
out = (b*0.02, e*0.02, sw)
|
|
||||||
ret.append(out)
|
|
||||||
logger.debug(f"TS-WORD:\t{' '.join(map(str, out))}")
|
|
||||||
return ret
|
|
||||||
|
|
||||||
def process_iter(self) -> Tuple[List[ASRToken], float]:
|
|
||||||
"""
|
"""
|
||||||
Process accumulated audio chunks using SimulStreaming.
|
Process accumulated audio chunks using SimulStreaming.
|
||||||
|
|
||||||
Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
|
Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
|
||||||
"""
|
"""
|
||||||
try:
|
try:
|
||||||
tokens, generation_progress = self.model.infer(is_last=self.is_last)
|
tokens, generation_progress = self.model.infer(is_last=is_last)
|
||||||
ts_words = self.timestamped_text(tokens, generation_progress)
|
ts_words = self.timestamped_text(tokens, generation_progress)
|
||||||
|
|
||||||
new_tokens = []
|
new_tokens = []
|
||||||
@@ -111,9 +151,33 @@ class SimulStreamingOnlineProcessor:
|
|||||||
end=end,
|
end=end,
|
||||||
text=word,
|
text=word,
|
||||||
probability=0.95 # fake prob. Maybe we can extract it from the model?
|
probability=0.95 # fake prob. Maybe we can extract it from the model?
|
||||||
|
).with_offset(
|
||||||
|
self.global_time_offset
|
||||||
)
|
)
|
||||||
new_tokens.append(token)
|
new_tokens.append(token)
|
||||||
self.committed.extend(new_tokens)
|
|
||||||
|
# identical_tokens = 0
|
||||||
|
# n_new_tokens = len(new_tokens)
|
||||||
|
# if n_new_tokens:
|
||||||
|
|
||||||
|
self.committed.extend(new_tokens)
|
||||||
|
|
||||||
|
# if token in self.committed:
|
||||||
|
# pos = len(self.committed) - 1 - self.committed[::-1].index(token)
|
||||||
|
# if pos:
|
||||||
|
# for i in range(len(self.committed) - n_new_tokens, -1, -n_new_tokens):
|
||||||
|
# commited_segment = self.committed[i:i+n_new_tokens]
|
||||||
|
# if commited_segment == new_tokens:
|
||||||
|
# identical_segments +=1
|
||||||
|
# if identical_tokens >= TOO_MANY_REPETITIONS:
|
||||||
|
# logger.warning('Too many repetition, model is stuck. Load a new one')
|
||||||
|
# self.committed = self.committed[:i]
|
||||||
|
# self.load_new_backend()
|
||||||
|
# return [], self.end
|
||||||
|
|
||||||
|
# pos = self.committed.rindex(token)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
return new_tokens, self.end
|
return new_tokens, self.end
|
||||||
|
|
||||||
@@ -132,6 +196,13 @@ class SimulStreamingOnlineProcessor:
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.exception(f"SimulStreaming warmup failed: {e}")
|
logger.exception(f"SimulStreaming warmup failed: {e}")
|
||||||
|
|
||||||
|
def __del__(self):
|
||||||
|
# free the model and add a new model to stack.
|
||||||
|
# del self.model
|
||||||
|
gc.collect()
|
||||||
|
torch.cuda.empty_cache()
|
||||||
|
# self.asr.new_model_to_stack()
|
||||||
|
self.model.remove_hooks()
|
||||||
|
|
||||||
class SimulStreamingASR():
|
class SimulStreamingASR():
|
||||||
"""SimulStreaming backend with AlignAtt policy."""
|
"""SimulStreaming backend with AlignAtt policy."""
|
||||||
@@ -145,7 +216,7 @@ class SimulStreamingASR():
|
|||||||
|
|
||||||
self.model_path = kwargs.get('model_path', './large-v3.pt')
|
self.model_path = kwargs.get('model_path', './large-v3.pt')
|
||||||
self.frame_threshold = kwargs.get('frame_threshold', 25)
|
self.frame_threshold = kwargs.get('frame_threshold', 25)
|
||||||
self.audio_max_len = kwargs.get('audio_max_len', 30.0)
|
self.audio_max_len = kwargs.get('audio_max_len', 20.0)
|
||||||
self.audio_min_len = kwargs.get('audio_min_len', 0.0)
|
self.audio_min_len = kwargs.get('audio_min_len', 0.0)
|
||||||
self.segment_length = kwargs.get('segment_length', 0.5)
|
self.segment_length = kwargs.get('segment_length', 0.5)
|
||||||
self.beams = kwargs.get('beams', 1)
|
self.beams = kwargs.get('beams', 1)
|
||||||
@@ -156,6 +227,8 @@ class SimulStreamingASR():
|
|||||||
self.init_prompt = kwargs.get('init_prompt', None)
|
self.init_prompt = kwargs.get('init_prompt', None)
|
||||||
self.static_init_prompt = kwargs.get('static_init_prompt', None)
|
self.static_init_prompt = kwargs.get('static_init_prompt', None)
|
||||||
self.max_context_tokens = kwargs.get('max_context_tokens', None)
|
self.max_context_tokens = kwargs.get('max_context_tokens', None)
|
||||||
|
self.warmup_file = kwargs.get('warmup_file', None)
|
||||||
|
self.preload_model_count = kwargs.get('preload_model_count', 1)
|
||||||
|
|
||||||
if model_dir is not None:
|
if model_dir is not None:
|
||||||
self.model_path = model_dir
|
self.model_path = model_dir
|
||||||
@@ -176,16 +249,11 @@ class SimulStreamingASR():
|
|||||||
}
|
}
|
||||||
self.model_path = model_mapping.get(modelsize, f'./{modelsize}.pt')
|
self.model_path = model_mapping.get(modelsize, f'./{modelsize}.pt')
|
||||||
|
|
||||||
self.model = self.load_model(modelsize)
|
|
||||||
|
|
||||||
# Set up tokenizer for translation if needed
|
# Set up tokenizer for translation if needed
|
||||||
if self.task == "translate":
|
if self.task == "translate":
|
||||||
self.tokenizer = self.set_translate_task()
|
self.tokenizer = self.set_translate_task()
|
||||||
else:
|
else:
|
||||||
self.tokenizer = None
|
self.tokenizer = None
|
||||||
|
|
||||||
|
|
||||||
def load_model(self, modelsize):
|
|
||||||
self.cfg = AlignAttConfig(
|
self.cfg = AlignAttConfig(
|
||||||
model_path=self.model_path,
|
model_path=self.model_path,
|
||||||
segment_length=self.segment_length,
|
segment_length=self.segment_length,
|
||||||
@@ -201,10 +269,34 @@ class SimulStreamingASR():
|
|||||||
init_prompt=self.init_prompt,
|
init_prompt=self.init_prompt,
|
||||||
max_context_tokens=self.max_context_tokens,
|
max_context_tokens=self.max_context_tokens,
|
||||||
static_init_prompt=self.static_init_prompt,
|
static_init_prompt=self.static_init_prompt,
|
||||||
)
|
)
|
||||||
model_name = os.path.basename(self.cfg.model_path).replace(".pt", "")
|
|
||||||
model_path = os.path.dirname(os.path.abspath(self.cfg.model_path))
|
self.model_name = os.path.basename(self.cfg.model_path).replace(".pt", "")
|
||||||
self.whisper_model = load_model(name=model_name, download_root=model_path)
|
self.model_path = os.path.dirname(os.path.abspath(self.cfg.model_path))
|
||||||
|
self.models = [self.load_model() for i in range(self.preload_model_count)]
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
def load_model(self):
|
||||||
|
whisper_model = load_model(name=self.model_name, download_root=self.model_path)
|
||||||
|
warmup_audio = load_file(self.warmup_file)
|
||||||
|
whisper_model.transcribe(warmup_audio, language=self.original_language)
|
||||||
|
return whisper_model
|
||||||
|
|
||||||
|
def get_new_model_instance(self):
|
||||||
|
"""
|
||||||
|
SimulStreaming cannot share the same backend because it uses global forward hooks on the attention layers.
|
||||||
|
Therefore, each user requires a separate model instance, which can be memory-intensive. To maintain speed, we preload the models into memory.
|
||||||
|
"""
|
||||||
|
if len(self.models) == 0:
|
||||||
|
self.models.append(self.load_model())
|
||||||
|
new_model = self.models.pop()
|
||||||
|
return new_model
|
||||||
|
# self.models[0]
|
||||||
|
|
||||||
|
def new_model_to_stack(self):
|
||||||
|
self.models.append(self.load_model())
|
||||||
|
|
||||||
|
|
||||||
def set_translate_task(self):
|
def set_translate_task(self):
|
||||||
@@ -218,6 +310,6 @@ class SimulStreamingASR():
|
|||||||
|
|
||||||
def transcribe(self, audio):
|
def transcribe(self, audio):
|
||||||
"""
|
"""
|
||||||
Only used for warmup. It's a direct whisper call, not a simulstreaming call
|
Warmup is done directly in load_model
|
||||||
"""
|
"""
|
||||||
self.whisper_model.transcribe(audio, language=self.original_language)
|
pass
|
||||||
@@ -24,6 +24,6 @@ class AlignAttConfig(SimulWhisperConfig):
|
|||||||
segment_length: float = field(default=1.0, metadata = {"help": "in second"})
|
segment_length: float = field(default=1.0, metadata = {"help": "in second"})
|
||||||
frame_threshold: int = 4
|
frame_threshold: int = 4
|
||||||
rewind_threshold: int = 200
|
rewind_threshold: int = 200
|
||||||
audio_max_len: float = 30.0
|
audio_max_len: float = 20.0
|
||||||
cif_ckpt_path: str = ""
|
cif_ckpt_path: str = ""
|
||||||
never_fire: bool = False
|
never_fire: bool = False
|
||||||
@@ -56,6 +56,7 @@ class PaddedAlignAttWhisper:
|
|||||||
self.max_text_len = self.model.dims.n_text_ctx
|
self.max_text_len = self.model.dims.n_text_ctx
|
||||||
self.num_decoder_layers = len(self.model.decoder.blocks)
|
self.num_decoder_layers = len(self.model.decoder.blocks)
|
||||||
self.cfg = cfg
|
self.cfg = cfg
|
||||||
|
self.l_hooks = []
|
||||||
|
|
||||||
# model to detect end-of-word boundary at the end of the segment
|
# model to detect end-of-word boundary at the end of the segment
|
||||||
self.CIFLinear, self.always_fire, self.never_fire = load_cif(cfg,
|
self.CIFLinear, self.always_fire, self.never_fire = load_cif(cfg,
|
||||||
@@ -69,7 +70,8 @@ class PaddedAlignAttWhisper:
|
|||||||
t = F.softmax(net_output[1], dim=-1)
|
t = F.softmax(net_output[1], dim=-1)
|
||||||
self.dec_attns.append(t.squeeze(0))
|
self.dec_attns.append(t.squeeze(0))
|
||||||
for b in self.model.decoder.blocks:
|
for b in self.model.decoder.blocks:
|
||||||
b.cross_attn.register_forward_hook(layer_hook)
|
hook = b.cross_attn.register_forward_hook(layer_hook)
|
||||||
|
self.l_hooks.append(hook)
|
||||||
|
|
||||||
self.kv_cache = {}
|
self.kv_cache = {}
|
||||||
def kv_hook(module: torch.nn.Linear, _, net_output: torch.Tensor):
|
def kv_hook(module: torch.nn.Linear, _, net_output: torch.Tensor):
|
||||||
@@ -82,10 +84,13 @@ class PaddedAlignAttWhisper:
|
|||||||
return self.kv_cache[module.cache_id]
|
return self.kv_cache[module.cache_id]
|
||||||
|
|
||||||
for i,b in enumerate(self.model.decoder.blocks):
|
for i,b in enumerate(self.model.decoder.blocks):
|
||||||
b.attn.key.register_forward_hook(kv_hook)
|
hooks = [
|
||||||
b.attn.value.register_forward_hook(kv_hook)
|
b.attn.key.register_forward_hook(kv_hook),
|
||||||
b.cross_attn.key.register_forward_hook(kv_hook)
|
b.attn.value.register_forward_hook(kv_hook),
|
||||||
b.cross_attn.value.register_forward_hook(kv_hook)
|
b.cross_attn.key.register_forward_hook(kv_hook),
|
||||||
|
b.cross_attn.value.register_forward_hook(kv_hook),
|
||||||
|
]
|
||||||
|
self.l_hooks.extend(hooks)
|
||||||
|
|
||||||
self.align_source = {}
|
self.align_source = {}
|
||||||
self.num_align_heads = 0
|
self.num_align_heads = 0
|
||||||
@@ -120,6 +125,7 @@ class PaddedAlignAttWhisper:
|
|||||||
self.init_tokens()
|
self.init_tokens()
|
||||||
|
|
||||||
self.last_attend_frame = -self.cfg.rewind_threshold
|
self.last_attend_frame = -self.cfg.rewind_threshold
|
||||||
|
self.cumulative_time_offset = 0.0
|
||||||
|
|
||||||
if self.cfg.max_context_tokens is None:
|
if self.cfg.max_context_tokens is None:
|
||||||
self.max_context_tokens = self.max_text_len
|
self.max_context_tokens = self.max_text_len
|
||||||
@@ -139,6 +145,11 @@ class PaddedAlignAttWhisper:
|
|||||||
self.inference.kv_cache = self.kv_cache
|
self.inference.kv_cache = self.kv_cache
|
||||||
|
|
||||||
self.token_decoder = BeamSearchDecoder(inference=self.inference, eot=self.tokenizer.eot, beam_size=cfg.beam_size)
|
self.token_decoder = BeamSearchDecoder(inference=self.inference, eot=self.tokenizer.eot, beam_size=cfg.beam_size)
|
||||||
|
|
||||||
|
def remove_hooks(self):
|
||||||
|
print('remove hook')
|
||||||
|
for hook in self.l_hooks:
|
||||||
|
hook.remove()
|
||||||
|
|
||||||
def create_tokenizer(self, language=None):
|
def create_tokenizer(self, language=None):
|
||||||
self.tokenizer = tokenizer.get_tokenizer(
|
self.tokenizer = tokenizer.get_tokenizer(
|
||||||
@@ -210,6 +221,7 @@ class PaddedAlignAttWhisper:
|
|||||||
self.init_tokens()
|
self.init_tokens()
|
||||||
self.last_attend_frame = -self.cfg.rewind_threshold
|
self.last_attend_frame = -self.cfg.rewind_threshold
|
||||||
self.detected_language = None
|
self.detected_language = None
|
||||||
|
self.cumulative_time_offset = 0.0
|
||||||
self.init_context()
|
self.init_context()
|
||||||
logger.debug(f"Context: {self.context}")
|
logger.debug(f"Context: {self.context}")
|
||||||
if not complete and len(self.segments) > 2:
|
if not complete and len(self.segments) > 2:
|
||||||
@@ -277,8 +289,9 @@ class PaddedAlignAttWhisper:
|
|||||||
removed_len = self.segments[0].shape[0] / 16000
|
removed_len = self.segments[0].shape[0] / 16000
|
||||||
segments_len -= removed_len
|
segments_len -= removed_len
|
||||||
self.last_attend_frame -= int(TOKENS_PER_SECOND*removed_len)
|
self.last_attend_frame -= int(TOKENS_PER_SECOND*removed_len)
|
||||||
|
self.cumulative_time_offset += removed_len # Track cumulative time removed
|
||||||
self.segments = self.segments[1:]
|
self.segments = self.segments[1:]
|
||||||
logger.debug(f"remove segments: {len(self.segments)} {len(self.tokens)}")
|
logger.debug(f"remove segments: {len(self.segments)} {len(self.tokens)}, cumulative offset: {self.cumulative_time_offset:.2f}s")
|
||||||
if len(self.tokens) > 1:
|
if len(self.tokens) > 1:
|
||||||
self.context.append_token_ids(self.tokens[1][0,:])
|
self.context.append_token_ids(self.tokens[1][0,:])
|
||||||
self.tokens = [self.initial_tokens] + self.tokens[2:]
|
self.tokens = [self.initial_tokens] + self.tokens[2:]
|
||||||
@@ -494,7 +507,13 @@ class PaddedAlignAttWhisper:
|
|||||||
# for each beam, the most attended frame is:
|
# for each beam, the most attended frame is:
|
||||||
most_attended_frames = torch.argmax(attn_of_alignment_heads[:,-1,:], dim=-1)
|
most_attended_frames = torch.argmax(attn_of_alignment_heads[:,-1,:], dim=-1)
|
||||||
generation_progress_loop.append(("most_attended_frames",most_attended_frames.clone().tolist()))
|
generation_progress_loop.append(("most_attended_frames",most_attended_frames.clone().tolist()))
|
||||||
|
|
||||||
|
# Calculate absolute timestamps accounting for cumulative offset
|
||||||
|
absolute_timestamps = [(frame * 0.02 + self.cumulative_time_offset) for frame in most_attended_frames.tolist()]
|
||||||
|
generation_progress_loop.append(("absolute_timestamps", absolute_timestamps))
|
||||||
|
|
||||||
logger.debug(str(most_attended_frames.tolist()) + " most att frames")
|
logger.debug(str(most_attended_frames.tolist()) + " most att frames")
|
||||||
|
logger.debug(f"Absolute timestamps: {absolute_timestamps} (offset: {self.cumulative_time_offset:.2f}s)")
|
||||||
|
|
||||||
most_attended_frame = most_attended_frames[0].item()
|
most_attended_frame = most_attended_frames[0].item()
|
||||||
|
|
||||||
@@ -599,4 +618,4 @@ class PaddedAlignAttWhisper:
|
|||||||
|
|
||||||
self._clean_cache()
|
self._clean_cache()
|
||||||
|
|
||||||
return new_hypothesis, generation
|
return new_hypothesis, generation
|
||||||
|
|||||||
@@ -29,4 +29,8 @@ class SpeakerSegment(TimedText):
|
|||||||
"""Represents a segment of audio attributed to a specific speaker.
|
"""Represents a segment of audio attributed to a specific speaker.
|
||||||
No text nor probability is associated with this segment.
|
No text nor probability is associated with this segment.
|
||||||
"""
|
"""
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Silence():
|
||||||
|
duration: float
|
||||||
60
whisperlivekit/trail_repetition.py
Normal file
@@ -0,0 +1,60 @@
|
|||||||
|
from typing import Sequence, Callable, Any, Optional, Dict
|
||||||
|
|
||||||
|
def _detect_tail_repetition(
|
||||||
|
seq: Sequence[Any],
|
||||||
|
key: Callable[[Any], Any] = lambda x: x, # extract comparable value
|
||||||
|
min_block: int = 1, # set to 2 to ignore 1-token loops like "."
|
||||||
|
max_tail: int = 300, # search window from the end for speed
|
||||||
|
prefer: str = "longest", # "longest" coverage or "smallest" block
|
||||||
|
) -> Optional[Dict]:
|
||||||
|
vals = [key(x) for x in seq][-max_tail:]
|
||||||
|
n = len(vals)
|
||||||
|
best = None
|
||||||
|
|
||||||
|
# try every possible block length
|
||||||
|
for b in range(min_block, n // 2 + 1):
|
||||||
|
block = vals[-b:]
|
||||||
|
# count how many times this block repeats contiguously at the very end
|
||||||
|
count, i = 0, n
|
||||||
|
while i - b >= 0 and vals[i - b:i] == block:
|
||||||
|
count += 1
|
||||||
|
i -= b
|
||||||
|
|
||||||
|
if count >= 2:
|
||||||
|
cand = {
|
||||||
|
"block_size": b,
|
||||||
|
"count": count,
|
||||||
|
"start_index": len(seq) - count * b, # in original seq
|
||||||
|
"end_index": len(seq),
|
||||||
|
}
|
||||||
|
if (best is None or
|
||||||
|
(prefer == "longest" and count * b > best["count"] * best["block_size"]) or
|
||||||
|
(prefer == "smallest" and b < best["block_size"])):
|
||||||
|
best = cand
|
||||||
|
return best
|
||||||
|
|
||||||
|
def trim_tail_repetition(
|
||||||
|
seq: Sequence[Any],
|
||||||
|
key: Callable[[Any], Any] = lambda x: x,
|
||||||
|
min_block: int = 1,
|
||||||
|
max_tail: int = 300,
|
||||||
|
prefer: str = "longest",
|
||||||
|
keep: int = 1, # how many copies of the repeating block to keep at the end (0 or 1 are common)
|
||||||
|
):
|
||||||
|
"""
|
||||||
|
Returns a new sequence with repeated tail trimmed.
|
||||||
|
keep=1 -> keep a single copy of the repeated block.
|
||||||
|
keep=0 -> remove all copies of the repeated block.
|
||||||
|
"""
|
||||||
|
rep = _detect_tail_repetition(seq, key, min_block, max_tail, prefer)
|
||||||
|
if not rep:
|
||||||
|
return seq, False # nothing to trim
|
||||||
|
|
||||||
|
b, c = rep["block_size"], rep["count"]
|
||||||
|
if keep < 0:
|
||||||
|
keep = 0
|
||||||
|
if keep >= c:
|
||||||
|
return seq, False # nothing to trim (already <= keep copies)
|
||||||
|
# new length = total - (copies_to_remove * block_size)
|
||||||
|
new_len = len(seq) - (c - keep) * b
|
||||||
|
return seq[:new_len], True
|
||||||
402
whisperlivekit/web/live_transcription.css
Normal file
@@ -0,0 +1,402 @@
|
|||||||
|
:root {
|
||||||
|
--bg: #ffffff;
|
||||||
|
--text: #111111;
|
||||||
|
--muted: #666666;
|
||||||
|
--border: #e5e5e5;
|
||||||
|
--chip-bg: rgba(0, 0, 0, 0.04);
|
||||||
|
--chip-text: #000000;
|
||||||
|
--spinner-border: #8d8d8d5c;
|
||||||
|
--spinner-top: #b0b0b0;
|
||||||
|
--silence-bg: #f3f3f3;
|
||||||
|
--loading-bg: rgba(255, 77, 77, 0.06);
|
||||||
|
--button-bg: #ffffff;
|
||||||
|
--button-border: #e9e9e9;
|
||||||
|
--wave-stroke: #000000;
|
||||||
|
--label-dia-text: #868686;
|
||||||
|
--label-trans-text: #111111;
|
||||||
|
}
|
||||||
|
|
||||||
|
@media (prefers-color-scheme: dark) {
|
||||||
|
:root:not([data-theme="light"]) {
|
||||||
|
--bg: #0b0b0b;
|
||||||
|
--text: #e6e6e6;
|
||||||
|
--muted: #9aa0a6;
|
||||||
|
--border: #333333;
|
||||||
|
--chip-bg: rgba(255, 255, 255, 0.08);
|
||||||
|
--chip-text: #e6e6e6;
|
||||||
|
--spinner-border: #555555;
|
||||||
|
--spinner-top: #dddddd;
|
||||||
|
--silence-bg: #1a1a1a;
|
||||||
|
--loading-bg: rgba(255, 77, 77, 0.12);
|
||||||
|
--button-bg: #111111;
|
||||||
|
--button-border: #333333;
|
||||||
|
--wave-stroke: #e6e6e6;
|
||||||
|
--label-dia-text: #b3b3b3;
|
||||||
|
--label-trans-text: #ffffff;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
:root[data-theme="dark"] {
|
||||||
|
--bg: #0b0b0b;
|
||||||
|
--text: #e6e6e6;
|
||||||
|
--muted: #9aa0a6;
|
||||||
|
--border: #333333;
|
||||||
|
--chip-bg: rgba(255, 255, 255, 0.08);
|
||||||
|
--chip-text: #e6e6e6;
|
||||||
|
--spinner-border: #555555;
|
||||||
|
--spinner-top: #dddddd;
|
||||||
|
--silence-bg: #1a1a1a;
|
||||||
|
--loading-bg: rgba(255, 77, 77, 0.12);
|
||||||
|
--button-bg: #111111;
|
||||||
|
--button-border: #333333;
|
||||||
|
--wave-stroke: #e6e6e6;
|
||||||
|
--label-dia-text: #b3b3b3;
|
||||||
|
--label-trans-text: #ffffff;
|
||||||
|
}
|
||||||
|
|
||||||
|
:root[data-theme="light"] {
|
||||||
|
--bg: #ffffff;
|
||||||
|
--text: #111111;
|
||||||
|
--muted: #666666;
|
||||||
|
--border: #e5e5e5;
|
||||||
|
--chip-bg: rgba(0, 0, 0, 0.04);
|
||||||
|
--chip-text: #000000;
|
||||||
|
--spinner-border: #8d8d8d5c;
|
||||||
|
--spinner-top: #b0b0b0;
|
||||||
|
--silence-bg: #f3f3f3;
|
||||||
|
--loading-bg: rgba(255, 77, 77, 0.06);
|
||||||
|
--button-bg: #ffffff;
|
||||||
|
--button-border: #e9e9e9;
|
||||||
|
--wave-stroke: #000000;
|
||||||
|
--label-dia-text: #868686;
|
||||||
|
--label-trans-text: #111111;
|
||||||
|
}
|
||||||
|
|
||||||
|
body {
|
||||||
|
font-family: ui-sans-serif, system-ui, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
|
||||||
|
margin: 20px;
|
||||||
|
text-align: center;
|
||||||
|
background-color: var(--bg);
|
||||||
|
color: var(--text);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Record button */
|
||||||
|
#recordButton {
|
||||||
|
width: 50px;
|
||||||
|
height: 50px;
|
||||||
|
border: none;
|
||||||
|
border-radius: 50%;
|
||||||
|
background-color: var(--button-bg);
|
||||||
|
cursor: pointer;
|
||||||
|
transition: all 0.3s ease;
|
||||||
|
border: 1px solid var(--button-border);
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
justify-content: center;
|
||||||
|
position: relative;
|
||||||
|
}
|
||||||
|
|
||||||
|
#recordButton.recording {
|
||||||
|
width: 180px;
|
||||||
|
border-radius: 40px;
|
||||||
|
justify-content: flex-start;
|
||||||
|
padding-left: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
#recordButton:active {
|
||||||
|
transform: scale(0.95);
|
||||||
|
}
|
||||||
|
|
||||||
|
.shape-container {
|
||||||
|
width: 25px;
|
||||||
|
height: 25px;
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
justify-content: center;
|
||||||
|
flex-shrink: 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
.shape {
|
||||||
|
width: 25px;
|
||||||
|
height: 25px;
|
||||||
|
background-color: rgb(209, 61, 53);
|
||||||
|
border-radius: 50%;
|
||||||
|
transition: all 0.3s ease;
|
||||||
|
}
|
||||||
|
|
||||||
|
#recordButton:disabled .shape {
|
||||||
|
background-color: #6e6d6d;
|
||||||
|
}
|
||||||
|
|
||||||
|
#recordButton.recording .shape {
|
||||||
|
border-radius: 5px;
|
||||||
|
width: 25px;
|
||||||
|
height: 25px;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Recording elements */
|
||||||
|
.recording-info {
|
||||||
|
display: none;
|
||||||
|
align-items: center;
|
||||||
|
margin-left: 15px;
|
||||||
|
flex-grow: 1;
|
||||||
|
}
|
||||||
|
|
||||||
|
#recordButton.recording .recording-info {
|
||||||
|
display: flex;
|
||||||
|
}
|
||||||
|
|
||||||
|
.wave-container {
|
||||||
|
width: 60px;
|
||||||
|
height: 30px;
|
||||||
|
position: relative;
|
||||||
|
display: flex;
|
||||||
|
align-items: center;
|
||||||
|
justify-content: center;
|
||||||
|
}
|
||||||
|
|
||||||
|
#waveCanvas {
|
||||||
|
width: 100%;
|
||||||
|
height: 100%;
|
||||||
|
}
|
||||||
|
|
||||||
|
.timer {
|
||||||
|
font-size: 14px;
|
||||||
|
font-weight: 500;
|
||||||
|
color: var(--text);
|
||||||
|
margin-left: 10px;
|
||||||
|
}
|
||||||
|
|
||||||
|
#status {
|
||||||
|
margin-top: 20px;
|
||||||
|
font-size: 16px;
|
||||||
|
color: var(--text);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Settings */
|
||||||
|
.settings-container {
|
||||||
|
display: flex;
|
||||||
|
justify-content: center;
|
||||||
|
align-items: center;
|
||||||
|
gap: 15px;
|
||||||
|
margin-top: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.settings {
|
||||||
|
display: flex;
|
||||||
|
flex-direction: column;
|
||||||
|
align-items: flex-start;
|
||||||
|
gap: 12px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.field {
|
||||||
|
display: flex;
|
||||||
|
flex-direction: column;
|
||||||
|
align-items: flex-start;
|
||||||
|
gap: 3px;
|
||||||
|
}
|
||||||
|
|
||||||
|
#chunkSelector,
|
||||||
|
#websocketInput,
|
||||||
|
#themeSelector {
|
||||||
|
font-size: 16px;
|
||||||
|
padding: 5px 8px;
|
||||||
|
border-radius: 8px;
|
||||||
|
border: 1px solid var(--border);
|
||||||
|
background-color: var(--button-bg);
|
||||||
|
color: var(--text);
|
||||||
|
max-height: 34px;
|
||||||
|
}
|
||||||
|
|
||||||
|
#websocketInput {
|
||||||
|
width: 220px;
|
||||||
|
}
|
||||||
|
|
||||||
|
#chunkSelector:focus,
|
||||||
|
#websocketInput:focus,
|
||||||
|
#themeSelector:focus {
|
||||||
|
outline: none;
|
||||||
|
border-color: #007bff;
|
||||||
|
box-shadow: 0 0 0 3px rgba(0, 123, 255, 0.15);
|
||||||
|
}
|
||||||
|
|
||||||
|
label {
|
||||||
|
font-size: 13px;
|
||||||
|
color: var(--muted);
|
||||||
|
}
|
||||||
|
|
||||||
|
.ws-default {
|
||||||
|
font-size: 12px;
|
||||||
|
color: var(--muted);
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Segmented pill control for Theme */
|
||||||
|
.segmented {
|
||||||
|
display: inline-flex;
|
||||||
|
align-items: stretch;
|
||||||
|
border: 1px solid var(--button-border);
|
||||||
|
background-color: var(--button-bg);
|
||||||
|
border-radius: 999px;
|
||||||
|
overflow: hidden;
|
||||||
|
}
|
||||||
|
|
||||||
|
.segmented input[type="radio"] {
|
||||||
|
position: absolute;
|
||||||
|
opacity: 0;
|
||||||
|
pointer-events: none;
|
||||||
|
}
|
||||||
|
|
||||||
|
.theme-selector-container {
|
||||||
|
position: absolute;
|
||||||
|
top: 20px;
|
||||||
|
right: 20px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.segmented label {
|
||||||
|
display: inline-flex;
|
||||||
|
align-items: center;
|
||||||
|
gap: 6px;
|
||||||
|
padding: 6px 12px;
|
||||||
|
font-size: 14px;
|
||||||
|
color: var(--muted);
|
||||||
|
cursor: pointer;
|
||||||
|
user-select: none;
|
||||||
|
transition: background-color 0.2s ease, color 0.2s ease;
|
||||||
|
}
|
||||||
|
|
||||||
|
.segmented label span {
|
||||||
|
display: none;
|
||||||
|
}
|
||||||
|
|
||||||
|
.segmented label:hover span {
|
||||||
|
display: inline;
|
||||||
|
}
|
||||||
|
|
||||||
|
.segmented label:hover {
|
||||||
|
background-color: var(--chip-bg);
|
||||||
|
}
|
||||||
|
|
||||||
|
.segmented img {
|
||||||
|
width: 16px;
|
||||||
|
height: 16px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.segmented input[type="radio"]:checked + label {
|
||||||
|
background-color: var(--chip-bg);
|
||||||
|
color: var(--text);
|
||||||
|
}
|
||||||
|
|
||||||
|
.segmented input[type="radio"]:focus-visible + label,
|
||||||
|
.segmented input[type="radio"]:focus + label {
|
||||||
|
outline: 2px solid #007bff;
|
||||||
|
outline-offset: 2px;
|
||||||
|
border-radius: 999px;
|
||||||
|
}
|
||||||
|
|
||||||
|
/* Transcript area */
|
||||||
|
#linesTranscript {
|
||||||
|
margin: 20px auto;
|
||||||
|
max-width: 700px;
|
||||||
|
text-align: left;
|
||||||
|
font-size: 16px;
|
||||||
|
}
|
||||||
|
|
||||||
|
#linesTranscript p {
|
||||||
|
margin: 0px 0;
|
||||||
|
}
|
||||||
|
|
||||||
|
#linesTranscript strong {
|
||||||
|
color: var(--text);
|
||||||
|
}
|
||||||
|
|
||||||
|
#speaker {
|
||||||
|
border: 1px solid var(--border);
|
||||||
|
border-radius: 100px;
|
||||||
|
padding: 2px 10px;
|
||||||
|
font-size: 14px;
|
||||||
|
margin-bottom: 0px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.label_diarization {
|
||||||
|
background-color: var(--chip-bg);
|
||||||
|
border-radius: 8px 8px 8px 8px;
|
||||||
|
padding: 2px 10px;
|
||||||
|
margin-left: 10px;
|
||||||
|
display: inline-block;
|
||||||
|
white-space: nowrap;
|
||||||
|
font-size: 14px;
|
||||||
|
margin-bottom: 0px;
|
||||||
|
color: var(--label-dia-text);
|
||||||
|
}
|
||||||
|
|
||||||
|
.label_transcription {
|
||||||
|
background-color: var(--chip-bg);
|
||||||
|
border-radius: 8px 8px 8px 8px;
|
||||||
|
padding: 2px 10px;
|
||||||
|
display: inline-block;
|
||||||
|
white-space: nowrap;
|
||||||
|
margin-left: 10px;
|
||||||
|
font-size: 14px;
|
||||||
|
margin-bottom: 0px;
|
||||||
|
color: var(--label-trans-text);
|
||||||
|
}
|
||||||
|
|
||||||
|
#timeInfo {
|
||||||
|
color: var(--muted);
|
||||||
|
margin-left: 10px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.textcontent {
|
||||||
|
font-size: 16px;
|
||||||
|
padding-left: 10px;
|
||||||
|
margin-bottom: 10px;
|
||||||
|
margin-top: 1px;
|
||||||
|
padding-top: 5px;
|
||||||
|
border-radius: 0px 0px 0px 10px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.buffer_diarization {
|
||||||
|
color: var(--label-dia-text);
|
||||||
|
margin-left: 4px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.buffer_transcription {
|
||||||
|
color: #7474748c;
|
||||||
|
margin-left: 4px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.spinner {
|
||||||
|
display: inline-block;
|
||||||
|
width: 8px;
|
||||||
|
height: 8px;
|
||||||
|
border: 2px solid var(--spinner-border);
|
||||||
|
border-top: 2px solid var(--spinner-top);
|
||||||
|
border-radius: 50%;
|
||||||
|
animation: spin 0.7s linear infinite;
|
||||||
|
vertical-align: middle;
|
||||||
|
margin-bottom: 2px;
|
||||||
|
margin-right: 5px;
|
||||||
|
}
|
||||||
|
|
||||||
|
@keyframes spin {
|
||||||
|
to {
|
||||||
|
transform: rotate(360deg);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
.silence {
|
||||||
|
color: var(--muted);
|
||||||
|
background-color: var(--silence-bg);
|
||||||
|
font-size: 13px;
|
||||||
|
border-radius: 30px;
|
||||||
|
padding: 2px 10px;
|
||||||
|
}
|
||||||
|
|
||||||
|
.loading {
|
||||||
|
color: var(--muted);
|
||||||
|
background-color: var(--loading-bg);
|
||||||
|
border-radius: 8px 8px 8px 0px;
|
||||||
|
padding: 2px 10px;
|
||||||
|
font-size: 14px;
|
||||||
|
margin-bottom: 0px;
|
||||||
|
}
|
||||||
@@ -1,861 +1,61 @@
|
|||||||
<!DOCTYPE html>
|
<!DOCTYPE html>
|
||||||
<html lang="en">
|
<html lang="en">
|
||||||
|
|
||||||
<head>
|
<head>
|
||||||
<meta charset="UTF-8" />
|
<meta charset="UTF-8" />
|
||||||
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
|
||||||
<title>WhisperLiveKit</title>
|
<title>WhisperLiveKit</title>
|
||||||
<style>
|
<link rel="stylesheet" href="/web/live_transcription.css" />
|
||||||
:root {
|
|
||||||
--bg: #ffffff;
|
|
||||||
--text: #111111;
|
|
||||||
--muted: #666666;
|
|
||||||
--border: #e5e5e5;
|
|
||||||
--chip-bg: rgba(0, 0, 0, 0.04);
|
|
||||||
--chip-text: #000000;
|
|
||||||
--spinner-border: #8d8d8d5c;
|
|
||||||
--spinner-top: #b0b0b0;
|
|
||||||
--silence-bg: #f3f3f3;
|
|
||||||
--loading-bg: rgba(255, 77, 77, 0.06);
|
|
||||||
--button-bg: #ffffff;
|
|
||||||
--button-border: #e9e9e9;
|
|
||||||
--wave-stroke: #000000;
|
|
||||||
--label-dia-text: #868686;
|
|
||||||
--label-trans-text: #111111;
|
|
||||||
}
|
|
||||||
|
|
||||||
@media (prefers-color-scheme: dark) {
|
|
||||||
:root:not([data-theme="light"]) {
|
|
||||||
--bg: #0b0b0b;
|
|
||||||
--text: #e6e6e6;
|
|
||||||
--muted: #9aa0a6;
|
|
||||||
--border: #333333;
|
|
||||||
--chip-bg: rgba(255, 255, 255, 0.08);
|
|
||||||
--chip-text: #e6e6e6;
|
|
||||||
--spinner-border: #555555;
|
|
||||||
--spinner-top: #dddddd;
|
|
||||||
--silence-bg: #1a1a1a;
|
|
||||||
--loading-bg: rgba(255, 77, 77, 0.12);
|
|
||||||
--button-bg: #111111;
|
|
||||||
--button-border: #333333;
|
|
||||||
--wave-stroke: #e6e6e6;
|
|
||||||
--label-dia-text: #b3b3b3;
|
|
||||||
--label-trans-text: #ffffff;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
:root[data-theme="dark"] {
|
|
||||||
--bg: #0b0b0b;
|
|
||||||
--text: #e6e6e6;
|
|
||||||
--muted: #9aa0a6;
|
|
||||||
--border: #333333;
|
|
||||||
--chip-bg: rgba(255, 255, 255, 0.08);
|
|
||||||
--chip-text: #e6e6e6;
|
|
||||||
--spinner-border: #555555;
|
|
||||||
--spinner-top: #dddddd;
|
|
||||||
--silence-bg: #1a1a1a;
|
|
||||||
--loading-bg: rgba(255, 77, 77, 0.12);
|
|
||||||
--button-bg: #111111;
|
|
||||||
--button-border: #333333;
|
|
||||||
--wave-stroke: #e6e6e6;
|
|
||||||
--label-dia-text: #b3b3b3;
|
|
||||||
--label-trans-text: #ffffff;
|
|
||||||
}
|
|
||||||
|
|
||||||
:root[data-theme="light"] {
|
|
||||||
--bg: #ffffff;
|
|
||||||
--text: #111111;
|
|
||||||
--muted: #666666;
|
|
||||||
--border: #e5e5e5;
|
|
||||||
--chip-bg: rgba(0, 0, 0, 0.04);
|
|
||||||
--chip-text: #000000;
|
|
||||||
--spinner-border: #8d8d8d5c;
|
|
||||||
--spinner-top: #b0b0b0;
|
|
||||||
--silence-bg: #f3f3f3;
|
|
||||||
--loading-bg: rgba(255, 77, 77, 0.06);
|
|
||||||
--button-bg: #ffffff;
|
|
||||||
--button-border: #e9e9e9;
|
|
||||||
--wave-stroke: #000000;
|
|
||||||
--label-dia-text: #868686;
|
|
||||||
--label-trans-text: #111111;
|
|
||||||
}
|
|
||||||
body {
|
|
||||||
font-family: ui-sans-serif, system-ui, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
|
|
||||||
margin: 20px;
|
|
||||||
text-align: center;
|
|
||||||
background-color: var(--bg);
|
|
||||||
color: var(--text);
|
|
||||||
}
|
|
||||||
|
|
||||||
#recordButton {
|
|
||||||
width: 50px;
|
|
||||||
height: 50px;
|
|
||||||
border: none;
|
|
||||||
border-radius: 50%;
|
|
||||||
background-color: var(--button-bg);
|
|
||||||
cursor: pointer;
|
|
||||||
transition: all 0.3s ease;
|
|
||||||
border: 1px solid var(--button-border);
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
justify-content: center;
|
|
||||||
position: relative;
|
|
||||||
}
|
|
||||||
|
|
||||||
#recordButton.recording {
|
|
||||||
width: 180px;
|
|
||||||
border-radius: 40px;
|
|
||||||
justify-content: flex-start;
|
|
||||||
padding-left: 20px;
|
|
||||||
}
|
|
||||||
|
|
||||||
#recordButton:active {
|
|
||||||
transform: scale(0.95);
|
|
||||||
}
|
|
||||||
|
|
||||||
.shape-container {
|
|
||||||
width: 25px;
|
|
||||||
height: 25px;
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
justify-content: center;
|
|
||||||
flex-shrink: 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
.shape {
|
|
||||||
width: 25px;
|
|
||||||
height: 25px;
|
|
||||||
background-color: rgb(209, 61, 53);
|
|
||||||
border-radius: 50%;
|
|
||||||
transition: all 0.3s ease;
|
|
||||||
}
|
|
||||||
|
|
||||||
#recordButton:disabled .shape {
|
|
||||||
background-color: #6e6d6d;
|
|
||||||
}
|
|
||||||
|
|
||||||
#recordButton.recording .shape {
|
|
||||||
border-radius: 5px;
|
|
||||||
width: 25px;
|
|
||||||
height: 25px;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Recording elements */
|
|
||||||
.recording-info {
|
|
||||||
display: none;
|
|
||||||
align-items: center;
|
|
||||||
margin-left: 15px;
|
|
||||||
flex-grow: 1;
|
|
||||||
}
|
|
||||||
|
|
||||||
#recordButton.recording .recording-info {
|
|
||||||
display: flex;
|
|
||||||
}
|
|
||||||
|
|
||||||
.wave-container {
|
|
||||||
width: 60px;
|
|
||||||
height: 30px;
|
|
||||||
position: relative;
|
|
||||||
display: flex;
|
|
||||||
align-items: center;
|
|
||||||
justify-content: center;
|
|
||||||
}
|
|
||||||
|
|
||||||
#waveCanvas {
|
|
||||||
width: 100%;
|
|
||||||
height: 100%;
|
|
||||||
}
|
|
||||||
|
|
||||||
.timer {
|
|
||||||
font-size: 14px;
|
|
||||||
font-weight: 500;
|
|
||||||
color: var(--text);
|
|
||||||
margin-left: 10px;
|
|
||||||
}
|
|
||||||
|
|
||||||
#status {
|
|
||||||
margin-top: 20px;
|
|
||||||
font-size: 16px;
|
|
||||||
color: var(--text);
|
|
||||||
}
|
|
||||||
|
|
||||||
.settings-container {
|
|
||||||
display: flex;
|
|
||||||
justify-content: center;
|
|
||||||
align-items: center;
|
|
||||||
gap: 15px;
|
|
||||||
margin-top: 20px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.settings {
|
|
||||||
display: flex;
|
|
||||||
flex-direction: column;
|
|
||||||
align-items: flex-start;
|
|
||||||
gap: 5px;
|
|
||||||
}
|
|
||||||
|
|
||||||
#chunkSelector,
|
|
||||||
#websocketInput,
|
|
||||||
#themeSelector {
|
|
||||||
font-size: 16px;
|
|
||||||
padding: 5px;
|
|
||||||
border-radius: 5px;
|
|
||||||
border: 1px solid var(--border);
|
|
||||||
background-color: var(--button-bg);
|
|
||||||
color: var(--text);
|
|
||||||
max-height: 30px;
|
|
||||||
}
|
|
||||||
|
|
||||||
#websocketInput {
|
|
||||||
width: 200px;
|
|
||||||
}
|
|
||||||
|
|
||||||
#chunkSelector:focus,
|
|
||||||
#websocketInput:focus,
|
|
||||||
#themeSelector:focus {
|
|
||||||
outline: none;
|
|
||||||
border-color: #007bff;
|
|
||||||
}
|
|
||||||
|
|
||||||
label {
|
|
||||||
font-size: 14px;
|
|
||||||
}
|
|
||||||
|
|
||||||
/* Speaker-labeled transcript area */
|
|
||||||
#linesTranscript {
|
|
||||||
margin: 20px auto;
|
|
||||||
max-width: 700px;
|
|
||||||
text-align: left;
|
|
||||||
font-size: 16px;
|
|
||||||
}
|
|
||||||
|
|
||||||
#linesTranscript p {
|
|
||||||
margin: 0px 0;
|
|
||||||
}
|
|
||||||
|
|
||||||
#linesTranscript strong {
|
|
||||||
color: var(--text);
|
|
||||||
}
|
|
||||||
|
|
||||||
#speaker {
|
|
||||||
border: 1px solid var(--border);
|
|
||||||
border-radius: 100px;
|
|
||||||
padding: 2px 10px;
|
|
||||||
font-size: 14px;
|
|
||||||
margin-bottom: 0px;
|
|
||||||
}
|
|
||||||
.label_diarization {
|
|
||||||
background-color: var(--chip-bg);
|
|
||||||
border-radius: 8px 8px 8px 8px;
|
|
||||||
padding: 2px 10px;
|
|
||||||
margin-left: 10px;
|
|
||||||
display: inline-block;
|
|
||||||
white-space: nowrap;
|
|
||||||
font-size: 14px;
|
|
||||||
margin-bottom: 0px;
|
|
||||||
color: var(--label-dia-text)
|
|
||||||
}
|
|
||||||
|
|
||||||
.label_transcription {
|
|
||||||
background-color: var(--chip-bg);
|
|
||||||
border-radius: 8px 8px 8px 8px;
|
|
||||||
padding: 2px 10px;
|
|
||||||
display: inline-block;
|
|
||||||
white-space: nowrap;
|
|
||||||
margin-left: 10px;
|
|
||||||
font-size: 14px;
|
|
||||||
margin-bottom: 0px;
|
|
||||||
color: var(--label-trans-text)
|
|
||||||
}
|
|
||||||
|
|
||||||
#timeInfo {
|
|
||||||
color: var(--muted);
|
|
||||||
margin-left: 10px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.textcontent {
|
|
||||||
font-size: 16px;
|
|
||||||
/* margin-left: 10px; */
|
|
||||||
padding-left: 10px;
|
|
||||||
margin-bottom: 10px;
|
|
||||||
margin-top: 1px;
|
|
||||||
padding-top: 5px;
|
|
||||||
border-radius: 0px 0px 0px 10px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.buffer_diarization {
|
|
||||||
color: var(--label-dia-text);
|
|
||||||
margin-left: 4px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.buffer_transcription {
|
|
||||||
color: #7474748c;
|
|
||||||
margin-left: 4px;
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
.spinner {
|
|
||||||
display: inline-block;
|
|
||||||
width: 8px;
|
|
||||||
height: 8px;
|
|
||||||
border: 2px solid var(--spinner-border);
|
|
||||||
border-top: 2px solid var(--spinner-top);
|
|
||||||
border-radius: 50%;
|
|
||||||
animation: spin 0.7s linear infinite;
|
|
||||||
vertical-align: middle;
|
|
||||||
margin-bottom: 2px;
|
|
||||||
margin-right: 5px;
|
|
||||||
}
|
|
||||||
|
|
||||||
@keyframes spin {
|
|
||||||
to {
|
|
||||||
transform: rotate(360deg);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
.silence {
|
|
||||||
color: var(--muted);
|
|
||||||
background-color: var(--silence-bg);
|
|
||||||
font-size: 13px;
|
|
||||||
border-radius: 30px;
|
|
||||||
padding: 2px 10px;
|
|
||||||
}
|
|
||||||
|
|
||||||
.loading {
|
|
||||||
color: var(--muted);
|
|
||||||
background-color: var(--loading-bg);
|
|
||||||
border-radius: 8px 8px 8px 0px;
|
|
||||||
padding: 2px 10px;
|
|
||||||
font-size: 14px;
|
|
||||||
margin-bottom: 0px;
|
|
||||||
}
|
|
||||||
</style>
|
|
||||||
</head>
|
</head>
|
||||||
|
|
||||||
<body>
|
<body>
|
||||||
|
<div class="settings-container">
|
||||||
<div class="settings-container">
|
<button id="recordButton">
|
||||||
<button id="recordButton">
|
<div class="shape-container">
|
||||||
<div class="shape-container">
|
<div class="shape"></div>
|
||||||
<div class="shape"></div>
|
</div>
|
||||||
</div>
|
<div class="recording-info">
|
||||||
<div class="recording-info">
|
<div class="wave-container">
|
||||||
<div class="wave-container">
|
<canvas id="waveCanvas"></canvas>
|
||||||
<canvas id="waveCanvas"></canvas>
|
|
||||||
</div>
|
|
||||||
<div class="timer">00:00</div>
|
|
||||||
</div>
|
|
||||||
</button>
|
|
||||||
<div class="settings">
|
|
||||||
<div>
|
|
||||||
<label for="chunkSelector">Chunk size (ms):</label>
|
|
||||||
<select id="chunkSelector">
|
|
||||||
<option value="500">500 ms</option>
|
|
||||||
<option value="1000" selected>1000 ms</option>
|
|
||||||
<option value="2000">2000 ms</option>
|
|
||||||
<option value="3000">3000 ms</option>
|
|
||||||
<option value="4000">4000 ms</option>
|
|
||||||
<option value="5000">5000 ms</option>
|
|
||||||
</select>
|
|
||||||
</div>
|
|
||||||
<div>
|
|
||||||
<label for="websocketInput">WebSocket URL:</label>
|
|
||||||
<input id="websocketInput" type="text" />
|
|
||||||
</div>
|
|
||||||
<div>
|
|
||||||
<label for="themeSelector">Theme:</label>
|
|
||||||
<select id="themeSelector">
|
|
||||||
<option value="system" selected>System</option>
|
|
||||||
<option value="light">Light</option>
|
|
||||||
<option value="dark">Dark</option>
|
|
||||||
</select>
|
|
||||||
</div>
|
|
||||||
</div>
|
</div>
|
||||||
|
<div class="timer">00:00</div>
|
||||||
|
</div>
|
||||||
|
</button>
|
||||||
|
|
||||||
|
<div class="settings">
|
||||||
|
<div class="field">
|
||||||
|
<label for="websocketInput">WebSocket URL</label>
|
||||||
|
<input id="websocketInput" type="text" placeholder="ws://host:port/asr" />
|
||||||
|
</div>
|
||||||
|
|
||||||
|
</div>
|
||||||
</div>
|
</div>
|
||||||
|
</div>
|
||||||
|
|
||||||
<p id="status"></p>
|
<div class="theme-selector-container">
|
||||||
|
<div class="segmented" role="radiogroup" aria-label="Theme selector">
|
||||||
|
<input type="radio" id="theme-system" name="theme" value="system" />
|
||||||
|
<label for="theme-system" title="System">
|
||||||
|
<img src="/web/src/system_mode.svg" alt="" />
|
||||||
|
<span>System</span>
|
||||||
|
</label>
|
||||||
|
|
||||||
<!-- Speaker-labeled transcript -->
|
<input type="radio" id="theme-light" name="theme" value="light" />
|
||||||
<div id="linesTranscript"></div>
|
<label for="theme-light" title="Light">
|
||||||
|
<img src="/web/src/light_mode.svg" alt="" />
|
||||||
|
<span>Light</span>
|
||||||
|
</label>
|
||||||
|
|
||||||
<script>
|
<input type="radio" id="theme-dark" name="theme" value="dark" />
|
||||||
let isRecording = false;
|
<label for="theme-dark" title="Dark">
|
||||||
let websocket = null;
|
<img src="/web/src/dark_mode.svg" alt="" />
|
||||||
let recorder = null;
|
<span>Dark</span>
|
||||||
let chunkDuration = 1000;
|
</label>
|
||||||
let websocketUrl = "ws://localhost:8000/asr";
|
</div>
|
||||||
let userClosing = false;
|
</div>
|
||||||
let wakeLock = null;
|
|
||||||
let startTime = null;
|
|
||||||
let timerInterval = null;
|
|
||||||
let audioContext = null;
|
|
||||||
let analyser = null;
|
|
||||||
let microphone = null;
|
|
||||||
let waveCanvas = document.getElementById("waveCanvas");
|
|
||||||
let waveCtx = waveCanvas.getContext("2d");
|
|
||||||
let animationFrame = null;
|
|
||||||
let waitingForStop = false;
|
|
||||||
let lastReceivedData = null;
|
|
||||||
let lastSignature = null;
|
|
||||||
waveCanvas.width = 60 * (window.devicePixelRatio || 1);
|
|
||||||
waveCanvas.height = 30 * (window.devicePixelRatio || 1);
|
|
||||||
waveCtx.scale(window.devicePixelRatio || 1, window.devicePixelRatio || 1);
|
|
||||||
|
|
||||||
const statusText = document.getElementById("status");
|
<p id="status"></p>
|
||||||
const recordButton = document.getElementById("recordButton");
|
|
||||||
const chunkSelector = document.getElementById("chunkSelector");
|
|
||||||
const websocketInput = document.getElementById("websocketInput");
|
|
||||||
const linesTranscriptDiv = document.getElementById("linesTranscript");
|
|
||||||
const timerElement = document.querySelector(".timer");
|
|
||||||
const themeSelector = document.getElementById("themeSelector");
|
|
||||||
|
|
||||||
function getWaveStroke() {
|
<div id="linesTranscript"></div>
|
||||||
const styles = getComputedStyle(document.documentElement);
|
|
||||||
const v = styles.getPropertyValue("--wave-stroke").trim();
|
|
||||||
return v || "#000";
|
|
||||||
}
|
|
||||||
|
|
||||||
let waveStroke = getWaveStroke();
|
<script src="/web/live_transcription.js"></script>
|
||||||
|
|
||||||
function updateWaveStroke() {
|
|
||||||
waveStroke = getWaveStroke();
|
|
||||||
}
|
|
||||||
|
|
||||||
function applyTheme(pref) {
|
|
||||||
if (pref === "light") {
|
|
||||||
document.documentElement.setAttribute("data-theme", "light");
|
|
||||||
} else if (pref === "dark") {
|
|
||||||
document.documentElement.setAttribute("data-theme", "dark");
|
|
||||||
} else {
|
|
||||||
document.documentElement.removeAttribute("data-theme");
|
|
||||||
}
|
|
||||||
updateWaveStroke();
|
|
||||||
}
|
|
||||||
|
|
||||||
const savedThemePref = localStorage.getItem("themePreference") || "system";
|
|
||||||
applyTheme(savedThemePref);
|
|
||||||
if (themeSelector) {
|
|
||||||
themeSelector.value = savedThemePref;
|
|
||||||
themeSelector.addEventListener("change", () => {
|
|
||||||
const val = themeSelector.value;
|
|
||||||
localStorage.setItem("themePreference", val);
|
|
||||||
applyTheme(val);
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
const darkMq = window.matchMedia && window.matchMedia("(prefers-color-scheme: dark)");
|
|
||||||
const handleOsThemeChange = () => {
|
|
||||||
const pref = localStorage.getItem("themePreference") || "system";
|
|
||||||
if (pref === "system") updateWaveStroke();
|
|
||||||
};
|
|
||||||
if (darkMq && darkMq.addEventListener) {
|
|
||||||
darkMq.addEventListener("change", handleOsThemeChange);
|
|
||||||
} else if (darkMq && darkMq.addListener) {
|
|
||||||
darkMq.addListener(handleOsThemeChange);
|
|
||||||
}
|
|
||||||
|
|
||||||
function fmt1(x) {
|
|
||||||
const n = Number(x);
|
|
||||||
return Number.isFinite(n) ? n.toFixed(1) : x;
|
|
||||||
}
|
|
||||||
|
|
||||||
const host = window.location.hostname || "localhost";
|
|
||||||
const port = window.location.port;
|
|
||||||
const protocol = window.location.protocol === "https:" ? "wss" : "ws";
|
|
||||||
const defaultWebSocketUrl = `${protocol}://${host}:${port}/asr`;
|
|
||||||
websocketInput.value = defaultWebSocketUrl;
|
|
||||||
websocketUrl = defaultWebSocketUrl;
|
|
||||||
|
|
||||||
chunkSelector.addEventListener("change", () => {
|
|
||||||
chunkDuration = parseInt(chunkSelector.value);
|
|
||||||
});
|
|
||||||
|
|
||||||
websocketInput.addEventListener("change", () => {
|
|
||||||
const urlValue = websocketInput.value.trim();
|
|
||||||
if (!urlValue.startsWith("ws://") && !urlValue.startsWith("wss://")) {
|
|
||||||
statusText.textContent = "Invalid WebSocket URL (must start with ws:// or wss://)";
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
websocketUrl = urlValue;
|
|
||||||
statusText.textContent = "WebSocket URL updated. Ready to connect.";
|
|
||||||
});
|
|
||||||
|
|
||||||
function setupWebSocket() {
|
|
||||||
return new Promise((resolve, reject) => {
|
|
||||||
try {
|
|
||||||
websocket = new WebSocket(websocketUrl);
|
|
||||||
} catch (error) {
|
|
||||||
statusText.textContent = "Invalid WebSocket URL. Please check and try again.";
|
|
||||||
reject(error);
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
websocket.onopen = () => {
|
|
||||||
statusText.textContent = "Connected to server.";
|
|
||||||
resolve();
|
|
||||||
};
|
|
||||||
|
|
||||||
websocket.onclose = () => {
|
|
||||||
if (userClosing) {
|
|
||||||
if (waitingForStop) {
|
|
||||||
statusText.textContent = "Processing finalized or connection closed.";
|
|
||||||
if (lastReceivedData) {
|
|
||||||
renderLinesWithBuffer(
|
|
||||||
lastReceivedData.lines || [],
|
|
||||||
lastReceivedData.buffer_diarization || "",
|
|
||||||
lastReceivedData.buffer_transcription || "",
|
|
||||||
0, 0, true // isFinalizing = true
|
|
||||||
);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
// If ready_to_stop was received, statusText is already "Finished processing..."
|
|
||||||
// and waitingForStop is false.
|
|
||||||
} else {
|
|
||||||
statusText.textContent = "Disconnected from the WebSocket server. (Check logs if model is loading.)";
|
|
||||||
if (isRecording) {
|
|
||||||
stopRecording();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
isRecording = false;
|
|
||||||
waitingForStop = false;
|
|
||||||
userClosing = false;
|
|
||||||
lastReceivedData = null;
|
|
||||||
websocket = null;
|
|
||||||
updateUI();
|
|
||||||
};
|
|
||||||
|
|
||||||
websocket.onerror = () => {
|
|
||||||
statusText.textContent = "Error connecting to WebSocket.";
|
|
||||||
reject(new Error("Error connecting to WebSocket"));
|
|
||||||
};
|
|
||||||
|
|
||||||
// Handle messages from server
|
|
||||||
websocket.onmessage = (event) => {
|
|
||||||
const data = JSON.parse(event.data);
|
|
||||||
|
|
||||||
// Check for status messages
|
|
||||||
if (data.type === "ready_to_stop") {
|
|
||||||
console.log("Ready to stop received, finalizing display and closing WebSocket.");
|
|
||||||
waitingForStop = false;
|
|
||||||
|
|
||||||
if (lastReceivedData) {
|
|
||||||
renderLinesWithBuffer(
|
|
||||||
lastReceivedData.lines || [],
|
|
||||||
lastReceivedData.buffer_diarization || "",
|
|
||||||
lastReceivedData.buffer_transcription || "",
|
|
||||||
0, // No more lag
|
|
||||||
0, // No more lag
|
|
||||||
true // isFinalizing = true
|
|
||||||
);
|
|
||||||
}
|
|
||||||
statusText.textContent = "Finished processing audio! Ready to record again.";
|
|
||||||
recordButton.disabled = false;
|
|
||||||
|
|
||||||
if (websocket) {
|
|
||||||
websocket.close(); // will trigger onclose
|
|
||||||
// websocket = null; // onclose handle setting websocket to null
|
|
||||||
}
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
lastReceivedData = data;
|
|
||||||
|
|
||||||
// Handle normal transcription updates
|
|
||||||
const {
|
|
||||||
lines = [],
|
|
||||||
buffer_transcription = "",
|
|
||||||
buffer_diarization = "",
|
|
||||||
remaining_time_transcription = 0,
|
|
||||||
remaining_time_diarization = 0,
|
|
||||||
status = "active_transcription"
|
|
||||||
} = data;
|
|
||||||
|
|
||||||
renderLinesWithBuffer(
|
|
||||||
lines,
|
|
||||||
buffer_diarization,
|
|
||||||
buffer_transcription,
|
|
||||||
remaining_time_diarization,
|
|
||||||
remaining_time_transcription,
|
|
||||||
false,
|
|
||||||
status
|
|
||||||
);
|
|
||||||
};
|
|
||||||
});
|
|
||||||
}
|
|
||||||
|
|
||||||
function renderLinesWithBuffer(lines, buffer_diarization, buffer_transcription, remaining_time_diarization, remaining_time_transcription, isFinalizing = false, current_status = "active_transcription") {
|
|
||||||
if (current_status === "no_audio_detected") {
|
|
||||||
linesTranscriptDiv.innerHTML = "<p style='text-align: center; color: var(--muted); margin-top: 20px;'><em>No audio detected...</em></p>";
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
|
|
||||||
// try to keep stable DOM despite having updates every 0.1s. only update numeric lag values if structure hasn't changed
|
|
||||||
const showLoading = (!isFinalizing) && (lines || []).some(it => it.speaker == 0);
|
|
||||||
const showTransLag = !isFinalizing && remaining_time_transcription > 0;
|
|
||||||
const showDiaLag = !isFinalizing && !!buffer_diarization && remaining_time_diarization > 0;
|
|
||||||
const signature = JSON.stringify({
|
|
||||||
lines: (lines || []).map(it => ({ speaker: it.speaker, text: it.text, beg: it.beg, end: it.end })),
|
|
||||||
buffer_transcription: buffer_transcription || "",
|
|
||||||
buffer_diarization: buffer_diarization || "",
|
|
||||||
status: current_status,
|
|
||||||
showLoading,
|
|
||||||
showTransLag,
|
|
||||||
showDiaLag,
|
|
||||||
isFinalizing: !!isFinalizing
|
|
||||||
});
|
|
||||||
if (lastSignature === signature) {
|
|
||||||
const t = document.querySelector(".lag-transcription-value");
|
|
||||||
if (t) t.textContent = fmt1(remaining_time_transcription);
|
|
||||||
const d = document.querySelector(".lag-diarization-value");
|
|
||||||
if (d) d.textContent = fmt1(remaining_time_diarization);
|
|
||||||
const ld = document.querySelector(".loading-diarization-value");
|
|
||||||
if (ld) ld.textContent = fmt1(remaining_time_diarization);
|
|
||||||
return;
|
|
||||||
}
|
|
||||||
lastSignature = signature;
|
|
||||||
|
|
||||||
const linesHtml = lines.map((item, idx) => {
|
|
||||||
let timeInfo = "";
|
|
||||||
if (item.beg !== undefined && item.end !== undefined) {
|
|
||||||
timeInfo = ` ${item.beg} - ${item.end}`;
|
|
||||||
}
|
|
||||||
|
|
||||||
let speakerLabel = "";
|
|
||||||
if (item.speaker === -2) {
|
|
||||||
speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`;
|
|
||||||
} else if (item.speaker == 0 && !isFinalizing) {
|
|
||||||
speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'><span class="loading-diarization-value">${fmt1(remaining_time_diarization)}</span> second(s) of audio are undergoing diarization</span></span>`;
|
|
||||||
} else if (item.speaker == -1) {
|
|
||||||
speakerLabel = `<span id="speaker">Speaker 1<span id='timeInfo'>${timeInfo}</span></span>`;
|
|
||||||
} else if (item.speaker !== -1 && item.speaker !== 0) {
|
|
||||||
speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`;
|
|
||||||
}
|
|
||||||
|
|
||||||
|
|
||||||
let currentLineText = item.text || "";
|
|
||||||
|
|
||||||
if (idx === lines.length - 1) {
|
|
||||||
if (!isFinalizing && item.speaker !== -2) {
|
|
||||||
if (remaining_time_transcription > 0) {
|
|
||||||
speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Transcription lag <span id='timeInfo'><span class="lag-transcription-value">${fmt1(remaining_time_transcription)}</span>s</span></span>`;
|
|
||||||
}
|
|
||||||
if (buffer_diarization && remaining_time_diarization > 0) {
|
|
||||||
speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Diarization lag<span id='timeInfo'><span class="lag-diarization-value">${fmt1(remaining_time_diarization)}</span>s</span></span>`;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
if (buffer_diarization) {
|
|
||||||
if (isFinalizing) {
|
|
||||||
currentLineText += (currentLineText.length > 0 && buffer_diarization.trim().length > 0 ? " " : "") + buffer_diarization.trim();
|
|
||||||
} else {
|
|
||||||
currentLineText += `<span class="buffer_diarization">${buffer_diarization}</span>`;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if (buffer_transcription) {
|
|
||||||
if (isFinalizing) {
|
|
||||||
currentLineText += (currentLineText.length > 0 && buffer_transcription.trim().length > 0 ? " " : "") + buffer_transcription.trim();
|
|
||||||
} else {
|
|
||||||
currentLineText += `<span class="buffer_transcription">${buffer_transcription}</span>`;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
return currentLineText.trim().length > 0 || speakerLabel.length > 0
|
|
||||||
? `<p>${speakerLabel}<br/><div class='textcontent'>${currentLineText}</div></p>`
|
|
||||||
: `<p>${speakerLabel}<br/></p>`;
|
|
||||||
}).join("");
|
|
||||||
|
|
||||||
linesTranscriptDiv.innerHTML = linesHtml;
|
|
||||||
window.scrollTo({ top: document.body.scrollHeight, behavior: 'smooth' });
|
|
||||||
}
|
|
||||||
|
|
||||||
function updateTimer() {
|
|
||||||
if (!startTime) return;
|
|
||||||
|
|
||||||
const elapsed = Math.floor((Date.now() - startTime) / 1000);
|
|
||||||
const minutes = Math.floor(elapsed / 60).toString().padStart(2, "0");
|
|
||||||
const seconds = (elapsed % 60).toString().padStart(2, "0");
|
|
||||||
timerElement.textContent = `${minutes}:${seconds}`;
|
|
||||||
}
|
|
||||||
|
|
||||||
function drawWaveform() {
|
|
||||||
if (!analyser) return;
|
|
||||||
|
|
||||||
const bufferLength = analyser.frequencyBinCount;
|
|
||||||
const dataArray = new Uint8Array(bufferLength);
|
|
||||||
analyser.getByteTimeDomainData(dataArray);
|
|
||||||
|
|
||||||
waveCtx.clearRect(0, 0, waveCanvas.width / (window.devicePixelRatio || 1), waveCanvas.height / (window.devicePixelRatio || 1));
|
|
||||||
waveCtx.lineWidth = 1;
|
|
||||||
waveCtx.strokeStyle = waveStroke;
|
|
||||||
waveCtx.beginPath();
|
|
||||||
|
|
||||||
const sliceWidth = (waveCanvas.width / (window.devicePixelRatio || 1)) / bufferLength;
|
|
||||||
let x = 0;
|
|
||||||
|
|
||||||
for (let i = 0; i < bufferLength; i++) {
|
|
||||||
const v = dataArray[i] / 128.0;
|
|
||||||
const y = v * (waveCanvas.height / (window.devicePixelRatio || 1)) / 2;
|
|
||||||
|
|
||||||
if (i === 0) {
|
|
||||||
waveCtx.moveTo(x, y);
|
|
||||||
} else {
|
|
||||||
waveCtx.lineTo(x, y);
|
|
||||||
}
|
|
||||||
|
|
||||||
x += sliceWidth;
|
|
||||||
}
|
|
||||||
|
|
||||||
waveCtx.lineTo(waveCanvas.width / (window.devicePixelRatio || 1), waveCanvas.height / (window.devicePixelRatio || 1) / 2);
|
|
||||||
waveCtx.stroke();
|
|
||||||
|
|
||||||
animationFrame = requestAnimationFrame(drawWaveform);
|
|
||||||
}
|
|
||||||
|
|
||||||
async function startRecording() {
|
|
||||||
try {
|
|
||||||
|
|
||||||
// https://developer.mozilla.org/en-US/docs/Web/API/Screen_Wake_Lock_API
|
|
||||||
// create an async function to request a wake lock
|
|
||||||
try {
|
|
||||||
wakeLock = await navigator.wakeLock.request("screen");
|
|
||||||
} catch (err) {
|
|
||||||
// The Wake Lock request has failed - usually system related, such as battery.
|
|
||||||
console.log("Error acquiring wake lock.")
|
|
||||||
}
|
|
||||||
|
|
||||||
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
|
|
||||||
|
|
||||||
audioContext = new (window.AudioContext || window.webkitAudioContext)();
|
|
||||||
analyser = audioContext.createAnalyser();
|
|
||||||
analyser.fftSize = 256;
|
|
||||||
microphone = audioContext.createMediaStreamSource(stream);
|
|
||||||
microphone.connect(analyser);
|
|
||||||
|
|
||||||
recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
|
|
||||||
recorder.ondataavailable = (e) => {
|
|
||||||
if (websocket && websocket.readyState === WebSocket.OPEN) {
|
|
||||||
websocket.send(e.data);
|
|
||||||
}
|
|
||||||
};
|
|
||||||
recorder.start(chunkDuration);
|
|
||||||
|
|
||||||
startTime = Date.now();
|
|
||||||
timerInterval = setInterval(updateTimer, 1000);
|
|
||||||
drawWaveform();
|
|
||||||
|
|
||||||
isRecording = true;
|
|
||||||
updateUI();
|
|
||||||
} catch (err) {
|
|
||||||
statusText.textContent = "Error accessing microphone. Please allow microphone access.";
|
|
||||||
console.error(err);
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
async function stopRecording() {
|
|
||||||
wakeLock.release().then(() => {
|
|
||||||
wakeLock = null;
|
|
||||||
});
|
|
||||||
|
|
||||||
userClosing = true;
|
|
||||||
waitingForStop = true;
|
|
||||||
|
|
||||||
if (websocket && websocket.readyState === WebSocket.OPEN) {
|
|
||||||
// Send empty audio buffer as stop signal
|
|
||||||
const emptyBlob = new Blob([], { type: 'audio/webm' });
|
|
||||||
websocket.send(emptyBlob);
|
|
||||||
statusText.textContent = "Recording stopped. Processing final audio...";
|
|
||||||
}
|
|
||||||
|
|
||||||
if (recorder) {
|
|
||||||
recorder.stop();
|
|
||||||
recorder = null;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (microphone) {
|
|
||||||
microphone.disconnect();
|
|
||||||
microphone = null;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (analyser) {
|
|
||||||
analyser = null;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (audioContext && audioContext.state !== 'closed') {
|
|
||||||
try {
|
|
||||||
audioContext.close();
|
|
||||||
} catch (e) {
|
|
||||||
console.warn("Could not close audio context:", e);
|
|
||||||
}
|
|
||||||
audioContext = null;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (animationFrame) {
|
|
||||||
cancelAnimationFrame(animationFrame);
|
|
||||||
animationFrame = null;
|
|
||||||
}
|
|
||||||
|
|
||||||
if (timerInterval) {
|
|
||||||
clearInterval(timerInterval);
|
|
||||||
timerInterval = null;
|
|
||||||
}
|
|
||||||
timerElement.textContent = "00:00";
|
|
||||||
startTime = null;
|
|
||||||
|
|
||||||
|
|
||||||
isRecording = false;
|
|
||||||
updateUI();
|
|
||||||
}
|
|
||||||
|
|
||||||
async function toggleRecording() {
|
|
||||||
if (!isRecording) {
|
|
||||||
if (waitingForStop) {
|
|
||||||
console.log("Waiting for stop, early return");
|
|
||||||
return; // Early return, UI is already updated
|
|
||||||
}
|
|
||||||
console.log("Connecting to WebSocket");
|
|
||||||
try {
|
|
||||||
// If we have an active WebSocket that's still processing, just restart audio capture
|
|
||||||
if (websocket && websocket.readyState === WebSocket.OPEN) {
|
|
||||||
await startRecording();
|
|
||||||
} else {
|
|
||||||
// If no active WebSocket or it's closed, create new one
|
|
||||||
await setupWebSocket();
|
|
||||||
await startRecording();
|
|
||||||
}
|
|
||||||
} catch (err) {
|
|
||||||
statusText.textContent = "Could not connect to WebSocket or access mic. Aborted.";
|
|
||||||
console.error(err);
|
|
||||||
}
|
|
||||||
} else {
|
|
||||||
console.log("Stopping recording");
|
|
||||||
stopRecording();
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
function updateUI() {
|
|
||||||
recordButton.classList.toggle("recording", isRecording);
|
|
||||||
recordButton.disabled = waitingForStop;
|
|
||||||
|
|
||||||
if (waitingForStop) {
|
|
||||||
if (statusText.textContent !== "Recording stopped. Processing final audio...") {
|
|
||||||
statusText.textContent = "Please wait for processing to complete...";
|
|
||||||
}
|
|
||||||
} else if (isRecording) {
|
|
||||||
statusText.textContent = "Recording...";
|
|
||||||
} else {
|
|
||||||
if (statusText.textContent !== "Finished processing audio! Ready to record again." &&
|
|
||||||
statusText.textContent !== "Processing finalized or connection closed.") {
|
|
||||||
statusText.textContent = "Click to start transcription";
|
|
||||||
}
|
|
||||||
}
|
|
||||||
if (!waitingForStop) {
|
|
||||||
recordButton.disabled = false;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
|
|
||||||
recordButton.addEventListener("click", toggleRecording);
|
|
||||||
</script>
|
|
||||||
</body>
|
</body>
|
||||||
|
|
||||||
</html>
|
</html>
|
||||||
|
|||||||
513
whisperlivekit/web/live_transcription.js
Normal file
@@ -0,0 +1,513 @@
|
|||||||
|
/* Theme, WebSocket, recording, rendering logic extracted from inline script and adapted for segmented theme control and WS caption */
|
||||||
|
|
||||||
|
let isRecording = false;
|
||||||
|
let websocket = null;
|
||||||
|
let recorder = null;
|
||||||
|
let chunkDuration = 100;
|
||||||
|
let websocketUrl = "ws://localhost:8000/asr";
|
||||||
|
let userClosing = false;
|
||||||
|
let wakeLock = null;
|
||||||
|
let startTime = null;
|
||||||
|
let timerInterval = null;
|
||||||
|
let audioContext = null;
|
||||||
|
let analyser = null;
|
||||||
|
let microphone = null;
|
||||||
|
let waveCanvas = document.getElementById("waveCanvas");
|
||||||
|
let waveCtx = waveCanvas.getContext("2d");
|
||||||
|
let animationFrame = null;
|
||||||
|
let waitingForStop = false;
|
||||||
|
let lastReceivedData = null;
|
||||||
|
let lastSignature = null;
|
||||||
|
|
||||||
|
waveCanvas.width = 60 * (window.devicePixelRatio || 1);
|
||||||
|
waveCanvas.height = 30 * (window.devicePixelRatio || 1);
|
||||||
|
waveCtx.scale(window.devicePixelRatio || 1, window.devicePixelRatio || 1);
|
||||||
|
|
||||||
|
const statusText = document.getElementById("status");
|
||||||
|
const recordButton = document.getElementById("recordButton");
|
||||||
|
const chunkSelector = document.getElementById("chunkSelector");
|
||||||
|
const websocketInput = document.getElementById("websocketInput");
|
||||||
|
const websocketDefaultSpan = document.getElementById("wsDefaultUrl");
|
||||||
|
const linesTranscriptDiv = document.getElementById("linesTranscript");
|
||||||
|
const timerElement = document.querySelector(".timer");
|
||||||
|
const themeRadios = document.querySelectorAll('input[name="theme"]');
|
||||||
|
|
||||||
|
function getWaveStroke() {
|
||||||
|
const styles = getComputedStyle(document.documentElement);
|
||||||
|
const v = styles.getPropertyValue("--wave-stroke").trim();
|
||||||
|
return v || "#000";
|
||||||
|
}
|
||||||
|
|
||||||
|
let waveStroke = getWaveStroke();
|
||||||
|
function updateWaveStroke() {
|
||||||
|
waveStroke = getWaveStroke();
|
||||||
|
}
|
||||||
|
|
||||||
|
function applyTheme(pref) {
|
||||||
|
if (pref === "light") {
|
||||||
|
document.documentElement.setAttribute("data-theme", "light");
|
||||||
|
} else if (pref === "dark") {
|
||||||
|
document.documentElement.setAttribute("data-theme", "dark");
|
||||||
|
} else {
|
||||||
|
document.documentElement.removeAttribute("data-theme");
|
||||||
|
}
|
||||||
|
updateWaveStroke();
|
||||||
|
}
|
||||||
|
|
||||||
|
// Persisted theme preference
|
||||||
|
const savedThemePref = localStorage.getItem("themePreference") || "system";
|
||||||
|
applyTheme(savedThemePref);
|
||||||
|
if (themeRadios.length) {
|
||||||
|
themeRadios.forEach((r) => {
|
||||||
|
r.checked = r.value === savedThemePref;
|
||||||
|
r.addEventListener("change", () => {
|
||||||
|
if (r.checked) {
|
||||||
|
localStorage.setItem("themePreference", r.value);
|
||||||
|
applyTheme(r.value);
|
||||||
|
}
|
||||||
|
});
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// React to OS theme changes when in "system" mode
|
||||||
|
const darkMq = window.matchMedia && window.matchMedia("(prefers-color-scheme: dark)");
|
||||||
|
const handleOsThemeChange = () => {
|
||||||
|
const pref = localStorage.getItem("themePreference") || "system";
|
||||||
|
if (pref === "system") updateWaveStroke();
|
||||||
|
};
|
||||||
|
if (darkMq && darkMq.addEventListener) {
|
||||||
|
darkMq.addEventListener("change", handleOsThemeChange);
|
||||||
|
} else if (darkMq && darkMq.addListener) {
|
||||||
|
// deprecated, but included for Safari compatibility
|
||||||
|
darkMq.addListener(handleOsThemeChange);
|
||||||
|
}
|
||||||
|
|
||||||
|
// Helpers
|
||||||
|
function fmt1(x) {
|
||||||
|
const n = Number(x);
|
||||||
|
return Number.isFinite(n) ? n.toFixed(1) : x;
|
||||||
|
}
|
||||||
|
|
||||||
|
// Default WebSocket URL computation
|
||||||
|
const host = window.location.hostname || "localhost";
|
||||||
|
const port = window.location.port;
|
||||||
|
const protocol = window.location.protocol === "https:" ? "wss" : "ws";
|
||||||
|
const defaultWebSocketUrl = `${protocol}://${host}${port ? ":" + port : ""}/asr`;
|
||||||
|
|
||||||
|
// Populate default caption and input
|
||||||
|
if (websocketDefaultSpan) websocketDefaultSpan.textContent = defaultWebSocketUrl;
|
||||||
|
websocketInput.value = defaultWebSocketUrl;
|
||||||
|
websocketUrl = defaultWebSocketUrl;
|
||||||
|
|
||||||
|
// Optional chunk selector (guard for presence)
|
||||||
|
if (chunkSelector) {
|
||||||
|
chunkSelector.addEventListener("change", () => {
|
||||||
|
chunkDuration = parseInt(chunkSelector.value);
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
// WebSocket input change handling
|
||||||
|
websocketInput.addEventListener("change", () => {
|
||||||
|
const urlValue = websocketInput.value.trim();
|
||||||
|
if (!urlValue.startsWith("ws://") && !urlValue.startsWith("wss://")) {
|
||||||
|
statusText.textContent = "Invalid WebSocket URL (must start with ws:// or wss://)";
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
websocketUrl = urlValue;
|
||||||
|
statusText.textContent = "WebSocket URL updated. Ready to connect.";
|
||||||
|
});
|
||||||
|
|
||||||
|
function setupWebSocket() {
|
||||||
|
return new Promise((resolve, reject) => {
|
||||||
|
try {
|
||||||
|
websocket = new WebSocket(websocketUrl);
|
||||||
|
} catch (error) {
|
||||||
|
statusText.textContent = "Invalid WebSocket URL. Please check and try again.";
|
||||||
|
reject(error);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
websocket.onopen = () => {
|
||||||
|
statusText.textContent = "Connected to server.";
|
||||||
|
resolve();
|
||||||
|
};
|
||||||
|
|
||||||
|
websocket.onclose = () => {
|
||||||
|
if (userClosing) {
|
||||||
|
if (waitingForStop) {
|
||||||
|
statusText.textContent = "Processing finalized or connection closed.";
|
||||||
|
if (lastReceivedData) {
|
||||||
|
renderLinesWithBuffer(
|
||||||
|
lastReceivedData.lines || [],
|
||||||
|
lastReceivedData.buffer_diarization || "",
|
||||||
|
lastReceivedData.buffer_transcription || "",
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
true
|
||||||
|
);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
statusText.textContent = "Disconnected from the WebSocket server. (Check logs if model is loading.)";
|
||||||
|
if (isRecording) {
|
||||||
|
stopRecording();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
isRecording = false;
|
||||||
|
waitingForStop = false;
|
||||||
|
userClosing = false;
|
||||||
|
lastReceivedData = null;
|
||||||
|
websocket = null;
|
||||||
|
updateUI();
|
||||||
|
};
|
||||||
|
|
||||||
|
websocket.onerror = () => {
|
||||||
|
statusText.textContent = "Error connecting to WebSocket.";
|
||||||
|
reject(new Error("Error connecting to WebSocket"));
|
||||||
|
};
|
||||||
|
|
||||||
|
websocket.onmessage = (event) => {
|
||||||
|
const data = JSON.parse(event.data);
|
||||||
|
|
||||||
|
if (data.type === "ready_to_stop") {
|
||||||
|
console.log("Ready to stop received, finalizing display and closing WebSocket.");
|
||||||
|
waitingForStop = false;
|
||||||
|
|
||||||
|
if (lastReceivedData) {
|
||||||
|
renderLinesWithBuffer(
|
||||||
|
lastReceivedData.lines || [],
|
||||||
|
lastReceivedData.buffer_diarization || "",
|
||||||
|
lastReceivedData.buffer_transcription || "",
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
true
|
||||||
|
);
|
||||||
|
}
|
||||||
|
statusText.textContent = "Finished processing audio! Ready to record again.";
|
||||||
|
recordButton.disabled = false;
|
||||||
|
|
||||||
|
if (websocket) {
|
||||||
|
websocket.close();
|
||||||
|
}
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
lastReceivedData = data;
|
||||||
|
|
||||||
|
const {
|
||||||
|
lines = [],
|
||||||
|
buffer_transcription = "",
|
||||||
|
buffer_diarization = "",
|
||||||
|
remaining_time_transcription = 0,
|
||||||
|
remaining_time_diarization = 0,
|
||||||
|
status = "active_transcription",
|
||||||
|
} = data;
|
||||||
|
|
||||||
|
renderLinesWithBuffer(
|
||||||
|
lines,
|
||||||
|
buffer_diarization,
|
||||||
|
buffer_transcription,
|
||||||
|
remaining_time_diarization,
|
||||||
|
remaining_time_transcription,
|
||||||
|
false,
|
||||||
|
status
|
||||||
|
);
|
||||||
|
};
|
||||||
|
});
|
||||||
|
}
|
||||||
|
|
||||||
|
function renderLinesWithBuffer(
|
||||||
|
lines,
|
||||||
|
buffer_diarization,
|
||||||
|
buffer_transcription,
|
||||||
|
remaining_time_diarization,
|
||||||
|
remaining_time_transcription,
|
||||||
|
isFinalizing = false,
|
||||||
|
current_status = "active_transcription"
|
||||||
|
) {
|
||||||
|
if (current_status === "no_audio_detected") {
|
||||||
|
linesTranscriptDiv.innerHTML =
|
||||||
|
"<p style='text-align: center; color: var(--muted); margin-top: 20px;'><em>No audio detected...</em></p>";
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
|
||||||
|
const showLoading = !isFinalizing && (lines || []).some((it) => it.speaker == 0);
|
||||||
|
const showTransLag = !isFinalizing && remaining_time_transcription > 0;
|
||||||
|
const showDiaLag = !isFinalizing && !!buffer_diarization && remaining_time_diarization > 0;
|
||||||
|
const signature = JSON.stringify({
|
||||||
|
lines: (lines || []).map((it) => ({ speaker: it.speaker, text: it.text, beg: it.beg, end: it.end })),
|
||||||
|
buffer_transcription: buffer_transcription || "",
|
||||||
|
buffer_diarization: buffer_diarization || "",
|
||||||
|
status: current_status,
|
||||||
|
showLoading,
|
||||||
|
showTransLag,
|
||||||
|
showDiaLag,
|
||||||
|
isFinalizing: !!isFinalizing,
|
||||||
|
});
|
||||||
|
if (lastSignature === signature) {
|
||||||
|
const t = document.querySelector(".lag-transcription-value");
|
||||||
|
if (t) t.textContent = fmt1(remaining_time_transcription);
|
||||||
|
const d = document.querySelector(".lag-diarization-value");
|
||||||
|
if (d) d.textContent = fmt1(remaining_time_diarization);
|
||||||
|
const ld = document.querySelector(".loading-diarization-value");
|
||||||
|
if (ld) ld.textContent = fmt1(remaining_time_diarization);
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
lastSignature = signature;
|
||||||
|
|
||||||
|
const linesHtml = (lines || [])
|
||||||
|
.map((item, idx) => {
|
||||||
|
let timeInfo = "";
|
||||||
|
if (item.beg !== undefined && item.end !== undefined) {
|
||||||
|
timeInfo = ` ${item.beg} - ${item.end}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
let speakerLabel = "";
|
||||||
|
if (item.speaker === -2) {
|
||||||
|
speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`;
|
||||||
|
} else if (item.speaker == 0 && !isFinalizing) {
|
||||||
|
speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'><span class="loading-diarization-value">${fmt1(
|
||||||
|
remaining_time_diarization
|
||||||
|
)}</span> second(s) of audio are undergoing diarization</span></span>`;
|
||||||
|
} else if (item.speaker !== 0) {
|
||||||
|
speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`;
|
||||||
|
}
|
||||||
|
|
||||||
|
let currentLineText = item.text || "";
|
||||||
|
|
||||||
|
if (idx === lines.length - 1) {
|
||||||
|
if (!isFinalizing && item.speaker !== -2) {
|
||||||
|
if (remaining_time_transcription > 0) {
|
||||||
|
speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Transcription lag <span id='timeInfo'><span class="lag-transcription-value">${fmt1(
|
||||||
|
remaining_time_transcription
|
||||||
|
)}</span>s</span></span>`;
|
||||||
|
}
|
||||||
|
if (buffer_diarization && remaining_time_diarization > 0) {
|
||||||
|
speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Diarization lag<span id='timeInfo'><span class="lag-diarization-value">${fmt1(
|
||||||
|
remaining_time_diarization
|
||||||
|
)}</span>s</span></span>`;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
if (buffer_diarization) {
|
||||||
|
if (isFinalizing) {
|
||||||
|
currentLineText +=
|
||||||
|
(currentLineText.length > 0 && buffer_diarization.trim().length > 0 ? " " : "") + buffer_diarization.trim();
|
||||||
|
} else {
|
||||||
|
currentLineText += `<span class="buffer_diarization">${buffer_diarization}</span>`;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (buffer_transcription) {
|
||||||
|
if (isFinalizing) {
|
||||||
|
currentLineText +=
|
||||||
|
(currentLineText.length > 0 && buffer_transcription.trim().length > 0 ? " " : "") +
|
||||||
|
buffer_transcription.trim();
|
||||||
|
} else {
|
||||||
|
currentLineText += `<span class="buffer_transcription">${buffer_transcription}</span>`;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return currentLineText.trim().length > 0 || speakerLabel.length > 0
|
||||||
|
? `<p>${speakerLabel}<br/><div class='textcontent'>${currentLineText}</div></p>`
|
||||||
|
: `<p>${speakerLabel}<br/></p>`;
|
||||||
|
})
|
||||||
|
.join("");
|
||||||
|
|
||||||
|
linesTranscriptDiv.innerHTML = linesHtml;
|
||||||
|
window.scrollTo({ top: document.body.scrollHeight, behavior: "smooth" });
|
||||||
|
}
|
||||||
|
|
||||||
|
function updateTimer() {
|
||||||
|
if (!startTime) return;
|
||||||
|
|
||||||
|
const elapsed = Math.floor((Date.now() - startTime) / 1000);
|
||||||
|
const minutes = Math.floor(elapsed / 60).toString().padStart(2, "0");
|
||||||
|
const seconds = (elapsed % 60).toString().padStart(2, "0");
|
||||||
|
timerElement.textContent = `${minutes}:${seconds}`;
|
||||||
|
}
|
||||||
|
|
||||||
|
function drawWaveform() {
|
||||||
|
if (!analyser) return;
|
||||||
|
|
||||||
|
const bufferLength = analyser.frequencyBinCount;
|
||||||
|
const dataArray = new Uint8Array(bufferLength);
|
||||||
|
analyser.getByteTimeDomainData(dataArray);
|
||||||
|
|
||||||
|
waveCtx.clearRect(
|
||||||
|
0,
|
||||||
|
0,
|
||||||
|
waveCanvas.width / (window.devicePixelRatio || 1),
|
||||||
|
waveCanvas.height / (window.devicePixelRatio || 1)
|
||||||
|
);
|
||||||
|
waveCtx.lineWidth = 1;
|
||||||
|
waveCtx.strokeStyle = waveStroke;
|
||||||
|
waveCtx.beginPath();
|
||||||
|
|
||||||
|
const sliceWidth = (waveCanvas.width / (window.devicePixelRatio || 1)) / bufferLength;
|
||||||
|
let x = 0;
|
||||||
|
|
||||||
|
for (let i = 0; i < bufferLength; i++) {
|
||||||
|
const v = dataArray[i] / 128.0;
|
||||||
|
const y = (v * (waveCanvas.height / (window.devicePixelRatio || 1))) / 2;
|
||||||
|
|
||||||
|
if (i === 0) {
|
||||||
|
waveCtx.moveTo(x, y);
|
||||||
|
} else {
|
||||||
|
waveCtx.lineTo(x, y);
|
||||||
|
}
|
||||||
|
|
||||||
|
x += sliceWidth;
|
||||||
|
}
|
||||||
|
|
||||||
|
waveCtx.lineTo(
|
||||||
|
waveCanvas.width / (window.devicePixelRatio || 1),
|
||||||
|
(waveCanvas.height / (window.devicePixelRatio || 1)) / 2
|
||||||
|
);
|
||||||
|
waveCtx.stroke();
|
||||||
|
|
||||||
|
animationFrame = requestAnimationFrame(drawWaveform);
|
||||||
|
}
|
||||||
|
|
||||||
|
async function startRecording() {
|
||||||
|
try {
|
||||||
|
try {
|
||||||
|
wakeLock = await navigator.wakeLock.request("screen");
|
||||||
|
} catch (err) {
|
||||||
|
console.log("Error acquiring wake lock.");
|
||||||
|
}
|
||||||
|
|
||||||
|
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
|
||||||
|
|
||||||
|
audioContext = new (window.AudioContext || window.webkitAudioContext)();
|
||||||
|
analyser = audioContext.createAnalyser();
|
||||||
|
analyser.fftSize = 256;
|
||||||
|
microphone = audioContext.createMediaStreamSource(stream);
|
||||||
|
microphone.connect(analyser);
|
||||||
|
|
||||||
|
recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
|
||||||
|
recorder.ondataavailable = (e) => {
|
||||||
|
if (websocket && websocket.readyState === WebSocket.OPEN) {
|
||||||
|
websocket.send(e.data);
|
||||||
|
}
|
||||||
|
};
|
||||||
|
recorder.start(chunkDuration);
|
||||||
|
|
||||||
|
startTime = Date.now();
|
||||||
|
timerInterval = setInterval(updateTimer, 1000);
|
||||||
|
drawWaveform();
|
||||||
|
|
||||||
|
isRecording = true;
|
||||||
|
updateUI();
|
||||||
|
} catch (err) {
|
||||||
|
statusText.textContent = "Error accessing microphone. Please allow microphone access.";
|
||||||
|
console.error(err);
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
async function stopRecording() {
|
||||||
|
if (wakeLock) {
|
||||||
|
try {
|
||||||
|
await wakeLock.release();
|
||||||
|
} catch (e) {
|
||||||
|
// ignore
|
||||||
|
}
|
||||||
|
wakeLock = null;
|
||||||
|
}
|
||||||
|
|
||||||
|
userClosing = true;
|
||||||
|
waitingForStop = true;
|
||||||
|
|
||||||
|
if (websocket && websocket.readyState === WebSocket.OPEN) {
|
||||||
|
const emptyBlob = new Blob([], { type: "audio/webm" });
|
||||||
|
websocket.send(emptyBlob);
|
||||||
|
statusText.textContent = "Recording stopped. Processing final audio...";
|
||||||
|
}
|
||||||
|
|
||||||
|
if (recorder) {
|
||||||
|
recorder.stop();
|
||||||
|
recorder = null;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (microphone) {
|
||||||
|
microphone.disconnect();
|
||||||
|
microphone = null;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (analyser) {
|
||||||
|
analyser = null;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (audioContext && audioContext.state !== "closed") {
|
||||||
|
try {
|
||||||
|
await audioContext.close();
|
||||||
|
} catch (e) {
|
||||||
|
console.warn("Could not close audio context:", e);
|
||||||
|
}
|
||||||
|
audioContext = null;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (animationFrame) {
|
||||||
|
cancelAnimationFrame(animationFrame);
|
||||||
|
animationFrame = null;
|
||||||
|
}
|
||||||
|
|
||||||
|
if (timerInterval) {
|
||||||
|
clearInterval(timerInterval);
|
||||||
|
timerInterval = null;
|
||||||
|
}
|
||||||
|
timerElement.textContent = "00:00";
|
||||||
|
startTime = null;
|
||||||
|
|
||||||
|
isRecording = false;
|
||||||
|
updateUI();
|
||||||
|
}
|
||||||
|
|
||||||
|
async function toggleRecording() {
|
||||||
|
if (!isRecording) {
|
||||||
|
if (waitingForStop) {
|
||||||
|
console.log("Waiting for stop, early return");
|
||||||
|
return;
|
||||||
|
}
|
||||||
|
console.log("Connecting to WebSocket");
|
||||||
|
try {
|
||||||
|
if (websocket && websocket.readyState === WebSocket.OPEN) {
|
||||||
|
await startRecording();
|
||||||
|
} else {
|
||||||
|
await setupWebSocket();
|
||||||
|
await startRecording();
|
||||||
|
}
|
||||||
|
} catch (err) {
|
||||||
|
statusText.textContent = "Could not connect to WebSocket or access mic. Aborted.";
|
||||||
|
console.error(err);
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
console.log("Stopping recording");
|
||||||
|
stopRecording();
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
function updateUI() {
|
||||||
|
recordButton.classList.toggle("recording", isRecording);
|
||||||
|
recordButton.disabled = waitingForStop;
|
||||||
|
|
||||||
|
if (waitingForStop) {
|
||||||
|
if (statusText.textContent !== "Recording stopped. Processing final audio...") {
|
||||||
|
statusText.textContent = "Please wait for processing to complete...";
|
||||||
|
}
|
||||||
|
} else if (isRecording) {
|
||||||
|
statusText.textContent = "Recording...";
|
||||||
|
} else {
|
||||||
|
if (
|
||||||
|
statusText.textContent !== "Finished processing audio! Ready to record again." &&
|
||||||
|
statusText.textContent !== "Processing finalized or connection closed."
|
||||||
|
) {
|
||||||
|
statusText.textContent = "Click to start transcription";
|
||||||
|
}
|
||||||
|
}
|
||||||
|
if (!waitingForStop) {
|
||||||
|
recordButton.disabled = false;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
recordButton.addEventListener("click", toggleRecording);
|
||||||
1
whisperlivekit/web/src/dark_mode.svg
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-120q-151 0-255.5-104.5T120-480q0-138 90-239.5T440-838q13-2 23 3.5t16 14.5q6 9 6.5 21t-7.5 23q-17 26-25.5 55t-8.5 61q0 90 63 153t153 63q31 0 61.5-9t54.5-25q11-7 22.5-6.5T819-479q10 5 15.5 15t3.5 24q-14 138-117.5 229T480-120Zm0-80q88 0 158-48.5T740-375q-20 5-40 8t-40 3q-123 0-209.5-86.5T364-660q0-20 3-40t8-40q-78 32-126.5 102T200-480q0 116 82 198t198 82Zm-10-270Z"/></svg>
|
||||||
|
After Width: | Height: | Size: 493 B |
1
whisperlivekit/web/src/light_mode.svg
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-360q50 0 85-35t35-85q0-50-35-85t-85-35q-50 0-85 35t-35 85q0 50 35 85t85 35Zm0 80q-83 0-141.5-58.5T280-480q0-83 58.5-141.5T480-680q83 0 141.5 58.5T680-480q0 83-58.5 141.5T480-280ZM80-440q-17 0-28.5-11.5T40-480q0-17 11.5-28.5T80-520h80q17 0 28.5 11.5T200-480q0 17-11.5 28.5T160-440H80Zm720 0q-17 0-28.5-11.5T760-480q0-17 11.5-28.5T800-520h80q17 0 28.5 11.5T920-480q0 17-11.5 28.5T880-440h-80ZM480-760q-17 0-28.5-11.5T440-800v-80q0-17 11.5-28.5T480-920q17 0 28.5 11.5T520-880v80q0 17-11.5 28.5T480-760Zm0 720q-17 0-28.5-11.5T440-80v-80q0-17 11.5-28.5T480-200q17 0 28.5 11.5T520-160v80q0 17-11.5 28.5T480-40ZM226-678l-43-42q-12-11-11.5-28t11.5-29q12-12 29-12t28 12l42 43q11 12 11 28t-11 28q-11 12-27.5 11.5T226-678Zm494 495-42-43q-11-12-11-28.5t11-27.5q11-12 27.5-11.5T734-282l43 42q12 11 11.5 28T777-183q-12 12-29 12t-28-12Zm-42-495q-12-11-11.5-27.5T678-734l42-43q11-12 28-11.5t29 11.5q12 12 12 29t-12 28l-43 42q-12 11-28 11t-28-11ZM183-183q-12-12-12-29t12-28l43-42q12-11 28.5-11t27.5 11q12 11 11.5 27.5T282-226l-42 43q-11 12-28 11.5T183-183Zm297-297Z"/></svg>
|
||||||
|
After Width: | Height: | Size: 1.2 KiB |
1
whisperlivekit/web/src/system_mode.svg
Normal file
@@ -0,0 +1 @@
|
|||||||
|
<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M396-396q-32-32-58.5-67T289-537q-5 14-6.5 28.5T281-480q0 83 58 141t141 58q14 0 28.5-2t28.5-6q-39-22-74-48.5T396-396Zm85 196q-56 0-107-21t-91-61q-40-40-61-91t-21-107q0-51 17-97.5t50-84.5q13-14 32-9.5t27 24.5q21 55 52.5 104t73.5 91q42 42 91 73.5T648-326q20 8 24.5 27t-9.5 32q-38 33-84.5 50T481-200Zm223-192q-16-5-23-20.5t-4-32.5q9-48-6-94.5T621-621q-35-35-80.5-49.5T448-677q-17 3-32-4t-21-23q-6-16 1.5-31t23.5-19q69-15 138 4.5T679-678q51 51 71 120t5 138q-4 17-19 25t-32 3ZM480-840q-17 0-28.5-11.5T440-880v-40q0-17 11.5-28.5T480-960q17 0 28.5 11.5T520-920v40q0 17-11.5 28.5T480-840Zm0 840q-17 0-28.5-11.5T440-40v-40q0-17 11.5-28.5T480-120q17 0 28.5 11.5T520-80v40q0 17-11.5 28.5T480 0Zm255-734q-12-12-12-28.5t12-28.5l28-28q11-11 27.5-11t28.5 11q12 12 12 28.5T819-762l-28 28q-12 12-28 12t-28-12ZM141-141q-12-12-12-28.5t12-28.5l28-28q12-12 28-12t28 12q12 12 12 28.5T225-169l-28 28q-11 11-27.5 11T141-141Zm739-299q-17 0-28.5-11.5T840-480q0-17 11.5-28.5T880-520h40q17 0 28.5 11.5T960-480q0 17-11.5 28.5T920-440h-40Zm-840 0q-17 0-28.5-11.5T0-480q0-17 11.5-28.5T40-520h40q17 0 28.5 11.5T120-480q0 17-11.5 28.5T80-440H40Zm779 299q-12 12-28.5 12T762-141l-28-28q-12-12-12-28t12-28q12-12 28.5-12t28.5 12l28 28q11 11 11 27.5T819-141ZM226-735q-12 12-28.5 12T169-735l-28-28q-11-11-11-27.5t11-28.5q12-12 28.5-12t28.5 12l28 28q12 12 12 28t-12 28Zm170 339Z"/></svg>
|
||||||
|
After Width: | Height: | Size: 1.4 KiB |
@@ -10,4 +10,24 @@ def get_web_interface_html():
|
|||||||
return f.read()
|
return f.read()
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error loading web interface HTML: {e}")
|
logger.error(f"Error loading web interface HTML: {e}")
|
||||||
return "<html><body><h1>Error loading interface</h1></body></html>"
|
return "<html><body><h1>Error loading interface</h1></body></html>"
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
|
||||||
|
from fastapi import FastAPI
|
||||||
|
from fastapi.responses import HTMLResponse
|
||||||
|
import uvicorn
|
||||||
|
from starlette.staticfiles import StaticFiles
|
||||||
|
import pathlib
|
||||||
|
import whisperlivekit.web as webpkg
|
||||||
|
|
||||||
|
app = FastAPI()
|
||||||
|
web_dir = pathlib.Path(webpkg.__file__).parent
|
||||||
|
app.mount("/web", StaticFiles(directory=str(web_dir)), name="web")
|
||||||
|
|
||||||
|
@app.get("/")
|
||||||
|
async def get():
|
||||||
|
return HTMLResponse(get_web_interface_html())
|
||||||
|
|
||||||
|
uvicorn.run(app=app)
|
||||||
@@ -122,6 +122,7 @@ class OnlineASRProcessor:
|
|||||||
self.tokenize = tokenize_method
|
self.tokenize = tokenize_method
|
||||||
self.logfile = logfile
|
self.logfile = logfile
|
||||||
self.confidence_validation = confidence_validation
|
self.confidence_validation = confidence_validation
|
||||||
|
self.global_time_offset = 0.0
|
||||||
self.init()
|
self.init()
|
||||||
|
|
||||||
self.buffer_trimming_way, self.buffer_trimming_sec = buffer_trimming
|
self.buffer_trimming_way, self.buffer_trimming_sec = buffer_trimming
|
||||||
@@ -152,6 +153,21 @@ class OnlineASRProcessor:
|
|||||||
"""Append an audio chunk (a numpy array) to the current audio buffer."""
|
"""Append an audio chunk (a numpy array) to the current audio buffer."""
|
||||||
self.audio_buffer = np.append(self.audio_buffer, audio)
|
self.audio_buffer = np.append(self.audio_buffer, audio)
|
||||||
|
|
||||||
|
def insert_silence(self, silence_duration, offset):
|
||||||
|
"""
|
||||||
|
If silences are > 5s, we do a complete context clear. Otherwise, we just insert a small silence and shift the last_attend_frame
|
||||||
|
"""
|
||||||
|
# if self.transcript_buffer.buffer:
|
||||||
|
# self.committed.extend(self.transcript_buffer.buffer)
|
||||||
|
# self.transcript_buffer.buffer = []
|
||||||
|
|
||||||
|
if True: #silence_duration < 3: #we want the last audio to be treated to not have a gap. could also be handled in the future in ends_with_silence.
|
||||||
|
gap_silence = np.zeros(int(16000 * silence_duration), dtype=np.int16)
|
||||||
|
self.insert_audio_chunk(gap_silence)
|
||||||
|
else:
|
||||||
|
self.init(offset=silence_duration + offset)
|
||||||
|
self.global_time_offset += silence_duration
|
||||||
|
|
||||||
def prompt(self) -> Tuple[str, str]:
|
def prompt(self) -> Tuple[str, str]:
|
||||||
"""
|
"""
|
||||||
Returns a tuple: (prompt, context), where:
|
Returns a tuple: (prompt, context), where:
|
||||||
@@ -230,6 +246,9 @@ class OnlineASRProcessor:
|
|||||||
logger.debug(
|
logger.debug(
|
||||||
f"Length of audio buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:.2f} seconds"
|
f"Length of audio buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:.2f} seconds"
|
||||||
)
|
)
|
||||||
|
if self.global_time_offset:
|
||||||
|
for token in committed_tokens:
|
||||||
|
token = token.with_offset(self.global_time_offset)
|
||||||
return committed_tokens, current_audio_processed_upto
|
return committed_tokens, current_audio_processed_upto
|
||||||
|
|
||||||
def chunk_completed_sentence(self):
|
def chunk_completed_sentence(self):
|
||||||
@@ -391,128 +410,3 @@ class OnlineASRProcessor:
|
|||||||
start = None
|
start = None
|
||||||
end = None
|
end = None
|
||||||
return Transcript(start, end, text, probability=probability)
|
return Transcript(start, end, text, probability=probability)
|
||||||
|
|
||||||
|
|
||||||
class VACOnlineASRProcessor:
|
|
||||||
"""
|
|
||||||
Wraps an OnlineASRProcessor with a Voice Activity Controller (VAC).
|
|
||||||
|
|
||||||
It receives small chunks of audio, applies VAD (e.g. with Silero),
|
|
||||||
and when the system detects a pause in speech (or end of an utterance)
|
|
||||||
it finalizes the utterance immediately.
|
|
||||||
"""
|
|
||||||
SAMPLING_RATE = 16000
|
|
||||||
|
|
||||||
def __init__(self, online_chunk_size: float, *args, **kwargs):
|
|
||||||
self.online_chunk_size = online_chunk_size
|
|
||||||
self.online = OnlineASRProcessor(*args, **kwargs)
|
|
||||||
self.asr = self.online.asr
|
|
||||||
|
|
||||||
# Load a VAD model (e.g. Silero VAD)
|
|
||||||
import torch
|
|
||||||
model, _ = torch.hub.load(repo_or_dir="snakers4/silero-vad", model="silero_vad")
|
|
||||||
from .silero_vad_iterator import FixedVADIterator
|
|
||||||
|
|
||||||
self.vac = FixedVADIterator(model)
|
|
||||||
self.logfile = self.online.logfile
|
|
||||||
self.last_input_audio_stream_end_time: float = 0.0
|
|
||||||
self.init()
|
|
||||||
|
|
||||||
def init(self):
|
|
||||||
self.online.init()
|
|
||||||
self.vac.reset_states()
|
|
||||||
self.current_online_chunk_buffer_size = 0
|
|
||||||
self.last_input_audio_stream_end_time = self.online.buffer_time_offset
|
|
||||||
self.is_currently_final = False
|
|
||||||
self.status: Optional[str] = None # "voice" or "nonvoice"
|
|
||||||
self.audio_buffer = np.array([], dtype=np.float32)
|
|
||||||
self.buffer_offset = 0 # in frames
|
|
||||||
|
|
||||||
def get_audio_buffer_end_time(self) -> float:
|
|
||||||
"""Returns the absolute end time of the audio processed by the underlying OnlineASRProcessor."""
|
|
||||||
return self.online.get_audio_buffer_end_time()
|
|
||||||
|
|
||||||
def clear_buffer(self):
|
|
||||||
self.buffer_offset += len(self.audio_buffer)
|
|
||||||
self.audio_buffer = np.array([], dtype=np.float32)
|
|
||||||
|
|
||||||
def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time: float):
|
|
||||||
"""
|
|
||||||
Process an incoming small audio chunk:
|
|
||||||
- run VAD on the chunk,
|
|
||||||
- decide whether to send the audio to the online ASR processor immediately,
|
|
||||||
- and/or to mark the current utterance as finished.
|
|
||||||
"""
|
|
||||||
self.last_input_audio_stream_end_time = audio_stream_end_time
|
|
||||||
res = self.vac(audio)
|
|
||||||
self.audio_buffer = np.append(self.audio_buffer, audio)
|
|
||||||
|
|
||||||
if res is not None:
|
|
||||||
# VAD returned a result; adjust the frame number
|
|
||||||
frame = list(res.values())[0] - self.buffer_offset
|
|
||||||
if "start" in res and "end" not in res:
|
|
||||||
self.status = "voice"
|
|
||||||
send_audio = self.audio_buffer[frame:]
|
|
||||||
self.online.init(offset=(frame + self.buffer_offset) / self.SAMPLING_RATE)
|
|
||||||
self.online.insert_audio_chunk(send_audio)
|
|
||||||
self.current_online_chunk_buffer_size += len(send_audio)
|
|
||||||
self.clear_buffer()
|
|
||||||
elif "end" in res and "start" not in res:
|
|
||||||
self.status = "nonvoice"
|
|
||||||
send_audio = self.audio_buffer[:frame]
|
|
||||||
self.online.insert_audio_chunk(send_audio)
|
|
||||||
self.current_online_chunk_buffer_size += len(send_audio)
|
|
||||||
self.is_currently_final = True
|
|
||||||
self.clear_buffer()
|
|
||||||
else:
|
|
||||||
beg = res["start"] - self.buffer_offset
|
|
||||||
end = res["end"] - self.buffer_offset
|
|
||||||
self.status = "nonvoice"
|
|
||||||
send_audio = self.audio_buffer[beg:end]
|
|
||||||
self.online.init(offset=(beg + self.buffer_offset) / self.SAMPLING_RATE)
|
|
||||||
self.online.insert_audio_chunk(send_audio)
|
|
||||||
self.current_online_chunk_buffer_size += len(send_audio)
|
|
||||||
self.is_currently_final = True
|
|
||||||
self.clear_buffer()
|
|
||||||
else:
|
|
||||||
if self.status == "voice":
|
|
||||||
self.online.insert_audio_chunk(self.audio_buffer)
|
|
||||||
self.current_online_chunk_buffer_size += len(self.audio_buffer)
|
|
||||||
self.clear_buffer()
|
|
||||||
else:
|
|
||||||
# Keep 1 second worth of audio in case VAD later detects voice,
|
|
||||||
# but trim to avoid unbounded memory usage.
|
|
||||||
self.buffer_offset += max(0, len(self.audio_buffer) - self.SAMPLING_RATE)
|
|
||||||
self.audio_buffer = self.audio_buffer[-self.SAMPLING_RATE:]
|
|
||||||
|
|
||||||
def process_iter(self) -> Tuple[List[ASRToken], float]:
|
|
||||||
"""
|
|
||||||
Depending on the VAD status and the amount of accumulated audio,
|
|
||||||
process the current audio chunk.
|
|
||||||
Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
|
|
||||||
"""
|
|
||||||
if self.is_currently_final:
|
|
||||||
return self.finish()
|
|
||||||
elif self.current_online_chunk_buffer_size > self.SAMPLING_RATE * self.online_chunk_size:
|
|
||||||
self.current_online_chunk_buffer_size = 0
|
|
||||||
return self.online.process_iter()
|
|
||||||
else:
|
|
||||||
logger.debug("No online update, only VAD")
|
|
||||||
return [], self.last_input_audio_stream_end_time
|
|
||||||
|
|
||||||
def finish(self) -> Tuple[List[ASRToken], float]:
|
|
||||||
"""
|
|
||||||
Finish processing by flushing any remaining text.
|
|
||||||
Returns a tuple: (list of remaining ASRToken objects, float representing the final audio processed up to time).
|
|
||||||
"""
|
|
||||||
result_tokens, processed_upto = self.online.finish()
|
|
||||||
self.current_online_chunk_buffer_size = 0
|
|
||||||
self.is_currently_final = False
|
|
||||||
return result_tokens, processed_upto
|
|
||||||
|
|
||||||
def get_buffer(self):
|
|
||||||
"""
|
|
||||||
Get the unvalidated buffer in string format.
|
|
||||||
"""
|
|
||||||
return self.online.concatenate_tokens(self.online.transcript_buffer.buffer)
|
|
||||||
|
|
||||||
|
|||||||