bump version ro 0.2.5

Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web
Increase END_SILENCE_DURATION to reduce false positives
2026-03-07 22:33:36 +00:00 · 2025-08-13 10:04:31 +02:00 · 2025-08-13 10:04:04 +02:00 · 2025-08-13 10:04:00 +02:00 · 2025-08-13 10:02:52 +02:00 · 2025-08-12 17:31:32 -07:00
50 changed files with 109149 additions and 836 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -54,7 +54,6 @@ coverage.xml
 # Translations
 *.mo
 *.pot
 # Django stuff:
 *.log
 local_settings.py
@@ -129,4 +128,13 @@ dmypy.json
 .pyre/
 *.wav
-run_*.sh
+run_*.sh
 # Downloaded models
 *.pt 
 # Debug & testing
 test_*.py
 launch.json
 .DS_Store
 test/*
--- a/8
+++ b/8
@@ -21,15 +21,17 @@ RUN apt-get update && \
        python3 \
        python3-pip \
        ffmpeg \
-        git && \
+        git \
        build-essential \
        python3-dev && \
    rm -rf /var/lib/apt/lists/*
-RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
+RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
 COPY . .
 # Install WhisperLiveKit directly, allowing for optional dependencies
-#   Note: For gates modedls, need to add your HF toke. See README.md
+#   Note: For gates models, need to add your HF toke. See README.md
 #         for more details.
 RUN if [ -n "$EXTRAS" ]; then \
      echo "Installing with extras: [$EXTRAS]"; \
--- a/57
+++ b/57
@@ -1,21 +1,52 @@
 # License
 ## Main Software License
 MIT License
-Copyright (c) 2023 ÚFAL
+Copyright (c) 2025 Quentin Fuxa.  
-Permission is hereby granted, free of charge, to any person obtaining a copy
+Permission is hereby granted, free of charge, to any person obtaining a copy  
-of this software and associated documentation files (the "Software"), to deal
+of this software and associated documentation files (the "Software"), to deal  
-in the Software without restriction, including without limitation the rights
+in the Software without restriction, including without limitation the rights  
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell  
-copies of the Software, and to permit persons to whom the Software is
+copies of the Software, and to permit persons to whom the Software is  
 furnished to do so, subject to the following conditions:
-The above copyright notice and this permission notice shall be included in all
+The above copyright notice and this permission notice shall be included in all  
 copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR  
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,  
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE  
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER  
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,  
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE  
 SOFTWARE.
 ## SimulStreaming Backend License
 **When using the SimulStreaming backend (SimulWhisper), additional licensing terms apply:**
 SimulStreaming (https://github.com/ufal/SimulStreaming) is dual-licensed:
 ### 🔹 Non-Commercial Use
 You may use SimulStreaming under the **PolyForm Noncommercial License 1.0.0** if you obtain the code through the GitHub repository. This license is **free of charge** and comes with **no obligations** for non-commercial users.
 ### 🔸 Commercial Use
 Understanding who uses SimulStreaming commercially helps improve and prioritize development. Therefore, **registration is required** for those who acquire a commercial license.
 Commercial licenses are planned to be **affordable** to SMEs and individuals. They are considering providing commercial licenses either for free or for a symbolic one-time fee, and may also provide additional support. You can share your preference via the [questionnaire](https://forms.cloud.microsoft.com/e/7tCxb4gJfB).
 You can also leave your contact [there](https://forms.cloud.microsoft.com/e/7tCxb4gJfB) to be notified when commercial licenses become available.
 **Contact for SimulStreaming licensing:**  
 [Dominik Macháček](https://ufal.mff.cuni.cz/dominik-machacek/), machacek@ufal.mff.cuni.cz
 ---
 ## Based on:
 - **whisper_streaming** by ÚFAL – MIT License – https://github.com/ufal/whisper_streaming. The original work by ÚFAL. License: https://github.com/ufal/whisper_streaming/blob/main/LICENSE  
 - **silero-vad** by Snakers4 – MIT License – https://github.com/snakers4/silero-vad. The work by Snakers4 (silero-vad). License: https://github.com/snakers4/silero-vad/blob/f6b1294cb27590fb2452899df98fb234dfef1134/LICENSE  
 - **Diart** by juanmc2005 – MIT License – https://github.com/juanmc2005/diart. The work in Diart by juanmc2005. License: https://github.com/juanmc2005/diart/blob/main/LICENSE
 - **SimulStreaming** by ÚFAL – Dual License (PolyForm Noncommercial License 1.0.0 / Commercial License) – https://github.com/ufal/SimulStreaming
--- a/README.md
+++ b/README.md
@@ -1,47 +1,41 @@
 <h1 align="center">WhisperLiveKit</h1>
 <p align="center">
-  <img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit Demo" width="730">
+<img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit Demo" width="730">
 </p>
 <p align="center"><b>Real-time, Fully Local Speech-to-Text with Speaker Diarization</b></p>
 <p align="center">
-  <a href="https://pypi.org/project/whisperlivekit/"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/whisperlivekit?color=g"></a>
+<a href="https://pypi.org/project/whisperlivekit/"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/whisperlivekit?color=g"></a>
-  <a href="https://pepy.tech/project/whisperlivekit"><img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/whisperlivekit?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads"></a>
+<a href="https://pepy.tech/project/whisperlivekit"><img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/whisperlivekit?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads"></a>
-  <a href="https://pypi.org/project/whisperlivekit/"><img alt="Python Versions" src="https://img.shields.io/badge/python-3.9%20%7C%203.10%20%7C%203.11%20%7C%203.12-dark_green"></a>
+<a href="https://pypi.org/project/whisperlivekit/"><img alt="Python Versions" src="https://img.shields.io/badge/python-3.9--3.13-dark_green"></a>
-  <a href="https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/github/license/QuentinFuxa/WhisperLiveKit?color=blue"></a>
+<a href="https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-MIT/Dual Licensed-dark_green"></a>
 </p>
 ## 🚀 Overview
-This project is based on [Whisper Streaming](https://github.com/ufal/whisper_streaming) and lets you transcribe audio directly from your browser. WhisperLiveKit provides a complete backend solution for real-time speech transcription with a functional and simple frontend that you can customize for your own needs. Everything runs locally on your machine ✨
+WhisperLiveKit brings real-time speech transcription directly to your browser, with a ready-to-use backend+server and a simple frontend. ✨
-### 🔄 Architecture
+Built on [SimulStreaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) and [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) for transcription, plus [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) and [Diart](https://github.com/juanmc2005/diart) (SOTA 2021) for diarization.
 WhisperLiveKit consists of three main components:
 - **Frontend**: A basic HTML & JavaScript interface that captures microphone audio and streams it to the backend via WebSockets. You can use and adapt the provided template at [whisperlivekit/web/live_transcription.html](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html) for your specific use case.
 - **Backend (Web Server)**: A FastAPI-based WebSocket server that receives streamed audio data, processes it in real time, and returns transcriptions to the frontend. This is where the WebSocket logic and routing live.
 - **Core Backend (Library Logic)**: A server-agnostic core that handles audio processing, ASR, and diarization. It exposes reusable components that take in audio bytes and return transcriptions. This makes it easy to plug into any WebSocket or audio stream pipeline.
-### ✨ Key Features
+### Key Features
 - **🎙️ Real-time Transcription** - Convert speech to text instantly as you speak
 - **👥 Speaker Diarization** - Identify different speakers in real-time using [Diart](https://github.com/juanmc2005/diart)
 - **🔒 Fully Local** - All processing happens on your machine - no data sent to external servers
 - **📱 Multi-User Support** - Handle multiple users simultaneously with a single backend/server
 ### ⚙️ Core differences from [Whisper Streaming](https://github.com/ufal/whisper_streaming)
 - **Real-time Transcription** - Locally (or on-prem) convert speech to text instantly as you speak
 - **Speaker Diarization** - Identify different speakers in real-time. (⚠️ backend Streaming Sortformer in developement)
 - **Multi-User Support** - Handle multiple users simultaneously with a single backend/server
 - **Automatic Silence Chunking** – Automatically chunks when no audio is detected to limit buffer size
- **Multi-User Support** – Handles multiple users simultaneously by decoupling backend and online ASR
+- **Confidence Validation** – Immediately validate high-confidence tokens for faster inference (WhisperStreaming only)
- **Confidence Validation** – Immediately validate high-confidence tokens for faster inference
+- **Buffering Preview** – Displays unvalidated transcription segments (not compatible with SimulStreaming yet)
- **MLX Whisper Backend** – Optimized for Apple Silicon for faster local processing
+- **Punctuation-Based Speaker Splitting [BETA]** - Align speaker changes with natural sentence boundaries for more readable transcripts
- **Buffering Preview** – Displays unvalidated transcription segments
+- **SimulStreaming Backend** - [Dual-licensed](https://github.com/ufal/SimulStreaming#-licence-and-contributions) - Ultra-low latency transcription using SOTA AlignAtt policy. 
-## 📖 Quick Start
+### Architecture
 <img alt="Architecture" src="architecture.png" />
 ## Quick Start
 ```bash
 # Install the package
@@ -50,38 +44,25 @@ pip install whisperlivekit
 # Start the transcription server
 whisperlivekit-server --model tiny.en
-# Open your browser at http://localhost:8000
+# Open your browser at http://localhost:8000 to see the interface.
-```
+# Use  -ssl-certfile public.crt --ssl-keyfile private.key parameters to use SSL
 ### Quick Start with SSL
 ```bash
 # You must provide a certificate and key
 whisperlivekit-server -ssl-certfile public.crt --ssl-keyfile private.key
 # Open your browser at https://localhost:8000
 ```
 That's it! Start speaking and watch your words appear on screen.
-## 🛠️ Installation Options
+## Installation
 ### Install from PyPI (Recommended)
 ```bash
 #Install from PyPI (Recommended)
 pip install whisperlivekit
 ```
-### Install from Source
+#Install from Source
 ```bash
 git clone https://github.com/QuentinFuxa/WhisperLiveKit
 cd WhisperLiveKit
 pip install -e .
 ```
-### System Dependencies
+### FFmpeg Dependency
 FFmpeg is required:
 ```bash
 # Ubuntu/Debian
@@ -112,6 +93,7 @@ pip install whisperlivekit[whisper]              # Original Whisper
 pip install whisperlivekit[whisper-timestamped]  # Improved timestamps
 pip install whisperlivekit[mlx-whisper]          # Apple Silicon optimization
 pip install whisperlivekit[openai]               # OpenAI API
 pip install whisperlivekit[simulstreaming]
 ```
 ### 🎹 Pyannote Models Setup
@@ -122,10 +104,10 @@ For diarization, you need access to pyannote.audio models:
 2. [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model
 3. [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
 4. Login with HuggingFace:
-   ```bash
+```bash
-   pip install huggingface_hub
+pip install huggingface_hub
-   huggingface-cli login
+huggingface-cli login
-   ```
+```
 ## 💻 Usage Examples
@@ -139,55 +121,62 @@ whisperlivekit-server --model tiny.en
 # Advanced configuration with diarization
 whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language auto
 # SimulStreaming backend for ultra-low latency
 whisperlivekit-server --backend simulstreaming --model large-v3 --frame-threshold 20
 ```
 ### Python API Integration (Backend)
 Check [basic_server.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) for a complete example.
 ```python
-from whisperlivekit import WhisperLiveKit
+from whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args
-from whisperlivekit.audio_processor import AudioProcessor
+from fastapi import FastAPI, WebSocket, WebSocketDisconnect
 from fastapi import FastAPI, WebSocket
 import asyncio
 from fastapi.responses import HTMLResponse
 from contextlib import asynccontextmanager
 import asyncio
-# Initialize components
+transcription_engine = None
 app = FastAPI()
 kit = WhisperLiveKit(model="medium", diarization=True)
-# Serve the web interface
+@asynccontextmanager
-@app.get("/")
+async def lifespan(app: FastAPI):
-async def get():
+    global transcription_engine
-    return HTMLResponse(kit.web_interface())  # Use the built-in web interface
+    transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
    # You can also load from command-line arguments using parse_args()
    # args = parse_args()
    # transcription_engine = TranscriptionEngine(**vars(args))
    yield
 app = FastAPI(lifespan=lifespan)
 # Process WebSocket connections
-async def handle_websocket_results(websocket, results_generator):
+async def handle_websocket_results(websocket: WebSocket, results_generator):
    async for response in results_generator:
        await websocket.send_json(response)
    await websocket.send_json({"type": "ready_to_stop"})
@app.websocket("/asr")
 async def websocket_endpoint(websocket: WebSocket):
-    audio_processor = AudioProcessor()
+    global transcription_engine
    await websocket.accept()
    results_generator = await audio_processor.create_tasks()
    websocket_task = asyncio.create_task(
        handle_websocket_results(websocket, results_generator)
    )
-    try:
+    # Create a new AudioProcessor for each connection, passing the shared engine
-        while True:
+    audio_processor = AudioProcessor(transcription_engine=transcription_engine)    
-            message = await websocket.receive_bytes()
+    results_generator = await audio_processor.create_tasks()
-            await audio_processor.process_audio(message)
+    results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
-    except Exception as e:
+    await websocket.accept()
-        print(f"WebSocket error: {e}")
+    while True:
-        websocket_task.cancel()
+        message = await websocket.receive_bytes()
        await audio_processor.process_audio(message)        
 ```
 ### Frontend Implementation
-The package includes a simple HTML/JavaScript implementation that you can adapt for your project. You can get in in [whisperlivekit/web/live_transcription.html](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html), or using :
+The package includes a simple HTML/JavaScript implementation that you can adapt for your project. You can find it [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html), or load its content using `get_web_interface_html()` :
 ```python
-kit.web_interface()
+from whisperlivekit import get_web_interface_html
 html_content = get_web_interface_html()
 ```
 ## ⚙️ Configuration Reference
@@ -198,11 +187,12 @@ WhisperLiveKit offers extensive configuration options:
 |-----------|-------------|---------|
 | `--host` | Server host address | `localhost` |
 | `--port` | Server port | `8000` |
-| `--model` | Whisper model size | `tiny` |
+| `--model` | Whisper model size. Caution : '.en' models do not work with Simulstreaming | `tiny` |
 | `--language` | Source language code or `auto` | `en` |
 | `--task` | `transcribe` or `translate` | `transcribe` |
 | `--backend` | Processing backend | `faster-whisper` |
 | `--diarization` | Enable speaker identification | `False` |
 | `--punctuation-split` | Use punctuation to improve speaker boundaries | `True` |
 | `--confidence-validation` | Use confidence scores for faster validation | `False` |
 | `--min-chunk-size` | Minimum audio chunk size (seconds) | `1.0` |
 | `--vac` | Use Voice Activity Controller | `False` |
@@ -211,20 +201,31 @@ WhisperLiveKit offers extensive configuration options:
 | `--warmup-file` | Audio file path for model warmup | `jfk.wav` |
 | `--ssl-certfile` | Path to the SSL certificate file (for HTTPS support) | `None` |
 | `--ssl-keyfile` | Path to the SSL private key file (for HTTPS support) | `None` |
 | `--segmentation-model` | Hugging Face model ID for pyannote.audio segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |
 | `--embedding-model` | Hugging Face model ID for pyannote.audio embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |
 **SimulStreaming-specific Options:**
 | Parameter | Description | Default |
 |-----------|-------------|---------|
 | `--frame-threshold` | AlignAtt frame threshold (lower = faster, higher = more accurate) | `25` |
 | `--beams` | Number of beams for beam search (1 = greedy decoding) | `1` |
 | `--decoder` | Force decoder type (`beam` or `greedy`) | `auto` |
 | `--audio-max-len` | Maximum audio buffer length (seconds) | `30.0` |
 | `--audio-min-len` | Minimum audio length to process (seconds) | `0.0` |
 | `--cif-ckpt-path` | Path to CIF model for word boundary detection | `None` |
 | `--never-fire` | Never truncate incomplete words | `False` |
 | `--init-prompt` | Initial prompt for the model | `None` |
 | `--static-init-prompt` | Static prompt that doesn't scroll | `None` |
 | `--max-context-tokens` | Maximum context tokens | `None` |
 | `--model-path` | Direct path to .pt model file. Download it if not found | `./base.pt` |
 ## 🔧 How It Works
 <p align="center">
  <img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit in Action" width="500">
 </p>
 1. **Audio Capture**: Browser's MediaRecorder API captures audio in webm/opus format
 2. **Streaming**: Audio chunks are sent to the server via WebSocket
-3. **Processing**: Server decodes audio with FFmpeg and streams into Whisper for transcription
+3. **Processing**: Server decodes audio with FFmpeg and streams into the model for transcription
-4. **Real-time Output**: 
+4. **Real-time Output**: Partial transcriptions appear immediately in light gray (the 'aperçu') and finalized text appears in normal color
   - Partial transcriptions appear immediately in light gray (the 'aperçu')
   - Finalized text appears in normal color
   - (When enabled) Different speakers are identified and highlighted
 ## 🚀 Deployment Guide
@@ -244,83 +245,55 @@ To deploy WhisperLiveKit in production:
   - Ensure WebSocket connection points to your server's address
 3. **Nginx Configuration** (recommended for production):
-   ```nginx
+    ```nginx    
   server {
       listen 80;
       server_name your-domain.com;
-       location / {
+    location / {
-           proxy_pass http://localhost:8000;
+        proxy_pass http://localhost:8000;
-           proxy_set_header Upgrade $http_upgrade;
+        proxy_set_header Upgrade $http_upgrade;
-           proxy_set_header Connection "upgrade";
+        proxy_set_header Connection "upgrade";
-           proxy_set_header Host $host;
+        proxy_set_header Host $host;
-       }
+    }}
-   }
+    ```
   ```
 4. **HTTPS Support**: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL
 ### 🐋 Docker
-A basic Dockerfile is provided which allows re-use of Python package installation options. See below usage examples:
+A basic Dockerfile is provided which allows re-use of Python package installation options. ⚠️ For **large** models, ensure that your **docker runtime** has enough **memory** available. See below usage examples:
 **NOTE:** For **larger** models, ensure that your **docker runtime** has enough **memory** available.
 #### All defaults
 - Create a reusable image with only the basics and then run as a named container:
-```bash
+    ```bash
-docker build -t whisperlivekit-defaults .
+    docker build -t whisperlivekit-defaults .
-docker create --gpus all --name whisperlivekit -p 8000:8000 whisperlivekit-defaults
+    docker create --gpus all --name whisperlivekit -p 8000:8000 whisperlivekit-defaults
-docker start -i whisperlivekit
+    docker start -i whisperlivekit
-```
+    ```
-> **Note**: If you're running on a system without NVIDIA GPU support (such as Mac with Apple Silicon or any system without CUDA capabilities), you need to **remove the `--gpus all` flag** from the `docker create` command. Without GPU acceleration, transcription will use CPU only, which may be significantly slower. Consider using small models for better performance on CPU-only systems.
+    > **Note**: If you're running on a system without NVIDIA GPU support (such as Mac with Apple Silicon or any system without CUDA capabilities), you need to **remove the `--gpus all` flag** from the `docker create` command. Without GPU acceleration, transcription will use CPU only, which may be significantly slower. Consider using small models for better performance on CPU-only systems.
 #### Customization
 - Customize the container options:
-```bash
+    ```bash
-docker build -t whisperlivekit-defaults .
+    docker build -t whisperlivekit-defaults .
-docker create --gpus all --name whisperlivekit-base -p 8000:8000 whisperlivekit-defaults --model base
+    docker create --gpus all --name whisperlivekit-base -p 8000:8000 whisperlivekit-defaults --model base
-docker start -i whisperlivekit-base
+    docker start -i whisperlivekit-base
-```
+    ```
 - `--build-arg` Options:
  - `EXTRAS="whisper-timestamped"` - Add extras to the image's installation (no spaces). Remember to set necessary container options!
  - `HF_PRECACHE_DIR="./.cache/"` - Pre-load a model cache for faster first-time start
-  - `HF_TOKEN="./token"` - Add your Hugging Face Hub access token to download gated models
+  - `HF_TKN_FILE="./token"` - Add your Hugging Face Hub access token to download gated models
 ## 🔮 Use Cases
-
+Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...
 - **Meeting Transcription**: Capture discussions in real-time
 - **Accessibility Tools**: Help hearing-impaired users follow conversations
 - **Content Creation**: Transcribe podcasts or videos automatically
 - **Customer Service**: Transcribe support calls with speaker identification
 ## 🤝 Contributing
 Contributions are welcome! Here's how to get started:
 1. Fork the repository
 2. Create a feature branch: `git checkout -b feature/amazing-feature`
 3. Commit your changes: `git commit -m 'Add amazing feature'`
 4. Push to your branch: `git push origin feature/amazing-feature`
 5. Open a Pull Request
 ## 🙏 Acknowledgments
-This project builds upon the foundational work of:
+We extend our gratitude to the original authors of:
 - [Whisper Streaming](https://github.com/ufal/whisper_streaming)
 - [Diart](https://github.com/juanmc2005/diart)
 - [OpenAI Whisper](https://github.com/openai/whisper)
-We extend our gratitude to the original authors for their contributions.
+| [Whisper Streaming](https://github.com/ufal/whisper_streaming)  | [SimulStreaming](https://github.com/ufal/SimulStreaming) | [Diart](https://github.com/juanmc2005/diart) | [OpenAI Whisper](https://github.com/openai/whisper) |
-
+| -------- | ------- | -------- | ------- |
 ## 📄 License
 This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
 ## 🔗 Links
 - [GitHub Repository](https://github.com/QuentinFuxa/WhisperLiveKit)
 - [PyPI Package](https://pypi.org/project/whisperlivekit/)
 - [Issue Tracker](https://github.com/QuentinFuxa/WhisperLiveKit/issues)
--- a/architecture.png
+++ b/architecture.png
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -0,0 +1,59 @@
 [build-system]
 requires = ["setuptools>=61.0"]
 build-backend = "setuptools.build_meta"
 [project]
 name = "whisperlivekit"
 version = "0.2.5"
 description = "Real-time, Fully Local Whisper's Speech-to-Text and Speaker Diarization"
 readme = "README.md"
 authors = [
    { name = "Quentin Fuxa" }
 ]
 license = { file = "LICENSE" }
 requires-python = ">=3.9"
 classifiers = [
    "Development Status :: 4 - Beta",
    "Intended Audience :: Developers",
    "License :: OSI Approved :: MIT License",
    "Programming Language :: Python :: 3.9",
    "Programming Language :: Python :: 3.10",
    "Topic :: Scientific/Engineering :: Artificial Intelligence",
    "Topic :: Multimedia :: Sound/Audio :: Speech"
 ]
 dependencies = [
    "fastapi",
    "librosa",
    "soundfile",
    "faster-whisper",
    "uvicorn",
    "websockets"
 ]
 [project.optional-dependencies]
 diarization = ["diart"]
 vac = ["torch"]
 sentence = ["mosestokenizer", "wtpsplit"]
 whisper = ["whisper"]
 whisper-timestamped = ["whisper-timestamped"]
 mlx-whisper = ["mlx-whisper"]
 openai = ["openai"]
 simulstreaming = [
    "torch",
    "tqdm",
    "tiktoken",
    'triton>=2.0.0,<3; platform_machine == "x86_64" and (sys_platform == "linux" or sys_platform == "linux2")'
 ]
 [project.urls]
 Homepage = "https://github.com/QuentinFuxa/WhisperLiveKit"
 [project.scripts]
 whisperlivekit-server = "whisperlivekit.basic_server:main"
 [tool.setuptools]
 packages = ["whisperlivekit", "whisperlivekit.diarization", "whisperlivekit.simul_whisper", "whisperlivekit.simul_whisper.whisper", "whisperlivekit.simul_whisper.whisper.assets", "whisperlivekit.simul_whisper.whisper.normalizers", "whisperlivekit.web", "whisperlivekit.whisper_streaming_custom"]
 [tool.setuptools.package-data]
 whisperlivekit = ["web/*.html"]
 "whisperlivekit.simul_whisper.whisper.assets" = ["*.tiktoken", "*.npz"]
--- a/setup.py
+++ b/setup.py
@@ -1,47 +0,0 @@
 from setuptools import setup, find_packages
 setup(
    name="whisperlivekit",
    version="0.1.5",
    description="Real-time, Fully Local Whisper's Speech-to-Text and Speaker Diarization",
    long_description=open("README.md", "r", encoding="utf-8").read(),
    long_description_content_type="text/markdown",
    author="Quentin Fuxa",
    url="https://github.com/QuentinFuxa/WhisperLiveKit",
    packages=find_packages(),
    install_requires=[
        "fastapi",
        "ffmpeg-python",
        "librosa",
        "soundfile",
        "faster-whisper",
        "uvicorn",
        "websockets",
    ],
    extras_require={
        "diarization": ["diart"],
        "vac": ["torch"],
        "sentence": ["mosestokenizer", "wtpsplit"],
        "whisper": ["whisper"],
        "whisper-timestamped": ["whisper-timestamped"],
        "mlx-whisper": ["mlx-whisper"],
        "openai": ["openai"],
    },
    package_data={
        'whisperlivekit': ['web/*.html'],
    },
    entry_points={
        'console_scripts': [
            'whisperlivekit-server=whisperlivekit.basic_server:main',
        ],
    },
    classifiers=[
        "Development Status :: 4 - Beta",
        "Intended Audience :: Developers",
        "License :: OSI Approved :: MIT License",
        "Programming Language :: Python :: 3.9",
        "Programming Language :: Python :: 3.10",
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
        "Topic :: Multimedia :: Sound/Audio :: Speech",
    ],
    python_requires=">=3.9",
 )
--- a/whisperlivekit/init.py
+++ b/whisperlivekit/init.py
@@ -1,4 +1,12 @@
 from .core import WhisperLiveKit, parse_args
 from .audio_processor import AudioProcessor
 from .core import TranscriptionEngine
 from .parse_args import parse_args
 from .web.web_interface import get_web_interface_html
-__all__ = ['WhisperLiveKit', 'AudioProcessor', 'parse_args']
+__all__ = [
    "TranscriptionEngine",
    "AudioProcessor",
    "parse_args",
    "get_web_interface_html",
    "download_simulstreaming_backend",
 ]
--- a/whisperlivekit/audio_processor.py
+++ b/whisperlivekit/audio_processor.py
@@ -1,20 +1,21 @@
 import asyncio
 import numpy as np
 import ffmpeg
 from time import time, sleep
 import math
 import logging
 import traceback
 from datetime import timedelta
 from whisperlivekit.timed_objects import ASRToken
-from whisperlivekit.whisper_streaming_custom.whisper_online import online_factory
+from whisperlivekit.core import TranscriptionEngine, online_factory
-from whisperlivekit.core import WhisperLiveKit
+from whisperlivekit.ffmpeg_manager import FFmpegManager, FFmpegState
-
+from .remove_silences import handle_silences
 # Set up logging once
 logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
 SENTINEL = object() # unique sentinel object for end of stream marker
 def format_time(seconds: float) -> str:
    """Format seconds as HH:MM:SS."""
    return str(timedelta(seconds=int(seconds)))
@@ -25,10 +26,13 @@ class AudioProcessor:
    Handles audio processing, state management, and result formatting.
    """
-    def __init__(self):
+    def __init__(self, **kwargs):
        """Initialize the audio processor with configuration, models, and state."""
-        models = WhisperLiveKit()
+        if 'transcription_engine' in kwargs and isinstance(kwargs['transcription_engine'], TranscriptionEngine):
            models = kwargs['transcription_engine']
        else:
            models = TranscriptionEngine(**kwargs)
        # Audio processing settings
        self.args = models.args
@@ -41,12 +45,12 @@ class AudioProcessor:
        self.last_ffmpeg_activity = time()
        self.ffmpeg_health_check_interval = 5
        self.ffmpeg_max_idle_time = 10
- 
+
        # State management
        self.is_stopping = False
        self.tokens = []
        self.buffer_transcription = ""
        self.buffer_diarization = ""
        self.full_transcription = ""
        self.end_buffer = 0
        self.end_attributed_speaker = 0
        self.lock = asyncio.Lock()
@@ -58,10 +62,29 @@ class AudioProcessor:
        self.asr = models.asr
        self.tokenizer = models.tokenizer
        self.diarization = models.diarization
-        self.ffmpeg_process = self.start_ffmpeg_decoder()
+        
        self.ffmpeg_manager = FFmpegManager(
            sample_rate=self.sample_rate,
            channels=self.channels
        )
        async def handle_ffmpeg_error(error_type: str):
            logger.error(f"FFmpeg error: {error_type}")
            self._ffmpeg_error = error_type
        self.ffmpeg_manager.on_error_callback = handle_ffmpeg_error
        self._ffmpeg_error = None
        self.transcription_queue = asyncio.Queue() if self.args.transcription else None
        self.diarization_queue = asyncio.Queue() if self.args.diarization else None
        self.pcm_buffer = bytearray()
        # Task references
        self.transcription_task = None
        self.diarization_task = None
        self.ffmpeg_reader_task = None
        self.watchdog_task = None
        self.all_tasks_for_cleanup = []
        # Initialize transcription engine if enabled
        if self.args.transcription:
@@ -71,67 +94,12 @@ class AudioProcessor:
        """Convert PCM buffer in s16le format to normalized NumPy array."""
        return np.frombuffer(pcm_buffer, dtype=np.int16).astype(np.float32) / 32768.0
-    def start_ffmpeg_decoder(self):
+    async def update_transcription(self, new_tokens, buffer, end_buffer, sep):
        """Start FFmpeg process for WebM to PCM conversion."""
        return (ffmpeg.input("pipe:0", format="webm")
                .output("pipe:1", format="s16le", acodec="pcm_s16le", 
                        ac=self.channels, ar=str(self.sample_rate))
                .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True))
    async def restart_ffmpeg(self):
        """Restart the FFmpeg process after failure."""
        logger.warning("Restarting FFmpeg process...")
        if self.ffmpeg_process:
            try:
                # we check if process is still running
                if self.ffmpeg_process.poll() is None:
                    logger.info("Terminating existing FFmpeg process")
                    self.ffmpeg_process.stdin.close()
                    self.ffmpeg_process.terminate()
                    # wait for termination with timeout
                    try:
                        await asyncio.wait_for(
                            asyncio.get_event_loop().run_in_executor(None, self.ffmpeg_process.wait),
                            timeout=5.0
                        )
                    except asyncio.TimeoutError:
                        logger.warning("FFmpeg process did not terminate, killing forcefully")
                        self.ffmpeg_process.kill()
                        await asyncio.get_event_loop().run_in_executor(None, self.ffmpeg_process.wait)
            except Exception as e:
                logger.error(f"Error during FFmpeg process termination: {e}")
                logger.error(traceback.format_exc())
        # we start new process
        try:
            logger.info("Starting new FFmpeg process")
            self.ffmpeg_process = self.start_ffmpeg_decoder()
            self.pcm_buffer = bytearray()
            self.last_ffmpeg_activity = time()
            logger.info("FFmpeg process restarted successfully")
        except Exception as e:
            logger.error(f"Failed to restart FFmpeg process: {e}")
            logger.error(traceback.format_exc())
            # try again after 5s
            await asyncio.sleep(5)
            try:
                self.ffmpeg_process = self.start_ffmpeg_decoder()
                self.pcm_buffer = bytearray()
                self.last_ffmpeg_activity = time()
                logger.info("FFmpeg process restarted successfully on second attempt")
            except Exception as e2:
                logger.critical(f"Failed to restart FFmpeg process on second attempt: {e2}")
                logger.critical(traceback.format_exc())
    async def update_transcription(self, new_tokens, buffer, end_buffer, full_transcription, sep):
        """Thread-safe update of transcription with new data."""
        async with self.lock:
            self.tokens.extend(new_tokens)
            self.buffer_transcription = buffer
            self.end_buffer = end_buffer
            self.full_transcription = full_transcription
            self.sep = sep
    async def update_diarization(self, end_attributed_speaker, buffer_diarization=""):
@@ -158,12 +126,12 @@ class AudioProcessor:
            # Calculate remaining times
            remaining_transcription = 0
            if self.end_buffer > 0:
-                remaining_transcription = max(0, round(current_time - self.beg_loop - self.end_buffer, 2))
+                remaining_transcription = max(0, round(current_time - self.beg_loop - self.end_buffer, 1))
            remaining_diarization = 0
            if self.tokens:
                latest_end = max(self.end_buffer, self.tokens[-1].end if self.tokens else 0)
-                remaining_diarization = max(0, round(latest_end - self.end_attributed_speaker, 2))
+                remaining_diarization = max(0, round(latest_end - self.end_attributed_speaker, 1))
            return {
                "tokens": self.tokens.copy(),
@@ -182,44 +150,44 @@ class AudioProcessor:
            self.tokens = []
            self.buffer_transcription = self.buffer_diarization = ""
            self.end_buffer = self.end_attributed_speaker = 0
            self.full_transcription = self.last_response_content = ""
            self.beg_loop = time()
    async def ffmpeg_stdout_reader(self):
        """Read audio data from FFmpeg stdout and process it."""
        loop = asyncio.get_event_loop()
        beg = time()
        while True:
            try:
                # Check if FFmpeg is running
                state = await self.ffmpeg_manager.get_state()
                if state == FFmpegState.FAILED:
                    logger.error("FFmpeg is in FAILED state, cannot read data")
                    break
                elif state == FFmpegState.STOPPED:
                    logger.info("FFmpeg is stopped")
                    break
                elif state != FFmpegState.RUNNING:
                    logger.warning(f"FFmpeg is in {state} state, waiting...")
                    await asyncio.sleep(0.5)
                    continue
                current_time = time()
                elapsed_time = math.floor((current_time - beg) * 10) / 10
                buffer_size = max(int(32000 * elapsed_time), 4096)
                beg = current_time
-                # Detect idle state much more quickly
+                chunk = await self.ffmpeg_manager.read_data(buffer_size)
                if current_time - self.last_ffmpeg_activity > self.ffmpeg_max_idle_time:
                    logger.warning(f"FFmpeg process idle for {current_time - self.last_ffmpeg_activity:.2f}s. Restarting...")
                    await self.restart_ffmpeg()
                    beg = time()
                    self.last_ffmpeg_activity = time()
                    continue
                chunk = await loop.run_in_executor(None, self.ffmpeg_process.stdout.read, buffer_size)
                if chunk:
                    self.last_ffmpeg_activity = time()
                if not chunk:
-                    logger.info("FFmpeg stdout closed.")
+                    if self.is_stopping:
-                    break
+                        logger.info("FFmpeg stdout closed, stopping.")
                        break
                    else:
                        # No data available, but not stopping - FFmpeg might be restarting
                        await asyncio.sleep(0.1)
                        continue
                self.pcm_buffer.extend(chunk)
                # Send to diarization if enabled
                if self.args.diarization and self.diarization_queue:
                    await self.diarization_queue.put(
                        self.convert_pcm_to_float(self.pcm_buffer).copy()
                    )
                # Process when enough data
                if len(self.pcm_buffer) >= self.bytes_per_sec:
@@ -236,7 +204,11 @@ class AudioProcessor:
                    # Send to transcription if enabled
                    if self.args.transcription and self.transcription_queue:
                        await self.transcription_queue.put(pcm_array.copy())
-                    
+
                    # Send to diarization if enabled
                    if self.args.diarization and self.diarization_queue:
                        await self.diarization_queue.put(pcm_array.copy())
                    # Sleep if no processing is happening
                    if not self.args.transcription and not self.args.diarization:
                        await asyncio.sleep(0.1)
@@ -244,46 +216,89 @@ class AudioProcessor:
            except Exception as e:
                logger.warning(f"Exception in ffmpeg_stdout_reader: {e}")
                logger.warning(f"Traceback: {traceback.format_exc()}")
-                break
+                # Try to recover by waiting a bit
                await asyncio.sleep(1)
                # Check if we should exit
                if self.is_stopping:
                    break
        logger.info("FFmpeg stdout processing finished. Signaling downstream processors.")
        if self.args.transcription and self.transcription_queue:
            await self.transcription_queue.put(SENTINEL)
            logger.debug("Sentinel put into transcription_queue.")
        if self.args.diarization and self.diarization_queue:
            await self.diarization_queue.put(SENTINEL)
            logger.debug("Sentinel put into diarization_queue.")
    async def transcription_processor(self):
        """Process audio chunks for transcription."""
        self.full_transcription = ""
        self.sep = self.online.asr.sep
        cumulative_pcm_duration_stream_time = 0.0
        while True:
            try:
                pcm_array = await self.transcription_queue.get()
                if pcm_array is SENTINEL:
                    logger.debug("Transcription processor received sentinel. Finishing.")
                    self.transcription_queue.task_done()
                    break
-                logger.info(f"{len(self.online.audio_buffer) / self.online.SAMPLING_RATE} seconds of audio to process.")
+                if not self.online:
                    logger.warning("Transcription processor: self.online not initialized.")
                    self.transcription_queue.task_done()
                    continue
                asr_internal_buffer_duration_s = len(getattr(self.online, 'audio_buffer', [])) / self.online.SAMPLING_RATE
                transcription_lag_s = max(0.0, time() - self.beg_loop - self.end_buffer)
                logger.info(
                    f"ASR processing: internal_buffer={asr_internal_buffer_duration_s:.2f}s, "
                    f"lag={transcription_lag_s:.2f}s."
                )
                # Process transcription
-                self.online.insert_audio_chunk(pcm_array)
+                duration_this_chunk = len(pcm_array) / self.sample_rate if isinstance(pcm_array, np.ndarray) else 0
-                new_tokens = self.online.process_iter()
+                cumulative_pcm_duration_stream_time += duration_this_chunk
                stream_time_end_of_current_pcm = cumulative_pcm_duration_stream_time
                self.online.insert_audio_chunk(pcm_array, stream_time_end_of_current_pcm)
                new_tokens, current_audio_processed_upto = self.online.process_iter()
                if new_tokens:
                    self.full_transcription += self.sep.join([t.text for t in new_tokens])
                # Get buffer information
-                _buffer = self.online.get_buffer()
+                _buffer_transcript_obj = self.online.get_buffer()
-                buffer = _buffer.text
+                buffer_text = _buffer_transcript_obj.text
-                end_buffer = _buffer.end if _buffer.end else (
+
-                    new_tokens[-1].end if new_tokens else 0
+                if new_tokens:
-                )
+                    validated_text = self.sep.join([t.text for t in new_tokens])
                    if buffer_text.startswith(validated_text):
                        buffer_text = buffer_text[len(validated_text):].lstrip()
                candidate_end_times = [self.end_buffer]
                if new_tokens:
                    candidate_end_times.append(new_tokens[-1].end)
                if _buffer_transcript_obj.end is not None:
                    candidate_end_times.append(_buffer_transcript_obj.end)
                candidate_end_times.append(current_audio_processed_upto)
                new_end_buffer = max(candidate_end_times)
                # Avoid duplicating content
                if buffer in self.full_transcription:
                    buffer = ""
                await self.update_transcription(
-                    new_tokens, buffer, end_buffer, self.full_transcription, self.sep
+                    new_tokens, buffer_text, new_end_buffer, self.sep
                )
                self.transcription_queue.task_done()
            except Exception as e:
                logger.warning(f"Exception in transcription_processor: {e}")
                logger.warning(f"Traceback: {traceback.format_exc()}")
-            finally:
+                if 'pcm_array' in locals() and pcm_array is not SENTINEL : # Check if pcm_array was assigned from queue
-                self.transcription_queue.task_done()
+                    self.transcription_queue.task_done()
        logger.info("Transcription processor task finished.")
    async def diarization_processor(self, diarization_obj):
        """Process audio chunks for speaker diarization."""
@@ -292,28 +307,55 @@ class AudioProcessor:
        while True:
            try:
                pcm_array = await self.diarization_queue.get()
                if pcm_array is SENTINEL:
                    logger.debug("Diarization processor received sentinel. Finishing.")
                    self.diarization_queue.task_done()
                    break
                # Process diarization
                await diarization_obj.diarize(pcm_array)
-                # Get current state and update speakers
+                async with self.lock:
-                state = await self.get_current_state()
+                    self.tokens = diarization_obj.assign_speakers_to_tokens(
-                new_end = diarization_obj.assign_speakers_to_tokens(
+                        self.tokens,
-                    state["end_attributed_speaker"], state["tokens"]
+                        use_punctuation_split=self.args.punctuation_split
-                )
+                    )
                    if len(self.tokens) > 0:
                        self.end_attributed_speaker = max(self.tokens[-1].end, self.end_attributed_speaker)
                    if buffer_diarization:
                        self.buffer_diarization = buffer_diarization
-                await self.update_diarization(new_end, buffer_diarization)
+                self.diarization_queue.task_done()
            except Exception as e:
                logger.warning(f"Exception in diarization_processor: {e}")
                logger.warning(f"Traceback: {traceback.format_exc()}")
-            finally:
+                if 'pcm_array' in locals() and pcm_array is not SENTINEL:
-                self.diarization_queue.task_done()
+                    self.diarization_queue.task_done()
        logger.info("Diarization processor task finished.")
    async def results_formatter(self):
        """Format processing results for output."""
        last_sent_trans = None
        last_sent_diar = None
        while True:
            try:
                ffmpeg_state = await self.ffmpeg_manager.get_state()
                if ffmpeg_state == FFmpegState.FAILED and self._ffmpeg_error:
                    yield {
                        "status": "error",
                        "error": f"FFmpeg error: {self._ffmpeg_error}",
                        "lines": [],
                        "buffer_transcription": "",
                        "buffer_diarization": "",
                        "remaining_time_transcription": 0,
                        "remaining_time_diarization": 0
                    }
                    self._ffmpeg_error = None
                    await asyncio.sleep(1)
                    continue
                # Get current state
                state = await self.get_current_state()
                tokens = state["tokens"]
@@ -334,8 +376,8 @@ class AudioProcessor:
                lines = []
                last_end_diarized = 0
                undiarized_text = []
-                
+                current_time = time() - self.beg_loop
-                # Process each token
+                tokens = handle_silences(tokens, current_time)
                for token in tokens:
                    speaker = token.speaker
@@ -372,31 +414,60 @@ class AudioProcessor:
                    await self.update_diarization(end_attributed_speaker, combined)
                    buffer_diarization = combined
-                # Create response object
+                response_status = "active_transcription"
-                if not lines:
+                final_lines_for_response = lines.copy()
-                    lines = [{
+
                if not tokens and not buffer_transcription and not buffer_diarization:
                    response_status = "no_audio_detected"
                    final_lines_for_response = []
                elif response_status == "active_transcription" and not final_lines_for_response:
                    final_lines_for_response = [{
                        "speaker": 1,
                        "text": "",
-                        "beg": format_time(0),
+                        "beg": format_time(state.get("end_buffer", 0)),
-                        "end": format_time(tokens[-1].end if tokens else 0),
+                        "end": format_time(state.get("end_buffer", 0)),
                        "diff": 0
                    }]
                response = {
-                    "lines": lines, 
+                    "status": response_status,
                    "lines": final_lines_for_response,
                    "buffer_transcription": buffer_transcription,
                    "buffer_diarization": buffer_diarization,
                    "remaining_time_transcription": state["remaining_time_transcription"],
                    "remaining_time_diarization": state["remaining_time_diarization"]
                }
-                # Only yield if content has changed
+                current_response_signature = f"{response_status} | " + \
-                response_content = ' '.join([f"{line['speaker']} {line['text']}" for line in lines]) + \
+                                           ' '.join([f"{line['speaker']} {line['text']}" for line in final_lines_for_response]) + \
-                                  f" | {buffer_transcription} | {buffer_diarization}"
+                                           f" | {buffer_transcription} | {buffer_diarization}"
-                if response_content != self.last_response_content and (lines or buffer_transcription or buffer_diarization):
+                trans = state["remaining_time_transcription"]
                diar = state["remaining_time_diarization"]
                should_push = (
                    current_response_signature != self.last_response_content
                    or last_sent_trans is None
                    or round(trans, 1) != round(last_sent_trans, 1)
                    or round(diar, 1) != round(last_sent_diar, 1)
                )
                if should_push and (final_lines_for_response or buffer_transcription or buffer_diarization or response_status == "no_audio_detected" or trans > 0 or diar > 0):
                    yield response
-                    self.last_response_content = response_content
+                    self.last_response_content = current_response_signature
                    last_sent_trans = trans
                    last_sent_diar = diar
                # Check for termination condition
                if self.is_stopping:
                    all_processors_done = True
                    if self.args.transcription and self.transcription_task and not self.transcription_task.done():
                        all_processors_done = False
                    if self.args.diarization and self.diarization_task and not self.diarization_task.done():
                        all_processors_done = False
                    if all_processors_done:
                        logger.info("Results formatter: All upstream processors are done and in stopping state. Terminating.")
                        final_state = await self.get_current_state()
                        return
                await asyncio.sleep(0.1)  # Avoid overwhelming the client
@@ -407,114 +478,109 @@ class AudioProcessor:
    async def create_tasks(self):
        """Create and start processing tasks."""
-            
+        self.all_tasks_for_cleanup = []
-        tasks = []    
+        processing_tasks_for_watchdog = []
        success = await self.ffmpeg_manager.start()
        if not success:
            logger.error("Failed to start FFmpeg manager")
            async def error_generator():
                yield {
                    "status": "error", 
                    "error": "FFmpeg failed to start. Please check that FFmpeg is installed.",
                    "lines": [],
                    "buffer_transcription": "",
                    "buffer_diarization": "",
                    "remaining_time_transcription": 0,
                    "remaining_time_diarization": 0
                }
            return error_generator()
        if self.args.transcription and self.online:
-            tasks.append(asyncio.create_task(self.transcription_processor()))
+            self.transcription_task = asyncio.create_task(self.transcription_processor())
            self.all_tasks_for_cleanup.append(self.transcription_task)
            processing_tasks_for_watchdog.append(self.transcription_task)
        if self.args.diarization and self.diarization:
-            tasks.append(asyncio.create_task(self.diarization_processor(self.diarization)))
+            self.diarization_task = asyncio.create_task(self.diarization_processor(self.diarization))
            self.all_tasks_for_cleanup.append(self.diarization_task)
            processing_tasks_for_watchdog.append(self.diarization_task)
-        tasks.append(asyncio.create_task(self.ffmpeg_stdout_reader()))
+        self.ffmpeg_reader_task = asyncio.create_task(self.ffmpeg_stdout_reader())
-        
+        self.all_tasks_for_cleanup.append(self.ffmpeg_reader_task)
-        # Monitor overall system health
+        processing_tasks_for_watchdog.append(self.ffmpeg_reader_task)
        async def watchdog():
            while True:
                try:
                    await asyncio.sleep(10)  # Check every 10 seconds instead of 60
                    current_time = time()
                    # Check for stalled tasks
                    for i, task in enumerate(tasks):
                        if task.done():
                            exc = task.exception() if task.done() else None
                            task_name = task.get_name() if hasattr(task, 'get_name') else f"Task {i}"
                            logger.error(f"{task_name} unexpectedly completed with exception: {exc}")
                    # Check for FFmpeg process health with shorter thresholds
                    ffmpeg_idle_time = current_time - self.last_ffmpeg_activity
                    if ffmpeg_idle_time > 15:  # 15 seconds instead of 180
                        logger.warning(f"FFmpeg idle for {ffmpeg_idle_time:.2f}s - may need attention")
                        # Force restart after 30 seconds of inactivity (instead of 600)
                        if ffmpeg_idle_time > 30:
                            logger.error("FFmpeg idle for too long, forcing restart")
                            await self.restart_ffmpeg()
                except Exception as e:
                    logger.error(f"Error in watchdog task: {e}")
-        tasks.append(asyncio.create_task(watchdog()))
+        # Monitor overall system health
-        self.tasks = tasks
+        self.watchdog_task = asyncio.create_task(self.watchdog(processing_tasks_for_watchdog))
        self.all_tasks_for_cleanup.append(self.watchdog_task)
        return self.results_formatter()
    async def watchdog(self, tasks_to_monitor):
        """Monitors the health of critical processing tasks."""
        while True:
            try:
                await asyncio.sleep(10)
                for i, task in enumerate(tasks_to_monitor):
                    if task.done():
                        exc = task.exception()
                        task_name = task.get_name() if hasattr(task, 'get_name') else f"Monitored Task {i}"
                        if exc:
                            logger.error(f"{task_name} unexpectedly completed with exception: {exc}")
                        else:
                            logger.info(f"{task_name} completed normally.")
                # Check FFmpeg status through the manager
                ffmpeg_state = await self.ffmpeg_manager.get_state()
                if ffmpeg_state == FFmpegState.FAILED:
                    logger.error("FFmpeg is in FAILED state, notifying results formatter")
                    # FFmpeg manager will handle its own recovery
                elif ffmpeg_state == FFmpegState.STOPPED and not self.is_stopping:
                    logger.warning("FFmpeg unexpectedly stopped, attempting restart")
                    await self.ffmpeg_manager.restart()
            except asyncio.CancelledError:
                logger.info("Watchdog task cancelled.")
                break
            except Exception as e:
                logger.error(f"Error in watchdog task: {e}", exc_info=True)
    async def cleanup(self):
        """Clean up resources when processing is complete."""
-        for task in self.tasks:
+        logger.info("Starting cleanup of AudioProcessor resources.")        
-            task.cancel()
+        for task in self.all_tasks_for_cleanup:
-            
+            if task and not task.done():
-        try:
+                task.cancel()
-            await asyncio.gather(*self.tasks, return_exceptions=True)
+        
-            self.ffmpeg_process.stdin.close()
+        created_tasks = [t for t in self.all_tasks_for_cleanup if t]
-            self.ffmpeg_process.wait()
+        if created_tasks:
-        except Exception as e:
+            await asyncio.gather(*created_tasks, return_exceptions=True)
-            logger.warning(f"Error during cleanup: {e}")
+        logger.info("All processing tasks cancelled or finished.")
-            
+        await self.ffmpeg_manager.stop()
-        if self.args.diarization and hasattr(self, 'diarization'):
+        logger.info("FFmpeg manager stopped.")
        if self.args.diarization and hasattr(self, 'diarization') and hasattr(self.diarization, 'close'):
            self.diarization.close()
        logger.info("AudioProcessor cleanup complete.")
    async def process_audio(self, message):
        """Process incoming audio data."""
-        retry_count = 0
+        if not message:
-        max_retries = 3
+            logger.info("Empty audio message received, initiating stop sequence.")
-        
+            self.is_stopping = True
-        # Log periodic heartbeats showing ongoing audio proc
+            # Signal FFmpeg manager to stop accepting data
-        current_time = time()
+            await self.ffmpeg_manager.stop()
-        if not hasattr(self, '_last_heartbeat') or current_time - self._last_heartbeat >= 10:
+            return
-            logger.debug(f"Processing audio chunk, last FFmpeg activity: {current_time - self.last_ffmpeg_activity:.2f}s ago")
+
-            self._last_heartbeat = current_time
+        if self.is_stopping:
-        
+            logger.warning("AudioProcessor is stopping. Ignoring incoming audio.")
-        while retry_count < max_retries:
+            return
-            try:
+
-                if not self.ffmpeg_process or not hasattr(self.ffmpeg_process, 'stdin') or self.ffmpeg_process.poll() is not None:
+        success = await self.ffmpeg_manager.write_data(message)
-                    logger.warning("FFmpeg process not available, restarting...")
+        if not success:
-                    await self.restart_ffmpeg()
+            ffmpeg_state = await self.ffmpeg_manager.get_state()
-                
+            if ffmpeg_state == FFmpegState.FAILED:
-                loop = asyncio.get_running_loop()                
+                logger.error("FFmpeg is in FAILED state, cannot process audio")
-                try:
+            else:
-                    await asyncio.wait_for(
+                logger.warning("Failed to write audio data to FFmpeg")
                        loop.run_in_executor(None, lambda: self.ffmpeg_process.stdin.write(message)),
                        timeout=2.0
                    )
                except asyncio.TimeoutError:
                    logger.warning("FFmpeg write operation timed out, restarting...")
                    await self.restart_ffmpeg()
                    retry_count += 1
                    continue
                try:
                    await asyncio.wait_for(
                        loop.run_in_executor(None, self.ffmpeg_process.stdin.flush),
                        timeout=2.0
                    )
                except asyncio.TimeoutError:
                    logger.warning("FFmpeg flush operation timed out, restarting...")
                    await self.restart_ffmpeg()
                    retry_count += 1
                    continue
                self.last_ffmpeg_activity = time()
                return
            except (BrokenPipeError, AttributeError, OSError) as e:
                retry_count += 1
                logger.warning(f"Error writing to FFmpeg: {e}. Retry {retry_count}/{max_retries}...")
                if retry_count < max_retries:
                    await self.restart_ffmpeg()
                    await asyncio.sleep(0.5)
                else:
                    logger.error("Maximum retries reached for FFmpeg process")
                    await self.restart_ffmpeg()
                    return
--- a/whisperlivekit/basic_server.py
+++ b/whisperlivekit/basic_server.py
@@ -2,26 +2,24 @@ from contextlib import asynccontextmanager
 from fastapi import FastAPI, WebSocket, WebSocketDisconnect
 from fastapi.responses import HTMLResponse
 from fastapi.middleware.cors import CORSMiddleware
-
+from whisperlivekit import TranscriptionEngine, AudioProcessor, get_web_interface_html, parse_args
 from whisperlivekit import WhisperLiveKit, parse_args
 from whisperlivekit.audio_processor import AudioProcessor
 import asyncio
 import logging
 import os, sys
 import argparse
 logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
 logging.getLogger().setLevel(logging.WARNING)
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
-kit = None
+args = parse_args()
 transcription_engine = None
@asynccontextmanager
 async def lifespan(app: FastAPI):
-    global kit
+    global transcription_engine
-    kit = WhisperLiveKit()
+    transcription_engine = TranscriptionEngine(
        **vars(args),
    )
    yield
 app = FastAPI(lifespan=lifespan)
@@ -33,10 +31,9 @@ app.add_middleware(
    allow_headers=["*"],
 )
@app.get("/")
 async def get():
-    return HTMLResponse(kit.web_interface())
+    return HTMLResponse(get_web_interface_html())
 async def handle_websocket_results(websocket, results_generator):
@@ -44,14 +41,21 @@ async def handle_websocket_results(websocket, results_generator):
    try:
        async for response in results_generator:
            await websocket.send_json(response)
        # when the results_generator finishes it means all audio has been processed
        logger.info("Results generator finished. Sending 'ready_to_stop' to client.")
        await websocket.send_json({"type": "ready_to_stop"})
    except WebSocketDisconnect:
        logger.info("WebSocket disconnected while handling results (client likely closed connection).")
    except Exception as e:
        logger.warning(f"Error in WebSocket results handler: {e}")
@app.websocket("/asr")
 async def websocket_endpoint(websocket: WebSocket):
-    audio_processor = AudioProcessor()
+    global transcription_engine
-
+    audio_processor = AudioProcessor(
        transcription_engine=transcription_engine,
    )
    await websocket.accept()
    logger.info("WebSocket connection opened.")
@@ -62,19 +66,33 @@ async def websocket_endpoint(websocket: WebSocket):
        while True:
            message = await websocket.receive_bytes()
            await audio_processor.process_audio(message)
    except KeyError as e:
        if 'bytes' in str(e):
            logger.warning(f"Client has closed the connection.")
        else:
            logger.error(f"Unexpected KeyError in websocket_endpoint: {e}", exc_info=True)
    except WebSocketDisconnect:
-        logger.warning("WebSocket disconnected.")
+        logger.info("WebSocket disconnected by client during message receiving loop.")
    except Exception as e:
        logger.error(f"Unexpected error in websocket_endpoint main loop: {e}", exc_info=True)
    finally:
-        websocket_task.cancel()
+        logger.info("Cleaning up WebSocket endpoint...")
        if not websocket_task.done():
            websocket_task.cancel()
        try:
            await websocket_task
        except asyncio.CancelledError:
            logger.info("WebSocket results handler task was cancelled.")
        except Exception as e:
            logger.warning(f"Exception while awaiting websocket_task completion: {e}")
        await audio_processor.cleanup()
-        logger.info("WebSocket endpoint cleaned up.")
+        logger.info("WebSocket endpoint cleaned up successfully.")
 def main():
    """Entry point for the CLI command."""
    import uvicorn
    args = parse_args()
    uvicorn_kwargs = {
        "app": "whisperlivekit.basic_server:app",
        "host":args.host, 
@@ -93,7 +111,6 @@ def main():
            "ssl_keyfile": args.ssl_keyfile
        }
    if ssl_kwargs:
        uvicorn_kwargs = {**uvicorn_kwargs, **ssl_kwargs}
--- a/whisperlivekit/core.py
+++ b/whisperlivekit/core.py
@@ -1,149 +1,14 @@
 try:
-    from whisperlivekit.whisper_streaming_custom.whisper_online import backend_factory, warmup_asr
+    from whisperlivekit.whisper_streaming_custom.whisper_online import backend_factory
    from whisperlivekit.whisper_streaming_custom.online_asr import VACOnlineASRProcessor, OnlineASRProcessor
 except ImportError:
-    from .whisper_streaming_custom.whisper_online import backend_factory, warmup_asr
+    from .whisper_streaming_custom.whisper_online import backend_factory
-from argparse import Namespace, ArgumentParser
+    from .whisper_streaming_custom.online_asr import VACOnlineASRProcessor, OnlineASRProcessor
 from whisperlivekit.warmup import warmup_asr, warmup_online
 from argparse import Namespace
 import sys
-def parse_args():
+class TranscriptionEngine:
    parser = ArgumentParser(description="Whisper FastAPI Online Server")
    parser.add_argument(
        "--host",
        type=str,
        default="localhost",
        help="The host address to bind the server to.",
    )
    parser.add_argument(
        "--port", type=int, default=8000, help="The port number to bind the server to."
    )
    parser.add_argument(
        "--warmup-file",
        type=str,
        default=None,
        dest="warmup_file",
        help="""
        The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast.
        If not set, uses https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav.
        If False, no warmup is performed.
        """,
    )
    parser.add_argument(
        "--confidence-validation",
        action="store_true",
        help="Accelerates validation of tokens using confidence scores. Transcription will be faster but punctuation might be less accurate.",
    )
    parser.add_argument(
        "--diarization",
        action="store_true",
        default=False,
        help="Enable speaker diarization.",
    )
    parser.add_argument(
        "--no-transcription",
        action="store_true",
        help="Disable transcription to only see live diarization results.",
    )
    parser.add_argument(
        "--min-chunk-size",
        type=float,
        default=0.5,
        help="Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.",
    )
    parser.add_argument(
        "--model",
        type=str,
        default="tiny",
        help="Name size of the Whisper model to use (default: tiny). Suggested values: tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large,large-v3-turbo. The model is automatically downloaded from the model hub if not present in model cache dir.",
    )
    parser.add_argument(
        "--model_cache_dir",
        type=str,
        default=None,
        help="Overriding the default model cache dir where models downloaded from the hub are saved",
    )
    parser.add_argument(
        "--model_dir",
        type=str,
        default=None,
        help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.",
    )
    parser.add_argument(
        "--lan",
        "--language",
        type=str,
        default="auto",
        help="Source language code, e.g. en,de,cs, or 'auto' for language detection.",
    )
    parser.add_argument(
        "--task",
        type=str,
        default="transcribe",
        choices=["transcribe", "translate"],
        help="Transcribe or translate.",
    )
    parser.add_argument(
        "--backend",
        type=str,
        default="faster-whisper",
        choices=["faster-whisper", "whisper_timestamped", "mlx-whisper", "openai-api"],
        help="Load only this backend for Whisper processing.",
    )
    parser.add_argument(
        "--vac",
        action="store_true",
        default=False,
        help="Use VAC = voice activity controller. Recommended. Requires torch.",
    )
    parser.add_argument(
        "--vac-chunk-size", type=float, default=0.04, help="VAC sample size in seconds."
    )
    parser.add_argument(
        "--no-vad",
        action="store_true",
        help="Disable VAD (voice activity detection).",
    )
    parser.add_argument(
        "--buffer_trimming",
        type=str,
        default="segment",
        choices=["sentence", "segment"],
        help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.',
    )
    parser.add_argument(
        "--buffer_trimming_sec",
        type=float,
        default=15,
        help="Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.",
    )
    parser.add_argument(
        "-l",
        "--log-level",
        dest="log_level",
        choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"],
        help="Set the log level",
        default="DEBUG",
    )
    parser.add_argument("--ssl-certfile", type=str, help="Path to the SSL certificate file.", default=None)
    parser.add_argument("--ssl-keyfile", type=str, help="Path to the SSL private key file.", default=None)
    args = parser.parse_args()
    args.transcription = not args.no_transcription
    args.vad = not args.no_vad    
    delattr(args, 'no_transcription')
    delattr(args, 'no_vad')
    return args
 class WhisperLiveKit:
    _instance = None
    _initialized = False
@@ -153,32 +18,135 @@ class WhisperLiveKit:
        return cls._instance
    def __init__(self, **kwargs):
-        if WhisperLiveKit._initialized:
+        if TranscriptionEngine._initialized:
            return
-            
+
-        default_args = vars(parse_args())
+        defaults = {
            "host": "localhost",
            "port": 8000,
            "warmup_file": None,
            "diarization": False,
            "punctuation_split": False,
            "min_chunk_size": 0.5,
            "model": "tiny",
            "model_cache_dir": None,
            "model_dir": None,
            "lan": "auto",
            "task": "transcribe",
            "backend": "faster-whisper",
            "vac": False,
            "vac_chunk_size": 0.04,
            "log_level": "DEBUG",
            "ssl_certfile": None,
            "ssl_keyfile": None,
            "transcription": True,
            "vad": True,
            # whisperstreaming params:
            "buffer_trimming": "segment",
            "confidence_validation": False,
            "buffer_trimming_sec": 15,
            # simulstreaming params:
            "frame_threshold": 25,
            "beams": 1,
            "decoder_type": None,
            "audio_max_len": 30.0,
            "audio_min_len": 0.0,
            "cif_ckpt_path": None,
            "never_fire": False,
            "init_prompt": None,
            "static_init_prompt": None,
            "max_context_tokens": None,
            "model_path": './base.pt',
            # diart params:
            "segmentation_model": "pyannote/segmentation-3.0",
            "embedding_model": "pyannote/embedding",
        }
        config_dict = {**defaults, **kwargs}
        if 'no_transcription' in kwargs:
            config_dict['transcription'] = not kwargs['no_transcription']
        if 'no_vad' in kwargs:
            config_dict['vad'] = not kwargs['no_vad']
-        merged_args = {**default_args, **kwargs}
+        config_dict.pop('no_transcription', None)
-        
+        config_dict.pop('no_vad', None)
-        self.args = Namespace(**merged_args)
+
        if 'language' in kwargs:
            config_dict['lan'] = kwargs['language']
        config_dict.pop('language', None) 
        self.args = Namespace(**config_dict)
        self.asr = None
        self.tokenizer = None
        self.diarization = None
        if self.args.transcription:
-            self.asr, self.tokenizer = backend_factory(self.args)
+            if self.args.backend == "simulstreaming": 
-            warmup_asr(self.asr, self.args.warmup_file)
+                from simul_whisper import SimulStreamingASR
                self.tokenizer = None
                simulstreaming_kwargs = {}
                for attr in ['frame_threshold', 'beams', 'decoder_type', 'audio_max_len', 'audio_min_len', 
                            'cif_ckpt_path', 'never_fire', 'init_prompt', 'static_init_prompt', 
                            'max_context_tokens', 'model_path']:
                    if hasattr(self.args, attr):
                        simulstreaming_kwargs[attr] = getattr(self.args, attr)
                # Add segment_length from min_chunk_size
                simulstreaming_kwargs['segment_length'] = getattr(self.args, 'min_chunk_size', 0.5)
                simulstreaming_kwargs['task'] = self.args.task
                size = self.args.model
                self.asr = SimulStreamingASR(
                    modelsize=size,
                    lan=self.args.lan,
                    cache_dir=getattr(self.args, 'model_cache_dir', None),
                    model_dir=getattr(self.args, 'model_dir', None),
                    **simulstreaming_kwargs
                )
            else:
                self.asr, self.tokenizer = backend_factory(self.args)
            warmup_asr(self.asr, self.args.warmup_file) #for simulstreaming, warmup should be done in the online class not here
        if self.args.diarization:
            from whisperlivekit.diarization.diarization_online import DiartDiarization
-            self.diarization = DiartDiarization()
+            self.diarization = DiartDiarization(
                block_duration=self.args.min_chunk_size,
                segmentation_model_name=self.args.segmentation_model,
                embedding_model_name=self.args.embedding_model
            )
-        WhisperLiveKit._initialized = True
+        TranscriptionEngine._initialized = True
-    def web_interface(self):
+
-        import pkg_resources
+
-        html_path = pkg_resources.resource_filename('whisperlivekit', 'web/live_transcription.html')
+def online_factory(args, asr, tokenizer, logfile=sys.stderr):
-        with open(html_path, "r", encoding="utf-8") as f:
+    if args.backend == "simulstreaming":    
-            html = f.read()
+        from simul_whisper import SimulStreamingOnlineProcessor
-        return html
+        online = SimulStreamingOnlineProcessor(
            asr,
            logfile=logfile,
        )
        # warmup_online(online, args.warmup_file)
    elif args.vac:
        online = VACOnlineASRProcessor(
            args.min_chunk_size,
            asr,
            tokenizer,
            logfile=logfile,
            buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec),
            confidence_validation = args.confidence_validation
        )
    else:
        online = OnlineASRProcessor(
            asr,
            tokenizer,
            logfile=logfile,
            buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec),
            confidence_validation = args.confidence_validation
        )
    return online
--- a/whisperlivekit/diarization/diarization_online.py
+++ b/whisperlivekit/diarization/diarization_online.py
@@ -3,7 +3,8 @@ import re
 import threading
 import numpy as np
 import logging
-
+import time
 from queue import SimpleQueue, Empty
 from diart import SpeakerDiarization, SpeakerDiarizationConfig
 from diart.inference import StreamingInference
@@ -13,6 +14,7 @@ from diart.sources import MicrophoneAudioSource
 from rx.core import Observer
 from typing import Tuple, Any, List
 from pyannote.core import Annotation
 import diart.models as m
 logger = logging.getLogger(__name__)
@@ -78,40 +80,114 @@ class DiarizationObserver(Observer):
 class WebSocketAudioSource(AudioSource):
    """
-    Custom AudioSource that blocks in read() until close() is called.
+    Buffers incoming audio and releases it in fixed-size chunks at regular intervals.
    Use push_audio() to inject PCM chunks.
    """
-    def __init__(self, uri: str = "websocket", sample_rate: int = 16000):
+    def __init__(self, uri: str = "websocket", sample_rate: int = 16000, block_duration: float = 0.5):
        super().__init__(uri, sample_rate)
        self.block_duration = block_duration
        self.block_size = int(np.rint(block_duration * sample_rate))
        self._queue = SimpleQueue()
        self._buffer = np.array([], dtype=np.float32)
        self._buffer_lock = threading.Lock()
        self._closed = False
        self._close_event = threading.Event()
        self._processing_thread = None
        self._last_chunk_time = time.time()
    def read(self):
        """Start processing buffered audio and emit fixed-size chunks."""
        self._processing_thread = threading.Thread(target=self._process_chunks)
        self._processing_thread.daemon = True
        self._processing_thread.start()
        self._close_event.wait()
        if self._processing_thread:
            self._processing_thread.join(timeout=2.0)
    def _process_chunks(self):
        """Process audio from queue and emit fixed-size chunks at regular intervals."""
        while not self._closed:
            try:
                audio_chunk = self._queue.get(timeout=0.1)
                with self._buffer_lock:
                    self._buffer = np.concatenate([self._buffer, audio_chunk])
                    while len(self._buffer) >= self.block_size:
                        chunk = self._buffer[:self.block_size]
                        self._buffer = self._buffer[self.block_size:]
                        current_time = time.time()
                        time_since_last = current_time - self._last_chunk_time
                        if time_since_last < self.block_duration:
                            time.sleep(self.block_duration - time_since_last)
                        chunk_reshaped = chunk.reshape(1, -1)
                        self.stream.on_next(chunk_reshaped)
                        self._last_chunk_time = time.time()
            except Empty:
                with self._buffer_lock:
                    if len(self._buffer) > 0 and time.time() - self._last_chunk_time > self.block_duration:
                        padded_chunk = np.zeros(self.block_size, dtype=np.float32)
                        padded_chunk[:len(self._buffer)] = self._buffer
                        self._buffer = np.array([], dtype=np.float32)
                        chunk_reshaped = padded_chunk.reshape(1, -1)
                        self.stream.on_next(chunk_reshaped)
                        self._last_chunk_time = time.time()
            except Exception as e:
                logger.error(f"Error in audio processing thread: {e}")
                self.stream.on_error(e)
                break
        with self._buffer_lock:
            if len(self._buffer) > 0:
                padded_chunk = np.zeros(self.block_size, dtype=np.float32)
                padded_chunk[:len(self._buffer)] = self._buffer
                chunk_reshaped = padded_chunk.reshape(1, -1)
                self.stream.on_next(chunk_reshaped)
        self.stream.on_completed()
    def close(self):
        if not self._closed:
            self._closed = True
            self.stream.on_completed()
            self._close_event.set()
    def push_audio(self, chunk: np.ndarray):
        """Add audio chunk to the processing queue."""
        if not self._closed:
-            new_audio = np.expand_dims(chunk, axis=0)
+            if chunk.ndim > 1:
-            logger.debug('Add new chunk with shape:', new_audio.shape)
+                chunk = chunk.flatten()
-            self.stream.on_next(new_audio)
+            self._queue.put(chunk)
            logger.debug(f'Added chunk to queue with {len(chunk)} samples')
 class DiartDiarization:
-    def __init__(self, sample_rate: int = 16000, config : SpeakerDiarizationConfig = None, use_microphone: bool = False):
+    def __init__(self, sample_rate: int = 16000, config : SpeakerDiarizationConfig = None, use_microphone: bool = False, block_duration: float = 1.5, segmentation_model_name: str = "pyannote/segmentation-3.0", embedding_model_name: str = "pyannote/embedding"):
        segmentation_model = m.SegmentationModel.from_pretrained(segmentation_model_name)
        embedding_model = m.EmbeddingModel.from_pretrained(embedding_model_name)
        if config is None:
            config = SpeakerDiarizationConfig(
                segmentation=segmentation_model,
                embedding=embedding_model,
            )
        self.pipeline = SpeakerDiarization(config=config)        
        self.observer = DiarizationObserver()
        self.lag_diart = None
        if use_microphone:
-            self.source = MicrophoneAudioSource()
+            self.source = MicrophoneAudioSource(block_duration=block_duration)
            self.custom_source = None
        else:
-            self.custom_source = WebSocketAudioSource(uri="websocket_source", sample_rate=sample_rate)
+            self.custom_source = WebSocketAudioSource(
                uri="websocket_source", 
                sample_rate=sample_rate,
                block_duration=block_duration
            )
            self.source = self.custom_source
        self.inference = StreamingInference(
@@ -130,24 +206,106 @@ class DiartDiarization:
        """
        if self.custom_source:
            self.custom_source.push_audio(pcm_array)            
-        self.observer.clear_old_segments()        
+        # self.observer.clear_old_segments()        
        return self.observer.get_segments()
    def close(self):
        """Close the audio source."""
        if self.custom_source:
            self.custom_source.close()
-    def assign_speakers_to_tokens(self, end_attributed_speaker, tokens: list) -> float:
+    def assign_speakers_to_tokens(self, tokens: list, use_punctuation_split: bool = False) -> float:
        """
        Assign speakers to tokens based on timing overlap with speaker segments.
        Uses the segments collected by the observer.
        If use_punctuation_split is True, uses punctuation marks to refine speaker boundaries.
        """
        segments = self.observer.get_segments()
-        for token in tokens:
+        # Debug logging
-            for segment in segments:
+        logger.debug(f"assign_speakers_to_tokens called with {len(tokens)} tokens")
-                if not (segment.end <= token.start or segment.start >= token.end):
+        logger.debug(f"Available segments: {len(segments)}")
-                    token.speaker = extract_number(segment.speaker) + 1
+        for i, seg in enumerate(segments[:5]):  # Show first 5 segments
-                    end_attributed_speaker = max(token.end, end_attributed_speaker)
+            logger.debug(f"  Segment {i}: {seg.speaker} [{seg.start:.2f}-{seg.end:.2f}]")
-        return end_attributed_speaker
+        
        if not self.lag_diart and segments and tokens:
            self.lag_diart = segments[0].start - tokens[0].start
        if not use_punctuation_split:
            for token in tokens:
                for segment in segments:
                    if not (segment.end <= token.start + self.lag_diart or segment.start >= token.end + self.lag_diart):
                        token.speaker = extract_number(segment.speaker) + 1
        else:
            tokens = add_speaker_to_tokens(segments, tokens)
        return tokens
 def concatenate_speakers(segments):
    segments_concatenated = [{"speaker": 1, "begin": 0.0, "end": 0.0}]
    for segment in segments:
        speaker = extract_number(segment.speaker) + 1
        if segments_concatenated[-1]['speaker'] != speaker:
            segments_concatenated.append({"speaker": speaker, "begin": segment.start, "end": segment.end})
        else:
            segments_concatenated[-1]['end'] = segment.end
    # print("Segments concatenated:")
    # for entry in segments_concatenated:
    #     print(f"Speaker {entry['speaker']}: {entry['begin']:.2f}s - {entry['end']:.2f}s")   
    return segments_concatenated
 def add_speaker_to_tokens(segments, tokens):
    """
    Assign speakers to tokens based on diarization segments, with punctuation-aware boundary adjustment.
    """
    punctuation_marks = {'.', '!', '?'}
    punctuation_tokens = [token for token in tokens if token.text.strip() in punctuation_marks]
    segments_concatenated = concatenate_speakers(segments)
    for ind, segment in enumerate(segments_concatenated):
            for i, punctuation_token in enumerate(punctuation_tokens):
                if punctuation_token.start > segment['end']:
                    after_length = punctuation_token.start - segment['end']
                    before_length = segment['end'] - punctuation_tokens[i - 1].end
                    if before_length > after_length:
                        segment['end'] = punctuation_token.start
                        if i < len(punctuation_tokens) - 1 and ind + 1 < len(segments_concatenated):
                            segments_concatenated[ind + 1]['begin'] = punctuation_token.start
                    else:
                        segment['end'] = punctuation_tokens[i - 1].end
                        if i < len(punctuation_tokens) - 1 and ind - 1 >= 0:
                            segments_concatenated[ind - 1]['begin'] = punctuation_tokens[i - 1].end
                    break
    last_end = 0.0
    for token in tokens:
        start = max(last_end + 0.01, token.start)
        token.start = start
        token.end = max(start, token.end)
        last_end = token.end
    ind_last_speaker = 0
    for segment in segments_concatenated:
        for i, token in enumerate(tokens[ind_last_speaker:]):
            if token.end <= segment['end']:
                token.speaker = segment['speaker']
                ind_last_speaker = i + 1
                # print(
                #     f"Token '{token.text}' ('begin': {token.start:.2f}, 'end': {token.end:.2f}) "
                #     f"assigned to Speaker {segment['speaker']} ('segment': {segment['begin']:.2f}-{segment['end']:.2f})"
                # )
            elif token.start > segment['end']:
                break
    return tokens
 def visualize_tokens(tokens):
    conversation = [{"speaker": -1, "text": ""}]
    for token in tokens:
        speaker = conversation[-1]['speaker']
        if token.speaker != speaker:
            conversation.append({"speaker": token.speaker, "text": token.text})
        else:
            conversation[-1]['text'] += token.text
    print("Conversation:")
    for entry in conversation:
        print(f"Speaker {entry['speaker']}: {entry['text']}")
--- a/whisperlivekit/ffmpeg_manager.py
+++ b/whisperlivekit/ffmpeg_manager.py
@@ -0,0 +1,193 @@
 import asyncio
 import logging
 from enum import Enum
 from typing import Optional, Callable
 import contextlib
 logger = logging.getLogger(__name__)
 logging.basicConfig(level=logging.INFO)
 ERROR_INSTALL_INSTRUCTIONS = """
 FFmpeg is not installed or not found in your system's PATH.
 Please install FFmpeg to enable audio processing.
 Installation instructions:
 # Ubuntu/Debian:
 sudo apt update && sudo apt install ffmpeg
 # macOS (using Homebrew):
 brew install ffmpeg
 # Windows:
 # 1. Download the latest static build from https://ffmpeg.org/download.html
 # 2. Extract the archive (e.g., to C:\\FFmpeg).
 # 3. Add the 'bin' directory (e.g., C:\\FFmpeg\\bin) to your system's PATH environment variable.
 After installation, please restart the application.
 """
 class FFmpegState(Enum):
    STOPPED = "stopped"
    STARTING = "starting"
    RUNNING = "running"
    RESTARTING = "restarting"
    FAILED = "failed"
 class FFmpegManager:
    def __init__(self, sample_rate: int = 16000, channels: int = 1):
        self.sample_rate = sample_rate
        self.channels = channels
        self.process: Optional[asyncio.subprocess.Process] = None
        self._stderr_task: Optional[asyncio.Task] = None
        self.on_error_callback: Optional[Callable[[str], None]] = None
        self.state = FFmpegState.STOPPED
        self._state_lock = asyncio.Lock()
    async def start(self) -> bool:
        async with self._state_lock:
            if self.state != FFmpegState.STOPPED:
                logger.warning(f"FFmpeg already running in state: {self.state}")
                return False
            self.state = FFmpegState.STARTING
        try:
            cmd = [
                "ffmpeg",
                "-hide_banner",
                "-loglevel", "error",
                "-i", "pipe:0",
                "-f", "s16le",
                "-acodec", "pcm_s16le",
                "-ac", str(self.channels),
                "-ar", str(self.sample_rate),
                "pipe:1"
            ]
            self.process = await asyncio.create_subprocess_exec(
                *cmd,
                stdin=asyncio.subprocess.PIPE,
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE
            )
            self._stderr_task = asyncio.create_task(self._drain_stderr())
            async with self._state_lock:
                self.state = FFmpegState.RUNNING
            logger.info("FFmpeg started.")
            return True
        except FileNotFoundError:
            logger.error(ERROR_INSTALL_INSTRUCTIONS)
            async with self._state_lock:
                self.state = FFmpegState.FAILED
            if self.on_error_callback:
                await self.on_error_callback("ffmpeg_not_found")
            return False
        except Exception as e:
            logger.error(f"Error starting FFmpeg: {e}")
            async with self._state_lock:
                self.state = FFmpegState.FAILED
            if self.on_error_callback:
                await self.on_error_callback("start_failed")
            return False
    async def stop(self):
        async with self._state_lock:
            if self.state == FFmpegState.STOPPED:
                return
            self.state = FFmpegState.STOPPED
        if self.process:
            if self.process.stdin and not self.process.stdin.is_closing():
                self.process.stdin.close()
                await self.process.stdin.wait_closed()
            await self.process.wait()
            self.process = None
        if self._stderr_task:
            self._stderr_task.cancel()
            with contextlib.suppress(asyncio.CancelledError):
                await self._stderr_task
        logger.info("FFmpeg stopped.")
    async def write_data(self, data: bytes) -> bool:
        async with self._state_lock:
            if self.state != FFmpegState.RUNNING:
                logger.warning(f"Cannot write, FFmpeg state: {self.state}")
                return False
        try:
            self.process.stdin.write(data)
            await self.process.stdin.drain()
            return True
        except Exception as e:
            logger.error(f"Error writing to FFmpeg: {e}")
            if self.on_error_callback:
                await self.on_error_callback("write_error")
            return False
    async def read_data(self, size: int) -> Optional[bytes]:
        async with self._state_lock:
            if self.state != FFmpegState.RUNNING:
                logger.warning(f"Cannot read, FFmpeg state: {self.state}")
                return None
        try:
            data = await asyncio.wait_for(
                self.process.stdout.read(size),
                timeout=5.0
            )
            return data
        except asyncio.TimeoutError:
            logger.warning("FFmpeg read timeout.")
            return None
        except Exception as e:
            logger.error(f"Error reading from FFmpeg: {e}")
            if self.on_error_callback:
                await self.on_error_callback("read_error")
            return None
    async def get_state(self) -> FFmpegState:
        async with self._state_lock:
            return self.state
    async def restart(self) -> bool:
        async with self._state_lock:
            if self.state == FFmpegState.RESTARTING:
                logger.warning("Restart already in progress.")
                return False
            self.state = FFmpegState.RESTARTING
        logger.info("Restarting FFmpeg...")
        try:
            await self.stop()
            await asyncio.sleep(1)  # short delay before restarting
            return await self.start()
        except Exception as e:
            logger.error(f"Error during FFmpeg restart: {e}")
            async with self._state_lock:
                self.state = FFmpegState.FAILED
            if self.on_error_callback:
                await self.on_error_callback("restart_failed")
            return False
    async def _drain_stderr(self):
        try:
            while True:
                line = await self.process.stderr.readline()
                if not line:
                    break
                logger.debug(f"FFmpeg stderr: {line.decode(errors='ignore').strip()}")
        except asyncio.CancelledError:
            logger.info("FFmpeg stderr drain task cancelled.")
        except Exception as e:
            logger.error(f"Error draining FFmpeg stderr: {e}")
--- a/whisperlivekit/parse_args.py
+++ b/whisperlivekit/parse_args.py
@@ -0,0 +1,253 @@
 from argparse import ArgumentParser
 def parse_args():
    parser = ArgumentParser(description="Whisper FastAPI Online Server")
    parser.add_argument(
        "--host",
        type=str,
        default="localhost",
        help="The host address to bind the server to.",
    )
    parser.add_argument(
        "--port", type=int, default=8000, help="The port number to bind the server to."
    )
    parser.add_argument(
        "--warmup-file",
        type=str,
        default=None,
        dest="warmup_file",
        help="""
        The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast.
        If not set, uses https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav.
        If False, no warmup is performed.
        """,
    )
    parser.add_argument(
        "--confidence-validation",
        action="store_true",
        help="Accelerates validation of tokens using confidence scores. Transcription will be faster but punctuation might be less accurate.",
    )
    parser.add_argument(
        "--diarization",
        action="store_true",
        default=False,
        help="Enable speaker diarization.",
    )
    parser.add_argument(
        "--punctuation-split",
        action="store_true",
        default=False,
        help="Use punctuation marks from transcription to improve speaker boundary detection. Requires both transcription and diarization to be enabled.",
    )
    parser.add_argument(
        "--segmentation-model",
        type=str,
        default="pyannote/segmentation-3.0",
        help="Hugging Face model ID for pyannote.audio segmentation model.",
    )
    parser.add_argument(
        "--embedding-model",
        type=str,
        default="pyannote/embedding",
        help="Hugging Face model ID for pyannote.audio embedding model.",
    )
    parser.add_argument(
        "--no-transcription",
        action="store_true",
        help="Disable transcription to only see live diarization results.",
    )
    parser.add_argument(
        "--min-chunk-size",
        type=float,
        default=0.5,
        help="Minimum audio chunk size in seconds. It waits up to this time to do processing. If the processing takes shorter time, it waits, otherwise it processes the whole segment that was received by this time.",
    )
    parser.add_argument(
        "--model",
        type=str,
        default="tiny",
        help="Name size of the Whisper model to use (default: tiny). Suggested values: tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large,large-v3-turbo. The model is automatically downloaded from the model hub if not present in model cache dir.",
    )
    parser.add_argument(
        "--model_cache_dir",
        type=str,
        default=None,
        help="Overriding the default model cache dir where models downloaded from the hub are saved",
    )
    parser.add_argument(
        "--model_dir",
        type=str,
        default=None,
        help="Dir where Whisper model.bin and other files are saved. This option overrides --model and --model_cache_dir parameter.",
    )
    parser.add_argument(
        "--lan",
        "--language",
        type=str,
        default="auto",
        help="Source language code, e.g. en,de,cs, or 'auto' for language detection.",
    )
    parser.add_argument(
        "--task",
        type=str,
        default="transcribe",
        choices=["transcribe", "translate"],
        help="Transcribe or translate.",
    )
    parser.add_argument(
        "--backend",
        type=str,
        default="faster-whisper",
        choices=["faster-whisper", "whisper_timestamped", "mlx-whisper", "openai-api", "simulstreaming"],
        help="Load only this backend for Whisper processing.",
    )
    parser.add_argument(
        "--vac",
        action="store_true",
        default=False,
        help="Use VAC = voice activity controller. Recommended. Requires torch.",
    )
    parser.add_argument(
        "--vac-chunk-size", type=float, default=0.04, help="VAC sample size in seconds."
    )
    parser.add_argument(
        "--no-vad",
        action="store_true",
        help="Disable VAD (voice activity detection).",
    )
    parser.add_argument(
        "--buffer_trimming",
        type=str,
        default="segment",
        choices=["sentence", "segment"],
        help='Buffer trimming strategy -- trim completed sentences marked with punctuation mark and detected by sentence segmenter, or the completed segments returned by Whisper. Sentence segmenter must be installed for "sentence" option.',
    )
    parser.add_argument(
        "--buffer_trimming_sec",
        type=float,
        default=15,
        help="Buffer trimming length threshold in seconds. If buffer length is longer, trimming sentence/segment is triggered.",
    )
    parser.add_argument(
        "-l",
        "--log-level",
        dest="log_level",
        choices=["DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"],
        help="Set the log level",
        default="DEBUG",
    )
    parser.add_argument("--ssl-certfile", type=str, help="Path to the SSL certificate file.", default=None)
    parser.add_argument("--ssl-keyfile", type=str, help="Path to the SSL private key file.", default=None)
    # SimulStreaming-specific arguments
    simulstreaming_group = parser.add_argument_group('SimulStreaming arguments (only used with --backend simulstreaming)')
    simulstreaming_group.add_argument(
        "--frame-threshold",
        type=int,
        default=25,
        dest="frame_threshold",
        help="Threshold for the attention-guided decoding. The AlignAtt policy will decode only until this number of frames from the end of audio. In frames: one frame is 0.02 seconds for large-v3 model.",
    )
    simulstreaming_group.add_argument(
        "--beams",
        "-b",
        type=int,
        default=1,
        help="Number of beams for beam search decoding. If 1, GreedyDecoder is used.",
    )
    simulstreaming_group.add_argument(
        "--decoder",
        type=str,
        default=None,
        dest="decoder_type",
        choices=["beam", "greedy"],
        help="Override automatic selection of beam or greedy decoder. If beams > 1 and greedy: invalid.",
    )
    simulstreaming_group.add_argument(
        "--audio-max-len",
        type=float,
        default=30.0,
        dest="audio_max_len",
        help="Max length of the audio buffer, in seconds.",
    )
    simulstreaming_group.add_argument(
        "--audio-min-len",
        type=float,
        default=0.0,
        dest="audio_min_len",
        help="Skip processing if the audio buffer is shorter than this length, in seconds. Useful when the --min-chunk-size is small.",
    )
    simulstreaming_group.add_argument(
        "--cif-ckpt-path",
        type=str,
        default=None,
        dest="cif_ckpt_path",
        help="The file path to the Simul-Whisper's CIF model checkpoint that detects whether there is end of word at the end of the chunk. If not, the last decoded space-separated word is truncated because it is often wrong -- transcribing a word in the middle. The CIF model adapted for the Whisper model version should be used. Find the models in https://github.com/backspacetg/simul_whisper/tree/main/cif_models . Note that there is no model for large-v3.",
    )
    simulstreaming_group.add_argument(
        "--never-fire",
        action="store_true",
        default=False,
        dest="never_fire",
        help="Override the CIF model. If True, the last word is NEVER truncated, no matter what the CIF model detects. If False: if CIF model path is set, the last word is SOMETIMES truncated, depending on the CIF detection. Otherwise, if the CIF model path is not set, the last word is ALWAYS trimmed.",
    )
    simulstreaming_group.add_argument(
        "--init-prompt",
        type=str,
        default=None,
        dest="init_prompt",
        help="Init prompt for the model. It should be in the target language.",
    )
    simulstreaming_group.add_argument(
        "--static-init-prompt",
        type=str,
        default=None,
        dest="static_init_prompt",
        help="Do not scroll over this text. It can contain terminology that should be relevant over all document.",
    )
    simulstreaming_group.add_argument(
        "--max-context-tokens",
        type=int,
        default=None,
        dest="max_context_tokens",
        help="Max context tokens for the model. Default is 0.",
    )
    simulstreaming_group.add_argument(
        "--model-path",
        type=str,
        default=None,
        dest="model_path",
        help="Direct path to the SimulStreaming Whisper .pt model file. Overrides --model for SimulStreaming backend.",
    )
    args = parser.parse_args()
    args.transcription = not args.no_transcription
    args.vad = not args.no_vad    
    delattr(args, 'no_transcription')
    delattr(args, 'no_vad')
    return args
--- a/whisperlivekit/remove_silences.py
+++ b/whisperlivekit/remove_silences.py
@@ -0,0 +1,103 @@
 from whisperlivekit.timed_objects import ASRToken
 import re
 MIN_SILENCE_DURATION = 4 #in seconds
 END_SILENCE_DURATION = 8 #in seconds. you should keep it important to not have false positive when the model lag is important
 def blank_to_silence(tokens):
    full_string = ''.join([t.text for t in tokens])
    patterns = [re.compile(r'(?:\s*\[BLANK_AUDIO\]\s*)+'), re.compile(r'(?:\s*\[typing\]\s*)+')]             
    matches = []
    for pattern in patterns:
        for m in pattern.finditer(full_string):
            matches.append({
                'start': m.start(),
                'end': m.end()
            })
    if matches:
        # cleaned = pattern.sub(' ', full_string).strip()
        # print("Cleaned:", cleaned)
        cumulated_len = 0
        silence_token = None
        cleaned_tokens = []
        for token in tokens:
            if matches:
                start = cumulated_len
                end = cumulated_len + len(token.text)
                cumulated_len = end
                if start >= matches[0]['start'] and end <= matches[0]['end']:
                    if silence_token: #previous token was already silence
                        silence_token.start = min(silence_token.start, token.start)
                        silence_token.end = max(silence_token.end, token.end)
                    else: #new silence
                        silence_token = ASRToken(
                            start=token.start,
                            end=token.end,
                            speaker=-2,
                            probability=0.95
                        )
                else:
                    if silence_token: #there was silence but no more
                        if silence_token.end - silence_token.start >= MIN_SILENCE_DURATION:
                            cleaned_tokens.append(
                                silence_token
                            )
                        silence_token = None
                        matches.pop(0)
                    cleaned_tokens.append(token)
        # print(cleaned_tokens)    
        return cleaned_tokens
    return tokens
 def no_token_to_silence(tokens):
    new_tokens = []
    silence_token = None
    for token in tokens:
        if token.speaker == -2:
            if new_tokens and new_tokens[-1].speaker == -2: #if token is silence and previous one too
                new_tokens[-1].end = token.end
            else:
                new_tokens.append(token)
        last_end = new_tokens[-1].end if new_tokens else 0.0
        if token.start - last_end >= MIN_SILENCE_DURATION: #if token is not silence but important gap
            if new_tokens and new_tokens[-1].speaker == -2:
                new_tokens[-1].end = token.start
            else:
                silence_token = ASRToken(
                    start=last_end,
                    end=token.start,
                    speaker=-2,
                    probability=0.95
                    )
                new_tokens.append(silence_token)
        if token.speaker != -2:
            new_tokens.append(token)
    return new_tokens
 def ends_with_silence(tokens, current_time):
    if not tokens:
        return []
    last_token = tokens[-1]
    if tokens and current_time - last_token.end >= END_SILENCE_DURATION:
        if last_token.speaker == -2:
            last_token.end = current_time
        else:
            tokens.append(
                ASRToken(
                    start=tokens[-1].end,
                    end=current_time,
                    speaker=-2,
                    probability=0.95
                )
            )
    return tokens
 def handle_silences(tokens, current_time):
    tokens = blank_to_silence(tokens) #useful for simulstreaming backend which tends to generate [BLANK_AUDIO] text
    tokens = no_token_to_silence(tokens)
    tokens = ends_with_silence(tokens, current_time)
    return tokens
--- a/whisperlivekit/simul_whisper/init.py
+++ b/whisperlivekit/simul_whisper/init.py
@@ -0,0 +1,6 @@
 from .backend import SimulStreamingASR, SimulStreamingOnlineProcessor
 __all__ = [
    "SimulStreamingASR",
    "SimulStreamingOnlineProcessor",
 ]
--- a/whisperlivekit/simul_whisper/backend.py
+++ b/whisperlivekit/simul_whisper/backend.py
@@ -0,0 +1,223 @@
 import sys
 import numpy as np
 import logging
 from typing import List, Tuple, Optional
 import logging
 from whisperlivekit.timed_objects import ASRToken, Transcript
 from whisperlivekit.simul_whisper.license_simulstreaming import SIMULSTREAMING_LICENSE
 from .whisper import load_model, tokenizer
 import os
 logger = logging.getLogger(__name__)
 try:
    import torch
    from whisperlivekit.simul_whisper.config import AlignAttConfig
    from whisperlivekit.simul_whisper.simul_whisper import PaddedAlignAttWhisper
    from whisperlivekit.simul_whisper.whisper import tokenizer
 except ImportError as e:
    raise ImportError(
        """SimulStreaming dependencies are not available.
        Please install WhisperLiveKit using pip install "whisperlivekit[simulstreaming]".""")
 class SimulStreamingOnlineProcessor:
    SAMPLING_RATE = 16000
    def __init__(
        self,
        asr,
        logfile=sys.stderr,
        warmup_file=None
    ):        
        self.asr = asr
        self.logfile = logfile
        self.is_last = False
        self.beg = 0.0
        self.end = 0.0
        self.cumulative_audio_duration = 0.0
        self.committed: List[ASRToken] = []
        self.last_result_tokens: List[ASRToken] = []        
        self.model = PaddedAlignAttWhisper(
            cfg=asr.cfg,
            loaded_model=asr.whisper_model)
        if asr.tokenizer:
            self.model.tokenizer = asr.tokenizer
    def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time: Optional[float] = None):
        """Append an audio chunk to be processed by SimulStreaming."""
        # Convert numpy array to torch tensor
        audio_tensor = torch.from_numpy(audio).float()
        # Update timing
        chunk_duration = len(audio) / self.SAMPLING_RATE
        self.cumulative_audio_duration += chunk_duration
        if audio_stream_end_time is not None:
            self.end = audio_stream_end_time
        else:
            self.end = self.cumulative_audio_duration            
        self.model.insert_audio(audio_tensor)
    def get_buffer(self):
        return Transcript(
            start=None, 
            end=None, 
            text='', 
            probability=None
        )
    def timestamped_text(self, tokens, generation):
        # From the simulstreaming repo. self.model to self.asr.model
        pr = generation["progress"]
        if "result" not in generation:
            split_words, split_tokens = self.model.tokenizer.split_to_word_tokens(tokens)
        else:
            split_words, split_tokens = generation["result"]["split_words"], generation["result"]["split_tokens"]
        frames = [p["most_attended_frames"][0] for p in pr]
        tokens = tokens.copy()
        ret = []
        for sw,st in zip(split_words,split_tokens):
            b = None
            for stt in st:
                t,f = tokens.pop(0), frames.pop(0)
                if t != stt:
                    raise ValueError(f"Token mismatch: {t} != {stt} at frame {f}.")
                if b is None:
                    b = f
            e = f
            out = (b*0.02, e*0.02, sw)
            ret.append(out)
            logger.debug(f"TS-WORD:\t{' '.join(map(str, out))}")
        return ret
    def process_iter(self) -> Tuple[List[ASRToken], float]:
        """
        Process accumulated audio chunks using SimulStreaming.
        Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
        """
        try:            
            tokens, generation_progress = self.model.infer(is_last=self.is_last)
            ts_words = self.timestamped_text(tokens, generation_progress)
            new_tokens = []
            for ts_word in ts_words:
                start, end, word = ts_word
                token = ASRToken(
                    start=start,
                    end=end,
                    text=word,
                    probability=0.95  # fake prob. Maybe we can extract it from the model?
                )
                new_tokens.append(token)
                self.committed.extend(new_tokens)
            return new_tokens, self.end
        except Exception as e:
            logger.exception(f"SimulStreaming processing error: {e}")
            return [], self.end
    def warmup(self, audio, init_prompt=""):
        """Warmup the SimulStreaming model."""
        try:
            self.model.insert_audio(audio)
            self.model.infer(True)
            self.model.refresh_segment(complete=True)
            logger.info("SimulStreaming model warmed up successfully")
        except Exception as e:
            logger.exception(f"SimulStreaming warmup failed: {e}")
 class SimulStreamingASR():
    """SimulStreaming backend with AlignAtt policy."""
    sep = ""
    def __init__(self, lan, modelsize=None, cache_dir=None, model_dir=None, logfile=sys.stderr, **kwargs):
        logger.warning(SIMULSTREAMING_LICENSE)
        self.logfile = logfile
        self.transcribe_kargs = {}
        self.original_language = None if lan == "auto" else lan
        self.model_path = kwargs.get('model_path', './large-v3.pt')
        self.frame_threshold = kwargs.get('frame_threshold', 25)
        self.audio_max_len = kwargs.get('audio_max_len', 30.0)
        self.audio_min_len = kwargs.get('audio_min_len', 0.0)
        self.segment_length = kwargs.get('segment_length', 0.5)
        self.beams = kwargs.get('beams', 1)
        self.decoder_type = kwargs.get('decoder_type', 'greedy' if self.beams == 1 else 'beam')
        self.task = kwargs.get('task', 'transcribe')
        self.cif_ckpt_path = kwargs.get('cif_ckpt_path', None)
        self.never_fire = kwargs.get('never_fire', False)
        self.init_prompt = kwargs.get('init_prompt', None)
        self.static_init_prompt = kwargs.get('static_init_prompt', None)
        self.max_context_tokens = kwargs.get('max_context_tokens', None)
        if model_dir is not None:
            self.model_path = model_dir
        elif modelsize is not None:
            model_mapping = {
                'tiny': './tiny.pt',
                'base': './base.pt',
                'small': './small.pt',
                'medium': './medium.pt',
                'medium.en': './medium.en.pt',
                'large-v1': './large-v1.pt',
                'base.en': './base.en.pt',
                'small.en': './small.en.pt',
                'tiny.en': './tiny.en.pt',
                'large-v2': './large-v2.pt',
                'large-v3': './large-v3.pt',
                'large': './large-v3.pt'
            }
            self.model_path = model_mapping.get(modelsize, f'./{modelsize}.pt')
        self.model = self.load_model(modelsize)
        # Set up tokenizer for translation if needed
        if self.task == "translate":
            self.tokenizer = self.set_translate_task()
        else:
            self.tokenizer = None
    def load_model(self, modelsize):
        self.cfg = AlignAttConfig(
                model_path=self.model_path,
                segment_length=self.segment_length,
                frame_threshold=self.frame_threshold,
                language=self.original_language,
                audio_max_len=self.audio_max_len,
                audio_min_len=self.audio_min_len,
                cif_ckpt_path=self.cif_ckpt_path,
                decoder_type="beam",
                beam_size=self.beams,
                task=self.task,
                never_fire=self.never_fire,
                init_prompt=self.init_prompt,
                max_context_tokens=self.max_context_tokens,
                static_init_prompt=self.static_init_prompt,
        )   
        model_name = os.path.basename(self.cfg.model_path).replace(".pt", "")
        model_path = os.path.dirname(os.path.abspath(self.cfg.model_path))
        self.whisper_model = load_model(name=model_name, download_root=model_path)
    def set_translate_task(self):
        """Set up translation task."""
        return tokenizer.get_tokenizer(
            multilingual=True,
            language=self.model.cfg.language,
            num_languages=self.model.model.num_languages,
            task="translate"
        )
    def transcribe(self, audio):
        """
        Only used for warmup. It's a direct whisper call, not a simulstreaming call
        """
        self.whisper_model.transcribe(audio, language=self.original_language)
--- a/whisperlivekit/simul_whisper/beam.py
+++ b/whisperlivekit/simul_whisper/beam.py
@@ -0,0 +1,17 @@
 from .whisper.decoding import PyTorchInference
 # extention of PyTorchInference for beam search
 class BeamPyTorchInference(PyTorchInference):
    def _kv_modules(self):
        key_modules = [block.attn.key.cache_id for block in self.model.decoder.blocks]
        value_modules = [block.attn.value.cache_id for block in self.model.decoder.blocks]
        return key_modules + value_modules
    def rearrange_kv_cache(self, source_indices):
        if source_indices != list(range(len(source_indices))):
            for module_cache_id in self._kv_modules():
                self.kv_cache[module_cache_id] = self.kv_cache[module_cache_id][source_indices].detach()
    from torch import Tensor
    def logits(self, tokens: Tensor, audio_features: Tensor) -> Tensor:
        return self.model.decoder(tokens, audio_features, kv_cache=self.kv_cache)
--- a/whisperlivekit/simul_whisper/config.py
+++ b/whisperlivekit/simul_whisper/config.py
@@ -0,0 +1,29 @@
 # This code was originally in simul_whisper/transcriber/simul_whisper.py . It is adapted a lot for SimulStreaming.
 from dataclasses import dataclass, field
 from typing import Literal
@dataclass
 class SimulWhisperConfig:
    '''Options that are common for all simul policies that could be implemented in SimulWhisper.'''
    model_path: str
    language: str = field(default="zh")
    nonspeech_prob: float = 0.5
    audio_min_len: float = 1.0
    decoder_type: Literal["greedy","beam"] = "greedy"
    beam_size: int = 5
    task: Literal["transcribe","translate"] = "transcribe"
    init_prompt: str = field(default=None)
    static_init_prompt: str = field(default=None)
    max_context_tokens: int = field(default=None)
@dataclass
 class AlignAttConfig(SimulWhisperConfig):
    '''Options specific to the AlignAtt policy.'''
    eval_data_path: str = "tmp"
    segment_length: float = field(default=1.0, metadata = {"help": "in second"})
    frame_threshold: int = 4
    rewind_threshold: int = 200
    audio_max_len: float = 30.0
    cif_ckpt_path: str = ""
    never_fire: bool = False
--- a/whisperlivekit/simul_whisper/eow_detection.py
+++ b/whisperlivekit/simul_whisper/eow_detection.py
@@ -0,0 +1,65 @@
 import torch
 # code for the end-of-word detection based on the CIF model proposed in Simul-Whisper
 def load_cif(cfg, n_audio_state, device):
    """cfg: AlignAttConfig, n_audio_state: int, device: torch.device"""
    cif_linear = torch.nn.Linear(n_audio_state, 1)
    if cfg.cif_ckpt_path is None or not cfg.cif_ckpt_path:
        if cfg.never_fire:
            never_fire = True
            always_fire = False
        else:
            always_fire = True
            never_fire = False
    else:
        always_fire = False
        never_fire = cfg.never_fire
        checkpoint = torch.load(cfg.cif_ckpt_path)
        cif_linear.load_state_dict(checkpoint)
    cif_linear.to(device)
    return cif_linear, always_fire, never_fire
 # from https://github.com/dqqcasia/mosst/blob/master/fairseq/models/speech_to_text/convtransformer_wav2vec_cif.py
 def resize(alphas, target_lengths, threshold=0.999):
    """
    alpha in thresh=1.0 | (0.0, +0.21)
    target_lengths: if None, apply round and resize, else apply scaling
    """
    # sum
    _num = alphas.sum(-1)
    num = target_lengths.float()
    # scaling
    _alphas = alphas * (num / _num)[:, None].repeat(1, alphas.size(1))
    # rm attention value that exceeds threashold
    count = 0
    while len(torch.where(_alphas > threshold)[0]):
        count += 1
        if count > 10:
            break
        xs, ys = torch.where(_alphas > threshold)
        for x, y in zip(xs, ys):
            if _alphas[x][y] >= threshold:
                mask = _alphas[x].ne(0).float()
                mean = 0.5 * _alphas[x].sum() / mask.sum()
                _alphas[x] = _alphas[x] * 0.5 + mean * mask
    return _alphas, _num
 def fire_at_boundary(chunked_encoder_feature: torch.Tensor, cif_linear):
    content_mel_len = chunked_encoder_feature.shape[1] # B, T, D
    alphas = cif_linear(chunked_encoder_feature).squeeze(dim=2) # B, T
    alphas = torch.sigmoid(alphas)
    decode_length = torch.round(alphas.sum(-1)).int()
    alphas, _ = resize(alphas, decode_length)
    alphas = alphas.squeeze(0) # (T, )
    threshold = 0.999
    integrate = torch.cumsum(alphas[:-1], dim=0) # ignore the peak value at the end of the content chunk
    exceed_count = integrate[-1] // threshold
    integrate = integrate - exceed_count*1.0 # minus 1 every time intergrate exceed the threshold
    important_positions = (integrate >= 0).nonzero(as_tuple=True)[0]
    if important_positions.numel() == 0:
        return False
    else:
        return important_positions[0] >= content_mel_len-2
--- a/whisperlivekit/simul_whisper/generation_progress.py
+++ b/whisperlivekit/simul_whisper/generation_progress.py
@@ -0,0 +1,43 @@
 class Tokens:
    def __init__(self, tokens):
        self.tokens = tokens
 #    def clone(self):
 #        return Tokens(self.tokens.clone())
    def __str__(self):
        return str(self.tokens.tolist())
    def __repr__(self):
        return self.__str__()
 class BeamTokens(Tokens):
    def __init__(self, tokens, beam_size):
        self.tokens = tokens
        self.beam_size = beam_size
    def clone(self):
        return BeamTokens(self.tokens.clone())
    def __str__(self):
        return f"BeamTokens({self.tokens.tolist()}, beam_size={self.beam_size})"
    def __repr__(self):
        return self.__str__()
    def as_text(self, tokenizer):
        return tokenizer.decode(self.tokens)
 class Logits(Tokens):
    def __init__(self, logits):
        super().__init__(logits)
 #    def clone(self):
 #        return Logits(self.tokens.clone(), self.beam_size)
    def __str__(self):
 #        return "abc"
        return f"Logits({self.tokens.shape})"
    def __repr__(self):
        return self.__str__()
--- a/whisperlivekit/simul_whisper/license_simulstreaming.py
+++ b/whisperlivekit/simul_whisper/license_simulstreaming.py
@@ -0,0 +1,5 @@
 SIMULSTREAMING_LICENSE = f"""
 SimulStreaming backend is dual-licensed:
 • Non-Commercial Use: PolyForm Noncommercial License 1.0.0.
 • Commercial Use: Check SimulStreaming README (github.com/ufal/SimulStreaming) for more details.
 """
--- a/whisperlivekit/simul_whisper/simul_whisper.py
+++ b/whisperlivekit/simul_whisper/simul_whisper.py
@@ -0,0 +1,602 @@
 # This code was originally in simul_whisper/transcriber/simul_whisper.py . It is adapted a lot for SimulStreaming.
 import os
 import logging
 import torch
 import torch.nn.functional as F
 from .whisper import load_model, DecodingOptions, tokenizer
 from .config import AlignAttConfig
 from .whisper.audio import log_mel_spectrogram, TOKENS_PER_SECOND, pad_or_trim, N_SAMPLES, N_FRAMES
 from .whisper.timing import median_filter
 from .whisper.decoding import GreedyDecoder, BeamSearchDecoder, SuppressTokens, detect_language
 from .beam import BeamPyTorchInference
 from .eow_detection import fire_at_boundary, load_cif
 import os
 from .token_buffer import TokenBuffer
 import numpy as np
 from .generation_progress import *
 DEC_PAD = 50257
 logger = logging.getLogger(__name__)
 import sys
 import wave
 # New features added to the original version of Simul-Whisper: 
 # - large-v3 model support
 # - translation support
 # - beam search
 # - prompt -- static vs. non-static
 # - context
 class PaddedAlignAttWhisper:
    def __init__(self, cfg: AlignAttConfig, loaded_model=None) -> None:
        self.log_segments = 0
        model_name = os.path.basename(cfg.model_path).replace(".pt", "")
        model_path = os.path.dirname(os.path.abspath(cfg.model_path))
        if loaded_model:
            self.model = loaded_model
        else:
            self.model = load_model(name=model_name, download_root=model_path)
        logger.info(f"Model dimensions: {self.model.dims}")
        self.decode_options = DecodingOptions(
            language = cfg.language, 
            without_timestamps = True,
            task=cfg.task
        )
        self.tokenizer_is_multilingual = not model_name.endswith(".en")
        self.create_tokenizer(cfg.language if cfg.language != "auto" else None)
        self.detected_language = cfg.language if cfg.language != "auto" else None
        self.max_text_len = self.model.dims.n_text_ctx
        self.num_decoder_layers = len(self.model.decoder.blocks)
        self.cfg = cfg
        # model to detect end-of-word boundary at the end of the segment
        self.CIFLinear, self.always_fire, self.never_fire = load_cif(cfg,
                                                                     n_audio_state=self.model.dims.n_audio_state,
                                                                     device=self.model.device)
        # install hooks to access encoder-decoder attention
        self.dec_attns = []
        def layer_hook(module, net_input, net_output):
            # net_output[1]: B*num_head*token_len*audio_len
            t = F.softmax(net_output[1], dim=-1)
            self.dec_attns.append(t.squeeze(0))
        for b in self.model.decoder.blocks:
            b.cross_attn.register_forward_hook(layer_hook)
        self.kv_cache = {}
        def kv_hook(module: torch.nn.Linear, _, net_output: torch.Tensor):
            if module.cache_id not in self.kv_cache or net_output.shape[1] > self.max_text_len:
                # save as-is, for the first token or cross attention
                self.kv_cache[module.cache_id] = net_output
            else:
                x = self.kv_cache[module.cache_id]
                self.kv_cache[module.cache_id] = torch.cat([x, net_output], dim=1).detach()
            return self.kv_cache[module.cache_id] 
        for i,b in enumerate(self.model.decoder.blocks):
            b.attn.key.register_forward_hook(kv_hook)
            b.attn.value.register_forward_hook(kv_hook)
            b.cross_attn.key.register_forward_hook(kv_hook)
            b.cross_attn.value.register_forward_hook(kv_hook)
        self.align_source = {}
        self.num_align_heads = 0
        for layer_rank, head_id in self.model.alignment_heads.indices().T:
            layer_rank = layer_rank.item()
            heads = self.align_source.get(layer_rank, [])
            heads.append((self.num_align_heads, head_id.item()))
            self.align_source[layer_rank] = heads
            self.num_align_heads += 1
        # tokens to be suppressed from decoding, to prevent hallucinations
        suppress_tokens = [
                self.tokenizer.transcribe,
                self.tokenizer.translate,
                self.tokenizer.sot,
                self.tokenizer.sot_prev,
                self.tokenizer.sot_lm,
                # self.tokenizer.eot 
                self.tokenizer.no_timestamps,  # added by DM
            ] + list(self.tokenizer.all_language_tokens)  # added by DM
        if self.tokenizer.no_speech is not None:
            suppress_tokens.append(self.tokenizer.no_speech)
        suppress_tokens =  tuple(sorted(set(suppress_tokens)))
        logger.debug(f"Suppress tokens: {suppress_tokens}")
        sup_tokens = SuppressTokens(suppress_tokens)
        self.suppress_tokens = lambda logits: sup_tokens.apply(logits, None)
        # blank tokens are suppresed for new segments near the line 334
        # it's going to be regenerated after lang id
        self.segments = []
        self.init_tokens()
        self.last_attend_frame = -self.cfg.rewind_threshold
        if self.cfg.max_context_tokens is None:
            self.max_context_tokens = self.max_text_len
        else:
            self.max_context_tokens = self.cfg.max_context_tokens
        self.init_context()
        # decoder type: greedy or beam
        if cfg.decoder_type == "greedy":
            logger.info("Using greedy decoder")
            self.token_decoder = GreedyDecoder(0.0, self.tokenizer.eot)
            self.decoder_type = "greedy"
        elif cfg.decoder_type == "beam":
            self.decoder_type = "beam"
            self.inference = BeamPyTorchInference(self.model, self.initial_token_length)
            self.inference.kv_cache = self.kv_cache
            self.token_decoder = BeamSearchDecoder(inference=self.inference, eot=self.tokenizer.eot, beam_size=cfg.beam_size)
    def create_tokenizer(self, language=None):
        self.tokenizer = tokenizer.get_tokenizer(
            multilingual=self.tokenizer_is_multilingual,  
            language=language,
            num_languages=self.model.num_languages,
            task=self.decode_options.task
        )
    def init_context(self):
        kw = {'tokenizer': self.tokenizer, 
              'device': self.model.device, 
              'prefix_token_ids': [self.tokenizer.sot_prev]}
        self.context = TokenBuffer.empty(**kw)
        if self.cfg.static_init_prompt is not None:
            self.context = TokenBuffer.from_text(self.cfg.static_init_prompt, **kw)
        if self.cfg.init_prompt is not None:
            self.context.text += self.cfg.init_prompt
    def init_tokens(self):
        logger.debug(f"init tokens, {len(self.segments)}")
        # init tokens (mandatory prompt)
        self.initial_tokens = torch.tensor(
            self.tokenizer.sot_sequence_including_notimestamps, 
            dtype=torch.long, 
            device=self.model.device).unsqueeze(0)
        self.initial_token_length = self.initial_tokens.shape[1]
        self.sot_index = self.tokenizer.sot_sequence.index(self.tokenizer.sot)
 #        self.segments = []
        logger.debug(f"init tokens after, {len(self.segments)}")
        self.tokens = [self.initial_tokens]
    def trim_context(self):
        logger.info("Trimming context")
        c = len(self.context.as_token_ids()) - len(self.context.prefix_token_ids)
 #        logger.debug(f"c= {len(self.context.as_token_ids())}, {len(self.context.prefix_token_ids)}")
        logger.info(f"Context text: {self.context.as_text()}")
 #        logger.debug(f"Context tensor: {self.context.as_tensor()}")
        l = sum(t.shape[1] for t in self.tokens) + c
 #        logger.debug(f"len {l}, c {c}, max_context_tokens {self.max_context_tokens}")
        if self.cfg.static_init_prompt is None:
            after = 0
        else:
            after = len(self.cfg.static_init_prompt)
 #        logger.debug(f"len {l}, c {c}, max_context_tokens {self.max_context_tokens}")
        while c > self.max_context_tokens or l > self.max_text_len - 20:
            t = self.context.trim_words(after=after)
            l -= t
            c -= t
            logger.debug(f"len {l}, c {c}, max_context_tokens {self.max_context_tokens}")
            if t == 0:
                break
 #        logger.debug(f"len {l}, c {c}, max_context_tokens {self.max_context_tokens}")
        logger.info(f"Context after trim: {self.context.text} (len: {l})")
    def logits(self, tokens: torch.Tensor, audio_features: torch.Tensor) -> torch.Tensor:
        if self.cfg.decoder_type == "greedy":
            logit = self.model.decoder(tokens, audio_features, kv_cache=self.kv_cache)
        else:
            logger.debug(f"Logits shape: {tokens.shape}")
            logit = self.inference.logits(tokens, audio_features)
        return logit
    def refresh_segment(self, complete=False):
        logger.debug("Refreshing segment:")
        self.init_tokens()
        self.last_attend_frame = -self.cfg.rewind_threshold       
        self.detected_language = None
        self.init_context()
        logger.debug(f"Context: {self.context}")
        if not complete and len(self.segments) > 2:
            logger.debug("keeping last two segments because they are and it is not complete.")
            self.segments = self.segments[-2:]
        else:
            logger.debug("removing all segments.")
            self.segments = []
        self.log_segments += 1
    def fire_at_boundary(self, chunked_encoder_feature: torch.Tensor):
        if self.always_fire: return True
        if self.never_fire: return False
        return fire_at_boundary(chunked_encoder_feature, self.CIFLinear)
    def _current_tokens(self):
        toks = self.tokens
        # very first infer: duplicate start of seq to beam_size
        if toks[0].shape[0] == 1:
            toks[0] = toks[0].repeat_interleave(self.cfg.beam_size,dim=0)
        if not self.context.is_empty():
            context_toks = self.context.as_tensor_beam(self.cfg.beam_size, device=self.model.device)
            toks = [context_toks] + toks
        # make it one tensor
        if len(toks) > 1:
            current_tokens = torch.cat(toks, dim=1)
        else:
            current_tokens = toks[0]
        logger.debug("debug print current_tokens:")
        self.debug_print_tokens(current_tokens)
        return current_tokens
    def debug_print_tokens(self, tokens):
        for i in range(self.cfg.beam_size):
            logger.debug(self.tokenizer.decode_with_timestamps(tokens[i].tolist()))
    ### audio buffer 
    def segments_len(self):
        segments_len = sum(s.shape[0] for s in self.segments) / 16000
        return segments_len
    def _apply_minseglen(self):
        segments_len = self.segments_len()
        # wait for long enough audio to start
        if segments_len < self.cfg.audio_min_len: 
            logger.debug("waiting for next segment")
            return False
        return True
    def insert_audio(self, segment=None):
        if segment is not None:
            self.segments.append(segment)
        removed_len = 0
        # len of audio is bigger than buffer_len. Going to remove the first segment
        segments_len = self.segments_len()
        while len(self.segments) > 1 and segments_len > self.cfg.audio_max_len:
            removed_len = self.segments[0].shape[0] / 16000
            segments_len -= removed_len
            self.last_attend_frame -= int(TOKENS_PER_SECOND*removed_len)
            self.segments = self.segments[1:]
            logger.debug(f"remove segments: {len(self.segments)} {len(self.tokens)}")
            if len(self.tokens) > 1:
                self.context.append_token_ids(self.tokens[1][0,:])
                self.tokens = [self.initial_tokens] + self.tokens[2:]
        return removed_len
    def _clean_cache(self):
        '''clean the cache that stores the attention matrices and kv_cache.
        It must be called every time after generation with the model.'''
        # cleaning cache
        self.dec_attns = []
        self.kv_cache = {}
        if self.decoder_type == "beam":
            self.inference.kv_cache = self.kv_cache
            self.token_decoder.reset()
    @torch.no_grad()
    def lang_id(self, encoder_features):
        """Language detection from encoder features.
        This code is trimmed and copy-pasted from whisper.decoding.detect_language .
        """
        # forward pass using a single token, startoftranscript
        n_audio = encoder_features.shape[0]
        x = torch.tensor([[self.tokenizer.sot]] * n_audio).to(self.model.device)  # [n_audio, 1]
        logits = self.model.logits(x, encoder_features)[:, 0]
        # collect detected languages; suppress all non-language tokens
        mask = torch.ones(logits.shape[-1], dtype=torch.bool)
        mask[list(self.tokenizer.all_language_tokens)] = False
        logits[:, mask] = -np.inf
        language_tokens = logits.argmax(dim=-1)
        language_token_probs = logits.softmax(dim=-1).cpu()
        language_probs = [
            {
                c: language_token_probs[i, j].item()
                for j, c in zip(self.tokenizer.all_language_tokens, self.tokenizer.all_language_codes)
            }
            for i in range(n_audio)
        ]
        single = encoder_features.ndim == 2
        if single:
            language_tokens = language_tokens[0]
            language_probs = language_probs[0]
        self._clean_cache()
        return language_tokens, language_probs
    ### transcription / translation
    @torch.no_grad()
    def infer(self, is_last=False):
        new_segment = True
        if len(self.segments) == 0:
            logger.debug("No segments, nothing to do")
            return [], {}
        if not self._apply_minseglen():
            logger.debug(f"applied minseglen {self.cfg.audio_min_len} > {self.segments_len()}.")
            input_segments = torch.cat(self.segments, dim=0)
            return [], {}
        # input_segments is concatenation of audio, it's one array
        if len(self.segments) > 1:
            input_segments = torch.cat(self.segments, dim=0)
        else:
            input_segments = self.segments[0]
        # mel + padding to 30s
        mel_padded = log_mel_spectrogram(input_segments, n_mels=self.model.dims.n_mels, padding=N_SAMPLES, 
                                            device=self.model.device).unsqueeze(0)
        # trim to 3000
        mel = pad_or_trim(mel_padded, N_FRAMES)
        # the len of actual audio
        content_mel_len = int((mel_padded.shape[2] - mel.shape[2])/2)
        # encode
        encoder_feature = self.model.encoder(mel)
 #        logger.debug(f"Encoder feature shape: {encoder_feature.shape}")
 #        if mel.shape[-2:] != (self.model.dims.n_audio_ctx, self.model.dims.n_audio_state):
 #            logger.debug("mel ")
        if self.cfg.language == "auto" and self.detected_language is None:
            language_tokens, language_probs = self.lang_id(encoder_feature) 
            logger.debug(f"Language tokens: {language_tokens}, probs: {language_probs}")
            top_lan, p = max(language_probs[0].items(), key=lambda x: x[1])
            logger.info(f"Detected language: {top_lan} with p={p:.4f}")
            #self.tokenizer.language = top_lan
            #self.tokenizer.__post_init__()
            self.create_tokenizer(top_lan)
            self.detected_language = top_lan
            self.init_tokens()
            logger.info(f"Tokenizer language: {self.tokenizer.language}, {self.tokenizer.sot_sequence_including_notimestamps}")
        self.trim_context()
        current_tokens = self._current_tokens()
 #        
        fire_detected = self.fire_at_boundary(encoder_feature[:, :content_mel_len, :])
        ####################### Decoding loop
        logger.info("Decoding loop starts\n")
        sum_logprobs = torch.zeros(self.cfg.beam_size, device=mel.device)
        completed = False
        attn_of_alignment_heads = None
        most_attended_frame = None
        token_len_before_decoding = current_tokens.shape[1]
        generation_progress = []
        generation = {
            "starting_tokens": BeamTokens(current_tokens[0,:].clone(), self.cfg.beam_size),
            "token_len_before_decoding": token_len_before_decoding,
            #"fire_detected": fire_detected,
            "frames_len": content_mel_len,
            "frames_threshold": 4 if is_last else self.cfg.frame_threshold,
            # to be filled later
            "logits_starting": None,
            # to be filled later
            "no_speech_prob": None,
            "no_speech": False,
            # to be filled in the loop
            "progress": generation_progress,
        }
        while not completed and current_tokens.shape[1] < self.max_text_len: # bos is 3 tokens
            generation_progress_loop = []
            if new_segment:
                tokens_for_logits = current_tokens
            else:
                # only need to use the last token except in the first forward pass
                tokens_for_logits = current_tokens[:,-1:]
            logits = self.logits(tokens_for_logits, encoder_feature) # B, len(tokens), token dict size
            if new_segment:
                generation["logits_starting"] = Logits(logits[:,:,:])
            if new_segment and self.tokenizer.no_speech is not None:
                probs_at_sot = logits[:, self.sot_index, :].float().softmax(dim=-1)
                no_speech_probs = probs_at_sot[:, self.tokenizer.no_speech].tolist()
                generation["no_speech_prob"] = no_speech_probs[0]
                if no_speech_probs[0] > self.cfg.nonspeech_prob:
                    generation["no_speech"] = True
                    logger.info("no speech, stop")
                    break
            logits = logits[:, -1, :] # logits for the last token
            generation_progress_loop.append(("logits_before_suppress",Logits(logits)))
            # supress blank tokens only at the beginning of the segment
            if new_segment:
                logits[:, self.tokenizer.encode(" ") + [self.tokenizer.eot]] = -np.inf
            new_segment = False
            self.suppress_tokens(logits)
            #generation_progress_loop.append(("logits_after_suppres",BeamLogits(logits[0,:].clone(), self.cfg.beam_size)))
            generation_progress_loop.append(("logits_after_suppress",Logits(logits)))
            current_tokens, completed = self.token_decoder.update(current_tokens, logits, sum_logprobs)
            generation_progress_loop.append(("beam_tokens",Tokens(current_tokens[:,-1].clone())))
            generation_progress_loop.append(("sum_logprobs",sum_logprobs.tolist()))
            generation_progress_loop.append(("completed",completed))
            logger.debug(f"Decoding completed: {completed}, sum_logprobs: {sum_logprobs.tolist()}, tokens: ")
            self.debug_print_tokens(current_tokens)
            # if self.decoder_type == "beam":
            #     logger.debug(f"Finished sequences: {self.token_decoder.finished_sequences}")
            #     logprobs = F.log_softmax(logits.float(), dim=-1)
            #     idx = 0
            #     logger.debug(f"Beam search topk: {logprobs[idx].topk(self.cfg.beam_size + 1)}")
            #     logger.debug(f"Greedy search argmax: {logits.argmax(dim=-1)}")
            # if completed:
            #     self.debug_print_tokens(current_tokens)
            #     logger.debug("decode stopped because decoder completed")
            attn_of_alignment_heads = [[] for _ in range(self.num_align_heads)]
            for i, attn_mat in enumerate(self.dec_attns):
                layer_rank = int(i % len(self.model.decoder.blocks))
                align_heads_in_layer = self.align_source.get(layer_rank, [])
                if len(align_heads_in_layer) == 0:
                    continue
                for align_head_rank, head_id in align_heads_in_layer:
                    if self.cfg.beam_size == 1:
                        a = attn_mat[head_id, :, :]
                        a = a.unsqueeze(0)
                    else:
                        a = attn_mat[:, head_id, :, :]
                    attn_of_alignment_heads[align_head_rank].append(a)
            tmp = []
            for mat in attn_of_alignment_heads:
                t = torch.cat(mat, dim=1)
                tmp.append(t) 
            attn_of_alignment_heads = torch.stack(tmp, dim=1)
 #            logger.debug(str(attn_of_alignment_heads.shape) + " tttady")
            std, mean = torch.std_mean(attn_of_alignment_heads, dim=-2, keepdim=True, unbiased=False)
            attn_of_alignment_heads = (attn_of_alignment_heads - mean) / std
            attn_of_alignment_heads = median_filter(attn_of_alignment_heads, 7) # from whisper.timing
            attn_of_alignment_heads = attn_of_alignment_heads.mean(dim=1)
 #            logger.debug(str(attn_of_alignment_heads.shape) + " po mean")
            attn_of_alignment_heads = attn_of_alignment_heads[:,:, :content_mel_len]
 #            logger.debug(str(attn_of_alignment_heads.shape) + " pak ")
            # for each beam, the most attended frame is:
            most_attended_frames = torch.argmax(attn_of_alignment_heads[:,-1,:], dim=-1)
            generation_progress_loop.append(("most_attended_frames",most_attended_frames.clone().tolist()))
            logger.debug(str(most_attended_frames.tolist()) + " most att frames")
            most_attended_frame = most_attended_frames[0].item()
            generation_progress.append(dict(generation_progress_loop))
            logger.debug("current tokens" + str(current_tokens.shape))
            if completed:
            #    # stripping the last token, the eot
                current_tokens = current_tokens[:, :-1]
                break
            # for some rare cases where the attention fails
            if not is_last and self.last_attend_frame - most_attended_frame > self.cfg.rewind_threshold:
                # TODO: check this
                if current_tokens.shape[1] > 1 and current_tokens[0, -2] >= DEC_PAD:
                    logger.debug("ommit rewinding from special tokens")
                    self.last_attend_frame = most_attended_frame
                else:
                    logger.debug(
                        f"[rewind detected] current attention pos: {most_attended_frame}, "
                        f"last attention pos: {self.last_attend_frame}; omit this segment")
                    self.last_attend_frame = -self.cfg.rewind_threshold
                    current_tokens = torch.cat(self.tokens, dim=1) if len(self.tokens) > 0 else self.tokens[0]
                    break
            else:
                self.last_attend_frame = most_attended_frame
            if content_mel_len - most_attended_frame <= (4 if is_last else self.cfg.frame_threshold):
                logger.debug(f"attention reaches the end: {most_attended_frame}/{content_mel_len}")
                # stripping the last token, the one that is attended too close to the end
                current_tokens = current_tokens[:, :-1]
                break
            # debug print
            for i in range(self.cfg.beam_size):
                logger.debug("attn: {}, current pos: {}, current token: {}({})".format(
                    attn_of_alignment_heads.shape if attn_of_alignment_heads is not None else None,
                    most_attended_frames[i], 
                    current_tokens[i, -1].item(),
                    self.tokenizer.decode([current_tokens[i, -1].item()])
                ))
 #        for k,v in generation.items():
 #            print(k,v,file=sys.stderr)
 #        for x in generation_progress:
 #            for y in x.items():
 #                print("\t\t",*y,file=sys.stderr)
 #            print("\t","----", file=sys.stderr)
 #        print("\t", "end of generation_progress_loop", file=sys.stderr)
        #    sys.exit(1)
        ####################### End of decoding loop
        logger.info("End of decoding loop")
        # if attn_of_alignment_heads is not None:
        #     seg_len = int(segment.shape[0] / 16000 * TOKENS_PER_SECOND)
        #     # Lets' now consider only the top hypothesis in the beam search
        #     top_beam_attn_of_alignment_heads = attn_of_alignment_heads[0]
        #     # debug print: how is the new token attended?
        #     new_token_attn = top_beam_attn_of_alignment_heads[token_len_before_decoding:, -seg_len:]
        #     logger.debug(f"New token attention shape: {new_token_attn.shape}")
        #     if new_token_attn.shape[0] == 0:  # it's not attended in the current audio segment
        #         logger.debug("no token generated")
        #     else:  # it is, and the max attention is:
        #         new_token_max_attn, _ = new_token_attn.max(dim=-1)
        #         logger.debug(f"segment max attention: {new_token_max_attn.mean().item()/len(self.segments)}")
        # let's now operate only with the top beam hypothesis
        tokens_to_split = current_tokens[0, token_len_before_decoding:]
        if fire_detected or is_last:
            new_hypothesis = tokens_to_split.flatten().tolist()
        else:
            # going to truncate the tokens after the last space
            split_words, split_tokens = self.tokenizer.split_to_word_tokens(tokens_to_split.tolist())
            generation["result"] = {"split_words": split_words[:-1], "split_tokens": split_tokens[:-1]}
            generation["result_truncated"] = {"split_words": split_words[-1:], "split_tokens": split_tokens[-1:]}
 #            text_to_split = self.tokenizer.decode(tokens_to_split)
 #            logger.debug(f"text_to_split: {text_to_split}")
 #            logger.debug("text at current step: {}".format(text_to_split.replace(" ", "<space>")))
 #            text_before_space = " ".join(text_to_split.split(" ")[:-1])
 #            logger.debug("before the last space: {}".format(text_before_space.replace(" ", "<space>")))
            if len(split_words) > 1:
                new_hypothesis = [i for sublist in split_tokens[:-1] for i in sublist]  
            else:
                new_hypothesis = []
        ### new hypothesis
        logger.debug(f"new_hypothesis: {new_hypothesis}")
        new_tokens = torch.tensor([new_hypothesis], dtype=torch.long).repeat_interleave(self.cfg.beam_size, dim=0).to(
            device=self.model.device,
        )
        self.tokens.append(new_tokens)
        # TODO: test if this is redundant or not
 #        ret = ret[ret<DEC_PAD]
        logger.info(f"Output: {self.tokenizer.decode(new_hypothesis)}")
        self._clean_cache()
        return new_hypothesis, generation
--- a/whisperlivekit/simul_whisper/token_buffer.py
+++ b/whisperlivekit/simul_whisper/token_buffer.py
@@ -0,0 +1,73 @@
 import torch
 import sys
 class TokenBuffer:
    def __init__(self, text="", tokenizer=None, device=None, prefix_token_ids=[]):
        self.text = text
        self.prefix_token_ids = prefix_token_ids
        self.tokenizer = tokenizer
        self.device = device
    def as_token_ids(self, tokenizer=None):
        if tokenizer is None:
            tokenizer = self.tokenizer
        if tokenizer is None:
            raise ValueError("Tokenizer is not set.") 
        return self.prefix_token_ids + tokenizer.encode(self.text)
    def as_tensor(self, device=None):
        if device is None:
            device = self.device
        if device is None:
            raise ValueError("Device is not set.")
        tok_ids = self.as_token_ids()
        return torch.tensor(tok_ids, 
                     dtype=torch.long, device=device).unsqueeze(0)
    def as_tensor_beam(self, beam, device=None):
        t = self.as_tensor(device=device)
        return t.repeat_interleave(beam, dim=0)
    def as_text(self):
        return self.text
    @staticmethod
    def empty(*a, **kw):
        return TokenBuffer(*a,**kw)
    @staticmethod
    def from_text(text, *a, **kw):
        return TokenBuffer(*a, text=text, **kw)
    def is_empty(self):
        return self.text is None or self.text == ""
    def trim_words(self, num=1, after=0):
        '''
        num: how many words to trim from the beginning
        after: how many characters to skip (length of the static prompt)
        '''
        tokenizer = self.tokenizer
        assert tokenizer is not None, "Tokenizer is not set."
        ids = tokenizer.encode(self.text[after:])
        words, wids = self.tokenizer.split_to_word_tokens(ids)
 #        print(words, file=sys.stderr)
 #        print(wids, file=sys.stderr)
        if not words:
            return 0
        self.text = self.text[:after] + "".join(words[num:])
        return sum(len(wi) for wi in wids[:num])
    def append_token_ids(self, token_ids):
        tokenizer = self.tokenizer
        assert tokenizer is not None, "Tokenizer is not set."
        self.text += self.tokenizer.decode(token_ids)
    def as_split_word_tokens(self):
        tokenizer = self.tokenizer
        assert tokenizer is not None, "Tokenizer is not set."
        ids = tokenizer.encode(self.text)
        return tokenizer.split_to_word_tokens(ids)
--- a/whisperlivekit/simul_whisper/whisper/init.py
+++ b/whisperlivekit/simul_whisper/whisper/init.py
@@ -0,0 +1,160 @@
 import hashlib
 import io
 import os
 import urllib
 import warnings
 from typing import List, Optional, Union
 import torch
 from tqdm import tqdm
 from .audio import load_audio, log_mel_spectrogram, pad_or_trim
 from .decoding import DecodingOptions, DecodingResult, decode, detect_language
 from .model import ModelDimensions, Whisper
 from .transcribe import transcribe
 from .version import __version__
 _MODELS = {
    "tiny.en": "https://openaipublic.azureedge.net/main/whisper/models/d3dd57d32accea0b295c96e26691aa14d8822fac7d9d27d5dc00b4ca2826dd03/tiny.en.pt",
    "tiny": "https://openaipublic.azureedge.net/main/whisper/models/65147644a518d12f04e32d6f3b26facc3f8dd46e5390956a9424a650c0ce22b9/tiny.pt",
    "base.en": "https://openaipublic.azureedge.net/main/whisper/models/25a8566e1d0c1e2231d1c762132cd20e0f96a85d16145c3a00adf5d1ac670ead/base.en.pt",
    "base": "https://openaipublic.azureedge.net/main/whisper/models/ed3a0b6b1c0edf879ad9b11b1af5a0e6ab5db9205f891f668f8b0e6c6326e34e/base.pt",
    "small.en": "https://openaipublic.azureedge.net/main/whisper/models/f953ad0fd29cacd07d5a9eda5624af0f6bcf2258be67c92b79389873d91e0872/small.en.pt",
    "small": "https://openaipublic.azureedge.net/main/whisper/models/9ecf779972d90ba49c06d968637d720dd632c55bbf19d441fb42bf17a411e794/small.pt",
    "medium.en": "https://openaipublic.azureedge.net/main/whisper/models/d7440d1dc186f76616474e0ff0b3b6b879abc9d1a4926b7adfa41db2d497ab4f/medium.en.pt",
    "medium": "https://openaipublic.azureedge.net/main/whisper/models/345ae4da62f9b3d59415adc60127b97c714f32e89e936602e85993674d08dcb1/medium.pt",
    "large-v1": "https://openaipublic.azureedge.net/main/whisper/models/e4b87e7e0bf463eb8e6956e646f1e277e901512310def2c24bf0e11bd3c28e9a/large-v1.pt",
    "large-v2": "https://openaipublic.azureedge.net/main/whisper/models/81f7c96c852ee8fc832187b0132e569d6c3065a3252ed18e56effd0b6a73e524/large-v2.pt",
    "large-v3": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
    "large": "https://openaipublic.azureedge.net/main/whisper/models/e5b1a55b89c1367dacf97e3e19bfd829a01529dbfdeefa8caeb59b3f1b81dadb/large-v3.pt",
    "large-v3-turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
    "turbo": "https://openaipublic.azureedge.net/main/whisper/models/aff26ae408abcba5fbf8813c21e62b0941638c5f6eebfb145be0c9839262a19a/large-v3-turbo.pt",
 }
 # base85-encoded (n_layers, n_heads) boolean arrays indicating the cross-attention heads that are
 # highly correlated to the word-level timing, i.e. the alignment between audio and text tokens.
 _ALIGNMENT_HEADS = {
    "tiny.en": b"ABzY8J1N>@0{>%R00Bk>$p{7v037`oCl~+#00",
    "tiny": b"ABzY8bu8Lr0{>%RKn9Fp%m@SkK7Kt=7ytkO",
    "base.en": b"ABzY8;40c<0{>%RzzG;p*o+Vo09|#PsxSZm00",
    "base": b"ABzY8KQ!870{>%RzyTQH3`Q^yNP!>##QT-<FaQ7m",
    "small.en": b"ABzY8>?_)10{>%RpeA61k&I|OI3I$65C{;;pbCHh0B{qLQ;+}v00",
    "small": b"ABzY8DmU6=0{>%Rpa?J`kvJ6qF(V^F86#Xh7JUGMK}P<N0000",
    "medium.en": b"ABzY8usPae0{>%R7<zz_OvQ{)4kMa0BMw6u5rT}kRKX;$NfYBv00*Hl@qhsU00",
    "medium": b"ABzY8B0Jh+0{>%R7}kK1fFL7w6%<-Pf*t^=N)Qr&0RR9",
    "large-v1": b"ABzY8r9j$a0{>%R7#4sLmoOs{s)o3~84-RPdcFk!JR<kSfC2yj",
    "large-v2": b"ABzY8zd+h!0{>%R7=D0pU<_bnWW*tkYAhobTNnu$jnkEkXqp)j;w1Tzk)UH3X%SZd&fFZ2fC2yj",
    "large-v3": b"ABzY8gWO1E0{>%R7(9S+Kn!D~%ngiGaR?*L!iJG9p-nab0JQ=-{D1-g00",
    "large": b"ABzY8gWO1E0{>%R7(9S+Kn!D~%ngiGaR?*L!iJG9p-nab0JQ=-{D1-g00",
    "large-v3-turbo": b"ABzY8j^C+e0{>%RARaKHP%t(lGR*)0g!tONPyhe`",
    "turbo": b"ABzY8j^C+e0{>%RARaKHP%t(lGR*)0g!tONPyhe`",
 }
 def _download(url: str, root: str, in_memory: bool) -> Union[bytes, str]:
    os.makedirs(root, exist_ok=True)
    expected_sha256 = url.split("/")[-2]
    download_target = os.path.join(root, os.path.basename(url))
    if os.path.exists(download_target) and not os.path.isfile(download_target):
        raise RuntimeError(f"{download_target} exists and is not a regular file")
    if os.path.isfile(download_target):
        with open(download_target, "rb") as f:
            model_bytes = f.read()
        if hashlib.sha256(model_bytes).hexdigest() == expected_sha256:
            return model_bytes if in_memory else download_target
        else:
            warnings.warn(
                f"{download_target} exists, but the SHA256 checksum does not match; re-downloading the file"
            )
    with urllib.request.urlopen(url) as source, open(download_target, "wb") as output:
        with tqdm(
            total=int(source.info().get("Content-Length")),
            ncols=80,
            unit="iB",
            unit_scale=True,
            unit_divisor=1024,
        ) as loop:
            while True:
                buffer = source.read(8192)
                if not buffer:
                    break
                output.write(buffer)
                loop.update(len(buffer))
    model_bytes = open(download_target, "rb").read()
    if hashlib.sha256(model_bytes).hexdigest() != expected_sha256:
        raise RuntimeError(
            "Model has been downloaded but the SHA256 checksum does not not match. Please retry loading the model."
        )
    return model_bytes if in_memory else download_target
 def available_models() -> List[str]:
    """Returns the names of available models"""
    return list(_MODELS.keys())
 def load_model(
    name: str,
    device: Optional[Union[str, torch.device]] = None,
    download_root: str = None,
    in_memory: bool = False,
 ) -> Whisper:
    """
    Load a Whisper ASR model
    Parameters
    ----------
    name : str
        one of the official model names listed by `whisper.available_models()`, or
        path to a model checkpoint containing the model dimensions and the model state_dict.
    device : Union[str, torch.device]
        the PyTorch device to put the model into
    download_root: str
        path to download the model files; by default, it uses "~/.cache/whisper"
    in_memory: bool
        whether to preload the model weights into host memory
    Returns
    -------
    model : Whisper
        The Whisper ASR model instance
    """
    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"
    if download_root is None:
        default = os.path.join(os.path.expanduser("~"), ".cache")
        download_root = os.path.join(os.getenv("XDG_CACHE_HOME", default), "whisper")
    if name in _MODELS:
        checkpoint_file = _download(_MODELS[name], download_root, in_memory)
        alignment_heads = _ALIGNMENT_HEADS[name]
    elif os.path.isfile(name):
        checkpoint_file = open(name, "rb").read() if in_memory else name
        alignment_heads = None
    else:
        raise RuntimeError(
            f"Model {name} not found; available models = {available_models()}"
        )
    with (
        io.BytesIO(checkpoint_file) if in_memory else open(checkpoint_file, "rb")
    ) as fp:
        checkpoint = torch.load(fp, map_location=device)
    del checkpoint_file
    dims = ModelDimensions(**checkpoint["dims"])
    model = Whisper(dims)
    model.load_state_dict(checkpoint["model_state_dict"])
    if alignment_heads is not None:
        model.set_alignment_heads(alignment_heads)
    return model.to(device)
--- a/whisperlivekit/simul_whisper/whisper/main.py
+++ b/whisperlivekit/simul_whisper/whisper/main.py
@@ -0,0 +1,3 @@
 from .transcribe import cli
 cli()
--- a/whisperlivekit/simul_whisper/whisper/assets/init.py
+++ b/whisperlivekit/simul_whisper/whisper/assets/init.py
--- a/whisperlivekit/simul_whisper/whisper/assets/gpt2.tiktoken
+++ b/whisperlivekit/simul_whisper/whisper/assets/gpt2.tiktoken
--- a/whisperlivekit/simul_whisper/whisper/assets/mel_filters.npz
+++ b/whisperlivekit/simul_whisper/whisper/assets/mel_filters.npz
--- a/whisperlivekit/simul_whisper/whisper/assets/multilingual.tiktoken
+++ b/whisperlivekit/simul_whisper/whisper/assets/multilingual.tiktoken
--- a/whisperlivekit/simul_whisper/whisper/audio.py
+++ b/whisperlivekit/simul_whisper/whisper/audio.py
@@ -0,0 +1,157 @@
 import os
 from functools import lru_cache
 from subprocess import CalledProcessError, run
 from typing import Optional, Union
 import numpy as np
 import torch
 import torch.nn.functional as F
 from .utils import exact_div
 # hard-coded audio hyperparameters
 SAMPLE_RATE = 16000
 N_FFT = 400
 HOP_LENGTH = 160
 CHUNK_LENGTH = 30
 N_SAMPLES = CHUNK_LENGTH * SAMPLE_RATE  # 480000 samples in a 30-second chunk
 N_FRAMES = exact_div(N_SAMPLES, HOP_LENGTH)  # 3000 frames in a mel spectrogram input
 N_SAMPLES_PER_TOKEN = HOP_LENGTH * 2  # the initial convolutions has stride 2
 FRAMES_PER_SECOND = exact_div(SAMPLE_RATE, HOP_LENGTH)  # 10ms per audio frame
 TOKENS_PER_SECOND = exact_div(SAMPLE_RATE, N_SAMPLES_PER_TOKEN)  # 20ms per audio token
 def load_audio(file: str, sr: int = SAMPLE_RATE):
    """
    Open an audio file and read as mono waveform, resampling as necessary
    Parameters
    ----------
    file: str
        The audio file to open
    sr: int
        The sample rate to resample the audio if necessary
    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """
    # This launches a subprocess to decode audio while down-mixing
    # and resampling as necessary.  Requires the ffmpeg CLI in PATH.
    # fmt: off
    cmd = [
        "ffmpeg",
        "-nostdin",
        "-threads", "0",
        "-i", file,
        "-f", "s16le",
        "-ac", "1",
        "-acodec", "pcm_s16le",
        "-ar", str(sr),
        "-"
    ]
    # fmt: on
    try:
        out = run(cmd, capture_output=True, check=True).stdout
    except CalledProcessError as e:
        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e
    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
 def pad_or_trim(array, length: int = N_SAMPLES, *, axis: int = -1):
    """
    Pad or trim the audio array to N_SAMPLES, as expected by the encoder.
    """
    if torch.is_tensor(array):
        if array.shape[axis] > length:
            array = array.index_select(
                dim=axis, index=torch.arange(length, device=array.device)
            )
        if array.shape[axis] < length:
            pad_widths = [(0, 0)] * array.ndim
            pad_widths[axis] = (0, length - array.shape[axis])
            array = F.pad(array, [pad for sizes in pad_widths[::-1] for pad in sizes])
    else:
        if array.shape[axis] > length:
            array = array.take(indices=range(length), axis=axis)
        if array.shape[axis] < length:
            pad_widths = [(0, 0)] * array.ndim
            pad_widths[axis] = (0, length - array.shape[axis])
            array = np.pad(array, pad_widths)
    return array
@lru_cache(maxsize=None)
 def mel_filters(device, n_mels: int) -> torch.Tensor:
    """
    load the mel filterbank matrix for projecting STFT into a Mel spectrogram.
    Allows decoupling librosa dependency; saved using:
        np.savez_compressed(
            "mel_filters.npz",
            mel_80=librosa.filters.mel(sr=16000, n_fft=400, n_mels=80),
            mel_128=librosa.filters.mel(sr=16000, n_fft=400, n_mels=128),
        )
    """
    assert n_mels in {80, 128}, f"Unsupported n_mels: {n_mels}"
    filters_path = os.path.join(os.path.dirname(__file__), "assets", "mel_filters.npz")
    with np.load(filters_path, allow_pickle=False) as f:
        return torch.from_numpy(f[f"mel_{n_mels}"]).to(device)
 def log_mel_spectrogram(
    audio: Union[str, np.ndarray, torch.Tensor],
    n_mels: int = 80,
    padding: int = 0,
    device: Optional[Union[str, torch.device]] = None,
 ):
    """
    Compute the log-Mel spectrogram of
    Parameters
    ----------
    audio: Union[str, np.ndarray, torch.Tensor], shape = (*)
        The path to audio or either a NumPy array or Tensor containing the audio waveform in 16 kHz
    n_mels: int
        The number of Mel-frequency filters, only 80 and 128 are supported
    padding: int
        Number of zero samples to pad to the right
    device: Optional[Union[str, torch.device]]
        If given, the audio tensor is moved to this device before STFT
    Returns
    -------
    torch.Tensor, shape = (n_mels, n_frames)
        A Tensor that contains the Mel spectrogram
    """
    if not torch.is_tensor(audio):
        if isinstance(audio, str):
            audio = load_audio(audio)
        audio = torch.from_numpy(audio)
    if device is not None:
        audio = audio.to(device)
    if padding > 0:
        audio = F.pad(audio, (0, padding))
    window = torch.hann_window(N_FFT).to(audio.device)
    stft = torch.stft(audio, N_FFT, HOP_LENGTH, window=window, return_complex=True)
    magnitudes = stft[..., :-1].abs() ** 2
    filters = mel_filters(audio.device, n_mels)
    mel_spec = filters @ magnitudes
    log_spec = torch.clamp(mel_spec, min=1e-10).log10()
    log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
    log_spec = (log_spec + 4.0) / 4.0
    return log_spec
--- a/whisperlivekit/simul_whisper/whisper/decoding.py
+++ b/whisperlivekit/simul_whisper/whisper/decoding.py
@@ -0,0 +1,826 @@
 from dataclasses import dataclass, field, replace
 from typing import TYPE_CHECKING, Dict, Iterable, List, Optional, Sequence, Tuple, Union
 import numpy as np
 import torch
 import torch.nn.functional as F
 from torch import Tensor
 from torch.distributions import Categorical
 from .audio import CHUNK_LENGTH
 from .tokenizer import Tokenizer, get_tokenizer
 from .utils import compression_ratio
 if TYPE_CHECKING:
    from .model import Whisper
@torch.no_grad()
 def detect_language(
    model: "Whisper", mel: Tensor, tokenizer: Tokenizer = None
 ) -> Tuple[Tensor, List[dict]]:
    """
    Detect the spoken language in the audio, and return them as list of strings, along with the ids
    of the most probable language tokens and the probability distribution over all language tokens.
    This is performed outside the main decode loop in order to not interfere with kv-caching.
    Returns
    -------
    language_tokens : Tensor, shape = (n_audio,)
        ids of the most probable language tokens, which appears after the startoftranscript token.
    language_probs : List[Dict[str, float]], length = n_audio
        list of dictionaries containing the probability distribution over all languages.
    """
    if tokenizer is None:
        tokenizer = get_tokenizer(
            model.is_multilingual, num_languages=model.num_languages
        )
    if (
        tokenizer.language is None
        or tokenizer.language_token not in tokenizer.sot_sequence
    ):
        raise ValueError(
            "This model doesn't have language tokens so it can't perform lang id"
        )
    single = mel.ndim == 2
    if single:
        mel = mel.unsqueeze(0)
    # skip encoder forward pass if already-encoded audio features were given
    if mel.shape[-2:] != (model.dims.n_audio_ctx, model.dims.n_audio_state):
        mel = model.encoder(mel)
    # forward pass using a single token, startoftranscript
    n_audio = mel.shape[0]
    x = torch.tensor([[tokenizer.sot]] * n_audio).to(mel.device)  # [n_audio, 1]
    logits = model.logits(x, mel)[:, 0]
    # collect detected languages; suppress all non-language tokens
    mask = torch.ones(logits.shape[-1], dtype=torch.bool)
    mask[list(tokenizer.all_language_tokens)] = False
    logits[:, mask] = -np.inf
    language_tokens = logits.argmax(dim=-1)
    language_token_probs = logits.softmax(dim=-1).cpu()
    language_probs = [
        {
            c: language_token_probs[i, j].item()
            for j, c in zip(tokenizer.all_language_tokens, tokenizer.all_language_codes)
        }
        for i in range(n_audio)
    ]
    if single:
        language_tokens = language_tokens[0]
        language_probs = language_probs[0]
    return language_tokens, language_probs
@dataclass(frozen=True)
 class DecodingOptions:
    # whether to perform X->X "transcribe" or X->English "translate"
    task: str = "transcribe"
    # language that the audio is in; uses detected language if None
    language: Optional[str] = None
    # sampling-related options
    temperature: float = 0.0
    sample_len: Optional[int] = None  # maximum number of tokens to sample
    best_of: Optional[int] = None  # number of independent sample trajectories, if t > 0
    beam_size: Optional[int] = None  # number of beams in beam search, if t == 0
    patience: Optional[float] = None  # patience in beam search (arxiv:2204.05424)
    # "alpha" in Google NMT, or None for length norm, when ranking generations
    # to select which to return among the beams or best-of-N samples
    length_penalty: Optional[float] = None
    # text or tokens to feed as the prompt or the prefix; for more info:
    # https://github.com/openai/whisper/discussions/117#discussioncomment-3727051
    prompt: Optional[Union[str, List[int]]] = None  # for the previous context
    prefix: Optional[Union[str, List[int]]] = None  # to prefix the current context
    # list of tokens ids (or comma-separated token ids) to suppress
    # "-1" will suppress a set of symbols as defined in `tokenizer.non_speech_tokens()`
    suppress_tokens: Optional[Union[str, Iterable[int]]] = "-1"
    suppress_blank: bool = True  # this will suppress blank outputs
    # timestamp sampling options
    without_timestamps: bool = False  # use <|notimestamps|> to sample text tokens only
    max_initial_timestamp: Optional[float] = 1.0
    # implementation details
    fp16: bool = True  # use fp16 for most of the calculation
@dataclass(frozen=True)
 class DecodingResult:
    audio_features: Tensor
    language: str
    language_probs: Optional[Dict[str, float]] = None
    tokens: List[int] = field(default_factory=list)
    text: str = ""
    avg_logprob: float = np.nan
    no_speech_prob: float = np.nan
    temperature: float = np.nan
    compression_ratio: float = np.nan
 class Inference:
    def logits(self, tokens: Tensor, audio_features: Tensor) -> Tensor:
        """Perform a forward pass on the decoder and return per-token logits"""
        raise NotImplementedError
    def rearrange_kv_cache(self, source_indices) -> None:
        """Update the key-value cache according to the updated beams"""
        raise NotImplementedError
    def cleanup_caching(self) -> None:
        """Clean up any resources or hooks after decoding is finished"""
        pass
 class PyTorchInference(Inference):
    def __init__(self, model: "Whisper", initial_token_length: int):
        self.model: "Whisper" = model
        self.initial_token_length = initial_token_length
        self.kv_cache = {}
        self.hooks = []
        key_modules = [block.attn.key for block in self.model.decoder.blocks]
        value_modules = [block.attn.value for block in self.model.decoder.blocks]
        self.kv_modules = key_modules + value_modules
    def logits(self, tokens: Tensor, audio_features: Tensor) -> Tensor:
        if not self.kv_cache:
            self.kv_cache, self.hooks = self.model.install_kv_cache_hooks()
        if tokens.shape[-1] > self.initial_token_length:
            # only need to use the last token except in the first forward pass
            tokens = tokens[:, -1:]
        return self.model.decoder(tokens, audio_features, kv_cache=self.kv_cache)
    def cleanup_caching(self):
        for hook in self.hooks:
            hook.remove()
        self.kv_cache = {}
        self.hooks = []
    def rearrange_kv_cache(self, source_indices):
        if source_indices != list(range(len(source_indices))):
            for module in self.kv_modules:
                # update the key/value cache to contain the selected sequences
                self.kv_cache[module] = self.kv_cache[module][source_indices].detach()
 class SequenceRanker:
    def rank(
        self, tokens: List[List[Tensor]], sum_logprobs: List[List[float]]
    ) -> List[int]:
        """
        Given a list of groups of samples and their cumulative log probabilities,
        return the indices of the samples in each group to select as the final result
        """
        raise NotImplementedError
 class MaximumLikelihoodRanker(SequenceRanker):
    """
    Select the sample with the highest log probabilities, penalized using either
    a simple length normalization or Google NMT paper's length penalty
    """
    def __init__(self, length_penalty: Optional[float]):
        self.length_penalty = length_penalty
    def rank(self, tokens: List[List[Tensor]], sum_logprobs: List[List[float]]):
        def scores(logprobs, lengths):
            result = []
            for logprob, length in zip(logprobs, lengths):
                if self.length_penalty is None:
                    penalty = length
                else:
                    # from the Google NMT paper
                    penalty = ((5 + length) / 6) ** self.length_penalty
                result.append(logprob / penalty)
            return result
        # get the sequence with the highest score
        lengths = [[len(t) for t in s] for s in tokens]
        return [np.argmax(scores(p, l)) for p, l in zip(sum_logprobs, lengths)]
 class TokenDecoder:
    def reset(self):
        """Initialize any stateful variables for decoding a new sequence"""
    def update(
        self, tokens: Tensor, logits: Tensor, sum_logprobs: Tensor
    ) -> Tuple[Tensor, bool]:
        """Specify how to select the next token, based on the current trace and logits
        Parameters
        ----------
        tokens : Tensor, shape = (n_batch, current_sequence_length)
            all tokens in the context so far, including the prefix and sot_sequence tokens
        logits : Tensor, shape = (n_batch, vocab_size)
            per-token logits of the probability distribution at the current step
        sum_logprobs : Tensor, shape = (n_batch)
            cumulative log probabilities for each sequence
        Returns
        -------
        tokens : Tensor, shape = (n_batch, current_sequence_length + 1)
            the tokens, appended with the selected next token
        completed : bool
            True if all sequences has reached the end of text
        """
        raise NotImplementedError
    def finalize(
        self, tokens: Tensor, sum_logprobs: Tensor
    ) -> Tuple[Sequence[Sequence[Tensor]], List[List[float]]]:
        """Finalize search and return the final candidate sequences
        Parameters
        ----------
        tokens : Tensor, shape = (n_audio, n_group, current_sequence_length)
            all tokens in the context so far, including the prefix and sot_sequence
        sum_logprobs : Tensor, shape = (n_audio, n_group)
            cumulative log probabilities for each sequence
        Returns
        -------
        tokens : Sequence[Sequence[Tensor]], length = n_audio
            sequence of Tensors containing candidate token sequences, for each audio input
        sum_logprobs : List[List[float]], length = n_audio
            sequence of cumulative log probabilities corresponding to the above
        """
        raise NotImplementedError
 class GreedyDecoder(TokenDecoder):
    def __init__(self, temperature: float, eot: int):
        self.temperature = temperature
        self.eot = eot
    def update(
        self, tokens: Tensor, logits: Tensor, sum_logprobs: Tensor
    ) -> Tuple[Tensor, bool]:
        if self.temperature == 0:
            next_tokens = logits.argmax(dim=-1)
        else:
            next_tokens = Categorical(logits=logits / self.temperature).sample()
        logprobs = F.log_softmax(logits.float(), dim=-1)
        current_logprobs = logprobs[torch.arange(logprobs.shape[0]), next_tokens]
        sum_logprobs += current_logprobs * (tokens[:, -1] != self.eot)
        next_tokens[tokens[:, -1] == self.eot] = self.eot
        tokens = torch.cat([tokens, next_tokens[:, None]], dim=-1)
        completed = (tokens[:, -1] == self.eot).all()
        return tokens, completed
    def finalize(self, tokens: Tensor, sum_logprobs: Tensor):
        # make sure each sequence has at least one EOT token at the end
        tokens = F.pad(tokens, (0, 1), value=self.eot)
        return tokens, sum_logprobs.tolist()
 class BeamSearchDecoder(TokenDecoder):
    def __init__(
        self,
        beam_size: int,
        eot: int,
        inference: Inference,
        patience: Optional[float] = None,
    ):
        self.beam_size = beam_size
        self.eot = eot
        self.inference = inference
        self.patience = patience or 1.0
        self.max_candidates: int = round(beam_size * self.patience)
        self.finished_sequences = None
        assert (
            self.max_candidates > 0
        ), f"Invalid beam size ({beam_size}) or patience ({patience})"
    def reset(self):
        self.finished_sequences = None
    def update(
        self, tokens: Tensor, logits: Tensor, sum_logprobs: Tensor
    ) -> Tuple[Tensor, bool]:
        if tokens.shape[0] % self.beam_size != 0:
            raise ValueError(f"{tokens.shape}[0] % {self.beam_size} != 0")
        n_audio = tokens.shape[0] // self.beam_size
        if self.finished_sequences is None:  # for the first update
            self.finished_sequences = [{} for _ in range(n_audio)]
        logprobs = F.log_softmax(logits.float(), dim=-1)
        next_tokens, source_indices, finished_sequences = [], [], []
        for i in range(n_audio):
            scores, sources, finished = {}, {}, {}
            # STEP 1: calculate the cumulative log probabilities for possible candidates
            for j in range(self.beam_size):
                idx = i * self.beam_size + j
                prefix = tokens[idx].tolist()
                for logprob, token in zip(*logprobs[idx].topk(self.beam_size + 1)):
                    new_logprob = (sum_logprobs[idx] + logprob).item()
                    sequence = tuple(prefix + [token.item()])
                    scores[sequence] = new_logprob
                    sources[sequence] = idx
            # STEP 2: rank the candidates and keep the top beam_size sequences for each audio
            saved = 0
            for sequence in sorted(scores, key=scores.get, reverse=True):
                if sequence[-1] == self.eot:
                    finished[sequence] = scores[sequence]
                else:
                    sum_logprobs[len(next_tokens)] = scores[sequence]
                    next_tokens.append(sequence)
                    source_indices.append(sources[sequence])
                    saved += 1
                    if saved == self.beam_size:
                        break
            finished_sequences.append(finished)
        tokens = torch.tensor(next_tokens, device=tokens.device)
        self.inference.rearrange_kv_cache(source_indices)
        # add newly finished sequences to self.finished_sequences
        assert len(self.finished_sequences) == len(finished_sequences)
        for previously_finished, newly_finished in zip(
            self.finished_sequences, finished_sequences
        ):
            for seq in sorted(newly_finished, key=newly_finished.get, reverse=True):
                if len(previously_finished) >= self.max_candidates:
                    break  # the candidate list is full
                previously_finished[seq] = newly_finished[seq]
        # mark as completed if all audio has enough number of samples
        completed = all(
            len(sequences) >= self.max_candidates
            for sequences in self.finished_sequences
        )
        return tokens, completed
    def finalize(self, preceding_tokens: Tensor, sum_logprobs: Tensor):
        # collect all finished sequences, including patience, and add unfinished ones if not enough
        sum_logprobs = sum_logprobs.cpu()
        for i, sequences in enumerate(self.finished_sequences):
            if (
                len(sequences) < self.beam_size
            ):  # when not enough sequences are finished
                for j in list(np.argsort(sum_logprobs[i]))[::-1]:
                    sequence = preceding_tokens[i, j].tolist() + [self.eot]
                    sequences[tuple(sequence)] = sum_logprobs[i][j].item()
                    if len(sequences) >= self.beam_size:
                        break
        tokens: List[List[Tensor]] = [
            [torch.tensor(seq) for seq in sequences.keys()]
            for sequences in self.finished_sequences
        ]
        sum_logprobs: List[List[float]] = [
            list(sequences.values()) for sequences in self.finished_sequences
        ]
        return tokens, sum_logprobs
 class LogitFilter:
    def apply(self, logits: Tensor, tokens: Tensor) -> None:
        """Apply any filtering or masking to logits in-place
        Parameters
        ----------
        logits : Tensor, shape = (n_batch, vocab_size)
            per-token logits of the probability distribution at the current step
        tokens : Tensor, shape = (n_batch, current_sequence_length)
            all tokens in the context so far, including the prefix and sot_sequence tokens
        """
        raise NotImplementedError
 class SuppressBlank(LogitFilter):
    def __init__(self, tokenizer: Tokenizer, sample_begin: int):
        self.tokenizer = tokenizer
        self.sample_begin = sample_begin
    def apply(self, logits: Tensor, tokens: Tensor):
        if tokens.shape[1] == self.sample_begin:
            logits[:, self.tokenizer.encode(" ") + [self.tokenizer.eot]] = -np.inf
 class SuppressTokens(LogitFilter):
    def __init__(self, suppress_tokens: Sequence[int]):
        self.suppress_tokens = list(suppress_tokens)
    def apply(self, logits: Tensor, tokens: Tensor):
        logits[:, self.suppress_tokens] = -np.inf
 class ApplyTimestampRules(LogitFilter):
    def __init__(
        self,
        tokenizer: Tokenizer,
        sample_begin: int,
        max_initial_timestamp_index: Optional[int],
    ):
        self.tokenizer = tokenizer
        self.sample_begin = sample_begin
        self.max_initial_timestamp_index = max_initial_timestamp_index
    def apply(self, logits: Tensor, tokens: Tensor):
        # suppress <|notimestamps|> which is handled by without_timestamps
        if self.tokenizer.no_timestamps is not None:
            logits[:, self.tokenizer.no_timestamps] = -np.inf
        # timestamps have to appear in pairs, except directly before EOT; mask logits accordingly
        for k in range(tokens.shape[0]):
            sampled_tokens = tokens[k, self.sample_begin :]
            seq = [t for t in sampled_tokens.tolist()]
            last_was_timestamp = (
                len(seq) >= 1 and seq[-1] >= self.tokenizer.timestamp_begin
            )
            penultimate_was_timestamp = (
                len(seq) < 2 or seq[-2] >= self.tokenizer.timestamp_begin
            )
            if last_was_timestamp:
                if penultimate_was_timestamp:  # has to be non-timestamp
                    logits[k, self.tokenizer.timestamp_begin :] = -np.inf
                else:  # cannot be normal text tokens
                    logits[k, : self.tokenizer.eot] = -np.inf
            timestamps = sampled_tokens[
                sampled_tokens.ge(self.tokenizer.timestamp_begin)
            ]
            if timestamps.numel() > 0:
                # timestamps shouldn't decrease; forbid timestamp tokens smaller than the last
                # also force each segment to have a nonzero length, to prevent infinite looping
                if last_was_timestamp and not penultimate_was_timestamp:
                    timestamp_last = timestamps[-1]
                else:
                    timestamp_last = timestamps[-1] + 1
                logits[k, self.tokenizer.timestamp_begin : timestamp_last] = -np.inf
        if tokens.shape[1] == self.sample_begin:
            # suppress generating non-timestamp tokens at the beginning
            logits[:, : self.tokenizer.timestamp_begin] = -np.inf
            # apply the `max_initial_timestamp` option
            if self.max_initial_timestamp_index is not None:
                last_allowed = (
                    self.tokenizer.timestamp_begin + self.max_initial_timestamp_index
                )
                logits[:, last_allowed + 1 :] = -np.inf
        # if sum of probability over timestamps is above any other token, sample timestamp
        logprobs = F.log_softmax(logits.float(), dim=-1)
        for k in range(tokens.shape[0]):
            timestamp_logprob = logprobs[k, self.tokenizer.timestamp_begin :].logsumexp(
                dim=-1
            )
            max_text_token_logprob = logprobs[k, : self.tokenizer.timestamp_begin].max()
            if timestamp_logprob > max_text_token_logprob:
                logits[k, : self.tokenizer.timestamp_begin] = -np.inf
 class DecodingTask:
    inference: Inference
    sequence_ranker: SequenceRanker
    decoder: TokenDecoder
    logit_filters: List[LogitFilter]
    def __init__(self, model: "Whisper", options: DecodingOptions):
        self.model = model
        language = options.language or "en"
        tokenizer = get_tokenizer(
            model.is_multilingual,
            num_languages=model.num_languages,
            language=language,
            task=options.task,
        )
        self.tokenizer: Tokenizer = tokenizer
        self.options: DecodingOptions = self._verify_options(options)
        self.n_group: int = options.beam_size or options.best_of or 1
        self.n_ctx: int = model.dims.n_text_ctx
        self.sample_len: int = options.sample_len or model.dims.n_text_ctx // 2
        self.sot_sequence: Tuple[int] = tokenizer.sot_sequence
        if self.options.without_timestamps:
            self.sot_sequence = tokenizer.sot_sequence_including_notimestamps
        self.initial_tokens: Tuple[int] = self._get_initial_tokens()
        self.sample_begin: int = len(self.initial_tokens)
        self.sot_index: int = self.initial_tokens.index(tokenizer.sot)
        # inference: implements the forward pass through the decoder, including kv caching
        self.inference = PyTorchInference(model, len(self.initial_tokens))
        # sequence ranker: implements how to rank a group of sampled sequences
        self.sequence_ranker = MaximumLikelihoodRanker(options.length_penalty)
        # decoder: implements how to select the next tokens, given the autoregressive distribution
        if options.beam_size is not None:
            self.decoder = BeamSearchDecoder(
                options.beam_size, tokenizer.eot, self.inference, options.patience
            )
        else:
            self.decoder = GreedyDecoder(options.temperature, tokenizer.eot)
        # logit filters: applies various rules to suppress or penalize certain tokens
        self.logit_filters = []
        if self.options.suppress_blank:
            self.logit_filters.append(SuppressBlank(self.tokenizer, self.sample_begin))
        if self.options.suppress_tokens:
            self.logit_filters.append(SuppressTokens(self._get_suppress_tokens()))
        if not options.without_timestamps:
            precision = CHUNK_LENGTH / model.dims.n_audio_ctx  # usually 0.02 seconds
            max_initial_timestamp_index = None
            if options.max_initial_timestamp:
                max_initial_timestamp_index = round(
                    self.options.max_initial_timestamp / precision
                )
            self.logit_filters.append(
                ApplyTimestampRules(
                    tokenizer, self.sample_begin, max_initial_timestamp_index
                )
            )
    def _verify_options(self, options: DecodingOptions) -> DecodingOptions:
        if options.beam_size is not None and options.best_of is not None:
            raise ValueError("beam_size and best_of can't be given together")
        if options.temperature == 0:
            if options.best_of is not None:
                raise ValueError("best_of with greedy sampling (T=0) is not compatible")
        if options.patience is not None and options.beam_size is None:
            raise ValueError("patience requires beam_size to be given")
        if options.length_penalty is not None and not (
            0 <= options.length_penalty <= 1
        ):
            raise ValueError("length_penalty (alpha) should be a value between 0 and 1")
        return options
    def _get_initial_tokens(self) -> Tuple[int]:
        tokens = list(self.sot_sequence)
        if prefix := self.options.prefix:
            prefix_tokens = (
                self.tokenizer.encode(" " + prefix.strip())
                if isinstance(prefix, str)
                else prefix
            )
            if self.sample_len is not None:
                max_prefix_len = self.n_ctx // 2 - self.sample_len
                prefix_tokens = prefix_tokens[-max_prefix_len:]
            tokens = tokens + prefix_tokens
        if prompt := self.options.prompt:
            prompt_tokens = (
                self.tokenizer.encode(" " + prompt.strip())
                if isinstance(prompt, str)
                else prompt
            )
            tokens = (
                [self.tokenizer.sot_prev]
                + prompt_tokens[-(self.n_ctx // 2 - 1) :]
                + tokens
            )
        return tuple(tokens)
    def _get_suppress_tokens(self) -> Tuple[int]:
        suppress_tokens = self.options.suppress_tokens
        if isinstance(suppress_tokens, str):
            suppress_tokens = [int(t) for t in suppress_tokens.split(",")]
        if -1 in suppress_tokens:
            suppress_tokens = [t for t in suppress_tokens if t >= 0]
            suppress_tokens.extend(self.tokenizer.non_speech_tokens)
        elif suppress_tokens is None or len(suppress_tokens) == 0:
            suppress_tokens = []  # interpret empty string as an empty list
        else:
            assert isinstance(suppress_tokens, list), "suppress_tokens must be a list"
        suppress_tokens.extend(
            [
                self.tokenizer.transcribe,
                self.tokenizer.translate,
                self.tokenizer.sot,
                self.tokenizer.sot_prev,
                self.tokenizer.sot_lm,
            ]
        )
        if self.tokenizer.no_speech is not None:
            # no-speech probability is collected separately
            suppress_tokens.append(self.tokenizer.no_speech)
        return tuple(sorted(set(suppress_tokens)))
    def _get_audio_features(self, mel: Tensor):
        if self.options.fp16:
            mel = mel.half()
        if mel.shape[-2:] == (
            self.model.dims.n_audio_ctx,
            self.model.dims.n_audio_state,
        ):
            # encoded audio features are given; skip audio encoding
            audio_features = mel
        else:
            audio_features = self.model.encoder(mel)
        if audio_features.dtype != (
            torch.float16 if self.options.fp16 else torch.float32
        ):
            return TypeError(
                f"audio_features has an incorrect dtype: {audio_features.dtype}"
            )
        return audio_features
    def _detect_language(self, audio_features: Tensor, tokens: Tensor):
        languages = [self.options.language] * audio_features.shape[0]
        lang_probs = None
        if self.options.language is None or self.options.task == "lang_id":
            lang_tokens, lang_probs = self.model.detect_language(
                audio_features, self.tokenizer
            )
            languages = [max(probs, key=probs.get) for probs in lang_probs]
            if self.options.language is None:
                tokens[:, self.sot_index + 1] = lang_tokens  # write language tokens
        return languages, lang_probs
    def _main_loop(self, audio_features: Tensor, tokens: Tensor):
        n_batch = tokens.shape[0]
        sum_logprobs: Tensor = torch.zeros(n_batch, device=audio_features.device)
        no_speech_probs = [np.nan] * n_batch
        try:
            for i in range(self.sample_len):
                logits = self.inference.logits(tokens, audio_features)
                if (
                    i == 0 and self.tokenizer.no_speech is not None
                ):  # save no_speech_probs
                    probs_at_sot = logits[:, self.sot_index].float().softmax(dim=-1)
                    no_speech_probs = probs_at_sot[:, self.tokenizer.no_speech].tolist()
                # now we need to consider the logits at the last token only
                logits = logits[:, -1]
                # apply the logit filters, e.g. for suppressing or applying penalty to
                for logit_filter in self.logit_filters:
                    logit_filter.apply(logits, tokens)
                # expand the tokens tensor with the selected next tokens
                tokens, completed = self.decoder.update(tokens, logits, sum_logprobs)
                if completed or tokens.shape[-1] > self.n_ctx:
                    break
        finally:
            self.inference.cleanup_caching()
        return tokens, sum_logprobs, no_speech_probs
    @torch.no_grad()
    def run(self, mel: Tensor) -> List[DecodingResult]:
        self.decoder.reset()
        tokenizer: Tokenizer = self.tokenizer
        n_audio: int = mel.shape[0]
        audio_features: Tensor = self._get_audio_features(mel)  # encoder forward pass
        tokens: Tensor = torch.tensor([self.initial_tokens]).repeat(n_audio, 1)
        # detect language if requested, overwriting the language token
        languages, language_probs = self._detect_language(audio_features, tokens)
        if self.options.task == "lang_id":
            return [
                DecodingResult(
                    audio_features=features, language=language, language_probs=probs
                )
                for features, language, probs in zip(
                    audio_features, languages, language_probs
                )
            ]
        # repeat text tensors by the group size, for beam search or best-of-n sampling
        tokens = tokens.repeat_interleave(self.n_group, dim=0).to(audio_features.device)
        # call the main sampling loop
        tokens, sum_logprobs, no_speech_probs = self._main_loop(audio_features, tokens)
        # reshape the tensors to have (n_audio, n_group) as the first two dimensions
        audio_features = audio_features[:: self.n_group]
        no_speech_probs = no_speech_probs[:: self.n_group]
        assert audio_features.shape[0] == len(no_speech_probs) == n_audio
        tokens = tokens.reshape(n_audio, self.n_group, -1)
        sum_logprobs = sum_logprobs.reshape(n_audio, self.n_group)
        # get the final candidates for each group, and slice between the first sampled token and EOT
        tokens, sum_logprobs = self.decoder.finalize(tokens, sum_logprobs)
        tokens: List[List[Tensor]] = [
            [t[self.sample_begin : (t == tokenizer.eot).nonzero()[0, 0]] for t in s]
            for s in tokens
        ]
        # select the top-ranked sample in each group
        selected = self.sequence_ranker.rank(tokens, sum_logprobs)
        tokens: List[List[int]] = [t[i].tolist() for i, t in zip(selected, tokens)]
        texts: List[str] = [tokenizer.decode(t).strip() for t in tokens]
        sum_logprobs: List[float] = [lp[i] for i, lp in zip(selected, sum_logprobs)]
        avg_logprobs: List[float] = [
            lp / (len(t) + 1) for t, lp in zip(tokens, sum_logprobs)
        ]
        fields = (
            texts,
            languages,
            tokens,
            audio_features,
            avg_logprobs,
            no_speech_probs,
        )
        if len(set(map(len, fields))) != 1:
            raise RuntimeError(f"inconsistent result lengths: {list(map(len, fields))}")
        return [
            DecodingResult(
                audio_features=features,
                language=language,
                tokens=tokens,
                text=text,
                avg_logprob=avg_logprob,
                no_speech_prob=no_speech_prob,
                temperature=self.options.temperature,
                compression_ratio=compression_ratio(text),
            )
            for text, language, tokens, features, avg_logprob, no_speech_prob in zip(
                *fields
            )
        ]
@torch.no_grad()
 def decode(
    model: "Whisper",
    mel: Tensor,
    options: DecodingOptions = DecodingOptions(),
    **kwargs,
 ) -> Union[DecodingResult, List[DecodingResult]]:
    """
    Performs decoding of 30-second audio segment(s), provided as Mel spectrogram(s).
    Parameters
    ----------
    model: Whisper
        the Whisper model instance
    mel: torch.Tensor, shape = (80, 3000) or (*, 80, 3000)
        A tensor containing the Mel spectrogram(s)
    options: DecodingOptions
        A dataclass that contains all necessary options for decoding 30-second segments
    Returns
    -------
    result: Union[DecodingResult, List[DecodingResult]]
        The result(s) of decoding contained in `DecodingResult` dataclass instance(s)
    """
    if single := mel.ndim == 2:
        mel = mel.unsqueeze(0)
    if kwargs:
        options = replace(options, **kwargs)
    result = DecodingTask(model, options).run(mel)
    return result[0] if single else result
--- a/whisperlivekit/simul_whisper/whisper/model.py
+++ b/whisperlivekit/simul_whisper/whisper/model.py
@@ -0,0 +1,348 @@
 import base64
 import gzip
 from contextlib import contextmanager
 from dataclasses import dataclass
 from typing import Dict, Iterable, Optional, Tuple
 import numpy as np
 import torch
 import torch.nn.functional as F
 from torch import Tensor, nn
 from .decoding import decode as decode_function
 from .decoding import detect_language as detect_language_function
 from .transcribe import transcribe as transcribe_function
 try:
    from torch.nn.functional import scaled_dot_product_attention
    SDPA_AVAILABLE = True
 except (ImportError, RuntimeError, OSError):
    scaled_dot_product_attention = None
    SDPA_AVAILABLE = False
@dataclass
 class ModelDimensions:
    n_mels: int
    n_audio_ctx: int
    n_audio_state: int
    n_audio_head: int
    n_audio_layer: int
    n_vocab: int
    n_text_ctx: int
    n_text_state: int
    n_text_head: int
    n_text_layer: int
 class LayerNorm(nn.LayerNorm):
    def forward(self, x: Tensor) -> Tensor:
        return super().forward(x.float()).type(x.dtype)
 class Linear(nn.Linear):
    def forward(self, x: Tensor) -> Tensor:
        return F.linear(
            x,
            self.weight.to(x.dtype),
            None if self.bias is None else self.bias.to(x.dtype),
        )
 class Conv1d(nn.Conv1d):
    def _conv_forward(
        self, x: Tensor, weight: Tensor, bias: Optional[Tensor]
    ) -> Tensor:
        return super()._conv_forward(
            x, weight.to(x.dtype), None if bias is None else bias.to(x.dtype)
        )
 def sinusoids(length, channels, max_timescale=10000):
    """Returns sinusoids for positional embedding"""
    assert channels % 2 == 0
    log_timescale_increment = np.log(max_timescale) / (channels // 2 - 1)
    inv_timescales = torch.exp(-log_timescale_increment * torch.arange(channels // 2))
    scaled_time = torch.arange(length)[:, np.newaxis] * inv_timescales[np.newaxis, :]
    return torch.cat([torch.sin(scaled_time), torch.cos(scaled_time)], dim=1)
@contextmanager
 def disable_sdpa():
    prev_state = MultiHeadAttention.use_sdpa
    try:
        MultiHeadAttention.use_sdpa = False
        yield
    finally:
        MultiHeadAttention.use_sdpa = prev_state
 class MultiHeadAttention(nn.Module):
    use_sdpa = False  # Disable SDPA to ensure qk is always computed for hooks
    def __init__(self, n_state: int, n_head: int, cache_id: str = ""):
        super().__init__()
        self.n_head = n_head
        self.query = Linear(n_state, n_state)
        self.key = Linear(n_state, n_state, bias=False)
        self.value = Linear(n_state, n_state)
        self.out = Linear(n_state, n_state)
        self.cache_id = cache_id
        self.key.cache_id = f"{cache_id}_key"
        self.value.cache_id = f"{cache_id}_value"
    def forward(
        self,
        x: Tensor,
        xa: Optional[Tensor] = None,
        mask: Optional[Tensor] = None,
        kv_cache: Optional[dict] = None,
    ):
        q = self.query(x)
        if kv_cache is None or xa is None or self.key not in kv_cache:
            # hooks, if installed (i.e. kv_cache is not None), will prepend the cached kv tensors;
            # otherwise, perform key/value projections for self- or cross-attention as usual.
            k = self.key(x if xa is None else xa)
            v = self.value(x if xa is None else xa)
        else:
            # for cross-attention, calculate keys and values once and reuse in subsequent calls.
            k = kv_cache[self.key]
            v = kv_cache[self.value]
        wv, qk = self.qkv_attention(q, k, v, mask)
        return self.out(wv), qk
    def qkv_attention(
        self, q: Tensor, k: Tensor, v: Tensor, mask: Optional[Tensor] = None
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
        n_batch, n_ctx, n_state = q.shape
        scale = (n_state // self.n_head) ** -0.25
        q = q.view(*q.shape[:2], self.n_head, -1).permute(0, 2, 1, 3)
        k = k.view(*k.shape[:2], self.n_head, -1).permute(0, 2, 1, 3)
        v = v.view(*v.shape[:2], self.n_head, -1).permute(0, 2, 1, 3)
        if SDPA_AVAILABLE and MultiHeadAttention.use_sdpa:
            a = scaled_dot_product_attention(
                q, k, v, is_causal=mask is not None and n_ctx > 1
            )
            out = a.permute(0, 2, 1, 3).flatten(start_dim=2)
            qk = None
        else:
            qk = (q * scale) @ (k * scale).transpose(-1, -2)
            if mask is not None:
                qk = qk + mask[:n_ctx, :n_ctx]
            qk = qk.float()
            w = F.softmax(qk, dim=-1).to(q.dtype)
            out = (w @ v).permute(0, 2, 1, 3).flatten(start_dim=2)
            qk = qk.detach()
        return out, qk
 class ResidualAttentionBlock(nn.Module):
    def __init__(self, n_state: int, n_head: int, cross_attention: bool = False, cache_id: str = ""):
        super().__init__()
        self.attn = MultiHeadAttention(n_state, n_head, cache_id=f"{cache_id}_self_attn")
        self.attn_ln = LayerNorm(n_state)
        self.cross_attn = (
            MultiHeadAttention(n_state, n_head, cache_id=f"{cache_id}_cross_attn") if cross_attention else None
        )
        self.cross_attn_ln = LayerNorm(n_state) if cross_attention else None
        n_mlp = n_state * 4
        self.mlp = nn.Sequential(
            Linear(n_state, n_mlp), nn.GELU(), Linear(n_mlp, n_state)
        )
        self.mlp_ln = LayerNorm(n_state)
    def forward(
        self,
        x: Tensor,
        xa: Optional[Tensor] = None,
        mask: Optional[Tensor] = None,
        kv_cache: Optional[dict] = None,
    ):
        x = x + self.attn(self.attn_ln(x), mask=mask, kv_cache=kv_cache)[0]
        if self.cross_attn:
            x = x + self.cross_attn(self.cross_attn_ln(x), xa, kv_cache=kv_cache)[0]
        x = x + self.mlp(self.mlp_ln(x))
        return x
 class AudioEncoder(nn.Module):
    def __init__(
        self, n_mels: int, n_ctx: int, n_state: int, n_head: int, n_layer: int
    ):
        super().__init__()
        self.conv1 = Conv1d(n_mels, n_state, kernel_size=3, padding=1)
        self.conv2 = Conv1d(n_state, n_state, kernel_size=3, stride=2, padding=1)
        self.register_buffer("positional_embedding", sinusoids(n_ctx, n_state))
        self.blocks: Iterable[ResidualAttentionBlock] = nn.ModuleList(
            [ResidualAttentionBlock(n_state, n_head, cache_id=f"enc_layer{i}") for i in range(n_layer)]
        )
        self.ln_post = LayerNorm(n_state)
    def forward(self, x: Tensor):
        """
        x : torch.Tensor, shape = (batch_size, n_mels, n_ctx)
            the mel spectrogram of the audio
        """
        x = F.gelu(self.conv1(x))
        x = F.gelu(self.conv2(x))
        x = x.permute(0, 2, 1)
        assert x.shape[1:] == self.positional_embedding.shape, "incorrect audio shape"
        x = (x + self.positional_embedding).to(x.dtype)
        for block in self.blocks:
            x = block(x)
        x = self.ln_post(x)
        return x
 class TextDecoder(nn.Module):
    def __init__(
        self, n_vocab: int, n_ctx: int, n_state: int, n_head: int, n_layer: int
    ):
        super().__init__()
        self.token_embedding = nn.Embedding(n_vocab, n_state)
        self.positional_embedding = nn.Parameter(torch.empty(n_ctx, n_state))
        self.blocks: Iterable[ResidualAttentionBlock] = nn.ModuleList(
            [
                ResidualAttentionBlock(n_state, n_head, cross_attention=True, cache_id=f"dec_layer{i}")
                for i in range(n_layer)
            ]
        )
        self.ln = LayerNorm(n_state)
        mask = torch.empty(n_ctx, n_ctx).fill_(-np.inf).triu_(1)
        self.register_buffer("mask", mask, persistent=False)
    def forward(self, x: Tensor, xa: Tensor, kv_cache: Optional[dict] = None):
        """
        x : torch.LongTensor, shape = (batch_size, <= n_ctx)
            the text tokens
        xa : torch.Tensor, shape = (batch_size, n_audio_ctx, n_audio_state)
            the encoded audio features to be attended on
        """
        offset = next(iter(kv_cache.values())).shape[1] if kv_cache else 0
        x = (
            self.token_embedding(x)
            + self.positional_embedding[offset : offset + x.shape[-1]]
        )
        x = x.to(xa.dtype)
        for block in self.blocks:
            x = block(x, xa, mask=self.mask, kv_cache=kv_cache)
        x = self.ln(x)
        logits = (
            x @ torch.transpose(self.token_embedding.weight.to(x.dtype), 0, 1)
        ).float()
        return logits
 class Whisper(nn.Module):
    def __init__(self, dims: ModelDimensions):
        super().__init__()
        self.dims = dims
        self.encoder = AudioEncoder(
            self.dims.n_mels,
            self.dims.n_audio_ctx,
            self.dims.n_audio_state,
            self.dims.n_audio_head,
            self.dims.n_audio_layer,
        )
        self.decoder = TextDecoder(
            self.dims.n_vocab,
            self.dims.n_text_ctx,
            self.dims.n_text_state,
            self.dims.n_text_head,
            self.dims.n_text_layer,
        )
        # use the last half among the decoder layers for time alignment by default;
        # to use a specific set of heads, see `set_alignment_heads()` below.
        all_heads = torch.zeros(
            self.dims.n_text_layer, self.dims.n_text_head, dtype=torch.bool
        )
        all_heads[self.dims.n_text_layer // 2 :] = True
        self.register_buffer("alignment_heads", all_heads.to_sparse(), persistent=False)
    def set_alignment_heads(self, dump: bytes):
        array = np.frombuffer(
            gzip.decompress(base64.b85decode(dump)), dtype=bool
        ).copy()
        mask = torch.from_numpy(array).reshape(
            self.dims.n_text_layer, self.dims.n_text_head
        )
        self.register_buffer("alignment_heads", mask.to_sparse(), persistent=False)
    def embed_audio(self, mel: torch.Tensor):
        return self.encoder(mel)
    def logits(self, tokens: torch.Tensor, audio_features: torch.Tensor):
        return self.decoder(tokens, audio_features)
    def forward(
        self, mel: torch.Tensor, tokens: torch.Tensor
    ) -> Dict[str, torch.Tensor]:
        return self.decoder(tokens, self.encoder(mel))
    @property
    def device(self):
        return next(self.parameters()).device
    @property
    def is_multilingual(self):
        return self.dims.n_vocab >= 51865
    @property
    def num_languages(self):
        return self.dims.n_vocab - 51765 - int(self.is_multilingual)
    def install_kv_cache_hooks(self, cache: Optional[dict] = None):
        """
        The `MultiHeadAttention` module optionally accepts `kv_cache` which stores the key and value
        tensors calculated for the previous positions. This method returns a dictionary that stores
        all caches, and the necessary hooks for the key and value projection modules that save the
        intermediate tensors to be reused during later calculations.
        Returns
        -------
        cache : Dict[nn.Module, torch.Tensor]
            A dictionary object mapping the key/value projection modules to its cache
        hooks : List[RemovableHandle]
            List of PyTorch RemovableHandle objects to stop the hooks to be called
        """
        cache = {**cache} if cache is not None else {}
        hooks = []
        def save_to_cache(module, _, output):
            if module not in cache or output.shape[1] > self.dims.n_text_ctx:
                # save as-is, for the first token or cross attention
                cache[module] = output
            else:
                cache[module] = torch.cat([cache[module], output], dim=1).detach()
            return cache[module]
        def install_hooks(layer: nn.Module):
            if isinstance(layer, MultiHeadAttention):
                hooks.append(layer.key.register_forward_hook(save_to_cache))
                hooks.append(layer.value.register_forward_hook(save_to_cache))
        self.decoder.apply(install_hooks)
        return cache, hooks
    detect_language = detect_language_function
    transcribe = transcribe_function
    decode = decode_function
--- a/whisperlivekit/simul_whisper/whisper/normalizers/init.py
+++ b/whisperlivekit/simul_whisper/whisper/normalizers/init.py
@@ -0,0 +1,2 @@
 from .basic import BasicTextNormalizer as BasicTextNormalizer
 from .english import EnglishTextNormalizer as EnglishTextNormalizer
--- a/whisperlivekit/simul_whisper/whisper/normalizers/basic.py
+++ b/whisperlivekit/simul_whisper/whisper/normalizers/basic.py
@@ -0,0 +1,80 @@
 import re
 import unicodedata
 import regex
 # non-ASCII letters that are not separated by "NFKD" normalization
 ADDITIONAL_DIACRITICS = {
    "œ": "oe",
    "Œ": "OE",
    "ø": "o",
    "Ø": "O",
    "æ": "ae",
    "Æ": "AE",
    "ß": "ss",
    "ẞ": "SS",
    "đ": "d",
    "Đ": "D",
    "ð": "d",
    "Ð": "D",
    "þ": "th",
    "Þ": "th",
    "ł": "l",
    "Ł": "L",
 }
 def remove_symbols_and_diacritics(s: str, keep=""):
    """
    Replace any other markers, symbols, and punctuations with a space,
    and drop any diacritics (category 'Mn' and some manual mappings)
    """
    return "".join(
        (
            c
            if c in keep
            else (
                ADDITIONAL_DIACRITICS[c]
                if c in ADDITIONAL_DIACRITICS
                else (
                    ""
                    if unicodedata.category(c) == "Mn"
                    else " " if unicodedata.category(c)[0] in "MSP" else c
                )
            )
        )
        for c in unicodedata.normalize("NFKD", s)
    )
 def remove_symbols(s: str):
    """
    Replace any other markers, symbols, punctuations with a space, keeping diacritics
    """
    return "".join(
        " " if unicodedata.category(c)[0] in "MSP" else c
        for c in unicodedata.normalize("NFKC", s)
    )
 class BasicTextNormalizer:
    def __init__(self, remove_diacritics: bool = False, split_letters: bool = False):
        self.clean = (
            remove_symbols_and_diacritics if remove_diacritics else remove_symbols
        )
        self.split_letters = split_letters
    def __call__(self, s: str):
        s = s.lower()
        s = re.sub(r"[<\[][^>\]]*[>\]]", "", s)  # remove words between brackets
        s = re.sub(r"\(([^)]+?)\)", "", s)  # remove words between parenthesis
        s = self.clean(s).lower()
        if self.split_letters:
            s = " ".join(regex.findall(r"\X", s, regex.U))
        s = re.sub(
            r"\s+", " ", s
        )  # replace any successive whitespace characters with a space
        return s
--- a/whisperlivekit/simul_whisper/whisper/normalizers/english.json
+++ b/whisperlivekit/simul_whisper/whisper/normalizers/english.json
--- a/whisperlivekit/simul_whisper/whisper/normalizers/english.py
+++ b/whisperlivekit/simul_whisper/whisper/normalizers/english.py
@@ -0,0 +1,550 @@
 import json
 import os
 import re
 from fractions import Fraction
 from typing import Iterator, List, Match, Optional, Union
 from more_itertools import windowed
 from .basic import remove_symbols_and_diacritics
 class EnglishNumberNormalizer:
    """
    Convert any spelled-out numbers into arabic numbers, while handling:
    - remove any commas
    - keep the suffixes such as: `1960s`, `274th`, `32nd`, etc.
    - spell out currency symbols after the number. e.g. `$20 million` -> `20000000 dollars`
    - spell out `one` and `ones`
    - interpret successive single-digit numbers as nominal: `one oh one` -> `101`
    """
    def __init__(self):
        super().__init__()
        self.zeros = {"o", "oh", "zero"}
        self.ones = {
            name: i
            for i, name in enumerate(
                [
                    "one",
                    "two",
                    "three",
                    "four",
                    "five",
                    "six",
                    "seven",
                    "eight",
                    "nine",
                    "ten",
                    "eleven",
                    "twelve",
                    "thirteen",
                    "fourteen",
                    "fifteen",
                    "sixteen",
                    "seventeen",
                    "eighteen",
                    "nineteen",
                ],
                start=1,
            )
        }
        self.ones_plural = {
            "sixes" if name == "six" else name + "s": (value, "s")
            for name, value in self.ones.items()
        }
        self.ones_ordinal = {
            "zeroth": (0, "th"),
            "first": (1, "st"),
            "second": (2, "nd"),
            "third": (3, "rd"),
            "fifth": (5, "th"),
            "twelfth": (12, "th"),
            **{
                name + ("h" if name.endswith("t") else "th"): (value, "th")
                for name, value in self.ones.items()
                if value > 3 and value != 5 and value != 12
            },
        }
        self.ones_suffixed = {**self.ones_plural, **self.ones_ordinal}
        self.tens = {
            "twenty": 20,
            "thirty": 30,
            "forty": 40,
            "fifty": 50,
            "sixty": 60,
            "seventy": 70,
            "eighty": 80,
            "ninety": 90,
        }
        self.tens_plural = {
            name.replace("y", "ies"): (value, "s") for name, value in self.tens.items()
        }
        self.tens_ordinal = {
            name.replace("y", "ieth"): (value, "th")
            for name, value in self.tens.items()
        }
        self.tens_suffixed = {**self.tens_plural, **self.tens_ordinal}
        self.multipliers = {
            "hundred": 100,
            "thousand": 1_000,
            "million": 1_000_000,
            "billion": 1_000_000_000,
            "trillion": 1_000_000_000_000,
            "quadrillion": 1_000_000_000_000_000,
            "quintillion": 1_000_000_000_000_000_000,
            "sextillion": 1_000_000_000_000_000_000_000,
            "septillion": 1_000_000_000_000_000_000_000_000,
            "octillion": 1_000_000_000_000_000_000_000_000_000,
            "nonillion": 1_000_000_000_000_000_000_000_000_000_000,
            "decillion": 1_000_000_000_000_000_000_000_000_000_000_000,
        }
        self.multipliers_plural = {
            name + "s": (value, "s") for name, value in self.multipliers.items()
        }
        self.multipliers_ordinal = {
            name + "th": (value, "th") for name, value in self.multipliers.items()
        }
        self.multipliers_suffixed = {
            **self.multipliers_plural,
            **self.multipliers_ordinal,
        }
        self.decimals = {*self.ones, *self.tens, *self.zeros}
        self.preceding_prefixers = {
            "minus": "-",
            "negative": "-",
            "plus": "+",
            "positive": "+",
        }
        self.following_prefixers = {
            "pound": "£",
            "pounds": "£",
            "euro": "€",
            "euros": "€",
            "dollar": "$",
            "dollars": "$",
            "cent": "¢",
            "cents": "¢",
        }
        self.prefixes = set(
            list(self.preceding_prefixers.values())
            + list(self.following_prefixers.values())
        )
        self.suffixers = {
            "per": {"cent": "%"},
            "percent": "%",
        }
        self.specials = {"and", "double", "triple", "point"}
        self.words = set(
            [
                key
                for mapping in [
                    self.zeros,
                    self.ones,
                    self.ones_suffixed,
                    self.tens,
                    self.tens_suffixed,
                    self.multipliers,
                    self.multipliers_suffixed,
                    self.preceding_prefixers,
                    self.following_prefixers,
                    self.suffixers,
                    self.specials,
                ]
                for key in mapping
            ]
        )
        self.literal_words = {"one", "ones"}
    def process_words(self, words: List[str]) -> Iterator[str]:
        prefix: Optional[str] = None
        value: Optional[Union[str, int]] = None
        skip = False
        def to_fraction(s: str):
            try:
                return Fraction(s)
            except ValueError:
                return None
        def output(result: Union[str, int]):
            nonlocal prefix, value
            result = str(result)
            if prefix is not None:
                result = prefix + result
            value = None
            prefix = None
            return result
        if len(words) == 0:
            return
        for prev, current, next in windowed([None] + words + [None], 3):
            if skip:
                skip = False
                continue
            next_is_numeric = next is not None and re.match(r"^\d+(\.\d+)?$", next)
            has_prefix = current[0] in self.prefixes
            current_without_prefix = current[1:] if has_prefix else current
            if re.match(r"^\d+(\.\d+)?$", current_without_prefix):
                # arabic numbers (potentially with signs and fractions)
                f = to_fraction(current_without_prefix)
                assert f is not None
                if value is not None:
                    if isinstance(value, str) and value.endswith("."):
                        # concatenate decimals / ip address components
                        value = str(value) + str(current)
                        continue
                    else:
                        yield output(value)
                prefix = current[0] if has_prefix else prefix
                if f.denominator == 1:
                    value = f.numerator  # store integers as int
                else:
                    value = current_without_prefix
            elif current not in self.words:
                # non-numeric words
                if value is not None:
                    yield output(value)
                yield output(current)
            elif current in self.zeros:
                value = str(value or "") + "0"
            elif current in self.ones:
                ones = self.ones[current]
                if value is None:
                    value = ones
                elif isinstance(value, str) or prev in self.ones:
                    if (
                        prev in self.tens and ones < 10
                    ):  # replace the last zero with the digit
                        assert value[-1] == "0"
                        value = value[:-1] + str(ones)
                    else:
                        value = str(value) + str(ones)
                elif ones < 10:
                    if value % 10 == 0:
                        value += ones
                    else:
                        value = str(value) + str(ones)
                else:  # eleven to nineteen
                    if value % 100 == 0:
                        value += ones
                    else:
                        value = str(value) + str(ones)
            elif current in self.ones_suffixed:
                # ordinal or cardinal; yield the number right away
                ones, suffix = self.ones_suffixed[current]
                if value is None:
                    yield output(str(ones) + suffix)
                elif isinstance(value, str) or prev in self.ones:
                    if prev in self.tens and ones < 10:
                        assert value[-1] == "0"
                        yield output(value[:-1] + str(ones) + suffix)
                    else:
                        yield output(str(value) + str(ones) + suffix)
                elif ones < 10:
                    if value % 10 == 0:
                        yield output(str(value + ones) + suffix)
                    else:
                        yield output(str(value) + str(ones) + suffix)
                else:  # eleven to nineteen
                    if value % 100 == 0:
                        yield output(str(value + ones) + suffix)
                    else:
                        yield output(str(value) + str(ones) + suffix)
                value = None
            elif current in self.tens:
                tens = self.tens[current]
                if value is None:
                    value = tens
                elif isinstance(value, str):
                    value = str(value) + str(tens)
                else:
                    if value % 100 == 0:
                        value += tens
                    else:
                        value = str(value) + str(tens)
            elif current in self.tens_suffixed:
                # ordinal or cardinal; yield the number right away
                tens, suffix = self.tens_suffixed[current]
                if value is None:
                    yield output(str(tens) + suffix)
                elif isinstance(value, str):
                    yield output(str(value) + str(tens) + suffix)
                else:
                    if value % 100 == 0:
                        yield output(str(value + tens) + suffix)
                    else:
                        yield output(str(value) + str(tens) + suffix)
            elif current in self.multipliers:
                multiplier = self.multipliers[current]
                if value is None:
                    value = multiplier
                elif isinstance(value, str) or value == 0:
                    f = to_fraction(value)
                    p = f * multiplier if f is not None else None
                    if f is not None and p.denominator == 1:
                        value = p.numerator
                    else:
                        yield output(value)
                        value = multiplier
                else:
                    before = value // 1000 * 1000
                    residual = value % 1000
                    value = before + residual * multiplier
            elif current in self.multipliers_suffixed:
                multiplier, suffix = self.multipliers_suffixed[current]
                if value is None:
                    yield output(str(multiplier) + suffix)
                elif isinstance(value, str):
                    f = to_fraction(value)
                    p = f * multiplier if f is not None else None
                    if f is not None and p.denominator == 1:
                        yield output(str(p.numerator) + suffix)
                    else:
                        yield output(value)
                        yield output(str(multiplier) + suffix)
                else:  # int
                    before = value // 1000 * 1000
                    residual = value % 1000
                    value = before + residual * multiplier
                    yield output(str(value) + suffix)
                value = None
            elif current in self.preceding_prefixers:
                # apply prefix (positive, minus, etc.) if it precedes a number
                if value is not None:
                    yield output(value)
                if next in self.words or next_is_numeric:
                    prefix = self.preceding_prefixers[current]
                else:
                    yield output(current)
            elif current in self.following_prefixers:
                # apply prefix (dollars, cents, etc.) only after a number
                if value is not None:
                    prefix = self.following_prefixers[current]
                    yield output(value)
                else:
                    yield output(current)
            elif current in self.suffixers:
                # apply suffix symbols (percent -> '%')
                if value is not None:
                    suffix = self.suffixers[current]
                    if isinstance(suffix, dict):
                        if next in suffix:
                            yield output(str(value) + suffix[next])
                            skip = True
                        else:
                            yield output(value)
                            yield output(current)
                    else:
                        yield output(str(value) + suffix)
                else:
                    yield output(current)
            elif current in self.specials:
                if next not in self.words and not next_is_numeric:
                    # apply special handling only if the next word can be numeric
                    if value is not None:
                        yield output(value)
                    yield output(current)
                elif current == "and":
                    # ignore "and" after hundreds, thousands, etc.
                    if prev not in self.multipliers:
                        if value is not None:
                            yield output(value)
                        yield output(current)
                elif current == "double" or current == "triple":
                    if next in self.ones or next in self.zeros:
                        repeats = 2 if current == "double" else 3
                        ones = self.ones.get(next, 0)
                        value = str(value or "") + str(ones) * repeats
                        skip = True
                    else:
                        if value is not None:
                            yield output(value)
                        yield output(current)
                elif current == "point":
                    if next in self.decimals or next_is_numeric:
                        value = str(value or "") + "."
                else:
                    # should all have been covered at this point
                    raise ValueError(f"Unexpected token: {current}")
            else:
                # all should have been covered at this point
                raise ValueError(f"Unexpected token: {current}")
        if value is not None:
            yield output(value)
    def preprocess(self, s: str):
        # replace "<number> and a half" with "<number> point five"
        results = []
        segments = re.split(r"\band\s+a\s+half\b", s)
        for i, segment in enumerate(segments):
            if len(segment.strip()) == 0:
                continue
            if i == len(segments) - 1:
                results.append(segment)
            else:
                results.append(segment)
                last_word = segment.rsplit(maxsplit=2)[-1]
                if last_word in self.decimals or last_word in self.multipliers:
                    results.append("point five")
                else:
                    results.append("and a half")
        s = " ".join(results)
        # put a space at number/letter boundary
        s = re.sub(r"([a-z])([0-9])", r"\1 \2", s)
        s = re.sub(r"([0-9])([a-z])", r"\1 \2", s)
        # but remove spaces which could be a suffix
        s = re.sub(r"([0-9])\s+(st|nd|rd|th|s)\b", r"\1\2", s)
        return s
    def postprocess(self, s: str):
        def combine_cents(m: Match):
            try:
                currency = m.group(1)
                integer = m.group(2)
                cents = int(m.group(3))
                return f"{currency}{integer}.{cents:02d}"
            except ValueError:
                return m.string
        def extract_cents(m: Match):
            try:
                return f"¢{int(m.group(1))}"
            except ValueError:
                return m.string
        # apply currency postprocessing; "$2 and ¢7" -> "$2.07"
        s = re.sub(r"([€£$])([0-9]+) (?:and )?¢([0-9]{1,2})\b", combine_cents, s)
        s = re.sub(r"[€£$]0.([0-9]{1,2})\b", extract_cents, s)
        # write "one(s)" instead of "1(s)", just for the readability
        s = re.sub(r"\b1(s?)\b", r"one\1", s)
        return s
    def __call__(self, s: str):
        s = self.preprocess(s)
        s = " ".join(word for word in self.process_words(s.split()) if word is not None)
        s = self.postprocess(s)
        return s
 class EnglishSpellingNormalizer:
    """
    Applies British-American spelling mappings as listed in [1].
    [1] https://www.tysto.com/uk-us-spelling-list.html
    """
    def __init__(self):
        mapping_path = os.path.join(os.path.dirname(__file__), "english.json")
        self.mapping = json.load(open(mapping_path))
    def __call__(self, s: str):
        return " ".join(self.mapping.get(word, word) for word in s.split())
 class EnglishTextNormalizer:
    def __init__(self):
        self.ignore_patterns = r"\b(hmm|mm|mhm|mmm|uh|um)\b"
        self.replacers = {
            # common contractions
            r"\bwon't\b": "will not",
            r"\bcan't\b": "can not",
            r"\blet's\b": "let us",
            r"\bain't\b": "aint",
            r"\by'all\b": "you all",
            r"\bwanna\b": "want to",
            r"\bgotta\b": "got to",
            r"\bgonna\b": "going to",
            r"\bi'ma\b": "i am going to",
            r"\bimma\b": "i am going to",
            r"\bwoulda\b": "would have",
            r"\bcoulda\b": "could have",
            r"\bshoulda\b": "should have",
            r"\bma'am\b": "madam",
            # contractions in titles/prefixes
            r"\bmr\b": "mister ",
            r"\bmrs\b": "missus ",
            r"\bst\b": "saint ",
            r"\bdr\b": "doctor ",
            r"\bprof\b": "professor ",
            r"\bcapt\b": "captain ",
            r"\bgov\b": "governor ",
            r"\bald\b": "alderman ",
            r"\bgen\b": "general ",
            r"\bsen\b": "senator ",
            r"\brep\b": "representative ",
            r"\bpres\b": "president ",
            r"\brev\b": "reverend ",
            r"\bhon\b": "honorable ",
            r"\basst\b": "assistant ",
            r"\bassoc\b": "associate ",
            r"\blt\b": "lieutenant ",
            r"\bcol\b": "colonel ",
            r"\bjr\b": "junior ",
            r"\bsr\b": "senior ",
            r"\besq\b": "esquire ",
            # prefect tenses, ideally it should be any past participles, but it's harder..
            r"'d been\b": " had been",
            r"'s been\b": " has been",
            r"'d gone\b": " had gone",
            r"'s gone\b": " has gone",
            r"'d done\b": " had done",  # "'s done" is ambiguous
            r"'s got\b": " has got",
            # general contractions
            r"n't\b": " not",
            r"'re\b": " are",
            r"'s\b": " is",
            r"'d\b": " would",
            r"'ll\b": " will",
            r"'t\b": " not",
            r"'ve\b": " have",
            r"'m\b": " am",
        }
        self.standardize_numbers = EnglishNumberNormalizer()
        self.standardize_spellings = EnglishSpellingNormalizer()
    def __call__(self, s: str):
        s = s.lower()
        s = re.sub(r"[<\[][^>\]]*[>\]]", "", s)  # remove words between brackets
        s = re.sub(r"\(([^)]+?)\)", "", s)  # remove words between parenthesis
        s = re.sub(self.ignore_patterns, "", s)
        s = re.sub(r"\s+'", "'", s)  # when there's a space before an apostrophe
        for pattern, replacement in self.replacers.items():
            s = re.sub(pattern, replacement, s)
        s = re.sub(r"(\d),(\d)", r"\1\2", s)  # remove commas between digits
        s = re.sub(r"\.([^0-9]|$)", r" \1", s)  # remove periods not followed by numbers
        s = remove_symbols_and_diacritics(s, keep=".%$¢€£")  # keep numeric symbols
        s = self.standardize_numbers(s)
        s = self.standardize_spellings(s)
        # now remove prefix/suffix symbols that are not preceded/followed by numbers
        s = re.sub(r"[.$¢€£]([^0-9])", r" \1", s)
        s = re.sub(r"([^0-9])%", r"\1 ", s)
        s = re.sub(r"\s+", " ", s)  # replace any successive whitespaces with a space
        return s
--- a/whisperlivekit/simul_whisper/whisper/timing.py
+++ b/whisperlivekit/simul_whisper/whisper/timing.py
@@ -0,0 +1,388 @@
 import itertools
 import subprocess
 import warnings
 from dataclasses import dataclass
 from typing import TYPE_CHECKING, List
 import numba
 import numpy as np
 import torch
 import torch.nn.functional as F
 from .audio import HOP_LENGTH, SAMPLE_RATE, TOKENS_PER_SECOND
 from .tokenizer import Tokenizer
 if TYPE_CHECKING:
    from .model import Whisper
 def median_filter(x: torch.Tensor, filter_width: int):
    """Apply a median filter of width `filter_width` along the last dimension of `x`"""
    pad_width = filter_width // 2
    if x.shape[-1] <= pad_width:
        # F.pad requires the padding width to be smaller than the input dimension
        return x
    if (ndim := x.ndim) <= 2:
        # `F.pad` does not support 1D or 2D inputs for reflect padding but supports 3D and 4D
        x = x[None, None, :]
    assert (
        filter_width > 0 and filter_width % 2 == 1
    ), "`filter_width` should be an odd number"
    result = None
    x = F.pad(x, (filter_width // 2, filter_width // 2, 0, 0), mode="reflect")
    if x.is_cuda:
        try:
            from .triton_ops import median_filter_cuda
            result = median_filter_cuda(x, filter_width)
        except (RuntimeError, subprocess.CalledProcessError):
            warnings.warn(
                "Failed to launch Triton kernels, likely due to missing CUDA toolkit; "
                "falling back to a slower median kernel implementation..."
            )
    if result is None:
        # sort() is faster than torch.median (https://github.com/pytorch/pytorch/issues/51450)
        result = x.unfold(-1, filter_width, 1).sort()[0][..., filter_width // 2]
    if ndim <= 2:
        result = result[0, 0]
    return result
@numba.jit(nopython=True)
 def backtrace(trace: np.ndarray):
    i = trace.shape[0] - 1
    j = trace.shape[1] - 1
    trace[0, :] = 2
    trace[:, 0] = 1
    result = []
    while i > 0 or j > 0:
        result.append((i - 1, j - 1))
        if trace[i, j] == 0:
            i -= 1
            j -= 1
        elif trace[i, j] == 1:
            i -= 1
        elif trace[i, j] == 2:
            j -= 1
        else:
            raise ValueError("Unexpected trace[i, j]")
    result = np.array(result)
    return result[::-1, :].T
@numba.jit(nopython=True, parallel=True)
 def dtw_cpu(x: np.ndarray):
    N, M = x.shape
    cost = np.ones((N + 1, M + 1), dtype=np.float32) * np.inf
    trace = -np.ones((N + 1, M + 1), dtype=np.float32)
    cost[0, 0] = 0
    for j in range(1, M + 1):
        for i in range(1, N + 1):
            c0 = cost[i - 1, j - 1]
            c1 = cost[i - 1, j]
            c2 = cost[i, j - 1]
            if c0 < c1 and c0 < c2:
                c, t = c0, 0
            elif c1 < c0 and c1 < c2:
                c, t = c1, 1
            else:
                c, t = c2, 2
            cost[i, j] = x[i - 1, j - 1] + c
            trace[i, j] = t
    return backtrace(trace)
 def dtw_cuda(x, BLOCK_SIZE=1024):
    from .triton_ops import dtw_kernel
    M, N = x.shape
    assert M < BLOCK_SIZE, f"M should be smaller than {BLOCK_SIZE=}"
    x_skew = (
        F.pad(x, (0, M + 1), value=np.inf).flatten()[: M * (N + M)].reshape(M, N + M)
    )
    x_skew = x_skew.T.contiguous()
    cost = torch.ones(N + M + 2, M + 2) * np.inf
    cost[0, 0] = 0
    cost = cost.to(x.device)
    trace = torch.zeros_like(cost, dtype=torch.int32)
    dtw_kernel[(1,)](
        cost,
        trace,
        x_skew,
        x_skew.stride(0),
        cost.stride(0),
        trace.stride(0),
        N,
        M,
        BLOCK_SIZE=BLOCK_SIZE,
    )
    trace = trace.T.flatten()[: (M + 1) * (M + N + 3)].reshape(M + 1, M + N + 3)[
        :, : N + 1
    ]
    return backtrace(trace.cpu().numpy())
 def dtw(x: torch.Tensor) -> np.ndarray:
    if x.is_cuda:
        try:
            return dtw_cuda(x)
        except (RuntimeError, subprocess.CalledProcessError):
            warnings.warn(
                "Failed to launch Triton kernels, likely due to missing CUDA toolkit; "
                "falling back to a slower DTW implementation..."
            )
    return dtw_cpu(x.double().cpu().numpy())
@dataclass
 class WordTiming:
    word: str
    tokens: List[int]
    start: float
    end: float
    probability: float
 def find_alignment(
    model: "Whisper",
    tokenizer: Tokenizer,
    text_tokens: List[int],
    mel: torch.Tensor,
    num_frames: int,
    *,
    medfilt_width: int = 7,
    qk_scale: float = 1.0,
 ) -> List[WordTiming]:
    if len(text_tokens) == 0:
        return []
    tokens = torch.tensor(
        [
            *tokenizer.sot_sequence,
            tokenizer.no_timestamps,
            *text_tokens,
            tokenizer.eot,
        ]
    ).to(model.device)
    # install hooks on the cross attention layers to retrieve the attention weights
    QKs = [None] * model.dims.n_text_layer
    hooks = [
        block.cross_attn.register_forward_hook(
            lambda _, ins, outs, index=i: QKs.__setitem__(index, outs[-1][0])
        )
        for i, block in enumerate(model.decoder.blocks)
    ]
    from .model import disable_sdpa
    with torch.no_grad(), disable_sdpa():
        logits = model(mel.unsqueeze(0), tokens.unsqueeze(0))[0]
        sampled_logits = logits[len(tokenizer.sot_sequence) :, : tokenizer.eot]
        token_probs = sampled_logits.softmax(dim=-1)
        text_token_probs = token_probs[np.arange(len(text_tokens)), text_tokens]
        text_token_probs = text_token_probs.tolist()
    for hook in hooks:
        hook.remove()
    # heads * tokens * frames
    weights = torch.stack([QKs[_l][_h] for _l, _h in model.alignment_heads.indices().T])
    weights = weights[:, :, : num_frames // 2]
    weights = (weights * qk_scale).softmax(dim=-1)
    std, mean = torch.std_mean(weights, dim=-2, keepdim=True, unbiased=False)
    weights = (weights - mean) / std
    weights = median_filter(weights, medfilt_width)
    matrix = weights.mean(axis=0)
    matrix = matrix[len(tokenizer.sot_sequence) : -1]
    text_indices, time_indices = dtw(-matrix)
    words, word_tokens = tokenizer.split_to_word_tokens(text_tokens + [tokenizer.eot])
    if len(word_tokens) <= 1:
        # return on eot only
        # >>> np.pad([], (1, 0))
        # array([0.])
        # This results in crashes when we lookup jump_times with float, like
        # IndexError: arrays used as indices must be of integer (or boolean) type
        return []
    word_boundaries = np.pad(np.cumsum([len(t) for t in word_tokens[:-1]]), (1, 0))
    jumps = np.pad(np.diff(text_indices), (1, 0), constant_values=1).astype(bool)
    jump_times = time_indices[jumps] / TOKENS_PER_SECOND
    start_times = jump_times[word_boundaries[:-1]]
    end_times = jump_times[word_boundaries[1:]]
    word_probabilities = [
        np.mean(text_token_probs[i:j])
        for i, j in zip(word_boundaries[:-1], word_boundaries[1:])
    ]
    return [
        WordTiming(word, tokens, start, end, probability)
        for word, tokens, start, end, probability in zip(
            words, word_tokens, start_times, end_times, word_probabilities
        )
    ]
 def merge_punctuations(alignment: List[WordTiming], prepended: str, appended: str):
    # merge prepended punctuations
    i = len(alignment) - 2
    j = len(alignment) - 1
    while i >= 0:
        previous = alignment[i]
        following = alignment[j]
        if previous.word.startswith(" ") and previous.word.strip() in prepended:
            # prepend it to the following word
            following.word = previous.word + following.word
            following.tokens = previous.tokens + following.tokens
            previous.word = ""
            previous.tokens = []
        else:
            j = i
        i -= 1
    # merge appended punctuations
    i = 0
    j = 1
    while j < len(alignment):
        previous = alignment[i]
        following = alignment[j]
        if not previous.word.endswith(" ") and following.word in appended:
            # append it to the previous word
            previous.word = previous.word + following.word
            previous.tokens = previous.tokens + following.tokens
            following.word = ""
            following.tokens = []
        else:
            i = j
        j += 1
 def add_word_timestamps(
    *,
    segments: List[dict],
    model: "Whisper",
    tokenizer: Tokenizer,
    mel: torch.Tensor,
    num_frames: int,
    prepend_punctuations: str = "\"'“¿([{-",
    append_punctuations: str = "\"'.。,，!！?？:：”)]}、",
    last_speech_timestamp: float,
    **kwargs,
 ):
    if len(segments) == 0:
        return
    text_tokens_per_segment = [
        [token for token in segment["tokens"] if token < tokenizer.eot]
        for segment in segments
    ]
    text_tokens = list(itertools.chain.from_iterable(text_tokens_per_segment))
    alignment = find_alignment(model, tokenizer, text_tokens, mel, num_frames, **kwargs)
    word_durations = np.array([t.end - t.start for t in alignment])
    word_durations = word_durations[word_durations.nonzero()]
    median_duration = np.median(word_durations) if len(word_durations) > 0 else 0.0
    median_duration = min(0.7, float(median_duration))
    max_duration = median_duration * 2
    # hack: truncate long words at sentence boundaries.
    # a better segmentation algorithm based on VAD should be able to replace this.
    if len(word_durations) > 0:
        sentence_end_marks = ".。!！?？"
        # ensure words at sentence boundaries are not longer than twice the median word duration.
        for i in range(1, len(alignment)):
            if alignment[i].end - alignment[i].start > max_duration:
                if alignment[i].word in sentence_end_marks:
                    alignment[i].end = alignment[i].start + max_duration
                elif alignment[i - 1].word in sentence_end_marks:
                    alignment[i].start = alignment[i].end - max_duration
    merge_punctuations(alignment, prepend_punctuations, append_punctuations)
    time_offset = segments[0]["seek"] * HOP_LENGTH / SAMPLE_RATE
    word_index = 0
    for segment, text_tokens in zip(segments, text_tokens_per_segment):
        saved_tokens = 0
        words = []
        while word_index < len(alignment) and saved_tokens < len(text_tokens):
            timing = alignment[word_index]
            if timing.word:
                words.append(
                    dict(
                        word=timing.word,
                        start=round(time_offset + timing.start, 2),
                        end=round(time_offset + timing.end, 2),
                        probability=timing.probability,
                    )
                )
            saved_tokens += len(timing.tokens)
            word_index += 1
        # hack: truncate long words at segment boundaries.
        # a better segmentation algorithm based on VAD should be able to replace this.
        if len(words) > 0:
            # ensure the first and second word after a pause is not longer than
            # twice the median word duration.
            if words[0]["end"] - last_speech_timestamp > median_duration * 4 and (
                words[0]["end"] - words[0]["start"] > max_duration
                or (
                    len(words) > 1
                    and words[1]["end"] - words[0]["start"] > max_duration * 2
                )
            ):
                if (
                    len(words) > 1
                    and words[1]["end"] - words[1]["start"] > max_duration
                ):
                    boundary = max(words[1]["end"] / 2, words[1]["end"] - max_duration)
                    words[0]["end"] = words[1]["start"] = boundary
                words[0]["start"] = max(0, words[0]["end"] - max_duration)
            # prefer the segment-level start timestamp if the first word is too long.
            if (
                segment["start"] < words[0]["end"]
                and segment["start"] - 0.5 > words[0]["start"]
            ):
                words[0]["start"] = max(
                    0, min(words[0]["end"] - median_duration, segment["start"])
                )
            else:
                segment["start"] = words[0]["start"]
            # prefer the segment-level end timestamp if the last word is too long.
            if (
                segment["end"] > words[-1]["start"]
                and segment["end"] + 0.5 < words[-1]["end"]
            ):
                words[-1]["end"] = max(
                    words[-1]["start"] + median_duration, segment["end"]
                )
            else:
                segment["end"] = words[-1]["end"]
            last_speech_timestamp = segment["end"]
        segment["words"] = words
--- a/whisperlivekit/simul_whisper/whisper/tokenizer.py
+++ b/whisperlivekit/simul_whisper/whisper/tokenizer.py
@@ -0,0 +1,395 @@
 import base64
 import os
 import string
 from dataclasses import dataclass, field
 from functools import cached_property, lru_cache
 from typing import Dict, List, Optional, Tuple
 import tiktoken
 LANGUAGES = {
    "en": "english",
    "zh": "chinese",
    "de": "german",
    "es": "spanish",
    "ru": "russian",
    "ko": "korean",
    "fr": "french",
    "ja": "japanese",
    "pt": "portuguese",
    "tr": "turkish",
    "pl": "polish",
    "ca": "catalan",
    "nl": "dutch",
    "ar": "arabic",
    "sv": "swedish",
    "it": "italian",
    "id": "indonesian",
    "hi": "hindi",
    "fi": "finnish",
    "vi": "vietnamese",
    "he": "hebrew",
    "uk": "ukrainian",
    "el": "greek",
    "ms": "malay",
    "cs": "czech",
    "ro": "romanian",
    "da": "danish",
    "hu": "hungarian",
    "ta": "tamil",
    "no": "norwegian",
    "th": "thai",
    "ur": "urdu",
    "hr": "croatian",
    "bg": "bulgarian",
    "lt": "lithuanian",
    "la": "latin",
    "mi": "maori",
    "ml": "malayalam",
    "cy": "welsh",
    "sk": "slovak",
    "te": "telugu",
    "fa": "persian",
    "lv": "latvian",
    "bn": "bengali",
    "sr": "serbian",
    "az": "azerbaijani",
    "sl": "slovenian",
    "kn": "kannada",
    "et": "estonian",
    "mk": "macedonian",
    "br": "breton",
    "eu": "basque",
    "is": "icelandic",
    "hy": "armenian",
    "ne": "nepali",
    "mn": "mongolian",
    "bs": "bosnian",
    "kk": "kazakh",
    "sq": "albanian",
    "sw": "swahili",
    "gl": "galician",
    "mr": "marathi",
    "pa": "punjabi",
    "si": "sinhala",
    "km": "khmer",
    "sn": "shona",
    "yo": "yoruba",
    "so": "somali",
    "af": "afrikaans",
    "oc": "occitan",
    "ka": "georgian",
    "be": "belarusian",
    "tg": "tajik",
    "sd": "sindhi",
    "gu": "gujarati",
    "am": "amharic",
    "yi": "yiddish",
    "lo": "lao",
    "uz": "uzbek",
    "fo": "faroese",
    "ht": "haitian creole",
    "ps": "pashto",
    "tk": "turkmen",
    "nn": "nynorsk",
    "mt": "maltese",
    "sa": "sanskrit",
    "lb": "luxembourgish",
    "my": "myanmar",
    "bo": "tibetan",
    "tl": "tagalog",
    "mg": "malagasy",
    "as": "assamese",
    "tt": "tatar",
    "haw": "hawaiian",
    "ln": "lingala",
    "ha": "hausa",
    "ba": "bashkir",
    "jw": "javanese",
    "su": "sundanese",
    "yue": "cantonese",
 }
 # language code lookup by name, with a few language aliases
 TO_LANGUAGE_CODE = {
    **{language: code for code, language in LANGUAGES.items()},
    "burmese": "my",
    "valencian": "ca",
    "flemish": "nl",
    "haitian": "ht",
    "letzeburgesch": "lb",
    "pushto": "ps",
    "panjabi": "pa",
    "moldavian": "ro",
    "moldovan": "ro",
    "sinhalese": "si",
    "castilian": "es",
    "mandarin": "zh",
 }
@dataclass
 class Tokenizer:
    """A thin wrapper around `tiktoken` providing quick access to special tokens"""
    encoding: tiktoken.Encoding
    num_languages: int
    language: Optional[str] = None
    task: Optional[str] = None
    sot_sequence: Tuple[int] = ()
    special_tokens: Dict[str, int] = field(default_factory=dict)
    def __post_init__(self):
        for special in self.encoding.special_tokens_set:
            special_token = self.encoding.encode_single_token(special)
            self.special_tokens[special] = special_token
        sot: int = self.special_tokens["<|startoftranscript|>"]
        translate: int = self.special_tokens["<|translate|>"]
        transcribe: int = self.special_tokens["<|transcribe|>"]
        langs = tuple(LANGUAGES.keys())[: self.num_languages]
        sot_sequence = [sot]
        if self.language is not None:
            sot_sequence.append(sot + 1 + langs.index(self.language))
        if self.task is not None:
            task_token: int = transcribe if self.task == "transcribe" else translate
            sot_sequence.append(task_token)
        self.sot_sequence = tuple(sot_sequence)
    def encode(self, text, **kwargs):
        return self.encoding.encode(text, **kwargs)
    def decode(self, token_ids: List[int], **kwargs) -> str:
        token_ids = [t for t in token_ids if t < self.timestamp_begin]
        return self.encoding.decode(token_ids, **kwargs)
    def decode_with_timestamps(self, token_ids: List[int], **kwargs) -> str:
        """
        Timestamp tokens are above other special tokens' id range and are ignored by `decode()`.
        This method decodes given tokens with timestamps tokens annotated, e.g. "<|1.08|>".
        """
        return self.encoding.decode(token_ids, **kwargs)
    @cached_property
    def eot(self) -> int:
        return self.encoding.eot_token
    @cached_property
    def transcribe(self) -> int:
        return self.special_tokens["<|transcribe|>"]
    @cached_property
    def translate(self) -> int:
        return self.special_tokens["<|translate|>"]
    @cached_property
    def sot(self) -> int:
        return self.special_tokens["<|startoftranscript|>"]
    @cached_property
    def sot_lm(self) -> int:
        return self.special_tokens["<|startoflm|>"]
    @cached_property
    def sot_prev(self) -> int:
        return self.special_tokens["<|startofprev|>"]
    @cached_property
    def no_speech(self) -> int:
        return self.special_tokens["<|nospeech|>"]
    @cached_property
    def no_timestamps(self) -> int:
        return self.special_tokens["<|notimestamps|>"]
    @cached_property
    def timestamp_begin(self) -> int:
        return self.special_tokens["<|0.00|>"]
    @cached_property
    def language_token(self) -> int:
        """Returns the token id corresponding to the value of the `language` field"""
        if self.language is None:
            raise ValueError("This tokenizer does not have language token configured")
        return self.to_language_token(self.language)
    def to_language_token(self, language):
        if token := self.special_tokens.get(f"<|{language}|>", None):
            return token
        raise KeyError(f"Language {language} not found in tokenizer.")
    @cached_property
    def all_language_tokens(self) -> Tuple[int]:
        result = []
        for token, token_id in self.special_tokens.items():
            if token.strip("<|>") in LANGUAGES:
                result.append(token_id)
        return tuple(result)[: self.num_languages]
    @cached_property
    def all_language_codes(self) -> Tuple[str]:
        return tuple(self.decode([_l]).strip("<|>") for _l in self.all_language_tokens)
    @cached_property
    def sot_sequence_including_notimestamps(self) -> Tuple[int]:
        return tuple(list(self.sot_sequence) + [self.no_timestamps])
    @cached_property
    def non_speech_tokens(self) -> Tuple[int]:
        """
        Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech
        annotations, to prevent sampling texts that are not actually spoken in the audio, e.g.
        - ♪♪♪
        - ( SPEAKING FOREIGN LANGUAGE )
        - [DAVID] Hey there,
        keeping basic punctuations like commas, periods, question marks, exclamation points, etc.
        """
        symbols = list('"#()*+/:;<=>@[\\]^_`{|}~「」『』')
        symbols += (
            "<< >> <<< >>> -- --- -( -[ (' (\" (( )) ((( ))) [[ ]] {{ }} ♪♪ ♪♪♪".split()
        )
        # symbols that may be a single token or multiple tokens depending on the tokenizer.
        # In case they're multiple tokens, suppress the first token, which is safe because:
        # These are between U+2640 and U+267F miscellaneous symbols that are okay to suppress
        # in generations, and in the 3-byte UTF-8 representation they share the first two bytes.
        miscellaneous = set("♩♪♫♬♭♮♯")
        assert all(0x2640 <= ord(c) <= 0x267F for c in miscellaneous)
        # allow hyphens "-" and single quotes "'" between words, but not at the beginning of a word
        result = {self.encoding.encode(" -")[0], self.encoding.encode(" '")[0]}
        for symbol in symbols + list(miscellaneous):
            for tokens in [
                self.encoding.encode(symbol),
                self.encoding.encode(" " + symbol),
            ]:
                if len(tokens) == 1 or symbol in miscellaneous:
                    result.add(tokens[0])
        return tuple(sorted(result))
    def split_to_word_tokens(self, tokens: List[int]):
        if self.language in {"zh", "ja", "th", "lo", "my", "yue"}:
            # These languages don't typically use spaces, so it is difficult to split words
            # without morpheme analysis. Here, we instead split words at any
            # position where the tokens are decoded as valid unicode points
            return self.split_tokens_on_unicode(tokens)
        return self.split_tokens_on_spaces(tokens)
    def split_tokens_on_unicode(self, tokens: List[int]):
        decoded_full = self.decode_with_timestamps(tokens)
        replacement_char = "\ufffd"
        words = []
        word_tokens = []
        current_tokens = []
        unicode_offset = 0
        for token in tokens:
            current_tokens.append(token)
            decoded = self.decode_with_timestamps(current_tokens)
            if (
                replacement_char not in decoded
                or decoded_full[unicode_offset + decoded.index(replacement_char)]
                == replacement_char
            ):
                words.append(decoded)
                word_tokens.append(current_tokens)
                current_tokens = []
                unicode_offset += len(decoded)
        return words, word_tokens
    def split_tokens_on_spaces(self, tokens: List[int]):
        subwords, subword_tokens_list = self.split_tokens_on_unicode(tokens)
        words = []
        word_tokens = []
        for subword, subword_tokens in zip(subwords, subword_tokens_list):
            special = subword_tokens[0] >= self.eot
            with_space = subword.startswith(" ")
            punctuation = subword.strip() in string.punctuation
            if special or with_space or punctuation or len(words) == 0:
                words.append(subword)
                word_tokens.append(subword_tokens)
            else:
                words[-1] = words[-1] + subword
                word_tokens[-1].extend(subword_tokens)
        return words, word_tokens
@lru_cache(maxsize=None)
 def get_encoding(name: str = "gpt2", num_languages: int = 99):
    vocab_path = os.path.join(os.path.dirname(__file__), "assets", f"{name}.tiktoken")
    ranks = {
        base64.b64decode(token): int(rank)
        for token, rank in (line.split() for line in open(vocab_path) if line)
    }
    n_vocab = len(ranks)
    special_tokens = {}
    specials = [
        "<|endoftext|>",
        "<|startoftranscript|>",
        *[f"<|{lang}|>" for lang in list(LANGUAGES.keys())[:num_languages]],
        "<|translate|>",
        "<|transcribe|>",
        "<|startoflm|>",
        "<|startofprev|>",
        "<|nospeech|>",
        "<|notimestamps|>",
        *[f"<|{i * 0.02:.2f}|>" for i in range(1501)],
    ]
    for token in specials:
        special_tokens[token] = n_vocab
        n_vocab += 1
    return tiktoken.Encoding(
        name=os.path.basename(vocab_path),
        explicit_n_vocab=n_vocab,
        pat_str=r"""'s|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+""",
        mergeable_ranks=ranks,
        special_tokens=special_tokens,
    )
@lru_cache(maxsize=None)
 def get_tokenizer(
    multilingual: bool,
    *,
    num_languages: int = 99,
    language: Optional[str] = None,
    task: Optional[str] = None,  # Literal["transcribe", "translate", None]
 ) -> Tokenizer:
    if language is not None:
        language = language.lower()
        if language not in LANGUAGES:
            if language in TO_LANGUAGE_CODE:
                language = TO_LANGUAGE_CODE[language]
            else:
                raise ValueError(f"Unsupported language: {language}")
    if multilingual:
        encoding_name = "multilingual"
        language = language or "en"
        task = task or "transcribe"
    else:
        encoding_name = "gpt2"
        language = None
        task = None
    encoding = get_encoding(name=encoding_name, num_languages=num_languages)
    return Tokenizer(
        encoding=encoding, num_languages=num_languages, language=language, task=task
    )
--- a/whisperlivekit/simul_whisper/whisper/transcribe.py
+++ b/whisperlivekit/simul_whisper/whisper/transcribe.py
@@ -0,0 +1,623 @@
 import argparse
 import os
 import traceback
 import warnings
 from typing import TYPE_CHECKING, List, Optional, Tuple, Union
 import numpy as np
 import torch
 import tqdm
 from .audio import (
    FRAMES_PER_SECOND,
    HOP_LENGTH,
    N_FRAMES,
    N_SAMPLES,
    SAMPLE_RATE,
    log_mel_spectrogram,
    pad_or_trim,
 )
 from .decoding import DecodingOptions, DecodingResult
 from .timing import add_word_timestamps
 from .tokenizer import LANGUAGES, TO_LANGUAGE_CODE, get_tokenizer
 from .utils import (
    exact_div,
    format_timestamp,
    get_end,
    get_writer,
    make_safe,
    optional_float,
    optional_int,
    str2bool,
 )
 if TYPE_CHECKING:
    from .model import Whisper
 def transcribe(
    model: "Whisper",
    audio: Union[str, np.ndarray, torch.Tensor],
    *,
    verbose: Optional[bool] = None,
    temperature: Union[float, Tuple[float, ...]] = (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
    compression_ratio_threshold: Optional[float] = 2.4,
    logprob_threshold: Optional[float] = -1.0,
    no_speech_threshold: Optional[float] = 0.6,
    condition_on_previous_text: bool = True,
    initial_prompt: Optional[str] = None,
    carry_initial_prompt: bool = False,
    word_timestamps: bool = False,
    prepend_punctuations: str = "\"'“¿([{-",
    append_punctuations: str = "\"'.。,，!！?？:：”)]}、",
    clip_timestamps: Union[str, List[float]] = "0",
    hallucination_silence_threshold: Optional[float] = None,
    **decode_options,
 ):
    """
    Transcribe an audio file using Whisper
    Parameters
    ----------
    model: Whisper
        The Whisper model instance
    audio: Union[str, np.ndarray, torch.Tensor]
        The path to the audio file to open, or the audio waveform
    verbose: bool
        Whether to display the text being decoded to the console. If True, displays all the details,
        If False, displays minimal details. If None, does not display anything
    temperature: Union[float, Tuple[float, ...]]
        Temperature for sampling. It can be a tuple of temperatures, which will be successively used
        upon failures according to either `compression_ratio_threshold` or `logprob_threshold`.
    compression_ratio_threshold: float
        If the gzip compression ratio is above this value, treat as failed
    logprob_threshold: float
        If the average log probability over sampled tokens is below this value, treat as failed
    no_speech_threshold: float
        If the no_speech probability is higher than this value AND the average log probability
        over sampled tokens is below `logprob_threshold`, consider the segment as silent
    condition_on_previous_text: bool
        if True, the previous output of the model is provided as a prompt for the next window;
        disabling may make the text inconsistent across windows, but the model becomes less prone to
        getting stuck in a failure loop, such as repetition looping or timestamps going out of sync.
    word_timestamps: bool
        Extract word-level timestamps using the cross-attention pattern and dynamic time warping,
        and include the timestamps for each word in each segment.
    prepend_punctuations: str
        If word_timestamps is True, merge these punctuation symbols with the next word
    append_punctuations: str
        If word_timestamps is True, merge these punctuation symbols with the previous word
    initial_prompt: Optional[str]
        Optional text to provide as a prompt for the first window. This can be used to provide, or
        "prompt-engineer" a context for transcription, e.g. custom vocabularies or proper nouns
        to make it more likely to predict those word correctly.
    carry_initial_prompt: bool
        If carry_initial_prompt is True, `initial_prompt` is prepended to the prompt of each internal
        `decode()` call. If there is not enough context space at the start of the prompt, it is
        left-sliced to make space.
    decode_options: dict
        Keyword arguments to construct `DecodingOptions` instances
    clip_timestamps: Union[str, List[float]]
        Comma-separated list start,end,start,end,... timestamps (in seconds) of clips to process.
        The last end timestamp defaults to the end of the file.
    hallucination_silence_threshold: Optional[float]
        When word_timestamps is True, skip silent periods longer than this threshold (in seconds)
        when a possible hallucination is detected
    Returns
    -------
    A dictionary containing the resulting text ("text") and segment-level details ("segments"), and
    the spoken language ("language"), which is detected when `decode_options["language"]` is None.
    """
    dtype = torch.float16 if decode_options.get("fp16", True) else torch.float32
    if model.device == torch.device("cpu"):
        if torch.cuda.is_available():
            warnings.warn("Performing inference on CPU when CUDA is available")
        if dtype == torch.float16:
            warnings.warn("FP16 is not supported on CPU; using FP32 instead")
            dtype = torch.float32
    if dtype == torch.float32:
        decode_options["fp16"] = False
    # Pad 30-seconds of silence to the input audio, for slicing
    mel = log_mel_spectrogram(audio, model.dims.n_mels, padding=N_SAMPLES)
    content_frames = mel.shape[-1] - N_FRAMES
    content_duration = float(content_frames * HOP_LENGTH / SAMPLE_RATE)
    if decode_options.get("language", None) is None:
        if not model.is_multilingual:
            decode_options["language"] = "en"
        else:
            if verbose:
                print(
                    "Detecting language using up to the first 30 seconds. Use `--language` to specify the language"
                )
            mel_segment = pad_or_trim(mel, N_FRAMES).to(model.device).to(dtype)
            _, probs = model.detect_language(mel_segment)
            decode_options["language"] = max(probs, key=probs.get)
            if verbose is not None:
                print(
                    f"Detected language: {LANGUAGES[decode_options['language']].title()}"
                )
    language: str = decode_options["language"]
    task: str = decode_options.get("task", "transcribe")
    tokenizer = get_tokenizer(
        model.is_multilingual,
        num_languages=model.num_languages,
        language=language,
        task=task,
    )
    if isinstance(clip_timestamps, str):
        clip_timestamps = [
            float(ts) for ts in (clip_timestamps.split(",") if clip_timestamps else [])
        ]
    seek_points: List[int] = [round(ts * FRAMES_PER_SECOND) for ts in clip_timestamps]
    if len(seek_points) == 0:
        seek_points.append(0)
    if len(seek_points) % 2 == 1:
        seek_points.append(content_frames)
    seek_clips: List[Tuple[int, int]] = list(zip(seek_points[::2], seek_points[1::2]))
    punctuation = "\"'“¿([{-\"'.。,，!！?？:：”)]}、"
    if word_timestamps and task == "translate":
        warnings.warn("Word-level timestamps on translations may not be reliable.")
    def decode_with_fallback(segment: torch.Tensor) -> DecodingResult:
        temperatures = (
            [temperature] if isinstance(temperature, (int, float)) else temperature
        )
        decode_result = None
        for t in temperatures:
            kwargs = {**decode_options}
            if t > 0:
                # disable beam_size and patience when t > 0
                kwargs.pop("beam_size", None)
                kwargs.pop("patience", None)
            else:
                # disable best_of when t == 0
                kwargs.pop("best_of", None)
            options = DecodingOptions(**kwargs, temperature=t)
            decode_result = model.decode(segment, options)
            needs_fallback = False
            if (
                compression_ratio_threshold is not None
                and decode_result.compression_ratio > compression_ratio_threshold
            ):
                needs_fallback = True  # too repetitive
            if (
                logprob_threshold is not None
                and decode_result.avg_logprob < logprob_threshold
            ):
                needs_fallback = True  # average log probability is too low
            if (
                no_speech_threshold is not None
                and decode_result.no_speech_prob > no_speech_threshold
                and logprob_threshold is not None
                and decode_result.avg_logprob < logprob_threshold
            ):
                needs_fallback = False  # silence
            if not needs_fallback:
                break
        return decode_result
    clip_idx = 0
    seek = seek_clips[clip_idx][0]
    input_stride = exact_div(
        N_FRAMES, model.dims.n_audio_ctx
    )  # mel frames per output token: 2
    time_precision = (
        input_stride * HOP_LENGTH / SAMPLE_RATE
    )  # time per output token: 0.02 (seconds)
    all_tokens = []
    all_segments = []
    prompt_reset_since = 0
    remaining_prompt_length = model.dims.n_text_ctx // 2 - 1
    if initial_prompt is not None:
        initial_prompt_tokens = tokenizer.encode(" " + initial_prompt.strip())
        all_tokens.extend(initial_prompt_tokens)
        remaining_prompt_length -= len(initial_prompt_tokens)
    else:
        initial_prompt_tokens = []
    def new_segment(
        *, start: float, end: float, tokens: torch.Tensor, result: DecodingResult
    ):
        tokens = tokens.tolist()
        text_tokens = [token for token in tokens if token < tokenizer.eot]
        return {
            "seek": seek,
            "start": start,
            "end": end,
            "text": tokenizer.decode(text_tokens),
            "tokens": tokens,
            "temperature": result.temperature,
            "avg_logprob": result.avg_logprob,
            "compression_ratio": result.compression_ratio,
            "no_speech_prob": result.no_speech_prob,
        }
    # show the progress bar when verbose is False (if True, transcribed text will be printed)
    with tqdm.tqdm(
        total=content_frames, unit="frames", disable=verbose is not False
    ) as pbar:
        last_speech_timestamp = 0.0
        # NOTE: This loop is obscurely flattened to make the diff readable.
        # A later commit should turn this into a simpler nested loop.
        # for seek_clip_start, seek_clip_end in seek_clips:
        #     while seek < seek_clip_end
        while clip_idx < len(seek_clips):
            seek_clip_start, seek_clip_end = seek_clips[clip_idx]
            if seek < seek_clip_start:
                seek = seek_clip_start
            if seek >= seek_clip_end:
                clip_idx += 1
                if clip_idx < len(seek_clips):
                    seek = seek_clips[clip_idx][0]
                continue
            time_offset = float(seek * HOP_LENGTH / SAMPLE_RATE)
            window_end_time = float((seek + N_FRAMES) * HOP_LENGTH / SAMPLE_RATE)
            segment_size = min(N_FRAMES, content_frames - seek, seek_clip_end - seek)
            mel_segment = mel[:, seek : seek + segment_size]
            segment_duration = segment_size * HOP_LENGTH / SAMPLE_RATE
            mel_segment = pad_or_trim(mel_segment, N_FRAMES).to(model.device).to(dtype)
            if carry_initial_prompt:
                nignored = max(len(initial_prompt_tokens), prompt_reset_since)
                remaining_prompt = all_tokens[nignored:][-remaining_prompt_length:]
                decode_options["prompt"] = initial_prompt_tokens + remaining_prompt
            else:
                decode_options["prompt"] = all_tokens[prompt_reset_since:]
            result: DecodingResult = decode_with_fallback(mel_segment)
            tokens = torch.tensor(result.tokens)
            if no_speech_threshold is not None:
                # no voice activity check
                should_skip = result.no_speech_prob > no_speech_threshold
                if (
                    logprob_threshold is not None
                    and result.avg_logprob > logprob_threshold
                ):
                    # don't skip if the logprob is high enough, despite the no_speech_prob
                    should_skip = False
                if should_skip:
                    seek += segment_size  # fast-forward to the next segment boundary
                    continue
            previous_seek = seek
            current_segments = []
            # anomalous words are very long/short/improbable
            def word_anomaly_score(word: dict) -> float:
                probability = word.get("probability", 0.0)
                duration = word["end"] - word["start"]
                score = 0.0
                if probability < 0.15:
                    score += 1.0
                if duration < 0.133:
                    score += (0.133 - duration) * 15
                if duration > 2.0:
                    score += duration - 2.0
                return score
            def is_segment_anomaly(segment: Optional[dict]) -> bool:
                if segment is None or not segment["words"]:
                    return False
                words = [w for w in segment["words"] if w["word"] not in punctuation]
                words = words[:8]
                score = sum(word_anomaly_score(w) for w in words)
                return score >= 3 or score + 0.01 >= len(words)
            def next_words_segment(segments: List[dict]) -> Optional[dict]:
                return next((s for s in segments if s["words"]), None)
            timestamp_tokens: torch.Tensor = tokens.ge(tokenizer.timestamp_begin)
            single_timestamp_ending = timestamp_tokens[-2:].tolist() == [False, True]
            consecutive = torch.where(timestamp_tokens[:-1] & timestamp_tokens[1:])[0]
            consecutive.add_(1)
            if len(consecutive) > 0:
                # if the output contains two consecutive timestamp tokens
                slices = consecutive.tolist()
                if single_timestamp_ending:
                    slices.append(len(tokens))
                last_slice = 0
                for current_slice in slices:
                    sliced_tokens = tokens[last_slice:current_slice]
                    start_timestamp_pos = (
                        sliced_tokens[0].item() - tokenizer.timestamp_begin
                    )
                    end_timestamp_pos = (
                        sliced_tokens[-1].item() - tokenizer.timestamp_begin
                    )
                    current_segments.append(
                        new_segment(
                            start=time_offset + start_timestamp_pos * time_precision,
                            end=time_offset + end_timestamp_pos * time_precision,
                            tokens=sliced_tokens,
                            result=result,
                        )
                    )
                    last_slice = current_slice
                if single_timestamp_ending:
                    # single timestamp at the end means no speech after the last timestamp.
                    seek += segment_size
                else:
                    # otherwise, ignore the unfinished segment and seek to the last timestamp
                    last_timestamp_pos = (
                        tokens[last_slice - 1].item() - tokenizer.timestamp_begin
                    )
                    seek += last_timestamp_pos * input_stride
            else:
                duration = segment_duration
                timestamps = tokens[timestamp_tokens.nonzero().flatten()]
                if (
                    len(timestamps) > 0
                    and timestamps[-1].item() != tokenizer.timestamp_begin
                ):
                    # no consecutive timestamps but it has a timestamp; use the last one.
                    last_timestamp_pos = (
                        timestamps[-1].item() - tokenizer.timestamp_begin
                    )
                    duration = last_timestamp_pos * time_precision
                current_segments.append(
                    new_segment(
                        start=time_offset,
                        end=time_offset + duration,
                        tokens=tokens,
                        result=result,
                    )
                )
                seek += segment_size
            if word_timestamps:
                add_word_timestamps(
                    segments=current_segments,
                    model=model,
                    tokenizer=tokenizer,
                    mel=mel_segment,
                    num_frames=segment_size,
                    prepend_punctuations=prepend_punctuations,
                    append_punctuations=append_punctuations,
                    last_speech_timestamp=last_speech_timestamp,
                )
                if not single_timestamp_ending:
                    last_word_end = get_end(current_segments)
                    if last_word_end is not None and last_word_end > time_offset:
                        seek = round(last_word_end * FRAMES_PER_SECOND)
                # skip silence before possible hallucinations
                if hallucination_silence_threshold is not None:
                    threshold = hallucination_silence_threshold
                    if not single_timestamp_ending:
                        last_word_end = get_end(current_segments)
                        if last_word_end is not None and last_word_end > time_offset:
                            remaining_duration = window_end_time - last_word_end
                            if remaining_duration > threshold:
                                seek = round(last_word_end * FRAMES_PER_SECOND)
                            else:
                                seek = previous_seek + segment_size
                    # if first segment might be a hallucination, skip leading silence
                    first_segment = next_words_segment(current_segments)
                    if first_segment is not None and is_segment_anomaly(first_segment):
                        gap = first_segment["start"] - time_offset
                        if gap > threshold:
                            seek = previous_seek + round(gap * FRAMES_PER_SECOND)
                            continue
                    # skip silence before any possible hallucination that is surrounded
                    # by silence or more hallucinations
                    hal_last_end = last_speech_timestamp
                    for si in range(len(current_segments)):
                        segment = current_segments[si]
                        if not segment["words"]:
                            continue
                        if is_segment_anomaly(segment):
                            next_segment = next_words_segment(
                                current_segments[si + 1 :]
                            )
                            if next_segment is not None:
                                hal_next_start = next_segment["words"][0]["start"]
                            else:
                                hal_next_start = time_offset + segment_duration
                            silence_before = (
                                segment["start"] - hal_last_end > threshold
                                or segment["start"] < threshold
                                or segment["start"] - time_offset < 2.0
                            )
                            silence_after = (
                                hal_next_start - segment["end"] > threshold
                                or is_segment_anomaly(next_segment)
                                or window_end_time - segment["end"] < 2.0
                            )
                            if silence_before and silence_after:
                                seek = round(
                                    max(time_offset + 1, segment["start"])
                                    * FRAMES_PER_SECOND
                                )
                                if content_duration - segment["end"] < threshold:
                                    seek = content_frames
                                current_segments[si:] = []
                                break
                        hal_last_end = segment["end"]
                last_word_end = get_end(current_segments)
                if last_word_end is not None:
                    last_speech_timestamp = last_word_end
            if verbose:
                for segment in current_segments:
                    start, end, text = segment["start"], segment["end"], segment["text"]
                    line = f"[{format_timestamp(start)} --> {format_timestamp(end)}] {text}"
                    print(make_safe(line))
            # if a segment is instantaneous or does not contain text, clear it
            for i, segment in enumerate(current_segments):
                if segment["start"] == segment["end"] or segment["text"].strip() == "":
                    segment["text"] = ""
                    segment["tokens"] = []
                    segment["words"] = []
            all_segments.extend(
                [
                    {"id": i, **segment}
                    for i, segment in enumerate(
                        current_segments, start=len(all_segments)
                    )
                ]
            )
            all_tokens.extend(
                [token for segment in current_segments for token in segment["tokens"]]
            )
            if not condition_on_previous_text or result.temperature > 0.5:
                # do not feed the prompt tokens if a high temperature was used
                prompt_reset_since = len(all_tokens)
            # update progress bar
            pbar.update(min(content_frames, seek) - previous_seek)
    return dict(
        text=tokenizer.decode(all_tokens[len(initial_prompt_tokens) :]),
        segments=all_segments,
        language=language,
    )
 def cli():
    from . import available_models
    def valid_model_name(name):
        if name in available_models() or os.path.exists(name):
            return name
        raise ValueError(
            f"model should be one of {available_models()} or path to a model checkpoint"
        )
    # fmt: off
    parser = argparse.ArgumentParser(formatter_class=argparse.ArgumentDefaultsHelpFormatter)
    parser.add_argument("audio", nargs="+", type=str, help="audio file(s) to transcribe")
    parser.add_argument("--model", default="turbo", type=valid_model_name, help="name of the Whisper model to use")
    parser.add_argument("--model_dir", type=str, default=None, help="the path to save model files; uses ~/.cache/whisper by default")
    parser.add_argument("--device", default="cuda" if torch.cuda.is_available() else "cpu", help="device to use for PyTorch inference")
    parser.add_argument("--output_dir", "-o", type=str, default=".", help="directory to save the outputs")
    parser.add_argument("--output_format", "-f", type=str, default="all", choices=["txt", "vtt", "srt", "tsv", "json", "all"], help="format of the output file; if not specified, all available formats will be produced")
    parser.add_argument("--verbose", type=str2bool, default=True, help="whether to print out the progress and debug messages")
    parser.add_argument("--task", type=str, default="transcribe", choices=["transcribe", "translate"], help="whether to perform X->X speech recognition ('transcribe') or X->English translation ('translate')")
    parser.add_argument("--language", type=str, default=None, choices=sorted(LANGUAGES.keys()) + sorted([k.title() for k in TO_LANGUAGE_CODE.keys()]), help="language spoken in the audio, specify None to perform language detection")
    parser.add_argument("--temperature", type=float, default=0, help="temperature to use for sampling")
    parser.add_argument("--best_of", type=optional_int, default=5, help="number of candidates when sampling with non-zero temperature")
    parser.add_argument("--beam_size", type=optional_int, default=5, help="number of beams in beam search, only applicable when temperature is zero")
    parser.add_argument("--patience", type=float, default=None, help="optional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent to conventional beam search")
    parser.add_argument("--length_penalty", type=float, default=None, help="optional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization by default")
    parser.add_argument("--suppress_tokens", type=str, default="-1", help="comma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except common punctuations")
    parser.add_argument("--initial_prompt", type=str, default=None, help="optional text to provide as a prompt for the first window.")
    parser.add_argument("--carry_initial_prompt", type=str2bool, default=False, help="if True, prepend initial_prompt to every internal decode() call. May reduce the effectiveness of condition_on_previous_text")
    parser.add_argument("--condition_on_previous_text", type=str2bool, default=True, help="if True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent across windows, but the model becomes less prone to getting stuck in a failure loop")
    parser.add_argument("--fp16", type=str2bool, default=True, help="whether to perform inference in fp16; True by default")
    parser.add_argument("--temperature_increment_on_fallback", type=optional_float, default=0.2, help="temperature to increase when falling back when the decoding fails to meet either of the thresholds below")
    parser.add_argument("--compression_ratio_threshold", type=optional_float, default=2.4, help="if the gzip compression ratio is higher than this value, treat the decoding as failed")
    parser.add_argument("--logprob_threshold", type=optional_float, default=-1.0, help="if the average log probability is lower than this value, treat the decoding as failed")
    parser.add_argument("--no_speech_threshold", type=optional_float, default=0.6, help="if the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to `logprob_threshold`, consider the segment as silence")
    parser.add_argument("--word_timestamps", type=str2bool, default=False, help="(experimental) extract word-level timestamps and refine the results based on them")
    parser.add_argument("--prepend_punctuations", type=str, default="\"\'“¿([{-", help="if word_timestamps is True, merge these punctuation symbols with the next word")
    parser.add_argument("--append_punctuations", type=str, default="\"\'.。,，!！?？:：”)]}、", help="if word_timestamps is True, merge these punctuation symbols with the previous word")
    parser.add_argument("--highlight_words", type=str2bool, default=False, help="(requires --word_timestamps True) underline each word as it is spoken in srt and vtt")
    parser.add_argument("--max_line_width", type=optional_int, default=None, help="(requires --word_timestamps True) the maximum number of characters in a line before breaking the line")
    parser.add_argument("--max_line_count", type=optional_int, default=None, help="(requires --word_timestamps True) the maximum number of lines in a segment")
    parser.add_argument("--max_words_per_line", type=optional_int, default=None, help="(requires --word_timestamps True, no effect with --max_line_width) the maximum number of words in a segment")
    parser.add_argument("--threads", type=optional_int, default=0, help="number of threads used by torch for CPU inference; supercedes MKL_NUM_THREADS/OMP_NUM_THREADS")
    parser.add_argument("--clip_timestamps", type=str, default="0", help="comma-separated list start,end,start,end,... timestamps (in seconds) of clips to process, where the last end timestamp defaults to the end of the file")
    parser.add_argument("--hallucination_silence_threshold", type=optional_float, help="(requires --word_timestamps True) skip silent periods longer than this threshold (in seconds) when a possible hallucination is detected")
    # fmt: on
    args = parser.parse_args().__dict__
    model_name: str = args.pop("model")
    model_dir: str = args.pop("model_dir")
    output_dir: str = args.pop("output_dir")
    output_format: str = args.pop("output_format")
    device: str = args.pop("device")
    os.makedirs(output_dir, exist_ok=True)
    if model_name.endswith(".en") and args["language"] not in {"en", "English"}:
        if args["language"] is not None:
            warnings.warn(
                f"{model_name} is an English-only model but receipted '{args['language']}'; using English instead."
            )
        args["language"] = "en"
    temperature = args.pop("temperature")
    if (increment := args.pop("temperature_increment_on_fallback")) is not None:
        temperature = tuple(np.arange(temperature, 1.0 + 1e-6, increment))
    else:
        temperature = [temperature]
    if (threads := args.pop("threads")) > 0:
        torch.set_num_threads(threads)
    from . import load_model
    model = load_model(model_name, device=device, download_root=model_dir)
    writer = get_writer(output_format, output_dir)
    word_options = [
        "highlight_words",
        "max_line_count",
        "max_line_width",
        "max_words_per_line",
    ]
    if not args["word_timestamps"]:
        for option in word_options:
            if args[option]:
                parser.error(f"--{option} requires --word_timestamps True")
    if args["max_line_count"] and not args["max_line_width"]:
        warnings.warn("--max_line_count has no effect without --max_line_width")
    if args["max_words_per_line"] and args["max_line_width"]:
        warnings.warn("--max_words_per_line has no effect with --max_line_width")
    writer_args = {arg: args.pop(arg) for arg in word_options}
    for audio_path in args.pop("audio"):
        try:
            result = transcribe(model, audio_path, temperature=temperature, **args)
            writer(result, audio_path, **writer_args)
        except Exception as e:
            traceback.print_exc()
            print(f"Skipping {audio_path} due to {type(e).__name__}: {str(e)}")
 if __name__ == "__main__":
    cli()
--- a/whisperlivekit/simul_whisper/whisper/triton_ops.py
+++ b/whisperlivekit/simul_whisper/whisper/triton_ops.py
@@ -0,0 +1,117 @@
 from functools import lru_cache
 import numpy as np
 import torch
 try:
    import triton
    import triton.language as tl
 except ImportError:
    raise RuntimeError("triton import failed; try `pip install --pre triton`")
@triton.jit
 def dtw_kernel(
    cost, trace, x, x_stride, cost_stride, trace_stride, N, M, BLOCK_SIZE: tl.constexpr
 ):
    offsets = tl.arange(0, BLOCK_SIZE)
    mask = offsets < M
    for k in range(1, N + M + 1):  # k = i + j
        tl.debug_barrier()
        p0 = cost + (k - 1) * cost_stride
        p1 = cost + k * cost_stride
        p2 = cost + k * cost_stride + 1
        c0 = tl.load(p0 + offsets, mask=mask)
        c1 = tl.load(p1 + offsets, mask=mask)
        c2 = tl.load(p2 + offsets, mask=mask)
        x_row = tl.load(x + (k - 1) * x_stride + offsets, mask=mask, other=0)
        cost_row = x_row + tl.minimum(tl.minimum(c0, c1), c2)
        cost_ptr = cost + (k + 1) * cost_stride + 1
        tl.store(cost_ptr + offsets, cost_row, mask=mask)
        trace_ptr = trace + (k + 1) * trace_stride + 1
        tl.store(trace_ptr + offsets, 2, mask=mask & (c2 <= c0) & (c2 <= c1))
        tl.store(trace_ptr + offsets, 1, mask=mask & (c1 <= c0) & (c1 <= c2))
        tl.store(trace_ptr + offsets, 0, mask=mask & (c0 <= c1) & (c0 <= c2))
@lru_cache(maxsize=None)
 def median_kernel(filter_width: int):
    @triton.jit
    def kernel(
        y, x, x_stride, y_stride, BLOCK_SIZE: tl.constexpr
    ):  # x.shape[-1] == filter_width
        row_idx = tl.program_id(0)
        offsets = tl.arange(0, BLOCK_SIZE)
        mask = offsets < y_stride
        x_ptr = x + row_idx * x_stride  # noqa: F841
        y_ptr = y + row_idx * y_stride
        LOAD_ALL_ROWS_HERE  # noqa: F821
        BUBBLESORT_HERE  # noqa: F821
        tl.store(y_ptr + offsets, MIDDLE_ROW_HERE, mask=mask)  # noqa: F821
    kernel = triton.JITFunction(kernel.fn)
    new_kernel = kernel.src.replace(
        "    LOAD_ALL_ROWS_HERE",
        "\n".join(
            [
                f"    row{i} = tl.load(x_ptr + offsets + {i}, mask=mask)"
                for i in range(filter_width)
            ]
        ),
    )
    new_kernel = new_kernel.replace(
        "    BUBBLESORT_HERE",
        "\n\n".join(
            [
                "\n\n".join(
                    [
                        "\n".join(
                            [
                                f"    smaller = tl.where(row{j} < row{j + 1}, row{j}, row{j + 1})",
                                f"    larger = tl.where(row{j} > row{j + 1}, row{j}, row{j + 1})",
                                f"    row{j} = smaller",
                                f"    row{j + 1} = larger",
                            ]
                        )
                        for j in range(filter_width - i - 1)
                    ]
                )
                for i in range(filter_width // 2 + 1)
            ]
        ),
    )
    new_kernel = new_kernel.replace("MIDDLE_ROW_HERE", f"row{filter_width // 2}")
    if hasattr(kernel, "_unsafe_update_src") is True:
        kernel._unsafe_update_src(new_kernel)
        kernel.hash = None
    else:
        kernel.src = new_kernel
    return kernel
 def median_filter_cuda(x: torch.Tensor, filter_width: int):
    """Apply a median filter of given width along the last dimension of x"""
    slices = x.contiguous().unfold(-1, filter_width, 1)
    grid = np.prod(slices.shape[:-2])
    kernel = median_kernel(filter_width)
    y = torch.empty_like(slices[..., 0])
    BLOCK_SIZE = 1 << (y.stride(-2) - 1).bit_length()
    kernel[(grid,)](y, x, x.stride(-2), y.stride(-2), BLOCK_SIZE=BLOCK_SIZE)
    return y
--- a/whisperlivekit/simul_whisper/whisper/utils.py
+++ b/whisperlivekit/simul_whisper/whisper/utils.py
@@ -0,0 +1,318 @@
 import json
 import os
 import re
 import sys
 import zlib
 from typing import Callable, List, Optional, TextIO
 system_encoding = sys.getdefaultencoding()
 if system_encoding != "utf-8":
    def make_safe(string):
        # replaces any character not representable using the system default encoding with an '?',
        # avoiding UnicodeEncodeError (https://github.com/openai/whisper/discussions/729).
        return string.encode(system_encoding, errors="replace").decode(system_encoding)
 else:
    def make_safe(string):
        # utf-8 can encode any Unicode code point, so no need to do the round-trip encoding
        return string
 def exact_div(x, y):
    assert x % y == 0
    return x // y
 def str2bool(string):
    str2val = {"True": True, "False": False}
    if string in str2val:
        return str2val[string]
    else:
        raise ValueError(f"Expected one of {set(str2val.keys())}, got {string}")
 def optional_int(string):
    return None if string == "None" else int(string)
 def optional_float(string):
    return None if string == "None" else float(string)
 def compression_ratio(text) -> float:
    text_bytes = text.encode("utf-8")
    return len(text_bytes) / len(zlib.compress(text_bytes))
 def format_timestamp(
    seconds: float, always_include_hours: bool = False, decimal_marker: str = "."
 ):
    assert seconds >= 0, "non-negative timestamp expected"
    milliseconds = round(seconds * 1000.0)
    hours = milliseconds // 3_600_000
    milliseconds -= hours * 3_600_000
    minutes = milliseconds // 60_000
    milliseconds -= minutes * 60_000
    seconds = milliseconds // 1_000
    milliseconds -= seconds * 1_000
    hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
    return (
        f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"
    )
 def get_start(segments: List[dict]) -> Optional[float]:
    return next(
        (w["start"] for s in segments for w in s["words"]),
        segments[0]["start"] if segments else None,
    )
 def get_end(segments: List[dict]) -> Optional[float]:
    return next(
        (w["end"] for s in reversed(segments) for w in reversed(s["words"])),
        segments[-1]["end"] if segments else None,
    )
 class ResultWriter:
    extension: str
    def __init__(self, output_dir: str):
        self.output_dir = output_dir
    def __call__(
        self, result: dict, audio_path: str, options: Optional[dict] = None, **kwargs
    ):
        audio_basename = os.path.basename(audio_path)
        audio_basename = os.path.splitext(audio_basename)[0]
        output_path = os.path.join(
            self.output_dir, audio_basename + "." + self.extension
        )
        with open(output_path, "w", encoding="utf-8") as f:
            self.write_result(result, file=f, options=options, **kwargs)
    def write_result(
        self, result: dict, file: TextIO, options: Optional[dict] = None, **kwargs
    ):
        raise NotImplementedError
 class WriteTXT(ResultWriter):
    extension: str = "txt"
    def write_result(
        self, result: dict, file: TextIO, options: Optional[dict] = None, **kwargs
    ):
        for segment in result["segments"]:
            print(segment["text"].strip(), file=file, flush=True)
 class SubtitlesWriter(ResultWriter):
    always_include_hours: bool
    decimal_marker: str
    def iterate_result(
        self,
        result: dict,
        options: Optional[dict] = None,
        *,
        max_line_width: Optional[int] = None,
        max_line_count: Optional[int] = None,
        highlight_words: bool = False,
        max_words_per_line: Optional[int] = None,
    ):
        options = options or {}
        max_line_width = max_line_width or options.get("max_line_width")
        max_line_count = max_line_count or options.get("max_line_count")
        highlight_words = highlight_words or options.get("highlight_words", False)
        max_words_per_line = max_words_per_line or options.get("max_words_per_line")
        preserve_segments = max_line_count is None or max_line_width is None
        max_line_width = max_line_width or 1000
        max_words_per_line = max_words_per_line or 1000
        def iterate_subtitles():
            line_len = 0
            line_count = 1
            # the next subtitle to yield (a list of word timings with whitespace)
            subtitle: List[dict] = []
            last: float = get_start(result["segments"]) or 0.0
            for segment in result["segments"]:
                chunk_index = 0
                words_count = max_words_per_line
                while chunk_index < len(segment["words"]):
                    remaining_words = len(segment["words"]) - chunk_index
                    if max_words_per_line > len(segment["words"]) - chunk_index:
                        words_count = remaining_words
                    for i, original_timing in enumerate(
                        segment["words"][chunk_index : chunk_index + words_count]
                    ):
                        timing = original_timing.copy()
                        long_pause = (
                            not preserve_segments and timing["start"] - last > 3.0
                        )
                        has_room = line_len + len(timing["word"]) <= max_line_width
                        seg_break = i == 0 and len(subtitle) > 0 and preserve_segments
                        if (
                            line_len > 0
                            and has_room
                            and not long_pause
                            and not seg_break
                        ):
                            # line continuation
                            line_len += len(timing["word"])
                        else:
                            # new line
                            timing["word"] = timing["word"].strip()
                            if (
                                len(subtitle) > 0
                                and max_line_count is not None
                                and (long_pause or line_count >= max_line_count)
                                or seg_break
                            ):
                                # subtitle break
                                yield subtitle
                                subtitle = []
                                line_count = 1
                            elif line_len > 0:
                                # line break
                                line_count += 1
                                timing["word"] = "\n" + timing["word"]
                            line_len = len(timing["word"].strip())
                        subtitle.append(timing)
                        last = timing["start"]
                    chunk_index += max_words_per_line
            if len(subtitle) > 0:
                yield subtitle
        if len(result["segments"]) > 0 and "words" in result["segments"][0]:
            for subtitle in iterate_subtitles():
                subtitle_start = self.format_timestamp(subtitle[0]["start"])
                subtitle_end = self.format_timestamp(subtitle[-1]["end"])
                subtitle_text = "".join([word["word"] for word in subtitle])
                if highlight_words:
                    last = subtitle_start
                    all_words = [timing["word"] for timing in subtitle]
                    for i, this_word in enumerate(subtitle):
                        start = self.format_timestamp(this_word["start"])
                        end = self.format_timestamp(this_word["end"])
                        if last != start:
                            yield last, start, subtitle_text
                        yield start, end, "".join(
                            [
                                (
                                    re.sub(r"^(\s*)(.*)$", r"\1<u>\2</u>", word)
                                    if j == i
                                    else word
                                )
                                for j, word in enumerate(all_words)
                            ]
                        )
                        last = end
                else:
                    yield subtitle_start, subtitle_end, subtitle_text
        else:
            for segment in result["segments"]:
                segment_start = self.format_timestamp(segment["start"])
                segment_end = self.format_timestamp(segment["end"])
                segment_text = segment["text"].strip().replace("-->", "->")
                yield segment_start, segment_end, segment_text
    def format_timestamp(self, seconds: float):
        return format_timestamp(
            seconds=seconds,
            always_include_hours=self.always_include_hours,
            decimal_marker=self.decimal_marker,
        )
 class WriteVTT(SubtitlesWriter):
    extension: str = "vtt"
    always_include_hours: bool = False
    decimal_marker: str = "."
    def write_result(
        self, result: dict, file: TextIO, options: Optional[dict] = None, **kwargs
    ):
        print("WEBVTT\n", file=file)
        for start, end, text in self.iterate_result(result, options, **kwargs):
            print(f"{start} --> {end}\n{text}\n", file=file, flush=True)
 class WriteSRT(SubtitlesWriter):
    extension: str = "srt"
    always_include_hours: bool = True
    decimal_marker: str = ","
    def write_result(
        self, result: dict, file: TextIO, options: Optional[dict] = None, **kwargs
    ):
        for i, (start, end, text) in enumerate(
            self.iterate_result(result, options, **kwargs), start=1
        ):
            print(f"{i}\n{start} --> {end}\n{text}\n", file=file, flush=True)
 class WriteTSV(ResultWriter):
    """
    Write a transcript to a file in TSV (tab-separated values) format containing lines like:
    <start time in integer milliseconds>\t<end time in integer milliseconds>\t<transcript text>
    Using integer milliseconds as start and end times means there's no chance of interference from
    an environment setting a language encoding that causes the decimal in a floating point number
    to appear as a comma; also is faster and more efficient to parse & store, e.g., in C++.
    """
    extension: str = "tsv"
    def write_result(
        self, result: dict, file: TextIO, options: Optional[dict] = None, **kwargs
    ):
        print("start", "end", "text", sep="\t", file=file)
        for segment in result["segments"]:
            print(round(1000 * segment["start"]), file=file, end="\t")
            print(round(1000 * segment["end"]), file=file, end="\t")
            print(segment["text"].strip().replace("\t", " "), file=file, flush=True)
 class WriteJSON(ResultWriter):
    extension: str = "json"
    def write_result(
        self, result: dict, file: TextIO, options: Optional[dict] = None, **kwargs
    ):
        json.dump(result, file)
 def get_writer(
    output_format: str, output_dir: str
 ) -> Callable[[dict, TextIO, dict], None]:
    writers = {
        "txt": WriteTXT,
        "vtt": WriteVTT,
        "srt": WriteSRT,
        "tsv": WriteTSV,
        "json": WriteJSON,
    }
    if output_format == "all":
        all_writers = [writer(output_dir) for writer in writers.values()]
        def write_all(
            result: dict, file: TextIO, options: Optional[dict] = None, **kwargs
        ):
            for writer in all_writers:
                writer(result, file, options, **kwargs)
        return write_all
    return writers[output_format](output_dir)
--- a/whisperlivekit/simul_whisper/whisper/version.py
+++ b/whisperlivekit/simul_whisper/whisper/version.py
@@ -0,0 +1 @@
 __version__ = "20250625"
--- a/whisperlivekit/timed_objects.py
+++ b/whisperlivekit/timed_objects.py
@@ -26,4 +26,7 @@ class Transcript(TimedText):
@dataclass
 class SpeakerSegment(TimedText):
    """Represents a segment of audio attributed to a specific speaker.
    No text nor probability is associated with this segment.
    """
    pass
--- a/whisperlivekit/warmup.py
+++ b/whisperlivekit/warmup.py
@@ -0,0 +1,62 @@
 import logging
 logger = logging.getLogger(__name__)
 def load_file(warmup_file=None, timeout=5):
    import os
    import tempfile
    import librosa
    if warmup_file is None:
        # Download JFK sample if not already present
        jfk_url = "https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav"
        temp_dir = tempfile.gettempdir()
        warmup_file = os.path.join(temp_dir, "whisper_warmup_jfk.wav")
        if not os.path.exists(warmup_file):
            logger.debug(f"Downloading warmup file from {jfk_url}")
            print(f"Downloading warmup file from {jfk_url}")
            import time
            import urllib.request
            import urllib.error
            import socket
            original_timeout = socket.getdefaulttimeout()
            socket.setdefaulttimeout(timeout)
            start_time = time.time()
            try:
                urllib.request.urlretrieve(jfk_url, warmup_file)
                logger.debug(f"Download successful in {time.time() - start_time:.2f}s")
            except (urllib.error.URLError, socket.timeout) as e:
                logger.warning(f"Download failed: {e}. Proceeding without warmup.")
                return False
            finally:
                socket.setdefaulttimeout(original_timeout)
    elif not warmup_file:
        return False 
    if not warmup_file or not os.path.exists(warmup_file) or os.path.getsize(warmup_file) == 0:
        logger.warning(f"Warmup file {warmup_file} invalid or missing.")
        return False
    try:
        audio, sr = librosa.load(warmup_file, sr=16000)
    except Exception as e:
        logger.warning(f"Failed to load audio file: {e}")
        return False
    return audio
 def warmup_asr(asr, warmup_file=None, timeout=5):
    """
    Warmup the ASR model by transcribing a short audio file.
    """
    audio = load_file(warmup_file=None, timeout=5)
    asr.transcribe(audio)
    logger.info("ASR model is warmed up")
 def warmup_online(online, warmup_file=None, timeout=5):
    audio = load_file(warmup_file=None, timeout=5)
    online.warmup(audio)
    logger.warning("ASR is warmed up")
--- a/whisperlivekit/web/live_transcription.html
+++ b/whisperlivekit/web/live_transcription.html
@@ -4,12 +4,87 @@
 <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <title>Audio Transcription</title>
+    <title>WhisperLiveKit</title>
    <style>
        :root {
            --bg: #ffffff;
            --text: #111111;
            --muted: #666666;
            --border: #e5e5e5;
            --chip-bg: rgba(0, 0, 0, 0.04);
            --chip-text: #000000;
            --spinner-border: #8d8d8d5c;
            --spinner-top: #b0b0b0;
            --silence-bg: #f3f3f3;
            --loading-bg: rgba(255, 77, 77, 0.06);
            --button-bg: #ffffff;
            --button-border: #e9e9e9;
            --wave-stroke: #000000;
            --label-dia-text: #868686;
            --label-trans-text: #111111;
        }
        @media (prefers-color-scheme: dark) {
            :root:not([data-theme="light"]) {
                --bg: #0b0b0b;
                --text: #e6e6e6;
                --muted: #9aa0a6;
                --border: #333333;
                --chip-bg: rgba(255, 255, 255, 0.08);
                --chip-text: #e6e6e6;
                --spinner-border: #555555;
                --spinner-top: #dddddd;
                --silence-bg: #1a1a1a;
                --loading-bg: rgba(255, 77, 77, 0.12);
                --button-bg: #111111;
                --button-border: #333333;
                --wave-stroke: #e6e6e6;
                --label-dia-text: #b3b3b3;
                --label-trans-text: #ffffff;
            }
        }
        :root[data-theme="dark"] {
            --bg: #0b0b0b;
            --text: #e6e6e6;
            --muted: #9aa0a6;
            --border: #333333;
            --chip-bg: rgba(255, 255, 255, 0.08);
            --chip-text: #e6e6e6;
            --spinner-border: #555555;
            --spinner-top: #dddddd;
            --silence-bg: #1a1a1a;
            --loading-bg: rgba(255, 77, 77, 0.12);
            --button-bg: #111111;
            --button-border: #333333;
            --wave-stroke: #e6e6e6;
            --label-dia-text: #b3b3b3;
            --label-trans-text: #ffffff;
        }
        :root[data-theme="light"] {
            --bg: #ffffff;
            --text: #111111;
            --muted: #666666;
            --border: #e5e5e5;
            --chip-bg: rgba(0, 0, 0, 0.04);
            --chip-text: #000000;
            --spinner-border: #8d8d8d5c;
            --spinner-top: #b0b0b0;
            --silence-bg: #f3f3f3;
            --loading-bg: rgba(255, 77, 77, 0.06);
            --button-bg: #ffffff;
            --button-border: #e9e9e9;
            --wave-stroke: #000000;
            --label-dia-text: #868686;
            --label-trans-text: #111111;
        }
        body {
            font-family: ui-sans-serif, system-ui, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
            margin: 20px;
            text-align: center;
            background-color: var(--bg);
            color: var(--text);
        }
        #recordButton {
@@ -17,10 +92,10 @@
            height: 50px;
            border: none;
            border-radius: 50%;
-            background-color: white;
+            background-color: var(--button-bg);
            cursor: pointer;
            transition: all 0.3s ease;
-            border: 1px solid rgb(233, 233, 233);
+            border: 1px solid var(--button-border);
            display: flex;
            align-items: center;
            justify-content: center;
@@ -94,14 +169,14 @@
        .timer {
            font-size: 14px;
            font-weight: 500;
-            color: #333;
+            color: var(--text);
            margin-left: 10px;
        }
        #status {
            margin-top: 20px;
            font-size: 16px;
-            color: #333;
+            color: var(--text);
        }
        .settings-container {
@@ -120,12 +195,14 @@
        }
        #chunkSelector,
-        #websocketInput {
+        #websocketInput,
        #themeSelector {
            font-size: 16px;
            padding: 5px;
            border-radius: 5px;
-            border: 1px solid #ddd;
+            border: 1px solid var(--border);
-            background-color: #ffffff;
+            background-color: var(--button-bg);
            color: var(--text);
            max-height: 30px;
        }
@@ -134,7 +211,8 @@
        }
        #chunkSelector:focus,
-        #websocketInput:focus {
+        #websocketInput:focus,
        #themeSelector:focus {
            outline: none;
            border-color: #007bff;
        }
@@ -156,18 +234,18 @@
        }
        #linesTranscript strong {
-            color: #333;
+            color: var(--text);
        }
        #speaker {
-            border: 1px solid rgb(229, 229, 229);
+            border: 1px solid var(--border);
            border-radius: 100px;
            padding: 2px 10px;
            font-size: 14px;
            margin-bottom: 0px;
        }
        .label_diarization {
-            background-color: #ffffff66;
+            background-color: var(--chip-bg);
            border-radius: 8px 8px 8px 8px;
            padding: 2px 10px;
            margin-left: 10px;
@@ -175,11 +253,11 @@
            white-space: nowrap;
            font-size: 14px;
            margin-bottom: 0px;
-            color: rgb(134, 134, 134)
+            color: var(--label-dia-text)
        }
        .label_transcription {
-            background-color: #ffffff66;
+            background-color: var(--chip-bg);
            border-radius: 8px 8px 8px 8px;
            padding: 2px 10px;
            display: inline-block;
@@ -187,11 +265,11 @@
            margin-left: 10px;
            font-size: 14px;
            margin-bottom: 0px;
-            color: #000000
+            color: var(--label-trans-text)
        }
        #timeInfo {
-            color: #666;
+            color: var(--muted);
            margin-left: 10px;
        }
@@ -206,7 +284,7 @@
        }
        .buffer_diarization {
-            color: rgb(134, 134, 134);
+            color: var(--label-dia-text);
            margin-left: 4px;
        }
@@ -220,10 +298,10 @@
            display: inline-block;
            width: 8px;
            height: 8px;
-            border: 2px solid #8d8d8d5c;
+            border: 2px solid var(--spinner-border);
-            border-top: 2px solid #6c6c6ce5;
+            border-top: 2px solid var(--spinner-top);
            border-radius: 50%;
-            animation: spin 0.6s linear infinite;
+            animation: spin 0.7s linear infinite;
            vertical-align: middle;
            margin-bottom: 2px;
            margin-right: 5px;
@@ -236,16 +314,16 @@
        }
        .silence {
-            color: #666;
+            color: var(--muted);
-            background-color: #f3f3f3;
+            background-color: var(--silence-bg);
            font-size: 13px;
            border-radius: 30px;
            padding: 2px 10px;
        }
        .loading {
-            color: #666;
+            color: var(--muted);
-            background-color: #ff4d4d0f;
+            background-color: var(--loading-bg);
            border-radius: 8px 8px 8px 0px;
            padding: 2px 10px;
            font-size: 14px;
@@ -284,6 +362,14 @@
                <label for="websocketInput">WebSocket URL:</label>
                <input id="websocketInput" type="text" />
            </div>
            <div>
                <label for="themeSelector">Theme:</label>
                <select id="themeSelector">
                    <option value="system" selected>System</option>
                    <option value="light">Light</option>
                    <option value="dark">Dark</option>
                </select>
            </div>
        </div>
    </div>
@@ -299,6 +385,7 @@
        let chunkDuration = 1000;
        let websocketUrl = "ws://localhost:8000/asr";
        let userClosing = false;
        let wakeLock = null;
        let startTime = null;
        let timerInterval = null;
        let audioContext = null;
@@ -308,6 +395,8 @@
        let waveCtx = waveCanvas.getContext("2d");
        let animationFrame = null;
        let waitingForStop = false;
        let lastReceivedData = null;
        let lastSignature = null;
        waveCanvas.width = 60 * (window.devicePixelRatio || 1);
        waveCanvas.height = 30 * (window.devicePixelRatio || 1);
        waveCtx.scale(window.devicePixelRatio || 1, window.devicePixelRatio || 1);
@@ -318,9 +407,60 @@
        const websocketInput = document.getElementById("websocketInput");
        const linesTranscriptDiv = document.getElementById("linesTranscript");
        const timerElement = document.querySelector(".timer");
        const themeSelector = document.getElementById("themeSelector");
        function getWaveStroke() {
            const styles = getComputedStyle(document.documentElement);
            const v = styles.getPropertyValue("--wave-stroke").trim();
            return v || "#000";
        }
        let waveStroke = getWaveStroke();
        function updateWaveStroke() {
            waveStroke = getWaveStroke();
        }
        function applyTheme(pref) {
            if (pref === "light") {
                document.documentElement.setAttribute("data-theme", "light");
            } else if (pref === "dark") {
                document.documentElement.setAttribute("data-theme", "dark");
            } else {
                document.documentElement.removeAttribute("data-theme");
            }
            updateWaveStroke();
        }
        const savedThemePref = localStorage.getItem("themePreference") || "system";
        applyTheme(savedThemePref);
        if (themeSelector) {
            themeSelector.value = savedThemePref;
            themeSelector.addEventListener("change", () => {
                const val = themeSelector.value;
                localStorage.setItem("themePreference", val);
                applyTheme(val);
            });
        }
        const darkMq = window.matchMedia && window.matchMedia("(prefers-color-scheme: dark)");
        const handleOsThemeChange = () => {
            const pref = localStorage.getItem("themePreference") || "system";
            if (pref === "system") updateWaveStroke();
        };
        if (darkMq && darkMq.addEventListener) {
            darkMq.addEventListener("change", handleOsThemeChange);
        } else if (darkMq && darkMq.addListener) {
            darkMq.addListener(handleOsThemeChange);
        }
        function fmt1(x) {
            const n = Number(x);
            return Number.isFinite(n) ? n.toFixed(1) : x;
        }
        const host = window.location.hostname || "localhost";
-        const port = window.location.port || "8000";
+        const port = window.location.port;
        const protocol = window.location.protocol === "https:" ? "wss" : "ws";
        const defaultWebSocketUrl = `${protocol}://${host}:${port}/asr`;
        websocketInput.value = defaultWebSocketUrl;
@@ -357,18 +497,31 @@
                websocket.onclose = () => {
                    if (userClosing) {
-                        if (!statusText.textContent.includes("Recording stopped. Processing final audio")) { // This is a bit of a hack. We should have a better way to handle this. eg. using a status code.
+                        if (waitingForStop) {
-                            statusText.textContent = "Finished processing audio! Ready to record again.";
+                            statusText.textContent = "Processing finalized or connection closed.";
                            if (lastReceivedData) {
                                renderLinesWithBuffer(
                                    lastReceivedData.lines || [],
                                    lastReceivedData.buffer_diarization || "",
                                    lastReceivedData.buffer_transcription || "",
                                    0, 0, true // isFinalizing = true
                                );
                            }
                        }
-                        waitingForStop = false;
+                        // If ready_to_stop was received, statusText is already "Finished processing..."
                        // and waitingForStop is false.
                    } else {
-                        statusText.textContent =
+                        statusText.textContent = "Disconnected from the WebSocket server. (Check logs if model is loading.)";
                            "Disconnected from the WebSocket server. (Check logs if model is loading.)";
                        if (isRecording) {
-                            stopRecording();
+                            stopRecording(); 
                        }
                    }
-                    userClosing = false;
+                    isRecording = false;  
                    waitingForStop = false; 
                    userClosing = false;  
                    lastReceivedData = null;  
                    websocket = null;    
                    updateUI();  
                };
                websocket.onerror = () => {
@@ -382,31 +535,39 @@
                    // Check for status messages
                    if (data.type === "ready_to_stop") {
-                        console.log("Ready to stop, closing WebSocket");
+                        console.log("Ready to stop received, finalizing display and closing WebSocket.");
                        // signal that we are not waiting for stop anymore
                        waitingForStop = false;
                        recordButton.disabled = false; // this should be elsewhere
                        console.log("Record button enabled");
-                        //Now we can close the WebSocket
+                        if (lastReceivedData) {
-                        if (websocket) {
+                            renderLinesWithBuffer(
-                            websocket.close();
+                                lastReceivedData.lines || [],
-                            websocket = null;
+                                lastReceivedData.buffer_diarization || "",
                                lastReceivedData.buffer_transcription || "",
                                0, // No more lag
                                0, // No more lag
                                true // isFinalizing = true
                            );
                        }
-
+                        statusText.textContent = "Finished processing audio! Ready to record again.";
-
+                        recordButton.disabled = false;
                        if (websocket) {
                            websocket.close(); // will trigger onclose
                            // websocket = null; // onclose handle setting websocket to null
                        }
                        return;
                    }
                    lastReceivedData = data; 
                    // Handle normal transcription updates
                    const { 
                        lines = [], 
                        buffer_transcription = "", 
                        buffer_diarization = "",
                        remaining_time_transcription = 0,
-                        remaining_time_diarization = 0
+                        remaining_time_diarization = 0,
                        status = "active_transcription"
                    } = data;
                    renderLinesWithBuffer(
@@ -414,13 +575,45 @@
                        buffer_diarization, 
                        buffer_transcription, 
                        remaining_time_diarization,
-                        remaining_time_transcription
+                        remaining_time_transcription,
                        false,
                        status
                    );
                };
            });
        }
-        function renderLinesWithBuffer(lines, buffer_diarization, buffer_transcription, remaining_time_diarization, remaining_time_transcription) {
+        function renderLinesWithBuffer(lines, buffer_diarization, buffer_transcription, remaining_time_diarization, remaining_time_transcription, isFinalizing = false, current_status = "active_transcription") {
            if (current_status === "no_audio_detected") {
                linesTranscriptDiv.innerHTML = "<p style='text-align: center; color: var(--muted); margin-top: 20px;'><em>No audio detected...</em></p>";
                return; 
            }
            // try to keep stable DOM despite having updates every 0.1s. only update numeric lag values if structure hasn't changed
            const showLoading = (!isFinalizing) && (lines || []).some(it => it.speaker == 0);
            const showTransLag = !isFinalizing && remaining_time_transcription > 0;
            const showDiaLag = !isFinalizing && !!buffer_diarization && remaining_time_diarization > 0;
            const signature = JSON.stringify({
                lines: (lines || []).map(it => ({ speaker: it.speaker, text: it.text, beg: it.beg, end: it.end })),
                buffer_transcription: buffer_transcription || "",
                buffer_diarization: buffer_diarization || "",
                status: current_status,
                showLoading,
                showTransLag,
                showDiaLag,
                isFinalizing: !!isFinalizing
            });
            if (lastSignature === signature) {
                const t = document.querySelector(".lag-transcription-value");
                if (t) t.textContent = fmt1(remaining_time_transcription);
                const d = document.querySelector(".lag-diarization-value");
                if (d) d.textContent = fmt1(remaining_time_diarization);
                const ld = document.querySelector(".loading-diarization-value");
                if (ld) ld.textContent = fmt1(remaining_time_diarization);
                return;
            }
            lastSignature = signature;
            const linesHtml = lines.map((item, idx) => {
                let timeInfo = "";
                if (item.beg !== undefined && item.end !== undefined) {
@@ -430,33 +623,50 @@
                let speakerLabel = "";
                if (item.speaker === -2) {
                    speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`;
-                } else if (item.speaker == 0) {
+                } else if (item.speaker == 0 && !isFinalizing) {
-                    speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'>${remaining_time_diarization} second(s) of audio are undergoing diarization</span></span>`;
+                    speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'><span class="loading-diarization-value">${fmt1(remaining_time_diarization)}</span> second(s) of audio are undergoing diarization</span></span>`;
                } else if (item.speaker == -1) {
-                    speakerLabel = `<span id="speaker"><span id='timeInfo'>${timeInfo}</span></span>`;
+                    speakerLabel = `<span id="speaker">Speaker 1<span id='timeInfo'>${timeInfo}</span></span>`;
-                } else if (item.speaker !== -1) {
+                } else if (item.speaker !== -1 && item.speaker !== 0) {
                    speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`;
                }
-                let textContent = item.text;
+
-                if (idx === lines.length - 1) {
+                let currentLineText = item.text || "";
-                    speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Transcription lag <span id='timeInfo'>${remaining_time_transcription}s</span></span>`
+
-                }
+                if (idx === lines.length - 1) { 
-                if (idx === lines.length - 1 && buffer_diarization) {
+                    if (!isFinalizing && item.speaker !== -2) {
-                    speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Diarization lag<span id='timeInfo'>${remaining_time_diarization}s</span></span>`
+                        if (remaining_time_transcription > 0) {
-                    textContent += `<span class="buffer_diarization">${buffer_diarization}</span>`;
+                             speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Transcription lag <span id='timeInfo'><span class="lag-transcription-value">${fmt1(remaining_time_transcription)}</span>s</span></span>`;
-                }
+                        }
-                if (idx === lines.length - 1) {
+                        if (buffer_diarization && remaining_time_diarization > 0) {
-                    textContent += `<span class="buffer_transcription">${buffer_transcription}</span>`;
+                             speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Diarization lag<span id='timeInfo'><span class="lag-diarization-value">${fmt1(remaining_time_diarization)}</span>s</span></span>`;
                        }
                    }
                    if (buffer_diarization) {
                        if (isFinalizing) {
                            currentLineText += (currentLineText.length > 0 && buffer_diarization.trim().length > 0 ? " " : "") + buffer_diarization.trim();
                        } else {
                            currentLineText += `<span class="buffer_diarization">${buffer_diarization}</span>`;
                        }
                    }
                    if (buffer_transcription) {
                        if (isFinalizing) {
                            currentLineText += (currentLineText.length > 0 && buffer_transcription.trim().length > 0 ? " " : "") + buffer_transcription.trim();
                        } else {
                            currentLineText += `<span class="buffer_transcription">${buffer_transcription}</span>`;
                        }
                    }
                }
-
+                return currentLineText.trim().length > 0 || speakerLabel.length > 0
-                return textContent
+                    ? `<p>${speakerLabel}<br/><div class='textcontent'>${currentLineText}</div></p>`
-                    ? `<p>${speakerLabel}<br/><div class='textcontent'>${textContent}</div></p>`
+                    : `<p>${speakerLabel}<br/></p>`; 
                    : `<p>${speakerLabel}<br/></p>`;
            }).join("");
            linesTranscriptDiv.innerHTML = linesHtml;
            window.scrollTo({ top: document.body.scrollHeight, behavior: 'smooth' });
        }
        function updateTimer() {
@@ -477,7 +687,7 @@
            waveCtx.clearRect(0, 0, waveCanvas.width / (window.devicePixelRatio || 1), waveCanvas.height / (window.devicePixelRatio || 1));
            waveCtx.lineWidth = 1;
-            waveCtx.strokeStyle = 'rgb(0, 0, 0)';
+            waveCtx.strokeStyle = waveStroke;
            waveCtx.beginPath();
            const sliceWidth = (waveCanvas.width / (window.devicePixelRatio || 1)) / bufferLength;
@@ -504,6 +714,16 @@
        async function startRecording() {
            try {
                // https://developer.mozilla.org/en-US/docs/Web/API/Screen_Wake_Lock_API
                // create an async function to request a wake lock
                try {
                  wakeLock = await navigator.wakeLock.request("screen");
                } catch (err) {
                  // The Wake Lock request has failed - usually system related, such as battery.
                  console.log("Error acquiring wake lock.")
                }
                const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
                audioContext = new (window.AudioContext || window.webkitAudioContext)();
@@ -533,6 +753,10 @@
        }
        async function stopRecording() {
            wakeLock.release().then(() => {
              wakeLock = null;
            });
            userClosing = true;
            waitingForStop = true;
@@ -578,20 +802,6 @@
            timerElement.textContent = "00:00";
            startTime = null;
            if (websocket && websocket.readyState === WebSocket.OPEN) {
                try {
                    await websocket.send(JSON.stringify({
                        type: "stop",
                        message: "User stopped recording"
                    }));
                    statusText.textContent = "Recording stopped. Processing final audio...";
                } catch (e) {
                    console.error("Could not send stop message:", e);
                    statusText.textContent = "Recording stopped. Error during final audio processing.";
                    websocket.close();
                    websocket = null;
                }
            }
            isRecording = false;
            updateUI();	
@@ -625,19 +835,22 @@
        function updateUI() {
            recordButton.classList.toggle("recording", isRecording);
-            
+            recordButton.disabled = waitingForStop;
            if (waitingForStop) {
-                statusText.textContent = "Please wait for processing to complete...";
+                if (statusText.textContent !== "Recording stopped. Processing final audio...") {
-                recordButton.disabled = true;  // Optionally disable the button while waiting
+                     statusText.textContent = "Please wait for processing to complete...";
-                console.log("Record button disabled");
+                }
            } else if (isRecording) {
                statusText.textContent = "Recording...";
                recordButton.disabled = false;
                console.log("Record button enabled");
            } else {
-                statusText.textContent = "Click to start transcription";
+                if (statusText.textContent !== "Finished processing audio! Ready to record again." &&
                    statusText.textContent !== "Processing finalized or connection closed.") {
                    statusText.textContent = "Click to start transcription";
                }
            }
            if (!waitingForStop) {
                recordButton.disabled = false;
                console.log("Record button enabled");
            }
        }
@@ -645,4 +858,4 @@
    </script>
 </body>
-</html>
+</html>
--- a/whisperlivekit/web/web_interface.py
+++ b/whisperlivekit/web/web_interface.py
@@ -0,0 +1,13 @@
 import logging
 import importlib.resources as resources
 logger = logging.getLogger(__name__)
 def get_web_interface_html():
    """Loads the HTML for the web interface using importlib.resources."""
    try:
        with resources.files('whisperlivekit.web').joinpath('live_transcription.html').open('r', encoding='utf-8') as f:
            return f.read()
    except Exception as e:
        logger.error(f"Error loading web interface HTML: {e}")
        return "<html><body><h1>Error loading interface</h1></body></html>"
--- a/whisperlivekit/whisper_streaming_custom/backends.py
+++ b/whisperlivekit/whisper_streaming_custom/backends.py
@@ -3,16 +3,10 @@ import logging
 import io
 import soundfile as sf
 import math
 try: 
    import torch
 except ImportError: 
    torch = None
 from typing import List
 import numpy as np
 from whisperlivekit.timed_objects import ASRToken
 logger = logging.getLogger(__name__)
 class ASRBase:
    sep = " "  # join transcribe words with this character (" " for whisper_timestamped,
              # "" for faster-whisper because it emits the spaces when needed)
--- a/whisperlivekit/whisper_streaming_custom/online_asr.py
+++ b/whisperlivekit/whisper_streaming_custom/online_asr.py
@@ -6,7 +6,6 @@ from whisperlivekit.timed_objects import ASRToken, Sentence, Transcript
 logger = logging.getLogger(__name__)
 class HypothesisBuffer:
    """
    Buffer to store and process ASR hypothesis tokens.
@@ -143,8 +142,13 @@ class OnlineASRProcessor:
        self.buffer_time_offset = offset if offset is not None else 0.0
        self.transcript_buffer.last_committed_time = self.buffer_time_offset
        self.committed: List[ASRToken] = []
        self.time_of_last_asr_output = 0.0
-    def insert_audio_chunk(self, audio: np.ndarray):
+    def get_audio_buffer_end_time(self) -> float:
        """Returns the absolute end time of the current audio_buffer."""
        return self.buffer_time_offset + (len(self.audio_buffer) / self.SAMPLING_RATE)
    def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time: Optional[float] = None):
        """Append an audio chunk (a numpy array) to the current audio buffer."""
        self.audio_buffer = np.append(self.audio_buffer, audio)
@@ -179,26 +183,42 @@ class OnlineASRProcessor:
        return self.concatenate_tokens(self.transcript_buffer.buffer)
-    def process_iter(self) -> Transcript:
+    def process_iter(self) -> Tuple[List[ASRToken], float]:
        """
        Processes the current audio buffer.
-        Returns a Transcript object representing the committed transcript.
+        Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
        """
        current_audio_processed_upto = self.get_audio_buffer_end_time()
        prompt_text, _ = self.prompt()
        logger.debug(
            f"Transcribing {len(self.audio_buffer)/self.SAMPLING_RATE:.2f} seconds from {self.buffer_time_offset:.2f}"
        )
        res = self.asr.transcribe(self.audio_buffer, init_prompt=prompt_text)
-        tokens = self.asr.ts_words(res)  # Expecting List[ASRToken]
+        tokens = self.asr.ts_words(res)
        self.transcript_buffer.insert(tokens, self.buffer_time_offset)
        committed_tokens = self.transcript_buffer.flush()
        self.committed.extend(committed_tokens)
        if committed_tokens:
            self.time_of_last_asr_output = self.committed[-1].end
        completed = self.concatenate_tokens(committed_tokens)
        logger.debug(f">>>> COMPLETE NOW: {completed.text}")
        incomp = self.concatenate_tokens(self.transcript_buffer.buffer)
        logger.debug(f"INCOMPLETE: {incomp.text}")
        buffer_duration = len(self.audio_buffer) / self.SAMPLING_RATE
        if not committed_tokens and buffer_duration > self.buffer_trimming_sec:
            time_since_last_output = self.get_audio_buffer_end_time() - self.time_of_last_asr_output
            if time_since_last_output > self.buffer_trimming_sec:
                logger.warning(
                    f"No ASR output for {time_since_last_output:.2f}s. "
                    f"Resetting buffer to prevent freezing."
                )
                self.init(offset=self.get_audio_buffer_end_time())
                return [], current_audio_processed_upto
        if committed_tokens and self.buffer_trimming_way == "sentence":
            if len(self.audio_buffer) / self.SAMPLING_RATE > self.buffer_trimming_sec:
                self.chunk_completed_sentence()
@@ -210,7 +230,7 @@ class OnlineASRProcessor:
        logger.debug(
            f"Length of audio buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:.2f} seconds"
        )
-        return committed_tokens
+        return committed_tokens, current_audio_processed_upto
    def chunk_completed_sentence(self):
        """
@@ -343,15 +363,17 @@ class OnlineASRProcessor:
                )
                sentences.append(sentence)
        return sentences
-    def finish(self) -> Transcript:
+    
    def finish(self) -> Tuple[List[ASRToken], float]:
        """
        Flush the remaining transcript when processing ends.
        Returns a tuple: (list of remaining ASRToken objects, float representing the final audio processed up to time).
        """
        remaining_tokens = self.transcript_buffer.buffer
-        final_transcript = self.concatenate_tokens(remaining_tokens)
+        logger.debug(f"Final non-committed tokens: {remaining_tokens}")
-        logger.debug(f"Final non-committed transcript: {final_transcript}")
+        final_processed_upto = self.buffer_time_offset + (len(self.audio_buffer) / self.SAMPLING_RATE)
-        self.buffer_time_offset += len(self.audio_buffer) / self.SAMPLING_RATE
+        self.buffer_time_offset = final_processed_upto
-        return final_transcript
+        return remaining_tokens, final_processed_upto
    def concatenate_tokens(
        self,
@@ -384,7 +406,8 @@ class VACOnlineASRProcessor:
    def __init__(self, online_chunk_size: float, *args, **kwargs):
        self.online_chunk_size = online_chunk_size
        self.online = OnlineASRProcessor(*args, **kwargs)
-
+        self.asr = self.online.asr
        # Load a VAD model (e.g. Silero VAD)
        import torch
        model, _ = torch.hub.load(repo_or_dir="snakers4/silero-vad", model="silero_vad")
@@ -392,28 +415,35 @@ class VACOnlineASRProcessor:
        self.vac = FixedVADIterator(model)
        self.logfile = self.online.logfile
        self.last_input_audio_stream_end_time: float = 0.0
        self.init()
    def init(self):
        self.online.init()
        self.vac.reset_states()
        self.current_online_chunk_buffer_size = 0
        self.last_input_audio_stream_end_time = self.online.buffer_time_offset
        self.is_currently_final = False
        self.status: Optional[str] = None  # "voice" or "nonvoice"
        self.audio_buffer = np.array([], dtype=np.float32)
        self.buffer_offset = 0  # in frames
    def get_audio_buffer_end_time(self) -> float:
        """Returns the absolute end time of the audio processed by the underlying OnlineASRProcessor."""
        return self.online.get_audio_buffer_end_time()
    def clear_buffer(self):
        self.buffer_offset += len(self.audio_buffer)
        self.audio_buffer = np.array([], dtype=np.float32)
-    def insert_audio_chunk(self, audio: np.ndarray):
+    def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time: float):
        """
        Process an incoming small audio chunk:
          - run VAD on the chunk,
          - decide whether to send the audio to the online ASR processor immediately,
          - and/or to mark the current utterance as finished.
        """
        self.last_input_audio_stream_end_time = audio_stream_end_time
        res = self.vac(audio)
        self.audio_buffer = np.append(self.audio_buffer, audio)
@@ -455,10 +485,11 @@ class VACOnlineASRProcessor:
                self.buffer_offset += max(0, len(self.audio_buffer) - self.SAMPLING_RATE)
                self.audio_buffer = self.audio_buffer[-self.SAMPLING_RATE:]
-    def process_iter(self) -> Transcript:
+    def process_iter(self) -> Tuple[List[ASRToken], float]:
        """
        Depending on the VAD status and the amount of accumulated audio,
        process the current audio chunk.
        Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
        """
        if self.is_currently_final:
            return self.finish()
@@ -467,17 +498,21 @@ class VACOnlineASRProcessor:
            return self.online.process_iter()
        else:
            logger.debug("No online update, only VAD")
-            return Transcript(None, None, "")
+            return [], self.last_input_audio_stream_end_time
-    def finish(self) -> Transcript:
+    def finish(self) -> Tuple[List[ASRToken], float]:
-        """Finish processing by flushing any remaining text."""
+        """
-        result = self.online.finish()
+        Finish processing by flushing any remaining text.
        Returns a tuple: (list of remaining ASRToken objects, float representing the final audio processed up to time).
        """
        result_tokens, processed_upto = self.online.finish()
        self.current_online_chunk_buffer_size = 0
        self.is_currently_final = False
-        return result
+        return result_tokens, processed_upto
    def get_buffer(self):
        """
        Get the unvalidated buffer in string format.
        """
-        return self.online.concatenate_tokens(self.online.transcript_buffer.buffer).text
+        return self.online.concatenate_tokens(self.online.transcript_buffer.buffer)
--- a/whisperlivekit/whisper_streaming_custom/whisper_online.py
+++ b/whisperlivekit/whisper_streaming_custom/whisper_online.py
@@ -6,7 +6,6 @@ from functools import lru_cache
 import time
 import logging
 from .backends import FasterWhisperASR, MLXWhisper, WhisperTimestampedASR, OpenaiApiASR
 from .online_asr import OnlineASRProcessor, VACOnlineASRProcessor
 logger = logging.getLogger(__name__)
@@ -68,7 +67,7 @@ def backend_factory(args):
    backend = args.backend
    if backend == "openai-api":
        logger.debug("Using OpenAI API.")
-        asr = OpenaiApiASR(lan=args.lan)
+        asr = OpenaiApiASR(lan=args.lan)        
    else:
        if backend == "faster-whisper":
            asr_cls = FasterWhisperASR
@@ -84,8 +83,8 @@ def backend_factory(args):
        asr = asr_cls(
            modelsize=size,
            lan=args.lan,
-            cache_dir=args.model_cache_dir,
+            cache_dir=getattr(args, 'model_cache_dir', None),
-            model_dir=args.model_dir,
+            model_dir=getattr(args, 'model_dir', None),
        )
        e = time.time()
        logger.info(f"done. It took {round(e-t,2)} seconds.")
@@ -97,98 +96,15 @@ def backend_factory(args):
    language = args.lan
    if args.task == "translate":
-        asr.set_translate_task()
+        if backend != "simulstreaming":
            asr.set_translate_task()
        tgt_language = "en"  # Whisper translates into English
    else:
        tgt_language = language  # Whisper transcribes in this language
    # Create the tokenizer
    if args.buffer_trimming == "sentence":
        tokenizer = create_tokenizer(tgt_language)
    else:
        tokenizer = None
-    return asr, tokenizer
+    return asr, tokenizer
 def online_factory(args, asr, tokenizer, logfile=sys.stderr):
    if args.vac:
        online = VACOnlineASRProcessor(
            args.min_chunk_size,
            asr,
            tokenizer,
            logfile=logfile,
            buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec),
            confidence_validation = args.confidence_validation
        )
    else:
        online = OnlineASRProcessor(
            asr,
            tokenizer,
            logfile=logfile,
            buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec),
            confidence_validation = args.confidence_validation
        )
    return online
 def asr_factory(args, logfile=sys.stderr):
    """
    Creates and configures an ASR and ASR Online instance based on the specified backend and arguments.
    """
    asr, tokenizer = backend_factory(args)
    online = online_factory(args, asr, tokenizer, logfile=logfile)
    return asr, online
 def warmup_asr(asr, warmup_file=None, timeout=5):
    """
    Warmup the ASR model by transcribing a short audio file.
    """
    import os
    import tempfile
    if warmup_file is None:
        # Download JFK sample if not already present
        jfk_url = "https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav"
        temp_dir = tempfile.gettempdir()
        warmup_file = os.path.join(temp_dir, "whisper_warmup_jfk.wav")
        if not os.path.exists(warmup_file):
            logger.debug(f"Downloading warmup file from {jfk_url}")
            print(f"Downloading warmup file from {jfk_url}")
            import time
            import urllib.request
            import urllib.error
            import socket
            original_timeout = socket.getdefaulttimeout()
            socket.setdefaulttimeout(timeout)
            start_time = time.time()
            try:
                urllib.request.urlretrieve(jfk_url, warmup_file)
                logger.debug(f"Download successful in {time.time() - start_time:.2f}s")
            except (urllib.error.URLError, socket.timeout) as e:
                logger.warning(f"Download failed: {e}. Proceeding without warmup.")
                return False
            finally:
                socket.setdefaulttimeout(original_timeout)
    elif not warmup_file:
        return False 
    if not warmup_file or not os.path.exists(warmup_file) or os.path.getsize(warmup_file) == 0:
        logger.warning(f"Warmup file {warmup_file} invalid or missing.")
        return False
    print(f"Warming up Whisper with {warmup_file}")
    try:
        import librosa
        audio, sr = librosa.load(warmup_file, sr=16000)
    except Exception as e:
        logger.warning(f"Failed to load audio file: {e}")
        return False
    # Process the audio
    asr.transcribe(audio)
    logger.info("Whisper is warmed up")
Author	SHA1	Message	Date
Quentin Fuxa	349c7dcb9e	bump version ro 0.2.5	2025-08-13 10:04:31 +02:00
Quentin Fuxa	1c42b867cf	Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web	2025-08-13 10:04:04 +02:00
Quentin Fuxa	d4771e563e	Increase END_SILENCE_DURATION to reduce false positives	2025-08-13 10:04:00 +02:00
Quentin Fuxa	b0a5fc0693	Merge pull request #155 from davidgumberg/keepawakescrolldown frontend: Keep screen awake and scroll down when transcribing.	2025-08-13 10:02:52 +02:00
David Gumberg	3b96fb8776	frontend: Scroll down when appending transcription	2025-08-12 17:31:32 -07:00
David Gumberg	7f93c4b978	frontend: Don't let screen sleep when transcribing.	2025-08-12 17:30:57 -07:00
Quentin Fuxa	15c3df1cba	warmup base whisper when using simulstreaming	2025-08-12 18:52:52 +02:00
Quentin Fuxa	7fb8e66c01	typo	2025-08-12 18:36:32 +02:00
Quentin Fuxa	728e1f1290	simulstreaming warmup is done for each instance of online, not for the backend	2025-08-12 18:35:04 +02:00
Quentin Fuxa	87b9ed6ecd	nonspeech_prob from 1 to 0.5	2025-08-12 18:34:37 +02:00
Quentin Fuxa	38b4ebe8ba	Handle 3 types of silences: Indicated by whisper, between tokens, and at the end of the input. Display them in the frontend	2025-08-11 17:56:57 +02:00
Quentin Fuxa	d098af3185	each SimulStreamingOnlineProcessor now contains PaddedAlignAttWhisper instance. SimulStreamingASR only contains loaded whisper model	2025-08-11 08:24:14 +02:00
Quentin Fuxa	4e56130a40	frontend supports dark theme	2025-08-11 08:22:23 +02:00
Quentin Fuxa	2bbdc70187	lags are now updated every 0.1s	2025-08-09 23:11:05 +02:00
Quentin Fuxa	b678a55f63	remove duplicate file	2025-08-09 23:10:34 +02:00
Quentin Fuxa	5491964e81	clean SimulStreamingOnlineProcessor initialization + audio processing	2025-08-09 20:16:27 +02:00
Quentin Fuxa	b05297a96d	clean simulwhisper backend and online	2025-08-09 18:02:15 +02:00
Quentin Fuxa	197293e25e	refactor(simulstreaming): extract backend + online module into separate files from whisper streaming	2025-08-08 18:07:51 +02:00
Quentin Fuxa	ba41c4ab56	Remove download_simulstreaming_backend	2025-08-08 18:06:40 +02:00
Quentin Fuxa	bda72b8bc0	setup.py to pyproject.toml. Remove <2.0.0 condition on numpy dep	2025-08-03 16:32:31 +02:00
Quentin Fuxa	bb6b9f4cb1	architecture diagram : available backends for whisper streaming & diarization	2025-08-03 12:25:36 +02:00
Quentin Fuxa	e40b5a3ea0	Update architecture diagram	2025-08-02 13:51:15 +02:00
Quentin Fuxa	4cfed6e98e	in MultiHeadAttention and ResidualAttentionBlock include cache_id for compatibility with simulstreaming code	2025-08-02 13:16:58 +02:00
Quentin Fuxa	687e3dd5e2	update simulstreaming model.py to match the latest version of whisper sources	2025-08-02 13:16:10 +02:00
Quentin Fuxa	e4140cd299	Update Dockerfile to install build-essential and update PyTorch version	2025-08-02 13:08:43 +02:00
Quentin Fuxa	8e056cbdf2	Upgrade SimulStreaming Whisper core from version 20230918 to 20250625	2025-08-02 13:06:36 +02:00
Quentin Fuxa	9dcfb38967	Update README.md	2025-08-01 18:02:11 +02:00
Quentin Fuxa	47b9235d70	Update README.md	2025-08-01 17:55:40 +02:00
Quentin Fuxa	f3cd53a4db	Update README.md	2025-08-01 16:53:22 +02:00
Quentin Fuxa	dbdb4ea66c	Update README.md	2025-08-01 16:33:26 +02:00
Quentin Fuxa	00424d7ca3	latest version of simulstreaming	2025-07-31 16:44:23 +02:00
Quentin Fuxa	4b738d6f63	fix duplicate line	2025-07-31 16:29:35 +02:00
Quentin Fuxa	8a5e2adb1e	simulstreaming: fixes token handling during warm-up phase	2025-07-31 16:25:34 +02:00
Quentin Fuxa	f85329e112	Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web	2025-07-31 11:42:16 +02:00
Quentin Fuxa	46efbdf1d9	solves https://github.com/QuentinFuxa/WhisperLiveKit/issues/151	2025-07-31 11:42:06 +02:00
Quentin Fuxa	8885ade003	Merge pull request #153 from luisla-rivas/main Fix README.md to view correctly Deployment Guide info	2025-07-31 07:10:35 +02:00
luisla-rivas	2564928d83	Fix README.md to view correctly Deployment Guide info	2025-07-30 14:11:19 +02:00
Quentin Fuxa	56114d3071	Remove end_attributed_speaker in diarization_online. handled in audio processor	2025-07-16 12:09:43 +02:00
Quentin Fuxa	5b9977c9af	Enhanced use_punctuation_split for diarization. further improvements still needed	2025-07-16 12:06:17 +02:00
Quentin Fuxa	12a544164f	Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web	2025-07-16 12:05:01 +02:00
Quentin Fuxa	2ca1156b7e	Merge pull request #147 from choomegan/diar_queue Ensure diarization_queue receives only latest PCM chunk	2025-07-16 12:04:53 +02:00
Quentin Fuxa	3ad3683ca7	Refactor speaker assignment in DiartDiarization for clarity and punctuation awareness	2025-07-15 14:38:53 +02:00
Quentin Fuxa	1599bd87a0	work on punctuation_split	2025-07-15 12:04:54 +02:00
Quentin Fuxa	90623400a4	Remove automatic downloading of SimulStreaming dependencies on import failure	2025-07-15 12:04:17 +02:00
choomegan	64e44fb24f	fix: logic of adding of pcm_array to diarization_queue	2025-07-15 15:33:41 +08:00
Quentin Fuxa	156b9a133f	0.2.2	2025-07-04 17:11:35 +02:00
Quentin Fuxa	df8cb23848	Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web	2025-07-04 17:04:26 +02:00
Quentin Fuxa	9ff513093b	simulstreaming uses empty space as separator	2025-07-04 17:03:01 +02:00
Quentin Fuxa	17184e552c	Update README.md	2025-07-03 11:13:45 +02:00
Quentin Fuxa	aad2c55d8c	download_simulstreaming_backend.py now downloads files in the correct lib dir	2025-07-03 11:07:28 +02:00
Quentin Fuxa	2f177c4a3b	add __init__.py file to simul_whisper assets directory	2025-07-03 10:41:12 +02:00
Quentin Fuxa	b362eccb23	new command to get simulstreaming backend	2025-07-03 10:24:02 +02:00
Quentin Fuxa	5daaf77258	add download script for SimulStreaming backend	2025-07-03 10:14:45 +02:00
Quentin Fuxa	36cc4412c3	update LICENSE with SimulStreaming dual licensing terms; include in .gitignore additional stuff	2025-07-03 09:21:38 +02:00
Quentin Fuxa	e1d4bf7e94	modify import paths in simul whisper backend so that it works in lib mode	2025-07-01 20:34:47 +02:00
Quentin Fuxa	62bf28949e	compatible with the latest version of simulstreaming	2025-07-01 20:10:45 +02:00
Quentin Fuxa	25526b3aa2	typo	2025-07-01 19:14:49 +02:00
Quentin Fuxa	1e3fab9550	copy non python files from simulstreaming when installing package	2025-07-01 19:14:23 +02:00
Quentin Fuxa	f25de6d8a4	ffmpeg-python is not used anymore - ffmpeg is directly called through create_subprocess_exec	2025-07-01 18:53:35 +02:00
Quentin Fuxa	8a175e79d8	Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web	2025-07-01 18:52:26 +02:00
Quentin Fuxa	dc37b44486	add _read_stderr to empty the stderr	2025-07-01 17:05:58 +02:00
Quentin Fuxa	2d1df92aa7	Merge pull request #145 from SlavikCA/port-fix fix port for WS link; use correct HF build arg	2025-07-01 14:16:58 +02:00
Quentin Fuxa	2c1a603e38	ffmpeg is managed in a thread in FFmpegManager to prevent the all from crashing when an error occurs	2025-07-01 11:19:10 +02:00
Quentin Fuxa	774cee036b	increase timeout from 2 to 20s for ffmpeg stdin flush and writing	2025-06-30 18:28:50 +02:00
Quentin Fuxa	d22916988e	add SIMULSTREAMING_ERROR_AND_INSTALLATION_INSTRUCTIONS for instructions when simulstreaming files are not there	2025-06-30 17:42:45 +02:00
slavik.fursov	5b8ad94dde	fix port for WS link; use correct HF build arg	2025-06-30 08:15:51 -07:00
Quentin Fuxa	f668570292	Trim buffer when no new ASR tokens are issued	2025-06-30 11:55:07 +02:00
Quentin Fuxa	7c0768e8f3	bump version to 0.2.1; enhance error message for simulstreaming missing dependencies	2025-06-27 14:06:35 +02:00
Quentin Fuxa	b42d8b2692	add dual license warning indication when using simulstreaming backend	2025-06-27 10:00:19 +02:00
Quentin Fuxa	0cd885247c	update readme	2025-06-26 00:15:56 +02:00
Quentin Fuxa	8e30e8010a	correct timing (lag) calculations in SimulStreamingASR and SimulStreamingOnlineProcessor	2025-06-26 00:13:44 +02:00
Quentin Fuxa	bfec335a5f	restore a functionnal buffer_diarization	2025-06-25 23:38:23 +02:00
Quentin Fuxa	6867041254	1rst version of SimulStreaming backend. many improvements needed	2025-06-25 17:59:46 +02:00
Quentin Fuxa	e165916952	add diarization model list url	2025-06-19 16:43:23 +02:00
Quentin Fuxa	8532a91c7a	add segmentation and embedding model options to configuration	2025-06-19 16:29:25 +02:00
Quentin Fuxa	b01b81bad0	improve diarization with lag diarization substraction	2025-06-19 16:18:49 +02:00
Quentin Fuxa	0f79d442ee	improve diarization speed + Use punctuation to better align speakers and diarization	2025-06-19 13:03:29 +02:00
Quentin Fuxa	c9f60504e3	update with up to date example	2025-06-16 16:57:47 +02:00
Quentin Fuxa	993a83546a	core refactoring	2025-06-16 16:13:57 +02:00
Quentin Fuxa	eabd1b199a	to 0.1.7	2025-05-28 13:29:45 +02:00
Quentin Fuxa	f7644268c1	Message when launching transcription and no audio is detected	2025-05-28 13:25:49 +02:00
Quentin Fuxa	34e8fe260e	lag information in real time even when no audio is detected	2025-05-28 12:25:47 +02:00
Quentin Fuxa	debfefaf3e	Merge pull request #128 from QuentinFuxa/vac-update Vac update	2025-05-28 11:51:37 +02:00
Quentin Fuxa	101ca9ef90	Update README.md	2025-05-28 11:50:44 +02:00
Quentin Fuxa	94bb05d53e	Update README.md	2025-05-28 11:48:46 +02:00
Quentin Fuxa	6797b88176	Error handling for missing FFmpeg in start_ffmpeg_decoder	2025-05-28 11:43:30 +02:00
Quentin Fuxa	46770efd6c	correct error when using VAC	2025-05-28 11:43:18 +02:00
Quentin Fuxa	b23ef3ec3e	refactor license for correct shields.io detection	2025-05-28 11:42:26 +02:00
Quentin Fuxa	fa29a24abe	Bump version to 0.1.6	2025-05-07 11:45:33 +02:00
Quentin Fuxa	fea3c3553c	logging in ASR proc. includes internal buffer duration and transcription lag	2025-05-07 11:45:00 +02:00
Quentin Fuxa	d6d65a663b	errors handling when end of transcription	2025-05-07 10:56:04 +02:00
Quentin Fuxa	083d5b2f44	uses sentinel object when end of transcription, to properly terminate tasks	2025-05-07 10:55:44 +02:00
Quentin Fuxa	8e4674b093	End of transcription : Properly sends signal back to the endpoint	2025-05-07 10:55:12 +02:00
Quentin Fuxa	bc7c32100f	Mention third-party components	2025-04-14 00:21:43 +02:00
		`@@ -0,0 +1,2 @@`
							`from .basic import BasicTextNormalizer as BasicTextNormalizer`
							`from .english import EnglishTextNormalizer as EnglishTextNormalizer`