0.2.6

speaker -1 are no more sent in websocket - no buffer when their is a silence
default model is now base. default backend simulstreaming
2026-03-08 23:04:50 +00:00 · 2025-08-21 14:34:46 +02:00 · 2025-08-21 14:09:02 +02:00 · 2025-08-21 11:55:36 +02:00 · 2025-08-21 10:33:15 +02:00 · 2025-08-20 20:57:43 +02:00
29 changed files with 1932 additions and 1270 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -15,7 +15,7 @@ Thank you for considering contributing ! We appreciate your time and effort to h

 ## Opening Issues

-If you encounter a problem with diart or want to suggest an improvement, please follow these guidelines when opening an issue:
+If you encounter a problem with WhisperLiveKit or want to suggest an improvement, please follow these guidelines when opening an issue:

 - **Bug Reports:**
  - Clearly describe the error. **Please indicate the parameters you use, especially the model(s)**
@@ -43,4 +43,4 @@ We welcome and appreciate contributions! To ensure a smooth review process, plea

 ## Thank You

-Your contributions make diart better for everyone. Thank you for your time and dedication!
+Your contributions make WhisperLiveKit better for everyone. Thank you for your time and dedication!
--- a/2
+++ b/2
@@ -81,4 +81,4 @@ EXPOSE 8000
 ENTRYPOINT ["whisperlivekit-server", "--host", "0.0.0.0"]

 # Default args
-CMD ["--model", "tiny.en"]
+CMD ["--model", "base"]
--- a/README.md
+++ b/README.md
@@ -4,7 +4,7 @@
 <img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit Demo" width="730">
 </p>

-<p align="center"><b>Real-time, Fully Local Speech-to-Text with Speaker Diarization</b></p>
+<p align="center"><b>Real-time, Fully Local Speech-to-Text with Speaker Identification</b></p>

 <p align="center">
 <a href="https://pypi.org/project/whisperlivekit/"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/whisperlivekit?color=g"></a>
@@ -14,121 +14,93 @@
 </p>


-WhisperLiveKit brings real-time speech transcription directly to your browser, with a ready-to-use backend+server and a simple frontend. ✨
+Real-time speech transcription directly to your browser, with a ready-to-use backend+server and a simple frontend. ✨

-Built on [SimulStreaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) and [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) for transcription, plus [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) and [Diart](https://github.com/juanmc2005/diart) (SOTA 2021) for diarization.
+#### Powered by Leading Research:
+
+- [SimulStreaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) - Ultra-low latency transcription with AlignAtt policy
+- [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) - Low latency transcription with LocalAgreement policy
+- [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) - Advanced real-time speaker diarization
+- [Diart](https://github.com/juanmc2005/diart) (SOTA 2021) - Real-time speaker diarization
+- [Silero VAD](https://github.com/snakers4/silero-vad) (2024) - Enterprise-grade Voice Activity Detection


-### Key Features
+> **Why not just run a simple Whisper model on every audio batch?** Whisper is designed for complete utterances, not real-time chunks. Processing small segments loses context, cuts off words mid-syllable, and produces poor transcription. WhisperLiveKit uses state-of-the-art simultaneous speech research for intelligent buffering and incremental processing.

- **Real-time Transcription** - Locally (or on-prem) convert speech to text instantly as you speak
- **Speaker Diarization** - Identify different speakers in real-time. (⚠️ backend Streaming Sortformer in developement)
- **Multi-User Support** - Handle multiple users simultaneously with a single backend/server
- **Automatic Silence Chunking** – Automatically chunks when no audio is detected to limit buffer size
- **Confidence Validation** – Immediately validate high-confidence tokens for faster inference (WhisperStreaming only)
- **Buffering Preview** – Displays unvalidated transcription segments (not compatible with SimulStreaming yet)
- **Punctuation-Based Speaker Splitting [BETA]** - Align speaker changes with natural sentence boundaries for more readable transcripts
- **SimulStreaming Backend** - [Dual-licensed](https://github.com/ufal/SimulStreaming#-licence-and-contributions) - Ultra-low latency transcription using SOTA AlignAtt policy. 

 ### Architecture

-<img alt="Architecture" src="architecture.png" />
+<img alt="Architecture" src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/architecture.png" />

+*The backend supports multiple concurrent users. Voice Activity Detection reduces overhead when no voice is detected.*

-## Quick Start
+### Installation & Quick Start

 ```bash
-# Install the package
 pip install whisperlivekit
-
-# Start the transcription server
-whisperlivekit-server --model tiny.en
-
-# Open your browser at http://localhost:8000 to see the interface.
-# Use  -ssl-certfile public.crt --ssl-keyfile private.key parameters to use SSL
 ```

-That's it! Start speaking and watch your words appear on screen.
+>  **FFmpeg is required** and must be installed before using WhisperLiveKit
+> 
+> | OS | How to install |
+> |-----------|-------------|
+>  | Ubuntu/Debian | `sudo apt install ffmpeg` |
+> | MacOS | `brew install ffmpeg` |
+> | Windows | Download .exe from https://ffmpeg.org/download.html and add to PATH |

-## Installation
+#### Quick Start
+1. **Start the transcription server:**
+   ```bash
+   whisperlivekit-server --model base --language en
+   ```

-```bash
-#Install from PyPI (Recommended)
-pip install whisperlivekit
+2. **Open your browser** and navigate to `http://localhost:8000`. Start speaking and watch your words appear in real-time!

-#Install from Source
-git clone https://github.com/QuentinFuxa/WhisperLiveKit
-cd WhisperLiveKit
-pip install -e .
-```

-### FFmpeg Dependency
+> - See [tokenizer.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/simul_whisper/whisper/tokenizer.py) for the list of all available languages.
+> - For HTTPS requirements, see the **Parameters** section for SSL configuration options.

-```bash
-# Ubuntu/Debian
-sudo apt install ffmpeg
+ 

-# macOS
-brew install ffmpeg
+#### Optional Dependencies

-# Windows
-# Download from https://ffmpeg.org/download.html and add to PATH
-```
+| Optional | `pip install` |
+|-----------|-------------|
+| Speaker diarization | `whisperlivekit[diarization]` |
+| Original Whisper backend | `whisperlivekit[whisper]` |
+| Improved timestamps backend | `whisperlivekit[whisper-timestamped]` |
+| Apple Silicon optimization backend | `whisperlivekit[mlx-whisper]` |
+| OpenAI API backend | `whisperlivekit[openai]` |

-### Optional Dependencies
+See  **Parameters & Configuration** below on how to use them.

-```bash
-# Voice Activity Controller (prevents hallucinations)
-pip install torch
-
-# Sentence-based buffer trimming
-pip install mosestokenizer wtpsplit
-pip install tokenize_uk  # If you work with Ukrainian text
-
-# Speaker diarization
-pip install diart
-
-# Alternative Whisper backends (default is faster-whisper)
-pip install whisperlivekit[whisper]              # Original Whisper
-pip install whisperlivekit[whisper-timestamped]  # Improved timestamps
-pip install whisperlivekit[mlx-whisper]          # Apple Silicon optimization
-pip install whisperlivekit[openai]               # OpenAI API
-pip install whisperlivekit[simulstreaming]
-```
-
-### 🎹 Pyannote Models Setup
-
-For diarization, you need access to pyannote.audio models:
-
-1. [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model
-2. [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model
-3. [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
-4. Login with HuggingFace:
-```bash
-pip install huggingface_hub
-huggingface-cli login
-```
+ 
+> **Pyannote Models Setup** For diarization, you need access to pyannote.audio models:
+> 1. [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model
+> 2. [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model
+> 3. [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
+>4. Login with HuggingFace:
+> ```bash
+> huggingface-cli login
+> ```

 ## 💻 Usage Examples

-### Command-line Interface
+#### Command-line Interface

 Start the transcription server with various options:

 ```bash
-# Basic server with English model
-whisperlivekit-server --model tiny.en
+# SimulStreaming backend for ultra-low latency
+whisperlivekit-server --backend simulstreaming --model large-v3

 # Advanced configuration with diarization
-whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language auto
-
-# SimulStreaming backend for ultra-low latency
-whisperlivekit-server --backend simulstreaming --model large-v3 --frame-threshold 20
+whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language fr
 ```


-### Python API Integration (Backend)
-Check [basic_server.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) for a complete example.
+#### Python API Integration (Backend)
+Check [basic_server](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) for a more complete example of how to use the functions and classes.

 ```python
 from whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args
@@ -143,14 +115,10 @@ transcription_engine = None
 async def lifespan(app: FastAPI):
    global transcription_engine
    transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
-    # You can also load from command-line arguments using parse_args()
-    # args = parse_args()
-    # transcription_engine = TranscriptionEngine(**vars(args))
    yield

 app = FastAPI(lifespan=lifespan)

-# Process WebSocket connections
 async def handle_websocket_results(websocket: WebSocket, results_generator):
    async for response in results_generator:
        await websocket.send_json(response)
@@ -170,43 +138,36 @@ async def websocket_endpoint(websocket: WebSocket):
        await audio_processor.process_audio(message)        
 ```

-### Frontend Implementation
+#### Frontend Implementation

-The package includes a simple HTML/JavaScript implementation that you can adapt for your project. You can find it [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html), or load its content using `get_web_interface_html()` :
+The package includes an HTML/JavaScript implementation [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html). You can also import it using `from whisperlivekit import get_web_interface_html` & `page = get_web_interface_html()`

-```python
-from whisperlivekit import get_web_interface_html
-html_content = get_web_interface_html()
-```

-## ⚙️ Configuration Reference
-
-WhisperLiveKit offers extensive configuration options:
+### ⚙️ Parameters & Configuration

 | Parameter | Description | Default |
 |-----------|-------------|---------|
-| `--host` | Server host address | `localhost` |
-| `--port` | Server port | `8000` |
-| `--model` | Whisper model size. Caution : '.en' models do not work with Simulstreaming | `tiny` |
+| `--model` | Whisper model size. | `small` |
 | `--language` | Source language code or `auto` | `en` |
 | `--task` | `transcribe` or `translate` | `transcribe` |
-| `--backend` | Processing backend | `faster-whisper` |
-| `--diarization` | Enable speaker identification | `False` |
-| `--punctuation-split` | Use punctuation to improve speaker boundaries | `True` |
-| `--confidence-validation` | Use confidence scores for faster validation | `False` |
+| `--backend` | Processing backend | `simulstreaming` |
 | `--min-chunk-size` | Minimum audio chunk size (seconds) | `1.0` |
-| `--vac` | Use Voice Activity Controller | `False` |
+| `--no-vac` | Disable Voice Activity Controller | `False` |
 | `--no-vad` | Disable Voice Activity Detection | `False` |
-| `--buffer_trimming` | Buffer trimming strategy (`sentence` or `segment`) | `segment` |
 | `--warmup-file` | Audio file path for model warmup | `jfk.wav` |
+| `--host` | Server host address | `localhost` |
+| `--port` | Server port | `8000` |
 | `--ssl-certfile` | Path to the SSL certificate file (for HTTPS support) | `None` |
 | `--ssl-keyfile` | Path to the SSL private key file (for HTTPS support) | `None` |
-| `--segmentation-model` | Hugging Face model ID for pyannote.audio segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |
-| `--embedding-model` | Hugging Face model ID for pyannote.audio embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |

-**SimulStreaming-specific Options:**

-| Parameter | Description | Default |
+| WhisperStreaming backend options | Description | Default |
+|-----------|-------------|---------|
+| `--confidence-validation` | Use confidence scores for faster validation | `False` |
+| `--buffer_trimming` | Buffer trimming strategy (`sentence` or `segment`) | `segment` |
+
+
+| SimulStreaming backend options | Description | Default |
 |-----------|-------------|---------|
 | `--frame-threshold` | AlignAtt frame threshold (lower = faster, higher = more accurate) | `25` |
 | `--beams` | Number of beams for beam search (1 = greedy decoding) | `1` |
@@ -219,42 +180,37 @@ WhisperLiveKit offers extensive configuration options:
 | `--static-init-prompt` | Static prompt that doesn't scroll | `None` |
 | `--max-context-tokens` | Maximum context tokens | `None` |
 | `--model-path` | Direct path to .pt model file. Download it if not found | `./base.pt` |
+| `--preloaded-model-count` | Optional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent users) | `1` |

-## 🔧 How It Works
+| Diarization options | Description | Default |
+|-----------|-------------|---------|
+| `--diarization` | Enable speaker identification | `False` |
+| `--punctuation-split` | Use punctuation to improve speaker boundaries | `True` |
+| `--segmentation-model` | Hugging Face model ID for pyannote.audio segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |
+| `--embedding-model` | Hugging Face model ID for pyannote.audio embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |

-1. **Audio Capture**: Browser's MediaRecorder API captures audio in webm/opus format
-2. **Streaming**: Audio chunks are sent to the server via WebSocket
-3. **Processing**: Server decodes audio with FFmpeg and streams into the model for transcription
-4. **Real-time Output**: Partial transcriptions appear immediately in light gray (the 'aperçu') and finalized text appears in normal color
-
-## 🚀 Deployment Guide
+### 🚀 Deployment Guide

 To deploy WhisperLiveKit in production:
-
-1. **Server Setup** (Backend):
+ 
+1. **Server Setup**: Install production ASGI server & launch with multiple workers
   ```bash
-   # Install production ASGI server
   pip install uvicorn gunicorn
-
-   # Launch with multiple workers
   gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
   ```

-2. **Frontend Integration**:
-   - Host your customized version of the example HTML/JS in your web application
-   - Ensure WebSocket connection points to your server's address
+2. **Frontend**: Host your customized version of the `html` example & ensure WebSocket connection points correctly

 3. **Nginx Configuration** (recommended for production):
    ```nginx    
   server {
       listen 80;
       server_name your-domain.com;
-
-    location / {
-        proxy_pass http://localhost:8000;
-        proxy_set_header Upgrade $http_upgrade;
-        proxy_set_header Connection "upgrade";
-        proxy_set_header Host $host;
+        location / {
+            proxy_pass http://localhost:8000;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection "upgrade";
+            proxy_set_header Host $host;
    }}
    ```

@@ -262,26 +218,19 @@ To deploy WhisperLiveKit in production:

 ### 🐋 Docker

-A basic Dockerfile is provided which allows re-use of Python package installation options. ⚠️ For **large** models, ensure that your **docker runtime** has enough **memory** available. See below usage examples:
+A Dockerfile is provided which allows re-use of Python package installation options. Create a reusable image with only the basics and then run as a named container:

+```bash
+docker build -t whisperlivekit-defaults .
+docker create --gpus all --name whisperlivekit -p 8000:8000 whisperlivekit-defaults --model base
+docker start -i whisperlivekit
+```

-#### All defaults
- Create a reusable image with only the basics and then run as a named container:
-    ```bash
-    docker build -t whisperlivekit-defaults .
-    docker create --gpus all --name whisperlivekit -p 8000:8000 whisperlivekit-defaults
-    docker start -i whisperlivekit
-    ```
+> **Note**: For **large** models, ensure that your **docker runtime** has enough **memory** available

-    > **Note**: If you're running on a system without NVIDIA GPU support (such as Mac with Apple Silicon or any system without CUDA capabilities), you need to **remove the `--gpus all` flag** from the `docker create` command. Without GPU acceleration, transcription will use CPU only, which may be significantly slower. Consider using small models for better performance on CPU-only systems.
+> **Note**: If you're running on a system without NVIDIA GPU support (such as Mac with Apple Silicon or any system without CUDA capabilities), you need to **remove the `--gpus all` flag** from the `docker create` command. Without GPU acceleration, transcription will use CPU only, which may be significantly slower. Consider using small models for better performance on CPU-only systems.

 #### Customization
- Customize the container options:
-    ```bash
-    docker build -t whisperlivekit-defaults .
-    docker create --gpus all --name whisperlivekit-base -p 8000:8000 whisperlivekit-defaults --model base
-    docker start -i whisperlivekit-base
-    ```

 - `--build-arg` Options:
  - `EXTRAS="whisper-timestamped"` - Add extras to the image's installation (no spaces). Remember to set necessary container options!
@@ -290,10 +239,3 @@ A basic Dockerfile is provided which allows re-use of Python package installatio

 ## 🔮 Use Cases
 Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...
-
-## 🙏 Acknowledgments
-
-We extend our gratitude to the original authors of:
-
-| [Whisper Streaming](https://github.com/ufal/whisper_streaming)  | [SimulStreaming](https://github.com/ufal/SimulStreaming) | [Diart](https://github.com/juanmc2005/diart) | [OpenAI Whisper](https://github.com/openai/whisper) |
-| -------- | ------- | -------- | ------- |
--- a/architecture.png
+++ b/architecture.png
--- a/demo.png
+++ b/demo.png
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "whisperlivekit"
-version = "0.2.5"
+version = "0.2.6"
 description = "Real-time, Fully Local Whisper's Speech-to-Text and Speaker Diarization"
 readme = "README.md"
 authors = [
@@ -27,24 +27,21 @@ dependencies = [
    "soundfile",
    "faster-whisper",
    "uvicorn",
-    "websockets"
-]
-
-[project.optional-dependencies]
-diarization = ["diart"]
-vac = ["torch"]
-sentence = ["mosestokenizer", "wtpsplit"]
-whisper = ["whisper"]
-whisper-timestamped = ["whisper-timestamped"]
-mlx-whisper = ["mlx-whisper"]
-openai = ["openai"]
-simulstreaming = [
+    "websockets",
    "torch",
    "tqdm",
    "tiktoken",
    'triton>=2.0.0,<3; platform_machine == "x86_64" and (sys_platform == "linux" or sys_platform == "linux2")'
 ]

+[project.optional-dependencies]
+diarization = ["diart"]
+sentence = ["mosestokenizer", "wtpsplit"]
+whisper = ["whisper"]
+whisper-timestamped = ["whisper-timestamped"]
+mlx-whisper = ["mlx-whisper"]
+openai = ["openai"]
+
 [project.urls]
 Homepage = "https://github.com/QuentinFuxa/WhisperLiveKit"

@@ -55,5 +52,5 @@ whisperlivekit-server = "whisperlivekit.basic_server:main"
 packages = ["whisperlivekit", "whisperlivekit.diarization", "whisperlivekit.simul_whisper", "whisperlivekit.simul_whisper.whisper", "whisperlivekit.simul_whisper.whisper.assets", "whisperlivekit.simul_whisper.whisper.normalizers", "whisperlivekit.web", "whisperlivekit.whisper_streaming_custom"]

 [tool.setuptools.package-data]
-whisperlivekit = ["web/*.html"]
+whisperlivekit = ["web/*.html", "web/*.css", "web/*.js", "web/src/*.svg"]
 "whisperlivekit.simul_whisper.whisper.assets" = ["*.tiktoken", "*.npz"]
--- a/whisperlivekit/audio_processor.py
+++ b/whisperlivekit/audio_processor.py
@@ -5,10 +5,12 @@ import math
 import logging
 import traceback
 from datetime import timedelta
-from whisperlivekit.timed_objects import ASRToken
+from whisperlivekit.timed_objects import ASRToken, Silence
 from whisperlivekit.core import TranscriptionEngine, online_factory
 from whisperlivekit.ffmpeg_manager import FFmpegManager, FFmpegState
-from .remove_silences import handle_silences
+from whisperlivekit.remove_silences import handle_silences
+from whisperlivekit.trail_repetition import trim_tail_repetition
+from whisperlivekit.silero_vad_iterator import FixedVADIterator
 # Set up logging once
 logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
 logger = logging.getLogger(__name__)
@@ -45,16 +47,19 @@ class AudioProcessor:
        self.last_ffmpeg_activity = time()
        self.ffmpeg_health_check_interval = 5
        self.ffmpeg_max_idle_time = 10
+        self.debug = False

        # State management
        self.is_stopping = False
+        self.silence = False
+        self.silence_duration = 0.0
        self.tokens = []
        self.buffer_transcription = ""
        self.buffer_diarization = ""
        self.end_buffer = 0
        self.end_attributed_speaker = 0
        self.lock = asyncio.Lock()
-        self.beg_loop = time()
+        self.beg_loop = None #to deal with a potential little lag at the websocket initialization, this is now set in process_audio
        self.sep = " "  # Default separator
        self.last_response_content = ""
        
@@ -62,7 +67,12 @@ class AudioProcessor:
        self.asr = models.asr
        self.tokenizer = models.tokenizer
        self.diarization = models.diarization
-        
+        self.vac_model = models.vac_model
+        if self.args.vac:
+            self.vac = FixedVADIterator(models.vac_model)
+        else:
+            self.vac = None
+            
        self.ffmpeg_manager = FFmpegManager(
            sample_rate=self.sample_rate,
            channels=self.channels
@@ -98,6 +108,17 @@ class AudioProcessor:
        """Thread-safe update of transcription with new data."""
        async with self.lock:
            self.tokens.extend(new_tokens)
+            
+            # self.tokens, has_been_trimmed = trim_tail_repetition(
+            #     self.tokens,
+            #     key=lambda t: t.text.strip().lower(),
+            #     min_block=2,        # avoid trimming single '.' loops; set to 1 if you want to remove those too
+            #     max_tail=200,
+            #     prefer="longest",   # prefer removing the longest repeated phrase
+            #     keep=1
+            # )
+            # if has_been_trimmed:
+            #     print('HAS BEEN TRIMMED !')
            self.buffer_transcription = buffer
            self.end_buffer = end_buffer
            self.sep = sep
@@ -201,18 +222,44 @@ class AudioProcessor:
                    pcm_array = self.convert_pcm_to_float(self.pcm_buffer[:self.max_bytes_per_sec])
                    self.pcm_buffer = self.pcm_buffer[self.max_bytes_per_sec:]
                    
-                    # Send to transcription if enabled
-                    if self.args.transcription and self.transcription_queue:
-                        await self.transcription_queue.put(pcm_array.copy())
+                    res = None
+                    end_of_audio = False
+                    silence_buffer = None
+                    
+                    if self.args.vac:
+                        res = self.vac(pcm_array)
+                    
+                    if res is not None:
+                        if res.get('end', 0) > res.get('start', 0):
+                            end_of_audio = True
+                        elif self.silence: #end of silence
+                            self.silence = False
+                            silence_buffer = Silence(duration=time() - self.start_silence)
+                            
+                    if silence_buffer:
+                        if self.args.transcription and self.transcription_queue:
+                            await self.transcription_queue.put(silence_buffer)
+                        if self.args.diarization and self.diarization_queue:
+                            await self.diarization_queue.put(silence_buffer)

-                    # Send to diarization if enabled
-                    if self.args.diarization and self.diarization_queue:
-                        await self.diarization_queue.put(pcm_array.copy())
+                    if not self.silence:                            
+                        if self.args.transcription and self.transcription_queue:
+                            await self.transcription_queue.put(pcm_array.copy())
+
+                        if self.args.diarization and self.diarization_queue:
+                            await self.diarization_queue.put(pcm_array.copy())
+                        
+                        self.silence_duration = 0.0
+                        if end_of_audio:
+                            self.silence = True
+                            self.start_silence = time()

                    # Sleep if no processing is happening
                    if not self.args.transcription and not self.args.diarization:
                        await asyncio.sleep(0.1)
                    
+                    
+                    
            except Exception as e:
                logger.warning(f"Exception in ffmpeg_stdout_reader: {e}")
                logger.warning(f"Traceback: {traceback.format_exc()}")
@@ -239,8 +286,8 @@ class AudioProcessor:
        
        while True:
            try:
-                pcm_array = await self.transcription_queue.get()
-                if pcm_array is SENTINEL:
+                item = await self.transcription_queue.get()
+                if item is SENTINEL:
                    logger.debug("Transcription processor received sentinel. Finishing.")
                    self.transcription_queue.task_done()
                    break
@@ -252,17 +299,30 @@ class AudioProcessor:

                asr_internal_buffer_duration_s = len(getattr(self.online, 'audio_buffer', [])) / self.online.SAMPLING_RATE
                transcription_lag_s = max(0.0, time() - self.beg_loop - self.end_buffer)
-
-                logger.info(
-                    f"ASR processing: internal_buffer={asr_internal_buffer_duration_s:.2f}s, "
-                    f"lag={transcription_lag_s:.2f}s."
-                )
+                asr_processing_logs = f"internal_buffer={asr_internal_buffer_duration_s:.2f}s | lag={transcription_lag_s:.2f}s |"
+                if type(item) is Silence:
+                    asr_processing_logs += f" + Silence of = {item.duration:.2f}s"
+                    if self.tokens:
+                        asr_processing_logs += " | last_end = {self.tokens[-1].end} |"
+                logger.info(asr_processing_logs)
                
-                # Process transcription
-                duration_this_chunk = len(pcm_array) / self.sample_rate if isinstance(pcm_array, np.ndarray) else 0
+                if type(item) is Silence:
+                    cumulative_pcm_duration_stream_time += item.duration
+                    self.online.insert_silence(item.duration, self.tokens[-1].end)
+                    continue
+                
+                if isinstance(item, np.ndarray):
+                    pcm_array = item
+                else:
+                    raise Exception('item should be pcm_array')
+                
+                duration_this_chunk = len(pcm_array) / self.sample_rate
                cumulative_pcm_duration_stream_time += duration_this_chunk
                stream_time_end_of_current_pcm = cumulative_pcm_duration_stream_time

+                
+                    
+
                self.online.insert_audio_chunk(pcm_array, stream_time_end_of_current_pcm)
                new_tokens, current_audio_processed_upto = self.online.process_iter()
                
@@ -303,15 +363,25 @@ class AudioProcessor:
    async def diarization_processor(self, diarization_obj):
        """Process audio chunks for speaker diarization."""
        buffer_diarization = ""
-        
+        cumulative_pcm_duration_stream_time = 0.0
        while True:
            try:
-                pcm_array = await self.diarization_queue.get()
-                if pcm_array is SENTINEL:
+                item = await self.diarization_queue.get()
+                if item is SENTINEL:
                    logger.debug("Diarization processor received sentinel. Finishing.")
                    self.diarization_queue.task_done()
                    break
                
+                if type(item) is Silence:
+                    cumulative_pcm_duration_stream_time += item.duration
+                    diarization_obj.insert_silence(item.duration)
+                    continue
+    
+                if isinstance(item, np.ndarray):
+                    pcm_array = item
+                else:
+                    raise Exception('item should be pcm_array') 
+                
                # Process diarization
                await diarization_obj.diarize(pcm_array)
                
@@ -376,13 +446,16 @@ class AudioProcessor:
                lines = []
                last_end_diarized = 0
                undiarized_text = []
-                current_time = time() - self.beg_loop
-                tokens = handle_silences(tokens, current_time)
+                current_time = time() - self.beg_loop if self.beg_loop else None
+                tokens, buffer_transcription, buffer_diarization = handle_silences(tokens, buffer_transcription, buffer_diarization, current_time, self.silence)
                for token in tokens:
                    speaker = token.speaker
                    
+                    if speaker == -1: #Speaker -1 means no attributed by diarization. In the frontend, it should appear under 'Speaker 1'
+                        speaker = 1
+                    
                    # Handle diarization
-                    if self.args.diarization:
+                    if self.args.diarization and not tokens[-1].speaker == -2:
                        if (speaker in [-1, 0]) and token.end >= end_attributed_speaker:
                            undiarized_text.append(token.text)
                            continue
@@ -391,21 +464,23 @@ class AudioProcessor:
                        if speaker not in [-1, 0]:
                            last_end_diarized = max(token.end, last_end_diarized)

-                    # Group by speaker
+                    debug_info = ""
+                    if self.debug:
+                        debug_info = f"[{format_time(token.start)} : {format_time(token.end)}]"
                    if speaker != previous_speaker or not lines:
                        lines.append({
                            "speaker": speaker,
-                            "text": token.text,
+                            "text": token.text + debug_info,
                            "beg": format_time(token.start),
                            "end": format_time(token.end),
                            "diff": round(token.end - last_end_diarized, 2)
                        })
                        previous_speaker = speaker
                    elif token.text:  # Only append if text isn't empty
-                        lines[-1]["text"] += sep + token.text
+                        lines[-1]["text"] += sep + token.text + debug_info
                        lines[-1]["end"] = format_time(token.end)
                        lines[-1]["diff"] = round(token.end - last_end_diarized, 2)
-                
+
                # Handle undiarized text
                if undiarized_text:
                    combined = sep.join(undiarized_text)
@@ -566,6 +641,10 @@ class AudioProcessor:

    async def process_audio(self, message):
        """Process incoming audio data."""
+
+        if not self.beg_loop:
+            self.beg_loop = time()
+
        if not message:
            logger.info("Empty audio message received, initiating stop sequence.")
            self.is_stopping = True
--- a/whisperlivekit/basic_server.py
+++ b/whisperlivekit/basic_server.py
@@ -5,6 +5,9 @@ from fastapi.middleware.cors import CORSMiddleware
 from whisperlivekit import TranscriptionEngine, AudioProcessor, get_web_interface_html, parse_args
 import asyncio
 import logging
+from starlette.staticfiles import StaticFiles
+import pathlib
+import whisperlivekit.web as webpkg

 logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
 logging.getLogger().setLevel(logging.WARNING)
@@ -30,6 +33,8 @@ app.add_middleware(
    allow_methods=["*"],
    allow_headers=["*"],
 )
+web_dir = pathlib.Path(webpkg.__file__).parent
+app.mount("/web", StaticFiles(directory=str(web_dir)), name="web")

@app.get("/")
 async def get():
@@ -47,7 +52,7 @@ async def handle_websocket_results(websocket, results_generator):
    except WebSocketDisconnect:
        logger.info("WebSocket disconnected while handling results (client likely closed connection).")
    except Exception as e:
-        logger.warning(f"Error in WebSocket results handler: {e}")
+        logger.error(f"Error in WebSocket results handler: {e}")


@app.websocket("/asr")
--- a/whisperlivekit/core.py
+++ b/whisperlivekit/core.py
@@ -1,9 +1,9 @@
 try:
    from whisperlivekit.whisper_streaming_custom.whisper_online import backend_factory
-    from whisperlivekit.whisper_streaming_custom.online_asr import VACOnlineASRProcessor, OnlineASRProcessor
+    from whisperlivekit.whisper_streaming_custom.online_asr import OnlineASRProcessor
 except ImportError:
    from .whisper_streaming_custom.whisper_online import backend_factory
-    from .whisper_streaming_custom.online_asr import VACOnlineASRProcessor, OnlineASRProcessor
+    from .whisper_streaming_custom.online_asr import OnlineASRProcessor
 from whisperlivekit.warmup import warmup_asr, warmup_online
 from argparse import Namespace
 import sys
@@ -34,7 +34,7 @@ class TranscriptionEngine:
            "lan": "auto",
            "task": "transcribe",
            "backend": "faster-whisper",
-            "vac": False,
+            "vac": True,
            "vac_chunk_size": 0.04,
            "log_level": "DEBUG",
            "ssl_certfile": None,
@@ -49,7 +49,7 @@ class TranscriptionEngine:
            "frame_threshold": 25,
            "beams": 1,
            "decoder_type": None,
-            "audio_max_len": 30.0,
+            "audio_max_len": 20.0,
            "audio_min_len": 0.0,
            "cif_ckpt_path": None,
            "never_fire": False,
@@ -57,10 +57,10 @@ class TranscriptionEngine:
            "static_init_prompt": None,
            "max_context_tokens": None,
            "model_path": './base.pt',
+            "diarization_backend": "diart",
            # diart params:
            "segmentation_model": "pyannote/segmentation-3.0",
            "embedding_model": "pyannote/embedding",
-
        }

        config_dict = {**defaults, **kwargs}
@@ -69,6 +69,8 @@ class TranscriptionEngine:
            config_dict['transcription'] = not kwargs['no_transcription']
        if 'no_vad' in kwargs:
            config_dict['vad'] = not kwargs['no_vad']
+        if 'no_vac' in kwargs:
+            config_dict['vac'] = not kwargs['no_vac']
        
        config_dict.pop('no_transcription', None)
        config_dict.pop('no_vad', None)
@@ -82,15 +84,20 @@ class TranscriptionEngine:
        self.asr = None
        self.tokenizer = None
        self.diarization = None
+        self.vac_model = None
+        
+        if self.args.vac:
+            import torch
+            self.vac_model, _ = torch.hub.load(repo_or_dir="snakers4/silero-vad", model="silero_vad")            
        
        if self.args.transcription:
            if self.args.backend == "simulstreaming": 
-                from simul_whisper import SimulStreamingASR
+                from whisperlivekit.simul_whisper import SimulStreamingASR
                self.tokenizer = None
                simulstreaming_kwargs = {}
                for attr in ['frame_threshold', 'beams', 'decoder_type', 'audio_max_len', 'audio_min_len', 
                            'cif_ckpt_path', 'never_fire', 'init_prompt', 'static_init_prompt', 
-                            'max_context_tokens', 'model_path']:
+                            'max_context_tokens', 'model_path', 'warmup_file', 'preload_model_count']:
                    if hasattr(self.args, attr):
                        simulstreaming_kwargs[attr] = getattr(self.args, attr)
        
@@ -112,12 +119,17 @@ class TranscriptionEngine:
            warmup_asr(self.asr, self.args.warmup_file) #for simulstreaming, warmup should be done in the online class not here

        if self.args.diarization:
-            from whisperlivekit.diarization.diarization_online import DiartDiarization
-            self.diarization = DiartDiarization(
-                block_duration=self.args.min_chunk_size,
-                segmentation_model_name=self.args.segmentation_model,
-                embedding_model_name=self.args.embedding_model
-            )
+            if self.args.diarization_backend == "diart":
+                from whisperlivekit.diarization.diart_backend import DiartDiarization
+                self.diarization = DiartDiarization(
+                    block_duration=self.args.min_chunk_size,
+                    segmentation_model_name=self.args.segmentation_model,
+                    embedding_model_name=self.args.embedding_model
+                )
+            elif self.args.diarization_backend == "sortformer":
+                raise ValueError('Sortformer backend in developement')
+            else:
+                raise ValueError(f"Unknown diarization backend: {self.args.diarization_backend}")
            
        TranscriptionEngine._initialized = True

@@ -125,21 +137,12 @@ class TranscriptionEngine:

 def online_factory(args, asr, tokenizer, logfile=sys.stderr):
    if args.backend == "simulstreaming":    
-        from simul_whisper import SimulStreamingOnlineProcessor
+        from whisperlivekit.simul_whisper import SimulStreamingOnlineProcessor
        online = SimulStreamingOnlineProcessor(
            asr,
            logfile=logfile,
        )
        # warmup_online(online, args.warmup_file)
-    elif args.vac:
-        online = VACOnlineASRProcessor(
-            args.min_chunk_size,
-            asr,
-            tokenizer,
-            logfile=logfile,
-            buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec),
-            confidence_validation = args.confidence_validation
-        )
    else:
        online = OnlineASRProcessor(
            asr,
--- a/whisperlivekit/diarization/diarization_online.py
+++ b/whisperlivekit/diarization/diarization_online.py
@@ -29,6 +29,7 @@ class DiarizationObserver(Observer):
        self.speaker_segments = []
        self.processed_time = 0
        self.segment_lock = threading.Lock()
+        self.global_time_offset = 0.0
    
    def on_next(self, value: Tuple[Annotation, Any]):
        annotation, audio = value
@@ -49,8 +50,8 @@ class DiarizationObserver(Observer):
                        print(f"  {speaker}: {start:.2f}s-{end:.2f}s")
                        self.speaker_segments.append(SpeakerSegment(
                            speaker=speaker,
-                            start=start,
-                            end=end
+                            start=start + self.global_time_offset,
+                            end=end + self.global_time_offset
                        ))
            else:
                logger.debug("\nNo speakers detected in this segment")
@@ -199,6 +200,9 @@ class DiartDiarization:
        self.inference.attach_observers(self.observer)
        asyncio.get_event_loop().run_in_executor(None, self.inference)

+    def insert_silence(self, silence_duration):
+        self.observer.global_time_offset += silence_duration
+
    async def diarize(self, pcm_array: np.ndarray):
        """
        Process audio data for diarization.
--- a/whisperlivekit/diarization/sortformer_backend.py
+++ b/whisperlivekit/diarization/sortformer_backend.py
@@ -0,0 +1,145 @@
+import numpy as np
+import torch
+import logging
+from whisperlivekit.timed_objects import SpeakerSegment
+
+logger = logging.getLogger(__name__)
+
+try:
+    from nemo.collections.asr.models import SortformerEncLabelModel
+except ImportError:
+    raise SystemExit("""Please use `pip install "git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]"` to use the Sortformer diarization""")
+
+class SortformerDiarization:
+    def __init__(self, model_name="nvidia/diar_streaming_sortformer_4spk-v2"):
+        self.diar_model = SortformerEncLabelModel.from_pretrained(model_name)
+        self.diar_model.eval()
+
+        if torch.cuda.is_available():
+            self.diar_model.to(torch.device("cuda"))
+
+        # Streaming parameters for speed
+        self.diar_model.sortformer_modules.chunk_len = 12
+        self.diar_model.sortformer_modules.chunk_right_context = 1
+        self.diar_model.sortformer_modules.spkcache_len = 188
+        self.diar_model.sortformer_modules.fifo_len = 188
+        self.diar_model.sortformer_modules.spkcache_update_period = 144
+        self.diar_model.sortformer_modules.log = False
+        self.diar_model.sortformer_modules._check_streaming_parameters()
+
+        self.batch_size = 1
+        self.processed_signal_offset = torch.zeros((self.batch_size,), dtype=torch.long, device=self.diar_model.device)
+        
+        self.audio_buffer = np.array([], dtype=np.float32)
+        self.sample_rate = 16000
+        self.speaker_segments = []
+
+        self.streaming_state = self.diar_model.sortformer_modules.init_streaming_state(
+            batch_size=self.batch_size,
+            async_streaming=True,
+            device=self.diar_model.device
+        )
+        self.total_preds = torch.zeros((self.batch_size, 0, self.diar_model.sortformer_modules.n_spk), device=self.diar_model.device)
+
+
+    def _prepare_audio_signal(self, signal):
+        audio_signal = torch.tensor(signal).unsqueeze(0).to(self.diar_model.device)
+        audio_signal_length = torch.tensor([audio_signal.shape[1]]).to(self.diar_model.device)
+        processed_signal, processed_signal_length = self.diar_model.preprocessor(input_signal=audio_signal, length=audio_signal_length)
+        return processed_signal, processed_signal_length
+
+    def _create_streaming_loader(self, processed_signal, processed_signal_length):
+        streaming_loader = self.diar_model.sortformer_modules.streaming_feat_loader(
+            feat_seq=processed_signal,
+            feat_seq_length=processed_signal_length,
+            feat_seq_offset=self.processed_signal_offset,
+        )
+        return streaming_loader
+
+    async def diarize(self, pcm_array: np.ndarray):
+        """
+        Process an incoming audio chunk for diarization.
+        """
+        self.audio_buffer = np.concatenate([self.audio_buffer, pcm_array])
+        
+        # Process in fixed-size chunks (e.g., 1 second)
+        chunk_size = self.sample_rate # 1 second of audio
+        
+        while len(self.audio_buffer) >= chunk_size:
+            chunk_to_process = self.audio_buffer[:chunk_size]
+            self.audio_buffer = self.audio_buffer[chunk_size:]
+
+            processed_signal, processed_signal_length = self._prepare_audio_signal(chunk_to_process)
+            
+            current_offset_seconds = self.processed_signal_offset.item() * self.diar_model.preprocessor._cfg.window_stride
+
+            streaming_loader = self._create_streaming_loader(processed_signal, processed_signal_length)
+            
+            frame_duration_s = self.diar_model.sortformer_modules.subsampling_factor * self.diar_model.preprocessor._cfg.window_stride
+            chunk_duration_seconds = self.diar_model.sortformer_modules.chunk_len * frame_duration_s
+
+            for i, chunk_feat_seq_t, feat_lengths, left_offset, right_offset in streaming_loader:
+                with torch.inference_mode():
+                    self.streaming_state, self.total_preds = self.diar_model.forward_streaming_step(
+                        processed_signal=chunk_feat_seq_t,
+                        processed_signal_length=feat_lengths,
+                        streaming_state=self.streaming_state,
+                        total_preds=self.total_preds,
+                        left_offset=left_offset,
+                        right_offset=right_offset,
+                    )
+                    
+                    num_new_frames = feat_lengths[0].item()
+                    
+                    # Get predictions for the current chunk from the end of total_preds
+                    preds_np = self.total_preds[0, -num_new_frames:].cpu().numpy()
+                    active_speakers = np.argmax(preds_np, axis=1)
+
+                    for idx, spk in enumerate(active_speakers):
+                        start_time = current_offset_seconds + (i * chunk_duration_seconds) + (idx * frame_duration_s)
+                        end_time = start_time + frame_duration_s
+                        
+                        if self.speaker_segments and self.speaker_segments[-1].speaker == spk + 1:
+                            self.speaker_segments[-1].end = end_time
+                        else:
+                            self.speaker_segments.append(SpeakerSegment(
+                                speaker=int(spk + 1),
+                                start=start_time,
+                                end=end_time
+                            ))
+            
+            self.processed_signal_offset += processed_signal_length
+
+
+    def assign_speakers_to_tokens(self, tokens: list, **kwargs) -> list:
+        """
+        Assign speakers to tokens based on timing overlap with speaker segments.
+        """
+        for token in tokens:
+            for segment in self.speaker_segments:
+                if not (segment.end <= token.start or segment.start >= token.end):
+                    token.speaker = segment.speaker
+        return tokens
+
+    def close(self):
+        """
+        Cleanup resources.
+        """
+        logger.info("Closing SortformerDiarization.")
+
+if __name__ == '__main__':
+    import librosa
+    an4_audio = 'new_audio_test.mp3'
+    signal, sr = librosa.load(an4_audio, sr=16000)
+
+    diarization_pipeline = SortformerDiarization()
+
+    # Simulate streaming
+    chunk_size = 16000  # 1 second
+    for i in range(0, len(signal), chunk_size):
+        chunk = signal[i:i+chunk_size]
+        import asyncio
+        asyncio.run(diarization_pipeline.diarize(chunk))
+
+    for segment in diarization_pipeline.speaker_segments:
+        print(f"Speaker {segment.speaker}: {segment.start:.2f}s - {segment.end:.2f}s")
--- a/whisperlivekit/diarization/sortformer_backend_2.py
+++ b/whisperlivekit/diarization/sortformer_backend_2.py
@@ -0,0 +1,257 @@
+import numpy as np
+import torch
+import logging
+import math
+logger = logging.getLogger(__name__)
+
+try:
+    from nemo.collections.asr.models import SortformerEncLabelModel
+except ImportError:
+    raise SystemExit("""Please use `pip install "git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]"` to use the Sortformer diarization""")
+    
+
+diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2")
+diar_model.eval()
+
+if torch.cuda.is_available():
+    diar_model.to(torch.device("cuda"))
+    
+# Set the streaming parameters corresponding to 1.04s latency setup. This will affect the streaming feat loader.
+# diar_model.sortformer_modules.chunk_len = 6
+# diar_model.sortformer_modules.spkcache_len = 188
+# diar_model.sortformer_modules.chunk_right_context = 7
+# diar_model.sortformer_modules.fifo_len = 188
+# diar_model.sortformer_modules.spkcache_update_period = 144
+# diar_model.sortformer_modules.log = False
+
+
+# here we change the settings for our goal: speed!
+# we want batches of around 1 second. one frame is 0.08s, so 1s is 12.5 frames. we take 12.
+diar_model.sortformer_modules.chunk_len = 12
+
+# for more speed, we reduce the 'right context'. it's like looking less into the future.
+diar_model.sortformer_modules.chunk_right_context = 1
+
+# we keep the rest same for now
+diar_model.sortformer_modules.spkcache_len = 188
+diar_model.sortformer_modules.fifo_len = 188
+diar_model.sortformer_modules.spkcache_update_period = 144
+diar_model.sortformer_modules.log = False
+diar_model.sortformer_modules._check_streaming_parameters()
+
+batch_size = 1
+processed_signal_offset = torch.zeros((batch_size,), dtype=torch.long, device=diar_model.device)
+
+# from nemo.collections.asr.parts.preprocessing.features import FilterbankFeatures
+# from nemo.collections.asr.modules.audio_preprocessing import get_features
+from nemo.collections.asr.modules.audio_preprocessing import AudioToMelSpectrogramPreprocessor
+
+
+def prepare_audio_signal(signal):
+    audio_signal = torch.tensor(signal).unsqueeze(0).to(diar_model.device)
+    audio_signal_length = torch.tensor([audio_signal.shape[1]]).to(diar_model.device)
+    processed_signal, processed_signal_length = AudioToMelSpectrogramPreprocessor(
+            window_size= 0.025, 
+            normalize="NA",
+            n_fft=512,
+            features=128).get_features(audio_signal, audio_signal_length)
+    return processed_signal, processed_signal_length
+
+
+def streaming_feat_loader(
+    feat_seq, feat_seq_length, feat_seq_offset
+):
+    """
+    Load a chunk of feature sequence for streaming inference.
+
+    Args:
+        feat_seq (torch.Tensor): Tensor containing feature sequence
+            Shape: (batch_size, feat_dim, feat frame count)
+        feat_seq_length (torch.Tensor): Tensor containing feature sequence lengths
+            Shape: (batch_size,)
+        feat_seq_offset (torch.Tensor): Tensor containing feature sequence offsets
+            Shape: (batch_size,)
+
+    Returns:
+        chunk_idx (int): Index of the current chunk
+        chunk_feat_seq (torch.Tensor): Tensor containing the chunk of feature sequence
+            Shape: (batch_size, diar frame count, feat_dim)
+        feat_lengths (torch.Tensor): Tensor containing lengths of the chunk of feature sequence
+            Shape: (batch_size,)
+    """
+    feat_len = feat_seq.shape[2]
+    num_chunks = math.ceil(feat_len / (diar_model.sortformer_modules.chunk_len * diar_model.sortformer_modules.subsampling_factor))
+    if False:
+        logging.info(
+            f"feat_len={feat_len}, num_chunks={num_chunks}, "
+            f"feat_seq_length={feat_seq_length}, feat_seq_offset={feat_seq_offset}"
+        )
+
+    stt_feat, end_feat, chunk_idx = 0, 0, 0
+    while end_feat < feat_len:
+        left_offset = min(diar_model.sortformer_modules.chunk_left_context * diar_model.sortformer_modules.subsampling_factor, stt_feat)
+        end_feat = min(stt_feat + diar_model.sortformer_modules.chunk_len * diar_model.sortformer_modules.subsampling_factor, feat_len)
+        right_offset = min(diar_model.sortformer_modules.chunk_right_context * diar_model.sortformer_modules.subsampling_factor, feat_len - end_feat)
+        chunk_feat_seq = feat_seq[:, :, stt_feat - left_offset : end_feat + right_offset]
+        feat_lengths = (feat_seq_length + feat_seq_offset - stt_feat + left_offset).clamp(
+            0, chunk_feat_seq.shape[2]
+        )
+        feat_lengths = feat_lengths * (feat_seq_offset < end_feat)
+        stt_feat = end_feat
+        chunk_feat_seq_t = torch.transpose(chunk_feat_seq, 1, 2)
+        if False:
+            logging.info(
+                f"chunk_idx: {chunk_idx}, "
+                f"chunk_feat_seq_t shape: {chunk_feat_seq_t.shape}, "
+                f"chunk_feat_lengths: {feat_lengths}"
+            )
+        yield chunk_idx, chunk_feat_seq_t, feat_lengths, left_offset, right_offset
+        chunk_idx += 1
+
+
+class StreamingSortformerState:
+    """
+    This class creates a class instance that will be used to store the state of the
+    streaming Sortformer model.
+
+    Attributes:
+        spkcache (torch.Tensor): Speaker cache to store embeddings from start
+        spkcache_lengths (torch.Tensor): Lengths of the speaker cache
+        spkcache_preds (torch.Tensor): The speaker predictions for the speaker cache parts
+        fifo (torch.Tensor): FIFO queue to save the embedding from the latest chunks
+        fifo_lengths (torch.Tensor): Lengths of the FIFO queue
+        fifo_preds (torch.Tensor): The speaker predictions for the FIFO queue parts
+        spk_perm (torch.Tensor): Speaker permutation information for the speaker cache
+        mean_sil_emb (torch.Tensor): Mean silence embedding
+        n_sil_frames (torch.Tensor): Number of silence frames
+    """
+
+    spkcache = None  # Speaker cache to store embeddings from start
+    spkcache_lengths = None  #
+    spkcache_preds = None  # speaker cache predictions
+    fifo = None  # to save the embedding from the latest chunks
+    fifo_lengths = None
+    fifo_preds = None
+    spk_perm = None
+    mean_sil_emb = None
+    n_sil_frames = None
+
+
+def init_streaming_state(self, batch_size: int = 1, async_streaming: bool = False, device: torch.device = None):
+    """
+    Initializes StreamingSortformerState with empty tensors or zero-valued tensors.
+
+    Args:
+        batch_size (int): Batch size for tensors in streaming state
+        async_streaming (bool): True for asynchronous update, False for synchronous update
+        device (torch.device): Device for tensors in streaming state
+
+    Returns:
+        streaming_state (SortformerStreamingState): initialized streaming state
+    """
+    streaming_state = StreamingSortformerState()
+    if async_streaming:
+        streaming_state.spkcache = torch.zeros((batch_size, self.spkcache_len, self.fc_d_model), device=device)
+        streaming_state.spkcache_preds = torch.zeros((batch_size, self.spkcache_len, self.n_spk), device=device)
+        streaming_state.spkcache_lengths = torch.zeros((batch_size,), dtype=torch.long, device=device)
+        streaming_state.fifo = torch.zeros((batch_size, self.fifo_len, self.fc_d_model), device=device)
+        streaming_state.fifo_lengths = torch.zeros((batch_size,), dtype=torch.long, device=device)
+    else:
+        streaming_state.spkcache = torch.zeros((batch_size, 0, self.fc_d_model), device=device)
+        streaming_state.fifo = torch.zeros((batch_size, 0, self.fc_d_model), device=device)
+    streaming_state.mean_sil_emb = torch.zeros((batch_size, self.fc_d_model), device=device)
+    streaming_state.n_sil_frames = torch.zeros((batch_size,), dtype=torch.long, device=device)
+    return streaming_state
+
+def process_diarization(signal, chunks):
+    
+    audio_signal = torch.tensor(signal).unsqueeze(0).to(diar_model.device)
+    audio_signal_length = torch.tensor([audio_signal.shape[1]]).to(diar_model.device)
+    processed_signal, processed_signal_length = AudioToMelSpectrogramPreprocessor(
+            window_size= 0.025, 
+            normalize="NA",
+            n_fft=512,
+            features=128).get_features(audio_signal, audio_signal_length)
+
+    
+    streaming_loader = streaming_feat_loader(processed_signal, processed_signal_length, processed_signal_offset)
+
+    
+    streaming_state = init_streaming_state(diar_model.sortformer_modules,
+        batch_size = batch_size,
+        async_streaming = True,
+        device = diar_model.device
+    )
+    total_preds = torch.zeros((batch_size, 0, diar_model.sortformer_modules.n_spk), device=diar_model.device)
+
+    
+    chunk_duration_seconds = diar_model.sortformer_modules.chunk_len * diar_model.sortformer_modules.subsampling_factor * diar_model.preprocessor._cfg.window_stride
+    print(f"Chunk duration: {chunk_duration_seconds} seconds")
+
+    l_speakers = [
+        {'start_time': 0,
+        'end_time': 0,
+        'speaker': 0
+        }
+    ]
+    len_prediction = None
+    left_offset = 0
+    right_offset = 8
+    for i, chunk_feat_seq_t, _, _, _ in streaming_loader:
+        with torch.inference_mode():
+                streaming_state, total_preds = diar_model.forward_streaming_step(
+                    processed_signal=chunk_feat_seq_t,
+                    processed_signal_length=torch.tensor([chunk_feat_seq_t.shape[1]]),
+                    streaming_state=streaming_state,
+                    total_preds=total_preds,
+                    left_offset=left_offset,
+                    right_offset=right_offset,
+                )
+                left_offset = 8
+                preds_np = total_preds[0].cpu().numpy()
+                active_speakers = np.argmax(preds_np, axis=1)
+                if len_prediction is None:
+                    len_prediction = len(active_speakers) # we want to get the len of 1 prediction
+                frame_duration = chunk_duration_seconds / len_prediction
+                active_speakers = active_speakers[-len_prediction:]
+                print(chunk_feat_seq_t.shape, total_preds.shape)
+                for idx, spk in enumerate(active_speakers):
+                    if spk != l_speakers[-1]['speaker']:
+                        l_speakers.append(
+                            {'start_time': i * chunk_duration_seconds + idx * frame_duration,
+                            'end_time': i * chunk_duration_seconds + (idx + 1) * frame_duration,
+                            'speaker': spk
+                        })                    
+                    else:
+                        l_speakers[-1]['end_time'] = i * chunk_duration_seconds + (idx + 1) * frame_duration
+                    
+        print(l_speakers)
+        """
+        Should print
+        [{'start_time': 0, 'end_time': 8.72, 'speaker': 0}, 
+        {'start_time': 8.72, 'end_time': 18.88, 'speaker': 1},
+        {'start_time': 18.88, 'end_time': 24.96, 'speaker': 2},
+        {'start_time': 24.96, 'end_time': 31.68, 'speaker': 0}]
+        """
+
+if __name__ == '__main__':
+    import librosa
+    an4_audio = 'new_audio_test.mp3'
+    signal, sr = librosa.load(an4_audio,sr=16000) 
+
+    """
+    ground truth:
+    speaker 0 : 0:00 - 0:09
+    speaker 1 : 0:09 - 0:19
+    speaker 2 : 0:19 - 0:25
+    speaker 0 : 0:25 - end
+    """
+
+    # Simulate streaming
+    chunk_size = 16000  # 1 second
+    chunks = []
+    for i in range(0, len(signal), chunk_size):
+        chunk = signal[i:i+chunk_size]
+        chunks.append(chunk)
+
+    process_diarization(signal, chunks)
--- a/whisperlivekit/ffmpeg_manager.py
+++ b/whisperlivekit/ffmpeg_manager.py
@@ -143,7 +143,7 @@ class FFmpegManager:
        try:
            data = await asyncio.wait_for(
                self.process.stdout.read(size),
-                timeout=5.0
+                timeout=20.0
            )
            return data
        except asyncio.TimeoutError:
--- a/whisperlivekit/parse_args.py
+++ b/whisperlivekit/parse_args.py
@@ -58,6 +58,14 @@ def parse_args():
        help="Hugging Face model ID for pyannote.audio embedding model.",
    )

+    parser.add_argument(
+        "--diarization-backend",
+        type=str,
+        default="diart",
+        choices=["sortformer", "diart"],
+        help="The diarization backend to use.",
+    )
+
    parser.add_argument(
        "--no-transcription",
        action="store_true",
@@ -74,7 +82,7 @@ def parse_args():
    parser.add_argument(
        "--model",
        type=str,
-        default="tiny",
+        default="small",
        help="Name size of the Whisper model to use (default: tiny). Suggested values: tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large,large-v3-turbo. The model is automatically downloaded from the model hub if not present in model cache dir.",
    )
    
@@ -107,15 +115,15 @@ def parse_args():
    parser.add_argument(
        "--backend",
        type=str,
-        default="faster-whisper",
+        default="simulstreaming",
        choices=["faster-whisper", "whisper_timestamped", "mlx-whisper", "openai-api", "simulstreaming"],
        help="Load only this backend for Whisper processing.",
    )
    parser.add_argument(
-        "--vac",
+        "--no-vac",
        action="store_true",
        default=False,
-        help="Use VAC = voice activity controller. Recommended. Requires torch.",
+        help="Disable VAC = voice activity controller.",
    )
    parser.add_argument(
        "--vac-chunk-size", type=float, default=0.04, help="VAC sample size in seconds."
@@ -242,6 +250,14 @@ def parse_args():
        dest="model_path",
        help="Direct path to the SimulStreaming Whisper .pt model file. Overrides --model for SimulStreaming backend.",
    )
+    
+    simulstreaming_group.add_argument(
+        "--preloaded_model_count",
+        type=int,
+        default=1,
+        dest="preloaded_model_count",
+        help="Optional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent instances).",
+    )

    args = parser.parse_args()
    
--- a/whisperlivekit/remove_silences.py
+++ b/whisperlivekit/remove_silences.py
@@ -3,6 +3,7 @@ import re

 MIN_SILENCE_DURATION = 4 #in seconds
 END_SILENCE_DURATION = 8 #in seconds. you should keep it important to not have false positive when the model lag is important
+END_SILENCE_DURATION_VAC = 3 #VAC is good at detecting silences, but we want to skip the smallest silences

 def blank_to_silence(tokens):
    full_string = ''.join([t.text for t in tokens])
@@ -76,11 +77,15 @@ def no_token_to_silence(tokens):
            new_tokens.append(token)
    return new_tokens
            
-def ends_with_silence(tokens, current_time):
+def ends_with_silence(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence):
    if not tokens:
-        return []
+        return [], buffer_transcription, buffer_diarization
    last_token = tokens[-1]
-    if tokens and current_time - last_token.end >= END_SILENCE_DURATION:
+    if tokens and (
+        current_time - last_token.end >= END_SILENCE_DURATION 
+        or 
+        (current_time - last_token.end >= 3 and vac_detected_silence)
+        ):
        if last_token.speaker == -2:
            last_token.end = current_time
        else:
@@ -92,12 +97,14 @@ def ends_with_silence(tokens, current_time):
                    probability=0.95
                )
            )
-    return tokens
+        buffer_transcription = "" # for whisperstreaming backend, we should probably validate the buffer has because of the silence
+        buffer_diarization  = ""
+    return tokens, buffer_transcription, buffer_diarization
    

-def handle_silences(tokens, current_time):
+def handle_silences(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence):
    tokens = blank_to_silence(tokens) #useful for simulstreaming backend which tends to generate [BLANK_AUDIO] text
    tokens = no_token_to_silence(tokens)
-    tokens = ends_with_silence(tokens, current_time)
-    return tokens
+    tokens, buffer_transcription, buffer_diarization = ends_with_silence(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence)
+    return tokens, buffer_transcription, buffer_diarization
     
--- a/whisperlivekit/whisper_streaming_custom/silero_vad_iterator.py
+++ b/whisperlivekit/whisper_streaming_custom/silero_vad_iterator.py
--- a/whisperlivekit/simul_whisper/backend.py
+++ b/whisperlivekit/simul_whisper/backend.py
@@ -4,9 +4,13 @@ import logging
 from typing import List, Tuple, Optional
 import logging
 from whisperlivekit.timed_objects import ASRToken, Transcript
+from whisperlivekit.warmup import load_file
 from whisperlivekit.simul_whisper.license_simulstreaming import SIMULSTREAMING_LICENSE
 from .whisper import load_model, tokenizer
+from .whisper.audio import TOKENS_PER_SECOND
+
 import os
+import gc
 logger = logging.getLogger(__name__)

 try:
@@ -19,6 +23,8 @@ except ImportError as e:
        """SimulStreaming dependencies are not available.
        Please install WhisperLiveKit using pip install "whisperlivekit[simulstreaming]".""")

+# TOO_MANY_REPETITIONS = 3
+
 class SimulStreamingOnlineProcessor:
    SAMPLING_RATE = 16000

@@ -30,33 +36,42 @@ class SimulStreamingOnlineProcessor:
    ):        
        self.asr = asr
        self.logfile = logfile
-        self.is_last = False
-        self.beg = 0.0
        self.end = 0.0
-        self.cumulative_audio_duration = 0.0
+        self.global_time_offset = 0.0
        
        self.committed: List[ASRToken] = []
-        self.last_result_tokens: List[ASRToken] = []        
-        self.model = PaddedAlignAttWhisper(
-            cfg=asr.cfg,
-            loaded_model=asr.whisper_model)
+        self.last_result_tokens: List[ASRToken] = []
+        self.load_new_backend()
        if asr.tokenizer:
            self.model.tokenizer = asr.tokenizer

-    def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time: Optional[float] = None):
+    def load_new_backend(self):
+        model = self.asr.get_new_model_instance()
+        self.model = PaddedAlignAttWhisper(
+            cfg=self.asr.cfg,
+            loaded_model=model)
+
+    def insert_silence(self, silence_duration, offset):
+        """
+        If silences are > 5s, we do a complete context clear. Otherwise, we just insert a small silence and shift the last_attend_frame
+        """
+        if silence_duration < 5:
+            gap_silence = torch.zeros(int(16000*silence_duration))
+            self.model.insert_audio(gap_silence)
+            # self.global_time_offset += silence_duration
+        else:
+            self.process_iter(is_last=True) #we want to totally process what remains in the buffer.
+            self.model.refresh_segment(complete=True)
+            self.global_time_offset += silence_duration + offset
+
+
+        
+    def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time):
        """Append an audio chunk to be processed by SimulStreaming."""
            
        # Convert numpy array to torch tensor
        audio_tensor = torch.from_numpy(audio).float()
-        
-        # Update timing
-        chunk_duration = len(audio) / self.SAMPLING_RATE
-        self.cumulative_audio_duration += chunk_duration
-        
-        if audio_stream_end_time is not None:
-            self.end = audio_stream_end_time
-        else:
-            self.end = self.cumulative_audio_duration            
+        self.end = audio_stream_end_time #Only to be aligned with what happens in whisperstreaming backend.
        self.model.insert_audio(audio_tensor)

    def get_buffer(self):
@@ -68,38 +83,63 @@ class SimulStreamingOnlineProcessor:
        )

    def timestamped_text(self, tokens, generation):
-        # From the simulstreaming repo. self.model to self.asr.model
-        pr = generation["progress"]
-        if "result" not in generation:
-            split_words, split_tokens = self.model.tokenizer.split_to_word_tokens(tokens)
+        """
+        generate timestamped text from tokens and generation data.
+        
+        args:
+            tokens: List of tokens to process
+            generation: Dictionary containing generation progress and optionally results
+            
+        returns:
+            List of tuples containing (start_time, end_time, word) for each word
+        """
+        FRAME_DURATION = 0.02    
+        if "result" in generation:
+            split_words = generation["result"]["split_words"]
+            split_tokens = generation["result"]["split_tokens"]
        else:
-            split_words, split_tokens = generation["result"]["split_words"], generation["result"]["split_tokens"]
+            split_words, split_tokens = self.model.tokenizer.split_to_word_tokens(tokens)
+        progress = generation["progress"]
+        frames = [p["most_attended_frames"][0] for p in progress]
+        absolute_timestamps = [p["absolute_timestamps"][0] for p in progress]
+        tokens_queue = tokens.copy()
+        timestamped_words = []
+        
+        for word, word_tokens in zip(split_words, split_tokens):
+            # start_frame = None
+            # end_frame = None
+            for expected_token in word_tokens:
+                if not tokens_queue or not frames:
+                    raise ValueError(f"Insufficient tokens or frames for word '{word}'")
+                    
+                actual_token = tokens_queue.pop(0)
+                current_frame = frames.pop(0)
+                current_timestamp = absolute_timestamps.pop(0)
+                if actual_token != expected_token:
+                    raise ValueError(
+                        f"Token mismatch: expected '{expected_token}', "
+                        f"got '{actual_token}' at frame {current_frame}"
+                    )
+                # if start_frame is None:
+                #     start_frame = current_frame
+                # end_frame = current_frame
+            # start_time = start_frame * FRAME_DURATION
+            # end_time = end_frame * FRAME_DURATION
+            start_time = current_timestamp
+            end_time = current_timestamp + 0.1
+            timestamp_entry = (start_time, end_time, word)
+            timestamped_words.append(timestamp_entry)
+            logger.debug(f"TS-WORD:\t{start_time:.2f}\t{end_time:.2f}\t{word}")
+        return timestamped_words

-        frames = [p["most_attended_frames"][0] for p in pr]
-        tokens = tokens.copy()
-        ret = []
-        for sw,st in zip(split_words,split_tokens):
-            b = None
-            for stt in st:
-                t,f = tokens.pop(0), frames.pop(0)
-                if t != stt:
-                    raise ValueError(f"Token mismatch: {t} != {stt} at frame {f}.")
-                if b is None:
-                    b = f
-            e = f
-            out = (b*0.02, e*0.02, sw)
-            ret.append(out)
-            logger.debug(f"TS-WORD:\t{' '.join(map(str, out))}")
-        return ret
-
-    def process_iter(self) -> Tuple[List[ASRToken], float]:
+    def process_iter(self, is_last=False) -> Tuple[List[ASRToken], float]:
        """
        Process accumulated audio chunks using SimulStreaming.
        
        Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
        """
-        try:            
-            tokens, generation_progress = self.model.infer(is_last=self.is_last)
+        try:
+            tokens, generation_progress = self.model.infer(is_last=is_last)
            ts_words = self.timestamped_text(tokens, generation_progress)
            
            new_tokens = []
@@ -111,9 +151,33 @@ class SimulStreamingOnlineProcessor:
                    end=end,
                    text=word,
                    probability=0.95  # fake prob. Maybe we can extract it from the model?
+                ).with_offset(
+                    self.global_time_offset
                )
                new_tokens.append(token)
-                self.committed.extend(new_tokens)
+                
+            # identical_tokens = 0
+            # n_new_tokens = len(new_tokens)
+            # if n_new_tokens:
+            
+            self.committed.extend(new_tokens)
+            
+            # if token in self.committed:
+            #     pos = len(self.committed) - 1 - self.committed[::-1].index(token)
+            # if pos:
+            #     for i in range(len(self.committed) - n_new_tokens, -1, -n_new_tokens):
+            #         commited_segment = self.committed[i:i+n_new_tokens]
+            #         if commited_segment == new_tokens:
+            #             identical_segments +=1
+            #             if identical_tokens >= TOO_MANY_REPETITIONS:
+            #                 logger.warning('Too many repetition, model is stuck. Load a new one')
+            #                 self.committed = self.committed[:i]
+            #                 self.load_new_backend()
+            #                 return [], self.end
+
+            # pos = self.committed.rindex(token)
+
+            
            
            return new_tokens, self.end

@@ -132,6 +196,13 @@ class SimulStreamingOnlineProcessor:
        except Exception as e:
            logger.exception(f"SimulStreaming warmup failed: {e}")

+    def __del__(self):
+        # free the model and add a new model to stack.
+        # del self.model
+        gc.collect()
+        torch.cuda.empty_cache()
+        # self.asr.new_model_to_stack()
+        self.model.remove_hooks()

 class SimulStreamingASR():
    """SimulStreaming backend with AlignAtt policy."""
@@ -145,7 +216,7 @@ class SimulStreamingASR():
        
        self.model_path = kwargs.get('model_path', './large-v3.pt')
        self.frame_threshold = kwargs.get('frame_threshold', 25)
-        self.audio_max_len = kwargs.get('audio_max_len', 30.0)
+        self.audio_max_len = kwargs.get('audio_max_len', 20.0)
        self.audio_min_len = kwargs.get('audio_min_len', 0.0)
        self.segment_length = kwargs.get('segment_length', 0.5)
        self.beams = kwargs.get('beams', 1)
@@ -156,6 +227,8 @@ class SimulStreamingASR():
        self.init_prompt = kwargs.get('init_prompt', None)
        self.static_init_prompt = kwargs.get('static_init_prompt', None)
        self.max_context_tokens = kwargs.get('max_context_tokens', None)
+        self.warmup_file = kwargs.get('warmup_file', None)
+        self.preload_model_count = kwargs.get('preload_model_count', 1)
        
        if model_dir is not None:
            self.model_path = model_dir
@@ -176,16 +249,11 @@ class SimulStreamingASR():
            }
            self.model_path = model_mapping.get(modelsize, f'./{modelsize}.pt')
        
-        self.model = self.load_model(modelsize)
-        
        # Set up tokenizer for translation if needed
        if self.task == "translate":
            self.tokenizer = self.set_translate_task()
        else:
            self.tokenizer = None
-
-
-    def load_model(self, modelsize):
        self.cfg = AlignAttConfig(
                model_path=self.model_path,
                segment_length=self.segment_length,
@@ -201,10 +269,34 @@ class SimulStreamingASR():
                init_prompt=self.init_prompt,
                max_context_tokens=self.max_context_tokens,
                static_init_prompt=self.static_init_prompt,
-        )   
-        model_name = os.path.basename(self.cfg.model_path).replace(".pt", "")
-        model_path = os.path.dirname(os.path.abspath(self.cfg.model_path))
-        self.whisper_model = load_model(name=model_name, download_root=model_path)
+        )  
+        
+        self.model_name = os.path.basename(self.cfg.model_path).replace(".pt", "")
+        self.model_path = os.path.dirname(os.path.abspath(self.cfg.model_path))
+        self.models = [self.load_model() for i in range(self.preload_model_count)]
+    
+
+
+
+    def load_model(self):
+        whisper_model = load_model(name=self.model_name, download_root=self.model_path)
+        warmup_audio = load_file(self.warmup_file)
+        whisper_model.transcribe(warmup_audio, language=self.original_language)
+        return whisper_model
+    
+    def get_new_model_instance(self):
+        """
+        SimulStreaming cannot share the same backend because it uses global forward hooks on the attention layers.
+        Therefore, each user requires a separate model instance, which can be memory-intensive. To maintain speed, we preload the models into memory.
+        """
+        if len(self.models) == 0:
+            self.models.append(self.load_model())
+        new_model = self.models.pop()
+        return new_model
+        # self.models[0]
+
+    def new_model_to_stack(self):
+        self.models.append(self.load_model())
        

    def set_translate_task(self):
@@ -218,6 +310,6 @@ class SimulStreamingASR():

    def transcribe(self, audio):
        """
-        Only used for warmup. It's a direct whisper call, not a simulstreaming call
+        Warmup is done directly in load_model
        """
-        self.whisper_model.transcribe(audio, language=self.original_language)
+        pass
--- a/whisperlivekit/simul_whisper/config.py
+++ b/whisperlivekit/simul_whisper/config.py
@@ -24,6 +24,6 @@ class AlignAttConfig(SimulWhisperConfig):
    segment_length: float = field(default=1.0, metadata = {"help": "in second"})
    frame_threshold: int = 4
    rewind_threshold: int = 200
-    audio_max_len: float = 30.0
+    audio_max_len: float = 20.0
    cif_ckpt_path: str = ""
    never_fire: bool = False
--- a/whisperlivekit/simul_whisper/simul_whisper.py
+++ b/whisperlivekit/simul_whisper/simul_whisper.py
@@ -56,6 +56,7 @@ class PaddedAlignAttWhisper:
        self.max_text_len = self.model.dims.n_text_ctx
        self.num_decoder_layers = len(self.model.decoder.blocks)
        self.cfg = cfg
+        self.l_hooks = []

        # model to detect end-of-word boundary at the end of the segment
        self.CIFLinear, self.always_fire, self.never_fire = load_cif(cfg,
@@ -69,7 +70,8 @@ class PaddedAlignAttWhisper:
            t = F.softmax(net_output[1], dim=-1)
            self.dec_attns.append(t.squeeze(0))
        for b in self.model.decoder.blocks:
-            b.cross_attn.register_forward_hook(layer_hook)
+            hook = b.cross_attn.register_forward_hook(layer_hook)
+            self.l_hooks.append(hook)
        
        self.kv_cache = {}
        def kv_hook(module: torch.nn.Linear, _, net_output: torch.Tensor):
@@ -82,10 +84,13 @@ class PaddedAlignAttWhisper:
            return self.kv_cache[module.cache_id] 

        for i,b in enumerate(self.model.decoder.blocks):
-            b.attn.key.register_forward_hook(kv_hook)
-            b.attn.value.register_forward_hook(kv_hook)
-            b.cross_attn.key.register_forward_hook(kv_hook)
-            b.cross_attn.value.register_forward_hook(kv_hook)
+            hooks = [
+                b.attn.key.register_forward_hook(kv_hook),
+                b.attn.value.register_forward_hook(kv_hook),
+                b.cross_attn.key.register_forward_hook(kv_hook),
+                b.cross_attn.value.register_forward_hook(kv_hook),
+            ]
+            self.l_hooks.extend(hooks)

        self.align_source = {}
        self.num_align_heads = 0
@@ -120,6 +125,7 @@ class PaddedAlignAttWhisper:
        self.init_tokens()
        
        self.last_attend_frame = -self.cfg.rewind_threshold
+        self.cumulative_time_offset = 0.0

        if self.cfg.max_context_tokens is None:
            self.max_context_tokens = self.max_text_len
@@ -139,6 +145,11 @@ class PaddedAlignAttWhisper:
            self.inference.kv_cache = self.kv_cache

            self.token_decoder = BeamSearchDecoder(inference=self.inference, eot=self.tokenizer.eot, beam_size=cfg.beam_size)
+            
+    def remove_hooks(self):
+        print('remove hook')
+        for hook in self.l_hooks:
+            hook.remove()

    def create_tokenizer(self, language=None):
        self.tokenizer = tokenizer.get_tokenizer(
@@ -210,6 +221,7 @@ class PaddedAlignAttWhisper:
        self.init_tokens()
        self.last_attend_frame = -self.cfg.rewind_threshold       
        self.detected_language = None
+        self.cumulative_time_offset = 0.0
        self.init_context()
        logger.debug(f"Context: {self.context}")
        if not complete and len(self.segments) > 2:
@@ -277,8 +289,9 @@ class PaddedAlignAttWhisper:
            removed_len = self.segments[0].shape[0] / 16000
            segments_len -= removed_len
            self.last_attend_frame -= int(TOKENS_PER_SECOND*removed_len)
+            self.cumulative_time_offset += removed_len  # Track cumulative time removed
            self.segments = self.segments[1:]
-            logger.debug(f"remove segments: {len(self.segments)} {len(self.tokens)}")
+            logger.debug(f"remove segments: {len(self.segments)} {len(self.tokens)}, cumulative offset: {self.cumulative_time_offset:.2f}s")
            if len(self.tokens) > 1:
                self.context.append_token_ids(self.tokens[1][0,:])
                self.tokens = [self.initial_tokens] + self.tokens[2:]
@@ -494,7 +507,13 @@ class PaddedAlignAttWhisper:
            # for each beam, the most attended frame is:
            most_attended_frames = torch.argmax(attn_of_alignment_heads[:,-1,:], dim=-1)
            generation_progress_loop.append(("most_attended_frames",most_attended_frames.clone().tolist()))
+            
+            # Calculate absolute timestamps accounting for cumulative offset
+            absolute_timestamps = [(frame * 0.02 + self.cumulative_time_offset) for frame in most_attended_frames.tolist()]
+            generation_progress_loop.append(("absolute_timestamps", absolute_timestamps))
+            
            logger.debug(str(most_attended_frames.tolist()) + " most att frames")
+            logger.debug(f"Absolute timestamps: {absolute_timestamps} (offset: {self.cumulative_time_offset:.2f}s)")

            most_attended_frame = most_attended_frames[0].item()

@@ -599,4 +618,4 @@ class PaddedAlignAttWhisper:
        
        self._clean_cache()

-        return new_hypothesis, generation
+        return new_hypothesis, generation
--- a/whisperlivekit/timed_objects.py
+++ b/whisperlivekit/timed_objects.py
@@ -29,4 +29,8 @@ class SpeakerSegment(TimedText):
    """Represents a segment of audio attributed to a specific speaker.
    No text nor probability is associated with this segment.
    """
-    pass
+    pass
+
+@dataclass
+class Silence():
+    duration: float
--- a/whisperlivekit/trail_repetition.py
+++ b/whisperlivekit/trail_repetition.py
@@ -0,0 +1,60 @@
+from typing import Sequence, Callable, Any, Optional, Dict
+
+def _detect_tail_repetition(
+    seq: Sequence[Any],
+    key: Callable[[Any], Any] = lambda x: x,  # extract comparable value
+    min_block: int = 1,                       # set to 2 to ignore 1-token loops like "."
+    max_tail: int = 300,                      # search window from the end for speed
+    prefer: str = "longest",                  # "longest" coverage or "smallest" block
+) -> Optional[Dict]:
+    vals = [key(x) for x in seq][-max_tail:]
+    n = len(vals)
+    best = None
+
+    # try every possible block length
+    for b in range(min_block, n // 2 + 1):
+        block = vals[-b:]
+        # count how many times this block repeats contiguously at the very end
+        count, i = 0, n
+        while i - b >= 0 and vals[i - b:i] == block:
+            count += 1
+            i -= b
+
+        if count >= 2:
+            cand = {
+                "block_size": b,
+                "count": count,
+                "start_index": len(seq) - count * b,  # in original seq
+                "end_index": len(seq),
+            }
+            if (best is None or
+                (prefer == "longest" and count * b > best["count"] * best["block_size"]) or
+                (prefer == "smallest" and b < best["block_size"])):
+                best = cand
+    return best
+
+def trim_tail_repetition(
+    seq: Sequence[Any],
+    key: Callable[[Any], Any] = lambda x: x,
+    min_block: int = 1,
+    max_tail: int = 300,
+    prefer: str = "longest",
+    keep: int = 1,  # how many copies of the repeating block to keep at the end (0 or 1 are common)
+):
+    """
+    Returns a new sequence with repeated tail trimmed.
+    keep=1 -> keep a single copy of the repeated block.
+    keep=0 -> remove all copies of the repeated block.
+    """
+    rep = _detect_tail_repetition(seq, key, min_block, max_tail, prefer)
+    if not rep:
+        return seq, False  # nothing to trim
+
+    b, c = rep["block_size"], rep["count"]
+    if keep < 0:
+        keep = 0
+    if keep >= c:
+        return seq, False  # nothing to trim (already <= keep copies)
+    # new length = total - (copies_to_remove * block_size)
+    new_len = len(seq) - (c - keep) * b
+    return seq[:new_len], True
--- a/whisperlivekit/web/live_transcription.css
+++ b/whisperlivekit/web/live_transcription.css
@@ -0,0 +1,402 @@
+:root {
+  --bg: #ffffff;
+  --text: #111111;
+  --muted: #666666;
+  --border: #e5e5e5;
+  --chip-bg: rgba(0, 0, 0, 0.04);
+  --chip-text: #000000;
+  --spinner-border: #8d8d8d5c;
+  --spinner-top: #b0b0b0;
+  --silence-bg: #f3f3f3;
+  --loading-bg: rgba(255, 77, 77, 0.06);
+  --button-bg: #ffffff;
+  --button-border: #e9e9e9;
+  --wave-stroke: #000000;
+  --label-dia-text: #868686;
+  --label-trans-text: #111111;
+}
+
+@media (prefers-color-scheme: dark) {
+  :root:not([data-theme="light"]) {
+    --bg: #0b0b0b;
+    --text: #e6e6e6;
+    --muted: #9aa0a6;
+    --border: #333333;
+    --chip-bg: rgba(255, 255, 255, 0.08);
+    --chip-text: #e6e6e6;
+    --spinner-border: #555555;
+    --spinner-top: #dddddd;
+    --silence-bg: #1a1a1a;
+    --loading-bg: rgba(255, 77, 77, 0.12);
+    --button-bg: #111111;
+    --button-border: #333333;
+    --wave-stroke: #e6e6e6;
+    --label-dia-text: #b3b3b3;
+    --label-trans-text: #ffffff;
+  }
+}
+
+:root[data-theme="dark"] {
+  --bg: #0b0b0b;
+  --text: #e6e6e6;
+  --muted: #9aa0a6;
+  --border: #333333;
+  --chip-bg: rgba(255, 255, 255, 0.08);
+  --chip-text: #e6e6e6;
+  --spinner-border: #555555;
+  --spinner-top: #dddddd;
+  --silence-bg: #1a1a1a;
+  --loading-bg: rgba(255, 77, 77, 0.12);
+  --button-bg: #111111;
+  --button-border: #333333;
+  --wave-stroke: #e6e6e6;
+  --label-dia-text: #b3b3b3;
+  --label-trans-text: #ffffff;
+}
+
+:root[data-theme="light"] {
+  --bg: #ffffff;
+  --text: #111111;
+  --muted: #666666;
+  --border: #e5e5e5;
+  --chip-bg: rgba(0, 0, 0, 0.04);
+  --chip-text: #000000;
+  --spinner-border: #8d8d8d5c;
+  --spinner-top: #b0b0b0;
+  --silence-bg: #f3f3f3;
+  --loading-bg: rgba(255, 77, 77, 0.06);
+  --button-bg: #ffffff;
+  --button-border: #e9e9e9;
+  --wave-stroke: #000000;
+  --label-dia-text: #868686;
+  --label-trans-text: #111111;
+}
+
+body {
+  font-family: ui-sans-serif, system-ui, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
+  margin: 20px;
+  text-align: center;
+  background-color: var(--bg);
+  color: var(--text);
+}
+
+/* Record button */
+#recordButton {
+  width: 50px;
+  height: 50px;
+  border: none;
+  border-radius: 50%;
+  background-color: var(--button-bg);
+  cursor: pointer;
+  transition: all 0.3s ease;
+  border: 1px solid var(--button-border);
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  position: relative;
+}
+
+#recordButton.recording {
+  width: 180px;
+  border-radius: 40px;
+  justify-content: flex-start;
+  padding-left: 20px;
+}
+
+#recordButton:active {
+  transform: scale(0.95);
+}
+
+.shape-container {
+  width: 25px;
+  height: 25px;
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  flex-shrink: 0;
+}
+
+.shape {
+  width: 25px;
+  height: 25px;
+  background-color: rgb(209, 61, 53);
+  border-radius: 50%;
+  transition: all 0.3s ease;
+}
+
+#recordButton:disabled .shape {
+  background-color: #6e6d6d;
+}
+
+#recordButton.recording .shape {
+  border-radius: 5px;
+  width: 25px;
+  height: 25px;
+}
+
+/* Recording elements */
+.recording-info {
+  display: none;
+  align-items: center;
+  margin-left: 15px;
+  flex-grow: 1;
+}
+
+#recordButton.recording .recording-info {
+  display: flex;
+}
+
+.wave-container {
+  width: 60px;
+  height: 30px;
+  position: relative;
+  display: flex;
+  align-items: center;
+  justify-content: center;
+}
+
+#waveCanvas {
+  width: 100%;
+  height: 100%;
+}
+
+.timer {
+  font-size: 14px;
+  font-weight: 500;
+  color: var(--text);
+  margin-left: 10px;
+}
+
+#status {
+  margin-top: 20px;
+  font-size: 16px;
+  color: var(--text);
+}
+
+/* Settings */
+.settings-container {
+  display: flex;
+  justify-content: center;
+  align-items: center;
+  gap: 15px;
+  margin-top: 20px;
+}
+
+.settings {
+  display: flex;
+  flex-direction: column;
+  align-items: flex-start;
+  gap: 12px;
+}
+
+.field {
+  display: flex;
+  flex-direction: column;
+  align-items: flex-start;
+  gap: 3px;
+}
+
+#chunkSelector,
+#websocketInput,
+#themeSelector {
+  font-size: 16px;
+  padding: 5px 8px;
+  border-radius: 8px;
+  border: 1px solid var(--border);
+  background-color: var(--button-bg);
+  color: var(--text);
+  max-height: 34px;
+}
+
+#websocketInput {
+  width: 220px;
+}
+
+#chunkSelector:focus,
+#websocketInput:focus,
+#themeSelector:focus {
+  outline: none;
+  border-color: #007bff;
+  box-shadow: 0 0 0 3px rgba(0, 123, 255, 0.15);
+}
+
+label {
+  font-size: 13px;
+  color: var(--muted);
+}
+
+.ws-default {
+  font-size: 12px;
+  color: var(--muted);
+}
+
+/* Segmented pill control for Theme */
+.segmented {
+  display: inline-flex;
+  align-items: stretch;
+  border: 1px solid var(--button-border);
+  background-color: var(--button-bg);
+  border-radius: 999px;
+  overflow: hidden;
+}
+
+.segmented input[type="radio"] {
+  position: absolute;
+  opacity: 0;
+  pointer-events: none;
+}
+
+.theme-selector-container {
+  position: absolute;
+  top: 20px;
+  right: 20px;
+}
+
+.segmented label {
+  display: inline-flex;
+  align-items: center;
+  gap: 6px;
+  padding: 6px 12px;
+  font-size: 14px;
+  color: var(--muted);
+  cursor: pointer;
+  user-select: none;
+  transition: background-color 0.2s ease, color 0.2s ease;
+}
+
+.segmented label span {
+  display: none;
+}
+
+.segmented label:hover span {
+  display: inline;
+}
+
+.segmented label:hover {
+  background-color: var(--chip-bg);
+}
+
+.segmented img {
+  width: 16px;
+  height: 16px;
+}
+
+.segmented input[type="radio"]:checked + label {
+  background-color: var(--chip-bg);
+  color: var(--text);
+}
+
+.segmented input[type="radio"]:focus-visible + label,
+.segmented input[type="radio"]:focus + label {
+  outline: 2px solid #007bff;
+  outline-offset: 2px;
+  border-radius: 999px;
+}
+
+/* Transcript area */
+#linesTranscript {
+  margin: 20px auto;
+  max-width: 700px;
+  text-align: left;
+  font-size: 16px;
+}
+
+#linesTranscript p {
+  margin: 0px 0;
+}
+
+#linesTranscript strong {
+  color: var(--text);
+}
+
+#speaker {
+  border: 1px solid var(--border);
+  border-radius: 100px;
+  padding: 2px 10px;
+  font-size: 14px;
+  margin-bottom: 0px;
+}
+
+.label_diarization {
+  background-color: var(--chip-bg);
+  border-radius: 8px 8px 8px 8px;
+  padding: 2px 10px;
+  margin-left: 10px;
+  display: inline-block;
+  white-space: nowrap;
+  font-size: 14px;
+  margin-bottom: 0px;
+  color: var(--label-dia-text);
+}
+
+.label_transcription {
+  background-color: var(--chip-bg);
+  border-radius: 8px 8px 8px 8px;
+  padding: 2px 10px;
+  display: inline-block;
+  white-space: nowrap;
+  margin-left: 10px;
+  font-size: 14px;
+  margin-bottom: 0px;
+  color: var(--label-trans-text);
+}
+
+#timeInfo {
+  color: var(--muted);
+  margin-left: 10px;
+}
+
+.textcontent {
+  font-size: 16px;
+  padding-left: 10px;
+  margin-bottom: 10px;
+  margin-top: 1px;
+  padding-top: 5px;
+  border-radius: 0px 0px 0px 10px;
+}
+
+.buffer_diarization {
+  color: var(--label-dia-text);
+  margin-left: 4px;
+}
+
+.buffer_transcription {
+  color: #7474748c;
+  margin-left: 4px;
+}
+
+.spinner {
+  display: inline-block;
+  width: 8px;
+  height: 8px;
+  border: 2px solid var(--spinner-border);
+  border-top: 2px solid var(--spinner-top);
+  border-radius: 50%;
+  animation: spin 0.7s linear infinite;
+  vertical-align: middle;
+  margin-bottom: 2px;
+  margin-right: 5px;
+}
+
+@keyframes spin {
+  to {
+    transform: rotate(360deg);
+  }
+}
+
+.silence {
+  color: var(--muted);
+  background-color: var(--silence-bg);
+  font-size: 13px;
+  border-radius: 30px;
+  padding: 2px 10px;
+}
+
+.loading {
+  color: var(--muted);
+  background-color: var(--loading-bg);
+  border-radius: 8px 8px 8px 0px;
+  padding: 2px 10px;
+  font-size: 14px;
+  margin-bottom: 0px;
+}
--- a/whisperlivekit/web/live_transcription.html
+++ b/whisperlivekit/web/live_transcription.html
@@ -1,861 +1,61 @@
 <!DOCTYPE html>
 <html lang="en">
-
 <head>
-    <meta charset="UTF-8" />
-    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <title>WhisperLiveKit</title>
-    <style>
-        :root {
-            --bg: #ffffff;
-            --text: #111111;
-            --muted: #666666;
-            --border: #e5e5e5;
-            --chip-bg: rgba(0, 0, 0, 0.04);
-            --chip-text: #000000;
-            --spinner-border: #8d8d8d5c;
-            --spinner-top: #b0b0b0;
-            --silence-bg: #f3f3f3;
-            --loading-bg: rgba(255, 77, 77, 0.06);
-            --button-bg: #ffffff;
-            --button-border: #e9e9e9;
-            --wave-stroke: #000000;
-            --label-dia-text: #868686;
-            --label-trans-text: #111111;
-        }
-
-        @media (prefers-color-scheme: dark) {
-            :root:not([data-theme="light"]) {
-                --bg: #0b0b0b;
-                --text: #e6e6e6;
-                --muted: #9aa0a6;
-                --border: #333333;
-                --chip-bg: rgba(255, 255, 255, 0.08);
-                --chip-text: #e6e6e6;
-                --spinner-border: #555555;
-                --spinner-top: #dddddd;
-                --silence-bg: #1a1a1a;
-                --loading-bg: rgba(255, 77, 77, 0.12);
-                --button-bg: #111111;
-                --button-border: #333333;
-                --wave-stroke: #e6e6e6;
-                --label-dia-text: #b3b3b3;
-                --label-trans-text: #ffffff;
-            }
-        }
-
-        :root[data-theme="dark"] {
-            --bg: #0b0b0b;
-            --text: #e6e6e6;
-            --muted: #9aa0a6;
-            --border: #333333;
-            --chip-bg: rgba(255, 255, 255, 0.08);
-            --chip-text: #e6e6e6;
-            --spinner-border: #555555;
-            --spinner-top: #dddddd;
-            --silence-bg: #1a1a1a;
-            --loading-bg: rgba(255, 77, 77, 0.12);
-            --button-bg: #111111;
-            --button-border: #333333;
-            --wave-stroke: #e6e6e6;
-            --label-dia-text: #b3b3b3;
-            --label-trans-text: #ffffff;
-        }
-
-        :root[data-theme="light"] {
-            --bg: #ffffff;
-            --text: #111111;
-            --muted: #666666;
-            --border: #e5e5e5;
-            --chip-bg: rgba(0, 0, 0, 0.04);
-            --chip-text: #000000;
-            --spinner-border: #8d8d8d5c;
-            --spinner-top: #b0b0b0;
-            --silence-bg: #f3f3f3;
-            --loading-bg: rgba(255, 77, 77, 0.06);
-            --button-bg: #ffffff;
-            --button-border: #e9e9e9;
-            --wave-stroke: #000000;
-            --label-dia-text: #868686;
-            --label-trans-text: #111111;
-        }
-        body {
-            font-family: ui-sans-serif, system-ui, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
-            margin: 20px;
-            text-align: center;
-            background-color: var(--bg);
-            color: var(--text);
-        }
-
-        #recordButton {
-            width: 50px;
-            height: 50px;
-            border: none;
-            border-radius: 50%;
-            background-color: var(--button-bg);
-            cursor: pointer;
-            transition: all 0.3s ease;
-            border: 1px solid var(--button-border);
-            display: flex;
-            align-items: center;
-            justify-content: center;
-            position: relative;
-        }
-
-        #recordButton.recording {
-            width: 180px;
-            border-radius: 40px;
-            justify-content: flex-start;
-            padding-left: 20px;
-        }
-
-        #recordButton:active {
-            transform: scale(0.95);
-        }
-
-        .shape-container {
-            width: 25px;
-            height: 25px;
-            display: flex;
-            align-items: center;
-            justify-content: center;
-            flex-shrink: 0;
-        }
-
-        .shape {
-            width: 25px;
-            height: 25px;
-            background-color: rgb(209, 61, 53);
-            border-radius: 50%;
-            transition: all 0.3s ease;
-        }
-
-        #recordButton:disabled .shape {
-            background-color: #6e6d6d;
-        }
-
-        #recordButton.recording .shape {
-            border-radius: 5px;
-            width: 25px;
-            height: 25px;
-        }
-
-        /* Recording elements */
-        .recording-info {
-            display: none;
-            align-items: center;
-            margin-left: 15px;
-            flex-grow: 1;
-        }
-
-        #recordButton.recording .recording-info {
-            display: flex;
-        }
-
-        .wave-container {
-            width: 60px;
-            height: 30px;
-            position: relative;
-            display: flex;
-            align-items: center;
-            justify-content: center;
-        }
-
-        #waveCanvas {
-            width: 100%;
-            height: 100%;
-        }
-
-        .timer {
-            font-size: 14px;
-            font-weight: 500;
-            color: var(--text);
-            margin-left: 10px;
-        }
-
-        #status {
-            margin-top: 20px;
-            font-size: 16px;
-            color: var(--text);
-        }
-
-        .settings-container {
-            display: flex;
-            justify-content: center;
-            align-items: center;
-            gap: 15px;
-            margin-top: 20px;
-        }
-
-        .settings {
-            display: flex;
-            flex-direction: column;
-            align-items: flex-start;
-            gap: 5px;
-        }
-
-        #chunkSelector,
-        #websocketInput,
-        #themeSelector {
-            font-size: 16px;
-            padding: 5px;
-            border-radius: 5px;
-            border: 1px solid var(--border);
-            background-color: var(--button-bg);
-            color: var(--text);
-            max-height: 30px;
-        }
-
-        #websocketInput {
-            width: 200px;
-        }
-
-        #chunkSelector:focus,
-        #websocketInput:focus,
-        #themeSelector:focus {
-            outline: none;
-            border-color: #007bff;
-        }
-
-        label {
-            font-size: 14px;
-        }
-
-        /* Speaker-labeled transcript area */
-        #linesTranscript {
-            margin: 20px auto;
-            max-width: 700px;
-            text-align: left;
-            font-size: 16px;
-        }
-
-        #linesTranscript p {
-            margin: 0px 0;
-        }
-
-        #linesTranscript strong {
-            color: var(--text);
-        }
-
-        #speaker {
-            border: 1px solid var(--border);
-            border-radius: 100px;
-            padding: 2px 10px;
-            font-size: 14px;
-            margin-bottom: 0px;
-        }
-        .label_diarization {
-            background-color: var(--chip-bg);
-            border-radius: 8px 8px 8px 8px;
-            padding: 2px 10px;
-            margin-left: 10px;
-            display: inline-block;
-            white-space: nowrap;
-            font-size: 14px;
-            margin-bottom: 0px;
-            color: var(--label-dia-text)
-        }
-
-        .label_transcription {
-            background-color: var(--chip-bg);
-            border-radius: 8px 8px 8px 8px;
-            padding: 2px 10px;
-            display: inline-block;
-            white-space: nowrap;
-            margin-left: 10px;
-            font-size: 14px;
-            margin-bottom: 0px;
-            color: var(--label-trans-text)
-        }
-
-        #timeInfo {
-            color: var(--muted);
-            margin-left: 10px;
-        }
-
-        .textcontent {
-            font-size: 16px;
-            /* margin-left: 10px; */
-            padding-left: 10px;
-            margin-bottom: 10px;
-            margin-top: 1px;
-            padding-top: 5px;
-            border-radius: 0px 0px 0px 10px;
-        }
-
-        .buffer_diarization {
-            color: var(--label-dia-text);
-            margin-left: 4px;
-        }
-
-        .buffer_transcription {
-            color: #7474748c;
-            margin-left: 4px;
-        }
-
-
-        .spinner {
-            display: inline-block;
-            width: 8px;
-            height: 8px;
-            border: 2px solid var(--spinner-border);
-            border-top: 2px solid var(--spinner-top);
-            border-radius: 50%;
-            animation: spin 0.7s linear infinite;
-            vertical-align: middle;
-            margin-bottom: 2px;
-            margin-right: 5px;
-        }
-
-        @keyframes spin {
-            to {
-                transform: rotate(360deg);
-            }
-        }
-
-        .silence {
-            color: var(--muted);
-            background-color: var(--silence-bg);
-            font-size: 13px;
-            border-radius: 30px;
-            padding: 2px 10px;
-        }
-
-        .loading {
-            color: var(--muted);
-            background-color: var(--loading-bg);
-            border-radius: 8px 8px 8px 0px;
-            padding: 2px 10px;
-            font-size: 14px;
-            margin-bottom: 0px;
-        }
-    </style>
+  <meta charset="UTF-8" />
+  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+  <title>WhisperLiveKit</title>
+  <link rel="stylesheet" href="/web/live_transcription.css" />
 </head>
-
 <body>
-
-    <div class="settings-container">
-        <button id="recordButton">
-            <div class="shape-container">
-                <div class="shape"></div>
-            </div>
-            <div class="recording-info">
-                <div class="wave-container">
-                    <canvas id="waveCanvas"></canvas>
-                </div>
-                <div class="timer">00:00</div>
-            </div>
-        </button>
-        <div class="settings">
-            <div>
-                <label for="chunkSelector">Chunk size (ms):</label>
-                <select id="chunkSelector">
-                    <option value="500">500 ms</option>
-                    <option value="1000" selected>1000 ms</option>
-                    <option value="2000">2000 ms</option>
-                    <option value="3000">3000 ms</option>
-                    <option value="4000">4000 ms</option>
-                    <option value="5000">5000 ms</option>
-                </select>
-            </div>
-            <div>
-                <label for="websocketInput">WebSocket URL:</label>
-                <input id="websocketInput" type="text" />
-            </div>
-            <div>
-                <label for="themeSelector">Theme:</label>
-                <select id="themeSelector">
-                    <option value="system" selected>System</option>
-                    <option value="light">Light</option>
-                    <option value="dark">Dark</option>
-                </select>
-            </div>
+  <div class="settings-container">
+    <button id="recordButton">
+      <div class="shape-container">
+        <div class="shape"></div>
+      </div>
+      <div class="recording-info">
+        <div class="wave-container">
+          <canvas id="waveCanvas"></canvas>
        </div>
+        <div class="timer">00:00</div>
+      </div>
+    </button>
+
+    <div class="settings">
+      <div class="field">
+        <label for="websocketInput">WebSocket URL</label>
+        <input id="websocketInput" type="text" placeholder="ws://host:port/asr" />
+      </div>
+
+      </div>
    </div>
+  </div>

-    <p id="status"></p>
+  <div class="theme-selector-container">
+    <div class="segmented" role="radiogroup" aria-label="Theme selector">
+      <input type="radio" id="theme-system" name="theme" value="system" />
+      <label for="theme-system" title="System">
+        <img src="/web/src/system_mode.svg" alt="" />
+        <span>System</span>
+      </label>

-    <!-- Speaker-labeled transcript -->
-    <div id="linesTranscript"></div>
+      <input type="radio" id="theme-light" name="theme" value="light" />
+      <label for="theme-light" title="Light">
+        <img src="/web/src/light_mode.svg" alt="" />
+        <span>Light</span>
+      </label>

-    <script>
-        let isRecording = false;
-        let websocket = null;
-        let recorder = null;
-        let chunkDuration = 1000;
-        let websocketUrl = "ws://localhost:8000/asr";
-        let userClosing = false;
-        let wakeLock = null;
-        let startTime = null;
-        let timerInterval = null;
-        let audioContext = null;
-        let analyser = null;
-        let microphone = null;
-        let waveCanvas = document.getElementById("waveCanvas");
-        let waveCtx = waveCanvas.getContext("2d");
-        let animationFrame = null;
-        let waitingForStop = false;
-        let lastReceivedData = null;
-        let lastSignature = null;
-        waveCanvas.width = 60 * (window.devicePixelRatio || 1);
-        waveCanvas.height = 30 * (window.devicePixelRatio || 1);
-        waveCtx.scale(window.devicePixelRatio || 1, window.devicePixelRatio || 1);
+      <input type="radio" id="theme-dark" name="theme" value="dark" />
+      <label for="theme-dark" title="Dark">
+        <img src="/web/src/dark_mode.svg" alt="" />
+        <span>Dark</span>
+      </label>
+    </div>
+  </div>

-        const statusText = document.getElementById("status");
-        const recordButton = document.getElementById("recordButton");
-        const chunkSelector = document.getElementById("chunkSelector");
-        const websocketInput = document.getElementById("websocketInput");
-        const linesTranscriptDiv = document.getElementById("linesTranscript");
-        const timerElement = document.querySelector(".timer");
-        const themeSelector = document.getElementById("themeSelector");
+  <p id="status"></p>

-        function getWaveStroke() {
-            const styles = getComputedStyle(document.documentElement);
-            const v = styles.getPropertyValue("--wave-stroke").trim();
-            return v || "#000";
-        }
+  <div id="linesTranscript"></div>

-        let waveStroke = getWaveStroke();
-
-        function updateWaveStroke() {
-            waveStroke = getWaveStroke();
-        }
-
-        function applyTheme(pref) {
-            if (pref === "light") {
-                document.documentElement.setAttribute("data-theme", "light");
-            } else if (pref === "dark") {
-                document.documentElement.setAttribute("data-theme", "dark");
-            } else {
-                document.documentElement.removeAttribute("data-theme");
-            }
-            updateWaveStroke();
-        }
-
-        const savedThemePref = localStorage.getItem("themePreference") || "system";
-        applyTheme(savedThemePref);
-        if (themeSelector) {
-            themeSelector.value = savedThemePref;
-            themeSelector.addEventListener("change", () => {
-                const val = themeSelector.value;
-                localStorage.setItem("themePreference", val);
-                applyTheme(val);
-            });
-        }
-
-        const darkMq = window.matchMedia && window.matchMedia("(prefers-color-scheme: dark)");
-        const handleOsThemeChange = () => {
-            const pref = localStorage.getItem("themePreference") || "system";
-            if (pref === "system") updateWaveStroke();
-        };
-        if (darkMq && darkMq.addEventListener) {
-            darkMq.addEventListener("change", handleOsThemeChange);
-        } else if (darkMq && darkMq.addListener) {
-            darkMq.addListener(handleOsThemeChange);
-        }
-
-        function fmt1(x) {
-            const n = Number(x);
-            return Number.isFinite(n) ? n.toFixed(1) : x;
-        }
-
-        const host = window.location.hostname || "localhost";
-        const port = window.location.port;
-        const protocol = window.location.protocol === "https:" ? "wss" : "ws";
-        const defaultWebSocketUrl = `${protocol}://${host}:${port}/asr`;
-        websocketInput.value = defaultWebSocketUrl;
-        websocketUrl = defaultWebSocketUrl;
-
-        chunkSelector.addEventListener("change", () => {
-            chunkDuration = parseInt(chunkSelector.value);
-        });
-
-        websocketInput.addEventListener("change", () => {
-            const urlValue = websocketInput.value.trim();
-            if (!urlValue.startsWith("ws://") && !urlValue.startsWith("wss://")) {
-                statusText.textContent = "Invalid WebSocket URL (must start with ws:// or wss://)";
-                return;
-            }
-            websocketUrl = urlValue;
-            statusText.textContent = "WebSocket URL updated. Ready to connect.";
-        });
-
-        function setupWebSocket() {
-            return new Promise((resolve, reject) => {
-                try {
-                    websocket = new WebSocket(websocketUrl);
-                } catch (error) {
-                    statusText.textContent = "Invalid WebSocket URL. Please check and try again.";
-                    reject(error);
-                    return;
-                }
-
-                websocket.onopen = () => {
-                    statusText.textContent = "Connected to server.";
-                    resolve();
-                };
-
-                websocket.onclose = () => {
-                    if (userClosing) {
-                        if (waitingForStop) {
-                            statusText.textContent = "Processing finalized or connection closed.";
-                            if (lastReceivedData) {
-                                renderLinesWithBuffer(
-                                    lastReceivedData.lines || [],
-                                    lastReceivedData.buffer_diarization || "",
-                                    lastReceivedData.buffer_transcription || "",
-                                    0, 0, true // isFinalizing = true
-                                );
-                            }
-                        }
-                        // If ready_to_stop was received, statusText is already "Finished processing..."
-                        // and waitingForStop is false.
-                    } else {
-                        statusText.textContent = "Disconnected from the WebSocket server. (Check logs if model is loading.)";
-                        if (isRecording) {
-                            stopRecording(); 
-                        }
-                    }
-                    isRecording = false;  
-                    waitingForStop = false; 
-                    userClosing = false;  
-                    lastReceivedData = null;  
-                    websocket = null;    
-                    updateUI();  
-                };
-
-                websocket.onerror = () => {
-                    statusText.textContent = "Error connecting to WebSocket.";
-                    reject(new Error("Error connecting to WebSocket"));
-                };
-
-                // Handle messages from server
-                websocket.onmessage = (event) => {
-                    const data = JSON.parse(event.data);
-                    
-                    // Check for status messages
-                    if (data.type === "ready_to_stop") {
-                        console.log("Ready to stop received, finalizing display and closing WebSocket.");
-                        waitingForStop = false;
-
-                        if (lastReceivedData) {
-                            renderLinesWithBuffer(
-                                lastReceivedData.lines || [],
-                                lastReceivedData.buffer_diarization || "",
-                                lastReceivedData.buffer_transcription || "",
-                                0, // No more lag
-                                0, // No more lag
-                                true // isFinalizing = true
-                            );
-                        }
-                        statusText.textContent = "Finished processing audio! Ready to record again.";
-                        recordButton.disabled = false;
-                        
-                        if (websocket) {
-                            websocket.close(); // will trigger onclose
-                            // websocket = null; // onclose handle setting websocket to null
-                        }
-                        return;
-                    }
-                    
-                    lastReceivedData = data; 
-                    
-                    // Handle normal transcription updates
-                    const { 
-                        lines = [], 
-                        buffer_transcription = "", 
-                        buffer_diarization = "",
-                        remaining_time_transcription = 0,
-                        remaining_time_diarization = 0,
-                        status = "active_transcription"
-                    } = data;
-                    
-                    renderLinesWithBuffer(
-                        lines, 
-                        buffer_diarization, 
-                        buffer_transcription, 
-                        remaining_time_diarization,
-                        remaining_time_transcription,
-                        false,
-                        status
-                    );
-                };
-            });
-        }
-
-        function renderLinesWithBuffer(lines, buffer_diarization, buffer_transcription, remaining_time_diarization, remaining_time_transcription, isFinalizing = false, current_status = "active_transcription") {
-            if (current_status === "no_audio_detected") {
-                linesTranscriptDiv.innerHTML = "<p style='text-align: center; color: var(--muted); margin-top: 20px;'><em>No audio detected...</em></p>";
-                return; 
-            }
-
-            // try to keep stable DOM despite having updates every 0.1s. only update numeric lag values if structure hasn't changed
-            const showLoading = (!isFinalizing) && (lines || []).some(it => it.speaker == 0);
-            const showTransLag = !isFinalizing && remaining_time_transcription > 0;
-            const showDiaLag = !isFinalizing && !!buffer_diarization && remaining_time_diarization > 0;
-            const signature = JSON.stringify({
-                lines: (lines || []).map(it => ({ speaker: it.speaker, text: it.text, beg: it.beg, end: it.end })),
-                buffer_transcription: buffer_transcription || "",
-                buffer_diarization: buffer_diarization || "",
-                status: current_status,
-                showLoading,
-                showTransLag,
-                showDiaLag,
-                isFinalizing: !!isFinalizing
-            });
-            if (lastSignature === signature) {
-                const t = document.querySelector(".lag-transcription-value");
-                if (t) t.textContent = fmt1(remaining_time_transcription);
-                const d = document.querySelector(".lag-diarization-value");
-                if (d) d.textContent = fmt1(remaining_time_diarization);
-                const ld = document.querySelector(".loading-diarization-value");
-                if (ld) ld.textContent = fmt1(remaining_time_diarization);
-                return;
-            }
-            lastSignature = signature;
-
-            const linesHtml = lines.map((item, idx) => {
-                let timeInfo = "";
-                if (item.beg !== undefined && item.end !== undefined) {
-                    timeInfo = ` ${item.beg} - ${item.end}`;
-                }
-
-                let speakerLabel = "";
-                if (item.speaker === -2) {
-                    speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`;
-                } else if (item.speaker == 0 && !isFinalizing) {
-                    speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'><span class="loading-diarization-value">${fmt1(remaining_time_diarization)}</span> second(s) of audio are undergoing diarization</span></span>`;
-                } else if (item.speaker == -1) {
-                    speakerLabel = `<span id="speaker">Speaker 1<span id='timeInfo'>${timeInfo}</span></span>`;
-                } else if (item.speaker !== -1 && item.speaker !== 0) {
-                    speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`;
-                }
-
-
-                let currentLineText = item.text || "";
-
-                if (idx === lines.length - 1) { 
-                    if (!isFinalizing && item.speaker !== -2) {
-                        if (remaining_time_transcription > 0) {
-                             speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Transcription lag <span id='timeInfo'><span class="lag-transcription-value">${fmt1(remaining_time_transcription)}</span>s</span></span>`;
-                        }
-                        if (buffer_diarization && remaining_time_diarization > 0) {
-                             speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Diarization lag<span id='timeInfo'><span class="lag-diarization-value">${fmt1(remaining_time_diarization)}</span>s</span></span>`;
-                        }
-                    }
-
-                    if (buffer_diarization) {
-                        if (isFinalizing) {
-                            currentLineText += (currentLineText.length > 0 && buffer_diarization.trim().length > 0 ? " " : "") + buffer_diarization.trim();
-                        } else {
-                            currentLineText += `<span class="buffer_diarization">${buffer_diarization}</span>`;
-                        }
-                    }
-                    if (buffer_transcription) {
-                        if (isFinalizing) {
-                            currentLineText += (currentLineText.length > 0 && buffer_transcription.trim().length > 0 ? " " : "") + buffer_transcription.trim();
-                        } else {
-                            currentLineText += `<span class="buffer_transcription">${buffer_transcription}</span>`;
-                        }
-                    }
-                }
-                
-                return currentLineText.trim().length > 0 || speakerLabel.length > 0
-                    ? `<p>${speakerLabel}<br/><div class='textcontent'>${currentLineText}</div></p>`
-                    : `<p>${speakerLabel}<br/></p>`; 
-            }).join("");
-
-            linesTranscriptDiv.innerHTML = linesHtml;
-            window.scrollTo({ top: document.body.scrollHeight, behavior: 'smooth' });
-        }
-
-        function updateTimer() {
-            if (!startTime) return;
-            
-            const elapsed = Math.floor((Date.now() - startTime) / 1000);
-            const minutes = Math.floor(elapsed / 60).toString().padStart(2, "0");
-            const seconds = (elapsed % 60).toString().padStart(2, "0");
-            timerElement.textContent = `${minutes}:${seconds}`;
-        }
-
-        function drawWaveform() {
-            if (!analyser) return;
-            
-            const bufferLength = analyser.frequencyBinCount;
-            const dataArray = new Uint8Array(bufferLength);
-            analyser.getByteTimeDomainData(dataArray);
-            
-            waveCtx.clearRect(0, 0, waveCanvas.width / (window.devicePixelRatio || 1), waveCanvas.height / (window.devicePixelRatio || 1));
-            waveCtx.lineWidth = 1;
-            waveCtx.strokeStyle = waveStroke;
-            waveCtx.beginPath();
-            
-            const sliceWidth = (waveCanvas.width / (window.devicePixelRatio || 1)) / bufferLength;
-            let x = 0;
-            
-            for (let i = 0; i < bufferLength; i++) {
-                const v = dataArray[i] / 128.0;
-                const y = v * (waveCanvas.height / (window.devicePixelRatio || 1)) / 2;
-                
-                if (i === 0) {
-                    waveCtx.moveTo(x, y);
-                } else {
-                    waveCtx.lineTo(x, y);
-                }
-                
-                x += sliceWidth;
-            }
-            
-            waveCtx.lineTo(waveCanvas.width / (window.devicePixelRatio || 1), waveCanvas.height / (window.devicePixelRatio || 1) / 2);
-            waveCtx.stroke();
-            
-            animationFrame = requestAnimationFrame(drawWaveform);
-        }
-
-        async function startRecording() {
-            try {
-
-                // https://developer.mozilla.org/en-US/docs/Web/API/Screen_Wake_Lock_API
-                // create an async function to request a wake lock
-                try {
-                  wakeLock = await navigator.wakeLock.request("screen");
-                } catch (err) {
-                  // The Wake Lock request has failed - usually system related, such as battery.
-                  console.log("Error acquiring wake lock.")
-                }
-
-                const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
-                
-                audioContext = new (window.AudioContext || window.webkitAudioContext)();
-                analyser = audioContext.createAnalyser();
-                analyser.fftSize = 256;
-                microphone = audioContext.createMediaStreamSource(stream);
-                microphone.connect(analyser);
-                
-                recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
-                recorder.ondataavailable = (e) => {
-                    if (websocket && websocket.readyState === WebSocket.OPEN) {
-                        websocket.send(e.data);
-                    }
-                };
-                recorder.start(chunkDuration);
-                
-                startTime = Date.now();
-                timerInterval = setInterval(updateTimer, 1000);
-                drawWaveform();
-                
-                isRecording = true;
-                updateUI();
-            } catch (err) {
-                statusText.textContent = "Error accessing microphone. Please allow microphone access.";
-                console.error(err);
-            }
-        }
-
-        async function stopRecording() {
-            wakeLock.release().then(() => {
-              wakeLock = null;
-            });
-  
-            userClosing = true;
-            waitingForStop = true;
-            
-            if (websocket && websocket.readyState === WebSocket.OPEN) {
-                // Send empty audio buffer as stop signal
-                const emptyBlob = new Blob([], { type: 'audio/webm' });
-                websocket.send(emptyBlob);
-                statusText.textContent = "Recording stopped. Processing final audio...";
-            }
-            
-            if (recorder) {
-                recorder.stop();
-                recorder = null;
-            }
-            
-            if (microphone) {
-                microphone.disconnect();
-                microphone = null;
-            }
-            
-            if (analyser) {
-                analyser = null;
-            }
-            
-            if (audioContext && audioContext.state !== 'closed') {
-                try {
-                    audioContext.close();
-                } catch (e) {
-                    console.warn("Could not close audio context:", e);
-                }
-                audioContext = null;
-            }
-            
-            if (animationFrame) {
-                cancelAnimationFrame(animationFrame);
-                animationFrame = null;
-            }
-            
-            if (timerInterval) {
-                clearInterval(timerInterval);
-                timerInterval = null;
-            }            
-            timerElement.textContent = "00:00";
-            startTime = null;
-            
-            
-            isRecording = false;
-            updateUI();	
-        }
-
-        async function toggleRecording() {
-            if (!isRecording) {
-                if (waitingForStop) {
-                    console.log("Waiting for stop, early return");
-                    return;  // Early return, UI is already updated
-                }
-                console.log("Connecting to WebSocket");
-                try {
-                    // If we have an active WebSocket that's still processing, just restart audio capture
-                    if (websocket && websocket.readyState === WebSocket.OPEN) {
-                        await startRecording();
-                    } else {
-                        // If no active WebSocket or it's closed, create new one
-                        await setupWebSocket();
-                        await startRecording();
-                    }
-                } catch (err) {
-                    statusText.textContent = "Could not connect to WebSocket or access mic. Aborted.";
-                    console.error(err);
-                }
-            } else {
-                console.log("Stopping recording");
-                stopRecording();
-            }
-        }
-
-        function updateUI() {
-            recordButton.classList.toggle("recording", isRecording);
-            recordButton.disabled = waitingForStop;
-
-            if (waitingForStop) {
-                if (statusText.textContent !== "Recording stopped. Processing final audio...") {
-                     statusText.textContent = "Please wait for processing to complete...";
-                }
-            } else if (isRecording) {
-                statusText.textContent = "Recording...";
-            } else {
-                if (statusText.textContent !== "Finished processing audio! Ready to record again." &&
-                    statusText.textContent !== "Processing finalized or connection closed.") {
-                    statusText.textContent = "Click to start transcription";
-                }
-            }
-            if (!waitingForStop) {
-                recordButton.disabled = false;
-            }
-        }
-
-        recordButton.addEventListener("click", toggleRecording);
-    </script>
+  <script src="/web/live_transcription.js"></script>
 </body>
-
 </html>
--- a/whisperlivekit/web/live_transcription.js
+++ b/whisperlivekit/web/live_transcription.js
@@ -0,0 +1,513 @@
+/* Theme, WebSocket, recording, rendering logic extracted from inline script and adapted for segmented theme control and WS caption */
+
+let isRecording = false;
+let websocket = null;
+let recorder = null;
+let chunkDuration = 100;
+let websocketUrl = "ws://localhost:8000/asr";
+let userClosing = false;
+let wakeLock = null;
+let startTime = null;
+let timerInterval = null;
+let audioContext = null;
+let analyser = null;
+let microphone = null;
+let waveCanvas = document.getElementById("waveCanvas");
+let waveCtx = waveCanvas.getContext("2d");
+let animationFrame = null;
+let waitingForStop = false;
+let lastReceivedData = null;
+let lastSignature = null;
+
+waveCanvas.width = 60 * (window.devicePixelRatio || 1);
+waveCanvas.height = 30 * (window.devicePixelRatio || 1);
+waveCtx.scale(window.devicePixelRatio || 1, window.devicePixelRatio || 1);
+
+const statusText = document.getElementById("status");
+const recordButton = document.getElementById("recordButton");
+const chunkSelector = document.getElementById("chunkSelector");
+const websocketInput = document.getElementById("websocketInput");
+const websocketDefaultSpan = document.getElementById("wsDefaultUrl");
+const linesTranscriptDiv = document.getElementById("linesTranscript");
+const timerElement = document.querySelector(".timer");
+const themeRadios = document.querySelectorAll('input[name="theme"]');
+
+function getWaveStroke() {
+  const styles = getComputedStyle(document.documentElement);
+  const v = styles.getPropertyValue("--wave-stroke").trim();
+  return v || "#000";
+}
+
+let waveStroke = getWaveStroke();
+function updateWaveStroke() {
+  waveStroke = getWaveStroke();
+}
+
+function applyTheme(pref) {
+  if (pref === "light") {
+    document.documentElement.setAttribute("data-theme", "light");
+  } else if (pref === "dark") {
+    document.documentElement.setAttribute("data-theme", "dark");
+  } else {
+    document.documentElement.removeAttribute("data-theme");
+  }
+  updateWaveStroke();
+}
+
+// Persisted theme preference
+const savedThemePref = localStorage.getItem("themePreference") || "system";
+applyTheme(savedThemePref);
+if (themeRadios.length) {
+  themeRadios.forEach((r) => {
+    r.checked = r.value === savedThemePref;
+    r.addEventListener("change", () => {
+      if (r.checked) {
+        localStorage.setItem("themePreference", r.value);
+        applyTheme(r.value);
+      }
+    });
+  });
+}
+
+// React to OS theme changes when in "system" mode
+const darkMq = window.matchMedia && window.matchMedia("(prefers-color-scheme: dark)");
+const handleOsThemeChange = () => {
+  const pref = localStorage.getItem("themePreference") || "system";
+  if (pref === "system") updateWaveStroke();
+};
+if (darkMq && darkMq.addEventListener) {
+  darkMq.addEventListener("change", handleOsThemeChange);
+} else if (darkMq && darkMq.addListener) {
+  // deprecated, but included for Safari compatibility
+  darkMq.addListener(handleOsThemeChange);
+}
+
+// Helpers
+function fmt1(x) {
+  const n = Number(x);
+  return Number.isFinite(n) ? n.toFixed(1) : x;
+}
+
+// Default WebSocket URL computation
+const host = window.location.hostname || "localhost";
+const port = window.location.port;
+const protocol = window.location.protocol === "https:" ? "wss" : "ws";
+const defaultWebSocketUrl = `${protocol}://${host}${port ? ":" + port : ""}/asr`;
+
+// Populate default caption and input
+if (websocketDefaultSpan) websocketDefaultSpan.textContent = defaultWebSocketUrl;
+websocketInput.value = defaultWebSocketUrl;
+websocketUrl = defaultWebSocketUrl;
+
+// Optional chunk selector (guard for presence)
+if (chunkSelector) {
+  chunkSelector.addEventListener("change", () => {
+    chunkDuration = parseInt(chunkSelector.value);
+  });
+}
+
+// WebSocket input change handling
+websocketInput.addEventListener("change", () => {
+  const urlValue = websocketInput.value.trim();
+  if (!urlValue.startsWith("ws://") && !urlValue.startsWith("wss://")) {
+    statusText.textContent = "Invalid WebSocket URL (must start with ws:// or wss://)";
+    return;
+  }
+  websocketUrl = urlValue;
+  statusText.textContent = "WebSocket URL updated. Ready to connect.";
+});
+
+function setupWebSocket() {
+  return new Promise((resolve, reject) => {
+    try {
+      websocket = new WebSocket(websocketUrl);
+    } catch (error) {
+      statusText.textContent = "Invalid WebSocket URL. Please check and try again.";
+      reject(error);
+      return;
+    }
+
+    websocket.onopen = () => {
+      statusText.textContent = "Connected to server.";
+      resolve();
+    };
+
+    websocket.onclose = () => {
+      if (userClosing) {
+        if (waitingForStop) {
+          statusText.textContent = "Processing finalized or connection closed.";
+          if (lastReceivedData) {
+            renderLinesWithBuffer(
+              lastReceivedData.lines || [],
+              lastReceivedData.buffer_diarization || "",
+              lastReceivedData.buffer_transcription || "",
+              0,
+              0,
+              true
+            );
+          }
+        }
+      } else {
+        statusText.textContent = "Disconnected from the WebSocket server. (Check logs if model is loading.)";
+        if (isRecording) {
+          stopRecording();
+        }
+      }
+      isRecording = false;
+      waitingForStop = false;
+      userClosing = false;
+      lastReceivedData = null;
+      websocket = null;
+      updateUI();
+    };
+
+    websocket.onerror = () => {
+      statusText.textContent = "Error connecting to WebSocket.";
+      reject(new Error("Error connecting to WebSocket"));
+    };
+
+    websocket.onmessage = (event) => {
+      const data = JSON.parse(event.data);
+
+      if (data.type === "ready_to_stop") {
+        console.log("Ready to stop received, finalizing display and closing WebSocket.");
+        waitingForStop = false;
+
+        if (lastReceivedData) {
+          renderLinesWithBuffer(
+            lastReceivedData.lines || [],
+            lastReceivedData.buffer_diarization || "",
+            lastReceivedData.buffer_transcription || "",
+            0,
+            0,
+            true
+          );
+        }
+        statusText.textContent = "Finished processing audio! Ready to record again.";
+        recordButton.disabled = false;
+
+        if (websocket) {
+          websocket.close();
+        }
+        return;
+      }
+
+      lastReceivedData = data;
+
+      const {
+        lines = [],
+        buffer_transcription = "",
+        buffer_diarization = "",
+        remaining_time_transcription = 0,
+        remaining_time_diarization = 0,
+        status = "active_transcription",
+      } = data;
+
+      renderLinesWithBuffer(
+        lines,
+        buffer_diarization,
+        buffer_transcription,
+        remaining_time_diarization,
+        remaining_time_transcription,
+        false,
+        status
+      );
+    };
+  });
+}
+
+function renderLinesWithBuffer(
+  lines,
+  buffer_diarization,
+  buffer_transcription,
+  remaining_time_diarization,
+  remaining_time_transcription,
+  isFinalizing = false,
+  current_status = "active_transcription"
+) {
+  if (current_status === "no_audio_detected") {
+    linesTranscriptDiv.innerHTML =
+      "<p style='text-align: center; color: var(--muted); margin-top: 20px;'><em>No audio detected...</em></p>";
+    return;
+  }
+
+  const showLoading = !isFinalizing && (lines || []).some((it) => it.speaker == 0);
+  const showTransLag = !isFinalizing && remaining_time_transcription > 0;
+  const showDiaLag = !isFinalizing && !!buffer_diarization && remaining_time_diarization > 0;
+  const signature = JSON.stringify({
+    lines: (lines || []).map((it) => ({ speaker: it.speaker, text: it.text, beg: it.beg, end: it.end })),
+    buffer_transcription: buffer_transcription || "",
+    buffer_diarization: buffer_diarization || "",
+    status: current_status,
+    showLoading,
+    showTransLag,
+    showDiaLag,
+    isFinalizing: !!isFinalizing,
+  });
+  if (lastSignature === signature) {
+    const t = document.querySelector(".lag-transcription-value");
+    if (t) t.textContent = fmt1(remaining_time_transcription);
+    const d = document.querySelector(".lag-diarization-value");
+    if (d) d.textContent = fmt1(remaining_time_diarization);
+    const ld = document.querySelector(".loading-diarization-value");
+    if (ld) ld.textContent = fmt1(remaining_time_diarization);
+    return;
+  }
+  lastSignature = signature;
+
+  const linesHtml = (lines || [])
+    .map((item, idx) => {
+      let timeInfo = "";
+      if (item.beg !== undefined && item.end !== undefined) {
+        timeInfo = ` ${item.beg} - ${item.end}`;
+      }
+
+      let speakerLabel = "";
+      if (item.speaker === -2) {
+        speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`;
+      } else if (item.speaker == 0 && !isFinalizing) {
+        speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'><span class="loading-diarization-value">${fmt1(
+          remaining_time_diarization
+        )}</span> second(s) of audio are undergoing diarization</span></span>`;
+      } else if (item.speaker !== 0) {
+        speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`;
+      }
+
+      let currentLineText = item.text || "";
+
+      if (idx === lines.length - 1) {
+        if (!isFinalizing && item.speaker !== -2) {
+          if (remaining_time_transcription > 0) {
+            speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Transcription lag <span id='timeInfo'><span class="lag-transcription-value">${fmt1(
+              remaining_time_transcription
+            )}</span>s</span></span>`;
+          }
+          if (buffer_diarization && remaining_time_diarization > 0) {
+            speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Diarization lag<span id='timeInfo'><span class="lag-diarization-value">${fmt1(
+              remaining_time_diarization
+            )}</span>s</span></span>`;
+          }
+        }
+
+        if (buffer_diarization) {
+          if (isFinalizing) {
+            currentLineText +=
+              (currentLineText.length > 0 && buffer_diarization.trim().length > 0 ? " " : "") + buffer_diarization.trim();
+          } else {
+            currentLineText += `<span class="buffer_diarization">${buffer_diarization}</span>`;
+          }
+        }
+        if (buffer_transcription) {
+          if (isFinalizing) {
+            currentLineText +=
+              (currentLineText.length > 0 && buffer_transcription.trim().length > 0 ? " " : "") +
+              buffer_transcription.trim();
+          } else {
+            currentLineText += `<span class="buffer_transcription">${buffer_transcription}</span>`;
+          }
+        }
+      }
+
+      return currentLineText.trim().length > 0 || speakerLabel.length > 0
+        ? `<p>${speakerLabel}<br/><div class='textcontent'>${currentLineText}</div></p>`
+        : `<p>${speakerLabel}<br/></p>`;
+    })
+    .join("");
+
+  linesTranscriptDiv.innerHTML = linesHtml;
+  window.scrollTo({ top: document.body.scrollHeight, behavior: "smooth" });
+}
+
+function updateTimer() {
+  if (!startTime) return;
+
+  const elapsed = Math.floor((Date.now() - startTime) / 1000);
+  const minutes = Math.floor(elapsed / 60).toString().padStart(2, "0");
+  const seconds = (elapsed % 60).toString().padStart(2, "0");
+  timerElement.textContent = `${minutes}:${seconds}`;
+}
+
+function drawWaveform() {
+  if (!analyser) return;
+
+  const bufferLength = analyser.frequencyBinCount;
+  const dataArray = new Uint8Array(bufferLength);
+  analyser.getByteTimeDomainData(dataArray);
+
+  waveCtx.clearRect(
+    0,
+    0,
+    waveCanvas.width / (window.devicePixelRatio || 1),
+    waveCanvas.height / (window.devicePixelRatio || 1)
+  );
+  waveCtx.lineWidth = 1;
+  waveCtx.strokeStyle = waveStroke;
+  waveCtx.beginPath();
+
+  const sliceWidth = (waveCanvas.width / (window.devicePixelRatio || 1)) / bufferLength;
+  let x = 0;
+
+  for (let i = 0; i < bufferLength; i++) {
+    const v = dataArray[i] / 128.0;
+    const y = (v * (waveCanvas.height / (window.devicePixelRatio || 1))) / 2;
+
+    if (i === 0) {
+      waveCtx.moveTo(x, y);
+    } else {
+      waveCtx.lineTo(x, y);
+    }
+
+    x += sliceWidth;
+  }
+
+  waveCtx.lineTo(
+    waveCanvas.width / (window.devicePixelRatio || 1),
+    (waveCanvas.height / (window.devicePixelRatio || 1)) / 2
+  );
+  waveCtx.stroke();
+
+  animationFrame = requestAnimationFrame(drawWaveform);
+}
+
+async function startRecording() {
+  try {
+    try {
+      wakeLock = await navigator.wakeLock.request("screen");
+    } catch (err) {
+      console.log("Error acquiring wake lock.");
+    }
+
+    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
+
+    audioContext = new (window.AudioContext || window.webkitAudioContext)();
+    analyser = audioContext.createAnalyser();
+    analyser.fftSize = 256;
+    microphone = audioContext.createMediaStreamSource(stream);
+    microphone.connect(analyser);
+
+    recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
+    recorder.ondataavailable = (e) => {
+      if (websocket && websocket.readyState === WebSocket.OPEN) {
+        websocket.send(e.data);
+      }
+    };
+    recorder.start(chunkDuration);
+
+    startTime = Date.now();
+    timerInterval = setInterval(updateTimer, 1000);
+    drawWaveform();
+
+    isRecording = true;
+    updateUI();
+  } catch (err) {
+    statusText.textContent = "Error accessing microphone. Please allow microphone access.";
+    console.error(err);
+  }
+}
+
+async function stopRecording() {
+  if (wakeLock) {
+    try {
+      await wakeLock.release();
+    } catch (e) {
+      // ignore
+    }
+    wakeLock = null;
+  }
+
+  userClosing = true;
+  waitingForStop = true;
+
+  if (websocket && websocket.readyState === WebSocket.OPEN) {
+    const emptyBlob = new Blob([], { type: "audio/webm" });
+    websocket.send(emptyBlob);
+    statusText.textContent = "Recording stopped. Processing final audio...";
+  }
+
+  if (recorder) {
+    recorder.stop();
+    recorder = null;
+  }
+
+  if (microphone) {
+    microphone.disconnect();
+    microphone = null;
+  }
+
+  if (analyser) {
+    analyser = null;
+  }
+
+  if (audioContext && audioContext.state !== "closed") {
+    try {
+      await audioContext.close();
+    } catch (e) {
+      console.warn("Could not close audio context:", e);
+    }
+    audioContext = null;
+  }
+
+  if (animationFrame) {
+    cancelAnimationFrame(animationFrame);
+    animationFrame = null;
+  }
+
+  if (timerInterval) {
+    clearInterval(timerInterval);
+    timerInterval = null;
+  }
+  timerElement.textContent = "00:00";
+  startTime = null;
+
+  isRecording = false;
+  updateUI();
+}
+
+async function toggleRecording() {
+  if (!isRecording) {
+    if (waitingForStop) {
+      console.log("Waiting for stop, early return");
+      return;
+    }
+    console.log("Connecting to WebSocket");
+    try {
+      if (websocket && websocket.readyState === WebSocket.OPEN) {
+        await startRecording();
+      } else {
+        await setupWebSocket();
+        await startRecording();
+      }
+    } catch (err) {
+      statusText.textContent = "Could not connect to WebSocket or access mic. Aborted.";
+      console.error(err);
+    }
+  } else {
+    console.log("Stopping recording");
+    stopRecording();
+  }
+}
+
+function updateUI() {
+  recordButton.classList.toggle("recording", isRecording);
+  recordButton.disabled = waitingForStop;
+
+  if (waitingForStop) {
+    if (statusText.textContent !== "Recording stopped. Processing final audio...") {
+      statusText.textContent = "Please wait for processing to complete...";
+    }
+  } else if (isRecording) {
+    statusText.textContent = "Recording...";
+  } else {
+    if (
+      statusText.textContent !== "Finished processing audio! Ready to record again." &&
+      statusText.textContent !== "Processing finalized or connection closed."
+    ) {
+      statusText.textContent = "Click to start transcription";
+    }
+  }
+  if (!waitingForStop) {
+    recordButton.disabled = false;
+  }
+}
+
+recordButton.addEventListener("click", toggleRecording);
--- a/whisperlivekit/web/src/dark_mode.svg
+++ b/whisperlivekit/web/src/dark_mode.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-120q-151 0-255.5-104.5T120-480q0-138 90-239.5T440-838q13-2 23 3.5t16 14.5q6 9 6.5 21t-7.5 23q-17 26-25.5 55t-8.5 61q0 90 63 153t153 63q31 0 61.5-9t54.5-25q11-7 22.5-6.5T819-479q10 5 15.5 15t3.5 24q-14 138-117.5 229T480-120Zm0-80q88 0 158-48.5T740-375q-20 5-40 8t-40 3q-123 0-209.5-86.5T364-660q0-20 3-40t8-40q-78 32-126.5 102T200-480q0 116 82 198t198 82Zm-10-270Z"/></svg>
--- a/whisperlivekit/web/src/light_mode.svg
+++ b/whisperlivekit/web/src/light_mode.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-360q50 0 85-35t35-85q0-50-35-85t-85-35q-50 0-85 35t-35 85q0 50 35 85t85 35Zm0 80q-83 0-141.5-58.5T280-480q0-83 58.5-141.5T480-680q83 0 141.5 58.5T680-480q0 83-58.5 141.5T480-280ZM80-440q-17 0-28.5-11.5T40-480q0-17 11.5-28.5T80-520h80q17 0 28.5 11.5T200-480q0 17-11.5 28.5T160-440H80Zm720 0q-17 0-28.5-11.5T760-480q0-17 11.5-28.5T800-520h80q17 0 28.5 11.5T920-480q0 17-11.5 28.5T880-440h-80ZM480-760q-17 0-28.5-11.5T440-800v-80q0-17 11.5-28.5T480-920q17 0 28.5 11.5T520-880v80q0 17-11.5 28.5T480-760Zm0 720q-17 0-28.5-11.5T440-80v-80q0-17 11.5-28.5T480-200q17 0 28.5 11.5T520-160v80q0 17-11.5 28.5T480-40ZM226-678l-43-42q-12-11-11.5-28t11.5-29q12-12 29-12t28 12l42 43q11 12 11 28t-11 28q-11 12-27.5 11.5T226-678Zm494 495-42-43q-11-12-11-28.5t11-27.5q11-12 27.5-11.5T734-282l43 42q12 11 11.5 28T777-183q-12 12-29 12t-28-12Zm-42-495q-12-11-11.5-27.5T678-734l42-43q11-12 28-11.5t29 11.5q12 12 12 29t-12 28l-43 42q-12 11-28 11t-28-11ZM183-183q-12-12-12-29t12-28l43-42q12-11 28.5-11t27.5 11q12 11 11.5 27.5T282-226l-42 43q-11 12-28 11.5T183-183Zm297-297Z"/></svg>
--- a/whisperlivekit/web/src/system_mode.svg
+++ b/whisperlivekit/web/src/system_mode.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M396-396q-32-32-58.5-67T289-537q-5 14-6.5 28.5T281-480q0 83 58 141t141 58q14 0 28.5-2t28.5-6q-39-22-74-48.5T396-396Zm85 196q-56 0-107-21t-91-61q-40-40-61-91t-21-107q0-51 17-97.5t50-84.5q13-14 32-9.5t27 24.5q21 55 52.5 104t73.5 91q42 42 91 73.5T648-326q20 8 24.5 27t-9.5 32q-38 33-84.5 50T481-200Zm223-192q-16-5-23-20.5t-4-32.5q9-48-6-94.5T621-621q-35-35-80.5-49.5T448-677q-17 3-32-4t-21-23q-6-16 1.5-31t23.5-19q69-15 138 4.5T679-678q51 51 71 120t5 138q-4 17-19 25t-32 3ZM480-840q-17 0-28.5-11.5T440-880v-40q0-17 11.5-28.5T480-960q17 0 28.5 11.5T520-920v40q0 17-11.5 28.5T480-840Zm0 840q-17 0-28.5-11.5T440-40v-40q0-17 11.5-28.5T480-120q17 0 28.5 11.5T520-80v40q0 17-11.5 28.5T480 0Zm255-734q-12-12-12-28.5t12-28.5l28-28q11-11 27.5-11t28.5 11q12 12 12 28.5T819-762l-28 28q-12 12-28 12t-28-12ZM141-141q-12-12-12-28.5t12-28.5l28-28q12-12 28-12t28 12q12 12 12 28.5T225-169l-28 28q-11 11-27.5 11T141-141Zm739-299q-17 0-28.5-11.5T840-480q0-17 11.5-28.5T880-520h40q17 0 28.5 11.5T960-480q0 17-11.5 28.5T920-440h-40Zm-840 0q-17 0-28.5-11.5T0-480q0-17 11.5-28.5T40-520h40q17 0 28.5 11.5T120-480q0 17-11.5 28.5T80-440H40Zm779 299q-12 12-28.5 12T762-141l-28-28q-12-12-12-28t12-28q12-12 28.5-12t28.5 12l28 28q11 11 11 27.5T819-141ZM226-735q-12 12-28.5 12T169-735l-28-28q-11-11-11-27.5t11-28.5q12-12 28.5-12t28.5 12l28 28q12 12 12 28t-12 28Zm170 339Z"/></svg>
--- a/whisperlivekit/web/web_interface.py
+++ b/whisperlivekit/web/web_interface.py
@@ -10,4 +10,24 @@ def get_web_interface_html():
            return f.read()
    except Exception as e:
        logger.error(f"Error loading web interface HTML: {e}")
-        return "<html><body><h1>Error loading interface</h1></body></html>"
+        return "<html><body><h1>Error loading interface</h1></body></html>"
+
+
+if __name__ == '__main__':
+    
+    from fastapi import FastAPI
+    from fastapi.responses import HTMLResponse
+    import uvicorn
+    from starlette.staticfiles import StaticFiles
+    import pathlib
+    import whisperlivekit.web as webpkg
+    
+    app = FastAPI()    
+    web_dir = pathlib.Path(webpkg.__file__).parent
+    app.mount("/web", StaticFiles(directory=str(web_dir)), name="web")
+    
+    @app.get("/")
+    async def get():
+        return HTMLResponse(get_web_interface_html())
+
+    uvicorn.run(app=app)
--- a/whisperlivekit/whisper_streaming_custom/online_asr.py
+++ b/whisperlivekit/whisper_streaming_custom/online_asr.py
@@ -122,6 +122,7 @@ class OnlineASRProcessor:
        self.tokenize = tokenize_method
        self.logfile = logfile
        self.confidence_validation = confidence_validation
+        self.global_time_offset = 0.0
        self.init()

        self.buffer_trimming_way, self.buffer_trimming_sec = buffer_trimming
@@ -152,6 +153,21 @@ class OnlineASRProcessor:
        """Append an audio chunk (a numpy array) to the current audio buffer."""
        self.audio_buffer = np.append(self.audio_buffer, audio)

+    def insert_silence(self, silence_duration, offset):
+        """
+        If silences are > 5s, we do a complete context clear. Otherwise, we just insert a small silence and shift the last_attend_frame
+        """
+        # if self.transcript_buffer.buffer:
+        #     self.committed.extend(self.transcript_buffer.buffer)
+        #     self.transcript_buffer.buffer = []
+            
+        if True: #silence_duration < 3: #we want the last audio to be treated to not have a gap. could also be handled in the future in ends_with_silence.
+            gap_silence = np.zeros(int(16000 * silence_duration), dtype=np.int16)
+            self.insert_audio_chunk(gap_silence)
+        else:
+            self.init(offset=silence_duration + offset)
+        self.global_time_offset += silence_duration
+
    def prompt(self) -> Tuple[str, str]:
        """
        Returns a tuple: (prompt, context), where:
@@ -230,6 +246,9 @@ class OnlineASRProcessor:
        logger.debug(
            f"Length of audio buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:.2f} seconds"
        )
+        if self.global_time_offset:
+            for token in committed_tokens:
+                token = token.with_offset(self.global_time_offset)
        return committed_tokens, current_audio_processed_upto

    def chunk_completed_sentence(self):
@@ -391,128 +410,3 @@ class OnlineASRProcessor:
            start = None
            end = None
        return Transcript(start, end, text, probability=probability)
-
-
-class VACOnlineASRProcessor:
-    """
-    Wraps an OnlineASRProcessor with a Voice Activity Controller (VAC).
-    
-    It receives small chunks of audio, applies VAD (e.g. with Silero),
-    and when the system detects a pause in speech (or end of an utterance)
-    it finalizes the utterance immediately.
-    """
-    SAMPLING_RATE = 16000
-
-    def __init__(self, online_chunk_size: float, *args, **kwargs):
-        self.online_chunk_size = online_chunk_size
-        self.online = OnlineASRProcessor(*args, **kwargs)
-        self.asr = self.online.asr
-        
-        # Load a VAD model (e.g. Silero VAD)
-        import torch
-        model, _ = torch.hub.load(repo_or_dir="snakers4/silero-vad", model="silero_vad")
-        from .silero_vad_iterator import FixedVADIterator
-
-        self.vac = FixedVADIterator(model)
-        self.logfile = self.online.logfile
-        self.last_input_audio_stream_end_time: float = 0.0
-        self.init()
-
-    def init(self):
-        self.online.init()
-        self.vac.reset_states()
-        self.current_online_chunk_buffer_size = 0
-        self.last_input_audio_stream_end_time = self.online.buffer_time_offset
-        self.is_currently_final = False
-        self.status: Optional[str] = None  # "voice" or "nonvoice"
-        self.audio_buffer = np.array([], dtype=np.float32)
-        self.buffer_offset = 0  # in frames
-
-    def get_audio_buffer_end_time(self) -> float:
-        """Returns the absolute end time of the audio processed by the underlying OnlineASRProcessor."""
-        return self.online.get_audio_buffer_end_time()
-
-    def clear_buffer(self):
-        self.buffer_offset += len(self.audio_buffer)
-        self.audio_buffer = np.array([], dtype=np.float32)
-
-    def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time: float):
-        """
-        Process an incoming small audio chunk:
-          - run VAD on the chunk,
-          - decide whether to send the audio to the online ASR processor immediately,
-          - and/or to mark the current utterance as finished.
-        """
-        self.last_input_audio_stream_end_time = audio_stream_end_time
-        res = self.vac(audio)
-        self.audio_buffer = np.append(self.audio_buffer, audio)
-
-        if res is not None:
-            # VAD returned a result; adjust the frame number
-            frame = list(res.values())[0] - self.buffer_offset
-            if "start" in res and "end" not in res:
-                self.status = "voice"
-                send_audio = self.audio_buffer[frame:]
-                self.online.init(offset=(frame + self.buffer_offset) / self.SAMPLING_RATE)
-                self.online.insert_audio_chunk(send_audio)
-                self.current_online_chunk_buffer_size += len(send_audio)
-                self.clear_buffer()
-            elif "end" in res and "start" not in res:
-                self.status = "nonvoice"
-                send_audio = self.audio_buffer[:frame]
-                self.online.insert_audio_chunk(send_audio)
-                self.current_online_chunk_buffer_size += len(send_audio)
-                self.is_currently_final = True
-                self.clear_buffer()
-            else:
-                beg = res["start"] - self.buffer_offset
-                end = res["end"] - self.buffer_offset
-                self.status = "nonvoice"
-                send_audio = self.audio_buffer[beg:end]
-                self.online.init(offset=(beg + self.buffer_offset) / self.SAMPLING_RATE)
-                self.online.insert_audio_chunk(send_audio)
-                self.current_online_chunk_buffer_size += len(send_audio)
-                self.is_currently_final = True
-                self.clear_buffer()
-        else:
-            if self.status == "voice":
-                self.online.insert_audio_chunk(self.audio_buffer)
-                self.current_online_chunk_buffer_size += len(self.audio_buffer)
-                self.clear_buffer()
-            else:
-                # Keep 1 second worth of audio in case VAD later detects voice,
-                # but trim to avoid unbounded memory usage.
-                self.buffer_offset += max(0, len(self.audio_buffer) - self.SAMPLING_RATE)
-                self.audio_buffer = self.audio_buffer[-self.SAMPLING_RATE:]
-
-    def process_iter(self) -> Tuple[List[ASRToken], float]:
-        """
-        Depending on the VAD status and the amount of accumulated audio,
-        process the current audio chunk.
-        Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
-        """
-        if self.is_currently_final:
-            return self.finish()
-        elif self.current_online_chunk_buffer_size > self.SAMPLING_RATE * self.online_chunk_size:
-            self.current_online_chunk_buffer_size = 0
-            return self.online.process_iter()
-        else:
-            logger.debug("No online update, only VAD")
-            return [], self.last_input_audio_stream_end_time
-
-    def finish(self) -> Tuple[List[ASRToken], float]:
-        """
-        Finish processing by flushing any remaining text.
-        Returns a tuple: (list of remaining ASRToken objects, float representing the final audio processed up to time).
-        """
-        result_tokens, processed_upto = self.online.finish()
-        self.current_online_chunk_buffer_size = 0
-        self.is_currently_final = False
-        return result_tokens, processed_upto
-    
-    def get_buffer(self):
-        """
-        Get the unvalidated buffer in string format.
-        """
-        return self.online.concatenate_tokens(self.online.transcript_buffer.buffer)
-
Author	SHA1	Message	Date
Quentin Fuxa	12973711f6	0.2.6	2025-08-21 14:34:46 +02:00
Quentin Fuxa	909ac9dd41	speaker -1 are no more sent in websocket - no buffer when their is a silence	2025-08-21 14:09:02 +02:00
Quentin Fuxa	d94a07d417	default model is now base. default backend simulstreaming	2025-08-21 11:55:36 +02:00
Quentin Fuxa	b32dd8bfc4	Align backend and frontend time handling	2025-08-21 10:33:15 +02:00
Quentin Fuxa	9feb0e597b	remove VACOnlineASRProcessor backend possibility	2025-08-20 20:57:43 +02:00
Quentin Fuxa	9dab84a573	update front	2025-08-20 20:15:38 +02:00
Quentin Fuxa	d089c7fce0	.html to .html + .css + .js	2025-08-20 20:00:31 +02:00
Quentin Fuxa	253a080df5	diart diarization handles pauses/silences thanks to offset	2025-08-19 21:12:55 +02:00
Quentin Fuxa	0c6e4b2aee	sortformer diar implementation v0.1	2025-08-19 19:48:51 +02:00
Quentin Fuxa	e14bbde77d	sortformer diar implementation v0	2025-08-19 17:02:55 +02:00
Quentin Fuxa	7496163467	rename diart backend	2025-08-19 15:02:27 +02:00
Quentin Fuxa	696a94d1ce	1rst sortformer backend implementation	2025-08-19 15:02:17 +02:00
Quentin Fuxa	2699b0974c	Fix simulstreaming imports	2025-08-19 14:43:54 +02:00
Quentin Fuxa	90c0250ba4	update optional dependencies	2025-08-19 09:36:59 +02:00
Quentin Fuxa	eb96153ffd	new vac parameters	2025-08-17 22:26:28 +02:00
Quentin Fuxa	47e3eb9b5b	Update README.md	2025-08-17 09:55:03 +02:00
Quentin Fuxa	b8b07adeef	--vac to --no-vac	2025-08-17 09:44:26 +02:00
Quentin Fuxa	d0e9e37ef6	simulstreaming: cumulative_time_offset to keep timestamps correct when audio > 30s	2025-08-17 09:33:47 +02:00
Quentin Fuxa	820f92d8cb	audio_max_len to 30 -> 20, ffmpeg timeout 5 -> 20	2025-08-17 09:32:08 +02:00
Quentin Fuxa	e42523af84	VAC activated by default	2025-08-17 01:29:34 +02:00
Quentin Fuxa	e2184d5e06	better handle silences when VAC + correct offset issue with whisperstreaming backend	2025-08-17 01:27:07 +02:00
Quentin Fuxa	7fe0353260	vac model is loaded in TranscriptionEngine, and by default	2025-08-17 00:34:25 +02:00
Quentin Fuxa	0f2eba507e	use with_offset to add no audio offset to tokens	2025-08-17 00:33:24 +02:00
Quentin Fuxa	55e08474f3	recycle backend in simulstreaming thanks to new remove hooks function	2025-08-16 23:06:16 +02:00
Quentin Fuxa	28bdc52e1d	VAC before doing transcription and diarization. V0	2025-08-16 23:04:21 +02:00
Quentin Fuxa	e4221fa6c3	Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web	2025-08-15 23:04:05 +02:00
Quentin Fuxa	1652db9a2d	Use distinct backend models for simulstreaming and add --preloaded_model_count to preload them	2025-08-15 23:03:55 +02:00
Quentin Fuxa	601f17653a	Update CONTRIBUTING.md	2025-08-13 21:59:32 +02:00
Quentin Fuxa	7718190fcd	Update CONTRIBUTING.md	2025-08-13 21:59:00 +02:00
				`@@ -0,0 +1 @@`
				`<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-120q-151 0-255.5-104.5T120-480q0-138 90-239.5T440-838q13-2 23 3.5t16 14.5q6 9 6.5 21t-7.5 23q-17 26-25.5 55t-8.5 61q0 90 63 153t153 63q31 0 61.5-9t54.5-25q11-7 22.5-6.5T819-479q10 5 15.5 15t3.5 24q-14 138-117.5 229T480-120Zm0-80q88 0 158-48.5T740-375q-20 5-40 8t-40 3q-123 0-209.5-86.5T364-660q0-20 3-40t8-40q-78 32-126.5 102T200-480q0 116 82 198t198 82Zm-10-270Z"/></svg>`
				`@@ -0,0 +1 @@`
				<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-360q50 0 85-35t35-85q0-50-35-85t-85-35q-50 0-85 35t-35 85q0 50 35 85t85 35Zm0 80q-83 0-141.5-58.5T280-480q0-83 58.5-141.5T480-680q83 0 141.5 58.5T680-480q0 83-58.5 141.5T480-280ZM80-440q-17 0-28.5-11.5T40-480q0-17 11.5-28.5T80-520h80q17 0 28.5 11.5T200-480q0 17-11.5 28.5T160-440H80Zm720 0q-17 0-28.5-11.5T760-480q0-17 11.5-28.5T800-520h80q17 0 28.5 11.5T920-480q0 17-11.5 28.5T880-440h-80ZM480-760q-17 0-28.5-11.5T440-800v-80q0-17 11.5-28.5T480-920q17 0 28.5 11.5T520-880v80q0 17-11.5 28.5T480-760Zm0 720q-17 0-28.5-11.5T440-80v-80q0-17 11.5-28.5T480-200q17 0 28.5 11.5T520-160v80q0 17-11.5 28.5T480-40ZM226-678l-43-42q-12-11-11.5-28t11.5-29q12-12 29-12t28 12l42 43q11 12 11 28t-11 28q-11 12-27.5 11.5T226-678Zm494 495-42-43q-11-12-11-28.5t11-27.5q11-12 27.5-11.5T734-282l43 42q12 11 11.5 28T777-183q-12 12-29 12t-28-12Zm-42-495q-12-11-11.5-27.5T678-734l42-43q11-12 28-11.5t29 11.5q12 12 12 29t-12 28l-43 42q-12 11-28 11t-28-11ZM183-183q-12-12-12-29t12-28l43-42q12-11 28.5-11t27.5 11q12 11 11.5 27.5T282-226l-42 43q-11 12-28 11.5T183-183Zm297-297Z"/></svg>