0.2.10

--pcm-input option information
ffmpeg install instruction error indicates --pcm-input alternative
2026-03-07 22:33:36 +00:00 · 2025-09-16 23:45:00 +02:00 · 2025-09-17 16:06:28 +02:00 · 2025-09-17 16:04:17 +02:00 · 2025-09-17 14:55:57 +02:00 · 2025-09-17 11:28:36 +02:00
66 changed files with 5809 additions and 1648 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -137,4 +137,5 @@ run_*.sh
 test_*.py
 launch.json
 .DS_Store
-test/*
+test/*
+nllb-200-distilled-600M-ctranslate2/*
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -15,7 +15,7 @@ Thank you for considering contributing ! We appreciate your time and effort to h

 ## Opening Issues

-If you encounter a problem with diart or want to suggest an improvement, please follow these guidelines when opening an issue:
+If you encounter a problem with WhisperLiveKit or want to suggest an improvement, please follow these guidelines when opening an issue:

 - **Bug Reports:**
  - Clearly describe the error. **Please indicate the parameters you use, especially the model(s)**
@@ -43,4 +43,4 @@ We welcome and appreciate contributions! To ensure a smooth review process, plea

 ## Thank You

-Your contributions make diart better for everyone. Thank you for your time and dedication!
+Your contributions make WhisperLiveKit better for everyone. Thank you for your time and dedication!
--- a/DEV_NOTES.md
+++ b/DEV_NOTES.md
@@ -0,0 +1,91 @@
+# 1. Simulstreaming: Decouple the encoder for faster inference
+
+Simulstreaming encoder time (whisperlivekit/simul_whisper/simul_whisper.py l. 397) experimentations :
+
+On macOS Apple Silicon M4 :
+
+| Encoder | base.en | small |
+|--------|---------|-------|
+| WHISPER (no modification) | 0.35s | 1.09s |
+| FASTER_WHISPER | 0.4s | 1.20s |
+| MLX_WHISPER | 0.07s | 0.20s |
+
+Memory saved by only loading encoder for optimized framework:
+
+For tiny.en, mlx whisper:
+Sizes MLX whisper:
+Decoder weights: 59110771 bytes
+Encoder weights: 15268874 bytes
+
+
+# 2. Translation: Faster model for each system
+
+## Benchmark Results
+
+Testing on MacBook M3 with NLLB-200-distilled-600M model:
+
+### Standard Transformers vs CTranslate2
+
+| Test Text | Standard Inference Time | CTranslate2 Inference Time | Speedup |
+|-----------|-------------------------|---------------------------|---------|
+| UN Chief says there is no military solution in Syria | 0.9395s | 2.0472s | 0.5x |
+| The rapid advancement of AI technology is transforming various industries | 0.7171s | 1.7516s | 0.4x |
+| Climate change poses a significant threat to global ecosystems | 0.8533s | 1.8323s | 0.5x |
+| International cooperation is essential for addressing global challenges | 0.7209s | 1.3575s | 0.5x |
+| The development of renewable energy sources is crucial for a sustainable future | 0.8760s | 1.5589s | 0.6x |
+
+**Results:**
+- Total Standard time: 4.1068s
+- Total CTranslate2 time: 8.5476s
+- CTranslate2 is slower on this system --> Use Transformers, and ideally we would have an mlx implementation.
+
+
+# 3. SortFormer Diarization: 4-to-2 Speaker Constraint Algorithm
+
+Transform a diarization model that predicts up to 4 speakers into one that predicts up to 2 speakers by mapping the output predictions.
+
+## Problem Statement
+- Input: `self.total_preds` with shape `(x, x, 4)` - predictions for 4 speakers
+- Output: Constrained predictions with shape `(x, x, 2)` - predictions for 2 speakers
+
+#
+### Initial Setup
+For each time step `i`, we have a ranking of 4 speaker predictions (1-4). When only 2 speakers are present, the model will have close predictions for the 2 active speaker positions.
+
+Instead of `np.argmax(preds_np, axis=1)`, we take the top 2 predictions and build a dynamic 4→2 mapping that can evolve over time.
+
+### Algorithm
+
+```python
+top_2_speakers = np.argsort(preds_np, axis=1)[:, -2:]
+```
+
+- `DS_a_{i}`: Top detected speaker for prediction i
+- `DS_b_{i}`: Second detected speaker for prediction i  
+- `AS_{i}`: Attributed speaker for prediction i
+- `GTS_A`: Ground truth speaker A
+- `GTS_B`: Ground truth speaker B
+- `DIST(a, b)`: Distance between detected speakers a and b
+
+3. **Attribution Logic**
+
+```
+AS_0 ← A
+
+AS_1 ← B
+
+IF DIST(DS_a_0, DS_a_1) < DIST(DS_a_0, DS_a_2) AND 
+    DIST(DS_a_0, DS_a_1) < DIST(DS_a_1, DS_a_2):
+    # Likely that DS_a_0 = DS_a_1 (same speaker)
+    AS_1 ← A
+    AS_2 ← B
+
+ELIF DIST(DS_a_0, DS_a_2) < DIST(DS_a_0, DS_a_1) AND 
+    DIST(DS_a_0, DS_a_2) < DIST(DS_a_1, DS_a_2):
+    AS_2 ← A
+
+ELSE:
+    AS_2 ← B
+
+to finish
+```
--- a/47
+++ b/47
@@ -1,4 +1,4 @@
-FROM nvidia/cuda:12.8.1-cudnn-runtime-ubuntu22.04
+FROM nvidia/cuda:12.9.1-cudnn-devel-ubuntu24.04

 ENV DEBIAN_FRONTEND=noninteractive
 ENV PYTHONUNBUFFERED=1
@@ -9,48 +9,50 @@ ARG EXTRAS
 ARG HF_PRECACHE_DIR
 ARG HF_TKN_FILE

-# Install system dependencies
-#RUN apt-get update && \
-#    apt-get install -y ffmpeg git && \
-#    apt-get clean && \
-#    rm -rf /var/lib/apt/lists/*
-
-# 2) Install system dependencies + Python + pip
 RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        python3 \
        python3-pip \
+        python3-venv \
        ffmpeg \
        git \
        build-essential \
-        python3-dev && \
+        python3-dev \
+        ca-certificates && \
    rm -rf /var/lib/apt/lists/*

-RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
+RUN python3 -m venv /opt/venv
+ENV PATH="/opt/venv/bin:$PATH"
+
+# timeout/retries for large torch wheels
+RUN pip3 install --upgrade pip setuptools wheel && \
+    pip3 --disable-pip-version-check install --timeout=120 --retries=5 \
+        --index-url https://download.pytorch.org/whl/cu129 \
+        torch torchaudio \
+    || (echo "Initial install failed — retrying with extended timeout..." && \
+        pip3 --disable-pip-version-check install --timeout=300 --retries=3 \
+            --index-url https://download.pytorch.org/whl/cu129 \
+            torch torchvision torchaudio)

 COPY . .

 # Install WhisperLiveKit directly, allowing for optional dependencies
-#   Note: For gates models, need to add your HF toke. See README.md
-#         for more details.
 RUN if [ -n "$EXTRAS" ]; then \
      echo "Installing with extras: [$EXTRAS]"; \
-      pip install --no-cache-dir .[$EXTRAS]; \
+      pip install --no-cache-dir whisperlivekit[$EXTRAS]; \
    else \
      echo "Installing base package only"; \
-      pip install --no-cache-dir .; \
+      pip install --no-cache-dir whisperlivekit; \
    fi

-# Enable in-container caching for Hugging Face models by: 
-# Note: If running multiple containers, better to map a shared
-# bucket. 
-#
+# In-container caching for Hugging Face models by: 
 # A) Make the cache directory persistent via an anonymous volume.
 #    Note: This only persists for a single, named container. This is 
 #          only for convenience at de/test stage. 
 #          For prod, it is better to use a named volume via host mount/k8s.
 VOLUME ["/root/.cache/huggingface/hub"]

+
 # or
 # B) Conditionally copy a local pre-cache from the build context to the 
 #    container's cache via the HF_PRECACHE_DIR build-arg.
@@ -65,8 +67,7 @@ RUN if [ -n "$HF_PRECACHE_DIR" ]; then \
      echo "No local Hugging Face cache specified, skipping copy"; \
    fi

-# Conditionally copy a Hugging Face token if provided
-
+# Conditionally copy a Hugging Face token if provided. Useful for Diart backend (pyannote audio models)
 RUN if [ -n "$HF_TKN_FILE" ]; then \
      echo "Copying Hugging Face token from $HF_TKN_FILE"; \
      mkdir -p /root/.cache/huggingface && \
@@ -74,11 +75,9 @@ RUN if [ -n "$HF_TKN_FILE" ]; then \
    else \
      echo "No Hugging Face token file specified, skipping token setup"; \
    fi
-    
-# Expose port for the transcription server
+
 EXPOSE 8000

 ENTRYPOINT ["whisperlivekit-server", "--host", "0.0.0.0"]

-# Default args
-CMD ["--model", "tiny.en"]
+CMD ["--model", "medium"]
--- a/Dockerfile.cpu
+++ b/Dockerfile.cpu
@@ -0,0 +1,61 @@
+FROM python:3.13-slim
+
+ENV DEBIAN_FRONTEND=noninteractive
+ENV PYTHONUNBUFFERED=1
+
+WORKDIR /app
+
+ARG EXTRAS
+ARG HF_PRECACHE_DIR
+ARG HF_TKN_FILE
+
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends \
+        ffmpeg \
+        git \
+        build-essential \
+        python3-dev && \
+    rm -rf /var/lib/apt/lists/*
+
+# Install CPU-only PyTorch
+RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
+
+COPY . .
+
+# Install WhisperLiveKit directly, allowing for optional dependencies
+RUN if [ -n "$EXTRAS" ]; then \
+      echo "Installing with extras: [$EXTRAS]"; \
+      pip install --no-cache-dir whisperlivekit[$EXTRAS]; \
+    else \
+      echo "Installing base package only"; \
+      pip install --no-cache-dir whisperlivekit; \
+    fi
+
+# Enable in-container caching for Hugging Face models
+VOLUME ["/root/.cache/huggingface/hub"]
+
+# Conditionally copy a local pre-cache from the build context
+RUN if [ -n "$HF_PRECACHE_DIR" ]; then \
+      echo "Copying Hugging Face cache from $HF_PRECACHE_DIR"; \
+      mkdir -p /root/.cache/huggingface/hub && \
+      cp -r $HF_PRECACHE_DIR/* /root/.cache/huggingface/hub; \
+    else \
+      echo "No local Hugging Face cache specified, skipping copy"; \
+    fi
+
+# Conditionally copy a Hugging Face token if provided
+RUN if [ -n "$HF_TKN_FILE" ]; then \
+      echo "Copying Hugging Face token from $HF_TKN_FILE"; \
+      mkdir -p /root/.cache/huggingface && \
+      cp $HF_TKN_FILE /root/.cache/huggingface/token; \
+    else \
+      echo "No Hugging Face token file specified, skipping token setup"; \
+    fi
+    
+# Expose port for the transcription server
+EXPOSE 8000
+
+ENTRYPOINT ["whisperlivekit-server", "--host", "0.0.0.0"]
+
+# Default args - you might want to use a smaller model for CPU
+CMD ["--model", "tiny"]
--- a/README.md
+++ b/README.md
@@ -4,131 +4,88 @@
 <img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit Demo" width="730">
 </p>

-<p align="center"><b>Real-time, Fully Local Speech-to-Text with Speaker Diarization</b></p>
+<p align="center"><b>Real-time, Fully Local Speech-to-Text with Speaker Identification</b></p>

 <p align="center">
 <a href="https://pypi.org/project/whisperlivekit/"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/whisperlivekit?color=g"></a>
-<a href="https://pepy.tech/project/whisperlivekit"><img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/whisperlivekit?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=downloads"></a>
-<a href="https://pypi.org/project/whisperlivekit/"><img alt="Python Versions" src="https://img.shields.io/badge/python-3.9--3.13-dark_green"></a>
+<a href="https://pepy.tech/project/whisperlivekit"><img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/whisperlivekit?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=installations"></a>
+<a href="https://pypi.org/project/whisperlivekit/"><img alt="Python Versions" src="https://img.shields.io/badge/python-3.9--3.15-dark_green"></a>
 <a href="https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-MIT/Dual Licensed-dark_green"></a>
 </p>


-WhisperLiveKit brings real-time speech transcription directly to your browser, with a ready-to-use backend+server and a simple frontend. ✨
+Real-time speech transcription directly to your browser, with a ready-to-use backend+server and a simple frontend. ✨

-Built on [SimulStreaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) and [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) for transcription, plus [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) and [Diart](https://github.com/juanmc2005/diart) (SOTA 2021) for diarization.
+#### Powered by Leading Research:
+
+- [SimulStreaming](https://github.com/ufalSimulStreaming) (SOTA 2025) - Ultra-low latency transcription using [AlignAtt policy](https://arxiv.org/pdf/2305.11408)
+- [NLLB](https://arxiv.org/abs/2207.04672), ([distilled](https://huggingface.co/entai2965/nllb-200-distilled-600M-ctranslate2)) (2024) - Translation to more than 100 languages.
+- [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) - Low latency transcription using [LocalAgreement policy](https://www.isca-archive.org/interspeech_2020/liu20s_interspeech.pdf)
+- [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) - Advanced real-time speaker diarization
+- [Diart](https://github.com/juanmc2005/diart) (SOTA 2021) - Real-time speaker diarization
+- [Silero VAD](https://github.com/snakers4/silero-vad) (2024) - Enterprise-grade Voice Activity Detection


-### Key Features
+> **Why not just run a simple Whisper model on every audio batch?** Whisper is designed for complete utterances, not real-time chunks. Processing small segments loses context, cuts off words mid-syllable, and produces poor transcription. WhisperLiveKit uses state-of-the-art simultaneous speech research for intelligent buffering and incremental processing.

- **Real-time Transcription** - Locally (or on-prem) convert speech to text instantly as you speak
- **Speaker Diarization** - Identify different speakers in real-time. (⚠️ backend Streaming Sortformer in developement)
- **Multi-User Support** - Handle multiple users simultaneously with a single backend/server
- **Automatic Silence Chunking** – Automatically chunks when no audio is detected to limit buffer size
- **Confidence Validation** – Immediately validate high-confidence tokens for faster inference (WhisperStreaming only)
- **Buffering Preview** – Displays unvalidated transcription segments (not compatible with SimulStreaming yet)
- **Punctuation-Based Speaker Splitting [BETA]** - Align speaker changes with natural sentence boundaries for more readable transcripts
- **SimulStreaming Backend** - [Dual-licensed](https://github.com/ufal/SimulStreaming#-licence-and-contributions) - Ultra-low latency transcription using SOTA AlignAtt policy. 

 ### Architecture

-<img alt="Architecture" src="architecture.png" />
+<img alt="Architecture" src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/architecture.png" />

+*The backend supports multiple concurrent users. Voice Activity Detection reduces overhead when no voice is detected.*

-## Quick Start
+### Installation & Quick Start

 ```bash
-# Install the package
 pip install whisperlivekit
-
-# Start the transcription server
-whisperlivekit-server --model tiny.en
-
-# Open your browser at http://localhost:8000 to see the interface.
-# Use  -ssl-certfile public.crt --ssl-keyfile private.key parameters to use SSL
 ```
+> You can also clone the repo and `pip install -e .` for the latest version.

-That's it! Start speaking and watch your words appear on screen.
+#### Quick Start
+1. **Start the transcription server:**
+   ```bash
+   whisperlivekit-server --model base --language en
+   ```

-## Installation
+2. **Open your browser** and navigate to `http://localhost:8000`. Start speaking and watch your words appear in real-time!
+
+
+> - See [tokenizer.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/simul_whisper/whisper/tokenizer.py) for the list of all available languages.
+> - For HTTPS requirements, see the **Parameters** section for SSL configuration options.
+
+ 
+
+#### Optional Dependencies
+
+| Optional | `pip install` |
+|-----------|-------------|
+| **Speaker diarization with Sortformer** | `git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]` |
+| **Apple Silicon optimized backend** | `mlx-whisper` |
+| **NLLB Translation** | `huggingface_hub` & `transformers` |
+| *[Not recommanded]*  Speaker diarization with Diart | `diart` |
+| *[Not recommanded]*  Original Whisper backend | `whisper` |
+| *[Not recommanded]*  Improved timestamps backend | `whisper-timestamped` |
+| OpenAI API backend | `openai` |
+
+See  **Parameters & Configuration** below on how to use them.
+
+
+
+### Usage Examples
+
+**Command-line Interface**: Start the transcription server with various options:

 ```bash
-#Install from PyPI (Recommended)
-pip install whisperlivekit
+# Large model and translate from french to danish
+whisperlivekit-server --model large-v3 --language fr --target-language da

-#Install from Source
-git clone https://github.com/QuentinFuxa/WhisperLiveKit
-cd WhisperLiveKit
-pip install -e .
-```
-
-### FFmpeg Dependency
-
-```bash
-# Ubuntu/Debian
-sudo apt install ffmpeg
-
-# macOS
-brew install ffmpeg
-
-# Windows
-# Download from https://ffmpeg.org/download.html and add to PATH
-```
-
-### Optional Dependencies
-
-```bash
-# Voice Activity Controller (prevents hallucinations)
-pip install torch
-
-# Sentence-based buffer trimming
-pip install mosestokenizer wtpsplit
-pip install tokenize_uk  # If you work with Ukrainian text
-
-# Speaker diarization
-pip install diart
-
-# Alternative Whisper backends (default is faster-whisper)
-pip install whisperlivekit[whisper]              # Original Whisper
-pip install whisperlivekit[whisper-timestamped]  # Improved timestamps
-pip install whisperlivekit[mlx-whisper]          # Apple Silicon optimization
-pip install whisperlivekit[openai]               # OpenAI API
-pip install whisperlivekit[simulstreaming]
-```
-
-### 🎹 Pyannote Models Setup
-
-For diarization, you need access to pyannote.audio models:
-
-1. [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model
-2. [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model
-3. [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
-4. Login with HuggingFace:
-```bash
-pip install huggingface_hub
-huggingface-cli login
-```
-
-## 💻 Usage Examples
-
-### Command-line Interface
-
-Start the transcription server with various options:
-
-```bash
-# Basic server with English model
-whisperlivekit-server --model tiny.en
-
-# Advanced configuration with diarization
-whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language auto
-
-# SimulStreaming backend for ultra-low latency
-whisperlivekit-server --backend simulstreaming --model large-v3 --frame-threshold 20
+# Diarization and server listening on */80 
+whisperlivekit-server --host 0.0.0.0 --port 80 --model medium --diarization --language fr
 ```


-### Python API Integration (Backend)
-Check [basic_server.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) for a complete example.
+**Python API Integration**: Check [basic_server](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) for a more complete example of how to use the functions and classes.

 ```python
 from whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args
@@ -143,14 +100,10 @@ transcription_engine = None
 async def lifespan(app: FastAPI):
    global transcription_engine
    transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
-    # You can also load from command-line arguments using parse_args()
-    # args = parse_args()
-    # transcription_engine = TranscriptionEngine(**vars(args))
    yield

 app = FastAPI(lifespan=lifespan)

-# Process WebSocket connections
 async def handle_websocket_results(websocket: WebSocket, results_generator):
    async for response in results_generator:
        await websocket.send_json(response)
@@ -170,44 +123,44 @@ async def websocket_endpoint(websocket: WebSocket):
        await audio_processor.process_audio(message)        
 ```

-### Frontend Implementation
+**Frontend Implementation**: The package includes an HTML/JavaScript implementation [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html). You can also import it using `from whisperlivekit import get_inline_ui_html` & `page = get_inline_ui_html()`

-The package includes a simple HTML/JavaScript implementation that you can adapt for your project. You can find it [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html), or load its content using `get_web_interface_html()` :

-```python
-from whisperlivekit import get_web_interface_html
-html_content = get_web_interface_html()
-```
+## Parameters & Configuration

-## ⚙️ Configuration Reference
-
-WhisperLiveKit offers extensive configuration options:

 | Parameter | Description | Default |
 |-----------|-------------|---------|
+| `--model` | Whisper model size. List and recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/available_models.md) | `small` |
+| `--language` | List [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/simul_whisper/whisper/tokenizer.py). If you use `auto`, the model attempts to detect the language automatically, but it tends to bias towards English. | `auto` |
+| `--target-language` | If sets, activates translation using NLLB. Ex: `fr`. [118 languages available](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/translation/mapping_languages.py). If you want to translate to english, you should rather use `--task translate`, since Whisper can do it directly. | `None` |
+| `--task` | Set to `translate` to translate *only* to english, using Whisper translation. | `transcribe` |
+| `--diarization` | Enable speaker identification | `False` |
+| `--backend` | Processing backend. You can switch to `faster-whisper` if  `simulstreaming` does not work correctly | `simulstreaming` |
+| `--no-vac` | Disable Voice Activity Controller | `False` |
+| `--no-vad` | Disable Voice Activity Detection | `False` |
+| `--warmup-file` | Audio file path for model warmup | `jfk.wav` |
 | `--host` | Server host address | `localhost` |
 | `--port` | Server port | `8000` |
-| `--model` | Whisper model size. Caution : '.en' models do not work with Simulstreaming | `tiny` |
-| `--language` | Source language code or `auto` | `en` |
-| `--task` | `transcribe` or `translate` | `transcribe` |
-| `--backend` | Processing backend | `faster-whisper` |
-| `--diarization` | Enable speaker identification | `False` |
-| `--punctuation-split` | Use punctuation to improve speaker boundaries | `True` |
-| `--confidence-validation` | Use confidence scores for faster validation | `False` |
-| `--min-chunk-size` | Minimum audio chunk size (seconds) | `1.0` |
-| `--vac` | Use Voice Activity Controller | `False` |
-| `--no-vad` | Disable Voice Activity Detection | `False` |
-| `--buffer_trimming` | Buffer trimming strategy (`sentence` or `segment`) | `segment` |
-| `--warmup-file` | Audio file path for model warmup | `jfk.wav` |
 | `--ssl-certfile` | Path to the SSL certificate file (for HTTPS support) | `None` |
 | `--ssl-keyfile` | Path to the SSL private key file (for HTTPS support) | `None` |
-| `--segmentation-model` | Hugging Face model ID for pyannote.audio segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |
-| `--embedding-model` | Hugging Face model ID for pyannote.audio embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |
+| `--pcm-input` | raw PCM (s16le) data is expected as input and FFmpeg will be bypassed. Frontend will use AudioWorklet instead of MediaRecorder | `False` |

-**SimulStreaming-specific Options:**
-
-| Parameter | Description | Default |
+| Translation options | Description | Default |
 |-----------|-------------|---------|
+| `--nllb-backend` | `transformers` or `ctranslate2` | `ctranslate2` |
+| `--nllb-size` | `600M` or `1.3B` | `600M` |
+
+| Diarization options | Description | Default |
+|-----------|-------------|---------|
+| `--diarization-backend` |  `diart` or `sortformer` | `sortformer` |
+| `--disable-punctuation-split` |  Disable punctuation based splits. See #214 | `False` |
+| `--segmentation-model` | Hugging Face model ID for Diart segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |
+| `--embedding-model` | Hugging Face model ID for Diart embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |
+
+| SimulStreaming backend options | Description | Default |
+|-----------|-------------|---------|
+| `--disable-fast-encoder` | Disable Faster Whisper or MLX Whisper backends for the encoder (if installed). Inference can be slower but helpful when GPU memory is limited | `False` |
 | `--frame-threshold` | AlignAtt frame threshold (lower = faster, higher = more accurate) | `25` |
 | `--beams` | Number of beams for beam search (1 = greedy decoding) | `1` |
 | `--decoder` | Force decoder type (`beam` or `greedy`) | `auto` |
@@ -219,69 +172,82 @@ WhisperLiveKit offers extensive configuration options:
 | `--static-init-prompt` | Static prompt that doesn't scroll | `None` |
 | `--max-context-tokens` | Maximum context tokens | `None` |
 | `--model-path` | Direct path to .pt model file. Download it if not found | `./base.pt` |
+| `--preload-model-count` | Optional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent users) | `1` |

-## 🔧 How It Works

-1. **Audio Capture**: Browser's MediaRecorder API captures audio in webm/opus format
-2. **Streaming**: Audio chunks are sent to the server via WebSocket
-3. **Processing**: Server decodes audio with FFmpeg and streams into the model for transcription
-4. **Real-time Output**: Partial transcriptions appear immediately in light gray (the 'aperçu') and finalized text appears in normal color

-## 🚀 Deployment Guide
+| WhisperStreaming backend options | Description | Default |
+|-----------|-------------|---------|
+| `--confidence-validation` | Use confidence scores for faster validation | `False` |
+| `--buffer_trimming` | Buffer trimming strategy (`sentence` or `segment`) | `segment` |
+
+
+
+
+> For diarization using Diart, you need to accept user conditions [here](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model, [here](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model and [here](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model. **Then**, login to HuggingFace: `huggingface-cli login`
+
+### 🚀 Deployment Guide

 To deploy WhisperLiveKit in production:
-
-1. **Server Setup** (Backend):
+ 
+1. **Server Setup**: Install production ASGI server & launch with multiple workers
   ```bash
-   # Install production ASGI server
   pip install uvicorn gunicorn
-
-   # Launch with multiple workers
   gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
   ```

-2. **Frontend Integration**:
-   - Host your customized version of the example HTML/JS in your web application
-   - Ensure WebSocket connection points to your server's address
+2. **Frontend**: Host your customized version of the `html` example & ensure WebSocket connection points correctly

 3. **Nginx Configuration** (recommended for production):
    ```nginx    
   server {
       listen 80;
       server_name your-domain.com;
-
-    location / {
-        proxy_pass http://localhost:8000;
-        proxy_set_header Upgrade $http_upgrade;
-        proxy_set_header Connection "upgrade";
-        proxy_set_header Host $host;
+        location / {
+            proxy_pass http://localhost:8000;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection "upgrade";
+            proxy_set_header Host $host;
    }}
    ```

 4. **HTTPS Support**: For secure deployments, use "wss://" instead of "ws://" in WebSocket URL

-### 🐋 Docker
+## 🐋 Docker

-A basic Dockerfile is provided which allows re-use of Python package installation options. ⚠️ For **large** models, ensure that your **docker runtime** has enough **memory** available. See below usage examples:
+Deploy the application easily using Docker with GPU or CPU support.

+### Prerequisites
+- Docker installed on your system
+- For GPU support: NVIDIA Docker runtime installed

-#### All defaults
- Create a reusable image with only the basics and then run as a named container:
-    ```bash
-    docker build -t whisperlivekit-defaults .
-    docker create --gpus all --name whisperlivekit -p 8000:8000 whisperlivekit-defaults
-    docker start -i whisperlivekit
-    ```
+### Quick Start
+
+**With GPU acceleration (recommended):**
+```bash
+docker build -t wlk .
+docker run --gpus all -p 8000:8000 --name wlk wlk
+```
+
+**CPU only:**
+```bash
+docker build -f Dockerfile.cpu -t wlk .
+docker run -p 8000:8000 --name wlk wlk
+```
+
+### Advanced Usage
+
+**Custom configuration:**
+```bash
+# Example with custom model and language
+docker run --gpus all -p 8000:8000 --name wlk wlk --model large-v3 --language fr
+```
+
+### Memory Requirements
+- **Large models**: Ensure your Docker runtime has sufficient memory allocated

-    > **Note**: If you're running on a system without NVIDIA GPU support (such as Mac with Apple Silicon or any system without CUDA capabilities), you need to **remove the `--gpus all` flag** from the `docker create` command. Without GPU acceleration, transcription will use CPU only, which may be significantly slower. Consider using small models for better performance on CPU-only systems.

 #### Customization
- Customize the container options:
-    ```bash
-    docker build -t whisperlivekit-defaults .
-    docker create --gpus all --name whisperlivekit-base -p 8000:8000 whisperlivekit-defaults --model base
-    docker start -i whisperlivekit-base
-    ```

 - `--build-arg` Options:
  - `EXTRAS="whisper-timestamped"` - Add extras to the image's installation (no spaces). Remember to set necessary container options!
@@ -290,10 +256,3 @@ A basic Dockerfile is provided which allows re-use of Python package installatio

 ## 🔮 Use Cases
 Capture discussions in real-time for meeting transcription, help hearing-impaired users follow conversations through accessibility tools, transcribe podcasts or videos automatically for content creation, transcribe support calls with speaker identification for customer service...
-
-## 🙏 Acknowledgments
-
-We extend our gratitude to the original authors of:
-
-| [Whisper Streaming](https://github.com/ufal/whisper_streaming)  | [SimulStreaming](https://github.com/ufal/SimulStreaming) | [Diart](https://github.com/juanmc2005/diart) | [OpenAI Whisper](https://github.com/openai/whisper) |
-| -------- | ------- | -------- | ------- |
--- a/ReadmeJP.md
+++ b/ReadmeJP.md
@@ -0,0 +1,258 @@
+<h1 align="center">WhisperLiveKit</h1>
+
+<p align="center">
+<img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/demo.png" alt="WhisperLiveKit Demo" width="730">
+</p>
+
+<p align="center"><b>話者識別機能付き、リアルタイム、完全ローカルな音声テキスト変換</b></p>
+
+<p align="center">
+<a href="https://pypi.org/project/whisperlivekit/"><img alt="PyPI Version" src="https://img.shields.io/pypi/v/whisperlivekit?color=g"></a>
+<a href="https://pepy.tech/project/whisperlivekit"><img alt="PyPI Downloads" src="https://static.pepy.tech/personalized-badge/whisperlivekit?period=total&units=international_system&left_color=grey&right_color=brightgreen&left_text=installations"></a>
+<a href="https://pypi.org/project/whisperlivekit/"><img alt="Python Versions" src="https://img.shields.io/badge/python-3.9--3.13-dark_green"></a>
+<a href="https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/LICENSE"><img alt="License" src="https://img.shields.io/badge/License-MIT/Dual Licensed-dark_green"></a>
+</p>
+
+すぐに使えるバックエンド+サーバーとシンプルなフロントエンドで、リアルタイムの音声文字起こしをブラウザに直接提供します。✨
+
+#### 主要な研究による技術：
+
+- [SimulStreaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) - AlignAttポリシーによる超低遅延文字起こし
+- [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) - LocalAgreementポリシーによる低遅延文字起こし
+- [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) - 高度なリアルタイム話者ダイアライゼーション
+- [Diart](https://github.com/juanmc2005/diart) (SOTA 2021) - リアルタイム話者ダイアライゼーション
+- [Silero VAD](https://github.com/snakers4/silero-vad) (2024) - エンタープライズグレードの音声区間検出
+
+> **なぜ各音声バッチで単純なWhisperモデルを実行しないのか？** Whisperは完全な発話向けに設計されており、リアルタイムのチャンク向けではありません。小さなセグメントを処理するとコンテキストが失われ、単語が音節の途中で途切れ、質の悪い文字起こしになります。WhisperLiveKitは、インテリジェントなバッファリングとインクリメンタルな処理のために、最先端の同時音声研究を利用しています。
+
+### アーキテクチャ
+
+<img alt="Architecture" src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/architecture.png" />
+
+*バックエンドは複数の同時ユーザーをサポートします。音声が検出されない場合、音声区間検出がオーバーヘッドを削減します。*
+
+### インストールとクイックスタート
+
+```bash
+pip install whisperlivekit
+```
+
+>  **FFmpegが必要です** WhisperLiveKitを使用する前にインストールする必要があります。
+>
+> | OS | インストール方法 |
+> |-----------|-------------|
+>  | Ubuntu/Debian | `sudo apt install ffmpeg` |
+> | MacOS | `brew install ffmpeg` |
+> | Windows | https://ffmpeg.org/download.html から.exeをダウンロードし、PATHに追加 |
+
+#### クイックスタート
+1. **文字起こしサーバーを起動します:**
+   ```bash
+   whisperlivekit-server --model base --language en
+   ```
+
+2. **ブラウザを開き** `http://localhost:8000` にアクセスします。話し始めると、あなたの言葉がリアルタイムで表示されます！
+
+
+> - 利用可能なすべての言語のリストについては、[tokenizer.py](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/simul_whisper/whisper/tokenizer.py) を参照してください。
+> - HTTPSの要件については、**パラメータ**セクションのSSL設定オプションを参照してください。
+
+#### オプションの依存関係
+
+| オプション | `pip install` |
+|-----------|-------------|
+| **Sortformerによる話者ダイアライゼーション** | `git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]` |
+| Diartによる話者ダイアライゼーション | `diart` |
+| オリジナルのWhisperバックエンド | `whisper` |
+| タイムスタンプ改善バックエンド | `whisper-timestamped` |
+| Apple Silicon最適化バックエンド | `mlx-whisper` |
+| OpenAI APIバックエンド | `openai` |
+
+それらの使用方法については、以下の**パラメータと設定**を参照してください。
+
+### 使用例
+
+**コマンドラインインターフェース**: 様々なオプションで文字起こしサーバーを起動します:
+
+```bash
+# デフォルト(small)より良いモデルを使用
+whisperlivekit-server --model large-v3
+
+# ダイアライゼーションと言語を指定した高度な設定
+whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language fr
+```
+
+**Python API連携**: 関数やクラスの使用方法のより完全な例については、[basic_server](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/basic_server.py) を確認してください。
+
+```python
+from whisperlivekit import TranscriptionEngine, AudioProcessor, parse_args
+from fastapi import FastAPI, WebSocket, WebSocketDisconnect
+from fastapi.responses import HTMLResponse
+from contextlib import asynccontextmanager
+import asyncio
+
+transcription_engine = None
+
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    global transcription_engine
+    transcription_engine = TranscriptionEngine(model="medium", diarization=True, lan="en")
+    yield
+
+app = FastAPI(lifespan=lifespan)
+
+async def handle_websocket_results(websocket: WebSocket, results_generator):
+    async for response in results_generator:
+        await websocket.send_json(response)
+    await websocket.send_json({"type": "ready_to_stop"})
+
+@app.websocket("/asr")
+async def websocket_endpoint(websocket: WebSocket):
+    global transcription_engine
+
+    # 接続ごとに新しいAudioProcessorを作成し、共有エンジンを渡す
+    audio_processor = AudioProcessor(transcription_engine=transcription_engine)
+    results_generator = await audio_processor.create_tasks()
+    results_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
+    await websocket.accept()
+    while True:
+        message = await websocket.receive_bytes()
+        await audio_processor.process_audio(message)
+```
+
+**フロントエンド実装**: パッケージにはHTML/JavaScript実装が[ここ](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/web/live_transcription.html)に含まれています。`from whisperlivekit import get_web_interface_html` & `page = get_web_interface_html()` を使ってインポートすることもできます。
+
+
+## パラメータと設定
+
+重要なパラメータのリストを変更できます。しかし、何を*変更すべき*でしょうか？
+- `--model` サイズ。リストと推奨事項は[こちら](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/available_models.md)
+- `--language`。リストは[こちら](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/simul_whisper/whisper/tokenizer.py)。`auto`を使用すると、モデルは自動的に言語を検出しようとしますが、英語に偏る傾向があります。
+- `--backend`？ `simulstreaming`が正しく動作しない場合や、デュアルライセンス要件を避けたい場合は`--backend faster-whisper`に切り替えることができます。
+- `--warmup-file`、もしあれば
+- `--host`, `--port`, `--ssl-certfile`, `--ssl-keyfile`、サーバーをセットアップする場合
+- `--diarization`、使用したい場合。
+
+残りは推奨しません。しかし、以下があなたのオプションです。
+
+| パラメータ | 説明 | デフォルト |
+|-----------|-------------|---------|
+| `--model` | Whisperモデルのサイズ。 | `small` |
+| `--language` | ソース言語コードまたは`auto` | `auto` |
+| `--task` | `transcribe`または`translate` | `transcribe` |
+| `--backend` | 処理バックエンド | `simulstreaming` |
+| `--min-chunk-size` | 最小音声チャンクサイズ（秒） | `1.0` |
+| `--no-vac` | 音声アクティビティコントローラーを無効化 | `False` |
+| `--no-vad` | 音声区間検出を無効化 | `False` |
+| `--warmup-file` | モデルのウォームアップ用音声ファイルパス | `jfk.wav` |
+| `--host` | サーバーホストアドレス | `localhost` |
+| `--port` | サーバーポート | `8000` |
+| `--ssl-certfile` | SSL証明書ファイルへのパス（HTTPSサポート用） | `None` |
+| `--ssl-keyfile` | SSL秘密鍵ファイルへのパス（HTTPSサポート用） | `None` |
+
+
+| WhisperStreamingバックエンドオプション | 説明 | デフォルト |
+|-----------|-------------|---------|
+| `--confidence-validation` | 高速な検証のために信頼スコアを使用 | `False` |
+| `--buffer_trimming` | バッファトリミング戦略（`sentence`または`segment`） | `segment` |
+
+
+| SimulStreamingバックエンドオプション | 説明 | デフォルト |
+|-----------|-------------|---------|
+| `--frame-threshold` | AlignAttフレームしきい値（低いほど速く、高いほど正確） | `25` |
+| `--beams` | ビームサーチのビーム数（1 = 貪欲デコーディング） | `1` |
+| `--decoder` | デコーダタイプを強制（`beam`または`greedy`） | `auto` |
+| `--audio-max-len` | 最大音声バッファ長（秒） | `30.0` |
+| `--audio-min-len` | 処理する最小音声長（秒） | `0.0` |
+| `--cif-ckpt-path` | 単語境界検出用CIFモデルへのパス | `None` |
+| `--never-fire` | 未完了の単語を決して切り捨てない | `False` |
+| `--init-prompt` | モデルの初期プロンプト | `None` |
+| `--static-init-prompt` | スクロールしない静的プロンプト | `None` |
+| `--max-context-tokens` | 最大コンテキストトークン数 | `None` |
+| `--model-path` | .ptモデルファイルへの直接パス。見つからない場合はダウンロード | `./base.pt` |
+| `--preloaded-model-count` | オプション。メモリにプリロードするモデルの数（予想される同時ユーザー数まで設定） | `1` |
+
+| ダイアライゼーションオプション | 説明 | デフォルト |
+|-----------|-------------|---------|
+| `--diarization` | 話者識別を有効化 | `False` |
+| `--diarization-backend` | `diart`または`sortformer` | `sortformer` |
+| `--segmentation-model` | DiartセグメンテーションモデルのHugging FaceモデルID。[利用可能なモデル](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |
+| `--embedding-model` | Diart埋め込みモデルのHugging FaceモデルID。[利用可能なモデル](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |
+
+
+> Diartを使用したダイアライゼーションには、pyannote.audioモデルへのアクセスが必要です：
+> 1. `pyannote/segmentation`モデルの[ユーザー条件に同意](https://huggingface.co/pyannote/segmentation)
+> 2. `pyannote/segmentation-3.0`モデルの[ユーザー条件に同意](https://huggingface.co/pyannote/segmentation-3.0)
+> 3. `pyannote/embedding`モデルの[ユーザー条件に同意](https://huggingface.co/pyannote/embedding)
+>4. HuggingFaceでログイン: `huggingface-cli login`
+
+### 🚀 デプロイガイド
+
+WhisperLiveKitを本番環境にデプロイするには：
+
+1. **サーバーセットアップ**: 本番用ASGIサーバーをインストールし、複数のワーカーで起動します
+   ```bash
+   pip install uvicorn gunicorn
+   gunicorn -k uvicorn.workers.UvicornWorker -w 4 your_app:app
+   ```
+
+2. **フロントエンド**: カスタマイズした`html`のバージョンをホストし、WebSocket接続が正しくポイントするようにします
+
+3. **Nginx設定** (本番環境で推奨):
+    ```nginx
+   server {
+       listen 80;
+       server_name your-domain.com;
+        location / {
+            proxy_pass http://localhost:8000;
+            proxy_set_header Upgrade $http_upgrade;
+            proxy_set_header Connection "upgrade";
+            proxy_set_header Host $host;
+    }}
+    ```
+
+4. **HTTPSサポート**: 安全なデプロイメントのために、WebSocket URLで "ws://" の代わりに "wss://" を使用します
+
+## 🐋 Docker
+
+GPUまたはCPUサポート付きでDockerを使用してアプリケーションを簡単にデプロイします。
+
+### 前提条件
+- Dockerがシステムにインストールされていること
+- GPUサポートの場合: NVIDIA Dockerランタイムがインストールされていること
+
+### クイックスタート
+
+**GPUアクセラレーション付き (推奨):**
+```bash
+docker build -t wlk .
+docker run --gpus all -p 8000:8000 --name wlk wlk
+```
+
+**CPUのみ:**
+```bash
+docker build -f Dockerfile.cpu -t wlk .
+docker run -p 8000:8000 --name wlk wlk
+```
+
+### 高度な使用法
+
+**カスタム設定:**
+```bash
+# カスタムモデルと言語の例
+docker run --gpus all -p 8000:8000 --name wlk wlk --model large-v3 --language fr
+```
+
+### メモリ要件
+- **大規模モデル**: Dockerランタイムに十分なメモリが割り当てられていることを確認してください
+
+
+#### カスタマイズ
+
+- `--build-arg` オプション:
+  - `EXTRAS="whisper-timestamped"` - イメージのインストールにエクストラを追加します（スペースなし）。必要なコンテナオプションを設定することを忘れないでください！
+  - `HF_PRECACHE_DIR="./.cache/"` - 初回起動を高速化するためにモデルキャッシュをプリロードします
+  - `HF_TKN_FILE="./token"` - ゲート付きモデルをダウンロードするためにHugging Face Hubアクセストークンを追加します
+
+## 🔮 ユースケース
+会議の文字起こしのためにリアルタイムで議論をキャプチャする、聴覚障害のあるユーザーがアクセシビリティツールを通じて会話を追うのを助ける、コンテンツ作成のためにポッドキャストやビデオを自動的に文字起こしする、カスタマーサービスのために話者識別付きでサポートコールを文字起こしする...
--- a/architecture.png
+++ b/architecture.png
--- a/available_models.md
+++ b/available_models.md
@@ -0,0 +1,109 @@
+# Available Whisper model sizes:
+
+- tiny.en (english only)
+- tiny
+- base.en (english only)
+- base
+- small.en (english only)
+- small
+- medium.en (english only)
+- medium
+- large-v1
+- large-v2
+- large-v3
+- large-v3-turbo
+
+## How to choose?
+
+### Language Support
+- **English only**: Use `.en` models for better accuracy and faster processing when you only need English transcription
+- **Multilingual**: Do not use `.en` models.
+
+### Resource Constraints
+- **Limited GPU/CPU or need for very low latency**: Choose `small` or smaller models
+  - `tiny`: Fastest, lowest resource usage, acceptable quality for simple audio
+  - `base`: Good balance of speed and accuracy for basic use cases
+  - `small`: Better accuracy while still being resource-efficient
+- **Good resources available**: Use `large` models for best accuracy
+  - `large-v2`: Excellent accuracy, good multilingual support
+  - `large-v3`: Best overall accuracy and language support
+
+### Special Cases
+- **No translation needed**: Use `large-v3-turbo`
+  - Same transcription quality as `large-v2` but significantly faster
+  - **Important**: Does not translate correctly, only transcribes
+
+### Model Comparison Table
+
+| Model | Speed | Accuracy | Multilingual | Translation | Best Use Case |
+|-------|--------|----------|--------------|-------------|---------------|
+| tiny(.en) | Fastest | Basic | Yes/No | Yes/No | Real-time, low resources |
+| base(.en) | Fast | Good | Yes/No | Yes/No | Balanced performance |
+| small(.en) | Medium | Better | Yes/No | Yes/No | Quality on limited hardware |
+| medium(.en) | Slow | High | Yes/No | Yes/No | High quality, moderate resources |
+| large-v2 | Slowest | Excellent | Yes | Yes | Best overall quality |
+| large-v3 | Slowest | Excellent | Yes | Yes | Maximum accuracy |
+| large-v3-turbo | Fast | Excellent | Yes | No | Fast, high-quality transcription |
+
+### Additional Considerations
+
+**Model Performance**:
+- Accuracy improves significantly from tiny to large models
+- English-only models are ~10-15% more accurate for English audio
+- Newer versions (v2, v3) have better punctuation and formatting
+
+**Hardware Requirements**:
+- `tiny`: ~1GB VRAM
+- `base`: ~1GB VRAM  
+- `small`: ~2GB VRAM
+- `medium`: ~5GB VRAM
+- `large`: ~10GB VRAM
+- `large‑v3‑turbo`: ~6GB VRAM
+
+**Audio Quality Impact**:
+- Clean, clear audio: smaller models may suffice
+- Noisy, accented, or technical audio: larger models recommended
+- Phone/low-quality audio: use at least `small` model
+
+### Quick Decision Tree
+1. English only? → Add `.en` to your choice
+2. Limited resources or need speed? → `small` or smaller
+3. Good hardware and want best quality? → `large-v3`
+4. Need fast, high-quality transcription without translation? → `large-v3-turbo`
+5. Need translation capabilities? → `large-v2` or `large-v3` (avoid turbo)
+
+
+_______________________
+
+# Translation Models and Backend
+
+**Language Support**: ~200 languages
+
+## Distilled Model Sizes Available
+
+| Model | Size | Parameters | VRAM (FP16) | VRAM (INT8) | Quality |
+|-------|------|------------|-------------|-------------|---------|
+| 600M | 2.46 GB | 600M | ~1.5GB | ~800MB | Good, understandable |
+| 1.3B | 5.48 GB | 1.3B | ~3GB | ~1.5GB | Better accuracy, context |
+
+**Quality Impact**: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs.
+
+## Backend Performance
+
+| Backend | Speed vs Base | Memory Usage | Quality Loss |
+|---------|---------------|--------------|--------------|
+| CTranslate2 | 6-10x faster | 40-60% less | ~5% BLEU drop |
+| Transformers | Baseline | High | None |
+| Transformers + MPS (on Apple Silicon) | 2x faster | Medium | None |
+
+**Metrics**:
+- CTranslate2: 50-100+ tokens/sec
+- Transformers: 10-30 tokens/sec
+- Apple Silicon with MPS: Up to 2x faster than CTranslate2
+
+## Quick Decision Matrix
+
+**Choose 600M**: Limited resources, close to 0 lag
+**Choose 1.3B**: Quality matters
+**Choose Transformers**: On Apple Silicon
+
--- a/chrome-extension/README.md
+++ b/chrome-extension/README.md
@@ -0,0 +1,17 @@
+## WhisperLiveKit Chrome Extension v0.1.0
+Capture the audio of your current tab, transcribe or translate it using WhisperliveKit. **Still unstable**
+
+<img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/chrome-extension/demo-extension.png" alt="WhisperLiveKit Demo" width="730">
+
+## Running this extension
+1. Clone this repository.
+2. Load this directory in Chrome as an unpacked extension.
+
+
+## Devs:
+- Impossible to capture audio from tabs if extension is a pannel, unfortunately: 
+- https://issues.chromium.org/issues/40926394
+- https://groups.google.com/a/chromium.org/g/chromium-extensions/c/DET2SXCFnDg
+- https://issues.chromium.org/issues/40916430
+
+- To capture microphone in an extension, there are tricks: https://github.com/justinmann/sidepanel-audio-issue , https://medium.com/@lynchee.owo/how-to-enable-microphone-access-in-chrome-extensions-by-code-924295170080 (comments)
--- a/chrome-extension/background.js
+++ b/chrome-extension/background.js
@@ -0,0 +1,9 @@
+chrome.runtime.onInstalled.addListener((details) => {
+    if (details.reason.search(/install/g) === -1) {
+        return
+    }
+    chrome.tabs.create({
+        url: chrome.runtime.getURL("welcome.html"),
+        active: true
+    })
+})
--- a/chrome-extension/demo-extension.png
+++ b/chrome-extension/demo-extension.png
--- a/chrome-extension/icons/icon128.png
+++ b/chrome-extension/icons/icon128.png
--- a/chrome-extension/icons/icon16.png
+++ b/chrome-extension/icons/icon16.png
--- a/chrome-extension/icons/icon32.png
+++ b/chrome-extension/icons/icon32.png
--- a/chrome-extension/icons/icon48.png
+++ b/chrome-extension/icons/icon48.png
--- a/chrome-extension/live_transcription.js
+++ b/chrome-extension/live_transcription.js
@@ -0,0 +1,669 @@
+/* Theme, WebSocket, recording, rendering logic extracted from inline script and adapted for segmented theme control and WS caption */
+let isRecording = false;
+let websocket = null;
+let recorder = null;
+let chunkDuration = 100;
+let websocketUrl = "ws://localhost:8000/asr";
+let userClosing = false;
+let wakeLock = null;
+let startTime = null;
+let timerInterval = null;
+let audioContext = null;
+let analyser = null;
+let microphone = null;
+let waveCanvas = document.getElementById("waveCanvas");
+let waveCtx = waveCanvas.getContext("2d");
+let animationFrame = null;
+let waitingForStop = false;
+let lastReceivedData = null;
+let lastSignature = null;
+let availableMicrophones = [];
+let selectedMicrophoneId = null;
+
+waveCanvas.width = 60 * (window.devicePixelRatio || 1);
+waveCanvas.height = 30 * (window.devicePixelRatio || 1);
+waveCtx.scale(window.devicePixelRatio || 1, window.devicePixelRatio || 1);
+
+const statusText = document.getElementById("status");
+const recordButton = document.getElementById("recordButton");
+const chunkSelector = document.getElementById("chunkSelector");
+const websocketInput = document.getElementById("websocketInput");
+const websocketDefaultSpan = document.getElementById("wsDefaultUrl");
+const linesTranscriptDiv = document.getElementById("linesTranscript");
+const timerElement = document.querySelector(".timer");
+const themeRadios = document.querySelectorAll('input[name="theme"]');
+const microphoneSelect = document.getElementById("microphoneSelect");
+const settingsToggle = document.getElementById("settingsToggle");
+const settingsDiv = document.querySelector(".settings");
+
+
+
+chrome.runtime.onInstalled.addListener((details) => {
+    if (details.reason.search(/install/g) === -1) {
+        return
+    }
+    chrome.tabs.create({
+        url: chrome.runtime.getURL("welcome.html"),
+        active: true
+    })
+})
+
+function getWaveStroke() {
+  const styles = getComputedStyle(document.documentElement);
+  const v = styles.getPropertyValue("--wave-stroke").trim();
+  return v || "#000";
+}
+
+let waveStroke = getWaveStroke();
+function updateWaveStroke() {
+  waveStroke = getWaveStroke();
+}
+
+function applyTheme(pref) {
+  if (pref === "light") {
+    document.documentElement.setAttribute("data-theme", "light");
+  } else if (pref === "dark") {
+    document.documentElement.setAttribute("data-theme", "dark");
+  } else {
+    document.documentElement.removeAttribute("data-theme");
+  }
+  updateWaveStroke();
+}
+
+// Persisted theme preference
+const savedThemePref = localStorage.getItem("themePreference") || "system";
+applyTheme(savedThemePref);
+if (themeRadios.length) {
+  themeRadios.forEach((r) => {
+    r.checked = r.value === savedThemePref;
+    r.addEventListener("change", () => {
+      if (r.checked) {
+        localStorage.setItem("themePreference", r.value);
+        applyTheme(r.value);
+      }
+    });
+  });
+}
+
+// React to OS theme changes when in "system" mode
+const darkMq = window.matchMedia && window.matchMedia("(prefers-color-scheme: dark)");
+const handleOsThemeChange = () => {
+  const pref = localStorage.getItem("themePreference") || "system";
+  if (pref === "system") updateWaveStroke();
+};
+if (darkMq && darkMq.addEventListener) {
+  darkMq.addEventListener("change", handleOsThemeChange);
+} else if (darkMq && darkMq.addListener) {
+  // deprecated, but included for Safari compatibility
+  darkMq.addListener(handleOsThemeChange);
+}
+
+async function enumerateMicrophones() {
+  try {
+      const micPermission = await navigator.permissions.query({
+    name: "microphone",
+  });
+  
+    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
+    stream.getTracks().forEach(track => track.stop());
+
+    const devices = await navigator.mediaDevices.enumerateDevices();
+    availableMicrophones = devices.filter(device => device.kind === 'audioinput');
+
+    populateMicrophoneSelect();
+    console.log(`Found ${availableMicrophones.length} microphone(s)`);
+  } catch (error) {
+    console.error('Error enumerating microphones:', error);
+    statusText.textContent = "Error accessing microphones. Please grant permission.";
+  }
+}
+
+function populateMicrophoneSelect() {
+  if (!microphoneSelect) return;
+
+  microphoneSelect.innerHTML = '<option value="">Default Microphone</option>';
+
+  availableMicrophones.forEach((device, index) => {
+    const option = document.createElement('option');
+    option.value = device.deviceId;
+    option.textContent = device.label || `Microphone ${index + 1}`;
+    microphoneSelect.appendChild(option);
+  });
+
+  const savedMicId = localStorage.getItem('selectedMicrophone');
+  if (savedMicId && availableMicrophones.some(mic => mic.deviceId === savedMicId)) {
+    microphoneSelect.value = savedMicId;
+    selectedMicrophoneId = savedMicId;
+  }
+}
+
+function handleMicrophoneChange() {
+  selectedMicrophoneId = microphoneSelect.value || null;
+  localStorage.setItem('selectedMicrophone', selectedMicrophoneId || '');
+
+  const selectedDevice = availableMicrophones.find(mic => mic.deviceId === selectedMicrophoneId);
+  const deviceName = selectedDevice ? selectedDevice.label : 'Default Microphone';
+
+  console.log(`Selected microphone: ${deviceName}`);
+  statusText.textContent = `Microphone changed to: ${deviceName}`;
+
+  if (isRecording) {
+    statusText.textContent = "Switching microphone... Please wait.";
+    stopRecording().then(() => {
+      setTimeout(() => {
+        toggleRecording();
+      }, 1000);
+    });
+  }
+}
+
+// Helpers
+function fmt1(x) {
+  const n = Number(x);
+  return Number.isFinite(n) ? n.toFixed(1) : x;
+}
+
+// Default WebSocket URL computation
+const host = window.location.hostname || "localhost";
+const port = window.location.port;
+const protocol = window.location.protocol === "https:" ? "wss" : "ws";
+const defaultWebSocketUrl = websocketUrl;
+
+// Populate default caption and input
+if (websocketDefaultSpan) websocketDefaultSpan.textContent = defaultWebSocketUrl;
+websocketInput.value = defaultWebSocketUrl;
+websocketUrl = defaultWebSocketUrl;
+
+// Optional chunk selector (guard for presence)
+if (chunkSelector) {
+  chunkSelector.addEventListener("change", () => {
+    chunkDuration = parseInt(chunkSelector.value);
+  });
+}
+
+// WebSocket input change handling
+websocketInput.addEventListener("change", () => {
+  const urlValue = websocketInput.value.trim();
+  if (!urlValue.startsWith("ws://") && !urlValue.startsWith("wss://")) {
+    statusText.textContent = "Invalid WebSocket URL (must start with ws:// or wss://)";
+    return;
+  }
+  websocketUrl = urlValue;
+  statusText.textContent = "WebSocket URL updated. Ready to connect.";
+});
+
+function setupWebSocket() {
+  return new Promise((resolve, reject) => {
+    try {
+      websocket = new WebSocket(websocketUrl);
+    } catch (error) {
+      statusText.textContent = "Invalid WebSocket URL. Please check and try again.";
+      reject(error);
+      return;
+    }
+
+    websocket.onopen = () => {
+      statusText.textContent = "Connected to server.";
+      resolve();
+    };
+
+    websocket.onclose = () => {
+      if (userClosing) {
+        if (waitingForStop) {
+          statusText.textContent = "Processing finalized or connection closed.";
+          if (lastReceivedData) {
+            renderLinesWithBuffer(
+              lastReceivedData.lines || [],
+              lastReceivedData.buffer_diarization || "",
+              lastReceivedData.buffer_transcription || "",
+              0,
+              0,
+              true
+            );
+          }
+        }
+      } else {
+        statusText.textContent = "Disconnected from the WebSocket server. (Check logs if model is loading.)";
+        if (isRecording) {
+          stopRecording();
+        }
+      }
+      isRecording = false;
+      waitingForStop = false;
+      userClosing = false;
+      lastReceivedData = null;
+      websocket = null;
+      updateUI();
+    };
+
+    websocket.onerror = () => {
+      statusText.textContent = "Error connecting to WebSocket.";
+      reject(new Error("Error connecting to WebSocket"));
+    };
+
+    websocket.onmessage = (event) => {
+      const data = JSON.parse(event.data);
+
+      if (data.type === "ready_to_stop") {
+        console.log("Ready to stop received, finalizing display and closing WebSocket.");
+        waitingForStop = false;
+
+        if (lastReceivedData) {
+          renderLinesWithBuffer(
+            lastReceivedData.lines || [],
+            lastReceivedData.buffer_diarization || "",
+            lastReceivedData.buffer_transcription || "",
+            0,
+            0,
+            true
+          );
+        }
+        statusText.textContent = "Finished processing audio! Ready to record again.";
+        recordButton.disabled = false;
+
+        if (websocket) {
+          websocket.close();
+        }
+        return;
+      }
+
+      lastReceivedData = data;
+
+      const {
+        lines = [],
+        buffer_transcription = "",
+        buffer_diarization = "",
+        remaining_time_transcription = 0,
+        remaining_time_diarization = 0,
+        status = "active_transcription",
+      } = data;
+
+      renderLinesWithBuffer(
+        lines,
+        buffer_diarization,
+        buffer_transcription,
+        remaining_time_diarization,
+        remaining_time_transcription,
+        false,
+        status
+      );
+    };
+  });
+}
+
+function renderLinesWithBuffer(
+  lines,
+  buffer_diarization,
+  buffer_transcription,
+  remaining_time_diarization,
+  remaining_time_transcription,
+  isFinalizing = false,
+  current_status = "active_transcription"
+) {
+  if (current_status === "no_audio_detected") {
+    linesTranscriptDiv.innerHTML =
+      "<p style='text-align: center; color: var(--muted); margin-top: 20px;'><em>No audio detected...</em></p>";
+    return;
+  }
+
+  const showLoading = !isFinalizing && (lines || []).some((it) => it.speaker == 0);
+  const showTransLag = !isFinalizing && remaining_time_transcription > 0;
+  const showDiaLag = !isFinalizing && !!buffer_diarization && remaining_time_diarization > 0;
+  const signature = JSON.stringify({
+    lines: (lines || []).map((it) => ({ speaker: it.speaker, text: it.text, start: it.start, end: it.end })),
+    buffer_transcription: buffer_transcription || "",
+    buffer_diarization: buffer_diarization || "",
+    status: current_status,
+    showLoading,
+    showTransLag,
+    showDiaLag,
+    isFinalizing: !!isFinalizing,
+  });
+  if (lastSignature === signature) {
+    const t = document.querySelector(".lag-transcription-value");
+    if (t) t.textContent = fmt1(remaining_time_transcription);
+    const d = document.querySelector(".lag-diarization-value");
+    if (d) d.textContent = fmt1(remaining_time_diarization);
+    const ld = document.querySelector(".loading-diarization-value");
+    if (ld) ld.textContent = fmt1(remaining_time_diarization);
+    return;
+  }
+  lastSignature = signature;
+
+  const linesHtml = (lines || [])
+    .map((item, idx) => {
+      let timeInfo = "";
+      if (item.start !== undefined && item.end !== undefined) {
+        timeInfo = ` ${item.start} - ${item.end}`;
+      }
+
+      let speakerLabel = "";
+      if (item.speaker === -2) {
+        speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`;
+      } else if (item.speaker == 0 && !isFinalizing) {
+        speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'><span class="loading-diarization-value">${fmt1(
+          remaining_time_diarization
+        )}</span> second(s) of audio are undergoing diarization</span></span>`;
+      } else if (item.speaker !== 0) {
+        speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`;
+      }
+
+      let currentLineText = item.text || "";
+
+      if (idx === lines.length - 1) {
+        if (!isFinalizing && item.speaker !== -2) {
+          if (remaining_time_transcription > 0) {
+            speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Lag <span id='timeInfo'><span class="lag-transcription-value">${fmt1(
+              remaining_time_transcription
+            )}</span>s</span></span>`;
+          }
+          if (buffer_diarization && remaining_time_diarization > 0) {
+            speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Lag<span id='timeInfo'><span class="lag-diarization-value">${fmt1(
+              remaining_time_diarization
+            )}</span>s</span></span>`;
+          }
+        }
+
+        if (buffer_diarization) {
+          if (isFinalizing) {
+            currentLineText +=
+              (currentLineText.length > 0 && buffer_diarization.trim().length > 0 ? " " : "") + buffer_diarization.trim();
+          } else {
+            currentLineText += `<span class="buffer_diarization">${buffer_diarization}</span>`;
+          }
+        }
+        if (buffer_transcription) {
+          if (isFinalizing) {
+            currentLineText +=
+              (currentLineText.length > 0 && buffer_transcription.trim().length > 0 ? " " : "") +
+              buffer_transcription.trim();
+          } else {
+            currentLineText += `<span class="buffer_transcription">${buffer_transcription}</span>`;
+          }
+        }
+      }
+
+      return currentLineText.trim().length > 0 || speakerLabel.length > 0
+        ? `<p>${speakerLabel}<br/><div class='textcontent'>${currentLineText}</div></p>`
+        : `<p>${speakerLabel}<br/></p>`;
+    })
+    .join("");
+
+  linesTranscriptDiv.innerHTML = linesHtml;
+  window.scrollTo({ top: document.body.scrollHeight, behavior: "smooth" });
+}
+
+function updateTimer() {
+  if (!startTime) return;
+
+  const elapsed = Math.floor((Date.now() - startTime) / 1000);
+  const minutes = Math.floor(elapsed / 60).toString().padStart(2, "0");
+  const seconds = (elapsed % 60).toString().padStart(2, "0");
+  timerElement.textContent = `${minutes}:${seconds}`;
+}
+
+function drawWaveform() {
+  if (!analyser) return;
+
+  const bufferLength = analyser.frequencyBinCount;
+  const dataArray = new Uint8Array(bufferLength);
+  analyser.getByteTimeDomainData(dataArray);
+
+  waveCtx.clearRect(
+    0,
+    0,
+    waveCanvas.width / (window.devicePixelRatio || 1),
+    waveCanvas.height / (window.devicePixelRatio || 1)
+  );
+  waveCtx.lineWidth = 1;
+  waveCtx.strokeStyle = waveStroke;
+  waveCtx.beginPath();
+
+  const sliceWidth = (waveCanvas.width / (window.devicePixelRatio || 1)) / bufferLength;
+  let x = 0;
+
+  for (let i = 0; i < bufferLength; i++) {
+    const v = dataArray[i] / 128.0;
+    const y = (v * (waveCanvas.height / (window.devicePixelRatio || 1))) / 2;
+
+    if (i === 0) {
+      waveCtx.moveTo(x, y);
+    } else {
+      waveCtx.lineTo(x, y);
+    }
+
+    x += sliceWidth;
+  }
+
+  waveCtx.lineTo(
+    waveCanvas.width / (window.devicePixelRatio || 1),
+    (waveCanvas.height / (window.devicePixelRatio || 1)) / 2
+  );
+  waveCtx.stroke();
+
+  animationFrame = requestAnimationFrame(drawWaveform);
+}
+
+async function startRecording() {
+  try {
+    try {
+      wakeLock = await navigator.wakeLock.request("screen");
+    } catch (err) {
+      console.log("Error acquiring wake lock.");
+    }
+
+    let stream;
+    try {
+      // Try tab capture first
+      stream = await new Promise((resolve, reject) => {
+        chrome.tabCapture.capture({audio: true}, (s) => {
+          if (s) {
+            resolve(s);
+          } else {
+            reject(new Error('Tab capture failed or not available'));
+          }
+        });
+      });
+      statusText.textContent = "Using tab audio capture.";
+    } catch (tabError) {
+      console.log('Tab capture not available, falling back to microphone', tabError);
+      // Fallback to microphone
+      const audioConstraints = selectedMicrophoneId
+        ? { audio: { deviceId: { exact: selectedMicrophoneId } } }
+        : { audio: true };
+      stream = await navigator.mediaDevices.getUserMedia(audioConstraints);
+      statusText.textContent = "Using microphone audio.";
+    }
+
+    audioContext = new (window.AudioContext || window.webkitAudioContext)();
+    analyser = audioContext.createAnalyser();
+    analyser.fftSize = 256;
+    microphone = audioContext.createMediaStreamSource(stream);
+    microphone.connect(analyser);
+
+    recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
+    recorder.ondataavailable = (e) => {
+      if (websocket && websocket.readyState === WebSocket.OPEN) {
+        websocket.send(e.data);
+      }
+    };
+    recorder.start(chunkDuration);
+
+    startTime = Date.now();
+    timerInterval = setInterval(updateTimer, 1000);
+    drawWaveform();
+
+    isRecording = true;
+    updateUI();
+  } catch (err) {
+    if (window.location.hostname === "0.0.0.0") {
+      statusText.textContent =
+        "Error accessing audio input. Browsers may block audio access on 0.0.0.0. Try using localhost:8000 instead.";
+    } else {
+      statusText.textContent = "Error accessing audio input. Please check permissions.";
+    }
+    console.error(err);
+  }
+}
+
+async function stopRecording() {
+  if (wakeLock) {
+    try {
+      await wakeLock.release();
+    } catch (e) {
+      // ignore
+    }
+    wakeLock = null;
+  }
+
+  userClosing = true;
+  waitingForStop = true;
+
+  if (websocket && websocket.readyState === WebSocket.OPEN) {
+    const emptyBlob = new Blob([], { type: "audio/webm" });
+    websocket.send(emptyBlob);
+    statusText.textContent = "Recording stopped. Processing final audio...";
+  }
+
+  if (recorder) {
+    recorder.stop();
+    recorder = null;
+  }
+
+  if (microphone) {
+    microphone.disconnect();
+    microphone = null;
+  }
+
+  if (analyser) {
+    analyser = null;
+  }
+
+  if (audioContext && audioContext.state !== "closed") {
+    try {
+      await audioContext.close();
+    } catch (e) {
+      console.warn("Could not close audio context:", e);
+    }
+    audioContext = null;
+  }
+
+  if (animationFrame) {
+    cancelAnimationFrame(animationFrame);
+    animationFrame = null;
+  }
+
+  if (timerInterval) {
+    clearInterval(timerInterval);
+    timerInterval = null;
+  }
+  timerElement.textContent = "00:00";
+  startTime = null;
+
+  isRecording = false;
+  updateUI();
+}
+
+async function toggleRecording() {
+  if (!isRecording) {
+    if (waitingForStop) {
+      console.log("Waiting for stop, early return");
+      return;
+    }
+    console.log("Connecting to WebSocket");
+    try {
+      if (websocket && websocket.readyState === WebSocket.OPEN) {
+        await startRecording();
+      } else {
+        await setupWebSocket();
+        await startRecording();
+      }
+    } catch (err) {
+      statusText.textContent = "Could not connect to WebSocket or access mic. Aborted.";
+      console.error(err);
+    }
+  } else {
+    console.log("Stopping recording");
+    stopRecording();
+  }
+}
+
+function updateUI() {
+  recordButton.classList.toggle("recording", isRecording);
+  recordButton.disabled = waitingForStop;
+
+  if (waitingForStop) {
+    if (statusText.textContent !== "Recording stopped. Processing final audio...") {
+      statusText.textContent = "Please wait for processing to complete...";
+    }
+  } else if (isRecording) {
+    statusText.textContent = "Recording...";
+  } else {
+    if (
+      statusText.textContent !== "Finished processing audio! Ready to record again." &&
+      statusText.textContent !== "Processing finalized or connection closed."
+    ) {
+      statusText.textContent = "Click to start transcription";
+    }
+  }
+  if (!waitingForStop) {
+    recordButton.disabled = false;
+  }
+}
+
+recordButton.addEventListener("click", toggleRecording);
+
+if (microphoneSelect) {
+  microphoneSelect.addEventListener("change", handleMicrophoneChange);
+}
+
+// Settings toggle functionality
+settingsToggle.addEventListener("click", () => {
+  settingsDiv.classList.toggle("visible");
+  settingsToggle.classList.toggle("active");
+});
+
+document.addEventListener('DOMContentLoaded', async () => {
+  try {
+    await enumerateMicrophones();
+  } catch (error) {
+    console.log("Could not enumerate microphones on load:", error);
+  }
+});
+navigator.mediaDevices.addEventListener('devicechange', async () => {
+  console.log('Device change detected, re-enumerating microphones');
+  try {
+    await enumerateMicrophones();
+  } catch (error) {
+    console.log("Error re-enumerating microphones:", error);
+  }
+});
+
+
+async function run() {
+  const micPermission = await navigator.permissions.query({
+    name: "microphone",
+  });
+
+  document.getElementById(
+    "audioPermission"
+  ).innerText = `MICROPHONE: ${micPermission.state}`;
+
+  if (micPermission.state !== "granted") {
+    chrome.tabs.create({ url: "welcome.html" });
+  }
+
+  const intervalId = setInterval(async () => {
+    const micPermission = await navigator.permissions.query({
+      name: "microphone",
+    });
+    if (micPermission.state === "granted") {
+      document.getElementById(
+        "audioPermission"
+      ).innerText = `MICROPHONE: ${micPermission.state}`;
+      clearInterval(intervalId);
+    }
+  }, 100);
+}
+
+void run();
--- a/chrome-extension/manifest.json
+++ b/chrome-extension/manifest.json
@@ -0,0 +1,37 @@
+{
+    "manifest_version": 3,
+    "name": "WhisperLiveKit Tab Capture",
+    "version": "1.0",
+    "description": "Capture and transcribe audio from browser tabs using WhisperLiveKit.",
+    "background": {
+        "service_worker": "background.js"
+    },
+    "icons": {
+        "16": "icons/icon16.png",
+        "32": "icons/icon32.png",
+        "48": "icons/icon48.png",
+        "128": "icons/icon128.png"
+    },
+    "action": {
+        "default_title": "WhisperLiveKit Tab Capture",
+        "default_popup": "popup.html"
+    },
+    "permissions": [
+        "scripting",
+        "tabCapture",
+        "offscreen",
+        "activeTab",
+        "storage"
+    ],
+    "web_accessible_resources": [
+        {
+            "resources": [
+                "requestPermissions.html",
+                "requestPermissions.js"
+            ],
+            "matches": [
+                "<all_urls>"
+            ]
+        }
+    ]
+}
--- a/chrome-extension/popup.html
+++ b/chrome-extension/popup.html
@@ -0,0 +1,78 @@
+<!DOCTYPE html>
+<html lang="en">
+
+<head>
+    <meta charset="UTF-8" />
+    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <title>WhisperLiveKit</title>
+    <link rel="stylesheet" href="/web/live_transcription.css" />
+</head>
+
+<body>
+    <div class="settings-container">
+        <button id="recordButton">
+            <div class="shape-container">
+                <div class="shape"></div>
+            </div>
+            <div class="recording-info">
+                <div class="wave-container">
+                    <canvas id="waveCanvas"></canvas>
+                </div>
+                <div class="timer">00:00</div>
+            </div>
+        </button>
+
+        <button id="settingsToggle" class="settings-toggle" title="Show/hide settings">
+            <img src="/web/src/settings.svg" alt="Settings" />
+        </button>
+
+        <div class="settings">
+            <div class="field">
+                <label for="websocketInput">Websocket URL</label>
+                <input id="websocketInput" type="text" placeholder="ws://host:port/asr" />
+            </div>
+
+            <div class="field">
+                <label id="microphoneSelectLabel" for="microphoneSelect">Select Microphone</label>
+                <select id="microphoneSelect">
+                    <option value="">Default Microphone</option>
+                </select>
+                        <div id="audioPermission"></div>
+
+            </div>
+
+            <div class="theme-selector-container">
+                <div class="segmented" role="radiogroup" aria-label="Theme selector">
+                    <input type="radio" id="theme-system" name="theme" value="system" />
+                    <label for="theme-system" title="System">
+                        <img src="/web/src/system_mode.svg" alt="" />
+                        <!-- <span>System</span> -->
+                    </label>
+
+                    <input type="radio" id="theme-light" name="theme" value="light" />
+                    <label for="theme-light" title="Light">
+                        <img src="/web/src/light_mode.svg" alt="" />
+                        <!-- <span>Light</span> -->
+                    </label>
+
+                    <input type="radio" id="theme-dark" name="theme" value="dark" />
+                    <label for="theme-dark" title="Dark">
+                        <img src="/web/src/dark_mode.svg" alt="" />
+                        <!-- <span>Dark</span> -->
+                    </label>
+                </div>
+            </div>
+
+        </div>
+    </div>
+
+
+
+    <p id="status"></p>
+
+    <div id="linesTranscript"></div>
+
+    <script src="live_transcription.js"></script>
+</body>
+
+</html>
--- a/chrome-extension/requestPermissions.html
+++ b/chrome-extension/requestPermissions.html
@@ -0,0 +1,12 @@
+<!DOCTYPE html>
+<html>
+  <head>
+    <title>Request Permissions</title>
+    <script src="requestPermissions.js"></script>
+  </head>
+  <body>
+    This page exists to workaround an issue with Chrome that blocks permission
+    requests from chrome extensions
+    <button id="requestMicrophone">Request Microphone</button>
+  </body>
+</html>
--- a/chrome-extension/requestPermissions.js
+++ b/chrome-extension/requestPermissions.js
@@ -0,0 +1,17 @@
+/**
+ * Requests user permission for microphone access.
+ * @returns {Promise<void>} A Promise that resolves when permission is granted or rejects with an error.
+ */
+async function getUserPermission() {
+  console.log("Getting user permission for microphone access...");
+  await navigator.mediaDevices.getUserMedia({ audio: true });
+  const micPermission = await navigator.permissions.query({
+    name: "microphone",
+  });
+  if (micPermission.state == "granted") {
+    window.close();
+  }
+}
+
+// Call the function to request microphone permission
+getUserPermission();
--- a/chrome-extension/sidepanel.js
+++ b/chrome-extension/sidepanel.js
@@ -0,0 +1,29 @@
+console.log("sidepanel.js");
+
+async function run() {
+  const micPermission = await navigator.permissions.query({
+    name: "microphone",
+  });
+
+  document.getElementById(
+    "audioPermission"
+  ).innerText = `MICROPHONE: ${micPermission.state}`;
+
+  if (micPermission.state !== "granted") {
+    chrome.tabs.create({ url: "requestPermissions.html" });
+  }
+
+  const intervalId = setInterval(async () => {
+    const micPermission = await navigator.permissions.query({
+      name: "microphone",
+    });
+    if (micPermission.state === "granted") {
+      document.getElementById(
+        "audioPermission"
+      ).innerText = `MICROPHONE: ${micPermission.state}`;
+      clearInterval(intervalId);
+    }
+  }, 100);
+}
+
+void run();
--- a/chrome-extension/web/live_transcription.css
+++ b/chrome-extension/web/live_transcription.css
@@ -0,0 +1,539 @@
+:root {
+  --bg: #ffffff;
+  --text: #111111;
+  --muted: #666666;
+  --border: #e5e5e5;
+  --chip-bg: rgba(0, 0, 0, 0.04);
+  --chip-text: #000000;
+  --spinner-border: #8d8d8d5c;
+  --spinner-top: #b0b0b0;
+  --silence-bg: #f3f3f3;
+  --loading-bg: rgba(255, 77, 77, 0.06);
+  --button-bg: #ffffff;
+  --button-border: #e9e9e9;
+  --wave-stroke: #000000;
+  --label-dia-text: #868686;
+  --label-trans-text: #111111;
+}
+
+@media (prefers-color-scheme: dark) {
+  :root:not([data-theme="light"]) {
+    --bg: #0b0b0b;
+    --text: #e6e6e6;
+    --muted: #9aa0a6;
+    --border: #333333;
+    --chip-bg: rgba(255, 255, 255, 0.08);
+    --chip-text: #e6e6e6;
+    --spinner-border: #555555;
+    --spinner-top: #dddddd;
+    --silence-bg: #1a1a1a;
+    --loading-bg: rgba(255, 77, 77, 0.12);
+    --button-bg: #111111;
+    --button-border: #333333;
+    --wave-stroke: #e6e6e6;
+    --label-dia-text: #b3b3b3;
+    --label-trans-text: #ffffff;
+  }
+}
+
+:root[data-theme="dark"] {
+  --bg: #0b0b0b;
+  --text: #e6e6e6;
+  --muted: #9aa0a6;
+  --border: #333333;
+  --chip-bg: rgba(255, 255, 255, 0.08);
+  --chip-text: #e6e6e6;
+  --spinner-border: #555555;
+  --spinner-top: #dddddd;
+  --silence-bg: #1a1a1a;
+  --loading-bg: rgba(255, 77, 77, 0.12);
+  --button-bg: #111111;
+  --button-border: #333333;
+  --wave-stroke: #e6e6e6;
+  --label-dia-text: #b3b3b3;
+  --label-trans-text: #ffffff;
+}
+
+:root[data-theme="light"] {
+  --bg: #ffffff;
+  --text: #111111;
+  --muted: #666666;
+  --border: #e5e5e5;
+  --chip-bg: rgba(0, 0, 0, 0.04);
+  --chip-text: #000000;
+  --spinner-border: #8d8d8d5c;
+  --spinner-top: #b0b0b0;
+  --silence-bg: #f3f3f3;
+  --loading-bg: rgba(255, 77, 77, 0.06);
+  --button-bg: #ffffff;
+  --button-border: #e9e9e9;
+  --wave-stroke: #000000;
+  --label-dia-text: #868686;
+  --label-trans-text: #111111;
+}
+
+body {
+  font-family: ui-sans-serif, system-ui, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
+  margin: 20px;
+  text-align: center;
+  background-color: var(--bg);
+  color: var(--text);
+}
+
+.settings-toggle {
+  margin-top: 4px;
+  width: 40px;
+  height: 40px;
+  border: none;
+  border-radius: 50%;
+  background-color: var(--button-bg);
+  cursor: pointer;
+  transition: all 0.3s ease;
+  /* border: 1px solid var(--button-border); */
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  position: relative;
+}
+
+.settings-toggle:hover {
+  background-color: var(--chip-bg);
+}
+
+.settings-toggle img {
+  width: 24px;
+  height: 24px;
+  opacity: 0.7;
+  transition: opacity 0.2s ease, transform 0.3s ease;
+}
+
+.settings-toggle:hover img {
+  opacity: 1;
+}
+
+.settings-toggle.active img {
+  transform: rotate(80deg);
+}
+
+/* Record button */
+#recordButton {
+  width: 50px;
+  height: 50px;
+  border: none;
+  border-radius: 50%;
+  background-color: var(--button-bg);
+  cursor: pointer;
+  transition: all 0.3s ease;
+  border: 1px solid var(--button-border);
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  position: relative;
+}
+
+#recordButton.recording {
+  width: 180px;
+  border-radius: 40px;
+  justify-content: flex-start;
+  padding-left: 20px;
+}
+
+#recordButton:active {
+  transform: scale(0.95);
+}
+
+.shape-container {
+  width: 25px;
+  height: 25px;
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  flex-shrink: 0;
+}
+
+.shape {
+  width: 25px;
+  height: 25px;
+  background-color: rgb(209, 61, 53);
+  border-radius: 50%;
+  transition: all 0.3s ease;
+}
+
+#recordButton:disabled .shape {
+  background-color: #6e6d6d;
+}
+
+#recordButton.recording .shape {
+  border-radius: 5px;
+  width: 25px;
+  height: 25px;
+}
+
+/* Recording elements */
+.recording-info {
+  display: none;
+  align-items: center;
+  margin-left: 15px;
+  flex-grow: 1;
+}
+
+#recordButton.recording .recording-info {
+  display: flex;
+}
+
+.wave-container {
+  width: 60px;
+  height: 30px;
+  position: relative;
+  display: flex;
+  align-items: center;
+  justify-content: center;
+}
+
+#waveCanvas {
+  width: 100%;
+  height: 100%;
+}
+
+.timer {
+  font-size: 14px;
+  font-weight: 500;
+  color: var(--text);
+  margin-left: 10px;
+}
+
+#status {
+  margin-top: 20px;
+  font-size: 16px;
+  color: var(--text);
+}
+
+/* Settings */
+.settings-container {
+  display: flex;
+  justify-content: center;
+  align-items: flex-start;
+  gap: 15px;
+  margin-top: 20px;
+  flex-wrap: wrap;
+}
+
+.settings {
+  display: none;
+  flex-wrap: wrap;
+  align-items: flex-start;
+  gap: 12px;
+  transition: opacity 0.3s ease;
+}
+
+.settings.visible {
+  display: flex;
+}
+
+.field {
+  display: flex;
+  flex-direction: column;
+  align-items: flex-start;
+  gap: 3px;
+}
+
+#chunkSelector,
+#websocketInput,
+#themeSelector,
+#microphoneSelect {
+  font-size: 16px;
+  padding: 5px 8px;
+  border-radius: 8px;
+  border: 1px solid var(--border);
+  background-color: var(--button-bg);
+  color: var(--text);
+  max-height: 30px;
+}
+
+#microphoneSelect {
+  width: 100%;
+  max-width: 190px;
+  min-width: 120px;
+}
+
+#chunkSelector:focus,
+#websocketInput:focus,
+#themeSelector:focus,
+#microphoneSelect:focus {
+  outline: none;
+  border-color: #007bff;
+  box-shadow: 0 0 0 3px rgba(0, 123, 255, 0.15);
+}
+
+label {
+  font-size: 13px;
+  color: var(--muted);
+}
+
+.ws-default {
+  font-size: 12px;
+  color: var(--muted);
+}
+
+/* Segmented pill control for Theme */
+.segmented {
+  display: inline-flex;
+  align-items: stretch;
+  border: 1px solid var(--button-border);
+  background-color: var(--button-bg);
+  border-radius: 999px;
+  overflow: hidden;
+}
+
+.segmented input[type="radio"] {
+  position: absolute;
+  opacity: 0;
+  pointer-events: none;
+}
+
+.theme-selector-container {
+  display: flex;
+  align-items: center;
+  margin-top: 17px;
+}
+
+.segmented label {
+  display: inline-flex;
+  align-items: center;
+  gap: 6px;
+  padding: 6px 12px;
+  font-size: 14px;
+  color: var(--muted);
+  cursor: pointer;
+  user-select: none;
+  transition: background-color 0.2s ease, color 0.2s ease;
+}
+
+.segmented label span {
+  display: none;
+}
+
+.segmented label:hover span {
+  display: inline;
+}
+
+.segmented label:hover {
+  background-color: var(--chip-bg);
+}
+
+.segmented img {
+  width: 16px;
+  height: 16px;
+}
+
+.segmented input[type="radio"]:checked + label {
+  background-color: var(--chip-bg);
+  color: var(--text);
+}
+
+.segmented input[type="radio"]:focus-visible + label,
+.segmented input[type="radio"]:focus + label {
+  outline: 2px solid #007bff;
+  outline-offset: 2px;
+  border-radius: 999px;
+}
+
+/* Transcript area */
+#linesTranscript {
+  margin: 20px auto;
+  max-width: 700px;
+  text-align: left;
+  font-size: 16px;
+}
+
+#linesTranscript p {
+  margin: 0px 0;
+}
+
+#linesTranscript strong {
+  color: var(--text);
+}
+
+#speaker {
+  border: 1px solid var(--border);
+  border-radius: 100px;
+  padding: 2px 10px;
+  font-size: 14px;
+  margin-bottom: 0px;
+}
+
+.label_diarization {
+  background-color: var(--chip-bg);
+  border-radius: 8px 8px 8px 8px;
+  padding: 2px 10px;
+  margin-left: 10px;
+  display: inline-block;
+  white-space: nowrap;
+  font-size: 14px;
+  margin-bottom: 0px;
+  color: var(--label-dia-text);
+}
+
+.label_transcription {
+  background-color: var(--chip-bg);
+  border-radius: 8px 8px 8px 8px;
+  padding: 2px 10px;
+  display: inline-block;
+  white-space: nowrap;
+  margin-left: 10px;
+  font-size: 14px;
+  margin-bottom: 0px;
+  color: var(--label-trans-text);
+}
+
+#timeInfo {
+  color: var(--muted);
+  margin-left: 10px;
+}
+
+.textcontent {
+  font-size: 16px;
+  padding-left: 10px;
+  margin-bottom: 10px;
+  margin-top: 1px;
+  padding-top: 5px;
+  border-radius: 0px 0px 0px 10px;
+}
+
+.buffer_diarization {
+  color: var(--label-dia-text);
+  margin-left: 4px;
+}
+
+.buffer_transcription {
+  color: #7474748c;
+  margin-left: 4px;
+}
+
+.spinner {
+  display: inline-block;
+  width: 8px;
+  height: 8px;
+  border: 2px solid var(--spinner-border);
+  border-top: 2px solid var(--spinner-top);
+  border-radius: 50%;
+  animation: spin 0.7s linear infinite;
+  vertical-align: middle;
+  margin-bottom: 2px;
+  margin-right: 5px;
+}
+
+@keyframes spin {
+  to {
+    transform: rotate(360deg);
+  }
+}
+
+.silence {
+  color: var(--muted);
+  background-color: var(--silence-bg);
+  font-size: 13px;
+  border-radius: 30px;
+  padding: 2px 10px;
+}
+
+.loading {
+  color: var(--muted);
+  background-color: var(--loading-bg);
+  border-radius: 8px 8px 8px 0px;
+  padding: 2px 10px;
+  font-size: 14px;
+  margin-bottom: 0px;
+}
+
+/* for smaller screens */
+/* @media (max-width: 450px) {
+  .settings-container {
+    flex-direction: column;
+    gap: 10px;
+    align-items: center;
+  }
+
+  .settings {
+    justify-content: center;
+    gap: 8px;
+    width: 100%;
+  }
+
+  .field {
+    align-items: center;
+    width: 100%;
+  }
+
+  #websocketInput,
+  #microphoneSelect {
+    min-width: 200px;
+    max-width: 100%;
+  }
+
+  .theme-selector-container {
+    margin-top: 10px;
+  }
+} */
+
+/* @media (max-width: 768px) and (min-width: 451px) {
+  .settings-container {
+    gap: 10px;
+  }
+
+  .settings {
+    gap: 8px;
+  }
+
+  #websocketInput,
+  #microphoneSelect {
+    min-width: 150px;
+    max-width: 300px;
+  }
+} */
+
+/* @media (max-width: 480px) {
+  body {
+    margin: 10px;
+  }
+
+  .settings-toggle {
+    width: 35px;
+    height: 35px;
+  }
+
+  .settings-toggle img {
+    width: 20px;
+    height: 20px;
+  }
+
+  .settings {
+    flex-direction: column;
+    align-items: center;
+    gap: 6px;
+  }
+
+  #websocketInput,
+  #microphoneSelect {
+    max-width: 400px;
+  }
+
+  .segmented label {
+    padding: 4px 8px;
+    font-size: 12px;
+  }
+
+  .segmented img {
+    width: 14px;
+    height: 14px;
+  }
+} */
+
+
+html
+{
+    width: 400px;  /* max: 800px */
+    height: 600px; /* max: 600px */
+    border-radius: 10px;
+
+}
--- a/chrome-extension/web/src/dark_mode.svg
+++ b/chrome-extension/web/src/dark_mode.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-120q-151 0-255.5-104.5T120-480q0-138 90-239.5T440-838q13-2 23 3.5t16 14.5q6 9 6.5 21t-7.5 23q-17 26-25.5 55t-8.5 61q0 90 63 153t153 63q31 0 61.5-9t54.5-25q11-7 22.5-6.5T819-479q10 5 15.5 15t3.5 24q-14 138-117.5 229T480-120Zm0-80q88 0 158-48.5T740-375q-20 5-40 8t-40 3q-123 0-209.5-86.5T364-660q0-20 3-40t8-40q-78 32-126.5 102T200-480q0 116 82 198t198 82Zm-10-270Z"/></svg>
--- a/chrome-extension/web/src/light_mode.svg
+++ b/chrome-extension/web/src/light_mode.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-360q50 0 85-35t35-85q0-50-35-85t-85-35q-50 0-85 35t-35 85q0 50 35 85t85 35Zm0 80q-83 0-141.5-58.5T280-480q0-83 58.5-141.5T480-680q83 0 141.5 58.5T680-480q0 83-58.5 141.5T480-280ZM80-440q-17 0-28.5-11.5T40-480q0-17 11.5-28.5T80-520h80q17 0 28.5 11.5T200-480q0 17-11.5 28.5T160-440H80Zm720 0q-17 0-28.5-11.5T760-480q0-17 11.5-28.5T800-520h80q17 0 28.5 11.5T920-480q0 17-11.5 28.5T880-440h-80ZM480-760q-17 0-28.5-11.5T440-800v-80q0-17 11.5-28.5T480-920q17 0 28.5 11.5T520-880v80q0 17-11.5 28.5T480-760Zm0 720q-17 0-28.5-11.5T440-80v-80q0-17 11.5-28.5T480-200q17 0 28.5 11.5T520-160v80q0 17-11.5 28.5T480-40ZM226-678l-43-42q-12-11-11.5-28t11.5-29q12-12 29-12t28 12l42 43q11 12 11 28t-11 28q-11 12-27.5 11.5T226-678Zm494 495-42-43q-11-12-11-28.5t11-27.5q11-12 27.5-11.5T734-282l43 42q12 11 11.5 28T777-183q-12 12-29 12t-28-12Zm-42-495q-12-11-11.5-27.5T678-734l42-43q11-12 28-11.5t29 11.5q12 12 12 29t-12 28l-43 42q-12 11-28 11t-28-11ZM183-183q-12-12-12-29t12-28l43-42q12-11 28.5-11t27.5 11q12 11 11.5 27.5T282-226l-42 43q-11 12-28 11.5T183-183Zm297-297Z"/></svg>
--- a/chrome-extension/web/src/settings.svg
+++ b/chrome-extension/web/src/settings.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M433-80q-27 0-46.5-18T363-142l-9-66q-13-5-24.5-12T307-235l-62 26q-25 11-50 2t-39-32l-47-82q-14-23-8-49t27-43l53-40q-1-7-1-13.5v-27q0-6.5 1-13.5l-53-40q-21-17-27-43t8-49l47-82q14-23 39-32t50 2l62 26q11-8 23-15t24-12l9-66q4-26 23.5-44t46.5-18h94q27 0 46.5 18t23.5 44l9 66q13 5 24.5 12t22.5 15l62-26q25-11 50-2t39 32l47 82q14 23 8 49t-27 43l-53 40q1 7 1 13.5v27q0 6.5-2 13.5l53 40q21 17 27 43t-8 49l-48 82q-14 23-39 32t-50-2l-60-26q-11 8-23 15t-24 12l-9 66q-4 26-23.5 44T527-80h-94Zm7-80h79l14-106q31-8 57.5-23.5T639-327l99 41 39-68-86-65q5-14 7-29.5t2-31.5q0-16-2-31.5t-7-29.5l86-65-39-68-99 42q-22-23-48.5-38.5T533-694l-13-106h-79l-14 106q-31 8-57.5 23.5T321-633l-99-41-39 68 86 64q-5 15-7 30t-2 32q0 16 2 31t7 30l-86 65 39 68 99-42q22 23 48.5 38.5T427-266l13 106Zm42-180q58 0 99-41t41-99q0-58-41-99t-99-41q-59 0-99.5 41T342-480q0 58 40.5 99t99.5 41Zm-2-140Z"/></svg>
--- a/chrome-extension/web/src/system_mode.svg
+++ b/chrome-extension/web/src/system_mode.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M396-396q-32-32-58.5-67T289-537q-5 14-6.5 28.5T281-480q0 83 58 141t141 58q14 0 28.5-2t28.5-6q-39-22-74-48.5T396-396Zm85 196q-56 0-107-21t-91-61q-40-40-61-91t-21-107q0-51 17-97.5t50-84.5q13-14 32-9.5t27 24.5q21 55 52.5 104t73.5 91q42 42 91 73.5T648-326q20 8 24.5 27t-9.5 32q-38 33-84.5 50T481-200Zm223-192q-16-5-23-20.5t-4-32.5q9-48-6-94.5T621-621q-35-35-80.5-49.5T448-677q-17 3-32-4t-21-23q-6-16 1.5-31t23.5-19q69-15 138 4.5T679-678q51 51 71 120t5 138q-4 17-19 25t-32 3ZM480-840q-17 0-28.5-11.5T440-880v-40q0-17 11.5-28.5T480-960q17 0 28.5 11.5T520-920v40q0 17-11.5 28.5T480-840Zm0 840q-17 0-28.5-11.5T440-40v-40q0-17 11.5-28.5T480-120q17 0 28.5 11.5T520-80v40q0 17-11.5 28.5T480 0Zm255-734q-12-12-12-28.5t12-28.5l28-28q11-11 27.5-11t28.5 11q12 12 12 28.5T819-762l-28 28q-12 12-28 12t-28-12ZM141-141q-12-12-12-28.5t12-28.5l28-28q12-12 28-12t28 12q12 12 12 28.5T225-169l-28 28q-11 11-27.5 11T141-141Zm739-299q-17 0-28.5-11.5T840-480q0-17 11.5-28.5T880-520h40q17 0 28.5 11.5T960-480q0 17-11.5 28.5T920-440h-40Zm-840 0q-17 0-28.5-11.5T0-480q0-17 11.5-28.5T40-520h40q17 0 28.5 11.5T120-480q0 17-11.5 28.5T80-440H40Zm779 299q-12 12-28.5 12T762-141l-28-28q-12-12-12-28t12-28q12-12 28.5-12t28.5 12l28 28q11 11 11 27.5T819-141ZM226-735q-12 12-28.5 12T169-735l-28-28q-11-11-11-27.5t11-28.5q12-12 28.5-12t28.5 12l28 28q12 12 12 28t-12 28Zm170 339Z"/></svg>
--- a/chrome-extension/welcome.html
+++ b/chrome-extension/welcome.html
@@ -0,0 +1,12 @@
+<!DOCTYPE html>
+<html>
+  <head>
+    <title>Welcome</title>
+    <script src="welcome.js"></script>
+  </head>
+  <body>
+    This page exists to workaround an issue with Chrome that blocks permission
+    requests from chrome extensions
+    <!-- <button id="requestMicrophone">Request Microphone</button> -->
+  </body>
+</html>
--- a/demo.png
+++ b/demo.png
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,8 +4,8 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "whisperlivekit"
-version = "0.2.5"
-description = "Real-time, Fully Local Whisper's Speech-to-Text and Speaker Diarization"
+version = "0.2.10"
+description = "Real-time speech-to-text with speaker diarization using Whisper"
 readme = "README.md"
 authors = [
    { name = "Quentin Fuxa" }
@@ -18,6 +18,11 @@ classifiers = [
    "License :: OSI Approved :: MIT License",
    "Programming Language :: Python :: 3.9",
    "Programming Language :: Python :: 3.10",
+    "Programming Language :: Python :: 3.11",
+    "Programming Language :: Python :: 3.12",
+    "Programming Language :: Python :: 3.13",
+    "Programming Language :: Python :: 3.14",
+    "Programming Language :: Python :: 3.15",
    "Topic :: Scientific/Engineering :: Artificial Intelligence",
    "Topic :: Multimedia :: Sound/Audio :: Speech"
 ]
@@ -27,23 +32,16 @@ dependencies = [
    "soundfile",
    "faster-whisper",
    "uvicorn",
-    "websockets"
+    "websockets",
+    "torchaudio>=2.0.0",
+    "torch>=2.0.0",
+    "tqdm",
+    "tiktoken",
+    'triton>=2.0.0; platform_machine == "x86_64" and (sys_platform == "linux" or sys_platform == "linux2")'
 ]

 [project.optional-dependencies]
-diarization = ["diart"]
-vac = ["torch"]
 sentence = ["mosestokenizer", "wtpsplit"]
-whisper = ["whisper"]
-whisper-timestamped = ["whisper-timestamped"]
-mlx-whisper = ["mlx-whisper"]
-openai = ["openai"]
-simulstreaming = [
-    "torch",
-    "tqdm",
-    "tiktoken",
-    'triton>=2.0.0,<3; platform_machine == "x86_64" and (sys_platform == "linux" or sys_platform == "linux2")'
-]

 [project.urls]
 Homepage = "https://github.com/QuentinFuxa/WhisperLiveKit"
@@ -55,5 +53,5 @@ whisperlivekit-server = "whisperlivekit.basic_server:main"
 packages = ["whisperlivekit", "whisperlivekit.diarization", "whisperlivekit.simul_whisper", "whisperlivekit.simul_whisper.whisper", "whisperlivekit.simul_whisper.whisper.assets", "whisperlivekit.simul_whisper.whisper.normalizers", "whisperlivekit.web", "whisperlivekit.whisper_streaming_custom"]

 [tool.setuptools.package-data]
-whisperlivekit = ["web/*.html"]
+whisperlivekit = ["web/*.html", "web/*.css", "web/*.js", "web/src/*.svg"]
 "whisperlivekit.simul_whisper.whisper.assets" = ["*.tiktoken", "*.npz"]
--- a/whisperlivekit/init.py
+++ b/whisperlivekit/init.py
@@ -1,12 +1,13 @@
 from .audio_processor import AudioProcessor
 from .core import TranscriptionEngine
 from .parse_args import parse_args
-from .web.web_interface import get_web_interface_html
+from .web.web_interface import get_web_interface_html, get_inline_ui_html

 __all__ = [
    "TranscriptionEngine",
    "AudioProcessor",
    "parse_args",
    "get_web_interface_html",
+    "get_inline_ui_html",
    "download_simulstreaming_backend",
 ]
--- a/whisperlivekit/audio_processor.py
+++ b/whisperlivekit/audio_processor.py
@@ -4,11 +4,11 @@ from time import time, sleep
 import math
 import logging
 import traceback
-from datetime import timedelta
-from whisperlivekit.timed_objects import ASRToken
-from whisperlivekit.core import TranscriptionEngine, online_factory
+from whisperlivekit.timed_objects import ASRToken, Silence, Line, FrontData, State
+from whisperlivekit.core import TranscriptionEngine, online_factory, online_diarization_factory, online_translation_factory
+from whisperlivekit.silero_vad_iterator import FixedVADIterator
+from whisperlivekit.results_formater import format_output
 from whisperlivekit.ffmpeg_manager import FFmpegManager, FFmpegState
-from .remove_silences import handle_silences
 # Set up logging once
 logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
 logger = logging.getLogger(__name__)
@@ -16,9 +16,16 @@ logger.setLevel(logging.DEBUG)

 SENTINEL = object() # unique sentinel object for end of stream marker

-def format_time(seconds: float) -> str:
-    """Format seconds as HH:MM:SS."""
-    return str(timedelta(seconds=int(seconds)))
+
+async def get_all_from_queue(queue):
+    items = []
+    try:
+        while True:
+            item = queue.get_nowait()
+            items.append(item)
+    except asyncio.QueueEmpty:
+        pass
+    return items

 class AudioProcessor:
    """
@@ -42,65 +49,75 @@ class AudioProcessor:
        self.bytes_per_sample = 2
        self.bytes_per_sec = self.samples_per_sec * self.bytes_per_sample
        self.max_bytes_per_sec = 32000 * 5  # 5 seconds of audio at 32 kHz
-        self.last_ffmpeg_activity = time()
-        self.ffmpeg_health_check_interval = 5
-        self.ffmpeg_max_idle_time = 10
+        self.is_pcm_input = self.args.pcm_input
+        self.debug = False

        # State management
        self.is_stopping = False
+        self.silence = False
+        self.silence_duration = 0.0
        self.tokens = []
+        self.translated_segments = []
        self.buffer_transcription = ""
        self.buffer_diarization = ""
        self.end_buffer = 0
        self.end_attributed_speaker = 0
        self.lock = asyncio.Lock()
-        self.beg_loop = time()
+        self.beg_loop = None #to deal with a potential little lag at the websocket initialization, this is now set in process_audio
        self.sep = " "  # Default separator
-        self.last_response_content = ""
+        self.last_response_content = FrontData()
        
        # Models and processing
        self.asr = models.asr
        self.tokenizer = models.tokenizer
-        self.diarization = models.diarization
-        
-        self.ffmpeg_manager = FFmpegManager(
-            sample_rate=self.sample_rate,
-            channels=self.channels
-        )
-        
-        async def handle_ffmpeg_error(error_type: str):
-            logger.error(f"FFmpeg error: {error_type}")
-            self._ffmpeg_error = error_type
-        
-        self.ffmpeg_manager.on_error_callback = handle_ffmpeg_error
+        self.vac_model = models.vac_model
+        if self.args.vac:
+            self.vac = FixedVADIterator(models.vac_model)
+        else:
+            self.vac = None
+                         
+        self.ffmpeg_manager = None
+        self.ffmpeg_reader_task = None
        self._ffmpeg_error = None
-        
+
+        if not self.is_pcm_input:
+            self.ffmpeg_manager = FFmpegManager(
+                sample_rate=self.sample_rate,
+                channels=self.channels
+            )
+            async def handle_ffmpeg_error(error_type: str):
+                logger.error(f"FFmpeg error: {error_type}")
+                self._ffmpeg_error = error_type
+            self.ffmpeg_manager.on_error_callback = handle_ffmpeg_error
+             
        self.transcription_queue = asyncio.Queue() if self.args.transcription else None
        self.diarization_queue = asyncio.Queue() if self.args.diarization else None
+        self.translation_queue = asyncio.Queue() if self.args.target_language else None
        self.pcm_buffer = bytearray()

-        # Task references
        self.transcription_task = None
        self.diarization_task = None
-        self.ffmpeg_reader_task = None
        self.watchdog_task = None
        self.all_tasks_for_cleanup = []
        
-        # Initialize transcription engine if enabled
        if self.args.transcription:
-            self.online = online_factory(self.args, models.asr, models.tokenizer)
+            self.online = online_factory(self.args, models.asr, models.tokenizer)        
+            self.sep = self.online.asr.sep   
+        if self.args.diarization:
+            self.diarization = online_diarization_factory(self.args, models.diarization_model)
+        if self.args.target_language:
+            self.online_translation = online_translation_factory(self.args, models.translation_model)

    def convert_pcm_to_float(self, pcm_buffer):
        """Convert PCM buffer in s16le format to normalized NumPy array."""
        return np.frombuffer(pcm_buffer, dtype=np.int16).astype(np.float32) / 32768.0

-    async def update_transcription(self, new_tokens, buffer, end_buffer, sep):
+    async def update_transcription(self, new_tokens, buffer, end_buffer):
        """Thread-safe update of transcription with new data."""
        async with self.lock:
            self.tokens.extend(new_tokens)
            self.buffer_transcription = buffer
            self.end_buffer = end_buffer
-            self.sep = sep
            
    async def update_diarization(self, end_attributed_speaker, buffer_diarization=""):
        """Thread-safe update of diarization with new data."""
@@ -112,7 +129,7 @@ class AudioProcessor:
    async def add_dummy_token(self):
        """Placeholder token when no transcription is available."""
        async with self.lock:
-            current_time = time() - self.beg_loop
+            current_time = time() - self.beg_loop if self.beg_loop else 0
            self.tokens.append(ASRToken(
                start=current_time, end=current_time + 1,
                text=".", speaker=-1, is_dummy=True
@@ -133,33 +150,36 @@ class AudioProcessor:
                latest_end = max(self.end_buffer, self.tokens[-1].end if self.tokens else 0)
                remaining_diarization = max(0, round(latest_end - self.end_attributed_speaker, 1))
                
-            return {
-                "tokens": self.tokens.copy(),
-                "buffer_transcription": self.buffer_transcription,
-                "buffer_diarization": self.buffer_diarization,
-                "end_buffer": self.end_buffer,
-                "end_attributed_speaker": self.end_attributed_speaker,
-                "sep": self.sep,
-                "remaining_time_transcription": remaining_transcription,
-                "remaining_time_diarization": remaining_diarization
-            }
+            return State(
+                tokens=self.tokens.copy(),
+                translated_segments=self.translated_segments.copy(),
+                buffer_transcription=self.buffer_transcription,
+                buffer_diarization=self.buffer_diarization,
+                end_buffer=self.end_buffer,
+                end_attributed_speaker=self.end_attributed_speaker,
+                remaining_time_transcription=remaining_transcription,
+                remaining_time_diarization=remaining_diarization
+            )
            
    async def reset(self):
        """Reset all state variables to initial values."""
        async with self.lock:
            self.tokens = []
+            self.translated_segments = []
            self.buffer_transcription = self.buffer_diarization = ""
            self.end_buffer = self.end_attributed_speaker = 0
            self.beg_loop = time()

    async def ffmpeg_stdout_reader(self):
-        """Read audio data from FFmpeg stdout and process it."""
+        """Read audio data from FFmpeg stdout and process it into the PCM pipeline."""
        beg = time()
-        
        while True:
            try:
-                # Check if FFmpeg is running
-                state = await self.ffmpeg_manager.get_state()
+                if self.is_stopping:
+                    logger.info("Stopping ffmpeg_stdout_reader due to stopping flag.")
+                    break
+
+                state = await self.ffmpeg_manager.get_state() if self.ffmpeg_manager else FFmpegState.STOPPED
                if state == FFmpegState.FAILED:
                    logger.error("FFmpeg is in FAILED state, cannot read data")
                    break
@@ -167,80 +187,47 @@ class AudioProcessor:
                    logger.info("FFmpeg is stopped")
                    break
                elif state != FFmpegState.RUNNING:
-                    logger.warning(f"FFmpeg is in {state} state, waiting...")
-                    await asyncio.sleep(0.5)
+                    await asyncio.sleep(0.1)
                    continue
-                
+
                current_time = time()
-                elapsed_time = math.floor((current_time - beg) * 10) / 10
-                buffer_size = max(int(32000 * elapsed_time), 4096)
+                elapsed_time = max(0.0, current_time - beg)
+                buffer_size = max(int(32000 * elapsed_time), 4096)  # dynamic read
                beg = current_time

                chunk = await self.ffmpeg_manager.read_data(buffer_size)
-                        
                if not chunk:
-                    if self.is_stopping:
-                        logger.info("FFmpeg stdout closed, stopping.")
-                        break
-                    else:
-                        # No data available, but not stopping - FFmpeg might be restarting
-                        await asyncio.sleep(0.1)
-                        continue
-                    
+                    # No data currently available
+                    await asyncio.sleep(0.05)
+                    continue
+
                self.pcm_buffer.extend(chunk)
+                await self.handle_pcm_data()

-                # Process when enough data
-                if len(self.pcm_buffer) >= self.bytes_per_sec:
-                    if len(self.pcm_buffer) > self.max_bytes_per_sec:
-                        logger.warning(
-                            f"Audio buffer too large: {len(self.pcm_buffer) / self.bytes_per_sec:.2f}s. "
-                            f"Consider using a smaller model."
-                        )
-
-                    # Process audio chunk
-                    pcm_array = self.convert_pcm_to_float(self.pcm_buffer[:self.max_bytes_per_sec])
-                    self.pcm_buffer = self.pcm_buffer[self.max_bytes_per_sec:]
-                    
-                    # Send to transcription if enabled
-                    if self.args.transcription and self.transcription_queue:
-                        await self.transcription_queue.put(pcm_array.copy())
-
-                    # Send to diarization if enabled
-                    if self.args.diarization and self.diarization_queue:
-                        await self.diarization_queue.put(pcm_array.copy())
-
-                    # Sleep if no processing is happening
-                    if not self.args.transcription and not self.args.diarization:
-                        await asyncio.sleep(0.1)
-                    
+            except asyncio.CancelledError:
+                logger.info("ffmpeg_stdout_reader cancelled.")
+                break
            except Exception as e:
                logger.warning(f"Exception in ffmpeg_stdout_reader: {e}")
-                logger.warning(f"Traceback: {traceback.format_exc()}")
-                # Try to recover by waiting a bit
-                await asyncio.sleep(1)
-                
-                # Check if we should exit
-                if self.is_stopping:
-                    break
-        
-        logger.info("FFmpeg stdout processing finished. Signaling downstream processors.")
+                logger.debug(f"Traceback: {traceback.format_exc()}")
+                await asyncio.sleep(0.2)
+
+        logger.info("FFmpeg stdout processing finished. Signaling downstream processors if needed.")
        if self.args.transcription and self.transcription_queue:
            await self.transcription_queue.put(SENTINEL)
-            logger.debug("Sentinel put into transcription_queue.")
        if self.args.diarization and self.diarization_queue:
            await self.diarization_queue.put(SENTINEL)
-            logger.debug("Sentinel put into diarization_queue.")
-
+        if self.args.target_language and self.translation_queue:
+            await self.translation_queue.put(SENTINEL)

    async def transcription_processor(self):
        """Process audio chunks for transcription."""
-        self.sep = self.online.asr.sep
        cumulative_pcm_duration_stream_time = 0.0
        
        while True:
            try:
-                pcm_array = await self.transcription_queue.get()
-                if pcm_array is SENTINEL:
+                item = await self.transcription_queue.get()
+                if item is SENTINEL:
                    logger.debug("Transcription processor received sentinel. Finishing.")
                    self.transcription_queue.task_done()
                    break
@@ -252,19 +239,28 @@ class AudioProcessor:

                asr_internal_buffer_duration_s = len(getattr(self.online, 'audio_buffer', [])) / self.online.SAMPLING_RATE
                transcription_lag_s = max(0.0, time() - self.beg_loop - self.end_buffer)
-
-                logger.info(
-                    f"ASR processing: internal_buffer={asr_internal_buffer_duration_s:.2f}s, "
-                    f"lag={transcription_lag_s:.2f}s."
-                )
+                asr_processing_logs = f"internal_buffer={asr_internal_buffer_duration_s:.2f}s | lag={transcription_lag_s:.2f}s |"
+                if type(item) is Silence:
+                    asr_processing_logs += f" + Silence of = {item.duration:.2f}s"
+                    if self.tokens:
+                        asr_processing_logs += f" | last_end = {self.tokens[-1].end} |"
+                    logger.info(asr_processing_logs)
+                    cumulative_pcm_duration_stream_time += item.duration
+                    self.online.insert_silence(item.duration, self.tokens[-1].end if self.tokens else 0)
+                    continue
+                logger.info(asr_processing_logs)
                
-                # Process transcription
-                duration_this_chunk = len(pcm_array) / self.sample_rate if isinstance(pcm_array, np.ndarray) else 0
+                if isinstance(item, np.ndarray):
+                    pcm_array = item
+                else:
+                    raise Exception('item should be pcm_array')
+                
+                duration_this_chunk = len(pcm_array) / self.sample_rate
                cumulative_pcm_duration_stream_time += duration_this_chunk
                stream_time_end_of_current_pcm = cumulative_pcm_duration_stream_time

                self.online.insert_audio_chunk(pcm_array, stream_time_end_of_current_pcm)
-                new_tokens, current_audio_processed_upto = self.online.process_iter()
+                new_tokens, current_audio_processed_upto = await asyncio.to_thread(self.online.process_iter)
                
                # Get buffer information
                _buffer_transcript_obj = self.online.get_buffer()
@@ -288,8 +284,13 @@ class AudioProcessor:
                new_end_buffer = max(candidate_end_times)
                
                await self.update_transcription(
-                    new_tokens, buffer_text, new_end_buffer, self.sep
+                    new_tokens, buffer_text, new_end_buffer
                )
+                
+                if self.translation_queue:
+                    for token in new_tokens:
+                        await self.translation_queue.put(token)
+                        
                self.transcription_queue.task_done()
                
            except Exception as e:
@@ -297,20 +298,36 @@ class AudioProcessor:
                logger.warning(f"Traceback: {traceback.format_exc()}")
                if 'pcm_array' in locals() and pcm_array is not SENTINEL : # Check if pcm_array was assigned from queue
                    self.transcription_queue.task_done()
+        
+        if self.is_stopping:
+            logger.info("Transcription processor finishing due to stopping flag.")
+            if self.diarization_queue:
+                await self.diarization_queue.put(SENTINEL)
+            if self.translation_queue:
+                await self.translation_queue.put(SENTINEL)
+
        logger.info("Transcription processor task finished.")


    async def diarization_processor(self, diarization_obj):
        """Process audio chunks for speaker diarization."""
        buffer_diarization = ""
-        
+        cumulative_pcm_duration_stream_time = 0.0
        while True:
            try:
-                pcm_array = await self.diarization_queue.get()
-                if pcm_array is SENTINEL:
+                item = await self.diarization_queue.get()
+                if item is SENTINEL:
                    logger.debug("Diarization processor received sentinel. Finishing.")
                    self.diarization_queue.task_done()
                    break
+                elif type(item) is Silence:
+                    cumulative_pcm_duration_stream_time += item.duration
+                    diarization_obj.insert_silence(item.duration)
+                    continue
+                elif isinstance(item, np.ndarray):
+                    pcm_array = item
+                else:
+                    raise Exception('item should be pcm_array') 
                
                # Process diarization
                await diarization_obj.diarize(pcm_array)
@@ -334,127 +351,117 @@ class AudioProcessor:
                    self.diarization_queue.task_done()
        logger.info("Diarization processor task finished.")

+    async def translation_processor(self, online_translation):
+        # the idea is to ignore diarization for the moment. We use only transcription tokens. 
+        # And the speaker is attributed given the segments used for the translation
+        # in the future we want to have different languages for each speaker etc, so it will be more complex.
+        while True:
+            try:
+                item = await self.translation_queue.get() #block until at least 1 token
+                if item is SENTINEL:
+                    logger.debug("Translation processor received sentinel. Finishing.")
+                    self.translation_queue.task_done()
+                    break
+                elif type(item) is Silence:
+                    online_translation.insert_silence(item.duration)
+                    continue
+                
+                # get all the available tokens for translation. The more words, the more precise
+                tokens_to_process = [item]
+                additional_tokens = await get_all_from_queue(self.translation_queue)
+                
+                sentinel_found = False
+                for additional_token in additional_tokens:
+                    if additional_token is SENTINEL:
+                        sentinel_found = True
+                        break
+                    tokens_to_process.append(additional_token)                
+                if tokens_to_process:
+                    online_translation.insert_tokens(tokens_to_process)
+                    self.translated_segments = await asyncio.to_thread(online_translation.process)
+                
+                self.translation_queue.task_done()
+                for _ in additional_tokens:
+                    self.translation_queue.task_done()
+                
+                if sentinel_found:
+                    logger.debug("Translation processor received sentinel in batch. Finishing.")
+                    break
+                
+            except Exception as e:
+                logger.warning(f"Exception in translation_processor: {e}")
+                logger.warning(f"Traceback: {traceback.format_exc()}")
+                if 'token' in locals() and item is not SENTINEL:
+                    self.translation_queue.task_done()
+                if 'additional_tokens' in locals():
+                    for _ in additional_tokens:
+                        self.translation_queue.task_done()
+        logger.info("Translation processor task finished.")

    async def results_formatter(self):
        """Format processing results for output."""
-        last_sent_trans = None
-        last_sent_diar = None
        while True:
            try:
-                ffmpeg_state = await self.ffmpeg_manager.get_state()
-                if ffmpeg_state == FFmpegState.FAILED and self._ffmpeg_error:
-                    yield {
-                        "status": "error",
-                        "error": f"FFmpeg error: {self._ffmpeg_error}",
-                        "lines": [],
-                        "buffer_transcription": "",
-                        "buffer_diarization": "",
-                        "remaining_time_transcription": 0,
-                        "remaining_time_diarization": 0
-                    }
+                # If FFmpeg error occurred, notify front-end
+                if self._ffmpeg_error:
+                    yield FrontData(
+                        status="error",
+                        error=f"FFmpeg error: {self._ffmpeg_error}"
+                    )
                    self._ffmpeg_error = None
                    await asyncio.sleep(1)
                    continue
-                
+
                # Get current state
                state = await self.get_current_state()
-                tokens = state["tokens"]
-                buffer_transcription = state["buffer_transcription"]
-                buffer_diarization = state["buffer_diarization"]
-                end_attributed_speaker = state["end_attributed_speaker"]
-                sep = state["sep"]
-                
+                                
                # Add dummy tokens if needed
-                if (not tokens or tokens[-1].is_dummy) and not self.args.transcription and self.args.diarization:
+                if (not state.tokens or state.tokens[-1].is_dummy) and not self.args.transcription and self.args.diarization:
                    await self.add_dummy_token()
                    sleep(0.5)
                    state = await self.get_current_state()
-                    tokens = state["tokens"]
                
                # Format output
-                previous_speaker = -1
-                lines = []
-                last_end_diarized = 0
-                undiarized_text = []
-                current_time = time() - self.beg_loop
-                tokens = handle_silences(tokens, current_time)
-                for token in tokens:
-                    speaker = token.speaker
-                    
-                    # Handle diarization
-                    if self.args.diarization:
-                        if (speaker in [-1, 0]) and token.end >= end_attributed_speaker:
-                            undiarized_text.append(token.text)
-                            continue
-                        elif (speaker in [-1, 0]) and token.end < end_attributed_speaker:
-                            speaker = previous_speaker
-                        if speaker not in [-1, 0]:
-                            last_end_diarized = max(token.end, last_end_diarized)
-
-                    # Group by speaker
-                    if speaker != previous_speaker or not lines:
-                        lines.append({
-                            "speaker": speaker,
-                            "text": token.text,
-                            "beg": format_time(token.start),
-                            "end": format_time(token.end),
-                            "diff": round(token.end - last_end_diarized, 2)
-                        })
-                        previous_speaker = speaker
-                    elif token.text:  # Only append if text isn't empty
-                        lines[-1]["text"] += sep + token.text
-                        lines[-1]["end"] = format_time(token.end)
-                        lines[-1]["diff"] = round(token.end - last_end_diarized, 2)
-                
+                lines, undiarized_text, buffer_transcription, buffer_diarization = format_output(
+                    state,
+                    self.silence,
+                    current_time = time() - self.beg_loop if self.beg_loop else None,
+                    args = self.args,
+                    debug = self.debug,
+                    sep=self.sep
+                )
                # Handle undiarized text
                if undiarized_text:
-                    combined = sep.join(undiarized_text)
+                    combined = self.sep.join(undiarized_text)
                    if buffer_transcription:
-                        combined += sep
-                    await self.update_diarization(end_attributed_speaker, combined)
+                        combined += self.sep
+                    await self.update_diarization(state.end_attributed_speaker, combined)
                    buffer_diarization = combined
                
                response_status = "active_transcription"
-                final_lines_for_response = lines.copy()
-
-                if not tokens and not buffer_transcription and not buffer_diarization:
+                if not state.tokens and not buffer_transcription and not buffer_diarization:
                    response_status = "no_audio_detected"
-                    final_lines_for_response = []
-                elif response_status == "active_transcription" and not final_lines_for_response:
-                    final_lines_for_response = [{
-                        "speaker": 1,
-                        "text": "",
-                        "beg": format_time(state.get("end_buffer", 0)),
-                        "end": format_time(state.get("end_buffer", 0)),
-                        "diff": 0
-                    }]
+                    lines = []
+                elif not lines:
+                    lines = [Line(
+                        speaker=1,
+                        start=state.end_buffer,
+                        end=state.end_buffer
+                    )]
                
-                response = {
-                    "status": response_status,
-                    "lines": final_lines_for_response,
-                    "buffer_transcription": buffer_transcription,
-                    "buffer_diarization": buffer_diarization,
-                    "remaining_time_transcription": state["remaining_time_transcription"],
-                    "remaining_time_diarization": state["remaining_time_diarization"]
-                }
-                
-                current_response_signature = f"{response_status} | " + \
-                                           ' '.join([f"{line['speaker']} {line['text']}" for line in final_lines_for_response]) + \
-                                           f" | {buffer_transcription} | {buffer_diarization}"
-                
-                trans = state["remaining_time_transcription"]
-                diar = state["remaining_time_diarization"]
-                should_push = (
-                    current_response_signature != self.last_response_content
-                    or last_sent_trans is None
-                    or round(trans, 1) != round(last_sent_trans, 1)
-                    or round(diar, 1) != round(last_sent_diar, 1)
+                response = FrontData(
+                    status=response_status,
+                    lines=lines,
+                    buffer_transcription=buffer_transcription,
+                    buffer_diarization=buffer_diarization,
+                    remaining_time_transcription=state.remaining_time_transcription,
+                    remaining_time_diarization=state.remaining_time_diarization if self.args.diarization else 0
                )
-                if should_push and (final_lines_for_response or buffer_transcription or buffer_diarization or response_status == "no_audio_detected" or trans > 0 or diar > 0):
+                                
+                should_push = (response != self.last_response_content)
+                if should_push and (lines or buffer_transcription or buffer_diarization or response_status == "no_audio_detected"):
                    yield response
-                    self.last_response_content = current_response_signature
-                    last_sent_trans = trans
-                    last_sent_diar = diar
+                    self.last_response_content = response
                
                # Check for termination condition
                if self.is_stopping:
@@ -466,35 +473,34 @@ class AudioProcessor:
                    
                    if all_processors_done:
                        logger.info("Results formatter: All upstream processors are done and in stopping state. Terminating.")
-                        final_state = await self.get_current_state()
                        return
                
-                await asyncio.sleep(0.1)  # Avoid overwhelming the client
+                await asyncio.sleep(0.05)
                
            except Exception as e:
                logger.warning(f"Exception in results_formatter: {e}")
                logger.warning(f"Traceback: {traceback.format_exc()}")
-                await asyncio.sleep(0.5)  # Back off on error
+                await asyncio.sleep(0.5)
        
    async def create_tasks(self):
        """Create and start processing tasks."""
        self.all_tasks_for_cleanup = []
        processing_tasks_for_watchdog = []

-        success = await self.ffmpeg_manager.start()
-        if not success:
-            logger.error("Failed to start FFmpeg manager")
-            async def error_generator():
-                yield {
-                    "status": "error", 
-                    "error": "FFmpeg failed to start. Please check that FFmpeg is installed.",
-                    "lines": [],
-                    "buffer_transcription": "",
-                    "buffer_diarization": "",
-                    "remaining_time_transcription": 0,
-                    "remaining_time_diarization": 0
-                }
-            return error_generator()
+        # If using FFmpeg (non-PCM input), start it and spawn stdout reader
+        if not self.is_pcm_input:
+            success = await self.ffmpeg_manager.start()
+            if not success:
+                logger.error("Failed to start FFmpeg manager")
+                async def error_generator():
+                    yield FrontData(
+                        status="error",
+                        error="FFmpeg failed to start. Please check that FFmpeg is installed."
+                    )
+                return error_generator()
+            self.ffmpeg_reader_task = asyncio.create_task(self.ffmpeg_stdout_reader())
+            self.all_tasks_for_cleanup.append(self.ffmpeg_reader_task)
+            processing_tasks_for_watchdog.append(self.ffmpeg_reader_task)

        if self.args.transcription and self.online:
            self.transcription_task = asyncio.create_task(self.transcription_processor())
@@ -506,10 +512,11 @@ class AudioProcessor:
            self.all_tasks_for_cleanup.append(self.diarization_task)
            processing_tasks_for_watchdog.append(self.diarization_task)
        
-        self.ffmpeg_reader_task = asyncio.create_task(self.ffmpeg_stdout_reader())
-        self.all_tasks_for_cleanup.append(self.ffmpeg_reader_task)
-        processing_tasks_for_watchdog.append(self.ffmpeg_reader_task)
-
+        if self.args.target_language and self.args.lan != 'auto':
+            self.translation_task = asyncio.create_task(self.translation_processor(self.online_translation))
+            self.all_tasks_for_cleanup.append(self.translation_task)
+            processing_tasks_for_watchdog.append(self.translation_task)
+        
        # Monitor overall system health
        self.watchdog_task = asyncio.create_task(self.watchdog(processing_tasks_for_watchdog))
        self.all_tasks_for_cleanup.append(self.watchdog_task)
@@ -530,15 +537,6 @@ class AudioProcessor:
                            logger.error(f"{task_name} unexpectedly completed with exception: {exc}")
                        else:
                            logger.info(f"{task_name} completed normally.")
-                
-                # Check FFmpeg status through the manager
-                ffmpeg_state = await self.ffmpeg_manager.get_state()
-                if ffmpeg_state == FFmpegState.FAILED:
-                    logger.error("FFmpeg is in FAILED state, notifying results formatter")
-                    # FFmpeg manager will handle its own recovery
-                elif ffmpeg_state == FFmpegState.STOPPED and not self.is_stopping:
-                    logger.warning("FFmpeg unexpectedly stopped, attempting restart")
-                    await self.ffmpeg_manager.restart()
                    
            except asyncio.CancelledError:
                logger.info("Watchdog task cancelled.")
@@ -548,39 +546,114 @@ class AudioProcessor:
        
    async def cleanup(self):
        """Clean up resources when processing is complete."""
-        logger.info("Starting cleanup of AudioProcessor resources.")        
+        logger.info("Starting cleanup of AudioProcessor resources.")
+        self.is_stopping = True
        for task in self.all_tasks_for_cleanup:
            if task and not task.done():
                task.cancel()
-        
-        created_tasks = [t for t in self.all_tasks_for_cleanup if t]
-        if created_tasks:
-            await asyncio.gather(*created_tasks, return_exceptions=True)
-        logger.info("All processing tasks cancelled or finished.")
-        await self.ffmpeg_manager.stop()
-        logger.info("FFmpeg manager stopped.")
-        if self.args.diarization and hasattr(self, 'diarization') and hasattr(self.diarization, 'close'):
-            self.diarization.close()
-        logger.info("AudioProcessor cleanup complete.")
+            
+            created_tasks = [t for t in self.all_tasks_for_cleanup if t]
+            if created_tasks:
+                await asyncio.gather(*created_tasks, return_exceptions=True)
+            logger.info("All processing tasks cancelled or finished.")
+
+            if not self.is_pcm_input and self.ffmpeg_manager:
+                try:
+                    await self.ffmpeg_manager.stop()
+                    logger.info("FFmpeg manager stopped.")
+                except Exception as e:
+                    logger.warning(f"Error stopping FFmpeg manager: {e}")
+            if self.args.diarization and hasattr(self, 'dianization') and hasattr(self.diarization, 'close'):
+                self.diarization.close()
+            logger.info("AudioProcessor cleanup complete.")


    async def process_audio(self, message):
        """Process incoming audio data."""
+
+        if not self.beg_loop:
+            self.beg_loop = time()
+
        if not message:
            logger.info("Empty audio message received, initiating stop sequence.")
            self.is_stopping = True
-            # Signal FFmpeg manager to stop accepting data
-            await self.ffmpeg_manager.stop()
+             
+            if self.transcription_queue:
+                await self.transcription_queue.put(SENTINEL)
+
+            if not self.is_pcm_input and self.ffmpeg_manager:
+                await self.ffmpeg_manager.stop()
+
            return

        if self.is_stopping:
            logger.warning("AudioProcessor is stopping. Ignoring incoming audio.")
            return

-        success = await self.ffmpeg_manager.write_data(message)
-        if not success:
-            ffmpeg_state = await self.ffmpeg_manager.get_state()
-            if ffmpeg_state == FFmpegState.FAILED:
-                logger.error("FFmpeg is in FAILED state, cannot process audio")
-            else:
-                logger.warning("Failed to write audio data to FFmpeg")
+        if self.is_pcm_input:
+            self.pcm_buffer.extend(message)
+            await self.handle_pcm_data()
+        else:
+            if not self.ffmpeg_manager:
+                logger.error("FFmpeg manager not initialized for non-PCM input.")
+                return
+            success = await self.ffmpeg_manager.write_data(message)
+            if not success:
+                ffmpeg_state = await self.ffmpeg_manager.get_state()
+                if ffmpeg_state == FFmpegState.FAILED:
+                    logger.error("FFmpeg is in FAILED state, cannot process audio")
+                else:
+                    logger.warning("Failed to write audio data to FFmpeg")
+
+    async def handle_pcm_data(self):
+        # Process when enough data
+        if len(self.pcm_buffer) < self.bytes_per_sec:
+            return
+
+        if len(self.pcm_buffer) > self.max_bytes_per_sec:
+            logger.warning(
+                f"Audio buffer too large: {len(self.pcm_buffer) / self.bytes_per_sec:.2f}s. "
+                f"Consider using a smaller model."
+            )
+
+        # Process audio chunk
+        pcm_array = self.convert_pcm_to_float(self.pcm_buffer[:self.max_bytes_per_sec])
+        self.pcm_buffer = self.pcm_buffer[self.max_bytes_per_sec:]
+
+        res = None
+        end_of_audio = False
+        silence_buffer = None
+
+        if self.args.vac:
+            res = self.vac(pcm_array)
+
+        if res is not None:
+            if res.get("end", 0) > res.get("start", 0):
+                end_of_audio = True
+            elif self.silence: #end of silence
+                self.silence = False
+                silence_buffer = Silence(duration=time() - self.start_silence)
+
+        if silence_buffer:
+            if self.args.transcription and self.transcription_queue:
+                await self.transcription_queue.put(silence_buffer)
+            if self.args.diarization and self.diarization_queue:
+                await self.diarization_queue.put(silence_buffer)
+            if self.translation_queue:
+                await self.translation_queue.put(silence_buffer)
+
+        if not self.silence:
+            if self.args.transcription and self.transcription_queue:
+                await self.transcription_queue.put(pcm_array.copy())
+
+            if self.args.diarization and self.diarization_queue:
+                await self.diarization_queue.put(pcm_array.copy())
+
+            self.silence_duration = 0.0
+
+            if end_of_audio:
+                self.silence = True
+                self.start_silence = time()
+
+        if not self.args.transcription and not self.args.diarization:
+            await asyncio.sleep(0.1)
--- a/whisperlivekit/basic_server.py
+++ b/whisperlivekit/basic_server.py
@@ -2,9 +2,12 @@ from contextlib import asynccontextmanager
 from fastapi import FastAPI, WebSocket, WebSocketDisconnect
 from fastapi.responses import HTMLResponse
 from fastapi.middleware.cors import CORSMiddleware
-from whisperlivekit import TranscriptionEngine, AudioProcessor, get_web_interface_html, parse_args
+from whisperlivekit import TranscriptionEngine, AudioProcessor, get_inline_ui_html, parse_args
 import asyncio
 import logging
+from starlette.staticfiles import StaticFiles
+import pathlib
+import whisperlivekit.web as webpkg

 logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
 logging.getLogger().setLevel(logging.WARNING)
@@ -15,7 +18,7 @@ args = parse_args()
 transcription_engine = None

@asynccontextmanager
-async def lifespan(app: FastAPI):
+async def lifespan(app: FastAPI):    
    global transcription_engine
    transcription_engine = TranscriptionEngine(
        **vars(args),
@@ -30,24 +33,26 @@ app.add_middleware(
    allow_methods=["*"],
    allow_headers=["*"],
 )
+web_dir = pathlib.Path(webpkg.__file__).parent
+app.mount("/web", StaticFiles(directory=str(web_dir)), name="web")

@app.get("/")
 async def get():
-    return HTMLResponse(get_web_interface_html())
+    return HTMLResponse(get_inline_ui_html())


 async def handle_websocket_results(websocket, results_generator):
    """Consumes results from the audio processor and sends them via WebSocket."""
    try:
        async for response in results_generator:
-            await websocket.send_json(response)
+            await websocket.send_json(response.to_dict())
        # when the results_generator finishes it means all audio has been processed
        logger.info("Results generator finished. Sending 'ready_to_stop' to client.")
        await websocket.send_json({"type": "ready_to_stop"})
    except WebSocketDisconnect:
        logger.info("WebSocket disconnected while handling results (client likely closed connection).")
    except Exception as e:
-        logger.warning(f"Error in WebSocket results handler: {e}")
+        logger.exception(f"Error in WebSocket results handler: {e}")


@app.websocket("/asr")
@@ -58,6 +63,11 @@ async def websocket_endpoint(websocket: WebSocket):
    )
    await websocket.accept()
    logger.info("WebSocket connection opened.")
+
+    try:
+        await websocket.send_json({"type": "config", "useAudioWorklet": bool(args.pcm_input)})
+    except Exception as e:
+        logger.warning(f"Failed to send config to client: {e}")
            
    results_generator = await audio_processor.create_tasks()
    websocket_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
--- a/whisperlivekit/core.py
+++ b/whisperlivekit/core.py
@@ -1,10 +1,10 @@
 try:
    from whisperlivekit.whisper_streaming_custom.whisper_online import backend_factory
-    from whisperlivekit.whisper_streaming_custom.online_asr import VACOnlineASRProcessor, OnlineASRProcessor
+    from whisperlivekit.whisper_streaming_custom.online_asr import OnlineASRProcessor
 except ImportError:
    from .whisper_streaming_custom.whisper_online import backend_factory
-    from .whisper_streaming_custom.online_asr import VACOnlineASRProcessor, OnlineASRProcessor
-from whisperlivekit.warmup import warmup_asr, warmup_online
+    from .whisper_streaming_custom.online_asr import OnlineASRProcessor
+from whisperlivekit.warmup import warmup_asr
 from argparse import Namespace
 import sys

@@ -33,23 +33,28 @@ class TranscriptionEngine:
            "model_dir": None,
            "lan": "auto",
            "task": "transcribe",
+            "target_language": "",
            "backend": "faster-whisper",
-            "vac": False,
+            "vac": True,
            "vac_chunk_size": 0.04,
            "log_level": "DEBUG",
            "ssl_certfile": None,
            "ssl_keyfile": None,
            "transcription": True,
            "vad": True,
+            "pcm_input": False,
+            
            # whisperstreaming params:
            "buffer_trimming": "segment",
            "confidence_validation": False,
            "buffer_trimming_sec": 15,
+            
            # simulstreaming params:
+            "disable_fast_encoder": False,
            "frame_threshold": 25,
            "beams": 1,
            "decoder_type": None,
-            "audio_max_len": 30.0,
+            "audio_max_len": 20.0,
            "audio_min_len": 0.0,
            "cif_ckpt_path": None,
            "never_fire": False,
@@ -57,10 +62,16 @@ class TranscriptionEngine:
            "static_init_prompt": None,
            "max_context_tokens": None,
            "model_path": './base.pt',
-            # diart params:
+            "diarization_backend": "sortformer",
+            
+            # diarization params:
+            "disable_punctuation_split" : False,
            "segmentation_model": "pyannote/segmentation-3.0",
-            "embedding_model": "pyannote/embedding",
-
+            "embedding_model": "pyannote/embedding",  
+            
+            # translation params:
+            "nllb_backend": "ctranslate2",
+            "nllb_size": "600M"
        }

        config_dict = {**defaults, **kwargs}
@@ -69,6 +80,8 @@ class TranscriptionEngine:
            config_dict['transcription'] = not kwargs['no_transcription']
        if 'no_vad' in kwargs:
            config_dict['vad'] = not kwargs['no_vad']
+        if 'no_vac' in kwargs:
+            config_dict['vac'] = not kwargs['no_vac']
        
        config_dict.pop('no_transcription', None)
        config_dict.pop('no_vad', None)
@@ -82,15 +95,20 @@ class TranscriptionEngine:
        self.asr = None
        self.tokenizer = None
        self.diarization = None
+        self.vac_model = None
+        
+        if self.args.vac:
+            import torch
+            self.vac_model, _ = torch.hub.load(repo_or_dir="snakers4/silero-vad", model="silero_vad")            
        
        if self.args.transcription:
            if self.args.backend == "simulstreaming": 
-                from simul_whisper import SimulStreamingASR
+                from whisperlivekit.simul_whisper import SimulStreamingASR
                self.tokenizer = None
                simulstreaming_kwargs = {}
                for attr in ['frame_threshold', 'beams', 'decoder_type', 'audio_max_len', 'audio_min_len', 
                            'cif_ckpt_path', 'never_fire', 'init_prompt', 'static_init_prompt', 
-                            'max_context_tokens', 'model_path']:
+                            'max_context_tokens', 'model_path', 'warmup_file', 'preload_model_count', 'disable_fast_encoder']:
                    if hasattr(self.args, attr):
                        simulstreaming_kwargs[attr] = getattr(self.args, attr)
        
@@ -109,37 +127,40 @@ class TranscriptionEngine:

            else:
                self.asr, self.tokenizer = backend_factory(self.args)
-            warmup_asr(self.asr, self.args.warmup_file) #for simulstreaming, warmup should be done in the online class not here
+                warmup_asr(self.asr, self.args.warmup_file) #for simulstreaming, warmup should be done in the online class not here

        if self.args.diarization:
-            from whisperlivekit.diarization.diarization_online import DiartDiarization
-            self.diarization = DiartDiarization(
-                block_duration=self.args.min_chunk_size,
-                segmentation_model_name=self.args.segmentation_model,
-                embedding_model_name=self.args.embedding_model
-            )
-            
+            if self.args.diarization_backend == "diart":
+                from whisperlivekit.diarization.diart_backend import DiartDiarization
+                self.diarization_model = DiartDiarization(
+                    block_duration=self.args.min_chunk_size,
+                    segmentation_model_name=self.args.segmentation_model,
+                    embedding_model_name=self.args.embedding_model
+                )
+            elif self.args.diarization_backend == "sortformer":
+                from whisperlivekit.diarization.sortformer_backend import SortformerDiarization
+                self.diarization_model = SortformerDiarization()
+            else:
+                raise ValueError(f"Unknown diarization backend: {self.args.diarization_backend}")
+        
+        self.translation_model = None
+        if self.args.target_language:
+            if self.args.lan == 'auto':
+                raise Exception('Translation cannot be set with language auto')
+            else:
+                from whisperlivekit.translation.translation import load_model
+                self.translation_model = load_model([self.args.lan], backend=self.args.nllb_backend, model_size=self.args.nllb_size) #in the future we want to handle different languages for different speakers
        TranscriptionEngine._initialized = True



 def online_factory(args, asr, tokenizer, logfile=sys.stderr):
    if args.backend == "simulstreaming":    
-        from simul_whisper import SimulStreamingOnlineProcessor
+        from whisperlivekit.simul_whisper import SimulStreamingOnlineProcessor
        online = SimulStreamingOnlineProcessor(
            asr,
            logfile=logfile,
        )
-        # warmup_online(online, args.warmup_file)
-    elif args.vac:
-        online = VACOnlineASRProcessor(
-            args.min_chunk_size,
-            asr,
-            tokenizer,
-            logfile=logfile,
-            buffer_trimming=(args.buffer_trimming, args.buffer_trimming_sec),
-            confidence_validation = args.confidence_validation
-        )
    else:
        online = OnlineASRProcessor(
            asr,
@@ -149,4 +170,22 @@ def online_factory(args, asr, tokenizer, logfile=sys.stderr):
            confidence_validation = args.confidence_validation
        )
    return online
-  
+  
+  
+def online_diarization_factory(args, diarization_backend):
+    if args.diarization_backend == "diart":
+        online = diarization_backend
+        # Not the best here, since several user/instances will share the same backend, but diart is not SOTA anymore and sortformer is recommended
+    
+    if args.diarization_backend == "sortformer":
+        from whisperlivekit.diarization.sortformer_backend import SortformerDiarizationOnline
+        online = SortformerDiarizationOnline(shared_model=diarization_backend)
+    return online
+
+
+def online_translation_factory(args, translation_model):
+    #should be at speaker level in the future:
+    #one shared nllb model for all speaker
+    #one tokenizer per speaker/language
+    from whisperlivekit.translation.translation import OnlineTranslation
+    return OnlineTranslation(translation_model, [args.lan], [args.target_language])
--- a/whisperlivekit/diarization/diarization_online.py
+++ b/whisperlivekit/diarization/diarization_online.py
@@ -29,6 +29,7 @@ class DiarizationObserver(Observer):
        self.speaker_segments = []
        self.processed_time = 0
        self.segment_lock = threading.Lock()
+        self.global_time_offset = 0.0
    
    def on_next(self, value: Tuple[Annotation, Any]):
        annotation, audio = value
@@ -49,8 +50,8 @@ class DiarizationObserver(Observer):
                        print(f"  {speaker}: {start:.2f}s-{end:.2f}s")
                        self.speaker_segments.append(SpeakerSegment(
                            speaker=speaker,
-                            start=start,
-                            end=end
+                            start=start + self.global_time_offset,
+                            end=end + self.global_time_offset
                        ))
            else:
                logger.debug("\nNo speakers detected in this segment")
@@ -199,6 +200,9 @@ class DiartDiarization:
        self.inference.attach_observers(self.observer)
        asyncio.get_event_loop().run_in_executor(None, self.inference)

+    def insert_silence(self, silence_duration):
+        self.observer.global_time_offset += silence_duration
+
    async def diarize(self, pcm_array: np.ndarray):
        """
        Process audio data for diarization.
--- a/whisperlivekit/diarization/sortformer_backend.py
+++ b/whisperlivekit/diarization/sortformer_backend.py
@@ -0,0 +1,465 @@
+import numpy as np
+import torch
+import logging
+import threading
+import time
+import wave
+from typing import List, Optional
+from queue import SimpleQueue, Empty
+
+from whisperlivekit.timed_objects import SpeakerSegment
+
+logger = logging.getLogger(__name__)
+
+try:
+    from nemo.collections.asr.models import SortformerEncLabelModel
+    from nemo.collections.asr.modules import AudioToMelSpectrogramPreprocessor
+except ImportError:
+    raise SystemExit("""Please use `pip install "git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]"` to use the Sortformer diarization""")
+
+
+class StreamingSortformerState:
+    """
+    This class creates a class instance that will be used to store the state of the
+    streaming Sortformer model.
+
+    Attributes:
+        spkcache (torch.Tensor): Speaker cache to store embeddings from start
+        spkcache_lengths (torch.Tensor): Lengths of the speaker cache
+        spkcache_preds (torch.Tensor): The speaker predictions for the speaker cache parts
+        fifo (torch.Tensor): FIFO queue to save the embedding from the latest chunks
+        fifo_lengths (torch.Tensor): Lengths of the FIFO queue
+        fifo_preds (torch.Tensor): The speaker predictions for the FIFO queue parts
+        spk_perm (torch.Tensor): Speaker permutation information for the speaker cache
+        mean_sil_emb (torch.Tensor): Mean silence embedding
+        n_sil_frames (torch.Tensor): Number of silence frames
+    """
+
+    def __init__(self):
+        self.spkcache = None  # Speaker cache to store embeddings from start
+        self.spkcache_lengths = None
+        self.spkcache_preds = None  # speaker cache predictions
+        self.fifo = None  # to save the embedding from the latest chunks
+        self.fifo_lengths = None
+        self.fifo_preds = None
+        self.spk_perm = None
+        self.mean_sil_emb = None
+        self.n_sil_frames = None
+
+
+class SortformerDiarization:
+    def __init__(self, model_name: str = "nvidia/diar_streaming_sortformer_4spk-v2"):
+        """
+        Stores the shared streaming Sortformer diarization model. Used when a new online_diarization is initialized.
+        """
+        self._load_model(model_name)
+    
+    def _load_model(self, model_name: str):
+        """Load and configure the Sortformer model for streaming."""
+        try:
+            self.diar_model = SortformerEncLabelModel.from_pretrained(model_name)
+            self.diar_model.eval()
+
+            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+            self.diar_model.to(device)
+            
+            ## to test
+            # for name, param in self.diar_model.named_parameters():
+            #     if param.device != device:
+            #         raise RuntimeError(f"Parameter {name} is on {param.device} but should be on {device}")
+            
+            logger.info(f"Using {device.type.upper()} for Sortformer model")
+
+            self.diar_model.sortformer_modules.chunk_len = 10
+            self.diar_model.sortformer_modules.subsampling_factor = 10
+            self.diar_model.sortformer_modules.chunk_right_context = 0
+            self.diar_model.sortformer_modules.chunk_left_context = 10
+            self.diar_model.sortformer_modules.spkcache_len = 188
+            self.diar_model.sortformer_modules.fifo_len = 188
+            self.diar_model.sortformer_modules.spkcache_update_period = 144
+            self.diar_model.sortformer_modules.log = False
+            self.diar_model.sortformer_modules._check_streaming_parameters()
+                        
+        except Exception as e:
+            logger.error(f"Failed to load Sortformer model: {e}")
+            raise
+ 
+class SortformerDiarizationOnline:
+    def __init__(self, shared_model, sample_rate: int = 16000):
+        """
+        Initialize the streaming Sortformer diarization system.
+        
+        Args:
+            sample_rate: Audio sample rate (default: 16000)
+            model_name: Pre-trained model name (default: "nvidia/diar_streaming_sortformer_4spk-v2")
+        """
+        self.sample_rate = sample_rate
+        self.speaker_segments = []
+        self.buffer_audio = np.array([], dtype=np.float32)
+        self.segment_lock = threading.Lock()
+        self.global_time_offset = 0.0
+        self.processed_time = 0.0
+        self.debug = False
+                
+        self.diar_model = shared_model.diar_model
+             
+        self.audio2mel = AudioToMelSpectrogramPreprocessor(
+            window_size=0.025,
+            normalize="NA",
+            n_fft=512,
+            features=128,
+            pad_to=0
+        )
+        self.audio2mel.to(self.diar_model.device)
+        
+        self.chunk_duration_seconds = (
+            self.diar_model.sortformer_modules.chunk_len * 
+            self.diar_model.sortformer_modules.subsampling_factor * 
+            self.diar_model.preprocessor._cfg.window_stride
+        )
+        
+        self._init_streaming_state()
+        
+        self._previous_chunk_features = None
+        self._chunk_index = 0
+        self._len_prediction = None
+        
+        # Audio buffer to store PCM chunks for debugging
+        self.audio_buffer = []
+        
+        # Buffer for accumulating audio chunks until reaching chunk_duration_seconds
+        self.audio_chunk_buffer = []
+        self.accumulated_duration = 0.0
+        
+        logger.info("SortformerDiarization initialized successfully")
+
+
+    def _init_streaming_state(self):
+        """Initialize the streaming state for the model."""
+        batch_size = 1
+        device = self.diar_model.device
+        
+        self.streaming_state = StreamingSortformerState()
+        self.streaming_state.spkcache = torch.zeros(
+            (batch_size, self.diar_model.sortformer_modules.spkcache_len, self.diar_model.sortformer_modules.fc_d_model), 
+            device=device
+        )
+        self.streaming_state.spkcache_preds = torch.zeros(
+            (batch_size, self.diar_model.sortformer_modules.spkcache_len, self.diar_model.sortformer_modules.n_spk), 
+            device=device
+        )
+        self.streaming_state.spkcache_lengths = torch.zeros((batch_size,), dtype=torch.long, device=device)
+        self.streaming_state.fifo = torch.zeros(
+            (batch_size, self.diar_model.sortformer_modules.fifo_len, self.diar_model.sortformer_modules.fc_d_model), 
+            device=device
+        )
+        self.streaming_state.fifo_lengths = torch.zeros((batch_size,), dtype=torch.long, device=device)
+        self.streaming_state.mean_sil_emb = torch.zeros((batch_size, self.diar_model.sortformer_modules.fc_d_model), device=device)
+        self.streaming_state.n_sil_frames = torch.zeros((batch_size,), dtype=torch.long, device=device)
+        
+        # Initialize total predictions tensor
+        self.total_preds = torch.zeros((batch_size, 0, self.diar_model.sortformer_modules.n_spk), device=device)
+
+    def insert_silence(self, silence_duration: float):
+        """
+        Insert silence period by adjusting the global time offset.
+        
+        Args:
+            silence_duration: Duration of silence in seconds
+        """
+        with self.segment_lock:
+            self.global_time_offset += silence_duration
+        logger.debug(f"Inserted silence of {silence_duration:.2f}s, new offset: {self.global_time_offset:.2f}s")
+
+    async def diarize(self, pcm_array: np.ndarray):
+        """
+        Process audio data for diarization in streaming fashion.
+        
+        Args:
+            pcm_array: Audio data as numpy array
+        """
+        try:
+            if self.debug:
+                self.audio_buffer.append(pcm_array.copy())
+
+            threshold = int(self.chunk_duration_seconds * self.sample_rate)
+            
+            self.buffer_audio = np.concatenate([self.buffer_audio, pcm_array.copy()])
+            if not len(self.buffer_audio) >= threshold:
+                return
+            
+            audio = self.buffer_audio[:threshold]
+            self.buffer_audio = self.buffer_audio[threshold:]
+            
+            device = self.diar_model.device
+            audio_signal_chunk = torch.tensor(audio, device=device).unsqueeze(0)
+            audio_signal_length_chunk = torch.tensor([audio_signal_chunk.shape[1]], device=device)
+            
+            processed_signal_chunk, processed_signal_length_chunk = self.audio2mel.get_features(
+                audio_signal_chunk, audio_signal_length_chunk
+            )
+            processed_signal_chunk = processed_signal_chunk.to(device)
+            processed_signal_length_chunk = processed_signal_length_chunk.to(device)
+            
+            if self._previous_chunk_features is not None:
+                to_add = self._previous_chunk_features[:, :, -99:].to(device)
+                total_features = torch.concat([to_add, processed_signal_chunk], dim=2).to(device)
+            else:
+                total_features = processed_signal_chunk.to(device)
+            
+            self._previous_chunk_features = processed_signal_chunk.to(device)
+            
+            chunk_feat_seq_t = torch.transpose(total_features, 1, 2).to(device)
+            
+            with torch.inference_mode():
+                left_offset = 8 if self._chunk_index > 0 else 0
+                right_offset = 8
+                
+                self.streaming_state, self.total_preds = self.diar_model.forward_streaming_step(
+                    processed_signal=chunk_feat_seq_t,
+                    processed_signal_length=torch.tensor([chunk_feat_seq_t.shape[1]]).to(device),
+                    streaming_state=self.streaming_state,
+                    total_preds=self.total_preds,
+                    left_offset=left_offset,
+                    right_offset=right_offset,
+                )
+                
+            # Convert predictions to speaker segments
+            self._process_predictions()
+            
+            self._chunk_index += 1
+            
+        except Exception as e:
+            logger.error(f"Error in diarize: {e}")
+            raise
+            
+        # TODO: Handle case when stream ends with partial buffer (accumulated_duration > 0 but < chunk_duration_seconds)
+
+    def _process_predictions(self):
+        """Process model predictions and convert to speaker segments."""
+        try:
+            preds_np = self.total_preds[0].cpu().numpy()
+            active_speakers = np.argmax(preds_np, axis=1)
+            
+            if self._len_prediction is None:
+                self._len_prediction = len(active_speakers)
+            
+            # Get predictions for current chunk
+            frame_duration = self.chunk_duration_seconds / self._len_prediction
+            current_chunk_preds = active_speakers[-self._len_prediction:]
+            
+            with self.segment_lock:
+                # Process predictions into segments
+                base_time = self._chunk_index * self.chunk_duration_seconds + self.global_time_offset
+                
+                for idx, spk in enumerate(current_chunk_preds):
+                    start_time = base_time + idx * frame_duration
+                    end_time = base_time + (idx + 1) * frame_duration
+                    
+                    # Check if this continues the last segment or starts a new one
+                    if (self.speaker_segments and 
+                        self.speaker_segments[-1].speaker == spk and 
+                        abs(self.speaker_segments[-1].end - start_time) < frame_duration * 0.5):
+                        # Continue existing segment
+                        self.speaker_segments[-1].end = end_time
+                    else:
+                        
+                        # Create new segment
+                        self.speaker_segments.append(SpeakerSegment(
+                            speaker=spk,
+                            start=start_time,
+                            end=end_time
+                        ))
+                
+                # Update processed time
+                self.processed_time = max(self.processed_time, base_time + self.chunk_duration_seconds)
+                
+                logger.debug(f"Processed chunk {self._chunk_index}, total segments: {len(self.speaker_segments)}")
+                
+        except Exception as e:
+            logger.error(f"Error processing predictions: {e}")
+
+    def assign_speakers_to_tokens(self, tokens: list, use_punctuation_split: bool = False) -> list:
+        """
+        Assign speakers to tokens based on timing overlap with speaker segments.
+        
+        Args:
+            tokens: List of tokens with timing information
+            use_punctuation_split: Whether to use punctuation for boundary refinement
+            
+        Returns:
+            List of tokens with speaker assignments
+        """
+        with self.segment_lock:
+            segments = self.speaker_segments.copy()
+        
+        if not segments or not tokens:
+            logger.debug("No segments or tokens available for speaker assignment")
+            return tokens
+        
+        logger.debug(f"Assigning speakers to {len(tokens)} tokens using {len(segments)} segments")
+        use_punctuation_split = False
+        if not use_punctuation_split:
+            # Simple overlap-based assignment
+            for token in tokens:
+                token.speaker = -1  # Default to no speaker
+                for segment in segments:
+                    # Check for timing overlap
+                    if not (segment.end <= token.start or segment.start >= token.end):
+                        token.speaker = segment.speaker + 1  # Convert to 1-based indexing
+                        break
+        else:
+            # Use punctuation-aware assignment (similar to diart_backend)
+            tokens = self._add_speaker_to_tokens_with_punctuation(segments, tokens)
+        
+        return tokens
+
+    def _add_speaker_to_tokens_with_punctuation(self, segments: List[SpeakerSegment], tokens: list) -> list:
+        """
+        Assign speakers to tokens with punctuation-aware boundary adjustment.
+        
+        Args:
+            segments: List of speaker segments
+            tokens: List of tokens to assign speakers to
+            
+        Returns:
+            List of tokens with speaker assignments
+        """
+        punctuation_marks = {'.', '!', '?'}
+        punctuation_tokens = [token for token in tokens if token.text.strip() in punctuation_marks]
+        
+        # Convert segments to concatenated format
+        segments_concatenated = self._concatenate_speakers(segments)
+        
+        # Adjust segment boundaries based on punctuation
+        for ind, segment in enumerate(segments_concatenated):
+            for i, punctuation_token in enumerate(punctuation_tokens):
+                if punctuation_token.start > segment['end']:
+                    after_length = punctuation_token.start - segment['end']
+                    before_length = segment['end'] - punctuation_tokens[i - 1].end if i > 0 else float('inf')
+                    
+                    if before_length > after_length:
+                        segment['end'] = punctuation_token.start
+                        if i < len(punctuation_tokens) - 1 and ind + 1 < len(segments_concatenated):
+                            segments_concatenated[ind + 1]['begin'] = punctuation_token.start
+                    else:
+                        segment['end'] = punctuation_tokens[i - 1].end if i > 0 else segment['end']
+                        if i < len(punctuation_tokens) - 1 and ind - 1 >= 0:
+                            segments_concatenated[ind - 1]['begin'] = punctuation_tokens[i - 1].end
+                    break
+        
+        # Ensure non-overlapping tokens
+        last_end = 0.0
+        for token in tokens:
+            start = max(last_end + 0.01, token.start)
+            token.start = start
+            token.end = max(start, token.end)
+            last_end = token.end
+        
+        # Assign speakers based on adjusted segments
+        ind_last_speaker = 0
+        for segment in segments_concatenated:
+            for i, token in enumerate(tokens[ind_last_speaker:]):
+                if token.end <= segment['end']:
+                    token.speaker = segment['speaker']
+                    ind_last_speaker = i + 1
+                elif token.start > segment['end']:
+                    break
+        
+        return tokens
+
+    def _concatenate_speakers(self, segments: List[SpeakerSegment]) -> List[dict]:
+        """
+        Concatenate consecutive segments from the same speaker.
+        
+        Args:
+            segments: List of speaker segments
+            
+        Returns:
+            List of concatenated speaker segments
+        """
+        if not segments:
+            return []
+            
+        segments_concatenated = [{"speaker": segments[0].speaker + 1, "begin": segments[0].start, "end": segments[0].end}]
+        
+        for segment in segments[1:]:
+            speaker = segment.speaker + 1
+            if segments_concatenated[-1]['speaker'] != speaker:
+                segments_concatenated.append({"speaker": speaker, "begin": segment.start, "end": segment.end})
+            else:
+                segments_concatenated[-1]['end'] = segment.end
+                
+        return segments_concatenated
+
+    def get_segments(self) -> List[SpeakerSegment]:
+        """Get a copy of the current speaker segments."""
+        with self.segment_lock:
+            return self.speaker_segments.copy()
+
+    def clear_old_segments(self, older_than: float = 30.0):
+        """Clear segments older than the specified time."""
+        with self.segment_lock:
+            current_time = self.processed_time
+            self.speaker_segments = [
+                segment for segment in self.speaker_segments 
+                if current_time - segment.end < older_than
+            ]
+            logger.debug(f"Cleared old segments, remaining: {len(self.speaker_segments)}")
+
+    def close(self):
+        """Close the diarization system and clean up resources."""
+        logger.info("Closing SortformerDiarization")
+        with self.segment_lock:
+            self.speaker_segments.clear()
+        
+        if self.debug:
+            concatenated_audio = np.concatenate(self.audio_buffer)
+            audio_data_int16 = (concatenated_audio * 32767).astype(np.int16)                
+            with wave.open("diarization_audio.wav", "wb") as wav_file:
+                wav_file.setnchannels(1)  # mono audio
+                wav_file.setsampwidth(2)   # 2 bytes per sample (int16)
+                wav_file.setframerate(self.sample_rate)
+                wav_file.writeframes(audio_data_int16.tobytes())
+            logger.info(f"Saved {len(concatenated_audio)} samples to diarization_audio.wav")
+
+
+def extract_number(s: str) -> int:
+    """Extract number from speaker string (compatibility function)."""
+    import re
+    m = re.search(r'\d+', s)
+    return int(m.group()) if m else 0
+
+
+if __name__ == '__main__':
+    import asyncio
+    import librosa
+    
+    async def main():
+        """TEST ONLY."""
+        an4_audio = 'audio_test.mp3'
+        signal, sr = librosa.load(an4_audio, sr=16000)
+        signal = signal[:16000*30]
+
+        print("\n" + "=" * 50)
+        print("ground truth:")
+        print("Speaker 0: 0:00 - 0:09")
+        print("Speaker 1: 0:09 - 0:19") 
+        print("Speaker 2: 0:19 - 0:25")
+        print("Speaker 0: 0:25 - 0:30")
+        print("=" * 50)
+        
+        diarization = SortformerDiarization(sample_rate=16000)        
+        chunk_size = 1600
+        
+        for i in range(0, len(signal), chunk_size):
+            chunk = signal[i:i+chunk_size]
+            await diarization.diarize(chunk)
+            print(f"Processed chunk {i // chunk_size + 1}")
+        
+        segments = diarization.get_segments()
+        print("\nDiarization results:")
+        for segment in segments:
+            print(f"Speaker {segment.speaker}: {segment.start:.2f}s - {segment.end:.2f}s")
+    
+    asyncio.run(main())
--- a/whisperlivekit/diarization/sortformer_backend_offline.py
+++ b/whisperlivekit/diarization/sortformer_backend_offline.py
@@ -0,0 +1,205 @@
+import numpy as np
+import torch
+import logging
+
+from nemo.collections.asr.models import SortformerEncLabelModel
+from nemo.collections.asr.modules import AudioToMelSpectrogramPreprocessor
+import librosa
+
+logger = logging.getLogger(__name__)
+
+def load_model():
+
+    diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2")
+    diar_model.eval()
+
+    if torch.cuda.is_available():
+        diar_model.to(torch.device("cuda"))
+
+    #we target 1 second lag for the moment. chunk_len could be reduced.
+    diar_model.sortformer_modules.chunk_len = 10
+    diar_model.sortformer_modules.subsampling_factor = 10 #8 would be better ideally
+
+    diar_model.sortformer_modules.chunk_right_context = 0 #no.
+    diar_model.sortformer_modules.chunk_left_context = 10 #big so it compensiate the problem with no padding later.
+
+    diar_model.sortformer_modules.spkcache_len = 188
+    diar_model.sortformer_modules.fifo_len = 188
+    diar_model.sortformer_modules.spkcache_update_period = 144
+    diar_model.sortformer_modules.log = False
+    diar_model.sortformer_modules._check_streaming_parameters()
+
+
+    audio2mel = AudioToMelSpectrogramPreprocessor(
+            window_size= 0.025, 
+            normalize="NA",
+            n_fft=512,
+            features=128,
+            pad_to=0) #pad_to 16 works better than 0. On test audio, we detect a third speaker for 1 second with pad_to=0. To solve that : increase left context to 10.
+
+    return diar_model, audio2mel
+
+diar_model, audio2mel = load_model()
+
+class StreamingSortformerState:
+    """
+    This class creates a class instance that will be used to store the state of the
+    streaming Sortformer model.
+
+    Attributes:
+        spkcache (torch.Tensor): Speaker cache to store embeddings from start
+        spkcache_lengths (torch.Tensor): Lengths of the speaker cache
+        spkcache_preds (torch.Tensor): The speaker predictions for the speaker cache parts
+        fifo (torch.Tensor): FIFO queue to save the embedding from the latest chunks
+        fifo_lengths (torch.Tensor): Lengths of the FIFO queue
+        fifo_preds (torch.Tensor): The speaker predictions for the FIFO queue parts
+        spk_perm (torch.Tensor): Speaker permutation information for the speaker cache
+        mean_sil_emb (torch.Tensor): Mean silence embedding
+        n_sil_frames (torch.Tensor): Number of silence frames
+    """
+
+    spkcache = None  # Speaker cache to store embeddings from start
+    spkcache_lengths = None  #
+    spkcache_preds = None  # speaker cache predictions
+    fifo = None  # to save the embedding from the latest chunks
+    fifo_lengths = None
+    fifo_preds = None
+    spk_perm = None
+    mean_sil_emb = None
+    n_sil_frames = None
+
+
+def init_streaming_state(self, batch_size: int = 1, async_streaming: bool = False, device: torch.device = None):
+    """
+    Initializes StreamingSortformerState with empty tensors or zero-valued tensors.
+
+    Args:
+        batch_size (int): Batch size for tensors in streaming state
+        async_streaming (bool): True for asynchronous update, False for synchronous update
+        device (torch.device): Device for tensors in streaming state
+
+    Returns:
+        streaming_state (SortformerStreamingState): initialized streaming state
+    """
+    streaming_state = StreamingSortformerState()
+    if async_streaming:
+        streaming_state.spkcache = torch.zeros((batch_size, self.spkcache_len, self.fc_d_model), device=device)
+        streaming_state.spkcache_preds = torch.zeros((batch_size, self.spkcache_len, self.n_spk), device=device)
+        streaming_state.spkcache_lengths = torch.zeros((batch_size,), dtype=torch.long, device=device)
+        streaming_state.fifo = torch.zeros((batch_size, self.fifo_len, self.fc_d_model), device=device)
+        streaming_state.fifo_lengths = torch.zeros((batch_size,), dtype=torch.long, device=device)
+    else:
+        streaming_state.spkcache = torch.zeros((batch_size, 0, self.fc_d_model), device=device)
+        streaming_state.fifo = torch.zeros((batch_size, 0, self.fc_d_model), device=device)
+    streaming_state.mean_sil_emb = torch.zeros((batch_size, self.fc_d_model), device=device)
+    streaming_state.n_sil_frames = torch.zeros((batch_size,), dtype=torch.long, device=device)
+    return streaming_state
+
+
+def process_diarization(chunks):
+    """ 
+    what it does:
+    1. Preprocessing: Applies dithering and pre-emphasis (high-pass filter) if enabled
+    2. STFT: Computes the Short-Time Fourier Transform using:
+        - the window of window_size=0.025 --> size of a window : 400 samples
+        - the hop parameter : n_window_stride = 0.01 -> every 160 samples, a new window
+    3. Magnitude Calculation: Converts complex STFT output to magnitude spectrogram
+    4. Mel Conversion: Applies Mel filterbanks (128 filters in this case) to get Mel spectrogram
+    5. Logarithm: Takes the log of the Mel spectrogram (if `log=True`)
+    6. Normalization: Skips normalization since `normalize="NA"`
+    7. Padding: Pads the time dimension to a multiple of `pad_to` (default 16)    
+    """
+    previous_chunk = None
+    l_chunk_feat_seq_t = []
+    for chunk in chunks:
+        audio_signal_chunk = torch.tensor(chunk).unsqueeze(0).to(diar_model.device)
+        audio_signal_length_chunk = torch.tensor([audio_signal_chunk.shape[1]]).to(diar_model.device)
+        processed_signal_chunk, processed_signal_length_chunk = audio2mel.get_features(audio_signal_chunk, audio_signal_length_chunk)
+        if previous_chunk is not None:
+            to_add = previous_chunk[:, :, -99:]
+            total = torch.concat([to_add, processed_signal_chunk], dim=2)
+        else:
+            total = processed_signal_chunk
+        previous_chunk = processed_signal_chunk
+        l_chunk_feat_seq_t.append(torch.transpose(total, 1, 2))
+
+    batch_size = 1
+    streaming_state = init_streaming_state(diar_model.sortformer_modules,
+        batch_size = batch_size,
+        async_streaming = True,
+        device = diar_model.device
+    )
+    total_preds = torch.zeros((batch_size, 0, diar_model.sortformer_modules.n_spk), device=diar_model.device)
+
+    chunk_duration_seconds = diar_model.sortformer_modules.chunk_len * diar_model.sortformer_modules.subsampling_factor * diar_model.preprocessor._cfg.window_stride
+
+    l_speakers = [
+        {'start_time': 0,
+        'end_time': 0,
+        'speaker': 0
+        }
+    ]
+    len_prediction = None
+    left_offset = 0
+    right_offset = 8
+    for i, chunk_feat_seq_t in enumerate(l_chunk_feat_seq_t):
+        with torch.inference_mode():
+                streaming_state, total_preds = diar_model.forward_streaming_step(
+                    processed_signal=chunk_feat_seq_t,
+                    processed_signal_length=torch.tensor([chunk_feat_seq_t.shape[1]]),
+                    streaming_state=streaming_state,
+                    total_preds=total_preds,
+                    left_offset=left_offset,
+                    right_offset=right_offset,
+                )
+                left_offset = 8
+                preds_np = total_preds[0].cpu().numpy()
+                active_speakers = np.argmax(preds_np, axis=1)
+                if len_prediction is None:
+                    len_prediction = len(active_speakers) # we want to get the len of 1 prediction
+                frame_duration = chunk_duration_seconds / len_prediction
+                active_speakers = active_speakers[-len_prediction:]
+                for idx, spk in enumerate(active_speakers):
+                    if spk != l_speakers[-1]['speaker']:
+                        l_speakers.append(
+                            {'start_time': (i * chunk_duration_seconds + idx * frame_duration),
+                            'end_time': (i * chunk_duration_seconds + (idx + 1) * frame_duration),
+                            'speaker': spk
+                        })                    
+                    else:
+                        l_speakers[-1]['end_time'] = i * chunk_duration_seconds + (idx + 1) * frame_duration
+                    
+        
+        """
+        Should print
+        [{'start_time': 0, 'end_time': 8.72, 'speaker': 0}, 
+        {'start_time': 8.72, 'end_time': 18.88, 'speaker': 1},
+        {'start_time': 18.88, 'end_time': 24.96, 'speaker': 2},
+        {'start_time': 24.96, 'end_time': 31.68, 'speaker': 0}]
+        """
+    for speaker in l_speakers:
+        print(f"Speaker {speaker['speaker']}: {speaker['start_time']:.2f}s - {speaker['end_time']:.2f}s")    
+    
+
+if __name__ == '__main__':
+    
+    an4_audio = 'audio_test.mp3'
+    signal, sr = librosa.load(an4_audio, sr=16000)
+    signal = signal[:16000*30]
+    # signal = signal[:-(len(signal)%16000)]
+
+    print("\n" + "=" * 50)
+    print("Expected ground truth:")
+    print("Speaker 0: 0:00 - 0:09")
+    print("Speaker 1: 0:09 - 0:19") 
+    print("Speaker 2: 0:19 - 0:25")
+    print("Speaker 0: 0:25 - 0:30")
+    print("=" * 50)
+
+    chunk_size = 16000  # 1 second
+    chunks = []
+    for i in range(0, len(signal), chunk_size):
+        chunk = signal[i:i+chunk_size]
+        chunks.append(chunk)
+    
+    process_diarization(chunks)
--- a/whisperlivekit/ffmpeg_manager.py
+++ b/whisperlivekit/ffmpeg_manager.py
@@ -7,11 +7,12 @@ import contextlib
 logger = logging.getLogger(__name__)
 logging.basicConfig(level=logging.INFO)

-ERROR_INSTALL_INSTRUCTIONS = """
+ERROR_INSTALL_INSTRUCTIONS = f"""
+{'='*50}
 FFmpeg is not installed or not found in your system's PATH.
-Please install FFmpeg to enable audio processing.
+Alternative Solution: You can still use WhisperLiveKit without FFmpeg by adding the --pcm-input parameter. Note that when using this option, audio will not be compressed between the frontend and backend, which may result in higher bandwidth usage.

-Installation instructions:
+If you want to install FFmpeg:

 # Ubuntu/Debian:
 sudo apt update && sudo apt install ffmpeg
@@ -25,6 +26,7 @@ brew install ffmpeg
 # 3. Add the 'bin' directory (e.g., C:\\FFmpeg\\bin) to your system's PATH environment variable.

 After installation, please restart the application.
+{'='*50}
 """

 class FFmpegState(Enum):
@@ -143,7 +145,7 @@ class FFmpegManager:
        try:
            data = await asyncio.wait_for(
                self.process.stdout.read(size),
-                timeout=5.0
+                timeout=20.0
            )
            return data
        except asyncio.TimeoutError:
@@ -183,6 +185,8 @@ class FFmpegManager:
    async def _drain_stderr(self):
        try:
            while True:
+                if not self.process or not self.process.stderr:
+                    break
                line = await self.process.stderr.readline()
                if not line:
                    break
@@ -190,4 +194,4 @@ class FFmpegManager:
        except asyncio.CancelledError:
            logger.info("FFmpeg stderr drain task cancelled.")
        except Exception as e:
-            logger.error(f"Error draining FFmpeg stderr: {e}")
+            logger.error(f"Error draining FFmpeg stderr: {e}")
--- a/whisperlivekit/parse_args.py
+++ b/whisperlivekit/parse_args.py
@@ -20,7 +20,7 @@ def parse_args():
        help="""
        The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast.
        If not set, uses https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav.
-        If False, no warmup is performed.
+        If empty, no warmup is performed.
        """,
    )

@@ -58,12 +58,26 @@ def parse_args():
        help="Hugging Face model ID for pyannote.audio embedding model.",
    )

+    parser.add_argument(
+        "--diarization-backend",
+        type=str,
+        default="sortformer",
+        choices=["sortformer", "diart"],
+        help="The diarization backend to use.",
+    )
+
    parser.add_argument(
        "--no-transcription",
        action="store_true",
        help="Disable transcription to only see live diarization results.",
    )
    
+    parser.add_argument(
+        "--disable-punctuation-split",
+        action="store_true",
+        help="Disable the split parameter.",
+    )
+    
    parser.add_argument(
        "--min-chunk-size",
        type=float,
@@ -74,7 +88,7 @@ def parse_args():
    parser.add_argument(
        "--model",
        type=str,
-        default="tiny",
+        default="small",
        help="Name size of the Whisper model to use (default: tiny). Suggested values: tiny.en,tiny,base.en,base,small.en,small,medium.en,medium,large-v1,large-v2,large-v3,large,large-v3-turbo. The model is automatically downloaded from the model hub if not present in model cache dir.",
    )
    
@@ -104,18 +118,27 @@ def parse_args():
        choices=["transcribe", "translate"],
        help="Transcribe or translate.",
    )
+    
+    parser.add_argument(
+        "--target-language",
+        type=str,
+        default="",
+        dest="target_language",
+        help="Target language for translation. Not functional yet.",
+    )    
+
    parser.add_argument(
        "--backend",
        type=str,
-        default="faster-whisper",
+        default="simulstreaming",
        choices=["faster-whisper", "whisper_timestamped", "mlx-whisper", "openai-api", "simulstreaming"],
        help="Load only this backend for Whisper processing.",
    )
    parser.add_argument(
-        "--vac",
+        "--no-vac",
        action="store_true",
        default=False,
-        help="Use VAC = voice activity controller. Recommended. Requires torch.",
+        help="Disable VAC = voice activity controller.",
    )
    parser.add_argument(
        "--vac-chunk-size", type=float, default=0.04, help="VAC sample size in seconds."
@@ -150,9 +173,22 @@ def parse_args():
    )
    parser.add_argument("--ssl-certfile", type=str, help="Path to the SSL certificate file.", default=None)
    parser.add_argument("--ssl-keyfile", type=str, help="Path to the SSL private key file.", default=None)
-
+    parser.add_argument(
+        "--pcm-input",
+        action="store_true",
+        default=False,
+        help="If set, raw PCM (s16le) data is expected as input and FFmpeg will be bypassed. Frontend will use AudioWorklet instead of MediaRecorder."
+    )
    # SimulStreaming-specific arguments
    simulstreaming_group = parser.add_argument_group('SimulStreaming arguments (only used with --backend simulstreaming)')
+
+    simulstreaming_group.add_argument(
+        "--disable-fast-encoder",
+        action="store_true",
+        default=False,
+        dest="disable_fast_encoder",
+        help="Disable Faster Whisper or MLX Whisper backends for encoding (if installed). Slower but helpful when GPU memory is limited",
+    )
    
    simulstreaming_group.add_argument(
        "--frame-threshold",
@@ -242,6 +278,28 @@ def parse_args():
        dest="model_path",
        help="Direct path to the SimulStreaming Whisper .pt model file. Overrides --model for SimulStreaming backend.",
    )
+    
+    simulstreaming_group.add_argument(
+        "--preload-model-count",
+        type=int,
+        default=1,
+        dest="preload_model_count",
+        help="Optional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent instances).",
+    )
+
+    simulstreaming_group.add_argument(
+        "--nllb-backend",
+        type=str,
+        default="ctranslate2",
+        help="transformers or ctranslate2",
+    )
+    
+    simulstreaming_group.add_argument(
+        "--nllb-size",
+        type=str,
+        default="600M",
+        help="600M or 1.3B",
+    )

    args = parser.parse_args()
    
--- a/whisperlivekit/remove_silences.py
+++ b/whisperlivekit/remove_silences.py
@@ -3,6 +3,7 @@ import re

 MIN_SILENCE_DURATION = 4 #in seconds
 END_SILENCE_DURATION = 8 #in seconds. you should keep it important to not have false positive when the model lag is important
+END_SILENCE_DURATION_VAC = 3 #VAC is good at detecting silences, but we want to skip the smallest silences

 def blank_to_silence(tokens):
    full_string = ''.join([t.text for t in tokens])
@@ -38,7 +39,7 @@ def blank_to_silence(tokens):
                        )
                else:
                    if silence_token: #there was silence but no more
-                        if silence_token.end - silence_token.start >= MIN_SILENCE_DURATION:
+                        if silence_token.duration() >= MIN_SILENCE_DURATION:
                            cleaned_tokens.append(
                                silence_token
                            )
@@ -76,11 +77,15 @@ def no_token_to_silence(tokens):
            new_tokens.append(token)
    return new_tokens
            
-def ends_with_silence(tokens, current_time):
+def ends_with_silence(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence):
    if not tokens:
-        return []
+        return [], buffer_transcription, buffer_diarization
    last_token = tokens[-1]
-    if tokens and current_time - last_token.end >= END_SILENCE_DURATION:
+    if tokens and current_time and (
+        current_time - last_token.end >= END_SILENCE_DURATION 
+        or 
+        (current_time - last_token.end >= 3 and vac_detected_silence)
+        ):
        if last_token.speaker == -2:
            last_token.end = current_time
        else:
@@ -92,12 +97,14 @@ def ends_with_silence(tokens, current_time):
                    probability=0.95
                )
            )
-    return tokens
+        buffer_transcription = "" # for whisperstreaming backend, we should probably validate the buffer has because of the silence
+        buffer_diarization  = ""
+    return tokens, buffer_transcription, buffer_diarization
    

-def handle_silences(tokens, current_time):
+def handle_silences(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence):
    tokens = blank_to_silence(tokens) #useful for simulstreaming backend which tends to generate [BLANK_AUDIO] text
    tokens = no_token_to_silence(tokens)
-    tokens = ends_with_silence(tokens, current_time)
-    return tokens
+    tokens, buffer_transcription, buffer_diarization = ends_with_silence(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence)
+    return tokens, buffer_transcription, buffer_diarization
     
--- a/whisperlivekit/results_formater.py
+++ b/whisperlivekit/results_formater.py
@@ -0,0 +1,155 @@
+
+import logging
+from whisperlivekit.remove_silences import handle_silences
+from whisperlivekit.timed_objects import Line, format_time
+
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.DEBUG)
+
+PUNCTUATION_MARKS = {'.', '!', '?', '。', '！', '？'}
+CHECK_AROUND = 4
+
+def is_punctuation(token):
+    if token.text.strip() in PUNCTUATION_MARKS:
+        return True
+    return False
+
+def next_punctuation_change(i, tokens):
+    for ind in range(i+1, min(len(tokens), i+CHECK_AROUND+1)):
+        if is_punctuation(tokens[ind]):
+            return ind        
+    return None
+
+def next_speaker_change(i, tokens, speaker):
+    for ind in range(i-1, max(0, i-CHECK_AROUND)-1, -1):
+        token = tokens[ind]
+        if is_punctuation(token):
+            break
+        if token.speaker != speaker:
+            return ind, token.speaker
+    return None, speaker
+    
+def new_line(
+    token,
+    speaker,
+    debug_info = ""
+):
+    return Line(
+        speaker = speaker,
+        text = token.text + debug_info,
+        start = token.start,
+        end = token.end,
+    )
+
+def append_token_to_last_line(lines, sep, token, debug_info):
+    if token.text:
+        lines[-1].text += sep + token.text + debug_info
+        lines[-1].end = token.end
+
+def format_output(state, silence, current_time, args, debug, sep):
+    diarization = args.diarization
+    disable_punctuation_split = args.disable_punctuation_split
+    tokens = state.tokens
+    translated_segments = state.translated_segments # Here we will attribute the speakers only based on the timestamps of the segments
+    buffer_transcription = state.buffer_transcription
+    buffer_diarization = state.buffer_diarization
+    end_attributed_speaker = state.end_attributed_speaker
+    
+    previous_speaker = -1
+    lines = []
+    undiarized_text = []
+    tokens, buffer_transcription, buffer_diarization = handle_silences(tokens, buffer_transcription, buffer_diarization, current_time, silence)
+    last_punctuation = None
+    for i, token in enumerate(tokens):
+        speaker = token.speaker
+        if not diarization and speaker == -1: #Speaker -1 means no attributed by diarization. In the frontend, it should appear under 'Speaker 1'
+            speaker = 1
+        if diarization and not tokens[-1].speaker == -2:
+            if (speaker in [-1, 0]) and token.end >= end_attributed_speaker:
+                undiarized_text.append(token.text)
+                continue
+            elif (speaker in [-1, 0]) and token.end < end_attributed_speaker:
+                speaker = previous_speaker
+        debug_info = ""
+        if debug:
+            debug_info = f"[{format_time(token.start)} : {format_time(token.end)}]"
+            
+        if not lines:
+            lines.append(new_line(token, speaker, debug_info = ""))
+            continue
+        else:
+            previous_speaker = lines[-1].speaker
+        
+        if is_punctuation(token):
+            last_punctuation = i
+            
+        
+        if last_punctuation == i-1:
+            if speaker != previous_speaker:
+                # perfect, diarization perfectly aligned
+                lines.append(new_line(token, speaker, debug_info = ""))
+                last_punctuation, next_punctuation = None, None
+                continue
+            
+            speaker_change_pos, new_speaker = next_speaker_change(i, tokens, speaker)
+            if speaker_change_pos:
+                # Corrects delay:
+                # That was the idea. Okay haha |SPLIT SPEAKER| that's a good one 
+                # should become:
+                # That was the idea. |SPLIT SPEAKER| Okay haha that's a good one 
+                lines.append(new_line(token, new_speaker, debug_info = ""))
+            else:
+                # No speaker change to come
+                append_token_to_last_line(lines, sep, token, debug_info)
+            continue
+        
+
+        if speaker != previous_speaker:
+            if speaker == -2 or previous_speaker == -2: #silences can happen anytime
+                lines.append(new_line(token, speaker, debug_info = ""))
+                continue
+            elif next_punctuation_change(i, tokens):
+                # Corrects advance:
+                # Are you |SPLIT SPEAKER| okay? yeah, sure. Absolutely 
+                # should become:
+                # Are you okay? |SPLIT SPEAKER| yeah, sure. Absolutely 
+                append_token_to_last_line(lines, sep, token, debug_info)
+                continue
+            else: #we create a new speaker, but that's no ideal. We are not sure about the split. We prefer to append to previous line
+                if disable_punctuation_split:
+                    lines.append(new_line(token, speaker, debug_info = ""))
+                    continue
+                pass
+            
+        append_token_to_last_line(lines, sep, token, debug_info)
+    if lines and translated_segments:
+        unassigned_translated_segments = []
+        for ts in translated_segments:
+            assigned = False
+            for line in lines:
+                if ts and ts.overlaps_with(line):
+                    if ts.is_within(line):
+                        line.translation += ts.text + ' '
+                        assigned = True
+                        break
+                    else:
+                        ts0, ts1 = ts.approximate_cut_at(line.end)
+                        if ts0 and line.overlaps_with(ts0):
+                            line.translation += ts0.text + ' '
+                        if ts1:
+                            unassigned_translated_segments.append(ts1)
+                        assigned = True
+                        break
+            if not assigned:
+                unassigned_translated_segments.append(ts)
+        
+        if unassigned_translated_segments:
+            for line in lines:
+                remaining_segments = []
+                for ts in unassigned_translated_segments:
+                    if ts and ts.overlaps_with(line):
+                        line.translation += ts.text + ' '
+                    else:
+                        remaining_segments.append(ts)
+                unassigned_translated_segments = remaining_segments #maybe do smth in the future about that
+    return lines, undiarized_text, buffer_transcription, ''
--- a/whisperlivekit/whisper_streaming_custom/silero_vad_iterator.py
+++ b/whisperlivekit/whisper_streaming_custom/silero_vad_iterator.py
--- a/whisperlivekit/simul_whisper/backend.py
+++ b/whisperlivekit/simul_whisper/backend.py
@@ -3,21 +3,39 @@ import numpy as np
 import logging
 from typing import List, Tuple, Optional
 import logging
+import platform
 from whisperlivekit.timed_objects import ASRToken, Transcript
+from whisperlivekit.warmup import load_file
 from whisperlivekit.simul_whisper.license_simulstreaming import SIMULSTREAMING_LICENSE
 from .whisper import load_model, tokenizer
+from .whisper.audio import TOKENS_PER_SECOND
 import os
+import gc
 logger = logging.getLogger(__name__)

+import torch
+from whisperlivekit.simul_whisper.config import AlignAttConfig
+from whisperlivekit.simul_whisper.simul_whisper import PaddedAlignAttWhisper
+from whisperlivekit.simul_whisper.whisper import tokenizer
+
 try:
-    import torch
-    from whisperlivekit.simul_whisper.config import AlignAttConfig
-    from whisperlivekit.simul_whisper.simul_whisper import PaddedAlignAttWhisper
-    from whisperlivekit.simul_whisper.whisper import tokenizer
-except ImportError as e:
-    raise ImportError(
-        """SimulStreaming dependencies are not available.
-        Please install WhisperLiveKit using pip install "whisperlivekit[simulstreaming]".""")
+    from .mlx_encoder import mlx_model_mapping, load_mlx_encoder
+    HAS_MLX_WHISPER = True
+except ImportError:
+    if platform.system() == "Darwin" and platform.machine() == "arm64":
+        print('MLX Whisper not found but you are on Apple Silicon. Consider installing mlx-whisper for better performance: pip install mlx-whisper')
+    HAS_MLX_WHISPER = False
+if HAS_MLX_WHISPER:
+    HAS_FASTER_WHISPER = False
+else:
+    try:
+        from faster_whisper import WhisperModel
+        HAS_FASTER_WHISPER = True
+    except ImportError:
+        HAS_FASTER_WHISPER = False
+
+
+# TOO_MANY_REPETITIONS = 3

 class SimulStreamingOnlineProcessor:
    SAMPLING_RATE = 16000
@@ -30,33 +48,47 @@ class SimulStreamingOnlineProcessor:
    ):        
        self.asr = asr
        self.logfile = logfile
-        self.is_last = False
-        self.beg = 0.0
        self.end = 0.0
-        self.cumulative_audio_duration = 0.0
+        self.global_time_offset = 0.0
        
        self.committed: List[ASRToken] = []
-        self.last_result_tokens: List[ASRToken] = []        
-        self.model = PaddedAlignAttWhisper(
-            cfg=asr.cfg,
-            loaded_model=asr.whisper_model)
+        self.last_result_tokens: List[ASRToken] = []
+        self.load_new_backend()
+        
+        #can be moved
        if asr.tokenizer:
            self.model.tokenizer = asr.tokenizer

-    def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time: Optional[float] = None):
+    def load_new_backend(self):
+        model = self.asr.get_new_model_instance()
+        self.model = PaddedAlignAttWhisper(
+            cfg=self.asr.cfg,
+            loaded_model=model,
+            mlx_encoder=self.asr.mlx_encoder,
+            fw_encoder=self.asr.fw_encoder,
+            )
+
+    def insert_silence(self, silence_duration, offset):
+        """
+        If silences are > 5s, we do a complete context clear. Otherwise, we just insert a small silence and shift the last_attend_frame
+        """
+        if silence_duration < 5:
+            gap_silence = torch.zeros(int(16000*silence_duration))
+            self.model.insert_audio(gap_silence)
+            # self.global_time_offset += silence_duration
+        else:
+            self.process_iter(is_last=True) #we want to totally process what remains in the buffer.
+            self.model.refresh_segment(complete=True)
+            self.global_time_offset = silence_duration + offset
+
+
+        
+    def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time):
        """Append an audio chunk to be processed by SimulStreaming."""
            
        # Convert numpy array to torch tensor
        audio_tensor = torch.from_numpy(audio).float()
-        
-        # Update timing
-        chunk_duration = len(audio) / self.SAMPLING_RATE
-        self.cumulative_audio_duration += chunk_duration
-        
-        if audio_stream_end_time is not None:
-            self.end = audio_stream_end_time
-        else:
-            self.end = self.cumulative_audio_duration            
+        self.end = audio_stream_end_time #Only to be aligned with what happens in whisperstreaming backend.
        self.model.insert_audio(audio_tensor)

    def get_buffer(self):
@@ -68,38 +100,63 @@ class SimulStreamingOnlineProcessor:
        )

    def timestamped_text(self, tokens, generation):
-        # From the simulstreaming repo. self.model to self.asr.model
-        pr = generation["progress"]
-        if "result" not in generation:
-            split_words, split_tokens = self.model.tokenizer.split_to_word_tokens(tokens)
+        """
+        generate timestamped text from tokens and generation data.
+        
+        args:
+            tokens: List of tokens to process
+            generation: Dictionary containing generation progress and optionally results
+            
+        returns:
+            List of tuples containing (start_time, end_time, word) for each word
+        """
+        FRAME_DURATION = 0.02    
+        if "result" in generation:
+            split_words = generation["result"]["split_words"]
+            split_tokens = generation["result"]["split_tokens"]
        else:
-            split_words, split_tokens = generation["result"]["split_words"], generation["result"]["split_tokens"]
+            split_words, split_tokens = self.model.tokenizer.split_to_word_tokens(tokens)
+        progress = generation["progress"]
+        frames = [p["most_attended_frames"][0] for p in progress]
+        absolute_timestamps = [p["absolute_timestamps"][0] for p in progress]
+        tokens_queue = tokens.copy()
+        timestamped_words = []
+        
+        for word, word_tokens in zip(split_words, split_tokens):
+            # start_frame = None
+            # end_frame = None
+            for expected_token in word_tokens:
+                if not tokens_queue or not frames:
+                    raise ValueError(f"Insufficient tokens or frames for word '{word}'")
+                    
+                actual_token = tokens_queue.pop(0)
+                current_frame = frames.pop(0)
+                current_timestamp = absolute_timestamps.pop(0)
+                if actual_token != expected_token:
+                    raise ValueError(
+                        f"Token mismatch: expected '{expected_token}', "
+                        f"got '{actual_token}' at frame {current_frame}"
+                    )
+                # if start_frame is None:
+                #     start_frame = current_frame
+                # end_frame = current_frame
+            # start_time = start_frame * FRAME_DURATION
+            # end_time = end_frame * FRAME_DURATION
+            start_time = current_timestamp
+            end_time = current_timestamp + 0.1
+            timestamp_entry = (start_time, end_time, word)
+            timestamped_words.append(timestamp_entry)
+            logger.debug(f"TS-WORD:\t{start_time:.2f}\t{end_time:.2f}\t{word}")
+        return timestamped_words

-        frames = [p["most_attended_frames"][0] for p in pr]
-        tokens = tokens.copy()
-        ret = []
-        for sw,st in zip(split_words,split_tokens):
-            b = None
-            for stt in st:
-                t,f = tokens.pop(0), frames.pop(0)
-                if t != stt:
-                    raise ValueError(f"Token mismatch: {t} != {stt} at frame {f}.")
-                if b is None:
-                    b = f
-            e = f
-            out = (b*0.02, e*0.02, sw)
-            ret.append(out)
-            logger.debug(f"TS-WORD:\t{' '.join(map(str, out))}")
-        return ret
-
-    def process_iter(self) -> Tuple[List[ASRToken], float]:
+    def process_iter(self, is_last=False) -> Tuple[List[ASRToken], float]:
        """
        Process accumulated audio chunks using SimulStreaming.
        
        Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
        """
-        try:            
-            tokens, generation_progress = self.model.infer(is_last=self.is_last)
+        try:
+            tokens, generation_progress = self.model.infer(is_last=is_last)
            ts_words = self.timestamped_text(tokens, generation_progress)
            
            new_tokens = []
@@ -111,9 +168,33 @@ class SimulStreamingOnlineProcessor:
                    end=end,
                    text=word,
                    probability=0.95  # fake prob. Maybe we can extract it from the model?
+                ).with_offset(
+                    self.global_time_offset
                )
                new_tokens.append(token)
-                self.committed.extend(new_tokens)
+                
+            # identical_tokens = 0
+            # n_new_tokens = len(new_tokens)
+            # if n_new_tokens:
+            
+            self.committed.extend(new_tokens)
+            
+            # if token in self.committed:
+            #     pos = len(self.committed) - 1 - self.committed[::-1].index(token)
+            # if pos:
+            #     for i in range(len(self.committed) - n_new_tokens, -1, -n_new_tokens):
+            #         commited_segment = self.committed[i:i+n_new_tokens]
+            #         if commited_segment == new_tokens:
+            #             identical_segments +=1
+            #             if identical_tokens >= TOO_MANY_REPETITIONS:
+            #                 logger.warning('Too many repetition, model is stuck. Load a new one')
+            #                 self.committed = self.committed[:i]
+            #                 self.load_new_backend()
+            #                 return [], self.end
+
+            # pos = self.committed.rindex(token)
+
+            
            
            return new_tokens, self.end

@@ -132,6 +213,13 @@ class SimulStreamingOnlineProcessor:
        except Exception as e:
            logger.exception(f"SimulStreaming warmup failed: {e}")

+    def __del__(self):
+        # free the model and add a new model to stack.
+        # del self.model
+        gc.collect()
+        torch.cuda.empty_cache()
+        # self.asr.new_model_to_stack()
+        self.model.remove_hooks()

 class SimulStreamingASR():
    """SimulStreaming backend with AlignAtt policy."""
@@ -141,11 +229,11 @@ class SimulStreamingASR():
        logger.warning(SIMULSTREAMING_LICENSE)
        self.logfile = logfile
        self.transcribe_kargs = {}
-        self.original_language = None if lan == "auto" else lan
+        self.original_language = lan
        
        self.model_path = kwargs.get('model_path', './large-v3.pt')
        self.frame_threshold = kwargs.get('frame_threshold', 25)
-        self.audio_max_len = kwargs.get('audio_max_len', 30.0)
+        self.audio_max_len = kwargs.get('audio_max_len', 20.0)
        self.audio_min_len = kwargs.get('audio_min_len', 0.0)
        self.segment_length = kwargs.get('segment_length', 0.5)
        self.beams = kwargs.get('beams', 1)
@@ -156,7 +244,10 @@ class SimulStreamingASR():
        self.init_prompt = kwargs.get('init_prompt', None)
        self.static_init_prompt = kwargs.get('static_init_prompt', None)
        self.max_context_tokens = kwargs.get('max_context_tokens', None)
-        
+        self.warmup_file = kwargs.get('warmup_file', None)
+        self.preload_model_count = kwargs.get('preload_model_count', 1)
+        self.disable_fast_encoder = kwargs.get('disable_fast_encoder', False)
+        self.fast_encoder = False
        if model_dir is not None:
            self.model_path = model_dir
        elif modelsize is not None:
@@ -176,16 +267,6 @@ class SimulStreamingASR():
            }
            self.model_path = model_mapping.get(modelsize, f'./{modelsize}.pt')
        
-        self.model = self.load_model(modelsize)
-        
-        # Set up tokenizer for translation if needed
-        if self.task == "translate":
-            self.tokenizer = self.set_translate_task()
-        else:
-            self.tokenizer = None
-
-
-    def load_model(self, modelsize):
        self.cfg = AlignAttConfig(
                model_path=self.model_path,
                segment_length=self.segment_length,
@@ -201,23 +282,84 @@ class SimulStreamingASR():
                init_prompt=self.init_prompt,
                max_context_tokens=self.max_context_tokens,
                static_init_prompt=self.static_init_prompt,
-        )   
-        model_name = os.path.basename(self.cfg.model_path).replace(".pt", "")
-        model_path = os.path.dirname(os.path.abspath(self.cfg.model_path))
-        self.whisper_model = load_model(name=model_name, download_root=model_path)
+        )  
+        
+        # Set up tokenizer for translation if needed
+        if self.task == "translate":
+            self.tokenizer = self.set_translate_task()
+        else:
+            self.tokenizer = None
+        
+        self.model_name = os.path.basename(self.cfg.model_path).replace(".pt", "")
+        self.model_path = os.path.dirname(os.path.abspath(self.cfg.model_path))
+    
+        self.mlx_encoder, self.fw_encoder = None, None
+        if not self.disable_fast_encoder:
+            if HAS_MLX_WHISPER:
+                print('Simulstreaming will use MLX whisper for a faster encoder.')
+                mlx_model_name = mlx_model_mapping[self.model_name]
+                self.mlx_encoder = load_mlx_encoder(path_or_hf_repo=mlx_model_name)
+                self.fast_encoder = True
+            elif HAS_FASTER_WHISPER:
+                print('Simulstreaming will use Faster Whisper for the encoder.')
+                self.fw_encoder = WhisperModel(
+                    self.model_name,
+                    device='auto',
+                    compute_type='auto',
+                )
+                self.fast_encoder = True
+
+        self.models = [self.load_model() for i in range(self.preload_model_count)]
+
+
+    def load_model(self):
+        whisper_model = load_model(name=self.model_name, download_root=self.model_path, decoder_only=self.fast_encoder)
+        warmup_audio = load_file(self.warmup_file)
+        if warmup_audio is not None:
+            warmup_audio = torch.from_numpy(warmup_audio).float()
+            if self.fast_encoder:                
+                temp_model = PaddedAlignAttWhisper(
+                    cfg=self.cfg,
+                    loaded_model=whisper_model,
+                    mlx_encoder=self.mlx_encoder,
+                    fw_encoder=self.fw_encoder,
+                )
+                temp_model.warmup(warmup_audio)
+                temp_model.remove_hooks()
+            else:
+                # For standard encoder, use the original transcribe warmup
+                warmup_audio = load_file(self.warmup_file)
+                whisper_model.transcribe(warmup_audio, language=self.original_language if self.original_language != 'auto' else None)
+        return whisper_model
+    
+    def get_new_model_instance(self):
+        """
+        SimulStreaming cannot share the same backend because it uses global forward hooks on the attention layers.
+        Therefore, each user requires a separate model instance, which can be memory-intensive. To maintain speed, we preload the models into memory.
+        """
+        if len(self.models) == 0:
+            self.models.append(self.load_model())
+        new_model = self.models.pop()
+        return new_model
+        # self.models[0]
+
+    def new_model_to_stack(self):
+        self.models.append(self.load_model())
        

    def set_translate_task(self):
        """Set up translation task."""
+        if self.cfg.language == 'auto':
+            raise Exception('Translation cannot be done with language = auto')
        return tokenizer.get_tokenizer(
            multilingual=True,
-            language=self.model.cfg.language,
-            num_languages=self.model.model.num_languages,
+            language=self.cfg.language,
+            num_languages=99,
            task="translate"
        )

    def transcribe(self, audio):
        """
-        Only used for warmup. It's a direct whisper call, not a simulstreaming call
+        Warmup is done directly in load_model
        """
-        self.whisper_model.transcribe(audio, language=self.original_language)
+        pass
--- a/whisperlivekit/simul_whisper/config.py
+++ b/whisperlivekit/simul_whisper/config.py
@@ -24,6 +24,6 @@ class AlignAttConfig(SimulWhisperConfig):
    segment_length: float = field(default=1.0, metadata = {"help": "in second"})
    frame_threshold: int = 4
    rewind_threshold: int = 200
-    audio_max_len: float = 30.0
+    audio_max_len: float = 20.0
    cif_ckpt_path: str = ""
    never_fire: bool = False
--- a/whisperlivekit/simul_whisper/mlx_encoder.py
+++ b/whisperlivekit/simul_whisper/mlx_encoder.py
@@ -0,0 +1,72 @@
+import json
+from pathlib import Path
+
+import mlx.core as mx
+import mlx.nn as nn
+from huggingface_hub import snapshot_download
+from mlx.utils import tree_unflatten
+
+from mlx_whisper import whisper
+
+mlx_model_mapping = {
+    "tiny.en": "mlx-community/whisper-tiny.en-mlx",
+    "tiny": "mlx-community/whisper-tiny-mlx",
+    "base.en": "mlx-community/whisper-base.en-mlx",
+    "base": "mlx-community/whisper-base-mlx",
+    "small.en": "mlx-community/whisper-small.en-mlx",
+    "small": "mlx-community/whisper-small-mlx",
+    "medium.en": "mlx-community/whisper-medium.en-mlx",
+    "medium": "mlx-community/whisper-medium-mlx",
+    "large-v1": "mlx-community/whisper-large-v1-mlx",
+    "large-v2": "mlx-community/whisper-large-v2-mlx",
+    "large-v3": "mlx-community/whisper-large-v3-mlx",
+    "large-v3-turbo": "mlx-community/whisper-large-v3-turbo",
+    "large": "mlx-community/whisper-large-mlx",
+}
+
+def load_mlx_encoder(
+    path_or_hf_repo: str,
+    dtype: mx.Dtype = mx.float32,
+) -> whisper.Whisper:
+    model_path = Path(path_or_hf_repo)
+    if not model_path.exists():
+        model_path = Path(snapshot_download(repo_id=path_or_hf_repo))
+
+    with open(str(model_path / "config.json"), "r") as f:
+        config = json.loads(f.read())
+        config.pop("model_type", None)
+        quantization = config.pop("quantization", None)
+
+    model_args = whisper.ModelDimensions(**config)
+
+    wf = model_path / "weights.safetensors"
+    if not wf.exists():
+        wf = model_path / "weights.npz"
+    weights = mx.load(str(wf))
+
+    model = whisper.Whisper(model_args, dtype)
+
+    if quantization is not None:
+        class_predicate = (
+            lambda p, m: isinstance(m, (nn.Linear, nn.Embedding))
+            and f"{p}.scales" in weights
+        )
+        nn.quantize(model, **quantization, class_predicate=class_predicate)
+
+    weights = tree_unflatten(list(weights.items()))
+    
+    # we only want to load the encoder weights here.
+    # Size examples: for tiny.en, 
+    # Decoder weights: 59110771 bytes
+    # Encoder weights: 15268874 bytes
+
+    
+    encoder_weights = {}
+    encoder_weights['encoder'] = weights['encoder']
+    del(weights)
+    
+
+
+    model.update(encoder_weights)
+    mx.eval(model.parameters())
+    return model
--- a/whisperlivekit/simul_whisper/simul_whisper.py
+++ b/whisperlivekit/simul_whisper/simul_whisper.py
@@ -14,7 +14,7 @@ from .whisper.decoding import GreedyDecoder, BeamSearchDecoder, SuppressTokens,
 from .beam import BeamPyTorchInference
 from .eow_detection import fire_at_boundary, load_cif
 import os
-
+from time import time
 from .token_buffer import TokenBuffer

 import numpy as np
@@ -23,8 +23,22 @@ from .generation_progress import *
 DEC_PAD = 50257
 logger = logging.getLogger(__name__)

-import sys
-import wave
+
+try:
+    from mlx_whisper.audio import log_mel_spectrogram as mlx_log_mel_spectrogram
+    from mlx_whisper.transcribe import pad_or_trim as mlx_pad_or_trim
+    HAS_MLX_WHISPER = True
+except ImportError:
+    HAS_MLX_WHISPER = False
+if HAS_MLX_WHISPER:
+    HAS_FASTER_WHISPER = False
+else:
+    try:
+        from faster_whisper.audio import pad_or_trim as fw_pad_or_trim
+        from faster_whisper.feature_extractor import FeatureExtractor
+        HAS_FASTER_WHISPER = True
+    except ImportError:
+        HAS_FASTER_WHISPER = False

 # New features added to the original version of Simul-Whisper: 
 # - large-v3 model support
@@ -33,7 +47,13 @@ import wave
 # - prompt -- static vs. non-static
 # - context
 class PaddedAlignAttWhisper:
-    def __init__(self, cfg: AlignAttConfig, loaded_model=None) -> None:
+    def __init__(
+            self, 
+            cfg: AlignAttConfig,
+            loaded_model=None,
+            mlx_encoder=None,
+            fw_encoder=None,
+        ) -> None:
        self.log_segments = 0
        model_name = os.path.basename(cfg.model_path).replace(".pt", "")
        model_path = os.path.dirname(os.path.abspath(cfg.model_path))
@@ -41,7 +61,14 @@ class PaddedAlignAttWhisper:
            self.model = loaded_model
        else:
            self.model = load_model(name=model_name, download_root=model_path)
+        
+        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'

+        self.mlx_encoder = mlx_encoder
+        self.fw_encoder = fw_encoder
+        if fw_encoder:
+            self.fw_feature_extractor = FeatureExtractor(feature_size=self.model.dims.n_mels)
+            
        logger.info(f"Model dimensions: {self.model.dims}")

        self.decode_options = DecodingOptions(
@@ -56,6 +83,7 @@ class PaddedAlignAttWhisper:
        self.max_text_len = self.model.dims.n_text_ctx
        self.num_decoder_layers = len(self.model.decoder.blocks)
        self.cfg = cfg
+        self.l_hooks = []

        # model to detect end-of-word boundary at the end of the segment
        self.CIFLinear, self.always_fire, self.never_fire = load_cif(cfg,
@@ -69,7 +97,8 @@ class PaddedAlignAttWhisper:
            t = F.softmax(net_output[1], dim=-1)
            self.dec_attns.append(t.squeeze(0))
        for b in self.model.decoder.blocks:
-            b.cross_attn.register_forward_hook(layer_hook)
+            hook = b.cross_attn.register_forward_hook(layer_hook)
+            self.l_hooks.append(hook)
        
        self.kv_cache = {}
        def kv_hook(module: torch.nn.Linear, _, net_output: torch.Tensor):
@@ -82,10 +111,13 @@ class PaddedAlignAttWhisper:
            return self.kv_cache[module.cache_id] 

        for i,b in enumerate(self.model.decoder.blocks):
-            b.attn.key.register_forward_hook(kv_hook)
-            b.attn.value.register_forward_hook(kv_hook)
-            b.cross_attn.key.register_forward_hook(kv_hook)
-            b.cross_attn.value.register_forward_hook(kv_hook)
+            hooks = [
+                b.attn.key.register_forward_hook(kv_hook),
+                b.attn.value.register_forward_hook(kv_hook),
+                b.cross_attn.key.register_forward_hook(kv_hook),
+                b.cross_attn.value.register_forward_hook(kv_hook),
+            ]
+            self.l_hooks.extend(hooks)

        self.align_source = {}
        self.num_align_heads = 0
@@ -120,6 +152,7 @@ class PaddedAlignAttWhisper:
        self.init_tokens()
        
        self.last_attend_frame = -self.cfg.rewind_threshold
+        self.cumulative_time_offset = 0.0

        if self.cfg.max_context_tokens is None:
            self.max_context_tokens = self.max_text_len
@@ -139,6 +172,19 @@ class PaddedAlignAttWhisper:
            self.inference.kv_cache = self.kv_cache

            self.token_decoder = BeamSearchDecoder(inference=self.inference, eot=self.tokenizer.eot, beam_size=cfg.beam_size)
+            
+    def remove_hooks(self):
+        for hook in self.l_hooks:
+            hook.remove()
+
+    def warmup(self, audio):
+        try:
+            self.insert_audio(audio)
+            self.infer(is_last=True)
+            self.refresh_segment(complete=True)
+            logger.info("Model warmed up successfully")
+        except Exception as e:
+            logger.exception(f"Model warmup failed: {e}")

    def create_tokenizer(self, language=None):
        self.tokenizer = tokenizer.get_tokenizer(
@@ -210,6 +256,7 @@ class PaddedAlignAttWhisper:
        self.init_tokens()
        self.last_attend_frame = -self.cfg.rewind_threshold       
        self.detected_language = None
+        self.cumulative_time_offset = 0.0
        self.init_context()
        logger.debug(f"Context: {self.context}")
        if not complete and len(self.segments) > 2:
@@ -277,8 +324,9 @@ class PaddedAlignAttWhisper:
            removed_len = self.segments[0].shape[0] / 16000
            segments_len -= removed_len
            self.last_attend_frame -= int(TOKENS_PER_SECOND*removed_len)
+            self.cumulative_time_offset += removed_len  # Track cumulative time removed
            self.segments = self.segments[1:]
-            logger.debug(f"remove segments: {len(self.segments)} {len(self.tokens)}")
+            logger.debug(f"remove segments: {len(self.segments)} {len(self.tokens)}, cumulative offset: {self.cumulative_time_offset:.2f}s")
            if len(self.tokens) > 1:
                self.context.append_token_ids(self.tokens[1][0,:])
                self.tokens = [self.initial_tokens] + self.tokens[2:]
@@ -346,20 +394,38 @@ class PaddedAlignAttWhisper:
        else:
            input_segments = self.segments[0]

-
-        
-        # mel + padding to 30s
-        mel_padded = log_mel_spectrogram(input_segments, n_mels=self.model.dims.n_mels, padding=N_SAMPLES, 
-                                            device=self.model.device).unsqueeze(0)
-        # trim to 3000
-        mel = pad_or_trim(mel_padded, N_FRAMES)
-
-        # the len of actual audio
-        content_mel_len = int((mel_padded.shape[2] - mel.shape[2])/2)
-
-        # encode
-        encoder_feature = self.model.encoder(mel)
-
+        # NEW : we can use a different encoder, before using standart whisper for cross attention with the hooks on the decoder
+        beg_encode = time()
+        if self.mlx_encoder:
+            mlx_mel_padded = mlx_log_mel_spectrogram(audio=input_segments.detach(), n_mels=self.model.dims.n_mels, padding=N_SAMPLES)
+            mlx_mel = mlx_pad_or_trim(mlx_mel_padded, N_FRAMES, axis=-2)
+            mlx_encoder_feature = self.mlx_encoder.encoder(mlx_mel[None])
+            encoder_feature = torch.as_tensor(mlx_encoder_feature)
+            content_mel_len = int((mlx_mel_padded.shape[0] - mlx_mel.shape[0])/2)
+        elif self.fw_encoder:
+            audio_length_seconds = len(input_segments) / 16000   
+            content_mel_len = int(audio_length_seconds * 100)//2      
+            mel_padded_2 = self.fw_feature_extractor(waveform=input_segments.numpy(), padding=N_SAMPLES)[None, :]
+            mel = fw_pad_or_trim(mel_padded_2, N_FRAMES, axis=-1)
+            encoder_feature_ctranslate = self.fw_encoder.encode(mel)
+            if self.device == 'cpu': #it seems that on gpu, passing StorageView to torch.as_tensor fails and wrapping in the array works
+                encoder_feature_ctranslate = np.array(encoder_feature_ctranslate)
+            try:
+                encoder_feature = torch.as_tensor(encoder_feature_ctranslate, device=self.device)
+            except TypeError: # Normally the cpu condition should prevent having exceptions, but just in case:
+                encoder_feature = torch.as_tensor(np.array(encoder_feature_ctranslate), device=self.device)
+        else:
+            # mel + padding to 30s
+            mel_padded = log_mel_spectrogram(input_segments, n_mels=self.model.dims.n_mels, padding=N_SAMPLES, 
+                                                device=self.device).unsqueeze(0)
+            # trim to 3000
+            mel = pad_or_trim(mel_padded, N_FRAMES)
+            # the len of actual audio
+            content_mel_len = int((mel_padded.shape[2] - mel.shape[2])/2)
+            encoder_feature = self.model.encoder(mel)
+        end_encode = time()
+        # print('Encoder duration:', end_encode-beg_encode)
+                
 #        logger.debug(f"Encoder feature shape: {encoder_feature.shape}")
 #        if mel.shape[-2:] != (self.model.dims.n_audio_ctx, self.model.dims.n_audio_state):
 #            logger.debug("mel ")
@@ -384,7 +450,7 @@ class PaddedAlignAttWhisper:
        ####################### Decoding loop
        logger.info("Decoding loop starts\n")

-        sum_logprobs = torch.zeros(self.cfg.beam_size, device=mel.device)
+        sum_logprobs = torch.zeros(self.cfg.beam_size, device=self.device)
        completed = False

        attn_of_alignment_heads = None
@@ -494,7 +560,13 @@ class PaddedAlignAttWhisper:
            # for each beam, the most attended frame is:
            most_attended_frames = torch.argmax(attn_of_alignment_heads[:,-1,:], dim=-1)
            generation_progress_loop.append(("most_attended_frames",most_attended_frames.clone().tolist()))
+            
+            # Calculate absolute timestamps accounting for cumulative offset
+            absolute_timestamps = [(frame * 0.02 + self.cumulative_time_offset) for frame in most_attended_frames.tolist()]
+            generation_progress_loop.append(("absolute_timestamps", absolute_timestamps))
+            
            logger.debug(str(most_attended_frames.tolist()) + " most att frames")
+            logger.debug(f"Absolute timestamps: {absolute_timestamps} (offset: {self.cumulative_time_offset:.2f}s)")

            most_attended_frame = most_attended_frames[0].item()

@@ -589,7 +661,7 @@ class PaddedAlignAttWhisper:
        ### new hypothesis
        logger.debug(f"new_hypothesis: {new_hypothesis}")
        new_tokens = torch.tensor([new_hypothesis], dtype=torch.long).repeat_interleave(self.cfg.beam_size, dim=0).to(
-            device=self.model.device,
+            device=self.device,
        )
        self.tokens.append(new_tokens)
        # TODO: test if this is redundant or not
--- a/whisperlivekit/simul_whisper/whisper/init.py
+++ b/whisperlivekit/simul_whisper/whisper/init.py
@@ -105,6 +105,7 @@ def load_model(
    device: Optional[Union[str, torch.device]] = None,
    download_root: str = None,
    in_memory: bool = False,
+    decoder_only=False
 ) -> Whisper:
    """
    Load a Whisper ASR model
@@ -151,7 +152,14 @@ def load_model(
    del checkpoint_file

    dims = ModelDimensions(**checkpoint["dims"])
-    model = Whisper(dims)
+    model = Whisper(dims, decoder_only=decoder_only)
+    
+    if decoder_only:
+        checkpoint["model_state_dict"] = {
+            k: v for k, v in checkpoint["model_state_dict"].items() 
+            if 'encoder' not in k
+        }
+
    model.load_state_dict(checkpoint["model_state_dict"])

    if alignment_heads is not None:
--- a/whisperlivekit/simul_whisper/whisper/model.py
+++ b/whisperlivekit/simul_whisper/whisper/model.py
@@ -253,16 +253,18 @@ class TextDecoder(nn.Module):


 class Whisper(nn.Module):
-    def __init__(self, dims: ModelDimensions):
+    def __init__(self, dims: ModelDimensions, decoder_only: bool = False):
        super().__init__()
        self.dims = dims
-        self.encoder = AudioEncoder(
-            self.dims.n_mels,
-            self.dims.n_audio_ctx,
-            self.dims.n_audio_state,
-            self.dims.n_audio_head,
-            self.dims.n_audio_layer,
-        )
+        
+        if not decoder_only:
+            self.encoder = AudioEncoder(
+                self.dims.n_mels,
+                self.dims.n_audio_ctx,
+                self.dims.n_audio_state,
+                self.dims.n_audio_head,
+                self.dims.n_audio_layer,
+            )
        self.decoder = TextDecoder(
            self.dims.n_vocab,
            self.dims.n_text_ctx,
--- a/whisperlivekit/timed_objects.py
+++ b/whisperlivekit/timed_objects.py
@@ -1,14 +1,35 @@
-from dataclasses import dataclass
-from typing import Optional
+from dataclasses import dataclass, field
+from typing import Optional, Any
+from datetime import timedelta
+
+def format_time(seconds: float) -> str:
+    """Format seconds as HH:MM:SS."""
+    return str(timedelta(seconds=int(seconds)))
+

@dataclass
 class TimedText:
-    start: Optional[float]
-    end: Optional[float]
+    start: Optional[float] = 0
+    end: Optional[float] = 0
    text: Optional[str] = ''
    speaker: Optional[int] = -1
    probability: Optional[float] = None
    is_dummy: Optional[bool] = False
+    
+    def overlaps_with(self, other: 'TimedText') -> bool:
+        return not (self.end <= other.start or other.end <= self.start)
+    
+    def is_within(self, other: 'TimedText') -> bool:
+        return other.contains_timespan(self)
+
+    def duration(self) -> float:
+        return self.end - self.start
+
+    def contains_time(self, time: float) -> bool:
+        return self.start <= time <= self.end
+
+    def contains_timespan(self, other: 'TimedText') -> bool:
+        return self.start <= other.start and self.end >= other.end

@dataclass
 class ASRToken(TimedText):
@@ -29,4 +50,88 @@ class SpeakerSegment(TimedText):
    """Represents a segment of audio attributed to a specific speaker.
    No text nor probability is associated with this segment.
    """
-    pass
+    pass
+
+@dataclass
+class Translation(TimedText):
+    pass
+
+    def approximate_cut_at(self, cut_time):
+        """
+        Each word in text is considered to be of duration (end-start)/len(words in text)
+        """
+        if not self.text or not self.contains_time(cut_time):
+            return self, None
+
+        words = self.text.split()
+        num_words = len(words)
+        if num_words == 0:
+            return self, None
+
+        duration_per_word = self.duration() / num_words
+        
+        cut_word_index = int((cut_time - self.start) / duration_per_word)
+        
+        if cut_word_index >= num_words:
+            cut_word_index = num_words -1
+        
+        text0 = " ".join(words[:cut_word_index])
+        text1 = " ".join(words[cut_word_index:])
+
+        segment0 = Translation(start=self.start, end=cut_time, text=text0)
+        segment1 = Translation(start=cut_time, end=self.end, text=text1)
+
+        return segment0, segment1
+        
+
+@dataclass
+class Silence():
+    duration: float
+    
+    
+@dataclass
+class Line(TimedText):
+    translation: str = ''
+    
+    def to_dict(self):
+        return {
+            'speaker': int(self.speaker),
+            'text': self.text,
+            'translation': self.translation,
+            'start': format_time(self.start),
+            'end': format_time(self.end),
+        }
+        
+@dataclass  
+class FrontData():
+    status: str = ''
+    error: str = ''
+    lines: list[Line] = field(default_factory=list)
+    buffer_transcription: str = ''
+    buffer_diarization: str = ''
+    remaining_time_transcription: float = 0.
+    remaining_time_diarization: float = 0.
+    
+    def to_dict(self):
+        _dict = {
+            'status': self.status,
+            'lines': [line.to_dict() for line in self.lines],
+            'buffer_transcription': self.buffer_transcription,
+            'buffer_diarization': self.buffer_diarization,
+            'remaining_time_transcription': self.remaining_time_transcription,
+            'remaining_time_diarization': self.remaining_time_diarization,
+        }
+        if self.error:
+            _dict['error'] = self.error
+        return _dict
+    
+@dataclass  
+class State():
+    tokens: list
+    translated_segments: list
+    buffer_transcription: str
+    buffer_diarization: str
+    end_buffer: float
+    end_attributed_speaker: float
+    remaining_time_transcription: float
+    remaining_time_diarization: float
--- a/whisperlivekit/trail_repetition.py
+++ b/whisperlivekit/trail_repetition.py
@@ -0,0 +1,60 @@
+from typing import Sequence, Callable, Any, Optional, Dict
+
+def _detect_tail_repetition(
+    seq: Sequence[Any],
+    key: Callable[[Any], Any] = lambda x: x,  # extract comparable value
+    min_block: int = 1,                       # set to 2 to ignore 1-token loops like "."
+    max_tail: int = 300,                      # search window from the end for speed
+    prefer: str = "longest",                  # "longest" coverage or "smallest" block
+) -> Optional[Dict]:
+    vals = [key(x) for x in seq][-max_tail:]
+    n = len(vals)
+    best = None
+
+    # try every possible block length
+    for b in range(min_block, n // 2 + 1):
+        block = vals[-b:]
+        # count how many times this block repeats contiguously at the very end
+        count, i = 0, n
+        while i - b >= 0 and vals[i - b:i] == block:
+            count += 1
+            i -= b
+
+        if count >= 2:
+            cand = {
+                "block_size": b,
+                "count": count,
+                "start_index": len(seq) - count * b,  # in original seq
+                "end_index": len(seq),
+            }
+            if (best is None or
+                (prefer == "longest" and count * b > best["count"] * best["block_size"]) or
+                (prefer == "smallest" and b < best["block_size"])):
+                best = cand
+    return best
+
+def trim_tail_repetition(
+    seq: Sequence[Any],
+    key: Callable[[Any], Any] = lambda x: x,
+    min_block: int = 1,
+    max_tail: int = 300,
+    prefer: str = "longest",
+    keep: int = 1,  # how many copies of the repeating block to keep at the end (0 or 1 are common)
+):
+    """
+    Returns a new sequence with repeated tail trimmed.
+    keep=1 -> keep a single copy of the repeated block.
+    keep=0 -> remove all copies of the repeated block.
+    """
+    rep = _detect_tail_repetition(seq, key, min_block, max_tail, prefer)
+    if not rep:
+        return seq, False  # nothing to trim
+
+    b, c = rep["block_size"], rep["count"]
+    if keep < 0:
+        keep = 0
+    if keep >= c:
+        return seq, False  # nothing to trim (already <= keep copies)
+    # new length = total - (copies_to_remove * block_size)
+    new_len = len(seq) - (c - keep) * b
+    return seq[:new_len], True
--- a/whisperlivekit/translation/init.py
+++ b/whisperlivekit/translation/init.py
--- a/whisperlivekit/translation/mapping_languages.py
+++ b/whisperlivekit/translation/mapping_languages.py
@@ -0,0 +1,182 @@
+"""
+adapted from https://store.crowdin.com/custom-mt
+"""
+
+LANGUAGES = [
+    {"name": "Afrikaans", "nllb": "afr_Latn", "crowdin": "af"},
+    {"name": "Akan", "nllb": "aka_Latn", "crowdin": "ak"},
+    {"name": "Amharic", "nllb": "amh_Ethi", "crowdin": "am"},
+    {"name": "Assamese", "nllb": "asm_Beng", "crowdin": "as"},
+    {"name": "Asturian", "nllb": "ast_Latn", "crowdin": "ast"},
+    {"name": "Bashkir", "nllb": "bak_Cyrl", "crowdin": "ba"},
+    {"name": "Bambara", "nllb": "bam_Latn", "crowdin": "bm"},
+    {"name": "Balinese", "nllb": "ban_Latn", "crowdin": "ban"},
+    {"name": "Belarusian", "nllb": "bel_Cyrl", "crowdin": "be"},
+    {"name": "Bengali", "nllb": "ben_Beng", "crowdin": "bn"},
+    {"name": "Bosnian", "nllb": "bos_Latn", "crowdin": "bs"},
+    {"name": "Bulgarian", "nllb": "bul_Cyrl", "crowdin": "bg"},
+    {"name": "Catalan", "nllb": "cat_Latn", "crowdin": "ca"},
+    {"name": "Cebuano", "nllb": "ceb_Latn", "crowdin": "ceb"},
+    {"name": "Czech", "nllb": "ces_Latn", "crowdin": "cs"},
+    {"name": "Welsh", "nllb": "cym_Latn", "crowdin": "cy"},
+    {"name": "Danish", "nllb": "dan_Latn", "crowdin": "da"},
+    {"name": "German", "nllb": "deu_Latn", "crowdin": "de"},
+    {"name": "Dzongkha", "nllb": "dzo_Tibt", "crowdin": "dz"},
+    {"name": "Greek", "nllb": "ell_Grek", "crowdin": "el"},
+    {"name": "English", "nllb": "eng_Latn", "crowdin": "en"},
+    {"name": "Esperanto", "nllb": "epo_Latn", "crowdin": "eo"},
+    {"name": "Estonian", "nllb": "est_Latn", "crowdin": "et"},
+    {"name": "Basque", "nllb": "eus_Latn", "crowdin": "eu"},
+    {"name": "Ewe", "nllb": "ewe_Latn", "crowdin": "ee"},
+    {"name": "Faroese", "nllb": "fao_Latn", "crowdin": "fo"},
+    {"name": "Fijian", "nllb": "fij_Latn", "crowdin": "fj"},
+    {"name": "Finnish", "nllb": "fin_Latn", "crowdin": "fi"},
+    {"name": "French", "nllb": "fra_Latn", "crowdin": "fr"},
+    {"name": "Friulian", "nllb": "fur_Latn", "crowdin": "fur-IT"},
+    {"name": "Scottish Gaelic", "nllb": "gla_Latn", "crowdin": "gd"},
+    {"name": "Irish", "nllb": "gle_Latn", "crowdin": "ga-IE"},
+    {"name": "Galician", "nllb": "glg_Latn", "crowdin": "gl"},
+    {"name": "Guarani", "nllb": "grn_Latn", "crowdin": "gn"},
+    {"name": "Gujarati", "nllb": "guj_Gujr", "crowdin": "gu-IN"},
+    {"name": "Haitian Creole", "nllb": "hat_Latn", "crowdin": "ht"},
+    {"name": "Hausa", "nllb": "hau_Latn", "crowdin": "ha"},
+    {"name": "Hebrew", "nllb": "heb_Hebr", "crowdin": "he"},
+    {"name": "Hindi", "nllb": "hin_Deva", "crowdin": "hi"},
+    {"name": "Croatian", "nllb": "hrv_Latn", "crowdin": "hr"},
+    {"name": "Hungarian", "nllb": "hun_Latn", "crowdin": "hu"},
+    {"name": "Armenian", "nllb": "hye_Armn", "crowdin": "hy-AM"},
+    {"name": "Igbo", "nllb": "ibo_Latn", "crowdin": "ig"},
+    {"name": "Indonesian", "nllb": "ind_Latn", "crowdin": "id"},
+    {"name": "Icelandic", "nllb": "isl_Latn", "crowdin": "is"},
+    {"name": "Italian", "nllb": "ita_Latn", "crowdin": "it"},
+    {"name": "Javanese", "nllb": "jav_Latn", "crowdin": "jv"},
+    {"name": "Japanese", "nllb": "jpn_Jpan", "crowdin": "ja"},
+    {"name": "Kabyle", "nllb": "kab_Latn", "crowdin": "kab"},
+    {"name": "Kannada", "nllb": "kan_Knda", "crowdin": "kn"},
+    {"name": "Georgian", "nllb": "kat_Geor", "crowdin": "ka"},
+    {"name": "Kazakh", "nllb": "kaz_Cyrl", "crowdin": "kk"},
+    {"name": "Khmer", "nllb": "khm_Khmr", "crowdin": "km"},
+    {"name": "Kinyarwanda", "nllb": "kin_Latn", "crowdin": "rw"},
+    {"name": "Kyrgyz", "nllb": "kir_Cyrl", "crowdin": "ky"},
+    {"name": "Korean", "nllb": "kor_Hang", "crowdin": "ko"},
+    {"name": "Lao", "nllb": "lao_Laoo", "crowdin": "lo"},
+    {"name": "Ligurian", "nllb": "lij_Latn", "crowdin": "lij"},
+    {"name": "Limburgish", "nllb": "lim_Latn", "crowdin": "li"},
+    {"name": "Lingala", "nllb": "lin_Latn", "crowdin": "ln"},
+    {"name": "Lithuanian", "nllb": "lit_Latn", "crowdin": "lt"},
+    {"name": "Luxembourgish", "nllb": "ltz_Latn", "crowdin": "lb"},
+    {"name": "Maithili", "nllb": "mai_Deva", "crowdin": "mai"},
+    {"name": "Malayalam", "nllb": "mal_Mlym", "crowdin": "ml-IN"},
+    {"name": "Marathi", "nllb": "mar_Deva", "crowdin": "mr"},
+    {"name": "Macedonian", "nllb": "mkd_Cyrl", "crowdin": "mk"},
+    {"name": "Maltese", "nllb": "mlt_Latn", "crowdin": "mt"},
+    {"name": "Mossi", "nllb": "mos_Latn", "crowdin": "mos"},
+    {"name": "Maori", "nllb": "mri_Latn", "crowdin": "mi"},
+    {"name": "Burmese", "nllb": "mya_Mymr", "crowdin": "my"},
+    {"name": "Dutch", "nllb": "nld_Latn", "crowdin": "nl"},
+    {"name": "Norwegian Nynorsk", "nllb": "nno_Latn", "crowdin": "nn-NO"},
+    {"name": "Nepali", "nllb": "npi_Deva", "crowdin": "ne-NP"},
+    {"name": "Northern Sotho", "nllb": "nso_Latn", "crowdin": "nso"},
+    {"name": "Occitan", "nllb": "oci_Latn", "crowdin": "oc"},
+    {"name": "Odia", "nllb": "ory_Orya", "crowdin": "or"},
+    {"name": "Papiamento", "nllb": "pap_Latn", "crowdin": "pap"},
+    {"name": "Polish", "nllb": "pol_Latn", "crowdin": "pl"},
+    {"name": "Portuguese", "nllb": "por_Latn", "crowdin": "pt-PT"},
+    {"name": "Dari", "nllb": "prs_Arab", "crowdin": "fa-AF"},
+    {"name": "Romanian", "nllb": "ron_Latn", "crowdin": "ro"},
+    {"name": "Rundi", "nllb": "run_Latn", "crowdin": "rn"},
+    {"name": "Russian", "nllb": "rus_Cyrl", "crowdin": "ru"},
+    {"name": "Sango", "nllb": "sag_Latn", "crowdin": "sg"},
+    {"name": "Sanskrit", "nllb": "san_Deva", "crowdin": "sa"},
+    {"name": "Santali", "nllb": "sat_Olck", "crowdin": "sat"},
+    {"name": "Sinhala", "nllb": "sin_Sinh", "crowdin": "si-LK"},
+    {"name": "Slovak", "nllb": "slk_Latn", "crowdin": "sk"},
+    {"name": "Slovenian", "nllb": "slv_Latn", "crowdin": "sl"},
+    {"name": "Shona", "nllb": "sna_Latn", "crowdin": "sn"},
+    {"name": "Sindhi", "nllb": "snd_Arab", "crowdin": "sd"},
+    {"name": "Somali", "nllb": "som_Latn", "crowdin": "so"},
+    {"name": "Southern Sotho", "nllb": "sot_Latn", "crowdin": "st"},
+    {"name": "Spanish", "nllb": "spa_Latn", "crowdin": "es-ES"},
+    {"name": "Sardinian", "nllb": "srd_Latn", "crowdin": "sc"},
+    {"name": "Swati", "nllb": "ssw_Latn", "crowdin": "ss"},
+    {"name": "Sundanese", "nllb": "sun_Latn", "crowdin": "su"},
+    {"name": "Swedish", "nllb": "swe_Latn", "crowdin": "sv-SE"},
+    {"name": "Swahili", "nllb": "swh_Latn", "crowdin": "sw"},
+    {"name": "Tamil", "nllb": "tam_Taml", "crowdin": "ta"},
+    {"name": "Tatar", "nllb": "tat_Cyrl", "crowdin": "tt-RU"},
+    {"name": "Telugu", "nllb": "tel_Telu", "crowdin": "te"},
+    {"name": "Tajik", "nllb": "tgk_Cyrl", "crowdin": "tg"},
+    {"name": "Tagalog", "nllb": "tgl_Latn", "crowdin": "tl"},
+    {"name": "Thai", "nllb": "tha_Thai", "crowdin": "th"},
+    {"name": "Tigrinya", "nllb": "tir_Ethi", "crowdin": "ti"},
+    {"name": "Tswana", "nllb": "tsn_Latn", "crowdin": "tn"},
+    {"name": "Tsonga", "nllb": "tso_Latn", "crowdin": "ts"},
+    {"name": "Turkmen", "nllb": "tuk_Latn", "crowdin": "tk"},
+    {"name": "Turkish", "nllb": "tur_Latn", "crowdin": "tr"},
+    {"name": "Uyghur", "nllb": "uig_Arab", "crowdin": "ug"},
+    {"name": "Ukrainian", "nllb": "ukr_Cyrl", "crowdin": "uk"},
+    {"name": "Venetian", "nllb": "vec_Latn", "crowdin": "vec"},
+    {"name": "Vietnamese", "nllb": "vie_Latn", "crowdin": "vi"},
+    {"name": "Wolof", "nllb": "wol_Latn", "crowdin": "wo"},
+    {"name": "Xhosa", "nllb": "xho_Latn", "crowdin": "xh"},
+    {"name": "Yoruba", "nllb": "yor_Latn", "crowdin": "yo"},
+    {"name": "Zulu", "nllb": "zul_Latn", "crowdin": "zu"},
+]
+
+NAME_TO_NLLB = {lang["name"]: lang["nllb"] for lang in LANGUAGES}
+NAME_TO_CROWDIN = {lang["name"]: lang["crowdin"] for lang in LANGUAGES}
+CROWDIN_TO_NLLB = {lang["crowdin"]: lang["nllb"] for lang in LANGUAGES}
+NLLB_TO_CROWDIN = {lang["nllb"]: lang["crowdin"] for lang in LANGUAGES}
+CROWDIN_TO_NAME = {lang["crowdin"]: lang["name"] for lang in LANGUAGES}
+NLLB_TO_NAME = {lang["nllb"]: lang["name"] for lang in LANGUAGES}
+
+
+def get_nllb_code(crowdin_code):
+    return CROWDIN_TO_NLLB.get(crowdin_code, None)
+
+
+def get_crowdin_code(nllb_code):
+    return NLLB_TO_CROWDIN.get(nllb_code)
+
+
+def get_language_name_by_crowdin(crowdin_code):
+    return CROWDIN_TO_NAME.get(crowdin_code)
+
+
+def get_language_name_by_nllb(nllb_code):
+    return NLLB_TO_NAME.get(nllb_code)
+
+
+def get_language_info(identifier, identifier_type="auto"):
+    if identifier_type == "auto":
+        for lang in LANGUAGES:
+            if (lang["name"].lower() == identifier.lower() or 
+                lang["nllb"] == identifier or 
+                lang["crowdin"] == identifier):
+                return lang
+    elif identifier_type == "name":
+        for lang in LANGUAGES:
+            if lang["name"].lower() == identifier.lower():
+                return lang
+    elif identifier_type == "nllb":
+        for lang in LANGUAGES:
+            if lang["nllb"] == identifier:
+                return lang
+    elif identifier_type == "crowdin":
+        for lang in LANGUAGES:
+            if lang["crowdin"] == identifier:
+                return lang
+    
+    return None
+
+
+def list_all_languages():
+    return [lang["name"] for lang in LANGUAGES]
+
+
+def list_all_nllb_codes():
+    return [lang["nllb"] for lang in LANGUAGES]
+
+
+def list_all_crowdin_codes():
+    return [lang["crowdin"] for lang in LANGUAGES]
--- a/whisperlivekit/translation/translation.py
+++ b/whisperlivekit/translation/translation.py
@@ -0,0 +1,150 @@
+import logging
+import time
+import ctranslate2
+import torch
+import transformers
+from dataclasses import dataclass
+import huggingface_hub
+from whisperlivekit.translation.mapping_languages import get_nllb_code
+from whisperlivekit.timed_objects import Translation
+
+logger = logging.getLogger(__name__)
+
+#In diarization case, we may want to translate just one speaker, or at least start the sentences there
+
+PUNCTUATION_MARKS = {'.', '!', '?', '。', '！', '？'}
+
+MIN_SILENCE_DURATION_DEL_BUFFER = 3 #After a silence of x seconds, we consider the model should not use the buffer, even if the previous
+# sentence is not finished.
+
+@dataclass
+class TranslationModel():
+    translator: ctranslate2.Translator
+    tokenizer: dict
+    device: str
+    backend_type: str = 'ctranslate2'
+
+def load_model(src_langs, backend='ctranslate2', model_size='600M'):
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+    MODEL = f'nllb-200-distilled-{model_size}-ctranslate2'
+    if backend=='ctranslate2':
+        MODEL_GUY = 'entai2965'
+        huggingface_hub.snapshot_download(MODEL_GUY + '/' + MODEL,local_dir=MODEL)
+        translator = ctranslate2.Translator(MODEL,device=device)
+    elif backend=='transformers':
+        translator = transformers.AutoModelForSeq2SeqLM.from_pretrained(f"facebook/nllb-200-distilled-{model_size}")
+    tokenizer = dict()
+    for src_lang in src_langs:
+        tokenizer[src_lang] = transformers.AutoTokenizer.from_pretrained(MODEL, src_lang=src_lang, clean_up_tokenization_spaces=True)
+
+    return TranslationModel(
+        translator=translator,
+        tokenizer=tokenizer,
+        backend_type=backend,
+        device = device
+    )
+
+class OnlineTranslation:
+    def __init__(self, translation_model: TranslationModel, input_languages: list, output_languages: list):
+        self.buffer = []
+        self.len_processed_buffer = 0
+        self.translation_remaining = Translation()
+        self.validated = []
+        self.translation_pending_validation = ''
+        self.translation_model = translation_model
+        self.input_languages = input_languages
+        self.output_languages = output_languages
+
+    def compute_common_prefix(self, results):
+        #we dont want want to prune the result for the moment. 
+        if not self.buffer:
+            self.buffer = results
+        else:
+            for i in range(min(len(self.buffer), len(results))):
+                if self.buffer[i] != results[i]:
+                    self.commited.extend(self.buffer[:i])
+                    self.buffer = results[i:]
+
+    def translate(self, input, input_lang=None, output_lang=None):
+        if not input:
+            return ""
+        if input_lang is None:
+            input_lang = self.input_languages[0]
+        if output_lang is None:
+            output_lang = self.output_languages[0]
+        nllb_output_lang = get_nllb_code(output_lang)
+            
+        tokenizer = self.translation_model.tokenizer[input_lang]
+        tokenizer_output = tokenizer(input, return_tensors="pt").to(self.translation_model.device)
+        
+        if self.translation_model.backend_type == 'ctranslate2':
+            source = tokenizer.convert_ids_to_tokens(tokenizer_output['input_ids'][0])    
+            results = self.translation_model.translator.translate_batch([source], target_prefix=[[nllb_output_lang]])
+            target = results[0].hypotheses[0][1:]
+            result = tokenizer.decode(tokenizer.convert_tokens_to_ids(target))
+        else:
+            translated_tokens = self.translation_model.translator.generate(**tokenizer_output, forced_bos_token_id=tokenizer.convert_tokens_to_ids(nllb_output_lang))
+            result = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
+        return result
+    
+    def translate_tokens(self, tokens):
+        if tokens:
+            text = ' '.join([token.text for token in tokens])
+            start = tokens[0].start
+            end = tokens[-1].end
+            translated_text = self.translate(text)
+            translation = Translation(
+                text=translated_text,
+                start=start,
+                end=end,
+            )
+            return translation
+        return None
+            
+
+    def insert_tokens(self, tokens):
+        self.buffer.extend(tokens)
+        pass
+    
+    def process(self):
+        i = 0
+        if len(self.buffer) < self.len_processed_buffer + 3: #nothing new to process
+            return self.validated + [self.translation_remaining]
+        while i < len(self.buffer):
+            if self.buffer[i].text in PUNCTUATION_MARKS:
+                translation_sentence = self.translate_tokens(self.buffer[:i+1])
+                self.validated.append(translation_sentence)
+                self.buffer = self.buffer[i+1:]
+                i = 0
+            else:
+                i+=1
+        self.translation_remaining = self.translate_tokens(self.buffer)
+        self.len_processed_buffer = len(self.buffer)
+        return self.validated + [self.translation_remaining]
+
+    def insert_silence(self, silence_duration: float):
+        if silence_duration >= MIN_SILENCE_DURATION_DEL_BUFFER:
+            self.buffer = []
+            self.validated += [self.translation_remaining]
+
+if __name__ == '__main__':
+    output_lang = 'fr'
+    input_lang = "en"
+    
+    
+    test_string = """
+    Transcription technology has improved so much in the past few years. Have you noticed how accurate real-time speech-to-text is now?
+    """
+    test = test_string.split(' ')
+    step = len(test) // 3
+    
+    shared_model = load_model([input_lang], backend='ctranslate2')
+    online_translation = OnlineTranslation(shared_model, input_languages=[input_lang], output_languages=[output_lang])
+    
+    beg_inference = time.time()    
+    for id in range(5):
+        val = test[id*step : (id+1)*step]
+        val_str = ' '.join(val)
+        result = online_translation.translate(val_str)
+        print(result)
+    print('inference time:', time.time() - beg_inference)
--- a/whisperlivekit/warmup.py
+++ b/whisperlivekit/warmup.py
@@ -6,57 +6,46 @@ logger = logging.getLogger(__name__)
 def load_file(warmup_file=None, timeout=5):
    import os
    import tempfile
+    import urllib.request
    import librosa
-        
+
+    if warmup_file == "":
+        logger.info(f"Skipping warmup.")
+        return None
+
+    # Download JFK sample if not already present
    if warmup_file is None:
-        # Download JFK sample if not already present
        jfk_url = "https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav"
        temp_dir = tempfile.gettempdir()
        warmup_file = os.path.join(temp_dir, "whisper_warmup_jfk.wav")
-        
-        if not os.path.exists(warmup_file):
-            logger.debug(f"Downloading warmup file from {jfk_url}")
-            print(f"Downloading warmup file from {jfk_url}")
-            import time
-            import urllib.request
-            import urllib.error
-            import socket
-            
-            original_timeout = socket.getdefaulttimeout()
-            socket.setdefaulttimeout(timeout)
-            
-            start_time = time.time()
+        if not os.path.exists(warmup_file) or os.path.getsize(warmup_file) == 0:
            try:
-                urllib.request.urlretrieve(jfk_url, warmup_file)
-                logger.debug(f"Download successful in {time.time() - start_time:.2f}s")
-            except (urllib.error.URLError, socket.timeout) as e:
-                logger.warning(f"Download failed: {e}. Proceeding without warmup.")
-                return False
-            finally:
-                socket.setdefaulttimeout(original_timeout)
-    elif not warmup_file:
-        return False 
-    
-    if not warmup_file or not os.path.exists(warmup_file) or os.path.getsize(warmup_file) == 0:
-        logger.warning(f"Warmup file {warmup_file} invalid or missing.")
-        return False
-    
+                logger.debug(f"Downloading warmup file from {jfk_url}")
+                with urllib.request.urlopen(jfk_url, timeout=timeout) as r, open(warmup_file, "wb") as f:
+                    f.write(r.read())
+            except Exception as e:
+                logger.warning(f"Warmup file download failed: {e}.")
+                return None
+
+    # Validate file and load
+    if not os.path.exists(warmup_file) or os.path.getsize(warmup_file) == 0:
+        logger.warning(f"Warmup file {warmup_file} is invalid or missing.")
+        return None
+
    try:
-        audio, sr = librosa.load(warmup_file, sr=16000)
+        audio, _ = librosa.load(warmup_file, sr=16000)
+        return audio
    except Exception as e:
-        logger.warning(f"Failed to load audio file: {e}")
-        return False
-    return audio
+        logger.warning(f"Failed to load warmup file: {e}")
+        return None

 def warmup_asr(asr, warmup_file=None, timeout=5):
    """
    Warmup the ASR model by transcribing a short audio file.
    """
-    audio = load_file(warmup_file=None, timeout=5)
+    audio = load_file(warmup_file=warmup_file, timeout=timeout)
+    if audio is None:
+        logger.warning("Warmup file unavailable. Skipping ASR warmup.")
+        return
    asr.transcribe(audio)
-    logger.info("ASR model is warmed up")
-    
-def warmup_online(online, warmup_file=None, timeout=5):
-    audio = load_file(warmup_file=None, timeout=5)
-    online.warmup(audio)
-    logger.warning("ASR is warmed up")
+    logger.info("ASR model is warmed up.")
--- a/whisperlivekit/web/live_transcription.css
+++ b/whisperlivekit/web/live_transcription.css
@@ -0,0 +1,516 @@
+:root {
+  --bg: #ffffff;
+  --text: #111111;
+  --muted: #666666;
+  --border: #e5e5e5;
+  --chip-bg: rgba(0, 0, 0, 0.04);
+  --chip-text: #000000;
+  --spinner-border: #8d8d8d5c;
+  --spinner-top: #b0b0b0;
+  --silence-bg: #f3f3f3;
+  --loading-bg: rgba(255, 77, 77, 0.06);
+  --button-bg: #ffffff;
+  --button-border: #e9e9e9;
+  --wave-stroke: #000000;
+  --label-dia-text: #868686;
+  --label-trans-text: #111111;
+}
+
+@media (prefers-color-scheme: dark) {
+  :root:not([data-theme="light"]) {
+    --bg: #0b0b0b;
+    --text: #e6e6e6;
+    --muted: #9aa0a6;
+    --border: #333333;
+    --chip-bg: rgba(255, 255, 255, 0.08);
+    --chip-text: #e6e6e6;
+    --spinner-border: #555555;
+    --spinner-top: #dddddd;
+    --silence-bg: #1a1a1a;
+    --loading-bg: rgba(255, 77, 77, 0.12);
+    --button-bg: #111111;
+    --button-border: #333333;
+    --wave-stroke: #e6e6e6;
+    --label-dia-text: #b3b3b3;
+    --label-trans-text: #ffffff;
+  }
+}
+
+:root[data-theme="dark"] {
+  --bg: #0b0b0b;
+  --text: #e6e6e6;
+  --muted: #9aa0a6;
+  --border: #333333;
+  --chip-bg: rgba(255, 255, 255, 0.08);
+  --chip-text: #e6e6e6;
+  --spinner-border: #555555;
+  --spinner-top: #dddddd;
+  --silence-bg: #1a1a1a;
+  --loading-bg: rgba(255, 77, 77, 0.12);
+  --button-bg: #111111;
+  --button-border: #333333;
+  --wave-stroke: #e6e6e6;
+  --label-dia-text: #b3b3b3;
+  --label-trans-text: #ffffff;
+}
+
+:root[data-theme="light"] {
+  --bg: #ffffff;
+  --text: #111111;
+  --muted: #666666;
+  --border: #e5e5e5;
+  --chip-bg: rgba(0, 0, 0, 0.04);
+  --chip-text: #000000;
+  --spinner-border: #8d8d8d5c;
+  --spinner-top: #b0b0b0;
+  --silence-bg: #f3f3f3;
+  --loading-bg: rgba(255, 77, 77, 0.06);
+  --button-bg: #ffffff;
+  --button-border: #e9e9e9;
+  --wave-stroke: #000000;
+  --label-dia-text: #868686;
+  --label-trans-text: #111111;
+}
+
+body {
+  font-family: ui-sans-serif, system-ui, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
+  margin: 0;
+  text-align: center;
+  background-color: var(--bg);
+  color: var(--text);
+  height: 100vh;
+  display: flex;
+  flex-direction: column;
+}
+
+/* Record button */
+#recordButton {
+  width: 50px;
+  height: 50px;
+  border: none;
+  border-radius: 50%;
+  background-color: var(--button-bg);
+  cursor: pointer;
+  transition: all 0.3s ease;
+  border: 1px solid var(--button-border);
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  position: relative;
+}
+
+#recordButton.recording {
+  width: 180px;
+  border-radius: 40px;
+  justify-content: flex-start;
+  padding-left: 20px;
+}
+
+#recordButton:active {
+  transform: scale(0.95);
+}
+
+.shape-container {
+  width: 25px;
+  height: 25px;
+  display: flex;
+  align-items: center;
+  justify-content: center;
+  flex-shrink: 0;
+}
+
+.shape {
+  width: 25px;
+  height: 25px;
+  background-color: rgb(209, 61, 53);
+  border-radius: 50%;
+  transition: all 0.3s ease;
+}
+
+#recordButton:disabled .shape {
+  background-color: #6e6d6d;
+}
+
+#recordButton.recording .shape {
+  border-radius: 5px;
+  width: 25px;
+  height: 25px;
+}
+
+/* Recording elements */
+.recording-info {
+  display: none;
+  align-items: center;
+  margin-left: 15px;
+  flex-grow: 1;
+}
+
+#recordButton.recording .recording-info {
+  display: flex;
+}
+
+.wave-container {
+  width: 60px;
+  height: 30px;
+  position: relative;
+  display: flex;
+  align-items: center;
+  justify-content: center;
+}
+
+#waveCanvas {
+  width: 100%;
+  height: 100%;
+}
+
+.timer {
+  font-size: 14px;
+  font-weight: 500;
+  color: var(--text);
+  margin-left: 10px;
+}
+
+#status {
+  margin-top: 15px;
+  font-size: 16px;
+  color: var(--text);
+  margin-bottom: 0;
+}
+
+.header-container {
+  position: sticky;
+  top: 0;
+  background-color: var(--bg);
+  z-index: 100;
+  padding: 20px;
+}
+
+/* Settings */
+.settings-container {
+  display: flex;
+  justify-content: center;
+  align-items: center;
+  gap: 15px;
+}
+
+.settings {
+  display: flex;
+  flex-wrap: wrap;
+  align-items: flex-start;
+  gap: 12px;
+}
+
+.field {
+  display: flex;
+  flex-direction: column;
+  align-items: flex-start;
+  gap: 3px;
+}
+
+#chunkSelector,
+#websocketInput,
+#themeSelector,
+#microphoneSelect {
+  font-size: 16px;
+  padding: 5px 8px;
+  border-radius: 8px;
+  border: 1px solid var(--border);
+  background-color: var(--button-bg);
+  color: var(--text);
+  max-height: 30px;
+}
+
+#microphoneSelect {
+  width: 100%;
+  max-width: 190px;
+  min-width: 120px;
+}
+
+#chunkSelector:focus,
+#websocketInput:focus,
+#themeSelector:focus,
+#microphoneSelect:focus {
+  outline: none;
+  border-color: #007bff;
+  box-shadow: 0 0 0 3px rgba(0, 123, 255, 0.15);
+}
+
+label {
+  font-size: 13px;
+  color: var(--muted);
+}
+
+.ws-default {
+  font-size: 12px;
+  color: var(--muted);
+}
+
+/* Segmented pill control for Theme */
+.segmented {
+  display: inline-flex;
+  align-items: stretch;
+  border: 1px solid var(--button-border);
+  background-color: var(--button-bg);
+  border-radius: 999px;
+  overflow: hidden;
+}
+
+.segmented input[type="radio"] {
+  position: absolute;
+  opacity: 0;
+  pointer-events: none;
+}
+
+.theme-selector-container {
+  display: flex;
+  align-items: center;
+  margin-top: 17px;
+}
+
+.segmented label {
+  display: inline-flex;
+  align-items: center;
+  gap: 6px;
+  padding: 6px 12px;
+  font-size: 14px;
+  color: var(--muted);
+  cursor: pointer;
+  user-select: none;
+  transition: background-color 0.2s ease, color 0.2s ease;
+}
+
+.segmented label span {
+  display: none;
+}
+
+.segmented label:hover span {
+  display: inline;
+}
+
+.segmented label:hover {
+  background-color: var(--chip-bg);
+}
+
+.segmented img {
+  width: 16px;
+  height: 16px;
+}
+
+.segmented input[type="radio"]:checked + label {
+  background-color: var(--chip-bg);
+  color: var(--text);
+}
+
+.segmented input[type="radio"]:focus-visible + label,
+.segmented input[type="radio"]:focus + label {
+  outline: 2px solid #007bff;
+  outline-offset: 2px;
+  border-radius: 999px;
+}
+
+.transcript-container {
+  flex: 1;
+  overflow-y: auto;
+  padding: 20px;
+  scrollbar-width: none;
+  -ms-overflow-style: none;
+}
+
+.transcript-container::-webkit-scrollbar {
+  display: none;
+}
+
+/* Transcript area */
+#linesTranscript {
+  margin: 0 auto;
+  max-width: 700px;
+  text-align: left;
+  font-size: 16px;
+}
+
+#linesTranscript p {
+  margin: 0px 0;
+}
+
+#linesTranscript strong {
+  color: var(--text);
+}
+
+#speaker {
+  border: 1px solid var(--border);
+  border-radius: 100px;
+  padding: 2px 10px;
+  font-size: 14px;
+  margin-bottom: 0px;
+}
+
+.label_diarization {
+  background-color: var(--chip-bg);
+  border-radius: 8px 8px 8px 8px;
+  padding: 2px 10px;
+  margin-left: 10px;
+  display: inline-block;
+  white-space: nowrap;
+  font-size: 14px;
+  margin-bottom: 0px;
+  color: var(--label-dia-text);
+}
+
+.label_transcription {
+  background-color: var(--chip-bg);
+  border-radius: 8px 8px 8px 8px;
+  padding: 2px 10px;
+  display: inline-block;
+  white-space: nowrap;
+  margin-left: 10px;
+  font-size: 14px;
+  margin-bottom: 0px;
+  color: var(--label-trans-text);
+}
+
+.label_translation {
+  background-color: var(--chip-bg);
+  border-radius: 10px;
+  padding: 4px 8px;
+  margin-top: 4px;
+  font-size: 14px;
+  color: var(--text);
+  display: flex;
+  align-items: flex-start;
+  gap: 4px;
+}
+
+.label_translation img {
+  margin-top: 2px;
+}
+
+.label_translation img {
+  width: 12px;
+  height: 12px;
+}
+
+#timeInfo {
+  color: var(--muted);
+  margin-left: 10px;
+}
+
+.textcontent {
+  font-size: 16px;
+  padding-left: 10px;
+  margin-bottom: 10px;
+  margin-top: 1px;
+  padding-top: 5px;
+  border-radius: 0px 0px 0px 10px;
+}
+
+.buffer_diarization {
+  color: var(--label-dia-text);
+  margin-left: 4px;
+}
+
+.buffer_transcription {
+  color: #7474748c;
+  margin-left: 4px;
+}
+
+.spinner {
+  display: inline-block;
+  width: 8px;
+  height: 8px;
+  border: 2px solid var(--spinner-border);
+  border-top: 2px solid var(--spinner-top);
+  border-radius: 50%;
+  animation: spin 0.7s linear infinite;
+  vertical-align: middle;
+  margin-bottom: 2px;
+  margin-right: 5px;
+}
+
+@keyframes spin {
+  to {
+    transform: rotate(360deg);
+  }
+}
+
+.silence {
+  color: var(--muted);
+  background-color: var(--silence-bg);
+  font-size: 13px;
+  border-radius: 30px;
+  padding: 2px 10px;
+}
+
+.loading {
+  color: var(--muted);
+  background-color: var(--loading-bg);
+  border-radius: 8px 8px 8px 0px;
+  padding: 2px 10px;
+  font-size: 14px;
+  margin-bottom: 0px;
+}
+
+/* for smaller screens */
+@media (max-width: 768px) {
+  .header-container {
+    padding: 15px;
+  }
+  
+  .settings-container {
+    flex-direction: column;
+    gap: 10px;
+  }
+  
+  .settings {
+    justify-content: center;
+    gap: 8px;
+  }
+  
+  .field {
+    align-items: center;
+  }
+  
+  #websocketInput,
+  #microphoneSelect {
+    min-width: 100px;
+    max-width: 160px;
+  }
+  
+  .theme-selector-container {
+    margin-top: 10px;
+  }
+  
+  .transcript-container {
+    padding: 15px;
+  }
+}
+
+@media (max-width: 480px) {
+  .header-container {
+    padding: 10px;
+  }
+  
+  .settings {
+    flex-direction: column;
+    align-items: center;
+    gap: 6px;
+  }
+  
+  #websocketInput,
+  #microphoneSelect {
+    max-width: 140px;
+  }
+  
+  .segmented label {
+    padding: 4px 8px;
+    font-size: 12px;
+  }
+  
+  .segmented img {
+    width: 14px;
+    height: 14px;
+  }
+  
+  .transcript-container {
+    padding: 10px;
+  }
+}
--- a/whisperlivekit/web/live_transcription.html
+++ b/whisperlivekit/web/live_transcription.html
@@ -5,857 +5,69 @@
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>WhisperLiveKit</title>
-    <style>
-        :root {
-            --bg: #ffffff;
-            --text: #111111;
-            --muted: #666666;
-            --border: #e5e5e5;
-            --chip-bg: rgba(0, 0, 0, 0.04);
-            --chip-text: #000000;
-            --spinner-border: #8d8d8d5c;
-            --spinner-top: #b0b0b0;
-            --silence-bg: #f3f3f3;
-            --loading-bg: rgba(255, 77, 77, 0.06);
-            --button-bg: #ffffff;
-            --button-border: #e9e9e9;
-            --wave-stroke: #000000;
-            --label-dia-text: #868686;
-            --label-trans-text: #111111;
-        }
-
-        @media (prefers-color-scheme: dark) {
-            :root:not([data-theme="light"]) {
-                --bg: #0b0b0b;
-                --text: #e6e6e6;
-                --muted: #9aa0a6;
-                --border: #333333;
-                --chip-bg: rgba(255, 255, 255, 0.08);
-                --chip-text: #e6e6e6;
-                --spinner-border: #555555;
-                --spinner-top: #dddddd;
-                --silence-bg: #1a1a1a;
-                --loading-bg: rgba(255, 77, 77, 0.12);
-                --button-bg: #111111;
-                --button-border: #333333;
-                --wave-stroke: #e6e6e6;
-                --label-dia-text: #b3b3b3;
-                --label-trans-text: #ffffff;
-            }
-        }
-
-        :root[data-theme="dark"] {
-            --bg: #0b0b0b;
-            --text: #e6e6e6;
-            --muted: #9aa0a6;
-            --border: #333333;
-            --chip-bg: rgba(255, 255, 255, 0.08);
-            --chip-text: #e6e6e6;
-            --spinner-border: #555555;
-            --spinner-top: #dddddd;
-            --silence-bg: #1a1a1a;
-            --loading-bg: rgba(255, 77, 77, 0.12);
-            --button-bg: #111111;
-            --button-border: #333333;
-            --wave-stroke: #e6e6e6;
-            --label-dia-text: #b3b3b3;
-            --label-trans-text: #ffffff;
-        }
-
-        :root[data-theme="light"] {
-            --bg: #ffffff;
-            --text: #111111;
-            --muted: #666666;
-            --border: #e5e5e5;
-            --chip-bg: rgba(0, 0, 0, 0.04);
-            --chip-text: #000000;
-            --spinner-border: #8d8d8d5c;
-            --spinner-top: #b0b0b0;
-            --silence-bg: #f3f3f3;
-            --loading-bg: rgba(255, 77, 77, 0.06);
-            --button-bg: #ffffff;
-            --button-border: #e9e9e9;
-            --wave-stroke: #000000;
-            --label-dia-text: #868686;
-            --label-trans-text: #111111;
-        }
-        body {
-            font-family: ui-sans-serif, system-ui, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
-            margin: 20px;
-            text-align: center;
-            background-color: var(--bg);
-            color: var(--text);
-        }
-
-        #recordButton {
-            width: 50px;
-            height: 50px;
-            border: none;
-            border-radius: 50%;
-            background-color: var(--button-bg);
-            cursor: pointer;
-            transition: all 0.3s ease;
-            border: 1px solid var(--button-border);
-            display: flex;
-            align-items: center;
-            justify-content: center;
-            position: relative;
-        }
-
-        #recordButton.recording {
-            width: 180px;
-            border-radius: 40px;
-            justify-content: flex-start;
-            padding-left: 20px;
-        }
-
-        #recordButton:active {
-            transform: scale(0.95);
-        }
-
-        .shape-container {
-            width: 25px;
-            height: 25px;
-            display: flex;
-            align-items: center;
-            justify-content: center;
-            flex-shrink: 0;
-        }
-
-        .shape {
-            width: 25px;
-            height: 25px;
-            background-color: rgb(209, 61, 53);
-            border-radius: 50%;
-            transition: all 0.3s ease;
-        }
-
-        #recordButton:disabled .shape {
-            background-color: #6e6d6d;
-        }
-
-        #recordButton.recording .shape {
-            border-radius: 5px;
-            width: 25px;
-            height: 25px;
-        }
-
-        /* Recording elements */
-        .recording-info {
-            display: none;
-            align-items: center;
-            margin-left: 15px;
-            flex-grow: 1;
-        }
-
-        #recordButton.recording .recording-info {
-            display: flex;
-        }
-
-        .wave-container {
-            width: 60px;
-            height: 30px;
-            position: relative;
-            display: flex;
-            align-items: center;
-            justify-content: center;
-        }
-
-        #waveCanvas {
-            width: 100%;
-            height: 100%;
-        }
-
-        .timer {
-            font-size: 14px;
-            font-weight: 500;
-            color: var(--text);
-            margin-left: 10px;
-        }
-
-        #status {
-            margin-top: 20px;
-            font-size: 16px;
-            color: var(--text);
-        }
-
-        .settings-container {
-            display: flex;
-            justify-content: center;
-            align-items: center;
-            gap: 15px;
-            margin-top: 20px;
-        }
-
-        .settings {
-            display: flex;
-            flex-direction: column;
-            align-items: flex-start;
-            gap: 5px;
-        }
-
-        #chunkSelector,
-        #websocketInput,
-        #themeSelector {
-            font-size: 16px;
-            padding: 5px;
-            border-radius: 5px;
-            border: 1px solid var(--border);
-            background-color: var(--button-bg);
-            color: var(--text);
-            max-height: 30px;
-        }
-
-        #websocketInput {
-            width: 200px;
-        }
-
-        #chunkSelector:focus,
-        #websocketInput:focus,
-        #themeSelector:focus {
-            outline: none;
-            border-color: #007bff;
-        }
-
-        label {
-            font-size: 14px;
-        }
-
-        /* Speaker-labeled transcript area */
-        #linesTranscript {
-            margin: 20px auto;
-            max-width: 700px;
-            text-align: left;
-            font-size: 16px;
-        }
-
-        #linesTranscript p {
-            margin: 0px 0;
-        }
-
-        #linesTranscript strong {
-            color: var(--text);
-        }
-
-        #speaker {
-            border: 1px solid var(--border);
-            border-radius: 100px;
-            padding: 2px 10px;
-            font-size: 14px;
-            margin-bottom: 0px;
-        }
-        .label_diarization {
-            background-color: var(--chip-bg);
-            border-radius: 8px 8px 8px 8px;
-            padding: 2px 10px;
-            margin-left: 10px;
-            display: inline-block;
-            white-space: nowrap;
-            font-size: 14px;
-            margin-bottom: 0px;
-            color: var(--label-dia-text)
-        }
-
-        .label_transcription {
-            background-color: var(--chip-bg);
-            border-radius: 8px 8px 8px 8px;
-            padding: 2px 10px;
-            display: inline-block;
-            white-space: nowrap;
-            margin-left: 10px;
-            font-size: 14px;
-            margin-bottom: 0px;
-            color: var(--label-trans-text)
-        }
-
-        #timeInfo {
-            color: var(--muted);
-            margin-left: 10px;
-        }
-
-        .textcontent {
-            font-size: 16px;
-            /* margin-left: 10px; */
-            padding-left: 10px;
-            margin-bottom: 10px;
-            margin-top: 1px;
-            padding-top: 5px;
-            border-radius: 0px 0px 0px 10px;
-        }
-
-        .buffer_diarization {
-            color: var(--label-dia-text);
-            margin-left: 4px;
-        }
-
-        .buffer_transcription {
-            color: #7474748c;
-            margin-left: 4px;
-        }
-
-
-        .spinner {
-            display: inline-block;
-            width: 8px;
-            height: 8px;
-            border: 2px solid var(--spinner-border);
-            border-top: 2px solid var(--spinner-top);
-            border-radius: 50%;
-            animation: spin 0.7s linear infinite;
-            vertical-align: middle;
-            margin-bottom: 2px;
-            margin-right: 5px;
-        }
-
-        @keyframes spin {
-            to {
-                transform: rotate(360deg);
-            }
-        }
-
-        .silence {
-            color: var(--muted);
-            background-color: var(--silence-bg);
-            font-size: 13px;
-            border-radius: 30px;
-            padding: 2px 10px;
-        }
-
-        .loading {
-            color: var(--muted);
-            background-color: var(--loading-bg);
-            border-radius: 8px 8px 8px 0px;
-            padding: 2px 10px;
-            font-size: 14px;
-            margin-bottom: 0px;
-        }
-    </style>
+    <link rel="stylesheet" href="/web/live_transcription.css" />
 </head>

 <body>
-
-    <div class="settings-container">
-        <button id="recordButton">
-            <div class="shape-container">
-                <div class="shape"></div>
-            </div>
-            <div class="recording-info">
-                <div class="wave-container">
-                    <canvas id="waveCanvas"></canvas>
+    <div class="header-container">
+        <div class="settings-container">
+            <button id="recordButton">
+                <div class="shape-container">
+                    <div class="shape"></div>
+                </div>
+                <div class="recording-info">
+                    <div class="wave-container">
+                        <canvas id="waveCanvas"></canvas>
+                    </div>
+                    <div class="timer">00:00</div>
+                </div>
+            </button>
+
+            <div class="settings">
+                <div class="field">
+                    <label for="websocketInput">Websocket URL</label>
+                    <input id="websocketInput" type="text" placeholder="ws://host:port/asr" />
+                </div>
+
+                <div class="field">
+                    <label id="microphoneSelectLabel" for="microphoneSelect">Select Microphone</label>
+                    <select id="microphoneSelect">
+                        <option value="">Default Microphone</option>
+                    </select>
+                </div>
+
+                <div class="theme-selector-container">
+                    <div class="segmented" role="radiogroup" aria-label="Theme selector">
+                        <input type="radio" id="theme-system" name="theme" value="system" />
+                        <label for="theme-system" title="System">
+                            <img src="/web/src/system_mode.svg" alt="" />
+                            <span>System</span>
+                        </label>
+
+                        <input type="radio" id="theme-light" name="theme" value="light" />
+                        <label for="theme-light" title="Light">
+                            <img src="/web/src/light_mode.svg" alt="" />
+                            <span>Light</span>
+                        </label>
+
+                        <input type="radio" id="theme-dark" name="theme" value="dark" />
+                        <label for="theme-dark" title="Dark">
+                            <img src="/web/src/dark_mode.svg" alt="" />
+                            <span>Dark</span>
+                        </label>
+                    </div>
                </div>
-                <div class="timer">00:00</div>
-            </div>
-        </button>
-        <div class="settings">
-            <div>
-                <label for="chunkSelector">Chunk size (ms):</label>
-                <select id="chunkSelector">
-                    <option value="500">500 ms</option>
-                    <option value="1000" selected>1000 ms</option>
-                    <option value="2000">2000 ms</option>
-                    <option value="3000">3000 ms</option>
-                    <option value="4000">4000 ms</option>
-                    <option value="5000">5000 ms</option>
-                </select>
-            </div>
-            <div>
-                <label for="websocketInput">WebSocket URL:</label>
-                <input id="websocketInput" type="text" />
-            </div>
-            <div>
-                <label for="themeSelector">Theme:</label>
-                <select id="themeSelector">
-                    <option value="system" selected>System</option>
-                    <option value="light">Light</option>
-                    <option value="dark">Dark</option>
-                </select>
            </div>
        </div>
+        
+        <p id="status"></p>
    </div>

-    <p id="status"></p>
+    <div class="transcript-container">
+        <div id="linesTranscript"></div>
+    </div>

-    <!-- Speaker-labeled transcript -->
-    <div id="linesTranscript"></div>
-
-    <script>
-        let isRecording = false;
-        let websocket = null;
-        let recorder = null;
-        let chunkDuration = 1000;
-        let websocketUrl = "ws://localhost:8000/asr";
-        let userClosing = false;
-        let wakeLock = null;
-        let startTime = null;
-        let timerInterval = null;
-        let audioContext = null;
-        let analyser = null;
-        let microphone = null;
-        let waveCanvas = document.getElementById("waveCanvas");
-        let waveCtx = waveCanvas.getContext("2d");
-        let animationFrame = null;
-        let waitingForStop = false;
-        let lastReceivedData = null;
-        let lastSignature = null;
-        waveCanvas.width = 60 * (window.devicePixelRatio || 1);
-        waveCanvas.height = 30 * (window.devicePixelRatio || 1);
-        waveCtx.scale(window.devicePixelRatio || 1, window.devicePixelRatio || 1);
-
-        const statusText = document.getElementById("status");
-        const recordButton = document.getElementById("recordButton");
-        const chunkSelector = document.getElementById("chunkSelector");
-        const websocketInput = document.getElementById("websocketInput");
-        const linesTranscriptDiv = document.getElementById("linesTranscript");
-        const timerElement = document.querySelector(".timer");
-        const themeSelector = document.getElementById("themeSelector");
-
-        function getWaveStroke() {
-            const styles = getComputedStyle(document.documentElement);
-            const v = styles.getPropertyValue("--wave-stroke").trim();
-            return v || "#000";
-        }
-
-        let waveStroke = getWaveStroke();
-
-        function updateWaveStroke() {
-            waveStroke = getWaveStroke();
-        }
-
-        function applyTheme(pref) {
-            if (pref === "light") {
-                document.documentElement.setAttribute("data-theme", "light");
-            } else if (pref === "dark") {
-                document.documentElement.setAttribute("data-theme", "dark");
-            } else {
-                document.documentElement.removeAttribute("data-theme");
-            }
-            updateWaveStroke();
-        }
-
-        const savedThemePref = localStorage.getItem("themePreference") || "system";
-        applyTheme(savedThemePref);
-        if (themeSelector) {
-            themeSelector.value = savedThemePref;
-            themeSelector.addEventListener("change", () => {
-                const val = themeSelector.value;
-                localStorage.setItem("themePreference", val);
-                applyTheme(val);
-            });
-        }
-
-        const darkMq = window.matchMedia && window.matchMedia("(prefers-color-scheme: dark)");
-        const handleOsThemeChange = () => {
-            const pref = localStorage.getItem("themePreference") || "system";
-            if (pref === "system") updateWaveStroke();
-        };
-        if (darkMq && darkMq.addEventListener) {
-            darkMq.addEventListener("change", handleOsThemeChange);
-        } else if (darkMq && darkMq.addListener) {
-            darkMq.addListener(handleOsThemeChange);
-        }
-
-        function fmt1(x) {
-            const n = Number(x);
-            return Number.isFinite(n) ? n.toFixed(1) : x;
-        }
-
-        const host = window.location.hostname || "localhost";
-        const port = window.location.port;
-        const protocol = window.location.protocol === "https:" ? "wss" : "ws";
-        const defaultWebSocketUrl = `${protocol}://${host}:${port}/asr`;
-        websocketInput.value = defaultWebSocketUrl;
-        websocketUrl = defaultWebSocketUrl;
-
-        chunkSelector.addEventListener("change", () => {
-            chunkDuration = parseInt(chunkSelector.value);
-        });
-
-        websocketInput.addEventListener("change", () => {
-            const urlValue = websocketInput.value.trim();
-            if (!urlValue.startsWith("ws://") && !urlValue.startsWith("wss://")) {
-                statusText.textContent = "Invalid WebSocket URL (must start with ws:// or wss://)";
-                return;
-            }
-            websocketUrl = urlValue;
-            statusText.textContent = "WebSocket URL updated. Ready to connect.";
-        });
-
-        function setupWebSocket() {
-            return new Promise((resolve, reject) => {
-                try {
-                    websocket = new WebSocket(websocketUrl);
-                } catch (error) {
-                    statusText.textContent = "Invalid WebSocket URL. Please check and try again.";
-                    reject(error);
-                    return;
-                }
-
-                websocket.onopen = () => {
-                    statusText.textContent = "Connected to server.";
-                    resolve();
-                };
-
-                websocket.onclose = () => {
-                    if (userClosing) {
-                        if (waitingForStop) {
-                            statusText.textContent = "Processing finalized or connection closed.";
-                            if (lastReceivedData) {
-                                renderLinesWithBuffer(
-                                    lastReceivedData.lines || [],
-                                    lastReceivedData.buffer_diarization || "",
-                                    lastReceivedData.buffer_transcription || "",
-                                    0, 0, true // isFinalizing = true
-                                );
-                            }
-                        }
-                        // If ready_to_stop was received, statusText is already "Finished processing..."
-                        // and waitingForStop is false.
-                    } else {
-                        statusText.textContent = "Disconnected from the WebSocket server. (Check logs if model is loading.)";
-                        if (isRecording) {
-                            stopRecording(); 
-                        }
-                    }
-                    isRecording = false;  
-                    waitingForStop = false; 
-                    userClosing = false;  
-                    lastReceivedData = null;  
-                    websocket = null;    
-                    updateUI();  
-                };
-
-                websocket.onerror = () => {
-                    statusText.textContent = "Error connecting to WebSocket.";
-                    reject(new Error("Error connecting to WebSocket"));
-                };
-
-                // Handle messages from server
-                websocket.onmessage = (event) => {
-                    const data = JSON.parse(event.data);
-                    
-                    // Check for status messages
-                    if (data.type === "ready_to_stop") {
-                        console.log("Ready to stop received, finalizing display and closing WebSocket.");
-                        waitingForStop = false;
-
-                        if (lastReceivedData) {
-                            renderLinesWithBuffer(
-                                lastReceivedData.lines || [],
-                                lastReceivedData.buffer_diarization || "",
-                                lastReceivedData.buffer_transcription || "",
-                                0, // No more lag
-                                0, // No more lag
-                                true // isFinalizing = true
-                            );
-                        }
-                        statusText.textContent = "Finished processing audio! Ready to record again.";
-                        recordButton.disabled = false;
-                        
-                        if (websocket) {
-                            websocket.close(); // will trigger onclose
-                            // websocket = null; // onclose handle setting websocket to null
-                        }
-                        return;
-                    }
-                    
-                    lastReceivedData = data; 
-                    
-                    // Handle normal transcription updates
-                    const { 
-                        lines = [], 
-                        buffer_transcription = "", 
-                        buffer_diarization = "",
-                        remaining_time_transcription = 0,
-                        remaining_time_diarization = 0,
-                        status = "active_transcription"
-                    } = data;
-                    
-                    renderLinesWithBuffer(
-                        lines, 
-                        buffer_diarization, 
-                        buffer_transcription, 
-                        remaining_time_diarization,
-                        remaining_time_transcription,
-                        false,
-                        status
-                    );
-                };
-            });
-        }
-
-        function renderLinesWithBuffer(lines, buffer_diarization, buffer_transcription, remaining_time_diarization, remaining_time_transcription, isFinalizing = false, current_status = "active_transcription") {
-            if (current_status === "no_audio_detected") {
-                linesTranscriptDiv.innerHTML = "<p style='text-align: center; color: var(--muted); margin-top: 20px;'><em>No audio detected...</em></p>";
-                return; 
-            }
-
-            // try to keep stable DOM despite having updates every 0.1s. only update numeric lag values if structure hasn't changed
-            const showLoading = (!isFinalizing) && (lines || []).some(it => it.speaker == 0);
-            const showTransLag = !isFinalizing && remaining_time_transcription > 0;
-            const showDiaLag = !isFinalizing && !!buffer_diarization && remaining_time_diarization > 0;
-            const signature = JSON.stringify({
-                lines: (lines || []).map(it => ({ speaker: it.speaker, text: it.text, beg: it.beg, end: it.end })),
-                buffer_transcription: buffer_transcription || "",
-                buffer_diarization: buffer_diarization || "",
-                status: current_status,
-                showLoading,
-                showTransLag,
-                showDiaLag,
-                isFinalizing: !!isFinalizing
-            });
-            if (lastSignature === signature) {
-                const t = document.querySelector(".lag-transcription-value");
-                if (t) t.textContent = fmt1(remaining_time_transcription);
-                const d = document.querySelector(".lag-diarization-value");
-                if (d) d.textContent = fmt1(remaining_time_diarization);
-                const ld = document.querySelector(".loading-diarization-value");
-                if (ld) ld.textContent = fmt1(remaining_time_diarization);
-                return;
-            }
-            lastSignature = signature;
-
-            const linesHtml = lines.map((item, idx) => {
-                let timeInfo = "";
-                if (item.beg !== undefined && item.end !== undefined) {
-                    timeInfo = ` ${item.beg} - ${item.end}`;
-                }
-
-                let speakerLabel = "";
-                if (item.speaker === -2) {
-                    speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`;
-                } else if (item.speaker == 0 && !isFinalizing) {
-                    speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'><span class="loading-diarization-value">${fmt1(remaining_time_diarization)}</span> second(s) of audio are undergoing diarization</span></span>`;
-                } else if (item.speaker == -1) {
-                    speakerLabel = `<span id="speaker">Speaker 1<span id='timeInfo'>${timeInfo}</span></span>`;
-                } else if (item.speaker !== -1 && item.speaker !== 0) {
-                    speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`;
-                }
-
-
-                let currentLineText = item.text || "";
-
-                if (idx === lines.length - 1) { 
-                    if (!isFinalizing && item.speaker !== -2) {
-                        if (remaining_time_transcription > 0) {
-                             speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Transcription lag <span id='timeInfo'><span class="lag-transcription-value">${fmt1(remaining_time_transcription)}</span>s</span></span>`;
-                        }
-                        if (buffer_diarization && remaining_time_diarization > 0) {
-                             speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Diarization lag<span id='timeInfo'><span class="lag-diarization-value">${fmt1(remaining_time_diarization)}</span>s</span></span>`;
-                        }
-                    }
-
-                    if (buffer_diarization) {
-                        if (isFinalizing) {
-                            currentLineText += (currentLineText.length > 0 && buffer_diarization.trim().length > 0 ? " " : "") + buffer_diarization.trim();
-                        } else {
-                            currentLineText += `<span class="buffer_diarization">${buffer_diarization}</span>`;
-                        }
-                    }
-                    if (buffer_transcription) {
-                        if (isFinalizing) {
-                            currentLineText += (currentLineText.length > 0 && buffer_transcription.trim().length > 0 ? " " : "") + buffer_transcription.trim();
-                        } else {
-                            currentLineText += `<span class="buffer_transcription">${buffer_transcription}</span>`;
-                        }
-                    }
-                }
-                
-                return currentLineText.trim().length > 0 || speakerLabel.length > 0
-                    ? `<p>${speakerLabel}<br/><div class='textcontent'>${currentLineText}</div></p>`
-                    : `<p>${speakerLabel}<br/></p>`; 
-            }).join("");
-
-            linesTranscriptDiv.innerHTML = linesHtml;
-            window.scrollTo({ top: document.body.scrollHeight, behavior: 'smooth' });
-        }
-
-        function updateTimer() {
-            if (!startTime) return;
-            
-            const elapsed = Math.floor((Date.now() - startTime) / 1000);
-            const minutes = Math.floor(elapsed / 60).toString().padStart(2, "0");
-            const seconds = (elapsed % 60).toString().padStart(2, "0");
-            timerElement.textContent = `${minutes}:${seconds}`;
-        }
-
-        function drawWaveform() {
-            if (!analyser) return;
-            
-            const bufferLength = analyser.frequencyBinCount;
-            const dataArray = new Uint8Array(bufferLength);
-            analyser.getByteTimeDomainData(dataArray);
-            
-            waveCtx.clearRect(0, 0, waveCanvas.width / (window.devicePixelRatio || 1), waveCanvas.height / (window.devicePixelRatio || 1));
-            waveCtx.lineWidth = 1;
-            waveCtx.strokeStyle = waveStroke;
-            waveCtx.beginPath();
-            
-            const sliceWidth = (waveCanvas.width / (window.devicePixelRatio || 1)) / bufferLength;
-            let x = 0;
-            
-            for (let i = 0; i < bufferLength; i++) {
-                const v = dataArray[i] / 128.0;
-                const y = v * (waveCanvas.height / (window.devicePixelRatio || 1)) / 2;
-                
-                if (i === 0) {
-                    waveCtx.moveTo(x, y);
-                } else {
-                    waveCtx.lineTo(x, y);
-                }
-                
-                x += sliceWidth;
-            }
-            
-            waveCtx.lineTo(waveCanvas.width / (window.devicePixelRatio || 1), waveCanvas.height / (window.devicePixelRatio || 1) / 2);
-            waveCtx.stroke();
-            
-            animationFrame = requestAnimationFrame(drawWaveform);
-        }
-
-        async function startRecording() {
-            try {
-
-                // https://developer.mozilla.org/en-US/docs/Web/API/Screen_Wake_Lock_API
-                // create an async function to request a wake lock
-                try {
-                  wakeLock = await navigator.wakeLock.request("screen");
-                } catch (err) {
-                  // The Wake Lock request has failed - usually system related, such as battery.
-                  console.log("Error acquiring wake lock.")
-                }
-
-                const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
-                
-                audioContext = new (window.AudioContext || window.webkitAudioContext)();
-                analyser = audioContext.createAnalyser();
-                analyser.fftSize = 256;
-                microphone = audioContext.createMediaStreamSource(stream);
-                microphone.connect(analyser);
-                
-                recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
-                recorder.ondataavailable = (e) => {
-                    if (websocket && websocket.readyState === WebSocket.OPEN) {
-                        websocket.send(e.data);
-                    }
-                };
-                recorder.start(chunkDuration);
-                
-                startTime = Date.now();
-                timerInterval = setInterval(updateTimer, 1000);
-                drawWaveform();
-                
-                isRecording = true;
-                updateUI();
-            } catch (err) {
-                statusText.textContent = "Error accessing microphone. Please allow microphone access.";
-                console.error(err);
-            }
-        }
-
-        async function stopRecording() {
-            wakeLock.release().then(() => {
-              wakeLock = null;
-            });
-  
-            userClosing = true;
-            waitingForStop = true;
-            
-            if (websocket && websocket.readyState === WebSocket.OPEN) {
-                // Send empty audio buffer as stop signal
-                const emptyBlob = new Blob([], { type: 'audio/webm' });
-                websocket.send(emptyBlob);
-                statusText.textContent = "Recording stopped. Processing final audio...";
-            }
-            
-            if (recorder) {
-                recorder.stop();
-                recorder = null;
-            }
-            
-            if (microphone) {
-                microphone.disconnect();
-                microphone = null;
-            }
-            
-            if (analyser) {
-                analyser = null;
-            }
-            
-            if (audioContext && audioContext.state !== 'closed') {
-                try {
-                    audioContext.close();
-                } catch (e) {
-                    console.warn("Could not close audio context:", e);
-                }
-                audioContext = null;
-            }
-            
-            if (animationFrame) {
-                cancelAnimationFrame(animationFrame);
-                animationFrame = null;
-            }
-            
-            if (timerInterval) {
-                clearInterval(timerInterval);
-                timerInterval = null;
-            }            
-            timerElement.textContent = "00:00";
-            startTime = null;
-            
-            
-            isRecording = false;
-            updateUI();	
-        }
-
-        async function toggleRecording() {
-            if (!isRecording) {
-                if (waitingForStop) {
-                    console.log("Waiting for stop, early return");
-                    return;  // Early return, UI is already updated
-                }
-                console.log("Connecting to WebSocket");
-                try {
-                    // If we have an active WebSocket that's still processing, just restart audio capture
-                    if (websocket && websocket.readyState === WebSocket.OPEN) {
-                        await startRecording();
-                    } else {
-                        // If no active WebSocket or it's closed, create new one
-                        await setupWebSocket();
-                        await startRecording();
-                    }
-                } catch (err) {
-                    statusText.textContent = "Could not connect to WebSocket or access mic. Aborted.";
-                    console.error(err);
-                }
-            } else {
-                console.log("Stopping recording");
-                stopRecording();
-            }
-        }
-
-        function updateUI() {
-            recordButton.classList.toggle("recording", isRecording);
-            recordButton.disabled = waitingForStop;
-
-            if (waitingForStop) {
-                if (statusText.textContent !== "Recording stopped. Processing final audio...") {
-                     statusText.textContent = "Please wait for processing to complete...";
-                }
-            } else if (isRecording) {
-                statusText.textContent = "Recording...";
-            } else {
-                if (statusText.textContent !== "Finished processing audio! Ready to record again." &&
-                    statusText.textContent !== "Processing finalized or connection closed.") {
-                    statusText.textContent = "Click to start transcription";
-                }
-            }
-            if (!waitingForStop) {
-                recordButton.disabled = false;
-            }
-        }
-
-        recordButton.addEventListener("click", toggleRecording);
-    </script>
+    <script src="/web/live_transcription.js"></script>
 </body>

-</html>
+</html>
--- a/whisperlivekit/web/live_transcription.js
+++ b/whisperlivekit/web/live_transcription.js
@@ -0,0 +1,683 @@
+/* Theme, WebSocket, recording, rendering logic extracted from inline script and adapted for segmented theme control and WS caption */
+
+let isRecording = false;
+let websocket = null;
+let recorder = null;
+let chunkDuration = 100;
+let websocketUrl = "ws://localhost:8000/asr";
+let userClosing = false;
+let wakeLock = null;
+let startTime = null;
+let timerInterval = null;
+let audioContext = null;
+let analyser = null;
+let microphone = null;
+let workletNode = null;
+let recorderWorker = null;
+let waveCanvas = document.getElementById("waveCanvas");
+let waveCtx = waveCanvas.getContext("2d");
+let animationFrame = null;
+let waitingForStop = false;
+let lastReceivedData = null;
+let lastSignature = null;
+let availableMicrophones = [];
+let selectedMicrophoneId = null;
+let serverUseAudioWorklet = null;
+let configReadyResolve;
+const configReady = new Promise((r) => (configReadyResolve = r));
+
+waveCanvas.width = 60 * (window.devicePixelRatio || 1);
+waveCanvas.height = 30 * (window.devicePixelRatio || 1);
+waveCtx.scale(window.devicePixelRatio || 1, window.devicePixelRatio || 1);
+
+const statusText = document.getElementById("status");
+const recordButton = document.getElementById("recordButton");
+const chunkSelector = document.getElementById("chunkSelector");
+const websocketInput = document.getElementById("websocketInput");
+const websocketDefaultSpan = document.getElementById("wsDefaultUrl");
+const linesTranscriptDiv = document.getElementById("linesTranscript");
+const timerElement = document.querySelector(".timer");
+const themeRadios = document.querySelectorAll('input[name="theme"]');
+const microphoneSelect = document.getElementById("microphoneSelect");
+
+function getWaveStroke() {
+  const styles = getComputedStyle(document.documentElement);
+  const v = styles.getPropertyValue("--wave-stroke").trim();
+  return v || "#000";
+}
+
+let waveStroke = getWaveStroke();
+function updateWaveStroke() {
+  waveStroke = getWaveStroke();
+}
+
+function applyTheme(pref) {
+  if (pref === "light") {
+    document.documentElement.setAttribute("data-theme", "light");
+  } else if (pref === "dark") {
+    document.documentElement.setAttribute("data-theme", "dark");
+  } else {
+    document.documentElement.removeAttribute("data-theme");
+  }
+  updateWaveStroke();
+}
+
+// Persisted theme preference
+const savedThemePref = localStorage.getItem("themePreference") || "system";
+applyTheme(savedThemePref);
+if (themeRadios.length) {
+  themeRadios.forEach((r) => {
+    r.checked = r.value === savedThemePref;
+    r.addEventListener("change", () => {
+      if (r.checked) {
+        localStorage.setItem("themePreference", r.value);
+        applyTheme(r.value);
+      }
+    });
+  });
+}
+
+// React to OS theme changes when in "system" mode
+const darkMq = window.matchMedia && window.matchMedia("(prefers-color-scheme: dark)");
+const handleOsThemeChange = () => {
+  const pref = localStorage.getItem("themePreference") || "system";
+  if (pref === "system") updateWaveStroke();
+};
+if (darkMq && darkMq.addEventListener) {
+  darkMq.addEventListener("change", handleOsThemeChange);
+} else if (darkMq && darkMq.addListener) {
+  // deprecated, but included for Safari compatibility
+  darkMq.addListener(handleOsThemeChange);
+}
+
+async function enumerateMicrophones() {
+  try {
+    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
+    stream.getTracks().forEach(track => track.stop());
+
+    const devices = await navigator.mediaDevices.enumerateDevices();
+    availableMicrophones = devices.filter(device => device.kind === 'audioinput');
+
+    populateMicrophoneSelect();
+    console.log(`Found ${availableMicrophones.length} microphone(s)`);
+  } catch (error) {
+    console.error('Error enumerating microphones:', error);
+    statusText.textContent = "Error accessing microphones. Please grant permission.";
+  }
+}
+
+function populateMicrophoneSelect() {
+  if (!microphoneSelect) return;
+
+  microphoneSelect.innerHTML = '<option value="">Default Microphone</option>';
+
+  availableMicrophones.forEach((device, index) => {
+    const option = document.createElement('option');
+    option.value = device.deviceId;
+    option.textContent = device.label || `Microphone ${index + 1}`;
+    microphoneSelect.appendChild(option);
+  });
+
+  const savedMicId = localStorage.getItem('selectedMicrophone');
+  if (savedMicId && availableMicrophones.some(mic => mic.deviceId === savedMicId)) {
+    microphoneSelect.value = savedMicId;
+    selectedMicrophoneId = savedMicId;
+  }
+}
+
+function handleMicrophoneChange() {
+  selectedMicrophoneId = microphoneSelect.value || null;
+  localStorage.setItem('selectedMicrophone', selectedMicrophoneId || '');
+
+  const selectedDevice = availableMicrophones.find(mic => mic.deviceId === selectedMicrophoneId);
+  const deviceName = selectedDevice ? selectedDevice.label : 'Default Microphone';
+
+  console.log(`Selected microphone: ${deviceName}`);
+  statusText.textContent = `Microphone changed to: ${deviceName}`;
+
+  if (isRecording) {
+    statusText.textContent = "Switching microphone... Please wait.";
+    stopRecording().then(() => {
+      setTimeout(() => {
+        toggleRecording();
+      }, 1000);
+    });
+  }
+}
+
+// Helpers
+function fmt1(x) {
+  const n = Number(x);
+  return Number.isFinite(n) ? n.toFixed(1) : x;
+}
+
+// Default WebSocket URL computation
+const host = window.location.hostname || "localhost";
+const port = window.location.port;
+const protocol = window.location.protocol === "https:" ? "wss" : "ws";
+const defaultWebSocketUrl = `${protocol}://${host}${port ? ":" + port : ""}/asr`;
+
+// Populate default caption and input
+if (websocketDefaultSpan) websocketDefaultSpan.textContent = defaultWebSocketUrl;
+websocketInput.value = defaultWebSocketUrl;
+websocketUrl = defaultWebSocketUrl;
+
+// Optional chunk selector (guard for presence)
+if (chunkSelector) {
+  chunkSelector.addEventListener("change", () => {
+    chunkDuration = parseInt(chunkSelector.value);
+  });
+}
+
+// WebSocket input change handling
+websocketInput.addEventListener("change", () => {
+  const urlValue = websocketInput.value.trim();
+  if (!urlValue.startsWith("ws://") && !urlValue.startsWith("wss://")) {
+    statusText.textContent = "Invalid WebSocket URL (must start with ws:// or wss://)";
+    return;
+  }
+  websocketUrl = urlValue;
+  statusText.textContent = "WebSocket URL updated. Ready to connect.";
+});
+
+function setupWebSocket() {
+  return new Promise((resolve, reject) => {
+    try {
+      websocket = new WebSocket(websocketUrl);
+    } catch (error) {
+      statusText.textContent = "Invalid WebSocket URL. Please check and try again.";
+      reject(error);
+      return;
+    }
+
+    websocket.onopen = () => {
+      statusText.textContent = "Connected to server.";
+      resolve();
+    };
+
+    websocket.onclose = () => {
+      if (userClosing) {
+        if (waitingForStop) {
+          statusText.textContent = "Processing finalized or connection closed.";
+          if (lastReceivedData) {
+            renderLinesWithBuffer(
+              lastReceivedData.lines || [],
+              lastReceivedData.buffer_diarization || "",
+              lastReceivedData.buffer_transcription || "",
+              0,
+              0,
+              true
+            );
+          }
+        }
+      } else {
+        statusText.textContent = "Disconnected from the WebSocket server. (Check logs if model is loading.)";
+        if (isRecording) {
+          stopRecording();
+        }
+      }
+      isRecording = false;
+      waitingForStop = false;
+      userClosing = false;
+      lastReceivedData = null;
+      websocket = null;
+      updateUI();
+    };
+
+    websocket.onerror = () => {
+      statusText.textContent = "Error connecting to WebSocket.";
+      reject(new Error("Error connecting to WebSocket"));
+    };
+
+    websocket.onmessage = (event) => {
+      const data = JSON.parse(event.data);
+      if (data.type === "config") {
+        serverUseAudioWorklet = !!data.useAudioWorklet;
+        statusText.textContent = serverUseAudioWorklet
+          ? "Connected. Using AudioWorklet (PCM)."
+          : "Connected. Using MediaRecorder (WebM).";
+        if (configReadyResolve) configReadyResolve();
+        return;
+      }
+
+      if (data.type === "ready_to_stop") {
+        console.log("Ready to stop received, finalizing display and closing WebSocket.");
+        waitingForStop = false;
+
+        if (lastReceivedData) {
+          renderLinesWithBuffer(
+            lastReceivedData.lines || [],
+            lastReceivedData.buffer_diarization || "",
+            lastReceivedData.buffer_transcription || "",
+            0,
+            0,
+            true
+          );
+        }
+        statusText.textContent = "Finished processing audio! Ready to record again.";
+        recordButton.disabled = false;
+
+        if (websocket) {
+          websocket.close();
+        }
+        return;
+      }
+
+      lastReceivedData = data;
+
+      const {
+        lines = [],
+        buffer_transcription = "",
+        buffer_diarization = "",
+        remaining_time_transcription = 0,
+        remaining_time_diarization = 0,
+        status = "active_transcription",
+      } = data;
+
+      renderLinesWithBuffer(
+        lines,
+        buffer_diarization,
+        buffer_transcription,
+        remaining_time_diarization,
+        remaining_time_transcription,
+        false,
+        status
+      );
+    };
+  });
+}
+
+function renderLinesWithBuffer(
+  lines,
+  buffer_diarization,
+  buffer_transcription,
+  remaining_time_diarization,
+  remaining_time_transcription,
+  isFinalizing = false,
+  current_status = "active_transcription"
+) {
+  if (current_status === "no_audio_detected") {
+    linesTranscriptDiv.innerHTML =
+      "<p style='text-align: center; color: var(--muted); margin-top: 20px;'><em>No audio detected...</em></p>";
+    return;
+  }
+
+  const showLoading = !isFinalizing && (lines || []).some((it) => it.speaker == 0);
+  const showTransLag = !isFinalizing && remaining_time_transcription > 0;
+  const showDiaLag = !isFinalizing && !!buffer_diarization && remaining_time_diarization > 0;
+  const signature = JSON.stringify({
+    lines: (lines || []).map((it) => ({ speaker: it.speaker, text: it.text, start: it.start, end: it.end })),
+    buffer_transcription: buffer_transcription || "",
+    buffer_diarization: buffer_diarization || "",
+    status: current_status,
+    showLoading,
+    showTransLag,
+    showDiaLag,
+    isFinalizing: !!isFinalizing,
+  });
+  if (lastSignature === signature) {
+    const t = document.querySelector(".lag-transcription-value");
+    if (t) t.textContent = fmt1(remaining_time_transcription);
+    const d = document.querySelector(".lag-diarization-value");
+    if (d) d.textContent = fmt1(remaining_time_diarization);
+    const ld = document.querySelector(".loading-diarization-value");
+    if (ld) ld.textContent = fmt1(remaining_time_diarization);
+    return;
+  }
+  lastSignature = signature;
+
+  const linesHtml = (lines || [])
+    .map((item, idx) => {
+      let timeInfo = "";
+      if (item.start !== undefined && item.end !== undefined) {
+        timeInfo = ` ${item.start} - ${item.end}`;
+      }
+
+      let speakerLabel = "";
+      if (item.speaker === -2) {
+        speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`;
+      } else if (item.speaker == 0 && !isFinalizing) {
+        speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'><span class="loading-diarization-value">${fmt1(
+          remaining_time_diarization
+        )}</span> second(s) of audio are undergoing diarization</span></span>`;
+      } else if (item.speaker !== 0) {
+        speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`;
+      }
+
+      let currentLineText = item.text || "";
+
+      if (idx === lines.length - 1) {
+        if (!isFinalizing && item.speaker !== -2) {
+          if (remaining_time_transcription > 0) {
+            speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Transcription lag <span id='timeInfo'><span class="lag-transcription-value">${fmt1(
+              remaining_time_transcription
+            )}</span>s</span></span>`;
+          }
+          if (buffer_diarization && remaining_time_diarization > 0) {
+            speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Diarization lag<span id='timeInfo'><span class="lag-diarization-value">${fmt1(
+              remaining_time_diarization
+            )}</span>s</span></span>`;
+          }
+        }
+
+        if (buffer_diarization) {
+          if (isFinalizing) {
+            currentLineText +=
+              (currentLineText.length > 0 && buffer_diarization.trim().length > 0 ? " " : "") + buffer_diarization.trim();
+          } else {
+            currentLineText += `<span class="buffer_diarization">${buffer_diarization}</span>`;
+          }
+        }
+        if (buffer_transcription) {
+          if (isFinalizing) {
+            currentLineText +=
+              (currentLineText.length > 0 && buffer_transcription.trim().length > 0 ? " " : "") +
+              buffer_transcription.trim();
+          } else {
+            currentLineText += `<span class="buffer_transcription">${buffer_transcription}</span>`;
+          }
+        }
+      }
+      
+      if (item.translation) {
+        currentLineText += `<div class="label_translation">
+          <img src="/web/src/translate.svg" alt="Translation" width="12" height="12" />
+          <span>${item.translation}</span>
+        </div>`;
+      }
+
+      return currentLineText.trim().length > 0 || speakerLabel.length > 0
+        ? `<p>${speakerLabel}<br/><div class='textcontent'>${currentLineText}</div></p>`
+        : `<p>${speakerLabel}<br/></p>`;
+    })
+    .join("");
+
+  linesTranscriptDiv.innerHTML = linesHtml;
+  const transcriptContainer = document.querySelector('.transcript-container');
+  if (transcriptContainer) {
+    transcriptContainer.scrollTo({ top: transcriptContainer.scrollHeight, behavior: "smooth" });
+  }
+}
+
+function updateTimer() {
+  if (!startTime) return;
+
+  const elapsed = Math.floor((Date.now() - startTime) / 1000);
+  const minutes = Math.floor(elapsed / 60).toString().padStart(2, "0");
+  const seconds = (elapsed % 60).toString().padStart(2, "0");
+  timerElement.textContent = `${minutes}:${seconds}`;
+}
+
+function drawWaveform() {
+  if (!analyser) return;
+
+  const bufferLength = analyser.frequencyBinCount;
+  const dataArray = new Uint8Array(bufferLength);
+  analyser.getByteTimeDomainData(dataArray);
+
+  waveCtx.clearRect(
+    0,
+    0,
+    waveCanvas.width / (window.devicePixelRatio || 1),
+    waveCanvas.height / (window.devicePixelRatio || 1)
+  );
+  waveCtx.lineWidth = 1;
+  waveCtx.strokeStyle = waveStroke;
+  waveCtx.beginPath();
+
+  const sliceWidth = (waveCanvas.width / (window.devicePixelRatio || 1)) / bufferLength;
+  let x = 0;
+
+  for (let i = 0; i < bufferLength; i++) {
+    const v = dataArray[i] / 128.0;
+    const y = (v * (waveCanvas.height / (window.devicePixelRatio || 1))) / 2;
+
+    if (i === 0) {
+      waveCtx.moveTo(x, y);
+    } else {
+      waveCtx.lineTo(x, y);
+    }
+
+    x += sliceWidth;
+  }
+
+  waveCtx.lineTo(
+    waveCanvas.width / (window.devicePixelRatio || 1),
+    (waveCanvas.height / (window.devicePixelRatio || 1)) / 2
+  );
+  waveCtx.stroke();
+
+  animationFrame = requestAnimationFrame(drawWaveform);
+}
+
+async function startRecording() {
+  try {
+    try {
+      wakeLock = await navigator.wakeLock.request("screen");
+    } catch (err) {
+      console.log("Error acquiring wake lock.");
+    }
+
+    const audioConstraints = selectedMicrophoneId 
+      ? { audio: { deviceId: { exact: selectedMicrophoneId } } }
+      : { audio: true };
+
+    const stream = await navigator.mediaDevices.getUserMedia(audioConstraints);
+
+    audioContext = new (window.AudioContext || window.webkitAudioContext)();
+    analyser = audioContext.createAnalyser();
+    analyser.fftSize = 256;
+    microphone = audioContext.createMediaStreamSource(stream);
+    microphone.connect(analyser);
+
+    if (serverUseAudioWorklet) {
+      if (!audioContext.audioWorklet) {
+        throw new Error("AudioWorklet is not supported in this browser");
+      }
+      await audioContext.audioWorklet.addModule("/web/pcm_worklet.js");
+      workletNode = new AudioWorkletNode(audioContext, "pcm-forwarder", { numberOfInputs: 1, numberOfOutputs: 0, channelCount: 1 });
+      microphone.connect(workletNode);
+
+      recorderWorker = new Worker("/web/recorder_worker.js");
+      recorderWorker.postMessage({
+        command: "init",
+        config: {
+          sampleRate: audioContext.sampleRate,
+        },
+      });
+
+      recorderWorker.onmessage = (e) => {
+        if (websocket && websocket.readyState === WebSocket.OPEN) {
+          websocket.send(e.data.buffer);
+        }
+      };
+
+      workletNode.port.onmessage = (e) => {
+        const data = e.data;
+        const ab = data instanceof ArrayBuffer ? data : data.buffer;
+        recorderWorker.postMessage(
+          {
+            command: "record",
+            buffer: ab,
+          },
+          [ab]
+        );
+      };
+    } else {
+      try {
+        recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
+      } catch (e) {
+        recorder = new MediaRecorder(stream);
+      }
+      recorder.ondataavailable = (e) => {
+        if (websocket && websocket.readyState === WebSocket.OPEN) {
+          if (e.data && e.data.size > 0) {
+            websocket.send(e.data);
+          }
+        }
+      };
+      recorder.start(chunkDuration);
+    }
+
+    startTime = Date.now();
+    timerInterval = setInterval(updateTimer, 1000);
+    drawWaveform();
+
+    isRecording = true;
+    updateUI();
+  } catch (err) {
+    if (window.location.hostname === "0.0.0.0") {
+      statusText.textContent =
+        "Error accessing microphone. Browsers may block microphone access on 0.0.0.0. Try using localhost:8000 instead.";
+    } else {
+      statusText.textContent = "Error accessing microphone. Please allow microphone access.";
+    }
+    console.error(err);
+  }
+}
+
+async function stopRecording() {
+  if (wakeLock) {
+    try {
+      await wakeLock.release();
+    } catch (e) {
+      // ignore
+    }
+    wakeLock = null;
+  }
+
+  userClosing = true;
+  waitingForStop = true;
+
+  if (websocket && websocket.readyState === WebSocket.OPEN) {
+    const emptyBlob = new Blob([], { type: "audio/webm" });
+    websocket.send(emptyBlob);
+    statusText.textContent = "Recording stopped. Processing final audio...";
+  }
+
+  if (recorder) {
+    try {
+      recorder.stop();
+    } catch (e) {
+    }
+    recorder = null;
+  }
+
+  if (recorderWorker) {
+    recorderWorker.terminate();
+    recorderWorker = null;
+  }
+  
+  if (workletNode) {
+    try {
+      workletNode.port.onmessage = null;
+    } catch (e) {}
+    try {
+      workletNode.disconnect();
+    } catch (e) {}
+    workletNode = null;
+  }
+
+  if (microphone) {
+    microphone.disconnect();
+    microphone = null;
+  }
+
+  if (analyser) {
+    analyser = null;
+  }
+
+  if (audioContext && audioContext.state !== "closed") {
+    try {
+      await audioContext.close();
+    } catch (e) {
+      console.warn("Could not close audio context:", e);
+    }
+    audioContext = null;
+  }
+
+  if (animationFrame) {
+    cancelAnimationFrame(animationFrame);
+    animationFrame = null;
+  }
+
+  if (timerInterval) {
+    clearInterval(timerInterval);
+    timerInterval = null;
+  }
+  timerElement.textContent = "00:00";
+  startTime = null;
+
+  isRecording = false;
+  updateUI();
+}
+
+async function toggleRecording() {
+  if (!isRecording) {
+    if (waitingForStop) {
+      console.log("Waiting for stop, early return");
+      return;
+    }
+    console.log("Connecting to WebSocket");
+    try {
+      if (websocket && websocket.readyState === WebSocket.OPEN) {
+        await configReady;
+        await startRecording();
+      } else {
+        await setupWebSocket();
+        await configReady;
+        await startRecording();
+      }
+    } catch (err) {
+      statusText.textContent = "Could not connect to WebSocket or access mic. Aborted.";
+      console.error(err);
+    }
+  } else {
+    console.log("Stopping recording");
+    stopRecording();
+  }
+}
+
+function updateUI() {
+  recordButton.classList.toggle("recording", isRecording);
+  recordButton.disabled = waitingForStop;
+
+  if (waitingForStop) {
+    if (statusText.textContent !== "Recording stopped. Processing final audio...") {
+      statusText.textContent = "Please wait for processing to complete...";
+    }
+  } else if (isRecording) {
+    statusText.textContent = "Recording...";
+  } else {
+    if (
+      statusText.textContent !== "Finished processing audio! Ready to record again." &&
+      statusText.textContent !== "Processing finalized or connection closed."
+    ) {
+      statusText.textContent = "Click to start transcription";
+    }
+  }
+  if (!waitingForStop) {
+    recordButton.disabled = false;
+  }
+}
+
+recordButton.addEventListener("click", toggleRecording);
+
+if (microphoneSelect) {
+  microphoneSelect.addEventListener("change", handleMicrophoneChange);
+}
+document.addEventListener('DOMContentLoaded', async () => {
+  try {
+    await enumerateMicrophones();
+  } catch (error) {
+    console.log("Could not enumerate microphones on load:", error);
+  }
+});
+navigator.mediaDevices.addEventListener('devicechange', async () => {
+  console.log('Device change detected, re-enumerating microphones');
+  try {
+    await enumerateMicrophones();
+  } catch (error) {
+    console.log("Error re-enumerating microphones:", error);
+  }
+});
--- a/whisperlivekit/web/pcm_worklet.js
+++ b/whisperlivekit/web/pcm_worklet.js
@@ -0,0 +1,16 @@
+class PCMForwarder extends AudioWorkletProcessor {
+  process(inputs) {
+    const input = inputs[0];
+    if (input && input[0] && input[0].length) {
+      // Forward mono channel (0). If multi-channel, downmixing can be added here.
+      const channelData = input[0];
+      const copy = new Float32Array(channelData.length);
+      copy.set(channelData);
+      this.port.postMessage(copy, [copy.buffer]);
+    }
+    // Keep processor alive
+    return true;
+  }
+}
+
+registerProcessor('pcm-forwarder', PCMForwarder);
--- a/whisperlivekit/web/recorder_worker.js
+++ b/whisperlivekit/web/recorder_worker.js
@@ -0,0 +1,58 @@
+let sampleRate = 48000;
+let targetSampleRate = 16000;
+
+self.onmessage = function (e) {
+  switch (e.data.command) {
+    case 'init':
+      init(e.data.config);
+      break;
+    case 'record':
+      record(e.data.buffer);
+      break;
+  }
+};
+
+function init(config) {
+  sampleRate = config.sampleRate;
+  targetSampleRate = config.targetSampleRate || 16000;
+}
+
+function record(inputBuffer) {
+  const buffer = new Float32Array(inputBuffer);
+  const resampledBuffer = resample(buffer, sampleRate, targetSampleRate);
+  const pcmBuffer = toPCM(resampledBuffer);
+  self.postMessage({ buffer: pcmBuffer }, [pcmBuffer]);
+}
+
+function resample(buffer, from, to) {
+    if (from === to) {
+        return buffer;
+    }
+    const ratio = from / to;
+    const newLength = Math.round(buffer.length / ratio);
+    const result = new Float32Array(newLength);
+    let offsetResult = 0;
+    let offsetBuffer = 0;
+    while (offsetResult < result.length) {
+        const nextOffsetBuffer = Math.round((offsetResult + 1) * ratio);
+        let accum = 0, count = 0;
+        for (let i = offsetBuffer; i < nextOffsetBuffer && i < buffer.length; i++) {
+            accum += buffer[i];
+            count++;
+        }
+        result[offsetResult] = accum / count;
+        offsetResult++;
+        offsetBuffer = nextOffsetBuffer;
+    }
+    return result;
+}
+
+function toPCM(input) {
+  const buffer = new ArrayBuffer(input.length * 2);
+  const view = new DataView(buffer);
+  for (let i = 0; i < input.length; i++) {
+    const s = Math.max(-1, Math.min(1, input[i]));
+    view.setInt16(i * 2, s < 0 ? s * 0x8000 : s * 0x7FFF, true);
+  }
+  return buffer;
+}
--- a/whisperlivekit/web/src/dark_mode.svg
+++ b/whisperlivekit/web/src/dark_mode.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-120q-151 0-255.5-104.5T120-480q0-138 90-239.5T440-838q13-2 23 3.5t16 14.5q6 9 6.5 21t-7.5 23q-17 26-25.5 55t-8.5 61q0 90 63 153t153 63q31 0 61.5-9t54.5-25q11-7 22.5-6.5T819-479q10 5 15.5 15t3.5 24q-14 138-117.5 229T480-120Zm0-80q88 0 158-48.5T740-375q-20 5-40 8t-40 3q-123 0-209.5-86.5T364-660q0-20 3-40t8-40q-78 32-126.5 102T200-480q0 116 82 198t198 82Zm-10-270Z"/></svg>
--- a/whisperlivekit/web/src/light_mode.svg
+++ b/whisperlivekit/web/src/light_mode.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-360q50 0 85-35t35-85q0-50-35-85t-85-35q-50 0-85 35t-35 85q0 50 35 85t85 35Zm0 80q-83 0-141.5-58.5T280-480q0-83 58.5-141.5T480-680q83 0 141.5 58.5T680-480q0 83-58.5 141.5T480-280ZM80-440q-17 0-28.5-11.5T40-480q0-17 11.5-28.5T80-520h80q17 0 28.5 11.5T200-480q0 17-11.5 28.5T160-440H80Zm720 0q-17 0-28.5-11.5T760-480q0-17 11.5-28.5T800-520h80q17 0 28.5 11.5T920-480q0 17-11.5 28.5T880-440h-80ZM480-760q-17 0-28.5-11.5T440-800v-80q0-17 11.5-28.5T480-920q17 0 28.5 11.5T520-880v80q0 17-11.5 28.5T480-760Zm0 720q-17 0-28.5-11.5T440-80v-80q0-17 11.5-28.5T480-200q17 0 28.5 11.5T520-160v80q0 17-11.5 28.5T480-40ZM226-678l-43-42q-12-11-11.5-28t11.5-29q12-12 29-12t28 12l42 43q11 12 11 28t-11 28q-11 12-27.5 11.5T226-678Zm494 495-42-43q-11-12-11-28.5t11-27.5q11-12 27.5-11.5T734-282l43 42q12 11 11.5 28T777-183q-12 12-29 12t-28-12Zm-42-495q-12-11-11.5-27.5T678-734l42-43q11-12 28-11.5t29 11.5q12 12 12 29t-12 28l-43 42q-12 11-28 11t-28-11ZM183-183q-12-12-12-29t12-28l43-42q12-11 28.5-11t27.5 11q12 11 11.5 27.5T282-226l-42 43q-11 12-28 11.5T183-183Zm297-297Z"/></svg>
--- a/whisperlivekit/web/src/settings.svg
+++ b/whisperlivekit/web/src/settings.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M433-80q-27 0-46.5-18T363-142l-9-66q-13-5-24.5-12T307-235l-62 26q-25 11-50 2t-39-32l-47-82q-14-23-8-49t27-43l53-40q-1-7-1-13.5v-27q0-6.5 1-13.5l-53-40q-21-17-27-43t8-49l47-82q14-23 39-32t50 2l62 26q11-8 23-15t24-12l9-66q4-26 23.5-44t46.5-18h94q27 0 46.5 18t23.5 44l9 66q13 5 24.5 12t22.5 15l62-26q25-11 50-2t39 32l47 82q14 23 8 49t-27 43l-53 40q1 7 1 13.5v27q0 6.5-2 13.5l53 40q21 17 27 43t-8 49l-48 82q-14 23-39 32t-50-2l-60-26q-11 8-23 15t-24 12l-9 66q-4 26-23.5 44T527-80h-94Zm7-80h79l14-106q31-8 57.5-23.5T639-327l99 41 39-68-86-65q5-14 7-29.5t2-31.5q0-16-2-31.5t-7-29.5l86-65-39-68-99 42q-22-23-48.5-38.5T533-694l-13-106h-79l-14 106q-31 8-57.5 23.5T321-633l-99-41-39 68 86 64q-5 15-7 30t-2 32q0 16 2 31t7 30l-86 65 39 68 99-42q22 23 48.5 38.5T427-266l13 106Zm42-180q58 0 99-41t41-99q0-58-41-99t-99-41q-59 0-99.5 41T342-480q0 58 40.5 99t99.5 41Zm-2-140Z"/></svg>
--- a/whisperlivekit/web/src/system_mode.svg
+++ b/whisperlivekit/web/src/system_mode.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M396-396q-32-32-58.5-67T289-537q-5 14-6.5 28.5T281-480q0 83 58 141t141 58q14 0 28.5-2t28.5-6q-39-22-74-48.5T396-396Zm85 196q-56 0-107-21t-91-61q-40-40-61-91t-21-107q0-51 17-97.5t50-84.5q13-14 32-9.5t27 24.5q21 55 52.5 104t73.5 91q42 42 91 73.5T648-326q20 8 24.5 27t-9.5 32q-38 33-84.5 50T481-200Zm223-192q-16-5-23-20.5t-4-32.5q9-48-6-94.5T621-621q-35-35-80.5-49.5T448-677q-17 3-32-4t-21-23q-6-16 1.5-31t23.5-19q69-15 138 4.5T679-678q51 51 71 120t5 138q-4 17-19 25t-32 3ZM480-840q-17 0-28.5-11.5T440-880v-40q0-17 11.5-28.5T480-960q17 0 28.5 11.5T520-920v40q0 17-11.5 28.5T480-840Zm0 840q-17 0-28.5-11.5T440-40v-40q0-17 11.5-28.5T480-120q17 0 28.5 11.5T520-80v40q0 17-11.5 28.5T480 0Zm255-734q-12-12-12-28.5t12-28.5l28-28q11-11 27.5-11t28.5 11q12 12 12 28.5T819-762l-28 28q-12 12-28 12t-28-12ZM141-141q-12-12-12-28.5t12-28.5l28-28q12-12 28-12t28 12q12 12 12 28.5T225-169l-28 28q-11 11-27.5 11T141-141Zm739-299q-17 0-28.5-11.5T840-480q0-17 11.5-28.5T880-520h40q17 0 28.5 11.5T960-480q0 17-11.5 28.5T920-440h-40Zm-840 0q-17 0-28.5-11.5T0-480q0-17 11.5-28.5T40-520h40q17 0 28.5 11.5T120-480q0 17-11.5 28.5T80-440H40Zm779 299q-12 12-28.5 12T762-141l-28-28q-12-12-12-28t12-28q12-12 28.5-12t28.5 12l28 28q11 11 11 27.5T819-141ZM226-735q-12 12-28.5 12T169-735l-28-28q-11-11-11-27.5t11-28.5q12-12 28.5-12t28.5 12l28 28q12 12 12 28t-12 28Zm170 339Z"/></svg>
--- a/whisperlivekit/web/src/translate.svg
+++ b/whisperlivekit/web/src/translate.svg
@@ -0,0 +1 @@
+<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="m603-202-34 97q-4 11-14 18t-22 7q-20 0-32.5-16.5T496-133l152-402q5-11 15-18t22-7h30q12 0 22 7t15 18l152 403q8 19-4 35.5T868-80q-13 0-22.5-7T831-106l-34-96H603ZM362-401 188-228q-11 11-27.5 11.5T132-228q-11-11-11-28t11-28l174-174q-35-35-63.5-80T190-640h84q20 39 40 68t48 58q33-33 68.5-92.5T484-720H80q-17 0-28.5-11.5T40-760q0-17 11.5-28.5T80-800h240v-40q0-17 11.5-28.5T360-880q17 0 28.5 11.5T400-840v40h240q17 0 28.5 11.5T680-760q0 17-11.5 28.5T640-720h-76q-21 72-63 148t-83 116l96 98-30 82-122-125Zm266 129h144l-72-204-72 204Z"/></svg>
--- a/whisperlivekit/web/web_interface.py
+++ b/whisperlivekit/web/web_interface.py
@@ -1,5 +1,6 @@
 import logging
 import importlib.resources as resources
+import base64

 logger = logging.getLogger(__name__)

@@ -10,4 +11,78 @@ def get_web_interface_html():
            return f.read()
    except Exception as e:
        logger.error(f"Error loading web interface HTML: {e}")
-        return "<html><body><h1>Error loading interface</h1></body></html>"
+        return "<html><body><h1>Error loading interface</h1></body></html>"
+
+def get_inline_ui_html():
+    """Returns the complete web interface HTML with all assets embedded in a single call."""
+    try:
+        with resources.files('whisperlivekit.web').joinpath('live_transcription.html').open('r', encoding='utf-8') as f:
+            html_content = f.read()        
+        with resources.files('whisperlivekit.web').joinpath('live_transcription.css').open('r', encoding='utf-8') as f:
+            css_content = f.read()
+        with resources.files('whisperlivekit.web').joinpath('live_transcription.js').open('r', encoding='utf-8') as f:
+            js_content = f.read()
+        
+        # SVG files
+        with resources.files('whisperlivekit.web').joinpath('src', 'system_mode.svg').open('r', encoding='utf-8') as f:
+            system_svg = f.read()
+            system_data_uri = f"data:image/svg+xml;base64,{base64.b64encode(system_svg.encode('utf-8')).decode('utf-8')}"
+        with resources.files('whisperlivekit.web').joinpath('src', 'light_mode.svg').open('r', encoding='utf-8') as f:
+            light_svg = f.read()
+            light_data_uri = f"data:image/svg+xml;base64,{base64.b64encode(light_svg.encode('utf-8')).decode('utf-8')}"
+        with resources.files('whisperlivekit.web').joinpath('src', 'dark_mode.svg').open('r', encoding='utf-8') as f:
+            dark_svg = f.read()
+            dark_data_uri = f"data:image/svg+xml;base64,{base64.b64encode(dark_svg.encode('utf-8')).decode('utf-8')}"
+        
+        # Replace external references
+        html_content = html_content.replace(
+            '<link rel="stylesheet" href="/web/live_transcription.css" />',
+            f'<style>\n{css_content}\n</style>'
+        )
+        
+        html_content = html_content.replace(
+            '<script src="/web/live_transcription.js"></script>',
+            f'<script>\n{js_content}\n</script>'
+        )
+        
+        # Replace SVG references
+        html_content = html_content.replace(
+            '<img src="/web/src/system_mode.svg" alt="" />',
+            f'<img src="{system_data_uri}" alt="" />'
+        )
+        
+        html_content = html_content.replace(
+            '<img src="/web/src/light_mode.svg" alt="" />',
+            f'<img src="{light_data_uri}" alt="" />'
+        )
+        
+        html_content = html_content.replace(
+            '<img src="/web/src/dark_mode.svg" alt="" />',
+            f'<img src="{dark_data_uri}" alt="" />'
+        )
+        
+        return html_content
+        
+    except Exception as e:
+        logger.error(f"Error creating embedded web interface: {e}")
+        return "<html><body><h1>Error loading embedded interface</h1></body></html>"
+
+
+if __name__ == '__main__':
+    
+    from fastapi import FastAPI
+    from fastapi.responses import HTMLResponse
+    import uvicorn
+    from starlette.staticfiles import StaticFiles
+    import pathlib
+    import whisperlivekit.web as webpkg
+    
+    app = FastAPI()    
+    web_dir = pathlib.Path(webpkg.__file__).parent
+    app.mount("/web", StaticFiles(directory=str(web_dir)), name="web")
+    
+    @app.get("/")
+    async def get():
+        return HTMLResponse(get_inline_ui_html())
+
+    uvicorn.run(app=app)
--- a/whisperlivekit/whisper_streaming_custom/online_asr.py
+++ b/whisperlivekit/whisper_streaming_custom/online_asr.py
@@ -122,6 +122,7 @@ class OnlineASRProcessor:
        self.tokenize = tokenize_method
        self.logfile = logfile
        self.confidence_validation = confidence_validation
+        self.global_time_offset = 0.0
        self.init()

        self.buffer_trimming_way, self.buffer_trimming_sec = buffer_trimming
@@ -152,6 +153,21 @@ class OnlineASRProcessor:
        """Append an audio chunk (a numpy array) to the current audio buffer."""
        self.audio_buffer = np.append(self.audio_buffer, audio)

+    def insert_silence(self, silence_duration, offset):
+        """
+        If silences are > 5s, we do a complete context clear. Otherwise, we just insert a small silence and shift the last_attend_frame
+        """
+        # if self.transcript_buffer.buffer:
+        #     self.committed.extend(self.transcript_buffer.buffer)
+        #     self.transcript_buffer.buffer = []
+            
+        if True: #silence_duration < 3: #we want the last audio to be treated to not have a gap. could also be handled in the future in ends_with_silence.
+            gap_silence = np.zeros(int(16000 * silence_duration), dtype=np.int16)
+            self.insert_audio_chunk(gap_silence)
+        else:
+            self.init(offset=silence_duration + offset)
+        self.global_time_offset += silence_duration
+
    def prompt(self) -> Tuple[str, str]:
        """
        Returns a tuple: (prompt, context), where:
@@ -230,6 +246,9 @@ class OnlineASRProcessor:
        logger.debug(
            f"Length of audio buffer now: {len(self.audio_buffer)/self.SAMPLING_RATE:.2f} seconds"
        )
+        if self.global_time_offset:
+            for token in committed_tokens:
+                token = token.with_offset(self.global_time_offset)
        return committed_tokens, current_audio_processed_upto

    def chunk_completed_sentence(self):
@@ -391,128 +410,3 @@ class OnlineASRProcessor:
            start = None
            end = None
        return Transcript(start, end, text, probability=probability)
-
-
-class VACOnlineASRProcessor:
-    """
-    Wraps an OnlineASRProcessor with a Voice Activity Controller (VAC).
-    
-    It receives small chunks of audio, applies VAD (e.g. with Silero),
-    and when the system detects a pause in speech (or end of an utterance)
-    it finalizes the utterance immediately.
-    """
-    SAMPLING_RATE = 16000
-
-    def __init__(self, online_chunk_size: float, *args, **kwargs):
-        self.online_chunk_size = online_chunk_size
-        self.online = OnlineASRProcessor(*args, **kwargs)
-        self.asr = self.online.asr
-        
-        # Load a VAD model (e.g. Silero VAD)
-        import torch
-        model, _ = torch.hub.load(repo_or_dir="snakers4/silero-vad", model="silero_vad")
-        from .silero_vad_iterator import FixedVADIterator
-
-        self.vac = FixedVADIterator(model)
-        self.logfile = self.online.logfile
-        self.last_input_audio_stream_end_time: float = 0.0
-        self.init()
-
-    def init(self):
-        self.online.init()
-        self.vac.reset_states()
-        self.current_online_chunk_buffer_size = 0
-        self.last_input_audio_stream_end_time = self.online.buffer_time_offset
-        self.is_currently_final = False
-        self.status: Optional[str] = None  # "voice" or "nonvoice"
-        self.audio_buffer = np.array([], dtype=np.float32)
-        self.buffer_offset = 0  # in frames
-
-    def get_audio_buffer_end_time(self) -> float:
-        """Returns the absolute end time of the audio processed by the underlying OnlineASRProcessor."""
-        return self.online.get_audio_buffer_end_time()
-
-    def clear_buffer(self):
-        self.buffer_offset += len(self.audio_buffer)
-        self.audio_buffer = np.array([], dtype=np.float32)
-
-    def insert_audio_chunk(self, audio: np.ndarray, audio_stream_end_time: float):
-        """
-        Process an incoming small audio chunk:
-          - run VAD on the chunk,
-          - decide whether to send the audio to the online ASR processor immediately,
-          - and/or to mark the current utterance as finished.
-        """
-        self.last_input_audio_stream_end_time = audio_stream_end_time
-        res = self.vac(audio)
-        self.audio_buffer = np.append(self.audio_buffer, audio)
-
-        if res is not None:
-            # VAD returned a result; adjust the frame number
-            frame = list(res.values())[0] - self.buffer_offset
-            if "start" in res and "end" not in res:
-                self.status = "voice"
-                send_audio = self.audio_buffer[frame:]
-                self.online.init(offset=(frame + self.buffer_offset) / self.SAMPLING_RATE)
-                self.online.insert_audio_chunk(send_audio)
-                self.current_online_chunk_buffer_size += len(send_audio)
-                self.clear_buffer()
-            elif "end" in res and "start" not in res:
-                self.status = "nonvoice"
-                send_audio = self.audio_buffer[:frame]
-                self.online.insert_audio_chunk(send_audio)
-                self.current_online_chunk_buffer_size += len(send_audio)
-                self.is_currently_final = True
-                self.clear_buffer()
-            else:
-                beg = res["start"] - self.buffer_offset
-                end = res["end"] - self.buffer_offset
-                self.status = "nonvoice"
-                send_audio = self.audio_buffer[beg:end]
-                self.online.init(offset=(beg + self.buffer_offset) / self.SAMPLING_RATE)
-                self.online.insert_audio_chunk(send_audio)
-                self.current_online_chunk_buffer_size += len(send_audio)
-                self.is_currently_final = True
-                self.clear_buffer()
-        else:
-            if self.status == "voice":
-                self.online.insert_audio_chunk(self.audio_buffer)
-                self.current_online_chunk_buffer_size += len(self.audio_buffer)
-                self.clear_buffer()
-            else:
-                # Keep 1 second worth of audio in case VAD later detects voice,
-                # but trim to avoid unbounded memory usage.
-                self.buffer_offset += max(0, len(self.audio_buffer) - self.SAMPLING_RATE)
-                self.audio_buffer = self.audio_buffer[-self.SAMPLING_RATE:]
-
-    def process_iter(self) -> Tuple[List[ASRToken], float]:
-        """
-        Depending on the VAD status and the amount of accumulated audio,
-        process the current audio chunk.
-        Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
-        """
-        if self.is_currently_final:
-            return self.finish()
-        elif self.current_online_chunk_buffer_size > self.SAMPLING_RATE * self.online_chunk_size:
-            self.current_online_chunk_buffer_size = 0
-            return self.online.process_iter()
-        else:
-            logger.debug("No online update, only VAD")
-            return [], self.last_input_audio_stream_end_time
-
-    def finish(self) -> Tuple[List[ASRToken], float]:
-        """
-        Finish processing by flushing any remaining text.
-        Returns a tuple: (list of remaining ASRToken objects, float representing the final audio processed up to time).
-        """
-        result_tokens, processed_upto = self.online.finish()
-        self.current_online_chunk_buffer_size = 0
-        self.is_currently_final = False
-        return result_tokens, processed_upto
-    
-    def get_buffer(self):
-        """
-        Get the unvalidated buffer in string format.
-        """
-        return self.online.concatenate_tokens(self.online.transcript_buffer.buffer)
-
Author	SHA1	Message	Date
Quentin Fuxa	1833e7c921	0.2.10	2025-09-16 23:45:00 +02:00
Quentin Fuxa	777ec63a71	--pcm-input option information	2025-09-17 16:06:28 +02:00
Quentin Fuxa	0a6e5ae9c1	ffmpeg install instruction error indicates --pcm-input alternative	2025-09-17 16:04:17 +02:00
Quentin Fuxa	ee448a37e9	when pcm-input is set, the frontend uses AudioWorklet	2025-09-17 14:55:57 +02:00
Quentin Fuxa	9c051052b0	Merge branch 'main' into ScriptProcessorNode-to-AudioWorklet	2025-09-17 11:28:36 +02:00
Quentin Fuxa	4d7c487614	replace deprecated ScriptProcessorNode with AudioWorklet	2025-09-17 10:53:53 +02:00
Quentin Fuxa	65025cc448	nllb backend can be transformers, and model size can be 1.3B	2025-09-17 10:20:31 +02:00
Quentin Fuxa	bbba1d9bb7	add nllb-backend and translation perf test in dev_notes	2025-09-16 20:45:01 +02:00
Quentin Fuxa	99dc96c644	fixes #224	2025-09-16 18:34:35 +02:00
GeorgeCaoJ	2a27d2030a	feat: support web audio 16kHz PCM input and remove ffmpeg dependency	2025-09-15 23:22:25 +08:00
Quentin Fuxa	cd160caaa1	asyncio.to_thread for transcription and translation	2025-09-15 15:23:22 +02:00
Quentin Fuxa	d27b5eb23e	Merge pull request #219 from notV3NOM/main Fix warmup file behavior	2025-09-15 10:19:26 +02:00
Quentin Fuxa	f9d704a900	Merge branch 'main' of https://github.com/notv3nom/whisperlivekit into pr/notV3NOM/219	2025-09-15 10:00:14 +02:00
Quentin Fuxa	2f6e00f512	simulstreaming warmup is done in whisperlivekit.simul_whisper.backend.load_model, not in warmup_online	2025-09-15 09:43:15 +02:00
Quentin Fuxa	5aa312e437	simulstreaming warmup is done in whisperlivekit.simul_whisper.backend.load_model, not in warmup_online	2025-09-13 20:19:19 +01:00
notV3NOM	ebaf36a8be	Fix warmup file behavior	2025-09-13 20:44:24 +05:30
Quentin Fuxa	babe93b99a	to 0.2.9	2025-09-11 21:36:32 +02:00
Quentin Fuxa	a4e9f3cab7	support for raw PCM input option by @YeonjunNotFR	2025-09-11 21:32:11 +02:00
Quentin Fuxa	b06866877a	add --disable-punctuation-split option	2025-09-11 21:03:00 +02:00
Quentin Fuxa	967cdfebc8	fix Translation imports	2025-09-11 21:03:00 +02:00
Quentin Fuxa	3c11c60126	fix by @treeaaa	2025-09-11 21:03:00 +02:00
Quentin Fuxa	2963e8a757	translate when at least 3 new tokens	2025-09-09 21:45:00 +02:00
Quentin Fuxa	cb2d4ea88a	audio processor lines use now Lines objects instead of dict	2025-09-09 21:45:00 +02:00
Quentin Fuxa	add7ea07ee	translator takes all the tokens from the queue	2025-09-09 19:55:39 +02:00
Quentin Fuxa	da8726b2cb	Merge pull request #211 from Alexander-ARTV/main Fix type error when setting encoder_feature in simul_whisper->infer for faster whisper encoder	2025-09-09 15:46:59 +02:00
Quentin Fuxa	3358877054	Fix StorageView conversion for CPU/GPU compatibility	2025-09-09 15:44:16 +02:00
Quentin Fuxa	1f7798c7c1	condition on encoder_feature_ctranslate type	2025-09-09 12:16:52 +02:00
Alexander Lindberg	c7b3bb5e58	Fix regression with faster-whisper encoder_feature	2025-09-09 11:18:55 +03:00
Quentin Fuxa	f661f21675	translation asyncio task	2025-09-08 18:34:31 +02:00
Quentin Fuxa	b6164aa59b	translation device determined with torch.device	2025-09-08 11:34:40 +02:00
Quentin Fuxa	4209d7f7c0	Place all tensors on the same device in sortformer diarization	2025-09-08 10:20:57 +02:00
Quentin Fuxa	334b338ab0	use platform to determine system and recommand mlx whisper	2025-09-07 15:49:11 +02:00
Quentin Fuxa	72f33be6f2	translation: use of get_nllb_code	2025-09-07 15:25:14 +02:00
Quentin Fuxa	84890b8e61	Merge pull request #201 from notV3NOM/main Fix: simulstreaming preload model count argument in cli	2025-09-07 15:18:54 +02:00
Quentin Fuxa	c6668adcf3	Merge pull request #200 from notV3NOM/misc docs: add vram usage for large-v3-turbo	2025-09-07 15:17:42 +02:00
notV3NOM	a178ed5c22	fix simulstreaming preload model count argument in cli	2025-09-06 18:18:09 +05:30
notV3NOM	7601c74c9c	add vram usage for large-v3-turbo	2025-09-06 17:56:39 +05:30
Quentin Fuxa	fad9ee4d21	Merge pull request #198 from notV3NOM/main Fix scrolling UX with sticky header controls	2025-09-05 20:46:36 +02:00
Quentin Fuxa	d1a9913c47	nllb v0	2025-09-05 18:02:42 +02:00
notV3NOM	e4ca2623cb	Fix scrolling UX with sticky header controls	2025-09-05 21:25:13 +05:30
Quentin Fuxa	9c1bf37960	fixes #197	2025-09-05 16:34:13 +02:00
Quentin Fuxa	f46528471b	revamp chromium extension settings	2025-09-05 16:19:48 +02:00
Quentin Fuxa	191680940b	Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web	2025-09-04 23:58:51 +02:00
Quentin Fuxa	ee02afec56	workaround to get the list of microphones in the extension	2025-09-04 23:58:48 +02:00
Quentin Fuxa	a458028de2	Merge pull request #196 from notV3NOM/main Fix: Exponentially growing simulstreaming silence timer	2025-09-04 23:05:59 +02:00
notV3NOM	abd8f2c269	Fix exponentially growing simulstreaming silence timer	2025-09-04 21:49:07 +05:30
Quentin Fuxa	f3ad4e39e4	torch.Tensor to torch.as_tensor	2025-09-04 16:39:11 +02:00
Quentin Fuxa	e0a5cbf0e7	v0.1.0 chrome extension	2025-09-04 16:36:28 +02:00
Quentin Fuxa	953697cd86	torch.Tensor to torch.as_tensor	2025-09-04 15:25:39 +02:00
Quentin Fuxa	3bd2122eb4	0.2.8 : only the decoder of whisper is loaded in memory when a different encoder is used	2025-09-02 21:12:25 +02:00
Quentin Fuxa	50b0527858	update architecture	2025-09-01 21:24:12 +02:00
Quentin Fuxa	b044fcdec2	Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web	2025-09-01 14:55:19 +02:00
Quentin Fuxa	b0508fcf2c	mlx/fasterWhisper encoders are loaded once and shared in simulstreaming	2025-09-01 14:55:11 +02:00
Quentin Fuxa	ce89b0aebc	Merge pull request #177 from komiyamma/translate-readme-to-japanese Translate README.md to Japanese	2025-09-01 13:54:50 +02:00
Quentin Fuxa	d5008ed828	mlx/fasterWhisper encoders are loaded once and shared in simulstreaming	2025-09-01 12:33:19 +02:00
Quentin Fuxa	d467716e26	add microphone picker	2025-08-31 10:12:52 +02:00
Quentin Fuxa	199e21b3ef	faster-whisper as an optional encoder alternative for simulstreaming	2025-08-30 23:50:16 +02:00
Quentin Fuxa	1d926f2e67	mlx-whisper used as simulstreaming encoder: improve speed for macos systems	2025-08-30 22:19:11 +02:00
Quentin Fuxa	4a71a391b8	get_web_interface_html to get_inline_ui_html for embedded web interface HTML	2025-08-30 13:44:06 +02:00
google-labs-jules[bot]	d3ed4e46e2	Translate README.md to Japanese Create a Japanese version of the README.md file named ReadmeJP.md. This makes the project more accessible to Japanese-speaking users.	2025-08-30 04:16:18 +00:00
Quentin Fuxa	057a1026d7	Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web	2025-08-29 22:01:04 +02:00
Quentin Fuxa	1ba171a58d	add embedded web interface HTML (single-file version with inline CSS/JS/SVG) ### Added - `get_inline_ui_html()`: generates a self-contained version of the web interface, with CSS, JS, and SVG assets inlined directly into the HTML. useful for environments where serving static files is inconvenient or when a single-call UI delivery is preferred. (cherry picked from commit `aa44a92a67`)	2025-08-29 22:00:59 +02:00
Quentin Fuxa	1adac67155	explanations about model persistency in containers	2025-08-29 21:27:08 +02:00
Quentin Fuxa	42be1a3773	Merge pull request #173 from CoderRahul9904/chore/docker/pytorch-timeout-retries fix: increase pip timeout & retries for torch wheel install	2025-08-29 21:22:30 +02:00
Rahul Mourya	0a49fafa0d	Update Dockerfile fix(docker): increase pip timeout/retries for PyTorch wheel installs	2025-08-30 00:23:59 +05:30
Quentin Fuxa	4a5d5e1f3b	raise Exception when language == auto and task == translation	2025-08-29 17:44:46 +02:00
Quentin Fuxa	583a2ec2e4	highlight Sortformer optional installation	2025-08-27 21:02:25 +02:00
Quentin Fuxa	19765e89e9	remove triton <3 condition	2025-08-27 20:44:39 +02:00
Quentin Fuxa	9895bc83bf	auto detection of language for warmup if not indicated	2025-08-27 20:37:48 +02:00
Quentin Fuxa	ab98c31f16	trim will happen before audio processor	2025-08-27 18:17:11 +02:00
Quentin Fuxa	f9c9c4188a	optional dependencies removed, ask to direct alternative package installations	2025-08-27 18:15:32 +02:00
Quentin Fuxa	c21d2302e7	to 0.2.7	2024-08-24 19:28:00 +02:00
Quentin Fuxa	4ed62e181d	when silences are detected, speaker correction is no more applied	2024-08-24 19:24:00 +02:00
Quentin Fuxa	52a755a08c	indications on how to choose a model	2024-08-24 19:22:00 +02:00
Quentin Fuxa	9a8d3cbd90	improve diarization + silence handling	2024-08-24 19:20:00 +02:00
Quentin Fuxa	b101ce06bd	several users share the same sortformer model instance	2024-08-24 19:18:00 +02:00
Quentin Fuxa	c83fd179a8	improves phase shift correction between transcription and diarization	2024-08-24 19:15:00 +02:00
Quentin Fuxa	5258305745	default diarization backend in now sortformer	2025-08-24 18:32:01 +02:00
Quentin Fuxa	ce781831ee	punctuation is checked in audio-processor's result formatter	2025-08-24 18:32:01 +02:00
Quentin Fuxa	58297daf6d	sortformer diar implementation v0.3	2025-08-24 18:32:01 +02:00
Quentin Fuxa	3393a08f7e	sortformer diar implementation v0.2	2025-08-24 18:32:01 +02:00
Quentin Fuxa	5b2ddeccdb	correct pip installation error in image build	2025-08-22 15:37:46 +02:00
Quentin Fuxa	26cc1072dd	new dockerfile for cpu only. update dockerfile from cuda 12.8 to 12.9	2025-08-22 11:04:35 +02:00
Quentin Fuxa	12973711f6	0.2.6	2025-08-21 14:34:46 +02:00
Quentin Fuxa	909ac9dd41	speaker -1 are no more sent in websocket - no buffer when their is a silence	2025-08-21 14:09:02 +02:00
Quentin Fuxa	d94a07d417	default model is now base. default backend simulstreaming	2025-08-21 11:55:36 +02:00
Quentin Fuxa	b32dd8bfc4	Align backend and frontend time handling	2025-08-21 10:33:15 +02:00
Quentin Fuxa	9feb0e597b	remove VACOnlineASRProcessor backend possibility	2025-08-20 20:57:43 +02:00
Quentin Fuxa	9dab84a573	update front	2025-08-20 20:15:38 +02:00
Quentin Fuxa	d089c7fce0	.html to .html + .css + .js	2025-08-20 20:00:31 +02:00
Quentin Fuxa	253a080df5	diart diarization handles pauses/silences thanks to offset	2025-08-19 21:12:55 +02:00
Quentin Fuxa	0c6e4b2aee	sortformer diar implementation v0.1	2025-08-19 19:48:51 +02:00
Quentin Fuxa	e14bbde77d	sortformer diar implementation v0	2025-08-19 17:02:55 +02:00
Quentin Fuxa	7496163467	rename diart backend	2025-08-19 15:02:27 +02:00
Quentin Fuxa	696a94d1ce	1rst sortformer backend implementation	2025-08-19 15:02:17 +02:00
Quentin Fuxa	2699b0974c	Fix simulstreaming imports	2025-08-19 14:43:54 +02:00
Quentin Fuxa	90c0250ba4	update optional dependencies	2025-08-19 09:36:59 +02:00
Quentin Fuxa	eb96153ffd	new vac parameters	2025-08-17 22:26:28 +02:00
Quentin Fuxa	47e3eb9b5b	Update README.md	2025-08-17 09:55:03 +02:00
Quentin Fuxa	b8b07adeef	--vac to --no-vac	2025-08-17 09:44:26 +02:00
Quentin Fuxa	d0e9e37ef6	simulstreaming: cumulative_time_offset to keep timestamps correct when audio > 30s	2025-08-17 09:33:47 +02:00
Quentin Fuxa	820f92d8cb	audio_max_len to 30 -> 20, ffmpeg timeout 5 -> 20	2025-08-17 09:32:08 +02:00
Quentin Fuxa	e42523af84	VAC activated by default	2025-08-17 01:29:34 +02:00
Quentin Fuxa	e2184d5e06	better handle silences when VAC + correct offset issue with whisperstreaming backend	2025-08-17 01:27:07 +02:00
Quentin Fuxa	7fe0353260	vac model is loaded in TranscriptionEngine, and by default	2025-08-17 00:34:25 +02:00
Quentin Fuxa	0f2eba507e	use with_offset to add no audio offset to tokens	2025-08-17 00:33:24 +02:00
Quentin Fuxa	55e08474f3	recycle backend in simulstreaming thanks to new remove hooks function	2025-08-16 23:06:16 +02:00
Quentin Fuxa	28bdc52e1d	VAC before doing transcription and diarization. V0	2025-08-16 23:04:21 +02:00
Quentin Fuxa	e4221fa6c3	Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web	2025-08-15 23:04:05 +02:00
Quentin Fuxa	1652db9a2d	Use distinct backend models for simulstreaming and add --preloaded_model_count to preload them	2025-08-15 23:03:55 +02:00
Quentin Fuxa	601f17653a	Update CONTRIBUTING.md	2025-08-13 21:59:32 +02:00
Quentin Fuxa	7718190fcd	Update CONTRIBUTING.md	2025-08-13 21:59:00 +02:00
				`@@ -0,0 +1 @@`
				`<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-120q-151 0-255.5-104.5T120-480q0-138 90-239.5T440-838q13-2 23 3.5t16 14.5q6 9 6.5 21t-7.5 23q-17 26-25.5 55t-8.5 61q0 90 63 153t153 63q31 0 61.5-9t54.5-25q11-7 22.5-6.5T819-479q10 5 15.5 15t3.5 24q-14 138-117.5 229T480-120Zm0-80q88 0 158-48.5T740-375q-20 5-40 8t-40 3q-123 0-209.5-86.5T364-660q0-20 3-40t8-40q-78 32-126.5 102T200-480q0 116 82 198t198 82Zm-10-270Z"/></svg>`
				`@@ -0,0 +1 @@`
				<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-360q50 0 85-35t35-85q0-50-35-85t-85-35q-50 0-85 35t-35 85q0 50 35 85t85 35Zm0 80q-83 0-141.5-58.5T280-480q0-83 58.5-141.5T480-680q83 0 141.5 58.5T680-480q0 83-58.5 141.5T480-280ZM80-440q-17 0-28.5-11.5T40-480q0-17 11.5-28.5T80-520h80q17 0 28.5 11.5T200-480q0 17-11.5 28.5T160-440H80Zm720 0q-17 0-28.5-11.5T760-480q0-17 11.5-28.5T800-520h80q17 0 28.5 11.5T920-480q0 17-11.5 28.5T880-440h-80ZM480-760q-17 0-28.5-11.5T440-800v-80q0-17 11.5-28.5T480-920q17 0 28.5 11.5T520-880v80q0 17-11.5 28.5T480-760Zm0 720q-17 0-28.5-11.5T440-80v-80q0-17 11.5-28.5T480-200q17 0 28.5 11.5T520-160v80q0 17-11.5 28.5T480-40ZM226-678l-43-42q-12-11-11.5-28t11.5-29q12-12 29-12t28 12l42 43q11 12 11 28t-11 28q-11 12-27.5 11.5T226-678Zm494 495-42-43q-11-12-11-28.5t11-27.5q11-12 27.5-11.5T734-282l43 42q12 11 11.5 28T777-183q-12 12-29 12t-28-12Zm-42-495q-12-11-11.5-27.5T678-734l42-43q11-12 28-11.5t29 11.5q12 12 12 29t-12 28l-43 42q-12 11-28 11t-28-11ZM183-183q-12-12-12-29t12-28l43-42q12-11 28.5-11t27.5 11q12 11 11.5 27.5T282-226l-42 43q-11 12-28 11.5T183-183Zm297-297Z"/></svg>