to 0.2.11

translation compatible with auto and detected language
language detection after few seconds working
2026-03-07 22:33:36 +00:00 · 2025-09-24 22:34:01 +02:00 · 2025-09-22 11:20:00 +02:00 · 2025-09-20 11:08:00 +02:00 · 2025-09-20 11:06:00 +02:00 · 2025-09-20 11:06:00 +02:00
51 changed files with 2982 additions and 784 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -137,4 +137,5 @@ run_*.sh
 test_*.py
 launch.json
 .DS_Store
-test/*
+test/*
 nllb-200-distilled-600M-ctranslate2/*
--- a/DEV_NOTES.md
+++ b/DEV_NOTES.md
@@ -18,8 +18,29 @@ Decoder weights: 59110771 bytes
 Encoder weights: 15268874 bytes
 # 2. Translation: Faster model for each system
-# 2. SortFormer Diarization: 4-to-2 Speaker Constraint Algorithm
+## Benchmark Results
 Testing on MacBook M3 with NLLB-200-distilled-600M model:
 ### Standard Transformers vs CTranslate2
 | Test Text | Standard Inference Time | CTranslate2 Inference Time | Speedup |
 |-----------|-------------------------|---------------------------|---------|
 | UN Chief says there is no military solution in Syria | 0.9395s | 2.0472s | 0.5x |
 | The rapid advancement of AI technology is transforming various industries | 0.7171s | 1.7516s | 0.4x |
 | Climate change poses a significant threat to global ecosystems | 0.8533s | 1.8323s | 0.5x |
 | International cooperation is essential for addressing global challenges | 0.7209s | 1.3575s | 0.5x |
 | The development of renewable energy sources is crucial for a sustainable future | 0.8760s | 1.5589s | 0.6x |
 **Results:**
 - Total Standard time: 4.1068s
 - Total CTranslate2 time: 8.5476s
 - CTranslate2 is slower on this system --> Use Transformers, and ideally we would have an mlx implementation.
 # 3. SortFormer Diarization: 4-to-2 Speaker Constraint Algorithm
 Transform a diarization model that predicts up to 4 speakers into one that predicts up to 2 speakers by mapping the output predictions.
@@ -67,4 +88,4 @@ ELSE:
    AS_2 ← B
 to finish
-```
+```
--- a/README.md
+++ b/README.md
@@ -18,8 +18,9 @@ Real-time speech transcription directly to your browser, with a ready-to-use bac
 #### Powered by Leading Research:
- [SimulStreaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) - Ultra-low latency transcription with AlignAtt policy
+- [SimulStreaming](https://github.com/ufalSimulStreaming) (SOTA 2025) - Ultra-low latency transcription using [AlignAtt policy](https://arxiv.org/pdf/2305.11408)
- [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) - Low latency transcription with LocalAgreement policy
+- [NLLB](https://arxiv.org/abs/2207.04672), ([distilled](https://huggingface.co/entai2965/nllb-200-distilled-600M-ctranslate2)) (2024) - Translation to more than 100 languages.
 - [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) - Low latency transcription using [LocalAgreement policy](https://www.isca-archive.org/interspeech_2020/liu20s_interspeech.pdf)
 - [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) - Advanced real-time speaker diarization
 - [Diart](https://github.com/juanmc2005/diart) (SOTA 2021) - Real-time speaker diarization
 - [Silero VAD](https://github.com/snakers4/silero-vad) (2024) - Enterprise-grade Voice Activity Detection
@@ -39,14 +40,7 @@ Real-time speech transcription directly to your browser, with a ready-to-use bac
 ```bash
 pip install whisperlivekit
 ```
-
+> You can also clone the repo and `pip install -e .` for the latest version.
 >  **FFmpeg is required** and must be installed before using WhisperLiveKit
 > 
 > | OS | How to install |
 > |-----------|-------------|
 >  | Ubuntu/Debian | `sudo apt install ffmpeg` |
 > | MacOS | `brew install ffmpeg` |
 > | Windows | Download .exe from https://ffmpeg.org/download.html and add to PATH |
 #### Quick Start
 1. **Start the transcription server:**
@@ -68,6 +62,7 @@ pip install whisperlivekit
 |-----------|-------------|
 | **Speaker diarization with Sortformer** | `git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]` |
 | **Apple Silicon optimized backend** | `mlx-whisper` |
 | **NLLB Translation** | `huggingface_hub` & `transformers` |
 | *[Not recommanded]*  Speaker diarization with Diart | `diart` |
 | *[Not recommanded]*  Original Whisper backend | `whisper` |
 | *[Not recommanded]*  Improved timestamps backend | `whisper-timestamped` |
@@ -82,11 +77,11 @@ See  **Parameters & Configuration** below on how to use them.
 **Command-line Interface**: Start the transcription server with various options:
 ```bash
-# Use better model than default (small)
+# Large model and translate from french to danish
-whisperlivekit-server --model large-v3
+whisperlivekit-server --model large-v3 --language fr --target-language da
-# Advanced configuration with diarization and language
+# Diarization and server listening on */80 
-whisperlivekit-server --host 0.0.0.0 --port 8000 --model medium --diarization --language fr
+whisperlivekit-server --host 0.0.0.0 --port 80 --model medium --diarization --language fr
 ```
@@ -133,24 +128,15 @@ async def websocket_endpoint(websocket: WebSocket):
 ## Parameters & Configuration
 An important list of parameters can be changed. But what *should* you change?
 - the `--model` size. List and recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/available_models.md)
 - the `--language`.  List [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/simul_whisper/whisper/tokenizer.py). If you use `auto`, the model attempts to detect the language automatically, but it tends to bias towards English.
 - the `--backend` ? you can switch to `--backend faster-whisper` if  `simulstreaming` does not work correctly or if you prefer to avoid the dual-license requirements.
 - `--warmup-file`, if you have one
 - `--task translate`, to translate in english
 - `--host`, `--port`, `--ssl-certfile`, `--ssl-keyfile`, if you set up a server
 - `--diarization`, if you want to use it.
 The rest I don't recommend. But below are your options.
 | Parameter | Description | Default |
 |-----------|-------------|---------|
-| `--model` | Whisper model size. | `small` |
+| `--model` | Whisper model size. List and recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/available_models.md) | `small` |
-| `--language` | Source language code or `auto` | `auto` |
+| `--language` | List [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/simul_whisper/whisper/tokenizer.py). If you use `auto`, the model attempts to detect the language automatically, but it tends to bias towards English. | `auto` |
-| `--task` | `transcribe` or `translate` | `transcribe` |
+| `--target-language` | If sets, activates translation using NLLB. Ex: `fr`. [118 languages available](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/translation/mapping_languages.py). If you want to translate to english, you should rather use `--task translate`, since Whisper can do it directly. | `None` |
-| `--backend` | Processing backend | `simulstreaming` |
+| `--task` | Set to `translate` to translate *only* to english, using Whisper translation. | `transcribe` |
-| `--min-chunk-size` | Minimum audio chunk size (seconds) | `1.0` |
+| `--diarization` | Enable speaker identification | `False` |
 | `--backend` | Processing backend. You can switch to `faster-whisper` if  `simulstreaming` does not work correctly | `simulstreaming` |
 | `--no-vac` | Disable Voice Activity Controller | `False` |
 | `--no-vad` | Disable Voice Activity Detection | `False` |
 | `--warmup-file` | Audio file path for model warmup | `jfk.wav` |
@@ -158,7 +144,19 @@ The rest I don't recommend. But below are your options.
 | `--port` | Server port | `8000` |
 | `--ssl-certfile` | Path to the SSL certificate file (for HTTPS support) | `None` |
 | `--ssl-keyfile` | Path to the SSL private key file (for HTTPS support) | `None` |
 | `--pcm-input` | raw PCM (s16le) data is expected as input and FFmpeg will be bypassed. Frontend will use AudioWorklet instead of MediaRecorder | `False` |
 | Translation options | Description | Default |
 |-----------|-------------|---------|
 | `--nllb-backend` | `transformers` or `ctranslate2` | `ctranslate2` |
 | `--nllb-size` | `600M` or `1.3B` | `600M` |
 | Diarization options | Description | Default |
 |-----------|-------------|---------|
 | `--diarization-backend` |  `diart` or `sortformer` | `sortformer` |
 | `--disable-punctuation-split` |  Disable punctuation based splits. See #214 | `False` |
 | `--segmentation-model` | Hugging Face model ID for Diart segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |
 | `--embedding-model` | Hugging Face model ID for Diart embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |
 | SimulStreaming backend options | Description | Default |
 |-----------|-------------|---------|
@@ -174,7 +172,8 @@ The rest I don't recommend. But below are your options.
 | `--static-init-prompt` | Static prompt that doesn't scroll | `None` |
 | `--max-context-tokens` | Maximum context tokens | `None` |
 | `--model-path` | Direct path to .pt model file. Download it if not found | `./base.pt` |
-| `--preloaded-model-count` | Optional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent users) | `1` |
+| `--preload-model-count` | Optional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent users) | `1` |
 | WhisperStreaming backend options | Description | Default |
@@ -182,19 +181,10 @@ The rest I don't recommend. But below are your options.
 | `--confidence-validation` | Use confidence scores for faster validation | `False` |
 | `--buffer_trimming` | Buffer trimming strategy (`sentence` or `segment`) | `segment` |
 | Diarization options | Description | Default |
 |-----------|-------------|---------|
 | `--diarization` | Enable speaker identification | `False` |
 | `--diarization-backend` |  `diart` or `sortformer` | `sortformer` |
 | `--segmentation-model` | Hugging Face model ID for Diart segmentation model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `pyannote/segmentation-3.0` |
 | `--embedding-model` | Hugging Face model ID for Diart embedding model. [Available models](https://github.com/juanmc2005/diart/tree/main?tab=readme-ov-file#pre-trained-models) | `speechbrain/spkrec-ecapa-voxceleb` |
-> For diarization using Diart, you need access to pyannote.audio models:
+
-> 1. [Accept user conditions](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model
+> For diarization using Diart, you need to accept user conditions [here](https://huggingface.co/pyannote/segmentation) for the `pyannote/segmentation` model, [here](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model and [here](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model. **Then**, login to HuggingFace: `huggingface-cli login`
 > 2. [Accept user conditions](https://huggingface.co/pyannote/segmentation-3.0) for the `pyannote/segmentation-3.0` model
 > 3. [Accept user conditions](https://huggingface.co/pyannote/embedding) for the `pyannote/embedding` model
 >4. Login with HuggingFace: `huggingface-cli login`
 ### 🚀 Deployment Guide
--- a/architecture.png
+++ b/architecture.png
--- a/available_models.md
+++ b/available_models.md
@@ -1,4 +1,4 @@
-# Available model sizes:
+# Available Whisper model sizes:
 - tiny.en (english only)
 - tiny
@@ -58,6 +58,7 @@
 - `small`: ~2GB VRAM
 - `medium`: ~5GB VRAM
 - `large`: ~10GB VRAM
 - `large‑v3‑turbo`: ~6GB VRAM
 **Audio Quality Impact**:
 - Clean, clear audio: smaller models may suffice
@@ -69,4 +70,40 @@
 2. Limited resources or need speed? → `small` or smaller
 3. Good hardware and want best quality? → `large-v3`
 4. Need fast, high-quality transcription without translation? → `large-v3-turbo`
-5. Need translation capabilities? → `large-v2` or `large-v3` (avoid turbo)
+5. Need translation capabilities? → `large-v2` or `large-v3` (avoid turbo)
 _______________________
 # Translation Models and Backend
 **Language Support**: ~200 languages
 ## Distilled Model Sizes Available
 | Model | Size | Parameters | VRAM (FP16) | VRAM (INT8) | Quality |
 |-------|------|------------|-------------|-------------|---------|
 | 600M | 2.46 GB | 600M | ~1.5GB | ~800MB | Good, understandable |
 | 1.3B | 5.48 GB | 1.3B | ~3GB | ~1.5GB | Better accuracy, context |
 **Quality Impact**: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs.
 ## Backend Performance
 | Backend | Speed vs Base | Memory Usage | Quality Loss |
 |---------|---------------|--------------|--------------|
 | CTranslate2 | 6-10x faster | 40-60% less | ~5% BLEU drop |
 | Transformers | Baseline | High | None |
 | Transformers + MPS (on Apple Silicon) | 2x faster | Medium | None |
 **Metrics**:
 - CTranslate2: 50-100+ tokens/sec
 - Transformers: 10-30 tokens/sec
 - Apple Silicon with MPS: Up to 2x faster than CTranslate2
 ## Quick Decision Matrix
 **Choose 600M**: Limited resources, close to 0 lag
 **Choose 1.3B**: Quality matters
 **Choose Transformers**: On Apple Silicon
--- a/chrome-extension/README.md
+++ b/chrome-extension/README.md
@@ -0,0 +1,17 @@
 ## WhisperLiveKit Chrome Extension v0.1.0
 Capture the audio of your current tab, transcribe or translate it using WhisperliveKit. **Still unstable**
 <img src="https://raw.githubusercontent.com/QuentinFuxa/WhisperLiveKit/refs/heads/main/chrome-extension/demo-extension.png" alt="WhisperLiveKit Demo" width="730">
 ## Running this extension
 1. Clone this repository.
 2. Load this directory in Chrome as an unpacked extension.
 ## Devs:
 - Impossible to capture audio from tabs if extension is a pannel, unfortunately: 
 - https://issues.chromium.org/issues/40926394
 - https://groups.google.com/a/chromium.org/g/chromium-extensions/c/DET2SXCFnDg
 - https://issues.chromium.org/issues/40916430
 - To capture microphone in an extension, there are tricks: https://github.com/justinmann/sidepanel-audio-issue , https://medium.com/@lynchee.owo/how-to-enable-microphone-access-in-chrome-extensions-by-code-924295170080 (comments)
--- a/chrome-extension/background.js
+++ b/chrome-extension/background.js
@@ -0,0 +1,9 @@
 chrome.runtime.onInstalled.addListener((details) => {
    if (details.reason.search(/install/g) === -1) {
        return
    }
    chrome.tabs.create({
        url: chrome.runtime.getURL("welcome.html"),
        active: true
    })
 })
--- a/chrome-extension/demo-extension.png
+++ b/chrome-extension/demo-extension.png
--- a/chrome-extension/icons/icon128.png
+++ b/chrome-extension/icons/icon128.png
--- a/chrome-extension/icons/icon16.png
+++ b/chrome-extension/icons/icon16.png
--- a/chrome-extension/icons/icon32.png
+++ b/chrome-extension/icons/icon32.png
--- a/chrome-extension/icons/icon48.png
+++ b/chrome-extension/icons/icon48.png
--- a/chrome-extension/live_transcription.js
+++ b/chrome-extension/live_transcription.js
@@ -0,0 +1,669 @@
 /* Theme, WebSocket, recording, rendering logic extracted from inline script and adapted for segmented theme control and WS caption */
 let isRecording = false;
 let websocket = null;
 let recorder = null;
 let chunkDuration = 100;
 let websocketUrl = "ws://localhost:8000/asr";
 let userClosing = false;
 let wakeLock = null;
 let startTime = null;
 let timerInterval = null;
 let audioContext = null;
 let analyser = null;
 let microphone = null;
 let waveCanvas = document.getElementById("waveCanvas");
 let waveCtx = waveCanvas.getContext("2d");
 let animationFrame = null;
 let waitingForStop = false;
 let lastReceivedData = null;
 let lastSignature = null;
 let availableMicrophones = [];
 let selectedMicrophoneId = null;
 waveCanvas.width = 60 * (window.devicePixelRatio || 1);
 waveCanvas.height = 30 * (window.devicePixelRatio || 1);
 waveCtx.scale(window.devicePixelRatio || 1, window.devicePixelRatio || 1);
 const statusText = document.getElementById("status");
 const recordButton = document.getElementById("recordButton");
 const chunkSelector = document.getElementById("chunkSelector");
 const websocketInput = document.getElementById("websocketInput");
 const websocketDefaultSpan = document.getElementById("wsDefaultUrl");
 const linesTranscriptDiv = document.getElementById("linesTranscript");
 const timerElement = document.querySelector(".timer");
 const themeRadios = document.querySelectorAll('input[name="theme"]');
 const microphoneSelect = document.getElementById("microphoneSelect");
 const settingsToggle = document.getElementById("settingsToggle");
 const settingsDiv = document.querySelector(".settings");
 chrome.runtime.onInstalled.addListener((details) => {
    if (details.reason.search(/install/g) === -1) {
        return
    }
    chrome.tabs.create({
        url: chrome.runtime.getURL("welcome.html"),
        active: true
    })
 })
 function getWaveStroke() {
  const styles = getComputedStyle(document.documentElement);
  const v = styles.getPropertyValue("--wave-stroke").trim();
  return v || "#000";
 }
 let waveStroke = getWaveStroke();
 function updateWaveStroke() {
  waveStroke = getWaveStroke();
 }
 function applyTheme(pref) {
  if (pref === "light") {
    document.documentElement.setAttribute("data-theme", "light");
  } else if (pref === "dark") {
    document.documentElement.setAttribute("data-theme", "dark");
  } else {
    document.documentElement.removeAttribute("data-theme");
  }
  updateWaveStroke();
 }
 // Persisted theme preference
 const savedThemePref = localStorage.getItem("themePreference") || "system";
 applyTheme(savedThemePref);
 if (themeRadios.length) {
  themeRadios.forEach((r) => {
    r.checked = r.value === savedThemePref;
    r.addEventListener("change", () => {
      if (r.checked) {
        localStorage.setItem("themePreference", r.value);
        applyTheme(r.value);
      }
    });
  });
 }
 // React to OS theme changes when in "system" mode
 const darkMq = window.matchMedia && window.matchMedia("(prefers-color-scheme: dark)");
 const handleOsThemeChange = () => {
  const pref = localStorage.getItem("themePreference") || "system";
  if (pref === "system") updateWaveStroke();
 };
 if (darkMq && darkMq.addEventListener) {
  darkMq.addEventListener("change", handleOsThemeChange);
 } else if (darkMq && darkMq.addListener) {
  // deprecated, but included for Safari compatibility
  darkMq.addListener(handleOsThemeChange);
 }
 async function enumerateMicrophones() {
  try {
      const micPermission = await navigator.permissions.query({
    name: "microphone",
  });
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    stream.getTracks().forEach(track => track.stop());
    const devices = await navigator.mediaDevices.enumerateDevices();
    availableMicrophones = devices.filter(device => device.kind === 'audioinput');
    populateMicrophoneSelect();
    console.log(`Found ${availableMicrophones.length} microphone(s)`);
  } catch (error) {
    console.error('Error enumerating microphones:', error);
    statusText.textContent = "Error accessing microphones. Please grant permission.";
  }
 }
 function populateMicrophoneSelect() {
  if (!microphoneSelect) return;
  microphoneSelect.innerHTML = '<option value="">Default Microphone</option>';
  availableMicrophones.forEach((device, index) => {
    const option = document.createElement('option');
    option.value = device.deviceId;
    option.textContent = device.label || `Microphone ${index + 1}`;
    microphoneSelect.appendChild(option);
  });
  const savedMicId = localStorage.getItem('selectedMicrophone');
  if (savedMicId && availableMicrophones.some(mic => mic.deviceId === savedMicId)) {
    microphoneSelect.value = savedMicId;
    selectedMicrophoneId = savedMicId;
  }
 }
 function handleMicrophoneChange() {
  selectedMicrophoneId = microphoneSelect.value || null;
  localStorage.setItem('selectedMicrophone', selectedMicrophoneId || '');
  const selectedDevice = availableMicrophones.find(mic => mic.deviceId === selectedMicrophoneId);
  const deviceName = selectedDevice ? selectedDevice.label : 'Default Microphone';
  console.log(`Selected microphone: ${deviceName}`);
  statusText.textContent = `Microphone changed to: ${deviceName}`;
  if (isRecording) {
    statusText.textContent = "Switching microphone... Please wait.";
    stopRecording().then(() => {
      setTimeout(() => {
        toggleRecording();
      }, 1000);
    });
  }
 }
 // Helpers
 function fmt1(x) {
  const n = Number(x);
  return Number.isFinite(n) ? n.toFixed(1) : x;
 }
 // Default WebSocket URL computation
 const host = window.location.hostname || "localhost";
 const port = window.location.port;
 const protocol = window.location.protocol === "https:" ? "wss" : "ws";
 const defaultWebSocketUrl = websocketUrl;
 // Populate default caption and input
 if (websocketDefaultSpan) websocketDefaultSpan.textContent = defaultWebSocketUrl;
 websocketInput.value = defaultWebSocketUrl;
 websocketUrl = defaultWebSocketUrl;
 // Optional chunk selector (guard for presence)
 if (chunkSelector) {
  chunkSelector.addEventListener("change", () => {
    chunkDuration = parseInt(chunkSelector.value);
  });
 }
 // WebSocket input change handling
 websocketInput.addEventListener("change", () => {
  const urlValue = websocketInput.value.trim();
  if (!urlValue.startsWith("ws://") && !urlValue.startsWith("wss://")) {
    statusText.textContent = "Invalid WebSocket URL (must start with ws:// or wss://)";
    return;
  }
  websocketUrl = urlValue;
  statusText.textContent = "WebSocket URL updated. Ready to connect.";
 });
 function setupWebSocket() {
  return new Promise((resolve, reject) => {
    try {
      websocket = new WebSocket(websocketUrl);
    } catch (error) {
      statusText.textContent = "Invalid WebSocket URL. Please check and try again.";
      reject(error);
      return;
    }
    websocket.onopen = () => {
      statusText.textContent = "Connected to server.";
      resolve();
    };
    websocket.onclose = () => {
      if (userClosing) {
        if (waitingForStop) {
          statusText.textContent = "Processing finalized or connection closed.";
          if (lastReceivedData) {
            renderLinesWithBuffer(
              lastReceivedData.lines || [],
              lastReceivedData.buffer_diarization || "",
              lastReceivedData.buffer_transcription || "",
              0,
              0,
              true
            );
          }
        }
      } else {
        statusText.textContent = "Disconnected from the WebSocket server. (Check logs if model is loading.)";
        if (isRecording) {
          stopRecording();
        }
      }
      isRecording = false;
      waitingForStop = false;
      userClosing = false;
      lastReceivedData = null;
      websocket = null;
      updateUI();
    };
    websocket.onerror = () => {
      statusText.textContent = "Error connecting to WebSocket.";
      reject(new Error("Error connecting to WebSocket"));
    };
    websocket.onmessage = (event) => {
      const data = JSON.parse(event.data);
      if (data.type === "ready_to_stop") {
        console.log("Ready to stop received, finalizing display and closing WebSocket.");
        waitingForStop = false;
        if (lastReceivedData) {
          renderLinesWithBuffer(
            lastReceivedData.lines || [],
            lastReceivedData.buffer_diarization || "",
            lastReceivedData.buffer_transcription || "",
            0,
            0,
            true
          );
        }
        statusText.textContent = "Finished processing audio! Ready to record again.";
        recordButton.disabled = false;
        if (websocket) {
          websocket.close();
        }
        return;
      }
      lastReceivedData = data;
      const {
        lines = [],
        buffer_transcription = "",
        buffer_diarization = "",
        remaining_time_transcription = 0,
        remaining_time_diarization = 0,
        status = "active_transcription",
      } = data;
      renderLinesWithBuffer(
        lines,
        buffer_diarization,
        buffer_transcription,
        remaining_time_diarization,
        remaining_time_transcription,
        false,
        status
      );
    };
  });
 }
 function renderLinesWithBuffer(
  lines,
  buffer_diarization,
  buffer_transcription,
  remaining_time_diarization,
  remaining_time_transcription,
  isFinalizing = false,
  current_status = "active_transcription"
 ) {
  if (current_status === "no_audio_detected") {
    linesTranscriptDiv.innerHTML =
      "<p style='text-align: center; color: var(--muted); margin-top: 20px;'><em>No audio detected...</em></p>";
    return;
  }
  const showLoading = !isFinalizing && (lines || []).some((it) => it.speaker == 0);
  const showTransLag = !isFinalizing && remaining_time_transcription > 0;
  const showDiaLag = !isFinalizing && !!buffer_diarization && remaining_time_diarization > 0;
  const signature = JSON.stringify({
    lines: (lines || []).map((it) => ({ speaker: it.speaker, text: it.text, start: it.start, end: it.end })),
    buffer_transcription: buffer_transcription || "",
    buffer_diarization: buffer_diarization || "",
    status: current_status,
    showLoading,
    showTransLag,
    showDiaLag,
    isFinalizing: !!isFinalizing,
  });
  if (lastSignature === signature) {
    const t = document.querySelector(".lag-transcription-value");
    if (t) t.textContent = fmt1(remaining_time_transcription);
    const d = document.querySelector(".lag-diarization-value");
    if (d) d.textContent = fmt1(remaining_time_diarization);
    const ld = document.querySelector(".loading-diarization-value");
    if (ld) ld.textContent = fmt1(remaining_time_diarization);
    return;
  }
  lastSignature = signature;
  const linesHtml = (lines || [])
    .map((item, idx) => {
      let timeInfo = "";
      if (item.start !== undefined && item.end !== undefined) {
        timeInfo = ` ${item.start} - ${item.end}`;
      }
      let speakerLabel = "";
      if (item.speaker === -2) {
        speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`;
      } else if (item.speaker == 0 && !isFinalizing) {
        speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'><span class="loading-diarization-value">${fmt1(
          remaining_time_diarization
        )}</span> second(s) of audio are undergoing diarization</span></span>`;
      } else if (item.speaker !== 0) {
        speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`;
      }
      let currentLineText = item.text || "";
      if (idx === lines.length - 1) {
        if (!isFinalizing && item.speaker !== -2) {
          if (remaining_time_transcription > 0) {
            speakerLabel += `<span class="label_transcription"><span class="spinner"></span>Lag <span id='timeInfo'><span class="lag-transcription-value">${fmt1(
              remaining_time_transcription
            )}</span>s</span></span>`;
          }
          if (buffer_diarization && remaining_time_diarization > 0) {
            speakerLabel += `<span class="label_diarization"><span class="spinner"></span>Lag<span id='timeInfo'><span class="lag-diarization-value">${fmt1(
              remaining_time_diarization
            )}</span>s</span></span>`;
          }
        }
        if (buffer_diarization) {
          if (isFinalizing) {
            currentLineText +=
              (currentLineText.length > 0 && buffer_diarization.trim().length > 0 ? " " : "") + buffer_diarization.trim();
          } else {
            currentLineText += `<span class="buffer_diarization">${buffer_diarization}</span>`;
          }
        }
        if (buffer_transcription) {
          if (isFinalizing) {
            currentLineText +=
              (currentLineText.length > 0 && buffer_transcription.trim().length > 0 ? " " : "") +
              buffer_transcription.trim();
          } else {
            currentLineText += `<span class="buffer_transcription">${buffer_transcription}</span>`;
          }
        }
      }
      return currentLineText.trim().length > 0 || speakerLabel.length > 0
        ? `<p>${speakerLabel}<br/><div class='textcontent'>${currentLineText}</div></p>`
        : `<p>${speakerLabel}<br/></p>`;
    })
    .join("");
  linesTranscriptDiv.innerHTML = linesHtml;
  window.scrollTo({ top: document.body.scrollHeight, behavior: "smooth" });
 }
 function updateTimer() {
  if (!startTime) return;
  const elapsed = Math.floor((Date.now() - startTime) / 1000);
  const minutes = Math.floor(elapsed / 60).toString().padStart(2, "0");
  const seconds = (elapsed % 60).toString().padStart(2, "0");
  timerElement.textContent = `${minutes}:${seconds}`;
 }
 function drawWaveform() {
  if (!analyser) return;
  const bufferLength = analyser.frequencyBinCount;
  const dataArray = new Uint8Array(bufferLength);
  analyser.getByteTimeDomainData(dataArray);
  waveCtx.clearRect(
    0,
    0,
    waveCanvas.width / (window.devicePixelRatio || 1),
    waveCanvas.height / (window.devicePixelRatio || 1)
  );
  waveCtx.lineWidth = 1;
  waveCtx.strokeStyle = waveStroke;
  waveCtx.beginPath();
  const sliceWidth = (waveCanvas.width / (window.devicePixelRatio || 1)) / bufferLength;
  let x = 0;
  for (let i = 0; i < bufferLength; i++) {
    const v = dataArray[i] / 128.0;
    const y = (v * (waveCanvas.height / (window.devicePixelRatio || 1))) / 2;
    if (i === 0) {
      waveCtx.moveTo(x, y);
    } else {
      waveCtx.lineTo(x, y);
    }
    x += sliceWidth;
  }
  waveCtx.lineTo(
    waveCanvas.width / (window.devicePixelRatio || 1),
    (waveCanvas.height / (window.devicePixelRatio || 1)) / 2
  );
  waveCtx.stroke();
  animationFrame = requestAnimationFrame(drawWaveform);
 }
 async function startRecording() {
  try {
    try {
      wakeLock = await navigator.wakeLock.request("screen");
    } catch (err) {
      console.log("Error acquiring wake lock.");
    }
    let stream;
    try {
      // Try tab capture first
      stream = await new Promise((resolve, reject) => {
        chrome.tabCapture.capture({audio: true}, (s) => {
          if (s) {
            resolve(s);
          } else {
            reject(new Error('Tab capture failed or not available'));
          }
        });
      });
      statusText.textContent = "Using tab audio capture.";
    } catch (tabError) {
      console.log('Tab capture not available, falling back to microphone', tabError);
      // Fallback to microphone
      const audioConstraints = selectedMicrophoneId
        ? { audio: { deviceId: { exact: selectedMicrophoneId } } }
        : { audio: true };
      stream = await navigator.mediaDevices.getUserMedia(audioConstraints);
      statusText.textContent = "Using microphone audio.";
    }
    audioContext = new (window.AudioContext || window.webkitAudioContext)();
    analyser = audioContext.createAnalyser();
    analyser.fftSize = 256;
    microphone = audioContext.createMediaStreamSource(stream);
    microphone.connect(analyser);
    recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
    recorder.ondataavailable = (e) => {
      if (websocket && websocket.readyState === WebSocket.OPEN) {
        websocket.send(e.data);
      }
    };
    recorder.start(chunkDuration);
    startTime = Date.now();
    timerInterval = setInterval(updateTimer, 1000);
    drawWaveform();
    isRecording = true;
    updateUI();
  } catch (err) {
    if (window.location.hostname === "0.0.0.0") {
      statusText.textContent =
        "Error accessing audio input. Browsers may block audio access on 0.0.0.0. Try using localhost:8000 instead.";
    } else {
      statusText.textContent = "Error accessing audio input. Please check permissions.";
    }
    console.error(err);
  }
 }
 async function stopRecording() {
  if (wakeLock) {
    try {
      await wakeLock.release();
    } catch (e) {
      // ignore
    }
    wakeLock = null;
  }
  userClosing = true;
  waitingForStop = true;
  if (websocket && websocket.readyState === WebSocket.OPEN) {
    const emptyBlob = new Blob([], { type: "audio/webm" });
    websocket.send(emptyBlob);
    statusText.textContent = "Recording stopped. Processing final audio...";
  }
  if (recorder) {
    recorder.stop();
    recorder = null;
  }
  if (microphone) {
    microphone.disconnect();
    microphone = null;
  }
  if (analyser) {
    analyser = null;
  }
  if (audioContext && audioContext.state !== "closed") {
    try {
      await audioContext.close();
    } catch (e) {
      console.warn("Could not close audio context:", e);
    }
    audioContext = null;
  }
  if (animationFrame) {
    cancelAnimationFrame(animationFrame);
    animationFrame = null;
  }
  if (timerInterval) {
    clearInterval(timerInterval);
    timerInterval = null;
  }
  timerElement.textContent = "00:00";
  startTime = null;
  isRecording = false;
  updateUI();
 }
 async function toggleRecording() {
  if (!isRecording) {
    if (waitingForStop) {
      console.log("Waiting for stop, early return");
      return;
    }
    console.log("Connecting to WebSocket");
    try {
      if (websocket && websocket.readyState === WebSocket.OPEN) {
        await startRecording();
      } else {
        await setupWebSocket();
        await startRecording();
      }
    } catch (err) {
      statusText.textContent = "Could not connect to WebSocket or access mic. Aborted.";
      console.error(err);
    }
  } else {
    console.log("Stopping recording");
    stopRecording();
  }
 }
 function updateUI() {
  recordButton.classList.toggle("recording", isRecording);
  recordButton.disabled = waitingForStop;
  if (waitingForStop) {
    if (statusText.textContent !== "Recording stopped. Processing final audio...") {
      statusText.textContent = "Please wait for processing to complete...";
    }
  } else if (isRecording) {
    statusText.textContent = "Recording...";
  } else {
    if (
      statusText.textContent !== "Finished processing audio! Ready to record again." &&
      statusText.textContent !== "Processing finalized or connection closed."
    ) {
      statusText.textContent = "Click to start transcription";
    }
  }
  if (!waitingForStop) {
    recordButton.disabled = false;
  }
 }
 recordButton.addEventListener("click", toggleRecording);
 if (microphoneSelect) {
  microphoneSelect.addEventListener("change", handleMicrophoneChange);
 }
 // Settings toggle functionality
 settingsToggle.addEventListener("click", () => {
  settingsDiv.classList.toggle("visible");
  settingsToggle.classList.toggle("active");
 });
 document.addEventListener('DOMContentLoaded', async () => {
  try {
    await enumerateMicrophones();
  } catch (error) {
    console.log("Could not enumerate microphones on load:", error);
  }
 });
 navigator.mediaDevices.addEventListener('devicechange', async () => {
  console.log('Device change detected, re-enumerating microphones');
  try {
    await enumerateMicrophones();
  } catch (error) {
    console.log("Error re-enumerating microphones:", error);
  }
 });
 async function run() {
  const micPermission = await navigator.permissions.query({
    name: "microphone",
  });
  document.getElementById(
    "audioPermission"
  ).innerText = `MICROPHONE: ${micPermission.state}`;
  if (micPermission.state !== "granted") {
    chrome.tabs.create({ url: "welcome.html" });
  }
  const intervalId = setInterval(async () => {
    const micPermission = await navigator.permissions.query({
      name: "microphone",
    });
    if (micPermission.state === "granted") {
      document.getElementById(
        "audioPermission"
      ).innerText = `MICROPHONE: ${micPermission.state}`;
      clearInterval(intervalId);
    }
  }, 100);
 }
 void run();
--- a/chrome-extension/manifest.json
+++ b/chrome-extension/manifest.json
@@ -0,0 +1,37 @@
 {
    "manifest_version": 3,
    "name": "WhisperLiveKit Tab Capture",
    "version": "1.0",
    "description": "Capture and transcribe audio from browser tabs using WhisperLiveKit.",
    "background": {
        "service_worker": "background.js"
    },
    "icons": {
        "16": "icons/icon16.png",
        "32": "icons/icon32.png",
        "48": "icons/icon48.png",
        "128": "icons/icon128.png"
    },
    "action": {
        "default_title": "WhisperLiveKit Tab Capture",
        "default_popup": "popup.html"
    },
    "permissions": [
        "scripting",
        "tabCapture",
        "offscreen",
        "activeTab",
        "storage"
    ],
    "web_accessible_resources": [
        {
            "resources": [
                "requestPermissions.html",
                "requestPermissions.js"
            ],
            "matches": [
                "<all_urls>"
            ]
        }
    ]
 }
--- a/chrome-extension/popup.html
+++ b/chrome-extension/popup.html
@@ -0,0 +1,78 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>WhisperLiveKit</title>
    <link rel="stylesheet" href="/web/live_transcription.css" />
 </head>
 <body>
    <div class="settings-container">
        <button id="recordButton">
            <div class="shape-container">
                <div class="shape"></div>
            </div>
            <div class="recording-info">
                <div class="wave-container">
                    <canvas id="waveCanvas"></canvas>
                </div>
                <div class="timer">00:00</div>
            </div>
        </button>
        <button id="settingsToggle" class="settings-toggle" title="Show/hide settings">
            <img src="/web/src/settings.svg" alt="Settings" />
        </button>
        <div class="settings">
            <div class="field">
                <label for="websocketInput">Websocket URL</label>
                <input id="websocketInput" type="text" placeholder="ws://host:port/asr" />
            </div>
            <div class="field">
                <label id="microphoneSelectLabel" for="microphoneSelect">Select Microphone</label>
                <select id="microphoneSelect">
                    <option value="">Default Microphone</option>
                </select>
                        <div id="audioPermission"></div>
            </div>
            <div class="theme-selector-container">
                <div class="segmented" role="radiogroup" aria-label="Theme selector">
                    <input type="radio" id="theme-system" name="theme" value="system" />
                    <label for="theme-system" title="System">
                        <img src="/web/src/system_mode.svg" alt="" />
                        <!-- <span>System</span> -->
                    </label>
                    <input type="radio" id="theme-light" name="theme" value="light" />
                    <label for="theme-light" title="Light">
                        <img src="/web/src/light_mode.svg" alt="" />
                        <!-- <span>Light</span> -->
                    </label>
                    <input type="radio" id="theme-dark" name="theme" value="dark" />
                    <label for="theme-dark" title="Dark">
                        <img src="/web/src/dark_mode.svg" alt="" />
                        <!-- <span>Dark</span> -->
                    </label>
                </div>
            </div>
        </div>
    </div>
    <p id="status"></p>
    <div id="linesTranscript"></div>
    <script src="live_transcription.js"></script>
 </body>
 </html>
--- a/chrome-extension/requestPermissions.html
+++ b/chrome-extension/requestPermissions.html
@@ -0,0 +1,12 @@
 <!DOCTYPE html>
 <html>
  <head>
    <title>Request Permissions</title>
    <script src="requestPermissions.js"></script>
  </head>
  <body>
    This page exists to workaround an issue with Chrome that blocks permission
    requests from chrome extensions
    <button id="requestMicrophone">Request Microphone</button>
  </body>
 </html>
--- a/chrome-extension/requestPermissions.js
+++ b/chrome-extension/requestPermissions.js
@@ -0,0 +1,17 @@
 /**
 * Requests user permission for microphone access.
 * @returns {Promise<void>} A Promise that resolves when permission is granted or rejects with an error.
 */
 async function getUserPermission() {
  console.log("Getting user permission for microphone access...");
  await navigator.mediaDevices.getUserMedia({ audio: true });
  const micPermission = await navigator.permissions.query({
    name: "microphone",
  });
  if (micPermission.state == "granted") {
    window.close();
  }
 }
 // Call the function to request microphone permission
 getUserPermission();
--- a/chrome-extension/sidepanel.js
+++ b/chrome-extension/sidepanel.js
@@ -0,0 +1,29 @@
 console.log("sidepanel.js");
 async function run() {
  const micPermission = await navigator.permissions.query({
    name: "microphone",
  });
  document.getElementById(
    "audioPermission"
  ).innerText = `MICROPHONE: ${micPermission.state}`;
  if (micPermission.state !== "granted") {
    chrome.tabs.create({ url: "requestPermissions.html" });
  }
  const intervalId = setInterval(async () => {
    const micPermission = await navigator.permissions.query({
      name: "microphone",
    });
    if (micPermission.state === "granted") {
      document.getElementById(
        "audioPermission"
      ).innerText = `MICROPHONE: ${micPermission.state}`;
      clearInterval(intervalId);
    }
  }, 100);
 }
 void run();
--- a/chrome-extension/web/live_transcription.css
+++ b/chrome-extension/web/live_transcription.css
@@ -0,0 +1,539 @@
 :root {
  --bg: #ffffff;
  --text: #111111;
  --muted: #666666;
  --border: #e5e5e5;
  --chip-bg: rgba(0, 0, 0, 0.04);
  --chip-text: #000000;
  --spinner-border: #8d8d8d5c;
  --spinner-top: #b0b0b0;
  --silence-bg: #f3f3f3;
  --loading-bg: rgba(255, 77, 77, 0.06);
  --button-bg: #ffffff;
  --button-border: #e9e9e9;
  --wave-stroke: #000000;
  --label-dia-text: #868686;
  --label-trans-text: #111111;
 }
@media (prefers-color-scheme: dark) {
  :root:not([data-theme="light"]) {
    --bg: #0b0b0b;
    --text: #e6e6e6;
    --muted: #9aa0a6;
    --border: #333333;
    --chip-bg: rgba(255, 255, 255, 0.08);
    --chip-text: #e6e6e6;
    --spinner-border: #555555;
    --spinner-top: #dddddd;
    --silence-bg: #1a1a1a;
    --loading-bg: rgba(255, 77, 77, 0.12);
    --button-bg: #111111;
    --button-border: #333333;
    --wave-stroke: #e6e6e6;
    --label-dia-text: #b3b3b3;
    --label-trans-text: #ffffff;
  }
 }
 :root[data-theme="dark"] {
  --bg: #0b0b0b;
  --text: #e6e6e6;
  --muted: #9aa0a6;
  --border: #333333;
  --chip-bg: rgba(255, 255, 255, 0.08);
  --chip-text: #e6e6e6;
  --spinner-border: #555555;
  --spinner-top: #dddddd;
  --silence-bg: #1a1a1a;
  --loading-bg: rgba(255, 77, 77, 0.12);
  --button-bg: #111111;
  --button-border: #333333;
  --wave-stroke: #e6e6e6;
  --label-dia-text: #b3b3b3;
  --label-trans-text: #ffffff;
 }
 :root[data-theme="light"] {
  --bg: #ffffff;
  --text: #111111;
  --muted: #666666;
  --border: #e5e5e5;
  --chip-bg: rgba(0, 0, 0, 0.04);
  --chip-text: #000000;
  --spinner-border: #8d8d8d5c;
  --spinner-top: #b0b0b0;
  --silence-bg: #f3f3f3;
  --loading-bg: rgba(255, 77, 77, 0.06);
  --button-bg: #ffffff;
  --button-border: #e9e9e9;
  --wave-stroke: #000000;
  --label-dia-text: #868686;
  --label-trans-text: #111111;
 }
 body {
  font-family: ui-sans-serif, system-ui, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
  margin: 20px;
  text-align: center;
  background-color: var(--bg);
  color: var(--text);
 }
 .settings-toggle {
  margin-top: 4px;
  width: 40px;
  height: 40px;
  border: none;
  border-radius: 50%;
  background-color: var(--button-bg);
  cursor: pointer;
  transition: all 0.3s ease;
  /* border: 1px solid var(--button-border); */
  display: flex;
  align-items: center;
  justify-content: center;
  position: relative;
 }
 .settings-toggle:hover {
  background-color: var(--chip-bg);
 }
 .settings-toggle img {
  width: 24px;
  height: 24px;
  opacity: 0.7;
  transition: opacity 0.2s ease, transform 0.3s ease;
 }
 .settings-toggle:hover img {
  opacity: 1;
 }
 .settings-toggle.active img {
  transform: rotate(80deg);
 }
 /* Record button */
 #recordButton {
  width: 50px;
  height: 50px;
  border: none;
  border-radius: 50%;
  background-color: var(--button-bg);
  cursor: pointer;
  transition: all 0.3s ease;
  border: 1px solid var(--button-border);
  display: flex;
  align-items: center;
  justify-content: center;
  position: relative;
 }
 #recordButton.recording {
  width: 180px;
  border-radius: 40px;
  justify-content: flex-start;
  padding-left: 20px;
 }
 #recordButton:active {
  transform: scale(0.95);
 }
 .shape-container {
  width: 25px;
  height: 25px;
  display: flex;
  align-items: center;
  justify-content: center;
  flex-shrink: 0;
 }
 .shape {
  width: 25px;
  height: 25px;
  background-color: rgb(209, 61, 53);
  border-radius: 50%;
  transition: all 0.3s ease;
 }
 #recordButton:disabled .shape {
  background-color: #6e6d6d;
 }
 #recordButton.recording .shape {
  border-radius: 5px;
  width: 25px;
  height: 25px;
 }
 /* Recording elements */
 .recording-info {
  display: none;
  align-items: center;
  margin-left: 15px;
  flex-grow: 1;
 }
 #recordButton.recording .recording-info {
  display: flex;
 }
 .wave-container {
  width: 60px;
  height: 30px;
  position: relative;
  display: flex;
  align-items: center;
  justify-content: center;
 }
 #waveCanvas {
  width: 100%;
  height: 100%;
 }
 .timer {
  font-size: 14px;
  font-weight: 500;
  color: var(--text);
  margin-left: 10px;
 }
 #status {
  margin-top: 20px;
  font-size: 16px;
  color: var(--text);
 }
 /* Settings */
 .settings-container {
  display: flex;
  justify-content: center;
  align-items: flex-start;
  gap: 15px;
  margin-top: 20px;
  flex-wrap: wrap;
 }
 .settings {
  display: none;
  flex-wrap: wrap;
  align-items: flex-start;
  gap: 12px;
  transition: opacity 0.3s ease;
 }
 .settings.visible {
  display: flex;
 }
 .field {
  display: flex;
  flex-direction: column;
  align-items: flex-start;
  gap: 3px;
 }
 #chunkSelector,
 #websocketInput,
 #themeSelector,
 #microphoneSelect {
  font-size: 16px;
  padding: 5px 8px;
  border-radius: 8px;
  border: 1px solid var(--border);
  background-color: var(--button-bg);
  color: var(--text);
  max-height: 30px;
 }
 #microphoneSelect {
  width: 100%;
  max-width: 190px;
  min-width: 120px;
 }
 #chunkSelector:focus,
 #websocketInput:focus,
 #themeSelector:focus,
 #microphoneSelect:focus {
  outline: none;
  border-color: #007bff;
  box-shadow: 0 0 0 3px rgba(0, 123, 255, 0.15);
 }
 label {
  font-size: 13px;
  color: var(--muted);
 }
 .ws-default {
  font-size: 12px;
  color: var(--muted);
 }
 /* Segmented pill control for Theme */
 .segmented {
  display: inline-flex;
  align-items: stretch;
  border: 1px solid var(--button-border);
  background-color: var(--button-bg);
  border-radius: 999px;
  overflow: hidden;
 }
 .segmented input[type="radio"] {
  position: absolute;
  opacity: 0;
  pointer-events: none;
 }
 .theme-selector-container {
  display: flex;
  align-items: center;
  margin-top: 17px;
 }
 .segmented label {
  display: inline-flex;
  align-items: center;
  gap: 6px;
  padding: 6px 12px;
  font-size: 14px;
  color: var(--muted);
  cursor: pointer;
  user-select: none;
  transition: background-color 0.2s ease, color 0.2s ease;
 }
 .segmented label span {
  display: none;
 }
 .segmented label:hover span {
  display: inline;
 }
 .segmented label:hover {
  background-color: var(--chip-bg);
 }
 .segmented img {
  width: 16px;
  height: 16px;
 }
 .segmented input[type="radio"]:checked + label {
  background-color: var(--chip-bg);
  color: var(--text);
 }
 .segmented input[type="radio"]:focus-visible + label,
 .segmented input[type="radio"]:focus + label {
  outline: 2px solid #007bff;
  outline-offset: 2px;
  border-radius: 999px;
 }
 /* Transcript area */
 #linesTranscript {
  margin: 20px auto;
  max-width: 700px;
  text-align: left;
  font-size: 16px;
 }
 #linesTranscript p {
  margin: 0px 0;
 }
 #linesTranscript strong {
  color: var(--text);
 }
 #speaker {
  border: 1px solid var(--border);
  border-radius: 100px;
  padding: 2px 10px;
  font-size: 14px;
  margin-bottom: 0px;
 }
 .label_diarization {
  background-color: var(--chip-bg);
  border-radius: 8px 8px 8px 8px;
  padding: 2px 10px;
  margin-left: 10px;
  display: inline-block;
  white-space: nowrap;
  font-size: 14px;
  margin-bottom: 0px;
  color: var(--label-dia-text);
 }
 .label_transcription {
  background-color: var(--chip-bg);
  border-radius: 8px 8px 8px 8px;
  padding: 2px 10px;
  display: inline-block;
  white-space: nowrap;
  margin-left: 10px;
  font-size: 14px;
  margin-bottom: 0px;
  color: var(--label-trans-text);
 }
 #timeInfo {
  color: var(--muted);
  margin-left: 10px;
 }
 .textcontent {
  font-size: 16px;
  padding-left: 10px;
  margin-bottom: 10px;
  margin-top: 1px;
  padding-top: 5px;
  border-radius: 0px 0px 0px 10px;
 }
 .buffer_diarization {
  color: var(--label-dia-text);
  margin-left: 4px;
 }
 .buffer_transcription {
  color: #7474748c;
  margin-left: 4px;
 }
 .spinner {
  display: inline-block;
  width: 8px;
  height: 8px;
  border: 2px solid var(--spinner-border);
  border-top: 2px solid var(--spinner-top);
  border-radius: 50%;
  animation: spin 0.7s linear infinite;
  vertical-align: middle;
  margin-bottom: 2px;
  margin-right: 5px;
 }
@keyframes spin {
  to {
    transform: rotate(360deg);
  }
 }
 .silence {
  color: var(--muted);
  background-color: var(--silence-bg);
  font-size: 13px;
  border-radius: 30px;
  padding: 2px 10px;
 }
 .loading {
  color: var(--muted);
  background-color: var(--loading-bg);
  border-radius: 8px 8px 8px 0px;
  padding: 2px 10px;
  font-size: 14px;
  margin-bottom: 0px;
 }
 /* for smaller screens */
 /* @media (max-width: 450px) {
  .settings-container {
    flex-direction: column;
    gap: 10px;
    align-items: center;
  }
  .settings {
    justify-content: center;
    gap: 8px;
    width: 100%;
  }
  .field {
    align-items: center;
    width: 100%;
  }
  #websocketInput,
  #microphoneSelect {
    min-width: 200px;
    max-width: 100%;
  }
  .theme-selector-container {
    margin-top: 10px;
  }
 } */
 /* @media (max-width: 768px) and (min-width: 451px) {
  .settings-container {
    gap: 10px;
  }
  .settings {
    gap: 8px;
  }
  #websocketInput,
  #microphoneSelect {
    min-width: 150px;
    max-width: 300px;
  }
 } */
 /* @media (max-width: 480px) {
  body {
    margin: 10px;
  }
  .settings-toggle {
    width: 35px;
    height: 35px;
  }
  .settings-toggle img {
    width: 20px;
    height: 20px;
  }
  .settings {
    flex-direction: column;
    align-items: center;
    gap: 6px;
  }
  #websocketInput,
  #microphoneSelect {
    max-width: 400px;
  }
  .segmented label {
    padding: 4px 8px;
    font-size: 12px;
  }
  .segmented img {
    width: 14px;
    height: 14px;
  }
 } */
 html
 {
    width: 400px;  /* max: 800px */
    height: 600px; /* max: 600px */
    border-radius: 10px;
 }
--- a/chrome-extension/web/src/dark_mode.svg
+++ b/chrome-extension/web/src/dark_mode.svg
@@ -0,0 +1 @@
 <svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-120q-151 0-255.5-104.5T120-480q0-138 90-239.5T440-838q13-2 23 3.5t16 14.5q6 9 6.5 21t-7.5 23q-17 26-25.5 55t-8.5 61q0 90 63 153t153 63q31 0 61.5-9t54.5-25q11-7 22.5-6.5T819-479q10 5 15.5 15t3.5 24q-14 138-117.5 229T480-120Zm0-80q88 0 158-48.5T740-375q-20 5-40 8t-40 3q-123 0-209.5-86.5T364-660q0-20 3-40t8-40q-78 32-126.5 102T200-480q0 116 82 198t198 82Zm-10-270Z"/></svg>
--- a/chrome-extension/web/src/light_mode.svg
+++ b/chrome-extension/web/src/light_mode.svg
@@ -0,0 +1 @@
 <svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-360q50 0 85-35t35-85q0-50-35-85t-85-35q-50 0-85 35t-35 85q0 50 35 85t85 35Zm0 80q-83 0-141.5-58.5T280-480q0-83 58.5-141.5T480-680q83 0 141.5 58.5T680-480q0 83-58.5 141.5T480-280ZM80-440q-17 0-28.5-11.5T40-480q0-17 11.5-28.5T80-520h80q17 0 28.5 11.5T200-480q0 17-11.5 28.5T160-440H80Zm720 0q-17 0-28.5-11.5T760-480q0-17 11.5-28.5T800-520h80q17 0 28.5 11.5T920-480q0 17-11.5 28.5T880-440h-80ZM480-760q-17 0-28.5-11.5T440-800v-80q0-17 11.5-28.5T480-920q17 0 28.5 11.5T520-880v80q0 17-11.5 28.5T480-760Zm0 720q-17 0-28.5-11.5T440-80v-80q0-17 11.5-28.5T480-200q17 0 28.5 11.5T520-160v80q0 17-11.5 28.5T480-40ZM226-678l-43-42q-12-11-11.5-28t11.5-29q12-12 29-12t28 12l42 43q11 12 11 28t-11 28q-11 12-27.5 11.5T226-678Zm494 495-42-43q-11-12-11-28.5t11-27.5q11-12 27.5-11.5T734-282l43 42q12 11 11.5 28T777-183q-12 12-29 12t-28-12Zm-42-495q-12-11-11.5-27.5T678-734l42-43q11-12 28-11.5t29 11.5q12 12 12 29t-12 28l-43 42q-12 11-28 11t-28-11ZM183-183q-12-12-12-29t12-28l43-42q12-11 28.5-11t27.5 11q12 11 11.5 27.5T282-226l-42 43q-11 12-28 11.5T183-183Zm297-297Z"/></svg>
--- a/chrome-extension/web/src/settings.svg
+++ b/chrome-extension/web/src/settings.svg
@@ -0,0 +1 @@
 <svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M433-80q-27 0-46.5-18T363-142l-9-66q-13-5-24.5-12T307-235l-62 26q-25 11-50 2t-39-32l-47-82q-14-23-8-49t27-43l53-40q-1-7-1-13.5v-27q0-6.5 1-13.5l-53-40q-21-17-27-43t8-49l47-82q14-23 39-32t50 2l62 26q11-8 23-15t24-12l9-66q4-26 23.5-44t46.5-18h94q27 0 46.5 18t23.5 44l9 66q13 5 24.5 12t22.5 15l62-26q25-11 50-2t39 32l47 82q14 23 8 49t-27 43l-53 40q1 7 1 13.5v27q0 6.5-2 13.5l53 40q21 17 27 43t-8 49l-48 82q-14 23-39 32t-50-2l-60-26q-11 8-23 15t-24 12l-9 66q-4 26-23.5 44T527-80h-94Zm7-80h79l14-106q31-8 57.5-23.5T639-327l99 41 39-68-86-65q5-14 7-29.5t2-31.5q0-16-2-31.5t-7-29.5l86-65-39-68-99 42q-22-23-48.5-38.5T533-694l-13-106h-79l-14 106q-31 8-57.5 23.5T321-633l-99-41-39 68 86 64q-5 15-7 30t-2 32q0 16 2 31t7 30l-86 65 39 68 99-42q22 23 48.5 38.5T427-266l13 106Zm42-180q58 0 99-41t41-99q0-58-41-99t-99-41q-59 0-99.5 41T342-480q0 58 40.5 99t99.5 41Zm-2-140Z"/></svg>
--- a/chrome-extension/web/src/system_mode.svg
+++ b/chrome-extension/web/src/system_mode.svg
@@ -0,0 +1 @@
 <svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M396-396q-32-32-58.5-67T289-537q-5 14-6.5 28.5T281-480q0 83 58 141t141 58q14 0 28.5-2t28.5-6q-39-22-74-48.5T396-396Zm85 196q-56 0-107-21t-91-61q-40-40-61-91t-21-107q0-51 17-97.5t50-84.5q13-14 32-9.5t27 24.5q21 55 52.5 104t73.5 91q42 42 91 73.5T648-326q20 8 24.5 27t-9.5 32q-38 33-84.5 50T481-200Zm223-192q-16-5-23-20.5t-4-32.5q9-48-6-94.5T621-621q-35-35-80.5-49.5T448-677q-17 3-32-4t-21-23q-6-16 1.5-31t23.5-19q69-15 138 4.5T679-678q51 51 71 120t5 138q-4 17-19 25t-32 3ZM480-840q-17 0-28.5-11.5T440-880v-40q0-17 11.5-28.5T480-960q17 0 28.5 11.5T520-920v40q0 17-11.5 28.5T480-840Zm0 840q-17 0-28.5-11.5T440-40v-40q0-17 11.5-28.5T480-120q17 0 28.5 11.5T520-80v40q0 17-11.5 28.5T480 0Zm255-734q-12-12-12-28.5t12-28.5l28-28q11-11 27.5-11t28.5 11q12 12 12 28.5T819-762l-28 28q-12 12-28 12t-28-12ZM141-141q-12-12-12-28.5t12-28.5l28-28q12-12 28-12t28 12q12 12 12 28.5T225-169l-28 28q-11 11-27.5 11T141-141Zm739-299q-17 0-28.5-11.5T840-480q0-17 11.5-28.5T880-520h40q17 0 28.5 11.5T960-480q0 17-11.5 28.5T920-440h-40Zm-840 0q-17 0-28.5-11.5T0-480q0-17 11.5-28.5T40-520h40q17 0 28.5 11.5T120-480q0 17-11.5 28.5T80-440H40Zm779 299q-12 12-28.5 12T762-141l-28-28q-12-12-12-28t12-28q12-12 28.5-12t28.5 12l28 28q11 11 11 27.5T819-141ZM226-735q-12 12-28.5 12T169-735l-28-28q-11-11-11-27.5t11-28.5q12-12 28.5-12t28.5 12l28 28q12 12 12 28t-12 28Zm170 339Z"/></svg>
--- a/chrome-extension/welcome.html
+++ b/chrome-extension/welcome.html
@@ -0,0 +1,12 @@
 <!DOCTYPE html>
 <html>
  <head>
    <title>Welcome</title>
    <script src="welcome.js"></script>
  </head>
  <body>
    This page exists to workaround an issue with Chrome that blocks permission
    requests from chrome extensions
    <!-- <button id="requestMicrophone">Request Microphone</button> -->
  </body>
 </html>
--- a/demo.png
+++ b/demo.png
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
 [project]
 name = "whisperlivekit"
-version = "0.2.8"
+version = "0.2.11"
 description = "Real-time speech-to-text with speaker diarization using Whisper"
 readme = "README.md"
 authors = [
--- a/whisperlivekit/audio_processor.py
+++ b/whisperlivekit/audio_processor.py
@@ -4,11 +4,11 @@ from time import time, sleep
 import math
 import logging
 import traceback
-from whisperlivekit.timed_objects import ASRToken, Silence
+from whisperlivekit.timed_objects import ASRToken, Silence, Line, FrontData, State, Transcript, ChangeSpeaker
-from whisperlivekit.core import TranscriptionEngine, online_factory, online_diarization_factory
+from whisperlivekit.core import TranscriptionEngine, online_factory, online_diarization_factory, online_translation_factory
 from whisperlivekit.ffmpeg_manager import FFmpegManager, FFmpegState
 from whisperlivekit.silero_vad_iterator import FixedVADIterator
-from whisperlivekit.results_formater import format_output, format_time
+from whisperlivekit.results_formater import format_output
 from whisperlivekit.ffmpeg_manager import FFmpegManager, FFmpegState
 # Set up logging once
 logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
 logger = logging.getLogger(__name__)
@@ -16,6 +16,17 @@ logger.setLevel(logging.DEBUG)
 SENTINEL = object() # unique sentinel object for end of stream marker
 async def get_all_from_queue(queue):
    items = []
    try:
        while True:
            item = queue.get_nowait()
            items.append(item)
    except asyncio.QueueEmpty:
        pass
    return items
 class AudioProcessor:
    """
    Processes audio streams for transcription and diarization.
@@ -38,9 +49,7 @@ class AudioProcessor:
        self.bytes_per_sample = 2
        self.bytes_per_sec = self.samples_per_sec * self.bytes_per_sample
        self.max_bytes_per_sec = 32000 * 5  # 5 seconds of audio at 32 kHz
-        self.last_ffmpeg_activity = time()
+        self.is_pcm_input = self.args.pcm_input
        self.ffmpeg_health_check_interval = 5
        self.ffmpeg_max_idle_time = 10
        self.debug = False
        # State management
@@ -48,15 +57,19 @@ class AudioProcessor:
        self.silence = False
        self.silence_duration = 0.0
        self.tokens = []
-        self.buffer_transcription = ""
+        self.translated_segments = []
-        self.buffer_diarization = ""
+        self.buffer_transcription = Transcript()
        self.end_buffer = 0
        self.end_attributed_speaker = 0
        self.lock = asyncio.Lock()
        self.beg_loop = None #to deal with a potential little lag at the websocket initialization, this is now set in process_audio
        self.sep = " "  # Default separator
-        self.last_response_content = ""
+        self.last_response_content = FrontData()
-        
+        self.last_detected_speaker = None
        self.speaker_languages = {}
        self.cumulative_pcm_len = 0
        self.diarization_before_transcription = False
        # Models and processing
        self.asr = models.asr
        self.tokenizer = models.tokenizer
@@ -65,58 +78,44 @@ class AudioProcessor:
            self.vac = FixedVADIterator(models.vac_model)
        else:
            self.vac = None
-            
+                         
-        self.ffmpeg_manager = FFmpegManager(
+        self.ffmpeg_manager = None
-            sample_rate=self.sample_rate,
+        self.ffmpeg_reader_task = None
            channels=self.channels
        )
        async def handle_ffmpeg_error(error_type: str):
            logger.error(f"FFmpeg error: {error_type}")
            self._ffmpeg_error = error_type
        self.ffmpeg_manager.on_error_callback = handle_ffmpeg_error
        self._ffmpeg_error = None
-        
+
        if not self.is_pcm_input:
            self.ffmpeg_manager = FFmpegManager(
                sample_rate=self.sample_rate,
                channels=self.channels
            )
            async def handle_ffmpeg_error(error_type: str):
                logger.error(f"FFmpeg error: {error_type}")
                self._ffmpeg_error = error_type
            self.ffmpeg_manager.on_error_callback = handle_ffmpeg_error
        self.transcription_queue = asyncio.Queue() if self.args.transcription else None
        self.diarization_queue = asyncio.Queue() if self.args.diarization else None
        self.translation_queue = asyncio.Queue() if self.args.target_language else None
        self.pcm_buffer = bytearray()
        # Task references
        self.transcription_task = None
        self.diarization_task = None
        self.ffmpeg_reader_task = None
        self.watchdog_task = None
        self.all_tasks_for_cleanup = []
        self.online_translation = None
        # Initialize transcription engine if enabled
        if self.args.transcription:
-            self.online = online_factory(self.args, models.asr, models.tokenizer)
+            self.online = online_factory(self.args, models.asr, models.tokenizer)        
-            
+            self.sep = self.online.asr.sep   
        # Initialize diarization engine if enabled
        if self.args.diarization:
            self.diarization = online_diarization_factory(self.args, models.diarization_model)
-
+        if models.translation_model:
            self.online_translation = online_translation_factory(self.args, models.translation_model)
    def convert_pcm_to_float(self, pcm_buffer):
        """Convert PCM buffer in s16le format to normalized NumPy array."""
        return np.frombuffer(pcm_buffer, dtype=np.int16).astype(np.float32) / 32768.0
    async def update_transcription(self, new_tokens, buffer, end_buffer, sep):
        """Thread-safe update of transcription with new data."""
        async with self.lock:
            self.tokens.extend(new_tokens)
            self.buffer_transcription = buffer
            self.end_buffer = end_buffer
            self.sep = sep
    async def update_diarization(self, end_attributed_speaker, buffer_diarization=""):
        """Thread-safe update of diarization with new data."""
        async with self.lock:
            self.end_attributed_speaker = end_attributed_speaker
            if buffer_diarization:
                self.buffer_diarization = buffer_diarization
    async def add_dummy_token(self):
        """Placeholder token when no transcription is available."""
        async with self.lock:
@@ -141,33 +140,35 @@ class AudioProcessor:
                latest_end = max(self.end_buffer, self.tokens[-1].end if self.tokens else 0)
                remaining_diarization = max(0, round(latest_end - self.end_attributed_speaker, 1))
-            return {
+            return State(
-                "tokens": self.tokens.copy(),
+                tokens=self.tokens.copy(),
-                "buffer_transcription": self.buffer_transcription,
+                translated_segments=self.translated_segments.copy(),
-                "buffer_diarization": self.buffer_diarization,
+                buffer_transcription=self.buffer_transcription,
-                "end_buffer": self.end_buffer,
+                end_buffer=self.end_buffer,
-                "end_attributed_speaker": self.end_attributed_speaker,
+                end_attributed_speaker=self.end_attributed_speaker,
-                "sep": self.sep,
+                remaining_time_transcription=remaining_transcription,
-                "remaining_time_transcription": remaining_transcription,
+                remaining_time_diarization=remaining_diarization
-                "remaining_time_diarization": remaining_diarization
+            )
            }
    async def reset(self):
        """Reset all state variables to initial values."""
        async with self.lock:
            self.tokens = []
-            self.buffer_transcription = self.buffer_diarization = ""
+            self.translated_segments = []
            self.buffer_transcription = Transcript()
            self.end_buffer = self.end_attributed_speaker = 0
            self.beg_loop = time()
    async def ffmpeg_stdout_reader(self):
-        """Read audio data from FFmpeg stdout and process it."""
+        """Read audio data from FFmpeg stdout and process it into the PCM pipeline."""
        beg = time()
        while True:
            try:
-                # Check if FFmpeg is running
+                if self.is_stopping:
-                state = await self.ffmpeg_manager.get_state()
+                    logger.info("Stopping ffmpeg_stdout_reader due to stopping flag.")
                    break
                state = await self.ffmpeg_manager.get_state() if self.ffmpeg_manager else FFmpegState.STOPPED
                if state == FFmpegState.FAILED:
                    logger.error("FFmpeg is in FAILED state, cannot read data")
                    break
@@ -175,100 +176,41 @@ class AudioProcessor:
                    logger.info("FFmpeg is stopped")
                    break
                elif state != FFmpegState.RUNNING:
-                    logger.warning(f"FFmpeg is in {state} state, waiting...")
+                    await asyncio.sleep(0.1)
                    await asyncio.sleep(0.5)
                    continue
-                
+
                current_time = time()
-                elapsed_time = math.floor((current_time - beg) * 10) / 10
+                elapsed_time = max(0.0, current_time - beg)
-                buffer_size = max(int(32000 * elapsed_time), 4096)
+                buffer_size = max(int(32000 * elapsed_time), 4096)  # dynamic read
                beg = current_time
                chunk = await self.ffmpeg_manager.read_data(buffer_size)
                if not chunk:
-                    if self.is_stopping:
+                    # No data currently available
-                        logger.info("FFmpeg stdout closed, stopping.")
+                    await asyncio.sleep(0.05)
-                        break
+                    continue
-                    else:
+
                        # No data available, but not stopping - FFmpeg might be restarting
                        await asyncio.sleep(0.1)
                        continue
                self.pcm_buffer.extend(chunk)
                await self.handle_pcm_data()
-                # Process when enough data
+            except asyncio.CancelledError:
-                if len(self.pcm_buffer) >= self.bytes_per_sec:
+                logger.info("ffmpeg_stdout_reader cancelled.")
-                    if len(self.pcm_buffer) > self.max_bytes_per_sec:
+                break
                        logger.warning(
                            f"Audio buffer too large: {len(self.pcm_buffer) / self.bytes_per_sec:.2f}s. "
                            f"Consider using a smaller model."
                        )
                    # Process audio chunk
                    pcm_array = self.convert_pcm_to_float(self.pcm_buffer[:self.max_bytes_per_sec])
                    self.pcm_buffer = self.pcm_buffer[self.max_bytes_per_sec:]
                    res = None
                    end_of_audio = False
                    silence_buffer = None
                    if self.args.vac:
                        res = self.vac(pcm_array)
                    if res is not None:
                        if res.get('end', 0) > res.get('start', 0):
                            end_of_audio = True
                        elif self.silence: #end of silence
                            self.silence = False
                            silence_buffer = Silence(duration=time() - self.start_silence)
                    if silence_buffer:
                        if self.args.transcription and self.transcription_queue:
                            await self.transcription_queue.put(silence_buffer)
                        if self.args.diarization and self.diarization_queue:
                            await self.diarization_queue.put(silence_buffer)
                    if not self.silence:                            
                        if self.args.transcription and self.transcription_queue:
                            await self.transcription_queue.put(pcm_array.copy())
                        if self.args.diarization and self.diarization_queue:
                            await self.diarization_queue.put(pcm_array.copy())
                        self.silence_duration = 0.0
                        if end_of_audio:
                            self.silence = True
                            self.start_silence = time()
                    # Sleep if no processing is happening
                    if not self.args.transcription and not self.args.diarization:
                        await asyncio.sleep(0.1)
            except Exception as e:
                logger.warning(f"Exception in ffmpeg_stdout_reader: {e}")
-                logger.warning(f"Traceback: {traceback.format_exc()}")
+                logger.debug(f"Traceback: {traceback.format_exc()}")
-                # Try to recover by waiting a bit
+                await asyncio.sleep(0.2)
-                await asyncio.sleep(1)
+
-                
+        logger.info("FFmpeg stdout processing finished. Signaling downstream processors if needed.")
-                # Check if we should exit
+        if not self.diarization_before_transcription and self.transcription_queue:
                if self.is_stopping:
                    break
        logger.info("FFmpeg stdout processing finished. Signaling downstream processors.")
        if self.args.transcription and self.transcription_queue:
            await self.transcription_queue.put(SENTINEL)
            logger.debug("Sentinel put into transcription_queue.")
        if self.args.diarization and self.diarization_queue:
            await self.diarization_queue.put(SENTINEL)
-            logger.debug("Sentinel put into diarization_queue.")
+        if self.online_translation:
-
+            await self.translation_queue.put(SENTINEL)
    async def transcription_processor(self):
        """Process audio chunks for transcription."""
        self.sep = self.online.asr.sep
        cumulative_pcm_duration_stream_time = 0.0
        while True:
@@ -278,11 +220,6 @@ class AudioProcessor:
                    logger.debug("Transcription processor received sentinel. Finishing.")
                    self.transcription_queue.task_done()
                    break
                if not self.online:
                    logger.warning("Transcription processor: self.online not initialized.")
                    self.transcription_queue.task_done()
                    continue
                asr_internal_buffer_duration_s = len(getattr(self.online, 'audio_buffer', [])) / self.online.SAMPLING_RATE
                transcription_lag_s = max(0.0, time() - self.beg_loop - self.end_buffer)
@@ -291,52 +228,51 @@ class AudioProcessor:
                    asr_processing_logs += f" + Silence of = {item.duration:.2f}s"
                    if self.tokens:
                        asr_processing_logs += f" | last_end = {self.tokens[-1].end} |"
-                logger.info(asr_processing_logs)
+                    logger.info(asr_processing_logs)
                if type(item) is Silence:
                    cumulative_pcm_duration_stream_time += item.duration
                    self.online.insert_silence(item.duration, self.tokens[-1].end if self.tokens else 0)
                    continue
-                
+                elif isinstance(item, ChangeSpeaker):
-                if isinstance(item, np.ndarray):
+                    self.online.new_speaker(item)
                elif isinstance(item, np.ndarray):
                    pcm_array = item
-                else:
+                
-                    raise Exception('item should be pcm_array')
+                logger.info(asr_processing_logs)
                duration_this_chunk = len(pcm_array) / self.sample_rate
                cumulative_pcm_duration_stream_time += duration_this_chunk
                stream_time_end_of_current_pcm = cumulative_pcm_duration_stream_time
                self.online.insert_audio_chunk(pcm_array, stream_time_end_of_current_pcm)
-                new_tokens, current_audio_processed_upto = self.online.process_iter()
+                new_tokens, current_audio_processed_upto = await asyncio.to_thread(self.online.process_iter)
-                # Get buffer information
+                _buffer_transcript = self.online.get_buffer()
-                _buffer_transcript_obj = self.online.get_buffer()
+                buffer_text = _buffer_transcript.text
                buffer_text = _buffer_transcript_obj.text
                if new_tokens:
                    validated_text = self.sep.join([t.text for t in new_tokens])
                    if buffer_text.startswith(validated_text):
-                        buffer_text = buffer_text[len(validated_text):].lstrip()
+                        _buffer_transcript.text = buffer_text[len(validated_text):].lstrip()
                candidate_end_times = [self.end_buffer]
                if new_tokens:
                    candidate_end_times.append(new_tokens[-1].end)
-                if _buffer_transcript_obj.end is not None:
+                if _buffer_transcript.end is not None:
-                    candidate_end_times.append(_buffer_transcript_obj.end)
+                    candidate_end_times.append(_buffer_transcript.end)
                candidate_end_times.append(current_audio_processed_upto)
-                new_end_buffer = max(candidate_end_times)
+                async with self.lock:
                    self.tokens.extend(new_tokens)
                    self.buffer_transcription = _buffer_transcript
                    self.end_buffer = max(candidate_end_times)
-                await self.update_transcription(
+                if self.translation_queue:
-                    new_tokens, buffer_text, new_end_buffer, self.sep
+                    for token in new_tokens:
-                )
+                        await self.translation_queue.put(token)
                self.transcription_queue.task_done()
            except Exception as e:
@@ -344,13 +280,20 @@ class AudioProcessor:
                logger.warning(f"Traceback: {traceback.format_exc()}")
                if 'pcm_array' in locals() and pcm_array is not SENTINEL : # Check if pcm_array was assigned from queue
                    self.transcription_queue.task_done()
        if self.is_stopping:
            logger.info("Transcription processor finishing due to stopping flag.")
            if self.diarization_queue:
                await self.diarization_queue.put(SENTINEL)
            if self.translation_queue:
                await self.translation_queue.put(SENTINEL)
        logger.info("Transcription processor task finished.")
    async def diarization_processor(self, diarization_obj):
        """Process audio chunks for speaker diarization."""
-        buffer_diarization = ""
+        self.current_speaker = 0
        cumulative_pcm_duration_stream_time = 0.0
        while True:
            try:
                item = await self.diarization_queue.get()
@@ -358,30 +301,36 @@ class AudioProcessor:
                    logger.debug("Diarization processor received sentinel. Finishing.")
                    self.diarization_queue.task_done()
                    break
-                
+                elif type(item) is Silence:
                if type(item) is Silence:
                    cumulative_pcm_duration_stream_time += item.duration
                    diarization_obj.insert_silence(item.duration)
                    continue
-    
+                elif isinstance(item, np.ndarray):
                if isinstance(item, np.ndarray):
                    pcm_array = item
                else:
                    raise Exception('item should be pcm_array') 
                # Process diarization
                await diarization_obj.diarize(pcm_array)
                segments = diarization_obj.get_segments()
-                async with self.lock:
+                if self.diarization_before_transcription:
-                    self.tokens = diarization_obj.assign_speakers_to_tokens(
+                    if segments and segments[-1].speaker != self.current_speaker:
-                        self.tokens,
+                        self.current_speaker = segments[-1].speaker
-                        use_punctuation_split=self.args.punctuation_split
+                        cut_at = int(segments[-1].start*16000 - (self.cumulative_pcm_len))
-                    )
+                        await self.transcription_queue.put(pcm_array[cut_at:])
-                    if len(self.tokens) > 0:
+                        await self.transcription_queue.put(ChangeSpeaker(speaker=self.current_speaker, start=cut_at))                        
-                        self.end_attributed_speaker = max(self.tokens[-1].end, self.end_attributed_speaker)
+                        await self.transcription_queue.put(pcm_array[:cut_at])
-                    if buffer_diarization:
+                    else:
-                        self.buffer_diarization = buffer_diarization
+                        await self.transcription_queue.put(pcm_array)
-                
+                else:           
                    async with self.lock:
                        self.tokens = diarization_obj.assign_speakers_to_tokens(
                            self.tokens,
                            use_punctuation_split=self.args.punctuation_split
                        )
                self.cumulative_pcm_len += len(pcm_array)            
                if len(self.tokens) > 0:
                    self.end_attributed_speaker = max(self.tokens[-1].end, self.end_attributed_speaker)
                self.diarization_queue.task_done()
            except Exception as e:
@@ -391,100 +340,120 @@ class AudioProcessor:
                    self.diarization_queue.task_done()
        logger.info("Diarization processor task finished.")
    async def translation_processor(self):
        # the idea is to ignore diarization for the moment. We use only transcription tokens. 
        # And the speaker is attributed given the segments used for the translation
        # in the future we want to have different languages for each speaker etc, so it will be more complex.
        while True:
            try:
                item = await self.translation_queue.get() #block until at least 1 token
                if item is SENTINEL:
                    logger.debug("Translation processor received sentinel. Finishing.")
                    self.translation_queue.task_done()
                    break
                elif type(item) is Silence:
                    self.online_translation.insert_silence(item.duration)
                    continue
                # get all the available tokens for translation. The more words, the more precise
                tokens_to_process = [item]
                additional_tokens = await get_all_from_queue(self.translation_queue)
                sentinel_found = False
                for additional_token in additional_tokens:
                    if additional_token is SENTINEL:
                        sentinel_found = True
                        break
                    tokens_to_process.append(additional_token)                
                if tokens_to_process:
                    self.online_translation.insert_tokens(tokens_to_process)
                    self.translated_segments = await asyncio.to_thread(self.online_translation.process)
                self.translation_queue.task_done()
                for _ in additional_tokens:
                    self.translation_queue.task_done()
                if sentinel_found:
                    logger.debug("Translation processor received sentinel in batch. Finishing.")
                    break
            except Exception as e:
                logger.warning(f"Exception in translation_processor: {e}")
                logger.warning(f"Traceback: {traceback.format_exc()}")
                if 'token' in locals() and item is not SENTINEL:
                    self.translation_queue.task_done()
                if 'additional_tokens' in locals():
                    for _ in additional_tokens:
                        self.translation_queue.task_done()
        logger.info("Translation processor task finished.")
    async def results_formatter(self):
        """Format processing results for output."""
        last_sent_trans = None
        last_sent_diar = None
        while True:
            try:
-                ffmpeg_state = await self.ffmpeg_manager.get_state()
+                # If FFmpeg error occurred, notify front-end
-                if ffmpeg_state == FFmpegState.FAILED and self._ffmpeg_error:
+                if self._ffmpeg_error:
-                    yield {
+                    yield FrontData(
-                        "status": "error",
+                        status="error",
-                        "error": f"FFmpeg error: {self._ffmpeg_error}",
+                        error=f"FFmpeg error: {self._ffmpeg_error}"
-                        "lines": [],
+                    )
                        "buffer_transcription": "",
                        "buffer_diarization": "",
                        "remaining_time_transcription": 0,
                        "remaining_time_diarization": 0
                    }
                    self._ffmpeg_error = None
                    await asyncio.sleep(1)
                    continue
-                
+
                # Get current state
                state = await self.get_current_state()
                tokens = state["tokens"]
                buffer_transcription = state["buffer_transcription"]
                buffer_diarization = state["buffer_diarization"]
                end_attributed_speaker = state["end_attributed_speaker"]
                sep = state["sep"]
                # Add dummy tokens if needed
-                if (not tokens or tokens[-1].is_dummy) and not self.args.transcription and self.args.diarization:
+                if (not state.tokens or state.tokens[-1].is_dummy) and not self.args.transcription and self.args.diarization:
                    await self.add_dummy_token()
                    sleep(0.5)
                    state = await self.get_current_state()
                    tokens = state["tokens"]
                # Format output
-                lines, undiarized_text, buffer_transcription, buffer_diarization = format_output(
+                lines, undiarized_text, end_w_silence = format_output(
                    state,
                    self.silence,
                    current_time = time() - self.beg_loop if self.beg_loop else None,
-                    diarization = self.args.diarization,
+                    args = self.args,
-                    debug = self.debug
+                    debug = self.debug,
                    sep=self.sep
                )
-                # Handle undiarized text
+                if end_w_silence:
                    buffer_transcription = Transcript()
                else:
                    buffer_transcription = state.buffer_transcription
                buffer_diarization = ''
                if undiarized_text:
-                    combined = sep.join(undiarized_text)
+                    buffer_diarization = self.sep.join(undiarized_text)
-                    if buffer_transcription:
+
-                        combined += sep
+                    async with self.lock:
-                    await self.update_diarization(end_attributed_speaker, combined)
+                        self.end_attributed_speaker = state.end_attributed_speaker
                    buffer_diarization = combined
                response_status = "active_transcription"
-                final_lines_for_response = lines.copy()
+                if not state.tokens and not buffer_transcription and not buffer_diarization:
                if not tokens and not buffer_transcription and not buffer_diarization:
                    response_status = "no_audio_detected"
-                    final_lines_for_response = []
+                    lines = []
-                elif response_status == "active_transcription" and not final_lines_for_response:
+                elif not lines:
-                    final_lines_for_response = [{
+                    lines = [Line(
-                        "speaker": 1,
+                        speaker=1,
-                        "text": "",
+                        start=state.end_buffer,
-                        "beg": format_time(state.get("end_buffer", 0)),
+                        end=state.end_buffer
-                        "end": format_time(state.get("end_buffer", 0)),
+                    )]
                        "diff": 0
                    }]
-                response = {
+                response = FrontData(
-                    "status": response_status,
+                    status=response_status,
-                    "lines": final_lines_for_response,
+                    lines=lines,
-                    "buffer_transcription": buffer_transcription,
+                    buffer_transcription=buffer_transcription.text.strip(),
-                    "buffer_diarization": buffer_diarization,
+                    buffer_diarization=buffer_diarization.strip(),
-                    "remaining_time_transcription": state["remaining_time_transcription"],
+                    remaining_time_transcription=state.remaining_time_transcription,
-                    "remaining_time_diarization": state["remaining_time_diarization"] if self.args.diarization else 0
+                    remaining_time_diarization=state.remaining_time_diarization if self.args.diarization else 0
                }
                current_response_signature = f"{response_status} | " + \
                                           ' '.join([f"{line['speaker']} {line['text']}" for line in final_lines_for_response]) + \
                                           f" | {buffer_transcription} | {buffer_diarization}"
                trans = state["remaining_time_transcription"]
                diar = state["remaining_time_diarization"]
                should_push = (
                    current_response_signature != self.last_response_content
                    or last_sent_trans is None
                    or round(trans, 1) != round(last_sent_trans, 1)
                    or round(diar, 1) != round(last_sent_diar, 1)
                )
-                if should_push and (final_lines_for_response or buffer_transcription or buffer_diarization or response_status == "no_audio_detected" or trans > 0 or diar > 0):
+                                
                should_push = (response != self.last_response_content)
                if should_push and (lines or buffer_transcription or buffer_diarization or response_status == "no_audio_detected"):
                    yield response
-                    self.last_response_content = current_response_signature
+                    self.last_response_content = response
                    last_sent_trans = trans
                    last_sent_diar = diar
                # Check for termination condition
                if self.is_stopping:
@@ -496,35 +465,34 @@ class AudioProcessor:
                    if all_processors_done:
                        logger.info("Results formatter: All upstream processors are done and in stopping state. Terminating.")
                        final_state = await self.get_current_state()
                        return
-                await asyncio.sleep(0.1)  # Avoid overwhelming the client
+                await asyncio.sleep(0.05)
            except Exception as e:
                logger.warning(f"Exception in results_formatter: {e}")
                logger.warning(f"Traceback: {traceback.format_exc()}")
-                await asyncio.sleep(0.5)  # Back off on error
+                await asyncio.sleep(0.5)
    async def create_tasks(self):
        """Create and start processing tasks."""
        self.all_tasks_for_cleanup = []
        processing_tasks_for_watchdog = []
-        success = await self.ffmpeg_manager.start()
+        # If using FFmpeg (non-PCM input), start it and spawn stdout reader
-        if not success:
+        if not self.is_pcm_input:
-            logger.error("Failed to start FFmpeg manager")
+            success = await self.ffmpeg_manager.start()
-            async def error_generator():
+            if not success:
-                yield {
+                logger.error("Failed to start FFmpeg manager")
-                    "status": "error", 
+                async def error_generator():
-                    "error": "FFmpeg failed to start. Please check that FFmpeg is installed.",
+                    yield FrontData(
-                    "lines": [],
+                        status="error",
-                    "buffer_transcription": "",
+                        error="FFmpeg failed to start. Please check that FFmpeg is installed."
-                    "buffer_diarization": "",
+                    )
-                    "remaining_time_transcription": 0,
+                return error_generator()
-                    "remaining_time_diarization": 0
+            self.ffmpeg_reader_task = asyncio.create_task(self.ffmpeg_stdout_reader())
-                }
+            self.all_tasks_for_cleanup.append(self.ffmpeg_reader_task)
-            return error_generator()
+            processing_tasks_for_watchdog.append(self.ffmpeg_reader_task)
        if self.args.transcription and self.online:
            self.transcription_task = asyncio.create_task(self.transcription_processor())
@@ -536,10 +504,11 @@ class AudioProcessor:
            self.all_tasks_for_cleanup.append(self.diarization_task)
            processing_tasks_for_watchdog.append(self.diarization_task)
-        self.ffmpeg_reader_task = asyncio.create_task(self.ffmpeg_stdout_reader())
+        if self.online_translation:
-        self.all_tasks_for_cleanup.append(self.ffmpeg_reader_task)
+            self.translation_task = asyncio.create_task(self.translation_processor())
-        processing_tasks_for_watchdog.append(self.ffmpeg_reader_task)
+            self.all_tasks_for_cleanup.append(self.translation_task)
-
+            processing_tasks_for_watchdog.append(self.translation_task)
        # Monitor overall system health
        self.watchdog_task = asyncio.create_task(self.watchdog(processing_tasks_for_watchdog))
        self.all_tasks_for_cleanup.append(self.watchdog_task)
@@ -560,15 +529,6 @@ class AudioProcessor:
                            logger.error(f"{task_name} unexpectedly completed with exception: {exc}")
                        else:
                            logger.info(f"{task_name} completed normally.")
                # Check FFmpeg status through the manager
                ffmpeg_state = await self.ffmpeg_manager.get_state()
                if ffmpeg_state == FFmpegState.FAILED:
                    logger.error("FFmpeg is in FAILED state, notifying results formatter")
                    # FFmpeg manager will handle its own recovery
                elif ffmpeg_state == FFmpegState.STOPPED and not self.is_stopping:
                    logger.warning("FFmpeg unexpectedly stopped, attempting restart")
                    await self.ffmpeg_manager.restart()
            except asyncio.CancelledError:
                logger.info("Watchdog task cancelled.")
@@ -578,18 +538,24 @@ class AudioProcessor:
    async def cleanup(self):
        """Clean up resources when processing is complete."""
-        logger.info("Starting cleanup of AudioProcessor resources.")        
+        logger.info("Starting cleanup of AudioProcessor resources.")
        self.is_stopping = True
        for task in self.all_tasks_for_cleanup:
            if task and not task.done():
                task.cancel()
-        
+            
        created_tasks = [t for t in self.all_tasks_for_cleanup if t]
        if created_tasks:
            await asyncio.gather(*created_tasks, return_exceptions=True)
        logger.info("All processing tasks cancelled or finished.")
-        await self.ffmpeg_manager.stop()
+
-        logger.info("FFmpeg manager stopped.")
+        if not self.is_pcm_input and self.ffmpeg_manager:
-        if self.args.diarization and hasattr(self, 'diarization') and hasattr(self.diarization, 'close'):
+            try:
                await self.ffmpeg_manager.stop()
                logger.info("FFmpeg manager stopped.")
            except Exception as e:
                logger.warning(f"Error stopping FFmpeg manager: {e}")
        if self.args.diarization and hasattr(self, 'dianization') and hasattr(self.diarization, 'close'):
            self.diarization.close()
        logger.info("AudioProcessor cleanup complete.")
@@ -603,18 +569,83 @@ class AudioProcessor:
        if not message:
            logger.info("Empty audio message received, initiating stop sequence.")
            self.is_stopping = True
-            # Signal FFmpeg manager to stop accepting data
+             
-            await self.ffmpeg_manager.stop()
+            if self.transcription_queue:
                await self.transcription_queue.put(SENTINEL)
            if not self.is_pcm_input and self.ffmpeg_manager:
                await self.ffmpeg_manager.stop()
            return
        if self.is_stopping:
            logger.warning("AudioProcessor is stopping. Ignoring incoming audio.")
            return
-        success = await self.ffmpeg_manager.write_data(message)
+        if self.is_pcm_input:
-        if not success:
+            self.pcm_buffer.extend(message)
-            ffmpeg_state = await self.ffmpeg_manager.get_state()
+            await self.handle_pcm_data()
-            if ffmpeg_state == FFmpegState.FAILED:
+        else:
-                logger.error("FFmpeg is in FAILED state, cannot process audio")
+            if not self.ffmpeg_manager:
-            else:
+                logger.error("FFmpeg manager not initialized for non-PCM input.")
-                logger.warning("Failed to write audio data to FFmpeg")
+                return
            success = await self.ffmpeg_manager.write_data(message)
            if not success:
                ffmpeg_state = await self.ffmpeg_manager.get_state()
                if ffmpeg_state == FFmpegState.FAILED:
                    logger.error("FFmpeg is in FAILED state, cannot process audio")
                else:
                    logger.warning("Failed to write audio data to FFmpeg")
    async def handle_pcm_data(self):
        # Process when enough data
        if len(self.pcm_buffer) < self.bytes_per_sec:
            return
        if len(self.pcm_buffer) > self.max_bytes_per_sec:
            logger.warning(
                f"Audio buffer too large: {len(self.pcm_buffer) / self.bytes_per_sec:.2f}s. "
                f"Consider using a smaller model."
            )
        # Process audio chunk
        pcm_array = self.convert_pcm_to_float(self.pcm_buffer[:self.max_bytes_per_sec])
        self.pcm_buffer = self.pcm_buffer[self.max_bytes_per_sec:]
        res = None
        end_of_audio = False
        silence_buffer = None
        if self.args.vac:
            res = self.vac(pcm_array)
        if res is not None:
            if res.get("end", 0) > res.get("start", 0):
                end_of_audio = True
            elif self.silence: #end of silence
                self.silence = False
                silence_buffer = Silence(duration=time() - self.start_silence)
        if silence_buffer:
            if not self.diarization_before_transcription and self.transcription_queue:
                await self.transcription_queue.put(silence_buffer)
            if self.args.diarization and self.diarization_queue:
                await self.diarization_queue.put(silence_buffer)
            if self.translation_queue:
                await self.translation_queue.put(silence_buffer)
        if not self.silence:
            if not self.diarization_before_transcription and self.transcription_queue:
                await self.transcription_queue.put(pcm_array.copy())
            if self.args.diarization and self.diarization_queue:
                await self.diarization_queue.put(pcm_array.copy())
            self.silence_duration = 0.0
            if end_of_audio:
                self.silence = True
                self.start_silence = time()
        if not self.args.transcription and not self.args.diarization:
            await asyncio.sleep(0.1)
--- a/whisperlivekit/basic_server.py
+++ b/whisperlivekit/basic_server.py
@@ -5,9 +5,6 @@ from fastapi.middleware.cors import CORSMiddleware
 from whisperlivekit import TranscriptionEngine, AudioProcessor, get_inline_ui_html, parse_args
 import asyncio
 import logging
 from starlette.staticfiles import StaticFiles
 import pathlib
 import whisperlivekit.web as webpkg
 logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
 logging.getLogger().setLevel(logging.WARNING)
@@ -18,16 +15,7 @@ args = parse_args()
 transcription_engine = None
@asynccontextmanager
-async def lifespan(app: FastAPI):
+async def lifespan(app: FastAPI):    
    #to remove after 0.2.8
    if args.backend == "simulstreaming" and not args.disable_fast_encoder:
        logger.warning(f"""
 {'='*50}
 WhisperLiveKit 0.2.8 has introduced a new fast encoder feature using MLX Whisper or Faster Whisper for improved speed. Use --disable-fast-encoder to disable if you encounter issues.
 {'='*50}
    """)
    global transcription_engine
    transcription_engine = TranscriptionEngine(
        **vars(args),
@@ -42,8 +30,6 @@ app.add_middleware(
    allow_methods=["*"],
    allow_headers=["*"],
 )
 web_dir = pathlib.Path(webpkg.__file__).parent
 app.mount("/web", StaticFiles(directory=str(web_dir)), name="web")
@app.get("/")
 async def get():
@@ -54,7 +40,7 @@ async def handle_websocket_results(websocket, results_generator):
    """Consumes results from the audio processor and sends them via WebSocket."""
    try:
        async for response in results_generator:
-            await websocket.send_json(response)
+            await websocket.send_json(response.to_dict())
        # when the results_generator finishes it means all audio has been processed
        logger.info("Results generator finished. Sending 'ready_to_stop' to client.")
        await websocket.send_json({"type": "ready_to_stop"})
@@ -72,6 +58,11 @@ async def websocket_endpoint(websocket: WebSocket):
    )
    await websocket.accept()
    logger.info("WebSocket connection opened.")
    try:
        await websocket.send_json({"type": "config", "useAudioWorklet": bool(args.pcm_input)})
    except Exception as e:
        logger.warning(f"Failed to send config to client: {e}")
    results_generator = await audio_processor.create_tasks()
    websocket_task = asyncio.create_task(handle_websocket_results(websocket, results_generator))
--- a/whisperlivekit/core.py
+++ b/whisperlivekit/core.py
@@ -4,7 +4,7 @@ try:
 except ImportError:
    from .whisper_streaming_custom.whisper_online import backend_factory
    from .whisper_streaming_custom.online_asr import OnlineASRProcessor
-from whisperlivekit.warmup import warmup_asr, warmup_online
+from whisperlivekit.warmup import warmup_asr
 from argparse import Namespace
 import sys
@@ -33,6 +33,7 @@ class TranscriptionEngine:
            "model_dir": None,
            "lan": "auto",
            "task": "transcribe",
            "target_language": "",
            "backend": "faster-whisper",
            "vac": True,
            "vac_chunk_size": 0.04,
@@ -41,10 +42,13 @@ class TranscriptionEngine:
            "ssl_keyfile": None,
            "transcription": True,
            "vad": True,
            "pcm_input": False,
            # whisperstreaming params:
            "buffer_trimming": "segment",
            "confidence_validation": False,
            "buffer_trimming_sec": 15,
            # simulstreaming params:
            "disable_fast_encoder": False,
            "frame_threshold": 25,
@@ -59,9 +63,15 @@ class TranscriptionEngine:
            "max_context_tokens": None,
            "model_path": './base.pt',
            "diarization_backend": "sortformer",
-            # diart params:
+            
            # diarization params:
            "disable_punctuation_split" : False,
            "segmentation_model": "pyannote/segmentation-3.0",
-            "embedding_model": "pyannote/embedding",         
+            "embedding_model": "pyannote/embedding",  
            # translation params:
            "nllb_backend": "ctranslate2",
            "nllb_size": "600M"
        }
        config_dict = {**defaults, **kwargs}
@@ -117,7 +127,7 @@ class TranscriptionEngine:
            else:
                self.asr, self.tokenizer = backend_factory(self.args)
-            warmup_asr(self.asr, self.args.warmup_file) #for simulstreaming, warmup should be done in the online class not here
+                warmup_asr(self.asr, self.args.warmup_file) #for simulstreaming, warmup should be done in the online class not here
        if self.args.diarization:
            if self.args.diarization_backend == "diart":
@@ -132,7 +142,14 @@ class TranscriptionEngine:
                self.diarization_model = SortformerDiarization()
            else:
                raise ValueError(f"Unknown diarization backend: {self.args.diarization_backend}")
-            
+        
        self.translation_model = None
        if self.args.target_language:
            if self.args.lan == 'auto' and self.args.backend != "simulstreaming":
                raise Exception('Translation cannot be set with language auto when transcription backend is not simulstreaming')
            else:
                from whisperlivekit.translation.translation import load_model
                self.translation_model = load_model([self.args.lan], backend=self.args.nllb_backend, model_size=self.args.nllb_size) #in the future we want to handle different languages for different speakers
        TranscriptionEngine._initialized = True
@@ -144,7 +161,6 @@ def online_factory(args, asr, tokenizer, logfile=sys.stderr):
            asr,
            logfile=logfile,
        )
        # warmup_online(online, args.warmup_file)
    else:
        online = OnlineASRProcessor(
            asr,
@@ -159,11 +175,17 @@ def online_factory(args, asr, tokenizer, logfile=sys.stderr):
 def online_diarization_factory(args, diarization_backend):
    if args.diarization_backend == "diart":
        online = diarization_backend
-        # Not the best here, since several user/instances will share the same backend, but diart is not SOTA anymore and sortformer is recommanded
+        # Not the best here, since several user/instances will share the same backend, but diart is not SOTA anymore and sortformer is recommended
    if args.diarization_backend == "sortformer":
        from whisperlivekit.diarization.sortformer_backend import SortformerDiarizationOnline
        online = SortformerDiarizationOnline(shared_model=diarization_backend)
    return online
-        
+
 def online_translation_factory(args, translation_model):
    #should be at speaker level in the future:
    #one shared nllb model for all speaker
    #one tokenizer per speaker/language
    from whisperlivekit.translation.translation import OnlineTranslation
    return OnlineTranslation(translation_model, [args.lan], [args.target_language])
--- a/whisperlivekit/diarization/sortformer_backend.py
+++ b/whisperlivekit/diarization/sortformer_backend.py
@@ -60,11 +60,15 @@ class SortformerDiarization:
            self.diar_model = SortformerEncLabelModel.from_pretrained(model_name)
            self.diar_model.eval()
-            if torch.cuda.is_available():
+            device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
-                self.diar_model.to(torch.device("cuda"))
+            self.diar_model.to(device)
-                logger.info("Using CUDA for Sortformer model")
+            
-            else:
+            ## to test
-                logger.info("Using CPU for Sortformer model")
+            # for name, param in self.diar_model.named_parameters():
            #     if param.device != device:
            #         raise RuntimeError(f"Parameter {name} is on {param.device} but should be on {device}")
            logger.info(f"Using {device.type.upper()} for Sortformer model")
            self.diar_model.sortformer_modules.chunk_len = 10
            self.diar_model.sortformer_modules.subsampling_factor = 10
@@ -106,6 +110,7 @@ class SortformerDiarizationOnline:
            features=128,
            pad_to=0
        )
        self.audio2mel.to(self.diar_model.device)
        self.chunk_duration_seconds = (
            self.diar_model.sortformer_modules.chunk_len * 
@@ -186,22 +191,25 @@ class SortformerDiarizationOnline:
            audio = self.buffer_audio[:threshold]
            self.buffer_audio = self.buffer_audio[threshold:]
-            audio_signal_chunk = torch.tensor(audio).unsqueeze(0).to(self.diar_model.device)
+            device = self.diar_model.device
-            audio_signal_length_chunk = torch.tensor([audio_signal_chunk.shape[1]]).to(self.diar_model.device)
+            audio_signal_chunk = torch.tensor(audio, device=device).unsqueeze(0)
            audio_signal_length_chunk = torch.tensor([audio_signal_chunk.shape[1]], device=device)
            processed_signal_chunk, processed_signal_length_chunk = self.audio2mel.get_features(
                audio_signal_chunk, audio_signal_length_chunk
            )
            processed_signal_chunk = processed_signal_chunk.to(device)
            processed_signal_length_chunk = processed_signal_length_chunk.to(device)
            if self._previous_chunk_features is not None:
-                to_add = self._previous_chunk_features[:, :, -99:]
+                to_add = self._previous_chunk_features[:, :, -99:].to(device)
-                total_features = torch.concat([to_add, processed_signal_chunk], dim=2)
+                total_features = torch.concat([to_add, processed_signal_chunk], dim=2).to(device)
            else:
-                total_features = processed_signal_chunk
+                total_features = processed_signal_chunk.to(device)
-            self._previous_chunk_features = processed_signal_chunk
+            self._previous_chunk_features = processed_signal_chunk.to(device)
-            chunk_feat_seq_t = torch.transpose(total_features, 1, 2)
+            chunk_feat_seq_t = torch.transpose(total_features, 1, 2).to(device)
            with torch.inference_mode():
                left_offset = 8 if self._chunk_index > 0 else 0
@@ -209,7 +217,7 @@ class SortformerDiarizationOnline:
                self.streaming_state, self.total_preds = self.diar_model.forward_streaming_step(
                    processed_signal=chunk_feat_seq_t,
-                    processed_signal_length=torch.tensor([chunk_feat_seq_t.shape[1]]),
+                    processed_signal_length=torch.tensor([chunk_feat_seq_t.shape[1]]).to(device),
                    streaming_state=self.streaming_state,
                    total_preds=self.total_preds,
                    left_offset=left_offset,
@@ -281,6 +289,7 @@ class SortformerDiarizationOnline:
        Returns:
            List of tokens with speaker assignments
            Last speaker_segment
        """
        with self.segment_lock:
            segments = self.speaker_segments.copy()
--- a/whisperlivekit/ffmpeg_manager.py
+++ b/whisperlivekit/ffmpeg_manager.py
@@ -7,11 +7,12 @@ import contextlib
 logger = logging.getLogger(__name__)
 logging.basicConfig(level=logging.INFO)
-ERROR_INSTALL_INSTRUCTIONS = """
+ERROR_INSTALL_INSTRUCTIONS = f"""
 {'='*50}
 FFmpeg is not installed or not found in your system's PATH.
-Please install FFmpeg to enable audio processing.
+Alternative Solution: You can still use WhisperLiveKit without FFmpeg by adding the --pcm-input parameter. Note that when using this option, audio will not be compressed between the frontend and backend, which may result in higher bandwidth usage.
-Installation instructions:
+If you want to install FFmpeg:
 # Ubuntu/Debian:
 sudo apt update && sudo apt install ffmpeg
@@ -25,6 +26,7 @@ brew install ffmpeg
 # 3. Add the 'bin' directory (e.g., C:\\FFmpeg\\bin) to your system's PATH environment variable.
 After installation, please restart the application.
 {'='*50}
 """
 class FFmpegState(Enum):
@@ -183,6 +185,8 @@ class FFmpegManager:
    async def _drain_stderr(self):
        try:
            while True:
                if not self.process or not self.process.stderr:
                    break
                line = await self.process.stderr.readline()
                if not line:
                    break
@@ -190,4 +194,4 @@ class FFmpegManager:
        except asyncio.CancelledError:
            logger.info("FFmpeg stderr drain task cancelled.")
        except Exception as e:
-            logger.error(f"Error draining FFmpeg stderr: {e}")
+            logger.error(f"Error draining FFmpeg stderr: {e}")
--- a/whisperlivekit/parse_args.py
+++ b/whisperlivekit/parse_args.py
@@ -20,7 +20,7 @@ def parse_args():
        help="""
        The path to a speech audio wav file to warm up Whisper so that the very first chunk processing is fast.
        If not set, uses https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav.
-        If False, no warmup is performed.
+        If empty, no warmup is performed.
        """,
    )
@@ -72,6 +72,12 @@ def parse_args():
        help="Disable transcription to only see live diarization results.",
    )
    parser.add_argument(
        "--disable-punctuation-split",
        action="store_true",
        help="Disable the split parameter.",
    )
    parser.add_argument(
        "--min-chunk-size",
        type=float,
@@ -112,6 +118,15 @@ def parse_args():
        choices=["transcribe", "translate"],
        help="Transcribe or translate.",
    )
    parser.add_argument(
        "--target-language",
        type=str,
        default="",
        dest="target_language",
        help="Target language for translation. Not functional yet.",
    )    
    parser.add_argument(
        "--backend",
        type=str,
@@ -158,7 +173,12 @@ def parse_args():
    )
    parser.add_argument("--ssl-certfile", type=str, help="Path to the SSL certificate file.", default=None)
    parser.add_argument("--ssl-keyfile", type=str, help="Path to the SSL private key file.", default=None)
-
+    parser.add_argument(
        "--pcm-input",
        action="store_true",
        default=False,
        help="If set, raw PCM (s16le) data is expected as input and FFmpeg will be bypassed. Frontend will use AudioWorklet instead of MediaRecorder."
    )
    # SimulStreaming-specific arguments
    simulstreaming_group = parser.add_argument_group('SimulStreaming arguments (only used with --backend simulstreaming)')
@@ -260,13 +280,27 @@ def parse_args():
    )
    simulstreaming_group.add_argument(
-        "--preloaded_model_count",
+        "--preload-model-count",
        type=int,
        default=1,
-        dest="preloaded_model_count",
+        dest="preload_model_count",
        help="Optional. Number of models to preload in memory to speed up loading (set up to the expected number of concurrent instances).",
    )
    simulstreaming_group.add_argument(
        "--nllb-backend",
        type=str,
        default="ctranslate2",
        help="transformers or ctranslate2",
    )
    simulstreaming_group.add_argument(
        "--nllb-size",
        type=str,
        default="600M",
        help="600M or 1.3B",
    )
    args = parser.parse_args()
    args.transcription = not args.no_transcription
--- a/whisperlivekit/remove_silences.py
+++ b/whisperlivekit/remove_silences.py
@@ -39,7 +39,7 @@ def blank_to_silence(tokens):
                        )
                else:
                    if silence_token: #there was silence but no more
-                        if silence_token.end - silence_token.start >= MIN_SILENCE_DURATION:
+                        if silence_token.duration() >= MIN_SILENCE_DURATION:
                            cleaned_tokens.append(
                                silence_token
                            )
@@ -77,15 +77,17 @@ def no_token_to_silence(tokens):
            new_tokens.append(token)
    return new_tokens
-def ends_with_silence(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence):
+def ends_with_silence(tokens, current_time, vac_detected_silence):
    end_w_silence = False
    if not tokens:
-        return [], buffer_transcription, buffer_diarization
+        return [], end_w_silence
    last_token = tokens[-1]
    if tokens and current_time and (
        current_time - last_token.end >= END_SILENCE_DURATION 
-        or 
+        or
        (current_time - last_token.end >= 3 and vac_detected_silence)
        ):
        end_w_silence = True
        if last_token.speaker == -2:
            last_token.end = current_time
        else:
@@ -97,14 +99,12 @@ def ends_with_silence(tokens, buffer_transcription, buffer_diarization, current_
                    probability=0.95
                )
            )
-        buffer_transcription = "" # for whisperstreaming backend, we should probably validate the buffer has because of the silence
+    return tokens, end_w_silence
        buffer_diarization  = ""
    return tokens, buffer_transcription, buffer_diarization
-def handle_silences(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence):
+def handle_silences(tokens, current_time, vac_detected_silence):
    tokens = blank_to_silence(tokens) #useful for simulstreaming backend which tends to generate [BLANK_AUDIO] text
    tokens = no_token_to_silence(tokens)
-    tokens, buffer_transcription, buffer_diarization = ends_with_silence(tokens, buffer_transcription, buffer_diarization, current_time, vac_detected_silence)
+    tokens, end_w_silence = ends_with_silence(tokens, current_time, vac_detected_silence)
-    return tokens, buffer_transcription, buffer_diarization
+    return tokens, end_w_silence
--- a/whisperlivekit/results_formater.py
+++ b/whisperlivekit/results_formater.py
@@ -1,21 +1,15 @@
 import logging
 from datetime import timedelta
 from whisperlivekit.remove_silences import handle_silences
 from whisperlivekit.timed_objects import Line, format_time
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.DEBUG)
 PUNCTUATION_MARKS = {'.', '!', '?'}
 CHECK_AROUND = 4
 def format_time(seconds: float) -> str:
    """Format seconds as HH:MM:SS."""
    return str(timedelta(seconds=int(seconds)))
 def is_punctuation(token):
-    if token.text.strip() in PUNCTUATION_MARKS:
+    if token.is_punctuation():
        return True
    return False
@@ -34,45 +28,41 @@ def next_speaker_change(i, tokens, speaker):
            return ind, token.speaker
    return None, speaker
 def new_line(
    token,
    speaker,
    last_end_diarized,
    debug_info = ""
 ):
-    return {
+    return Line(
-            "speaker": int(speaker),
+        speaker = speaker,
-            "text": token.text + debug_info,
+        text = token.text + debug_info,
-            "beg": format_time(token.start),
+        start = token.start,
-            "end": format_time(token.end),
+        end = token.end,
-            "diff": round(token.end - last_end_diarized, 2)
+        detected_language=token.detected_language
-    }
+    )
-
+def append_token_to_last_line(lines, sep, token, debug_info):
 def append_token_to_last_line(lines, sep, token, debug_info, last_end_diarized):
    if token.text:
-        lines[-1]["text"] += sep + token.text + debug_info
+        lines[-1].text += sep + token.text + debug_info
-        lines[-1]["end"] = format_time(token.end)
+        lines[-1].end = token.end
-        lines[-1]["diff"] = round(token.end - last_end_diarized, 2)
+    if not lines[-1].detected_language and token.detected_language:
-            
+        lines[-1].detected_language = token.detected_language
-def format_output(state, silence, current_time, diarization, debug):
+def format_output(state, silence, current_time, args, debug, sep):
-    tokens = state["tokens"]
+    diarization = args.diarization
-    buffer_transcription = state["buffer_transcription"]
+    disable_punctuation_split = args.disable_punctuation_split
-    buffer_diarization = state["buffer_diarization"]
+    tokens = state.tokens
-    end_attributed_speaker = state["end_attributed_speaker"]
+    translated_segments = state.translated_segments # Here we will attribute the speakers only based on the timestamps of the segments
-    sep = state["sep"]
+    end_attributed_speaker = state.end_attributed_speaker
    previous_speaker = -1
    lines = []
    last_end_diarized = 0
    undiarized_text = []
-    tokens, buffer_transcription, buffer_diarization = handle_silences(tokens, buffer_transcription, buffer_diarization, current_time, silence)
+    tokens, end_w_silence = handle_silences(tokens, current_time, silence)
    last_punctuation = None
    for i, token in enumerate(tokens):
        speaker = token.speaker
        if not diarization and speaker == -1: #Speaker -1 means no attributed by diarization. In the frontend, it should appear under 'Speaker 1'
            speaker = 1
        if diarization and not tokens[-1].speaker == -2:
@@ -81,18 +71,15 @@ def format_output(state, silence, current_time, diarization, debug):
                continue
            elif (speaker in [-1, 0]) and token.end < end_attributed_speaker:
                speaker = previous_speaker
            if speaker not in [-1, 0]:
                last_end_diarized = max(token.end, last_end_diarized)
        debug_info = ""
        if debug:
            debug_info = f"[{format_time(token.start)} : {format_time(token.end)}]"
        if not lines:
-            lines.append(new_line(token, speaker, last_end_diarized, debug_info = ""))
+            lines.append(new_line(token, speaker, debug_info = ""))
            continue
        else:
-            previous_speaker = lines[-1]['speaker']
+            previous_speaker = lines[-1].speaker
        if is_punctuation(token):
            last_punctuation = i
@@ -101,7 +88,7 @@ def format_output(state, silence, current_time, diarization, debug):
        if last_punctuation == i-1:
            if speaker != previous_speaker:
                # perfect, diarization perfectly aligned
-                lines.append(new_line(token, speaker, last_end_diarized, debug_info = ""))
+                lines.append(new_line(token, speaker, debug_info = ""))
                last_punctuation, next_punctuation = None, None
                continue
@@ -111,28 +98,64 @@ def format_output(state, silence, current_time, diarization, debug):
                # That was the idea. Okay haha |SPLIT SPEAKER| that's a good one 
                # should become:
                # That was the idea. |SPLIT SPEAKER| Okay haha that's a good one 
-                lines.append(new_line(token, new_speaker, last_end_diarized, debug_info = ""))
+                lines.append(new_line(token, new_speaker, debug_info = ""))
            else:
                # No speaker change to come
-                append_token_to_last_line(lines, sep, token, debug_info, last_end_diarized)
+                append_token_to_last_line(lines, sep, token, debug_info)
            continue
        if speaker != previous_speaker:
            if speaker == -2 or previous_speaker == -2: #silences can happen anytime
-                lines.append(new_line(token, speaker, last_end_diarized, debug_info = ""))
+                lines.append(new_line(token, speaker, debug_info = ""))
                continue
            elif next_punctuation_change(i, tokens):
                # Corrects advance:
                # Are you |SPLIT SPEAKER| okay? yeah, sure. Absolutely 
                # should become:
                # Are you okay? |SPLIT SPEAKER| yeah, sure. Absolutely 
-                append_token_to_last_line(lines, sep, token, debug_info, last_end_diarized)
+                append_token_to_last_line(lines, sep, token, debug_info)
                continue
            else: #we create a new speaker, but that's no ideal. We are not sure about the split. We prefer to append to previous line
-                # lines.append(new_line(token, speaker, last_end_diarized, debug_info = ""))
+                if disable_punctuation_split:
                    lines.append(new_line(token, speaker, debug_info = ""))
                    continue
                pass
-        append_token_to_last_line(lines, sep, token, debug_info, last_end_diarized)
+        append_token_to_last_line(lines, sep, token, debug_info)
    return lines, undiarized_text, buffer_transcription, '' 
    if lines and translated_segments:
        unassigned_translated_segments = []
        for ts in translated_segments:
            assigned = False
            for line in lines:
                if ts and ts.overlaps_with(line):
                    if ts.is_within(line):
                        line.translation += ts.text + ' '
                        assigned = True
                        break
                    else:
                        ts0, ts1 = ts.approximate_cut_at(line.end)
                        if ts0 and line.overlaps_with(ts0):
                            line.translation += ts0.text + ' '
                        if ts1:
                            unassigned_translated_segments.append(ts1)
                        assigned = True
                        break
            if not assigned:
                unassigned_translated_segments.append(ts)
        if unassigned_translated_segments:
            for line in lines:
                remaining_segments = []
                for ts in unassigned_translated_segments:
                    if ts and ts.overlaps_with(line):
                        line.translation += ts.text + ' '
                    else:
                        remaining_segments.append(ts)
                unassigned_translated_segments = remaining_segments #maybe do smth in the future about that
    if state.buffer_transcription and lines:
        lines[-1].end = max(state.buffer_transcription.end, lines[-1].end)
    return lines, undiarized_text, end_w_silence
--- a/whisperlivekit/simul_whisper/backend.py
+++ b/whisperlivekit/simul_whisper/backend.py
@@ -3,12 +3,11 @@ import numpy as np
 import logging
 from typing import List, Tuple, Optional
 import logging
-from whisperlivekit.timed_objects import ASRToken, Transcript
+import platform
 from whisperlivekit.timed_objects import ASRToken, Transcript, ChangeSpeaker
 from whisperlivekit.warmup import load_file
 from whisperlivekit.simul_whisper.license_simulstreaming import SIMULSTREAMING_LICENSE
 from .whisper import load_model, tokenizer
 from .whisper.audio import TOKENS_PER_SECOND
 import os
 import gc
 logger = logging.getLogger(__name__)
@@ -22,6 +21,12 @@ try:
    from .mlx_encoder import mlx_model_mapping, load_mlx_encoder
    HAS_MLX_WHISPER = True
 except ImportError:
    if platform.system() == "Darwin" and platform.machine() == "arm64":
        print(f"""
            {"="*50}
            MLX Whisper not found but you are on Apple Silicon. Consider installing mlx-whisper for better performance: pip install mlx-whisper
            {"="*50}
            """)
    HAS_MLX_WHISPER = False
 if HAS_MLX_WHISPER:
    HAS_FASTER_WHISPER = False
@@ -47,8 +52,7 @@ class SimulStreamingOnlineProcessor:
        self.asr = asr
        self.logfile = logfile
        self.end = 0.0
-        self.global_time_offset = 0.0
+        self.buffer = []
        self.committed: List[ASRToken] = []
        self.last_result_tokens: List[ASRToken] = []
        self.load_new_backend()
@@ -77,7 +81,7 @@ class SimulStreamingOnlineProcessor:
        else:
            self.process_iter(is_last=True) #we want to totally process what remains in the buffer.
            self.model.refresh_segment(complete=True)
-            self.global_time_offset += silence_duration + offset
+            self.model.global_time_offset = silence_duration + offset
@@ -89,63 +93,15 @@ class SimulStreamingOnlineProcessor:
        self.end = audio_stream_end_time #Only to be aligned with what happens in whisperstreaming backend.
        self.model.insert_audio(audio_tensor)
-    def get_buffer(self):
+    def new_speaker(self, change_speaker: ChangeSpeaker):
-        return Transcript(
+            self.process_iter(is_last=True)
-            start=None, 
+            self.model.refresh_segment(complete=True)
-            end=None, 
+            self.model.speaker = change_speaker.speaker
-            text='', 
+            self.global_time_offset = change_speaker.start
            probability=None
        )
    def timestamped_text(self, tokens, generation):
        """
        generate timestamped text from tokens and generation data.
        args:
            tokens: List of tokens to process
            generation: Dictionary containing generation progress and optionally results
-        returns:
+    def get_buffer(self):
-            List of tuples containing (start_time, end_time, word) for each word
+        concat_buffer = Transcript.from_tokens(tokens= self.buffer, sep='')
-        """
+        return concat_buffer
        FRAME_DURATION = 0.02    
        if "result" in generation:
            split_words = generation["result"]["split_words"]
            split_tokens = generation["result"]["split_tokens"]
        else:
            split_words, split_tokens = self.model.tokenizer.split_to_word_tokens(tokens)
        progress = generation["progress"]
        frames = [p["most_attended_frames"][0] for p in progress]
        absolute_timestamps = [p["absolute_timestamps"][0] for p in progress]
        tokens_queue = tokens.copy()
        timestamped_words = []
        for word, word_tokens in zip(split_words, split_tokens):
            # start_frame = None
            # end_frame = None
            for expected_token in word_tokens:
                if not tokens_queue or not frames:
                    raise ValueError(f"Insufficient tokens or frames for word '{word}'")
                actual_token = tokens_queue.pop(0)
                current_frame = frames.pop(0)
                current_timestamp = absolute_timestamps.pop(0)
                if actual_token != expected_token:
                    raise ValueError(
                        f"Token mismatch: expected '{expected_token}', "
                        f"got '{actual_token}' at frame {current_frame}"
                    )
                # if start_frame is None:
                #     start_frame = current_frame
                # end_frame = current_frame
            # start_time = start_frame * FRAME_DURATION
            # end_time = end_frame * FRAME_DURATION
            start_time = current_timestamp
            end_time = current_timestamp + 0.1
            timestamp_entry = (start_time, end_time, word)
            timestamped_words.append(timestamp_entry)
            logger.debug(f"TS-WORD:\t{start_time:.2f}\t{end_time:.2f}\t{word}")
        return timestamped_words
    def process_iter(self, is_last=False) -> Tuple[List[ASRToken], float]:
        """
@@ -154,47 +110,14 @@ class SimulStreamingOnlineProcessor:
        Returns a tuple: (list of committed ASRToken objects, float representing the audio processed up to time).
        """
        try:
-            tokens, generation_progress = self.model.infer(is_last=is_last)
+            timestamped_words = self.model.infer(is_last=is_last)
-            ts_words = self.timestamped_text(tokens, generation_progress)
+            if timestamped_words and timestamped_words[0].detected_language == None:
                self.buffer.extend(timestamped_words)
                return [], self.end
-            new_tokens = []
+            self.committed.extend(timestamped_words)
-            for ts_word in ts_words:
+            self.buffer = []
-                
+            return timestamped_words, self.end
                start, end, word = ts_word
                token = ASRToken(
                    start=start,
                    end=end,
                    text=word,
                    probability=0.95  # fake prob. Maybe we can extract it from the model?
                ).with_offset(
                    self.global_time_offset
                )
                new_tokens.append(token)
            # identical_tokens = 0
            # n_new_tokens = len(new_tokens)
            # if n_new_tokens:
            self.committed.extend(new_tokens)
            # if token in self.committed:
            #     pos = len(self.committed) - 1 - self.committed[::-1].index(token)
            # if pos:
            #     for i in range(len(self.committed) - n_new_tokens, -1, -n_new_tokens):
            #         commited_segment = self.committed[i:i+n_new_tokens]
            #         if commited_segment == new_tokens:
            #             identical_segments +=1
            #             if identical_tokens >= TOO_MANY_REPETITIONS:
            #                 logger.warning('Too many repetition, model is stuck. Load a new one')
            #                 self.committed = self.committed[:i]
            #                 self.load_new_backend()
            #                 return [], self.end
            # pos = self.committed.rindex(token)
            return new_tokens, self.end
        except Exception as e:
@@ -224,7 +147,6 @@ class SimulStreamingASR():
    sep = ""
    def __init__(self, lan, modelsize=None, cache_dir=None, model_dir=None, logfile=sys.stderr, **kwargs):
        logger.warning(SIMULSTREAMING_LICENSE)
        self.logfile = logfile
        self.transcribe_kargs = {}
        self.original_language = lan
@@ -360,4 +282,4 @@ class SimulStreamingASR():
        """
        Warmup is done directly in load_model
        """
-        pass
+        pass
--- a/whisperlivekit/simul_whisper/simul_whisper.py
+++ b/whisperlivekit/simul_whisper/simul_whisper.py
@@ -8,6 +8,7 @@ import torch.nn.functional as F
 from .whisper import load_model, DecodingOptions, tokenizer
 from .config import AlignAttConfig
 from whisperlivekit.timed_objects import ASRToken
 from .whisper.audio import log_mel_spectrogram, TOKENS_PER_SECOND, pad_or_trim, N_SAMPLES, N_FRAMES
 from .whisper.timing import median_filter
 from .whisper.decoding import GreedyDecoder, BeamSearchDecoder, SuppressTokens, detect_language
@@ -18,6 +19,7 @@ from time import time
 from .token_buffer import TokenBuffer
 import numpy as np
 from ..timed_objects import PUNCTUATION_MARKS
 from .generation_progress import *
 DEC_PAD = 50257
@@ -40,12 +42,6 @@ else:
    except ImportError:
        HAS_FASTER_WHISPER = False
 # New features added to the original version of Simul-Whisper: 
 # - large-v3 model support
 # - translation support
 # - beam search
 # - prompt -- static vs. non-static
 # - context
 class PaddedAlignAttWhisper:
    def __init__(
            self, 
@@ -61,6 +57,8 @@ class PaddedAlignAttWhisper:
            self.model = loaded_model
        else:
            self.model = load_model(name=model_name, download_root=model_path)
        self.device = 'cuda' if torch.cuda.is_available() else 'cpu'
        self.mlx_encoder = mlx_encoder
        self.fw_encoder = fw_encoder
@@ -68,7 +66,7 @@ class PaddedAlignAttWhisper:
            self.fw_feature_extractor = FeatureExtractor(feature_size=self.model.dims.n_mels)
        logger.info(f"Model dimensions: {self.model.dims}")
-
+        self.speaker = -1
        self.decode_options = DecodingOptions(
            language = cfg.language, 
            without_timestamps = True,
@@ -76,7 +74,10 @@ class PaddedAlignAttWhisper:
        )
        self.tokenizer_is_multilingual = not model_name.endswith(".en")
        self.create_tokenizer(cfg.language if cfg.language != "auto" else None)
        # self.create_tokenizer('en')
        self.detected_language = cfg.language if cfg.language != "auto" else None
        self.global_time_offset = 0.0
        self.reset_tokenizer_to_auto_next_call = False
        self.max_text_len = self.model.dims.n_text_ctx
        self.num_decoder_layers = len(self.model.decoder.blocks)
@@ -151,6 +152,7 @@ class PaddedAlignAttWhisper:
        self.last_attend_frame = -self.cfg.rewind_threshold
        self.cumulative_time_offset = 0.0
        self.first_timestamp = None
        if self.cfg.max_context_tokens is None:
            self.max_context_tokens = self.max_text_len
@@ -172,7 +174,6 @@ class PaddedAlignAttWhisper:
            self.token_decoder = BeamSearchDecoder(inference=self.inference, eot=self.tokenizer.eot, beam_size=cfg.beam_size)
    def remove_hooks(self):
        print('remove hook')
        for hook in self.l_hooks:
            hook.remove()
@@ -259,7 +260,6 @@ class PaddedAlignAttWhisper:
        self.init_context()
        logger.debug(f"Context: {self.context}")
        if not complete and len(self.segments) > 2:
            logger.debug("keeping last two segments because they are and it is not complete.")
            self.segments = self.segments[-2:]
        else:
            logger.debug("removing all segments.")
@@ -381,11 +381,11 @@ class PaddedAlignAttWhisper:
        new_segment = True
        if len(self.segments) == 0:
            logger.debug("No segments, nothing to do")
-            return [], {}
+            return []
        if not self._apply_minseglen():
            logger.debug(f"applied minseglen {self.cfg.audio_min_len} > {self.segments_len()}.")
            input_segments = torch.cat(self.segments, dim=0)
-            return [], {}
+            return []
        # input_segments is concatenation of audio, it's one array
        if len(self.segments) > 1:
@@ -393,88 +393,77 @@ class PaddedAlignAttWhisper:
        else:
            input_segments = self.segments[0]
        # if self.cfg.language == "auto" and self.reset_tokenizer_to_auto_next_call:
        #     logger.debug("Resetting tokenizer to auto for new sentence.")
        #     self.create_tokenizer(None)
        #     self.detected_language = None
        #     self.init_tokens()
        #     self.reset_tokenizer_to_auto_next_call = False
        # NEW : we can use a different encoder, before using standart whisper for cross attention with the hooks on the decoder
        beg_encode = time()
        if self.mlx_encoder:
            mlx_mel_padded = mlx_log_mel_spectrogram(audio=input_segments.detach(), n_mels=self.model.dims.n_mels, padding=N_SAMPLES)
            mlx_mel = mlx_pad_or_trim(mlx_mel_padded, N_FRAMES, axis=-2)
            mlx_encoder_feature = self.mlx_encoder.encoder(mlx_mel[None])
-            encoder_feature = torch.tensor(np.array(mlx_encoder_feature))
+            encoder_feature = torch.as_tensor(mlx_encoder_feature)
            content_mel_len = int((mlx_mel_padded.shape[0] - mlx_mel.shape[0])/2)
            device = 'cpu'
        elif self.fw_encoder:
            audio_length_seconds = len(input_segments) / 16000   
            content_mel_len = int(audio_length_seconds * 100)//2      
            mel_padded_2 = self.fw_feature_extractor(waveform=input_segments.numpy(), padding=N_SAMPLES)[None, :]
            mel = fw_pad_or_trim(mel_padded_2, N_FRAMES, axis=-1)
            encoder_feature_ctranslate = self.fw_encoder.encode(mel)
-            encoder_feature = torch.Tensor(np.array(encoder_feature_ctranslate))
+            if self.device == 'cpu': #it seems that on gpu, passing StorageView to torch.as_tensor fails and wrapping in the array works
-            device = 'cpu'
+                encoder_feature_ctranslate = np.array(encoder_feature_ctranslate)
            try:
                encoder_feature = torch.as_tensor(encoder_feature_ctranslate, device=self.device)
            except TypeError: # Normally the cpu condition should prevent having exceptions, but just in case:
                encoder_feature = torch.as_tensor(np.array(encoder_feature_ctranslate), device=self.device)
        else:
            # mel + padding to 30s
            mel_padded = log_mel_spectrogram(input_segments, n_mels=self.model.dims.n_mels, padding=N_SAMPLES, 
-                                                device=self.model.device).unsqueeze(0)
+                                                device=self.device).unsqueeze(0)
            # trim to 3000
            mel = pad_or_trim(mel_padded, N_FRAMES)
            # the len of actual audio
            content_mel_len = int((mel_padded.shape[2] - mel.shape[2])/2)
            encoder_feature = self.model.encoder(mel)
            device = mel.device
        end_encode = time()
        # print('Encoder duration:', end_encode-beg_encode)
-#        logger.debug(f"Encoder feature shape: {encoder_feature.shape}")
+        if self.cfg.language == "auto" and self.detected_language is None and self.first_timestamp:
-#        if mel.shape[-2:] != (self.model.dims.n_audio_ctx, self.model.dims.n_audio_state):
+            seconds_since_start = self.segments_len() - self.first_timestamp
-#            logger.debug("mel ")
+            if seconds_since_start >= 2.0:
-        if self.cfg.language == "auto" and self.detected_language is None:
+                language_tokens, language_probs = self.lang_id(encoder_feature) 
-            language_tokens, language_probs = self.lang_id(encoder_feature) 
+                top_lan, p = max(language_probs[0].items(), key=lambda x: x[1])
-            logger.debug(f"Language tokens: {language_tokens}, probs: {language_probs}")
+                print(f"Detected language: {top_lan} with p={p:.4f}")
-            top_lan, p = max(language_probs[0].items(), key=lambda x: x[1])
+                self.create_tokenizer(top_lan)
-            logger.info(f"Detected language: {top_lan} with p={p:.4f}")
+                self.last_attend_frame = -self.cfg.rewind_threshold
-            #self.tokenizer.language = top_lan
+                self.cumulative_time_offset = 0.0
-            #self.tokenizer.__post_init__()
+                self.init_tokens()
-            self.create_tokenizer(top_lan)
+                self.init_context()
-            self.detected_language = top_lan
+                self.detected_language = top_lan
-            self.init_tokens()
+                logger.info(f"Tokenizer language: {self.tokenizer.language}, {self.tokenizer.sot_sequence_including_notimestamps}")
            logger.info(f"Tokenizer language: {self.tokenizer.language}, {self.tokenizer.sot_sequence_including_notimestamps}")
        self.trim_context()
        current_tokens = self._current_tokens()
-#        
+   
        fire_detected = self.fire_at_boundary(encoder_feature[:, :content_mel_len, :])
-        ####################### Decoding loop
+        sum_logprobs = torch.zeros(self.cfg.beam_size, device=self.device)
        logger.info("Decoding loop starts\n")
        sum_logprobs = torch.zeros(self.cfg.beam_size, device=device)
        completed = False
        # punctuation_stop = False
        attn_of_alignment_heads = None
        most_attended_frame = None
        token_len_before_decoding = current_tokens.shape[1]
-        generation_progress = []
+        l_absolute_timestamps = []
-        generation = {
+        
            "starting_tokens": BeamTokens(current_tokens[0,:].clone(), self.cfg.beam_size),
            "token_len_before_decoding": token_len_before_decoding,
            #"fire_detected": fire_detected,
            "frames_len": content_mel_len,
            "frames_threshold": 4 if is_last else self.cfg.frame_threshold,
            # to be filled later
            "logits_starting": None,
            # to be filled later
            "no_speech_prob": None,
            "no_speech": False,
            # to be filled in the loop
            "progress": generation_progress,
        }
        while not completed and current_tokens.shape[1] < self.max_text_len: # bos is 3 tokens
            generation_progress_loop = []
            if new_segment:
                tokens_for_logits = current_tokens
@@ -483,50 +472,26 @@ class PaddedAlignAttWhisper:
                tokens_for_logits = current_tokens[:,-1:]
            logits = self.logits(tokens_for_logits, encoder_feature) # B, len(tokens), token dict size
            if new_segment:
                generation["logits_starting"] = Logits(logits[:,:,:])
            if new_segment and self.tokenizer.no_speech is not None:
                probs_at_sot = logits[:, self.sot_index, :].float().softmax(dim=-1)
                no_speech_probs = probs_at_sot[:, self.tokenizer.no_speech].tolist()
                generation["no_speech_prob"] = no_speech_probs[0]
                if no_speech_probs[0] > self.cfg.nonspeech_prob:
                    generation["no_speech"] = True
                    logger.info("no speech, stop")
                    break
            logits = logits[:, -1, :] # logits for the last token
            generation_progress_loop.append(("logits_before_suppress",Logits(logits)))
            # supress blank tokens only at the beginning of the segment
            if new_segment:
                logits[:, self.tokenizer.encode(" ") + [self.tokenizer.eot]] = -np.inf
            new_segment = False
            self.suppress_tokens(logits)
            #generation_progress_loop.append(("logits_after_suppres",BeamLogits(logits[0,:].clone(), self.cfg.beam_size)))
            generation_progress_loop.append(("logits_after_suppress",Logits(logits)))
            current_tokens, completed = self.token_decoder.update(current_tokens, logits, sum_logprobs)
            generation_progress_loop.append(("beam_tokens",Tokens(current_tokens[:,-1].clone())))
            generation_progress_loop.append(("sum_logprobs",sum_logprobs.tolist()))
            generation_progress_loop.append(("completed",completed))
            logger.debug(f"Decoding completed: {completed}, sum_logprobs: {sum_logprobs.tolist()}, tokens: ")
            self.debug_print_tokens(current_tokens)
            # if self.decoder_type == "beam":
            #     logger.debug(f"Finished sequences: {self.token_decoder.finished_sequences}")
            #     logprobs = F.log_softmax(logits.float(), dim=-1)
            #     idx = 0
            #     logger.debug(f"Beam search topk: {logprobs[idx].topk(self.cfg.beam_size + 1)}")
            #     logger.debug(f"Greedy search argmax: {logits.argmax(dim=-1)}")
            # if completed:
            #     self.debug_print_tokens(current_tokens)
            #     logger.debug("decode stopped because decoder completed")
            attn_of_alignment_heads = [[] for _ in range(self.num_align_heads)]
            for i, attn_mat in enumerate(self.dec_attns):
                layer_rank = int(i % len(self.model.decoder.blocks))
@@ -545,30 +510,24 @@ class PaddedAlignAttWhisper:
                t = torch.cat(mat, dim=1)
                tmp.append(t) 
            attn_of_alignment_heads = torch.stack(tmp, dim=1)
 #            logger.debug(str(attn_of_alignment_heads.shape) + " tttady")
            std, mean = torch.std_mean(attn_of_alignment_heads, dim=-2, keepdim=True, unbiased=False)
            attn_of_alignment_heads = (attn_of_alignment_heads - mean) / std
            attn_of_alignment_heads = median_filter(attn_of_alignment_heads, 7) # from whisper.timing
            attn_of_alignment_heads = attn_of_alignment_heads.mean(dim=1)
 #            logger.debug(str(attn_of_alignment_heads.shape) + " po mean")
            attn_of_alignment_heads = attn_of_alignment_heads[:,:, :content_mel_len]
 #            logger.debug(str(attn_of_alignment_heads.shape) + " pak ")
            # for each beam, the most attended frame is:
            most_attended_frames = torch.argmax(attn_of_alignment_heads[:,-1,:], dim=-1)
            generation_progress_loop.append(("most_attended_frames",most_attended_frames.clone().tolist()))
            # Calculate absolute timestamps accounting for cumulative offset
            absolute_timestamps = [(frame * 0.02 + self.cumulative_time_offset) for frame in most_attended_frames.tolist()]
            generation_progress_loop.append(("absolute_timestamps", absolute_timestamps))
            logger.debug(str(most_attended_frames.tolist()) + " most att frames")
            logger.debug(f"Absolute timestamps: {absolute_timestamps} (offset: {self.cumulative_time_offset:.2f}s)")
            most_attended_frame = most_attended_frames[0].item()
            l_absolute_timestamps.append(absolute_timestamps[0])
            generation_progress.append(dict(generation_progress_loop))
            logger.debug("current tokens" + str(current_tokens.shape))
            if completed:
            #    # stripping the last token, the eot
@@ -606,66 +565,53 @@ class PaddedAlignAttWhisper:
                    self.tokenizer.decode([current_tokens[i, -1].item()])
                ))
 #        for k,v in generation.items():
 #            print(k,v,file=sys.stderr)
 #        for x in generation_progress:
 #            for y in x.items():
 #                print("\t\t",*y,file=sys.stderr)
 #            print("\t","----", file=sys.stderr)
 #        print("\t", "end of generation_progress_loop", file=sys.stderr)
        #    sys.exit(1)
        ####################### End of decoding loop
        logger.info("End of decoding loop")
        # if attn_of_alignment_heads is not None:
        #     seg_len = int(segment.shape[0] / 16000 * TOKENS_PER_SECOND)
        #     # Lets' now consider only the top hypothesis in the beam search
        #     top_beam_attn_of_alignment_heads = attn_of_alignment_heads[0]
        #     # debug print: how is the new token attended?
        #     new_token_attn = top_beam_attn_of_alignment_heads[token_len_before_decoding:, -seg_len:]
        #     logger.debug(f"New token attention shape: {new_token_attn.shape}")
        #     if new_token_attn.shape[0] == 0:  # it's not attended in the current audio segment
        #         logger.debug("no token generated")
        #     else:  # it is, and the max attention is:
        #         new_token_max_attn, _ = new_token_attn.max(dim=-1)
        #         logger.debug(f"segment max attention: {new_token_max_attn.mean().item()/len(self.segments)}")
        # let's now operate only with the top beam hypothesis
        tokens_to_split = current_tokens[0, token_len_before_decoding:]
-        if fire_detected or is_last:
+
        if fire_detected or is_last: #or punctuation_stop:
            new_hypothesis = tokens_to_split.flatten().tolist()
            split_words, split_tokens = self.tokenizer.split_to_word_tokens(new_hypothesis)
        else:
            # going to truncate the tokens after the last space
            split_words, split_tokens = self.tokenizer.split_to_word_tokens(tokens_to_split.tolist())
            generation["result"] = {"split_words": split_words[:-1], "split_tokens": split_tokens[:-1]}
            generation["result_truncated"] = {"split_words": split_words[-1:], "split_tokens": split_tokens[-1:]}
 #            text_to_split = self.tokenizer.decode(tokens_to_split)
 #            logger.debug(f"text_to_split: {text_to_split}")
 #            logger.debug("text at current step: {}".format(text_to_split.replace(" ", "<space>")))
 #            text_before_space = " ".join(text_to_split.split(" ")[:-1])
 #            logger.debug("before the last space: {}".format(text_before_space.replace(" ", "<space>")))
            if len(split_words) > 1:
                new_hypothesis = [i for sublist in split_tokens[:-1] for i in sublist]  
            else:
                new_hypothesis = []
        ### new hypothesis
        logger.debug(f"new_hypothesis: {new_hypothesis}")
        new_tokens = torch.tensor([new_hypothesis], dtype=torch.long).repeat_interleave(self.cfg.beam_size, dim=0).to(
-            device=self.model.device,
+            device=self.device,
        )
        self.tokens.append(new_tokens)
        # TODO: test if this is redundant or not
 #        ret = ret[ret<DEC_PAD]
        logger.info(f"Output: {self.tokenizer.decode(new_hypothesis)}")
        self._clean_cache()
-        return new_hypothesis, generation
+        if len(l_absolute_timestamps) >=2 and self.first_timestamp is None:
            self.first_timestamp = l_absolute_timestamps[0]
        timestamped_words = []
        timestamp_idx = 0
        for word, word_tokens in zip(split_words, split_tokens):
            try:
                current_timestamp = l_absolute_timestamps[timestamp_idx]
            except:
                pass
            timestamp_idx += len(word_tokens)
            timestamp_entry = ASRToken(
                    start=current_timestamp,
                    end=current_timestamp + 0.1,
                    text= word,
                    probability=0.95,
                    speaker=self.speaker,
                    detected_language=self.detected_language
                ).with_offset(
                    self.global_time_offset
            )
            timestamped_words.append(timestamp_entry)
        return timestamped_words
--- a/whisperlivekit/timed_objects.py
+++ b/whisperlivekit/timed_objects.py
@@ -1,20 +1,51 @@
-from dataclasses import dataclass
+from dataclasses import dataclass, field
-from typing import Optional
+from typing import Optional, Any, List
 from datetime import timedelta
 PUNCTUATION_MARKS = {'.', '!', '?', '。', '！', '？'}
 def format_time(seconds: float) -> str:
    """Format seconds as HH:MM:SS."""
    return str(timedelta(seconds=int(seconds)))
@dataclass
 class TimedText:
-    start: Optional[float]
+    start: Optional[float] = 0
-    end: Optional[float]
+    end: Optional[float] = 0
    text: Optional[str] = ''
    speaker: Optional[int] = -1
    probability: Optional[float] = None
    is_dummy: Optional[bool] = False
    detected_language: Optional[str] = None
    def is_punctuation(self):
        return self.text.strip() in PUNCTUATION_MARKS
    def overlaps_with(self, other: 'TimedText') -> bool:
        return not (self.end <= other.start or other.end <= self.start)
    def is_within(self, other: 'TimedText') -> bool:
        return other.contains_timespan(self)
-@dataclass
+    def duration(self) -> float:
        return self.end - self.start
    def contains_time(self, time: float) -> bool:
        return self.start <= time <= self.end
    def contains_timespan(self, other: 'TimedText') -> bool:
        return self.start <= other.start and self.end >= other.end
    def __bool__(self):
        return bool(self.text)
@dataclass()
 class ASRToken(TimedText):
    def with_offset(self, offset: float) -> "ASRToken":
        """Return a new token with the time offset added."""
-        return ASRToken(self.start + offset, self.end + offset, self.text, self.speaker, self.probability)
+        return ASRToken(self.start + offset, self.end + offset, self.text, self.speaker, self.probability, detected_language=self.detected_language)
@dataclass
 class Sentence(TimedText):
@@ -22,7 +53,28 @@ class Sentence(TimedText):
@dataclass
 class Transcript(TimedText):
-    pass
+    """
    represents a concatenation of several ASRToken
    """
    @classmethod
    def from_tokens(
        cls,
        tokens: List[ASRToken],
        sep: Optional[str] = None,
        offset: float = 0
    ) -> "Transcript":
        sep = sep if sep is not None else ' '
        text = sep.join(token.text for token in tokens)
        probability = sum(token.probability for token in tokens if token.probability) / len(tokens) if tokens else None
        if tokens:
            start = offset + tokens[0].start
            end = offset + tokens[-1].end
        else:
            start = None
            end = None
        return cls(start, end, text, probability=probability)
@dataclass
 class SpeakerSegment(TimedText):
@@ -31,6 +83,95 @@ class SpeakerSegment(TimedText):
    """
    pass
@dataclass
 class Translation(TimedText):
    pass
    def approximate_cut_at(self, cut_time):
        """
        Each word in text is considered to be of duration (end-start)/len(words in text)
        """
        if not self.text or not self.contains_time(cut_time):
            return self, None
        words = self.text.split()
        num_words = len(words)
        if num_words == 0:
            return self, None
        duration_per_word = self.duration() / num_words
        cut_word_index = int((cut_time - self.start) / duration_per_word)
        if cut_word_index >= num_words:
            cut_word_index = num_words -1
        text0 = " ".join(words[:cut_word_index])
        text1 = " ".join(words[cut_word_index:])
        segment0 = Translation(start=self.start, end=cut_time, text=text0)
        segment1 = Translation(start=cut_time, end=self.end, text=text1)
        return segment0, segment1
@dataclass
 class Silence():
-    duration: float
+    duration: float
@dataclass
 class Line(TimedText):
    translation: str = ''
    def to_dict(self):
        _dict = {
            'speaker': int(self.speaker),
            'text': self.text,
            'start': format_time(self.start),
            'end': format_time(self.end),
        }
        if self.translation:
            _dict['translation'] = self.translation
        if self.detected_language:
            _dict['detected_language'] = self.detected_language
        return _dict
@dataclass  
 class FrontData():
    status: str = ''
    error: str = ''
    lines: list[Line] = field(default_factory=list)
    buffer_transcription: str = ''
    buffer_diarization: str = ''
    remaining_time_transcription: float = 0.
    remaining_time_diarization: float = 0.
    def to_dict(self):
        _dict = {
            'status': self.status,
            'lines': [line.to_dict() for line in self.lines],
            'buffer_transcription': self.buffer_transcription,
            'buffer_diarization': self.buffer_diarization,
            'remaining_time_transcription': self.remaining_time_transcription,
            'remaining_time_diarization': self.remaining_time_diarization,
        }
        if self.error:
            _dict['error'] = self.error
        return _dict
@dataclass  
 class ChangeSpeaker:
    speaker: int
    start: int
@dataclass  
 class State():
    tokens: list
    translated_segments: list
    buffer_transcription: str
    end_buffer: float
    end_attributed_speaker: float
    remaining_time_transcription: float
    remaining_time_diarization: float
--- a/whisperlivekit/translation/init.py
+++ b/whisperlivekit/translation/init.py
--- a/whisperlivekit/translation/mapping_languages.py
+++ b/whisperlivekit/translation/mapping_languages.py
@@ -0,0 +1,182 @@
 """
 adapted from https://store.crowdin.com/custom-mt
 """
 LANGUAGES = [
    {"name": "Afrikaans", "nllb": "afr_Latn", "crowdin": "af"},
    {"name": "Akan", "nllb": "aka_Latn", "crowdin": "ak"},
    {"name": "Amharic", "nllb": "amh_Ethi", "crowdin": "am"},
    {"name": "Assamese", "nllb": "asm_Beng", "crowdin": "as"},
    {"name": "Asturian", "nllb": "ast_Latn", "crowdin": "ast"},
    {"name": "Bashkir", "nllb": "bak_Cyrl", "crowdin": "ba"},
    {"name": "Bambara", "nllb": "bam_Latn", "crowdin": "bm"},
    {"name": "Balinese", "nllb": "ban_Latn", "crowdin": "ban"},
    {"name": "Belarusian", "nllb": "bel_Cyrl", "crowdin": "be"},
    {"name": "Bengali", "nllb": "ben_Beng", "crowdin": "bn"},
    {"name": "Bosnian", "nllb": "bos_Latn", "crowdin": "bs"},
    {"name": "Bulgarian", "nllb": "bul_Cyrl", "crowdin": "bg"},
    {"name": "Catalan", "nllb": "cat_Latn", "crowdin": "ca"},
    {"name": "Cebuano", "nllb": "ceb_Latn", "crowdin": "ceb"},
    {"name": "Czech", "nllb": "ces_Latn", "crowdin": "cs"},
    {"name": "Welsh", "nllb": "cym_Latn", "crowdin": "cy"},
    {"name": "Danish", "nllb": "dan_Latn", "crowdin": "da"},
    {"name": "German", "nllb": "deu_Latn", "crowdin": "de"},
    {"name": "Dzongkha", "nllb": "dzo_Tibt", "crowdin": "dz"},
    {"name": "Greek", "nllb": "ell_Grek", "crowdin": "el"},
    {"name": "English", "nllb": "eng_Latn", "crowdin": "en"},
    {"name": "Esperanto", "nllb": "epo_Latn", "crowdin": "eo"},
    {"name": "Estonian", "nllb": "est_Latn", "crowdin": "et"},
    {"name": "Basque", "nllb": "eus_Latn", "crowdin": "eu"},
    {"name": "Ewe", "nllb": "ewe_Latn", "crowdin": "ee"},
    {"name": "Faroese", "nllb": "fao_Latn", "crowdin": "fo"},
    {"name": "Fijian", "nllb": "fij_Latn", "crowdin": "fj"},
    {"name": "Finnish", "nllb": "fin_Latn", "crowdin": "fi"},
    {"name": "French", "nllb": "fra_Latn", "crowdin": "fr"},
    {"name": "Friulian", "nllb": "fur_Latn", "crowdin": "fur-IT"},
    {"name": "Scottish Gaelic", "nllb": "gla_Latn", "crowdin": "gd"},
    {"name": "Irish", "nllb": "gle_Latn", "crowdin": "ga-IE"},
    {"name": "Galician", "nllb": "glg_Latn", "crowdin": "gl"},
    {"name": "Guarani", "nllb": "grn_Latn", "crowdin": "gn"},
    {"name": "Gujarati", "nllb": "guj_Gujr", "crowdin": "gu-IN"},
    {"name": "Haitian Creole", "nllb": "hat_Latn", "crowdin": "ht"},
    {"name": "Hausa", "nllb": "hau_Latn", "crowdin": "ha"},
    {"name": "Hebrew", "nllb": "heb_Hebr", "crowdin": "he"},
    {"name": "Hindi", "nllb": "hin_Deva", "crowdin": "hi"},
    {"name": "Croatian", "nllb": "hrv_Latn", "crowdin": "hr"},
    {"name": "Hungarian", "nllb": "hun_Latn", "crowdin": "hu"},
    {"name": "Armenian", "nllb": "hye_Armn", "crowdin": "hy-AM"},
    {"name": "Igbo", "nllb": "ibo_Latn", "crowdin": "ig"},
    {"name": "Indonesian", "nllb": "ind_Latn", "crowdin": "id"},
    {"name": "Icelandic", "nllb": "isl_Latn", "crowdin": "is"},
    {"name": "Italian", "nllb": "ita_Latn", "crowdin": "it"},
    {"name": "Javanese", "nllb": "jav_Latn", "crowdin": "jv"},
    {"name": "Japanese", "nllb": "jpn_Jpan", "crowdin": "ja"},
    {"name": "Kabyle", "nllb": "kab_Latn", "crowdin": "kab"},
    {"name": "Kannada", "nllb": "kan_Knda", "crowdin": "kn"},
    {"name": "Georgian", "nllb": "kat_Geor", "crowdin": "ka"},
    {"name": "Kazakh", "nllb": "kaz_Cyrl", "crowdin": "kk"},
    {"name": "Khmer", "nllb": "khm_Khmr", "crowdin": "km"},
    {"name": "Kinyarwanda", "nllb": "kin_Latn", "crowdin": "rw"},
    {"name": "Kyrgyz", "nllb": "kir_Cyrl", "crowdin": "ky"},
    {"name": "Korean", "nllb": "kor_Hang", "crowdin": "ko"},
    {"name": "Lao", "nllb": "lao_Laoo", "crowdin": "lo"},
    {"name": "Ligurian", "nllb": "lij_Latn", "crowdin": "lij"},
    {"name": "Limburgish", "nllb": "lim_Latn", "crowdin": "li"},
    {"name": "Lingala", "nllb": "lin_Latn", "crowdin": "ln"},
    {"name": "Lithuanian", "nllb": "lit_Latn", "crowdin": "lt"},
    {"name": "Luxembourgish", "nllb": "ltz_Latn", "crowdin": "lb"},
    {"name": "Maithili", "nllb": "mai_Deva", "crowdin": "mai"},
    {"name": "Malayalam", "nllb": "mal_Mlym", "crowdin": "ml-IN"},
    {"name": "Marathi", "nllb": "mar_Deva", "crowdin": "mr"},
    {"name": "Macedonian", "nllb": "mkd_Cyrl", "crowdin": "mk"},
    {"name": "Maltese", "nllb": "mlt_Latn", "crowdin": "mt"},
    {"name": "Mossi", "nllb": "mos_Latn", "crowdin": "mos"},
    {"name": "Maori", "nllb": "mri_Latn", "crowdin": "mi"},
    {"name": "Burmese", "nllb": "mya_Mymr", "crowdin": "my"},
    {"name": "Dutch", "nllb": "nld_Latn", "crowdin": "nl"},
    {"name": "Norwegian Nynorsk", "nllb": "nno_Latn", "crowdin": "nn-NO"},
    {"name": "Nepali", "nllb": "npi_Deva", "crowdin": "ne-NP"},
    {"name": "Northern Sotho", "nllb": "nso_Latn", "crowdin": "nso"},
    {"name": "Occitan", "nllb": "oci_Latn", "crowdin": "oc"},
    {"name": "Odia", "nllb": "ory_Orya", "crowdin": "or"},
    {"name": "Papiamento", "nllb": "pap_Latn", "crowdin": "pap"},
    {"name": "Polish", "nllb": "pol_Latn", "crowdin": "pl"},
    {"name": "Portuguese", "nllb": "por_Latn", "crowdin": "pt-PT"},
    {"name": "Dari", "nllb": "prs_Arab", "crowdin": "fa-AF"},
    {"name": "Romanian", "nllb": "ron_Latn", "crowdin": "ro"},
    {"name": "Rundi", "nllb": "run_Latn", "crowdin": "rn"},
    {"name": "Russian", "nllb": "rus_Cyrl", "crowdin": "ru"},
    {"name": "Sango", "nllb": "sag_Latn", "crowdin": "sg"},
    {"name": "Sanskrit", "nllb": "san_Deva", "crowdin": "sa"},
    {"name": "Santali", "nllb": "sat_Olck", "crowdin": "sat"},
    {"name": "Sinhala", "nllb": "sin_Sinh", "crowdin": "si-LK"},
    {"name": "Slovak", "nllb": "slk_Latn", "crowdin": "sk"},
    {"name": "Slovenian", "nllb": "slv_Latn", "crowdin": "sl"},
    {"name": "Shona", "nllb": "sna_Latn", "crowdin": "sn"},
    {"name": "Sindhi", "nllb": "snd_Arab", "crowdin": "sd"},
    {"name": "Somali", "nllb": "som_Latn", "crowdin": "so"},
    {"name": "Southern Sotho", "nllb": "sot_Latn", "crowdin": "st"},
    {"name": "Spanish", "nllb": "spa_Latn", "crowdin": "es-ES"},
    {"name": "Sardinian", "nllb": "srd_Latn", "crowdin": "sc"},
    {"name": "Swati", "nllb": "ssw_Latn", "crowdin": "ss"},
    {"name": "Sundanese", "nllb": "sun_Latn", "crowdin": "su"},
    {"name": "Swedish", "nllb": "swe_Latn", "crowdin": "sv-SE"},
    {"name": "Swahili", "nllb": "swh_Latn", "crowdin": "sw"},
    {"name": "Tamil", "nllb": "tam_Taml", "crowdin": "ta"},
    {"name": "Tatar", "nllb": "tat_Cyrl", "crowdin": "tt-RU"},
    {"name": "Telugu", "nllb": "tel_Telu", "crowdin": "te"},
    {"name": "Tajik", "nllb": "tgk_Cyrl", "crowdin": "tg"},
    {"name": "Tagalog", "nllb": "tgl_Latn", "crowdin": "tl"},
    {"name": "Thai", "nllb": "tha_Thai", "crowdin": "th"},
    {"name": "Tigrinya", "nllb": "tir_Ethi", "crowdin": "ti"},
    {"name": "Tswana", "nllb": "tsn_Latn", "crowdin": "tn"},
    {"name": "Tsonga", "nllb": "tso_Latn", "crowdin": "ts"},
    {"name": "Turkmen", "nllb": "tuk_Latn", "crowdin": "tk"},
    {"name": "Turkish", "nllb": "tur_Latn", "crowdin": "tr"},
    {"name": "Uyghur", "nllb": "uig_Arab", "crowdin": "ug"},
    {"name": "Ukrainian", "nllb": "ukr_Cyrl", "crowdin": "uk"},
    {"name": "Venetian", "nllb": "vec_Latn", "crowdin": "vec"},
    {"name": "Vietnamese", "nllb": "vie_Latn", "crowdin": "vi"},
    {"name": "Wolof", "nllb": "wol_Latn", "crowdin": "wo"},
    {"name": "Xhosa", "nllb": "xho_Latn", "crowdin": "xh"},
    {"name": "Yoruba", "nllb": "yor_Latn", "crowdin": "yo"},
    {"name": "Zulu", "nllb": "zul_Latn", "crowdin": "zu"},
 ]
 NAME_TO_NLLB = {lang["name"]: lang["nllb"] for lang in LANGUAGES}
 NAME_TO_CROWDIN = {lang["name"]: lang["crowdin"] for lang in LANGUAGES}
 CROWDIN_TO_NLLB = {lang["crowdin"]: lang["nllb"] for lang in LANGUAGES}
 NLLB_TO_CROWDIN = {lang["nllb"]: lang["crowdin"] for lang in LANGUAGES}
 CROWDIN_TO_NAME = {lang["crowdin"]: lang["name"] for lang in LANGUAGES}
 NLLB_TO_NAME = {lang["nllb"]: lang["name"] for lang in LANGUAGES}
 def get_nllb_code(crowdin_code):
    return CROWDIN_TO_NLLB.get(crowdin_code, None)
 def get_crowdin_code(nllb_code):
    return NLLB_TO_CROWDIN.get(nllb_code)
 def get_language_name_by_crowdin(crowdin_code):
    return CROWDIN_TO_NAME.get(crowdin_code)
 def get_language_name_by_nllb(nllb_code):
    return NLLB_TO_NAME.get(nllb_code)
 def get_language_info(identifier, identifier_type="auto"):
    if identifier_type == "auto":
        for lang in LANGUAGES:
            if (lang["name"].lower() == identifier.lower() or 
                lang["nllb"] == identifier or 
                lang["crowdin"] == identifier):
                return lang
    elif identifier_type == "name":
        for lang in LANGUAGES:
            if lang["name"].lower() == identifier.lower():
                return lang
    elif identifier_type == "nllb":
        for lang in LANGUAGES:
            if lang["nllb"] == identifier:
                return lang
    elif identifier_type == "crowdin":
        for lang in LANGUAGES:
            if lang["crowdin"] == identifier:
                return lang
    return None
 def list_all_languages():
    return [lang["name"] for lang in LANGUAGES]
 def list_all_nllb_codes():
    return [lang["nllb"] for lang in LANGUAGES]
 def list_all_crowdin_codes():
    return [lang["crowdin"] for lang in LANGUAGES]
--- a/whisperlivekit/translation/translation.py
+++ b/whisperlivekit/translation/translation.py
@@ -0,0 +1,169 @@
 import logging
 import time
 import ctranslate2
 import torch
 import transformers
 from dataclasses import dataclass, field
 import huggingface_hub
 from whisperlivekit.translation.mapping_languages import get_nllb_code
 from whisperlivekit.timed_objects import Translation
 logger = logging.getLogger(__name__)
 #In diarization case, we may want to translate just one speaker, or at least start the sentences there
 MIN_SILENCE_DURATION_DEL_BUFFER = 3 #After a silence of x seconds, we consider the model should not use the buffer, even if the previous
 # sentence is not finished.
@dataclass
 class TranslationModel():
    translator: ctranslate2.Translator
    device: str
    tokenizer: dict = field(default_factory=dict)
    backend_type: str = 'ctranslate2'
    model_size: str = '600M'
    def get_tokenizer(self, input_lang):
        if not self.tokenizer.get(input_lang, False):
            self.tokenizer[input_lang] = transformers.AutoTokenizer.from_pretrained(
                f"facebook/nllb-200-distilled-{self.model_size}",
                src_lang=input_lang,
                clean_up_tokenization_spaces=True
            )
        return self.tokenizer[input_lang]
 def load_model(src_langs, backend='ctranslate2', model_size='600M'):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    MODEL = f'nllb-200-distilled-{model_size}-ctranslate2'
    if backend=='ctranslate2':
        MODEL_GUY = 'entai2965'
        huggingface_hub.snapshot_download(MODEL_GUY + '/' + MODEL,local_dir=MODEL)
        translator = ctranslate2.Translator(MODEL,device=device)
    elif backend=='transformers':
        translator = transformers.AutoModelForSeq2SeqLM.from_pretrained(f"facebook/nllb-200-distilled-{model_size}")
    tokenizer = dict()
    for src_lang in src_langs:
        if src_lang != 'auto':
            tokenizer[src_lang] = transformers.AutoTokenizer.from_pretrained(MODEL, src_lang=src_lang, clean_up_tokenization_spaces=True)
    translation_model = TranslationModel(
        translator=translator,
        tokenizer=tokenizer,
        backend_type=backend,
        device = device,
        model_size = model_size
    )
    for src_lang in src_langs:
        if src_lang != 'auto':
            translation_model.get_tokenizer(src_lang)
    return translation_model
 class OnlineTranslation:
    def __init__(self, translation_model: TranslationModel, input_languages: list, output_languages: list):
        self.buffer = []
        self.len_processed_buffer = 0
        self.translation_remaining = Translation()
        self.validated = []
        self.translation_pending_validation = ''
        self.translation_model = translation_model
        self.input_languages = input_languages
        self.output_languages = output_languages
    def compute_common_prefix(self, results):
        #we dont want want to prune the result for the moment. 
        if not self.buffer:
            self.buffer = results
        else:
            for i in range(min(len(self.buffer), len(results))):
                if self.buffer[i] != results[i]:
                    self.commited.extend(self.buffer[:i])
                    self.buffer = results[i:]
    def translate(self, input, input_lang, output_lang):
        if not input:
            return ""
        nllb_output_lang = get_nllb_code(output_lang)
        tokenizer = self.translation_model.get_tokenizer(input_lang)
        tokenizer_output = tokenizer(input, return_tensors="pt").to(self.translation_model.device)
        if self.translation_model.backend_type == 'ctranslate2':
            source = tokenizer.convert_ids_to_tokens(tokenizer_output['input_ids'][0])    
            results = self.translation_model.translator.translate_batch([source], target_prefix=[[nllb_output_lang]])
            target = results[0].hypotheses[0][1:]
            result = tokenizer.decode(tokenizer.convert_tokens_to_ids(target))
        else:
            translated_tokens = self.translation_model.translator.generate(**tokenizer_output, forced_bos_token_id=tokenizer.convert_tokens_to_ids(nllb_output_lang))
            result = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
        return result
    def translate_tokens(self, tokens):
        if tokens:
            text = ' '.join([token.text for token in tokens])
            start = tokens[0].start
            end = tokens[-1].end
            if self.input_languages[0] == 'auto':
                input_lang = tokens[0].detected_language
            else:
                input_lang = self.input_languages[0]
            translated_text = self.translate(text,
                                            input_lang,
                                            self.output_languages[0]
                                            )
            translation = Translation(
                text=translated_text,
                start=start,
                end=end,
            )
            return translation
        return None
    def insert_tokens(self, tokens):
        self.buffer.extend(tokens)
        pass
    def process(self):
        i = 0
        if len(self.buffer) < self.len_processed_buffer + 3: #nothing new to process
            return self.validated + [self.translation_remaining]
        while i < len(self.buffer):
            if self.buffer[i].is_punctuation():
                translation_sentence = self.translate_tokens(self.buffer[:i+1])
                self.validated.append(translation_sentence)
                self.buffer = self.buffer[i+1:]
                i = 0
            else:
                i+=1
        self.translation_remaining = self.translate_tokens(self.buffer)
        self.len_processed_buffer = len(self.buffer)
        return self.validated + [self.translation_remaining]
    def insert_silence(self, silence_duration: float):
        if silence_duration >= MIN_SILENCE_DURATION_DEL_BUFFER:
            self.buffer = []
            self.validated += [self.translation_remaining]
 if __name__ == '__main__':
    output_lang = 'fr'
    input_lang = "en"
    test_string = """
    Transcription technology has improved so much in the past few years. Have you noticed how accurate real-time speech-to-text is now?
    """
    test = test_string.split(' ')
    step = len(test) // 3
    shared_model = load_model([input_lang], backend='ctranslate2')
    online_translation = OnlineTranslation(shared_model, input_languages=[input_lang], output_languages=[output_lang])
    beg_inference = time.time()    
    for id in range(5):
        val = test[id*step : (id+1)*step]
        val_str = ' '.join(val)
        result = online_translation.translate(val_str)
        print(result)
    print('inference time:', time.time() - beg_inference)
--- a/whisperlivekit/warmup.py
+++ b/whisperlivekit/warmup.py
@@ -6,57 +6,46 @@ logger = logging.getLogger(__name__)
 def load_file(warmup_file=None, timeout=5):
    import os
    import tempfile
    import urllib.request
    import librosa
-        
+
    if warmup_file == "":
        logger.info(f"Skipping warmup.")
        return None
    # Download JFK sample if not already present
    if warmup_file is None:
        # Download JFK sample if not already present
        jfk_url = "https://github.com/ggerganov/whisper.cpp/raw/master/samples/jfk.wav"
        temp_dir = tempfile.gettempdir()
        warmup_file = os.path.join(temp_dir, "whisper_warmup_jfk.wav")
-        
+        if not os.path.exists(warmup_file) or os.path.getsize(warmup_file) == 0:
        if not os.path.exists(warmup_file):
            logger.debug(f"Downloading warmup file from {jfk_url}")
            print(f"Downloading warmup file from {jfk_url}")
            import time
            import urllib.request
            import urllib.error
            import socket
            original_timeout = socket.getdefaulttimeout()
            socket.setdefaulttimeout(timeout)
            start_time = time.time()
            try:
-                urllib.request.urlretrieve(jfk_url, warmup_file)
+                logger.debug(f"Downloading warmup file from {jfk_url}")
-                logger.debug(f"Download successful in {time.time() - start_time:.2f}s")
+                with urllib.request.urlopen(jfk_url, timeout=timeout) as r, open(warmup_file, "wb") as f:
-            except (urllib.error.URLError, socket.timeout) as e:
+                    f.write(r.read())
-                logger.warning(f"Download failed: {e}. Proceeding without warmup.")
+            except Exception as e:
                logger.warning(f"Warmup file download failed: {e}.")
                return None
-            finally:
+
-                socket.setdefaulttimeout(original_timeout)
+    # Validate file and load
-    elif not warmup_file:
+    if not os.path.exists(warmup_file) or os.path.getsize(warmup_file) == 0:
-        return None 
+        logger.warning(f"Warmup file {warmup_file} is invalid or missing.")
    if not warmup_file or not os.path.exists(warmup_file) or os.path.getsize(warmup_file) == 0:
        logger.warning(f"Warmup file {warmup_file} invalid or missing.")
        return None
-    
+
    try:
-        audio, sr = librosa.load(warmup_file, sr=16000)
+        audio, _ = librosa.load(warmup_file, sr=16000)
        return audio
    except Exception as e:
-        logger.warning(f"Failed to load audio file: {e}")
+        logger.warning(f"Failed to load warmup file: {e}")
        return None
    return audio
 def warmup_asr(asr, warmup_file=None, timeout=5):
    """
    Warmup the ASR model by transcribing a short audio file.
    """
-    audio = load_file(warmup_file=None, timeout=5)
+    audio = load_file(warmup_file=warmup_file, timeout=timeout)
    if audio is None:
        logger.warning("Warmup file unavailable. Skipping ASR warmup.")
        return
    asr.transcribe(audio)
-    logger.info("ASR model is warmed up")
+    logger.info("ASR model is warmed up.")
 def warmup_online(online, warmup_file=None, timeout=5):
    audio = load_file(warmup_file=None, timeout=5)
    online.warmup(audio)
    logger.warning("ASR is warmed up")
--- a/whisperlivekit/web/live_transcription.css
+++ b/whisperlivekit/web/live_transcription.css
@@ -74,10 +74,13 @@
 body {
  font-family: ui-sans-serif, system-ui, sans-serif, 'Apple Color Emoji', 'Segoe UI Emoji', 'Segoe UI Symbol', 'Noto Color Emoji';
-  margin: 20px;
+  margin: 0;
  text-align: center;
  background-color: var(--bg);
  color: var(--text);
  height: 100vh;
  display: flex;
  flex-direction: column;
 }
 /* Record button */
@@ -168,9 +171,18 @@ body {
 }
 #status {
-  margin-top: 20px;
+  margin-top: 15px;
  font-size: 16px;
  color: var(--text);
  margin-bottom: 0;
 }
 .header-container {
  position: sticky;
  top: 0;
  background-color: var(--bg);
  z-index: 100;
  padding: 20px;
 }
 /* Settings */
@@ -179,7 +191,6 @@ body {
  justify-content: center;
  align-items: center;
  gap: 15px;
  margin-top: 20px;
 }
 .settings {
@@ -297,9 +308,21 @@ label {
  border-radius: 999px;
 }
 .transcript-container {
  flex: 1;
  overflow-y: auto;
  padding: 20px;
  scrollbar-width: none;
  -ms-overflow-style: none;
 }
 .transcript-container::-webkit-scrollbar {
  display: none;
 }
 /* Transcript area */
 #linesTranscript {
-  margin: 20px auto;
+  margin: 0 auto;
  max-width: 700px;
  text-align: left;
  font-size: 16px;
@@ -323,7 +346,7 @@ label {
 .label_diarization {
  background-color: var(--chip-bg);
-  border-radius: 8px 8px 8px 8px;
+  border-radius: 100px;
  padding: 2px 10px;
  margin-left: 10px;
  display: inline-block;
@@ -335,7 +358,7 @@ label {
 .label_transcription {
  background-color: var(--chip-bg);
-  border-radius: 8px 8px 8px 8px;
+  border-radius: 100px;
  padding: 2px 10px;
  display: inline-block;
  white-space: nowrap;
@@ -345,9 +368,34 @@ label {
  color: var(--label-trans-text);
 }
 .label_translation {
  background-color: var(--chip-bg);
  display: inline-flex;
  border-radius: 10px;
  padding: 4px 8px;
  margin-top: 4px;
  font-size: 14px;
  color: var(--text);
  align-items: flex-start;
  gap: 4px;
 }
 .lag-diarization-value {
    margin-left: 10px;
 }
 .label_translation img {
  margin-top: 2px;
 }
 .label_translation img {
  width: 12px;
  height: 12px;
 }
 #timeInfo {
  color: var(--muted);
-  margin-left: 10px;
+  margin-left: 0px;
 }
 .textcontent {
@@ -407,6 +455,10 @@ label {
 /* for smaller screens */
@media (max-width: 768px) {
  .header-container {
    padding: 15px;
  }
  .settings-container {
    flex-direction: column;
    gap: 10px;
@@ -430,11 +482,15 @@ label {
  .theme-selector-container {
    margin-top: 10px;
  }
  .transcript-container {
    padding: 15px;
  }
 }
@media (max-width: 480px) {
-  body {
+  .header-container {
-    margin: 10px;
+    padding: 10px;
  }
  .settings {
@@ -457,4 +513,38 @@ label {
    width: 14px;
    height: 14px;
  }
  .transcript-container {
    padding: 10px;
  }
 }
 .label_language {
  background-color: var(--chip-bg);
  margin-bottom: 0px;
  margin-top: 5px;
  height: 18.5px;
  border-radius: 100px;
  padding: 2px 8px;
  margin-left: 10px;
  display: inline-flex;
  align-items: center;
  gap: 4px;
  font-size: 14px;
  color: var(--muted);
 }
 .speaker-badge {
  display: inline-flex;
  align-items: center;
  justify-content: center;
  width: 16px;
  height: 16px;
  margin-left: -5px;
  border-radius: 50%;
  font-size: 11px;
  line-height: 1;
  font-weight: 800;
  color: var(--muted);
 }
--- a/whisperlivekit/web/live_transcription.html
+++ b/whisperlivekit/web/live_transcription.html
@@ -9,63 +9,63 @@
 </head>
 <body>
-    <div class="settings-container">
+    <div class="header-container">
-        <button id="recordButton">
+        <div class="settings-container">
-            <div class="shape-container">
+            <button id="recordButton">
-                <div class="shape"></div>
+                <div class="shape-container">
-            </div>
+                    <div class="shape"></div>
            <div class="recording-info">
                <div class="wave-container">
                    <canvas id="waveCanvas"></canvas>
                </div>
-                <div class="timer">00:00</div>
+                <div class="recording-info">
-            </div>
+                    <div class="wave-container">
-        </button>
+                        <canvas id="waveCanvas"></canvas>
                    </div>
                    <div class="timer">00:00</div>
                </div>
            </button>
-        <div class="settings">
+            <div class="settings">
-            <div class="field">
+                <div class="field">
-                <label for="websocketInput">Websocket URL</label>
+                    <label for="websocketInput">Websocket URL</label>
-                <input id="websocketInput" type="text" placeholder="ws://host:port/asr" />
+                    <input id="websocketInput" type="text" placeholder="ws://host:port/asr" />
-            </div>
+                </div>
-            <div class="field">
+                <div class="field">
-                <label id="microphoneSelectLabel" for="microphoneSelect">Select Microphone</label>
+                    <label id="microphoneSelectLabel" for="microphoneSelect">Select Microphone</label>
-                <select id="microphoneSelect">
+                    <select id="microphoneSelect">
-                    <option value="">Default Microphone</option>
+                        <option value="">Default Microphone</option>
-                </select>
+                    </select>
-            </div>
+                </div>
-            <div class="theme-selector-container">
+                <div class="theme-selector-container">
-                <div class="segmented" role="radiogroup" aria-label="Theme selector">
+                    <div class="segmented" role="radiogroup" aria-label="Theme selector">
-                    <input type="radio" id="theme-system" name="theme" value="system" />
+                        <input type="radio" id="theme-system" name="theme" value="system" />
-                    <label for="theme-system" title="System">
+                        <label for="theme-system" title="System">
-                        <img src="/web/src/system_mode.svg" alt="" />
+                            <img src="/web/src/system_mode.svg" alt="" />
-                        <span>System</span>
+                            <span>System</span>
-                    </label>
+                        </label>
-                    <input type="radio" id="theme-light" name="theme" value="light" />
+                        <input type="radio" id="theme-light" name="theme" value="light" />
-                    <label for="theme-light" title="Light">
+                        <label for="theme-light" title="Light">
-                        <img src="/web/src/light_mode.svg" alt="" />
+                            <img src="/web/src/light_mode.svg" alt="" />
-                        <span>Light</span>
+                            <span>Light</span>
-                    </label>
+                        </label>
-                    <input type="radio" id="theme-dark" name="theme" value="dark" />
+                        <input type="radio" id="theme-dark" name="theme" value="dark" />
-                    <label for="theme-dark" title="Dark">
+                        <label for="theme-dark" title="Dark">
-                        <img src="/web/src/dark_mode.svg" alt="" />
+                            <img src="/web/src/dark_mode.svg" alt="" />
-                        <span>Dark</span>
+                            <span>Dark</span>
-                    </label>
+                        </label>
                    </div>
                </div>
            </div>
        </div>
-    </div>
+        
        <p id="status"></p>
    </div>
-
+    <div class="transcript-container">
-
+        <div id="linesTranscript"></div>
-    <p id="status"></p>
+    </div>
    <div id="linesTranscript"></div>
    <script src="/web/live_transcription.js"></script>
 </body>
--- a/whisperlivekit/web/live_transcription.js
+++ b/whisperlivekit/web/live_transcription.js
@@ -12,6 +12,8 @@ let timerInterval = null;
 let audioContext = null;
 let analyser = null;
 let microphone = null;
 let workletNode = null;
 let recorderWorker = null;
 let waveCanvas = document.getElementById("waveCanvas");
 let waveCtx = waveCanvas.getContext("2d");
 let animationFrame = null;
@@ -20,6 +22,9 @@ let lastReceivedData = null;
 let lastSignature = null;
 let availableMicrophones = [];
 let selectedMicrophoneId = null;
 let serverUseAudioWorklet = null;
 let configReadyResolve;
 const configReady = new Promise((r) => (configReadyResolve = r));
 waveCanvas.width = 60 * (window.devicePixelRatio || 1);
 waveCanvas.height = 30 * (window.devicePixelRatio || 1);
@@ -35,6 +40,11 @@ const timerElement = document.querySelector(".timer");
 const themeRadios = document.querySelectorAll('input[name="theme"]');
 const microphoneSelect = document.getElementById("microphoneSelect");
 const translationIcon = `<svg xmlns="http://www.w3.org/2000/svg" height="12px" viewBox="0 -960 960 960" width="12px" fill="#5f6368"><path d="m603-202-34 97q-4 11-14 18t-22 7q-20 0-32.5-16.5T496-133l152-402q5-11 15-18t22-7h30q12 0 22 7t15 18l152 403q8 19-4 35.5T868-80q-13 0-22.5-7T831-106l-34-96H603ZM362-401 188-228q-11 11-27.5 11.5T132-228q-11-11-11-28t11-28l174-174q-35-35-63.5-80T190-640h84q20 39 40 68t48 58q33-33 68.5-92.5T484-720H80q-17 0-28.5-11.5T40-760q0-17 11.5-28.5T80-800h240v-40q0-17 11.5-28.5T360-880q17 0 28.5 11.5T400-840v40h240q17 0 28.5 11.5T680-760q0 17-11.5 28.5T640-720h-76q-21 72-63 148t-83 116l96 98-30 82-122-125Zm266 129h144l-72-204-72 204Z"/></svg>`
 const silenceIcon = `<svg xmlns="http://www.w3.org/2000/svg" style="vertical-align: text-bottom;" height="14px" viewBox="0 -960 960 960" width="14px" fill="#5f6368"><path d="M514-556 320-752q9-3 19-5.5t21-2.5q66 0 113 47t47 113q0 11-1.5 22t-4.5 22ZM40-200v-32q0-33 17-62t47-44q51-26 115-44t141-18q26 0 49.5 2.5T456-392l-56-54q-9 3-19 4.5t-21 1.5q-66 0-113-47t-47-113q0-11 1.5-21t4.5-19L84-764q-11-11-11-28t11-28q12-12 28.5-12t27.5 12l675 685q11 11 11.5 27.5T816-80q-11 13-28 12.5T759-80L641-200h39q0 33-23.5 56.5T600-120H120q-33 0-56.5-23.5T40-200Zm80 0h480v-32q0-14-4.5-19.5T580-266q-36-18-92.5-36T360-320q-71 0-127.5 18T140-266q-9 5-14.5 14t-5.5 20v32Zm240 0Zm560-400q0 69-24.5 131.5T829-355q-12 14-30 15t-32-13q-13-13-12-31t12-33q30-38 46.5-85t16.5-98q0-51-16.5-97T767-781q-12-15-12.5-33t12.5-32q13-14 31.5-13.5T829-845q42 51 66.5 113.5T920-600Zm-182 0q0 32-10 61.5T700-484q-11 15-29.5 15.5T638-482q-13-13-13.5-31.5T633-549q6-11 9.5-24t3.5-27q0-14-3.5-27t-9.5-25q-9-17-8.5-35t13.5-31q14-14 32.5-13.5T700-716q18 25 28 54.5t10 61.5Z"/></svg>`;
 const languageIcon = `<svg xmlns="http://www.w3.org/2000/svg" height="12" viewBox="0 -960 960 960" width="12" fill="#5f6368"><path d="M480-80q-82 0-155-31.5t-127.5-86Q143-252 111.5-325T80-480q0-83 31.5-155.5t86-127Q252-817 325-848.5T480-880q83 0 155.5 31.5t127 86q54.5 54.5 86 127T880-480q0 82-31.5 155t-86 127.5q-54.5 54.5-127 86T480-80Zm0-82q26-36 45-75t31-83H404q12 44 31 83t45 75Zm-104-16q-18-33-31.5-68.5T322-320H204q29 50 72.5 87t99.5 55Zm208 0q56-18 99.5-55t72.5-87H638q-9 38-22.5 73.5T584-178ZM170-400h136q-3-20-4.5-39.5T300-480q0-21 1.5-40.5T306-560H170q-5 20-7.5 39.5T160-480q0 21 2.5 40.5T170-400Zm216 0h188q3-20 4.5-39.5T580-480q0-21-1.5-40.5T574-560H386q-3 20-4.5 39.5T380-480q0 21 1.5 40.5T386-400Zm268 0h136q5-20 7.5-39.5T800-480q0-21-2.5-40.5T790-560H654q3 20 4.5 39.5T660-480q0 21-1.5 40.5T654-400Zm-16-240h118q-29-50-72.5-87T584-782q18 33 31.5 68.5T638-640Zm-234 0h152q-12-44-31-83t-45-75q-26 36-45 75t-31 83Zm-200 0h118q9-38 22.5-73.5T376-782q-56 18-99.5 55T204-640Z"/></svg>`
 const speakerIcon = `<svg xmlns="http://www.w3.org/2000/svg" height="16px" style="vertical-align: text-bottom;" viewBox="0 -960 960 960" width="16px" fill="#5f6368"><path d="M480-480q-66 0-113-47t-47-113q0-66 47-113t113-47q66 0 113 47t47 113q0 66-47 113t-113 47ZM160-240v-32q0-34 17.5-62.5T224-378q62-31 126-46.5T480-440q66 0 130 15.5T736-378q29 15 46.5 43.5T800-272v32q0 33-23.5 56.5T720-160H240q-33 0-56.5-23.5T160-240Zm80 0h480v-32q0-11-5.5-20T700-306q-54-27-109-40.5T480-360q-56 0-111 13.5T260-306q-9 5-14.5 14t-5.5 20v32Zm240-320q33 0 56.5-23.5T560-640q0-33-23.5-56.5T480-720q-33 0-56.5 23.5T400-640q0 33 23.5 56.5T480-560Zm0-80Zm0 400Z"/></svg>`;
 function getWaveStroke() {
  const styles = getComputedStyle(document.documentElement);
  const v = styles.getPropertyValue("--wave-stroke").trim();
@@ -226,6 +236,14 @@ function setupWebSocket() {
    websocket.onmessage = (event) => {
      const data = JSON.parse(event.data);
      if (data.type === "config") {
        serverUseAudioWorklet = !!data.useAudioWorklet;
        statusText.textContent = serverUseAudioWorklet
          ? "Connected. Using AudioWorklet (PCM)."
          : "Connected. Using MediaRecorder (WebM).";
        if (configReadyResolve) configReadyResolve();
        return;
      }
      if (data.type === "ready_to_stop") {
        console.log("Ready to stop received, finalizing display and closing WebSocket.");
@@ -293,7 +311,7 @@ function renderLinesWithBuffer(
  const showTransLag = !isFinalizing && remaining_time_transcription > 0;
  const showDiaLag = !isFinalizing && !!buffer_diarization && remaining_time_diarization > 0;
  const signature = JSON.stringify({
-    lines: (lines || []).map((it) => ({ speaker: it.speaker, text: it.text, beg: it.beg, end: it.end })),
+    lines: (lines || []).map((it) => ({ speaker: it.speaker, text: it.text, start: it.start, end: it.end, detected_language: it.detected_language })),
    buffer_transcription: buffer_transcription || "",
    buffer_diarization: buffer_diarization || "",
    status: current_status,
@@ -316,19 +334,24 @@ function renderLinesWithBuffer(
  const linesHtml = (lines || [])
    .map((item, idx) => {
      let timeInfo = "";
-      if (item.beg !== undefined && item.end !== undefined) {
+      if (item.start !== undefined && item.end !== undefined) {
-        timeInfo = ` ${item.beg} - ${item.end}`;
+        timeInfo = ` ${item.start} - ${item.end}`;
      }
      let speakerLabel = "";
      if (item.speaker === -2) {
-        speakerLabel = `<span class="silence">Silence<span id='timeInfo'>${timeInfo}</span></span>`;
+        speakerLabel = `<span class="silence">${silenceIcon}<span id='timeInfo'>${timeInfo}</span></span>`;
      } else if (item.speaker == 0 && !isFinalizing) {
        speakerLabel = `<span class='loading'><span class="spinner"></span><span id='timeInfo'><span class="loading-diarization-value">${fmt1(
          remaining_time_diarization
        )}</span> second(s) of audio are undergoing diarization</span></span>`;
      } else if (item.speaker !== 0) {
-        speakerLabel = `<span id="speaker">Speaker ${item.speaker}<span id='timeInfo'>${timeInfo}</span></span>`;
+        const speakerNum = `<span class="speaker-badge">${item.speaker}</span>`;
        speakerLabel = `<span id="speaker">${speakerIcon}${speakerNum}<span id='timeInfo'>${timeInfo}</span></span>`;
        if (item.detected_language) {
          speakerLabel += `<span class="label_language">${languageIcon}<span>${item.detected_language}</span></span>`;
        }
      }
      let currentLineText = item.text || "";
@@ -365,6 +388,13 @@ function renderLinesWithBuffer(
          }
        }
      }
      if (item.translation) {
        currentLineText += `<div class="label_translation">
          ${translationIcon}
          <span>${item.translation}</span>
        </div>`;
      }
      return currentLineText.trim().length > 0 || speakerLabel.length > 0
        ? `<p>${speakerLabel}<br/><div class='textcontent'>${currentLineText}</div></p>`
@@ -373,7 +403,10 @@ function renderLinesWithBuffer(
    .join("");
  linesTranscriptDiv.innerHTML = linesHtml;
-  window.scrollTo({ top: document.body.scrollHeight, behavior: "smooth" });
+  const transcriptContainer = document.querySelector('.transcript-container');
  if (transcriptContainer) {
    transcriptContainer.scrollTo({ top: transcriptContainer.scrollHeight, behavior: "smooth" });
  }
 }
 function updateTimer() {
@@ -447,13 +480,54 @@ async function startRecording() {
    microphone = audioContext.createMediaStreamSource(stream);
    microphone.connect(analyser);
-    recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
+    if (serverUseAudioWorklet) {
-    recorder.ondataavailable = (e) => {
+      if (!audioContext.audioWorklet) {
-      if (websocket && websocket.readyState === WebSocket.OPEN) {
+        throw new Error("AudioWorklet is not supported in this browser");
        websocket.send(e.data);
      }
-    };
+      await audioContext.audioWorklet.addModule("/web/pcm_worklet.js");
-    recorder.start(chunkDuration);
+      workletNode = new AudioWorkletNode(audioContext, "pcm-forwarder", { numberOfInputs: 1, numberOfOutputs: 0, channelCount: 1 });
      microphone.connect(workletNode);
      recorderWorker = new Worker("/web/recorder_worker.js");
      recorderWorker.postMessage({
        command: "init",
        config: {
          sampleRate: audioContext.sampleRate,
        },
      });
      recorderWorker.onmessage = (e) => {
        if (websocket && websocket.readyState === WebSocket.OPEN) {
          websocket.send(e.data.buffer);
        }
      };
      workletNode.port.onmessage = (e) => {
        const data = e.data;
        const ab = data instanceof ArrayBuffer ? data : data.buffer;
        recorderWorker.postMessage(
          {
            command: "record",
            buffer: ab,
          },
          [ab]
        );
      };
    } else {
      try {
        recorder = new MediaRecorder(stream, { mimeType: "audio/webm" });
      } catch (e) {
        recorder = new MediaRecorder(stream);
      }
      recorder.ondataavailable = (e) => {
        if (websocket && websocket.readyState === WebSocket.OPEN) {
          if (e.data && e.data.size > 0) {
            websocket.send(e.data);
          }
        }
      };
      recorder.start(chunkDuration);
    }
    startTime = Date.now();
    timerInterval = setInterval(updateTimer, 1000);
@@ -492,10 +566,28 @@ async function stopRecording() {
  }
  if (recorder) {
-    recorder.stop();
+    try {
      recorder.stop();
    } catch (e) {
    }
    recorder = null;
  }
  if (recorderWorker) {
    recorderWorker.terminate();
    recorderWorker = null;
  }
  if (workletNode) {
    try {
      workletNode.port.onmessage = null;
    } catch (e) {}
    try {
      workletNode.disconnect();
    } catch (e) {}
    workletNode = null;
  }
  if (microphone) {
    microphone.disconnect();
    microphone = null;
@@ -539,9 +631,11 @@ async function toggleRecording() {
    console.log("Connecting to WebSocket");
    try {
      if (websocket && websocket.readyState === WebSocket.OPEN) {
        await configReady;
        await startRecording();
      } else {
        await setupWebSocket();
        await configReady;
        await startRecording();
      }
    } catch (err) {
--- a/whisperlivekit/web/pcm_worklet.js
+++ b/whisperlivekit/web/pcm_worklet.js
@@ -0,0 +1,16 @@
 class PCMForwarder extends AudioWorkletProcessor {
  process(inputs) {
    const input = inputs[0];
    if (input && input[0] && input[0].length) {
      // Forward mono channel (0). If multi-channel, downmixing can be added here.
      const channelData = input[0];
      const copy = new Float32Array(channelData.length);
      copy.set(channelData);
      this.port.postMessage(copy, [copy.buffer]);
    }
    // Keep processor alive
    return true;
  }
 }
 registerProcessor('pcm-forwarder', PCMForwarder);
--- a/whisperlivekit/web/recorder_worker.js
+++ b/whisperlivekit/web/recorder_worker.js
@@ -0,0 +1,58 @@
 let sampleRate = 48000;
 let targetSampleRate = 16000;
 self.onmessage = function (e) {
  switch (e.data.command) {
    case 'init':
      init(e.data.config);
      break;
    case 'record':
      record(e.data.buffer);
      break;
  }
 };
 function init(config) {
  sampleRate = config.sampleRate;
  targetSampleRate = config.targetSampleRate || 16000;
 }
 function record(inputBuffer) {
  const buffer = new Float32Array(inputBuffer);
  const resampledBuffer = resample(buffer, sampleRate, targetSampleRate);
  const pcmBuffer = toPCM(resampledBuffer);
  self.postMessage({ buffer: pcmBuffer }, [pcmBuffer]);
 }
 function resample(buffer, from, to) {
    if (from === to) {
        return buffer;
    }
    const ratio = from / to;
    const newLength = Math.round(buffer.length / ratio);
    const result = new Float32Array(newLength);
    let offsetResult = 0;
    let offsetBuffer = 0;
    while (offsetResult < result.length) {
        const nextOffsetBuffer = Math.round((offsetResult + 1) * ratio);
        let accum = 0, count = 0;
        for (let i = offsetBuffer; i < nextOffsetBuffer && i < buffer.length; i++) {
            accum += buffer[i];
            count++;
        }
        result[offsetResult] = accum / count;
        offsetResult++;
        offsetBuffer = nextOffsetBuffer;
    }
    return result;
 }
 function toPCM(input) {
  const buffer = new ArrayBuffer(input.length * 2);
  const view = new DataView(buffer);
  for (let i = 0; i < input.length; i++) {
    const s = Math.max(-1, Math.min(1, input[i]));
    view.setInt16(i * 2, s < 0 ? s * 0x8000 : s * 0x7FFF, true);
  }
  return buffer;
 }
--- a/whisperlivekit/web/src/language.svg
+++ b/whisperlivekit/web/src/language.svg
@@ -0,0 +1 @@
 <svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-80q-82 0-155-31.5t-127.5-86Q143-252 111.5-325T80-480q0-83 31.5-155.5t86-127Q252-817 325-848.5T480-880q83 0 155.5 31.5t127 86q54.5 54.5 86 127T880-480q0 82-31.5 155t-86 127.5q-54.5 54.5-127 86T480-80Zm0-82q26-36 45-75t31-83H404q12 44 31 83t45 75Zm-104-16q-18-33-31.5-68.5T322-320H204q29 50 72.5 87t99.5 55Zm208 0q56-18 99.5-55t72.5-87H638q-9 38-22.5 73.5T584-178ZM170-400h136q-3-20-4.5-39.5T300-480q0-21 1.5-40.5T306-560H170q-5 20-7.5 39.5T160-480q0 21 2.5 40.5T170-400Zm216 0h188q3-20 4.5-39.5T580-480q0-21-1.5-40.5T574-560H386q-3 20-4.5 39.5T380-480q0 21 1.5 40.5T386-400Zm268 0h136q5-20 7.5-39.5T800-480q0-21-2.5-40.5T790-560H654q3 20 4.5 39.5T660-480q0 21-1.5 40.5T654-400Zm-16-240h118q-29-50-72.5-87T584-782q18 33 31.5 68.5T638-640Zm-234 0h152q-12-44-31-83t-45-75q-26 36-45 75t-31 83Zm-200 0h118q9-38 22.5-73.5T376-782q-56 18-99.5 55T204-640Z"/></svg>
--- a/whisperlivekit/web/src/settings.svg
+++ b/whisperlivekit/web/src/settings.svg
@@ -0,0 +1 @@
 <svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M433-80q-27 0-46.5-18T363-142l-9-66q-13-5-24.5-12T307-235l-62 26q-25 11-50 2t-39-32l-47-82q-14-23-8-49t27-43l53-40q-1-7-1-13.5v-27q0-6.5 1-13.5l-53-40q-21-17-27-43t8-49l47-82q14-23 39-32t50 2l62 26q11-8 23-15t24-12l9-66q4-26 23.5-44t46.5-18h94q27 0 46.5 18t23.5 44l9 66q13 5 24.5 12t22.5 15l62-26q25-11 50-2t39 32l47 82q14 23 8 49t-27 43l-53 40q1 7 1 13.5v27q0 6.5-2 13.5l53 40q21 17 27 43t-8 49l-48 82q-14 23-39 32t-50-2l-60-26q-11 8-23 15t-24 12l-9 66q-4 26-23.5 44T527-80h-94Zm7-80h79l14-106q31-8 57.5-23.5T639-327l99 41 39-68-86-65q5-14 7-29.5t2-31.5q0-16-2-31.5t-7-29.5l86-65-39-68-99 42q-22-23-48.5-38.5T533-694l-13-106h-79l-14 106q-31 8-57.5 23.5T321-633l-99-41-39 68 86 64q-5 15-7 30t-2 32q0 16 2 31t7 30l-86 65 39 68 99-42q22 23 48.5 38.5T427-266l13 106Zm42-180q58 0 99-41t41-99q0-58-41-99t-99-41q-59 0-99.5 41T342-480q0 58 40.5 99t99.5 41Zm-2-140Z"/></svg>
--- a/whisperlivekit/web/src/silence.svg
+++ b/whisperlivekit/web/src/silence.svg
@@ -0,0 +1 @@
 <svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M514-556 320-752q9-3 19-5.5t21-2.5q66 0 113 47t47 113q0 11-1.5 22t-4.5 22ZM40-200v-32q0-33 17-62t47-44q51-26 115-44t141-18q26 0 49.5 2.5T456-392l-56-54q-9 3-19 4.5t-21 1.5q-66 0-113-47t-47-113q0-11 1.5-21t4.5-19L84-764q-11-11-11-28t11-28q12-12 28.5-12t27.5 12l675 685q11 11 11.5 27.5T816-80q-11 13-28 12.5T759-80L641-200h39q0 33-23.5 56.5T600-120H120q-33 0-56.5-23.5T40-200Zm80 0h480v-32q0-14-4.5-19.5T580-266q-36-18-92.5-36T360-320q-71 0-127.5 18T140-266q-9 5-14.5 14t-5.5 20v32Zm240 0Zm560-400q0 69-24.5 131.5T829-355q-12 14-30 15t-32-13q-13-13-12-31t12-33q30-38 46.5-85t16.5-98q0-51-16.5-97T767-781q-12-15-12.5-33t12.5-32q13-14 31.5-13.5T829-845q42 51 66.5 113.5T920-600Zm-182 0q0 32-10 61.5T700-484q-11 15-29.5 15.5T638-482q-13-13-13.5-31.5T633-549q6-11 9.5-24t3.5-27q0-14-3.5-27t-9.5-25q-9-17-8.5-35t13.5-31q14-14 32.5-13.5T700-716q18 25 28 54.5t10 61.5Z"/></svg>
--- a/whisperlivekit/web/src/speaker.svg
+++ b/whisperlivekit/web/src/speaker.svg
@@ -0,0 +1 @@
 <svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-480q-66 0-113-47t-47-113q0-66 47-113t113-47q66 0 113 47t47 113q0 66-47 113t-113 47ZM160-240v-32q0-34 17.5-62.5T224-378q62-31 126-46.5T480-440q66 0 130 15.5T736-378q29 15 46.5 43.5T800-272v32q0 33-23.5 56.5T720-160H240q-33 0-56.5-23.5T160-240Zm80 0h480v-32q0-11-5.5-20T700-306q-54-27-109-40.5T480-360q-56 0-111 13.5T260-306q-9 5-14.5 14t-5.5 20v32Zm240-320q33 0 56.5-23.5T560-640q0-33-23.5-56.5T480-720q-33 0-56.5 23.5T400-640q0 33 23.5 56.5T480-560Zm0-80Zm0 400Z"/></svg>
--- a/whisperlivekit/web/src/translate.svg
+++ b/whisperlivekit/web/src/translate.svg
@@ -0,0 +1 @@
 <svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="m603-202-34 97q-4 11-14 18t-22 7q-20 0-32.5-16.5T496-133l152-402q5-11 15-18t22-7h30q12 0 22 7t15 18l152 403q8 19-4 35.5T868-80q-13 0-22.5-7T831-106l-34-96H603ZM362-401 188-228q-11 11-27.5 11.5T132-228q-11-11-11-28t11-28l174-174q-35-35-63.5-80T190-640h84q20 39 40 68t48 58q33-33 68.5-92.5T484-720H80q-17 0-28.5-11.5T40-760q0-17 11.5-28.5T80-800h240v-40q0-17 11.5-28.5T360-880q17 0 28.5 11.5T400-840v40h240q17 0 28.5 11.5T680-760q0 17-11.5 28.5T640-720h-76q-21 72-63 148t-83 116l96 98-30 82-122-125Zm266 129h144l-72-204-72 204Z"/></svg>
Author	SHA1	Message	Date
Quentin Fuxa	d24c110d55	to 0.2.11	2025-09-24 22:34:01 +02:00
Quentin Fuxa	4dd5d8bf8a	translation compatible with auto and detected language	2025-09-22 11:20:00 +02:00
Quentin Fuxa	93f002cafb	language detection after few seconds working	2025-09-20 11:08:00 +02:00
Quentin Fuxa	c5e30c2c07	svg loaded once in javascript, no more need for StaticFiles	2025-09-20 11:06:00 +02:00
Quentin Fuxa	1c2afb8bd2	svg loaded once in javascript, no more need for StaticFiles	2025-09-20 11:06:00 +02:00
Quentin Fuxa	674b20d3af	in buffer while language not detected »	2025-09-21 11:05:00 +02:00
Quentin Fuxa	a5503308c5	O(n) to O(1) for simulstreaming timestamp determination	2025-09-21 11:04:00 +02:00
Quentin Fuxa	e61afdefa3	punctuation is now checked in timed_object	2025-09-22 22:40:39 +02:00
Quentin Fuxa	426d70a790	simulstreaming infer does not return a dictionary anymore	2025-09-21 11:03:00 +02:00
Quentin Fuxa	b03a212fbf	fixes #227 , auto language dectection v0.1 - simulstreaming only - when diarization and auto	2025-09-19 19:15:28 +02:00
Quentin Fuxa	1833e7c921	0.2.10	2025-09-16 23:45:00 +02:00
Quentin Fuxa	777ec63a71	--pcm-input option information	2025-09-17 16:06:28 +02:00
Quentin Fuxa	0a6e5ae9c1	ffmpeg install instruction error indicates --pcm-input alternative	2025-09-17 16:04:17 +02:00
Quentin Fuxa	ee448a37e9	when pcm-input is set, the frontend uses AudioWorklet	2025-09-17 14:55:57 +02:00
Quentin Fuxa	9c051052b0	Merge branch 'main' into ScriptProcessorNode-to-AudioWorklet	2025-09-17 11:28:36 +02:00
Quentin Fuxa	4d7c487614	replace deprecated ScriptProcessorNode with AudioWorklet	2025-09-17 10:53:53 +02:00
Quentin Fuxa	65025cc448	nllb backend can be transformers, and model size can be 1.3B	2025-09-17 10:20:31 +02:00
Quentin Fuxa	bbba1d9bb7	add nllb-backend and translation perf test in dev_notes	2025-09-16 20:45:01 +02:00
Quentin Fuxa	99dc96c644	fixes #224	2025-09-16 18:34:35 +02:00
GeorgeCaoJ	2a27d2030a	feat: support web audio 16kHz PCM input and remove ffmpeg dependency	2025-09-15 23:22:25 +08:00
Quentin Fuxa	cd160caaa1	asyncio.to_thread for transcription and translation	2025-09-15 15:23:22 +02:00
Quentin Fuxa	d27b5eb23e	Merge pull request #219 from notV3NOM/main Fix warmup file behavior	2025-09-15 10:19:26 +02:00
Quentin Fuxa	f9d704a900	Merge branch 'main' of https://github.com/notv3nom/whisperlivekit into pr/notV3NOM/219	2025-09-15 10:00:14 +02:00
Quentin Fuxa	2f6e00f512	simulstreaming warmup is done in whisperlivekit.simul_whisper.backend.load_model, not in warmup_online	2025-09-15 09:43:15 +02:00
Quentin Fuxa	5aa312e437	simulstreaming warmup is done in whisperlivekit.simul_whisper.backend.load_model, not in warmup_online	2025-09-13 20:19:19 +01:00
notV3NOM	ebaf36a8be	Fix warmup file behavior	2025-09-13 20:44:24 +05:30
Quentin Fuxa	babe93b99a	to 0.2.9	2025-09-11 21:36:32 +02:00
Quentin Fuxa	a4e9f3cab7	support for raw PCM input option by @YeonjunNotFR	2025-09-11 21:32:11 +02:00
Quentin Fuxa	b06866877a	add --disable-punctuation-split option	2025-09-11 21:03:00 +02:00
Quentin Fuxa	967cdfebc8	fix Translation imports	2025-09-11 21:03:00 +02:00
Quentin Fuxa	3c11c60126	fix by @treeaaa	2025-09-11 21:03:00 +02:00
Quentin Fuxa	2963e8a757	translate when at least 3 new tokens	2025-09-09 21:45:00 +02:00
Quentin Fuxa	cb2d4ea88a	audio processor lines use now Lines objects instead of dict	2025-09-09 21:45:00 +02:00
Quentin Fuxa	add7ea07ee	translator takes all the tokens from the queue	2025-09-09 19:55:39 +02:00
Quentin Fuxa	da8726b2cb	Merge pull request #211 from Alexander-ARTV/main Fix type error when setting encoder_feature in simul_whisper->infer for faster whisper encoder	2025-09-09 15:46:59 +02:00
Quentin Fuxa	3358877054	Fix StorageView conversion for CPU/GPU compatibility	2025-09-09 15:44:16 +02:00
Quentin Fuxa	1f7798c7c1	condition on encoder_feature_ctranslate type	2025-09-09 12:16:52 +02:00
Alexander Lindberg	c7b3bb5e58	Fix regression with faster-whisper encoder_feature	2025-09-09 11:18:55 +03:00
Quentin Fuxa	f661f21675	translation asyncio task	2025-09-08 18:34:31 +02:00
Quentin Fuxa	b6164aa59b	translation device determined with torch.device	2025-09-08 11:34:40 +02:00
Quentin Fuxa	4209d7f7c0	Place all tensors on the same device in sortformer diarization	2025-09-08 10:20:57 +02:00
Quentin Fuxa	334b338ab0	use platform to determine system and recommand mlx whisper	2025-09-07 15:49:11 +02:00
Quentin Fuxa	72f33be6f2	translation: use of get_nllb_code	2025-09-07 15:25:14 +02:00
Quentin Fuxa	84890b8e61	Merge pull request #201 from notV3NOM/main Fix: simulstreaming preload model count argument in cli	2025-09-07 15:18:54 +02:00
Quentin Fuxa	c6668adcf3	Merge pull request #200 from notV3NOM/misc docs: add vram usage for large-v3-turbo	2025-09-07 15:17:42 +02:00
notV3NOM	a178ed5c22	fix simulstreaming preload model count argument in cli	2025-09-06 18:18:09 +05:30
notV3NOM	7601c74c9c	add vram usage for large-v3-turbo	2025-09-06 17:56:39 +05:30
Quentin Fuxa	fad9ee4d21	Merge pull request #198 from notV3NOM/main Fix scrolling UX with sticky header controls	2025-09-05 20:46:36 +02:00
Quentin Fuxa	d1a9913c47	nllb v0	2025-09-05 18:02:42 +02:00
notV3NOM	e4ca2623cb	Fix scrolling UX with sticky header controls	2025-09-05 21:25:13 +05:30
Quentin Fuxa	9c1bf37960	fixes #197	2025-09-05 16:34:13 +02:00
Quentin Fuxa	f46528471b	revamp chromium extension settings	2025-09-05 16:19:48 +02:00
Quentin Fuxa	191680940b	Merge branch 'main' of https://github.com/QuentinFuxa/whisper_streaming_web	2025-09-04 23:58:51 +02:00
Quentin Fuxa	ee02afec56	workaround to get the list of microphones in the extension	2025-09-04 23:58:48 +02:00
Quentin Fuxa	a458028de2	Merge pull request #196 from notV3NOM/main Fix: Exponentially growing simulstreaming silence timer	2025-09-04 23:05:59 +02:00
notV3NOM	abd8f2c269	Fix exponentially growing simulstreaming silence timer	2025-09-04 21:49:07 +05:30
Quentin Fuxa	f3ad4e39e4	torch.Tensor to torch.as_tensor	2025-09-04 16:39:11 +02:00
Quentin Fuxa	e0a5cbf0e7	v0.1.0 chrome extension	2025-09-04 16:36:28 +02:00
Quentin Fuxa	953697cd86	torch.Tensor to torch.as_tensor	2025-09-04 15:25:39 +02:00
		`@@ -0,0 +1 @@`
							`<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-120q-151 0-255.5-104.5T120-480q0-138 90-239.5T440-838q13-2 23 3.5t16 14.5q6 9 6.5 21t-7.5 23q-17 26-25.5 55t-8.5 61q0 90 63 153t153 63q31 0 61.5-9t54.5-25q11-7 22.5-6.5T819-479q10 5 15.5 15t3.5 24q-14 138-117.5 229T480-120Zm0-80q88 0 158-48.5T740-375q-20 5-40 8t-40 3q-123 0-209.5-86.5T364-660q0-20 3-40t8-40q-78 32-126.5 102T200-480q0 116 82 198t198 82Zm-10-270Z"/></svg>`
		`@@ -0,0 +1 @@`
							<svg xmlns="http://www.w3.org/2000/svg" height="24px" viewBox="0 -960 960 960" width="24px" fill="#5f6368"><path d="M480-360q50 0 85-35t35-85q0-50-35-85t-85-35q-50 0-85 35t-35 85q0 50 35 85t85 35Zm0 80q-83 0-141.5-58.5T280-480q0-83 58.5-141.5T480-680q83 0 141.5 58.5T680-480q0 83-58.5 141.5T480-280ZM80-440q-17 0-28.5-11.5T40-480q0-17 11.5-28.5T80-520h80q17 0 28.5 11.5T200-480q0 17-11.5 28.5T160-440H80Zm720 0q-17 0-28.5-11.5T760-480q0-17 11.5-28.5T800-520h80q17 0 28.5 11.5T920-480q0 17-11.5 28.5T880-440h-80ZM480-760q-17 0-28.5-11.5T440-800v-80q0-17 11.5-28.5T480-920q17 0 28.5 11.5T520-880v80q0 17-11.5 28.5T480-760Zm0 720q-17 0-28.5-11.5T440-80v-80q0-17 11.5-28.5T480-200q17 0 28.5 11.5T520-160v80q0 17-11.5 28.5T480-40ZM226-678l-43-42q-12-11-11.5-28t11.5-29q12-12 29-12t28 12l42 43q11 12 11 28t-11 28q-11 12-27.5 11.5T226-678Zm494 495-42-43q-11-12-11-28.5t11-27.5q11-12 27.5-11.5T734-282l43 42q12 11 11.5 28T777-183q-12 12-29 12t-28-12Zm-42-495q-12-11-11.5-27.5T678-734l42-43q11-12 28-11.5t29 11.5q12 12 12 29t-12 28l-43 42q-12 11-28 11t-28-11ZM183-183q-12-12-12-29t12-28l43-42q12-11 28.5-11t27.5 11q12 11 11.5 27.5T282-226l-42 43q-11 12-28 11.5T183-183Zm297-297Z"/></svg>