update doc

2026-03-07 14:23:18 +00:00 · 2025-11-25 23:20:00 +01:00
parent 28cf831701
commit 345d781e97
6 changed files with 220 additions and 134 deletions
--- a/README.md
+++ b/README.md
@@ -18,7 +18,7 @@ Real-time transcription directly to your browser, with a ready-to-use backend+se

 #### Powered by Leading Research:

- Simul-[Whisper](https://github.com/backspacetg/simul_whisper)/[Streaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) - Ultra-low latency transcription using [AlignAtt policy](https://arxiv.org/pdf/2305.11408)
+- Simul-[Whisper](https://arxiv.org/pdf/2406.10052)/[Streaming](https://arxiv.org/abs/2506.17077) (SOTA 2025) - Ultra-low latency transcription using [AlignAtt policy](https://arxiv.org/pdf/2305.11408)
 - [NLLW](https://github.com/QuentinFuxa/NoLanguageLeftWaiting) (2025), based on [distilled](https://huggingface.co/entai2965/nllb-200-distilled-600M-ctranslate2) [NLLB](https://arxiv.org/abs/2207.04672) (2022, 2024) - Simulatenous translation from & to 200 languages.
 - [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) - Low latency transcription using [LocalAgreement policy](https://www.isca-archive.org/interspeech_2020/liu20s_interspeech.pdf)
 - [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) - Advanced real-time speaker diarization
@@ -143,8 +143,8 @@ async def websocket_endpoint(websocket: WebSocket):

 | Parameter | Description | Default |
 |-----------|-------------|---------|
-| `--model` | Whisper model size. List and recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/available_models.md) | `small` |
-| `--model-path` | Local .pt file/directory **or** Hugging Face repo ID containing the Whisper model. Overrides `--model`. Recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/models_compatible_formats.md) | `None` |
+| `--model` | Whisper model size. List and recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/default_and_custom_models.md) | `small` |
+| `--model-path` | Local .pt file/directory **or** Hugging Face repo ID containing the Whisper model. Overrides `--model`. Recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/default_and_custom_models.md) | `None` |
 | `--language` | List [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/whisper/tokenizer.py). If you use `auto`, the model attempts to detect the language automatically, but it tends to bias towards English. | `auto` |
 | `--target-language` | If sets, translates using [NLLW](https://github.com/QuentinFuxa/NoLanguageLeftWaiting). [200 languages available](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/supported_languages.md). If you want to translate to english, you can also use `--direct-english-translation`. The STT model will try to directly output the translation. | `None` |
 | `--diarization` | Enable speaker identification | `False` |
--- a/docs/available_models.md
+++ b/docs/available_models.md
@@ -1,109 +0,0 @@
-# Available Whisper model sizes:
-
- tiny.en (english only)
- tiny
- base.en (english only)
- base
- small.en (english only)
- small
- medium.en (english only)
- medium
- large-v1
- large-v2
- large-v3
- large-v3-turbo
-
-## How to choose?
-
-### Language Support
- **English only**: Use `.en` models for better accuracy and faster processing when you only need English transcription
- **Multilingual**: Do not use `.en` models.
-
-### Resource Constraints
- **Limited GPU/CPU or need for very low latency**: Choose `small` or smaller models
-  - `tiny`: Fastest, lowest resource usage, acceptable quality for simple audio
-  - `base`: Good balance of speed and accuracy for basic use cases
-  - `small`: Better accuracy while still being resource-efficient
- **Good resources available**: Use `large` models for best accuracy
-  - `large-v2`: Excellent accuracy, good multilingual support
-  - `large-v3`: Best overall accuracy and language support
-
-### Special Cases
- **No translation needed**: Use `large-v3-turbo`
-  - Same transcription quality as `large-v2` but significantly faster
-  - **Important**: Does not translate correctly, only transcribes
-
-### Model Comparison Table
-
-| Model | Speed | Accuracy | Multilingual | Translation | Best Use Case |
-|-------|--------|----------|--------------|-------------|---------------|
-| tiny(.en) | Fastest | Basic | Yes/No | Yes/No | Real-time, low resources |
-| base(.en) | Fast | Good | Yes/No | Yes/No | Balanced performance |
-| small(.en) | Medium | Better | Yes/No | Yes/No | Quality on limited hardware |
-| medium(.en) | Slow | High | Yes/No | Yes/No | High quality, moderate resources |
-| large-v2 | Slowest | Excellent | Yes | Yes | Best overall quality |
-| large-v3 | Slowest | Excellent | Yes | Yes | Maximum accuracy |
-| large-v3-turbo | Fast | Excellent | Yes | No | Fast, high-quality transcription |
-
-### Additional Considerations
-
-**Model Performance**:
- Accuracy improves significantly from tiny to large models
- English-only models are ~10-15% more accurate for English audio
- Newer versions (v2, v3) have better punctuation and formatting
-
-**Hardware Requirements**:
- `tiny`: ~1GB VRAM
- `base`: ~1GB VRAM  
- `small`: ~2GB VRAM
- `medium`: ~5GB VRAM
- `large`: ~10GB VRAM
- `large‑v3‑turbo`: ~6GB VRAM
-
-**Audio Quality Impact**:
- Clean, clear audio: smaller models may suffice
- Noisy, accented, or technical audio: larger models recommended
- Phone/low-quality audio: use at least `small` model
-
-### Quick Decision Tree
-1. English only? → Add `.en` to your choice
-2. Limited resources or need speed? → `small` or smaller
-3. Good hardware and want best quality? → `large-v3`
-4. Need fast, high-quality transcription without translation? → `large-v3-turbo`
-5. Need translation capabilities? → `large-v2` or `large-v3` (avoid turbo)
-
-
-_______________________
-
-# Translation Models and Backend
-
-**Language Support**: ~200 languages
-
-## Distilled Model Sizes Available
-
-| Model | Size | Parameters | VRAM (FP16) | VRAM (INT8) | Quality |
-|-------|------|------------|-------------|-------------|---------|
-| 600M | 2.46 GB | 600M | ~1.5GB | ~800MB | Good, understandable |
-| 1.3B | 5.48 GB | 1.3B | ~3GB | ~1.5GB | Better accuracy, context |
-
-**Quality Impact**: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs.
-
-## Backend Performance
-
-| Backend | Speed vs Base | Memory Usage | Quality Loss |
-|---------|---------------|--------------|--------------|
-| CTranslate2 | 6-10x faster | 40-60% less | ~5% BLEU drop |
-| Transformers | Baseline | High | None |
-| Transformers + MPS (on Apple Silicon) | 2x faster | Medium | None |
-
-**Metrics**:
- CTranslate2: 50-100+ tokens/sec
- Transformers: 10-30 tokens/sec
- Apple Silicon with MPS: Up to 2x faster than CTranslate2
-
-## Quick Decision Matrix
-
-**Choose 600M**: Limited resources, close to 0 lag
-**Choose 1.3B**: Quality matters
-**Choose Transformers**: On Apple Silicon
-
--- a/docs/default_and_custom_models.md
+++ b/docs/default_and_custom_models.md
@@ -0,0 +1,106 @@
+# Models and Model Paths
+
+## Defaults
+
+**Default Whisper Model**: `base`  
+When no model is specified, WhisperLiveKit uses the `base` model, which provides a good balance of speed and accuracy for most use cases.
+
+**Default Model Cache Directory**: `~/.cache/whisper`  
+Models are automatically downloaded from OpenAI's model hub and cached in this directory. You can override this with `--model_cache_dir`.
+
+**Default Translation Model**: `600M` (NLLB-200-distilled)  
+When translation is enabled, the 600M distilled NLLB model is used by default. This provides good quality with minimal resource usage.
+
+**Default Translation Backend**: `transformers`  
+The translation backend defaults to Transformers. On Apple Silicon, this automatically uses MPS acceleration for better performance.
+
+---
+
+
+## Available Whisper model sizes:
+
+| Available Model    | Speed    | Accuracy  | Multilingual | Translation | Hardware Requirements | Best Use Case                   |
+|--------------------|----------|-----------|--------------|-------------|----------------------|----------------------------------|
+| tiny(.en)          | Fastest  | Basic     | Yes/No       | Yes/No      | ~1GB VRAM            | Real-time, low resources         |
+| base(.en)          | Fast     | Good      | Yes/No       | Yes/No      | ~1GB VRAM            | Balanced performance             |
+| small(.en)         | Medium   | Better    | Yes/No       | Yes/No      | ~2GB VRAM            | Quality on limited hardware      |
+| medium(.en)        | Slow     | High      | Yes/No       | Yes/No      | ~5GB VRAM            | High quality, moderate resources |
+| large-v2           | Slowest  | Excellent | Yes          | Yes         | ~10GB VRAM           | Good overall accuracy & language support          |
+| large-v3           | Slowest  | Excellent | Yes          | Yes         | ~10GB VRAM           | Best overall accuracy & language support                |
+| large-v3-turbo     | Fast     | Excellent | Yes          | No          | ~6GB VRAM            | Fast, high-quality transcription |
+
+
+### How to choose?
+
+#### Language Support
+- **English only**: Use `.en` (ex: `base.en`) models for better accuracy and faster processing when you only need English transcription
+- **Multilingual**: Do not use `.en` models.
+      
+#### Special Cases
+- **No translation needed**: Use `large-v3-turbo`
+  - Same transcription quality as `large-v2` but significantly faster
+  - **Important**: Does not translate correctly, only transcribes
+
+### Additional Considerations
+
+**Model Performance**:
+- Accuracy improves significantly from tiny to large models
+- English-only models are ~10-15% more accurate for English audio
+- Newer versions (v2, v3) have better punctuation and formatting
+
+**Audio Quality Impact**:
+- Clean, clear audio: smaller models may suffice
+- Noisy, accented, or technical audio: larger models recommended
+- Phone/low-quality audio: use at least `small` model
+
+_______________________
+
+
+# Custom Models:
+
+The `--model-path` parameter accepts:
+
+## File Path
+- **`.pt` / `.bin` / `.safetensor` formats** Should be openable by pytorch/safetensor.
+
+## Directory Path (recommended)
+Must contain:
+- **`.pt` / `.bin` / `.safetensor` file** (required for decoder)
+
+May optionally contain:
+- **`.bin` file** - faster-whisper model for encoder (requires faster-whisper)
+- **`weights.npz`** or **`weights.safetensors`** - for encoder (requires whisper-mlx)
+
+## Hugging Face Repo ID
+- Provide the repo ID (e.g. `openai/whisper-large-v3`) and WhisperLiveKit will download and cache the snapshot automatically. For gated repos, authenticate via `huggingface-cli login` first.
+
+To improve speed/reduce hallucinations, you may want to use `scripts/determine_alignment_heads.py` to determine the alignment heads to use for your model, and use the `--custom-alignment-heads` to pass them to WLK. If not, alignment heads are set to be all the heads of the last half layer of decoder.
+
+
+_______________________
+
+# Translation Models and Backend
+
+**Language Support**: ~200 languages
+
+## Distilled Model Sizes Available
+
+| Model | Size | Parameters | VRAM (FP16) | VRAM (INT8) | Quality |
+|-------|------|------------|-------------|-------------|---------|
+| 600M | 2.46 GB | 600M | ~1.5GB | ~800MB | Good, understandable |
+| 1.3B | 5.48 GB | 1.3B | ~3GB | ~1.5GB | Better accuracy, context |
+
+**Quality Impact**: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs.
+
+## Backend Performance
+
+| Backend | Speed vs Base | Memory Usage | Quality Loss |
+|---------|---------------|--------------|--------------|
+| CTranslate2 | 6-10x faster | 40-60% less | ~5% BLEU drop |
+| Transformers | Baseline | High | None |
+| Transformers + MPS (on Apple Silicon) | 2x faster | Medium | None |
+
+**Metrics**:
+- CTranslate2: 50-100+ tokens/sec
+- Transformers: 10-30 tokens/sec
+- Apple Silicon with MPS: Up to 2x faster than CTranslate2
--- a/docs/models_compatible_formats.md
+++ b/docs/models_compatible_formats.md
@@ -1,19 +0,0 @@
-# Model Path Formats
-
-The `--model-path` parameter accepts:
-
-## File Path
- **`.pt` / `.bin` / `.safetensor` formats** Should be openable by pytorch/safetensor.
-
-## Directory Path (recommended)
-Must contain:
- **`.pt` / `.bin` / `.safetensor` file** (required for decoder)
-
-May optionally contain:
- **`.bin` file** - faster-whisper model for encoder (requires faster-whisper)
- **`weights.npz`** or **`weights.safetensors`** - for encoder (requires whisper-mlx)
-
-## Hugging Face Repo ID
- Provide the repo ID (e.g. `openai/whisper-large-v3`) and WhisperLiveKit will download and cache the snapshot automatically. For gated repos, authenticate via `huggingface-cli login` first.
-
-To improve speed/reduce allucinations, you may want to use `scripts/determine_alignment_heads.py` to determine the alignment heads to use for your model, and use the `--custom-alignment-heads` to pass them to WLK. If not, alignement heads are set to be all the heads of the last half layer of decoder.
--- a/docs/supported_languages.md
+++ b/docs/supported_languages.md
@@ -1,6 +1,114 @@
-# Supported Languages
+# Transcription: Supported Language

-WhisperLiveKit supports translation into **201 languages** from the FLORES-200 dataset through the NLLB (No Language Left Behind) translation system. 
+WLK supports transcription in the following languages:
+
+| ISO Code | Language Name        |
+|----------|---------------------|
+| en       | English             |
+| zh       | Chinese             |
+| de       | German              |
+| es       | Spanish             |
+| ru       | Russian             |
+| ko       | Korean              |
+| fr       | French              |
+| ja       | Japanese            |
+| pt       | Portuguese          |
+| tr       | Turkish             |
+| pl       | Polish              |
+| ca       | Catalan             |
+| nl       | Dutch               |
+| ar       | Arabic              |
+| sv       | Swedish             |
+| it       | Italian             |
+| id       | Indonesian          |
+| hi       | Hindi               |
+| fi       | Finnish             |
+| vi       | Vietnamese          |
+| he       | Hebrew              |
+| uk       | Ukrainian           |
+| el       | Greek               |
+| ms       | Malay               |
+| cs       | Czech               |
+| ro       | Romanian            |
+| da       | Danish              |
+| hu       | Hungarian           |
+| ta       | Tamil               |
+| no       | Norwegian           |
+| th       | Thai                |
+| ur       | Urdu                |
+| hr       | Croatian            |
+| bg       | Bulgarian           |
+| lt       | Lithuanian          |
+| la       | Latin               |
+| mi       | Maori               |
+| ml       | Malayalam           |
+| cy       | Welsh               |
+| sk       | Slovak              |
+| te       | Telugu              |
+| fa       | Persian             |
+| lv       | Latvian             |
+| bn       | Bengali             |
+| sr       | Serbian             |
+| az       | Azerbaijani         |
+| sl       | Slovenian           |
+| kn       | Kannada             |
+| et       | Estonian            |
+| mk       | Macedonian          |
+| br       | Breton              |
+| eu       | Basque              |
+| is       | Icelandic           |
+| hy       | Armenian            |
+| ne       | Nepali              |
+| mn       | Mongolian           |
+| bs       | Bosnian             |
+| kk       | Kazakh              |
+| sq       | Albanian            |
+| sw       | Swahili             |
+| gl       | Galician            |
+| mr       | Marathi             |
+| pa       | Punjabi             |
+| si       | Sinhala             |
+| km       | Khmer               |
+| sn       | Shona               |
+| yo       | Yoruba              |
+| so       | Somali              |
+| af       | Afrikaans           |
+| oc       | Occitan             |
+| ka       | Georgian            |
+| be       | Belarusian          |
+| tg       | Tajik               |
+| sd       | Sindhi              |
+| gu       | Gujarati            |
+| am       | Amharic             |
+| yi       | Yiddish             |
+| lo       | Lao                 |
+| uz       | Uzbek               |
+| fo       | Faroese             |
+| ht       | Haitian Creole      |
+| ps       | Pashto              |
+| tk       | Turkmen             |
+| nn       | Nynorsk             |
+| mt       | Maltese             |
+| sa       | Sanskrit            |
+| lb       | Luxembourgish       |
+| my       | Myanmar             |
+| bo       | Tibetan             |
+| tl       | Tagalog             |
+| mg       | Malagasy            |
+| as       | Assamese            |
+| tt       | Tatar               |
+| haw      | Hawaiian            |
+| ln       | Lingala             |
+| ha       | Hausa               |
+| ba       | Bashkir             |
+| jw       | Javanese            |
+| su       | Sundanese           |
+| yue      | Cantonese           |
+
+
+# Translation: Supported Languages 
+
+WLK supports translation into **201 languages** from the FLORES-200 dataset through the [NLLW](https://github.com/QuentinFuxa/NoLanguageLeftWaiting) translation system. 

 ## How to Specify Languages

--- a/docs/technical_integration.md
+++ b/docs/technical_integration.md
@@ -40,4 +40,4 @@ This document introduce how to reuse the core components when you do **not** wan
 3. Call `create_tasks()` to get the async generator, `process_audio()` with incoming bytes, and ensure `cleanup()` runs when the client disconnects.


-If you prefer to send compressed audio, instantiate `AudioProcessor(pcm_input=False)` and pipe encoded chunks through `FFmpegManager` transparently—just ensure `ffmpeg` is available or be ready to handle the `"ffmpeg_not_found"` error in the streamed `FrontData`.
+If you prefer to send compressed audio, instantiate `AudioProcessor(pcm_input=False)` and pipe encoded chunks through `FFmpegManager` transparently. Just ensure `ffmpeg` is available.