From 345d781e97ef1380a7bcfdae1e2d28390b16e7ea Mon Sep 17 00:00:00 2001 From: Quentin Fuxa Date: Tue, 25 Nov 2025 23:20:00 +0100 Subject: [PATCH] update doc --- README.md | 6 +- docs/available_models.md | 109 ----------------------------- docs/default_and_custom_models.md | 106 ++++++++++++++++++++++++++++ docs/models_compatible_formats.md | 19 ----- docs/supported_languages.md | 112 +++++++++++++++++++++++++++++- docs/technical_integration.md | 2 +- 6 files changed, 220 insertions(+), 134 deletions(-) delete mode 100644 docs/available_models.md create mode 100644 docs/default_and_custom_models.md delete mode 100644 docs/models_compatible_formats.md diff --git a/README.md b/README.md index a0115e6..fb7a7fe 100644 --- a/README.md +++ b/README.md @@ -18,7 +18,7 @@ Real-time transcription directly to your browser, with a ready-to-use backend+se #### Powered by Leading Research: -- Simul-[Whisper](https://github.com/backspacetg/simul_whisper)/[Streaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) - Ultra-low latency transcription using [AlignAtt policy](https://arxiv.org/pdf/2305.11408) +- Simul-[Whisper](https://arxiv.org/pdf/2406.10052)/[Streaming](https://arxiv.org/abs/2506.17077) (SOTA 2025) - Ultra-low latency transcription using [AlignAtt policy](https://arxiv.org/pdf/2305.11408) - [NLLW](https://github.com/QuentinFuxa/NoLanguageLeftWaiting) (2025), based on [distilled](https://huggingface.co/entai2965/nllb-200-distilled-600M-ctranslate2) [NLLB](https://arxiv.org/abs/2207.04672) (2022, 2024) - Simulatenous translation from & to 200 languages. - [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) - Low latency transcription using [LocalAgreement policy](https://www.isca-archive.org/interspeech_2020/liu20s_interspeech.pdf) - [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) - Advanced real-time speaker diarization @@ -143,8 +143,8 @@ async def websocket_endpoint(websocket: WebSocket): | Parameter | Description | Default | |-----------|-------------|---------| -| `--model` | Whisper model size. List and recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/available_models.md) | `small` | -| `--model-path` | Local .pt file/directory **or** Hugging Face repo ID containing the Whisper model. Overrides `--model`. Recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/models_compatible_formats.md) | `None` | +| `--model` | Whisper model size. List and recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/default_and_custom_models.md) | `small` | +| `--model-path` | Local .pt file/directory **or** Hugging Face repo ID containing the Whisper model. Overrides `--model`. Recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/default_and_custom_models.md) | `None` | | `--language` | List [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/whisper/tokenizer.py). If you use `auto`, the model attempts to detect the language automatically, but it tends to bias towards English. | `auto` | | `--target-language` | If sets, translates using [NLLW](https://github.com/QuentinFuxa/NoLanguageLeftWaiting). [200 languages available](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/supported_languages.md). If you want to translate to english, you can also use `--direct-english-translation`. The STT model will try to directly output the translation. | `None` | | `--diarization` | Enable speaker identification | `False` | diff --git a/docs/available_models.md b/docs/available_models.md deleted file mode 100644 index 6495fc5..0000000 --- a/docs/available_models.md +++ /dev/null @@ -1,109 +0,0 @@ -# Available Whisper model sizes: - -- tiny.en (english only) -- tiny -- base.en (english only) -- base -- small.en (english only) -- small -- medium.en (english only) -- medium -- large-v1 -- large-v2 -- large-v3 -- large-v3-turbo - -## How to choose? - -### Language Support -- **English only**: Use `.en` models for better accuracy and faster processing when you only need English transcription -- **Multilingual**: Do not use `.en` models. - -### Resource Constraints -- **Limited GPU/CPU or need for very low latency**: Choose `small` or smaller models - - `tiny`: Fastest, lowest resource usage, acceptable quality for simple audio - - `base`: Good balance of speed and accuracy for basic use cases - - `small`: Better accuracy while still being resource-efficient -- **Good resources available**: Use `large` models for best accuracy - - `large-v2`: Excellent accuracy, good multilingual support - - `large-v3`: Best overall accuracy and language support - -### Special Cases -- **No translation needed**: Use `large-v3-turbo` - - Same transcription quality as `large-v2` but significantly faster - - **Important**: Does not translate correctly, only transcribes - -### Model Comparison Table - -| Model | Speed | Accuracy | Multilingual | Translation | Best Use Case | -|-------|--------|----------|--------------|-------------|---------------| -| tiny(.en) | Fastest | Basic | Yes/No | Yes/No | Real-time, low resources | -| base(.en) | Fast | Good | Yes/No | Yes/No | Balanced performance | -| small(.en) | Medium | Better | Yes/No | Yes/No | Quality on limited hardware | -| medium(.en) | Slow | High | Yes/No | Yes/No | High quality, moderate resources | -| large-v2 | Slowest | Excellent | Yes | Yes | Best overall quality | -| large-v3 | Slowest | Excellent | Yes | Yes | Maximum accuracy | -| large-v3-turbo | Fast | Excellent | Yes | No | Fast, high-quality transcription | - -### Additional Considerations - -**Model Performance**: -- Accuracy improves significantly from tiny to large models -- English-only models are ~10-15% more accurate for English audio -- Newer versions (v2, v3) have better punctuation and formatting - -**Hardware Requirements**: -- `tiny`: ~1GB VRAM -- `base`: ~1GB VRAM -- `small`: ~2GB VRAM -- `medium`: ~5GB VRAM -- `large`: ~10GB VRAM -- `large‑v3‑turbo`: ~6GB VRAM - -**Audio Quality Impact**: -- Clean, clear audio: smaller models may suffice -- Noisy, accented, or technical audio: larger models recommended -- Phone/low-quality audio: use at least `small` model - -### Quick Decision Tree -1. English only? → Add `.en` to your choice -2. Limited resources or need speed? → `small` or smaller -3. Good hardware and want best quality? → `large-v3` -4. Need fast, high-quality transcription without translation? → `large-v3-turbo` -5. Need translation capabilities? → `large-v2` or `large-v3` (avoid turbo) - - -_______________________ - -# Translation Models and Backend - -**Language Support**: ~200 languages - -## Distilled Model Sizes Available - -| Model | Size | Parameters | VRAM (FP16) | VRAM (INT8) | Quality | -|-------|------|------------|-------------|-------------|---------| -| 600M | 2.46 GB | 600M | ~1.5GB | ~800MB | Good, understandable | -| 1.3B | 5.48 GB | 1.3B | ~3GB | ~1.5GB | Better accuracy, context | - -**Quality Impact**: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs. - -## Backend Performance - -| Backend | Speed vs Base | Memory Usage | Quality Loss | -|---------|---------------|--------------|--------------| -| CTranslate2 | 6-10x faster | 40-60% less | ~5% BLEU drop | -| Transformers | Baseline | High | None | -| Transformers + MPS (on Apple Silicon) | 2x faster | Medium | None | - -**Metrics**: -- CTranslate2: 50-100+ tokens/sec -- Transformers: 10-30 tokens/sec -- Apple Silicon with MPS: Up to 2x faster than CTranslate2 - -## Quick Decision Matrix - -**Choose 600M**: Limited resources, close to 0 lag -**Choose 1.3B**: Quality matters -**Choose Transformers**: On Apple Silicon - diff --git a/docs/default_and_custom_models.md b/docs/default_and_custom_models.md new file mode 100644 index 0000000..824df40 --- /dev/null +++ b/docs/default_and_custom_models.md @@ -0,0 +1,106 @@ +# Models and Model Paths + +## Defaults + +**Default Whisper Model**: `base` +When no model is specified, WhisperLiveKit uses the `base` model, which provides a good balance of speed and accuracy for most use cases. + +**Default Model Cache Directory**: `~/.cache/whisper` +Models are automatically downloaded from OpenAI's model hub and cached in this directory. You can override this with `--model_cache_dir`. + +**Default Translation Model**: `600M` (NLLB-200-distilled) +When translation is enabled, the 600M distilled NLLB model is used by default. This provides good quality with minimal resource usage. + +**Default Translation Backend**: `transformers` +The translation backend defaults to Transformers. On Apple Silicon, this automatically uses MPS acceleration for better performance. + +--- + + +## Available Whisper model sizes: + +| Available Model | Speed | Accuracy | Multilingual | Translation | Hardware Requirements | Best Use Case | +|--------------------|----------|-----------|--------------|-------------|----------------------|----------------------------------| +| tiny(.en) | Fastest | Basic | Yes/No | Yes/No | ~1GB VRAM | Real-time, low resources | +| base(.en) | Fast | Good | Yes/No | Yes/No | ~1GB VRAM | Balanced performance | +| small(.en) | Medium | Better | Yes/No | Yes/No | ~2GB VRAM | Quality on limited hardware | +| medium(.en) | Slow | High | Yes/No | Yes/No | ~5GB VRAM | High quality, moderate resources | +| large-v2 | Slowest | Excellent | Yes | Yes | ~10GB VRAM | Good overall accuracy & language support | +| large-v3 | Slowest | Excellent | Yes | Yes | ~10GB VRAM | Best overall accuracy & language support | +| large-v3-turbo | Fast | Excellent | Yes | No | ~6GB VRAM | Fast, high-quality transcription | + + +### How to choose? + +#### Language Support +- **English only**: Use `.en` (ex: `base.en`) models for better accuracy and faster processing when you only need English transcription +- **Multilingual**: Do not use `.en` models. + +#### Special Cases +- **No translation needed**: Use `large-v3-turbo` + - Same transcription quality as `large-v2` but significantly faster + - **Important**: Does not translate correctly, only transcribes + +### Additional Considerations + +**Model Performance**: +- Accuracy improves significantly from tiny to large models +- English-only models are ~10-15% more accurate for English audio +- Newer versions (v2, v3) have better punctuation and formatting + +**Audio Quality Impact**: +- Clean, clear audio: smaller models may suffice +- Noisy, accented, or technical audio: larger models recommended +- Phone/low-quality audio: use at least `small` model + +_______________________ + + +# Custom Models: + +The `--model-path` parameter accepts: + +## File Path +- **`.pt` / `.bin` / `.safetensor` formats** Should be openable by pytorch/safetensor. + +## Directory Path (recommended) +Must contain: +- **`.pt` / `.bin` / `.safetensor` file** (required for decoder) + +May optionally contain: +- **`.bin` file** - faster-whisper model for encoder (requires faster-whisper) +- **`weights.npz`** or **`weights.safetensors`** - for encoder (requires whisper-mlx) + +## Hugging Face Repo ID +- Provide the repo ID (e.g. `openai/whisper-large-v3`) and WhisperLiveKit will download and cache the snapshot automatically. For gated repos, authenticate via `huggingface-cli login` first. + +To improve speed/reduce hallucinations, you may want to use `scripts/determine_alignment_heads.py` to determine the alignment heads to use for your model, and use the `--custom-alignment-heads` to pass them to WLK. If not, alignment heads are set to be all the heads of the last half layer of decoder. + + +_______________________ + +# Translation Models and Backend + +**Language Support**: ~200 languages + +## Distilled Model Sizes Available + +| Model | Size | Parameters | VRAM (FP16) | VRAM (INT8) | Quality | +|-------|------|------------|-------------|-------------|---------| +| 600M | 2.46 GB | 600M | ~1.5GB | ~800MB | Good, understandable | +| 1.3B | 5.48 GB | 1.3B | ~3GB | ~1.5GB | Better accuracy, context | + +**Quality Impact**: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs. + +## Backend Performance + +| Backend | Speed vs Base | Memory Usage | Quality Loss | +|---------|---------------|--------------|--------------| +| CTranslate2 | 6-10x faster | 40-60% less | ~5% BLEU drop | +| Transformers | Baseline | High | None | +| Transformers + MPS (on Apple Silicon) | 2x faster | Medium | None | + +**Metrics**: +- CTranslate2: 50-100+ tokens/sec +- Transformers: 10-30 tokens/sec +- Apple Silicon with MPS: Up to 2x faster than CTranslate2 diff --git a/docs/models_compatible_formats.md b/docs/models_compatible_formats.md deleted file mode 100644 index 2559129..0000000 --- a/docs/models_compatible_formats.md +++ /dev/null @@ -1,19 +0,0 @@ -# Model Path Formats - -The `--model-path` parameter accepts: - -## File Path -- **`.pt` / `.bin` / `.safetensor` formats** Should be openable by pytorch/safetensor. - -## Directory Path (recommended) -Must contain: -- **`.pt` / `.bin` / `.safetensor` file** (required for decoder) - -May optionally contain: -- **`.bin` file** - faster-whisper model for encoder (requires faster-whisper) -- **`weights.npz`** or **`weights.safetensors`** - for encoder (requires whisper-mlx) - -## Hugging Face Repo ID -- Provide the repo ID (e.g. `openai/whisper-large-v3`) and WhisperLiveKit will download and cache the snapshot automatically. For gated repos, authenticate via `huggingface-cli login` first. - -To improve speed/reduce allucinations, you may want to use `scripts/determine_alignment_heads.py` to determine the alignment heads to use for your model, and use the `--custom-alignment-heads` to pass them to WLK. If not, alignement heads are set to be all the heads of the last half layer of decoder. diff --git a/docs/supported_languages.md b/docs/supported_languages.md index e6a26f9..f1257b2 100644 --- a/docs/supported_languages.md +++ b/docs/supported_languages.md @@ -1,6 +1,114 @@ -# Supported Languages +# Transcription: Supported Language -WhisperLiveKit supports translation into **201 languages** from the FLORES-200 dataset through the NLLB (No Language Left Behind) translation system. +WLK supports transcription in the following languages: + +| ISO Code | Language Name | +|----------|---------------------| +| en | English | +| zh | Chinese | +| de | German | +| es | Spanish | +| ru | Russian | +| ko | Korean | +| fr | French | +| ja | Japanese | +| pt | Portuguese | +| tr | Turkish | +| pl | Polish | +| ca | Catalan | +| nl | Dutch | +| ar | Arabic | +| sv | Swedish | +| it | Italian | +| id | Indonesian | +| hi | Hindi | +| fi | Finnish | +| vi | Vietnamese | +| he | Hebrew | +| uk | Ukrainian | +| el | Greek | +| ms | Malay | +| cs | Czech | +| ro | Romanian | +| da | Danish | +| hu | Hungarian | +| ta | Tamil | +| no | Norwegian | +| th | Thai | +| ur | Urdu | +| hr | Croatian | +| bg | Bulgarian | +| lt | Lithuanian | +| la | Latin | +| mi | Maori | +| ml | Malayalam | +| cy | Welsh | +| sk | Slovak | +| te | Telugu | +| fa | Persian | +| lv | Latvian | +| bn | Bengali | +| sr | Serbian | +| az | Azerbaijani | +| sl | Slovenian | +| kn | Kannada | +| et | Estonian | +| mk | Macedonian | +| br | Breton | +| eu | Basque | +| is | Icelandic | +| hy | Armenian | +| ne | Nepali | +| mn | Mongolian | +| bs | Bosnian | +| kk | Kazakh | +| sq | Albanian | +| sw | Swahili | +| gl | Galician | +| mr | Marathi | +| pa | Punjabi | +| si | Sinhala | +| km | Khmer | +| sn | Shona | +| yo | Yoruba | +| so | Somali | +| af | Afrikaans | +| oc | Occitan | +| ka | Georgian | +| be | Belarusian | +| tg | Tajik | +| sd | Sindhi | +| gu | Gujarati | +| am | Amharic | +| yi | Yiddish | +| lo | Lao | +| uz | Uzbek | +| fo | Faroese | +| ht | Haitian Creole | +| ps | Pashto | +| tk | Turkmen | +| nn | Nynorsk | +| mt | Maltese | +| sa | Sanskrit | +| lb | Luxembourgish | +| my | Myanmar | +| bo | Tibetan | +| tl | Tagalog | +| mg | Malagasy | +| as | Assamese | +| tt | Tatar | +| haw | Hawaiian | +| ln | Lingala | +| ha | Hausa | +| ba | Bashkir | +| jw | Javanese | +| su | Sundanese | +| yue | Cantonese | + + +# Translation: Supported Languages + +WLK supports translation into **201 languages** from the FLORES-200 dataset through the [NLLW](https://github.com/QuentinFuxa/NoLanguageLeftWaiting) translation system. ## How to Specify Languages diff --git a/docs/technical_integration.md b/docs/technical_integration.md index c8083d2..6f0e6d6 100644 --- a/docs/technical_integration.md +++ b/docs/technical_integration.md @@ -40,4 +40,4 @@ This document introduce how to reuse the core components when you do **not** wan 3. Call `create_tasks()` to get the async generator, `process_audio()` with incoming bytes, and ensure `cleanup()` runs when the client disconnects. -If you prefer to send compressed audio, instantiate `AudioProcessor(pcm_input=False)` and pipe encoded chunks through `FFmpegManager` transparently—just ensure `ffmpeg` is available or be ready to handle the `"ffmpeg_not_found"` error in the streamed `FrontData`. \ No newline at end of file +If you prefer to send compressed audio, instantiate `AudioProcessor(pcm_input=False)` and pipe encoded chunks through `FFmpegManager` transparently. Just ensure `ffmpeg` is available. \ No newline at end of file