mirror of
https://github.com/QuentinFuxa/WhisperLiveKit.git
synced 2026-03-07 14:23:18 +00:00
update doc
This commit is contained in:
@@ -18,7 +18,7 @@ Real-time transcription directly to your browser, with a ready-to-use backend+se
|
||||
|
||||
#### Powered by Leading Research:
|
||||
|
||||
- Simul-[Whisper](https://github.com/backspacetg/simul_whisper)/[Streaming](https://github.com/ufal/SimulStreaming) (SOTA 2025) - Ultra-low latency transcription using [AlignAtt policy](https://arxiv.org/pdf/2305.11408)
|
||||
- Simul-[Whisper](https://arxiv.org/pdf/2406.10052)/[Streaming](https://arxiv.org/abs/2506.17077) (SOTA 2025) - Ultra-low latency transcription using [AlignAtt policy](https://arxiv.org/pdf/2305.11408)
|
||||
- [NLLW](https://github.com/QuentinFuxa/NoLanguageLeftWaiting) (2025), based on [distilled](https://huggingface.co/entai2965/nllb-200-distilled-600M-ctranslate2) [NLLB](https://arxiv.org/abs/2207.04672) (2022, 2024) - Simulatenous translation from & to 200 languages.
|
||||
- [WhisperStreaming](https://github.com/ufal/whisper_streaming) (SOTA 2023) - Low latency transcription using [LocalAgreement policy](https://www.isca-archive.org/interspeech_2020/liu20s_interspeech.pdf)
|
||||
- [Streaming Sortformer](https://arxiv.org/abs/2507.18446) (SOTA 2025) - Advanced real-time speaker diarization
|
||||
@@ -143,8 +143,8 @@ async def websocket_endpoint(websocket: WebSocket):
|
||||
|
||||
| Parameter | Description | Default |
|
||||
|-----------|-------------|---------|
|
||||
| `--model` | Whisper model size. List and recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/available_models.md) | `small` |
|
||||
| `--model-path` | Local .pt file/directory **or** Hugging Face repo ID containing the Whisper model. Overrides `--model`. Recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/models_compatible_formats.md) | `None` |
|
||||
| `--model` | Whisper model size. List and recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/default_and_custom_models.md) | `small` |
|
||||
| `--model-path` | Local .pt file/directory **or** Hugging Face repo ID containing the Whisper model. Overrides `--model`. Recommandations [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/default_and_custom_models.md) | `None` |
|
||||
| `--language` | List [here](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/whisperlivekit/whisper/tokenizer.py). If you use `auto`, the model attempts to detect the language automatically, but it tends to bias towards English. | `auto` |
|
||||
| `--target-language` | If sets, translates using [NLLW](https://github.com/QuentinFuxa/NoLanguageLeftWaiting). [200 languages available](https://github.com/QuentinFuxa/WhisperLiveKit/blob/main/docs/supported_languages.md). If you want to translate to english, you can also use `--direct-english-translation`. The STT model will try to directly output the translation. | `None` |
|
||||
| `--diarization` | Enable speaker identification | `False` |
|
||||
|
||||
@@ -1,109 +0,0 @@
|
||||
# Available Whisper model sizes:
|
||||
|
||||
- tiny.en (english only)
|
||||
- tiny
|
||||
- base.en (english only)
|
||||
- base
|
||||
- small.en (english only)
|
||||
- small
|
||||
- medium.en (english only)
|
||||
- medium
|
||||
- large-v1
|
||||
- large-v2
|
||||
- large-v3
|
||||
- large-v3-turbo
|
||||
|
||||
## How to choose?
|
||||
|
||||
### Language Support
|
||||
- **English only**: Use `.en` models for better accuracy and faster processing when you only need English transcription
|
||||
- **Multilingual**: Do not use `.en` models.
|
||||
|
||||
### Resource Constraints
|
||||
- **Limited GPU/CPU or need for very low latency**: Choose `small` or smaller models
|
||||
- `tiny`: Fastest, lowest resource usage, acceptable quality for simple audio
|
||||
- `base`: Good balance of speed and accuracy for basic use cases
|
||||
- `small`: Better accuracy while still being resource-efficient
|
||||
- **Good resources available**: Use `large` models for best accuracy
|
||||
- `large-v2`: Excellent accuracy, good multilingual support
|
||||
- `large-v3`: Best overall accuracy and language support
|
||||
|
||||
### Special Cases
|
||||
- **No translation needed**: Use `large-v3-turbo`
|
||||
- Same transcription quality as `large-v2` but significantly faster
|
||||
- **Important**: Does not translate correctly, only transcribes
|
||||
|
||||
### Model Comparison Table
|
||||
|
||||
| Model | Speed | Accuracy | Multilingual | Translation | Best Use Case |
|
||||
|-------|--------|----------|--------------|-------------|---------------|
|
||||
| tiny(.en) | Fastest | Basic | Yes/No | Yes/No | Real-time, low resources |
|
||||
| base(.en) | Fast | Good | Yes/No | Yes/No | Balanced performance |
|
||||
| small(.en) | Medium | Better | Yes/No | Yes/No | Quality on limited hardware |
|
||||
| medium(.en) | Slow | High | Yes/No | Yes/No | High quality, moderate resources |
|
||||
| large-v2 | Slowest | Excellent | Yes | Yes | Best overall quality |
|
||||
| large-v3 | Slowest | Excellent | Yes | Yes | Maximum accuracy |
|
||||
| large-v3-turbo | Fast | Excellent | Yes | No | Fast, high-quality transcription |
|
||||
|
||||
### Additional Considerations
|
||||
|
||||
**Model Performance**:
|
||||
- Accuracy improves significantly from tiny to large models
|
||||
- English-only models are ~10-15% more accurate for English audio
|
||||
- Newer versions (v2, v3) have better punctuation and formatting
|
||||
|
||||
**Hardware Requirements**:
|
||||
- `tiny`: ~1GB VRAM
|
||||
- `base`: ~1GB VRAM
|
||||
- `small`: ~2GB VRAM
|
||||
- `medium`: ~5GB VRAM
|
||||
- `large`: ~10GB VRAM
|
||||
- `large‑v3‑turbo`: ~6GB VRAM
|
||||
|
||||
**Audio Quality Impact**:
|
||||
- Clean, clear audio: smaller models may suffice
|
||||
- Noisy, accented, or technical audio: larger models recommended
|
||||
- Phone/low-quality audio: use at least `small` model
|
||||
|
||||
### Quick Decision Tree
|
||||
1. English only? → Add `.en` to your choice
|
||||
2. Limited resources or need speed? → `small` or smaller
|
||||
3. Good hardware and want best quality? → `large-v3`
|
||||
4. Need fast, high-quality transcription without translation? → `large-v3-turbo`
|
||||
5. Need translation capabilities? → `large-v2` or `large-v3` (avoid turbo)
|
||||
|
||||
|
||||
_______________________
|
||||
|
||||
# Translation Models and Backend
|
||||
|
||||
**Language Support**: ~200 languages
|
||||
|
||||
## Distilled Model Sizes Available
|
||||
|
||||
| Model | Size | Parameters | VRAM (FP16) | VRAM (INT8) | Quality |
|
||||
|-------|------|------------|-------------|-------------|---------|
|
||||
| 600M | 2.46 GB | 600M | ~1.5GB | ~800MB | Good, understandable |
|
||||
| 1.3B | 5.48 GB | 1.3B | ~3GB | ~1.5GB | Better accuracy, context |
|
||||
|
||||
**Quality Impact**: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs.
|
||||
|
||||
## Backend Performance
|
||||
|
||||
| Backend | Speed vs Base | Memory Usage | Quality Loss |
|
||||
|---------|---------------|--------------|--------------|
|
||||
| CTranslate2 | 6-10x faster | 40-60% less | ~5% BLEU drop |
|
||||
| Transformers | Baseline | High | None |
|
||||
| Transformers + MPS (on Apple Silicon) | 2x faster | Medium | None |
|
||||
|
||||
**Metrics**:
|
||||
- CTranslate2: 50-100+ tokens/sec
|
||||
- Transformers: 10-30 tokens/sec
|
||||
- Apple Silicon with MPS: Up to 2x faster than CTranslate2
|
||||
|
||||
## Quick Decision Matrix
|
||||
|
||||
**Choose 600M**: Limited resources, close to 0 lag
|
||||
**Choose 1.3B**: Quality matters
|
||||
**Choose Transformers**: On Apple Silicon
|
||||
|
||||
106
docs/default_and_custom_models.md
Normal file
106
docs/default_and_custom_models.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# Models and Model Paths
|
||||
|
||||
## Defaults
|
||||
|
||||
**Default Whisper Model**: `base`
|
||||
When no model is specified, WhisperLiveKit uses the `base` model, which provides a good balance of speed and accuracy for most use cases.
|
||||
|
||||
**Default Model Cache Directory**: `~/.cache/whisper`
|
||||
Models are automatically downloaded from OpenAI's model hub and cached in this directory. You can override this with `--model_cache_dir`.
|
||||
|
||||
**Default Translation Model**: `600M` (NLLB-200-distilled)
|
||||
When translation is enabled, the 600M distilled NLLB model is used by default. This provides good quality with minimal resource usage.
|
||||
|
||||
**Default Translation Backend**: `transformers`
|
||||
The translation backend defaults to Transformers. On Apple Silicon, this automatically uses MPS acceleration for better performance.
|
||||
|
||||
---
|
||||
|
||||
|
||||
## Available Whisper model sizes:
|
||||
|
||||
| Available Model | Speed | Accuracy | Multilingual | Translation | Hardware Requirements | Best Use Case |
|
||||
|--------------------|----------|-----------|--------------|-------------|----------------------|----------------------------------|
|
||||
| tiny(.en) | Fastest | Basic | Yes/No | Yes/No | ~1GB VRAM | Real-time, low resources |
|
||||
| base(.en) | Fast | Good | Yes/No | Yes/No | ~1GB VRAM | Balanced performance |
|
||||
| small(.en) | Medium | Better | Yes/No | Yes/No | ~2GB VRAM | Quality on limited hardware |
|
||||
| medium(.en) | Slow | High | Yes/No | Yes/No | ~5GB VRAM | High quality, moderate resources |
|
||||
| large-v2 | Slowest | Excellent | Yes | Yes | ~10GB VRAM | Good overall accuracy & language support |
|
||||
| large-v3 | Slowest | Excellent | Yes | Yes | ~10GB VRAM | Best overall accuracy & language support |
|
||||
| large-v3-turbo | Fast | Excellent | Yes | No | ~6GB VRAM | Fast, high-quality transcription |
|
||||
|
||||
|
||||
### How to choose?
|
||||
|
||||
#### Language Support
|
||||
- **English only**: Use `.en` (ex: `base.en`) models for better accuracy and faster processing when you only need English transcription
|
||||
- **Multilingual**: Do not use `.en` models.
|
||||
|
||||
#### Special Cases
|
||||
- **No translation needed**: Use `large-v3-turbo`
|
||||
- Same transcription quality as `large-v2` but significantly faster
|
||||
- **Important**: Does not translate correctly, only transcribes
|
||||
|
||||
### Additional Considerations
|
||||
|
||||
**Model Performance**:
|
||||
- Accuracy improves significantly from tiny to large models
|
||||
- English-only models are ~10-15% more accurate for English audio
|
||||
- Newer versions (v2, v3) have better punctuation and formatting
|
||||
|
||||
**Audio Quality Impact**:
|
||||
- Clean, clear audio: smaller models may suffice
|
||||
- Noisy, accented, or technical audio: larger models recommended
|
||||
- Phone/low-quality audio: use at least `small` model
|
||||
|
||||
_______________________
|
||||
|
||||
|
||||
# Custom Models:
|
||||
|
||||
The `--model-path` parameter accepts:
|
||||
|
||||
## File Path
|
||||
- **`.pt` / `.bin` / `.safetensor` formats** Should be openable by pytorch/safetensor.
|
||||
|
||||
## Directory Path (recommended)
|
||||
Must contain:
|
||||
- **`.pt` / `.bin` / `.safetensor` file** (required for decoder)
|
||||
|
||||
May optionally contain:
|
||||
- **`.bin` file** - faster-whisper model for encoder (requires faster-whisper)
|
||||
- **`weights.npz`** or **`weights.safetensors`** - for encoder (requires whisper-mlx)
|
||||
|
||||
## Hugging Face Repo ID
|
||||
- Provide the repo ID (e.g. `openai/whisper-large-v3`) and WhisperLiveKit will download and cache the snapshot automatically. For gated repos, authenticate via `huggingface-cli login` first.
|
||||
|
||||
To improve speed/reduce hallucinations, you may want to use `scripts/determine_alignment_heads.py` to determine the alignment heads to use for your model, and use the `--custom-alignment-heads` to pass them to WLK. If not, alignment heads are set to be all the heads of the last half layer of decoder.
|
||||
|
||||
|
||||
_______________________
|
||||
|
||||
# Translation Models and Backend
|
||||
|
||||
**Language Support**: ~200 languages
|
||||
|
||||
## Distilled Model Sizes Available
|
||||
|
||||
| Model | Size | Parameters | VRAM (FP16) | VRAM (INT8) | Quality |
|
||||
|-------|------|------------|-------------|-------------|---------|
|
||||
| 600M | 2.46 GB | 600M | ~1.5GB | ~800MB | Good, understandable |
|
||||
| 1.3B | 5.48 GB | 1.3B | ~3GB | ~1.5GB | Better accuracy, context |
|
||||
|
||||
**Quality Impact**: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs.
|
||||
|
||||
## Backend Performance
|
||||
|
||||
| Backend | Speed vs Base | Memory Usage | Quality Loss |
|
||||
|---------|---------------|--------------|--------------|
|
||||
| CTranslate2 | 6-10x faster | 40-60% less | ~5% BLEU drop |
|
||||
| Transformers | Baseline | High | None |
|
||||
| Transformers + MPS (on Apple Silicon) | 2x faster | Medium | None |
|
||||
|
||||
**Metrics**:
|
||||
- CTranslate2: 50-100+ tokens/sec
|
||||
- Transformers: 10-30 tokens/sec
|
||||
- Apple Silicon with MPS: Up to 2x faster than CTranslate2
|
||||
@@ -1,19 +0,0 @@
|
||||
# Model Path Formats
|
||||
|
||||
The `--model-path` parameter accepts:
|
||||
|
||||
## File Path
|
||||
- **`.pt` / `.bin` / `.safetensor` formats** Should be openable by pytorch/safetensor.
|
||||
|
||||
## Directory Path (recommended)
|
||||
Must contain:
|
||||
- **`.pt` / `.bin` / `.safetensor` file** (required for decoder)
|
||||
|
||||
May optionally contain:
|
||||
- **`.bin` file** - faster-whisper model for encoder (requires faster-whisper)
|
||||
- **`weights.npz`** or **`weights.safetensors`** - for encoder (requires whisper-mlx)
|
||||
|
||||
## Hugging Face Repo ID
|
||||
- Provide the repo ID (e.g. `openai/whisper-large-v3`) and WhisperLiveKit will download and cache the snapshot automatically. For gated repos, authenticate via `huggingface-cli login` first.
|
||||
|
||||
To improve speed/reduce allucinations, you may want to use `scripts/determine_alignment_heads.py` to determine the alignment heads to use for your model, and use the `--custom-alignment-heads` to pass them to WLK. If not, alignement heads are set to be all the heads of the last half layer of decoder.
|
||||
@@ -1,6 +1,114 @@
|
||||
# Supported Languages
|
||||
# Transcription: Supported Language
|
||||
|
||||
WhisperLiveKit supports translation into **201 languages** from the FLORES-200 dataset through the NLLB (No Language Left Behind) translation system.
|
||||
WLK supports transcription in the following languages:
|
||||
|
||||
| ISO Code | Language Name |
|
||||
|----------|---------------------|
|
||||
| en | English |
|
||||
| zh | Chinese |
|
||||
| de | German |
|
||||
| es | Spanish |
|
||||
| ru | Russian |
|
||||
| ko | Korean |
|
||||
| fr | French |
|
||||
| ja | Japanese |
|
||||
| pt | Portuguese |
|
||||
| tr | Turkish |
|
||||
| pl | Polish |
|
||||
| ca | Catalan |
|
||||
| nl | Dutch |
|
||||
| ar | Arabic |
|
||||
| sv | Swedish |
|
||||
| it | Italian |
|
||||
| id | Indonesian |
|
||||
| hi | Hindi |
|
||||
| fi | Finnish |
|
||||
| vi | Vietnamese |
|
||||
| he | Hebrew |
|
||||
| uk | Ukrainian |
|
||||
| el | Greek |
|
||||
| ms | Malay |
|
||||
| cs | Czech |
|
||||
| ro | Romanian |
|
||||
| da | Danish |
|
||||
| hu | Hungarian |
|
||||
| ta | Tamil |
|
||||
| no | Norwegian |
|
||||
| th | Thai |
|
||||
| ur | Urdu |
|
||||
| hr | Croatian |
|
||||
| bg | Bulgarian |
|
||||
| lt | Lithuanian |
|
||||
| la | Latin |
|
||||
| mi | Maori |
|
||||
| ml | Malayalam |
|
||||
| cy | Welsh |
|
||||
| sk | Slovak |
|
||||
| te | Telugu |
|
||||
| fa | Persian |
|
||||
| lv | Latvian |
|
||||
| bn | Bengali |
|
||||
| sr | Serbian |
|
||||
| az | Azerbaijani |
|
||||
| sl | Slovenian |
|
||||
| kn | Kannada |
|
||||
| et | Estonian |
|
||||
| mk | Macedonian |
|
||||
| br | Breton |
|
||||
| eu | Basque |
|
||||
| is | Icelandic |
|
||||
| hy | Armenian |
|
||||
| ne | Nepali |
|
||||
| mn | Mongolian |
|
||||
| bs | Bosnian |
|
||||
| kk | Kazakh |
|
||||
| sq | Albanian |
|
||||
| sw | Swahili |
|
||||
| gl | Galician |
|
||||
| mr | Marathi |
|
||||
| pa | Punjabi |
|
||||
| si | Sinhala |
|
||||
| km | Khmer |
|
||||
| sn | Shona |
|
||||
| yo | Yoruba |
|
||||
| so | Somali |
|
||||
| af | Afrikaans |
|
||||
| oc | Occitan |
|
||||
| ka | Georgian |
|
||||
| be | Belarusian |
|
||||
| tg | Tajik |
|
||||
| sd | Sindhi |
|
||||
| gu | Gujarati |
|
||||
| am | Amharic |
|
||||
| yi | Yiddish |
|
||||
| lo | Lao |
|
||||
| uz | Uzbek |
|
||||
| fo | Faroese |
|
||||
| ht | Haitian Creole |
|
||||
| ps | Pashto |
|
||||
| tk | Turkmen |
|
||||
| nn | Nynorsk |
|
||||
| mt | Maltese |
|
||||
| sa | Sanskrit |
|
||||
| lb | Luxembourgish |
|
||||
| my | Myanmar |
|
||||
| bo | Tibetan |
|
||||
| tl | Tagalog |
|
||||
| mg | Malagasy |
|
||||
| as | Assamese |
|
||||
| tt | Tatar |
|
||||
| haw | Hawaiian |
|
||||
| ln | Lingala |
|
||||
| ha | Hausa |
|
||||
| ba | Bashkir |
|
||||
| jw | Javanese |
|
||||
| su | Sundanese |
|
||||
| yue | Cantonese |
|
||||
|
||||
|
||||
# Translation: Supported Languages
|
||||
|
||||
WLK supports translation into **201 languages** from the FLORES-200 dataset through the [NLLW](https://github.com/QuentinFuxa/NoLanguageLeftWaiting) translation system.
|
||||
|
||||
## How to Specify Languages
|
||||
|
||||
|
||||
@@ -40,4 +40,4 @@ This document introduce how to reuse the core components when you do **not** wan
|
||||
3. Call `create_tasks()` to get the async generator, `process_audio()` with incoming bytes, and ensure `cleanup()` runs when the client disconnects.
|
||||
|
||||
|
||||
If you prefer to send compressed audio, instantiate `AudioProcessor(pcm_input=False)` and pipe encoded chunks through `FFmpegManager` transparently—just ensure `ffmpeg` is available or be ready to handle the `"ffmpeg_not_found"` error in the streamed `FrontData`.
|
||||
If you prefer to send compressed audio, instantiate `AudioProcessor(pcm_input=False)` and pipe encoded chunks through `FFmpegManager` transparently. Just ensure `ffmpeg` is available.
|
||||
Reference in New Issue
Block a user