# Models and Model Paths ## Defaults **Default Whisper Model**: `base` When no model is specified, WhisperLiveKit uses the `base` model, which provides a good balance of speed and accuracy for most use cases. **Default Model Cache Directory**: `~/.cache/whisper` Models are automatically downloaded from OpenAI's model hub and cached in this directory. You can override this with `--model_cache_dir`. **Default Translation Model**: `600M` (NLLB-200-distilled) When translation is enabled, the 600M distilled NLLB model is used by default. This provides good quality with minimal resource usage. **Default Translation Backend**: `transformers` The translation backend defaults to Transformers. On Apple Silicon, this automatically uses MPS acceleration for better performance. --- ## Available Whisper model sizes: | Available Model | Speed | Accuracy | Multilingual | Translation | Hardware Requirements | Best Use Case | |--------------------|----------|-----------|--------------|-------------|----------------------|----------------------------------| | tiny(.en) | Fastest | Basic | Yes/No | Yes/No | ~1GB VRAM | Real-time, low resources | | base(.en) | Fast | Good | Yes/No | Yes/No | ~1GB VRAM | Balanced performance | | small(.en) | Medium | Better | Yes/No | Yes/No | ~2GB VRAM | Quality on limited hardware | | medium(.en) | Slow | High | Yes/No | Yes/No | ~5GB VRAM | High quality, moderate resources | | large-v2 | Slowest | Excellent | Yes | Yes | ~10GB VRAM | Good overall accuracy & language support | | large-v3 | Slowest | Excellent | Yes | Yes | ~10GB VRAM | Best overall accuracy & language support | | large-v3-turbo | Fast | Excellent | Yes | No | ~6GB VRAM | Fast, high-quality transcription | ### How to choose? #### Language Support - **English only**: Use `.en` (ex: `base.en`) models for better accuracy and faster processing when you only need English transcription - **Multilingual**: Do not use `.en` models. #### Special Cases - **No translation needed**: Use `large-v3-turbo` - Same transcription quality as `large-v2` but significantly faster - **Important**: Does not translate correctly, only transcribes ### Additional Considerations **Model Performance**: - Accuracy improves significantly from tiny to large models - English-only models are ~10-15% more accurate for English audio - Newer versions (v2, v3) have better punctuation and formatting **Audio Quality Impact**: - Clean, clear audio: smaller models may suffice - Noisy, accented, or technical audio: larger models recommended - Phone/low-quality audio: use at least `small` model _______________________ # Custom Models: The `--model-path` parameter accepts: ## File Path - **`.pt` / `.bin` / `.safetensor` formats** Should be openable by pytorch/safetensor. ## Directory Path (recommended) Must contain: - **`.pt` / `.bin` / `.safetensor` file** (required for decoder) May optionally contain: - **`.bin` file** - faster-whisper model for encoder (requires faster-whisper) - **`weights.npz`** or **`weights.safetensors`** - for encoder (requires whisper-mlx) ## Hugging Face Repo ID - Provide the repo ID (e.g. `openai/whisper-large-v3`) and WhisperLiveKit will download and cache the snapshot automatically. For gated repos, authenticate via `huggingface-cli login` first. To improve speed/reduce hallucinations, you may want to use `scripts/determine_alignment_heads.py` to determine the alignment heads to use for your model, and use the `--custom-alignment-heads` to pass them to WLK. If not, alignment heads are set to be all the heads of the last half layer of decoder. _______________________ # Translation Models and Backend **Language Support**: ~200 languages ## Distilled Model Sizes Available | Model | Size | Parameters | VRAM (FP16) | VRAM (INT8) | Quality | |-------|------|------------|-------------|-------------|---------| | 600M | 2.46 GB | 600M | ~1.5GB | ~800MB | Good, understandable | | 1.3B | 5.48 GB | 1.3B | ~3GB | ~1.5GB | Better accuracy, context | **Quality Impact**: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs. ## Backend Performance | Backend | Speed vs Base | Memory Usage | Quality Loss | |---------|---------------|--------------|--------------| | CTranslate2 | 6-10x faster | 40-60% less | ~5% BLEU drop | | Transformers | Baseline | High | None | | Transformers + MPS (on Apple Silicon) | 2x faster | Medium | None | **Metrics**: - CTranslate2: 50-100+ tokens/sec - Transformers: 10-30 tokens/sec - Apple Silicon with MPS: Up to 2x faster than CTranslate2