mirror of
https://github.com/QuentinFuxa/WhisperLiveKit.git
synced 2026-03-07 14:23:18 +00:00
3.8 KiB
3.8 KiB
Available Whisper model sizes:
- tiny.en (english only)
- tiny
- base.en (english only)
- base
- small.en (english only)
- small
- medium.en (english only)
- medium
- large-v1
- large-v2
- large-v3
- large-v3-turbo
How to choose?
Language Support
- English only: Use
.enmodels for better accuracy and faster processing when you only need English transcription - Multilingual: Do not use
.enmodels.
Resource Constraints
- Limited GPU/CPU or need for very low latency: Choose
smallor smaller modelstiny: Fastest, lowest resource usage, acceptable quality for simple audiobase: Good balance of speed and accuracy for basic use casessmall: Better accuracy while still being resource-efficient
- Good resources available: Use
largemodels for best accuracylarge-v2: Excellent accuracy, good multilingual supportlarge-v3: Best overall accuracy and language support
Special Cases
- No translation needed: Use
large-v3-turbo- Same transcription quality as
large-v2but significantly faster - Important: Does not translate correctly, only transcribes
- Same transcription quality as
Model Comparison Table
| Model | Speed | Accuracy | Multilingual | Translation | Best Use Case |
|---|---|---|---|---|---|
| tiny(.en) | Fastest | Basic | Yes/No | Yes/No | Real-time, low resources |
| base(.en) | Fast | Good | Yes/No | Yes/No | Balanced performance |
| small(.en) | Medium | Better | Yes/No | Yes/No | Quality on limited hardware |
| medium(.en) | Slow | High | Yes/No | Yes/No | High quality, moderate resources |
| large-v2 | Slowest | Excellent | Yes | Yes | Best overall quality |
| large-v3 | Slowest | Excellent | Yes | Yes | Maximum accuracy |
| large-v3-turbo | Fast | Excellent | Yes | No | Fast, high-quality transcription |
Additional Considerations
Model Performance:
- Accuracy improves significantly from tiny to large models
- English-only models are ~10-15% more accurate for English audio
- Newer versions (v2, v3) have better punctuation and formatting
Hardware Requirements:
tiny: ~1GB VRAMbase: ~1GB VRAMsmall: ~2GB VRAMmedium: ~5GB VRAMlarge: ~10GB VRAMlarge‑v3‑turbo: ~6GB VRAM
Audio Quality Impact:
- Clean, clear audio: smaller models may suffice
- Noisy, accented, or technical audio: larger models recommended
- Phone/low-quality audio: use at least
smallmodel
Quick Decision Tree
- English only? → Add
.ento your choice - Limited resources or need speed? →
smallor smaller - Good hardware and want best quality? →
large-v3 - Need fast, high-quality transcription without translation? →
large-v3-turbo - Need translation capabilities? →
large-v2orlarge-v3(avoid turbo)
Translation Models and Backend
Language Support: ~200 languages
Distilled Model Sizes Available
| Model | Size | Parameters | VRAM (FP16) | VRAM (INT8) | Quality |
|---|---|---|---|---|---|
| 600M | 2.46 GB | 600M | ~1.5GB | ~800MB | Good, understandable |
| 1.3B | 5.48 GB | 1.3B | ~3GB | ~1.5GB | Better accuracy, context |
Quality Impact: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs.
Backend Performance
| Backend | Speed vs Base | Memory Usage | Quality Loss |
|---|---|---|---|
| CTranslate2 | 6-10x faster | 40-60% less | ~5% BLEU drop |
| Transformers | Baseline | High | None |
| Transformers + MPS (on Apple Silicon) | 2x faster | Medium | None |
Metrics:
- CTranslate2: 50-100+ tokens/sec
- Transformers: 10-30 tokens/sec
- Apple Silicon with MPS: Up to 2x faster than CTranslate2
Quick Decision Matrix
Choose 600M: Limited resources, close to 0 lag Choose 1.3B: Quality matters Choose Transformers: On Apple Silicon