mirror of
https://github.com/QuentinFuxa/WhisperLiveKit.git
synced 2026-03-07 14:23:18 +00:00
110 lines
3.8 KiB
Markdown
110 lines
3.8 KiB
Markdown
# Available Whisper model sizes:
|
||
|
||
- tiny.en (english only)
|
||
- tiny
|
||
- base.en (english only)
|
||
- base
|
||
- small.en (english only)
|
||
- small
|
||
- medium.en (english only)
|
||
- medium
|
||
- large-v1
|
||
- large-v2
|
||
- large-v3
|
||
- large-v3-turbo
|
||
|
||
## How to choose?
|
||
|
||
### Language Support
|
||
- **English only**: Use `.en` models for better accuracy and faster processing when you only need English transcription
|
||
- **Multilingual**: Do not use `.en` models.
|
||
|
||
### Resource Constraints
|
||
- **Limited GPU/CPU or need for very low latency**: Choose `small` or smaller models
|
||
- `tiny`: Fastest, lowest resource usage, acceptable quality for simple audio
|
||
- `base`: Good balance of speed and accuracy for basic use cases
|
||
- `small`: Better accuracy while still being resource-efficient
|
||
- **Good resources available**: Use `large` models for best accuracy
|
||
- `large-v2`: Excellent accuracy, good multilingual support
|
||
- `large-v3`: Best overall accuracy and language support
|
||
|
||
### Special Cases
|
||
- **No translation needed**: Use `large-v3-turbo`
|
||
- Same transcription quality as `large-v2` but significantly faster
|
||
- **Important**: Does not translate correctly, only transcribes
|
||
|
||
### Model Comparison Table
|
||
|
||
| Model | Speed | Accuracy | Multilingual | Translation | Best Use Case |
|
||
|-------|--------|----------|--------------|-------------|---------------|
|
||
| tiny(.en) | Fastest | Basic | Yes/No | Yes/No | Real-time, low resources |
|
||
| base(.en) | Fast | Good | Yes/No | Yes/No | Balanced performance |
|
||
| small(.en) | Medium | Better | Yes/No | Yes/No | Quality on limited hardware |
|
||
| medium(.en) | Slow | High | Yes/No | Yes/No | High quality, moderate resources |
|
||
| large-v2 | Slowest | Excellent | Yes | Yes | Best overall quality |
|
||
| large-v3 | Slowest | Excellent | Yes | Yes | Maximum accuracy |
|
||
| large-v3-turbo | Fast | Excellent | Yes | No | Fast, high-quality transcription |
|
||
|
||
### Additional Considerations
|
||
|
||
**Model Performance**:
|
||
- Accuracy improves significantly from tiny to large models
|
||
- English-only models are ~10-15% more accurate for English audio
|
||
- Newer versions (v2, v3) have better punctuation and formatting
|
||
|
||
**Hardware Requirements**:
|
||
- `tiny`: ~1GB VRAM
|
||
- `base`: ~1GB VRAM
|
||
- `small`: ~2GB VRAM
|
||
- `medium`: ~5GB VRAM
|
||
- `large`: ~10GB VRAM
|
||
- `large‑v3‑turbo`: ~6GB VRAM
|
||
|
||
**Audio Quality Impact**:
|
||
- Clean, clear audio: smaller models may suffice
|
||
- Noisy, accented, or technical audio: larger models recommended
|
||
- Phone/low-quality audio: use at least `small` model
|
||
|
||
### Quick Decision Tree
|
||
1. English only? → Add `.en` to your choice
|
||
2. Limited resources or need speed? → `small` or smaller
|
||
3. Good hardware and want best quality? → `large-v3`
|
||
4. Need fast, high-quality transcription without translation? → `large-v3-turbo`
|
||
5. Need translation capabilities? → `large-v2` or `large-v3` (avoid turbo)
|
||
|
||
|
||
_______________________
|
||
|
||
# Translation Models and Backend
|
||
|
||
**Language Support**: ~200 languages
|
||
|
||
## Distilled Model Sizes Available
|
||
|
||
| Model | Size | Parameters | VRAM (FP16) | VRAM (INT8) | Quality |
|
||
|-------|------|------------|-------------|-------------|---------|
|
||
| 600M | 2.46 GB | 600M | ~1.5GB | ~800MB | Good, understandable |
|
||
| 1.3B | 5.48 GB | 1.3B | ~3GB | ~1.5GB | Better accuracy, context |
|
||
|
||
**Quality Impact**: 1.3B has ~15-25% better BLEU scores vs 600M across language pairs.
|
||
|
||
## Backend Performance
|
||
|
||
| Backend | Speed vs Base | Memory Usage | Quality Loss |
|
||
|---------|---------------|--------------|--------------|
|
||
| CTranslate2 | 6-10x faster | 40-60% less | ~5% BLEU drop |
|
||
| Transformers | Baseline | High | None |
|
||
| Transformers + MPS (on Apple Silicon) | 2x faster | Medium | None |
|
||
|
||
**Metrics**:
|
||
- CTranslate2: 50-100+ tokens/sec
|
||
- Transformers: 10-30 tokens/sec
|
||
- Apple Silicon with MPS: Up to 2x faster than CTranslate2
|
||
|
||
## Quick Decision Matrix
|
||
|
||
**Choose 600M**: Limited resources, close to 0 lag
|
||
**Choose 1.3B**: Quality matters
|
||
**Choose Transformers**: On Apple Silicon
|
||
|