|
--- |
|
library_name: transformers |
|
language: |
|
- fa |
|
license: mit |
|
base_model: openai/whisper-large-v3-turbo |
|
tags: |
|
- whisper |
|
- whisper-large-v3 |
|
- persian |
|
- farsi |
|
- speech-recognition |
|
- asr |
|
- automatic-speech-recognition |
|
- audio |
|
- transformers |
|
- generated_from_trainer |
|
- h100 |
|
- huggingface |
|
- vhdm |
|
datasets: |
|
- vhdm/persian-voice-v1.1 |
|
metrics: |
|
- wer |
|
model-index: |
|
- name: vhdm/whisper-large-fa-v1 |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: vhdm/persian-voice-v1 |
|
type: vhdm/persian-voice-v1.1 |
|
args: 'config: fa, split: test' |
|
metrics: |
|
- name: Wer |
|
type: wer |
|
value: 14.065335753176045 |
|
--- |
|
|
|
# 📢 vhdm/whisper-large-fa-v1 |
|
|
|
🎧 **Fine-tuned Whisper Large V3 Turbo for Persian Speech Recognition** |
|
|
|
This model is a fine-tuned version of [`openai/whisper-large-v3-turbo`](https://huggingface.co/openai/whisper-large-v3-turbo) trained specifically on high-quality Persian speech data from the [`vhdm/persian-voice-v1`](https://huggingface.co/datasets/vhdm/persian-voice-v1) dataset. |
|
|
|
--- |
|
|
|
## 🧪 Evaluation Results |
|
|
|
| Metric | Value | |
|
|--------|-------| |
|
| **Final Validation Loss** | 0.1445 | |
|
| **Word Error Rate (WER)** | **14.07%** | |
|
|
|
The model shows consistent improvement over training and reaches a solid WER of ~14% on clean Persian speech data. |
|
|
|
--- |
|
|
|
## 🧠 Model Description |
|
|
|
This model aims to bring high-accuracy **automatic speech recognition (ASR)** to Persian language using the Whisper architecture. By leveraging OpenAI's powerful Whisper Large V3 Turbo backbone and carefully curated Persian data, it can transcribe Persian audio with high fidelity. |
|
|
|
--- |
|
|
|
## ✅ Intended Use |
|
|
|
This model is best suited for: |
|
|
|
- 📱 Transcribing Persian voice notes |
|
- 🗣️ Real-time or batch ASR for Persian podcasts, videos, and interviews |
|
- 🔍 Creating searchable transcripts of Persian audio content |
|
- 🧩 Fine-tuning or domain adaptation for Persian speech tasks |
|
|
|
### 🚫 Limitations |
|
|
|
- The model is fine-tuned on clean audio from specific sources and may perform poorly on noisy, accented, or dialectal speech. |
|
- Not optimized for real-time streaming ASR (though inference is fast). |
|
- It may occasionally produce hallucinations (incorrect but plausible words), a common issue in Whisper models. |
|
|
|
--- |
|
|
|
## 📚 Training Data |
|
|
|
The model was trained on the [`vhdm/persian-voice-v1`](https://huggingface.co/datasets/vhdm/persian-voice-v1) dataset, a curated collection of Persian speech recordings with high-quality transcriptions. |
|
|
|
--- |
|
|
|
## ⚙️ Training Procedure |
|
|
|
- **Optimizer**: AdamW (`betas=(0.9, 0.999)`, `eps=1e-08`) |
|
- **Learning Rate**: 1e-5 |
|
- **Batch Sizes**: Train - 16 | Eval - 8 |
|
- **Scheduler**: Linear with 500 warmup steps |
|
- **Mixed Precision**: Native AMP (automatic mixed precision) |
|
- **Seed**: 42 |
|
- **Training Steps**: 5000 |
|
|
|
--- |
|
|
|
## ⏱️ Training Time & Hardware |
|
|
|
The model was trained using an **NVIDIA H100 GPU**, and the full fine-tuning process took approximately **20 hours**. |
|
|
|
--- |
|
|
|
## 📈 Training Progress |
|
|
|
| Step | Training Loss | Validation Loss | WER (%) | |
|
|------|----------------|-----------------|----------| |
|
| 1000 | 0.2190 | 0.2093 | 22.07 | |
|
| 2000 | 0.1191 | 0.1698 | 17.85 | |
|
| 3000 | 0.1051 | 0.1485 | 15.79 | |
|
| 4000 | 0.0644 | 0.1530 | 16.03 | |
|
| 5000 | 0.0289 | 0.1445 | **14.07** | |
|
|
|
--- |
|
|
|
## 🧰 Framework Versions |
|
|
|
- `transformers`: 4.52.4 |
|
- `torch`: 2.7.1+cu118 |
|
- `datasets`: 3.6.0 |
|
- `tokenizers`: 0.21.1 |
|
|
|
--- |
|
|
|
## 🚀 Try it out |
|
|
|
You can load and test the model using 🤗 Transformers: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
pipe = pipeline("automatic-speech-recognition", model="vhdm/whisper-large-fa-v1") |
|
result = pipe("path_to_persian_audio.wav") |
|
print(result["text"]) |
|
|
|
|