|
--- |
|
language: |
|
- cs |
|
- en |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
- ctc |
|
- wav2vec2-bert |
|
- czech |
|
license: mit |
|
datasets: |
|
- common-voice |
|
metric: |
|
- wer |
|
--- |
|
|
|
# mitkaj/w2v2BERT-CZ-CV-17.0 |
|
|
|
This is a fine-tuned Wav2Vec2BERT model for Czech Automatic Speech Recognition (ASR) using CTC loss. |
|
|
|
## Model Details |
|
|
|
- **Base Model**: facebook/w2v-bert-2.0 |
|
- **Architecture**: Wav2Vec2BertForCTC |
|
- **Training**: Fine-tuned on Czech Common Voice dataset |
|
- **Loss Function**: CTC (Connectionist Temporal Classification) |
|
- **Vocab Size**: 51 tokens |
|
|
|
## Training Summary |
|
|
|
- **Training Epochs**: 19.97 |
|
- **Final Training Loss**: 0.0305 |
|
- **Final Evaluation Loss**: 0.1450 |
|
- **Final WER**: 0.0583 (5.83%) |
|
- **Total Training Time**: 5.1 hours |
|
- **Total FLOPS**: 79819834495052513280 GF |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModelForCTC |
|
import torch |
|
|
|
# Load model and processor |
|
processor = AutoProcessor.from_pretrained("mitkaj/w2v2BERT-CZ-CV-17.0") |
|
model = AutoModelForCTC.from_pretrained("mitkaj/w2v2BERT-CZ-CV-17.0") |
|
|
|
# Process audio |
|
inputs = processor(audio, sampling_rate=16000, return_tensors="pt") |
|
|
|
# Get logits |
|
with torch.no_grad(): |
|
logits = model(**inputs).logits |
|
|
|
# Decode |
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
transcription = processor.batch_decode(predicted_ids) |
|
``` |
|
|
|
## Training |
|
|
|
This model was trained using the CTC approach on Czech speech data. |
|
|
|
## Performance |
|
|
|
The model was evaluated on Czech test data using WER (Word Error Rate) metric. |
|
|
|
|
|
|
|
## Citation |
|
|
|
If you use this model, please cite the original Wav2Vec2BERT paper and this fine-tuned version. |
|
|