w2v2BERT-CZ-CV-17.0 / README.md
mitkaj's picture
Upload fine-tuned Wav2Vec2BERT CTC model for Czech ASR
637b1de verified
|
raw
history blame
1.59 kB
metadata
language:
  - cs
  - en
tags:
  - audio
  - automatic-speech-recognition
  - ctc
  - wav2vec2-bert
  - czech
license: mit
datasets:
  - common-voice
metric:
  - wer

mitkaj/w2v2BERT-CZ-CV-17.0

This is a fine-tuned Wav2Vec2BERT model for Czech Automatic Speech Recognition (ASR) using CTC loss.

Model Details

  • Base Model: facebook/w2v-bert-2.0
  • Architecture: Wav2Vec2BertForCTC
  • Training: Fine-tuned on Czech Common Voice dataset
  • Loss Function: CTC (Connectionist Temporal Classification)
  • Vocab Size: 51 tokens

Training Summary

  • Training Epochs: 19.97
  • Final Training Loss: 0.0305
  • Final Evaluation Loss: 0.1450
  • Final WER: 0.0583 (5.83%)
  • Total Training Time: 5.1 hours
  • Total FLOPS: 79819834495052513280 GF

Usage

from transformers import AutoProcessor, AutoModelForCTC
import torch

# Load model and processor
processor = AutoProcessor.from_pretrained("mitkaj/w2v2BERT-CZ-CV-17.0")
model = AutoModelForCTC.from_pretrained("mitkaj/w2v2BERT-CZ-CV-17.0")

# Process audio
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")

# Get logits
with torch.no_grad():
    logits = model(**inputs).logits

# Decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Training

This model was trained using the CTC approach on Czech speech data.

Performance

The model was evaluated on Czech test data using WER (Word Error Rate) metric.

Citation

If you use this model, please cite the original Wav2Vec2BERT paper and this fine-tuned version.