Information on finetuning the model for new langauges!!

#53
by gude - opened

Hi @nithinraok ,

I am planning to finetune the parakeet-tdt-0.6b-v2 model on new language and would like to achieve WER < 5 , can you please provide information approximately how many hours of audio would be good number , and any suggestions to achieve high quality fine tuned model for new langauges .

Thanks in advance

Hi,

I recently fine-tuned the parakeet-tdt-0.6b-v2 model on approximately 500 hours of Persian audio, using the training code from this repository (https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune). I experimented with two approaches:

Full fine-tuning of all 617M+ parameters

Freezing the encoder and training only the decoder

Surprisingly, the results in both cases have been quite poor — the generated speech is barely intelligible and far below what I’d expect, especially given the scale of the dataset and training effort.

Since 500 hours is a substantial amount of data, I expected at least reasonable performance. I’m now wondering whether there might be issues in the training process, or perhaps limitations in the model when applied to non-English or low-resource languages.

I’d greatly appreciate any guidance, suggestions, or insights from others who have worked with this model in similar settings.

Sign up or log in to comment