Information on finetuning the model for new langauges!!
Hi @nithinraok ,
I am planning to finetune the parakeet-tdt-0.6b-v2 model on new language and would like to achieve WER < 5 , can you please provide information approximately how many hours of audio would be good number , and any suggestions to achieve high quality fine tuned model for new langauges .
Thanks in advance
Hi,
I recently fine-tuned the parakeet-tdt-0.6b-v2 model on approximately 500 hours of Persian audio, using the training code from this repository (https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune). I experimented with two approaches:
Full fine-tuning of all 617M+ parameters
Freezing the encoder and training only the decoder
Surprisingly, the results in both cases have been quite poor — the generated speech is barely intelligible and far below what I’d expect, especially given the scale of the dataset and training effort.
Since 500 hours is a substantial amount of data, I expected at least reasonable performance. I’m now wondering whether there might be issues in the training process, or perhaps limitations in the model when applied to non-English or low-resource languages.
I’d greatly appreciate any guidance, suggestions, or insights from others who have worked with this model in similar settings.