nvidia/parakeet-tdt-0.6b-v2 · Information on finetuning the model for new langauges!!

Hi,

I recently fine-tuned the parakeet-tdt-0.6b-v2 model on approximately 500 hours of Persian audio, using the training code from this repository (https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune). I experimented with two approaches:

Full fine-tuning of all 617M+ parameters

Freezing the encoder and training only the decoder

Surprisingly, the results in both cases have been quite poor — the generated speech is barely intelligible and far below what I’d expect, especially given the scale of the dataset and training effort.

Since 500 hours is a substantial amount of data, I expected at least reasonable performance. I’m now wondering whether there might be issues in the training process, or perhaps limitations in the model when applied to non-English or low-resource languages.

I’d greatly appreciate any guidance, suggestions, or insights from others who have worked with this model in similar settings.