Fine-tuning Kitten TTS Nano For another language
Hi and thanks for sharing this great model!
I'm interested in fine-tuning Kitten TTS for Persian (Farsi) audio tasks, I’d appreciate your guidance on a few key points:
Is fine-tuning for a new language like Persian supported or practical with this model?
Given that it's trained in English, I’d like to know how transferable the learned representations are to a different language, especially a low-resource one.
Roughly how much audio data would be needed for a meaningful fine-tuning on Persian?
I understand it depends on the task and setup, but a ballpark estimate would help a lot (e.g., hours of audio, number of samples, etc.).
Are there any recommended training settings or constraints (batch size, LR, augmentation, etc.) that you found important when fine-tuning this architecture?
Does the model architecture support freezing early layers, or is end-to-end fine-tuning preferable?
Finally, do you provide or suggest any starter scripts, notebooks, or best practices for fine-tuning?
I’d really appreciate any help or pointers. Thank you in advance for your work and time!
I would also be interested in this, please advice. Have dataset
Me too!
meto