Fine-tuning openaudio-s1-mini for Persian (Farsi) – Feasibility & Guidance

#4
by arshambz - opened

Hi and thanks for sharing this great model!

I'm interested in fine-tuning fishaudio/openaudio-s1-mini for Persian (Farsi) audio tasks, such as speech recognition or audio classification. I’d appreciate your guidance on a few key points:

Is fine-tuning for a new language like Persian supported or practical with this model?
Given that it's trained in English, I’d like to know how transferable the learned representations are to a different language, especially a low-resource one.

Roughly how much audio data would be needed for a meaningful fine-tuning on Persian?
I understand it depends on the task and setup, but a ballpark estimate would help a lot (e.g., hours of audio, number of samples, etc.).

Are there any recommended training settings or constraints (batch size, LR, augmentation, etc.) that you found important when fine-tuning this architecture?

Does the model architecture support freezing early layers, or is end-to-end fine-tuning preferable?

Finally, do you provide or suggest any starter scripts, notebooks, or best practices for fine-tuning?

I’d really appreciate any help or pointers. Thank you in advance for your work and time!

Fish Audio org
edited Jun 10

WIP, maybe we'll update the finetune part in July.

Any update on how to finetune the s1-mini model?
The previous docs on finetuning is not available anymore. I tried to update the code to make it work (the code there to finetune was for fish-speech 1.5) but didn't make it work. Would be nice to have the possibility to finetune it, especially for languages where the accent is incorrect (Russian, Japanese). That's a great model for its size to be honest. Would love to push it further.

Sign up or log in to comment