Hyperparameters used in the training

by Viewegger - opened 6 days ago

6 days ago

•

Hello,

may I ask you for how many epochs did you trained the model and what hyperparamters - lr rate etc. were used during the training?

Thank you!

Thomcles

Owner 6 days ago

Hello viewegger 👋,

Epochs: 1

Batch size: 8 (with gradient accumulation of 4)

Learning rate: 2e-5

Warmup steps: 2000

Evaluation: every 1000 steps (500 would be better, but the result was great so no need to change.)

hope this helps.

Viewegger

5 days ago

•

edited 5 days ago

Thank you Kevin for such a quick answer!

May I ask you - did you use https://github.com/stlohrey/chatterbox-finetuning repo for training?

And are 200 hours really enough to get good quality of the model without word repetition/skipping? (Thank would be impressive considering only 200hours of data)

Also is model quality really good? (I don't speak Korean so cannot really say...)

And seeing that you also trained the French model with 1200 hours of data would you say French one is better?

I would like to try train Swedish version of the model :)

Thomcles

Owner 5 days ago

This comment has been hidden (marked as Resolved)

Thomcles

Owner 5 days ago

This comment has been hidden (marked as Resolved)

Thomcles

Owner 5 days ago

I don't speak any Korean, so I can't tell you, but the model sounded decent after a few listens.
200 hours is enough given that it corresponds to 72,000 samples, enough to give the model an idea of what these 4,000 new tokens are worth.

As for French (which I speak), I find it very decent, natural, the WER on about twenty samples was 0% (if we ignore whisper errors...).
In terms of expressiveness, it was very decent, and I'd say it's worth a fish-speech 1.5.
1,400 hours is a lot and more than enough.

To train your Swedish model, here are my tips:

Don't reinvent the basic tokenizer, just merge it with one you've created (obviously BPE, otherwise it's complicated, and for Latin, there's nothing better...). Otherwise, it's as if the model were starting from scratch (and that's 100,000 hours, at least that's what was used to pre-train F5 TTS)
500 hours (180,000 samples) seems about right, if you want to be sure of quality, after all it depends on the quality of your data.
1xRTX6000ADA seems to be the best hardware

If all this seems complicated, I can take care of it for you; I'm working on it right now :)
Give me a contact if you're interested.

Viewegger

5 days ago

Really thank you for the answer!

I have around 1000 hours of data, so hopefully it will be enough to get descent quality...

Regarding the tokenizer I will just replace some tokens that are not used very often with missing Swedish letters, to preserve most of the knowledge.

just to be sure - did you use https://github.com/stlohrey/chatterbox-finetuning repo for training? I am just asking whether there is some better implementation...

Thank you for your help once again.

Thomcles

Owner 5 days ago

•

edited 5 days ago

Yes I used that one for French,

also David Browne's implementation seems correct (privileging GRPO loss) : https://github.com/davidbrowne17/chatterbox-streaming

If you found my help valuable, don't hesitate to like the repo to give visibility :)

Viewegger

5 days ago

Thank you I didn't catch that one!

So I will give it a try - I won't be able to share the dataset because it's based on company phone calls, meeting records etc. because of GDPR rules...

But hopefully they will allow me to share the model.

Thank you once again for your time.

Viewegger changed discussion status to closed 5 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment