How We Fine-Tuned a Kannada TTS Model - Step by Step

Authors: Sreeram Sridhar, Adithya L Bhat, Sasank Chilamkurthy

We wanted to fine-tune a Kannada TTS model using openly available data and open tools. Here's exactly how we did it from Wikipedia dump to a trained model.

Synthetic Data Generation

Code to run this in synthetic_data folder. We document the process here.

1. Get Raw Kannada Text from Wikipedia

We downloaded the Kannada Wiktionary dump.

2. Convert XML to Plain Text

We used this GitHub repo to convert Wikipedia XML to plain text: https://github.com/daveshap/PlainTextWikipedia

This tool parses XML and removes wiki markup to produce readable plain text in JSON format.

3. Clean and Split the Text into Sentences

We used the Indic NLP library to split the large text corpus into clean Kannada sentences. Indic NLP Library (tokenizer): https://github.com/anoopkunchukuttan/indic_nlp_library

Steps:

  • Load the text into a Python list.

  • Use the Indic NLP sentence tokenizer to break paragraphs into sentences. It handles common full stops, abbreviations like “Dr.”, and other punctuation reasonably well.

  • After tokenization, we had a list of sentences like: [sentence_1, sentence_2, ...].

4. Random Sampling for Fairness

To avoid bias, we sampled sentences randomly from the list using Python:

import random
sampled_sentences = random.sample(all_sentences, k=1000)

We also created grouped variations:

  • 1 sentence
  • 2 sentences together
  • 3 sentences together

This helps the model learn from input sequences of varying lengths.

5. Generate Audio with Google TTS

We used Google Cloud’s Text-to-Speech API to generate synthetic audio for each sentence.

We used:

  • Language: Kannada (India)
  • Voice: kn-IN-Chirp3-HD-Achernar
  • Type: Premium, Female

Each sentence was converted into an .mp3 audio file. We ensured that filenames matched:

001.json ↔ 001.mp3

002.json ↔ 002.mp3

...

Note: Google TTS supports up to 5000 characters per request. Make sure grouped sentence blocks stay within this limit.

6. Create a Hugging Face Dataset

We structured the data in a format compatible with Hugging Face datasets, similar to this example:

Our dataset included:

  • audio: path to .mp3 file
  • prompt: corresponding text sentence(s)
  • description: optional metadata

Model Training

1. Fine-Tune the Model

We fine-tuned the AI4Bharat Indic Parler-TTS model: https://huggingface.co/ai4bharat/indic-parler-tts

We followed the steps from this notebook for single-speaker TTS fine-tuning:

Training details:

  • Dataset: our synthetic Kannada dataset
  • GPU: NVIDIA 4070 Ti Super (16GB VRAM)
  • Training time: approximately 15–20 minutes (~25 hrs fine-tuning)

Exact command we used for our final run is below

accelerate launch training/run_parler_tts_training.py \
    --model_name_or_path "ai4bharat/indic-parler-tts" \
    --overwrite_output_dir true \
    --train_dataset_name "chsasank/tts_kn_large" \
    --train_metadata_dataset_name "chsasank/tts_kn_large" \
    --train_dataset_config_name "default" \
    --train_split_name "train" \
    --eval_dataset_name "chsasank/tts_kn_large" \
    --eval_dataset_config_name "default" \
    --eval_metadata_dataset_name "chsasank/tts_kn_large" \
    --eval_split_name "test" \
    --per_device_eval_batch_size 1 \
    --per_device_train_batch_size 1 \
    --target_audio_column_name "audio" \
    --description_column_name "description" \
    --prompt_column_name "prompt" \
    --max_train_samples 1900\
    --max_duration_in_seconds 30 \
    --min_duration_in_seconds 2.0 \
    --max_text_length 700 \
    --preprocessing_num_workers 2 \
    --do_train true \
    --num_train_epochs 10 \
    --gradient_accumulation_steps 36 \
    --gradient_checkpointing true \
    --learning_rate 0.0002 \
    --adam_beta1 0.9 \
    --adam_beta2 0.99 \
    --weight_decay 0.01 \
    --lr_scheduler_type "cosine" \
    --warmup_steps 30 \
    --logging_steps 2 \
    --freeze_text_encoder true \
    --audio_encoder_per_device_batch_size 5 \
    --dtype "float16" \
    --seed 456 \
    --output_dir "/tmp/tts_data/training/model_outputs" \
    --temporary_save_to_disk "/tmp/tts_data/training/temp" \
    --save_to_disk "/tmp/tts_data/training/temp_audio" \
    --include_inputs_for_metrics \
    --do_eval \
    --group_by_length true | tee /tmp/tts_data/logs/exp_final_logs.txt

We used Hugging Face’s datasets and transformers libraries to load the dataset and train the model.

2. Test the Trained Model

After training, we tested the model by giving new Kannada text inputs and generating speech using the trained model. The same GitHub repo provides sample scripts to run inference locally.

Inference code is coming up shortly.

Final Outcome

We now have a fine-tuned Kannada TTS model capable of generating realistic speech for synthetic or real input. This project shows that:

  • You can start with openly available text.
  • You don’t need a large team or massive compute resources.
  • Just a few tools, some scripting, and one decent GPU is enough to get started.

If you want to build your own TTS model in your language, the path is open and doable

Downloads last month
60
Safetensors
Model size
938M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train chsasank/saraswati-tts