bharathkumar1922001 commited on
Commit
10d2312
·
verified ·
1 Parent(s): 24628a7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +258 -1
README.md CHANGED
@@ -55,4 +55,261 @@ To use Veena, you need to install the `transformers`, `torch`, `torchaudio`, `sn
55
 
56
  ```bash
57
  pip install transformers torch torchaudio
58
- pip install snac bitsandbytes # For audio decoding and quantization
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
 
56
  ```bash
57
  pip install transformers torch torchaudio
58
+ pip install snac bitsandbytes # For audio decoding and quantization
59
+ ```
60
+
61
+ ### Basic Usage
62
+
63
+ The following Python code demonstrates how to generate speech from text using Veena with 4-bit quantization for efficient inference.
64
+
65
+ ```python
66
+ import torch
67
+ from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
68
+ from snac import SNAC
69
+ import soundfile as sf
70
+
71
+ # Model configuration for 4-bit inference
72
+ quantization_config = BitsAndBytesConfig(
73
+ load_in_4bit=True,
74
+ bnb_4bit_quant_type="nf4",
75
+ bnb_4bit_compute_dtype=torch.bfloat16,
76
+ bnb_4bit_use_double_quant=True,
77
+ )
78
+
79
+ # Load model and tokenizer
80
+ model = AutoModelForCausalLM.from_pretrained(
81
+ "maya-research/veena-tts",
82
+ quantization_config=quantization_config,
83
+ device_map="auto",
84
+ trust_remote_code=True,
85
+ )
86
+ tokenizer = AutoTokenizer.from_pretrained("maya-research/veena-tts", trust_remote_code=True)
87
+
88
+ # Initialize SNAC decoder
89
+ snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz").eval().cuda()
90
+
91
+ # Control token IDs (fixed for Veena)
92
+ START_OF_SPEECH_TOKEN = 128257
93
+ END_OF_SPEECH_TOKEN = 128258
94
+ START_OF_HUMAN_TOKEN = 128259
95
+ END_OF_HUMAN_TOKEN = 128260
96
+ START_OF_AI_TOKEN = 128261
97
+ END_OF_AI_TOKEN = 128262
98
+ AUDIO_CODE_BASE_OFFSET = 128266
99
+
100
+ # Available speakers
101
+ speakers = ["kavya", "agastya", "maitri", "vinaya"]
102
+
103
+ def generate_speech(text, speaker="kavya", temperature=0.4, top_p=0.9):
104
+ """Generate speech from text using specified speaker voice"""
105
+
106
+ # Prepare input with speaker token
107
+ prompt = f"<spk_{speaker}> {text}"
108
+ prompt_tokens = tokenizer.encode(prompt, add_special_tokens=False)
109
+
110
+ # Construct full sequence: [HUMAN] <spk_speaker> text [/HUMAN] [AI] [SPEECH]
111
+ input_tokens = [
112
+ START_OF_HUMAN_TOKEN,
113
+ *prompt_tokens,
114
+ END_OF_HUMAN_TOKEN,
115
+ START_OF_AI_TOKEN,
116
+ START_OF_SPEECH_TOKEN
117
+ ]
118
+
119
+ input_ids = torch.tensor([input_tokens], device=model.device)
120
+
121
+ # Calculate max tokens based on text length
122
+ max_tokens = min(int(len(text) * 1.3) * 7 + 21, 700)
123
+
124
+ # Generate audio tokens
125
+ with torch.no_grad():
126
+ output = model.generate(
127
+ input_ids,
128
+ max_new_tokens=max_tokens,
129
+ do_sample=True,
130
+ temperature=temperature,
131
+ top_p=top_p,
132
+ repetition_penalty=1.05,
133
+ pad_token_id=tokenizer.pad_token_id,
134
+ eos_token_id=[END_OF_SPEECH_TOKEN, END_OF_AI_TOKEN]
135
+ )
136
+
137
+ # Extract SNAC tokens
138
+ generated_ids = output[0][len(input_tokens):].tolist()
139
+ snac_tokens = [
140
+ token_id for token_id in generated_ids
141
+ if AUDIO_CODE_BASE_OFFSET <= token_id < (AUDIO_CODE_BASE_OFFSET + 7 * 4096)
142
+ ]
143
+
144
+ if not snac_tokens:
145
+ raise ValueError("No audio tokens generated")
146
+
147
+ # Decode audio
148
+ audio = decode_snac_tokens(snac_tokens, snac_model)
149
+ return audio
150
+
151
+ def decode_snac_tokens(snac_tokens, snac_model):
152
+ """De-interleave and decode SNAC tokens to audio"""
153
+ if not snac_tokens or len(snac_tokens) % 7 != 0:
154
+ return None
155
+
156
+ # De-interleave tokens into 3 hierarchical levels
157
+ codes_lvl = [[] for _ in range(3)]
158
+ llm_codebook_offsets = [AUDIO_CODE_BASE_OFFSET + i * 4096 for i in range(7)]
159
+
160
+ for i in range(0, len(snac_tokens), 7):
161
+ # Level 0: Coarse (1 token)
162
+ codes_lvl[0].append(snac_tokens[i] - llm_codebook_offsets[0])
163
+ # Level 1: Medium (2 tokens)
164
+ codes_lvl[1].append(snac_tokens[i+1] - llm_codebook_offsets[1])
165
+ codes_lvl[1].append(snac_tokens[i+4] - llm_codebook_offsets[4])
166
+ # Level 2: Fine (4 tokens)
167
+ codes_lvl[2].append(snac_tokens[i+2] - llm_codebook_offsets[2])
168
+ codes_lvl[2].append(snac_tokens[i+3] - llm_codebook_offsets[3])
169
+ codes_lvl[2].append(snac_tokens[i+5] - llm_codebook_offsets[5])
170
+ codes_lvl[2].append(snac_tokens[i+6] - llm_codebook_offsets[6])
171
+
172
+ # Convert to tensors for SNAC decoder
173
+ hierarchical_codes = []
174
+ for lvl_codes in codes_lvl:
175
+ tensor = torch.tensor(lvl_codes, dtype=torch.int32, device=snac_model.device).unsqueeze(0)
176
+ if torch.any((tensor < 0) | (tensor > 4095)):
177
+ raise ValueError("Invalid SNAC token values")
178
+ hierarchical_codes.append(tensor)
179
+
180
+ # Decode with SNAC
181
+ with torch.no_grad():
182
+ audio_hat = snac_model.decode(hierarchical_codes)
183
+
184
+ return audio_hat.squeeze().clamp(-1, 1).cpu().numpy()
185
+
186
+ # --- Example Usage ---
187
+
188
+ # Hindi
189
+ text_hindi = "आज मैंने एक नई तकनीक के बारे में सीखा जो कृत्रिम बुद्धिमत्ता का उपयोग करके मानव जैसी आवाज़ उत्पन्न कर सकती है।"
190
+ audio = generate_speech(text_hindi, speaker="kavya")
191
+ sf.write("output_hindi_kavya.wav", audio, 24000)
192
+
193
+ # English
194
+ text_english = "Today I learned about a new technology that uses artificial intelligence to generate human-like voices."
195
+ audio = generate_speech(text_english, speaker="agastya")
196
+ sf.write("output_english_agastya.wav", audio, 24000)
197
+
198
+ # Code-mixed
199
+ text_mixed = "मैं तो पूरा presentation prepare कर चुका हूं! कल रात को ही मैंने पूरा code base चेक किया।"
200
+ audio = generate_speech(text_mixed, speaker="maitri")
201
+ sf.write("output_mixed_maitri.wav", audio, 24000)
202
+ ```
203
+
204
+ ## Uses
205
+
206
+ Veena is ideal for a wide range of applications requiring high-quality, low-latency speech synthesis for Indian languages, including:
207
+
208
+ * **Accessibility:** Screen readers and voice-enabled assistance for visually impaired users.
209
+ * **Customer Service:** IVR systems, voice bots, and automated announcements.
210
+ * **Content Creation:** Dubbing for videos, e-learning materials, and audiobooks.
211
+ * **Automotive:** In-car navigation and infotainment systems.
212
+ * **Edge Devices:** Voice-enabled smart devices and IoT applications.
213
+
214
+ ## Technical Specifications
215
+
216
+ ### Architecture
217
+
218
+ Veena leverages a 3B parameter transformer-based architecture with several key innovations:
219
+
220
+ * **Base Architecture:** Llama-style autoregressive transformer (3B parameters)
221
+ * **Audio Codec:** SNAC (24kHz) for high-quality audio token generation
222
+ * **Speaker Conditioning:** Special speaker tokens (`<spk_kavya>`, `<spk_agastya>`, `<spk_maitri>`, `<spk_vinaya>`)
223
+ * **Parameter-Efficient Training:** LoRA adaptation with differentiated ranks for attention and FFN modules.
224
+ * **Context Length:** 2048 tokens
225
+
226
+ ### Training
227
+
228
+ #### Training Infrastructure
229
+
230
+ * **Hardware:** 8× NVIDIA H100 80GB GPUs
231
+ * **Distributed Training:** DDP with optimized communication
232
+ * **Precision:** BF16 mixed precision training with gradient checkpointing
233
+ * **Memory Optimization:** 4-bit quantization with NF4 + double quantization
234
+
235
+ #### Training Configuration
236
+
237
+ * **LoRA Configuration:**
238
+ * `lora_rank_attention`: 192
239
+ * `lora_rank_ffn`: 96
240
+ * `lora_alpha`: 2× rank (384 for attention, 192 for FFN)
241
+ * `lora_dropout`: 0.05
242
+ * `target_modules`: `["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]`
243
+ * `modules_to_save`: `["embed_tokens"]`
244
+ * **Optimizer Configuration:**
245
+ * `optimizer`: AdamW (8-bit)
246
+ * `optimizer_betas`: (0.9, 0.98)
247
+ * `optimizer_eps`: 1e-5
248
+ * `learning_rate_peak`: 1e-4
249
+ * `lr_scheduler`: cosine
250
+ * `warmup_ratio`: 0.02
251
+ * **Batch Configuration:**
252
+ * `micro_batch_size`: 8
253
+ * `gradient_accumulation_steps`: 4
254
+ * `effective_batch_size`: 256
255
+
256
+ #### Training Data
257
+
258
+ Veena was trained on **proprietary, high-quality datasets** specifically curated for Indian language TTS.
259
+
260
+ * **Data Volume:** 15,000+ utterances per speaker (60,000+ total)
261
+ * **Languages:** Native Hindi and English utterances with code-mixed support
262
+ * **Speaker Diversity:** 4 professional voice artists with distinct characteristics
263
+ * **Audio Quality:** Studio-grade recordings at 24kHz sampling rate
264
+ * **Content Diversity:** Conversational, narrative, expressive, and informational styles
265
+
266
+ **Note:** The training datasets are proprietary and not publicly available.
267
+
268
+ ## Performance Benchmarks
269
+
270
+ | Metric | Value |
271
+ | --------------------- | ------------------------- |
272
+ | Latency (H100-80GB) | \<80ms |
273
+ | Latency (A100-40GB) | \~120ms |
274
+ | Latency (RTX 4090) | \~200ms |
275
+ | Real-time Factor | 0.05x |
276
+ | Throughput | \~170k tokens/s (8×H100) |
277
+ | Audio Quality (MOS) | 4.2/5.0 |
278
+ | Speaker Similarity | 92% |
279
+ | Intelligibility | 98% |
280
+
281
+ ## Risks, Limitations and Biases
282
+
283
+ * **Language Support:** Currently supports only Hindi and English. Performance on other Indian languages is not guaranteed.
284
+ * **Speaker Diversity:** Limited to 4 speaker voices, which may not represent the full diversity of Indian accents and dialects.
285
+ * **Hardware Requirements:** Requires a GPU for real-time or near-real-time inference. CPU performance will be significantly slower.
286
+ * **Input Length:** The model is limited to a maximum input length of 2048 tokens.
287
+ * **Bias:** The model's performance and voice characteristics are a reflection of the proprietary training data. It may exhibit biases present in the data.
288
+
289
+ ## Future Updates
290
+
291
+ We are actively working on expanding Veena's capabilities:
292
+
293
+ * Support for Tamil, Telugu, Bengali, Marathi, and other Indian languages.
294
+ * Additional speaker voices with regional accents.
295
+ * Emotion and prosody control tokens.
296
+ * Streaming inference support.
297
+ * CPU optimization for edge deployment.
298
+
299
+ ## Citing
300
+
301
+ If you use Veena in your research or applications, please cite:
302
+
303
+ ```bibtex
304
+ @misc{veena2025,
305
+ title={Veena: Open Source Text-to-Speech for Indian Languages},
306
+ author={Maya Research Team},
307
+ year={2025},
308
+ publisher={HuggingFace},
309
+ url={[https://huggingface.co/maya-research/veena-tts](https://huggingface.co/maya-research/veena-tts)}
310
+ }
311
+ ```
312
+
313
+ ## Acknowledgments
314
+
315
+ We thank the open-source community and all contributors who made this project possible. Special thanks to the voice artists who provided high-quality recordings for training.