bharathkumar1922001 commited on
Commit
24628a7
·
verified ·
1 Parent(s): f743be8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -180
README.md CHANGED
@@ -1,29 +1,27 @@
1
  ---
2
  license: apache-2.0
3
  language:
4
-
5
- - en
6
- - hi
7
- library\_name: transformers
8
- tags:
9
- - text-to-speech
10
- - tts
11
- - hindi
12
- - english
13
- - llama
14
- - audio
15
- - speech
16
- - india
17
- datasets:
18
- - proprietary
19
- pipeline\_tag: text-to-speech
20
- co2\_eq\_emissions:
21
- emissions: "unknown"
22
- source: "Not specified"
23
- training\_type: "unknown"
24
- geographical\_location: "unknown"
25
-
26
- -----
27
 
28
  # Veena - Text to Speech for Indian Languages
29
 
@@ -33,21 +31,21 @@ Veena is a state-of-the-art neural text-to-speech (TTS) model specifically desig
33
 
34
  **Veena** is a 3B parameter autoregressive transformer model based on the Llama architecture. It is designed to synthesize high-quality speech from text in Hindi and English, including code-mixed scenarios. The model outputs audio at a 24kHz sampling rate using the SNAC neural codec.
35
 
36
- * **Model type:** Autoregressive Transformer
37
- * **Base Architecture:** Llama (3B parameters)
38
- * **Languages:** Hindi, English
39
- * **Audio Codec:** SNAC @ 24kHz
40
- * **License:** Apache 2.0
41
- * **Developed by:** Maya Research
42
- * **Model URL:** [https://huggingface.co/maya-research/veena-tts](https://huggingface.co/maya-research/veena-tts)
43
 
44
  ## Key Features
45
 
46
- * **4 Distinct Voices:** `kavya`, `agastya`, `maitri`, and `vinaya` - each with unique vocal characteristics.
47
- * **Multilingual Support:** Native Hindi and English capabilities with code-mixed support.
48
- * **Ultra-Fast Inference:** Sub-80ms latency on H100-80GB GPUs.
49
- * **High-Quality Audio:** 24kHz output with the SNAC neural codec.
50
- * **Production-Ready:** Optimized for real-world deployment with 4-bit quantization support.
51
 
52
  ## How to Get Started with the Model
53
 
@@ -57,148 +55,4 @@ To use Veena, you need to install the `transformers`, `torch`, `torchaudio`, `sn
57
 
58
  ```bash
59
  pip install transformers torch torchaudio
60
- pip install snac bitsandbytes # For audio decoding and quantization
61
- ```
62
-
63
- ### Basic Usage
64
-
65
- The following Python code demonstrates how to generate speech from text using Veena with 4-bit quantization for efficient inference.
66
-
67
- ````python
68
- import torch
69
- from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
70
- from snac import SNAC
71
- import soundfile as sf
72
-
73
- # Model configuration for 4-bit inference
74
- quantization_config = BitsAndBytesConfig(
75
- load_in_4bit=True,
76
- bnb_4bit_quant_type=\"nf4\",
77
- bnb_4bit_compute_dtype=torch.bfloat16,
78
- bnb_4bit_use_double_quant=True,
79
- )
80
-
81
- # Load model and tokenizer
82
- model = AutoModelForCausalLM.from_pretrained(
83
- \"maya-research/veena-tts\",
84
- quantization_config=quantization_config,
85
- device_map=\"auto\",
86
- trust_remote_code=True,
87
- )
88
- tokenizer = AutoTokenizer.from_pretrained(\"maya-research/veena-tts\", trust_remote_code=True)
89
-
90
- # Initialize SNAC decoder
91
- snac_model = SNAC.from_pretrained(\"hubertsiuzdak/snac_24khz\").eval().cuda()
92
-
93
- # Control token IDs (fixed for Veena)
94
- START_OF_SPEECH_TOKEN = 128257
95
- END_OF_SPEECH_TOKEN = 128258
96
- START_OF_HUMAN_TOKEN = 128259
97
- END_OF_HUMAN_TOKEN = 128260
98
- START_OF_AI_TOKEN = 128261
99
- END_OF_AI_TOKEN = 128262
100
- AUDIO_CODE_BASE_OFFSET = 128266
101
-
102
- # Available speakers
103
- speakers = [\"kavya\", \"agastya\", \"maitri\", \"vinaya\"]
104
-
105
- def generate_speech(text, speaker=\"kavya\", temperature=0.4, top_p=0.9):
106
- \"\"\"Generate speech from text using specified speaker voice\"\"\"
107
-
108
- # Prepare input with speaker token
109
- prompt = f\"<spk_{speaker}> {text}\"
110
- prompt_tokens = tokenizer.encode(prompt, add_special_tokens=False)
111
-
112
- # Construct full sequence: [HUMAN] <spk_speaker> text [/HUMAN] [AI] [SPEECH]
113
- input_tokens = [
114
- START_OF_HUMAN_TOKEN,
115
- *prompt_tokens,
116
- END_OF_HUMAN_TOKEN,
117
- START_OF_AI_TOKEN,
118
- START_OF_SPEECH_TOKEN
119
- ]
120
-
121
- input_ids = torch.tensor([input_tokens], device=model.device)
122
-
123
- # Calculate max tokens based on text length
124
- max_tokens = min(int(len(text) * 1.3) * 7 + 21, 700)
125
-
126
- # Generate audio tokens
127
- with torch.no_grad():
128
- output = model.generate(
129
- input_ids,
130
- max_new_tokens=max_tokens,
131
- do_sample=True,
132
- temperature=temperature,
133
- top_p=top_p,
134
- repetition_penalty=1.05,
135
- pad_token_id=tokenizer.pad_token_id,
136
- eos_token_id=[END_OF_SPEECH_TOKEN, END_OF_AI_TOKEN]
137
- )
138
-
139
- # Extract SNAC tokens
140
- generated_ids = output[0][len(input_tokens):].tolist()
141
- snac_tokens = [
142
- token_id for token_id in generated_ids
143
- if AUDIO_CODE_BASE_OFFSET <= token_id < (AUDIO_CODE_BASE_OFFSET + 7 * 4096)
144
- ]
145
-
146
- if not snac_tokens:
147
- raise ValueError(\"No audio tokens generated\")
148
-
149
- # Decode audio
150
- audio = decode_snac_tokens(snac_tokens, snac_model)
151
- return audio
152
-
153
- def decode_snac_tokens(snac_tokens, snac_model):
154
- \"\"\"De-interleave and decode SNAC tokens to audio\"\"\"
155
- if not snac_tokens or len(snac_tokens) % 7 != 0:
156
- return None
157
-
158
- # De-interleave tokens into 3 hierarchical levels
159
- codes_lvl = [[] for _ in range(3)]
160
- llm_codebook_offsets = [AUDIO_CODE_BASE_OFFSET + i * 4096 for i in range(7)]
161
-
162
- for i in range(0, len(snac_tokens), 7):
163
- # Level 0: Coarse (1 token)
164
- codes_lvl[0].append(snac_tokens[i] - llm_codebook_offsets[0])
165
- # Level 1: Medium (2 tokens)
166
- codes_lvl[1].append(snac_tokens[i+1] - llm_codebook_offsets[1])
167
- codes_lvl[1].append(snac_tokens[i+4] - llm_codebook_offsets[4])
168
- # Level 2: Fine (4 tokens)
169
- codes_lvl[2].append(snac_tokens[i+2] - llm_codebook_offsets[2])
170
- codes_lvl[2].append(snac_tokens[i+3] - llm_codebook_offsets[3])
171
- codes_lvl[2].append(snac_tokens[i+5] - llm_codebook_offsets[5])
172
- codes_lvl[2].append(snac_tokens[i+6] - llm_codebook_offsets[6])
173
-
174
- # Convert to tensors for SNAC decoder
175
- hierarchical_codes = []
176
- for lvl_codes in codes_lvl:
177
- tensor = torch.tensor(lvl_codes, dtype=torch.int32, device=snac_model.device).unsqueeze(0)
178
- if torch.any((tensor < 0) | (tensor > 4095)):
179
- raise ValueError(\"Invalid SNAC token values\")
180
- hierarchical_codes.append(tensor)
181
-
182
- # Decode with SNAC
183
- with torch.no_grad():
184
- audio_hat = snac_model.decode(hierarchical_codes)
185
-
186
- return audio_hat.squeeze().clamp(-1, 1).cpu().numpy()
187
-
188
- # --- Example Usage ---
189
-
190
- # Hindi
191
- text_hindi = \"आज मैंने एक नई तकनीक के बारे में सीखा जो कृत्रिम बुद्धिमत्ता का उपयोग करके मानव जैसी आवाज़ उत्पन्न कर सकती है।\"
192
- audio = generate_speech(text_hindi, speaker=\"kavya\")
193
- sf.write(\"output_hindi_kavya.wav\", audio, 24000)
194
-
195
- # English
196
- text_english = \"Today I learned about a new technology that uses artificial intelligence to generate human-like voices.\"
197
- audio = generate_speech(text_english, speaker=\"agastya\")
198
- sf.write(\"output_english_agastya.wav\", audio, 24000)
199
-
200
- # Code-mixed
201
- text_mixed = \"मैं तो पूरा presentation prepare कर चुका हूं! कल रात को ही मैंने पूरा code base चेक किया।\"
202
- audio = generate_speech(text_mixed, speaker=\"maitri\")
203
- sf.write(\"output_mixed_maitri.wav\", audio, 24000)
204
- ```
 
1
  ---
2
  license: apache-2.0
3
  language:
4
+ - en
5
+ - hi
6
+ library_name: transformers
7
+ tags:
8
+ - text-to-speech
9
+ - tts
10
+ - hindi
11
+ - english
12
+ - llama
13
+ - audio
14
+ - speech
15
+ - india
16
+ datasets:
17
+ - proprietary
18
+ pipeline_tag: text-to-speech
19
+ co2_eq_emissions:
20
+ emissions: 0
21
+ source: "Not specified"
22
+ training_type: "unknown"
23
+ geographical_location: "unknown"
24
+ ---
 
 
25
 
26
  # Veena - Text to Speech for Indian Languages
27
 
 
31
 
32
  **Veena** is a 3B parameter autoregressive transformer model based on the Llama architecture. It is designed to synthesize high-quality speech from text in Hindi and English, including code-mixed scenarios. The model outputs audio at a 24kHz sampling rate using the SNAC neural codec.
33
 
34
+ * **Model type:** Autoregressive Transformer
35
+ * **Base Architecture:** Llama (3B parameters)
36
+ * **Languages:** Hindi, English
37
+ * **Audio Codec:** SNAC @ 24kHz
38
+ * **License:** Apache 2.0
39
+ * **Developed by:** Maya Research
40
+ * **Model URL:** [https://huggingface.co/maya-research/veena-tts](https://huggingface.co/maya-research/veena-tts)
41
 
42
  ## Key Features
43
 
44
+ * **4 Distinct Voices:** `kavya`, `agastya`, `maitri`, and `vinaya` - each with unique vocal characteristics.
45
+ * **Multilingual Support:** Native Hindi and English capabilities with code-mixed support.
46
+ * **Ultra-Fast Inference:** Sub-80ms latency on H100-80GB GPUs.
47
+ * **High-Quality Audio:** 24kHz output with the SNAC neural codec.
48
+ * **Production-Ready:** Optimized for real-world deployment with 4-bit quantization support.
49
 
50
  ## How to Get Started with the Model
51
 
 
55
 
56
  ```bash
57
  pip install transformers torch torchaudio
58
+ pip install snac bitsandbytes # For audio decoding and quantization