kadirnar commited on
Commit
0dfa563
·
verified ·
1 Parent(s): 923f534

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +153 -1
README.md CHANGED
@@ -8,4 +8,156 @@ base_model:
8
  - Qwen/Qwen3-0.6B
9
  pipeline_tag: text-to-speech
10
  library_name: transformers
11
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - Qwen/Qwen3-0.6B
9
  pipeline_tag: text-to-speech
10
  library_name: transformers
11
+ ---
12
+ ## Overview
13
+ VyvoTTS-v0-Qwen3-0.6B is a Text-to-Speech model based on Qwen3-0.6B, trained to produce natural-sounding English speech.
14
+
15
+ - **Type:** Text-to-Speech
16
+ - **Language:** English
17
+ - **License:** MIT
18
+ - **Params:** ~810M
19
+
20
+ > **Note:** This model has a high Word Error Rate (WER) as it was trained on a 10,000-hour dataset. To improve the model's accuracy, you should use it as a pretrained base.
21
+ > I can recommend the Emilia dataset for this purpose. After the pretraining process is complete, you should perform fine-tuning for single-speaker speech.
22
+
23
+ ## Usage
24
+ Below is an example of using the model with `unsloth` and `SNAC` for speech generation:
25
+
26
+ ```python
27
+ from unsloth import FastLanguageModel
28
+ import torch
29
+ from snac import SNAC
30
+
31
+ model, tokenizer = FastLanguageModel.from_pretrained(
32
+ model_name = "unsloth/orpheus-3b-0.1-ft",
33
+ max_seq_length= 2048,
34
+ dtype = None,
35
+ load_in_4bit = False,
36
+ )
37
+ snac_model = SNAC.from_pretrained("hubertsiuzdak/snac_24khz")
38
+ tokeniser_length = 151669
39
+ start_of_text = 151643
40
+ end_of_text = 151645
41
+
42
+ start_of_speech = tokeniser_length + 1
43
+ end_of_speech = tokeniser_length + 2
44
+ start_of_human = tokeniser_length + 3
45
+ end_of_human = tokeniser_length + 4
46
+ pad_token = tokeniser_length + 7
47
+
48
+ audio_tokens_start = tokeniser_length + 10
49
+ prompts = ["Hey there my name is Elise, and I'm a speech generation model that can sound like a person."]
50
+ chosen_voice = None
51
+
52
+ FastLanguageModel.for_inference(model)
53
+ snac_model.to("cpu")
54
+ prompts_ = [(f"{chosen_voice}: " + p) if chosen_voice else p for p in prompts]
55
+
56
+ all_input_ids = []
57
+ for prompt in prompts_:
58
+ input_ids = tokenizer(prompt, return_tensors="pt").input_ids
59
+ all_input_ids.append(input_ids)
60
+
61
+ start_token = torch.tensor([[start_of_human]], dtype=torch.int64)
62
+ end_tokens = torch.tensor([[end_of_text, end_of_human]], dtype=torch.int64)
63
+
64
+ all_modified_input_ids = []
65
+ for input_ids in all_input_ids:
66
+ modified_input_ids = torch.cat([start_token, input_ids, end_tokens], dim=1)
67
+ all_modified_input_ids.append(modified_input_ids)
68
+
69
+ all_padded_tensors, all_attention_masks = [], []
70
+ max_length = max([m.shape[1] for m in all_modified_input_ids])
71
+ for m in all_modified_input_ids:
72
+ padding = max_length - m.shape[1]
73
+ padded_tensor = torch.cat([torch.full((1, padding), pad_token, dtype=torch.int64), m], dim=1)
74
+ attention_mask = torch.cat([torch.zeros((1, padding), dtype=torch.int64), torch.ones((1, m.shape[1]), dtype=torch.int64)], dim=1)
75
+ all_padded_tensors.append(padded_tensor)
76
+ all_attention_masks.append(attention_mask)
77
+
78
+ input_ids = torch.cat(all_padded_tensors, dim=0).to("cuda")
79
+ attention_mask = torch.cat(all_attention_masks, dim=0).to("cuda")
80
+
81
+ generated_ids = model.generate(
82
+ input_ids=input_ids,
83
+ attention_mask=attention_mask,
84
+ max_new_tokens=1200,
85
+ do_sample=True,
86
+ temperature=0.6,
87
+ top_p=0.95,
88
+ repetition_penalty=1.1,
89
+ num_return_sequences=1,
90
+ eos_token_id=end_of_speech,
91
+ use_cache=True
92
+ )
93
+
94
+ token_to_find = start_of_speech
95
+ token_to_remove = end_of_speech
96
+ token_indices = (generated_ids == token_to_find).nonzero(as_tuple=True)
97
+
98
+ if len(token_indices[1]) > 0:
99
+ last_occurrence_idx = token_indices[1][-1].item()
100
+ cropped_tensor = generated_ids[:, last_occurrence_idx+1:]
101
+ else:
102
+ cropped_tensor = generated_ids
103
+
104
+ processed_rows = []
105
+ for row in cropped_tensor:
106
+ masked_row = row[row != token_to_remove]
107
+ processed_rows.append(masked_row)
108
+
109
+ code_lists = []
110
+ for row in processed_rows:
111
+ row_length = row.size(0)
112
+ new_length = (row_length // 7) * 7
113
+ trimmed_row = row[:new_length]
114
+ trimmed_row = [t - audio_tokens_start for t in trimmed_row]
115
+ code_lists.append(trimmed_row)
116
+
117
+ def redistribute_codes(code_list):
118
+ layer_1, layer_2, layer_3 = [], [], []
119
+ for i in range((len(code_list)+1)//7):
120
+ layer_1.append(code_list[7*i])
121
+ layer_2.append(code_list[7*i+1]-4096)
122
+ layer_3.append(code_list[7*i+2]-(2*4096))
123
+ layer_3.append(code_list[7*i+3]-(3*4096))
124
+ layer_2.append(code_list[7*i+4]-(4*4096))
125
+ layer_3.append(code_list[7*i+5]-(5*4096))
126
+ layer_3.append(code_list[7*i+6]-(6*4096))
127
+ codes = [
128
+ torch.tensor(layer_1).unsqueeze(0),
129
+ torch.tensor(layer_2).unsqueeze(0),
130
+ torch.tensor(layer_3).unsqueeze(0)
131
+ ]
132
+ audio_hat = snac_model.decode(codes)
133
+ return audio_hat
134
+
135
+ my_samples = []
136
+ for code_list in code_lists:
137
+ samples = redistribute_codes(code_list)
138
+ my_samples.append(samples)
139
+
140
+ from IPython.display import display, Audio
141
+ if len(prompts) != len(my_samples):
142
+ raise Exception("Number of prompts and samples do not match")
143
+ else:
144
+ for i in range(len(my_samples)):
145
+ print(prompts[i])
146
+ samples = my_samples[i]
147
+ display(Audio(samples.detach().squeeze().to("cpu").numpy(), rate=24000))
148
+
149
+ del my_samples, samples
150
+ ```
151
+
152
+ ## Citation
153
+
154
+ If you use this model, please cite:
155
+
156
+ ```bibtex
157
+ @misc{VyvoTTS-v0-Qwen3-0.6B,
158
+ title={VyvoTTS-v0-Qwen3-0.6B},
159
+ author={Vyvo},
160
+ year={2025},
161
+ howpublished={\url{https://huggingface.co/Vyvo/VyvoTTS-v0-Qwen3-0.6B}}
162
+ }
163
+ ```