nari-labs
/

Dia-1.6B-0626

model_hub_mixin

pytorch_model_hub_mixin

Model card Files Files and versions

Update README.md

#3

by AntonV HF Staff - opened Jul 2

base: refs/heads/main

←

from: refs/pr/3

Discussion Files changed

Files changed (1) hide show

README.md +58 -0

README.md CHANGED Viewed

@@ -84,6 +84,64 @@ sf.write("simple.mp3", output, 44100)
 A pypi package and a working CLI tool will be available soon.
 ## 💻 Hardware and Inference Speed
 Dia has been tested on only GPUs (pytorch 2.0+, CUDA 12.6). CPU support is to be added soon.

 A pypi package and a working CLI tool will be available soon.
+### As part of transformers
+Install `transformers`:
+```bash
+# pip
+pip install "transformers[torch]"
+# uv
+uv pip install "transformers[torch]"
+```
+#### Generation with Text
+```python
+from transformers import AutoProcessor, DiaForConditionalGeneration
+torch_device = "cuda"
+model_checkpoint = "nari-labs/Dia-1.6B-0626"
+text = ["[S1] Dia is an open weights text to dialogue model."]
+processor = AutoProcessor.from_pretrained(model_checkpoint)
+inputs = processor(text=text, padding=True, return_tensors="pt").to(torch_device)
+model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
+outputs = model.generate(**inputs, max_new_tokens=256)  # corresponds to around ~2s
+# save audio to a file
+outputs = processor.batch_decode(outputs)
+processor.save_audio(outputs, "example.wav")
+```
+#### Generation with Text and Audio (Voice Cloning)
+```python
+from datasets import load_dataset, Audio
+from transformers import AutoProcessor, DiaForConditionalGeneration
+torch_device = "cuda"
+model_checkpoint = "nari-labs/Dia-1.6B-0626"
+ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
+ds = ds.cast_column("audio", Audio(sampling_rate=44100))
+audio = ds[-1]["audio"]["array"]
+# text is a transcript of the audio + additional text you want as new audio
+text = ["[S1] I know. It's going to save me a lot of money, I hope. [S2] I sure hope so for you."]
+processor = AutoProcessor.from_pretrained(model_checkpoint)
+inputs = processor(text=text, audio=audio, padding=True, return_tensors="pt").to(torch_device)
+prompt_len = processor.get_audio_prompt_len(inputs["decoder_attention_mask"])
+model = DiaForConditionalGeneration.from_pretrained(model_checkpoint).to(torch_device)
+outputs = model.generate(**inputs, max_new_tokens=256)  # corresponds to around ~2s
+# retrieve actually generated audio and save to a file
+outputs = processor.batch_decode(outputs, audio_prompt_len=prompt_len)
+processor.save_audio(outputs, "example_with_audio.wav")
+```
 ## 💻 Hardware and Inference Speed
 Dia has been tested on only GPUs (pytorch 2.0+, CUDA 12.6). CPU support is to be added soon.