Feature Extraction
NeMo
CasanovaE commited on
Commit
28b827b
·
verified ·
1 Parent(s): a05c060

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +271 -6
README.md CHANGED
@@ -1,6 +1,271 @@
1
- ---
2
- license: other
3
- license_name: nvidia-open-model-license
4
- license_link: >-
5
- https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nvidia-open-model-license
4
+ license_link: >-
5
+ https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
6
+ pipeline_tag: feature-extraction
7
+ ---
8
+
9
+
10
+ # NVIDIA nemo-nano-codec
11
+ <style>
12
+ img{
13
+ display: inline-table;
14
+ vertical-align: small;
15
+ margin: 0;
16
+ padding: 0;
17
+ }
18
+ </style>
19
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-NemoNanoCodec-lightgrey#model-badge)](#model-architecture)
20
+ | [![Model size](https://img.shields.io/badge/Params-62M-lightgrey#model-badge)](#model-architecture)
21
+ | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets)
22
+
23
+
24
+ The [nemo-nano-codec]() is a neural audio codec that leverages finite scalar quantization and adversarial training with large speech language models to achieve state-of-the-art audio compression across different bitrate and frame rate ranges.
25
+ Model variant details:
26
+
27
+ | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
28
+ |:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
29
+ | 22050 | 21.5 | 1.89kpbs | 8 | 2016 | 32 | [8, 7, 6, 6] |
30
+
31
+ This model is ready for commercial/non-commercial use.
32
+
33
+
34
+
35
+ ## nemo-nano-codec variants
36
+
37
+ Model | Sample Rate | Frame Rate | Bit Rate | # Codebooks | Codebook Size | Embed Dim | FSQ Levels |
38
+ :-----------:|:-----------:|:----------:|:----------:|:-----------:|:-------------:|:-----------:|:------------:|
39
+ [1.78kbps-12.5fps](https://huggingface.co/nvidia/nanocodec-22khz-1.78kbps-12.5fps)| 22050 | 21.5 | 1.78kpbs | 13 | 2016 | 52 | [8, 7, 6, 6] |
40
+ [0.6kbps-12.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-0.6kbps-12.5fps) | 22050 | 21.5 | 0.6kpbs | 4 | 4032 | 16 | [9, 8, 8, 7] |
41
+ [1.89kbps-21.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps) | 22050 | 21.5 | 1.89kpbs | 8 | 2016 | 32 | [8, 7, 6, 6] |
42
+
43
+
44
+ ## License/Terms of Use
45
+ [NVIDIA Open Model License Agreement](https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf)
46
+
47
+ ### Deployment Geography:
48
+ <br>Global<br>
49
+
50
+ ### Use Case:
51
+ <br> This model can be used for audio compression and can also serve as a component in the training of speech generation models.<br>
52
+
53
+ ### Release Date:
54
+ <br>Huggingface [08/11/2025] via https://huggingface.co/nvidia/nanocodec-22khz-1.78kbps-12.5fps<br>
55
+
56
+ ## Model Architecture
57
+ nemo-nano-codec is composed of a fully convolutional generator neural network and three discriminators. The generator comprises an encoder, followed by vector quantization, and a [HiFi-GAN-based](https://arxiv.org/abs/2010.05646) decoder.
58
+
59
+ The non-causal encoder consists of five residual blocks, each block containing three residual layers similar to the [multi-receptive field fusion (MRF) module](https://arxiv.org/abs/2010.05646). The causal decoder, based on the HiFi-GAN vocoder, uses upsampling rates that are the reverse of the encoder's One-Dimensional (1D) convolutional strides.
60
+
61
+ For the vector quantization, we have used [Finite Scalar Quantization (FSQ)](https://arxiv.org/abs/2309.15505) with thirteen codebooks and four dimensions per code and 2016 codes per codebook. For the discriminators, we utilize three neural networks, all employing a squared-GAN and feature-matching loss. We adopt the [multi-period discriminator](https://arxiv.org/abs/2010.05646), [multi-band multi-scale STFT discriminator](https://arxiv.org/abs/2306.06546), and [WavLM-based discriminator](https://arxiv.org/abs/2409.12117).
62
+
63
+ For more details please check [our paper]().
64
+
65
+ **This model was developed based on [NVIDIA Low Frame-rate Speech Codec](https://huggingface.co/nvidia/low-frame-rate-speech-codec-22khz)**
66
+
67
+ ** This model has 62M of model parameters.**
68
+
69
+
70
+ ### Input
71
+ - **Input Type:** Audio
72
+ - **Input Format(s):** .wav files
73
+ - **Input Parameters:** One-Dimensional (1D)
74
+ - **Other Properties Related to Input:** 22050 Hz Mono-channel Audio
75
+
76
+ ### Output
77
+ - **Output Type**: Audio
78
+ - **Output Format:** .wav files
79
+ - **Output Parameters:** One Dimensional (1D)
80
+ - **Other Properties Related to Output:** 22050 Hz Mono-channel Audio
81
+
82
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
83
+
84
+ ## Software Integration
85
+
86
+ ### Supported Hardware Microarchitecture Compatibility:
87
+ - NVIDIA Ampere
88
+ - NVIDIA Blackwell
89
+ - NVIDIA Jetson
90
+ - NVIDIA Hopper
91
+ - NVIDIA Lovelace
92
+ - NVIDIA Pascal
93
+ - NVIDIA Turing
94
+ - NVIDIA Volta
95
+
96
+ ### Runtime Engine
97
+
98
+ - Nemo 2.0.0
99
+
100
+ ### Preferred Operating System
101
+
102
+ - Linux
103
+
104
+ ## Model Version(s):
105
+ <br> v12.5.1.78 <br>
106
+
107
+ ## How to Use this Model
108
+
109
+ The model is available for use in the [NVIDIA NeMo](https://github.com/NVIDIA/NeMo), and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
110
+
111
+ ### Inference
112
+
113
+ For inference, you can refer to our [Audio Codec Inference Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Inference.ipynb), which automatically downloads the model checkpoint. Ensure that you set the model_name parameter to "nvidia/nanocodec-22khz-1.78kbps-12.5fps".
114
+
115
+ Alternatively, you can use the code below, which also handles the automatic checkpoint download:
116
+
117
+ ```
118
+ import librosa
119
+ import torch
120
+ import soundfile as sf
121
+ from nemo.collections.tts.models import AudioCodecModel
122
+
123
+ path_to_input_audio = ??? # path of the input audio
124
+ path_to_output_audio = ??? # path of the reconstructed output audio
125
+
126
+ # load audio codec model
127
+ nemo_codec_model = AudioCodecModel.from_pretrained("nvidia/nanocodec-22khz-1.78kbps-12.5fps").eval()
128
+
129
+ # get discrete tokens from audio
130
+ audio, _ = librosa.load(path_to_input_audio, sr=nemo_codec_model.sample_rate)
131
+
132
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
133
+ audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
134
+ audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
135
+
136
+ encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)
137
+
138
+ # Reconstruct audio from tokens
139
+ reconstructed_audio, _ = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)
140
+
141
+ # save reconstructed audio
142
+ output_audio = reconstructed_audio.cpu().numpy().squeeze()
143
+ sf.write(path_to_output_audio, output_audio, nemo_codec_model.sample_rate)
144
+
145
+ ```
146
+
147
+ If preferred, you can manually download the [checkpoint](https://huggingface.co/nvidia/nvidia/nanocodec-22khz-1.78kbps-12.5fps/resolve/main/nanocodec-22khz-1.78kbps-12.5fps.nemo) and use the provided code to run inference on the model:
148
+
149
+ ```
150
+ import librosa
151
+ import torch
152
+ import soundfile as sf
153
+ from nemo.collections.tts.models import AudioCodecModel
154
+
155
+ codec_path = ??? # set here the model .nemo checkpoint path
156
+ path_to_input_audio = ??? # path of the input audio
157
+ path_to_output_audio = ??? # path of the reconstructed output audio
158
+
159
+ # load audio codec model
160
+ nemo_codec_model = AudioCodecModel.restore_from(restore_path=codec_path, map_location="cpu").eval()
161
+
162
+ # get discrete tokens from audio
163
+ audio, _ = librosa.load(path_to_input_audio, sr=nemo_codec_model.sample_rate)
164
+
165
+ device = 'cuda' if torch.cuda.is_available() else 'cpu'
166
+ audio_tensor = torch.from_numpy(audio).unsqueeze(dim=0).to(device)
167
+ audio_len = torch.tensor([audio_tensor[0].shape[0]]).to(device)
168
+
169
+ encoded_tokens, encoded_len = nemo_codec_model.encode(audio=audio_tensor, audio_len=audio_len)
170
+
171
+ # Reconstruct audio from tokens
172
+ reconstructed_audio, _ = nemo_codec_model.decode(tokens=encoded_tokens, tokens_len=encoded_len)
173
+
174
+ # save reconstructed audio
175
+ output_audio = reconstructed_audio.cpu().numpy().squeeze()
176
+ sf.write(path_to_output_audio, output_audio, nemo_codec_model.sample_rate)
177
+
178
+ ```
179
+
180
+ ### Training
181
+ For fine-tuning on another dataset, please follow the steps available at our [Audio Codec Training Tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/Audio_Codec_Training.ipynb). Note that you will need to set the ```CONFIG_FILENAME``` parameter to the "audio_codec_low_frame_rate_22050.yaml" config. You also will need to set ```pretrained_model_name``` to "audio_codec_low_frame_rate_22khz".
182
+
183
+
184
+ ## Training, Testing, and Evaluation Datasets:
185
+
186
+ The nemo-nano-codec was trained on 28.7k hours of speech data spanning 105 languages. The model was evaluated using multilingual audiobook-style data and high-quality English recordings. For further details, refer to [our paper]().
187
+
188
+
189
+ ### Training Datasets
190
+ The nemo-nano-codec is trained on a total of 28.7k hrs of speech data from 105 languages.
191
+
192
+ Link: [MLS English](https://www.openslr.org/94/) [25.5k]
193
+
194
+ - Data Collection Method by Dataset: Human
195
+
196
+ - Labeling Method by Dataset: Automated
197
+
198
+ Link: [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)[3.2k]
199
+
200
+ - Data Collection Method by Dataset: Human
201
+
202
+ - Labeling Method by Dataset: Human
203
+
204
+ ### Test Datasets
205
+
206
+ Link: [MLS](https://www.openslr.org/94/)
207
+
208
+ - Data Collection Method by Dataset: Human
209
+
210
+ - Labeling Method by Dataset: Automated
211
+
212
+ - Properties: We randomly selected 200 samples from each of the eight languages in the 44kHz MLS dataset.
213
+
214
+ Link: [DAPS](https://zenodo.org/records/4660670)
215
+
216
+ - Data Collection Method by Dataset: Human
217
+
218
+ - Labeling Method by Dataset: Automated
219
+
220
+ - Properties: To assess our models' performance on studio-quality audio, we utilized the F10 and M10 speakers from the DAPS Clear dataset. These speakers were also employed in the evaluation of the [DAC model](https://arxiv.org/abs/2306.06546).
221
+
222
+
223
+ ### Evaluation Datasets
224
+
225
+ Link: [MLS English](https://www.openslr.org/94/)
226
+
227
+ - Data Collection Method By Dataset: Human
228
+
229
+ - Labeling Method by Dataset: Automated
230
+
231
+ - Properties: We randomly selected 3,807 samples, including examples from multiple speakers.
232
+
233
+ Link: [Common Voice](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0)
234
+
235
+ - Data Collection Method by Dataset: Human
236
+
237
+ - Labeling Method by Dataset: Human
238
+
239
+ - Properties: We randomly selected 1587 samples, including examples from multiple languages.
240
+
241
+
242
+
243
+ ## Performance
244
+
245
+ We evaluated our codec using multiple objective audio quality metrics across two distinct test sets. Additionally, we compared our model's performance with state-of-the-art codecs. For further details, please refer to [our paper]().
246
+
247
+
248
+ Variant results:
249
+ | Dataset | Squim MOS (↑) |PESQ (↑) |Mel Dist. (↓) | SECS (↓) | CER (↓)|
250
+ |:-----------:|:----------:|:----------:|:----------:|:-----------:|:-----------:|
251
+ | MLS | 4.441 | 2.760 | 0.143 | 0.862 | 2.423 |
252
+ | DAPS | 4.697 | 3.030 | 0.139 | 0.831 | 0.758 |
253
+
254
+
255
+
256
+
257
+ ## Inference:
258
+ **Engine:** Transformers <br>
259
+ **Test Hardware:** <br>
260
+ - FP32:
261
+ - 1x NVIDIA A100-80GB
262
+ - 2x NVIDIA RTX 6000 Ada
263
+
264
+ ## Ethical Considerations:
265
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
266
+
267
+ For more detailed information on ethical considerations for this model, please see the [Model Card++ Explainability](https://gitlab-master.nvidia.com/ajukic/nemo-model-overview/-/blob/main/models/nanocodec_22khz/explainalability-subcard.md), [Bias](https://gitlab-master.nvidia.com/ajukic/nemo-model-overview/-/blob/main/models/nanocodec_22khz/bias-subcard.md), [Safety & Security](https://gitlab-master.nvidia.com/ajukic/nemo-model-overview/-/blob/main/models/nanocodec_22khz/safety-subcard.md), and [Privacy Subcards](https://gitlab-master.nvidia.com/ajukic/nemo-model-overview/-/blob/main/models/nanocodec_22khz/privacy-subcard.md).
268
+
269
+
270
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
271
+