Commit
·
bd45601
1
Parent(s):
57c7d5b
initial commit
Browse files- .gitattributes +3 -0
- NVIDIA OneWay Noncommercial License.docx +0 -0
- README.md +29 -1
- modelcard.md +114 -0
.gitattributes
CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
assets/af2_arch.png filter=lfs diff=lfs merge=lfs -text
|
37 |
+
assets/af2_radar.png filter=lfs diff=lfs merge=lfs -text
|
38 |
+
assets/af2_table2.png filter=lfs diff=lfs merge=lfs -text
|
NVIDIA OneWay Noncommercial License.docx
ADDED
Binary file (20.5 kB). View file
|
|
README.md
CHANGED
@@ -1,5 +1,33 @@
|
|
1 |
---
|
2 |
license: other
|
3 |
license_name: nvidia-oneway-noncommercial-license
|
4 |
-
license_link: LICENSE
|
5 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: other
|
3 |
license_name: nvidia-oneway-noncommercial-license
|
|
|
4 |
---
|
5 |
+
|
6 |
+
# PyTorch Implementation of Audio-to-Audio Schrodinger Bridges
|
7 |
+
|
8 |
+
**Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro**
|
9 |
+
|
10 |
+
[[paper]](https://arxiv.org/abs/2501.11311) [[GitHub]](https://github.com/NVIDIA/diffusion-audio-restoration) [[Demo]](https://research.nvidia.com/labs/adlr/A2SB/)
|
11 |
+
|
12 |
+
This repo contains the PyTorch implementation of [A2SB: Audio-to-Audio Schrodinger Bridges](https://arxiv.org/abs/2501.11311). A2SB is an audio restoration model tailored for high-res music at 44.1kHz. It is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, and able to restore hour-long audio inputs. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets.
|
13 |
+
|
14 |
+
- We propose A2SB, a state-of-the-art, end-to-end, vocoder-free, and multi-task diffusion Schrodinger Bridge model for 44.1kHz high-res music restoration, using an effective factorized audio representation.
|
15 |
+
|
16 |
+
- A2SB is the first long audio restoration model that could restore hour-long audio without
|
17 |
+
boundary artifacts
|
18 |
+
|
19 |
+
## License
|
20 |
+
|
21 |
+
The model is provided under the NVIDIA OneWay NonCommercial License.
|
22 |
+
|
23 |
+
|
24 |
+
## Citation
|
25 |
+
|
26 |
+
```
|
27 |
+
@article{kong2025a2sb,
|
28 |
+
title={A2SB: Audio-to-Audio Schrodinger Bridges},
|
29 |
+
author={Kong, Zhifeng and Shih, Kevin J and Nie, Weili and Vahdat, Arash and Lee, Sang-gil and Santos, Joao Felipe and Jukic, Ante and Valle, Rafael and Catanzaro, Bryan},
|
30 |
+
journal={arXiv preprint arXiv:2501.11311},
|
31 |
+
year={2025}
|
32 |
+
}
|
33 |
+
```
|
modelcard.md
ADDED
@@ -0,0 +1,114 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Model Overview
|
2 |
+
|
3 |
+
## Description:
|
4 |
+
A2SB uses a UNet architecture to perform inpainting on an audio spectrogram. It can fill in missing frequency bands above 4kHz (bandwidth extension), or fill in short temporal slices (currently supporting filling in gaps of less than 1 second). This model is for non commercial use only.
|
5 |
+
|
6 |
+
### License/Terms of Use:
|
7 |
+
The model is provided under the NVIDIA OneWay NonCommercial License.
|
8 |
+
|
9 |
+
The code is under [NVIDIA Source Code License - Non Commercial](https://github.com/NVlabs/I2SB/blob/master/LICENSE). Some components are adapted from other sources. The training code is adapted from [I2SB](https://github.com/NVlabs/I2SB) under the [NVIDIA Source Code License - Non Commercial](https://github.com/NVlabs/I2SB/blob/master/LICENSE). The model architecture is adapted from [Improved Diffusion](https://github.com/openai/improved-diffusion/blob/main/LICENSE) under the MIT License.
|
10 |
+
|
11 |
+
### Deployment Geography:
|
12 |
+
Global
|
13 |
+
|
14 |
+
### Use Case:
|
15 |
+
Research purposes pertaining to audio enhancement and generative modeling, as well as for general creative use such as bandwidth extension and inpainting short segments of missing audio.
|
16 |
+
|
17 |
+
### Release Date:
|
18 |
+
Github 06/27/2025 via github.com/NVIDIA/diffusion-audio-restoration
|
19 |
+
|
20 |
+
## Reference(s):
|
21 |
+
- [project page](https://research.nvidia.com/labs/adlr/A2SB)
|
22 |
+
- [technical report](https://arxiv.org/abs/2501.11311)
|
23 |
+
- [I2SB](https://github.com/NVlabs/I2SB)
|
24 |
+
- [Improved-Diffusion UNet Architecture](https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/unet.py)
|
25 |
+
|
26 |
+
|
27 |
+
## Model Architecture:
|
28 |
+
**Architecture Type:** CNN with interleaved Self-Attention Layers
|
29 |
+
|
30 |
+
**Network Architecture:** UNET
|
31 |
+
|
32 |
+
|
33 |
+
|
34 |
+
## Input:
|
35 |
+
**Input Type(s):** Audio
|
36 |
+
|
37 |
+
**Input Format(s):** WAV/MP3/FLAC
|
38 |
+
|
39 |
+
**Input Parameters:** One-Dimensional (1D)
|
40 |
+
|
41 |
+
**Other Properties Related to Input:** All audio assumed to be single-channeled, 44.1kHz. For editing, also provide frequency cutoff for bandwidth extension sampling (resample content above this frequency), or start/end time stamps for segment inpainting.
|
42 |
+
|
43 |
+
## Output:
|
44 |
+
**Output Type(s):** Audio
|
45 |
+
|
46 |
+
**Output Format(s):** WAV
|
47 |
+
|
48 |
+
**Output Parameters:** One-Dimensional (1D)
|
49 |
+
|
50 |
+
**Other Properties Related to Output:** Single-channeled 44.1kHz output file. Maximum audio output length is 1 hour.
|
51 |
+
|
52 |
+
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
|
53 |
+
|
54 |
+
## Software Integration:
|
55 |
+
**Runtime Engine(s):**
|
56 |
+
* [PyTorch-2.2.2+cuda12.1+cudnn8]
|
57 |
+
|
58 |
+
|
59 |
+
**Supported Hardware Microarchitecture Compatibility:**
|
60 |
+
* NVIDIA Ampere
|
61 |
+
* NVIDIA Blackwell
|
62 |
+
* NVIDIA Jetson
|
63 |
+
* NVIDIA Hopper
|
64 |
+
* NVIDIA Lovelace
|
65 |
+
* NVIDIA Pascal
|
66 |
+
* NVIDIA Turing
|
67 |
+
* NVIDIA Volta
|
68 |
+
|
69 |
+
|
70 |
+
**[Preferred/Supported] Operating System(s):**
|
71 |
+
['Linux']
|
72 |
+
|
73 |
+
## Model Versions:
|
74 |
+
v1
|
75 |
+
|
76 |
+
# Training and Evaluation Datasets:
|
77 |
+
|
78 |
+
## Training Datasets:
|
79 |
+
|
80 |
+
The property column below shows the total duration before license, quality, and sampling rate filtering. Our model training code ingests only raw audio samples -- no additional labels provided in the datasets listed below are used for training purposes.
|
81 |
+
|
82 |
+
| DatasetName | Collection Method | Labeling Method | Properties |
|
83 |
+
| ------ | ------ | ------ | ------ |
|
84 |
+
| [FMA](https://github.com/mdeff/fma) | Human | N/A | 5257.0 hrs |
|
85 |
+
| [Medleys-solos-DB](https://medleydb.weebly.com/) | Human | N/A | 17.8 hrs|
|
86 |
+
| [MUSAN](https://www.openslr.org/17/) | Human | N/A | 42.6 hrs |
|
87 |
+
| [Musical Instrument](https://www.kaggle.com/datasets/soumendraprasad/musical-instruments-sound-dataset) | Human| N/A | 16.2 hrs |
|
88 |
+
| [MusicNet](https://zenodo.org/records/5120004) | Human | N/A | 34.5 hrs |
|
89 |
+
| [Slakh](https://github.com/ethman/slakh-utils) | Hybrid | N/A | 118.3 hrs|
|
90 |
+
| [FreeSound](https://freesound.org/) | Human | N/A | 4576.6 hrs|
|
91 |
+
| [FSD50K](https://zenodo.org/records/4060432) | Human | N/A | 75.6 hrs|
|
92 |
+
| [GTZAN](http://marsyas.info/index.html) | Human | N/A | 8.3 hrs|
|
93 |
+
| [NSynth](https://magenta.tensorflow.org/datasets/nsynth) | Human | N/A | 340.0 hrs|
|
94 |
+
|
95 |
+
|
96 |
+
## Evaluation Datasets:
|
97 |
+
| DatasetName | Collection Method | Labeling Method | Properties |
|
98 |
+
| ------ | ------ | ------ | ------ |
|
99 |
+
| [AAM: Artificial Audio Multitracks Dataset](https://zenodo.org/records/5794629) | Automated | N/A | 4 hrs |
|
100 |
+
| [Maestro](https://magenta.tensorflow.org/datasets/maestro) | Human | N/A | 199.2 hrs |
|
101 |
+
| [MTD](https://www.audiolabs-erlangen.de/resources/MIR/MTD) | Human | N/A | 0.9 hrs |
|
102 |
+
| [CC-Mixter](https://members.loria.fr/ALiutkus/kam/) | Human | N/A | 3.2 hrs |
|
103 |
+
|
104 |
+
|
105 |
+
## Inference:
|
106 |
+
**Engine:** PyTorch
|
107 |
+
|
108 |
+
**Test Hardware:**
|
109 |
+
* NVIDIA Ampere
|
110 |
+
|
111 |
+
## Ethical Considerations:
|
112 |
+
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
|
113 |
+
|
114 |
+
Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
|