ZhifengKong commited on
Commit
bd45601
·
1 Parent(s): 57c7d5b

initial commit

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/af2_arch.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/af2_radar.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/af2_table2.png filter=lfs diff=lfs merge=lfs -text
NVIDIA OneWay Noncommercial License.docx ADDED
Binary file (20.5 kB). View file
 
README.md CHANGED
@@ -1,5 +1,33 @@
1
  ---
2
  license: other
3
  license_name: nvidia-oneway-noncommercial-license
4
- license_link: LICENSE
5
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: other
3
  license_name: nvidia-oneway-noncommercial-license
 
4
  ---
5
+
6
+ # PyTorch Implementation of Audio-to-Audio Schrodinger Bridges
7
+
8
+ **Zhifeng Kong, Kevin J Shih, Weili Nie, Arash Vahdat, Sang-gil Lee, Joao Felipe Santos, Ante Jukic, Rafael Valle, Bryan Catanzaro**
9
+
10
+ [[paper]](https://arxiv.org/abs/2501.11311) [[GitHub]](https://github.com/NVIDIA/diffusion-audio-restoration) [[Demo]](https://research.nvidia.com/labs/adlr/A2SB/)
11
+
12
+ This repo contains the PyTorch implementation of [A2SB: Audio-to-Audio Schrodinger Bridges](https://arxiv.org/abs/2501.11311). A2SB is an audio restoration model tailored for high-res music at 44.1kHz. It is capable of both bandwidth extension (predicting high-frequency components) and inpainting (re-generating missing segments). Critically, A2SB is end-to-end without need of a vocoder to predict waveform outputs, and able to restore hour-long audio inputs. A2SB is capable of achieving state-of-the-art bandwidth extension and inpainting quality on several out-of-distribution music test sets.
13
+
14
+ - We propose A2SB, a state-of-the-art, end-to-end, vocoder-free, and multi-task diffusion Schrodinger Bridge model for 44.1kHz high-res music restoration, using an effective factorized audio representation.
15
+
16
+ - A2SB is the first long audio restoration model that could restore hour-long audio without
17
+ boundary artifacts
18
+
19
+ ## License
20
+
21
+ The model is provided under the NVIDIA OneWay NonCommercial License.
22
+
23
+
24
+ ## Citation
25
+
26
+ ```
27
+ @article{kong2025a2sb,
28
+ title={A2SB: Audio-to-Audio Schrodinger Bridges},
29
+ author={Kong, Zhifeng and Shih, Kevin J and Nie, Weili and Vahdat, Arash and Lee, Sang-gil and Santos, Joao Felipe and Jukic, Ante and Valle, Rafael and Catanzaro, Bryan},
30
+ journal={arXiv preprint arXiv:2501.11311},
31
+ year={2025}
32
+ }
33
+ ```
modelcard.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Model Overview
2
+
3
+ ## Description:
4
+ A2SB uses a UNet architecture to perform inpainting on an audio spectrogram. It can fill in missing frequency bands above 4kHz (bandwidth extension), or fill in short temporal slices (currently supporting filling in gaps of less than 1 second). This model is for non commercial use only.
5
+
6
+ ### License/Terms of Use:
7
+ The model is provided under the NVIDIA OneWay NonCommercial License.
8
+
9
+ The code is under [NVIDIA Source Code License - Non Commercial](https://github.com/NVlabs/I2SB/blob/master/LICENSE). Some components are adapted from other sources. The training code is adapted from [I2SB](https://github.com/NVlabs/I2SB) under the [NVIDIA Source Code License - Non Commercial](https://github.com/NVlabs/I2SB/blob/master/LICENSE). The model architecture is adapted from [Improved Diffusion](https://github.com/openai/improved-diffusion/blob/main/LICENSE) under the MIT License.
10
+
11
+ ### Deployment Geography:
12
+ Global
13
+
14
+ ### Use Case:
15
+ Research purposes pertaining to audio enhancement and generative modeling, as well as for general creative use such as bandwidth extension and inpainting short segments of missing audio.
16
+
17
+ ### Release Date:
18
+ Github 06/27/2025 via github.com/NVIDIA/diffusion-audio-restoration
19
+
20
+ ## Reference(s):
21
+ - [project page](https://research.nvidia.com/labs/adlr/A2SB)
22
+ - [technical report](https://arxiv.org/abs/2501.11311)
23
+ - [I2SB](https://github.com/NVlabs/I2SB)
24
+ - [Improved-Diffusion UNet Architecture](https://github.com/openai/improved-diffusion/blob/main/improved_diffusion/unet.py)
25
+
26
+
27
+ ## Model Architecture:
28
+ **Architecture Type:** CNN with interleaved Self-Attention Layers
29
+
30
+ **Network Architecture:** UNET
31
+
32
+
33
+
34
+ ## Input:
35
+ **Input Type(s):** Audio
36
+
37
+ **Input Format(s):** WAV/MP3/FLAC
38
+
39
+ **Input Parameters:** One-Dimensional (1D)
40
+
41
+ **Other Properties Related to Input:** All audio assumed to be single-channeled, 44.1kHz. For editing, also provide frequency cutoff for bandwidth extension sampling (resample content above this frequency), or start/end time stamps for segment inpainting.
42
+
43
+ ## Output:
44
+ **Output Type(s):** Audio
45
+
46
+ **Output Format(s):** WAV
47
+
48
+ **Output Parameters:** One-Dimensional (1D)
49
+
50
+ **Other Properties Related to Output:** Single-channeled 44.1kHz output file. Maximum audio output length is 1 hour.
51
+
52
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.
53
+
54
+ ## Software Integration:
55
+ **Runtime Engine(s):**
56
+ * [PyTorch-2.2.2+cuda12.1+cudnn8]
57
+
58
+
59
+ **Supported Hardware Microarchitecture Compatibility:**
60
+ * NVIDIA Ampere
61
+ * NVIDIA Blackwell
62
+ * NVIDIA Jetson
63
+ * NVIDIA Hopper
64
+ * NVIDIA Lovelace
65
+ * NVIDIA Pascal
66
+ * NVIDIA Turing
67
+ * NVIDIA Volta
68
+
69
+
70
+ **[Preferred/Supported] Operating System(s):**
71
+ ['Linux']
72
+
73
+ ## Model Versions:
74
+ v1
75
+
76
+ # Training and Evaluation Datasets:
77
+
78
+ ## Training Datasets:
79
+
80
+ The property column below shows the total duration before license, quality, and sampling rate filtering. Our model training code ingests only raw audio samples -- no additional labels provided in the datasets listed below are used for training purposes.
81
+
82
+ | DatasetName | Collection Method | Labeling Method | Properties |
83
+ | ------ | ------ | ------ | ------ |
84
+ | [FMA](https://github.com/mdeff/fma) | Human | N/A | 5257.0 hrs |
85
+ | [Medleys-solos-DB](https://medleydb.weebly.com/) | Human | N/A | 17.8 hrs|
86
+ | [MUSAN](https://www.openslr.org/17/) | Human | N/A | 42.6 hrs |
87
+ | [Musical Instrument](https://www.kaggle.com/datasets/soumendraprasad/musical-instruments-sound-dataset) | Human| N/A | 16.2 hrs |
88
+ | [MusicNet](https://zenodo.org/records/5120004) | Human | N/A | 34.5 hrs |
89
+ | [Slakh](https://github.com/ethman/slakh-utils) | Hybrid | N/A | 118.3 hrs|
90
+ | [FreeSound](https://freesound.org/) | Human | N/A | 4576.6 hrs|
91
+ | [FSD50K](https://zenodo.org/records/4060432) | Human | N/A | 75.6 hrs|
92
+ | [GTZAN](http://marsyas.info/index.html) | Human | N/A | 8.3 hrs|
93
+ | [NSynth](https://magenta.tensorflow.org/datasets/nsynth) | Human | N/A | 340.0 hrs|
94
+
95
+
96
+ ## Evaluation Datasets:
97
+ | DatasetName | Collection Method | Labeling Method | Properties |
98
+ | ------ | ------ | ------ | ------ |
99
+ | [AAM: Artificial Audio Multitracks Dataset](https://zenodo.org/records/5794629) | Automated | N/A | 4 hrs |
100
+ | [Maestro](https://magenta.tensorflow.org/datasets/maestro) | Human | N/A | 199.2 hrs |
101
+ | [MTD](https://www.audiolabs-erlangen.de/resources/MIR/MTD) | Human | N/A | 0.9 hrs |
102
+ | [CC-Mixter](https://members.loria.fr/ALiutkus/kam/) | Human | N/A | 3.2 hrs |
103
+
104
+
105
+ ## Inference:
106
+ **Engine:** PyTorch
107
+
108
+ **Test Hardware:**
109
+ * NVIDIA Ampere
110
+
111
+ ## Ethical Considerations:
112
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
113
+
114
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).