Audio-Text-to-Text
ZhifengKong commited on
Commit
1516992
·
1 Parent(s): 76fa399

initial upload

Browse files
Files changed (2) hide show
  1. .gitattributes +3 -0
  2. README.md +53 -5
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ assets/af2_arch.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/af2_radar.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/af2_table2.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,5 +1,53 @@
1
- ---
2
- license: other
3
- license_name: nvidia-oneway-noncommercial-license
4
- license_link: LICENSE
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: audio-text-to-text
3
+ license: other
4
+ datasets:
5
+ - nvidia/AudioSkills
6
+ - nvidia/AF-Think
7
+ ---
8
+
9
+ # PyTorch Implementation of Audio Flamingo 2
10
+
11
+ **Zhifeng Kong, Arushi Goel, João Felipe Santos, Sreyan Ghosh, Rafael Valle, Wei Ping, Bryan Catanzaro**
12
+
13
+ [[paper]]() [[GitHub]](https://github.com/NVIDIA/audio-flamingo/tree/soundCoT)
14
+
15
+ This repo contains the PyTorch implementation of [Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding](). Audio Flamingo 2 Sound-CoT (3B) has significant improvements on the chain-of-thought (CoT) reasoning abilities and is comparable to several 7B reasoning baselines on reasoning benchmarks. It is finetuned from our previous [Audio Flamingo 2](https://arxiv.org/abs/2503.03983).
16
+
17
+ - We introduce **AF-Reasoning-Eval**, a sound reasoning benchmark targeting common-sense reasoning and the ability to discriminate among closely related choices.
18
+
19
+ - We introduce **AF-CoT-Train** with 1.24M CoT reasoning traces to advance the field of audio understanding.
20
+
21
+ - Audio Flamingo 2 Sound-CoT shows strong reasoning abilities on several sound reasoning benchmarks, despite being small (3B) and trained exclusively on public datasets.
22
+
23
+ ## License
24
+
25
+ - The code in this repo is under MIT license.
26
+ - The checkpoints are for non-commercial use only (see NVIDIA OneWay Noncommercial License). They are also subject to the [Qwen Research license](https://huggingface.co/Qwen/Qwen2.5-3B/blob/main/LICENSE), the [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and the original licenses accompanying each training dataset.
27
+ - Notice: Audio Flamingo 2 Sound-CoT is built with Qwen-2.5. Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved.
28
+
29
+
30
+ ## Citation
31
+ - Audio Flamingo 2
32
+ ```
33
+ @inproceedings{
34
+ ghosh2025audio,
35
+ title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities},
36
+ author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan},
37
+ booktitle={Forty-second International Conference on Machine Learning},
38
+ year={2025},
39
+ url={https://openreview.net/forum?id=xWu5qpDK6U}
40
+ }
41
+ ```
42
+
43
+ - Audio Flamingo
44
+ ```
45
+ @inproceedings{kong2024audio,
46
+ title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities},
47
+ author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan},
48
+ booktitle={International Conference on Machine Learning},
49
+ pages={25125--25148},
50
+ year={2024},
51
+ organization={PMLR}
52
+ }
53
+ ```