|
|
--- |
|
|
pipeline_tag: audio-text-to-text |
|
|
license: other |
|
|
datasets: |
|
|
- nvidia/AudioSkills |
|
|
- nvidia/AF-Think |
|
|
--- |
|
|
|
|
|
# PyTorch Implementation of Audio Flamingo Sound-CoT |
|
|
|
|
|
**Zhifeng Kong, Arushi Goel, João Felipe Santos, Sreyan Ghosh, Rafael Valle, Wei Ping, Bryan Catanzaro** |
|
|
|
|
|
[[paper]](https://arxiv.org/abs/2508.11818) [[GitHub]](https://github.com/NVIDIA/audio-flamingo/tree/soundCoT) |
|
|
|
|
|
This repo contains the PyTorch implementation of [Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding](https://arxiv.org/abs/2508.11818). Audio Flamingo 2 Sound-CoT (3B) has significant improvements on the chain-of-thought (CoT) reasoning abilities and is comparable to several 7B reasoning baselines on reasoning benchmarks. It is finetuned from our previous [Audio Flamingo 2](https://arxiv.org/abs/2503.03983). |
|
|
|
|
|
- We introduce **AF-Reasoning-Eval**, a sound reasoning benchmark targeting common-sense reasoning and the ability to discriminate among closely related choices. |
|
|
|
|
|
- We introduce **AF-CoT-Train** with about 1M CoT reasoning traces to advance the field of audio understanding. |
|
|
|
|
|
- Audio Flamingo 2 Sound-CoT shows strong reasoning abilities on several sound reasoning benchmarks, despite being small (3B) and trained exclusively on public datasets. |
|
|
|
|
|
## Usage |
|
|
|
|
|
The inference script is almost the same as [Audio Flamingo 2](https://github.com/NVIDIA/audio-flamingo/tree/audio_flamingo_2/inference_HF_pretrained). The only difference is to add a special prompt (```Output the answer with <SUMMARY>, <CAPTION>, <REASONING>, and <CONCLUSION> tags.```) after the input question. For instance, in Audio Flamingo 2, the input is |
|
|
``` |
|
|
Based on the given audio, identify the source of the church bells. Choose the correct option from the following options:\n(A) Church\n(B) School\n(C) Clock Tower\n(D) Fire Station. |
|
|
``` |
|
|
In Audio Flamingo 2 Sound-CoT, the input is |
|
|
``` |
|
|
Based on the given audio, identify the source of the church bells. Choose the correct option from the following options:\n(A) Church\n(B) School\n(C) Clock Tower\n(D) Fire Station. Output the answer with <SUMMARY>, <CAPTION>, <REASONING>, and <CONCLUSION> tags. |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
- The code in this repo is under MIT license. |
|
|
- The checkpoints are for non-commercial use only (see NVIDIA OneWay Noncommercial License). They are also subject to the [Qwen Research license](https://huggingface.co/Qwen/Qwen2.5-3B/blob/main/LICENSE), the [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI, and the original licenses accompanying each training dataset. |
|
|
- Notice: Audio Flamingo 2 Sound-CoT is built with Qwen-2.5. Qwen is licensed under the Qwen RESEARCH LICENSE AGREEMENT, Copyright (c) Alibaba Cloud. All Rights Reserved. |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
- Audio Flamingo Sound-CoT |
|
|
``` |
|
|
@article{kong2025audio, |
|
|
title={Audio Flamingo Sound-CoT Technical Report: Improving Chain-of-Thought Reasoning in Sound Understanding}, |
|
|
author={Kong, Zhifeng and Goel, Arushi and Santos, Joao Felipe and Ghosh, Sreyan and Valle, Rafael and Ping, Wei and Catanzaro, Bryan}, |
|
|
journal={arXiv preprint arXiv:2508.11818}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
- Audio Flamingo 3 |
|
|
``` |
|
|
@article{goel2025audio, |
|
|
title={Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models}, |
|
|
author={Goel, Arushi and Ghosh, Sreyan and Kim, Jaehyeon and Kumar, Sonal and Kong, Zhifeng and Lee, Sang-gil and Yang, Chao-Han Huck and Duraiswami, Ramani and Manocha, Dinesh and Valle, Rafael and Catanzaro, Bryan}, |
|
|
journal={arXiv preprint arXiv:2507.08128}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
- Audio Flamingo 2 |
|
|
``` |
|
|
@inproceedings{ |
|
|
ghosh2025audio, |
|
|
title={Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities}, |
|
|
author={Ghosh, Sreyan and Kong, Zhifeng and Kumar, Sonal and Sakshi, S and Kim, Jaehyeon and Ping, Wei and Valle, Rafael and Manocha, Dinesh and Catanzaro, Bryan}, |
|
|
booktitle={Forty-second International Conference on Machine Learning}, |
|
|
year={2025}, |
|
|
url={https://openreview.net/forum?id=xWu5qpDK6U} |
|
|
} |
|
|
``` |
|
|
|
|
|
- Audio Flamingo |
|
|
``` |
|
|
@inproceedings{kong2024audio, |
|
|
title={Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities}, |
|
|
author={Kong, Zhifeng and Goel, Arushi and Badlani, Rohan and Ping, Wei and Valle, Rafael and Catanzaro, Bryan}, |
|
|
booktitle={International Conference on Machine Learning}, |
|
|
pages={25125--25148}, |
|
|
year={2024}, |
|
|
organization={PMLR} |
|
|
} |
|
|
``` |