Any-to-Any
Bagel
Safetensors
delia1212fd1f commited on
Commit
7ad87ba
·
verified ·
1 Parent(s): 265d1d4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -62
README.md CHANGED
@@ -1,113 +1,136 @@
1
  ---
2
  license: apache-2.0
3
  base_model:
4
- - Qwen/Qwen2.5-7B-Instruct
5
  pipeline_tag: any-to-any
6
  library_name: bagel-mot
7
  ---
8
-
9
-
10
- <p align="left">
11
  <img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="480"/>
12
  </p>
13
 
 
14
 
15
- # 🥯 BAGEL • Unified Model for Multimodal Understanding and Generation
16
-
17
-
18
-
19
- <p align="left">
20
  <a href="https://bagel-ai.org/">
21
- <img
22
- src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
23
- alt="BAGEL Website"
24
- />
25
  </a>
26
  <a href="https://arxiv.org/abs/2505.14683">
27
- <img
28
- src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
29
- alt="BAGEL Paper on arXiv"
30
- />
31
  </a>
32
- <a href="https://github.com/bytedance-seed/BAGEL" target="_blank" style="margin: 2px;">
33
- <img
34
- alt="Github" src="https://img.shields.io/badge/BAGEL-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
35
- alt="BAGEL Codebase"
36
- />
37
  </a>
38
  <a href="https://demo.bagel-ai.org/">
39
- <img
40
- src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=white" style="display: inline-block; vertical-align: middle;"
41
- alt="BAGEL Demo"
42
- />
43
  </a>
44
  <a href="https://discord.com/invite/Z836xxzy">
45
- <img
46
- src="https://img.shields.io/badge/BAGEL-Discord-green?logo=discord&logoColor=white" style="display: inline-block; vertical-align: middle;"
47
- alt="BAGEL Discord"
48
- />
49
  </a>
50
-
51
-
52
  </p>
53
 
 
54
 
55
- > We present **BAGEL**, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3.
56
- Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models.
57
-
58
-
59
- This repository hosts the model weights for **BAGEL**. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/bytedance-seed/BAGEL).
60
-
61
 
 
62
 
63
- <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/teaser.webp" width="80%"></p>
 
 
 
 
64
 
 
65
 
 
 
 
 
 
66
 
 
67
 
 
 
 
68
 
 
69
 
70
  ## 🧠 Method
71
- BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.
72
 
73
- BAGEL scales MoT’s capacity through Pre-training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in-context multimodal abilities like free-form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning.
 
 
 
74
 
75
- <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/arch.png" width="50%"></p>
 
 
76
 
 
77
 
78
  ## 🌱 Emerging Properties
79
- <p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"></p>
80
 
81
- As we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities.
 
 
82
 
 
 
 
 
 
83
 
 
84
 
85
  ## 📊 Benchmarks
86
- ### 1. Visual Understanding
87
- | Model | MME ↑ | MMBench ↑ | MMMU ↑ | MM-Vet ↑ | MathVista ↑ |
88
- | ------------------- | ----------: | ----------: | -------: | -------: | ----------: |
89
- | Janus-Pro-7B | - | 79.2 | 41.0 | 50.0 | |
90
- | Qwen2.5-VL-7B | 2347 | 83.5 | **58.6** | 67.1 | 68.2 |
91
- | **BAGEL** | **2388** | **85.0** | 55.3 | **67.2** | **73.1** |
92
- ### 2. Text-to-Image Generation · GenEval
 
 
 
 
 
 
93
  | Model | Overall ↑ |
94
- | ------------ | --------- |
95
  | FLUX-1-dev | 0.82 |
96
  | SD3-Medium | 0.74 |
97
  | Janus-Pro-7B | 0.80 |
98
  | **BAGEL** | **0.88** |
99
- ### 3. Image Editing
 
 
 
 
100
  | Model | GEdit-Bench-EN (SC) ↑ | GEdit-Bench-EN (PQ) ↑ | GEdit-Bench-EN (O) ↑ | IntelligentBench ↑ |
101
- | ------------- | --------------------- | --------------------- | ------------------- | ------------------ |
102
- | Step1X-Edit | 7.09 | 6.76 | **6.70** | 14.9 |
103
- | Gemini-2-exp. | 6.73 | 6.61 | 6.32 | **57.6** |
104
- | **BAGEL** | **7.36** | **6.83** | 6.52 | 44.0 |
105
- | **BAGEL+CoT** | – | – | – | 55.3 |
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
- ## License
108
- BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2) model, and uses the [FLUX.1-schnell VAE model](https://huggingface.co/black-forest-labs/FLUX.1-schnell), all under Apache 2.0.
109
 
110
- ## ✍️ Citation
111
  ```bibtex
112
  @article{deng2025bagel,
113
  title = {Emerging Properties in Unified Multimodal Pretraining},
@@ -115,4 +138,3 @@ BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5-7B
115
  journal = {arXiv preprint arXiv:2505.14683},
116
  year = {2025}
117
  }
118
- ```
 
1
  ---
2
  license: apache-2.0
3
  base_model:
4
+ - Qwen/Qwen2.5-7B-Instruct
5
  pipeline_tag: any-to-any
6
  library_name: bagel-mot
7
  ---
8
+ <p align="center">
 
 
9
  <img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="480"/>
10
  </p>
11
 
12
+ # 🥯 BAGEL: Unified Model for Multimodal Understanding and Generation
13
 
14
+ <p align="center">
 
 
 
 
15
  <a href="https://bagel-ai.org/">
16
+ <img src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white" />
 
 
 
17
  </a>
18
  <a href="https://arxiv.org/abs/2505.14683">
19
+ <img src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red" />
 
 
 
20
  </a>
21
+ <a href="https://github.com/bytedance-seed/BAGEL">
22
+ <img src="https://img.shields.io/badge/BAGEL-Codebase-536af5?logo=github" />
 
 
 
23
  </a>
24
  <a href="https://demo.bagel-ai.org/">
25
+ <img src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=white" />
 
 
 
26
  </a>
27
  <a href="https://discord.com/invite/Z836xxzy">
28
+ <img src="https://img.shields.io/badge/BAGEL-Discord-green?logo=discord&logoColor=white" />
 
 
 
29
  </a>
 
 
30
  </p>
31
 
32
+ ---
33
 
34
+ We present **BAGEL**, an open‑source multimodal foundation model with **7B active parameters (14B total)** trained on large‑scale interleaved multimodal data.
 
 
 
 
 
35
 
36
+ **BAGEL** outperforms leading open‑source VLMs like **Qwen2.5-VL** and **InternVL-2.5** on standard benchmarks and delivers text‑to‑image quality competitive with specialist generators such as **SD3**.
37
 
38
+ It supports:
39
+ - Free-form **visual manipulation**
40
+ - **Multiview synthesis**
41
+ - **World navigation**
42
+ - Advanced **image editing** beyond traditional models
43
 
44
+ ---
45
 
46
+ ### 🔧 Installation & Usage
47
+ Please refer to our [GitHub Repository](https://github.com/bytedance-seed/BAGEL) for:
48
+ - Setup instructions
49
+ - Example scripts
50
+ - Demo usage
51
 
52
+ ---
53
 
54
+ <p align="center">
55
+ <img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/teaser.webp" width="80%"/>
56
+ </p>
57
 
58
+ ---
59
 
60
  ## 🧠 Method
 
61
 
62
+ **BAGEL** uses a **Mixture-of-Transformer-Experts (MoT)** architecture with:
63
+ - Dual encoders: capturing **pixel-level** and **semantic-level** features
64
+ - Training objective: **Next Group of Token Prediction**
65
+ - Vision token compression via [FLUX.1 VAE](https://huggingface.co/black-forest-labs/FLUX.1-schnell)
66
 
67
+ <p align="center">
68
+ <img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/arch.png" width="50%"/>
69
+ </p>
70
 
71
+ ---
72
 
73
  ## 🌱 Emerging Properties
 
74
 
75
+ <p align="center">
76
+ <img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"/>
77
+ </p>
78
 
79
+ Performance improves as pretraining scales, progressing from:
80
+ - Multimodal understanding
81
+ - Generation
82
+ - Basic image editing
83
+ - Advanced multimodal reasoning and 3D/world modeling
84
 
85
+ ---
86
 
87
  ## 📊 Benchmarks
88
+
89
+ ### 🖼️ Visual Understanding
90
+
91
+ | Model | MME ↑ | MMBench ↑ | MMMU ↑ | MM-Vet ↑ | MathVista ↑ |
92
+ |------------------|-------:|-----------:|--------:|----------:|-------------:|
93
+ | Janus-Pro-7B | | 79.2 | 41.0 | 50.0 | |
94
+ | Qwen2.5-VL-7B | 2347 | 83.5 | **58.6**| 67.1 | 68.2 |
95
+ | **BAGEL** | **2388**| **85.0** | 55.3 | **67.2** | **73.1** |
96
+
97
+ ---
98
+
99
+ ### 🖌️ Text-to-Image Generation (GenEval)
100
+
101
  | Model | Overall ↑ |
102
+ |--------------|-----------|
103
  | FLUX-1-dev | 0.82 |
104
  | SD3-Medium | 0.74 |
105
  | Janus-Pro-7B | 0.80 |
106
  | **BAGEL** | **0.88** |
107
+
108
+ ---
109
+
110
+ ### 🪄 Image Editing
111
+
112
  | Model | GEdit-Bench-EN (SC) ↑ | GEdit-Bench-EN (PQ) ↑ | GEdit-Bench-EN (O) ↑ | IntelligentBench ↑ |
113
+ |---------------|-----------------------|------------------------|----------------------|---------------------|
114
+ | Step1X-Edit | 7.09 | 6.76 | **6.70** | 14.9 |
115
+ | Gemini-2-exp. | 6.73 | 6.61 | 6.32 | **57.6** |
116
+ | **BAGEL** | **7.36** | **6.83** | 6.52 | 44.0 |
117
+ | **BAGEL+CoT** | – | – | – | 55.3 |
118
+
119
+ ---
120
+
121
+ ## ⚖️ License
122
+
123
+ BAGEL is licensed under the **Apache 2.0 License**.
124
+
125
+ Finetuned from:
126
+ - [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
127
+ - [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2)
128
+ - Uses [FLUX.1-schnell VAE](https://huggingface.co/black-forest-labs/FLUX.1-schnell)
129
+
130
+ ---
131
 
132
+ ## 📚 Citation
 
133
 
 
134
  ```bibtex
135
  @article{deng2025bagel,
136
  title = {Emerging Properties in Unified Multimodal Pretraining},
 
138
  journal = {arXiv preprint arXiv:2505.14683},
139
  year = {2025}
140
  }