Update README.md
Browse files
README.md
CHANGED
@@ -1,113 +1,136 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
base_model:
|
4 |
-
- Qwen/Qwen2.5-7B-Instruct
|
5 |
pipeline_tag: any-to-any
|
6 |
library_name: bagel-mot
|
7 |
---
|
8 |
-
|
9 |
-
|
10 |
-
<p align="left">
|
11 |
<img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="480"/>
|
12 |
</p>
|
13 |
|
|
|
14 |
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
<p align="left">
|
20 |
<a href="https://bagel-ai.org/">
|
21 |
-
<img
|
22 |
-
src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white" style="display: inline-block; vertical-align: middle;"
|
23 |
-
alt="BAGEL Website"
|
24 |
-
/>
|
25 |
</a>
|
26 |
<a href="https://arxiv.org/abs/2505.14683">
|
27 |
-
<img
|
28 |
-
src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red" style="display: inline-block; vertical-align: middle;"
|
29 |
-
alt="BAGEL Paper on arXiv"
|
30 |
-
/>
|
31 |
</a>
|
32 |
-
<a href="https://github.com/bytedance-seed/BAGEL"
|
33 |
-
|
34 |
-
alt="Github" src="https://img.shields.io/badge/BAGEL-Codebase-536af5?color=536af5&logo=github" style="display: inline-block; vertical-align: middle;"
|
35 |
-
alt="BAGEL Codebase"
|
36 |
-
/>
|
37 |
</a>
|
38 |
<a href="https://demo.bagel-ai.org/">
|
39 |
-
<img
|
40 |
-
src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=white" style="display: inline-block; vertical-align: middle;"
|
41 |
-
alt="BAGEL Demo"
|
42 |
-
/>
|
43 |
</a>
|
44 |
<a href="https://discord.com/invite/Z836xxzy">
|
45 |
-
<img
|
46 |
-
src="https://img.shields.io/badge/BAGEL-Discord-green?logo=discord&logoColor=white" style="display: inline-block; vertical-align: middle;"
|
47 |
-
alt="BAGEL Discord"
|
48 |
-
/>
|
49 |
</a>
|
50 |
-
|
51 |
-
|
52 |
</p>
|
53 |
|
|
|
54 |
|
55 |
-
|
56 |
-
Moreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute "world-modeling" tasks beyond the scope of previous image-editing models.
|
57 |
-
|
58 |
-
|
59 |
-
This repository hosts the model weights for **BAGEL**. For installation, usage instructions, and further documentation, please visit our [GitHub repository](https://github.com/bytedance-seed/BAGEL).
|
60 |
-
|
61 |
|
|
|
62 |
|
63 |
-
|
|
|
|
|
|
|
|
|
64 |
|
|
|
65 |
|
|
|
|
|
|
|
|
|
|
|
66 |
|
|
|
67 |
|
|
|
|
|
|
|
68 |
|
|
|
69 |
|
70 |
## 🧠 Method
|
71 |
-
BAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.
|
72 |
|
73 |
-
BAGEL
|
|
|
|
|
|
|
74 |
|
75 |
-
<p align="
|
|
|
|
|
76 |
|
|
|
77 |
|
78 |
## 🌱 Emerging Properties
|
79 |
-
<p align="left"><img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"></p>
|
80 |
|
81 |
-
|
|
|
|
|
82 |
|
|
|
|
|
|
|
|
|
|
|
83 |
|
|
|
84 |
|
85 |
## 📊 Benchmarks
|
86 |
-
|
87 |
-
|
88 |
-
|
89 |
-
|
|
90 |
-
|
91 |
-
|
|
92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
93 |
| Model | Overall ↑ |
|
94 |
-
|
95 |
| FLUX-1-dev | 0.82 |
|
96 |
| SD3-Medium | 0.74 |
|
97 |
| Janus-Pro-7B | 0.80 |
|
98 |
| **BAGEL** | **0.88** |
|
99 |
-
|
|
|
|
|
|
|
|
|
100 |
| Model | GEdit-Bench-EN (SC) ↑ | GEdit-Bench-EN (PQ) ↑ | GEdit-Bench-EN (O) ↑ | IntelligentBench ↑ |
|
101 |
-
|
102 |
-
| Step1X-Edit | 7.09 | 6.76
|
103 |
-
| Gemini-2-exp. | 6.73 | 6.61
|
104 |
-
| **BAGEL** | **7.36** | **6.83**
|
105 |
-
| **BAGEL+CoT** | –
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
106 |
|
107 |
-
##
|
108 |
-
BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) and [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2) model, and uses the [FLUX.1-schnell VAE model](https://huggingface.co/black-forest-labs/FLUX.1-schnell), all under Apache 2.0.
|
109 |
|
110 |
-
## ✍️ Citation
|
111 |
```bibtex
|
112 |
@article{deng2025bagel,
|
113 |
title = {Emerging Properties in Unified Multimodal Pretraining},
|
@@ -115,4 +138,3 @@ BAGEL is licensed under the Apache 2.0 license. It is finetuned from [Qwen2.5-7B
|
|
115 |
journal = {arXiv preprint arXiv:2505.14683},
|
116 |
year = {2025}
|
117 |
}
|
118 |
-
```
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
base_model:
|
4 |
+
- Qwen/Qwen2.5-7B-Instruct
|
5 |
pipeline_tag: any-to-any
|
6 |
library_name: bagel-mot
|
7 |
---
|
8 |
+
<p align="center">
|
|
|
|
|
9 |
<img src="https://lf3-static.bytednsdoc.com/obj/eden-cn/nuhojubrps/banner.png" alt="BAGEL" width="480"/>
|
10 |
</p>
|
11 |
|
12 |
+
# 🥯 BAGEL: Unified Model for Multimodal Understanding and Generation
|
13 |
|
14 |
+
<p align="center">
|
|
|
|
|
|
|
|
|
15 |
<a href="https://bagel-ai.org/">
|
16 |
+
<img src="https://img.shields.io/badge/BAGEL-Website-0A66C2?logo=safari&logoColor=white" />
|
|
|
|
|
|
|
17 |
</a>
|
18 |
<a href="https://arxiv.org/abs/2505.14683">
|
19 |
+
<img src="https://img.shields.io/badge/BAGEL-Paper-red?logo=arxiv&logoColor=red" />
|
|
|
|
|
|
|
20 |
</a>
|
21 |
+
<a href="https://github.com/bytedance-seed/BAGEL">
|
22 |
+
<img src="https://img.shields.io/badge/BAGEL-Codebase-536af5?logo=github" />
|
|
|
|
|
|
|
23 |
</a>
|
24 |
<a href="https://demo.bagel-ai.org/">
|
25 |
+
<img src="https://img.shields.io/badge/BAGEL-Demo-blue?logo=googleplay&logoColor=white" />
|
|
|
|
|
|
|
26 |
</a>
|
27 |
<a href="https://discord.com/invite/Z836xxzy">
|
28 |
+
<img src="https://img.shields.io/badge/BAGEL-Discord-green?logo=discord&logoColor=white" />
|
|
|
|
|
|
|
29 |
</a>
|
|
|
|
|
30 |
</p>
|
31 |
|
32 |
+
---
|
33 |
|
34 |
+
We present **BAGEL**, an open‑source multimodal foundation model with **7B active parameters (14B total)** trained on large‑scale interleaved multimodal data.
|
|
|
|
|
|
|
|
|
|
|
35 |
|
36 |
+
**BAGEL** outperforms leading open‑source VLMs like **Qwen2.5-VL** and **InternVL-2.5** on standard benchmarks and delivers text‑to‑image quality competitive with specialist generators such as **SD3**.
|
37 |
|
38 |
+
It supports:
|
39 |
+
- Free-form **visual manipulation**
|
40 |
+
- **Multiview synthesis**
|
41 |
+
- **World navigation**
|
42 |
+
- Advanced **image editing** beyond traditional models
|
43 |
|
44 |
+
---
|
45 |
|
46 |
+
### 🔧 Installation & Usage
|
47 |
+
Please refer to our [GitHub Repository](https://github.com/bytedance-seed/BAGEL) for:
|
48 |
+
- Setup instructions
|
49 |
+
- Example scripts
|
50 |
+
- Demo usage
|
51 |
|
52 |
+
---
|
53 |
|
54 |
+
<p align="center">
|
55 |
+
<img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/teaser.webp" width="80%"/>
|
56 |
+
</p>
|
57 |
|
58 |
+
---
|
59 |
|
60 |
## 🧠 Method
|
|
|
61 |
|
62 |
+
**BAGEL** uses a **Mixture-of-Transformer-Experts (MoT)** architecture with:
|
63 |
+
- Dual encoders: capturing **pixel-level** and **semantic-level** features
|
64 |
+
- Training objective: **Next Group of Token Prediction**
|
65 |
+
- Vision token compression via [FLUX.1 VAE](https://huggingface.co/black-forest-labs/FLUX.1-schnell)
|
66 |
|
67 |
+
<p align="center">
|
68 |
+
<img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/arch.png" width="50%"/>
|
69 |
+
</p>
|
70 |
|
71 |
+
---
|
72 |
|
73 |
## 🌱 Emerging Properties
|
|
|
74 |
|
75 |
+
<p align="center">
|
76 |
+
<img src="https://github.com/ByteDance-Seed/Bagel/raw/main/assets/emerging_curves.png" width="50%"/>
|
77 |
+
</p>
|
78 |
|
79 |
+
Performance improves as pretraining scales, progressing from:
|
80 |
+
- Multimodal understanding
|
81 |
+
- Generation
|
82 |
+
- Basic image editing
|
83 |
+
- Advanced multimodal reasoning and 3D/world modeling
|
84 |
|
85 |
+
---
|
86 |
|
87 |
## 📊 Benchmarks
|
88 |
+
|
89 |
+
### 🖼️ Visual Understanding
|
90 |
+
|
91 |
+
| Model | MME ↑ | MMBench ↑ | MMMU ↑ | MM-Vet ↑ | MathVista ↑ |
|
92 |
+
|------------------|-------:|-----------:|--------:|----------:|-------------:|
|
93 |
+
| Janus-Pro-7B | – | 79.2 | 41.0 | 50.0 | – |
|
94 |
+
| Qwen2.5-VL-7B | 2347 | 83.5 | **58.6**| 67.1 | 68.2 |
|
95 |
+
| **BAGEL** | **2388**| **85.0** | 55.3 | **67.2** | **73.1** |
|
96 |
+
|
97 |
+
---
|
98 |
+
|
99 |
+
### 🖌️ Text-to-Image Generation (GenEval)
|
100 |
+
|
101 |
| Model | Overall ↑ |
|
102 |
+
|--------------|-----------|
|
103 |
| FLUX-1-dev | 0.82 |
|
104 |
| SD3-Medium | 0.74 |
|
105 |
| Janus-Pro-7B | 0.80 |
|
106 |
| **BAGEL** | **0.88** |
|
107 |
+
|
108 |
+
---
|
109 |
+
|
110 |
+
### 🪄 Image Editing
|
111 |
+
|
112 |
| Model | GEdit-Bench-EN (SC) ↑ | GEdit-Bench-EN (PQ) ↑ | GEdit-Bench-EN (O) ↑ | IntelligentBench ↑ |
|
113 |
+
|---------------|-----------------------|------------------------|----------------------|---------------------|
|
114 |
+
| Step1X-Edit | 7.09 | 6.76 | **6.70** | 14.9 |
|
115 |
+
| Gemini-2-exp. | 6.73 | 6.61 | 6.32 | **57.6** |
|
116 |
+
| **BAGEL** | **7.36** | **6.83** | 6.52 | 44.0 |
|
117 |
+
| **BAGEL+CoT** | – | – | – | 55.3 |
|
118 |
+
|
119 |
+
---
|
120 |
+
|
121 |
+
## ⚖️ License
|
122 |
+
|
123 |
+
BAGEL is licensed under the **Apache 2.0 License**.
|
124 |
+
|
125 |
+
Finetuned from:
|
126 |
+
- [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)
|
127 |
+
- [siglip-so400m-14-384-flash-attn2](https://huggingface.co/HuggingFaceM4/siglip-so400m-14-384-flash-attn2)
|
128 |
+
- Uses [FLUX.1-schnell VAE](https://huggingface.co/black-forest-labs/FLUX.1-schnell)
|
129 |
+
|
130 |
+
---
|
131 |
|
132 |
+
## 📚 Citation
|
|
|
133 |
|
|
|
134 |
```bibtex
|
135 |
@article{deng2025bagel,
|
136 |
title = {Emerging Properties in Unified Multimodal Pretraining},
|
|
|
138 |
journal = {arXiv preprint arXiv:2505.14683},
|
139 |
year = {2025}
|
140 |
}
|
|