update with LFS support

Files changed (7) hide show

.gitattributes +2 -0
README.md +148 -3
assets/benchmark_results.png +3 -0
assets/throughput.png +3 -0
assets/training_recipe.png +3 -0
assets/visualization_animation.gif +3 -0
modeling.py +1 -4

.gitattributes CHANGED Viewed

@@ -34,3 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
+*.gif filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,148 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+- en
+base_model:
+- Qwen/Qwen2.5-7B-Instruct
+---
+# Fast-dLLM v2 (7B) — Efficient Block-Diffusion LLM
+## 📖 Introduction
+Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their **inherent sequential decoding limits inference efficiency**.
+We present **Fast-dLLM v2** — a carefully designed **block diffusion language model (dLLM)** that efficiently adapts a pretrained AR model (**Qwen2.5-7B-Instruct**) into a diffusion-style decoder for **parallel text generation**.
+### ✨ Key Innovations
+- **Block Diffusion Mechanism + Complementary Attention Mask**
+  Enables **blockwise bidirectional context modeling** without sacrificing AR objectives.
+- **Hierarchical Caching**
+  - **Block-level cache**: Stores historical context representations across blocks.
+  - **Sub-block cache**: Parallel decoding within partially generated blocks.
+- **Token Shift Mechanism**
+  Retains autoregressive characteristics while supporting bidirectional context within blocks.
+- **Parallel Decoding Pipeline**
+  Achieves up to **2.5× speedup** over standard AR decoding **without compromising quality**.
+> 🚀 Fast-dLLM v2 uses **only ~1B tokens** for fine-tuning — a **500× reduction** vs. full-attention diffusion LLMs (Dream: 580B tokens) — while **matching or surpassing AR baselines** in accuracy.
+![Generation Process](assets/visualization_animation.gif)
+---
+## 🛠 Model Overview
+- **Type**: Block Diffusion Language Model (dLLM)
+- **Base Model**: `Qwen/Qwen2.5-7B-Instruct`
+- **Architecture**: Transformer w/ RoPE, SwiGLU activation, RMSNorm, Attention QKV bias
+- **Params**: ~7B
+- **Layers**: 28
+- **Attention Heads**: 28 (Q), 4 (KV, GQA)
+- **Block Diffusion Size**: 32 tokens
+- **Key Feature**: Parallel **block-wise decoding** + **hierarchical caching (block-level & sub-block)**
+---
+## 📦 Installation
+You will need `transformers`, `torch`, and our **custom generation function**:
+```bash
+pip install transformers torch numpy
+```
+---
+## 🚀 Quickstart
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "Efficient-Large-Model/Fast_dLLM_7B"
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype="auto",
+    device_map="auto",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
+prompt = "Give me a short introduction to large language model."
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+inputs = tokenizer([text], return_tensors="pt").to(model.device)
+# Fast-dLLM v2 parallel decoding
+gen_ids = model.generate(
+    inputs["input_ids"],
+    tokenizer=tokenizer,
+    max_new_tokens=512,
+    small_block_size=8,
+    threshold=0.9,
+)
+response = tokenizer.decode(
+    gen_ids[0][inputs["input_ids"].shape[1]:],
+    skip_special_tokens=True
+)
+print(response)
+```
+---
+## 📊 Performance & Benchmarks
+### ▶ Real-time Throughput
+Fast-dLLM v2 offers **up to 2.54× higher throughput** than Qwen2.5-7B-Instruct, **without loss in quality**.
+![Throughput Comparison](assets/throughput.png)
+---
+### 🏆 Benchmark Results
+We compare Fast-dLLM v2 against AR baselines and previous diffusion LLMs on diverse tasks:
+HumanEval, MBPP (code), GSM8K, Math (reasoning), IFEval (instruction), MMLU, GPQA (knowledge QA).
+- **1B group**: Fast-dLLM v2 (7B) achieves **best average score: 45.0**.
+- **7B group**: Fast-dLLM v2 (7B) achieves **best average score: 60.3**, surpassing LLaDA and Dream models.
+![Benchmark Results](assets/benchmark_results.png)
+---
+## 📜 Citation
+If you use Fast-dLLM v2 in your research or products, please cite:
+```bibtex
+@misc{wu2025fastdllmv2efficientblockdiffusion,
+      title={Fast-dLLM v2: Efficient Block-Diffusion LLM},
+      author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie},
+      year={2025},
+      eprint={2509.26328},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2509.26328},
+}
+```
+---
+## 📄 License
+Released under **Apache 2.0**, following the base Qwen2.5 license.
+---
+## 🔗 Resources
+- 📄 [Paper](https://arxiv.org/abs/2509.26328)
+- 💻 [Code](https://github.com/NVlabs/Fast-dLLM)
+- 🤗 [HuggingFace Model](https://huggingface.co/Efficient-Large-Model/Fast_dLLM_7B)

assets/benchmark_results.png ADDED Viewed

Git LFS Details

SHA256: 9ef4dfb1d35ef1332f9dca4072c2e7727ed761f7f635f9ac891f9b81a54adee7
Pointer size: 131 Bytes
Size of remote file: 133 kB

assets/throughput.png ADDED Viewed

Git LFS Details

SHA256: 4f208427d0eeda6fc5e65316aa50b9d5e43ecd38a38fdf3929dc6691bad02079
Pointer size: 131 Bytes
Size of remote file: 125 kB

assets/training_recipe.png ADDED Viewed

Git LFS Details

SHA256: b2267f5d41fa4264816e870afa1353f4811b38865c04acdc9f3e4f04f5e3eb0c
Pointer size: 131 Bytes
Size of remote file: 180 kB

assets/visualization_animation.gif ADDED Viewed

Git LFS Details

SHA256: 2c4c7fb54af204ea8cc03a8dadc9dde8dc8fb5ac514ead026a5d2833ee3aad37
Pointer size: 132 Bytes
Size of remote file: 1.36 MB

modeling.py CHANGED Viewed

@@ -555,7 +555,6 @@ class Fast_dLLM_QwenForCausalLM(Fast_dLLM_QwenPreTrainedModel, GenerationMixin):
         top_p=0.95,
         temperature=0,
         use_block_cache=False,
-        block_cache_refresh_interval=16,
         **kwargs
     ):
         num_blocks = max_new_tokens // block_size
@@ -581,7 +580,6 @@ class Fast_dLLM_QwenForCausalLM(Fast_dLLM_QwenPreTrainedModel, GenerationMixin):
             x_init = torch.cat([input_ids, x_init], dim=1)
             x_t = x_init.clone()
-            step = 0
             block_past_key_values = None
             while True:
                 if stop_token in x_t[:, prompt_length:]:
@@ -612,7 +610,7 @@ class Fast_dLLM_QwenForCausalLM(Fast_dLLM_QwenPreTrainedModel, GenerationMixin):
                                 break
                         if use_block_cache:
-                            if step % block_cache_refresh_interval == 0 or (x_t[:, -block_size+small_block_start_idx] == mask_id).any():
                                 output = self.forward(input_ids=x_t[:, -block_size:], use_cache=True, past_key_values=past_key_values, update_past_key_values=False, use_block_cache=True)
                                 logits, block_past_key_values = output.logits, output.block_past_key_values
                                 logits = torch.cat([logits[:, :1, :], logits[:, :-1, :]], dim=1)
@@ -638,7 +636,6 @@ class Fast_dLLM_QwenForCausalLM(Fast_dLLM_QwenPreTrainedModel, GenerationMixin):
                         x_t[:, start:end][unmask_idx] = x_1[unmask_idx]
-                        step += 1
             input_ids = x_t
         # Truncate stop_token
         if stop_token in input_ids[:, original_input_length:]:

         top_p=0.95,
         temperature=0,
         use_block_cache=False,
         **kwargs
     ):
         num_blocks = max_new_tokens // block_size
             x_init = torch.cat([input_ids, x_init], dim=1)
             x_t = x_init.clone()
             block_past_key_values = None
             while True:
                 if stop_token in x_t[:, prompt_length:]:
                                 break
                         if use_block_cache:
+                            if block_past_key_values is None or (x_t[:, -block_size+small_block_start_idx] == mask_id).any():
                                 output = self.forward(input_ids=x_t[:, -block_size:], use_cache=True, past_key_values=past_key_values, update_past_key_values=False, use_block_cache=True)
                                 logits, block_past_key_values = output.logits, output.block_past_key_values
                                 logits = torch.cat([logits[:, :1, :], logits[:, :-1, :]], dim=1)
                         x_t[:, start:end][unmask_idx] = x_1[unmask_idx]
             input_ids = x_t
         # Truncate stop_token
         if stop_token in input_ids[:, original_input_length:]: