Fast-dLLM v2 (7B) — Efficient Block-Diffusion LLM
📖 Introduction
Autoregressive (AR) large language models (LLMs) have achieved remarkable performance across a wide range of natural language tasks, yet their inherent sequential decoding limits inference efficiency.
We present Fast-dLLM v2 — a carefully designed block diffusion language model (dLLM) that efficiently adapts a pretrained AR model (Qwen2.5-7B-Instruct) into a diffusion-style decoder for parallel text generation.
✨ Key Innovations
- Block Diffusion Mechanism + Complementary Attention Mask
Enables blockwise bidirectional context modeling without sacrificing AR objectives. - Hierarchical Caching
- Block-level cache: Stores historical context representations across blocks.
- Sub-block cache: Parallel decoding within partially generated blocks.
- Token Shift Mechanism
Retains autoregressive characteristics while supporting bidirectional context within blocks. - Parallel Decoding Pipeline
Achieves up to 2.5× speedup over standard AR decoding without compromising quality.
🚀 Fast-dLLM v2 uses only ~1B tokens for fine-tuning — a 500× reduction vs. full-attention diffusion LLMs (Dream: 580B tokens) — while matching or surpassing AR baselines in accuracy.
🛠 Model Overview
- Type: Block Diffusion Language Model (dLLM)
- Base Model:
Qwen/Qwen2.5-7B-Instruct
- Architecture: Transformer w/ RoPE, SwiGLU activation, RMSNorm, Attention QKV bias
- Params: ~7B
- Layers: 28
- Attention Heads: 28 (Q), 4 (KV, GQA)
- Block Diffusion Size: 32 tokens
- Key Feature: Parallel block-wise decoding + hierarchical caching (block-level & sub-block)
📦 Installation
You will need transformers
, torch
, and our custom generation function:
pip install transformers torch numpy
🚀 Quickstart
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Efficient-Large-Model/Fast_dLLM_7B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Fast-dLLM v2 parallel decoding
gen_ids = model.generate(
inputs["input_ids"],
tokenizer=tokenizer,
max_new_tokens=512,
small_block_size=8,
threshold=0.9,
)
response = tokenizer.decode(
gen_ids[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=True
)
print(response)
📊 Performance & Benchmarks
▶ Real-time Throughput
Fast-dLLM v2 offers up to 2.54× higher throughput than Qwen2.5-7B-Instruct, without loss in quality.
🏆 Benchmark Results
We compare Fast-dLLM v2 against AR baselines and previous diffusion LLMs on diverse tasks:
HumanEval, MBPP (code), GSM8K, Math (reasoning), IFEval (instruction), MMLU, GPQA (knowledge QA).
- 1B group: Fast-dLLM v2 (7B) achieves best average score: 45.0.
- 7B group: Fast-dLLM v2 (7B) achieves best average score: 60.3, surpassing LLaDA and Dream models.
📜 Citation
If you use Fast-dLLM v2 in your research or products, please cite:
@misc{wu2025fastdllmv2efficientblockdiffusion,
title={Fast-dLLM v2: Efficient Block-Diffusion LLM},
author={Chengyue Wu and Hao Zhang and Shuchen Xue and Shizhe Diao and Yonggan Fu and Zhijian Liu and Pavlo Molchanov and Ping Luo and Song Han and Enze Xie},
year={2025},
eprint={2509.26328},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.26328},
}
📄 License
Released under Apache 2.0, following the base Qwen2.5 license.
🔗 Resources
- 📄 Paper
- 💻 Code
- 🤗 HuggingFace Model
- Downloads last month
- 258