LLaDA-346M: Large Language Diffusion with Masking
Model Description
This is a 346 Million parameter Large Language Diffusion Model trained with masked diffusion processes. This model demonstrates that diffusion-based approaches can be viable alternatives to autoregressive language models.
Key Features
- Architecture: Masked Diffusion Model (MDM) with Transformer encoder
- Parameters: 346M
- Sequence Length: 512 tokens
- Vocab Size: 50,257 (GPT-2)
- Training Data: 50,000 WikiText-2 samples
Model Architecture
Token Embeddings (50257 × 1024)
↓
Position Embeddings (512 × 1024)
↓
Time Embeddings (MLP)
↓
Transformer Encoder (12 layers, 16 heads)
├─ Self-Attention
└─ Feed-Forward (4096 dim)
↓
Output Projection (1024 × 50257)
Training Details
- Algorithm: Masked Diffusion Model (MDM)
- Loss Function: Cross-entropy on masked positions
- Optimizer: AdamW (lr=3e-5, betas=(0.9, 0.95))
- Batch Size: 16 (effective: 32 with grad accumulation)
- Gradient Checkpointing: Enabled
- Mixed Precision: AMP (FP32/FP16)
- Epochs: 4
- Training Samples: 50,000
- GPU: NVIDIA V100 (22GB VRAM)
- Training Time: ~20 hours
Performance
| Metric | Value |
|---|---|
| Initial Loss | 5.96 |
| Final Loss | 4.94 |
| Loss Reduction | 17.1% |
| Total Parameters | 346M |
| Model Size (FP32) | 1.38 GB |
Usage
Installation
pip install transformers torch
Loading the Model
import torch
from transformers import AutoTokenizer
from your_module import MaskedDiffusionModel
# Load model
model = MaskedDiffusionModel(
vocab_size=50257,
hidden_dim=1024,
num_layers=12,
num_heads=16,
ff_dim=4096,
dropout=0.1,
max_seq_length=512,
num_timesteps=100
)
# Load weights
checkpoint = torch.load("pytorch_model.bin")
model.load_state_dict(checkpoint)
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
Text Generation
from diffusion_sampler import DiffusionSampler
sampler = DiffusionSampler(model, tokenizer, config, device)
# Generate text
text = sampler.generate(
prompt="The future of AI",
num_steps=40,
temperature=0.8,
top_p=0.9
)
print(text)
Model Characteristics
Advantages
✅ Bidirectional Context: Sees full context unlike autoregressive models
✅ Parallel Generation: Can predict multiple tokens simultaneously
✅ Reversal Invariance: Equal performance on forward and reverse tasks
✅ Global Coherence: Reduces error accumulation
Limitations
❌ Slower generation (iterative denoising process)
❌ Requires more compute for inference
❌ Not fine-tuned for specific tasks
Training Process
Forward Process
- Gradually mask tokens randomly
- At timestep t ∈ [0,1], each token masked with probability t
- Creates noisy version of input
Reverse Process
- Iteratively predict and unmask tokens
- Uses transformer to predict masked positions
- Trained with cross-entropy loss on masked tokens only
Optimization Techniques
- Gradient Checkpointing: Save memory during backprop
- Mixed Precision (AMP): Use FP16 where possible
- Gradient Accumulation: Simulate larger batches
- Layer Norm First: Improved training stability
Citation
If you use this model, please cite:
@article{nie2025llada,
title={Large Language Diffusion Models},
author={Nie, Shen and others},
journal={arXiv preprint arXiv:2502.09992},
year={2025}
}
License
MIT License - Feel free to use for research and commercial purposes
Acknowledgments
- Based on "Large Language Diffusion Models" (Nie et al., 2025)
- Built with PyTorch and Transformers
- Trained on WikiText-2 dataset
- Inspired by diffusion models for vision (DiT, Genie)
Contact & Support
For issues, questions, or suggestions, please open an issue on GitHub or contact the model author.
Last Updated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
- Downloads last month
- 17