|
|
|
--- |
|
license: mit |
|
language: en |
|
library_name: transformers |
|
tags: |
|
- text-generation |
|
- mixture-of-experts |
|
- moe |
|
- from-scratch |
|
- ag_news |
|
--- |
|
|
|
# Mixture-of-Experts Foundation Model: AdbhutMOE |
|
|
|
**AdbhutMOE** is a miniature, from-scratch Mixture-of-Experts (MoE) autoregressive language model based on the Mixtral architecture. This model was pre-trained on a sample of the `ag_news` dataset as part of a learning exercise to demonstrate the end-to-end pipeline for creating a sparse foundation model. |
|
|
|
This model is intended for **educational purposes only**. It showcases how to configure and train an MoE model, which uses a sparse activation pattern to increase parameter count while maintaining a manageable computational cost. |
|
|
|
- **Developed by:** [rohitnagareddy](https://huggingface.co/rohitnagareddy) |
|
- **Model type:** Mixture-of-Experts Causal Language Model |
|
- **Language:** English |
|
- **License:** MIT |
|
|
|
## How to Use |
|
|
|
The model can be easily loaded for text generation using the `transformers` library pipeline. |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
# Load the model from the Hugging Face Hub |
|
generator = pipeline('text-generation', model='rohitnagareddy/AdbhutMOE') |
|
|
|
# Generate text |
|
prompt = "The latest discovery in space exploration is" |
|
output = generator( |
|
prompt, |
|
max_length=50, |
|
num_return_sequences=1, |
|
no_repeat_ngram_size=2, |
|
temperature=0.7, |
|
top_k=50 |
|
) |
|
|
|
print(output[0]['generated_text']) |
|
``` |
|
|
|
## Model Architecture |
|
|
|
**AdbhutMOE** is a small-scale MoE model with the following configuration: |
|
- **Number of layers:** 4 |
|
- **Hidden dimension:** 256 |
|
- **Number of attention heads:** 4 |
|
- **Vocabulary size:** 8000 |
|
- **Maximum sequence length:** 256 positions |
|
- **Total Experts per Layer:** 8 |
|
- **Activated Experts per Token:** 2 |
|
|
|
This architecture results in a significantly higher parameter count than a dense model of similar computational cost, demonstrating the core benefit of the MoE approach. |
|
|
|
--- |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The model was pre-trained on a shuffled sample of the **`ag_news`** dataset. |
|
- **Dataset:** `ag_news` |
|
- **Sample Size:** 10000 articles |
|
- **Preprocessing:** The text of each article was extracted and used for training after filtering out empty examples. |
|
|
|
### Training Procedure |
|
|
|
The model was pre-trained using the Hugging Face `Trainer` on a single GPU. |
|
|
|
- **Framework:** PyTorch |
|
- **Training Steps:** 100 |
|
- **Batch Size:** 4 |
|
- **Optimizer:** AdamW (default) |
|
- **Objective:** Causal Language Modeling (including the router's auxiliary loss to ensure expert load balancing). |
|
|
|
--- |
|
|
|
## Limitations and Intended Use |
|
|
|
**This model is a proof-of-concept and is not suitable for any real-world application.** |
|
|
|
The primary goal of this project was to learn and demonstrate the MoE training pipeline. As a result, it has significant limitations: |
|
|
|
1. **Limited Coherence:** While more capable than a dense model trained for the same number of steps, the output may still lack long-range coherence due to the limited training data and short training cycle. |
|
2. **Confined Knowledge:** The model's knowledge is restricted to the 10000 news articles it was trained on. |
|
3. **Bias:** The model will reflect the biases inherent in the `ag_news` dataset. |
|
4. **No Safety Alignment:** This is a raw, pre-trained base model and has not undergone any instruction tuning or RLHF. It should not be used in a public-facing capacity. |
|
|
|
The intended use is for studying the configuration and training behavior of Mixture-of-Experts models. |
|
|