rohitnagareddy
/

AdbhutMOE

Text Generation

mixture-of-experts

Mixture of Experts

text-generation-inference

Model card Files Files and versions

AdbhutMOE / README.md

rohitnagareddy's picture

Upload folder using huggingface_hub

8d95c20 verified 4 months ago

|

history blame contribute delete

3.5 kB


	---
	license: mit
	language: en
	library_name: transformers
	tags:
	- text-generation
	- mixture-of-experts
	- moe
	- from-scratch
	- ag_news
	---

	# Mixture-of-Experts Foundation Model: AdbhutMOE

	AdbhutMOE is a miniature, from-scratch Mixture-of-Experts (MoE) autoregressive language model based on the Mixtral architecture. This model was pre-trained on a sample of the `ag_news` dataset as part of a learning exercise to demonstrate the end-to-end pipeline for creating a sparse foundation model.

	This model is intended for educational purposes only. It showcases how to configure and train an MoE model, which uses a sparse activation pattern to increase parameter count while maintaining a manageable computational cost.

	- Developed by: [rohitnagareddy](https://huggingface.co/rohitnagareddy)
	- Model type: Mixture-of-Experts Causal Language Model
	- Language: English
	- License: MIT

	## How to Use

	The model can be easily loaded for text generation using the `transformers` library pipeline.

	```python
	from transformers import pipeline

	# Load the model from the Hugging Face Hub
	generator = pipeline('text-generation', model='rohitnagareddy/AdbhutMOE')

	# Generate text
	prompt = "The latest discovery in space exploration is"
	output = generator(
	prompt,
	max_length=50,
	num_return_sequences=1,
	no_repeat_ngram_size=2,
	temperature=0.7,
	top_k=50
	)

	print(output[0]['generated_text'])
	```

	## Model Architecture

	AdbhutMOE is a small-scale MoE model with the following configuration:
	- Number of layers: 4
	- Hidden dimension: 256
	- Number of attention heads: 4
	- Vocabulary size: 8000
	- Maximum sequence length: 256 positions
	- Total Experts per Layer: 8
	- Activated Experts per Token: 2

	This architecture results in a significantly higher parameter count than a dense model of similar computational cost, demonstrating the core benefit of the MoE approach.

	---

	## Training Details

	### Training Data

	The model was pre-trained on a shuffled sample of the `ag_news` dataset.
	- Dataset: `ag_news`
	- Sample Size: 10000 articles
	- Preprocessing: The text of each article was extracted and used for training after filtering out empty examples.

	### Training Procedure

	The model was pre-trained using the Hugging Face `Trainer` on a single GPU.

	- Framework: PyTorch
	- Training Steps: 100
	- Batch Size: 4
	- Optimizer: AdamW (default)
	- Objective: Causal Language Modeling (including the router's auxiliary loss to ensure expert load balancing).

	---

	## Limitations and Intended Use

	This model is a proof-of-concept and is not suitable for any real-world application.

	The primary goal of this project was to learn and demonstrate the MoE training pipeline. As a result, it has significant limitations:

	1. Limited Coherence: While more capable than a dense model trained for the same number of steps, the output may still lack long-range coherence due to the limited training data and short training cycle.
	2. Confined Knowledge: The model's knowledge is restricted to the 10000 news articles it was trained on.
	3. Bias: The model will reflect the biases inherent in the `ag_news` dataset.
	4. No Safety Alignment: This is a raw, pre-trained base model and has not undergone any instruction tuning or RLHF. It should not be used in a public-facing capacity.

	The intended use is for studying the configuration and training behavior of Mixture-of-Experts models.