MoE-5L-Active-ArXiv-Code-SimpleStories

Model Description

This is a 5-layer Mixture of Experts (MoE) transformer model trained on a combination of ArXiv papers, code repositories, and SimpleStories dataset. The model uses an advanced MoE architecture with expert routing for efficient and scalable language modeling.

Model Details

Architecture

Model Type: Mixture of Experts Transformer for Causal Language Modeling
Architecture: MoeTransformerForCausalLM
Parameters: ~140M parameters (8 experts × ~17.5M each)
Active Parameters: ~35M per forward pass (top-2 expert routing)
Layers: 5 transformer layers with MoE feed-forward networks
Hidden Size: 768
Attention Heads: 12 (with 8 key-value heads for efficiency)
Vocabulary Size: 50,256 tokens
Max Sequence Length: 1024 tokens
Context Window: 512 tokens (with windowing support)

MoE Configuration

Number of Experts: 8 experts per layer
Expert Selection: Top-2 routing (2 experts activated per token)
Router Type: Learned gating network with auxiliary loss
Load Balancing: Auxiliary loss coefficient: 0.01
Router Z-Loss: Coefficient: 0.001

Training Details

Training Data: ArXiv papers, code repositories, and SimpleStories
Training Epochs: 2
Batch Size: 256
Learning Rate: 5e-4 (lower than dense model for stability)
Optimizer: AdamW (β1=0.9, β2=0.999)
Dropout: 0.1 (attention and hidden layers)
Normalization: RMSNorm (ε=1e-6)

Model Features

Mixture of Experts: Sparse activation with expert routing for efficiency
Load Balancing: Auxiliary loss to ensure balanced expert utilization
Rotary Position Embeddings: For better handling of positional information
Group Query Attention: Efficient attention with 12 query heads and 8 key-value heads
SwiGLU Activation: Modern activation function in expert feed-forward layers
RMSNorm: Layer normalization for improved training stability

Usage

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "your-username/moe-5l-active-arxiv-code-simplestories"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    device_map="auto"
)

Text Generation

# Generate text with MoE model
prompt = "The concept of mixture of experts in machine learning"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=200,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        output_router_logits=True  # Optional: get expert routing information
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Code Generation with Expert Routing

# Generate Python code and inspect expert usage
prompt = "def quicksort(arr):"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=150,
        temperature=0.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        output_router_logits=True,
        return_dict_in_generate=True
    )

generated_code = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(f"Generated Code:\n{generated_code}")

# Expert routing information is available in outputs.router_logits
if hasattr(outputs, 'router_logits'):
    print("Expert routing information available")

Advanced Usage: Expert Analysis

# Analyze expert specialization
def analyze_expert_usage(model, tokenizer, prompts):
    """Analyze which experts are activated for different types of prompts"""
    results = {}
    
    for prompt_type, prompt in prompts.items():
        inputs = tokenizer(prompt, return_tensors="pt")
        
        with torch.no_grad():
            outputs = model(
                **inputs,
                output_router_logits=True,
                return_dict=True
            )
        
        # Analyze router logits to see expert activation patterns
        if hasattr(outputs, 'router_aux_losses'):
            results[prompt_type] = outputs.router_aux_losses
    
    return results

# Example usage
prompts = {
    "math": "The derivative of x^2 is",
    "code": "def factorial(n):",
    "story": "Once upon a time in a distant galaxy",
    "science": "The theory of relativity explains"
}

expert_analysis = analyze_expert_usage(model, tokenizer, prompts)

Intended Use

Primary Use Cases

Research: Advanced research in mixture of experts and sparse models
Efficiency Studies: Investigating parameter-efficient language models
Domain Adaptation: Leveraging expert specialization for multi-domain tasks
Educational: Learning about MoE architectures and expert routing

Suitable Tasks

Multi-domain text generation (academic, code, narrative)
Efficient large-scale language modeling
Domain-specific content generation with expert routing
Research into expert specialization patterns

Advantages of MoE Architecture

Efficiency Benefits

Parameter Efficiency: Only ~25% of parameters active per forward pass
Scalability: Can increase model capacity without proportional compute increase
Specialization: Experts can specialize in different domains or patterns
Memory Efficiency: Lower activation memory compared to equivalent dense model

Performance Benefits

Quality: Often matches or exceeds dense models of similar active parameter count
Versatility: Better handling of diverse domains due to expert specialization
Adaptability: Can potentially learn domain-specific routing patterns

Limitations and Biases

MoE-Specific Limitations

Routing Instability: Expert routing can be unstable during training
Load Imbalance: Some experts may be underutilized despite load balancing
Complexity: More complex architecture with additional hyperparameters
Hardware Requirements: May require specialized hardware for optimal efficiency

General Limitations

Context Length: Limited to 1024 tokens maximum sequence length
Training Complexity: More complex training dynamics than dense models
Expert Collapse: Risk of experts becoming redundant
Inference Complexity: Routing overhead during inference

Potential Biases

Dataset Bias: Reflects biases present in training data across all experts
Expert Bias: Different experts may exhibit different biases
Routing Bias: Expert selection may be biased toward certain patterns
Domain Imbalance: Expert specialization may favor overrepresented domains

Training Data

The model was trained on a curated dataset combining:

ArXiv Papers: Academic papers for scientific and mathematical reasoning
Code Repositories: Programming code for software development tasks
SimpleStories: Narrative text for story generation and general language understanding

The MoE architecture allows the model to potentially develop specialized experts for each domain.

Expert Routing Analysis

Expected Expert Specializations

Based on the training data, experts may specialize in:

Mathematical/Scientific content (from ArXiv papers)
Programming languages and code patterns (from code repositories)
Narrative and storytelling (from SimpleStories)
General language patterns (cross-domain)

Load Balancing

The model uses auxiliary loss to encourage balanced expert utilization:

Router Auxiliary Loss: Encourages uniform expert selection
Z-Loss: Prevents router collapse and maintains diversity

Evaluation

MoE-Specific Metrics

Expert Utilization: Measure of how evenly experts are used
Routing Entropy: Diversity of expert selection patterns
Expert Specialization: Domain-specific expert activation analysis

Performance Metrics

Perplexity: [Add your perplexity scores across domains]
FLOPS per Token: Computational efficiency compared to dense models
Domain-Specific Evaluation: Performance on ArXiv, code, and story tasks

Environmental Impact

Efficiency Gains

Reduced Active Parameters: ~75% parameter sparsity during inference
Computational Efficiency: Lower FLOPs per token compared to equivalent dense model
Training Efficiency: Faster convergence due to expert specialization

Technical Specifications

Hardware Requirements

Minimum RAM: 8GB for inference (due to expert parameters)
Recommended GPU: NVIDIA RTX 3080 or better
CPU: Modern multi-core processor
Storage: Additional space for expert parameters

Software Requirements

Python 3.8+
PyTorch 1.12+ (with MoE support)
Transformers 4.25+ (with MoE implementation)
CUDA 11.6+ (for GPU acceleration)

Citation

@misc{moe5lactive2024,
  title={MoE-5L-Active-ArXiv-Code-SimpleStories: An Efficient Mixture of Experts Transformer},
  author={[Your Name]},
  year={2024},
  howpublished={HuggingFace Model Hub},
  url={https://huggingface.co/your-username/moe-5l-active-arxiv-code-simplestories}
}

License

This model is released under the Apache 2.0 License. See the LICENSE file for more details.

Model Card Authors

[Your Name] - [Your Affiliation]

Contact

For questions or issues regarding this model, please:

Open an issue on the model repository
Contact: [email protected]

Disclaimer: This model is provided for research and educational purposes. The MoE architecture adds complexity that users should understand when deploying in production environments.