MoE-5L-Active-ArXiv-Code-SimpleStories

Model Description

This is a 5-layer Mixture of Experts (MoE) transformer model trained on a combination of ArXiv papers, code repositories, and SimpleStories dataset. The model uses an advanced MoE architecture with expert routing for efficient and scalable language modeling.

Model Details

Architecture

  • Model Type: Mixture of Experts Transformer for Causal Language Modeling
  • Architecture: MoeTransformerForCausalLM
  • Parameters: ~140M parameters (8 experts × ~17.5M each)
  • Active Parameters: ~35M per forward pass (top-2 expert routing)
  • Layers: 5 transformer layers with MoE feed-forward networks
  • Hidden Size: 768
  • Attention Heads: 12 (with 8 key-value heads for efficiency)
  • Vocabulary Size: 50,256 tokens
  • Max Sequence Length: 1024 tokens
  • Context Window: 512 tokens (with windowing support)

MoE Configuration

  • Number of Experts: 8 experts per layer
  • Expert Selection: Top-2 routing (2 experts activated per token)
  • Router Type: Learned gating network with auxiliary loss
  • Load Balancing: Auxiliary loss coefficient: 0.01
  • Router Z-Loss: Coefficient: 0.001

Training Details

  • Training Data: ArXiv papers, code repositories, and SimpleStories
  • Training Epochs: 2
  • Batch Size: 256
  • Learning Rate: 5e-4 (lower than dense model for stability)
  • Optimizer: AdamW (β1=0.9, β2=0.999)
  • Dropout: 0.1 (attention and hidden layers)
  • Normalization: RMSNorm (ε=1e-6)

Model Features

  • Mixture of Experts: Sparse activation with expert routing for efficiency
  • Load Balancing: Auxiliary loss to ensure balanced expert utilization
  • Rotary Position Embeddings: For better handling of positional information
  • Group Query Attention: Efficient attention with 12 query heads and 8 key-value heads
  • SwiGLU Activation: Modern activation function in expert feed-forward layers
  • RMSNorm: Layer normalization for improved training stability

Usage

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "your-username/moe-5l-active-arxiv-code-simplestories"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    device_map="auto"
)

Text Generation

# Generate text with MoE model
prompt = "The concept of mixture of experts in machine learning"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=200,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        output_router_logits=True  # Optional: get expert routing information
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)

Code Generation with Expert Routing

# Generate Python code and inspect expert usage
prompt = "def quicksort(arr):"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=150,
        temperature=0.2,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
        output_router_logits=True,
        return_dict_in_generate=True
    )

generated_code = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(f"Generated Code:\n{generated_code}")

# Expert routing information is available in outputs.router_logits
if hasattr(outputs, 'router_logits'):
    print("Expert routing information available")

Advanced Usage: Expert Analysis

# Analyze expert specialization
def analyze_expert_usage(model, tokenizer, prompts):
    """Analyze which experts are activated for different types of prompts"""
    results = {}
    
    for prompt_type, prompt in prompts.items():
        inputs = tokenizer(prompt, return_tensors="pt")
        
        with torch.no_grad():
            outputs = model(
                **inputs,
                output_router_logits=True,
                return_dict=True
            )
        
        # Analyze router logits to see expert activation patterns
        if hasattr(outputs, 'router_aux_losses'):
            results[prompt_type] = outputs.router_aux_losses
    
    return results

# Example usage
prompts = {
    "math": "The derivative of x^2 is",
    "code": "def factorial(n):",
    "story": "Once upon a time in a distant galaxy",
    "science": "The theory of relativity explains"
}

expert_analysis = analyze_expert_usage(model, tokenizer, prompts)

Intended Use

Primary Use Cases

  • Research: Advanced research in mixture of experts and sparse models
  • Efficiency Studies: Investigating parameter-efficient language models
  • Domain Adaptation: Leveraging expert specialization for multi-domain tasks
  • Educational: Learning about MoE architectures and expert routing

Suitable Tasks

  • Multi-domain text generation (academic, code, narrative)
  • Efficient large-scale language modeling
  • Domain-specific content generation with expert routing
  • Research into expert specialization patterns

Advantages of MoE Architecture

Efficiency Benefits

  • Parameter Efficiency: Only ~25% of parameters active per forward pass
  • Scalability: Can increase model capacity without proportional compute increase
  • Specialization: Experts can specialize in different domains or patterns
  • Memory Efficiency: Lower activation memory compared to equivalent dense model

Performance Benefits

  • Quality: Often matches or exceeds dense models of similar active parameter count
  • Versatility: Better handling of diverse domains due to expert specialization
  • Adaptability: Can potentially learn domain-specific routing patterns

Limitations and Biases

MoE-Specific Limitations

  • Routing Instability: Expert routing can be unstable during training
  • Load Imbalance: Some experts may be underutilized despite load balancing
  • Complexity: More complex architecture with additional hyperparameters
  • Hardware Requirements: May require specialized hardware for optimal efficiency

General Limitations

  • Context Length: Limited to 1024 tokens maximum sequence length
  • Training Complexity: More complex training dynamics than dense models
  • Expert Collapse: Risk of experts becoming redundant
  • Inference Complexity: Routing overhead during inference

Potential Biases

  • Dataset Bias: Reflects biases present in training data across all experts
  • Expert Bias: Different experts may exhibit different biases
  • Routing Bias: Expert selection may be biased toward certain patterns
  • Domain Imbalance: Expert specialization may favor overrepresented domains

Training Data

The model was trained on a curated dataset combining:

  1. ArXiv Papers: Academic papers for scientific and mathematical reasoning
  2. Code Repositories: Programming code for software development tasks
  3. SimpleStories: Narrative text for story generation and general language understanding

The MoE architecture allows the model to potentially develop specialized experts for each domain.

Expert Routing Analysis

Expected Expert Specializations

Based on the training data, experts may specialize in:

  • Mathematical/Scientific content (from ArXiv papers)
  • Programming languages and code patterns (from code repositories)
  • Narrative and storytelling (from SimpleStories)
  • General language patterns (cross-domain)

Load Balancing

The model uses auxiliary loss to encourage balanced expert utilization:

  • Router Auxiliary Loss: Encourages uniform expert selection
  • Z-Loss: Prevents router collapse and maintains diversity

Evaluation

MoE-Specific Metrics

  • Expert Utilization: Measure of how evenly experts are used
  • Routing Entropy: Diversity of expert selection patterns
  • Expert Specialization: Domain-specific expert activation analysis

Performance Metrics

  • Perplexity: [Add your perplexity scores across domains]
  • FLOPS per Token: Computational efficiency compared to dense models
  • Domain-Specific Evaluation: Performance on ArXiv, code, and story tasks

Environmental Impact

Efficiency Gains

  • Reduced Active Parameters: ~75% parameter sparsity during inference
  • Computational Efficiency: Lower FLOPs per token compared to equivalent dense model
  • Training Efficiency: Faster convergence due to expert specialization

Technical Specifications

Hardware Requirements

  • Minimum RAM: 8GB for inference (due to expert parameters)
  • Recommended GPU: NVIDIA RTX 3080 or better
  • CPU: Modern multi-core processor
  • Storage: Additional space for expert parameters

Software Requirements

  • Python 3.8+
  • PyTorch 1.12+ (with MoE support)
  • Transformers 4.25+ (with MoE implementation)
  • CUDA 11.6+ (for GPU acceleration)

Citation

@misc{moe5lactive2024,
  title={MoE-5L-Active-ArXiv-Code-SimpleStories: An Efficient Mixture of Experts Transformer},
  author={[Your Name]},
  year={2024},
  howpublished={HuggingFace Model Hub},
  url={https://huggingface.co/your-username/moe-5l-active-arxiv-code-simplestories}
}

License

This model is released under the Apache 2.0 License. See the LICENSE file for more details.

Model Card Authors

[Your Name] - [Your Affiliation]

Contact

For questions or issues regarding this model, please:


Disclaimer: This model is provided for research and educational purposes. The MoE architecture adds complexity that users should understand when deploying in production environments.

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support