pranavkarra's picture
Upload folder using huggingface_hub
e3a1538 verified
metadata
language:
  - en
license: apache-2.0
library_name: transformers
tags:
  - causal-lm
  - moe-transformer
  - mixture-of-experts
  - arxiv
  - code
  - simplestories
datasets:
  - arxiv
  - code
  - simplestories
pipeline_tag: text-generation

MoE-5L-Total-ArXiv-Code-SimpleStories

Model Description

This is a 5-layer Mixture of Experts (MoE) transformer model trained on a combination of ArXiv papers, code repositories, and SimpleStories dataset. This "total" variant represents a comprehensive training approach with extended training and potential architectural refinements compared to the "active" version.

Model Details

Architecture

  • Model Type: Mixture of Experts Transformer for Causal Language Modeling
  • Architecture: MoeTransformerForCausalLM
  • Parameters: ~140M parameters (8 experts × ~17.5M each)
  • Active Parameters: ~35M per forward pass (top-2 expert routing)
  • Layers: 5 transformer layers with MoE feed-forward networks
  • Hidden Size: 768
  • Attention Heads: 12 (with 8 key-value heads for efficiency)
  • Vocabulary Size: 50,256 tokens
  • Max Sequence Length: 1024 tokens
  • Context Window: 512 tokens (with windowing support)

MoE Configuration

  • Number of Experts: 8 experts per layer
  • Expert Selection: Top-2 routing (2 experts activated per token)
  • Router Type: Learned gating network with auxiliary loss
  • Load Balancing: Auxiliary loss coefficient: 0.01
  • Router Z-Loss: Coefficient: 0.001

Training Details

  • Training Data: ArXiv papers, code repositories, and SimpleStories
  • Training Epochs: 2 (comprehensive training schedule)
  • Batch Size: 256
  • Learning Rate: 5e-4 (optimized for stability)
  • Optimizer: AdamW (β1=0.9, β2=0.999)
  • Dropout: 0.1 (attention and hidden layers)
  • Normalization: RMSNorm (ε=1e-6)
  • Training Objective: Total loss optimization with enhanced expert utilization

Model Features

  • Enhanced MoE Training: Comprehensive training with improved expert specialization
  • Load Balancing: Advanced auxiliary loss for optimal expert utilization
  • Rotary Position Embeddings: For better handling of positional information
  • Group Query Attention: Efficient attention with 12 query heads and 8 key-value heads
  • SwiGLU Activation: Modern activation function in expert feed-forward layers
  • RMSNorm: Layer normalization for improved training stability

Differences from MoE-Active

Training Improvements

  • Extended Training: More comprehensive training schedule
  • Enhanced Expert Utilization: Improved load balancing and expert specialization
  • Optimized Hyperparameters: Fine-tuned for better performance
  • Advanced Routing: Enhanced expert routing mechanisms

Performance Characteristics

  • Better Convergence: More stable training dynamics
  • Improved Specialization: Clearer expert domain specialization
  • Enhanced Quality: Better overall generation quality across domains

Usage

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model_name = "your-username/moe-5l-total-arxiv-code-simplestories"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,
    device_map="auto"
)

Multi-Domain Text Generation

# Generate academic content
prompt = "The implications of quantum entanglement in modern physics"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=200,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

academic_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Academic: {academic_text}")

Advanced Code Generation

# Generate complex code with explanations
prompt = "# Implement a binary search tree with insertion and search methods\nclass BinarySearchTree:"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=300,
        temperature=0.3,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

code_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Code: {code_text}")

Story Generation

# Generate creative narratives
prompt = "In a world where mathematics came alive, the number seven"
inputs = tokenizer(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=250,
        temperature=0.8,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

story_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Story: {story_text}")

Expert Routing Analysis

# Comprehensive expert analysis
def comprehensive_expert_analysis(model, tokenizer):
    """Detailed analysis of expert usage patterns"""
    
    test_prompts = {
        "mathematics": [
            "The derivative of x^3 + 2x^2 - 5x + 1 is",
            "Integration by parts formula states that",
            "The Pythagorean theorem in higher dimensions"
        ],
        "programming": [
            "def fibonacci(n):",
            "class LinkedList:",
            "# Sort an array using merge sort"
        ],
        "narrative": [
            "Once upon a time in a magical forest",
            "The old lighthouse keeper had seen many storms",
            "In the year 2150, humanity discovered"
        ],
        "science": [
            "The theory of relativity explains",
            "DNA replication involves several key enzymes",
            "Climate change affects ocean currents by"
        ]
    }
    
    expert_patterns = {}
    
    for domain, prompts in test_prompts.items():
        domain_patterns = []
        
        for prompt in prompts:
            inputs = tokenizer(prompt, return_tensors="pt")
            
            with torch.no_grad():
                outputs = model(
                    **inputs,
                    output_router_logits=True,
                    return_dict=True
                )
            
            if hasattr(outputs, 'router_aux_losses'):
                domain_patterns.append(outputs.router_aux_losses)
        
        expert_patterns[domain] = domain_patterns
    
    return expert_patterns

# Run comprehensive analysis
expert_analysis = comprehensive_expert_analysis(model, tokenizer)
print("Expert specialization analysis completed")

Intended Use

Primary Use Cases

  • Research: Advanced research in mixture of experts and efficient language models
  • Multi-Domain Applications: Applications requiring expertise across academic, code, and narrative domains
  • Efficiency Studies: Benchmarking sparse models against dense alternatives
  • Educational: Teaching advanced transformer architectures and expert routing

Suitable Tasks

  • Cross-domain text generation with high quality
  • Efficient large-scale language modeling
  • Research into expert specialization and routing
  • Multi-modal content creation (text + code + academic writing)

Training Methodology

Total Loss Optimization

The "total" variant employs comprehensive loss optimization:

  • Primary Loss: Standard causal language modeling loss
  • Auxiliary Loss: Expert load balancing with enhanced coefficients
  • Routing Loss: Advanced router optimization for better expert utilization
  • Regularization: Enhanced regularization for improved generalization

Expert Specialization Strategy

  • Domain-Aware Training: Training schedule optimized for expert specialization
  • Balanced Sampling: Careful data sampling to ensure expert development
  • Progressive Training: Gradual complexity increase to encourage specialization

Performance Characteristics

Expected Improvements over MoE-Active

  • Better Domain Separation: Clearer expert specialization patterns
  • Improved Quality: Higher quality generation across all domains
  • Enhanced Stability: More stable expert routing during inference
  • Better Generalization: Improved performance on unseen data patterns

Computational Efficiency

  • Optimized Routing: More efficient expert selection patterns
  • Reduced Overhead: Lower routing computational overhead
  • Better Load Balancing: More even expert utilization across tasks

Evaluation Metrics

Domain-Specific Performance

Academic Text Quality:
- Perplexity on ArXiv: [Add scores]
- Factual Accuracy: [Add scores]
- Coherence: [Add scores]

Code Generation Quality:
- HumanEval: [Add scores]
- MBPP: [Add scores]
- Syntax Correctness: [Add scores]

Narrative Quality:
- Story Coherence: [Add scores]
- Creativity Metrics: [Add scores]
- Readability: [Add scores]

MoE-Specific Metrics

  • Expert Utilization Variance: Lower is better (more balanced)
  • Routing Entropy: Higher indicates better expert diversity
  • Expert Specialization Index: Measure of domain-specific expert activation

Environmental Impact

Enhanced Efficiency

  • Improved Training Efficiency: Better convergence properties
  • Optimized Inference: More efficient expert routing
  • Parameter Efficiency: Maintained sparsity with improved quality

Technical Specifications

Hardware Requirements

  • Minimum RAM: 8GB for inference
  • Recommended GPU: NVIDIA RTX 3080 or better
  • CPU: Modern multi-core processor
  • Storage: 2GB+ for model weights

Software Requirements

  • Python 3.8+
  • PyTorch 1.12+ (with MoE support)
  • Transformers 4.25+
  • CUDA 11.6+ (for GPU acceleration)

Comparison with Other Variants

Feature Dense-5L MoE-Active MoE-Total
Parameters ~50M ~140M ~140M
Active Params 50M ~35M ~35M
Training Epochs 1 2 2
Expert Quality N/A Good Enhanced
Specialization N/A Moderate Strong
Stability High Good Enhanced

Citation

@misc{moe5ltotal2024,
  title={MoE-5L-Total-ArXiv-Code-SimpleStories: A Comprehensive Mixture of Experts Transformer},
  author={[Your Name]},
  year={2024},
  howpublished={HuggingFace Model Hub},
  url={https://huggingface.co/your-username/moe-5l-total-arxiv-code-simplestories}
}

License

This model is released under the Apache 2.0 License. See the LICENSE file for more details.

Model Card Authors

[Your Name] - [Your Affiliation]

Contact

For questions or issues regarding this model, please:


Disclaimer: This model represents an advanced MoE implementation designed for research and educational purposes. The "total" variant provides enhanced capabilities but requires understanding of MoE architectures for optimal use.