|
--- |
|
license: mit |
|
language: en |
|
tags: |
|
- mixture-of-experts |
|
- moe |
|
- coding |
|
- code-generation |
|
- fine-tuned |
|
- lora |
|
- instruction |
|
- python |
|
- adbhutmoe |
|
datasets: |
|
- TokenBender/code_instructions_122k_alpaca_style |
|
model_type: mixtral |
|
base_model: rohitnagareddy/AdbhutMOE |
|
--- |
|
|
|
# AdbhutMOE-Coding-Finetuned - Fine-tuned Coding Assistant |
|
|
|
This model is a fine-tuned version of the `rohitnagareddy/AdbhutMOE` Mixture-of-Experts (MoE) model, specialized for Python code generation and programming assistance tasks. It combines the efficiency of sparse MoE architecture with domain-specific fine-tuning for coding applications. |
|
|
|
## 💻 Model Description |
|
|
|
- **Base Model**: `rohitnagareddy/AdbhutMOE` (Custom MoE Architecture) |
|
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) |
|
- **Dataset**: `TokenBender/code_instructions_122k_alpaca_style` - A comprehensive dataset of coding instructions and solutions |
|
- **Architecture**: Mixture-of-Experts with selective expert activation |
|
- **Training**: Optimized for instruction-based code generation with memory-efficient techniques |
|
|
|
## 🏗️ Architecture Details |
|
|
|
This model is based on a custom Mixture-of-Experts architecture: |
|
- **Experts per Layer**: 8 experts with 2 activated per token |
|
- **Hidden Dimension**: 256 |
|
- **Attention Heads**: 4 |
|
- **Layers**: 4 |
|
- **Vocabulary**: Custom-trained tokenizer (~8K tokens) |
|
- **Max Sequence Length**: 512 tokens |
|
|
|
## ⚠️ Important Considerations |
|
|
|
- **Verify All Code**: Generated code may contain errors or be suboptimal. Always test and review thoroughly. |
|
- **Security**: Generated code has not been vetted for security vulnerabilities. |
|
- **Educational Model**: This is a proof-of-concept model demonstrating MoE fine-tuning techniques. |
|
- **Limited Training**: Model was trained with limited resources for demonstration purposes. |
|
|
|
## 🚀 Usage |
|
|
|
### Basic Text Generation |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline |
|
import torch |
|
|
|
model_id = "rohitnagareddy/AdbhutMOE-Coding-Finetuned" |
|
|
|
# Load model and tokenizer |
|
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_id, |
|
torch_dtype=torch.float16, |
|
device_map="auto", |
|
trust_remote_code=True |
|
) |
|
|
|
# Create a text generation pipeline |
|
pipe = pipeline( |
|
"text-generation", |
|
model=model, |
|
tokenizer=tokenizer |
|
) |
|
|
|
# Generate code |
|
prompt = '''### Instruction: |
|
Write a Python function that takes a list of integers and returns the sum of all even numbers in the list. |
|
|
|
### Response:''' |
|
|
|
response = pipe(prompt, max_new_tokens=150, temperature=0.2, do_sample=True) |
|
print(response[0]["generated_text"]) |
|
``` |
|
|
|
### Direct Model Usage |
|
|
|
```python |
|
# For more control over generation |
|
prompt = '''### Instruction: |
|
Create a Python class for a simple calculator with basic arithmetic operations. |
|
|
|
### Response:''' |
|
|
|
inputs = tokenizer(prompt, return_tensors="pt") |
|
with torch.no_grad(): |
|
outputs = model.generate( |
|
**inputs, |
|
max_new_tokens=200, |
|
temperature=0.3, |
|
top_p=0.9, |
|
do_sample=True, |
|
pad_token_id=tokenizer.pad_token_id |
|
) |
|
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
print(generated_text) |
|
``` |
|
|
|
## 📊 Training Details |
|
|
|
### Fine-tuning Configuration |
|
- **Training Steps**: 500 (limited for demonstration) |
|
- **Batch Size**: 1 (with 8 gradient accumulation steps) |
|
- **Learning Rate**: 1e-4 |
|
- **Optimizer**: Paged AdamW 8-bit |
|
- **LoRA Rank**: 8 |
|
- **LoRA Alpha**: 16 |
|
- **Target Modules**: All linear layers including MoE experts and gates |
|
|
|
### Base Model Training |
|
- **Pre-training Data**: AG News dataset sample |
|
- **Architecture**: Custom Mixtral-based MoE |
|
- **Training Steps**: 100 (base model pre-training) |
|
|
|
## 🎯 Performance Notes |
|
|
|
- **Efficiency**: MoE architecture provides parameter efficiency while maintaining performance |
|
- **Memory**: Optimized for memory-efficient inference and training |
|
- **Speed**: Sparse activation patterns enable faster inference compared to dense models of similar capability |
|
|
|
## 🔄 Model Lineage |
|
|
|
1. **Base Architecture**: Custom Mixtral MoE implementation |
|
2. **Pre-training**: Trained on AG News dataset sample |
|
3. **Fine-tuning**: LoRA adaptation on coding instruction dataset |
|
4. **Optimization**: 4-bit quantization support for efficient deployment |
|
|
|
## 📈 Intended Use Cases |
|
|
|
- **Code Generation**: Creating Python functions and classes |
|
- **Programming Education**: Demonstrating coding concepts |
|
- **Research**: Studying MoE architectures for domain-specific tasks |
|
- **Prototyping**: Quick code snippet generation |
|
|
|
## 🚫 Limitations |
|
|
|
- **Limited Scope**: Primarily trained on basic coding tasks |
|
- **Language Focus**: Optimized for Python, limited other language support |
|
- **Scale**: Small model size limits complex reasoning capabilities |
|
- **Training Data**: Limited training iterations due to resource constraints |
|
|
|
## 🤝 Contributing |
|
|
|
This model serves as a foundation for further experimentation with MoE architectures in code generation. Contributions and improvements are welcome! |
|
|
|
--- |
|
*Fine-tuned by rohitnagareddy using LoRA on the AdbhutMOE architecture.* |
|
*This model demonstrates the application of parameter-efficient fine-tuning to Mixture-of-Experts models.* |
|
|