File size: 5,236 Bytes
b488d79 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 |
---
license: mit
language: en
tags:
- mixture-of-experts
- moe
- coding
- code-generation
- fine-tuned
- lora
- instruction
- python
- adbhutmoe
datasets:
- TokenBender/code_instructions_122k_alpaca_style
model_type: mixtral
base_model: rohitnagareddy/AdbhutMOE
---
# AdbhutMOE-Coding-Finetuned - Fine-tuned Coding Assistant
This model is a fine-tuned version of the `rohitnagareddy/AdbhutMOE` Mixture-of-Experts (MoE) model, specialized for Python code generation and programming assistance tasks. It combines the efficiency of sparse MoE architecture with domain-specific fine-tuning for coding applications.
## 💻 Model Description
- **Base Model**: `rohitnagareddy/AdbhutMOE` (Custom MoE Architecture)
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
- **Dataset**: `TokenBender/code_instructions_122k_alpaca_style` - A comprehensive dataset of coding instructions and solutions
- **Architecture**: Mixture-of-Experts with selective expert activation
- **Training**: Optimized for instruction-based code generation with memory-efficient techniques
## 🏗️ Architecture Details
This model is based on a custom Mixture-of-Experts architecture:
- **Experts per Layer**: 8 experts with 2 activated per token
- **Hidden Dimension**: 256
- **Attention Heads**: 4
- **Layers**: 4
- **Vocabulary**: Custom-trained tokenizer (~8K tokens)
- **Max Sequence Length**: 512 tokens
## ⚠️ Important Considerations
- **Verify All Code**: Generated code may contain errors or be suboptimal. Always test and review thoroughly.
- **Security**: Generated code has not been vetted for security vulnerabilities.
- **Educational Model**: This is a proof-of-concept model demonstrating MoE fine-tuning techniques.
- **Limited Training**: Model was trained with limited resources for demonstration purposes.
## 🚀 Usage
### Basic Text Generation
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
model_id = "rohitnagareddy/AdbhutMOE-Coding-Finetuned"
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
# Create a text generation pipeline
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer
)
# Generate code
prompt = '''### Instruction:
Write a Python function that takes a list of integers and returns the sum of all even numbers in the list.
### Response:'''
response = pipe(prompt, max_new_tokens=150, temperature=0.2, do_sample=True)
print(response[0]["generated_text"])
```
### Direct Model Usage
```python
# For more control over generation
prompt = '''### Instruction:
Create a Python class for a simple calculator with basic arithmetic operations.
### Response:'''
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=200,
temperature=0.3,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.pad_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
## 📊 Training Details
### Fine-tuning Configuration
- **Training Steps**: 500 (limited for demonstration)
- **Batch Size**: 1 (with 8 gradient accumulation steps)
- **Learning Rate**: 1e-4
- **Optimizer**: Paged AdamW 8-bit
- **LoRA Rank**: 8
- **LoRA Alpha**: 16
- **Target Modules**: All linear layers including MoE experts and gates
### Base Model Training
- **Pre-training Data**: AG News dataset sample
- **Architecture**: Custom Mixtral-based MoE
- **Training Steps**: 100 (base model pre-training)
## 🎯 Performance Notes
- **Efficiency**: MoE architecture provides parameter efficiency while maintaining performance
- **Memory**: Optimized for memory-efficient inference and training
- **Speed**: Sparse activation patterns enable faster inference compared to dense models of similar capability
## 🔄 Model Lineage
1. **Base Architecture**: Custom Mixtral MoE implementation
2. **Pre-training**: Trained on AG News dataset sample
3. **Fine-tuning**: LoRA adaptation on coding instruction dataset
4. **Optimization**: 4-bit quantization support for efficient deployment
## 📈 Intended Use Cases
- **Code Generation**: Creating Python functions and classes
- **Programming Education**: Demonstrating coding concepts
- **Research**: Studying MoE architectures for domain-specific tasks
- **Prototyping**: Quick code snippet generation
## 🚫 Limitations
- **Limited Scope**: Primarily trained on basic coding tasks
- **Language Focus**: Optimized for Python, limited other language support
- **Scale**: Small model size limits complex reasoning capabilities
- **Training Data**: Limited training iterations due to resource constraints
## 🤝 Contributing
This model serves as a foundation for further experimentation with MoE architectures in code generation. Contributions and improvements are welcome!
---
*Fine-tuned by rohitnagareddy using LoRA on the AdbhutMOE architecture.*
*This model demonstrates the application of parameter-efficient fine-tuning to Mixture-of-Experts models.*
|