Add model card for fine-tuned AdbhutMOE
Browse files
README.md
ADDED
@@ -0,0 +1,158 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language: en
|
4 |
+
tags:
|
5 |
+
- mixture-of-experts
|
6 |
+
- moe
|
7 |
+
- coding
|
8 |
+
- code-generation
|
9 |
+
- fine-tuned
|
10 |
+
- lora
|
11 |
+
- instruction
|
12 |
+
- python
|
13 |
+
- adbhutmoe
|
14 |
+
datasets:
|
15 |
+
- TokenBender/code_instructions_122k_alpaca_style
|
16 |
+
model_type: mixtral
|
17 |
+
base_model: rohitnagareddy/AdbhutMOE
|
18 |
+
---
|
19 |
+
|
20 |
+
# AdbhutMOE-Coding-Finetuned - Fine-tuned Coding Assistant
|
21 |
+
|
22 |
+
This model is a fine-tuned version of the `rohitnagareddy/AdbhutMOE` Mixture-of-Experts (MoE) model, specialized for Python code generation and programming assistance tasks. It combines the efficiency of sparse MoE architecture with domain-specific fine-tuning for coding applications.
|
23 |
+
|
24 |
+
## 💻 Model Description
|
25 |
+
|
26 |
+
- **Base Model**: `rohitnagareddy/AdbhutMOE` (Custom MoE Architecture)
|
27 |
+
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
|
28 |
+
- **Dataset**: `TokenBender/code_instructions_122k_alpaca_style` - A comprehensive dataset of coding instructions and solutions
|
29 |
+
- **Architecture**: Mixture-of-Experts with selective expert activation
|
30 |
+
- **Training**: Optimized for instruction-based code generation with memory-efficient techniques
|
31 |
+
|
32 |
+
## 🏗️ Architecture Details
|
33 |
+
|
34 |
+
This model is based on a custom Mixture-of-Experts architecture:
|
35 |
+
- **Experts per Layer**: 8 experts with 2 activated per token
|
36 |
+
- **Hidden Dimension**: 256
|
37 |
+
- **Attention Heads**: 4
|
38 |
+
- **Layers**: 4
|
39 |
+
- **Vocabulary**: Custom-trained tokenizer (~8K tokens)
|
40 |
+
- **Max Sequence Length**: 512 tokens
|
41 |
+
|
42 |
+
## ⚠️ Important Considerations
|
43 |
+
|
44 |
+
- **Verify All Code**: Generated code may contain errors or be suboptimal. Always test and review thoroughly.
|
45 |
+
- **Security**: Generated code has not been vetted for security vulnerabilities.
|
46 |
+
- **Educational Model**: This is a proof-of-concept model demonstrating MoE fine-tuning techniques.
|
47 |
+
- **Limited Training**: Model was trained with limited resources for demonstration purposes.
|
48 |
+
|
49 |
+
## 🚀 Usage
|
50 |
+
|
51 |
+
### Basic Text Generation
|
52 |
+
|
53 |
+
```python
|
54 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
|
55 |
+
import torch
|
56 |
+
|
57 |
+
model_id = "rohitnagareddy/AdbhutMOE-Coding-Finetuned"
|
58 |
+
|
59 |
+
# Load model and tokenizer
|
60 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
|
61 |
+
model = AutoModelForCausalLM.from_pretrained(
|
62 |
+
model_id,
|
63 |
+
torch_dtype=torch.float16,
|
64 |
+
device_map="auto",
|
65 |
+
trust_remote_code=True
|
66 |
+
)
|
67 |
+
|
68 |
+
# Create a text generation pipeline
|
69 |
+
pipe = pipeline(
|
70 |
+
"text-generation",
|
71 |
+
model=model,
|
72 |
+
tokenizer=tokenizer
|
73 |
+
)
|
74 |
+
|
75 |
+
# Generate code
|
76 |
+
prompt = '''### Instruction:
|
77 |
+
Write a Python function that takes a list of integers and returns the sum of all even numbers in the list.
|
78 |
+
|
79 |
+
### Response:'''
|
80 |
+
|
81 |
+
response = pipe(prompt, max_new_tokens=150, temperature=0.2, do_sample=True)
|
82 |
+
print(response[0]["generated_text"])
|
83 |
+
```
|
84 |
+
|
85 |
+
### Direct Model Usage
|
86 |
+
|
87 |
+
```python
|
88 |
+
# For more control over generation
|
89 |
+
prompt = '''### Instruction:
|
90 |
+
Create a Python class for a simple calculator with basic arithmetic operations.
|
91 |
+
|
92 |
+
### Response:'''
|
93 |
+
|
94 |
+
inputs = tokenizer(prompt, return_tensors="pt")
|
95 |
+
with torch.no_grad():
|
96 |
+
outputs = model.generate(
|
97 |
+
**inputs,
|
98 |
+
max_new_tokens=200,
|
99 |
+
temperature=0.3,
|
100 |
+
top_p=0.9,
|
101 |
+
do_sample=True,
|
102 |
+
pad_token_id=tokenizer.pad_token_id
|
103 |
+
)
|
104 |
+
|
105 |
+
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
106 |
+
print(generated_text)
|
107 |
+
```
|
108 |
+
|
109 |
+
## 📊 Training Details
|
110 |
+
|
111 |
+
### Fine-tuning Configuration
|
112 |
+
- **Training Steps**: 500 (limited for demonstration)
|
113 |
+
- **Batch Size**: 1 (with 8 gradient accumulation steps)
|
114 |
+
- **Learning Rate**: 1e-4
|
115 |
+
- **Optimizer**: Paged AdamW 8-bit
|
116 |
+
- **LoRA Rank**: 8
|
117 |
+
- **LoRA Alpha**: 16
|
118 |
+
- **Target Modules**: All linear layers including MoE experts and gates
|
119 |
+
|
120 |
+
### Base Model Training
|
121 |
+
- **Pre-training Data**: AG News dataset sample
|
122 |
+
- **Architecture**: Custom Mixtral-based MoE
|
123 |
+
- **Training Steps**: 100 (base model pre-training)
|
124 |
+
|
125 |
+
## 🎯 Performance Notes
|
126 |
+
|
127 |
+
- **Efficiency**: MoE architecture provides parameter efficiency while maintaining performance
|
128 |
+
- **Memory**: Optimized for memory-efficient inference and training
|
129 |
+
- **Speed**: Sparse activation patterns enable faster inference compared to dense models of similar capability
|
130 |
+
|
131 |
+
## 🔄 Model Lineage
|
132 |
+
|
133 |
+
1. **Base Architecture**: Custom Mixtral MoE implementation
|
134 |
+
2. **Pre-training**: Trained on AG News dataset sample
|
135 |
+
3. **Fine-tuning**: LoRA adaptation on coding instruction dataset
|
136 |
+
4. **Optimization**: 4-bit quantization support for efficient deployment
|
137 |
+
|
138 |
+
## 📈 Intended Use Cases
|
139 |
+
|
140 |
+
- **Code Generation**: Creating Python functions and classes
|
141 |
+
- **Programming Education**: Demonstrating coding concepts
|
142 |
+
- **Research**: Studying MoE architectures for domain-specific tasks
|
143 |
+
- **Prototyping**: Quick code snippet generation
|
144 |
+
|
145 |
+
## 🚫 Limitations
|
146 |
+
|
147 |
+
- **Limited Scope**: Primarily trained on basic coding tasks
|
148 |
+
- **Language Focus**: Optimized for Python, limited other language support
|
149 |
+
- **Scale**: Small model size limits complex reasoning capabilities
|
150 |
+
- **Training Data**: Limited training iterations due to resource constraints
|
151 |
+
|
152 |
+
## 🤝 Contributing
|
153 |
+
|
154 |
+
This model serves as a foundation for further experimentation with MoE architectures in code generation. Contributions and improvements are welcome!
|
155 |
+
|
156 |
+
---
|
157 |
+
*Fine-tuned by rohitnagareddy using LoRA on the AdbhutMOE architecture.*
|
158 |
+
*This model demonstrates the application of parameter-efficient fine-tuning to Mixture-of-Experts models.*
|