File size: 5,236 Bytes
b488d79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
license: mit
language: en
tags:
- mixture-of-experts
- moe
- coding
- code-generation
- fine-tuned
- lora
- instruction
- python
- adbhutmoe
datasets:
- TokenBender/code_instructions_122k_alpaca_style
model_type: mixtral
base_model: rohitnagareddy/AdbhutMOE
---

# AdbhutMOE-Coding-Finetuned - Fine-tuned Coding Assistant

This model is a fine-tuned version of the `rohitnagareddy/AdbhutMOE` Mixture-of-Experts (MoE) model, specialized for Python code generation and programming assistance tasks. It combines the efficiency of sparse MoE architecture with domain-specific fine-tuning for coding applications.

## 💻 Model Description

- **Base Model**: `rohitnagareddy/AdbhutMOE` (Custom MoE Architecture)
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
- **Dataset**: `TokenBender/code_instructions_122k_alpaca_style` - A comprehensive dataset of coding instructions and solutions
- **Architecture**: Mixture-of-Experts with selective expert activation
- **Training**: Optimized for instruction-based code generation with memory-efficient techniques

## 🏗️ Architecture Details

This model is based on a custom Mixture-of-Experts architecture:
- **Experts per Layer**: 8 experts with 2 activated per token
- **Hidden Dimension**: 256
- **Attention Heads**: 4
- **Layers**: 4
- **Vocabulary**: Custom-trained tokenizer (~8K tokens)
- **Max Sequence Length**: 512 tokens

## ⚠️ Important Considerations

- **Verify All Code**: Generated code may contain errors or be suboptimal. Always test and review thoroughly.
- **Security**: Generated code has not been vetted for security vulnerabilities.
- **Educational Model**: This is a proof-of-concept model demonstrating MoE fine-tuning techniques.
- **Limited Training**: Model was trained with limited resources for demonstration purposes.

## 🚀 Usage

### Basic Text Generation

```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch

model_id = "rohitnagareddy/AdbhutMOE-Coding-Finetuned"

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

# Create a text generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer
)

# Generate code
prompt = '''### Instruction:
Write a Python function that takes a list of integers and returns the sum of all even numbers in the list.

### Response:'''

response = pipe(prompt, max_new_tokens=150, temperature=0.2, do_sample=True)
print(response[0]["generated_text"])
```

### Direct Model Usage

```python
# For more control over generation
prompt = '''### Instruction:
Create a Python class for a simple calculator with basic arithmetic operations.

### Response:'''

inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.3,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## 📊 Training Details

### Fine-tuning Configuration
- **Training Steps**: 500 (limited for demonstration)
- **Batch Size**: 1 (with 8 gradient accumulation steps)
- **Learning Rate**: 1e-4
- **Optimizer**: Paged AdamW 8-bit
- **LoRA Rank**: 8
- **LoRA Alpha**: 16
- **Target Modules**: All linear layers including MoE experts and gates

### Base Model Training
- **Pre-training Data**: AG News dataset sample
- **Architecture**: Custom Mixtral-based MoE
- **Training Steps**: 100 (base model pre-training)

## 🎯 Performance Notes

- **Efficiency**: MoE architecture provides parameter efficiency while maintaining performance
- **Memory**: Optimized for memory-efficient inference and training
- **Speed**: Sparse activation patterns enable faster inference compared to dense models of similar capability

## 🔄 Model Lineage

1. **Base Architecture**: Custom Mixtral MoE implementation
2. **Pre-training**: Trained on AG News dataset sample
3. **Fine-tuning**: LoRA adaptation on coding instruction dataset
4. **Optimization**: 4-bit quantization support for efficient deployment

## 📈 Intended Use Cases

- **Code Generation**: Creating Python functions and classes
- **Programming Education**: Demonstrating coding concepts
- **Research**: Studying MoE architectures for domain-specific tasks
- **Prototyping**: Quick code snippet generation

## 🚫 Limitations

- **Limited Scope**: Primarily trained on basic coding tasks
- **Language Focus**: Optimized for Python, limited other language support
- **Scale**: Small model size limits complex reasoning capabilities
- **Training Data**: Limited training iterations due to resource constraints

## 🤝 Contributing

This model serves as a foundation for further experimentation with MoE architectures in code generation. Contributions and improvements are welcome!

---
*Fine-tuned by rohitnagareddy using LoRA on the AdbhutMOE architecture.*
*This model demonstrates the application of parameter-efficient fine-tuning to Mixture-of-Experts models.*