Add model card for fine-tuned AdbhutMOE

b488d79 verified 3 months ago

5.24 kB

	---
	license: mit
	language: en
	tags:
	- mixture-of-experts
	- moe
	- coding
	- code-generation
	- fine-tuned
	- lora
	- instruction
	- python
	- adbhutmoe
	datasets:
	- TokenBender/code_instructions_122k_alpaca_style
	model_type: mixtral
	base_model: rohitnagareddy/AdbhutMOE
	---

	# AdbhutMOE-Coding-Finetuned - Fine-tuned Coding Assistant

	This model is a fine-tuned version of the `rohitnagareddy/AdbhutMOE` Mixture-of-Experts (MoE) model, specialized for Python code generation and programming assistance tasks. It combines the efficiency of sparse MoE architecture with domain-specific fine-tuning for coding applications.

	## 💻 Model Description

	- Base Model: `rohitnagareddy/AdbhutMOE` (Custom MoE Architecture)
	- Fine-tuning Method: LoRA (Low-Rank Adaptation)
	- Dataset: `TokenBender/code_instructions_122k_alpaca_style` - A comprehensive dataset of coding instructions and solutions
	- Architecture: Mixture-of-Experts with selective expert activation
	- Training: Optimized for instruction-based code generation with memory-efficient techniques

	## 🏗️ Architecture Details

	This model is based on a custom Mixture-of-Experts architecture:
	- Experts per Layer: 8 experts with 2 activated per token
	- Hidden Dimension: 256
	- Attention Heads: 4
	- Layers: 4
	- Vocabulary: Custom-trained tokenizer (~8K tokens)
	- Max Sequence Length: 512 tokens

	## ⚠️ Important Considerations

	- Verify All Code: Generated code may contain errors or be suboptimal. Always test and review thoroughly.
	- Security: Generated code has not been vetted for security vulnerabilities.
	- Educational Model: This is a proof-of-concept model demonstrating MoE fine-tuning techniques.
	- Limited Training: Model was trained with limited resources for demonstration purposes.

	## 🚀 Usage

	### Basic Text Generation

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
	import torch

	model_id = "rohitnagareddy/AdbhutMOE-Coding-Finetuned"

	# Load model and tokenizer
	tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	device_map="auto",
	trust_remote_code=True
	)

	# Create a text generation pipeline
	pipe = pipeline(
	"text-generation",
	model=model,
	tokenizer=tokenizer
	)

	# Generate code
	prompt = '''### Instruction:
	Write a Python function that takes a list of integers and returns the sum of all even numbers in the list.

	### Response:'''

	response = pipe(prompt, max_new_tokens=150, temperature=0.2, do_sample=True)
	print(response[0]["generated_text"])
	```

	### Direct Model Usage

	```python
	# For more control over generation
	prompt = '''### Instruction:
	Create a Python class for a simple calculator with basic arithmetic operations.

	### Response:'''

	inputs = tokenizer(prompt, return_tensors="pt")
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=200,
	temperature=0.3,
	top_p=0.9,
	do_sample=True,
	pad_token_id=tokenizer.pad_token_id
	)

	generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(generated_text)
	```

	## 📊 Training Details

	### Fine-tuning Configuration
	- Training Steps: 500 (limited for demonstration)
	- Batch Size: 1 (with 8 gradient accumulation steps)
	- Learning Rate: 1e-4
	- Optimizer: Paged AdamW 8-bit
	- LoRA Rank: 8
	- LoRA Alpha: 16
	- Target Modules: All linear layers including MoE experts and gates

	### Base Model Training
	- Pre-training Data: AG News dataset sample
	- Architecture: Custom Mixtral-based MoE
	- Training Steps: 100 (base model pre-training)

	## 🎯 Performance Notes

	- Efficiency: MoE architecture provides parameter efficiency while maintaining performance
	- Memory: Optimized for memory-efficient inference and training
	- Speed: Sparse activation patterns enable faster inference compared to dense models of similar capability

	## 🔄 Model Lineage

	1. Base Architecture: Custom Mixtral MoE implementation
	2. Pre-training: Trained on AG News dataset sample
	3. Fine-tuning: LoRA adaptation on coding instruction dataset
	4. Optimization: 4-bit quantization support for efficient deployment

	## 📈 Intended Use Cases

	- Code Generation: Creating Python functions and classes
	- Programming Education: Demonstrating coding concepts
	- Research: Studying MoE architectures for domain-specific tasks
	- Prototyping: Quick code snippet generation

	## 🚫 Limitations

	- Limited Scope: Primarily trained on basic coding tasks
	- Language Focus: Optimized for Python, limited other language support
	- Scale: Small model size limits complex reasoning capabilities
	- Training Data: Limited training iterations due to resource constraints

	## 🤝 Contributing

	This model serves as a foundation for further experimentation with MoE architectures in code generation. Contributions and improvements are welcome!

	---
	Fine-tuned by rohitnagareddy using LoRA on the AdbhutMOE architecture.
	This model demonstrates the application of parameter-efficient fine-tuning to Mixture-of-Experts models.