|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
base_model: |
|
- Salesforce/codet5-small |
|
tags: |
|
- cpp |
|
- complete |
|
--- |
|
|
|
|
|
# 🚀 Codelander |
|
|
|
--- |
|
|
|
## 📖 Overview |
|
|
|
This specialized **CodeT5** model has been fine-tuned for **C++ code completion** tasks. |
|
It excels at understanding **C++ syntax** and **common programming patterns** to provide intelligent code suggestions as you type. |
|
|
|
--- |
|
|
|
## ✨ Key Features |
|
|
|
- 🔹 Context-aware completions for C++ functions, classes, and control structures |
|
- 🔹 Handles complex C++ syntax including **templates, STL, and modern C++ features** |
|
- 🔹 Trained on **competitive programming solutions** from high-quality Codeforces submissions |
|
- 🔹 Low latency suitable for **real-time editor integration** |
|
|
|
--- |
|
|
|
## 📊 Model Performance |
|
|
|
| Metric | Value | |
|
|---------------------|---------| |
|
| Training Loss | 1.2475 | |
|
| Validation Loss | 1.0016 | |
|
| Training Epochs | 3 | |
|
| Training Steps | 14010 | |
|
| Samples per second | 6.275 | |
|
|
|
--- |
|
|
|
## ⚙️ Installation & Usage |
|
|
|
### 🔧 Direct Integration with HuggingFace Transformers |
|
|
|
```python |
|
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer |
|
|
|
# Load model and tokenizer |
|
model = AutoModelForSeq2SeqLM.from_pretrained("outlander23/codelander") |
|
tokenizer = AutoTokenizer.from_pretrained("outlander23/codelander") |
|
|
|
# Generate completion |
|
def get_completion(code_prefix, max_new_tokens=100): |
|
inputs = tokenizer(f"complete C++ code: {code_prefix}", return_tensors="pt") |
|
outputs = model.generate( |
|
inputs.input_ids, |
|
max_new_tokens=max_new_tokens, |
|
temperature=0.7, |
|
top_p=0.9, |
|
do_sample=True |
|
) |
|
return tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
``` |
|
|
|
--- |
|
|
|
## 🏗️ Model Architecture |
|
|
|
- Base Model: **Salesforce/codet5-base** |
|
- Parameters: **220M** |
|
- Context Window: **512 tokens** |
|
- Fine-tuning: **Seq2Seq training on C++ code snippets** |
|
- Training Time: ~ **5 hours** |
|
|
|
--- |
|
|
|
## 📂 Training Data |
|
|
|
- Dataset: **open-r1/codeforces-submissions** |
|
- Selection: **Accepted C++ solutions only** |
|
- Size: **50,000+ code samples** |
|
- Processing: **Prefix-suffix pairs with random splits** |
|
|
|
--- |
|
|
|
## ⚠️ Limitations |
|
|
|
- ❌ May generate syntactically correct but semantically incorrect code |
|
- ❌ Limited knowledge of **domain-specific libraries** not present in training data |
|
- ❌ May occasionally produce **incomplete code fragments** |
|
|
|
--- |
|
|
|
## 💻 Example Completions |
|
|
|
### ✅ Example 1: Factorial Function |
|
|
|
**Input:** |
|
```cpp |
|
int factorial(int n) { |
|
if (n <= 1) { |
|
return 1; |
|
} else { |
|
``` |
|
|
|
**Completion:** |
|
```cpp |
|
return n * factorial(n - 1); |
|
} |
|
} |
|
``` |
|
|
|
--- |
|
|
|
|
|
--- |
|
|
|
## 📈 Training Details |
|
|
|
- Training completed on: **2025-08-28 12:51:09 UTC** |
|
- Training epochs: **3/3** |
|
- Total steps: **14010** |
|
- Training loss: **1.2475** |
|
|
|
### 📊 Epoch Performance |
|
|
|
| Epoch | Training Loss | Validation Loss | |
|
|-------|---------------|-----------------| |
|
| 1 | 1.2638 | 1.1004 | |
|
| 2 | 1.1551 | 1.0250 | |
|
| 3 | 1.1081 | 1.0016 | |
|
|
|
--- |
|
|
|
## 🖥️ Compatibility |
|
|
|
- ✅ Compatible with **Transformers 4.30.0+** |
|
- ✅ Optimized for **Python 3.8+** |
|
- ✅ Supports both **CPU and GPU inference** |
|
|
|
--- |
|
|
|
## ❤️ Credits |
|
|
|
Made with ❤️ by **outlander23** |
|
|
|
> "Good code is its own best documentation." – *Steve McConnell* |
|
|
|
--- |