# CodeGenDetect-CodeBERT

**Model Name:** `azherali/CodeGenDetect-CodeBert`  
**Task:** Code Generation Detection (Human vs Machine Generated Code)  
**Languages Supported:** C++, Java, Python  
**Base Model:** CodeBERT  
**Author:** Azher Ali  

---

## 📌 Model Overview

`CodeGenDetect-CodeBert` is a transformer-based classification model designed to distinguish **human-written code** from **machine-generated code** produced by Large Language Models (LLMs). The model is fine-tuned on multilingual source code data spanning **C++**, **Java**, and **Python**, making it suitable for real-world, cross-language code analysis tasks.

Built on top of **CodeBERT**, the model leverages contextual and structural representations of source code to capture subtle stylistic, syntactic, and semantic patterns that differentiate human-authored code from AI-generated code.

---

## 🎯 Intended Use Cases

This model is well-suited for:

- **Academic integrity & plagiarism detection**
- **LLM-generated code identification**
- **Code authenticity verification**
- **Research on AI-generated programming artifacts**
- **Code forensics and auditing pipelines**

---

## 🧠 Model Details

- **Architecture:** Transformer-based (CodeBERT)
- **Task Type:** Binary Sequence Classification
- **Labels:**
  - `0` → Human-generated code
  - `1` → Machine-generated (LLM) code
- **Input:** Source code as plain text
- **Output:** Class probabilities and predicted label

---

## 🌐 Supported Programming Languages

The model has been trained and evaluated on code written in:

- **C++**
- **Java**
- **Python**

It generalizes across these languages by learning language-agnostic code patterns while still capturing language-specific constructs.

---

## 🏋️ Training Summary

- **Training Objective:** Binary cross-entropy loss for classification
- **Tokenization:** CodeBERT tokenizer with fixed-length padding and truncation
- **Optimization:** Fine-tuned using modern deep learning best practices
- **Evaluation Metrics:** Accuracy, Precision, Recall, F1-score

The training data includes both human-written code and code generated by modern LLMs to ensure realistic detection performance.

---

## 🚀 Example Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "azherali/CodeGenDetect-CodeBert"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

code_snippet = """
def add(a, b):
    return a + b
"""

inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)

prediction = torch.argmax(outputs.logits, dim=1).item()
label = "Machine-generated" if prediction == 1 else "Human-written"

print(label)