CodeGenDetect-CodeBERT
Model Name: azherali/CodeGenDetect-CodeBert
Task: Code Generation Detection (Human vs Machine Generated Code)
Languages Supported: C++, Java, Python
Base Model: CodeBERT
Author: Azher Ali
📌 Model Overview
CodeGenDetect-CodeBert is a transformer-based classification model designed to distinguish human-written code from machine-generated code produced by Large Language Models (LLMs). The model is fine-tuned on multilingual source code data spanning C++, Java, and Python, making it suitable for real-world, cross-language code analysis tasks.
Built on top of CodeBERT, the model leverages contextual and structural representations of source code to capture subtle stylistic, syntactic, and semantic patterns that differentiate human-authored code from AI-generated code.
🎯 Intended Use Cases
This model is well-suited for:
- Academic integrity & plagiarism detection
- LLM-generated code identification
- Code authenticity verification
- Research on AI-generated programming artifacts
- Code forensics and auditing pipelines
🧠 Model Details
- Architecture: Transformer-based (CodeBERT)
- Task Type: Binary Sequence Classification
- Labels:
0→ Human-generated code1→ Machine-generated (LLM) code
- Input: Source code as plain text
- Output: Class probabilities and predicted label
🌐 Supported Programming Languages
The model has been trained and evaluated on code written in:
- C++
- Java
- Python
It generalizes across these languages by learning language-agnostic code patterns while still capturing language-specific constructs.
🏋️ Training Summary
- Training Objective: Binary cross-entropy loss for classification
- Tokenization: CodeBERT tokenizer with fixed-length padding and truncation
- Optimization: Fine-tuned using modern deep learning best practices
- Evaluation Metrics: Accuracy, Precision, Recall, F1-score
The training data includes both human-written code and code generated by modern LLMs to ensure realistic detection performance.
🚀 Example Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "azherali/CodeGenDetect-CodeBert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
code_snippet = """
def add(a, b):
return a + b
"""
inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)
prediction = torch.argmax(outputs.logits, dim=1).item()
label = "Machine-generated" if prediction == 1 else "Human-written"
print(label)