CodeGenDetect-CodeBERT

Model Name: azherali/CodeGenDetect-CodeBert
Task: Code Generation Detection (Human vs Machine Generated Code)
Languages Supported: C++, Java, Python
Base Model: CodeBERT
Author: Azher Ali

📌 Model Overview

CodeGenDetect-CodeBert is a transformer-based classification model designed to distinguish human-written code from machine-generated code produced by Large Language Models (LLMs). The model is fine-tuned on multilingual source code data spanning C++, Java, and Python, making it suitable for real-world, cross-language code analysis tasks.

Built on top of CodeBERT, the model leverages contextual and structural representations of source code to capture subtle stylistic, syntactic, and semantic patterns that differentiate human-authored code from AI-generated code.

🎯 Intended Use Cases

This model is well-suited for:

Academic integrity & plagiarism detection
LLM-generated code identification
Code authenticity verification
Research on AI-generated programming artifacts
Code forensics and auditing pipelines

🧠 Model Details

Architecture: Transformer-based (CodeBERT)
Task Type: Binary Sequence Classification
Labels:
- 0 → Human-generated code
- 1 → Machine-generated (LLM) code
Input: Source code as plain text
Output: Class probabilities and predicted label

🌐 Supported Programming Languages

The model has been trained and evaluated on code written in:

C++
Java
Python

It generalizes across these languages by learning language-agnostic code patterns while still capturing language-specific constructs.

🏋️ Training Summary

Training Objective: Binary cross-entropy loss for classification
Tokenization: CodeBERT tokenizer with fixed-length padding and truncation
Optimization: Fine-tuned using modern deep learning best practices
Evaluation Metrics: Accuracy, Precision, Recall, F1-score

The training data includes both human-written code and code generated by modern LLMs to ensure realistic detection performance.

🚀 Example Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "azherali/CodeGenDetect-CodeBert"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

code_snippet = """
def add(a, b):
    return a + b
"""

inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)

prediction = torch.argmax(outputs.logits, dim=1).item()
label = "Machine-generated" if prediction == 1 else "Human-written"

print(label)