azherali's picture
Update README.md
39c26a6 verified

CodeGenDetect-CodeBERT

Model Name: azherali/CodeGenDetect-CodeBert
Task: Code Generation Detection (Human vs Machine Generated Code)
Languages Supported: C++, Java, Python
Base Model: CodeBERT
Author: Azher Ali


📌 Model Overview

CodeGenDetect-CodeBert is a transformer-based classification model designed to distinguish human-written code from machine-generated code produced by Large Language Models (LLMs). The model is fine-tuned on multilingual source code data spanning C++, Java, and Python, making it suitable for real-world, cross-language code analysis tasks.

Built on top of CodeBERT, the model leverages contextual and structural representations of source code to capture subtle stylistic, syntactic, and semantic patterns that differentiate human-authored code from AI-generated code.


🎯 Intended Use Cases

This model is well-suited for:

  • Academic integrity & plagiarism detection
  • LLM-generated code identification
  • Code authenticity verification
  • Research on AI-generated programming artifacts
  • Code forensics and auditing pipelines

🧠 Model Details

  • Architecture: Transformer-based (CodeBERT)
  • Task Type: Binary Sequence Classification
  • Labels:
    • 0 → Human-generated code
    • 1 → Machine-generated (LLM) code
  • Input: Source code as plain text
  • Output: Class probabilities and predicted label

🌐 Supported Programming Languages

The model has been trained and evaluated on code written in:

  • C++
  • Java
  • Python

It generalizes across these languages by learning language-agnostic code patterns while still capturing language-specific constructs.


🏋️ Training Summary

  • Training Objective: Binary cross-entropy loss for classification
  • Tokenization: CodeBERT tokenizer with fixed-length padding and truncation
  • Optimization: Fine-tuned using modern deep learning best practices
  • Evaluation Metrics: Accuracy, Precision, Recall, F1-score

The training data includes both human-written code and code generated by modern LLMs to ensure realistic detection performance.


🚀 Example Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "azherali/CodeGenDetect-CodeBert"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

code_snippet = """
def add(a, b):
    return a + b
"""

inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding=True)
outputs = model(**inputs)

prediction = torch.argmax(outputs.logits, dim=1).item()
label = "Machine-generated" if prediction == 1 else "Human-written"

print(label)