# CodeGenDetect-CodeBERT **Model Name:** `azherali/CodeGenDetect-CodeBert` **Task:** Code Generation Detection (Human vs Machine Generated Code) **Languages Supported:** C++, Java, Python **Base Model:** CodeBERT **Author:** Azher Ali --- ## 📌 Model Overview `CodeGenDetect-CodeBert` is a transformer-based classification model designed to distinguish **human-written code** from **machine-generated code** produced by Large Language Models (LLMs). The model is fine-tuned on multilingual source code data spanning **C++**, **Java**, and **Python**, making it suitable for real-world, cross-language code analysis tasks. Built on top of **CodeBERT**, the model leverages contextual and structural representations of source code to capture subtle stylistic, syntactic, and semantic patterns that differentiate human-authored code from AI-generated code. --- ## 🎯 Intended Use Cases This model is well-suited for: - **Academic integrity & plagiarism detection** - **LLM-generated code identification** - **Code authenticity verification** - **Research on AI-generated programming artifacts** - **Code forensics and auditing pipelines** --- ## 🧠 Model Details - **Architecture:** Transformer-based (CodeBERT) - **Task Type:** Binary Sequence Classification - **Labels:** - `0` → Human-generated code - `1` → Machine-generated (LLM) code - **Input:** Source code as plain text - **Output:** Class probabilities and predicted label --- ## 🌐 Supported Programming Languages The model has been trained and evaluated on code written in: - **C++** - **Java** - **Python** It generalizes across these languages by learning language-agnostic code patterns while still capturing language-specific constructs. --- ## 🏋️ Training Summary - **Training Objective:** Binary cross-entropy loss for classification - **Tokenization:** CodeBERT tokenizer with fixed-length padding and truncation - **Optimization:** Fine-tuned using modern deep learning best practices - **Evaluation Metrics:** Accuracy, Precision, Recall, F1-score The training data includes both human-written code and code generated by modern LLMs to ensure realistic detection performance. --- ## 🚀 Example Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_name = "azherali/CodeGenDetect-CodeBert" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) code_snippet = """ def add(a, b): return a + b """ inputs = tokenizer(code_snippet, return_tensors="pt", truncation=True, padding=True) outputs = model(**inputs) prediction = torch.argmax(outputs.logits, dim=1).item() label = "Machine-generated" if prediction == 1 else "Human-written" print(label)