UnixCoder-Primevul-BigVul / README.md

mahdin70

Update README.md

ae20617 verified 7 months ago

preview code

raw

history blame

11.4 kB

metadata

license: mit
datasets:
  - mahdin70/balanced_merged_bigvul_primevul
base_model:
  - microsoft/unixcoder-base
tags:
  - Code
  - Vulnerability
  - Detection
metrics:
  - accuracy
pipeline_tag: text-classification
library_name: transformers

UnixCoder-Primevul-BigVul Model Card

Model Overview

UnixCoder-Primevul-BigVul is a multi-task model based on Microsoft's unixcoder-base, fine-tuned to detect vulnerabilities (vul) and classify Common Weakness Enumeration (CWE) types in code snippets. It was developed by mahdin70 and trained on a balanced dataset combining BigVul and PrimeVul datasets. The model performs binary classification for vulnerability detection and multi-class classification for CWE identification.

Model Repository: mahdin70/UnixCoder-Primevul-BigVul
Base Model: microsoft/unixcoder-base
Tasks: Vulnerability Detection (Binary), CWE Classification (Multi-class)
License: MIT (assumed; adjust if different)
Date: Trained and uploaded as of March 11, 2025

Model Architecture

The model extends unixcoder-base with two task-specific heads:

Vulnerability Head: A linear layer mapping 768-dimensional hidden states to 2 classes (vulnerable or not).
CWE Head: A linear layer mapping 768-dimensional hidden states to 135 classes (134 CWE types + 1 for "no CWE").

The architecture is implemented as a custom MultiTaskUnixCoder class in PyTorch, with the loss computed as the sum of cross-entropy losses for both tasks.

Training Dataset

The model was trained on the mahdin70/balanced_merged_bigvul_primevul dataset (configuration: 10_per_commit), which combines:

BigVul: A dataset of real-world vulnerabilities from open-source projects.
PrimeVul: A dataset focused on prime vulnerabilities in code.

Dataset Details

Splits:
- Train: 124,780 samples
- Validation: 26,740 samples
- Test: 26,738 samples
Features:
- func: Code snippet (text)
- vul: Binary label (0 = non-vulnerable, 1 = vulnerable)
- CWE ID: CWE identifier (e.g., CWE-89) or None for non-vulnerable samples
Preprocessing:
- CWE labels were encoded using a LabelEncoder with 134 unique CWE classes identified across the dataset.
- Non-vulnerable samples assigned a CWE label of -1 (mapped to 0 in the model).

The dataset is balanced to ensure a fair representation of vulnerable and non-vulnerable samples, with a maximum of 10 samples per commit where applicable.

Training Details

Training Arguments

The model was trained using the Hugging Face Trainer API with the following arguments:

Output Directory: ./unixcoder_multitask
Evaluation Strategy: Per epoch
Save Strategy: Per epoch
Learning Rate: 2e-5
Batch Size: 8 (per device, train and eval)
Epochs: 3
Weight Decay: 0.01
Logging: Every 10 steps, logged to ./logs
WANDB: Disabled

Training Environment

Hardware: NVIDIA Tesla T4 GPU
Framework: PyTorch 2.5.1+cu121, Transformers 4.47.0
Duration: ~6 hours, 34 minutes, 53 seconds (23,397 steps)

Training Metrics

Validation metrics across epochs:

Epoch	Training Loss	Validation Loss	Vul Accuracy	Vul Precision	Vul Recall	Vul F1	CWE Accuracy
1	0.3038	0.4997	0.9570	0.8082	0.5379	0.6459	0.1887
2	0.6092	0.4859	0.9587	0.8118	0.5641	0.6657	0.2964
3	0.4261	0.5090	0.9585	0.8114	0.5605	0.6630	0.3323

Final Training Loss: 0.4430 (average over all steps)

Evaluation

The model was evaluated on the test split (26,738 samples) with the following metrics:

Vulnerability Detection:
- Accuracy: 0.9571
- Precision: 0.7947
- Recall: 0.5437
- F1 Score: 0.6457
CWE Classification (on vulnerable samples):
- Accuracy: 0.3288

The model excels at identifying non-vulnerable code (high accuracy) but has moderate recall for vulnerabilities and lower CWE classification accuracy, indicating room for improvement in CWE prediction.

Usage

Installation

Install the required libraries:

pip install transformers torch datasets huggingface_hub

Apologies for the oversight! Below is the corrected README.md with the entire content, including the "Sample Code Snippet" section through to the end, formatted properly in Markdown.

markdown

Collapse

Wrap

Copy