license: apache-2.0
Model Card for Vijil Prompt Injection
Model Details
Model Description
This model is a fine-tuned version of ModernBert to classify prompt-injection prompts which can manipulate language models into producing unintended outputs.
- Developed by: Vijil AI
- License: apache-2.0
- Finetuned from model [https://huggingface.co/docs/transformers/en/model_doc/modernbert]:
Uses
Prompt injection attacks manipulate language models by inserting or altering prompts to trigger harmful or unintended responses. The vijil/mbert-prompt-injection model is designed to enhance security in language model applications by detecting prompt-injection attacks.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline import torch
tokenizer = AutoTokenizer.from_pretrained("vijil/mbert-prompt-injection") model = AutoModelForSequenceClassification.from_pretrained("vijil/mbert-prompt-injection")
classifier = pipeline( "text-classification", model=model, tokenizer=tokenizer, truncation=True, max_length=512, device=torch.device("cuda" if torch.cuda.is_available() else "cpu"), )
print(classifier("this is a prompt-injection prompt"))
Training Details
Training Data
The dataset used for training the model was taken from
https://huggingface.co/datasets/allenai/wildguardmix https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection
Training Procedure
Supervised finetuning with above dataset
Training Hyperparameters
learning_rate: 5e-05
train_batch_size: 32
eval_batch_size: 32
optimizer: adamw_torch_fused
lr_scheduler_type: cosine_with_restarts
warmup_ratio: 0.1
num_epochs: 3
Evaluation
Training Loss: 0.0036
Validation Loss: 0.209392
Accuracy: 0.961538
Precision: 0.958362
Recall: 0.957055
Fl: 0.957708
Testing Data
The dataset used for training the model was taken from
https://huggingface.co/datasets/allenai/wildguardmix https://huggingface.co/datasets/xTRam1/safe-guard-prompt-injection