vijilpd's picture
Update README.md
8df72c8 verified
|
raw
history blame
2.31 kB
metadata
license: apache-2.0

Model Card for Vijil Prompt Injection

Model Details

Model Description

This model is a fine-tuned version of ModernBert to classify prompt-injection prompts which can manipulate language models into producing unintended outputs.

Uses

Prompt injection attacks manipulate language models by inserting or altering prompts to trigger harmful or unintended responses. The vijil/mbert-prompt-injection model is designed to enhance security in language model applications by detecting prompt-injection attacks.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("vijil/mbert-prompt-injection")
model = AutoModelForSequenceClassification.from_pretrained("vijil/mbert-prompt-injection")

classifier = pipeline(
  "text-classification",
  model=model,
  tokenizer=tokenizer,
  truncation=True,
  max_length=512,
  device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)

print(classifier("this is a prompt-injection prompt"))

Training Details

Training Data

The dataset used for training the model was taken from

wildguardmix/train and safe-guard-prompt-injection/train

Training Procedure

Supervised finetuning with above dataset

Training Hyperparameters

  • learning_rate: 5e-05

  • train_batch_size: 32

  • eval_batch_size: 32

  • optimizer: adamw_torch_fused

  • lr_scheduler_type: cosine_with_restarts

  • warmup_ratio: 0.1

  • num_epochs: 3

Evaluation

  • Training Loss: 0.0036

  • Validation Loss: 0.209392

  • Accuracy: 0.961538

  • Precision: 0.958362

  • Recall: 0.957055

  • Fl: 0.957708

Testing Data

The dataset used for training the model was taken from

wildguardmix/test and safe-guard-prompt-injection/test

Results

Model Card Contact

https://vijil.ai