File size: 2,731 Bytes
b3aaf96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c376ce7
 
b3aaf96
c376ce7
b3aaf96
c376ce7
b3aaf96
c376ce7
 
 
 
 
 
 
 
 
b3aaf96
c376ce7
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
---

license: mit
language: 
  - en
tags:
- random-forest
- binary-classification
- prompt-injection
- security
datasets:
- imoxto/prompt_injection_cleaned_dataset-v2
- reshabhs/SPML_Chatbot_Prompt_Injection
- Harelix/Prompt-Injection-Mixed-Techniques-2024
- JasperLS/prompt-injections
- fka/awesome-chatgpt-prompts
- rubend18/ChatGPT-Jailbreak-Prompts
metrics:
- recall
- precision
- f1
- auc
---


# Model Description

The purpose of our trained Random Forest models is to identify malicious prompts given the prompt embeddings derived from [OpenAI](https://huggingface.co/datasets/ahsanayub/malicious-prompts-openai-embeddings), [OctoAI](https://huggingface.co/datasets/ahsanayub/malicious-prompts-octoai-embeddings), and [MiniLM](https://huggingface.co/datasets/ahsanayub/malicious-prompts-minilm-embeddings). The models are trained with 373,598 benign and malicious prompts. We split this dataset into 80% training and 20% test sets. To ensure equal proportion of the malicious and benign labels across splits, we use stratified sampling.

Embeddings consist of fixed-length numerical representations. For example, OpenAI generates an embedding vector consisting of 1,536 floating-point numbers for each prompt. Similarly, the embedding datasets for OctoAI and MiniLM consist of 1,027 and 387 features, respectively.

## Model Evaluation

The binary classification performance of embedding-based random forest models is shared below:

| Embedding | Precision | Recall | F1-score | AUC   | 
|-----------|-----------|--------|----------|-------|
| OpenAI    | 0.867     | 0.867  | 0.867    | 0.764 |
| OctoAI    | 0.849     | 0.853  | 0.851    | 0.731 |
| MiniLM    | 0.849     | 0.853  | 0.851    | 0.730 |

## How To Use The Model

We have shared three versions of random forest models in this repository. We used the following embedding models: `text-embedding-3-small` from OpenAI, and the open-source models `gte-large` hosted on OctoAI, as well as the well-known `all-MiniLM-L6-v2`. Therefore, you need to covert the prompts to its respective embeddings before querying the model to obtain its  prediction: `0` for benign and `1` for malicous.

## Citing This Work
Our implementation, along with the curated datasets used for evaluation, is available on [GitHub](https://github.com/AhsanAyub/malicious-prompt-detection). Additionaly, if you use our implementation for scientific research, you are highly encouraged to cite [our paper](https://arxiv.org/abs/2410.22284).

```

@article{ayub2024embedding,

  title={Embedding-based classifiers can detect prompt injection attacks},

  author={Ayub, Md Ahsan and Majumdar, Subhabrata},

  booktitle={CAMLIS},

  year={2024}

}

```