ahsanayub
/

malicious-prompts-detection-random-forest

binary-classification

prompt-injection

Model card Files Files and versions

malicious-prompts-detection-random-forest / README.md

ahsanayub's picture

Upload README.md

b3aaf96 verified 8 months ago

|

history blame contribute delete

2.73 kB

	---
	license: mit
	language:
	- en
	tags:
	- random-forest
	- binary-classification
	- prompt-injection
	- security
	datasets:
	- imoxto/prompt_injection_cleaned_dataset-v2
	- reshabhs/SPML_Chatbot_Prompt_Injection
	- Harelix/Prompt-Injection-Mixed-Techniques-2024
	- JasperLS/prompt-injections
	- fka/awesome-chatgpt-prompts
	- rubend18/ChatGPT-Jailbreak-Prompts
	metrics:
	- recall
	- precision
	- f1
	- auc
	---

	# Model Description

	The purpose of our trained Random Forest models is to identify malicious prompts given the prompt embeddings derived from [OpenAI](https://huggingface.co/datasets/ahsanayub/malicious-prompts-openai-embeddings), [OctoAI](https://huggingface.co/datasets/ahsanayub/malicious-prompts-octoai-embeddings), and [MiniLM](https://huggingface.co/datasets/ahsanayub/malicious-prompts-minilm-embeddings). The models are trained with 373,598 benign and malicious prompts. We split this dataset into 80% training and 20% test sets. To ensure equal proportion of the malicious and benign labels across splits, we use stratified sampling.

	Embeddings consist of fixed-length numerical representations. For example, OpenAI generates an embedding vector consisting of 1,536 floating-point numbers for each prompt. Similarly, the embedding datasets for OctoAI and MiniLM consist of 1,027 and 387 features, respectively.

	## Model Evaluation

	The binary classification performance of embedding-based random forest models is shared below:

	\| Embedding \| Precision \| Recall \| F1-score \| AUC \|
	\|-----------\|-----------\|--------\|----------\|-------\|
	\| OpenAI \| 0.867 \| 0.867 \| 0.867 \| 0.764 \|
	\| OctoAI \| 0.849 \| 0.853 \| 0.851 \| 0.731 \|
	\| MiniLM \| 0.849 \| 0.853 \| 0.851 \| 0.730 \|

	## How To Use The Model

	We have shared three versions of random forest models in this repository. We used the following embedding models: `text-embedding-3-small` from OpenAI, and the open-source models `gte-large` hosted on OctoAI, as well as the well-known `all-MiniLM-L6-v2`. Therefore, you need to covert the prompts to its respective embeddings before querying the model to obtain its prediction: `0` for benign and `1` for malicous.

	## Citing This Work
	Our implementation, along with the curated datasets used for evaluation, is available on [GitHub](https://github.com/AhsanAyub/malicious-prompt-detection). Additionaly, if you use our implementation for scientific research, you are highly encouraged to cite [our paper](https://arxiv.org/abs/2410.22284).

	```
	@article{ayub2024embedding,
	title={Embedding-based classifiers can detect prompt injection attacks},
	author={Ayub, Md Ahsan and Majumdar, Subhabrata},
	booktitle={CAMLIS},
	year={2024}
	}
	```