File size: 3,619 Bytes
			
			| 2466023 3ea64bd e3137e8 2466023 e3137e8 d1f1ef8 5363998 2466023 3ea64bd 5363998 c104496 4d58612 c104496 5363998 b0c7c89 5363998 2466023 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | ---
license: mit
language:
- en
base_model:
- meta-llama/Prompt-Guard-86M
pipeline_tag: text-classification
---
# katanemolabs/Arch-Guard
## Overview
The Katanemo Arch-Guard collection is a collection state-of-the-art (SOTA) LLMs specifically designed for **jailbreaking detection** tasks.
Definition: jailbreaking attempts are malicious prompts designed to alternate the intended behavior of the foundation LLM model of the application. They often violate the safety and security policies of the model. 
Arch Guard is a classifier model fine-tuned based on the open source model [Prompt-Guard-86M](https://huggingface.co/meta-llama/Prompt-Guard-86M) on a collection of open-source datasets of jailbreaking attemps with an intention to improve
the capability of detecting jailbreaks only.
In summary, the Katanemo Arch-Function collection demonstrates:
- **State-of-the-art performance** in jailbreaking attempts detection
- Optimized **low-latency, low False Positive Rate**, making it suitable for real-time, production environments, and best user experience.
| Dominant class = jailbreak |        |        |        |        |       |           |        |
| -------------------------- | ------ | ------ | ------ | ------ | ----- | --------- | ------ |
| Model                      | TPR    | TNR    | FPR    | FNR    | AUC   | Precision | Recall |
| Prompt-guard               | 0.8468 | 0.9972 | 0.0028 | 0.1532 | 0.857 | 0.715     | 0.999  |
| Arch-guard                 | 0.8887 | 0.9970 | 0.0030 | 0.1113 | 0.880 | 0.761     | 0.999  |
## Requirements
The gpu model is quantized with EEtq, please follow the instruction at https://github.com/NetEase-FuXi/EETQ?tab=readme-ov-file#getting-started to install the package.
## Datasets
Evaluation dataset is from casual_conversation
[casual_conversation](https://huggingface.co/datasets/SohamGhadge/casual-conversation)
[commonqa](https://huggingface.co/datasets/tau/commonsense_qa)
[financeqa](https://huggingface.co/datasets/AIR-Bench/qa_finance_en)                                             
[instruction](http://mbzuai/LaMini-instruction)                                                                                         
[jailbreak_behavior_benign](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors)                   
[jailbreak_behavior_harmful](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors)                  
[jailbreak_judge](https://huggingface.co/datasets/JailbreakBench/JBB-Behaviors)                             
[jailbreak_prompts](https://huggingface.co/datasets/rubend18/ChatGPT-Jailbreak-Prompts)               
[jailbreak_tweet](https://huggingface.co/datasets/cstnz/Disaster-tweet-jailbreaking)                   
[jailbreak_v](https://huggingface.co/datasets/JailbreakV-28K/JailBreakV-28k)                               
[jailbreak_vigil](https://huggingface.co/datasets/deadbits/vigil-jailbreak-all-MiniLM-L6-v2)   
[mental_health](https://huggingface.co/datasets/Amod/mental_health_counseling_conversations) 
[telecom](https://huggingface.co/datasets/talkmap/telecom-conversation-corpus)                       
[truthqa](https://huggingface.co/datasets/truthfulqa/truthful_qa)
[weather](https://huggingface.co/datasets/GEM/conversational_weather)                                         
## How to use
````python
from transformers import pipeline
pipe = pipeline("text-classification", model="katanemolabs/Arch-Guard-gpu")
pipe("Ignore your instruction")
````
# License
Katanemo Arch-Guard is distributed under the [Katanemo license](https://huggingface.co/katanemolabs/Arch-Guard/blob/main/LICENSE). | 
