File size: 2,494 Bytes
e8ef484
9833d6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d49759
9833d6c
1e94f9e
9833d6c
1e94f9e
9833d6c
1e94f9e
 
 
9833d6c
1e94f9e
 
 
 
 
 
9833d6c
1e94f9e
9833d6c
1e94f9e
9833d6c
1e94f9e
 
 
 
 
9833d6c
1e94f9e
9833d6c
1e94f9e
9833d6c
1e94f9e
9833d6c
1e94f9e
 
 
 
9833d6c
1e94f9e
 
 
 
 
9833d6c
 
 
 
 
 
 
 
 
1e94f9e
 
 
 
 
 
9833d6c
 
1e94f9e
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
---
license: apache-2.0
base_model: distilroberta-base
tags:
- generated_from_trainer
- rejection
- no_answer
- chatgpt
metrics:
- accuracy
- recall
- precision
- f1
model-index:
- name: distilroberta-base-rejection-v1
  results: []
language:
- en
pipeline_tag: text-classification
co2_eq_emissions:
  emissions: 0.07987621556153969
  source: code carbon
  training_type: fine-tuning
datasets:
- argilla/notus-uf-dpo-closest-rejected
---

# Model Card: distilroberta-base-rejection-v1  

This model was originally developed and fine-tuned by **[Protect AI](https://protectai.com/)**. It is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base), trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.  

The goal of this model is to **detect LLM rejections** when a prompt does not pass content moderation. It classifies responses into two categories:  
- `0`: Normal output  
- `1`: Rejection detected  

On the evaluation set, the model achieves:  
- **Loss:** 0.0544  
- **Accuracy:** 0.9887  
- **Recall:** 0.9810  
- **Precision:** 0.9279  
- **F1 Score:** 0.9537  

---

## Model Details  

- **Developed & fine-tuned by:** [ProtectAI.com](https://protectai.com)  
- **Base model:** [distilroberta-base](https://huggingface.co/distilroberta-base)  
- **Language(s):** English  
- **License:** Apache 2.0  
- **Task:** Text classification (Rejection detection)  

---

## Intended Use & Limitations  

The model is designed to **identify rejection responses in LLM outputs**, particularly where a refusal or safeguard message is generated.  

**Limitations:**  
- Performance depends on the quality and domain of the training data.  
- May underperform on text styles or topics underrepresented in training.  
- Being based on `distilroberta-base`, it is **case-sensitive**.  

---

## Usage  

### With Hugging Face Transformers  

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")

classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    truncation=True,
    max_length=512,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)

print(classifier("Sorry, but I can't assist with that."))