File size: 2,494 Bytes
e8ef484 9833d6c 4d49759 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e 9833d6c 1e94f9e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
license: apache-2.0
base_model: distilroberta-base
tags:
- generated_from_trainer
- rejection
- no_answer
- chatgpt
metrics:
- accuracy
- recall
- precision
- f1
model-index:
- name: distilroberta-base-rejection-v1
results: []
language:
- en
pipeline_tag: text-classification
co2_eq_emissions:
emissions: 0.07987621556153969
source: code carbon
training_type: fine-tuning
datasets:
- argilla/notus-uf-dpo-closest-rejected
---
# Model Card: distilroberta-base-rejection-v1
This model was originally developed and fine-tuned by **[Protect AI](https://protectai.com/)**. It is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base), trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.
The goal of this model is to **detect LLM rejections** when a prompt does not pass content moderation. It classifies responses into two categories:
- `0`: Normal output
- `1`: Rejection detected
On the evaluation set, the model achieves:
- **Loss:** 0.0544
- **Accuracy:** 0.9887
- **Recall:** 0.9810
- **Precision:** 0.9279
- **F1 Score:** 0.9537
---
## Model Details
- **Developed & fine-tuned by:** [ProtectAI.com](https://protectai.com)
- **Base model:** [distilroberta-base](https://huggingface.co/distilroberta-base)
- **Language(s):** English
- **License:** Apache 2.0
- **Task:** Text classification (Rejection detection)
---
## Intended Use & Limitations
The model is designed to **identify rejection responses in LLM outputs**, particularly where a refusal or safeguard message is generated.
**Limitations:**
- Performance depends on the quality and domain of the training data.
- May underperform on text styles or topics underrepresented in training.
- Being based on `distilroberta-base`, it is **case-sensitive**.
---
## Usage
### With Hugging Face Transformers
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
truncation=True,
max_length=512,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)
print(classifier("Sorry, but I can't assist with that.")) |