|
--- |
|
license: apache-2.0 |
|
base_model: distilroberta-base |
|
tags: |
|
- generated_from_trainer |
|
- rejection |
|
- no_answer |
|
- chatgpt |
|
metrics: |
|
- accuracy |
|
- recall |
|
- precision |
|
- f1 |
|
model-index: |
|
- name: distilroberta-base-rejection-v1 |
|
results: [] |
|
language: |
|
- en |
|
pipeline_tag: text-classification |
|
co2_eq_emissions: |
|
emissions: 0.07987621556153969 |
|
source: code carbon |
|
training_type: fine-tuning |
|
datasets: |
|
- argilla/notus-uf-dpo-closest-rejected |
|
--- |
|
|
|
# Model Card: distilroberta-base-rejection-v1 |
|
|
|
This model was originally developed and fine-tuned by **[Protect AI](https://protectai.com/)**. It is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base), trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets. |
|
|
|
The goal of this model is to **detect LLM rejections** when a prompt does not pass content moderation. It classifies responses into two categories: |
|
- `0`: Normal output |
|
- `1`: Rejection detected |
|
|
|
On the evaluation set, the model achieves: |
|
- **Loss:** 0.0544 |
|
- **Accuracy:** 0.9887 |
|
- **Recall:** 0.9810 |
|
- **Precision:** 0.9279 |
|
- **F1 Score:** 0.9537 |
|
|
|
--- |
|
|
|
## Model Details |
|
|
|
- **Developed & fine-tuned by:** [ProtectAI.com](https://protectai.com) |
|
- **Base model:** [distilroberta-base](https://huggingface.co/distilroberta-base) |
|
- **Language(s):** English |
|
- **License:** Apache 2.0 |
|
- **Task:** Text classification (Rejection detection) |
|
|
|
--- |
|
|
|
## Intended Use & Limitations |
|
|
|
The model is designed to **identify rejection responses in LLM outputs**, particularly where a refusal or safeguard message is generated. |
|
|
|
**Limitations:** |
|
- Performance depends on the quality and domain of the training data. |
|
- May underperform on text styles or topics underrepresented in training. |
|
- Being based on `distilroberta-base`, it is **case-sensitive**. |
|
|
|
--- |
|
|
|
## Usage |
|
|
|
### With Hugging Face Transformers |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1") |
|
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1") |
|
|
|
classifier = pipeline( |
|
"text-classification", |
|
model=model, |
|
tokenizer=tokenizer, |
|
truncation=True, |
|
max_length=512, |
|
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"), |
|
) |
|
|
|
print(classifier("Sorry, but I can't assist with that.")) |