rejection_detection / README.md
wu981526092's picture
Update README.md
e8ef484 verified
---
license: apache-2.0
base_model: distilroberta-base
tags:
- generated_from_trainer
- rejection
- no_answer
- chatgpt
metrics:
- accuracy
- recall
- precision
- f1
model-index:
- name: distilroberta-base-rejection-v1
results: []
language:
- en
pipeline_tag: text-classification
co2_eq_emissions:
emissions: 0.07987621556153969
source: code carbon
training_type: fine-tuning
datasets:
- argilla/notus-uf-dpo-closest-rejected
---
# Model Card: distilroberta-base-rejection-v1
This model was originally developed and fine-tuned by **[Protect AI](https://protectai.com/)**. It is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base), trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.
The goal of this model is to **detect LLM rejections** when a prompt does not pass content moderation. It classifies responses into two categories:
- `0`: Normal output
- `1`: Rejection detected
On the evaluation set, the model achieves:
- **Loss:** 0.0544
- **Accuracy:** 0.9887
- **Recall:** 0.9810
- **Precision:** 0.9279
- **F1 Score:** 0.9537
---
## Model Details
- **Developed & fine-tuned by:** [ProtectAI.com](https://protectai.com)
- **Base model:** [distilroberta-base](https://huggingface.co/distilroberta-base)
- **Language(s):** English
- **License:** Apache 2.0
- **Task:** Text classification (Rejection detection)
---
## Intended Use & Limitations
The model is designed to **identify rejection responses in LLM outputs**, particularly where a refusal or safeguard message is generated.
**Limitations:**
- Performance depends on the quality and domain of the training data.
- May underperform on text styles or topics underrepresented in training.
- Being based on `distilroberta-base`, it is **case-sensitive**.
---
## Usage
### With Hugging Face Transformers
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
classifier = pipeline(
"text-classification",
model=model,
tokenizer=tokenizer,
truncation=True,
max_length=512,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)
print(classifier("Sorry, but I can't assist with that."))