holistic-ai
/

rejection_detection

Text Classification

Generated from Trainer

Model card Files Files and versions

rejection_detection / README.md

wu981526092's picture

Update README.md

e8ef484 verified 12 days ago

|

history blame contribute delete

2.49 kB

	---
	license: apache-2.0
	base_model: distilroberta-base
	tags:
	- generated_from_trainer
	- rejection
	- no_answer
	- chatgpt
	metrics:
	- accuracy
	- recall
	- precision
	- f1
	model-index:
	- name: distilroberta-base-rejection-v1
	results: []
	language:
	- en
	pipeline_tag: text-classification
	co2_eq_emissions:
	emissions: 0.07987621556153969
	source: code carbon
	training_type: fine-tuning
	datasets:
	- argilla/notus-uf-dpo-closest-rejected
	---

	# Model Card: distilroberta-base-rejection-v1

	This model was originally developed and fine-tuned by [Protect AI](https://protectai.com/). It is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base), trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.

	The goal of this model is to detect LLM rejections when a prompt does not pass content moderation. It classifies responses into two categories:
	- `0`: Normal output
	- `1`: Rejection detected

	On the evaluation set, the model achieves:
	- Loss: 0.0544
	- Accuracy: 0.9887
	- Recall: 0.9810
	- Precision: 0.9279
	- F1 Score: 0.9537

	---

	## Model Details

	- Developed & fine-tuned by: [ProtectAI.com](https://protectai.com)
	- Base model: [distilroberta-base](https://huggingface.co/distilroberta-base)
	- Language(s): English
	- License: Apache 2.0
	- Task: Text classification (Rejection detection)

	---

	## Intended Use & Limitations

	The model is designed to identify rejection responses in LLM outputs, particularly where a refusal or safeguard message is generated.

	Limitations:
	- Performance depends on the quality and domain of the training data.
	- May underperform on text styles or topics underrepresented in training.
	- Being based on `distilroberta-base`, it is case-sensitive.

	---

	## Usage

	### With Hugging Face Transformers

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
	import torch

	tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
	model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")

	classifier = pipeline(
	"text-classification",
	model=model,
	tokenizer=tokenizer,
	truncation=True,
	max_length=512,
	device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
	)

	print(classifier("Sorry, but I can't assist with that."))