Multilingual Refusal Classifier

This model detects assistant refusals in multilingual AI conversations. It identifies when a model declines to answer a user prompt (for example, for safety, capability, or policy reasons) versus when it provides a substantive response.

The model is a fine-tuned version of agentlans/multilingual-e5-small-aligned-v2, trained on the agentlans/refusal-classifier-data dataset.

Evaluation results:

  • Loss: 0.2665
  • Accuracy: 0.9153
  • Training tokens: 5,347,200

Usage

This classifier accepts input in conversation-like text formats using structured role tokens.
For long texts, insert <|...|> as an ellipsis placeholder in the middle of omitted content.

Supported input formats:

  • <|system|>System prompt<|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...
  • <|user|>User message<|assistant|>Response<|user|>Next user message<|assistant|>Next response...

Example:

from transformers import pipeline

classifier = pipeline(
    task="text-classification",
    model="agentlans/multilingual-e5-small-refusal-classifier"
)

text = (
    "<|user|>Mr. Loyd wants to fence his square-shaped land of 150 sqft each side. "
    "If a pole is laid every certain distance, he needs 30 poles. "
    "What is the distance between each pole in feet?"
    "<|assistant|>If Mr. Loyd's land is square-shaped and each side is 150 sqft, then<|...|>"
    "ce between poles ≈ 20.69 sqft\n\nTherefore, the distance between each pole is approximately 20.69 feet."
)

print(classifier(text))
# [{'label': 'Non-refusal', 'score': 0.9906}]

Evaluation Results

The classifier was tested on ten examples translated from the NousResearch/Minos-v1 model page. Full examples are available in Examples.md.

  • 🚫 — The model predicted a refusal to answer.
  • ◯ — The model predicted a valid response.
Example English French Spanish Chinese Russian Arabic
1 🚫 🚫 🚫 🚫 🚫 🚫
2 🚫 🚫 🚫 🚫 🚫 🚫
3 🚫 🚫 🚫 🚫 🚫 🚫
4 🚫 🚫 🚫 🚫 🚫 🚫
5 🚫 🚫 🚫 🚫 🚫 🚫
6
7
8
9 🚫 🚫 🚫
10

The classifier performs consistently across major languages, though some false positives remain, especially in contexts with ambiguous phrasing.

Limitations

  • Input length: 512-token maximum
  • False positives/negatives: Occasionally similar to the Minos classifier
  • Low-resource languages: May yield inconsistent predictions
  • Cultural variation: Expressions of refusal differ linguistically, which can affect accuracy

Training Details

Hyperparameters

  • Learning rate: 5e-5
  • Train batch size: 8
  • Eval batch size: 8
  • Seed: 42
  • Optimizer: ADAMW_TORCH_FUSED (betas=(0.9, 0.999), epsilon=1e-8)
  • Scheduler: Linear
  • Epochs: 5

Framework Versions

  • Transformers 5.0.0.dev0
  • PyTorch 2.9.1+cu128
  • Datasets 4.4.1
  • Tokenizers 0.22.1

Intended Use

This model is designed for:

  • Identifying AI refusals during conversation analysis.
  • Supporting evaluation pipelines for alignment and compliance studies.
  • Helping developers monitor cross-lingual consistency in model responses.

It is not intended for moderation or real-time deployment in production systems without human oversight.

Downloads last month
35
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for agentlans/multilingual-e5-small-refusal-classifier

Dataset used to train agentlans/multilingual-e5-small-refusal-classifier