File size: 4,148 Bytes

5d3ff71
 
b297317
5d3ff71
 
 
 
 
 
 
 
 
528ba59
5d3ff71
528ba59
5d3ff71
528ba59
 
 
 
 
 
 
 
 
5d3ff71
528ba59
5d3ff71
528ba59
 
5d3ff71
528ba59
 
 
 
 
 
5d3ff71
528ba59
5d3ff71
528ba59
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d3ff71
528ba59
5d3ff71
528ba59
5d3ff71
528ba59
 
 
5d3ff71
528ba59
5d3ff71
528ba59

---
base_model: openai/gpt-oss-20b
datasets: AIGym/free-gpt-oss
library_name: transformers
model_name: oss-multi-lingual
tags:
- generated_from_trainer
- sft
- trl
licence: license
---

## Model Card: `AIGym/oss-adapter`

![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f2b7bcbe95ed4c9a9e7669/xEeUCVtMcl8svYB5-aYeO.png)

### Model Overview

* **Base model**: Fine-tuned from `openai/gpt-oss-20b` using supervised fine-tuning (SFT) on the `AIGym/free-gpt-oss` dataset ([Hugging Face][1]).
* **Motivation**: Created to participate in the OpenAI GPT-OSS-20B Red-Teaming Challenge on Kaggle, which tasked participants with probing and uncovering previously undetected harmful behaviors and vulnerabilities in the open-weight GPT-OSS-20B model ([Kaggle][2]).

### Intended Use & Scope

* **Applications**: Designed primarily for red-teaming or safety evaluation tasks—leveraging its fine-tuning to explore and detect model vulnerabilities. It can also serve as a foundation in research or development of safer LLM applications.
* **Limitations**: Not recommended for deployment in unmoderated settings or as a general-purpose chatbot. Outputs may include unsafe or adversarial behaviors due to its focus on red-teaming scenarios.

### Training Details

* **Fine-tuning method**: Supervised fine-tuning (SFT) using the TRL library ([Hugging Face][1]).
* **Tooling and versions**:

  * TRL: 0.21.0
  * Transformers: 4.55.2
  * PyTorch: 2.8.0.dev20250319+cu128
  * Datasets: 4.0.0
  * Tokenizers: 0.21.4 ([Hugging Face][1]).
* **Dataset**: `AIGym/free-gpt-oss`, which presumably includes examples crafted to expose harmful behaviors in the base GPT-OSS-20B model (specific content should be described here if available).

### Evaluation & Behavior

* **Challenge context**: The Kaggle Red-Teaming Challenge emphasized discovering hidden vulnerabilities in GPT-OSS-20B by adversarial prompting and probing ([Kaggle][2]).
* **Performance**: (Include any metrics, success rates, or qualitative findings if you evaluated the model’s adversarial robustness compared to the base model.)

### Example Usage

```python
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="AIGym/oss-multi-lingual",  # Or "AIGym/oss-adapter" depending on naming
    device="cuda"
)
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
output = generator(
    [{"role": "user", "content": question}],
    max_new_tokens=128,
    return_full_text=False
)[0]
print(output["generated_text"])
```

This snippet demonstrates how to query the model in an interactive pipeline, useful for both red-teaming experiments and exploratory analysis ([Hugging Face][1]).

### Caveats & Ethical Considerations

* **Potential risks**: The model is intentionally fine-tuned to surface vulnerabilities—it may generate harmful or unsafe content more readily than standard models.
* **Recommended usage environment**: Restricted to controlled research and evaluation settings with proper moderation and oversight. Not intended for downstream production without robust safety measures.
* **Transparency & reproducibility**: Encourage users to report findings responsibly and contribute to community understanding around safe LLM deployment.

### Summary Table

| Section              | Highlights                                                              |
| -------------------- | ----------------------------------------------------------------------- |
| **Overview**         | Fine-tuned GPT-OSS-20B adapter for red-teaming, using AIGym dataset     |
| **Motivation**       | Built for the Kaggle Red-Teaming Challenge targeting safety analysis    |
| **Tools & Versions** | TRL 0.21.0, Transformers 4.55.2, PyTorch dev build, Datasets 4.0.0 etc. |
| **Usage Example**    | Provided pipeline snippet for quick start                               |
| **Caveats**          | Generates potentially harmful outputs; meant only for controlled eval   |
| **Citation**         | TRL GitHub repository                                                   |