|
--- |
|
base_model: openai/gpt-oss-20b |
|
datasets: AIGym/free-gpt-oss |
|
library_name: transformers |
|
model_name: oss-multi-lingual |
|
tags: |
|
- generated_from_trainer |
|
- sft |
|
- trl |
|
licence: license |
|
--- |
|
|
|
## Model Card: `AIGym/oss-adapter` |
|
|
|
 |
|
|
|
### Model Overview |
|
|
|
* **Base model**: Fine-tuned from `openai/gpt-oss-20b` using supervised fine-tuning (SFT) on the `AIGym/free-gpt-oss` dataset ([Hugging Face][1]). |
|
* **Motivation**: Created to participate in the OpenAI GPT-OSS-20B Red-Teaming Challenge on Kaggle, which tasked participants with probing and uncovering previously undetected harmful behaviors and vulnerabilities in the open-weight GPT-OSS-20B model ([Kaggle][2]). |
|
|
|
### Intended Use & Scope |
|
|
|
* **Applications**: Designed primarily for red-teaming or safety evaluation tasks—leveraging its fine-tuning to explore and detect model vulnerabilities. It can also serve as a foundation in research or development of safer LLM applications. |
|
* **Limitations**: Not recommended for deployment in unmoderated settings or as a general-purpose chatbot. Outputs may include unsafe or adversarial behaviors due to its focus on red-teaming scenarios. |
|
|
|
### Training Details |
|
|
|
* **Fine-tuning method**: Supervised fine-tuning (SFT) using the TRL library ([Hugging Face][1]). |
|
* **Tooling and versions**: |
|
|
|
* TRL: 0.21.0 |
|
* Transformers: 4.55.2 |
|
* PyTorch: 2.8.0.dev20250319+cu128 |
|
* Datasets: 4.0.0 |
|
* Tokenizers: 0.21.4 ([Hugging Face][1]). |
|
* **Dataset**: `AIGym/free-gpt-oss`, which presumably includes examples crafted to expose harmful behaviors in the base GPT-OSS-20B model (specific content should be described here if available). |
|
|
|
### Evaluation & Behavior |
|
|
|
* **Challenge context**: The Kaggle Red-Teaming Challenge emphasized discovering hidden vulnerabilities in GPT-OSS-20B by adversarial prompting and probing ([Kaggle][2]). |
|
* **Performance**: (Include any metrics, success rates, or qualitative findings if you evaluated the model’s adversarial robustness compared to the base model.) |
|
|
|
### Example Usage |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
generator = pipeline( |
|
"text-generation", |
|
model="AIGym/oss-multi-lingual", # Or "AIGym/oss-adapter" depending on naming |
|
device="cuda" |
|
) |
|
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?" |
|
output = generator( |
|
[{"role": "user", "content": question}], |
|
max_new_tokens=128, |
|
return_full_text=False |
|
)[0] |
|
print(output["generated_text"]) |
|
``` |
|
|
|
This snippet demonstrates how to query the model in an interactive pipeline, useful for both red-teaming experiments and exploratory analysis ([Hugging Face][1]). |
|
|
|
### Caveats & Ethical Considerations |
|
|
|
* **Potential risks**: The model is intentionally fine-tuned to surface vulnerabilities—it may generate harmful or unsafe content more readily than standard models. |
|
* **Recommended usage environment**: Restricted to controlled research and evaluation settings with proper moderation and oversight. Not intended for downstream production without robust safety measures. |
|
* **Transparency & reproducibility**: Encourage users to report findings responsibly and contribute to community understanding around safe LLM deployment. |
|
|
|
### Summary Table |
|
|
|
| Section | Highlights | |
|
| -------------------- | ----------------------------------------------------------------------- | |
|
| **Overview** | Fine-tuned GPT-OSS-20B adapter for red-teaming, using AIGym dataset | |
|
| **Motivation** | Built for the Kaggle Red-Teaming Challenge targeting safety analysis | |
|
| **Tools & Versions** | TRL 0.21.0, Transformers 4.55.2, PyTorch dev build, Datasets 4.0.0 etc. | |
|
| **Usage Example** | Provided pipeline snippet for quick start | |
|
| **Caveats** | Generates potentially harmful outputs; meant only for controlled eval | |
|
| **Citation** | TRL GitHub repository | |
|
|