metadata
base_model: openai/gpt-oss-20b
datasets: AIGym/free-gpt-oss
library_name: transformers
model_name: oss-multi-lingual
tags:
- generated_from_trainer
- sft
- trl
licence: license
Model Card: AIGym/oss-adapter
Model Overview
- Base model: Fine-tuned from
openai/gpt-oss-20b
using supervised fine-tuning (SFT) on theAIGym/free-gpt-oss
dataset ([Hugging Face][1]). - Motivation: Created to participate in the OpenAI GPT-OSS-20B Red-Teaming Challenge on Kaggle, which tasked participants with probing and uncovering previously undetected harmful behaviors and vulnerabilities in the open-weight GPT-OSS-20B model ([Kaggle][2]).
Intended Use & Scope
- Applications: Designed primarily for red-teaming or safety evaluation tasks—leveraging its fine-tuning to explore and detect model vulnerabilities. It can also serve as a foundation in research or development of safer LLM applications.
- Limitations: Not recommended for deployment in unmoderated settings or as a general-purpose chatbot. Outputs may include unsafe or adversarial behaviors due to its focus on red-teaming scenarios.
Training Details
Fine-tuning method: Supervised fine-tuning (SFT) using the TRL library ([Hugging Face][1]).
Tooling and versions:
- TRL: 0.21.0
- Transformers: 4.55.2
- PyTorch: 2.8.0.dev20250319+cu128
- Datasets: 4.0.0
- Tokenizers: 0.21.4 ([Hugging Face][1]).
Dataset:
AIGym/free-gpt-oss
, which presumably includes examples crafted to expose harmful behaviors in the base GPT-OSS-20B model (specific content should be described here if available).
Evaluation & Behavior
- Challenge context: The Kaggle Red-Teaming Challenge emphasized discovering hidden vulnerabilities in GPT-OSS-20B by adversarial prompting and probing ([Kaggle][2]).
- Performance: (Include any metrics, success rates, or qualitative findings if you evaluated the model’s adversarial robustness compared to the base model.)
Example Usage
from transformers import pipeline
generator = pipeline(
"text-generation",
model="AIGym/oss-multi-lingual", # Or "AIGym/oss-adapter" depending on naming
device="cuda"
)
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
output = generator(
[{"role": "user", "content": question}],
max_new_tokens=128,
return_full_text=False
)[0]
print(output["generated_text"])
This snippet demonstrates how to query the model in an interactive pipeline, useful for both red-teaming experiments and exploratory analysis ([Hugging Face][1]).
Caveats & Ethical Considerations
- Potential risks: The model is intentionally fine-tuned to surface vulnerabilities—it may generate harmful or unsafe content more readily than standard models.
- Recommended usage environment: Restricted to controlled research and evaluation settings with proper moderation and oversight. Not intended for downstream production without robust safety measures.
- Transparency & reproducibility: Encourage users to report findings responsibly and contribute to community understanding around safe LLM deployment.
Summary Table
Section | Highlights |
---|---|
Overview | Fine-tuned GPT-OSS-20B adapter for red-teaming, using AIGym dataset |
Motivation | Built for the Kaggle Red-Teaming Challenge targeting safety analysis |
Tools & Versions | TRL 0.21.0, Transformers 4.55.2, PyTorch dev build, Datasets 4.0.0 etc. |
Usage Example | Provided pipeline snippet for quick start |
Caveats | Generates potentially harmful outputs; meant only for controlled eval |
Citation | TRL GitHub repository |