oss-adapter / README.md
AIGym's picture
Update README.md
6d52f65 verified
metadata
base_model: openai/gpt-oss-20b
datasets: AIGym/free-gpt-oss
library_name: transformers
model_name: oss-multi-lingual
tags:
  - generated_from_trainer
  - sft
  - trl
licence: license

Model Card: AIGym/oss-adapter

image/png

Model Overview

  • Base model: Fine-tuned from openai/gpt-oss-20b using supervised fine-tuning (SFT) on the AIGym/free-gpt-oss dataset ([Hugging Face][1]).
  • Motivation: Created to participate in the OpenAI GPT-OSS-20B Red-Teaming Challenge on Kaggle, which tasked participants with probing and uncovering previously undetected harmful behaviors and vulnerabilities in the open-weight GPT-OSS-20B model ([Kaggle][2]).

Intended Use & Scope

  • Applications: Designed primarily for red-teaming or safety evaluation tasks—leveraging its fine-tuning to explore and detect model vulnerabilities. It can also serve as a foundation in research or development of safer LLM applications.
  • Limitations: Not recommended for deployment in unmoderated settings or as a general-purpose chatbot. Outputs may include unsafe or adversarial behaviors due to its focus on red-teaming scenarios.

Training Details

  • Fine-tuning method: Supervised fine-tuning (SFT) using the TRL library ([Hugging Face][1]).

  • Tooling and versions:

    • TRL: 0.21.0
    • Transformers: 4.55.2
    • PyTorch: 2.8.0.dev20250319+cu128
    • Datasets: 4.0.0
    • Tokenizers: 0.21.4 ([Hugging Face][1]).
  • Dataset: AIGym/free-gpt-oss, which presumably includes examples crafted to expose harmful behaviors in the base GPT-OSS-20B model (specific content should be described here if available).

Evaluation & Behavior

  • Challenge context: The Kaggle Red-Teaming Challenge emphasized discovering hidden vulnerabilities in GPT-OSS-20B by adversarial prompting and probing ([Kaggle][2]).
  • Performance: (Include any metrics, success rates, or qualitative findings if you evaluated the model’s adversarial robustness compared to the base model.)

Example Usage

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="AIGym/oss-multi-lingual",  # Or "AIGym/oss-adapter" depending on naming
    device="cuda"
)
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
output = generator(
    [{"role": "user", "content": question}],
    max_new_tokens=128,
    return_full_text=False
)[0]
print(output["generated_text"])

This snippet demonstrates how to query the model in an interactive pipeline, useful for both red-teaming experiments and exploratory analysis ([Hugging Face][1]).

Caveats & Ethical Considerations

  • Potential risks: The model is intentionally fine-tuned to surface vulnerabilities—it may generate harmful or unsafe content more readily than standard models.
  • Recommended usage environment: Restricted to controlled research and evaluation settings with proper moderation and oversight. Not intended for downstream production without robust safety measures.
  • Transparency & reproducibility: Encourage users to report findings responsibly and contribute to community understanding around safe LLM deployment.

Summary Table

Section Highlights
Overview Fine-tuned GPT-OSS-20B adapter for red-teaming, using AIGym dataset
Motivation Built for the Kaggle Red-Teaming Challenge targeting safety analysis
Tools & Versions TRL 0.21.0, Transformers 4.55.2, PyTorch dev build, Datasets 4.0.0 etc.
Usage Example Provided pipeline snippet for quick start
Caveats Generates potentially harmful outputs; meant only for controlled eval
Citation TRL GitHub repository