metadata

base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
library_name: peft
license: openrail
datasets:
  - Roblox/RoGuard-Eval
language:
  - en
pipeline_tag: text-classification

RoGuard 1.0: Advancing Safety for LLMs with Robust Guardrails

RoGuard 1.0, a SOTA instruction fine-tuned LLM, is designed to help safeguard our Text Generation API. It performs safety classification at both the prompt and response levels, deciding whether or not each input or output violates our policies. This dual-level assessment is essential for moderating both user queries and the model’s own generated outputs. At the heart of our system is an LLM that’s been fine-tuned from the Llama-3.1-8B-Instruct model. We trained this LLM with a particular focus on high-quality instruction tuning to optimize for safety judgment performance.

📊 Model Benchmark Results

Prompt Metrics: These evaluate how well the model classifies or responds to potentially harmful user inputs
Response Metrics: These measure how well the model handles or generates responses, ensuring its outputs are safe and aligned.

🔗 GitHub Repository

You can find the full source code and evaluation framework on GitHub:

👉 Roblox/RoGuard on GitHub