metadata
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
library_name: peft
license: llama3.1
RoGuard: Advancing Safety for LLMs with Robust Guardrails
Model Card for Model ID
RoGuard is a lightweight, modular evaluation framework for assessing the safety of fine-tuned language models. It provides structured evaluation using configurable prompts, labeled datasets, and outputs comprehensive metrics.
📊 Model Benchmark Results
- Prompt Metrics: These evaluate how well the model classifies or responds to potentially harmful user inputs
- Response Metrics: These measure how well the model handles or generates responses, ensuring its outputs are safe and aligned.
Model / Metric | Prompt | Response | |||||||
---|---|---|---|---|---|---|---|---|---|
ToxicC. | OAI | Aegis | XSTest | WildP. | BeaverT. | SaferRLHF | WildR. | HarmB. | |
LlamaGuard2-8B | 42.7 | 77.6 | 73.8 | 88.6 | 70.9 | 71.8 | 51.6 | 65.2 | 78.5 |
LlamaGuard3-8B | 50.9 | 79.4 | 74.8 | 88.3 | 70.1 | 69.7 | 53.7 | 70.2 | 84.9 |
MD-Judge-7B | - | - | - | - | - | 86.7 | 64.8 | 76.8 | 81.2 |
WildGuard-7B | 70.8 | 72.1 | 89.4 | 94.4 | 88.9 | 84.4 | 64.2 | 75.4 | 86.2 |
ShieldGemma-7B | 70.2 | 82.1 | 88.7 | 92.5 | 88.1 | 84.8 | 66.6 | 77.8 | 84.8 |
GPT-4o | 68.1 | 70.4 | 83.2 | 90.2 | 87.9 | 83.8 | 67.9 | 73.1 | 83.5 |
BingoGuard-phi3-3B | 72.5 | 72.8 | 90.0 | 90.8 | 88.9 | 86.2 | 69.9 | 79.7 | 85.1 |
BingoGuard-llama3.1-8B | 75.7 | 77.9 | 90.4 | 94.9 | 88.9 | 86.4 | 68.7 | 80.1 | 86.4 |
🛡️ RoGuard | 75.8 | 70.5 | 91.1 | 90.2 | 88.7 | 87.5 | 69.7 | 80.0 | 80.7 |
🔗 GitHub Repository
You can find the full source code and evaluation framework on GitHub: