--- base_model: meta-llama/Meta-Llama-3.1-8B-Instruct library_name: peft license: llama3.1 --- # Model Card for Model ID RoGuard is a lightweight, modular evaluation framework for assessing the safety of fine-tuned language models. It provides structured evaluation using configurable prompts, labeled datasets, and outputs comprehensive metrics. # 📊 Model Benchmark Results - **Prompt Metrics**: These evaluate how well the model classifies or responds to potentially harmful **user inputs** - **Response Metrics**: These measure how well the model handles or generates **responses**, ensuring its outputs are safe and aligned. | Model / Metric | Prompt | | | | | Response | | | | |---------------------------|--------:|------:|------:|-------:|-------:|---------:|----------:|-------:|-------:| | | ToxicC. | OAI | Aegis | XSTest | WildP. | BeaverT. | SaferRLHF | WildR. | HarmB. | | LlamaGuard2-8B | 42.7 | 77.6 | 73.8 | 88.6 | 70.9 | 71.8 | 51.6 | 65.2 | 78.5 | | LlamaGuard3-8B | 50.9 | 79.4 | 74.8 | 88.3 | 70.1 | 69.7 | 53.7 | 70.2 | 84.9 | | MD-Judge-7B | - | - | - | - | - | 86.7 | 64.8 | 76.8 | 81.2 | | WildGuard-7B | 70.8 | 72.1 | 89.4 | 94.4 | 88.9 | 84.4 | 64.2 | 75.4 | 86.2 | | ShieldGemma-7B | 70.2 | 82.1 | 88.7 | 92.5 | 88.1 | 84.8 | 66.6 | 77.8 | 84.8 | | GPT-4o | 68.1 | 70.4 | 83.2 | 90.2 | 87.9 | 83.8 | 67.9 | 73.1 | 83.5 | | BingoGuard-phi3-3B | 72.5 | 72.8 | 90.0 | 90.8 | 88.9 | 86.2 | 69.9 | 79.7 | 85.1 | | BingoGuard-llama3.1-8B | 75.7 | 77.9 | 90.4 | 94.9 | 88.9 | 86.4 | 68.7 | 80.1 | 86.4 | | 🛡️ RoGuard | 75.8 | 70.5 | 91.1 | 90.2 | 88.7 | 87.5 | 69.7 | 80.0 | 80.7 | ## 🔗 GitHub Repository You can find the full source code and evaluation framework on GitHub: 👉 [Roblox/RoGuard on GitHub](https://github.com/Roblox/RoGuard)