Update README.md

01de356 verified 2 months ago

2.31 kB

metadata

base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
library_name: peft
license: llama3.1

RoGuard: Advancing Safety for LLMs with Robust Guardrails

Model Card for Model ID

RoGuard is a lightweight, modular evaluation framework for assessing the safety of fine-tuned language models. It provides structured evaluation using configurable prompts, labeled datasets, and outputs comprehensive metrics.

📊 Model Benchmark Results

Prompt Metrics: These evaluate how well the model classifies or responds to potentially harmful user inputs
Response Metrics: These measure how well the model handles or generates responses, ensuring its outputs are safe and aligned.

Model / Metric	Prompt					Response
	ToxicC.	OAI	Aegis	XSTest	WildP.	BeaverT.	SaferRLHF	WildR.	HarmB.
LlamaGuard2-8B	42.7	77.6	73.8	88.6	70.9	71.8	51.6	65.2	78.5
LlamaGuard3-8B	50.9	79.4	74.8	88.3	70.1	69.7	53.7	70.2	84.9
MD-Judge-7B	-	-	-	-	-	86.7	64.8	76.8	81.2
WildGuard-7B	70.8	72.1	89.4	94.4	88.9	84.4	64.2	75.4	86.2
ShieldGemma-7B	70.2	82.1	88.7	92.5	88.1	84.8	66.6	77.8	84.8
GPT-4o	68.1	70.4	83.2	90.2	87.9	83.8	67.9	73.1	83.5
BingoGuard-phi3-3B	72.5	72.8	90.0	90.8	88.9	86.2	69.9	79.7	85.1
BingoGuard-llama3.1-8B	75.7	77.9	90.4	94.9	88.9	86.4	68.7	80.1	86.4
🛡️ RoGuard	75.8	70.5	91.1	90.2	88.7	87.5	69.7	80.0	80.7

🔗 GitHub Repository

You can find the full source code and evaluation framework on GitHub:

👉 Roblox/RoGuard on GitHub