HarmBench Classifiers Classifiers for red teaming evaluation in HarmBench HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6 cais/HarmBench-Llama-2-13b-cls Text Generation • 13B • Updated Mar 17, 2024 • 15.2k • • 21 cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 33 • cais/HarmBench-Mistral-7b-val-cls Text Generation • 7B • Updated Mar 17, 2024 • 2.09k • 6
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6
cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 33 •
WMDP Benchmark The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1 cais/wmdp Viewer • Updated Apr 27, 2024 • 3.67k • 7.82k • 20 cais/wmdp-bio-forget-corpus Viewer • Updated May 29 • 24.5k • 429 cais/wmdp-cyber-forget-corpus Viewer • Updated May 29 • 1k • 183 • 1
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1
HarmBench Classifiers Classifiers for red teaming evaluation in HarmBench HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6 cais/HarmBench-Llama-2-13b-cls Text Generation • 13B • Updated Mar 17, 2024 • 15.2k • • 21 cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 33 • cais/HarmBench-Mistral-7b-val-cls Text Generation • 7B • Updated Mar 17, 2024 • 2.09k • 6
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6
cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 33 •
WMDP Benchmark The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1 cais/wmdp Viewer • Updated Apr 27, 2024 • 3.67k • 7.82k • 20 cais/wmdp-bio-forget-corpus Viewer • Updated May 29 • 24.5k • 429 cais/wmdp-cyber-forget-corpus Viewer • Updated May 29 • 1k • 183 • 1
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1