HarmBench-Llama-2-13b-cls
Property | Value |
---|---|
Parameter Count | 13B |
Tensor Type | BF16 |
License | MIT |
Paper | HarmBench Paper |
What is HarmBench-Llama-2-13b-cls?
HarmBench-Llama-2-13b-cls is a specialized classifier model designed for evaluating and detecting harmful behaviors in AI language models. Built on the Llama-2 13B architecture, it serves as the official classifier for the HarmBench framework, achieving remarkable 93.19% average agreement with human judgments on behavior classification tasks.
Implementation Details
The model implements a sophisticated classification system that can evaluate both standard text behaviors and contextual behaviors. It utilizes a specific prompt template structure and provides binary (yes/no) classifications for potentially harmful AI outputs.
- Supports both standard and contextual behavior classification
- Uses BF16 tensor type for efficient computation
- Implements specific prompt templates for different classification scenarios
- Achieves superior performance compared to other benchmarks like AdvBench and GPTFuzz
Core Capabilities
- 94.53% accuracy on standard behavior classification
- 90.5% accuracy on contextual behavior classification
- Comprehensive evaluation of harmful or undesirable AI outputs
- Support for code-related behavior analysis
- Multi-lingual output assessment capability
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its exceptional classification accuracy, surpassing even GPT-4's performance in identifying harmful AI behaviors. It's specifically designed for red teaming scenarios and provides robust classification across both standard and contextual behaviors.
Q: What are the recommended use cases?
The model is ideal for AI safety researchers, developers working on content moderation systems, and teams conducting red teaming exercises on language models. It's particularly useful for identifying and filtering potentially harmful or undesirable AI outputs in production environments.