HarmBench-Llama-2-13b-cls

Property	Value
Parameter Count	13B
Tensor Type	BF16
License	MIT
Paper	HarmBench Paper

What is HarmBench-Llama-2-13b-cls?

HarmBench-Llama-2-13b-cls is a specialized classifier model designed for evaluating and detecting harmful behaviors in AI language models. Built on the Llama-2 13B architecture, it serves as the official classifier for the HarmBench framework, achieving remarkable 93.19% average agreement with human judgments on behavior classification tasks.

Implementation Details

The model implements a sophisticated classification system that can evaluate both standard text behaviors and contextual behaviors. It utilizes a specific prompt template structure and provides binary (yes/no) classifications for potentially harmful AI outputs.

Supports both standard and contextual behavior classification
Uses BF16 tensor type for efficient computation
Implements specific prompt templates for different classification scenarios
Achieves superior performance compared to other benchmarks like AdvBench and GPTFuzz

Core Capabilities

94.53% accuracy on standard behavior classification
90.5% accuracy on contextual behavior classification
Comprehensive evaluation of harmful or undesirable AI outputs
Support for code-related behavior analysis
Multi-lingual output assessment capability

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its exceptional classification accuracy, surpassing even GPT-4's performance in identifying harmful AI behaviors. It's specifically designed for red teaming scenarios and provides robust classification across both standard and contextual behaviors.

Q: What are the recommended use cases?

The model is ideal for AI safety researchers, developers working on content moderation systems, and teams conducting red teaming exercises on language models. It's particularly useful for identifying and filtering potentially harmful or undesirable AI outputs in production environments.