Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0
Property | Value |
---|---|
Base Model | Llama2-7B |
License | Llama 2 Community License Agreement |
Paper | Aegis Content Moderation (arXiv:2404.05993) |
Author | NVIDIA (Shaona Ghosh) |
What is Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0?
This is NVIDIA's advanced content safety model built on Llama Guard, specifically designed to detect and classify potentially harmful content across 13 critical safety categories. Trained on the proprietary Aegis Content Safety Dataset containing 11,000 annotated prompts, it serves as a robust defensive layer for AI systems.
Implementation Details
The model utilizes parameter-efficient instruction tuning on Llama2-7B, implementing PEFT (Parameter-Efficient Fine-Tuning) techniques. It's trained using FSDP (Fully Sharded Data Parallel) with fp16 precision, achieving state-of-the-art performance metrics including 0.941 AUPRC on the Aegis test set.
- Trained on 8 GPUs per node with rank 16 and alpha 32
- Uses advanced prompt classification architecture
- Implements comprehensive safety taxonomy covering violence, harassment, hate speech, and more
Core Capabilities
- Real-time content safety assessment
- Classification across 13 distinct safety categories
- Support for custom safety policies and taxonomies
- High accuracy in detecting harmful content (100% accuracy on Simple Safety Tests)
Frequently Asked Questions
Q: What makes this model unique?
The model combines NVIDIA's extensive safety taxonomy with parameter-efficient tuning of Llama Guard, achieving superior performance across multiple benchmarks including Toxic Chat Dataset and OpenAI Moderation Dataset.
Q: What are the recommended use cases?
Primary applications include safeguarding general-purpose LLM content, evaluating pre-training data toxicity, and implementing custom content moderation policies. It's particularly effective for organizations requiring robust content safety measures.