Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0

Property	Value
Base Model	Llama2-7B
License	Llama 2 Community License Agreement
Paper	Aegis Content Moderation (arXiv:2404.05993)
Author	NVIDIA (Shaona Ghosh)

What is Aegis-AI-Content-Safety-LlamaGuard-Defensive-1.0?

This is NVIDIA's advanced content safety model built on Llama Guard, specifically designed to detect and classify potentially harmful content across 13 critical safety categories. Trained on the proprietary Aegis Content Safety Dataset containing 11,000 annotated prompts, it serves as a robust defensive layer for AI systems.

Implementation Details

The model utilizes parameter-efficient instruction tuning on Llama2-7B, implementing PEFT (Parameter-Efficient Fine-Tuning) techniques. It's trained using FSDP (Fully Sharded Data Parallel) with fp16 precision, achieving state-of-the-art performance metrics including 0.941 AUPRC on the Aegis test set.

Trained on 8 GPUs per node with rank 16 and alpha 32
Uses advanced prompt classification architecture
Implements comprehensive safety taxonomy covering violence, harassment, hate speech, and more

Core Capabilities

Real-time content safety assessment
Classification across 13 distinct safety categories
Support for custom safety policies and taxonomies
High accuracy in detecting harmful content (100% accuracy on Simple Safety Tests)

Frequently Asked Questions

Q: What makes this model unique?

The model combines NVIDIA's extensive safety taxonomy with parameter-efficient tuning of Llama Guard, achieving superior performance across multiple benchmarks including Toxic Chat Dataset and OpenAI Moderation Dataset.

Q: What are the recommended use cases?

Primary applications include safeguarding general-purpose LLM content, evaluating pre-training data toxicity, and implementing custom content moderation policies. It's particularly effective for organizations requiring robust content safety measures.