Skywork-Reward-Gemma-2-27B-v0.2
Property | Value |
---|---|
Parameter Count | 27.2B |
Model Type | Sequence Classification |
Base Architecture | google/gemma-2-27b-it |
Paper | Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs |
License | Skywork Community License |
What is Skywork-Reward-Gemma-2-27B-v0.2?
Skywork-Reward-Gemma-2-27B-v0.2 is an advanced reward model that currently ranks first on the RewardBench leaderboard. Built on the Gemma-2-27b-it architecture, it was trained using a carefully curated dataset of 80K high-quality preference pairs from public sources. The model demonstrates exceptional capability in handling complex preference scenarios across various domains including mathematics, coding, and safety.
Implementation Details
The model utilizes BF16 precision and requires either flash_attention_2 or eager implementation for optimal performance. It was trained on the Skywork Reward Data Collection, which combines data from multiple sources including HelpSteer2, OffsetBias, WildGuard, and the Magpie DPO series.
- Achieves 94.3 overall score on RewardBench
- Implements sophisticated data curation techniques
- Leverages flash attention for enhanced performance
- Supports both commercial and community use under specific licensing
Core Capabilities
- Superior performance in chat evaluation (96.1 score)
- Excellent reasoning capabilities (98.1 score)
- Strong safety alignment (93.0 score)
- Robust handling of complex preference pairs
- Efficient operation with proper attention implementation
Frequently Asked Questions
Q: What makes this model unique?
The model achieves state-of-the-art performance on RewardBench using only 80K carefully curated training pairs, demonstrating that high-quality data curation can be more important than dataset size. It's particularly notable for maintaining balanced performance across different domains.
Q: What are the recommended use cases?
The model is ideal for evaluating and ranking AI-generated responses, particularly in scenarios involving mathematical reasoning, coding tasks, and safety-critical applications. It's specifically designed to help determine which of two possible responses better addresses a given prompt.