Gemma2 9B CPT Sahabat-AI v1 Instruct

Property	Value
Parameter Count	9.24B
Model Type	Decoder
Languages	English, Indonesian, Javanese, Sundanese
License	Gemma Community License
Context Length	8192 tokens

What is gemma2-9b-cpt-sahabatai-v1-instruct?

Sahabat-AI v1 Instruct is an advanced multilingual language model specifically designed for Indonesian languages and dialects. Co-initiated by GoTo Group and Indosat Ooredoo Hutchison, this model has been fine-tuned on an extensive dataset of 448,000 Indonesian instruction pairs, along with 96,000 Javanese and 98,000 Sundanese instruction pairs, plus 129,000 English instruction pairs.

Implementation Details

Built on the Gemma2 architecture, this model leverages advanced decoder technology and has been fine-tuned through a combination of full parameter tuning and on-policy alignment. The training process involved 4 hours of fine-tuning and 2 hours of alignment on 8x H100-80GB GPUs.

Context length of 8192 tokens
BF16 tensor type optimization
Comprehensive multilingual capabilities
State-of-the-art performance on regional language benchmarks

Core Capabilities

Achieves 61.169% overall score on SEA HELM benchmark
62.6% performance on IndoMMLU evaluation
Superior performance in Indonesian (64.154%), Javanese (64.439%), and Sundanese (54.913%) tasks
Maintains strong English language capabilities with 33.67% average score on standard benchmarks

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its specialized optimization for Indonesian languages and dialects, while maintaining strong performance across multiple languages. It's particularly notable for achieving state-of-the-art results on regional language benchmarks like SEA HELM and IndoMMLU.

Q: What are the recommended use cases?

The model is ideal for applications requiring multilingual capabilities in Southeast Asian languages, particularly Indonesian, Javanese, and Sundanese. It's suitable for tasks like question answering, sentiment analysis, translation, and abstractive summarization in these languages.