Sarvam-2b-v0.5

Property	Value
Parameter Count	2.51B
Model Type	Text Generation, Transformers
License	Other
Tensor Type	BF16

What is sarvam-2b-v0.5?

Sarvam-2b-v0.5 is an early checkpoint of a powerful multilingual language model specifically designed for Indian languages. This 2.51B parameter model has been pre-trained from scratch on 2 trillion tokens, with a unique focus on supporting 10 Indic languages alongside English. The model represents a significant advancement in Indian language AI, utilizing the NVIDIA NeMo™ Framework and trained on HGX H100 systems.

Implementation Details

The model employs a custom tokenizer optimized for Indic languages, achieving an impressive average fertility score of ~2, significantly lower than competitors like Llama-3.1 (9.34) and GPT-4 (3.00). This efficiency in tokenization makes it particularly effective for processing Indian languages.

Trained on Yotta Shakti Cloud using HGX H100 systems
Implements transformer architecture with BF16 precision
Supports 10 Indic languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu

Core Capabilities

Efficient multilingual text generation
Superior tokenization for Indic scripts
Balanced performance across 11 languages
Easy integration with Hugging Face transformers library

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its optimized tokenization for Indic languages, achieving significantly better fertility scores compared to other models. It's specifically designed to handle multiple Indian scripts effectively while maintaining strong English language capabilities.

Q: What are the recommended use cases?

The model is well-suited for multilingual text generation tasks, particularly those involving Indian languages. It can be used for content generation, translation assistance, and general language understanding tasks across the supported languages.

sarvam-2b-v0.5