IndoBERT Large P2
Property | Value |
---|---|
Parameter Count | 335.2M |
Architecture | BERT Large |
Training Data | Indo4B (23.43 GB) |
Model Type | Transformer-based Language Model |
Hugging Face | indobenchmark/indobert-large-p2 |
What is indobert-large-p2?
IndoBERT Large P2 is a state-of-the-art language model specifically designed for Indonesian language processing. It represents the large-scale variant of the IndoBERT family, trained on an extensive dataset of 23.43 GB of Indonesian text (Indo4B). This phase 2 model builds upon the success of its predecessors, incorporating advanced pre-training objectives including Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
Implementation Details
The model utilizes the BERT large architecture with 335.2 million parameters, making it one of the most comprehensive Indonesian language models available. It's implemented using the Transformers library and can be easily integrated into existing NLP pipelines.
- Pre-trained using MLM and NSP objectives
- Uncased tokenization for better generalization
- Compatible with HuggingFace's Transformers library
- Trained on diverse Indonesian text corpus
Core Capabilities
- Contextual word embeddings generation
- Text classification and sequence labeling
- Natural language understanding tasks
- Masked language modeling
- Next sentence prediction
Frequently Asked Questions
Q: What makes this model unique?
IndoBERT Large P2 stands out due to its extensive parameter count (335.2M) and specialized training on Indonesian text, making it particularly effective for Indonesian language tasks. The phase 2 training further refines its performance on downstream tasks.
Q: What are the recommended use cases?
The model is ideal for Indonesian language processing tasks including text classification, named entity recognition, sentiment analysis, and other natural language understanding applications. It's particularly suited for tasks requiring deep contextual understanding of Indonesian text.