IndoBERT Large P2

Property	Value
Parameter Count	335.2M
Architecture	BERT Large
Training Data	Indo4B (23.43 GB)
Model Type	Transformer-based Language Model
Hugging Face	indobenchmark/indobert-large-p2

What is indobert-large-p2?

IndoBERT Large P2 is a state-of-the-art language model specifically designed for Indonesian language processing. It represents the large-scale variant of the IndoBERT family, trained on an extensive dataset of 23.43 GB of Indonesian text (Indo4B). This phase 2 model builds upon the success of its predecessors, incorporating advanced pre-training objectives including Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).

Implementation Details

The model utilizes the BERT large architecture with 335.2 million parameters, making it one of the most comprehensive Indonesian language models available. It's implemented using the Transformers library and can be easily integrated into existing NLP pipelines.

Pre-trained using MLM and NSP objectives
Uncased tokenization for better generalization
Compatible with HuggingFace's Transformers library
Trained on diverse Indonesian text corpus

Core Capabilities

Contextual word embeddings generation
Text classification and sequence labeling
Natural language understanding tasks
Masked language modeling
Next sentence prediction

Frequently Asked Questions

Q: What makes this model unique?

IndoBERT Large P2 stands out due to its extensive parameter count (335.2M) and specialized training on Indonesian text, making it particularly effective for Indonesian language tasks. The phase 2 training further refines its performance on downstream tasks.

Q: What are the recommended use cases?

The model is ideal for Indonesian language processing tasks including text classification, named entity recognition, sentiment analysis, and other natural language understanding applications. It's particularly suited for tasks requiring deep contextual understanding of Indonesian text.