IndoBERT Base Uncased

Property	Value
License	MIT
Research Paper	arXiv:2011.00677
Training Data Size	220M words
Training Steps	2.4M (180 epochs)

What is indobert-base-uncased?

IndoBERT is a specialized BERT model trained specifically for the Indonesian language, developed by the IndoLEM team. The model was trained on a diverse corpus of 220M words, including Indonesian Wikipedia (74M words), news articles from various sources (55M words), and an Indonesian Web Corpus (90M words). With a development set perplexity of 3.97, it achieves performance comparable to English BERT-base.

Implementation Details

The model architecture follows the BERT-base configuration and was trained for 2.4M steps across 180 epochs. It demonstrates superior performance across multiple Indonesian language tasks, including POS tagging, NER, dependency parsing, and sentiment analysis.

Achieves 96.8% accuracy on POS tagging
Reaches 90.1% F1 score on NER UI task
Demonstrates 84.13% F1 score on sentiment analysis
Shows 93.7% accuracy on next tweet prediction

Core Capabilities

Morpho-syntactic analysis
Named Entity Recognition
Dependency parsing
Sentiment analysis
Text summarization
Social media text analysis

Frequently Asked Questions

Q: What makes this model unique?

IndoBERT is specifically optimized for Indonesian language processing, trained on a comprehensive dataset of Indonesian text. It consistently outperforms multilingual BERT and other baselines across various NLP tasks specific to Indonesian language processing.

Q: What are the recommended use cases?

The model is ideal for Indonesian language processing tasks including POS tagging, named entity recognition, sentiment analysis, and text summarization. It's particularly effective for both formal text (news, wikipedia) and social media content analysis.