indobert-base-uncased

Maintained By
indolem

IndoBERT Base Uncased

PropertyValue
LicenseMIT
Research PaperarXiv:2011.00677
Training Data Size220M words
Training Steps2.4M (180 epochs)

What is indobert-base-uncased?

IndoBERT is a specialized BERT model trained specifically for the Indonesian language, developed by the IndoLEM team. The model was trained on a diverse corpus of 220M words, including Indonesian Wikipedia (74M words), news articles from various sources (55M words), and an Indonesian Web Corpus (90M words). With a development set perplexity of 3.97, it achieves performance comparable to English BERT-base.

Implementation Details

The model architecture follows the BERT-base configuration and was trained for 2.4M steps across 180 epochs. It demonstrates superior performance across multiple Indonesian language tasks, including POS tagging, NER, dependency parsing, and sentiment analysis.

  • Achieves 96.8% accuracy on POS tagging
  • Reaches 90.1% F1 score on NER UI task
  • Demonstrates 84.13% F1 score on sentiment analysis
  • Shows 93.7% accuracy on next tweet prediction

Core Capabilities

  • Morpho-syntactic analysis
  • Named Entity Recognition
  • Dependency parsing
  • Sentiment analysis
  • Text summarization
  • Social media text analysis

Frequently Asked Questions

Q: What makes this model unique?

IndoBERT is specifically optimized for Indonesian language processing, trained on a comprehensive dataset of Indonesian text. It consistently outperforms multilingual BERT and other baselines across various NLP tasks specific to Indonesian language processing.

Q: What are the recommended use cases?

The model is ideal for Indonesian language processing tasks including POS tagging, named entity recognition, sentiment analysis, and text summarization. It's particularly effective for both formal text (news, wikipedia) and social media content analysis.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.