bert-base-arabic

Property	Value
Parameter Count	111M parameters
Training Data	8.2B words (95GB text)
Architecture	BERT Base
Author	asafaya
Downloads	32,075

What is bert-base-arabic?

bert-base-arabic is a comprehensive Arabic language model pretrained on a massive corpus of 8.2 billion words. Developed by researcher Ali Safaya, this model represents a significant advancement in Arabic natural language processing, supporting both Modern Standard Arabic and dialectical variations.

Implementation Details

The model was trained using Google's BERT architecture on a TPU v3-8, following modified BERT training parameters with 3M training steps and a batch size of 128. The training corpus combines Arabic content from OSCAR (filtered Common Crawl) and Wikipedia, totaling approximately 95GB of text data.

Preserves inline non-Arabic words for NER task compatibility
No cased/uncased versions due to Arabic script characteristics
Supports both Modern Standard and dialectical Arabic

Core Capabilities

Masked Language Modeling
Text Classification
Named Entity Recognition
Sentiment Analysis
General Arabic text understanding

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its comprehensive coverage of Arabic language variations and its substantial training corpus of 8.2B words, making it particularly robust for Arabic NLP tasks. It maintains non-Arabic inline text, which is crucial for real-world applications.

Q: What are the recommended use cases?

The model is ideal for various Arabic NLP tasks including text classification, named entity recognition, and general language understanding. It's particularly suitable for applications requiring understanding of both Modern Standard Arabic and dialectical variations.

bert-base-arabic

bert-base-arabic

What is bert-base-arabic?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models