bert-base-arabic
Property | Value |
---|---|
Parameter Count | 111M parameters |
Training Data | 8.2B words (95GB text) |
Architecture | BERT Base |
Author | asafaya |
Downloads | 32,075 |
What is bert-base-arabic?
bert-base-arabic is a comprehensive Arabic language model pretrained on a massive corpus of 8.2 billion words. Developed by researcher Ali Safaya, this model represents a significant advancement in Arabic natural language processing, supporting both Modern Standard Arabic and dialectical variations.
Implementation Details
The model was trained using Google's BERT architecture on a TPU v3-8, following modified BERT training parameters with 3M training steps and a batch size of 128. The training corpus combines Arabic content from OSCAR (filtered Common Crawl) and Wikipedia, totaling approximately 95GB of text data.
- Preserves inline non-Arabic words for NER task compatibility
- No cased/uncased versions due to Arabic script characteristics
- Supports both Modern Standard and dialectical Arabic
Core Capabilities
- Masked Language Modeling
- Text Classification
- Named Entity Recognition
- Sentiment Analysis
- General Arabic text understanding
Frequently Asked Questions
Q: What makes this model unique?
This model stands out for its comprehensive coverage of Arabic language variations and its substantial training corpus of 8.2B words, making it particularly robust for Arabic NLP tasks. It maintains non-Arabic inline text, which is crucial for real-world applications.
Q: What are the recommended use cases?
The model is ideal for various Arabic NLP tasks including text classification, named entity recognition, and general language understanding. It's particularly suitable for applications requiring understanding of both Modern Standard Arabic and dialectical variations.