bert-base-arabic

Maintained By
asafaya

bert-base-arabic

PropertyValue
Parameter Count111M parameters
Training Data8.2B words (95GB text)
ArchitectureBERT Base
Authorasafaya
Downloads32,075

What is bert-base-arabic?

bert-base-arabic is a comprehensive Arabic language model pretrained on a massive corpus of 8.2 billion words. Developed by researcher Ali Safaya, this model represents a significant advancement in Arabic natural language processing, supporting both Modern Standard Arabic and dialectical variations.

Implementation Details

The model was trained using Google's BERT architecture on a TPU v3-8, following modified BERT training parameters with 3M training steps and a batch size of 128. The training corpus combines Arabic content from OSCAR (filtered Common Crawl) and Wikipedia, totaling approximately 95GB of text data.

  • Preserves inline non-Arabic words for NER task compatibility
  • No cased/uncased versions due to Arabic script characteristics
  • Supports both Modern Standard and dialectical Arabic

Core Capabilities

  • Masked Language Modeling
  • Text Classification
  • Named Entity Recognition
  • Sentiment Analysis
  • General Arabic text understanding

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its comprehensive coverage of Arabic language variations and its substantial training corpus of 8.2B words, making it particularly robust for Arabic NLP tasks. It maintains non-Arabic inline text, which is crucial for real-world applications.

Q: What are the recommended use cases?

The model is ideal for various Arabic NLP tasks including text classification, named entity recognition, and general language understanding. It's particularly suitable for applications requiring understanding of both Modern Standard Arabic and dialectical variations.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.