bert-base-arabic-camelbert-mix

Maintained By
CAMeL-Lab

bert-base-arabic-camelbert-mix

PropertyValue
Training Data Size167GB (17.3B words)
Variants CoveredMSA, Dialectal Arabic, Classical Arabic
PaperThe Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models
DeveloperCAMeL-Lab

What is bert-base-arabic-camelbert-mix?

CAMeLBERT-Mix is a state-of-the-art BERT model specifically designed for Arabic natural language processing. It's unique in its comprehensive coverage of Arabic language variants, being pre-trained on a massive 167GB dataset that combines Modern Standard Arabic (MSA), dialectal Arabic (DA), and classical Arabic (CA). This makes it particularly versatile for various Arabic NLP tasks.

Implementation Details

The model was trained on a single TPU v3-8 for one million steps, using a WordPiece tokenizer with a 30,000-token vocabulary. The training process involved whole word masking, with initial training using a batch size of 1,024 for 90,000 steps, followed by a batch size of 256 for the remaining steps.

  • Sequence length: 128 tokens (90% of steps) and 512 tokens (10% of steps)
  • Learning rate: 1e-4 with Adam optimizer
  • Pre-training approach: Masked Language Modeling and Next Sentence Prediction
  • Vocabulary size: 30,000 tokens

Core Capabilities

  • Named Entity Recognition (80.8% F1 score on ANERcorp)
  • POS Tagging (98.1% accuracy on PATB)
  • Sentiment Analysis (92.7% on ArSAS)
  • Dialect Identification (92.5% on MADAR-6)
  • Poetry Classification (79.8% on APCD)

Frequently Asked Questions

Q: What makes this model unique?

This model's uniqueness lies in its comprehensive coverage of Arabic language variants and its large-scale training data (167GB), making it particularly effective for tasks across different Arabic dialects and styles.

Q: What are the recommended use cases?

The model excels in various NLP tasks including named entity recognition, POS tagging, sentiment analysis, dialect identification, and poetry classification. It's particularly suitable for applications requiring understanding of multiple Arabic variants.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.