heBERT
Property | Value |
---|---|
Research Paper | View Paper |
Architecture | BERT-Base |
Training Data | ~10.6GB (OSCAR, Wikipedia, UGC) |
Primary Tasks | Sentiment Analysis, NER, Masked-LM |
What is heBERT?
heBERT is a sophisticated Hebrew language model based on Google's BERT architecture, specifically designed for Hebrew text analysis. The model was trained on an extensive dataset comprising 1 billion words across 20.8 million sentences from various sources, making it a robust tool for Hebrew natural language processing tasks.
Implementation Details
The model leverages three primary datasets: a 9.8GB Hebrew OSCAR corpus, a 650MB Wikipedia dump, and 150MB of user-generated content from news sites. It implements the BERT-Base configuration and offers multiple specialized versions for different tasks, including sentiment analysis and named entity recognition.
- Pre-trained on combined datasets totaling over 1 billion words
- Supports masked language modeling for transfer learning
- Includes fine-tuned versions for sentiment analysis and NER
- Available through the Transformers library and AWS
Core Capabilities
- Masked Language Modeling for general language understanding
- Sentiment Analysis with three-way classification (positive, negative, neutral)
- Named Entity Recognition for Hebrew text
- Emotion recognition across 8 distinct emotions (upcoming feature)
Frequently Asked Questions
Q: What makes this model unique?
heBERT is specifically optimized for Hebrew language processing, trained on a diverse and extensive Hebrew corpus. Its ability to handle both formal and user-generated content makes it particularly valuable for real-world applications.
Q: What are the recommended use cases?
The model excels in sentiment analysis of Hebrew text, named entity recognition, and can be fine-tuned for various downstream tasks. It's particularly suitable for applications requiring Hebrew text understanding, such as social media analysis, content moderation, and automated text classification.