ParsBERT - Persian Language Understanding Model
Property | Value |
---|---|
Research Paper | arXiv:2005.12515 |
Developer | HooshvareLab |
Framework Support | PyTorch, TensorFlow |
Community Stats | 24,433 downloads, 31 likes |
What is bert-base-parsbert-uncased?
ParsBERT is a monolingual language model specifically designed for Persian language understanding, based on Google's BERT architecture. Trained on a diverse corpus of over 2M documents spanning scientific texts, novels, and news articles, it represents a significant advancement in Persian NLP capabilities.
Implementation Details
The model implements extensive pre-processing combining POS tagging and WordPiece segmentation, processing over 40M true sentences. It follows BERT-Base configurations and is trained with whole word masking.
- Comprehensive pre-training on varied Persian texts
- State-of-the-art performance across multiple NLP tasks
- Uncased tokenization with whole word masking
- Compatible with both PyTorch and TensorFlow frameworks
Core Capabilities
- Sentiment Analysis: Achieves 81.74% F1 on Digikala User Comments
- Text Classification: 93.59% accuracy on Digikala Magazine
- Named Entity Recognition: 98.79% F1 score on ARMAN dataset
- Outperforms multilingual BERT in all Persian language tasks
Frequently Asked Questions
Q: What makes this model unique?
ParsBERT is the first comprehensive BERT model specifically trained for Persian language understanding, combining extensive pre-processing with a large-scale Persian corpus, resulting in superior performance compared to multilingual alternatives.
Q: What are the recommended use cases?
The model excels in sentiment analysis, text classification, and named entity recognition tasks for Persian text. It's particularly suitable for applications involving user comments analysis, news classification, and automated text understanding in Persian.