ParsBERT - Persian Language Understanding Model

Property	Value
Research Paper	arXiv:2005.12515
Developer	HooshvareLab
Framework Support	PyTorch, TensorFlow
Community Stats	24,433 downloads, 31 likes

What is bert-base-parsbert-uncased?

ParsBERT is a monolingual language model specifically designed for Persian language understanding, based on Google's BERT architecture. Trained on a diverse corpus of over 2M documents spanning scientific texts, novels, and news articles, it represents a significant advancement in Persian NLP capabilities.

Implementation Details

The model implements extensive pre-processing combining POS tagging and WordPiece segmentation, processing over 40M true sentences. It follows BERT-Base configurations and is trained with whole word masking.

Comprehensive pre-training on varied Persian texts
State-of-the-art performance across multiple NLP tasks
Uncased tokenization with whole word masking
Compatible with both PyTorch and TensorFlow frameworks

Core Capabilities

Sentiment Analysis: Achieves 81.74% F1 on Digikala User Comments
Text Classification: 93.59% accuracy on Digikala Magazine
Named Entity Recognition: 98.79% F1 score on ARMAN dataset
Outperforms multilingual BERT in all Persian language tasks

Frequently Asked Questions

Q: What makes this model unique?

ParsBERT is the first comprehensive BERT model specifically trained for Persian language understanding, combining extensive pre-processing with a large-scale Persian corpus, resulting in superior performance compared to multilingual alternatives.

Q: What are the recommended use cases?

The model excels in sentiment analysis, text classification, and named entity recognition tasks for Persian text. It's particularly suitable for applications involving user comments analysis, news classification, and automated text understanding in Persian.

bert-base-parsbert-uncased