PhoBERT-base-v2
Property | Value |
---|---|
Parameter Count | 135M |
Architecture | RoBERTa Base |
Maximum Length | 256 tokens |
License | AGPL-3.0 |
Training Data | 140GB (Wikipedia, News, OSCAR-2301) |
What is PhoBERT-base-v2?
PhoBERT-base-v2 is an advanced Vietnamese language model that builds upon the success of the original PhoBERT architecture. Based on RoBERTa optimization of BERT, this model represents a significant breakthrough in Vietnamese natural language processing. It's trained on an extensive dataset of 140GB, combining 20GB of Wikipedia and news texts with 120GB from OSCAR-2301, making it one of the most comprehensively trained Vietnamese language models available.
Implementation Details
The model implements a RoBERTa-based architecture with 135M parameters, designed specifically for Vietnamese language understanding. It requires word-segmented input and integrates seamlessly with the Hugging Face transformers library. The model supports both PyTorch and TensorFlow 2.0+ implementations.
- Pre-trained on a massive 140GB Vietnamese text corpus
- Implements RoBERTa's optimized training approach
- Supports maximum sequence length of 256 tokens
- Requires specialized Vietnamese word segmentation preprocessing
Core Capabilities
- Part-of-speech tagging with state-of-the-art accuracy
- Dependency parsing for Vietnamese text
- Named-entity recognition
- Natural language inference
- Fill-mask prediction tasks
Frequently Asked Questions
Q: What makes this model unique?
PhoBERT-base-v2 stands out for its extensive training on Vietnamese-specific data and its optimization using RoBERTa's approach. It's specifically designed for Vietnamese language processing and achieves state-of-the-art performance across multiple NLP tasks.
Q: What are the recommended use cases?
The model is ideal for Vietnamese language processing tasks including part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference. It requires word-segmented input, and it's recommended to use the RDRSegmenter from VnCoreNLP for preprocessing raw text.