phobert-base-v2

Maintained By
vinai

PhoBERT-base-v2

PropertyValue
Parameter Count135M
ArchitectureRoBERTa Base
Maximum Length256 tokens
LicenseAGPL-3.0
Training Data140GB (Wikipedia, News, OSCAR-2301)

What is PhoBERT-base-v2?

PhoBERT-base-v2 is an advanced Vietnamese language model that builds upon the success of the original PhoBERT architecture. Based on RoBERTa optimization of BERT, this model represents a significant breakthrough in Vietnamese natural language processing. It's trained on an extensive dataset of 140GB, combining 20GB of Wikipedia and news texts with 120GB from OSCAR-2301, making it one of the most comprehensively trained Vietnamese language models available.

Implementation Details

The model implements a RoBERTa-based architecture with 135M parameters, designed specifically for Vietnamese language understanding. It requires word-segmented input and integrates seamlessly with the Hugging Face transformers library. The model supports both PyTorch and TensorFlow 2.0+ implementations.

  • Pre-trained on a massive 140GB Vietnamese text corpus
  • Implements RoBERTa's optimized training approach
  • Supports maximum sequence length of 256 tokens
  • Requires specialized Vietnamese word segmentation preprocessing

Core Capabilities

  • Part-of-speech tagging with state-of-the-art accuracy
  • Dependency parsing for Vietnamese text
  • Named-entity recognition
  • Natural language inference
  • Fill-mask prediction tasks

Frequently Asked Questions

Q: What makes this model unique?

PhoBERT-base-v2 stands out for its extensive training on Vietnamese-specific data and its optimization using RoBERTa's approach. It's specifically designed for Vietnamese language processing and achieves state-of-the-art performance across multiple NLP tasks.

Q: What are the recommended use cases?

The model is ideal for Vietnamese language processing tasks including part-of-speech tagging, dependency parsing, named-entity recognition, and natural language inference. It requires word-segmented input, and it's recommended to use the RDRSegmenter from VnCoreNLP for preprocessing raw text.

The first platform built for prompt engineering