bert-base-thai
Property | Value |
---|---|
Parameter Count | 106M |
Model Type | BERT |
Papers | WordPiece Paper, BPE Paper |
Training Data | Thai Wikipedia Corpus |
What is bert-base-thai?
bert-base-thai is a specialized BERT model adapted specifically for Thai language processing. Built on the BERT-Base architecture, this model was trained on approximately 2 million sentences and 40 million words from Thai Wikipedia. It implements specialized tokenization using SentencePiece and BPE (Byte-Pair Encoding) to handle the unique characteristics of Thai text.
Implementation Details
The model uses a custom tokenization pipeline that combines SentencePiece and BPE with a vocabulary size of 25,000 tokens. Pre-training was conducted for 1 million steps on Tesla K80 GPU, with the best-performing checkpoint achieved at 0.8 million steps.
- Custom Thai tokenization implementation
- Trained on cleaned Thai Wikipedia corpus
- Incorporates sentence segmentation heuristics
- Uses combined BPE and SentencePiece tokenization
Core Capabilities
- Thai text feature extraction and representation
- Superior performance on XNLI task (68.9% accuracy)
- Effective for review classification tasks
- Handles Thai-specific linguistic characteristics
Frequently Asked Questions
Q: What makes this model unique?
This model addresses the specific challenges of Thai language processing, particularly in word segmentation, which was previously a limitation in multilingual BERT models. It achieves this through custom tokenization and Thai-specific training.
Q: What are the recommended use cases?
The model excels in Thai language tasks such as natural language inference (XNLI) and sentiment classification. It's particularly effective for tasks requiring deep understanding of Thai text structure and semantics.