bert-base-thai

Maintained By
monsoon-nlp

bert-base-thai

PropertyValue
Parameter Count106M
Model TypeBERT
PapersWordPiece Paper, BPE Paper
Training DataThai Wikipedia Corpus

What is bert-base-thai?

bert-base-thai is a specialized BERT model adapted specifically for Thai language processing. Built on the BERT-Base architecture, this model was trained on approximately 2 million sentences and 40 million words from Thai Wikipedia. It implements specialized tokenization using SentencePiece and BPE (Byte-Pair Encoding) to handle the unique characteristics of Thai text.

Implementation Details

The model uses a custom tokenization pipeline that combines SentencePiece and BPE with a vocabulary size of 25,000 tokens. Pre-training was conducted for 1 million steps on Tesla K80 GPU, with the best-performing checkpoint achieved at 0.8 million steps.

  • Custom Thai tokenization implementation
  • Trained on cleaned Thai Wikipedia corpus
  • Incorporates sentence segmentation heuristics
  • Uses combined BPE and SentencePiece tokenization

Core Capabilities

  • Thai text feature extraction and representation
  • Superior performance on XNLI task (68.9% accuracy)
  • Effective for review classification tasks
  • Handles Thai-specific linguistic characteristics

Frequently Asked Questions

Q: What makes this model unique?

This model addresses the specific challenges of Thai language processing, particularly in word segmentation, which was previously a limitation in multilingual BERT models. It achieves this through custom tokenization and Thai-specific training.

Q: What are the recommended use cases?

The model excels in Thai language tasks such as natural language inference (XNLI) and sentiment classification. It's particularly effective for tasks requiring deep understanding of Thai text structure and semantics.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.