KLUE BERT Base
Property | Value |
---|---|
Parameter Count | 111M parameters |
License | cc-by-sa-4.0 |
Research Paper | View Paper |
Training Data Size | 62GB Korean text |
Primary Language | Korean |
What is KLUE BERT base?
KLUE BERT base is a specialized Korean language model developed as part of the Korean Language Understanding Evaluation (KLUE) Benchmark initiative. Built on the BERT architecture, this model has been pre-trained on a diverse 62GB Korean text corpus, including formal articles, web content, and colloquial text from various sources like MODU Corpus, CC-100-Kor, NAMUWIKI, and Korean news articles.
Implementation Details
The model employs a unique morpheme-based subword tokenization method, combining morphological analysis with byte pair encoding (BPE). It uses the Mecab-ko analyzer for morpheme segmentation and Huggingface Tokenizers library for BPE segmentation, with a vocabulary size of 32k tokens.
- Custom tokenization pipeline combining morphological analysis and BPE
- Trained on TPU v3-8 infrastructure
- Implements advanced preprocessing techniques including noise filtering and sentence splitting
Core Capabilities
- Topic Classification (TC) with 85.73% F1 score
- Named Entity Recognition (NER) with 83.97% Entity F1 score
- Natural Language Inference (NLI) with 81.63% accuracy
- Semantic Textual Similarity (STS) with 90.85% Pearson correlation
- Machine Reading Comprehension (MRC)
Frequently Asked Questions
Q: What makes this model unique?
The model's distinctive feature is its specialized Korean language understanding capabilities, achieved through a novel morpheme-based tokenization approach and comprehensive training on diverse Korean text sources.
Q: What are the recommended use cases?
The model excels in Korean language tasks including topic classification, semantic analysis, named entity recognition, and natural language inference. It's particularly suitable for academic research and production applications requiring deep Korean language understanding.