KLUE BERT Base

Property	Value
Parameter Count	111M parameters
License	cc-by-sa-4.0
Research Paper	View Paper
Training Data Size	62GB Korean text
Primary Language	Korean

What is KLUE BERT base?

KLUE BERT base is a specialized Korean language model developed as part of the Korean Language Understanding Evaluation (KLUE) Benchmark initiative. Built on the BERT architecture, this model has been pre-trained on a diverse 62GB Korean text corpus, including formal articles, web content, and colloquial text from various sources like MODU Corpus, CC-100-Kor, NAMUWIKI, and Korean news articles.

Implementation Details

The model employs a unique morpheme-based subword tokenization method, combining morphological analysis with byte pair encoding (BPE). It uses the Mecab-ko analyzer for morpheme segmentation and Huggingface Tokenizers library for BPE segmentation, with a vocabulary size of 32k tokens.

Custom tokenization pipeline combining morphological analysis and BPE
Trained on TPU v3-8 infrastructure
Implements advanced preprocessing techniques including noise filtering and sentence splitting

Core Capabilities

Topic Classification (TC) with 85.73% F1 score
Named Entity Recognition (NER) with 83.97% Entity F1 score
Natural Language Inference (NLI) with 81.63% accuracy
Semantic Textual Similarity (STS) with 90.85% Pearson correlation
Machine Reading Comprehension (MRC)

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its specialized Korean language understanding capabilities, achieved through a novel morpheme-based tokenization approach and comprehensive training on diverse Korean text sources.

Q: What are the recommended use cases?

The model excels in Korean language tasks including topic classification, semantic analysis, named entity recognition, and natural language inference. It's particularly suitable for academic research and production applications requiring deep Korean language understanding.

bert-base