bert-base

Maintained By
klue

KLUE BERT Base

PropertyValue
Parameter Count111M parameters
Licensecc-by-sa-4.0
Research PaperView Paper
Training Data Size62GB Korean text
Primary LanguageKorean

What is KLUE BERT base?

KLUE BERT base is a specialized Korean language model developed as part of the Korean Language Understanding Evaluation (KLUE) Benchmark initiative. Built on the BERT architecture, this model has been pre-trained on a diverse 62GB Korean text corpus, including formal articles, web content, and colloquial text from various sources like MODU Corpus, CC-100-Kor, NAMUWIKI, and Korean news articles.

Implementation Details

The model employs a unique morpheme-based subword tokenization method, combining morphological analysis with byte pair encoding (BPE). It uses the Mecab-ko analyzer for morpheme segmentation and Huggingface Tokenizers library for BPE segmentation, with a vocabulary size of 32k tokens.

  • Custom tokenization pipeline combining morphological analysis and BPE
  • Trained on TPU v3-8 infrastructure
  • Implements advanced preprocessing techniques including noise filtering and sentence splitting

Core Capabilities

  • Topic Classification (TC) with 85.73% F1 score
  • Named Entity Recognition (NER) with 83.97% Entity F1 score
  • Natural Language Inference (NLI) with 81.63% accuracy
  • Semantic Textual Similarity (STS) with 90.85% Pearson correlation
  • Machine Reading Comprehension (MRC)

Frequently Asked Questions

Q: What makes this model unique?

The model's distinctive feature is its specialized Korean language understanding capabilities, achieved through a novel morpheme-based tokenization approach and comprehensive training on diverse Korean text sources.

Q: What are the recommended use cases?

The model excels in Korean language tasks including topic classification, semantic analysis, named entity recognition, and natural language inference. It's particularly suitable for academic research and production applications requiring deep Korean language understanding.

The first platform built for prompt engineering