KcBERT-base

Property	Value
Parameter Count	110M
License	Apache 2.0
Paper	Research Paper
Author	beomi

What is KcBERT-base?

KcBERT-base is a specialized Korean language model trained on a massive dataset of news comments and replies collected between 2019-2020. Unlike traditional Korean BERT models that focus on formal text, KcBERT is specifically designed to handle informal language, including social media content, colloquialisms, and emoji.

Implementation Details

The model was trained on 12.5GB of text data containing 89 million sentences, using a vocabulary size of 30,000 tokens. It implements a BERT architecture with 12 hidden layers, 768 hidden size, and 12 attention heads.

Preprocessed to handle Korean text, English, special characters, and emojis
Uses BertWordPieceTokenizer for tokenization
Trained on TPU v3-8 hardware
Achieves 89.62% accuracy on NSMC (movie review) classification

Core Capabilities

Masked Language Modeling for Korean text
Handles informal language and social media content
Supports both Korean and English characters
Emoji-aware processing
Maximum sequence length of 300 tokens

Frequently Asked Questions

Q: What makes this model unique?

KcBERT-base is specifically trained on Korean social media content and comments, making it especially effective for informal language processing, unlike traditional models trained on formal text.

Q: What are the recommended use cases?

The model excels in tasks involving social media analysis, sentiment analysis, and informal text processing. It's particularly well-suited for applications involving user-generated content like comments, reviews, and social media posts.

kcbert-base