kcbert-base

kcbert-base

beomi

KcBERT-base is a Korean BERT model trained on news comments data with 110M parameters, optimized for informal text and social media content

PropertyValue
Parameter Count110M
LicenseApache 2.0
PaperResearch Paper
Authorbeomi

What is KcBERT-base?

KcBERT-base is a specialized Korean language model trained on a massive dataset of news comments and replies collected between 2019-2020. Unlike traditional Korean BERT models that focus on formal text, KcBERT is specifically designed to handle informal language, including social media content, colloquialisms, and emoji.

Implementation Details

The model was trained on 12.5GB of text data containing 89 million sentences, using a vocabulary size of 30,000 tokens. It implements a BERT architecture with 12 hidden layers, 768 hidden size, and 12 attention heads.

  • Preprocessed to handle Korean text, English, special characters, and emojis
  • Uses BertWordPieceTokenizer for tokenization
  • Trained on TPU v3-8 hardware
  • Achieves 89.62% accuracy on NSMC (movie review) classification

Core Capabilities

  • Masked Language Modeling for Korean text
  • Handles informal language and social media content
  • Supports both Korean and English characters
  • Emoji-aware processing
  • Maximum sequence length of 300 tokens

Frequently Asked Questions

Q: What makes this model unique?

KcBERT-base is specifically trained on Korean social media content and comments, making it especially effective for informal language processing, unlike traditional models trained on formal text.

Q: What are the recommended use cases?

The model excels in tasks involving social media analysis, sentiment analysis, and informal text processing. It's particularly well-suited for applications involving user-generated content like comments, reviews, and social media posts.

Related Models

Socials
PromptLayer
Company
All services online
Location IconPromptLayer is located in the heart of New York City
PromptLayer © 2026