kcbert-base

Maintained By
beomi

KcBERT-base

PropertyValue
Parameter Count110M
LicenseApache 2.0
PaperResearch Paper
Authorbeomi

What is KcBERT-base?

KcBERT-base is a specialized Korean language model trained on a massive dataset of news comments and replies collected between 2019-2020. Unlike traditional Korean BERT models that focus on formal text, KcBERT is specifically designed to handle informal language, including social media content, colloquialisms, and emoji.

Implementation Details

The model was trained on 12.5GB of text data containing 89 million sentences, using a vocabulary size of 30,000 tokens. It implements a BERT architecture with 12 hidden layers, 768 hidden size, and 12 attention heads.

  • Preprocessed to handle Korean text, English, special characters, and emojis
  • Uses BertWordPieceTokenizer for tokenization
  • Trained on TPU v3-8 hardware
  • Achieves 89.62% accuracy on NSMC (movie review) classification

Core Capabilities

  • Masked Language Modeling for Korean text
  • Handles informal language and social media content
  • Supports both Korean and English characters
  • Emoji-aware processing
  • Maximum sequence length of 300 tokens

Frequently Asked Questions

Q: What makes this model unique?

KcBERT-base is specifically trained on Korean social media content and comments, making it especially effective for informal language processing, unlike traditional models trained on formal text.

Q: What are the recommended use cases?

The model excels in tasks involving social media analysis, sentiment analysis, and informal text processing. It's particularly well-suited for applications involving user-generated content like comments, reviews, and social media posts.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.