KcBERT-base
Property | Value |
---|---|
Parameter Count | 110M |
License | Apache 2.0 |
Paper | Research Paper |
Author | beomi |
What is KcBERT-base?
KcBERT-base is a specialized Korean language model trained on a massive dataset of news comments and replies collected between 2019-2020. Unlike traditional Korean BERT models that focus on formal text, KcBERT is specifically designed to handle informal language, including social media content, colloquialisms, and emoji.
Implementation Details
The model was trained on 12.5GB of text data containing 89 million sentences, using a vocabulary size of 30,000 tokens. It implements a BERT architecture with 12 hidden layers, 768 hidden size, and 12 attention heads.
- Preprocessed to handle Korean text, English, special characters, and emojis
- Uses BertWordPieceTokenizer for tokenization
- Trained on TPU v3-8 hardware
- Achieves 89.62% accuracy on NSMC (movie review) classification
Core Capabilities
- Masked Language Modeling for Korean text
- Handles informal language and social media content
- Supports both Korean and English characters
- Emoji-aware processing
- Maximum sequence length of 300 tokens
Frequently Asked Questions
Q: What makes this model unique?
KcBERT-base is specifically trained on Korean social media content and comments, making it especially effective for informal language processing, unlike traditional models trained on formal text.
Q: What are the recommended use cases?
The model excels in tasks involving social media analysis, sentiment analysis, and informal text processing. It's particularly well-suited for applications involving user-generated content like comments, reviews, and social media posts.