KoBART-base-v2

Property	Value
Parameter Count	124M
License	MIT
Paper	BART Paper
Language	Korean
Training Data	40GB+ Korean text

What is kobart-base-v2?

KoBART-base-v2 is a Korean language model based on the BART architecture, specifically designed for Korean text processing tasks. It's an encoder-decoder model trained on over 40GB of Korean text data, including Wikipedia, news, books, and public datasets. The model employs text infilling as its primary noise function during training, making it particularly effective for various Korean language tasks.

Implementation Details

The model features a sophisticated architecture with both encoder and decoder components, each containing 6 layers with 16 attention heads. The model uses a Character BPE tokenizer with a vocabulary size of 30,000, including special tokens for emoticons and emojis to enhance conversational capabilities.

Hidden dimensions: 768
FFN dimensions: 3072
Total parameters: 124M
Training data includes Korean Wiki (5M sentences) and other corpus (0.27B sentences)

Core Capabilities

Text Feature Extraction
Achieved 90.24% accuracy on NSMC task
81.66% Spearman correlation on KorSTS
94.34% accuracy on Question Pair classification

Frequently Asked Questions

Q: What makes this model unique?

KoBART-base-v2 stands out for its comprehensive training on Korean-specific data and its inclusion of conversational elements like emoticons in its vocabulary. Its architecture is optimized for Korean language understanding and generation tasks.

Q: What are the recommended use cases?

The model is well-suited for Korean language tasks including text classification, summarization, and feature extraction. It performs particularly well on sentiment analysis (NSMC) and question pair classification tasks.

kobart-base-v2