kobart-base-v2

Maintained By
gogamza

KoBART-base-v2

PropertyValue
Parameter Count124M
LicenseMIT
PaperBART Paper
LanguageKorean
Training Data40GB+ Korean text

What is kobart-base-v2?

KoBART-base-v2 is a Korean language model based on the BART architecture, specifically designed for Korean text processing tasks. It's an encoder-decoder model trained on over 40GB of Korean text data, including Wikipedia, news, books, and public datasets. The model employs text infilling as its primary noise function during training, making it particularly effective for various Korean language tasks.

Implementation Details

The model features a sophisticated architecture with both encoder and decoder components, each containing 6 layers with 16 attention heads. The model uses a Character BPE tokenizer with a vocabulary size of 30,000, including special tokens for emoticons and emojis to enhance conversational capabilities.

  • Hidden dimensions: 768
  • FFN dimensions: 3072
  • Total parameters: 124M
  • Training data includes Korean Wiki (5M sentences) and other corpus (0.27B sentences)

Core Capabilities

  • Text Feature Extraction
  • Achieved 90.24% accuracy on NSMC task
  • 81.66% Spearman correlation on KorSTS
  • 94.34% accuracy on Question Pair classification

Frequently Asked Questions

Q: What makes this model unique?

KoBART-base-v2 stands out for its comprehensive training on Korean-specific data and its inclusion of conversational elements like emoticons in its vocabulary. Its architecture is optimized for Korean language understanding and generation tasks.

Q: What are the recommended use cases?

The model is well-suited for Korean language tasks including text classification, summarization, and feature extraction. It performs particularly well on sentiment analysis (NSMC) and question pair classification tasks.

The first platform built for prompt engineering