bert-base-swedish-cased
Property | Value |
---|---|
Author | KB (National Library of Sweden) |
Framework Support | PyTorch, TensorFlow, JAX |
Training Data | ~15-20GB text (200M sentences, 3B tokens) |
Vocabulary Size | ~50,000 tokens |
What is bert-base-swedish-cased?
bert-base-swedish-cased is a comprehensive BERT language model specifically trained for Swedish text processing. Developed by the National Library of Sweden (KBLab), this model has been trained on an extensive dataset comprising books, news, government publications, Swedish Wikipedia, and internet forums, making it a representative model for Swedish language understanding.
Implementation Details
The model follows the original BERT base architecture as published by Google, incorporating whole word masking for improved performance. It maintains case sensitivity and can be easily implemented using the Huggingface Transformers library.
- Trained with same hyperparameters as original BERT
- Implements whole word masking technique
- Case-sensitive tokenization
- Compatible with Transformers 2.4.1+ and PyTorch 1.3.1+
Core Capabilities
- Text representation for Swedish language
- Masked language modeling
- Foundation for downstream Swedish NLP tasks
- Easily adaptable for fine-tuning
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Swedish language processing, trained on a diverse and extensive Swedish text corpus of 3 billion tokens, making it one of the most comprehensive Swedish language models available.
Q: What are the recommended use cases?
The model is ideal for Swedish text processing tasks including text classification, named entity recognition (when fine-tuned), and general language understanding tasks. It serves as an excellent foundation for fine-tuning on specific Swedish NLP tasks.