bert-base-swedish-cased

Property	Value
Author	KB (National Library of Sweden)
Framework Support	PyTorch, TensorFlow, JAX
Training Data	~15-20GB text (200M sentences, 3B tokens)
Vocabulary Size	~50,000 tokens

What is bert-base-swedish-cased?

bert-base-swedish-cased is a comprehensive BERT language model specifically trained for Swedish text processing. Developed by the National Library of Sweden (KBLab), this model has been trained on an extensive dataset comprising books, news, government publications, Swedish Wikipedia, and internet forums, making it a representative model for Swedish language understanding.

Implementation Details

The model follows the original BERT base architecture as published by Google, incorporating whole word masking for improved performance. It maintains case sensitivity and can be easily implemented using the Huggingface Transformers library.

Trained with same hyperparameters as original BERT
Implements whole word masking technique
Case-sensitive tokenization
Compatible with Transformers 2.4.1+ and PyTorch 1.3.1+

Core Capabilities

Text representation for Swedish language
Masked language modeling
Foundation for downstream Swedish NLP tasks
Easily adaptable for fine-tuning

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Swedish language processing, trained on a diverse and extensive Swedish text corpus of 3 billion tokens, making it one of the most comprehensive Swedish language models available.

Q: What are the recommended use cases?

The model is ideal for Swedish text processing tasks including text classification, named entity recognition (when fine-tuned), and general language understanding tasks. It serves as an excellent foundation for fine-tuning on specific Swedish NLP tasks.