bert-base-finnish-cased-v1
Property | Value |
---|---|
Parameter Count | 125M |
Model Type | BERT Base (Cased) |
Research Paper | arXiv:1912.07076 |
Author | TurkuNLP |
Training Data | 3B tokens (24B characters) |
What is bert-base-finnish-cased-v1?
bert-base-finnish-cased-v1 is a specialized BERT model developed by TurkuNLP specifically for the Finnish language. It represents a significant advancement in Finnish natural language processing, featuring a custom 50,000 wordpiece vocabulary optimized for Finnish language characteristics. The model was trained on a diverse corpus of Finnish text including news, online discussions, and internet crawls, making it substantially more comprehensive than previous multilingual alternatives.
Implementation Details
The model was trained for 1 million steps on over 3 billion tokens of Finnish text. It utilizes a custom vocabulary that provides superior coverage of Finnish words compared to multilingual BERT, resulting in more natural tokenization of Finnish text.
- Custom 50,000 wordpiece vocabulary
- Trained on 24B characters of Finnish text
- Cased model variant (recommended for use)
- Compatible with standard BERT architecture
Core Capabilities
- Named Entity Recognition (92.40% accuracy on FiNER corpus)
- Part-of-Speech Tagging (98.23% on TDT, 98.39% on FTB, 98.08% on PUD)
- Document Classification with superior performance over multilingual BERT
- Fill-Mask task support
Frequently Asked Questions
Q: What makes this model unique?
This model's uniqueness lies in its specialized Finnish language training and custom vocabulary, which allows for much better tokenization of Finnish words compared to multilingual alternatives. It consistently outperforms multilingual BERT across various Finnish NLP tasks.
Q: What are the recommended use cases?
The model is ideal for Finnish language processing tasks including named entity recognition, part-of-speech tagging, and document classification. It's particularly recommended for applications requiring deep understanding of Finnish text structure and semantics.