roberta-large-finnish
Property | Value |
---|---|
Model Type | RoBERTa Large |
Training Data | 78GB Finnish text |
Tokenizer | BPE (50265 vocab size) |
Developer | Finnish-NLP |
What is roberta-large-finnish?
roberta-large-finnish is a powerful Finnish language model based on the RoBERTa architecture, trained on a diverse corpus of Finnish text including news archives, Wikipedia, and web crawl data. The model specializes in masked language modeling (MLM) and is designed for downstream task fine-tuning.
Implementation Details
The model was trained on TPUv3-8 VM using Adafactor optimizer with carefully tuned parameters. Training involved 2 epochs with 128 sequence length followed by an additional epoch with 512 sequence length. The model implements dynamic masking during pretraining, making it more robust than traditional BERT implementations.
- Trained on combined 78GB of cleaned Finnish text data
- Uses byte-level BPE tokenization with 50265 vocabulary size
- Implements 15% token masking with varied replacement strategies
- Achieves 88.02% average accuracy on downstream tasks
Core Capabilities
- Masked language modeling for Finnish text
- Sequence classification tasks
- Token classification
- Question answering
- Feature extraction for downstream tasks
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Finnish language processing, trained on a comprehensive dataset of Finnish text. It improves upon previous Finnish language models and approaches the performance of FinBERT while using the more advanced RoBERTa architecture.
Q: What are the recommended use cases?
The model is primarily designed for tasks that utilize whole sentence context, such as sequence classification, token classification, and question answering. It's not recommended for text generation tasks, where models like GPT2 would be more appropriate.