roberta-large-finnish

Property	Value
Model Type	RoBERTa Large
Training Data	78GB Finnish text
Tokenizer	BPE (50265 vocab size)
Developer	Finnish-NLP

What is roberta-large-finnish?

roberta-large-finnish is a powerful Finnish language model based on the RoBERTa architecture, trained on a diverse corpus of Finnish text including news archives, Wikipedia, and web crawl data. The model specializes in masked language modeling (MLM) and is designed for downstream task fine-tuning.

Implementation Details

The model was trained on TPUv3-8 VM using Adafactor optimizer with carefully tuned parameters. Training involved 2 epochs with 128 sequence length followed by an additional epoch with 512 sequence length. The model implements dynamic masking during pretraining, making it more robust than traditional BERT implementations.

Trained on combined 78GB of cleaned Finnish text data
Uses byte-level BPE tokenization with 50265 vocabulary size
Implements 15% token masking with varied replacement strategies
Achieves 88.02% average accuracy on downstream tasks

Core Capabilities

Masked language modeling for Finnish text
Sequence classification tasks
Token classification
Question answering
Feature extraction for downstream tasks

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Finnish language processing, trained on a comprehensive dataset of Finnish text. It improves upon previous Finnish language models and approaches the performance of FinBERT while using the more advanced RoBERTa architecture.

Q: What are the recommended use cases?

The model is primarily designed for tasks that utilize whole sentence context, such as sequence classification, token classification, and question answering. It's not recommended for text generation tasks, where models like GPT2 would be more appropriate.