BERTIN: RoBERTa Base Spanish

Property	Value
Parameter Count	125M
License	CC-BY-4.0
Paper	BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
Training Data	Sampled Spanish mC4 Dataset

What is bertin-roberta-base-spanish?

BERTIN is a RoBERTa-based language model trained specifically for Spanish text processing. What makes it unique is its innovative training approach using perplexity sampling, which allowed the team to train a competitive model using just one-fifth of the traditional data volume. The model achieves state-of-the-art performance on several Spanish language tasks while being trained with significantly fewer resources.

Implementation Details

The model was trained using Flax/JAX on TPUv3-8 hardware, implementing a novel perplexity sampling technique to select high-quality training data from the Spanish portion of mC4. This approach enabled efficient training with only 50M samples instead of the full 416M available samples.

Architecture: RoBERTa base architecture with 125M parameters
Training Data: Carefully sampled subset of Spanish mC4 using perplexity-based selection
Training Infrastructure: 3 TPUv3-8 units for approximately 10 days
Sequence Length: Available in both 128 and 512 token versions

Core Capabilities

Masked Language Modeling with high accuracy (0.65-0.69)
State-of-the-art performance on MLDoc classification
Competitive results on NER (F1 0.8792) and POS tagging (F1 0.9662)
Efficient fine-tuning for downstream tasks
Specialized for Spanish language understanding

Frequently Asked Questions

Q: What makes this model unique?

BERTIN's key innovation is its perplexity sampling approach, which allows it to achieve competitive performance while using only 20% of the traditional training data volume. This makes it particularly valuable for teams with limited computational resources.

Q: What are the recommended use cases?

The model excels in various Spanish NLP tasks including: document classification, named entity recognition, part-of-speech tagging, and masked language modeling. It's particularly suitable for applications requiring deep Spanish language understanding with resource constraints.