BERTIN: RoBERTa Base Spanish
Property | Value |
---|---|
Parameter Count | 125M |
License | CC-BY-4.0 |
Paper | BERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling |
Training Data | Sampled Spanish mC4 Dataset |
What is bertin-roberta-base-spanish?
BERTIN is a RoBERTa-based language model trained specifically for Spanish text processing. What makes it unique is its innovative training approach using perplexity sampling, which allowed the team to train a competitive model using just one-fifth of the traditional data volume. The model achieves state-of-the-art performance on several Spanish language tasks while being trained with significantly fewer resources.
Implementation Details
The model was trained using Flax/JAX on TPUv3-8 hardware, implementing a novel perplexity sampling technique to select high-quality training data from the Spanish portion of mC4. This approach enabled efficient training with only 50M samples instead of the full 416M available samples.
- Architecture: RoBERTa base architecture with 125M parameters
- Training Data: Carefully sampled subset of Spanish mC4 using perplexity-based selection
- Training Infrastructure: 3 TPUv3-8 units for approximately 10 days
- Sequence Length: Available in both 128 and 512 token versions
Core Capabilities
- Masked Language Modeling with high accuracy (0.65-0.69)
- State-of-the-art performance on MLDoc classification
- Competitive results on NER (F1 0.8792) and POS tagging (F1 0.9662)
- Efficient fine-tuning for downstream tasks
- Specialized for Spanish language understanding
Frequently Asked Questions
Q: What makes this model unique?
BERTIN's key innovation is its perplexity sampling approach, which allows it to achieve competitive performance while using only 20% of the traditional training data volume. This makes it particularly valuable for teams with limited computational resources.
Q: What are the recommended use cases?
The model excels in various Spanish NLP tasks including: document classification, named entity recognition, part-of-speech tagging, and masked language modeling. It's particularly suitable for applications requiring deep Spanish language understanding with resource constraints.