bertin-roberta-base-spanish

Maintained By
bertin-project

BERTIN: RoBERTa Base Spanish

PropertyValue
Parameter Count125M
LicenseCC-BY-4.0
PaperBERTIN: Efficient Pre-Training of a Spanish Language Model using Perplexity Sampling
Training DataSampled Spanish mC4 Dataset

What is bertin-roberta-base-spanish?

BERTIN is a RoBERTa-based language model trained specifically for Spanish text processing. What makes it unique is its innovative training approach using perplexity sampling, which allowed the team to train a competitive model using just one-fifth of the traditional data volume. The model achieves state-of-the-art performance on several Spanish language tasks while being trained with significantly fewer resources.

Implementation Details

The model was trained using Flax/JAX on TPUv3-8 hardware, implementing a novel perplexity sampling technique to select high-quality training data from the Spanish portion of mC4. This approach enabled efficient training with only 50M samples instead of the full 416M available samples.

  • Architecture: RoBERTa base architecture with 125M parameters
  • Training Data: Carefully sampled subset of Spanish mC4 using perplexity-based selection
  • Training Infrastructure: 3 TPUv3-8 units for approximately 10 days
  • Sequence Length: Available in both 128 and 512 token versions

Core Capabilities

  • Masked Language Modeling with high accuracy (0.65-0.69)
  • State-of-the-art performance on MLDoc classification
  • Competitive results on NER (F1 0.8792) and POS tagging (F1 0.9662)
  • Efficient fine-tuning for downstream tasks
  • Specialized for Spanish language understanding

Frequently Asked Questions

Q: What makes this model unique?

BERTIN's key innovation is its perplexity sampling approach, which allows it to achieve competitive performance while using only 20% of the traditional training data volume. This makes it particularly valuable for teams with limited computational resources.

Q: What are the recommended use cases?

The model excels in various Spanish NLP tasks including: document classification, named entity recognition, part-of-speech tagging, and masked language modeling. It's particularly suitable for applications requiring deep Spanish language understanding with resource constraints.

The first platform built for prompt engineering