bert-base-italian-xxl-cased
Property | Value |
---|---|
Parameter Count | 111M parameters |
License | MIT |
Framework | PyTorch |
Training Data | Wikipedia + OSCAR corpus (81GB) |
What is bert-base-italian-xxl-cased?
This is an advanced Italian language model developed by the MDZ Digital Library team (dbmdz) at the Bavarian State Library. It's a BERT-based model trained on a massive 81GB corpus containing 13.1B tokens from Italian Wikipedia and the OSCAR corpus, making it one of the most comprehensive Italian language models available.
Implementation Details
The model utilizes the BERT architecture with 111M parameters and maintains case sensitivity. It was trained for multiple million steps with a sequence length of 512 subwords. The model uses a specialized vocabulary of 31,102 tokens, though there is a known mismatch with the config.json specification.
- Training corpus: 81GB of text (13,138,379,147 tokens)
- Architecture: BERT-base with cased tokens
- Framework compatibility: PyTorch, Transformers library
- Sentence splitting: NLTK implementation
Core Capabilities
- Fill-mask prediction for Italian text
- Language understanding and representation
- Support for downstream NLP tasks like NER and PoS tagging
- Handles cased text maintaining capitalization information
Frequently Asked Questions
Q: What makes this model unique?
This model stands out due to its extensive training on a massive Italian corpus (81GB), making it particularly robust for Italian language tasks. It's part of the 'XXL' series, which represents a significant upgrade over standard Italian BERT models in terms of training data volume.
Q: What are the recommended use cases?
The model is ideal for Italian natural language processing tasks, including named entity recognition, part-of-speech tagging, and text classification. It's particularly suitable for applications requiring understanding of formal Italian text, given its training on Wikipedia and curated corpora.