BERTurk: Turkish BERT Model
Property | Value |
---|---|
License | MIT |
Author | dbmdz |
Downloads | 82,946 |
Training Corpus Size | 35GB (44.04B tokens) |
Framework Support | PyTorch, TensorFlow |
What is bert-base-turkish-uncased?
BERTurk is a community-driven uncased BERT model specifically designed for Turkish language processing. Developed by the MDZ Digital Library team at the Bavarian State Library, this model represents a significant contribution to Turkish NLP resources. It was trained on a comprehensive dataset combining the Turkish OSCAR corpus, Wikipedia dumps, OPUS corpora, and additional data from Kemal Oflazer.
Implementation Details
The model was trained using Google's TensorFlow Research Cloud (TFRC) on a TPU v3-8 for 2 million steps. It's primarily available in PyTorch-Transformers compatible format, though TensorFlow checkpoints can be requested. The implementation follows the base BERT architecture, optimized for Turkish language characteristics.
- Trained on a filtered and sentence segmented corpus of 35GB
- Processes uncased (lowercase) Turkish text
- Compatible with Transformers library version ≥ 2.3
- Includes specialized tokenization for Turkish language
Core Capabilities
- Turkish language understanding and processing
- Support for PoS tagging and NER tasks
- Efficient tokenization of Turkish text
- Seamless integration with Hugging Face's Transformers library
Frequently Asked Questions
Q: What makes this model unique?
This model is specifically optimized for Turkish language processing, trained on an extensive and diverse dataset of Turkish text. It's one of the few dedicated Turkish BERT models available with community support and extensive testing.
Q: What are the recommended use cases?
The model is well-suited for Turkish natural language processing tasks including part-of-speech tagging, named entity recognition, and general language understanding tasks. It's particularly valuable for applications requiring Turkish text analysis and processing.