BERTurk: Turkish BERT Model
Property | Value |
---|---|
Parameter Count | 185M |
License | MIT |
Author | dbmdz |
Vocabulary Size | 128k tokens |
Training Data Size | 35GB (44B tokens) |
What is bert-base-turkish-128k-uncased?
BERTurk is a community-driven uncased BERT model specifically designed for Turkish language processing. Developed by the MDZ Digital Library team at the Bavarian State Library, this model represents a significant advancement in Turkish NLP capabilities. The model was trained on an extensive corpus including the Turkish OSCAR dataset, Wikipedia dumps, OPUS corpora, and additional data provided by Kemal Oflazer.
Implementation Details
The model was trained on Google's TPU v3-8 for 2 million steps, leveraging the TensorFlow Research Cloud (TFRC). It utilizes the PyTorch framework and is fully compatible with the Hugging Face Transformers library. The model architecture follows the base BERT configuration with specialized adaptations for Turkish language processing.
- Trained on 35GB of carefully curated Turkish text
- 44 billion tokens in training corpus
- 128,000 vocabulary size
- Uncased tokenization approach
Core Capabilities
- Turkish language understanding and generation
- Support for PoS tagging and NER tasks
- Seamless integration with Hugging Face Transformers
- Optimized for Turkish-specific NLP applications
Frequently Asked Questions
Q: What makes this model unique?
This model stands out due to its extensive Turkish-specific training data and large vocabulary size of 128k tokens, making it particularly effective for Turkish language tasks. It's also one of the few models trained with community contribution and academic collaboration.
Q: What are the recommended use cases?
The model is particularly suited for Turkish natural language processing tasks including part-of-speech tagging, named entity recognition, and general language understanding tasks. It can be easily integrated into existing NLP pipelines using the Hugging Face Transformers library.