BERTurk: Turkish BERT Model

Property	Value
Parameter Count	185M
License	MIT
Author	dbmdz
Vocabulary Size	128k tokens
Training Data Size	35GB (44B tokens)

What is bert-base-turkish-128k-uncased?

BERTurk is a community-driven uncased BERT model specifically designed for Turkish language processing. Developed by the MDZ Digital Library team at the Bavarian State Library, this model represents a significant advancement in Turkish NLP capabilities. The model was trained on an extensive corpus including the Turkish OSCAR dataset, Wikipedia dumps, OPUS corpora, and additional data provided by Kemal Oflazer.

Implementation Details

The model was trained on Google's TPU v3-8 for 2 million steps, leveraging the TensorFlow Research Cloud (TFRC). It utilizes the PyTorch framework and is fully compatible with the Hugging Face Transformers library. The model architecture follows the base BERT configuration with specialized adaptations for Turkish language processing.

Trained on 35GB of carefully curated Turkish text
44 billion tokens in training corpus
128,000 vocabulary size
Uncased tokenization approach

Core Capabilities

Turkish language understanding and generation
Support for PoS tagging and NER tasks
Seamless integration with Hugging Face Transformers
Optimized for Turkish-specific NLP applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its extensive Turkish-specific training data and large vocabulary size of 128k tokens, making it particularly effective for Turkish language tasks. It's also one of the few models trained with community contribution and academic collaboration.

Q: What are the recommended use cases?

The model is particularly suited for Turkish natural language processing tasks including part-of-speech tagging, named entity recognition, and general language understanding tasks. It can be easily integrated into existing NLP pipelines using the Hugging Face Transformers library.

bert-base-turkish-128k-uncased