BERTurk: Turkish BERT Model

Property	Value
License	MIT
Author	dbmdz
Downloads	82,946
Training Corpus Size	35GB (44.04B tokens)
Framework Support	PyTorch, TensorFlow

What is bert-base-turkish-uncased?

BERTurk is a community-driven uncased BERT model specifically designed for Turkish language processing. Developed by the MDZ Digital Library team at the Bavarian State Library, this model represents a significant contribution to Turkish NLP resources. It was trained on a comprehensive dataset combining the Turkish OSCAR corpus, Wikipedia dumps, OPUS corpora, and additional data from Kemal Oflazer.

Implementation Details

The model was trained using Google's TensorFlow Research Cloud (TFRC) on a TPU v3-8 for 2 million steps. It's primarily available in PyTorch-Transformers compatible format, though TensorFlow checkpoints can be requested. The implementation follows the base BERT architecture, optimized for Turkish language characteristics.

Trained on a filtered and sentence segmented corpus of 35GB
Processes uncased (lowercase) Turkish text
Compatible with Transformers library version ≥ 2.3
Includes specialized tokenization for Turkish language

Core Capabilities

Turkish language understanding and processing
Support for PoS tagging and NER tasks
Efficient tokenization of Turkish text
Seamless integration with Hugging Face's Transformers library

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Turkish language processing, trained on an extensive and diverse dataset of Turkish text. It's one of the few dedicated Turkish BERT models available with community support and extensive testing.

Q: What are the recommended use cases?

The model is well-suited for Turkish natural language processing tasks including part-of-speech tagging, named entity recognition, and general language understanding tasks. It's particularly valuable for applications requiring Turkish text analysis and processing.

bert-base-turkish-uncased