bert-base-turkish-uncased

Maintained By
dbmdz

BERTurk: Turkish BERT Model

PropertyValue
LicenseMIT
Authordbmdz
Downloads82,946
Training Corpus Size35GB (44.04B tokens)
Framework SupportPyTorch, TensorFlow

What is bert-base-turkish-uncased?

BERTurk is a community-driven uncased BERT model specifically designed for Turkish language processing. Developed by the MDZ Digital Library team at the Bavarian State Library, this model represents a significant contribution to Turkish NLP resources. It was trained on a comprehensive dataset combining the Turkish OSCAR corpus, Wikipedia dumps, OPUS corpora, and additional data from Kemal Oflazer.

Implementation Details

The model was trained using Google's TensorFlow Research Cloud (TFRC) on a TPU v3-8 for 2 million steps. It's primarily available in PyTorch-Transformers compatible format, though TensorFlow checkpoints can be requested. The implementation follows the base BERT architecture, optimized for Turkish language characteristics.

  • Trained on a filtered and sentence segmented corpus of 35GB
  • Processes uncased (lowercase) Turkish text
  • Compatible with Transformers library version ≥ 2.3
  • Includes specialized tokenization for Turkish language

Core Capabilities

  • Turkish language understanding and processing
  • Support for PoS tagging and NER tasks
  • Efficient tokenization of Turkish text
  • Seamless integration with Hugging Face's Transformers library

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Turkish language processing, trained on an extensive and diverse dataset of Turkish text. It's one of the few dedicated Turkish BERT models available with community support and extensive testing.

Q: What are the recommended use cases?

The model is well-suited for Turkish natural language processing tasks including part-of-speech tagging, named entity recognition, and general language understanding tasks. It's particularly valuable for applications requiring Turkish text analysis and processing.

The first platform built for prompt engineering