bert-base-turkish-128k-uncased

Maintained By
dbmdz

BERTurk: Turkish BERT Model

PropertyValue
Parameter Count185M
LicenseMIT
Authordbmdz
Vocabulary Size128k tokens
Training Data Size35GB (44B tokens)

What is bert-base-turkish-128k-uncased?

BERTurk is a community-driven uncased BERT model specifically designed for Turkish language processing. Developed by the MDZ Digital Library team at the Bavarian State Library, this model represents a significant advancement in Turkish NLP capabilities. The model was trained on an extensive corpus including the Turkish OSCAR dataset, Wikipedia dumps, OPUS corpora, and additional data provided by Kemal Oflazer.

Implementation Details

The model was trained on Google's TPU v3-8 for 2 million steps, leveraging the TensorFlow Research Cloud (TFRC). It utilizes the PyTorch framework and is fully compatible with the Hugging Face Transformers library. The model architecture follows the base BERT configuration with specialized adaptations for Turkish language processing.

  • Trained on 35GB of carefully curated Turkish text
  • 44 billion tokens in training corpus
  • 128,000 vocabulary size
  • Uncased tokenization approach

Core Capabilities

  • Turkish language understanding and generation
  • Support for PoS tagging and NER tasks
  • Seamless integration with Hugging Face Transformers
  • Optimized for Turkish-specific NLP applications

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its extensive Turkish-specific training data and large vocabulary size of 128k tokens, making it particularly effective for Turkish language tasks. It's also one of the few models trained with community contribution and academic collaboration.

Q: What are the recommended use cases?

The model is particularly suited for Turkish natural language processing tasks including part-of-speech tagging, named entity recognition, and general language understanding tasks. It can be easily integrated into existing NLP pipelines using the Hugging Face Transformers library.

The first platform built for prompt engineering