bert-base-turkish-cased

Maintained By
dbmdz

bert-base-turkish-cased

PropertyValue
Parameter Count111M parameters
LicenseMIT
Authordbmdz
Downloads178,783
Training Data Size35GB (44,04,976,662 tokens)

What is bert-base-turkish-cased?

BERTurk is a community-driven cased BERT model specifically designed for Turkish language processing. Developed by the MDZ Digital Library team at the Bavarian State Library, it represents a significant contribution to Turkish NLP research and applications. The model was trained on a comprehensive dataset including the Turkish OSCAR corpus, Wikipedia dumps, OPUS corpora, and additional specialized content provided by Kemal Oflazer.

Implementation Details

The model was trained using Google's TensorFlow Research Cloud (TFRC) on a TPU v3-8 for 2 million steps. It implements the BERT architecture with PyTorch compatibility through the Hugging Face Transformers library.

  • Cased vocabulary preservation for better Turkish language representation
  • Trained on 35GB of carefully curated Turkish text
  • Compatible with Transformers library version 2.3 and above
  • Supports both inference and fine-tuning tasks

Core Capabilities

  • Turkish language understanding and generation
  • Part-of-Speech (PoS) tagging
  • Named Entity Recognition (NER)
  • Text classification and analysis
  • Sequence labeling tasks

Frequently Asked Questions

Q: What makes this model unique?

This model is specifically optimized for Turkish language processing, maintaining case sensitivity and trained on an extensive collection of Turkish texts. It's one of the few dedicated Turkish BERT models available with community support and extensive testing.

Q: What are the recommended use cases?

The model is ideal for Turkish natural language processing tasks including text classification, named entity recognition, and part-of-speech tagging. It's particularly suitable for applications requiring deep understanding of Turkish language nuances and grammar structures.

The first platform built for prompt engineering