bert-base-german-uncased

Maintained By
dbmdz

BERT Base German Uncased

PropertyValue
Parameter Count111M parameters
LicenseMIT
FrameworkPyTorch
Dataset Size16GB (2.35B tokens)
Authordbmdz (Bavarian State Library)

What is bert-base-german-uncased?

BERT Base German Uncased is a state-of-the-art language model developed by the MDZ Digital Library team at the Bavarian State Library. It's a transformer-based model specifically trained on a diverse German language corpus, offering robust language understanding capabilities for German text processing tasks.

Implementation Details

The model was trained on an extensive dataset combining Wikipedia dumps, EU Bookshop corpus, Open Subtitles, CommonCrawl, ParaCrawl, and News Crawl. The training process utilized spacy for sentence splitting and followed SciBERT's preprocessing methodology. The model was trained for 1.5M steps with a sequence length of 512 subwords.

  • Comprehensive vocabulary based on German text corpus
  • PyTorch compatibility through Hugging Face Transformers
  • Trained with state-of-the-art transformer architecture
  • Optimized for German language understanding

Core Capabilities

  • Fill-mask task performance
  • Text classification and token classification
  • Sequence classification tasks
  • Named Entity Recognition (NER)
  • Part of Speech (PoS) tagging

Frequently Asked Questions

Q: What makes this model unique?

This model stands out due to its extensive training on diverse German language sources and its optimization for uncased text processing, making it particularly suitable for applications where case sensitivity isn't crucial.

Q: What are the recommended use cases?

The model is ideal for German language processing tasks including text classification, named entity recognition, and general language understanding applications. It's particularly useful in scenarios where case-insensitive text processing is preferred.

The first platform built for prompt engineering