bert-large-uncased-whole-word-masking

Maintained By
google-bert

BERT Large Uncased Whole Word Masking

PropertyValue
Parameter Count336M
LicenseApache 2.0
PaperView Paper
Training DataBookCorpus and Wikipedia
Architecture24-layer, 1024 hidden dimension, 16 attention heads

What is bert-large-uncased-whole-word-masking?

This is an advanced variant of BERT that implements whole word masking during pre-training, where all tokens of a word are masked simultaneously rather than individual wordpieces. This model is uncased, meaning it treats "english" and "English" identically, and has been trained on BookCorpus and Wikipedia data.

Implementation Details

The model employs a sophisticated training approach with two key objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). During training, it processes 15% masked tokens and uses a vocabulary size of 30,000 WordPiece tokens. The training was conducted on 4 cloud TPUs in Pod configuration for one million steps with a 256 batch size.

  • Bidirectional contextual representation learning
  • Whole word masking technique for better semantic understanding
  • Pre-trained on extensive text corpora
  • Optimized with Adam optimizer (learning rate 1e-4)

Core Capabilities

  • Masked language modeling with whole word masking
  • Next sentence prediction for context understanding
  • Feature extraction for downstream tasks
  • Token classification and sequence classification
  • Question answering (achieves 92.8/86.7 F1/EM on SQUAD 1.1)

Frequently Asked Questions

Q: What makes this model unique?

This model's distinctive feature is its whole word masking approach, where entire words are masked during training rather than individual wordpieces, leading to better semantic understanding and performance on downstream tasks.

Q: What are the recommended use cases?

The model is best suited for tasks that require understanding of complete sentences, such as sequence classification, token classification, and question answering. It's not recommended for text generation tasks, where models like GPT-2 would be more appropriate.

The first platform built for prompt engineering