BERT Large Uncased Whole Word Masking

Property	Value
Parameter Count	336M
License	Apache 2.0
Paper	View Paper
Training Data	BookCorpus and Wikipedia
Architecture	24-layer, 1024 hidden dimension, 16 attention heads

What is bert-large-uncased-whole-word-masking?

This is an advanced variant of BERT that implements whole word masking during pre-training, where all tokens of a word are masked simultaneously rather than individual wordpieces. This model is uncased, meaning it treats "english" and "English" identically, and has been trained on BookCorpus and Wikipedia data.

Implementation Details

The model employs a sophisticated training approach with two key objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). During training, it processes 15% masked tokens and uses a vocabulary size of 30,000 WordPiece tokens. The training was conducted on 4 cloud TPUs in Pod configuration for one million steps with a 256 batch size.

Bidirectional contextual representation learning
Whole word masking technique for better semantic understanding
Pre-trained on extensive text corpora
Optimized with Adam optimizer (learning rate 1e-4)

Core Capabilities

Masked language modeling with whole word masking
Next sentence prediction for context understanding
Feature extraction for downstream tasks
Token classification and sequence classification
Question answering (achieves 92.8/86.7 F1/EM on SQUAD 1.1)

Frequently Asked Questions

Q: What makes this model unique?

This model's distinctive feature is its whole word masking approach, where entire words are masked during training rather than individual wordpieces, leading to better semantic understanding and performance on downstream tasks.

Q: What are the recommended use cases?

The model is best suited for tasks that require understanding of complete sentences, such as sequence classification, token classification, and question answering. It's not recommended for text generation tasks, where models like GPT-2 would be more appropriate.