bert-large-cased-whole-word-masking

Maintained By
google-bert

BERT Large Cased Whole Word Masking

PropertyValue
Parameters336M
Architecture24 layers, 1024 hidden dimension, 16 attention heads
Training DataBookCorpus (11,038 books) & English Wikipedia
PaperBERT: Pre-training of Deep Bidirectional Transformers
PerformanceSQUAD 1.1: 92.9/86.7 F1/EM, MultiNLI: 86.46%

What is bert-large-cased-whole-word-masking?

This is an advanced version of BERT that implements whole word masking during pre-training, where all tokens of a word are masked simultaneously rather than individual WordPiece tokens. The model maintains case sensitivity, distinguishing between uppercase and lowercase text, making it particularly suitable for tasks where case information is important.

Implementation Details

The model was trained using a sophisticated approach on 4 cloud TPUs in Pod configuration for one million steps. It employs two key pre-training objectives: Masked Language Modeling (MLM) with 15% masking rate and Next Sentence Prediction (NSP). The training process used Adam optimizer with carefully tuned hyperparameters and learning rate scheduling.

  • Case-sensitive tokenization with WordPiece vocabulary (30,000 tokens)
  • Whole word masking implementation for more coherent predictions
  • Bidirectional context understanding through transformer architecture
  • Optimized with 256 batch size and adaptive learning rates

Core Capabilities

  • Superior performance on sequence classification tasks
  • Token classification and question answering
  • Masked language modeling for text completion
  • Sentence pair classification tasks
  • Feature extraction for downstream NLP tasks

Frequently Asked Questions

Q: What makes this model unique?

The whole word masking approach sets this model apart from standard BERT implementations. By masking entire words rather than subword tokens, it develops a more natural understanding of language semantics. Combined with its case-sensitivity, it's particularly effective for tasks requiring precise language understanding.

Q: What are the recommended use cases?

This model excels in tasks that require understanding of complete sentences, such as sequence classification, token classification, and question answering. It's particularly suitable for applications where case sensitivity matters and when working with formal text that requires precise understanding of word relationships.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.