FNet-Large
Property | Value |
---|---|
Architecture | 24-layer Transformer with Fourier Transform |
Hidden Dimension | 1024 |
Training Data | C4 (Cleaned Common Crawl) |
Paper | FNet: Mixing Tokens with Fourier Transforms |
Developer |
What is fnet-large?
FNet-large is an innovative transformer model that replaces traditional attention mechanisms with Fourier transforms, offering a unique approach to natural language processing. Developed by Google, it achieves impressive performance while potentially reducing computational complexity. The model is pre-trained on the C4 dataset using masked language modeling (MLM) and next sentence prediction (NSP) objectives, achieving 0.58 accuracy on MLM and 0.80 on NSP tasks.
Implementation Details
The model employs a 24-layer architecture with 1024 hidden dimensions. It processes input text using SentencePiece tokenization with a 32,000-token vocabulary. During training, the model was optimized using Adam with specific parameters (learning rate: 1e-4, β1=0.9, β2=0.999) and trained on 4 cloud TPUs in Pod configuration for one million steps.
- Uses Fourier transforms instead of attention mechanisms
- Maximum sequence length of 512 tokens
- Implements masked language modeling with 15% token masking
- Trained with weight decay of 0.01 and learning rate warmup for 10,000 steps
Core Capabilities
- Masked Language Modeling (MLM)
- Next Sentence Prediction (NSP)
- Strong performance on GLUE benchmark tasks (81.9 average score)
- Effective for sequence classification and token classification tasks
- Suitable for fine-tuning on downstream tasks
Frequently Asked Questions
Q: What makes this model unique?
FNet-large's distinctive feature is its use of Fourier transforms instead of attention mechanisms, which potentially offers computational advantages while maintaining strong performance on NLP tasks. This makes it an interesting alternative to traditional transformer architectures.
Q: What are the recommended use cases?
The model is best suited for tasks that involve whole-sentence processing, such as sequence classification, token classification, and question answering. It's designed to be fine-tuned on downstream tasks rather than used directly for text generation. The model particularly excels in scenarios where bidirectional context is important.