Flan-UL2
Property | Value |
---|---|
Model Size | 20B parameters |
Architecture | T5-based Encoder-Decoder |
License | Apache 2.0 |
Paper | Unifying Language Learning Paradigms |
What is Flan-UL2?
Flan-UL2 is an advanced language model that combines the UL2 architecture with Flan instruction tuning. It represents a significant evolution in language model development, featuring a 2048-token receptive field and removing the need for mode tokens present in the original UL2. The model has been trained on a massive scale using the C4 corpus and fine-tuned with the Flan instruction dataset collection.
Implementation Details
The model architecture consists of 32 encoder layers and 32 decoder layers, with a model dimension of 4096 and feed-forward dimension of 16384. It employs 16 attention heads, each with a dimension of 256. Training involved 1 trillion tokens over 2 million steps, using a batch size of 1024 and sequence length of 512/512 for inputs and targets.
- Utilizes Mixture-of-Denoisers (MoD) pre-training objective
- Implements three denoising strategies: R-Denoiser, S-Denoiser, and X-Denoiser
- Supports efficient 8-bit quantization for memory optimization
- Compatible with bfloat16 precision training
Core Capabilities
- Multi-language support (English, French, Romanian, German)
- Advanced text generation and translation
- Question answering and logical reasoning
- Mathematical problem solving
- Step-by-step reasoning tasks
- Scientific knowledge processing
Frequently Asked Questions
Q: What makes this model unique?
Flan-UL2 uniquely combines the UL2 architecture with Flan instruction tuning, offering improved few-shot learning capabilities and removing the need for mode tokens. It achieved significant improvements over its predecessor, showing gains of up to 7.4% on certain tasks.
Q: What are the recommended use cases?
The model excels in diverse NLP tasks including language generation, understanding, text classification, question answering, commonsense reasoning, long text reasoning, and information retrieval. It's particularly effective for tasks requiring few-shot learning and complex reasoning.