JetMoE-8B
Property | Value |
---|---|
Parameter Count | 8.52B (2.2B active) |
Model Type | Mixture of Experts (MoE) |
License | Apache 2.0 |
Training Cost | ~$0.08 million |
Paper | Technical Report |
What is JetMoE-8B?
JetMoE-8B is a groundbreaking language model that achieves LLaMA2-7B-level performance at a fraction of the training cost. Developed with just $0.1M in training resources, it demonstrates that high-performance LLMs can be created more economically than previously thought. The model features an innovative Mixture of Experts architecture with only 2.2B active parameters during inference, significantly reducing computational requirements.
Implementation Details
The model consists of 24 blocks, each containing two MoE layers: Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE). Each layer contains 8 experts, with 2 experts activated per input token. The model was trained on 1.25T tokens from public datasets using a two-phase training approach.
- Architecture: 24 blocks with MoA and MoE layers
- Training Data: 1.25T tokens from public datasets
- Learning Rate: 5.0 x 10^-4
- Batch Size: 4M tokens
Core Capabilities
- Outperforms LLaMA2-7B on multiple benchmarks
- Achieves 53.0 average score on Open LLM leaderboard
- Excels in MMLU (49.2%) and TruthfulQA (41.7%)
- Strong performance in GSM8k (27.8%) and MBPP (34.2%)
Frequently Asked Questions
Q: What makes this model unique?
JetMoE-8B combines cost-efficient training ($0.1M) with state-of-the-art performance, using only publicly available datasets and achieving better results than models trained with billion-dollar budgets.
Q: What are the recommended use cases?
The model is particularly well-suited for academic research and applications requiring efficient inference, as it only uses 2.2B active parameters. It performs well across a wide range of tasks, including mathematical reasoning, code generation, and general language understanding.