JetMoE-8B

Property	Value
Parameter Count	8.52B (2.2B active)
Model Type	Mixture of Experts (MoE)
License	Apache 2.0
Training Cost	~$0.08 million
Paper	Technical Report

What is JetMoE-8B?

JetMoE-8B is a groundbreaking language model that achieves LLaMA2-7B-level performance at a fraction of the training cost. Developed with just $0.1M in training resources, it demonstrates that high-performance LLMs can be created more economically than previously thought. The model features an innovative Mixture of Experts architecture with only 2.2B active parameters during inference, significantly reducing computational requirements.

Implementation Details

The model consists of 24 blocks, each containing two MoE layers: Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE). Each layer contains 8 experts, with 2 experts activated per input token. The model was trained on 1.25T tokens from public datasets using a two-phase training approach.

Architecture: 24 blocks with MoA and MoE layers
Training Data: 1.25T tokens from public datasets
Learning Rate: 5.0 x 10^-4
Batch Size: 4M tokens

Core Capabilities

Outperforms LLaMA2-7B on multiple benchmarks
Achieves 53.0 average score on Open LLM leaderboard
Excels in MMLU (49.2%) and TruthfulQA (41.7%)
Strong performance in GSM8k (27.8%) and MBPP (34.2%)

Frequently Asked Questions

Q: What makes this model unique?

JetMoE-8B combines cost-efficient training ($0.1M) with state-of-the-art performance, using only publicly available datasets and achieving better results than models trained with billion-dollar budgets.

Q: What are the recommended use cases?

The model is particularly well-suited for academic research and applications requiring efficient inference, as it only uses 2.2B active parameters. It performs well across a wide range of tasks, including mathematical reasoning, code generation, and general language understanding.

jetmoe-8b