AMD-Llama-135m

Property	Value
Parameter Count	135M
License	Apache 2.0
Architecture	LLaMA-based
Training Data	SlimPajama + Project Gutenberg (670B tokens)
Research Paper	GPT-NeoX Paper

What is AMD-Llama-135m?

AMD-Llama-135m is a lightweight language model trained on AMD Instinct MI250 accelerators, designed to be compatible with the LLaMA2 architecture. This model represents a significant achievement in creating efficient, smaller-scale language models that can serve both as standalone text generators and as draft models for speculative decoding with larger LLaMA2 and CodeLLama models.

Implementation Details

The model features a 12-layer architecture with 768 hidden dimensions and 12 attention heads. It utilizes advanced components like RMSNorm layer normalization, rotary positional embeddings (RoPE), and the Swiglu activation function. The model supports a context window of 2048 tokens and employs a vocabulary size of 32000.

Trained on SlimPajama and Project Gutenberg datasets (670B tokens)
Implements multi-head attention with 64-dimensional heads
Optimized using AdamW with cosine learning rate scheduling
Supports speculative decoding for performance acceleration

Core Capabilities

General text generation and completion
Code completion when finetuned (AMD-Llama-135m-code variant)
Speculative decoding acceleration for larger models
Competitive performance on various NLP benchmarks

Frequently Asked Questions

Q: What makes this model unique?

The model's ability to serve as an efficient draft model for speculative decoding while maintaining competitive performance despite its small size makes it unique. It achieves up to 3.88x throughput speedup when used as a draft model.

Q: What are the recommended use cases?

The model is particularly well-suited for deployment scenarios requiring efficient text generation, code completion tasks (when using the code-finetuned variant), and as a draft model for speculative decoding with larger LLaMA2 or CodeLlama models.

AMD-Llama-135m

AMD-Llama-135m

What is AMD-Llama-135m?

Implementation Details

Core Capabilities

Frequently Asked Questions

Q: What makes this model unique?

Q: What are the recommended use cases?

Related Models