Nemotron-Mini-4B-Instruct-GGUF

Maintained By
bartowski

Nemotron-Mini-4B-Instruct-GGUF

PropertyValue
Original Modelnvidia/Nemotron-Mini-4B-Instruct
Parameters4 Billion
FormatGGUF (Various Quantizations)
Authorbartowski
Model URLhttps://huggingface.co/bartowski/Nemotron-Mini-4B-Instruct-GGUF

What is Nemotron-Mini-4B-Instruct-GGUF?

Nemotron-Mini-4B-Instruct-GGUF is a comprehensive collection of GGUF quantized versions of NVIDIA's Nemotron-Mini-4B-Instruct model. This collection offers various quantization levels optimized for different hardware configurations and memory constraints, ranging from full F16 weights (8.39GB) to highly compressed IQ3_M format (2.18GB).

Implementation Details

The model uses llama.cpp's imatrix quantization technology and features a specific prompt format using special tokens. Multiple quantization options are available, each optimized for different use cases and hardware configurations, from high-quality Q8_0 to memory-efficient IQ3_M versions.

  • Supports various quantization formats including K-quants and I-quants
  • Special optimizations for ARM inference with Q4_0_X_X variants
  • Enhanced versions with Q8_0 embeddings and output weights
  • Compatible with different hardware acceleration options (cuBLAS, rocBLAS, Metal)

Core Capabilities

  • Flexible deployment options for different hardware configurations
  • Memory-efficient inference with minimal quality loss
  • Optimized performance for various GPU architectures
  • Special ARM-optimized versions for mobile/embedded deployment

Frequently Asked Questions

Q: What makes this model unique?

The model offers an extensive range of quantization options, allowing users to balance between model quality and resource requirements. It includes special optimizations for different hardware architectures and unique embedding/output weight configurations for enhanced performance.

Q: What are the recommended use cases?

For maximum speed, choose a quantization size 1-2GB smaller than your GPU's VRAM. For maximum quality, select a quantization based on combined system RAM and GPU VRAM. K-quants are recommended for general use, while I-quants are better for specific hardware configurations, especially with Nvidia or AMD GPUs using cuBLAS/rocBLAS.

🍰 Interesting in building your own agents?
PromptLayer provides Huggingface integration tools to manage and monitor prompts with your whole team. Get started here.