Nous-Hermes-13B-GGML

Property	Value
License	Other
Format	GGML
Author	TheBloke
Base Model	Nous-Hermes-13B

What is Nous-Hermes-13B-GGML?

Nous-Hermes-13B-GGML is a quantized version of the Nous-Hermes-13B model, specifically optimized for CPU and GPU inference using the llama.cpp framework. The model comes in multiple quantization variants ranging from 2-bit to 8-bit, offering different trade-offs between model size, performance, and accuracy.

Implementation Details

The model implements various quantization methods including both original llama.cpp methods (q4_0, q4_1, q5_0, q5_1, q8_0) and new k-quant methods (q2_K, q3_K_S, q3_K_M, q3_K_L, q4_K_S, q4_K_M, q5_K_S, q6_K). File sizes range from 5.43GB to 13.83GB depending on the quantization level.

Supports Alpaca prompt format
Compatible with multiple UI frameworks including text-generation-webui and KoboldCpp
Offers GPU layer offloading capabilities

Core Capabilities

Long-form response generation
Low hallucination rate
Strong performance across various benchmarks (top rankings in ARC-c, ARC-e, Hellaswag)
Flexible deployment options for both CPU and GPU inference

Frequently Asked Questions

Q: What makes this model unique?

This model stands out for its variety of quantization options that allow users to balance between performance and resource usage. It provides exceptional compatibility with llama.cpp ecosystem while maintaining the strong performance of the base Nous-Hermes-13B model.

Q: What are the recommended use cases?

The model is ideal for local deployment scenarios where GPU resources might be limited. It's particularly well-suited for applications requiring natural language understanding, creative writing, and instruction following, while allowing users to choose the quantization level that best matches their hardware capabilities.