Quantization Explained: Q4 vs Q5 vs Q8

Quantization is the technique that allows a language model once assumed to require a data-centre GPU to run on an ordinary gaming card. Anyone browsing model downloads soon meets labels such as Q4_K_M, Q5_K_M or Q8_0 and finds them unexplained. This guide defines what quantization is and sets out what the Q-numbers mean and quantifies the trade-off between size and quality and offers a rule for choosing among them.

The problem that quantization solves

A model consists of billions of numerical parameters known as weights. By default each weight is stored at full precision and occupies 16 bits. This makes models large. A 13-billion-parameter model at full precision needs roughly 26 GB, which exceeds the VRAM of most consumer graphics cards. To fit such a model into available memory each weight must be made smaller, and that is precisely what quantization accomplishes.

What quantization is

Quantization stores each weight using fewer bits, for instance reducing a 16-bit value to 4 bits. Fewer bits per weight yields a smaller file and a lower memory requirement. Reducing precision from 16-bit to 4-bit shrinks the model by roughly a factor of four, which is what turns that 26 GB model into something near 7 to 8 GB, well within the reach of a normal card.

The process is lossy by definition. Reducing the number of bits rounds each weight and discards some information. The notable finding confirmed repeatedly in practice is how well models tolerate this. A carefully constructed 4-bit version remains close to the original in ordinary use, and for most tasks the difference is difficult to detect. This robustness is the empirical basis for the entire local-AI ecosystem.

What do Q4, Q5 and Q8 mean?

The number following the Q indicates roughly how many bits each weight uses. Q4 denotes about 4 bits per weight and Q8about 8. The additional letters are the so-called "K-quants" as in Q4_K_M. They designate more sophisticated schemes that allocate a few extra bits to the weights that matter most and achieve better quality at nearly the same overall size. The practical hierarchy is as follows:

Q4_K_M is roughly 4 to 5 bits per weight. This is the most popular choice. It is about a quarter of the full size with quality most users cannot distinguish from the original in normal use.
Q5_K_M / Q6_K is somewhat larger and somewhat sharper. A reasonable step up when VRAM is available to spare.
Q8_0 is roughly 8 bits per weight and essentially indistinguishable from full precision but twice the size of a 4-bit version.
FP16 is full precision with no quantization. It offers the highest quality and the largest footprint and is rarely necessary for local use.

The size-versus-quality trade-off

The relationship is straightforward. Fewer bits produce a smaller and faster model at a modest cost to quality while more bits improve quality at the expense of size and speed. The decisive observation is that the quality gap at Q4_K_M is small enough that the majority of users adopt it without further deliberation. Quantizing below 4 bits should be reserved for cases where fitting a larger model is essential, because very low-bit quantization begins to degrade output noticeably. Repetition and factual slips and weaker reasoning become more frequent.

Which quantization should you choose?

The available VRAM should make the decision. Select the highest-quality quantization that still fits the card with room for context. If a Q8_0 version of the desired model fits use it. If not step down to Q6_K then Q5_K_M then Q4_K_M. Most users running 7-to-14-billion-parameter models on consumer cards settle on Q4_K_M and are well served by it. This is also the mechanism by which quantization unlocks larger models than a card could otherwise hold. See how much VRAM you need for the size-by-size breakdown. The WillMyGPURunIt calculator applies this same logic automatically and matches the best quantization to a given card so the choice need not be made by hand.