GGUF vs EXL2 vs AWQ: Local AI Model Formats Explained

Anyone downloading a model from Hugging Face quickly meets three acronyms that look deceptively similar: GGUF and EXL2 and AWQ. All three are file formats for quantized language models, yet they belong to distinct ecosystems and run on different hardware and suit different situations. Understanding what separates them is the prerequisite for choosing the right download and for understanding why the same model may run comfortably in one tool but refuse to load in another.

Format vs. quantization level: an important distinction

Before comparing the formats themselves it helps to separate two ideas that are frequently conflated. Quantization levelrefers to how aggressively a model's weights are compressed. These are the Q4 and Q5 and Q8 labels discussed in the quantization guide. A format by contrast is the container, meaning the file structure and the metadata layout and the runtime that reads it. GGUF and EXL2 and AWQ are all formats. Each can store weights at various quantization levels but they are incompatible with one another. A tool built to load GGUF cannot read an EXL2 file and vice versa. The format determines which software can run the model. The quantization level determines how much memory it requires and how closely it matches the original.

What is GGUF?

GGUF (GPT-Generated Unified Format) is the successor to the older GGML format and is the creation of the llama.cpp project. It stores model weights and the tokenizer and all necessary metadata in a single self-contained file, which makes distribution and loading straightforward. The format is consumed by llama.cpp directly and by every major consumer-facing tool built on top of it. Ollama and LM Studio and GPT4All and Jan and koboldcpp all read GGUF natively.

The defining advantage of GGUF is its hardware compatibility. It runs on CPUs without a GPU present and on NVIDIA and AMD graphics cards and on Apple Silicon via Metal, all within the same file. When a system has less VRAM than the model requires, llama.cpp will offload as many layers as possible to the GPU and handle the rest on the CPU. This graceful CPU fallback means a model can run on almost any modern machine, albeit more slowly when the GPU cannot hold the entire model. GGUF files also support the full range of K-quant levels such as Q4_K_M and Q5_K_M and Q8_0 and others, so users can choose the precision tier that best fits their hardware.

The trade-off is throughput. When a model does fit entirely in VRAM, GGUF is slower than purpose-built GPU formats because the llama.cpp runtime is optimised for broad compatibility rather than peak GPU performance.

What is EXL2?

EXL2 is the quantization format produced by the ExLlamaV2 library developed by turboderp. It was designed from the outset for maximum throughput on NVIDIA GPUs and achieves this through custom CUDA kernels tuned specifically for mixed-precision inference. The format is consumed by ExLlamaV2 directly and by serving tools built on top of it, most notably TabbyAPI.

The most distinctive property of EXL2 is its variable bits-per-weight system. Rather than applying a uniform quantization level across every layer the format allows the quantizer to allocate more bits to layers that are sensitive to precision loss and fewer bits to layers that tolerate it. This lets a model be quantized to an arbitrary average bitrate such as 4.65 bpw or 3.75 bpw rather than being limited to whole-number steps. The result is finer VRAM control. You can squeeze a model into a specific memory budget while preserving as much quality as the budget allows.

The critical constraint is that EXL2 is GPU-only. It has no CPU offload path. If the entire model does not fit in VRAM it will not load. This is an acceptable trade-off for users with sufficient GPU memory, because ExLlamaV2's custom kernels produce meaningfully higher inference speed than GGUF at comparable quality, a significant advantage for tasks that demand fast responses or large amounts of generated text.

What is AWQ?

AWQ(Activation-Aware Weight Quantization) originated as a research format from MIT's Han Lab and won the MLSys 2024 Best Paper Award. Its key insight is that not all weights in a model are equally important. A small fraction of weights identifiable by examining activation patterns on a calibration dataset contribute disproportionately to output quality. AWQ protects those salient weights at higher precision while aggressively quantizing the rest, which achieves strong quality retention at 4-bit without the overhead of fully mixed-precision schemes.

AWQ is a GPU-only format and integrates natively with vLLMand Hugging Face's Text Generation Inference and the AutoAWQ library. It is the standard choice for server-side deployment and high-throughput GPU inference, particularly when serving multiple users concurrently. The AWQ format is also supported by some builds of LM Studio as an alternative backend.

Like EXL2 the AWQ format requires the full model to reside in GPU VRAM. CPU offloading is not supported. Unlike EXL2 the bitrate is typically fixed at 4-bit (INT4) and the format does not offer the same fine-grained per-layer control over average bitwidth. Its strength is production readiness. The toolchain around AWQ is mature and well-documented and heavily used in cloud inference pipelines.

What about safetensors and GPTQ?

Two other formats appear frequently in model repositories and deserve a brief mention. Safetensors is not a quantization format but a safe fast serialisation format for full-precision (FP16 or BF16) weights. It is what most models are released in before quantization, and it is the format used for fine-tuning and research workflows. GPTQ is an older GPU quantization format that preceded both AWQ and EXL2. It is widely available on Hugging Face and supported by vLLM and ExLlamaV2, but has largely been superseded in practice by the newer formats, which offer better quality at equivalent bitrates.

GGUF vs EXL2 vs AWQ: which should you use?

The right format follows from the hardware available and the software being used. The list below maps common situations to the appropriate choice:

Using Ollama, LM Studio, GPT4All, or koboldcpp. Use GGUF. These tools are built on llama.cpp and accept no other format natively.
Running on CPU only, or on AMD or Apple Silicon. Use GGUF. It is the only format with a CPU fallback path and broad non-NVIDIA support.
Model partially fits in VRAM. Use GGUF. Only llama.cpp can split layers across GPU and CPU. EXL2 and AWQ require the full model in VRAM.
Entire model fits in VRAM on an NVIDIA GPU and speed is the priority. Use EXL2 via ExLlamaV2 or TabbyAPI. The custom CUDA kernels produce the highest throughput of any format in this scenario.
Targeting a specific VRAM budget with the largest possible model. Use EXL2. The fractional bits-per-weight control allows finer sizing than the fixed steps available in GGUF or AWQ.
Running vLLM, Text Generation Inference, or a server deployment. Use AWQ or GPTQ where AWQ is unavailable. These tools are built around GPU-server pipelines and integrate AWQ as a first-class quantization format.
Unsure, or wanting maximum compatibility. Use GGUF at Q4_K_M. It runs on the widest range of hardware and software, and the quality is sufficient for the vast majority of tasks. For help choosing the right quantization level see the quantization guide.

Ultimately the format decision is often made by the choice of software rather than the other way around. A user who already runs Ollama has their format determined for them. The practical choice arises when a user has a capable NVIDIA GPU and sufficient VRAM for the desired model and the freedom to pick a runtime. At that point EXL2 and TabbyAPI offer a measurable speed advantage over GGUF and Ollama. For everything else VRAM capacity should guide the quantization level and the calculator can confirm whether a given model fits at all.