"How much VRAM do I need?" is the first and most consequential question for anyone assembling a system for local AI. The answer is more tractable than it might appear. The memory a model requires can be estimated from a single property called the parameter count using one straightforward calculation. This guide presents that calculation and applies it to the model sizes most people run and explains the two factors that adjust the result.
The underlying formula
A model's memory requirement is dominated by its weights and the weights' size can be computed directly. The expression parameters × bits-per-weight ÷ 8 yields the figure in gigabytes. In practice almost everyone runs models quantized to 4-bit precision, which reduces the general formula to a convenient approximation:
- Approximately 0.6 GB of VRAM per billion parameters at 4-bit precision.
- An additional 1 to 2 GB of overheadfor the inference runtime and the operating system's own use of the GPU.
- A further allowance for long context. An extended conversation or a large document can require anywhere from one to several additional gigabytes.
Take a concrete case. An 8-billion-parameter model needs roughly 8 × 0.6 ≈ 5 GB for its weights plus overhead. That total fits comfortably on an 8 GB card. The same method scales predictably to any model size, which is what makes the requirement possible to plan around in advance.
Requirements by model size
The following figures assume 4-bit quantization and modest context. They represent the memory needed to run each model entirely on the GPU, which is the condition for acceptable speed:
- 7–8B models (~5 GB): the everyday workhorses for conversation and writing and light programming. They fit on 8 GB cards.
- 13–14B models (~9 GB): noticeably more capable. 12 GB is workable though 16 GB is preferable to leave room for context.
- 32B models (~20 GB): strong at reasoning and coding. They require 24 GB to run fully on the GPU.
- 70B models (~42 GB): the largest open-weight models in common use. They do not fit on a single consumer card and require two 24 GB GPUs or a 32 GB-class card or slow offload to system RAM.
The two factors that change the result
Quantization reduces the requirement
Every figure above assumes 4-bit precision. A higher-quality quantization such as Q6 or Q8 increases both the memory footprint and the output quality. Quantizing below 4-bit reduces the footprint further at a more pronounced cost to quality. The trade-off is rarely worth managing by hand, so the WillMyGPURunIt calculator automatically selects the highest-quality quantization that fits a given card and removes the need to estimate it manually.
Context increases it
The context window is the quantity of conversation or document the model holds in working memory at once. It adds to the requirement on top of the weights because the KV cache that stores it grows with length. If you intend to feed in long documents, reserve several additional gigabytes of headroom beyond the weight figure. For short conversational use the default overhead allowance is generally sufficient.
The practical conclusion
The guiding principle is to match VRAM to ambition. 8 GB supports capable 7-to-8-billion-parameter assistants. 12 to 16 GB covers the comfortable middle ground. 24 GB is required to run the large 32-billion-parameter models locally. If you are selecting a card, the best GPUs for local LLMs ranks the options by memory and value. For a card already owned, the calculator reports exactly which models it can run today, so the estimate above can be confirmed against real hardware in a few seconds.