How VRAM Affects Local AI · WillMyGPURunIt

Among the many specifications that describe a graphics card, one governs local AI more decisively than the rest. That spec is VRAM. It determines which models can be run at all and how quickly they respond and how much context they can retain. Before you consult any benchmark or compare processing speeds, the first figure to establish is how much video memory a card provides. This article explains why that single number carries such weight.

What VRAM is

VRAM (video random-access memory) is memory built directly onto the graphics card and distinct from the system RAM installed on the motherboard. It is considerably faster than system memory and sits immediately next to the GPU's processing cores. When an AI model runs its weights must occupy memory the GPU can access with minimal delay. The weights are the billions of numerical parameters that make up the trained model. That memory is VRAM.

System RAM is larger and far cheaper per gigabyte but it is physically distant from the GPU and connected by a comparatively narrow channel. The GPU can draw on it when necessary but only slowly. The governing principle for local AI follows directly. What fits within VRAM runs quickly and what does not runs slowly. Almost every practical consequence below is an expression of that single fact.

Why a model must "fit"

Generating each word requires the GPU to read through the entire model. When the whole model resides in VRAM that read completes almost instantly and produces a responsive assistant. Two distinct quantities compete for the available memory:

The weights. These account for the majority of the requirement. A serviceable rule of thumb is that at 4-bit quantization a model needs roughly 0.6 GB of VRAM per billion parameters. An 8-billion-parameter model therefore occupies about 5 GB plus a small allowance for overhead.
The KV cache, which holds the context. As a conversation proceeds the model maintains a running representation of everything said so far. The larger the context window the more memory this cache consumes. Processing a lengthy document can add several gigabytes on top of the weights.

What happens when a model does not fit

When a model exceeds available VRAM, inference software such as llama.cpp can offloadthe surplus into system RAM and execute those portions on the CPU. The model loads and produces answers, so it "works" in a narrow sense. But every word now depends on the slow path between the processor and main memory. Throughput can fall from dozens of words per second to a single-digit crawl. This is why a model that technically runs on a small card is frequently unpleasant to use in practice, and why fitting entirely within VRAM is the threshold that matters.

The degradation is not gradual in the way you might expect. A model that fits with a gigabyte to spare performs well. The same model pushed slightly beyond capacity can become several times slower, because even a small fraction of the work routed through the slow path dominates the total time. The practical lesson is to leave genuine headroom rather than aim to fit a model exactly.

The relationship between VRAM and speed

Once a model fits, the rate at which it generates text is governed chiefly by the card's memory bandwidth, the speed at which it can read the weights from VRAM. This explains an observation that otherwise seems paradoxical. Two cards with identical VRAM can feel markedly different in use, because the one with faster memory reads the weights more quickly and therefore produces words at a higher rate. As a point of reference a token corresponds to roughly three-quarters of a word, and a rate of around 40 tokens per second already exceeds normal reading speed.

How much is required

The answer depends entirely on the size of model you intend to run. As a broad guide 8 GB handles small 7-to-8-billion-parameter assistants. The 12 to 16 GB range is the comfortable middle ground for 13-to-14-billion-parameter models. And 24 GB opens access to 32-billion-parameter models and beyond. The VRAM-by-model-size guide details each tier precisely while the best GPUs for local LLMs ranks the cards offering the most memory for a given budget. When a specific decision is at hand the WillMyGPURunIt calculator confirms exactly what a particular card can run.