All Local AI Guides
Hardware · 7 min read

How Much System RAM Do You Need for Local AI? (RAM vs VRAM)

System RAM and VRAM play very different roles. Learn how much you need — and why RAM type matters more than RAM amount for speed.

System RAM is one of the most frequently misunderstood variables in local AI builds. Users who discover that a model runs slowly often reach first for a RAM upgrade and expect the extra gigabytes to accelerate inference. In reality the role of system RAM in local AI is specific and constrained. It determines which models can load at all when VRAM runs short but it does not by itself determine how fast those models run. Understanding that distinction and knowing which amount of RAM is actually warranted is the subject of this guide.

RAM vs VRAM: what is the difference?

A desktop or laptop computer contains two distinct pools of memory that both carry the label "RAM" in everyday conversation but serve entirely different purposes in the context of local AI.

VRAM(video RAM) is the memory built directly onto the graphics card. It is tightly coupled to the GPU's processing cores and operates at extremely high bandwidth, and a modern mid-range card reaches several hundred gigabytes per second. When a language model runs its weights must sit in VRAM for the GPU to read them efficiently. The weights are the billions of numerical parameters that make up the trained model. This is the fast path. A model that fits entirely within VRAM runs quickly and responsively.

System RAM is the memory installed on the motherboard, the DDR4 or DDR5 sticks familiar from general PC building. It is connected to the GPU through the PCIe bus, which is far narrower than the internal VRAM bus. When a model exceeds the available VRAM, tools such as llama.cpp can offload the surplus layers into system RAM and execute those portions on the CPU instead. The model still loads and generates answers but every token now depends on reading weights across the much slower PCIe channel. The governing principle is captured in the VRAM guide. What fits within VRAM runs quickly and what overflows runs slowly.

How much system RAM do you need for local AI?

System RAM enters the picture only once the VRAM budget is determined. Its role is to provide a landing zone for any layers the GPU cannot hold. Three practical tiers cover the vast majority of local AI users:

  • 16 GB is the sensible floor. On a typical desktop with a dedicated GPU, 16 GB of system RAM leaves sufficient headroom for the operating system and a browser and any overflow from a small model. It is not generous, and a user running a very large model that heavily offloads will find the available headroom limited, but it represents the minimum for a functional local AI setup without feeling constrained during normal use.
  • 32 GB is the comfortable middle. At 32 GB the system can hold a meaningfully larger offloaded segment without competition from background processes. A model that does not quite fit in an 8 GB GPU, such as a 13-billion-parameter model at 4-bit quantization, can spill its excess layers into system RAM and still load completely. This is the tier most serious local AI users settle on and it handles the majority of models currently in widespread use.
  • 64 GB and above is for large model offloading.A user attempting to run a 70-billion-parameter model on a GPU with limited VRAM will offload a substantial fraction of that model's layers into system RAM. At those sizes 32 GB may not provide enough headroom after the OS reserve, and 64 GB or more becomes necessary simply to allow the model to load at all. This tier is not required by most users and is best reserved for specific large-model use cases.

One important clarification. These figures describe how much RAM is needed for offloading to be possible, not for it to be fast. The amount of RAM changes whether a model fits. The type of RAM changes how quickly an offloaded model runs. That distinction is covered in the section below.

Does more RAM make local AI faster?

Adding more gigabytes of system RAM does not in isolation speed up inference. The reason lies in how offloading works. When a model spills into system RAM each generated token requires the software to read the offloaded weight layers from main memory. The speed of those reads is governed by memory bandwidth, a figure measured in gigabytes per second that describes how quickly data can flow between the memory chips and the processor. Bandwidth is a physical property of the memory type and configuration, not of capacity.

A system with 64 GB of DDR4 RAM and a system with 32 GB of DDR4 RAM have identical bandwidth assuming the same channel configuration. The larger kit allows a bigger model to load but once loaded both systems run the offloaded portion at exactly the same speed. Adding gigabytes without changing the memory type provides no throughput benefit for offloaded inference.

The practical consequence is straightforward. If the goal is to fit a model that currently cannot load, more RAM solves the problem. If the goal is to make an already-loading model run faster, more RAM of the same type will not help. Only moving more layers into VRAM or upgrading the RAM type will.

Does DDR4 vs DDR5 matter for local AI?

Yes, and meaningfully so, but only for offloaded models. DDR5 provides roughly 85 GB/s of bandwidth in a typical dual-channel desktop configuration compared to roughly 50 GB/s for DDR4. Quantized weights are read sequentially during token generation, so bandwidth translates almost directly into tokens per second for the CPU-handled layers.

A model that runs entirely within VRAM is unaffected by the DDR4 versus DDR5 distinction. The GPU never touches system RAM during inference in that case, so the difference is irrelevant. But a model that offloads a substantial fraction of its layers, perhaps because the GPU holds only 8 GB while the model requires 12 GB, will run noticeably faster on DDR5. The bandwidth advantage flows directly into the offloaded portion of each token generation step.

For most users building a new system DDR5 is the correct choice wherever the platform supports it, not because it allows larger models but because it closes some of the speed gap that offloading introduces. On an existing DDR4 system replacing RAM with DDR5 would require a new CPU and motherboard as well, since the memory types are not interchangeable. In that scenario the investment is only warranted if offloading performance is a specific priority. Otherwise spending the equivalent budget on a GPU with more VRAM will produce a larger overall improvement by reducing how much offloading occurs in the first place.

The relationship between system RAM and VRAM

The two memory pools are complementary rather than interchangeable. VRAM sets the performance ceiling. A model that fits entirely within VRAM receives the full benefit of the GPU's bandwidth and compute. System RAM sets the capacity ceiling. It determines whether models larger than the available VRAM can load at all. Neither pool can substitute for the other.

The ordering of priorities that follows from this is consistent. If your primary concern is running models quickly, maximising VRAM should come first through GPU choice or through quantization, which reduces the memory footprint of a given model and can make the difference between offloading and fitting cleanly. System RAM becomes relevant once VRAM is settled, as a buffer for models too large to fit on the card entirely. And within that buffer role the type of RAM chosen influences the speed at which offloaded inference proceeds, while the amount of RAM influences only the size of the models that can be attempted.

Practical guidance by use case

The following recommendations assume a dedicated GPU is present. Users on integrated graphics, where the GPU and CPU share a single pool of system RAM, face a different situation, since that shared memory functions as both system RAM and VRAM simultaneously.

  • Light daily use (7B to 8B models): These models fit entirely within 8 GB of VRAM at 4-bit quantization. System RAM plays almost no role in inference speed. 16 GB of system RAM is sufficient and 32 GB provides comfortable headroom for the OS and applications.
  • Heavier use (13B to 14B models): A 13-billion-parameter model at 4-bit quantization requires roughly 8 to 9 GB. On a card with 8 GB of VRAM a small number of layers will offload to system RAM. 32 GB of system RAM is the sensible choice and DDR5 will produce a modestly faster result for the offloaded portion.
  • Large model experiments (30B to 70B models): At these sizes substantial offloading is unavoidable on consumer hardware. System RAM amount and type both matter. 64 GB of RAM, ideally DDR5, provides meaningful headroom, though the offloaded inference will remain slow relative to a fully VRAM-resident configuration. The calculator can estimate exact throughput for a specific GPU and RAM pairing.

For any specific build the interaction between VRAM and system RAM capacity and system RAM bandwidth is complex enough that a dedicated tool provides more reliable guidance than general rules of thumb. The WillMyGPURunIt calculator takes GPU and RAM type and RAM amount as inputs and reports which models fit cleanly and which offload partially and what token-per-second rate to expect in each case.

Keep reading