Best GPUs for Local LLMs · WillMyGPURunIt

When choosing the best GPU for local LLMs, one specification matters more than any other: VRAM, the card's dedicated memory. A model runs at acceptable speed only if it fits in VRAM, so the apparent question of which graphics card is best for running local AI resolves into a more precise one. How much VRAM does a given workload require and which card supplies the most of it within a given budget? The rankings below are organised on exactly that basis.

The picks are grouped by VRAM tier. The biggest model and ~tok/sfigures are not editorial estimates: they are computed from each card's actual VRAM and memory bandwidth using the same engine that drives the calculator. The tokens-per-second figure is standardised on an 8-billion-parameter model at 4-bit quantization so that cards may be compared on equal terms. (A token corresponds to roughly three-quarters of a word. A rate of about 40 tok/s already exceeds normal reading speed.)

Flagship — big models at home

24–32 GB VRAM

These run 32B-class models entirely on the GPU and offload 70B to RAM. Overkill for chat, ideal if you're set on the largest local models or image/video generation.

GPU	VRAM	Biggest on-GPU	8B speed
RTX 5090 The most VRAM on a consumer card — 32 GB runs the biggest models most people touch.	32 GB	47B model	~224 tok/s
RTX 4090 Still the local-AI darling: 24 GB and huge bandwidth make it fast on anything that fits.	24 GB	34B model	~126 tok/s
RTX 3090 The value king for local AI — 24 GB second-hand for far less than a 4090.	24 GB	34B model	~117 tok/s
Radeon RX 7900 XTX 24 GB on the AMD side; great if you're on Linux with ROCm, less plug-and-play than NVIDIA.	24 GB	34B model	~120 tok/s

High-end — the local-AI sweet spot

16 GB VRAM

16 GB comfortably runs 13–14B models fully on the GPU at good speed, the range most people actually use day to day.

GPU	VRAM	Biggest on-GPU	8B speed
RTX 5080 Fast 16 GB card; great for 14B models and image generation.	16 GB	21B model	~120 tok/s
RTX 4060 Ti 16GB The budget VRAM hero — slow-ish, but 16 GB for the price is unmatched for AI.	16 GB	21B model	~36 tok/s
Radeon RX 7800 XT Strong 16 GB AMD option with high bandwidth for the money.	16 GB	21B model	~78 tok/s
Arc A770 16 GB on a budget; Intel's local-AI support is improving but still rougher.	16 GB	21B model	~70 tok/s

Mid-range — great for 7–8B

12 GB VRAM

12 GB runs 7–8B chat and coding models fully on the GPU with room for context, and 13B at tight quantization. A sensible starting point.

GPU	VRAM	Biggest on-GPU	8B speed
RTX 5070 Fast 12 GB card that flies on 8B models.	12 GB	15B model	~84 tok/s
RTX 3060 Laptop GPU The classic entry pick — cheap, 12 GB, runs 8B models well.	6 GB	4B model	~42 tok/s
Arc B580 Newer 12 GB budget card with solid bandwidth.	12 GB	15B model	~57 tok/s

Entry — 8B with care

8 GB VRAM

8 GB is the floor: it runs 7–8B models at 4-bit and Stable Diffusion, but 13B+ spills into slow RAM offload. Fine to start, easy to outgrow.

GPU	VRAM	Biggest on-GPU	8B speed
RTX 4060 Common 8 GB card; runs 8B models, capped above that.	8 GB	8B model	~34 tok/s
Radeon RX 7600 AMD's 8 GB budget option, similar story.	8 GB	8B model	~36 tok/s

How to choose a GPU for local AI

Four principles account for most of the decision, and they hold across budgets:

NVIDIA is the safe default. Its CUDA platform is the target of nearly every local-AI tool, which makes NVIDIA cards the most reliably plug-and-play option and the one least likely to require troubleshooting.
Prioritise VRAM over raw speed. A slower card with more VRAM such as a 16 GB RTX 4060 Ti runs larger models than a faster card with less, because a model that does not fit in VRAM barely runs at all. Memory capacity is the binding constraint. Bandwidth only governs speed once a model already fits.
A used RTX 3090 remains the value choice. It offers 24 GB of VRAM at well below flagship pricing and continues to be the enthusiast favourite for that reason.
AMD and Intel are viable, with caveats. Both offer strong VRAM per dollar, but installation and tooling are smoother on NVIDIA at present, particularly on Windows.

The tier should be matched to the intended workload: 8 GB for 7-to-8-billion-parameter chat and coding assistants, 12 to 16 GB for the comfortable middle ground, and 24 GB or more for those pursuing the largest models. Whatever card is under consideration, the WillMyGPURunIt calculator confirms its exact capabilities before any purchase is made. For the reasoning behind these size requirements, see how much VRAM you need to run an LLM.