All Local AI Guides
Hardware · 2 min read

Best GPUs for Local LLMs

Ranked GPU picks for running local AI by budget and VRAM tier — built from real benchmark data and the actual model each card can run.

When choosing the best GPU for local LLMs, one specification matters more than any other: VRAM, the card's dedicated memory. A model runs at acceptable speed only if it fits in VRAM, so the apparent question of which graphics card is best for running local AI resolves into a more precise one. How much VRAM does a given workload require and which card supplies the most of it within a given budget? The rankings below are organised on exactly that basis.

The picks are grouped by VRAM tier. The biggest model and ~tok/sfigures are not editorial estimates: they are computed from each card's actual VRAM and memory bandwidth using the same engine that drives the calculator. The tokens-per-second figure is standardised on an 8-billion-parameter model at 4-bit quantization so that cards may be compared on equal terms. (A token corresponds to roughly three-quarters of a word. A rate of about 40 tok/s already exceeds normal reading speed.)

Flagship — big models at home

24–32 GB VRAM

These run 32B-class models entirely on the GPU and offload 70B to RAM. Overkill for chat, ideal if you're set on the largest local models or image/video generation.

GPUVRAM8B speed
RTX 5090
The most VRAM on a consumer card — 32 GB runs the biggest models most people touch.
32 GB~224 tok/s
RTX 4090
Still the local-AI darling: 24 GB and huge bandwidth make it fast on anything that fits.
24 GB~126 tok/s
RTX 3090
The value king for local AI — 24 GB second-hand for far less than a 4090.
24 GB~117 tok/s
Radeon RX 7900 XTX
24 GB on the AMD side; great if you're on Linux with ROCm, less plug-and-play than NVIDIA.
24 GB~120 tok/s

High-end — the local-AI sweet spot

16 GB VRAM

16 GB comfortably runs 13–14B models fully on the GPU at good speed, the range most people actually use day to day.

GPUVRAM8B speed
RTX 5080
Fast 16 GB card; great for 14B models and image generation.
16 GB~120 tok/s
RTX 4060 Ti 16GB
The budget VRAM hero — slow-ish, but 16 GB for the price is unmatched for AI.
16 GB~36 tok/s
Radeon RX 7800 XT
Strong 16 GB AMD option with high bandwidth for the money.
16 GB~78 tok/s
Arc A770
16 GB on a budget; Intel's local-AI support is improving but still rougher.
16 GB~70 tok/s

Mid-range — great for 7–8B

12 GB VRAM

12 GB runs 7–8B chat and coding models fully on the GPU with room for context, and 13B at tight quantization. A sensible starting point.

GPUVRAM8B speed
RTX 5070
Fast 12 GB card that flies on 8B models.
12 GB~84 tok/s
RTX 3060 Laptop GPU
The classic entry pick — cheap, 12 GB, runs 8B models well.
6 GB~42 tok/s
Arc B580
Newer 12 GB budget card with solid bandwidth.
12 GB~57 tok/s

Entry — 8B with care

8 GB VRAM

8 GB is the floor: it runs 7–8B models at 4-bit and Stable Diffusion, but 13B+ spills into slow RAM offload. Fine to start, easy to outgrow.

GPUVRAM8B speed
RTX 4060
Common 8 GB card; runs 8B models, capped above that.
8 GB~34 tok/s
Radeon RX 7600
AMD's 8 GB budget option, similar story.
8 GB~36 tok/s

How to choose a GPU for local AI

Four principles account for most of the decision, and they hold across budgets:

  • NVIDIA is the safe default. Its CUDA platform is the target of nearly every local-AI tool, which makes NVIDIA cards the most reliably plug-and-play option and the one least likely to require troubleshooting.
  • Prioritise VRAM over raw speed. A slower card with more VRAM such as a 16 GB RTX 4060 Ti runs larger models than a faster card with less, because a model that does not fit in VRAM barely runs at all. Memory capacity is the binding constraint. Bandwidth only governs speed once a model already fits.
  • A used RTX 3090 remains the value choice. It offers 24 GB of VRAM at well below flagship pricing and continues to be the enthusiast favourite for that reason.
  • AMD and Intel are viable, with caveats. Both offer strong VRAM per dollar, but installation and tooling are smoother on NVIDIA at present, particularly on Windows.

The tier should be matched to the intended workload: 8 GB for 7-to-8-billion-parameter chat and coding assistants, 12 to 16 GB for the comfortable middle ground, and 24 GB or more for those pursuing the largest models. Whatever card is under consideration, the WillMyGPURunIt calculator confirms its exact capabilities before any purchase is made. For the reasoning behind these size requirements, see how much VRAM you need to run an LLM.

Keep reading