All Local AI Guides
Hardware · 3 min read

The Best Local LLMs for 8 GB VRAM

8 GB is the most common VRAM size in the world. The models that genuinely shine on it, the ones to skip, and the settings that stretch the card.

8 GB is the most common VRAM size in the world. It is what an RTX 3060 Ti, RTX 3070, RTX 4060, RX 7600 and a long list of laptop GPUs all carry, which means more people ask "what can I run on 8 GB?" than any other version of the question. The honest answer: quite a lot, provided you pick models sized for the card instead of fighting it. This guide lists the models that genuinely work well in 8 GB, what to skip, and the settings that make the difference.

The golden rule for 8 GB

A model fits comfortably when its weights, its working memory (the KV cache) and a little overhead all sit inside VRAM. At 4-bit quantization, that puts the practical ceiling for an 8 GB card at models of roughly 7 to 9 billion parameters with a normal context window. Anything in that band runs entirely on the GPU at full speed. Push past it and layers spill into system RAM, where performance falls off a cliff.

The best models to run on 8 GB

  • Llama 3.1 8B: the default choice. Broadly capable, well supported by every tool, and its Q4_K_M build leaves room for a healthy context window. If you install one model, install this one.
  • Qwen3 8B: the strongest all-rounder in the class for reasoning and multilingual work, with an optional thinking mode that trades speed for better answers on hard problems.
  • Qwen2.5-Coder 7B: the pick for programming. For autocomplete and everyday coding questions it punches far above its size, and it fits with room to spare.
  • Mistral 7B: older now, but fast, permissively licensed and still a fine writing assistant on modest hardware.
  • Gemma 3 4B: when you want headroom. It understands images as well as text, and at 4B it leaves VRAM free for a long context window, which matters for summarising documents.
  • Phi-4 Mini (3.8B): remarkably strong for its size on math and structured reasoning, and small enough to keep loaded alongside other work.

What to skip on 8 GB

The 12 to 14 billion parameter tier (Gemma 3 12B, Qwen3 14B, Phi-4) is the awkward zone: a Q4 build of a 14B model wants 9 to 10 GB, so on an 8 GB card it only loads by offloading layers, and speed drops from "faster than you read" to "watching paint dry." The 32B and 70B classes are simply out of reach on a single 8 GB card. If those tiers are the goal, the realistic path is a used 24 GB card, and the best GPUs for local LLMs guide walks through the options by budget.

Settings that stretch the card

  • Stay at Q4_K_M. On 8 GB the quality gain from Q5 or Q6 is not worth the VRAM it eats. Save the higher quants for bigger cards.
  • Watch the context window. The KV cache grows with context length. An 8B model at 4K context fits easily; the same model at 32K context may not. Start small and raise it only when needed.
  • Close the other VRAM users. A browser with hardware acceleration and a game launcher can quietly hold a gigabyte hostage.

Check your exact card

Bandwidth differs between 8 GB cards, so the same model can run at very different speeds on an RTX 3070 versus an RTX 4060. Enter your CPU and GPU into the calculator to see every model your card runs fully on-GPU, with an estimated tokens per second for each, or read how those speed estimates work in what tokens per second means.

Keep reading