Yes, you can run local AI without a graphics card. Tools like Ollama and llama.cpp fall back to the CPU automatically, and a small model will genuinely work on an ordinary desktop or laptop with no GPU at all. The real question is not whether it runs but whether the speed is acceptable, and that depends almost entirely on which model you pick. This guide covers what CPU-only inference is actually like, which models make sense, and when a GPU becomes worth it.
Why the CPU is so much slower
Generating text is a memory exercise: for every single token, the computer reads essentially the whole model out of memory. A graphics card's VRAM delivers several hundred gigabytes per second, while typical dual-channel desktop system RAM manages around 50 to 90 GB/s. That five-to-tenfold bandwidth gap is the whole story. It is not about processor cores; past six or so cores, adding more barely helps, because the memory bus is already saturated.
What speeds to expect
On a modern desktop CPU with DDR5 memory, rough expectations at 4-bit quantization look like this:
- 1 to 4B models: 10 to 30 tokens per second. Genuinely usable, faster than most people read.
- 7 to 9B models: 4 to 10 tokens per second. Usable for short answers if you are patient; sluggish for long output.
- 13B and up: 1 to 3 tokens per second. Technically runs, practically painful. This is where CPU-only stops being sensible.
Older machines with DDR4 memory land at roughly half those figures. What tokens per second feels like in practice is covered in this short explainer.
The models that work best on CPU
- Llama 3.2 3B: the best overall CPU-only choice. Capable enough for summaries, drafting and Q&A at a comfortable speed.
- Gemma 3 4B and Phi-4 Mini (3.8B): both deliver surprising quality per parameter, which is exactly what a bandwidth-starved machine needs.
- Qwen3 4B: strong multilingual and reasoning ability in a CPU-friendly size.
- Small MoE models such as Qwen3 30B A3B are an interesting middle path: they store 30B parameters but only read 3B per token, so they run at small-model speed if you have 24 GB or more of system RAM to hold them.
Making CPU-only tolerable
- Use dual-channel RAM. Two matched sticks nearly double bandwidth versus one stick, which nearly doubles tokens per second. This is the single biggest CPU-only upgrade.
- Keep context modest. Long prompts take a long time to process on a CPU before the first word even appears.
- Prefer Q4 quantization. Smaller weights mean fewer bytes to read per token, which directly raises speed.
When a GPU becomes worth it
If you settle into daily use, even a modest used GPU changes the experience completely: an 8 GB card runs 8B models around ten times faster than a CPU. Why VRAM dominates local AI is its own guide, and the best GPUs for local LLMs breaks down the sensible buys by budget. To see what your current machine can do before spending anything, enter your hardware into the calculator: it shows which models fit and how fast they should feel.