The open-weight model landscape has matured to the point where a well-chosen local model can handle the majority of tasks that once required a commercial API. The challenge for most users is no longer whether a good model exists for their hardware. It is knowing which model to pick. This guide organises the strongest options available in 2026 by hardware tier and identifies the standout performers in each category and notes the specialised models that lead their respective domains. Exact speed figures vary by GPU and configuration. The WillMyGPURunIt calculator can estimate performance for a specific build.
Understanding the tiers
The most important number when selecting a local model is not a benchmark score but a hardware constraint, the VRAM available on the graphics card. The model's weights must reside in GPU memory to achieve useful inference speed, so the VRAM ceiling determines which parameter counts are accessible. The VRAM requirements guide explains the underlying arithmetic in detail. In brief, modern quantization techniques allow models to be compressed enough that an 8-billion-parameter model fits on an 8 GB card and a 32-billion-parameter model fits on a 24 GB card. The sections below use those thresholds as tier boundaries.
Small GPUs: 8 GB VRAM
Eight gigabytes of VRAM is the most common configuration on mid-range consumer cards and it supports a genuinely capable set of models. The three strongest options at this tier are Llama 3.1 8B and Qwen3 8B and Gemma 3 4B.
- Llama 3.1 8Bis Meta's workhorse at this size. It is a well-rounded general-purpose assistant competent at conversation and summarisation and light writing, and it benefits from the largest ecosystem of fine-tunes and tooling of any open model at this parameter count. It is an appropriate default for users running 8 GB cards for the first time.
- Qwen3 8Brepresents the current state of the art for the 8-billion-parameter class. Released by Alibaba's Qwen team, it competes with models a size tier above it in reasoning and instruction-following, and it supports a genuinely extended context window. Users who want the maximum capability from an 8 GB card should start here.
- Gemma 3 4B is the right choice when headroom is limited. Its smaller footprint leaves VRAM to spare, which translates into faster responses and more room for long contexts. For simple chat and short writing tasks on constrained hardware it outperforms heavier models that are competing for the same limited memory.
Mid-range GPUs: 12 to 16 GB VRAM
The 12-to-16 GB bracket opens access to the fourteen-billion-parameter class, which is where the quality gap relative to cloud models begins to narrow. The leading options are Gemma 3 12B and Qwen3 14B and Phi-4 14B.
- Gemma 3 12Bis Google DeepMind's most balanced open release. It handles multi-turn conversation and structured writing and document comprehension well, and its relative efficiency means it runs comfortably on a 12 GB card without requiring aggressive quantization.
- Qwen3 14Bextends the family's lead in instruction-following and multilingual capability to the mid-range tier. If you work across languages or require strong adherence to complex instructions it is the strongest choice in this bracket.
- Phi-4 14B from Microsoft achieves performance disproportionate to its parameter count by training on high-quality synthetic data. It excels at structured reasoning and mathematics and science-adjacent tasks. If you find yourself prompting for logical analysis or technical explanations consider it specifically for that workload.
High-end GPUs: 24 GB VRAM
A 24 GB card such as a GeForce RTX 3090 or 4090 or a professional equivalent reaches the thirty-two-billion-parameter tier, which is near-frontier for many practical tasks. The strongest models here are Qwen2.5 32B and Qwen3 32B and Gemma 3 27B and QwQ 32B and DeepSeek-R1 Distill 32B. For GPU recommendations at this tier the best GPUs for local LLMs guide covers the options in detail.
- Qwen2.5 32B and Qwen3 32B are the most capable general-purpose models accessible on a single 24 GB consumer card. They handle long documents and complex multi-step instructions and challenging coding tasks with reliability that 8B and 14B models cannot match. If you require a capable daily-driver assistant and have the hardware these are the clearest recommendation in 2026.
- Gemma 3 27Bis Google's most capable open release and is competitive with the Qwen3 32B on general benchmarks. It is a strong alternative for users who prefer the Gemma family or who find that its particular training characteristics suit their workload better.
- QwQ 32B is a reasoning-specialised model from the Qwen team that applies extended chain-of-thought processing before producing an answer. It is slower than the general Qwen models but produces substantially stronger results on problems that benefit from deliberate step-by-step reasoning such as mathematics and logic puzzles and complex analysis. It occupies the same VRAM footprint as Qwen3 32B.
- DeepSeek-R1 Distill 32Bis a distilled version of DeepSeek's reasoning model compressed to fit within the 24 GB bracket without sacrificing the reasoning capability that made the full model notable. It is the stronger choice for structured reasoning tasks at this tier and it is covered in depth in the DeepSeek-R1 local AI guide.
Multi-GPU and workstation: 48 GB and above
The seventy-billion-parameter class requires either two 24 GB cards bridged with NVLink or a single professional card with sufficient VRAM or acceptance of partial CPU offload at a speed penalty. The models worth running at this tier are Llama 3.3 70B and Qwen2.5 72B, both of which approach frontier capability on a wide range of tasks. Beyond those, the massive mixture-of-experts (MoE) models require workstation-class hardware with very large VRAM pools or multi-node inference. That group includes DeepSeek-R1 and DeepSeek-V3 at 671 billion parameters and Qwen3 235B and Kimi K2. They represent the leading edge of what is achievable locally but they are realistic options only for users with substantial dedicated infrastructure.
Specialised models worth knowing
Beyond the general-purpose tiers several models lead their respective domains and are worth selecting specifically for those use cases.
- Best for coding: Qwen2.5-Coder is the clear leader. Available in multiple sizes, it is trained specifically on code and produces reliable well-structured output across a broad range of languages. If you use your local model primarily for programming assistance prefer it over the general-purpose alternatives at the same parameter count.
- Best for reasoning: The strongest reasoning models available locally are DeepSeek-R1 and QwQ 32B and GPT-4o (open weights) where accessible. All three apply explicit reasoning steps before committing to an answer, which produces measurably better results on problems with a correct solution. The trade-off is speed. These models think before they speak and response latency reflects that. Their performance can be estimated in advance for a given card using the quantization guide and the calculator.
- Best tiny model: Gemma 3 4B and Llama 3.2 3B are the most capable models at the sub-8-billion-parameter scale. They are appropriate for lightweight hardware and always-on background tasks and use cases where response latency must be minimal. Llama 3.2 additionally provides a strong multimodal variant at this size.
Choosing the right model
The practical decision process is straightforward. Establish the VRAM available, select the largest model that fits comfortably within that ceiling, and then prefer a specialist model if the workload falls clearly into coding or reasoning or minimal-latency use. For most users with 8 GB cards Qwen3 8B is the current recommendation. With 24 GB, Qwen3 32B or DeepSeek-R1 Distill 32B covers the large majority of tasks. The WillMyGPURunIt calculator will confirm which specific quantizations fit a given card and estimate the inference speed for each, so the choice can be validated against real hardware numbers before committing to a download.