When DeepSeek released its R1 model in early 2025 it drew unusually wide attention. That was not merely because of its benchmark scores but because of the reasoning approach it employed and because the weights were made freely available under a permissive licence. For anyone interested in running a capable reasoning model on their own hardware, R1 represented a genuine step change. It was a family of models that could think through problems step by step, released openly, in sizes that range from small enough for a modest laptop GPU to large enough to require a datacenter. This article explains what DeepSeek R1 is and how the distilled variants work and what kinds of tasks warrant using a reasoning model locally.
What distinguishes a reasoning model
Most language models operate in a single pass. They receive a prompt and generate a response directly. A reasoning model, of which R1 is the most prominent open example, is trained to produce an extended internal monologue before arriving at a final answer. This monologue is sometimes called a chain of thought. It works through the problem step by step by decomposing it and checking intermediate conclusions and revising where necessary. The final response is then drawn from the end of that reasoning trace rather than from the prompt alone.
The practical effect is most pronounced on problems where a direct answer is likely to be wrong. Mathematical proofs and multi-step logic puzzles and algorithmic challenges and complex code tasks all benefit from the model having worked through sub-problems before committing to a conclusion. On straightforward factual questions or short conversational exchanges the advantage is smaller, and the additional tokens generated during the reasoning phase simply add latency without corresponding benefit. Knowing when to reach for a reasoning model and when a conventional one is sufficient is therefore part of using R1 effectively.
The reasoning trace is visible when running R1 locally. The model outputs its thinking enclosed in special tags before producing the final answer. This transparency is unusual. It allows you to inspect where the model allocated effort and to identify the step, if any, where reasoning went astray.
Architecture: a 671-billion-parameter mixture of experts
The full DeepSeek R1 model contains 671 billion parameters and is implemented as a mixture of experts (MoE) architecture. In a MoE model the parameters are partitioned into groups called experts, and a routing mechanism selects only a small subset of those experts for each token generated. R1 activates roughly 37 billion parameters per token despite holding 671 billion in total. This means the compute cost per token is much closer to that of a 37B dense model than to a 671B one, but the full 671B must still reside in memory, which is an enormous storage requirement. Running the complete model demands workstation-class or multi-GPU hardware of a kind that falls outside the reach of consumer setups.
DeepSeek also released DeepSeek-V3, a sibling 671B MoE model intended for general-purpose use rather than structured reasoning. V3 does not employ the chain-of-thought training that defines R1. It is faster per answer and better suited to tasks where extended deliberation adds little, such as summarisation or translation or open-ended conversation. For users evaluating the DeepSeek family as a whole, V3 and R1 occupy complementary roles rather than competing directly. Both are MIT licensed and available through the same distribution channels.
Distillation: reasoning on consumer hardware
The development that made R1 practically useful for local deployment is the release of a set of distilledmodels. Distillation is a training technique in which a smaller model called the student is trained to reproduce the output behaviour of a much larger model called the teacher. Rather than learning solely from human-labelled data the student learns from the teacher's responses, including the reasoning traces that characterise R1. The result is a compact model that has internalised the teacher's problem-solving patterns even though it lacks the teacher's full parameter count.
DeepSeek released distilled versions at six sizes: 1.5B and 7B and 8B and 14B and 32B and 70B parameters. These are not new architectures built from scratch. They are built on the base architectures of Qwen for most sizes and Llama for the 8B and 70B variants, fine-tuned to carry R1's reasoning capability. The distillation process preserves a meaningful portion of the original model's reasoning ability at each size. The 32B distill has attracted particular interest among users with high-end consumer GPUs, as it sits at a point where reasoning quality is strong and hardware requirements, while demanding, remain within reach of capable desktop setups. The 7B and 14B distills are more broadly accessible and remain genuinely useful for maths and coding tasks.
Understanding which distill to run requires knowing how much VRAM is available and what performance to expect from it. The VRAM requirements guide covers the relationship between model size and memory in detail, and the quantisation guide explains how reducing weight precision can make a larger distill fit on a smaller card at a measured cost to output quality. The WillMyGPURunIt calculator accepts a specific GPU and model and returns an estimate of both fit and speed for the selected combination.
What R1 distills are well suited for
Reasoning models earn their place in workflows where intermediate steps matter as much as the final answer. Several categories stand out:
Mathematics and formal logic. R1 distills perform measurably better than equivalently sized conventional models on competition-style maths problems and formal proofs. The chain-of-thought training allows the model to attempt algebraic manipulation step by step and to catch arithmetic errors before they propagate. If you work through problem sets or verify derivations or explore mathematical ideas, even the 7B distill often provides noticeably more reliable working than a same-sized dense model.
Step-by-step coding.When asked to implement a non-trivial algorithm or debug a subtle logic error, R1 distills tend to narrate their approach by identifying edge cases and considering data structures and checking assumptions before producing code. This makes the output easier to audit and the model's reasoning easier to redirect when it heads in the wrong direction. For general boilerplate or simple function generation, a conventional model from the best local AI models list may be faster and equally competent.
Analytical tasks with visible reasoning.For decisions or analyses where the quality of the reasoning process matters and not merely the conclusion, R1's externally visible thinking is useful in its own right. You can read the chain of thought and identify a faulty assumption and intervene with a correction before the model has committed to an answer.
The trade-off: tokens and speed
The cost of reasoning is token volume. A reasoning model generates substantially more tokens per response than a conventional model answering the same question, because the chain of thought itself consumes tokens before the final answer begins. On a given hardware setup this means longer wall-clock wait times. A response that would take a few seconds from a dense model may take considerably longer from R1, because the model is generating a paragraph or more of deliberation first.
This is not a deficiency. It is the mechanism by which the quality improvement is achieved. But it does mean that hardware capable of a comfortable speed on a conventional model may feel slow when running R1 distills on the same tasks. Understanding the relationship between hardware bandwidth and model size and output rate is essential before choosing a distill size. The tokens per second guide explains the underlying mechanics and provides the thresholds for interactive versus batch use. For R1 distills especially, where responses are longer, a higher token rate makes a meaningful difference to the practical experience.
Choosing between distill sizes
At the smaller end the 1.5B and 7B distills run on modest hardware and are reasonable for exploratory or educational use such as quick maths checks or learning how a reasoning trace looks or working through small coding exercises. The 14B distill represents a meaningful step up in reasoning depth and suits a wider range of problem types. The 32B distill approaches the quality ceiling reachable through distillation and is the preference among users with high-end consumer GPUs who prioritise reasoning accuracy over speed. The 70B distill pushes into territory that requires either large-VRAM professional cards or partial offload to system memory with corresponding speed penalties.
The right choice is governed by the VRAM available and the task at hand. For most users considering R1 for local deployment, the 14B or 32B distill at a standard quantisation level represents the range where reasoning quality and hardware accessibility intersect most usefully. Entering a specific GPU into the calculator will confirm whether a chosen distill fits within VRAM and what token rate to expect, which makes it straightforward to select a model that suits both the hardware and the workflow.