In 2025 OpenAI, the company behind ChatGPT, released its first open-weight models in several years: gpt-oss-20b and gpt-oss-120b, both published under the permissive Apache 2.0 licence. Open-weight models distribute the trained parameter files publicly, which allows anyone to download and run them on private hardware without an account or an API key or a subscription. For a company whose products had previously been accessible only through managed services, this release marked a notable shift, and it attracted significant attention from the local AI community.
What gpt-oss actually is
Both gpt-oss models are Mixture-of-Experts (MoE) architectures. A conventional dense language model activates every one of its parameters when generating each token. A MoE model instead partitions its parameters into groups called experts and routes each token through only a small subset of them. The remaining experts are loaded into memory but idle during that particular step.
gpt-oss-20b holds roughly 21 billion total parameters yet only around 3.6 billion are active when producing any given token. gpt-oss-120b carries roughly 117 billion total parameters with roughly 5.1 billion active per token. Inference speed is determined by how many parameter bytes must be read from memory per token rather than by the total number of parameters stored, so both models can generate text noticeably faster than their headline sizes would suggest. The tokens-per-second guide explains this relationship between active parameters and memory bandwidth and throughput in detail.
Both models are reasoning models. They support an adjustable reasoning effort setting, typically expressed as low or medium or high, which controls how many internal reasoning steps the model executes before producing a final response. Higher effort improves accuracy on difficult problems at the cost of additional latency and token consumption. This makes them unusually flexible. A quick factual question can be handled at low effort while a complex mathematical proof or a multi-step coding task can be directed toward high effort.
Why the release matters
The significance of the gpt-oss release is partly symbolic and partly practical. On the symbolic side OpenAI had built its commercial position around access-controlled models. Releasing capable weights under Apache 2.0, a licence that permits commercial use and modification and redistribution, represented a meaningful departure from that posture. It also validated the open-weight approach at a moment when competitors such as Meta (Llama) and Mistral had already demonstrated that openly distributed models could be competitive with much larger proprietary ones.
On the practical side the release made it possible to run a model trained by one of the most well-resourced AI laboratories in the world entirely offline. No prompt leaves the machine. No usage is logged. The model continues to function regardless of whether OpenAI's services are available. If you have privacy requirements or a preference for self-hosted tooling, this combination of provenance and availability was unusual. To explore how local AI compares to cloud services on those dimensions, running AI locally with Ollama is a practical starting point.
What the models are suited for
Both gpt-oss variants are general-purpose assistants with particular strengths in reasoning and mathematics and code generation and agentic tool-use workflows.
The reasoning capability means they handle problems that benefit from intermediate steps such as decomposing a question before answering it or working through a proof in stages or checking the logical consistency of a plan. This differs from models that generate plausible-sounding responses without genuinely elaborating a chain of thought, and it is particularly noticeable on tasks where the correct answer is not the most statistically likely one.
The tool-use and agentic capability means the models can be integrated into automated pipelines by calling external functions and parsing structured data returned from APIs and maintaining task state across multiple turns. This makes them suitable not just for interactive chat but for autonomous or semi-autonomous workflows where a model must plan and execute a sequence of actions. A coding assistant that runs tests and interprets failures, or a research agent that fetches and synthesises sources, are representative examples.
Hardware considerations
The two models target meaningfully different hardware profiles, and the distinction is worth understanding before choosing between them.
gpt-oss-20b is the consumer-accessible variant. Only a small fraction of its parameters are active per token, so its memory footprint during inference is more modest than the total parameter count implies. A high-end consumer GPU with sufficient VRAM can run it entirely in GPU memory and achieve interactive generation speeds. This makes it a practical choice if you already own capable gaming hardware and want a high-quality local model without additional investment. For guidance on how much VRAM is needed for models of this class, the VRAM requirements guide provides a tier-by-tier breakdown.
gpt-oss-120b is the workstation variant. Even with MoE efficiency the full model occupies a substantial amount of memory, and fitting it entirely within GPU VRAM requires a multi-GPU workstation or a card with very high capacity. It can be run on consumer hardware with layer offloading by distributing the model across GPU memory and system RAM, but generation speed degrades significantly when layers must be read from the slower RAM bus. For most consumer setups gpt-oss-120b is best treated as an aspirational target or an option for batch tasks where throughput matters less than response quality.
For an accurate estimate of how either model would perform on a specific machine, the WillMyGPURunIt calculator accounts for GPU memory bandwidth and VRAM capacity and active parameter count to produce a realistic tokens-per-second figure without requiring you to run a benchmark.
Where gpt-oss fits in the open-weight landscape
The open-weight space already contained strong models before gpt-oss arrived. Llama 3 from Meta and Qwen from Alibaba and Mistral from the French laboratory of the same name and DeepSeek-V3 from the Chinese research group DeepSeek had each demonstrated that capable models could be distributed openly. The guide to the best local AI models surveys that broader landscape. What gpt-oss adds is a data point from a laboratory that had previously chosen not to distribute weights, and a reasoning-oriented architecture with adjustable effort at a scale that consumer hardware can meaningfully engage with.
The Apache 2.0 licence is significant in context. Some open-weight releases carry usage restrictions that prohibit commercial use above a certain scale or require attribution in specific ways. Apache 2.0 imposes no such constraints, which simplifies deployment in commercial applications and reduces the legal overhead for organisations that need clarity about permitted uses.
Getting started
gpt-oss models are available through the standard model repositories used by the local AI ecosystem. Tools such as Ollama and llama.cpp can load and serve them using the same workflow as any other open-weight model. Quantised variants in GGUF format are the most common choice for consumer hardware. They trade a modest reduction in output quality for a proportional reduction in memory requirements and a meaningful gain in generation speed. If you are new to the process the Ollama setup guide walks through installation and first use.
The appropriate starting point for most users is gpt-oss-20b at a mid-range quantisation level. That is sufficient to demonstrate the model's reasoning capability on a wide range of tasks while remaining practical on consumer hardware. The 120b variant is worth considering once the 20b has been evaluated and a higher-capacity machine is available, or when the task at hand specifically warrants the additional quality headroom that the larger model provides.