All Local AI Guides
Going deeper · 7 min read

What Is Tokens Per Second? How Fast Is Fast Enough for Local AI?

Tokens per second explained: what it measures, what counts as interactive, and why MoE models can outrun smaller dense ones.

Tokens per second (abbreviated tok/s or t/s) is the single number most commonly used to describe how fast a large language model generates text. It counts how many tokens a model produces in one second of continuous output. It collapses hardware and software and model size into a single figure, so tokens per second has become the standard benchmark for comparing local AI setups. Understanding it is essential for anyone choosing a GPU or selecting a model to run at home.

What is a token?

A token is the fundamental unit of text that a language model reads and writes. Tokens are not words. They are fragments produced by a process called tokenisationthat breaks text into the smallest pieces the model was trained to recognise. In practice a token in English corresponds to roughly three-quarters of a word or about four characters. Common short words such as "the" or "is" are typically a single token. Longer or rarer words are split across two or more.

This distinction matters when interpreting speed claims. A model running at 30 tok/s is not delivering 30 words per second. It is closer to 22 words per second. At 40 tok/s the output arrives at around 30 words per second, which already exceeds the reading speed of a typical adult. Token counts also determine how much context a model can hold in memory. A 4,096-token context window fits roughly 3,000 words or about six pages of ordinary prose.

What determines tokens per second?

During the generation phase called decode the model produces one token at a time. To produce each token it must read its entire set of weights from memory into the processor. This architecture makes decode memory-bandwidth-bound. The speed at which tokens appear is limited not by the raw arithmetic power of the GPU but by how quickly the chip can transfer data between its memory and its compute cores.

The relationship is direct enough to express as an approximation. Tokens per second is proportional to memory bandwidth divided by the number of bytes the model occupies per token. Formally, for a model with N billion active parameters stored at b bits per weight:

  • Bytes read per token= N × (b ÷ 8) gigabytes
  • Theoretical tok/s= GPU memory bandwidth (GB/s) ÷ bytes per token
  • Realistic tok/s≈ theoretical × 0.5–0.65. This accounts for KV-cache reads and attention and sampling overhead and kernel launch costs.

The practical consequence is that the two levers most directly under your control are the GPU's memory bandwidth and the quantisation level applied to the model. A higher-bandwidth card moves data faster and therefore produces tokens faster. A more aggressively quantised model stores each weight in fewer bits, which reduces the bytes read per token and raises throughput at some cost to output quality.

How many tokens per second is fast enough?

The answer depends on the use case but a small set of thresholds captures most practical situations:

  • Below ~5 tok/s. Output arrives noticeably slower than a person reads. Waiting for each sentence becomes uncomfortable and the experience feels more like polling than conversing. This range is generally below the threshold for interactive use.
  • ~10–20 tok/s. Readable and usable for most tasks. The model keeps pace with a careful reader and responses of several hundred tokens complete in under a minute. This is a common target for consumer GPU setups running mid-sized models.
  • ~30–40 tok/s. At this range the model generates text faster than most people read it. Waiting ceases to be a friction point for prose tasks. This is the typical experience on a mid-range GPU running a well-quantised 7–8 billion parameter model.
  • Above ~40 tok/s. The model outpaces comfortable reading speed. Additional throughput becomes relevant for batch tasks such as processing many documents or generating code completions in a tight loop rather than for single-user conversation.

The threshold that most practitioners describe as "interactive" sits around 10 tok/s though 20 tok/s is more comfortable. For code generation, where you may be scanning output carefully, 15–25 tok/s generally feels responsive. For chat or summarisation, where reading proceeds quickly, 30 tok/s or above makes the wait essentially imperceptible.

Why bigger MoE models can be faster than smaller ones

One of the less intuitive facts in local AI is that a very large model can sometimes run faster than a much smaller one. This is the consequence of a design called mixture-of-experts (abbreviated MoE).

In a conventional dense model every parameter participates in generating every token. A model with 13 billion parameters reads all 13 billion weights to produce each word. Tokens per second is therefore determined by the full parameter count.

A MoE model partitions its parameters into groups called experts. A routing mechanism selects only a small subset of those experts for each token, typically two out of many. The weights that do not belong to the selected experts are never read during that step. A MoE model may hold far more total parameters than a dense alternative but its active parameters per token can be considerably smaller, and that figure is what actually determines decode speed.

The implication is that the nominal size of a MoE model is not the right number to use when estimating speed. What matters for throughput is how many parameters are active during decode. This is also why the WillMyGPURunIt calculator displays active parameters separately from total parameters for MoE models and uses only the active figure when estimating tokens per second. For VRAM requirements the total parameter count still governs how much memory the model occupies. All experts must be loaded even if only a few are used at once.

The role of quantisation

Decode speed is proportional to the number of bytes read per token, so reducing the precision at which weights are stored has a direct and predictable effect on throughput. Quantisation achieves this by representing each weight using fewer bits than the original training format. A model stored at 4 bits per weight occupies roughly one-quarter the memory of the same model stored in 16-bit floating point, so a GPU with fixed memory bandwidth can in principle push approximately four times as many tokens per second.

In practice the relationship is somewhat compressed by overhead that does not scale with weight size, but quantisation remains the most accessible lever for improving speed on a fixed hardware budget. The trade-off is quality. Heavier compression discards information from the weights and the model's outputs become progressively less coherent as precision falls. The common Q4_K_M format at roughly 4.8 effective bits per weight is a broadly accepted compromise. It offers sufficient quality for most conversational and coding tasks with a meaningful speed advantage over higher-precision formats.

When the model spills out of VRAM

Tokens per second drops sharply when a model does not fit entirely within GPU memory. Tools such as llama.cpp can offload layers that exceed VRAM capacity to system RAM and allow a larger model to run on a smaller GPU. But system RAM bandwidth is typically 50–85 GB/s on a consumer desktop depending on generation, far lower than the memory bandwidth of a discrete GPU. Any layer that must be read from system RAM rather than VRAM contributes to the per-token transfer time at the slower rate and pulls the effective throughput down proportionally.

A model that mostly fits in VRAM and spills only a small number of layers to RAM can still achieve acceptable throughput. A model that barely fits with most layers resident in system RAM may run at speeds well below the interactive threshold. This is why fitting a model entirely within VRAM is the primary target when selecting hardware, and why GPU memory capacity and bandwidth are both relevant when choosing a card for local AI.

What this means when choosing hardware

The memory-bandwidth-bound nature of decode has a practical consequence that is frequently underweighted in purchasing decisions. Within a given GPU generation a card with higher bandwidth will often outperform a card with more raw compute on language model inference tasks even if the two cards have similar gaming benchmarks. Gaming workloads are compute-bound. LLM decode is bandwidth-bound. The two rankings do not align, and a card that appears modest by gaming measures can be surprisingly competitive for local AI.

Similarly the amount of VRAM a card carries governs which models fit and therefore which speeds are reachable at all. A GPU with ample VRAM but modest bandwidth will run large models smoothly but at a lower rate than a card with narrower capacity and higher bandwidth. The optimal balance depends on the target model size and use case. For most users running 7–13 billion parameter models at standard quantisation levels bandwidth matters more than capacity beyond a threshold. For users trying to fit very large models capacity becomes the binding constraint. Consulting the calculator with a specific GPU and model in mind is the most direct way to resolve this trade-off for a given build.

Keep reading