Why naive VRAM calculators are wrong on modern LLMs

Gemma 4 31B real KV = 20.78 GiB — naive calculators report ~11.1× more

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-07-16.

The shortcut that used to work

For years, "VRAM needed = parameters × bytes-per-param + a flat KV estimate" was close enough. On 2025–2026 models it's wrong by multiples, because the architectures broke the assumptions behind that formula. Four errors — with the receipts.

1. Sliding-window attention caps the KV cache

Gemma 4 interleaves 5 sliding-window (local) layers : 1 global layer. Local layers only attend to the last ~1024 tokens, so their KV cache stops growing past the window. Counting all layers at full context over-estimates KV by ~6×.

2. Hybrid / linear attention has no growing KV

Qwen 3.6 is hybrid: only a fraction of layers use full attention; the rest are linear and keep no per-token KV cache. A calculator that counts every layer at full context is counting cache that doesn't exist.

3. Global layers use a different head_dim

In Gemma 4 the global-attention layers use head_dim 512 with fewer KV heads, while sliding layers use 256. A single uniform head_dim is wrong for both kinds of layer at once.

4. MoE: total params resident, active params for compute

Qwen 3.6 35B-A3B activates ~3B params per token but all 35B must sit in memory. Size memory off active params and you'll think it fits when it doesn't; size KV off all layers and you'll think it won't when it does.

5. MLA caches one shared latent — not per-head K and V

GLM-5.2, GLM-4.7-Flash and the DeepSeek family use Multi-head Latent Attention: K/V are compressed into a single low-rank latent (kv_lora_rank 512 + 64 RoPE dims = 576 elements per token per layer, shared by all heads, cached once — verified against the DeepSeek-V2 paper and official inference code). A per-head "2 × heads × head_dim" formula reports ~17.8× more KV than GLM-4.7-Flash actually uses (6.61 GiB real at 128K bf16).

The receipts — reproduce the anchor by hand

Gemma 4 31B, from its official config.json: 60 layers = 10 global (4 KV-heads × head_dim 512) + 50 sliding (16 KV-heads × head_dim 256, window 1024). At 262,144-token context, bf16 (2 bytes/element):

global: 10 layers × 2 (K,V) × 4 heads × 512 dim × 2 B × 262,144 tokens = 21,474,836,480 B
local:  50 layers × 2 (K,V) × 16 heads × 256 dim × 2 B × 1,024 (window) =    838,860,800 B
total = 22,313,697,280 B ÷ 1024³ = 20.78 GiB

FitLLM computes exactly 20.78 GiB for this case — the arithmetic above, nothing else. Our unit tests pin these bytes.
A naive "all layers × full context" formula reports ~11.1× more KV than reality for the same model (131K, 8-bit).
GLM-4.7-Flash (MLA): real KV at 128K = 6.61 GiB; per-head formulas report ~17.8× more.
Get the head shape and MoE wrong and a 27B dense model can appear to need more memory than a 35B MoE — backwards.

Audit it yourself

The whole engine is one readable MIT file: fitllm-engine. Every number on this site is computed from official model config.json values — no guessing. See the per-model fit pages or run the calculator:

▶ Open the FitLLM calculator

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home