Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.
The real architecture of current open models, read from their official Hugging Face config.json — these are the values that actually determine memory, and the ones generic VRAM calculators get wrong. Reproduce any number with the open-source FitLLM engine (MIT).
| Model | Params | Layers | Attention | KV heads × dim | Sliding window | Max ctx | Full-ctx KV (bf16) |
|---|---|---|---|---|---|---|---|
| Qwen 3.6 27B | 27.2B | 64 | Hybrid — 16/64 full-attn | 4×256 | — | 262K | 16.00 GiB |
| Qwen 3.6 35B-A3B | 35B (3B active) | 40 | Hybrid — 10/40 full-attn | 2×256 | — | 262K | 5.00 GiB |
| Gemma 4 e2b | 5.1B (2.3B active) | 35 | Sliding-window 5:1 + global | 1×256 | 512 | 131K | 0.76 GiB |
| Gemma 4 e4b | 8B (4.5B active) | 42 | Sliding-window 5:1 + global | 2×256 | 512 | 131K | 1.78 GiB |
| Gemma 4 12b | 11.95B | 48 | Sliding-window 5:1 + global | 8×256 · global 1×512 | 1024 | 262K | 4.31 GiB |
| Gemma 4 26b A4B | 25.5B (4B active) | 30 | Sliding-window 5:1 + global | 8×256 · global 2×512 | 1024 | 262K | 5.20 GiB |
| Gemma 4 31b | 30.7B | 60 | Sliding-window 5:1 + global | 16×256 · global 4×512 | 1024 | 262K | 20.78 GiB |
Full-context KV cache = KV memory at the model's maximum context in bf16, computed per layer with the real head shape — sliding-window layers capped at the window, hybrid/linear layers excluded, global layers using their own head_dim. This is why Gemma 4 31B's full-context KV is 20.78 GiB, not the ~240 GB a naive "all layers × full context" formula implies.
See the full explanation in why naive VRAM calculators are wrong, or check any model on hardware in the fit pages.
▶ Open the FitLLM calculatorAll numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home