FitLLM

LLM memory & architecture reference

Real per-model architecture from official config.json — the numbers that decide memory

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.

The real architecture of current open models, read from their official Hugging Face config.json — these are the values that actually determine memory, and the ones generic VRAM calculators get wrong. Reproduce any number with the open-source FitLLM engine (MIT).

Per-model architecture & KV cache

ModelParamsLayersAttentionKV heads × dimSliding windowMax ctxFull-ctx KV (bf16)
Qwen 3.6 27B27.2B64Hybrid — 16/64 full-attn4×256262K16.00 GiB
Qwen 3.6 35B-A3B35B (3B active)40Hybrid — 10/40 full-attn2×256262K5.00 GiB
Gemma 4 e2b5.1B (2.3B active)35Sliding-window 5:1 + global1×256512131K0.76 GiB
Gemma 4 e4b8B (4.5B active)42Sliding-window 5:1 + global2×256512131K1.78 GiB
Gemma 4 12b11.95B48Sliding-window 5:1 + global8×256 · global 1×5121024262K4.31 GiB
Gemma 4 26b A4B25.5B (4B active)30Sliding-window 5:1 + global8×256 · global 2×5121024262K5.20 GiB
Gemma 4 31b30.7B60Sliding-window 5:1 + global16×256 · global 4×5121024262K20.78 GiB

Full-context KV cache = KV memory at the model's maximum context in bf16, computed per layer with the real head shape — sliding-window layers capped at the window, hybrid/linear layers excluded, global layers using their own head_dim. This is why Gemma 4 31B's full-context KV is 20.78 GiB, not the ~240 GB a naive "all layers × full context" formula implies.

Why these fields decide memory

See the full explanation in why naive VRAM calculators are wrong, or check any model on hardware in the fit pages.

▶ Open the FitLLM calculator

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home