LLM memory & architecture reference

Real per-model architecture from official config.json — the numbers that decide memory

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-07-16.

The real architecture of current open models, read from their official Hugging Face config.json — these are the values that actually determine memory, and the ones generic VRAM calculators get wrong. Reproduce any number with the open-source FitLLM engine (MIT).

Per-model architecture & KV cache

Model	Params	Layers	Attention	KV heads × dim	Sliding window	Max ctx	Full-ctx KV (bf16)
GLM-4.7-Flash	30B (3B active)	47	Full	20×256	—	203K	10.22 GiB
GLM-5.2	753B (40B active)	78	Full	64×256	—	1M	87.75 GiB
gpt-oss-20b	21B (3.6B active)	24	Sliding-window + global	8×64	128	131K	3.00 GiB
gpt-oss-120b	117B (5.1B active)	36	Sliding-window + global	8×64	128	131K	4.50 GiB
Qwen 3.6 27B	27.2B	64	Hybrid — 16/64 full-attn	4×256	—	262K	16.00 GiB
Qwen 3.6 35B-A3B	35B (3B active)	40	Hybrid — 10/40 full-attn	2×256	—	262K	5.00 GiB
Qwen-AgentWorld-35B-A3B	34.7B (3B active)	40	Hybrid — 10/40 full-attn	2×256	—	262K	5.00 GiB
Gemma 4 e2b	5.1B (2.3B active)	35	Sliding-window 4:1 + global	1×256	512	131K	0.89 GiB
Gemma 4 e4b	8B (4.5B active)	42	Sliding-window 5:1 + global	2×256	512	131K	1.78 GiB
Gemma 4 12b	11.95B	48	Sliding-window 5:1 + global	8×256 · global 1×512	1024	262K	4.31 GiB
Gemma 4 26b A4B	25.5B (4B active)	30	Sliding-window 5:1 + global	8×256 · global 2×512	1024	262K	5.20 GiB
Gemma 4 31b	30.7B	60	Sliding-window 5:1 + global	16×256 · global 4×512	1024	262K	20.78 GiB
Llama-3.2-3B-Instruct	3.2B	28	Full	8×128	—	131K	14.00 GiB
Llama-3.1-8B-Instruct	8B	32	Full	8×128	—	131K	16.00 GiB
MiniCPM5-1B	1.081B	24	Full	2×128	—	131K	3.00 GiB
Qwen3-0.6B	0.596B	28	Full	8×128	—	41K	4.38 GiB
Qwen3-1.7B	1.721B	28	Full	8×128	—	41K	4.38 GiB
Llama-3.2-1B-Instruct	1.236B	16	Full	8×64	—	131K	4.00 GiB
Gemma-3-1B-it	1B	26	Sliding-window + global	1×256	512	33K	0.14 GiB
Hy3	298.8B (21B active)	80	Full	8×128	—	262K	80.00 GiB

Full-context KV cache = KV memory at the model's maximum context in bf16, computed per layer with the real head shape — sliding-window layers capped at the window, hybrid/linear layers excluded, global layers using their own head_dim. This is why Gemma 4 31B's full-context KV is 20.78 GiB, not the ~240 GB a naive "all layers × full context" formula implies.

Why these fields decide memory

Attention type — sliding-window layers stop growing KV past the window; hybrid/linear layers keep no per-token KV. Counting every layer at full context over-estimates by several times.
KV heads × head_dim — KV cache scales with grouped-query KV heads and head_dim, not the full hidden size. Global and sliding layers can differ (Gemma 4: global 512, sliding 256).
Params (total vs active) — for Mixture-of-Experts, all parameters sit in memory even though only the active subset computes per token.

See the full explanation in why naive VRAM calculators are wrong, or check any model on hardware in the fit pages.

▶ Open the FitLLM calculator

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home