Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.
| Model weights | 6.8 GB |
| KV cache | 0.8 GB |
| Runtime overhead + reserve | 3.9 GB |
| Total used | 11.5 / 16 GB |
| Free | 4.5 GB |
Max context that fits at Q4_K_M: ~110K tokens · with Q8 KV cache → ~139K tokens.
| Weight quant | Weights | Fits (KV F16) | Used @32K |
|---|---|---|---|
| Q4_K_M | 6.8 GB | ✅ up to 110K ctx | 11.5 / 16.0 GB |
| Q5_K_M | 7.9 GB | ✅ up to 83K ctx | 12.8 / 16.0 GB |
| Q6_K | 9.1 GB | ✅ up to 55K ctx | 14.1 / 16.0 GB |
| Q8_0 | 11.8 GB | ❌ won't fit | 17.2 / 16.0 GB |
| FP16 | 22.3 GB | ❌ won't fit | 28.8 / 16.0 GB |
KV cache is F16 here (llama.cpp default). Drop it to Q8/Q4 (-ctk/-ctv) for more context.
Gemma 4 12b interleaves sliding-window (local) and global attention 5:1. The local layers cap their KV cache at the 1024-token window, and the global layers use a different head shape (head_dim 512 vs 256). A naive "all layers × full context × one head_dim" formula over-counts KV cache by several times.
same GPU Models that fit on the RTX 5080: Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b.
same model GPUs that run Gemma 4 12b: RTX 5090 (32GB), RTX 5080 (16GB), RTX 5070 Ti (16GB), RTX 5070 (12GB), RTX 4090 (24GB), RTX 4080 SUPER (16GB), RTX 4070 Ti SUPER (16GB), RTX 4070 (12GB), RTX 4060 Ti 16GB (16GB), RTX 3090 (24GB), RTX 3090 Ti (24GB), RTX 3080 12GB (12GB), RTX 3060 12GB (12GB).
Gemma 4 12b = 11.95B, 48 layers. The RTX 5080 has 16GB / 960GB/s. Same math, open source: fitllm-engine. GGUF bpw from llama.cpp.
All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home