Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.
| Model weights | 17.5 GB |
| KV cache | 3.3 GB |
| Runtime overhead + reserve | 5.6 GB |
| Total used | 26.3 / 32 GB |
| Free | 5.7 GB |
Max context that fits at Q4_K_M: ~67K tokens · with Q8 KV cache → ~113K tokens.
| Weight quant | Weights | Fits (KV F16) | Used @32K |
|---|---|---|---|
| Q4_K_M | 17.5 GB | ✅ up to 67K ctx | 26.3 / 32.0 GB |
| Q5_K_M | 20.4 GB | ✅ up to 40K ctx | 29.6 / 32.0 GB |
| Q6_K | 23.5 GB | ❌ up to 10K ctx | 33.0 / 32.0 GB |
| Q8_0 | 30.4 GB | ❌ won't fit | 40.8 / 32.0 GB |
| FP16 | 57.2 GB | ❌ won't fit | 70.8 / 32.0 GB |
KV cache is F16 here (llama.cpp default). Drop it to Q8/Q4 (-ctk/-ctv) for more context.
Gemma 4 31b interleaves sliding-window (local) and global attention 5:1. The local layers cap their KV cache at the 1024-token window, and the global layers use a different head shape (head_dim 512 vs 256). A naive "all layers × full context × one head_dim" formula over-counts KV cache by several times.
same GPU Models that fit on the RTX 5090: Qwen 3.6 27B, Qwen 3.6 35B-A3B, Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b, Gemma 4 26b A4B, Gemma 4 31b.
same model GPUs that run Gemma 4 31b: RTX 5090 (32GB).
Gemma 4 31b = 30.7B, 60 layers. The RTX 5090 has 32GB / 1792GB/s. Same math, open source: fitllm-engine. GGUF bpw from llama.cpp.
All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home