FitLLM

Can I run Gemma 4 12b on an RTX 5080 (16GB)?

✅ Yes — it fits — up to ~110K tokens at Q4_K_M

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.

Memory breakdown (Q4_K_M, F16 KV, 33K context)

Model weights6.8 GB
KV cache0.8 GB
Runtime overhead + reserve3.9 GB
Total used11.5 / 16 GB
Free4.5 GB

Max context that fits at Q4_K_M: ~110K tokens · with Q8 KV cache → ~139K tokens.

Every quantization on the RTX 5080

Weight quantWeightsFits (KV F16)Used @32K
Q4_K_M6.8 GB✅ up to 110K ctx11.5 / 16.0 GB
Q5_K_M7.9 GB✅ up to 83K ctx12.8 / 16.0 GB
Q6_K9.1 GB✅ up to 55K ctx14.1 / 16.0 GB
Q8_011.8 GB❌ won't fit17.2 / 16.0 GB
FP1622.3 GB❌ won't fit28.8 / 16.0 GB

KV cache is F16 here (llama.cpp default). Drop it to Q8/Q4 (-ctk/-ctv) for more context.

▶ Open the interactive calculator (this exact setup)

Why most VRAM calculators get this wrong

Gemma 4 12b interleaves sliding-window (local) and global attention 5:1. The local layers cap their KV cache at the 1024-token window, and the global layers use a different head shape (head_dim 512 vs 256). A naive "all layers × full context × one head_dim" formula over-counts KV cache by several times.

Other options

same GPU Models that fit on the RTX 5080: Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b.

same model GPUs that run Gemma 4 12b: RTX 5090 (32GB), RTX 5080 (16GB), RTX 5070 Ti (16GB), RTX 5070 (12GB), RTX 4090 (24GB), RTX 4080 SUPER (16GB), RTX 4070 Ti SUPER (16GB), RTX 4070 (12GB), RTX 4060 Ti 16GB (16GB), RTX 3090 (24GB), RTX 3090 Ti (24GB), RTX 3080 12GB (12GB), RTX 3060 12GB (12GB).

Reproduce it

Gemma 4 12b = 11.95B, 48 layers. The RTX 5080 has 16GB / 960GB/s. Same math, open source: fitllm-engine. GGUF bpw from llama.cpp.

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home