FitLLM

Can I run Gemma 4 26b A4B on an RTX 5080 (16GB)?

❌ No — Gemma 4 26b A4B (Q4_K_M) needs 20.2 GB but the RTX 5080 has 16 GB

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.

Memory breakdown (Q4_K_M, F16 KV, 33K context)

Model weights14.5 GB
KV cache0.8 GB
Runtime overhead + reserve4.8 GB
Total used20.2 / 16 GB
Short by4.2 GB

Max context that fits at Q4_K_M: does not fit.

Every quantization on the RTX 5080

Weight quantWeightsFits (KV F16)Used @32K
Q4_K_M14.5 GB❌ won't fit20.2 / 16.0 GB
Q5_K_M16.9 GB❌ won't fit22.9 / 16.0 GB
Q6_K19.5 GB❌ won't fit25.7 / 16.0 GB
Q8_025.2 GB❌ won't fit32.2 / 16.0 GB
FP1647.5 GB❌ won't fit57.1 / 16.0 GB

KV cache is F16 here (llama.cpp default). Drop it to Q8/Q4 (-ctk/-ctv) for more context.

▶ Open the interactive calculator (this exact setup)

Why most VRAM calculators get this wrong

Gemma 4 26b A4B interleaves sliding-window (local) and global attention 5:1. The local layers cap their KV cache at the 1024-token window, and the global layers use a different head shape (head_dim 512 vs 256). A naive "all layers × full context × one head_dim" formula over-counts KV cache by several times.

What fits on the RTX 5080 instead

same GPU Models that fit on the RTX 5080: Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b.

same model GPUs that run Gemma 4 26b A4B: RTX 5090 (32GB), RTX 4090 (24GB), RTX 3090 (24GB), RTX 3090 Ti (24GB).

Reproduce it

Gemma 4 26b A4B = 25.5B (4B active, MoE), 30 layers. The RTX 5080 has 16GB / 960GB/s. Same math, open source: fitllm-engine. GGUF bpw from llama.cpp.

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home