Can I run Gemma 4 26b A4B on an RTX 5080 (16GB)?

Q: Can I run Gemma 4 26b A4B on an RTX 5080?

No — Gemma 4 26b A4B at Q4_K_M needs about 20.2 GB but the RTX 5080 has only 16 GB. Computed from the model's official config.json by the open-source FitLLM engine.

❌ No — Gemma 4 26b A4B (Q4_K_M) needs 20.2 GB but the RTX 5080 has 16 GB

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-07-16.

Memory breakdown (Q4_K_M, F16 KV, 33K context)

Model weights	14.5 GB
KV cache	0.8 GB
Runtime overhead + reserve	4.8 GB
Total used	20.2 / 16 GB
Short by	4.2 GB

Max context that fits at Q4_K_M: does not fit.

Every quantization on the RTX 5080

Weight quant	Weights	Fits (KV F16)	Used @32K
Q4_K_M	14.5 GB	❌ won't fit	20.2 / 16.0 GB
Q5_K_M	16.9 GB	❌ won't fit	22.9 / 16.0 GB
Q6_K	19.5 GB	❌ won't fit	25.7 / 16.0 GB
Q8_0	25.2 GB	❌ won't fit	32.2 / 16.0 GB
FP16	47.5 GB	❌ won't fit	57.1 / 16.0 GB

Lower weight quants free memory at some output-quality cost — Q4 is the common sweet spot; below that quality drops faster.

KV cache is F16 here (llama.cpp default). Drop it to Q8/Q4 (-ctk/-ctv) for more context.

▶ Open the interactive calculator (this exact setup)

Embed this verdict

Live badge for your README or model card — recomputed by the engine, never stale:

[![fits: Gemma 4 26b A4B on RTX 5080](https://img.shields.io/endpoint?url=https%3A%2F%2Ffitllm.run%2Fapi%2Fbadge%3Fmodel%3DGemma%25204%252026b%2520A4B%26gpu%3DRTX%25205080)](https://fitllm.run/can-i-run/gemma-4-26b-a4b-on-rtx-5080)

fit badge preview ← renders like this, live.

Or from your terminal (exit 0/1 — works as a pre-download guard):

npx fitllm "Gemma 4 26b A4B" --gpu "RTX 5080"

Why most VRAM calculators get this wrong

Gemma 4 26b A4B interleaves sliding-window (local) and global attention 5:1. The local layers cap their KV cache at the 1024-token window, and the global layers use a different head shape (head_dim 512 vs 256). A naive "all layers × full context × one head_dim" formula over-counts KV cache by several times.

What fits on the RTX 5080 instead

same GPU Models that fit on the RTX 5080: Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b, Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, MiniCPM5-1B, Qwen3-0.6B, Qwen3-1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B-it.

same model GPUs that run Gemma 4 26b A4B: RTX 5090 (32GB), RTX 4090 (24GB), RTX 3090 (24GB), RTX 3090 Ti (24GB), RTX 6000 Ada (48GB), RTX PRO 6000 Blackwell (96GB), RX 7900 XTX (24GB), RX 7900 XT (20GB), Radeon PRO W7900 (48GB), 2× RTX 3090 (48GB), 2× RTX 4090 (48GB), 4× RTX 3090 (96GB), A100 40GB (40GB), A100 80GB (80GB), H100 80GB (80GB), H200 141GB (141GB), B200 (180GB).

Reproduce it

Gemma 4 26b A4B = 25.5B (4B active, MoE), 30 layers. The RTX 5080 has 16GB / 960GB/s. Same math, open source: fitllm-engine. GGUF bpw from llama.cpp.

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home