Can I run Gemma 4 26b A4B on an RTX 4090 (24GB)?

Q: Can I run Gemma 4 26b A4B on an RTX 4090?

Yes — Gemma 4 26b A4B fits on the RTX 4090 (24 GB) at Q4_K_M, with up to about 83K tokens of context. Computed from the model's official config.json by the open-source FitLLM engine.

✅ Yes — it fits — up to ~83K tokens at Q4_K_M

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-07-16.

Memory breakdown (Q4_K_M, F16 KV, 33K context)

Model weights	14.5 GB
KV cache	0.8 GB
Runtime overhead + reserve	4.8 GB
Total used	20.2 / 24 GB
Free	3.8 GB

Max context that fits at Q4_K_M: ~83K tokens · with Q8 KV cache → ~108K tokens.

Every quantization on the RTX 4090

Weight quant	Weights	Fits (KV F16)	Used @32K
Q4_K_M	14.5 GB	✅ up to 83K ctx	20.2 / 24.0 GB
Q5_K_M	16.9 GB	⚠️ up to 31K ctx	22.9 / 24.0 GB
Q6_K	19.5 GB	❌ won't fit	25.7 / 24.0 GB
Q8_0	25.2 GB	❌ won't fit	32.2 / 24.0 GB
FP16	47.5 GB	❌ won't fit	57.1 / 24.0 GB

Lower weight quants free memory at some output-quality cost — Q4 is the common sweet spot; below that quality drops faster.

KV cache is F16 here (llama.cpp default). Drop it to Q8/Q4 (-ctk/-ctv) for more context.

▶ Open the interactive calculator (this exact setup)

Embed this verdict

Live badge for your README or model card — recomputed by the engine, never stale:

[![fits: Gemma 4 26b A4B on RTX 4090](https://img.shields.io/endpoint?url=https%3A%2F%2Ffitllm.run%2Fapi%2Fbadge%3Fmodel%3DGemma%25204%252026b%2520A4B%26gpu%3DRTX%25204090)](https://fitllm.run/can-i-run/gemma-4-26b-a4b-on-rtx-4090)

fit badge preview ← renders like this, live.

Or from your terminal (exit 0/1 — works as a pre-download guard):

npx fitllm "Gemma 4 26b A4B" --gpu "RTX 4090"

Why most VRAM calculators get this wrong

Gemma 4 26b A4B interleaves sliding-window (local) and global attention 5:1. The local layers cap their KV cache at the 1024-token window, and the global layers use a different head shape (head_dim 512 vs 256). A naive "all layers × full context × one head_dim" formula over-counts KV cache by several times.

Other options

same GPU Models that fit on the RTX 4090: GLM-4.7-Flash, gpt-oss-20b, Qwen 3.6 27B, Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b, Gemma 4 26b A4B, Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, MiniCPM5-1B, Qwen3-0.6B, Qwen3-1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B-it.

same model GPUs that run Gemma 4 26b A4B: RTX 5090 (32GB), RTX 4090 (24GB), RTX 3090 (24GB), RTX 3090 Ti (24GB), RTX 6000 Ada (48GB), RTX PRO 6000 Blackwell (96GB), RX 7900 XTX (24GB), RX 7900 XT (20GB), Radeon PRO W7900 (48GB), 2× RTX 3090 (48GB), 2× RTX 4090 (48GB), 4× RTX 3090 (96GB), A100 40GB (40GB), A100 80GB (80GB), H100 80GB (80GB), H200 141GB (141GB), B200 (180GB).

Reproduce it

Gemma 4 26b A4B = 25.5B (4B active, MoE), 30 layers. The RTX 4090 has 24GB / 1008GB/s. Same math, open source: fitllm-engine. GGUF bpw from llama.cpp.

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home