The best GPU or Mac to run Gemma 4 12b locally

✅ RTX 3060 12GB — runs Gemma 4 12b at ~4-bit, 10.4/12 GB at 8K context

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-07-16.

Ranked by memory size (a proxy for cost and availability), not price. Every figure is computed by the engine — these are a floor, not a guarantee; leave headroom for your runtime and OS.

Smallest GPU / Mac that fits, by setup

~4-bit ≈ Q4_K_M (GGUF, llama.cpp) on GPU / 4-bit (MLX) on Mac · ~8-bit ≈ Q8_0 / 8-bit · KV cache F16 · "full" = the model's max context (262K).

Setup	Smallest GPU	Smallest Mac
~4-bit · 8K ctx	✅ RTX 3060 12GB · 10.4/12 GB	✅ M5 Pro 48GB · 15.0/48 GB
~4-bit · full (262K)	✅ RTX 3090 · 22.4/24 GB	✅ M5 Pro 48GB · 27.1/48 GB
~8-bit · 33K ctx	✅ RTX 3090 · 17.2/24 GB	✅ M5 Pro 48GB · 22.4/48 GB
~8-bit · full (262K)	✅ RTX 5090 · 28.1/32 GB	✅ M5 Pro 48GB · 33.3/48 GB

Every GPU and Mac, ranked by memory

Hardware	Memory	Max context (~4-bit)	Used @8K
RTX 3060 12GB	12 GB	✅ up to 30K	10.4 / 12 GB
RTX 4080 SUPER	16 GB	✅ up to 110K	10.4 / 16 GB
RTX 5080	16 GB	✅ up to 110K	10.4 / 16 GB
RTX 3090	24 GB	✅ up to 262K	10.4 / 24 GB
RTX 4090	24 GB	✅ up to 262K	10.4 / 24 GB
RTX 5090	32 GB	✅ up to 262K	10.4 / 32 GB
M5 Pro 48GB	48 GB	✅ up to 262K	15.0 / 48 GB
M5 Max 64GB	64 GB	✅ up to 262K	15.0 / 64 GB
M5 Max 128GB	128 GB	✅ up to 262K	15.0 / 128 GB

Why bigger isn't always needed — and smaller sometimes won't fit

Gemma 4 12b interleaves sliding-window (local) and global attention 5:1. The local layers cap their KV cache at the 1024-token window, and the global layers use a different head shape (head_dim 512 vs 256). A naive "all layers × full context × one head_dim" formula over-counts KV cache by several times. So Gemma 4 12b has no single fixed memory requirement — it shifts with quantization and context. See the full breakdown.

▶ Open the interactive calculator

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home