Can I run Gemma 4 12b on a M5 Max 128GB Mac?

Q: Can I run Gemma 4 12b on a M5 Max 128GB Mac?

Yes — Gemma 4 12b fits in 128 GB of M5 Max unified memory at 8-bit, with up to about 262K tokens of context. Computed from the model's official config.json by the open-source FitLLM engine.

✅ Yes — it fits — up to ~262K tokens at 8-bit

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-07-16.

Memory breakdown (8-bit, F16 KV, 33K context)

Model weights	11.1 GB
KV cache	0.8 GB
Runtime + macOS	10.4 GB
Total used	22.4 / 128 GB
Free	106 GB

Max context at 8-bit: ~262K tokens. Unified memory is shared by the OS — FitLLM leaves ~20% headroom.

Every quantization on M5 Max 128GB

Quant	Weights	Fits (KV F16)	Used @32K
4bit	5.6 GB	✅ up to 262K ctx	16.1 / 128 GB
8bit	11.1 GB	✅ up to 262K ctx	22.4 / 128 GB
16bit	22.3 GB	✅ up to 262K ctx	34.8 / 128 GB

Lower quants free memory at some output-quality cost — 4-bit is the common sweet spot for local use.

▶ Open the interactive calculator (this exact setup)

Embed this verdict

Live badge for your README or model card — recomputed by the engine, never stale:

[![fits: Gemma 4 12b on M5 Max 128GB Mac](https://img.shields.io/endpoint?url=https%3A%2F%2Ffitllm.run%2Fapi%2Fbadge%3Fmodel%3DGemma%25204%252012b%26ram%3D128%26quant%3D8)](https://fitllm.run/can-i-run/gemma-4-12b-on-m5-max-128gb)

fit badge preview ← renders like this, live.

Or from your terminal (exit 0/1 — works as a pre-download guard):

npx fitllm "Gemma 4 12b" --mac 128

Why most calculators get this wrong

Gemma 4 12b interleaves sliding-window (local) and global attention 5:1. The local layers cap their KV cache at the 1024-token window, and the global layers use a different head shape (head_dim 512 vs 256). A naive "all layers × full context × one head_dim" formula over-counts KV cache by several times.

Other options

same Mac Models that fit in 128GB: GLM-4.7-Flash, gpt-oss-20b, Qwen 3.6 27B, Qwen 3.6 35B-A3B, Qwen-AgentWorld-35B-A3B, Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b, Gemma 4 26b A4B, Gemma 4 31b, Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, MiniCPM5-1B, Qwen3-0.6B, Qwen3-1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B-it.

Reproduce it

Open math: fitllm-engine (MIT), from official config.json.

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home