FitLLM

Can I run Gemma 4 31b on a M5 Max 64GB Mac?

✅ Yes — it fits — up to ~90K tokens at 8-bit

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.

Memory breakdown (8-bit, F16 KV, 33K context)

Model weights28.3 GB
KV cache3.3 GB
Runtime + macOS6.9 GB
Total used44.5 / 64 GB
Free19.5 GB

Max context at 8-bit: ~90K tokens. Unified memory is shared by the OS — FitLLM leaves ~20% headroom.

Every quantization on M5 Max 64GB

QuantWeightsFits (KV F16)Used @32K
4bit16.2 GB✅ up to 206K ctx30.8 / 64 GB
8bit28.3 GB✅ up to 90K ctx44.5 / 64 GB
16bit54.3 GB❌ won't fit73.6 / 64 GB
▶ Open the interactive calculator (this exact setup)

Why most calculators get this wrong

Gemma 4 31b interleaves sliding-window (local) and global attention 5:1. The local layers cap their KV cache at the 1024-token window, and the global layers use a different head shape (head_dim 512 vs 256). A naive "all layers × full context × one head_dim" formula over-counts KV cache by several times.

Other options

same Mac Models that fit in 64GB: Qwen 3.6 27B, Qwen 3.6 35B-A3B, Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b, Gemma 4 26b A4B, Gemma 4 31b.

Reproduce it

Open math: fitllm-engine (MIT), from official config.json.

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home