Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.
| Model weights | 11.1 GB |
| KV cache | 0.8 GB |
| Runtime + macOS | 4.4 GB |
| Total used | 22.4 / 128 GB |
| Free | 106 GB |
Max context at 8-bit: ~262K tokens. Unified memory is shared by the OS — FitLLM leaves ~20% headroom.
| Quant | Weights | Fits (KV F16) | Used @32K |
|---|---|---|---|
| 4bit | 5.6 GB | ✅ up to 262K ctx | 16.1 / 128 GB |
| 8bit | 11.1 GB | ✅ up to 262K ctx | 22.4 / 128 GB |
| 16bit | 22.3 GB | ✅ up to 262K ctx | 34.8 / 128 GB |
Gemma 4 12b interleaves sliding-window (local) and global attention 5:1. The local layers cap their KV cache at the 1024-token window, and the global layers use a different head shape (head_dim 512 vs 256). A naive "all layers × full context × one head_dim" formula over-counts KV cache by several times.
same Mac Models that fit in 128GB: Qwen 3.6 27B, Qwen 3.6 35B-A3B, Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b, Gemma 4 26b A4B, Gemma 4 31b.
Open math: fitllm-engine (MIT), from official config.json.
All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home