Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.
| Model weights | 32.6 GB |
| KV cache | 0.6 GB |
| Runtime + macOS | 7.0 GB |
| Total used | 46.2 / 128 GB |
| Free | 81.8 GB |
Max context at 8-bit: ~262K tokens. Unified memory is shared by the OS — FitLLM leaves ~20% headroom.
| Quant | Weights | Fits (KV F16) | Used @32K |
|---|---|---|---|
| 4bit | 16.3 GB | ✅ up to 262K ctx | 28.0 / 128 GB |
| 8bit | 32.6 GB | ✅ up to 262K ctx | 46.2 / 128 GB |
| 16bit | 65.2 GB | ✅ up to 262K ctx | 82.7 / 128 GB |
Qwen 3.6 35B-A3B is a hybrid model: only 10 of its 40 layers use full attention — the rest are linear and keep no growing KV cache. Naive calculators count every layer at full context and badly over-estimate.
same Mac Models that fit in 128GB: Qwen 3.6 27B, Qwen 3.6 35B-A3B, Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b, Gemma 4 26b A4B, Gemma 4 31b.
Open math: fitllm-engine (MIT), from official config.json.
All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home