FitLLM

The best GPU or Mac to run Qwen 3.6 35B-A3B locally

✅ RTX 5090 — runs Qwen 3.6 35B-A3B at ~4-bit, 24.8/32 GB at 8K context

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.

Ranked by memory size (a proxy for cost and availability), not price. Every figure is computed by the engine — these are a floor, not a guarantee; leave headroom for your runtime and OS.

Smallest GPU / Mac that fits, by setup

~4-bit ≈ Q4_K_M (GGUF, llama.cpp) on GPU / 4-bit (MLX) on Mac · ~8-bit ≈ Q8_0 / 8-bit · KV cache F16 · "full" = the model's max context (262K).

SetupSmallest GPUSmallest Mac
~4-bit · 8K ctxRTX 5090 · 24.8/32 GBM5 Pro 48GB · 26.7/48 GB
~4-bit · full (262K)🔴M5 Max 64GB · 39.9/64 GB
~8-bit · 33K ctx🔴M5 Max 64GB · 46.2/64 GB
~8-bit · full (262K)🔴M5 Max 128GB · 58.1/128 GB

Every GPU and Mac, ranked by memory

HardwareMemoryMax context (~4-bit)Used @8K
RTX 3060 12GB12 GB❌ won't fit24.8 / 12 GB
RTX 4080 SUPER16 GB❌ won't fit24.8 / 16 GB
RTX 508016 GB❌ won't fit24.8 / 16 GB
RTX 309024 GB❌ won't fit24.8 / 24 GB
RTX 409024 GB❌ won't fit24.8 / 24 GB
RTX 509032 GB✅ up to 117K24.8 / 32 GB
M5 Pro 48GB48 GB✅ up to 234K26.7 / 48 GB
M5 Max 64GB64 GB✅ up to 262K26.7 / 64 GB
M5 Max 128GB128 GB✅ up to 262K26.7 / 128 GB

Why bigger isn't always needed — and smaller sometimes won't fit

Qwen 3.6 35B-A3B is a hybrid model: only 10 of its 40 layers use full attention — the rest are linear and keep no growing KV cache. Naive calculators count every layer at full context and badly over-estimate. So Qwen 3.6 35B-A3B has no single fixed memory requirement — it shifts with quantization and context. See the full breakdown.

▶ Open the interactive calculator

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home