Can I run Qwen 3.6 35B-A3B on an RTX 4090 (24GB)?

Q: Can I run Qwen 3.6 35B-A3B on an RTX 4090?

No — Qwen 3.6 35B-A3B at Q4_K_M needs about 26.0 GB but the RTX 4090 has only 24 GB. Computed from the model's official config.json by the open-source FitLLM engine.

❌ No — Qwen 3.6 35B-A3B (Q4_K_M) needs 26.0 GB but the RTX 4090 has 24 GB

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-07-16.

Memory breakdown (Q4_K_M, F16 KV, 33K context)

Model weights	19.9 GB
KV cache	0.6 GB
Runtime overhead + reserve	5.5 GB
Total used	26.0 / 24 GB
Short by	2.0 GB

Max context that fits at Q4_K_M: does not fit.

Every quantization on the RTX 4090

Weight quant	Weights	Fits (KV F16)	Used @32K
Q4_K_M	19.9 GB	❌ won't fit	26.0 / 24.0 GB
Q5_K_M	23.2 GB	❌ won't fit	29.7 / 24.0 GB
Q6_K	26.7 GB	❌ won't fit	33.7 / 24.0 GB
Q8_0	34.6 GB	❌ won't fit	42.5 / 24.0 GB
FP16	65.2 GB	❌ won't fit	76.7 / 24.0 GB

Lower weight quants free memory at some output-quality cost — Q4 is the common sweet spot; below that quality drops faster.

KV cache is F16 here (llama.cpp default). Drop it to Q8/Q4 (-ctk/-ctv) for more context.

▶ Open the interactive calculator (this exact setup)

Embed this verdict

Live badge for your README or model card — recomputed by the engine, never stale:

[![fits: Qwen 3.6 35B-A3B on RTX 4090](https://img.shields.io/endpoint?url=https%3A%2F%2Ffitllm.run%2Fapi%2Fbadge%3Fmodel%3DQwen%25203.6%252035B-A3B%26gpu%3DRTX%25204090)](https://fitllm.run/can-i-run/qwen-3-6-35b-a3b-on-rtx-4090)

fit badge preview ← renders like this, live.

Or from your terminal (exit 0/1 — works as a pre-download guard):

npx fitllm "Qwen 3.6 35B-A3B" --gpu "RTX 4090"

Why most VRAM calculators get this wrong

Qwen 3.6 35B-A3B is a hybrid model: only 10 of its 40 layers use full attention — the rest are linear and keep no growing KV cache. Naive calculators count every layer at full context and badly over-estimate.

What fits on the RTX 4090 instead

same GPU Models that fit on the RTX 4090: GLM-4.7-Flash, gpt-oss-20b, Qwen 3.6 27B, Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b, Gemma 4 26b A4B, Llama-3.2-3B-Instruct, Llama-3.1-8B-Instruct, MiniCPM5-1B, Qwen3-0.6B, Qwen3-1.7B, Llama-3.2-1B-Instruct, Gemma-3-1B-it.

same model GPUs that run Qwen 3.6 35B-A3B: RTX 5090 (32GB), RTX 6000 Ada (48GB), RTX PRO 6000 Blackwell (96GB), Radeon PRO W7900 (48GB), 2× RTX 3090 (48GB), 2× RTX 4090 (48GB), 4× RTX 3090 (96GB), A100 40GB (40GB), A100 80GB (80GB), H100 80GB (80GB), H200 141GB (141GB), B200 (180GB).

Reproduce it

Qwen 3.6 35B-A3B = 35B (3B active, MoE), 40 layers. The RTX 4090 has 24GB / 1008GB/s. Same math, open source: fitllm-engine. GGUF bpw from llama.cpp.

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home