FitLLM

Can I run Qwen 3.6 27B on an RTX 3090 (24GB)?

✅ Yes — it fits — up to ~34K tokens at Q4_K_M

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.

Memory breakdown (Q4_K_M, F16 KV, 33K context)

Model weights15.5 GB
KV cache2.0 GB
Runtime overhead + reserve5.1 GB
Total used22.6 / 24 GB
Free1.4 GB

Max context that fits at Q4_K_M: ~34K tokens · with Q8 KV cache → ~53K tokens.

Every quantization on the RTX 3090

Weight quantWeightsFits (KV F16)Used @32K
Q4_K_M15.5 GB✅ up to 34K ctx22.6 / 24.0 GB
Q5_K_M18.1 GB❌ up to 6K ctx25.5 / 24.0 GB
Q6_K20.8 GB❌ won't fit28.6 / 24.0 GB
Q8_026.9 GB❌ won't fit35.4 / 24.0 GB
FP1650.7 GB❌ won't fit62.0 / 24.0 GB

KV cache is F16 here (llama.cpp default). Drop it to Q8/Q4 (-ctk/-ctv) for more context.

▶ Open the interactive calculator (this exact setup)

Why most VRAM calculators get this wrong

Qwen 3.6 27B is a hybrid model: only 16 of its 64 layers use full attention — the rest are linear and keep no growing KV cache. Naive calculators count every layer at full context and badly over-estimate.

Other options

same GPU Models that fit on the RTX 3090: Qwen 3.6 27B, Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b, Gemma 4 26b A4B.

same model GPUs that run Qwen 3.6 27B: RTX 5090 (32GB), RTX 4090 (24GB), RTX 3090 (24GB), RTX 3090 Ti (24GB).

Reproduce it

Qwen 3.6 27B = 27.2B, 64 layers. The RTX 3090 has 24GB / 936GB/s. Same math, open source: fitllm-engine. GGUF bpw from llama.cpp.

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home