Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-06.
| Model weights | 15.5 GB |
| KV cache | 2.0 GB |
| Runtime overhead + reserve | 5.1 GB |
| Total used | 22.6 / 12 GB |
| Short by | 10.6 GB |
Max context that fits at Q4_K_M: does not fit.
| Weight quant | Weights | Fits (KV F16) | Used @32K |
|---|---|---|---|
| Q4_K_M | 15.5 GB | ❌ won't fit | 22.6 / 12.0 GB |
| Q5_K_M | 18.1 GB | ❌ won't fit | 25.5 / 12.0 GB |
| Q6_K | 20.8 GB | ❌ won't fit | 28.6 / 12.0 GB |
| Q8_0 | 26.9 GB | ❌ won't fit | 35.4 / 12.0 GB |
| FP16 | 50.7 GB | ❌ won't fit | 62.0 / 12.0 GB |
KV cache is F16 here (llama.cpp default). Drop it to Q8/Q4 (-ctk/-ctv) for more context.
Qwen 3.6 27B is a hybrid model: only 16 of its 64 layers use full attention — the rest are linear and keep no growing KV cache. Naive calculators count every layer at full context and badly over-estimate.
same GPU Models that fit on the RTX 3060 12GB: Gemma 4 e2b, Gemma 4 e4b, Gemma 4 12b.
same model GPUs that run Qwen 3.6 27B: RTX 5090 (32GB), RTX 4090 (24GB), RTX 3090 (24GB), RTX 3090 Ti (24GB).
Qwen 3.6 27B = 27.2B, 64 layers. The RTX 3060 12GB has 12GB / 360GB/s. Same math, open source: fitllm-engine. GGUF bpw from llama.cpp.
All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home