RTX 5090 vs RTX 5080 for local LLMs

RTX 5090: 11/14 models · RTX 5080: 4/14 (at ~4-bit, 8K)

Computed with the open FitLLM engine — accurate per-layer KV-cache modeling, not a naive estimate. Updated 2026-07-16.

FitLLM compares fit — what loads in memory — computed from official config.json. These are a floor, not a guarantee; speed and power are not estimated.

The two cards

	RTX 5090	RTX 5080
VRAM	32 GB	16 GB
Memory bandwidth (speed, not estimated)	1792 GB/s	960 GB/s

What each runs (~4-bit, max context that fits)

Model	RTX 5090	RTX 5080
Hy3	❌ won't fit · 196/32 GB	❌ won't fit · 196/16 GB
GLM-5.2	❌ won't fit · 484/32 GB	❌ won't fit · 484/16 GB
GLM-4.7-Flash	✅ up to 105K · 21.9/32 GB	❌ won't fit · 21.9/16 GB
gpt-oss-20b	✅ up to 131K · 15.9/32 GB	⚠️ won't fit · 15.9/16 GB
gpt-oss-120b	❌ won't fit · 77.2/32 GB	❌ won't fit · 77.2/16 GB
Qwen 3.6 35B-A3B	✅ up to 117K · 24.8/32 GB	❌ won't fit · 24.8/16 GB
Qwen 3.6 27B	✅ up to 110K · 20.2/32 GB	❌ won't fit · 20.2/16 GB
Qwen-AgentWorld-35B-A3B	✅ up to 120K · 24.6/32 GB	❌ won't fit · 24.6/16 GB
Gemma 4 31b	✅ up to 67K · 23.5/32 GB	❌ won't fit · 23.5/16 GB
Gemma 4 26b A4B	✅ up to 229K · 18.9/32 GB	❌ won't fit · 18.9/16 GB
Gemma 4 12b	✅ up to 262K · 10.4/32 GB	✅ up to 110K · 10.4/16 GB
Llama-3.1-8B-Instruct	✅ up to 131K · 8.5/32 GB	✅ up to 48K · 8.5/16 GB
Llama-3.2-3B-Instruct	✅ up to 131K · 5.3/32 GB	✅ up to 73K · 5.3/16 GB
MiniCPM5-1B	✅ up to 131K · 3.2/32 GB	✅ up to 131K · 3.2/16 GB

Only the RTX 5090 runs: GLM-4.7-Flash, gpt-oss-20b, Qwen 3.6 35B-A3B, Qwen 3.6 27B, Qwen-AgentWorld-35B-A3B, Gemma 4 31b, Gemma 4 26b A4B.

Bottom line

For local LLMs, more VRAM means more models and longer context. Match the card to the model you actually want to run — see the per-model fit pages.

All numbers are computed by the open-source fitllm-engine (MIT) from official model config.json values — reproduce or audit them yourself. Estimates; real usage varies with runtime (llama.cpp / MLX / Ollama), driver and display. Found a mismatch? Report it. · FitLLM home