I have a single 3090!
That’s the dream GPU, these days.
And I have 128GB CPU RAM. So the best model I can run is MiMo 2.5 (a 300B model) at around 10 tokens/sec, using hybrid CPU inference.
…But that’s the worst-case scenario, for speed. It’s an IQ3_KT quant (a high quality “trellis” quantization type, but very slow on CPU), with a gigantic model that barely fits in my RAM+VRAM combined, with no DFlash or any kind of speculative decoding turned on. I could tune it to be much faster, but I mostly just want “max quality, fast enough to read as it streams, barely fits in memory” for this model.
For speed, or prompts with lots of thinking or context (like agenic use), I just run Qwen 3.6 27B now. That would fit in your 3090 no matter how much CPU RAM you have, but you have to be smart about the backend and quantization you pick. If you just use Ollama, it’s gonna tell you it won’t fit, or use some horrible default that spits out garbage.
…This is what I meant to emphasize.
It’s not just the hardware. You kinda have to be part developer, part enthusiast to even follow this stuff, it up optimally, and keep it up-to-date. If you just try to Google “best LLM for 3090,” you will get absolute garbage.






Well, it would be massively, massively better if they did some basic validation and tuning in a Proton environment.
Thousands of open-source-dev man-hours patching in hacky workarounds for Windows games not ideal; it’d be far easier for the game dev to fix things (or raise issues) from their end. And those Proton devs have better things they could be doing.