Yup, I’m posting another this week. Sorry.
This week I’m hoping we can wrangle a solution around AI and our selfhosted community. There are plenty of strong opinions (both pro and con), but one thing is for certain - there needs to be better disclosure in promo posts. Two options (that aren’t mutually exclusive):
- Any posts of an AI focused, AI Developed, etc software gets an [AI] tag. No, a [Not-AI] tag is not needed to accomplish this, thats kind of a “non-golfer” sort of tag.
- Comment requiring an AI disclosure response to every promo post, if its not detailed in the post itself. Specifics (generating docs for commands, translation, whole-boat vibe-coded this app, etc) would be requested.
I will say that having disclosure and/or tagging would mean that comments that just say “slop” or “fuck ai” or whatever would be off topic at that point, that information is already provided, so its just noise (and sometimes pretty uncivil - I’ve been light on that for now due to the need for a rule on this).
The tag [AI] would make it easy to filter out (or search for, if that’s your thing), but there is a wildly different degree of AI use out there, and from the posts with a positive score, its usually due to responsible AI use (translations, a snippet they had to do something obscure with, available to use with AI but doesn’t require it, whatever), which is why I think the disclosure has a place as a benefit to everyone.
Please provide any input or alternative options on this, and I can then put it to a vote like the last one. Comments seem to be the best approach without involving something off-site, but if you have a better idea/option, please share.


I have a single 3090!
That’s the dream GPU, these days.
And I have 128GB CPU RAM. So the best model I can run is MiMo 2.5 (a 300B model) at around 10 tokens/sec, using hybrid CPU inference.
…But that’s the worst-case scenario, for speed. It’s an IQ3_KT quant (a high quality “trellis” quantization type, but very slow on CPU), with a gigantic model that barely fits in my RAM+VRAM combined, with no DFlash or any kind of speculative decoding turned on. I could tune it to be much faster, but I mostly just want “max quality, fast enough to read as it streams, barely fits in memory” for this model.
For speed, or prompts with lots of thinking or context (like agenic use), I just run Qwen 3.6 27B now. That would fit in your 3090 no matter how much CPU RAM you have, but you have to be smart about the backend and quantization you pick. If you just use Ollama, it’s gonna tell you it won’t fit, or use some horrible default that spits out garbage.
…This is what I meant to emphasize.
It’s not just the hardware. You kinda have to be part developer, part enthusiast to even follow this stuff, it up optimally, and keep it up-to-date. If you just try to Google “best LLM for 3090,” you will get absolute garbage.
I’m still impressed you got any MiMo to work at home, at 10 tok/s.
For those trying to visualise that -
https://mikeveerman.github.io/tokenspeed/?rate=10&mode=agent&think=10
Is it a constant 10 or does it (it must do, right?) drop off as context increases?
I imagine you must have compaction or something to mitigate that.
It’s drops off, but not as much as you’d think.
MiMo uses 5:1 SWA, so its long-context compute doesn’t increase as catastrophically as older models. That, and most of the “slowness” comes from the MoE layers being on CPU (whereas the attention layers that get heavier at high context are all on the 3090).
That’s the beauty of these MoEs: they’re just the right size for the “compute-lite” parts to stay in CPU RAM.
I will measure it tomorrow. It is a constant ~9-10TPS for short queries, but definitely slower near my current max context of 85K.
And do you mean prompt compaction? I don’t automate that; when I use that particular model, I tend to use it in Mikupad, aka “raw” notepad mode, and manipulate the context directly. This is so I can do things like chop out conversations, pick different tokens from the logprobs, or edit its own replies/thinking and continue mid reply.
I like manually handling this because, being a local model, prompts are cached. Streaming starts quickly if most of the prompt stays cached, which is actually a really nice advantage over APIs.