As always, the game plan seems to be “disrupt, own the market, enshittify”.
But with a slimy veneer of SEO/engagement spamming as the primary business strategy.
As always, the game plan seems to be “disrupt, own the market, enshittify”.
But with a slimy veneer of SEO/engagement spamming as the primary business strategy.
It’s drops off, but not as much as you’d think.
MiMo uses 5:1 SWA, so its long-context compute doesn’t increase as catastrophically as older models. That, and most of the “slowness” comes from the MoE layers being on CPU (whereas the attention layers that get heavier at high context are all on the 3090).
That’s the beauty of these MoEs: they’re just the right size for the “compute-lite” parts to stay in CPU RAM.
I will measure it tomorrow. It is a constant ~9-10TPS for short queries, but definitely slower near my current max context of 85K.
And do you mean prompt compaction? I don’t automate that; when I use that particular model, I tend to use it in Mikupad, aka “raw” notepad mode, and manipulate the context directly. This is so I can do things like chop out conversations, pick different tokens from the logprobs, or edit its own replies/thinking and continue mid reply.
I like manually handling this because, being a local model, prompts are cached. Streaming starts quickly if most of the prompt stays cached, which is actually a really nice advantage over APIs.
Goood clanker. Pats my desktop.
All while hiding any attribution to the underlying engine, just to start:
https://sleepingrobots.com/dreams/stop-using-ollama/
And that article isn’t comprehensive. A book could be written on damage and drama they’ve caused.


Well, it would be massively, massively better if they did some basic validation and tuning in a Proton environment.
Thousands of open-source-dev man-hours patching in hacky workarounds for Windows games not ideal; it’d be far easier for the game dev to fix things (or raise issues) from their end. And those Proton devs have better things they could be doing.
I have a single 3090!
That’s the dream GPU, these days.
And I have 128GB CPU RAM. So the best model I can run is MiMo 2.5 (a 300B model) at around 10 tokens/sec, using hybrid CPU inference.
…But that’s the worst-case scenario, for speed. It’s an IQ3_KT quant (a high quality “trellis” quantization type, but very slow on CPU), with a gigantic model that barely fits in my RAM+VRAM combined, with no DFlash or any kind of speculative decoding turned on. I could tune it to be much faster, but I mostly just want “max quality, fast enough to read as it streams, barely fits in memory” for this model.
For speed, or prompts with lots of thinking or context (like agenic use), I just run Qwen 3.6 27B now. That would fit in your 3090 no matter how much CPU RAM you have, but you have to be smart about the backend and quantization you pick. If you just use Ollama, it’s gonna tell you it won’t fit, or use some horrible default that spits out garbage.
…This is what I meant to emphasize.
It’s not just the hardware. You kinda have to be part developer, part enthusiast to even follow this stuff, it up optimally, and keep it up-to-date. If you just try to Google “best LLM for 3090,” you will get absolute garbage.
You don’t even need Claude anymore. GLM 5.2 API is good enough for 95% of the same things and vastly cheaper.
MiMo 2.5 Pro and Kimi are also very good. And then there’s Cerebras API if you just want simple things done quick.
The thing with self hosting, while awesome, is that it requires a lot of hardware and considerable time investment for what’s essentially a “base tier model,” or at best one step down for what’s still a very cheap API. I still love it, especially the privacy and control aspect, but you aren’t running Claude at home unless you’ve got a threadripper or server hardware collecting dust.
…Hence I can understand why people don’t pursue it. Especially since a cursory Google search will lead you to trying the Deepseek distillation on Ollama (which is awful).
Trains for long haul + autotaxis (or air taxis) for short, low speed rides actually sounds pretty dope.
Oh, both! Yeah. I didn’t even think of that, but [AIT]/[AIP] as separate tags makes a lot of sense.
I’d like being able to filter by either, actually.
I guess two tags runs the risk of “rules too complex for some to follow,” but that’s more of a moderation load question. I have no say in that, heh.
Well, BIG TIMI is either a bot or has a serious Twitter problem, as he’s posted 31 times in the past 17 hours, every hour:
…And he pays for Twitter.
To bring intellectualism back into his life, maybe he should consider that.
Not that I have much of a leg to stand on, being on Lemmy a bit much, but still.
For what it’s worth, I asked my self-hosted LLM (MiMo 2.5, no network access outside my desktop), and it came with [AIT] (AI-Topic).
…I think that’s my favorite so far. [AIP] would work too.
I feel like that “obfuscates” the tag enough to blunt impulse downvotes in /new and feeds, without being deceptive or anything.
I’m not consistent about it yet, but because of exactly this, I’m trying to differentiate the two when I talk.
Responsible automation? I use ML or machine learning.
The grift consuming the world? A Tech Bro? “AI”
I think one of the saddest things is the conflation between the two, like you can’t even talk about one without invoking the other. Or it opening up that whole ethical debate, when you’re just talking about, like, a 100M transcription model trained by one research in some university on a potato.
Yeah. Just not sure what it should be, heh.
I will say, if it still has “AI” in the tag (like [LAI] or whatever), it would play nicer with keyword filters.
TBD indeed. But it will effectively ‘downrank’ posts and their visibility, maybe into the negative vote range. I’ve seen highly negative scores across the board in more machine-learning focused subs, and that’s without a tag that catches the eye so easily.
I think even modifying the acronym could make a difference, though (as I ninja edited).
Also:
Anything with an [AI] tag, first thing in the title, will have a drive-by downvote issue.
Not sure how to deal with that, or if its even a concern.
EDIT:
Maybe it should be something else that’s not such a loaded keyword?
[ML] for Machine Learning? [SAI]? [LAI]?
I’ve been messing with ‘AI’ for a decade, and even I hate what the term has come to represent.
There already are.
I’d argue that Lemmy and piefed need a “sub community” or community taxonomy structure, but that’s kinda out of scope here.
+1
Home-AI oriented channels like Reddit’s localllama are filled with self promotion garbage, and more will trickle here over time… I’m not even against self promo or heavy coding assistance, but 9-times-out-of-10, the linked repo is nonsense, or straight-up fraudulent. And being obviously vibe-coded is a common tell.
Good to get ahead of this.
Also, +1 on supressing driveby insults. If the post is tagged up front, there’s no need. That being said, it should be okay for users to call out an obvious grift, or a “nonsense repo” that’s actually pure slop.


Just a reminder that Altman and Anthropic LOVE this.
They want a world where all other LLM providers are smited out of existence.


Europe kind of a quagmire, unfortunately. Not the amount of regulation, but the ambiguity; their domestic AI “players” are afraid of getting in legal trouble over nebulous wording and inconsistency.
If Trump goes after Huggingface and they run to France, that might change, though.
They 100% do. They’re probably serving “naive” FP8 via VLLM, which is worse than you’d think, especially if they flip on the awful FP8 KV cache.
In a local quant, you can stop quantized models from falling apart at higher CTX by leaving the attention heads at a higher quantization. As an example, with MiMo 2.5, I have all the MoE MLP layers at IQ3_KT, the dense experts at Q6K, but all the attention layers at Q8_0.
For Qwen 27B, I’m still experimenting, but leaning towards IQ4_KT for the MLPs, Q6K for attention, and Q8_0 for the small, very sensitive KV heads. Or a similar scheme as an exl3 quant.
That being said, sometimes even unquantized models fall apart in certain long context scenarios because the max advertised context is a lie. You just have to test them and see, but Qwen has certainly done this in the past.