

It’s the strategy!
It’s the 2020s. There’s no such thing as bad attention.


It’s the strategy!
It’s the 2020s. There’s no such thing as bad attention.
I mean, I saw the community, yet I had to think of it.
It’s “nottheonion” adjacent.


Yep.
Internet is “good enough” for P2P these days. I get not everyone is in a great living situation, but less than 500K down/up is an outlier at this point.


SonoBus for voice chat.
It’s peer to peer, just works on everything, sounds better than Discord, and most importantly, is 100X less annoying because latency is so low.
I think it was designed for remote music collaboration, hence it feels like you’re talking to your friend in the room. No compression if you don’t want it, no awkward interruption of pauses from the audio delay.
Oh, and it’s free, and has no chat or emojis or anything.
That’s what gets upvotes on Lemmy, sadly.
This is how Voat (another Reddit clone) died. Political shitposts and clickbait tabloids crowded out every niche, so all the interesting content left.
As it turns out, doomscrolling twitter troll reposts with the same few comments in each one is quite depressing.
I don’t know a good solution, either. Clickbait works. Maybe some structural changes could help, though?


FYI you can buy this this: https://frame.work/products/framework-desktop-mainboard-amd-ryzen-ai-max-300-series?v=FRAFMK0002
And stick a regular Nvidia GPU on it. Or an AMD one.
That’d give you the option to batch renders across the integrated and discrete GPUs, if such a thing fits your workflow. Or to use one GPU while the other is busy. And if a particular model doesn’t play nice with AMD, it’d give you the option to use Nvidia + CPU offloading very effectively.
It’s only PCIe 4.0 X4, but that’s enough for most GPUs.
TBH I’m considering exactly this, hanging my venerable 3090 off the board. As I’m feeling the FOMO crunch of all hardware getting so expensive. And $2K for 16 cores with 128GB of ridiculously fast quad channel RAM is not bad, even JUST as a CPU.


As a hobby mostly, but its useful for work. I found LLMs fascinating even before the hype, when everyone was trying to get GPT-J finetunes named after Star Trek characters to run.
Reading my own quote, I was being a bit dramatic. But at the very least it is super important to grasp some basic concepts (like MoE CPU offloading, quantization, and specs of your own hardware), and watch for new releases in LocalLlama or whatever. You kinda do have to follow and test things, yes, as there’s tons of FUD in open weights AI land.
As an example, stepfun 2.5 seems to be a great model for my hardware (single Nvidia GPU + 128GB CPU RAM), and it could have easily flown under the radar without following stuff. I also wouldn’t know to run it with ik_llama.cpp instead of mainline llama.cpp, for a considerable speed/quality boost over (say) LM Studio.
If I were to google all this now, I’d probably still get links for setting up the Deepseek distillations from Tech Bro YouTubers. That series is now dreadfully slow and long obsolete.


I dunno. Whatever the default was, so perhaps not?
But whatever Ublock Lite’s default is is probably what 99% of folks are using.


Chinese electric cars were always going to take off. RAM is just a commodity; if you sell the most bits at the lowest price and sufficient speed, it works.
If you’re in edge machine learning, if you write your own software stacks for niche stuff, Chinese hardware will be killer.
But if you’re trying to run Steam games? Or CUDA projects? That’s a whole different story. It doesn’t matter how good the hardware is, they’re always going to be handicapped by software in “legacy” code. Not just for performance, but driver bugs/quirks.
Proton (and focusing everything on a good Vulkan driver) is not a bad path forward, but still. They’re working against decades of dev work targeting AMD/Nvidia/Intel, up and down the stack.


Also, this has been the case (or at least planned) for a while.
Pascal (the GTX 1000 series) and Ampere (the RTX 3000 series) used the exact same architecture for datacenter/gaming. The big gaming dies were dual use and datacenter-optimized. This habit sort of goes back to ~2008, but Ampere and the A100 is really where “datacenter first” took off.
AMD announced a plan to unify their datacenter/gaming architecture awhile ago, and prioritized the MI300X before that. And EPYC has always been the priority, too.
Intel wanted to do this, but had some roadmap trouble.
These companies have always put datacenter first, it just took this much drama for the consumer segment to largely notice.


I did find this calculator the other day
That calculator is total nonsense. Don’t trust anything like that; at best, its obsolete the week after its posted.
I’d be hesitant to buy something just for AI that doesn’t also have RTX cores because I do a lot of Blender rendering. RDNA 5 is supposed to have more competitive RTX cores
Yeah, that’s a huge caveat. AMD Blender might be better than you think though, and you can use your RTX 4060 on a Strix Halo motherboard just fine. The CPU itself is incredible for any kind of workstation workload.
along with NPU cores, so I guess my ideal would be a SoC with a ton of RAM
So far, NPUs have been useless. Don’t buy any of that marketing.
I’m also not sure under 10 tokens per second will be usable, though I’ve never really tried it.
That’s still 5 words/second. That’s not a bad reading speed.
Whether its enough? That depends. GLM 350B without thinking is smarter than most models with thinking, so I end up with better answers faster.
But anyway, I’m get more like 20 tokens a second with models that aren’t squeezed into my rig within an inch of their life. If you buy an HEDT/Server CPU with more RAM channels, it’s even faster.
If you want to look into the bleeding edge, start with https://github.com/ikawrakow/ik_llama.cpp/
And all the models on huggingface with the ik tag: https://huggingface.co/models?other=ik_llama.cpp&sort=modified
You’ll see instructions for running big models on a 4060 + RAM.
If you’re trying to like batch process documents quickly (so no CPU offloading), look at exl3s instead: https://huggingface.co/models?num_parameters=min%3A12B%2Cmax%3A32B&sort=modified&search=exl3
And run them with this: https://github.com/theroyallab/tabbyAPI


I mean, I’d kill for a Chinese GPU. But software lock-in for your Steam back catalog is strong.
Also, have you been watching all the Chinese GPU announcements? They’re all in on datacenter machine learning ASICs too.


This is not true. I have a single 3090 + 128GB CPU RAM (which wasn’t so expensive that long ago), and I can run GLM 4.6 350B at 6 tokens/sec, with measurably reasonable quantization quality. I can run sparser models like Stepfun 3.5, GLM Air or Minimax 2.1 much faster, and these are all better than the cheapest API models. I can batch Kimi Linear, Seed-OSS, Qwen3, and all sorts of models without any offloading for tons of speed.
…It’s not trivial to set up though. It’s definitely not turnkey. That’s the issue.
You can’t just do “ollama run” and expect good performance, as the local LLM scene is finicky and highly experimental. You have to compile forks and PRs, learn about sampling and chat formatting, perplexity and KL divergence, about quantization and MoEs and benchmarking. Everything is moving too fast, and is too performance sensitive, to make it that easy, unfortunately.
EDIT:
And if I were trying to get local LLMs setup today, for a lot of usage, I’d probably buy an AI Max 395 motherboard instead of a GPU. They aren’t horrendously priced, and they don’t slurp power like a 3090. 96GB VRAM is the perfect size for all those ~250B MoEs.
But if you go AMD, take all the finickiness for an Nvidia setup and multiply it by 10. You better know your way around pip and Linux, as if you don’t get it exactly right, performance will be horrendous, and many setups just won’t work anyway.


It depends on the application.
Do you have some apps that are inactive for long periods of time, and “wake up?” Better to do it at the highest level. That gives the OS power and o essentially shunt whole VMs away and give the active ones full power.
Are they all pretty active all the time? Are memory performance requirements not too high? Is latency a priority? Best to do it inside, I suppose.
EDIT: For what it’s worth, I found that no zram is best in some scenarios. Sometimes applications just barely, rarely scrape the memory limit, and if I enable a big chunk of zram they scrape it more frequently, then don’t give it up and keep active pages in zram. Rare swapping to an ssd ended up much, much faster.


On the contrary, my eyes slide right off ads. They even did when I was a little kid.
Why should I care about something I’m not looking for? It’s just going to make whatever they’re advertising more expensive to buy.
Maybe that’s the autism side of AuADHD taking over, though.


Is it?
I used vanilla Chrome with “Ublock lite” in someone else’s computer for a bit, and was shocked by how many ads got through, not to speak of annoyances and what I suspect was a malware link. We also got a related ad on TV soon after browsing for something.
I think Google’s having their cake and eating it. It blocks enough for users to feel like they’re getting Adblock, yet it’s not much skin off Google’s back.
I think they meant background transcoding while using the browser.
I don’t even want to speculate on what’s going wrong there, heh. But I can definitely see that being a quirk.
Full Ublock is a mixed bag on mobile because it eats battery/performance, and (if you add all the same filter sources), integrated blockers like Orion’s are just about the same anyway.
Oh heck yeah. I’ve been using it on iOS a ton, and dying for this on Windows/Linux.
Fun trivia: what browser supports HEIFs, JPEG XL AVIF, AV1, all with correctly rendered HDR?
Not Chrome. And not Firefox, nor anything based on them I’ve tried: https://caniuse.com/?search=image+format
Shame they didn’t go Intel. Arc is good, and they could have gotten around TSMC supply constraints.