

Heh, so does mine.
All our parents’ book hoarding may end up saving us. And the internet, if they become the new standard?


Heh, so does mine.
All our parents’ book hoarding may end up saving us. And the internet, if they become the new standard?


You joke, but that’s horrifying.
This is already an SEO technique, apparently, and I could see Amazon book sellers finding a way to fudge it: https://yoast.com/help/date-appears-search-results/


100%.
But I was pondering more what the general population might do. People are going to figure out slop recipes don’t work, but the question is what’s the next most accessible thing to replace it with?


…So are we going back to print cookbooks? Published before 2024?
Honestly, that feels like the practical solution.


Plantains are a fickle plant. Ripeness is a huge factor, and that aside, some are just fibery/less sweet and don’t cook as fast.


In theory, Google should fight all attempts at SEO.
But they infamously stopped doing that to bump some quarterly result (as sifting through them generates more clicks), and here we are.


A post that long?
Eh, well, it could definitely be an unmarked bot on X. That’s good attention bait, and it has a feeling of temporal inplausibility kinda like a ‘cheapest API LLM’ story.


Where’d you see it?
There are GPT-2 bots on Reddit and Lemmy that make a lot of posts like this. And they aren’t hidden; the community is explicitly labeled.


Vllm is a bit better with parallelization. All the kv cache sits in a single “pool”, and it uses as many slots as will fit. If it gets a bunch of short requests, it does many in parallel. If it gets a long context request, it kinda just does that one.
You still have to specify a maximum context though, and it is best to set that as low as possible.
…The catch is it’s quite vram inefficient. But it can split over multiple cards reasonably well, better than llama.cpp can, depending on your PCIe speeds.
You might try TabbyAPI exl2s as well. It’s very good with parallel calls, thoughts I’m not sure how well it supports MI50s.
Another thing to tweak is batch size. If you are actually making a bunch of 47K context calls, you can increase the prompt processing batch size a ton to load the MI50 better, and get it to process the prompt faster.
EDIT: Also, now that I think about it, I’m pretty sure ollama is really dumb with parallelization. Does it even support paged attention batching?
The llama.cpp server should be much better, eg use less VRAM for each of the “slots” it can utilize.


I’ll save you the searching!
For max speed when making parallel calls, vllm: https://hub.docker.com/r/btbtyler09/vllm-rocm-gcn5
Generally, the built in llama.cpp server is the best for GGUF models! It has a great built in web UI as well.
For a more one-click RP focused UI, and API server, kobold.cpp rocm is sublime: https://github.com/YellowRoseCx/koboldcpp-rocm/
If you are running big MoE models that need some CPU offloading, check out ik_llama.cpp. It’s specifically optimized for MoE hybrid inference, but the caveat is that its vulkan backend isn’t well tested. They will fix issues if you find any, though: https://github.com/ikawrakow/ik_llama.cpp/
mlc-llm also has a Vulcan runtime, but it’s one of the more… exotic LLM backends out there. I’d try the other ones first.


AFAIK some outputs are made with a really tiny/quantized local LLM too.
And yeah, even that aside, GPT 3.5 is really bad these days. It’s obsolete.


Bloefz has a great setup. Used Mi50s are cheap.
An RTX 3090 + a cheap HEDT/Server CPU is another popular homelab config. Newer models run reasonably quickly on them, with the attention/dense layers on the GPU and sparse parts on the CPU.


This is the way.
…Except for ollama. It’s starting to enshittify and I would not recommend it.


The iPhone models are really bad. They aren’t representative of the usefulness of bigger ones, and it’s inexplicably stupid that Apple doesn’t like people pick their own API as an alternative.


Funny how polarized the comments are.
Going by their track record, season 1 will either be quite good or really bad. There will be no in between, and I’m fine with that… best it doesn’t go on as a mediocre zombie.
Lulz.
It’s an interesting coding exercise, though. Trying to (for example) OCR all the documents, or generate a relations graph between the documents or concepts, is a great into to language modeling (which is not prompt engineering, like most seem to think).
If you’re like a reporter or something, it’s also the obvious way to comb through the documents looking for clues to actually make headlines. I dunno what techniques they use at big outlets, though.
It’s literally “this one is my fursona. This one won’t refuse BDSM, but its not as eloquent. Oh, this one is lobotimized but really creative.” I kid you not. Here is an example, and note that is one of 115 uploads from one account:
https://huggingface.co/Mawdistical/RAWMAW-70B?not-for-all-audiences=true
And I love that madness. It feels like the old internet. In fact, furries and horny roleplayers have made some good code contributions to the space.
Early on, there were a few ‘character’ finetunes or more generic ones like ‘talk like a pirate’ or ‘talk only in emojiis.’ But as local models got more advanced, they got so good at adopting personas that the finetuning focused more on writing ‘style’ and storytelling than emulating specific characters. For example, one trained specifically to stick to the role of a dungeonmaster: https://huggingface.co/LatitudeGames/Nova-70B-Llama-3.3
Or this one, where you can look at the datasets and see the anime ‘style’ they’re trying to massage in: https://huggingface.co/zerofata/GLM-4.5-Iceblink-106B-A12B
Meme finetunes are nothing new.
As an example, there are DPO datasets with positive/negative examples intended to train LLMs to respond politely and helpfully (as opposed to the negative response). There are some that include toxic comments plucked from the web as negative examples.
And the immediate community thought was “…What if I reversed them?”
Sometimes. As a tool, not an outsourced human, oracle, or some transcendent companion con artists like Altman are trying to sell.
See how grounded this interview is, from a company with a model trained on peanuts compared to ChatGPT, and that takes even less to run:
https://www.chinatalk.media/p/the-zai-playbook
They talk about how the next release will be very small/lightweight, and more task focused. How important gaining efficiency through architecture (not scaling up) is now. They even touch on how their own models are starting to be useful utilities in their workflows, and specifically not miraculous worker replacements.