

“Why did they take that feature away? I was busy abusing it!”
Basically a deer with a human face. Despite probably being some sort of magical nature spirit, his interests are primarily in technology and politics and science fiction.
Spent many years on Reddit before joining the Threadiverse as well.


“Why did they take that feature away? I was busy abusing it!”
I have a sneaking suspicion that the vast majority of the people raging about AIs scraping their data are not raging about it being done inefficiently.
You’re thinking of “model decay”, I take it? That’s not really a thing in practice.
Raw materials to inform the LLMs constructing the synthetic data, most likely. If you want it to be up to date on the news, you need to give it that news.
The point is not that the scraping doesn’t happen, it’s that the data is already being highly processed and filtered before it gets to the LLM training step. There’s a ton of “poison” in that data naturally already. Early LLMs like GPT-3 just swallowed the poison and muddled on, but researchers have learned how much better LLMs can be when trained on cleaner data and so they already take steps to clean it up.
I have no idea what “established means” would be. In the particular case of the Fediverse it seems impossible, you can just set up your own instance specifically intended for harvesting comments and use that. The Fediverse is designed specifically to publish its data for others to use in an open manner.
Are you proposing flooding the Fediverse with fake bot comments in order to prevent the Fediverse from being flooded with fake bot comments? Or are you thinking more along the lines of that guy who keeps using “Þ” in place of “th”? Making the Fediverse too annoying to use for bot and human alike would be a fairly phyrric victory, I would think.
A basic Google search for “synthetic data llm training” will give you lots of hits describing how the process goes these days.
Take this as “defeatist” if you wish, as I said it doesn’t really matter. In the early days of LLMs when ChatGPT first came out the strategy for training these things was to just dump as much raw data onto them as possible and hope quantity allowed the LLM to figure something out from it, but since then it’s been learned that quality is better than quantity and so training data is far more carefully curated these days. Not because there’s “poison” in it, just because it results in better LLMs. Filtering out poison will happen as a side effect.
It’s like trying to contaminate a city’s water supply by peeing in the river upstream of the water treatment plant drawing from it. The water treatment plant is already dealing with all sorts of contaminants anyway.
I think it’s worthwhile to show people that views outside of their like-minded bubble exist. One of the nice things about the Fediverse over Reddit is that the upvote and downvote tallies are both shown, so we can see that opinions are not a monolith.
Also, engaging in Internet debate is never to convince the person you’re actually talking to. That almost never happens. The point of debate is to present convincing arguments for the less-committed casual readers who are lurking rather than participating directly.
Doesn’t work, but I guess if it makes people feel better I suppose they can waste their resources doing this.
Modern LLMs aren’t trained on just whatever raw data can be scraped off the web any more. They’re trained with synthetic data that’s prepared by other LLMs and carefully crafted and curated. Folks are still thinking ChatGPT 3 is state of the art here.


People have been doing this to “protest” AI for years already. AI trainers already do extensive filtering and processing of their training data before they use it to train, the days of simply turning an AI loose on Common Crawl and hoping to get something out of that are long past. Most AIs these days train on synthetic data which isn’t even taken directly from the web.
So go ahead and do this, I suppose, if it makes you feel better. It’s not likely to have any impact on AIs though.
And yet I don’t see any of that. Lots of people complain about “the algorithm” but it seems to be working well for me.


I mean, it’s pretty obvious. They release good open-weight models. Western companies did that a little at first, but they’ve basically stopped doing that any more. It’s really easy to win a competition when one of the competitors isn’t actually competing.


Ah, good, that makes this less of a dilemma then.


On the one hand not fond of the CCP, and this is a step toward making Taiwan more “safely” invadeable.
On the other hand not fond of the United States throwing its weight around like it’s in charge of the world and not fond of monopolies in general.
So hard to settle on a reaction for this.


It is interesting, IMO, that with AI we see the opposite of the usual trend; the fancy new disruptive technology seems to be liked more by the older crowd, and less by the younger ones.


Right, you take the article at face value. So exactly as I originally said:
you sure are relying on just believing whatever you read without any checking whatsoever.


For every news article you read?
That’s the point here. AI can allow for tedious tasks to be automated. I could have a button in my browser that, when clicked, tells the AI to follow up on those sources to confirm that they say what the article says they say. It can highlight the ones that don’t. It can add notes mentioning if those sources happen to be inherently questionable - environmental projections from a fossil fuel think tank, for example. It can highlight claims that don’t have a source, and can do a web search to try to find them.
These are all things I can do myself by hand, sure. I do that sometimes when an article seems particularly important or questionable. It takes a lot of time and effort, though. I would much rather have an AI do the grunt work of going through all that and highlighting problem areas for me to potentially check up on myself. Even if it makes mistakes sometimes that’s still going to give me a far more thoroughly checked and vetted view of the news than the existing process.
Did you look at the link I gave you about how this sort of automated fact-checking has worked out on Wikipedia? Or was it too much hassle to follow the link manually, read through it, and verify whether it actually supported or detracted from my argument?


Okay, we’ve established how you don’t do it. So how do you go about the process of fact checking every news article you read?


30 % increase in preformance? or “we WOn’T nEEd progRAmMers iN 3 yEars”?
You think people aren’t going to want to use AI unless it does literally everything for them? That’s exactly the “if something’s not perfect then it must be awful” mindset I was criticizing in the comment you’re responding to.
I don’t see a link to that research, but that means 38% don’t believe AI is significantly overhyped.
If my job depends on saying you are correct… Mr. FaceDeer you are always correct, the most correct ever.
You are now arguing that the source that you yourself brought into this discussion is no good.
This is ridiculous.
Youtube isn’t the way you think it should be, though.