AI’s Memorization Crisis | Large language models don’t “learn”—they copy. And that could change everything for the tech industry.

silence7@slrpnk.net · 2 months ago

AI’s Memorization Crisis | Large language models don’t “learn”—they copy. And that could change everything for the tech industry.

AmbitiousProcess (they/them)@piefed.social · 2 months ago

The article seems to be implying that this is a common problem that happens constantly and that the companies creating these AI models just don’t give a fuck.

Not only does the article not once state that this is a common problem, only explaining the technical details of how it works, and the possible legal ramifications of it, but they mention how, according to nearly any AI scholar/expert you can talk to, this is not some fixable problem. If you take data, and effectively do extremely lossy compression on it, there is still a way for that data to theoretically be recovered.

Advancing LLMs while claiming you’ll work on it doing this doesn’t change the fact that this is a problem inherent to LLMs. There are certainly ways to prevent it, reduce its likelihood, etc, but you can’t entirely remove the problem. The article is simply about how LLMs inherently memorize data, and while you can mask it with more varied training data, you still can’t avoid the fact that trained weights memorize inputs, and when combined together, can eventually reproduce those inputs.

To be very clear, again, I’m not saying it’s impossible to make this happen less, but it’s still an inherent part of how LLMs work, and isn’t some entirely fixable problem. Is it better now than it used to be? Sure. Is it fully fixable? Never.

Clearly nobody is distributing copyrighted images by asking AI to do its best to recreate them. When you do this, you end up with severely shitty hack images that nobody wants to look at

It’s actually a major problem for artists where people will pass their art through an AI model to reimagine it slightly differently so it can’t be copyright striked, but will still retain some of the more human choices, design elements, and overall composition.

Spend any amount of time on social platforms with artists and you’ll find many of them now don’t complain as much about people directly stealing their art and reposting it, but more people stealing their images and changing them a bit with AI, then reposting it so it’s just different enough they can feign innocence and tell their followers it’s all their work.

Basically, if no one is actually using these images except to say, “aha! My academic research uncovered this tiny flaw in your model that represents an obscure area of AI research!” why TF should anyone care?

The thing is, while these are isolated experiments meant to test for these behaviors as quickly as possible with a small set of researchers, when you look at the sheer scale of people using AI tools now, then statistically speaking, you will inevitably get people who put in a prompt that is similar enough to a work that was trained on, and it will output something almost identical to that work, without the prompter realizing.

Why do you need to point to absolutely, ridiculously obscure shit like finding a flaw in Stable Diffusion 1.4 (from years ago, before 99% of the world had even heard of generative image AI)?

Because they highlight the flaws that continue to plague existing models, but have been around for long enough that you can run long-term tests, run them more cheaply on current AI hardware at scale, and can repeat tests with the same conditions rather than starting over again every single time a new model is released.

Again, this memorization is inherent to how these AI models are trained, it gets better with new releases as more training data is used, and more alterations are made, but it cannot be removed, because removing the memorization removes all the training.

I’ll admit it’s less of a “smoking gun” against use of AI in itself than it used to be when the issue was more prevalent, but acting like it’s a non-issue isn’t right either.

Generative AI is just the latest way of giving instructions to computers. That’s it! That’s all it is.

It is not, unless you consider every single piece of software or code ever to be just “a way of giving instructions to computers” since code is just instructions for how a computer should operate, regardless of the actual tangible outcomes of those base-level instructions.

Generative AI is a type of computation that predicts the most likely sequence of text, or distribution of pixels in an image. That is all it is. It can be used to predict the most likely text, in a machine readable format, which can then control a computer, but that is not what it inherently is in its entirety.

It can also rip off artists and journalists, hallucinate plausible misinformation about current events, or delude you into believing you’re the smartest baby of 1996.

It’s like saying a kitchen knife is just a way to cut foods… when it can also be used to stab someone, make crafts, or open your packages. It can be “just a way of altering the size and quantity of pieces of food”, but it can also be a murder weapon or a letter opener.

Nobody gave a shit about this kind of thing when Star Trek was pretending to do generative AI in the Holodeck

That would be because it was a fictional series about a nonexistent future that didn’t affect anyone’s life today in a negative way if nonexistent job roles were replaced, and most people didn’t have to think about how it would affect them if it became reality today.

Do you want the cool shit from Star Trek’s imaginary future or not? This is literally what computer scientists have been dreaming of for decades. It’s here! Have some fun with it!

People also want flying cars without thinking of the noise pollution and traffic management. Fiction isn’t always what people think it could be.

Generative AI uses up less power/water than streaming YouTube or Netflix

But Generative AI is not replacing YouTube or Netflix, it’s primarily replacing web searches. So when someone goes to ChatGPT instead of Google, that uses anywhere from a few tens of times more energy to a couple hundreds more.

Yet they will still also use Netflix on top of that.

I expect you’re just as vocal about streaming video, yeah?

People generally aren’t, because streaming video tends to have a much more positive effect on their lives than AI.

Watching a new show or movie is fun and relaxing. If it isn’t, you just… stop watching. Nobody forces it down your throat.

Having LLMs pollute my search results with plausible sounding nonsense, and displace the jobs of artists I enjoy the art of, is not fun, nor relaxing. Talking with someone on social media just to find out they aren’t even a real human is annoying. Trying to troubleshoot an issue and finding made up solutions makes my problem even harder to solve.

We can’t necessarily all be focusing on every single possible thing that takes energy, but it’s easy to focus on the thing that most people have an overall negative association with the effects of.

Two birds, one stone.

VoterFrog@lemmy.world · 2 months ago

If you take data, and effectively do extremely lossy compression on it, there is still a way for that data to theoretically be recovered.

This is extremely wrong and your entire argument rests on this single sentence’s accuracy so I’m going to focus on it.

It’s very, very easy to do a lossy compression on some data and wind up with something unrecognizable. Actual lossy compression algorithms are a tight balancing act of trying to get rid of just the right amount of just the right pieces of data so that the result is still satisfactory.

LLMs are designed with no such restriction. And any single entry in a large data set is both theoretically and mathematically unrecoverable. The only way that these large models reproduce anything is due to heavy replication in the data set such that, essentially, enough of the “compressed” data makes it through. There’s a reason why whenever you read about this the examples are very culturally significant.

Zos_Kia@lemmynsfw.com · 2 months ago

That scenario where artists get their shit stolen by passing it through AIGen to avoid copyright strikes is hilarious to me. I’d love to see examples of that cause I can’t really picture it.

Riskable@programming.dev · 2 months ago

In Kadrey v. Meta (court case) a group of authors sued Meta/Anthropic for copyright infringement but the case was thrown out by the judge because they couldn’t actually produce any evidence of infringement beyond, “Look! This passage is similar.” They asked for more time so they could keep trying thousands (millions?) of different prompts until they finally got one that matched enough that they might have some real evidence.

In Getty Images v. Stability AI (UK), the court threw out the case for the same reason: It was determined that even though it was possible to generate an image similar to something owned by Getty, that didn’t meet the legal definition of infringement.

Basically, the courts ruled in both cases, “AI models are not just lossy/lousy compression.”

IMHO: What we really need a ruling on is, “who is responsible?” When an AI model does output something that violate someone’s copyright, is it the owner/creator of the model that’s at fault or the person that instructed it to do so? Even then, does generating something for an individual even count as “distribution” under the law? I mean, I don’t think it does because to me that’s just like using a copier to copy a book. Anyone can do that (legally) for any book they own, but if they start selling/distributing that copy, then they’re violating copyright.

Even then, there’s differences between distributing an AI model that people can use on their PCs (like Stable Diffusion) VS using an AI service to do the same thing. Just because the model can be used for infringement should be meaningless because anything (e.g. a computer, Photoshop, etc) can be used for infringement. The actual act of infringement needs to be something someone does by distributing the work.

You know what? Copyright law is way too fucking complicated, LOL!

AwesomeLowlander@sh.itjust.works · 2 months ago

Please see my other comment about energy / water usage. Aside from that, I’m not disputing your other points.

Relevant except:

–

ChatGPT is bad relative to other things we do (it’s ten times as bad as a Google search)

If you multiply an extremely small value by 10, it can still be so small that it shouldn’t factor into your decisions.

If you were being billed $0.0005 per month for energy for an activity, and then suddenly it began to cost $0.005 per month, how much would that change your plans?

A digital clock uses one million times more power (1W) than an analog watch (1µW). “Using a digital clock instead of a watch is one million times as harmful to the climate” is correct, but misleading. The energy digital clocks use rounds to zero compared to travel, food, and heat and air conditioning. Climate guilt about digital clocks would be misplaced.

The relationship between Google and ChatGPT is similar to watches and clocks. One uses more energy than the other, but both round to zero.

When was the last time you heard a climate scientist say we should avoid using Google for the environment? This would sound strange. It would sound strange if I said “Ugh, my friend did over 100 Google searches today. She clearly doesn’t care about the climate.” Google doesn’t add to our energy budget at all. Assuming a Google search uses 0.03 Wh, it would take 300,000 Google searches to increase your monthly energy use by 1%. It would be a sad meaningless distraction for people who care about the climate to freak out about how often they use Google search. Imagine what your reaction would be to someone telling you they did ten Google searches. You should have the same reaction to someone telling you they prompted ChatGPT.

What matters for your individual carbon budget is total emissions. Increasing the emissions of a specific activity by 10 times is only bad if that meaningfully contributes to your total emissions. If the original value is extremely small, this doesn’t matter.

It’s as if you were trying to save money and had a few options for where to cut:

You buy a gum ball once a month for $0.01. Suddenly their price jumps to $0.10 per gum ball.

You have a fancy meal out for $50 once a week to keep up with a friend. The restaurant host likes you because you come so often, so she lowers the price to $40.

It’s very unlikely that spending an additional $0.10 per month is ever going to matter for your budget. Spending any mental energy on the gum ball is going to be a waste of time for your budget, even though its cost was multiplied by 10. The meal out is making a sizable dent in your budget. Even though it decreased in cost, cutting that meal and finding something different to do with your friend is important if you’re trying to save money. What matters is the total money spent and the value you got for it, not how much individual activities increased or decreased relative to some other arbitrary point.

Google and ChatGPT are like the gum ball. If a friend were worried about their finances, but spent any time talking about foregoing a gum ball each month, you would correctly say they had been distracted by a cost that rounds to zero. You should say the same to friends worried about ChatGPT. They should be able to enjoy something that’s very close to free. What matters for the climate is the total energy we use, just like what matters for our budget is how much we spend in total. The climate doesn’t react to hyper specific categories of activities, like search or AI prompts.

If you’re an average American, each ChatGPT prompt increases your daily energy use (not including the energy you use in your car) by 0.001%. It takes about 1,000 ChatGPT prompts to increase your daily energy use by 1%. If you did 1,000 ChatGPT prompts in 1 day and feel bad about the increased energy, you could remove an equal amount of energy from your daily use by:

Running a clothes drier for 6 fewer minutes.

Running an air conditioner for 18 fewer minutes.

VoterFrog@lemmy.world · edit-2 2 months ago

deleted by creator

AwesomeLowlander@sh.itjust.works · 2 months ago

Sorry, are you sure you’re replying to the right person?

VoterFrog@lemmy.world · 2 months ago

deleted by creator

Riskable@programming.dev · 2 months ago

Hmmm… That’s all an interesting argument but it has nothing to do with my comparison to YouTube/Netflix (or any other kind of video) streaming.

If we were to compare a heavy user of ChatGPT to a teenager that spends a lot of time streaming videos, the ChatGPT side of the equation wouldn’t even amount to 1% of the power/water used by streaming. In fact, if you add up all the usage of all the popular AI services power/water usage that still doesn’t add up to much compared to video streaming.

AwesomeLowlander@sh.itjust.works · 2 months ago

Did you mean to reply to somebody else? You’re repeating what I said

Riskable@programming.dev · 2 months ago

unless you consider every single piece of software or code ever to be just “a way of giving instructions to computers”

Yes. Yes I do. That’s exactly what code is: instructions. That’s literally how computers work. That’s what people like me (software developers) do when we write software: We’re writing down instructions.

When you click or move your mouse, you’re giving the computer instructions (well, the driver is). When you type a key, that’s resulting in an instruction being executed (dozens to thousands, actually).

When I click “submit” on this comment, I’m giving a whole bunch of computers some instructions.

Insert meme of, “you mean computers are just running instructions?” “Always have been.”