Researchers have found the cause of hallucinations in LLMs, H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs

Allah@piefed.world · 3 months ago

Researchers have found the cause of hallucinations in LLMs, H-Neurons: On the Existence, Impact, and Origin of Hallucination-Associated Neurons in LLMs

so_pitted_wabam@lemmy.zip · 3 months ago

I think a more appropriate post title would be “Researchers have identified and named the process that spawns hallucinations in LLMs, they still don’t know the cause though”

This article is like reading the headline “Researchers have identified the cause of AIDS” and then you open it up and the body is a bunch of science jargon that basically says HIV.

Skullgrid@lemmy.world · 3 months ago

This article is like reading the headline “Researchers have identified the cause of AIDS” and then you open it up and the body is a bunch of science jargon that basically says HIV.

Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training.

it sounds like it’s just how the systems are designed.

I mean, the point of this shit is to take training data and create new stuff out of it through pattern matching. You’re going to get some mismatched shit by design,since the random decisions are modified by the weights. Otherwise you’d get the same shit every time.

[deleted]@piefed.world · 3 months ago

When the system is intended to look like a random a person then randomness is fine.

When the output is expected to be accurate, it should be the same each time so it can be verified as accurate.

LLMs are being sold as doing both at the same time, but random plus consistent equals random.

Skullgrid@lemmy.world · 3 months ago

throw it onto the pile of people being idiots

XLE@piefed.social · 3 months ago

throw it onto the pile of ~~people being idiots~~ AI companies lying to the public

XLE@piefed.social · 3 months ago

That’s incorrect. Wrong responses will still be generated even if you remove the element that randomizes the response for the same question.

If that wasn’t the case, this paper wouldn’t exist.

db2@lemmy.world · 3 months ago

So we’re just throwing in the towel on what words mean now I guess. Anything can be a neuron.

etchinghillside@reddthat.com · 3 months ago

But can anything be a H-NEURON?

XLE@piefed.social · 3 months ago

Any data that makes AI people upset is an H-neuron. This includes both inaccurate responses, and accurate responses that the model designers were attempting to censor, such as “harmful” content.

Infuriatingly, the researchers actually insist that offensive material is not factual material.

The interventions reveal a distinctive behavioral pattern: amplifying H-Neurons’ activations systematically increases a spectrum of over-compliance behaviors – ranging from overcommitment to incorrect premises and heightened susceptibility to misleading contexts, to increased adherence to harmful instructions… (bypassing safety filters to assist with weapon creation)… and stronger sycophantic tendencies. These findings suggest that H-Neurons do not simply encode factual errors, but rather represent a general tendency to prioritize conversational compliance over factual integrity.

Skullgrid@lemmy.world · 3 months ago

In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons)

no, they have to be the nodes responsible for the creation of hallucinations

XLE@piefed.social · edit-2 3 months ago

And a “hallucination” is also an inaccurate humanization of the actual meaning: “statistical relationship that we AI folks don’t like.”

“Hallucinations” even include accurate data.

It is a trash marketing buzzword.

Skullgrid@lemmy.world · 3 months ago

did you know that there is no sex going on in a Breeder Reactor?

https://en.wikipedia.org/wiki/Breeder_reactor

They’re analogies to help us communicate ideas.

[deleted]@piefed.world · 3 months ago

A breeder reactor is creating something, which is like the outcome of breeding. That name fits.

Skullgrid@lemmy.world · 3 months ago

a hallucination is seeing something that’s not there, which also fits.

XLE@piefed.social · 3 months ago

In AI, a “hallucination” is just as much “there” as a non-“hallucination.” It’s a way for scientists to stomp their foot and say that the wrong output is the computer’s fault and not a natural consequence of how LLMs work.

[deleted]@piefed.world · 3 months ago

Hallucinations requires perception. LLMs are just statistical models and do not have perceptions.

It was a cute name early on, now it is used to deflect when the output is just plain wrong.

athairmor@lemmy.world · 3 months ago

Nuclear energy companies aren’t trying to make people think that their reactors reproduce.

AI companies are trying to make people think that their software is intelligent.

The context matters.

Bronzebeard@lemmy.zip · 3 months ago

I don’t think anyone is confusing radiation propagation with being alive though.

The issue is, these things “communicate” with us so granting it even more leeway to seem like it’s thinking (it’s not) is only further muddying how people perceive them

Greg Clarke@lemmy.ca · 3 months ago

Hchicdfvhk!

Skullgrid@lemmy.world · 3 months ago

https://en.wikipedia.org/wiki/Neural_network_(machine_learning)

it’s a node in the system.

Allah@piefed.world · 3 months ago

kind of like cpu is the brain of computer

Bronzebeard@lemmy.zip · 3 months ago

and MITOCHONDRIA IS THE POWERHOUSE OF THE CELL

XLE@piefed.social · 3 months ago

So, what can we glean from this? Here are a few of my observations.

Current studies largely treat LLMs as black boxes… Just as… neuroscience investigations into individual neuronal activity and synaptic interactions shape theories of cognition like learning and memory, analyzing neurons – the fundamental computational units of LLMs – is essential for decoding hallucination. By scrutinizing neurons’ activation patterns in relation to hallucinations, we can gain deeper insights into model reliability.

So these researchers are left poking at the compiled code of a closed source database. What a pain.

The funny part is, although they insist it’s not a black box…

The process begins by generating a balanced dataset of faithful (green check) and hallucinatory (red cross) responses using the TriviaQA benchmark. We extract the contribution profiles of neurons specifically on the answer tokens to train a linear classifier. Neurons assigned positive weights by this classifier are identified as “H-Neurons”, distinguishing them from normal neurons based on their predictive role in generating hallucinations.

… The researchers clearly have no idea what the bad nodes are doing to make anything bad. They just can observe that when they are hit, a bad thing happens. So the nodes themselves are black boxes to them.

Our investigation reveals that a remarkably sparse subset of neurons – comprising less than 0.1% of the model’s total neurons – can accurately predict whether the model will produce hallucinated responses.

The “bad” nodes are everywhere. If you look at a 1,000th of the database, you will find them scattered across it. The mystery deepens.

Our investigation reveals that H-Neurons originate during the pre-training phase…observed “parameter inertia” suggests that standard instruction tuning does not effectively restructure the underlying hallucination mechanics; instead, it largely preserves these pre-existing circuits… Findings suggest that hallucinations are not merely artifacts of model scaling or alignment procedures, but rather deeply rooted in the fundamental training objectives that shape LLM behavior from their inception.

The “bad” nodes are among the first ones added to models, before anything else is filtered or further trained. This is very funny because it implies they’re part of something crucial.

We hypothesize that the neurons identifying hallucinations do not merely encode factual errors, but rather drive a fundamental behavioral we term over-compliance, which means the model’s tendency to satisfy user prompts even at the expense of truthfulness, safety, or integrity. Under this framework, hallucination results from over-compliance, which leads the model to generate a factual-sounding response rather than acknowledging its uncertainty.

They made a (second) new phrase: This earliest data that goes into the model, and persists after adding more data, they call “over-compliance” and insist it’s the model trying to bullshit a user extra hard.

Alternative hypothesis: what if this data is simply the basis for even making the results legible?

This originates from the inherent characteristics of the next-token prediction objective. This training paradigm does not distinguish between factually correct and incorrect continuations – it merely rewards fluent text generation.

Never mind, they just said it outright.

Peruvian_Skies@sh.itjust.works · 3 months ago

So tue tldr is just what we already knew: LLMs predict the most likely word to come next and have no concept of “true” or “false” information.

Indeed, to have such a concept would require understanding that information and any AI that actually understood information wouldn’t be an LLM because LLMs are just fancy autocorrect.

XLE@piefed.social · 3 months ago

There’s a bit more to it: Obviously, if a model gets more correct data pumped into it, it’s more likely to produce a correct output. But they found that at the core of every AI model they tested, when an incorrect output came along, certain nodes produced it. And they are some of the nodes at the earliest part of making the model - before data gets added.

So with that in mind, the tl;dr is more like

AI models have two goals: first be readable, then be correct. It appears the nodes causing incorrect outputs that are also intended to make the output readable.

HyperfocusSurfer@lemmy.dbzer0.com · 3 months ago

So, their approach can be used to flag likely hallucinated output and warn the user?

0ndead@infosec.pub · 3 months ago

AbouBenAdhem@lemmy.world · 3 months ago

amplifying H-Neurons’ activations systematically increases a spectrum of over-compliance behaviors – ranging from overcommitment to incorrect premises and heightened susceptibility to misleading contexts, to increased adherence to harmful instructions and stronger sycophantic tendencies. These findings suggest that H-Neurons do not simply encode factual errors, but rather represent a general tendency to prioritize conversational compliance over factual integrity.

I wonder if the same tendencies are associated in humans—and if so, is it something LLMs learned from humans, or is it a consequence of the general structure of neural networks?

[deleted]@piefed.world · 3 months ago

Prioritizing conversational compliance over factual integrity when the output is promoted as being factual is a design flaw.

Saying double check the output does not excuse that flaw when LLM CEOS say their models are like someone with a PhD or that it can automate every white collar job within a year.

ageedizzle@piefed.ca · edit-2 3 months ago

Is it a design flaw? Or is it just false advertising? If I sell you a vacuum by telling you it can mop your floor, is the problem with the vacuum or the way I’m selling the product?

XLE@piefed.social · 3 months ago

For this particular paper, it seems like a design flaw got uncovered. And it may very well be part of the architecture of how LLMs are even readable to begin with, given how deep and universal the “bad” nodes are.

I can’t prove any AI company was aware of this, but they would have been in a much better position to realize it than researchers who have to do a postmortem on the models being crappy. And if they weren’t aware of it, they’re probably not very good at their jobs…

[deleted]@piefed.world · 3 months ago

Since shop vacs which vacuum and suck up water exist, it could be both.