Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
So, what can we glean from this? Here are a few of my observations.
Current studies largely treat LLMs as black boxes… Just as… neuroscience investigations into individual neuronal activity and synaptic interactions shape theories of cognition like learning and memory, analyzing neurons – the fundamental computational units of LLMs – is essential for decoding hallucination. By scrutinizing neurons’ activation patterns in relation to hallucinations, we can gain deeper insights into model reliability.
So these researchers are left poking at the compiled code of a closed source database. What a pain.
The funny part is, although they insist it’s not a black box…
The process begins by generating a balanced dataset of faithful (green check) and hallucinatory (red cross) responses using the TriviaQA benchmark. We extract the contribution profiles of neurons specifically on the answer tokens to train a linear classifier. Neurons assigned positive weights by this classifier are identified as “H-Neurons”, distinguishing them from normal neurons based on their predictive role in generating hallucinations.
… The researchers clearly have no idea what the bad nodes are doing to make anything bad. They just can observe that when they are hit, a bad thing happens. So the nodes themselves are black boxes to them.
Our investigation reveals that a remarkably sparse subset of neurons – comprising less than 0.1% of the model’s total neurons – can accurately predict whether the model will produce hallucinated responses.
The “bad” nodes are everywhere. If you look at a 1,000th of the database, you will find them scattered across it. The mystery deepens.
Our investigation reveals that H-Neurons originate during the pre-training phase…observed “parameter inertia” suggests that standard instruction tuning does not effectively restructure the underlying hallucination mechanics; instead, it largely preserves these pre-existing circuits… Findings suggest that hallucinations are not merely artifacts of model scaling or alignment procedures, but rather deeply rooted in the fundamental training objectives that shape LLM behavior from their inception.
The “bad” nodes are among the first ones added to models, before anything else is filtered or further trained. This is very funny because it implies they’re part of something crucial.
We hypothesize that the neurons identifying hallucinations do not merely encode factual errors, but rather drive a fundamental behavioral we term over-compliance, which means the model’s tendency to satisfy user prompts even at the expense of truthfulness, safety, or integrity. Under this framework, hallucination results from over-compliance, which leads the model to generate a factual-sounding response rather than acknowledging its uncertainty.
They made a (second) new phrase: This earliest data that goes into the model, and persists after adding more data, they call “over-compliance” and insist it’s the model trying to bullshit a user extra hard.
Alternative hypothesis: what if this data is simply the basis for even making the results legible?
This originates from the inherent characteristics of the next-token prediction objective. This training paradigm does not distinguish between factually correct and incorrect continuations – it merely rewards fluent text generation.
So tue tldr is just what we already knew: LLMs predict the most likely word to come next and have no concept of “true” or “false” information.
Indeed, to have such a concept would require understanding that information and any AI that actually understood information wouldn’t be an LLM because LLMs are just fancy autocorrect.
There’s a bit more to it: Obviously, if a model gets more correct data pumped into it, it’s more likely to produce a correct output. But they found that at the core of every AI model they tested, when an incorrect output came along, certain nodes produced it. And they are some of the nodes at the earliest part of making the model - before data gets added.
So with that in mind, the tl;dr is more like
AI models have two goals: first be readable, then be correct. It appears the nodes causing incorrect outputs that are also intended to make the output readable.
So, what can we glean from this? Here are a few of my observations.
So these researchers are left poking at the compiled code of a closed source database. What a pain.
The funny part is, although they insist it’s not a black box…
… The researchers clearly have no idea what the bad nodes are doing to make anything bad. They just can observe that when they are hit, a bad thing happens. So the nodes themselves are black boxes to them.
The “bad” nodes are everywhere. If you look at a 1,000th of the database, you will find them scattered across it. The mystery deepens.
The “bad” nodes are among the first ones added to models, before anything else is filtered or further trained. This is very funny because it implies they’re part of something crucial.
They made a (second) new phrase: This earliest data that goes into the model, and persists after adding more data, they call “over-compliance” and insist it’s the model trying to bullshit a user extra hard.
Alternative hypothesis: what if this data is simply the basis for even making the results legible?
Never mind, they just said it outright.
So, their approach can be used to flag likely hallucinated output and warn the user?
So tue tldr is just what we already knew: LLMs predict the most likely word to come next and have no concept of “true” or “false” information.
Indeed, to have such a concept would require understanding that information and any AI that actually understood information wouldn’t be an LLM because LLMs are just fancy autocorrect.
There’s a bit more to it: Obviously, if a model gets more correct data pumped into it, it’s more likely to produce a correct output. But they found that at the core of every AI model they tested, when an incorrect output came along, certain nodes produced it. And they are some of the nodes at the earliest part of making the model - before data gets added.
So with that in mind, the tl;dr is more like
AI models have two goals: first be readable, then be correct. It appears the nodes causing incorrect outputs that are also intended to make the output readable.