Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
So tue tldr is just what we already knew: LLMs predict the most likely word to come next and have no concept of “true” or “false” information.
Indeed, to have such a concept would require understanding that information and any AI that actually understood information wouldn’t be an LLM because LLMs are just fancy autocorrect.
There’s a bit more to it: Obviously, if a model gets more correct data pumped into it, it’s more likely to produce a correct output. But they found that at the core of every AI model they tested, when an incorrect output came along, certain nodes produced it. And they are some of the nodes at the earliest part of making the model - before data gets added.
So with that in mind, the tl;dr is more like
AI models have two goals: first be readable, then be correct. It appears the nodes causing incorrect outputs that are also intended to make the output readable.
So tue tldr is just what we already knew: LLMs predict the most likely word to come next and have no concept of “true” or “false” information.
Indeed, to have such a concept would require understanding that information and any AI that actually understood information wouldn’t be an LLM because LLMs are just fancy autocorrect.
There’s a bit more to it: Obviously, if a model gets more correct data pumped into it, it’s more likely to produce a correct output. But they found that at the core of every AI model they tested, when an incorrect output came along, certain nodes produced it. And they are some of the nodes at the earliest part of making the model - before data gets added.
So with that in mind, the tl;dr is more like
AI models have two goals: first be readable, then be correct. It appears the nodes causing incorrect outputs that are also intended to make the output readable.