Large language models (LLMs) frequently generate hallucinations -- plausible but factually incorrect outputs -- undermining their reliability. While prior work has examined hallucinations from macroscopic perspectives such as training data and objectives, the underlying neuron-level mechanisms remain largely unexplored. In this paper, we conduct a systematic investigation into hallucination-associated neurons (H-Neurons) in LLMs from three perspectives: identification, behavioral impact, and origins. Regarding their identification, we demonstrate that a remarkably sparse subset of neurons (less than $0.1\%$ of total neurons) can reliably predict hallucination occurrences, with strong generalization across diverse scenarios. In terms of behavioral impact, controlled interventions reveal that these neurons are causally linked to over-compliance behaviors. Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training. Our findings bridge macroscopic behavioral patterns with microscopic neural mechanisms, offering insights for developing more reliable LLMs.
This article is like reading the headline “Researchers have identified the cause of AIDS” and then you open it up and the body is a bunch of science jargon that basically says HIV.
Concerning their origins, we trace these neurons back to the pre-trained base models and find that these neurons remain predictive for hallucination detection, indicating they emerge during pre-training.
it sounds like it’s just how the systems are designed.
I mean, the point of this shit is to take training data and create new stuff out of it through pattern matching. You’re going to get some mismatched shit by design,since the random decisions are modified by the weights. Otherwise you’d get the same shit every time.
it sounds like it’s just how the systems are designed.
I mean, the point of this shit is to take training data and create new stuff out of it through pattern matching. You’re going to get some mismatched shit by design,since the random decisions are modified by the weights. Otherwise you’d get the same shit every time.
When the system is intended to look like a random a person then randomness is fine.
When the output is expected to be accurate, it should be the same each time so it can be verified as accurate.
LLMs are being sold as doing both at the same time, but random plus consistent equals random.
throw it onto the pile of people being idiots
That’s incorrect. Wrong responses will still be generated even if you remove the element that randomizes the response for the same question.
If that wasn’t the case, this paper wouldn’t exist.