LLMs performed best on questions related to legal systems and social complexity, but they struggled significantly with topics such as discrimination and social mobility.
“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history,” said del Rio-Chanona. “They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task.”
Among the tested models, GPT-4 Turbo ranked highest with 46% accuracy, while Llama-3.1-8B scored the lowest at 33.6%.



I mean, I’d argue they’re highly complex I/O mechanisms, which is how you get weird hallucinations that developers can’t easily explain.
But expecting cognition out of a graph is like demanding novelty out of a plinko machine. Not only do you get out what you get in, but you get a very statistically well-determined output. That’s the whole point. The LLM isn’t supposed to be doing high level cognitive extrapolations. It’s supposed to be doing statistical aggregates on word association using a natural language schema.
Hallucinations imply a sense of “normal” or “reasonable” or at least “real” in the first place. LLMs have no concept of that.
I prefer to phrase it as “you get made-up results that are less convincingly made-up than the test”