A new paper argues that current LLMs are fundamentally broken because they’re completely static. They call it “anterograde amnesia”, which is honestly spot on. A model gets pre-trained, and from that moment on, its weights are frozen. It can’t actually learn anything new. Sure, it has a context window, but that’s just short-term memory. The model can’t take new information from its context and permanently update its own parameters. The knowledge in its MLP layers is stuck in the past, and the attention mechanism is the only part that’s live, but it forgets everything instantly.
The paper introduces what they term Nested Learning to fix this. The whole idea is to stop thinking of a model as one big, deep stack of layers that all update at the same time. Instead, they take inspiration from the brain, which has all kinds of different update cycles running at different speeds in form of brain waves. They represent the model as a set of nested optimization problems , where each level has its own update frequency. Instead of just deep layers, you have levels defined by how often they learn.
The idea of levels was then used to extend the standard Transformer which has a fast attention level that updates every token and the slow MLP layers that update only during pre-training. There’s no in-between.
The paper presents a Hierarchical Optimizers and Parallel Extensible model with additional levels. You might have a mid-frequency level that updates its own weights every, say, 1,000 tokens it processes, and a slower-frequency level that updates every 100,000 tokens, and so on. The result is a model that can actually consolidate new information it sees after pre-training. It can learn new facts from a long document and bake them into that mid-level memory, all while the deep, core knowledge in the slowest level stays stable. It creates a proper gradient of memory from short-term to long-term, allowing the model to finally learn on the fly without just forgetting everything or suffering catastrophic forgetting.


guido gotchu fam https://docs.python.org/3/library/functools.html#functools.cache
🤣