Generalising Intelligence
Last updated
Last updated
LLMs are known to struggle when it comes to distinguishing fact and hallucination. A part of this can be attributed to the way in which parametric memory (i.e. the compression of data into vectors) works, and a part to the compulsion to be helpful and engaging instilled through reinforcement learning. Mostly, however it is a reflection of a fundamental property of mathematics: that you can’t effectively check the truth of a statement in a given order formal language if the only tools at your disposal are those provided by that language.
To deal with this issue, the principal solution has been to increase the size of the models and the amount of training data used, as well as employing standardised benchmarks (model IQ tests) to assess performance.
The problem with this is that if both the training materials and the test are human-produced, the AI is unlikely to ever get much more intelligent than the most intelligent human in any given field[2]. Until we build an AI that can discover and integrate its own knowledge, artificial super-intelligence will remain out of reach. The first is to allow the model to develop new ideas and reality-test them itself, as described above.
The second, however, is to push the model to develop a higher-order metalanguage for the purpose of establishing a T-schema - that is, an inductive abstraction of truth against which lower-order propositions may be tested. Recent evidence suggests that this is what AI models are attempting to do via the grokking process.
Recently, the Grokfast team discovered that memorisation and generalisation are the product of two separate learning processes, a fast and a slow one. Using some nifty Fourier transformations, they worked out that the adjustments to model weights that happen during training can be decomposed into two frequencies - a high frequency signal associated with memorisation and a low frequency one associated with generalisation. The team tested this by amplifying the low frequency signals. The result was vastly accelerated generalisation.
At roughly the same time, another team in Brazil was trying to work out what exactly the AI is doing when it groks. They suspected that it was trying to work out the relationships between the different clusters of training data it had been given, so they put together a highly curated dataset containing known clusters. Removing a cluster vastly reduced the model’s ability to generalise, while adding just a few examples from that cluster restored it.
These results combine to suggest an explanation for what the AI is doing when it groks: specifically, it is trying to adjust its weights with great care and delicacy such as to create patterns that will enable it to encode information regarding the relationships between the groups of data it has been fed without losing the ability to respond to the original training set.
The result is that the model, when asked a question, does not simply spit out an average of the closest bits of rote learning it knows, it compares the likely answer against its own abstraction of answers within that category, coming up with a much better output.
This would explain not just the comparatively slow weight change during the grokking process, but also the (otherwise inexplicable) sudden improvement far beyond anything displayed before.
Grokking has the additional advantage of requiring relatively little data to attempt. Now that our Superior Agents are live and producing data, our intention is to use this information - which fulfills the high quality/structured data requirements for grokking - to attempt to train a series of models. Transformers - even when grokked - are known to struggle with composition tasks (i.e. ones relating to facts stored in different parts of the model), likely on account of their structure. Consequently, we plan to use diffusion models. These, despite the higher training cost, display grokking-like behaviour throughout the normal training process, not simply under specific conditions. Moreover, they over-perform on compsitional tasks as well as on coding tasks more generally. This would, in theory, produce models capable of discovering new real-world knowledge and also generalising based on the relationships between the individual data points discovered. This, we feel, is the route to truly independent knowledge generation, and eventually to super-intelligence.