Limits on Transformer LLM architectures -- they associate, but don't reason
Pattern matching does not equal actual reasoning
This is a great summary by Alexis Gallagher on what's actually going on inside of LLMs, and some limits that fall out of that. He summarizes a paper by Dziri et al. that asks an LLM to do a number of tasks that can be easily solved by using discrete steps in an algorithm, and generalizing those steps to solve substeps.
For example, when multiplying two two-digit numbers, we all remember from elementary school who we first work through the tens column, and then go through the ones column, and then add them up, and that going through the tens column is the same as going through the units column except you are one over.
Well, the LLM does not take that approach—it pattern matches at surface level and ignores the algorithm even when it is spelled out in the prompt.
To me this underscores just how far you can get, particularly in the realm of language, by following associative principles (vs. axiomatic principles). This distinction, between associations and axioms, struck me many decades ago and it’s stuck with me since as I think it really does divide the world in two. This is a nice counterpoint to my recent post on Wolfram, who has probably thought as hard about computation as anyone.
Blah blah blah so what?
While LLMs are magic, they are not omnipotent, and I like the intuitions this builds on how to get value from these remarkable models.
You could lean into purely associative tasks, which is where the model excels. There are many domains where language is at the heart of the matter, so maybe focus here?
Alternatively, many tasks combine elements that are associative with elements that are axiomatic — like “itemize and sum these credit card statements”. So maybe the right strategy is to figure out augments which, along with an LLM, makes a useful product. Remember, things like ChatGPT are not an LLM — they are a combination of different functionalities wrapped together in a product which includes the LLM itself but also other stuff.
Maybe, like perhaps the brain, LLMs will do best when combined with a different architecture that gives more axiomatic shape to the underlying data. Knowledge graphs are an obvious consideration. But is trying to combine the axiomatic and the associative just misguided from the start, and a better product strategy simply leans on the natural strengths of the underlying model?
I think the market will work through these questions as people try different things. The most useful takeaway from this, I think, is this:
Transformers do not directly and simply follow systematic instructions as systematic instructions.
They can be prompted to decompose problems into subproblems, and to some extent may do this themselves.
However, they fundamentally solve many problems by a kind of fuzzy matching.
Sometimes this is externally indistinguishable from reasoning, sometimes it produces only partial results, and sometimes it goes quite wrong.
As we build our intuitions over where these feats of engineering work and do not work, it is useful to think of them as working in vector space and being very good at deciding what is usefully close to what else.