Giving attention to transformers

Posted on Jan 8, 2024

I want to understand transformers, so I’ve been banging my head against them since the New Year from all angles: YouTube, podcasts, O’Reilly books, attempting an explanation towards anyone who’d listen, etc. I’m getting close, I think! (at least for the basic feed forward inference – gradient descent remains mysterious) Possibly premature, but I’m feeling that two chumps like my Dad and I are capable of understanding how LLM’s work. As non-chumps have pointed out, the math isn’t too complicated, the real unlocks seems to be scale + data. (somewhat on that note: wtf is Chinchilla optimal?)

A compelling learning has been this architecture of model ‘body and heads’, mismatches for understanding the model’s input+output portions. The body accepts an input and projects it into a model space, while the head converts that information rich hidden state into some usable output, e.g., language generation or classification or… ?! This gets into the brainstorm, a three-parter: what can be used for 1) input into the model, 2) output from the model, and 3) objective functions for training the model.

Starting with input, anything that looks like a sequence is probably going to be fair game, obvious candidates being biological sequences (DNA, amino acids). Self-driving cars also come to mind (earlier today my dad put me in the driver seat of his Tesla and flipped on FSD – it’s gotten much better). I imagine one could think about the stream of data coming in via the car’s sensors as a sequence, and the control outputs as another.

For output, I need to scratch my head on the topic a bit more; I don’t have many creative ideas yet. The decoder architecture allows for generating text output, classifiers for categorical, and… image and audio? Going to GPT-4 for some ideas help yields data viz, 3D model generation, video generation, game development(?), and molecular modeling.

Finally, the training process or objective function – one of the reasons the generative GPT approach feels so brilliant is the simplicity: just predict the nth word from all the priors and repeat – the word insight of ye olde Markov Chains in a bottle (I remember writing one of those things in like 20 lines of Python during a car ride to Tahoe). Other objective function examples include word masking, translation, classification (if dealing with a classifier, etc.). Going back to the GPT-4 oracle for “creative alternatives” came up with generating a random span of words, multi-word masking, semantics masking, syntax driven masking, dialogue turn prediction (particularly of note is dialogue turn given where the deliberative conversational voice interface falls on its face).

Let’s wrap – I’m getting farther from the concepts I actually understand.