Jeff Dean and Noam Shazeer interviewed by Dwarkesh Patel

Posted on Feb 12, 2025

Solid Dwarkesh interview with the legends Jeff Dean (Google’s Chief Scientist) and Noam Shazeer (co-inventor of LLMs). Here are some takeaways I found interesting:

We have models that can deal with millions of tokens of context, which is quite a lot. It’s hundreds of pages of PDF, or 50 research papers, or hours of video, or tens of hours of audio, or some combination of those things, which is pretty cool. But it would be really nice if the model could attend to trillions of tokens. Could it attend to the entire internet and find the right stuff for you? Could it attend to all your personal information for you? (Dean)
We have a little bit of that in Mixture of Experts models, but it’s still very structured. I feel like this kind of more organic growth of expertise, and when you want more expertise of that, you add some more capacity to the model there and let it learn a bit more on that kind of thing. (Dean)
[Current MoE¹] is still a very regular structure… I want something more organic, where if we need more capacity for math, we add math capacity. If we need more for Southeast Asian languages, we add it. Let each piece be developed somewhat independently, then stitched together. (Dean) ²
this notion of adapting the connectivity of the model to the connectivity of the hardware is a good one. I think you want incredibly dense connections between artificial neurons in the same chip and the same HBM because that doesn’t cost you that much. But then you want a smaller number of connections to nearby neurons. So, like a chip away, you should have some amount of connections and then, like many, many chips away, you should have a smaller number of connections where you send over a very limited kind of bottlenecky thing. (Dean)
We’re already doing multi-data-center training fully synchronously. If each step is a couple seconds, the 50ms latency between data centers doesn’t kill you, as long as there’s enough bandwidth. (Dean)
A related thing is I feel like we need interesting learning techniques during pre-training. I’m not sure we’re extracting the maximal value from every token we look at with the current training objective. Maybe we should think a lot harder about some tokens. (Dean)
I think the good news is that analyzing text seems to be easier than generating text. So I believe that the ability of language models to actually analyze language model output and figure out what is problematic or dangerous will actually be the solution to a lot of these control issues. (Shazeer)³

Noam Shazeer is first author on the original mixture of experts paper ↩︎
Jeff Dean mentions his Pathways work as a next-generation AI architecture that might support these more modular networks. ↩︎
Admittedly not a novel take, but there’s an angle I struggle with: is there useful generated text which can’t be analyzed or verified? The of recursive alignment where model N validates the output of the next, smarter model N+1 depends on this verifiability. ↩︎