Response on "Foundation Models"

Author: Jitendra Malik

In response to “On the Opportunities and Risks of Foundation Models” (Bommasani et al., 2021)

The most concrete examples of “Foundation Models” are Large Language Models (LLMs), such as BERT from Google. These transformer models have proved very useful in contexts such as machine translation and speech recognition, and I like the core idea of training big generic models and then finetuning them for various tasks. Indeed the R-CNN work from my group (Girshick et al., CVPR 2014) was a pioneer in using ImageNet pre-training before finetuning for the task of object detection on the PASCAL dataset. I am not opposed to transformers as a neural architecture either (I have worked on them myself). There are big risks in LLMs inheriting the biases in the large volumes of uncurated text from the Internet which many have pointed out; I do not need to dwell on that.

But calling these models “Foundation Models” evokes something more, that these models can be the foundation for research in AI, broadly construed – akin to logic, rule-based expert systems, probabilistic graphical models, and deep learning that have been advocated over the last six decades. The strong claim might be construed as: these LLMs provide a template for all of AI research. It is this claim, perhaps stronger than the authors of this report intended, that I question. I subscribe to the embodiment hypothesis widely held in present day cognitive science, and will quote Smith and Gasser as a representative example: “intelligence emerges in the interaction of an agent with an environment and as the result of sensorimotor activity… We argue that starting as a baby grounded in a physical, social and linguistic world is crucial to the development of the flexible and inventive intelligence that characterizes humankind.” I do not believe that AI has to slavishly mimic the course of human baby development, but clearly perception, interaction, acting in a physical 4D world, acquiring models of commonsense physics, theories of mind, and acquiring language grounded in this world are important components of AI. Neglecting many of them is unlikely to lead to successful approaches to “wide” AI. As of yet, Large Language Models have only shown their power in “narrow” AI settings, and their lack of sensorimotor grounding is why I have called them “castles in the air”. Quite useful castles they are, but they float in air without this grounding.

I will conclude with some more “big picture” remarks. The last decade has seen remarkable advances in AI, but the journey is still only partially complete. Senior luminaries in AI such as Judea Pearl, Mike Jordan and Rod Brooks have written extensively about some of what is missing today and I do not need to reprise their arguments. The “explore-exploit” dilemma is always with us in planning long term research strategy. My experience, having been in AI for over forty years, is that we over-invest in current paradigms and are not sufficiently alert to the emergent big idea over in some neglected part of the intellectual landscape. Large language models are useful, and it makes a lot of sense for big technology companies like Google, Facebook or Microsoft to invest in them. But academia should pursue a “Let a thousand flowers bloom” strategy. Who knows when and where the next Alexnet moment will occur?