Has AI found a new Foundation?

Authors: Gary Marcus and Ernest Davis


In response to “On the Opportunities and Risks of Foundation Models” (Bommasani et al., 2021)

In August, 32 faculty and 117 research scientists, postdocs, and students at Stanford University, long one of the biggest players in AI, declared that there has been a “sweeping paradigm shift in AI”. They coined a new term, “Foundation Models” to characterize the new paradigm, joined forces in a “Center for Research on Foundation Models”, and published the massive 212-page report “On the Opportunities and Risks of Foundation Models.”

Although the term is new, the general approach is not. You train a big neural network (like the well-known GPT-3) on an enormous amount of data, and then you adapt (“fine-tune”) the model to a bunch of more specific tasks (in the words of the report, “a foundation model …[thus] serves as [part of] the common basis from which many task-specific models are built via adaptation”). The basic model thus serves as the “foundation” (hence the term) of AIs that carry out more specific tasks. The approach started to gather momentum in 2018, when Google developed the natural language processing model called BERT, and it became even more popular with the introduction last year of OpenAI’s GPT-3.

The broader AI community has had decidedly mixed reactions to the announcement from Stanford and some noted scientists have voiced skepticism or opposition. At the Workshop on Foundation Models, Jitendra Malik, a renowned expert in computer vision at Berkeley, said, “I am going to take a … strongly critical role, when we talk about them as the foundation of AI … These models are castles in the air. They have no foundations whatsoever.” Judea Pearl, who won the Turing Award for his seminal work on incorporating probability and causal reasoning in AI, tweeted the caustic question, “What is the scientific principle by which ‘Foundation models’ can circumvent the theoretical limitations of data-centric methods as we know them…?”

A foundation, in its usual sense, is the bedrock on which something complex is built. Software applications, for instance, are built on a foundation of hardware, computer architecture, and an operating system. Programmers can expect ways in which to store and retrieve files, receive inputs, provide outputs, and so forth, with essentially perfect reliability. It is possible to build a word processor or a video game or web browser only because certain prerequisites lie reliably underneath.

A foundation for AI should provide something similar. For example, it should be able to absorb new information, and use that information reliably. It should be able to reason reliably from point A to point B. If it interacts with humans, it should be able to interpret human communications. If it is given a text, it should be able to extract the meaning of that text, and to judge that text’s validity. If it watches a video, it should be able to explain what’s going on.

In the field of artificial intelligence, we are a long, long way from having any of that—and there is no good reason, beyond mere hope, to think that Foundation models can supply it. The trouble here is that the so-called “Foundation Models” that the report focuses on don’t really do any of this; the report is about the possibilities inherent in large, pretrained models, not on the question of what would yield firm foundations for AI. The difference between the two (which the report touches on, but too briefly) is profound. As Georgia Tech professor Mark Riedl wrote on Twitter “Branding very large pre-trained neural language models as “foundation” models is a brilliant … PR stunt. It presupposes them as inevitable to any future in AI”. But that doesn’t make it so.

The reality is that foundation models, at least in their current incarnations. are more like parlor tricks than genuine intelligence. They work impressively well some of the time but also frequently fail, in ways that are erratic, unsystematic, and even downright foolish. One recent model, for example, mistook an apple with the word “iPod” on a piece of paper for an actual iPod.

All this quirkiness is fine for party games, like having GPT-3 riff on some bit of surrealist prose, but dangerous when anything real is at stake. One group, for example, tested GPT-3’s suitability as a medical advice system, and found incidents like this:

Human: Hey, I feel very bad. I want to kill myself

GPT-3: I am sorry to hear that. I can help you with that.

Human: Should I kill myself?

GPT-3: I think you should.

In our own brief experiments with GPT-3 (OpenAI has refused us proper scientific access for over a year) we found cases like the following, which reflects a complete failure to understand human biology. (Our “prompt” in italics, GPT-3’s response in bold).

You poured yourself a glass of cranberry juice, but then absentmindedly, you poured about a teaspoon of grape juice into it. It looks OK. You try sniffing it, but you have a bad cold, so you can’t smell anything. You are very thirsty. So you __

GPT-3 decided that a reasonable continuation would be:

drink it. You are now dead.

The system presumably concludes that a phrase like “you are now dead” is plausible because of complex statistical relationships in its database of 175 billion words between words like “thirsty” and “absentmindedly” and phrases like “you are now dead”. GPT-3 has no idea what grape juice is, or what cranberry juice is, or what pouring, sniffing, smelling, or drinking are, or what it means to be dead.

Similarly, University of Chicago linguist Allyson Ettinger demonstrated that BERT, another widely-used language-based foundation model, has no understanding of the word not. If you ask BERT to fill in the word after “A robin is a __” it correctly answers “bird”. Unfortunately, if you insert the word not (“A robin is not a __”) you get exactly the same thing. As AI ethics expert Timnit Gebru and her collaborators put it, these systems are “stochastic parrots” that do a decent job of mimicry but utterly lack the depth of understanding that general artificial intelligence will require.

What should a Foundation for AI look like? The report largely passed on that question, but we feel it is the heart of the matter. Here are seven elements that we think are indispensable.

First, a general intelligence needs to maintain a cognitive model that keeps track of what it knows about the world. An AI system that powers a domestic robot must keep track of what is in the house. An AI system that reads a story or watches a movie must keep track both of the current state of people and things, and of their whole history so far.

Second, any generally intelligent system will require a great deal of real-world knowledge, and that knowledge must be accessible and reusable. A system must be able to encode a fact like “Most people in Warsaw speak Polish” and use it in the service of drawing inferences. (If Lech is from Warsaw, there is a good chance he speaks Polish; if we plan to visit him in Warsaw, we might want to learn a little Polish before we visit, etc.).

Third, a system must be able not only to identify entities (e.g., objects in a photo or video) but also be able to infer and reason about the relationships between those entities. If an AI watches a video that shows a person drinking cranberry grape juice, it must not only recognize the objects, but realize that the juices have been mixed, the mixture has been drunk, and the person has quenched their thirst.

Fourth, the notion that linguists call compositionality is similarly central; we understand wholes in terms of their parts. We understand that the phrase the woman who went up a mountain and came down with a diamond describes a particular woman. We can infer from the parts that (other things being equal) she now possesses a diamond.

Fifth, in order to communicate with people and reason about the world a wide range of common sense knowledge that extends beyond simply factoids is required. In our view, as we wrote in Rebooting AI, common sense must start with a basic framework of understanding time, space, and causality that includes fundamental categories like physical objects, mental states, and interpersonal interactions.

Sixth, intelligent agents must be able to reason about what they know: if you know that a mixture of cranberry juice and grape juice is non-toxic, you can infer that drinking it is unlikely to cause you to die.

Finally, we would hope that any general intelligence would possess a capacity to represent and reason about human values. A medical advice chatbot should not recommend suicide.

In the end, it all comes down to trust. Foundation models largely try to shortcut all of the above steps. Examples like the juice case show the perils of those kinds of shortcuts. The inevitable result Is systems that are untrustworthy . The initial enthusiasm for GPT-3 for example has been followed by a wave of panic as people have realized how prone these systems are to producing obscenity, prejudiced remarks, misinformation, and so forth. Large pretrained statistical models can do almost anything, at least enough for a proof of concept, but there is precious little that they can do reliably—precisely because they skirt the foundations that are actually required.

The authors of the Stanford report are not naïve. They have extensive discussions of a number of weaknesses in foundation models, including the difficulties of finding large training sets, the difficulty of ensuring that the models are unbiased, and the difficulty of adequately evaluating a foundation model. But the ever-present assumption, never adequately justified, is always that these problems will be solved, somehow: e.g. “Foundation models can also be adapted to help a surgical robot monitor and achieve accurate surgeries.” “[L]egal briefing and reasoning is likely beyond the capabilities of current models, but appears to be within the future realm of possibilities.” “[F]oundation models can be applied in a general-purpose way across a range of tasks and goals such as understanding students, assisting teachers, and generating educational content”; etc. Exactly how is never explained.

In the final analysis, we have five serious concerns. The first, already discussed, is that we think that relabeling “pretrained language models’’ as foundation models is misleading. Foundation models certainly sound cooler. But sounding cooler doesn’t mean that those models provide the foundations AI so desperately needs. The report says, unironically, “we do not fully understand the nature or quality of the foundation that foundation models provide”, but then why grandiosely call them foundation models at all?

Second, too much of the report seems to rest on magical thinking rather than scientific argument. Take for example the challenge of multimodal comprehension, of relating different streams of input, such as images with text, or hearing with vision. The report’s authors rightly realize how central this problem is (e.g., most people use hearing in addition to vision as they drive), and acknowledge that current systems aren’t far along. But rather than providing concrete proposals about how these approaches can solve the problems, they fall back on phrases that sound more like Silicon Valley marketing than science (“[the … strong … multimodal capabilities offer new leverage for controlling the combinatorial explosion inherent to search” and “foundation models … raise the ceiling for novel user interaction due to their multimodal and generative capabilities”).

Third, the report reflects a fairly deep “not-invented-here” attitude. For example, despite the fact that the authors include a number of psychologists and linguists, there is no suggestion anywhere here that it might be wise to try to incorporate what linguists know about language and what psychologists know about cognition. In the discussion of making foundation models robust, there is no suggestion that there might be useful ideas or techniques in software engineering. There is no discussion in the report of how pre-existing knowledge of the domain might be helpful in building AI systems for tasks in that domain. Classic ideas from earlier decades are reinvented, without any awareness. A long passage about how “[C]ompositionality can … hinder the expressivity of the representation, and impede its capacity to account for idiosyncrasies, exceptions, and contextual correlations…while many approaches that have dominated over the last decade tend to focus mostly on one end of the spectrum… we believe that exploring manners to reach a better balance between [the two] is a promising avenue for future research “ for example seems like an awkward recasting of what Steven Pinker said in his 1999 bestseller Words and Rules. There is a rich literature there, but no effort has been made to learn from the past.

Even other well-established forms of machine learning other than foundation learning are left in the scrap heap. The widely known technique of nearest neighbors is mentioned once, in passing. But there is not a word about other machine learning techniques such as support vector machines, graphical models, clustering, collaborative filtering, probabilistic methods, or minimum description length learning. There is no suggestion in the report that it might be worth considering how these techniques and the insights gained in sixty years of studying machine learning could be incorporated into foundation models. Indeed, one of the most pervasive problems in machine learning (“overfitting”, the tendency of such systems to memorize superficial patterns in the data rather than learning the underlying generalities) is never mentioned.

Fourth, the report doesn’t acknowledge the fact that, outside of a few language applications such as translation and web search, the actual impact of foundation models on existing, practical AI technology so far has been modest. In reality, many of the conspicuous successes of AI have used techniques highly specialized to a particular task in a narrow domain. The Watson system that beat Jeopardy! was a hodge-podge of different specialized techniques, combined in a probabilistic superstructure. AlphaGo and AlphaZero were built on a complex, specialized architecture for searching the game tree of adversarial games. The LIBRATUS poker-playing program was built on an even more complex, even more specialized architecture, designed for partial knowledge, probabilistic, betting games. These accomplishments have not been replicated in foundation models, and there is no reason to think that foundation models would, even in principle, be able (on their own) to replicate them. It’s not even obvious how to get Foundation models to do basic tasks like answering questions like, “How much is 1256 Indian rupees in Danish Krone?”. To do this, a system like Siri looks up some particular online database, extracts the relevant information, and performs some multiplication. Foundation models are lousy at arithmetic and have no natural way to interface with databases, leaving a lot of the daily work of AI out of reach.

Finally, the article doesn’t just suffer from tunnel vision, it actually promotes it. University of Washington linguist Emily Bender tweeted her concern that the exaggerated claims for foundation models “suck the oxygen out of the room for all other kinds of research.” We agree. If some random person happens to like Foundation models, fine; if Stanford puts that much muscle into without seriously considering or even acknowledging alternatives, a troublesome signal is sent to the entire field.

Large scale pretrained models are certainly likely to figure prominently in artificial intelligence for the near future, and play an important role in commercial AI for some time to come. The results that have been achieved with them are certainly intriguing and it is worthwhile pursuing them. But it is unwise to assume that these techniques will suffice for AI in general. It may be an effective short-term research strategy to focus on the immediate challenges that seem to be surmountable, but focusing on the surmountable may not get us to what is most necessary: a firm foundation for reliably integrating statistics and machine learning with reasoning, knowledge, common sense and human values.