Stanford CRFM

A whirlwind year in review of CRFM's work in 2022 spanning technical advances, applications, and social responsibility.

2022 was an eventful year for foundation models. DALL-E 2, Stable Diffusion, and others transformed the image generation space. We saw more powerful language models, PaLM, and of course ChatGPT. We saw foundation models being developed for speech, music, proteins, and many other data modalities. And, for the first time, these models are now being widely deployed and utilized by consumers to accomplish a wide breadth of useful tasks. This has not been without controversy – teachers are wrestling with the impact of ChatGPT on writing assignments and Github Copilot on computer science education, lawsuits have been filed against Stable Diffusion and Github Copilot for scraping copyrighted data, and judges have had mixed responses to the role that ChatGPT can serve in court cases. What is clear is that while foundation models have opened up unprecedented new possibilities, they are also still raw, imperfect research artifacts that we do not entirely understand.

In 2021, we founded the Center for Research on Foundation Models (CRFM), recognizing the critical role of foundation models. CRFM’s mission is to understand and improve foundation models from both a technical and societal perspective. Our collaborative, interdisciplinary community, which spans 35+ faculty and 140+ students from 15+ departments, has been busy over the last year making progress towards that mission.

In this post, we would like to share the broad themes of CRFM’s research in 2022 and spotlight a few works. The work can be organized into three pillars:

Technical advances: We aim to do long-lasting, fundamental work that has an impact on the state-of-the-art. As one example, Dao et al.’s FlashAttention trains Transformers significantly faster than existing methods, achieving a 15% end-to-end wall-clock speedup on BERT-large compared to the MLPerf 1.1 training speed record, 3x speedup on GPT-2, and 2.4x speedup on long-range arena. Dao et al.’s techniques enable longer context in Transformers, yielding higher quality models and entirely new capabilities. FlashAttention has been widely deployed in industry including at Meta, Microsoft, OpenAI, NVIDIA, Amazon, and more.
Applications: CRFM includes researchers from Stanford’s schools of law, medicine, and education, and many diverse departments. We are exploring ways to leverage foundation models in these fields. We train new foundation models on domain-specific data; for example, a 2.7 billion parameter model, BioMedLM, which we created jointly with MosaicML, and have open-sourced to support downstream applications.
Social responsibility: As a non-commercial entity, CRFM is in a unique position to characterize and shape the increasing social impact of foundation models. Our new benchmark for language models, HELM, performs a rigorous evaluation of over 30+ language models across a standardized set of scenarios and metrics. We have also worked to develop community norms for the release of foundation models.

Technical Advances

Evaluating Foundation Models

Evaluating foundation models was a major focus of the CRFM community in 2022. We watched many new foundation models appear on the map, trained by a broad set of researchers and organizations, and using a broad set of training strategies. For instance, EleutherAI, BigScience, Google, and Meta have released open-source GPT-3 style and instruction fine-tuned models. Organizations such as OpenAI and Anthropic have also provided API-access to their proprietary models. Foundation models are increasingly being used as the starting point for a wide range of ML applications. Recent foundation models also expose a natural language interface, enabling a wide audience beyond ML experts to engage with the models. Despite their broad reach, the models are not yet well-understood and we need new styles of evaluation to improve transparency.

In 2022, we released Holistic Evaluation of Language Models (HELM), a large-scale and collaborative undertaking to evaluate 30 prominent language models (spanning open, limited-access, and closed models) on 42 scenarios Holistic Evaluation of Language Models. HELM provides standardized evaluations along seven metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency), improving our understanding of the capabilities of current models. The work is upfront that the 42 scenarios are a broad, but incomplete subset of the vast set of scenarios in which foundation models may be applied, and to grapple with this, the work introduces a taxonomy for the space of scenarios. As a result, the HELM project is maintaining a live and collaborative benchmark, encouraging the broader community to contribute new evaluation scenarios.

Members of the CRFM community also developed new benchmarks to better understand how humans interact with foundation models. First, the natural language interface of foundation models has enabled richer human-model interaction. To account for this, Lee et al. developed a new framework called Human-AI Language-based Interaction Evaluation (HALIE) in their paper Evaluating Human-Language Model Interaction, which expanded non-interactive evaluation along three dimensions to better capture the interactive process of working with foundation models. Second, most foundation models are currently being hosted behind private APIs. Users may build systems on top of these APIs but lack visibility into the exact models they are using, and the organizations that publish these private models may unilaterally update models over time. This can lead to unexpected changes in a system’s performance. Chen et al. thus developed HAPI: A Large-Scale Longitudinal Dataset of Commercial ML API Predictions, which tracks ML API performance across 3 years and contains 1.7M data points. Third, Srivastava et al. developed the first-of-its kind benchmark for embodied AI in BEHAVIOR: Benchmark for Everyday Household Activities in Virtual, Interactive, and Ecological Environments. The challenging and realistic benchmark presents a goalpost as we continue developing FMs.

Further, members of the community studied the metrics and protocols for evaluating FMs. Ethayarajh et al. tackle the question of how to evaluate natural language generation (NLG) systems, demonstrating issues with the standard protocol and proposing a new protocol called System-level Probabilistic Assessment (SPA) in The Authenticity Gap in Human Evaluation. In Deconstructing NLG Evaluation: Evaluation Practices, Assumptions, and Their Implications, Zhou et al. conduct a semi-structured interview of NLG practitioners and report on the objectives, assumptions, and constraints shaping NLG evaluations in practice today. Tamkin et al. have developed an extensive testbed, DABS, for enhancing research on universal self-supervised learning algorithms. The benchmark spans 12 domains including protein biology, bacterial genomics, and particle physics.

Foundation Model Training and Architectures

This past year saw significant advances in extending the capabilities of foundation models via better architectures, training procedures, and mechanisms for incorporating external knowledge into foundation models.

Retrieval-Based Models. Integrating explicit knowledge from external sources has been a key challenge for today’s foundation models, motivating interest in augmenting these models with supplementary retrieval-based methods that can draw upon additional information. To help combine such retrieval methods with large language models, Khattab et al. developed Demonstrate-Search-Predict: Composing Retrieval and Language Models for Knowledge-Intensive NLP, which achieved state-of-the-art performance on complex in-context learning tasks by passing natural language texts between a language model and retriever model. In Retrieval-Augmented Multimodal Language Modeling, Yasunaga et al. extended retrieval methods to multimodal language models (i.e. text and image) by incorporating multimodal information sources like web documents.

External Knowledge. Another way to inject knowledge into a language model is by utilizing an externally curated “knowledge graph”, which is simply a collection of entities and their relationships. For example, Facebook can be thought of as a knowledge graph containing people and their relationships. This type of structured data can be leveraged by a language model, either during joint pretraining over both the knowledge graph and text as in Deep Bidirectional Language- Knowledge Graph Pretraining by Yasunaga et al., or via a fusion of modalities using GreaseLM: Graph REASoning Enhanced Language Models for Question Answering by Zhang et al. Even the hyperlinks between web documents can contain valuable ancillary information for incorporating implicit knowledge about the relationships between concepts, as Yasunaga et al. showed with LinkBERT: Pretraining Language Models with Document Links.

State Space Models. Enabling foundation models to understand longer context windows of information has also remained a challenge. In Efficiently Modeling Long Sequences with Structured State Spaces, Gu et al. achieved state-of-the-art results with S4 – a novel state space model (SSM) – on benchmarks like Pathfinder and Path-X which require a model to track long range dependencies. Hungry Hungry Hippos: Towards Language Modeling with State Space Models by Dao et al. accelerated the training of such state space models with the development of FlashConv, a fused block Fast Fourier Transform algorithm, in addition to introducing a novel architecture dubbed “hybrid H3-attention” which combined two layers of attention with stacked SSMs to achieve better language modeling performance than Transformers!

Better Training. In addition to these architectural improvements, CRFM researchers also developed methods to train more accurate and more robust foundation models. Dao et al.’s FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness helped to greatly reduce the time needed to train Transformer models by computing attention in a way that mitigates IO-related bottlenecks. FlashAttention set a new MLPerf record for BERT training speed and achieved ~3.5x faster training times for GPT-2 than prior work. Given a pre-trained foundation model, Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution by Kumar et al. showed that a combination of linear probing followed by full fine-tuning (“LP-FT”) outperformed either strategy alone in achieving better in-distribution and out-of-distribution accuracy, while Improving Self-Supervised Learning by Characterizing Idealized Representations by Dubois et al. developed from first principles the conditions necessary for self-supervised learning methods to generate representations that can be linearly probed to yield high downstream accuracies. Additionally, Shih et al. Training and Inference on Any-Order Autoregressive Models the Right Way make significant contributions to AO-ARMs, a specific class of autoregressive models, which allow them to achieve better modeling on text and image sequence tasks than traditional transformers. From a theoretical perspective, Liu et al. question a common rule-of-thumb – validation pre-training loss is often used as the evaluation metric when developing language models – since the pre-training loss tends to be well-correlated with downstream performance. Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models shows that pre-training loss cannot fully explain downstream performance and explores alternate model-properties to consider.

Data behind Foundation Models

One of the biggest contributors to foundation model success is the large quantity of data they are trained on. Datasets publicly scraped from the Internet form the backbone for many foundation models such as ChatGPT, CLIP, GitHub Copilot, etc. and are critical to their versatility. However, at the same time, they bring issues such as data availability, data composition, data privacy and bias amplification to the forefront, putting their utility at risk. At CRFM, many of us are thinking about these issues and ways to study and mitigate them. This past year, we have had several works come out of CRFM studying the interaction between pretraining data and the resulting foundation models.

Data Availability. In Beyond Neural Scaling Laws: Beating Power Law Scaling via Data Pruning, Sorcher et al. suggest a different approach to beat the power law, where error falls off as a power of the training set size. They propose a new, self-supervised, data pruning metric to prune these datasets, thereby retaining only the best set of training examples. Zhao et al. take a different approach in Data Budgeting for Machine Learning. They study the data budgeting problem, i.e. “How much data do we really need to train models?” and propose methods to predict the best performance given enough data and how many data points are in fact needed to reach this best model performance. Chakraborty et al. present novel methods to efficiently filter a user defined number of pretraining images conditioned on a target dataset in Efficient Conditional Pre-Training for Transfer Learning.

Data Composition and Data Slicing. In their work Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning, Santurkar et al. study the composition of the pre-training dataset and determine when multi-modal models like CLIP will outperform unimodal image only models. They find that when certain criteria (sufficiently large and descriptive captions of low variability) are met, despite the large dataset size, image-only models trained on this data do not meet CLIP’s transfer performance. They propose several prescriptions to better leverage language information present in the pretraining data. Eyuboglu et al focus on identifying underperforming slices of data in their paper, Domino: Discovering Systematic Errors with Cross-Modal Embeddings. They present Domino, a slice discovery method that leverages cross-modal embeddings and a novel error-aware mixture model to discover and describe coherent slices in a human understandable format.

Bias Amplification. Taori and Hashimoto, in their paper Data Feedback Loops: Model-driven Amplification of Dataset Biases, find that the degree of bias amplification for a model is closely linked to whether the model’s outputs behave like samples from the training distribution, a behavior they term consistent calibration. They propose an intervention to help calibrate and stabilize these unstable feedback systems. In Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale Bianchi et al. report on the stereotypes and biases that are amplified by FMs for image generation.

Understanding Foundation Models

At CRFM, we are also interested in understanding the theory behind various aspects of foundation models. This past year, we’ve looked specifically at “in-context” learning and “emergent abilities” associated with foundation models.

In-context learning refers to the ability of a model to do a downstream task simply based on a prompt with in-context examples (input-output pairs corresponding to some task). In fact, this happens only at inference time without any parameter updates to the model. In An Explanation of In-context Learning as Implicit Bayesian Inference, Xie et al. study how in-context learning can emerge when pretraining documents have long-range coherence. They also prove this occurs, despite a distribution mismatch between prompts and pretraining data with a simple pretraining distribution using a mixture of Hidden Markov Models. Garg et al. investigate the relationship between tasks and the pretraining data for in-context learning in their paper What Can Transformers Learn In-Context? A Case Study of Simple Function Classes. They show that standard Transformers can be trained from scratch to perform in-context learning of linear functions, neural networks, and decision trees, and that this works even under distribution shifts!

Emergent Abilities. Wei et al. study the unpredictable phenomenon termed “emergent abilities” associated with large language models in their paper Emergent Abilities of Large Language Models. They consider an ability to be emergent if it is not present in smaller models but is present in larger models, which hence cannot be predicted simply by extrapolating the performance of smaller models and argue for additional scaling that could further expand the range of capabilities of language models. They explore emergence with respect to model scale, and survey emergent abilities as observed in a range of prior work. This was joint work with collaborators beyond CRFM.

Diffusion Models. Diffusion-LM Improves Controllable Text Generation by Li et al. is the first diffusion-based language model, and demonstrates improved controllability in text generation. Diffusion-LM is able to better satisfy user requirements around the topic of generated text, as well as the syntactic structure of generations. In Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models, Li et al. present a method to reduce the inference latency of diffusion models during image editing, leveraging the observation that during editing, users tend to make gradual modifications to the input image. Meanwhile, Meng et al. also improve the sampling speed of classifier-free guided diffusion models with their new method for model distillation in On Distillation of Guided Diffusion Models.

Using Foundation Models

As many applications have shown, foundation models hold tremendous potential for various types of tasks. There has been a fair amount of work in the CRFM community on principled methods and algorithms for using foundation models reliably.

Prompting Strategies. Current prompting strategies remain largely based on heuristics. Ask Me Anything: A simple strategy for prompting language models by Arora et al. improves our understanding of the categories of prompt formats that lead to improved in-context learning quality and also broadly applies weak supervision to improve the prompting performance of off-the-shelf foundation models for the first time. The method enables a 30x parameter smaller and open-source model to compete with OpenAI’s GPT-3 on many tasks.

Programming with Foundation Models. Meanwhile, Parsel: A (De-)compositional Framework for Algorithmic Reasoning with Language Models by Zelikman et al., and Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP by Khattab et al. introduced new frameworks for programming with foundation models to help improve their performance on tasks that require multi-step reasoning. Separately, Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference by Mitchell et al. demonstrated how to improve the consistency of FM responses models by jointly considering the FM’s likelihood for each answer choice alongside an off-the-shelf NLI model’s beliefs about the compatibility of different answers.

Applications

The remarkable ability of foundation models to adapt to new tasks without any training, suggests a future in which non-technical domain experts can easily incorporate the models in their workstreams. CRFM aims to foster collaborations between ML and non-ML domain experts to better understand their domain-specific applications. In 2022, this resulted in research across many disciplines including law, medicine, and education.

Law. Henderson et al. released Pile of Law Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset, a law-specific pretraining corpus, and Guha et al. released LegalBench LegalBench: Prototyping a Collaborative Benchmark for Legal Reasoning, a set of 44 downstream benchmarks requiring legal reasoning. LegalBench, like HELM, is a collaborative and evolving benchmark for the legal community.

Biomedicine. CRFM members, in collaboration with MosaicML, released BioMedLM, a best-in-class open-source model for medical question answering. We worked with collaborators to publish results demonstrating how foundation models can improve robustness to temporal distribution shift in electronic health records data in EHR Foundation Models Improve Robustness in the Presence of Temporal Distribution Shift by Guo et al. Large strides in the field of synthetic health record data were also made by Chambon et al. in RoentGen: Vision-Language Foundation Model for Chest X-ray Generation, which resulted in a model capable of generating synthetic chest x-rays conditioned on the text in a radiology report. In the physical realm, GaitForeMer: Self-Supervised Pre-Training of Transformers via Human Motion Forecasting for Few-Shot Gait Impairment Severity Estimation by Endo et al. used human motion forecasting as an effective self-supervised pre-training task for the estimation of MDS-UPDRS gait impairment severity, while Ouyang et al. applied self-supervised learning to build better latent representations for longitudinal MRIs in Self-supervised learning of neighborhood embedding for longitudinal MRI.

Education. Lee et al. studied how to use foundation models to assist users with creative and argumentative writing CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities. Tack et al. studied how to use foundation models as teachers that interact effectively and helpfully with students The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues.

Music. In Melody transcription via generative pre-training, Donahue et al. explore generative pre-training over a large scale melody transcription corpus and demonstrate this provides 77% stronger performance on the melody transcription relative to the strongest available baseline. This work enables Sheet Sage, a system capable of transcribing human-readable lead sheets directly from music audio.

Robotics. In LILA: Language-Informed Latent Actions, Karamcheti et al. present a framework for learning natural language interfaces in the context of human-robot collaboration. In LISA: Learning Interpretable Skill Abstractions from Language, Garg et a. present (LISA), a hierarchical imitation learning framework that can learn interpretable skills from language conditioned offline demonstrations. Nair et al. propose LOReL, a method to learn language-conditioned visuomotor policies by incorporating FMs, enabling a real robot to complete language-specified tasks while being robust to complex rephrasings of the task description in Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation. In R3M: A Universal Visual Representation for Robot Manipulation, Nair et al. study how visual representations pre-trained on diverse human video data can enable data-efficient learning of downstream robotic manipulation tasks. Gupta et al. in MaskViT: Masked Visual Pre-Training for Video Prediction explore pre-training transformers via masked visual modeling – the approach significantly improves inference speeds on real robot planning tasks.

Agriculture, satellite imagery, and geospatial. Mai et al. in Towards a Foundation Model for Geospatial Artificial Intelligence (Vision Paper) ask whether we can extend the remarkable improvements in language and vision tasks to new modalities such as multi-modal geospatial data. In SatMAE: Pre-training Transformers for Temporal and Multi-Spectral Satellite Imagery Cong et al. present SatMAE, a pre-training framework for temporal or multi-spectral satellite imagery based on Masked Autoencoder (MAE).

Data engineering. Data management is a key time-sink when working in any domain. Narayan et al. applied foundation models to data-wrangling tasks, demonstrating the advantage over legacy systems in their work Can Foundation Models Wrangle Your Data?.

Social systems. Park et al. creatively applies foundation models to modeling social dynamics in community-based settings, such as social media platforms, in their work Social Simulacra: Creating Populated Prototypes for Social Computing Systems. They develop social simulacra, a prototyping technique that generates a breadth of realistic social interactions that may emerge when a social computing system is populated with synthetic “community members.”

Social Responsibility

Foundation models captured news headlines around the world in 2022 thanks to the splashy launches of tools like Stable Diffusion and ChatGPT. Oftentimes, however, the societal impacts of such disruptive technologies only become clear after they become irreversibly entrenched in our lives. To help bring clarity to this issue, CRFM researchers were at the forefront of studying how to align foundation models with HAI’s broader mission of “advancing AI research, education, policy, and practice to improve the human condition.”

Governance. The need for governance and norms surrounding foundation model development was at the top of many CRFM members’ minds. We published articles advocating in favor of developing community norms for the release of foundation models, analyzed the impact of “algorithmic homogenization” spurred by the proliferation of foundation models in Picking on the Same Person: Does Algorithmic Monoculture lead to Outcome Homogenization? by Bommasani et al., and quantified how multilingual language models are biased against lower-resourced languages in Multilingual BERT has an accent: Evaluating English influences on fluency in multilingual models by Papadimitriou et al. In addition to the models themselves, the data input into a model is of equal import, prompting Henderson et al. to draw from legal norms to develop the ethical framework described in Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset for filtering the data selected for pretraining a foundation model.

Evaluations. In addition to data and model governance, CRFM researchers advocated for more transparent and comprehensive evaluations of foundation models. Liang et al. set a new standard for conducting “holistic” benchmarks of foundation models in Holistic Evaluation of Language Models which, for the first time ever, evaluated 30 open and closed foundation models on an even playing field while incorporating metrics that have been understudied in the literature like robustness, fairness, bias, toxicity, and efficiency. Bommasani et al. in Trustworthy Social Bias Measurement proposed DivDist, a novel bias measurement framework grounded in social science research to better quantify social bias.

Privacy and security. In addition to questions around governance and oversight, foundation models also pose new opportunities and challenges within the field of cybersecurity. Can Foundation Models Help Us Achieve Perfect Secrecy? by Arora et al. answered in the affirmative, outlining how foundation models might be able to provide stronger security guarantees than federated learning, a popular method for preserving user privacy with machine learning. On the flip side, Do Users Write More Insecure Code with AI Assistants? by Perry et al. was also answered in the affirmative, finding that programmers who used a foundation model (OpenAI Codex) to help write code wrote less secure code than without the AI assistant. Finally, in Large Language Models Can Be Strong Differentially Private Learners, Li et al. showed how to improve differential privacy by training language models on public data and fine-tuning on private data. This work showed that surprisingly, DP-SGD works for high-dimensional foundation models; these approaches have been deployed at Microsoft.

Conclusion

We thus conclude our whirlwind tour of the work that CRFM produced in 2022. We have made technical advances, taken steps to improve social responsibility, and applied foundation models to different applications. We always strive to open source our datasets, models, and code so that the broader community can leverage our work. This work is possible due to our wonderful collaborators beyond Stanford. We have also learned so much from the researchers who have engaged through our courses and seminars, and the exciting work of numerous research organizations beyond CRFM — for instance, open-source models such as GPT-J and OpenCLIP, and platforms such as Huggingface’s model hub. We’re looking forward to what 2023 brings!

We expect foundation models will be even more significant in 2023, both in terms of capabilities, breath of domains, and footprint on society. How can we continue to benchmark and document the foundation models ecosystem? How can we foster a community norm of greater transparency? How can we develop a deeper understanding of how foundation models work and develop more reliable models? These are some of the key questions we seek to address in the coming year.

Acknowledgements We would like to thank Rishi Bommasani for his leadership as well as his feedback on this blog post. We also thank Stanford Institute for Human-Centered AI (HAI), McGovern Foundation, and Schmidt Futures, and companies including Microsoft, Google, OpenAI, AI21 Labs, Cohere, Together, and MosaicML, who supported CRFM’s work through funding, compute infrastructure, and API credits.

Year in Review 2022: Technical Advances, Applications, and Social Responsibility

Authors: Michael Wornow and Nandita Bhaskhar and Simran Arora and Percy Liang