Drawing Lines: Tiers for Foundation Models

Author: Rishi Bommasani


How should governments delineate foundation models to design proportionate regulation?

Governments are taking action to govern foundation models. The US issued a landmark Executive Order that enshrined “dual-use foundation models” as a specific focus. The UK has honed in on frontier models as a narrower class of highly-capable foundation models as the subject of the world’s first AI Summit and their new AI Safety Institute. The G7 issued their principles and code of conduct for foundation models and generative AI. China has already put into force regulations for generative AI. And the EU is finalizing negotiations on the AI Act, the world’s first broad regulation of AI, with foundation models posing the most consternation.

Many factors animate the push for governance of foundation models. They span matters like rising opacity and unsafe labor, copyright protections and creative work, economic shifts and market concentration, and more. The misuse of foundation models has amounted to demonstrable harm via AI-generated child sexual abuse material and non-consensual intimate imagery, as well as the potential for widespread disinformation amidst 2024 elections that see 25% of the world at the polls.

Once governments decide to govern a technology (here foundation models), they encounter a familiar question. Are all instances the same and, if not, do they deserve to be treated differently? Many are considering tiers for foundation models: the US establishes a two-tiered regime in the Executive Order, the Spanish presidency of the EU Council sees tiers as an inroad to political compromise for the AI Act, the open source community advocates for tiers for the AI Act, and the Partnership on AI provides deployment guidance for three tiers of foundation models.

In this post, we conceptualize how to design tiers, arriving at the following conclusions:

  1. How to tier. Greater scrutiny is warranted when technologies demonstrate greater impact or precipitate greater harm in society: tiers should be determined on this basis.
  2. How to not tier. In contrast, while some proposals advocate for compute-based tiers, such approaches are short-sighted with significant defects. If compute is to be used, it should be accompanied by other non-compute criteria.
  3. What must be accounted for. In designing tiers, complexity should be avoided for several reasons, but the varied input/output modalities and release strategies for foundation models should be accounted for in tiering schemes.
  4. Evaluate proposals now. To judge tiering schemes, the schemes should be evaluated based on the status quo to understand which entities would currently fall in which tiers. Current discourse is often handwavy: concreteness allows for clear cost-benefit analyses.
  5. Tiers will need to be updated. The foundation model ecosystem is rapidly evolving and the current conditions may not hold even in a year’s time. Governments should empower agencies to update tiers and resource agencies with sufficient technical expertise.

Introduction

Market monitoring efforts like Ecosystem Graphs confirm many axes of variation in the foundation model ecosystem. Foundation models are developed by varied organizations (e.g. big tech, startups, nonprofits) using very different amounts of resources (e.g. data, compute). These models can span a range of modalities (e.g. the text-to-text Claude 2, the text-and-image-to-action RT-2-X) and be of varied capabilities/risks. And they vary in how they are released (e.g. the open Llama 2, the API-based GPT-4, and the fully closed Gopher) and the impact they have via downstream use.

When faced with substantial heterogeneity, as we see here, governments must consider if/how to tailor requirements to yield a proportionate approach. Across policy contexts, governments often resort to tiers with requirements escalating from one tier to the next. However, these tiers also introduce complexity compared to the one-size-fits-all approach. Tiers introduce new questions of scope (how do governments, the entities themselves, and the public determine the applicable tier for a given regulated entity) and new matters of maintenance (how do governments update the tiers as circumstances change, especially in fast-moving spaces like AI).

To build intuition, consider taxation in the United States. Sales taxes for goods provide examples of both invariance and tailoring in the approach. Namely, all individuals are taxed an equal amount if they buy the same good (e.g. chicken), though some may be exempted/provided subsidies (e.g. food stamps), and the impact of the taxes on an individual may not be equal (i.e. taxes of an equal amount disproportionately affect the poor). Yet the tax amount may vary for different types of goods (e.g. essentials like food vs. luxuries like jewelry). Further, the modern tax bracket system for income provides an iconic example of thresholds that determine a series of tiers, allowing for richer individuals to be progressively taxed more. And to maintain the modern tax brackets requires a sophisticated index (the Consumer Price Index) to track the price of goods, leading to formulaic adjustments of the tax brackets, both in terms of the threshold distinguishing brackets and the tax value within a bracket.

Design space for tiers

If governments use tiers to partition the space of foundation models, how might they do it? Here, our focus is on the design space: what approaches could identify foundation models that warrant more or less scrutiny than others. Different approaches vary in their desirability (fundamentally, do they align with the reasons for why governments should exert greater scrutiny?) and feasibility (practically, are they technically and institutionally plausible)?

Figure 1: 4 different approaches to determine tiers for foundation models along with specific measures, when those measures could be assessed, and examples of these approaches.


In general, tiers can be based on properties of the regulated entity (e.g. protections for small and medium-sized enterprises). Or the tiers can be based on the regulated item/activity, namely on (i) the resources spent to build the item, (ii) the item itself, or (iii) the item’s impact on society. For foundation models, this would mean properties of the foundation model provider (e.g. OpenAI, Meta), the resources used to build the foundation model (e.g. OpenAI’s GPT-4 is estimated to have used 2.1 x 10^25 FLOPs, Meta’s Llama 2 was trained on 2 trillion tokens of text), evaluations of the foundation model (e.g. the scores of Llama 2 or GPT-4 on benchmarks like HELM), or the foundation model’s downstream impact (e.g. the hundreds of applications built on GPT-4 or Llama 2).

We emphasize two key factors that complicate the design of tiers for foundation models:

  1. Modalities. The inputs and outputs for a foundation model can correspond to one or multiple modalities (known as multimodal models). For example, Llama 2 takes in text and outputs text. GPT-4 takes in text and/or images and outputs text. Stable Diffusion 2 takes in text and outputs images. ESM-2 takes in protein sequences and outputs protein sequences.
  2. Release strategies. Foundation models are developed by a developer and, potentially, released more broadly through distribution channels. For example, Gopher is fully closed and only available to its developer, Google. GPT-4 is accessible through an API, meaning it can be queried as black box akin to Google Search. Llama 2 is released openly, meaning the weights (i.e. underlying model) are broadly available though some restrictions are imposed through its license. Pythia is released under an open-source license along with the training data, meaning there are no restrictions on downstream use.

These factors matter because they are fundamental differences between foundation models that complicate uniform thresholds for tiers. The resources required to build a state-of-the-art text-to-image model are quite different (and significantly less costly to acquire) from those of a state-of-the-art text-to-text model in 2023. And while open foundation model developers tend to be more transparent about the resources used in building their models, they cede the ability to directly control and monitor the downstream use of their model.

How should governments draw lines

Many sources of uncertainty complicate the principled design of tiers for foundation models. First, what we describe below does not assume a specific regulatory context, therefore the associated requirements/enforcement possibilities are unclear and are themselves in flux. Second, the technological landscape of foundation models changes rapidly, meaning well-designed tiers at the time of legislative agreement may erode in quality by the time laws are in force. Third, economy-wide diffusion of foundation models is quickly evolving, meaning the market position of foundation model developers has not solidified and the set of impacted downstream markets is rapidly shifting.

However, pervasive opacity truly cripples the entire enterprise of tier design. Most companies do not publicly disclose the amount of data and compute they use, the results of evaluations for capabilities and risks, or the downstream use and impact of their models. Without these facts, governments are ill-positioned to determine at present which entities would fall in which tiers under a specific proposal. And without this transparency moving forward, the public will continue to remain in the dark. Governments should remember that it is citizens, journalists, and academics who often are the true frontlines of identifying technological harms, so transparency is essential for public accountability.

Nonetheless, the clock is ticking. Most notably, the time is now to get tiers right for the EU AI Act, which will be the world’s first comprehensive AI legislation. Our focus is on which foundation model developers/providers warrant the greatest scrutiny. Namely, this is the same objective expressed in the US AI Executive Order as well as in proposals that have support within the three bodies negotiating the EU AI Act: the Council, Commission, and Parliament.

We firmly reject developer properties. At present, the world’s most influential foundation models are often built by new startups, including companies like OpenAI and Anthropic. While these companies are rapidly raising money, conventional measures (e.g. number of employees) of company scale are unlikely to align with the impact of their technologies. Conversely, many large companies that are building foundation models for narrow application as small portions of their overall business may instead be wrongly subject to heightened scrutiny.

We warn against standalone compute-based tiers. At present, compute is often the favored child in tier design. Most notably, the Biden administration used compute in their recent Executive Order, setting thresholds at 10^26 FLOPs for all foundation models and 10^23 FLOPs for those involving primarily biological data. In Frontier AI Regulation, Anderljung et al. also suggest 10^26 FLOPS, noting that the most compute-intensive publicly-disclosed foundation model as of now (GPT-4) is estimated 2.1 x 10^25 FLOPs.

Compute has two significant benefits. First, we largely know how to measure it (using FLOPs), certainly to a greater extent than almost all other proposed metrics. And second, compute can be estimated pre-development, affording covered entities ample foresight to comply with requirements, including the possibility to intervene on the development process.

However, in spite of these benefits, we caution against compute as a sole basis for thresholds. To begin, the relationship between compute and impact is quite tenuous and not evidentiated. Compute is known to be robustly predictive of the accuracy of language models on their training objective (predicting the next work) through a body of work on scaling laws. While such empirical relationships are not known for all modalities, we expect they likely do exist for other modalities. However, these relationships do not guarantee compute always predicts substantive or societally-relevant model capabilities, as shown in the literature on emergent capabilities. Further, and more consequentially, there is no demonstration that compute robustly predicts results on risk evaluations, let alone demonstrations that compute predicts the impact foundation models have in society. Notably, compute cannot account for how models are distributed, which significantly shapes downstream impact. This highlights the intrinsic flaw of compute: compute is a measure of upstream resource expenditure, naturally divorced from downstream societal impact. Companies that build the most compute-intensive models may or may not have the greatest societal impact or mediate the greatest societal harm.

Beyond these intrinsic deficits of compute, we highlight three practical difficulties. First, a singular modality-agnostic compute threshold makes little sense given sizable disparities in compute usage across modalities. For example, if a high compute threshold is set to capture the most noteworthy language models, this is likely to not include any models of other modalities. Or if a low compute threshold is set (e.g. to capture noteworthy models with broad societal influence, including direct evidence of contributing to the problem of AI-generated pornography, like Stable Diffusion), it will cover many language models. Second, a compute threshold is not robust to interventions that significantly change model behavior at marginal capture cost. Methods like chain-of-thought-prompting and reinforcement learning from human feedback exemplify low-compute high-impact procedures. And finally, compute is tricky to maintain. Societal access to compute grows rapidly (today’s phones can do more than computers a decade ago) and improvements in algorithmic efficiency mean we can do more with less. In spite of the short-term appeal of compute, it is unsound as the sole long-term basis for heightened scrutiny.

We see evaluations as promising but immature in 2023 when the stakes are public trust. Evaluations have been the central mechanism for tracking AI progress: the entire field’s rise to prominence in the past decade could be attributed in part to the influence and adoption of the ImageNet benchmark. While the recent escalation of AI from the domain of research to broader societal adoption means common model-level evaluations are inadequate to characterize societal impact, the traditions of evaluation do provide evidence that this may be a viable path forward. Unfortunately, as designers of the HELM benchmark, a well-known benchmark for foundation models that is making inroads towards an industry standard, we believe evaluations are not yet ready. We do expect significant progress in the coming years due to top-level investment of the UK and US into evaluation development with their new AI Safety Institutes, though key agencies like NIST are habitually under-resourced to contribute to this vital work. Governments should closely track the state of evaluations, as well-executed evaluations could crisply articulate the properties of a foundation model and afford decision-makers the opportunity to intervene prior to market release.

We see demonstrated impact as the ideal, but measurement requires support. Beginning with first principles, the reason to place scrutiny on specific foundation models should be due to demonstrated impact. Notably, legislative proposals like the EU AI Act originate from an emphasis on product safety, meaning an emphasis on how AI interfaces with the public. Therefore, it is natural that the best approaches to determine the conditions for greater scrutiny would be on the basis of greater demonstrated impact/harm. In particular, we look at the EU’s Digital Services Act (the DSA, which just went into effect) as an exemplar with the designation of Very Large Online Platforms and Very Large Online Search Engines. Given the stated intent of this designation to “identify, analyse, and assess systemic risks” along with the function of determining the entities that must comply with the “most stringent rules of the DSA”, we take direct inspiration from the DSA. In particular, tiers in the DSA are based on demonstrated impact: a Very Large Online Platform is an online platform with at least 45 million monthly active users in the EU. Recent updates out of the EU AI Act trilogue indicate key EU policymakers support an emphasis on high impact or systemic foundation models.

While we see the demonstrated impact as appropriate from first principles, we recognize it has its own deficits. Fundamentally, the demonstrated impact of course requires a technology be placed on the market for a period of time to accrue this impact. Therefore, determining tiers solely based on impact complicates the ability to intervene on the original development of the model, which is a very important opportunity for changing a foundation model’s behavior. Similarly, this means the compliance burden cannot be anticipated easily by the foundation model developer/provider, though this concern can be abated by providing a buffer before compliance is required.

However, we see the key deficiency for demonstrated impact not as its conceptual imperfections, but its current measurability. Unlike online platforms like Amazon, Facebook, Twitter, TikTok, and LinkedIn, the public (generally) does not directly use foundation models. Therefore, counting users is not a straightforward approach. We see two routes forward. The first is to track applications that depend on a given foundation model, recognizing that all applications are not equally consequential, but that this may be a sensible proxy for impact. The second is to go further and count the aggregate number of users across those downstream applications, recognizing that this equates distinct forms of downstream usage (e.g. a user of Duolingo vs Morgan Stanley where both depend on OpenAI’s GPT-4). Neither proposal is perfect, much like the DSA’s number of users is not a perfect measure of impact, but both might be good-enough.

Regardless of approach, the essential problem is no apparatus exists for tracking the downstream use of foundation models. The UK’s Competition and Markets Authority initiated market surveillance efforts for precisely this purpose, and while resources like Ecosystem Graphs that we have built make some headway, we simply do not have good indicators of downstream use at present. Some foundation model providers (e.g. those who can monitor distribution channels, like their own API) could characterize usage, but this is not directly possible for open foundation models. This reflects the nascency of foundation models: governments do not yet have robust mechanisms for supply chain monitoring that are customary for more mature industries.

The public databases in the EU AI Act may be the best option. Given our analysis, we favor measures of downstream impact as the basis for tiers over all other considered options. In practice, how can we measure impact? Implicitly, this depends on whether foundation model developers, downstream distribution channels, downstream deployers, or world governments can create the infrastructure to track impact.

In the context of the EU AI Act, a very concrete but imperfect approach could be through the proposed registration requirements in the Parliament position on the AI Act. Concretely, the Parliament position from June requires that both high-risk AI systems and foundation models be registered in a public database. Under this proposal, we strongly recommend an additional requirement: when registering high-risk AI systems, providers should be required to declare any foundation models the system depends upon.

In itself, such a requirement provides tremendous leverage for EU policymakers to understand how foundation models are being used in society, for EU and member state enforcement bodies to efficiently allocate resources in the enforcement of foundation model requirements by directing these resources towards the most consequential foundation models, and for the public to understand how critical AI systems that influence their lives are built upon foundation models. In particular, this single line item follows naturally from the Parliament position: both high-risk AI systems and foundation models must be registered in public EU databases, so linking the two is a straightforward requirement that provides serious benefit.

If this requirement was included, a critical inroad emerges for characterizing the societal impact of foundation models. Given this surrogate for impact, foundation model tiers could be based on the number of downstream high-risk AI systems, the breadth of different high-risk categories from Annex III covered by these downstream systems, or other specifics of these downstream systems. We emphasize that this is still not a perfect measure of the impact, and potential for harm, of a foundation model as it fails to account for other downstream AI systems that are not high-risk. But, while imperfect, we see this as the clearest inroad towards concretizing the impact of foundation models.

Having downstream high-risk AI system providers declare their dependencies means this proposal is workable for tracking the impact of both open and closed foundation models. After all, this proposal mirrors how scientific papers reference prior works they are based on, allowing the scientific communication to track the impact of open scientific research via mechanisms like Google Scholar. And, by tracking impact rather than resources like compute, this measure is more naturally agnostic to model modalities. In this way, if foundation models and high-risk AI systems are already to be registered, we strongly support the inclusion of this requirement.

We see potential in hybrid approaches. In this post, largely based on our view on current discourse, we have considered the four “pure” approaches to determining tiers on a singular basis. However, a more complex but potentially more robust approach could be to pursue “mixed” strategies where different tiering approaches are integrated. In the EU, the Parliament has recently suggested this as a possibility. In particular, this may better balance the differing profiles of desirability and feasibility for different measures.

Conclusion

The growing societal impact of foundation models demands thoughtful and robust governance. As world governments take action, and consider proportionate approaches, much uncertainty will complicate the design of tiers. To provide guidance, we make our current views clear. But, to state the obvious, governments must consult with all stakeholders (e.g. civil society organizations, academia, industry), not just the companies that lobby most aggressively, to credibly advance the public interest on this matter.

Author. Rishi Bommasani is the Society Lead at the Stanford Center for Research on Foundation Models (CRFM).

Acknowledgments

We thank Dan Ho, Daniel Zhang, Percy Liang and Tatsu Hashimoto for their comments on this piece as well as Arvind Narayanan, Ashwin Ramaswami, Aviv Ovadya, Christie Lawrence, Connor Dunlop, Helen Toner, Florence G’Sell, Irene Solaiman, Judy Shen, Kevin Klyman, Markus Anderjlung, Marietje Schaake, Neel Guha, Owen Larter, Peter Cihon, Peter Henderson, Risto Uuk, Rob Reich, Russell Wald, Sanna Ali, Sayash Kapoor, Shayne Longpre, Steven Cao, Yacine Jernite, and Yo Shavit for discussions on this matter.

@misc{bommasani2023tiers, 
    author = {Rishi Bommasani}, 
    title  = {Drawing Lines: Tiers for Foundation Models}, 
    url    = {https://crfm.stanford.edu/2023/11/18/tiers.html}, 
    year   = {2023}
}