OpenAI Transparency Report

1. Data acquisition methods (Score: 0)

What methods does the developer use to acquire data used to build the model?

Like OpenAI’s other o‑series models, OpenAI o3 and o4‑mini were trained on diverse datasets, including information that is publicly available on the internet, information that we partner with third parties to access, and information that our users or human trainers and researchers provide or generate
System card §2 "Model Data and Training": "Like OpenAI’s other o‑series models, OpenAI o3 and o4‑mini were trained on diverse datasets, including information that is publicly available on the internet, information that we partner with third parties to access, and information that our users or human trainers and researchers provide or generate."
While some methods are described, the exact data acquisition methods used are unclearly stated.

Which of the following data acquisition methods does the developer use: 
 (i) acquiring existing public datasets, (ii) crawling the web, (iii) using data acquired via its existing products and services, (iv) licensing existing data from external parties, (v) having humans create or annotate new data, (vi) using models to generate new data,
 or (vii) other data acquisition methods not captured by the above. For example, if the developer uses reinforcement learning from human feedback to train models using model-generated outputs with human preference annotations, this would satisfy categories (v) and (vi). Alternatively, if the developer post-trains its model using off-the-shelf preference data (for example, the Alpaca dataset), this would satisfy category (i).
To build our model, we acquire data by crawling the Internet for publicly available data, licensing data from third-parties, and using models to synthetically generate new data. Humans do not create new data nor do we use data from our other products/services to train our model.
2. Public datasets (Score: 0)

What are the top-5 sources (by volume) of publicly available datasets acquired for building the model?

Not disclosed
Not disclosed
No information provided.

We define a source as the entity or means by which the developer acquires data. We define the top-5 sources as the top-5 sources by data volume.
We acquire publicly available data from only two sources: The Pile and CommonCrawl.
3. Crawling (Score: 1)

If data collection involves web-crawling, what is the crawler name and opt-out protocol?

When data collection involves crawling publicly available webpages, we use our GPTBot crawler. Website owners who do not wish their content to be included in training can opt out by disallowing the GPTBot user-agent in their robots.txt file or by blocking the IP ranges we publish for GPTBot.
article OpenAI’s Approach to Frontier Risk, “Data Input Controls and Audit”: “OpenAI has implemented an easy means for website operators to exclude their content from being accessed by OpenAI’s ‘GPTBot’ web crawler, relying on the robots.txt web standard… We provide instructions online for how to disallow [GPTBot] from accessing sites.”
GPTBot crawler is used; opt-out based on robots.txt or blocking the published IP ranges.

We award this point for disclosure of the crawler name and opt-out protocols, including if/how they respect the Robots Exclusion Protocol (robots.txt).
Our web crawler is named A and information on the opt-out protocol can be found at this URL: ... The CommonCrawl web crawler is named CCBot and information on the opt-out protocol can be found at this URL: https://commoncrawl.org/faq#:~:text=How%20can%20I%20block%20the,%2Dagent%20string%20is%3A%20CCBot.
4. Usage data used in training (Score: 0)

What are the top-5 sources (by volume) of usage data from the developer's products and services that are used for building the model?

Not disclosed
Not disclosed
No information provided.

We define usage data as data collected from the use of a developer's products or services.
We use usage data from only two sources: our deployed chatbot X and our online social media platform Y.
5. Notice of usage data used in training (Score: 0)

For the top-5 sources of usage data, how are users of these products and services made aware that this data is used for building the model?

Not disclosed
Not disclosed
No information provided.

We define usage data notice as the proactive disclosure to users of how their data is used for model development. For example, via a pop-up with a description, a link to the privacy policy, or link to a description of company practices.
We notify users of our chatbot X that chatbot interactions are used to train our AI via a pop-up as shown at this URL: ... We notify users of our platform Y about whether their data is used to train our AI via a link to our privacy policy when they sign up for an account.
6. Licensed data sources (Score: 0)

What are the top-5 sources (by volume) of licensed data acquired for building the model?

Not disclosed
Not disclosed
No information provided.

We define a source as the entity from which the developer acquires data. For example, the Associated Press is reportedly a source of licensed data for OpenAI.
We license data from only three sources: A, B, and C.
7. Licensed data compensation (Score: 0)

For each of the top-5 sources of licensed data, are details related to compensation disclosed?

Not disclosed
Not disclosed
No information provided.

We award this point if the model developer describes the compensation structure specified in the contract with the data source or indicates they are prohibited from sharing this information if contractually mandated.
We compensate A by ... We cannot disclose information on compensation for our relationships with B and C due to contractual terms that prohibit public disclosure.
8. New human-generated data sources (Score: 0)

What are the top-5 sources (by volume) of new human-generated data for building the model?

Not disclosed
Not disclosed
No information provided.

We define a source as the entity or means by which the developer acquires data. For example, Scale AI could be a source of new human-generated data. By new, we mean the data is specifically acquired for the purposes of building the model.
We acquire new human-generated data from only two sources: our internal data annotation team and an external vendor, A.
9. Instructions for data generation (Score: 0)

For each of the top-5 sources of human-generated data, what instructions does the developer provide for data generation?

Not disclosed
Not disclosed
No information provided.

The instructions should be those provided to the data source. For example, if a third-party vendor works directly with the data laborers to produce the data, the instructions from the developer to this vendor should be disclosed.
We instruct our internal data annotation team as follows: ... We instruct vendor A as follows: ...
10. Data laborer practices (Score: 0)

For the top-5 sources of human-generated data, how are laborers compensated, where are they located, and what labor protections are in place?

Not disclosed
Not disclosed
No information provided.

For each data source, we require (i) the compensation in either USD or the local currency, (ii) any countries where at least 25% of the laborers are located, and (iii) a description of any labor protections. We will award this point if the developer discloses that it is not aware of data laborer practices.
Our internal data annotation team is located in the US, is compensated at 20 USD per hour, and deals with data that does not require specific protections. Our sole external data vendor contracts laborers in Kenya, compensates them at KES 15000 per month, and implements protections for dealing with toxic or unsafe content such as A and B.
11. Synthetic data sources (Score: 0)

What are the top-5 sources (by volume) of synthetic data acquired for building the model?

Not disclosed
Not disclosed
No information provided.

We define a source of synthetic data as a non-human mechanism (e.g. a machine learning model) used to generate the data.
We synthetically generate data using only our previous model X and an early checkpoint of our current flagship model Y.
12. Synthetic data purpose (Score: 0)

For the top-5 sources of synthetically generated data, what is the primary purpose for data generation?

Not disclosed
Not disclosed
No information provided.

We define a source of synthetic data as a non-human mechanism (e.g. a machine learning model) used to generate the data.
We use model X to generate instruction-tuning data and we use model Y to generate candidate responses that humans select between to provide human preference data for reinforcement learning with human feedback.
13. Data processing methods (Score: 0)

What are the methods the developer uses to process acquired data to determine the data directly used in building the model?

Our data‑processing pipeline includes rigorous filtering to maintain data quality and mitigate potential risks.
System card §2 "Model Data and Training": "Our data processing pipeline includes rigorous filtering to maintain data quality and mitigate potential risks."
No substantive information about all data processing involved.

We will award this point for disclosure of all of the methods used to process acquired data. Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.
We process data in the following six-step pipeline: (i) removal of HTML artifacts, (ii) deduplication, (iii) language identification to retain English data, (iv) removal of CSAM imagery, (v) removal of train-test overlap, and (vi) tokenization.
14. Data processing purpose (Score: 0)

For each data processing method, what is its primary purpose?

We use advanced data filtering processes to reduce personal information from training data.
System card §2 "Model Data and Training": "We use advanced data filtering processes to reduce personal information from training data."
No substantive information about all data processing involved.

Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.
Examples of primary purposes for a data processing method could include: (i) removes low quality data, (ii) removes potentially personal/copyrighted data, (iii) removes product-irrelevant data, (iv) removes toxic data, (v) improves evaluation integrity, or (vi) prepares the data for training the model.
15. Data processing techniques (Score: 0)

For each data processing method, how does the developer implement the method?

We also employ a combination of our Moderation API and safety classifiers to help prevent the use of harmful or sensitive content, including explicit materials such as sexual content involving a minor.
System card §2 "Model Data and Training": "We also employ a combination of our Moderation API and safety classifiers to help prevent the use of harmful or sensitive content, including explicit materials such as sexual content involving a minor."
No substantive information about all data processing involved.

Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.
Examples of how a data processing method is implemented could include: the method (i) is implemented using an in-house regular expression, (ii) is implemented using an in-house tool based on n-gram overlap, (iii) is implemented using a FastText classifier trained on Wikipedia data, (iv) is implemented using hash collisions with the NCMEC database, (v) is implemented by searching for known benchmark canary strings, and (vi) is implemented using tiktoken (https://github.com/openai/tiktoken).
16. Data size (Score: 0)

Is the size of the data used in building the model disclosed?

Not disclosed
Not disclosed
No information provided.

To receive this point, the developer should report data size in appropriate units (e.g. bytes, words, tokens, images, frames) and broken down by modality. Data size should be reported to a precision of one significant figure (e.g. 4 trillion tokens, 200 thousand images). The size should reflect data directly used in building the model (i.e. training data) and not data that was acquired but unused, or data used to evaluate the model.
We used 3 x 10^12 tokens of text, 1 x 10^6 images, and 5 x 10^5 hours of audio for training.
17. Data language composition (Score: 0)

For all text data used in building the model, what is the composition of languages?

Not disclosed
Not disclosed
No information provided.

To receive this point, the developer should report (i) all languages which make up at least 1% of the data and their corresponding proportions and (ii) a brief description of how languages are labeled (if a publicly available tool is used, include a link to the tool). Proportions should be reported to a precision of two significant figures and should describe proportions of documents labeled with some langauge. An "Unknown" category may be included to denote documents where the language could not be identified.
English 80%, Spanish 5.0%, French 3.0%, Chinese 2.0%, Unknown 10%. We used a FastText-based classifier trained on Wikipedia data to identify languages.
18. Data domain composition (Score: 0)

For all the data used in building the model, what is the composition of domains covered in the data?

Not disclosed
Not disclosed
No information provided.

To receive this point, the developer should report the composition of the main domains included in the data used to train the model. This data should be at a level of granularity lower than broad claims about training on "internet data". For example, this could include the proportion of data from e-commerce, social media, news, code, etc. based on the URLs from which the data is sourced. Proportions should be reported to a precision of one significant figure.
Social media 40%, code repositories 30%, news articles 20%, e-commerce product listings 5%, scientific papers 5%.
19. External data access (Score: 0)

Does a third-party have direct access to the data used to build the model?

Not disclosed
Not disclosed
No information provided.

By a third-party, we mean entities that are financially independent of the developer. We will award this point if at least one such entity is named as having direct access to the data. With that said, we may award this point if the developer provides justifications for prohibiting access to narrowly-scoped parts of the data.
Third-parties that have direct access to the data include organizations A and B.
20. Data replicability (Score: 0)

Is the data used to build the model described in enough detail to be externally replicable?

Not disclosed
Not disclosed
No information provided.

We will award this point if the description contains (i) a list of all publicly available training data and where to obtain it and (ii) a list of all training data obtainable from third parties and where to obtain it. These conditions refer to criteria 2 and 3 under the OSI Open Source AI v1.0 definition.
The listing of publicly available training data can be found at this URL ... and the listing of all training data obtainable from third parties can be found at this URL ...
21. Compute usage for final training run (Score: 0)

Is the amount of compute used in the model's final training run disclosed?

Not disclosed
Not disclosed
No information provided.

Compute should be reported in appropriate units, which most often will be floating point operations (FLOPs), along with a description of the measurement methodology, which may involve estimation. Compute should be reported to a precision of one significant figure (e.g. 5 x 10^25 FLOPs). This number should represent the compute used to train the final model across all model stages.
Our model was trained using 5 x 10^25 FLOPs, measured according to the Frontier Model Forum guidance provided at this URL: https://www.frontiermodelforum.org/updates/issue-brief-measuring-training-compute/
22. Compute usage including R&D (Score: 0)

Is the amount of compute used to build the model, including experiments, disclosed?

Not disclosed
Not disclosed
No information provided.

Compute should be reported in appropriate units, which most often will be floating point operations (FLOPs), along with a description of the measurement methodology, which may involve estimation. Compute should be reported to a precision of one significant figure (e.g. 7 x 10^26 FLOPs). Compared to the previous indicator, this indicator should include an estimation of the total compute used across experiments used towards the final training run for the model (such as including hyperparameter optimization or other experiments), and not just the final training run itself.
Our cumulative compute usage involved in building the model was 7 x 10^26 FLOPs, measured according to the Frontier Model Forum guidance provided at this URL: https://www.frontiermodelforum.org/updates/issue-brief-measuring-training-compute/
23. Development duration for final training run (Score: 0)

Is the amount of time required to build the model disclosed?

Not disclosed
Not disclosed
No information provided.

The amount of time should be specified in terms of both the continuous duration of time required and the number of hardware hours used. The continuous duration of time required to build the model should be reported in weeks, days, or hours to a precision of one significant figure (e.g. 3 weeks). The number of hardware hours should be reported to a precision of one significant figure and include the type of hardware hours. No form of decomposition into phases of building the model is required for this indicator, but it should be clear what the duration refers to (e.g. training the model, or training and subsequent evaluation and red teaming).
Our model was trained over a period of 90 days using 4x10^4 NVIDIA H100 GPU-days.
24. Compute hardware for final training run (Score: 0)

For the primary hardware used to build the model, is the amount and type of hardware disclosed?

Not disclosed
Not disclosed
No information provided.

In most cases, this indicator will be satisfied by information regarding the number and type of GPUs or TPUs used to train the model. The number of hardware units should be reported to a precision of one significant figure (e.g. 800 NVIDIA H100 GPUs). We will not award this point if (i) the training hardware generally used by the developer is disclosed, but the specific hardware for the given model is not, or (ii) the training hardware is disclosed, but the amount of hardware is not. We will award this point even if information about the interconnects between hardware units is not disclosed.
Our model was trained using 1000 NVIDIA H100 GPUs.
25. Compute provider (Score: 0)

Is the compute provider disclosed?

Not disclosed
Not disclosed
No information provided.

For example, the compute provider may be the model developer in the case of a self-owned cluster, a cloud provider like Microsoft Azure, Google Cloud Platform, or Amazon Web Services, or a national supercomputer. In the event that compute is provided by multiple sources or is highly decentralized, we will award this point if a developer makes a reasonable effort to describe the distribution of hardware owners.
Compute is provided by Google Cloud Platform.
26. Energy usage for final training run (Score: 0)

Is the amount of energy expended in building the model disclosed?

Not disclosed
Not disclosed
No information provided.

Energy usage should be reported in appropriate units, which most often will be megawatt-hours (mWh), along with a description of the measurement methodology, which may involve estimation. Energy usage should be reported to a precision of one significant figure (e.g. 500 mWh). No form of decomposition into compute phases is required, but it should be clear whether the reported energy usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitate energy usage. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lacks and why it lacks it.
Our model was trained using an estimate 1 x 10^4 MWh of energy. To estimate training energy consumption, we multiplied training FLOPs (5 x 10^25) by a conversion factor using NVIDIA A100 GPU information (3.74 × 10^21 FLOPs/MWh) given we train using FP16 with sparsity.
27. Carbon emissions for final training run (Score: 0)

Is the amount of carbon emitted in building the model disclosed?

Not disclosed
Not disclosed
No information provided.

Emissions should be reported in appropriate units, which most often will be tons of carbon dioxide emitted (tCO2), along with a description of the measurement methodology, which may involve estimation. Emissions should be reported to a precision of one significant figure (e.g. 500 tCO2). No form of decomposition into compute phases is required, but it should be clear whether the reported emissions is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that generate emissions. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lack and why it lacks it. Emissions should correspond with the energy used in the previous indicator.
Our model yielded an estimate of 5 x 10^3 tCO2. To estimate training carbon emissions, we multiplied training energy usage (1 x 10^4 MWh) by a 2023 estimate for the US data center carbon intensity (0.375 tCO2/MWh) given the data centers used in training operate in the US.
28. Water usage for final training run (Score: 0)

Is the amount of clean water used in building the model disclosed?

Not disclosed
Not disclosed
No information provided.

Clean water usage should be in appropriate units, which most often will be megaliters, along with a description of the measurement methodology, which may involve estimation. Clean water usage should be reported to a precision of one significant figure (e.g., 5000ML). No form of decomposition into compute phases is required, but it should be clear whether the reported water usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitates water usage. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lacks and why it lacks it.
Our model yielded an estimate of 20 ML water. To estimate training water usage, we multiplied training energy usage (1 x 10^4 MWh) by a 2021 estimate for the US data center water efficiency (1.8 ML per 1,000 MWh) given the data centers used in training operate in the US.
29. Internal compute allocation (Score: 0)

How is compute allocated across the teams building and working to release the model?

Not disclosed
Not disclosed
No information provided.

To receive a point, the developer should provide the compute allocated to each team involved in training the model. We understand there might be no clear allocation of compute across different teams; in that case, report an estimate of the compute used over the last year. Compute allocation should be reported to at least one significant figure.
- Safety — 15% - Pre-training — 60% - Post-training — 15% - Infrastructure and reliability — 5%
30. Model stages (Score: 0)

Are all stages in the model development process disclosed?

In the system card and in the release blog, we describe stages, including detailing that "by retracing the scaling path—this time in RL—we’ve pushed an additional order of magnitude in both training compute and inference-time reasoning".
Not disclosed
No substantive information provided.

Stages refer to each identifiable step that constitutes a substantive change to the model during the model building process. We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear and complete description of these stages.
We define five stages in building the model: (1) unsupervised pre-training, (2) supervised instruction tuning, (3) RLHF, (4) domain-specific fine-tuning, and (5) final safety alignment.
31. Model objectives (Score: 0)

For all stages that are described, is there a clear description of the associated learning objectives or a clear characterization of the nature of this update to the model?

Not disclosed
Not disclosed
No information provided.

We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear description of the update to the model related to each stage, whether that is the intent of the stage (e.g. making the model less harmful), a mechanistic characterization (e.g. minimizing a specific loss function), or an empirical assessment (e.g. evaluation results conducted before and after the stage).
During unsupervised pre-training, the objective is next-token prediction. During supervised instruction tuning, we optimize for correctness and helpfulness on labeled tasks. RLHF aligns model outputs with human preference judgments. Domain-specific fine-tuning focuses on improving in-domain capabilities using specialized data (e.g., code or legal text). Final safety alignment reduces disallowed or harmful responses.
32. Code access (Score: 0)

Does the developer release code that allows third-parties to train and run the model?

Not disclosed
Not disclosed
No information provided.

The released code does not need to match the code used internally.
We release training and inference code under an Apache 2.0 license at https://github.com/..., enabling others to replicate our core pipeline.
33. Organization chart (Score: 0)

How are employees developing and deploying the model organized internally?

Not disclosed
Not disclosed
No information provided.

To receive a point, the developer should provide both the internal organization chart for the team developing the model as well as the headcounts (or a proportion of headcounts) by the team.
The model team comprises of 63 people, organized as follows: - CEO - Managing Director (Safety) — 24 people - Managing Director (Pre-training) — 12 people - Managing Director (Post-training) — 11 people - Managing Director (API) — 6 people - Director (Infrastructure and reliability) — 7 people - Director (PR and marketing) — 4 people - Director (hiring) — 7 people
34. Model cost (Score: 0)

What is the cost of building the model?

Not disclosed
Not disclosed
No information provided.

Monetary cost should be reported in appropriate currency (e.g. USD), along with the measurement methodology, which may involve estimation. Cost should be reported to a precision of one significant figure (e.g. 200 million USD).
We spent approximately 200 million USD on building the model: 50 million for data acquisition, 10 million for data processing, 20 million for personnel, 80 million for compute for R&D priced at market rates, and 40 million for compute for the final training run priced at market rates.
35. Basic model properties (Score: 0)

Are all basic model properties disclosed?

The OpenAI o-series models are trained with large-scale reinforcement learning on chains of thought.
System card §1 "Introduction": "The OpenAI o‑series models are trained with large‑scale reinforcement learning on chains of thought."
The modalities can be found in the developer's public API documentation (https://platform.openai.com/docs/models/o3). However, the developer discloses only limited information about model components and does not disclose information about the size and architecture.

Basic model properties include: the input modality, output modality, model size, model components, and model architecture. To receive a point, all model properties should be disclosed. Modalities refer to the types or formats of information that the model can accept as input. Examples of input modalities include text, image, audio, video, tables, graphs. Model components refer to distinct and identifiable parts of the model. We recognize that different developers may use different terminology for model components, or conceptualize components differently. Examples include: (i) For a text-to-image model, components could refer to a text encoder and an image encoder, which may have been trained separately. (ii) For a retrieval-augmented model, components could refer to a separate retriever module. Model size should be reported in appropriate units, which generally is the number of model parameters, broken down by named component. Model size should be reported to a precision of one significant figure (e.g. 500 billion parameters for text encoder, 20 billion parameters for image encoder). Model architecture is the overall structure and organization of a foundation model, which includes the way in which any disclosed components are integrated and how data moves through the model during training or inference. We recognize that different developers may use different terminology for model architecture, or conceptualize the architecture differently; a sufficient disclosure includes any clear, though potentially incomplete, description of the model architecture.
Input modality: Text Output modality: Text Model components: Decoder-only model trained using self-supervised learning, followed by supervised fine tuning and RLHF that are used to align the language model to follow users' instructions and be helpful, harmless, and honest. Model size: 70B parameters Model architecture: Autoregressive (causal, decoder only) transformer language model with rotary position embeddings and are trained on the next token prediction task.
36. Deeper model properties (Score: 0)

Is a detailed description of the model architecture disclosed?

Not disclosed
Not disclosed
The developer does not disclose this information.

To receive a point, the model architecture should be described in enough detail to allow for an external entity to fully implement the model. Publicly available code or a configuration file for a model training library (e.g., GPT-NeoX) would be a sufficiently detailed description.
The configuration file for training our model using a public model training library A can be found at [URL].
37. Model dependencies (Score: 0)

Is the model(s) the model is derived from disclosed?

Not disclosed
Not disclosed
The developer does not disclose this information.

We will award this point for a comprehensive disclosure of the model or models on which the foundation model directly depends on or is derived from, as well as the method by which it was derived (e.g., through fine tuning, model merging, or distillation). Additionally, we will award a point if the developer discloses that the model is not dependent on or derived from any model.
This model is a fine tune of Camel-70B. We used the methods described in [PAPER URL] for distillation.
38. Benchmarked inference (Score: 0)

Is the compute and time required for model inference disclosed for a clearly-specified task on clearly-specified hardware?

Not disclosed
Not disclosed
The developer does not disclose this information.

The duration should be reported in seconds to a precision of one significant figure (e.g. 0.002 seconds). Compute usage for inference should be reported in FLOPs/second to a precision of one significant figure (e.g. 5 x 10^21 FLOPs/second). The hardware in this evaluation need not be the hardware the developer uses for inference. The developer can report this figure over some known or public dataset.
It takes 0.002 seconds and 5 x 10^21 FLOPs/second to generate 100,000 tokens as 5,000 sequences of length 20 given inputs of length 40 from [DATASET URL]. The fixed set of hardware is 8 NVIDIA A100s.
39. Researcher credits (Score: 0)

Is a protocol for granting external entities API credits for the model disclosed?

We publicly describe two programs that let external parties obtain API credits usable on all public models, including o3: the Researcher Access Program (up to US $1,000 per researcher, reviewed quarterly, credits valid for 12 months) and the Codex open source fund (US $1 million pool awarding up to US $25,000 in API credits per project, reviewed on a rolling basis).
Researcher Access Program page: “Researchers can apply for up to $1,000 of OpenAI API credits… Credits are valid for a period of 12 months… Applications are reviewed once every 3 months” Codex open source fund page: “We’re excited to launch a $1 million initiative… projects receiving grants up to $25,000 in API credits.”
Criteria for selecting external entities and the policy on the timeframe is not disclosed.

A model credit access protocol refers to the steps, requirements, and considerations involved in granting credits to external entities. We will award this point if the developer discloses key details of its protocol, including (i) where external entities can request access to credits (e.g. via an access request form); (ii) explicit criteria for selecting external entities; and (iii) its policy on granting a transparent decision on whether access has been granted within a specified, reasonable period of time. Additionally, we will award a point if the developer discloses that it does not grant external entities API credits.
We implement a researcher access program: (i) Access can be requested from [URL] (ii) Any researcher at an accredited research institution is eligible to apply. Decisions are made based on the alignment between the applicant's project description and our target research directions (as described here: [URL]). (iii) Decision notifications are sent within three weeks of the application receipt.
40. Specialized access (Score: 0)

Does the developer disclose if it provides specialized access to the model?

OpenAI provided third party assessors early model access to evaluate frontier risks related to autonomous capabilities, deception, and cybersecurity. Third party assessors were provided both OpenAI o3 and o4-mini early checkpoints, as well as the final launch candidate models to conduct their assessments. As part of our ongoing efforts to consult with external experts, OpenAI granted early access to these versions of o3 and o4-mini to the U.S. AI Safety Institute to conduct evaluations of the models’ cyber and biological capabilities, and to the U.K. AI Security Institute to conduct evaluations of cyber, chemical and biological, and autonomy capabilities, and an early version of the safeguards.
System card §3.9 “Third-Party Assessments” states that OpenAI “granted early access … to the U.S. AI Safety Institute and the U.K. AI Safety Institute” for cyber- and bio-risk evaluations The same section notes that METR, a nonprofit frontier-risk evaluator, was given “early checkpoints … and the final launch candidate models” for autonomous-capability testing Apollo Research is listed as an external assessor of deception and scheming risks, confirming its early access Pattern Labs, an AI-security nonprofit, evaluated o3 and o4-mini on cybersecurity challenges under a dedicated subsection, evidencing its specialized access
Although the developer discloses two organizations given specialized access, it's not clear whether these constitute all entities granted specialized access. Statistics about all organizations is required for a point.

Specialized access could include several categories, such as early access, subsidized access, or deeper access (e.g., to model weights or checkpoints, that are not publicly available). We will award this point if the developer discloses (i) if it provides specialized access and (ii) statistics on the number of users granted access across academia, industry, non-profits, and governments, to one significant figure.
We provide early access to the model via API to: (1) 250 academics vetted by our program (2) 0 industry affiliates (3) 0 non-profit affiliates (3) 2 government entities with whom we have signed MoUs We provide no other specialized research access.
41. Open weights (Score: 0)

Are the model's weights openly released?

Not disclosed
Not disclosed
The weights are not openly released.

To receive this point, model weights need to be publicly available at no cost. Developers may receive this point even if there are some restrictions on the external entities that are permitted access (e.g. geographic restrictions), insofar as these restrictions are transparent (e.g. via a license or some high-level description of who has been granted access to the foundation model).
Model weights are available on HuggingFace by following this link: [URL]
42. Agent Protocols (Score: 0)

Are the agent protocols supported for the model disclosed?

Not disclosed
Not disclosed
The developer does not disclose this information.

Agent protocols are specifications that define how autonomous agents exchange messages, context, or function calls with other agents, tools, or services (e.g., Anthropic’s Model Context Protocol (MCP) and Google’s Agent‑to‑Agent (A2A) spec). To earn this point, documentation must enumerate each protocol and describe any deviations or proprietary extensions.
We support MCP and A2A for agents built using model A
43. Capabilities taxonomy (Score: 0)

Are the specific capabilities or tasks that were optimized for during post-training disclosed?

Our launch blog described the capabilities we focused on when training o3. Published text includes: "OpenAI o3 is our most powerful reasoning model that pushes the frontier across coding, math, science, visual perception, and more … It’s ideal for complex queries requiring multi-faceted analysis and whose answers may not be immediately obvious. It performs especially strongly at visual tasks like analyzing images, charts, and graphics.
System card §3.7 “Jailbreaks through Custom Developer Messages”: “To mitigate this issue, we taught the model to adhere to an Instruction Hierarchy…” System card §3.5 “Person Identification and Ungrounded Inference Evaluations”: “Our safety mitigations include post-training our reasoning models to refuse requests to identify a person based on an image, and to refuse requests for ungrounded inferences.” System card §4.5 “Safeguards”: “Modified post-training of the models to refuse high-risk biological requests while not refusing benign requests…
The developer does not disclose the set of capabilities specifically optimized for during the post-training phase of the model.

Capabilities refer to the specific and distinctive functions that the model can perform. We recognize that different developers may use different terminology for capabilities, or conceptualize capabilities differently. We will award this point for a list of capabilities specifically optimized for in the post-training phase of the model, even if some of the capabilities are not reflected in the final model.
We focus on the following capabilities during post-training: (1) Coding ability (2) Retrieval of information and factuality (3) Multilingual language proficiency on non-English languages (4) Tool-use
44. Capabilities evaluation (Score: 1)

Does the developer evaluate the model's capabilities prior to its release and disclose them concurrent with release?

Our system card for o3 was published on launch day and includes detailed evaluation results. As examples, before launch we quantified OpenAI o3’s performance across multiple Preparedness-Framework capability areas and released the full results in the public system card. Headline metrics include 71 % pass@1 on the SWE-bench-Verified real-world software-engineering benchmark, 0.888 average accuracy across 13 languages on MMLU, 98.7 % recall from our biorisk reasoning monitor, and reported success rates on professional-level cybersecurity CTF and cyber-range tasks.
System card §4.4.2 “SWE-bench Verified”: “the o3 helpful-only model achiev[es] a SOTA result of 71 %” System card §5 “Multilingual Performance”: Table 16 lists an average accuracy of 0.888 for o3 across 13 languages System card §4.5 “Safeguards”: “We simulated our blocking logic and found 4 misses, resulting in a recall of 98.7 %” on the biorisk red-teaming set
OpenAI reports results on several capability evaluations organized into categories even though they do not define a precise capability taxonomy.

The evaluations must contain precise quantifications of the model's behavior in relation to the capabilities specified in the capabilities taxonomy. We will award this point for any clear, but potentially incomplete, evaluation of multiple capabilities.
We evaluate capabilities using the following benchmarks: (1) Coding: HumanEval (2) Retrieval: HotPotQA (3) Multilingual performance: MMMLU (4) Tool use: UltraTool
45. External reproducibility of capabilities evaluation (Score: 0)

Are code and prompts that allow for an external reproduction of the evaluation of model capabilities disclosed?

We have open-sourced our evaluation frameworks and benchmark prompts (the openai/evals and openai/simple-evals repositories, MIT-licensed), enabling anyone to reproduce the public benchmark results we report for o3. Many evals in our system card are on standard public benchmarks.
GitHub simple-evals README: “We are open sourcing it so we can be transparent about the accuracy numbers we’re publishing alongside our latest models.” GitHub openai/evals README: “Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.”
The disclosed repositories does not allow all capability evaluations to be reproduced (e.g., not all capabilities named in the previous indicator are reproducible with the reported frameworks).

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all capability evaluations to be reproduced. The released code must be open source, following the OSI definition of open source.
The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]
46. Train-test overlap (Score: 0)

Does the developer measure and disclose the overlap between the training set and the dataset used to evaluate model capabilities?

Not disclosed
Not disclosed
The developer does not disclose this information.

We will award this point if, with every capability evaluation for which the developer reports results, the developer reports the overlap between the training set of the model and the dataset used for evaluation, as well as the general methodology for computing train-test overlap (e.g. a description of how n-gram matching was used).
We compute train-test overlap using n-gram matching using the procedure described here [URL]. We evaluate the train-test overlap for the following benchmarks: (1) Coding: HumanEval (1.6%) (2) Retrieval: HotPotQA (4%) (3) Multilingual performance: MMMLU (3%) (4) Tool use: UltraTool (9%)
47. Risks taxonomy (Score: 1)

Are the risks considered when developing the model disclosed?

We evaluated OpenAI o3 and o4-mini according to our Preparedness Framework. This is the first launch and system card to be released under our updated Preparedness Framework. The Framework currently has three Tracked Categories: Biological and Chemical, Cybersecurity, and AI Self-Improvement. OpenAI’s Safety Advisory Group (SAG) reviewed the results of these evaluations and determined that OpenAI o3 and o4-mini do not reach the High threshold in any of our three Tracked Categories.
System card §4 "Preparedness": "We evaluated OpenAI o3 and o4‑mini according to our Preparedness Framework."
The developer discloses the risks considered.

Risks refer to possible negative consequences or undesirable outcomes that can arise from the model's deployment and usage. These consequences or outcomes may arise from model limitations (functions that the model cannot perform) or issues with the model's trustworthiness (e.g., its lack of robustness, reliability, calibration). We recognize that different developers may use different terminology for risks, or conceptualize risks differently. We will award this point for a complete list of risks considered, even if some of the risks are not reflected in the final model.
We consider the following risks when developing and evaluating the model: (1) Misinformation (2) Harassment (3) Cybersecurity risks (4) Bioweapons design (5) Revealing personally-identifiable information
48. Risks evaluation (Score: 1)

Does the developer evaluate the model's risks prior to its release and disclose them concurrent with release?

Our system card for o3 includes quantitative results of our evaluations for a range of disallowed content categories, including the model's risk of producing harassing, violent and other types of disallowed content. We also quantify jailbreak and hallucination risk, bias, and various other risks in the system card.
System card §4.1 "Capabilities Assessment": "We ran scalable evaluations throughout training and on intermediate post‑trained checkpoints of OpenAI o3 and o4‑mini, as well as a final automated eval sweep on the launch candidates."
The developer discloses quantitative results on evaluations for each of the risks in the previous indicator.

The evaluations must contain precise quantifications of the model's behavior in relation to the risks specified in the risk taxonomy. We will award this point for clear evaluations of the majority of the states risks.
We evaluate the risks for each of the above harms using HarmBench. The results (in terms of mean attack success rate) are: (1) Misinformation: 0.02 (2) Harassment: 0.01 (3) Cybersecurity: 0.10 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.12 (5) Revealing personally-identifiable information (subset of General Harm in HarmBench): 0.02
49. External reproducibility of risks evaluation (Score: 0)

Are code and prompts to allow for an external reproduction of the evaluation of model risks disclosed?

We have open-sourced our evaluation frameworks and benchmark prompts (the openai/evals and openai/simple-evals repositories, MIT-licensed), enabling anyone to reproduce the public benchmark results we report for o3
GitHub simple-evals README: “We are open sourcing it so we can be transparent about the accuracy numbers we’re publishing alongside our latest models.” github.com GitHub openai/evals README: “Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.”
These evaluation frameworks do not seem to cover certain risk evaluations (e.g., Cyber Range).

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all risk evaluations to be reproduced. The released code must be open-source, following the OSI definition of open-source.
The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]
50. Pre-deployment risk evaluation (Score: 0)

Are the external entities have evaluated the model pre-deployment disclosed?

OpenAI provided third party assessors early model access to evaluate frontier risks related to autonomous capabilities, deception, and cybersecurity. Third party assessors were provided both OpenAI o3 and o4-mini early checkpoints, as well as the final launch candidate models to conduct their assessments. 8 As part of our ongoing efforts to consult with external experts, OpenAI granted early access to these versions of o3 and o4-mini to the U.S. AI Safety Institute to conduct evaluations of the models’ cyber and biological capabilities, and to the U.K. AI Security Institute to conduct evaluations of cyber, chemical and biological, and autonomy capabilities, and an early version of the safeguards.
System card §3.9 “Third-Party Assessments” states that OpenAI “granted early access … to the U.S. AI Safety Institute and the U.K. AI Safety Institute” for cyber- and bio-risk evaluations The same section notes that METR, a nonprofit frontier-risk evaluator, was given “early checkpoints … and the final launch candidate models” for autonomous-capability testing Apollo Research is listed as an external assessor of deception and scheming risks, confirming its early access Pattern Labs, an AI-security nonprofit, evaluated o3 and o4-mini on cybersecurity challenges under a dedicated subsection, evidencing its specialized access
The disclosure does not include the terms of the analysis or information about financial transactions.

By external entities, we mean entities that are significantly or fully independent of the developer. We will award this point if the developer specifies the entity that carried out the pre-deployment analysis, discloses the terms of the analysis (such as conditions for releasing the evaluation results or the developer's control over the final results), as well as any financial transaction between the parties. We will award this point if the developer discloses no external entities have evaluated the model pre-deployment, or discloses only terms of the analysis where it is not bound by NDA while still naming all external entities.
We provide the following parties access to our model for pre-deployment capabilities evaluation: METR. METR has control over the release of the evaluation results (including whether or not to release the results and the contents of the results being released), but must provide the evaluation results to us for review before release. There are no financial transactions between us and METR.
51. External risk evaluation (Score: 1)

Are the parties contracted to evaluated model risks disclosed?

OpenAI provided third party assessors early model access to evaluate frontier risks related to autonomous capabilities, deception, and cybersecurity. Third party assessors were provided both OpenAI o3 and o4-mini early checkpoints, as well as the final launch candidate models to conduct their assessments. 8 As part of our ongoing efforts to consult with external experts, OpenAI granted early access to these versions of o3 and o4-mini to the U.S. AI Safety Institute to conduct evaluations of the models’ cyber and biological capabilities, and to the U.K. AI Security Institute to conduct evaluations of cyber, chemical and biological, and autonomy capabilities, and an early version of the safeguards.
System card §3.9 "Third‑Party Assessments": "OpenAI granted early access to these versions of o3 and o4‑mini to the U.S. AI Safety Institute … and to the U.K. AI Security Institute…"
The developer discloses two contracted parties who evaluated risks.

We will award this point if the developer discloses statistics regarding all contracted parties that are responsible for evaluating risks (not limited to external entities or pre-deployment evaluation). This includes the number of contracted for-profit or non-profit entities, government entities, independent contractors, and researchers contracted by the developer to evaluate risks. We will award this point if the developer discloses it has no such contracts.
Contracted parties responsible for evaluating risks: (1) 2 contracting non-profits (2) 5 independent contractors (3) 0 government entities (4) 20 researchers
52. Mitigations taxonomy (Score: 1)

Are the post-training mitigations implemented when developing the model disclosed?

We disclose three principal post-training mitigations applied to OpenAI o3: (1) Instruction-hierarchy tuning so the model prioritizes system instructions over developer and user messages to resist jailbreaks, (2) refusal fine-tuning that blocks person-identification and other ungrounded inferences from images, and (3) additional refusal training that blocks high-risk biological requests while continuing to answer benign biology questions.
System card §3.7 “Jailbreaks through Custom Developer Messages”: “To mitigate this issue, we taught the model to adhere to an Instruction Hierarchy … supervised each model to follow the instructions in the system message over developer messages, and instructions in developer messages over user messages.” System card §3.5 “Person Identification and Ungrounded Inference Evaluations”: “Our safety mitigations include post-training our reasoning models to refuse requests to identify a person based on an image, and to refuse requests for ungrounded inferences.” System card §4.5 “Safeguards”: “Modified post-training of the models to refuse high-risk biological requests while not refusing benign requests …”
The developer discloses mitgiations implemented.

By post-training mitigations, we refer to interventions implemented by the developer during the post-training phase to reduce the likelihood and/or the severity of the model’s risks. We recognize that different developers may use different terminology for mitigations, or conceptualize mitigations differently. We will award this point for a complete list of mitigations considered, even if some of the mitigations are not reflected in the final model. Alternatively, we will award this point if the developer reports that it does not mitigate risk in this way.
We implement supervised fine tuning and reinforcement learning with human feedback to address model risks. We use no other methods to address risks.
53. Mitigations taxonomy mapped to risk taxonomy (Score: 1)

Does the developer disclose how the post-training mitigations map onto the taxonomy of risks?

Our system card links each post-training mitigation to the specific Preparedness-taxonomy risk it is meant to curb—instruction-hierarchy tuning targets jailbreak-driven autonomy risks, refusal fine-tuning for person-identification addresses privacy and ungrounded-inference risks, and the combined refusal/monitoring layers for high-risk biology and for cybersecurity map to the Biological & Chemical Capability and Cybersecurity risk categories, respectively.
System card §3.7 “Jailbreaks through Custom Developer Messages”: “To mitigate this issue, we taught the model to adhere to an Instruction Hierarchy … supervised each model to follow the instructions in the system message over developer messages, and instructions in developer messages over user messages.” System card §3.5 “Person Identification and Ungrounded Inference Evaluations”: “Our safety mitigations include post-training our reasoning models to refuse requests to identify a person based on an image, and to refuse requests for ungrounded inferences.” System card §4.5 “Safeguards”: bullet list mapping mitigations to Preparedness risks—“Modified post-training of the models to refuse high-risk biological requests … Monitoring for high-risk cybersecurity threats …”
The developer discloses the mapping from mitigations to risks.

We will award this point for a complete mapping of the primary risk that each mitigation is meant to address, even if the mitigation potentially maps on to other risks in the taxonomy. Alternatively, we will award this point if the developer reports that it does not mitigate risk.
We use supervised fine tuning for general instruction following. We use RLHF to reduce the model's propensity to output information about cybercrimes, bioweapons, disinformation, content harassing someone, and PII.
54. Mitigations efficacy (Score: 1)

Does the developer evaluate and disclose the impact of post-training mitigations?

We evaluate and publicly disclose the impact of our post-training mitigations by reporting side-by-side results for pre-mitigation checkpoints (labelled “helpful-only” or “without refusals”) and the post-mitigation launch-candidate models across multiple Preparedness-Framework evaluations—including biological-risk planning tasks and end-to-end cyber-range scenarios—so that the effectiveness of each mitigation is quantitatively visible.
System card §4.2.1 “Long-form Biological Risk Questions”: tables and text compare helpful-only (pre-mitigation) models with launch candidates that “consistently refuse all operational planning steps,” demonstrating the mitigation’s effect. System card §4.3 “Cybersecurity”: figures plot o3 results with and without refusals (helpful-only) under every test configuration, directly exposing performance changes attributable to the refusal mitigation.
The developer provides comparisons between helpful only models and mitigated models; while this does not include evaluations of the efficacy of instruction hierarchy tuning and refusal finetuning, it does show the effect of combined refusal/monitoring layers.

We will award this point if the developer discloses the results on the risk evaluations before and after the post-training mitigations are applied. Alternatively, we will award this point if the developer reports that it does not mitigate risk in this way.
Pre-mitigations (measured through mean attack success rate on HarmBench): (1) Misinformation: 0.80 (2) Harassment: 0.91 (3) Cybersecurity risks: 0.56 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.62 (5) Personally-identifiable information (subset of General Harm in HarmBench): 0.52 Post-mitigations (measured through mean attack success rate on HarmBench): (1) Misinformation: 0.02 (2) Harassment: 0.01 (3) Cybersecurity risks: 0.10 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.12 (5) Personally-identifiable information (subset of General Harm in HarmBench): 0.02
55. External reproducibility of mitigations evaluation (Score: 0)

Are code and prompts to allow for an external reproduction of the evaluation of post-training mitigations disclosed?

Not disclosed
Not disclosed
The developer does not disclose this information.

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all mitigations evaluations to be reproduced. The released code must be open-source, following the OSI definition of open-source. Alternatively, we will award this point if the developer reports that it does not mitigate risk.
We release the code and prompts for reproducing post-training mitigation evaluations at this GitHub link: [URL]
56. Model theft prevention measures (Score: 1)

Does the developer disclose the security measures used to prevent unauthorized copying (“theft”) or unauthorized public release of the model weights?

These are detailed in our public blog: Reimagining Secure Infrastructure for Advanced AI.
Not disclosed
The blog post in the disclosure describes mitigations for model theft.

This indicator assesses the developer's disclosures regarding how it addresses the risk that malicious actors or insiders could exfiltrate or replicate proprietary weights. Security measures could include insider threat analysis and detection, in addition to external threat management. Examples of such measures include encryption at rest, key management, remote attestation, or auditing for suspicious queries. We will award a point if the developer discloses specific steps taken to safeguard the model weights or that none are implemented.
We store model weights on encrypted volumes with hardware-based key management. We monitor inference queries for suspicious patterns (like repeated attempts to reconstruct weights token-by-token), and we audit all staff access logs monthly.
57. Release stages (Score: 1)

Are the stages of the model's release disclosed?

We describe a staged rollout for OpenAI o-series models: initial research-preview access for early testers, limited launch to ChatGPT Pro / Team users, simultaneous or shortly-thereafter availability in the API, followed by expansion to Enterprise, Edu, Plus, and finally Free tiers (for o3-mini).
Help Center “Model Release Notes” §“Launching OpenAI o3-pro” (June 10 2025): “o3-pro is available in the model picker for Pro and Team users starting today … Enterprise and Edu users will get access the week after.” Blog post Introducing OpenAI o3 and o4-mini (update line, June 10 2025): “OpenAI o3-pro is now available to Pro users in ChatGPT, as well as in our API.” Help Center “Model Release Notes” §“Introducing OpenAI o3-mini” (Jan 31 2025): “ChatGPT Team, Pro, Plus, and Free plan users can access o3-mini starting today.”
The stages of model release are disclosed.

Release stages include A/B testing, release on a user-facing product, GA release, open-weight release, etc. We recognize that the release of a foundation model falls along a spectrum, with many forms of partial release, and that different developers may conceptualize release differently. We will award a point if the developer provides a clear identification of the stages through which the model was released.
We began with an internal alpha test for two weeks, followed by a closed beta with selected enterprise partners for one month, then a public waitlisted preview, and finally a general availability release once thresholds on safety benchmarks were met.
58. Risk thresholds (Score: 1)

Are risk thresholds disclosed?

The Framework currently has three Tracked Categories: Biological and Chemical, Cybersecurity, and AI Self-Improvement. OpenAI’s Safety Advisory Group (SAG) reviewed the results of these evaluations and determined that OpenAI o3 and o4-mini do not reach the High threshold in any of our three Tracked Categories. Our published Preparedness Framework discloses both the thresholds and the process we use to determine whether or not risks are sufficiently mitigated for deployment. As we make clear in the Preparedness Framework, “the determination that a threshold has been reached is informed by [the] indicative results from capability evaluations, and also reflects holistic judgment based on the totality of available evidence – for example, information about the methodological robustness of evaluation results.”
System card §4 "Preparedness": "The Framework currently has three Tracked Categories: Biological and Chemical, Cybersecurity, and AI Self‑Improvement." This indicates the disclosed quantitative risk thresholds used for release decisions.
The developer discloses the harmful outcomes being scored, how the scores are computed, and (in the Preparedness Framework) what triggers an action to block/delay/modify a model's release.

Risk thresholds determine when a risk level is unacceptably high to a developer (e.g. leading to the decision to not release a model), moderately high (e.g. triggering additional safety screening), or low enough to permit normal usage. We will award this point if the developer discloses explicit risk thresholds that clarify (i) which harmful outcomes are being scored, (ii) how the scores are computed (in general terms, not necessarily disclosing internal algorithms), and (iii) what triggers an action to block, delay, or otherwise modify a model's release. Alternatively, we will award a point if the developer discloses that it does not consider explicit risk thresholds during model release.
Our risk threshold for biorisks is the ability to autonomously create bioweapons. Current models score a medium: they don't autonomously create bioweapons but could help a skilled practitioner with access to materials in speeding up creation of bioweapons. Risk thresholds higher than medium would delay the model's release until the risk level drops to medium or below.
59. Versioning protocol (Score: 1)

Is there a disclosed protocol for versioning and deprecation of the model?

We version every model with two complementary labels: Stable alias (e.g., o3, gpt-4) that always points to the latest release and is auto-upgraded for all calls. Date-stamped pin (e.g., gpt-4-0613) that stays frozen; developers can specify it to lock behavior. When we publish a new version we announce it in the changelog and release-notes blog, automatically switch the stable alias, and give developers at least three months’ notice (often six) before any pinned version is removed. The notice includes exact cut-off dates, the replacement model name, and instructions for migration. We also email and in-product message teams that have recently used the retiring model and maintain a public Deprecations page that tracks upcoming removals.
Blog Function calling and other API updates: “We … allow API users to pin the model version… Each individually pinned model is stable, meaning that we won’t make changes that impact the outputs.” Same blog, §“Model deprecations”: “Applications using the stable model names … will automatically be upgraded … These older models will be accessible through June 13 2024, after which requests … will fail.” Blog GPT-4 API general availability and deprecation of older models: “In 6 months we will be retiring some of our older models … we will label the Completions API as ‘legacy’ starting today.”
The developer discloses a versioning protocol.

We will award a point if the developer discloses how model versions are labeled, updated, deprecated, and communicated to users.
We version models based on the date of release: e.g., ModelName-11-01-2024. We additionally provide ModelName-latest, corresponding to the latest release. We deprecate versions of models when we plan to remove access to with a six months notice to users. Users should respond to model deprecation by switching to the newest version of the models or an equivalent non-deprecated model. Users can switch to a different model by replacing the model identifier (to e.g., ModelName-latest for the latest version) in API calls or through the Python SDK.
60. Change log (Score: 1)

Is there a disclosed change log for the model?

We publish a public, date-stamped Model Release Notes page in the Help Center that chronicles every model update—each entry lists the model version (for example, “Launching OpenAI o3-pro … June 10 2025”) together with the new features, fixes, or performance changes. This page is updated whenever we ship a new snapshot, ensuring a clear, version-by-version changelog for o-series models and all other production models.
Help Center “Model Release Notes”: header “# Model Release Notes — Updated this week” followed by entries such as “## Launching OpenAI o3-pro—available now … (June 10, 2025)” and successive dated sections for each model version, providing a chronological record of features and improvements.
A sufficient changelog is provided.

We will award a point if the developer publishes a version-by-version record of new features, fixes, or performance improvements.
On 11/1/2024 (version ModelName-11-01-2024), we improved model reasoning in technical domains. This resulted in a 20-point increase on the MATH benchmark (from 62% to 82%). Past change logs can be viewed at [URL]
61. Foundation model roadmap (Score: 0)

Is a forward-looking roadmap for upcoming models, features, or products disclosed?

Not disclosed
Not disclosed
The developer does not disclose this information.

A foundation model roadmap is a transparent statement about how the developer intends to evolve or expand its LLM offerings, including upcoming models, major feature releases, or expanded products based on the model, along with approximate timelines or version milestones. It can be high-level (e.g., “new model Q2 2025”), but must exist publicly.
We plan to release ModelX2 in Q2 2025, featuring enhanced multilingual capabilities and improved retrieval. We also aim to launch an enterprise-specific product tier for regulated industries by early 2026.
62. Top distribution channels (Score: 0)

Are the top-5 distribution channels for the model disclosed?

Not disclosed
Not disclosed
The developer does not disclose this information.

We define distribution channels to be either an API provider (a pathway by which users can query the model with inputs and receive outputs) or a model distributor (a pathway by which model weights are released). We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. Distribution channels can be ranked by any reasonable metric (e.g., number of queries, number of downloads, number of users, revenue). A description of the metric should be provided. API providers and model distributors may be ranked separately using different metrics as long as the total number of distribution channels equals five (if five distribution channels exist). For example, the developer may choose to disclose the top-3 API providers (ranked by the number of queries) and the top-2 model distributors (ranked by the number of downloads).
We provide API access to the model through A, B, and C. We distribute model weights through D and E. We pick the top-3 API providers based on the average number of queries per month and the top-2 model weight providers based on the average number of downloads per month.
63. Quantization (Score: 0)

Is the quantization of the model served to customers in the top-5 distribution channels disclosed?

Not disclosed
Not disclosed
The developer does not disclose this information.

We will award this point for a disclosure of the model precision in each of the top-5 distribution channels.
We serve the model at 16-bit precision on all distribution channels.
64. Terms of use (Score: 1)

Are the terms of use of the model disclosed?

Our terms of use are publicly available on the OpenAI Policies site: the Terms of Use govern ChatGPT and other consumer-facing services, while the OpenAI Services Agreement covers our API, ChatGPT Enterprise, ChatGPT Team, and other developer or business channels, together addressing all primary distribution paths for the o3 model.
System policies → Terms of Use: “These Terms of Use apply to your use of ChatGPT, DALL·E, and OpenAI’s other services for individuals…” System policies → OpenAI Services Agreement: “This OpenAI Services Agreement only applies to use of OpenAI's APIs, ChatGPT Enterprise, ChatGPT Team, and other services for businesses and developers…”
A terms of service that appears to apply to the bulk of the model's distribution channels is disclosed.

We define terms of use to include terms of service and model licenses. We will award this point for a pointer to the terms of service or model license. In the event that model's licenses are written more generally, it should be clear which assets they apply to. We recognize that different developers may adopt different business models and therefore have different types of model licenses. Examples of model licenses include responsible AI licenses, open-source licenses, and licenses that allow for commercial use. Terms of service should be disclosed for each of the top-5 distribution channels. However, we will award this point if there are terms-of-service that appear to apply to the bulk of the model’s distribution channels.
Our terms of service are published at https://ourcompany.com/model-tos - these terms cover both our API and all distribution channels for model weights.
65. Distribution channels with usage data (Score: 0)

What are the top-5 distribution channels for which the developer has usage data?

Not disclosed
Not disclosed
No response

We define distribution channels to be either an API provider (a pathway by which users can query the model with inputs and receive outputs) or a model distributor (a pathway by which model weights are released). We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. Distribution channels can be ranked by any reasonable metric (e.g., number of queries, number of downloads, number of users, revenue). A description of the metric should be provided. We define usage data as any form of developer-exclusive data collected from any of a developer's distribution channel. A developer has access to usage data from a distribution channel if it is able to use that data for downstream purposes (e.g., analytics, training etc.). Usage data may be shared outside of the developer, but it is initially collected by the distribution channel and shared to the developer.
We have access to usage data through the distribution channels: A, B, and C.
66. Amount of usage (Score: 0)

For each of the top-5 distribution channels, how much usage is there?

Not disclosed
Not disclosed
No response

Usage should be reported as the number of queries over the span of a month, reported to the precision of one significant figure (e.g., 50 million queries).
Distribution channel A: 50 million queries. Distribution channel B: 10 million queries. Distribution channel C: 10 million queries.
67. Classification of usage data (Score: 0)

Is a representative, anonymized dataset classifying queries into usage categories disclosed?

Not disclosed
Not disclosed
No response

Developers may either share a fully public dataset or a partially restricted dataset (e.g., under a research license). We will award this point if there is a clear, aggregated or sample dataset that reveals categories of tasks/queries.
We provide quarterly releases of an anonymized dataset that classifies user queries into 20 broad job-related categories. Researchers can request access via [URL]. We ensure no PII is included.
68. Data retention and deletion policy (Score: 1)

Is a policy for data retention and deletion disclosed?

As detailed in our privacy policy, and further explained in this blog (a) we retain API inputs and outputs and unsaved or deleted ChatGPT conversations for up to 30 days, unless law requires longer storage; (b) customers on eligible endpoints can request zero-data-retention (ZDR), and Enterprise workspace admins may set shorter retention windows; (c) individuals can turn off model-training or file a privacy-portal request so new content is excluded from future training runs; and (d) when an account or conversation is deleted, the associated data is hard-deleted within 30 days.
Enterprise privacy FAQ “API Platform”: “OpenAI may securely retain API inputs and outputs for up to 30 days… You can also request zero data retention (ZDR) for eligible endpoints” Enterprise privacy FAQ “ChatGPT Enterprise/Edu”: “Any deleted conversations are removed from our systems within 30 days, unless we are legally required to retain them” Data Controls FAQ: “Temporary Chats are deleted from our systems after 30 days… Aren’t used to train our models" How your data is used to improve model performance: “You can opt out … Once you opt out, new conversations will not be used to train our models” Help-center article “How to delete your account”: “If you delete your account, we will delete your data within 30 days, except where law requires retention
Response suffices for the indicator

A data retention and deletion policy is a policy for removing particular data from the training set and/or preventing it from being used if there is a user or external request (e.g., “right to be forgotten”) that also covers internal data governance. This includes whether there is a formal process to delete or retract data from future training runs and how long raw data is retained. It also clarifies how quickly deletions propagate to the model (e.g., “only in subsequent major model releases”).
We honor verified user requests to delete personal data from our training corpus by removing it from any subsequent scheduled retraining. Our data retention policy ensures chat logs are purged after 90 days.
69. Geographic statistics (Score: 0)

Across all forms of downstream use, are statistics of model usage across geographies disclosed?

Not disclosed
Not disclosed
No response

We will award this point if there is a meaningful, though potentially incomplete or vague, disclosure of geographic usage statistics at the country-level.
We share anonymized per-country usage metrics in a publicly accessible dashboard, updated monthly, on this link: [link]
70. Internal products and services (Score: 1)

What are the top-5 internal products or services using the model?

As detailed in our Deep Research blog and our o3 blog, o3 powers Deep Research and in ChatGPT Pro, Plus, Enterprise, and Teams.
Not disclosed
Response lists all internal products

An internal product or service is a product or service built by the developer. Products or services can be ranked by any reasonable metric (e.g., number of users, queries, revenue). A description of the metric should be provided.
The model is used in products A, B, C, D, and E. We choose products based on the number of montly active users.
71. External products and services (Score: 0)

What are the top-5 external products or services using the model?

Not disclosed
Not disclosed
No response

An external product or service is a product or service built by a party external to the developer. Products or services can be ranked by any reasonable metric (e.g., number of users, queries, revenue). A description of the metric should be provided. We will award a point if the developer discloses that that it does not have access to such metrics about external products or services.
The model is used in products A, B, C, D, and E. We choose products based on the number of montly active users.
72. Users of internal products and services (Score: 0)

How many monthly active users are there for each of the top-5 internal products or services using the model?

Not disclosed
Not disclosed
No response

An internal product or service is a product or service built by the developer. The number of users refers to users who engaged or interacted with the model through the internal product or service over the last month or averaged over the last X months (this should be specified). Number of users should be specified to one significant figure (e.g. 100,000).
Over the last 6 months, the total monthly active users for our top-5 products using model Y are: Product A: 100,000 users Product B: 30,000 users Product C: 10,000 users Product D: 10,000 users Product E: 10,000 users
73. Consumer/enterprise usage (Score: 0)

Across all distribution channels for which the developer has usage data, what portion of usage is consumer versus enterprise?

Not disclosed
Not disclosed
No response

Consumer usage refers to usage by individual consumers. Enterprise usage refers to usage by enterprise customers (including government use). Consumer and enterprise usage should be calculated in terms of the number of queries by or the amount of revenue from consumer or enterprise users. Percentages should be specified to two significant digits (e.g., 12% consumer, 88% enterprise).
12% of the usage of model A across all distribution channels is from consumers, 88% is from enterprise users. Of this 88%, 6% is from users at governments. Usage is calculated based on number of queries.
74. Enterprise users (Score: 0)

Across all distribution channels for which the developer has usage data, what are the top-5 enterprises that use the model?

Not disclosed
Not disclosed
No response

Enterprises should be ranked by the number of queries made or the amount of revenue from usage since the model's release. We will also award this point if the developer indicates it does not have access to enterprise usage data.
The top-5 enterprises are A, B, C, D, and E. The enterprises are selected based on the number of queries.
75. Government use (Score: 0)

What are the 5 largest government contracts for use of the model?

Not disclosed
Not disclosed
No response

This includes known government contracts of enterprise or government-specific products and services that use the model. We will award this point if the developer discloses its top five government contracts ranked monetary value, though the developer may omit contracts where it is under NDA regarding the existence of the contract.
The five largest government users of our service, along with their use cases, are: 1. County A is utilizing our product for improving access to internal resources 2. National Lab B is using our model to advance bioscientific research. 3. Federal agency C is using our product to deliver faster, more accurate translation services 4. City D is participating in a pilot program found our product helped reduce the time spent on routine tasks 5. Country E is using our product to summarize legal documents in their lower courts.
76. Benefits Assessment (Score: 0)

Is an assessment of the benefits of deploying the model disclosed?

We quantify the benefits of deploying OpenAI o3 by reporting concrete, comparative performance gains: it makes 20 percent fewer major errors than o1 on difficult real-world tasks, achieves a state-of-the-art 71 percent pass@1 on the SWE-bench Verified software-engineering benchmark, and, when allowed tool use, scores 98.4 percent pass@1 on the 2025 AIME math exam—at the same latency and cost as o1 while delivering higher overall performance
OpenAI launch blog §“What’s changed”: “In evaluations by external experts, o3 makes 20 percent fewer major errors than OpenAI o1 on difficult, real-world tasks” Same blog §“Continuing to scale reinforcement learning”: “At equal latency and cost with OpenAI o1, o3 delivers higher performance in ChatGPT” System card §4.4.2 “SWE-bench Verified”: “the o3 helpful-only model achieving a SOTA result of 71 % pass@1” Launch blog §“What’s changed”: “o3 shows… 98.4 % pass@1 on AIME 2025 from tool use”
Benefits assessments should be on real-world tasks, not benchmarks

We will award this point for any quantitative assessment of the benefits or potential benefits of deploying the model.
We analyze the impact of using the model in education outcomes using a randomized controlled trial in third grade math assignnments, and find that use in the classroom improves standardized test outcomes by 26%. [Link to report.]
77. AI bug bounty (Score: 1)

Does the developer operate a public bug bounty or vulnerability reward program under which the model is in scope?

We run a public Bug Bounty Program on the Bugcrowd platform. The scope document lists what kinds of vulnerabilities are in scope (e.g., authentication flaws, data-exposure bugs, injection or escalation issues in ChatGPT, the API, and other production services) and what is out of scope (content-oriented “model safety” issues such as jailbreak prompts or hallucinations). Researchers submit findings through Bugcrowd; reports are triaged there and we pay cash rewards that scale from US $200 for low-severity issues up to US $20 000 for critical discoveries.
System blog “Announcing OpenAI’s Bug Bounty Program”: “We have partnered with Bugcrowd to manage the submission and reward process… Our rewards range from $200… to up to $20 000” Help-center article “How to Report Security Vulnerabilities to OpenAI”: the policy “explains how to… identify what types of testing are in-scope and out-of-scope … [and] communicate with us securely
OpenAI discloses a bug bounty with a scope, out of scope, and reward tier section

We will award this point for a publicly documented bug bounty or vulnerability reward program describing (i) in-scope vulnerabilities (e.g., prompt bypasses, data leaks), (ii) out-of-scope items, (iii) submission process, and (iv) reward tiers or recognition if applicable. We will award a point if the developer discloses it has no AI bug bounty that encourages external researchers to report security, privacy, or adversarial vulnerabilities in the model.
We run a bug bounty program with HackerOne. We award up to $5,000 for critical vulnerabilities, such as discovering a major exploit that circumvents our content filters or reveals private data. [link to bug bounty]
78. Responsible disclosure policy (Score: 1)

Does the developer clearly define a process by which external parties can disclose model vulnerabilities or flaws?

Our Coordinated Vulnerability Disclosure Policy and public Bug Bounty Program give external researchers two paths to report model or infrastructure flaws: (i) submit a report through our Bugcrowd portal (the primary intake for security issues, with cash rewards that scale by severity), or (ii) send an encrypted incident report via the link in the policy. After a submission, our security team triages the finding, works with the reporter while a fix is deployed, and authorizes public disclosure once the issue is resolved; program rules require reporters to keep details confidential until that authorization is granted.
System policies → Coordinated Vulnerability Disclosure Policy: “The OpenAI Bug Bounty Program… We invite you to report vulnerabilities… Detailed guidelines and rules… ## Incident reporting – To report a security incident, please let us know immediately by submitting an encrypted report.” Launch-blog “Announcing OpenAI’s Bug Bounty Program”: “The OpenAI Bug Bounty Program… We invite you to report vulnerabilities… We have partnered with Bugcrowd… detailed guidelines and rules for participation can be found on our Bugcrowd page.” Help-Center article “How to Report Security Vulnerabilities to OpenAI”: “This policy explains how to • Request authorization for testing • Identify what types of testing are in-scope and out-of-scope • Communicate with us securely.”
CVDP counts - reporting occurs only once approved

We will award this point for a description of the process external parties can use for responsbly disclosing model vulnerabilities and flaws, which should include (i) what mechanism external parties can use to disclose vulnerabilities or flaws (e.g., a form, an email) and (ii) what process follows a disclosure (e.g., how much time must parties wait until public release). This is often included with a bug bounty, but can also be standalone. We will award a point if the developer discloses it has no responsible disclosure policy.
We maintain a responsible disclosure policy at [URL] that describes how external parties can disclose vulnerabilities and flaws in Model A, including a 45-day disclosure window and an official contact for urgent security vulnerabilities.
79. Safe harbor (Score: 1)

Does the developer disclose its policy for legal action against external evaluators conducting good-faith research?

Our Bug Bounty Program establishes a legal safe-harbor: external researchers who follow the program’s rules may probe the in-scope services and report vulnerabilities without fear that we will threaten or pursue legal action against them.
Security & Privacy page: “The program offers safe harbor for good-faith security testing and cash rewards for vulnerabilities based on their severity and impact.” Bugcrowd policy snippet: “OpenAI will not threaten or bring any legal action against anyone who makes a good faith effort to comply with this bug bounty policy.”
Bugcrowd includes legal safe harbor. Section on Model Issues remains unclear as to whether safe harbor is granted since it contradicts the in scope section to some degree

We will award this point if the developer discloses whether it has a policy committing it to not pursue legal action against external evaluators conducting good-faith research. This should not be only for software security vulnerabilities, but also AI flaws, and it should be based on researcher conduct standards, not at the sole discretion of the company. We will award this point if the developer provides a clear description of its policy regarding such protections for external researchers, or lack thereof.
We do not have a policy for researcher protections for good-faith safety research. OR Our policy ensures no legal action against good‐faith researchers who follow our disclosure guidelines, see: [link]
80. Security incident reporting protocol (Score: 1)

Are major security incidents involving the model disclosed?

OpenAI maintains a public security-incident protocol: external researchers file encrypted incident reports through the Coordinated Vulnerability Disclosure channel; our on-call security team triages submissions 24 × 7 × 365; and any material incident is communicated on status.openai.com with real-time updates and a full root-cause analysis published within five business days of recovery.
System policy “Coordinated vulnerability disclosure policy,” §“Incident reporting”: “To report a security incident, please let us know immediately … by submitting an encrypted report.” Enterprise Privacy FAQ, §“How does OpenAI ensure data security?”: “Our security team has an on-call rotation that has 24/7/365 coverage and is paged in case of any potential security incident.” Public status page (June 10, 2025 incident note): “The detailed Root Cause Analysis (RCA) will be published in the next 5 business days.”
Protocol provided under the CVDP

A security incident reporting protocol provides post-deployment transparency about serious incidents or breaches. Security incidents refer to incidents where external security threats affect the model (e.g., data breaches or DDoS attacks on the service). We will award this point if the developer states (i) how to submit a security incident report, (ii) how quickly it will respond, and (iii) when and whether results are disclosed. Every incident need not be reported publicly, but the developer must disclose a policy determining how incidents are reported and disclosed.
We publish a public ‘Security Incident Report’ on our website for any confirmed security incident affecting the model within 7 days of a patch being implemented. Users and researchers can report incidents via security@ourcompany.com, and we commit to an initial acknowledgment within 48 hours.
81. Misuse incident reporting protocol (Score: 0)

Are misuse incidents involving the model disclosed?

Not disclosed
Not disclosed
No response

A misuse incident reporting protocol provides post-deployment transparency about incidents of misuse involving the model. As opposed to the previous indicator, this indicator is about actors misusing the model to cause real-world harm, such as misinformation operations or cybersecurity attacks. We will award this point if the developer states (i) how to submit a misuse incident report, (ii) how quickly it will respond, and (iii) when and whether results are disclosed. Every incident need not be reported publicly, but there needs to be a policy governing how incidents are reported.
We publish a public ‘Misuse Incident Report’ on our website for any confirmed misuse incident within 7 days of a patch being implemented. Users and researchers can report incidents regarding our flagship foundation model via security@ourcompany.com, and we commit to an initial acknowledgment within 48 hours.
82. Post-deployment coordination with government (Score: 1)

Does the developer coordinate evaluation with government bodies?

We coordinate with government AI-safety bodies by giving the U.S. Center for AI Standards and Innovation and the U.K. AI Security Institute early checkpoints and launch-candidate versions of OpenAI o3 so they can run post-deployment evaluations of the model’s cyber, biological, chemical, and autonomy-related risks.
Not disclosed
OpenAI discloses coordination with US AISI and UK AISI

We will award this point if the developer specifies which government bodies it is coordinating with and for what types of post-deployment evaluations. Government bodies include AI Safety Institutes, national security agencies, national labs, and international governmental enties such as UN agencies or the G7. Evaluation here may also include sharing of the developer's proprietary evaluation results for help with interpretation.
We do not coordinate with any government entities or AI Safety Institutes. OR We coordinate with the UK AISI for post-deployment evaluation of cyber, CB, and autonomy-related capabilities.
83. Feedback mechanisms (Score: 0)

Does the developer disclose a way to submit user feedback? If so, is a summary of major categories of feedback disclosed?

Users can submit feedback through the thumbs-up / thumbs-down buttons built into ChatGPT and by filling out our public Chat Model Feedback form; we do not, however, publish an aggregated or categorized summary of that feedback.
Help Center FAQ: “If you find an answer is incorrect, please provide that feedback by using the ‘Thumbs Down’ button.” Chat Model Feedback form: “We’d love to hear examples where our models did not meet your expectations… Why is the model output not ideal? (The model isn’t adhering to the system message / inaccurate / not useful / harmful / other).” Product post-mortem: internal A/B tests consider “aggregate metrics such as thumbs up / thumbs down feedback,”
Feedback mechanisms listed, no summary provided

We will award this point if the developer (i) discloses how users can submit feedback (e.g., via a form or a thumbs up/thumbs down for model responses) and (ii) discloses aggregated or categorized feedback data (e.g. a categorization of thumbs up and thumbs down data).
Users can submit feedback at this url: [URL] We find that users mainly report issues with API call response times, over-refusals from models, and outdated information in model outputs. A detailed categorization of user reports is available at [URL]
84. Permitted, restricted, and prohibited model behaviors (Score: 1)

Are model behaviors that are permitted, restricted, and prohibited disclosed?

Our public Model Spec and Usage Policies delineate all three behavioral bands: content outside the policy lists is permitted; “Restricted content” (e.g., information hazards or sensitive personal data) is allowed only in narrowly defined transformations or contexts; and “Prohibited content” (currently limited to sexual content involving minors) must never be generated.
Model Spec §“Stay in bounds” explicitly distinguishes Prohibited content (“never generate sexual content involving minors”) and Restricted content (“includes information hazards and sensitive personal data”) from other, implicitly permitted content Model Spec §“No topic is off limits” clarifies that, apart from those restrictions, the assistant may discuss any subject, indicating the scope of permitted behavior OpenAI Usage Policies further list universal rules that bar illegal or harmful uses, reinforcing the disclosure of restricted and prohibited behaviors
Model spec satisfies

We refer to a policy that includes this information as a model behavior policy, or a developer's policy on what the foundation model can and cannot do (e.g. such a policy may prohibit a model from responding to NSFW content). We recognize that different developers may adopt different business models and that some business models may make enforcement of a model behavior policy more or less feasible. We will award this point if at least two of the three categories (i.e. permitted, restricted, and prohibited model behaviors) are disclosed. Alternatively, we will award this point if the developer reports that it does not impose any restrictions on its model's behavior in this way.
We allow responses from Model A that include broad Q&A, restrict sexual or harassing content, and prohibit facilitating illegal or violent acts. More details can be found in our guidelines for model behavior here: [link]
85. Model response characteristics (Score: 1)

Are desired model response characteristics disclosed?

The Model Spec spells out the assistant’s default response style: it should assume users’ best intentions, ask clarifying questions when a request is unclear, be as helpful as possible without overstepping, maintain an objective point of view, encourage fairness and kindness, express uncertainty when appropriate, use the right tool for the job, and be thorough yet efficient.
Model Spec §“Defaults” lists the desired behaviors: “Assume best intentions… Ask clarifying questions when necessary … Be as helpful as possible without overstepping … Assume an objective point of view … Encourage fairness and kindness… Express uncertainty … Use the right tool for the job … Be thorough but efficient
Model spec satisfies

Model response characteristics include default behaviors or behaviors that the developer steers the model to take. These may include being helpful, taking an objective point of view, or using tools only when necessary. We will award points for a clear description of desired model response characteristics or a statement that there are no such characteristics.
We configure responses from Model A to be factual, neutral, and contextually helpful, avoiding personal or biased opinions. More details can be found in our guidelines for model behavior here: [link]
86. System prompt (Score: 0)

Is the default system prompt for at least one distribution channel disclosed?

Not disclosed
Not disclosed
No response

A system prompt is defined as the prompt provided to the system by default that guides the system's behavior. We will award this point for the disclosure of the verbatim text of the full system prompt as well as an explanation for the context in which the system prompt is used.
We disclose our default prompt for Model A via our chat interface: ‘You are a helpful AI assistant providing clear, accurate, and policy‐compliant responses.’
87. Intermediate tokens (Score: 1)

Are intermediate tokens used to generate model outputs available to end users?

OpenAI o3 allows end-users to view concise, natural-language reasoning summaries of the model’s internal chain-of-thought through ChatGPT and the Responses API. We also do not want to make an unaligned chain of thought directly visible to users, as described in this blog.
Product blog New tools and features in the Responses API: “Reasoning summaries: The Responses API can now generate concise, natural-language summaries of the model’s internal chain-of-thought…”
OpenAI discloses summaries of CoT and provides a justification for why full CoT is not made available

Intermediate tokens are defined as any tokens generated by the model before the final output is shown to the user, such as model chains of thought. We will also award this point if a summary of intermediate tokens is made available to end users. If intermediate tokens or summaries are not made available, the developer should provide a justification.
Model A is trained to generate intermediate chain-of-thought reasoning, but we withhold most chain-of-thought tokens from final user-facing responses to prevent model distillation. We do disclose chains-of-thought for a small set of research collaborators under NDA.
88. Internal product and service mitigations (Score: 1)

For internal products or services using the model, are downstream mitigations against adversarial attacks disclosed?

Our internal products layer multiple downstream defenses against adversarial attacks: we filter requests and responses through the Moderation API and specialist safety classifiers, and we run live monitors that hunt for high-risk patterns (e.g., cyber-intrusion or biorisk prompts) and trigger targeted account-level enforcement. We also operate round-the-clock detection and response for emerging jailbreaks or malicious use, continuously upgrading these classifiers as new attack methods appear.
System card §2 “Model Data and Training”: “We also employ a combination of our Moderation API and safety classifiers to help prevent the use of harmful or sensitive content” System card §4.5 “Safeguards”: “Monitoring for high-risk cybersecurity threats, such as active measures to disrupt high-priority adversaries including hunting, detection, monitoring, tracking, intel-sharing, and disrupting” System card §5.2 “Mitigations”: “Continued improvement of moderation classifiers that support targeted, scaled detections to act on activities that may be unsafe or violate our Usage Policies”
System card describes multiple downstream defenses

An internal product or service is a product or service built by the developer. Adversarial attacks include prompt injection, jailbreaking, or malicious queries. Mitigations against adversarial attacks might include specialized prompt filtering, content scanning, or real-time monitoring of queries or accounts. We will award this point if the developer discloses a clear statement of methods used (e.g., a specialized prompt sanitizer or adversarial pattern detector), or if the developer states it does not implement such product-level mitigations against adversarial attacks.
In our chatbot, we implement a second-stage content filter that checks user inputs for disallowed topics and attempts to sanitize adversarial prompts. We also log suspicious prompts for manual review.
89. External developer mitigations (Score: 1)

Does the developer provide built-in or recommended mitigations against adversarial attacks for downstream developers?

We equip downstream developers with several built-in defenses and published best-practice guides against prompt-injection, jailbreak, and other adversarial attacks. Our Moderation API exposes the same safety classifiers we use internally so developers can automatically screen user inputs and model outputs for disallowed or high-risk content, while our Safety Best Practices and Prompt-engineering guides recommend patterns such as system-message separation, input/output filtering, and real-time abuse monitoring
System card (GPT-4 technical report) §4.2 “Content Classifier Development”: “Several of our moderation classifiers are **accessible to developers via our Moderation API endpoint, which enables developers to filter out harmful content while integrating language models into their products.” OpenAI Usage Policies → “Building with the OpenAI API Platform”: “We’ve shared our Safety best practices and offer tools like our Moderation Endpoint and customizable system messages” to help builders mitigate misuse.
Moderation API suffices

Downstream developers are developers who access the model through a distribution channel. Adversarial attacks include prompt injection, jailbreaking, or malicious queries. Mitigations against adversarial attacks that developers might build in or recommend include content filtering endpoints and recommended prompt templates. We will award this point if the developer discloses (i) technical mitigations (e.g., a developer provided moderation API or classifier) it offers or implements, (ii) recommended best practices or libraries for downstream developers, or (iii) an explicit statement that it does not build or recommend any particular downstream mitigations in this way..
Our API includes an optional parameter that will automatically filter user prompts and model outputs for hateful or disallowed content. We also publish guidelines for building robust chat interfaces that resist common prompt injections.
90. Enterprise mitigations (Score: 1)

Does the developer disclose additional or specialized mitigations for enterprise users?

We give enterprise customers a dedicated privacy-and-security layer: their business data is never used for training, admins can set custom retention windows—including a zero-data-retention option for eligible API endpoints—and conversations are encrypted at rest and in transit. Enterprise workspaces add SAML SSO, domain verification, SOC 2-certified controls, and an Enterprise Compliance API that lets admins audit chats and GPTs.
Enterprise privacy page: “We do not train our models on your business data by default … You control how long your data is retained … Enterprise-level authentication through SAML SSO” Enterprise privacy page: “Workspace admins can access an audit log of conversations and GPTs through the Enterprise Compliance API” Enterprise privacy page: “You can also request zero data retention (ZDR) for eligible endpoints if you have a qualifying use-case” Introducing ChatGPT Enterprise: “We do not train on your business data … conversations are encrypted in transit and at rest … admin console offers domain verification, SSO, and usage insights”
Privacy and security layer suffices

Enterprise users are, for example, large organizations with dedicated service agreements or users of enterprise-specific API deployments or products and services. Additional or specialized mitigations may address enterprise needs such as data privacy controls, advanced prompt/response monitoring, or compliance checks with regulations such as GDPR or HIPAA. Additional or specialized mitigations may include single-tenant deployments, custom filters for specific regulated industries, or advanced logging for compliance. We will award a point if the developer at least describes these mitigations or states that it does not provide such additional or specialized enterprise mitigations.
Our enterprise offering for Model A includes a dedicated environment with stricter filtering, a HIPAA-compliant data retention policy, and the ability for enterprise admins to define custom blacklisted topics that the model must refuse.
91. Detection of machine-generated content (Score: 1)

Are mechanisms that are used for detecting content generated by this model disclosed?

Images created with the model’s DALL·E 3 tool carry C2PA provenance metadata, allowing anyone to verify that the image originated from ChatGPT or the OpenAI API
Help Center article “C2PA in ChatGPT Images”: “Images generated with ChatGPT on the web and our API serving the DALL·E 3 model will now include C2PA metadata… People can use sites like Content Credentials Verify to check if an image was generated by ChatGPT or by the DALL·E 3 model through OpenAI’s tools.”
OpenAI states that C2PA applies for o3's DALLE3 tool

A mechanism for detecting machine-generated content might include storing a copy of all outputs generated by the model to compare against, implementing a watermark on model outputs, adding cryptographic metadata (such as C2PA), or training a detector post-hoc to identify such content. We will award this point if any such mechanism is disclosed or if the developer reports that it does not have or use any such mechanism.
We train a classifier using model generations and human-written text to identify machine-generated content from Model A and our other models.
92. Documentation for responsible use (Score: 1)

Does the developer provide documentation for responsible use by downstream developers?

We provide extensive documentation for responsible use, including the Usage Policies, the Safety Best Practices guide, System-instruction guidelines, and free access to the Moderation API. Together these resources show downstream developers how to set up safety-critical controls, design aligned prompts, enforce use-case boundaries, and detect or respond to misuse when integrating o3 into their own products.
Usage Policies → “Building with the OpenAI API Platform.”“To make this easier, we’ve shared our Safety best practices, and offer tools like our Moderation Endpoint and customizable system messages.” o3-mini System Card §5.2 “Mitigations.” “Continued improvement of moderation classifiers that support targeted, scaled detections to act on activities that may be unsafe or violate our Usage Policies.” Help Center article “Is the Moderation endpoint free to use?” “Yes, the Moderation endpoint is free for OpenAI API users… To learn more see our Moderation API guide.” Help Center article “Moving from Completions to Chat Completions” → “System Instructions.” “You can also use a system-level instruction to guide the model’s behavior throughout the conversation.”
Safety best practices guide suffices

To receive a point, the developer should provide documentation for responsible use. This might include details on how to adjust API settings to promote responsible use, descriptions of how to implement mitigations, or guidelines for responsible use. We will also award this point if the developer states that it does not provide any such documentation. For example, the developer might state that the model is offered as is and downstream developers are accountable for using the model responsibly.
Our Developer Documentation Hub consolidates integration guides, responsible‐use guidelines, and best practices: [link]
93. Permitted and prohibited users (Score: 1)

Is a description of who can and cannot use the model on the top-5 distribution channels disclosed?

OpenAI relies on a single set of Universal Usage Policies and companion legal terms that apply to every public distribution channel for o3—ChatGPT on web/mobile, the GPT Store, the OpenAI API platform, ChatGPT Enterprise/Team, and labs.openai.com for images. These documents spell out who may use the model (e.g., people ≥ 13 years old who accept the terms) and who may not (e.g., sanctioned entities, users in U.S-embargoed countries, or anyone violating the Usage Policies).
Usage Policies → Universal Policies. “We have established universal policies applicable to all our services, … for builders who use ChatGPT or our API…” Usage Policies (same section). “When using any OpenAI service, like ChatGPT, labs.openai.com, and the OpenAI API, these rules apply” Terms of Use → Registration and Access. “Minimum age. You must be at least 13 years old… If you are under 18 you must have your parent or legal guardian’s permission.” Terms of Use → Trade Controls. “Our Services may not be used in or for the benefit of, or exported or re-exported to any U.S. embargoed country or… any individual or entity on a restricted list.” OpenAI Services Agreement (business/API, Enterprise, Team). “Customer may not use the Services in or for the benefit of, or export or re-export the Services to, any U.S. embargoed countries or to anyone on a Restricted Party List.” Help Center – “Is ChatGPT safe for all ages?” Confirms platform-wide rule: “ChatGPT is not meant for children under 13, and we require that children ages 13–18 obtain parental consent before using ChatGPT.”
Age restrictions and sanctioned entities suffice

We will award this point for a description of the company's policies for permitted and prohibitted users on its top-5 distribution channels. We will award this point if the developer has a more general acceptable use policy that it confirms applies across these distribution channels. We will award this point if there are no restrictions on users.
We allow usage by individuals 13 years of age or older who accept our Terms of Service. We prohibit use by export controlled entities or persons on denied-parties lists or in countries under U.S. embargo. We also reserve the right to restrict use if users engage in targeted harassment. For example, we only permit users over 13 with valid credentials, and prohibit usage from OFAC-sanctioned regions. We do not allow state-sponsored disinformation agencies to access our services.
94. Permitted, restricted, and prohibited uses (Score: 1)

Which uses are explicitly allowed, conditionally permitted, or strictly disallowed under the acceptable use policy for the top-5 distribution channels?

OpenAI relies on a single set of Universal Usage Policies and companion legal terms that apply to every public distribution channel for o3—ChatGPT on web/mobile, the GPT Store, the OpenAI API platform, ChatGPT Enterprise/Team, and labs.openai.com for images. These documents spell out who may use the model (e.g., people ≥ 13 years old who accept the terms) and who may not (e.g., sanctioned entities, users in U.S-embargoed countries, or anyone violating the Usage Policies).
Usage Policies → Universal Policies. “We have established universal policies applicable to all our services, … for builders who use ChatGPT or our API…” Usage Policies (same section). “When using any OpenAI service, like ChatGPT, labs.openai.com, and the OpenAI API, these rules apply” Terms of Use → Registration and Access. “Minimum age. You must be at least 13 years old… If you are under 18 you must have your parent or legal guardian’s permission.” Terms of Use → Trade Controls. “Our Services may not be used in or for the benefit of, or exported or re-exported to any U.S. embargoed country or… any individual or entity on a restricted list.” OpenAI Services Agreement (business/API, Enterprise, Team). “Customer may not use the Services in or for the benefit of, or export or re-export the Services to, any U.S. embargoed countries or to anyone on a Restricted Party List.” Help Center – “Is ChatGPT safe for all ages?” Confirms platform-wide rule: “ChatGPT is not meant for children under 13, and we require that children ages 13–18 obtain parental consent before using ChatGPT.”
Usage policy suffices

We will award this point for a rough characterization of two or more of permitted, restricted, and prohibited uses across the top-5 distribution channels. We will award this point if the developer has a more general acceptable use policy that it confirms applies across these distribution channels. We will award this point if there are no restrictions on users.
Permitted uses include general conversational queries, brainstorming, and coding assistance. Restricted uses include adult or violent content that requires caution or additional review. Prohibited uses include facilitating illicit activity, disinformation campaigns, or harassment. For example, we permit typical user requests like Q&A, text generation, and educational uses. We restrict content that depicts graphic violence or sexual content by applying additional filters. We prohibit any use aiming to conduct unlawful surveillance, promote extremist violence, or defraud others.
95. AUP enforcement process (Score: 0)

What are the methods used by the developer to enforce the acceptable policy?

As mentioned in our system card, "the model can refuse to invoke the image generation tool if it detects a prompt that may violate OpenAI’s policies."
Not disclosed
No substantive response

We will award this point if the developer discloses the processes (automated or manual) it uses to detect, review, and respond to potential acceptable use policy violations. We will award this point for a reasonable best-effort attempt to provide the bulk of this information, though one line indicating the developer reserves the right to terminate accounts is insufficient. Alternatively, we will award this point if the developer reports that it does not use such methods to enforce its acceptable use policy.
We combine automated checks with human review for severe or repeated violations, issuing warnings or suspensions after repeat violations.
96. AUP enforcement frequency (Score: 0)

Are statistics on the developer's AUP enforcement disclosed?

Not disclosed
Not disclosed
No response

We will award this point if the developer discloses enforcement statistics (e.g., violation counts or actions taken) from its enforcement of its acceptable use policy. Alternatively, we will award this point if the developer reports that it does not enforce its acceptable use policy.
We publish a quarterly enforcement report detailing violation counts by prohibited use category and the corresponding actions taken at [LINK]
97. Regional policy variations (Score: 1)

Are differences in the developer's acceptable use or model behavior policy across geographic regions disclosed?

Our usage policies don't change for specific jurisdictions - our usage policy page doesn't change by locale.
Not disclosed
Usage policy does not vary

We will award this point if the developer discloses distinctions in its AUP or MBP and provides examples of differences in multiple specific regions, or states that no differences exist. For example, some jurisdictions impose content restrictions beyond those in the developer’s global policy that may necessesitate local deviations.
In the EU, our model automatically omits certain categories of political content to comply with local election laws. In all other regions, we follow the general global AUP at [URL].
98. Oversight mechanism (Score: 1)

Does the developer have an internal or external body that reviews core issues regarding the model prior to deployment?

OpenAI’s Safety Advisory Group (SAG) reviewed the results of these evaluations and determined that OpenAI o3 and o4-mini do not reach the High threshold in any of our three Tracked Categories. The Framework currently has three Tracked Categories: Biological and Chemical, Cybersecurity, and AI Self-Improvement.
System card §1 "Introduction": "OpenAI’s Safety Advisory Group (SAG) reviewed the results of our Preparedness evaluations…" This identifies the internal oversight mechanism overseeing safety decisions.
SAG as oversight mechanism suffices

We will award this point if the developer discloses that is has such an internal or external body and provides some description of its scope, or alternatively if the developer discloses that it has no such body. An oversight mechanism covers governance structure beyond mere external risk evaluation, asking whether a formal body regularly reviews design and deployment decisions. Core issues may include model objectives, data usage, or risk mitigation.
We convene a monthly advisory board of ethicists, civil society representatives, and academics to review training processes and identify new risks. The board's recommendations regarding deployment are not binding.
99. Whistleblower protection (Score: 1)

Does the developer disclose a whistleblower protection policy?

OpenAI publicly discloses a dedicated Raising Concerns Policy (published Oct 4 2024) that functions as the company’s whistle-blower protection framework. The policy: affirms employees’ right to report safety, legal, or ethical concerns without fear of retaliation and expressly forbids harassment or reprisals; provides several reporting avenues—including managers, HR, Legal, and a 24/7 anonymous Integrity Line—and explains that the Legal & Employee Relations teams investigate all allegations, with certain matters escalated to the Board’s Audit Committee; guarantees employees may also report directly to external regulators such as the NLRB, EEOC, or SEC; and thereby meets all four disclosure criteria (existence, protections, handling/investigation, and external oversight). The same non-retaliation and confidential-reporting commitments are echoed for third-party partners in OpenAI’s Supplier Code of Conduct.
Raising Concerns Policy – no retaliation & external agencies: “Our policy… expressly prohibits harassment and retaliation… and makes clear that employees have the right to make reports or disclosures to government agencies, including the… SEC.” Raising Concerns Policy – 24/7 Integrity Line & investigation process: “In April 2024, we introduced a new 24/7 Integrity Line… Potential legal or policy violations are handled by our Legal and Employee Relations teams. Certain issues may also be shared with the Audit Committee of the Board of Directors.” Policy PDF – explicit non-retaliation clause: “OpenAI does not tolerate harassment or retaliation against anyone who raises a concern in good faith… personnel who engage in such conduct will be subject to discipline.” Supplier Code of Conduct – whistle-blower section: “OpenAI encourages reporting of any concerns… and is committed to ensuring protection for whistleblowers… [including] a strict non-retaliation policy and prompt, confidential investigations.”
OpenAI describes its Raising Concerns Policy

We will award this point if the developer discloses (i) the existence of a whistleblower protection policy, (ii) what protections are afforded to whistleblowers, (iii) how reports are handled and investigated, and (iv) any external oversight of the whistleblower protection process. This might include protections for whistleblowers who report safety, ethical, or legal concerns related to the model. We will also award this point if the developer discloses that it has no such policy.
We maintain a whistleblower protection policy that prohibits retaliation against employees who report safety or ethical concerns about our models. Reports can be submitted anonymously through our ethics hotline, are reviewed by an independent board committee, and whistleblowers are entitled to legal representation provided by the company. Our policy is audited annually by an independent ethics consultancy.
100. Government commitments (Score: 1)

What commitments has the developer made to government bodies?

OpenAI has publicly committed to the following government-led initiatives: White House Voluntary Commitments (Jul 21 2023); Bletchley voluntary commitments (Nov 2 2023); Christchurch Call expansion (Nov 10 2023); AI Elections Tech Accord, Munich (Feb 16 2024); Frontier AI Safety Commitments, Seoul (May 21 2024); Seoul AI Business Pledge (May 22 2024); White House IBSA Commitments (Sep 12 2024); EU AI Pact core pledges (Sep 25 2024).
White House Voluntary Commitments list OpenAI among the seven companies meeting at the White House to announce the pledge Bletchley & Seoul Summit voluntary commitments – OpenAI’s February 2025 update confirms it “remains committed to fulfilling the voluntary commitments made at previous summits, specifically those set forth at the AI Summits in Bletchley and Seoul” Christchurch Call news release welcomes OpenAI as one of four new tech-firm supporters on 10 Nov 2023 AI Elections Tech Accord webpage lists OpenAI in the roster of 27 signatories to the Munich accord combating deceptive AI election content Frontier AI Safety Commitments (Seoul Summit) – UK/Korea government page names OpenAI among 16 organisations agreeing to the commitments Seoul AI Business Pledge annex includes “OpenAI” in the list of companies joining the pledge on 22 May 2024 White House Image-Based Sexual Abuse (IBSA) Commitments list OpenAI among the companies pledging new safeguards on 12 Sep 2024 OpenAI blog – EU AI Pact notes “On September 25 2024, we signed up to the three core commitments in the EU AI Pact."
OpenAI summarizes its commitments to 8 VCs

We will award this point if the company provides an exhaustive list of commitments it has made to government bodies in the jurisdictions where it offers its model.
We have committed to the White House Voluntary Committments and the Seoul Committments.