Google Transparency Report

1. Data acquisition methods (Score: 0)

What methods does the developer use to acquire data used to build the model?

Disclosure:

We acquire data using most of these industry methods. For example, the pre-training dataset was a large-scale, diverse collection of data encompassing a wide range of domains and modalities, which included publicly-available web-documents, code (various programming languages), images, audio (including speech and other audio types) and video. The post-training dataset consisted of vetted instruction tuning data and was a collection of multimodal data with paired instructions and responses in addition to human preference and tool-use data.

References:

Not disclosed

Score justification:

The developer does not state the methods they use to acquire data in accordance with the provided taxonomy and use the words "for example", which introduces ambiguity on whether what is described is exhaustive.

Indicator notes:

Which of the following data acquisition methods does the developer use:   (i) acquiring existing public datasets, (ii) crawling the web, (iii) using data acquired via its existing products and services, (iv) licensing existing data from external parties, (v) having humans create or annotate new data, (vi) using models to generate new data,  or (vii) other data acquisition methods not captured by the above. For example, if the developer uses reinforcement learning from human feedback to train models using model-generated outputs with human preference annotations, this would satisfy categories (v) and (vi). Alternatively, if the developer post-trains its model using off-the-shelf preference data (for example, the Alpaca dataset), this would satisfy category (i).

Example disclosure:

To build our model, we acquire data by crawling the Internet for publicly available data, licensing data from third-parties, and using models to synthetically generate new data. Humans do not create new data nor do we use data from our other products/services to train our model.

2. Public datasets (Score: 0)

What are the top-5 sources (by volume) of publicly available datasets acquired for building the model?

Disclosure:

Granular reporting of training dataset details can bring security vulnerabilities and facilitate data poisoning and privacy-related attacks if the information is exposed. The pre-training dataset included publicly-available web-documents, code (various programming languages), images, audio (including speech and other audio types) and video. However, we do not publicly disclose further details for security reasons.

References:

Not disclosed

Score justification:

The developer does not describe the public datasets they used to train their model, but do explain why they do not disclose this information.

Indicator notes:

We define a source as the entity or means by which the developer acquires data. We define the top-5 sources as the top-5 sources by data volume.

Example disclosure:

We acquire publicly available data from only two sources: The Pile and CommonCrawl.

3. Crawling (Score: 1)

If data collection involves web-crawling, what is the crawler name and opt-out protocol?

Disclosure:

Google lists common crawlers that are likely to show up in log files and that we’ve received questions about here, including Googlebot. Googlebot obeys robots.txt rules when crawling automatically. Information about Google Extended token can be found here.

References:

Not disclosed

Score justification:

Google-Extended has clear robots.txt opt-out and is a standalone product token that web publishers can use to manage whether content Google crawls from their sites may be used for training future generations of Gemini models

Indicator notes:

We award this point for disclosure of the crawler name and opt-out protocols, including if/how they respect the Robots Exclusion Protocol (robots.txt).

Example disclosure:

Our web crawler is named A and information on the opt-out protocol can be found at this URL: ... The CommonCrawl web crawler is named CCBot and information on the opt-out protocol can be found at this URL: https://commoncrawl.org/faq#:~:text=How%20can%20I%20block%20the,%2Dagent%20string%20is%3A%20CCBot.

4. Usage data used in training (Score: 0)

What are the top-5 sources (by volume) of usage data from the developer's products and services that are used for building the model?

Disclosure:

The pre-training dataset was a large-scale, diverse collection of data encompassing a wide range of domains and modalities, which included publicly-available web-documents, code (various programming languages), images, audio (including speech and other audio types) and video. The post-training dataset consisted of vetted instruction tuning data and was a collection of multimodal data with paired instructions and responses in addition to human preference and tool-use data. However, we do not publicly disclose further details for security reasons. Granular reporting of training datasets can bring security vulnerabilities and facilitate data poisoning and privacy-related attacks if the information is exposed.

References:

Not disclosed

Score justification:

The developer does not provide information about the sources of usage data they use to train their flagship model.

Indicator notes:

We define usage data as data collected from the use of a developer's products or services.

Example disclosure:

We use usage data from only two sources: our deployed chatbot X and our online social media platform Y.

5. Notice of usage data used in training (Score: 1)

For the top-5 sources of usage data, how are users of these products and services made aware that this data is used for building the model?

Disclosure:

We notify people who sign up for a Google account whether their data is used to train Google 's AI models via link to the Google Privacy Policy when they sign up for an account and when they click on their Google profile picture to access their account information when they are signed in, and we link to the Gemini Apps Privacy Hub within the Settings section within the Gemini app. In addition, Gemini for Google Cloud Guides for customers include clear documentation on whether their data is used to train our models.

References:

Not disclosed

Score justification:

The developer provides information about how users are notified about the use of their data for model training across different Google products and services.

Indicator notes:

We define usage data notice as the proactive disclosure to users of how their data is used for model development. For example, via a pop-up with a description, a link to the privacy policy, or link to a description of company practices.

Example disclosure:

We notify users of our chatbot X that chatbot interactions are used to train our AI via a pop-up as shown at this URL: ... We notify users of our platform Y about whether their data is used to train our AI via a link to our privacy policy when they sign up for an account.

6. Licensed data sources (Score: 0)

What are the top-5 sources (by volume) of licensed data acquired for building the model?

Disclosure:

We do not publicly disclose this information about our training data.

References:

Not disclosed

Score justification:

The developer does not provide information about licensing data.

Indicator notes:

We define a source as the entity from which the developer acquires data. For example, the Associated Press is reportedly a source of licensed data for OpenAI.

Example disclosure:

We license data from only three sources: A, B, and C.

7. Licensed data compensation (Score: 0)

For each of the top-5 sources of licensed data, are details related to compensation disclosed?

Disclosure:

We do not publicly disclose this information about our training data.

References:

Not disclosed

Score justification:

The developer does not provide information about licensing data.

Indicator notes:

We award this point if the model developer describes the compensation structure specified in the contract with the data source or indicates they are prohibited from sharing this information if contractually mandated.

Example disclosure:

We compensate A by ... We cannot disclose information on compensation for our relationships with B and C due to contractual terms that prohibit public disclosure.

8. New human-generated data sources (Score: 0)

What are the top-5 sources (by volume) of new human-generated data for building the model?

Disclosure:

We do not publicly disclose this information for safety and security reasons. Disclosing granular information as it is currently requested, particularly about training data, can facilitate the work of bad actors trying to compromise AI systems. Granular reporting of training datasets provides a roadmap for attackers to reconstruct, infer or manipulate the training data.

References:

Not disclosed

Score justification:

The developer does not provide information about new human-generated data, but do explain why they do not disclose this information.

Indicator notes:

We define a source as the entity or means by which the developer acquires data. For example, Scale AI could be a source of new human-generated data. By new, we mean the data is specifically acquired for the purposes of building the model.

Example disclosure:

We acquire new human-generated data from only two sources: our internal data annotation team and an external vendor, A.

9. Instructions for data generation (Score: 0)

For each of the top-5 sources of human-generated data, what instructions does the developer provide for data generation?

Disclosure:

Following PAI’s white paper on Responsible Sourcing of Data Enrichment Services, we collaborated to develop our practices and processes for data enrichment. This includes an instruction to "provide verified instructions and/or training materials for workers to follow".

References:

Not disclosed

Score justification:

The developer does not provide the instructions they provide to data laborers but do describe their more general involvement in responsible sourcing of data enrichment services.

Indicator notes:

The instructions should be those provided to the data source. For example, if a third-party vendor works directly with the data laborers to produce the data, the instructions from the developer to this vendor should be disclosed.

Example disclosure:

We instruct our internal data annotation team as follows: ... We instruct vendor A as follows: ...

10. Data laborer practices (Score: 0)

For the top-5 sources of human-generated data, how are laborers compensated, where are they located, and what labor protections are in place?

Disclosure:

Following PAI’s white paper on Responsible Sourcing of Data Enrichment Services, we collaborated to develop our practices and processes for data enrichment. This included the creation of five steps AI practitioners can follow to improve the working conditions for people involved in data enrichment tasks (for more details, please visit PAI’s Data Enrichment Sourcing Guidelines): · Select an appropriate payment model and ensure all workers are paid above the local living wage. · Design and run a pilot before launching a data enrichment project. · Identify appropriate workers for the desired task. · Provide verified instructions and/or training materials for workers to follow. · Establish clear and regular communication mechanisms with workers. Together, we created the necessary policies and resources, gathering multiple rounds of feedback from our internal legal, data, security, ethics, and research teams in the process, before piloting them on a small number of data collection projects and later rolling them out to the wider organisation.

References:

Not disclosed

Score justification:

The developer does not information about the labor practices for data laborers involved with their flagship model but do describe their more general involvement in responsible sourcing of data enrichment services.

Indicator notes:

For each data source, we require (i) the compensation in either USD or the local currency, (ii) any countries where at least 25% of the laborers are located, and (iii) a description of any labor protections. We will award this point if the developer discloses that it is not aware of data laborer practices.

Example disclosure:

Our internal data annotation team is located in the US, is compensated at 20 USD per hour, and deals with data that does not require specific protections. Our sole external data vendor contracts laborers in Kenya, compensates them at KES 15000 per month, and implements protections for dealing with toxic or unsafe content such as A and B.

11. Synthetic data sources (Score: 1)

What are the top-5 sources (by volume) of synthetic data acquired for building the model?

Disclosure:

We generate synthetic data for training and evaluation purposes while building the model, using our previous and current models in the Gemini 1.5 and the Gemini 2.X model family. This can include generating synthetic voices or text conversations.

References:

Not disclosed

Score justification:

The developer discloses they generate synthetic data and names the specific models at least at the family level.

Indicator notes:

We define a source of synthetic data as a non-human mechanism (e.g. a machine learning model) used to generate the data.

Example disclosure:

We synthetically generate data using only our previous model X and an early checkpoint of our current flagship model Y.

12. Synthetic data purpose (Score: 1)

For the top-5 sources of synthetically generated data, what is the primary purpose for data generation?

Disclosure:

For the Supervised Fine-Tuning stage, we source adversarial prompts leveraging existing models and tools to probe Gemini’s attack surface (in addition to human interactions) to discover potentially harmful behavior. When we find that model behavior needs improvement, we use a combination of custom data generation recipes loosely inspired by Constitutional AI (Bai et al., 2022), as well as human intervention to revise responses. The process described here is typically refined through successive model iterations. For both Reward Model improvements and Reinforcement Learning, similarly to SFT, we source synthetic prompts through model-model interactions. For Data Reward Model training, given a prompt set, we use custom data generation recipes to surface a representative sample of model responses.

References:

Not disclosed

Score justification:

The purposes for synthetic data generation are clearly stated.

Indicator notes:

We define a source of synthetic data as a non-human mechanism (e.g. a machine learning model) used to generate the data.

Example disclosure:

We use model X to generate instruction-tuning data and we use model Y to generate candidate responses that humans select between to provide human preference data for reinforcement learning with human feedback.

13. Data processing methods (Score: 0)

What are the methods the developer uses to process acquired data to determine the data directly used in building the model?

Disclosure:

Data filtering and preprocessing included techniques such as deduplication, safety filtering in line with Google's commitment to advancing AI safely and responsibly, and quality filtering to mitigate risks and improve training data reliability. We apply safety filtering to our pre-training data for our strictest policies...The Gemini safety policies align with Google’s standard framework which prevents our Generative AI models from generating specific types of harmful content: 1. Child sexual abuse and exploitation 2. Hate speech (e.g., dehumanizing members of protected groups) 3. Dangerous content (e.g., promoting suicide, or instructing in activities that could cause realworld harm) 4. Harassment (e.g., encouraging violence against people) 5. Sexually explicit content 6. Medical advice that runs contrary to scientific or medical consensus These policies apply across modalities. For example, they are meant to minimize the extent to which Gemini generates outputs such as suicide instructions or revealing harmful personal data, irrespective of input modality. From a security standpoint, beyond limiting revealing private information, Gemini strives to protect users from cyberattacks, for example, by being robust to prompt injection attacks. https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf

References:

Not disclosed

Score justification:

The developer describes a number of data processing methods, but it is ambiguous as to whether all methods are covered.

Indicator notes:

We will award this point for disclosure of all of the methods used to process acquired data. Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.

Example disclosure:

We process data in the following six-step pipeline: (i) removal of HTML artifacts, (ii) deduplication, (iii) language identification to retain English data, (iv) removal of CSAM imagery, (v) removal of train-test overlap, and (vi) tokenization.

14. Data processing purpose (Score: 1)

For each data processing method, what is its primary purpose?

Disclosure:

Data practices for our models focus on filtering training data for quality, model performance, and adherence to policies, as well as documenting training techniques in technical reports and model cards. These processes also include safety, privacy, and security criteria. We use automated techniques to filter out certain personal information and other sensitive data from training sets. Data filtering and preprocessing included techniques such as deduplication, safety filtering in line with Google's commitment to advancing AI safely and responsibly, and quality filtering to mitigate risks and improve training data reliability. Safety and responsibility was built into Gemini 2.5 Pro throughout the training and deployment lifecycle, including pre-training, post-training, and product-level mitigations. Mitigations include, but are not limited to: · dataset filtering; · conditional pre-training; · supervised fine-tuning; · reinforcement learning from human and critic feedback; · safety policies and desiderata; · product-level mitigations such as safety filtering.

References:

Not disclosed

Score justification:

The developer provides the purpose for their data processing methods.

Indicator notes:

Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.

Example disclosure:

Examples of primary purposes for a data processing method could include: (i) removes low quality data, (ii) removes potentially personal/copyrighted data, (iii) removes product-irrelevant data, (iv) removes toxic data, (v) improves evaluation integrity, or (vi) prepares the data for training the model.

15. Data processing techniques (Score: 0)

For each data processing method, how does the developer implement the method?

Disclosure:

References:

Not disclosed

Score justification:

The developer does not adequately describe the techniques used to implement their data processing methods.

Indicator notes:

Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.

Example disclosure:

Examples of how a data processing method is implemented could include: the method (i) is implemented using an in-house regular expression, (ii) is implemented using an in-house tool based on n-gram overlap, (iii) is implemented using a FastText classifier trained on Wikipedia data, (iv) is implemented using hash collisions with the NCMEC database, (v) is implemented by searching for known benchmark canary strings, and (vi) is implemented using tiktoken (https://github.com/openai/tiktoken).

16. Data size (Score: 0)

Is the size of the data used in building the model disclosed?

Disclosure:

We do not disclose this.

References:

Not disclosed

Score justification:

The developer does not provide information about data size.

Indicator notes:

To receive this point, the developer should report data size in appropriate units (e.g. bytes, words, tokens, images, frames) and broken down by modality. Data size should be reported to a precision of one significant figure (e.g. 4 trillion tokens, 200 thousand images). The size should reflect data directly used in building the model (i.e. training data) and not data that was acquired but unused, or data used to evaluate the model.

Example disclosure:

We used 3 x 10^12 tokens of text, 1 x 10^6 images, and 5 x 10^5 hours of audio for training.

17. Data language composition (Score: 0)

For all text data used in building the model, what is the composition of languages?

Disclosure:

We do not disclose this.

References:

Not disclosed

Score justification:

The developer does not provide information about language composition in the training data.

Indicator notes:

To receive this point, the developer should report (i) all languages which make up at least 1% of the data and their corresponding proportions and (ii) a brief description of how languages are labeled (if a publicly available tool is used, include a link to the tool). Proportions should be reported to a precision of two significant figures and should describe proportions of documents labeled with some langauge. An "Unknown" category may be included to denote documents where the language could not be identified.

Example disclosure:

English 80%, Spanish 5.0%, French 3.0%, Chinese 2.0%, Unknown 10%. We used a FastText-based classifier trained on Wikipedia data to identify languages.

18. Data domain composition (Score: 0)

For all the data used in building the model, what is the composition of domains covered in the data?

Disclosure:

As outlined in the model card, the pre-training dataset was a large-scale, diverse collection of data encompassing a wide range of domains and modalities, which included publicly-available web-documents, code (various programming languages), images, audio (including speech and other audio types) and video. The post-training dataset consisted of vetted instruction tuning data and was a collection of multimodal data with paired instructions and responses in addition to human preference and tool-use data.

References:

Not disclosed

Score justification:

No specific information about data domains.

Indicator notes:

To receive this point, the developer should report the composition of the main domains included in the data used to train the model. This data should be at a level of granularity lower than broad claims about training on "internet data". For example, this could include the proportion of data from e-commerce, social media, news, code, etc. based on the URLs from which the data is sourced. Proportions should be reported to a precision of one significant figure.

Example disclosure:

Social media 40%, code repositories 30%, news articles 20%, e-commerce product listings 5%, scientific papers 5%.

19. External data access (Score: 0)

Does a third-party have direct access to the data used to build the model?

Disclosure:

We do not publicly disclose this information for security reasons. Granular reporting of training datasets can bring security vulnerabilities and facilitate data poisoning and privacy-related attacks if the information is exposed.

References:

Not disclosed

Score justification:

The developer does not provide information relevant to replicate the training data, but do explain why they do not disclose this information.

Indicator notes:

By a third-party, we mean entities that are financially independent of the developer. We will award this point if at least one such entity is named as having direct access to the data. With that said, we may award this point if the developer provides justifications for prohibiting access to narrowly-scoped parts of the data.

Example disclosure:

Third-parties that have direct access to the data include organizations A and B.

20. Data replicability (Score: 0)

Is the data used to build the model described in enough detail to be externally replicable?

Disclosure:

We do not disclose details on any publicly available training data. Research shows that knowledge of training data enables attackers to modify or inject data to exploit AI systems.

References:

Not disclosed

Score justification:

No access provided but the developers describes why they do not provide access.

Indicator notes:

We will award this point if the description contains (i) a list of all publicly available training data and where to obtain it and (ii) a list of all training data obtainable from third parties and where to obtain it. These conditions refer to criteria 2 and 3 under the OSI Open Source AI v1.0 definition.

Example disclosure:

The listing of publicly available training data can be found at this URL ... and the listing of all training data obtainable from third parties can be found at this URL ...

21. Compute usage for final training run (Score: 0)

Is the amount of compute used in the model's final training run disclosed?

Disclosure:

As a Frontier Model Forum founding member, we endorse the FMF methodology, but we do not publicly disclose this specific information since specific numbers of FLOPs, along with parameters, could give competitors an idea of our proprietary approach when asked to also provide information like model architecture and a summary of training data. When combined, specific numbers could help bad actors triangulate even more specifics of our approach. The benefit of potentially exposing specific numbers is outweighed by the risk of disclosure of trade secrets, and relatedly, the risk of potential security vulnerabilities through reverse engineering.

References:

Not disclosed

Score justification:

No information provided though the developer describes why they do not disclose this information.

Indicator notes:

Example disclosure:

Our model was trained using 5 x 10^25 FLOPs, measured according to the Frontier Model Forum guidance provided at this URL: https://www.frontiermodelforum.org/updates/issue-brief-measuring-training-compute/

22. Compute usage including R&D (Score: 0)

Is the amount of compute used to build the model, including experiments, disclosed?

Disclosure:

As a Frontier Model Forum founding member, we endorse the FMF methodology. However, offering a seemingly simple metric, cumulative compute (measured in FLOPs) may be an increasingly flawed proxy for the actual capabilities or risks of an AI model, given advances in model efficiency. We do not publicly disclose this information since specific numbers of FLOPs, along with parameters, could give competitors an idea of our proprietary approach when asked to also provide information like model architecture and a summary of training data. When combined, specific numbers could help bad actors triangulate even more specifics of our approach. The benefit of potentially exposing specific numbers is outweighed by the risk of disclosure of trade secrets, and relatedly, the risk of potential security vulnerabilities through reverse engineering.

References:

Not disclosed

Score justification:

No information provided though the developer describes why they do not disclose this information.

Indicator notes:

Compute should be reported in appropriate units, which most often will be floating point operations (FLOPs), along with a description of the measurement methodology, which may involve estimation. Compute should be reported to a precision of one significant figure (e.g. 7 x 10^26 FLOPs). Compared to the previous indicator, this indicator should include an estimation of the total compute used across experiments used towards the final training run for the model (such as including hyperparameter optimization or other experiments), and not just the final training run itself.

Example disclosure:

Our cumulative compute usage involved in building the model was 7 x 10^26 FLOPs, measured according to the Frontier Model Forum guidance provided at this URL: https://www.frontiermodelforum.org/updates/issue-brief-measuring-training-compute/

23. Development duration for final training run (Score: 0)

Is the amount of time required to build the model disclosed?

Disclosure:

We do not disclose this.

References:

Not disclosed

Score justification:

No information provided.

Indicator notes:

The amount of time should be specified in terms of both the continuous duration of time required and the number of hardware hours used. The continuous duration of time required to build the model should be reported in weeks, days, or hours to a precision of one significant figure (e.g. 3 weeks). The number of hardware hours should be reported to a precision of one significant figure and include the type of hardware hours. No form of decomposition into phases of building the model is required for this indicator, but it should be clear what the duration refers to (e.g. training the model, or training and subsequent evaluation and red teaming).

Example disclosure:

Our model was trained over a period of 90 days using 4x10^4 NVIDIA H100 GPU-days.

24. Compute hardware for final training run (Score: 0)

For the primary hardware used to build the model, is the amount and type of hardware disclosed?

Disclosure:

The model was trained on TPUv5p architecture. We employed synchronous data-parallel training to parallelise over multiple 8,960-chip pods of Google’s TPUv5p accelerators, distributed across multiple datacenters. https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf#page=4

References:

Not disclosed

Score justification:

The specific number of TPUs used is not disclosed.

Indicator notes:

In most cases, this indicator will be satisfied by information regarding the number and type of GPUs or TPUs used to train the model. The number of hardware units should be reported to a precision of one significant figure (e.g. 800 NVIDIA H100 GPUs). We will not award this point if (i) the training hardware generally used by the developer is disclosed, but the specific hardware for the given model is not, or (ii) the training hardware is disclosed, but the amount of hardware is not. We will award this point even if information about the interconnects between hardware units is not disclosed.

Example disclosure:

Our model was trained using 1000 NVIDIA H100 GPUs.

25. Compute provider (Score: 1)

Is the compute provider disclosed?

Disclosure:

Google models are trained and served by our own custom AI accelerators, Tensor Processing Units (TPU).

References:

Not disclosed

Score justification:

Self-owned cluster.

Indicator notes:

For example, the compute provider may be the model developer in the case of a self-owned cluster, a cloud provider like Microsoft Azure, Google Cloud Platform, or Amazon Web Services, or a national supercomputer. In the event that compute is provided by multiple sources or is highly decentralized, we will award this point if a developer makes a reasonable effort to describe the distribution of hardware owners.

Example disclosure:

Compute is provided by Google Cloud Platform.

26. Energy usage for final training run (Score: 0)

Is the amount of energy expended in building the model disclosed?

Disclosure:

We have disclosed that the model was trained using Google’s Tensor Processing Units (TPUs), which are specifically designed so training can be distributed across multiple TPU devices for faster and more efficient processing. The efficiencies gained through the use of TPUs are aligned with Google's commitment to operate sustainably. In fact, over two generations – from TPU v4 to Trillium (v6) – more efficient TPU hardware design has led to a 3x improvement in the carbon-efficiency of AI workloads. We are committed to working to reduce the carbon intensity of AI systems, including by optimizing software, improving hardware efficiency, and powering AI models with carbon-free energy. However, obtaining precise and consistent measurements of training energy consumption across diverse AI systems is not technically feasible yet and there is no standard methodology, especially in the context of comparative public scoring for energy use rankings across the AI ecosystem. NEW: In an August 2025 paper (past the time of the final training run) we shared a new methodology for estimating the environmental impact of AI inference. https://cloud.google.com/blog/topics/sustainability/tpus-improved-carbon-efficiency-of-ai-workloads-by-3x?e=48754805

References:

Not disclosed

Score justification:

No specific information provided about energy usage for Gemini 2.5 Pro, though a paper was released addressing the energy and environmental impact of AI inference after the Gemini 2.5 Pro training run.

Indicator notes:

Energy usage should be reported in appropriate units, which most often will be megawatt-hours (mWh), along with a description of the measurement methodology, which may involve estimation. Energy usage should be reported to a precision of one significant figure (e.g. 500 mWh). No form of decomposition into compute phases is required, but it should be clear whether the reported energy usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitate energy usage. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lacks and why it lacks it.

Example disclosure:

Our model was trained using an estimate 1 x 10^4 MWh of energy. To estimate training energy consumption, we multiplied training FLOPs (5 x 10^25) by a conversion factor using NVIDIA A100 GPU information (3.74 × 10^21 FLOPs/MWh) given we train using FP16 with sparsity.

27. Carbon emissions for final training run (Score: 0)

Is the amount of carbon emitted in building the model disclosed?

Disclosure:

As AI continues to unlock new opportunities for business growth and societal benefits, we’re working to reduce the carbon intensity of AI systems — including by optimizing software, improving hardware efficiency, and powering AI models with carbon-free energy. Our efforts to run on 24/7 carbon-free energy (CFE) on every grid where we operate by 2030 aims directly at reducing the largest contributor to TPU emissions — operational electricity consumption. In addition, our recent study’s detailed manufacturing LCA helps us target our manufacturing decarbonization efforts towards the highest-impact initiatives. We're actively working with our supply chain partners to reduce these emissions through more sustainable manufacturing processes and materials. Emissions associated with chip manufacturing are still notable — and their share of total emissions will increase as we reduce operational emissions with carbon-free energy. Our significant improvements in AI hardware carbon-efficiency complement rapid advancements in AI model and algorithm design. Continued optimization of AI models is reducing the number of computations required for a given model performance. Some models that once required a supercomputer to run can now be run on a laptop, and at Google we’re using techniques like Accurate Quantized Training and speculative decoding to further increase model efficiency. We expect model advancements to continue unlocking carbon-efficiency gains, and are working to quantify the impact of software design on carbon-efficiency in future studies. However, obtaining precise and consistent measurements of training energy consumption across diverse AI systems is not technically feasible yet and there is no standard methodology, especially in the context of comparative public scoring for energy use rankings across the AI ecosystem. NEW: In an August 2025 paper (past the time of the final training run) we shared a new methodology for estimating the environmental impact of AI inference. https://cloud.google.com/blog/topics/sustainability/tpus-improved-carbon-efficiency-of-ai-workloads-by-3x?e=48754805

References:

Not disclosed

Score justification:

No specific information provided about environmental impacts for Gemini 2.5 Pro, though a paper was released addressing the energy and environmental impact of AI inference after the Gemini 2.5 Pro training run.

Indicator notes:

Emissions should be reported in appropriate units, which most often will be tons of carbon dioxide emitted (tCO2), along with a description of the measurement methodology, which may involve estimation. Emissions should be reported to a precision of one significant figure (e.g. 500 tCO2). No form of decomposition into compute phases is required, but it should be clear whether the reported emissions is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that generate emissions. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lack and why it lacks it. Emissions should correspond with the energy used in the previous indicator.

Example disclosure:

Our model yielded an estimate of 5 x 10^3 tCO2. To estimate training carbon emissions, we multiplied training energy usage (1 x 10^4 MWh) by a 2023 estimate for the US data center carbon intensity (0.375 tCO2/MWh) given the data centers used in training operate in the US.

28. Water usage for final training run (Score: 0)

Is the amount of clean water used in building the model disclosed?

Disclosure:

While we do not disclose granular water usage per model given there is no industry standard methodology for such precise calculations, we have been transparent on our total water consumption at our data centers and offices, which was 6.4 billion gallons (approximately 24 billion liters or 24 million cubic meters) in 2023. We have disclosed that this represents a 14% increase from 2022, primarily due to water cooling needs at our data centers, which experienced increased electricity consumption year-over-year. We’re accelerating water reuse practices across our data centers, and are tailoring site-specific solutions based on facility types, locations, and local water contexts. We also use non-potable sources and freshwater alternatives whenever feasible. We prioritize responsible water use in our data centers. In 2023, 69% of our freshwater withdrawals came from watersheds with low water scarcity, 16% came from watersheds with medium water scarcity, and 15% came from watersheds with high water scarcity. Through our water replenishment program, we focus on higher volumetric water benefits in water-scarce regions to ensure we’re positively impacting the areas that need it most. NEW: In an August 2025 paper (past the time of the final training run) we shared a new methodology for estimating the environmental impact of AI inference. https://cloud.google.com/blog/topics/sustainability/tpus-improved-carbon-efficiency-of-ai-workloads-by-3x?e=48754805

References:

Not disclosed

Score justification:

Indicator notes:

Clean water usage should be in appropriate units, which most often will be megaliters, along with a description of the measurement methodology, which may involve estimation. Clean water usage should be reported to a precision of one significant figure (e.g., 5000ML). No form of decomposition into compute phases is required, but it should be clear whether the reported water usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitates water usage. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lacks and why it lacks it.

Example disclosure:

Our model yielded an estimate of 20 ML water. To estimate training water usage, we multiplied training energy usage (1 x 10^4 MWh) by a 2021 estimate for the US data center water efficiency (1.8 ML per 1,000 MWh) given the data centers used in training operate in the US.

29. Internal compute allocation (Score: 0)

How is compute allocated across the teams building and working to release the model?

Disclosure:

We do not disclose this.

References:

Not disclosed

Score justification:

No information provided.

Indicator notes:

To receive a point, the developer should provide the compute allocated to each team involved in training the model. We understand there might be no clear allocation of compute across different teams; in that case, report an estimate of the compute used over the last year. Compute allocation should be reported to at least one significant figure.

Example disclosure:

- Safety — 15% - Pre-training — 60% - Post-training — 15% - Infrastructure and reliability — 5%

30. Model stages (Score: 1)

Are all stages in the model development process disclosed?

Disclosure:

1. Data filtering for safety and responsibility and conditional pre-training 2. Supervised Fine-Tuning 3. Reinforcement Learning from Human Feedback, with evaluations that run continuously during training to monitor for safety (and other metrics)

References:

Not disclosed

Score justification:

3 model stages are provided.

Indicator notes:

Stages refer to each identifiable step that constitutes a substantive change to the model during the model building process. We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear and complete description of these stages.

Example disclosure:

We define five stages in building the model: (1) unsupervised pre-training, (2) supervised instruction tuning, (3) RLHF, (4) domain-specific fine-tuning, and (5) final safety alignment.

31. Model objectives (Score: 1)

For all stages that are described, is there a clear description of the associated learning objectives or a clear characterization of the nature of this update to the model?

Disclosure:

Our safety training approach begins with pre-training interventions around data filtering to remove data that violates our policies and tagging, as well as metrics that monitor the model during pre-training. In post-training, we use Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human and Critic Feedback (RLHF/RLCF) to align the model to the policies and desiderata, alongside with enabling other capabilities.

References:

Not disclosed

Score justification:

The purpose of each stage is described.

Indicator notes:

We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear description of the update to the model related to each stage, whether that is the intent of the stage (e.g. making the model less harmful), a mechanistic characterization (e.g. minimizing a specific loss function), or an empirical assessment (e.g. evaluation results conducted before and after the stage).

Example disclosure:

During unsupervised pre-training, the objective is next-token prediction. During supervised instruction tuning, we optimize for correctness and helpfulness on labeled tasks. RLHF aligns model outputs with human preference judgments. Domain-specific fine-tuning focuses on improving in-domain capabilities using specialized data (e.g., code or legal text). Final safety alignment reduces disallowed or harmful responses.

32. Code access (Score: 0)

Does the developer release code that allows third-parties to train and run the model?

Disclosure:

Vertex AI provides several options for Gemini 2.5 Pro model training and deployment: Generative AI on Vertex AI gives access to large generative AI models, including Gemini 2.5 Pro, so third parties can evaluate, tune, and deploy them for use in AI applications. Custom training gives complete control over the training process, including using a developer’s preferred ML framework, writing their own training code, and choosing hyperparameter tuning options. Google AI Studio is the fast path for developers, students, and researchers who want to try Gemini models and get started building with the Gemini Developer API.

References:

Not disclosed

Score justification:

No code is provided for training Gemini 2.5 Pro itself.

Indicator notes:

The released code does not need to match the code used internally.

Example disclosure:

We release training and inference code under an Apache 2.0 license at https://github.com/..., enabling others to replicate our core pipeline.

33. Organization chart (Score: 0)

How are employees developing and deploying the model organized internally?

Disclosure:

As outlined in our responsible AI governance approach, executive reviewers with expertise in responsible AI carefully assess evaluation results, mitigations, and risks before making a launch decision. They also oversee our frameworks, policies, and processes, ensuring that these evolve to account for new modalities and capabilities. The thoughtful development and deployment of our most capable models requires cross-functional collaboration and a variety of experts across a broad range of matrixed teams, with public acknowledgment of contributing organizations.

References:

Not disclosed

Score justification:

The organization structure is not clearly described.

Indicator notes:

To receive a point, the developer should provide both the internal organization chart for the team developing the model as well as the headcounts (or a proportion of headcounts) by the team.

Example disclosure:

The model team comprises of 63 people, organized as follows: - CEO - Managing Director (Safety) — 24 people - Managing Director (Pre-training) — 12 people - Managing Director (Post-training) — 11 people - Managing Director (API) — 6 people - Director (Infrastructure and reliability) — 7 people - Director (PR and marketing) — 4 people - Director (hiring) — 7 people

34. Model cost (Score: 0)

What is the cost of building the model?

Disclosure:

We do not publicly disclose this highly confidential business information as this could expose our research investments and product roadmaps to competitors.

References:

Not disclosed

Score justification:

No information provided but the developer explains why they do not disclose this information.

Indicator notes:

Monetary cost should be reported in appropriate currency (e.g. USD), along with the measurement methodology, which may involve estimation. Cost should be reported to a precision of one significant figure (e.g. 200 million USD).

Example disclosure:

We spent approximately 200 million USD on building the model: 50 million for data acquisition, 10 million for data processing, 20 million for personnel, 80 million for compute for R&D priced at market rates, and 40 million for compute for the final training run priced at market rates.

35. Basic model properties (Score: 0)

Are all basic model properties disclosed?

Disclosure:

Yes - the model card provides a summary of this information as detailed below: · Inputs: Text strings (e.g., a question, a prompt, document(s) to be summarized), images, audio, and video files, with a 1M token context window. · Outputs: Text, with a 64K token output. · Architecture: Gemini 2.5 Pro builds upon the sparse Mixture-of-Experts (MoE) Transformer architecture (Clark et al., 2020; Fedus et al., 2021; Lepikhin et al., 2020; Riquelme et al., 021; Shazeer et al., 2017; Zoph et al., 2022) used in Gemini 2.0 and 1.5. Refinements in architectural design and optimization methods led to substantial improvements in training stability and computational efficiency. · Gemini 2.5 Pro was carefully designed and calibrated to balance quality and performance for complex tasks, improving over previous generations. More information, including technical specifications, is made available on Vertex AI.

References:

Not disclosed

Score justification:

The developer discloses the modalities and basic information about model architecture but does not disclose information about the model components or size.

Indicator notes:

Basic model properties include: the input modality, output modality, model size, model components, and model architecture. To receive a point, all model properties should be disclosed. Modalities refer to the types or formats of information that the model can accept as input. Examples of input modalities include text, image, audio, video, tables, graphs. Model components refer to distinct and identifiable parts of the model. We recognize that different developers may use different terminology for model components, or conceptualize components differently. Examples include: (i) For a text-to-image model, components could refer to a text encoder and an image encoder, which may have been trained separately. (ii) For a retrieval-augmented model, components could refer to a separate retriever module. Model size should be reported in appropriate units, which generally is the number of model parameters, broken down by named component. Model size should be reported to a precision of one significant figure (e.g. 500 billion parameters for text encoder, 20 billion parameters for image encoder). Model architecture is the overall structure and organization of a foundation model, which includes the way in which any disclosed components are integrated and how data moves through the model during training or inference. We recognize that different developers may use different terminology for model architecture, or conceptualize the architecture differently; a sufficient disclosure includes any clear, though potentially incomplete, description of the model architecture.

Example disclosure:

Input modality: Text Output modality: Text Model components: Decoder-only model trained using self-supervised learning, followed by supervised fine tuning and RLHF that are used to align the language model to follow users' instructions and be helpful, harmless, and honest. Model size: 70B parameters Model architecture: Autoregressive (causal, decoder only) transformer language model with rotary position embeddings and are trained on the next token prediction task.

36. Deeper model properties (Score: 0)

Is a detailed description of the model architecture disclosed?

Disclosure:

Yes - the model card provides a summary of this information as detailed below: Architecture: The model builds upon the sparse Mixture-of-Experts (MoE) Transformer architecture (Clark et al., 2020; Fedus et al., 2021; Lepikhin et al., 2020; Riquelme et al., 021; Shazeer et al., 2017; Zoph et al., 2022) used in Gemini 2.0 and 1.5. Refinements in architectural design and optimization methods led to substantial improvements in training stability and computational efficiency. Gemini 2.5 Pro was carefully designed and calibrated to balance quality and performance for complex tasks, improving over previous generations.

References:

Not disclosed

Score justification:

Although the developer includes some information about the model architecture (e.g., that it's based on sparse MoE transformers), it is not in sufficient detail to allow an external entity to fully implement the model. For example, it does not disclose what the "refinements in architecture design" are.

Indicator notes:

To receive a point, the model architecture should be described in enough detail to allow for an external entity to fully implement the model. Publicly available code or a configuration file for a model training library (e.g., GPT-NeoX) would be a sufficiently detailed description.

Example disclosure:

The configuration file for training our model using a public model training library A can be found at [URL].

37. Model dependencies (Score: 0)

Is the model(s) the model is derived from disclosed?

Disclosure:

Yes. Gemini 2.5 Pro is the next iteration in the Gemini 2.0 series of models, a suite of highly-capable, natively multimodal, reasoning models.

References:

Not disclosed

Score justification:

The developer specifies that the model (2.5 Pro) is a continuation of a previous series of models (Gemini 2.0) which does not necessarily provide information about e.g., whether the model was derived from the Gemini 2.0 series.

Indicator notes:

We will award this point for a comprehensive disclosure of the model or models on which the foundation model directly depends on or is derived from, as well as the method by which it was derived (e.g., through fine tuning, model merging, or distillation). Additionally, we will award a point if the developer discloses that the model is not dependent on or derived from any model.

Example disclosure:

This model is a fine tune of Camel-70B. We used the methods described in [PAPER URL] for distillation.

38. Benchmarked inference (Score: 0)

Is the compute and time required for model inference disclosed for a clearly-specified task on clearly-specified hardware?

Disclosure:

We do not publicly disclose this information. Specific numbers of FLOPs and data points along with parameters could give competitors an idea of our proprietary approach when asked to also provide information like model architecture and a summary of training data. When combined, specific numbers could help bad actors triangulate even more specifics of our approach. The benefit of potentially exposing specific numbers is outweighed by the risk of disclosure of trade secrets, and relatedly, the risk of potential security vulnerabilities through reverse engineering. As a Frontier Model Forum founding member, we endorse the FMF methodology.

References:

Not disclosed

Score justification:

The developer does not disclose this information.

Indicator notes:

The duration should be reported in seconds to a precision of one significant figure (e.g. 0.002 seconds). Compute usage for inference should be reported in FLOPs/second to a precision of one significant figure (e.g. 5 x 10^21 FLOPs/second). The hardware in this evaluation need not be the hardware the developer uses for inference. The developer can report this figure over some known or public dataset.

Example disclosure:

It takes 0.002 seconds and 5 x 10^21 FLOPs/second to generate 100,000 tokens as 5,000 sequences of length 20 given inputs of length 40 from [DATASET URL]. The fixed set of hardware is 8 NVIDIA A100s.

39. Researcher credits (Score: 1)

Is a protocol for granting external entities API credits for the model disclosed?

Disclosure:

Yes. To support academic research and drive cutting-edge research, Google provides access to Gemini API credits for scientists and academic researchers through the Gemini Academic Program. Qualified academic researchers (such as faculty, staff, and PhD students) in supported countries can apply to receive Gemini API credits and higher rate limits for research projects. This support enables higher throughput for scientific experiments and advances research. We are particularly interested in the research areas in the following section, but we welcome applications from diverse scientific disciplines: Evaluations and benchmarks: Community-endorsed evaluation methods that can provide a strong performance signal in areas such as factuality, safety, instruction following, reasoning, and planning. Accelerating scientific discovery to benefit humanity: Potential applications of AI in interdisciplinary scientific research, including areas such as rare and neglected diseases, experimental biology, materials science, and sustainability. Embodiment and interactions: Utilizing large language models to investigate novel interactions within the fields of embodied AI, ambient interactions, robotics, and human-computer interaction. Emergent capabilities: Exploring new agentic capabilities required to enhance reasoning and planning, and how capabilities can be expanded during inference (e.g., by utilizing Gemini Flash). Multimodal interaction and understanding: Identifying gaps and opportunities for multimodal foundational models for analysis, reasoning, and planning across a variety of tasks. Eligibility: Only individuals (faculty members, researchers or equivalent) affiliated with a valid academic institution, or academic research organization can apply. Note that API access and credits will be granted and removed at Google's discretion. We review applications on a monthly basis.

References:

Not disclosed

Score justification:

The developer discloses how access can be requested, eligibility/selection criteria, and a timeframe for reviewing requests (monthly).

Indicator notes:

A model credit access protocol refers to the steps, requirements, and considerations involved in granting credits to external entities. We will award this point if the developer discloses key details of its protocol, including (i) where external entities can request access to credits (e.g. via an access request form); (ii) explicit criteria for selecting external entities; and (iii) its policy on granting a transparent decision on whether access has been granted within a specified, reasonable period of time. Additionally, we will award a point if the developer discloses that it does not grant external entities API credits.

Example disclosure:

We implement a researcher access program: (i) Access can be requested from [URL] (ii) Any researcher at an accredited research institution is eligible to apply. Decisions are made based on the alignment between the applicant's project description and our target research directions (as described here: [URL]). (iii) Decision notifications are sent within three weeks of the application receipt.

40. Specialized access (Score: 0)

Does the developer disclose if it provides specialized access to the model?

Disclosure:

As part of our External Safety Testing Program, we work with a small set of independent external groups to help identify areas for improvement in our model safety work, including on early versions of the model, by undertaking structured evaluations, qualitative probing, and unstructured red teaming. External safety testing groups receive access before the model is generally available and can turn off or down safety filters. Once the model is trained, independent external groups, including domain experts and a government body, further test the model to identify blind spots

References:

Not disclosed

Score justification:

The developer discloses that they do provide specialized access (in particular: early access & access to the model without safety filters), but does not disclose statistics on the number of users granted specialized access.

Indicator notes:

Specialized access could include several categories, such as early access, subsidized access, or deeper access (e.g., to model weights or checkpoints, that are not publicly available). We will award this point if the developer discloses (i) if it provides specialized access and (ii) statistics on the number of users granted access across academia, industry, non-profits, and governments, to one significant figure.

Example disclosure:

We provide early access to the model via API to: (1) 250 academics vetted by our program (2) 0 industry affiliates (3) 0 non-profit affiliates (3) 2 government entities with whom we have signed MoUs We provide no other specialized research access.

41. Open weights (Score: 0)

Are the model's weights openly released?

Disclosure:

We do not publicly disclose this information due to the fact that Gemini 2.5 Pro is not an open model. However, our open model Gemma 3 is provided with open weights on various platforms, including Google Cloud, Kaggle, and Hugging Face.

References:

Not disclosed

Score justification:

The weights for Gemini 2.5 Pro are not openly released.

Indicator notes:

To receive this point, model weights need to be publicly available at no cost. Developers may receive this point even if there are some restrictions on the external entities that are permitted access (e.g. geographic restrictions), insofar as these restrictions are transparent (e.g. via a license or some high-level description of who has been granted access to the foundation model).

Example disclosure:

Model weights are available on HuggingFace by following this link: [URL]

42. Agent Protocols (Score: 1)

Are the agent protocols supported for the model disclosed?

Disclosure:

We support A2A for agents built using Gemini 2.5 Pro available via Vertex AI. We also announced that the Gemini API and SDK will support Model Context Protocol (MCP) to make it easy for developers to use a wide range of open source tools.

References:

Not disclosed

Score justification:

The developer discloses agent protocols (A2A and MCP) that the model supports.

Indicator notes:

Agent protocols are specifications that define how autonomous agents exchange messages, context, or function calls with other agents, tools, or services (e.g., Anthropic’s Model Context Protocol (MCP) and Google’s Agent‑to‑Agent (A2A) spec). To earn this point, documentation must enumerate each protocol and describe any deviations or proprietary extensions.

Example disclosure:

We support MCP and A2A for agents built using model A

43. Capabilities taxonomy (Score: 0)

Are the specific capabilities or tasks that were optimized for during post-training disclosed?

Disclosure:

Yes - the model card provides the following information: Gemini 2.5 Pro is a thinking model, capable of reasoning before responding, resulting in enhanced performance and improved accuracy. It is well-suited for applications that require: · enhanced reasoning; · advanced coding; · multimodal understanding; · long context. Post-training optimizations for specific capabilities have resulted in improvements across all capabilities, we have also made changes that have resulted in some capability-specific wins: code, factuality, long context, multilinguality, audio, video, and agentic use cases" https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf

References:

Not disclosed

Score justification:

The developer discloses capabilities that they expect the model to perform well in. They also disclose capabilities that they have observed improvements in. However, the developer does not discuss the capabilities specifically optimized for during post-training.

Indicator notes:

Capabilities refer to the specific and distinctive functions that the model can perform. We recognize that different developers may use different terminology for capabilities, or conceptualize capabilities differently. We will award this point for a list of capabilities specifically optimized for in the post-training phase of the model, even if some of the capabilities are not reflected in the final model.

Example disclosure:

We focus on the following capabilities during post-training: (1) Coding ability (2) Retrieval of information and factuality (3) Multilingual language proficiency on non-English languages (4) Tool-use

44. Capabilities evaluation (Score: 1)

Does the developer evaluate the model's capabilities prior to its release and disclose them concurrent with release?

Disclosure:

Yes results for each of the benchmarks measuring the model’s capabilities are detailed in the model card, tech report, and other external resources. We continue to update our evals, so this represents an example of results at a specific moment in time. Updated model cards can be found here. We have also included here a summary of the results (as of this writing): Gemini 2.5 Pro demonstrated strong performance, especially in reasoning tasks. A key improvement over prior Gemini models was a significant boost in code performance. Evaluations were conducted across benchmarks that measured capabilities such as reasoning and knowledge (Humanity's Last Exam (no tools), 17.8%), science (GPQA diamond,83.0%), mathematics (AIME 2025, 83.0%) , and in line with the capabilities listed above: code generation (LiveCodeBench V5, 75.6%), code editing,(Aider Polyglot, 76.5% / 72.7% whole / diff) agentic coding (SWE-bench verified, 63.2%), factuality (SimpleQA, 50.8%), visual reasoning (MMMU, 79.6%), image understanding (Vibe-Eval (Reka), 65.6%) , video (Video-MME, 84.8%), long context (MRCR, 128k (average) 93.0%, 1M (pointwise) 82.9%), and multilingual performance (Global MMLU (Lite), 88.6%). We also offer additional benchmark results for audio + visual capabilities, e.g. VideoMME (86.9%) -see more at https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf#page=15

References:

Not disclosed

Score justification:

While the developer has not disclosed a sufficient capabilities taxonomy in the previous indicator, the developer discloses a large number of capabilities evaluations that relate to the capabilities referenced in their disclosure for the previous indicator.

Indicator notes:

The evaluations must contain precise quantifications of the model's behavior in relation to the capabilities specified in the capabilities taxonomy. We will award this point for any clear, but potentially incomplete, evaluation of multiple capabilities.

Example disclosure:

We evaluate capabilities using the following benchmarks: (1) Coding: HumanEval (2) Retrieval: HotPotQA (3) Multilingual performance: MMMLU (4) Tool use: UltraTool

45. External reproducibility of capabilities evaluation (Score: 0)

Are code and prompts that allow for an external reproduction of the evaluation of model capabilities disclosed?

Disclosure:

We offer safety guidelines and best practices that reflect our responsible AI governance processes.

References:

Not disclosed

Score justification:

The linked information does not provide information that would allow an external entity to reproduce the evaluations.

Indicator notes:

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all capability evaluations to be reproduced. The released code must be open source, following the OSI definition of open source.

Example disclosure:

The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]

46. Train-test overlap (Score: 0)

Does the developer measure and disclose the overlap between the training set and the dataset used to evaluate model capabilities?

Disclosure:

We conduct baseline assurance evaluations to guide decisions on model releases. These standard safety tests look at model behavior, including within the context of the safety policies and modality-specific risk areas. High-level findings are fed back to the model team, but prompt sets are held out to prevent overfitting and preserve the results’ ability to inform decision-making. With web-scale pre-training of language models, decontamination of public benchmarks is a persistent challenge (Brown et al., 2020; Gemini-Team et al., 2023; OpenAI, 2023a). When developing Gemini 2.5 Pro, we employed semantic-similarity and model based decontamination procedures to help mitigate the leakage issue, in addition to the standard n-gram based decontamination we used in Gemini 1.5. We also continue reporting on internally developed non-public benchmarks, such as HiddenMath, to move beyond the reliance on training set decontamination. When developing Gemini 2.5 Pro, we employed semantic-similarity and model based decontamination procedures to help mitigate the leakage issue, in addition to the standard n-gram based decontamination we used in Gemini 1.5. We also continue reporting on internally developed non-public benchmarks, such as HiddenMath, to move beyond the reliance on training set decontamination.

References:

Not disclosed

Score justification:

The disclosure does not include the specific amount of train-test overlap.

Indicator notes:

We will award this point if, with every capability evaluation for which the developer reports results, the developer reports the overlap between the training set of the model and the dataset used for evaluation, as well as the general methodology for computing train-test overlap (e.g. a description of how n-gram matching was used).

Example disclosure:

We compute train-test overlap using n-gram matching using the procedure described here [URL]. We evaluate the train-test overlap for the following benchmarks: (1) Coding: HumanEval (1.6%) (2) Retrieval: HotPotQA (4%) (3) Multilingual performance: MMMLU (3%) (4) Tool use: UltraTool (9%)

47. Risks taxonomy (Score: 1)

Are the risks considered when developing the model disclosed?

Disclosure:

Gemini safety policies align with Google’s standard framework for the types of harmful content that we make best efforts to prevent our Generative AI models from generating, including the following types of harmful content: · Child sexual abuse and exploitation · Hate speech (e.g. dehumanizing members of protected groups) · Dangerous content (e.g., promoting suicide, or instructing in activities that could cause real-world harm) · Harassment (e.g. encouraging violence against people) · Sexually explicit content · Medical advice that runs contrary to scientific or medical consensus In addition, Gemini 2.5 Pro was assessed against Google’s Frontier Safety Framework (FSF), which covers four risk domains: · CBRN (chemical, biological, radiological and nuclear information risks) · Cybersecurity · Machine learning R&D · Deceptive alignment We released the Frontier Safety Framework (FSF) in May 2024 and updated it in February 2025. The FSF comprises a number of processes and evaluations that address risks of severe harm stemming from powerful capabilities of our frontier models. The Frontier Safety Framework involves the regular evaluation of our frontier models to determine whether they require heightened mitigations. More specifically, the FSF defines critical capability levels (CCLs) for each area, which represent capability levels where a model may pose a significant risk of severe harm without appropriate mitigations. When conducting FSF evaluations, we compare test results against internal alert thresholds ("early warnings") which are set significantly below the actual CCLs. This built-in safety buffer helps us be proactive by signaling potential risks well before models reach CCLs. Concretely, our alert thresholds are designed such that if a frontier model does not reach the alert threshold for a CCL, we can assume models developed before the next regular testing interval will not reach that CCL. Google DeepMind’s recent paper, An Approach to Technical AGI Safety and Security, discusses this approximate continuity assumption in more depth in Section 3.5. This is why we test at a regular cadence and on exceptional capability jumps.

References:

Not disclosed

Score justification:

The developer discloses the risks considered.

Indicator notes:

Risks refer to possible negative consequences or undesirable outcomes that can arise from the model's deployment and usage. These consequences or outcomes may arise from model limitations (functions that the model cannot perform) or issues with the model's trustworthiness (e.g., its lack of robustness, reliability, calibration). We recognize that different developers may use different terminology for risks, or conceptualize risks differently. We will award this point for a complete list of risks considered, even if some of the risks are not reflected in the final model.

Example disclosure:

We consider the following risks when developing and evaluating the model: (1) Misinformation (2) Harassment (3) Cybersecurity risks (4) Bioweapons design (5) Revealing personally-identifiable information

48. Risks evaluation (Score: 0)

Does the developer evaluate the model's risks prior to its release and disclose them concurrent with release?

Disclosure:

Gemini 2.5 Pro was evaluated across several areas during training and development evaluations, including text to text safety, multilingual safety, image to text safety, tone, and instruction following. Gemini 2.5 Pro was assessed against Google’s Frontier Safety Framework, with more details provided on the results in the model card. Our tech report offers details on improvements in training and development evaluations for policy violations, including automated red teaming and manual reviews. Our manual review confirmed losses were overwhelmingly either a) false positives or b) not egregious. Furthermore, this review confirmed losses are narrowly concentrated around explicit requests to produce sexually suggestive content or hateful content, mostly in the context of creative use-cases (e.g. historical fiction). We have not observed increased violations outside these specific contexts.

References:

Not disclosed

Score justification:

A majority of the risks in the taxonomy do not have the results of evaluations with precise quantifications disclosed. The disclosure includes evaluation results for the four risks in the FSF. However, for the six categories of harmful content, the model card only specifies results for broad safety evaluations (text to text, multilingual, image to text). The automated red teaming/manual reviews do not provide the necessary precise quantifications in relation to the risks in the taxonomy.

Indicator notes:

The evaluations must contain precise quantifications of the model's behavior in relation to the risks specified in the risk taxonomy. We will award this point for clear evaluations of the majority of the states risks.

Example disclosure:

We evaluate the risks for each of the above harms using HarmBench. The results (in terms of mean attack success rate) are: (1) Misinformation: 0.02 (2) Harassment: 0.01 (3) Cybersecurity: 0.10 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.12 (5) Revealing personally-identifiable information (subset of General Harm in HarmBench): 0.02

49. External reproducibility of risks evaluation (Score: 0)

Are code and prompts to allow for an external reproduction of the evaluation of model risks disclosed?

Disclosure:

We do not publicly disclose this information. But we do offer guidance that reflects our own processes, for developers using Gemini 2.5 Pro.

References:

Not disclosed

Score justification:

The developer does not disclose this.

Indicator notes:

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all risk evaluations to be reproduced. The released code must be open-source, following the OSI definition of open-source.

Example disclosure:

The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]

50. Pre-deployment risk evaluation (Score: 0)

Are the external entities have evaluated the model pre-deployment disclosed?

Disclosure:

As outlined in our responsible AI governance approach, multi-layered red teaming also plays a critical role in our approach, with both internal and external teams proactively testing AI systems for weaknesses and identifying emerging risks. As part of our External Safety Testing Program, we work with a small set of independent external groups to help identify areas for improvement in our model safety work, including on early versions of the model, by undertaking structured evaluations, qualitative probing, and unstructured red teaming. External safety testing groups receive access before the model is generally available and can turn off or down safety filters. These groups were selected based on their expertise across a range of domain areas, such as autonomous systems, societal, cyber, and CBRN risks, and included civil society, and commercial organizations. The groups testing the model checkpoint were compensated for their time. External groups designed their own methodology to test topics within a particular domain area. The time dedicated to testing also varied per group, with some groups working full-time on executing testing processes, while others were part-time dedicated. Some groups pursued manual red-teaming and reported on qualitative findings from their exploration of model behavior, while others developed bespoke automatic testing strategies and produced quantitative reports of their results. Reports were written independently of Google DeepMind, but Google DeepMind experts were on hand to discuss methodology and findings throughout the testing process. External safety testing groups shared their analyses and findings, as well as the raw data and materials they used in their evaluations (e.g., prompts, model responses). After the testing, we undertake an internal exercise of exploring the data in more detail and assigning severity ratings for the issues found. We assess concerns raised by external testers against our safety policies to better understand the breadth of the issue, which we communicate back to modeling teams. Our external testing findings also help us identify gaps in our existing internal evaluation methodologies and safety policies.

References:

Not disclosed

Score justification:

The developer does not disclose the entities that have evaluated the model pre-deployment.

Indicator notes:

By external entities, we mean entities that are significantly or fully independent of the developer. We will award this point if the developer specifies the entity that carried out the pre-deployment analysis, discloses the terms of the analysis (such as conditions for releasing the evaluation results or the developer's control over the final results), as well as any financial transaction between the parties. We will award this point if the developer discloses no external entities have evaluated the model pre-deployment, or discloses only terms of the analysis where it is not bound by NDA while still naming all external entities.

Example disclosure:

We provide the following parties access to our model for pre-deployment capabilities evaluation: METR. METR has control over the release of the evaluation results (including whether or not to release the results and the contents of the results being released), but must provide the evaluation results to us for review before release. There are no financial transactions between us and METR.

51. External risk evaluation (Score: 0)

Are the parties contracted to evaluated model risks disclosed?

Disclosure:

Third party external testers were contracted, although we do not disclose exact numbers, who receive secure access to test models and applications according to their domain expertise to evaluate risks for Gemini 2.5 Pro.

References:

Not disclosed

Score justification:

The developer does not disclose statistics about the contracted parties that have evaluated model risks.

Indicator notes:

We will award this point if the developer discloses statistics regarding all contracted parties that are responsible for evaluating risks (not limited to external entities or pre-deployment evaluation). This includes the number of contracted for-profit or non-profit entities, government entities, independent contractors, and researchers contracted by the developer to evaluate risks. We will award this point if the developer discloses it has no such contracts.

Example disclosure:

Contracted parties responsible for evaluating risks: (1) 2 contracting non-profits (2) 5 independent contractors (3) 0 government entities (4) 20 researchers

52. Mitigations taxonomy (Score: 1)

Are the post-training mitigations implemented when developing the model disclosed?

Disclosure:

We take a multi-faceted approach to risk mitigation. We implement content safety, security, and privacy mitigations; employ phased launches; empower users with transparency, labeling and training; harness user feedback; and deploy ongoing monitoring to continuously improve. In addition, we support the wider ecosystem with AI safety tools and standards. Safety and responsibility was built into Gemini 2.5 Pro throughout the training and deployment lifecycle, including pre-training, post-training, and product-level mitigations. Mitigations include, but are not limited to: · dataset filtering; · conditional pre-training; · supervised fine-tuning; · reinforcement learning from human and critic feedback; · safety policies and desiderata; · product-level mitigations such as safety filtering. We build safety into the models though pre-and post-training approaches. We start by constructing metrics based on the policies and desiderata above, which we typically turn into automated evaluations that guide model development through successive model iterations. We use data filtering and conditional pre-training, as well as Supervised Fine-Tuning (SFT), and Reinforcement Learning from Human and Critic Feedback (RL*F). (see details starting on https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf#page=20)

References:

Not disclosed

Score justification:

The developer specifies that they use SFT and RL*F during post-training to mitigate risks.

Indicator notes:

By post-training mitigations, we refer to interventions implemented by the developer during the post-training phase to reduce the likelihood and/or the severity of the model’s risks. We recognize that different developers may use different terminology for mitigations, or conceptualize mitigations differently. We will award this point for a complete list of mitigations considered, even if some of the mitigations are not reflected in the final model. Alternatively, we will award this point if the developer reports that it does not mitigate risk in this way.

Example disclosure:

We implement supervised fine tuning and reinforcement learning with human feedback to address model risks. We use no other methods to address risks.

53. Mitigations taxonomy mapped to risk taxonomy (Score: 0)

Does the developer disclose how the post-training mitigations map onto the taxonomy of risks?

Disclosure:

We take a scientific approach to mapping AI risks through research and expert consultation, codifying these inputs into a risk taxonomy. Our mapping process is fundamentally iterative, evolving alongside the technology, and adapting to the range of contexts in which people use AI models or applications. We’ve codified our mapping work into a taxonomy of potential risks associated with AI, building on the NIST AI Risk Management Framework and informed by our experiences developing and deploying a wide range of AI models and applications. These risks span safety, privacy, and security, as well as transparency and accountability risks such as unclear provenance or lack of explainability. This risk map is designed to enable clarity around which risks are most relevant to understand for a given launch, and what might be needed to mitigate those risks. We have also mapped risks according to our Frontier Safety Framework (FSF), in which we identify capability levels at which AI models without additional mitigations could pose severe risk. We implement protocols to detect the attainment of such capability levels, prepare and articulate mitigation plans in advance of when such capability levels are attained, and, where appropriate, involve external parties to help inform and guide our approach.These FSF risks include four risk domains: CBRN (chemical, biological, radiological and nuclear information risks), cybersecurity, machine learning R&D, and deceptive alignment. Results for testing can be found in the model card.

References:

Not disclosed

Score justification:

The developer does not map the previously specified risks taxonomy to mitigations.

Indicator notes:

We will award this point for a complete mapping of the primary risk that each mitigation is meant to address, even if the mitigation potentially maps on to other risks in the taxonomy. Alternatively, we will award this point if the developer reports that it does not mitigate risk.

Example disclosure:

We use supervised fine tuning for general instruction following. We use RLHF to reduce the model's propensity to output information about cybercrimes, bioweapons, disinformation, content harassing someone, and PII.

54. Mitigations efficacy (Score: 0)

Does the developer evaluate and disclose the impact of post-training mitigations?

Disclosure:

We build safety into the models though pre-and post-training approaches. We start by constructing metrics based on the policies and desiderata outlined in the model card and other technical documentation, which we typically turn into automated evaluations that guide model development through successive model iterations. Throughout the model card and other technical documentation, we document how successive iterations of the model perform across our safety and dangerous capability evaluations as well as associated mitigations and approaches. Our technical report discloses an overview of metrics for safety policy violations. (see https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf#page=20)

References:

Not disclosed

Score justification:

The developer contrasts results on risk evaluations across model versions, but not with/without post-training mitigations.

Indicator notes:

We will award this point if the developer discloses the results on the risk evaluations before and after the post-training mitigations are applied. Alternatively, we will award this point if the developer reports that it does not mitigate risk in this way.

Example disclosure:

Pre-mitigations (measured through mean attack success rate on HarmBench): (1) Misinformation: 0.80 (2) Harassment: 0.91 (3) Cybersecurity risks: 0.56 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.62 (5) Personally-identifiable information (subset of General Harm in HarmBench): 0.52 Post-mitigations (measured through mean attack success rate on HarmBench): (1) Misinformation: 0.02 (2) Harassment: 0.01 (3) Cybersecurity risks: 0.10 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.12 (5) Personally-identifiable information (subset of General Harm in HarmBench): 0.02

55. External reproducibility of mitigations evaluation (Score: 0)

Are code and prompts to allow for an external reproduction of the evaluation of post-training mitigations disclosed?

Disclosure:

We do not offer external reproducibility of the model’s specific mitigations evaluation for safety and security reasons. However, for transparency into our content safety evaluation techniques and access to tools that reflect approaches that are standard for the model’s development and content safety mitigations, we offer ShieldGemma externally which is a set of instruction tuned, open weight models available externally for evaluating the safety of text and images against a set of defined safety policies. People can use ShieldGemma to help evaluate and prevent generative AI applications from violating safety policies. The ShieldGemma family of models’ open weights to allow fine-tuning for specific use cases.

References:

Not disclosed

Score justification:

The developer specifies that they do not offer external reproducibility of the model's specific mitigations.

Indicator notes:

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all mitigations evaluations to be reproduced. The released code must be open-source, following the OSI definition of open-source. Alternatively, we will award this point if the developer reports that it does not mitigate risk.

Example disclosure:

We release the code and prompts for reproducing post-training mitigation evaluations at this GitHub link: [URL]

56. Model theft prevention measures (Score: 1)

Does the developer disclose the security measures used to prevent unauthorized copying (“theft”) or unauthorized public release of the model weights?

Disclosure:

We use the SAIF framework to mitigate known and novel AI security risks. The latter category includes risks such as data poisoning, model exfiltration, and rogue actions. We apply security controls, or repeatable mitigations, to these risks. For example, for prompt injections and jailbreaks, we apply robust filtering and processing of inputs and outputs. Additionally, thorough training, tuning, and evaluation processes help fortify the model against prompt injection attacks. For data poisoning, we implement data sanitization, secure AI systems, enable access controls, and deploy mechanisms to ensure data and model integrity. We have published a full list of our controls for AI security risks. In addition, we continue to research new ways to help mitigate a model’s susceptibility to security attacks. For example, we’ve developed an AI agent that auto-detects real-world code for security risks.

References:

Not disclosed

Score justification:

The developer discloses the SAIF framework which which describes security measures for preventing model exfiltration (https://saif.google/secure-ai-framework/risks#model-exfiltration).

Indicator notes:

This indicator assesses the developer's disclosures regarding how it addresses the risk that malicious actors or insiders could exfiltrate or replicate proprietary weights. Security measures could include insider threat analysis and detection, in addition to external threat management. Examples of such measures include encryption at rest, key management, remote attestation, or auditing for suspicious queries. We will award a point if the developer discloses specific steps taken to safeguard the model weights or that none are implemented.

Example disclosure:

We store model weights on encrypted volumes with hardware-based key management. We monitor inference queries for suspicious patterns (like repeated attempts to reconstruct weights token-by-token), and we audit all staff access logs monthly.

57. Release stages (Score: 1)

Are the stages of the model's release disclosed?

Disclosure:

As appropriate, we use a multi-layered approach to model deployment that may start with testing internally, then releasing to trusted testers externally, then opening up to a small portion of our user base (for example, Gemini Ultra users first). We may also phase our country and language releases, constantly testing to ensure mitigations are working as intended before we expand. And finally, we have careful protocols and additional testing and mitigations required before a product is released to under 18s.

References:

Not disclosed

Score justification:

The developer outlines the stages of model release.

Indicator notes:

Release stages include A/B testing, release on a user-facing product, GA release, open-weight release, etc. We recognize that the release of a foundation model falls along a spectrum, with many forms of partial release, and that different developers may conceptualize release differently. We will award a point if the developer provides a clear identification of the stages through which the model was released.

Example disclosure:

We began with an internal alpha test for two weeks, followed by a closed beta with selected enterprise partners for one month, then a public waitlisted preview, and finally a general availability release once thresholds on safety benchmarks were met.

58. Risk thresholds (Score: 1)

Are risk thresholds disclosed?

Disclosure:

The Frontier Safety Framework (FSF) involves the regular evaluation of Google’s frontier models to determine whether they require heightened mitigations. The Framework is built around capability thresholds called “Critical Capability Levels (CCLs), “at which, absent mitigation measures, AI models or systems may pose heightened risk of severe harm. CCLs can be determined by identifying and analyzing the main foreseeable paths through which a model could cause severe harm, and then defining the CCLs as the minimal set of capabilities a model must possess to do so. Note that we have selected our CCLs to be conservative; it is not clear to what extent CCLs might translate to harm in real-world contexts. There are two sets of CCLs: misuse CCLs that can indicate heightened risk of severe harm from misuse if not addressed, and deceptive alignment CCLs that can indicate heightened risk of deceptive alignment-related events if not addressed. For misuse risk, we define CCLs in high-risk domains where, based on early research, we believe risks of severe harm may be most likely to arise from future models: · CBRN: Risks of models assisting in the development, preparation, and/or execution of a chemical, biological, radiological, or nuclear (“CBRN”) attack. · Cyber: Risks of models assisting in the development, preparation, and/or execution of a cyber attack. · Machine Learning R&D: Risks of the misuse of models capable of accelerating the rate of AI progress to potentially destabilizing levels, the result of which could be the unsafe attainment or proliferation of other powerful AI models. Capabilities in this area are under active research, and in the longer term may exacerbate frontier AI risks—including in other risk domains—if insufficiently managed. For deceptive alignment risk, the initial approach focuses on detecting when models might develop a baseline instrumental reasoning ability at which they have the potential to undermine human control, assuming no additional mitigations were applied. The two instrumental reasoning CCLs thus focus on delineating when such capability becomes present, and subsequently when the initial mitigation for this capability—automated monitoring—is no longer adequate. We intend to evaluate our most powerful frontier models regularly to check whether their AI capabilities are approaching a CCL. We also intend to evaluate any of these models that could indicate an exceptional increase in capabilities over previous models, and where appropriate, assess the likelihood of such capabilities and risks before and during training. To do so, we will define a set of evaluations called “early warning evaluations,” with a specific “alert threshold” that flags when a CCL may be reached before the evaluations are run again. In our FSF evaluations, we seek to equip the model with appropriate scaffolding and other augmentations to make it more likely that we are also assessing the capabilities of systems that will likely be produced with the model. We may run early warning evaluations more frequently or adjust the alert threshold of our evaluations if the rate of progress suggests our safety buffer is no longer adequate. Where necessary, early warning evaluations may be supplemented by other evaluations to better understand model capabilities relative to our CCLs. We may use additional external evaluators to test a model for relevant capabilities, if evaluators with relevant expertise are needed to provide an additional signal about a model’s proximity to CCLs. When a model reaches an alert threshold for a CCL, we will assess the proximity of the model to the CCL and analyze the risk posed, involving internal and external experts as needed. This will inform the formulation and application of a response plan. Central to most response plans will be the application of the mitigations described later in this document. For misuse, we have two categories of mitigations: security mitigations intended to prevent the exfiltration of model weights, and deployment mitigations (such as safety re-tuning and misuse filtering, detection, and response) intended to counter the misuse of critical capabilities in deployments. For deceptive alignment risk, automated monitoring may be applied to detect and respond to deceptive behavior for models that meet the first deceptive alignment CCL. Note that these mitigations reflect considerations from the perspective of addressing severe risks from powerful capabilities alone; due to this focused scope, other risk management and security considerations may result in more stringent mitigations applied to a model than specified by the Framework. A model flagged by an alert threshold may be assessed to pose risks for which readily available mitigations (including but not limited to those described below) may not be sufficient. If this happens, the response plan may involve putting deployment or further development on hold until adequate mitigations can be applied. Conversely, where model capabilities remain quite distant from a CCL, a response plan may involve the adoption of additional capability assessment processes to ag when heightened mitigations may be required. The appropriateness and efficacy of applied mitigations should be reviewed periodically, drawing on information like related misuse or misuse attempt incidents; results from continued post-mitigation testing; statistics about our intelligence, monitoring and escalation processes; and updated threat modeling and risk landscape analysis. When conducting FSF evaluations, we compare test results against internal alert thresholds ("early warnings") which are set significantly below the actual CCLs. This built-in safety buffer helps us be proactive by signaling potential risks well before models reach CCLs. Concretely, our alert thresholds are designed such that if a frontier model does not reach the alert threshold for a CCL, we can assume models developed before the next regular testing interval will not reach that CCL. Our model card provides details of our Frontier Safety assessment on pages 9-16. On top of robust internal FSF evaluations, we also evaluate the model against well-known academic safety and responsibility benchmarks. The results of these internal and external ethics and safety evaluations must be within acceptable thresholds for meeting internal policies for categories such as child safety, content safety, representational harms, memorization, and large-scale harms. Our policies for mitigating harm in areas such as child safety, suicide, and self-harm have been informed by years of research, user feedback, and expert consultation.

References:

Not disclosed

Score justification:

The developer discloses risk thresholds.

Indicator notes:

Risk thresholds determine when a risk level is unacceptably high to a developer (e.g. leading to the decision to not release a model), moderately high (e.g. triggering additional safety screening), or low enough to permit normal usage. We will award this point if the developer discloses explicit risk thresholds that clarify (i) which harmful outcomes are being scored, (ii) how the scores are computed (in general terms, not necessarily disclosing internal algorithms), and (iii) what triggers an action to block, delay, or otherwise modify a model's release. Alternatively, we will award a point if the developer discloses that it does not consider explicit risk thresholds during model release.

Example disclosure:

Our risk threshold for biorisks is the ability to autonomously create bioweapons. Current models score a medium: they don't autonomously create bioweapons but could help a skilled practitioner with access to materials in speeding up creation of bioweapons. Risk thresholds higher than medium would delay the model's release until the risk level drops to medium or below.

59. Versioning protocol (Score: 1)

Is there a disclosed protocol for versioning and deprecation of the model?

Disclosure:

We label stable models with substantial capability advancements (e.g., when multimodality or thinking capabilities were first introduced) with a new number. We define versions as such publicly: · Stable model: A publicly released version of the model that is available and supported for production use starting on the release date. · Latest stable model: The latest version within the model family recommended for new and active projects and should be the target for migrations from earlier versions. · Legacy stable model: A model version that's been superseded by the Latest Stable Model. · Retired model: The model version is past its retirement date and has been permanently deactivated. Retired models are no longer accessible or supported by Google. Details about the model versions and lifecycle are published on Vertex AI. Note, we also disclose naming convention adjustments; for example, throughout an updated model card, we may clearly differentiate two versions of Gemini 2.5 pro, for example, a model previously identified as "Gemini 2.5 Pro Preview" could be referred to as "Gemini 2.5 Pro Experimental (03-25)". The latest version is referred to as "Gemini 2.5 Pro Preview (05-06)” according to release dates. Additionally, we publish model specific version names, release dates, and launch stage information at Vertex AI pages for specific models, including for Gemini 2.5 Pro.

References:

Not disclosed

Score justification:

The developer discloses a versioning protocol. The developer also describes how they deprecate/communicate model versions.

Indicator notes:

We will award a point if the developer discloses how model versions are labeled, updated, deprecated, and communicated to users.

Example disclosure:

We version models based on the date of release: e.g., ModelName-11-01-2024. We additionally provide ModelName-latest, corresponding to the latest release. We deprecate versions of models when we plan to remove access to with a six months notice to users. Users should respond to model deprecation by switching to the newest version of the models or an equivalent non-deprecated model. Users can switch to a different model by replacing the model identifier (to e.g., ModelName-latest for the latest version) in API calls or through the Python SDK.

60. Change log (Score: 1)

Is there a disclosed change log for the model?

Disclosure:

We update model cards to include updated evaluations as the model is improved or revised, with details on performance improvements, and are disclosed and dated accordingly. Vertex AI also publishes recommended upgrades for the models and points to available migration paths. Release notes for Gemini 2.5 Pro available on Vertex AI are also updated and updated accordingly.

References:

Not disclosed

Score justification:

The release notes (https://ai.google.dev/gemini-api/docs/changelog) which lists feature additions.

Indicator notes:

We will award a point if the developer publishes a version-by-version record of new features, fixes, or performance improvements.

Example disclosure:

On 11/1/2024 (version ModelName-11-01-2024), we improved model reasoning in technical domains. This resulted in a 20-point increase on the MATH benchmark (from 62% to 82%). Past change logs can be viewed at [URL]

61. Foundation model roadmap (Score: 1)

Is a forward-looking roadmap for upcoming models, features, or products disclosed?

Disclosure:

Yes. We have discussed releases and improvements in advance for Gemini 2.5 Pro such as in this blog, stating how Gemini 2.5 Pro “builds on the overwhelmingly positive feedback to Gemini 2.5 Pro’s coding and multimodal reasoning capabilities. Beyond UI-focused development, these improvements extend to other coding tasks such as code transformation, code editing and developing complex agentic workflows,” and as part of the updates from our annual developer conference I/O in May, we wrote “both versions of Gemini 2.5 Flash as well as 2.5 Pro will appear in Google AI Studio and Vertex AI in Preview, with general availability for Flash coming in early June and Pro soon to follow.”

References:

Not disclosed

Score justification:

In the past, the developer has publicly disclosed specific future model releases.

Indicator notes:

A foundation model roadmap is a transparent statement about how the developer intends to evolve or expand its LLM offerings, including upcoming models, major feature releases, or expanded products based on the model, along with approximate timelines or version milestones. It can be high-level (e.g., “new model Q2 2025”), but must exist publicly.

Example disclosure:

We plan to release ModelX2 in Q2 2025, featuring enhanced multilingual capabilities and improved retrieval. We also aim to launch an enterprise-specific product tier for regulated industries by early 2026.

62. Top distribution channels (Score: 1)

Are the top-5 distribution channels for the model disclosed?

Disclosure:

Gemini 2.5 Pro is available across various channels, but we do not publicly disclose stack-ranking lists of these channels, given they have very different and therefore not comparable audience sizes (e.g., consumer app versus downstream enterprise customers. We are transparent on our channels for distribution of the model: (in alphabetical order) the Gemini API via Google AI Studio and Vertex AI. It’s also available for users in the Gemini app (using Gemini 2.5 Pro), powering features like Canvas, enabling anyone to “vibe code” and build interactive web apps with a single prompt.

References:

Not disclosed

Score justification:

The developer specifies that there are three distribution channels: the Gemini app, Google AI Studio, and Vertex AI.

Indicator notes:

We define distribution channels to be either an API provider (a pathway by which users can query the model with inputs and receive outputs) or a model distributor (a pathway by which model weights are released). We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. Distribution channels can be ranked by any reasonable metric (e.g., number of queries, number of downloads, number of users, revenue). A description of the metric should be provided. API providers and model distributors may be ranked separately using different metrics as long as the total number of distribution channels equals five (if five distribution channels exist). For example, the developer may choose to disclose the top-3 API providers (ranked by the number of queries) and the top-2 model distributors (ranked by the number of downloads).

Example disclosure:

We provide API access to the model through A, B, and C. We distribute model weights through D and E. We pick the top-3 API providers based on the average number of queries per month and the top-2 model weight providers based on the average number of downloads per month.

63. Quantization (Score: 0)

Is the quantization of the model served to customers in the top-5 distribution channels disclosed?

Disclosure:

We do not disclose this.

References:

Not disclosed

Score justification:

The developer does not disclose this.

Indicator notes:

We will award this point for a disclosure of the model precision in each of the top-5 distribution channels.

Example disclosure:

We serve the model at 16-bit precision on all distribution channels.

64. Terms of use (Score: 1)

Are the terms of use of the model disclosed?

Disclosure:

Our terms of service are available at the following links: · Terms of Service · Generative AI Additional Terms of Service · Generative AI Prohibited Use Policy · Service Specific Terms

References:

Not disclosed

Score justification:

A terms of service that appears to apply to the bulk of the model's distribution channels is disclosed.

Indicator notes:

We define terms of use to include terms of service and model licenses. We will award this point for a pointer to the terms of service or model license. In the event that model's licenses are written more generally, it should be clear which assets they apply to. We recognize that different developers may adopt different business models and therefore have different types of model licenses. Examples of model licenses include responsible AI licenses, open-source licenses, and licenses that allow for commercial use. Terms of service should be disclosed for each of the top-5 distribution channels. However, we will award this point if there are terms-of-service that appear to apply to the bulk of the model’s distribution channels.

Example disclosure:

Our terms of service are published at https://ourcompany.com/model-tos - these terms cover both our API and all distribution channels for model weights.

65. Distribution channels with usage data (Score: 0)

What are the top-5 distribution channels for which the developer has usage data?

Disclosure:

We offer access to usage data for Gemini 2.5 Pro usage for Cloud customers. More details are available at this link.

References:

Not disclosed

Score justification:

This indicator asks which distribution channels the developer has access to usage data from, not which ones it provides users access to data about.

Indicator notes:

We define distribution channels to be either an API provider (a pathway by which users can query the model with inputs and receive outputs) or a model distributor (a pathway by which model weights are released). We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. Distribution channels can be ranked by any reasonable metric (e.g., number of queries, number of downloads, number of users, revenue). A description of the metric should be provided. We define usage data as any form of developer-exclusive data collected from any of a developer's distribution channel. A developer has access to usage data from a distribution channel if it is able to use that data for downstream purposes (e.g., analytics, training etc.). Usage data may be shared outside of the developer, but it is initially collected by the distribution channel and shared to the developer.

Example disclosure:

We have access to usage data through the distribution channels: A, B, and C.

66. Amount of usage (Score: 0)

For each of the top-5 distribution channels, how much usage is there?

Disclosure:

Our complex and varied product portfolio makes it challenging to use "the same metric across all products." However, information we can disclose at the time of this report include the following: ] In a recent blog, we note that in early 2024, we were processing 9.7 trillion tokens a month across our products and APIs. Now, in 2025, we’re processing over 480 trillion — that’s 50 times more. Over 7 million developers are building with Gemini, five times more than this time last year, and Gemini usage on Vertex AI is up 40 times. The Gemini app now has over 400 million monthly active users. We are seeing strong growth and engagement particularly with the 2.5 series of models. For those using 2.5 Pro in the Gemini app, usage has gone up 45%.

References:

Not disclosed

Score justification:

Usage not reported per product. To get this point, the developer needs to disclose usage according to the same metric for all products

Indicator notes:

Usage should be reported as the number of queries over the span of a month, reported to the precision of one significant figure (e.g., 50 million queries).

Example disclosure:

Distribution channel A: 50 million queries. Distribution channel B: 10 million queries. Distribution channel C: 10 million queries.

67. Classification of usage data (Score: 0)

Is a representative, anonymized dataset classifying queries into usage categories disclosed?

Disclosure:

We do not publicly disclose this information.

References:

Not disclosed

Score justification:

Company acknowledges no disclosure

Indicator notes:

Developers may either share a fully public dataset or a partially restricted dataset (e.g., under a research license). We will award this point if there is a clear, aggregated or sample dataset that reveals categories of tasks/queries.

Example disclosure:

We provide quarterly releases of an anonymized dataset that classifies user queries into 20 broad job-related categories. Researchers can request access via [URL]. We ensure no PII is included.

68. Data retention and deletion policy (Score: 0)

Is a policy for data retention and deletion disclosed?

Disclosure:

We provide details about our collection and retention policies on our Privacy and Terms page. We share details of data processing, security measures, and privacy information for Cloud via our Cloud Data Processing Addendum. Information including details on storage, replication, and deletion for Google Cloud are located on the Data deletion documentation.

References:

Not disclosed

Score justification:

It remains unclear how data that users request is deleted is then removed from AI system training

Indicator notes:

A data retention and deletion policy is a policy for removing particular data from the training set and/or preventing it from being used if there is a user or external request (e.g., “right to be forgotten”) that also covers internal data governance. This includes whether there is a formal process to delete or retract data from future training runs and how long raw data is retained. It also clarifies how quickly deletions propagate to the model (e.g., “only in subsequent major model releases”).

Example disclosure:

We honor verified user requests to delete personal data from our training corpus by removing it from any subsequent scheduled retraining. Our data retention policy ensures chat logs are purged after 90 days.

69. Geographic statistics (Score: 0)

Across all forms of downstream use, are statistics of model usage across geographies disclosed?

Disclosure:

We disclose where Gemini 2.5 Pro is available via Google AI studio and for Vertex AI but do not have a monthly dashboard. Developers using the model can create dashboards.

References:

Not disclosed

Score justification:

No geographic statistics disclosed, only availability

Indicator notes:

We will award this point if there is a meaningful, though potentially incomplete or vague, disclosure of geographic usage statistics at the country-level.

Example disclosure:

We share anonymized per-country usage metrics in a publicly accessible dashboard, updated monthly, on this link: [link]

70. Internal products and services (Score: 0)

What are the top-5 internal products or services using the model?

Disclosure:

We’ve been bringing AI into our products and services, including the Gemini App for Gemini Advanced users, that are used by billions, making them radically more helpful. We share information here about our products that are using our latest AI technologies. We do not disclose specific metrics due to business confidentiality.

References:

Not disclosed

Score justification:

Google does not disclose specific metrics due to business confidentiality

Indicator notes:

An internal product or service is a product or service built by the developer. Products or services can be ranked by any reasonable metric (e.g., number of users, queries, revenue). A description of the metric should be provided.

Example disclosure:

The model is used in products A, B, C, D, and E. We choose products based on the number of montly active users.

71. External products and services (Score: 1)

What are the top-5 external products or services using the model?

Disclosure:

We do not have access to such metrics about external products or services, however we regularly publish case studies, including a list of +600 real-world gen AI use cases from the world's leading organizations.

References:

Not disclosed

Score justification:

Google discloses it does not have access to such metrics

Indicator notes:

An external product or service is a product or service built by a party external to the developer. Products or services can be ranked by any reasonable metric (e.g., number of users, queries, revenue). A description of the metric should be provided. We will award a point if the developer discloses that that it does not have access to such metrics about external products or services.

Example disclosure:

The model is used in products A, B, C, D, and E. We choose products based on the number of montly active users.

72. Users of internal products and services (Score: 0)

How many monthly active users are there for each of the top-5 internal products or services using the model?

Disclosure:

As outlined in our Q1 2025 earnings call, active users in AI Studio and the Gemini API have grown over 200% since the beginning of 2025.

References:

Not disclosed

Score justification:

No number of users disclosed. Needs to be disclosed per application or product.

Indicator notes:

An internal product or service is a product or service built by the developer. The number of users refers to users who engaged or interacted with the model through the internal product or service over the last month or averaged over the last X months (this should be specified). Number of users should be specified to one significant figure (e.g. 100,000).

Example disclosure:

Over the last 6 months, the total monthly active users for our top-5 products using model Y are: Product A: 100,000 users Product B: 30,000 users Product C: 10,000 users Product D: 10,000 users Product E: 10,000 users

73. Consumer/enterprise usage (Score: 0)

Across all distribution channels for which the developer has usage data, what portion of usage is consumer versus enterprise?

Disclosure:

We do not disclose this.

References:

Not disclosed

Score justification:

Company acknowledges no disclosure

Indicator notes:

Consumer usage refers to usage by individual consumers. Enterprise usage refers to usage by enterprise customers (including government use). Consumer and enterprise usage should be calculated in terms of the number of queries by or the amount of revenue from consumer or enterprise users. Percentages should be specified to two significant digits (e.g., 12% consumer, 88% enterprise).

Example disclosure:

12% of the usage of model A across all distribution channels is from consumers, 88% is from enterprise users. Of this 88%, 6% is from users at governments. Usage is calculated based on number of queries.

74. Enterprise users (Score: 0)

Across all distribution channels for which the developer has usage data, what are the top-5 enterprises that use the model?

Disclosure:

We do not disclose this.

References:

Not disclosed

Score justification:

Company acknowledges no disclosure

Indicator notes:

Enterprises should be ranked by the number of queries made or the amount of revenue from usage since the model's release. We will also award this point if the developer indicates it does not have access to enterprise usage data.

Example disclosure:

The top-5 enterprises are A, B, C, D, and E. The enterprises are selected based on the number of queries.

75. Government use (Score: 0)

What are the 5 largest government contracts for use of the model?

Disclosure:

We do not disclose this.

References:

Not disclosed

Score justification:

Company acknowledges no disclosure

Indicator notes:

This includes known government contracts of enterprise or government-specific products and services that use the model. We will award this point if the developer discloses its top five government contracts ranked monetary value, though the developer may omit contracts where it is under NDA regarding the existence of the contract.

Example disclosure:

The five largest government users of our service, along with their use cases, are: 1. County A is utilizing our product for improving access to internal resources 2. National Lab B is using our model to advance bioscientific research. 3. Federal agency C is using our product to deliver faster, more accurate translation services 4. City D is participating in a pilot program found our product helped reduce the time spent on routine tasks 5. Country E is using our product to summarize legal documents in their lower courts.

76. Benefits Assessment (Score: 1)

Is an assessment of the benefits of deploying the model disclosed?

Disclosure:

In a recent study, we discuss how AI is also transforming scientific disciplines - outlining five key opportunities for AI to significantly impact science: transforming knowledge digestion and communication, generating and annotating large datasets, simulating and accelerating complex experiments, modeling intricate systems, and identifying novel solutions. Education is a central beneficial use-case for modern artificial intelligence (AI). We conducted research and published a technical report on the benefits of using Gemini models to create LearnLM, collaborating with learners and educators to translate high level principles from learning science into a pragmatic set of seven diverse educational benchmarks, spanning quantitative, qualitative, automatic and human evaluations; and to develop a new set of fine-tuning datasets to improve the pedagogical capabilities of Gemini, introducing LearnLM-Tutor. Our evaluations show that LearnLM-Tutor is consistently preferred over a prompt-tuned Gemini by educators and learners on a number of pedagogical dimensions. We recently integrated the pedagogical capabilities of our experimental LearnLM model into the main Gemini model family. These capabilities now enhance Gemini 2.5 Pro. To assess state-of-the-art support for educational use cases, we ran an “arena for learning” where 189 educators and pedagogy experts conduct blind, head-to-head, multi-turn comparisons of leading AI models. The research shows Gemini 2.5 Pro, specifically, demonstrates markedly higher performance across key principles of good pedagogy when compared to other models.

References:

Not disclosed

Score justification:

Google discloses several quantitative assessments via studies and technical reports

Indicator notes:

We will award this point for any quantitative assessment of the benefits or potential benefits of deploying the model.

Example disclosure:

We analyze the impact of using the model in education outcomes using a randomized controlled trial in third grade math assignnments, and find that use in the classroom improves standardized test outcomes by 26%. [Link to report.]

77. AI bug bounty (Score: 1)

Does the developer operate a public bug bounty or vulnerability reward program under which the model is in scope?

Disclosure:

We have a close relationship with the security research community. To honor all the cutting-edge external contributions that help us keep our users safe, we’ve maintained a Vulnerability Reward Program running continuously since November 2010. We recently updated this program to specifically clarify and encourage reporting of issues in our AI products. We released a 2024 year in review of our Rewards program that confirmed the ongoing value of engaging with the security research community to make Google and its models and products safer.

References:

Not disclosed

Score justification:

Bug bounty with AI products in scope is disclosed

Indicator notes:

We will award this point for a publicly documented bug bounty or vulnerability reward program describing (i) in-scope vulnerabilities (e.g., prompt bypasses, data leaks), (ii) out-of-scope items, (iii) submission process, and (iv) reward tiers or recognition if applicable. We will award a point if the developer discloses it has no AI bug bounty that encourages external researchers to report security, privacy, or adversarial vulnerabilities in the model.

Example disclosure:

We run a bug bounty program with HackerOne. We award up to $5,000 for critical vulnerabilities, such as discovering a major exploit that circumvents our content filters or reveals private data. [link to bug bounty]

78. Responsible disclosure policy (Score: 1)

Does the developer clearly define a process by which external parties can disclose model vulnerabilities or flaws?

Disclosure:

Yes, we have an incident response process for incidents at Google, including pertaining to AI and a dedicated team for insider threat and abuse. We updated our vulnerability rewards program to specifically clarify and encourage reporting of issues in our AI products. We released a 2024 year in review of our Rewards program that confirmed the ongoing value of engaging with the security research community to make Google and its models and products safer.

References:

Not disclosed

Score justification:

Bug bounty includes RDP; A: Please read our stance on coordinated disclosure. In essence, our pledge to you is to respond promptly and fix bugs in a sensible timeframe - and in exchange, we ask for a reasonable advance notice. Reports that go against this principle will usually not qualify, but we will evaluate them on a case-by-case basis.

Indicator notes:

We will award this point for a description of the process external parties can use for responsbly disclosing model vulnerabilities and flaws, which should include (i) what mechanism external parties can use to disclose vulnerabilities or flaws (e.g., a form, an email) and (ii) what process follows a disclosure (e.g., how much time must parties wait until public release). This is often included with a bug bounty, but can also be standalone. We will award a point if the developer discloses it has no responsible disclosure policy.

Example disclosure:

We maintain a responsible disclosure policy at [URL] that describes how external parties can disclose vulnerabilities and flaws in Model A, including a 45-day disclosure window and an official contact for urgent security vulnerabilities.

79. Safe harbor (Score: 0)

Does the developer disclose its policy for legal action against external evaluators conducting good-faith research?

Disclosure:

We publish an Acceptable Use Policy for our Google Researcher Program.

References:

Not disclosed

Score justification:

Google discloses AUP for approved research under its Google Researcher Program, not a generalized safe harbor.

Indicator notes:

We will award this point if the developer discloses whether it has a policy committing it to not pursue legal action against external evaluators conducting good-faith research. This should not be only for software security vulnerabilities, but also AI flaws, and it should be based on researcher conduct standards, not at the sole discretion of the company. We will award this point if the developer provides a clear description of its policy regarding such protections for external researchers, or lack thereof.

Example disclosure:

We do not have a policy for researcher protections for good-faith safety research. OR Our policy ensures no legal action against good‐faith researchers who follow our disclosure guidelines, see: [link]

80. Security incident reporting protocol (Score: 1)

Are major security incidents involving the model disclosed?

Disclosure:

Users can report a security vulnerability at any time, via https://bughunters.google.com/report/vrp. We get around 200 overall security vulnerability reports every week and feature reports online. The response time depends on the report priority and the current load. For low priority reports in a busy period, it may take us a couple of weeks to respond. Reports can ne found at https://bughunters.google.com/report/reports

References:

Not disclosed

Score justification:

Disclosure clarifies how to report vulnerabilities, time to respond, and how reports are featured.

Indicator notes:

A security incident reporting protocol provides post-deployment transparency about serious incidents or breaches. Security incidents refer to incidents where external security threats affect the model (e.g., data breaches or DDoS attacks on the service). We will award this point if the developer states (i) how to submit a security incident report, (ii) how quickly it will respond, and (iii) when and whether results are disclosed. Every incident need not be reported publicly, but the developer must disclose a policy determining how incidents are reported and disclosed.

Example disclosure:

We publish a public ‘Security Incident Report’ on our website for any confirmed security incident affecting the model within 7 days of a patch being implemented. Users and researchers can report incidents via security@ourcompany.com, and we commit to an initial acknowledgment within 48 hours.

81. Misuse incident reporting protocol (Score: 0)

Are misuse incidents involving the model disclosed?

Disclosure:

People can report suspected abuse on Google Cloud Platform, where Gemini models are available via API. https://support.google.com/code/contact/cloud_platform_report?hl=en. Users can also report misuse via our form at "Report Content On Google."

References:

Not disclosed

Score justification:

Response does not satisfy (ii) and (iii) from the requirements of the indicator.

Indicator notes:

A misuse incident reporting protocol provides post-deployment transparency about incidents of misuse involving the model. As opposed to the previous indicator, this indicator is about actors misusing the model to cause real-world harm, such as misinformation operations or cybersecurity attacks. We will award this point if the developer states (i) how to submit a misuse incident report, (ii) how quickly it will respond, and (iii) when and whether results are disclosed. Every incident need not be reported publicly, but there needs to be a policy governing how incidents are reported.

Example disclosure:

We publish a public ‘Misuse Incident Report’ on our website for any confirmed misuse incident within 7 days of a patch being implemented. Users and researchers can report incidents regarding our flagship foundation model via security@ourcompany.com, and we commit to an initial acknowledgment within 48 hours.

82. Post-deployment coordination with government (Score: 0)

Does the developer coordinate evaluation with government bodies?

Disclosure:

Once the model is trained, independent external groups, including domain experts and a government body, further test the model to identify blind spots

References:

Not disclosed

Score justification:

Government body is not identified. Needs information about the specific bodies and the types of evaluations.

Indicator notes:

We will award this point if the developer specifies which government bodies it is coordinating with and for what types of post-deployment evaluations. Government bodies include AI Safety Institutes, national security agencies, national labs, and international governmental enties such as UN agencies or the G7. Evaluation here may also include sharing of the developer's proprietary evaluation results for help with interpretation.

Example disclosure:

We do not coordinate with any government entities or AI Safety Institutes. OR We coordinate with the UK AISI for post-deployment evaluation of cyber, CB, and autonomy-related capabilities.

83. Feedback mechanisms (Score: 0)

Does the developer disclose a way to submit user feedback? If so, is a summary of major categories of feedback disclosed?

Disclosure:

As part of our drive to promote user feedback on both quality and safety, developers using Gemini 2.5 Pro via Google AI studio can offer feedback via a form (accessible within a user’s AI Studio account). Users of the Gemini App with Gemini Advanced, which features the Gemini 2.5 Pro model, can offer thumbs up/down for model responses and give qualitative feedback where appropriate. Our teams monitor user feedback via these channels closely, as well as feedback delivered through other channels. We have mature incident management and crisis response capabilities to rapidly mitigate and remediate where needed, and feed this back into our risk identification efforts. Importantly, teams are enabled to have rapid-remediation mechanisms in place to block content flagged as illegal.

References:

Not disclosed

Score justification:

Aggregate or summarized feedback is not disclosed

Indicator notes:

We will award this point if the developer (i) discloses how users can submit feedback (e.g., via a form or a thumbs up/thumbs down for model responses) and (ii) discloses aggregated or categorized feedback data (e.g. a categorization of thumbs up and thumbs down data).

Example disclosure:

Users can submit feedback at this url: [URL] We find that users mainly report issues with API call response times, over-refusals from models, and outdated information in model outputs. A detailed categorization of user reports is available at [URL]

84. Permitted, restricted, and prohibited model behaviors (Score: 1)

Are model behaviors that are permitted, restricted, and prohibited disclosed?

Disclosure:

We outline restricted and prohibited model behaviors in our technical documentation, including our model card. Gemini safety policies align with Google’s standard framework for the types of harmful content that we make best efforts to prevent our Generative AI models from generating, including the following types of harmful content: · Child sexual abuse and exploitation · Hate speech (e.g. dehumanizing members of protected groups) · Dangerous content (e.g., promoting suicide, or instructing in activities that could cause real-world harm) · Harassment (e.g. encouraging violence against people) · Sexually explicit content 6. Medical advice that runs contrary to scientific or medical consensus. Additionally, we have published our Generative AI Prohibited Use policies.

References:

Not disclosed

Score justification:

PRP model behaviors disclosed in cell

Indicator notes:

We refer to a policy that includes this information as a model behavior policy, or a developer's policy on what the foundation model can and cannot do (e.g. such a policy may prohibit a model from responding to NSFW content). We recognize that different developers may adopt different business models and that some business models may make enforcement of a model behavior policy more or less feasible. We will award this point if at least two of the three categories (i.e. permitted, restricted, and prohibited model behaviors) are disclosed. Alternatively, we will award this point if the developer reports that it does not impose any restrictions on its model's behavior in this way.

Example disclosure:

We allow responses from Model A that include broad Q&A, restrict sexual or harassing content, and prohibit facilitating illegal or violent acts. More details can be found in our guidelines for model behavior here: [link]

85. Model response characteristics (Score: 1)

Are desired model response characteristics disclosed?

Disclosure:

Our approach to the Gemini app (using Gemini 2.5 Pro) responses and model behavior. https://gemini.google/our-approach/. In addition our tech report discloses Gemini 2.5 Pro "Desiderata, aka “helpfulness: Defining what not to do is only part of the safety story – it is equally important to define what we do want the model to do: 1. Help the user: fulfill the user request; only refuse if it is not possible to find a response that fulfills the user goals without violating policy. 2. Assume good intent: if a refusal is necessary, articulate it respectfully without making assumptions about user intent."

References:

Not disclosed

Score justification:

Desired characteristics disclosed in the tech report

Indicator notes:

Model response characteristics include default behaviors or behaviors that the developer steers the model to take. These may include being helpful, taking an objective point of view, or using tools only when necessary. We will award points for a clear description of desired model response characteristics or a statement that there are no such characteristics.

Example disclosure:

We configure responses from Model A to be factual, neutral, and contextually helpful, avoiding personal or biased opinions. More details can be found in our guidelines for model behavior here: [link]

86. System prompt (Score: 0)

Is the default system prompt for at least one distribution channel disclosed?

Disclosure:

We do not disclose this.

References:

Not disclosed

Score justification:

No system prompt disclosed

Indicator notes:

A system prompt is defined as the prompt provided to the system by default that guides the system's behavior. We will award this point for the disclosure of the verbatim text of the full system prompt as well as an explanation for the context in which the system prompt is used.

Example disclosure:

We disclose our default prompt for Model A via our chat interface: ‘You are a helpful AI assistant providing clear, accurate, and policy‐compliant responses.’

87. Intermediate tokens (Score: 1)

Are intermediate tokens used to generate model outputs available to end users?

Disclosure:

Gemini 2.5 Pro provides the option for users to "Show thinking" of the model after it generates a response to a prompt. More information can be found in the Thinking documentation.

References:

Not disclosed

Score justification:

intermediate tokens can be disclosed

Indicator notes:

Intermediate tokens are defined as any tokens generated by the model before the final output is shown to the user, such as model chains of thought. We will also award this point if a summary of intermediate tokens is made available to end users. If intermediate tokens or summaries are not made available, the developer should provide a justification.

Example disclosure:

Model A is trained to generate intermediate chain-of-thought reasoning, but we withhold most chain-of-thought tokens from final user-facing responses to prevent model distillation. We do disclose chains-of-thought for a small set of research collaborators under NDA.

88. Internal product and service mitigations (Score: 1)

For internal products or services using the model, are downstream mitigations against adversarial attacks disclosed?

Disclosure:

We use the structure from the SAIF framework to mitigate known and novel AI security risks. The latter category includes risks such as data poisoning, model exfiltration, and rogue actions. We apply security controls, or repeatable mitigations, to these risks. For example, for prompt injections and jailbreaks, we apply robust filtering and processing of inputs and outputs. Additionally, comprehensive training, tuning, and evaluation processes have been implemented at model-level defenses against prompt injection attacks. We are also building system-level solutions that make Gemini secure by default. For data poisoning, we implement data sanitization, secure AI systems, enable access controls, and deploy mechanisms to ensure data and model integrity. We have published a full list of our controls for AI security risks. In addition, we continue to research new ways to help mitigate a model’s susceptibility to security attacks.

References:

Not disclosed

Score justification:

Filters and list of controls for AI security disclosed

Indicator notes:

An internal product or service is a product or service built by the developer. Adversarial attacks include prompt injection, jailbreaking, or malicious queries. Mitigations against adversarial attacks might include specialized prompt filtering, content scanning, or real-time monitoring of queries or accounts. We will award this point if the developer discloses a clear statement of methods used (e.g., a specialized prompt sanitizer or adversarial pattern detector), or if the developer states it does not implement such product-level mitigations against adversarial attacks.

Example disclosure:

In our chatbot, we implement a second-stage content filter that checks user inputs for disallowed topics and attempts to sanitize adversarial prompts. We also log suspicious prompts for manual review.

89. External developer mitigations (Score: 1)

Does the developer provide built-in or recommended mitigations against adversarial attacks for downstream developers?

Disclosure:

We launched the SAIF Risk Self Assessment, a questionnaire-based tool that generates a checklist to guide AI practitioners responsible for securing AI systems. The tool will immediately provide a report highlighting specific risks such as data poisoning, prompt injection, and model source tampering, tailored to the submittor’s AI systems, as well as suggested mitigations, based on the responses they provided.

References:

Not disclosed

Score justification:

List of potential suggested mitigations not disclosed

Indicator notes:

Downstream developers are developers who access the model through a distribution channel. Adversarial attacks include prompt injection, jailbreaking, or malicious queries. Mitigations against adversarial attacks that developers might build in or recommend include content filtering endpoints and recommended prompt templates. We will award this point if the developer discloses (i) technical mitigations (e.g., a developer provided moderation API or classifier) it offers or implements, (ii) recommended best practices or libraries for downstream developers, or (iii) an explicit statement that it does not build or recommend any particular downstream mitigations in this way..

Example disclosure:

Our API includes an optional parameter that will automatically filter user prompts and model outputs for hateful or disallowed content. We also publish guidelines for building robust chat interfaces that resist common prompt injections.

90. Enterprise mitigations (Score: 1)

Does the developer disclose additional or specialized mitigations for enterprise users?

Disclosure:

Google Cloud discloses certifications and compliance standards, including: For customers who are subject to the requirements of the Health Insurance Portability and Accountability Act (known as HIPAA, as amended, including by the Health Information Technology for Economic and Clinical Health — HITECH — Act), Google Cloud discloses support for HIPAA compliance. We also disclose details on Google’s overall commitments to data protection laws such as GDPR. In continuation of Google Cloud's commitment to responsible AI, Cloud is ISO 42001 certified. The full list of Cloud AI products in scope can be found in our compliance card. See more here on our long-standing commitment to compliance.

References:

Not disclosed

Score justification:

Compliance with HIPAA and GDPR suffices

Indicator notes:

Enterprise users are, for example, large organizations with dedicated service agreements or users of enterprise-specific API deployments or products and services. Additional or specialized mitigations may address enterprise needs such as data privacy controls, advanced prompt/response monitoring, or compliance checks with regulations such as GDPR or HIPAA. Additional or specialized mitigations may include single-tenant deployments, custom filters for specific regulated industries, or advanced logging for compliance. We will award a point if the developer at least describes these mitigations or states that it does not provide such additional or specialized enterprise mitigations.

Example disclosure:

Our enterprise offering for Model A includes a dedicated environment with stricter filtering, a HIPAA-compliant data retention policy, and the ability for enterprise admins to define custom blacklisted topics that the model must refuse.

91. Detection of machine-generated content (Score: 1)

Are mechanisms that are used for detecting content generated by this model disclosed?

Disclosure:

The open-sourcing of our SynthID text watermarking tool – developed in-house and used by the Gemini app and web experience – contributes to the responsible use of AI. It makes it easier for any developer to apply watermarking for their generative AI models, so they can detect what text outputs have come from their own LLMs. The open source code is available on Hugging Face, and we’ve added it to our Responsible Generative AI Toolkit for developers. Additionally, the SynthID Detector can help determine if an image created using Google AI contains a SynthID watermark. The portal provides detection capabilities across different modalities in one place and provides essential transparency in the rapidly evolving landscape of generative media. It can also highlight which parts of the content are more likely to have been watermarked with SynthID. SynthID Detector is currently available to early testers before being made available more broadly.

References:

Not disclosed

Score justification:

SynthID suffices

Indicator notes:

A mechanism for detecting machine-generated content might include storing a copy of all outputs generated by the model to compare against, implementing a watermark on model outputs, adding cryptographic metadata (such as C2PA), or training a detector post-hoc to identify such content. We will award this point if any such mechanism is disclosed or if the developer reports that it does not have or use any such mechanism.

Example disclosure:

We train a classifier using model generations and human-written text to identify machine-generated content from Model A and our other models.

92. Documentation for responsible use (Score: 1)

Does the developer provide documentation for responsible use by downstream developers?

Disclosure:

In addition to our own mitigation efforts, we promote and support the responsible use of AI models by providing mitigations for use by downstream developers. We released ShieldGemma — a series of state-of-the-art safety classifiers that developers can apply to detect and mitigate harmful content in AI model input and outputs. Specifically, ShieldGemma is designed to target hate speech, harassment, sexually explicit content, and dangerous content. We offer an existing suite of safety classifiers in our Responsible Generative AI Toolkit, which includes a methodology to build classifiers tailored to a specific policy with limited number of datapoints, as well as existing Google Cloud off-the-shelf classifiers served via API. We share AI interpretability tools to help researchers improve AI safety. Our research teams are continually exploring new ways to better understand how models behave. For example, we recently announced Gemma Scope, a new set of tools enabling researchers to “peer inside” the workings of our Gemma 2 model to see how it parses and completes tasks. We believe that this kind of interpretability could open up new opportunities to identify and mitigate safety risks at the model behavior level. We launched the SAIF Risk Self Assessment, a questionnaire-based tool that generates a checklist to guide AI practitioners responsible for securing AI systems. The tool will immediately provide a report highlighting specific risks such as data poisoning, prompt injection, and model source tampering, tailored to the submittor’s AI systems, as well as suggested mitigations, based on the responses they provided.

References:

Not disclosed

Score justification:

Responsible GenAI toolkit suffices

Indicator notes:

To receive a point, the developer should provide documentation for responsible use. This might include details on how to adjust API settings to promote responsible use, descriptions of how to implement mitigations, or guidelines for responsible use. We will also award this point if the developer states that it does not provide any such documentation. For example, the developer might state that the model is offered as is and downstream developers are accountable for using the model responsibly.

Example disclosure:

Our Developer Documentation Hub consolidates integration guides, responsible‐use guidelines, and best practices: [link]

93. Permitted and prohibited users (Score: 1)

Is a description of who can and cannot use the model on the top-5 distribution channels disclosed?

Disclosure:

As per our Gemini API Additional Terms of Service, users must be 18 years of age or older to use the APIs. Users also will not use the Services as part of a website, application, or other service (collectively, "API Clients") that is directed towards or is likely to be accessed by individuals under the age of 18. Terms are subject to change and the latest version will be made available at the links provided. Our Google Account Terms of Service outline the age requirements for our broader suite of Google products, including the Gemini app (built with Gemini 2.5) as follows: if you’re under the age required to manage your own Google Account, you must have your parent or legal guardian’s permission to use a Google Account. If you’re a parent or legal guardian, and you allow your child to use the services, then these terms apply to you and you’re responsible for your child’s activity on the services. Some Google services have additional age requirements as described in their service-specific additional terms and policies.

References:

Not disclosed

Score justification:

Age requirements disclosed

Indicator notes:

We will award this point for a description of the company's policies for permitted and prohibitted users on its top-5 distribution channels. We will award this point if the developer has a more general acceptable use policy that it confirms applies across these distribution channels. We will award this point if there are no restrictions on users.

Example disclosure:

We allow usage by individuals 13 years of age or older who accept our Terms of Service. We prohibit use by export controlled entities or persons on denied-parties lists or in countries under U.S. embargo. We also reserve the right to restrict use if users engage in targeted harassment. For example, we only permit users over 13 with valid credentials, and prohibit usage from OFAC-sanctioned regions. We do not allow state-sponsored disinformation agencies to access our services.

94. Permitted, restricted, and prohibited uses (Score: 1)

Which uses are explicitly allowed, conditionally permitted, or strictly disallowed under the acceptable use policy for the top-5 distribution channels?

Disclosure:

Our Prohibited Use Policy outlines the restrictions that apply to user interactions with generative AI in the Google products and services that refer to the policy, including our Google API Services User Data Policy.

References:

Not disclosed

Score justification:

PUP describes usage restrictions

Indicator notes:

We will award this point for a rough characterization of two or more of permitted, restricted, and prohibited uses across the top-5 distribution channels. We will award this point if the developer has a more general acceptable use policy that it confirms applies across these distribution channels. We will award this point if there are no restrictions on users.

Example disclosure:

Permitted uses include general conversational queries, brainstorming, and coding assistance. Restricted uses include adult or violent content that requires caution or additional review. Prohibited uses include facilitating illicit activity, disinformation campaigns, or harassment. For example, we permit typical user requests like Q&A, text generation, and educational uses. We restrict content that depicts graphic violence or sexual content by applying additional filters. We prohibit any use aiming to conduct unlawful surveillance, promote extremist violence, or defraud others.

95. AUP enforcement process (Score: 1)

What are the methods used by the developer to enforce the acceptable policy?

Disclosure:

Google uses the following process to detect potential abuse and violations of its Acceptable Use Policy and Prohibited Use Policy for Generative AI services. This process is outlined in Section 4.3 "Generative AI Safety and Abuse" of the Google Cloud Platform Terms of Service. The abuse monitoring process includes the following steps: Automated detection: Google uses automated safety classifiers to detect potential abuse and violations. For technical details about how safety classifiers work, see Configure safety filters. Prompt logging: If automated safety classifiers detect suspicious activity that requires further investigation, Google might log your prompts to determine if you have violated the AUP or Prohibited Use Policy. This data is not used to train or fine-tune any AI/ML models. The data is stored securely for up to 30 days in the region that you select and adheres to Google Cloud assurances, such as Data Residency, Access Transparency, and VPC Service Controls. Action: Authorized Google employees might assess the flagged prompts and contact you for clarification. If you fail to address the behavior, or in cases of recurring or severe abuse, your access to Vertex AI or Google Cloud services might be suspended or terminated. https://cloud.google.com/vertex-ai/generative-ai/docs/learn/abuse-monitoring In addition, AI is helping us scale abuse detection across our platforms. AI-powered classifiers help quickly flag potentially harmful content for removal or escalation to a human reviewer. Google Accounts are usually disabled if the account's owner hasn't followed our policies. Google's policies include Google Terms of Service. Not all Google services disable an account for these reasons. Some Google services have their own codes of conduct or terms of service, including: Service Specific Terms, YouTube Community Guidelines, Google Workspace Acceptable Use Policy If someone doesn't follow these policies, we might stop that person from either: Using just that service or Signing in to any Google services

References:

Not disclosed

Score justification:

Google describes multiple steps in its AUP enforcement process

Indicator notes:

We will award this point if the developer discloses the processes (automated or manual) it uses to detect, review, and respond to potential acceptable use policy violations. We will award this point for a reasonable best-effort attempt to provide the bulk of this information, though one line indicating the developer reserves the right to terminate accounts is insufficient. Alternatively, we will award this point if the developer reports that it does not use such methods to enforce its acceptable use policy.

Example disclosure:

We combine automated checks with human review for severe or repeated violations, issuing warnings or suspensions after repeat violations.

96. AUP enforcement frequency (Score: 0)

Are statistics on the developer's AUP enforcement disclosed?

Disclosure:

We publish [ADD the number of actions taken on violative content, broken down by detection method in] annual transparency reports for services categorised as Online Platforms, Hosting Service Providers or Intermediary Service Providers under the EU DSA. These reports [CUT (including the February 2025 report)] can be found on our Transparency Report website. [ADD: A comprehensive report on government requests for takedowns of policy violations, including which countries made the request, and when can be downloaded from this page. Google’s transparency report does a breakdown by product. This includes “Bard,” which was the previous name for Gemini.

References:

Not disclosed

Score justification:

Annual transparency reports do not include enforcement details on gemini; government takedown requests do not satisfy this indicator

Indicator notes:

We will award this point if the developer discloses enforcement statistics (e.g., violation counts or actions taken) from its enforcement of its acceptable use policy. Alternatively, we will award this point if the developer reports that it does not enforce its acceptable use policy.

Example disclosure:

We publish a quarterly enforcement report detailing violation counts by prohibited use category and the corresponding actions taken at [LINK]

97. Regional policy variations (Score: 1)

Are differences in the developer's acceptable use or model behavior policy across geographic regions disclosed?

Disclosure:

AUP is not different by region, but Cloud does have sovereignty offering to provide customers the ability to store and process their data in certain regions.

References:

Not disclosed

Score justification:

AUP does not differ

Indicator notes:

We will award this point if the developer discloses distinctions in its AUP or MBP and provides examples of differences in multiple specific regions, or states that no differences exist. For example, some jurisdictions impose content restrictions beyond those in the developer’s global policy that may necessesitate local deviations.

Example disclosure:

In the EU, our model automatically omits certain categories of political content to comply with local election laws. In all other regions, we follow the general global AUP at [URL].

98. Oversight mechanism (Score: 1)

Does the developer have an internal or external body that reviews core issues regarding the model prior to deployment?

Disclosure:

Google DeepMind Responsibility and Safety Council (RSC), Google DeepMind’s internal governance body, reviewed the initial ethics and safety assessments on novel model capabilities in order to provide feedback and guidance during model development. The RSC also reviewed data on the model’s performance via assurance evaluations to ensure it was meeting safety and responsibility expectations. Additionally, in relation to the Frontier Safety Framework, when alert thresholds are reached for Misuse Critical Capability Levels, a response plan will be reviewed and approved by appropriate internal corporate governance bodies such as the RSC. The Google DeepMind AGI Safety Council will also periodically review the implementation of the Framework. Details can be found in the Frontier Safety Framework.

References:

Not disclosed

Score justification:

RSC constitutes oversight mechanism

Indicator notes:

We will award this point if the developer discloses that is has such an internal or external body and provides some description of its scope, or alternatively if the developer discloses that it has no such body. An oversight mechanism covers governance structure beyond mere external risk evaluation, asking whether a formal body regularly reviews design and deployment decisions. Core issues may include model objectives, data usage, or risk mitigation.

Example disclosure:

We convene a monthly advisory board of ethicists, civil society representatives, and academics to review training processes and identify new risks. The board's recommendations regarding deployment are not binding.

99. Whistleblower protection (Score: 1)

Does the developer disclose a whistleblower protection policy?

Disclosure:

Yes, we have a whistleblower protection policy as outlined in our Google Code of Conduct: "Google prohibits retaliation against any worker here at Google who reports or participates in an investigation of a possible violation of our Code, policies, or the law. If you believe you are being retaliated against, please contact Ethics & Business Integrity."

References:

Not disclosed

Score justification:

Google CoC whistleblower protection policy suffices

Indicator notes:

We will award this point if the developer discloses (i) the existence of a whistleblower protection policy, (ii) what protections are afforded to whistleblowers, (iii) how reports are handled and investigated, and (iv) any external oversight of the whistleblower protection process. This might include protections for whistleblowers who report safety, ethical, or legal concerns related to the model. We will also award this point if the developer discloses that it has no such policy.

Example disclosure:

We maintain a whistleblower protection policy that prohibits retaliation against employees who report safety or ethical concerns about our models. Reports can be submitted anonymously through our ethics hotline, are reviewed by an independent board committee, and whistleblowers are entitled to legal representation provided by the company. Our policy is audited annually by an independent ethics consultancy.

100. Government commitments (Score: 1)

What commitments has the developer made to government bodies?

Disclosure:

We continue our work in support of a number of new or ongoing commitments on AI, including the Bletchley Summit and Seoul AI Summit commitments, and a series of voluntary commitments made by companies to promote the safe, secure, and transparent development and use of generative AI technology; such as the U.S. White House Voluntary Commitments, the G7 Hiroshima AI Process (HAIP) Code of Conduct and the related OECD reporting framework; and the Frontier AI Safety Commitments made at the AI Summit in Seoul in 2024. Based on our work in this space, we also proposed seven principles for bold and responsible AI regulation.

References:

https://safety.google/cybersecurity-advancements/saif/#:~:text=we%20recently%20highlighted%20saif's%20role%20in%20securing%20ai%20systems%2C%20aligning%20with%20white%20house%20ai%20commitments. For added WHC source.]

Score justification:

A list of commitments is provided

Indicator notes:

We will award this point if the company provides an exhaustive list of commitments it has made to government bodies in the jurisdictions where it offers its model.

Example disclosure:

We have committed to the White House Voluntary Committments and the Seoul Committments.