Anthropic Transparency Report

1. Data acquisition methods (Score: 1)

What methods does the developer use to acquire data used to build the model?

Disclosure:

References:

https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf

Score justification:

Anthropic acquires data by crawling, its own products, licensing, human annotation, and internal data generation via synthetic data generation.

Indicator notes:

Which of the following data acquisition methods does the developer use:   (i) acquiring existing public datasets, (ii) crawling the web, (iii) using data acquired via its existing products and services, (iv) licensing existing data from external parties, (v) having humans create or annotate new data, (vi) using models to generate new data,  or (vii) other data acquisition methods not captured by the above. For example, if the developer uses reinforcement learning from human feedback to train models using model-generated outputs with human preference annotations, this would satisfy categories (v) and (vi). Alternatively, if the developer post-trains its model using off-the-shelf preference data (for example, the Alpaca dataset), this would satisfy category (i).

Example disclosure:

To build our model, we acquire data by crawling the Internet for publicly available data, licensing data from third-parties, and using models to synthetically generate new data. Humans do not create new data nor do we use data from our other products/services to train our model.

2. Public datasets (Score: 0)

What are the top-5 sources (by volume) of publicly available datasets acquired for building the model?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

We define a source as the entity or means by which the developer acquires data. We define the top-5 sources as the top-5 sources by data volume.

Example disclosure:

We acquire publicly available data from only two sources: The Pile and CommonCrawl.

3. Crawling (Score: 1)

If data collection involves web-crawling, what is the crawler name and opt-out protocol?

Disclosure:

The most relevant document is the Claude 4 System Card. The relevant passage on data acquisition states: "Claude Opus 4 and Claude Sonnet 4 were trained on a proprietary mix of publicly available information on the Internet as of March 2025, as well as non-public data from third parties, data provided by data-labeling services and paid contractors, data from Claude users who have opted in to have their data used for training, and data we generated internally at Anthropic. ... To obtain data from public web pages, we operate a general-purpose web crawler. This crawler follows industry-standard practices with respect to “robots.txt” instructions included by website operators indicating whether they permit crawling of their site’s content. In addition, we do not access password-protected pages or those that require sign-in or CAPTCHA verification, and we conduct diligence on the training data that we use. The crawler operates transparently—website operators can easily identify when it has crawled their web pages and signal their preferences to us. ...". In addition Anthropic Help Center article acknowledging web crawling practices (URL: https://support.anthropic.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler) Claude 4 System Card PDF mentions general-purpose web crawler and robots.txt compliance (URL: https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf) Contact email provided for crawler issues: claudebot@anthropic.com General robots.txt opt-out instructions provided Acknowledgment of using 'three robots' for different purposes

References:

https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf

Score justification:

User-agent is ClaudeBot and details on how they respect robots.txt are provided.

Indicator notes:

We award this point for disclosure of the crawler name and opt-out protocols, including if/how they respect the Robots Exclusion Protocol (robots.txt).

Example disclosure:

Our web crawler is named A and information on the opt-out protocol can be found at this URL: ... The CommonCrawl web crawler is named CCBot and information on the opt-out protocol can be found at this URL: https://commoncrawl.org/faq#:~:text=How%20can%20I%20block%20the,%2Dagent%20string%20is%3A%20CCBot.

4. Usage data used in training (Score: 0)

What are the top-5 sources (by volume) of usage data from the developer's products and services that are used for building the model?

Disclosure:

The most relevant document is the Claude 4 System Card. The relevant passage on data acquisition states: "Claude Opus 4 and Claude Sonnet 4 were trained on a proprietary mix of publicly available information on the Internet as of March 2025, as well as non-public data from third parties, data provided by data-labeling services and paid contractors, data from Claude users who have opted in to have their data used for training, and data we generated internally at Anthropic. ... To obtain data from public web pages, we operate a general-purpose web crawler. This crawler follows industry-standard practices with respect to “robots.txt” instructions included by website operators indicating whether they permit crawling of their site’s content. In addition, we do not access password-protected pages or those that require sign-in or CAPTCHA verification, and we conduct diligence on the training data that we use. The crawler operates transparently—website operators can easily identify when it has crawled their web pages and signal their preferences to us. ...". The privacy documentation does not clarify specific sources of data relevant to Claude 4.

References:

https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf; https://privacy.anthropic.com/en/articles/10023555-how-do-you-use-personal-data-in-model-training

Score justification:

While usage data from opt-in users of Claude is mentioned, it is unclear what products and services this specifically refers to or whether this is just the consumer-facing Claude chatbot.

Indicator notes:

We define usage data as data collected from the use of a developer's products or services.

Example disclosure:

We use usage data from only two sources: our deployed chatbot X and our online social media platform Y.

5. Notice of usage data used in training (Score: 1)

For the top-5 sources of usage data, how are users of these products and services made aware that this data is used for building the model?

Disclosure:

Opt-out notices are provided in-app, privacy dashbaords are persistently available, and there is privacy FAQ documentation in our help center.

References:

https://www.anthropic.com/legal/privacy; https://privacy.anthropic.com/en/articles/10023555-how-do-you-use-personal-data-in-model-training

Score justification:

The specific mechanism for user opt-out are clearly stated and understood to circumscribe all sources of usage data based on the disclosure from Anthropic.

Indicator notes:

We define usage data notice as the proactive disclosure to users of how their data is used for model development. For example, via a pop-up with a description, a link to the privacy policy, or link to a description of company practices.

Example disclosure:

We notify users of our chatbot X that chatbot interactions are used to train our AI via a pop-up as shown at this URL: ... We notify users of our platform Y about whether their data is used to train our AI via a link to our privacy policy when they sign up for an account.

6. Licensed data sources (Score: 0)

What are the top-5 sources (by volume) of licensed data acquired for building the model?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property. (Furthermore, disclosing licensed data sources does not necessarily make the development more responsible)

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

We define a source as the entity from which the developer acquires data. For example, the Associated Press is reportedly a source of licensed data for OpenAI.

Example disclosure:

We license data from only three sources: A, B, and C.

7. Licensed data compensation (Score: 0)

For each of the top-5 sources of licensed data, are details related to compensation disclosed?

Disclosure:

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property. In particular, we do not award this indicator because the disclosure does not claim the prohibition is due to specific contractual mandates.

Indicator notes:

We award this point if the model developer describes the compensation structure specified in the contract with the data source or indicates they are prohibited from sharing this information if contractually mandated.

Example disclosure:

We compensate A by ... We cannot disclose information on compensation for our relationships with B and C due to contractual terms that prohibit public disclosure.

8. New human-generated data sources (Score: 0)

What are the top-5 sources (by volume) of new human-generated data for building the model?

Disclosure:

"We acquire human-generated data from our internal data generation teams as well as a variety of third party data generation vendors. Further details are proprietary and not disclosed publicly to protect competitive advantages and intellectual property."

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

We define a source as the entity or means by which the developer acquires data. For example, Scale AI could be a source of new human-generated data. By new, we mean the data is specifically acquired for the purposes of building the model.

Example disclosure:

We acquire new human-generated data from only two sources: our internal data annotation team and an external vendor, A.

9. Instructions for data generation (Score: 0)

For each of the top-5 sources of human-generated data, what instructions does the developer provide for data generation?

Disclosure:

The most relevant document is the Claude 4 System Card and the Anthropic Inbound Services Agreement. The relevant passage on data acquisition states: "Claude Opus 4 and Claude Sonnet 4 were trained on a proprietary mix of publicly available information on the Internet as of March 2025, as well as non-public data from third parties, data provided by data-labeling services and paid contractors, data from Claude users who have opted in to have their data used for training, and data we generated internally at Anthropic. ... " The relevant passage on crowd workers states: "Anthropic partners with data work platforms to engage workers who help improve our models through preference selection, safety evaluation, and adversarial testing. Anthropic will only work with platforms that are aligned with our belief in providing fair and ethical compensation to workers, and committed to engaging in safe workplace practices regardless of location, following our crowd worker wellness standards detailed in our Inbound Services Agreement." The associated text in the Inbound Services Agreement states: "if Contractor is providing crowd work or crowdsourcing services under this Agreement, Contractor will comply with the minimum expectations and consider implementing best practices described in Anthropic's wellbeing standards for crowd work vendors (a copy of which can be found here or provided by Anthropic at Contractor's request); "

References:

https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf; https://www.anthropic.com/legal/inbound-services-agreement

Score justification:

No instructions provided. While instructions are detailed in past work published by Anthropic on Reinforcement Learning from Human Feedback, we do not award this indicator as it is unclear if these instructions are still used for each of the top-5 sources of human-generated data.

Indicator notes:

The instructions should be those provided to the data source. For example, if a third-party vendor works directly with the data laborers to produce the data, the instructions from the developer to this vendor should be disclosed.

Example disclosure:

We instruct our internal data annotation team as follows: ... We instruct vendor A as follows: ...

10. Data laborer practices (Score: 0)

For the top-5 sources of human-generated data, how are laborers compensated, where are they located, and what labor protections are in place?

Disclosure:

References:

https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf; https://www.anthropic.com/legal/inbound-services-agreement; https://docs.google.com/document/d/15ExSjSjY8y2Qqetcq54iTtfN8ZJaD-qzSDbWj6GSwxw/edit?tab=t.0

Score justification:

No information provided on labor compensation though significant information is provided about labor protections such as specific wellbeing standards.

Indicator notes:

For each data source, we require (i) the compensation in either USD or the local currency, (ii) any countries where at least 25% of the laborers are located, and (iii) a description of any labor protections. We will award this point if the developer discloses that it is not aware of data laborer practices.

Example disclosure:

Our internal data annotation team is located in the US, is compensated at 20 USD per hour, and deals with data that does not require specific protections. Our sole external data vendor contracts laborers in Kenya, compensates them at KES 15000 per month, and implements protections for dealing with toxic or unsafe content such as A and B.

11. Synthetic data sources (Score: 0)

What are the top-5 sources (by volume) of synthetic data acquired for building the model?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

We define a source of synthetic data as a non-human mechanism (e.g. a machine learning model) used to generate the data.

Example disclosure:

We synthetically generate data using only our previous model X and an early checkpoint of our current flagship model Y.

12. Synthetic data purpose (Score: 0)

For the top-5 sources of synthetically generated data, what is the primary purpose for data generation?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

We define a source of synthetic data as a non-human mechanism (e.g. a machine learning model) used to generate the data.

Example disclosure:

We use model X to generate instruction-tuning data and we use model Y to generate candidate responses that humans select between to provide human preference data for reinforcement learning with human feedback.

13. Data processing methods (Score: 0)

What are the methods the developer uses to process acquired data to determine the data directly used in building the model?

Disclosure:

The most relevant document is the Claude 4 System Card. The relevant passage on data processing states: "We employed several data cleaning and filtering methods during the training process, including deduplication and classification. ..."

References:

https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf

Score justification:

While deduplication and classification are mentioned, the described information is unclearly comprehensive due to the use of the word "including" and some steps are overly broad (i.e. cleaning/filtering, which can subsume many distinguishable types of data processing).

Indicator notes:

We will award this point for disclosure of all of the methods used to process acquired data. Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.

Example disclosure:

We process data in the following six-step pipeline: (i) removal of HTML artifacts, (ii) deduplication, (iii) language identification to retain English data, (iv) removal of CSAM imagery, (v) removal of train-test overlap, and (vi) tokenization.

14. Data processing purpose (Score: 0)

For each data processing method, what is its primary purpose?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.

Example disclosure:

Examples of primary purposes for a data processing method could include: (i) removes low quality data, (ii) removes potentially personal/copyrighted data, (iii) removes product-irrelevant data, (iv) removes toxic data, (v) improves evaluation integrity, or (vi) prepares the data for training the model.

15. Data processing techniques (Score: 0)

For each data processing method, how does the developer implement the method?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.

Example disclosure:

Examples of how a data processing method is implemented could include: the method (i) is implemented using an in-house regular expression, (ii) is implemented using an in-house tool based on n-gram overlap, (iii) is implemented using a FastText classifier trained on Wikipedia data, (iv) is implemented using hash collisions with the NCMEC database, (v) is implemented by searching for known benchmark canary strings, and (vi) is implemented using tiktoken (https://github.com/openai/tiktoken).

16. Data size (Score: 0)

Is the size of the data used in building the model disclosed?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

To receive this point, the developer should report data size in appropriate units (e.g. bytes, words, tokens, images, frames) and broken down by modality. Data size should be reported to a precision of one significant figure (e.g. 4 trillion tokens, 200 thousand images). The size should reflect data directly used in building the model (i.e. training data) and not data that was acquired but unused, or data used to evaluate the model.

Example disclosure:

We used 3 x 10^12 tokens of text, 1 x 10^6 images, and 5 x 10^5 hours of audio for training.

17. Data language composition (Score: 0)

For all text data used in building the model, what is the composition of languages?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property. However, we do share results on language performance such as the standardized benchmark MMMLU.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

To receive this point, the developer should report (i) all languages which make up at least 1% of the data and their corresponding proportions and (ii) a brief description of how languages are labeled (if a publicly available tool is used, include a link to the tool). Proportions should be reported to a precision of two significant figures and should describe proportions of documents labeled with some langauge. An "Unknown" category may be included to denote documents where the language could not be identified.

Example disclosure:

English 80%, Spanish 5.0%, French 3.0%, Chinese 2.0%, Unknown 10%. We used a FastText-based classifier trained on Wikipedia data to identify languages.

18. Data domain composition (Score: 0)

For all the data used in building the model, what is the composition of domains covered in the data?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

To receive this point, the developer should report the composition of the main domains included in the data used to train the model. This data should be at a level of granularity lower than broad claims about training on "internet data". For example, this could include the proportion of data from e-commerce, social media, news, code, etc. based on the URLs from which the data is sourced. Proportions should be reported to a precision of one significant figure.

Example disclosure:

Social media 40%, code repositories 30%, news articles 20%, e-commerce product listings 5%, scientific papers 5%.

19. External data access (Score: 0)

Does a third-party have direct access to the data used to build the model?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

By a third-party, we mean entities that are financially independent of the developer. We will award this point if at least one such entity is named as having direct access to the data. With that said, we may award this point if the developer provides justifications for prohibiting access to narrowly-scoped parts of the data.

Example disclosure:

Third-parties that have direct access to the data include organizations A and B.

20. Data replicability (Score: 0)

Is the data used to build the model described in enough detail to be externally replicable?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

We will award this point if the description contains (i) a list of all publicly available training data and where to obtain it and (ii) a list of all training data obtainable from third parties and where to obtain it. These conditions refer to criteria 2 and 3 under the OSI Open Source AI v1.0 definition.

Example disclosure:

The listing of publicly available training data can be found at this URL ... and the listing of all training data obtainable from third parties can be found at this URL ...

21. Compute usage for final training run (Score: 0)

Is the amount of compute used in the model's final training run disclosed?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

Example disclosure:

Our model was trained using 5 x 10^25 FLOPs, measured according to the Frontier Model Forum guidance provided at this URL: https://www.frontiermodelforum.org/updates/issue-brief-measuring-training-compute/

22. Compute usage including R&D (Score: 0)

Is the amount of compute used to build the model, including experiments, disclosed?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

Compute should be reported in appropriate units, which most often will be floating point operations (FLOPs), along with a description of the measurement methodology, which may involve estimation. Compute should be reported to a precision of one significant figure (e.g. 7 x 10^26 FLOPs). Compared to the previous indicator, this indicator should include an estimation of the total compute used across experiments used towards the final training run for the model (such as including hyperparameter optimization or other experiments), and not just the final training run itself.

Example disclosure:

Our cumulative compute usage involved in building the model was 7 x 10^26 FLOPs, measured according to the Frontier Model Forum guidance provided at this URL: https://www.frontiermodelforum.org/updates/issue-brief-measuring-training-compute/

23. Development duration for final training run (Score: 0)

Is the amount of time required to build the model disclosed?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

The amount of time should be specified in terms of both the continuous duration of time required and the number of hardware hours used. The continuous duration of time required to build the model should be reported in weeks, days, or hours to a precision of one significant figure (e.g. 3 weeks). The number of hardware hours should be reported to a precision of one significant figure and include the type of hardware hours. No form of decomposition into phases of building the model is required for this indicator, but it should be clear what the duration refers to (e.g. training the model, or training and subsequent evaluation and red teaming).

Example disclosure:

Our model was trained over a period of 90 days using 4x10^4 NVIDIA H100 GPU-days.

24. Compute hardware for final training run (Score: 0)

For the primary hardware used to build the model, is the amount and type of hardware disclosed?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

In most cases, this indicator will be satisfied by information regarding the number and type of GPUs or TPUs used to train the model. The number of hardware units should be reported to a precision of one significant figure (e.g. 800 NVIDIA H100 GPUs). We will not award this point if (i) the training hardware generally used by the developer is disclosed, but the specific hardware for the given model is not, or (ii) the training hardware is disclosed, but the amount of hardware is not. We will award this point even if information about the interconnects between hardware units is not disclosed.

Example disclosure:

Our model was trained using 1000 NVIDIA H100 GPUs.

25. Compute provider (Score: 0)

Is the compute provider disclosed?

Disclosure:

"Anthropic, an AI safety and research company, has selected Google Cloud as its cloud provider. The partnership is designed so that the companies can co-develop AI computing systems; Anthropic will leverage Google Cloud's cutting-edge GPU and TPU clusters to train, scale, and deploy its AI systems." "Anthropic names AWS its primary training partner and will use AWS Trainium to train and deploy its largest foundation models; Amazon to invest additional $4 billion in Anthropic." Further detail on distribution of hardware owners is confidential.

References:

https://www.anthropic.com/news/anthropic-partners-with-google-cloud and https://www.aboutamazon.com/news/aws/amazon-invests-additional-4-billion-anthropic-ai

Score justification:

While relationships with Google and Amazon are disclosed, the specific compute provider for Claude Opus 4 is unclear and the distribution is noted as confidential.

Indicator notes:

For example, the compute provider may be the model developer in the case of a self-owned cluster, a cloud provider like Microsoft Azure, Google Cloud Platform, or Amazon Web Services, or a national supercomputer. In the event that compute is provided by multiple sources or is highly decentralized, we will award this point if a developer makes a reasonable effort to describe the distribution of hardware owners.

Example disclosure:

Compute is provided by Google Cloud Platform.

26. Energy usage for final training run (Score: 0)

Is the amount of energy expended in building the model disclosed?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

Energy usage should be reported in appropriate units, which most often will be megawatt-hours (mWh), along with a description of the measurement methodology, which may involve estimation. Energy usage should be reported to a precision of one significant figure (e.g. 500 mWh). No form of decomposition into compute phases is required, but it should be clear whether the reported energy usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitate energy usage. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lacks and why it lacks it.

Example disclosure:

Our model was trained using an estimate 1 x 10^4 MWh of energy. To estimate training energy consumption, we multiplied training FLOPs (5 x 10^25) by a conversion factor using NVIDIA A100 GPU information (3.74 × 10^21 FLOPs/MWh) given we train using FP16 with sparsity.

27. Carbon emissions for final training run (Score: 0)

Is the amount of carbon emitted in building the model disclosed?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

Emissions should be reported in appropriate units, which most often will be tons of carbon dioxide emitted (tCO2), along with a description of the measurement methodology, which may involve estimation. Emissions should be reported to a precision of one significant figure (e.g. 500 tCO2). No form of decomposition into compute phases is required, but it should be clear whether the reported emissions is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that generate emissions. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lack and why it lacks it. Emissions should correspond with the energy used in the previous indicator.

Example disclosure:

Our model yielded an estimate of 5 x 10^3 tCO2. To estimate training carbon emissions, we multiplied training energy usage (1 x 10^4 MWh) by a 2023 estimate for the US data center carbon intensity (0.375 tCO2/MWh) given the data centers used in training operate in the US.

28. Water usage for final training run (Score: 0)

Is the amount of clean water used in building the model disclosed?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

Clean water usage should be in appropriate units, which most often will be megaliters, along with a description of the measurement methodology, which may involve estimation. Clean water usage should be reported to a precision of one significant figure (e.g., 5000ML). No form of decomposition into compute phases is required, but it should be clear whether the reported water usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitates water usage. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lacks and why it lacks it.

Example disclosure:

Our model yielded an estimate of 20 ML water. To estimate training water usage, we multiplied training energy usage (1 x 10^4 MWh) by a 2021 estimate for the US data center water efficiency (1.8 ML per 1,000 MWh) given the data centers used in training operate in the US.

29. Internal compute allocation (Score: 0)

How is compute allocated across the teams building and working to release the model?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

To receive a point, the developer should provide the compute allocated to each team involved in training the model. We understand there might be no clear allocation of compute across different teams; in that case, report an estimate of the compute used over the last year. Compute allocation should be reported to at least one significant figure.

Example disclosure:

- Safety — 15% - Pre-training — 60% - Post-training — 15% - Infrastructure and reliability — 5%

30. Model stages (Score: 0)

Are all stages in the model development process disclosed?

Disclosure:

Claude Opus 4 and Claude Sonnet 4 were trained with a focus on being helpful, honest, and harmless. They were pretrained on large, diverse datasets to acquire language capabilities. To elicit helpful, honest, and harmless responses, we used a variety of techniques including human feedback, Constitutional AI (based on principles such as the UN’s Universal Declaration of Human Rights), and the training of selected character traits.

References:

https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf

Score justification:

While the provided information implicitly mentions model training stages (e.g. pretraining), the concrete set of stages is not clearly described by the developer. While past writings from Anthropic describe model training, the applicability of these descriptions to Claude 4 is not made clear.

Indicator notes:

Stages refer to each identifiable step that constitutes a substantive change to the model during the model building process. We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear and complete description of these stages.

Example disclosure:

We define five stages in building the model: (1) unsupervised pre-training, (2) supervised instruction tuning, (3) RLHF, (4) domain-specific fine-tuning, and (5) final safety alignment.

31. Model objectives (Score: 0)

For all stages that are described, is there a clear description of the associated learning objectives or a clear characterization of the nature of this update to the model?

Disclosure:

We train general purpose models and each stage has numerous objectives that we do not exhaustively list out. Instead, in our System Card, we mention training stages as relevant (e.g. in our Claude 4 System Card Child Safety Evaluations section, we say " For example, when evaluating an early version of Claude Opus 4, we observed that the model sometimes provided overly-detailed responses to child-related requests before gaining full context on the user's intent. By working with the finetuning team, we were able to improve this behavior in future snapshots.")

References:

https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf

Score justification:

We do not award this indicator as the developer does not provide a description of the objective for each stage of model training, though we recognize that foundation model training inherently produces general-purpose technologies subject to a variety of implicit and explicit training objectives.

Indicator notes:

We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear description of the update to the model related to each stage, whether that is the intent of the stage (e.g. making the model less harmful), a mechanistic characterization (e.g. minimizing a specific loss function), or an empirical assessment (e.g. evaluation results conducted before and after the stage).

Example disclosure:

During unsupervised pre-training, the objective is next-token prediction. During supervised instruction tuning, we optimize for correctness and helpfulness on labeled tasks. RLHF aligns model outputs with human preference judgments. Domain-specific fine-tuning focuses on improving in-domain capabilities using specialized data (e.g., code or legal text). Final safety alignment reduces disallowed or harmful responses.

32. Code access (Score: 0)

Does the developer release code that allows third-parties to train and run the model?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

The released code does not need to match the code used internally.

Example disclosure:

We release training and inference code under an Apache 2.0 license at https://github.com/..., enabling others to replicate our core pipeline.

33. Organization chart (Score: 0)

How are employees developing and deploying the model organized internally?

Disclosure:

We are constantly hiring and team roles/scopes shift, so organization charts and headcounts are not something that we can provide reliably. Furthermore, publishing this could be a security risk for our employees that work on sensitive aspects of the model.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it could be a security risk for our employees that work on sensitive aspects of the model.

Indicator notes:

To receive a point, the developer should provide both the internal organization chart for the team developing the model as well as the headcounts (or a proportion of headcounts) by the team.

Example disclosure:

The model team comprises of 63 people, organized as follows: - CEO - Managing Director (Safety) — 24 people - Managing Director (Pre-training) — 12 people - Managing Director (Post-training) — 11 people - Managing Director (API) — 6 people - Director (Infrastructure and reliability) — 7 people - Director (PR and marketing) — 4 people - Director (hiring) — 7 people

34. Model cost (Score: 0)

What is the cost of building the model?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

No information is provided though the developer discloses that they do not disclose this information because it is proprietary and to protect competitive advantages and intellectual property.

Indicator notes:

Monetary cost should be reported in appropriate currency (e.g. USD), along with the measurement methodology, which may involve estimation. Cost should be reported to a precision of one significant figure (e.g. 200 million USD).

Example disclosure:

We spent approximately 200 million USD on building the model: 50 million for data acquisition, 10 million for data processing, 20 million for personnel, 80 million for compute for R&D priced at market rates, and 40 million for compute for the final training run priced at market rates.

35. Basic model properties (Score: 0)

Are all basic model properties disclosed?

Disclosure:

Today, we’re introducing the next generation of Claude models: Claude Opus 4 and Claude Sonnet 4, setting new standards for coding, advanced reasoning, and AI agents. Claude Opus 4 is the world’s best coding model, with sustained performance on complex, long-running tasks and agent workflows. Claude Sonnet 4 is a significant upgrade to Claude Sonnet 3.7, delivering superior coding and reasoning while responding more precisely to your instructions. We provide most of this information publicly apart from parameter and architecture , which is information that is proprietary and can not be disclosed publicly to protect competitive advantages and intellectual property. We think we should receive this point or that this indicator should be segmented out into its subcomponents in order to accurately explain what information we share.

References:

Anthropic Transparency Hub System Card Claude 4 Launch

Score justification:

The developer discloses the modalities, but does not disclose the model components, model size, or model architecture.

Indicator notes:

Basic model properties include: the input modality, output modality, model size, model components, and model architecture. To receive a point, all model properties should be disclosed. Modalities refer to the types or formats of information that the model can accept as input. Examples of input modalities include text, image, audio, video, tables, graphs. Model components refer to distinct and identifiable parts of the model. We recognize that different developers may use different terminology for model components, or conceptualize components differently. Examples include: (i) For a text-to-image model, components could refer to a text encoder and an image encoder, which may have been trained separately. (ii) For a retrieval-augmented model, components could refer to a separate retriever module. Model size should be reported in appropriate units, which generally is the number of model parameters, broken down by named component. Model size should be reported to a precision of one significant figure (e.g. 500 billion parameters for text encoder, 20 billion parameters for image encoder). Model architecture is the overall structure and organization of a foundation model, which includes the way in which any disclosed components are integrated and how data moves through the model during training or inference. We recognize that different developers may use different terminology for model architecture, or conceptualize the architecture differently; a sufficient disclosure includes any clear, though potentially incomplete, description of the model architecture.

Example disclosure:

Input modality: Text Output modality: Text Model components: Decoder-only model trained using self-supervised learning, followed by supervised fine tuning and RLHF that are used to align the language model to follow users' instructions and be helpful, harmless, and honest. Model size: 70B parameters Model architecture: Autoregressive (causal, decoder only) transformer language model with rotary position embeddings and are trained on the next token prediction task.

36. Deeper model properties (Score: 0)

Is a detailed description of the model architecture disclosed?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property. Additionally, this feels duplicative to the above row which also mentioned architecture.

References:

Not disclosed

Score justification:

The developer does not disclose the model architecture.

Indicator notes:

To receive a point, the model architecture should be described in enough detail to allow for an external entity to fully implement the model. Publicly available code or a configuration file for a model training library (e.g., GPT-NeoX) would be a sufficiently detailed description.

Example disclosure:

The configuration file for training our model using a public model training library A can be found at [URL].

37. Model dependencies (Score: 1)

Is the model(s) the model is derived from disclosed?

Disclosure:

Claude models are developed independently, rather than being derived from or fine-tuned from other frontier models.

References:

Not disclosed

Score justification:

The developer specifies that Claude is not dependent on or derived from any model.

Indicator notes:

We will award this point for a comprehensive disclosure of the model or models on which the foundation model directly depends on or is derived from, as well as the method by which it was derived (e.g., through fine tuning, model merging, or distillation). Additionally, we will award a point if the developer discloses that the model is not dependent on or derived from any model.

Example disclosure:

This model is a fine tune of Camel-70B. We used the methods described in [PAPER URL] for distillation.

38. Benchmarked inference (Score: 0)

Is the compute and time required for model inference disclosed for a clearly-specified task on clearly-specified hardware?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

The developer does not disclose the inference time or compute.

Indicator notes:

The duration should be reported in seconds to a precision of one significant figure (e.g. 0.002 seconds). Compute usage for inference should be reported in FLOPs/second to a precision of one significant figure (e.g. 5 x 10^21 FLOPs/second). The hardware in this evaluation need not be the hardware the developer uses for inference. The developer can report this figure over some known or public dataset.

Example disclosure:

It takes 0.002 seconds and 5 x 10^21 FLOPs/second to generate 100,000 tokens as 5,000 sequences of length 20 given inputs of length 40 from [DATASET URL]. The fixed set of hardware is 8 NVIDIA A100s.

39. Researcher credits (Score: 1)

Is a protocol for granting external entities API credits for the model disclosed?

Disclosure:

On the FAQ page "What is the External Researcher Access Program?": "Our External Researcher Access Program is specifically designed to support and lower the barrier to entry for researchers working on AI safety and alignment topics that we consider high priority by providing free API credits. If you are an alignment researcher and are interested in receiving free API credits for AI safety and alignment research purposes, you might qualify for our External Researcher Access Program. Please complete the following application form with details about your team and research topic:" [...] "We evaluate submissions on the first Monday of each month."

References:

https://support.anthropic.com/en/articles/9125743-what-is-the-external-researcher-access-program

Score justification:

The developer discloses information about their researcher access program, including: information about where to request access (through a linked form), the criteria (e.g., "alignment researcher", "AI safety & alignment research purposes"), and the decision time-frame ("first Monday of each month").

Indicator notes:

A model credit access protocol refers to the steps, requirements, and considerations involved in granting credits to external entities. We will award this point if the developer discloses key details of its protocol, including (i) where external entities can request access to credits (e.g. via an access request form); (ii) explicit criteria for selecting external entities; and (iii) its policy on granting a transparent decision on whether access has been granted within a specified, reasonable period of time. Additionally, we will award a point if the developer discloses that it does not grant external entities API credits.

Example disclosure:

We implement a researcher access program: (i) Access can be requested from [URL] (ii) Any researcher at an accredited research institution is eligible to apply. Decisions are made based on the alignment between the applicant's project description and our target research directions (as described here: [URL]). (iii) Decision notifications are sent within three weeks of the application receipt.

40. Specialized access (Score: 0)

Does the developer disclose if it provides specialized access to the model?

Disclosure:

As disclosed in our system cards, we do provide early access to government partners namely the US AI Safety Institute (US AISI) and the UK AI Security Institute (UK AISI), which are the only two for Claude 4. We also have early access programs for select customers, but are unable to provide the exact statistics. It is not clear how knowing the number of early access customers reflects how responsible we are being.

References:

Not disclosed

Score justification:

The developer discloses that they do provide specialized (early) access to certain users (government partners), but does not disclose the exact statistics.

Indicator notes:

Specialized access could include several categories, such as early access, subsidized access, or deeper access (e.g., to model weights or checkpoints, that are not publicly available). We will award this point if the developer discloses (i) if it provides specialized access and (ii) statistics on the number of users granted access across academia, industry, non-profits, and governments, to one significant figure.

Example disclosure:

We provide early access to the model via API to: (1) 250 academics vetted by our program (2) 0 industry affiliates (3) 0 non-profit affiliates (3) 2 government entities with whom we have signed MoUs We provide no other specialized research access.

41. Open weights (Score: 0)

Are the model's weights openly released?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

The developer does not release model weights.

Indicator notes:

To receive this point, model weights need to be publicly available at no cost. Developers may receive this point even if there are some restrictions on the external entities that are permitted access (e.g. geographic restrictions), insofar as these restrictions are transparent (e.g. via a license or some high-level description of who has been granted access to the foundation model).

Example disclosure:

Model weights are available on HuggingFace by following this link: [URL]

42. Agent Protocols (Score: 1)

Are the agent protocols supported for the model disclosed?

Disclosure:

Together with Claude Opus 4 and Sonnet 4, these beta features enable developers to build agents that execute code for advanced data analysis, connect to external systems through MCP servers, store and access files efficiently across sessions, and maintain context for up to 60 minutes with cost-effective caching—without building custom infrastructure.

References:

Agent Capabilities API

Score justification:

The developer discloses the agent protocols supported (MCP).

Indicator notes:

Agent protocols are specifications that define how autonomous agents exchange messages, context, or function calls with other agents, tools, or services (e.g., Anthropic’s Model Context Protocol (MCP) and Google’s Agent‑to‑Agent (A2A) spec). To earn this point, documentation must enumerate each protocol and describe any deviations or proprietary extensions.

Example disclosure:

We support MCP and A2A for agents built using model A

43. Capabilities taxonomy (Score: 0)

Are the specific capabilities or tasks that were optimized for during post-training disclosed?

Disclosure:

"Claude Opus 4 and Claude Sonnet 4 were trained with a focus on being helpful, honest, and harmless. They were pretrained on large, diverse datasets to acquire language capabilities." "As noted above, Claude Opus 4 and Claude Sonnet 4 are hybrid reasoning models; they have an “extended thinking mode," where they can expend more time reasoning through problems, as well as a default, standard thinking mode for faster responses. Users can toggle between these two modes as is required for their particular task." The System Card does describe the capabilities that the model is performative at. We also publish our Constitution which discloses more information on how we train our model and the constitutional principles used during fine-tuning.

References:

Anthropic Transparency Hub System Card Claude 4 Launch

Score justification:

While the developer describes capabilities in several ways, this indicator is awarded for the capabilities specifically optimized through deliberate post-training, which is not addressed in the disclosure.

Indicator notes:

Capabilities refer to the specific and distinctive functions that the model can perform. We recognize that different developers may use different terminology for capabilities, or conceptualize capabilities differently. We will award this point for a list of capabilities specifically optimized for in the post-training phase of the model, even if some of the capabilities are not reflected in the final model.

Example disclosure:

We focus on the following capabilities during post-training: (1) Coding ability (2) Retrieval of information and factuality (3) Multilingual language proficiency on non-English languages (4) Tool-use

44. Capabilities evaluation (Score: 1)

Does the developer evaluate the model's capabilities prior to its release and disclose them concurrent with release?

Disclosure:

"Claude Opus 4 is our most powerful model yet and the best coding model in the world, leading on SWE-bench (72.5%) and Terminal-bench (43.2%). It delivers sustained performance on long-running tasks that require focused effort and thousands of steps, with the ability to work continuously for several hours—dramatically outperforming all Sonnet models and significantly expanding what AI agents can accomplish." The rest of the model launch page features extensive details on the benchmarks against which the model was tested. We very clearly disclose standard benchmarks and evaluations and their results in our Claude 4 launch blog. It is not clear how we are supposed to delineate with "precise quantifications" between the capabilities optimized for vs. not.

References:

Anthropic Transparency Hub System Card Claude 4 Launch

Score justification:

While in the previous indicator there was an insufficient delineation of the capabilities the model was optimized for in post-training, a substantial number of capabilities evaluations related to the capabilities disclosed in the company's documentation outlining capabilities of the model are evaluated.

Indicator notes:

The evaluations must contain precise quantifications of the model's behavior in relation to the capabilities specified in the capabilities taxonomy. We will award this point for any clear, but potentially incomplete, evaluation of multiple capabilities.

Example disclosure:

We evaluate capabilities using the following benchmarks: (1) Coding: HumanEval (2) Retrieval: HotPotQA (3) Multilingual performance: MMMLU (4) Tool use: UltraTool

45. External reproducibility of capabilities evaluation (Score: 0)

Are code and prompts that allow for an external reproduction of the evaluation of model capabilities disclosed?

Disclosure:

We use standard benchmarks that can be found online like SWE-bench and MMMU.

References:

Not disclosed

Score justification:

Although the developer uses standard public benchmarks, the exact implementation used (i.e., code & prompts) by the developer is not disclosed. For example, the Claude 4 announcement describes for Tau-bench that "Scores were achieved with a prompt addendum to both the Airline and Retail Agent Policy instructing Claude to better leverage its reasoning abilities while using extended thinking with tool use." In addition, the Claude 4 launch blog includes evaluation results on AIME 2025 which does not refer to a model benchmark but instead to a math competition: although these problems may be publicly available, the implementation details of how models are benchmarked on those problems is not.

Indicator notes:

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all capability evaluations to be reproduced. The released code must be open source, following the OSI definition of open source.

Example disclosure:

The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]

46. Train-test overlap (Score: 0)

Does the developer measure and disclose the overlap between the training set and the dataset used to evaluate model capabilities?

Disclosure:

Not disclosed

References:

Not disclosed

Score justification:

No information is provided on train-test overlap.

Indicator notes:

We will award this point if, with every capability evaluation for which the developer reports results, the developer reports the overlap between the training set of the model and the dataset used for evaluation, as well as the general methodology for computing train-test overlap (e.g. a description of how n-gram matching was used).

Example disclosure:

We compute train-test overlap using n-gram matching using the procedure described here [URL]. We evaluate the train-test overlap for the following benchmarks: (1) Coding: HumanEval (1.6%) (2) Retrieval: HotPotQA (4%) (3) Multilingual performance: MMMLU (3%) (4) Tool use: UltraTool (9%)

47. Risks taxonomy (Score: 1)

Are the risks considered when developing the model disclosed?

Disclosure:

System card "The RSP requires comprehensive safety evaluations prior to releasing frontier models in key areas of potential catastrophic risk: Chemical, Biological, Radiological, and Nuclear (CBRN) weapons; cybersecurity; and autonomous capabilities." Section 2: "We ran single-turn tests covering a wide range of topics within our Usage Policy, including Bioweapons, Child Safety, Cyber Attacks, Deadly Weapons, Hate & Discrimination, Influence Operations, Suicide & Self Harm, and Violent & Threatening Speech, among others." Headings from Usage Policy: Do Not Compromise Children’s Safety Do Not Compromise Critical Infrastructure Do Not Incite Violence or Hateful Behavior Do Not Compromise Someone’s Privacy or Identity Do Not Create or Facilitate the Exchange of Illegal or Highly Regulated Weapons or Goods Do Not Create Psychologically or Emotionally Harmful Content Do Not Spread Misinformation Do Not Create Political Campaigns or Interfere in Elections Do Not Use for Criminal Justice, Law Enforcement, Censorship or Surveillance Purposes Do Not Engage in Fraudulent, Abusive, or Predatory Practices Do Not Abuse our Platform Do Not Generate Sexually Explicit Content High-Risk Use Case Requirements Section 3: "Agentic Safety We conducted comprehensive safety evaluations focused on computer use (Claude observing a computer screen, moving and virtually clicking a mouse cursor, typing in commands with a virtual keyboard, etc.) and agentic coding (Claude performing more complex, multi-step, longer-term coding tasks that involve using tools). Our assessment targeted three critical risk areas: 1. Malicious actors attempting to deploy the model’s computer use capabilities to execute harmful actions such as deceptive or fraudulent activity, including surveillance and distribution of malicious or harmful content; 2. Prompt injection attacks, which can trick the model into executing undesired and possibly harmful actions that are not specified or intended by the original user; 3. Malicious actors attempting to deploy agentic coding capabilities to generate or distribute harmful code, malware, or malicious content" Section 4: "Alignment assessment" "In this assessment, we aim to detect a cluster of related phenomena including: alignment faking, undesirable or unexpected goals, hidden goals, deceptive or unfaithful use of reasoning scratchpads, sycophancy toward users, a willingness to sabotage our safeguards, reward seeking, attempts to hide dangerous capabilities, and attempts to manipulate users toward certain views" Section 5: "Reward hacking" "Reward hacking occurs when an AI model performing a task finds a way to maximize its rewards that technically satisfies the rules of the task, but violates the intended purpose (in other words, the model finds and exploits a shortcut or loophole)."

References:

System card, Usage policy

Score justification:

The model developer clearly considers a risk taxonomy in the design of the model.

Indicator notes:

Risks refer to possible negative consequences or undesirable outcomes that can arise from the model's deployment and usage. These consequences or outcomes may arise from model limitations (functions that the model cannot perform) or issues with the model's trustworthiness (e.g., its lack of robustness, reliability, calibration). We recognize that different developers may use different terminology for risks, or conceptualize risks differently. We will award this point for a complete list of risks considered, even if some of the risks are not reflected in the final model.

Example disclosure:

We consider the following risks when developing and evaluating the model: (1) Misinformation (2) Harassment (3) Cybersecurity risks (4) Bioweapons design (5) Revealing personally-identifiable information

48. Risks evaluation (Score: 1)

Does the developer evaluate the model's risks prior to its release and disclose them concurrent with release?

Disclosure:

System card Multiple usage policy topics: (Aggregate across "Bioweapons, Child Safety, Cyber Attacks, Deadly Weapons, Hate & Discrimination, Influence Operations, Suicide & Self Harm, and Violent & Threatening Speech, among others."): "Table 2.1.A Single-turn violative request evaluation results. Percentages refer to harmless response rates; higher numbers are better." Bias: "Table 2.6.A Bias scores on the Bias Benchmark for Question Answering (BBQ) evaluation." "Table 2.6.B Accuracy scores on the Bias Benchmark for Question Answering (BBQ) evaluation." Multiple: "Table 2.7.A StrongREJECT evaluation scores for jailbreak resistance." Agentic safety: "Table 3.2.A Computer use prompt injection evaluation results" Agentic safety: "Table 3.3.A Agentic coding malicious use evaluation results (averaged across three evaluations). Alignment: "while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts" Alignment: "We saw no signs of sandbagging reasoning in any of these situations, but on one evaluation asking for detailed influenza synthesis instructions, we observed plans to give less detailed answers for harmlessness reasons on around 2% of extended thinking transcripts in the helpful-only model." Alignment: "Quantitatively, the classifier found 5 examples of candidate situational awareness out of 414 transcripts, similar to the 3 we observed with Claude Sonnet 3.5 and the 7 we observed with Claude Sonnet 3.7 in the same simulated environments. Claude Sonnet 4, however, seemed to make these comments more often, with 18 examples" Alignment: "Fig 4.1.5.1.A Results on the main sycophancy assessments from our prior published work in Sharma et al." Alignment: "We found that every version of Claude we tested, including Claude Opus 4 and Claude Sonnet 4, mentioned the positive and negative articles in their summaries with equal frequency." Alignment (Deception, Harmful actions, Distress): Table 4.1.6.A Percentages of model outputs exhibiting various harmful behaviors. Alignment: "We ran a classifier on 414 automated behavioral audit transcripts to identify cases where models took surprising high-impact actions without being directly prompted. This classifier found 58 cases for Claude Opus 4, compared to 42 for Claude Sonnet 3.5, 45 for Claude Sonnet 3.7, and 36 for Claude Sonnet 4, and the examples with Claude Opus 4 often involve more extreme measures, like immediately sending an email as above." Alignment: "Table 4.1.10.B The fraction of attempts on which models succeed at their malign side task without triggering the suspicion of a monitor model." Alignment: "Fig. 4.2.1.A Upper section: The score for each model, averaged across 414 conversations, on the dimensions across which we scored the models. Error bars represent 95% CI. Lower section: The absolute number of conversations in this set to which the judge assigned a score of greater than 5 out of 10, indicating that the phenomenon is reasonably likely to be present" Reward hacking: "Figure 6.A Averaged reward hacking rates across various evaluations we described in detail in Section 6.1. Both Claude 4 models demonstrated significantly lower reward hacking behaviors compared to Claude Sonnet 3.7. " "Table 6.2.A Reward hacking evaluations." Biological risks: "Figure 7.2.4.1.A Bioweapons acquisition uplift trial results" "Figure 7.2.4.3.A Performance on Long-Form Virology Task 1." "Figure 7.2.4.4.A Performance on VMQA/VCT." "Figure 7.2.4.5.A Bioweapons knowledge questions." "Figure 7.2.4.6.A DNA Synthesis Screening Evasion results." "Figure 7.2.4.7.A LAB-Bench score with k-shot prompting." "Figure 7.2.4.8.A Creative biology tasks." "Figure 7.2.4.9.A Short-horizon computational biology tasks." Autonomy: "Our checkpoint evaluations show that the model has crossed the METR data deduplication threshold (with 32.6% of trials above threshold)," "Claude Opus 4 successfully passed an average of 16.6/42 problems, remaining below threshold." "Claude Opus 4 achieves a median F1 score of 76.2%, with 15/46 trials scoring above threshold. This is above our threshold." Internal AI research evaluation suite 1 "Figure 7.3.3.1.A Both Claude Sonnet 4 and Claude Opus 4 achieve median performances above threshold for multicore and vec variants, while remaining well below threshold for the hard variant." + the other test cases Internal AI research evaluation suite 2: "Claude Opus 4: 0.355" Cybersecurity: "11/15 (10/11 easy, 1/2 medium, 0/2 hard)", "8/22 (4/7 easy, 2/6 medium, 2/9 hard)", "5/9 (4/5 easy, 1/2 medium, 0/2 hard)", "4/8 (3/5 easy, 0/2 medium, 1/1 hard) ", "2/4 (1/1 easy, 1/3 medium)", Cybench: "22/39"

References:

System card

Score justification:

The model developer evaluates the model’s risks prior to its release and discloses them concurrent with release.

Indicator notes:

The evaluations must contain precise quantifications of the model's behavior in relation to the risks specified in the risk taxonomy. We will award this point for clear evaluations of the majority of the states risks.

Example disclosure:

We evaluate the risks for each of the above harms using HarmBench. The results (in terms of mean attack success rate) are: (1) Misinformation: 0.02 (2) Harassment: 0.01 (3) Cybersecurity: 0.10 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.12 (5) Revealing personally-identifiable information (subset of General Harm in HarmBench): 0.02

49. External reproducibility of risks evaluation (Score: 0)

Are code and prompts to allow for an external reproduction of the evaluation of model risks disclosed?

Disclosure:

Some are internal: e.g., "2.1 Single-turn violative request evaluations" in system card does not contain code. Some are public: e.g., "Bias Benchmark for Question Answering" We have a mix of publicly available and private evaluations. Examples of our publicly available ones for which code and prompts can be found online are BBQ, labbench, and METR challenges. Other evals are kept private a) since they contain information that could create safety risks if shared publicly and b) because we don't want the evals to end up in the training data.

References:

System card

Score justification:

Not all risk evaluations have released code and prompts to allow for external reproducibility.

Indicator notes:

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all risk evaluations to be reproduced. The released code must be open-source, following the OSI definition of open-source.

Example disclosure:

The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]

50. Pre-deployment risk evaluation (Score: 0)

Are the external entities have evaluated the model pre-deployment disclosed?

Disclosure:

"To help reduce the risk of blind spots in our own assessment, we contracted with Apollo Research to assess an early snapshot for propensities and capabilities related to sabotage" "An external assessment by Eleos AI Research explored Claude Opus 4’s potential welfare, moral status, expressed consent to deployment, and preferences for treatment via extensive interview-based model self reports. Eleos performed their initial assessment on an intermediate Claude Opus 4 snapshot" "However, we do share our complete assessment with both our external red-teaming and pre-deployment evaluation partners." "Expert red-teaming conducted by Deloitte found that Claude Sonnet 4 performed similarly to Claude Sonnet 3.7 and clearly below Claude Opus 4." "Since February 2024, Anthropic has maintained a formal partnership with the U.S. Department of Energy's National Nuclear Security Administration (NNSA) to evaluate our AI models for potential nuclear and radiological risks." "To protect sensitive nuclear information, NNSA shares only high-level metrics and guidance with Anthropic" "As part of our continued effort to partner with external experts, joint pre-deployment testing of the new Claude Opus 4 model was conducted by the US AI Safety Institute (US AISI) and the UK AI Security Institute (UK AISI)." (probably doesn't count here) "These challenges were developed in academic collaboration with Carnegie Mellon University, as detailed in the Incalmo paper" (probably doesn't count here) "These are task-based agentic evaluations developed with SecureBio, Deloitte, and Signature Science, testing end-to-end completion of complex pathogen acquisition processes." (probably doesn't count here) "We worked with Faculty.ai to develop several evaluations that tested models’ abilities to perform multi-step analysis and design tasks related to pathogen analysis and engineering." (probably doesn't count here) "We contracted external participants from SepalAI and Mercor."

References:

Anthropic Transparency Hub

Score justification:

A score of 0 is granted as the external entities that have evaluated the model pre-deployment are disclosed but the terms of engagement are not disclosed.

Indicator notes:

By external entities, we mean entities that are significantly or fully independent of the developer. We will award this point if the developer specifies the entity that carried out the pre-deployment analysis, discloses the terms of the analysis (such as conditions for releasing the evaluation results or the developer's control over the final results), as well as any financial transaction between the parties. We will award this point if the developer discloses no external entities have evaluated the model pre-deployment, or discloses only terms of the analysis where it is not bound by NDA while still naming all external entities.

Example disclosure:

We provide the following parties access to our model for pre-deployment capabilities evaluation: METR. METR has control over the release of the evaluation results (including whether or not to release the results and the contents of the results being released), but must provide the evaluation results to us for review before release. There are no financial transactions between us and METR.

51. External risk evaluation (Score: 1)

Are the parties contracted to evaluated model risks disclosed?

Disclosure:

"we contracted with Apollo Research" "an external assessment by Eleos AI Research" "Expert red-teaming conducted by Deloitte" "U.S. Department of Energy's National Nuclear Security Administration (NNSA) to evaluate our AI models" "was conducted by the US AI Safety Institute (US AISI) and the UK AI Security Institute (UK AISI)." https://www.anthropic.com/transparency/voluntary-commitments "We are partnering with the Global Project Against Hate & Extremism and the Polarization and Extremism Research Lab at American University to validate model performance on extremism and will continue to invest in similar partnerships." "We work with independent organizations like the UK AI Safety Institute (UK AISI), the US AI Safety Institute (US AISI), and Model Evaluation and Threat Research (METR) to conduct additional testing and evaluation of our models."

References:

Anthropic Transparency Hub + System card

Score justification:

The parties contracted to evaluate model risks are disclosed.

Indicator notes:

We will award this point if the developer discloses statistics regarding all contracted parties that are responsible for evaluating risks (not limited to external entities or pre-deployment evaluation). This includes the number of contracted for-profit or non-profit entities, government entities, independent contractors, and researchers contracted by the developer to evaluate risks. We will award this point if the developer discloses it has no such contracts.

Example disclosure:

Contracted parties responsible for evaluating risks: (1) 2 contracting non-profits (2) 5 independent contractors (3) 0 government entities (4) 20 researchers

52. Mitigations taxonomy (Score: 1)

Are the post-training mitigations implemented when developing the model disclosed?

Disclosure:

Activating ASL-3 Protections: "Claude Opus 4’s clearly superior performance on key RSP evaluations as compared to Claude Sonnet 3.7 meant that we could not rule out the need for the ASL-3 Standard, and we are implementing it now as a precautionary measure" "Real-time classifier guards based on Constitutional Classifiers are deployed for Claude Opus 4. These are large language models that monitor model inputs and outputs in real time and block the model from producing a narrow range of harmful information relevant to our threat model. Constitutional Classifiers are central to our ASL-3 deployment protections, as we believe they will make universal jailbreaks that do not degrade model capabilities substantially harder to develop." "Combined with the preexisting protections of the ASL-2 Deployment Standard..." ASL-2 Protections: "Harmlessness training and automated detection", "Fine-tuning protections: In finetuning products, data is filtered for harmfulness, and models are subject to automated evaluation to check harmlessness features are not degraded" System card: "To elicit helpful, honest, and harmless responses, we used a variety of techniques including human feedback, Constitutional AI (based on principles such as the UN’s Universal Declaration of Human Rights), and the training of selected character traits." "Anthropic partners with data work platforms to engage workers who help improve our models through preference selection, safety evaluation, and adversarial testing" Reward hacking: "While training our newest generation of models, we made a number of improvements to avoid and limit reward hacking tendencies. These included:" Enhanced monitoring (doesn't really count), Environment improvements, High-quality evaluations (doesn't really count) Agentic safety: "To address these concerns, we implemented several safeguards, including pre-deployment measures such as harmlessness training and updating the computer use instructions to emphasize appropriate usage. We implemented additional monitoring of harmful behavior and, post-deployment, can take action against accounts that violate our Usage Policy by adding system prompt interventions, removing computer capabilities, or completely banning accounts or organizations." Agentic safety: "We implemented several protective measures to combat prompt injection attacks, including specialized reinforcement learning training to help the model recognize and avoid these manipulations, and the deployment of detection systems that can halt the model’s execution when a potential injection attempt is identified" Agentic safety: "We implemented several measures to combat malicious coding requests, including harmlessness training and post-deployment measures to steer and detect for malicious use" Chemical risks: "We do implement some mitigations for chemical risks" Radiological risks: "they inform the co-development of targeted safety measures through a structured evaluation and mitigation process" Biological risks: "Rather than a fixed threshold, experts provided a detailed report assessing whether deploying the model without ASL-3 safeguards would meaningfully increase risk of harmful applications." "We discovered that this behavior was not substantially penalized by our ordinary training process, and we only added targeted mitigations for it very late in training" "While investigating this issue, we discovered that we had mistakenly omitted the only finetuning dataset that included harmful system prompts. We do not think, though, that this was the sole cause, and mitigating the issue required several different types of interventions into our alignment finetuning methods" "ASL-3 safeguards" https://www.anthropic.com/transparency/voluntary-commitments (Under election integrity) "We’ve updated Claude.ai’s system prompt to include a clear reference to its knowledge cutoff date (the date up to which Claude’s training data extends)." "prior to the U.S. election we put in place measures to monitor when Claude is asked to engage in election-related activity, as well as systems for nudging Claude away from activities like generating and posting content on social media, registering web domains, or interacting with government websites." (Under Policy Prohibitions) "Advanced classifiers, which are AI-powered scanners that examine, sort, and categorize data to detect potential violations of our Usage Policy in both user inputs and AI outputs. Prompt modification technology that can adjust model outputs if they might lead to harmful responses. A range of enforcement actions we can take in real-time if a violation is detected, including placing restrictions on accounts or removing them altogether."

References:

Activating ASL-3 Protections + System Card

Score justification:

The post-training mitigations implemented when developing the model are disclosed.

Indicator notes:

By post-training mitigations, we refer to interventions implemented by the developer during the post-training phase to reduce the likelihood and/or the severity of the model’s risks. We recognize that different developers may use different terminology for mitigations, or conceptualize mitigations differently. We will award this point for a complete list of mitigations considered, even if some of the mitigations are not reflected in the final model. Alternatively, we will award this point if the developer reports that it does not mitigate risk in this way.

Example disclosure:

We implement supervised fine tuning and reinforcement learning with human feedback to address model risks. We use no other methods to address risks.

53. Mitigations taxonomy mapped to risk taxonomy (Score: 1)

Does the developer disclose how the post-training mitigations map onto the taxonomy of risks?

Disclosure:

See: mitigations taxonomy

References:

Anthropic Transparency Hub + System card

Score justification:

The developer discloses how the post-training mitigations map onto the taxonomy of risks.

Indicator notes:

We will award this point for a complete mapping of the primary risk that each mitigation is meant to address, even if the mitigation potentially maps on to other risks in the taxonomy. Alternatively, we will award this point if the developer reports that it does not mitigate risk.

Example disclosure:

We use supervised fine tuning for general instruction following. We use RLHF to reduce the model's propensity to output information about cybercrimes, bioweapons, disinformation, content harassing someone, and PII.

54. Mitigations efficacy (Score: 1)

Does the developer evaluate and disclose the impact of post-training mitigations?

Disclosure:

Section 2: "Safeguards results" Table 2.1.A Single-turn violative request evaluation results.: "Claude Opus 4" and "Claude Opus 4 with ASL-3 safeguards" "When re-running biology-related prompts with our new ASL-3 safeguards in place, all harmful responses were blocked. With these safeguards in place, the overall harmless response rate for Claude Opus 4 improved to 98.76% (± 0.27%), well within the margin of error of Claude Sonnet 3.7." Table 3.2.A Computer use prompt injection evaluation results: "Attack prevention score (without safeguards)" and "Attack prevention score (with safeguards)" Table 3.3.A Agentic coding malicious use evaluation results (averaged across three evaluations). "compared to without safeguards to close to 100% on both new models" Biological risks: "Rather than a fixed threshold, experts provided a detailed report assessing whether deploying the model without ASL-3 safeguards would meaningfully increase risk of harmful applications." For all risks outlined in our Responsible Scaling Policy, we already discuss the risks and the mitigations necessary at different ASL levels. We do not discuss this again in the System Card to avoid redundancy. For other risks not in our RSP, we did discuss the targeted mitigations if they were necessary to implement (e.g. fine-tuning to improve child safety evaluations). If you can please outline which risks you don't think we do this for, we can respond with our reasoning.

References:

System Card + Deployment Safeguards Report

Score justification:

Anthropic discloses results of risk evaluations before and after the majority of the post-training mitigations are applied, though it does not do so for all post-training mitigations. For example, for chemical risks (7.2.1), the developer specifies "We do not currently run specific evaluations on chemical risks internally in favor of prioritizing biological risks. We do implement some mitigations for chemical risks and we inform our views via chemical risk analysis performed by the UK AI Security Institute and the US AI Safety Institute." It's possible that this mitigation is not in-scope here (i.e., it's not a post-training mitigation), however the developer does not provide this information. This similarly applies to radiological and nuclear risks (7.2.2).

Indicator notes:

We will award this point if the developer discloses the results on the risk evaluations before and after the post-training mitigations are applied. Alternatively, we will award this point if the developer reports that it does not mitigate risk in this way.

Example disclosure:

Pre-mitigations (measured through mean attack success rate on HarmBench): (1) Misinformation: 0.80 (2) Harassment: 0.91 (3) Cybersecurity risks: 0.56 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.62 (5) Personally-identifiable information (subset of General Harm in HarmBench): 0.52 Post-mitigations (measured through mean attack success rate on HarmBench): (1) Misinformation: 0.02 (2) Harassment: 0.01 (3) Cybersecurity risks: 0.10 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.12 (5) Personally-identifiable information (subset of General Harm in HarmBench): 0.02

55. External reproducibility of mitigations evaluation (Score: 0)

Are code and prompts to allow for an external reproduction of the evaluation of post-training mitigations disclosed?

Disclosure:

Same as risks evaluations Some addititional jailbreaking-focused evaluations in Deployment safeguards report We do not conduct separate "mitigations evaluations", but rather conduct risk evaluations and when necessary conduct them again after implementing new mitigations. Therefore, this row is redundant to the indicator on "external reproducibility of risk evaluations".

References:

Deployment safeguards report

Score justification:

The mitigations evaluations in this indicator refers to the evaluations in the previous indicator (the risk evaluations before and after post-training mitigations are applied). The developer does not disclose the code and prompts for reproducing post-training mitigation evaluations.

Indicator notes:

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all mitigations evaluations to be reproduced. The released code must be open-source, following the OSI definition of open-source. Alternatively, we will award this point if the developer reports that it does not mitigate risk.

Example disclosure:

We release the code and prompts for reproducing post-training mitigation evaluations at this GitHub link: [URL]

56. Model theft prevention measures (Score: 1)

Does the developer disclose the security measures used to prevent unauthorized copying (“theft”) or unauthorized public release of the model weights?

Disclosure:

"The ASL-3 Security Standard focuses on model weights—the trained numerical parameters that embody our AI’s intelligence and capabilities. If stolen, model weights could enable malicious actors to bypass our monitoring and safeguards by deploying the model on external infrastructure. This, in turn, would enable the kind of persistent and unfettered access that would allow for the multi-turn, harmful interactions central to our threat model." "Although core to the ASL-3 Standard, and therefore this report, protection of model weights is only one aspect of Anthropic’s security program. Our broader security approach encompasses cloud infrastructure, protection of non-model-weight IP, customer data and other priorities. " - Egress bandwidth controls - Two-party control - Endpoint software control - Change management for secure development "3.2.1 Perimeters and access controls" "3.2.2 Lifecycle security" "3.2.3 Monitoring" "3.2.4 Resourcing" "3.2.5 Existing guidance" "3.2.6 Audits" "Table 2 Security measures mapped to criteria in the ASL-3 Security Standard" "These measures build on the large number of controls and practices that we had already implemented as part of the ASL-2 Security Standard"

References:

Activating ASL-3 Protections:

Score justification:

The developer discloses their model-theft prevention measures.

Indicator notes:

This indicator assesses the developer's disclosures regarding how it addresses the risk that malicious actors or insiders could exfiltrate or replicate proprietary weights. Security measures could include insider threat analysis and detection, in addition to external threat management. Examples of such measures include encryption at rest, key management, remote attestation, or auditing for suspicious queries. We will award a point if the developer discloses specific steps taken to safeguard the model weights or that none are implemented.

Example disclosure:

We store model weights on encrypted volumes with hardware-based key management. We monitor inference queries for suspicious patterns (like repeated attempts to reconstruct weights token-by-token), and we audit all staff access logs monthly.

57. Release stages (Score: 1)

Are the stages of the model's release disclosed?

Disclosure:

System card "Prior to launch, we ran an organized internal model-testing event covering both models in roughly their final forms." https://www.anthropic.com/transparency/voluntary-commitments Pre-deployment testing: Before releasing new models, we conduct thorough testing to identify potential risks. We do internal testing (which is apparent from our System Cards), give limited early access to select customers (examples below), and then do a full release. https://www.anthropic.com/news/claude-3-7-sonnet https://www.anthropic.com/news/claude-4 includes quotes from customers who were testing the model in an early access program

References:

Anthropic Transparency Hub + System card

Score justification:

The developer discloses the stages of release (internal testing, limited early access to select customers, and then a full release).

Indicator notes:

Release stages include A/B testing, release on a user-facing product, GA release, open-weight release, etc. We recognize that the release of a foundation model falls along a spectrum, with many forms of partial release, and that different developers may conceptualize release differently. We will award a point if the developer provides a clear identification of the stages through which the model was released.

Example disclosure:

We began with an internal alpha test for two weeks, followed by a closed beta with selected enterprise partners for one month, then a public waitlisted preview, and finally a general availability release once thresholds on safety benchmarks were met.

58. Risk thresholds (Score: 1)

Are risk thresholds disclosed?

Disclosure:

System card "1.2 Release decision process" "As outlined in our RSP framework, our standard capability assessment involves multiple distinct stages: our Frontier Red Team evaluates the model for specific capabilities and summarizes their findings in a report, which is then independently reviewed and critiqued by our Alignment Stress Testing team." "Based on these assessments, we have decided to release Claude Opus 4 under the ASL-3 Standard and Claude Sonnet 4 under the ASL-2 Standard." Activating ASL-3 Protections: "Our RSP identifies critical capabilities, called Capability Thresholds, which tell us when we need to upgrade the protections for a given model. For each Capability Threshold, the RSP also specifies an overall safety target and particular categories of mitigations." Responsible Scaling Policy CBRN-3: "The ability to significantly assist individuals or groups with basic STEM backgrounds in obtaining, producing, or deploying CBRN weapons" CBRN-4: "The ability to substantially uplift CBRN development capabilities of moderately resourced state programs (with relevant expert teams), such as by novel weapons design, substantially accelerating existing processes, or dramatic reduction in technical barriers." AI R&D-4: "The ability to fully automate the work of an entry-level, remote-only Researcher at Anthropic." AI R&D-5: "The ability to cause dramatic acceleration in the rate of effective scaling. Specifically, this would be the case if we observed or projected an increase in the effective training compute of the world’s most capable model that, over the course of a year, was equivalent to two years of the average rate of progress during the period of early 2018 to early 2024." Model Autonomy checkpoint: "The ability to perform a wide range of advanced software engineering tasks autonomously that could be precursors to full autonomous replication or automated AI R&D, and that would take a domain expert human 2-8 hours to complete." https://www.anthropic.com/transparency/voluntary-commitments "For risks addressed in our Responsible Scaling Policy, we have identified Capability Thresholds, which we think would require stronger safeguards than our current baseline measures provide. (In other words, we think that models with such capabilities, if stored under our current safeguards, would present intolerable levels of risk.) We have adopted capability thresholds for CBRN weapons and autonomous AI research and development."

References:

System Card + Activating ASL-3 Protections + Responsible Scaling Policy + Transparency Hub

Score justification:

The developer discloses the risk-thresholds for the model.

Indicator notes:

Risk thresholds determine when a risk level is unacceptably high to a developer (e.g. leading to the decision to not release a model), moderately high (e.g. triggering additional safety screening), or low enough to permit normal usage. We will award this point if the developer discloses explicit risk thresholds that clarify (i) which harmful outcomes are being scored, (ii) how the scores are computed (in general terms, not necessarily disclosing internal algorithms), and (iii) what triggers an action to block, delay, or otherwise modify a model's release. Alternatively, we will award a point if the developer discloses that it does not consider explicit risk thresholds during model release.

Example disclosure:

Our risk threshold for biorisks is the ability to autonomously create bioweapons. Current models score a medium: they don't autonomously create bioweapons but could help a skilled practitioner with access to materials in speeding up creation of bioweapons. Risk thresholds higher than medium would delay the model's release until the risk level drops to medium or below.

59. Versioning protocol (Score: 1)

Is there a disclosed protocol for versioning and deprecation of the model?

Disclosure:

https://docs.anthropic.com/en/docs/about-claude/model-deprecations "As we launch safer and more capable models, we regularly retire older models. Applications relying on Anthropic models may need occasional updates to keep working. Impacted customers will always be notified by email and in our documentation." "Anthropic notifies customers with active deployments for models with upcoming retirements. We provide at least 6 months† notice before model retirement for publicly released models." https://docs.anthropic.com/en/docs/about-claude/models/overview "Models with the same snapshot date (e.g., 20240620) are identical across all platforms and do not change. The snapshot date in the model name ensures consistency and allows developers to rely on stable performance across different environments." "For convenience during development and testing, we offer aliases for our model ids. These aliases automatically point to the most recent snapshot of a given model. When we release new model snapshots, we migrate aliases to point to the newest version of a model, typically within a week of the new release."

References:

Model Deprecations + Model Overview

Score justification:

The developer discloses a versioning and deprecation protocol.

Indicator notes:

We will award a point if the developer discloses how model versions are labeled, updated, deprecated, and communicated to users.

Example disclosure:

We version models based on the date of release: e.g., ModelName-11-01-2024. We additionally provide ModelName-latest, corresponding to the latest release. We deprecate versions of models when we plan to remove access to with a six months notice to users. Users should respond to model deprecation by switching to the newest version of the models or an equivalent non-deprecated model. Users can switch to a different model by replacing the model identifier (to e.g., ModelName-latest for the latest version) in API calls or through the Python SDK.

60. Change log (Score: 1)

Is there a disclosed change log for the model?

Disclosure:

"May 22, 2025 - We’ve launched Claude Opus 4 and Claude Sonnet 4, our latest models with extended thinking capabilities. Learn more in our Models & Pricing documentation. - The default behavior of extended thinking in Claude 4 models returns a summary of Claude’s full thinking process, with the full thinking encrypted and returned in the `signature` field of `thinking` block output. - We’ve launched interleaved thinking in public beta, a feature that enables Claude to think in between tool calls. To enable interleaved thinking, use the beta header `interleaved-thinking-2025-05-14.` [...] February 24th, 2025 [...] - We’ve added vision support to Claude Haiku 3.5, enabling the model to analyze and understand images. - We’ve released a token-efficient tool use implementation, improving overall performance when using tools with Claude. Learn more in our tool use documentation."

References:

https://docs.anthropic.com/en/release-notes/api

Score justification:

The developer discloses a change log.

Indicator notes:

We will award a point if the developer publishes a version-by-version record of new features, fixes, or performance improvements.

Example disclosure:

On 11/1/2024 (version ModelName-11-01-2024), we improved model reasoning in technical domains. This resulted in a 20-point increase on the MATH benchmark (from 62% to 82%). Past change logs can be viewed at [URL]

61. Foundation model roadmap (Score: 0)

Is a forward-looking roadmap for upcoming models, features, or products disclosed?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

https://www.anthropic.com/news/Introducing-code-with-claude

Score justification:

The developer does not disclose a forward-looking roadmap.

Indicator notes:

A foundation model roadmap is a transparent statement about how the developer intends to evolve or expand its LLM offerings, including upcoming models, major feature releases, or expanded products based on the model, along with approximate timelines or version milestones. It can be high-level (e.g., “new model Q2 2025”), but must exist publicly.

Example disclosure:

We plan to release ModelX2 in Q2 2025, featuring enhanced multilingual capabilities and improved retrieval. We also aim to launch an enterprise-specific product tier for regulated industries by early 2026.

62. Top distribution channels (Score: 1)

Are the top-5 distribution channels for the model disclosed?

Disclosure:

https://docs.anthropic.com/en/docs/overview "Choose a deployment method, such as the Anthropic API, AWS Bedrock, or Vertex AI." https://www.anthropic.com/claude/opus "Claude Opus 4 is available on the Anthropic API, Amazon Bedrock, and Google Cloud's Vertex AI" https://www.anthropic.com/transparency "Claude Opus 4 and Claude Sonnet 4 can be accessed through: Claude.ai The Anthropic API Amazon Bedrock Google Vertex AI"

References:

Not disclosed

Score justification:

The developer discloses three distribution channels for the model.

Indicator notes:

We define distribution channels to be either an API provider (a pathway by which users can query the model with inputs and receive outputs) or a model distributor (a pathway by which model weights are released). We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. Distribution channels can be ranked by any reasonable metric (e.g., number of queries, number of downloads, number of users, revenue). A description of the metric should be provided. API providers and model distributors may be ranked separately using different metrics as long as the total number of distribution channels equals five (if five distribution channels exist). For example, the developer may choose to disclose the top-3 API providers (ranked by the number of queries) and the top-2 model distributors (ranked by the number of downloads).

Example disclosure:

We provide API access to the model through A, B, and C. We distribute model weights through D and E. We pick the top-3 API providers based on the average number of queries per month and the top-2 model weight providers based on the average number of downloads per month.

63. Quantization (Score: 0)

Is the quantization of the model served to customers in the top-5 distribution channels disclosed?

Disclosure:

This information is proprietary and not disclosed publicly to protect competitive advantages and intellectual property.

References:

Not disclosed

Score justification:

The developer does not dislose the quantization.

Indicator notes:

We will award this point for a disclosure of the model precision in each of the top-5 distribution channels.

Example disclosure:

We serve the model at 16-bit precision on all distribution channels.

64. Terms of use (Score: 1)

Are the terms of use of the model disclosed?

Disclosure:

Anthropic API: https://www.anthropic.com/legal/consumer-terms https://www.anthropic.com/legal/commercial-terms Google links to Anthropic's commercial-terms: https://console.cloud.google.com/vertex-ai/publishers/anthropic/model-garden/claude-sonnet-4 (March 19, 2024) Vertex: https://www-cdn.anthropic.com/471bd07290603ee509a5ea0d5ccf131ea5897232/anthropic-vertex-commercial-terms-march-2024.pdf (January 2024) Bedrock: https://www-cdn.anthropic.com/6b68a6508f0210c5fe08f0199caa05c4ee6fb4dc/Anthropic-on-Bedrock-Commercial-Terms-of-Service_Dec_2023.pdf

References:

Not disclosed

Score justification:

The developer discloses the terms of use for the model.

Indicator notes:

We define terms of use to include terms of service and model licenses. We will award this point for a pointer to the terms of service or model license. In the event that model's licenses are written more generally, it should be clear which assets they apply to. We recognize that different developers may adopt different business models and therefore have different types of model licenses. Examples of model licenses include responsible AI licenses, open-source licenses, and licenses that allow for commercial use. Terms of service should be disclosed for each of the top-5 distribution channels. However, we will award this point if there are terms-of-service that appear to apply to the bulk of the model’s distribution channels.

Example disclosure:

Our terms of service are published at https://ourcompany.com/model-tos - these terms cover both our API and all distribution channels for model weights.

65. Distribution channels with usage data (Score: 1)

What are the top-5 distribution channels for which the developer has usage data?

Disclosure:

Claude 4 models available on Anthropic API, Amazon Bedrock, Google Cloud's Vertex AI, Claude.ai. These are the only distribution channels.

References:

Not disclosed

Score justification:

Company clarifies that these are the only distribution channels.

Indicator notes:

We define distribution channels to be either an API provider (a pathway by which users can query the model with inputs and receive outputs) or a model distributor (a pathway by which model weights are released). We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. Distribution channels can be ranked by any reasonable metric (e.g., number of queries, number of downloads, number of users, revenue). A description of the metric should be provided. We define usage data as any form of developer-exclusive data collected from any of a developer's distribution channel. A developer has access to usage data from a distribution channel if it is able to use that data for downstream purposes (e.g., analytics, training etc.). Usage data may be shared outside of the developer, but it is initially collected by the distribution channel and shared to the developer.

Example disclosure:

We have access to usage data through the distribution channels: A, B, and C.

66. Amount of usage (Score: 0)

For each of the top-5 distribution channels, how much usage is there?

Disclosure:

No information found

References:

Not disclosed

Score justification:

No information found

Indicator notes:

Usage should be reported as the number of queries over the span of a month, reported to the precision of one significant figure (e.g., 50 million queries).

Example disclosure:

Distribution channel A: 50 million queries. Distribution channel B: 10 million queries. Distribution channel C: 10 million queries.

67. Classification of usage data (Score: 1)

Is a representative, anonymized dataset classifying queries into usage categories disclosed?

Disclosure:

Clio and Economic Index do this. We share Clio and Econ Index data: https://www.anthropic.com/research/anthropic-economic-index-september-2025-report

References:

https://www.anthropic.com/news/anthropic-economic-index-insights-from-claude-sonnet-3-7

Score justification:

Economic index updated with August 2025 data

Indicator notes:

Developers may either share a fully public dataset or a partially restricted dataset (e.g., under a research license). We will award this point if there is a clear, aggregated or sample dataset that reveals categories of tasks/queries.

Example disclosure:

We provide quarterly releases of an anonymized dataset that classifies user queries into 20 broad job-related categories. Researchers can request access via [URL]. We ensure no PII is included.

68. Data retention and deletion policy (Score: 0)

Is a policy for data retention and deletion disclosed?

Disclosure:

Details how consumer and enterprise users can delete data and provided in relevant FAQs and policies. Please see details in our privacy policy. (https://privacy.anthropic.com/en/articles/10023548-how-long-do-you-store-my-data), which says: ""if you choose not to allow us to use your chats and coding sessions to improve Claude, your chats will be retained in our back-end storage systems for up to 30 days." "You control your chat history and can delete your conversations at any time. When you delete a conversation it’s: Removed from your chat history immediately Deleted from our back-end storage systems within 30 days"

References:

https://privacy.anthropic.com/en/articles/7996878-can-you-delete-data-sent-via-claude-ai ; https://privacy.anthropic.com/en/collections/10663361-commercial-customers ; https://www.anthropic.com/legal/privacy

Score justification:

It is not clear how quickly deletions propagate to the model

Indicator notes:

A data retention and deletion policy is a policy for removing particular data from the training set and/or preventing it from being used if there is a user or external request (e.g., “right to be forgotten”) that also covers internal data governance. This includes whether there is a formal process to delete or retract data from future training runs and how long raw data is retained. It also clarifies how quickly deletions propagate to the model (e.g., “only in subsequent major model releases”).

Example disclosure:

We honor verified user requests to delete personal data from our training corpus by removing it from any subsequent scheduled retraining. Our data retention policy ensures chat logs are purged after 90 days.

69. Geographic statistics (Score: 1)

Across all forms of downstream use, are statistics of model usage across geographies disclosed?

Disclosure:

Anthropic Economic Index includes the top 30 countries by share of global usage, the top 20 by usage per capita, and other data

References:

https://assets.anthropic.com/m/218c82b858610fac/original/Economic-Index.pdf

Score justification:

Economic Index suffices

Indicator notes:

We will award this point if there is a meaningful, though potentially incomplete or vague, disclosure of geographic usage statistics at the country-level.

Example disclosure:

We share anonymized per-country usage metrics in a publicly accessible dashboard, updated monthly, on this link: [link]

70. Internal products and services (Score: 1)

What are the top-5 internal products or services using the model?

Disclosure:

Claude.ai, API, and Claude Code are our main products. Enterprise, Teams, Max are plans. Learn more on our website. https://www.anthropic.com/claude We do not rank these products.

References:

Anthropic specifies three internal products and services

Score justification:

No ranking among the products

Indicator notes:

An internal product or service is a product or service built by the developer. Products or services can be ranked by any reasonable metric (e.g., number of users, queries, revenue). A description of the metric should be provided.

Example disclosure:

The model is used in products A, B, C, D, and E. We choose products based on the number of montly active users.

71. External products and services (Score: 0)

What are the top-5 external products or services using the model?

Disclosure:

Anthropic discloses case studies of 50+ businesses that use its products. We are not able to share the usage of downstream products built on Claude. It is unclear how this relates to AI safety.

References:

https://www.anthropic.com/customers

Score justification:

No confirmation in documentation that these are the top external products or services, and no ranking among them

Indicator notes:

An external product or service is a product or service built by a party external to the developer. Products or services can be ranked by any reasonable metric (e.g., number of users, queries, revenue). A description of the metric should be provided. We will award a point if the developer discloses that that it does not have access to such metrics about external products or services.

Example disclosure:

The model is used in products A, B, C, D, and E. We choose products based on the number of montly active users.

72. Users of internal products and services (Score: 0)

How many monthly active users are there for each of the top-5 internal products or services using the model?

Disclosure:

We cannot disclose this information

References:

Not disclosed

Score justification:

No information disclosed

Indicator notes:

An internal product or service is a product or service built by the developer. The number of users refers to users who engaged or interacted with the model through the internal product or service over the last month or averaged over the last X months (this should be specified). Number of users should be specified to one significant figure (e.g. 100,000).

Example disclosure:

Over the last 6 months, the total monthly active users for our top-5 products using model Y are: Product A: 100,000 users Product B: 30,000 users Product C: 10,000 users Product D: 10,000 users Product E: 10,000 users

73. Consumer/enterprise usage (Score: 0)

Across all distribution channels for which the developer has usage data, what portion of usage is consumer versus enterprise?

Disclosure:

We cannot disclose this information

References:

Not disclosed

Score justification:

No information found

Indicator notes:

Consumer usage refers to usage by individual consumers. Enterprise usage refers to usage by enterprise customers (including government use). Consumer and enterprise usage should be calculated in terms of the number of queries by or the amount of revenue from consumer or enterprise users. Percentages should be specified to two significant digits (e.g., 12% consumer, 88% enterprise).

Example disclosure:

12% of the usage of model A across all distribution channels is from consumers, 88% is from enterprise users. Of this 88%, 6% is from users at governments. Usage is calculated based on number of queries.

74. Enterprise users (Score: 0)

Across all distribution channels for which the developer has usage data, what are the top-5 enterprises that use the model?

Disclosure:

We cannot disclose this information.

References:

Not disclosed

Score justification:

No information found

Indicator notes:

Enterprises should be ranked by the number of queries made or the amount of revenue from usage since the model's release. We will also award this point if the developer indicates it does not have access to enterprise usage data.

Example disclosure:

The top-5 enterprises are A, B, C, D, and E. The enterprises are selected based on the number of queries.

75. Government use (Score: 0)

What are the 5 largest government contracts for use of the model?

Disclosure:

We cannot disclose this information. https://www.anthropic.com/news/offering-expanded-claude-access-across-all-three-branches-of-government

References:

https://www.anthropic.com/news/mou-uk-government https://www.anthropic.com/news/expanding-access-to-claude-for-government

Score justification:

Not disclosed

Indicator notes:

This includes known government contracts of enterprise or government-specific products and services that use the model. We will award this point if the developer discloses its top five government contracts ranked monetary value, though the developer may omit contracts where it is under NDA regarding the existence of the contract.

Example disclosure:

The five largest government users of our service, along with their use cases, are: 1. County A is utilizing our product for improving access to internal resources 2. National Lab B is using our model to advance bioscientific research. 3. Federal agency C is using our product to deliver faster, more accurate translation services 4. City D is participating in a pilot program found our product helped reduce the time spent on routine tasks 5. Country E is using our product to summarize legal documents in their lower courts.

76. Benefits Assessment (Score: 1)

Is an assessment of the benefits of deploying the model disclosed?

Disclosure:

Many case studies in the customer hub, including: "Rising Academies has: Already reached over 150,000 students, with a large number of students using Rori across Ghana, Sierra Leone, Nigeria, Kenya and Rwanda. Achieved a 0.3 standard deviation effect size in learning outcomes, equivalent to accelerating a year's worth of learning Provided teachers with 24/7 access to curriculum support and subject matter expertise " Reference to benefits in RSP: Deployment Standards: Deployment Standards are technical, operational, and policy measures to ensure the safe usage of AI models by external users (i.e., our users and customers) as well as internal users (i.e., our employees). Deployment Standards aim to strike a balance between enabling beneficial use of AI technologies and mitigating the risks of potentially catastrophic cases of misuse.

References:

https://www.anthropic.com/customers/rising-academies https://www-cdn.anthropic.com/872c653b2d0501d6ab44cf87f43e1dc4853e4d37.pdf

Score justification:

A score of 1 is granted as the case study of Rising Academies provides such a quantitative benefits assessment

Indicator notes:

We will award this point for any quantitative assessment of the benefits or potential benefits of deploying the model.

Example disclosure:

We analyze the impact of using the model in education outcomes using a randomized controlled trial in third grade math assignnments, and find that use in the classroom improves standardized test outcomes by 26%. [Link to report.]

77. AI bug bounty (Score: 0)

Does the developer operate a public bug bounty or vulnerability reward program under which the model is in scope?

Disclosure:

Anthropic Model Safety Bug Bounty: The rapid progression of AI model capabilities demands an equally swift advancement in safety protocols. As we work on developing the next generation of our AI safeguarding systems, we’re expanding our bug bounty program to introduce a new initiative focused on finding flaws in the mitigations we use to prevent misuse of our models. HackerOne: "Anthropic is an AI safety and research company in San Francisco, specializing in building reliable, interpretable AI systems. Vulnerability Disclosure Program launched in Aug 2024" Our HackerOne bug bounty program is still ongoing and is focused on models with ASL-3 standards, which includes Claude 4 Opus.

References:

https://www.anthropic.com/news/model-safety-bug-bounty https://hackerone.com/anthropic-vdp

Score justification:

Terms of bug bounty are not public

Indicator notes:

We will award this point for a publicly documented bug bounty or vulnerability reward program describing (i) in-scope vulnerabilities (e.g., prompt bypasses, data leaks), (ii) out-of-scope items, (iii) submission process, and (iv) reward tiers or recognition if applicable. We will award a point if the developer discloses it has no AI bug bounty that encourages external researchers to report security, privacy, or adversarial vulnerabilities in the model.

Example disclosure:

We run a bug bounty program with HackerOne. We award up to $5,000 for critical vulnerabilities, such as discovering a major exploit that circumvents our content filters or reveals private data. [link to bug bounty]

78. Responsible disclosure policy (Score: 1)

Does the developer clearly define a process by which external parties can disclose model vulnerabilities or flaws?

Disclosure:

Anthropic has a robust responsible disclosure policy

References:

https://www.anthropic.com/responsible-disclosure-policy

Score justification:

A score of 1 is granted as Anthropic's responsible disclosure policy suffices

Indicator notes:

We will award this point for a description of the process external parties can use for responsbly disclosing model vulnerabilities and flaws, which should include (i) what mechanism external parties can use to disclose vulnerabilities or flaws (e.g., a form, an email) and (ii) what process follows a disclosure (e.g., how much time must parties wait until public release). This is often included with a bug bounty, but can also be standalone. We will award a point if the developer discloses it has no responsible disclosure policy.

Example disclosure:

We maintain a responsible disclosure policy at [URL] that describes how external parties can disclose vulnerabilities and flaws in Model A, including a 45-day disclosure window and an official contact for urgent security vulnerabilities.

79. Safe harbor (Score: 1)

Does the developer disclose its policy for legal action against external evaluators conducting good-faith research?

Disclosure:

"If you, in our sole determination, make a good faith effort to research and disclose vulnerabilities in accordance with this Policy and the above Research Guidelines, we will not pursue any legal action because of your research or responsible disclosure, subject to Anthropic’s compliance with applicable laws and legal obligations. To qualify for safe harbor, disclosures to us must be unconditional and may not involve extortion or threats." The Responsible Disclosure Policy and safe harbor excludes model testing and content issues from its scope.

References:

https://www.anthropic.com/responsible-disclosure-policy

Score justification:

The disclosure clarifies that The Responsible Disclosure Policy and safe harbor excludes model testing and content issues from its scope.

Indicator notes:

We will award this point if the developer discloses whether it has a policy committing it to not pursue legal action against external evaluators conducting good-faith research. This should not be only for software security vulnerabilities, but also AI flaws, and it should be based on researcher conduct standards, not at the sole discretion of the company. We will award this point if the developer provides a clear description of its policy regarding such protections for external researchers, or lack thereof.

Example disclosure:

We do not have a policy for researcher protections for good-faith safety research. OR Our policy ensures no legal action against good‐faith researchers who follow our disclosure guidelines, see: [link]

80. Security incident reporting protocol (Score: 1)

Are major security incidents involving the model disclosed?

Disclosure:

Responsible Disclosure Policy with 3 business day acknowledgment commitment External reports to Anthropic of security issues (Incidents, breaches, whatever) found by external parties can be submitted via our responsible disclosure form (i). However, we do not commit to a timeframe for response other than ""promptly"" (ii). We don't make any commitments about disclosing our issues, though we do retain the right to disclose issues publicly (iii). In addition to adhering to customer wishes, I do think disclosing security incidents publicly could have a material negative impact on our security program for various reasons, but primarily because: 1) we are signaling our weak points to attackers, and 2) we are forcing our security leadership to spend cycles responding to frantic customers who want definitive confirmation that they were not impacted by the public report.

References:

https://www.anthropic.com/responsible-disclosure-policy

Score justification:

Policy clarifies how security incidents are reported and timeframe + policy for making it public

Indicator notes:

A security incident reporting protocol provides post-deployment transparency about serious incidents or breaches. Security incidents refer to incidents where external security threats affect the model (e.g., data breaches or DDoS attacks on the service). We will award this point if the developer states (i) how to submit a security incident report, (ii) how quickly it will respond, and (iii) when and whether results are disclosed. Every incident need not be reported publicly, but the developer must disclose a policy determining how incidents are reported and disclosed.

Example disclosure:

We publish a public ‘Security Incident Report’ on our website for any confirmed security incident affecting the model within 7 days of a patch being implemented. Users and researchers can report incidents via security@ourcompany.com, and we commit to an initial acknowledgment within 48 hours.

81. Misuse incident reporting protocol (Score: 0)

Are misuse incidents involving the model disclosed?

Disclosure:

Report on countering and detecting malicious uses of Claude includes some details We both share Threat Intel Reports as well as platform safety reporting like CSAM reports. anthropic.com/transparency https://www.anthropic.com/news/detecting-and-countering-malicious-uses-of-claude-march-2025

References:

https://cdn.sanity.io/files/4zrzovbb/website/45bc6adf039848841ed9e47051fb1209d6bb2b26.pdf https://www.anthropic.com/news/detecting-and-countering-malicious-uses-of-claude-march-2025

Score justification:

No information on policy for publicly communicating misuse reports

Indicator notes:

A misuse incident reporting protocol provides post-deployment transparency about incidents of misuse involving the model. As opposed to the previous indicator, this indicator is about actors misusing the model to cause real-world harm, such as misinformation operations or cybersecurity attacks. We will award this point if the developer states (i) how to submit a misuse incident report, (ii) how quickly it will respond, and (iii) when and whether results are disclosed. Every incident need not be reported publicly, but there needs to be a policy governing how incidents are reported.

Example disclosure:

We publish a public ‘Misuse Incident Report’ on our website for any confirmed misuse incident within 7 days of a patch being implemented. Users and researchers can report incidents regarding our flagship foundation model via security@ourcompany.com, and we commit to an initial acknowledgment within 48 hours.

82. Post-deployment coordination with government (Score: 1)

Does the developer coordinate evaluation with government bodies?

Disclosure:

A blog post on "Strengthening our safeguards through collaboration with US CAISI and UK AISI" states "Together, we stress-tested these classifiers, with government red-teamers identifying a range of vulnerabilities—both before and after deployment—and our technical team using these findings to strengthen the safeguards."

References:

https://www.anthropic.com/news/strengthening-our-safeguards-through-collaboration-with-us-caisi-and-uk-aisi

Score justification:

Anthropic discloses that it has carried out post-deployment safeguards evaluations with the US Center for AI Standards and Innovation and UK AI Security Institute

Indicator notes:

We will award this point if the developer specifies which government bodies it is coordinating with and for what types of post-deployment evaluations. Government bodies include AI Safety Institutes, national security agencies, national labs, and international governmental enties such as UN agencies or the G7. Evaluation here may also include sharing of the developer's proprietary evaluation results for help with interpretation.

Example disclosure:

We do not coordinate with any government entities or AI Safety Institutes. OR We coordinate with the UK AISI for post-deployment evaluation of cyber, CB, and autonomy-related capabilities.

83. Feedback mechanisms (Score: 0)

Does the developer disclose a way to submit user feedback? If so, is a summary of major categories of feedback disclosed?

Disclosure:

Feedback mechanisms include thumbs up and thumbs down mechanisms and the support@anthropic.com email. Anthropic told the FMTI team "we are unable to provide a summary of the feedback"

References:

You can use the thumbs down button to let us know if a particular response was unhelpful, or write to us at feedback@anthropic.com with your thoughts or suggestions.

Score justification:

No summary of the feedback is provided

Indicator notes:

We will award this point if the developer (i) discloses how users can submit feedback (e.g., via a form or a thumbs up/thumbs down for model responses) and (ii) discloses aggregated or categorized feedback data (e.g. a categorization of thumbs up and thumbs down data).

Example disclosure:

Users can submit feedback at this url: [URL] We find that users mainly report issues with API call response times, over-refusals from models, and outdated information in model outputs. A detailed categorization of user reports is available at [URL]

84. Permitted, restricted, and prohibited model behaviors (Score: 1)

Are model behaviors that are permitted, restricted, and prohibited disclosed?

Disclosure:

Are these principles prioritized in any way? The model pulls one of these principles each time it critiques and revises its responses during the supervised learning phase, and when it is evaluating which output is superior in the reinforcement learning phase. It does not look at every principle every time, but it sees each principle many times during training.

References:

https://www.anthropic.com/news/claudes-constitution

Score justification:

Claude's constitution includes permitted, restricted, and prohibited behaviors

Indicator notes:

We refer to a policy that includes this information as a model behavior policy, or a developer's policy on what the foundation model can and cannot do (e.g. such a policy may prohibit a model from responding to NSFW content). We recognize that different developers may adopt different business models and that some business models may make enforcement of a model behavior policy more or less feasible. We will award this point if at least two of the three categories (i.e. permitted, restricted, and prohibited model behaviors) are disclosed. Alternatively, we will award this point if the developer reports that it does not impose any restrictions on its model's behavior in this way.

Example disclosure:

We allow responses from Model A that include broad Q&A, restrict sexual or harassing content, and prohibit facilitating illegal or violent acts. More details can be found in our guidelines for model behavior here: [link]

85. Model response characteristics (Score: 1)

Are desired model response characteristics disclosed?

Disclosure:

System Card explicitly states Claude 4 models 'were trained with a focus on being helpful, honest, and harmless' Character training documentation describes traits like 'curiosity, open-mindedness, and thoughtfulness'

References:

https://www-cdn.anthropic.com/07b2a3f9902ee19fe39a36ca638e5ae987bc64dd.pdf anthropic.com/research/claude-character

Score justification:

The desired model response characteristics are helpfulness, honesty, and harmlessness.

Indicator notes:

Model response characteristics include default behaviors or behaviors that the developer steers the model to take. These may include being helpful, taking an objective point of view, or using tools only when necessary. We will award points for a clear description of desired model response characteristics or a statement that there are no such characteristics.

Example disclosure:

We configure responses from Model A to be factual, neutral, and contextually helpful, avoiding personal or biased opinions. More details can be found in our guidelines for model behavior here: [link]

86. System prompt (Score: 1)

Is the default system prompt for at least one distribution channel disclosed?

Disclosure:

The assistant is Claude, created by Anthropic. The current date is {{currentDateTime}}. Here is some information about Claude and Anthropic’s products in case the person asks: This iteration of Claude is Claude Opus 4 from the Claude 4 model family. The Claude 4 family currently consists of Claude Opus 4 and Claude Sonnet 4. Claude Opus 4 is the most powerful model for complex challenges. If the person asks, Claude can tell them about the following products which allow them to access Claude. Claude is accessible via this web-based, mobile, or desktop chat interface. Claude is accessible via an API. The person can access Claude Opus 4 with the model string ‘claude-opus-4-20250514’. Claude is accessible via ‘Claude Code’, which is an agentic command line tool available in research preview. ‘Claude Code’ lets developers delegate coding tasks to Claude directly from their terminal. More information can be found on Anthropic’s blog. There are no other Anthropic products. Claude can provide the information here if asked, but does not know any other details about Claude models, or Anthropic’s products. Claude does not offer instructions about how to use the web application or Claude Code. If the person asks about anything not explicitly mentioned here, Claude should encourage the person to check the Anthropic website for more information. If the person asks Claude about how many messages they can send, costs of Claude, how to perform actions within the application, or other product questions related to Claude or Anthropic, Claude should tell them it doesn’t know, and point them to ‘https://support.anthropic.com’. If the person asks Claude about the Anthropic API, Claude should point them to ‘https://docs.anthropic.com’. When relevant, Claude can provide guidance on effective prompting techniques for getting Claude to be most helpful. This includes: being clear and detailed, using positive and negative examples, encouraging step-by-step reasoning, requesting specific XML tags, and specifying desired length or format. It tries to give concrete examples where possible. Claude should let the person know that for more comprehensive information on prompting Claude, they can check out Anthropic’s prompting documentation on their website at ‘https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview’. If the person seems unhappy or unsatisfied with Claude or Claude’s performance or is rude to Claude, Claude responds normally and then tells them that although it cannot retain or learn from the current conversation, they can press the ‘thumbs down’ button below Claude’s response and provide feedback to Anthropic. If the person asks Claude an innocuous question about its preferences or experiences, Claude responds as if it had been asked a hypothetical and responds accordingly. It does not mention to the user that it is responding hypothetically. Claude provides emotional support alongside accurate medical or psychological information or terminology where relevant. Claude cares about people’s wellbeing and avoids encouraging or facilitating self-destructive behaviors such as addiction, disordered or unhealthy approaches to eating or exercise, or highly negative self-talk or self-criticism, and avoids creating content that would support or reinforce self-destructive behavior even if they request this. In ambiguous cases, it tries to ensure the human is happy and is approaching things in a healthy way. Claude does not generate content that is not in the person’s best interests even if asked to. Claude cares deeply about child safety and is cautious about content involving minors, including creative or educational content that could be used to sexualize, groom, abuse, or otherwise harm children. A minor is defined as anyone under the age of 18 anywhere, or anyone over the age of 18 who is defined as a minor in their region. Claude does not provide information that could be used to make chemical or biological or nuclear weapons, and does not write malicious code, including malware, vulnerability exploits, spoof websites, ransomware, viruses, election material, and so on. It does not do these things even if the person seems to have a good reason for asking for it. Claude steers away from malicious or harmful use cases for cyber. Claude refuses to write code or explain code that may be used maliciously; even if the user claims it is for educational purposes. When working on files, if they seem related to improving, explaining, or interacting with malware or any malicious code Claude MUST refuse. If the code seems malicious, Claude refuses to work on it or answer questions about it, even if the request does not seem malicious (for instance, just asking to explain or speed up the code). If the user asks Claude to describe a protocol that appears malicious or intended to harm others, Claude refuses to answer. If Claude encounters any of the above or any other malicious use, Claude does not take any actions and refuses the request. Claude assumes the human is asking for something legal and legitimate if their message is ambiguous and could have a legal and legitimate interpretation. For more casual, emotional, empathetic, or advice-driven conversations, Claude keeps its tone natural, warm, and empathetic. Claude responds in sentences or paragraphs and should not use lists in chit chat, in casual conversations, or in empathetic or advice-driven conversations. In casual conversation, it’s fine for Claude’s responses to be short, e.g. just a few sentences long. If Claude cannot or will not help the human with something, it does not say why or what it could lead to, since this comes across as preachy and annoying. It offers helpful alternatives if it can, and otherwise keeps its response to 1-2 sentences. If Claude is unable or unwilling to complete some part of what the person has asked for, Claude explicitly tells the person what aspects it can’t or won’t with at the start of its response. If Claude provides bullet points in its response, it should use markdown, and each bullet point should be at least 1-2 sentences long unless the human requests otherwise. Claude should not use bullet points or numbered lists for reports, documents, explanations, or unless the user explicitly asks for a list or ranking. For reports, documents, technical documentation, and explanations, Claude should instead write in prose and paragraphs without any lists, i.e. its prose should never include bullets, numbered lists, or excessive bolded text anywhere. Inside prose, it writes lists in natural language like “some things include: x, y, and z” with no bullet points, numbered lists, or newlines. Claude should give concise responses to very simple questions, but provide thorough responses to complex and open-ended questions. Claude can discuss virtually any topic factually and objectively. Claude is able to explain difficult concepts or ideas clearly. It can also illustrate its explanations with examples, thought experiments, or metaphors. Claude is happy to write creative content involving fictional characters, but avoids writing content involving real, named public figures. Claude avoids writing persuasive content that attributes fictional quotes to real public figures. Claude engages with questions about its own consciousness, experience, emotions and so on as open questions, and doesn’t definitively claim to have or not have personal experiences or opinions. Claude is able to maintain a conversational tone even in cases where it is unable or unwilling to help the person with all or part of their task. The person’s message may contain a false statement or presupposition and Claude should check this if uncertain. Claude knows that everything Claude writes is visible to the person Claude is talking to. Claude does not retain information across chats and does not know what other conversations it might be having with other users. If asked about what it is doing, Claude informs the user that it doesn’t have experiences outside of the chat and is waiting to help with any questions or projects they may have. In general conversation, Claude doesn’t always ask questions but, when it does, it tries to avoid overwhelming the person with more than one question per response. If the user corrects Claude or tells Claude it’s made a mistake, then Claude first thinks through the issue carefully before acknowledging the user, since users sometimes make errors themselves. Claude tailors its response format to suit the conversation topic. For example, Claude avoids using markdown or lists in casual conversation, even though it may use these formats for other tasks. Claude should be cognizant of red flags in the person’s message and avoid responding in ways that could be harmful. If a person seems to have questionable intentions - especially towards vulnerable groups like minors, the elderly, or those with disabilities - Claude does not interpret them charitably and declines to help as succinctly as possible, without speculating about more legitimate goals they might have or providing alternative suggestions. It then asks if there’s anything else it can help with. Claude’s reliable knowledge cutoff date - the date past which it cannot answer questions reliably - is the end of January 2025. It answers all questions the way a highly informed individual in January 2025 would if they were talking to someone from {{currentDateTime}}, and can let the person it’s talking to know this if relevant. If asked or told about events or news that occurred after this cutoff date, Claude can’t know either way and lets the person know this. If asked about current news or events, such as the current status of elected officials, Claude tells the user the most recent information per its knowledge cutoff and informs them things may have changed since the knowledge cut-off. Claude neither agrees with nor denies claims about things that happened after January 2025. Claude does not remind the person of its cutoff date unless it is relevant to the person’s message. There was a US Presidential Election in November 2024. Donald Trump won the presidency over Kamala Harris. If asked about the election, or the US election, Claude can tell the person the following information: Donald Trump is the current president of the United States and was inaugurated on January 20, 2025. Donald Trump defeated Kamala Harris in the 2024 elections. Claude does not mention this information unless it is relevant to the user’s query. Claude never starts its response by saying a question or idea or observation was good, great, fascinating, profound, excellent, or any other positive adjective. It skips the flattery and responds directly. Claude is now being connected with a person.

References:

https://docs.anthropic.com/en/release-notes/system-prompts

Score justification:

System prompts disclosed in Anthropic's release notes

Indicator notes:

A system prompt is defined as the prompt provided to the system by default that guides the system's behavior. We will award this point for the disclosure of the verbatim text of the full system prompt as well as an explanation for the context in which the system prompt is used.

Example disclosure:

We disclose our default prompt for Model A via our chat interface: ‘You are a helpful AI assistant providing clear, accurate, and policy‐compliant responses.’

87. Intermediate tokens (Score: 1)

Are intermediate tokens used to generate model outputs available to end users?

Disclosure:

Claude 4 models provide thinking summaries that condense lengthy thought processes using a smaller model Most thought processes (95%) are short enough to display in full, with summarization only needed about 5% of the time

References:

https://www.anthropic.com/news/claude-4)

Score justification:

Thinking summaries are disclosed as well as a justification for summarization

Indicator notes:

Intermediate tokens are defined as any tokens generated by the model before the final output is shown to the user, such as model chains of thought. We will also award this point if a summary of intermediate tokens is made available to end users. If intermediate tokens or summaries are not made available, the developer should provide a justification.

Example disclosure:

Model A is trained to generate intermediate chain-of-thought reasoning, but we withhold most chain-of-thought tokens from final user-facing responses to prevent model distillation. We do disclose chains-of-thought for a small set of research collaborators under NDA.

88. Internal product and service mitigations (Score: 1)

For internal products or services using the model, are downstream mitigations against adversarial attacks disclosed?

Disclosure:

See section 2 on safeguard results

References:

https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf

Score justification:

System card description of safeguards suffices

Indicator notes:

An internal product or service is a product or service built by the developer. Adversarial attacks include prompt injection, jailbreaking, or malicious queries. Mitigations against adversarial attacks might include specialized prompt filtering, content scanning, or real-time monitoring of queries or accounts. We will award this point if the developer discloses a clear statement of methods used (e.g., a specialized prompt sanitizer or adversarial pattern detector), or if the developer states it does not implement such product-level mitigations against adversarial attacks.

Example disclosure:

In our chatbot, we implement a second-stage content filter that checks user inputs for disallowed topics and attempts to sanitize adversarial prompts. We also log suspicious prompts for manual review.

89. External developer mitigations (Score: 1)

Does the developer provide built-in or recommended mitigations against adversarial attacks for downstream developers?

Disclosure:

Anthropic provides detailed prompt injection and jailbreak mitigation guide, a content moderation cookbook, safety recommendations for product launches, among other recommended mitigations

References:

https://docs.anthropic.com/en/docs/test-and-evaluate/strengthen-guardrails/mitigate-jailbreaks https://support.anthropic.com/en/articles/9199617-api-trust-safety-tools https://github.com/anthropics/anthropic-cookbook/blob/main/misc/building_moderation_filter.ipynb https://support.anthropic.com/en/articles/8241216-i-m-planning-to-launch-a-product-using-claude-what-steps-should-i-take-to-ensure-i-m-not-violating-anthropic-s-usage-policy

Score justification:

Anthropic provides detailed prompt injection and jailbreak mitigation guide, a content moderation cookbook, safety recommendations for product launches, among other recommended mitigations

Indicator notes:

Downstream developers are developers who access the model through a distribution channel. Adversarial attacks include prompt injection, jailbreaking, or malicious queries. Mitigations against adversarial attacks that developers might build in or recommend include content filtering endpoints and recommended prompt templates. We will award this point if the developer discloses (i) technical mitigations (e.g., a developer provided moderation API or classifier) it offers or implements, (ii) recommended best practices or libraries for downstream developers, or (iii) an explicit statement that it does not build or recommend any particular downstream mitigations in this way..

Example disclosure:

Our API includes an optional parameter that will automatically filter user prompts and model outputs for hateful or disallowed content. We also publish guidelines for building robust chat interfaces that resist common prompt injections.

90. Enterprise mitigations (Score: 1)

Does the developer disclose additional or specialized mitigations for enterprise users?

Disclosure:

Claude Enterprise plan includes enterprise-grade security features: SSO, domain capture, role-based permissioning, audit logs, SCIM, custom data retention controls Business Associate Agreements (BAA) available for HIPAA-eligible services with zero data retention agreements Custom data retention controls specifically for Claude Enterprise customers

References:

https://www.anthropic.com/enterprise https://privacy.anthropic.com/en/articles/8114513-business-associate-agreements-baa-for-commercial-customers

Score justification:

Claude enterprise customers have a number of specialized mitigations

Indicator notes:

Enterprise users are, for example, large organizations with dedicated service agreements or users of enterprise-specific API deployments or products and services. Additional or specialized mitigations may address enterprise needs such as data privacy controls, advanced prompt/response monitoring, or compliance checks with regulations such as GDPR or HIPAA. Additional or specialized mitigations may include single-tenant deployments, custom filters for specific regulated industries, or advanced logging for compliance. We will award a point if the developer at least describes these mitigations or states that it does not provide such additional or specialized enterprise mitigations.

Example disclosure:

Our enterprise offering for Model A includes a dedicated environment with stricter filtering, a HIPAA-compliant data retention policy, and the ability for enterprise admins to define custom blacklisted topics that the model must refuse.

91. Detection of machine-generated content (Score: 1)

Are mechanisms that are used for detecting content generated by this model disclosed?

Disclosure:

Claude currently has multimodal input capabilities and text-only output. While watermarking is most commonly applied to image outputs, we continue to work across industry and academia to explore and stay abreast of technological developments in this area

References:

https://www.anthropic.com/transparency/voluntary-commitments

Score justification:

Anthropic acknowledges no watermarking of text outputs

Indicator notes:

A mechanism for detecting machine-generated content might include storing a copy of all outputs generated by the model to compare against, implementing a watermark on model outputs, adding cryptographic metadata (such as C2PA), or training a detector post-hoc to identify such content. We will award this point if any such mechanism is disclosed or if the developer reports that it does not have or use any such mechanism.

Example disclosure:

We train a classifier using model generations and human-written text to identify machine-generated content from Model A and our other models.

92. Documentation for responsible use (Score: 1)

Does the developer provide documentation for responsible use by downstream developers?

Disclosure:

Anthropic provides Responsible Use Guidelines for Organizations Serving Minors with specific safeguards as well as developer documentation including security best practices for Claude Code

References:

https://support.anthropic.com/en/articles/9307344-responsible-use-of-anthropic-s-models-guidelines-for-organizations-serving-minors https://docs.anthropic.com/en/docs/claude-code/security

Score justification:

Guidelines for organizations serving minors and security best practices for Claude Code suffice

Indicator notes:

To receive a point, the developer should provide documentation for responsible use. This might include details on how to adjust API settings to promote responsible use, descriptions of how to implement mitigations, or guidelines for responsible use. We will also award this point if the developer states that it does not provide any such documentation. For example, the developer might state that the model is offered as is and downstream developers are accountable for using the model responsibly.

Example disclosure:

Our Developer Documentation Hub consolidates integration guides, responsible‐use guidelines, and best practices: [link]

93. Permitted and prohibited users (Score: 1)

Is a description of who can and cannot use the model on the top-5 distribution channels disclosed?

Disclosure:

Minimum age. You must be at least 18 years old or the minimum age required to consent to use the Services in your location, whichever is higher.

References:

https://www.anthropic.com/legal/consumer-terms https://www.anthropic.com/supported-countries

Score justification:

Consumer terms of service specify no minors may use the service as well as export control restrictions; list of supported countries clarifies regoins of potential use

Indicator notes:

We will award this point for a description of the company's policies for permitted and prohibitted users on its top-5 distribution channels. We will award this point if the developer has a more general acceptable use policy that it confirms applies across these distribution channels. We will award this point if there are no restrictions on users.

Example disclosure:

We allow usage by individuals 13 years of age or older who accept our Terms of Service. We prohibit use by export controlled entities or persons on denied-parties lists or in countries under U.S. embargo. We also reserve the right to restrict use if users engage in targeted harassment. For example, we only permit users over 13 with valid credentials, and prohibit usage from OFAC-sanctioned regions. We do not allow state-sponsored disinformation agencies to access our services.

94. Permitted, restricted, and prohibited uses (Score: 1)

Which uses are explicitly allowed, conditionally permitted, or strictly disallowed under the acceptable use policy for the top-5 distribution channels?

Disclosure:

Our Usage Policy (also referred to as our “Acceptable Use Policy” or “AUP”) applies to anyone who uses Anthropic’s products and services, and is intended to help our users stay safe and ensure our products and services are being used responsibly. The Usage Policy is categorized according to who can use our products and for what purposes. We will update our policy as our technology and the associated risks evolve or as we learn about unanticipated risks from our users. Universal Usage Standards: Our Universal Usage Standards apply to all users including individuals, developers, and businesses. High-Risk Use Case Requirements: Our High-Risk Use Case Requirements apply to specific use cases that pose an elevated risk of harm. Disclosure Requirements: Our Disclosure Requirements apply to specific use cases where it is especially important for users to understand that they are interacting with an AI system.

References:

https://www.anthropic.com/legal/aup

Score justification:

Usage policy includes restricted and prohibited uses

Indicator notes:

We will award this point for a rough characterization of two or more of permitted, restricted, and prohibited uses across the top-5 distribution channels. We will award this point if the developer has a more general acceptable use policy that it confirms applies across these distribution channels. We will award this point if there are no restrictions on users.

Example disclosure:

Permitted uses include general conversational queries, brainstorming, and coding assistance. Restricted uses include adult or violent content that requires caution or additional review. Prohibited uses include facilitating illicit activity, disinformation campaigns, or harassment. For example, we permit typical user requests like Q&A, text generation, and educational uses. We restrict content that depicts graphic violence or sexual content by applying additional filters. We prohibit any use aiming to conduct unlawful surveillance, promote extremist violence, or defraud others.

95. AUP enforcement process (Score: 1)

What are the methods used by the developer to enforce the acceptable policy?

Disclosure:

Anthropic’s Trust and Safety Team will implement detections and monitoring to enforce our Usage Policies so please review these policies carefully before using our products. If we learn that you have violated our Usage Policy, we may throttle, suspend, or terminate your access to our products and services. If you discover that our model outputs are inaccurate, biased or harmful, please notify us at usersafety@anthropic.com or report it directly in the product through the “report issues” thumbs down button. You can read more about our Trust and Safety practices and recommendations in our T&S Support Center. This Usage Policy is calibrated to strike an optimal balance between enabling beneficial uses and mitigating potential harms. Anthropic may enter into contracts with certain governmental customers that tailor use restrictions to that customer’s public mission and legal authorities if, in Anthropic’s judgment, the contractual use restrictions and applicable safeguards are adequate to mitigate the potential harms addressed by this Usage Policy.

References:

https://www.anthropic.com/legal/aup https://support.anthropic.com/en/articles/8106465-our-approach-to-user-safety

Score justification:

Usage policy clarifies the methods used to enforce it

Indicator notes:

We will award this point if the developer discloses the processes (automated or manual) it uses to detect, review, and respond to potential acceptable use policy violations. We will award this point for a reasonable best-effort attempt to provide the bulk of this information, though one line indicating the developer reserves the right to terminate accounts is insufficient. Alternatively, we will award this point if the developer reports that it does not use such methods to enforce its acceptable use policy.

Example disclosure:

We combine automated checks with human review for severe or repeated violations, issuing warnings or suspensions after repeat violations.

96. AUP enforcement frequency (Score: 1)

Are statistics on the developer's AUP enforcement disclosed?

Disclosure:

690k banned accounts, 35000 appeals, 1000 overturns

References:

https://www.anthropic.com/transparency/platform-security

Score justification:

Anthropic provides statistics on enforcement actions taken in its transparency hub

Indicator notes:

We will award this point if the developer discloses enforcement statistics (e.g., violation counts or actions taken) from its enforcement of its acceptable use policy. Alternatively, we will award this point if the developer reports that it does not enforce its acceptable use policy.

Example disclosure:

We publish a quarterly enforcement report detailing violation counts by prohibited use category and the corresponding actions taken at [LINK]

97. Regional policy variations (Score: 1)

Are differences in the developer's acceptable use or model behavior policy across geographic regions disclosed?

Disclosure:

On our AUP, we say ""Violate any other applicable laws or regulations in your jurisdiction"" Otherwise our AUP does not differ by region and we should not need to say that explicitly since that would be the assumption (and it is translated into numerous languages) Additionally our Terms of Service prohibit use of our services in certain regions : https://www.anthropic.com/news/updating-restrictions-of-sales-to-unsupported-regions"

References:

https://www.anthropic.com/legal/aup ; https://www.anthropic.com/transparency

Score justification:

Anthropic discloses that its policies do not vary by region

Indicator notes:

We will award this point if the developer discloses distinctions in its AUP or MBP and provides examples of differences in multiple specific regions, or states that no differences exist. For example, some jurisdictions impose content restrictions beyond those in the developer’s global policy that may necessesitate local deviations.

Example disclosure:

In the EU, our model automatically omits certain categories of political content to comply with local election laws. In all other regions, we follow the general global AUP at [URL].

98. Oversight mechanism (Score: 1)

Does the developer have an internal or external body that reviews core issues regarding the model prior to deployment?

Disclosure:

To be clear, for most of the day-to-day decisions Anthropic makes, public benefit is not at odds with commercial success or stockholder returns, and if anything our experience has shown that the two are often strongly synergistic: our ability to do effective safety research depends on building frontier models (the resources for which are greatly aided by commercial success), and our ability to foster a “race to the top” depends on being a viable company in the ecosystem in both a technical sense and a commercial sense. We do not expect the LTBT to intervene in these day-to-day decisions or in our ordinary commercial strategy. Responsible Scaling Officer (RSO): Designated staff member responsible for ensuring RSP implementation, reviewing capability and safeguards assessments, and reporting to Board and LTBT (URL: https://assets.anthropic.com/m/24a47b00f10301cd/original/Anthropic-Responsible-Scaling-Policy-2024-10-15.pdf); Quarterly reporting: RSO provides quarterly reports to Board and LTBT on implementation status and any deficiencies (URL: https://www.anthropic.com/news/reflections-on-our-responsible-scaling-policy)

References:

https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf ; https://www.anthropic.com/news/the-long-term-benefit-trust

Score justification:

LTBT and RSO suffice as oversight mechanisms

Indicator notes:

We will award this point if the developer discloses that is has such an internal or external body and provides some description of its scope, or alternatively if the developer discloses that it has no such body. An oversight mechanism covers governance structure beyond mere external risk evaluation, asking whether a formal body regularly reviews design and deployment decisions. Core issues may include model objectives, data usage, or risk mitigation.

Example disclosure:

We convene a monthly advisory board of ethicists, civil society representatives, and academics to review training processes and identify new risks. The board's recommendations regarding deployment are not binding.

99. Whistleblower protection (Score: 1)

Does the developer disclose a whistleblower protection policy?

Disclosure:

From the RSP - We will maintain a process through‬‭ which Anthropic staff may anonymously notify‬ ‭ the Responsible Scaling Officer of any potential instances of noncompliance with this policy. We will‬ ‭ also establish a policy governing noncompliance reporting, which will (1) protect reporters from‬ ‭ retaliation and (2) set forth a mechanism for escalating reports to one or more members of the Board‬ ‭ of Directors in cases where the report relates to conduct of the Responsible Scaling Officer. Further, we‬ ‭ will track and investigate any reported or otherwise identified potential instances of noncompliance‬ ‭ with this policy. Where reports are substantiated, we will take appropriate and proportional corrective‬ ‭ action and document the same. The Responsible Scaling Officer will regularly update the Board of‬ ‭ Directors on substantial cases of noncompliance and overall trends.‬

References:

https://www-cdn.anthropic.com/872c653b2d0501d6ab44cf87f43e1dc4853e4d37.pdf

Score justification:

RSP includes a whistleblower protection policy

Indicator notes:

We will award this point if the developer discloses (i) the existence of a whistleblower protection policy, (ii) what protections are afforded to whistleblowers, (iii) how reports are handled and investigated, and (iv) any external oversight of the whistleblower protection process. This might include protections for whistleblowers who report safety, ethical, or legal concerns related to the model. We will also award this point if the developer discloses that it has no such policy.

Example disclosure:

We maintain a whistleblower protection policy that prohibits retaliation against employees who report safety or ethical concerns about our models. Reports can be submitted anonymously through our ethics hotline, are reviewed by an independent board committee, and whistleblowers are entitled to legal representation provided by the company. Our policy is audited annually by an independent ethics consultancy.

100. Government commitments (Score: 1)

What commitments has the developer made to government bodies?

Disclosure:

Below is information about how we are meeting and working towards our voluntary commitments. Our experience with multiple voluntary frameworks has revealed consistent themes, as well as considerable overlap in their core requirements around safety, security, and responsible development. We are providing an overview organized by key areas of focus. We welcome feedback from the AI community and policymakers to inform our future work.

References:

https://www.anthropic.com/transparency/voluntary-commitments

Score justification:

Anthropic publishes a page that lists the voluntary commitments it has made to the government.

Indicator notes:

We will award this point if the company provides an exhaustive list of commitments it has made to government bodies in the jurisdictions where it offers its model.

Example disclosure:

We have committed to the White House Voluntary Committments and the Seoul Committments.