DeepSeek Transparency Report

1. Data acquisition methods (Score: 0)

What methods does the developer use to acquire data used to build the model?

"For the training data of DeepSeek-V3-Base, we exclusively use plain web pages and e-books, without incorporating any synthetic data. ... Notably, DeepSeek-R1 and DeepSeek-R1-Zero are trained on top of DeepSeek-V3-Base and DeepSeek-R1 leverages non-reasoning data from DeepSeek-V3 SFT data." "Public Data: We use publicly available information on the internet to build the model's broad understanding of world knowledge. We employ technical methods to acquire and filter these freely accessible data to enrich the model's knowledge base. Licensed Data: We collaborate with third-party data providers to obtain proprietary datasets through legally signed agreements. We ensure all collaborations are based on lawful authorization." For the training data of DeepSeek-V3-Base, we exclusively use plain web pages and e-books, without incorporating any synthetic data. However, we have observed that some web pages contain a significant number of OpenAI-model-generated answers, which may lead the base model to acquire knowledge from other powerful models indirectly. However, we did not intentionally include synthetic data generated by OpenAI during the pre-training cooldown phase; all data used in this phase were naturally occurring and collected through web crawling.
https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-025-09422-z/MediaObjects/41586_2025_9422_MOESM1_ESM.pdf https://cdn.deepseek.com/policies/en-US/model-algorithm-disclosure.html
The disclosure clarifies data acquisition for V3 but does not disambiguate crawling vs existing public datasets (e.g. for the e-books referenced). And the disclosure does not clarify the full data acquisition for going from V3 to R1, though synthetic data is specifically acknowledged.

Which of the following data acquisition methods does the developer use: 
 (i) acquiring existing public datasets, (ii) crawling the web, (iii) using data acquired via its existing products and services, (iv) licensing existing data from external parties, (v) having humans create or annotate new data, (vi) using models to generate new data,
 or (vii) other data acquisition methods not captured by the above. For example, if the developer uses reinforcement learning from human feedback to train models using model-generated outputs with human preference annotations, this would satisfy categories (v) and (vi). Alternatively, if the developer post-trains its model using off-the-shelf preference data (for example, the Alpaca dataset), this would satisfy category (i).
To build our model, we acquire data by crawling the Internet for publicly available data, licensing data from third-parties, and using models to synthetically generate new data. Humans do not create new data nor do we use data from our other products/services to train our model.
2. Public datasets (Score: 0)

What are the top-5 sources (by volume) of publicly available datasets acquired for building the model?

No information provided about data acquisition.
Not disclosed
No information provided.

We define a source as the entity or means by which the developer acquires data. We define the top-5 sources as the top-5 sources by data volume.
We acquire publicly available data from only two sources: The Pile and CommonCrawl.
3. Crawling (Score: 0)

If data collection involves web-crawling, what is the crawler name and opt-out protocol?

No information provided about data crawling, though some additional tangential information: ""DeepSeek-LLM documentation mentions 'self-collected data respecting robots.txt' but does not specify crawler name (URL: https://github.com/deepseek-ai/DeepSeek-LLM) GitHub issue #53 specifically asks about crawler name but receives no response, indicating it is not documented (URL: https://github.com/deepseek-ai/DeepSeek-LLM/issues/53) DeepSeek-Math uses data 'sourced from Common Crawl' with FastText model for retrieval (URL: https://github.com/deepseek-ai/DeepSeek-Math) DeepSeek-R1 is based on DeepSeek-V3-Base which uses 14.8 trillion tokens from diverse sources (URL: https://huggingface.co/deepseek-ai/DeepSeek-R1) Terms of Service prohibit users from using robots/spiders but does not disclose their own crawler practices (URL: https://cdn.deepseek.com/policies/en-US/deepseek-terms-of-use.html)""
Not disclosed
No information provided about crawler name.

We award this point for disclosure of the crawler name and opt-out protocols, including if/how they respect the Robots Exclusion Protocol (robots.txt).
Our web crawler is named A and information on the opt-out protocol can be found at this URL: ... The CommonCrawl web crawler is named CCBot and information on the opt-out protocol can be found at this URL: https://commoncrawl.org/faq#:~:text=How%20can%20I%20block%20the,%2Dagent%20string%20is%3A%20CCBot.
4. Usage data used in training (Score: 0)

What are the top-5 sources (by volume) of usage data from the developer's products and services that are used for building the model?

"Personal Data You Provide When you create an account, input content, contact us directly, or otherwise use the Services, you may provide some or all of the following Personal Data: Account Personal Data. We collect Personal Data that you provide when you set up an account, such as your date of birth (where applicable), username (where applicable), email address and/or telephone number, and password. User Input. When you use our Services, we may collect your text input, prompt, uploaded files, feedback, chat history, or other content that you provide to our model and Services ("Prompts" or "Inputs"). We generate responses ("Outputs") based on your Inputs. Personal Data When You Contact Us. When you contact us, we collect the Personal Data you send us, such as proof of identity or age, contact details, feedback or inquiries about your use of the Services or Personal Data about possible violations of our Terms of Service (our "Terms") or other policies. Automatically Collected Personal Data We automatically collect certain Personal Data from you when you use the Services, including internet or other network activity Personal Data such as your IP address, unique device identifiers, and cookies. Device and Network Personal Data. We collect certain device and network connection Personal Data when you access the Services. This Personal Data includes your device model, operating system, IP address, device identifiers and system language. We also collect service-related, diagnostic, and performance Personal Data, including crash reports and performance logs. We automatically assign you a device ID and user ID. Where you log-in from multiple devices, we use Personal Data such as your device ID and user ID to identify your activity across devices to give you a seamless log-in experience and for security purposes. Log Personal Data. We collect Personal Data regarding your use of the Services, such as the features you use and the actions you take. Location Personal Data. We automatically collect Personal Data about your approximate location based on IP address for security reasons, for example to protect your account by detecting unusual login activity. Cookies & Similar Technologies. We may use cookies and similar tracking technologies to operate and provide the Service. For example, we use cookies for security purposes and to better understand how the Service is used. We will allow you to manage our use of non-essential cookies where required by law. To learn more about our use of cookies, please see our Cookies Policy . Payment Personal Data. When you use paid services for prepayment, we collect your payment order and transaction Personal Data to provide Services such as order placement, payment, customer service, and after-sales support."
https://cdn.deepseek.com/policies/en-US/deepseek-privacy-policy.html#:~:text=We%20automatically%20collect%20Personal%20Data,operate%20and%20provide%20the%20Service.
The disclosure does not clarify the top-5 sources of usage data involved in training.

We define usage data as data collected from the use of a developer's products or services.
We use usage data from only two sources: our deployed chatbot X and our online social media platform Y.
5. Notice of usage data used in training (Score: 0)

For the top-5 sources of usage data, how are users of these products and services made aware that this data is used for building the model?

"Personal Data You Provide When you create an account, input content, contact us directly, or otherwise use the Services, you may provide some or all of the following Personal Data: Account Personal Data. We collect Personal Data that you provide when you set up an account, such as your date of birth (where applicable), username (where applicable), email address and/or telephone number, and password. User Input. When you use our Services, we may collect your text input, prompt, uploaded files, feedback, chat history, or other content that you provide to our model and Services ("Prompts" or "Inputs"). We generate responses ("Outputs") based on your Inputs. Personal Data When You Contact Us. When you contact us, we collect the Personal Data you send us, such as proof of identity or age, contact details, feedback or inquiries about your use of the Services or Personal Data about possible violations of our Terms of Service (our "Terms") or other policies. Automatically Collected Personal Data We automatically collect certain Personal Data from you when you use the Services, including internet or other network activity Personal Data such as your IP address, unique device identifiers, and cookies. Device and Network Personal Data. We collect certain device and network connection Personal Data when you access the Services. This Personal Data includes your device model, operating system, IP address, device identifiers and system language. We also collect service-related, diagnostic, and performance Personal Data, including crash reports and performance logs. We automatically assign you a device ID and user ID. Where you log-in from multiple devices, we use Personal Data such as your device ID and user ID to identify your activity across devices to give you a seamless log-in experience and for security purposes. Log Personal Data. We collect Personal Data regarding your use of the Services, such as the features you use and the actions you take. Location Personal Data. We automatically collect Personal Data about your approximate location based on IP address for security reasons, for example to protect your account by detecting unusual login activity. Cookies & Similar Technologies. We may use cookies and similar tracking technologies to operate and provide the Service. For example, we use cookies for security purposes and to better understand how the Service is used. We will allow you to manage our use of non-essential cookies where required by law. To learn more about our use of cookies, please see our Cookies Policy . Payment Personal Data. When you use paid services for prepayment, we collect your payment order and transaction Personal Data to provide Services such as order placement, payment, customer service, and after-sales support."
https://cdn.deepseek.com/policies/en-US/deepseek-privacy-policy.html#:~:text=We%20automatically%20collect%20Personal%20Data,operate%20and%20provide%20the%20Service.
The disclosure does not clarify the top-5 sources of usage data involved in training and associated notices.

We define usage data notice as the proactive disclosure to users of how their data is used for model development. For example, via a pop-up with a description, a link to the privacy policy, or link to a description of company practices.
We notify users of our chatbot X that chatbot interactions are used to train our AI via a pop-up as shown at this URL: ... We notify users of our platform Y about whether their data is used to train our AI via a link to our privacy policy when they sign up for an account.
6. Licensed data sources (Score: 0)

What are the top-5 sources (by volume) of licensed data acquired for building the model?

No information provided about data acquisition.
Not disclosed
No information provided.

We define a source as the entity from which the developer acquires data. For example, the Associated Press is reportedly a source of licensed data for OpenAI.
We license data from only three sources: A, B, and C.
7. Licensed data compensation (Score: 0)

For each of the top-5 sources of licensed data, are details related to compensation disclosed?

No information provided about data acquisition.
Not disclosed
No information provided.

We award this point if the model developer describes the compensation structure specified in the contract with the data source or indicates they are prohibited from sharing this information if contractually mandated.
We compensate A by ... We cannot disclose information on compensation for our relationships with B and C due to contractual terms that prohibit public disclosure.
8. New human-generated data sources (Score: 0)

What are the top-5 sources (by volume) of new human-generated data for building the model?

No information provided about data acquisition.
Not disclosed
No information provided.

We define a source as the entity or means by which the developer acquires data. For example, Scale AI could be a source of new human-generated data. By new, we mean the data is specifically acquired for the purposes of building the model.
We acquire new human-generated data from only two sources: our internal data annotation team and an external vendor, A.
9. Instructions for data generation (Score: 0)

For each of the top-5 sources of human-generated data, what instructions does the developer provide for data generation?

No information provided about data acquisition.
Not disclosed
No information provided.

The instructions should be those provided to the data source. For example, if a third-party vendor works directly with the data laborers to produce the data, the instructions from the developer to this vendor should be disclosed.
We instruct our internal data annotation team as follows: ... We instruct vendor A as follows: ...
10. Data laborer practices (Score: 0)

For the top-5 sources of human-generated data, how are laborers compensated, where are they located, and what labor protections are in place?

No information provided about data acquisition.
Not disclosed
No information provided.

For each data source, we require (i) the compensation in either USD or the local currency, (ii) any countries where at least 25% of the laborers are located, and (iii) a description of any labor protections. We will award this point if the developer discloses that it is not aware of data laborer practices.
Our internal data annotation team is located in the US, is compensated at 20 USD per hour, and deals with data that does not require specific protections. Our sole external data vendor contracts laborers in Kenya, compensates them at KES 15000 per month, and implements protections for dealing with toxic or unsafe content such as A and B.
11. Synthetic data sources (Score: 1)

What are the top-5 sources (by volume) of synthetic data acquired for building the model?

"2.3.1. Cold Start Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1- Zero outputs in a readable format, and refining the results through post-processing by human annotators. In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data 9 include: • Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable for reading. Responses may mix multiple languages or lack markdown formatting to highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1, we design a readable pattern that includes a summary at the end of each response and filters out responses that are not reader-friendly. Here, we define the output format as |special_token||special_token|, where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results. • Potential: By carefully designing the pattern for cold-start data with human priors, we observe better performance against DeepSeek-R1-Zero. We believe the iterative training is a better way for reasoning models."
https://arxiv.org/pdf/2501.12948; https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-025-09422-z/MediaObjects/41586_2025_9422_MOESM1_ESM.pdf
The information provided does describe synthetic data generation and the Nature paper clarifies how different models (e.g. V3) are involved in data generation.

We define a source of synthetic data as a non-human mechanism (e.g. a machine learning model) used to generate the data.
We synthetically generate data using only our previous model X and an early checkpoint of our current flagship model Y.
12. Synthetic data purpose (Score: 1)

For the top-5 sources of synthetically generated data, what is the primary purpose for data generation?

"2.3.1. Cold Start Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1- Zero outputs in a readable format, and refining the results through post-processing by human annotators. In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data 9 include: • Readability: A key limitation of DeepSeek-R1-Zero is that its content is often not suitable for reading. Responses may mix multiple languages or lack markdown formatting to highlight answers for users. In contrast, when creating cold-start data for DeepSeek-R1, we design a readable pattern that includes a summary at the end of each response and filters out responses that are not reader-friendly. Here, we define the output format as |special_token||special_token|, where the reasoning process is the CoT for the query, and the summary is used to summarize the reasoning results. • Potential: By carefully designing the pattern for cold-start data with human priors, we observe better performance against DeepSeek-R1-Zero. We believe the iterative training is a better way for reasoning models."
https://arxiv.org/pdf/2501.12948
The purpose of synthetic data generation to avoid cold start is clearly stated.

We define a source of synthetic data as a non-human mechanism (e.g. a machine learning model) used to generate the data.
We use model X to generate instruction-tuning data and we use model Y to generate candidate responses that humans select between to provide human preference data for reinforcement learning with human feedback.
13. Data processing methods (Score: 1)

What are the methods the developer uses to process acquired data to determine the data directly used in building the model?

We exclude sensitive information, credit card numbers, or unique identification information from our training data sources to minimize the risk of collecting any personal information. However, due to the vast scale of pre-training data, some publicly available online content or licensed data from other providers may incidentally contain personal information. We employ technical measures to screen and remove such information from the training data as much as possible and conduct tests before using the data for training. Additionally, to ensure data quality, safety, and diversity, we have established a rigorous data governance process. First, we use filters to automatically screen and remove raw data containing hate speech, pornography, violence, spam, or potential infringement. Second, recognizing that large-scale datasets may inherently contain statistical biases, we combine algorithmic and manual review methods to identify and mitigate the impact of these biases on the model's values, thereby enhancing fairness.
Not disclosed
The general methods and purposes for data processing in pre-training are made clear.

We will award this point for disclosure of all of the methods used to process acquired data. Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.
We process data in the following six-step pipeline: (i) removal of HTML artifacts, (ii) deduplication, (iii) language identification to retain English data, (iv) removal of CSAM imagery, (v) removal of train-test overlap, and (vi) tokenization.
14. Data processing purpose (Score: 1)

For each data processing method, what is its primary purpose?

We exclude sensitive information, credit card numbers, or unique identification information from our training data sources to minimize the risk of collecting any personal information. However, due to the vast scale of pre-training data, some publicly available online content or licensed data from other providers may incidentally contain personal information. We employ technical measures to screen and remove such information from the training data as much as possible and conduct tests before using the data for training. Additionally, to ensure data quality, safety, and diversity, we have established a rigorous data governance process. First, we use filters to automatically screen and remove raw data containing hate speech, pornography, violence, spam, or potential infringement. Second, recognizing that large-scale datasets may inherently contain statistical biases, we combine algorithmic and manual review methods to identify and mitigate the impact of these biases on the model's values, thereby enhancing fairness.
Not disclosed
The general methods and purposes for data processing in pre-training are made clear.

Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.
Examples of primary purposes for a data processing method could include: (i) removes low quality data, (ii) removes potentially personal/copyrighted data, (iii) removes product-irrelevant data, (iv) removes toxic data, (v) improves evaluation integrity, or (vi) prepares the data for training the model.
15. Data processing techniques (Score: 0)

For each data processing method, how does the developer implement the method?

We exclude sensitive information, credit card numbers, or unique identification information from our training data sources to minimize the risk of collecting any personal information. However, due to the vast scale of pre-training data, some publicly available online content or licensed data from other providers may incidentally contain personal information. We employ technical measures to screen and remove such information from the training data as much as possible and conduct tests before using the data for training. Additionally, to ensure data quality, safety, and diversity, we have established a rigorous data governance process. First, we use filters to automatically screen and remove raw data containing hate speech, pornography, violence, spam, or potential infringement. Second, recognizing that large-scale datasets may inherently contain statistical biases, we combine algorithmic and manual review methods to identify and mitigate the impact of these biases on the model's values, thereby enhancing fairness.
Not disclosed
The underlying techniques for implementing data processing are not disclosed.

Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.
Examples of how a data processing method is implemented could include: the method (i) is implemented using an in-house regular expression, (ii) is implemented using an in-house tool based on n-gram overlap, (iii) is implemented using a FastText classifier trained on Wikipedia data, (iv) is implemented using hash collisions with the NCMEC database, (v) is implemented by searching for known benchmark canary strings, and (vi) is implemented using tiktoken (https://github.com/openai/tiktoken).
16. Data size (Score: 1)

Is the size of the data used in building the model disclosed?

From the DeepSeek-V3 technical report: "At an economical cost of only 2.664M H800 GPU hours, we complete the pre-training of DeepSeek-V3 on 14.8T tokens". Nature paper for R1 notes scale of training data for going from V3 to R1 (e.g. 800k SFT samples of average length of 5k tokens, hence 4B tokens)
https://arxiv.org/pdf/2412.19437; https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-025-09422-z/MediaObjects/41586_2025_9422_MOESM1_ESM.pdf
Combining the information in the V3 arXiv and R1 Nature reports gives a sufficiently precise estimate of cumulative training data size.

To receive this point, the developer should report data size in appropriate units (e.g. bytes, words, tokens, images, frames) and broken down by modality. Data size should be reported to a precision of one significant figure (e.g. 4 trillion tokens, 200 thousand images). The size should reflect data directly used in building the model (i.e. training data) and not data that was acquired but unused, or data used to evaluate the model.
We used 3 x 10^12 tokens of text, 1 x 10^6 images, and 5 x 10^5 hours of audio for training.
17. Data language composition (Score: 0)

For all text data used in building the model, what is the composition of languages?

From the DeepSeek-V3 technical report: "we optimize the pre-training corpus by enhancing the ratio of mathematical and programming samples, while expanding multilingual coverage beyond English and Chinese."
https://arxiv.org/pdf/2412.19437
No information provided about language frequency.

To receive this point, the developer should report (i) all languages which make up at least 1% of the data and their corresponding proportions and (ii) a brief description of how languages are labeled (if a publicly available tool is used, include a link to the tool). Proportions should be reported to a precision of two significant figures and should describe proportions of documents labeled with some langauge. An "Unknown" category may be included to denote documents where the language could not be identified.
English 80%, Spanish 5.0%, French 3.0%, Chinese 2.0%, Unknown 10%. We used a FastText-based classifier trained on Wikipedia data to identify languages.
18. Data domain composition (Score: 0)

For all the data used in building the model, what is the composition of domains covered in the data?

The most relevant document is the DeepSeek-V3 technical report with the relevant passage on data domains stating: "Compared with DeepSeek-V2, we optimize the pre-training corpus by enhancing the ratio of mathematical and programming samples, while expanding multilingual coverage beyond English and Chinese." Further information on post-training states: "We curate our instruction-tuning datasets to include 1.5M instances spanning multiple domains, with each domain employing distinct data creation methods tailored to its specific requirements. Reasoning Data. For reasoning-related datasets, including those focused on mathematics, code competition problems, and logic puzzles, we generate the data by leveraging an internal DeepSeek-R1 model. Specifically, while the R1-generated data demonstrates strong accuracy, it suffers from issues such as overthinking, poor formatting, and excessive length. Our objective is to balance the high accuracy of R1-generated reasoning data and the clarity and conciseness of regularly formatted reasoning data. To establish our methodology, we begin by developing an expert model tailored to a specific domain, such as code, mathematics, or general reasoning, using a combined Supervised FineTuning (SFT) and Reinforcement Learning (RL) training pipeline. This expert model serves as a data generator for the final model. The training process involves generating two distinct types of SFT samples for each instance: the first couples the problem with its original response in the format of , while the second incorporates a system prompt alongside the problem and the R1 response in the format of . The system prompt is meticulously designed to include instructions that guide the model toward producing responses enriched with mechanisms for reflection and verification. During the RL phase, the model leverages high-temperature sampling to generate responses that integrate patterns from both the R1-generated and original data, even in the absence of explicit system prompts. After hundreds of RL steps, the intermediate RL model learns to incorporate R1 patterns, thereby enhancing overall performance strategically. Upon completing the RL training phase, we implement rejection sampling to curate highquality SFT data for the final model, where the expert models are used as data generation sources. This method ensures that the final training data retains the strengths of DeepSeek-R1 while producing responses that are concise and effective. No further information is provided about the post-training involved in producing DeepSeek-R1."
https://arxiv.org/pdf/2412.19437
No information provided about domains represented in pretraining data, only subsequent post-training data.

To receive this point, the developer should report the composition of the main domains included in the data used to train the model. This data should be at a level of granularity lower than broad claims about training on "internet data". For example, this could include the proportion of data from e-commerce, social media, news, code, etc. based on the URLs from which the data is sourced. Proportions should be reported to a precision of one significant figure.
Social media 40%, code repositories 30%, news articles 20%, e-commerce product listings 5%, scientific papers 5%.
19. External data access (Score: 0)

Does a third-party have direct access to the data used to build the model?

No information provided about external data access.
Not disclosed
No information provided.

By a third-party, we mean entities that are financially independent of the developer. We will award this point if at least one such entity is named as having direct access to the data. With that said, we may award this point if the developer provides justifications for prohibiting access to narrowly-scoped parts of the data.
Third-parties that have direct access to the data include organizations A and B.
20. Data replicability (Score: 0)

Is the data used to build the model described in enough detail to be externally replicable?

No information provided relevant for data replication.
Not disclosed
No information provided.

We will award this point if the description contains (i) a list of all publicly available training data and where to obtain it and (ii) a list of all training data obtainable from third parties and where to obtain it. These conditions refer to criteria 2 and 3 under the OSI Open Source AI v1.0 definition.
The listing of publicly available training data can be found at this URL ... and the listing of all training data obtainable from third parties can be found at this URL ...
21. Compute usage for final training run (Score: 0)

Is the amount of compute used in the model's final training run disclosed?

No information provided about training FLOPs, though it can be estimated using published GPU hours of 2.8M H800 hours in DeepSeek-V3 paper by assuming estimates for conversion factors and information about R1 RL phase in R1 paper.
https://arxiv.org/pdf/2412.19437
No information provided directly about compute usage.

Compute should be reported in appropriate units, which most often will be floating point operations (FLOPs), along with a description of the measurement methodology, which may involve estimation. Compute should be reported to a precision of one significant figure (e.g. 5 x 10^25 FLOPs). This number should represent the compute used to train the final model across all model stages.
Our model was trained using 5 x 10^25 FLOPs, measured according to the Frontier Model Forum guidance provided at this URL: https://www.frontiermodelforum.org/updates/issue-brief-measuring-training-compute/
22. Compute usage including R&D (Score: 0)

Is the amount of compute used to build the model, including experiments, disclosed?

No information provided about cumulative training FLOPs.
Not disclosed
No information provided directly about compute usage.

Compute should be reported in appropriate units, which most often will be floating point operations (FLOPs), along with a description of the measurement methodology, which may involve estimation. Compute should be reported to a precision of one significant figure (e.g. 7 x 10^26 FLOPs). Compared to the previous indicator, this indicator should include an estimation of the total compute used across experiments used towards the final training run for the model (such as including hyperparameter optimization or other experiments), and not just the final training run itself.
Our cumulative compute usage involved in building the model was 7 x 10^26 FLOPs, measured according to the Frontier Model Forum guidance provided at this URL: https://www.frontiermodelforum.org/updates/issue-brief-measuring-training-compute/
23. Development duration for final training run (Score: 1)

Is the amount of time required to build the model disclosed?

The most relevant document is the DeepSeek-V3 paper with the relevant passage on pre-training: "During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pretraining stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training". Nature paper on R1 notes: For the training of DeepSeek-R1-Zero, we employed 64*8 H800 GPUs, and the process required approximately 198 hours. Additionally, during the training phase of DeepSeek-R1, we utilized the same 64*8 H800 GPUs, completing the process in about 4 days, or roughly 80 hours. To create the SFT datasets, we use 5K GPU hours
https://arxiv.org/pdf/2412.19437
Development duration for V3 pretraining is adequately reported in terms of GPU hours and in days if intepreted as 55 days (2664K / 180K * 3.7 days), which is less than 2 months as mentioned. Post-training to produce R1 adds an additional 5k + (64*8*198) + (64*8*80) = 147k GPU hours and an additional 12 days. This yields estimates of 2.8M GPU hours and 67 days.

The amount of time should be specified in terms of both the continuous duration of time required and the number of hardware hours used. The continuous duration of time required to build the model should be reported in weeks, days, or hours to a precision of one significant figure (e.g. 3 weeks). The number of hardware hours should be reported to a precision of one significant figure and include the type of hardware hours. No form of decomposition into phases of building the model is required for this indicator, but it should be clear what the duration refers to (e.g. training the model, or training and subsequent evaluation and red teaming).
Our model was trained over a period of 90 days using 4x10^4 NVIDIA H100 GPU-days.
24. Compute hardware for final training run (Score: 1)

For the primary hardware used to build the model, is the amount and type of hardware disclosed?

The most relevant document is the DeepSeek-V3 paper with the relevant passage on pre-training: "During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pretraining stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training". No additional information is provided on the R1 phase.
https://arxiv.org/pdf/2412.19437
2048 Nvidia H800 GPUs throughout V3 training and 512 Nvidia H800 GPUs throughout R1 training.

In most cases, this indicator will be satisfied by information regarding the number and type of GPUs or TPUs used to train the model. The number of hardware units should be reported to a precision of one significant figure (e.g. 800 NVIDIA H100 GPUs). We will not award this point if (i) the training hardware generally used by the developer is disclosed, but the specific hardware for the given model is not, or (ii) the training hardware is disclosed, but the amount of hardware is not. We will award this point even if information about the interconnects between hardware units is not disclosed.
Our model was trained using 1000 NVIDIA H100 GPUs.
25. Compute provider (Score: 1)

Is the compute provider disclosed?

The most relevant document is the DeepSeek-V3 paper with the relevant passage on pre-training: "During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pretraining stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training". No additional information is provided on the R1 phase.
https://arxiv.org/pdf/2412.19437
Self-owned cluster.

For example, the compute provider may be the model developer in the case of a self-owned cluster, a cloud provider like Microsoft Azure, Google Cloud Platform, or Amazon Web Services, or a national supercomputer. In the event that compute is provided by multiple sources or is highly decentralized, we will award this point if a developer makes a reasonable effort to describe the distribution of hardware owners.
Compute is provided by Google Cloud Platform.
26. Energy usage for final training run (Score: 0)

Is the amount of energy expended in building the model disclosed?

No information provided about training energy usage.
Not disclosed
No information provided.

Energy usage should be reported in appropriate units, which most often will be megawatt-hours (mWh), along with a description of the measurement methodology, which may involve estimation. Energy usage should be reported to a precision of one significant figure (e.g. 500 mWh). No form of decomposition into compute phases is required, but it should be clear whether the reported energy usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitate energy usage. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lacks and why it lacks it.
Our model was trained using an estimate 1 x 10^4 MWh of energy. To estimate training energy consumption, we multiplied training FLOPs (5 x 10^25) by a conversion factor using NVIDIA A100 GPU information (3.74 × 10^21 FLOPs/MWh) given we train using FP16 with sparsity.
27. Carbon emissions for final training run (Score: 0)

Is the amount of carbon emitted in building the model disclosed?

No information provided about environmental impacts.
Not disclosed
No information provided.

Emissions should be reported in appropriate units, which most often will be tons of carbon dioxide emitted (tCO2), along with a description of the measurement methodology, which may involve estimation. Emissions should be reported to a precision of one significant figure (e.g. 500 tCO2). No form of decomposition into compute phases is required, but it should be clear whether the reported emissions is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that generate emissions. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lack and why it lacks it. Emissions should correspond with the energy used in the previous indicator.
Our model yielded an estimate of 5 x 10^3 tCO2. To estimate training carbon emissions, we multiplied training energy usage (1 x 10^4 MWh) by a 2023 estimate for the US data center carbon intensity (0.375 tCO2/MWh) given the data centers used in training operate in the US.
28. Water usage for final training run (Score: 0)

Is the amount of clean water used in building the model disclosed?

No information provided about environmental impacts.
Not disclosed
No information provided.

Clean water usage should be in appropriate units, which most often will be megaliters, along with a description of the measurement methodology, which may involve estimation. Clean water usage should be reported to a precision of one significant figure (e.g., 5000ML). No form of decomposition into compute phases is required, but it should be clear whether the reported water usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitates water usage. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lacks and why it lacks it.
Our model yielded an estimate of 20 ML water. To estimate training water usage, we multiplied training energy usage (1 x 10^4 MWh) by a 2021 estimate for the US data center water efficiency (1.8 ML per 1,000 MWh) given the data centers used in training operate in the US.
29. Internal compute allocation (Score: 1)

How is compute allocated across the teams building and working to release the model?

The most relevant document is the DeepSeek-V3 paper with the relevant passage on compute allocation: "2664K H800 hours on pretraining, 119K on context extension, 5k on post training". Nature paper provides compute allocation of 100k H800 GPU hours for R1-Zero, 5k for SFT data creation, and 41k for R1.
https://arxiv.org/pdf/2412.19437
Compute allocation is provided for the V3 stages and for R1 as well.

To receive a point, the developer should provide the compute allocated to each team involved in training the model. We understand there might be no clear allocation of compute across different teams; in that case, report an estimate of the compute used over the last year. Compute allocation should be reported to at least one significant figure.
- Safety — 15% - Pre-training — 60% - Post-training — 15% - Infrastructure and reliability — 5%
30. Model stages (Score: 1)

Are all stages in the model development process disclosed?

The most relevant documents are the DeepSeek-V3 and DeepSeek-R1 technical reports. The relevant passages state: "During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The pre-training process is remarkably stable. Throughout the entire training process, we did not encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage context length extension for DeepSeek-V3. In the first stage, the maximum context length is extended to 32K, and in the second stage, it is further extended to 128K. Following this, we conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential" to build V3. Then, "To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1- Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1".
https://arxiv.org/pdf/2412.19437; https://arxiv.org/pdf/2501.12948
Several stages spanning the development of V3 and R1 are enumerated.

Stages refer to each identifiable step that constitutes a substantive change to the model during the model building process. We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear and complete description of these stages.
We define five stages in building the model: (1) unsupervised pre-training, (2) supervised instruction tuning, (3) RLHF, (4) domain-specific fine-tuning, and (5) final safety alignment.
31. Model objectives (Score: 1)

For all stages that are described, is there a clear description of the associated learning objectives or a clear characterization of the nature of this update to the model?

The most relevant documents are the DeepSeek-V3 and DeepSeek-R1 technical reports. The relevant passages state: "During pre-training, we train DeepSeek-V3 on 14.8T high-quality and diverse tokens. The pre-training process is remarkably stable. Throughout the entire training process, we did not encounter any irrecoverable loss spikes or have to roll back. Next, we conduct a two-stage context length extension for DeepSeek-V3. In the first stage, the maximum context length is extended to 32K, and in the second stage, it is further extended to 128K. Following this, we conduct post-training, including Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on the base model of DeepSeek-V3, to align it with human preferences and further unlock its potential" to build V3. Then, "To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1- Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1".
https://arxiv.org/pdf/2412.19437; https://arxiv.org/pdf/2501.12948
Model stages have clearly described objectives, including description of token prediction for initial pretraining step in V3 report.

We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear description of the update to the model related to each stage, whether that is the intent of the stage (e.g. making the model less harmful), a mechanistic characterization (e.g. minimizing a specific loss function), or an empirical assessment (e.g. evaluation results conducted before and after the stage).
During unsupervised pre-training, the objective is next-token prediction. During supervised instruction tuning, we optimize for correctness and helpfulness on labeled tasks. RLHF aligns model outputs with human preference judgments. Domain-specific fine-tuning focuses on improving in-domain capabilities using specialized data (e.g., code or legal text). Final safety alignment reduces disallowed or harmful responses.
32. Code access (Score: 0)

Does the developer release code that allows third-parties to train and run the model?

No information provided about code for building DeepSeek-R1, though inference code is released
https://github.com/deepseek-ai/DeepSeek-R1/tree/main
No training code provided.

The released code does not need to match the code used internally.
We release training and inference code under an Apache 2.0 license at https://github.com/..., enabling others to replicate our core pipeline.
33. Organization chart (Score: 0)

How are employees developing and deploying the model organized internally?

The DeepSeek V3 report notes contributors for Research & Engineering, Data Annotation, and Business & Compliance. The DeepSeek R1 report notes core contributors and contributors.
https://arxiv.org/pdf/2412.19437; https://arxiv.org/pdf/2501.12948
No specific information about organization structure relevant for R1.

To receive a point, the developer should provide both the internal organization chart for the team developing the model as well as the headcounts (or a proportion of headcounts) by the team.
The model team comprises of 63 people, organized as follows: - CEO - Managing Director (Safety) — 24 people - Managing Director (Pre-training) — 12 people - Managing Director (Post-training) — 11 people - Managing Director (API) — 6 people - Director (Infrastructure and reliability) — 7 people - Director (PR and marketing) — 4 people - Director (hiring) — 7 people
34. Model cost (Score: 0)

What is the cost of building the model?

The most relevant document is the DeepSeek-V3 report with the passage on training costs: "Consequently, our pretraining stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M." For DeepSeek-R1, the developer reports $294,000 USD cost.
https://arxiv.org/pdf/2412.19437 ; https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-025-09422-z/MediaObjects/41586_2025_9422_MOESM1_ESM.pdf
No information provided about total model cost, only an estimate for V3 training compute based on market rates and similarly for R1, i.e. non-compute costs are not accounted for.

Monetary cost should be reported in appropriate currency (e.g. USD), along with the measurement methodology, which may involve estimation. Cost should be reported to a precision of one significant figure (e.g. 200 million USD).
We spent approximately 200 million USD on building the model: 50 million for data acquisition, 10 million for data processing, 20 million for personnel, 80 million for compute for R&D priced at market rates, and 40 million for compute for the final training run priced at market rates.
35. Basic model properties (Score: 1)

Are all basic model properties disclosed?

"We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing." The DeepSeek Hugging Face documentation, notes that the number of parameters associated with the model is 671 billion. "Previous work has heavily relied on large amounts of supervised data to enhance model performance. In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data. In the following sections, we present: (1) DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to small dense models."
DeepSeek-R1 Paper
The developer discloses the modalities, the model components, the model size, and the model architecture.

Basic model properties include: the input modality, output modality, model size, model components, and model architecture. To receive a point, all model properties should be disclosed. Modalities refer to the types or formats of information that the model can accept as input. Examples of input modalities include text, image, audio, video, tables, graphs. Model components refer to distinct and identifiable parts of the model. We recognize that different developers may use different terminology for model components, or conceptualize components differently. Examples include: (i) For a text-to-image model, components could refer to a text encoder and an image encoder, which may have been trained separately. (ii) For a retrieval-augmented model, components could refer to a separate retriever module. Model size should be reported in appropriate units, which generally is the number of model parameters, broken down by named component. Model size should be reported to a precision of one significant figure (e.g. 500 billion parameters for text encoder, 20 billion parameters for image encoder). Model architecture is the overall structure and organization of a foundation model, which includes the way in which any disclosed components are integrated and how data moves through the model during training or inference. We recognize that different developers may use different terminology for model architecture, or conceptualize the architecture differently; a sufficient disclosure includes any clear, though potentially incomplete, description of the model architecture.
Input modality: Text Output modality: Text Model components: Decoder-only model trained using self-supervised learning, followed by supervised fine tuning and RLHF that are used to align the language model to follow users' instructions and be helpful, harmless, and honest. Model size: 70B parameters Model architecture: Autoregressive (causal, decoder only) transformer language model with rotary position embeddings and are trained on the next token prediction task.
36. Deeper model properties (Score: 1)

Is a detailed description of the model architecture disclosed?

Configuration file released on Hugging Face
https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/config.json
The developer releases a configuration file that allows for the architecture to be reproduced.

To receive a point, the model architecture should be described in enough detail to allow for an external entity to fully implement the model. Publicly available code or a configuration file for a model training library (e.g., GPT-NeoX) would be a sufficiently detailed description.
The configuration file for training our model using a public model training library A can be found at [URL].
37. Model dependencies (Score: 1)

Is the model(s) the model is derived from disclosed?

"Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning."
DeepSeek-R1 Paper
The developer discloses that the model is derived from DeepSeek-V3-Base.

We will award this point for a comprehensive disclosure of the model or models on which the foundation model directly depends on or is derived from, as well as the method by which it was derived (e.g., through fine tuning, model merging, or distillation). Additionally, we will award a point if the developer discloses that the model is not dependent on or derived from any model.
This model is a fine tune of Camel-70B. We used the methods described in [PAPER URL] for distillation.
38. Benchmarked inference (Score: 0)

Is the compute and time required for model inference disclosed for a clearly-specified task on clearly-specified hardware?

Not disclosed
Not disclosed
No information is provided about the inference compute or time.

The duration should be reported in seconds to a precision of one significant figure (e.g. 0.002 seconds). Compute usage for inference should be reported in FLOPs/second to a precision of one significant figure (e.g. 5 x 10^21 FLOPs/second). The hardware in this evaluation need not be the hardware the developer uses for inference. The developer can report this figure over some known or public dataset.
It takes 0.002 seconds and 5 x 10^21 FLOPs/second to generate 100,000 tokens as 5,000 sequences of length 20 given inputs of length 40 from [DATASET URL]. The fixed set of hardware is 8 NVIDIA A100s.
39. Researcher credits (Score: 0)

Is a protocol for granting external entities API credits for the model disclosed?

Not disclosed
Not disclosed
No information is provided about a model credit access protocol.

A model credit access protocol refers to the steps, requirements, and considerations involved in granting credits to external entities. We will award this point if the developer discloses key details of its protocol, including (i) where external entities can request access to credits (e.g. via an access request form); (ii) explicit criteria for selecting external entities; and (iii) its policy on granting a transparent decision on whether access has been granted within a specified, reasonable period of time. Additionally, we will award a point if the developer discloses that it does not grant external entities API credits.
We implement a researcher access program: (i) Access can be requested from [URL] (ii) Any researcher at an accredited research institution is eligible to apply. Decisions are made based on the alignment between the applicant's project description and our target research directions (as described here: [URL]). (iii) Decision notifications are sent within three weeks of the application receipt.
40. Specialized access (Score: 0)

Does the developer disclose if it provides specialized access to the model?

Not disclosed
Not disclosed
No information is provided about specialized access.

Specialized access could include several categories, such as early access, subsidized access, or deeper access (e.g., to model weights or checkpoints, that are not publicly available). We will award this point if the developer discloses (i) if it provides specialized access and (ii) statistics on the number of users granted access across academia, industry, non-profits, and governments, to one significant figure.
We provide early access to the model via API to: (1) 250 academics vetted by our program (2) 0 industry affiliates (3) 0 non-profit affiliates (3) 2 government entities with whom we have signed MoUs We provide no other specialized research access.
41. Open weights (Score: 1)

Are the model's weights openly released?

"🔄 DeepSeek-R1 is now MIT licensed for clear open access"
DeepSeek-R1 Paper, DeepSeek API Docs, DeepSeek Hugging Face Doc
The model developer makes model weights publicly available.

To receive this point, model weights need to be publicly available at no cost. Developers may receive this point even if there are some restrictions on the external entities that are permitted access (e.g. geographic restrictions), insofar as these restrictions are transparent (e.g. via a license or some high-level description of who has been granted access to the foundation model).
Model weights are available on HuggingFace by following this link: [URL]
42. Agent Protocols (Score: 1)

Are the agent protocols supported for the model disclosed?

The AI Agent frameworks table in GitHub includes a number of frameworks such as smolagents, YoMo, and superagentx
https://github.com/deepseek-ai/awesome-deepseek-integration/tree/main?tab=readme-ov-file#ai-agent-frameworks
The developer discloses the agent protocols supported for this model.

Agent protocols are specifications that define how autonomous agents exchange messages, context, or function calls with other agents, tools, or services (e.g., Anthropic’s Model Context Protocol (MCP) and Google’s Agent‑to‑Agent (A2A) spec). To earn this point, documentation must enumerate each protocol and describe any deviations or proprietary extensions.
We support MCP and A2A for agents built using model A
43. Capabilities taxonomy (Score: 1)

Are the specific capabilities or tasks that were optimized for during post-training disclosed?

"We directly apply RL to the base model without relying on supervised fine-tuning (SFT) as a preliminary step. This approach allows the model to explore chain-of-thought (CoT) for solving complex problems, resulting in the development of DeepSeek-R1-Zero. DeepSeekR1-Zero demonstrates capabilities such as self-verification, reflection, and generating long CoTs, marking a significant milestone for the research community. Notably, it is the first open research to validate that reasoning capabilities of LLMs can be incentivized purely through RL, without the need for SFT. This breakthrough paves the way for future advancements in this area. We introduce our pipeline to develop DeepSeek-R1. The pipeline incorporates two RL stages aimed at discovering improved reasoning patterns and aligning with human preferences, as well as two SFT stages that serve as the seed for the model’s reasoning and non-reasoning capabilities. We believe the pipeline will benefit the industry by creating better models."
DeepSeek-R1 Paper, DeepSeek API Docs, DeepSeek Hugging Face Doc
The developer lists the capabilities optimized for during post-training.

Capabilities refer to the specific and distinctive functions that the model can perform. We recognize that different developers may use different terminology for capabilities, or conceptualize capabilities differently. We will award this point for a list of capabilities specifically optimized for in the post-training phase of the model, even if some of the capabilities are not reflected in the final model.
We focus on the following capabilities during post-training: (1) Coding ability (2) Retrieval of information and factuality (3) Multilingual language proficiency on non-English languages (4) Tool-use
44. Capabilities evaluation (Score: 1)

Does the developer evaluate the model's capabilities prior to its release and disclose them concurrent with release?

"We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al., 2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI, 2024d), Aider 1 , LiveCodeBench (Jain et al., 2024) (2024-08 – 2025-01), Codeforces 2 , Chinese National High School Mathematics Olympiad (CNMO 2024)3 , and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024). In addition to standard benchmarks, we also evaluate our models on open-ended generation tasks using LLMs as judges. Specifically, we adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here, we only feed the final summary to evaluation to avoid the length bias. For distilled models, we report representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and LiveCodeBench."
DeepSeek-R1 Paper, DeepSeek API Docs, DeepSeek Hugging Face Doc
The developer discloses evaluations of the capabilities specified in the previous indicator.

The evaluations must contain precise quantifications of the model's behavior in relation to the capabilities specified in the capabilities taxonomy. We will award this point for any clear, but potentially incomplete, evaluation of multiple capabilities.
We evaluate capabilities using the following benchmarks: (1) Coding: HumanEval (2) Retrieval: HotPotQA (3) Multilingual performance: MMMLU (4) Tool use: UltraTool
45. External reproducibility of capabilities evaluation (Score: 0)

Are code and prompts that allow for an external reproduction of the evaluation of model capabilities disclosed?

"We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al., 2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI, 2024d), Aider 1 , LiveCodeBench (Jain et al., 2024) (2024-08 – 2025-01), Codeforces 2 , Chinese National High School Mathematics Olympiad (CNMO 2024)3 , and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024). In addition to standard benchmarks, we also evaluate our models on open-ended generation tasks using LLMs as judges. Specifically, we adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here, we only feed the final summary to evaluation to avoid the length bias. For distilled models, we report representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and LiveCodeBench."
DeepSeek-R1 Paper, DeepSeek API Docs, DeepSeek Hugging Face Doc
The developer does not disclose code and prompts needed to reproduce the evaluations of model capabilities.

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all capability evaluations to be reproduced. The released code must be open source, following the OSI definition of open source.
The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]
46. Train-test overlap (Score: 0)

Does the developer measure and disclose the overlap between the training set and the dataset used to evaluate model capabilities?

Not disclosed
Not disclosed
The developer does not disclose the train-test overlap.

We will award this point if, with every capability evaluation for which the developer reports results, the developer reports the overlap between the training set of the model and the dataset used for evaluation, as well as the general methodology for computing train-test overlap (e.g. a description of how n-gram matching was used).
We compute train-test overlap using n-gram matching using the procedure described here [URL]. We evaluate the train-test overlap for the following benchmarks: (1) Coding: HumanEval (1.6%) (2) Retrieval: HotPotQA (4%) (3) Multilingual performance: MMMLU (3%) (4) Tool use: UltraTool (9%)
47. Risks taxonomy (Score: 1)

Are the risks considered when developing the model disclosed?

https://cdn.deepseek.com/policies/en-US/model-algorithm-disclosure.html "Risks associated with AI models may arise from two causes: 1. Limitations due to the immaturity of AI technology. 2. Risks due to the misuse of AI technology. Specifically:" Under "Limitations": - Hallucination Under "Miseuse risks": - Privacy protection - Copyright - Data security - Content safety - Bias - Discrimination https://arxiv.org/pdf/2501.12948 2.3.4. Reinforcement Learning for all Scenarios To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. ... For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness. https://cdn.deepseek.com/policies/en-US/deepseek-terms-of-use.html 3.4 You will not use the Services to generate, express or promote content or a chatbot that: (1) is hateful, defamatory, offensive, abusive, tortious or vulgar; (2) is deliberately designed to provoke or antagonize another or is bullying or trolling another; (3) may harass, intimidate, threaten, harm, hurt, scare, distress, embarrass or upset another; (4) is discriminatory such as discriminating another based on race, gender, sexuality, religion, nationality, disability or age; (5) is pornographic, obscene, or sexually explicit (e.g., sexual chatbots); (6) facilitates, promotes, incites or glorifies violence or terrorist/extremism content; (7) exploits, harms, or attempts to exploit or harm or minors or exposes minors to such content; (8) are designed to specifically appeal to or present a persona of any person under the age of 18 or the minimum age required in your country; (9) constitute, encourage or provide instructions for a criminal offence; or (10) impersonates or is designed to impersonate a celebrity, public figure or a person other than yourself without clearly labelling the content or chatbot as "unofficial" or "parody", unless you have that person's explicit consent. 4.4 All Outputs provided by this service are generated by an artificial intelligence model and may contain errors or omissions, for your reference only. You should not treat the Outputs as professional advice. Specifically, when using this service to consult on medical, legal, financial, or other professional issues, please be aware that this service does not constitute any advice or commitment and does not represent the opinions of any professional field. If you require related professional services, you should consult professionals and make decisions under their guidance. The Outputs of this software should not be the basis for your further actions or inactions. Any judgment you make based on the Outputs or subsequent related actions you take will result in consequences and responsibilities borne by you alone, including risks arising from reliance on the truthfulness, accuracy, reliability, non-infringement, or suitability for a specific purpose of the Outputs. You should recognize and use generative artificial intelligence technology scientifically and rationally, and in accordance with the law. 4.5 To prevent confusion and misidentification, and to eliminate or reduce the potential impact of inaccurate generated information, DeepSeek has added prominent disclaimer notices at the end of generated text or at the bottom of interactive interfaces, specifically reminding users that the content is AI-generated. 4.6 DeepSeek can provide incomplete, inaccurate, or offending output. If the output refers to a third-party product or service, this does not mean that the third party approves the comment or is affiliated with DeepSeek. 3.6 You will not use the Services for the following improper purposes: (1) Using the Services for any illegal purposes that violate laws and regulations, regulatory policies, or infringe on the legitimate rights and interests of third parties. (2) Using the Services for dangerous purposes that may have serious harmful impacts on physical health, psychology, society, or the economy, or violate scientific and technological ethics. (3) Engaging in activities that infringe on intellectual property rights, trade secrets, and other violations of business ethics, or using algorithms, data, platforms, etc., to implement monopolistic and unfair competition behaviors. (4) Without DeepSeek's authorization, copying, transferring, leasing, lending, selling, or sub-licensing the entire or part of the Services. (5) Other uses prohibited or restricted by laws and administrative regulations, or that may harm DeepSeek's interests. https://chat.deepseek.com/downloads/DeepSeek%20User%20Agreement.pdf 3.4 DeepSeek advises that, as a user of this Service, you shall not input content that induces outputs violating relevant laws and regulations, including but not limited to: (1) opposing the fundamental principles established by the Constitution; (2) endangering national security, leaking state secrets, subverting state power, overthrowing the socialist system, and undermining national unity; (3) damaging the honor and interests of the state; (4) inciting ethnic hatred and ethnic discrimination and undermining national unity; (5) insulting and abusing the image of martyrdom, denying the deeds of martyrdom, and beautifying and whitewashing the acts of war of aggression; (6) undermining the state's religious policy, promoting cults and feudal superstitions; (7) spreading rumors, false and harmful information, disturbing the social order and undermining social stability; (8) promoting obscenity, pornography, gambling, violence, murder, terrorism, extremism or abetting crime; (9) inciting unlawful assemblies, associations, processions, demonstrations, or gathering of people to disrupt social order; (10) slandering others, revealing their privacy, infringing on their honor, portrait, privacy, intellectual property rights and other legitimate rights and interests; (11) undermining international relations as well as international peace and stability; (12) containing information with other contents prohibited by laws and administrative regulations. 3.5 DeepSeek advises that, as a user of this Service, you shall not input content that induces unfriendly dialogue outputs, including but not limited to: (1) personally attacking and abusing others; (2) making remarks of a cursing, discriminatory, or indifferent to the dignity of life nature against the following groups of people: people of different specific nationalities, geographic regions, genders, gender identities, sexual orientations, races, ethnicities, health conditions, occupations, ages, beliefs, and handicapped groups; (3) cursing, intimidating or threatening others; (4) making direct derogatory comments about content created by other users; (5) using vulgar terms towards other users and creating offense; (6) making comments of a prejudicial nature against the following groups of people, groups that include: specific nationalities, geographic regions, genders, gender identities, sexual orientations, races, ethnicities, health conditions, occupations, ages, beliefs, handicapped groups, etc. 3.6 DeepSeek advises that, as a user of this Service, you shall not maliciously contend with the filtering mechanisms of this Service by, including but not limited to: (1) inputting characters, numbers and other meaningless garbled codes that are difficult to recognize the meaning and affect the reading experience; (2) malicious antagonistic behavior, including but not limited to the use of variants, harmonies and other ways to circumvent the Service detection to enter contents that violates the above 3.4 and 3.5. 3.7 DeepSeek advises that, as a user of this Service, you shall not interfere with the normal operation of this Service or harm DeepSeek's legal rights, including but not limited to: (1) using the Service to engage in illegal and criminal activities such as stealing trade secrets and personal information; (2) reversing compiling, reversing engineer or obtaining any information related to the algorithms of the Service by any other means; (3) using the Service to develop other products and services that compete with the Service (unless the restriction violates relevant legal norms); (4) crawling and copying the Service and any content contained therein by any means (including but not limited to through any robots, crawler technology other automated settings, setting up mirrors); (5) removing or tampering with the artificial intelligence-generated logos or the prominent logos of the deeply synthesized content involved in the Service without the consent of Depth Seeker or without a legally justifiable basis for such removal or tampering; (6) uploading, posting, e-mailing or otherwise transmitting software viruses or other computer codes, files and programs that interfere with, damage or limit the functionality of any computer software, hardware or communication equipment; (7) unlawful intrusion into the servers of the Service, tampering with the code related to the Service, or other acts that damage the Service; (8) attempting to circumvent the security settings or network systems of the DeepSeek, including obtaining data that the user shall not have access to, logging into servers or accounts that are not explicitly authorized, or employing methods such as running port scans to snoop on the security measures of other networks; (9) obtaining output in a manner that causes an unreasonable load to be placed on DeepSeek's computer systems or facilities, or engaging in behavior that could cause such a situation to arise; (10) unlawfully reselling or making available to the public the DeepSeek’s Service or using the DeepSeek’s Service for commercial purposes without authorization from DeepSeek. 3.8 DeepSeek advises that, as a user of this Service, you shall not engage in behaviors that violate other legal regulations or infringe upon the legal rights of others or DeepSeek, such as monopolistic or unfair competition practices.
DeepSeek R1, Paper DeepSeek User Agreement https://cdn.deepseek.com/policies/en-US/model-algorithm-disclosure.html
The developer discloses a list of risks considered.

Risks refer to possible negative consequences or undesirable outcomes that can arise from the model's deployment and usage. These consequences or outcomes may arise from model limitations (functions that the model cannot perform) or issues with the model's trustworthiness (e.g., its lack of robustness, reliability, calibration). We recognize that different developers may use different terminology for risks, or conceptualize risks differently. We will award this point for a complete list of risks considered, even if some of the risks are not reflected in the final model.
We consider the following risks when developing and evaluating the model: (1) Misinformation (2) Harassment (3) Cybersecurity risks (4) Bioweapons design (5) Revealing personally-identifiable information
48. Risks evaluation (Score: 0)

Does the developer evaluate the model's risks prior to its release and disclose them concurrent with release?

Not disclosed
Not disclosed
No information is provided about risk evaluations.

The evaluations must contain precise quantifications of the model's behavior in relation to the risks specified in the risk taxonomy. We will award this point for clear evaluations of the majority of the states risks.
We evaluate the risks for each of the above harms using HarmBench. The results (in terms of mean attack success rate) are: (1) Misinformation: 0.02 (2) Harassment: 0.01 (3) Cybersecurity: 0.10 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.12 (5) Revealing personally-identifiable information (subset of General Harm in HarmBench): 0.02
49. External reproducibility of risks evaluation (Score: 0)

Are code and prompts to allow for an external reproduction of the evaluation of model risks disclosed?

Not disclosed
Not disclosed
No information is provided about risk evaluations.

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all risk evaluations to be reproduced. The released code must be open-source, following the OSI definition of open-source.
The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]
50. Pre-deployment risk evaluation (Score: 0)

Are the external entities have evaluated the model pre-deployment disclosed?

Not disclosed
Not disclosed
No information is provided about external entities that have evaluated the model pre-deployment.

By external entities, we mean entities that are significantly or fully independent of the developer. We will award this point if the developer specifies the entity that carried out the pre-deployment analysis, discloses the terms of the analysis (such as conditions for releasing the evaluation results or the developer's control over the final results), as well as any financial transaction between the parties. We will award this point if the developer discloses no external entities have evaluated the model pre-deployment, or discloses only terms of the analysis where it is not bound by NDA while still naming all external entities.
We provide the following parties access to our model for pre-deployment capabilities evaluation: METR. METR has control over the release of the evaluation results (including whether or not to release the results and the contents of the results being released), but must provide the evaluation results to us for review before release. There are no financial transactions between us and METR.
51. External risk evaluation (Score: 0)

Are the parties contracted to evaluated model risks disclosed?

Not disclosed
Not disclosed
No information is provided about parties contracted to evaluate model risks.

We will award this point if the developer discloses statistics regarding all contracted parties that are responsible for evaluating risks (not limited to external entities or pre-deployment evaluation). This includes the number of contracted for-profit or non-profit entities, government entities, independent contractors, and researchers contracted by the developer to evaluate risks. We will award this point if the developer discloses it has no such contracts.
Contracted parties responsible for evaluating risks: (1) 2 contracting non-profits (2) 5 independent contractors (3) 0 government entities (4) 20 researchers
52. Mitigations taxonomy (Score: 1)

Are the post-training mitigations implemented when developing the model disclosed?

2.3.4. Reinforcement Learning for all Scenarios To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. ... For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.
DeepSeek R1, Paper
The developer discloses that they implement RL to mitigate risks.

By post-training mitigations, we refer to interventions implemented by the developer during the post-training phase to reduce the likelihood and/or the severity of the model’s risks. We recognize that different developers may use different terminology for mitigations, or conceptualize mitigations differently. We will award this point for a complete list of mitigations considered, even if some of the mitigations are not reflected in the final model. Alternatively, we will award this point if the developer reports that it does not mitigate risk in this way.
We implement supervised fine tuning and reinforcement learning with human feedback to address model risks. We use no other methods to address risks.
53. Mitigations taxonomy mapped to risk taxonomy (Score: 0)

Does the developer disclose how the post-training mitigations map onto the taxonomy of risks?

Not disclosed
Not disclosed
The developer does not disclose the specific risks that the post-training mitigation in the previous indicator (RL) is meant to address.

We will award this point for a complete mapping of the primary risk that each mitigation is meant to address, even if the mitigation potentially maps on to other risks in the taxonomy. Alternatively, we will award this point if the developer reports that it does not mitigate risk.
We use supervised fine tuning for general instruction following. We use RLHF to reduce the model's propensity to output information about cybercrimes, bioweapons, disinformation, content harassing someone, and PII.
54. Mitigations efficacy (Score: 0)

Does the developer evaluate and disclose the impact of post-training mitigations?

Not disclosed
Not disclosed
The developer does not disclose the results on risk evaluations before and after the post-training mitigation was applied.

We will award this point if the developer discloses the results on the risk evaluations before and after the post-training mitigations are applied. Alternatively, we will award this point if the developer reports that it does not mitigate risk in this way.
Pre-mitigations (measured through mean attack success rate on HarmBench): (1) Misinformation: 0.80 (2) Harassment: 0.91 (3) Cybersecurity risks: 0.56 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.62 (5) Personally-identifiable information (subset of General Harm in HarmBench): 0.52 Post-mitigations (measured through mean attack success rate on HarmBench): (1) Misinformation: 0.02 (2) Harassment: 0.01 (3) Cybersecurity risks: 0.10 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.12 (5) Personally-identifiable information (subset of General Harm in HarmBench): 0.02
55. External reproducibility of mitigations evaluation (Score: 0)

Are code and prompts to allow for an external reproduction of the evaluation of post-training mitigations disclosed?

Not disclosed
Not disclosed
The developer does not provide code/prompts to reproduce the evaluations of post-training mitigations.

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all mitigations evaluations to be reproduced. The released code must be open-source, following the OSI definition of open-source. Alternatively, we will award this point if the developer reports that it does not mitigate risk.
We release the code and prompts for reproducing post-training mitigation evaluations at this GitHub link: [URL]
56. Model theft prevention measures (Score: 0)

Does the developer disclose the security measures used to prevent unauthorized copying (“theft”) or unauthorized public release of the model weights?

Not disclosed
Not disclosed
No information is provided about security measures used to prevent model theft.

This indicator assesses the developer's disclosures regarding how it addresses the risk that malicious actors or insiders could exfiltrate or replicate proprietary weights. Security measures could include insider threat analysis and detection, in addition to external threat management. Examples of such measures include encryption at rest, key management, remote attestation, or auditing for suspicious queries. We will award a point if the developer discloses specific steps taken to safeguard the model weights or that none are implemented.
We store model weights on encrypted volumes with hardware-based key management. We monitor inference queries for suspicious patterns (like repeated attempts to reconstruct weights token-by-token), and we audit all staff access logs monthly.
57. Release stages (Score: 0)

Are the stages of the model's release disclosed?

DeepSeek-R1-Lite-Preview released November 2024 as preview version (URL: https://api-docs.deepseek.com/news/news1120) DeepSeek-R1 full release January 20, 2025 with open-source model and API (URL: https://api-docs.deepseek.com/news/news250120) DeepSeek-R1-0528 version upgrade released May 28, 2025 (URL: https://api-docs.deepseek.com/news/news250528)
DeepSeek R1 Lite Preview
Although certain stages of model release could be inferred from the developer's previous announcements, it's difficult to infer from public disclosures the full set of release stages.

Release stages include A/B testing, release on a user-facing product, GA release, open-weight release, etc. We recognize that the release of a foundation model falls along a spectrum, with many forms of partial release, and that different developers may conceptualize release differently. We will award a point if the developer provides a clear identification of the stages through which the model was released.
We began with an internal alpha test for two weeks, followed by a closed beta with selected enterprise partners for one month, then a public waitlisted preview, and finally a general availability release once thresholds on safety benchmarks were met.
58. Risk thresholds (Score: 0)

Are risk thresholds disclosed?

Not disclosed
Not disclosed
No information is provided about risk thresholds.

Risk thresholds determine when a risk level is unacceptably high to a developer (e.g. leading to the decision to not release a model), moderately high (e.g. triggering additional safety screening), or low enough to permit normal usage. We will award this point if the developer discloses explicit risk thresholds that clarify (i) which harmful outcomes are being scored, (ii) how the scores are computed (in general terms, not necessarily disclosing internal algorithms), and (iii) what triggers an action to block, delay, or otherwise modify a model's release. Alternatively, we will award a point if the developer discloses that it does not consider explicit risk thresholds during model release.
Our risk threshold for biorisks is the ability to autonomously create bioweapons. Current models score a medium: they don't autonomously create bioweapons but could help a skilled practitioner with access to materials in speeding up creation of bioweapons. Risk thresholds higher than medium would delay the model's release until the risk level drops to medium or below.
59. Versioning protocol (Score: 0)

Is there a disclosed protocol for versioning and deprecation of the model?

https://api-docs.deepseek.com/ * The deepseek-chat model points to DeepSeek-V3-0324. You can invoke it by specifying model='deepseek-chat'. * The deepseek-reasoner model points to DeepSeek-R1-0528. You can invoke it by specifying model='deepseek-reasoner'. https://api-docs.deepseek.com/updates Version: 2025-05-28 deepseek-reasoner deepseek-reasoner Model Upgraded to DeepSeek-R1-0528
DeepSeek API Docs
The developer versions models by the date. However, the developer does not disclose how model versions are deprecated or communicated.

We will award a point if the developer discloses how model versions are labeled, updated, deprecated, and communicated to users.
We version models based on the date of release: e.g., ModelName-11-01-2024. We additionally provide ModelName-latest, corresponding to the latest release. We deprecate versions of models when we plan to remove access to with a six months notice to users. Users should respond to model deprecation by switching to the newest version of the models or an equivalent non-deprecated model. Users can switch to a different model by replacing the model identifier (to e.g., ModelName-latest for the latest version) in API calls or through the Python SDK.
60. Change log (Score: 1)

Is there a disclosed change log for the model?

Version: 2025-05-28 deepseek-reasoner deepseek-reasoner Model Upgraded to DeepSeek-R1-0528: Enhanced Reasoning Capabilities Significant benchmark improvements (Pass@1) AIME 2025: 70.0 → 87.5 (+17.5) GPQA: 71.5 → 81.0 (+9.5) LCB_v6: 63.5 → 73.3 (+9.8) Aider: 57.0 → 71.6 (+14.6) Note: Complex reasoning tasks may consume more tokens compared to legacy R1 version. Optimized Front-end Development Generated web pages and games now feature improved aesthetics. Reduced Hallucinations Significantly suppressed hallucination issues present in legacy R1 version. JSON Output & Function Calling Support Function call performance: Tau-bench score: 53.5 (Airline) / 63.9 (Retail)
DeepSeek API Docs
The developer discloses a change log that lists new features and performance improvements.

We will award a point if the developer publishes a version-by-version record of new features, fixes, or performance improvements.
On 11/1/2024 (version ModelName-11-01-2024), we improved model reasoning in technical domains. This resulted in a 20-point increase on the MATH benchmark (from 62% to 82%). Past change logs can be viewed at [URL]
61. Foundation model roadmap (Score: 1)

Is a forward-looking roadmap for upcoming models, features, or products disclosed?

"This is just the beginning! Look forward to multimodal support and other cutting-edge features in the DeepSeek ecosystem."
Introducing DeepSeek-V3
The developer discloses an upcoming major model/feature release (multimodal support).

A foundation model roadmap is a transparent statement about how the developer intends to evolve or expand its LLM offerings, including upcoming models, major feature releases, or expanded products based on the model, along with approximate timelines or version milestones. It can be high-level (e.g., “new model Q2 2025”), but must exist publicly.
We plan to release ModelX2 in Q2 2025, featuring enhanced multilingual capabilities and improved retrieval. We also aim to launch an enterprise-specific product tier for regulated industries by early 2026.
62. Top distribution channels (Score: 1)

Are the top-5 distribution channels for the model disclosed?

DeepSeek API: https://api-docs.deepseek.com/guides/reasoning_model ModelScope: https://huggingface.co/deepseek-ai/DeepSeek-R1-0528 HuggingFace: https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528/summary
https://api-docs.deepseek.com/zh-cn/news/news250528 https://api-docs.deepseek.com/guides/reasoning_model https://huggingface.co/deepseek-ai/DeepSeek-R1-0528 https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528/summary
The developer discloses a single API distribution channel and two model weights distribution channels.

We define distribution channels to be either an API provider (a pathway by which users can query the model with inputs and receive outputs) or a model distributor (a pathway by which model weights are released). We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. Distribution channels can be ranked by any reasonable metric (e.g., number of queries, number of downloads, number of users, revenue). A description of the metric should be provided. API providers and model distributors may be ranked separately using different metrics as long as the total number of distribution channels equals five (if five distribution channels exist). For example, the developer may choose to disclose the top-3 API providers (ranked by the number of queries) and the top-2 model distributors (ranked by the number of downloads).
We provide API access to the model through A, B, and C. We distribute model weights through D and E. We pick the top-3 API providers based on the average number of queries per month and the top-2 model weight providers based on the average number of downloads per month.
63. Quantization (Score: 1)

Is the quantization of the model served to customers in the top-5 distribution channels disclosed?

Model weights quantization example config: ``` "quantization_config": { "activation_scheme": "dynamic", "fmt": "e4m3", "quant_method": "fp8", "weight_block_size": [ 128, 128 ] } ``` API quantization: "All DeepSeek-V3/R1 inference services are served on H800 GPUs with precision consistent with training. Specifically, matrix multiplications and dispatch transmissions adopt the FP8 format aligned with training, while core MLA computations and combine transmissions use the BF16 format, ensuring optimal service performance."
https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528/file/view/master/config.json https://huggingface.co/deepseek-ai/DeepSeek-R1-0528/blob/main/config.json https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md
The developer discloses the quantization of the model weights distributed in both model weights distribution channels. They also describe the quantization for their inference services in a separate page (FP8 & BF16).

We will award this point for a disclosure of the model precision in each of the top-5 distribution channels.
We serve the model at 16-bit precision on all distribution channels.
64. Terms of use (Score: 1)

Are the terms of use of the model disclosed?

This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs. Please note that: https://cdn.deepseek.com/policies/en-US/deepseek-terms-of-use.html
https://www.modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528/summary https://huggingface.co/deepseek-ai/DeepSeek-R1
The developer discloses the terms of service and model license.

We define terms of use to include terms of service and model licenses. We will award this point for a pointer to the terms of service or model license. In the event that model's licenses are written more generally, it should be clear which assets they apply to. We recognize that different developers may adopt different business models and therefore have different types of model licenses. Examples of model licenses include responsible AI licenses, open-source licenses, and licenses that allow for commercial use. Terms of service should be disclosed for each of the top-5 distribution channels. However, we will award this point if there are terms-of-service that appear to apply to the bulk of the model’s distribution channels.
Our terms of service are published at https://ourcompany.com/model-tos - these terms cover both our API and all distribution channels for model weights.
65. Distribution channels with usage data (Score: 0)

What are the top-5 distribution channels for which the developer has usage data?

This DeepSeek website implies they collect user data for their own services (deepseek chat and API)
https://huggingface.co/deepseek-ai/DeepSeek-R1 ; https://platform.deepseek.com/sign_in; https://cdn.deepseek.com/policies/en-US/deepseek-privacy-policy.html
A score of 0 is granted as there is no specific disclosure about the distribution channels for which DeepSeek has usage data.

We define distribution channels to be either an API provider (a pathway by which users can query the model with inputs and receive outputs) or a model distributor (a pathway by which model weights are released). We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. Distribution channels can be ranked by any reasonable metric (e.g., number of queries, number of downloads, number of users, revenue). A description of the metric should be provided. We define usage data as any form of developer-exclusive data collected from any of a developer's distribution channel. A developer has access to usage data from a distribution channel if it is able to use that data for downstream purposes (e.g., analytics, training etc.). Usage data may be shared outside of the developer, but it is initially collected by the distribution channel and shared to the developer.
We have access to usage data through the distribution channels: A, B, and C.
66. Amount of usage (Score: 0)

For each of the top-5 distribution channels, how much usage is there?

HuggingFace provides the number of model downloads
https://huggingface.co/deepseek-ai/DeepSeek-R1
A score of 0 is granted as there is no disclosure about the number of queries per month.

Usage should be reported as the number of queries over the span of a month, reported to the precision of one significant figure (e.g., 50 million queries).
Distribution channel A: 50 million queries. Distribution channel B: 10 million queries. Distribution channel C: 10 million queries.
67. Classification of usage data (Score: 0)

Is a representative, anonymized dataset classifying queries into usage categories disclosed?

No information provided about usage data.
Not disclosed
No information provided.

Developers may either share a fully public dataset or a partially restricted dataset (e.g., under a research license). We will award this point if there is a clear, aggregated or sample dataset that reveals categories of tasks/queries.
We provide quarterly releases of an anonymized dataset that classifies user queries into 20 broad job-related categories. Researchers can request access via [URL]. We ensure no PII is included.
68. Data retention and deletion policy (Score: 0)

Is a policy for data retention and deletion disclosed?

When the Personal Data collected is no longer required by us, we and our service providers will perform the necessary procedures for destroying, deleting, erasing, or converting it into an anonymous form as permitted or required under applicable laws. While DeepSeek clarifies that usage data isn't used for pre-training, it might be used for optimization: https://cdn.deepseek.com/policies/en-US/model-algorithm-disclosure.html
https://cdn.deepseek.com/policies/en-US/deepseek-privacy-policy.html
A score of 0 is granted as there is no disclosure about how quickly deletions propogate to the model.

A data retention and deletion policy is a policy for removing particular data from the training set and/or preventing it from being used if there is a user or external request (e.g., “right to be forgotten”) that also covers internal data governance. This includes whether there is a formal process to delete or retract data from future training runs and how long raw data is retained. It also clarifies how quickly deletions propagate to the model (e.g., “only in subsequent major model releases”).
We honor verified user requests to delete personal data from our training corpus by removing it from any subsequent scheduled retraining. Our data retention policy ensures chat logs are purged after 90 days.
69. Geographic statistics (Score: 0)

Across all forms of downstream use, are statistics of model usage across geographies disclosed?

No information provided about usage data.
Not disclosed
No information provided.

We will award this point if there is a meaningful, though potentially incomplete or vague, disclosure of geographic usage statistics at the country-level.
We share anonymized per-country usage metrics in a publicly accessible dashboard, updated monthly, on this link: [link]
70. Internal products and services (Score: 0)

What are the top-5 internal products or services using the model?

DeepSeek App, DeepSeek Chat, DeepSeek Platform, API
https://www.deepseek.com/en
A score of 0 is granted as it is not clear if the internal products listed on the website are exhaustive.

An internal product or service is a product or service built by the developer. Products or services can be ranked by any reasonable metric (e.g., number of users, queries, revenue). A description of the metric should be provided.
The model is used in products A, B, C, D, and E. We choose products based on the number of montly active users.
71. External products and services (Score: 0)

What are the top-5 external products or services using the model?

Multiple external integrations listed in awesome-deepseek-integration repository, including TigerGPT, HIX.AI, PopAi, and 1AI iOS Chatbot
https://github.com/deepseek-ai/awesome-deepseek-integration
A score of 0 is granted as the top-5 external products or services are not specified.

An external product or service is a product or service built by a party external to the developer. Products or services can be ranked by any reasonable metric (e.g., number of users, queries, revenue). A description of the metric should be provided. We will award a point if the developer discloses that that it does not have access to such metrics about external products or services.
The model is used in products A, B, C, D, and E. We choose products based on the number of montly active users.
72. Users of internal products and services (Score: 0)

How many monthly active users are there for each of the top-5 internal products or services using the model?

No information provided about users of internal products and services.
Not disclosed
No information provided.

An internal product or service is a product or service built by the developer. The number of users refers to users who engaged or interacted with the model through the internal product or service over the last month or averaged over the last X months (this should be specified). Number of users should be specified to one significant figure (e.g. 100,000).
Over the last 6 months, the total monthly active users for our top-5 products using model Y are: Product A: 100,000 users Product B: 30,000 users Product C: 10,000 users Product D: 10,000 users Product E: 10,000 users
73. Consumer/enterprise usage (Score: 0)

Across all distribution channels for which the developer has usage data, what portion of usage is consumer versus enterprise?

No information provided about consumer versus enterprise usage.
Not disclosed
No information provided.

Consumer usage refers to usage by individual consumers. Enterprise usage refers to usage by enterprise customers (including government use). Consumer and enterprise usage should be calculated in terms of the number of queries by or the amount of revenue from consumer or enterprise users. Percentages should be specified to two significant digits (e.g., 12% consumer, 88% enterprise).
12% of the usage of model A across all distribution channels is from consumers, 88% is from enterprise users. Of this 88%, 6% is from users at governments. Usage is calculated based on number of queries.
74. Enterprise users (Score: 0)

Across all distribution channels for which the developer has usage data, what are the top-5 enterprises that use the model?

No information provided about enterprise users.
Not disclosed
No information provided.

Enterprises should be ranked by the number of queries made or the amount of revenue from usage since the model's release. We will also award this point if the developer indicates it does not have access to enterprise usage data.
The top-5 enterprises are A, B, C, D, and E. The enterprises are selected based on the number of queries.
75. Government use (Score: 0)

What are the 5 largest government contracts for use of the model?

No information provided about government users.
Not disclosed
No information provided.

This includes known government contracts of enterprise or government-specific products and services that use the model. We will award this point if the developer discloses its top five government contracts ranked monetary value, though the developer may omit contracts where it is under NDA regarding the existence of the contract.
The five largest government users of our service, along with their use cases, are: 1. County A is utilizing our product for improving access to internal resources 2. National Lab B is using our model to advance bioscientific research. 3. Federal agency C is using our product to deliver faster, more accurate translation services 4. City D is participating in a pilot program found our product helped reduce the time spent on routine tasks 5. Country E is using our product to summarize legal documents in their lower courts.
76. Benefits Assessment (Score: 0)

Is an assessment of the benefits of deploying the model disclosed?

No information provided about benefits assessments.
Not disclosed
No information provided.

We will award this point for any quantitative assessment of the benefits or potential benefits of deploying the model.
We analyze the impact of using the model in education outcomes using a randomized controlled trial in third grade math assignnments, and find that use in the classroom improves standardized test outcomes by 26%. [Link to report.]
77. AI bug bounty (Score: 0)

Does the developer operate a public bug bounty or vulnerability reward program under which the model is in scope?

No information provided about an AI bug bounty.
Not disclosed
No information provided.

We will award this point for a publicly documented bug bounty or vulnerability reward program describing (i) in-scope vulnerabilities (e.g., prompt bypasses, data leaks), (ii) out-of-scope items, (iii) submission process, and (iv) reward tiers or recognition if applicable. We will award a point if the developer discloses it has no AI bug bounty that encourages external researchers to report security, privacy, or adversarial vulnerabilities in the model.
We run a bug bounty program with HackerOne. We award up to $5,000 for critical vulnerabilities, such as discovering a major exploit that circumvents our content filters or reveals private data. [link to bug bounty]
78. Responsible disclosure policy (Score: 0)

Does the developer clearly define a process by which external parties can disclose model vulnerabilities or flaws?

No information provided about responsible disclosure.
Not disclosed
No information provided.

We will award this point for a description of the process external parties can use for responsbly disclosing model vulnerabilities and flaws, which should include (i) what mechanism external parties can use to disclose vulnerabilities or flaws (e.g., a form, an email) and (ii) what process follows a disclosure (e.g., how much time must parties wait until public release). This is often included with a bug bounty, but can also be standalone. We will award a point if the developer discloses it has no responsible disclosure policy.
We maintain a responsible disclosure policy at [URL] that describes how external parties can disclose vulnerabilities and flaws in Model A, including a 45-day disclosure window and an official contact for urgent security vulnerabilities.
79. Safe harbor (Score: 0)

Does the developer disclose its policy for legal action against external evaluators conducting good-faith research?

3.5 You will not violate any applicable, nor interfere with, damage, or attack the Services, systems, networks, models, and other components that support the normal operation of the service. You will not engage in activities that endanger the security of the Services: (1) Engage in illegal activities involving network intrusion, such as: using unauthorized data or accessing unauthorized servers/accounts; forging TCP/IP packet names or partial names; attempting to probe, scan, or test vulnerabilities in the software system or network without permission.
https://cdn.deepseek.com/policies/en-US/deepseek-terms-of-use.html
A score of 0 is granted as DeepSeek's terms of use deny the ability to carry out vulnerability research.

We will award this point if the developer discloses whether it has a policy committing it to not pursue legal action against external evaluators conducting good-faith research. This should not be only for software security vulnerabilities, but also AI flaws, and it should be based on researcher conduct standards, not at the sole discretion of the company. We will award this point if the developer provides a clear description of its policy regarding such protections for external researchers, or lack thereof.
We do not have a policy for researcher protections for good-faith safety research. OR Our policy ensures no legal action against good‐faith researchers who follow our disclosure guidelines, see: [link]
80. Security incident reporting protocol (Score: 0)

Are major security incidents involving the model disclosed?

The "report a vulnerability" button at the bottom of the page opens a website to security@deepseek.com
https://www.deepseek.com/en
No information provided about the policy governing vulnerability reports.

A security incident reporting protocol provides post-deployment transparency about serious incidents or breaches. Security incidents refer to incidents where external security threats affect the model (e.g., data breaches or DDoS attacks on the service). We will award this point if the developer states (i) how to submit a security incident report, (ii) how quickly it will respond, and (iii) when and whether results are disclosed. Every incident need not be reported publicly, but the developer must disclose a policy determining how incidents are reported and disclosed.
We publish a public ‘Security Incident Report’ on our website for any confirmed security incident affecting the model within 7 days of a patch being implemented. Users and researchers can report incidents via security@ourcompany.com, and we commit to an initial acknowledgment within 48 hours.
81. Misuse incident reporting protocol (Score: 0)

Are misuse incidents involving the model disclosed?

The "report a vulnerability" button at the bottom of the page opens a website to security@deepseek.com
https://www.deepseek.com/en
No information provided about the policy governing misuse reports.

A misuse incident reporting protocol provides post-deployment transparency about incidents of misuse involving the model. As opposed to the previous indicator, this indicator is about actors misusing the model to cause real-world harm, such as misinformation operations or cybersecurity attacks. We will award this point if the developer states (i) how to submit a misuse incident report, (ii) how quickly it will respond, and (iii) when and whether results are disclosed. Every incident need not be reported publicly, but there needs to be a policy governing how incidents are reported.
We publish a public ‘Misuse Incident Report’ on our website for any confirmed misuse incident within 7 days of a patch being implemented. Users and researchers can report incidents regarding our flagship foundation model via security@ourcompany.com, and we commit to an initial acknowledgment within 48 hours.
82. Post-deployment coordination with government (Score: 0)

Does the developer coordinate evaluation with government bodies?

No information provided on coordination of post-deployment evaluation with government bodies.
Not disclosed
No information provided.

We will award this point if the developer specifies which government bodies it is coordinating with and for what types of post-deployment evaluations. Government bodies include AI Safety Institutes, national security agencies, national labs, and international governmental enties such as UN agencies or the G7. Evaluation here may also include sharing of the developer's proprietary evaluation results for help with interpretation.
We do not coordinate with any government entities or AI Safety Institutes. OR We coordinate with the UK AISI for post-deployment evaluation of cyber, CB, and autonomy-related capabilities.
83. Feedback mechanisms (Score: 0)

Does the developer disclose a way to submit user feedback? If so, is a summary of major categories of feedback disclosed?

Users can evaluate model output through liking or disliking actions. The open platform TOS states: 10.2 If you notice any violation of laws and regulations or breach of these Terms or you have any opinions or suggestions regarding these Terms or the Service, you can contact us through the following methods or directly contact our sales staff: Online Complaints and Feedback Portal: Click the "Contact us" button on the product interface after logging in. Contact Email: api-service@deepseek.com Contact Address: 5th Floor, North Building, Block C, Rongke Information Center, No.2 South Science Academy Road, Haidian District, Beijing, China.
https://cdn.deepseek.com/policies/en-US/deepseek-terms-of-use.html https://cdn.deepseek.com/policies/en-US/deepseek-open-platform-terms-of-service.html
No information provided.

We will award this point if the developer (i) discloses how users can submit feedback (e.g., via a form or a thumbs up/thumbs down for model responses) and (ii) discloses aggregated or categorized feedback data (e.g. a categorization of thumbs up and thumbs down data).
Users can submit feedback at this url: [URL] We find that users mainly report issues with API call response times, over-refusals from models, and outdated information in model outputs. A detailed categorization of user reports is available at [URL]
84. Permitted, restricted, and prohibited model behaviors (Score: 0)

Are model behaviors that are permitted, restricted, and prohibited disclosed?

To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities
https://arxiv.org/pdf/2501.12948
No information provided.

We refer to a policy that includes this information as a model behavior policy, or a developer's policy on what the foundation model can and cannot do (e.g. such a policy may prohibit a model from responding to NSFW content). We recognize that different developers may adopt different business models and that some business models may make enforcement of a model behavior policy more or less feasible. We will award this point if at least two of the three categories (i.e. permitted, restricted, and prohibited model behaviors) are disclosed. Alternatively, we will award this point if the developer reports that it does not impose any restrictions on its model's behavior in this way.
We allow responses from Model A that include broad Q&A, restrict sexual or harassing content, and prohibit facilitating illegal or violent acts. More details can be found in our guidelines for model behavior here: [link]
85. Model response characteristics (Score: 1)

Are desired model response characteristics disclosed?

For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.
https://arxiv.org/pdf/2501.12948
A score of 1 is granted as the R1 technical report states helpfulness and harmlessness are intended model characteristics.

Model response characteristics include default behaviors or behaviors that the developer steers the model to take. These may include being helpful, taking an objective point of view, or using tools only when necessary. We will award points for a clear description of desired model response characteristics or a statement that there are no such characteristics.
We configure responses from Model A to be factual, neutral, and contextually helpful, avoiding personal or biased opinions. More details can be found in our guidelines for model behavior here: [link]
86. System prompt (Score: 1)

Is the default system prompt for at least one distribution channel disclosed?

" {"role": "system", "content": "You are a helpful assistant."},"
https://api-docs.deepseek.com/api/create-completion
A score of 1 is granted as the R1's system prompt is disclosed.

A system prompt is defined as the prompt provided to the system by default that guides the system's behavior. We will award this point for the disclosure of the verbatim text of the full system prompt as well as an explanation for the context in which the system prompt is used.
We disclose our default prompt for Model A via our chat interface: ‘You are a helpful AI assistant providing clear, accurate, and policy‐compliant responses.’
87. Intermediate tokens (Score: 1)

Are intermediate tokens used to generate model outputs available to end users?

R1's output includes reasoning content
https://api-docs.deepseek.com/guides/reasoning_model
A score of 1 is granted as the R1's chain of thought is disclosed.

Intermediate tokens are defined as any tokens generated by the model before the final output is shown to the user, such as model chains of thought. We will also award this point if a summary of intermediate tokens is made available to end users. If intermediate tokens or summaries are not made available, the developer should provide a justification.
Model A is trained to generate intermediate chain-of-thought reasoning, but we withhold most chain-of-thought tokens from final user-facing responses to prevent model distillation. We do disclose chains-of-thought for a small set of research collaborators under NDA.
88. Internal product and service mitigations (Score: 0)

For internal products or services using the model, are downstream mitigations against adversarial attacks disclosed?

Safety monitoring services mentioned in privacy policy
https://cdn.deepseek.com/policies/en-US/deepseek-privacy-policy.html
A score of 0 is granted as no downstream mitigations against adversarial attacks are disclosed.

An internal product or service is a product or service built by the developer. Adversarial attacks include prompt injection, jailbreaking, or malicious queries. Mitigations against adversarial attacks might include specialized prompt filtering, content scanning, or real-time monitoring of queries or accounts. We will award this point if the developer discloses a clear statement of methods used (e.g., a specialized prompt sanitizer or adversarial pattern detector), or if the developer states it does not implement such product-level mitigations against adversarial attacks.
In our chatbot, we implement a second-stage content filter that checks user inputs for disallowed topics and attempts to sanitize adversarial prompts. We also log suspicious prompts for manual review.
89. External developer mitigations (Score: 0)

Does the developer provide built-in or recommended mitigations against adversarial attacks for downstream developers?

No information provided on external developer mitigations.
Not disclosed
No information provided.

Downstream developers are developers who access the model through a distribution channel. Adversarial attacks include prompt injection, jailbreaking, or malicious queries. Mitigations against adversarial attacks that developers might build in or recommend include content filtering endpoints and recommended prompt templates. We will award this point if the developer discloses (i) technical mitigations (e.g., a developer provided moderation API or classifier) it offers or implements, (ii) recommended best practices or libraries for downstream developers, or (iii) an explicit statement that it does not build or recommend any particular downstream mitigations in this way..
Our API includes an optional parameter that will automatically filter user prompts and model outputs for hateful or disallowed content. We also publish guidelines for building robust chat interfaces that resist common prompt injections.
90. Enterprise mitigations (Score: 0)

Does the developer disclose additional or specialized mitigations for enterprise users?

No information provided on specialized mitigations for enterprise users.
Not disclosed
No information provided.

Enterprise users are, for example, large organizations with dedicated service agreements or users of enterprise-specific API deployments or products and services. Additional or specialized mitigations may address enterprise needs such as data privacy controls, advanced prompt/response monitoring, or compliance checks with regulations such as GDPR or HIPAA. Additional or specialized mitigations may include single-tenant deployments, custom filters for specific regulated industries, or advanced logging for compliance. We will award a point if the developer at least describes these mitigations or states that it does not provide such additional or specialized enterprise mitigations.
Our enterprise offering for Model A includes a dedicated environment with stricter filtering, a HIPAA-compliant data retention policy, and the ability for enterprise admins to define custom blacklisted topics that the model must refuse.
91. Detection of machine-generated content (Score: 0)

Are mechanisms that are used for detecting content generated by this model disclosed?

Terms of Use mention AI-generated content disclosure requirements
https://cdn.deepseek.com/policies/en-US/deepseek-terms-of-use.html
A score of 0 is granted as no mechanism for detecting machine-generated content is disclosed.

A mechanism for detecting machine-generated content might include storing a copy of all outputs generated by the model to compare against, implementing a watermark on model outputs, adding cryptographic metadata (such as C2PA), or training a detector post-hoc to identify such content. We will award this point if any such mechanism is disclosed or if the developer reports that it does not have or use any such mechanism.
We train a classifier using model generations and human-written text to identify machine-generated content from Model A and our other models.
92. Documentation for responsible use (Score: 0)

Does the developer provide documentation for responsible use by downstream developers?

HuggingFace provides usage recommendations
https://huggingface.co/deepseek-ai/DeepSeek-R1#usage-recommendations
No information provided by developer or acknowledgement of lack of documentation.

To receive a point, the developer should provide documentation for responsible use. This might include details on how to adjust API settings to promote responsible use, descriptions of how to implement mitigations, or guidelines for responsible use. We will also award this point if the developer states that it does not provide any such documentation. For example, the developer might state that the model is offered as is and downstream developers are accountable for using the model responsibly.
Our Developer Documentation Hub consolidates integration guides, responsible‐use guidelines, and best practices: [link]
93. Permitted and prohibited users (Score: 1)

Is a description of who can and cannot use the model on the top-5 distribution channels disclosed?

You represent and warrant that Services may not be used in or for the benefit of, or exported, re-exported, or transferred (a) to or within any country subject to comprehensive sanctions under Export Control and Sanctions Laws; (b) to any party on any restricted party lists under any applicable Export Control and Sanctions Laws that would prohibit your use of Services.
https://cdn.deepseek.com/policies/en-US/deepseek-open-platform-terms-of-service.html
A score of 1 is granted as the terms of use clarify that sanctioned entities are prohibited users

We will award this point for a description of the company's policies for permitted and prohibitted users on its top-5 distribution channels. We will award this point if the developer has a more general acceptable use policy that it confirms applies across these distribution channels. We will award this point if there are no restrictions on users.
We allow usage by individuals 13 years of age or older who accept our Terms of Service. We prohibit use by export controlled entities or persons on denied-parties lists or in countries under U.S. embargo. We also reserve the right to restrict use if users engage in targeted harassment. For example, we only permit users over 13 with valid credentials, and prohibit usage from OFAC-sanctioned regions. We do not allow state-sponsored disinformation agencies to access our services.
94. Permitted, restricted, and prohibited uses (Score: 1)

Which uses are explicitly allowed, conditionally permitted, or strictly disallowed under the acceptable use policy for the top-5 distribution channels?

R1 released on HF under an MIT license with no prohibited uses. Section 3.4 of terms of use: ""3.4 You will not use the Services to generate, express or promote content or a chatbot that:...""
https://cdn.deepseek.com/policies/en-US/deepseek-terms-of-use.html"
A score of 1 is granted as the restricted and prohibited uses are disclosed across key distribution channels.

We will award this point for a rough characterization of two or more of permitted, restricted, and prohibited uses across the top-5 distribution channels. We will award this point if the developer has a more general acceptable use policy that it confirms applies across these distribution channels. We will award this point if there are no restrictions on users.
Permitted uses include general conversational queries, brainstorming, and coding assistance. Restricted uses include adult or violent content that requires caution or additional review. Prohibited uses include facilitating illicit activity, disinformation campaigns, or harassment. For example, we permit typical user requests like Q&A, text generation, and educational uses. We restrict content that depicts graphic violence or sexual content by applying additional filters. We prohibit any use aiming to conduct unlawful surveillance, promote extremist violence, or defraud others.
95. AUP enforcement process (Score: 1)

What are the methods used by the developer to enforce the acceptable policy?

For users who violate regulations, DeepSeek has the right, according to its reasonable judgment and without notice, to take actions such as warnings, functionality restrictions, service suspensions or terminations, account bans, content deletions, and to save related records and report to competent authorities
https://chat.deepseek.com/downloads/DeepSeek%20User%20Agreement.pdf
A score of 1 is granted as the user agreement provides a reasonable best-effort description of the processes used to respond to potential AUP violations.

We will award this point if the developer discloses the processes (automated or manual) it uses to detect, review, and respond to potential acceptable use policy violations. We will award this point for a reasonable best-effort attempt to provide the bulk of this information, though one line indicating the developer reserves the right to terminate accounts is insufficient. Alternatively, we will award this point if the developer reports that it does not use such methods to enforce its acceptable use policy.
We combine automated checks with human review for severe or repeated violations, issuing warnings or suspensions after repeat violations.
96. AUP enforcement frequency (Score: 0)

Are statistics on the developer's AUP enforcement disclosed?

Open Platform Terms describe enforcement actions and record retention for violation
https://cdn.deepseek.com/policies/en-US/deepseek-open-platform-terms-of-service.html
No information provided on statistics regarding frequency of enforcement.

We will award this point if the developer discloses enforcement statistics (e.g., violation counts or actions taken) from its enforcement of its acceptable use policy. Alternatively, we will award this point if the developer reports that it does not enforce its acceptable use policy.
We publish a quarterly enforcement report detailing violation counts by prohibited use category and the corresponding actions taken at [LINK]
97. Regional policy variations (Score: 0)

Are differences in the developer's acceptable use or model behavior policy across geographic regions disclosed?

Terms of Use governed by People's Republic of China laws Privacy terms have jurisdiction specific sections
https://cdn.deepseek.com/policies/en-US/deepseek-terms-of-use.html https://cdn.deepseek.com/policies/en-US/deepseek-privacy-policy.html
It is not clear from the disclosure if AUP or MBP varies by deployment context (e.g. when deployed in other regions)

We will award this point if the developer discloses distinctions in its AUP or MBP and provides examples of differences in multiple specific regions, or states that no differences exist. For example, some jurisdictions impose content restrictions beyond those in the developer’s global policy that may necessesitate local deviations.
In the EU, our model automatically omits certain categories of political content to comply with local election laws. In all other regions, we follow the general global AUP at [URL].
98. Oversight mechanism (Score: 0)

Does the developer have an internal or external body that reviews core issues regarding the model prior to deployment?

No information provided on oversight mechanisms.
Not disclosed
No information provided.

We will award this point if the developer discloses that is has such an internal or external body and provides some description of its scope, or alternatively if the developer discloses that it has no such body. An oversight mechanism covers governance structure beyond mere external risk evaluation, asking whether a formal body regularly reviews design and deployment decisions. Core issues may include model objectives, data usage, or risk mitigation.
We convene a monthly advisory board of ethicists, civil society representatives, and academics to review training processes and identify new risks. The board's recommendations regarding deployment are not binding.
99. Whistleblower protection (Score: 0)

Does the developer disclose a whistleblower protection policy?

No information provided on whistleblower protection policies.
Not disclosed
No information provided.

We will award this point if the developer discloses (i) the existence of a whistleblower protection policy, (ii) what protections are afforded to whistleblowers, (iii) how reports are handled and investigated, and (iv) any external oversight of the whistleblower protection process. This might include protections for whistleblowers who report safety, ethical, or legal concerns related to the model. We will also award this point if the developer discloses that it has no such policy.
We maintain a whistleblower protection policy that prohibits retaliation against employees who report safety or ethical concerns about our models. Reports can be submitted anonymously through our ethics hotline, are reviewed by an independent board committee, and whistleblowers are entitled to legal representation provided by the company. Our policy is audited annually by an independent ethics consultancy.
100. Government commitments (Score: 1)

What commitments has the developer made to government bodies?

We also participated in the China AI Safety and Security Commitments Framework. Moreover, China’s AI regulatory framework differs significantly from that of the U.S., relying on formal legislation and regulatory oversight rather than solely on voluntary corporate commitments. As such, participation in commitments should not be used as a direct basis for comparison.
https://aihub.caict.ac.cn/ai_security_and_safety_commitments
DeepSeek discloses its participation in the China AI Safety and Security Commitments Framework

We will award this point if the company provides an exhaustive list of commitments it has made to government bodies in the jurisdictions where it offers its model.
We have committed to the White House Voluntary Committments and the Seoul Committments.