Alibaba Transparency Report

1. Data acquisition methods (Score: 0)

What methods does the developer use to acquire data used to build the model?

Disclosure:

The most relevant document is the Qwen 3 technical report. The relevant section on data acquisition first describes pretraining data: "Specifically, we collected twice as many pre-training tokens—covering three times more languages. All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages and dialects, with a total of 36 trillion tokens. This dataset includes high-quality content in various 3 domains such as coding, STEM (Science, Technology, Engineering, and Mathematics), reasoning tasks, books, multilingual texts, and synthetic data. To further expand the pre-training data corpus, we first employ the Qwen2.5-VL model (Bai et al., 2025) to perform text recognition on a large volume of PDF-like documents. The recognized text is then refined using the Qwen2.5 model (Yang et al., 2024b), which helps improve its quality. Through this two-step process, we are able to obtain an additional set of high-quality text tokens, amounting to trillions in total. Besides, we employ Qwen2.5 (Yang et al., 2024b), Qwen2.5-Math (Yang et al., 2024c), and Qwen2.5-Coder (Hui et al., 2024) models to synthesize trillions of text tokens in different formats, including textbooks, question-answering, instructions, and code snippets, covering dozens of domains. Finally, we further expand the pre-training corpus by incorporating additional multilingual data and introducing more languages. Compared to the pre-training data used in Qwen2.5, the number of supported languages has been significantly increased from 29 to 119, enhancing the model’s linguistic coverage and cross-lingual capabilities." A further reasoning stage in pretraining is described by "Reasoning Stage (S2): To further improve the reasoning ability, we optimize the pre-training corpus of this stage by increasing the proportion of STEM, coding, reasoning, and synthetic data", followed by a long-context stage: "In the final pre-training stage, we collect high-quality long context corpora to extend the context length of Qwen3 models. All models are pre-trained on hundreds of billions of tokens with a sequence length of 32,768 tokens". Beyond pre-training, further post-training is described.

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer does not make methods for acquiring pretraining data adequately clear.

Indicator notes:

Which of the following data acquisition methods does the developer use:   (i) acquiring existing public datasets, (ii) crawling the web, (iii) using data acquired via its existing products and services, (iv) licensing existing data from external parties, (v) having humans create or annotate new data, (vi) using models to generate new data,  or (vii) other data acquisition methods not captured by the above. For example, if the developer uses reinforcement learning from human feedback to train models using model-generated outputs with human preference annotations, this would satisfy categories (v) and (vi). Alternatively, if the developer post-trains its model using off-the-shelf preference data (for example, the Alpaca dataset), this would satisfy category (i).

Example disclosure:

To build our model, we acquire data by crawling the Internet for publicly available data, licensing data from third-parties, and using models to synthetically generate new data. Humans do not create new data nor do we use data from our other products/services to train our model.

2. Public datasets (Score: 0)

What are the top-5 sources (by volume) of publicly available datasets acquired for building the model?

Disclosure:

No information provided about public datasets.

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer does not provide information about specific public datasets.

Indicator notes:

We define a source as the entity or means by which the developer acquires data. We define the top-5 sources as the top-5 sources by data volume.

Example disclosure:

We acquire publicly available data from only two sources: The Pile and CommonCrawl.

3. Crawling (Score: 0)

If data collection involves web-crawling, what is the crawler name and opt-out protocol?

Disclosure:

No information provided about crawling.

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer does not provide information about crawling practices.

Indicator notes:

We award this point for disclosure of the crawler name and opt-out protocols, including if/how they respect the Robots Exclusion Protocol (robots.txt).

Example disclosure:

Our web crawler is named A and information on the opt-out protocol can be found at this URL: ... The CommonCrawl web crawler is named CCBot and information on the opt-out protocol can be found at this URL: https://commoncrawl.org/faq#:~:text=How%20can%20I%20block%20the,%2Dagent%20string%20is%3A%20CCBot.

4. Usage data used in training (Score: 0)

What are the top-5 sources (by volume) of usage data from the developer's products and services that are used for building the model?

Disclosure:

Via one product (Qwen Chat) Alibaba discloses it may use "de-identified user content" "to improve the accuracy and quality of our services (including our AI models)"

References:

https://chat.qwen.ai/legal-agreement/privacy-policy

Score justification:

The developer does not disclose which other products contribute usage data.

Indicator notes:

We define usage data as data collected from the use of a developer's products or services.

Example disclosure:

We use usage data from only two sources: our deployed chatbot X and our online social media platform Y.

5. Notice of usage data used in training (Score: 0)

For the top-5 sources of usage data, how are users of these products and services made aware that this data is used for building the model?

Disclosure:

Via one product (Qwen Chat) Alibaba discloses it may use "de-identified user content" "to improve the accuracy and quality of our services (including our AI models)"

References:

https://chat.qwen.ai/legal-agreement/privacy-policy

Score justification:

The developer does not disclose which other products contribute usage data.

Indicator notes:

We define usage data notice as the proactive disclosure to users of how their data is used for model development. For example, via a pop-up with a description, a link to the privacy policy, or link to a description of company practices.

Example disclosure:

We notify users of our chatbot X that chatbot interactions are used to train our AI via a pop-up as shown at this URL: ... We notify users of our platform Y about whether their data is used to train our AI via a link to our privacy policy when they sign up for an account.

6. Licensed data sources (Score: 0)

What are the top-5 sources (by volume) of licensed data acquired for building the model?

Disclosure:

No information is provided about licensed data in Aliababa's own documentation. Aliababa appears to have signed data licensing deals with, for example, China Media Group

References:

https://www.cctv.com/2025/02/28/ARTIag2AgSIgSB0HcOoRpT3E250228.shtml

Score justification:

The developer does not provide information about specific licensed data.

Indicator notes:

We define a source as the entity from which the developer acquires data. For example, the Associated Press is reportedly a source of licensed data for OpenAI.

Example disclosure:

We license data from only three sources: A, B, and C.

7. Licensed data compensation (Score: 0)

For each of the top-5 sources of licensed data, are details related to compensation disclosed?

Disclosure:

No information is provided about licensed data in Aliababa's own documentation. Aliababa appears to have signed data licensing deals with, for example, China Media Group

References:

https://www.cctv.com/2025/02/28/ARTIag2AgSIgSB0HcOoRpT3E250228.shtml

Score justification:

The developer does not provide information about specific licensed data.

Indicator notes:

We award this point if the model developer describes the compensation structure specified in the contract with the data source or indicates they are prohibited from sharing this information if contractually mandated.

Example disclosure:

We compensate A by ... We cannot disclose information on compensation for our relationships with B and C due to contractual terms that prohibit public disclosure.

8. New human-generated data sources (Score: 0)

What are the top-5 sources (by volume) of new human-generated data for building the model?

Disclosure:

The most relevant document is the Qwen 3 technical report. The relevant passage is: "After reserving a validation query set, we generate N candidate responses for each remaining query using QwQ-32B (Qwen Team, 2025). When QwQ-32B consistently fails to generate correct solutions, human annotators manually assess the accuracy of the responses. For queries with positive Pass@N, further stringent filtering criteria are applied to remove responses that (1) yield incorrect final answers, (2) contain substantial repetition, (3) clearly indicate guesswork without adequate reasoning, (4) exhibit inconsistencies between the thinking and summary contents, (5) involve inappropriate language mixing or stylistic shifts, or (6) are suspected of being overly similar to potential validation set items. Subsequently, a carefully selected subset of the refined dataset is used for the initial cold-start training of the reasoning patterns. The objective at this stage is to instill foundational reasoning patterns in the model without overly emphasizing immediate reasoning performance. This approach ensures that the model’s potential is not limited, allowing for greater flexibility and improvement during the subsequent reinforcement learning (RL) phase. To achieve this objective effectively, it is preferable to minimize both the number of training samples and the training steps during this preparatory phase."

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer does not clarify if this is the sole source of human annotation nor the relationship between the developer and the annotators.

Indicator notes:

We define a source as the entity or means by which the developer acquires data. For example, Scale AI could be a source of new human-generated data. By new, we mean the data is specifically acquired for the purposes of building the model.

Example disclosure:

We acquire new human-generated data from only two sources: our internal data annotation team and an external vendor, A.

9. Instructions for data generation (Score: 0)

For each of the top-5 sources of human-generated data, what instructions does the developer provide for data generation?

Disclosure:

No information provided about instructions.

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer does not provide the instructions involved in human data generation.

Indicator notes:

The instructions should be those provided to the data source. For example, if a third-party vendor works directly with the data laborers to produce the data, the instructions from the developer to this vendor should be disclosed.

Example disclosure:

We instruct our internal data annotation team as follows: ... We instruct vendor A as follows: ...

10. Data laborer practices (Score: 0)

For the top-5 sources of human-generated data, how are laborers compensated, where are they located, and what labor protections are in place?

Disclosure:

No information provided about labor practices.

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer does not provide the labor conditions involved in human data generation.

Indicator notes:

For each data source, we require (i) the compensation in either USD or the local currency, (ii) any countries where at least 25% of the laborers are located, and (iii) a description of any labor protections. We will award this point if the developer discloses that it is not aware of data laborer practices.

Example disclosure:

Our internal data annotation team is located in the US, is compensated at 20 USD per hour, and deals with data that does not require specific protections. Our sole external data vendor contracts laborers in Kenya, compensates them at KES 15000 per month, and implements protections for dealing with toxic or unsafe content such as A and B.

11. Synthetic data sources (Score: 1)

What are the top-5 sources (by volume) of synthetic data acquired for building the model?

Disclosure:

The most relevant document is the Qwen 3 technical report. The relevant passages include: "we generate N candidate responses for each remaining query using QwQ-32B" and "To ensure that the performance of the Stage 2 model is not compromised by the additional SFT, the “thinking” data is generated via rejection sampling on Stage 1 queries using the Stage 2 model itself" and "To efficiently expand the training data, we employ a multi-modal approach: Qwen2.5-VL (Bai et al., 2025) is finetuned to extract text from extensive PDF documents. We also generate synthetic data using domain-specific models: Qwen2.5-Math (Yang et al., 2024c) for mathematical content and Qwen2.5-Coder (Hui et al., 2024) for code-related data."

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer describes four sources of synthetic data with the associated models.

Indicator notes:

We define a source of synthetic data as a non-human mechanism (e.g. a machine learning model) used to generate the data.

Example disclosure:

We synthetically generate data using only our previous model X and an early checkpoint of our current flagship model Y.

12. Synthetic data purpose (Score: 1)

For the top-5 sources of synthetically generated data, what is the primary purpose for data generation?

Disclosure:

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer labels each synthetic data generation method with a purpose.

Indicator notes:

We define a source of synthetic data as a non-human mechanism (e.g. a machine learning model) used to generate the data.

Example disclosure:

We use model X to generate instruction-tuning data and we use model Y to generate candidate responses that humans select between to provide human preference data for reinforcement learning with human feedback.

13. Data processing methods (Score: 0)

What are the methods the developer uses to process acquired data to determine the data directly used in building the model?

Disclosure:

The most relevant document is the Qwen 3 technical report. The relevant passage on data processing during pre-training states: "To further expand the pre-training data corpus, we first employ the Qwen2.5-VL model (Bai et al., 2025) to perform text recognition on a large volume of PDF-like documents. The recognized text is then refined using the Qwen2.5 model (Yang et al., 2024b), which helps improve its quality. Through this two-step process, we are able to obtain an additional set of high-quality text tokens, amounting to trillions in total. Besides, we employ Qwen2.5 (Yang et al., 2024b), Qwen2.5-Math (Yang et al., 2024c), and Qwen2.5-Coder (Hui et al., 2024) models to synthesize trillions of text tokens in different formats, including textbooks, question-answering, instructions, and code snippets, covering dozens of domains. Finally, we further expand the pre-training corpus by incorporating additional multilingual data and introducing more languages. Compared to the pre-training data used in Qwen2.5, the number of supported languages has been significantly increased from 29 to 119, enhancing the model’s linguistic coverage and cross-lingual capabilities. We have developed a multilingual data annotation system designed to enhance both the quality and diversity of training data. This system has been applied to our large-scale pre-training datasets, annotating over 30 trillion tokens across multiple dimensions such as educational value, fields, domains, and safety. These detailed annotations support more effective data filtering and combination. Unlike previous studies (Xie et al., 2023; Fan et al., 2023; Liu et al., 2024b) that optimize the data mixture at the data source or domain level, our method optimizes the data mixture at the instance-level through extensive ablation experiments on small proxy models with the fine-grained data labels." In addition, further information about data processing is provided for the four stages of post-training: Long-CoT Cold Start, Reasoning RL, Thinking Mode Fusion, and General RL.

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer describes data processing, but the specific methods are not described and data processing for pretraining textual data is not described.

Indicator notes:

We will award this point for disclosure of all of the methods used to process acquired data. Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.

Example disclosure:

We process data in the following six-step pipeline: (i) removal of HTML artifacts, (ii) deduplication, (iii) language identification to retain English data, (iv) removal of CSAM imagery, (v) removal of train-test overlap, and (vi) tokenization.

14. Data processing purpose (Score: 0)

For each data processing method, what is its primary purpose?

Disclosure:

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer describes data processing, but the specific purposes are not described and data processing for pretraining textual data is not described.

Indicator notes:

Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.

Example disclosure:

Examples of primary purposes for a data processing method could include: (i) removes low quality data, (ii) removes potentially personal/copyrighted data, (iii) removes product-irrelevant data, (iv) removes toxic data, (v) improves evaluation integrity, or (vi) prepares the data for training the model.

15. Data processing techniques (Score: 0)

For each data processing method, how does the developer implement the method?

Disclosure:

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer describes data processing, but the specific techniques are not described and data processing for pretraining textual data is not described.

Indicator notes:

Data processing refers to any method that substantively changes the content of the data. For example, compression or changing the data file format is generally not in the scope of this indicator.

Example disclosure:

Examples of how a data processing method is implemented could include: the method (i) is implemented using an in-house regular expression, (ii) is implemented using an in-house tool based on n-gram overlap, (iii) is implemented using a FastText classifier trained on Wikipedia data, (iv) is implemented using hash collisions with the NCMEC database, (v) is implemented by searching for known benchmark canary strings, and (vi) is implemented using tiktoken (https://github.com/openai/tiktoken).

16. Data size (Score: 1)

Is the size of the data used in building the model disclosed?

Disclosure:

The most relevant document is the Qwen 3 technical report. The relevant passage states: "General Stage (S1): At the first pre-training stage, all Qwen3 models are trained on over 30 trillion tokens using a sequence length of 4,096 tokens. At this stage, the models have been fully pre-trained on language proficiency and general world knowledge, with training data covering 119 languages and dialects. (2) Reasoning Stage (S2): To further improve the reasoning ability, we optimize the pre-training corpus of this stage by increasing the proportion of STEM, coding, reasoning, and synthetic data. The models are further pre-trained with about 5T higher-quality tokens at a sequence length of 4,096 tokens. We also accelerate the learning rate decay during this stage. (3) Long Context Stage: In the final pre-training stage, we collect high-quality long context corpora to extend the context length of Qwen3 models. All models are pre-trained on hundreds of billions of tokens with a sequence length of 32,768 tokens. The long context corpus includes 75% of text between 16,384 to 32,768 tokens in length, and 25% of text between 4,096 to 16,384 in length."

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer discloses that the data size is 36 trillion tokens.

Indicator notes:

To receive this point, the developer should report data size in appropriate units (e.g. bytes, words, tokens, images, frames) and broken down by modality. Data size should be reported to a precision of one significant figure (e.g. 4 trillion tokens, 200 thousand images). The size should reflect data directly used in building the model (i.e. training data) and not data that was acquired but unused, or data used to evaluate the model.

Example disclosure:

We used 3 x 10^12 tokens of text, 1 x 10^6 images, and 5 x 10^5 hours of audio for training.

17. Data language composition (Score: 0)

For all text data used in building the model, what is the composition of languages?

Disclosure:

Qwen 3 blog post states: "Multilingual Support Qwen3 models are supporting 119 languages and dialects. This extensive multilingual capability opens up new possibilities for international applications, enabling users worldwide to benefit from the power of these models. Language Family Languages & Dialects Indo-European English, French, Portuguese, German, Romanian, Swedish, Danish, Bulgarian, Russian, Czech, Greek, Ukrainian, Spanish, Dutch, Slovak, Croatian, Polish, Lithuanian, Norwegian Bokmål, Norwegian Nynorsk, Persian, Slovenian, Gujarati, Latvian, Italian, Occitan, Nepali, Marathi, Belarusian, Serbian, Luxembourgish, Venetian, Assamese, Welsh, Silesian, Asturian, Chhattisgarhi, Awadhi, Maithili, Bhojpuri, Sindhi, Irish, Faroese, Hindi, Punjabi, Bengali, Oriya, Tajik, Eastern Yiddish, Lombard, Ligurian, Sicilian, Friulian, Sardinian, Galician, Catalan, Icelandic, Tosk Albanian, Limburgish, Dari, Afrikaans, Macedonian, Sinhala, Urdu, Magahi, Bosnian, Armenian Sino-Tibetan Chinese (Simplified Chinese, Traditional Chinese, Cantonese), Burmese Afro-Asiatic Arabic (Standard, Najdi, Levantine, Egyptian, Moroccan, Mesopotamian, Ta’izzi-Adeni, Tunisian), Hebrew, Maltese Austronesian Indonesian, Malay, Tagalog, Cebuano, Javanese, Sundanese, Minangkabau, Balinese, Banjar, Pangasinan, Iloko, Waray (Philippines) Dravidian Tamil, Telugu, Kannada, Malayalam Turkic Turkish, North Azerbaijani, Northern Uzbek, Kazakh, Bashkir, Tatar Tai-Kadai Thai, Lao Uralic Finnish, Estonian, Hungarian Austroasiatic Vietnamese, Khmer Other Japanese, Ko"rean, Georgian, Basque, Haitian, Papiamento, Kabuverdianu, Tok Pisin, Swahili

References:

https://qwenlm.github.io/blog/qwen3/

Score justification:

Frequency of the supported languages in the text data used to build the model is not provided.

Indicator notes:

To receive this point, the developer should report (i) all languages which make up at least 1% of the data and their corresponding proportions and (ii) a brief description of how languages are labeled (if a publicly available tool is used, include a link to the tool). Proportions should be reported to a precision of two significant figures and should describe proportions of documents labeled with some language. An "Unknown" category may be included to denote documents where the language could not be identified.

Example disclosure:

English 80%, Spanish 5.0%, French 3.0%, Chinese 2.0%, Unknown 10%. We used a FastText-based classifier trained on Wikipedia data to identify languages.

18. Data domain composition (Score: 0)

For all the data used in building the model, what is the composition of domains covered in the data?

Disclosure:

From section 3 of the Qwen 3 technical report: "This dataset includes high-quality content in various domains such as coding, STEM (Science, Technology, Engineering, and Mathematics), reasoning tasks, books, multilingual texts, and synthetic data"

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer provides no significant information about domains, as only very high level domains are referenced.

Indicator notes:

To receive this point, the developer should report the composition of the main domains included in the data used to train the model. This data should be at a level of granularity lower than broad claims about training on "internet data". For example, this could include the proportion of data from e-commerce, social media, news, code, etc. based on the URLs from which the data is sourced. Proportions should be reported to a precision of one significant figure.

Example disclosure:

Social media 40%, code repositories 30%, news articles 20%, e-commerce product listings 5%, scientific papers 5%.

19. External data access (Score: 0)

Does a third-party have direct access to the data used to build the model?

Disclosure:

No information provided about external data access.

References:

Not disclosed

Score justification:

The developer does not provide information about external data access.

Indicator notes:

By a third-party, we mean entities that are financially independent of the developer. We will award this point if at least one such entity is named as having direct access to the data. With that said, we may award this point if the developer provides justifications for prohibiting access to narrowly-scoped parts of the data.

Example disclosure:

Third-parties that have direct access to the data include organizations A and B.

20. Data replicability (Score: 0)

Is the data used to build the model described in enough detail to be externally replicable?

Disclosure:

No information provided relevant for data replication.

References:

Not disclosed

Score justification:

The developer does not provide information about external data replication.

Indicator notes:

We will award this point if the description contains (i) a list of all publicly available training data and where to obtain it and (ii) a list of all training data obtainable from third parties and where to obtain it. These conditions refer to criteria 2 and 3 under the OSI Open Source AI v1.0 definition.

Example disclosure:

The listing of publicly available training data can be found at this URL ... and the listing of all training data obtainable from third parties can be found at this URL ...

21. Compute usage for final training run (Score: 0)

Is the amount of compute used in the model's final training run disclosed?

Disclosure:

No information provided about training compute, though the technical report notes: "Qwen3 models are trained on over 30 trillion tokens using a sequence length of 4,096 tokens".

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer provides no information about training compute, instead reporting number of tokens trained over.

Indicator notes:

Example disclosure:

Our model was trained using 5 x 10^25 FLOPs, measured according to the Frontier Model Forum guidance provided at this URL: https://www.frontiermodelforum.org/updates/issue-brief-measuring-training-compute/

22. Compute usage including R&D (Score: 0)

Is the amount of compute used to build the model, including experiments, disclosed?

Disclosure:

No information provided about cumulative training compute.

References:

Not disclosed

Score justification:

The developer does not provide information about cumulative training compute.

Indicator notes:

Compute should be reported in appropriate units, which most often will be floating point operations (FLOPs), along with a description of the measurement methodology, which may involve estimation. Compute should be reported to a precision of one significant figure (e.g. 7 x 10^26 FLOPs). Compared to the previous indicator, this indicator should include an estimation of the total compute used across experiments used towards the final training run for the model (such as including hyperparameter optimization or other experiments), and not just the final training run itself.

Example disclosure:

Our cumulative compute usage involved in building the model was 7 x 10^26 FLOPs, measured according to the Frontier Model Forum guidance provided at this URL: https://www.frontiermodelforum.org/updates/issue-brief-measuring-training-compute/

23. Development duration for final training run (Score: 0)

Is the amount of time required to build the model disclosed?

Disclosure:

From the technical report: "In addition, it achieves these gains with much greater training efficiency, requiring only 1/10 of the GPU hours compared to the four-stage training method." Also Table 21, but for Qwen 8B

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer provides no information about the overall duration of model training, reporting only improved efficiency.

Indicator notes:

The amount of time should be specified in terms of both the continuous duration of time required and the number of hardware hours used. The continuous duration of time required to build the model should be reported in weeks, days, or hours to a precision of one significant figure (e.g. 3 weeks). The number of hardware hours should be reported to a precision of one significant figure and include the type of hardware hours. No form of decomposition into phases of building the model is required for this indicator, but it should be clear what the duration refers to (e.g. training the model, or training and subsequent evaluation and red teaming).

Example disclosure:

Our model was trained over a period of 90 days using 4x10^4 NVIDIA H100 GPU-days.

24. Compute hardware for final training run (Score: 0)

For the primary hardware used to build the model, is the amount and type of hardware disclosed?

Disclosure:

No information provided about training hardware.

References:

Not disclosed

Score justification:

The developer does not provide information about training hardware.

Indicator notes:

In most cases, this indicator will be satisfied by information regarding the number and type of GPUs or TPUs used to train the model. The number of hardware units should be reported to a precision of one significant figure (e.g. 800 NVIDIA H100 GPUs). We will not award this point if (i) the training hardware generally used by the developer is disclosed, but the specific hardware for the given model is not, or (ii) the training hardware is disclosed, but the amount of hardware is not. We will award this point even if information about the interconnects between hardware units is not disclosed.

Example disclosure:

Our model was trained using 1000 NVIDIA H100 GPUs.

25. Compute provider (Score: 1)

Is the compute provider disclosed?

Disclosure:

"Qwen is a series of large language models (LLMs) independently developed by Alibaba Cloud."

References:

https://www.alibabacloud.com/help/en/model-studio/what-is-qwen-llm

Score justification:

The developer discloses Alibaba Cloud is the compute provider.

Indicator notes:

For example, the compute provider may be the model developer in the case of a self-owned cluster, a cloud provider like Microsoft Azure, Google Cloud Platform, or Amazon Web Services, or a national supercomputer. In the event that compute is provided by multiple sources or is highly decentralized, we will award this point if a developer makes a reasonable effort to describe the distribution of hardware owners.

Example disclosure:

Compute is provided by Google Cloud Platform.

26. Energy usage for final training run (Score: 0)

Is the amount of energy expended in building the model disclosed?

Disclosure:

No information provided about training energy usage.

References:

Not disclosed

Score justification:

The developer does not provide information about training energy usage.

Indicator notes:

Energy usage should be reported in appropriate units, which most often will be megawatt-hours (mWh), along with a description of the measurement methodology, which may involve estimation. Energy usage should be reported to a precision of one significant figure (e.g. 500 mWh). No form of decomposition into compute phases is required, but it should be clear whether the reported energy usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitate energy usage. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lacks and why it lacks it.

Example disclosure:

Our model was trained using an estimate 1 x 10^4 MWh of energy. To estimate training energy consumption, we multiplied training FLOPs (5 x 10^25) by a conversion factor using NVIDIA A100 GPU information (3.74 × 10^21 FLOPs/MWh) given we train using FP16 with sparsity.

27. Carbon emissions for final training run (Score: 0)

Is the amount of carbon emitted in building the model disclosed?

Disclosure:

No information provided about environmental impacts.

References:

Not disclosed

Score justification:

The developer does not provide information about environmental impacts.

Indicator notes:

Emissions should be reported in appropriate units, which most often will be tons of carbon dioxide emitted (tCO2), along with a description of the measurement methodology, which may involve estimation. Emissions should be reported to a precision of one significant figure (e.g. 500 tCO2). No form of decomposition into compute phases is required, but it should be clear whether the reported emissions is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that generate emissions. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lack and why it lacks it. Emissions should correspond with the energy used in the previous indicator.

Example disclosure:

Our model yielded an estimate of 5 x 10^3 tCO2. To estimate training carbon emissions, we multiplied training energy usage (1 x 10^4 MWh) by a 2023 estimate for the US data center carbon intensity (0.375 tCO2/MWh) given the data centers used in training operate in the US.

28. Water usage for final training run (Score: 0)

Is the amount of clean water used in building the model disclosed?

Disclosure:

No information provided about environmental impacts.

References:

Not disclosed

Score justification:

The developer does not provide information about environmental impacts.

Indicator notes:

Clean water usage should be in appropriate units, which most often will be megaliters, along with a description of the measurement methodology, which may involve estimation. Clean water usage should be reported to a precision of one significant figure (e.g., 5000ML). No form of decomposition into compute phases is required, but it should be clear whether the reported water usage is for a single model run or includes additional runs, or hyperparameter tuning, or training other models like reward models, or other steps in the model development process that necessitates water usage. If the developer is unable to measure or estimate this quantity due to information not being available from another party (e.g. compute provider), we will award this point if the developer explicitly discloses what information it lacks and why it lacks it.

Example disclosure:

Our model yielded an estimate of 20 ML water. To estimate training water usage, we multiplied training energy usage (1 x 10^4 MWh) by a 2021 estimate for the US data center water efficiency (1.8 ML per 1,000 MWh) given the data centers used in training operate in the US.

29. Internal compute allocation (Score: 0)

How is compute allocated across the teams building and working to release the model?

Disclosure:

No information provided about internal compute allocation.

References:

Not disclosed

Score justification:

The developer does not provide information about compute allocation.

Indicator notes:

To receive a point, the developer should provide the compute allocated to each team involved in training the model. We understand there might be no clear allocation of compute across different teams; in that case, report an estimate of the compute used over the last year. Compute allocation should be reported to at least one significant figure.

Example disclosure:

- Safety — 15% - Pre-training — 60% - Post-training — 15% - Infrastructure and reliability — 5%

30. Model stages (Score: 1)

Are all stages in the model development process disclosed?

Disclosure:

From the technical report: "The Qwen3 models are pre-trained through a three-stage process: (1) General Stage (S1): At the first pre-training stage, all Qwen3 models are trained on over 30 trillion tokens using a sequence length of 4,096 tokens. At this stage, the models have been fully pre-trained on language proficiency and general world knowledge, with training data covering 119 languages and dialects. (2) Reasoning Stage (S2): To further improve the reasoning ability, we optimize the pre-training corpus of this stage by increasing the proportion of STEM, coding, reasoning, and synthetic data. The models are further pre-trained with about 5T higher-quality tokens at a sequence length of 4,096 tokens. We also accelerate the learning rate decay during this stage. (3) Long Context Stage: In the final pre-training stage, we collect high-quality long context corpora to extend the context length of Qwen3 models. All models are pre-trained on hundreds of billions of tokens with a sequence length of 32,768 tokens. The long context corpus includes 75% of text between 16,384 to 32,768 tokens in length, and 25% of text between 4,096 to 16,384 in length. Following Qwen2.5 (Yang et al., 2024b), we increase the base frequency of RoPE from 10,000 to 1,000,000 using the ABF technique (Xiong et al., 2023). Meanwhile, we introduce YARN (Peng et al., 2023) and Dual Chunk Attention (DCA," "The post-training pipeline of Qwen3 is strategically designed with two core objectives: (1) Thinking Control: This involves the integration of two distinct modes, namely the “non-thinking” and “thinking” modes, providing users with the flexibility to choose whether the model should engage in reasoning or not, and to control the depth of thinking by specifying a token budget for the thinking process. (2) Strong-to-Weak Distillation: This aims to streamline and optimize the post-training process for lightweight models. By leveraging the knowledge from large-scale models, we substantially reduce both the computational costs and the development efforts required for building smaller scale models."

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer discloses 3 pretraining stages and 2 post-training stages

Indicator notes:

Stages refer to each identifiable step that constitutes a substantive change to the model during the model building process. We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear and complete description of these stages.

Example disclosure:

We define five stages in building the model: (1) unsupervised pre-training, (2) supervised instruction tuning, (3) RLHF, (4) domain-specific fine-tuning, and (5) final safety alignment.

31. Model objectives (Score: 1)

For all stages that are described, is there a clear description of the associated learning objectives or a clear characterization of the nature of this update to the model?

Disclosure:

"The Qwen3 models are pre-trained through a three-stage process: (1) General Stage (S1): At the first pre-training stage, all Qwen3 models are trained on over 30 trillion tokens using a sequence length of 4,096 tokens. At this stage, the models have been fully pre-trained on language proficiency and general world knowledge, with training data covering 119 languages and dialects. (2) Reasoning Stage (S2): To further improve the reasoning ability, we optimize the pre-training corpus of this stage by increasing the proportion of STEM, coding, reasoning, and synthetic data. The models are further pre-trained with about 5T higher-quality tokens at a sequence length of 4,096 tokens. We also accelerate the learning rate decay during this stage. (3) Long Context Stage: In the final pre-training stage, we collect high-quality long context corpora to extend the context length of Qwen3 models. All models are pre-trained on hundreds of billions of tokens with a sequence length of 32,768 tokens. The long context corpus includes 75% of text between 16,384 to 32,768 tokens in length, and 25% of text between 4,096 to 16,384 in length. Following Qwen2.5 (Yang et al., 2024b), we increase the base frequency of RoPE from 10,000 to 1,000,000 using the ABF technique (Xiong et al., 2023). Meanwhile, we introduce YARN (Peng et al., 2023) and Dual Chunk Attention (DCA," "The post-training pipeline of Qwen3 is strategically designed with two core objectives: (1) Thinking Control: This involves the integration of two distinct modes, namely the “non-thinking” and “thinking” modes, providing users with the flexibility to choose whether the model should engage in reasoning or not, and to control the depth of thinking by specifying a token budget for the thinking process. (2) Strong-to-Weak Distillation: This aims to streamline and optimize the post-training process for lightweight models. By leveraging the knowledge from large-scale models, we substantially reduce both the computational costs and the development efforts required for building smaller scale models."

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer names all 5 stages by their objective.

Indicator notes:

We recognize that different developers may use different terminology for these stages, or conceptualize the stages differently. We will award this point if there is a clear description of the update to the model related to each stage, whether that is the intent of the stage (e.g. making the model less harmful), a mechanistic characterization (e.g. minimizing a specific loss function), or an empirical assessment (e.g. evaluation results conducted before and after the stage).

Example disclosure:

During unsupervised pre-training, the objective is next-token prediction. During supervised instruction tuning, we optimize for correctness and helpfulness on labeled tasks. RLHF aligns model outputs with human preference judgments. Domain-specific fine-tuning focuses on improving in-domain capabilities using specialized data (e.g., code or legal text). Final safety alignment reduces disallowed or harmful responses.

32. Code access (Score: 0)

Does the developer release code that allows third-parties to train and run the model?

Disclosure:

No code provided in relation to model training.

References:

Not disclosed

Score justification:

The developer does not provide training code.

Indicator notes:

The released code does not need to match the code used internally.

Example disclosure:

We release training and inference code under an Apache 2.0 license at https://github.com/..., enabling others to replicate our core pipeline.

33. Organization chart (Score: 0)

How are employees developing and deploying the model organized internally?

Disclosure:

The technical report lists core and (generic) contributors.

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer does not disclose significant information about organizational structure, listing only core and other contributors in the relevant paper.

Indicator notes:

To receive a point, the developer should provide both the internal organization chart for the team developing the model as well as the headcounts (or a proportion of headcounts) by the team.

Example disclosure:

The model team comprises of 63 people, organized as follows: - CEO - Managing Director (Safety) — 24 people - Managing Director (Pre-training) — 12 people - Managing Director (Post-training) — 11 people - Managing Director (API) — 6 people - Director (Infrastructure and reliability) — 7 people - Director (PR and marketing) — 4 people - Director (hiring) — 7 people

34. Model cost (Score: 0)

What is the cost of building the model?

Disclosure:

No information provided about monetary cost.

References:

Not disclosed

Score justification:

The developer does not provide information about monetary costs.

Indicator notes:

Monetary cost should be reported in appropriate currency (e.g. USD), along with the measurement methodology, which may involve estimation. Cost should be reported to a precision of one significant figure (e.g. 200 million USD).

Example disclosure:

We spent approximately 200 million USD on building the model: 50 million for data acquisition, 10 million for data processing, 20 million for personnel, 80 million for compute for R&D priced at market rates, and 40 million for compute for the final training run priced at market rates.

35. Basic model properties (Score: 1)

Are all basic model properties disclosed?

Disclosure:

Input modality: Text Output modality: Text Model components: Decoder-only model pre-trained w/ SFT in three stages (General, reasoning, long-context), post-training in three stages (Long-CoT cold-start, reasoning RL w/ GRPO, thinking mode fusion w/ SFT, general RL, strong-to-weak distillation) Model size: 0.6B, 1.7B, 4B, 8B, 14B, 32B & MoE w/ 30B (3B active), 235B (22B active) Model architecture: Autoregressive transformer LM (fine-grained expert segmentation for MoE w/o shared experts) w/ GQA, SwiGLU, RoPE, RMSNorm w/ pre-normalization, QK-norm. "The architecture of the Qwen3 dense models is similar to Qwen2.5 (Yang et al., 2024b), including using Grouped Query Attention (GQA, Ainslie et al., 2023), SwiGLU (Dauphin et al., 2017), Rotary Positional Embeddings (RoPE, Su et al., 2024), and RMSNorm (Jiang et al., 2023) with pre-normalization. Besides, we remove QKV-bias used in Qwen2 (Yang et al., 2024a) and introduce QK-Norm (Dehghani et al., 2023) to the attention mechanism to ensure stable training for Qwen3. Key information on model architecture is provided in Table 1. The Qwen3 MoE models share the same fundamental architecture as the Qwen3 dense models. Key information on model architecture is provided in Table 2. We follow Qwen2.5-MoE (Yang et al., 2024b) and implement fine-grained expert segmentation (Dai et al., 2024). The Qwen3 MoE models have 128 total experts with 8 activated experts per token. Unlike Qwen2.5-MoE, the Qwen3-MoE design excludes shared experts. Furthermore, we adopt the global-batch load balancing loss (Qiu et al., 2025) to encourage expert specialization. These architectural and training innovations have yielded substantial improvements in model performance across downstream tasks. Table 1: Model architecture of Qwen3 dense models. Table 2: Model architecture of Qwen3 MoE models. Section 3, 4 (Pre-training, post-training)"

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer discloses input modality, output modality, model components, model size and model architecture.

Indicator notes:

Basic model properties include: the input modality, output modality, model size, model components, and model architecture. To receive a point, all model properties should be disclosed. Modalities refer to the types or formats of information that the model can accept as input. Examples of input modalities include text, image, audio, video, tables, graphs. Model components refer to distinct and identifiable parts of the model. We recognize that different developers may use different terminology for model components, or conceptualize components differently. Examples include: (i) For a text-to-image model, components could refer to a text encoder and an image encoder, which may have been trained separately. (ii) For a retrieval-augmented model, components could refer to a separate retriever module. Model size should be reported in appropriate units, which generally is the number of model parameters, broken down by named component. Model size should be reported to a precision of one significant figure (e.g. 500 billion parameters for text encoder, 20 billion parameters for image encoder). Model architecture is the overall structure and organization of a foundation model, which includes the way in which any disclosed components are integrated and how data moves through the model during training or inference. We recognize that different developers may use different terminology for model architecture, or conceptualize the architecture differently; a sufficient disclosure includes any clear, though potentially incomplete, description of the model architecture.

Example disclosure:

Input modality: Text Output modality: Text Model components: Decoder-only model trained using self-supervised learning, followed by supervised fine tuning and RLHF that are used to align the language model to follow users' instructions and be helpful, harmless, and honest. Model size: 70B parameters Model architecture: Autoregressive (causal, decoder only) transformer language model with rotary position embeddings and are trained on the next token prediction task.

36. Deeper model properties (Score: 1)

Is a detailed description of the model architecture disclosed?

Disclosure:

https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/config.json

References:

https://huggingface.co/Qwen/Qwen3-235B-A22B/blob/main/config.json

Score justification:

The developer discloses the relevant configuration file.

Indicator notes:

To receive a point, the model architecture should be described in enough detail to allow for an external entity to fully implement the model. Publicly available code or a configuration file for a model training library (e.g., GPT-NeoX) would be a sufficiently detailed description.

Example disclosure:

The configuration file for training our model using a public model training library A can be found at [URL].

37. Model dependencies (Score: 1)

Is the model(s) the model is derived from disclosed?

Disclosure:

Lightweight models (Qwen3-0.6B, 1.7B, 4B, 8B, and 14B; Qwen3-30B-A3B) are distilled from strong models (Qwen3-235B-A22B, Qwen3-32B) "Figure 1: Post-training pipeline of the Qwen3 series models. The Strong-to-Weak Distillation pipeline is specifically designed to optimize lightweight models, encompassing 5 dense models (Qwen3-0.6B, 1.7B, 4B, 8B, and 14B) and one MoE model (Qwen3-30B-A3B). This approach enhances model performance while effectively imparting robust mode-switching capabilities. The distillation process is divided into two primary phases: (1) Off-policy Distillation: At this initial phase, we combine the outputs of teacher models generated with both /think and /no think modes for response distillation. This helps lightweight student models develop basic reasoning skills and the ability to switch between different modes of thinking, laying a solid foundation for the next on-policy training phase. (2) On-policy Distillation: In this phase, the student model generates on-policy sequences for fine-tuning. Specifically, prompts are sampled, and the student model produces responses in either /think or /no think mode. The student model is then fine-tuned by aligning its logits with those of a teacher model (Qwen3-32B or Qwen3-235B-A22B) to minimize the KL divergence."

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer discloses the models on which the model directly depends.

Indicator notes:

We will award this point for a comprehensive disclosure of the model or models on which the foundation model directly depends on or is derived from, as well as the method by which it was derived (e.g., through fine tuning, model merging, or distillation). Additionally, we will award a point if the developer discloses that the model is not dependent on or derived from any model.

Example disclosure:

This model is a fine tune of Camel-70B. We used the methods described in [PAPER URL] for distillation.

38. Benchmarked inference (Score: 0)

Is the compute and time required for model inference disclosed for a clearly-specified task on clearly-specified hardware?

Disclosure:

No information provided

References:

Not disclosed

Score justification:

The developer does not provide information about benchmarked inference.

Indicator notes:

The duration should be reported in seconds to a precision of one significant figure (e.g. 0.002 seconds). Compute usage for inference should be reported in FLOPs/second to a precision of one significant figure (e.g. 5 x 10^21 FLOPs/second). The hardware in this evaluation need not be the hardware the developer uses for inference. The developer can report this figure over some known or public dataset.

Example disclosure:

It takes 0.002 seconds and 5 x 10^21 FLOPs/second to generate 100,000 tokens as 5,000 sequences of length 20 given inputs of length 40 from [DATASET URL]. The fixed set of hardware is 8 NVIDIA A100s.

39. Researcher credits (Score: 0)

Is a protocol for granting external entities API credits for the model disclosed?

Disclosure:

Overarching researcher programs with universities are described for Alibaba Cloud but not specifically for researcher credits in relation to Alibaba models like Qwen 3.

References:

https://help.aliyun.com/zh/model-studio/get-api-key?spm=a2c4g.11186623.help-menu-2400256.d_2_0_0.27ef1c90fRN9Jn; https://university.aliyun.com/activity/air

Score justification:

The developer does not disclose (1) a specific location where researchers can request API credits, (2) criteria for selecting external entities for credit access, or (3) policies regarding decision timeframes. The model is available through standard commercial API services and open-source downloads, but no special researcher credit program is documented.

Indicator notes:

A model credit access protocol refers to the steps, requirements, and considerations involved in granting credits to external entities. We will award this point if the developer discloses key details of its protocol, including (i) where external entities can request access to credits (e.g. via an access request form); (ii) explicit criteria for selecting external entities; and (iii) its policy on granting a transparent decision on whether access has been granted within a specified, reasonable period of time. Additionally, we will award a point if the developer discloses that it does not grant external entities API credits.

Example disclosure:

We implement a researcher access program: (i) Access can be requested from [URL] (ii) Any researcher at an accredited research institution is eligible to apply. Decisions are made based on the alignment between the applicant's project description and our target research directions (as described here: [URL]). (iii) Decision notifications are sent within three weeks of the application receipt.

40. Specialized access (Score: 0)

Does the developer disclose if it provides specialized access to the model?

Disclosure:

No information provided

References:

https://help.aliyun.com/zh/model-studio/get-api-key?spm=a2c4g.11186623.help-menu-2400256.d_2_0_0.27ef1c90fRN9Jn; https://university.aliyun.com/activity/air

Score justification:

The developer does not provide information about specialized access.

Indicator notes:

Specialized access could include several categories, such as early access, subsidized access, or deeper access (e.g., to model weights or checkpoints, that are not publicly available). We will award this point if the developer discloses (i) if it provides specialized access and (ii) statistics on the number of users granted access across academia, industry, non-profits, and governments, to one significant figure.

Example disclosure:

We provide early access to the model via API to: (1) 250 academics vetted by our program (2) 0 industry affiliates (3) 0 non-profit affiliates (3) 2 government entities with whom we have signed MoUs We provide no other specialized research access.

41. Open weights (Score: 1)

Are the model's weights openly released?

Disclosure:

Model weights are available on Huggingface: https://huggingface.co/Qwen/Qwen3-235B-A22B

References:

https://huggingface.co/Qwen/Qwen3-235B-A22B

Score justification:

The developer makes the model weights publicly available.

Indicator notes:

To receive this point, model weights need to be publicly available at no cost. Developers may receive this point even if there are some restrictions on the external entities that are permitted access (e.g. geographic restrictions), insofar as these restrictions are transparent (e.g. via a license or some high-level description of who has been granted access to the foundation model).

Example disclosure:

Model weights are available on HuggingFace by following this link: [URL]

42. Agent Protocols (Score: 1)

Are the agent protocols supported for the model disclosed?

Disclosure:

We support MCP for agents built with Qwen3-32B and 235B-A22B "For tool use capabilities, we recommend taking a look at Qwen-Agent, which provides a wrapper around these APIs to support tool use or function calling with MCP support. Tool use with Qwen3 can also be conducted with SGLang, vLLM, Transformers, llama.cpp, Ollama, etc. Follow guides in our documentation to see how to enable the support. https://github.com/QwenLM/Qwen-Agent/blob/main/examples/assistant_qwen3.py 'model': 'qwen3-235b-a22b' 'model': 'Qwen/Qwen3-32B'"

References:

https://github.com/QwenLM/Qwen3

Score justification:

The developer discloses MCP support with additional information.

Indicator notes:

Agent protocols are specifications that define how autonomous agents exchange messages, context, or function calls with other agents, tools, or services (e.g., Anthropic’s Model Context Protocol (MCP) and Google’s Agent‑to‑Agent (A2A) spec). To earn this point, documentation must enumerate each protocol and describe any deviations or proprietary extensions.

Example disclosure:

We support MCP and A2A for agents built using model A

43. Capabilities taxonomy (Score: 1)

Are the specific capabilities or tasks that were optimized for during post-training disclosed?

Disclosure:

- Reasoning - Controllability of reasoning - Format following - Preference alignment - Agent ability (tool-use) - Abilities for specialized scenarios (e.g., RAG) "4.1 Long-CoT Cold Start The objective at this stage is to instill foundational reasoning patterns in the model without overly emphasizing immediate reasoning performance. 4.2 Reasoning RL 4.3 Thinking Mode Fusion The goal of the Thinking Mode Fusion stage is to integrate the “non-thinking” capabilities into the previously developed “thinking” model. This approach allows developers to manage and control reasoning behaviors, while also reducing the cost and complexity of deploying separate models for thinking and non-thinking tasks. 4.4 General RL These tasks specifically target enhancements in the following core capabilities: instructions, including requirements related to content, format, length, and the use of structured output, delivering responses that align with user expectations. • Format Following: In addition to explicit instructions, we expect the model to adhere to specific formatting conventions. For instance, it should respond appropriately to the /think and /no think flags by switching between thinking and non-thinking modes, and consistently use designated tokens (e.g., and ) to separate the thinking and response parts in the final output. • Preference Alignment: For open-ended queries, preference alignment focuses on improving the model’s helpfulness, engagement, and style, ultimately delivering a more natural and satisfying user experience. • Agent Ability: This involves training the model to correctly invoke tools via designated interfaces. During the RL rollout, the model is allowed to perform complete multi-turn interaction cycles with real environment execution feedback, thereby improving its performance and stability in long-horizon decision-making tasks. • Abilities for Specialized Scenarios: In more specialized scenarios, we design tasks tailored to the specific context. For example, in Retrieval-Augmented Generation (RAG) tasks, we incorporate reward signals to guide the model toward generating accurate and contextually appropriate responses, thereby minimizing the risk of hallucination."

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer specifically lists the capabilities that were optimized for during post-training, including reasoning, formal following, and agentic tool use.

Indicator notes:

Capabilities refer to the specific and distinctive functions that the model can perform. We recognize that different developers may use different terminology for capabilities, or conceptualize capabilities differently. We will award this point for a list of capabilities specifically optimized for in the post-training phase of the model, even if some of the capabilities are not reflected in the final model.

Example disclosure:

We focus on the following capabilities during post-training: (1) Coding ability (2) Retrieval of information and factuality (3) Multilingual language proficiency on non-English languages (4) Tool-use

44. Capabilities evaluation (Score: 1)

Does the developer evaluate the model's capabilities prior to its release and disclose them concurrent with release?

Disclosure:

- Reasoning: MATH-500, ZebraLogic - Controlability of reasoning: ThinkFollow - Format following: IFEval, Multi-IF, LengthControl, ThinkFollow - Preference alignment: Arena-Hard, AlignBench v1.1 - Agent ability (tool-use): BFCL, ToolUse "General Tasks: We utilize benchmarks including MMLU-Redux (Gema et al., 2024), GPQADiamond (Rein et al., 2023), C-Eval (Huang et al., 2023), and LiveBench (2024-11-25) (White et al., 2024). For GPQA-Diamond, we sample 10 times for each query and report the averaged accuracy. • Alignment Tasks: To evaluate how well the model aligns with human preferences, we employ a suite of specialized benchmarks. For instruction-following performance, we report the strictprompt accuracy of IFEval (Zhou et al., 2023). To assess alignment with human preferences on general topics, we utilize Arena-Hard (Li et al., 2024) and AlignBench v1.1 (Liu et al., 2023b). For writing tasks, we rely on Creative Writing V3 (Paech, 2024) and WritingBench (Wu et al., 2025) to evaluate the model’s proficiency and creativity. • Math & Text Reasoning: For evaluating mathematical and logical reasoning skills, we employ high-level math benchmarks including MATH-500 (Lightman et al., 2023), AIME’24 and AIME’25 (AIME, 2025), and text reasoning tasks including ZebraLogic (Lin et al., 2025) and AutoLogi (Zhu et al., 2025). For AIME problems, each year’s questions include Part I and Part II, totaling 30 questions. For each question, we sample 64 times and take the average accuracy as the final score. • Agent & Coding: To test the model’s proficiency in coding and agent-based tasks, we use BFCL v3 (Yan et al., 2024), LiveCodeBench (v5, 2024.10-2025.02) (Jain et al., 2024), and Codeforces Ratings from CodeElo (Quan et al., 2025). For BFCL, all Qwen3 models are evaluated using the FC format, and yarn was used to deploy the models to a context length of 64k for Multi-Turn evaluation. Some baselines are derived from the BFCL leaderboard, taking the higher scores between FC and Prompt formats. For models not reported on the leaderboard, the Prompt formats are evaluated. For LiveCodeBench, for the non-thinking mode, we use the officially recommended prompt, while for the thinking mode, we adjust the prompt template to allow the model to think more freely, by removing the restriction You will not return anything except for the program. To evaluate the performance gap between models and competitive programming experts, we use CodeForces to calculate Elo ratings. In our benchmark, each problem is solved by generating up to eight independent reasoning attempts. • Multilingual Tasks: For multilingual capabilities, we evaluate four kinds of tasks: instruction following, knowledge, mathematics, and logical reasoning. Instruction following is assessed using Multi-IF (He et al., 2024), which focuses on 8 key languages. Knowledge assessment consisted of two types: regional knowledge evaluated through INCLUDE (Romanou et al., 2024), covering 44 languages, and general knowledge assessed with MMMLU (OpenAI, 2024) across 14 languages, excluding the unoptimized Yoruba language; for these two benchmarks, we sample only 10% of the original data to improve evaluation efficiency. The mathematics task employ MT-AIME2024 (Son et al., 2025), encompassing 55 languages, and PolyMath (Wang et al., 2025), which includes 18 languages. Logical reasoning is evaluated using MlogiQA, covering 10 languages, sourced from Zhang et al. (2024)."

References:

https://arxiv.org/pdf/2505.09388

Score justification:

The developer evaluates the model’s capabilities prior to its release and discloses them concurrently with the release.

Indicator notes:

The evaluations must contain precise quantifications of the model's behavior in relation to the capabilities specified in the capabilities taxonomy. We will award this point for any clear, but potentially incomplete, evaluation of multiple capabilities.

Example disclosure:

We evaluate capabilities using the following benchmarks: (1) Coding: HumanEval (2) Retrieval: HotPotQA (3) Multilingual performance: MMMLU (4) Tool use: UltraTool

45. External reproducibility of capabilities evaluation (Score: 0)

Are code and prompts that allow for an external reproduction of the evaluation of model capabilities disclosed?

Disclosure:

The introduced benchmarks don't seem to be public, the other benchmarks have public prompts and/or code. Not 100% clear that they use the same code for those benchmarks. "To evaluate the effectiveness of Thinking Mode Fusion and General Reinforcement Learning (RL) during the post-training, we conduct evaluations on various stages of the Qwen-32B model. In addition to the datasets mentioned earlier, we introduce several in-house benchmarks to monitor other capabilities. These benchmarks include: • CounterFactQA: Contains counterfactual questions where the model needs to identify that the questions are not factual and avoid generating hallucinatory answers. • LengthCtrl: Includes creative writing tasks with length requirements; the final score is based on the difference between the generated content length and the target length. • ThinkFollow: Involves multi-turn dialogues with randomly inserted /think and /no think flags to test whether the model can correctly switch thinking modes based on user queries. • ToolUse: Evaluates the stability of the model in single-turn, multi-turn, and multi-step tool calling processes. The score includes accuracy in intent recognition, format accuracy, and parameter accuracy during the tool calling process.

References:

https://arxiv.org/pdf/2505.09388#page=21.32

Score justification:

The developer does not disclose open-source code and prompts that would allow external reproduction of the model's capability evaluations.

Indicator notes:

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all capability evaluations to be reproduced. The released code must be open source, following the OSI definition of open source.

Example disclosure:

The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]

46. Train-test overlap (Score: 0)

Does the developer measure and disclose the overlap between the training set and the dataset used to evaluate model capabilities?

Disclosure:

No information provided, though Alibaba did provide this information for Qwen-2

References:

Technical Report, Hugging Face, GitHub External Paper

Score justification:

The developer does not disclose overlap between the training set and dataset used to evaluate model capabilities, though the developer did so for a previous model.

Indicator notes:

We will award this point if, with every capability evaluation for which the developer reports results, the developer reports the overlap between the training set of the model and the dataset used for evaluation, as well as the general methodology for computing train-test overlap (e.g. a description of how n-gram matching was used).

Example disclosure:

We compute train-test overlap using n-gram matching using the procedure described here [URL]. We evaluate the train-test overlap for the following benchmarks: (1) Coding: HumanEval (1.6%) (2) Retrieval: HotPotQA (4%) (3) Multilingual performance: MMMLU (3%) (4) Tool use: UltraTool (9%)

47. Risks taxonomy (Score: 0)

Are the risks considered when developing the model disclosed?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not disclose a comprehensive risk taxonomy for the model in question. While there are scattered references to general safety considerations, there is no systematic disclosure of the specific risks considered during the development.

Indicator notes:

Risks refer to possible negative consequences or undesirable outcomes that can arise from the model's deployment and usage. These consequences or outcomes may arise from model limitations (functions that the model cannot perform) or issues with the model's trustworthiness (e.g., its lack of robustness, reliability, calibration). We recognize that different developers may use different terminology for risks, or conceptualize risks differently. We will award this point for a complete list of risks considered, even if some of the risks are not reflected in the final model.

Example disclosure:

We consider the following risks when developing and evaluating the model: (1) Misinformation (2) Harassment (3) Cybersecurity risks (4) Bioweapons design (5) Revealing personally-identifiable information

48. Risks evaluation (Score: 0)

Does the developer evaluate the model's risks prior to its release and disclose them concurrent with release?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not disclose specific, quantified risk evaluations for the model in question.

Indicator notes:

The evaluations must contain precise quantifications of the model's behavior in relation to the risks specified in the risk taxonomy. We will award this point for clear evaluations of the majority of the states risks.

Example disclosure:

We evaluate the risks for each of the above harms using HarmBench. The results (in terms of mean attack success rate) are: (1) Misinformation: 0.02 (2) Harassment: 0.01 (3) Cybersecurity: 0.10 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.12 (5) Revealing personally-identifiable information (subset of General Harm in HarmBench): 0.02

49. External reproducibility of risks evaluation (Score: 0)

Are code and prompts to allow for an external reproduction of the evaluation of model risks disclosed?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not disclose publicly available code and prompts that would allow external reproduction of risk evaluations.

Indicator notes:

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all risk evaluations to be reproduced. The released code must be open-source, following the OSI definition of open-source.

Example disclosure:

The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]

50. Pre-deployment risk evaluation (Score: 0)

Are the external entities have evaluated the model pre-deployment disclosed?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not provide information about pre-deployment evaluation.

Indicator notes:

By external entities, we mean entities that are significantly or fully independent of the developer. We will award this point if the developer specifies the entity that carried out the pre-deployment analysis, discloses the terms of the analysis (such as conditions for releasing the evaluation results or the developer's control over the final results), as well as any financial transaction between the parties. We will award this point if the developer discloses no external entities have evaluated the model pre-deployment, or discloses only terms of the analysis where it is not bound by NDA while still naming all external entities.

Example disclosure:

We provide the following parties access to our model for pre-deployment capabilities evaluation: METR. METR has control over the release of the evaluation results (including whether or not to release the results and the contents of the results being released), but must provide the evaluation results to us for review before release. There are no financial transactions between us and METR.

51. External risk evaluation (Score: 0)

Are the parties contracted to evaluated model risks disclosed?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not provide information about external evaluation.

Indicator notes:

We will award this point if the developer discloses statistics regarding all contracted parties that are responsible for evaluating risks (not limited to external entities or pre-deployment evaluation). This includes the number of contracted for-profit or non-profit entities, government entities, independent contractors, and researchers contracted by the developer to evaluate risks. We will award this point if the developer discloses it has no such contracts.

Example disclosure:

Contracted parties responsible for evaluating risks: (1) 2 contracting non-profits (2) 5 independent contractors (3) 0 government entities (4) 20 researchers

52. Mitigations taxonomy (Score: 0)

Are the post-training mitigations implemented when developing the model disclosed?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not provide information about mitigations.

Indicator notes:

By post-training mitigations, we refer to interventions implemented by the developer during the post-training phase to reduce the likelihood and/or the severity of the model’s risks. We recognize that different developers may use different terminology for mitigations, or conceptualize mitigations differently. We will award this point for a complete list of mitigations considered, even if some of the mitigations are not reflected in the final model. Alternatively, we will award this point if the developer reports that it does not mitigate risk in this way.

Example disclosure:

We implement supervised fine tuning and reinforcement learning with human feedback to address model risks. We use no other methods to address risks.

53. Mitigations taxonomy mapped to risk taxonomy (Score: 0)

Does the developer disclose how the post-training mitigations map onto the taxonomy of risks?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not provide information about mitigations.

Indicator notes:

We will award this point for a complete mapping of the primary risk that each mitigation is meant to address, even if the mitigation potentially maps on to other risks in the taxonomy. Alternatively, we will award this point if the developer reports that it does not mitigate risk.

Example disclosure:

We use supervised fine tuning for general instruction following. We use RLHF to reduce the model's propensity to output information about cybercrimes, bioweapons, disinformation, content harassing someone, and PII.

54. Mitigations efficacy (Score: 0)

Does the developer evaluate and disclose the impact of post-training mitigations?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not provide information about mitigations.

Indicator notes:

We will award this point if the developer discloses the results on the risk evaluations before and after the post-training mitigations are applied. Alternatively, we will award this point if the developer reports that it does not mitigate risk in this way.

Example disclosure:

Pre-mitigations (measured through mean attack success rate on HarmBench): (1) Misinformation: 0.80 (2) Harassment: 0.91 (3) Cybersecurity risks: 0.56 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.62 (5) Personally-identifiable information (subset of General Harm in HarmBench): 0.52 Post-mitigations (measured through mean attack success rate on HarmBench): (1) Misinformation: 0.02 (2) Harassment: 0.01 (3) Cybersecurity risks: 0.10 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.12 (5) Personally-identifiable information (subset of General Harm in HarmBench): 0.02

55. External reproducibility of mitigations evaluation (Score: 0)

Are code and prompts to allow for an external reproduction of the evaluation of post-training mitigations disclosed?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not provide information about mitigations.

Indicator notes:

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all mitigations evaluations to be reproduced. The released code must be open-source, following the OSI definition of open-source. Alternatively, we will award this point if the developer reports that it does not mitigate risk.

Example disclosure:

We release the code and prompts for reproducing post-training mitigation evaluations at this GitHub link: [URL]

56. Model theft prevention measures (Score: 0)

Does the developer disclose the security measures used to prevent unauthorized copying (“theft”) or unauthorized public release of the model weights?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not provide information about model theft protections.

Indicator notes:

This indicator assesses the developer's disclosures regarding how it addresses the risk that malicious actors or insiders could exfiltrate or replicate proprietary weights. Security measures could include insider threat analysis and detection, in addition to external threat management. Examples of such measures include encryption at rest, key management, remote attestation, or auditing for suspicious queries. We will award a point if the developer discloses specific steps taken to safeguard the model weights or that none are implemented.

Example disclosure:

We store model weights on encrypted volumes with hardware-based key management. We monitor inference queries for suspicious patterns (like repeated attempts to reconstruct weights token-by-token), and we audit all staff access logs monthly.

57. Release stages (Score: 0)

Are the stages of the model's release disclosed?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not provide information about release stages.

Indicator notes:

Release stages include A/B testing, release on a user-facing product, GA release, open-weight release, etc. We recognize that the release of a foundation model falls along a spectrum, with many forms of partial release, and that different developers may conceptualize release differently. We will award a point if the developer provides a clear identification of the stages through which the model was released.

Example disclosure:

We began with an internal alpha test for two weeks, followed by a closed beta with selected enterprise partners for one month, then a public waitlisted preview, and finally a general availability release once thresholds on safety benchmarks were met.

58. Risk thresholds (Score: 0)

Are risk thresholds disclosed?

Disclosure:

No information provided

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer does not disclose explicit risk thresholds.

Indicator notes:

Risk thresholds determine when a risk level is unacceptably high to a developer (e.g. leading to the decision to not release a model), moderately high (e.g. triggering additional safety screening), or low enough to permit normal usage. We will award this point if the developer discloses explicit risk thresholds that clarify (i) which harmful outcomes are being scored, (ii) how the scores are computed (in general terms, not necessarily disclosing internal algorithms), and (iii) what triggers an action to block, delay, or otherwise modify a model's release. Alternatively, we will award a point if the developer discloses that it does not consider explicit risk thresholds during model release.

Example disclosure:

Our risk threshold for biorisks is the ability to autonomously create bioweapons. Current models score a medium: they don't autonomously create bioweapons but could help a skilled practitioner with access to materials in speeding up creation of bioweapons. Risk thresholds higher than medium would delay the model's release until the risk level drops to medium or below.

59. Versioning protocol (Score: 1)

Is there a disclosed protocol for versioning and deprecation of the model?

Disclosure:

"To optimize resource utilization and ensure you can use the latest and best models, Model Studio deprecates historical models based on model iteration and upgrades from time to time. Notification Timeline Snapshot models (model names with specific date identifiers, such as qwen-max-2025-01-25) will send deprecation notices 30 days before the deprecation date. Main models (core versions of a model series) will send deprecation notices 3 months before the deprecation date. Methods Through emails, internal messages, and announcements on the official website. Only users who have called the to-be-deprecated models within the last 3 months will receive the emails and internal messages."

References:

https://www.alibabacloud.com/help/en/model-studio/model-depreciation?spm=a2c63.p38356.0.i1

Score justification:

The developer discloses a protocol for versioning and deprecating the model.

Indicator notes:

We will award a point if the developer discloses how model versions are labeled, updated, deprecated, and communicated to users.

Example disclosure:

We version models based on the date of release: e.g., ModelName-11-01-2024. We additionally provide ModelName-latest, corresponding to the latest release. We deprecate versions of models when we plan to remove access to with a six months notice to users. Users should respond to model deprecation by switching to the newest version of the models or an equivalent non-deprecated model. Users can switch to a different model by replacing the model identifier (to e.g., ModelName-latest for the latest version) in API calls or through the Python SDK.

60. Change log (Score: 1)

Is there a disclosed change log for the model?

Disclosure:

Alibaba maintains a changelog via its GitHub page and documentation pages for Qwen-3

References:

GitHub Documentation

Score justification:

The developer documents new features, fixes, and improvements by version in GitHub and documentation pages

Indicator notes:

We will award a point if the developer publishes a version-by-version record of new features, fixes, or performance improvements.

Example disclosure:

On 11/1/2024 (version ModelName-11-01-2024), we improved model reasoning in technical domains. This resulted in a 20-point increase on the MATH benchmark (from 62% to 82%). Past change logs can be viewed at [URL]

61. Foundation model roadmap (Score: 1)

Is a forward-looking roadmap for upcoming models, features, or products disclosed?

Disclosure:

QwQ-Max-Preview blog post mentions official open-source release of QwQ-Max under Apache 2.0 License planned (URL: https://qwenlm.github.io/blog/qwq-max-preview/) Planned release of dedicated APP for Qwen Chat with real-time responsiveness and productivity tool integration (URL: https://qwenlm.github.io/blog/qwq-max-preview/) Planned release of smaller QwQ variants like QwQ-32B for local device deployment (URL: https://qwenlm.github.io/blog/qwq-max-preview/) QwQ-32B already released in March 2025 as part of roadmap execution (URL: https://qwenlm.github.io/blog/qwq-32b/) Alibaba's $53 billion investment over next three years for cloud computing and AI infrastructure advancement (URL: https://www.alibabacloud.com/blog/alibaba-cloud-strengthens-ai-capabilities-with-innovations-for-international-customers_602126) API access for Qwen3 models will soon be available through Alibaba's AI model development platform Model Studio (URL: https://www.alibabacloud.com/press-room/alibaba-introduces-qwen3-setting-new-benchmark)

References:

Technical Report, Hugging Face, GitHub

Score justification:

The developer discloses a forward-looking roadmap for upcoming models, features and products.

Indicator notes:

A foundation model roadmap is a transparent statement about how the developer intends to evolve or expand its LLM offerings, including upcoming models, major feature releases, or expanded products based on the model, along with approximate timelines or version milestones. It can be high-level (e.g., “new model Q2 2025”), but must exist publicly.

Example disclosure:

We plan to release ModelX2 in Q2 2025, featuring enhanced multilingual capabilities and improved retrieval. We also aim to launch an enterprise-specific product tier for regulated industries by early 2026.

62. Top distribution channels (Score: 1)

Are the top-5 distribution channels for the model disclosed?

Disclosure:

Model weights are distributed through HuggingFace and ModelScope API access is provided through Alibaba Cloud (Model Studio)

References:

https://huggingface.co/Qwen/Qwen3-235B-A22B https://modelscope.cn/models/Qwen/Qwen3-235B-A22B https://www.alibabacloud.com/help/en/model-studio/models

Score justification:

The developer discloses Hugging Face, ModelScope, and Alibaba Cloud as top distribution channels.

Indicator notes:

We define distribution channels to be either an API provider (a pathway by which users can query the model with inputs and receive outputs) or a model distributor (a pathway by which model weights are released). We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. Distribution channels can be ranked by any reasonable metric (e.g., number of queries, number of downloads, number of users, revenue). A description of the metric should be provided. API providers and model distributors may be ranked separately using different metrics as long as the total number of distribution channels equals five (if five distribution channels exist). For example, the developer may choose to disclose the top-3 API providers (ranked by the number of queries) and the top-2 model distributors (ranked by the number of downloads).

Example disclosure:

We provide API access to the model through A, B, and C. We distribute model weights through D and E. We pick the top-3 API providers based on the average number of queries per month and the top-2 model weight providers based on the average number of downloads per month.

63. Quantization (Score: 0)

Is the quantization of the model served to customers in the top-5 distribution channels disclosed?

Disclosure:

Model weights are provided at a variety of quantization levels (e.g., bf16, fp8) on HuggingFace, ModelScope, and Kaggle

References:

https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-MLX-bf16 https://modelscope.cn/models/Qwen/Qwen3-235B-A22B-FP8 https://huggingface.co/Qwen/Qwen3-235B-A22B-MLX-bf16 https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8 https://www.kaggle.com/models/qwen-lm/qwen-3

Score justification:

The developer does not disclose model precision in each of the top-5 distribution channels as it is lacking for Alibaba Cloud.

Indicator notes:

We will award this point for a disclosure of the model precision in each of the top-5 distribution channels.

Example disclosure:

We serve the model at 16-bit precision on all distribution channels.

64. Terms of use (Score: 1)

Are the terms of use of the model disclosed?

Disclosure:

Qwen Chat Terms of Service webpage The model weights are released under the apache-2.0 license (on HF, ModelScope, and Kaggle) Alibaba Cloud Model Studio Terms: "4.48 Alibaba Cloud Model Studio Alibaba Cloud may from time to time in its sole and absolute discretion make available the Model Studio services to you, relating to services for developing generative artificial intelligence ("AI") models and applications, as described at Alibaba Cloud website (https://www.alibabacloud.com/) or via such other means as Alibaba Cloud may determine from time to time ("Model Studio")."

References:

https://chat.qwen.ai/legal-agreement/terms-of-service https://modelscope.cn/models/Qwen/Qwen3-235B-A22B https://huggingface.co/Qwen/Qwen3-235B-A22B-MLX-bf16 https://www.alibabacloud.com/help/en/legal/latest/alibaba-cloud-international-website-product-terms-of-service-v-3-8-0 https://www.kaggle.com/models/qwen-lm/qwen-3

Score justification:

The developer discloses the terms of use for its model for each of the top distribution channels.

Indicator notes:

We define terms of use to include terms of service and model licenses. We will award this point for a pointer to the terms of service or model license. In the event that model's licenses are written more generally, it should be clear which assets they apply to. We recognize that different developers may adopt different business models and therefore have different types of model licenses. Examples of model licenses include responsible AI licenses, open-source licenses, and licenses that allow for commercial use. Terms of service should be disclosed for each of the top-5 distribution channels. However, we will award this point if there are terms-of-service that appear to apply to the bulk of the model’s distribution channels.

Example disclosure:

Our terms of service are published at https://ourcompany.com/model-tos - these terms cover both our API and all distribution channels for model weights.

65. Distribution channels with usage data (Score: 0)

What are the top-5 distribution channels for which the developer has usage data?

Disclosure:

ModelScope distribution platform mentioned for Chinese users (URL: https://qwenlm.github.io/blog/qwen3/) Alibaba Model Studio API access through DashScope (URL: https://www.alibabacloud.com/help/en/model-studio/use-qwen-by-calling-api) Qwen Chat Web interface at chat.qwen.ai (URL: https://qwenlm.github.io/blog/qwen3/) Qwen3 models available on Hugging Face with multiple format variants (URL: https://huggingface.co/Qwen/Qwen3-235B-A22B)

References:

Not disclosed

Score justification:

The developer does not disclose which distribution channels have usage data

Indicator notes:

We define distribution channels to be either an API provider (a pathway by which users can query the model with inputs and receive outputs) or a model distributor (a pathway by which model weights are released). We recognize that distribution channels may arise without the knowledge of the model developer. For example, the weights of a model may be released through one distribution channel and then be distributed through other channels. Distribution channels can be ranked by any reasonable metric (e.g., number of queries, number of downloads, number of users, revenue). A description of the metric should be provided. We define usage data as any form of developer-exclusive data collected from any of a developer's distribution channel. A developer has access to usage data from a distribution channel if it is able to use that data for downstream purposes (e.g., analytics, training etc.). Usage data may be shared outside of the developer, but it is initially collected by the distribution channel and shared to the developer.

Example disclosure:

We have access to usage data through the distribution channels: A, B, and C.

66. Amount of usage (Score: 0)

For each of the top-5 distribution channels, how much usage is there?

Disclosure:

"The Qwen model family has attracted over 300 million downloads worldwide and developers have created more than 100,000 Qwen-based derivative models on Hugging Face (URL: https://www.alibabacloud.com/en/press-room/alibaba-introduces-qwen3-setting-new-benchmark)"

References:

Not disclosed

Score justification:

The developer does not disclose amount of usage on a monthly basis

Indicator notes:

Usage should be reported as the number of queries over the span of a month, reported to the precision of one significant figure (e.g., 50 million queries).

Example disclosure:

Distribution channel A: 50 million queries. Distribution channel B: 10 million queries. Distribution channel C: 10 million queries.

67. Classification of usage data (Score: 0)

Is a representative, anonymized dataset classifying queries into usage categories disclosed?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not provide information about usage data.

Indicator notes:

Developers may either share a fully public dataset or a partially restricted dataset (e.g., under a research license). We will award this point if there is a clear, aggregated or sample dataset that reveals categories of tasks/queries.

Example disclosure:

We provide quarterly releases of an anonymized dataset that classifies user queries into 20 broad job-related categories. Researchers can request access via [URL]. We ensure no PII is included.

68. Data retention and deletion policy (Score: 0)

Is a policy for data retention and deletion disclosed?

Disclosure:

Alibaba Cloud general privacy policy mentions data retention for legitimate business needs (URL: https://www.alibabacloud.com/help/en/legal/latest/alibaba-cloud-international-website-privacy-policy)

References:

Not disclosed

Score justification:

The developer does not disclose how data deletion requests propagates to user data used to build the model

Indicator notes:

A data retention and deletion policy is a policy for removing particular data from the training set and/or preventing it from being used if there is a user or external request (e.g., “right to be forgotten”) that also covers internal data governance. This includes whether there is a formal process to delete or retract data from future training runs and how long raw data is retained. It also clarifies how quickly deletions propagate to the model (e.g., “only in subsequent major model releases”).

Example disclosure:

We honor verified user requests to delete personal data from our training corpus by removing it from any subsequent scheduled retraining. Our data retention policy ensures chat logs are purged after 90 days.

69. Geographic statistics (Score: 0)

Across all forms of downstream use, are statistics of model usage across geographies disclosed?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not provide information about usage data.

Indicator notes:

We will award this point if there is a meaningful, though potentially incomplete or vague, disclosure of geographic usage statistics at the country-level.

Example disclosure:

We share anonymized per-country usage metrics in a publicly accessible dashboard, updated monthly, on this link: [link]

70. Internal products and services (Score: 0)

What are the top-5 internal products or services using the model?

Disclosure:

Qwen3 powers Alibaba's flagship AI super assistant application, Quark (URL: https://www.alibabacloud.com/en/press-room/alibaba-introduces-qwen3-setting-new-benchmark)

References:

Not disclosed

Score justification:

The developer does not clearly disclose the top 5 internal products or services making use of the model

Indicator notes:

An internal product or service is a product or service built by the developer. Products or services can be ranked by any reasonable metric (e.g., number of users, queries, revenue). A description of the metric should be provided.

Example disclosure:

The model is used in products A, B, C, D, and E. We choose products based on the number of monthly active users.

71. External products and services (Score: 0)

What are the top-5 external products or services using the model?

Disclosure:

DingTalk is equipped with an AI assistant powered by Alibaba Cloud's large model Tongyi Qianwen (URL: https://www.alibabacloud.com/blog/600404) Taobao and Tmall Group uses Qwen for AI-driven innovations including Taobao Wenwen chatbot and Ali Xiaomi customer service assistant (URL: https://www.alibabacloud.com/blog/601285)

References:

Not disclosed

Score justification:

The developer does not clearly disclose the top 5 external products or services

Indicator notes:

An external product or service is a product or service built by a party external to the developer. Products or services can be ranked by any reasonable metric (e.g., number of users, queries, revenue). A description of the metric should be provided. We will award a point if the developer discloses that that it does not have access to such metrics about external products or services.

Example disclosure:

The model is used in products A, B, C, D, and E. We choose products based on the number of monthly active users.

72. Users of internal products and services (Score: 0)

How many monthly active users are there for each of the top-5 internal products or services using the model?

Disclosure:

Quark serves more than 200 million users in China (URL: https://www.alibabacloud.com/blog/alibaba-unveils-flagship-ai-super-assistant-application-quark_602058)

References:

Not disclosed

Score justification:

The developer discloses the total number of users of an internal product in China, but it does not disclose the top 5 internal products or services ranked by users

Indicator notes:

An internal product or service is a product or service built by the developer. The number of users refers to users who engaged or interacted with the model through the internal product or service over the last month or averaged over the last X months (this should be specified). Number of users should be specified to one significant figure (e.g. 100,000).

Example disclosure:

Over the last 6 months, the total monthly active users for our top-5 products using model Y are: Product A: 100,000 users Product B: 30,000 users Product C: 10,000 users Product D: 10,000 users Product E: 10,000 users

73. Consumer/enterprise usage (Score: 0)

Across all distribution channels for which the developer has usage data, what portion of usage is consumer versus enterprise?

Disclosure:

Alibaba Cloud reported over 90,000 enterprise deployments of Qwen models through Model Studio (URL: https://www.alibabacloud.com/blog/alibaba-clouds-qwen-models-attract-over-90000-enterprise-adoptions-within-its-first-year_601130)

References:

Not disclosed

Score justification:

The developer does not disclose a breakdown consumer vs. enterprise use, providing only the volume of enterprise deployments.

Indicator notes:

Consumer usage refers to usage by individual consumers. Enterprise usage refers to usage by enterprise customers (including government use). Consumer and enterprise usage should be calculated in terms of the number of queries by or the amount of revenue from consumer or enterprise users. Percentages should be specified to two significant digits (e.g., 12% consumer, 88% enterprise).

Example disclosure:

12% of the usage of model A across all distribution channels is from consumers, 88% is from enterprise users. Of this 88%, 6% is from users at governments. Usage is calculated based on number of queries.

74. Enterprise users (Score: 0)

Across all distribution channels for which the developer has usage data, what are the top-5 enterprises that use the model?

Disclosure:

Specific enterprise examples mentioned: Xiaomi (integrated into Xiao Ai assistant), Perfect World Games (game development), AstraZeneca (adverse event summary system), eSignGlobal (Contract Lifecycle Management), ZKH (product recommendation) (URL: https://www.alibabacloud.com/blog/alibaba-clouds-qwen-models-attract-over-90000-enterprise-adoptions-within-its-first-year_601130)

References:

Not disclosed

Score justification:

The developer discloses a number of enterprises making use of the model but it is not clear which are among the top 5.

Indicator notes:

Enterprises should be ranked by the number of queries made or the amount of revenue from usage since the model's release. We will also award this point if the developer indicates it does not have access to enterprise usage data.

Example disclosure:

The top-5 enterprises are A, B, C, D, and E. The enterprises are selected based on the number of queries.

75. Government use (Score: 0)

What are the 5 largest government contracts for use of the model?

Disclosure:

Alibaba Cloud partnership with Sarawak Government (Malaysia) for digital public services using AI technologies (URL: https://www.alibabacloud.com/blog/alibaba-cloud-and-sains-expand-collaboration-to-enhance-sarawaks-digital-public-services-and-sustainability-initiatives-with-innovative-ai-powered-so_601327)

References:

Not disclosed

Score justification:

The developer discloses a government use case but it is not clear what the 5 largest government contracts for use of the model are.

Indicator notes:

This includes known government contracts of enterprise or government-specific products and services that use the model. We will award this point if the developer discloses its top five government contracts ranked monetary value, though the developer may omit contracts where it is under NDA regarding the existence of the contract.

Example disclosure:

The five largest government users of our service, along with their use cases, are: 1. County A is utilizing our product for improving access to internal resources 2. National Lab B is using our model to advance bioscientific research. 3. Federal agency C is using our product to deliver faster, more accurate translation services 4. City D is participating in a pilot program found our product helped reduce the time spent on routine tasks 5. Country E is using our product to summarize legal documents in their lower courts.

76. Benefits Assessment (Score: 0)

Is an assessment of the benefits of deploying the model disclosed?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not provide information about benefits assessment.

Indicator notes:

We will award this point for any quantitative assessment of the benefits or potential benefits of deploying the model.

Example disclosure:

We analyze the impact of using the model in education outcomes using a randomized controlled trial in third grade math assignments, and find that use in the classroom improves standardized test outcomes by 26%. [Link to report.]

77. AI bug bounty (Score: 0)

Does the developer operate a public bug bounty or vulnerability reward program under which the model is in scope?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not provide information about bug bounties.

Indicator notes:

We will award this point for a publicly documented bug bounty or vulnerability reward program describing (i) in-scope vulnerabilities (e.g., prompt bypasses, data leaks), (ii) out-of-scope items, (iii) submission process, and (iv) reward tiers or recognition if applicable. We will award a point if the developer discloses it has no AI bug bounty that encourages external researchers to report security, privacy, or adversarial vulnerabilities in the model.

Example disclosure:

We run a bug bounty program with HackerOne. We award up to $5,000 for critical vulnerabilities, such as discovering a major exploit that circumvents our content filters or reveals private data. [link to bug bounty]

78. Responsible disclosure policy (Score: 0)

Does the developer clearly define a process by which external parties can disclose model vulnerabilities or flaws?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not provide information about responsible disclosure policies.

Indicator notes:

We will award this point for a description of the process external parties can use for responsibly disclosing model vulnerabilities and flaws, which should include (i) what mechanism external parties can use to disclose vulnerabilities or flaws (e.g., a form, an email) and (ii) what process follows a disclosure (e.g., how much time must parties wait until public release). This is often included with a bug bounty, but can also be standalone. We will award a point if the developer discloses it has no responsible disclosure policy.

Example disclosure:

We maintain a responsible disclosure policy at [URL] that describes how external parties can disclose vulnerabilities and flaws in Model A, including a 45-day disclosure window and an official contact for urgent security vulnerabilities.

79. Safe harbor (Score: 0)

Does the developer disclose its policy for legal action against external evaluators conducting good-faith research?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not provide information about safe harbor protections.

Indicator notes:

We will award this point if the developer discloses whether it has a policy committing it to not pursue legal action against external evaluators conducting good-faith research. This should not be only for software security vulnerabilities, but also AI flaws, and it should be based on researcher conduct standards, not at the sole discretion of the company. We will award this point if the developer provides a clear description of its policy regarding such protections for external researchers, or lack thereof.

Example disclosure:

We do not have a policy for researcher protections for good-faith safety research. OR Our policy ensures no legal action against good‐faith researchers who follow our disclosure guidelines, see: [link]

80. Security incident reporting protocol (Score: 0)

Are major security incidents involving the model disclosed?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not provide information about security incident reporting.

Indicator notes:

A security incident reporting protocol provides post-deployment transparency about serious incidents or breaches. Security incidents refer to incidents where external security threats affect the model (e.g., data breaches or DDoS attacks on the service). We will award this point if the developer states (i) how to submit a security incident report, (ii) how quickly it will respond, and (iii) when and whether results are disclosed. Every incident need not be reported publicly, but the developer must disclose a policy determining how incidents are reported and disclosed.

Example disclosure:

We publish a public ‘Security Incident Report’ on our website for any confirmed security incident affecting the model within 7 days of a patch being implemented. Users and researchers can report incidents via security@ourcompany.com, and we commit to an initial acknowledgment within 48 hours.

81. Misuse incident reporting protocol (Score: 0)

Are misuse incidents involving the model disclosed?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not provide information about misuse incident reporting.

Indicator notes:

A misuse incident reporting protocol provides post-deployment transparency about incidents of misuse involving the model. As opposed to the previous indicator, this indicator is about actors misusing the model to cause real-world harm, such as misinformation operations or cybersecurity attacks. We will award this point if the developer states (i) how to submit a misuse incident report, (ii) how quickly it will respond, and (iii) when and whether results are disclosed. Every incident need not be reported publicly, but there needs to be a policy governing how incidents are reported.

Example disclosure:

We publish a public ‘Misuse Incident Report’ on our website for any confirmed misuse incident within 7 days of a patch being implemented. Users and researchers can report incidents regarding our flagship foundation model via security@ourcompany.com, and we commit to an initial acknowledgment within 48 hours.

82. Post-deployment coordination with government (Score: 0)

Does the developer coordinate evaluation with government bodies?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not provide information about government coordination.

Indicator notes:

We will award this point if the developer specifies which government bodies it is coordinating with and for what types of post-deployment evaluations. Government bodies include AI Safety Institutes, national security agencies, national labs, and international governmental entities such as UN agencies or the G7. Evaluation here may also include sharing of the developer's proprietary evaluation results for help with interpretation.

Example disclosure:

We do not coordinate with any government entities or AI Safety Institutes. OR We coordinate with the UK AISI for post-deployment evaluation of cyber, CB, and autonomy-related capabilities.

83. Feedback mechanisms (Score: 0)

Does the developer disclose a way to submit user feedback? If so, is a summary of major categories of feedback disclosed?

Disclosure:

"We appreciate feedback. If you provide us with any feedback, comments, suggestions, and/or responses to any surveys, or rate any Outputs (such as by using the thumbs up/thumbs down button next to Outputs) (“User Feedback”), by providing such User Feedback to us, you agree that:"

References:

https://chat.qwen.ai/legal-agreement/terms-of-service

Score justification:

Categorization of the feedback is not provided

Indicator notes:

We will award this point if the developer (i) discloses how users can submit feedback (e.g., via a form or a thumbs up/thumbs down for model responses) and (ii) discloses aggregated or categorized feedback data (e.g. a categorization of thumbs up and thumbs down data).

Example disclosure:

Users can submit feedback at this url: [URL] We find that users mainly report issues with API call response times, over-refusals from models, and outdated information in model outputs. A detailed categorization of user reports is available at [URL]

84. Permitted, restricted, and prohibited model behaviors (Score: 0)

Are model behaviors that are permitted, restricted, and prohibited disclosed?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not provide information about model behavior restrictions.

Indicator notes:

We refer to a policy that includes this information as a model behavior policy, or a developer's policy on what the foundation model can and cannot do (e.g. such a policy may prohibit a model from responding to NSFW content). We recognize that different developers may adopt different business models and that some business models may make enforcement of a model behavior policy more or less feasible. We will award this point if at least two of the three categories (i.e. permitted, restricted, and prohibited model behaviors) are disclosed. Alternatively, we will award this point if the developer reports that it does not impose any restrictions on its model's behavior in this way.

Example disclosure:

We allow responses from Model A that include broad Q&A, restrict sexual or harassing content, and prohibit facilitating illegal or violent acts. More details can be found in our guidelines for model behavior here: [link]

85. Model response characteristics (Score: 1)

Are desired model response characteristics disclosed?

Disclosure:

For open-ended queries, preference alignment focuses on improving the model’s helpfulness, engagement, and style, ultimately delivering a more natural and satisfying user experience.

References:

https://arxiv.org/abs/2505.09388

Score justification:

The desired model characteristics are helpfulness and engagement to provide a natural and satisfying user experience.

Indicator notes:

Model response characteristics include default behaviors or behaviors that the developer steers the model to take. These may include being helpful, taking an objective point of view, or using tools only when necessary. We will award points for a clear description of desired model response characteristics or a statement that there are no such characteristics.

Example disclosure:

We configure responses from Model A to be factual, neutral, and contextually helpful, avoiding personal or biased opinions. More details can be found in our guidelines for model behavior here: [link]

86. System prompt (Score: 0)

Is the default system prompt for at least one distribution channel disclosed?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not disclose the system prompt.

Indicator notes:

A system prompt is defined as the prompt provided to the system by default that guides the system's behavior. We will award this point for the disclosure of the verbatim text of the full system prompt as well as an explanation for the context in which the system prompt is used.

Example disclosure:

We disclose our default prompt for Model A via our chat interface: ‘You are a helpful AI assistant providing clear, accurate, and policy‐compliant responses.’

87. Intermediate tokens (Score: 1)

Are intermediate tokens used to generate model outputs available to end users?

Disclosure:

Qwen3 models generate thinking content wrapped in ... blocks that are visible to users (URL: https://huggingface.co/Qwen/Qwen3-235B-A22B)

References:

https://huggingface.co/Qwen/Qwen3-235B-A22B

Score justification:

The model's thinking content is visible to users.

Indicator notes:

Intermediate tokens are defined as any tokens generated by the model before the final output is shown to the user, such as model chains of thought. We will also award this point if a summary of intermediate tokens is made available to end users. If intermediate tokens or summaries are not made available, the developer should provide a justification.

Example disclosure:

Model A is trained to generate intermediate chain-of-thought reasoning, but we withhold most chain-of-thought tokens from final user-facing responses to prevent model distillation. We do disclose chains-of-thought for a small set of research collaborators under NDA.

88. Internal product and service mitigations (Score: 0)

For internal products or services using the model, are downstream mitigations against adversarial attacks disclosed?

Disclosure:

Qwen3 powers Alibaba's flagship AI super assistant application, Quark (URL: https://www.alibabacloud.com/blog/alibaba-introduces-qwen3-setting-new-benchmark-in-open-source-ai-with-hybrid-reasoning_602192) Alibaba Cloud provides comprehensive Content Moderation Service with deep learning technology for filtering inappropriate content (URL: https://www.alibabacloud.com/product/content-moderation)

References:

Not disclosed

Score justification:

The developer does not provide information about mitigations for internally-developed products and services.

Indicator notes:

An internal product or service is a product or service built by the developer. Adversarial attacks include prompt injection, jailbreaking, or malicious queries. Mitigations against adversarial attacks might include specialized prompt filtering, content scanning, or real-time monitoring of queries or accounts. We will award this point if the developer discloses a clear statement of methods used (e.g., a specialized prompt sanitizer or adversarial pattern detector), or if the developer states it does not implement such product-level mitigations against adversarial attacks.

Example disclosure:

In our chatbot, we implement a second-stage content filter that checks user inputs for disallowed topics and attempts to sanitize adversarial prompts. We also log suspicious prompts for manual review.

89. External developer mitigations (Score: 1)

Does the developer provide built-in or recommended mitigations against adversarial attacks for downstream developers?

Disclosure:

Alibaba Cloud Content Moderation Service provides comprehensive content filtering for images, videos, text, and audio Content Moderation API offers text filtering for inappropriate content including pornography, terrorism, advertising, verbal abuse Model Studio includes protection against DDoS attacks, CC attacks, traffic cleansing, web vulnerabilities, unauthorized traffic

References:

https://www.alibabacloud.com/blog/define-digital-safety-with-alibaba-clouds-advanced-content-moderation-service_600660 https://www.alibabacloud.com/help/en/content-moderation/latest/faq-about-the-content-moderation-api https://www.alibabacloud.com/en/product/modelstudio

Score justification:

A score of 1 is granted as Alibaba offers a moderation API and Alibaba's model studio offers protections against DDoS attacks and other types of attacks.

Indicator notes:

Downstream developers are developers who access the model through a distribution channel. Adversarial attacks include prompt injection, jailbreaking, or malicious queries. Mitigations against adversarial attacks that developers might build in or recommend include content filtering endpoints and recommended prompt templates. We will award this point if the developer discloses (i) technical mitigations (e.g., a developer provided moderation API or classifier) it offers or implements, (ii) recommended best practices or libraries for downstream developers, or (iii) an explicit statement that it does not build or recommend any particular downstream mitigations in this way..

Example disclosure:

Our API includes an optional parameter that will automatically filter user prompts and model outputs for hateful or disallowed content. We also publish guidelines for building robust chat interfaces that resist common prompt injections.

90. Enterprise mitigations (Score: 1)

Does the developer disclose additional or specialized mitigations for enterprise users?

Disclosure:

Alibaba Cloud Model Studio provides dedicated VPC network deployment for enterprise data security (URL: https://www.alibabacloud.com/en/product/modelstudio) PrivateLink support for secure enterprise data transmission and dedicated VPC connections (URL: https://www.alibabacloud.com/en/product/modelstudio) Enterprise-level data access management and data disclosure management capabilities (URL: https://www.alibabacloud.com/en/product/modelstudio) GDPR compliance support with Standard Contractual Clauses for EU data transfers (URL: https://www.alibabacloud.com/trust-center/gdpr) HIPAA compliance with Business Associate Agreement (BAA) support (URL: https://www.alibabacloud.com/trust-center/hipaa) Data isolation between R&D and production environments (URL: https://www.alibabacloud.com/en/product/modelstudio) Customizable content governance and monitoring for compliance (URL: https://www.alibabacloud.com/en/product/modelstudio) Advanced logging and encryption for enterprise data in OSS (URL: https://www.alibabacloud.com/en/product/modelstudio)

References:

https://www.alibabacloud.com/en/product/modelstudio

Score justification:

A score of 1 is granted as Alibaba's model studio offers a number of enterprise mitigations such as data access management, GDPR and HIPAA compliance, data isolation, and advanced encryption.

Indicator notes:

Enterprise users are, for example, large organizations with dedicated service agreements or users of enterprise-specific API deployments or products and services. Additional or specialized mitigations may address enterprise needs such as data privacy controls, advanced prompt/response monitoring, or compliance checks with regulations such as GDPR or HIPAA. Additional or specialized mitigations may include single-tenant deployments, custom filters for specific regulated industries, or advanced logging for compliance. We will award a point if the developer at least describes these mitigations or states that it does not provide such additional or specialized enterprise mitigations.

Example disclosure:

Our enterprise offering for Model A includes a dedicated environment with stricter filtering, a HIPAA-compliant data retention policy, and the ability for enterprise admins to define custom blacklisted topics that the model must refuse.

91. Detection of machine-generated content (Score: 0)

Are mechanisms that are used for detecting content generated by this model disclosed?

Disclosure:

Alibaba Cloud offers AI-generated image identification service through Content Moderation 2.0 (URL: https://www.alibabacloud.com/help/en/content-moderation/latest/image-audit-enhanced-edition-detects-aigc-infringement) Alibaba Cloud provides content moderation and watermarking services for data security through DLP solutions (URL: https://www.alibabacloud.com/help/en/sase/user-guide/manage-watermarks-to-ensure-data-security)

References:

Not disclosed

Score justification:

A score of 0 is granted as it is not clear if these mechanisms detect outputs of Qwen 3 in particular

Indicator notes:

A mechanism for detecting machine-generated content might include storing a copy of all outputs generated by the model to compare against, implementing a watermark on model outputs, adding cryptographic metadata (such as C2PA), or training a detector post-hoc to identify such content. We will award this point if any such mechanism is disclosed or if the developer reports that it does not have or use any such mechanism.

Example disclosure:

We train a classifier using model generations and human-written text to identify machine-generated content from Model A and our other models.

92. Documentation for responsible use (Score: 0)

Does the developer provide documentation for responsible use by downstream developers?

Disclosure:

No information provided.

References:

Not disclosed

Score justification:

The developer does not provide information about responsible usage.

Indicator notes:

To receive a point, the developer should provide documentation for responsible use. This might include details on how to adjust API settings to promote responsible use, descriptions of how to implement mitigations, or guidelines for responsible use. We will also award this point if the developer states that it does not provide any such documentation. For example, the developer might state that the model is offered as is and downstream developers are accountable for using the model responsibly.

Example disclosure:

Our Developer Documentation Hub consolidates integration guides, responsible‐use guidelines, and best practices: [link]

93. Permitted and prohibited users (Score: 1)

Is a description of who can and cannot use the model on the top-5 distribution channels disclosed?

Disclosure:

"I. Account Registration and Security 1. Minimum Age. You must be at least 18 years old or the age of majority in your jurisdiction, whichever is higher, in order to access or use the Services. If you are younger than 18 years old or the age of majority in your jurisdiction, you must not access or use the Services." Alibaba Cloud Model Studio has general terms of service but no specific user restrictions disclosed for Qwen3-235B-A22B (URL: https://www.alibabacloud.com/help/en/legal/latest/alibaba-cloud-international-website-product-terms-of-service-v-3-8-0) Qwen3-235B-A22B is released under Apache 2.0 license according to the blog post (URL: https://qwenlm.github.io/blog/qwen3/)

References:

https://chat.qwen.ai/legal-agreement/terms-of-service

Score justification:

A score of 1 is granted as the terms of use articulate restricted users (minors) and the Apache 2.0 license has no user restrictions for other distribution channels.

Indicator notes:

We will award this point for a description of the company's policies for permitted and prohibited users on its top-5 distribution channels. We will award this point if the developer has a more general acceptable use policy that it confirms applies across these distribution channels. We will award this point if there are no restrictions on users.

Example disclosure:

We allow usage by individuals 13 years of age or older who accept our Terms of Service. We prohibit use by export controlled entities or persons on denied-parties lists or in countries under U.S. embargo. We also reserve the right to restrict use if users engage in targeted harassment. For example, we only permit users over 13 with valid credentials, and prohibit usage from OFAC-sanctioned regions. We do not allow state-sponsored disinformation agencies to access our services.

94. Permitted, restricted, and prohibited uses (Score: 1)

Which uses are explicitly allowed, conditionally permitted, or strictly disallowed under the acceptable use policy for the top-5 distribution channels?

Disclosure:

"II. Use of Our Services You agree that you shall access and use our Services only in compliance with all applicable laws as well as these Terms, third party rights (including without limitation contractual and Intellectual Property Rights), any documentation, guidelines, or policies we may post on or make available via the Services, and any additional terms and conditions as we may specify from time to time. 1. What You Cannot Do. Without prejudice to the generality of the foregoing, you shall not do or attempt to do any of the following: (a) recompile, decompile, disassemble, reverse engineer, or determine or attempt to determine any source code, algorithms, methods, techniques, or any underlying components (including without limitation the training data, model, parameters, or weights); (b) interact with, extract, or download any information, data or content from the Services (including without limitation the Outputs) in an automated manner; (c) scrape, mine, or distil any information, data or content from the Services (including without limitation the Outputs) whether using scripts, engines, software, tools, agents, devices, data mining tools, scraping techniques, distillation techniques, or other means or mechanisms, in each case whether manual or automated; (d) create, recreate, distribute, or advertise an index of any significant portion of the information, data, or content (including without limitation the Outputs) made available via the Services; (e) use any information, data, or content (including without limitation the Outputs) made available via the Services to develop or improve any products or services (including without limitation any models) that compete with or are similar in functionality to the Services (or any aspect thereof); (f) use the Services or Outputs to obtain unauthorized access to any system or information or to deceive any person; (g) access or use the Services or Outputs in any way that may infringe, misappropriate, or violate any applicable law or third party rights (including without limitation any Intellectual Property Rights or rights of publicity or privacy); (h) probe, scan, test the vulnerability of or attempt to circumvent any security or authentication measures or features, functionalities, data, information or content that you are not authorised to access; (i) interfere with, 'jailbreak', or disrupt the proper working of the Services (or any part thereof), and/or any website or any networks connected to the Services, whether via 'prompt injections' or otherwise; (j) access or use the Services or Outputs in any way that may abuse, harm, interfere with, or disrupt the integrity or performance of our Services or any person's enjoyment or legitimate use of the same; and/or (k) engage in any conduct that restricts or inhibits any person from using or enjoying our Services, or that we reasonably believe exposes us or any of our users, affiliates, or any other third party to any liability, damages, or detriment of any type, including reputational harm. 2. Usage Policy.You agree that it is a continuing condition of your access and/or use of the Services that you comply at all times with our Usage Policy or via such other means as we may specify from time to time. Please review our Usage Policy carefully before using our Services. We may use automated detection and monitoring mechanisms to identify and act upon non-compliance with our Usage Policy. Without prejudice to generality of the foregoing, you further agree that: 2.1 Do not compromise critical infrastructure or national security; 2.2 Do not use the Services to generate any content for political campaigns; 2.3 Do not Incite, facilitate, or promote violent extremism, terrorism, or hateful behavior; 2.4 Do not submit content or create any sexually explicit or suggestive content, and other visually shocking or disturbing content; 2.5 Do not compromise children’s safety; 2.6 Do not compromise someone’s privacy or identity; 2.7 Do not submit content or create any content that are inherently disrespectful, aggressive, hateful, or otherwise abusive; 2.8 Do not spread misinformation; 2.9 Do not perform or facilitate the activities that may significantly affect the safety, wellbeing, or rights of others, such as providing tailored legal, medical/health, or financial advice, making automated decisions in domains that affect an individual’s rights or well-being, etc." Section III User content(c) the Prompts (in whole or in part): (i) are not offensive, inappropriate, illegal, or objectionable in any way; (ii) are not pornographic, nor do they contain profanity; (iii) are not defamatory or libellous; (iv) are not inaccurate, false, or misleading; (v) are not confidential or proprietary; (vi) do not contain any harmful inputs, code, or files that contain viruses, trojans, worms, time bombs, logic bombs or other material that is malicious or technologically harmful; (vii) do not contain any 'prompt injections' or other inputs intended or likely to circumvent any controls or restrictions in connection with the Services, or cause the Services to behave in an unexpected or unintended manner; (viii) meet all requirements applicable in relation to content published on the Internet or other electronic platforms, including without limitation under applicable law and codes of practice issued by competent government authorities and agencies; and (ix) will not expose us or any of our users to any harm or liability of any kind. Qwen3-235B-A22B is released under Apache 2.0 license according to the blog post (URL: https://qwenlm.github.io/blog/qwen3/)

References:

https://chat.qwen.ai/legal-agreement/terms-of-service

Score justification:

A score of 1 is granted as the terms of use articulate restricted and prohibited uses and the Apache 2.0 license has no use restrictions for other distribution channels.

Indicator notes:

We will award this point for a rough characterization of two or more of permitted, restricted, and prohibited uses across the top-5 distribution channels. We will award this point if the developer has a more general acceptable use policy that it confirms applies across these distribution channels. We will award this point if there are no restrictions on users.

Example disclosure:

Permitted uses include general conversational queries, brainstorming, and coding assistance. Restricted uses include adult or violent content that requires caution or additional review. Prohibited uses include facilitating illicit activity, disinformation campaigns, or harassment. For example, we permit typical user requests like Q&A, text generation, and educational uses. We restrict content that depicts graphic violence or sexual content by applying additional filters. We prohibit any use aiming to conduct unlawful surveillance, promote extremist violence, or defraud others.

95. AUP enforcement process (Score: 1)

What are the methods used by the developer to enforce the acceptable policy?

Disclosure:

"II Use of Our Services. Please review our Usage Policy carefully before using our Services. We may use automated detection and monitoring mechanisms to identify and act upon non-compliance with our Usage Policy. "

References:

https://chat.qwen.ai/legal-agreement/terms-of-service

Score justification:

A score of 1 is granted as the terms of service of give examples of methods used to enforce the AUP.

Indicator notes:

We will award this point if the developer discloses the processes (automated or manual) it uses to detect, review, and respond to potential acceptable use policy violations. We will award this point for a reasonable best-effort attempt to provide the bulk of this information, though one line indicating the developer reserves the right to terminate accounts is insufficient. Alternatively, we will award this point if the developer reports that it does not use such methods to enforce its acceptable use policy.

Example disclosure:

We combine automated checks with human review for severe or repeated violations, issuing warnings or suspensions after repeat violations.

96. AUP enforcement frequency (Score: 0)

Are statistics on the developer's AUP enforcement disclosed?

Disclosure:

No information provided about statistics on the developer's AUP enforcement.

References:

Not disclosed

Score justification:

The developer does not provide information about AUP enforcement.

Indicator notes:

We will award this point if the developer discloses enforcement statistics (e.g., violation counts or actions taken) from its enforcement of its acceptable use policy. Alternatively, we will award this point if the developer reports that it does not enforce its acceptable use policy.

Example disclosure:

We publish a quarterly enforcement report detailing violation counts by prohibited use category and the corresponding actions taken at [LINK]

97. Regional policy variations (Score: 0)

Are differences in the developer's acceptable use or model behavior policy across geographic regions disclosed?

Disclosure:

Alibaba Cloud terms refer to potential differences by region (URL: https://www.alibabacloud.com/help/en/legal/latest/alibaba-cloud-international-website-product-terms-of-service-v-3-8-0)

References:

Not disclosed

Score justification:

The developer does not provide clear distinctions about variations in their AUP or MBP due to geography.

Indicator notes:

We will award this point if the developer discloses distinctions in its AUP or MBP and provides examples of differences in multiple specific regions, or states that no differences exist. For example, some jurisdictions impose content restrictions beyond those in the developer’s global policy that may necessitate local deviations.

Example disclosure:

In the EU, our model automatically omits certain categories of political content to comply with local election laws. In all other regions, we follow the general global AUP at [URL].

98. Oversight mechanism (Score: 0)

Does the developer have an internal or external body that reviews core issues regarding the model prior to deployment?

Disclosure:

No information provided about oversight mechanisms.

References:

Not disclosed

Score justification:

The developer does not provide information about oversight.

Indicator notes:

We will award this point if the developer discloses that is has such an internal or external body and provides some description of its scope, or alternatively if the developer discloses that it has no such body. An oversight mechanism covers governance structure beyond mere external risk evaluation, asking whether a formal body regularly reviews design and deployment decisions. Core issues may include model objectives, data usage, or risk mitigation.

Example disclosure:

We convene a monthly advisory board of ethicists, civil society representatives, and academics to review training processes and identify new risks. The board's recommendations regarding deployment are not binding.

99. Whistleblower protection (Score: 0)

Does the developer disclose a whistleblower protection policy?

Disclosure:

No information provided about whistleblower protection policies.

References:

Not disclosed

Score justification:

The developer does not provide information about whistleblower protections.

Indicator notes:

We will award this point if the developer discloses (i) the existence of a whistleblower protection policy, (ii) what protections are afforded to whistleblowers, (iii) how reports are handled and investigated, and (iv) any external oversight of the whistleblower protection process. This might include protections for whistleblowers who report safety, ethical, or legal concerns related to the model. We will also award this point if the developer discloses that it has no such policy.

Example disclosure:

We maintain a whistleblower protection policy that prohibits retaliation against employees who report safety or ethical concerns about our models. Reports can be submitted anonymously through our ethics hotline, are reviewed by an independent board committee, and whistleblowers are entitled to legal representation provided by the company. Our policy is audited annually by an independent ethics consultancy.

100. Government commitments (Score: 1)

What commitments has the developer made to government bodies?

Disclosure:

Alibaba is a participant in the China AI Safety and Security Commitments Framework.

References:

https://aihub.caict.ac.cn/ai_security_and_safety_commitments

Score justification:

Alibaba's participation in the China AI Safety and Security Commitments Framework suffices.

Indicator notes:

We will award this point if the company provides an exhaustive list of commitments it has made to government bodies in the jurisdictions where it offers its model.

Example disclosure:

We have committed to the White House Voluntary Commitments and the Seoul Commitments.