Are the specific capabilities or tasks that were optimized for during post-training disclosed?

Disclosure:

References:

https://mistral.ai/news/mistral-medium-3

Score justification:

Although the developer describes capabilities that the model excels in, they do not describe the capabilities specifically optimized for during post-training.

Indicator notes:

Capabilities refer to the specific and distinctive functions that the model can perform. We recognize that different developers may use different terminology for capabilities, or conceptualize capabilities differently. We will award this point for a list of capabilities specifically optimized for in the post-training phase of the model, even if some of the capabilities are not reflected in the final model.

Example disclosure:

We focus on the following capabilities during post-training: (1) Coding ability (2) Retrieval of information and factuality (3) Multilingual language proficiency on non-English languages (4) Tool-use

44. Capabilities evaluation (Score: 1)

Does the developer evaluate the model's capabilities prior to its release and disclose them concurrent with release?

Disclosure:

https://mistral.ai/news/mistral-medium-3 (Coding) HumanEval (0-shot): 92.1% (Coding) LiveCodeBench (v6): 30.3% (Coding) MultiPL-E average: 81.4% (Instruction Following) ArenaHard (0-shot): 97.1% (Instruction Following) IfEval (0-shot): 89.4% (Math) Math500 Instruct: 91.0% (Knowledge) GPQA Diamond (5-shot CoT): 57.1% (Knowledge) MMLU Pro (5-shot CoT): 77.2% (Long Context) RULER 32K (0-shot): 96.0% (Long Context) RULER 128K (0-shot): 90.2% (Multimodal) MMMU (0-shot): 66.1% (Multimodal) DocVQA (0-shot): 95.3% (Multimodal) AI2D (0-shot): 93.7% (Multimodal) ChartQA (0-shot): 82.6% Competitor wins vs Mistral wins for coding: Claude Sonnet 3.7: 60.00 to 40.00 DeepSeek 3.1: 62.50 to 37.50 GPT-4o: 50.00 to 50.00 Command-A: 30.77 to 69.23 Llama 4 Maverick: 18.18 to 81.82 Llama maverick wins vs Mistral wins: Coding: 18.18 to 81.82 Multimodal: 46.15 to 53.85 English: 33.33 to 66.67 French: 28.57 to 71.43 Spanish: 26.67 to 73.33 German: 37.50 to 62.50 Arabic: 35.29 to 64.71

References:

Mistral Medium 3, News

Score justification:

The developer publishes a large number of capability evaluation results that tie to the capabilities it associates with the model, though the benchmarking of the model is not specifically mapped onto the delineation of capabilities for which the model was optimized.

Indicator notes:

The evaluations must contain precise quantifications of the model's behavior in relation to the capabilities specified in the capabilities taxonomy. We will award this point for any clear, but potentially incomplete, evaluation of multiple capabilities.

Example disclosure:

We evaluate capabilities using the following benchmarks: (1) Coding: HumanEval (2) Retrieval: HotPotQA (3) Multilingual performance: MMMLU (4) Tool use: UltraTool

45. External reproducibility of capabilities evaluation (Score: 0)

Are code and prompts that allow for an external reproduction of the evaluation of model capabilities disclosed?

Disclosure:

Mistral Medium 3 is designed to be frontier-class, particularly in categories of professional use. In the evaluations below, we use numbers reported previously by other providers wherever available, otherwise we use our own evaluation harness. Performance accuracy on all benchmarks were obtained through the same internal evaluation pipeline. Mistral Medium 3 particular stands out in coding and STEM tasks where it comes close to its very large and much slower competitors. In addition to academic benchmarks we report third-party human evaluations that are more representative of real-world use cases. Mistral Medium 3 continues to shine in the coding domain and delivers much better performance, across the board, than some of its much larger competitors.

References:

Mistral Medium 3, News

Score justification:

Although much of the evaluations in the previous indicator are public benchmarks, the code/prompts that would allow an external reproduction of the results is not disclosed (e.g., the "internal evaluation pipeline" described in the release announcement or some equivalent).

Indicator notes:

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all capability evaluations to be reproduced. The released code must be open source, following the OSI definition of open source.

Example disclosure:

The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]

46. Train-test overlap (Score: 0)

Does the developer measure and disclose the overlap between the training set and the dataset used to evaluate model capabilities?

Disclosure:

Not disclosed

References:

Not disclosed

Score justification:

No information is provided about the train-test overlap.

Indicator notes:

We will award this point if, with every capability evaluation for which the developer reports results, the developer reports the overlap between the training set of the model and the dataset used for evaluation, as well as the general methodology for computing train-test overlap (e.g. a description of how n-gram matching was used).

Example disclosure:

We compute train-test overlap using n-gram matching using the procedure described here [URL]. We evaluate the train-test overlap for the following benchmarks: (1) Coding: HumanEval (1.6%) (2) Retrieval: HotPotQA (4%) (3) Multilingual performance: MMMLU (3%) (4) Tool use: UltraTool (9%)

47. Risks taxonomy (Score: 0)

Are the risks considered when developing the model disclosed?

Disclosure:

"We are introducing our new moderation service, which is powered by the Mistral Moderation model, a classifier model based on Ministral 8B 24.10. It enables our users to detect harmful text content along several policy dimensions." "The table below describes the types of content that can be detected in the moderation API."

References:

Mistral Documentation, Moderation

Score justification:

The developer includes some information about risks through the structure of their moderation API, but they do not disclose the taxonomy of risks considered while developing Mistral Medium 3.

Indicator notes:

Risks refer to possible negative consequences or undesirable outcomes that can arise from the model's deployment and usage. These consequences or outcomes may arise from model limitations (functions that the model cannot perform) or issues with the model's trustworthiness (e.g., its lack of robustness, reliability, calibration). We recognize that different developers may use different terminology for risks, or conceptualize risks differently. We will award this point for a complete list of risks considered, even if some of the risks are not reflected in the final model.

Example disclosure:

We consider the following risks when developing and evaluating the model: (1) Misinformation (2) Harassment (3) Cybersecurity risks (4) Bioweapons design (5) Revealing personally-identifiable information

48. Risks evaluation (Score: 0)

Does the developer evaluate the model's risks prior to its release and disclose them concurrent with release?

Disclosure:

Not disclosed

References:

Not disclosed

Score justification:

No information provided about risk evaluations.

Indicator notes:

The evaluations must contain precise quantifications of the model's behavior in relation to the risks specified in the risk taxonomy. We will award this point for clear evaluations of the majority of the states risks.

Example disclosure:

We evaluate the risks for each of the above harms using HarmBench. The results (in terms of mean attack success rate) are: (1) Misinformation: 0.02 (2) Harassment: 0.01 (3) Cybersecurity: 0.10 (4) Bioweapons design (subset of Chemical Biological in HarmBench): 0.12 (5) Revealing personally-identifiable information (subset of General Harm in HarmBench): 0.02

49. External reproducibility of risks evaluation (Score: 0)

Are code and prompts to allow for an external reproduction of the evaluation of model risks disclosed?

Disclosure:

Not disclosed

References:

Not disclosed

Score justification:

No information provided about risk evaluations.

Indicator notes:

The released code and prompts need not be the same as what is used internally, but should allow the developer's results on all risk evaluations to be reproduced. The released code must be open-source, following the OSI definition of open-source.

Example disclosure:

The code and prompts to reproduce our evaluations can be found on this GitHub repository link: [URL]

50. Pre-deployment risk evaluation (Score: 0)

Are the external entities have evaluated the model pre-deployment disclosed?

Disclosure:

Not disclosed

References:

Not disclosed

Score justification:

No information provided about pre-deployment external evaluations.

Indicator notes:

By external entities, we mean entities that are significantly or fully independent of the developer. We will award this point if the developer specifies the entity that carried out the pre-deployment analysis, discloses the terms of the analysis (such as conditions for releasing the evaluation results or the developer's control over the final results), as well as any financial transaction between the parties. We will award this point if the developer discloses no external entities have evaluated the model pre-deployment, or discloses only terms of the analysis where it is not bound by NDA while still naming all external entities.

Example disclosure:

We provide the following parties access to our model for pre-deployment capabilities evaluation: METR. METR has control over the release of the evaluation results (including whether or not to release the results and the contents of the results being released), but must provide the evaluation results to us for review before release. There are no financial transactions between us and METR.

51. External risk evaluation (Score: 0)

Are the parties contracted to evaluated model risks disclosed?

Disclosure:

Not disclosed

References:

Not disclosed

Score justification:

No information provided about parties contracted to evaluate risk.

Indicator notes:

We will award this point if the developer discloses statistics regarding all contracted parties that are responsible for evaluating risks (not limited to external entities or pre-deployment evaluation). This includes the number of contracted for-profit or non-profit entities, government entities, independent contractors, and researchers contracted by the developer to evaluate risks. We will award this point if the developer discloses it has no such contracts.

Example disclosure:

Contracted parties responsible for evaluating risks: (1) 2 contracting non-profits (2) 5 independent contractors (3) 0 government entities (4) 20 researchers

Does the developer clearly define a process by which external parties can disclose model vulnerabilities or flaws?

Disclosure:

References:

https://mistral.ai/terms

Score justification:

Penetration testing is prohibited, therefore Mistral clarifies that there is no responsible disclosure policy

Indicator notes:

We will award this point for a description of the process external parties can use for responsbly disclosing model vulnerabilities and flaws, which should include (i) what mechanism external parties can use to disclose vulnerabilities or flaws (e.g., a form, an email) and (ii) what process follows a disclosure (e.g., how much time must parties wait until public release). This is often included with a bug bounty, but can also be standalone. We will award a point if the developer discloses it has no responsible disclosure policy.

Example disclosure:

We maintain a responsible disclosure policy at [URL] that describes how external parties can disclose vulnerabilities and flaws in Model A, including a 45-day disclosure window and an official contact for urgent security vulnerabilities.

What commitments has the developer made to government bodies?

Disclosure:

Mistral has committed to the Seoul Commitments and signed the EU AI Act GPAI Code of Practice

References:

https://www.gov.uk/government/publications/frontier-ai-safety-commitments-ai-seoul-summit-2024/frontier-ai-safety-commitments-ai-seoul-summit-2024 https://digital-strategy.ec.europa.eu/en/policies/contents-code-gpai

Score justification:

Seoul Commitments and AI Act GPAI Code of Practice suffice

Indicator notes:

We will award this point if the company provides an exhaustive list of commitments it has made to government bodies in the jurisdictions where it offers its model.

Example disclosure:

We have committed to the White House Voluntary Committments and the Seoul Committments.