Massive Multitask Language Understanding (MMLU) on HELM

Authors: Yifan Mai and Percy Liang


Despite the prominence of the Massive Multitask Language Understanding (MMLU) benchmark, MMLU scores reported by model creators are frequently produced in inconsistent or problematic ways, which hinder their comparability. To address this, we introduce HELM MMLU, a leaderboard with evaluation results from evaluating various language models on MMLU using HELM. Our evaluation results feature simple and standardized prompts, an accuracy breakdown for each of the 57 subjects, and full transparency of all raw prompts and predictions.

Motivation

Massive Multitask Language Understanding (MMLU) (Hendrycks et al, 2020) is a multiple-choice question answering test that covers 57 tasks including elementary mathematics, US history, computer science, law, and more. MMLU scores are reported prominently in the evaluation of virtually all language models, as well as on many leaderboards, including the Open LLM Leaderboard, HELM Classic and HELM Lite.

Despite the prominence of the MMLU benchmark, there have been a number of issues with recent MMLU evaluations that hinder the comparability of MMLU scores:

  • Model creators have reported MMLU scores using non-standard prompting techniques. For instance, a Google blog post on Google blog post on Gemini compared the Gemini Ultra’s MMLU score using “Chain-of-Thought with uncertainty routing” with GPT-4’s MMLU score using regular 5-shot in-context learning. This was not a controlled comparison, because Gemini Ultra’s MMLU score was significantly boosted by using this method instead of regular 5-shot in-context learning (90.04% vs 83.7%).
  • Third-party researchers reported lower MMLU scores on models than the MMLU scores provided by model creators. For example, Akter et al., 2023 reported a 5-shot MMLU score of 65.22%, whereas the Gemini paper reported a 5-shot MMLU score of 71.8%.
  • Model creators provided insufficient information about prompting templates. Evaluation results for language models can be highly sensitive to variations in prompting, including task instructions and the choice of in-context examples. Yet many model papers do not provide enough information about the prompts for a third-party researcher to recreate them.
  • Model creators did not use open source evaluation frameworks. When evaluations are performed using an open-source evaluation framework such as HELM, LLM Evaluation Harness or Unitxt, the evaluation results can be directly compared with results for a different model on the same framework. Furthermore, a third-party researcher can use the framework to recreate the precise prompts used for evaluation. Most model creators did not report using an open source evaluation framework for their MMLU evaluations.
  • Model creators sometimes performed evaluations using internal private snapshots. For example, the evaluations in the PaLM 2 paper and the GPT-4 paper were conducted using internal private snapshots that are not publicly available to third-party researchers.

To address these issues, we performed a comprehensive MMLU evaluation of the major models using simple and standardized prompts, with full transparency of all raw prompts and predictions. We then compared our MMLU scores with the MMLU scores reported by model creators.

Methodology

We used the HELM framework to perform a complete MMLU evaluation on language models.

  • We used the same prompts across all models with the exception of Claude (for reasons that will be discussed later).
  • We used the same 5 in-context learning examples per subject across all models.
  • We used the Multiple Choice Joint adaptation method from the original HELM paper. This adaptation method expects the model to directly produce the answers (i.e. “A”, “B”, “C” or “D”) as generated text. We have moved away from the approach taken by the original MMLU implementation, Open LLM Leaderboard, and HELM Classic’s Multiple Choice Separate adaptation method, which used token probabilities to determine the models’ answers. This is because many recent model APIs, such as the Anthropic API for Claude 3, do not provide token probabilities.
  • We evaluated using all instances (rather than a sample of instances) on the test split only.
  • Unlike HELM Classic and HELM Lite, which only used 5 academic subjects for evaluation, we used all 57 academic subjects in MMLU for evaluation.

The instructions for reproducing our evaluation runs can be found in the HELM documentation here.

Results

We obtained the following MMLU scores by evaluating these models with HELM. The MMLU scores that we obtained were frequently close to the MMLU scores reported in the original model papers, but some of our scores diverged from the original reported scores by as much as 5.0 percentage points.

Model Reported HELM Delta
Claude Instant 73.4 68.8 -4.6
Claude 2.1 78.5 73.5 -5.0
Claude 3 Haiku 75.2 73.8 -1.4
Claude 3 Sonnet 79.0 75.9 -3.1
Claude 3 Opus 86.8 84.6 -2.2
Gemini 1.0 Pro 71.8 70.0 -1.8
Gemma (7B) 64.3 66.1 1.8
GPT-4 (0613) 86.4 82.4 -4.0
Llama 2 (7B) 45.3 45.8 0.5
Llama 2 (13B) 54.8 55.4 0.6
Llama 2 (70B) 68.9 69.5 0.6
Llama 3 (8B) 66.6 66.8 0.2
Llama 3 (70B) 79.5 79.3 -0.2
Mistral v0.1 (7B) 60.1 56.6 -3.5
Mixtral (8x7B) 70.6 71.7 1.1
Mixtral (8x22B) 77.6 77.8 0.2
OLMo (7B) 28.3 29.5 1.2
PaLM 2 Unicorn 81.2 78.6 -2.6
Phi 2 2.7B 56.7 58.4 1.7
Qwen1.5 (7B) 61.0 62.6 1.6
Qwen1.5 (14B) 67.6 68.6 1.0
Qwen1.5 (32B) 73.4 74.4 1.0
Qwen1.5 (72B) 77.5 77.4 -0.1
Yi (6B) 63.2 64.0 0.8
Yi (34B) 76.3 76.2 -0.1

The results on the HELM MMLU leaderboard also include full transparency of all raw prompts and predictions, allowing users to drill down into the individual model requests and requests for each model evaluation run.

Note that the MMLU scores on the HELM MMLU leaderboard often differ from the MMLU scores on the HELM Lite leaderboard, because the HELM MMLU leaderboard evaluates using all 57 subjects, whereas the HELM Lite leaderboards only uses five subjects: abstract algebra, college chemistry, computer security, econometrics, and US foreign policy.

While performing our MMLU reproductions, we ran into a number of issues that made it impossible to use the exact same prompts for every model.

  • Some APIs had limitations that prevented standardized prompts from being used by third-party researchers. For instance, Claude 2 models require prompts to follow Anthropic’s “Human/Assistant format”. This is enforced by the Anthropic API, which returns a validation error if the prompt is incorrectly formatted. This makes it impossible to use the standardized HELM prompts for MMLU with the Anthropic API directly. To deal with this, we had to modify the prompts for Claude 2 to follow this format.
  • Instruction-following tuning prevented some models from following in-context learning prompts. For instance, with regular in-context learning prompts, Claude 3 would respond with complete sentences rather than with a single letter as desired. To deal with this, we had to add an additional instruction to Claude 3: “Answer with only a single letter.”

Conclusions

The HELM MMLU leaderboard provides comprehensive MMLU evaluations of models using simple and standardized prompts, and provides full transparency of all raw prompts and predictions. This living leaderboard will be continuously updated with evaluation results for new models. We hope this will be a useful resource for the community.