HELM Arabic

Authors: Yifan Mai and Rida Taleb and Ahmed Elhagry


As part of our efforts to better understand the multilingual capabilities of large language models (LLMs), we present HELM Arabic, a leaderboard for transparent and reproducible evaluation of LLMs on Arabic language benchmarks. This leaderboard was produced in collaboration with Arabic.AI.

Benchmark

HELM Arabic builds on a collection of established Arabic-language evaluation tasks that are widely used in the research community 1. It includes the following seven benchmarks:

  • AlGhafa 2 — an Arabic language multiple choice evaluation benchmark derived from publicly available NLP datasets
  • ArabicMMLU 3 — a native Arabic language question answering benchmark using questions sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions
  • Arabic EXAMS 4 — the Arabic language subset of the EXAMS multilingual question answering benchmark, which consists of high school exam questions across various school subjects
  • MadinahQA 3 — a question answering benchmark published by MBZUAI that tests knowledge of Arabic language and grammar
  • AraTrust 5 — an Arab-region-specific safety evaluation dataset consisting of human-written questions including direct attacks, indirect attacks, and harmless requests with sensitive words
  • ALRAGE 1 — an Arabic language passage-based open-ended model-graded question answering benchmark that reflects retrieval-augmented generation use cases
  • ArbMMLU-HT 6 — a translation of MMLU to Arabic by human translators published by MBZUAI

We made the following decisions in our evaluation methodology:

  • We used Arabic letters for the options in the multiple choice question answering prompt template, which better reflects LLM usage within the Arabic language context.
  • We used zero-shot prompting to make the tasks more challenging; all evaluated models were instruction-tuned and did not need few-shot prompting.
  • We limited the evaluation to 1000 randomly sampled instances from each subset to reduce imbalances in the number of instances across subsets.
  • We disabled thinking on models with an optional thinking mode, and excluded models that had a mandatory thinking mode that could not be disabled.

Models

We evaluated a several leading models on the leaderboard. These models can be grouped into three main groups:

Results

The leaderboard results show that LLMs have made significant progress in Arabic language understanding over the last few years.

  • Arabic.AI LLM-X achieved the highest mean score, as well as the highest scores for AlGhafa, EXAMS, MadinahQA, AraTrust, ALRAGE, and Translated MMLU
  • Qwen3 235B A22B Instruct 2507 FP8 was the best performing open-weights model, with a mean score of 0.786.
  • Open-weights multilingual models performed well relative to closed-weights multilingual models. Out of the top 10 models, 4 models were multilingual open-weights models: Qwen3 235B A22B Instruct 2507 FP8, Llama 4 Maverick (17Bx128E) Instruct FP8, Qwen3-Next 80B A3B Instruct, and DeepSeek v3.1.
  • Open-weights models that were trained or fine-tuned specifically for the Arabic language underperformed compared to the other two model groups. However, this may be explained by the age of these models; the most recent model family in this group, AceGPT-v2, was released in October 2024, more than a year ago.

As with all HELM leaderboards, this leaderboard provides full transparency into all LLM requests and responses, and the results are reproducible using the HELM open source framework. We hope that this leaderboard will be a valuable resource for the Arabic NLP community.

References

  1. El Filali, A., Manel, A., Hussain, T., Alzubaidi, A., Cojocaru, R., Hacid, H., & Fourrier, C. (2025, March 24). The open arabic llm leaderboard 2. https://huggingface.co/blog/leaderboard-arabic-v2  2

  2. Almazrouei, E., Cojocaru, R., Baldo, M., Malartic, Q., Alobeidli, H., Mazzotta, D., Penedo, G., Campesan, G., Farooq, M., Alhammadi, M., Launay, J., & Noune, B. (2023). Alghafa evaluation benchmark for arabic language models. In H. Sawaf, S. El-Beltagy, W. Zaghouani, W. Magdy, A. Abdelali, N. Tomeh, I. Abu Farha, N. Habash, S. Khalifa, A. Keleg, H. Haddad, I. Zitouni, K. Mrini, & R. Almatham (Eds.), Proceedings of ArabicNLP 2023 (pp. 244–275). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.arabicnlp-1.21 

  3. Koto, F., Li, H., Shatnawi, S., Doughman, J., Sadallah, A., Alraeesi, A., Almubarak, K., Alyafeai, Z., Sengupta, N., Shehata, S., Habash, N., Nakov, P., & Baldwin, T. (2024). ArabicMMLU: Assessing massive multitask language understanding in arabic. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 (pp. 5622–5640). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.334  2

  4. Hardalov, M., Mihaylov, T., Zlatkova, D., Dinkov, Y., Koychev, I., & Nakov, P. (2020). Exams: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 5427–5444). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.438 

  5. Alghamdi, E. A., Masoud, R., Alnuhait, D., Alomairi, A. Y., Ashraf, A., & Zaytoon, M. (2025). Aratrust: An evaluation of trustworthiness for llms in arabic. In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, & S. Schockaert (Eds.), Proceedings of the 31st International Conference on Computational Linguistics (pp. 8664–8679). Association for Computational Linguistics. https://aclanthology.org/2025.coling-main.579/ 

  6. Sengupta, N., Sahu, S. K., Jia, B., Katipomu, S., Li, H., Koto, F., Marshall, W., Gosal, G., Liu, C., Chen, Z., Afzal, O. M., Kamboj, S., Pandit, O., Pal, R., Pradhan, L., Mujahid, Z. M., Baali, M., Han, X., Bsharat, S. M., … Xing, E. (2023). Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models (No. arXiv:2308.16149). arXiv. https://doi.org/10.48550/arXiv.2308.16149