As part of our efforts to better understand the multilingual capabilities of large language models (LLMs), we present HELM Arabic, a leaderboard for transparent and reproducible evaluation of LLMs on Arabic language benchmarks. This leaderboard was produced in collaboration with Arabic.AI.
HELM Arabic builds on a collection of established Arabic-language evaluation tasks that are widely used in the research community 1. It includes the following seven benchmarks:
We made the following decisions in our evaluation methodology:
We evaluated a several leading models on the leaderboard. These models can be grouped into three main groups:
The leaderboard results show that LLMs have made significant progress in Arabic language understanding over the last few years.
As with all HELM leaderboards, this leaderboard provides full transparency into all LLM requests and responses, and the results are reproducible using the HELM open source framework. We hope that this leaderboard will be a valuable resource for the Arabic NLP community.
El Filali, A., Manel, A., Hussain, T., Alzubaidi, A., Cojocaru, R., Hacid, H., & Fourrier, C. (2025, March 24). The open arabic llm leaderboard 2. https://huggingface.co/blog/leaderboard-arabic-v2 ↩ ↩2
Almazrouei, E., Cojocaru, R., Baldo, M., Malartic, Q., Alobeidli, H., Mazzotta, D., Penedo, G., Campesan, G., Farooq, M., Alhammadi, M., Launay, J., & Noune, B. (2023). Alghafa evaluation benchmark for arabic language models. In H. Sawaf, S. El-Beltagy, W. Zaghouani, W. Magdy, A. Abdelali, N. Tomeh, I. Abu Farha, N. Habash, S. Khalifa, A. Keleg, H. Haddad, I. Zitouni, K. Mrini, & R. Almatham (Eds.), Proceedings of ArabicNLP 2023 (pp. 244–275). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.arabicnlp-1.21 ↩
Koto, F., Li, H., Shatnawi, S., Doughman, J., Sadallah, A., Alraeesi, A., Almubarak, K., Alyafeai, Z., Sengupta, N., Shehata, S., Habash, N., Nakov, P., & Baldwin, T. (2024). ArabicMMLU: Assessing massive multitask language understanding in arabic. In L.-W. Ku, A. Martins, & V. Srikumar (Eds.), Findings of the Association for Computational Linguistics: ACL 2024 (pp. 5622–5640). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.findings-acl.334 ↩ ↩2
Hardalov, M., Mihaylov, T., Zlatkova, D., Dinkov, Y., Koychev, I., & Nakov, P. (2020). Exams: A multi-subject high school examinations dataset for cross-lingual and multilingual question answering. In B. Webber, T. Cohn, Y. He, & Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 5427–5444). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.438 ↩
Alghamdi, E. A., Masoud, R., Alnuhait, D., Alomairi, A. Y., Ashraf, A., & Zaytoon, M. (2025). Aratrust: An evaluation of trustworthiness for llms in arabic. In O. Rambow, L. Wanner, M. Apidianaki, H. Al-Khalifa, B. D. Eugenio, & S. Schockaert (Eds.), Proceedings of the 31st International Conference on Computational Linguistics (pp. 8664–8679). Association for Computational Linguistics. https://aclanthology.org/2025.coling-main.579/ ↩
Sengupta, N., Sahu, S. K., Jia, B., Katipomu, S., Li, H., Koto, F., Marshall, W., Gosal, G., Liu, C., Chen, Z., Afzal, O. M., Kamboj, S., Pandit, O., Pal, R., Pradhan, L., Mujahid, Z. M., Baali, M., Han, X., Bsharat, S. M., … Xing, E. (2023). Jais and jais-chat: Arabic-centric foundation and instruction-tuned open generative large language models (No. arXiv:2308.16149). arXiv. https://doi.org/10.48550/arXiv.2308.16149 ↩