We present HELM Arabic Enterprise, a leaderboard for transparent, reproducible evaluation of large language models on Arabic-language benchmarks designed around enterprise use cases. The leaderboard was developed in collaboration with Arabic.AI and builds on the HELM evaluation methodology: standardized prompting, fully logged requests and responses, and reproducible scoring through the open-source HELM framework.
Arabic enterprise applications often require more than general conversational ability. Models must generate grounded content, reason over financial concepts, answer domain-specific legal questions, and operate reliably in Arabic across formal, professional, and institutional registers. HELM Arabic Enterprise evaluates these capabilities through six tasks across content generation, financial reasoning, and legal question answering:
Each instance is graded on a scale of 0 to 1. The leaderboard reports the mean score of each model macro-averaged across the six tasks.
The article generation benchmark evaluates whether models can generate high-quality Arabic corporate and financial news articles from structured factual source material.
The evaluation set is built from real-world Arabic content, including statements from chairpersons and chief executive officers of Arab companies, Arabic economic news articles, and Arabic stock market news articles. Each example contains a concise factual summary derived from an original article or executive statement.
Given the summary, the model is prompted to generate a complete article in a specified corporate or financial-news style while preserving the underlying facts, entities, and events. The benchmark is designed to assess factual consistency, stylistic control, domain-specific language generation, and coherence in Arabic corporate communication, rather than unrestricted creative writing.
Generated articles are evaluated using an LLM-as-judge pipeline along three axes:
Faithfulness measures whether the generated article introduces unsupported claims. It is a precision-oriented factuality metric: the judge checks whether statements in the generated output are entailed by, or directly supported by, the provided source summary.
This metric penalizes hallucinations such as invented numbers, dates, entities, product features, causal relationships, or business implications that are not present in the input. A model can receive a high score only if its generated content remains grounded in the source facts.
Completeness measures whether the generated article includes the important facts from the source summary. It is a recall-oriented completeness metric: the judge assesses whether key information from the input appears in the generated output.
This metric catches cases where a model produces fluent and non-hallucinated text but omits critical details, such as prices, limitations, conditions, named entities, outcomes, or timeline information.
Style Adherence measures whether the generated article adheres to the requested corporate writing style while preserving factual content.
The style rubric includes dimensions such as:
This metric penalizes generic writing, inconsistent tone, excessive promotional language, inappropriate formality, or drift away from the requested institutional voice.
Together, these three metrics separate factual precision, factual recall, and stylistic control. This allows the leaderboard to distinguish between models that are safe but incomplete, fluent but hallucination-prone, or factually correct but stylistically misaligned.
The financial benchmark evaluates Arabic financial reasoning across conceptual, quantitative, and decision-oriented question formats.
The dataset is adapted from English-language finance and economics textbooks, with questions translated into Arabic using machine translation and adapted for Arabic-language evaluation. The source textbooks include:
The benchmark covers topics such as corporate finance, capital budgeting, risk and return, derivatives, interest rates, financial markets, banking, and monetary economics.
Models are evaluated in three settings:
In the multiple-choice setting, the model is given a finance question and three candidate answers. The model must select the correct option.
This setting evaluates Arabic comprehension of technical financial terminology, conceptual understanding, and robustness to distractor answers.
In the boolean setting, the model is given a financial statement and must determine whether it is true or false.
This setting tests whether models can recognize correct financial principles, identify invalid assumptions, and reason over concise technical claims.
In the calculation setting, the model must solve quantitative finance problems and produce a numeric answer. The answer is then graded by a LLM-as-judge.
This setting evaluates arithmetic reliability, formula selection, multi-step reasoning, and the model’s ability to handle financial quantities in Arabic prompts. These examples are particularly important for enterprise use cases where small numerical errors can materially change the interpretation of an answer.
The legal benchmark evaluates Arabic legal question answering in the context of UAE law in both open-book and closed-book settings.
Each example consists of an open-ended question written by Arabic legal experts. The model is prompted to provide a short answer, which is then graded for correctness by a LLM-as-judge. The benchmark includes two evaluation settings:
In the open-book setting, the prompt includes the relevant legal statute. The model must use the provided legal text to answer the question accurately and concisely.
This setting evaluates a model’s ability to interpret legal language, extract the relevant rule, and generate a grounded answer based on supplied authority. It is designed to reflect enterprise workflows where legal assistants, compliance systems, or internal knowledge tools retrieve statutes or policies before generation.
In the closed-book setting, the relevant statute is not provided. The model must answer using only its internal domain knowledge.
This setting evaluates parametric legal knowledge, Arabic legal terminology, and the model’s ability to provide concise legal answers without retrieval support. Comparing open-book and closed-book performance helps quantify how much models benefit from explicit legal context.
We evaluated several leading models on the leaderboard. These models can be grouped into three main groups:
To ensure a fair comparison between all models, we did not permit the use of thinking tokens. This excluded us from evaluating models that did not support disabling thinking, such as the Gemini 3 models.
The leaderboard results show a wide range of model performance across tasks, settings, and metrics. This variation is especially significant for enterprise deployment: a model that performs well on general Arabic generation may still struggle with financial calculations, legal grounding, or corporate-style control.
Key findings:
As with all HELM leaderboards, HELM Arabic Enterprise emphasizes transparency and reproducibility. All model requests, responses, prompts, metrics, and scores are made available for inspection. Results can be reproduced using the open-source HELM framework, allowing researchers and practitioners to audit model behavior rather than relying only on aggregate scores.
We hope HELM Arabic Enterprise becomes a useful resource for the Arabic NLP community and for organizations evaluating LLMs for Arabic enterprise applications. By focusing on grounded generation, financial reasoning, and legal question answering, the leaderboard provides a more targeted view of model capabilities in domains where accuracy, reliability, and linguistic control are critical.