Stanford CRFM

We present HELM Arabic Enterprise, a leaderboard for transparent, reproducible evaluation of large language models on Arabic-language benchmarks designed around enterprise use cases. The leaderboard was developed in collaboration with Arabic.AI and builds on the HELM evaluation methodology: standardized prompting, fully logged requests and responses, and reproducible scoring through the open-source HELM framework.

Benchmark

Arabic enterprise applications often require more than general conversational ability. Models must generate grounded content, reason over financial concepts, answer domain-specific legal questions, and operate reliably in Arabic across formal, professional, and institutional registers. HELM Arabic Enterprise evaluates these capabilities through six tasks across content generation, financial reasoning, and legal question answering:

Article Generation
Financial Multiple Choice Question Anwering
Financial Boolean Verification
Financial Calculation
Legal Open-book Question Answering
Legal Closed-book Question Answering

Each instance is graded on a scale of 0 to 1. The leaderboard reports the mean score of each model macro-averaged across the six tasks.

Article Generation

The article generation benchmark evaluates whether models can generate high-quality Arabic corporate and financial news articles from structured factual source material.

The evaluation set is built from real-world Arabic content, including statements from chairpersons and chief executive officers of Arab companies, Arabic economic news articles, and Arabic stock market news articles. Each example contains a concise factual summary derived from an original article or executive statement.

Given the summary, the model is prompted to generate a complete article in a specified corporate or financial-news style while preserving the underlying facts, entities, and events. The benchmark is designed to assess factual consistency, stylistic control, domain-specific language generation, and coherence in Arabic corporate communication, rather than unrestricted creative writing.

Generated articles are evaluated using an LLM-as-judge pipeline along three axes:

Faithfulness

Faithfulness measures whether the generated article introduces unsupported claims. It is a precision-oriented factuality metric: the judge checks whether statements in the generated output are entailed by, or directly supported by, the provided source summary.

This metric penalizes hallucinations such as invented numbers, dates, entities, product features, causal relationships, or business implications that are not present in the input. A model can receive a high score only if its generated content remains grounded in the source facts.

Completeness

Completeness measures whether the generated article includes the important facts from the source summary. It is a recall-oriented completeness metric: the judge assesses whether key information from the input appears in the generated output.

This metric catches cases where a model produces fluent and non-hallucinated text but omits critical details, such as prices, limitations, conditions, named entities, outcomes, or timeline information.

Style Adherence

Style Adherence measures whether the generated article adheres to the requested corporate writing style while preserving factual content.

The style rubric includes dimensions such as:

Tone and intent: neutral, informative, reassuring, optimistic, persuasive, promotional, urgent, cautious, defensive, or critical.
Formality and register: casual, conversational, standard professional, formal corporate, or technical/academic.
Point of view and voice: first-person individual, first-person collective, second person, third person, institutional voice, or spokesperson voice.

This metric penalizes generic writing, inconsistent tone, excessive promotional language, inappropriate formality, or drift away from the requested institutional voice.

Together, these three metrics separate factual precision, factual recall, and stylistic control. This allows the leaderboard to distinguish between models that are safe but incomplete, fluent but hallucination-prone, or factually correct but stylistically misaligned.

Financial

The financial benchmark evaluates Arabic financial reasoning across conceptual, quantitative, and decision-oriented question formats.

The dataset is adapted from English-language finance and economics textbooks, with questions translated into Arabic using machine translation and adapted for Arabic-language evaluation. The source textbooks include:

Fundamentals of Corporate Finance
Options, Futures, and Other Derivatives
The Economics of Money, Banking, and Financial Markets

The benchmark covers topics such as corporate finance, capital budgeting, risk and return, derivatives, interest rates, financial markets, banking, and monetary economics.

Models are evaluated in three settings:

Multiple Choice Question Answering (MCQA)

In the multiple-choice setting, the model is given a finance question and three candidate answers. The model must select the correct option.

This setting evaluates Arabic comprehension of technical financial terminology, conceptual understanding, and robustness to distractor answers.

Boolean Verification

In the boolean setting, the model is given a financial statement and must determine whether it is true or false.

This setting tests whether models can recognize correct financial principles, identify invalid assumptions, and reason over concise technical claims.

Calculation

In the calculation setting, the model must solve quantitative finance problems and produce a numeric answer. The answer is then graded by a LLM-as-judge.

This setting evaluates arithmetic reliability, formula selection, multi-step reasoning, and the model’s ability to handle financial quantities in Arabic prompts. These examples are particularly important for enterprise use cases where small numerical errors can materially change the interpretation of an answer.

Legal

The legal benchmark evaluates Arabic legal question answering in the context of UAE law in both open-book and closed-book settings.

Each example consists of an open-ended question written by Arabic legal experts. The model is prompted to provide a short answer, which is then graded for correctness by a LLM-as-judge. The benchmark includes two evaluation settings:

Open-book Question Answering (QA)

In the open-book setting, the prompt includes the relevant legal statute. The model must use the provided legal text to answer the question accurately and concisely.

This setting evaluates a model’s ability to interpret legal language, extract the relevant rule, and generate a grounded answer based on supplied authority. It is designed to reflect enterprise workflows where legal assistants, compliance systems, or internal knowledge tools retrieve statutes or policies before generation.

Closed-book Question Answering (QA)

In the closed-book setting, the relevant statute is not provided. The model must answer using only its internal domain knowledge.

This setting evaluates parametric legal knowledge, Arabic legal terminology, and the model’s ability to provide concise legal answers without retrieval support. Comparing open-book and closed-book performance helps quantify how much models benefit from explicit legal context.

Models

We evaluated several leading models on the leaderboard. These models can be grouped into three main groups:

Closed-weights multilingual models: Arabic.AI LLM-X, Arabic.AI LLM-S, Command A, Gemini 2.5, Claude 4, GPT-5.1, GPT 5.3 and Mistral Large models
Open-weights multilingual models: DeepSeek v3.1, DeepSeek v4, Falcon3, Llama 3.3, Llama 4, Gemma 4, Qwen3 and Qwen3.5 models
Open-weights models that were trained or fine-tuned specifically for the Arabic language: AceGPT-v2, ALLaM, JAIS, and SILMA models

To ensure a fair comparison between all models, we did not permit the use of thinking tokens. This excluded us from evaluating models that did not support disabling thinking, such as the Gemini 3 models.

Results

The leaderboard results show a wide range of model performance across tasks, settings, and metrics. This variation is especially significant for enterprise deployment: a model that performs well on general Arabic generation may still struggle with financial calculations, legal grounding, or corporate-style control.

Key findings:

Arabic.AI LLM-X scored the highest mean score (0.826) of all models. It also obtained the highest scores for Article Generation (0.927), Finance Multiple Choice Question Answering (1), Finance Boolean Verification (0.877), and Legal Closed-book Question Answering (0.495).
Gemma 4 31B Instruct achieved the highest mean score (0.738) among open-weights multilingual models, and the third highest score overall.
Open-weights models that were trained or fine-tuned specifically for the Arabic language (i.e. Falcon3, AceGPT, ALLaM, ALLaM and Jais models) were outperformed by other models. This may be due to the fact that these models are older.
Although some models performed well on Content Generation (top score of 0.927), many models added unsupported facts in the generated output.
For the legal question answering tasks, models generally performed well on the open-book setting but poorly on the closed-book setting. The highest score for open-book was 0.99, but the highest score for closed-book was only 0.495. This suggests that most models lack the relevant background knowledge in UAE law required for closed-book legal question answering.
Models generally performed well on Finance MCQA (highest score of 1) and Finance Boolean Verification (highest score of 0.877), but performed less well on Finance Calculation (highest score of 0.753) due to the challenges of performing arithmetic accurately without tools.

Transparency and Reproducibility

As with all HELM leaderboards, HELM Arabic Enterprise emphasizes transparency and reproducibility. All model requests, responses, prompts, metrics, and scores are made available for inspection. Results can be reproduced using the open-source HELM framework, allowing researchers and practitioners to audit model behavior rather than relying only on aggregate scores.

We hope HELM Arabic Enterprise becomes a useful resource for the Arabic NLP community and for organizations evaluating LLMs for Arabic enterprise applications. By focusing on grounded generation, financial reasoning, and legal question answering, the leaderboard provides a more targeted view of model capabilities in domains where accuracy, reliability, and linguistic control are critical.

HELM Arabic Enterprise

Authors: Yifan Mai and Rida Taleb and Ahmed Elhagry