ThaiExam Leaderboard in HELM

In collaboration with SCBX and SCB 10X, we introduce the ThaiExam leaderboard. ThaiExam is a Thai language benchmark derived from standardized examinations in Thailand. It consists of assessments that evaluate general knowledge at the high school level, such as the ONET, TGAT, TPAT-1, and A-Level exams, as well as the IC exam, which assesses financial knowledge among investment professionals. The ThaiExam leaderboard is the first public leaderboard for language models on Thai language scenarios, and features evaluations of leading language models. Like all other HELM leaderboards, the ThaiExam leaderboard provides full prompt-level transparency, and the results can be fully reproduced using the HELM framework. We hope that this leaderboard will encourage further work in Thai language language model evaluation.

Background

The current field of LLM training and evaluation is English-centric. While there have been many recent releases of multilingual language models such as GPT-4o and Claude 3, the field of non-English evaluations is mostly underexplored. Non-English evaluations published by the model developers for GPT-4o and Claude 3 are limited. Major LLM leaderboards such as HELM Lite, Chatbot Arena, Open LLM Leaderboard and AlpacaEval consist entirely of evaluations using English datasets. Therefore, there is significant uncertainty about how well these models perform in non-English languages.

In our efforts to extend HELM evaluations to languages beyond English, we collaborated with researchers at SCBX and SCB 10X to explore language model evaluations in the Thai language. The Thai language has an estimated 61 million speakers and uses its own alphabet rather than the Latin alphabet. As part of this collaboration, we introduce the ThaiExam leaderboard, which features evaluations of leading language models on the ThaiExam benchmark.

Benchmark

ThaiExam is a multiple-choice question answering benchmark based on Thai examinations which was first introduced by SCB 10X as an evaluation for the Typhoon language model. The benchmark comprises the following Thai multiple-choice examinations across various academic levels and subjects:

ONET: The Ordinary National Educational Test (ONET) is an examination for students in Thailand. The subjects selected were Thai, English, Mathematics, Social Studies, and Science.
IC: The Investment Consultant (IC) examination, a licensing test for investment professionals in Thailand developed by the Stock Exchange of Thailand.
TGAT: The Thai General Aptitude Test (TGAT), a national high school examination in Thailand, focuses on critical and logical thinking skills.
TPAT-1: The Thai Professional Aptitude Test 1 (TPAT-1) is a national high school examination in Thailand that assesses students’ professional skills requirement in medical schools. This subset contains reasoning and medical ethics.
A-Level: An academic knowledge assessment examination that covers general foundational subjects at the secondary education level.

There are a few existing Thai language benchmarks outside of ThaiExam. Two notable ones are as follows:

M3Exam is a benchmark that evaluates LLMs on school exam questions in 9 languages including Thai. Several model papers for Thai language models, including SeaLLM and Typhoon, include model evaluation results on M3Exam. The Thai subset of M3Exam is based on O-NET, which is also used in ThaiExam. Compared to M3Exam, ThaiExam also includes a professional-level examination for investment professionals.
BHASA is a meta-benchmark that evaluates LLMs in the four Southeast Asian languages including Thai on eight tasks: named entity recognition, question answering, sentiment analysis, toxicity detection, machine translation, abstractive summarization, natural language inference, and causal reasoning. Some source datasets for BHASA use text translated from English. In contrast, all the Thai text in ThaiExam was written in Thai, rather than translated from English.

Models

We evaluated 34 models for our leaderboard, including several notable Thai language models. We also evaluated other leading language models that are not known to have Thai language capabilities for comparison. The evaluated model families can be characterized as belonging to one of these groups:

Thai-centric open weights models fine-tuned from English-centric open weights models - The Typhoon models were fine-tuned from Mistral, Qwen1.5, and Llama 3. OpenThaiGPT and SambaLingo-Thai-Chat were fine-tuned from Llama 2. Sailor was fine-tuned from Qwen1.5. SeaLLM v2 was fine-tuned from Gemma.
Multilingual open weights models that support Thai - SEA-LION is an open-weights model trained from scratch on a mixture of Southeast Asian languages including Thai. Qwen1.5 and Qwen2 are open-weights models that were primarily intended for English and Chinese language use, but the model developers have also provided evaluation results on Thai language benchmarks.
Open-weights models that do not officially support Thai - Llama 3 is intended primarily for English language use. Mistral models have multilingual capabilities for English, French, Italian, German, and Spanish, but not Thai.
Multilingual closed models - GPT-4, Claude 3, Gemini and Command R are families of closed models. All models claim to have some level of multilingual ability. Gemini and Claude have Thai language evaluation results available on their model cards, while the remaining models do not.

Results

Our results are presented on the ThaiExam leaderboard. Our key findings are as follows:

The strongest Thai language open fine-tuned models are competitive with closed commercial models. Typhoon 1.5X Instruct (70B) and Typhoon v1.5 Instruct (72B) outperformed closed models such as GPT-4 Turbo and Claude 3 Sonnet. The smaller Typhoon 1.5X Instruct (8B) and Typhoon v1.5 Instruct (8B) models outperformed GPT-3.5 Turbo.
Thai language fine-tuning provided significant gains over the English-language base models. Typhoon 1.5X Instruct (70B) had an improvement of 4.9 points over the base model Llama 3 Instruct (70B), while Typhoon v1.5 Instruct (72B) had an improvement of 5.7 points over Qwen1.5 Chat (72B).
There is a significant spread of results for commercial closed models. Claude 3.5 Sonnet, Gemini 1.5 Pro and GPT-4o were among the top performers on ThaiExam, while GPT-4 Turbo performed relatively poorly on ThaiExam. In contrast, on English language benchmark leaderboards such as HELM Lite and MMLU, Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4o and GPT-4 Turbo have similar rankings.
Some models that are not intended to be multilingual still show some Thai language competency. The Llama 3 model card states that Llama 3 models are trained on English language corpora and are intended for primarily English language use. The Llama 3 (7B) model performs poorly on ThaiExam as expected, but surprisingly, Llama 3 (70B) outperforms multilingual models such as Claude 3 Haiku.
There is significant room for LLMs to improve on ThaiExam performance. The top-performing models still achieve accuracies of only around 60-70%, compared to >80% accuracy on the English-language MMLU.

Conclusion

The ThaiExam leaderboard presents evaluations of leading large language models on standard Thai examinations. Our leaderboard enables comparisons between open-weights and closed models, and between English-centric, Thai-centric and multilingual models. Future work may include evaluating open-ended generations from instruction-following models, which are not captured by the existing multiple-choice question answering scenarios. We hope that this effort encourages the development of Thai language capabilities in language models and further work in Thai language model evaluation.