The Holistic Evaluation of Language Models (HELM) framework is an open source framework for reproducible and transparent benchmarking of language models that is widely adopted by academia and industry. To meet HELM users’ needs for more powerful benchmarking features, we are proud to announce our collaboration with Unitxt, an open-source community platform developed by IBM Research for data preprocessing and benchmark customization. The integration of Unitxt into HELM gives HELM users access to the vast Unitxt catalog of benchmarks, and allows users to run sharable and customizable evaluation pipelines with greater ease.
First, install HELM with the optional Unitxt dependencies:
pip install 'crfm-helm[unitxt]'
You can now use helm-run
to execute Unitxt pipelines. For example, to run WNLI on GPT-2:
export $recipe="card=cards.wnli,template=templates.classification.multi_class.relation.default"
export $model=openai/gpt2
helm-run \
--run-entries "unitxt:$recipe,model=$model" \
--max-eval-instances 10 --suite v1
Unitxt will load and preprocess the data, producing prompts such as the following:
Given a premise and hypothesis classify the entailment of the hypothesis to one of entailment, not entailment.
premise: George got free tickets to the play, but he gave them to Eric, because he was particularly eager to see it.
hypothesis: George was particularly eager to see it.
The entailment class is
To customize your own evaluation pipeline, you can construct a recipe using the Explore Unitxt tool and then use it with helm-run
as above. For advanced usage, please refer to the Unitxt documentation.
The Unitxt integration allows HELM framework users to leverage the Unitxt Catalog for building their evaluation pipelines. The Unitxt catalog currently provides more than 24 NLP tasks, 400 datasets, 200 prompt templates, and 80 metrics, including LLM as a judge. This more than doubles the number of available datasets in HELM and massively increases the number of possible HELM evaluation pipelines that can be composed out of these components.
Unitxt integrates with other commonly-used libraries and frameworks such as Hugging Face Datasets, Hugging Face Evaluate, and LM Evaluation Harness. Unitxt pipeline configurations (also known as Unitxt recipes) can be used with any of these integrations, and can be easily shared among practitioners. This allows practitioners to reproduce model evaluation pipelines with the exact same prompts, including system prompts, instructions, prompt templates, and in-context learning examples, across multiple frameworks.
Unitxt’s configurable data preprocessing pipelines excel in the preparation of datasets containing structured documents such as tables and chat conversations. Before the Unitxt integration, HELM’s text scenarios consisted mostly of unstructured text, and lacked structured documents of this kind. The Unitxt integration allows HELM to support a wide range of structured document testing scenarios that provide better coverage of LLM real-world applications in industry.
Every Unitxt pipeline is decomposed into modular stages: data loading, data preprocessing, prompt templating, and metric computation. This allows users to customize each step of the pipeline in a flexible manner. Furthermore, Unitxt enables customization without sacrificing reproducibility - the exact customizations can be shared as part of the Unitxt recipe, allowing practitioners to reproduce the exact pipeline, including exact customizations. This is a significant improvement over HELM, which previously only supported limited, ad-hoc customizations. The following are a few examples of use cases enabled by Unitxt’s versatility:
The integration of Unitxt into HELM represents a significant step forward in our quest to provide the community with more powerful and versatile tools for language model evaluation. We encourage HELM users to use this Unitxt integration and engage with the Unitxt community to advance the state of customizable and reproducible benchmarking.