Advancing Customizable Benchmarking in HELM via Unitxt Integration

Authors: Yifan Mai and Elron Bandel and Michal Shmueli-Scheuer and Percy Liang


The Holistic Evaluation of Language Models (HELM) framework is an open source framework for reproducible and transparent benchmarking of language models that is widely adopted by academia and industry. To meet HELM users’ needs for more powerful benchmarking features, we are proud to announce our collaboration with Unitxt, an open-source community platform developed by IBM Research for data preprocessing and benchmark customization. The integration of Unitxt into HELM gives HELM users access to the vast Unitxt catalog of benchmarks, and allows users to run sharable and customizable evaluation pipelines with greater ease.

Installation and Usage

First, install HELM with the optional Unitxt dependencies:

pip install 'crfm-helm[unitxt]'

You can now use helm-run to execute Unitxt pipelines. For example, to run WNLI on GPT-2:

export $recipe="card=cards.wnli,template=templates.classification.multi_class.relation.default"
export $model=openai/gpt2
helm-run \
    --run-entries "unitxt:$recipe,model=$model" \
    --max-eval-instances 10 --suite v1

Unitxt will load and preprocess the data, producing prompts such as the following:

Given a premise and hypothesis classify the entailment of the hypothesis to one of entailment, not entailment.
premise: George got free tickets to the play, but he gave them to Eric, because he was particularly eager to see it.
hypothesis: George was particularly eager to see it.
The entailment class is 

To customize your own evaluation pipeline, you can construct a recipe using the Explore Unitxt tool and then use it with helm-run as above. For advanced usage, please refer to the Unitxt documentation.

Benefits

Vast Community-driven Catalog

The Unitxt integration allows HELM framework users to leverage the Unitxt Catalog for building their evaluation pipelines. The Unitxt catalog currently provides more than 24 NLP tasks, 400 datasets, 200 prompt templates, and 80 metrics, including LLM as a judge. This more than doubles the number of available datasets in HELM and massively increases the number of possible HELM evaluation pipelines that can be composed out of these components.

Sharable and Reproducible Pipeline Configuration

Unitxt integrates with other commonly-used libraries and frameworks such as Hugging Face Datasets, Hugging Face Evaluate, and LM Evaluation Harness. Unitxt pipeline configurations (also known as Unitxt recipes) can be used with any of these integrations, and can be easily shared among practitioners. This allows practitioners to reproduce model evaluation pipelines with the exact same prompts, including system prompts, instructions, prompt templates, and in-context learning examples, across multiple frameworks.

Structured Data Preparation

Unitxt’s configurable data preprocessing pipelines excel in the preparation of datasets containing structured documents such as tables and chat conversations. Before the Unitxt integration, HELM’s text scenarios consisted mostly of unstructured text, and lacked structured documents of this kind. The Unitxt integration allows HELM to support a wide range of structured document testing scenarios that provide better coverage of LLM real-world applications in industry.

Modular Customizability

Every Unitxt pipeline is decomposed into modular stages: data loading, data preprocessing, prompt templating, and metric computation. This allows users to customize each step of the pipeline in a flexible manner. Furthermore, Unitxt enables customization without sacrificing reproducibility - the exact customizations can be shared as part of the Unitxt recipe, allowing practitioners to reproduce the exact pipeline, including exact customizations. This is a significant improvement over HELM, which previously only supported limited, ad-hoc customizations. The following are a few examples of use cases enabled by Unitxt’s versatility:

  • Model-specific prompting: Some recent chat models have specific requirements regarding prompt formatting. Some of them require a specific prompt template or chat template to be used. If the chat template is not used, some model APIs may return API errors, while others might display a significant degradation in performance. Unitxt allows templating prompts in the format that is expected by the model.
  • Ablation experiments: Users may want to vary a specific component while keeping the components constant in order to measure the effect of a specific ablation. For instance, a user might want to investigate the sensitivity of a model to its system prompt by modifying the system prompt while keeping the instructions and in-context learning examples constant. Unitxt allows users to easily run such experiments by modifying only one modular stage of the pipeline.

Conclusion

The integration of Unitxt into HELM represents a significant step forward in our quest to provide the community with more powerful and versatile tools for language model evaluation. We encourage HELM users to use this Unitxt integration and engage with the Unitxt community to advance the state of customizable and reproducible benchmarking.