Most Popular LLM Benchmarks
Why use benchmarks to evaluate LLM?
LLM benchmarks help
evaluate the accuracy of large language models by providing a standardized procedure for measuring metrics
performing various tasks.
Benchmarks contain all structures and datarequired for LLM assessment, including:
- “Reference” datasets (relevant tasks/questions/prompts with expected answers)
- Methods of passing input prompts in LLM
- Methods of interpreting/collecting responses
- Computed metrics and scores (and how to calculate them)
All this together allows us to compare the accuracy of different models in a consistent way. But which LLM benchmark should you use? It mostly depends on the use case, i.e. what you intend to use LLM for. Let's find out!
Best LLM Benchmarks
If you need a universal solution, then in
HuggingFace Big Benchmarks Collection
you can find enough
full list of widely used benchmarks
. It contains benchmarks included in the popular
and complements them with a variety of other important benchmarks.
Below we present some of the most popular LLM benchmarks, categorized by use case:
Benchmarks of reasoning, conversations, questions and answers
Such benchmarks evaluate the ability of models to
reasoning, argumentation and answering questions
Some of them are designed for specific subject areas, others are more general.
hellaswag (GitHub)
This benchmark focuses on
common sense inferences in natural language
that is, it checks whether the model can actually complete realistic human sentences. It contains questions that are trivial for humans but may be difficult for models.
The dataset contains 70 thousand multiple choice questions (based on activitynet or wikihow) and an adversarial set of machine-generated (and human-verified) incorrect answers. The models must choose one of four options for how to continue the sentence.
BIG Bench Hard (GitHub)
This benchmark is based on
(Beyond the Imitation Game Benchmark), which contains over two hundred challenges in
a wide range of task types and subject areas
.
BIG-Bench Hard focuses on a subset of the 23 hardest BIG-Bench problems. These are the problems where the model's scores cannot beat the average human evaluator (before the benchmark).
SQuAD (GitHub)
Stanford Question Answering Dataset (SQuAD) tests
reading comprehension
This benchmark contains
107785 pairs of questions and answers on 536 Wikipedia articles
; pairs are composed by humans and crowdsourced. In addition, SQuAD 2.0
contains 50 thousand questions that cannot be answered
to test whether models can detect when the input material does not provide a response and not respond to it.
A separate set of tests is kept confidential to avoid compromising the integrity of the results (e.g. so that models cannot be trained on it). To evaluate a model on the SQuAD test set, it must be given to the benchmark developers.
IFEval (GitHub)
IFEval evaluates the ability of models
follow instructions in natural language
. It contains
over five hundred promts with verifiable instructions
such as “write more than 400 words” or “mention the AI keyword at least three times.” IFEval is contained in
Hugging Face.
MuSR (GitHub)
The MuSR (Multi-step Soft Reasoning) dataset is designed to evaluate models in tasks with
chains of reasoning based on common sense
described in natural language. MuSR has two important characteristics that distinguish it from other benchmarks:
- Algorithmically generated dataset with complex tasks
- The dataset contains arbitrary texts corresponding to subject areas of real-world reasoning.
MuSR requires models to apply multi-step reasoning to solve murder mystery problems, answer questions about the location of objects, and optimize the assignment of roles to teams. Models must
parse long texts to understand the context
and then
apply reasoning
based on this context. MuSR is included in
Hugging Face.
MMLU-PRO (GitHub)
MMLU-PRO stands for Massive Multitask Language Understanding — Professional. It is an improved version of the standard
.
In this benchmark, the models must answer questions with ten possible answers (instead of four, as in the regular MMLU); some questions require reasoning. The quality of the dataset is higher than that of MMLU, which contains noisy data and data pollution (i.e. many new models are likely to be trained on the questions it contains), which reduces its complexity for the model and therefore its usefulness. MMLU-PRO is considered more complex than MMLU. MMLU-PRO is included in Open LLM Leaderboard Hugging Face.
MT-Bench
MT-Bench is a multi-stage benchmark (with follow-up questions) that evaluates a model's ability to participate in
holistic, informative and engaging conversations
. This benchmark focuses on
ability to create a flow of conversation and follow instructions
.
MT-Bench contains 80 questions and 3300 answers (generated by six models) reflecting human preferences. The benchmark uses LLM-as-a-judge technique: Strong LLMs, such as GPT-4, evaluate the quality of the model's responses. The responses were annotated by PhD students with expertise in the relevant subject areas.
Domain-specific benchmarks
GPQA (GitHub)
GPQA (Graduate-Level Google-Proof Q&A Benchmark) is a complex dataset of
448 multiple choice questions covering biology, physics and chemistry
. The questions in the GPQA can be considered very complex: when answering them, experts, including
with a PhD degree, were able to achieve an accuracy of approximately 65%
.
The questions are so complex that they can be considered protected from googlethat is, even with free access to the web and more than half an hour of research on the topic, validators who do not know the subject area (for example, biologists answering questions on chemistry) can achieve an accuracy of 34%. GPQA is included in Open LLM Leaderboard Hugging Face.
MedQA (GitHub)
Medical Question Answering is a benchmark for evaluating multiple-choice question models,
based on the US medical licensing examinations
. This benchmark
includes three languages
with many questions: English (12 thousand questions), Simplified Chinese (34 thousand questions) and Traditional Chinese (14 thousand questions).
PubMedQA (GitHub)
PubMedQA is a dataset
questions about biomedical research
Models must answer questions with three possible answers: yes, no, or maybe.
When answering questions about biomedical research submitted to the model some reasoning is requiredThe dataset contains sets of questions and answers labeled by experts (1 thousand), unlabeled (61.2 thousand), and artificially generated (211.3 thousand).
Coding Benchmarks
We looked at software code generation benchmarks in a separate post:
Comparing LLM benchmarks for software development
.
Mathematical benchmarks
GSM8K (GitHub)
The purpose of this benchmark is to evaluate
multi-stage mathematical reasoning
. GSM8K is a low-level benchmark consisting of
8500 math problems for elementary school
which a capable high school student can solve. The dataset is divided into
7500 training tasks and 1000 test tasks
.
Problems (written by living problem writers) are linguistically diverse and require 2-8 steps to solve. The decision requires LLM using a sequence of basic arithmetic operators (+ — / *).
MATH (GitHub)
The MATH dataset contains
12500 competitive level math problems
. It contains reference data: each of the tasks has a step-by-step solution. This allows us to evaluate the ability of LLM to
solving problems
. MATH is included in
Hugging Face.
MathEval (GitHub)
MathEval is designed for thorough
LLM Mathematical Aptitude Tests
Its developers intended MathEval to be a standard for comparing models' abilities in mathematics.
This collection of 20 datasets (including GSM8K and MATH)covering a wide range of areas of mathematics more than 30 thousand mathematical problems. MathEval provides comprehensive testing of various complexities and subsections of mathematics (arithmetic, problems from elementary and middle school competitions, and more complex subsections). In addition to evaluating models, MathEval is also designed to further improve their mathematical abilities. It can be expanded with new mathematical evaluation datasets if necessary.
Security related benchmarks
PyRIT
PyRIT stands for Python Risk Identification Tool for Generative AI. It's closer to a framework than a standalone benchmark, but it's still a useful tool.
PyRIT is a tool for LLM reliability estimates in a wide range of harmful categories. It can be used to identifying harmful categoriesincluding fabricated/baseless content (e.g., hallucinations), misuse (bias, malware generation, jailbreaking), prohibited content (e.g., harassment), and privacy damage (identity theft). This tool automates red team tasks for base models, and therefore contributes to efforts to ensuring the future development of AI.
Purple Llama CyberSecEval (GitHub)
CyberSecEval (a result of the Meta* project)
)
focuses on cybersecurity of the models used in coding
It is claimed to be the largest unified cybersecurity benchmark.
CyberSecEval provides verification of two critical security areas:
- probability of generating unsafe code
- compliance with the law in case of requests for assistance in cyber attacks.
The benchmark can be used to assess how prepared and capable LLMs are to assist attackers in cyber attacks. CyberSecEval has
metrics for quantitative assessment of cybersecurity risks
associated with the generated LLM code.
is an improved version of the original benchmark, which also allows you to evaluate
protection against prompt injection and malicious use of the code interpreter
.
Conclusion: LLM benchmarks for different subject areas
The list provided in the article should help you in choosing benchmarks for
LLM assessments in your use case
Whatever the subject area or application, selecting the right LLM always requires selecting the right benchmarks.
*The Meta organization is recognized as extremist in the Russian Federation.