Brief overview of LLM benchmarks

When we talk about LLM benchmarking in a particular subject area, we mean two different concepts: LLM model benchmarks and LLM system benchmarks. Benchmarking of LLM models consists of comparing basic general purpose models (for example, GPT, Mistral, Llama, Gemini, Claude, and so on). We should not invest resources in comparing them because: 1. There are published leaderboards for them, 2. There are many nuances in using these models (e.g. model variability, industrial design, use case, data quality, system configuration) that reduce the usefulness of discussing their high-level parameters, 3. Other factors may be more important than the accuracy of the model: locality of data, compliance with privacy protection requirements, cloud service provider, degree of customization (for example, fine-tuning or retraining).

What we should discuss is the benchmarking of LLM systems. This is a meaningful and important process where we consider the application of specific LLM models (along with industrial design and system configuration) in our specific use cases. We should curate data sets from specific subject areas, involving both people and LLMs in marking them to create a “golden” data set that allows us to measure the continuous improvements we make. You might even consider publishing “golden” benchmark datasets.

Context

Over the weekend I read a couple of documents (see links at the end of the article) about LLM assessment and benchmarking, and in this article I will summarize what I read. I hope this will be a good introduction for those interested in this topic.

LLM model and LLM system

When discussing benchmarking, there is a difference between LLM models and LLM systems. The accuracy of the pure LLM model is evaluated among a series of models applicable to a wide range of use cases. You have to realize that only a small group of people from OpenAI, Anthropic, Google and Meta* are doing research on this because their job is to train basic general purpose models. However, most ML practitioners are interested in how LLM performs in a particular use case in a system, and whether LLM brings any benefit to the business. That is, the comparison must be made in a specific context, and then various indicators of the LLM system must be assessed, including:

Benchmarking dimension

There are many dimensions when assessing LLM performance. Below is a short table with the most popular ones. The benchmarking process should include some criteria for each of these dimensions: the percentage of questions answered correctly, the percentage of questions answered incorrectly, and the percentage of situations where the model does not know the answer.

LLM use cases

When discussing specific LLM use cases, the most commonly talked about are:

Reference datasets

When assessing accuracy, it is important to understand what the standard is. A dataset containing reference data is often called a “golden” dataset. It is worth noting that most often there is no standard, there is only a dataset with markup or answers from living specialists. Below are the most famous datasets or tests for various purposes. They are often used for benchmarking open source models. Many public LLM leaderboards (

example 1

,

example 2

) compare open source LLMs with each other using these datasets.

LLM as an assessment mechanism

Creating a reference dataset is a non-trivial task for many reasons: user feedback and other “sources of truth” are extremely limited, and often do not exist at all; but even when marking by live people is possible, it is still expensive. Therefore, many are exploring the potential of LLM to generate synthetic benchmarks that can be used to evaluate other systems. For example,

“Judging LLM-as-a-judge”

And

grade

Vicuna using GPT-4 as a judge. G-Eval is a new Microsoft framework that uses LLM for Eval, consisting of two parts: the first generates the evaluation steps, and the second uses the generated stages to create the final numerical score.

What else?

There are many aspects to consider when considering how to use LLM in systems. For example, a method for determining semantic similarity to assess relevance, options for embedding sentences into vectors (for example, Sentence Transformer or Universal Sentence Encoder). Some models are sensitive to industrial products and require additional standardization and research in this direction. In addition, there are also aspects of creating a vector database and orchestrating the LLM workflow; a decision must be made whether to deploy replicated LLMs (to optimize throughput) or split LLMs (to optimize latency), and so on.

Reference materials

  1. [2308.04624] Benchmarking LLM powered Chatbots: Methods and Metrics,
  2. MEGAVERSE: Benchmarking Large Language Models Across Languages,
  3. LLM Benchmarking: How to Evaluate Language Model Performance | by Luv Bansal,
  4. The Guide To LLM Evals: How To Build and Benchmark Your Evals | by Aparna Dhinakaran | Towards Data Science,
  5. Evaluating LLM Performance: A Benchmarking Framework on Amazon Bedrock | by Ali Arabi | Feb, 2024 | Medium,
  6. A Gentle Introduction to LLM Evaluation – Confident AI,
  7. G-EVAL: NLG Evaluation using GPT-4 with Better Human Alignment: https://arxiv.org/pdf/2303.16634.pdf.
  8. The Definitive Guide to LLM Benchmarking – Confident AI

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *