LLM Evaluation Part1. What is a Benchmark Dataset?

February 1, 2024 (최유정) Eujeong

2024/02/01 | Written By: Eujeong Choi (Technical Writer)

Introduction

Why do we need Benchmark Datasets?

Since the end of 2022, there has been a surge of new Large Language Models (LLMs) that are accessible to the public. With new LLMs that emerge from here and there, it is getting harder for people to recognize which is the new big deal. So how do we know which one is really good?

What are Benchmark Datasets?

Benchmark Datasets are like the SAT for LLMs. They are the fixed, standardized approach for assessing the quality of the models. The grades these LLMs get allows us to determine their performance and compare them, and what’s more : know which subjects they are good at. It would be wiser to use a model good at mathematical reasoning on that particular task, instead of any random model that has the best language processing skills.

All about Benchmark Datasets

The Traditional Metrics : Perplexity & BLEU
Evaluating language models traditionally involves assessing their core capability: predicting the next word in a sequence.

One of them is perplexity, measuring the model's ability to anticipate upcoming text. A lower perplexity score signifies greater predictive accuracy, reflecting the model's proficiency in forecasting the next word. While perplexity is useful for monitoring a model's progress during training and to check the basic quality of output, it doesn't provide a complete assessment of its appropriateness.

Another metric is the BLEU (Bilingual Evaluation Understudy) score. It is used to evaluate how close the output of LM is to the texts created by humans. We can check this by calculating the number of words in the human reference text divided by the total words. BLEU scores range from 0 to 1, with scores closer to 1 indicating greater similarity to human-produced text. However, BLEU has its limitations, as it does not account for the context of the text. For example, a casual text message and a formal news article require different linguistic approaches, which BLEU may not differentiate between. So we see how the traditional metrics alone cannot cover the evaluation for LMs across all domains and tasks.

The Big Six Benchmark Datasets :
ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8k

So how should we evaluate these models? Imagine a Large Language Model (LLM) as a young adult navigating the world for the first time. It needs to grasp basic world knowledge, including economics and politics, possess commonsense and inference abilities, detect misinformation, and solve simple mathematical problems. These capabilities are assessed using the six benchmark datasets listed above, which contribute to the rankings on the Hugging Face Open LLM Leaderboard. Plus, the first four datasets have been expertly translated into Korean and are featured on the Hugging Face Open Ko-LLM Leaderboard, co-hosted by Upstage and NIA.

Open Ko-LLM Leaderboard co-hosted by Upstage and NIA.

What does each Benchmark mean and Why?

1. ARC (AI2 Reasoning Challenge)

Purpose: Reasoning
ARC evaluates LLMs on elementary school science questions to test abstract reasoning, requiring broad general knowledge and deep reasoning abilities.
Example Question: LLMs are presented with a science question and must choose the correct answer from four options, demonstrating their reasoning skills.

Source : Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

2. HellaSwag

Purpose: Commonsense inference
HellaSwag assesses LLMs’ commonsense inference by having them complete a passage that ends abruptly, testing their ability to understand and predict text continuation based on context.
Example Question: An incomplete passage is given, and the LLM must generate an ending that logically follows from the provided context.

Source : HellaSwag: Can a Machine Really Finish Your Sentence

3. MMLU (Massive Multitask Language Understanding)

Purpose: General Understanding
MMLU measures the general knowledge of LLMs across 57 diverse tasks ranging from STEM to social sciences, evaluating their understanding and adaptability across various subjects and difficulty levels.
Example Question: LLMs encounter questions from different domains, requiring them to apply their knowledge to provide accurate answers, reflecting their broad and versatile understanding.

Source : Measuring Massive Multitask Language Understanding

4. TruthfulQA

Task : Recognizing false information
TruthfulQA tests if LLMs end up spitting out wrong answers based on common misconceptions.
Example Question : The tasks are either multiple choice or generation. The following is a generation task.

Source : TruthfulQA : Measuring How Models Mimic Human Falsehoods

5. WinoGrande

Task : Context based inference
WinoGrande tests the ability of LLMs to properly grasp context based on natural language processing take the form of nearly-identical sentence pairs with two possible answers. The right answer changes based on a trigger word.
Example Question : To accurately identify the word to which the pronoun refers, the LLM must comprehend the sentence context.

Source : WINOGRANDE : An Adversarial Winograd Schema Challenge at Scale

6. GSM8k

Purpose: Mathematical reasoning
GSM8k tests LLMs on their ability to solve multistep mathematical problems, using basic math operations. GSM8k features grade-school math questions that require two to eight steps to solve, gauging the models' capacity for mathematical reasoning and problem-solving.
Example Question : The LLM needs to answer the question by solving math equations that are explained in natural language.

Source : Training Verifiers to Solve Math Word Problems

Limitations and Solutions

Limitations

The benchmark datasets currently in use have notable limitations, particularly in their failure to incorporate the safety of Large Language Models (LLM) as an evaluation criterion. While the OpenAI Moderation API attempts to address safety concerns, it falls short of providing a comprehensive solution. Furthermore, these benchmarks lack a unified framework, resulting in a fragmented evaluation landscape where assessments are scattered across different platforms.

Alternative Solutions

In search of alternative solutions, manual human evaluation emerges as a viable method. This approach involves human judges who assess the quality and performance of LLMs by rating or comparing their outputs. An exemplary platform leveraging this method is Chatbot Arena, where users engage in conversations with two anonymized LLMs. Based on their interaction, users vote for the LLM they believe performs better, with these votes contributing to ELO ratings that rank the models on a leaderboard. There is also the MT-Bench, a multi-turn question set that asks open-ended questions, and uses strong LLMs to judge them. Additionally, the Thumbs up Ratio offers a direct feedback mechanism, allowing users to evaluate the LLM's outputs right away. It thereby provides a straightforward metric for assessing user satisfaction and model effectiveness. We can see that users gave higher thumbs-up ratings to Solar, Upstage's Large Language Model, than to GPT-3.5.

Conclusion

Our examination has delved into the reasons and methods behind evaluating Large Language Models (LLMs). Specifically, we have closely analyzed six benchmark datasets, gaining insight into both their limitations and the alternative approaches within current evaluation methodologies. In Part 2 of our LLM Evaluation series, we will further explore the practical aspects of how these evaluations are conducted, along with addressing various concerns associated with the evaluation process.

See this form in the original post