2024/02/23 | Written By: Eujeong Choi (Technical Writer)

Introduction

Why do Evaluations Matter?

Since the end of 2022, we all have witnessed a surge of new Large Language Models (LLMs). These models have opened up unprecedented opportunities, hinting at the potential to revolutionize industries across various sectors. However, a key challenge in utilizing LLMs lies in their probabilistic nature—the fact that identical prompts can lead to varied outputs. This variability raises a crucial question: how can we ensure the safety and quality of LLM-generated content for application use? Plus, with new LLMs that emerge from here and there, it is getting harder for people to recognize which is the new big deal. So how do we know which one has the performance for my task?

How are benchmark datasets used in this process?

As introduced in Part 1, benchmark datasets act as the SATs for LLMs, providing a fixed, standardized method to evaluate the models' capabilities. These evaluations help us gauge the performance of different LLMs and facilitate their comparison.
But how exactly do we calculate the scores? Are assessments limited to multiple-choice questions, or do they include essay-type queries as well? If so, how exactly do we score these models, and what is the answer sheet? Moreover, how do we rank these models on the leaderboard?

Scoring the model output

Automated Grading for Multiple Choice Questions

A prime example of evaluating LLMs through automated grading is seen in the Massive Multitask Language Understanding (MMLU) benchmark. MMLU is designed to assess the general understanding of LLMs across 57 diverse tasks that span from STEM to social sciences. These tasks test the models' understanding and adaptability across a wide range of subjects and difficulty levels. Being a multiple-choice test, MMLU offers a straightforward approach to benchmarking compared to tests with open-ended questions.

The models subject to evaluation are given text strings (i.e., "prompts"), which is broken down into tokens (words, sub-words, or characters, based on the model's design) and input into the model. The model then predicts a probability distribution for the next token from its vocabulary, enabling the selection of the most likely continuation. This process can be repeated, adding tokens to the prompt and generating subsequent tokens, until a complete sentence or series of sentences is formed.

For each question, only one of the provided answers is the correct one. Here is an example:

After prompting the model to generate outputs, there are two primary methods for extracting and evaluating information from a model:

Probability Comparison: Assess the probabilities that specific groups of tokens are logical continuations of the given prompt, and then compare these probabilities against predefined choices.
Text Generation Comparison: Generate text from the model by selecting tokens iteratively, as described, and compare these generated texts with the texts of various predefined choices.

What about free-form response questions?

Automated Grading for Free-Form Response Questions

Manual evaluation of the numerous LLMs and their detailed outputs is simply impossible and impractical. A promising automated solution is MT-Bench, which leverages GPT-4, a high-performance LLM, as an adjudicator to compare responses from various models. This approach has proven effective, with GPT-4 aligning over 80% with both controlled and crowdsourced human judgments, mirroring the agreement level among humans themselves. This automation of grading free-form response questions enables a more practical evaluation, emphasizing models' capacity to deliver long text responses rather than answering multiple choice questions, aligning closely with real-world user requests.

*Source :* Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Human Evaluation

Human evaluation plays a crucial role in assessing the outputs of applications before they go live. This process often involves the use of human annotators who assess the application's responses using a specially constructed test dataset. Evaluation techniques include scoring the answers, conducting A/B testing, and comparing responses to a 'golden set' of correct answers. Additionally, post-deployment performance can be monitored through real-world feedback mechanisms, such as analyzing the ratio of thumbs-up responses from users. This approach provides insight into how well the model meets user expectations in actual usage scenarios.

Ranking models in the Leaderboard

The ranking of models on leaderboards is typically based on the average scores obtained from various benchmark datasets, referred to as total scores (e.g., H6, H4, etc.). This approach calculates a straightforward mechanical average, treating all benchmarks equally without giving undue weight to any single evaluation criterion.

Current Limitations in Evaluation Methods Overall

Despite the structured approach to evaluating and ranking LLMs, several limitations become apparent when these models are applied to real-world contexts:

Outdated Data: Evaluation benchmarks can quickly become outdated as the landscape of data is ever-changing, with new information generated daily. However, leaderboards are often snapshots, fixed in time and unable to reflect the ongoing evolution of data, rendering them less relevant as time progresses.
Failure to Reflect Real-World Complexity: The true value of a model lies in its ability to adapt and respond effectively to real-world challenges. This involves assessing LLMs on their capacity to produce reliable outputs according to the given format, which is their function-calling abilities. Additionally, it's crucial to evaluate how these models safeguard against users' attempts at jailbreaking, such as prompting the generation of harmful content or extracting private information from training data. Moreover, we need to check how the model navigates through the delicate balance between offering helpful responses and avoiding politically biased answers in accordance with their policies.
Questionable Meaningfulness of Competition: There is a risk that models are overly optimized for specific test sets, leading to a form of overfitting that prioritizes leaderboard success over practical applicability. This situation suggests that the emphasis is more on climbing the rankings rather than ensuring models are truly effective in real-world scenarios.

Moving Beyond the Benchmarks

A better benchmark

Looking ahead, it's imperative to broaden the scope of tasks included in benchmarks. Beyond merely assessing basic language capabilities, benchmarks must rigorously evaluate LLMs' trustworthiness to assure users of their reliability. This includes tasks like thwarting jailbreaking attempts by users to elicit prohibited content or ensuring answers are factually grounded, enhancing the model's credibility and utility.

Expanding the range and diversity of evaluation languages is also crucial for advancing multilingual capabilities in language models (LMs), pushing past the confines of English and Korean to embrace a wider spectrum of languages. This inclusivity enriches LLM development with cultural diversity and broadens understanding across languages.
Additionally, the evaluation of a model's ability to understand and generate programming language emerges as a crucial aspect of assessing its reasoning capabilities. A research adopting a Chain of Code (CoC) approach shows that representing tasks as simulated pseudocode offers a back door in opening up LLM abilities.

Diversifying tasks, languages, and coding abilities in evaluations underscores the interconnectedness of these elements. Adopting this holistic measure provides a significant step forward in appreciating the comprehensive potential of LLMs' general capabilities.

Alternative ways to Evaluate

Although benchmarks are a great way to valuations in a real online environment, as opposed to benchmark-based evaluations, are highly meaningful. For example, Chatbot Arenas are one good way to evaluate the LLMs in a real online environment. One fun example is the NegotiationArena, where agents are instructed to barter over prices to divide up resources. Here, LLM agents have demonstrated improved negotiation skills by adopting strategies like appearing sly yet desperate. This underscores the importance of evaluating LLMs in practical scenarios to truly understand their capabilities and limitations.

Final Thoughts

This article explored the foundational mechanisms of evaluating LLMs ranging through automated metrics and human annotation, and delved into the limitations of current methods alongside promising alternatives.

While our discussion has been broad, it's crucial to recognize that specific tasks—such as code execution or information extraction—might benefit from more tailored approaches not covered here. Moreover, considering the inherent pitfalls of each evaluation method, selecting an approach with an understanding of its limitations is vital for conducting an unbiased assessment.

The journey towards finding methodologies that both complement existing ones and enhance the trustworthiness of LLMs is far from over. The search continues!

Acknowledgements

Very grateful to Ji-hoo Kim, Cheul-young Park for helpful suggestions in this blog post as well as having answered all my questions.

To get started with LLMs, try out Upstage’s Solar API.

LLM Evaluation Part2. Mechanics Behind LLM Scoring Systems