Introduction to LLM Benchmarks

5 min read
Aug 26, 2024
Introduction to LLM Benchmarks
8:36

In the rapidly evolving field of artificial intelligence (AI), large language models (LLMs) have become a cornerstone of natural language processing and generation and the nuclear fuel for the tech industry market. As these models grow in complexity and capability, the need for robust, comprehensive benchmarks to evaluate their performance has never been greater. This blog article explores the landscape of popular LLM benchmarks, offering insights into their methodologies, strengths, and limitations.

Whether you're a seasoned AI professional or just starting to navigate the world of language models, understanding these benchmarks is crucial for gauging the progress and potential of this transformative technology. Let’s take a look!

Graduate-Level Google-Proof Q&A (GPQA)

The Graduate-Level Google-Proof Q&A (GPQA) benchmark evaluates the performance of LLMs in answering highly challenging, graduate-level questions in biology, physics, and chemistry. As its differentiating characteristic, GPQA presents problems that require deep, specialized understanding and reasoning in these specific scientific fields. The questions are intentionally crafted to be "Google-proof," so you can't simply find the direct answer through a web search.

The questions were crafted by actual experts and researchers in the fields covered and even when tested on highly educated professionals; it was still far from a 100% pass. If you are curious, you can download the dataset from GitHub and check the questions yourself. The dataset is provided as a ZIP archive that is password protected (the password is in the repo so there is a lower likelihood that the answers make it into the LLM training dataset.) You can find the paper for the GPQA here: https://arxiv.org/abs/2311.12022 as well as the GitHub repo to run the benchmark here: https://github.com/idavidrein/gpqa

MMLU and MMLU-Pro

MMLU (Massive Multitask Language Understanding) and MMLU-Pro are benchmarks designed to evaluate the performance of LLMs across a diverse range of subjects and tasks through multiple-choice questions. 

MMLU was first introduced back in September 2020 before the massive debut of ChatGPT put the entire industry in a frenzy. At that point in time, it was a comprehensive test that covered 57 tasks from various domains, such as STEM, humanities, and social sciences, with difficulties ranging from high school to professional levels. The rationale behind MMLU was to create a robust tool for assessing the general knowledge and problem-solving abilities of LLMs, offering insights into how well these models can handle real-world, multi-disciplinary challenges. However, with ‌the latest generation of LLMs (GPT 4, Gemini 1.5, Claude 3.5 Sonnet, etc.) most of the state-of-the-art LLMs now have scores in the high 80% range.

As such, MMLU-Pro was developed to address some limitations of the original benchmark and to push the boundaries of what LLMs can do currently. Published in June 2024, it introduces several key improvements aimed at increasing the difficulty and discriminative power of the benchmark. For example, it expands the number of answer options from four to ten, making it harder for models to rely on guessing. Additionally, MMLU-Pro focuses more on reasoning tasks, particularly in STEM subjects, and the development of the question-and-answer dataset underwent expert-led rigorous quality checks to remove noisy or trivial questions. These enhancements help the benchmark “catch-up” to the current state of LLMs and most of the state-of-the-art models now score in the 70% range.

The original MMLU paper can be found here: https://arxiv.org/abs/2009.03300. The newer MMLU-Pro paper can be found here: https://arxiv.org/abs/2406.01574 and the GitHub repo can be found here: https://github.com/TIGER-AI-Lab/MMLU-Pro

HumanEval

The HumanEval benchmark was introduced by OpenAI (yes, THAT OpenAI) researchers in 2021 as part of their work on the Codex model, a precursor to the groundbreaking GPT 3.5. It was designed to evaluate the code generation capabilities of LLMs in a way that more closely mimics real-world programming tasks.

HumanEval consists of 164 hand-written programming problems, each including a function signature, docstring, and a set of unit tests. The benchmark is intended to measure an LLM's ability to generate functionally correct code based on natural language descriptions. Unlike many other code-related benchmarks that focus on code completion or syntax prediction, HumanEval requires models to generate entire function implementations from scratch, run the code and then pass or fail the unit tests. The generated code must pass all provided unit tests to be considered correct. 

This way, HumanEval focuses on functional correctness rather than just writing code statements that look like they follow the programming language syntax. However, HumanEval is showing its age and at this point, most of the state-of-the-art models score in the high 80% range and some like GPT 4-o and Claude 3.5 Sonnet scoring over 90%. The paper for HumanEval can be found here: https://arxiv.org/abs/2107.03374 and the GitHub repo is here: https://github.com/openai/human-eval

LMSYS Chatbot Arena

Unlike traditional benchmarks that rely on predefined tasks or questions, the LMSYS Chatbot Arena employs a more dynamic, user-centric evaluation method. This approach addresses some key weaknesses of classic benchmarks, particularly their inability to capture the performance of AI in open-ended conversations as experienced by human counterparts. In the Chatbot Arena, the user can prompt two models and then select their favorite answer without knowing which model produced it. There is no fixed set of questions and no strict evaluation criteria; it’s all crowdfunded from the users of the arena. The results are then exposed as a live leaderboard shaped by the votes in real-time.

At the core of the Chatbot Arena's methodology is the Elo rating system, originally developed for chess rankings. When a user interacts with two LLMs side-by-side and chooses a "winner," the system updates the Elo scores of both models. Higher-rated LLMs are expected to outperform lower-rated ones, and the magnitude of rating changes depends on the difference between the model's pre-existing ratings. This creates a dynamic leaderboard that evolves based on ongoing user feedback, providing a more user-oriented measure of chatbot capability.

The Chatbot Arena's approach attempts to address several weaknesses of traditional benchmarks. First, it captures performance across a near unlimited range of real-world scenarios and inputs, rather than a limited set of predefined tasks or questions. Second, it incorporates human judgment, which can account for subtle qualities like conversational naturalness or creative problem-solving that may be missed by just a Q&A dataset. Third, its ongoing nature allows for continuous evaluation as models are updated or new ones are introduced, providing a more current view of the AI landscape. Keep in mind this method also has limitations, such as potential biases in user judgments and the challenge of ensuring consistent evaluation criteria across a large anonymous user base where users can, if desired, skew the results by intentionally being dishonest with their votes.

The Chatbot Arena paper can be found here: https://arxiv.org/abs/2403.04132 and the actual live leaderboard can be found here: https://chat.lmsys.org/?leaderboard and you can even cast your vote and influence the results over here: https://lmarena.ai/

Conclusion

The industry continues to mature in terms of adopting and evaluating LLMs. From science or coding focused benchmarks like HumanEval all the way to user generated content leaderboards like the Chatbot Arena, this is a very exciting time with innovations coming at breakneck speeds. Being able to compare and contrast and understand where a particular model's strengths and weaknesses lie will help everyone trying to implement their LLM solutions.

Get Email Notifications

No Comments Yet

Let us know what you think