Exploring the Performance of Large Language Models on Evaluation Tasks

Category Artificial Intelligence

tldr #

Large Language Models (LLMs) are neural networks with billions of parameters helping us to further Artificial Intelligence technology. This article explores the performance of LLMs on different evaluation tasks, such as language generation, knowledge utilization, and complex reasoning, with the use of datasets, benchmarks, metrics, and human ratings for evaluation.

content #

Chain-of-Thought (CoT) is an improved prompting strategy to boost the performance of LLMs on complex reasoning tasks, such as arithmetic reasoning, commonsense reasoning, and symbolic reasoning. Instead of simply constructing the prompts with input-output pairs as in ICL, CoT incorporates intermediate reasoning steps that can lead to the final output into the prompts.

CoT is an emergent ability, it only has a positive effect on sufficiently large models (e.g., typically containing 10B or more parameters) but not on small models. Moreover, since CoT augments the standard prompting with intermediate reasoning steps, it is mainly effective to improve the tasks that require step-by-step reasoning, such as arithmetic reasoning, commonsense reasoning, and symbolic reasoning.

LLMs are enabling us to further Artificial Intelligence technology.


To examine the effectiveness and superiority of LLMs, a surge of tasks and benchmarks have been leveraged for conducting empirical evaluation and analysis. Researchers first introduce three types of basic evaluation tasks of LLMs for language generation and understanding, then present several advanced tasks of LLMs with more complicated settings or goals, and finally discuss existing benchmarks and empirical analyses.

LLMs are Networks with billions of parameters and are more efficient than ever before.

Basic Evaluation Tasks .

Evaluators mainly focus on three types of evaluation tasks for LLMs, i.e., language generation, knowledge utilization, and complex reasoning.

Existing tasks about language generation can be roughly categorized into language modeling, conditional text generation, and code synthesis tasks.

Language Modeling. As the most fundamental ability of LLMs, language modeling aims to predict the next token based on the previous tokens, which mainly focuses on the capacity of basic language understanding and generation. For evaluating such an ability, typical language modeling datasets that existing work uses include Penn Treebank, WikiText-103, and the Pile, where the metric of perplexity is commonly used for evaluating the model performance under the zero-shot setting. Empirical studies show that LLMs bring substantial performance gains over the previous state-of-the-art methods on these evaluation datasets.

LLMs are more versatile in terms of language translation, summarization, and question answering.

Conditional Text Generation. As an important topic in language generation, conditional text generation focuses on generating texts satisfying specific task demands based on the given conditions, typically including machine translation, text summarization, and question answering. To measure the quality of the generated text, automatic metrics (e.g., Accuracy, BLEU and ROUGE) and human ratings have been typically used for evaluating the performance. Due to the powerful language generation capabilities, LLMs have achieved remarkable performance on existing datasets and benchmarks, even surpassing human performance (on test datasets).

The complexity of a task has a effect of the LLM’s performance.

Code Synthesis. Besides generating high-quality natural language, existing LLMs also show strong abilities to generate formal language, especially computer programs (i.e., code) that satisfy specific conditions, called code synthesis. Generated code can be directly checked by execution with corresponding compilers or interpreters, existing work mostly evaluates the quality of the generated code from LLMs by calculating the pass rate against the test cases.

Code Synthesis focuses on generating programs (e.g. code) that is checked by execution with corresponding compilers or interpreters.

hashtags #
worddensity #