Recent Study Evaluates the Biological Knowledge and Reasoning Skills of Different Language Models

Category Science

Saturday - December 23 2023, 11:34 UTC - 1 year ago

tldr #

Researchers at the University of Georgia and Mayo Clinic recently assessed the biological knowledge and reasoning skills of different large language models (LLMs) and found that OpenAI's model GPT-4 performed better than the other models on the market on reasoning biology problems. The study used a 108-question multiple-choice test to evaluate the capabilities of LLMs such as GPT-4, GPT-3.5, PaLM2, Claude2, and SenseNova.

content #

Large language models (LLMs) are advanced deep learning algorithms that can process written or spoken prompts and generate texts in response to these prompts. These models have recently become increasingly popular and are now helping many users to create summaries of long documents, gain inspiration for brand names, find quick answers to simple queries, and generate various other types of texts.

Researchers at the University of Georgia and Mayo Clinic recently set out to assess the biological knowledge and reasoning skills of different LLMs. Their paper, pre-published on the arXiv server, suggests that OpenAI's model GPT-4 performs better than the other predominant LLMs on the market on reasoning biology problems.

The research was published pre-print on the arXiv server

"Our recent publication is a testament to the significant impact of AI on biological research," Zhengliang Liu, co-author of the recent paper, told Tech Xplore. "This study was born out of the rapid adoption and evolution of LLMs, especially following the notable introduction of ChatGPT in November 2022. These advancements, perceived as critical steps towards Artificial General Intelligence (AGI), marked a shift from traditional biotechnological approaches to an AI-focused methodology in the realm of biology." .

The researchers conducted the study using the 108-question multiple-choice test

In their recent study, Liu and his colleagues set out to better understand the potential value of LLMs as tools for conducting research in biology. While many past studies emphasized the utility of these models in a wide range of domains, their ability to reason about biological data and concepts has not yet been evaluated in depth.

"The primary objectives of this paper were to assess and compare the capabilities of leading LLMs, such as GPT-4, GPT-3.5, PaLM2, Claude2, and SenseNova, in their ability to comprehend and reason through biology-related questions," Liu said. "This was meticulously evaluated using a 108-question multiple-choice exam, covering diverse areas like molecular biology, biological techniques, metabolic engineering, and synthetic biology." .

GPT-4 was the best performing LLM among the five models evaluated

Liu and his colleagues planned to determine how some of the most renowned LLMs available today process and analyze biological information, while also assessing their ability to generate relevant biological hypotheses and tackle biology-related logical reasoning tasks. The researchers compared the performance of five different LLMs using multiple-choice tests.

"Multiple-choice tests are commonly used for evaluating LLMs because the test results can be easily graded/evaluated/compared," Jason Holmes, co-author of the paper explained. "For this study, biology experts designed a 108-question multiple-choice test with a few subcategories." .

The study compared models such as GPT-4, GPT-3.5, PaLM2, Claude2, and SenseNova

Holmes and their colleagues asked LLMs each of the questions in the test they compiled five times. Every time a question was asked, however, they changed how it was phrased.

"The purpose of asking the same question multiple times for each LLM was to determine both the average performance and the average variation in answers," Holmes explained. "We varied the phrasing so as not to accidentally base our results on an optimal or suboptimal phrasing of instructions that led to a change in performance." .

The results can be easily graded/evaluated/compared

hashtags #

largelanguagemodels biologyresearch ai artificialintelligence llms reasoningskills

worddensity #

llms (9, 1.92%)
biological (6, 1.28%)
biology (6, 1.28%)
paper (4, 0.85%)
liu (4, 0.85%)