Can Large Language Models Pass Human Evaluations?
Category Artificial Intelligence Sunday - September 3 2023, 18:50 UTC - 1 year ago Large language models have been used to pass assessments designed for humans, however, there is disagreement on how to interpret the results. Developing large language models requires powerful computing power and large datasets, making it difficult to develop and train. Despite the lack of transparency, these models have certain capabilities that surpass humans in certain areas. One of the major critiques of these models is that they are often overhyped, as they can only replicate human behavior to a certain degree.
Webb is a psychologist at the University of California, Los Angeles, who studies the different ways people and computers solve abstract problems. He was used to building neural networks that had specific reasoning capabilities bolted on. But GPT-3 seemed to have learned them for free.
What Webb’s research highlights is only the latest in a long string of remarkable tricks pulled off by large language models. For example, when OpenAI unveiled GPT-3’s successor, GPT-4, in March, the company published an eye-popping list of professional and academic assessments that it claimed its new large language model had aced, including a couple of dozen high school tests and the bar exam. OpenAI later worked with Microsoft to show that GPT-4 could pass parts of the United States Medical Licensing Examination.And multiple researchers claim to have shown that large language models can pass tests designed to identify certain cognitive abilities in humans, from chain-of-thought reasoning (working through a problem step by step) to theory of mind (guessing what other people are thinking). These kinds of results are feeding a hype machine predicting that these machines will soon come for white-collar jobs, replacing teachers, doctors, journalists, and lawyers. Geoffrey Hinton has called out GPT-4’s apparent ability to string together thoughts as one reason he is now scared of the technology he helped create.
But there’s a problem: there is little agreement on what those results really mean. Some people are dazzled by what they see as glimmers of human-like intelligence; others aren’t convinced one bit.
"There are several critical issues with current evaluation techniques for large language models," says Natalie Shapira, a computer scientist at Bar-Ilan University in Ramat Gan, Israel. "It creates the illusion that they have greater capabilities than what truly exists." That’s why a growing number of researchers—computer scientists, cognitive scientists, neuroscientists, linguists—want to overhaul the way they are assessed, calling for more rigorous and exhaustive evaluation. Some think that the practice of scoring machines on human tests is wrongheaded, period, and should be ditched.
"People have been giving human intelligence tests—IQ tests and so on—to machines since the very beginning of AI," says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico. "The issue throughout has been what it means when you test a machine like this. It doesn’t mean the same thing that it means for a human." .
"There’s a lot of anthropomorphizing going on," she says. "And that’s kind of coloring the way that we think about these systems and how we test them." .
With hopes and fears for this technology at an all-time high, it is crucial that we get a solid grip on what large language models can and cannot do.
Open to interpretation .
Most of the problems with how large language models are tested boil down to the question of how the results are interpreted. Assessments designed for humans, like high school exams and IQ tests, take a lot for granted. When people score well, it is safe to assume that they pored over textbooks, understood what they read, developed a set of skills and strategies, and executed them correctly. But machines don’t have human biology: they don’t assign meaning to ideas, struggle to follow long-term logic, or intuit how their decision might affect another entity. Without those abilities, it is impossible to say what a machine has actually “learned” from a given assessment.
Developing large language models requires powerful computing power and large datasets, making them incredibly difficult to develop and train. Large language models are trained by using supervised learning techniques, where training datasets of pre-labeled examples are used to predict correct behaviors, or adjust parameters accordingly. Interpretability of large language models is a major challenge, as they are trained using unsupervised learning techniques, making it difficult to determine why a particular behavior was performed.
Despite the lack of transparency, large language models have capabilities that surpass humans in certain areas, such as translation, summarization, and creating short stories based on short prompts. Large language models can also be used to identify objects and patterns in data that are too difficult or complex for humans to identify.
One of the major critiques of large language models is that they are often overhyped, as they can only replicate human behavior to a certain degree. With that in mind, it is important to understand that these machines can only do so much. Even Webb’s questions, for example, take as a given a basic level of logic that a plain language model does not have.
For this same reason, it is important to be careful when interpreting results from assessments designed for humans. As impressive as the technology may be, it is important to remember that machines do not think like humans—attempting to replicate human cognition is often misleading and can lead to a false understanding of what the technology is actually capable of.
Share