The Uncertain Times Ahead of OpenAI's Language Model

Category Science

Tuesday - July 25 2023, 23:39 UTC - 2 years ago

tldr #

OpenAI's Large Language Model, ChatGPT, has been praised for its impressive natural responses to user inquiries, scoring high on exams across many fields. However, researchers from Stanford and UC Berkeley have recently noted significant changes in performance over a four-month period, from March to June 2023, and are uncertain about long-term effects. OpenAI is aware of the issue and addressing it.

content #

OpenAI's widely celebrated large language model has been hailed as "quite simply the best artificial intelligence chatbot ever released to the general public" by Kevin Roose, author of "Futureproof: 9 Rules for Humans in the Age of Automation" and as "one of the greatest things that has ever been done for computing" by Nvidia CEO Jensen Huang.ChatGPT has become so good at providing natural responses to user inquiries that some believe it has officially passed the Turing test, a longstanding measure of a machine's ability to achieve human intelligence.

Stanford researchers noted a significant change in performance of GPT-4 over four months, from March to June 2023

ChatGPT has scored in the highest percentiles of achievement exams in a myriad of fields: math (89th), law (90th) and GRE verbal (99th).

And researchers at NYU's medical school reported in early July 2023 that advice given by ChatGPT for health care related questions were almost indistinguishable from that provided by human medical staff.

But researchers at Stanford University and the University of California, Berkeley, are not quite ready to entrust ChatGPT with any critical decision-making. Echoing a growing number of concerns recently expressed by users, Lingjiao Chen, Matei Zaharia and James Zhu said ChatGPT performance has not been consistent. In some instances, it is growing worse.

2.4% accuracy when model in June 2023 was used for problem solving involving prime numbers

In a paper published in the arXiv preprint server July 18, researchers said "performance and behavior of both GPT-3.5 and GPT-4 vary significantly" and that responses on some tasks "have gotten substantially worse over time." .

They noted significant changes in performance over a four-month period, from March to June.

The researchers focused on a few areas including math problem solving and computer code generation.

Computer code completion accuracy dropped from 50% in March to 10% in June

In March 2023, GPT-4 achieved a 97.6% accuracy rate when tackling problems concerning prime numbers. That rate plummeted to just 2.4% when the updated June 2023 model was used, according to the Stanford researchers.

ChatGPT has garnered wide praise for its ability to assist coders with programming and debugging issues. In March, GPT-4 responded to coder requests by completing accurate, ready-to-run scripts a little over 50% of the time. But by June, the rate dropped to 10%. Chat-GPT-3.5 also showed a notable decline in accuracy, from 22% in March to 2% in June.

Math abilities improved for GPT-3.5 in June, achieving 86.8% accuracy for prime number problem solving

Interestingly, ChatGPT-3.5 showed nearly opposite results in math abilities: Achieving only a 7.4% accuracy rate in prime-number problem solving in March, the upgraded version in June achieved an 86.8% rate.

Zhu said it was difficult to pinpoint a cause, though it seems apparent that system modifications and upgrades are factors.

"We don't fully understand what causes these changes in ChatGPT's responses because these models are opaque," Zhu said. "It is possible that tuning the model to improve its performance in some domains can have unexpected side effects of making it worse on other tasks." .

OpenAI responding to the issue, and working to remedy

Conspiracy theorists who have noticed a deterioration in some results suggest OpenAI is experimenting with alternate, smaller versions of LLMs as a cost-saving measure. Others venture that OpenAI is intentionally weakening GPT-4 so frustrated users will be more willing to pay for GitHub's LLM accessory CoPilot.

OpenAi, for its part, says it is aware of the problem and is working swiftly to resolve it. It remains to be seen what the long-term outcome will be.

GPT-4 is an opaque system and it's difficult to pin-point the cause of performance dips

hashtags #

openai chatgpt llm gpt4 turingtest

worddensity #

chatgpt (6, 1.16%)
researchers (5, 0.97%)
march (5, 0.97%)
june (5, 0.97%)
rate (5, 0.97%)