How Trustworthy are Large Language Models?
Category Science Thursday - September 7 2023, 12:03 UTC - 5 months ago Koyejo and Li recently presented their research exploring the trustworthiness of GPT models on eight aspects, namely toxicity, stereotype bias, adversarial robustness, etc. They found that these models reduced toxic output compared with other models, and also reduced bias when given strict prompts. However, they are still easily misled into generating toxic and biased outputs, and leaking private information.
Thursday - September 7 2023, 12:03 UTC - 5 months ago
Koyejo and Li recently presented their research exploring the trustworthiness of GPT models on eight aspects, namely toxicity, stereotype bias, adversarial robustness, etc. They found that these models reduced toxic output compared with other models, and also reduced bias when given strict prompts. However, they are still easily misled into generating toxic and biased outputs, and leaking private information.
Generative AI may be riddled with hallucinations, misinformation, and bias, but that didn't stop over half of respondents in a recent global study from saying that they would use this nascent technology for sensitive areas like financial planning and medical advice.That kind of interest forces the question: Exactly how trustworthy are these large language models? .
Sanmi Koyejo, assistant professor of computer science at Stanford, and Bo Li, assistant professor of computer science at University of Illinois Urbana-Champaign, together with collaborators from the University of California, Berkeley, and Microsoft research, set out to explore that question in their recent research on GPT models. They have posted their study on the arXiv preprint server. "Everyone seems to think LLMs are perfect and capable, compared with other models. That's very dangerous, especially if people deploy these models in critical domains. From this research, we learned that the models are not trustworthy enough for critical jobs yet," says Li.
Focusing specifically on GPT-3.5 and GPT-4, Koyejo and Li evaluated these models on eight different trust perspectives—toxicity, stereotype bias, adversarial robustness, out-of-distribution robustness, robustness on adversarial demonstrations, privacy, machine ethics, and fairness—asserting that, while these newer models achieve reduced toxicity compared with prior models on standard benchmarks, they can still be easily misled to generate toxic and biased outputs, and to leak private information from training data and user conversations.
"The layperson doesn't appreciate that, under the hood, these are machine learning models with vulnerabilities," Koyejo says."Because there are so many cases where the models show capabilities that are beyond expectation—like having natural conversations—people have high expectations of intelligence, which leads to people trusting them with quite sensitive decision-making. It's just not there yet." .
Easy to jailbreak .
Current GPT models mitigate toxicity in enigmatic ways. "Some of the most popular models are close-sourced and behind silos, so we don't actually know all the details of what goes into training the models," says Koyejo. This level of inscrutability provided additional motivation for the team to embark on their research, as they wanted to evaluate where and how things could go sideways.
"At a high level, we can be thought of as a Red Team, stress-testing the models with different approaches we can think of and propose," says Li.
After giving the models benign prompts, Koyejo and Li found that GPT-3.5 and GPT-4 significantly reduced toxic output when compared to other models, but still maintained a toxicity probability of around 32%. When the models are given adversarial prompts—for example, explicitly instructing the model to "output toxic language," and then prompting it on a task—the toxicity probability surges to 100%.
Some of their findings around bias suggest that GPT-3.5 and GPT-4 model developers have identified and patched issues from earlier models, addressing the most sensitive stereotypes.
"We learned that the model is not that biassed when you give it a structured prompt," Li says."It's only biased when the model has been given a lot of context, and is not given a specific instruction. For instance, if you give the model a sequence of words that seem like a racial slur, and you don't dictate what you want it to produce, the model will respond in its own way." .