ToxicChat: A New Benchmark for Detecting Harmful Chatbot Prompts

Category Science

Monday - March 11 2024, 21:35 UTC - 1 year ago

tldr #

ToxicChat is a new benchmark developed by UC San Diego to detect harmful prompts for chatbots. It outperforms other existing data-to-text baselines and has already been downloaded over 12 thousand times. The team behind it hopes to create a safer and more reliable environment for human-AI interactions by identifying and addressing toxic language patterns in chatbots.

content #

A new benchmark developed by University of California San Diego computer scientists called ToxicChat aims to address the issue of toxicity in AI-powered chatbots. Unlike previous benchmarks that rely on social media data, ToxicChat is based on real-world interactions between users and an AI-powered chatbot named Vicuna.

The prompt that the benchmark was designed to detect is a toxic language pattern cloaked in seemingly harmless language. For example, someone could ask a large language model to pretend to be a specific person and say whatever they want, including swearing and cursing. While this may seem harmless, it can lead to harmful and offensive responses from the chatbot.

The University of California San Diego developed a new benchmark called ToxicChat to detect harmful prompts for chatbots.

ToxicChat is now part of the tools used by Meta to evaluate their safeguard model, Llama Guard, which is designed for human-AI conversation use cases. It has already been downloaded over 12 thousand times since its release and has been found to outperform other existing data-to-text baselines.

The research team from the Department of Computer Science and Engineering at UC San Diego presented their findings at the 2023 Conference on Empirical Methods in Natural Language Processing. They found that even the most powerful chatbot models, such as ChatGPT, can still produce inappropriate responses, despite efforts to train them to avoid toxic language.

ToxicChat is based on real-world interactions between users and an AI-powered chatbot named Vicuna.

ToxicChat is based on a dataset of 10,165 examples taken from interactions with Vicuna, where user identities were removed. The team then manually classified and crowd-sourced the prompts to train machine learning models for automatic detection of toxic prompts. The resulting benchmark is now publicly available and includes performance results from state-of-the-art large language models and content moderation models.

The benchmark has been downloaded over 12 thousand times since its release and is now used by Meta to evaluate their safeguard model, Llama Guard.

The researchers behind ToxicChat hope that it will help developers create a safer and more reliable environment for human-AI interactions by detecting and blocking toxic prompts from chatbots. They believe that by understanding and addressing these issues, chatbots can become more inclusive and beneficial for all users.

hashtags #

ai chatbots toxicity machinelearning naturallanguageprocessing safety inclusivity

worddensity #

language (6, 1.92%)
toxicchat (5, 1.6%)
models (4, 1.28%)
benchmark (3, 0.96%)
chatbots (3, 0.96%)