Unreliable Watermarks: The Weak Link in AI-generated Text Protection

Friday - March 29 2024, 22:29 UTC - 1 year ago

tldr #

Researchers have found that watermarks used to detect AI-generated text are easily defeated and can be used by hackers to create fake watermarked text or strip the watermark entirely. Out of five types of watermarks tested, four were successfully spoofed and three were successfully stripped. This highlights the importance of implementing multiple methods, such as content analysis and human fact-checking, to combat the spread of AI-generated misinformation and plagiarism.

content #

Watermarks for AI-generated text are easy to remove and can be stolen and copied, rendering them useless, researchers have found. They say these kinds of attacks discredit watermarks and can fool people into trusting text they shouldn’t. Watermarking works by inserting hidden patterns in AI-generated text, which allow computers to detect that the text comes from an AI system. They’re a fairly new invention, but they have already become a popular solution for fighting AI-generated misinformation and plagiarism .

Watermarks are used to detect if text is generated by AI.

For example, the European Union’s AI Act, which enters into force in May, will require developers to watermark AI-generated content. But the new research shows that the cutting edge of watermarking technology doesn’t live up to regulators’ requirements, says Robin Staab, a PhD student at ETH Zürich, who was part of the team that developed the attacks. The research is yet to be peer reviewed, but will be presented at the International Conference on Learning Representations conference in May .

Watermarks can be defeated by using an API to access the AI model.

AI language models work by predicting the next likely word in a sentence, generating one word at a time on the basis of those predictions. Watermarking algorithms for text divide the language model’s vocabulary into words on a "green list" and a "red list," and then make the AI model choose words from the green list. The more words in a sentence that are from the green list, the more likely it is that the text was generated by a computer .

Hackers can steal watermarks and use them to create fake watermarked text.

Humans tend to write sentences that include a more random mix of words. The researchers tampered with five different watermarks that work in this way. They were able to reverse-engineer the watermarks by using an API to access the AI model with the watermark applied and prompting it many times, says Staab. The responses allow the attacker to "steal" the watermark by building an approximate model of the watermarking rules .

Watermarked text can also be stripped of its watermark, making it seem like it was written by a human.

They do this by analyzing the AI outputs and comparing them with normal text. Once they have an approximate idea of what the watermarked words might be, this allows the researchers to execute two kinds of attacks. The first one, called a spoofing attack, allows malicious actors to use the information they learned from stealing the watermark to produce text that can be passed off as being watermarked .

Out of five types of watermarks tested, four were successfully spoofed and three were successfully stripped.

The second attack allows hackers to scrub AI-generated text from its watermark, so the text can be passed off as human-written.The team had a roughly 80% success rate in spoofing watermarks, and an 85% success rate in stripping AI-generated text of its watermark. Researchers not affiliated with the ETH Zürich team, such as Soheil Feizi, an associate professor and director of the Reliable AI Lab at the University of Maryland, have also found watermarks to be unreliable and vulnerable to spoofing attacks .

Experts caution against relying too heavily on watermarks for detecting AI-generated text.

The findings from ETH Zürich confirm that these issues with watermarks persist and extend to the most advanced types of chatbots and large language models being used today, says Feizi. The research "underscores the importance of exercising caution when deploying such detection mechanisms on a large scale," he says. Despite the findings, watermarks remain the most popular solution for detecting AI-generated text, but experts warn against relying solely on this method as it has proven to be unreliable and easily defeated .

Other techniques, such as content analysis and human fact-checking, should also be utilized to combat the spread of AI-generated misinformation and plagiarism.

hashtags #

watermarks ai-generated text spoofing plagiarism content analysis fact-checking

worddensity #

text (12, 2.14%)
watermarks (8, 1.42%)
ai-generated (8, 1.42%)
words (5, 0.89%)
researchers (4, 0.71%)