AI Spam Threatens the Internet—AI Can Also Protect It

Better and more effective AI detection techniques are on the horizon

6 min read
boxes of greens, blues and red with large ERROR letters in middle with bright pink, yellow, purple and red
Anand Purohit/Getty Images

2023 wasn’t a great year for AI detectors. Leaders like GPTZero surged in popularity but faced a backlash as false positives led to incorrect accusations. Then OpenAI quietly tossed ice-cold water on the idea with an FAQ to answer whether AI detectors work. The verdict? “No, not in our experience.”

OpenAI’s conclusion was correct—at the time. Yet the demise of AI detectors is greatly exaggerated. Researchers are inventing new detectors that perform better than their predecessors and can operate at scale. And these come alongside “data poisoning” attacks that individuals can use to safeguard their work from being scooped up against their wishes to train AI models.

“Language model detection can be done with a high enough level of accuracy to be useful, and it can also be done in the ‘zero shot’ sense, meaning you can detect all sorts of different language models at the same time,” says Tom Goldstein, a professor of computer science at the University of Maryland. “It’s a real counterpoint to the narrative that language model detection is basically impossible.”

Using AI to detect AI

Goldstein coauthored a paper recently uploaded to the arXiv preprint server that describes “Binoculars”: A detector that pairs an AI detective with a helpful sidekick.

Early AI detectors played at detective by asking a simple question: How surprising is this text? The assumption was that statistically less surprising text is more likely to be AI-generated. It’s an LLM’s mission to predict the “correct” word at each point in a string of text, which should lead to patterns a detector can pick up. Most detectors answered by giving users a numerical probability that the text submitted to it was AI-generated.

But that approach is flawed. AI-generated text can still be surprising if it’s generated in response to a surprising prompt, which the detector has no way to deduce. And the opposite is true, as well. Humans may write unsurprising text if covering a well-worn topic.

Detectors will only prove useful to companies, governments, and educational institutions if they create fewer headaches than they solve, and false positives cause many headaches.

Binoculars asks its AI detective (in this case Falcon-7B, an open-source large language model) the same question as previous detectors, but also asks an AI sidekick to do the same work. The results are compared to calculate how much the sidekick surprised the detective, creating a benchmark for comparison. Text written by a human should prove more surprising to the detective than the AI sidekick.

There are gaps in what Binoculars can see. Vinu Sankar Sadasivan, a graduate student in the University of Maryland’s computer science department and a coauthor on another preprint paper evaluating a variety of LLM detection techniques, says that Binoculars “significantly improves the performance of zero-shot detection, but it’s still not better than watermarking or retrieval-based methods in terms of accuracy.” Binoculars is also still being peer reviewed; Avi Schwarzschild, a coauthor on the Binoculars paper, says the goal is to present at a leading AI conference.

However, Goldstein contends that accuracy isn’t Binoculars’ secret sauce. He believes its real advantage lies in the ability to reveal AI text with fewer false positives.

“People tend to dwell a lot on accuracy, but this is a mistake,” says Goldstein. “If a detector mistakenly says human-written text was written by a language model…that can lead to false accusations. But if you make a mistake and say AI text is human, it’s not so bad.”

That might feel counterintuitive. AI can generate text at incredible scale, so even a near-perfect detector could let a lot of AI-generated text slip through.

But detectors will only prove useful to companies, governments, and educational institutions if they create fewer headaches than they solve, and false positives cause many headaches. Goldstein points out that even a false-positive rate in the single digits will, at the scale of a modern social network, make tens of thousands of false accusations each day, eroding faith in the detector itself.

Deepfakes fool people, but not detectors

Of course, AI-generated text is just one front in this fight. AI-generated images are of equal concern and, in recent studies, have proven they can fool humans.

A preprint paper from researchers at Ruhr University Bochum and the Helmholtz Center for Information Security, both in Germany, found people can’t reliably separate AI-generated images of human faces from real photos. Another from researchers at Indiana University and Northeastern University in Boston estimated that between 8,500 and 17,500 daily active accounts on social-media platform X (formerly Twitter) use an AI-generated profile picture.

It gets worse. These findings focus on generative adversarial networks (GANs), an older class of AI image model that is better understood and known to cause identifiable image artifacts. But the state-of-the-art image generators currently sweeping the Internet, such as AI’s Stable Diffusion, instead use diffusion probabilistic models, which learn to convert random noise into an image likely to mimic what a user wanted. Diffusion models significantly outperform their GAN predecessors.

But even so, it turns out that diffusion isn’t as sly as feared. A preprint paper to be presented at the International Conference on Computer Vision Theory and Applications (VISAPP 2024), being held in Rome from 27 to 29 February, found that detectors trained on GANs can, with some tweaks, also detect diffusion models.

“We found that the detectors that we already had, that were trained on GANs, failed to detect diffusion,” says paper coauthor Jonas Ricker, a graduate student at Ruhr University Bochum. “But it’s not undetectable. If we update those detectors, we are still able to detect [images generated by diffusion models].” This suggests that earlier AI image detectors don’t need to be reinvented from scratch to detect the latest models.

Accuracy is reduced when compared to the detection of GANs, which in some cases can reach 100 percent, but accuracy when detecting diffusion models still often remains higher than 90 percent (the results depend on the diffusion model and detector used). The detectors tweaked to detect diffusion models also remain capable of detecting images generated by a GAN, which makes them useful in many situations.

But what, exactly, do the updated detectors detect? The answer, as is so often true of modern AI models, is a bit mysterious. “It’s not super clear which artifacts are detectable,” says Simon Damm, another graduate student at Ruhr University Bochum and one of the VISAPP 2024 paper’s coauthors. “The performance shows they are for sure detectable…but the interpretability isn’t there.”

Data poison degrades spam

The latest AI detectors are promising, but detection is an inherently defensive approach. Some researchers are investigating proactive tactics like data poisoning, which disrupts an AI model at training.

Perhaps the strongest example is Nightshade, a technique invented by researchers at the University of Chicago. Nightshade is a prompt-specific data-poisoning attack built to degrade diffusion models. It laces an image with subtle changes to pixels, most of which aren’t visible to the human eye. However, an AI model trained on these images will learn incorrect associations. It might learn that a car looks like a cow or that a hat looks like a toaster.

“The simplest way to describe it is a small poison pill you can put in your own art,” says Ben Y. Zhao, a professor of computer science at the University of Chicago and one of the researchers who developed Nightshade. “If it’s downloaded against your wishes [and used for training an AI model], it can have a negative effect on the model.”

Critically, Nightshade can degrade a model even when a slim slice of the training data is poisoned. As little as 100 poisoned samples can successfully attack specific concepts (like “dog” or “cat”) in the latest models, including Stable Diffusion XL. The effects compound over multiple attacks, bleed into related concepts, and degrade a wide variety of models.

“The simplest way to describe it is a small poison pill you can put in your own art.” —Ben Y. Zhao, University of Chicago

How that will translate to the real world remains murky. The researchers had just one Nvidia A100 GPU for training—far less than the dozens or hundreds used to train the latest models—so they were limited to training on smaller datasets than those used by OpenAI or Stability AI. Nightshade is still under peer review, though the team behind it intends to present it at a conference in May. “There’s still unknown factors that are hard to control,” says Zhao. “But at a high level, we should see the effect.”

Zhao hopes that Nightshade, which is available for anyone to download and use for free, will prove a more effective tool than opt-out schemes and no-crawl requests. Compliance with such requests is voluntary, and while many large AI companies and organizations have pledged to respect them, it’s not legally required and difficult to verify. Data poisoning doesn’t require cooperation, instead offering protection that degrades any image model which includes the poisoned data.

“We’re not trying to break models out of spite,” says Zhao. “We’re trying to provide a tool for content owners to discourage unauthorized scraping for AI training. This is the way to push back, to provide a real disincentive to model trainers.”

The Conversation (0)