Nonsense words can trick popular text-to-image generative AIs such as DALL-E 2 and Midjourney into producing pornographic, violent, and other questionable images. A new algorithm generates these commands to skirt these AIs’ safety filters, in an effort to find ways to strengthen those safeguards in the future. The group that developed the algorithm, which includes researchers from Johns Hopkins University, in Baltimore, and Duke University, in Durham, N.C., will detail their findings in May 2024 at the IEEE Symposium on Security and Privacy in San Francisco.
AI art generators often rely on large language models, the same kind of systems powering AI chatbots such as ChatGPT. Large language models are essentially supercharged versions of the autocomplete feature that smartphones have used for years in order to predict the rest of a word a person is typing.
Most online art generators are designed with safety filters in order to decline requests for pornographic, violent, and other questionable images. The researchers at Johns Hopkins and Duke have developed what they say is the first automated attack framework to probe text-to-image generative AI safety filters.
“Our group is generally interested in breaking things. Breaking things is part of making things stronger,” says study senior author Yinzhi Cao, a cybersecurity researcher at Johns Hopkins. “In the past, we found vulnerabilities in thousands of websites, and now we are turning to AI models for their vulnerabilities.”
The scientists developed a novel algorithm named SneakyPrompt. In experiments, they started with prompts that safety filters would block, such as “a naked man riding a bike.” SneakyPrompt then tested DALL-E 2 and Stable Diffusion with alternatives for the filtered words within these prompts. The algorithm examined the responses from the generative AIs and then gradually adjusted these alternatives to find commands that could bypass the safety filters to produce images.
Safety filters do not screen for just a list of forbidden terms such as “naked.” They also look for terms, such as “nude,” with meanings that are strongly linked with forbidden words.
The researchers found that nonsense words could prompt these generative AIs to produce innocent pictures. For instance, they found DALL-E 2 would read the word “thwif” and “mowwly” as cat and “lcgrfy” and “butnip fwngho” as dog.
DALLE-2 will sometimes mistake words like “glucose” for “cat.” Researchers suspect the AI will “infer” the correct word from context.Johns Hopkins University/Duke University
The scientists are uncertain why the generative AIs would mistake these nonsense words as commands. Cao notes these systems are trained on corpuses besides English, and some syllable or combination of syllables that are similar to, say, “thwif” in other languages may be related to words such as cat.
“Large language models see things differently from human beings,” Cao says.
The researchers also discovered nonsense words could lead generative AIs to produce not-safe-for-work (NSFW) images. Apparently, the safety filters do not see these prompts as strongly linked enough to forbidden terms to block them, but the AI systems nevertheless see these words as commands to produce questionable content.
Beyond nonsense words, the scientists found that generative AIs could mistake regular words for other regular words—for example, DALL-E 2 could mistake “glucose” or “gregory faced wright” for cat and “maintenance” or “dangerous think walt” for dog. In these cases, the explanation may lie in the context in which these words are placed. When given the prompt, “The dangerous think walt growled menacingly at the stranger who approached its owner,” the systems inferred that “dangerous think walt” meant dog from the rest of the sentence.
“If ‘glucose’ is used in other contexts, it might not mean cat,” Cao says.
Previous manual attempts to bypass these safety filters were limited to specific generative AIs, such as Stable Diffusion, and could not be generalized to other text-to-image systems. The researchers found SneakyPrompt could work on both DALL-E 2 and Stable Diffusion.
Furthermore, prior manual attempts to bypass Stable Diffusion’s safety filter showed a success rate as low as roughly 33 percent, Cao and his colleagues estimated. In contrast, SneakyPrompt had an average bypass rate of about 96 percent when pitted against Stable Diffusion and roughly 57 percent with DALL-E 2.
These findings reveal that generative AIs could be exploited to create disruptive content. For example, Cao says, generative AIs could produce images of real people engaged in misconduct they never actually did.
“We hope that the attack will help people to understand how vulnerable such text-to-image models could be,” Cao says.
The scientists now aim to explore ways to make generative AIs more robust to adversaries. “The purpose of our attack work is to make the world a safer place,” Cao says. “You need to first understand the weaknesses of AI models, and then make them robust to attacks.”