Synthetic Art Could Help AI Systems Learn

Using AI to craft more illustrative training images builds keener neural nets

3 min read

Illustrative image of a brain linked to a dozen different images.
Alex Shipps/MIT CSAIL/Midjourney

AI systems may perform better at identifying pictures when they are trained on AI-generated art that has been tailored to “teach” concepts behind the images, a new study finds.

Artificial intelligence systems, such as those used to recognize faces, are usually trained on data collected from the real world. In the 1990s, researchers manually captured photographs to build image collections, while in the 2000s, they started trawling the Internet for data.

However, raw data often includes significant gaps and other problems that, if not accounted for, can lead to major blunders. For example, commercial facial-recognition systems were often trained on image databases whose pictures of faces were often light-skinned. This meant they often fared poorly at recognizing dark-skinned people. Curating databases to address these kinds of flaws is often a costly, laborious affair.

“Our study is among the first to demonstrate that using solely synthetic images can potentially lead to better representation learning than using real data.”
—Lijie Fan, MIT

Scientists had previously suggested that AI art generators might be able to help avoid the problems that come with collecting and curating real-world images. Now researchers find image-recognition AI systems trained off synthetic art can actually perform better than ones fed pictures of the real world.

Text-to-image generative AIproducts such as DALL-E 2, Stable Diffusion, and Midjourney now regularly conjure images based on text descriptions. “These models are now capable of generating photorealistic images of exceptionally high quality,” says study co-lead author Lijie Fan, a computer scientist at MIT. “Moreover, these models offer significant control over the content of the generated images. This capability enables the creation of a wide variety of images depicting similar concepts, offering the flexibility to tailor the dataset to suit specific tasks.”

In the new study, researchers developed an AI learning strategy dubbed StableRep. This new technique fed AIs pictures that Stable Diffusion generated when given text captions from image databases such as RedCaps.

The scientists had StableRep produce multiple pictures from identical text prompts. It then had the AI system examine these images as depictions of the same underlying theme. The aim of this strategy was to help the neural network learn more about the concepts behind the images.

In addition, the researchers developed StableRep+. This enhanced strategy not only trained off pictures but also examined text from image captions. In experiments, when trained with 10 million synthetic images, StableRep+ displayed an accuracy of 73.5 percent. In contrast, the AI system CLIP showed an accuracy of 72.9 percent when trained on 50 million real images and captions. In other words, StableRep+ achieved comparable performance with a data source that’s one-fifth as big.

“Our study is among the first to demonstrate that using solely synthetic images can potentially lead to better representation learning than using real data in a large-scale setting,” Fan says.

It remains uncertain why AIs perform better when they learn from synthetic images rather than real pictures. The researchers suggest one possibility is that AI art generators may provide a greater degree of control over the training data. Another is that generative AI may be able to generalize beyond the raw data it learns from to produce a richer training set than real data.

The researchers caution this work faces many potential concerns. For instance, AI art generators are often trained on uncurated data, which may be loaded with hidden biases and other problems. In addition, these systems often fail to supply proper attribution to where their data came from, leading to copyright and other legal battles.

Furthermore, AI art generators are so far relatively slow, ranging from 0.8 to 2.2 seconds per image, currently limiting how well StableRep can scale. Also, the researchers note that StableRep images do not always match the intention of the text prompts they are given, which could impact the overall quality and usefulness of the synthetic images. Moreover, text prompts run the risk of bias, and so require mindful design.

In the future, in addition to addressing the above concerns, the scientists would like to further scale up the size of synthetic image datasets. “Our current study is conducted on datasets in the tens of millions range, while the largest image-text datasets available now contain billions of samples,” Fan says. “Exploring how synthetic data performs at these larger scales presents an intriguing opportunity for further research.”

All in all, “synthetic data is getting popular in various domains, including medical image analysis, robotics, and so on,” Fan says. The strategy the researchers developed in this study may hold “significant potential for application in these areas as well. Our method could be particularly beneficial in environments where acquiring large volumes of real data is challenging or impractical.”

The scientists will detail their findings on 14 December at the Conference on Neural Information Processing Systems (NeurIPS) in New Orleans.

The Conversation (0)