Can Deepfake Tech Train Computer Vision AIs?

Generative adversarial networks make phony photorealistic images—and now synthetic data, too

3 min read
iris of eye made out of different colored lines
iStockphoto

A recent survey has revealed that 99 percent of computer-vision engineers have had a machine-learning project completely canceled due to insufficient training data. Researchers have already been looking toward synthetic data to fill this breach. In fact, Gartner foresees that by 2024, more than half of data used for AI and analytics projects will be synthetically generated.

A research group from the Massachusetts Institute of Technology has explored whether, given that generative models are now capable of producing highly realistic images, they can replace data sets to train computer-vision AI models. In their paper presented at the 2022 International Conference on Learning Representations, the MIT researchers asked whether, “given access only to a trained generative model, and no access to the dataset that trained it, can we learn effective visual representations?”

As generative models, specifically generative adversarial networks (GANs), get better and better, researchers are indeed getting performance close to that of real data for learning representations, says Ali Jahanian, a research scientist at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), and one of the authors of the paper.

GANs are deep-learning models that use two neural networks working against each other—a generator that creates synthetic data, and a discriminator that distinguishes between real and synthetic data—to generate synthetic images that are almost indistinguishable from real ones. GANs are popular for generating images and videos, including deepfakes.

The research used these pretrained models for multiview representation learning. “Their method uses a contrastive objective that pushes the model to produce a presentation that’s similar for different views of the same object,” says Anna Rumshisky, a computer scientist at the University of Massachusetts, Lowell. Jahanian and his colleagues’ study is a valuable proof of concept showing that applying contrastive methods to synthetic data allows the model to learn well without having the data set, she adds.

On some specific tasks, GANs are able to outperform real data when used for a downstream transfer learning task. For example, Nvidia’s StyleGAN can disentangle color from the objects and can even rotate the objects. “Because it is not only learning the data," he says, “but also the transformations.”

images of regular images and generative machine-learning imagesMIT researchers have demonstrated the use of a generative machine-learning model to create synthetic data, based on real data, that can be used to train another model for image classification. This image shows examples of the generative model’s transformation methods.MIT

Synthetic data has some advantages over traditional data sets, the researchers note. For instance, not all research teams have the resources to access high-volume, high-quality data, especially when it could involve sensitive information, such as personal data. In that case, pretrained generative models can be more efficient and accessible. Synthetic data is useful to simulate conditions that may not (yet) exist. These data sets can also be edited, and potentially at scale. This allows researchers to, for example, remove biases that exist in real data sets.

Developing methods to control what the model is generating is an active research area, says Rumshisky. Speaking from the point of view of natural language processing, which is her area of expertise, she adds, “There's been quite a few efforts in the past three years to come up with models that allow you to control what's getting generated…to push them all towards generating certain kinds of keywords, or certain kinds of attributes.” If you had a mechanism to control the attributes of the text that the model is generating, she explains, you could perhaps push it away from generating certain kinds of data, such as, personal details like names, social security numbers, and phone numbers, for instance.

The real test is whether the performance of ML models trained on synthetic data are on par with those trained on real data. That is still a work in progress at this stage. “I am using these models on data that, for instance, comes from self-driving vehicle data,” Jahanian says. “Eventually we want the AI agent to be able to go to the real world and…solve [real] tasks.”

There is still a long road ahead for generative data sets, Jahanian says. “A good question that was asked recently [was]: If these [generative] models are already biased, would we [get] biased representations? And I think that one future work would be…[to look] at the representations and see to what extent they are biased and how can we unbias them.”

Jahanian says bias implicit in the GANs themselves is another concern. Does a biased GAN necessarily yield biased representations? Another issue, he adds, concerns how realistic the synthetic data sets might be—and possibly how they might mislead a neural net into normalizing something that isn’t normal. “With real data, you probably cannot generate a view where a person is running with their dog in the middle of highway,” he says. “But to what extent can you do that using a generative model if it is not already seen in the underlying data? So this is a question that I’m trying to work on.”

The Conversation (0)

Will AI Steal Submarines’ Stealth?

Better detection will make the oceans transparent—and perhaps doom mutually assured destruction

11 min read
A photo of a submarine in the water under a partly cloudy sky.

The Virginia-class fast attack submarine USS Virginia cruises through the Mediterranean in 2010. Back then, it could effectively disappear just by diving.

U.S. Navy

Submarines are valued primarily for their ability to hide. The assurance that submarines would likely survive the first missile strike in a nuclear war and thus be able to respond by launching missiles in a second strike is key to the strategy of deterrence known as mutually assured destruction. Any new technology that might render the oceans effectively transparent, making it trivial to spot lurking submarines, could thus undermine the peace of the world. For nearly a century, naval engineers have striven to develop ever-faster, ever-quieter submarines. But they have worked just as hard at advancing a wide array of radar, sonar, and other technologies designed to detect, target, and eliminate enemy submarines.

The balance seemed to turn with the emergence of nuclear-powered submarines in the early 1960s. In a 2015 study for the Center for Strategic and Budgetary Assessment, Bryan Clark, a naval specialist now at the Hudson Institute, noted that the ability of these boats to remain submerged for long periods of time made them “nearly impossible to find with radar and active sonar.” But even these stealthy submarines produce subtle, very-low-frequency noises that can be picked up from far away by networks of acoustic hydrophone arrays mounted to the seafloor.

Keep Reading ↓Show less
{"imageShortcodeIds":["30133857"]}