AI has already shown off the capability to create photorealistic images of cats, dogs, and people’s faces that never existed before. More recently, researchers have been investigating how to train AI models to create more complex images that could include many different objects arranged in different poses and configurations.
The challenge involves figuring out how to get AI models—in this case typically a class of deep learning algorithms known as generative adversarial networks (GANs)—to generate more controlled images based on certain conditions rather than simply spitting out any random image. A team at North Carolina State University has developed a way for GANs to create such conditional images more reliably by using reconfigurable image layouts as the starting point.
“We want a model that is flexible enough such that when the input layout is reconfigurable, then we can generate an image that can be consistent,” says Tianfu Wu, an assistant professor in the department of electrical and computer engineering at North Carolina State University in Raleigh.
This layout- and style-based architecture for GANs (nicknamed “LostGANs”) came out of research by both Wu and Wei Sun, a former Ph.D. student in the department of electrical and computer engineering at North Carolina State University who is currently a research scientist at Facebook. Their paper on this work was published last month in the journal IEEE Transactions on Pattern Analysis and Machine Intelligence.
The starting point for the LostGANs approach involves a simple reconfigurable layout that includes rectangular bounding boxes showing where a tree, road, bus, sky, or person should be within the overall image. Yet previous AI models have generally failed to create photorealistic and perfectly proportioned images when they tried to work directly from such layouts.
This is why Wu and Sun trained their AI model to use the bounding boxes in the layout as a starting point to first create “object masks” that look like silhouettes of each object. This intermediate “layer-to-mask” step allows the model to further refine the general shape of such object silhouettes, which helps to make a more realistic and final “mask-to-image” result where all the visual details have been filled in.
The team’s approach also enables researchers to have the AI change the visual appearance of specific objects within the overall image layout based on reconfigurable “style codes.” For example, the AI can generate different versions of the same general wintry mountain landscape with people skiing by making specific style changes to the skiers’ clothing or even their body pose.
The results from the LostGANs approach are still not exactly photorealistic—such AI-generated images can sometimes resemble impressionistic paintings with strangely distorted proportions and poses. But LostGANs can synthesize images at a resolution of up to 512 x 512 pixels compared to prior layout-to-image AI models that usually generated lower-resolution images. The LostGANs approach also demonstrated some performance improvements over the competition during benchmark testing with the COCO-Stuff dataset and Visual Genome dataset.
A next step for LostGANs could involve better capturing the details of interactions between people and small objects, such as a person holding a tennis racket in a certain way. One way that LostGANs might improve here would be to use “part-level masks” that represent various components making up an object.
But just as importantly, Wu and Sun showed how to train LostGANs more efficiently using fewer labeled conditions without having to sacrifice the quality of the final image. Such semi-supervised training can rely on just 50 percent of the usual training images to bring LostGANs up to its usual performance standards. The source code and pretrained models of LostGANs are available online at GitHub for any other researchers interested in giving this approach a try.
Tech companies and organizations with much deeper pockets than academic labs have already begun showing the potential of harnessing AI-generated images. In 2019, NVIDIA demonstrated an AI art application called GauGAN that can convert rough sketches drawn by human artists into realistic-looking final images. In early 2021, OpenAI showed off a DALL·E version of its GPT-3 language model that can convert text prompts such as “an armchair in the shape of an avocado” into a realistic final image.
Still, the LostGANs research has a lot to offer despite not yet achieving as polished image results. By taking the layout-to-mask-to-image approach, LostGANs enables researchers to better understand how the AI model is generating the various objects within an image. Such transparency offered by LostGANs represents an improvement on the typical “black box” approach to many AI models that can leave even experts scratching their heads over how the final image was generated.
“For example, if you look at the image and the person doesn’t look correct, you can trace it back and see that it’s because the mask is not correctly computed,” Wu explains. “The mask is better for understanding what’s going on in the generated image and also makes it easier to control the image generation.”
The research could eventually help robots and AI agents to better envision the results of future interactions with objects within their immediate environment. Such image generation based on reconfigurable layouts could also potentially help generate different visual scenarios that could help train autonomous vehicles.
And in the near-term, LostGANs could play the role of an educational tool that invites students and other curious learners to interact with AI through setting up a simple image layout. During a departmental open house, an early version of LostGANs attracted the attention of local high school students with its still imperfect AI-generated images
“I think that will be fun for those students to play with,” Wu says. “Then they can get a rough understanding that ‘Oh, this is something where I can interact with an AI system through this simple painting.’”
Jeremy Hsu has been working as a science and technology journalist in New York City since 2008. He has written on subjects as diverse as supercomputing and wearable electronics for IEEE Spectrum. When he’s not trying to wrap his head around the latest quantum computing news for Spectrum, he also contributes to a variety of publications such as Scientific American, Discover, Popular Science, and others. He is a graduate of New York University’s Science, Health & Environmental Reporting Program.