Can artificial intelligence evolve as human baby does, learning about the world by seeing and interacting with its surroundings? That’s one of the questions driving a huge cognitive psychology experiment that has revealed crucial differences in how humans and computers see images.
The study has tested the limits of human and computer vision by examining each one’s ability to recognize partial or fuzzy images of objects such as airplanes, eagles, horses, cars, and eyeglasses. Unsurprisingly, human brains proved far better than computers at recognizing these “minimal” images even as they became smaller and harder to identify. But the results also offer tantalizing clues about the quirks of human vision—clues that could improve computer vision algorithms and eventually lead to artificial intelligence that learns to understand the world the way a growing toddler does.
“The study shows that human recognition is both different and better performing compared with current models,” said Shimon Ullman, computer scientist at the Weizmann Institute of Science in Rehovot, Israel. “We think that this difference [explains the inability] of current models to analyze automatically complex scenes—for example, getting details about actions performed by people in the image, or understanding social interactions between people.”
Human brains can identify partial or fuzzy minimal images based on certain “building block” features in known objects, Ullman explained. By comparison, computer vision models or algorithms do not seem to use such building block knowledge. The details of his team’s research were published today in the online issue of the journal Proceedings of the National Academy of Sciences.
The study involved more than 14,000 human participants, tested on 3,553 image patches. Such a staggering number of participants made it completely impractical to bring each person into the lab. Instead, Ullman and his colleagues crowdsourced their experiment to thousands of online workers through the service known as Amazon Mechanical Turk. The researchers then verified the online results by comparing them to a much smaller group of human volunteers in the lab.
Human brains easily outperformed the computer vision algorithms tested in the study. But an additional twist in the findings may highlight a key difference between how the human brain and computer vision algorithms decode images. The testing showed a sudden drop in human recognition of minimal images when slight changes make the images too small or fuzzy to identify. Human volunteers’ were able to identify baseline “minimal” images about 65 percent of the time. But when images were made even smaller or more blurry, recognition levels dropped below 20 percent. By comparison, the computer algorithms generally performed worse than human recognition, but did not show a similar “recognition gap” in performance as the images became smaller or fuzzier.
Such results suggest that the human brain relies upon certain learning and recognition mechanisms that computer algorithms lack. Ullman and his colleagues suspect that the results can be explained by one particular difference between the brain and computer vision algorithms.
Today’s computer vision models rely on a “bottom-up” approach that filters images based on the simplest features possible before moving on to identify them by more complex features. But human vision does not rely on just the bottom-up approach. The human brain also works “top-down,” comparing a standard model of certain objects with a particular object that it’s trying to identify.
“This means, roughly, that the brain stores in memory a model for each object type, and can use this internal model to ‘go back’ to the image, and search in it specific features and relations between features, which will verify the existence of the particular object in the image,” Ullman explained. “Our rich detailed perception appears to arise from the interplay between the bottom-up and top-down processes.”
That top-down human brain approach could inspire new computer models and algorithms capable of developing a complex understanding of the world through what they see. To that point, Ullman’s research received some funding through a “Digital Baby” project grant provided by the European Research Council. His group also received backing through the U.S. National Science Foundation’s support of the Center for Brains, Minds and Machines at MIT and other universities. One of the major research goals of the center is “Reverse Engineering the Infant Mind.”
Ullman envisions an artificial intelligence that starts out without detailed knowledge of the world and has sophisticated learning capabilities through vision and interaction:
As a baby, you open your eyes, see flickering pixels, and somehow it all comes together and you know something about the world. You’re starting from nothing, absorbing information and getting a rich view of the world. We were thinking about what would it take to get a computer program where you put in the minimal structures you need and let it view videos or the world for six months. If you do it right, you can get an interesting system.
Even better computer vision could someday enable Siri or Cortana, the virtual assistants in personal smartphones and tablets, to recognize human expressions or social interactions. It could also empower technologies such as self-driving cars or flying drones, making them better able to recognize the world around them. For example, driverless car researchers have been working hard to improve the computer vision algorithms that enable robot cars to quickly recognize pedestrians, cars, and other objects on the road.
On the human side, the study offers a new glimpse of how the human brain sees the world. Such research helps bridge the gap between brain science and computer science, Ullman said. And that could hugely benefit both humans and machines.
“We would like to combine psychological experiments with brain imaging and brain studies in both humans and animals to uncover the features and mechanisms involved in the recognition of minimal images, and their use in the understanding of complex scenes,” Ullman said. “Through this, we also hope to better understand the use of top-down processing in both biological and computer systems.”
Jeremy Hsu has been working as a science and technology journalist in New York City since 2008. He has written on subjects as diverse as supercomputing and wearable electronics for IEEE Spectrum. When he’s not trying to wrap his head around the latest quantum computing news for Spectrum, he also contributes to a variety of publications such as Scientific American, Discover, Popular Science, and others. He is a graduate of New York University’s Science, Health & Environmental Reporting Program.