November’s Internet of Everything column discussed the need to rethink cameras for an era of always-on operation at every corner. We’ll also have to rethink the way those cameras see.
Today, computer vision can track cars, faces, and production processes as accurately as most people can. When there’s a lot of data to sift through, computer-vision models are better than people.
But there are limits. Computers still need more time than a human to recognize a person or action. They can’t follow a person or object between multiple video cameras. They can be fooled easily. They can’t assign meaning to what they see. These are the limits engineers must overcome to make cameras more useful in manufacturing and in smart cities.
Today’s cameras can typically perform inference—using algorithms to match incoming images against a predefined model—at roughly 30 frames per second. The speed depends on the complexity of those computer-vision algorithms.
All inference is basically a trade-off among the variables of cost, speed, memory, and accuracy. A camera that can quickly infer what something is might sacrifice accuracy. Or it might need more memory, resulting in a higher device cost.
Thirty frames per second is fine for finding a face in a concert crowd after the fact. However, when it comes to more complicated computer-vision tasks, such as determining errors in a manufacturing process, computers need to speed up their capabilities or risk slowing down production lines, says Sophie Lebrecht, the director of operations at Xnor.ai, a company building software to improve computer vision. Xnor.ai’s goal is to track images at 60 frames per second.
Speeding up the frame rate at which computers can process images is just the first step. The next is to build software that can track an object between cameras in a network. For example, finding a person on one surveillance camera would allow the network to track that person as they walked in front of other cameras, automatically and in real time.
For that, we need fast image processing of complex models, plus software that will run across the camera network and can pick up the image. The goal would be to find a way to do this on a single network without sending data to the cloud. It would require an algorithm to recognize the person and another to track that person through physical space. It might also require a software overlay on the cameras or new communications protocols.
Cameras will also need to avoid “adversarial attacks,” which are a brand new area of research. Just as humans can be fooled by optical illusions, a computer’s vision may be deceived by various tricks that can distort an otherwise normal image, causing a program to perceive something that’s not there.
Perhaps the most difficult task is creating software that allows computers to ascribe meaning to what they see. It’s one thing to recognize a person is crawling; it’s another for a camera to infer that a person crawling across the floor needs help or is trying to avoid detection.
From there, the cameras—and their software—will need to decide what to do next. We’re a long way off from that, but researchers at Alphabet have already done impressive work trying to teach computer-vision algorithms to find meaning. It’s possible that one day, computers may see even better than we do, and will harness what they see for our benefit.
This article appears in the December 2018 print issue as “Cognitive Cameras.”