Qualcomm’s Scene-Detecting Smartphone System Is Almost Here

Engineers explain Qualcomm’s SceneDetect ahead of the release of the smartphone processor that runs it

Artificial neural networks have done many cool things in recent years, including learning how to cook food by watching YouTube videos and making cars less noisy. Qualcomm is hoping to bring neural networks to our smartphones to help them recognize the world around them, enabling them to identify objects and act upon this knowledge. With the upcoming release of Qualcomm’s smartphone processor, the Snapdragon 820, this capability could be just a few months away.

Humans are really good at identifying and classifying objects. Computers, on the other hand, find this task exceedingly difficult. Only in the past few years have we started seeing the first systems with capabilities close to those of small children. However, even those typically require significant computational power, limiting their widespread use.

For several years, researchers from several Qualcomm facilities around the world have been working on a large project called Zeroth. Their goal is to discover new algorithmic advances in machine learning that perform things like visual perception and audio recognition, and to develop efficient implementations of those algorithms for power-constrained devices such as smartphones.

SceneDetect will be the first commercial implementation of this idea. It performs near real-time classification of the visual scene for a variety of categories using the Snapdragon 820 and the camera on your smartphone (or some other supported device).

SceneDetect relies on an emerging field of artificial intelligence called deep learning, and it is implemented by a kind of artificial neural network called a deep convolutional network. It is a form of artificial neural network that is constructed of layers of interconnected artificial neurons. These neurons are fed data and collectively work to solve a problem—recognizing an image of a dog, for example. To train the network to recognize the dog, researchers feed images of many kinds of dogs into the network, and the network’s pattern of internal connections is adjusted until the system recognizes dogs.

Once trained, the network is able to tell you if an image contains a dog or not, even if it has never seen the specific image of that dog before. Thus these networks are capable of “learning” the inherent characteristics of dogs and not just “remembering” dogs they have seen in the past. (For an explanation of deep learning by one of its inventors, see this interview with Yann LeCun.)

Convolutional neural networks are widely used in image and video recognition, and SceneDetect currently recognizes between 30 and 50 categories of scenes—including birds, mountains, people, and clouds. (That number was chosen because earlier research concluded that this was a reasonable amount for most users.)

The training of SceneDetect’s neural network was performed offline using a compute cluster, and only then was it deployed to the Snapdragon-powered devices. In order to properly train a neural network to successfully identify different scenes out of countless potential ones, Qualcomm used a very large sample set of prelabeled images. This option became available only in the past few years, in part due to the pioneering work of Stanford computer vision researcher Fei-Fei Li and her colleagues. Her ImageNet research project, which has been using the crowdsourcing technology Amazon Mechanical Turk to create a database of millions of labeled images since at least 2008, was key to the training.

Qualcomm believes that there are a number of real-time applications, such as in robotics, where offloading to the cloud is not possible. However, for less time-critical or more computationally intensive situations, turning to the cloud is an option.

The world is full of unlabeled data, and Qualcomm is looking at this as a huge opportunity to use its SceneDetect technology to label our surroundings. “Using SceneDetect, your smartphone will be able to categorize the visual scene using concepts that are very similar to how humans describe a scene,” says Samir Kumar, senior director at Qualcomm Research. Automatically categorizing scenes in still images and videos will be a huge boon to search engines, both locally on your device and online. You could search through all of your photos and immediately find all of those that include your kid eating ice cream, for example. SceneDetect is already capable of recognizing both real-time and saved images, and Qualcomm is looking to extend this capability to video. So in the future, SceneDetect could automatically generate metadata for videos you upload to YouTube, including information about all of the main scenes and a list of objects that appear on them at different times.

According to Jeff Gehlhaar, vice president of technology at Qualcomm, as SceneDetect technology improves to include video localization of objects within a scene and counting specific types of objects, it could break in to whole new categories of devices. These include security and monitoring systems, robots, and a host of Internet of Things gadgets that could benefit from what will one day be “cognitive cameras.”

About the Author

Iddo Genuth is an Israel-based technology reporter. For IEEE Spectrum he’s written about smart textiles and startups.