The best artificial intelligence still has trouble visually recognizing people performing many of Homer Simpson’s favorite behaviors such as drinking beer, eating chips, eating doughnuts, yawning, and the occasional face-plant. Those findings from DeepMind, the pioneering London-based AI lab, also suggest the motive behind why DeepMind has created a huge new dataset of YouTube clips to help train AI on identifying human actions in videos that go well beyond “Mmm, doughnuts” or “Doh!”
The most popular AI used by Google, Facebook, Amazon, and other companies beyond Silicon Valley is based on deep learning algorithms that can learn to identify patterns in huge amounts of data. Over time, such algorithms can become much better at a wide variety of tasks such as translating between English and Chinese for Google Translate or automatically recognizing the faces of friends in Facebook photos. But even the most finely tuned deep learning relies on having lots of quality data to learn from. To help improve AI’s capability to recognize human actions in motion, DeepMind has unveiled its Kinetics dataset consisting of 300,000 video clips and 400 human action classes.
“AI systems are now very good at recognizing objects in images, but still have trouble making sense of videos,” says a DeepMind spokesperson. “One of the main reasons for this is that the research community has so far lacked a large, high-quality video dataset.”
DeepMind enlisted the help of online workers through Amazon’s Mechanical Turk service to help correctly identify and label the actions in thousands of YouTube clips. Each of the 400 human action classes in the Kinetics dataset has at least 400 video clips, with each clip lasting around 10 seconds and taken from separate YouTube videos. More details can be found in a DeepMind paper on the arXiv preprint server.
The new Kinetics dataset seems likely to represent a new benchmark for training datasets intended to improve AI computer vision for video. It has far more video clips and action classes than the HMDB-51 and UCF-101 datasets that previously formed the benchmarks for the research community. DeepMind also made a point of ensuring it had a diverse dataset—one that did not include multiple clips from the same YouTube videos.
Tech giants such as Google—a sister company to DeepMind under the umbrella Alphabet group—arguably have the best access to large amounts of video data that could prove helpful in training AI. Alphabet’s ownership of YouTube, the incredibly popular, online, video-streaming service, does not hurt either. But other companies and independent research groups must rely on publicly available datasets to train their deep learning algorithms.
Early training and testing with the Kinetics dataset showed some intriguing results. For example, deep learning algorithms showed accuracies of 80 percent or greater in classifying actions such as “playing tennis,” “crawling baby,” “presenting weather forecast,” “cutting watermelon,” and “bowling.” But the classification accuracy dropped to around 20 percent or less for the Homer Simpson actions, including slapping and headbutting, and an assortment of other actions such as “making a cake,” “tossing coin” and “fixing hair.”
AI faces special challenges with classifying actions such as eating because it may not be able to accurately identify the specific food being consumed—especially if the hot dog or burger is already partially consumed or appears very small within the overall video. Dancing classes and actions focused on a specific part of the body can also prove tricky. Some actions also occur fairly quickly and are only visible for a small number of frames within a video clip, according to a DeepMind spokesperson.
DeepMind also wanted to see if the new Kinetics dataset has enough gender balance to allow for accurate AI training. Past cases have shown how imbalanced training datasets can lead to deep learning algorithms performing worse at recognizing the faces of certain ethnic groups. Researchers have also shown how such algorithms can pick up gender and racial biases from language.
A preliminary study showed that the new Kinetics dataset seems to be fairly balanced. DeepMind researchers found that no single gender dominated within 340 out of the 400 action classes—or else it was not possible to determine gender in those actions. Those action classes that did end up gender imbalanced included YouTube clips of actions such as “shaving beard” or “dunking basketball” (mostly male) and “filling eyebrows” or “cheerleading” (mostly female).
But even action classes that had gender imbalance did not show much evidence of “classifier bias.” This means that even the Kinetics action classes featuring mostly male participants—such as “playing poker” or “hammer throw”—did not seem to bias AI to the point where the deep learning algorithms had trouble recognizing female participants performing the same actions.
DeepMind hopes that outside researchers can help suggest new human action classes for the Kinetics dataset. Any improvements may enable AI trained on Kinetics to better recognize both the most elegant of actions and the clumsier moments in videos that lead people to say “doh!” In turn, that could lead to new generations of computer software and robots with the capacity to recognize what all those crazy humans are doing on YouTube or in other video clips.
“Video understanding represents a significant challenge for the research community, and we are in the very early stages with this,” according to the DeepMind spokesperson. “Any real-world applications are still a really long way off, but you can see potential in areas such as medicine, for example, aiding the diagnosis of heart problems in echocardiograms.”