DeepMind Shows AI Has Trouble Seeing Homer Simpson’s Actions

IEEE SpectrumFOR THE TECHNOLOGY INSIDER
TopicsAerospaceArtificial IntelligenceBiomedicalClimate TechComputingConsumer ElectronicsEnergyHistory of TechnologyRoboticsSemiconductorsTelecommunicationsTransportation
SectionsFeaturesNewsOpinionCareersDIYEngineering Resources
MoreNewslettersPodcastsSpecial ReportsCollectionsExplainersTop Programming LanguagesRobots Guide ↗IEEE Job Site ↗
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
For IEEE MembersCurrent IssueMagazine ArchiveThe InstituteThe Institute Archive
IEEE SpectrumAbout UsContact UsReprints & Permissions ↗Advertising ↗
Follow IEEE Spectrum
Support IEEE SpectrumIEEE Spectrum is the flagship publication of the IEEE — the world’s largest professional organization devoted to engineering and applied sciences. Our articles, podcasts, and infographics inform our readers about developments in technology, engineering, and science.
Join IEEE
Subscribe
About IEEEContact & SupportAccessibilityNondiscrimination PolicyTermsIEEE Privacy PolicyCookie PreferencesAd Privacy Options
© Copyright 2024 IEEE — All rights reserved. A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity.

The best artificial intelligence still has trouble visually recognizing people performing many of Homer Simpson’s favorite behaviors such as drinking beer, eating chips, eating doughnuts, yawning, and the occasional face-plant. Those findings from DeepMind, the pioneering London-based AI lab, also suggest the motive behind why DeepMind has created a huge new dataset of YouTube clips to help train AI on identifying human actions in videos that go well beyond “Mmm, doughnuts” or “Doh!”

The most popular AI used by Google, Facebook, Amazon, and other companies beyond Silicon Valley is based on deep learning algorithms that can learn to identify patterns in huge amounts of data. Over time, such algorithms can become much better at a wide variety of tasks such as translating between English and Chinese for Google Translate or automatically recognizing the faces of friends in Facebook photos. But even the most finely tuned deep learning relies on having lots of quality data to learn from. To help improve AI’s capability to recognize human actions in motion, DeepMind has unveiled its Kinetics dataset consisting of 300,000 video clips and 400 human action classes.

“AI systems are now very good at recognizing objects in images, but still have trouble making sense of videos,” says a DeepMind spokesperson. “One of the main reasons for this is that the research community has so far lacked a large, high-quality video dataset.”

DeepMind enlisted the help of online workers through Amazon’s Mechanical Turk service to help correctly identify and label the actions in thousands of YouTube clips. Each of the 400 human action classes in the Kinetics dataset has at least 400 video clips, with each clip lasting around 10 seconds and taken from separate YouTube videos. More details can be found in a DeepMind paper [PDF] on the arXiv preprint server.

The new Kinetics dataset seems likely to represent a new benchmark for training datasets intended to improve AI computer vision for video. It has far more video clips and action classes than the HMDB-51 and UCF-101 datasets that previously formed the benchmarks for the research community. DeepMind also made a point of ensuring it had a diverse dataset—one that did not include multiple clips from the same YouTube videos.

Tech giants such as Google—a sister company to DeepMind under the umbrella Alphabet group—arguably have the best access to large amounts of video data that could prove helpful in training AI. Alphabet’s ownership of YouTube, the incredibly popular, online, video-streaming service, does not hurt either. But other companies and independent research groups must rely on publicly available datasets to train their deep learning algorithms.

Early training and testing with the Kinetics dataset showed some intriguing results. For example, deep learning algorithms showed accuracies of 80 percent or greater in classifying actions such as “playing tennis,” “crawling baby,” “presenting weather forecast,” “cutting watermelon,” and “bowling.” But the classification accuracy dropped to around 20 percent or less for the Homer Simpson actions, including slapping and headbutting, and an assortment of other actions such as “making a cake,” “tossing coin” and “fixing hair.”

AI faces special challenges with classifying actions such as eating because it may not be able to accurately identify the specific food being consumed—especially if the hot dog or burger is already partially consumed or appears very small within the overall video. Dancing classes and actions focused on a specific part of the body can also prove tricky. Some actions also occur fairly quickly and are only visible for a small number of frames within a video clip, according to a DeepMind spokesperson.

DeepMind also wanted to see if the new Kinetics dataset has enough gender balance to allow for accurate AI training. Past cases have shown how imbalanced training datasets can lead to deep learning algorithms performing worse at recognizing the faces of certain ethnic groups. Researchers have also shown how such algorithms can pick up gender and racial biases from language.

A preliminary study showed that the new Kinetics dataset seems to be fairly balanced. DeepMind researchers found that no single gender dominated within 340 out of the 400 action classes—or else it was not possible to determine gender in those actions. Those action classes that did end up gender imbalanced included YouTube clips of actions such as “shaving beard” or “dunking basketball” (mostly male) and “filling eyebrows” or “cheerleading” (mostly female).

But even action classes that had gender imbalance did not show much evidence of “classifier bias.” This means that even the Kinetics action classes featuring mostly male participants—such as “playing poker” or “hammer throw”—did not seem to bias AI to the point where the deep learning algorithms had trouble recognizing female participants performing the same actions.

DeepMind hopes that outside researchers can help suggest new human action classes for the Kinetics dataset. Any improvements may enable AI trained on Kinetics to better recognize both the most elegant of actions and the clumsier moments in videos that lead people to say “doh!” In turn, that could lead to new generations of computer software and robots with the capacity to recognize what all those crazy humans are doing on YouTube or in other video clips.

“Video understanding represents a significant challenge for the research community, and we are in the very early stages with this,” according to the DeepMind spokesperson. “Any real-world applications are still a really long way off, but you can see potential in areas such as medicine, for example, aiding the diagnosis of heart problems in echocardiograms.”

machine learning software DeepMind data science computer vision machine learning datasets YouTube machine vision

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

DeepMind Shows AI Has Trouble Seeing Homer Simpson’s Actions

DeepMind’s training data set of 300,000 YouTube clips finds AI struggles to recognize actions such as eating doughnuts or face-planting

German EV Motor Could Break Supply-Chain Deadlock

E-Bikes Are Growing Up, Finding Jobs, Still Having Fun

Related Stories

Deep Learning Picks Apart DNA Data-Copying Puzzles

Machine Learning Turns Up COVID Surprise

Watch Syntiant’s 1-Milliwatt Chip Play Doom

Topics

Sections

More

For IEEE Members

For IEEE Members

IEEE Spectrum

Follow IEEE Spectrum

Support IEEE Spectrum

Enjoy more free content and benefits by creating an account

Saving articles to read later requires an IEEE Spectrum account

The Institute content is only available for members

Downloading full PDF issues is exclusive for IEEE Members

Downloading this e-book is exclusive for IEEE Members

Access to Spectrum 's Digital Edition is exclusive for IEEE Members

Following topics is a feature exclusive for IEEE Members

Adding your response to an article requires an IEEE Spectrum account

Create an account to access more content and features on IEEE Spectrum , including the ability to save articles to read later, download Spectrum Collections, and participate in conversations with readers and editors. For more exclusive content and features, consider Joining IEEE .

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to all of Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Join the world’s largest professional organization devoted to engineering and applied sciences and get access to this e-book plus all of IEEE Spectrum’s articles, archives, PDF downloads, and other benefits. Learn more →

Access Thousands of Articles — Completely Free

Create an account and get exclusive content and features: Save articles, download collections, and talk to tech insiders — all free! For full access and benefits, join IEEE as a paying member.

DeepMind Shows AI Has Trouble Seeing Homer Simpson’s Actions

DeepMind’s training data set of 300,000 YouTube clips finds AI struggles to recognize actions such as eating doughnuts or face-planting

15 Graphs That Explain the State of AI in 2024

German EV Motor Could Break Supply-Chain Deadlock

E-Bikes Are Growing Up, Finding Jobs, Still Having Fun

Related Stories

Deep Learning Picks Apart DNA Data-Copying Puzzles

Machine Learning Turns Up COVID Surprise

Watch Syntiant’s 1-Milliwatt Chip Play ​Doom​

Watch Syntiant’s 1-Milliwatt Chip Play Doom