This AI Watched 100 Films to Learn How to Recognize a Kiss

A senior data scientist at Netflix trained an AI to detect kissing scenes in films—and had to take precautions to make sure the model didn’t confuse kissing with sex

4 min read

Image of Patrick Swayze and Demi Moore kissing in the 1990 movie 'Ghost.'
Patrick Swayze and Demi Moore kiss in the 1990 movie 'Ghost,' which was one of 100 films that a data scientist used to train an AI to spot a kiss.
Photo: Paramount/Alamy/AF Archive

Like someone who has never been kissed, AI began learning the basics by binge-watching romantic film clips to see how Hollywood stars lock lips. By training deep learning algorithms that have already proven adept at recognizing faces and objects to also recognize steamy kissing scenes dramatized by professional actors, a data scientist has shown how AI systems could gain greater insight into the most intimate human activities.

The study of AI-based kiss detection came from Amir Ziai, a senior data scientist at Netflix, as he was completing coursework to obtain an AI graduate certificate at Stanford University. Ziai handpicked a representative sample of 100 films from a database of Hollywood films spanning the past century. Then he manually labeled different film segments as either kissing or non-kissing scenes, and used still frames and sound clips from those segments to train deep learning algorithms to detect both the sights and sounds of smooching.

Lest anyone get the wrong impression, it’s still unclear whether or not the kiss detection method works with more sexual scenes that go beyond kissing. “In my training set, I’ve stayed away from overly sexual scenes to make sure that the model is not confusing kissing and sex,” Ziai says.

Ziai’s current employer Netflix was not involved in the Stanford-based research that is detailed in a paper published on the preprint server arXiv. And Ziai has not investigated any possible applications of such technology for Netflix. But it’s not hard to imagine the possible commercial applications that could interest Netflix or other companies such as YouTube, Facebook, Instagram, and TikTok that handle huge amounts of streaming or stored video.

Back in April 2019, Google announced that its Pixel smartphones had received a Photobooth feature update that allowed the phones to automatically snap photos whenever they detected kissing in a single frame taken by the smartphone camera. Ziai’s demonstration of kiss detection technology that works with videos hints at future applications that could automatically categorize video content, create personalized video recommendations for viewers, and possibly even screen out certain videos as part of online content moderation.

“This is a good example of how modern computer vision techniques make it fairly easy to develop specific ‘sense and respond’ software, cued to qualitative/unstructured things (like the presence of kissing in a scene),” said Jack Clark, strategy and communications director at OpenAI, in his Import AI newsletter, which recently highlighted the kiss detection study. “I think this is one of the most under-hyped aspects of how AI is changing the scope of individual software development.”

When it came time to visually identify kissing scenes, the deep learning model that proved most successful was ResNet-18, an image classification algorithm that was already pre-trained on more than one million images from the popular ImageNet database. To listen for the sounds of kissing, a deep learning model known as VGGish trained on the last 960 milliseconds of audio from one-second segments of each scene.

That two-pronged approach of training AI to process both images and audio of kissing helped the overall model achieve a fairly impressive F1 score of 0.95—a measure that represents the weighted average of the algorithm’s accuracy regarding both false positives and false negatives.

But the model still stumbled when it encountered trickier video editing and camera perspectives in some film scenes. For example, wide shots of actors kissing sometimes confused the algorithm because most of the camera frame consisted of background scenery. Fast-paced video cuts and shots that didn’t include both actors also proved challenging.

It’s always difficult to figure out which particular data patterns lead deep learning models to make their predictions. One way for humans to try to understand AI logic involves using saliency maps to highlight the data that received the most attention from the AI during its analysis. In the case of the Hollywood kissing scenes, the deep learning models seemed to pay more attention to image pixels related to the actors’ faces.

Some “limited experimentation” also suggests that the AI relied more heavily on visual features rather than audio in order to identify kissing scenes, Ziai says. He observed that the kiss detection system could benefit from a “more carefully crafted dataset” and perhaps make use of more contextual information beyond just still images to detect kissing.

It’s still unclear how well an AI model trained on just 100 Hollywood films such as Anna Karenina (1935), Ghost (1990), and Casino Royale (2006) would work in a larger dataset of films. But the model saw only “marginal improvement” after the training dataset grew beyond 80 videos, Ziai says. The Hollywood film dataset and some of the computing resources were provided by the lab of Kayvon Fatahalian, an assistant professor of computer science at Stanford University.

Another question is whether such an AI model could perform with similar accuracy in detecting kiss scenes in the types of videos commonly shared on social media. That challenge would probably require additional training on a much larger video dataset with examples going beyond on-screen Hollywood couples such as Patrick Swayze and Demi Moore. Still, some very preliminary testing suggests that this broader application of AI-powered kiss detection shows promise.

“The attempt in this study was to use a diverse dataset so that the model does not overfit to any particular type of movie, ” Ziai says. “Anecdotally, the model seem to work reasonably well on a few YouTube videos that I found.”

The Conversation (0)