Baidu’s AI Produces Short Videos in One Click

VidPress combines Baidu’s natural-language processing and computer-vision technologies

3 min read
A grid of video stills on a black background.
Photo: iStockphoto

Near the end of 2019, when Baidu's AI, named ERNIE, beat Google's AI, named BERT, in its understanding of human language, a team at Baidu Research was already prepping ERNIE for a new tool. They envisioned a program that could analyze the text from a URL, synthesize a pithy narrative, and align it with machine-selected clips to churn out a 2-minute video with voice-over—all in less time than it would take to play a song.

Last month, a prototype version of such a program, called VidPress, debuted. The AI’s goal is not only to save human video editors' time but also to outperform them in quality.

In a test performed by the team within Baidu’s video platform, Haokan (link in Chinese), it took up to 9 minutes for VidPress to generate a video from scratch. When it comes to viewers’ video completion rate, a rough proxy for quality, viewers stayed with 65 percent of VidPress’s videos from the beginning to the end, whereas the rate for videos produced by human editors was 50 percent, says Xi Chen, a research engineer at Baidu.

Chen and his team of engineers at Baidu Research in the San Francisco Bay Area are not alone in testing AI for the booming short-video market. For example, GliaStudio, a Taiwan-based startup, has been creating video summaries of articles since 2015. But few startups have the resources and advantages Baidu has, Chen says.

With access to ERNIE and other Baidu proprietary technologies, including computer-vision programs, the VidPress team is “standing on giant's shoulder,” says Julia Li, director of Baidu Research USA.

To understand how VidPress works, Li explains, consider someone feeding a web page about the death of NBA basketball star Kobe Bryant, who was killed in a helicopter accident in January 2020, to the tool.

On one level of a parallel process, VidPress generates a lightweight version of the story, making sure that important sentences, which can be crafted by the AI or pulled directly from the web page, appear early in the script. Such sentences might include keywords like "helicopter" and "Kobe." During this step, the program also ensures that the logical structure of the summary is coherent and clear, and it can also fix human writers' bad habits, such as using vague pronouns, Li says.

After having text-to-speech services convert the script into a synthesized speech, VidPress sets "anchors" in this audio track to suggest time points where viewers are most interested in seeing new visuals. Chen and colleagues wrote a decision-tree model to choose these anchor points based on how well the content around them correlates with the theme of the story. The system also pays attention to phrases people are normally curious about, such as the names of brands and locations.

On the other parallel level, VidPress finds and scores relevant media captured from the Internet, starting from the given web page and through other relevant pages on Baidu's newsfeed network Baijiahao. The algorithms are written in such a way that only higher-ranking videos or images are aligned to those anchor points in the timeline. Chen says the team is working on accessing general web pages, and developing capabilities to use commercial clients' copyrighted databases.

Baidu's computer-vision technologies are also involved. So, after a crash-site photo in the video about Bryant, Li says, VidPress can add post-match interview footage of Bryant and not of another NBA player when recapping Bryant’s career.

This ability to mine materials in multiple formats including text and visuals from a vast database of websites, as well as the ability to create a timeline dotted with anchor points to hook people’s attention, allows VidPress to improve viewers' satisfaction, Li explains. That’s probably why VidPress had a better video completion rate than human editors, she says.

An observer of China’s technology industry, Hefei Zhang, notes in an online post (link in Chinese) that the value of VidPress lies in how it uses algorithms to reduce the time costs of footage compilation, material organization, and editing. Like most AI products in the market, although VidPress saves time, it can’t yet replace or outperform humans in creativity, he says.

As Baidu’s Li points out, becoming more creative and even providing customized video content based on viewers’ tastes is a direction they’d like to take VidPress, but she acknowledges it’s not there yet.

The Conversation (0)