Maybe you remember going to a wedding and finding a cheap film camera on each table, along with a note asking guests to snap photos of their wedding experiences. Or, more recently, maybe you added your videos of a wedding, children's soccer match, or other event to a shared online folder. In both these cases, the host of the event or their designee had a lot of work to do to turn those images and videos into a usable keepsake.
Although the quality of video recorded by smartphones has been improving dramatically in recent years, the hassle of collecting and assembling multiple recordings of a single event has changed little. Sure, TikTok mavens, Instagram influencers, and other dedicated amateurs have learned how to use editing software to piece together engaging, shareable, smartphone movies.
But that leaves a lot of us out of the picture—though not for much longer. The next frontier of consumer video creation will be powered by AI, not by a professional videographer or dedicated amateur. These systems will intelligently and automatically combine video from multiple smartphones and other video devices, including action cameras, drones, gimbal cameras, or virtually any other connected camera into one finished production. We think this kind of system will be available to consumers within 2-3 years.
This is consumer multicam video production, an ecosystem of technologies that may just put wedding videographers out of business, or at least give them a run for the money. The building blocks for this system already exist. They include the cameras and advanced video processing software built into today's smartphones, AI that's already great at image recognition, and high speed, low latency wireless communications, including high-speed LTE wireless, Wi-Fi networks, and 5G.
Here's how it will work.
Think of several members of a family recording video of an event. First, they use an app to join a shared project. When they start recording, software on their devices automatically determines what each person is filming, tagging the content with detailed metadata.
As the event progresses, these metatagged video streams move from smartphones to the cloud. There, the AI production system matches streams by checking for timestamps, syncing visual and audio content when possible, and rating the reliability of all the synchronization.
Next, it classifies the streams in terms of distance to objects, camera direction, and orientation. And it classifies them in terms of content, using object recognition, scenery detection, and facial and speech recognition. It also begins comparing content among streams, identifying what content is in one stream but not another. Algorithms assign ratings to the content based on the content itself (a person laughing in a scene may be worth more to the final product than whether a frame's composition adheres to the rule of thirds) as well as on quality parameters (a well-lit, well composed shot may be more likely to make the final cut than one that is not).
These ratings help the automatic editor put together the final video, making the decisions that a human editor would, like selecting clips and mixing audio. It can apply visual themes, compensate for gaps in the content through techniques like slow motion or still images, add in stock media as necessary, and include user-specified titles or captions.
Finally, the system converts the video into formats and resolutions appropriate for the user's selected platform, from social media to home theater, adding copyright information or even a video watermark to signify its authenticity. It can also prepare it for distribution, via social media, a text-based link, or simply a downloadable file.
In the future, as high-speed wireless networks enable a more real-time multicamera production process, we would expect this system to include a feedback loop. For example, if the AI system realized there is no close-up shot of, say, the family's daughter celebrating the game-winning goal, it can trigger a controllable smartphone camera to zoom in.
Of course, any application of multicamera video technology must include security safeguards to ensure those contributing content streams are known to the system and have permission to participate. Much of this can be handled at the application level, through logins, passwords, etc. But smartphones also generate identifying data about the phone itself and the user that can be analyzed by the system for information that could indicate unauthorized access. And this AI-based multicamera video production could also include safeguards for fighting a contemporary media scourge: deepfake videos, for a video produced through a multicamera video platform could be automatically watermarked, indicating that what was produced was not altered and was created from actual content.
With multicamera video production, the foundation is in place to expand the way we use our devices to capture the world around us, turning video creation, not just video consumption, into a truly social experience.
- What Are Deepfakes and How Are They Created? - IEEE Spectrum ›
- The Trouble With Trusting AI to Interpret Police Body-Cam Video ... ›
- Baidu's AI Produces Short Videos in One Click - IEEE Spectrum ›
- With AI Watermarking, Creators Strike Back - IEEE Spectrum ›