Apple’s surprise purchase at the end of last month of WaveOne, a California-based startup that develops content-aware AI algorithms for video compression, showcases an important shift in how video signals are streamed to our devices. In the near-term Cuppertino’s purchase will likely lead to smart video-compression tools in Apple’s video-creation products and in the development of its much-discussed augmented-reality headset.
However, Apple isn’t alone. Startups in the AI video codec space are likely to prove acquisition targets for other companies trying to keep up.
For decades video compression used mathematical models to reduce the bandwidth required for transmission of analog signals, focusing on the changing portions of a scene from frame to frame. When digital video was introduced in the 1970s, improving video compression became a major research focus, leading to the development of many compression algorithms called codecs, short for “coder-decoder,” that compress and decompress digital media files. These algorithms paved the way for the current dominance of video in the digital age.
AI compression of still images has shown initial success. Video remains more challenging.
While a new codec standard has appeared around every 10 years, all have been based on pixel mathematics—manipulating the values of individual pixels in a video frame to remove information that is not essential for human perception. Other mathematical operations reduce the amount of data that needs to be transmitted or stored.
AI codecs, having been developed over the course of decades, use machine-learning algorithms to analyze and understand the visual content of a video, identify redundancies and nonfunctional data, and compress the video in a more efficient way. They use learning-based techniques instead of manually designed tools for encoding and can use different ways to measure encoding quality beyond traditional distortion measures. Recent advancements, like attention mechanisms, help them understand the data better and optimize visual quality.
During the first half of the 2010s, Netflix and a California-based company called Harmonic helped to spearhead a movement of what’s called “content-aware” encoding. CAE, as Harmonic calls it, uses AI to analyze and identify the most important parts of a video scene, and to allocate more bits to those parts for better visual quality, while reducing the bit rate for less important parts of the scene.
Content-aware video compression adjusts an encoder for different resolutions of encoding, adjusts the bit rate according to content, and adjusts the quality score—the perceived quality of a compressed video compared to the original uncompressed video. All those things can be done by neural encoders as well.
Yet, despite a decade-long effort, full neural-video compression—using deep learning—has not beat the best configurations of conventional codec standards in normal conditions. Reviews from third parties show that when benchmarked with conventional distortion metrics as well as human opinion scores, conventional video encoders still outperform neural-network compression, especially when conventional encoders are enhanced with AI tools.
WaveOne has shown success in neural-network compression of still images. In one comparison, WaveOne reconstructions of images were 5 to 10 times as likely to be chosen over conventional codecs by a group of independent users.
But the temporal correlation in video is much stronger than the spatial correlation in an image and you must encode the temporal domain extremely efficiently to beat the state of the art.
“At the moment, the neural video encoders are not there yet,” said Yiannis Andreopoulos, a professor of data and signal processing at University College London and chief technology officer at iSize Technologies.
WaveOne will likely continue working on full neural video compression under Apple’s aegis. According to WaveOne’s public research, its neural-compression technology is not compatible with existing codec standards and this fits with Apple’s policy of building products that work seamlessly together but are proprietary and tightly controlled by Apple.
WaveOne founder, Lubomir Bourdev, declined to comment on the current state of its technology and Apple did not respond to requests for comment.
AI and conventional codecs will for now work in tandem—in part because conventional encoders can be debugged.
Nonetheless, the industry appears to be moving toward combining AI with conventional codecs—rather than relying on full neural-network compression.
Vnova, for instance, uses standardized pre-encoding downscaling and post-decoding upscaling, according to its site, to make its encoder more efficient and faster than the encoder. But users need software components on both encoder side and decoder side.
The London-based company iSize also enhances conventional video encoders with AI-based preprocessing to improve the quality and bit-rate efficiency of conventional encoders. iSize users don’t need a component on the receiver end. The technology just produces bespoke representations in preprocessing that make encoders more efficient. It can add a postprocessing component, but that’s optional.
“By adding an AI component prior to encoder, regardless of what encoder you are using, we’re reducing the bit rate needed to compress some elements of each video frame,” said iSize CEO Sergio Grce in a Zoom call. “Our AI component learns to attenuate details that won’t be noticeable by human viewers when watching video played at the normal replay rate.”
As a result, Grce says, the encoding process is faster and latency drops—which is certainly an important advantage for VR where latency can lead to nausea on the part of users. The file the encoder spits out is significantly smaller without changing anything on the end-user device, Grce says.
In theory, everything in a video must be preserved. The ideal codec encodes everything it receives in a piece of content—not to alter it—which is why traditionally encoders have focused on what is called distortion metrics. Such measurements include signal-to-noise ratio (SNR), structural similarity index (SSIM), and peak signal-to-noise ratio (PSNR). Each of these provides a quantitative measure of how well the compressed video matches the original video in terms of visual quality.
However, in recent years, there has been an increasing focus on perceptual quality metrics that consider how the compressed video is perceived by human viewers. These metrics aim to measure the visual quality of the compressed video based on how humans perceive it rather than just mathematical measurements. Some distortions, after all, may be mathematically insignificant but still perceptually noticeable. (For instance, blurring a small portion of a person’s face may not represent much considering the overall image or video file, but even small changes to such distinctive features can still be noticed.) As a result, new video-compression techniques are being developed that consider both distortion and perceptual quality metrics.
More recently, things are moving further to more perception-oriented encoding, changing subtle details in the content based on how humans perceive it rather than just mathematical measurements. It’s easier to do that with neural encoders because they see the entire frame, while conventional encoders operate at the macroblock or slice level, seeing only a small piece of the frame.
For the time being, “AI and conventional technologies will work in tandem,” said Andreopoulos, in part, he said, because conventional encoders are interpretable and can be debugged. Neural networks are famously obscure “black boxes.” Whether in the very long term neural encoding will beat traditional encoding, Andreopoulos added, is still an open question.
WaveOne’s technology could be used by Apple to improve video-streaming efficiency, reduce bandwidth costs, and enable higher resolutions and frame rates on its Apple TV+ platform. The technology is hardware-agnostic and could run on AI accelerators built into many phones and laptops. Meanwhile, the metaverse, if realized, will involve a massive amount of data transfer and storage.
- A Fictional Compression Metric Moves Into the Real World ›
- Battle of the Video Codecs: Coding-Efficient VVC vs. Royalty-Free AV1 ›