How to Design and Train Your Implicit Neural Representation for Video Compression
Submission Under Review
Abstract
Implicit neural representation (INR) methods for video compression have recently achieved visual quality and compression ratios that are competitive with traditional pipelines. However, due to the need for per-sample network training, the encoding speeds of these methods are too slow for practical adoption. We develop a library to allow us to disentangle and review the components of methods from the NeRV family, reframing their performance in terms of not only size-quality trade-offs, but also impacts on training time. We uncover principles for effective video INR design and propose a state-of-the-art configuration of these components, Rabbit NeRV (RNeRV). When all methods are given equal training time (equivalent to 300 NeRV epochs) for 7 different UVG videos at 1080p, RNeRV achieves +1.27% PSNR on average compared to the best-performing alternative for each video in our NeRV library. We then tackle the encoding speed issue head-on by investigating the viability of hyper-networks, which predict INR weights from video inputs, to disentangle training from encoding to allow for real-time encoding. We propose masking the weights of the predicted INR during training to allow for variable, higher quality compression, resulting in 1.7% improvements to both PSNR and MS-SSIM at 0.037 bpp on the UCF-101 dataset, and we increase hyper-network parameters by 0.4% for 2.5%/2.7% improvements to PSNR/MS-SSIM with equal bpp and similar speeds.
Overview
In the context of deep learning, you can train implicit neural networks relatively quickly. Since each network needs to fit to only a single image, or video, or scene, the networks are typically quite small, and the datasets only consistent of the corresponding set of coordinates. A state-of-the-art image INR might fit to an image for perfect reconstruction in a matter of seconds. These networks tend to handle spatial and temporal redundancy, such that they are good candidates for image and video compression. However, if we change the context from deep learning to video compression, the overfitting process is suddenly prohibitively slow. Compressing a video at a rate of a few seconds per frame is unacceptable for nearly all applications. Unfortunately, the bulk of INR video compression literature focuses only on the quality-size tradeoff, to the point that these speeds, instead of getting better, are getting worse and worse with each new state-of-the-art. To remedy this, we propose a renewed focus on encoding speed. We disentangle video INR models into their core components, and revisit key design decisions in terms of how they impact quality, size, AND time. Rather than only measuring training epochs (which obfuscates the actual time cost of each iteration), we benchmark in terms of complexity and real-world latency. As a bonus, we also examine the extent to which these different mechanisms and principles, proposed concurrently across multiple papers, can work together for promising performance improvements.
Citation
The website was based on the popular template from Ben Mildenhall.