Abstract

Over the past few years, learning-based video compression has become an active research area. However, most works focus on P-frame coding. Learned B-frame coding is under-explored and more challenging. This work introduces a novel B-frame coding framework, termed B-CANF, that exploits conditional augmented normalizing flows for B-frame coding. B-CANF additionally features two novel elements: frame-type adaptive coding and B*-frames. Our frame-type adaptive coding learns better bit allocation for hierarchical B-frame coding by dynamically adapting the feature distributions according to the B-frame type. Our B*-frames allow greater flexibility in specifying the group-of-pictures (GOP) structure by reusing the B-frame codec to mimic P-frame coding, without the need for an additional, separate P-frame codec. On commonly used datasets, B-CANF achieves the state-of-the-art compression performance as compared to the other learned B-frame codecs.

Method

Frame-type Adaptive Coding


Frame-type adaptive coding aims to achieve adaptive coding according to the reference types of B-frames. In traditional codecs, the reference B-frames are usually coded at higher quality than the non-reference B-frames by operating the same B-frame codec in different modes. Following a similar strategy, we weight more heavily the distortions of the reference B-frames and B*-frames during training and introduces a frame-type adaptation (FA), as shown in figure (b).

B*-frame Extension


We propose B*-frames, which reuse our CANF-based B-frame codec to mimic P-frame coding that allows to support multiple GOPs in an intraperiod, as shown in figure above. Our ablation experiments in Section IV-D (paper) show that B*-frames have similar compression performance to P-frames, which require an additional, separate P-frame codec.

Paper

Rate-distortion Results

The figure presents the rate-distortion plots on UVG, MCL-JCV, HEVC Class B and CLIC'22 test dataset in terms of PSNR-RGB and MS-SSIM-RGB. When comparing our B-CANF with our previosu work, CANF-VC and the state-of-the-art B-frame codec, LHBDC, we observe that B-CANF surpasses them in terms of both PSNR-RGB and MS-SSIM-RGB across all the test datasets.

However, B-CANF falls short of achieving the same level of performance as the state-of-the-art Li'22 (DCVC-HEM), which is considered the state-of-the-art learned P-frame codec. This discrepancy in performance can be attributed to the more advanced entropy coding model employed by Li'22 and the domain shift issue that arises between the training and testing phases. For a comprehensive analysis of this domain shift, please refer to Section IV-C in our paper.


Complexity Analysis

Table presents a breakdown analysis of the encoding runtime and peak memory requirement for B-CANF. For runtime measurement, we encode a 1080p video 10 times and take the average of the encoding runtimes. The unit of peak memory measurement is "full-res" (i.e. the spatial resolution of the input image), where one reconstructed frame occupies the equivalent of 3 full-res.


Qualitative Comparison

The figure presents the subjective quality comparison. It is seen that our B-CANF (MSE) achieves comparable or even better subjective quality than LHBDC (MSE), with its bit rate being nearly one order of magnitude smaller than that of LHBDC (MSE). Compared with HM (randomaccess), our B-CANF (SSIM) preserves more texture details (cf. patterns on fingers in the first row, pillars in the second row, and textures in the last row) at a lower bit rate.