B-CANF: Adaptive B-frame Coding with Conditional Augmented Normalizing Flows

Abstract

Over the past few years, learning-based video compression has become an active research area. However, most works focus on P-frame coding. Learned B-frame coding is under-explored and more challenging. This work introduces a novel B-frame coding framework, termed B-CANF, that exploits conditional augmented normalizing flows for B-frame coding. B-CANF additionally features two novel elements: frame-type adaptive coding and B*-frames. Our frame-type adaptive coding learns better bit allocation for hierarchical B-frame coding by dynamically adapting the feature distributions according to the B-frame type. Our B*-frames allow greater flexibility in specifying the group-of-pictures (GOP) structure by reusing the B-frame codec to mimic P-frame coding, without the need for an additional, separate P-frame codec. On commonly used datasets, B-CANF achieves the state-of-the-art compression performance as compared to the other learned B-frame codecs.

Overview

The figure illustrates the proposed B-frame codec configured for coding B-frames (reference and non-reference) and B*-frames. $x_t$ represents the current coding frame and $\hat{x}_{t-k}$, $\hat{x}_{t+k}$ are the previously reconstructed reference frames. ME-Net estimates the optical flow maps $m^e_{t\to t-k}$, $m^e_{t\to t+k}$ between $x_t$ and its reference frames $\hat{x}_{t-k}$, $\hat{x}_{t+k}$, respectively. The motion prediction network outputs the predicted optical flow maps $m^p_{t\to t-k}$, $m^p_{t\to t+k}$, which serve as the conditioning signals for the motion codec. The frame synthesis network fuses the reference frames using the reconstructed optical flow maps $\hat{m}_{t\to t-k}$, $\hat{m}_{t\to t+k}$ to generate the predicted frame $x^c_t$, which acts as the conditioning signal for the inter-frame codec. M indicates the frame type (reference B, non-reference B, B*-frame).

Method

Frame-type Adaptive Coding

Frame-type adaptive coding aims to achieve adaptive coding according to the reference types of B-frames. In traditional codecs, the reference B-frames are usually coded at higher quality than the non-reference B-frames by operating the same B-frame codec in different modes. Following a similar strategy, we weight more heavily the distortions of the reference B-frames and B*-frames during training and introduces a frame-type adaptation (FA), as shown in figure (b).

**B*-frame Extension**

We propose B*-frames, which reuse our CANF-based B-frame codec to mimic P-frame coding that allows to support multiple GOPs in an intraperiod, as shown in figure above. Our ablation experiments in Section IV-D (paper) show that B*-frames have similar compression performance to P-frames, which require an additional, separate P-frame codec.

Paper

Rate-distortion Results

The figure presents the rate-distortion plots on UVG, MCL-JCV, HEVC Class B and CLIC'22 test dataset in terms of PSNR-RGB and MS-SSIM-RGB. When comparing our B-CANF with our previosu work, CANF-VC and the state-of-the-art B-frame codec, LHBDC, we observe that B-CANF surpasses them in terms of both PSNR-RGB and MS-SSIM-RGB across all the test datasets.

However, B-CANF falls short of achieving the same level of performance as the state-of-the-art Li'22 (DCVC-HEM), which is considered the state-of-the-art learned P-frame codec. This discrepancy in performance can be attributed to the more advanced entropy coding model employed by Li'22 and the domain shift issue that arises between the training and testing phases. For a comprehensive analysis of this domain shift, please refer to Section IV-C in our paper.

Complexity Analysis

Table presents a breakdown analysis of the encoding runtime and peak memory requirement for B-CANF. For runtime measurement, we encode a 1080p video 10 times and take the average of the encoding runtimes. The unit of peak memory measurement is "full-res" (i.e. the spatial resolution of the input image), where one reconstructed frame occupies the equivalent of 3 full-res.

Qualitative Comparison

The figure presents the subjective quality comparison. It is seen that our B-CANF (MSE) achieves comparable or even better subjective quality than LHBDC (MSE), with its bit rate being nearly one order of magnitude smaller than that of LHBDC (MSE). Compared with HM (randomaccess), our B-CANF (SSIM) preserves more texture details (cf. patterns on fingers in the first row, pillars in the second row, and textures in the last row) at a lower bit rate.