The figure illustrates the proposed B-frame codec configured for coding B-frames (reference and non-reference) and B*-frames. $x_t$ represents the current coding frame and $\hat{x}_{t-k}$, $\hat{x}_{t+k}$ are the previously reconstructed reference frames. ME-Net estimates the optical flow maps $m^e_{t\to t-k}$, $m^e_{t\to t+k}$ between $x_t$ and its reference frames $\hat{x}_{t-k}$, $\hat{x}_{t+k}$, respectively. The motion prediction network outputs the predicted optical flow maps $m^p_{t\to t-k}$, $m^p_{t\to t+k}$, which serve as the conditioning signals for the motion codec. The frame synthesis network fuses the reference frames using the reconstructed optical flow maps $\hat{m}_{t\to t-k}$, $\hat{m}_{t\to t+k}$ to generate the predicted frame $x^c_t$, which acts as the conditioning signal for the inter-frame codec. M indicates the frame type (reference B, non-reference B, B*-frame).
Frame-type adaptive coding aims to achieve adaptive coding according to the reference types of B-frames. In traditional codecs, the reference B-frames are usually coded at higher quality than the non-reference B-frames by operating the same B-frame codec in different modes. Following a similar strategy, we weight more heavily the distortions of the reference B-frames and B*-frames during training and introduces a frame-type adaptation (FA), as shown in figure (b).
We propose B*-frames, which reuse our CANF-based B-frame codec to mimic P-frame coding that allows to support multiple GOPs in an intraperiod, as shown in figure above. Our ablation experiments in Section IV-D (paper) show that B*-frames have similar compression performance to P-frames, which require an additional, separate P-frame codec.
The figure presents the rate-distortion plots on UVG, MCL-JCV, HEVC Class B and CLIC'22 test dataset in terms of PSNR-RGB and MS-SSIM-RGB.
When comparing our B-CANF with our previosu work, CANF-VC and the state-of-the-art B-frame codec, LHBDC,
we observe that B-CANF surpasses them in terms of both PSNR-RGB and MS-SSIM-RGB across all the test datasets.
However, B-CANF falls short of achieving the same level of performance as the state-of-the-art Li'22 (DCVC-HEM),
which is considered the state-of-the-art learned P-frame codec. This discrepancy in performance can be attributed to the more advanced entropy
coding model employed by Li'22 and the domain shift issue that arises between the training and testing phases. For a comprehensive analysis of this domain shift,
please refer to Section IV-C in our paper.
Table presents a breakdown analysis of the encoding runtime and peak memory requirement for B-CANF. For runtime measurement, we encode a 1080p video 10 times and take the average of the encoding runtimes. The unit of peak memory measurement is "full-res" (i.e. the spatial resolution of the input image), where one reconstructed frame occupies the equivalent of 3 full-res.
The figure presents the subjective quality comparison. It is seen that our B-CANF (MSE) achieves comparable or even better subjective quality than LHBDC (MSE), with its bit rate being nearly one order of magnitude smaller than that of LHBDC (MSE). Compared with HM (randomaccess), our B-CANF (SSIM) preserves more texture details (cf. patterns on fingers in the first row, pillars in the second row, and textures in the last row) at a lower bit rate.