Neural compression for hologram images and videos

Liang Shi; Liang Shi; Richard Webb; Lei Xiao; Changil Kim; Changwon Jang

doi:10.1364/OL.472962

Near-eye displays have become increasingly popular with the hot pursuit of the metaverse. Among a plethora of techniques, computer-generated holography (CGH) stands out for its potential to create life-like three-dimensional (3D) imagery. Despite the challenge of simultaneously supporting a large eyebox and a wide field of view, recent advances in manufacturing a high-resolution spatial light modulator (SLM) have significantly improved the 3D image quality of holographic displays. Meanwhile, CGH rendering techniques have been rapidly evolving and demonstrating image quality similar to traditional image and video; however, CGH compression [1–14] has not been studied as thoroughly.

A CGH represents a two-dimensional (2D) slice of the optical wavefront as a complex-valued image. This differs from the domain where the conventional image and video lie. A CGH also necessitates high-frequency interference fringes to produce a realistic depth of field effect. In contrast, conventional image and video codecs discard visually insignificant high-frequency details to achieve a higher compression rate. These unmatched design choices lead to suboptimal compression performance when applying existing codecs to CGHs. For practical virtual and augmented reality applications, the ultrahigh-resolution CGH (16 K+) needed to simultaneously support a wide field of view, and a large eyebox could further magnify codec inefficiency and impose a significant challenge. There have been many efforts to tackle CGH compression. Quantization and transform coding are the two main compression techniques. Quantization schemes employ reduction of bit level [15], non-uniform coding [1], and vector quantization [2,3], while transform coding uses discrete cosine transform (DCT) [5,6], discrete wavelet transform (DWT) [7,8], and their variants [10,11] to find compact basis. Disposing of the hologram amplitude further produces a phase-only hologram. Alternative phase representation [12], phase difference representation, and neural networks [13] can encode such holograms more efficiently. Joint photographic experts group (JPEG) Pleno Holography provides a uniform compression standard for digital holography, including common test conditions for proponent methods [16]. A majority of these works are designed for random phase holograms with noise-like statistics that enables a wide field of view, while recent works [17,18] suggest that smooth phase holograms with natural-image-like portfolios may induce less severe speckle and facilitate rapid phase-only encoding [17–20]. Despite the pros and cons when applying either type of hologram to different applications (see Supplement 1), compression algorithms specialized for smooth-phase holograms are not well exploited.

Inspired by recent advances in deep optics [22], neural compression for conventional image/video [23–25], and large-scale hologram datasets [18], we propose HiFiHC, a learning-based end-to-end compression method for smooth-phase complex hologram images and videos targeted for near-eye displays with eyepiece, where hologram has limited depth range. For hologram images, an encoder network is learned to compress the hologram into a low-dimensional latent code, and a decoder network reconstructs the hologram. For hologram videos, high efficiency video coding (HEVC, also known as H.265) with a high constant rate factor compresses the low- and mid-frequency content, while an encoder–decoder network compresses the high-frequency residual critical for the 3D image formation with assistance of motion vectors in the H.265 video.

Figure 1 shows the pipeline of HiFiHC. The architecture of HiFiHC repurposes high-fidelity compression (HiFiC) [25], a network designed for conventional image compression, adjusted for hologram input, dataset, and to incorporate domain-specific loss. Specifically, it consists of an encoder $E$, a generator $G$, and a discriminator $discrim$. HiFiHC takes a six-channel tensor input $x \in \mathbb {R}^{6 \times R_x \times R_y}$ created by concatenating the real and imaginary part of the complex hologram, where $R_x$ and $R_y$ are the spatial resolution of the hologram. The encoder produces a quantized latent $y = E(x)$, which is further decoded by the generator $G$ to obtain a lossy reconstruction $x' = G(y)$. Using a probability model $P$ and an entropy coding algorithm (e.g., arithmetic coding [26]), the latent $y$ can be stored losslessly using bitrate $r(y) = -\log (P(y))$. The discriminator produces a scalar value $discrim(y,x)$ and $discrim(y,x')$, indicating the probabilities that $x$ and $x'$ are real or fake (synthetic), respectively. We train HiFiHC as a conditional generative adversarial network (GAN), where the ($E$, $G$) tries to “fool” $discrim$ into believing its lossy reconstructions are real, while $discrim$ aims to classify the lossy reconstructions as fake and the uncompressed inputs as true. The loss function for training ($E$, $G$) is given by

(1)$$\begin{aligned} \mathcal{L}_{E,G} &= w_r r(y) + w_{holo} ||x - x'||_1 + w_{fs}d_{fs}(x,x') \\ &\quad - w_D \log(discrim(x', y)). \end{aligned}$$

Here $w_r, w_{holo}, w_{fs},$ and $w_D$ are the hyper-parameters controlling the trade-off between the terms, $d_{fs}$ defines a dynamic focal stack loss following Shi et al. [18] that encourages the focal stack reconstructed from the compressed hologram to match the one from the uncompressed input (see Supplement 1 for details).

Fig. 1. High-fidelity hologram compression (HiFiHC) pipeline for hologram image and video compression. For image compression, the encoder $E$ encodes one latent code for the hologram’s real and imaginary components. The latent code is quantized by $Q$, entropy coded with side information generated through $P$, decoded by $G$, and classified by $discrim$. For video compression, $E$ takes an H.265 compressed frame with its associated residual and encodes a latent code only for reconstructing the residual. The reconstructed residual is added back to the H.265 frame.

Download Full Size | PDF

The loss function for training $discrim$ is

(2)$$\mathcal{L}_{D} ={-}\log(1-discrim(x', y)) - \log(discrim(x,y)),$$

which encourages the uncompressed hologram to be classified as 1 (true) and the compressed hologram as 0 (fake).

For hologram video compression, compressing each frame to an individual latent prevents exploitation of temporary redundancy. Instead, we compress amplitude and phase into two regular videos using H.265 with a CRF of 23, which is insufficient for preserving fine interference fringe to recover the 3D scene details (see Supplement 1). Nevertheless, predicted frames (P-frames) and bidirectional predicted frames (B-frames) in each video encode the motion vectors, which can be leveraged during the residual learning. Denote the H.265 compressed frame $x_{625}$ (converted to real+imaginary representation) and the residual $\Delta x = x - x_{625}$. We train a HiFiHC network to predict $\Delta x$ with a 12-channel tensor input created by concatenating $\Delta x$ and $x_{625}$ along the channel dimension. Let $\Delta x'$ be the compressed residual, $\Delta y$ be the latent of $\Delta x$, we train HiFiHC using an updated loss for $(E,G)$,

(3)$$\begin{aligned} \mathcal{L}_{\Delta (E,G)} &= w_{\Delta r} r(\Delta y) + w_{\Delta {holo}} ||\Delta x - \Delta x'||_1 \\ &\quad + w_{\Delta fs}d_{\Delta fs}(\Delta x+x_{625},\Delta x'+x_{625}) \\ &\quad - w_{\Delta D} \log(discrim(\Delta x'+x_{625}, \Delta y)) \end{aligned}$$

and an updated loss for $discrim$

(4)$$\scalebox{0.9}{$\displaystyle\mathcal{L}_{D} ={-}\log(1-discrim(\Delta x'+x_{625}, \Delta y)) - \log(discrim(\Delta x+x_{625},\Delta y)).$}$$

When encoding the residual frames, we consider one group of pictures (GOP) as a batch of frames. The residual of the GOP’s intra-coded frame (I-frame) is first compressed and denoted as $\Delta x'_\textrm{I}$. Let $x_\textrm{P}$ be the anchored P-frame, and $x_{256 \_\textrm{P}}$ be the H.265 compressed frame. We compute the motion-compensated P-frame residual as

(5)$$\Delta x_\textrm{P} = x_\textrm{P} - (x_{256 \_\textrm{P}} + \text{warp}(\Delta x'_\textrm{I}, M_{I \to P})),$$

where $M_{I \to P}$ is the motion vector from the I-frame to the P-frame. Let $\Delta x'_\textrm{P}$ be the compressed residual and $\overline {\Delta x_\textrm{P}} = \Delta x'_\textrm{P} + \text {warp}(\Delta x'_\textrm{I}, M_{I \to P})$, the motion-compensated residual of an anchored B-frame is given by

(6)$$\Delta x_\textrm{B} = x_\textrm{B} - (x_{256 \_\textrm{B}} + \text{warp}(\Delta x'_\textrm{I}, M_{I \to B}) + \text{warp}(\overline{\Delta x_\textrm{P}}, M_{P \to B})),$$

where $M_{I \to B}$, $M_{P \to B}$ are the motion vector from the I-frame to the B-frame and the P-frame to the B-frame, respectively.

We train both versions of HiFiHC on the MIT-CGH-4K V2 dataset [28]. For the video version, we fine-tune the model using holograms rendered from the video sequence used in Xiao et al. [29]. To be used with a practical holographic near-eye display, we configure holograms in the dataset to have a 15mm propagation distance from the hologram to the 3D content (see Supplement 1 for more discussion). Still, we also evaluate HiFiHC performance at other propagation distances (see Supplement 1 for the detail). Following HiFiC, we use the hyper-prior model [23] for $P$, and we pretrain ($E$, $G$) for $1 \times 10^6$ iterations using $\mathcal {L}_{E,G}$ (without the last term) before training the full conditional GAN for another $1 \times 10^6$ iterations (see Supplement 1 for hyper-parameters). At evaluation, we use mv-extractor [30] to extract the motion vector (see Supplement 1).

For hologram images, we evaluate HiFiHC against HEIC [31] and BPG [32]. We omit JPG since it showed far worse performance. Figure 2 shows a comparison of a rendered boat scene and a real-world mansion scene evaluated by a model trained to yield results at a compression rate of $\sim 0.3$ bpp. In the less cluttered boat scene, ripple-like interference fringes surrounding the subjects (i.e., rope, boatman, dog) are visible and relatively isolated. HiFiHC faithfully preserves mid- and high-frequency fringe details amid smeared subject content due to defocusing. In contrast, the mid-frequency fringes (i.e., around the dog and dock man) are largely obviated in HEIC and BPG results. The structure of high-frequency fringes is broken due to the block-based compression pattern, and the contrast of rainbow-like intensity alternation is reduced. Consequently, the refocused depth of field images exhibits reduced sharpness and increased ringing artifacts.

Fig. 2. Comparison of HiFiHC, high efficient image file format (HEIC), and better portable graphics (BPG) performance on hologram images. Readers are encouraged to zoom in and examine details. The second and the third row in each label mark the peak signal to noise ratio (PSNR) and structure similarity index (SSIM) for the hologram amplitude (first number), and the refocused DoF image (second number). Source images: PartyTug 6:00AM (left) by Ian Hubert, and Mansion (right) from Kim et al. [21]. More discussion to be added.

Download Full Size | PDF

The highlighted mansion fence inset possesses heavily intermingled features due to interlaced depths. The interaction between randomly oriented shrub leaves makes the aggregated fringes away from contouring shapes. Despite challenging, HiFiHC reasons such interaction through the dynamic focal stack loss, and retains the dominating features for producing a sharp foreground after refocusing. Instead, BPG and HEIC significantly underperform, especially when evaluated in the image domain and the refocused results appear nearly uniformly blurred. Evaluation of additional image scenes, offset distances, and rate-distortion curves are visualized in Supplement 1 (see Figs. S2, S5, and S6).

For hologram videos, we evaluate HiFiHC against H.265 videos encoded at a lower CRF that incur the same number of additional per-frame bits per pixel. Figure 3 shows a comparison of sequences of a rendered bunny scene and a real-world fossil scene evaluated by a model trained to yield results at a +$\sim$0.15 bpp per hologram. In the selected frames of the bunny scene (see the full image in Supplement 1), the bunny takes a fast motion with non-rigid deformation, whereas the background stays relatively stationary. HiFiHC reduces 27%/39% of bpp for P-/B-frames with motion compensation, and the eyes of the bunny remain sharply focused across the GOP regardless of the frame type. In contrast, the eye sharpness noticeably degrades in the P-/B-frames of the lower CRF video. In the selected frames of the fossil scene, the camera undergoes a revolving motion, and all pixels translate with a scaled inverse proportional to the distance to the camera. For this challenging case, HiFiHC still gains a 6%/14% reduction of bits per pixel through motion compensation. Although HiFiHC is trained solely on the residual of a synthetic dataset, it well handles real-world scenes with sharper DoF images across all frame types, relatively. Evaluation of additional video scenes, offset distances, and rate-distortion curves are visualized in Supplement 1 (see Figs. S3, S4, S5, and S6).

Fig. 3. Comparison of HiFiHC and H.265 [at lower constant rate factor (CRF)] performance on hologram videos. Readers are encouraged to zoom in and examine details. In each inset, the top right-hand and bottom left-hand numbers mark the PSNR and SSIM for the refocused DoF image. The second row in the frame label marks the frame type and the bits per pixel (bpp) of the HiFiHC latent code. Source images: Big Buck Bunny (top) by Blender Foundation, and Horns (bottom) from Mildenhall et al. [27]. The H265 (lower CRF) results use CRF of 15 and 18 for Big Buck Bunny and Horns, respectively, both of which yield a similar number of additional bits per pixel compared with HiFiHC.

Download Full Size | PDF

In summary, these results demonstrate that HiFiHC can achieve superior image quality over the conventional image and video codecs on compressing complex holograms for diverse scenes of arbitrary resolution. Several challenges still need to be overcome for the practical deployment of such a learning-based system. Currently, HiFiHC has a considerable model size (2.2 G) and runs at an interactive decoding speed (5 FPS for 1080P on an NVIDIA Tesla V100 GPU, without TensorRT optimization). Nevertheless, further performance engineering such as convolutional neural network (CNN) model compression and TensorRT optimization can reduce the model size and improve runtime. With advances in low-power application-specific integrated circuits (ASICs) for accelerated CNN inference and the rapid development of memory-efficient CNN architectures, we envision proposed method to be available for the consumer holographic displays.

Funding

Reality Labs Research, Meta.

Acknowledgments

We thank Douglas Lanman for helpful discussions and advice.

Disclosures

C.J., R.W., L.X., C.K.: Meta Platforms, Inc. (I,E,P). The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

Supplemental document

See Supplement 1 for supporting content.

REFERENCES

1. A. E. Shortt, T. J. Naughton, and B. Javidi, Opt. Express 14, 5129 (2006). [CrossRef]

2. P. Tsang, K. W. K. Cheung, T.-C. Poon, and C. Zhou, J. Opt. 14, 125403 (2012). [CrossRef]

3. Y. K. L. Y. K. Lam, W. C. S. W. C. Situ, and P. W. M. T. P. W. M. Tsang, Chin. Opt. Lett. 11, 050901 (2013). [CrossRef]

4. Y.-H. Seo, H.-J. Choi, and D.-W. Kim, Signal Process.: Image Commun 22, 144 (2007). [CrossRef]

5. Y.-H. Seo, H.-J. Choi, J.-S. Yoo, G.-S. Lee, C.-H. Kim, S.-H. Lee, S.-H. Lee, and D.-W. Kim, Opt. Commun. 283, 4261 (2010). [CrossRef]

6. Z. Ren, P. Su, and J. Ma, Opt. Rev. 20, 469 (2013). [CrossRef]

7. A. E. Shortt, T. J. Naughton, and B. Javidi, Opt. Express 14, 2625 (2006). [CrossRef]

8. P. A. Cheremkhin and E. A. Kurbatova, Appl. Opt. 57, A55 (2018). [CrossRef]

9. D. Blinder, T. Bruylants, H. Ottevaere, A. Munteanu, and P. Schelkens, Opt. Eng. 53, 123102 (2014). [CrossRef]

10. T. Birnbaum, A. Ahar, D. Blinder, C. Schretter, T. Kozacki, and P. Schelkens, Appl. Opt. 58, 6193 (2019). [CrossRef]

11. A. El Rhammad, P. Gioia, A. Gilles, M. Cagnazzo, and B. Pesquet-Popescu, Appl. Opt. 57, 4930 (2018). [CrossRef]

12. A. V. Zea, A. L. V. Amado, M. Tebaldi, and R. Torroba, OSA Continuum 2, 572 (2019). [CrossRef]

13. H. Ko and H. Y. Kim, IEEE Access 9, 79735 (2021). [CrossRef]

14. E. Darakis and J. J. Soraghan, Appl. Opt. 46, 351 (2007). [CrossRef]

15. P. Tsang, K. W. K. Cheung, and T.-C. Poon, Appl. Opt. 50, H42 (2011). [CrossRef]

16. R. K. Muhamad, T. Birnbaum, A. Gilles, S. Mahmoudpour, K.-J. Oh, M. Pereira, C. Perra, A. Pinheiro, and P. Schelkens, Appl. Opt. 60, 641 (2021). [CrossRef]

17. A. Maimone, A. Georgiou, and J. S. Kollin, ACM Trans. Graph. 36, 1 (2017). [CrossRef]

18. L. Shi, B. Li, C. Kim, P. Kellnhofer, and W. Matusik, Nature 591, 234 (2021). [CrossRef]

19. C. K. Hsueh and A. A. Sawchuk, Appl. Opt. 17, 3874 (1978). [CrossRef]

20. V. Arrizón and D. Sánchez-De-la Llave, Appl. Opt. 41, 3436 (2002). [CrossRef]

21. C. Kim, H. Zimmer, Y. Pritch, A. Sorkine-Hornung, and M. Gross, ACM Trans. Graph. 32, 1 (2013). [CrossRef]

22. V. Sitzmann, S. Diamond, Y. Peng, X. Dun, S. Boyd, W. Heidrich, F. Heide, and G. Wetzstein, ACM Trans. Graph. 37, 1 (2018). [CrossRef]

23. J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” arXiv:1802.01436 (2018). [CrossRef]

24. D. Minnen, J. Ballé, and G. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” arXiv:1809.02736 (2018).

25. F. Mentzer, G. D. Toderici, M. Tschannen, and E. Agustsson, “High-Fidelity Generative Image Compression,” arXiv:2006.0996 (2020).

26. D. Marpe, H. Schwarz, and T. Wiegand, IEEE Trans. Circuits Syst. Video Technol. 13, 620 (2003). [CrossRef]

27. B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ramamoorthi, R. Ng, and A. Kar, ACM Trans. Graph. 38, 1 (2019). [CrossRef]

28. L. Shi, B. Li, and W. Matusik, Light: Sci. Appl. 11, 247 (2022). [CrossRef]

29. L. Xiao, S. Nouri, M. Chapman, A. Fix, D. Lanman, and A. Kaplanyan, ACM Trans. Graph. 39, 142 (2020). [CrossRef]

30. L. Bommes, X. Lin, and J. Zhou, in 2020 15th IEEE Conference on Industrial Electronics and Applications (ICIEA) (2020), pp. 1419–1424.

31. https://nokiatech.github.io/heif/technical.html.

32. F. Bellard, “BPG image format,” GitHub (2018) [accessed 11 November 2022], https://github.com/mirrorer/libbpg.

Neural compression for hologram images and videos

Abstract

Funding

Acknowledgments

Disclosures

Data availability

Supplemental document

REFERENCES

Supplementary Material (1)

Data availability

Cited By

Figures (3)

Equations (6)

Optics Letters