Portrait stylized rendering for 3D light-field display based on radiation field and example guide

Sheng Shen; Shujun Xing; Shujun Xing; Xinzhu Sang; Xinzhu Sang; Binbin Yan; Xinhui Xie; Bangshao Fu; Chongli Zhong; Shuang Zhang

doi:10.1364/OE.494870

1. Introduction

The artistic design of portrait usually requires a lot of time and professional knowledge of artists. In recent years, how to transfer the existing art forms to different 3D scenes has attracted extensive attention. Light-field display technology has emerged as a promising solution as it provides multi-perspective information that is more aligned with natural viewing experience, which allows users to observe the art scene in a 3D space from multiple angles, making it a critical aspect of 3D scene stylization [2–11].

With the development of deep neural networks, radiance fields (e.g. Neural Radiance Fields [12], Plenoptic Voxels [13] ) are used to represent 3D scenes. Compared with the traditional 3D scene expressions, continuous radiance fields in 3D spaces can be more reliably obtained from multi-view images, which makes learning easier. Methods for natural scene style transfer [1,14,15], are not suitable for portrait style transfer. Since the spatial constraint is not strong, the portrait scene details may be lost, which leads to unacceptable style transfer results.

Example-based style transfer is popular due to the significant advances in patch-based synthesis [16,17] and neural techniques [14,18], and more texture details are preserved from style examples compared to neural network-based stylization rendering methods. Details are critical to the preservation of the visual characteristics in the artistic style. Although style transfer techniques powered with patch-based methods [16,19] can deliver high-quality semantically meaningful results, they are computationally expensive because of their optimization nature. The faster synthesis algorithm [20] can provide a real-time approximation to the fully-fledged optimization by leveraging the specific structure of the guiding channels used in the context of face stylization [19]. Despite of great improvement on time consuming, it still hinders the real-time performance. In the method proposed by Texler et al. [21], the process of generating appearance guidance is optimized and accelerated. The modified the existing example-based stylization method [19] to compute guidance is compatible with the fast synthesis method of Sýkora et al. [20], and the real-time migration of portrait stylization is finally realized. However, the above method can not generate stylized 3D portraits from arbitrary viewpoints, and it is difficult to be directly applied to the 3D light-field display.

Based on the radiance field learned from a 3D scene, traditional volume rendering can be used to synthesize novel views, which is highly advantageous to generate light-field encoded images. Here, a mask-guided light-field encoding method based on ray projection algorithms is proposed to achieve multi-view image synthesis, and the rendering efficiency of 3D light-field content is significantly improved. Moreover, the generation efficiency of our proposed method is independent of the number of viewpoints and the size of the angle of view. It is robust for 3D light field displays with different optical structures [22].

Here, a 3D portrait stylization method for the light-field display is presented, which can transfer given style examples to real 3D portrait scenes and synthesize light-field encoded images. The generated images of different viewpoints are consistent. The 3D real person scene into a real portrait radiation field is reconstructed, and a 3D stylized portrait radiation field is optimized with the example-based style transfer method and loss function. It is capable to render high-quality view-consistent person stylized 3D light-field images, as shown in Fig. 1. The quality of these personified images is greatly improved compared to previous work [1].

Fig. 1. The proposed example-based 3D portrait stylization method. A pre-reconstructed radiance field of a real scene is used, and the facial features of the style image are transferred to the content image. The nearest neighbor feature matching loss function is used to optimize the pre-reconstructed radiation field into the artistic portrait radiation field, so as to achieve high-quality stylized new view synthesis. In contrast to current state-of-the-art Zhang K et al. [1], our method can provide more acceptable results.

Download Full Size | PDF

2. Method

In this section, our light-field portrait stylization method is described in detail. Given a photo-realistic radiance field reconstructed from photos of a real portrait scene, it is transformed into an artistic style by stylizing the 3D scene appearance with a 2D style image. It is achieved by fine-tuning the radiance field with an example-based style transfer method that transfers local features of a face to a specific style. The overall architecture of the proposed algorithm is given in the following Fig. 2. Firstly, a given set of images is reconstructed into a sparse voxel structure, and each voxel stores a scalar opacity and a spherical harmonic coefficient vector. The Plenoxels [13] method is used to represent and learn the content distribution of the real portrait scene. Then, the real portrait radiance field is optimized to portrait stylized radiance field with our proposed loss function. Finally, the light-field image is synthesized with the light-field coding algorithm based on the mask guidance image.

Fig. 2. The overall approach of our proposed method. (a) The real portrait radiance field is generated, and then the random viewpoint image generated by the real portrait radiance field is converted into the style image with the example-based style transfer algorithm. (b) The real portrait radiance field is optimized to the stylized portrait radiance field according to the proposed loss function. (c) The 3D image is synthesized with the light-field coding algorithm based on mask guide for the 3D light-field display.

Download Full Size | PDF

2.1 Portrait stylized radiation field

Plenoxels [13] method is adopted to reconstruct the radiation field of real portraits. Voxel grid containing spherical harmonic coefficients are reconstructed, and then trilinear interpolation is used to calculate the color and opacity of each sample point. The color and opacity of these samples are integrated based on the same differentiable volume rendering as in NeRF [12], where the color of a ray is approximated by integrating over samples taken along the ray.

(1)$$\hat{C}(r)=\sum_{i=1}^N T_i\left(1-\exp \left(-\sigma_i \delta_i\right)\right) c_i$$

$T_i$ represents how much light is transmitted through ray $r$ to sample $i$ (versus contributed by preceding samples), $\left (1-\exp \left (-\sigma _i \delta _i\right )\right )$ denotes how much light is contributed by sample $i$ , $\sigma _i$ denotes the opacity of sample $i$, and $c_i$ denotes the color of sample $i$, with distance $\delta _i$ to the next sample.

Voxel coefficients can then be optimized based on the standard MSE reconstruction loss relative to the training images, along with a total variation regularizer. The MSE reconstruction loss $L_{\text {recon }}$ and the total variation regularizer $L_{T V}$ are

(2)$$L_{\text{recon }}=\frac{1}{|\mathfrak{R}|} \sum_{r \in \mathfrak{R}}\|C(r)-\hat{C}(r)\|_2^2$$

(3)$$L_{T V}=\frac{1}{|\mathfrak{I}|} \sum_{\substack{i \in \mathfrak{I} \\ d \in[D]}} \sqrt{\Delta_x^2(i, d)+\Delta_y^2(i, d)+\Delta_z^2(i, d)}$$

with $\Delta _x^2(i, d)$ shorthand for the squared difference between the $dth$ value in voxel $i:=(i, j, k)$ and the $dth$ value in voxel $(i+1, j, k)$ normalized by the resolution, and analogously for $\Delta _y^2(i, d)$ and $\Delta _z^2(i, d)$.

It should be noted that Plenoxels method is a representative radiation field reconstruction method that can be replaced with other advanced radiation field methods. After the real portrait radiance field is reconstructed, the style and content loss are calculated to optimally generate the stylized portrait radiance field. Complex high-frequency visual details from 2D style images are transferred to 3D scenes based on nearest neighbor feature matching loss [23] between portrait images generated from random viewpoints of real portrait radiance fields and given style images. The style loss is computed with the pre-trained ResNet-50 network to extract the feature maps from the content map and style map separately [24,25]. While the content loss is obtained by computing the mean squared distance between the feature maps extracted from the content map and style map using the pre-trained VGG network [24,26]. In particular, $I_{\text {style}}$ denotes the style image, and $I_{\text {render}}$ denote an image rendered from the radiance field at a selected viewpoint. ResNet-50 feature maps $F_{\text {style}}$ and $F_{\text {render}}$ are extracted $I_{\text {style}}$ and $I_{\text {render}}$, respectively. $F_{\text {render }}(i, j)$ denotes the feature vector at pixel location $(i, j)$ of the feature map $F_{\text {render}}$. Our style loss function can be written as

(4)$$L\left(F_{\text{render }}, F_{\text{style }}\right)=\frac{1}{N} \sum_{i, i} \min _{i_{, j}} D\left(F_{\text{render }}(i, j), F_{\text{style }}\left(i^{\prime}, j^{\prime}\right)\right)$$

where $N$ is the number of pixels in $F_{\text {render }}$, and $D(v_1,v_2)$ computes the cosine distance between two vectors $v_1,v_2$:

(5)$$D\left(v_1, v_2\right)=\frac{1-v_1^T v_2}{\sqrt{v_1^T v_1 v_2^T v_2}}$$

Overall, for each feature in $F_{\text {render }}$ , its cosine distance is minimized to its nearest neighbor in the style image’s ResNet-50 feature space. The stylized portrait radiance field is optimized, but the radiance field does not transfer well for the local features of the face, and further constraints are needed for optimization. The specific optimization process is introduced in Section 2.2.

2.2 Example-based portrait style transfer optimization method

According to the method proposed in Section 2.1, the portrait stylized radiance field is constructed to realize the portrait style transfer from any perspective. However, in the process of portrait style transfer, the nearest neighbor style loss and content loss to optimize the network can lead to the loss of the details of the face. Therefore, the example-based portrait style transfer method is introduced to optimize the results.

Our optimization approach is used to replace the style transfer module in the proposed radiometric field based portrait stylization network with an example -based portrait style transfer module. Specifically, the portrait images with random perspectives generated by the radiation field of real portraits are transferred with the example -based style transfer method. The style loss and content loss proposed in Section 2.1 are calculated between the transferred result image and the content image (real portrait image) to optimize the network, and the Laplacian loss function is introduced to better retain the details of the content image.

Example-based Style Transfer method. Positional guide and appearance guide are inserted into the fast synthesis algorithm of Sýkora et al. [20] to achieve style transfer. A key role of the positional guide is to ensure style consistency, i.e., encourage the synthesis to transfer patches from the source exemplar to a semantically meaningful location in the target image. Firstly, the positional guide is to obtain the facial key points. For the style map, the pre-trained algorithm [27] is used to generate it in advance. For real face images, faster detection is obtained by reducing the resolution to half of the previous one before feeding into the face detector. After the facial key points are obtained, the coordinate information is embedded into the RGB three channels, where R is the key point x and G is the y. Then, the key point deformation from the original image to the style image is calculated. For the last remaining one, B channel is used to store the mask, where the mask of the style image can be pre-generated, while the mask of the content image is obtained in the way as shown in Fig. 3.

Fig. 3. Given a face (a), (b), we compute a fast approximation of a segmentation mask (c).

Download Full Size | PDF

The detection flag shown by the green circle in Fig. 3(a) is used to connect the position of the chin using a red line. The blue elliptic curve is then used to connect the left and right topmost chin coordinates. Results in a face mask are with a smaller range. As shown in Fig. 3(b), in order to include the forehead mask, the color components along the blue curve is sampled, and a fast color thresholding operation and connected component analysis are used to determine the boundary between skin and hair. Finally, the face mask map is achieved as shown in Fig. 3(c).

In order to obtain the appearance guide, the content image and style image are firstly converted into gray level map, and then the gray level map after Gaussian blur is subtracted from the gray level map to obtain the edge-filtered images of the content image and style image. Finally, the distribution histogram of the content image obtained based on edge filtering is matched with the style image to obtain the final appearance guide. See Fig. 4.

Fig. 4. The process of generating appearance guides for the style image and the content image. (a) and (d) are the gray-scale images of the content image and style image, (b) and (e) are the images after Gaussian blur blur the gray-scale image, and (c) and (f) are the edge filtering results of the content image and style image, which are obtained by subtracting (b) and (e) from (a) and (d). (g) shows the results after matching the distribution histograms of (c) and (f).

Download Full Size | PDF

After the two guidance images are obtained, a 3D lookup table is constructed by Eq. (6) to record the distance between multiple coordinates as an error index for the substitution between pixels,

(6)$$E(p, q)=\left\|G_{p o s}^S(p)-G_{p o s}^{T_i}(q)\right\|^2+\lambda\left|G_{a p p}^S(p)-G_{a p p}^{T_i}(q)\right|^2$$

where $G_{p o s}^S \& G_{a p p}^S$ represent the pixel coordinates of the guidance map of the style image, $G_{p o s}^{T_i} \& G_{a p p}^{T_i}$ represent the pixel coordinates of the content image, and $\lambda$ represent the contribution strength of the positional guide and the appearance guide. The 3D lookup table is input into the fast example-based synthesis algorithm [20] for style transfer synthesis, and this method is used to replace the style transfer module in the portrait style transfer system based on radiance field constructed in Section 2.1.

Laplacian loss function. The Laplacian matrix produced with the Laplacian operator is widely used in computer vision to detect edges and contours [28]. Therefore, the Laplacian loss is introduced to calculate the detail structure difference between the content image and the new image after style transfer, so as to better preserve the detail structure of the content image. The introduced Laplacian loss function is denoted by

(7)$$L_{l a p}=\sum_{i j}\left(D\left(x_c\right)-D(\hat{x})\right)_{i j}^2$$

where $D(x)$ is the Laplacian matrix of the image $x$ convolved with the Laplacian operator. $x_c$ is the content map and $\hat {x}$ is the image after style transfer.

2.3 Light-field coding algorithm based on mask guide

According to the methods introduced in Sections 2.1 and 2.2, the stylized portrait network is constructed based on radiance field. Since our radiance field is a sparse voxel grid with spherical harmonics, the algorithm of ray casting can be adopted to realize the rendering of the image [29]. Here a light-field coding algorithm based on mask guide is proposed to realize light-field coding image. Traditional light-field image synthesis methods based on virtual view and camera array acquisition methods usually need to encode a set of virtual multi-view images for 3D light-field display. However, there is some information redundancy with this method, which affects the speed of light-field image synthesis. Therefore, the ray-casting method can significantly improve the rendering efficiency of 3D light-field images.

As shown in the Fig. 5, for the 3D light-field display based on pointing back light in the horizontal direction, the cylindrical lens of each cycle only covers a one-dimensional array of sub-pixels of fixed length $L_n$. These sub-pixels are encoded into discrete viewpoints in the view area, and then the encoded rays are projected to calculate the value of the sub-pixels.

Fig. 5. Optical structure and sub-pixel encoding process of 3D light-field display based on directional backlight.

Download Full Size | PDF

Each cylindrical lens is considered to be an ideal slit located in the main plane, and only one sub-pixel can be seen through the lens. The viewpoint corresponding to the subpixel $p(i,j,k)$ is

(8)$$V d(m, n)=\left\{[3 *(H-j)+3 * i * \tan (\theta)+k] {\%} L_n-L_n / 2\right\} * V_n / L_n$$

where $(W,H)$ is the output synthetic image size, $\theta$ is the tilt angle of the cylindrical lens grating, $V_n$ is the total number of viewpoints, and $L_n$ is the width of the cylindrical lens covering the sub-pixel period. % is the remainder operator in Computer language. The viewpoint index $Vd(m,n)$ is used to calculate the starting point and direction of the ray that needs to be cast. Finally, the ray-casting algorithm is used to quickly synthesize the light-field image in the constructed sparse voxel grid.

Since the optical structure of the 3D light-field display system is fixed, the viewpoint value of each pixel is unchanged after the viewpoint encoding. To further reduce the amount of computation, a mask guide is precomputed to store the viewpoint values of all sub-pixels after encoding. The mask is read into memory before rendering starts so that it does not require re-encoding in subsequent renderings. Since the process is only processed once, the algorithm can be applied to different 3D light-field displays, and the rendering efficiency of the encoding stage is the same for different types of 3D light-field displays.

3. Implementation and analysis

In this section, the detailed implementation and the computer configuration of the proposed algorithm are presented. At the same time, the quality and efficiency improvement are analyzed.

3.1 Method implementation

Our self-built portrait dataset is used. COLMAP algorithm [30] is used to calibrate the sparse view to obtain the internal and external parameters of the camera. To represent a radiance field, the recently-proposed Plenoxels [13] method is primarily used for its fast reconstruction and rendering speed. To improve the overall speed, during the stylization process, the income-based portrait style transfer algorithm is implemented based C++, and it is encapsulated as a dynamic link library for Python to call. For the style transfer algorithm, the StyleBlit [20] algorithm is used to set the parameters. In the optimization phase, we use the pre-trained VGG-16 network to extract the feature maps of the content image and the style image to calculate the content loss. The pre-trained ResNet-50 network is then used to extract feature maps of the content image and style image to calculate the style loss. Finally, the Laplacian operator is convolved with the content image and the result image after style transfer to calculate the Laplacian loss. Our total loss function is

(9)$$L_{\text{total }}=\alpha L_{\text{content }}+\beta L_{\text{style }}+\gamma L_{\text{lap }}$$

where $\alpha, \beta, \gamma$ represent the strength of each loss. We set the content retention weights $\alpha$ = 0.001, $\beta$ = 0.005, $\gamma$ = 10.0 in the formula. In situations where the output image is severely distorted, $\gamma$ is increased to demand a more faithful stylization. The neural network and light-field coding algorithm based on mask guide are programmed with Pytorch and operated on an NVIDIA GeForce RTX 3090 GPU. The training time of the network is about 20 minutes, and it takes about 50ms to synthesize a 4K optical field image and 10s to synthesize an 8K optical field image.

3.2 Analysis

In the process of example-based portrait style transfer, there are three main methods: the synthesis algorithm of Futschik, the synthesis algorithm of Fišer, and the synthesis algorithm of Texler. Here, the pre-calculation time and synthesis time of the guide image are used to evaluate the quality and speed of the two methods in inserting into our network. Based the real portrait data constructed in our lab, the style transfer image of 1024$\times$768 resolution is synthesized for comprehensive analysis. As shown in the Fig. 6(a), we can see that compared with the synthesis algorithm of Fišer, the pretreatment time of Texler is longer, but the synthesis time is shorter. because the reason is that the algorithm of Texler et al. needs to store the coordinates of the nearest pixel pointing to the content image when the 3D lookup table of the appearance guide is pre-calculated. To obtain coordinates, the entire content image must be searched. This process is computationally expensive. In the synthesis stage, since only two guidance images are used, the synthesis time of calculating the error table and feeding into the StyleBlit [20] algorithm is greatly reduced. We simulated the running time of the three methods after they are plugged into our network separately. As shown in Fig. 6(b), it can be seen that the time of Texler’s method in the prediction phase is greatly compressed due to running in parallel under GPU. With the same rendering quality, the synthesis algorithm of Texler is faster at rendering. Therefore, the method of Texler is used for portrait style transfer.

Fig. 6. The efficiency of different methods was tested and the method used was identified. (a) Time comparison of different style transfer algorithms, (b) Comparison of time to insert the total network for different style transfer algorithms.

Download Full Size | PDF

For the process of light-field image synthesis based on the ray casting algorithm, the encoding process of the viewpoint position in the composite image under different resolutions affects the speed of synthesis and memory occupation. Table 1 illustrates the rendering speed with and without our mask-based guided optimization method during the generation of 3D light-field maps at different resolutions. We can see that at the same resolution, both methods are roughly the same in speed. However, in the process of rendering high-resolution 3D light-field mapping, the rendering speed without optimization method is significantly reduced. The rendering speed with optimization method does not change significantly, but the memory occupation is increased. In general, our proposed optimization method based on mask guidance shows a significant advantage in terms of the rendering speed of high-resolution light-field images.

Table 1. The efficiency of different methods was tested and the method used was identified. (a) Time comparison of different style transfer algorithms, (b) Comparison of time to insert the total network for different style transfer algorithms.

View Table | View all tables in this article

4. Experiments

A qualitative comparison with baseline methods is carried out to evaluate our approach. Stylization results are analyzed for a variety of real portrait scenes guided by different style images. The experimental results show that our method significantly outperforms baseline methods for high-quality portrait stylization results while maintaining recognizable semantic and geometric features of the original scene. Our 3D portrait stylization results are also displayed on a 3D light-field display.

Datasets. Extensive experiments in multiple real portrait scenes are demonstrated. Firstly, the camera is used to take images of the self-built portrait dataset. Then, the diversity of viewpoints and scales of the same scene is constructed by changing the camera angle and shooting distance during the shooting process. Finally, COLMAP is used to calibrate the dataset on which the network is trained and tested. We also tried a series of portrait images with different styles to test the ability of our method to handle various style exemplars. After our testing, our method only requires a dataset constructed from approximately 100 images to achieve good results.

Baselines. Our method is compared with state-of-the-art methods [1,14] for 3D style transfer quality. Specifically, Huang et al. take a pre-trained standard NeRF and replace its color prediction module with a style network. Then, by introducing a consistency loss, the prior knowledge of the spatial consistency of NeRF is distilled into the 2D stylization network to finally obtain a Stylized NeRF. The method of Zhang et al. is to use the radiance field of an already well reconstructed real scene to stylize its 3D scene using the style of a given 2D image. Stylization is achieved by fine-tuning the existing radiance field. The loss function used for fine-tuning is their proposed nearest neighbor feature matching style loss. For both methods, their published code and our self-built portrait dataset are used. From the results, we achieve better quality in portrait style transfer.

Quantitative comparison. We evaluate the quality of our results using the Frechet Inception Distance (FID) metric [31], which is a common metric to measure the visual similarity and distribution discrepancy between two sets of images. Additionally, we provide a comparison between our method and baseline approaches based on metrics such as PSNR, SSIM [32] and LPIPS [33]. As shown in Table 2, our method not only generates more realistic details for the lowest FID value, but also outperforms the baseline method in other metrics.

Table 2. Quantitative comparison of our method and two state-of-the-art approaches evaluated by four metrics (i.e., FID, PSNR, SSIM and LPIPS)

View Table | View all tables in this article

Qualitative comparisons. A visual comparison between our method and the baselines is given in Fig. 7. Visually, we see that our results exhibit a better style match to the exemplar image compared to the baselines. In the portrait scene, our method generates higher quality portrait style transfer results, while the details of the faces in Zhang et al. ’s baseline method are seriously lost. In contrast, our method effectively reestablishes and preserves the geometric and semantic content of the original scene, thanks to the example-based style transfer method and introduced Laplacian optimization. Our method is robust to different portrait scenes and also generates consistently superior results under a variety of styles.

Fig. 7. Comparison with the baseline methods Zhang K et al. [1] and Huang et al. [15] on real-world portrait data. Our results match both the colors and details of the style image most faithfully, while ensuring the correct transfer of face features. Zhang K et al.’s method will migrate facial features (such as eyes) to incorrect positions.

Download Full Size | PDF

Presenting on 3D light-field display. The virtual viewpoint generation method can generate a series of virtual views with horizontal parallax by adjusting the camera pose for the 3D light-field display. Since other methods require a sequence of views to be pre-generated before synthesizing 3D images, our method directly synthesizes 3D light-field images through a mask-based guided encoding method, which indicates a prominent improvement on generation speed. Some 3D images are presented in our innovative 8K 3D light field as shown in Fig. 8. The created portrait dataset is used for the 3D light-field display, from which we can observe the correct occlusion of the stylized portrait with smooth motion parallax. The results prove that our proposed method can synthesize high-quality light-field images of stylized portraits, and the problem of low quality of stylized portrait scenes is addressed in the 3D light-field display. Please refer to the demonstration videos as shown in Visualization 1, and Visualization 2.

Fig. 8. The results of 3D light-field display.

Download Full Size | PDF

User study. A user study is carried out to compare our methods to baseline methods. A user is presented with a sequence of stylization results displayed on the light-field display, where for each result the user is shown with a style image, an image of the original portrait scene, and two corresponding stylized portrait produced with our method and a baseline method. The user is then asked to select the result that better matches the style of the given style image. 50 users are invited to rate the generated results. Users prefer our method over the baseline Huang et al. [1] 86.8% of the time, and over the baseline Zhang et al. [14] 94.1% of the time. These results show a clear preference for our method.

Limitations. There a few limitations for our method. Firstly, to ensure the speed of the rendering process, an obvious limitation of the instance-based stylization transfer method is that there is no stylization of hair. Secondly, the example-based stylization transfer approach has similar limitations to other techniques based on patch-based guided synthesis. The style exemplar needs to have a compatible scale with the target image otherwise artifacts may appear. Considering the aforementioned limitations, one potential approach is to employ techniques such as semantic segmentation to separately stylize the hair in portraits. Additionally, incorporating a portrait deformation field enables training with multiple portrait styles in a single training process, thereby obtaining diverse stylized results for portraits.

5. Conclusion

In summary, stylized portrait radiance fields are reconstructed from photorealistic radiance fields given user-specified style exemplars. The reconstructed stylized portrait radiance field with our proposed mask-based guidance method is able to synthesize high-quality light-field images. The key to the success of our method is to introduction an example-based portrait style transfer algorithm in the style transfer stage, and the introduction of Laplacian loss is used to constrain the contour details of the person in the optimization stage. In addition, a mask-based guidance method is used in the light-field encoding stage to optimize the efficiency of high resolution light-field image synthesis. Some portrait datasets are constructed to analyze the effectiveness of the proposed method. High quality 3D light-field stylized portraits are synthesized, and the experimental results demonstrate the effectiveness of the proposed method. We believe that our method will be widely applied to stylized content synthesis for the 3D light-field displays.

Funding

Fundamental Research Funds for the Central Universities (2022RC11); National Natural Science Foundation of China (62075016, 62175017); Beijing Municipal Science and Technology Commission; Administrative Commission of Zhongguancun Science Park (Z221100006722023).

Disclosures

The authors declare no conflicts of interest. This work is original and has not been published elsewhere.

Data Availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. K. Zhang, N. Kolkin, S. Bi, F. Luan, Z. Xu, E. Shechtman, and N. Snavely, “Arf: Artistic radiance fields,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXI, (Springer, 2022), pp. 717–733.

2. X. Wang and X. Tang, “Face photo-sketch synthesis and recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 31(11), 1955–1967 (2009). [CrossRef]

3. N. Kolkin, J. Salavon, and G. Shakhnarovich, “Style transfer by relaxed optimal transport and self-similarity,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 10051–10060.

4. D. Kotovenko, A. Sanakoyeu, P. Ma, S. Lang, and B. Ommer, “A content transformation block for image style transfer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 10032–10041.

5. Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M.-H. Yang, “Universal style transfer via feature transforms,” Adv. neural information processing systems 30 (2017).

6. D. Kotovenko, A. Sanakoyeu, P. Ma, S. Lang, and B. Ommer, “A content transformation block for image style transfer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2019), pp. 10032–10041.

7. H.-P. Huang, H.-Y. Tseng, S. Saini, M. Singh, and M.-H. Yang, “Learning to stylize novel views,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 13869–13878.

8. F. Mu, J. Wang, Y. Wu, and Y. Li, “3d photo stylization: Learning to generate stylized novel views from a single image,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 16273–16282.

9. K. Yin, J. Gao, M. Shugrina, S. Khamis, and S. Fidler, “3dstylenet: Creating 3d shapes with geometric and texture style variations,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 12456–12465.

10. O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka, “Text2mesh: Text-driven neural stylization for meshes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 13492–13502.

11. H. Wang, B. Yan, X. Sang, D. Chen, P. Wang, S. Qi, X. Ye, and X. Guo, “Dense view synthesis for three-dimensional light-field displays based on position-guiding convolutional neural network,” Opt. Lasers Eng. 153, 106992 (2022). [CrossRef]

12. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Commun. ACM 65(1), 99–106 (2022). [CrossRef]

13. S. Fridovich-Keil, A. Yu, M. Tancik, Q. Chen, B. Recht, and A. Kanazawa, “Plenoxels: Radiance fields without neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 5501–5510.

14. L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 2414–2423.

15. Y.-H. Huang, Y. He, Y.-J. Yuan, Y.-K. Lai, and L. Gao, “Stylizednerf: consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), pp. 18342–18352.

16. J. Fiser, O. Jamri, M. Luká, E. Shechtman, P. J. Asente, J. Lu, and D. Sýkora, “Illumination-guided example-based stylization of 3d renderings,” (2018). US Patent 9,881,413.

17. O. Jamriška, J. Fišer, P. Asente, J. Lu, E. Shechtman, and D. Sỳkora, “Lazyfluids: Appearance transfer for fluid animations,” ACM Trans. Graph. 34(4), 1–10 (2015). [CrossRef]

18. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2017), pp. 1125–1134.

19. J. Fišer, O. Jamriška, D. Simons, E. Shechtman, J. Lu, P. Asente, M. Lukáč, and D. Sỳkora, “Example-based synthesis of stylized facial animations,” ACM Trans. Graph. 36(4), 1–11 (2017). [CrossRef]

20. D. Sỳkora, O. Jamriška, O. Texler, J. Fišer, M. Lukáč, J. Lu, and E. Shechtman, “Styleblit: Fast example-based stylization with local guidance,” in Computer Graphics Forum, (Wiley Online Library, 2019), 2, pp. 83–91.

21. A. Texler, O. Texler, M. Kučera, M. Chai, and D. Sỳkora, “Faceblit: instant real-time example-based style transfer to facial videos,” Proc. ACM Comput. Graph. Interact. Tech. 4(1), 1–17 (2021). [CrossRef]

22. Y. Li, X. Sang, S. Xing, Y. Guan, S. Yang, and B. Yan, “Real-time volume data three-dimensional display with a modified single-pass multiview rendering method,” Opt. Eng. 59(10), 102412 (2020). [CrossRef]

23. N. Kolkin, M. Kucera, S. Paris, D. Sykora, E. Shechtman, and G. Shakhnarovich, “Neural neighbor style transfer,” arXiv, arXiv:2203.13215 (2022). [CrossRef]

24. S. Mishra and J. Granskog, “Clip-based neural neighbor style transfer for 3d assets,” arXiv, arXiv:2208.04370 (2022). [CrossRef]

25. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 770–778.

26. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv, arXiv:1409.1556 (2014). [CrossRef]

27. C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “Maskgan: Towards diverse and interactive facial image manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), pp. 5549–5558.

28. S. Li, X. Xu, L. Nie, and T.-S. Chua, “Laplacian-steered neural style transfer,” in Proceedings of the 25th ACM international conference on Multimedia, (2017), pp. 1716–1724.

29. S. Chen, B. Yan, X. Sang, D. Chen, P. Wang, Z. Yang, X. Guo, and C. Zhong, “Fast virtual view synthesis for an 8k 3d light-field display based on cutoff-nerf and 3d voxel rendering,” Opt. Express 30(24), 44201–44217 (2022). [CrossRef]

30. J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), pp. 4104–4113.

31. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Adv. neural information processing systems 30 (2017).

32. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

33. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 586–595.

Name	Description
Visualization 1	The video demonstrates the effect of stylized portraits on a light-field display.
Visualization 2	The video demonstrates the effect of stylized portraits on a light-field display.

3D image resolution	without Mask Guide	with Mask Guide
1920 $\times$ 1080	0.06s	0.08s
3840 $\times$ 2160	2.91s	1.83s
5120 $\times$ 2880	5.47s	3.89s
7680 $\times$ 4320	14.41s	8.56s

Method	FID $↓$	PSNR $↑$	SSIM $↑$	LPIPS $↓$
ARF [1]	69.71	27.77	0.12	0.24
StylizedNeRF [15]	68.46	27.78	0.14	0.25
Ours	33.85	34.08	0.86	0.21

3D image resolution	without Mask Guide	with Mask Guide
1920 $\times$ 1080	0.06s	0.08s
3840 $\times$ 2160	2.91s	1.83s
5120 $\times$ 2880	5.47s	3.89s
7680 $\times$ 4320	14.41s	8.56s

Method	FID $↓$	PSNR $↑$	SSIM $↑$	LPIPS $↓$
ARF [1]	69.71	27.77	0.12	0.24
StylizedNeRF [15]	68.46	27.78	0.14	0.25
Ours	33.85	34.08	0.86	0.21

Portrait stylized rendering for 3D light-field display based on radiation field and example guide

Abstract

1. Introduction

2. Method

2.1 Portrait stylized radiation field

2.2 Example-based portrait style transfer optimization method

2.3 Light-field coding algorithm based on mask guide

3. Implementation and analysis

3.1 Method implementation

3.2 Analysis

4. Experiments

5. Conclusion

Funding

Disclosures

Data Availability

References

Supplementary Material (2)

Data Availability

Cited By

Figures (8)

Tables (2)

Equations (9)

Optics Express