LEIFR-Net: light estimation for implicit face relight network

Zhiru Li; Chenchu Rong; Yuanqing Wang

doi:10.1364/OE.510060

1. Introduction

Face relighting, a sophisticated task in image processing, focuses on enhancing the lighting of facial images captured in suboptimal conditions to achieve a more natural and aesthetically appealing appearance. By accurately estimating the illumination’s distribution and intensity based on the image’s background and surrounding environment, this technique enables the creation of facial images that look more lifelike across various settings. Its widespread applications span numerous fields, including face analysis, medical diagnostics, synthetic reality, and the film industry. Moreover, face relighting significantly improves tasks like facial keypoint detection, face recognition, and face editing. The process involves two primary steps: deducing the illumination from the environment around the face and rendering the facial image under the contextually reconstructed environmental lighting.

Light estimation generally relies on manually set prior frameworks or utilizes functions and networks for implicit representation. Estimating the illumination from a single image is a crucial part of face relighting and has been widely explored in the fields of computer vision and computer graphics. Existing research either regresses lighting parameters or generates illuminance maps that are often difficult to optimize or prone to inaccurate predictions [3,4].

In this work, we propose a representation method for estimating global illumination from a single environmental image. In our approach, natural lighting is parameterized in terms of position, intensity, color, and delay factor to effectively characterize illumination without redundancy. Utilizing this parameterization, we employ an encoder to extract lighting information from the scene images, which is then integrated into the subsequent relighting pipeline.

Traditional face relighting has predominantly relied on re-rendering techniques grounded in fundamental physics pipelines, necessitating extensive prior knowledge and often being constrained to specific scenarios, thus demonstrating a lack of robustness. The prevailing approach involves training models through end-to-end deep neural networks, as exemplified in the works of Sun et al. [5] and Wang et al. [6]. These methodologies, which require substantial real-world datasets, encounter significant challenges in terms of data acquisition and privacy concerns. In contrast, our methodology is inspired by the work of Yeh et al. [7] and innovatively employs stable diffusion techniques from open-source models, as referenced in [8], in conjunction with ControlNet [9]. This approach enables the creation of a diverse and comprehensive synthetic dataset, characterized by a wide array of lighting conditions, effectively overcoming the data acquisition and privacy challenges inherent in traditional face relighting methods.

Additionally, we propose an innovative method for face relighting that employs pixel-aligned implicit functions to integrate disentangled facial geometric textures with environmental illumination for image pixel value prediction. This concept builds upon the principles set forth by Ronneberger et al. [10]. Our method focuses on extracting texture and geometric information from facial features and fusing it with global illumination. This process is facilitated using a U-Net architecture. Subsequently, we extract deep features of facial geometry and texture. By merging these local features with global lighting information, our model can precisely predict the RGB values for each pixel. This integration significantly enhances the model’s robustness and overall performance across diverse samples. As Fig. 1 shows, our approach represents a significant advancement in the field of face relighting, offering a more efficient and versatile solution compared to traditional techniques.

Fig. 1. LEIFR-Net is capable of extracting lighting from environmental backgrounds and relighting input albedos or real faces. The two columns on the left show the results of relighting using albedo data obtained from Sfsnet [1], while the two columns on the right display real faces from the FFHQ [2] dataset and their relighted outputs.

Download Full Size | PDF

Therefore, our main contributions can be summarized as follows:

• We develop a novel parametric model for outdoor environmental lighting, as well as a methodology for estimating this representation from a single image.
• We propose a network called LEIFR-Net, which utilizes the fusion of global light feature and local objects features for implicit relighting of human faces.
• We constructed a paired dataset of synthetic faces, outdoor environments, and lighting structures using stable diffusion, which is used to train the facial relighting network model under different outdoor environments.

2. Related work

2.1 Light estimation and representation

light estimation and representation is fundamental to achieving facial image relighting and is crucial for realism [12,13]. Some studies [1,14] use spherical harmonics for lighting representation, but this method is limited to low-frequency illumination. Shi et al. [15] and Zhu et al. [16] have proposed novel methods for light field compression and representation. Aslan et al. [17] introduced a new technique for high-quality and high-fidelity implicit representation of light fields on continuously defined viewpoints. Environmental mapping [7,14,18–20] offers a comprehensive lighting representation, yet it poses challenges in editing.

2.2 Diffusion model

Diffusion models, exemplified by stable diffusion [21], represent the new state-of-the-art in deep generative models, surpassing previous leaders such as GANs in image generation tasks [8,17]. They have demonstrated remarkable performance in various fields, including computer vision, NLP, and more. Despite their advancements, these models face challenges like slow sampling speed and limited generalization to different data types. Recent efforts, including Zhang et al.’s ControlNet [9], have focused on enhancing large-scale diffusion models to support conditional inputs like edge and segmentation maps for finer image generation control. Cai et al. [22] explored self-supervised learning for universal facial representations, while Li et al. [23] introduced BLIP-Diffusion for theme-driven image generation with multi-modal control. Zhao et al. [24] presented Uni-ControlNet for more flexible local and global control in image generation. In this study, we leverage stable diffusion and ControlNet to create our synthetic environment, lighting, and facial dataset, showcasing the practical applications of these advancements in deep generative modeling.

2.3 Face relighting

Face relighting starts by separating the face from the background [25–27]. After separating the face, it is necessary to re-illuminate the face based on the given lighting conditions. A deep learning framework is proposed [28]to normalize unconstrained facial images, removing the viewpoint distortion and re-illuminating them to an even lighting environment, while predicting frontal and neutral faces. A physics-based portrait relighting method is applied to generate a large-scale and high-quality "in-the-wild" face relighting dataset (DPR) [14]. LeGendre et al. [23] introduce a learning-based technique to estimate high dynamic range (HDR) omni-directional lighting from a single low dynamic range (LDR) portrait image captured under any indoor or outdoor lighting conditions, and they also generate a rich photo collection by recording the reflection field and alpha matting of 70 different subjects under various expressions using an optical stage. However, the aforementioned methods require a large amount of dataset and the construction of a physical platform, resulting in high acquisition costs and resource requirements. To overcome this limitation, a new method is proposed by Yeh et al. [7] that enables the acquisition of face relighting datasets on a synthetic optical stage, achieving a performance comparable to state-of-the-art relighting methods. Inspired by this work, we perform lighting estimation and face relighting using a synthetic environment dataset generated by a diffusion model.

2.4 Pixel-aligned implicit function

PIFu (pixel-aligned implicit function) [29] is a pixel-aligned implicit function approach. The paper proposes an implicit representation that associates the pixels in a 2D image with the 3D information of the corresponding human body. He et al. [30] proposed Geo-PIFu, a method to reconstruct 3D meshes from monocular color images of clothed people. Li et al. [31] introduced the first method for real-time capture of volumetric performance and novel view rendering from monocular videos, without the need for expensive multi-view systems or cumbersome personalized template models. Chan et al. [32] proposed three new strategies to incorporate parameterized body models into pixel-aligned implicit models for single-view clothed human reconstruction. Motivated by this, we also adopt the implicit function representation for pixel-aligned relighting of facial images by estimating global illumination.

3. Method

Our method consists of two stages: In the first stage, we extract light embedding from single scene image. We represent the information of light by the position, intensity, color and the delay factor for light attenuation of the point light sources. Additionally, we learn an encoder to embed the light source information in the image. In the second stage, we input the light embedding and facial features, and aggregate global and local features using pixel-aligned functions to generate relighted facial images. These relighted facial images are supervised using the corresponding synthesized faces.

3.1 Light modeling

In previous research, light source information is typically indicated through input images. However, images exhibit significant variations and may contain ambiguous lighting information, leading to unclear illumination cues. To address this issue, we have modeled point light sources, offering a simpler and more compact representation compared to light source images. We will elaborate on this in the following sections.

For point light sources, the intensity of emitted light typically decreases with increasing distance. This attenuation often follows the inverse square law, where the intensity of light is inversely proportional to the square of the distance. This is based on the principle of energy conservation: when light radiates from a point light source, it distributes equal energy across any sphere of arbitrary size passed through. Therefore, the intensity of light on a larger spherical surface must be lower, as the energy is distributed over a larger area.

Mathematically, the intensity $I$ of a point light source at distance $r$ can be modeled as:

(1)$$I(r) = \frac{I_0}{r^2}$$

$I_0$ is the initial intensity or brightness of a light source. $r$ is the distance from the light source to the observation point. This assumption applies in 3D space. In 2D image space, due to the projection relationship, the intensity of light will be superimposed along the depth axis, causing the inverse square distance relationship to become a linear inverse distance relationship. Additionally, since the depth of a point light source is unknown in the image, but the intensity of light is superimposed due to the projection relationship, a light source with a greater depth will experience greater intensity attenuation on the pixels. This is because at the same pixel distance, a greater depth corresponds to a larger actual spatial distance. We use the delay factor $\delta$ to characterize this property, which provides some information about the depth of the light source.

The Euclidean distance $d$ between a light source position $\mathbf {p_s}$ and the pixels $\mathbf {p_x}$ through which the light passes can be represented as:

(2)$$d = \sqrt{(\mathbf{p_s} - \mathbf{p_x})^2}$$

Based on this distance, we can define the increase in intensity $\Delta I_p$ of pixel $p$ due to this ray of light:

(3)$$\Delta I_p = \begin{cases} I_s \times (1 - \delta)/d & \text{if } d > d_r \\ kI_s \times (1 - \delta)/d & \text{if } d \leq d_r \end{cases}$$

where $I_s$ represents the intensity of the light source.$k$ and $d_r$ signify the rapid attenuation of light intensity at a certain distance from the light source, taking into account the refraction of light in reality.

3.2 Scene2Light

In the first stage, we use randomly generated light source scattering images as cues and generate corresponding outdoor scene images with rich objects that satisfy the light source images. We map the position, intensity, color and delay factor of generated light source scattering images to an embedding. We apply downsampling and a final fully connected layer to the scene images to generate an embedding that represents the light sources. This embedding is supervised with the ground truth to learn a compact high-dimensional representation of the light sources. Considering the sparse dependency of scene light sources on image patches, we applied a 50% random mask during the experiment, hoping that the network would learn a representation of global information. We did not directly use the transformer block because we assume that the variation of light source information is not complex, and transformers provide more flexible inductive bias which could make the network overfit. By using a simple convolutional neural network, we achieved accurate light source estimation on our synthetic dataset.

(4)$$\mathbf{LE} = f_\phi(\mathbf{p},I_s,\delta,\mathbf{c})$$

where $\mathbf {LE}$ represents the embeding of the light source.

3.3 Face relighting

In the second stage, our aim is to extract feature maps from facial images that are independent of lighting and integrate them with global lighting information. We model the task of relighting by aggregating facial texture and geometry and global lighting information through implicit functions. Specifically, we use the well-established diffuse facial map, albedo, provided by SFSNet [1] and process it through a U-Net to obtain local structural features. These features are then combined with the globally learned representation of scene lighting obtained in the previous stage. The pixel-level feature vectors are passed through a multi-layer perception to predict the final RGB values for each pixel position. Additionally, to provide semantic information about the pixel’s relationship to the entire face and enable the network to extract occlusion information directly from the face, we also input the pixel’s distance to the 68 facial landmarks. In the end, our network produces reasonable relighted results.

3.4 Network architecture

In our network architecture, the first stage involves utilizing lighting attributes generated to produce an illumination map using the Stable Diffusion method. As shows in Fig. 2, With this generated scene illustration as input, the image is partitioned into multiple patches, and 50% of the patches are randomly discarded. The remaining patches are individually processed through an encoder to obtain feature vectors, which are subsequently fused. Our encoder encompasses several residual down-sampling layers and a final fully connected layer. The fused features are used to predict lighting attributes through an MLP. Figure 3 illustrates the fundamental process of our implicit face relighting. The facial albedo and the scene information are combined and input into a U-Net network, yielding structural and textural details of the face, along with scene-specific color temperature information. The feature vector corresponding to each pixel in the resultant feature map is combined with global illumination and semantic facial position information, and is then passed through an MLP to obtain the RGB feature of the pixel’s output image. This stage is referred to as the LEIFR Net. In introducing semantic position information, we opted not to employ the conventional absolute positional encoding used by generic implicit functions. Instead, we leveraged the distances of each pixel from the sixty-eight facial keypoints [33], thereby incorporating prior knowledge regarding facial occlusions.

Fig. 2. Pipenline of Generating Synthetic Dataset and Estimating Lighting. The first stage involves utilizing lighting attributes generated to produce an illumination map using the Stable Diffusion method [11].

Download Full Size | PDF

Fig. 3. The pipeline of our implicit face relighting. The facial albedo and the scene information are combined and input into a U-Net network, yielding structural and textural details of the face, along with scene-specific color temperature information. The feature vector corresponding to each pixel in the resultant feature map is combined with global illumination and semantic facial position information, and is then passed through an MLP to obtain the RGB feature of the pixel’s output image. We leveraged the distances of each pixel from the sixty-eight facial keypoints [33], thereby incorporating prior knowledge regarding facial occlusions.

Download Full Size | PDF

Ultimately, each pixel is processed in parallel through the LeiFR Net, and the final output is supervised against the relit face produced by our Stable Diffusion method.

3.5 Loss function and training

In the first stage, we use M to represent the indices of n patches that have not been randomly masked, and $f_l$ represents the network that estimates light source information from scene images. Therefore, the loss function for the scene2light stage can be represented as:

(5)$$\mathcal{L}_{recon} = |f_l(X_{scene} * M) -f_\phi(\mathbf{p},I_s,\delta,\mathbf{c}).detach()|$$

In the second stage, we first obtain feature maps $X_F$ from the input albedo image and the scene.

(6)$$X_F = f_\theta(X_{albedo},X_{scene})$$

then extract pixel-aligned local features from the features, merge them with the global lighting feature, use the distance map to provide pixel position information, and obtain the following relighting expression:

(7)$$X_R(\mathbf{p}) = f_{r}(X_F(\mathbf{p}),LE,d(\mathbf{p}))$$

(8)$$d_{k}(i,j) = \sqrt{(i - i_k)^2 + (j - j_k)^2}$$

where $i_k, j_k$ are the coordinates of the k-th keypoint, $d$ is a k-dimensional map that each pixel represents the distance of the pixel at that location from the facial keypoint. Our loss function is written as:

(9)$$\mathcal{L}_{relight} = \|{X_R - \hat{X_R}}\|_2$$

4. Experiment

We demonstrate the high-quality facial relighting capabilities of LEIFR-Net through extensive comparative experiments on our synthesized dataset. Additionally, we explore the model’s generalization ability under various light sources beyond point sources, such as spotlights and disco lights, and conduct qualitative experiments on real-world outdoor data. Finally, we present an ablation study to highlight the advantages of our network design.

4.1 Experiment environment

In the first stage of our light estimation, a dataset of 680 images was utilized for training. Training was conducted on an NVIDIA RTX 4090 graphics card over 2000 epochs, with a batch size set to 16. Our LEIFR network, comprising 2091 images, underwent training on a single NVIDIA RTX 4090 graphics card for 2000 epochs over the course of two days, maintaining a batch size of 16.

4.2 Dataset and implementation details

Our dataset originates from distilled information of the open-source large model stable diffusion, from which a synthetic illuminated dataset with reasonable lighting was extracted using the ControlNet framework. All the albedos we use are from Sfsnet [1]. Due to the natural occurrence of the Diderot effect in real-world scenes, linear light provides cues for scenes that conform to the lighting patterns depicted in light source images. These patterns are further combined with randomly generated scene prompts and fixed-quality image description prompts. Table 1 shows the comparisons between our datasets and others.

Table 1. Comparison of our dataset between others

View Table | View all tables in this article

Specifically, we employed light source images to generate scene illustrations. Subsequently, these scene illustrations were used to extract prompts which, when combined with the facial Canny map, ensured data integrity without any information leakage. As a result, we obtained a paired synthetic dataset that facilitated our supervised training process. Figure 4 shows the parts of out dataset and relighted results. The results demonstrate that the details of highlights and shadows in our output are pronounced and in accordance with the scene’s lighting conditions.

Fig. 4. Evaluations of our synthetic dataset on three different methods. All the albedos we use are from Sfsnet [1]. Compared with SIPR-S [5] and TR [18], our method achieves the most environment-aligned facial relighting effects, closely resembling the ground truth.

Download Full Size | PDF

4.3 Comparisons with state-of-the-art methods

To further demonstrate the relighting effects of LEIFR-Net, we compared it with state-of-the-art methods SIPR-S [5] and total relighting (TR) [18] in environment map-based relighting. Since these methods have not released their official code or models, we re-implemented them based on their principles. We used an existing image matting method [35] to extract portrait foregrounds and blend them with new backgrounds. Fig. 4 shows the evaluations of our synthetic dataset on three different methods. The results demonstrate that the details of highlights and shadows in our output are pronounced and in accordance with the scene’s lighting conditions.Our method generates more realistic results and more faithfully preserves the target lighting.

We also conducted a quantitative assessment for each method. We compared the mean squared error (MSE), learned perceptual image patch similarity (LPIPS) [36], and the pixel similarity metrics SSIM ] [37] and PSNR between the relit images generated by various methods and the real images. Additionally, we used Deg [38] to evaluate the identity preservation capability. We used a virtual paired test dataset of 561 pairs synthesized from stable diffusion, each consisting of an RGB face, albedo, scene, and rendered ground truth (GT), none of which were used during training. We report the quantitative results in Table 2 and provide comparisons. Our method demonstrates superior performance in terms of perceptual quality, identity preservation, and image similarity.

Table 2. Quantitative comparsion with state-of-the-art methods

View Table | View all tables in this article

4.4 Evaluations with real scenarios and other light sources

We demonstrated the capability of our method for outdoor portrait relighting using images from the FFHQ dataset [2]. The qualitative results are shown in Fig. 5. We selected a subset of images from the dataset as real standards, extracted lighting maps from them, and then used other dataset faces to perform relighting based on these environment maps. LEIFR-Net generated high-quality relit results with convincing lighting effects. Compared to TR [18], our results more closely match the lighting effects seen in the reference images. Against SIPR-S [5], our results were more robust and free of noticeable artifacts.

Fig. 5. Evaluation with real images from FFHQ [2]. LEIFR-Net generated high-quality relit results with convincing lighting effects.

Download Full Size | PDF

Additionally, as Fig. 6 shows, we tested our method in various types of lighting environments, such as spotlights and disco lights. Our method demonstrated strong robustness and adaptability to these diverse lighting sources.

Fig. 6. Evaluation with different light sources.

Download Full Size | PDF

4.5 Elimination of environmental pseudo-artifacts

Figure 7 illustrates that due to the inherent instability of generative models, pseudo-artifacts arising from the environment are present within the dataset. However, our model’s output exhibits an absence of these pseudo-artifacts. This phenomenon is attributed to the noise2noise principle [39], wherein network predictions tend to gravitate towards an average. The pseudo-artifacts present in the ground truth data, originating from a stable diffusion data distribution, essentially constitute normally distributed noise. Consequently, during the training process, the tendency for outputs to average out serves to mitigate these pseudo-artifacts. Furthermore, this behavior is intrinsically tied to the architecture of our network, which employs pixel-aligned strategies, signifying pixel-level semantics. Information is thus sourced from localized pixel features and global illumination characteristics.

Fig. 7. Comparison of enlarged facial texture details between ground truth and our network output. It is evident that ground truth images generated using stable diffusion tend to exhibit pseudo-artifacts stemming from the environment, while our network, through the integration of global and local features, effectively eliminates such pseudo-artifacts.

Download Full Size | PDF

4.6 Ablation study

We now provide ablation studies to demonstrate the benefit of our three designs in LEIFR-Net. All results are evaluated on our test set using synthetic dataset. As Table 3 shows, each of the design significantly improves performance.

Table 3. Results of ablation study

View Table | View all tables in this article

5. Conclusion

In this study, we introduce a light estimation for implicit face relight network, named LEIFR-Net. Firstly, we propose a method to estimate global illumination from a single image. Subsequently, we present an approach to model structurally disentangled relighting of faces using pixel-aligned implicit functions. Moreover, we construct a paired synthetic dataset encompassing synthetic faces, outdoor environments, and lighting structures through the use of stable diffusion. By comparing our approach with alternative methods and testing our model on real datasets, our results demonstrate superior facial relighting effects over existing methods known to us, characterized by highlights and shadows that more seamlessly integrate with the environment.

Funding

National Key Research and Development Program of China (2022YFB3606600).

Acknowledgment

The authors would like to acknowledge the contributions of Nanjing University stereo imaging technology (SIT) laboratory 3D imaging team, Nanjing University, and the authors also would like to thank the reviewers of this paper.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are available in Ref. [1] and Ref. [2]. The code is available at [40].

References

1. S. Sengupta, A. Kanazawa, C. D. Castillo, et al., “Sfsnet: Learning shape, reflectance and illuminance of facesin the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 6296–6305.

2. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks’,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2019), pp. 4401–4410.

3. K. Karsch, V. Hedau, D. Forsyth, et al., “Rendering synthetic objects into legacy photographs,” ACM Trans. Graph. 30(6), 1–12 (2011). [CrossRef]

4. F. Zhan, C. Zhang, Y. Yu, et al., “Emlight: Lighting estimation via spherical distribution approximation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35 (2021), pp. 3287–3295.

5. T. Sun, J. T. Barron, Y.-T. Tsai, et al., “Single image portrait relighting,” ACM Trans. Graph. 38(4), 1–12 (2019). [CrossRef]

6. Z. Wang, X. Yu, M. Lu, et al., “Single image portrait relighting via explicit multiple reflectance channel modeling,” ACM Trans. Graph. 39(6), 1–13 (2020). [CrossRef]

7. Y.-Y. Yeh, K. Nagano, S. Khamis, et al., “Learning to relight portrait images via a virtual light stage and synthetic-to-real adaptation,” ACM Trans. Graph. 41(6), 1–21 (2022). [CrossRef]

8. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems33, 6840–6851 (2020).

9. L. Zhang and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” arXiv, arXiv:2302.05543 (2023). [CrossRef]

10. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, (Springer, 2015), pp. 234–241.

11. P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in Neural Information Processing Systems (NeurIPS 2021)34, 8780–8794 (2021).

12. J. Eckhard, T. Eckhard, E. M. Valero, et al., “Outdoor scene reflectance measurements using a bragg-grating-based hyperspectral imager,” Appl. Opt. 54(13), D15–D24 (2015). [CrossRef]

13. Y. Wei, P. Han, F. Liu, et al., “Estimation and removal of backscattered light with nonuniform polarization information in underwater environments,” Opt. Express 30(22), 40208–40220 (2022). [CrossRef]

14. H. Zhou, S. Hadap, K. Sunkavalli, et al., “Deep single-image portrait relighting,” in Proceedings of the IEEE/CVF international conference on computer vision, (2019), pp. 7194–7202.

15. J. Shi and C. Guillemot, “Distilled low rank neural radiance field with quantization for light field compression,” arXiv, arXiv:2208.00164 (2022). [CrossRef]

16. H. Zhu, H. Wang, and Z. Chen, “Minl: Micro-images based neural representation for light fields,” arXiv, arXiv:2209.08277 (2022). [CrossRef]

17. S. Aslan, B. Y. Feng, and A. Varshney, “View correspondence network for implicit light field representation,” arXiv, arXiv:2305.06233 (2023). [CrossRef]

18. R. Pandey, S. O. Escolano, C. Legendre, et al., “Total relighting: learning to relight portraits for background replacement,” ACM Trans. Graph. 40(4), 1–21 (2021). [CrossRef]

19. R. Wu, Z. Feng, Z. Zheng, et al., “Design of freeform illumination optics,” Laser Photonics Rev. 12(7), 1700310 (2018). [CrossRef]

20. R. Wu, L. Yang, Z. Ding, et al., “Precise light control in highly tilted geometry by freeform illumination optics,” Opt. Lett. 44(11), 2887–2890 (2019). [CrossRef]

21. R. Rombach, A. Blattmann, D. Lorenz, et al., “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, (2022), pp. 10684–10695.

22. Z. Cai, S. Ghosh, K. Stefanov, et al., “Marlin: Masked autoencoder for facial video representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2023), pp. 1493–1504.

23. D. Li, J. Li, and S. C. Hoi, “Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing,” arXiv, arXiv:2305.14720 (2023). [CrossRef]

24. S. Zhao, D. Chen, Y.-C. Chen, et al., “Uni-controlnet: All-in-one control to text-to-image diffusion models,” arXiv, arXiv:2305.16322 (2023). [CrossRef]

25. X. Shen, A. Hertzmann, J. Jia, et al., “Automatic portrait segmentation for image stylization,” in Computer Graphics Forum, vol. 35 (Wiley Online Library, 2016), pp. 93–102.

26. H.-Y. Chang and T.-H. Lin, “Portrait imaging relighting system based on a simplified photometric stereo method,” Appl. Opt. 61(15), 4379–4386 (2022). [CrossRef]

27. J. Zhang, X. Chen, W. Tang, et al., “Single image relighting based on illumination field reconstruction,” Opt. Express 31(18), 29676–29694 (2023). [CrossRef]

28. K. Nagano, H. Luo, Z. Wang, et al., “Deep face normalization,” ACM Trans. Graph. 38(6), 1–16 (2019). [CrossRef]

29. S. Saito, Z. Huang, R. Natsume, et al., “Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization,” in Proceedings of the IEEE/CVF international conference on computer vision, (2019), pp. 2304–2314.

30. T. He, J. Collomosse, H. Jin, et al., “Geo-pifu: Geometry and pixel aligned implicit functions for single-view human reconstruction,” Adv. Neural Inf. Process. Syst. 33, 9276–9287 (2020). [CrossRef]

31. R. Li, Y. Xiu, S. Saito, et al., “Monocular real-time volumetric performance capture,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, (Springer, 2020), pp. 49–67.

32. K. Chan, G. Lin, H. Zhao, et al., “S-PIFu: Integrating parametric human models with PIFu for single-view clothed human reconstruction,” Advances in Neural Information Processing Systems35, 17373–17385 (2022).

33. S. Colaco and D. S. Han, “Facial keypoint detection with convolutional neural networks,” in 2020 International Conference on Artificial Intelligence in Information and Communication (ICAIIC), (IEEE, 2020), pp. 671–674.

34. L. Zhang, Q. Zhang, M. Wu, et al., “Neural video portrait relighting in real-time via consistency modeling,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), pp. 802–812.

35. M. P. P. Segundo, L. Silva, O. R. P. Bellon, et al., “Automatic face segmentation and facial landmark detection in range images,” IEEE Trans. Syst., Man, Cybern. B 40(5), 1319–1330 (2010). [CrossRef]

36. R. Zhang, P. Isola, A. A. Efros, et al., “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), pp. 586–595.

37. Z. Wang, A. C. Bovik, H. R. Sheikh, et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

38. X. Wu, R. He, Z. Sun, et al., “A light cnn for deep face representation with noisy labels,” IEEE Trans.Inform.Forensic Secur. 13(11), 2884–2896 (2018). [CrossRef]

39. J. Lehtinen, J. Munkberg, J. Hasselgren, et al., “Noise2noise: Learning image restoration without clean data,” arXiv, arXiv:1803.04189 (2018). [CrossRef]

40. rcc-cubAC, “LEIFR-Net,” GitHub (2023), https://github.com/rcc-cubAC/LEIFR-Net.

Dataset	Data acquisition	Quantity	Genders	Physically Based Rendering	Implicit Relight	Paired Scene
SIPR-S [5]	light stage	22	17M,5F	✓	$\times$	$\times$
TR [18]	light stage	70	-	✓	$\times$	$\times$
NVPR [34]	light stage	36	18M,18F	✓	$\times$	$\times$
NVIDIA [7]	simulation	515	247M,265F	✓	$\times$	✓
ours	synthetic generated	2091	1083M,1008F	$\times$	✓	✓

Methods	MSE ↓	LPIPS ↓	SSIM ↑	PSNR↑	Deg.↑
SIPR-S [5]	0.246	0.325	0.394	27.45	0.795
TR [18]	0.227	0.291	0.376	28.69	0.882
LEIFR-Net	0.183	0.264	0.498	30.26	0.958

Methods	MSE ↓	LPIPS ↓	SSIM ↑	PSNR↑	Deg.↑
w/o pixel-aligned	0.246	0.325	0.394	27.45	0.795
w/o keypoint	0.227	0.291	0.376	28.69	0.882
w/o position encoding	0.212	0.289	0.328	29.23	0.853
LEIFR-Net	0.183	0.264	0.498	30.26	0.958

Dataset	Data acquisition	Quantity	Genders	Physically Based Rendering	Implicit Relight	Paired Scene
SIPR-S [5]	light stage	22	17M,5F	✓	$\times$	$\times$
TR [18]	light stage	70	-	✓	$\times$	$\times$
NVPR [34]	light stage	36	18M,18F	✓	$\times$	$\times$
NVIDIA [7]	simulation	515	247M,265F	✓	$\times$	✓
ours	synthetic generated	2091	1083M,1008F	$\times$	✓	✓

Methods	MSE ↓	LPIPS ↓	SSIM ↑	PSNR↑	Deg.↑
SIPR-S [5]	0.246	0.325	0.394	27.45	0.795
TR [18]	0.227	0.291	0.376	28.69	0.882
LEIFR-Net	0.183	0.264	0.498	30.26	0.958

LEIFR-Net: light estimation for implicit face relight network

Abstract

1. Introduction

2. Related work

2.1 Light estimation and representation

2.2 Diffusion model

2.3 Face relighting

2.4 Pixel-aligned implicit function

3. Method

3.1 Light modeling

3.2 Scene2Light

3.3 Face relighting

3.4 Network architecture

3.5 Loss function and training

4. Experiment

4.1 Experiment environment

4.2 Dataset and implementation details

4.3 Comparisons with state-of-the-art methods

4.4 Evaluations with real scenarios and other light sources

4.5 Elimination of environmental pseudo-artifacts

4.6 Ablation study

5. Conclusion

Funding

Acknowledgment

Disclosures

Data availability

References

Data availability

Cited By

Figures (7)

Tables (3)

Equations (9)

Optics Express