## Abstract

Random Phase Encoding (RPE) techniques for image encryption have drawn increasing attention during the past decades. We demonstrate in this contribution that the RPE-based optical cryptosystems are vulnerable to the chosen-plaintext attack (CPA) with deep learning strategy. A deep neural network (DNN) model is employed and trained to learn the working mechanism of optical cryptosystems, and finally obtaining a certain optimized DNN that acts as a decryption system. Numerical simulations were carried out to verify its feasibility and reliability of not only the classical Double RPE (DRPE) scheme but also the security-enhanced Tripe RPE (TRPE) scheme. The results further indicate the possibility of reconstructing images (plaintexts) outside the original data set.

© 2019 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

Since a seminal work, Double random phase encoding (DRPE), was proposed by Refregier and Javidi in 1995 [1], optical cryptography has been studied widely in the past two decades. Meanwhile, DRPE has experienced great developments and enhancements by introducing extra freedoms, e.g. wavelength, propagation distance [2], polarization [3]. However, its security issue arouses great concerns and disputes from scholars. Some attack methods such as chosen-plaintext attack (CPA) [4], known-plaintext attack (KPA) [5] and ciphertext-only attack (COA) were put forward [6]. Undoubtedly, the further development of optical cryptosystems will be promoted by these optical cryptanalysis schemes. Some optical encryption methos with higher security level even optical asymmetric cryptography have been investigated, such as double random-phase-amplitude encryption [7] and asymmetric cryptosystem based on phase-truncated Fourier transforms [8]. Recently, J. Chen *et al*. and E. Ahouzi *et al*. almost simultaneously and respectively proposed a kind of security-enhanced encryption method, named triple random phase encoding (TRPE) [9,10] which is introducing an extra random phase mask (RPM) at the output plane, showing much higher security strength.

With the development of computing technology, deep learning (DL), a powerful tool, which allows the multi-layer network to learn characteristics of data, shows its outstanding ability in the field of digital image processing including the problems of classification [11] and pattern recognition in multi-set images [12]. Recently, DL has just shown its great potential in optical computational imaging field; especially, it provides a solution to imaging inverse problems, such as phase recovery and holographic image reconstruction [13], lensless phase imaging given the propagated intensity diffraction patterns [14], computational ghost imaging in extremely low sampling rate [15] and imaging through scattering media [16]. Furthermore, DL has also been successfully applied to improve the imaging performance such as enhancing the spatial resolution in optical microscopy [17], extending the depth-of-field [18] and field of view [19] in holographic imaging, removing the zero-order and twin-image terms in in-line digital holography [20]. Here, we propose a new optical cryptanalysis method to crack the aforementioned DRPE and even the security-enhanced TRPE optical cryptosystem based on deep learning (DL) strategy, where the deep neural network (DNN) model is trained using a series of chosen plaintexts and the corresponding ciphertexts. The trained DNN model can be regarded as an “equivalent key” of the optical cryptosystem. Then one can easily predict the corresponding plaintexts from any follow-up ciphertexts without the need to deduce its original keys. The rest of the paper is organized as follows. Section 2 gives a review of the TRPE optical encryption system and a preliminary analysis of its security strength. Section 3 provides a detailed description of the proposed DNN-based cryptanalysis method and section 4 presents the simulation results. The conclusion and discussion are summarized in the last section.

## 2. Review of triple random phase encoding and security analysis

In this section, a brief introduction to the basic principle of the classical DRPE and the security-enhanced TRPE scheme will be given. In DRPE encryption [1], the plaintext image is converted into white noise through a 4-*f* optical system, in which two statistically independent random phase masks (M_{1} and M_{2}) are placed in the input plane and Fourier plane, respectively. DRPE is demonstrated to be vulnerable to many attacks, as the Fourier amplitude can be easily obtained by performing a Fourier transform on the ciphertext. To overcome this security weakness, TRPE [9,10] was presented by employing the third random phase mask (M_{3}) at the output plane.

Suppose $(x,y)$ and $(u,v)$denote the coordinates in the input and the Fourier planes, respectively. $P(x,y)$ is the plaintext placed at the input plane of 4*f* configuration; As shown in Fig. 1,${M}_{1}(x,y)$,${M}_{2}(u,v)$and${M}_{3}(x,y)$represent three statistically independent random phases placed in the input plane, Fourier plane and output plane, respectively. When the whole system is illustrated by a collimated plane wave, the plaintext image is converted into the noise-like ciphertext. The encryption process for DRPE and TRPE can be mathematically expressed by:

_{1}- M

_{3}), the plaintexts can be decrypted by inversely performing the encryption process in a direct manner.

The security of the TRPE scheme is improved since the third RPM at the output plane protects the complex-valued ciphertext from being exposed directly in the Fourier plane. Take CPA as an example, and assume that an attacker selects an impulse function $\delta (x,y)$ as a special plaintext to be encrypted. In the DRPE scheme, the point spread function ${h}_{DRPE}(x,y)$ of the system can be expressed as:

_{2}), which is equivalent to the Fourier transform of the point spread function. However, in the TRPE scheme, the point spread function ${h}_{TRPE}(x,y)$ can be written as:

According to Eq. (4), attackers cannot extract any one of the phase keys (M_{2} and M_{3}) from the PSF${h}_{TRPE}(x,y)$, indicating that the impulse-based CPA method is invalid for the TRPE scheme. Moreover, compared with the classic DRPE scheme, the main feature of the TRPE one is that it’s nearly impossible to get the amplitude in the Fourier domain from the ciphertext because of the existence of M_{3}. Thus, without the constraint (amplitude) in the Fourier plane, it is difficult for attackers to retrieve the phase keys by using a phase retrieval algorithm (PRA), but this could be done in DRPE scheme [5]. The aforementioned theoretical analysis shows that the TRPE scheme outperforms the traditional DRPE one in terms of security. Nevertheless, the TRPE-based cryptosystem is still vulnerable to our proposed DNN-based attack strategy, which will be demonstrated in the next section.

## 3. DNN-based cryptanalysis

Here, we propose an end-to-end deep-learning-based method for optical cryptanalysis. The proposed cryptanalysis does not rely on the physical process of the encryption/decryption, but it needs a large set of plaintext-ciphertext pairs to learn the relationship between any ciphertext and its corresponding plaintext. In cryptanalysis, according to “Kerckhoffs’ principle”, it is always assumed that attackers have already known all details about a cryptosystem except for the encryption key [21]. Specifically, in the situation of CPA as will be adopted in our approach, attackers are assumed to have the ability to get the corresponding ciphertext of any chosen plaintext without knowing the keys, and try to directly predict the plaintexts from the intercepted subsequent ciphertexts without the keys. In our scheme, a modified deep neural network (DNN), named DecNet, is developed to train the ciphertext images (training data) and the corresponding plaintext images (training label). DecNet applies the deep residual convolutional network (ResNet) architecture [12], where each layer connects to every other within the block in a feed-forward fashion. Compared to conventional convolutional networks, ResNets have more direct connections between the layers, strengthening feature propagation, encouraging feature reuse and substantially reducing the number of parameters. Thus, ResNets are equipped with a better generalization capability.

A diagram of DecNet is shown in Fig. 2. To be strictly consistence with the traditional DRPE or TRPE scheme, the input and output of DecNet are set to be complex-valued ciphertext and the real-valued plaintext. The ciphertext from encryption firstly passes through the convolution layer with a 3 × 3 filter and the pooling layer with stride 2, and then is successively decimated by four downsampling residual blocks. After transmitting through another residual block, it passes through five upsampling residual blocks. Finally, the signals pass through a standard convolution layer with a 3 × 3 filter, and the estimate of the plaintext is produced. The above is the “encoder-decoder network” architecture [22,23], where the downsampling residual blocks serve as the encoder to extract the feature maps from the input patterns and the upsampling residual blocks serve as the decoder to perform pixel-wise regression. In addition, skip connections are employed to pass high-frequency information learned in the initial layers down the network towards the output reconstruction [24].

Each residual block consists of two composite convolution layers, which connect to each other within the same block in a feed-forward fashion. Each composite convolutional layer is comprised of three consecutive operations: batch normalization (BN), rectified linear unit (ReLU) and convolution (Conv) with filter size 3 × 3. After being activated by ReLU, a complex-valued signal will be turned into a real-valued signal. The downsampling residual block consists of an average pooling operation with stride 2. As a result, the dimension of the input to this block is reduced by a factor of 2 at the output. The upsampling residual block increases the dimension of the input by a factor of 2, which is achieved by the subpixel upscaling operation [25]. In addition, L2 regularization with weight decay of 1E-4 is employed in all convolutional filters initialized . The same regularization is used in batch normalization as well. A small dropout rate of 0.02 is set to prevent overfitting. Because of GPU memory constraints, our DecNet is trained with a mini-batch size 8 using ADAM optimizer in Keras. The training starts with a learning rate of 0.0001 and drops by a factor of 2 after every 5 epochs.

The hypothesis to test here is that DecNet can be trained to predict the plaintext from its corresponding ciphertext. Since all the training objects are included in each iteration of the DecNet training process, our learning approach is expected to be a much more robust scheme for optical cryptanalysis. Notably, in most cases, the cracked keys (the equivalent keys, not the real ones) are merely available for a specific set of plaintexts; while the proposed method broadens the application of the cracked keys. The premise is that it requires a sufficiently large database in advance, which cannot be ignored.

## 4. Numerical simulations and analysis

Numerical simulations were carried out to analyze the performance of the proposed attack method in TRPE system. The wavelength of illumination light was set *λ* = 532 nm, and the size of all the images were 32 × 32 pixels with 0.2 mm pixel size. The two RPMs were generated by computer and randomly distributed in [0, 2π] with 256 gray-levels. Theoretically, in the case of both DRPE and TRPE, the ciphertext is actually complex amplitude distribution which cannot detected by a normal CCD camera from the view of the real optical experiment. In these simulations, we assume attackers have already obtained this kind of complex-valued ciphertext from the corresponding hologram. The DecNet network was trained using 5000 grayscale fashion images from the fashion-MNIST database [25], which is available for anyone. The program was implemented on Python version 3.6 and the DNN was performed using Keras framework based on Tensorflow. The GPU-chip NVIDIA GTX1060 was used to accelerate the computation.

During the DecNet training, the loss function, mean absolute error (MAE), was employed to monitor the training process. MAE is defined as:

where*w*,

*h*are the width and height of the output, respectively;

*O*is the output of the last layer, and

*P*is ground truth (plaintext). To quantify the performance of the algorithms during the training process, we also introduce correlation coefficient (CC) values as other metrics for convergence. The definition of CC is given as:

*P*and the recovered plaintext

*O*, respectively. The MAE and CC values are plotted in Fig. 3(a) and 3(b), respectively.

In the first test, the set of plaintexts are the same type as the training fashion-MNIST data [25], but they have never been used in the training. The results using DecNet to attack the DRPE system are shown in Fig. 4(a), where the first row presents the subsequently obtained ciphertexts from the DRPE scheme; the second is the corresponding plaintext ground truth; the third gives the reconstructed results. Similarly, the results of attacking the TRPE system are shown in Fig. 4(b). High-quality reconstructions were achieved when DecNet was trained and tested by the same database. Obviously, the recovered images are close to the low-pass filtered version of the plaintext image, as shown in Fig. 4, where the retrieved plaintext information can be visualized but the high-frequency features are missing.

To further validate the proposed approach, we used DecNet to predict subsequent ciphertexts of new types, i.e., handwritten digits (MNIST database [26]) which are not included in training data. The ciphertexts, the ground truth plaintexts and the retrieved plaintext images are shown in the first, second and third row of Fig. 5(a) and Fig. 5(b), respectively. It demonstrates that DecNet can realize high-quality predictions of these new types of plaintexts from their corresponding ciphertexts.

In addition, the robustness of our method against shear and noise was also investigated since the optical information propagation may suffer from data loss and noise pollution. Taking TRPE as an example, we only added noise in the ciphertext during the testing stage. Figure 6(a) shows the ciphertexts with a shear ratio of 20% (that is 20% pixel value is set 0; others remain unchanged). The corresponding deciphered images are presented in Fig. 6(d), where the plaintext information can still be observed. Figures 6(b) and 6(c) are the ciphertexts with a shear ratio of 40% and 60%, respectively, with corresponding deciphered images in Figs. 6(e) and 6(f). When the shear ratio is larger than 60%, the reconstructed image become blurred and incomplete.

In terms of the anti-noise ability, the reconstructed results were analyzed from ciphertexts with additive noise and multiplicative noise, respectively. The noise function is expressed as:

where ${E}^{\prime}$ is the ciphertext with noise;*E*is the original ciphertext; $r$ is the weight of noise;

*p*is a matrix in which element values are randomly set between 0 and 1. Ciphertexts with additive and multiplicative noise weight of 0.25, 0.5 and 1 are shown in Figs. 7(a)-7(c) and Figs. 8(a)-8(c), respectively, with corresponding deciphered images in Figs. 7(d)-7(f) and Figs. 8(d)-8(f). Although there are some noise points on the reconstructed image, the plaintext information can still be recognized, indicating that the proposed optical cryptosystem analysis method based on deep learning has a strong anti-noise ability.

In practice, the detection and storage of a complex-valued ciphertext is not convenient since the involving of holography-based modulating and de-modulating strategy [3,27]. Furthermore, the complex-valued ciphertexts could leak more infos’ of the cryptosystem and thus may cause security troubles. In this regard, for a designer of a cryptosystem, it’s straightforward to think that it should be better if one just public the amplitude (or intensity) part of the complex-valued ciphertext, and separately preserve its phase part as additional infos. Meanwhile, it’s worthy to point out that the DRPE scheme and the TRPE one will hold the same nature if these cryptosystems just public its amplitude part. Anyway, to further validate our proposed Decnet model, we try to predict the subsequent plaintext when only the amplitude of ciphertext is known, which is impossible for the conventional PRA-based attack method [5]. During the training, we use only the amplitude of ciphertexts as the input of DecNet and the testing results are shown in Fig. 9. The amplitude of subsequent ciphertexts, the ground truth plaintexts and the retrieved plaintexts images are shown in the first, second and third row, respectively. Obviously, some important characters of plaintext still could be recognized from the retrieved images.

## 5. Optical experiments

Experiments have been carried out for investigating the effectiveness and practicability of the proposed method. The experimental setup is shown in Fig. 10. A continuous-wave laser (MW-SL-532/50mW) served as the illumination source. The laser beam pass through a spatial filter and is expanded and collimated and is directed onto the SLM. The plaintext image is loaded on a SLM (Holoeye, LC2002) which is placed at the input plane. Two imaging lenses (*f* = 150 mm) are used to construct an optical 4*f* system. Three different glass-diffusers are used as random phase masks. A high dynamic range CMOS camera (PCO edge 5.5, 2,160 × 2,160 pixels with a pixel size of 6.5 × 6.5 μm, dynamic range of 16 bits) is located at the output plane of 4*f* system. 5000 images (chosen plaintexts) from MNIST database are loaded on the SLM respectively and each of the corresponding speckle intensity (square of the amplitude of the corresponding ciphertext) is captured by the camera. We chosed only the central 400 × 400 pixels and resize them to 32 × 32 pixels in the experiments to perform the training. For these 5000 image pairs, we take 4500 of them as the training data and the rest 500 as the testing data. Testing results is presented in Fig. 11. The images in the first row are the amplitude of ciphertext, which are also the inputs of proposed Decnet model. The second and third row show the predicted plaintext images by Decnet and the ground truth images, respectively.

It should be noted that the output of DecNet is always real-valued signal no matter the input is real-valued or complex-valued so that the proposed attack method works only when the plaintext is real-valued. Besides, there are a large set of plaintext-ciphertext pair should be used as the prior information, which is not easy in practice. At present, the main problem and challenge in the application of the proposed DNN-based attack method is the acquisition of a large number of training samples and the time-consuming training process. Therefore, how to retrieve the complex-valued plaintext and how to reduce plaintext-ciphertext pairs will be considered in our future works.

## 6. Conclusion

In this contribution, a deep-learning-based method for optical cryptanalysis is proposed. The DNN model is introduced as a CPA method with a number of training pairs of input and output data, which are the ciphertext/plaintext pairs to be known. The trained DNN model is then regarded as the “equivalent key”, applicable to almost all the RPE-based cryptosystems (e.g., DRPE, TRPE) even if we only know part of the ciphertexts (e.g. amplitude parts of the ciphertexts). This is especially valuable when it comes to reality since it’s difficult to record and transmit a complex-valued ciphertext, and this is why our cryptanalysis approach still works for TRPE but the traditional cryptanalysis methods fail. We further demonstrate the robustness of the proposed method by introducing shearing and noise pollution. The potential demonstrated by this technique will guide a new cryptanalysis direction in this field.

## Funding

National Natural Science Foundation of China (NSFC) (61875129, 61705141, 61805152); Sino-German Center for Sino-German Cooperation Group (GZ 1391); China Postdoctoral Science Foundation (2017M610544, 2018M633114).

## References

**1. **P. Refregier and B. Javidi, “Optical image encryption based on input plane and Fourier plane random encoding,” Opt. Lett. **20**(7), 767–769 (1995). [CrossRef] [PubMed]

**2. **G. Situ and J. Zhang, “Double random-phase encoding in the Fresnel domain,” Opt. Lett. **29**(14), 1584–1586 (2004). [CrossRef] [PubMed]

**3. **O. Matoba and B. Javidi, “Secure holographic memory by double-random polarization encryption,” Appl. Opt. **43**(14), 2915–2919 (2004). [CrossRef] [PubMed]

**4. **X. Peng, H. Wei, and P. Zhang, “Chosen-plaintext attack on lensless double-random phase encoding in the Fresnel domain,” Opt. Lett. **31**(22), 3261–3263 (2006). [CrossRef] [PubMed]

**5. **X. Peng, P. Zhang, H. Wei, and B. Yu, “Known-plaintext attack on optical encryption based on double random phase keys,” Opt. Lett. **31**(8), 1044–1046 (2006). [CrossRef] [PubMed]

**6. **M. Liao, W. He, D. Lu, and X. Peng, “Ciphertext-only attack on optical cryptosystem with spatially incoherent illumination: from the view of imaging through scattering medium,” Sci. Rep. **7**(1), 41789 (2017). [CrossRef] [PubMed]

**7. **X. C. Cheng, L. Z. Cai, Y. R. Wang, X. F. Meng, H. Zhang, X. F. Xu, X. X. Shen, and G. Y. Dong, “Security enhancement of double-random phase encryption by amplitude modulation,” Opt. Lett. **33**(14), 1575–1577 (2008). [CrossRef] [PubMed]

**8. **W. Qin and X. Peng, “Asymmetric cryptosystem based on phase-truncated Fourier transforms,” Opt. Lett. **35**(2), 118–120 (2010). [CrossRef] [PubMed]

**9. **A. Esmail, W. Zamrani, N. Azami, A. Lizana, J. Campos, and J. Y. Maria, “Optical triple random-phase encryption,” Opt. Eng. **56**(11), 113–114 (2017).

**10. **J. Chen, Y. Zhang, J. Li, and L. B. Zhang, “Security enhancement of double random phase encoding using rear-mounted phase masking,” Opt. Eng. **101**(2), 51–59 (2018). [CrossRef]

**11. **T. H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, “PCANet: A simple deep learning baseline for image classification,” *in*Proceedings of IEEE transactions on image processing (IEEE, 2015), 5017–5032. [CrossRef]

**12. **K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” *in*Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (IEEE, 2016), 770–778.

**13. **Z. Ren, Z. Xu, and E. Y. Lam, “Learning-based nonparametric autofocusing for digital holography,” Optica **5**(4), 337–344 (2018). [CrossRef]

**14. **A. Sinha, J. Lee, S. Li, and G. Barbastathis, “Lensless computational imaging through deep learning,” Optica **4**(9), 1117–1125 (2017). [CrossRef]

**15. **M. Lyu, W. Wang, H. Wang, H. Wang, G. Li, N. Chen, and G. Situ, “Deep-learning-based ghost imaging,” Sci. Rep. **7**(1), 17865 (2017). [CrossRef] [PubMed]

**16. **M. Lyu, H. Wang, G. Li, and G. Situ, “Exploit imaging through opaque wall via deep learning,” arXiv, 1708, 07881 (2017).

**17. **Y. Rivenson, Z. Gorocs, H. Günaydın, Y. Zhang, H. Wang, and A. Ozcan, “Deep learning microscopy,” Optica **4**(11), 1437–1443 (2017). [CrossRef]

**18. **Y. Wu, Y. Rivenson, Y. Zhang, Z. Wei, H. Gunaydin, X. Lin, and A. Ozcan, “Extended depth-of-field in holographic imaging using deep-learning-based autofocusing and phase recovery,” Optica **5**(6), 704–710 (2018). [CrossRef]

**19. **Y. Rivenson, Y. Zhang, H. Günaydın, D. Teng, and A. Ozcan, “Phase recovery and holographic image reconstruction using deep learning in neural networks,” Light Sci. Appl. **7**(2), 17141 (2018). [CrossRef] [PubMed]

**20. **H. Wang, M. Lyu, and G. Situ, “eHoloNet: a learning-based end-to-end approach for in-line digital holographic reconstruction,” Opt. Express **26**(18), 22603–22614 (2018). [CrossRef] [PubMed]

**21. **B. Schneier, Applied cryptography: protocols, algorithms, and source code in C. 2nd ed. Hoboken: John Wiley and Sons (1996).

**22. **Y. LeCun, Y. Bengio, and G. Hinton, *Deep Learning* (MIT, 2015).

**23. **V. Badrinarayanan, A. Kendall, and R. Cipolla, “A deep convolutional encoder-decoder architecture for image segmentation,” *in*Proceedings of IEEE transactions on pattern analysis and machine intelligence (IEEE, 2017), 2481–2495. [CrossRef]

**24. **Z. Zhang, Q. Liu, and Y. Wang, “Road extraction by deep residual u-net,” *in*Proceedings of IEEE Geoscience and Remote Sensing Letters (IEEE, 2018), 749–753.

**25. **H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv, 1708, 07747 (2017).

**26. **L. Deng, “The MNIST database of handwritten digit images for machine learning research,” in Proceedings of IEEE Signal Processing Magazine (IEEE, 2012), 141–142. [CrossRef]

**27. **B. Javidi, G. Zhang, and J. Li, “Experimental demonstration of the random phase encoding technique for image encryption and security verification,” Opt. Eng. **35**(9), 2506–2512 (1996). [CrossRef]