## Abstract

The subpixel displacement estimation is an important step to calculation of the displacement between two digital images in optics and image processing. Digital image correlation (DIC) is an effective method for measuring displacement due to its high accuracy. Various DIC algorithms to compare images and to obtain displacement have been implemented. However, there are some drawbacks to DIC. It can be computationally expensive when processing a sequence of continuously deformed images. To simplify the subpixel displacement estimation and to explore a different measurement scheme, a convolutional neural network with a transfer learning based subpixel displacement measurement method (CNN-SDM) is proposed in this paper. The basic idea of the method is to compare images of an object decorated with speckle patterns before and after deformation by CNN, and thereby to achieve a coarse-to-fine subpixel displacement estimation. The proposed CNN is a classification model consisting of two convolutional neural networks in series. The results of simulated and real experiments are shown that the proposed CNN-SDM method is feasibly effective for subpixel displacement measurement due its high efficiency, robustness, simple structure and few parameters.

© 2021 Optical Society of America under the terms of the OSA Open Access Publishing Agreement

## 1. Introduction

A displacement tracking algorithm with subpixel accuracy is an important part to calculation of the displacement between two digital images in the field of optics and image processing [1,2]. The three key issues of displacement tracking/estimation algorithm are measurement accuracy, computation efficiency and robustness [3–7]. As an effective optical measurement technique for measuring displacement, Digital Image Correlation (DIC) uses image tracking algorithms to track the relative displacements of a sequence of digital images of a test specimen decorated with speckle patterns during deformation [8–10]. The displacement tracking algorithm mainly consists of initial value estimation and subpixel computation. Correlation-based [11–14], FFT-based [15] and feature-based [16] initial value estimation approaches have been proposed and implemented. However, initial value estimation is usually time-consuming due to the redundant search scheme and prevents it from attaining real-time, especially when a sequence of digital deformed images is used. In order to improve the accuracy of subpixel displacement measurement, great progress has been made in subpixel calculation methods, including genetic algorithm [17], gradient-based method [18], Newton-Raphson iterative method [19–21], artificial neural network method [22] and so on. With respect to computation efficiency, the efficiency of subpixel computation is limited by the iterative algorithms. It is worth mentioning that the inverse compositional Gauss-Newton (IC-GN) method reduces the computation of Hessian matrix, improves the efficiency of the iteration, and became one of the most widely used iteration methods [23]. However, IC-GN method still involves huge computation work, such as the computation of image gradient.

Compared with the existing DIC methods, learning a subpixel displacement estimating model from image data based on deep learning is a different way to approximate the subpixel displacement. Convolutional Neural Network (CNN) is known for its learning ability [24,25], and has great potential and research value in emerging fields such as Virtual Reality (VR), Augmented Reality (AR) and so on [26–28]. Many presented architectures can successfully perform the per-pixel prediction task [29], such as semantic segmentation [30,31], face detection [32], object recognition [33], object detection [34,35], depth estimation [36], and some medical applications [37,38], and so on. It is remarkable that the convolutional neural networks also have great potential in optics field, such as FlowNet for optical flow estimation [39], PIV-DCNN for particle image velocimetry [40] and some subsequent researches [41–44]. Therefore, it is feasible to train a convolutional neural network so that the network can compute the subpixel displacement without iterations. In 2001, Pitter et al. tried to use artificial neural network for subpixel deformation analysis, which improved the computing efficiency in some degree. However, the maximum error was more than 0.05 pixels. On the other hand, the method adopted was artificial neural network, which was different from convolutional neural network. Later in 2010, Liu et al. proposed an in-plane subpixel displacement detection method based on trained artificial neural network to calculate subpixel displacement [45]. The artificial neural network with only three hidden layers trained 3,000 samples requiring more than 16000 epochs. Although the authors claimed that subpixel displacement analysis with acceptable accuracy can be performed in a reasonable time, no comparison of efficiency and computational cost was given. In addition, there was no follow-up study as far as we know. Although the above studies have some problems in terms of accuracy and efficiency, their exploration of applying neural network to subpixel displacement estimation provide ideas for our study.

Our primary motivation here is to develop a simple convolutional neural network to realize accurate calculation of subpixel displacement, making the entire calculation process simpler and more efficient. This paper proposes a convolutional neural network based subpixel displacement measurement method (CNN-SDM) with simple structure and few coefficients to compare a sequence of images of an object decorated with speckle patterns before and after deformation by trained CNN, which can easily calculate the subpixel displacement and ensure a given level of accuracy. The proposed CNN-SDM with simple structure and fewer coefficients can calculate the subpixel displacement from coarse to fine, and only needs the reference image and the deformed images, without the iterative process. Generally, when the size of the input image changes, the convolutional neural network will fail to work because of the mismatching of the matrix dimensions of the fully connected layer. In order to solve this problem, the spatial pyramid pooling method is adopted to enable the CNN-SDM to calculate the subpixel displacement of subsets that have different sizes [46]. In addition, if the pattern of test images is quite different from that of training images, the calculation accuracy of the CNN-SDM may be affected. It is hard to include all kinds of speckle patterns when training a convolutional neural network [47]. In order to solve this problem, adopting the method of transfer learning is a good choice. The remainder of present paper is organized as follows. In Section 2, the details of the network structure will be provided. Then the algorithm principle be explained in Section 3. Simulation and real experiments are performed to verify the feasibility and effectiveness of the proposed method in Section 4. Finally, conclusions are summarized in Section 5.

## 2. Network architectures

In this section, network structure and specific training details will be discussed. The rationale for the choice of a neural network structure plays an important role in the training process [48–50]. Therefore, the neural network with the simplest structure will be preferred to achieve high-precision displacement measurement, which means that the neural network do not require too many layers and nodes.

#### 2.1 Design ideas of the convolutional neural networks

Compared with the integral pixel displacement estimation, the subpixel displacement estimation requires complicated work. The trained neural network in this paper is only used to calculate subpixel displacement. Integral displacement part can be assisted by other methods, because the calculation of the integral pixel displacement is not complicated.

In subpixel displacement estimation, the large image can be divided into some small ones according to a certain spacing. If the size of the cutting image is small enough, the deformation in such a small area can be approximately considered as rigid body translation. Based on this, the neural network only needs to calculate the displacement on each small picture one by one. After traversing the whole figure, the obtained displacement matrix can be interpolated to obtain the full-field displacement. In order to make the subpixel displacement measurement become a task of classification, the displacements of [0, 1] and [0, 0.1] can be divided into 11 categories in units of 0.1 and 0.01 pixels, respectively. In this way, two classification models can be used to calculate displacements with different pixel scales. One advantage of using the classification model is that the training sets and labels are easier to obtain, in the meantime, the network structure is simple and easy to train. If there is only one convolutional neural network used to perform the classification task, a large number of labels will be generated, which may cause a great burden of training a network and generating labels. So, we use two networks to calculate the displacement with different precision.

Based on the above ideas, the neural network proposed in this paper is composed of two convolutional neural networks. The displacement between [-1, 1] is calculated by the first-level CNN with 0.1 pixels scale, and the second-level CNN is used to calculate the displacement [-0.1, 0.1] with 0.01 pixels scale.

#### 2.2 Architectures and details

Convolutional neural networks are known for their ability to learn the relationship between inputs and outputs as long as given enough labeled data. Accordingly, an end-to-end learning method can be taken to train the convolutional neural network. The first-level CNN adopted in this paper is composed of input layer, convolutional layer, max pooling layer, spatial pyramid pooling layer, fully connected layer and output layer. For the efficiency purpose, the number of convolution layer is set to three, each convolution layer is followed by a max pooling layer. Next are two fully connected layers and the final output layer. The SPP layer is between the last convolutional layer and the first fully connected layer. The structure is shown in Fig. 1.

As shown in Fig. 1, the reference image and the deformed image are single-channel images, so the dimension of the pixel matrix of each image can be considered as width×height×1. The reference image and the deformed image are stacked on the third dimension to form a width×height×2 image pair, which can be regarded as training and test samples. The kernel size decreases from (5, 5) in the first two layers to (3, 3) in the third layer. Finally, there are two fully connected layers, through which the dimension of output results is guaranteed to be 11 and the activation function of the whole network is ReLU function.

In general, the number of fully connected layer parameters varies with the input data, which limits the application of neural networks rather than the number of convolution kernels or the operation of pooling. The SPP ensures that the network can still work when receiving input data of different sizes.

The structure of the second-level is exactly the same as the first-level CNN, the only difference between the two networks is that the first-level CNN is used to calculate the displacement in the range of [-1, 1] pixels, while the second-level CNN is responsible for the displacement within the range of [-0.1, 0.1] pixels.

## 3. Algorithm principle

#### 3.1 Method for calculating displacement

As mentioned in section 2.1, the displacements of [0, 1] and [0, 0.1] are divided into 11 categories respectively. Accordingly, the range of the label is 0 to 11. For the first-level CNN, label *i* means that the displacement is 0.1×*i* pixels. For instance, label 5 means the displacement is 0.5 pixels in the first-level CNN and 0.05 pixels in the second-level CNN. Because there are only 11 categories, the neural network can only calculate the displacement in one direction (the vertical downward displacement in this paper), such as the positive *y*-axis (vertical downward). Under general circumstances, the displacement may be either positive or negative. To obtain the final displacement, the deformed images and reference images should be processed before input the data to the neural network, the specific process is shown in Fig. 2.

In Fig. 2, the blue arrow is the true value of the displacement, and the horizontal and vertical components of the displacement are green arrows. The red arrow represents the component already calculated by the convolutional neural network. Figure 2(a) is an image with only rigid body translation. For the purpose of calculating displacements in all directions, rotate picture (a) clockwise 0°, 180°, 90° and 270° respectively to get the following four pictures, namely (b), (c), (d), (e). Then the displacement components of vertical down, vertical up, horizontal right, and horizontal left can be obtained respectively, as shown by the red arrow in (f), (g), (h), (i). Interestingly, the neural network trained in this paper has a special characteristic: for the first-level CNN, if the displacement is less than 0 pixels, the neural network will classify it into class 0. If the displacement is larger than 1 pixel, the neural network will classify it into class 10. The second-level neural network has the same properties like this. Therefore, in (f), the vector *v* is in the negative direction along the *y*-axis, so the calculation result is *v*(1) = 0. Similarly, *u*(1) = 0; In (g), the vector *v* is in the positive direction along the *y*-axis, so *v*(2) is the absolute value of vector *v*, and *u*(2) is the absolute value of vector *u*. Through this way, the displacement in all directions can be obtained.

In order to obtain the displacement of [-0.1, 0.1] with 0.01 pixels scale, the deformed image needs to be vertically or horizontally translated according to the results calculated by the first-level neural network, so that the accuracy of subpixel displacement measurement is increased from the initial [-1, 1] range to the finer [-0.1, 0.1] pixel scale. For example, the vertical displacement between two images is 0.57 pixels. The displacement results calculated by the first layer network is u=0.5 pixels. Then, the deformed image needs to move 0.5 pixels in the opposite direction using the method in [51] so that the sample displacement becomes 0.07 pixels. After that, the deformed samples are input into the second layer network, and the result of 0.07 pixels can be obtained. The calculation process is shown in Fig. 3.

In Fig. 3, the whole calculation process is divided into six steps in more detail: first, rotate the raw data; second, the first-level CNN is used to calculate the displacement with 0.1 pixels scale to obtain (*u*1, *v*1); third, translate the deformed image using displacement (-*u*1, -*v*1), then the new data is obtained; forth, rotate the new data; fifth, the second-level CNN is used to calculate the displacement with 0.01 pixels scale to obtain (*u*2, *v*2); finally, the data is processed to get the final displacement (U, V), where (U, V)=(*u*1+*u*2, *v*1+*v*2).

#### 3.2 Dataset

A deep learning algorithm normally needs a lot of training data to achieve excellent performance. In general, obtaining a large number of speckles experimentally is too time consuming. To solve this problem, the Matlab programs are used to generate speckle patterns, which can be used to simulate the real speckle patterns with any displacement satisfying the requirements. Just to reiterate, in this study, the vertical downward direction is seen as the positive direction of *y*-axis, and the horizontal-right direction is seen as the positive direction of *x*-axis.

It should be noted that although SPP was added to the network, we did not specifically design training sets of different sizes to train the models and compare the performance of these models. The purpose of using SPP layer is simply to enable CNN-SDM calculate input of different size, without considering using it to improve accuracy or efficiency.

#### 3.3 Transfer learning method

In transfer learning, model parameters are pre-trained on general domain tasks, and then the model can be fine-tuned for specific tasks, so as to ensure faster convergence speed of the model and lower requirements for training data. Using the existing training set and the new training set, the convolutional neural networks can continue to be trained to achieve the goal of practical application. The specific application will be shown in Section 4.4.

## 4. Experiments and discussions

In this section, the performance of the proposed CNN-SDM is verified, where both simulation and real experiments have been carried out. The simulation results verify the accuracy while the real experiment demonstrates its robustness. The CNN should be trained before the experiments. The speckle pattern size of the training set is 125×125 pixels, which are rendered with Boolean model [52]. Figure 4 shows the generating speckle patterns, and Table 1 shows more detail.

#### 4.1 Rigid body translation test

### 4.1.1 Test on synthetic sample 1

Sample 1 is a set of 21 images with size of 512×512 pixels, just like the training set, simulating uniform displacements simultaneously in the *x*- (horizontal) and *y*- (vertical) direction with an increment of 0.05 pixels, and the displacement to be measured ranges from 0 to 1 pixel. The subsets with size of 61×61 pixels are uniformly sampled by the grid spacing s=20 pixels, giving a mesh of 20×20 = 400 points of interest. Other subsets of experiments are sampled in the same way.

The mean bias errors, standard deviation errors from the true value are used to verify the accuracy of the proposed algorithm. The mean bias error ${u_e}$ and standard deviation (SD) error ${\sigma}$ are defined as:

*N*calculated displacements, and

*N*represents the number of all points of interest (the central point of the subsets), ${u_{pre}}$ represents the true values in the simulated speckle image.

In Fig. 5, U and V are the mean values and plotted with the true values. The Fig. 5 (a) and (c) demonstrate that the CNN-SDM is quite accurate in the calculation of rigid body translation. On the whole, with the growth of displacement, the mean bias error and SD error have increased, which means that the measured results may approximately symmetrical around the true value, especially in large displacement measurement. Quantitatively, the absolute maximum mean bias errors are 0.0062 and 0.0053 pixels, the maximum SD errors are 0.0055 pixels and 0.0078 pixels respectively. For a two-level CNN, the error mainly comes from the second-level CNN. For example, a true value is 0.05 pixels, the first-level CNN is likely to be classified it as 0 or 0.1 pixels, which is acceptable in 0.1 accuracy. Then the results of the next layer will determine the error. This is why the training the second-level CNN needs a larger training set, and is therefore more time consuming.

### 4.1.2 Test on synthetic sample 2

Sample 2 is a set of 11 images with size of 512×512 pixels which is the same kind of speckle pattern as shown in Fig. 4, simulating uniform displacements in the *x* (horizontal) and *y* (vertical) direction with an increment of 0.01 pixels, and the displacement to be measured ranges from 0 to 0.1 pixels.

In Fig. 6(a), U and V are the mean values calculated with subset of different size and plotted with the true values. Since the network is trained with a training set of 125×125 pixels, the calculated displacement and true value fit best when the test data have the same size. It also can be seen that the spatial pyramid pooling method can compute input images of different sizes. However, the accuracy of CNN-SDM is affected when the size of the input changes. The mean bias error and SD error of all displacements are obtained and shown in Fig. 6(b), (c), (e) and (f).

### 4.1.3 Discussions

As show in Fig. 5 and Fig. 6, the proposed network performs well in the calculation of rigid body translation, which proves the feasibility of the classification model. It can be seen from the results that both mean bias errors and standard deviation errors have high dispersion, especially standard deviation errors. This is caused by a characteristic of classification network: if the true value is 0.9 pixels, the calculation result of CNN-SDM could be 0.89, 0.9 or 0.91 pixels, which means the result will fluctuate around the true value. This nature has little impact on the mean bias errors, but influences the standard deviation errors a little bit. Besides, though the spatial pyramid pooling layer can deal with input of different sizes, it is recommended to keep the size of test set and training set consistent in order to ensure a certain level of accuracy.

#### 4.2 Tension test and pure shear test

### 4.2.1 Experimental process

As mentioned above, changes in the speckle pattern will affect the measurement results. The solution to this problem is to adopt transfer learning method. In order to verify the effectiveness of this method, a new speckle pattern described in [53] will be tested in this section. The simulated speckle function of this method is defined as:

*s*is the number of speckle;

*R*is the radius of speckle;

*I*

_{0}is the peak intensity; ${u_0},{v_0}$ is the displacement along x and y direction; $({x_k},{y_k})$ is the position of each speckle with a random distribution; ${u_x},{u_y}$ is the displacement gradient in x direction, ${v_x},{v_y}$ are displacement gradient in y direction.

We initialized a new CNN that has same structure with the pre-trained weights, and train the network with the training set generated by the method in [53]. That is, we carry out transfer learning based on the trained CNN used in section 4.1, which can accelerate the learning speed. Figure 7 shows the speckle pattern used in the tensile (left) and pure shear test (right).

The image used in tension and pure shear test with the size of 512×512 pixels. The test set are sampled in grid spacing s=5 pixels, getting a mesh of 91×91 = 8281 test samples.

The displacement field estimated by CNN-SDM and errors are shown in Fig. 8. Figure 8(a) and (b) are two group uniform displacement fields, which shows that the network trained by the rigid body translation training set has certain robustness and can be applied to tension and shear deformation. However, the results in Fig. 8(c) show that SD error is still much larger than mean bias error.

### 4.2.2 Error analysis

There are two main reasons may cause that the calculation results fluctuate around the true value: First, the calculation result of CNN-SDM is generally an integral multiple of 0.01 pixels. Compared with the true value, the results calculated by CNN-SDM often have such deviations as 0.002 pixels and 0.008 pixels. Just as the tension test, the true displacement contains values like 0.002, 0.004, 0.006 and 0.008 pixels and so on. For example, suppose the true value is 0.012 pixels, CNN will estimate whether it is more like 0.01 pixels or 0.02 pixels, which may cause the error of 0.002 or 0.008 pixels. Second, the calculation of CNN-SDM may be inaccurate. In some places, the deviation is 0.018 or 0.012 pixels, only a few points have a deviation more than 0.03 pixels. Although we think 0.012 pixels is closer to 0.01 pixels, CNN may sometimes believe it more like 0.02 pixels even 0 or 0.03 pixels in some cases. Just like many classified networks, CNN regards tigers as cats and cats as tigers. In addition, the test samples are no longer rigid body translation, which is different from the training set and will affect the test results. These issues can be resolved by using a larger number of examples in the training set, and allowing for a longer duration of the training phase as required. Generally speaking, the training result depends on the quality and richness of training samples.

### 4.2.3 Efficiency verification of transfer learning

In Section 4.2.1, Fig. 8 shows the effectiveness of transfer learning. In this section, we will verify the efficiency of transfer learning. Using the speckle patterns shown in Fig. 7 as the training set, we generated 11000 training samples and 2200 test samples. The next step is to train the network in two different ways. The first method is to train the network with randomly initialized weights, and the second method is to train the network with the transfer learning method used in Section 4.2.1. Both methods adopt Adam optimization, and the learning rate is set to 0.0001. As a classification model, the Top 1 accuracy can be adopted to view the training process. Here, compare the time-consuming of the two methods when the accuracy reaches 90% to test the performance of transfer learning. The specific training process is shown in Fig. 9. Where “Transfer” represents the transfer learning method, and “Random” means the method that using randomly initialized weights.

The differences between the two methods can be clearly seen from Fig. 9. Without transfer learning method, the accuracy rate is mostly 0, and there is a large range of fluctuations in many cases. The accuracy sometimes goes up suddenly, then down to zero in the next epoch. This can happen even when the accuracy reaches 90%, which indicating that the network is not in a stable condition. Besides, the network needs more than 140 epochs to achieve 90% accuracy. However, when using transfer learning, the network only needs about 20 epochs, and it is obvious that the accuracy is steadily increasing. When the accuracy reaches 90%, the accuracy fluctuates in a small range. Therefore, it is proved that transfer learning can speed up the training process.

#### 4.3 Comparison of accuracy and computational cost

In this section, we will verify the efficiency and accuracy of CNN-SDM through a series of case-studies using synthetic images. Nowadays, the most widely used and the latest subpixel tracking algorithm are subset-based local DIC and augmented-Lagrangian DIC (ALDIC) [54], respectively. Therefore, two sets of synthetic images were generated to compare the computational performance of our algorithm and these two DIC algorithms. All tests were performed on the same workstation with Intel Xeon CPU E5-2620 V3 2.40 GHz, RAM 32.0 GB and 12 GB NVIDIA GeForce GTX Titan X GPU.

### 4.3.1 Homogeneous deformation

The deformation images without displacement gradient were used to test the efficiency and accuracy of CNN-SDM method. A reference image and 10 deformed images were generated, all of which were 512×512 pixels. The deformations in these images are pure translations in vertical direction with amplitudes ranging from 0.1 to 1 pixel in increments of 0.1 pixels. Then, the three methods were used to calculate the displacement of this set of images respectively. Under the premise of certain accuracy, it is necessary to ensure that all the methods can output the displacement result matrix of the same size. The spacing was still set to 20 pixels, and 400 calculation points were selected for each image, just like in 4.1.1. The time consumption and accuracy of the three methods was compared to evaluate the calculation performance. The time consumption of the three methods is shown in Table 2. In addition, the mean bias error and SD error are shown in the Fig. 10.

It is clear from Table 2 that CNN-SDM is the least expensive, and ALDIC is the most expensive. However, in terms of accuracy, the proposed algorithm is slightly lower than the other two methods. It can be seen from Fig. 10(a) that the displacement values measured by the three methods are close to the true value. In Fig. 10(b), the mean bias error of subset-based local DIC and ALDIC is similar and presents sinusoidal variation with displacements. Comparatively, the mean bias error of CNN-SDM is the smallest. It is clear from Fig. 10(c) that DIC methods are very stable because their SD errors are almost zero. The main reason why CNN-SDM method has a large SD error is that, as mentioned before, the algorithm we proposed is a classification model. When the predicted value fluctuates around the true value, the minimum error is 0.01 pixels.

### 4.3.2 Heterogeneous deformation

In this section, two sets of images with heterogeneous deformation were used to test and compare the efficiency and accuracy of the three methods. Both the two sets of images contained a reference image and a deformed image, with size of 512×512 pixels respectively. The spacing was set to 5 pixels. The horizontal deformation of set 1 were uniaxial tension as shown in Fig. 11(a), and the deformation in set 2 were sinusoidal with changing frequency in the horizontal direction as shown in Fig. 11(b).

Figs. 12 and 13 show the horizontal displacement for the two kind of heterogeneous deformation images. These figures show that the results of the three methods are all close to the true value. Not surprisingly, the CNN-SDM method has a larger SD error. Table 3 lists the computational clock time of this heterogeneous deformation. It is clear from the table that the CNN-SDM method has a very obvious advantage in computational efficiency compared to the other two methods. Even when dealing with heterogeneous deformed images, its efficiency is still higher than that of DIC methods.

A conclusion can be drawn from the above two groups of experiments that CNN-SDM method has high computational efficiency. Moreover, although the accuracy is somewhat lower than that of the DIC methods, it is acceptable. An important reason why CNN is so efficient is that their algorithms are different. For subset-based local DIC, it divides the reference image and the deformed image into several subset for calculation. Since the subsets are limited in size, compared with ALDIC, the calculation speed will be very fast. However, the deformation of each subset is obtained independently, so the overall deformation may not be compatible. Different from subset based local DIC, after finding the displacement field of each subset, the ALDIC links them by introducing an auxiliary compatible displacement field. This makes the displacement more compatible, but increases the amount of computation, so it is less efficient than subset-based local DIC. Subset-based local DIC and ALDIC require initial estimation and need an iterative process to obtain the accurate displacement results. Moreover, when the displacement becomes complex, the calculation efficiency will be reduced. But the CNN-SDM is different, it does not require initial value estimation and iteration process. Instead, it uses such as convolution operations or pooling operations to achieve this series of work. Besides, the computational efficiency is not affected by the complexity of the displacement field, which only affects the accuracy. Therefore, the efficiency of CNN-SDM is higher than that of DIC when the accuracy is acceptable.

#### 4.4 Three-point bending test

To test the robustness of CNN-SDM, a three-point bending test was carried out. The schematic diagram for the three-point bending test is shown in Fig. 14. The beam made of ordinary carbon structural steel (E=200 Gpa) is adopted and blue LED light source is used. The size of the specimens was 300mm×15mm×10mm (*l* × *h* × *b*) and the span *d* is 50 mm.

To satisfy the requirement of CNN-SDM, the specimen was painted with white spray as background and then spray black to generate randomly distributed speckle pattern as shown in Fig. 14(b). The beam was loaded on the center of a test machine. The loading head was pressed down at the speed of 0.1 mm/min. A preload of 400 N was applied and the deformation picture at 400N is seen as the reference image. Then, the images loaded with 800N, 1200N, 1600N, 2000N were recorded as the deformed images.

As shown in Fig. 15, the two images have different brightness, the size and distribution of speckle are also uneven. However, the greater challenge is the difficulty of generating a sufficiently large number of training examples, which requires an excessive amount of computational resources. To make CNN-SDM more widely applied in various situations, we put forward a feasible scheme in which sufficient training sets can be obtained from only one image.

To generate enough training samples, the subsets with size of 61×61 pixels are uniformly sampled by the grid spacing s=30 pixels, getting a mesh of 41×11 = 451 training samples. The samples obtained from this are test images so cannot be used for training. The practical methods for data enhancement are mirror reflection, rotation and brightness change. The steps are as follows:

- (1) Flip the 451 images left and right (mirror reflection) to get new 451 images, named Set 1.
- (2) Rotate Set 1 by 90°, 180°, 270° respectively, generating 1353 (451×3) images, named Set 2.
- (3) Rotate the whole 1280×380 pixels test image, not the Set 1, by 45° (or some other appropriate angle), then the 61×61 pixels subsets are resampled to get new images named Set 3, as shown in Fig. 16.
- (4) Process the Set 3 according step (1) and step (2), getting new images named Set 4.
- (5) Adjust the brightness of the Set 1, Set 2, Set 3, Set 4 to get new data.

Through the way shown in Fig. 16, a 1280×380 pixels image can be processed into 3000∼4000 reference images. Then, translate these reference images will obtain the deformed images. If brightness was considered, more training data will be produced.

As shown in Table 4, two training sets were generated: one has brightness change and the other was not. Besides, the training set of the second-level network is three times more than that of the first-level network. Since higher accuracy means learning is more difficult, a corresponding increase in training samples is needed. In this experiment, based on the previous network used in 4.1, we trained two networks with the method of transfer learning.

It is worth mentioning that obtaining training samples from the images to be test may lead to the trained network not having the performance of computing new data. But we still think it is meaningful to propose this method. Firstly, subpixel displacement computation not only needs to extract speckle features, but more importantly, CNN needs to extract displacement features from speckle patterns; Secondly, in the proposed data enhancement method, all training sets are obtained by processing test sets, rather than directly using these samples to train the network; Finally, although the network trained in this way may not be able to compute new data, it still has a certain performance in the calculation of subpixel displacement. For example, after the test set image is rotated by 45°, a displacement of 0.1 pixels can be set for it, but in the actual test, the network trained with it may need to calculate a displacement of 0.9 pixels. If the network trained by such a training set can calculate any subpixel displacement, it can be considered that it does have the ability to calculate subpixel displacement to a certain extent. This approach may be helpful in cases where the training sample is particularly small.

Transfer learning can provide better training speed and effect when the training set is not enough. One network uses Train 1 and Train 2 as training set considering the change of sample brightness, named Net-1, while the other network only chooses Train 1 as training set, named Net-2. After selecting the region of interest, the test set are sampled in grid spacing s=5 pixels, getting a mesh of 47×288 = 13536 test samples. Here, the calculation results from CNN-SDM are compared with that from DIC.

Figure 17 shows that both trained networks Net-1 and Net-2 can measure the displacement field, but the result of Net-1 are smoother than Net 2. This indicate that if the network is trained without considering the change of brightness, the calculation results will become unreliable. For the second-level network, the change of gray value caused by the brightness becomes a more obvious feature than the subpixel displacement, which causes the neural network to pay more attention to the brightness rather than the displacement. Therefore, the results of Net-2 are very discrete. When Train 2 is added, the results will be improved. The mean bias error and SD error are shown in Fig. 18.

In Fig. 18, the maximum deviations of Net-1 and Net-2 are 0.0141 pixels and 0.0156 pixels, the maximum SD error are 0.0093 pixels and 0.0196 pixels respectively. It can also be observed that the mean deviation and SD errors are increase with the growth of displacement. The result may be attributed to two reasons. First, Lack of training samples leads to poor learning of large displacement. Second, the CNN-SDM algorithm can only calculate the displacement along y direction, which is affected by the U displacement and noise. In summary, Net 1 has better robustness and higher precision than Net 2. To facilitate the difference between Net-1 and DIC, Figs. 19 and 20 indicate the horizontal and vertical three lines scan of displacement filed of two methods at different loadings.

The results of Figs. 19 and 20 show that most of the result of CNN-SDM are overlapping the result of DIC. It is indicated that the proposed data enhancement method is effective and that Transfer Learning can be used to learn the subpixel displacement of new speckle with a small number of samples. However, compared with DIC, the results of CNN-SDM are still discrete, and the maximum deviation is about 0.05 pixels. The main reason for this is that the type of speckle is limited. Although there are a lot of training sets, they are all obtained from only one image, which is not as rich as the speckle pattern simulated by computer. This may lead to inaccurate calculations from the first-level. For example, the true value 0.55 is considered to be 0.4 or 0.7 pixels by first-level CNN, which results in a deviation of at least 0.05 pixels. While, this situation shares only a small proportion, most of the calculated displacements are acceptable. Therefore, in order to improve the training effect, it is better to provide a few more speckle patterns similar to the test set.

## 5. Conclusions

This paper proposed a reliable subpixel displacement measurement based on the convolutional neural network. Furthermore, a novel framework is developed based on the classification model, which is suitable for most displacement measurement tasks. The merits of the proposed CNN-SDM are embodied in the following aspects:

- (1) The proposed CNN-SDM method is feasibly effective for subpixel displacement measurement due its high efficiency, robustness.
- (2) Adopting two in series classification models to calculate the subpixel displacement. The network has simple structure, few parameters, which accelerate the training of the network. At the same time, the classification model makes the generation of training sets and labels more efficient.
- (3) Pyramid pooling and transfer learning are introduced to make the network more adaptable in practice. Pyramid pooling makes the network adapt many input data with different sizes, and the transfer learning enables the algorithm to be quickly applied to different speckle patterns. A trained network model and parameters for transfer learning are provided.

- (1) The accuracy of the second-level CNN can only reach 0.01 pixels, which results in a large degree of dispersion.
- (2) The network proposed in this paper is limited by speckle pattern. If the speckle patterns are different from the training set, CNN-SDM will lose its function. Although the transfer learning has been introduced, it still leads to the inconvenience in practical application. In addition, transfer learning is not always effective, such as in the absence of a proper training set. This may place some limitations on the use of this method

The CNN-SDM algorithm proposed in this paper is simple to operate, easy to train, and has high efficiency and robustness. We hope that this work can serve as a new beginning for future research on subpixel displacement calculation using CNN.

## Funding

Fundamental Research Funds for the Central Universities (2015ZCQ-GX-02, 2018ZY08).

## Disclosures

The authors declare no conflicts of interest.

## Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

## References

**1. **M. A. Sutton, F. Matta, D. Rizos, R. Ghorbani, S. Rajan, D. H. Mollenhauer, H. W. Schreier, and A. O. Lasprilla, “Recent progress in digital image correlation: background and developments since the 2013 W M Murray Lecture,” Exp. Mech. **57**(1), 1–30 (2017). [CrossRef]

**2. **D. I. T. I. Manuel, M. D. S. Hernández Montes, J. M. Flores-Moreno, and F. M. Santoyo, “Laser speckle based digital optical methods in structural mechanics: a review,” Opt. Lasers Eng. **87**, 32–58 (2016). [CrossRef]

**3. **B. Pan, “Digital image correlation for surface deformation measurement: historical developments, recent advances and future goals,” Meas. Sci. Technol. **29**(8), 082001 (2018). [CrossRef]

**4. **B. Pan, H. Xie, Z. Wang, K. Qian, and Z. Wang, “Study on subset size selection in digital image correlation for speckle patterns,” Opt. Express **16**(10), 7037–7048 (2008). [CrossRef]

**5. **C. Cofaru, W. Philips, and W. Van Paepegem, “Pixel-level robust digital image correlation,” Opt. Express **21**(24), 29979–29999 (2013). [CrossRef]

**6. **Z. Su, L. Lu, F. Yang, X. He, and D. Zhang, “Geometry constrained correlation adjustment for stereo reconstruction in 3D optical deformation measurements,” Opt. Express **28**(8), 12219–12232 (2020). [CrossRef]

**7. **C. Hartmann, J. Wang, D. Opristescu, and W. Volk, “Implementation and evaluation of optical flow methods for two-dimensional deformation measurement in comparison to digital image correlation,” Opt. Lasers Eng. **107**, 127–141 (2018). [CrossRef]

**8. **B. Pan, Z. Wang, and Z. Lu, “Genuine full-field deformation measurement of an object with complex shape using reliability-guided digital image correlation,” Opt. Express **18**(2), 1011–1023 (2010). [CrossRef]

**9. **J. Ban, L. Wang, Z. Liu, and Z. Li, “Self-calibration method for temperature errors in multi-axis rotational inertial navigation system,” Opt. Express **28**(6), 8909–8923 (2020). [CrossRef]

**10. **Y. Pang, B. K. Chen, S. F. Yu, and S. N. Lingamanaik, “Enhanced laser speckle optical sensor for in situ strain sensing and structural health monitoring,” Opt. Lett. **45**(8), 2331–2334 (2020). [CrossRef]

**11. **W. Tong, “An evaluation of digital image correlation criteria for strain mapping applications,” Strain **41**(4), 167–175 (2005). [CrossRef]

**12. **H. Masood, S. Rehman, A. Khan, F. Riaz, A. Hassan, and M. Abbas, “Approximate proximal gradient-based correlation filter for target tracking in videos: A Unified Approach,” Arabian J. Sci. Eng. **44**(11), 9363–9380 (2019). [CrossRef]

**13. **S. Tehsin, S. Rehman, MOB. Saeed, F. Riaz, A. Hassan, M. Abbas, R. Young, and MS. Alam, “Self-Organizing Hierarchical Particle Swarm Optimization of Correlation Filters for Object Recognition,” IEEE Access **5**, 24495–24502 (2017). [CrossRef]

**14. **S. Tensin, S. Rehman, A. Bilal, Q. Chaudry, O. Saeed, M. Abbas, and R. Young, “Comparative Analysis of Zero Aliasing Logarithmic mapped Optimal Trade-Off Correlation Filter,” Proc. SPIE **10203**(1020305), 1–16 (2017). [CrossRef]

**15. **J. P. Lewis, “Fast Normalized Cross-Correlation,” Circ. Syst. Signal. Pr. **82**(2), 144–156 (1995).

**16. **W. Li, Y. Li, and J. Liang, “Enhanced feature-based path-independent initial value estimation for robust point-wise digital image correlation,” Opt. Lasers Eng. **121**, 189–202 (2019). [CrossRef]

**17. **H. Jin and H. K. Bruck, “Pointwise digital image correlation using genetic algorithms,” Exp. Tech. **29**(1), 36–39 (2005). [CrossRef]

**18. **W. Feng, Y. Jin, Y. Wei, W. Hou, and C. Zhu, “Technique for two-dimensional displacement field determination using a reliability-guided spatial-gradient-based digital image correlation algorithm,” Appl. Opt. **57**(11), 2780–2789 (2018). [CrossRef]

**19. **H. A. Bruck, S. R. McNeil, M. A. Sutton, and W. H. Peters, “Digital Image Correlation Using Newton-Raphson Method of Partial Differential Correction,” Exp. Mech. **29**(3), 261–267 (1989). [CrossRef]

**20. **G. Vendroux and W. G. Knauss, “Submicron Deformation Field Measurements: Part2. Improved Digital Image Correlation,” Exp. Mech. **38**(2), 86–92 (1998). [CrossRef]

**21. **H. Lu and P. D. Cary, “Deformation Measurement by Digital Image Correlation: Implementation of a Second-order Displacement Gradient,” Exp. Mech. **40**(4), 393–400 (2000). [CrossRef]

**22. **M. C. Pitter, C. W. See, and M. G. Somekh, “Subpixel microscopic deformation analysis using correlation and artificial neural networks,” Opt. Express **8**(6), 322–327 (2001). [CrossRef]

**23. **B. Pan, W. Dafang, and X. Yong, “Incremental calculation for large deformation measurement using reliability-guided digital image correlation,” Opt. Lasers Eng. **50**(4), 586–592 (2012). [CrossRef]

**24. **Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature **521**(7553), 436–444 (2015). [CrossRef]

**25. **D. Gadot and L. Wolf, “Patchbatch: a batch augmented loss for optical flow,” in CVPR (2016), pp. 4236–4245.

**26. **Q. Cheng, S. Zhang, S. Bo, D. Chen, and H. Zhang, “Augmented Reality Dynamic Image Recognition Technology Based on Deep Learning Algorithm,” IEEE Access **8**, 1–10 (2020). [CrossRef]

**27. **W. Zhou, J. Jia, C. Huang, and Y. Cheng, “Web3D Learning Framework for 3D Shape Retrieval Based on Hybrid Convolutional Neural Networks,” Tsinghua Sci. Technol. **25**(1), 93–102 (2020). [CrossRef]

**28. **K. Aksit, “Patch scanning displays: spatiotemporal enhancement for displays,” Opt. Express **28**(2), 2107–2121 (2020). [CrossRef]

**29. **A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “Flownet: Learning optical flow with convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2758–2766.

**30. **J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in CVPR (2015), pp. 3431–3440.

**31. **Y. Xing, L. Zhong, and X. Zhong, “An encoder-decoder network based FCN architecture for semantic segmentation,” Wirel. Commun. Mob. Com. **2020**, 1–9 (2020). [CrossRef]

**32. **G. Zou, G. Fu, M. Gao, J. Pan, and Z. Liu, “A new approach for small sample face recognition with pose variation by fusing Gabor encoding features and deep features,” Multimed. Tools Appl. **79**(31-32), 23571–23598 (2020). [CrossRef]

**33. **W. Fang, Y. Ding, F. Zhang, and V. S. Sheng, “DOG: A new background removal for object recognition from images,” Neurocomputing **361**, 85–91 (2019). [CrossRef]

**34. **D. Liu, L. Zhang, T. Luo, L. Tao, and Y. Wu, “Towards interpretable and robust hand detection via pixel-wise prediction,” Pattern Recogn. **105**, 107202 (2020). [CrossRef]

**35. **A. Luo, X. Li, F. Yang, and Z. Jiao, “Webly-supervised learning for salient object detection,” Pattern Recogn. **103**, 107308 (2020). [CrossRef]

**36. **D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” in Advances in Neural Information Processing Systems (2014), pp. 2366–2374.

**37. **Y. Guo, Z. Sun, R. Qu, L. Jiao, F. Liu, and X. Zhang, “Fuzzy superpixels based semi-supervised similarity-constrained CNN for PolSAR image classification,” Remote Sensing **12**(10), 1694 (2020). [CrossRef]

**38. **W. Wang, D. A. Taft, Y. J. Chen, J. Zhang, C. T. Wallace, M. Xu, S. C. Watkins, and J. Xing, “Learn to segment single cells with deep distance estimator and deep cell detector,” Comput. Biol. Med. **108**, 133–141 (2019). [CrossRef]

**39. **P. Fischer, A. Dosovitskiy, E. Ilg, P. Husser, C. Hazrba, V. Golkov, S. van der, P. Smagt, D. Cremers, and T. Brox, “FlowNet: Learning Optical Flow with Convolutional Networks,” in IEEE International Conference on Computer Vision (2016), pp. 2758–2766.

**40. **Y. Lee, H. Yang, and Z. Yin, “PIV-DCNN: cascaded deep convolutional neural networks for particle image velocimetry,” Exp. Fluids **58**(12), 171 (2017). [CrossRef]

**41. **S. Cai, J. Liang, Q. Gao, C. Xu, and R. Wei, “Particle image velocimetry based on a deep learning motion estimator,” IEEE Trans. Instrum. Meas. **69**(6), 3538–3554 (2020). [CrossRef]

**42. **E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 1647–1655 (2017).

**43. **T. W. Hui, X. Tang, and C. C. Loy, “LiteFlowNet: a lightweight convolutional neural network for optical flow estimation,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (2018), pp. 1–4.

**44. **L. Kong and J. Yang, “FDFlowNet: fast optical flow estimation using a deep lightweight network,” in International Conference on Image Processing (2020), pp. 1501–1505.

**45. **X. Liu and Q. Tan, “Subpixel In-Plane Displacement Measurement Using Digital Image Correlation and Artificial Neural Networks,” in Symposium on Photonics and Optoelectronics. Chengdu (2010), pp. 1–4.

**46. **K. He, X. Zhang, S. Ren, and J. Sun, “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,” IEEE Trans. Pattern Anal. Mach. Intell. **37**(9), 1904–1916 (2015). [CrossRef]

**47. **X. Glorot, A. Bordes, and Y. Bengio, “Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach,” ICML (2011).

**48. **M. Lin, Q. Chen, and S. Yan, “Network in network,” in 2nd International Conference on Learning Representations, ICLR 2014-Conference Track Proceedings.

**49. **C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.2818–2826.

**50. **K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (2014), pp. 1–14.

**51. **H. W. Schreier, J. R. Braasch, and M. A. Sutton, “Systematic errors in digital image correlation caused by intensity interpolation,” Opt. Eng. **39**(11), 2915–2921 (2000). [CrossRef]

**52. **F. Sur, B. Blaysat, and M. Grédiac, “Rendering deformed speckle images with a boolean model,” J. Math. Imaging Vis. **60**(5), 634–650 (2018). [CrossRef]

**53. **P. Zhou and K. E. Goodson, “Subpixel displacement and deformation gradient measurement using digital image speckle correlation (DISC),” Opt. Eng. **40**(8), 1613–1620 (2001). [CrossRef]

**54. **J. Yang and K. Bhattacharya, “Augmented Lagrangian Digital Image Correlation,” Exp. Mech. **59**(2), 187–205 (2019). [CrossRef]