Stabilization of turbulence-degraded video using patch-based reference frame

Fouzia Nawreen; Kalyan Kumar Halder; Murat Tahtali; Sreenatha G. Anavatti

doi:10.1364/OPTCON.497110

1. Introduction

Long-range imaging systems are heavily impacted by atmospheric turbulence, which can cause severe distortions in the captured images or videos. This distortion is primarily caused by fluctuations in the air’s refractive index between the camera and the scene being captured [1–3]. These fluctuations occur due to various factors such as temperature, pressure, the wavelength of light, and air density. In this field of study, video stabilization methods are used to address these unwanted effects, with techniques such as digital image processing and adaptive optics being commonly used [4,5]. Digital image processing involves various steps such as image acquisition, preprocessing of images, image enhancement, image restoration, and analysis of the reconstructed images.

There are various methods for the stabilization of geometrically degraded videos, one of them is the FATR (First Average Then Register) method [6]. In this method, an approach for visualizing the effects of turbulence in clear air, as well as restoring the wide-area motion-blur images is introduced. A reference frame is created by averaging the image series which produces a motion-blurred image with geometry that closely resembles the actual scenery. The authors developed a cross-correlation technique for point-by-point registration to generate pixel shiftmaps that describe the distortion for each image. These maps enable the images to be dewarped prior to averaging, resulting in motion-blur-corrected images. In the FATR method, the estimation of geometrical shifts by pixel registration is erroneous as the average frame is affected by motion-blur.

The authors in [7] presented a technique for selecting a reference frame from a geometrically distorted video. This approach involves using a sharpness metric to identify the frame with the least amount of blurring, rather than the frame with the least geometric distortion. The technique also involves estimating the pixel shifts required to restore each frame to its geometrically accurate state. By subtracting the mean shift from each individual shift, the method is able to determine the shifts needed to restore each frame. This method is referred to as FRTAAS (First Register Then Average And Subtract) method.

Another method in [8] introduces a technique similar to the one proposed in [7], which is called FRTAAS2. The main difference between FRTAAS2 and the original FRTAAS is that FRTAAS2 does not have a fixed reference frame. Instead, each frame is registered to the one preceding it, with the assumption that the differences between adjacent frames are much smaller than those between distant frames. The same authors proposed the FRTAASv (First Register Then Average And Subtract-variant) method in [9]. In the FRAATSv method, a video sequence is divided into several sections each part having 20 frames and each part is registered separately. The first frame of every section is considered the reference frame for that section of the sequence. Although the FRTAAS method and its variants provide quite accurate and stable results, their main drawback is the use of a warped frame as the reference frame which limits the accuracy of restoration.

Zhu and Milanfar [2] proposed an approach for restoring images affected by atmospheric turbulence, which involves two main stages. In the first stage, each frame of the recorded video sequence is aligned with the reference frame using a non-rigid image registration technique based on B-splines. The registration process includes a symmetrical constraint that reduces the occurrence of mismatches between forward and backward deformation parameters during estimation, thereby improving accuracy. In the second stage, a high-quality image is generated using a Bayesian reconstruction framework applied to the registered frames. However, this reconstruction method has a limitation in that it assumes the point spread function to be both time- and space-invariant.

Abdoola et al. [10] introduced a technique for enhancing video sequences that suffer from turbulence-induced distortions caused by heat scintillation. The method involves treating each frame of the video sequence as a graph, where nodes represent pixels and edges represent connections between them. The technique then builds a cost function based on the structural coordinates of the nodes and the gray levels of the image. By minimizing the cost function, the method obtains new spatial coordinates for each node, which in turn produces a smoothed grid for the image. Mao and Gilles [11] constructed a variational model to address the geometric distortion problem and then solved it using the operator splitting technique and Bregman iterations. Specifically, the method involves using an optical flow method to estimate geometrical distortion, combined with a nonlocal Total Variation (TV)-based regularization technique to restore the original scenery.

Tian and Narasimhan [12] presented an image restoration method in which a static scene captured underwater from the water surface is restored. The method utilizes the wave equation to generate a geometric distortion model of the water surface and construct a model for the image degradation caused by water surface fluctuations. This allows the model-based tracking algorithm to estimate the water surface at any given time, which is then used to reconstruct the original image with less distortion. Oreifej et al. [13] proposed a method in which the averaged frame of a turbulence-degraded video sequence is used as the reference frame. The inaccuracies in FATR are reduced by blurring each frame of the video sequence utilizing a blur kernel estimated from the averaged reference frame. Then, using an image registration technique, each of the frames is registered with respect to the reference frame. Finally, the method estimates a warping function from the image registration step, which is used to dewarp the captured frames.

Rucci et al. [14] introduces a technique for reducing atmospheric turbulence using a series of short-exposure frames. The method involves using an iterative block-matching registration approach to correct image distortion. The corrected frames are then merged together using a least squares Lucky Look (LL) fusion process. Here the image patches are assigned weights to generate a fused image that agrees with a theoretical LL Optical Transfer Function (OTF) model. To further enhance the results, a Wiener filter is applied for the deconvolution of the LL OTF. In [15], an inverted pyramid structure is proposed that incorporates the cross-optical flow registration approach and a multi-scale weight fusion method using wavelet decomposition. The inverted pyramid employs the registration method, which is utilized to estimate the original pixel positions. Subsequently, a multi-scale image fusion technique is applied to merge the two inputs that have undergone optical flow and backward mapping processes.

Halder et al. [16] developed a method for restoring a distorted image from a warped video using motion compensation. The method begins by selecting a reference frame based on the estimated sharpness of the input frames, which is determined using a blind image quality metric. The frame with the highest sharpness value is chosen as the reference frame. Highly degraded frames in the video sequence are then rejected using a k-means clustering algorithm. An image registration approach is used to estimate the pixel shiftmaps of the warped frames over the reference frame. The method calculates the centroid of the evaluated shiftmaps and uses it to produce a high-quality output frame.

Sun et al. [17] proposed an image restoration method that can be used to restore a variety of degraded images, including turbulence-degraded video sequences, underwater objects, and monitoring of shallow riverbeds to observe vegetation. The method constructs a reference frame by combining patches that have the best quality from all frames in the sequence and then applies a guided filter to smooth the reference frame. To dewarp the frames in the sequence, an image registration method is used to register all the input frames against the reference frame. Once all the frames are dewarped, they are fed back into the input for several iterations to produce a better output.

This paper presents an improved approach for the stabilization of geometrically distorted videos. The reference frame needed for image registration is constructed using a patch-based approach. This reference frame is nearly similar to the real scene though it is affected by motion-blur. Due to the use of the best patches, this frame is less blurry than that obtained from the methods in [4,14]. Still, to obtain accurate shiftmaps of the frames, each frame in the video is blurred with a kernel estimated from the reference frame, and a window-based image registration technique is used. The captured frames are then dewarped using the estimated pixel shiftmaps. The proposed method is compared with state-of-the-art approaches in the literature [16] and [17], using various quality metrics.

The remainder of this paper is structured in the following manner: In Section 2, the patch-based method for estimating the reference frame is elucidated. Section 3 describes the source frame processing technique for image registration. Section 4 covers the window-based image registration technique that is utilized in this approach. The proposed method for image restoration is outlined in Section 5. Section 6 contains the simulation experiments and a comparative analysis of the methods. The final section, Section 7, provides comments and concluding remarks on the results.

2. Patch-based reference frame generation

An image patch refers to a set of pixels within an image. Understanding the similarity of image patches is a useful approach that can be used in image restoration techniques. In patch-based techniques, distorted images are partitioned into small blocks or patches. It leverages the differences and similarities among various blocks or patches of the input images to produce the final output image. With the patch-based approach, it is possible to replace irregular-shaped parts of an image with patches that match the surrounding regions [18,19].

In this study, each frame of the video sequence has a dimension of 512$\times$512 pixels and is divided into patches of size 64$\times$64 pixels with an overlap of 50% with the previous patch. A sample of the patch selection process is shown in Fig. 1, where each frame is divided into a total of 225 patches, with 15 patches in both the x- and y-directions. The quality of each patch is evaluated by computing its Peak-Signal-to-Noise Ratio (PSNR), which can easily be defined using Mean Square Error (MSE).

Fig. 1. An example of extracting image patches from a distorted frame.

Download Full Size | PDF

The MSE of a patch is calculated as

(1)$$\text{MSE}(P_i) = \frac{1}{m\times n} \sum_{x=1} ^{m} \sum_{y=1} ^{n} (P_i(x,y)-P_a(x,y))^2,$$

where $P_i$ is any individual patch, $P_a$ is the average of all the patches in that position, and $m \times n$ is the dimension of the patch. Using the MSE, the PSNR is determined by

(2)$$\text{PSNR}(P_i)= 10\log_{10}(\frac{(\text{MAX}(P_a))^2} {\text{MSE}(P_i)}).$$

After obtaining the PSNR values of all patches in a patch position (for example, patch-1), the best patch is selected based on the highest PSNR value. A high PSNR value signifies that the patch closely matches the averaged patch in terms of quality and aids in generating a sharper reference frame. The same process is applied to all other patches in the frames and the 255 best patches are chosen. Finally, these best patches are fused together to construct the patch-based reference frame.

3. Source frame processing

The reference frame, generated in the earlier section, looks similar to the ground-truth image but it may still affected by motion-blur due to the overlapping of patches with their neighborhoods. As registration of this blurry reference frame with respect to the non-blurry source frames of the input video sequence produces imperfect results [7], there might be a viable solution to that - deblur the reference frame by utilizing a state-of-the-art deblurring method and use that frame as the reference frame [20]. However, this approach may introduce undesired artifacts in the reconstructed frames [13]. This problem is solved here by estimating the blur kernel of the patch-based reference frame using a state-of-the-art method [21] and using it to blur all the source frames of the video sequence. It allows the image registration to emphasize the sharp edges of the reference frame rather than the corrupted blurry edges as both the reference frame and the source frames are blurry now, thus, giving a better estimation of pixel shiftmaps. It is noteworthy to mention that the utilized method for estimating blur kernel assumes spatial uniformity in the blur [21]. A block diagram of the source frame processing is shown in Fig. 2.

Fig. 2. Block diagram of source frame processing.

Download Full Size | PDF

4. Image registration

Image registration is a process of aligning multiple images of the same scene that were taken at different times, from different angles, or with different sensors. This technique is commonly used to estimate geometric distortions in a collection of similar images by aligning them with a reference image based on their geometry [22].

In this study, a correlation-based method for image registration called the Minimum Sum of Squared Differences (MSSD) method is used. This method determines the pixel shiftmaps by calculating the Sum of Squared Differences (SSD) for each pixel within a neighborhood. The process involves taking a square window of a particular size around the target pixel in the reference frame and searching for a matching pixel inside the window in the source frame by sliding the window around the entire frame. Figure 3 shows a search window of 3$\times$3 pixels within a search field of 9$\times$9 pixels. The window with the bold outline is positioned at zero shifts, while the window with the dashed outline is positioned at -4 and -4 pixels in the $x-$ and $y-$ directions, respectively, with respect to the search field. The window with the lowest SSD is identified after a thorough search for the target pixel, and the pixel shifts are calculated using the Euclidean distances. The outliers are subsequently removed by applying a median filter to the pixel shiftmaps [9].

Fig. 3. The MSSD cross-correlation technique showing the search windows (dashed and bold) and the search field (shaded) [9].

Download Full Size | PDF

5. Video stabilization

This work considers video sequences consisting of distorted frames of distant stationary scenes. The proposed method comprises 4 stages. Figure 4 shows the simplified block diagram of the complete method. In stage 1, the distorted frames are utilized to construct a reference frame using a patch-based approach. The reference frame closely resembles the original undisturbed image but is affected by motion-blur. Subsequently, in stage 2, the blur kernel of the reference frame is determined, and all source frames are made blurry using this kernel. In stage 3, window-based image registration is used to compute the pixel shifts of the blurred source frames with respect to the reference frame through backward mapping. Finally, in stage 4, these shifts are processed to correctly determine the shiftmaps of the input frames and reconstruct all of them. The mathematical formula used for the proposed video stabilization method is discussed below.

Fig. 4. Block diagram of the video stabilization method.

Download Full Size | PDF

Each source frame is registered with respect to the reference frame using the MSSD image registration to determine the shiftmaps $p_x(x,y,t)$ and $p_y(x, y,t)$. The mean of the shiftmaps, $M_x$ and $M_y$, are calculated as

(3)$$\begin{aligned}M_x (x,y) &= \frac{1}{N} \sum_{t=1} ^{N} p_x (x,y,t),\\ M_y (x,y) &= \frac{1}{N} \sum_{t=1} ^{N} p_y (x,y,t),\end{aligned}$$

where $N$ is the total number of frames in the video sequence.

Although the reference frame is generated using the sharpest patches, it could retain some degree of warping. The mean shiftmaps of the source frames with respect to the reference frame will then exhibit non-zero values. To obtain the accurate shiftmaps necessary for restoring the source frames, the mean shiftmaps are combined with individual shiftmaps via bicubic interpolation. Prior to this fusion, the inverse of the mean shiftmaps is computed, as the image registration involves a backward mapping technique to avoid holes in the estimated shiftmaps [7].

The inverses of $M_x$ and $M_y$ are calculated as [7]

(4)$$\begin{aligned}M_x^{{-}1} (x,y) &={-}M_x(x-M_x(x,y),y-M_y(x,y)),\\ M_y^{{-}1} (x,y) &={-}M_y(x-M_x(x,y),y-M_y(x,y)),\end{aligned}$$

where $M_x^{-1}$ and $M_y^{-1}$ are the inverses of $M_x$ and $M_y$, respectively.

The corrected shiftmaps, $C_x$ and $C_y$, for each input frame in the video sequence are calculated as

(5)$$\begin{array}{l}C_x(x,y,t) = M_x^{{-}1}(x,y)+p_x(x+M_x^{{-}1}(x,y),y+M_y^{{-}1}(x,y),t),\\ C_y(x,y,t) = M_y^{{-}1}(x,y)+p_y(x+M_x^{{-}1}(x,y),y+M_y^{{-}1}(x,y),t).\end{array}$$

By utilizing the corrected shiftmaps, it becomes possible to locate the unwarped version of the original warped frame through the following process.

(6)$$I^{'}(x,y,t) = I(x+C_x(x,y,t),y+C_y(x,y,t),t),$$

where $I^{'}$ is the unwarped form of the original warped frame $I$.

After the frames of the entire video sequence have been corrected for distortion, they are subsequently inputted for a few additional iterations. This process guarantees an enhanced video output of superior quality.

6. Simulation experiments

An Intel Core i5-8300H CPU operating at 2.30 GHz and having 16 GB of RAM was used to examine the proposed method and it was implemented in MATLAB with C++ MEX code. Experimental tests are done to compare its performance with two state-of-the-art methods namely the Halder method [16] and the Sun method [17]. At first, all three methods are applied to two synthetically distorted video sequences to verify and compare their performances. Later, all the methods were applied to a real-life unstable video sequence and evaluated their stabilization performances.

The two synthetically distorted sequences are created using the standard gray-scale Lena image and the Eye image [23], and named the Lena sequence and the Eye sequence, respectively. Each sequence comprising 80 frames is produced by applying a smoothly varying random distortion over time to the test image. This distortion is simulated by creating a grid of control points in a 13$\times$13 pattern over the entire image. These control points undergo a smoothly evolving random movement around their original positions on the grid, with an average of zero displacements. The random movement of each control point is generated by subjecting a random sequence to low-pass filtering at 5Hz, with a maximum displacement of 5 pixels. This process occurs at a frame rate of 25fps. The standard deviation of the pixels’ displacements over an entire sequence is 1.56 pixels. These specified parameters emulate a reasonably accurate representation of a capture sequence affected by anisoplanatic effects. This simulation is based solely on geometric distortion, with no consideration for the blurring effect.

The real-life sequence referred to as the Tower sequence [24] was acquired from the Council for Scientific and Industrial Research (CSIR), South Africa. It was recorded under real atmospheric turbulence conditions utilizing a 2MP Allied Vision Prosilica camera, which was a custom-designed camera with a 1500mm focal length. Despite the weather being relatively calm, there was significant atmospheric turbulence due to the heat. The video sequence depicts a water tower located about 7km to the south of CSIR’s Scientia campus in Pretoria, which itself is positioned about 1339m above sea level and 2 degrees to the south of the Tropic of Capricorn.

6.1 Quality metrics

The proposed video stabilization method is compared with two earlier methods using several quality metrics such as MSE, PSNR, Structural Similarity (SSIM), and Mutual Information (MI). These metrics need the availability of a distortion-free image, known as the ground-truth image, which can be compared to the image whose quality is to be estimated.

6.1.1 MSE

The MSE is calculated as the sum of the squared differences between the restored frames and the corresponding ground-truth frames. The lower the value of MSE, the lower the error. The MSE in Eq. (1) can be redefined as

(7)$$\text{MSE}(R, T) = \frac{1}{r\times c} \sum_{x=1} ^{r} \sum_{y=1} ^{c} (R(x,y)-T(x,y))^2,$$

where $R$ is the restored frame and $T$ is the ground-truth frame, and $r \times c$ is the dimension of frame.

6.1.2 PSNR

The PSNR measures the ratio between the maximum possible power of a signal and the power of the noise that affects the fidelity of the signal. In the context of image or video quality assessment, PSNR quantifies the level of distortion or loss of information compared to the original reference signal. A higher PSNR value indicates better quality, while a lower PSNR value suggests more significant distortion or loss of information. The PSNR in Eq. (2) can be redefined as

(8)$$\text{PSNR}(R, T)= 10\log_{10}(\frac{(\text{MAX}(T))^2} {\text{MSE}}).$$

6.1.3 SSIM

The SSIM is the preferred statistic for assessing how similar two images are to one another. It is a local quality score that incorporates brightness, contrast, and local image structure. Following brightness and contrast normalization, structures in this measure are patterns of pixel intensities, particularly between adjacent pixels. The SSIM is expressed as [25]

(9)$$\begin{aligned} \text{SSIM}(R, T)= \frac{(2\mu_R \mu_T +C_1)(2\sigma_{RT} + C_2)}{(\mu_R^2 + \mu_T^2 +C_1)(\sigma_R^2 + \sigma_T^2 + C_2)} , \end{aligned}$$

where the local means, standard deviations, and cross-covariance for the restored frame $R$ and the ground-truth frame $T$ are indicated by $\mu _R$, $\mu _T$, $\sigma _R$, $\sigma _T$ and $\sigma _{RT}$, respectively. Also, $C_1$ and $C_2$ are constants and their values can be written as

\begin{aligned} C_1 = (0.01\times {L})^2,\\ C_2 = (0.03\times {L})^2, \end{aligned}

where $L$ is the dynamic range of the pixel values. The SSIM value presented in this paper essentially represents the mean SSIM over the entire image.

6.1.4 MI

The MI between two images essentially measures how much information one image has about the other. Given frame $R$ and frame $T$, the MI is calculated as

(10)$$\text{MI}(R, T) = \sum_{a \in R}\sum_{b \in T} P_{RT}(a,b) \log_{2}(\frac{P_{RT}(a,b)}{P_R(a)P_T(b)})$$

where the joint probability of $R$ and $T$ is $P_{RT}$ and the marginal probability of $R$ and $T$ are $P_R$ and $P_T$, respectively. More specifically, $P_R(a)$ represents the likelihood of a pixel ($x$, $y$) in frame $R$ having a grayscale value of $a$, while $P_T(b)$ denotes the likelihood of a pixel ($x$, $y$) in frame $T$ having a grayscale value of $b$. Additionally, $P_{RT}(a,b)$ indicates the likelihood that a pixel ($x$, $y$) in frame $R$ has a grayscale value of $a$, and the corresponding pixel in frame $T$ has a grayscale value of $b$.

6.2 Results for Lena sequence

The synthetically warped Lena sequence consists of 80 frames, each frame of size 512$\times$512 pixels. Figure 5(a) shows the ground-truth Lena image, which is important for the experimental result to compare with when using a synthetic sequence. To demonstrate the restoration accuracy for a single frame, a distorted frame was selected at random, which is frame number 30, and it is shown in Fig. 5(b). Figure 5(c) presents the reference frame for the Lena sequence that is obtained from the distorted frames. The reference frame appears almost identical to the ground-truth image but exhibits motion-blur as can be seen in it. Despite the reference frame being generated with optimal patches, the blur arises due to the patches having overlapping regions with their neighboring patches. A difference image is computed between the reference frame and the ground-truth to exhibit their dissimilarities on a gray background and it is depicted in Fig. 5(d). In the difference image, the lighter and darker regions correspond to the positive and negative differences, respectively.

Fig. 5. (a) Ground-truth Lena image, (b) warped frame 30, (c) patch-based reference frame, and (d) difference image of reference frame.

Download Full Size | PDF

The performance of the three methods for restoring a single frame (displayed in Fig. 5(b)) is presented in Fig. 6. The restored versions of frame 30 using the Halder method, the Sun method, and the proposed method are shown in Figs. 6(a) to 6(c), respectively. To visualize the structural dissimilarities, their difference images with respect to the ground-truth are also determined and shown in Figs. 6(d) to 6(f), respectively. Even though the restored frames employing all three methods seem to be closer to the ground-truth, upon closer inspection, it becomes evident that the proposed method has fewer artifacts. Moreover, the difference image for the proposed method has lesser lighter and darker regions, indicating that this method restored frame 30 closer to the original Lena image.

Fig. 6. Restored Lena frame 30 using (a) Halder method, (b) Sun method, and (c) proposed method. Corresponding difference image for (d) Halder method, (e) Sun method, and (f) proposed method.

Download Full Size | PDF

After restoring all 80 frames using the methods separately, an averaged frame is reconstructed for each of them. These frames indicate how much the restored videos are stabilized. Figure 7 shows the averaged frames and corresponding difference images. A closer inspection of the images clearly reveals that the proposed method has higher video stabilization accuracy, as indicated by its difference image that possesses fewer artifacts compared to that of the other methods. In order to aid the visual analysis, a quantitative analysis is carried out using four quality metrics. Table 1 shows the values of the MSE, PSNR, SSIM, and MI for the averaged frames. These values also confirm the better restoration efficacy of the proposed method.

Fig. 7. Average of restored Lena frames using (a) Halder method, (b) Sun method, and (c) proposed method. Corresponding difference image for (d) Halder method, (e) Sun method, and (f) proposed method.

Download Full Size | PDF

Table 1. Comparison of quality metrics for the Lena sequence

View Table | View all tables in this article

Additional analyses are conducted to show how stabilized are the video sequences. Firstly, the MSE is computed for all the restored frames using each method. These MSE values are then plotted against the corresponding restored frame number to obtain four MSE curves, one for each method. Figure 8(a) shows the plot of MSE values versus restored frames. From the figure, it can be seen that the MSE values of the restored frames for the proposed method are consistently lower than those of the other methods, except for frame 68. The Halder method yields a lower MSE value for frame 68 since it directly employs this frame as the reference frame. The restoration of this frame involves fewer interpolation operations (for example, Eq. (5) is not required), resulting in a lower processing error.

Fig. 8. Plot of quality metrics for the Lena sequence: (a) MSE, (b) PSNR, (c) SSIM, and (d) MI.

Download Full Size | PDF

In a similar way, PSNR, SSIM, and MI values are plotted and shown in Figs. 8(b) to 8(d), respectively. From these figures, it becomes further clear that the values for the proposed method are higher, with few exceptions, than that of the other methods, confirming higher video stabilization accuracy.

6.3 Results for Eye sequence

The synthetically warped Eye sequence consists of 80 frames, each frame of size 890$\times$512 pixels. All the analyses that were done for the Lena sequence are also followed for the distorted Eye sequence to verify whether the proposed method works the same for different video sequences. The ground-truth Eye image is presented in Fig. 9(a). To demonstrate the restoration accuracy for a single frame, the warped frame 40 is chosen and shown in Fig. 9(b). Figure 9(c) shows the patch-based reference frame, which is required to register the source frames. Its difference image is illustrated in Fig. 9(d), which has higher dissimilarities with respect to the ground-truth if compared with that of the Lena sequence.

Fig. 9. (a) Ground-truth Eye image, (b) warped frame 40, (c) patch-based reference frame, and (d) difference image of reference frame.

Download Full Size | PDF

Figure 10 shows the restored versions of the warped frame 40 and the corresponding difference images for the Eye sequence. When all the restored frames are visually examined, it is clear that the proposed method shows fewer distortions than the other methods. Additionally, the dissimilarities are evident in the difference images, where the lighter areas are more pronounced in the results of the other two methods compared to those of the proposed method.

Fig. 10. Restored Eye frame 40 using (a) Halder method, (b) Sun method, and (c) proposed method. Corresponding difference image for (d) Halder method, (e) Sun method, and (f) proposed method.

Download Full Size | PDF

A final averaged frame is created using all 80 restored frames of the Eye sequence for each method and the results are presented in Fig. 11. Although all the averaged frames look similar and closer to the ground-truth, the difference image of the proposed method has fewer artifacts than the difference images of the other methods. The MSE, PSNR, SSIM, and MI values are also calculated for the averaged frame and listed in Table 2. The quality metrics’ values serve as evidence for the improved accuracy of the proposed restoration method.

Fig. 11. Average of restored Eye frames using (a) Halder method, (b) Sun method, and (c) proposed method. Corresponding difference image for (d) Halder method, (e) Sun method, and (f) proposed method.

Download Full Size | PDF

Table 2. Comparison of quality metrics for the Eye sequence

View Table | View all tables in this article

As with the Lena sequence, the quality metrics are computed for every restored frame of the Eye sequence using all three methods. The obtained values of MSE, PSNR, SSIM, and MI are presented in Fig. 12, displaying the correlation with the corresponding frame numbers. From the graph, it is evident that the proposed method yields significantly lower MSE values compared to the other two methods. Conversely, the PSNR, SSIM, and MI values, with a few exceptions, are higher for the proposed method compared to the other two methods. These results consistently indicate that the proposed method exhibits superior performance.

Fig. 12. Plot of quality metrics for the Eye sequence: (a) MSE, (b) PSNR, (c) SSIM, and (d) MI.

Download Full Size | PDF

6.4 Application on real-life sequence

Once the superior accuracy of the proposed method was confirmed on synthetically warped sequences, it was then employed on a real-life Tower sequence. This sequence comprises 80 frames, each with dimensions of 560$\times$460 pixels. The average frames of the stabilized videos were obtained and depicted in Fig. 13 for all three methods. Since there is no ground-truth available for this particular video sequence, difference images are not provided. The figures clearly demonstrate that the proposed method generated a frame that is slightly sharper, less blurry, and exhibited fewer artifacts compared to the other methods.

Fig. 13. Average of restored Tower frames using (a) Halder method, (b) Sun method, and (c) proposed method.

Download Full Size | PDF

Again, as the real-life sequence does not have any ground-truth, it is not possible to calculate the quality metrics that are calculated for the synthetically warped sequences. For this reason, two blind image quality metrics known as BIQAA [26] and BRISQUE [27], are used. The BIQAA values indicate better quality at higher levels, while lower BRISQUE values indicate better quality. Table 3 shows a numerical comparison of the blind image quality metrics for the average frames restored using the three methods. After analyzing the table, it can be inferred that the proposed method surpasses the other methods in terms of performance for real-life video sequences.

Table 3. Comparison of blind quality metrics for the Tower sequence

View Table | View all tables in this article

Finally, the blind quality metrics are computed and plotted in Fig. 14 for every restored frame of the Tower sequence for the three methods. The graph clearly illustrates that the proposed approach consistently produces higher BIQAA and lower BRISQUE values compared to the other two methods. This consistent pattern of results strongly suggests that the proposed method outperforms the alternatives.

Fig. 14. Plot of blind quality metrics for the Tower sequence: (a) BIQAA and (b) BRISQUE.

Download Full Size | PDF

7. Conclusion

In this research, an effective method for restoring the frames of a geometrically warped video sequence is successfully implemented. Estimating a patch-based frame from the distorted video sequence and using it as the reference frame for image registration is the main concept of this paper. The quality of the reference frame has a key role in how well the proposed method performs. Therefore, the patches having the best quality are carefully picked from all the distorted frames to build the best quality reference frame. Utilizing this frame, the window-based MSSD image registration gave a more accurate estimation of the pixel shiftmaps.

The proposed method has been validated using both synthetically warped and real-life video sequences. Four image quality metrics are employed for the synthetically warped sequences while two non-reference blind image quality measures are used to evaluate the proposed method on the real-world sequence for numerical comparison. In comparison to state-of-the-art methods, the proposed method demonstrates better performance in terms of geometrical correctness, on both types of video sequences. The implementation of the image registration algorithm using Matlab executable (MEX) C++ code significantly reduces the computational burden of the method. This approach offers potential benefits in diverse applications, including machine vision, surveillance, and other related fields. Given that our proposed method primarily addresses geometric distortions, investigating additional atmospheric factors such as blurring due to the widening of the PSF could be a potential avenue for future research. Furthermore, it will be intriguing to observe the method’s performance on video frames containing smooth gradients.

Disclosures

The authors declare no conflicts of interest.

Data availability

Data underlying the results presented in this paper are not publicly available at this time but may be obtained from the authors upon reasonable request.

References

1. D. Li, R. M. Mersereau, and S. Simske, “Atmospheric turbulence degraded image restoration using principal components analysis,” IEEE Geosci. Remote Sens. Lett. 4(3), 340–344 (2007). [CrossRef]

2. X. Zhu and P. Milanfar, “Image reconstruction from videos distorted by atmospheric turbulence,” Proc. SPIE 7543, 75430S (2010). [CrossRef]

3. C. S. Huebner and M. Greco, “Blind deconvolution algorithms for the restoration of atmospherically degraded imagery: a comparative analysis,” Proc. SPIE 7108, 71080M (2008). [CrossRef]

4. K. K. Halder, M. Tahtali, and S. G. Anavatti, “A new image restoration approach for imaging through the atmosphere,” in IEEE International Symposium on Signal Processing and Information Technology (Institute of Electrical and Electronic Engineers, 2013), pp. 350–355.

5. M. A. Vorontsov and G. W. Carhart, “Anisoplanatic imaging through turbulent media: image recovery by local information fusion from a set of short-exposure images,” J. Opt. Soc. Am. A 18(6), 1312–1324 (2001). [CrossRef]

6. D. Fraser, G. Thorpe, and A. Lambert, “Atmospheric turbulence visualization with wide-area motion-blur restoration,” J. Opt. Soc. Am. A 16(7), 1751–1758 (1999). [CrossRef]

7. M. Tahtali, D. Fraser, and A. J. Lambert, “Restoration of nonuniformly warped images using a typical frame as prototype,” in IEEE Region 10 Conference (Institute of Electrical and Electronic Engineers, 2005), pp. 1–6.

8. M. Tahtali, A. J. Lambert, and D. Fraser, “Restoration of nonuniformly warped images using accurate frame by frame shiftmap accumulation,” Proc. SPIE 6316, 631603 (2006). [CrossRef]

9. M. Tahtali, A. J. Lambert, and D. Fraser, “Graphics processing unit restoration of non-uniformly warped images using a typical frame as prototype,” Proc. SPIE 7800, 78000H (2010). [CrossRef]

10. R. Abdoola, G. Noel, B. van Wyk, et al., “Correction of atmospheric turbulence degraded sequences using grid smoothing,” Lect. Notes Comput. Sci. 6754, 317–327 (2011). [CrossRef]

11. Y. Mao and J. Gilles, “Non rigid geometric distortions correction - application to atmospheric turbulence stabilization,” Inverse Probl. Imaging 6(3), 531–546 (2012). [CrossRef]

12. Y. Tian and S. G. Narasimhan, “Seeing through water: Image restoration using model-based tracking,” in IEEE International Conference on Computer Vision (Institute of Electrical and Electronics Engineers, 2009), pp. 2303–2310.

13. O. Oreifej, S. Guang, T. Pace, et al., “A two-stage reconstruction approach for seeing through water,” in IEEE Conference on Computer Vision and Pattern Recognition (Institute of Electrical and Electronic Engineers, 2011), pp. 1153–1160.

14. M. A. Rucci, R. C. Hardie, R. K. Martin, et al., “Atmospheric optical turbulence mitigation using iterative image registration and least squares lucky look fusion,” Appl. Opt. 61(28), 8233–8247 (2022). [CrossRef]

15. Y. Cao, C. Cai, and H. Meng, “Inverted pyramid frame forward and backward prediction for distorted video by water waves,” Appl. Opt. 62(12), 3062–3071 (2023). [CrossRef]

16. K. K. Halder, M. Paul, M. Tahlati, et al., “Correction of geometrically distorted underwater images using shift map analysis,” J. Opt. Soc. Am. A 34(4), 666–673 (2017). [CrossRef]

17. T. Sun, Y. Tang and, and Z. Zhang, “Structural information reconstruction of distorted underwater images using image registration,” Appl. Sci. 10(16), 5670 (2020). [CrossRef]

18. M. H. Alkinani and M. R. El-Sakka, “Patch-based models and algorithms for image denoising: A comparative review between patch-based images denoising methods for additive noise reduction,” EURASIP J. Image Video Process. 2017(1), 58 (2017). [CrossRef]

19. V. Papyan and M. Elad, “Multi-scale patch-based image restoration,” IEEE Trans. on Image Process. 25(1), 249–261 (2016). [CrossRef]

20. Z. Zhang and X. Yang, “Reconstruction of distorted underwater images using robust registration,” Opt. Express 27(7), 9996–10008 (2019). [CrossRef]

21. A. Goldstein and R. Fattal, “Blur-kernel estimation from spectral irregularities,” Lect. Notes Comput. Sci. 7576, 622–635 (2012). [CrossRef]

22. B. Zitova and J. Flusser, “Image registration methods: a survey,” Image Vis. Comput. 21(11), 977–1000 (2003). [CrossRef]

23. J. G. James, P. Agrawal, and A. Rajwade, “Restoration of non-rigidly distorted underwater images using a combination of compressive sensing and local polynomial image representations,” in IEEE/CVF International Conference on Computer Vision (Institute of Electrical and Electronic Engineers, 2019), pp. 7838–7847.

24. P. Robinson, B. Walters, and W. Clarke, “Sharpening and contrast enhancement of atmospheric turbulence degraded video sequences,” in Proceedings of the Twenty-First Annual Symposium of the Pattern Recognition Association of South Africa (Pattern Recognition Association of South Africa, 2010), pp. 245–250.

25. Z. Wang, A. C. Bovik, H. R. Sheikh, et al., “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. on Image Process. 13(4), 600–612 (2004). [CrossRef]

26. S. Gabarda and G. Cristobal, “Blind image quality assessment through anisotropy,” J. Opt. Soc. Am. A 24(12), B42–B51 (2007). [CrossRef]

27. A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Trans. on Image Process. 21(12), 4695–4708 (2012). [CrossRef]

Quality metrics	Halder method	Sun method	Proposed method
MSE (Lower value desired)	8.6922	8.5621	7.2963
PSNR (Higher value desired)	29.6141	29.4792	30.8688
SSIM (Higher value desired)	0.6981	0.6261	0.7635
MI (Higher values desired)	2.8653	2.7823	3.0319

Quality metrics	Halder method	Sun method	Proposed method
MSE (Lower value desired)	21.4957	14.8246	10.7439
PSNR (Higher value desired)	23.4698	26.3536	28.3515
SSIM (Higher value desired)	0.6096	0.6980	0.8239
MI (Higher value desired)	2.7096	2.6909	3.1990

Blind quality metrics	Halder method	Sun method	Proposed method
BIQAA (Higher value desired)	1.5029e-05	2.3017e-05	6.7216e-05
BRISQUE (Lower value desired)	116.1190	116.1170	115.1000

Quality metrics	Halder method	Sun method	Proposed method
MSE (Lower value desired)	8.6922	8.5621	7.2963
PSNR (Higher value desired)	29.6141	29.4792	30.8688
SSIM (Higher value desired)	0.6981	0.6261	0.7635
MI (Higher values desired)	2.8653	2.7823	3.0319

Quality metrics	Halder method	Sun method	Proposed method
MSE (Lower value desired)	21.4957	14.8246	10.7439
PSNR (Higher value desired)	23.4698	26.3536	28.3515
SSIM (Higher value desired)	0.6096	0.6980	0.8239
MI (Higher value desired)	2.7096	2.6909	3.1990

Stabilization of turbulence-degraded video using patch-based reference frame

Abstract

1. Introduction

2. Patch-based reference frame generation

3. Source frame processing

4. Image registration

5. Video stabilization

6. Simulation experiments

6.1 Quality metrics

6.1.1 MSE

6.1.2 PSNR

6.1.3 SSIM

6.1.4 MI

6.2 Results for Lena sequence

6.3 Results for Eye sequence

6.4 Application on real-life sequence

7. Conclusion

Disclosures

Data availability

References

Data availability

Cited By

Figures (14)

Tables (3)

Equations (11)

Optics Continuum