Superimposed video disambiguation for increased field of view

Roummel F. Marcia; Changsoon Kim; Cihat Eldeniz; Jungsang Kim; David J. Brady; Rebecca M. Willett

doi:10.1364/OE.16.016352

1. Introduction

The performance of a typical imaging system is characterized by the resolution (the smallest feature that the system can resolve) and the field of view (FOV: the maximum angular extent that can be observed at a given instance). In most electronic imaging systems today, the detector element is a focal plane array (FPA) typically made out of semiconductor photodetectors [1, 2, 3, 4]. A FPA performs spatial sampling of the optical intensities at the image plane, with the maximum resolvable spatial frequency being inversely proportional to the center-to-center distance between pixels. Therefore, to obtain a high-resolution image with a given FPA, the optics must provide sufficient magnification, which limits the FOV. There are many applications where this trade-off between the resolution and FOV needs to be overcome. A good example is thermal imaging surveillance systems operating at the mid- and long-wave infrared wavelengths (3-20 µm). As the FPAs sensitive to this spectral range remain very expensive, techniques capable of achieving a wide FOV with a small-pixel-count FPA are desired.

Many techniques proposed to date to overcome the FOV-resolution trade-off are based on acquisition of multiple images and their subsequent numerical processing. For example, image mosaicing techniques increase the FOV while retaining the resolution, by tiling multiple sequentially captured images corresponding to different portions of the overall FOV [5, 6]. When applied to a video system, these techniques require acquisition of all sub-images for each video frame in order to accurately capture relative motion between adjacent frames. This means that the scanning element must scan through the sub-images at a rate much faster than the video frame rate, which is challenging to implement using conventional video cameras. In another example, super-resolution techniques provide means to overcome the resolution limit imposed by the FPA pixel size. In this technique, multiple images are obtained from a single scene, with each image having a different sub-pixel displacement from the others [7, 8]. The sub-pixel displacements provide additional information about the scene as compared with a single image, which can be exploited to construct an image of the scene with resolution better than that imposed by the FPA pixel size. For these technqiues to succeed, the displacements of the low-resolution images need to be known with sub-pixel accuracy either by a precise control of hardware motion or by an accurate image registration algorithm [9, 10]. As the pixel size of the FPAs continues to shrink, this requirement translates to micron-level control/registration which is difficult to maintain in a realistic operating environment subject to vibrations and temperature variations.

Recently, we proposed a numerical method by which the FOV of an imaging system can be increased without compromising its resolution [11]. In our setup, a static scene to be imaged is partitioned into smaller scenes, which are imaged onto a single FPA to form a composite image. We developed an efficient video processing approach to separate the composite image into its constituent images, thus restoring the complete scene corresponding to the overall FOV. To make this otherwise highly ill-posed problem of disambiguating the image tractable, the super-imposed sub-images are moved relative to one another between video frames. The disambiguation problem that we considered is similar to the blind source separation problem [12, 13, 14], where the manner in which the sub-images are superimposed is unknown. In our case, we control how the sub-images are superimposed by prescribing the relative motion between the sub-images. Incorporating this knowledge into our proposed optimization algorithms, we can succesfully and efficiently differentiate the sub-images to accurately reconsruct the original scene.

In this paper, we significantly extend our previous results in three principal aspects. First, to demonstrate the increase of FOVs in a realistic setting, we physically generate a composite video from a scene using an optical system employing a beamsplitter and a movable mirror. Without considerable prior knowledge of the contents of the scene, we are able to separate the sub-images, succesfully capturing both the large-scale features and fine details. Second, we show the effectiveness of the proposed approach by reconstructing with small mean square errors a dynamic scene where, in constrast to the previous demonstration, objects in the scene are moving. Third, we improve upon the previous computational methods by exploiting correlations between sequential video frames, particularly the sparsity in the difference between successive frames, leading to accurate solutions with reduced computational time.

The paper is organized as follows: In Sec. 2, we discuss the concept of our technique and a detailed description of the proposed architecture. Sec. 3 shows how the video disambiguation problem can be formulated and solved using optimization techniques based on sparse representation algorithms. In Sec. 4, we describe the both physical and numerical experiments. We conclude with a summary of the paper in Sec. 5.

2. Proposed camera architecture for generating a superimposed video

Figure 1(a) schematically shows the basic concept of superimposition and disambiguation. In the superimposition process, multiple sub-images are merged to form a composite image (shown on the right side of Fig. 1(a)) in a straightforward manner; the intensity of each pixel in the composite image is the sum of the intensities of the corresponding pixels in the individual images. However, the inverse process – the disambiguation of the individual sub-images from this composite image – is more challenging. For this, we must determine how the intensity of each pixel in the composite image is distributed over the corresponding pixels in the individual sub-images so that the resulting reconstruction accurately represents the original scene. Our technique achieves this task by measuring a composite video sequence, where the position of each sub-image is slightly altered at each frame. It is the movement of these individual sub-images that allows disambiguation to succeed. For simplicity, we consider the examples of superimposing only two sub-images in our experiments, but the approach we describe can be extended to more general cases.

Fig. 1. (a) Basic concept of superimposition and disambiguation. (b) Proposed camera architecture for superimposing two sub-images from the top view. The scene is split into two halves, x ⁽¹⁾ _t and x ⁽²⁾ _t. The optical field from the left half propagates directly through the beamsplitter to hit the FPA in the camera. The optical field from the right half hits a movable mirror before propagating to the beamsplitter and being reflected to the FPA in the camera.

Download Full Size | PDF

Superimposed images which are shifted relative to one another at different frames can easily be recorded using a simple camera architecture, depicted for two sub-images in Fig. 1(b). Constructed using beamsplitters and movable mirrors, the proposed assembly merges the sub-images into a single image and temporally varies the relative position of the two sub-images as they hit the detector. The optical field from the left half of the scene propagates directly through the beamsplitter and hits the FPA in the camera at the same relative position for every frame. The optical field from the right half of the scene, however, is reflected by a movable mirror followed by the beamsplitter before hitting the FPA. When the mirror, mounted on a linear stage, is moved, the right half of the scene is moved correspondingly. The image recorded by the FPA is then the sum of the stationary left sub-image and the right sub-image that is moved for each frame, resulting in a superimposed video sequence.

3. Mathematical model and computational approach for disambiguation

Let {x _t} be a sequence of frames representing a slowly changing scene. The superimposition process (Fig. 1(a)) can be modeled mathematically at the t ^th frame as

z_{t} = A_{t} x_{t} + ε_{t},

where z _t ∈ IR ^m×1 is the observed composite image, x _t ∈ IR ^n×1 is the (unknown) scene to be reconstructed, A _t ∈ IR ^m×n is the projection matrix that describes the superimposition, and ε_t is noise at frame t. We assume in this paper that ε_t is zero-mean white Gaussian noise. The disambiguation problem is the inverse problem of solving for x _t given the observations z _t and the matrix A _t. In this setting, n > m, which makes Eq. (1) underdetermined. There are several techniques for approaching this ill-posed statistical inverse problem of disambiguating the sub-images, many of which exploit the sparsity of x _t in one or more bases (cf. [15, 16, 17]). We note that Eq. (1) is different from the formulation in our previous paper [11], in that the scene is now dynamic, i.e., x _t can have (small) changes for each t, whereas previously, x _t is static, i.e., x _t+1=x _t for all t.

In the camera architecture described in Sec. 2, one sub-image is held stationary relative to the other. If x _t=[x ⁽¹⁾ _t;x ⁽²⁾ _t] are the pixel intensities corresponding to the two images, then A _t is the underdetermined matrix [I S _t], where I is the identity matrix and S _t describes the movement of the second sub-image in relation to the first at the t ^th frame. Here, we assume that x ⁽¹⁾ _t corresponds to the stationary sub-image while x ⁽²⁾ _t corresponds to the sub-image whose shifting is induced by the moving mirror (see Fig. 1(b)). Then the above system can be modeled mathematically as

z_{t} = [\begin{matrix} I & S_{t} \end{matrix}] [\begin{matrix} {x_{t}}^{(1)} \\ {x_{t}}^{(2)} \end{matrix}] + ε_{t} = {\tilde{S}}_{t} \tilde{W} θ_{t} + ε_{t},

where S̃_t=[I S _t]. Here, we write x _t=W̃θ _t, where θ _t denotes the vector of coefficients of the two sub-images in the wavelet basis and W̃ denotes the inverse wavelet transform. (We use the wavelet transform here because of its effectiveness with many natural images, but alternative bases could certainly be used depending on the setting.) We note that this formulation is slightly different from that found in our previous paper [11], where the coefficients for each sub-image are treated separately. In Eq. (2), θ _t contains the wavelet coefficients for the entire image x _t, as opposed to the concatenation of the wavelet coefficients of x ⁽¹⁾ _t and x ⁽²⁾ _t, resulting in a more seamless interface between the sub-images.

We formulate the reconstruction problem as a sequence of nonlinear optimization problems, minimizing the norm of the error ‖z _t-S̃_t W̃θ _t‖ along with a regularization term τ‖θ _t‖, for some tuning parameter τ, at each time frame and using the computed minimum as the initial value for the following frame. Since the underlying inverse problem is underdetermined, the regularization term in the objective function is necessary to make the disambiguation problem well-posed. This formulation of the reconstruction problem is similar to the ℓ²-ℓ¹ formulation of the compressed sensing problem [18, 19, 20] for suitably chosen norms: using the Euclidean norm for the error term gives the least-squares error while using the one norm for the regularization term induces sparsity in the solution. Sparse solutions in the wavelet domain provide accurate reconstructions of the original signal since the wavelet transform typically retains the majority of natural images’ energy in a relatively small number of basis coefficients. To solve the problem of disambiguating two superimposed images, we thus formulate it as the nonlinear optimization problem

{\hat{θ}}_{t} = \underset{θ_{t}}{\arg \min} ∥ z_{t} - {\tilde{S}}_{t} {\tilde{W} θ}_{t} ∥_{2}^{2} + τ ∥ θ_{t} ∥_{1} .

If we solve the optimization problem (3) for each frame independently, the ℓ ¹ regularization term can lead to reasonably accurate solutions to an otherwise underdetermined and ill-posed inverse problem, particularly when the true scene is very sparse in the wavelet basis and significant amounts of computation time are devoted to each frame. However, when the scene is stationary or slowly varying relative to the frame rate of the imaging system, subsequent frames of observations can be used simultaneously to achieve significantly better solutions. We describe a family of methods that depend on the number of frames solved simultaneously for exploiting interframe correlations.

1-Frame Method. For a scene that changes only slightly from frame to frame, the reconstruction from a previous frame is often a good approximation to the following frame. In the 1-Frame Method, we use the solution θ̑_t to the optimization problem (3) at the t ^th frame to initialize the optimization problem for the (t+1)^th frame.

2-Frame Method. We can improve upon the 1-Frame Method by solving for multiple frames in each optimization problem. In the 2-Frame Method we solve for two successive frames simultaneously. However, rather than solving for θ _t and θ _t+1, we solve for θ _t and Δθ _t≡θ _t+1 -θ _t for two main reasons. First, for slowly changing scenes, θ _t+1 ≈θ _t and since both θ _t+1 and θ _t are already sparse, Δθ _t is even sparser, making Δθ _t even more appropriate for the sparsity-inducing ℓ ²-ℓ ¹ minimization. Second, solving for Δθ _t allows for coupling the frames in an otherwise separable objective function, leading to accurate solutions to both θ _t and Δθ _t. The minimization problem can be formulated as follows:

{\hat{θ}}_{t}^{[2]} \equiv [\begin{matrix} {\hat{θ}}_{t} \\ Δ {\hat{θ}}_{t} \end{matrix}] = \arg min_{θ_{t}, Δ θ_{t}} {∥ [\begin{matrix} z_{t} \\ z_{t + 1} \end{matrix}] - [\begin{matrix} {\tilde{S}}_{t} & 0 \\ 0 & {\tilde{S}}_{t + 1} \end{matrix}] [\begin{matrix} \tilde{W} & 0 \\ \tilde{W} & \tilde{W} \end{matrix}] [\begin{matrix} θ_{t} \\ Δ θ_{t} \end{matrix}] ∥}_{2}^{2} + τ {∥ [\begin{matrix} θ_{t} \\ Δ θ_{t} \end{matrix}] ∥}_{1},

where S̃_i=[I S _i] for i=t and t+1. The following optimization problem for frame (t+1) is initialized using

$[\begin{matrix} θ_{t + 1}^{(0)} \\ Δ θ_{t + 1}^{(0)} \end{matrix}] \equiv [\begin{matrix} {\hat{θ}}_{t} + Δ {\hat{θ}}_{t} \\ Δ {\hat{θ}}_{t} \end{matrix}],$

which should already be a good approximation to the solution θ̑^[2] _t+1. Note that the formulation in (4) is different from that proposed in our previous paper [11], where θ _t corresponds to coefficients of static images, i.e., θ _t=θ _t+1, whereas here, we allow for movements within each sub-image. In addition, since Δθ _t is significantly sparser than θ _t, we use a different regularization parameter ρ on ‖Δθ _t‖₁ to encourage very sparse Δθ _t solutions, which leads to the following optimization problem:

${\hat{θ}}_{t}^{[2]} = \underset{θ_{t}, Δ θ_{t}}{arg min} {∥ [\begin{matrix} z_{t} \\ z_{t + 1} \end{matrix}] - [\begin{matrix} {\tilde{S}}_{t} & 0 \\ 0 & {\tilde{S}}_{t + 1} \end{matrix}] [\begin{matrix} \tilde{W} & 0 \\ \tilde{W} & \tilde{W} \end{matrix}] [\begin{matrix} θ_{t} \\ Δ θ_{t} \end{matrix}] ∥}_{2}^{2} + τ {∥ θ_{t} ∥}_{1} + ρ {∥ Δ θ_{t} ∥}_{1},$

In our experiments, we use ρ=(1.0×10³)τ.

4-Frame Method. The 4-Frame Method is very similar to 2-Frame Method, but we solve for the coefficients using four successive frames instead of two, using the observation vectors z _t+2 and z _t+3 and the observation operation matrices S _t+2 and S _t+3. By coupling more frames, the coefficients are required to satisfy more equations, leading to more accurate solutions. The drawback, however, is that the corresponding linear systems to be solved are larger and require more computation time. The corresponding minimization problem is given by

{\hat{θ}}_{t}^{[4]} = \underset{{\bar{θ}}_{t}}{\arg min} {∥ {\bar{z}}_{t} - {\bar{S}}_{t} \bar{W} {\bar{θ}}_{t} ∥}_{2}^{2} + τ {∥ θ_{t} ∥}_{1} + ρ \sum_{j = 0}^{2} {∥ Δ θ_{t + j} ∥}_{1},

where the minimizer θ̑^[4] _t=[θ̑_t; Δθ̑_t; Δθ̑_t+1; Δθ̑ _t+2], Δθ _i≡θ _i+1-θ _i for i=t,⋯,t+2, and

${\bar{z}}_{t} = [\begin{matrix} z_{t} \\ z_{t + 1} \\ z_{t + 2} \\ z_{t + 3} \end{matrix}], {\bar{S}}_{t} = [\begin{matrix} {\tilde{S}}_{t} \\ {\tilde{S}}_{t + 1} \\ {\tilde{S}}_{t + 2} \\ {\tilde{S}}_{t + 3} \end{matrix}], \bar{W} = [\begin{matrix} \tilde{W} \\ \tilde{W} & \tilde{W} \\ \tilde{W} & \tilde{W} & \tilde{W} \\ \tilde{W} & \tilde{W} & \tilde{W} & \tilde{W} \end{matrix}], and {\bar{θ}}_{t} = [\begin{matrix} θ_{t} \\ Δ θ_{t} \\ Δ θ_{t + 1} \\ Δ θ_{t + 2} \end{matrix}] .$

Here, S̃_i=[I S _i] for i=t,⋯, t+3. There is another formulation for simultaneously solving for four frames (see [21]). However, results from that paper indicate that solving (5) is more effective in generating more accurate solutions. As in the 2-Frame Method, we place the same weights (ρ=1.0×10³·τ) on ‖Δθ _i‖₁ for i=t,⋯, t+2 to encourage very sparse solutions.

A general n-Frame Method can be defined likewise for simultaneously solving for n frames. In our numerical experiments, we also use the 8- and 12-Frame Methods.

4. Experimental methods

To demonstrate that the FOV of an optical system can be increased using the proposed super-imposition and disambiguation technique, we perform two studies, one physical and one numerical. The physical experiment involves actually building the proposed camera architecture in Sec. 2 to obtain a composite video that is separated into the original scene. Here, since the scene is static, we use the most successful method in our previous paper [11] for disambiguating stationary scenes. The numerical experiment superimposes the two dynamic sub-images of a surveillance video. This experiment demonstrates that the two sub-images can be successfully disambiguated and that the slow moving components of the original scene can be captured with the proposed approach by exploiting the inter-frame correlations.

In these experiments, we solve the optimization problems for the various proposed methods using the Gradient Projection for Sparse Reconstruction (GPSR) algorithm of Figueiredo et al. [17]. GPSR is a gradient-based optimization method that is very fast, accurate, and efficient. In addition, GPSR has a debiasing phase, where upon solving the ℓ²-ℓ¹ minimization problem, it fixes the non-zero pattern of the optimal θ _t and minimizes the ℓ² term of the objective function, resulting in a minimal error in the reconstruction while keeping the number of non-zeros in the wavelet coefficients at a minimum. It has been shown to outperform many of the state-of-the-art codes for solving the ℓ²-ℓ¹ minimization problem or its equivalent formulations.

Ideally, the extent of [x ⁽¹⁾ _t;x ⁽²⁾ _t] should cover the entire scene at all frames. However, as x ⁽²⁾ _t moves, according to either the movement of the mirror in the optical experiment or the prescribed motion in the numerical experiment, there can be a portion of the scene that is not contained in the superimposed image at some frames, creating a “blind zone”. For a frame when the disambiguated image does not cover the entire scene, the result obtained for the blind zone from the previous frame is combined with the disambiguated image to reconstruct the entire scene.

The reconstruction video for both physical and numerical experiments are available at http://www.ee.duke.edu/nislab/videos/disambiguation under the names duke-earth-day.avi and surveillance.avi.

4.1. Optical experiment: Duke Earth Day

Fig. 2. (a) Original “Duke Earth Day” scene used in the experiment. The box with a solid red border represents the extent of x ⁽¹⁾, which is stationary during the superimposition process. As the mirror moves in a circular motion in the x-z plane shown in Fig. 1(b), the blue box with a solid border, which represents the moving boundary of x ⁽²⁾ at object plane, oscillates between the left and right turning points, represented by the blue boxes with dashed and dotted borders, respectively. (b) Superimposed image (left panel) and re-constructed scene (right panel) when the moving boundary is near the mid-point of the oscillation in the superimposed video. (c) Superimposed image (left panel) and reconstructed scene (right panel) when the moving boundary is near the left turning point, where sub-images are not completely disambiguated: the man in the hat and the banner (circled in yellow) partly appear in the left half of the disambiguated image.

Download Full Size | PDF

As mentioned above, the system shown in Fig. 1(b) is capable of generating a composite video where the sub-image corresponding to the right half of the scene, x ⁽²⁾, is moved while that corresponding to the left half of the scene, x ⁽¹⁾, remains still. In our experiment, the movement of x ⁽²⁾ was along the x-direction with its position following a sinusoidal function of frame. This was achieved by moving the mirror with a motion controller along a circular path on the x-z plane with a constant velocity; the displacement of the mirror along the x-direction causes x ⁽²⁾ to move in the same direction whereas the motion of the mirror in the z-direction does not create any change in the composite video. To determine S̃ _t in Eq. (2) corresponding to a given circular movement of the mirror, we performed a calibration experiment where a scene with a white background contained a black dot on its right half. By tracking the dot in the recorded video, we verified that the movement of the dot in the videowas indeed sinusoidal, and also determined its amplitude and period. In the actual experiment, the scene was replaced with a photograph (“Duke Earth Day”) while leaving the rest of the system unaltered. Hence, the amplitude and period of the movement of x ⁽²⁾ are the same as those obtained in the calibration experiment. The phase of the sinusoidal movement was determined for each recording by calculating a mean square difference between adjacent frames to identify the frame at which x ⁽²⁾ moved to the farthest right (or left).

The results of the physical experiment are shown in Fig. 2. Figure 2(a) shows the original scene used in the experiment. Also shown is the extent of x ⁽¹⁾ _t, as well as that of x ⁽²⁾ _t at different phases of the sinusoidal oscillation. Figure 2(b) shows the superimposed image and the reconstructed scene when the moving boundary of x ⁽²⁾ _t shown Fig. 2(a) was near the mid-point of its sinusoidal oscillation. In the right panel of Fig. 2(b), main figures in the original scene are easily identified and the texts are sufficiently clear to be read. Similar results were obtained at other phases of the oscillation except when the moving boundarywas near either the left or right turning point. Since the relative velocity of the two sub-images approaches to zero at these phases, x ⁽¹⁾ _t and x ⁽²⁾ _t cannot be completely disambiguated; the disambiguation result shown in Fig. 2(c) reveals that some objects belonging in x ⁽²⁾ _t, such as the man in the hat and the banner (circled in yellow) partly appear in the left half of the disambiguated image. As shown in Fig. 1(b), the optical path length for x ⁽²⁾ _t is longer than that for x ⁽¹⁾ _t, resulting in a small discrepancy in size. In our setup, x ⁽¹⁾ _t is 7% larger than x ⁽²⁾ _t. For this reason, the tilings of the disambiguated images seem less seamless.

4.2. Numerical experiment: Surveillance video

The video used in this numerical experiment is obtained from the Benchmark Data for PETS-ECCV 2004 [22]. Called Fight_OneManDown.mpg, it depicts two people fighting, with one man eventually falling down while the other runs away. The video was filmed using a wide angle camera lens in the entrance lobby of the INRIA Labs at Grenoble, France. Originally in 384×288 pixel resolution, the color video is rescaled to be 512×256 and converted to grayscale for ease of processing. This type of video is appropriate for our application since the scene of the lobby is relatively static with only some small scene changes corresponding to people moving in the lobby. We only use parts of the video where there is movement on both half of the scenes to test whether our approach will be able to assign each moving component to the proper sub-image (see objects circled in yellow in Fig. 3(a)). Zero-mean white Gaussian noise is added in our simulations. We add 10 frames between video frames using interpolation to simulate a faster video frame rate.

For a fairer comparison between the various methods, we allowed the optimization algorithm to run for specified times (5 and 20 seconds). For example, the 2-Frame Method might lead to more accurate solutions than the 1-Frame Method since it solves for two frames simultaneously. However, it is allowed fewer GPSR iterations than the 1-Frame Method since the computational run time for each 2-Frame optimization iteration is longer than that for the 1-Frame Method. Figures 4(a) and 4(b) show the mean squared error (MSE) values for the various methods described in Sec. 3 with the two specified time (5 and 20 seconds) allotted for GPSR minimization with debiasing. We make the following observations. First, with the exception of the 1-Frame Method, every method benefits from allotting more time to solve the optimization problem (20 seconds vs. 5 seconds). The 1-Frame Method does not benefit from this increase in time allotment because GPSR often converges (i.e., one of the criteria for algorithm termination becomes satisfied) within the 5 second time frame. Second, solving for arbitrarily many frames does not necessarily lead to more accurate solutions. Even though the 12-Frame Method simultaneously solves for many more frames, its MSE performance is worse than the 8-Frame Method, especially if only 5 seconds are allotted for each frame. Figure 4(a) shows that the performance of the 12-Frame Method is even worse than the 1-Frame Method for most of the frames that were considered. This poor performance is mainly due to the fact that because the resulting linear system to be solved for the 12-Frame Method is so large and that the allotted time is so short, only three GPSR iterations were allowed for each frame in our reconstruction method, which is hardly sufficient for the GPSR iterates to be near the solution. However, when the allotted time is increased to 20 seconds, the MSE performance of the 12-Frame Method significantly improves. Third, the reconstruction for the initial frames for the 8-Frames approach are generally worse than the 2- and 4-Frame Methods, but the sharp decrease in MSE value for the 8-Frame Method indicates that the solutions from the previous frames are being used effectively to initialize the current frame optimization. Fourth, the relatively ragged behavior of the various methods in Fig. 4(a) compared to that in Fig. 4(b), especially in the 8- and 12-Frame Methods, can perhaps be attributed to the difference in time restriction. Because of the relatively few GPSR iterations allowed within the 5 second time limit per frame, sufficiently good solutions, relative to the other frames, are not found in some instances. Qualitatively, the disambiguated reconstruction captures both large-scale features and fine details of the original scene. The overall structure of the lobby is correctly depicted, while small details such as those of the several kiosks and handrails and motions such as those of the man’s arms on the right half of the scene are reproduced accurately. The ghosting on the bottom of the right half of the reconstruction results from the lack of contrast in regions of high pixel intensity values on the left half of the scene. Yet in spite of this ghosting, details such as the edges of the lobby floor tiles are still distinguishable in the areas where this ghosting occurs.

Fig. 3. (a) Original surveillance video with two moving components (circled in yellow), (b) the observed superimposed sub-images of the left and right half of the scene, and (c) the reconstruction using the 8-Frame Method allowing 20 seconds of GPSR iterations.

Download Full Size | PDF

Fig. 4. MSE values for the 90 frames allowing 5 seconds (a) and 20 seconds (b) to solve the optimization problems for the different n-Frame Methods for the surveillance video.

Download Full Size | PDF

4.3. Discussion

While we showed in this section that two superimposed sub-images (and in previous demon-strations, up to four sub-images [11]) can be successfully disambiguated using our proposed technique, we recognize that there are practical limitations to our method, especially in extending it to disambiguating more sub-images or resolving quickly moving objects. First, the problem of reconstructing a scene becomes even more ill-posed when the number of sub-images that can be superimposed increases. Consequently, it is necessary to ensure that the movements of each sub-image relative to each other not coincide and are fundamentally different. Thus, the corresponding hardware becomes more complex. However, we note that the camera architecture will only involve more beamsplitters and movable mirrors, which costs substantially less than a larger FPA, especially for a system operating at the mid- and long-wave infrared wave-lengths. Second, the motion within each sub-image must remain sufficiently slow. The ability of the proposed techniques to disambiguate superimposed sub-images relies upon the temporal correlations of successive frames. Our reconstruction techniques succeed precisely because the frames are strongly correlated (i.e., the difference between consecutive frames is mostly zero in the wavelet domain) and this sparsity is exploited accordingly. The assumption that strong inter-frame correlations exist can be satisfied by increasing the video frame rate or limiting the application to scenes with relatively slowly moving components. For example, in some surveillance applications, in which the activity of interest is very fast and transient, the proposed architecture might not be suitable.

Aside from these challenges associated with hardware and temporal correlations, the mathematical methods described in this paper should extend to handling much larger numbers of sub-images. For example, consider an extreme case in which there is one sub-image for each pixel in the high resolution scene. In this setting, each observation frame would be the sum of a different subset of the pixels (sub-images) in the scene because the shifting of sub-images would make some of them unobservable by the detector at different times. This model is highly analogous to the Rice “Single Pixel Camera” [23], in which each measurement (in time) is the sum of a random collection of pixels. Duarte et al. demonstrate that if sufficiently many measurements are collected over time using this setup, then a static scene can be reconstructed with high accuracy.

We note that the proposed approach is different from classical mosaicing as described in Sec. 1. For example, consider a scene with a quickly moving, transient object in one location. With classical mosaicing, this object would only be observed at half the frame rate (since the other half of the frame rate is used to observe the other half of the scene); however, it would be observed with high spatial resolution. In contrast, the proposed technique would observe every part of the image during each frame acquisition, resulting in very high temporal resolution for detecting transient objects. However, because the disambiguation procedure relies on temporal correlations, the spatial resolution of the reconstructed object would be relatively poor – i.e., it would look blurred.

5. Conclusions

In this paper, we propose a novel camera architecture for collecting high resolution, wide fieldof-view videos in settings such as infrared imaging where large focal plane arrays are unavailable. This architecture is mechanically robust and easy to calibrate. Associated with this architecture is a fast and accurate technique for disambiguating the composite video image consisting of the superimposition of multiple sub-images. We demonstrated the increase of FOVs in a realistic setting by physically generating a composite video from a single scene using an optical system employing a beamsplitter and a movable mirror and successfully disambiguating the video. Without prior knowledge of the contents of the scene, our approach was able to disambiguate the two sub-images, successfully capturing both large-scale and fine details in each sub-image. Additionally, we improve upon our previous reconstruction approach by allowing each sub-image to have slowly changing components, carefully exploiting correlations between sequential video frames. Simulation results demonstrate that our optimization approach can reconstruct the constituent images and the moving components with small mean square errors, and that the errors improve by solving for multiple frames simultaneously.

Acknowledgments

The authors would like to thank Les Todd, assistant director of Duke Photography, for allowing the use of the “Duke Earth Day” photograph in our physical experiments. The authors were partially supported by DARPA Contract No. HR0011-04-C-0111, ONR Grant No. N00014-06-1-0610, and DARPA Contract No. HR0011-06-C-0109.

References and links

1. Y. Hagiwara, “High-density and high-quality frame transfer CCD imager with very low smear, low dark current, and very high blue sensitivity,” IEEE Trans. Electron Devices 43, 2122–2130 (1996). [CrossRef]

2. H. S. P. Wong, R. T. Chang, E. Crabbe, and P. D. Agnello, “CMOS active pixel image sensors fabricated using a 1.8-V, 0.25-mu m CMOS technology,” IEEE Trans. Electron Devices 45, 889–894 (1998). [CrossRef]

3. S. D. Gunapala, S. V. Bandara, J. K. Liu, C. J. Hill, S. B. Rafol, J. M. Mumolo, J. T. Trinh, M. Z. Tidrow, and P. D. Le Van, “1024 x 1024 pixel mid-wavelength and long-wavelength infrared QWIP focal plane arrays for imaging applications,” Semicond. Sci. Technol. 20, 473–480 (2005). [CrossRef]

4. S. Krishna, D. Forman, S. Annamalai, P. Dowd, P. Varangis, T. Tumolillo, A. Gray, J. Zilko, K. Sun, M. G. Liu, J. Campbell, and D. Carothers, “Demonstration of a 320x256 two-color focal plane array using InAs/InGaAs quantum dots in well detectors,” Appl. Phys. Lett. 86, 193,501 (2005). [CrossRef]

5. R. Szeliski, “Image mosaicing for tele-reality applications,” Proc. IEEEWorkshop on Applications of Computer Vision pp. 44–53 (1994).

6. R. A. Hicks, V. T. Nasis, and T. P. Kurzweg, “Programmable imaging with two-axis micromirrors,” Opt. Lett. 32, 1066–1068 (2007). [CrossRef] [PubMed]

7. S. C. Park, M. K. Park, and M. G. Kang, “Super-resolution image reconstruction: A technical overview,” IEEE Signal Process. Mag. 20, 21–36 (2003). [CrossRef]

8. R. C. Hardie, K. J. Barnard, J. G. Bognar, E. E. Armstrong, and E. A. Watson, “High-resolution image reconstruction from a sequence of rotated and translated frames and its application to an infrared imaging system,” Opt. Eng. 37, 247–260 (1998). [CrossRef]

9. J. C. Gillett, T. M. Stadtmiller, and R. C. Hardie, “Aliasing reduction in staring infrared imagers utilizing subpixel techniques,” Opt. Eng. 34, 3130–3137 (1995). [CrossRef]

10. M. Irani and S. Peleg, “Improving resolution by image registration,” CVGIP: Graph. Models Image Process. 53, 231–239 (1991). [CrossRef]

11. R. F. Marcia, C. Kim, J. Kim, D. Brady, and R. M. Willett, “Fast disambiguation of superimposed images for increased field of view,” Accepted to “Proc. IEEE Int. Conf. Image Proc. (ICIP 2008)”.

12. P. D. O’Grady, B. A. Pearlmutter, and S. T. Rickard, “Survey of sparse and non-sparse methods in source separation,” Int. J. Imag. Syst. Tech. 15, 18–33 (2005). [CrossRef]

13. A. M. Bronstein, M. M. Bronstein, M. Zibulevsky, and Y. Y. Zeevi, “Sparse ICA for blind separation of transmitted and reflected images,” Int. J. Imag. Syst. Tech. 15, 84–91 (2005). [CrossRef]

14. E. Be’ery and A. Yeredor, “Blind separation of superimposed shifted images using parameterized joint diagonalization,” IEEE Trans. Image Process. 17, 340–353 (2008). [CrossRef] [PubMed]

15. J. Bobin, J.-L. Starck, J. Fadili, and Y. Moudden, “Morphological Diversity and Source Separation,” IEEE Trans. Signal Process. 13, 409–412 (2006). [CrossRef]

16. S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by basis pursuit,” SIAM J. Sci. Comput. 20, 33–61 (electronic) (1998).

17. M. A. T. Figueiredo, R. D. Nowak, and S. J. Wright, “Gradient projection for sparse reconstruction: Application to compressed sensing and other inverse problems,” IEEE Journal of Selected Topics in Signal Processing: Special Issue on Convex Optimization Methods for Signal Processing (To appear). [PubMed]

18. E. Candès and T. Tao, “Near Optimal Signal Recovery From Random Projections: Universal Encoding Strategies,” (2006). To be published in IEEE Transactions on Information Theory.http://www.acm.caltech.edu/~emmanuel/papers/OptimalRecovery.pdf. [CrossRef]

19. D. L. Donoho and Y. Tsaig, “Fast solution of ~1-norm minimization problems when the solution may be sparse,” Preprint (2006).

20. R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. Roy. Statist. Soc. Ser. B 58, 267–288 (1996).

21. R. F. Marcia and R. M. Willett, “Compressive coded aperture video reconstruction,” Accepted to “Proc. Sixteenth European Signal Processing Conference (EUSIPCO 2008)”.

22. “Benchmark Data for PETS-ECCV 2004,” in Sixth IEEE International Workshop on Performance Evaluation of Tracking and Surveillance (2004). URL http://www-prima.imag.fr/PETS04/caviar\char’-data.html.

23. M. Duarte, M. Davenport, D. Takhar, J. Laska, T. Sun, K. Kelly, and R. Baraniuk, “Single-pixel imaging via compressive sampling, ” IEEE Signal Process. Mag. 25, 83–91, March 2008. [CrossRef]

Superimposed video disambiguation for increased field of view

Abstract

1. Introduction

2. Proposed camera architecture for generating a superimposed video

3. Mathematical model and computational approach for disambiguation

4. Experimental methods

4.1. Optical experiment: Duke Earth Day

4.2. Numerical experiment: Surveillance video

4.3. Discussion

5. Conclusions

Acknowledgments

References and links

Cited By

Figures (4)

Equations (5)

Optics Express