Visualization by P-flow: gradient- and feature-based optical flow and vector fields extracted from image analysis

Wataru Suzuki; Wataru Suzuki; Atsushi Hiyama; Atsushi Hiyama; Noritaka Ichinohe; Wakayo Yamashita; Takeharu Seno; Hiroshige Takeichi; Hiroshige Takeichi

doi:10.1364/JOSAA.398677

1. INTRODUCTION

Optical flow is important information for moving agents, such as creatures or robots, to adapt to their environment. Since these agents have limited energy supply and computational resources, an optimal algorithm is needed to efficiently and effectively calculate the optical flow. In the lower-level visual system of the creature, visual information is processed in a local region about the size of the classical receptive field of the retina, lateral geniculate nucleus (LGN), and primary visual area (V1) cells. Compared to the higher-level visual system, the lower-level visual system is characterized by one-shot calculation under a weak physical constraint condition. Therefore, it can be inferred that the lower-level visual system performs a calculation close to the normal flow, although it is not yet clear how context-dependent computation for motion affects the calculation in the lower-level visual system.

We proposed an algorithm for extracting motion data from natural movies with fewer physical conditions, simulating the lower- to middle-level primate visual system [1]. The extracted motion data are a dynamic vector field (DVF), calculated by simple image processing with spatial–temporal differentiation and normalization. The direction of the DVF is the same as that of the normal flow, but the amplitude is different. Although the DVF itself is not the ground-truth optical flow, the perception of DVF—which is visualized by random-dot motion—is quite similar to that of the corresponding original movie. The random-dot motion elicits neural responses similar to those of the original movie in the motion-sensitive middle temporal (MT) area of the primate brain [1]. Furthermore, we developed a novel optical flow algorithm, pseudo-flow (P-flow), by considering the vectors of the DVF as visual features for tracking [2].

The most standard algorithms for calculating optical flow are the Lucas–Kanade method [3] and the Horn–Schunck method [4]. The Lucas–Kanade method is used widely in a number of research fields because of its ease of implementation. In comparison, the Horn–Schunck method was the first variational approach algorithm applied to optical flow extraction and is the most successful state-of-the-art method [5]. Both the Lucas–Kanade and the Horn–Schunck methods are classified as gradient-based (GB) approaches for the optical flow extraction algorithm and solve partial differential equations under the brightness constant assumption (BCA). Because these algorithms do not provide an appropriate solution for long-range motion spreading over different pixels, they have been modified to adopt a coarse-to-fine framework in which optical flow is estimated from a coarser, downsampled image [6,7]. However, downsampling images oversmoothed the fine structures, and caused small-scale and fast-moving objects to disappear. A feature-based (FB) approach [8], such as Scale-Invariant Feature Transform (SIFT) flow [9], matches the corresponding visual feature descriptors across subsequent frames. Since FB approaches are effective for long-range motion, they are incorporated into the variational approach of the state-of-the-art optical flow algorithm to complement the drawback of GB approaches [9–13]. However, because the visual feature descriptors should be distinguished from the background, the optical flow obtained with an FB approach is sparse, and the resultant ambiguous features cause false positives. By incorporating an FB approach under the iterative coarse-to-fine framework, it is possible to obtain a precise and dense ground-truth optical flow by solving the GB approach-based variational method under a global physical constraint. Furthermore, the efficiency is increased by integrating a convolutional neural network [5,14–18].

In this study, we show that while P-flow is calculated by tracking the visual feature (i.e., FB), it is also classified as a GB approach that satisfies the BCA through theoretical analysis. As a result, we show that P-flow can successfully compensate for some of the drawbacks of each approach. Nevertheless, P-flow also has a drawback inherited from the two approaches, including obstacles associated with normal flow (aperture problem). Here, we propose modifications that correct the drawbacks that could not be overcome in the original P-flow. One modification is semi-dense interpolation, and the other incorporates the Harris Corner Detector (HCD) [19].

2. THEORY

A. Optical Flow Consistent with Primate Visual System (P-Flow) Derived by Matching Vectors in Movies

P-flow is the estimated optical flow using the feature matching framework [2]. The extraction algorithm of P-flow consists of two stages: calculation of DVF from an original movie and features matching the DVF. Here, the feature for matching is a vector in DVF. The algorithm to extract DVF and P-flow from the original natural movie was described in detail in [1,2].

DVF of a movie is defined as the dynamics of local normal vectors by calculating the normalized local normal vectors and their time derivatives for each pixel in each frame of the movie. The normalized local normal vectors are projected onto the $x {-} y$ plane, onto which the original objects are projected in the movie. A local normal vector at position $(x,y)$ is depicted as

(1)$$\left(- \frac{{\partial\! I(x,y,t)}}{{\partial x}}, - \frac{{\partial\! I(x,y,t)}}{{\partial y}},1\right),$$

where $I(x,y)$ is the pixel value at position $(x,y)$. The $x$ and $y$ components of the normalized local normal vector of the frame image projected onto the $x {-} y$ plane are calculated as

(2)$${n_x} = - \frac{1}{\alpha}\frac{{\partial\! I(x,y,t)}}{{\partial x}} = - \frac{1}{\alpha}{\partial _x}I,$$

(3)$${n_y} = - \frac{1}{\alpha}\frac{{\partial\! I(x,y,t)}}{{\partial y}} = - \frac{1}{\alpha}{\partial _y}I,$$

(4)$$\alpha = \sqrt {{\partial _x}{I^2} + {\partial _y}{I^2} + 1} .$$

Then, DVF is derived by calculating the time derivatives of the projected local normal vectors:

(5)$$\begin{split}{\textbf{v}} &= ({v_x},{v_y}) = \frac{\partial}{{\partial t}}({n_x},{n_y}) = \left(- \frac{1}{\alpha}\frac{{{\partial ^2}I}}{{\partial t\partial x}}, - \frac{1}{\alpha}\frac{{{\partial ^2}I}}{{\partial t\partial y}}\right)\\& = \left(- \frac{1}{\alpha}{\partial _{\textit{tx}}}I, - \frac{1}{\alpha}{\partial _{\textit{ty}}}I \right).\end{split}$$

The normalization term, $\alpha$, is approximated as a constant value at time $t$.

A tentative optical flow is estimated by matching pixels with the same local normal vector between two sequential frames. The vector of each pixel in DVF is a visual feature, and its direction and magnitude are used for feature tracking. Although the magnitude of the vector of each pixel is not related to the pixel movement itself, the direction of the vector indicates the direction of the pixel movement consistent with the primate visual system [1]. Thus, the search area is restricted to a line with the direction of the local normal vector of the target pixel. If multiple pixels are found to have the same vector in the next frame, the closest pixel to the target pixel is matched:

\begin{split}A& = \{{{\textbf{x}}_{t + \Delta t}}|{\textbf{v}}({{\textbf{x}}_{t + \Delta t}}) = {\textbf{v}}({{\textbf{x}}_t}),{{\textbf{x}}_{t + \Delta t}} - {{\textbf{x}}_t} = D{\textbf{v}}({{\textbf{x}}_t}),D \in R\} \\&{\textbf{x}}_{t + \Delta t}^{\rm{track}}({{\textbf{x}}_t}) = \arg \min d({{\textbf{x}}_{t + \Delta t}},{{\textbf{x}}_t}),\quad{{\textbf{x}}_{t + \Delta t}} \in A,\end{split}

where ${{\textbf{x}}_t}$ indicates the position $(x,y)$ of the pixel at the $t$th frame, and ${{\textbf{x}}_{t + \Delta t}}$ indicates the candidates of the destination of the pixel at ${{\textbf{x}}_t}$ at the $t + \Delta t$th frame (usually $\Delta t = 1$); ${\textbf{v}}({{\textbf{x}}_t})$ and ${\textbf{v}}({{\textbf{x}}_{t + \Delta t}})$ indicates the vector of the pixel at ${{\textbf{x}}_t}$ and ${{\textbf{x}}_{t + \Delta t}}$, respectively; $A$ is a group of pixels at the $t + \Delta t$th frame with the same vector ${\textbf{v}}({{\textbf{x}}_t})$ as a pixel ${{\textbf{x}}_t}$ at the $t$th frame and is in line with the direction of ${\textbf{v}}({{\textbf{x}}_t})$ of the pixel ${{\textbf{x}}_t}$; ${\textbf{x}}_{t + \Delta t}^{\rm{track}}({{\textbf{x}}_t})$ indicates the tracked (matching) pixel of pixel ${{\textbf{x}}_t}$ at the $t + \Delta t$th frame; and $d({{\textbf{x}}_{t + \Delta t}},{{\textbf{x}}_t}$) is the Euclid distance between pixels ${{\textbf{x}}_t}$ and ${{\textbf{x}}_{t + \Delta t}}$. The resultant optical flow (P-flow) is depicted as follows:

(6)$$U({{\textbf{x}}_t}) = ({U_x},{U_y}) = {\textbf{x}}_{t + \Delta t}^{\rm{track}}({{\textbf{x}}_t}) - {{\textbf{x}}_t} \equiv \tilde D{\textbf{v}}({{\textbf{x}}_t}),$$

where $\tilde D$ is the smallest absolute $D$ in group $A$.

The matching algorithm is based on two hypotheses on three neighboring frames: (1) constant luminance in a local spatial region between three frames and (2) linear motion in a local time scale between three frames. In the case where no pixel at the $t + \Delta t$th frame fulfilled these criteria, the P-flow is not determined at that pixel, resulting in sparse P-flow.

B. P-Flow in Relation to the First Approximation of Taylor Series of Brightness Constancy Constraint Assumption

BCA is a fundamental principle to extract the ground-truth optical flow and can be represented as follows:

I({x,y,t} ) = I ({x + \Delta x,y + \Delta y,t + \Delta t}).

If we assume that the movement is small, the image constraint at $I(x,y,t)$ can be approximated with the Taylor series

(7)$$\begin{split}&I(x + \Delta x,y + \Delta y,t + \Delta t) \\&\quad\approx I(x,y,t) + {\partial _x}I\Delta x + {\partial _y}I\Delta y + {\partial _t}I\Delta t,\\&{\partial _x}I\Delta x + {\partial _y}I\Delta y = - {\partial _t}I\Delta t,\\&{\partial _x}I \cdot {V_x} + {\partial _y}I \cdot {V_y} = - {\partial _t}I,\end{split}$$

where ${V_x}$ and ${V_y}$ are the $x$ and $y$ components of the ground truth optical flow of $I(x,y,t)$, respectively.

Here, we further calculate the time derivative of the equation above:

(8)$${\partial _{\textit{tx}}}I \cdot {V_x} + {\partial _{\textit{ty}}}I \cdot {V_y} = - {\partial _{\textit{tt}}}I.$$

Substituting Eq. (5) into Eq. (7) results in

(9)$${v_x} \cdot {V_x} + {v_y} \cdot {V_y} = \frac{1}{\alpha}{\partial _{\textit{tt}}}I,$$

(10)$${U_x} \cdot {V_x} + {U_y} \cdot {V_y} = \frac{{\tilde D}}{\alpha }{\partial _{tt}}I,\quad{\textbf{U}} \cdot {\textbf{V}} = \frac{{\tilde D}}{\alpha }{\partial _{tt}}I.$$

If the direction of P-flow, ${\textbf{U}}$, and that of the ground-truth optical flow, ${\textbf{V}}$, are different with an angle $\theta$ (Appendix A),

(11)$$|{\textbf{V}} |\cos \theta = {\rm sign}(\tilde D)\frac{{{\partial _{\textit{tt}}}I}}{{\sqrt {{\partial _{\textit{tx}}}{I^2} + {\partial _{\textit{ty}}}{I^2}}}}.$$

Thus, P-flow becomes close to the ground-truth optical flow in a region where the spatial–temporal change of the pixel value is small and where the time-course of the pixel value is far from the inflection point.

C. Analytic Solution to P-Flow

P-flow, ${\textbf{U}}$, is analytically derived in the situation shown in Fig. 1. When a pixel is on a line of a contour edge in a frame of a movie, it is impossible to predict to which position on the line the pixel moves at the next frame unless the ground-truth optical flow, ${\textbf{V}}$, is known. This is called the aperture problem and is represented as Eq. (7) with two unknown parameters, ${V_x}$ and ${V_y}$, in one equation.

Fig. 1. P-flow and aperture problem. Ground-truth optical flow (black dotted arrow) is unknown without global information outside the white circle. Because P-flow (black arrow) is normal to the line, magnitude of ground-truth optical flow with angle from P-flow can be analytically estimated. Note that P-flow is not extracted unless the lines are moving with the same vector. Since “normal flow” is calculated without motion information, it may extract inherently unreasonable optical flow.

Download Full Size | PDF

Since the direction of the P-flow of the pixel is perpendicular to the line (Appendix B), the norm value of ${\textbf{U}}$ is equivalent to $|{\textbf{V}}| \cos\theta$ for ${-}\pi /{{2}} \lt \theta \lt \pi /{{2}}$ (Fig. 1). Thus (Appendix A),

(12)$$| {\textbf{V}} |\cos \theta = | {\textbf{U}}| = \frac{{| {\tilde D}|}}{\alpha}\sqrt {{\partial _{\textit{tx}}}{I^2} + {\partial _{\textit{ty}}}{I^2}} .$$

From Eqs. (11) and (12), $\tilde D$ is obtained as follows:

(13)$$\tilde D = \frac{{\alpha {\partial _{\textit{tt}}}I}}{{{\partial _{\textit{tx}}}{I^2} + {\partial _{\textit{ty}}}{I^2}}}.$$

Finally, we calculate the P-flow from the information of the pixel values with spatial–temporal derivatives:

(14)$${\textbf{U}} = ({U_x},{U_x}) = \tilde D{\textbf{v}}({{\textbf{x}}_t}) = \frac{{- {\partial _{\textit{tt}}}I}}{{{\partial _{\textit{tx}}}{I^2} + {\partial _{\textit{ty}}}{I^2}}}({\partial _{\textit{tx}}}I,{\partial _{\textit{ty}}}I).$$

3. SEMI-DENSE P-FLOW BY INTERPOLATION ON THE DYNAMIC VECTOR FIELD

If there is no pixel in the next frame with the vector like that of the target pixel, the P-flow is not determined for the target pixel. P-flow is selective to prevent a false match, which in some cases results in a sparse solution. To make the flow denser, an interpolation method is used at the cost of correct matching. The correspondence field ${{\textbf{U}}_{\textit{SD}}}({{\textbf{x}}_{\rm{vec}}})$, where P-flow is not determined and a vector ${\textbf{v}}({{\textbf{x}}_{\rm{vec}}})$ is not 0 at a pixel, ${{\textbf{x}}_{\rm{vec}}}$, is expressed by

(15)$${{\textbf{U}}_{\textit{SD}}}({{\textbf{x}}_{\rm{vec}}}) = \frac{{\sum\nolimits_{d({{\textbf{x}}_m},{{\textbf{x}}_{\rm{vec}}}) \lt R} {{e^{- d({{\textbf{x}}_m},{{\textbf{x}}_{\rm{vec}}})}}{\textbf{U}}({{\textbf{x}}_m})}}}{{\sum\nolimits_{d({{\textbf{x}}_m},{{\textbf{x}}_{\rm{vec}}}) \lt R} {{e^{- d({{\textbf{x}}_m},{{\textbf{x}}_{\rm{vec}}})}}}}},$$

where the summation is taken for the pixels, ${{\textbf{x}}_m}$, whose P-flow is determined and is positioned with a distance less than $R$. If the vector is 0 at a pixel, P-flow is also 0. The interpolation method does not provide a full dense flow, as reported in other studies [20]. Thus, we call the interpolated flow semi-dense P-flow.

4. PIXEL SELECTION USING THE HARRIS CORNER DETECTOR TO AVOID THE APERTURE PROBLEM

As in the ground-truth optical flow estimation, the P-flow algorithm is affected by the aperture problem where motion on a line cannot be determined if it is viewed through a small aperture, except the component of the ground truth orthogonal to the line, or “normal flow” [21]. Since the direction of P-flow is the same as that of normal flow, it is impossible to solve the aperture problem. Although it causes a sparser P-flow, we attempt to avoid the aperture problem by selecting pixels that are not affected by the problem rather than trying to solve it for all pixels where P-flow is calculated.

Fig. 2. Examples of P-flow applied to natural movies. Original movie, P-flow, semi-dense P-flow, P-flow with HCD, and flow using the Lucas–Kanade method are shown from left to right. Each flow is described by lines.

Download Full Size | PDF

While the motion direction is ambiguous when only an edge or a line is seen in an aperture, it is uniquely determined when a corner is seen. Thus, the P-flow values of pixels that correspond to a corner are unlikely to be affected by the aperture problem. In this study, we adopt the HCD algorithm [19] to estimate the corner, although any corner detection algorithm can be used. First, the following matrix $M$ is calculated for each pixel where P-flow is determined:

(16)$$M = \sum\limits_{x,y \in W} {\left({\begin{array}{*{20}{c}}{{\partial _x}{I^2}}&\quad{{\partial _x}I{\partial _y}I}\\{{\partial _x}I{\partial _y}I}&\quad{{\partial _y}{I^2}}\end{array}} \right)} ,$$

where $W$ is an image patch that consists of ${{3}} \times {{3}}$ pixels, in our case. The pixel is classified as a corner when $r \lt {{0}}$, which is defined as

(17)$$r = {\lambda _1}{\lambda _2} - 0.04 \cdot {({\lambda _1} + {\lambda _2})^2},$$

where ${\lambda _1}$ and ${\lambda _2}$ are the eigenvalues of $M$.

5. IMPLEMENTATION AND EXAMPLES OF DYNAMIC VECTOR FIELDS, P-FLOW, SEMI-DENSE P-FLOW, AND P-FLOW WITH HCD

The implementation of the algorithm, DVF, P-flow, semi-dense P-flow, and P-flow with HCD was conducted on Python 3.6 with OpenCV. A normalized box filter (filter size: 7) was applied to all original movies converted to gray images. The pixels that do not move (i.e., zero vector strength) or those that move only by a small magnitude (${\lt}\sqrt {{2}}$ pixels) were excluded from further processing because it is challenging to determine whether or not the motion is due to background noise (e.g., shot noise from an image sensor). For semi-dense P-flow, we set R in Eq. (15) to 30 pixels.

P-flow, semi-dense P-flow, and P-flow with HCD for the natural movies listed below are visualized by dot motion (dot size: three pixels for P-flow and P-flow with HCD, one pixel for semi-dense P-flow), and the DVF (only for “Tiger walking” and “Lavalamp”) is visualized by random-dot motion (dot size: one pixel) according to the field [1] in Visualization 1, Visualization 2, and Visualization 3. Figure 2 shows examples of P-flow, semi-dense P-flow, and P-flow with HCD described by linesin Visualization 1, Visualization 2, and Visualization 3. The advantages of the proposed algorithm are that there is no constraint on the original movie (e.g., any motion of animals, artificial objects, or dynamic scenes) and that the original movies can be of any duration or size. In the original movie of Visualization 3, the P-flow was extracted in an area where the smoke is not clearly visible. When we increased the contrast of the original movie, there was actually smoke present, indicating that the dynamic range of the P-flow is large. For comparison, the optical flow extracted using the Lucas–Kanade method [3,22] from the same movies is also shown in Visualization 1, Visualization 2, and Visualization 3.

6. DISCUSSION

In this study, we investigated the theoretical background of P-flow: an optical flow estimation algorithm [2]. We showed that while P-flow was originally developed in an FB framework by considering the vector of DVF as a visual feature for tracking, it is also consistent with a GB approach that follows the BCA, which is a fundamental requirement for optical flow estimation. P-flow calculates the optical flow suitable for visualization but does not estimate the ground-truth optical flow, which most typical algorithms estimate. We quantified the difference between P-flow and ground-truth optical flow by referring to the BCA. We further modified the algorithm to resolve the sparsity and uncertainty of P-flow. To make the P-flow denser, we used the interpolation method guided by DVF. Uncertainty for the aperture problem was avoided by deleting the pixels that are inappropriate for tracking on the ground-truth flow. We did not put any limits on P-flow, which are necessary for ground-truth estimation, resulting in an increase in its applicable cases. Examples of P-flow of formless and flexible objects, such as a liquid or gas, were also presented.

A. Algorithm Characteristics

For tracking a feature in the P-flow algorithm, only two conditions need to be satisfied: (1) constant luminance in a local spatial region and (2) linear motion in a local time scale (at least in three frames). These are satisfied for the first approximation of the BCA, except for linear motion in two frames [3,4]. Under these conditions, the search area for tracking is automatically determined to be in line with the direction of the time derivative of a local normal vector at the target pixel. This stems from the observation that random-dot motion that follows the direction of the DVF is perceived in the same manner as the original movie [1], indicating that the target pixel is perceived to move in the direction of the vector. The search area restricted along a line is an advantage of P-flow, for it prevents false matching and reduces the computational requirement, while most algorithms with a matching feature are two-dimensional.

The optical flow algorithm is affected by the aperture problem where motion on a line cannot be determined if it is viewed through a small aperture, except for “normal flow”: the component of the ground truth that is orthogonal to the line [21]. The mathematical form of the aperture problem is equivalent to the first approximation of the Taylor series of the BCA [Eq. (7)]. Here, one equation with two unknown variables, $Vx$ and $Vy$—the vector components of the ground-truth optical flow—gives infinite solutions, indicating an ill-posed problem. In the ground-truth optical flow estimation, a condition is assumed to solve this ill-posed problem; for example, optical flow is the same in a small region around the target position for the Lucas–Kanade method [3], or the optical flow changes smoothly for the Horn–Schunck method [4]. The P-flow algorithm does not require any additional conditions (instead, three-frame linear motion is required) and calculates the P-flow that appears similar to the normal flow in a region of consecutive frames where the aperture problem typically occurs. P-flow differs from normal flow in that the former does not extract flow for pixels without the same vector in consecutive frames. Indeed, Eq. (14) for the P-flow solution is the same form as the normal flow, except that the former explicitly contains time-domain information [23]. Unlike P-flow, normal flow is not calculated on the basis of perceptually consistent motion information; therefore, it has no theoretical basis for motion visualization. Selecting pixels that are not placed on a line or edge by a corner detection algorithm helps to avoid the aperture problem for the purpose of extraction of the ground truth, as shown in Section 4. Even if an error in extracting the ground truth occurs, it can be quantitatively estimated, as shown in Eq. (11).

The P-flow algorithm contains the computation of second temporal derivatives. Three consecutive frames are necessary to implement the algorithm, which is different from other algorithms using two consecutive frames. Thus, P-flow may be more sensitive to noise caused by the approximation of second derivatives with discrete numerical calculation. To avoid noise, we dictated that the distance between a target feature in the first frame and a candidate feature in the subsequent frame must be larger than $\sqrt {{2}}$ pixels in the present implementation of the P-flow algorithm.

B. Comparison with Ground-Truth Optical Flow Estimation

Estimation of the ground-truth optical flow has progressed remarkably since Horn and Schunck [4] and Lucas and Kanade [3] proposed GB approaches. In the variational method introduced by Horn and Schunck [4], one estimates the ground-truth optical flow by minimizing the energy function that consists of the data term based on the BCA, as well as the smoothness term, which is derived from the assumption that the motion of neighboring pixels is similar and varies smoothly. The data and smoothness expressions are balanced with a coefficient parameter. The variational method can compute the optical flow for all image regions. It is called “dense flow” because the smoothness term can fill in the image regions where no information about flow is available. The variational method is modified to account for the difficulty in image regions where the condition of constant illumination and full visibility of the surface is not fulfilled (e.g., occlusions, motion discontinuities, illumination changes, large displacements). To deal with occlusions and motion discontinuities, total variation norms for robust functions are adopted to reduce the influence of outliers—instead of quadratic functions for both the data and smoothness expressions [6,7,24]—and effective regularizers are introduced [13,25]. Illumination changes are accommodated by introducing a gradient constancy term into the energy function [24]. A coarse-to-fine approach or warping framework is effective for large displacements [6,7,26]. The addition of a feature matching condition (i.e., FB approach) in which local descriptors should be matched in two subsequent images to the energy function is efficient in dealing with large displacements of a small object that the coarse-to-fine approach cannot handle [9–13]. The variational method is a type of global approach because the energy function that should be minimized is defined globally; flows in one region of an image affect the flows in neighboring regions. In contrast, the P-flow algorithm computes the flow in each local region, implying that the P-flow in each pixel is independent. P-flow is estimated only in the image region where information about the DVF is available and vectors in sequential frames are matched. Thus, P-flow is inherently a sparse flow. P-flow is not estimated in regions where occlusions, discontinuities, and illumination changes occur. It visualizes the flow as is (i.e., as a human would perceive it).

The method proposed by Lucas and Kanade [3] is related more to P-flow than the variational method because the optical flow is computed in a local region. The Lucas–Kanade method assumes that the motion is constant in the local neighborhood. Finding the same solution in Eq. (7) for all pixels in the local neighborhood can estimate the optical flow by performing the least-square minimization. The size of the neighborhood determines the accuracy and robustness of the flow. If the size is small, the information of the images is not blurred but may be susceptible to the aperture problem. A larger size can deal with numerous motions by integrating a vast region but may include regions from another motion surface. Choosing an ideal size is not a trivial problem. Moreover, it is challenging to determine the optical flow in a homogeneous region without any texture. There is a need to provide “good features to track” [27]. Unlike the Lucas–Kanade method, in which the estimation of flow in a target pixel requires computation in a finite neighboring image region, P-flow depends only on pixels that are required for spatial differentiation. Moreover, P-flow does not need to depend on any parameter. Time-domain information in three consecutive frames is all that is necessary to determine P-flow. The requirement of additional time-domain information without computation in a spatial region leads to the robustness of P-flow for formless and flexible objects (e.g., liquid or gas) when it moves linearly in at least three frames. In most cases, the aperture problem becomes less problematic for formless objects because they rarely consist of solid straight lines. Indeed, the algorithm extracts P-flow well from liquid or gas movies, as shown in Visualization 2 and Visualization 3. For the purpose of estimating the ground-truth flow made by rigid and straight lines, P-flow with corner detection, which is based on the HCD algorithm [19], provides the necessary “good features to track.”

C. Semi-Dense P-Flow

The P-flow algorithm is an FB algorithm that searches for the same feature in sequential frames. In contrast to the variational methods that extract dense optical flow using the filling-in effect, the optical flow extracted from FB algorithms is sparse [4]. In the P-flow algorithm, the feature for matching is a vector in the DVF derived from two consecutive frames in a movie. DVF contains unmatched vectors that could disappear in the next frame due to nonlinear movement, occlusion, illumination change, or discretization of computation. Considering that the unmatched vectors may contain supplementary P-flow information, we approximate the P-flow of the unmatched vectors by interpolating the magnitude of the neighboring matched pixels. The proposed semi-dense P-flow is denser than the original P-flow and perceptually realistic (Visualization 1, Visualization 2, and Visualization 3). To obtain full dense P-flow, it may be necessary to propose conditions that are consistent with neurophysiological or psychophysical knowledge. For example, if a Gestalt law is involved in the calculation [28], the conditions may appear like those used in conventional computer vision. In contrast, novel conditions may be needed to reflect psychological phenomena, such as illusory perception of a visual attribute [29] or superior perception of an ecologically important stimulus (e.g., facial expression) [30].

D. P-Flow as Optical Flow Consistent with Primate Visual System

The P-flow estimation algorithm consists of two stages: (1) extraction of the DVF and (2) matching of the vector in two consecutive frames of the movie. The DVF in a movie is calculated by simple image processing with spatial–temporal differentiation and normalization [Eq. (5)]. These are defined in the local region, and thus global information from the frames is not necessary. In other words, computation in the retina might be sufficient to extract the DVF. Perception of the DVF visualized by random-dot motion is quite similar to that of the corresponding original movies (Visualization 1 and Visualization 2). Likewise, the random-dot motion elicits neural responses similar to the original movie in the motion-sensitive MT area of the primate brain [1]. Considering the vectors as a visual feature for tracking, the second stage of the P-flow algorithm estimates the optical flow by tracking the vectors in the fields. The estimated optical flow can be considered parallel to that perceived by humans because of its tracking feature. Studies in computer vision have aimed to derive the ground truth. Recent developments in machine learning have reached a level where computers overwhelm human experts in several tasks [31,32]. In contrast, biological systems seem to adapt to the environment without the ground truth. Therefore, advanced computer vision by itself is not optimal as an assistive technology. The P-flow estimation algorithm simulates the flexible primate visual system; however, critical differences between the ground truth and primate vision still need clarification [33].

P-flow has affinity with event-based vision [34] because both are inspired by the biological visual system and both operate with local computation alone. Event-based vision is based on visual information taken by an event-based camera, which is also a bio-inspired visual sensor [35]. The camera measures brightness changes, called events, independently and asynchronously with microsecond resolution. While the event-based camera takes a time derivative, it does not process any spatial information. If it takes a spatial derivative with reference to nearby pixels, the extraction of the DVF and P-flow is accelerated. Event-based vision algorithms [36] could be combined with P-flow. For example, development of visual sensors that detect the brightness changes in both time and spatial domains could enable P-flow computation with microsecond resolution.

APPENDIX A

The norm of ${\rm{U}}$ is as follows:

|{\textbf{U}}| = \left| {\tilde D \left(- \frac{1}{\alpha}{\partial _{\textit{tx}}}I, - \frac{1}{\alpha}{\partial _{\textit{ty}}}I \right)} \right| = \frac{{| {\tilde D} |}}{\alpha}\sqrt {{\partial _{\textit{tx}}}{I^2} + {\partial _{\textit{ty}}}{I^2}} .

From Eq. (10), the inner product of ${\textbf{U}}$ and ${\textbf{V}}$ is rewritten as

{\textbf{U}} \cdot {\textbf{V}} = |{\textbf{U}}||{\textbf{V}}|\cos \theta = \frac{{\tilde D}}{\alpha }{\partial _{tt}}I,

where $\theta$ is the angle between ${\textbf{U}}$ and ${\textbf{V}}$. Substitution of the norm of ${\textbf{U}}$ to the above equation results in Eq. (11):

| {\textbf{V}}|\cos \theta = {\rm sign}(\tilde D)\frac{{{\partial _{\textit{tt}}}I}}{{\sqrt {{\partial _{\textit{tx}}}{I^2} + {\partial _{\textit{ty}}}{I^2}}}}.

APPENDIX B

Let $I(x,y)$, the pixel value at position $(x,y)$, be a step function (Fig. 1):

I(x,y) = \left\{{\begin{array}{*{20}{c}}{1:(a(x - {x_0}) + b(y - {y_0}) \gt 0}\\{0:(a(x - {x_0}) + b(y - {y_0}) \le 0}\end{array}} \right.,

where vector $(a,b)$ is perpendicular to the line: $l = a(x - {x_0}) + b(y - {y_0}) = 0$.

The gradient of the pixel value is as follows:

\begin{split}({\partial _x}I,{\partial _y}I) &= \left(\frac{{\partial I}}{{\partial l}}\frac{{\partial l}}{{\partial x}},\frac{{\partial I}}{{\partial l}}\frac{{\partial l}}{{\partial y}}\right) = (a\delta (l),b\delta (l)) \propto (a,b)\\[-3pt]\therefore {\textbf{v}} &= - \frac{1}{\alpha}\frac{\partial}{{\partial t}}({\partial _x}I,{\partial _y}I) \propto (a,b).\end{split}

Thus, the direction of P-flow of the pixel is perpendicular to the line.

Funding

Japan Society for the Promotion of Science (JP18K12016, JP19H00630, JP18H01100).

Acknowledgment

We thank Ueno Zoo for generously allowing us to film the animal videos. We thank Editage (www.editage.com) for English language editing.

Disclosures

The authors declare no conflicts of interest.

REFERENCES

1. W. Suzuki, N. Ichinohe, T. Tani, T. Hayami, N. Miyakawa, S. Watanabe, and H. Takeichi, “Novel method of extracting motion from natural movies,” J. Neurosci. Methods 291, 51–60 (2017). [CrossRef]

2. W. Suzuki, T. Seno, W. Yamashita, N. Ichinohe, H. Takeichi, and S. Palmisano, “Vection induced by the low-level motion extracted from complex animation films,” Exp. Brain Res. 237, 3321–3332 (2019). [CrossRef]

3. A. D. Lucas and T. Kanade, “An iterative image registration technique with an application to stereo vision,” in 7th International Joint Conference on Artificial Intelligence (IJCAI) (1981).

4. B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artif. Intell. 17, 185–203 (1981). [CrossRef]

5. Z. Tu, W. Xie, D. Zhang, R. Poppe, R. C. Veltkamp, B. Li, and J. Yuan, “A survey of variational and CNN-based optical flow techniques,” Signal Process. Image Commun. 72, 9–24 (2019). [CrossRef]

6. M. J. Black and P. Anandan, “The robust estimation of multiple motions: parametric and piecewise-smooth flow fields,” Comput. Vis. Image Underst. 63, 75–104 (1996). [CrossRef]

7. N. Papenberg, A. Bruhn, T. Brox, S. Didas, and J. Weickert, “Highly accurate optic flow computation with theoretically justified warping,” Int. J. Comput. Vis. 67, 141–158 (2006). [CrossRef]

8. J. Wills and S. Belongie, “A feature-based approach for determining dense long range correspondences,” in European Conference on Computer Vision (ECCV) (2004), Vol. 3023, pp. 170–182.

9. A. Liu, J. Yuen, and A. Torralba, “SIFT flow: dense correspondence across scenes and its applications,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 978–994 (2011). [CrossRef]

10. T. Brox and J. Malik, “Large displacement optical flow: descriptor matching in variational motion estimation,” IEEE Trans. Pattern Anal. Mach. Intell. 33, 500–513 (2011). [CrossRef]

11. J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “DeepMatching: hierarchical deformable dense matching,” Int. J. Comput. Vis. 120, 300–323 (2016). [CrossRef]

12. P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid, “DeepFlow: large displacement optical flow with deep matching,” in IEEE conference on Computer Vision and Pattern Recognition (CVPR) (2013).

13. L. Xu, J. Jia, and Y. Matsushita, “Motion detail preserving optical flow estimation,” IEEE Trans. Pattern Anal. Mach. Intell. 34, 1744–1757 (2012). [CrossRef]

14. A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox, “FlowNet: learning optical flow with convolutional networks,” in IEEE International Conference on Computer Vision (ICCV) (2015), pp. 2758–2766.

15. T. Hui, X. Tang, and C. C. Loy, “LiteFlowNet: a lightweight convolutional neural network for optical flow estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018).

16. E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: evolution of optical flow estimation with deep networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).

17. A. Ranjan and M. J. Black, “Optical flow estimation using a spatial pyramid network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017).

18. D. Sun, X. Yang, M. Liu, and J. Kautz, “PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018).

19. C. Harris and M. Stephens, “A combined corner and edge detector,” in 4th Alvey Vision Conference (1988) Vol. 15, p. 50.

20. J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “EpicFlow: edge-preserving interpolation of correspondences for optical flow,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015).

21. S. Nishida, T. Kawabe, M. Sawayama, and T. Fukiage, “Motion perception: from detection to interpretation,” Annu. Rev. Vision Sci. 4, 501–523 (2018). [CrossRef]

22. J. Y. Bouguet, “Pyramidal implementation of the affine Lucas Kanade feature tracker description of the algorithm,” Microprocessor Research Labs Report (Intel Corporation, 2001).

23. A. Bruhn, J. Weickert, and C. Schnörr, “Lucas/Kanade meets Horn/Schunck: combining local and global optic flow methods,” Int. J. Comput. Vis. 61, 1–21 (2005). [CrossRef]

24. T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in European Conference on Computer Vision (ECCV) (2004), Vol. 3024, pp. 25–36.

25. A. Wedel, D. Cremers, T. Pock, and H. Bischof, “Structure- and motion-adaptive regularization for high accuracy optic flow,” in IEEE International Conference on Computer Vision (ICCV) (2009).

26. Z. Chen, H. Jin, Z. Lin, S. Cohen, and Y. Wu, “Large displacement optical flow from nearest neighbor fields,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2013), pp. 2443–2450.

27. J. Shi and C. Tomasi, “Good features to track,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (1994).

28. F. Jäkel, M. Singh, F. A. Wichmann, and M. H. Herzog, “An overview of quantitative approaches in Gestalt perception,” Vis. Res. 126, 3–8 (2016). [CrossRef]

29. R. L. Gregory, Seeing through Illusions (Oxford University, 2009).

30. D. G. Purcell and A. L. Stewart, “The face-detection effect,” Bull. Psychon. Soc. 24, 118–120 (1986). [CrossRef]

31. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in Neural Information Processing Systems (NIPS) (2012).

32. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: unified, real-time object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016).

33. N. V. K. Medathati, H. Neumann, G. S. Masson, and P. Kornprobs, “Bio-inspired computer vision: towards a synergistic approach of artificial and biological vision,” Comput. Vis. Image Underst. 150, 1–30 (2016). [CrossRef]

34. G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza, “Event-based vision: a survey,” IEEE Trans. Pattern Anal. Mach. Intell. (to be published). [CrossRef]

35. D. P. Moeys, F. Corradi, C. Li, S. A. Bamford, L. Longinotti, F. F. Voigt, S. Berry, G. Taverni, F. Helmchen, and T. Delbruck, “A sensitive dynamic and active pixel vision sensor for color or neural imaging applications,” IEEE Trans. Biomed. Circuits Syst. 12, 123–136 (2018). [CrossRef]

36. T. Stoffregen, G. Gallego, T. Drummond, L. Kleeman, and D. Scaramuzza, “Event-based motion segmentation by motion compensation,” in IEEE International Conference on Computer Vision (ICCV) (2019).

Name	Description
Visualization 1	Original movie, random-dot movie by DVS, P-flow, semi-dense P-flow, P-flow with HCD, and flow by Lucas-Kanade method for tiger walking.
Visualization 2	Original movie, random-dot movie by DVS, P-flow, semi-dense P-flow, P-flow with HCD, and flow by Lucas-Kanade method for Lavalamp.
Visualization 3	Original movie, random-dot movie by DVS, P-flow, semi-dense P-flow, P-flow with HCD, and flow by Lucas-Kanade method for smoke.