Simulating binocular vision for no-reference 3D visual quality measurement

Wu-Jie Zhou; Lu Yu; Ming-Wei Wu

doi:10.1364/OE.23.023710

1. Introduction

In recent years, three-dimensional (3D) imaging technology has become an increasingly active research field [1–3]. The practical applications of this field range from the entertainment to consumer electronics industries. Because a variety of 3D visual distortions can be introduced during 3D scene capture, 3D content creation, compression, broadcasting, 3D reconstruction, 3D display, and other similar processes, the resulting 3D visual signals may be unsatisfactory in terms of end-user 3D quality of experience (QoE) [4,5]. Therefore, systems that effectively monitor, control, and improve 3D visual quality are highly desirable [6]. 3D visual quality measurement (VQM) is an ideal solution to address this problem [7], because of its ability to rate 3D visual quality in a manner that approximates human subjective judgments; moreover, its measurements act as feedback for optimizing 3D content production.

Similar to the classification of 2D-VQM metrics [8,9], 3D-VQM metrics can be divided into three categories: full-reference (FR), reduced-reference (RR), and no-reference/blind (NR) metrics if they use full, partial, or no reference information, respectively. FR 3D-VQM provides a useful and effective way to predict quality differences [10–12]. However, the original content is rarely accessible in most practical circumstances, which severely limits the application scope of those metrics. Recently, RR techniques have attracted great interest and achieved fairly good performance for different types of distortions, since partial original content can be made available as side information. By comparison, blind/NR metrics have potentially much broader applicability in real-world scenarios than FR and RR metrics since they can predict the quality of distorted 3D visual signals without any prior knowledge of original content.

Most existing NR methods are composed of two modules: 1) quality-aware features extraction, and 2) quality prediction model learning. In recent decades, many NR methods for two-dimensional (2D) visual signals have been studied in detail, including Distortion Identification-based Image Verity and INtegrity Evaluation (DIIVINE) [13], Blind/Referenceless Image Spatial QUality Evaluator (BRISQUE) [14], BLind Image Integrity Notator using DCT Statistics-II (BLIINDS-II) [15], joint statistics of gradient and Laplacian features-based NR 2D-VQM metrics [16], and others. The problem with NR 3D-VQM is much more complex than that of 2D NR-VQM since the former is affected by the quality of both views, depth perception, visual comfort, and other factors [17–19].

Therefore, NR 3D-VQM is relatively new, only a small number of NR 3D-VQM metrics have been proposed [20–22]. For example, in [21], Chen et al. developed a NR 3D-VQM model by extracting both 2D and 3D features from a distorted 3D visual signal, then training by a machine-learning technique to predict the perceived quality of a distorted 3D visual signal. In [22], Ryu et al. proposed an NR 3D-VQM metric for distorted 3D visual signals using the binocular quality perception model, but only perceptual blurriness and blockiness scores were combined to form the final quality score. However, performance improvement was limited by the metrics’ insufficient consideration of binocular visual perception.

The success of NR metrics depends on what kinds of 3D features are employed to predict quality. To the best of our knowledge, there is no other NR 3D-VQM that has been developed based on local patterns of binocular visual perception. Hence, we have room for further improvement.

In this paper, we aim to extract the quality-aware features of binocular vision, in order to construct a 3D-VQM model that emulates the human vision system (HVS) when measuring the perceptual quality of 3D visual signals. Motivated by deep analysis of the related binocular visual perception properties of simple and complex cells, many new elements are introduced into the proposed metric:

1) Gaussian derivative responses are basic elements that share similarities with the receptive fields of simple cells along the visual pathway, and can be used to characterize various visual semantic structures that are closely related to visual perceptual quality. When the structure information is damaged or distorted, the perceptual statistics of those responses will be changed accordingly. Hence, Gaussian derivative-based features are basic perceptual properties used for measuring the quality of both views.
2) The disparity energy response of complex cells, which is an important visual processing stage in the visual cortex where binocular fusion occurs, provides effective representation of local features. This occurs because important structural information is preserved (there will be a higher energy value for poor binocular fusion; that is, the strength of the disparity energy response determines the fusion quality of human vision). We will attempt to exploit the characteristics of this response, because it plays an important role in measuring the perceptual quality of the binocular fusion of 3D visual signals.
3) The binocular rivalry response of complex cells is a perceptual effect that occurs when the eyes view mismatched stimuli (larger differences between the retinal views result in a more pronounced binocular depth. However, if these differences are too large, the views from two eyes cannot be fused, and instead of being combined, the two views will compete with each other. This is known as binocular rivalry). Binocular rivalry response is one of the major sources of visual discomfort and fatigue in binocular vision. It is sensible to incorporate this observation into NR 3D-VQM.

By using support vector regression (SVR), the quality prediction model is acquired by mapping the extracted binocular quality-aware features to the corresponding subjective quality scores on the training set. With this quality prediction model, we can measure the perceptual quality of 3D visual signals. Experimental results on two publicly available databases indicate that the proposed metric is remarkably consistent with human perception, and performs statistically better than the relevant existing metrics for both symmetric and asymmetric distortions.

2. The proposed metric

Owing to recent advances in neural science and visual cognition theories, many psychophysical and neurological findings have enabled us to more clearly understand the binocular visual perception process. Because the new challenges related to 3D-VQM originate from the binocular interactions between the eyes, a deeper understanding of the binocular vision mechanism is beneficial to the development of effective 3D-VQM metrics. Here, we briefly introduce the findings of binocular vision research (for V1 areas) that are related to this work. Then, various binocular quality-aware features are extracted from the receptive fields of simple cells (one class of V1 neurons), and the receptive fields of complex cells (the other class of V1 neurons); these will change in the presence of distortions. Finally, using support vector regression (SVR), the binocular quality-aware features are mapped to the binocular quality score to learn the quality prediction model. The high-level diagram for the proposed NR 3D-VQM framework is depicted in Fig. 1, in which a database is divided into two randomly non-overlapping subsets: a training set and a test set.

Fig. 1 Proposed NR 3D-VQM framework for 3D visual signals

Download Full Size | PDF

2.1 Binocular vision

Binocular vision is a complex visual process that combines information from multiple sources. These may be HVS-related sources such as retinal disparity and vergence, or sources related to the environment. From a neuro-biological point of view, the HVS covers the entire vision chain from the retina to the brain. The retinal information is transmitted by the optic nerve through the optic chiasm, the optic tract, and finally to the visual cortex. Figure 2 illustrates a simplified hierarchical structure of a primate’s visual cortex. Recently, there has been strong neurophysiological evidence that a generic description (in terms of a variety of visual properties) is determined in areas V1–V4 and the middle temporal (MT/V5). This covers approximately 60% of the visual processing volume in the primate neocortex (more details can be obtained in [23]), especially the visual properties of V1, which are highly correlated with the binocular visual description in the primate neocortex. More specifically, V1 is the first area containing neurons that receive input from both eyes and are able to compute disparity. A prominent model for disparity estimation in V1 is the energy model, which is based on Gabor wavelets with slight phase or positional shifts. Furthermore, dissimilar visual stimuli between the retinal signals trigger rivalry; these findings are related to this work. The detailed functions of other visual areas (i.e. V2–V4 and MT/V5) are outside the scope of this work, and we do not describe them in this section.

Fig. 2 Simplified hierarchical structure of the primate’s visual cortex

Download Full Size | PDF

There are two major classes of neurons in the primary visual cortex (V1): simple cells and complex cells. Once the two retinal signals are captured by the receptive fields of the ganglion cells, they are transferred to the lateral geniculate nucleus (LGN), then to the visual cortex. The simple cells are the first to receive this retinal information at the visual cortex. These cells are characterized by elongated receptive fields having a specific orientation and size. If the stimulus undergoes an impairment that changes the characteristics of the simple cells, the response of the simple cells will decrease depending on the introduced gap. Several physiological experiments demonstrated that these cells can be simulated using Gaussian derivative functions [24]. In particular, Gaussian derivative responses can be used to characterize various visual semantic structures of the visual cortex perception, including lines, edges, corners, and blobs; these relate closely to human subjective judgments [24].

The receptive fields of complex cells do not have the same characteristics as the receptive fields of simple cells, which are responsible for sensory fusion and rivalry in the visual cortex. In order to generate a binocular signal resulting from monocular information, retinal signals are converged until they reach the visual cortex, where they are processed by complex cells. In this visual processing stage, binocular visual systems can perceive the differences between two views at the same retinal location, owing to two important binocular interactions—disparity energy responses and binocular rivalry responses [25]. More specifically, in a disparity energy response, two slightly different retinal signals are perceived by the left and right eyes. These signals continue on to the visual cortex where the binocular fusion occurs; finally, a single visual perceptual impression of the scene is obtained [26]. In a binocular rivalry response, the mismatched inter-view stimuli trigger binocular rivalry [11]. In particular, a binocular rivalry response is a visual perception effect that occurs when both eyes are provided with two mismatched views at the same retinal location in 3D space [27].

In summary, the perceptual properties of binocular vision analyzed in this subsection have a major impact in 3D-VQM. Because the visual cortex of the brain is usually very complex (it must manage non-intuitive interactions between multiple 3D visual cues) and is not well understood, this study attempts to simulate, not model, the characteristics of the perceptual process and the mechanisms of binocular vision in 3D quality measurement.

2.2 Binocular quality-aware features extraction

2.2.1 Binocular quality-aware features for simple cells

From a computational point of view, the responses of the classical cortical receptive fields in the simple cells can be simulated using Gaussian derivative functions [16,24,28,29], including the Gaussian smoothed gradient magnitude (GM) function [28] and the Laplacian of Gaussian (LOG) function [29]. GM and LOG features are basic elements that are commonly used to form visual semantic structures. As we will see, they are also strong quality-aware features for predicting the visual quality of both views. Let ${\overset{⌢}{S}}_{l}$ denote a visual signal for the left view. Its GM and LOG maps can be defined as

{\overset{⌢}{G}}_{l} = \sqrt{{[{\overset{⌢}{S}}_{l} \otimes h_{x}]}^{2} + {[{\overset{⌢}{S}}_{l} \otimes h_{y}]}^{2}}

and

{\overset{⌢}{L}}_{l} = {\overset{⌢}{S}}_{l} \otimes h_{D O G}

respectively. “

\otimes

” is the linear convolution operator;

h_{x}

and

h_{y}

denote the Gaussian partial derivative filters applied along the horizontal (x) and vertical (y) directions, respectively.

h_{D O G}

is the Gaussian kernel. GM and LOG operators can remove a significant number of visual redundancies, whereas certain correlations between neighboring pixels will remain. As in [16,30], joint adaptive normalization (JAN) is used to normalize the GM and LOG coefficients to obtain stable statistical visual representations. Subsequently, the normalized GM features

{\overset{⌣}{G}}_{l} (m, n)

are quantified into K levels

{{\overset{⌣}{g}}_{1}^{l}, {\overset{⌣}{g}}_{2}^{l}, \dots, {\overset{⌣}{g}}_{K}^{l}}

at each location (m,n), and the normalized LOG features

{\overset{⌣}{L}}_{l} (m, n)

are quantified into K levels

{{\overset{⌣}{l}}_{1}^{l}, {\overset{⌣}{l}}_{2}^{l}, \dots, {\overset{⌣}{l}}_{K}^{l}}

at each location (m,n). For conciseness of notation, we denote

{\overset{⌣}{G}}_{l} (m, n)

by

G_{l}

and denote

{\overset{⌣}{L}}_{l} (m, n)

by

L_{l}

. The details of above processes can be obtained in [16]. Then, the normalized histograms of the left view from

G_{l}

and

L_{l}

are represented as follows:

H_{G_{l}} ({\overset{⌣}{g}}_{i}) = \frac{1}{M \times N} \sum_{m = 1, n = 1}^{M, N} f (G_{l} (m, n), {\overset{⌣}{g}}_{i}^{l}), i = 1, 2 \dots, K

and

H_{L_{l}} ({\overset{⌣}{l}}_{i}) = \frac{1}{M \times N} \sum_{m = 1, n = 1}^{M, N} f (L_{l} (m, n), {\overset{⌣}{l}}_{i}^{l})

respectively.

f (s, t) = {\begin{matrix} 1, if s = t \\ 0, otherwise \end{matrix} � s & t \in [1, K]

; M and N denote the size of

{G_{l} (m, n)}

.

The computational process of the right view’s normalized histograms, $H_{G_{r}} ({\overset{⌣}{g}}_{i}^{r})$ and $H_{L_{r}} ({\overset{⌣}{l}}_{i}^{r})$ , is the same as that of the left view’s normalized histograms.

2.2.2 Binocular quality-aware features for complex cells

The manner in which the HVS perceives 3D and 2D visual signal distortions differs significantly, because of the delicate mechanisms in the binocular vision system that manage similarity and dissimilarity information from the retinas.

As a first step in this study, we attempt to analytically formulate partial binocular visual perception by applying the disparity energy response produced by the complex cells, in order to merge retinal information [26]. The disparity energy response depends on the disparity in the input stimulus. If the disparity is changed by slightly shifting the position of the stimulus, the two associated monocular responses will change as well. When two monocular responses belonging to the left and right views do not have the same position, the right monocular response ${\tilde{C}}_{r} (m, n)$ is a shifted version of the left monocular response ${\tilde{C}}_{l} (m, n)$ . Several physiological experiments [26] showed that monocular responses can be modeled using an analytical model such as Gabor filters. Moreover, Gabor filters have nonzero means, and thus they are affected by direct current (DC) components. Log-Gabor filters proposed in [31] are an alternative to Gabor filters. The log-Gabor filters discard the DC components and can overcome the bandwidth limitation of traditional Gabor filters. In this paper, log-Gabor [31] has a Gaussian-shaped response along the logarithmic frequency scale, instead of Gabor filters; this is used to construct the monocular responses of left and right views. By integrating the position shift with two monocular responses, the disparity energy response $\tilde{E}$ can be expressed by

\tilde{E} (m, n) = {‖ {\tilde{C}}_{l} (m, n) + {\tilde{C}}_{r} (m, n) \cdot e^{j Δ φ (m, n)} ‖}^{2}

where

‖ • ‖

denotes the modulus operation.

According to previous studies [10–12], the subjective quality of an asymmetrically distorted 3D visual signal generally cannot be predicted from the average quality of both views. In addition to the receptive fields of the disparity energy response of complex cells, binocular rivalry response (which is the main factor affecting the perceived 3D quality of symmetrically distorted stimuli) is a reasonable explanation for this observation. Binocular rivalry response is a perceptual effect that occurs when both eyes see mismatched left and right views at the same retinal location. In other words, binocular rivalry response is the result of competition between the eyes. Results from previous studies [32] have demonstrated that binocular rivalry response is strongly governed by low-level sensory factors. Inspired by this result, a well-known linear rivalry model is learned. Specifically, the energies of the complex 2D-Gabor filter responses on both views are used to stimulate rivalrousness. Consequently, binocular rivalry response _{$\tilde{B} (m, n)$} is defined as

\tilde{B} (m, n) = {\hat{W}}_{l} (m, n) \cdot {\tilde{M}}_{l} (m, n) + {\hat{W}}_{r} (m + d, n) \cdot {\tilde{M}}_{r} (m + d, n)

where _{${\tilde{M}}_{l}$} and _{${\tilde{M}}_{r}$} respectively denote the monocular response’s magnitudes for the left and right views, which can be defined as _{${\tilde{M}}_{ξ} (m, n) = ‖ {\tilde{C}}_{ξ} (m, n) ‖, ξ \in {l, r}$}. d is the disparity. _{${\hat{W}}_{l}$} and _{${\hat{W}}_{r}$} are weights to represent the binocular rivalry response process.

Overall, disparity energy responses and binocular rivalry responses reflect different aspects of the receptive fields of complex cells.

Motivated by the application of local binary patterns [33,34], we propose “positive” and “negative” modified local binary patterns (MLBPs) in this work. The positive and negative MLBP-based binocular quality-aware features extraction scheme for binocular rivalry responses and disparity energy responses is described as follows.

First, the positive MLBP of binocular rivalry responses can be defined as

M L B P_{P, R}^{P, B} ({\tilde{B}}_{c}) = \sum_{p = 0}^{P - 1} s ({\tilde{B}}_{p} - {\tilde{B}}_{c}) 2^{p}, s ({\tilde{B}}_{p} - {\tilde{B}}_{c}) = {\begin{matrix} 1, if {\tilde{B}}_{p} - {\tilde{B}}_{c} \geq β \\ 0, otherwise \end{matrix}

where

{\tilde{B}}_{c}

and

{\tilde{B}}_{p}

represent the central value and its P-neighbors, respectively.

β

denotes the threshold (

β > 0

). To reduce computational complexity, R and P were set to 1 and 8 in our experiments, respectively. Figure 3 shows the MLBP encoding procedure.

Fig. 3 Encoding procedure of MLBP operator (P = 8, R = 1, and β = 12)

Download Full Size | PDF

According to the principle of the uniform rotation invariant pattern [34] (three patterns are proposed in [34]: a uniform pattern, a rotation invariant pattern, and a uniform rotation invariant pattern. They exhibit similar performance in representing the visual structure information. However, compared with the other two patterns, the uniform rotation invariant pattern requires the lowest number of features to represent structure information. Hence, in this paper, we use a uniform rotation invariant pattern to extract binocular quality-aware features from complex cells), a uniform rotation invariant pattern of the positive MLBP can be defined as

M L B P_{P, R, r i u 2}^{P, B} = {\begin{matrix} \sum_{p = 0}^{P - 1} s ({\tilde{B}}_{p} - {\tilde{B}}_{c}) i f U (M L B P_{P, R}^{P} \leq 2) \\ P + 1 \end{matrix}

\begin{matrix} U (M L B P_{P, R}^{P, B}) = | s ({\tilde{B}}_{P - 1} - {\tilde{B}}_{c}) - s ({\tilde{B}}_{0} - {\tilde{B}}_{c}) | \\ + \sum_{p = 1}^{P - 1} | s ({\tilde{B}}_{p} - {\tilde{B}}_{c}) - s ({\tilde{B}}_{p - 1} - {\tilde{B}}_{c}) | \end{matrix}

where U denotes the frequency of 1-0 and 0-1 transitions in a circular representation of the upper MLBP. The mapping from

M L B P_{P, R}^{P, B}

to

M L B P_{P, R, r i u 2}^{P, B}

(the subscript riu2 represents the use of the uniform rotation invariant patterns with U ≤ 2), which has P + 2 different positive MLBP value types, can be obtained using a look-up table.

The negative MLBP of binocular rivalry responses can be defined as

M L B P_{P, R}^{N, B} ({\tilde{B}}_{c}) = \sum_{p = 0}^{P - 1} s ({\tilde{B}}_{p} - {\tilde{B}}_{c}) 2^{p}, s ({\tilde{B}}_{p} - {\tilde{B}}_{c}) = {\begin{matrix} 1, if {\tilde{B}}_{p} - {\tilde{B}}_{c} \leq - β \\ 0, otherwise \end{matrix}

For the negative MLBP’s uniform rotation invariant pattern

M L B P_{P, R, r i u 2}^{N, B}

of binocular rivalry responses, the computational process of

M L B P_{P, R, r i u 2}^{P, B}

is the same as that of

M L B P_{P, R, r i u 2}^{P, B}

.

Then, the joint empirical probability function of the uniform rotation invariant patterns of the positive and negative MLBPs can be denoted by

τ_{p_{1}, p_{2}} = P (H_{M L B P_{P, R, r i u 2}^{P, B}} = p_{1}, H_{M L B P_{P, R, r i u 2}^{N, B}} = p_{2}), p_{1} & p_{2} \in [0, P + 1]

Instead of using

τ_{p_{1}, p_{2}}

to learn the prediction model, it is preferable to extract a smaller set of binocular quality-aware features from

τ_{p_{1}, p_{2}}

for this task. As the quality-aware features of binocular rivalry responses, the normalized histograms

H_{M L B P_{P, R, r i u 2}^{P, B}}

and

H_{M L B P_{P, R, r i u 2}^{N, B}}

are respectively defined as

{\begin{matrix} H_{M L B P_{P, R, r i u 2}^{P, B}} (M L B P_{P, R, r i u 2}^{P, B} = p_{1}) = \sum_{p_{2} = 0}^{P + 1} τ_{p_{1}, p_{2}} \\ H_{M L B P_{P, R, r i u 2}^{P, B}} (M L B P_{P, R, r i u 2}^{P, B} = p_{2}) = \sum_{p_{1} = 0}^{P + 1} τ_{p_{1}, p_{2}} \end{matrix}

For the quality-aware features of disparity energy responses, the computational process of the disparity energy response’s normalized histograms,

H_{M L B P_{P, R, r i u 2}^{P, E}}

and

H_{M L B P_{P, R, r i u 2}^{N, E}}

, is the same as that of the binocular rivalry responses.

2.2.3 Features analysis

The joint normalized histograms

F = [H_{G_{l}}, H_{L_{l}}, H_{G_{r}}, H_{L_{r}}, H_{M L B P_{P, R, r i u 2}^{P, E}}, H_{M L B P_{P, R, r i u 2}^{N, E}}, H_{M L B P_{P, R, r i u 2}^{P, B}}, H_{M L B P_{P, R, r i u 2}^{N, B}}]

of the distorted 3D visual signal are formed from the various above-mentioned binocular quality-aware features. Figure 4 shows the joint normalized histograms of natural 3D visual signals. We can observe that the shapes of the joint normalized histograms are generally consistent across different natural 3D visual signals. Further, Fig. 5 shows an example of the joint normalized histograms of two distortion types with various visual content. As shown in Fig. 5, it is both very apparent and interesting that 3D visual signals with the same distortion type and close difference mean opinion scores (DMOSs) share similar joint normalized histogram shapes, despite their various 3D visual content. Therefore, we can conclude that the joint normalized histograms behaved in a content-independent manner, and are stable and feasible statistical features for the NR 3D-VQM task. The above observations are the principal motivations of our proposed metric.

Fig. 4 Joint normalized histograms of natural 3D visual signals with different content.

Download Full Size | PDF

Fig. 5 Joint normalized histograms of distorted 3D visual signals for two distortion types with different visual content.

Download Full Size | PDF

2.3 Binocular perceptual quality prediction

In the binocular perceptual quality prediction stage, given the vectors of the joint normalized histograms and the DMOS values of the training 3D visual signals, a statistical regression model is established to learn a feature pooling strategy. SVR is very effective for high-dimensional data pooling and is widely applied in machine learning. With SVR, local statistical features are mapped to quality scores to train the quality prediction model. In this paper, ɛ-SVR [1] is utilized for quality prediction model learning. The training data given are {(x₁,y₁), ..., (x_k, y_k)}, where x_i, i = 1, ..., k, is the feature vector and y_i is the DMOS value. Usually, we first map the input feature vector into high-dimensional feature space Φ(x), and then learn the regression function:

\begin{array}{l} f (x) = 〈 \sum_{i = 1}^{k} t_{i} Φ (x_{i}), Φ (x) 〉 + b \\ = \sum_{i = 1}^{k} t_{i} 〈 Φ (x_{i}), Φ (x) 〉 + b \end{array}

The inner product

〈 Φ (x_{i}), Φ (x) 〉

can be written as a kernel function k(x_i, x) that makes the feature mapping implicit. k(x_i, x) = exp(-γ(|x_i-x|)²) is a radial basis kernel function, γ is the precision parameter, and b is the bias term. More details about ɛ-SVR can be found in [35].

3. Experimental results and analyses

3.1 Protocol

In order to test the prediction accuracy of the proposed metric, two databases are utilized. The LIVE 3D Phase I database [11] consists of 365 symmetrically distorted 3D visual signals, which were generated from 20 natural 3D visual signals by corrupting them with five different distortion types. Four 80-stereopair sets were respectively corrupted with JPEG2000 (JP2K) compression, JPEG compression, white noise (WN), and a simulated fast fading Rayleigh channel (FF); in addition, one 45-stereopair set was corrupted with Gaussian blur (Gblur). The LIVE 3D Phase II database [21] includes the same distortion types as Phase I (JPEG, JP2K, Gblur, WN, and FF). In Phase II, each distortion type was used to corrupt 72 stereopairs. For each distortion type, each of eight natural 3D visual signals was processed to create three symmetrically distorted 3D visual signals and six asymmetrically distorted 3D visual signals; thus, 120 symmetrically distorted 3D visual signals and 240 asymmetrically distorted 3D visual signals were derived from the eight natural 3D visual signals. Each distorted 3D visual signal in these databases was evaluated by human subjects, and assigned a DMOS. A lower DMOS indicates higher visual quality for the 3D visual signal.

In the Phase I and II databases, natural 3D visual signals with sizes of 640 × 360 pixels were obtained by using a high-performance range scanner, and were shot using a 65 mm baseline. For subjective tests, 3D visual signals in the Phase I database were displayed on an iZ3D 22-inch 3D monitor with passive polarized 3D glasses. 3D visual signals in the Phase II database were displayed on a Panasonic 58 inch 3D TV with active shutter glasses. The viewing distance was four times the screen height in both the Phase I and Phase II databases. Each subject reported either normal or corrected-to-normal vision, and no acuity or color test was deemed necessary. There were 32 participants in Phase I; the majority of the participants were males. Six females and 27 males participated in Phase II. Phase I and Phase II are actually different and complementary databases — the Phase I database contains only symmetrically distorted 3D visual signals, while the Phase II database contains both symmetrically and asymmetrically distorted 3D visual signals. Single stimulus continuous quality evaluation (SSCQE) with hidden reference was used as the subjective test methodology on the Phase I and II databases. The following instruction was given to each participant: “Give an overall rating based on your viewing experience when viewing the stereo stimuli.” The ratings were obtained on a continuous scale labeled with equally spaced adjective terms: bad, poor, fair, good, and excellent. Further, the experiment was divided into two stages, each lasting less than 30 min, in order to minimize participant fatigue. A training stage using six stimuli was conducted before the beginning of each study, to verify that the participants were comfortable with the 3D display, and to help familiarize them with the user interface used in the task. For the subjective quality scores, difference opinion scores (DOS) were obtained by subtracting the participant’s reference stimulus ratings from their corresponding test-distorted stimulus ratings. The remaining subjective scores were then normalized to Z-scores and averaged across participants to produce DMOS. Additional information regarding the two databases can be found in [11,21].

Three generally used performance indicators are employed to benchmark the performance of competing 3D-VQM metrics. The first indicator, Pearson’s linear correlation coefficient (PLCC), measures prediction accuracy. The second indicator, Spearman’s rank ordered correlation coefficient (SROCC), serves as a measure of prediction monotonicity. Each PLCC and SROCC can have a maximum value of 1 — the higher the value, the better the objective 3D-IQA metric. Before computing the PLCC and RMSE, the predicted quality scores were processed through a five-parameter logistic regression function

f (x) = a_{1} (\frac{1}{2} - \frac{1}{1 + e^{a_{2} (x - a_{3})}}) + a_{4} x + a_{5}

where are model parameters obtained by using a nonlinear logistic regression [8]. The third indicator, root-mean-squared error (RMSE), evaluates prediction consistency. A RMSE value close to zero indicates a close correlation with human perception.

Because the proposed metric requires a training procedure, a cross-validation test is implemented by randomly partitioning each database into non-overlapping training and test sets. In each train-test procedure, 80% of the database content was selected for training; the remaining content was used for testing. To eliminate performance bias, the train-test procedure was repeated 1000 times; the median PLCC, SROCC, and RMSE values among those 1000 train-test iterations were used for verification. When utilizing the ɛ-SVR [35] to learn the quality prediction model, the ɛ-SVR parameters (C, γ) must be set. We conducted a cross-validation experiment using a grid search engine to select the values of (C, γ). The optimal (C, γ) delivering the best performance are summarized in Table 1.

Table 1. Cross-validation experiment results.

View Table | View all tables in this article

3.2 Feature validity experiment

According to Marr’s theory [36], when a 3D visual signal deteriorates, the joint normalized histograms that result from the inherent dependence between adjacent pixels will change correspondingly. Here, we analyze the validity of the joint normalized histograms. To better illustrate how 3D visual signal distortions affect these histograms, Fig. 6 plots them across different distortion types and levels. The figure shows that the histograms are altered in the presence of distortions. The more severe the distortion, the greater the alteration in the histogram plot. We can conclude that the shapes of the joint normalized histograms depend on the distortion levels. Here, we only present two distortion types; however, similar conclusions can be drawn for other distortion types. Because the relationships between the perceptual quality and the joint normalized histograms are interesting and promising, they can be helpful in designing NR VQM for 3D visual signals.

Fig. 6 Joint normalized histograms of distorted 3D visual signals at different distortion levels.

Download Full Size | PDF

3.3 Overall measurement of performance

In order to investigate the effectiveness of the proposed metric for all distortion types, we compared it with the existing state-of-the-art 3D-VQM metrics, including four FR metrics: SSIM [8], FSIM [9], Lin’s metric [10], and Chen’s metric [11], and three NR metrics: BRISQUE [14], Xue’s metric [16], and Chen’s metric [21]. The overall testing results are listed in Table 2; in terms of PLCC, SROCC, and RMSE on the two 3D databases, the results of the best-performing NR metrics for each database are highlighted in boldface. In order to apply 2D NR metrics (BRISQUE and Xue’s metric) to the 3D case, feature vectors are extracted separately for the left and right views, and weight-averaged to obtain the final feature vectors for training.

Table 2. Overall performance of seven metrics on two databases

View Table | View all tables in this article

For LIVE 3D Phase I, the results in Table 2 demonstrate that, in the case of symmetric distortion of both views, BRISQUE and Xue’s metric can provide comparatively and reasonably accurate quality prediction of 3D visual signals. For LIVE 3D Phase II, binocular rivalry is the main factor that affects the perceptual quality of symmetrically distorted 3D visual signals. The performance of the above metrics [14,16] was poor because this factor was not considered. However, Chen’s metric [21], based on studies of binocular effects, delivers competitive performance. It is clear that the proposed metric is highly consistent with human perception, and is markedly superior to the standard FR and NR metrics for all distortion types. The proposed metric achieves superior performance in terms of quality prediction accuracy, monotonicity, and consistency.

3.4 Performance on individual distortion types

To more comprehensively evaluate the ability of a 3D-VQM metric to predict the perceived quality of 3D visual signals across different types of distortions, we examined the performance of the representative 3D-VQM metrics on specific types of distortions. To save space, only the results of SROCC are presented in Table 3. The NR 3D-VQM metrics producing the highest performance are highlighted in boldface. As shown in Table 3, the proposed metric is among the top NR 3D-VQM metric six times, followed by Xue’s (three times) and BRISQUE (one time). Although some NR metrics may be effective for some individual distortions (e.g., BRISQUE and Xue’s [16] metrics are more effective for WN and JPEG distortions from the LIVE 3D Phase I database, respectively, and Xue’s [16] metric is more effective for JPEG and FF distortions from the LIVE 3D Phase II database), the proposed metric is competitive with the most effective metric for individual distortion types.

Table 3. Performance of seven metrics for each distortion type

View Table | View all tables in this article

3.5 Database independent experiments

For the experiments discussed in Subsections 3.3 and 3.4, the proposed metric was tested on the LIVE 3D IQA Phase I and Phase II databases, respectively. In order to verify that the proposed metric is independent of the database, it is necessary to conduct experiments that apply the regression model trained on one database and tested on another database. Therefore, we conducted the following experiments. First, a prediction model was trained using the LIVE 3D Phase I database, then tested with the LIVE 3D Phase II database; next, the prediction model was trained using the LIVE 3D Phase II database, then tested with the LIVE 3D Phase I database. The SROCC indicator was utilized for the evaluations; the results for SSIM, Xue’s metric, and the proposed metric are reported in Table 4. The table shows that, regardless of which training database is used, the proposed metric works effectively and achieves evaluations that are highly consistent with human perception. The experiment demonstrates that the selection of different training databases or a test database does not affect the evaluation results. Further, once trained on a properly prepared 3D database, the proposed metric can be applied to 3D images with arbitrary distortions, especially when the distortions have been covered in the training stage.

Table 4. Database independent testing

View Table | View all tables in this article

3.6 Contributions of each the quality-aware features in the proposed metric

To obtain deeper insight into how the proposed metric’s prediction performance is improved by considering visual perception properties, we designed two different metrics for performance comparison, referred to as metric-A and metric-B. For metric-A, we only use the quality-aware features of the receptive fields of simple cells to learn the quality prediction model. For metric-B, we learn quality prediction by using the quality-aware features of the receptive fields of simple cells and disparity energy responses. Table 5 shows the prediction performance of metric-A, metric-B, and the proposed metric. From Table 5, it is clear that the prediction performance can be further improved by properly considering the visual perception properties together. More specifically, the performance of metric-A is not very effective, because it only considers the quality-aware features of simple cells. When the quality-aware features of disparity energy responses are added to metric-A, the performance of metric-B is improved to a certain extent. When we add the quality-aware features of binocular rivalry responses to metric-B, the performance improvement of the proposed metric is different for different databases. For example, for the LIVE 3D Phase II database, which contains asymmetrically distorted 3D visual signals, the proposed metric performs significantly better than metric-B when binocular rivalry responses are considered. However, for LIVE 3D phase I, because there is little or no binocular rivalry in symmetrically distorted 3D visual signals, the performance improvement of the proposed metric is relatively minor, compared with its performance improvement with LIVE 3D Phase II. In general, overall performance can be gradually improved by a proposed metric that considers the quality-aware features of simple cells and complex cells simultaneously.

Table 5. Performance of each the quality-aware features in the proposed metric

View Table | View all tables in this article

4. Conclusion

One of the most significant challenges in NR 3D-VQM is calculating the quality measurement metric in a perceptually consistent manner. Owing to developments in visual cognition theories and neural science, many neurophysiological findings have helped to explain the binocular visual perception process. In this paper, an effective NR 3D-VQM is proposed that aims to judge quality in an HVS-like manner by simulating binocular vision, which can manage various scenarios including symmetric and asymmetric distortions. Our goal of quantifying 3D visual signal quality in a manner that conforms to binocular vision forms the theoretical significance of our proposed metric. The novelty of our research lies in the quality-aware features of the primary visual cortex (V1) that are applied to predict the perceptual quality of distorted 3D visual signals. This stands in contrast to existing metrics that attempt to measure 3D visual signals either by using 2D metrics or by exploiting depth information. To be more specific, we first simulated the processes of simple cells and complex cells. We then extracted various binocular quality-aware features from simple cells and complex cells. Finally, we built the mapping relationship between the binocular quality-aware features of a 3D visual signal and the corresponding DMOS. With the prediction model in hand, we can effectively measure 3D visual signal quality. Experimental results show that, in comparison with numerous related existing metrics, the proposed metric can yield results that are, statistically, much more consistent with human subjective judgments.

Although the proposed metric aims to simulate binocular vision to construct a 3D-VQM model for measuring 3D visual quality and shows effective performance, it still has limitations in some aspects: 1) it lacks comprehensive data mining in the higher visual areas; and 2) for learning based methods, human subjective scores for 3D visual signals are not easily acquired. To further advance the performance of the proposed metric, more comprehensive study of other binocular vision configurations (V2, V3, V4, and MT/V5) should be conducted in quality measurement. Furthermore, we will explore NR 3D-VQM approaches that do not require human subjective scores for training.

Acknowledgments

This work was supported by the Natural Science Foundation of China (Grant Nos. 61431015, 61371162, 61302112, 61502429), the Zhejiang Provincial Natural Science Foundation of China (Grant Nos. LQ15F020010, LY13F050005).

References and links

1. D. Zhao, B. Su, G. Chen, and H. Liao, “360 degree viewable floating autostereoscopic display using integral photography and multiple semitransparent mirrors,” Opt. Express 23(8), 9812–9823 (2015). [CrossRef] [PubMed]

2. J. Wang, Y. Song, Z. H. Li, A. Kempf, and A. Z. He, “Multi-directional 3D flame chemiluminescence tomography based on lens imaging,” Opt. Lett. 40(7), 1231–1234 (2015). [CrossRef] [PubMed]

3. Y. Gong, D. Meng, and E. J. Seibel, “Bound constrained bundle adjustment for reliable 3D reconstruction,” Opt. Express 23(8), 10771–10785 (2015). [CrossRef] [PubMed]

4. S. Wei, S. Wang, C. Zhou, K. Liu, and X. Fan, “Binocular vision measurement using Dammann grating,” Appl. Opt. 54(11), 3246–3251 (2015). [CrossRef] [PubMed]

5. Y. Cui, F. Zhou, Y. Wang, L. Liu, and H. Gao, “Precise calibration of binocular vision system used for vision measurement,” Opt. Express 22(8), 9134–9149 (2014). [CrossRef] [PubMed]

6. Z. Liu, X. Li, F. Li, and G. Zhang, “Flexible dynamic measurement method of three-dimensional surface profilometry based on multiple vision sensors,” Opt. Express 23(1), 384–400 (2015). [CrossRef] [PubMed]

7. K. Lee and S. Lee, “3D perception based quality pooling: stereopsis, binocular rivalry and binocular suppression,” IEEE J. Sel. Top. Signal Process. 9(3), 533–545 (2015). [CrossRef]

8. Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process. 13(4), 600–612 (2004). [CrossRef] [PubMed]

9. L. Zhang, L. Zhang, X. Mou, and D. Zhang, “FSIM: a feature similarity index for image quality assessment,” IEEE Trans. Image Process. 20(8), 2378–2386 (2011). [CrossRef] [PubMed]

10. Y. H. Lin and J. L. Wu, “Quality assessment of stereoscopic 3D image compression by binocular integration behaviors,” IEEE Trans. Image Process. 23(4), 1527–1542 (2014). [CrossRef] [PubMed]

11. M. J. Chen, C. C. Su, D. K. Kwon, L. K. Cormack, and A. C. Bovik, “Full-reference quality assessment of stereopairs accounting for rivalry,” Signal Process. Image Commun. 28(9), 1143–1155 (2013). [CrossRef]

12. W. Zhou, G. Jiang, M. Yu, Z. Wang, Z. Peng, and F. Shao, “Reduced reference stereoscopic image quality assessment using digital watermarking,” Comput. Electr. Eng. 40(8), 104–116 (2014). [CrossRef]

13. A. K. Moorthy and A. C. Bovik, “Blind image quality assessment: from natural scene statistics to perceptual quality,” IEEE Trans. Image Process. 20(12), 3350–3364 (2011). [CrossRef] [PubMed]

14. A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Trans. Image Process. 21(12), 4695–4708 (2012). [CrossRef] [PubMed]

15. M. A. Saad, A. C. Bovik, and C. Charrier, “Blind image quality assessment: a natural scene statistics approach in the DCT domain,” IEEE Trans. Image Process. 21(8), 3339–3352 (2012). [CrossRef] [PubMed]

16. W. Xue, X. Mou, L. Zhang, A. C. Bovik, and X. Feng, “Blind image quality assessment using joint statistics of gradient magnitude and Laplacian features,” IEEE Trans. Image Process. 23(11), 4850–4862 (2014). [CrossRef] [PubMed]

17. J. Kim, P. V. Johnson, and M. S. Banks, “Stereoscopic 3D display with color interlacing improves perceived depth,” Opt. Express 22(26), 31924–31934 (2014). [CrossRef] [PubMed]

18. K. C. Huang, Y. H. Chou, L. C. Lin, H. Y. Lin, F. H. Chen, C. C. Liao, Y. H. Chen, K. Lee, and W. H. Hsu, “Investigation of designated eye position and viewing zone for a two-view autostereoscopic display,” Opt. Express 22(4), 4751–4767 (2014). [CrossRef] [PubMed]

19. I. Mehra and N. K. Nishchal, “Image fusion using wavelet transform and its application to asymmetric cryptosystem and hiding,” Opt. Express 22(5), 5474–5482 (2014). [CrossRef] [PubMed]

20. R. Akhter, Z. M. Parvez Sazzad, Y. Horita, and J. Baltes, “No-reference stereoscopic image quality assessment,” Proc. SPIE 7525, 75240T (2010). [CrossRef]

21. M. J. Chen, L. K. Cormack, and A. C. Bovik, “No-reference quality assessment of natural stereopairs,” IEEE Trans. Image Process. 22(9), 3379–3391 (2013). [CrossRef] [PubMed]

22. S. Ryu and K. Sohn, “No-reference quality assessment for stereoscopic images based on binocular quality perception,” IEEE Trans. Circ. Syst. Video Tech. 24(4), 591–602 (2014). [CrossRef]

23. N. Krüger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J. Piater, A. J. Rodríguez-Sánchez, and L. Wiskott, “Deep hierarchies in the primate visual cortex: what can we learn for computer vision?” IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1847–1871 (2013). [CrossRef] [PubMed]

24. A. J. Bell and T. J. Sejnowski, “The “independent components” of natural scenes are edge filters,” Vision Res. 37(23), 3327–3338 (1997). [CrossRef] [PubMed]

25. R. Blake and K. Boothroyd, “The precedence of binocular fusion over binocular rivalry,” Percept. Psychophys. 37(2), 114–124 (1985). [CrossRef] [PubMed]

26. Q. Peng and B. E. Shi, “The changing disparity energy model,” Vision Res. 50(2), 181–192 (2010). [CrossRef] [PubMed]

27. R. Sabesan, L. Zheleznyak, and G. Yoon, “Binocular visual performance and summation after correcting higher order aberrations,” Biomed. Opt. Express 3(12), 3176–3189 (2012). [CrossRef] [PubMed]

28. J. Canny, “A computational approach to edge detection,” IEEE Trans. Pattern Anal. Mach. Intell. PAMI-8(6), 679–698 (1986). [CrossRef] [PubMed]

29. D. Marr and E. Hildreth, “Theory of edge detection,” Proc. R. Soc. Lond. Ser. B. Biol. Sci. 207(1167), 187–217 (1980). [CrossRef]

30. Q. Li and Z. Wang, “Reduced-reference image quality assessment using divisive normalization-based image representation,” IEEE J. Sel. Top. Signal Process. 3(2), 202–211 (2009). [CrossRef]

31. D. J. Field, “Relations between the statistics of natural images and the response properties of cortical cells,” J. Opt. Soc. Am. A 4(12), 2379–2394 (1987). [CrossRef] [PubMed]

32. J. W. Brascamp, P. C. Klink, and W. J. M. Levelt, “The ‘laws’ of binocular rivalry: 50 years of Levelt’s propositions,” Vision Res. 109(Pt A), 20–37 (2015). [CrossRef] [PubMed]

33. M. Zhang, C. Muramatsu, X. Zhou, T. Hara, and H. Fujita, “Blind image quality assessment using the joint statistics of generalized local binary pattern,” IEEE Signal Process. Lett. 22(2), 207–210 (2015). [CrossRef]

34. T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002). [CrossRef]

35. A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Stat. Comput. 14(3), 199–222 (2004). [CrossRef]

36. D. Marr, “A theory of cerebellar cortex,” J. Physiol. 202(2), 437–470 (1969). [CrossRef] [PubMed]

Databases	Indicator	FR				NR
Databases	Indicator	SSIM	FSIM	Lin [10]	Chen [11]	BRISQUE	Xue [16]	Chen [21]	Proposed
LIVE 3D Phase I	PLCC	0.873	0.927	0.865	0.917	0.910	0.932	0.895	0.941
	SROCC	0.877	0.921	0.856	0.916	0.901	0.913	0.891	0.919
	RMSE	8.006	6.162	8.242	6.533	6.793	5.819	7.247	5.520
LIVE 3D Phase II	PLCC	0.803	0.794	0.662	0.900	0.771	0.891	0.895	0.916
	SROCC	0.792	0.774	0.683	0.889	0.770	0.878	0.880	0.905
	RMSE	6.741	6.858	8.462	4.987	7.038	5.129	5.102	4.450

Databases	Distortion	FR				NR
Databases	Distortion	SSIM	FSIM	Lin [10]	Chen [11]	BRISQUE	Xue [16]	Chen [21]	Proposed
LIVE 3D Phase I	JP2K	0.857	0.887	0.839	0.888	0.812	0.903	0.863	0.906
	JPEG	0.435	0.521	0.199	0.530	0.569	0.675	0.617	0.650
	WN	0.940	0.938	0.928	0.948	0.940	0.929	0.919	0.924
	Gblur	0.882	0.916	0.935	0.925	0.860	0.906	0.878	0.918
	FF	0.585	0.748	0.658	0.707	0.784	0.635	0.652	0.785
LIVE 3D Phase II	JP2K	0.704	0.845	0.719	0.814	0.593	0.854	0.867	0.928
	JPEG	0.678	0.839	0.631	0.843	0.769	0.887	0.867	0.866
	WN	0.922	0.951	0.907	0.940	0.846	0.874	0.950	0.958
	Gblur	0.838	0.910	0.711	0.908	0.862	0.903	0.900	0.909
	FF	0.834	0.928	0.701	0.884	0.935	0.946	0.933	0.934

Databases	Metrics	JP2K	JPEG	WN	Gblur	FF	All
Testing on Phase I	SSIM	0.857	0.435	0.940	0.882	0.585	0.877
	Xue [15]	0.909	0.704	0.925	0.920	0.684	0.884
	Proposed	0.895	0.694	0.912	0.915	0.658	0.895
Testing on Phase II	SSIM	0.704	0.678	0.922	0.838	0.834	0.792
	Xue [15]	0.879	0.855	0.757	0.903	0.948	0.798
	Proposed	0.903	0.820	0.780	0.881	0.955	0.825

Databases	Metric-A	Metric-B	Proposed
LIVE 3D Phase I	0.908	0.914	0.919
LIVE 3D Phase II	0.851	0.877	0.905

Databases	Indicator	FR				NR
Databases	Indicator	SSIM	FSIM	Lin [10]	Chen [11]	BRISQUE	Xue [16]	Chen [21]	Proposed
LIVE 3D Phase I	PLCC	0.873	0.927	0.865	0.917	0.910	0.932	0.895	0.941
	SROCC	0.877	0.921	0.856	0.916	0.901	0.913	0.891	0.919
	RMSE	8.006	6.162	8.242	6.533	6.793	5.819	7.247	5.520
LIVE 3D Phase II	PLCC	0.803	0.794	0.662	0.900	0.771	0.891	0.895	0.916
	SROCC	0.792	0.774	0.683	0.889	0.770	0.878	0.880	0.905
	RMSE	6.741	6.858	8.462	4.987	7.038	5.129	5.102	4.450

Simulating binocular vision for no-reference 3D visual quality measurement

Abstract

1. Introduction

2. The proposed metric

2.1 Binocular vision

2.2 Binocular quality-aware features extraction

2.2.1 Binocular quality-aware features for simple cells

2.2.2 Binocular quality-aware features for complex cells

2.2.3 Features analysis

2.3 Binocular perceptual quality prediction

3. Experimental results and analyses

3.1 Protocol

3.2 Feature validity experiment

3.3 Overall measurement of performance

3.4 Performance on individual distortion types

3.5 Database independent experiments

3.6 Contributions of each the quality-aware features in the proposed metric

4. Conclusion

Acknowledgments

References and links

Cited By

Figures (6)

Tables (5)

Equations (15)

Optics Express