New perspectives in face correlation research: a tutorial

Q. Wang; A. Alfalou; C. Brosseau

doi:10.1364/AOP.9.000001

Glossary of Acronyms

ACE =	average correlation energy
ACH =	average correlation height
ACT =	advanced combat tracker
AESF =	average exact synthetic function filter
AFR =	automated face recognition
AMACH =	action maximum average correlation height filter
ANN =	artificial neutral network
ARCF =	adaptive robust correlation filter
ASLA model =	adaptive structural local sparse appearance model
ASM =	average similarity measure
AUC =	area under the curve
BPOF =	binary phase-only filter
CCD =	charge-coupled device
CF =	correlation filter
CFLB =	correlation filters with limited boundaries
CHC =	circular harmonic component
CHF =	circular-harmonic filter
CMACE =	correntropy minimum average correlation energy filter
CN =	color names
CNN =	convolutional neutral network
CPU =	central processing unit
CT =	compressive tracking
CUDA =	Compute Unified Device Architecture
DCF =	discriminating correlation filter
DCCF =	distance-classifier correlation filter
DFT =	distribution fields tracker
DSST =	discriminative scale space tracker
EDFT =	enhanced distribution field tracker
EEMACH =	eigen-extended maximum average correlation height filter
EER =	equal error rate
EMACH =	extended maximum average correlation height filter
FAR =	false acceptance rate
FC layers =	fully connected layers
FCC =	face class code
FFT =	fast Fourier transform
FNR =	false negative rate
FPR =	false positive rate
FRR =	false rejection rate
FRUE =	face recognition in unconstrained environment
FT =	Fourier transform
GMACH =	generalized maximum average correlation height filter
GPU =	graphics processing unit
HOG =	histograms of oriented gradient
IR =	infrared
ICA =	independent component analysis
IVT =	Incremental Visual Tracking
JTC =	joint transform correlator
KCF =	kernel correlation filter
KCFA =	kernel correlation feature analysis
LBP =	local binary pattern
LBP-UMACE =	local binary pattern–unconstrained minimum average correlation energy filter
LFM =	linear functional model
LPQ =	local phase quantization
MACE =	minimum average correlation energy filter
MACH =	maximum average correlation height filter
MDTCF =	minimum distance transform correlation filter
MEEM =	multiple experts using entropy minimization
MF =	matched filter
MILtrack =	multiple instance learning track
MINACE =	minimum noise and correlation energy
MMCF =	maximum margin correlation filter
MOSSE =	minimum output sum of squared error filter
MVSDF =	minimum variance synthetic discrimination function
OAB =	Online Ada-Boost
ONV =	output noise variance
OP =	overlap precision
OTB =	online tracking benchmark
OTSDF =	optimal trade-off synthetic discrimination function
OTMACH =	optimal trade-off maximum average correlation height filter
PCA =	principle component analysis
PCE =	peak-to-correlation energy
PCF =	polynomial correlation filter
PCI =	peak correlation intensity
PDCCF =	polynomial distance-classifier correlation filter
PHPID =	Pointing Head Pose Image Database
PIE =	pose illumination and expression dataset
POF =	phase-only filter
POUMACE =	phase-only unconstrained minimum average correlation energy filter
PSR =	peak-to-sidelobe ratio
QCF =	quadratic correlation filter
QPUMACE =	quad phase unconstrained minimum average correlation energy filter
RHC =	radial-harmonic component
RHF =	radial-harmonic filters
RKHS =	kernel Hilbert space
ROAM =	robust online appearance models
ROC =	receiver operating characteristics
SAMF =	Scale Adaptive with Multiple Features tracker
SCF =	segmented composite filter
SDF =	synthetic discrimination function
SLM =	spatial light modulator
SRDCF =	spatially regularized discriminating correlation filter
SSE =	sum of squared error
SVD =	singular value decomposition
SVM =	support vector machine
TAR =	true acceptance rate
TGPR =	tracker using Gaussian processes regression
TLD =	tracking learning detection
TNR =	true negative rate
TPR =	true positive rate
TRR =	true rejection rate
UMACE =	unconstrained minimum average correlation energy filter
UOTSDF =	unconstrained optimal trade-off synthetic discrimination function
VLC =	Vander Lugt correlation
VOT =	visual object tracking
WMMACH =	wavelet modified maximum average correlation height filter
ZACF =	zero-aliasing correlation filter

1. Introduction

The human face plays a very important role in our social interaction and in conveying a person’s identity. Since Kanade developed the first automated face recognition system (see box AFR) system in 1973 [1], face recognition technologies have received more and more attention due to their wide variety of civilian and military applications.

Automated face recognition (AFR): the main task of AFR is to detect and recognize human faces by using a machine. The ultimate goal of AFR is to replicate the capability of the human brain to perform detection and recognition of faces. AFR techniques can be classified into two related operations. (1) Face detection: this is the initial stage of AFR with the purpose of localizing and extracting facial features from any background. (2) Face recognition: this stage considers the face extracted from the background during the face detection stage and compares it with known faces stored in a specific database. This operation generally involves two steps: identification and verification. During the identification step a test face is compared with a set of faces aiming to find the most likely match. Verification is the process where the test face is compared with a known face in the database in order to make the acceptance or rejection decision.

Over the years, numerous computer-based vision methods have been applied to develop facial recognition or tracking algorithms with high discrimination and robustness [2], e.g., geometric feature-based methods, eigenface (sub-space)-based methods [3,4], neural networks, the support vector machine (SVM), and correlation-based methods [5]. Yet there are still many technical challenges to overcome. For example, face images of the same subject may appear very different due to changes of the expression, illumination conditions, head orientation, and digital post-processing modules [6]. Unexpected noise interference and imposters make it more difficult for a recognition system to make a correct decision. To construct a reliable discrimination algorithm that can cope with all of these changes, almost all existing numerical schemes involve complicated and time-consuming computations [7].

Ever since the first use of an optical correlator for implementing a matched spatial filter [8,9], correlation methods have attracted extensive interest among researchers in the face recognition field because of the following advantages: (1) inherent parallelism, (2) shift invariance, (3) desired noise robustness, and (4) high ability for discrimination [6]. Many correlation filters (CFs) can be expressed in closed forms (see Appendix A) in the frequency domain. Based on these closed form expressions, many optoelectronic hybrid implementation schemes of CFs have been suggested. To obtain improvement in discrimination, the design procedures of CF often involve optimization of some performance metrics such as the average correlation energy (ACE) and correlation output noise variance (ONV). During the recognition stage, CF methods generally involve multiplying the test image spectrum $T$ (2D Fourier transform spectrum of the image to be recognized) with a CF $H$ [generated from a single or a set of training (reference) images whose identity or class are known] and then performing an inverse Fourier transformation (FT). The correlation output is given as

C = {FFT}^{- 1} {H^{*} \circ T},

where ∘ denotes the element-wise array multiplication, * represents the conjugate operation, and

{FFT}^{- 1}

stands for the inverse fast FT (FFT) operation. From the correlation output, a series of decisional parameters have been developed to perform localization and classification tasks [10]. Standard metrics are peak-to-correlation energy (PCE) and peak-to-sidelobe ratio (PSR) [10–12] (see box PSR).

Peak-to-sidelobe ratio (PSR):

PSR = \frac{E {y (0)}}{var {y (τ)}},

where

τ ≫ 0

denotes a temporal delay (far from the origin) in the output correlation (the peak is assumed to occur at the origin). The numerator is the squared magnitude of the average output peak, whereas the denominator is the output variance. The value of

τ

is often taken to be

L / 2

when the output correlation is in the range

(- L / 2, L / 2)

[10]. PSR will be larger for sharper peaks. For simplicity, we use the one-dimensional form definition of the PSR and PCE.

Peak-to-correlation energy (PCE):

PCE = \frac{{| y (0) |}^{2}}{E_{y}},

where

| y (0) |

denotes the peak magnitude and

E_{y}

represents the correlation plane energy. PCE can also be used to measure the peak sharpness [10].

In the ideal case, one can obtain an apparent correlation peak with high PSR (or PCE, exceeding a threshold value) when the test image of $k$ th class is correlated with a CF synthesized from a training image that also belongs to the same class. In the case of real-time face recognition, the receiver operating characteristic curve (ROC) is also often used to measure the performance of correlation algorithms (see box ROC).

Robust, efficient, and discriminating face recognition schemes based on correlation algorithms have been reported in the past decade [5,13]. Some of them have been implemented successfully and used in practice. Much work has been done to combine conception of optical correlation with other spatial-domain recognition methods, aiming to achieve new face recognition algorithms with higher discrimination and robustness characteristics. Rapid development of powerful optoelectronic interfaces, such as, charge-coupled devices (CCDs) [14] and spatial light modulators (SLMs) [15], further accelerates the integration of optical correlation and computer vision methods in this field.

Figure 1 ROC curve.

Download Full Size | PDF

Receiver operating characteristics (ROC) curve: the ROC is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold values. In the literature, another variant of ROC is created by plotting the false negative rate (FNR) against the FPR.

True positive rate (TPR): The TPR is calculated as the ratio between the number of positive events truly categorized as positive and the total number of actual positive events, also called the true acceptance rate (TAR).

False negative rate (FNR): The FNR is calculated as the ratio between the number of positive events wrongly categorized as negative and the total number of actual positive events, also called the false rejection rate (FRR).

False positive rate (FPR): The FPR is calculated as the ratio between the number of negative events wrongly categorized as positive and the total number of actual negative events, also called the false acceptance rate (FAR).

True negative rate (TNR): The TNR is calculated as the ratio between the number of negative events truly categorized as negative and the total number of actual negative events, also called the true rejection rate (TRR).

Equal error rate (EER): As shown in Fig. 1, EER is the horizontal coordinate value of the point where the ROC curve intersects the blue dash line. At this point, FNR is equal to FPR. A lower EER value generally means a better performance of the recognition system.

In this tutorial review, we are interested in the design of face recognition algorithms using correlation theory, and their implementation and potential application. We hope that this study will prove useful in helping the reader get a broad overview of CF-based face recognition methods, especially the most recent progress in this area. Before discussing typical correlation-based face recognition algorithms, we will first present a brief historical background to help the reader appreciate the importance of this research. Our main goal is to pay more attention to the correlation techniques that have been exploited for face recognition.

Classical matched filter (MF): in the frequency domain, this filter is defined as $H_{CMF} (u, v) = α T^{*} (u, v) / N (u, v)$ , where $T$ represents the Fourier transform of the reference image that needs to be detected and located, * denotes the conjugate operator, $N$ denotes the power spectral density of the background noise, $u$ and $v$ are independent frequency domain variables, and $α$ is a constant [10].

As mentioned earlier, the correlation technique can be considered a filtering that extracts relevant information for pattern recognition in a complex 2D scene. The earliest CF is the well-known matched filter (MF; see box MF), which is fabricated from a single reference image [8]. MF can be implemented optically with a Vander Lugt 4- $f$ setup (VLC) [9]. The VLC 4- $f$ setup consists of two cascaded Fourier transform structures carried out by a lens, as shown in Fig. 2. In the input plane, the test image $O$ is first Fourier transformed by illuminating it with a collimated coherent beam. The Fourier spectrum $S_{O}$ is generated at the rear focal plane of the lens where a MF, $H_{MF}$ , is inserted to filter the resulted spectrum. By performing a second FT with another lens, the correlation result is recorded on the correlation plane (the rear focal plane of the second lens). An alternative implementation structure of a MF is a joint transform correlator (JTC) [16–18], which can avoid the precise alignment of optical elements and holographic recording of a complex filter, but spatial size of the input is seriously confined and much of the energy is consumed by the extra autocorrelation signal.

Figure 2 Principle of the VLC setup.

Download Full Size | PDF

To improve the light efficiency of a MF, Horner and Gianino proposed a phase-only matched filter (POF; see box POF). Compared with a MF, the POF can produce sharp correlation peaks with improved discrimination capability [19].

Phase-only filter (POF): Let $T (u, v) = A (u, v) \exp [j φ (u, v)]$ represents the Fourier transform of the reference image that needs to be recognized. The POF is defined as $H_{POF} (u, v) = \exp [- j φ (u, v)]$ by setting amplitude part $A (u, v)$ equal to unity [19].

Binary processing of a POF leads to another filter, a binary POF (BPOF), which can effectively reduce the accommodation burden of the SLM [20]. Since then, numerous nonlinear numerical methods have been applied to modify the robustness and discrimination of the original MF, resulting in wavelet MF [21], joint wavelet transform correlator [22], morphological correlation [23], adaptive JTC [24], fringe-adjusted JTC [17], and so on. Furthermore, in order to realize in-plane rotation invariance and scale invariance of correlation peaks, researchers have devised a circular-harmonic filter (CHF) [25] and radial-harmonic filters (RHF) [26] by extracting circular harmonic components (CHCs) and radial harmonic components (RHCs) from the training images.

Average correlation energy (ACE): $ACE = \frac{1}{L} \sum_{i = 1}^{L} \sum_{m} \sum_{n} {| g_{i} (m, n) |}^{2}$ , where $g_{i} (m, n)$ ( $i = 1,2, \dots, L$ ) is correlation output produced by the $i$ th training image. In practice, minimizing ACE is helpful to sharpen the correlation peaks [11].

Correlation output noise variance (ONV): $ONV = h^{+} \tilde{N} h$ , where + denotes the conjugate transpose operator, $h$ represents the column vector form of CF, and $\tilde{N}$ is a diagonal matrix with the Fourier power spectral density of noise model (in vector form) along its diagonal. When the noise is white, $\tilde{N}$ becomes an identity matrix. Generally, minimizing ONV can result in the enhancement of noise robustness [11].

A single MF is vulnerable when a wide variety of face samples of the same subject are considered for recognition [6,14], because it contains feature information of only a single face sample. In the ideal case, a face recognition system needs to exhibit distortion invariance to intra-class changes while maintaining sufficient discrimination capability to reject imposter classes. There are several approaches proposed in the literature aiming at handling this problem by the design of CFs. They generally fall into two categories. The first approach is to prepare a large number of MFs corresponding to all kinds of changes of each subject beforehand. The advantage of this solution is that it does not involve complicated training of filters. However, in order to use this solution to carry out high-speed recognition tasks with low error rates, one needs a large-volume digital memory device, such as a hard disk drive (HDD) or holographic optical memory system, to store these filters while ensuring high transfer and processing speed. If the shape or appearance of a target object is relatively simple, or the captured conditions of test images are given with many rigorous confinements, such recognition systems sometimes can also work very well, such as the high-speed holographic correlation system for face recognition in [27], and the road sign recognition system proposed in [28].

In the more complicated cases, this solution is clearly impractical because we cannot prepare a MF (or other filters generated from single facial image of one subject) for each combination of variance. To address the issue, the second approach based on composite filters (also called advanced CFs) should be used. In many applications, the first solution and the composite filters are often integrated together by using different performance optimization strategies [29]. A composite filter is a single filter that carries feature information extracted from two or more training images [6]. Development of composite filters can be classified into two different categories: (i) linear composite filters and (ii) nonlinear composite filters. Linear composite filters can be expressed as linear summation of multiple training spectra (i.e., Fourier spectra of training images) in the frequency domain. This class of filters can be further categorized into two classes: linear constrained filters and linear unconstrained filters. The constrained filters are designed to satisfy the hard constraints that specify the peak values for the training images. The earliest composite filter, the equal-correlation-peak synthetic discrimination function (ECP-SDF) filter, is a typical linear constrained filter [30] where the correlation peak intensities produced by authentic training images are set to be “1” while the peaks for imposter training images are set to be “0”. To help resolve the wide sidelobe problem of ECP-SDF, Mahalanobis et al. proposed minimum average correlation energy (MACE) filters that can produce sharp correlation peaks at the origin [31]. Different from the isolated point (desired peak values) constraints of ECP-SDF, MACE is an attempt to control the entire correlation plane where output values are suppressed at all points except at the origin center. To achieve this, MACE is designed to minimize average correlation energy (ACE) resulting from the training images (see box) while constraining the correlation peaks to predefined values. Because ECP-SDF and MACE were not designed to tolerate input noise, Kumar proposed the minimum variance SDF (MVSDF), which is capable of minimizing correlation output noise variance (ONV) (see box) while satisfying hard constraints of peak values [32]. To achieve a trade-off between noise tolerance and peak sharpness, Refregier presented the optimal trade-off SDF (OTSDF), which elegantly integrates the design ideas of MACE and MVSDF by introducing an adjustable trade-off factor [33]. Ravichandran and Casasent introduced another trade-off filter, i.e., the minimum noise and correlation energy (MINACE) filter, where an envelope equal to or larger than the noise is employed at each frequency of the power spectra of the training images [34].

A drawback of linear constrained filters is that their correlation peaks cannot take on the specified values for the non-training images. Moreover, studies have shown that hard constraints on the correlation peak outputs can be detrimental [5,6]. As a result, a wide variety of linear unconstrained filters were achieved by relaxing or removing the abovementioned hard constraints. One example is the maximum average correlation height (MACH) filter where the average correlation height (ACH) (see box) produced by true-class training images is maximized without any hard constraints required on the correlation peaks [35]. In addition to retaining large correlation peaks, MACH achieves distortion tolerance by minimizing the average similarity measure (ASM) (see box) yielded by the true class training image. By making an appropriate approximate fabrication of MACH, one can get the unconstrained MACE (UMACE) filter [35]. Although hard constraints are replaced with softer requirements in UMACE filters, in Kumar’s face recognition tests, they exhibited discrimination ability and distortion invariance comparable to those of MACE filters [11]. Optimal trade-off approaches are introduced by relating different correlation metrics (ACE, ONV, ASM, and so on), which results in unconstrained OTSDF (UOTSDF) [36] and optimal trade-off MACH (OTMACH) [37,38]. In order to reduce the dependence on the average of training images, two trade-off variations of MACH filters, termed extended MACH (EMACH) [39] and eigen-extended MACH (EEMACH) [40], were also introduced.

Average similarity measure (ASM): Let $g_{i} (m, n)$ be the correlation output produced by the $i$ th training image. The ASM is defined as

ASM = \frac{1}{L} \sum_{i = 1}^{L} \sum_{m} \sum_{n} {[g_{i} (m, n) - {\bar{g}}_{i} (m, n)]}^{2},

where

{\bar{g}}_{i} (m, n) = \frac{1}{L} \sum_{i = 1}^{L} g_{i} (m, n)

is the average of the training image correlation outputs. Ideally, all correlation surfaces produced by a distortion invariant filter (in response to a valid input image) would be the same, and the ASM would be zero. In practice, minimizing ASM can improve the stability of CF [11].

Average correlation height (ACH): $ACH = \frac{1}{L} \sum_{i = 1}^{L} x_{i}^{+} h$ , where + denotes the conjugate transpose operator, $x_{i}$ ( $i = 1,2, \dots, L$ ) represents the $i$ th training image written in the column vector form, and $h$ is column vector form of CF. Maximizing ACH can strengthen the correlation peak height yielded by training images [11].

An important advantage of the linear composite filters is that they can be computed efficiently in the frequency domain. In addition to the linear composite filter, some nonlinear composite filters that retain the computation efficiency of linear methods in the frequency domain were also suggested. Many filter designs are based on the judicious combination of linear composite filters and nonlinear optimization procedures, which include correntropy MACE (CMACE) [41], action MACH (AMACH) [42], wavelet modified MACH (WMMACH) [43], phase-only UMACE (POUMACE) [44], quad phase UMACE (QPUMACE) [45], local binary pattern UMACE (LBP-UMACE) [46], and generalized MACH (GMACH) [47]. In two contrasting approaches, researchers designed the quadratic CFs (QCFs) [48,49] and the polynomial CFs (PCFs) [50,51]. The design of QCFs is based on the fabrication of a quadratic discriminating function that can be decomposed into a set of linear correlators. However, in PCFs, a set of linear correlators are first applied to process input images, and then the correlation outputs are summed together. Other important nonlinear optimization approaches of composite filters include the segmented composite filter (SCF) [52], adaptive robust CF (ARCF) [53], minimum output sum of squared error (MOSSE) filter [54], and average exact synthetic function (AESF) filter [55].

In addition to the design of CFs mentioned earlier, several researchers found other ways to achieve high efficiency, distortion invariance, and discrimination ability by combining CFs with other pattern recognition methods. The combination of CFs and distance classifiers results in the distance-classifier CF (DCCF) [56], where the distance between the obtained correlation array and a prototype correlation array is calculated to make the class decision. By increasing the separation between true-class correlation outputs and false-class correlation outputs, a modified correlation-based classifier, termed the minimum distance transform CF (MDTCF) [57], achieves better discrimination than the previous DCCF. Alkanhal and Kumar devised the polynomial DCCF (PDCCF) by combining the DCCF and polynomial CFs (PCFs) for human face recognition [58]. Another class of composite filters is obtained by applying subspace-based methods. For example, Alfalou and Brosseau proposed independent component analysis (ICA)-based composite POF (ICA-CPOF) [59], where ICA (see box ICA) is employed to optimize the representation of training images. In order to achieve illumination invariance, Datta et al. proposed class-specific subspace-dependent nonlinear CF where class-specific subspace analysis is carried out for formulation of CFs [5]. Typical cases based on the combination of a SVM with CFs include linear shift-invariant maximum margin SVM CF (LMM-SVM-CF) [60] and maximum margin CF (MMCF) [61].

Independent component analysis (ICA): ICA is the unsupervised computational and statistical method for discovering intrinsic hidden factors in the data. ICA exploits the higher-order statistical dependencies among data and allows for a generative model for the observed multi-dimensional data. In the ICA model, the observed data variables are assumed to be linear mixtures of some unknown independent sources (independent components). The mixing system is also assumed to be unknown. Independent components are assumed to be non-Gaussian and mutually statistically independent. ICA can be applied for feature extraction from data patterns representing time series, images or other media [62].

In what follows, we will present some important specially designed CFs for face recognition application. The MACE, UMACE, and their extensions were among the earliest CFs used for face recognition. In [11], the authors used the MACE and UMACE to study the applicability of composite filters for face verification. Composite CFs can offer very good matching performance in the presence of variance in the facial images, e.g., face expressions and illumination changes. In another study, Wijaya et al. found that the optimal trade-off SDF (OTSDF) and MACE can perform illumination-tolerant face verification of compressed target images at low bit rates by using the JPEG2000 wavelet compression model, but the OTSDF was found to be more suitable for this task due to its built-in noise robustness [63]. In spite of this, one can greatly improve the discrimination performance of the MACE by making use of the logarithmic transformation on the training images [63,64]. In [65], a comparative study of the MACE and individual eigenface subspace method for face recognition in terms of margin of separations was proposed. Additionally, [66] described a method to reduce the complexity of the MACE design while the localization and discrimination performances are retained. In [67], a principle component analysis (PCA; see box PCA) is added to preprocess the phase spectrum of the training image because the obtained subspace can represent a larger set of changes, which is helpful to further enhance the PSR of the authentic facial class. Face verification based on the LBP-UMACE can be found in [46], where the local binary pattern (LBP) is operated on the training images of UMACE, aiming at enhancing the recognition rate and reducing the error rate. Simulation tests demonstrate that the LBP-UMACE produces better performance than the traditional UMACE. Correntropy MACE (CMACE) is another nonlinear variance of MACE used for face recognition applications [41]. The CMACE equations are formulated in a reproducing kernel Hilbert space (RKHS) induced by correntropy. A correntropy function can generalize the concept of correlation by introducing second- and higher-order moments of signal statistics, which is helpful to enhance the discrimination of CMACE. In experiments using pose illumination and expression (PIE) dataset of human faces, the CMACE outperforms the conventional MACE in both distortion invariance and rejection capabilities. Illumination invariant face recognition with the MINACE filter is presented in [68], where two forms of MINACE filters and two decisional criteria, i.e., peak intensity and PCE ratio, are combined for simulation tests on the PIE database. Satisfying performance criteria were achieved for both face identification and verification. In [69], several typical composite filters, including the MACE, UMACE, POUMACE, DCCF, and MDTCF, were compared with other traditional learning methods for several face and head pose databases. The experimental results show that the correlation-based classifiers outperform the traditional appearance-based methods (for example, PCA) in robustness and accuracy. The POUMACE filter provided the best performance with 100% accuracy for the CMU facial expression database and the Yale frontal face illumination database. In other comparison studies, polynomial DCCF (PDCCF) also provided a better discrimination capability than PCA and the numerical LBP method [58].

Principle component analysis (PCA): This is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors are an uncorrelated orthogonal basis set [70].

Thornton et al. found that application of SVM during the design of the correlation algorithm can enhance robustness against white additive noise and further optimize the relationship between peak and sidelobe in the correlation plane of training faces. Their work resulted in the proposal of LMM-SVM-CF [61]. Another successful combination of SVM and CF is the Rodriguez et al.’s MMCF [62], where the localization capability of CF and the generalization capability of SVM [71,72] are retained simultaneously. The face class code (FCC) method based on SVM and CF was also proposed in [73]. This method can perform face recognition when the class number is large. Good recognition results can be achieved under different illumination conditions.

A facial feature extraction method using kernel CF and cosine distance criterion was reported in [74,75], where satisfying experimental results were obtained when the FGRC2.0 dataset was used. In [76], kernel correlation feature analysis (KCFA) exhibits good representation capability for unseen datasets and better verification and identification properties using the PIE and AR facial datasets. In recent papers, some face trackers based on the kernel CF have been presented with better accuracy and distortion-tolerance capability obtained for dynamic tracking [77–79]. In another study, a 2D correlation feature analysis is extended to the high-order tensor cases. This tensor-based extension obtains better recognition rates using standard face datasets than previous 2D analysis [80].

In conventional composite filter design, increasing the number of training images leads to local saturation phenomenon [52]. To help resolve this problem, Alfalou and co-workers proposed a SCF, where the CF is segmented into several independent sections, and each section is assigned to a single training spectrum [52,81]. In [82], a comparison study of the SCF and ICA for human face recognition is proposed. Simulation results show that a better classification capability is obtained with SCF using the Pointing Head Pose Image Database (PHPID). Actually, a judicious combination of the ICA and CF can also be employed to optimize the decision performance. In [59], an ICA-based composite POF can perform face classification tasks with an impressive low false recognition rate.

The QCF for face recognition was proposed in [83], where the quaternion data is extracted from the wave decomposition of a face image for filter training. The proposed filter achieves a significant improvement in recognition compared with conventional advanced correlators for handling illumination-invariant recognition using the CMU PIE database. The representation of face image features using the structure of quaternion numbers alleviates the negative illumination disturbance on face images in an effective way. In a subsequent work, the QCF was extended into two variants: the phase-only QCF and separable trade-off QCF [84]. These filters need only one training facial image of a subject while retaining illumination-invariant performance.

By introducing zero-aliasing constraints into the training algorithms of many existing composite filters, Fernandez and co-workers proposed zero-aliasing CFs (ZACFs) in order to reduce the wide sidelobe caused by circular correlations [85,86]. Bolme et al.’s average of synthetic exact filter (ASEF) is another innovative approach to optimize correlation algorithms [55]. ASEFs can be contrasted with composite filters, which pose a limited number of constraints on the correlation output because ASEFs specify the entire correlation plane of training images. ASEFs do not involve complicated computations and can be trained using larger training sets because the over-fitting effect [55] is alleviated by average computations. In addition to the optimization of correlation algorithms, the decision optimization can also improve the discrimination capability of a recognition system. In [87], the linear functional model (LFM) [88] and singular value decomposition (SVD) [89] were employed to denoise the correlation plane, aiming at enhancing the recognition accuracy of correlators. Numerical tests demonstrated that this strategy is compatible with SCF.

Next, considering the close relationship between pattern recognition and visual tracking, we provide a brief discussion of several CFs used for face tracking. Similar to pattern recognition in a static scene, a tracking algorithm needs to overcome various critical issues associated with robustness, such as deformation, occlusion, illumination variation, and rotation. Even more critical is that visual tracking requires filters to be trained from a single frame and dynamically adapted to the changes of the target appearance. Although many conventional composite filters (even non-composite filters such as the fringe-adjusted JTC) can effectively perform localization tasks required by visual tracking, their training needs make them unsuitable for online tracking. Bolme et al.’s MOSSE filter thoroughly changed this situation [54]. The MOSSE filter is an ASEF-like filter that needs very few training images for initialization. The filter can quickly adapt to appearance changes by running a simple average computation. MOSSE filters were often used for face recognition and comparison purposes [62,86]. Motivated by the design of MOSSE filters, several researchers have suggested numerous CF-based tracking algorithms [77–79,90]. Compared with conventional trackers based on an appearance model, CF-based methods demonstrate better performance in efficiency and robustness. The interested reader may wish to consult Chen et al.’s study [91] for more details.

Despite the fact that CF approaches have achieved significant results for face recognition, detection, and tracking in the past decade, we must point out that these techniques encounter some recurrent problems that need to be addressed for optimizing their potentiality. Although many of these techniques can provide satisfying discrimination performance in the laboratory or with using benchmark datasets, they often have difficulties to perform well when some unexpected factors induce variations of facial appearance, i.e., intrapersonal, interpersonal, and extrinsic factors [5]. Intrapersonal factors include age, facial expression, and facial paraphernalia, such as facial hair, glasses, and cosmetics, that vary the facial appearance of a given subject. Interpersonal factors are responsible for the difference in the facial appearance of different persons, such as gender and ethnicity. Extrinsic factors include pose, illumination, scale, and some technical parameters such as focus and resolution, which vary the facial appearance through the interaction of light with the observer and the face. The facial appearance caused by these factors can potentially deteriorate the performance of an automated face recognition real-time system. Several key factors that may jeopardize practical application of CFs are the following. (1) Many CF methods can handle recognition tasks only under moderate illumination variations. Performance of these methods decreases noticeably when these variations become very large or when the illumination is not uniform. Although some correlation algorithms exhibit better illumination invariance [11,46,73,83], the experiments are still confined in very few standard datasets. (2) Projection deformations and self-occlusion caused by pose changes (e.g., 3D head rotation) can degrade the recognition performance of CFs. (3) Occlusion (often introduced by sunglasses, caps, veils, and other persons in the same scenario) on the upper side of the face can dramatically influence the identification process of CFs due to the loss of facial features such as eyes and nose. (4) Low image resolution and poor focus quality are also important factors that result in false recognition because of the loss of specific features. (5) Sometimes, extreme changes in facial expression may also result in recognition failure.

Obviously, it is impossible to resolve all the above-mentioned problems by using only one composite CF. This would require an excessive number of training face images to generalize all of these uncontrolled variances for classifying an array representing a test face in high dimensions. Moreover, increasing the number of training images may also seriously deteriorate the recognition performance of CFs because of the saturation effect. Schemes based on multiple composite filters can alleviate the problems to a certain extent. However, this may pose rigorous technique requirements on the computation capacity and speed of the hardware. Moreover, some practical recognition tasks are perhaps very difficult to be resolved by using only existing CF methods. For example, the changes in the human face over a period time due to aging are not trivial [5].

Based on the above considerations, we believe that devising new methods by judicially combining CFs and other face detection and recognition methods together is more effective and practical for resolving the deficiencies of the existing CFs. Due to highly complex exterior environmental interference and the varied nature of facial appearance it is unlikely that all face detection and recognition tasks can be resolved by using a single strategy. In past reviews, we demonstrated many examples for which CF and other techniques can be combined for performance optimization [41–51,56–59,61,62,67,68,71,72,82,87]. Here, we further supplement several possible approaches:

(1) Recently, convolutional neural networks [CNN, a special case of artificial neutral networks (ANNs)] received much attention because of the impressive results these techniques achieved for face recognition in unconstrained environment (FRUE) [92–98]. Unlike traditional hand-crafted features [e.g., LBPs and local phase quantization (LPQ)] that often degrade dramatically in an unconstrained environment, the features learned by CNNs are more robust to unconstrained variations such as pose, illumination, expression, and occlusion. Outstanding face recognition rates have been achieved by CNN methods for Labeled Faces in the Wild, a FRUE benchmark database [92–94,99]. If CF meets the requirements of the CNN methods, what can we get? To answer this question, several researchers recently developed visual tracking methods by combining CFs and CNNs [100,101]. In [100], the authors proposed to use activations from the convolutional layers of the CNN in a CF-based visual tracking framework. Comprehensive experiments show that a CF-based method that uses convolutional features for image description achieved better tracking performance than state-of-the-art methods (including previous CF-based tracking methods). In another study [101], the authors interpret CFs as counterparts of convolution filters in deep neural networks. They devised a three-layer CNN that directly learns a mapping as a spatial correlation between two consecutive frames for visual tracking. By updating the deeply learned CNN models, the long-term memory of target appearance is effectively maintained when heavy occlusions and out-of-view appear in the sequences. The introduction of CNN models overcomes the sensitivity of conventional CF-based tracking to heavy occlusion and drifting. These studies suggest that the CNN-learned feature can be viewed as a powerful tool to overcome the deficiencies of the existing CFs in handling some uncontrolled variances.
(2) Other strategies deal with 3D face recognition techniques [102,103]. Unlike the 2D intensity image of face, a 3D facial surface is insensitive to illumination, head pose, and cosmetics variations. Thus, additional invariant measures can be extracted from 3D facial data. Although CF methods are mainly based on 2D data analysis, it is still possible to combine them for performance optimization because 3D facial analyses often involve the processing and conversion of 2D data, e.g., range image where the pixel value reflects the distance between the facial surface and the sensor [5].
(3) Face recognition in infrared (IR) images was also developed [104–106]. Recognition based only on the visual spectrum is sensitive to variations in illumination conditions. Even when the face is well lit up, uncontrolled disturbance of glint, shadow, and makeup can lead to recognition errors. Contrasting with the visible spectrum, face images obtained in the far-IR are relatively independent of the ambient lighting because they are associated with the heat pattern emitted from the face [5]. Such a facial image is very useful under all light conditions, including total darkness, or when the subject is wearing a mask. Several schemes have been proposed to perform illumination-invariant recognition in IR face images. Heo et al. [104] evaluated the performance of face recognition using visual and IR images with composite filters (MACE and OTSDF). These authors found that thermal face recognition shows a higher recognition rate than visual face recognition under various lighting conditions and facial expressions. However, their experiments were conducted using only the Equinox database.

The remainder of this paper is arranged as follows. Section 2 presents some typical correlation algorithms applied to face recognition. In Section 3, we discuss some recently implemented schemes of correlation techniques. Although some of these applications are not strictly devoted to face recognition, they still demonstrate the potential of CF-based techniques to be compatible with standard optical or electronic hardware devices.

2. Correlation-Based Face Recognition Methods

In this section, we first present some classical composite correlation algorithms applied for face recognition, namely, the MACE, UMACE (Subsection 2.1 [11,31]), and OTSDF (Subsection 2.2 [37,63]). To the best of our knowledge, these are the earliest CFs employed to tackle basic issues involved in face recognition, such as robustness under different conditions (including facial expression changes, off-plane rotation, illumination variation, noisy interference, and image compression), optimization of discrimination ability, and filtering adaptability. These pioneering works gave a solid theoretical basis for application of CF methods in this area. We also pay special attention to recently proposed algorithms, such as ASEF (Subsection 2.3, [55]), MMCF (Subsection 2.6, [62]), LBP-based UMACE (Subsection 2.7, [46]), ICA-based POF (Subsection 2.8, [59]), and ZACF (Subsection 2.9, [85,86]). The simple ASEF design is described in Subsection 2.3. MOSSE [17], a milestone work for correlation-based tracking, is introduced in Subsection 2.4. Another visual tracking method using CNN-learned features in the CF framework is discussed in Subsection 2.5 [100]. Face recognition schemes described in Subsections 2.6 to 2.8 demonstrate the flexibility of correlation technique design for a wide variety of applications. Subsection 2.9 reports another approach of algorithm optimization by correcting a commonly neglected fault in the design of conventional composite filters. In Subsection 2.10, we will present a decision optimization scheme whose design idea is very different from the above filter optimization algorithms [20]. Finally, in Subsection 2.11, a recognition based on class-specific subspace analysis is introduced to realize illumination-invariant recognition (i.e., Chap. 11 of [5]). Because of space limitations, several correlation algorithms, such as the QCF [83,84], kernel CFs [74–79], DCCFs [56–58], and correntropy MACE [41], cannot be discussed in this section. Interested readers may reference the related literature.

2.1. Face Verification with MACE Filters

The MACE filter ([31], see Appendix A.2, item 2) is aimed at minimizing the ACE generated from the training images, and constraining the correlation peak intensities with specified values. In contrast to ECP-SDF [30], which controls only one isolated point of the correlation plane, to obtain a MACE filter, one needs to resolve the ACE minimization problem subject to the constraints of correlation peak values. In several studies, well-designed MACE filters exhibit impressive discrimination with sharp peaks in response to the training images in the authenticated class, as well as small output values in response to the imposter classes. In spite of this, two drawbacks of conventional MACE filters have been found. One is the lack of built-in noise tolerance. Another is that MACE exhibits more sensitivity to intraclass variation than the other types of composite filters [6].

In recent years, verification based on biometric characteristics (face, finger, and eye iris) has been regarded as an effective alternative to the conventional authentication systems using password and/or personal identification because (i) these characteristics are physically attached to the authentic subjects and cannot be lost or stolen, and (ii) they differ from one person to another one. Motivated by the excellent discrimination ability of MACE, Kumar et al. tried to use it for face verification under different facial expressions and illumination conditions [11]. In many methods, the recognition decision is based on the observation of a correlation peak that is strongly related to the brightness level of the input image. To avoid the dependence on brightness and to base the recognition decision on a larger region of correlation plane, the PSR can be used as a decision criterion. The datasets used for simulation tests were provided by the Advanced Multimedia Processing Laboratory in the Department of Electrical and Computer Engineering at Carnegie Mellon University. Facial images of 13 subjects are collected in this dataset. For each subject, 75 images ( $64 \times 64$ in size) with different facial expressions are recorded. Figure 3 shows example images. In [11], each subject’s MACE filter was synthesized from three training images (images 1, 21, and 41 out of the 75 images). To test the discrimination performance of these MACE filters, the cross correlations of each subject’s filter with every image of the dataset (75 true-class images and 900 false-class images) were calculated.

In Figs. 4 and 5, we show the best PSR values of the MACE filter (for subject 1) and the worst PSR values (for subject 2), respectively. The horizontal axis represents the image index. The solid curves show the PSRs produced from the authentic faces, and the dotted curves correspond to the imposter faces. For subject 1, one can find a distinct separation between the solid and the dotted curves, which means a perfect discrimination between the authenticated person and the imposters. Although the results for subject 2 show the smallest margin between the true and false classes, it was concluded that the MACE filter for subject 2 still yields a 99.1% verification performance [11].

Download Full Size | PDF

Download Full Size | PDF

For the 13 MACE filters, each generated from three training images (images 1, 21, and 41) of one subject, an average EER of 0.15% was obtained [11]. As a comparison, if an individual eigenface subspace method [107] is devised with the same three training images for comparison tests using the same database, the obtained EER takes a larger value of 0.85%. When the number of training images for each subject’s MACE filter is increased to 25, a 100% verification accuracy in the facial expression database is obtained [11].

The test and training images used for illumination tests in [11] were taken from the illumination subset of the Carnegie Mellon University’s PIE dataset, which contains 65 subjects with 21 different illumination conditions. Figure 6 displays 21 images of subject 2. A single MACE filter and a single UMACE filter (see Appendix A) were fabricated for each person by using three training images under extreme lighting variations (image 3, image 7, and image 16). In Fig. 7, we show the PSR results when the MACE and UMACE filters of subject 2 were cross correlated with the whole database. From Fig. 7, one can find that the PSR for the authenticated class is larger than those obtained for the imposters. These results demonstrate that by using three linearly independent training images that generalize the extreme lighting variations, one can obtain high PSR values from any face image that lies in the convex envelope of these training images. Moreover, there is no pronounced difference between the PSR values generated from the MACE and UMACE filters. The same simulation tests were repeated with another combination of training images (7, 10, and 19 with frontal facial lighting). A verification accuracy of 93.5% at a FAR of 0% was reported in [11]. The built-in tolerance of MACE filters to illumination variations is mainly due to the emphasis of these filters on high spatial frequencies. As a result, low spatial frequencies, which are affected mostly by lighting changes, are attenuated by MACEs.

Download Full Size | PDF

Download Full Size | PDF

Overall, Kumar et al.’s report can be considered a milestone in the design of CFs. Since this seminal work, much research has been done in designing and applying advanced CFs for biometric verification.

2.2. OTSDF for Face Verification in the Presence of Illumination Variance and Image Compression

The OTSDF [37] is actually designed to optimally trade-off ONV and ACE. As mentioned earlier, in order that a MACE filter can achieve sharp correlation peaks with high discrimination, high frequencies are emphasized by minimizing the ACE. In contrast, low frequencies are emphasized by minimizing the ONV so that MVSDF [32] (see Appendix A.2, item 3) can obtain optimal noise robustness. To maintain the balance between noise robustness and high discrimination, one must resolve an optimal trade-off problem of the two criteria, ONV and ACE, subject to the peak constraints. An feature adjust factor is introduced in the fabrication of the OTSDF (see Appendix A.2, item 4) to meet specific application use [37]. The MACE filter is actually a special case of the OTSDF with $α = 1$ . If $α = 0$ , the OTSDF simplifies to the maximally noise tolerant filter, which is MVSDF.

In many recognition applications, the situation is often encountered in which the training images are uncompressed, but authentication is performed with mobile devices that transmit compressed images at low bit rates. To identify how well the MACE filters and the OTSDFs perform for compressed images at different bit rates, Wijaya et al. carried out a series of simulation tests [63]. The CMU PIE dataset used in [63] consists of 65 subjects captured with 21 different illumination conditions. Figure 8 shows an example of gray images ( $100 \times 100$ pixels) of subject 2 in the dataset. These high-quality uncompressed images do not have background lighting. In [63], two sets of training images were used to design a MACE filter and an OTSDF for each of the 65 persons. Training set 1 includes three facial images captured with extreme lighting variances, and training set 2 uses three images that are captured under near-frontal lighting, as shown in Fig. 9.

Download Full Size | PDF

Download Full Size | PDF

Download Full Size | PDF

These authors tested the performance of these filters with the complete database. At the recognition stage, the tested images were first compressed by making use of the JPEG2000 scheme at various bit rates and then were reconstructed at the authentication end. In Table 1, we display the average verification rates of the 65 filters (one from each person) at 0% false acceptance rate (FAR) [63]. According to these results, the MACE filters synthesized from training set 1 provide better performance. One can achieve 92.8% verification even at 0.5 bpp (bit/pixel) using these filters. The ROC curves of MACE filters trained with training sets 1 and 2 of subject 2 (see Figs. 10 and 11) were also compared. The ROC of the MACE filter trained with training set 1 has better performance without error for bit rates as low as 0.8 bpp. The recognition performance strongly depends on a suitable enrollment choice of training set, which can generalize for the authenticated subject. The corresponding average verification rates of the OTSDFs are also given in Ref. 63] (see Table 2). The highest recognition rates in each column are shown in bold, from which one can see that the choice of parameter $α$ is closely related with the bit rate of the reconstructed images.

Table 1. Average Verification Rate (at 0% FAR) Achieved by Use of Test Images Compressed to Various Bit Rates with MACE Filters Synthesized with the Two Training Schemes [63]

View Table | View all tables in this article

Table 2. Average Verification Rate (at 0% FAR) Achieved by Use of Test Images Compressed at Various Bit Rates with OTSDFS Synthesized with Different Amounts of Noise Tolerance [63]

View Table | View all tables in this article

Download Full Size | PDF

It was also found that calculating the logarithm of the intensity of the tested images was an effective way to improve the performance of the MACE filter fabricated with training set 2, because it can reduce the contrast between the dark and illuminated regions. From the results summarized in Table 3, one can conclude that a filter with more built-in noise tolerance works better at lower bit rates, because a lower bit rate means larger compression artifacts [63]. However, more noise tolerance results in degradation of the discrimination ability. One can easily realize an optimal trade-off between the discrimination and noise tolerance by adjusting the parameter factor $α$ .

Table 3. Average Verification Performance of 65 People (at 0% FAR) with Compressed Logarithm-Transformed Test Images at Various Bit Rates [63]

View Table | View all tables in this article

Environmental noise interference is very common in images captured under low-light conditions. To evaluate the robustness of these filters to noisy tested images, Wijaya et al. introduced a synthesized Gaussian additive noise in the tested images with different signal-to-noise ratios (SNRs) [63]. These noisy images are then compressed and reconstructed for authentication tests. They calculated the verification rates for tested images with SNRs of 7, 10, and 15 dB. In Table 4, we present the calculated results for SNRs of only 7 dB (i.e., the worst case). For more complete data, see [63]. Table 4 verifies that the noise interference of test images can be resolved to some degree by the use of OTSDFs. By comparing with simulation results obtained for other SNRs, they found that lower SNRs require a larger amount of noise tolerance, that is, smaller values of $α$ , in order to obtain larger recognition rates.

Table 4. Verification Rates (at 0% FAR) of Test Images with 7 dB Additive White Gaussian Noise Compressed at Various Bit Rates and Tested with the OTSDF at Different Amounts of Noise Tolerance [63]

View Table | View all tables in this article

From these simulation results, Wijaya et al. checked that the MACE filter and the OTSDF can resolve face verification of compressed test images at low bit rates (JPEG2000 scheme) in the presence of illumination variations and additive white Gaussian noise. By selecting suitable training sets and the value of $α$ of OTSDFs, one can achieve high verification rates (above 92% for $FAR = 0 %$ ) with 0.5 bpp compression. To the best of our knowledge, [63] is the only report in which the influences of image compression, illumination variance, and noise interference on the performance of CFs were simultaneously considered in the face verification task.

2.3. ASEFs

In this section, we summarize the main features of the ASEF (see Appendix A.2, item 13), i.e., a class of CFs proposed by Bolme et al. first designed for exploring eye localization tasks in face images [55]. The ASEF has two marked features that differ from many previous composite filters that specify a only single peak value of correlation output per training image. First, ASEF is an over-constrained filter that exploits the convolution theorem to specify the entire surface of the designed correlation for each training image. This constraint is helpful to deal with the structured background because the resulting filter can learn to ignore other features on the human face. Second, to get the final ASEF one needs only to average the resulting filters that are generated from different training images, therefore reducing the over-fitting effect of training images that often appears in many SDFs with hard constraints. Thus, the training of such filters can be made over large and more inclusive image sets.

In recent literature, ASEFs along with the modified algorithm MOSSE [54] were also selected for performance comparison with other CFs [85,86]. Although some newly proposed CFs exhibited some advantages over ASEF in recognition rates for some face recognition or tracking tasks, we think that ASEFs deserve special mention because of their rapidity and simplicity. Moreover, performing localization of eyes is often regarded as an important first step in face detection and recognition.

According to the explanation provided by the authors of [55], the over-fitting effect found in many previous composite filters such as the MVSDF, MACE filters, and optimal trade-off filters arises mainly because there are few constraints relative to the freedom of the filter surface. To address this problem, they proposed to formulate a complete “exact filter” for each training image. This exact filter ensures that the corresponding training image has an entirely predefined training output (e.g., a bright peak centered on the target of interest). They also obtained the additional benefit that the target of interest does not need to be strictly centered in the training image because the location of the desired correlation peak can be selected freely. Using the convolution theorem, they solved for the exact filter in the following form:

H_{i}^{*} (u, v) = \frac{G_{i} (u, v)}{F_{i} (u, v)},

where

F_{i}

,

G_{i}

, and

H_{i}

are the Fourier transforms of the

i

th training image, the corresponding training output, and the exact filter, respectively. The division shown in Eq. (2) is element-wise division. This type of computation is not entirely original. Similar computations are used to produce inverse filters or to perform deconvolution. To produce a synthetic filter that generalizes for the entire training set, Bolme et al. carried out the average computation of the multiple exact filters:

H_{ASEF} (u, v) = \frac{1}{N} \sum_{i = 1}^{N} H_{i}^{*} (u, v) .

Figure 12 shows the results produced by the exact filters and the final ASEF for localization of the left eye. From Fig. 12, one can evaluate the effect of the average computation over the common features that are shared by the entire training set. A modified ASEF applicable to visual object tracking has also been devised by Bolme et al. in which the online training need is reduced by adding a small constant term in the denominators of the exact filters [54].

Download Full Size | PDF

The performance tests of ASEF in face recognition or tracking can be found in Subsections 2.6 and 2.9. Here, we present simulation results only for eye localization [55]. In these tests, the FERET dataset was employed to compare the localization ability of the ASEF and several other filters. The FERET dataset contains 3368 images of 1204 subjects. The scheme proceeds as follows. First, this dataset is randomly separated into two sets of 602 people and 1699 images. One of these sets is further separated into the training set with 1024 training images and the verifying set with 675 images. The training and verifying sets are used for the training and adjustment of filters, respectively. Another set is used for testing. To make the task more challenging, each of the training images is perturbed eight times by random similarity transforms, resulting in 8192 training images ( $128 \times 128$ ). In [55], the quantitative parameter used to evaluate localization is the deviation distance normalized by the intraocular distance. The distance $D < 0.1$ is chosen as the criterion of successful localization. Several experiments are performed under different conditions using several CFs. In Fig. 13, we present the localization results when the approximate position of the eye is not known in advance. Figure 13(a) demonstrates that the ASEF and the UMACE can avoid an over-fitting effect when the number of training images is a large number, while the localization accuracy of the other filters decreases rapidly due to the over-fitting effect. Here, the alphas are adjustment factors of the composite filters, and sigma denotes the standard deviation of the specified Gaussian output of the ASEF. Figure 13(b) displays the correct rates of three filters as functions of the fraction of the intraocular distance for which the ASEFs provide the best localization accuracy. Bolme et al. attributed the good performance of ASEF to two causes. First, the ASEF is trained on the entire face image, including the “wrong” eyes, nose, and mouth [55]. Note that the other composite filters, such OTSDF and UMACE, are centered on the correct eyes and therefore have no exposure to these or other perturbations. Second, because the ASEF specifies the correlation output surface for the entire training image, it can lead to a high response of the correct eye and to a low response for the rest of the face [55].

Figure 12 Training images $f_{1,2, 3}$ , specified correlation outputs $g_{1,2, 3}$ , and exact filters $h_{1,2, 3}$ of ASEF [55].

Download Full Size | PDF

Despite the fact that several researchers developed new CFs dedicated to face recognition tasks that outperform the ASEF, this algorithm is still used by researchers for feature localization and face recognition.

2.4. MOSSE Filter

In recent years, the MOSSE filter (see Appendix A.2, item 14) proposed by Bolme et al. [54] has been recognized as a typical example of a correlation-based tracker [77–79,90,91]. Different from its previous counterparts that are poorly suited to tracking due to time-consuming training procedures, the MOSSE filter is an adaptive CF that can be used by trackers with high efficiency to model the actual appearance of the target. By a comparative study, Bolme et al. found that trackers based on the MOSSE filter can achieve performance consistent with that of other robust tracking schemes at a higher update rate. Moreover, the implementation of the MOSSE-filter-based tracker is much simpler than many other trackers that often require complex appearance models and optimization algorithms [108–110].

Although the original MOSSE filter was specially aimed at object tracking, it has often been employed as comparison benchmark of CFs used to perform object recognition or identification. When Bolme et al. devised the MOSSE filters, their aim was to develop a modified filter that could keep the robustness of ASEFs [55] while requiring fewer training images to meet the high-speed tracking need. The optimization goal of the MOSSE filter is to find a filter that can minimize the sum of squared error (SSE) between the actual correlation of training images and the desired correlation output. The idea of minimizing SSE is not a new concept since it has been considered in earlier CF designs [35,111]. Different from the previous designs where the target was always carefully placed in the center of the training image and correlation output was fixed for the entire training set, the MOSSE filter customizes the correlation output for each training image without any target-centering requirement. Bolme et al. [54] deduced the following closed form of the MOSSE filter:

H^{*} = \frac{\sum_{i} G_{i} ⊙ F_{i}^{*}}{\sum_{i} F_{i} ⊙ F_{i}^{*}},

where

F_{i}

is the FT of the i-th training image

f_{i}

,

G_{i}

is the FT of the desired correlation output

g_{i}

, * denotes complex conjugate and

⊙

denotes an element-wise multiplication operation. Division shown in Eq. (4) is also element-wise. Because the denominator for the MOSSE definition is the sum of the Fourier power spectra of training images, it rarely takes small numbers. This is helpful to ensure the stability of MOSSE filters. Below, we show the improvement achieved by the MOSSE filter over the ASEF and the unconstrained MACE (UMACE) filter. Bolme et al. conducted a simulation test in which the second frame PSR values of these trackers are compared when the number of images used in the first frame to train the filters is varied. The training images applied for initialization are derived from random affine perturbations. Comparison results are shown in Fig. 14, from which one can see that the MOSSE filter provides sharper correlation signals when smaller numbers of images are involved for training. The training needs for MOSSE-based trackers are indeed greatly relaxed compared with the other two CFs. Bolme et al. found that they can improve the stability of ASEF and UMACE by adding a small value (regularization parameter) to each element of the power spectrum of the training images (please see [54] for a detailed explanation of the regularization optimization). By varying the regularization parameter, they obtained the PSR curves shown in Fig. 15. In this test, eight random perturbations were applied to train these filters in the first frame of the video.

Figure 13 Localization accuracy of several CFs when a priori approximate eye localization is unknown. © 2009 IEEE. Reprinted, with permission, from Bolme et al., IEEE Conference on Computer Vision and Pattern Recognition (2009), pp. 2105–2112 [55].

Download Full Size | PDF

Figure 14 Comparison of the second frame PSR values of the MOSSE, ASEF, and UMACE filters. © 2010 IEEE. Reprinted, with permission, from Bolme et al., IEEE Conference on Computer Vision and Pattern Recognition (2010), pp. 2544–2550 [54].

Download Full Size | PDF

During the tracking period, the filter can adapt rapidly to keep up with the changes of the target appearance in pose, scale, rotation, or under different illumination conditions. In [54], the MOSSE filter learned from the $i$ th frame is computed as

H_{i}^{*} = \frac{A_{i}}{B_{i}}, A_{i} = η G_{i} ⊙ F_{i}^{*} + (1 - η) A_{i - 1}, B_{i} = η F_{i} ⊙ F_{i}^{*} + (1 - η) B_{i - 1},

where

η

denotes the learning rate (the value

η = 0.125

was selected to update the filter while maintaining its robustness). In Fig. 16, we show Bolme et al.’s results for three common test videos used for face tracking. In these figures, green color good tracking, yellow indicates the track drifted off center, and red indicates tracking failure. The black lines show the PSR values, which have been clipped to the range of [0,20] (see http://youtube.com/users/bolme2008 for details). Interestingly, these results verify the stability of the MOSSE-based tracker. According to [54], these trackers were written in Python and tested on a single processor of a 2.4 GHz Core 2 Duo MacBook Pro. The average update rate of the MOSSE filter was 250 tracks per second for

64 \times 64

size tracking windows. By using C to optimize some slow parts of the code, the median frame rate of 669 updates per second was achieved, which was over 20 times faster than that found for other more complex trackers [54].

Figure 15 Comparison of the second frame PSR values of the regulated CFs when different regulation parameters are added. © 2010 IEEE. Reprinted, with permission, from Bolme et al., IEEE Conference on Computer Vision and Pattern Recognition (2010), pp. 2544–2550 [54].

Download Full Size | PDF

Bolme et al. also discussed the ability of some current trackers, such as, incremental visual tracking (IVT) [109], robust fragments-based tracking (FragTrack) [108], multiple instance learning (MILtrack) [112], robust online appearance models (ROAMs) [113], and Online Ada-Boost (OAB) [114]. They compared tracking scenes on the same test sequences with results provided by CF-based trackers. Bolme et al. found that, although these state-of-the-art schemes involve computation of heavy weight classifiers, complex appearance models, and stochastic search techniques, they did not exhibit significant advantages over the correlation-based trackers in tracking stability and efficiency [54]. Interested readers can check [109,112], where tracking results of IVT, ROAM MILTrack, OAB, and FragTrack are presented on the same video sequences for comparison.

MOSSE-based trackers are adapted to visual object tracking. Based on the framework of MOSSE filters, several variations of MOSSE filters and other CF-based tracking algorithms have been recently developed with great strengths in efficiency and robustness [77–79,90,91]. Attenuating training needs by optimizing existing algorithms is critical for the application of advanced CFs in online and real-time object tracking.

2.5. Correlation-Based Visual Tracking Using Convolutional Features

Shortly after the introduction of the MOSSE filter, histograms of oriented gradient (HOG) [79], color names (CN) [115], and channel representations [116] were also successfully employed in CF-based tracking schemes because such representations demonstrated good results for tracking. Recently, CNN techniques have drawn considerable attention because they achieved promising results for face recognition and detection [92–101]. In a recent paper, Danelljan et al. [100] employed activations from the convolutional layer of CNNs in a CF-based tracking framework. Making use of the online tracking benchmark (OTB), the Amsterdam library of ordinary videos for tracking (ALOV300++), and the visual object tracking (VOT) challenge 2015, the experiments demonstrated superior performance compared to other representations [100]. CNN is a special case of ANNs. These networks perform a sequence of convolution, local normalization, and pooling operations (called layers) on a fixed sized RGB image (input image). The final layers are fully connected (FC) layers. All the CNNs are trained using raw pixels with a fixed input size. Training these networks requires a large amount of labeled training data [100]. The standard deep features contained in the FC layers of a trained CNN are genetic and typically employed in a variety of vision areas. Interestingly, recent articles have shown that activations from the convolutional layers can lead to better results for classification tasks than those from the FC layer of the same network [117]. These convolutional layers are discriminating, semantically meaningful, and contain structural information crucial for the localization task. Additionally, the need to apply task-specific fine-tuning is also mitigated by using convolutional layers [117].

In Danelljan et al.’s scheme [100], spatial structures of these convolutional layers are employed for learning two different CFs: discriminating CF (DCF) and spatially regularized discriminating CF (SRDCF) [118]. The learned DCF and SRDCF can be viewed as the final classification layer in the network. The DCF framework uses the properties of circular correlation to train and applies the resulting correlation classifier to the input feature channels in a sliding window fashion. In a DCF-based tracking framework, the CF $f_{t}$ is learned from a set of patches $x_{k}$ that are sampled at each frame ( $k = 1,2, \dots, t$ , and $t$ denotes the current frame number). The objective is to find a CF that minimizes the following loss:

ε = \sum_{k = 1}^{t} α_{k} ‖ f_{t} \otimes x_{k} - y_{k} ‖^{2} + λ ‖ f_{t} ‖^{2},

where

\otimes

denotes the circular correlation generalized to multiple channel signals and

y_{k}

represents desired correlation output (e.g., centered Gaussian function). The different weights

α_{k}

control the impact of each training sample, while a weight parameter

λ

determines the impact of the regularization term. To locate the target at frame

t

, the resulting CF is applied to a sample patch

z_{t}

that is extracted at the previous location. To obtain an estimate of the target scale, the resulting filter needs to be applied at multiple resolutions. The update is terminated by finding the maximum correlation score over all evaluated locations and scales. In a SRDCF-based tracking framework [118], Eq. (6) is converted to

ε = \sum_{k = 1}^{t} α_{k} ‖ f_{t} \otimes x_{k} - y_{k} ‖^{2} + \sum_{l = 1}^{d} ‖ w \cdot f_{t}^{l} ‖^{2},

where

w

represents the spatial regularization function that reflects the reliability of visual features depending on their spatial locations, and superscript

l

denotes the signal channel. In [100], the samples used for training and detection are obtained by extracting the convolutional features at the appropriate image location.

In their experiments, the authors employed the imagenet-vgg-2048 network [119], which is trained on the ImageNet dataset for image classification tasks. This network contains five convolutional layers and a $224 \times 224$ RGB image is used as input. Figure 17 compares the tracking performance when different convolutional layers are employed in the DCF framework in the OTB dataset [120]. The results of mean overlap precision (OP) are given to measure the tracking performance. OP is calculated as the percentage of frames in a sequence where the intersection-over-union overlap with the ground-truth bounding box is larger than a specific threshold $T \in (0,1)$ (here $T = 0.5$ is chosen). Layer 0 represents the intensity of the RGB input, which is directly used as feature descriptor. The involvement of convolutional layers improves significantly the tracking performance. The best results are achieved for the first layer. Figure 18 shows a comparison of the first convolutional layer with other hand-crafted features [HOG, CN, and image intensity (I)] commonly used in the DCF-based trackers. Here OP is plotted over the range of thresholds, and area-under-the-curve (AUC) is employed to rank the different methods. The convolutional features achieved the best result with an AUC of 52.1% [100]. On the full OTB datasets (50 sequences), the authors compare their tracker with 15 state of-the-art trackers: SRDCF [118], discriminative scale space tracker (DSST) [90], kernel correlation filter (KCF) [79], scale adaptive with multiple features (SAMF) tracker [78], advanced combat tracker (ACT) [115], tracker using Gaussian processes regression (TGPR) [121], multiple experts using entropy minimization (MEEM) [122], Struck [123], correlation filters with limited boundaries (CFLB) [124], multiple instance learning (MIL) [125], compressive tracking (CT) [126], tracking learning detection (TLD) [127], distribution fields tracker (DFT) [128], enhanced distribution field tracker (EDFT) [129], and the adaptive structural local sparse appearance model (ASLA) [130]. The success plot on the full OTB dataset is displayed in Fig. 19, where the AUC scores for the top 10 trackers are presented in the caption. Here, DCF and SRDCF trackers using convolutional features are named DeepDCF and DeepSRDCF in the legend [100]. DeepSRDCF outperforms all the other trackers in terms of AUC scores.

Figure 16 Performance of three filter-based trackers on three video sequences for face tracking. © 2010 IEEE. Reprinted, with permission, from Bolme et al., IEEE Conference on Computer Vision and Pattern Recognition (2010), pp. 2544–2550 [54].

Download Full Size | PDF

Figure 17 Performance comparison when using different convolutional layers in the network. © 2015 IEEE. Reprinted, with permission, from Danelljan et al., Proceedings of the IEEE International Conference on Computer Vision Workshops (2015), pp. 58–66 [100].

Download Full Size | PDF

Figure 18 Performance comparison of several feature representations in the DCF framework. © 2015 IEEE. Reprinted, with permission, from Danelljan et al., Proceedings of the IEEE International Conference on Computer Vision Workshops (2015), pp. 58–66 [100].

Download Full Size | PDF

They also evaluate these trackers by using attribute-based analysis. Figure 20 shows the success plots of four different attributes: scale variation, in-plane rotation, fast motion, and occlusion. For clarity, only the top ten trackers are given in each plot. The number in each plot title indicates the number of sequences related to a particular attribute. The proposed trackers, especially DeepSRDCF, achieve better performance compared to other methods. The VOT challenge is a competition among short-term, model-free visual tracking methods [131]. The trackers in VOT are evaluated in terms of accuracy and robustness scores. In Table 5, the authors of [100] compare the results by the VOT2015 toolkit. In each column, the first, the second, and the third ranks are marked with superscripts $a$ , $b$ , and $c$ , respectively. The proposed DeepSRDCF tracker achieves the best final rank on this dataset. Due to space limitations, we do not show the experiment results on the ALOV300++ dataset [132]. The interested reader may wish to consult [100] for a detailed discussion. The improved results achieved by DeepSRDCF in visual tracking proved the potential of convolutional features contained in the CNNs. It is worth noting the excellent robustness of DeepSRDCF shown in Fig. 20.

Table 5. Results Generated by the VOT2015 Benchmark Toolkit [100]

View Table | View all tables in this article

Figure 19 Success plot showing a comparison of our trackers with state-of-the-art methods on the OTB dataset containing all 50 videos. © 2015 IEEE. Reprinted, with permission, from Danelljan et al., Proceedings of the IEEE International Conference on Computer Vision Workshops (2015), pp. 58–66 [100].

Download Full Size | PDF

2.6. MMCF

In many computer-based pattern recognition applications, the SVM classifier is very popular [71,72]. SVMs are often designed by extracting features from training images and then using a feature vector to represent the image. For a two-class problem, the task of the SVM classifier is to find a hyperplane that maximizes the smallest L2 norm distance between the hyperplane and any data sample with specific constraints. The resulted hyperplane is represented by a solution vector (the normal to the hyperplane) and the bias (or offset) from the origin (all called the margin). The solution vector can be expressed as a linear combination of training vectors (feature extracted from the training image). The training vectors corresponding to non-zero coefficients are called support vectors. One can use the resulting solution vector for simultaneous object localization and classification by cross correlating the template converted by the solution vector with features extracted from the test image. SVMs are designed to maximize margin and thus generally exhibit generalization for the desired class, but they exhibit poor localization because the peaks resulting from the cross correlation of the SVM template and the test image are not sharp [62]. In contrast, CFs can produce sharp peaks and thus offer good localization performance, but they are not explicitly designed to offer good generalization. Rodriguez et al. devised a MMCF (see Appendix A.2, item 12) that contains the localization capability of CFs and the generalization capability of SVM [62].

Design principles of SVMs and CFs are simultaneously considered in the fabrication of MMCF. In [62], the multi-objective function of an MMCF is written as

\min_{w, b} (h^{T} h + C \sum_{i = 1}^{N} ξ_{i}, \sum_{i = 1}^{N} ‖ h \otimes x_{i} - g_{i} ‖_{2}^{2}) s.t. t_{i} (h^{T} x_{i} + b) \geq c_{i} - ξ_{i},

where superscript

T

denotes transpose. The first function (called the margin criterion) corresponds to the objective function of the SVM classifier, where

h

represents the solution vector,

C > 0

is a trade-off parameter,

N

is the total number of training vectors

x_{i}

(

i = 1,2, \dots, N

), and the sum of

ξ_{i} \geq 0

is a penalty term containing the slack variables that offset the outlier effect. The smaller the value of

h^{T} h

, the larger the margin. The second function (called the localization criterion) is an expression of mean square error (MSE) which can be considered the design basis of CFs, where

h \otimes x_{i}

denotes the vector form of 2D cross correlation between the training images and template (note that the training image and the template are represented by vectors

x_{i}

and

h

, respectively), and

g_{i}

is the vector form of the desired correlation output, which is often chosen to be delta-function-like to get a sharp peak in the correlation plane. The smaller the value of

\sum_{i = 1}^{N} {‖ h \otimes x_{i} - g_{i} ‖}_{2}^{2}

, the sharper the correlation peak. In the inequality constraint,

b

denotes the bias from the origin point and

t_{i} \in {- 1,1}

represents class labels of training vectors

x_{i}

. Solution vector

h

can be expressed as the linear sum of

x_{i}

.

t_{i} = 1

corresponds to the positive coefficients.

t_{i} = - 1

is corresponding to the negative coefficients. In addition,

c_{i} = 1

is for true-class training images and

c_{i} = 0

is for false-class training images [62].

The MMCF design can be implemented with the standard SVM solver by making appropriate data transformation. Detailed procedures and analysis of the design can be found in [62]. It is important to observe that Rodriguez et al. introduced an adjustment factor $λ$ that can flexibly balance the localization and generalization of the MMCF (see Appendix A.2, item 12). To evaluate the performance of the MMCF, Rodriguez et al. conducted a set of computer vision tasks [62]. Here, we present simulation results for face recognition. Five other classifiers or filters, namely, SVM, ASEF, MOSSE filter, OTSDF, and UOTSDF were also used for comparison.

In their tests, the tested faces or training faces ( $64 \times 64$ ) were provided from the Multi-PIE database, which contains 337 subjects (persons) with different face poses, expressions, and illumination variations. The images in this database were captured over multiple sessions (five sessions at most). Figure 21 shows different images for one subject. Rodriguez et al. presented the results using frontal images of neutral expressions with different illumination (more than 23,000 images) [62]. In their scheme, to build the filters or SVM templates, a set of true-class training images was selected per subject, while the true-class training images from all the imposters were used as false-class training images. They performed eight experiments, which are described below.

(1) Test 1: one true-class training image per person (frontal illumination) was selected from session 1 and the images from all the other sessions were tested.
(2) Test 2: two true-class training images per person (one with illumination from the right and one from the left) were chosen from session 1 and the images from all the other sessions were tested.
(3) Test 3: three true-class training images per person (one image with frontal illumination and one with illumination from the right, and one with illumination from the left) were chosen from session 1 and the images from all the other sessions were tested.
(4) Tests 4, 5, and 6: similar to the tests 1, 2, and 3, respectively, but the images from session 1 were used for testing. In all these tests, the training images are excluded from the testing sets.
(5) Test 7: similar to test 3, but images with different occlusion degree were tested.
(6) Test 8: similar to test 6, but images with different occlusion degree were tested.

Figure 20 Attribute-based comparison of our trackers with some state of-the-art methods on the OTB-2013 dataset. © 2015 IEEE. Reprinted, with permission, from Danelljan et al., Proceedings of the IEEE International Conference on Computer Vision Workshops (2015), pp. 58–66 [100].

Download Full Size | PDF

In each experiment, each test image is cross correlated with all 337 templates (each template is designed to positively classify one subject). Table 6 shows the classification accuracy for the first six tests using different filters or classifiers [62]. Tests 1–3 are more challenging than Tests 4–6, since the testing session is different from the training session. From these results, the MMCF exhibits higher recognition ability than any other CFs in Tests 1–3. Rodriguez et al. attributed this good performance to the introduction of the SVM concept because it can improve the generalization of filters. The classification accuracy of the MMCF outperforms that of the SVM classifier. According to Rodriguez et al.’s interpretation, this originates mainly from the delta-function-like requirement of the MMCF for correlation output, in which the inner products of filters with the centered training images are constrained to be 1 and the inner products with the shifted images are constrained to be 0. Thus, the centered training images as well as all shifted versions are effectively used in designing a MMCF. The implicit increase in the number of training images also contributes to the classification accuracy of the MMCF. In contrast, SVMs constrain only the inner product of the template and the centered training images, and does not constrain the other values of the correlation plane.

Table 6. Classification Accuracy (%) [62]

View Table | View all tables in this article

Tests 7 and 8 were carried out to estimate the occlusion performance of these CFs and SVM. In Fig. 22, three tested images with variant degrees of occlusion are displayed as reference. Figures 23 and 24 present the variances of the rank-1 identification accuracy with the increasing percent of missing pixels in the image (calculated from Tests 7 and 8, respectively) [62]. In Test 7, the MMCFs exhibit more robustness than the other CFs and the SVM within the range of a low percent of occlusion. However, they were more sensitive to information loss than the OTSDF and the UOTSDF in Test 8. Finally, Table 7 [62] compares the computation complexity of these algorithms using the MATLAB platform, where $N$ ( $N = 100$ ) denotes the number of training images, $d$ ( $d = 40 \times 70$ ) represents the total pixels of one training image, and $d_{s}$ ( $d_{s} = 512 \times 640$ ) is the total pixels of one test image. Although training a MMCF is more time consuming, its recognition speed is the same as those of the other CFs and SVM.

Download Full Size | PDF

Table 7. Computational Complexity ( $O$ , see box) and Measured Time [62]

View Table | View all tables in this article

Download Full Size | PDF

Download Full Size | PDF

O (also termed Big $O$ ) is a notation that describes the limiting behavior of a function when the argument tends toward a particular value or infinity. In computer science, this notation is used to classify algorithms by how they respond to changes in input size, such as how the processing time of an algorithm changes as the problem size becomes extremely large.

It is instructive to enhance the generalization capability of CFs by introducing the design concept of the SVM classifier. The presented simulation results have demonstrated the potential value of such a hybrid scheme in many face recognition applications. We believe that successful combination of the two classical recognition techniques, CF and SVM, will breathe new life into the design field of pattern recognition.

2.7. UMACE Filters Using a LBP Operator

As mentioned earlier, during the training period, the MACE filter and its generalization, the OTSDF filter [31,37], are designed to satisfy the hard constraints of correlation peaks in response to the training images. Although the OTSDF filters realize the trade-off optimization between noise tolerance and discrimination, such filters cannot yet provide high correlation peaks in response to the unconstrained images from the desired class. Moreover, these hard constraints can actually be replaced with softer requirements [35–40]. Recently, Maddah and Mozaffari proposed the UMACE filter based on a LBP for face verification [46]. Simulation results showed that satisfying recognition performance is achieved in comparison with the UMACE filters.

LBP is a powerful descriptor of texture that encodes local primitives into a feature histogram [133]. Within the LBP operation of an image, the $3 \times 3$ neighborhood of each pixel in the image is binarized by using the center value as a threshold (see Fig. 25 for illustration, where 211 is taken as the gray level value of the center pixel). The resulted 256-bin histograms of the LBP labels are employed as a texture descriptor. Figure 26 shows the flowchart of the LBP-UMACE, from which we can find that the LBP is used as a pre-processing operation of training images before the fabrication of the UMACE.

Download Full Size | PDF

Download Full Size | PDF

In simulation tests, the extended Yale Face Database B was used to evaluate the performance of the LBP-UMACE in the presence of illumination variations. This database contains 38 persons (subjects) under 64 different illumination conditions. In [46], the first twelve images of each person were selected to design the LBP-UMACE filters and the conventional UMACE filters. Then, these CFs were tested with the complete database (64 authenticated person and $37 \times 64$ imposters). Figure 27 shows images of one person from the Extended Yale B frontal database [46]. Tables 8 and 9 display the corresponding recognition results of both CFs when the PSR threshold is equal to 10 (bold numbers represent successful recognition). The upward and downward arrows represent the increase and decrease relative to the UMACE filter, respectively. The recognition rates yielded by LBP-UMACE are found to be larger than those of the conventional UMACE filter. The zero error rates obtained for imposter images constitute an attractive feature of this scheme.

Table 8. Recognition Results of UMACE Filters for the First Five Subjects [46]

View Table | View all tables in this article

Table 9. Recognition Data of LBP-UMACE Filters for the First Five Subjects [46]

View Table | View all tables in this article

Download Full Size | PDF

In Fig. 28, the recognition rates for each person are given for comparison purposes ( $PSR = 10$ ). The horizontal axis denotes the subject index. For all subjects, the average recognition rates of the LBP-UMACE filter and the conventional UMACE were found to be 86.06 and 73.43, respectively [46]. In addition, Maddah and Mozaffari computed the error rates for all subjects to further verify the effectiveness of their scheme (see Fig. 29). The obtained average error rates were 0.46 for LBP-UMACE and 7.42 for the conventional UMACE.

Download Full Size | PDF

Download Full Size | PDF

According to Maddah and Mozaffari’s explanation, because the information of adjacent pixels is considered in each pixel of the LBP output, it contains more frequency content than a common image [46]. As a result, UMACE using LBP outputs as training images leads to significant improvement in recognition ability compared to conventional UMACE. Indeed, the numerical pre-processing of training images, i.e., the LBP operator, is helpful to improve the performance of existing MACE filters. However, several important questions remain, i.e., when other neighborhood regions, e.g., $4 \times 4$ pixels, are considered for the LBP operation, how well can the LBP-UMACE filters perform?

2.8. Face Recognition Based on Correlation and the ICA Model

One of the main drawbacks of existing face recognition methods comes from their sensitivity to the rotation of the target image with respect to the training images. Although many composite filters have been presented to optimize the correlation output in robustness and discrimination, their performance is deteriorated by saturation effects when the number of training images increases [52]. Motivated by the enhanced face recognition performance of ICA models in a compressed and whitened space, and their high sensitivity to high-order relationships among pixels, Alfalou and Brosseau proposed a hybrid scheme based on VLC and ICA [59], in which ICA [134,135] is carried out for the optimal pre-processing of the target image, and the recognition stage is performed with a POF.

The ICA algorithm can provide an optimal signal representation technique in the MSE sense [136]. In [59], a given face is considered a linear combination of several independent components (ICs) arising from reference facial images contained in a learning base. The learning base is a set of faces of the same person (e.g., subject X) differing in head orientation. In Fig. 30(a), we show the construction procedure of the IC training base. First, training faces in the learning base are converted to the vector forms $V_{i}$ ( $i = 1,2, \dots, n$ ). Then, these vectors are processed by a specific nonlinear function operation that ensures good convergence for the subsequent fast ICA algorithm [59]. Finally, fast ICA converts the original learning base into a new learning base consisting of ICs. In this study, the authors chose the ICs that can provide the best result.

Download Full Size | PDF

Next, each face $V_{i}$ of the learning base is correlated with different POFs, $F_{j} = {POF}_{j}$ , and ${PCE}_{i j}$ values of the correlation outputs are calculated. Here, each CF is generated from one IC, $C_{j}$ . It is worth observing that $V_{i}$ and $C_{j}$ need to be transformed to 2D forms to perform the correlation operations. By carrying out a series of such correlation filtering, one can get a reference matrix of PCEs [see Fig. 30(b)]. Once the components $C_{j}$ and the ${PCE}_{i j}$ values are determined, the recognition algorithm shown in Fig. 30(c) is carried out. In this algorithm, the vector $V_{?}$ representing the target face can be considered a linear combination in terms of the IC base. By correlating $V_{?}$ with the different POFs $F_{j}$ , one can achieve a set of PCE values $Val_Corr = ({PCE}_{1}^{'}, {PCE}_{2}^{'}, \dots, {PCE}_{n}^{'})$ used to identify the target face among the reference images. Finding the ${PCE}_{j}^{'}$ can be realized by introducing the target image into the input plane of a standard VLC using a POF generated from a single IC. By comparing with the PCE matrix, a set of error vectors are defined and calculated as

ε_{i} = | {PCE}_{1}^{'} - {PCE}_{i 1} | + | {PCE}_{2}^{'} - {PCE}_{i 2} | + \dots + | {PCE}_{n}^{'} - {PCE}_{i n} | . i = 1,2, \dots, n .

This error corresponds to the difference between the Val-Corr vector and the

i

th row elements of the PCE matrix. Finally, the average value of these errors is computed. If the final result is smaller than a given threshold Th, the target face is identified as the face of subject X. The correlation computation of POF can be implemented optically

Figure 31 shows simulation results obtained from images of PHPID. Three training images X(9) (right rotation), X(27) (without rotation), and X(45) (left rotation) are chosen in these simulation tests [represented by the red frames in the Fig. 31(b)]. The horizontal axis of Fig. 31(a) denotes the face index, and the vertical axis represents the error values obtained with different faces of person $X (i)$ ( $1 < i < 53$ ) and the 32nd face of another person $Y (32)$ , which is framed in green in Fig. 31(d). From Fig. 31(a), we find that a good performance is provided by this algorithm. However, the appropriate choice of threshold is critical to make a reliable decision. Figure 31(c) shows the ROCs as another test of this algorithm. Every point of this ROC is calculated with a specific threshold value within the range [2000,⋯,2200]. The horizontal and vertical axes of Fig. 31(c) correspond to the false recognition and the true recognition rates, respectively. The solid red line represents the reference values when the true recognition is equal to the false recognition. By fixing the threshold arbitrarily to 2008, a recognition rate of 83% is obtained with a false recognition rate of 3.7% [59].

Figure 30 Schematics of the algorithm: (a) definition of the independent component learning base, (b) definition of the PCEs matrix, and (c) recognition procedure. Reprinted with permission from [59]. Copyright 2011 Optical Society of America.

Download Full Size | PDF

Interestingly, Alfalou and Brosseau’s scheme can be optically implemented. This renders this technique suitable for designing improved image processing systems. The introduction of ICA plays an important role in providing better recognition outputs with a considerably smaller false alarm rate in comparison with those obtained from a simple composite filter. Remarkably, Alfalou and Brosseau’s work is a fruitful test of the combination of the optical correlation method with another conventional numerical pattern recognition method. Moreover, it is also possible to further improve the performance of this scheme by replacing the POF with other optimal filters. Another novelty of this work is that the PCE error is selected as the decision criterion that is to be contrasted with many conventional schemes. How the choice and number of ICs can impact the performance of this scheme remain to be studied in more detail.

2.9. Zero-Aliasing CFs for Face Recognition

To help resolve the aliasing effect in many existing CFs, Fernandez and co-workers proposed zero-aliasing CFs (ZACFs) [85,86] (see Appendix A.2, items 15–18) in which a set of ZA constraints are introduced in the design stage of composite filters to eliminate the template’s non-zero tail that results in aliasing the correlation output. This aliasing, i.e., parts of correlations being overlapped with other parts of themselves, can seriously reduce the sharpness of correlation peaks. ZA constraints can be flexibly incorporated into the designs of many existing CFs to optimize both localization and recognition rate. Using two databases of face images, the ZA versions of CFs are found to outperform the original ones in recognition ability [86].

CFs can perform simultaneous classification and localization of authenticated objects in a given scene without assuming prior segmentation of objects in test scenes. However, it has been recognized that there exists a potential design default in the training stage of these filters that can deteriorate the recognition ability of many CFs [85]. To increase the computational efficiency when performing optimizations of various metrics (such as minimization of MSE) during filter design, the cross correlation in the spatial domain is often converted to the element-wise multiplication of the CF [the discrete FT of 2D CF template array in spatial domain] and the conjugate of the discrete FT of the training images in the spatial frequency domain. However, the element-wise multiplication between two discrete FTs does not correspond to the intended linear correlation in the spatial domain but results in a circular correlation, i.e., an aliased version of linear correlation shown in Fig. 32 [86]. The wide sidelobe of circular correlation can affect the localization performance of the CFs. The optimization and training stages of almost all CFs used involve the discrete FT-based operations mentioned earlier with the wrong assumption that circular correlation is identical to linear correlation.

Download Full Size | PDF

Figure 32 Linear correlation (left) and circular correlation (right). © 2015 IEEE. Reprinted, with permission, from Fernandez et al., IEEE Trans. Pattern Anal. Mach. Intell. 37, 1702–1715 (2015) [86].

Download Full Size | PDF

The simple zero padding of test images and the CF template in the test stage cannot resolve the aliasing problem arising from the design stage of CFs. To thoroughly remove the aliasing effects due to circular correlation, Fernandez et al. [86] proposed the ZACF, in which ZA constraints that can force the template’s tail to zero in the design stage are applied into the optimization procedure of CFs. To overcome the intensive computation in the training stage of ZACF, Fernandez et al. developed two methods based on reduced aliasing CFs (RACF) and the iterative proximal gradient descent. The former can provide a computationally more tractable closed = form solution while allowing parts of aliasing to remain. In addition, a proper zero-padding operation needs to be made in the training images. The latter is a fast and more efficient iterative approach for the numerical solution from the computational memory standpoint.

In [86], the authors display the 3D output of the conventional MACE and zero-aliasing MACE (ZAMACE) filters (Fig. 33) to illustrate the benefit of the ZACF. The ZAMACE filter yields a sharper but somewhat lower correlation peak, which is more consistent with the original design of the MACE filter. The performance of the ZACF was quantitatively evaluated for several recognition tasks. Here, we report simulation results dedicated to face recognition. Two different databases, the AT&T Database of Faces (ORL dataset) and the face recognition grand challenge (FRGC) dataset were used to perform the tests.

Download Full Size | PDF

The AT&T Database of Faces (ORL dataset) contains the facial images ( $112 \times 92$ pixels in size) of 40 subjects (persons). For each person, 10 different images are recorded. A “leave one out” cross-validation approach is employed in the tests, i.e., nine training images are selected to build one CF for each person, and then the 40 filters are tested on the remaining 40 images (one for each person) for all subjects, i.e., a total of 1600 correlations. The above tests are repeated 10 times. The PCE is taken as a decision criterion. In addition, the EER and the rank-1 ID rates are evaluated using different CFs, namely, the OTSDF, MOSSE filter, and MMCF, and their zero-aliasing versions. The corresponding data are summarized in Table 10. The first column corresponds to the original CFs. The calculated results shown in column 2 are obtained by calculating the closed-form solution of the ZACF. Column 3 shows the results provided by the accelerated proximal gradient descent method. The other columns correspond to various versions of RACFs with different zero paddings in the training images. The recognition rate of the ZACF based on the closed-form expression (column 2) is larger than that of the original CFs (column 2) in terms of EER and rank-ID. However, arriving at the closed-form solution is complicated and time consuming. It is worth noting that the results obtained by the proximal gradient decent method (column 3) are very close to those generated from the closed-form solution.

Table 10. Comparison of Performance for the Baseline CFs and ZACFs Using the ORL Dataset^a [86]

View Table | View all tables in this article

The FRGC dataset contains face images ( $128 \times 128$ ) of 410 subjects. The number of images for each subject varies from eight (minimum) to 88 (maximum). In [86], three training and test sets are considered by randomly choosing 25% images of each subject for training, and the remaining 75% for testing. One CF was fabricated for each subject and every CF was correlated with all test images, respectively. Table 11 shows the results for the OTSDF, MACE, MOSSE, and MMCF. An accelerated proximal gradient approach is used to construct the optimal approach of zero aliasing. By viewing the data listed in Table 11, one arrives to the same conclusion as that found in Table 10. In the above test, it is worth noting that the delta function is used as the desired correlation output for the design of these filters.

Table 11. Comparison of Performance for the Baseline CFs and ZACFs Using the FRGC Dataset [86]

View Table | View all tables in this article

The ZACF proposed by Fernandez et al. [86] presents a solution to solve the aliasing problem caused by circular correlation. This algorithm optimization makes the output results of composite CFs more consistent with the intended linear correlation, and improves the recognition ability with sharper correlation. However, to fully realize the potential of this approach, one needs to find a trade-off between its computational complexity and its discrimination ability. Additionally, more accurate tests against noise, occlusion, and illumination variation need to be completed in future studies.

2.10. Decision Optimization Algorithm Based on the Denoised Decomposition of the Correlation Plane

In previous subsections, typical CFs proposed in recent years for face recognition or tracking tasks were discussed. All these schemes endeavor to design modified CF algorithms and/or choose specific decision criteria, such as PCE and PSR, to improve the decisional performance. Using another perspective, Alfalou et al. [87] proposed a new decisional optimization algorithm based on the denoised decomposition of the correlation plane. Their aim was not to develop a new CF but to apply a post-processing step in the correlation plane before using a decision metric. In their scheme, the linear functional model (LFM) [88] and SVD method [89] are employed to separate the background noise from the correlation signal. Simulation results verify that the proposal can increase the TPR by a factor of 5 for a FPR set to 0% for the SCF [52].

The principle of this algorithm is illustrated in Fig. 34, where the correlation plane $P_{c}$ is set to be a finite linear combination of weighted regressors and residual signal $R$ as

P_{c} = \sum_{i = 1}^{M} β_{i} Y_{i} + R or P_{c} = Y β + R, (matrix from)

where

M

is a given integer number and

β_{i}

denotes the weight corresponding to the regressor

Y_{i} = Y_{i}^{noise} + Y_{i}^{peak}

. In order to realize this decomposition, a sum of weighted regressors (noise regressors

Y_{i}^{noise}

and information regressors

Y_{i}^{peak}

) needs to be generated to get the best possible fit of the real correlation plane. The central aim of this algorithm is to reconstruct the correlation signal by completely discarding the noise components given in the decomposition process, thereby achieving an optimized correlation output

P_{c}^{opt}

. The estimation problem amounts to finding specific regressors for both signal and noise. In [87], a set of 3D sine cardinal functions with different characteristics (standard deviation and mean value) are calculated to model different correlation peak shapes

{Peak}_{1}, {Peak}_{2}, \dots, {Peak}_{k}

. Then, the simulated correlation planes are expanded in terms of a set of orthonormal vectors. To achieve this, thin SVD matrix factorization is used to produce a set of singular vectors

V_{i}

(

i = 1,2, \dots, k

). The final signal regressors are

Y^{peak} = [\begin{matrix} thinSVD ({peak}_{1}) \\ ⋮ \\ thinSVD ({peak}_{k}) \end{matrix}] = [\begin{array}{l} V_{1}^{t} \\ ⋮ \\ V_{k}^{t} \end{array}],

where

t

denotes the transpose operation. In [87], four subjects are selected from the PHPID dataset for modeling the correlation noise. Each subject [person, denoted as P(j)] has 52 facial images with different head orientations. A POF generated from person 1 [hereafter, P(1), as shown in Fig. 35(a)] is then correlated with all of these images. To model the noise associated with P(1), the noisy planes are retained undisturbed while the information signals are suppressed. Using this analysis, the noise regressors and the complete regressors matrices read as

Y^{peak} = [\begin{matrix} {noise}_{1} \\ ⋮ \\ {noise}_{n} \end{matrix}], Y = [\begin{matrix} Y^{peak} \\ Y^{noise} \end{matrix}] .

Using the obtained noise and signal regressors, the target correlation plane

P_{c}

is decomposed according to the linear model given in Eq. (10) without the residual term. As a result, the weight factors are calculated and the reconstructed correlation plane is given as

P_{c}^{'} = Y β^{+}

, where

β^{+}

represents the regressor weight. The residuals can be obtained by calculating the difference between the target correlation plane and the reconstructed correlation plane

R = P_{c} - P_{c}^{'}

. As a result, the optimized correlation is expressed as

P_{c}^{opt} = Y^{peak} β_{peak}^{+} + R

, where

β_{peak}

represents the weight factors for the signal regressors

Y^{peak}

.

Download Full Size | PDF

Here we summarize the main points and focus on the ROC curves (Fig. 35) related to the simulation tests worked out in [87]. First, the algorithm is tested with a three-reference segmented filter in a VLC configuration. The training images are chosen from P(1), as shown in the inset of Fig. 35(a). The PCE was selected as the final decision criterion in the simulations. Figure 35(a) shows the results obtained by correlating the face images of P(1) and P(3) with the three-reference segmented filter. Figure 35(b) shows the ROC curves obtained by applying the proposed algorithm (12 models from each person) to post-process the correlation outputs. Without the denoising optimization, the TPR is close to 15% with zero false positives. When using the optimization, a TPR close to 70% is obtained for a FPR set to 0%. For comparison, Figs. 35(c) and 35(d) show the ROC curves obtained with a five-reference segmented filter and a five-reference trade-off maximum average correlation height (OTMACH) filter, respectively. These filters achieve a TPR close to 50% and 30%, respectively, for a 0% FPR. Although the composite filters shown in Figs. 35(c) and 35(d) are fabricated with five reference persons, their TPR values are obviously lower than that of an optimized scheme based on three reference persons.

The LFM-SVD-based recognition algorithm puts emphasis on the post-processing optimization of the correlation plane rather than on the filter’s algorithm itself, in strong contrast with many conventional optimization methods of CFs. A major contribution to decisional optimization of CFs has been proposed by this work. However, much remains to be learned to check whether or not the effectiveness of this method is confined to the SCFs.

2.11. Class-Specific Nonlinear Correlation Filter for Illumination-Invariant Face Recognition

In this section, we present briefly a novel subspace-based face recognition scheme (see Chap. 11 of [5]). This design is based on the subspace-based reconstruction of faces. The proposed CFs can dynamically change according to the input face image in order to achieve robust recognition for all possible enlightening variations that lie in a 3D linear subspace for a Lambertian source. Figure 36 shows the face reconstruction procedure while the class specific subspace analysis is performed. From Fig. 36, one can find that the test face image is almost perfectly reconstructed when the test face is selected from the class of training faces (person-1 in PIE). When the image of person-3 (imposter) is projected onto the subspace developed by person-1, the reconstructed image looks like person-1. By evaluating the difference between the test face and its corresponding reconstruction, it is possible to discriminate the authentic face from the imposter one.

Download Full Size | PDF

Within this context, the authors of [5] proposed a class-specific subspace CF. The detailed process of this scheme is presented in Fig. 37. First, by computing the class-specific subspace over the total $M$ classes of face images, $M$ class-specific subspaces $E^{k}$ ( $1,2, \dots, M$ ) are obtained. Then, a test face image $T^{C_{j}}$ (from any $j$ th class, $j \in M$ ) is projected onto the subspaces, resulting in $M$ reconstructed images $R^{k}$ . The reconstructed correlation filter $H_{r}^{k}$ for each class is calculated from the reconstructed image. The authors consider only the phase component (amplitude is set to unity) and denote it as $H_{r φ}^{k}$ . A phase-only optimal projecting image CF $H_{p φ}^{j}$ is simultaneously calculated from the test face $T^{C_{j}}$ . This CF is designed by minimizing the energy at the correlation plane and maximizing the correlation peak height. The final correlation output is computed as

Figure 36 Face reconstruction while the subspace analysis is performed over person-1 (PIE dataset). Reprinted from Face Detection and Recognition: Theory and Practice (Chapman & Hall/CRC Press, 2015) [5].

Download Full Size | PDF

C = {FFT}^{- 1} {H_{p φ}^{j} \times H_{r φ}^{k *}},

where

{FFT}^{- 1}

denotes the inverse fast FT and × is element-wise multiplication. The test face is classified into a class for which the PSR value is larger than a predefined threshold. For linear CFs, it is difficult to discriminate between the authentic and imposter images that lie below a span of low gray levels. To overcome this problem, the authors introduce nonlinearities of image pixels that can help the CF achieve a uniform dynamic range.

Next, we demonstrate some experimental results obtained with nonlinear class-specific CFs. In [5], the PIE and YaleB (extended) datasets are employed to evaluate the illumination invariant performance of the filters. To demonstrate the improvement of PSR (see Table 12), 10 sets of three randomly selected training images are taken and the top-left corner image of Subset-5 (YaleB dataset) is chosen for test. The test images are not included in the training set but belong to the same subject. The authors of [5] also use ROC to further check the performance improvement. For this specific purpose, two images (index 10 and index 19) are chosen for training out of 21 face images from any one subject of the PIE dataset. Two training images are chosen from YaleB subset-1. The ROCs are shown in Figs. 38(a) and 38(b). Figures 38(c)–38(f) show the ROCs that are generated from different set of training images from YaleB as Fig. 38(c) two random, Fig. 38(d) three random, and from PIE as Fig. 38(e) three random, and Fig. 38(f) four random. It is worth observing that these ROC curves achieve considerable improvement against illumination variances.

Table 12. PSR Value Comparison of Different Filters [5]

View Table | View all tables in this article

Figure 37 Detailed process of the face recognition method. Reprinted from Face Detection and Recognition: Theory and Practice (Chapman & Hall/CRC Press, 2015) [5].

Download Full Size | PDF

2.12. Summary

In summary, we reviewed typical CF design for face recognition. To fully realize the potential of these approaches, many correlation algorithms were described in some detail. Overall, it is a general observation that these correlation-based schemes can achieve generally better performance than complicated numerical methods in several respects, such as noise tolerance, discrimination ability, and computational resources. Next, we continue our review by showing how to implement these schemes.

3. Implementation and Application of Correlation Methods

3.1. Implementation

As mentioned earlier, spatial matched filters can be implemented optically with either a VLC or a JTC. By adding suitable nonlinear optimization procedures, one can also implement more complicated correlation schemes, such as POF, CHF, and composite filters, at an optoelectronic hybrid platform. At this point, it is worth recalling that the optical correlation concept originates mainly from the inherent parallelism of optical information processing and the large light velocity. All-optical computing is a major topic within optical information science. However, compared with the rapid advancement of microelectronics-based devices in the past two decades, the development of all-optical computing devices encountered serious scientific and technological issues. Hence, the optical computing units of current optical correlators are still confined to performing several optical integral transforms, such as the Fresnel transform, FT, fractional FT, wavelet transform, and gyrator transform. To optically realize these transforms, one has to arrange optical elements intelligently under rigorous conditions and address various limitations that do not arise in numerical implementation, e.g., alignment of optical components or unwanted interference. Moreover, numerous optimization strategies of optical correlation must be completed in a numerical manner: this implies more conversion interfaces in the setup involving some specific technical problems, such as electronic noise and diffraction effects. For numerous civilian applications, optical correlators are still unable to compete with numerical correlators in terms of flexibility, stability, and cost versus performance balance. In the following sections, we present and discuss one example of a numerical correlator based on a graphics processing unit (GPU), and two examples of optical correlators based on either a compact VLC configuration, or using a holographic memory disk. In this section, we focus our attention on the hardware implementation structure of CFs, not on the algorithms and optimization design.

3.1a. GPU Implementation of Segmented Composite POF for Face Recognition

Recently, an all-numerical implementation has been studied for face recognition using a segmented composite POF [29,52]. An iterative pre-processing method of the target image is developed to ensure high discrimination while maintaining robustness against distortion of the subject, e.g., head rotation. In addition, a two-level decision tree learning approach is employed to optimize the recognition procedure with a GPU processor (NVIDIA GPU GeForce 8400GS) (see Fig. 39). For more detail about the iterative processing algorithm and the two-level decision tree learning approach, the interested reader may wish to consult [29]. Experimental results show that the proposed scheme can perform the assigned face recognition task efficiently with high discrimination, i.e., the true recognition rate is larger than 85% with $t < 120 ms$ for stationary images from the PHPID, and larger than 77% using a real video sequence with 2 frames per second for the database containing 100 persons. It is also worth noting that a higher processing speed with 4 frames per second can be obtained using the more recent NVIDIA-GPU Quadro FX 770M processor.

Figure 38 Comparison of ROC plots for different training sets. Reprinted from Face Detection and Recognition: Theory and Practice (Chapman & Hall/CRC Press, 2015) [5].

Download Full Size | PDF

The GPU processer shown in Fig. 40 is the core hardware of this all-numerical implementation in which one or several stream multiprocessors are integrated together to perform intensive computing in parallel. In this scheme, compute unified device architecture (CUDA; Fig. 41) is exploited to program the GPU. In CUDA, each kernel is processed by parallel threads that are grouped together in a block. Ouerhani et al. found that the face recognition algorithm running on the GPU can achieve impressive speed compared to the conventional CPU. In [29], comparative results of the computing time were reported for four architectures. Figure 42 demonstrates that the influence of the image size on computing time of the GPU family processors is smaller than for CPU architectures.

Figure 39 Two-level decision tree learning approach. (a) First level, classification; (b) second level, identification [29].

Download Full Size | PDF

Figure 40 GPU architecture [29].

Download Full Size | PDF

Figure 41 GPU architecture [29].

Download Full Size | PDF

Ouerhani et al.’s GPU-based all-numerical implementation platform complemented with two-level decision tree learning provides a high-speed tool for investigating face recognition using computationally intensive composite filters. Making a high-speed all-numerical implementation of CFs can broaden the technological applications of CF techniques.

3.1b. All-Optical Implementation System of Correlation Based on the Vander Lugt Configuration

Recently, it was reported that the optical FT simulating Fraunhofer diffraction leads to better performance than the numerical FFT. When the target image can be only optically determined, optical implementation becomes preferable. Figure 43 displays a recent all-optical implementation setup of CF [137]. All required optical components, such as the laser source, SLM, lens, and reflector are efficiently integrated in this setup.

Figure 42 Influence of the image size on the run time for four architectures [29].

Download Full Size | PDF

This setup was employed to implement segmented composite correlation filters [52]. Figure 44 shows an optical input image and the class domains used for segmented filtering. The PCE data listed in Fig. 45 for composite, multichannel, and segmented filters are plotted as functions of training image number. Although the generated PCE values decrease rapidly, owing to filter saturation, the relatively higher PCE values obtained for the segmented filter suggest that it is more appropriate compared to the other tested filters. To illustrate the intra-class distortion invariance of segmented filtering, the optical outputs in response to the training and non-training faces (belonging to the authentic class) are shown in Fig. 46.

Figure 43 All-optical correlation setup [137].

Download Full Size | PDF

Download Full Size | PDF

Download Full Size | PDF

3.1c. Optical Implementation of Face Recognition Correlator Using Holographic Memory

This optical correlation implementation system, also known as S-FARCO, was proposed and assembled by Watanabe and Kodate [27,138]. In this design, a filtering-correlation technique and a coaxial holographic optical storage system are combined to realize high-speed face recognition. In S-FARCO, a filtering-correlation optimization algorithm is employed to obtain a binary real-only matched filter that can meet the requirements of short calculation time, low false acceptance, and rejection rates. During the recording period of the database, the matched filters generated from different reference facial subjects are written on an optical disk by a co-axis FT holographic configuration [27] and then optimized in a PC server. Results presented in [27] show the schematic setup of the high-speed optical correlator with a holographic optical disk. The novel disk-shaped holographic memory system allows for a huge amount of data stored in the form of matched filters due to its 3D storage ability (typically, about 200 GB) [27]. During the recognition period, an input image on the same position is illuminated by the laser beam. This leads to a correlation signal through the matched filter on the output plane. In addition, the parallel transformation of holographic optic memory leads to higher processing speed than that of conventional digital signal processing architecture. One can speed up the correlation process by simply rotating the optical disk at higher latency. Table 13 summarizes the experimental conditions used in [27]. Preliminary correlation experiments conducted showed that the system can produce a high correlation peak and low recognition error rates at a multiplexing pitch of 10 μm and rotational speed of 300 rpm (equal error rate of 0% when the threshold is optimally chosen) [27]. The dependence of the recognition error rates (FAR, FRR, and EER) on the threshold have been studied in [27], where the shadow area represents the range of threshold that produces an EER of 0%. Table 14 summarizes correlation speed values under different conditions when Watanabe and Kodate conducted the correlation experiments. It is notable that a rotation speed of 2400 rpm is equivalent to data transfer of more than 100 Gbps.

Table 13. Experimental Conditions for Holographic Optical Disk Correlator [27]

View Table | View all tables in this article

Table 14. Correlation Speed of the Outermost Track of an Optical Holographic Optical Disk [27]

View Table | View all tables in this article

Download Full Size | PDF

Watanabe and Kodate developed an online numerical correlation face recognition system (FARCO) with high operation speed (less than 10 ms/correlation for 3 GHz CPU and 2 GB memory) and high accuracy for low-resolution facial images ( $64 \times 64$ pixels). They tested their system using 30 subjects and obtained 0% FAR and 0% FRR [27]. Finally, it is worth noting that a cell phone face recognition system [27] and a video copyright monitoring system [139] based on numerical correlation have also been assembled for technological applications.

3.1d. Summary

In summary, the implementation schemes show that correlation algorithms can run very fast using either a numerical or an optical hardware platform. Powerful GPU processors provide high-efficiency numerical computation sources for fabrication and real-time application of composite filters. Creating a compact VLC architecture such as that developed in [137] is an extremely important task for miniaturization of an optical correlator. The value of a state-of-the-art technique, i.e., the holographic optical memory disk, is an efficient tool for CF databases and data transfer [27].

3.2. Applications

In this section, we describe recent applications of correlation methods including face recognition, tracking in video surveillance, video copyright monitoring, cell phone face recognition, tracker of swimmers, image comparators, underwater mine recognition, and road sign recognition.

3.2a. Correlation-Based Face Recognition for Patient Monitoring Application in Home Video Surveillance

In [140], Elbouz et al. devised a patient monitoring system by combining correlation techniques with a fuzzy logic method. Different from previous monitoring systems, two basic correlation configurations, namely, a VLC and a JTC, are employed numerically to make (high level) decisions on the identity of patients and their 3D behavior change, e.g., a sudden fall. Figure 47 shows the overall block diagram of the monitoring system [140]. In this system, first, hue, saturation, and value (HSV) color space and moment methods are employed to detect the subject’s head information at the entrance of a room. Figure 48 shows the experimental procedures for detection of the head area in a captured image. The target image containing head information is sent to a numerical VLC fabricated with an optimized, segmented composite POF for identification. At the same time, a numerical fringe adjusted JTC (FAJTC) is initialized to guide the tracking procedure, with the reference images transformed by a stereovision system. To ensure that the tracking is reliable and robust against noise, a database containing possible head motion is stored beforehand as a reference base of JTC. Figure 49 shows several example images from a reference database. Moreover, a fuzzy logic method is also introduced to analyze the trajectory of the subject’s head motion in order to make a decision whether or not to send an alarm signal.

Figure 47 Block diagram of the monitoring system [140].

Download Full Size | PDF

Figure 48 Experimental results for head detection: (a) captured image, (b) H component thresholding, (c) S component thresholding, and (d) head-detection results. (e) This figure shows the result by using the proposed method [140].

Download Full Size | PDF

An experimental platform (see Fig. 50 [140]) is used to check the reliability of the monitoring system. A series of tests are performed using video recording of the subject’s body movement as input to the proposed algorithm. Figure 51 shows the user interface for real-time monitoring, in which the head motion is captured by four webcams and recorded from either the front, back, or side. Four video sequences of the scene can be visualized in this interface. The pink frames represent the head position of the identified subject. The software is run on a PC (Intel Dual Core E5300, 2.6 GHz, 2 GoRAM CPU). Logitech webcams are used for the tracking video with $352 \times 288$ pixels in size. The main focus of this study is on fall detection corresponding to a rapid decrease of the subject’s head position. Videos of experimental results can be viewed online (https://www.youtube.com/watch?v=3nNFTHOsm0E). The videos demonstrate that the segmented composite POF in the VLC configuration can correctly identify the subject without alarm, and the numerical FAJTC can track efficiently the trajectory of the subject’s head position in all experiments, i.e., entering, rapid falling, slow bending, and sitting of the subject. The alarm rate provided by this system is close to 89% and can be interesting for potential surveillance applications.

Figure 49 Example of reference database used in our study [140].

Download Full Size | PDF

Figure 50 Experimental platform developed [140].

Download Full Size | PDF

3.2b. Monitoring of Video Copyright Based on Optical Correlation

In the past two decades, unauthorized video content on the Internet has grown exponentially. Finding fast and reliable video monitoring techniques for copyright protection has become a most active research area. Using numerical correlation algorithms, Kodate’s group developed a powerful online video-matching application called the fast recognition correlation system (FReC) [139] to monitor the copyright information of video on the Internet. The FReC software system can correlate the content on multiple videos with the registered content provided by a copyright holder at very high speeds. By patrolling a video-sharing website, FReCs can filter out a number of unauthorized video content on the Internet and send automatic warning messages to the website’s manager. In 2009, major publishers and television networks in Japan began to use FReCs to protect their online video products [139]. To further speed up the detection of the video-filtering system, Kodate’s group is developing a new optical-correlation-based system called FARCO 3.0 [139]. Different from the numerical processing of the correlator in FReC, the optical correlator hardware in FARCO 3.0 uses a holographic memory disk. This permits processing correlations more than 100 times faster than the previous FReC system.

Figure 51 User interface of the fuzzy logic and optical correlation based face-recognition method developed using MATLAB software and a CPU: Intel Dual Core E5300, 2.6 GHz, 2Go RAM computer [140].

Download Full Size | PDF

3.2c. Cellular Phone Face Recognition System for Attendance Management of Students

This is another interesting online application of the numerical correlation method developed by Kodate et al. [27]. In this application, the filtering-correlation optimization algorithm mentioned in Subsection 3.1c. is applied to increase the recognition accuracy and efficiency of matched filters. The block diagram depicting the cellular phone face recognition system is shown in [27], which includes two working units (online registration and recognition). The system consists of the facial recognition software, a control server for pre- and post-processing, and a cellular camera phone. During the registration period, the students access the URL provided by the administrator and download the Java application. Then, they input their facial image using the Java application on their cell phone. Finally, they send the image with their ID to the administrator. After checking whether the IDs and images in the server match, the facial images are uploaded into the reference dataset. The recognition unit consists of the following steps: (1) a student takes their facial image with the Java application on their cell phone and transmits it back to the recognition server, along with their ID; (2) the face recognition server use the optimization and correlation algorithm to analyze and recognize the input image; (3) if the input is recognized as the registered person, a one-time password is created and provided to the student; and (4) using the password, the student logs on to the remote lecture server.

This cell phone face recognition system was employed as a lecture attendance management system. This system was implemented 12 times on 30 students over a 20-week duration. The D505is and D506i (Mitsubishi Co.) were chosen as cell phone types for this experiment. The facial images were in JPEG format ( $120 \times 120$ pixels). Experimental error rates of this system are given in Table 15. The results show very low error rates.

Table 15. Experimental Error Rates of this Cell Phone Face Recognition System [27]

View Table | View all tables in this article

3.2d. Optimized Swimmer Tracking System by a Dynamic Fusion of Correlation and Color Histogram Techniques

In this application, the nonlinear JTC (NL-NZ-JTC) and the color histogram method are combined to perform robust swimmer tracking [141]. This combination is completed by the dynamic fusion of the correlation plane and the color score map. Figure 52 shows the environment of this application. For the experiments, two video sequences of junior swimmers from the Brest nautical club were captured. The videos were captured with a 4K resolution ( $3840 \times$ 2160 pixels) in 25 frames per second for a period of 15 s each. Three measures [tracking percentage, PCE, and local standard deviation (Local-STD)] were used to evaluate the tracking performance of this system [141]. The tracking percentage is calculated by measuring the Euclidean distance between the coordinates given by the tracking system and the actual measurement obtained beforehand by manually adjusting the swimmer position in the videos. The PCE measures the detection accuracy. The Local-STD estimates the accuracy and robustness in terms of localization by measuring the noise level locally in the correlation plane.

Figures 53 and 54 show the PCE curves and the Local-STD curves of three tracking algorithms (the dynamic fusion method, NL-NZ-JTC, and color histogram), respectively. In Fig. 53, the higher PCEs of the fusion method demonstrate the improvement of the detection accuracy. Figure 54 shows that the fusion method has comparable robustness to the tracking method based on color histograms. Table 16 summarizes the results of the tests performed on the two videos. From these data, one obtains conclusions that are similar to those made from inspection of Figs. 53 and 54. The superiority of the dynamic fusion method in tracking percentage is also clear.

Table 16. Comparison among NL-NZ-JTC, Color Histogram, and Dynamic Fusion Technique in Terms of Tracking Percentage, PCE, and Local-STD [141]

View Table | View all tables in this article

Figure 52 Shooting environment: (a) camera Blackmagic 4K used for shooting the videos of the test, (b) example of a frame extracted from a video used in the tests [141].

Download Full Size | PDF

Figure 53 Comparison among the NL-NZ-JTC correlator, color histograms, and dynamic fusion in terms of PCE (tested on a video sequence (360 frames) of a backstroke competition). A high value of the PCE criterion implies greater confidence in the localization (sharp peak) [141].

Download Full Size | PDF

3.2e. Image Comparator Based on the Optical JTC

In this application [142], an integrated optical JTC configuration serves as comparator of dynamic video frames (Fig. 55). Different from the conventional JTC configuration, the input plane is formed by combining the current frame with the previous one in a video sequence. When the frames from the reference and the input sequences do match, a correlation peak is generated on the output plane; otherwise, the correlation peak between them decreases rapidly and eventually vanishes. Any possible movement of the chosen target can be detected by the movement of this correlation peak based on the shift-invariance property of linear correlation. The JTC application consists of a digital CMOS camera, a single liquid-crystal-over-silicon SLM, and a FT lens. Interestingly, numerical nonlinear procedures can be introduced in the joint power spectrum plane, allowing tuning and/or optimizing its recognition and robustness. The value of this image comparator has been demonstrated in many computer vision devices [142]. Figure 56 presents the basic configuration of this comparator.

Figure 54 Comparison among the NL-NZ-JTC correlator, color histogram, and dynamic fusion in terms of Local-STD (tested on a video sequence (360 frames) of a backstroke competition). A low value of the Local-STD criterion implies a greater confidence toward the detection (less noisy plane) [141].

Download Full Size | PDF

Figure 55 Image comparator based on the JTC [142]. Available from http://www-g.eng.cam.ac.uk/CMMPE/pattern.html.

Download Full Size | PDF

Figure 57 shows the correlation outputs of the JTC comparator when the reference and input sequences match and do not match. One can observe a change of brightness of the correlation signals, which is caused by the mismatch of sequences. Based on correlation shift invariance, the image comparator can be used to evaluate the direction of a moving object, e.g., jet pilot head tracker. For this purpose, an image comparator is used to detect the spatial translations of the camera view. The working principle of this application is illustrated in Fig. 58 where the distance and displacement measurement of the pilot’s head are characterized by the translation of the cross-correlation signal (blue dots). Figure 59 shows experimental results of this head tracker.

Figure 56 Picture showing the configuration of the image comparator studied in [142]. Available from http://www-g.eng.cam.ac.uk/CMMPE/pattern.html.

Download Full Size | PDF

Figure 57 Correlation output of the image comparator based on JTC [142]. Available from http://www-g.eng.cam.ac.uk/CMMPE/pattern.html.

Download Full Size | PDF

Figure 58 Head tracker based on JTC comparator [142]: the blue spots represent the cross-correlation peaks, and the red spot is the auto-correlation peak. Available from http://www-g.eng.cam.ac.uk/CMMPE/pattern.html.

Download Full Size | PDF

3.2f. Underwater Mine Detection and Recognition

In this application, an adaptive nonlinear fringe-adjusted JTC is devised to perform detection and recognition of underwater mines [137,143]. To optimize the recognition performance of correlation filters, a numerical nonlinear procedure is introduced in the Fourier plane of the fringe-adjusted JTC [143]. Moreover, a pre-processing treatment is also carried out to enhance the image quality within the underwater environment, e.g., polarized light is used to denoise the input target image. Figure 60 shows several types of underwater mines used for experimental detection.

Figure 59 Experimental results of head tracker based on JTC comparator [142]. Available from http://www-g.eng.cam.ac.uk/CMMPE/pattern.html.

Download Full Size | PDF

In Figure 61, detection results of the optimized nonlinear fringe-adjusted JTC are shown to verify its robustness and recognition ability. The horizontal axis of this figure represents the frame index of video sequence. The vertical axis measures a specific decisional criterion [143] that is related to the sharpness of the correlation peaks and takes its value between 0 and 1. A large value means that an underwater mine is identified in the video frame. To optimize the automatic decision making, a threshold needs to be set properly. At the bottom of Fig. 61, several images show the real detection environment. From Fig. 61, one finds that the optimized nonlinear fringe-adjusted JTC can detect most of the image frames containing the underwater mine with very few false alarms. Detailed results, in terms of recognition and false alarm probabilities, are summarized in Table 17, where three types of fringe-adjusted JTC are compared. Here $k$ is a nonlinear factor [143]. The optimized nonlinear fringe-adjusted JTC produces the smallest false alarm rate among the three JTCs. Compared to the nonlinear fringe-adjusted JTC, an increase of the recognition rate is also obtained by this optimized scheme.

Table 17. Robustness Results of the Different JTCs [143]

View Table | View all tables in this article

Figure 60 Three target images showing different types of mine [137].

Download Full Size | PDF

3.2g. Road Sign Recognition Using Phase-Only Correlation and the VIAPIX Module

Road sign recognition is not only interesting for the computer vision community, but also fundamentally important to address questions of road asset management, navigation assistance, and driver assistance systems. Here we discuss the road sign recognition system based on correlation introduced in [28]. Basically, it combines the VIAPIX device module with a numerical computation of a POF [28]. The VIAPIX module consists of a VIAPIX acquisition device (Fig. 62) and a processing software platform, i.e., VIAPIX Exploitation. First, the street images captured by three cameras in the acquisition unit are combined by the VIAPIX exploitation module to produce a panoramic image covering an angle of 180° view. Figure 63 illustrates several images captured by the three cameras. The corresponding panoramic image generated by VIAPIX is shown in Fig. 64.

Figure 61 Results of the optimized nonlinear fringe-adjusted JTC [143].

Download Full Size | PDF

Figure 62 VIAPIX acquisition [28].

Download Full Size | PDF

Figure 63 Examples of images used to obtain panoramic images [28].

Download Full Size | PDF

Then, the road signs appearing in the panoramic image are extracted by a color segmentation method based on the HSV space. Figure 65 shows an example of red segmentation using the HSV representation. The final recognition task is completed by a POF. In Fig. 66, we show the schematic diagram of this numerical POF, for which the filters are fabricated offline from different sign images and stored. The calculated PCE is chosen as the decisional metric of the recognition method.

Figure 64 Panoramic image obtained using images from [28].

Download Full Size | PDF

Figure 65 Red color segmentation using HSV [28].

Download Full Size | PDF

Figure 67 shows an experimental panoramic image and the corresponding recognition results [28]. In this experiment, the VIAPIX module is installed on the roof of a vehicle at a speed between 30 and 50 km/h. The acquisition frequency of the image is one image per meter. From Fig. 67, one can see that the POF can identify correctly the road signs appearing in the panoramic images. This recognition system identified 61 road signs from a database of 1500 images containing 72 road signs with a true recognition rate of 84.7% [28].

Figure 66 Schematic diagram of the POF used in [28].

Download Full Size | PDF

Figure 67 Panoramic image and identification results of POF [28].

Download Full Size | PDF

4. Concluding Remarks and Outlook

From a general perspective, the examples described in this review demonstrate that correlation methods are promising platforms for face and pattern recognition applications. In practice, these correlation-based systems exhibit reliable discrimination and/or real-time tracking performance with very low error rate and high processing speed. However, few applications based on correlation methods have been reported to date. We believe that this is partly because researchers have spent much effort in designing correlation schemes that can be optically implemented. However, as discussed above, the optical implementation of correlation schemes is generally complicated, time consuming, and has low performance versus cost. To achieve a high-efficiency optical recording of a complex-valued filter, a holographic configuration is often introduced, but it renders the implementation system more complicated to process and vulnerable to attackers. Additionally, expensive optoelectronic devices (SLMs and CCDs) need to be employed to perform numerically a flexible algorithm optimization with a PC because optimal optical signals can be processed only in the Fourier plane. Consequently, the low resolution, limited modulation ability, and low processing speed of these optoelectronic devices constitute the main limitations of optical correlators. Furthermore, progress in this field has been hampered by the diffraction effects and electronic noise arising from multiple optoelectronic conversions, which deteriorate the final correlation performance. Although researchers have proposed a number of optical correlators in the past three decades, the technical problems mentioned earlier reduce significantly the value of these devices. Similar advancements as the illustrating examples considered here are expected, but all-optical correlators have not yet been realized. In contrast, numerical implementation can benefit the development and application of correlation algorithms for many reasons. First, the numerical implementation needs only a PC with a high-efficiency processor, thus avoiding the construction of a complicated optical platform and using optoelectronic conversion. Second, many numerical methods can be easily applied to optimize the correlation performance in the input, Fourier, and output planes, e.g., numerical pre-processing of target images for edge detection and correlation post-processing for decision optimization. Third, a more reliable decision making on object’s identity, position, and behavior can be obtained by combining correlation methods and numerical pattern recognition approaches, such as the fuzzy logic techniques.

Thanks to the rapid development of digital processing techniques, we believe that more and more exciting practical applications will be developed in the field of correlation techniques for face and pattern recognition tasks. The high efficiency and flexibility of digital processing methods can greatly enrich the design of correlation techniques. At least four research directions for performance optimization of correlation methods can be of fundamental importance: (1) extracting the feature components representing the training images or test images by using some image representation (e.g., ICA, PCA), artificial neutral network techniques (such as “deep learning” represented by CNNs), and iterative processing methods before the correlation computation step; (2) separating or suppressing some components that may influence correct decisions in the correlation plane after the correlation computation step; (3) designing new decisional criteria that can ensure high recognition rates and robustness against distortion or noisy interference; and (4) combining correlation algorithms with numerical recognition methods. We hope to return to some of these issues in the near future.

Appendix A: Closed-Form Expressions of Composite Correlation Filters

1. Notation

$h$ :	column vector form of the correlation filter in the frequency domain.
$H$ :	matrix form of the correlation filter in the frequency domain.
$A = [x_{1}, x_{2}, \dots, x_{L}]$ :	$d \times L$ matrix whose columns are formed by $L$ training image vectors $x_{i}$ ( $i = 1,2, \dots, L$ , column vector in the frequency domain; $d$ represents the total number of pixels in images).
$c$ :	column vector of peak constraints; $c (i)$ is the specified peak value for training image $x_{i}$ .
${\tilde{X}}_{i}$ :	a $d \times d$ diagonal matrix form of the ith training image in the frequency domain with vector $x_{i}$ along its diagonal.
$\tilde{D} = \frac{1}{L} {\tilde{D}}_{i} = \frac{1}{L} \sum_{i = 1}^{L} ({\tilde{X}}_{i} {\tilde{X}}_{i}^{*})$ :	a $d \times d$ diagonal matrix with average power spectra along its diagonal, and * denotes the conjugate.
$\tilde{N}$ :	a $d \times d$ diagonal matrix with noise power spectra along its diagonal. The form of $\tilde{N}$ depends on the noise type; e.g., for white noise, $\tilde{N}$ is the identity matrix.
$\tilde{T} = λ \tilde{D} + \sqrt{1 - λ^{2}} \tilde{N}, 0 \leq λ \leq 1$ :	the trade-off matrix used in optimal trade-off filters.
$g_{i}$ :	2D correlation output (vector representation) of expected correlation output for the $i$ th training image.
${\tilde{G}}_{i}$ :	a $d \times d$ diagonal matrix with $g_{i}$ along its diagonal.
$m = \frac{1}{L} \sum_{i = 1}^{L} x_{i}$ :	a mean vector of entire training set in the frequency domain.
$\tilde{M}$ :	a $d \times d$ diagonal matrix with elements of $m$ along its diagonal.
$α, β, γ, λ$ :	non-negative optimization parameters.

2. Closed-Form Expression and Description

Composite Filters	Closed-Form Expression and Description
1. Equal-correlation-peak synthetic discrimination function (ECP-SDF) [30]	$h_{ECPSDF} = A {(A^{+} A)}^{- 1} c$ , where + and $- 1$ denote the conjugate transpose operator and inverse operator, respectively.
2. Minimum average correlation energy (MACE) filter [31]	$h_{MACE} = {\tilde{D}}^{- 1} A {(A^{+} {\tilde{D}}^{- 1} A)}^{- 1} c$ .
3. Minimum variance SDF (MVSDF) [32]	$h_{MVSDF} = {\tilde{N}}^{- 1} A {(A^{+} {\tilde{N}}^{- 1} A)}^{- 1} c$ .
4. Optimal trade-off SDF (OTSDF) [33]	$h_{OTSDF} = {\tilde{T}}^{- 1} A {(A^{+} {\tilde{T}}^{- 1} A)}^{- 1} c$ .
5. Minimum noise and correlation energy filter (MINACE) [68]	$\max (α \tilde{N}, \sqrt{1 - α^{2}} D_{i}, \dots,, \sqrt{1 - α^{2}} D_{L})$ .
6. Maximum average correlation height (MACH) filter [35]	$h_{MACH} = {\tilde{S}}^{- 1} m$ , where $\tilde{S} = \frac{1}{L} \sum_{i = 1}^{L} ({\tilde{X}}_{i} - \tilde{M}) {({\tilde{X}}_{i} - \tilde{M})}^{*}$ .
7. Unconstrained MACE (UMACE) filter [35]	$h_{UMACE} = {\tilde{D}}^{- 1} m$ .
8. Unconstrained OTSDF (UOTSDF) [36]	$h_{UOTSDF} = {(α \tilde{N} + β \tilde{D})}^{- 1} m$ .
9. Optimal trade-off MACH (OTMACH) [37,38]	$h_{OTMACH} = {(α \tilde{N} + β \tilde{D} + γ \tilde{S})}^{- 1} m$ .
10. Extended MACH (EMACH) [39]	It is the eigenvector of ${α I + {(1 - α^{2})}^{1 / 2} {\tilde{S}}^{β}}^{- 1} {\tilde{C}}^{β}$ , where $\tilde{C}$ is a covariance matrix and $\tilde{S}$ is the same as in MACH, $α$ is a parameter used to control the relative importance of ONV and ASM, $β$ controls the dependence of the MACH filter on the average training image.
11. Eigen-extended MACH (EEMACH) [40]	It keeps only the dominant eigenvector of ${α I + {(1 - α^{2})}^{1 / 2} {\tilde{S}}^{β}}^{- 1} {\tilde{C}}^{β}$ .
12. Maximum margin CF (MMCF) [62]	$h_{MMCF} = {\tilde{T}}^{- 1} \frac{1}{L} (\sum_{i = 1}^{L} {\tilde{X}}_{i} g_{i}) + {\tilde{T}}^{- 1} A \tilde{Y} a$ , where $\tilde{Y}$ is a diagonal matrix with class label (1 for authentic class, 0 for imposter) along its diagonal. The vector $a$ is evaluated from the sequential minimum optimization technique.
13. Average of synthetic exact filter (ASEF) [55]	$h_{ASEF} = \frac{1}{L} \sum_{i}^{L} {\tilde{X}}_{i}^{- 1} g_{i}$ .
14. Minimum output sum of squared error (MOSSE) filter [54]	$h_{MOSSE} = {\tilde{D}}^{- 1} \frac{1}{L} (\sum_{i = 1}^{L} {\tilde{X}}_{i} g_{i})$ .
15. Zero-aliasing MACE (ZAMACE) [86]	$h_{ZAMACE} = {\tilde{D}}^{- 1} B {(B^{+} {\tilde{D}}^{- 1} B)}^{- 1} k$ , where $B$ is obtained from the $A$ matrix and satisfies zero-aliasing constraints and $k$ is a column vector extended from the constrained vector $c$ by adding zero elements.
16. Zero-aliasing OTSDF (ZAOTSDF) [86]	$h_{ZAOTSDF} = {\tilde{T}}^{- 1} B {(B^{+} {\tilde{T}}^{- 1} B)}^{- 1} k$ , where $B$ and $k$ are the same as in ZAMACE (see item 15).
17. Zero-aliasing MOSSE (ZAMOSSE) [86]	$h_{ZAMOSSE} = [I - {\tilde{D}}^{- 1} Z {(Z^{+} {\tilde{D}}^{- 1} Z)}^{- 1} Z^{+}] {\tilde{D}}^{- 1} \frac{1}{L} (\sum_{i = 1}^{L} {\tilde{X}}_{i} g_{i})$ , where $Z$ is a matrix satisfying zero-aliasing constraints.
18. Zero-aliasing MMCF (ZAMMCF) [86]	$h_{ZAMMCF} = {\tilde{T}}^{- 1} \frac{1}{L} (\sum_{i = 1}^{L} {\tilde{X}}_{i} g_{i}) + {\tilde{T}}^{- 1} A Y a + {\tilde{T}}^{- 1} Zω$ , where $Z$ is the same as in ZAMOSSE, and $ω = - {(Z^{+} {\tilde{T}}^{- 1} Z)}^{- 1} Z^{+} [{\tilde{T}}^{- 1} \frac{1}{L} (\sum_{i = 1}^{L} {\tilde{X}}_{i} g_{i}) + {\tilde{T}}^{- 1} A Y a]$ , and $Y$ and $a$ are the same as in MMCF.
19. Adaptive and robust CF (ARCF) [53]	$h_{ARCF} = {(\tilde{D} + ε I)}^{- 1} A [A^{+} {(\tilde{D} + ε I)}^{- 1} A] c$ , $ε = 0, \Rightarrow MACE$ , and $ε = \infty, \Rightarrow ECP - SDF$ .
20. Correntropy MACE (CMACE) [41]	$h_{CMACE} = V^{- 1} A [A^{+} V^{- 1} A] c$ , where $V = \frac{1}{L} \sum_{i = 1}^{L} V_{i}$ and $V_{i} ≜ correntropy matrix$ .
21. Generalized MACH (GMACH) [47]	$h_{OTMACH} = {(α \tilde{N} + β \tilde{D} + γ \tilde{S} + δ Ω)}^{- 1} m Ω$ is a $d^{2} \times L$ matrix with rank L.
22. Action MACH [42]	3D extension of the conventional MACH filter.
23. Wavelet modified MACH [43]	$h_{WMMACH} = ({\tilde{W}}^{+} \tilde{W}) {\tilde{S}}^{- 1} m$ , where $\tilde{W}$ is a diagonal matrix with a vectorized wavelet filter representation (e.g., a Mexican hat filter) along its diagonal.
24. Polynomial CF (PCF) [50,51]	$h_{PCF} = S_{c}^{- 1} m_{c}$ where $S_{c}$ is a block matrix of diagonal matrices $S_{c p q} = \frac{1}{L} \sum_{i = 1}^{L} ({\tilde{X}}_{i}^{p} - {\tilde{M}}^{p}) {({\tilde{X}}_{i}^{q} - {\tilde{M}}^{q})}^{*}$ , and $m_{c}$ is a block vector whose $k$ th element is a vector defined as $m_{c} (k) = m^{k}$ . The final result $h_{PCF}$ is also a block vector consisting of many sub-filters (vector form). In addition, the method is not restricted to the power nonlinearity.
25. Quadratic CF (QCF) [48,49]	The QCFs (in the spatial domain) are eigenvectors of $R_{1} - R_{2}$ , where $R_{k} = E {x_{i k} x_{i k}^{T}}$ is correlation matrix for all the training images in the $k$ th class ( $k = 1$ or 2, class 1 is the authentic class and class 2 is the imposter). Note that $x_{i k}$ is a vectorized representation of the $i$ th training image of the $k$ th class in the spatial domain. ^T denotes a transpose operation.
26. Distance-classifier CF (DCCF) [56]	The optimal solution of DCCF is obtained from $S_{B}^{- 1} S_{A}$ where $S_{A} = \frac{1}{K} \sum_{k = 1}^{K} (m_{k} - m_{global}) {(m_{k} - m_{global})}^{+}$ , $S_{B} = \frac{1}{K} \sum_{k = 1}^{K} \frac{1}{L} \sum_{i = 1}^{L} ({\tilde{X}}_{i k} - {\tilde{M}}_{k}) {({\tilde{X}}_{i k} - {\tilde{M}}_{k})}^{*}$ , $m_{k}$ is a mean vector of the $k$ th class training images in the frequency domain, $m_{global}$ is the global vector of all training images in the frequency domain, ${\tilde{X}}_{i k}$ denotes a diagonal matrix with $x_{i k}$ ( $i$ th training vector of $k$ th class in the frequency domain) along its diagonal, and ${\tilde{M}}_{k}$ denotes a diagonal matrix with $m_{k}$ along its diagonal.
27. Polynomial DCCF (PDCCF)	The expression is identical to that of DCCF, but training images are preprocessed with nonlinear operations as is done in polynomial CF (PCF).
28. LBP-based UMACE [46]	Its expression is the same as that of UMACE, but training images are pre-processed by a LBP operation.

Funding

China Scholarship Council (CSC) (201508440010); National Natural Science Foundation of China (NSFC) (61675050); Lab-STICC (UMR CNRS 6285)

References

1. T. Kanade, “Picture processing system by computer complex and recognition of human face,” Ph.D. thesis (Department of Information Science, Kyoto University, 1973).

2. L. Sirovich and M. Kirby, “Low-dimensional procedure for the characterization of human face,” J. Opt. Soc. Am. 4, 519–524 (1987). [CrossRef]

3. M. Savvides, B. Kumar, and P. Khosla, “Eigenphases vs. eigenfaces,” in Proceedings of the 17th International Conference on Pattern Recognition (ICPR) (IEEE, 2004), Vol. 3.

4. X. Lu, “Image analysis for face recognition,” Department of Computer Science and Engineering; Michigan State University, East Lansing, Michigan 48824 (Private Lecture Notes, 2003).

5. A. Datta, M. Datta, and P. Banerjee, Face Detection and Recognition: Theory and Practice (Chapman & Hall/CRC Press, 2015).

6. B. Kumar, “Tutorial survey of composite filter designs for optical correlators,” Appl. Opt. 31, 4773–4801 (1992). [CrossRef]

7. M. Kaneko and O. Hasegawa, “Processing of face images and its applications,” IEICE Trans. Inf. Syst. E82-D, 589–600 (1999).

8. D. North, “An analysis of the factors which determine signal/noise discriminations in pulsed carrier systems,” Proc. IEEE 51, 1016–1027 (1963). [CrossRef]

9. A. VanderLugt, “Signal detection by complex spatial filtering,” IEEE Trans. Inf. Theory 10, 139–145 (1964). [CrossRef]

10. B. Kumar and L. Hassebrook, “Performance measures for correlation filters,” Appl. Opt. 29, 2997–3006 (1990). [CrossRef]

11. B. Kumar, M. Savvides, C. Xie, K. Venkataramani, J. Thornton, and A. Mahalanobis, “Biometric verification with composite filters,” Appl. Opt. 43, 391–402 (2004). [CrossRef]

12. A. Mansfield and J. Wayman, “Best practices in testing and reporting performance of biometric devices,” version 2.01 (Reproduced by Permission of the Controller of HMSO, 2002).

13. A. Quaglia and C. M. Epifano, eds. Face Recognition: Methods, Applications and Technology (Nova Science, 2012).

14. D. Wu, X. Zhou, B. Yao, R. Li, Y. Yang, T. Peng, M. Lei, D. Dan, and T. Ye, “Fast frame scanning camera system for light-sheet microscopy,” Appl. Opt. 54, 8632–8636 (2015). [CrossRef]

15. F. Wang, I. Toselli, and O. Korotkova, “Two spatial light modulator system for laboratory simulation of random beam propagation in random media,” Appl. Opt. 55, 1112–1117 (2016). [CrossRef]

16. C. Weaver and J. Goodman, “A technique for optically convolving two functions,” Appl. Opt. 5, 1248–1249 (1966). [CrossRef]

17. M. Alam and M. Karim, “Fringe-adjusted joint transform correlation,” Appl. Opt. 32, 4344–4350 (1993). [CrossRef]

18. L. Guibert, G. Keryer, A. Servel, M. Attia, H. Mackenzie, P. Pellat-Finet, and J. L. de Bougrenet de la Tocnaye, “On-board optical joint transform correlator for real-time road sign recognition,” Opt. Eng. 34, 101–109 (1995).

19. J. Horner and P. Gianino, “Phase-only matched filtering,” Appl. Opt. 23, 812–816 (1984). [CrossRef]

20. D. Psaltis, E. Paek, and S. Venkatesh, “Optical image correlation with binary spatial light modulator,” Opt. Eng. 23, 698–704 (1984).

21. D. Roberge and Y. Sheng, “Optical composite wavelet-matched filters,” Opt. Eng. 33, 2290–2295 (1994). [CrossRef]

22. X. Lu, A. Katz, E. Kanterakis, and N. Caviris, “Joint transform correlator that uses wavelet transforms,” Opt. Lett. 17, 1700–1702 (1992). [CrossRef]

23. Q. Wang and S. Liu, “Morphological fringe-adjusted joint transform correlation,” Opt. Eng. 45, 087002 (2006). [CrossRef]

24. F. Lei, M. Iton, and T. Yatagai, “Adaptive binary joint transform correlator for image recognition,” Appl. Opt. 41, 7416–7421 (2002). [CrossRef]

25. Y. Hsu and H. Arsenault, “Optical character recognition using circular harmonic expansion,” Appl. Opt. 21, 4016–4019 (1982). [CrossRef]

26. D. Mendlovic, E. Marom, and N. Konforti, “Shift- and scale-invariant pattern recognition using Mellin radial harmonics,” Opt. Commun. 67, 172–176 (1988). [CrossRef]

27. E. Watanabe and K. Kodate, High Speed Holographic Optical Correlator for Face Recognition, State of the Art in Face Recognition, J. Ponce and A. Karahoca, eds. (InTech, 2009).

28. Y. Ouerhani, M. Desthieux, and A. Alfalou, “Road sign recognition using Viapix module and correlation,” Proc. SPIE 9477, 94770H (2015). [CrossRef]

29. Y. Ouerhani, M. Jridi, A. Alfalou, C. Brosseau, P. Katz, and M. S. Alam, “Optimized pre-processing input plane GPU implementation of an optical face recognition technique using a segmented phase only composite filter,” Opt. Commun. 289, 33–44 (2013). [CrossRef]

30. C. Hester and D. Casasent, “Multivariant technique for multiclass pattern recognition,” Appl. Opt. 19, 1758–1761 (1980). [CrossRef]

31. A. Mahalanobis, B. Kumar, and D. Casasent, “Minimum average correlation energy filters,” Appl. Opt. 26, 3633–3640 (1987). [CrossRef]

32. B. Kumar, “Minimum variance synthetic discriminant functions,” J. Opt. Soc. Am. A 3, 1579–1584 (1986). [CrossRef]

33. P. Refregier, “Filter design for optical pattern recognition: multi-criteria optimization approach,” Opt. Lett. 15, 854–856 (1990). [CrossRef]

34. G. Ravichandran and D. Casasent, “Minimum noise and correlation energy optical CF,” Appl. Opt. 31, 1823–1833 (1992). [CrossRef]

35. A. Mahalanobis, B. Kumar, S. R. F. Sims, and J. F. Epperson, “Unconstrained CFs,” Appl. Opt. 33, 3751–3759 (1994). [CrossRef]

36. B. Kumar, D. Carlson, and A. Mahalanobis, “Optimal trade-off synthetic discriminant function filters for arbitrary devices,” Opt. Lett. 19, 1556–1558 (1994). [CrossRef]

37. P. Refregier, “Optimal trade-off filters for noise robustness, sharpness of the correlation peak, and Horner efficiency,” Opt. Lett. 16, 829–831 (1991). [CrossRef]

38. O. Johnson, W. Edens, T. Lu, and T. Chao, “Optimization of OT-MACH filter generation for target recognition,” Proc. SPIE 7340, 734008 (2009). [CrossRef]

39. M. Alkanhal, B. Kumar, and A. Mahalanobis, “Improving the false alarm capabilities of the maximum average correlation height CF,” Opt. Eng. 39, 1133–1141 (2000). [CrossRef]

40. B. Kumar and M. Alkanhal, “Eigen-extended maximum average correlation height filters for automatic target recognition,” Proc. SPIE 4379, 424–431 (2001). [CrossRef]

41. K. Jeong, W. Liu, S. Han, E. Hasanbelliu, and J. Principe, “The correntropy MACE filter,” Pattern Recogn. 42, 871–885 (2009). [CrossRef]

42. M. Rodriguez, J. Ahmed, and M. Shah, “Action MACH a spatio-temporal maximum average correlation height filter for action recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2008).

43. S. Goyal, N. Nishchal, V. Beri, and A. Gupta, “Wavelet-modified maximum average correlation height filter for rotation invariance that uses chirp encoding in a hybrid digital-optical correlator,” Appl. Opt. 45, 4850–4857 (2006). [CrossRef]

44. P. K. Banerjee and A. K. Datta, Techniques of Frequency Domain Correlation for Face Recognition and its Photonic Implementation, A. Quaglia and C. M. Epifano, eds. (NOVA, 2012), Chap. 9, pp. 165–186.

45. M. Savvides and B. V. K. Vijaya Kumar, “Quad phase minimum average correlation energy filters for reduced memory illumination tolerant face authentication,” in Proceedings of the 4th International Conference on Audio and Visual Biometrics based Person Authentication (AVBPA), Surrey, UK, 2003.

46. M. Maddah and S. Mozaffari, “Face verification using local binary pattern-unconstrained minimum average correlation energy CFs,” J. Opt. Soc. Am. A 29, 1717–1721 (2012). [CrossRef]

47. A. Nevel and A. Mahalanobis, “Comparative study of maximum average correlation height filter variants using ladar imagery,” Opt. Eng. 42, 541–550 (2003). [CrossRef]

48. R. Muise, A. Mahalanobis, R. Mohapatra, X. Li, D. Han, and W. Mikhael, “Constrained quadratic CFs for target detection,” Appl. Opt. 43, 304–314 (2004). [CrossRef]

49. A. Mahalanobis, R. Muise, S. Stanfill, and A. Nevel, “Design and application of quadratic CFs for target detection,” IEEE Trans. Aerosp. Electron. Syst. 40, 837–850 (2004).

50. K. Al-Mashouq, B. Kumar, and M. Alkanhal, “Analysis of signal-to-noise ratio of polynomial CFs,” Proc. SPIE 3715, 407–413 (1999). [CrossRef]

51. A. Mahalanobis and B. Kumar, “Polynomial filters for higher-order and multi-input information fusion,” in 11th Euro-American Opto-Electronic Information Processing Workshop (1999), pp. 221–231.

52. A. Alfalou, G. Keryer, and J. L. de Bougrenet de la Tocnaye, “Optical implementation of segmented composite filtering,” Appl. Opt. 38, 6129–6135 (1999). [CrossRef]

53. H. Lai, V. Ramanathan, and H. Wechsler, “Reliable face recognition using adaptive and robust CFs,” Comput. Vis. Image Underst. 111, 329–350 (2008). [CrossRef]

54. D. Bolme, J. Beveridge, B. Draper, and Y. Lui, “Visual object tracking using adaptive correlation filters,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2010), pp. 2544–2550.

55. D. Bolme, B. Draper, and J. Beveridge, “Average of synthetic exact filters,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2009), pp. 2105–2112.

56. A. Mahalanobis, B. Kumar, and S. R. F. Sims, “Distance classifier CFs for distortion tolerance, discrimination and clutter rejection,” Proc. SPIE 2026, 325–335 (1993). [CrossRef]

57. R. Juday, “Optimal realizable filters and the minimum Euclidean distance principle,” Appl. Opt. 32, 5100–5111 (1993). [CrossRef]

58. M. Alkanhal and B. Kumar, “Polynomial distance classifier CF for pattern recognition,” Appl. Opt. 42, 4688–4708 (2003). [CrossRef]

59. A. Alfalou and C. Brosseau, “Robust and discriminating method for face recognition based on correlation technique and independent component analysis model,” Opt. Lett. 36, 645–647 (2011). [CrossRef]

60. J. Thornton, M. Savvides, and B. Kumar, “Linear shift-invariant maximum margin SVM correlation filter,” in Proceedings of the Intelligent Sensors, Sensor Networks and Information Processing Conference (IEEE, 2004), pp. 183–188.

61. A. Rodriguez, V. Boddeti, B. Kumar, and A. Mahalanobis, “Maximum margin CF: a new approach for localization and classification,” IEEE Trans. Image Process. 22, 631–643 (2013). [CrossRef]

62. R. W. Świniarski and A. Skowron, “Transactions on rough sets I,” in Independent Component Analysis, Principal Component Analysis and Rough Sets in Face Recognition (Springer, 2004), pp. 392–404.

63. S. Wijaya, M. Savvides, and B. Kumar, “Illumination-tolerant face verification of low-bit-rate JPEG2000 wavelet images with advanced CFs for handheld devices,” Appl. Opt. 44, 655–665 (2005). [CrossRef]

64. M. Savvides and B. Kumar, “Illumination normalization using logarithm transforms for face authentication,” Lect. Notes Comput. Sci. 2688, 549–556 (2003). [CrossRef]

65. M. Savvides, B. Kumar, and P. Khosla, “Face verification using correlation filters,” in 3rd IEEE Automatic Identification Advanced Technologies (2002), pp. 56–61.

66. M. Savvides and B. Kumar, “Efficient design of advanced CFs for robust distortion-tolerant face recognition,” in Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance (2003).

67. M. Savvides, B. Kumar, and P. Khosla, “Corefaces - robust shift invariant PCA based CF for illumination tolerant face recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2004).

68. R. Patnaik and D. Casasent, “Illumination invariant face recognition and impostor rejection using different MINACE filter algorithms,” Proc. SPIE 5816, 94–104 (2005). [CrossRef]

69. M. D. Levine and Y. Yu, “Face recognition subject to variations in facial expression, illumination and pose using CFs,” Comput. Vis. Image Underst. 104, 1–15 (2006). [CrossRef]

70. I. Jolliffe, Principal Component Analysis (Wiley, 2002).

71. B. Boser, I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the Fifth Annual Workshop on Computational Learning Theory (1992), pp. 144–152.

72. C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn. 20, 273–297 (1995).

73. C. Xie and B. Kumar, “Face class code based feature extraction for face recognition,” in Fourth IEEE Workshop on Automatic Identification Advanced Technologies (2005).

74. C. Xie, M. Savvides, and B. Kumar, “Kernel CF based redundant class-dependence feature analysis on FRGC2.0 data,” Lect. Notes Comput. Sci. 3723, 32–43 (2005). [CrossRef]

75. C. Xie and B. Kumar, “Comparison of kernel class-dependence feature analysis (KCFA) with kernel discriminant analysis (KDA) for face recognition,” in First IEEE International Conference on Biometrics: Theory, Applications, and Systems (BTAS) (IEEE, 2007).

76. R. Abiantun, M. Savvides, and B. Kumar, “Generalized low dimensional feature subspace for robust face recognition on unseen datasets using kernel correlation feature analysis,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2007).

77. J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in Computer Vision–European Conference on Computer Vision (ECCV) (Springer, 2012), pp. 702–715.

78. Y. Li and J. Zhu, “A scale adaptive kernel CF tracker with feature integration,” in Computer Vision–European Conference on Computer Vision (ECCV) Workshops (Springer, 2014), pp. 254–265.

79. J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell. 37, 583–596 (2015). [CrossRef]

80. Y. Yan and Y. Zhang, “Tensor CF based class-dependence feature analysis for face recognition,” Neurocomputing 71, 3434–3438 (2008). [CrossRef]

81. A. Alfalou, “Implementation of optical multichannel correlation: application to pattern recognition,” Ph.D. thesis (Université de Rennes, 1999).

82. P. Katz, A. Alfalou, C. Brosseau, and M. S. Alam, “Correlation and independent component analysis based approaches for biometric recognition,” in Face Recognition Methods: Applications, and Technology, A. Quaglia and C. M. Epifano, eds. (Nova Science, 2011).

83. C. Xie, M. Savvides, and B. Kumar, “Quaternion CF for face recognition in wavelet domain,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP) (IEEE, 2005).

84. D. Rizo-Rodríguez, H. Méndez-Vázquez, and E. García-Reyes, “Illumination invariant face recognition using quaternion-based CFs,” J. Math. Imaging Vis. 45, 164–175 (2013). [CrossRef]

85. J. Fernandez and B. Kumar, “Zero-aliasing CFs,” in International Symposium on Image and Signal Processing and Analysis (2013), pp. 101–106.

86. J. Fernandez, V. Boddeti, A. Rodriguez, and B. Kumar, “Zero-aliasing CFs for object recognition,” IEEE Trans. Pattern Anal. Mach. Intell. 37, 1702–1715 (2015). [CrossRef]

87. A. Alfalou, C. Brosseau, P. Katz, and M. S. Alam, “Decision optimization for face recognition based on an alternate correlation plane quantification metric,” Opt. Lett. 37, 1562–1564 (2012). [CrossRef]

88. H. Cardot, F. Ferraty, and P. Sarda, “Linear functional model,” Stat. Probab. Lett. 45, 11–22 (1999). [CrossRef]

89. M. Wall, A. Rechtsteiner, and L. Rocha, A Practical Approach to Microarray Data Analysis, D. P. Berrar, W. Dubitzky, and M. Granzow, eds. (Springer, 2003), pp. 91–109.

90. M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg, “Accurate scale estimation for robust visual tracking,” in Proceedings of the British Machine Vision Conference (BMVC) (2014).

91. Z. Chen, Z. Hong, and D. Tao, “An experimental survey on correlation filter-based tracking,” arXiv:1509.05520 (2015).

92. E. Zhou, Z. Cao, and Q. Yin, “Naive-deep face recognition: touching the limit of LFW benchmark or not?” arXiv:1501.04690 (2015).

93. Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations are sparse, selective, and robust,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 2892–2900.

94. Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in Advances in Neural Information Processing Systems (2014), pp. 1988–1996.

95. M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014).

96. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems (2012), pp. 1097–1105.

97. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 1026–1034.

98. S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” arXiv:1502.03167 (2015).

99. G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: a database for studying face recognition in unconstrained environments,” Technical Report 07-49 (University of Massachusetts, 2007).

100. M. Danelljan, G. Häger, F. Shahbaz Khan, and M. Felsberg, “Convolutional features for correlation filter based visual tracking,” in Proceedings of the IEEE International Conference on Computer Vision Workshops (2015), pp. 58–66.

101. C. Ma, Y. Xu, B. Ni, and X. Yang, “When correlation filters meet convolutional neural networks for visual tracking,” IEEE Signal Process. Lett. 23, 1454–1458 (2016). [CrossRef]

102. K. W. Bowyer, K. Chang, and P. Flynn, “A survey of approaches and challenges in 3D and multi-modal 3D + 2D face recognition,” Comput. Vis. Image Underst. 101, 1–15 (2006). [CrossRef]

103. V. Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,” IEEE Trans. Pattern Anal. Mach. Intell. 25, 1063–1074 (2003). [CrossRef]

104. J. Heo, M. Savvides, and B. V. K. Vijayakumar, “Performance evaluation of face recognition using visual and thermal imagery with advanced correlation filters,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)-Workshops (IEEE, 2005).

105. M. K. Bhowmik, K. Saha, S. Majumder, G. Majumder, A. Saha, A. Nath Sarma, D. Bhattacharjee, D. K. Basu, and M. Nasipuri, “Thermal infrared face recognition—a biometric identification technique for robust security system,” in Reviews, Refinements and New Ideas in Face Recognition (2011), pp. 113–138.

106. A. Seal, S. Ganguly, D. Bhattacharjee, M. Nasipuri, and D. K. Basu, “Automated thermal face recognition based on minutiae extraction,” Int. J. Comput. Intell. Stud. 2, 133–156 (2013).

107. X. Liu, T. Chen, and B. Kumar, “Face authentication for multiple subjects using eigenflow,” Pattern Recogn. 36, 313–328 (2003). [CrossRef]

108. A. Adam, E. Rivlin, and I. Shimshoni, “Robust fragments-based tracking using the integral histogram,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition (IEEE, 2006), Vol. 1.

109. D. Ross, J. Lim, R. Lin, and M. Yang, “Incremental learning for robust visual tracking,” Int. J. Comput. Vis. 77, 125–141 (2008). [CrossRef]

110. X. Zhang, W. Hu, S. Maybank, and X. Li, “Graph based discriminative learning for robust and efficient object tracking,” in IEEE 11th International Conference on Computer Vision (ICCV) (IEEE, 2007).

111. B. Kumar, A. Mahalanobis, S. Song, S. Sims, and J. Epperson, “Minimum squared error synthetic discriminant functions,” Opt. Eng. 31, 915–922 (1992). [CrossRef]

112. B. Babenko, M.-H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in Conference on Computer Vision and Pattern Recognition (CVPR) (2009).

113. A. Jepson, D. Fleet, and T. El-Maraghi, “Robust online appearance models for visual tracking,” IEEE Trans. Pattern Anal. Mach. Intell. 25, 1296–1311 (2003). [CrossRef]

114. N. C. Oza, “Online ensemble learning,” Ph.D. thesis (University of California, 2001).

115. M. Danelljan, F. Shahbaz Khan, M. Felsberg, and J. van de Weijer, “Adaptive color attributes for real-time visual tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 1090–1097.

116. M. Danelljan, G. Häger, F. Shahbaz Khan, and M. Felsberg, “Coloring channel representations for visual tracking,” in Scandinavian Conference on Image Analysis (Springer, 2015), pp. 117–129.

117. M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture recognition and segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3828–3836.

118. M. Danelljan, G. Häger, F. Shahbaz Khan, and M. Felsberg, “Learning spatially regularized correlation filters for visual tracking,” in Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 4310–4318.

119. K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: delving deep into convolutional nets,” arXiv:1405.3531 (2014).

120. Y. Wu, J. Lim, and M. H. Yang, “Online object tracking: a benchmark,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), pp. 2411–2418.

121. J. Gao, H. Ling, W. Hu, and J. Xing, “Transfer learning based visual tracking with Gaussian processes regression,” in European Conference on Computer Vision (Springer, 2014), pp. 188–203.

122. J. Zhang, S. Ma, and S. Sclaroff, “MEEM: robust tracking via multiple experts using entropy minimization,” in European Conference on Computer Vision (Springer, 2014), pp. 188–203.

123. S. Hare, A. Saffari, and P. H. S. Torr, “Struck: structured output tracking with kernels,” in International Conference on Computer Vision (IEEE, 2011), pp. 263–270.

124. H. Kiani Galoogahi, T. Sim, and S. Lucey, “Correlation filters with limited boundaries,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4630–4638.

125. B. Babenko, M. H. Yang, and S. Belongie, “Visual tracking with online multiple instance learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2009), pp. 983–990.

126. K. Zhang, L. Zhang, and M. H. Yang, “Real-time compressive tracking,” in European Conference on Computer Vision (Springer, 2012), pp. 864–877.

127. Z. Kalal, J. Matas, and K. Mikolajczyk, “P-N learning: bootstrapping binary classifiers by structural constraints,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2010), pp. 49–56.

128. L. Sevilla-Lara and E. Learned-Miller, “Distribution fields for tracking,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2012), pp. 1910–1917.

129. M. Felsberg, “Enhanced distribution field tracking using channel representations,” in Proceedings of the IEEE International Conference on Computer Vision Workshops (2013), pp. 121–128.

130. X. Jia, H. Lu, and M. H. Yang, “Visual tracking via adaptive structural local sparse appearance model,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2012), pp. 1822–1829.

131. The Visual Object Tracking (VOT) Challenge, 2015, http://www.votchallenge.net.

132. A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual tracking: an experimental survey,” IEEE Trans. Pattern Anal. Mach. Intell. 36, 1442–1468 (2014). [CrossRef]

133. T. Ojala, M. Pietikäinen, and D. Harwood, “A comparative study of texture measures with classification based on featured distribution,” Pattern Recogn. 29, 51–59 (1996). [CrossRef]

134. A. Alfalou, M. Farhat, and A. Mansour, “Independent component analysis based approach to biometric recognition, information and communication technologies: from theory to applications,” in 3rd International Conference on International Conference on Information & Communication Technologies: from Theory to Applications (ICTTA), 7 –11 April 2008, pp. 1–4.

135. X. Guan, H. H. Szu, and Z. Markowitz, “Local ICA for the most wanted face recognition,” Proc. SPIE 4056, 539–551 (2000). [CrossRef]

136. P. Comon, “Independent component analysis, a new concept?” Signal Process. 36, 287–314 (1994). [CrossRef]

137. A. Alfalou, C. Brosseau, and M. Alam, “Smart pattern recognition,” Proc. SPIE 8748, 874809 (2013). [CrossRef]

138. E. Watanabe and K. Kodate, “Implementation of high-speed face recognition system using an optical parallel correlator,” Appl. Opt. 44, 666–676 (2005). [CrossRef]

139. K. Kodate, “Star power in Japan,” in SPIE Professional, July 2010, doi: 10.1117/2.4201007.13 [CrossRef] , https://spie.org/membership/spie-professional-magazine/spie-professional-archives-and-special-content/july2010-spie-professional/star-power-in-japan.

140. M. Elbouz, A. Alfalou, and C. Brosseau, “Fuzzy logic and optical correlation-based face recognition method for patient monitoring application in home video surveillance,” Opt. Eng. 50, 067003 (2011). [CrossRef]

141. D. Benarab, T. Napoléon, A. Alfalou, A. Verney, and P. Hellard, “Optimized swimmer tracking system by a dynamic fusion of correlation and color histogram techniques,” Opt. Commun. 356, 256–268 (2015). [CrossRef]

142. Centre of Molecular Materials for Photonics and Electronics, Department of Engineering, University of Cambridge, “Optical information processing,” http://www-g.eng.cam.ac.uk/CMMPE/pattern.html.

143. I. Leonard, A. Alfalou, M. S. Alam, and A. Arnold-Bos, “Adaptive nonlinear fringe-adjusted joint transform correlator,” Opt. Eng. 51, 098201 (2012). [CrossRef]

aop-9-1-1-i001 Qu Wang received his B.S. and Ph.D. degrees in optics from Harbin Institute of Technology in 2001 and 2006, respectively. He is an associate professor in the School of Physics and Optoelectronic Engineering at Guangdong University of Technology, China. Currently he works in the Vision Lab, ISEN Brest, France, as a visiting scholar supported by the China Scholarship Council. His research interests include optical pattern recognition, optical image encryption and hiding.

aop-9-1-1-i002 Ayman Alfalou received his Ph.D. degree in telecommunications and signal processing from Ecole Nationale Supérieure des Télécommunications de Bretagne (ENSTB-France) and Université de Rennes 1 in 1999. Since June 2000, he has been a professor of telecommunications and signal processing at the Institut Supérieur de l’Electronique et du Numérique (ISEN) in Brest. At ISEN, he launched the optical signal and image processing laboratory. His research interests are signal processing and image processing, telecommunications, optical systems, optical processing, optoelectronics, laser, and polarization optics. He has published more than 160 refereed journal articles or conference papers on a wide variety of theoretical and experimental topics. He has organized or co-organized many conferences and special sessions. He was the chairman or a member of scientific committees of many international conferences. He is a senior member of The Optical Society (OSA), IEEE, SPIE, and an elected member of IoP.

aop-9-1-1-i003 Christian Brosseau received a Ph.D. degree in physics from Fourier University, Grenoble, France in 1989. After holding a postdoctoral fellow position at Harvard University, he returned to Fourier University to become a Research Associate. He became an associate professor, and professor at the Université de Brest, France, in 1994 and 1997, respectively. He has published more than 210 refereed journal articles on a wide variety of theoretical and experimental topics and presented more than 130 papers at conferences. He has also written the book Fundamentals of Polarized Light: A Statistical Approach (Wiley, 1998). His current research interests include polarization and coherence of optical fields, and electromagnetic wave propagation in complex media. He is a Fellow of IoP, The Optical Society of America (OSA), and the Electromagnetics Academy. He has many editorial positions: Editorial Board Member of Optics Communications 2005–2008, of Optics Letters 2005–2010, of the Journal of Applied Physics 2012-present, and of Progress in Optics 2008-present. He is the 2017 recipient of the SPIE Stokes Award in recognition of his contributions to the field of polarization optics.

		Bit Rate (bpp)
MACE Filter	Original Image (%)	4.0 (%)	3.0 (%)	2.0 (%)	1.5 (%)	1.2 (%)	1.0 (%)	0.8 (%)	0.5 (%)
Training set 1	100	100	100	100	100	100	99.93	99.41	92.82
Training set 2	94.95	94.65	94.8	92.82	89.52	85.79	81.9	77	62.2

		Bit Rate (bpp)
OTSDF Filter with Training Set 1	Original Image (%)	4.0 (%)	3.0 (%)	2.0 (%)	1.5 (%)	1.2 (%)	1.0 (%)	0.8 (%)	0.5 (%)
$α = 0.999999$	100	100	100	100	100	100	100	99.78	95.9
$α = 0.99999$	100	100	100	100	100	100	99.93	99.85	97.73
$α = 0.9999$	100	100	100	100	99.93	99.93	99.78	99.63	97.66
$α = 0.999$	99.78	99.71	99.71	99.71	99.63	99.56	99.41	98.87	96.63
$α = 0.99$	98.32	98.39	98.39	98.17	97.66	96.92	97	95.46	89.16

		Bit Rate (bpp)
OTSDF Filter with Training Set 2	Log of Original Image (%)	4.0 (%)	3.0 (%)	2.0 (%)	1.5 (%)	1.2 (%)	1.0 (%)	0.8 (%)	0.5 (%)
$α = 0.999999$	97.14	95.9	95.46	93.63	90.33	86.81	83.22	78.39	57.66
$α = 0.99999$	95.9	94.87	94.65	92.45	89.74	86.15	83.15	79.05	62.56
$α = 0.9999$	92.97	92.16	91.65	88.94	86.3	83.96	81.47	78.1	65.86
$α = 0.999$	87.91	86.89	86.74	83.59	81.68	78.1	76.34	73.33	63.81
$α = 0.99$	79.73	79.12	78.83	76.7	74.36	71.28	69.52	66.45	57.44
MACE filters	97.73	96.56	96.63	94.43	91.06	87.25	82.78	75.31	50.11

		Bit Rate (bpp)
OTSDF Filter with Training Set 1	Original Image (%)	4.0 (%)	3.0 (%)	2.0 (%)	1.5 (%)	1.2 (%)	1.0 (%)	0.8 (%)	0.5 (%)
$α = 0.999999$	88.5	88.72	86.67	82.93	81.98	76.85	74.51	68.35	48.86
$α = 0.99999$	97.29	97.22	96.78	95.16	92.82	89.89	88.06	82.34	66.37
$α = 0.9999$	99.05	99.19	98.9	98.32	97.66	95.97	95.31	92.16	82.12
$α = 0.999$	99.56	99.56	99.56	98.83	98.39	97.95	97.22	94.87	84.91
$α = 0.99$	98.83	98.83	98.83	97.58	96.92	96.04	93.92	89.16	75.38
MACE filters	52.45	51.21	49.38	46.08	44.84	42.34	41.25	35.9	27.11

	Overlap	Failure Rate	Accuracy	Robustness	Final Rank
DeepSRDCF	0.53^a	1.05^a	3.89^b	4.17^a	4.03^a
SRDCF	0.53^a	1.24^b	3.77^a	4.60^b	4.19^b
DeepDCF	0.48^b	1.75^c	5.87	5.61^c	5.74^c
SAMF	0.48^b	2.05	5.52^c	6.27	5.89
MEEM	0.46^c	2.05	6.11	6.23	6.17
DSST	0.48^b	2.56	6.20	7.63	6.92
ACT	0.41	2.05	7.81	6.48	7.14
KCF	0.43	2.51	7.60	7.28	7.44
MIL	0.39	3.32	8.67	7.92	8.29
DFT	0.39	4.32	8.50	8.79	8.64
Struck	0.40	3.49	8.70	8.60	8.65
EDFT	0.38	4.08	8.88	8.83	8.85
CT	0.34	4.08	9.91	8.57	9.24

Exp.	MMCF	SVM	OTSDF	ASEF	MOSSE	UOTSDF
1	58.3	17.0	50.2	26.5	24.8	32.0
2	71.9	21.6	56.1	33.9	57.6	50.3
3	73.5	24.7	55.2	34.1	64.3	50.8
4	97.9	37.0	98.3	53.9	51.6	97.7
5	99.9	47.3	99.9	58.3	88.7	99.9
6	99.9	50.2	99.9	61.0	92.2	99.9

	Training One Image		Testing One Image
Filters	$O$	time (s)	Big O	time (s)
MMCF	$\min (N^{3}, N^{2} d) + N d \log d$	0.89	$d_{s} \log d_{s}$	0.20
SVM	$\min (N^{3}, N^{2} d)$	0.48	$d_{s} \log d_{s}$	0.20
OTSDF	$N^{3} + N d \log d$	0.61	$d_{s} \log d_{s}$	0.20
ASEF	$N d \log d$	0.41	$d_{s} \log d_{s}$	0.20
MOSSE	$N d \log d$	0.38	$d_{s} \log d_{s}$	0.20
UOTSDF	$N d \log d$	0.35	$d_{s} \log d_{s}$	0.20

	Filter
Person	1	2	3	4	5
1	64↑	0-	0↓	0-	0↓
2	0↓	61↑	0-	0-	0↓
3	0↓	0-	64↑	0-	0↓
4	0↓	0-	0↓	59↑	0↓
5	0↓	0↓	0↓	0↓	57↑

		Baseline CF (%)	ZACF (Closed Form) (%)	ZACF (Proximal Gradient) (%)	RACF ( $p = 15$ ) (%)	RACF ( $p = 20$ ) (%)	RACF ( $p = 25$ ) (%)	RACF ( $p = 30$ ) (%)
OTSDF	EER	10.89	7.54	7.65	11.75	8.45	7.21	7.5
OTSDF	Rank-1ID	83.75	89.5	89.5	70.75	81	85.75	87.75
MOSSE	EER	11.75	7.94	7.94	12.25	9.25	8.75	8.5
MOSSE	Rank-1ID	84.25	88.25	88	72.75	83	85.75	87.5
MMCF	EER	11.73	8	8.21	12	8.93	8.75	8.5
MMCF	Rank-1ID	84.25	88.5	88.3	73	83	86	87.5

		Baseline CF (%)	ZACF (%)
OTSDF	EER	2.43	1.79
OTSDF	Rank-1 ID	93.23	94.95
MACE	EER	15.23	9.46
MACE	Rank-1 ID	52.73	78.98
MOSSE	EER	7.35	5.02
MOSSE	Rank-1 ID	86.87	93.78
MMCF	EER	2.52	1.80%
MMCF	Rank-1 ID	91.96%	94.52

Training Sets	MACH	UMACE	OTMACH	PEUMACE	Proposal
9,8,45	6.7255	7.6847	8.2471	13.7647	21.1998
47,53,5	8.4226	10.1632	9.9003	17.3128	23.0935
35,57,19	4.4496	9.4087	8.1329	10.3363	18.7694
20,63,24	4.664	7.8933	6.7052	5.6747	13.0181
32,22,58	6.1131	11.6095	10.1757	10.0449	18.5307
3,2,34	5.0989	6.9466	5.325	11.183	24.2187
61,16,38	11.9977	13.9295	12.8193	22.0452	29.4638
44,39,8	6.24	6.0387	6.8613	10.5529	18.49
5,9,51	4.8421	7.3242	5.373	14.7871	21.1288
55,45,7	5.0781	6.7172	6.0718	13.1415	23.8946

Writing mode	Laser	$Q$ -Switch
	Rotation speed	300
	Database image	30
	Input image	30
	Recorded pitch (μm)	10
	Input image size (pixels)	$64 \times 128$
	Hologram media (μm/cm)	400/12
Correlation mode	Laser	Continuous wave
	Rotation speed (rpm)	300
	Database image	30
	Detect device	photomultiplier tube

Week	First Trial (%)	Second Trial (%)	Third Trial (%)
1st	6.7	0	0
5th	10.0	0	0
10th	13.3	3.3	3.3
15th	5.0	0	0
20th	0	0	0
average	9.9	2.9	2.0

	Comparison Criteria	NL-NZ JTC	Color Histogram	Dynamic Fusion
Backstroke video	Tracking percentage (%)	68.73	60	73.09
	PCE (mean) (*10e-2)	2	0.2	6
	Local-STD (mean) *(10e-2)**	4.97	0.86	1.13
Crawl video	Tracking percentage (%)	54.17	57.22	71.74
	PCE (mean) *(10e-2)**	2	0.2	4.6
	Local-STD (mean) *(10e-2)**	5.13	1.44	1.88

Method	Recognition (%)	False Alarm (%)	True to False Alarm Ratio
Fringe-adjusted JTC	88.52	59.93	1.48
Nonlinear fringe-adjusted JTC, $k = 0.85$	34.07	12.33	2.76
Optimized nonlinear fringe-adjusted JTC, $k = 0.85$	41.11	5.48	7.5

Multiplex Recording Pitch	Rotation (rpm)	Number of Images for Correlation per Second (frames/s)	Image ( $320 \times 240$ Pixels) Transfer Speed (Gbps)
10 μm	300	188,400	14
	600	376,800	29
	1000	62,800	48
	2000	1,256,000	96

Multiplex Recording Pitch	Rotation (rpm)	Number of Images for Correlation per Second (frames/s)	Image ( $320 \times 240$ Pixels) Transfer Speed (Gbps)
10 μm	300	188,400	14
	600	376,800	29
	1000	62,800	48
	2000	1,256,000	96

Abstract

1. Introduction

2. Correlation-Based Face Recognition Methods

2.1. Face Verification with MACE Filters

2.2. OTSDF for Face Verification in the Presence of Illumination Variance and Image Compression

2.3. ASEFs

2.4. MOSSE Filter

2.5. Correlation-Based Visual Tracking Using Convolutional Features

2.6. MMCF

2.7. UMACE Filters Using a LBP Operator

2.8. Face Recognition Based on Correlation and the ICA Model

2.9. Zero-Aliasing CFs for Face Recognition

2.10. Decision Optimization Algorithm Based on the Denoised Decomposition of the Correlation Plane

2.11. Class-Specific Nonlinear Correlation Filter for Illumination-Invariant Face Recognition

2.12. Summary

3. Implementation and Application of Correlation Methods

3.1. Implementation

3.1a. GPU Implementation of Segmented Composite POF for Face Recognition

3.1b. All-Optical Implementation System of Correlation Based on the Vander Lugt Configuration

3.1c. Optical Implementation of Face Recognition Correlator Using Holographic Memory

3.1d. Summary

3.2. Applications

3.2a. Correlation-Based Face Recognition for Patient Monitoring Application in Home Video Surveillance

3.2b. Monitoring of Video Copyright Based on Optical Correlation

3.2c. Cellular Phone Face Recognition System for Attendance Management of Students

3.2d. Optimized Swimmer Tracking System by a Dynamic Fusion of Correlation and Color Histogram Techniques

3.2e. Image Comparator Based on the Optical JTC

3.2f. Underwater Mine Detection and Recognition

3.2g. Road Sign Recognition Using Phase-Only Correlation and the VIAPIX Module

4. Concluding Remarks and Outlook

Appendix A: Closed-Form Expressions of Composite Correlation Filters

1. Notation

2. Closed-Form Expression and Description

Funding

References

Cited By

Figures (67)

Tables (17)

Equations (16)

Advances in Optics and Photonics