Expand this Topic clickable element to expand a topic
Skip to content
Optica Publishing Group

Pixel-level alignment of facial images for high accuracy recognition using ensemble of patches

Open Access Open Access

Abstract

The variation of pose, illumination, and expression continues to make face recognition a challenging problem. As a pre-processing step in holistic approaches, faces are usually aligned by eyes. The proposed method tries to perform a pixel alignment rather than eye alignment by mapping the geometry of faces to a reference face while keeping their own textures. The proposed geometry alignment not only creates a meaningful correspondence among every pixel of all faces, but also removes expression and pose variations effectively. The geometry alignment is performed pixel-wise, i.e., every pixel of the face is corresponded to a pixel of the reference face. In the proposed method, the information of intensity and geometry of faces is separated properly, trained by separate classifiers, and finally fused together to recognize human faces. Experimental results show a great improvement using the proposed method in comparison to eye-aligned recognition. For instance, at the false acceptance rate (FAR) of 0.001, the recognition rates are respectively improved by 24% and 33% in the Yale and AT&T datasets. In the labeled faces in the wild dataset, which is a challenging, big dataset, improvement is 20% at a FAR of 0.1.

© 2018 Optical Society of America

1. INTRODUCTION

Face recognition is one of the most attractive and practical fields of research in pattern analysis and image processing, receiving much attention from different knowledge backgrounds, including pattern recognition, computer vision, image processing, statistical learning, neural networks, and computer graphics [1].

According to [1], face recognition methods can be categorized into two main categories: feature-based and holistic (whole-pixel) methods. Feature-based methods try to create a feature vector out of the face for the learning process. The holistic recognition uses all pixels of the face region as raw data for recognition and learning.

Feature-based methods utilize the geometrical and structural features of the face [1]. For instance, in [2], features of head width, distances between eyes, and distances from eyes to mouth are compared. In [3], angles and distances between eye corners, mouth hole, chin top, and the nostrils are used. In [4], face features such as mouth, nose, eyebrows, and face outline are detected using horizontal and vertical gradients. In this method, template matching using correlation is also proposed. In [5,6], the hidden Markov model is used on pixel strips of different parts of the face. Also, recently, a patch-based representation is used in [7] in which each patch tries to learn a transformation dictionary in order to transform the features onto a discriminative subspace. Paper [8] is another feature-based method in which a pyramid of a facial image is created and the patches around five key landmarks in different pyramid levels are concatenated to prepare a high-dimensional feature vector.

Some feature-based methods use both features and whole pixels together in order to enhance the performance of recognition [1]. Eigenmodules [9] can be mentioned in which eigenfaces are combined with eigenmodules of the face such as eigeneyes, eigenmouth, and eigennose. In [10], principle component analysis (PCA) is used in combination with local feature analysis. Some of the methods in this category which seem more promising are based on the “shape-free” face concept. In Refs. [11,12], the active appearance model has been proposed as a method of warping textures of image patches to a specific geometry in an iterative manner. In this method [1316], the patch of the face is labeled by several landmarks (model points), the texture of the face is projected onto the texture model frame by applying scale and offset to the intensities, and the residual (error) between the projected and previous image patches is iteratively reduced. In [13,16], the shape of the face is also modeled using the active shape model (ASM) [17]. The authors have shown that different weights of eigenvalues can vary the different aspects and parts of face shape models. In [18], several shape-free (neutral) faces create an ensemble and all the faces are approximated by a linear combination of the eigenfaces of the ensemble.

Despite significant advances of feature-based methods, holistic methods are still receiving lots of attention as they use the information of all pixels in the face region. Holistic methods detect and crop the face out of the image and use it as a raw input for classification. Eigenfaces [19,20], Fisherfaces [21], and kernel faces [22,23] are several well-known examples of this category which respectively create a feature space using PCA, Fisher linear discriminant analysis (LDA), and kernel direct discriminant analysis for face classification and recognition. Face recognition using a support vector machine [24] is another method from this category, which formulates face recognition as a two-class problem, one class as dissimilarities between faces of the same person and the other class as dissimilarities between faces of different individuals. A Bayesian classifier [25] can also be mentioned in this category, which has a probabilistic approach toward the similarity of faces. Some other holistic methods of face recognition have used artificial neural networks [1,26]. For instance, the probabilistic decision-based neural network [27] and convolutional neural networks (CNNs) [2832] can be mentioned. Recently, sparse representation-based classification [33] has been used in order to create a recognition system with robustness to illumination and occlusion.

Both geometrical and intensity features exist in a 2D image of a face, which help the human eye to recognize people from their images. For example, both eye color and the distance between eyes and nose can be inferred from a facial image. Accordingly, in a successful face recognition system both of these categories of features, i.e., geometry and intensity, should be appropriately used. However, whenever eye alignment is used in a holistic method or other approaches, the correspondences between organs other than eyes are disturbed; the intensities of lips in different faces cannot be compared with each other, nor can their positions be compared. That is because eye alignment is a global transformation. This work proposes a local pixel-level alignment of faces in which every pixel corresponds to a specific similar region of face in all aligned facial images. The main contribution of this work is to introduce an alignment method by which the intensity information and geometry information are separated from each other. Each of these pieces of information is then used to train their corresponding classification modules and finally their results are fused together to recognize human faces. Moreover, note that any classification algorithm, such as a Fisher-LDA-based ensemble of patches, which is used in this paper, can be used as the classifier in the proposed method. In particular, the proposed local pixel-level alignment method can align the features as a pre-processing step to CNNs, which are quite robust to global, but not to local, misalignment.

To provide more detail, two major tasks are accomplished using the proposed method:

  • 1. The proposed alignment method places the intensities of similar organs in the same positions in the warped faces.
  • 2. When intensities are properly aligned, using the proposed geometry extraction method, the coordinate of the aligned pixels can be extracted to be used as the corresponding geometry information.

As a result, the proposed pixel alignment provides both intensity and geometry information useful for recognition.

The remainder of this paper is organized as follows. Section 2 details the geometrical alignment as the first part of the proposed method. Thereafter, geometrical information is discussed further in Section 3. Creating feature vectors using an ensemble of patches, using Fisher LDA, and decision fusion are explained in Section 4. Section 4 also sums up the proposed method by illustrating the overall structure. The utilized datasets and experimental results are also reported in Section 5. Some discussions on alignment of features and ensemble of patches are gone through in Section 6. Finally, Section 7 concludes the paper and posits possible future directions.

2. GEOMETRICAL ALIGNMENT

Geometrical alignment can be defined as aligning the geometries of faces to a unique geometry while saving their own textures. In the proposed geometrical alignment method, a reference geometry is defined, and the geometry of all faces of train and test procedures are transformed to this geometry. Here, the geometry of a face is defined as the location of the contours of the facial landmarks. Therefore, geometrical alignment is performed by warping a face such that its facial contours coincide with those of the reference contours.

In the proposed method, in order to detect facial landmarks, every landmark detection method can be used, such as the ASM [17] or constrained local neural fields (CLNF) [34]. In this work, the CLNF is utilized for this purpose. (The code of the CLNF method can be found in https://github.com/TadasBaltrusaitis/OpenFace.) The landmarks in this work are as follows. There are 17 landmarks around the face region, 14 landmarks for lips, three landmarks each for the upper and lower teeth, six landmarks for each eye, nine landmarks for the whole nose, and five landmarks for each eyebrow, resulting in 68 total landmarks.

In the following sections, the different steps of the proposed method are explained in detail.

A. Fitting Face Contours

The CLNF method [34], which is an enhanced constrained local model [35], is used for detecting landmarks of each train and test face. This method is briefly described in the following. Interested readers are encouraged to refer to [34] for more details.

In the CLNF method, face or faces are detected using a tree-based face detection method at first. The CLNF method introduces patch experts, which are small partitions of pixels around the interest points, such as face edge, eyes, eyebrows, nose, and lips. The initial patch experts are put on the image. A one-layer neural network is then trained for every patch expert. The neural networks try to learn the spatial dependencies of pixels in the patches. Moreover, the CLNF method tries to take into account the smoothness of the contour of patch experts as well as the sparsity of landmark probability peaks within every patch expert. The log-likelihood of the landmark probabilities are found and the solution to the optimization problem of maximum probabilities gives us the landmarks. Finally, an iterative patch experts update is performed in which different variances of patches are considered based on the location of landmarks in the facial region [34].

B. Reference Contour

Reference contours are obtained by averaging the contours of landmarks of several neutral faces from the training set. Figure 1 shows an example of reference contours.

 figure: Fig. 1.

Fig. 1. Obtaining reference contour by averaging landmark contours of several neutral faces. The facial images are from the Yale dataset [36].

Download Full Size | PDF

C. Transformation and Pixel-to-Pixel Warping

After fitting the contour of landmarks to the input face, the face is geometrically transformed and warped to reshape to the geometry of the reference face. This step is detailed in this section.

For the geometrical transformation and pixel-to-pixel warping, three interpolations are performed, as depicted in Fig. 2, which are detailed next. As a result of these interpolations, the intensity of each pixel is transformed to its corresponding location on the warped face. This transformation is guided by the transformation between the location of landmarks on the input face and the location of those on the warped face.

 figure: Fig. 2.

Fig. 2. Procedure of geometrical transformation and pixel-to-pixel face warping.

Download Full Size | PDF

It is important to note that the proposed face warping method differs from the conventional one as the target coordinates for every single pixel of the input face are calculated using the described interpolation procedures.

1. Delaunay Triangulation of Landmarks

According to [37], a triangulation of a finite point set PR2 is called a Delaunay triangulation if the circumcircle of every triangle is empty, that is, there is no point from P inside the circumcircle of any triangle. Each face is triangulated using the Delaunay method, as depicted in Fig. 3. By performing triangulation, the triangles needed for affine interpolations are obtained, which are used in geometrical transformation as described next.

 figure: Fig. 3.

Fig. 3. Delaunay triangulation of face landmarks.

Download Full Size | PDF

2. Transformation of Coordinates

Let (x,y) and (x,y) denote the coordinates of pixels on the input (source) and warped (target) face, respectively, and I(x,y) and I(x,y) denote their corresponding intensities. As can be seen in Fig. 4, the x value of every non-landmark pixel is found using affine interpolation of the x values of the three surrounding landmarks as

x=f(x,y)a0+a1x+a2y,
where the coefficients a0 to a2 are calculated by solving the linear system using three surrounding points. These interpolations are performed for all non-landmark points between the three surrounding landmarks to obtain the x coordinates of all existing pixels. This procedure is also performed for finding the y coordinates of pixels. Finally, the (x,y) of all pixels are obtained and it is known where each input pixel is to be transferred.

 figure: Fig. 4.

Fig. 4. Interpolation of x. The black points are the landmarks with known target coordinates and the yellow points are the interpolated target coordinates for the non-landmark pixels.

Download Full Size | PDF

3. Pixel-to-Pixel Warping

After x- and y-interpolations, each (x,y) coordinate gets the intensity of its corresponding (x,y) from the input face, i.e.,

I(x,y)=I(x,y).
I values are then resampled on a uniform grid, e.g., 140×120 pixels, to create the warped face (see Fig. 5).

 figure: Fig. 5.

Fig. 5. Intensity interpolation.

Download Full Size | PDF

For the sake of demonstration, an example of coordinate transformation and pixel-to-pixel warping on a sample face with a few number of landmarks is depicted in Fig. 6. In this figure, the yellow diamond points and red square points are respectively input and reference landmarks. The face is warped so that the input landmarks are precisely located at the position of reference landmarks, as it was the goal of coordinate transformation. The other pixels are interpolated as explained previously. It is also noteworthy that the three explained interpolations (shown in Fig. 2) are performed as forward mapping. This mapping can also be done in a backward direction where x- and y-interpolations take place by interpolating on the target image and the intensity interpolation should be done on the side of the source image.

 figure: Fig. 6.

Fig. 6. Example of geometrical transformation and pixel-to-pixel warping. The facial image is taken from the Cohn-Kanade dataset [38,39] (©Jeffrey Cohn).

Download Full Size | PDF

3. GEOMETRICAL INFORMATION

Geometrical information seems to be useful in addition to intensity information of the warped face. Obviously, the geometry information of each face exists in its unwarped (input) face. By finding the original coordinate (i.e., coordinate in the unwarped face image) of each pixel of the warped face, geometry information can be gathered. However, as x and y coordinates have been once resampled, their original coordinates cannot be found directly. These coordinates can be obtained by performing two other resamplings on the same grid as before: one for original x values and one for original y values. To better explain it, two other interpolations are performed in which the x and y source coordinates of each pixel in the warped face are found using interpolation. These two interpolations are exactly the same as previous intensity interpolation (Fig. 5) but by replacing I with x and y.

For the sake of better visualization, the difference of original coordinates x and y of every pixel from its previous pixel is calculated. The differences in original coordinates are denoted as Δx and Δy here, respectively, for differences in x and y information. Figure 7 illustrates the information of Δx and Δy for a sample face in the Yale dataset [36]. The amount of vertical and horizontal transitions of each pixel after warping can be seen in this figure. This figure shows that for this specific face, warping has changed the face more in the horizontal direction rather than vertical.

 figure: Fig. 7.

Fig. 7. Illustration of Δx and Δy information for a sample warped face. (a) Δx information, (b) Δy information.

Download Full Size | PDF

4. CLASSIFICATION USING ENSEMBLE OF PATCHES

A. Ensemble of Patches and Feature Vectors

Instead of using the whole face, a patch-based approach is used in this work. To do this, an ensemble of patches is created in the limit of the face frame. The location of patches is selected randomly once, and for all faces of the dataset, the same patches are used in both training and testing phases. The optimum number and size of patches were found through trial and error to be 80 and 30×30 pixels, respectively, over various different datasets.

For every face, the ensemble of patches is applied on the intensity matrix of its warped face, its Δx information, and its Δy matrix. An example of applying the ensemble of patches on these three matrices is depicted in Fig. 8. Note that the information of Δx and Δy is the same as x and y. In order to have the feature vectors of each patch, the matrix coefficients fell in the patch are reshaped as a vector. In other words, for the pth patch, if the size of the patch is m×m, the feature vectors are obtained as

fpI=[I(1,1),I(1,2),,I(m,m)]T,
fpΔx=[Δx(1,1),Δx(1,2),,Δx(m,m)]T,
fpΔy=[Δy(1,1),Δy(1,2),,Δy(m,m)]T,
where fpI, fpΔx, and fpΔy are, respectively, the feature vectors of the pth patch with respect to intensity, Δx, and Δy matrices. Moreover, I(k,l), Δx(k,l), and Δy(k,l) denote the coefficient of intensity, Δx, and Δy matrices, which fall in pixel (k,l) of the patch.

 figure: Fig. 8.

Fig. 8. Classification using the ensemble of patches.

Download Full Size | PDF

B. Fisher Linear Discriminant Analysis

After constructing the feature vectors of the ensemble of patches, three separate Fisher LDA subspaces are trained for every patch. To better explain, for the pth patch in all training sets of faces, one Fisher LDA subspace is trained using the feature vectors fpI, one for feature vectors fpΔx, and one for feature vectors fpΔy. In this work, the Fisherface method [21] is used for classification of each patch; however, other more complicated learning methods can be used in future works.

The goal of Fisher LDA is maximizing the ratio of

Wopt=argmaxW|WTSbW||WTSwW|,
where Sb and Sw are the between- and within-class scattering matrices, respectively [40,41], formulated as
Sw=i=1CxkXi(xkμi)(xkμi)T,
Sb=i=1CNi(μiμ)(μiμ)T,
where μi is the mean of the ith class and μ is the mean of means of classes. Ni is the number of samples of the ith class. And xk is the kth sample of the ith class (Xi).

After finding scattering matrices, a discriminative subspace is created using the eigenvectors of Sw1Sb matrix. To extract the discriminative features from each feature vector, it should be projected onto this subspace. If C denotes the number of classes, this projection also reduces the dimension of data to C1 [40,41].

C. Decision Making

Clearly, there are a lot of different features available rather than one, i.e., intensity, Δx, and Δy features for all patches. Hence, in order to obtain the final similarity/distance score between two face images, a fusion is required to be performed. The fusion can be performed either before, during, or after classification, which are respectively known as data-, feature-, and decision-level fusion. In the fusion of data and feature, respectively, the two feature vectors are concatenated before and after projecting to the discriminative subspace; and in the fusion of decision, the resulting scores are fused. The fusion of decision is found to perform better in this work.

For the pth patch in every face image, each of the feature vectors, fpI, fpΔx, and fpΔy, is projected onto their corresponding discriminative LDA subspace, obtained as described in Section 4.B. The projections result in projected feature vectors f^pI, f^pΔx, and f^pΔy. In the context of face recognition, it has been shown that the cosine of the angle between two discriminative feature vectors, which is obviously a similarity score, results in a better recognition rather than distance measures such as Euclidean distance [42,43]. Hence, a cosine is used in this work for matching purposes. Then, the similarity score between two face images i and j is calculated as follows. First, the similarity scores in the discriminative subspaces related to the pth patch are obtained as

simpI(i,j)=cos(f^p,iI,f^p,jI)=f^p,iIf^p,jI|f^p,iI||f^p,jI|,
simpΔx(i,j)=cos(f^p,iΔx,f^p,jΔx)=f^p,iΔxf^p,jΔx|f^p,iΔx||f^p,jΔx|,
simpΔy(i,j)=cos(f^p,iΔy,f^p,jΔy)=f^p,iΔyf^p,jΔy|f^p,iΔy||f^p,jΔy|,
where f^p,iI, f^p,iΔx, and f^p,iΔy are, respectively, projected feature vectors f^pI, f^pΔx, and f^pΔy in the ith face image.

Then, the final similarity score is simply obtained by a weighted summation of all the scores of patches (decision fusion):

sim(i,j)=p=180(simpI(i,j)+wsimpΔx(i,j)+wsimpΔy(i,j)),
where w is the weight associated to the geometrical information, and the weight of intensity information is considered to be one for simplicity. The classification using the ensemble of patches is summarized in Fig. 8.

D. Analyzing the Effectiveness of Geometrical and Intensity Information Separation

The separation of the intensity and geometry information significantly helps the classification of faces for the following reasons: (1) As the intensity information is obtained from the warped face where the pixels are aligned with respect to the reference landmarks, the information of pose and expression does not exist in this information and thus the intensity information is pure and clean. This pure and clean information helps the classifiers of intensity information discriminate the faces solely based on the intensity of faces (such as a star, dot, or any intensity difference in any facial organ between different identities). Moreover, it is important to note that when the intensity information is mixed up with the geometry information and is not properly aligned, learning between-class intensity differences is much more difficult. (2) The geometry information includes three types of information, which are (I) personal face geometry (such as fatness or distance of eyes), (II) pose, and (III) expression, from which merely the first type of information is important for face identification. When not applying the proposed method, this geometry information is mixed with the intensity information and thus excluding the personal geometry information from the mixture of intensity and geometry information is much harder than doing it from the pure geometry information. The proposed method separates geometry information from the intensity information and thus helps the classifier discriminate the personal face geometry much easier. Note that the classifier will consider the personal face geometry as between-class variation and the two other geometrical information as within-class variation and will try to find the projection directions where the between-class variation is maximized while the within-class variation is minimized.

E. Overall Structure of the Proposed Face Recognition Framework

The proposed method can be summarized as is depicted in Fig. 9. In this method, a set of reference contours is constructed, landmarks of each train/test face are detected using the CLNF method, the faces are transformed geometrically to the reference, warping is performed, and feature vectors are created for classification. In preparing feature vectors, the ensemble of patches is considered for matrices of warped intensity, Δx, and Δy. A separate Fisher LDA is trained for every patch in each of these matrices. Finally, in the test phase, the feature vectors are projected onto the corresponding LDA subspaces and the similarity scores are summed up together in order to have the total score.

 figure: Fig. 9.

Fig. 9. Overall structure of the proposed method.

Download Full Size | PDF

5. EXPERIMENTAL RESULTS

A. Datasets

Four different datasets are used for evaluating the recognition performance using the proposed alignment method, which are the Yale [36], AT&T [44,45], Cohn-Kanade [38,39], and labeled faces in the wild (LFW) datasets [46,47] detailed in the following. In this work, as a pre-processing step, the datasets are eye-aligned and then cropped using the CLNF method [34]. To further explain, the location of the eyes is found using CLNF and by a translation and rotation, the faces become eye-aligned.

1. Yale Face Dataset

The Yale face dataset [36] was created by the center of computational vision and control at Yale University, New Haven. It consists of 165 gray-scale face images of 15 different persons. There exist 11 images per person depicting different facial expressions.

2. AT&T Face Dataset

The AT&T face dataset [44,45] was created by the AT&T Laboratories Cambridge in 2002. There are pictures of 40 different persons with 10 different facial expressions.

3. Cohn-Kanade Face Dataset

The Cohn-Kanade dataset [38,39] includes 486 face sequences from 97 persons. Every sequence starts with a neutral face and ends with extreme versions of expressions. Different expressions exist in this dataset, such as laughing, surprising, etc. The first version of this dataset is used here. For every person in this dataset, merely one neutral face, all middle expressions, and all extreme expressions are utilized in this work to perform experiments.

4. LFW Face Dataset

The LFW dataset [46,47] is a very big and challenging dataset including 13,233 images of faces collected from the web. The faces have various poses, expressions, and locations in images. The distances of the camera from persons are not necessarily the same in the images. There are different number of images for every subject, from one to sometimes 10. The not-cropped and not-processed version of this dataset is used in this work for experiments.

B. Warped Faces

In this section, for the sake of visualization, several warped faces which are pixel-aligned are shown and analyzed. Several samples of warped faces from the Yale and AT&T datasets are illustrated in Figs. 10 and 11, respectively. In these figures, the first and second rows are faces before and after warping, respectively. At the right-hand side of these figures, the reference contours are shown.

 figure: Fig. 10.

Fig. 10. Several samples of pixel alignment in the Yale dataset [36].

Download Full Size | PDF

 figure: Fig. 11.

Fig. 11. Several samples of pixel alignment in the AT&T dataset [44,45].

Download Full Size | PDF

As seen in Fig. 10, faces (a), (c), and (f) are smiling originally but the mouths in the corresponding warped faces are closed and the teeth are roughly removed. Face Fig. 10(d), however, is wondering originally while the mouth is totally closed after warping. Similarly, in Fig. 11, faces (b), (d), (e), and (f) have different expressions while their corresponding warped faces have neutral expression with closed mouths. As shown in these figures, removing the facial expression is obviously one of the results of the proposed pixel-by-pixel alignment method, which of course can greatly improve the recognition task. Moreover, faces (b), (c), and (e) in Fig. 10 show that this method can also change the pose of faces to the pose of the reference contours. Similarly, in Fig. 11, faces (a), (b), (c), and (d) have frontal pose after warping. Clearly, in all the warped faces in Figs. 10 and 11, not only are different organs of the face aligned, but also other features of the face are almost aligned. However, due to the drawback of the landmark detection method in converging to exact landmark points, some features may not become well-aligned. For instance, in Fig. 10, the eyes are not completely open in the warped faces (a), (b), (c), (e), and (f).

C. Experiments

In all the experiments mentioned in this section except Section 5.C.4, the dataset is first shuffled randomly and then fivefold cross validation is performed (gallery set is the train set here). In the following, experimenting on the impact of patch size is reported and analyzed. Thereafter, classification using the ensemble of patches is compared to classification using the whole face. Finally, the proposed method is examined and compared to eye-aligned classification.

1. Experiment on the Size of Patches

In this experiment, the effect of patch sizes is mentioned and reported. Different experiments on the Yale dataset were performed with different sizes of patches, which are 10×10, 20×20, 30×30, 40×40, 50×50, and random-sized patches each with one of the mentioned sizes. In these experiments, 80 random patches were utilized, and the optimum weight of geometrical information was found to be 0.2 through trial and error (it is better for the weight of geometrical information to be less than the weight of intensity information because intensity information is purer and cleaner as was previously explained).

In each iteration of the experiments, the similarity score between every pair of gallery and probe images is calculated and the receiver operating characteristic (ROC) curve using all the scores is plotted. The ROC curves of experiments are depicted in Fig. 12. As is obvious in this figure, the size of the patches has an important impact on the recognition rate. According to the curves, 30×30 patches perform better; therefore, merely 30×30 patches are used in the next experiments.

 figure: Fig. 12.

Fig. 12. Effect of size of patches in classification using the ensemble of patches.

Download Full Size | PDF

2. Patch-Based Recognition Using Eye-Aligned and Pixel-Aligned Faces

Several other experiments were performed evaluating the effect of using the ensemble of patches for both eye-aligned and pixel-aligned face images. In these experiments, 80 random patches were utilized with size 30×30. First, classification using the ensemble of patches and not using patches was tested on eye-aligned images. Note that in classification using the ensemble of patches for eye-aligned faces, the ensemble was solely applied on the intensity matrix of eye-aligned images because warping does not exist anymore and thus there is no geometrical information. Figure 13 shows the ROC curves of the two experiments performed on the Yale dataset. As can be seen in this figure, using the ensemble of patches results in overall worse performance than not using patches when the eye-aligned method is utilized.

 figure: Fig. 13.

Fig. 13. Comparison of classification using the whole face or the ensemble of patches.

Download Full Size | PDF

On the other hand, the same two experiments were performed using pixel-aligned faces rather than eye-aligned ones. The ROC curves of these experiments on the Yale dataset are also depicted in Fig. 13. As is obvious in this figure, when pixel-aligned faces are used, patch-based recognition produces superior results compared to not using patches. For instance, in a false acceptance rate (FAR) of 0.001, verification rates are roughly 99% and 94% in recognition using the ensemble of patches and not using it, respectively. This result verifies the effectiveness of using the ensemble of patches alongside having faces pixel-to-pixel warped.

3. Eye-Aligned versus Proposed Method

Eye-aligned face recognition is compared with the proposed pixel-aligned classification method in Fig. 14 and Table 1. This comparison is performed for four datasets, which are the Yale [36], AT&T [44], Cohn-Kanade [38,39], and LFW datasets [46,47]. The LFW dataset is a very challenging and big dataset and includes images that might have more than one face, but still there is only one subject label associated with each image. For this dataset, using the CLNF method [34], the faces in the image were detected. If there were several detected faces in the image, the face with the biggest area (multiplication of height and width of face) and minimum distance from the center of the image was extracted as the main face. Thereafter, the main face was cropped out of the image.

 figure: Fig. 14.

Fig. 14. Comparison of the proposed method with eye-aligned face recognition on the (a) Yale dataset, (b) AT&T dataset, (c) Cohn-Kanade dataset, and (d) LFW dataset.

Download Full Size | PDF

Tables Icon

Table 1. Results of the Proposed Method and Eye-Aligned Face Recognition in Specific False Alarm Rates

As can be seen in the ROC curves of Fig. 14 and Table 1, the proposed method significantly outperforms eye-aligned face recognition with a wonderful enhancement in the Yale, AT&T, and LFW datasets. Notice that the LFW is a very challenging and big dataset, and the Yale and AT&T datasets are two medium well-known datasets. The proposed method results are very good in both big and small datasets, showing its power and effectiveness in different types of datasets.

In the Cohn-Kanade dataset, however, the eye-aligned method performs slightly better than the proposed method, although the ROC curves show that the rate of the proposed method is almost near the rate of eye-aligned face recognition. The reason for this failure is that the CLNF method [34], which was used for warping, did not work precisely in detecting very open mouths in extremely surprising faces. Thus, warping could not be performed successfully because of imprecise detected landmarks. Therefore, this failure is not because of the weakness of the proposed method, but because of not having correct and accurate landmarks as input.

4. Comparison with the State of the Art on the LFW Dataset

There exists a standard 10-fold cross-validation protocol for the LFW dataset in [46] where the match and mismatch pairs of facial images are used for experiments. This default cross-validation protocol helps the works on this dataset to be fairly comparable. The proposed alignment method is experimented on the LFW dataset using this standard cross-validation protocol. The ROC curves are shown in Fig. 15. As it is expected, the proposed alignment method highly outperforms the eye-alignment.

 figure: Fig. 15.

Fig. 15. ROC curves on the LFW dataset using the standard cross-validation protocol of the dataset.

Download Full Size | PDF

Table 2 compares our work with the state-of-the-art results on this challenging LFW dataset. As can be seen in this table, our proposed alignment outperforms methods [20,48,49], and [50]. Moreover, compared to methods [51,52], and [53], this work has a fairly acceptable result. Also, please note that the main concentration of this work is not on the classifier but on the alignment of faces as well as the separation of intensity and geometry information. A simple patch-based LDA classifier is used in this work for the sake of classification, while a more complex classifier can have better results when applied on the proposed alignment method.

Tables Icon

Table 2. Comparison of the Proposed Method to the State-of-the-Art Results on the LFW Dataset

6. DISCUSSION

A. Discussion on Alignment of Features

The most important contribution of this work was to introduce a method for fine alignment of intensity features of the face which as a by-product also results in accurate extraction of geometry information of the face. In other words, the proposed alignment method places the intensity of similar organs in the same positions in the warped faces. On the other hand, when intensities are properly aligned, using the proposed geometry extraction method, the coordinates of the aligned pixels can be extracted as the corresponding geometry information. As a result, pixel alignment provides both intensity and geometry information useful for recognition. Note that as alignment of features usually improves the recognition significantly, we proposed a possible solution for aligning intensities of faces whose requisite is separating the intensity and geometry information, which also helps the classifiers discriminate much easier.

B. Discussion on Classification Using Ensemble of Patches

As it was experimented in Section 5.C.2 and shown in Fig. 13, it was observed that for warped faces, classification using the ensemble of patches enhances performance in comparison to classification using the whole face. However, this enhancement is not seen for eye-aligned (not warped) faces. This might be because of the fact that in warped faces, every pixel corresponds to a specific region (such as lip corners) in all faces; however, it is not true in eye-aligned images. Therefore, in warped faces, every patch covers similar and corresponding pixels in all faces but may cover not related pixels in eye-aligned ones. That is why using patches has made the result worse in eye-aligned faces, as well as improving the result in warped faces.

7. CONCLUSION AND FUTURE DIRECTION

In this paper, a pixel-level facial alignment method, i.e., a method to align the whole pixels of faces, is proposed. This alignment is achieved by mapping the face geometry onto a reference geometry, where the mapping is guided by contours of facial landmarks which are fitted to each face using landmark detection methods, such as CLNF [34]. The proposed alignment method provides both the aligned intensity information and the corresponding geometry information. The resulting aligned intensity and geometry features create superior recognition results when they are used in a patch-based recognition framework.

The experiments were performed on four well-known datasets and Fisherfaces [21] were used as an instance of a holistic-based face recognition method. Results showed the significantly better performance of patch-based pixel-aligned face recognition in comparison to eye-aligned face recognition in all utilized datasets (except on the Cohn-Kanade dataset with slight rate difference). The reason for not having outperformance in this dataset is that the landmark detection, which is not a contribution of this work, did not work properly in extreme expressions resulting in not qualified warping. The proposed method guarantees better performance in comparison to eye-aligned face recognition when the landmarks are detected properly.

As was already mentioned, the geometrical information includes personal face geometry as well as pose and expression, which are not required, or even destructive, for face identification. The pose and expression can be found using any pose/expression detector (e.g., see [54] for pose detection). By finding the pose or expression of a face, the information of pose or expression can be excluded from the geometrical information of faces enhancing the recognition rates. This can be a future direction of our work. Also note that the proposed method, although having acceptable robustness to poses, might face problems with very strong poses. Although landmarks are assumed to be ready as input to the proposed method, another possible future direction is to mirror the intensity/geometry information of opposite landmarks for compensating the hidden landmarks in strong poses.

Funding

Sharif University of Technology Office of Research.

REFERENCES

1. W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: a literature survey,” ACM Comput. Surv. 35, 399–458 (2003). [CrossRef]  

2. M. D. Kelly, Visual Identification of People by Computer (Stanford University of California, 1970).

3. T. Kanade, Computer Recognition of Human Faces (Birkhäuser, 1977), Vol. 47.

4. R. Brunelli and T. Poggio, “Face recognition: features versus templates,” IEEE Trans. Pattern Anal. Mach. Intell. 15, 1042–1052 (1993). [CrossRef]  

5. A. V. Nefian and M. H. Hayes, “Hidden Markov models for face recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (IEEE, 1998), Vol. 5, pp. 2721–2724.

6. F. Samaria and S. Young, “Hmm-based architecture for face identification,” Image Vision Comput. 12, 537–543 (1994). [CrossRef]  

7. C. Ding, C. Xu, and D. Tao, “Multi-task pose-invariant face recognition,” IEEE Trans. Image Process. 24, 980–993 (2015). [CrossRef]  

8. D. Chen, X. Cao, F. Wen, and J. Sun, “Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), pp. 3025–3032.

9. A. Pentland, B. Moghaddam, and T. Starner, “View-based and modular eigenspaces for face recognition,” Comput. Vision Pattern Recognit. 94, 84–91 (1994).

10. P. S. Penev and J. J. Atick, “Local feature analysis: a general statistical theory for object representation,” Network 7, 477–500 (1996). [CrossRef]  

11. T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” in Proceedings of the European Conference on Computer Vision (Springer, 1998), Vol. 2, pp. 484–498.

12. T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” IEEE Trans. Pattern Anal. Mach. Intell. 23, 681–685 (2001). [CrossRef]  

13. A. Lanitis, C. J. Taylor, and T. F. Cootes, “Automatic face identification system using flexible appearance models,” Image Vision Comput. 13, 393–401 (1995). [CrossRef]  

14. M. B. Stegmann, “Analysis and segmentation of face images using point annotations and linear subspace techniques,” Tech. Rep. IMM-REP-2002-22 (2002).

15. G. J. Edwards, T. F. Cootes, and C. J. Taylor, “Face recognition using active appearance models,” in European Conference on Computer Vision (Springer, 1998), pp. 581–595.

16. A. Lanitis, C. J. Taylor, and T. F. Cootes, “A unified approach to coding and interpreting face images,” in Proceedings Fifth International Conference on Computer Vision (IEEE, 1995), pp. 368–373.

17. T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active shape models-their training and application,” Comput. Vis. Image Underst. 61, 38–59 (1995). [CrossRef]  

18. I. Craw and P. Cameron, “Face recognition by computer,” in British Machine Vision Conference (1992), pp. 1–10.

19. M. Turk and A. Pentland, “Eigenfaces for recognition,” J. Cogn. Neurosci. 3, 71–86 (1991).

20. M. A. Turk and A. P. Pentland, “Face recognition using eigenfaces,” in Proceedings IEEE Computer Society Conference on Computer Vision and Pattern Recognition” (IEEE, 1991), pp. 586–591.

21. P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: recognition using class specific linear projection,” IEEE Trans. Pattern Anal. Mach. Intell. 19, 711–720 (1997). [CrossRef]  

22. J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “Face recognition using kernel direct discriminant analysis algorithms,” IEEE Trans. Neural Networks 14, 117–126 (2003). [CrossRef]  

23. J. Lu, K. Plataniotis, and A. Venetsanopoulos, “Kernel discriminant learning with application to face recognition,” in Support Vector Machines: Theory and Applications (Springer, 2005), pp. 275–296.

24. P. J. Phillips, “Support vector machines applied to face recognition,” in Advances in Neural Information Processing Systems (1999), pp. 803–809.

25. B. Moghaddam, T. Jebara, and A. Pentland, “Bayesian face recognition,” Pattern Recognit. 33, 1771–1782 (2000). [CrossRef]  

26. M. M. Kasar, D. Bhattacharyya, and T.-H. Kim, “Face recognition using neural network: a review,” Int. J. Security Appl. 10, 81–100 (2016). [CrossRef]  

27. S.-H. Lin, S.-Y. Kung, and L.-J. Lin, “Face recognition/detection by probabilistic decision-based neural network,” IEEE Trans. Neural Networks 8, 114–132 (1997). [CrossRef]  

28. M. O. Simón, C. Corneanu, K. Nasrollahi, O. Nikisins, S. Escalera, Y. Sun, H. Li, Z. Sun, T. B. Moeslund, and M. Greitans, “Improved rgb-dt based face recognition,” IET Biometrics 5, 297–303 (2016). [CrossRef]  

29. W. AbdAlmageed, Y. Wu, S. Rawls, S. Harel, T. Hassner, I. Masi, J. Choi, J. Lekust, J. Kim, P. Natarajan, R. Nevatia, and G. Medioni, “Face recognition using deep multi-pose representations,” in IEEE Winter Conference on Applications of Computer Vision (WACV) (IEEE, 2016), pp. 1–9.

30. O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conference (2015), Vol. 1, pp. 6.

31. Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: closing the gap to human-level performance in face verification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014), pp. 1701–1708.

32. F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: a unified embedding for face recognition and clustering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 815–823.

33. J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell. 31, 210–227 (2009). [CrossRef]  

34. T. Baltrusaitis, P. Robinson, and L.-P. Morency, “Constrained local neural fields for robust facial landmark detection in the wild,” in Proceedings of the IEEE International Conference on Computer Vision Workshops (2013), pp. 354–361.

35. D. Cristinacce and T. F. Cootes, “Feature detection and tracking with constrained local models,” in British Machine Vision Conference (2006), Vol. 1, pp. 3.

36. “Yale Face Dataset,” http://cvc.cs.yale.edu/cvc/projects/yalefaces/yalefaces.html.

37. B. Gärtner and M. Hoffmann, Computational Geometry–Lecture Notes HS 2013 (ETH Zürich University, 2014), Chap. 6.

38. “Cohn-Kanade Face Dataset,” http://www.pitt.edu/~emotion/ck-spread.htm.

39. T. Kanade, J. F. Cohn, and Y. Tian, “Comprehensive database for facial expression analysis,” in Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition (IEEE, 2000), pp. 46–53.

40. T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2002).

41. C. Bishop, Pattern Recognition and Machine Learning (Information Science and Statistics) (Springer, 2007).

42. V. Perlibakas, “Distance measures for PCA-based face recognition,” Pattern Recognit. Lett. 25, 711–724 (2004). [CrossRef]  

43. H. Mohammadzade and D. Hatzinakos, “Projection into expression subspaces for face recognition from single sample per person,” IEEE Trans. Affective Comput. 4, 69–82 (2013). [CrossRef]  

44. “AT&T Face Dataset,” http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html.

45. F. S. Samaria and A. C. Harter, “Parameterisation of a stochastic model for human face identification,” in Proceedings of the Second IEEE Workshop on Applications of Computer Vision, 1994 (IEEE, 1994), pp. 138–142.

46. “LFW Face Dataset,” http://vis-www.cs.umass.edu/lfw/.

47. G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, “Labeled faces in the wild: a database for studying face recognition in unconstrained environments,” Tech. Rep. 07-49 (University of Massachusetts, 2007).

48. N. Pinto, J. J. DiCarlo, and D. D. Cox, “How far can you get with a modern face recognition test set using only simple features?” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (IEEE, 2009), pp. 2591–2598.

49. L. Wolf, T. Hassner, and Y. Taigman, “Descriptor based methods in the wild,” in Workshop on Faces in ‘Real-Life’ Images: Detection, Alignment, and Recognition (2008).

50. D. Yi, Z. Lei, and S. Z. Li, “Towards pose robust face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013), pp. 3539–3545.

51. F. Juefei-Xu, K. Luu, and M. Savvides, “Spartans: single-sample periocular-based alignment-robust recognition technique applied to non-frontal scenarios,” IEEE Trans. Image Process. 24, 4780–4795 (2015). [CrossRef]  

52. C. Sagonas, Y. Panagakis, S. Zafeiriou, and M. Pantic, “Robust statistical frontalization of human and animal faces,” Int. J. Comput. Vis. 122, 270–291 (2017). [CrossRef]  

53. H. Li and G. Hua, “Hierarchical-pep model for real-world face recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4055–4064.

54. X. Zhang and Y. Gao, “Face recognition across pose: a review,” Pattern Recognit. 42, 2876–2896 (2009). [CrossRef]  

Cited By

Optica participates in Crossref's Cited-By Linking service. Citing articles from Optica Publishing Group journals and other participating publishers are listed here.

Alert me when this article is cited.


Figures (15)

Fig. 1.
Fig. 1. Obtaining reference contour by averaging landmark contours of several neutral faces. The facial images are from the Yale dataset [36].
Fig. 2.
Fig. 2. Procedure of geometrical transformation and pixel-to-pixel face warping.
Fig. 3.
Fig. 3. Delaunay triangulation of face landmarks.
Fig. 4.
Fig. 4. Interpolation of x . The black points are the landmarks with known target coordinates and the yellow points are the interpolated target coordinates for the non-landmark pixels.
Fig. 5.
Fig. 5. Intensity interpolation.
Fig. 6.
Fig. 6. Example of geometrical transformation and pixel-to-pixel warping. The facial image is taken from the Cohn-Kanade dataset [38,39] (©Jeffrey Cohn).
Fig. 7.
Fig. 7. Illustration of Δ x and Δ y information for a sample warped face. (a)  Δ x information, (b)  Δ y information.
Fig. 8.
Fig. 8. Classification using the ensemble of patches.
Fig. 9.
Fig. 9. Overall structure of the proposed method.
Fig. 10.
Fig. 10. Several samples of pixel alignment in the Yale dataset [36].
Fig. 11.
Fig. 11. Several samples of pixel alignment in the AT&T dataset [44,45].
Fig. 12.
Fig. 12. Effect of size of patches in classification using the ensemble of patches.
Fig. 13.
Fig. 13. Comparison of classification using the whole face or the ensemble of patches.
Fig. 14.
Fig. 14. Comparison of the proposed method with eye-aligned face recognition on the (a) Yale dataset, (b) AT&T dataset, (c) Cohn-Kanade dataset, and (d) LFW dataset.
Fig. 15.
Fig. 15. ROC curves on the LFW dataset using the standard cross-validation protocol of the dataset.

Tables (2)

Tables Icon

Table 1. Results of the Proposed Method and Eye-Aligned Face Recognition in Specific False Alarm Rates

Tables Icon

Table 2. Comparison of the Proposed Method to the State-of-the-Art Results on the LFW Dataset

Equations (12)

Equations on this page are rendered with MathJax. Learn more.

x = f ( x , y ) a 0 + a 1 x + a 2 y ,
I ( x , y ) = I ( x , y ) .
f p I = [ I ( 1 , 1 ) , I ( 1 , 2 ) , , I ( m , m ) ] T ,
f p Δ x = [ Δ x ( 1 , 1 ) , Δ x ( 1 , 2 ) , , Δ x ( m , m ) ] T ,
f p Δ y = [ Δ y ( 1 , 1 ) , Δ y ( 1 , 2 ) , , Δ y ( m , m ) ] T ,
W opt = arg max W | W T S b W | | W T S w W | ,
S w = i = 1 C x k X i ( x k μ i ) ( x k μ i ) T ,
S b = i = 1 C N i ( μ i μ ) ( μ i μ ) T ,
sim p I ( i , j ) = cos ( f ^ p , i I , f ^ p , j I ) = f ^ p , i I f ^ p , j I | f ^ p , i I | | f ^ p , j I | ,
sim p Δ x ( i , j ) = cos ( f ^ p , i Δ x , f ^ p , j Δ x ) = f ^ p , i Δ x f ^ p , j Δ x | f ^ p , i Δ x | | f ^ p , j Δ x | ,
sim p Δ y ( i , j ) = cos ( f ^ p , i Δ y , f ^ p , j Δ y ) = f ^ p , i Δ y f ^ p , j Δ y | f ^ p , i Δ y | | f ^ p , j Δ y | ,
sim ( i , j ) = p = 1 80 ( sim p I ( i , j ) + w sim p Δ x ( i , j ) + w sim p Δ y ( i , j ) ) ,
Select as filters


Select Topics Cancel
© Copyright 2024 | Optica Publishing Group. All rights reserved, including rights for text and data mining and training of artificial technologies or similar technologies.