{"title": "Incremental Boosting Convolutional Neural Network for Facial Action Unit Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 109, "page_last": 117, "abstract": "Recognizing facial action units (AUs) from spontaneous facial expressions is still a challenging problem. Most recently, CNNs have shown promise on facial AU recognition. However, the learned CNNs are often overfitted and do not generalize well to unseen subjects due to limited AU-coded training images. We proposed a novel Incremental Boosting CNN (IB-CNN) to integrate boosting into the CNN via an incremental boosting layer that selects discriminative neurons from the lower layer and is incrementally updated on successive mini-batches. In addition, a novel loss function that accounts for errors from both the incremental boosted classifier and individual weak classifiers was proposed to fine-tune the IB-CNN. Experimental results on four benchmark AU databases have demonstrated that the IB-CNN yields significant improvement over the traditional CNN and the boosting CNN without incremental learning, as well as outperforming the state-of-the-art CNN-based methods in AU recognition. The improvement is more impressive for the AUs that have the lowest frequencies in the databases.", "full_text": "Incremental Boosting Convolutional Neural Network\n\nfor Facial Action Unit Recognition\n\nShizhong Han, Zibo Meng, Ahmed Shehab Khan, Yan Tong\n\nDepartment of Computer Science & Engineering, University of South Carolina, Columbia, SC\n\n{han38, mengz, akhan}@email.sc.edu, tongy@cse.sc.edu\n\nAbstract\n\nRecognizing facial action units (AUs) from spontaneous facial expressions is still\na challenging problem. Most recently, CNNs have shown promise on facial AU\nrecognition. However, the learned CNNs are often over\ufb01tted and do not gener-\nalize well to unseen subjects due to limited AU-coded training images. We pro-\nposed a novel Incremental Boosting CNN (IB-CNN) to integrate boosting into\nthe CNN via an incremental boosting layer that selects discriminative neurons\nfrom the lower layer and is incrementally updated on successive mini-batches. In\naddition, a novel loss function that accounts for errors from both the incremen-\ntal boosted classi\ufb01er and individual weak classi\ufb01ers was proposed to \ufb01ne-tune\nthe IB-CNN. Experimental results on four benchmark AU databases have demon-\nstrated that the IB-CNN yields signi\ufb01cant improvement over the traditional CNN\nand the boosting CNN without incremental learning, as well as outperforming the\nstate-of-the-art CNN-based methods in AU recognition. The improvement is more\nimpressive for the AUs that have the lowest frequencies in the databases.\n\n1 Introduction\n\nFacial behavior is a powerful means to express emotions and to perceive the intentions of a hu-\nman. Developed by Ekman and Friesen [1], the Facial Action Coding System (FACS) describes\nfacial behavior as combinations of facial action units (AUs), each of which is anatomically related\nto the contraction of a set of facial muscles. In addition to applications in human behavior analy-\nsis, an automatic AU recognition system has great potential to advance emerging applications in\nhuman-computer interaction (HCI), such as online/remote education, interactive games, and intelli-\ngent transportation, as well as to push the frontier of research in psychology.\n\nRecognizing facial AUs from spontaneous facial expressions is challenging because of subtle facial\nappearance changes, free head movements, and occlusions, as well as limited AU-coded training\nimages. As elaborated in the survey papers [2, 3], a number of approaches have been developed\nto extract features from videos or static images to characterize facial appearance or geometrical\nchanges caused by target AUs. Most of them employed hand-crafted features, which, however, are\nnot designed and optimized for facial AU recognition. Most recently, CNNs have achieved incredible\nsuccess in different applications such as object detection and categorization, video analysis, and have\nshown promise on facial expression and AU recognition [4, 5, 6, 7, 8, 9, 10].\n\nCNNs contain a large number of parameters, especially as the network becomes deeper. To achieve\nsatisfactory performance, a large number of training images are required and a mini-batch strategy\nis used to deal with large training data, where a small batch of images are employed in each iteration.\nIn contrast to the millions of training images employed in object categorization and detection, AU-\ncoded training images are limited and usually collected from a small population, e.g., 48,000 images\nfrom 15 subjects in the FERA2015 SEMAINE database [11], and 130,814 images from 27 subjects\nin Denver Intensity of Spontaneous Facial Action (DISFA) database [12]. As a result, the learned\nCNNs are often over\ufb01tted and do not generalize well to unseen subjects.\n\nBoosting, e.g., AdaBoost, is a popular ensemble learning technique, which combines many \u201cweak\u201d\nclassi\ufb01ers and has been demonstrated to yield better generalization performance in AU recogni-\ntion [13]. Boosting can be integrated into the CNN such that discriminative neurons are selected and\nactivated in each iteration of CNN learning. However, the boosting CNN (B-CNN) can over\ufb01t due\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: An overview of Incremental Boosting CNN. An incremental boosted classi\ufb01er is trained iteratively.\nOutputs of the FC layer are employed as input features and a subset of features (the blue nodes) are selected\nby boosting. The selected features in the current iteration are combined with those selected previously (the red\nnodes) to form an incremental strong classi\ufb01er. A loss is calculated based on the incremental classi\ufb01er and\npropagated backward to \ufb01ne-tune the CNN parameters. The gray nodes are inactive and thus, not selected by\nthe incremental strong classi\ufb01er. Given a testing image, features are calculated via the CNN and fed to the\nboosted classi\ufb01er to predict the AU label. Best viewed in color.\n\nto the limited training data in each mini-batch. Furthermore, the information captured in previous\niteration/batch cannot be propagated, i.e., a new set of weak classi\ufb01ers is selected in every iteration\nand the weak classi\ufb01ers learned previously are discarded.\n\nInspired by incremental learning, we proposed a novel Incremental Boosting CNN (IB-CNN), which\naims to accumulate information in B-CNN learning when new training samples appear. As shown\nin Figure 1, a batch of images is employed in each iteration of CNN learning. The outputs of the\nfully-connected (FC) layer are employed as features; a subset of features (the blue nodes), which\nis discriminative for recognizing the target AU in the current batch, is selected by boosting. Then,\nthese selected features are combined with the ones selected previously (the red nodes) to form an\nincremental strong classi\ufb01er. The weights of active features, i.e., both the blue and the red nodes,\nare updated such that the features selected most of the time have higher weights. Finally, a loss,\ni.e., the overall classi\ufb01cation error from both weak classi\ufb01ers and the incremental strong classi\ufb01er,\nis calculated and backpropagated to \ufb01ne-tune the CNN iteratively. The proposed IB-CNN has a\ncomplex decision boundary due to boosting and is capable of alleviating the over\ufb01tting problem for\nthe mini-batches by taking advantage of incremental learning.\n\nIn summary, this paper has three major contributions. (1) Feature selection and classi\ufb01cation are\nintegrated with CNN optimization in a boosting CNN framework. (2) A novel incremental boosted\nclassi\ufb01er is updated iteratively by accumulating information from multiple batches. (3) A novel loss\nfunction, which considers the overall classi\ufb01cation error of the incremental strong classi\ufb01er and\nindividual classi\ufb01cation errors of weak learners, is developed to \ufb01ne-tune the IB-CNN.\n\nExperimental results on four benchmark AU-coded databases, i.e., Cohn-Kanade (CK) [25] databse,\nFERA2015 SEMAINE database [11], FERA2015 BP4D database [11], and Denver Intensity of\nSpontaneous Facial Action (DISFA) database [12] have demonstrated that the proposed IB-CNN\nsigni\ufb01cantly outperforms the traditional CNN model as well as the state-of-the-art CNN-based meth-\nods for AU recognition. Furthermore, the performance improvement of the infrequent AUs is more\nimpressive, which demonstrates that the proposed IB-CNN is capable of improving CNN learning\nwith limited training data. In addition, the performance of IB-CNN is not sensitive to the number of\nneurons in the FC layer and the learning rate, which are favored traits in CNN learning.\n\n2 Related Work\n\nAs detailed in the survey papers [2, 3], various human-designed features are adopted in recogniz-\ning facial expressions and AUs including Gabor Wavelets [13], Local Binary Patterns (LBP) [14],\nHistogram of Oriented Gradients (HOG) [15], Scale Invariant Feature Transform (SIFT) fea-\ntures [16], Histograms of Local Phase Quantization (LPQ) [17], and their spatiotemporal exten-\nsions [17, 18, 19]. Recently, feature learning approaches including sparse coding [20] and deep\nlearning [4, 5, 6, 7, 8, 9, 10, 21] have been devoted to recognizing facial expressions and AUs.\n\nAmong the feature learning based methods, CNNs [4, 5, 6, 7, 8, 9, 10] have attracted increasing\nattention. Gudi et al. [9] used a pre-processing method with local and global contrast normalization\n\n2\n\n\fto improve the inputs of CNNs. Fasel [4] employed multi-size convolutional \ufb01lters to learn multi-\nscale features. Liu et al [7] extracted spatiotemporal features using the 3D CNN. Jung et al. [8]\njointly \ufb01ne-tuned temporal appearance and geometry features. Jaiswal and Valstar [10] integrated\nbi-directional long-term memory neural networks with the CNN to extract temporal features.\n\nMost of CNN-based methods make decisions using inner product of the FC layer. A few approaches\ndeveloped new objective functions to improve recognition performance. Tang [22, 6] replaced the\nsoftmax loss function with an SVM for optimization. Hinton et al. [23] utilized a dropout technique\nto reduce over\ufb01tting by dropping out some neuron activations from the previous layer, which can\nbe seen as an ensemble of networks sharing the same weights. However, the dropout process is\nrandom regardless the discriminative power of individual neurons. In contrast, the proposed IB-CNN\neffectively selects the more discriminative neurons and drops out noisy or redundant neurons.\n\nMedera and Babinec [24] adopted incremental learning using multiple CNNs trained individually\nfrom different subsets and additional CNNs are trained given new samples. Then, the prediction\nis calculated by weighted majority-voting of the outputs of all CNNs. However, each CNN may\nnot have suf\ufb01cient training data, which is especially true with limited AU-coded data. Different\nfrom [24], the IB-CNN has only one CNN trained along with an incremental strong classi\ufb01er, where\nweak learners are updated over time by accumulating information from multiple batches. Liu et\nal. [21] proposed a boosted deep belief network for facial expression recognition, where each weak\nclassi\ufb01er is learned exclusively from an image patch. In contrast, weak classi\ufb01ers are selected from\nan FC layer in the proposed IB-CNN and thus, learned from the whole face.\n\n3 Methodology\n\nAs illustrated in Figure 1, an IB-CNN model is proposed to integrate boosting with the CNN at the\ndecision layer with an incremental boosting algorithm, which selects and updates weak learners over\ntime as well as constructs an incremental strong classi\ufb01er in an online learning manner. There are\nthree major steps for incremental boosting: selecting and activating neurons (blue nodes) from the\nFC layer by boosting, combining the activated neurons from different batches (blue and red nodes)\nto form an incremental strong classi\ufb01er, and \ufb01ne-tuning the IB-CNN by minimizing the proposed\nloss function. In the following, we start with a brief review of CNNs and then, describe the three\nsteps of incremental boosting in detail.\n\n3.1 A Brief Review of CNNs\nA CNN consists of a stack of layers such as convolutional layers, pooling layers, recti\ufb01cation layers,\nFC layers, and a decision layer and transforms the input data into a highly nonlinear representation.\nIdeally, learned \ufb01lters should activate the image patches related to the recognition task, i.e., detecting\nAUs in this work. Neurons in an FC layer have full connections with all activations in the previous\nlayer. Finally, high-level reasoning is done at the decision layer, where the number of outputs is\nequal to the number of target classes. The score function used by the decision layer is generally\nthe inner product of the activations in the FC layer and the corresponding weights. During CNN\ntraining, a loss layer is employed after the decision layer to specify how to penalize the deviations\nbetween the predicted and true labels, where different types of loss functions have been employed,\nsuch as softmax, SVM, and sigmoid cross entropy. In this paper, we substitute the inner-product\nscore function with a boosting score function to achieve a complex decision boundary.\n\n3.2 Boosting CNN\nIn a CNN, a mini-batch strategy is often used to handle large training data. Let X = [x1, ..., xM ]\nbe the activation features of a batch with M training images, where the dimension of the activation\nfeature vector xi is K, and y = [y1, ..., yM ], yi \u2208 {\u22121, 1} is a vector storing the ground truth\nlabels. With the boosting algorithm, the prediction is calculated by a strong classi\ufb01er H(\u00b7) that is\nthe weighted summation of weak classi\ufb01ers h(\u00b7) as follows:\n\u03b1j h(xij, \u03bbj ); h(xij , \u03bbj) =\n\nf (xij, \u03bbj )\n\nH(xi) =\n\n(1)\n\nK\n\nX\n\nj=1\n\npf (xij , \u03bbj)2 + \u03b72\n\nwhere xij \u2208 xi is the jth activation feature of the ith image. Each feature corresponds to a candidate\nf (\u00b7)\u221af (\u00b7)2+\u03b72 is used to simulate a sign(\u00b7)\nweak classi\ufb01er h(xij , \u03bbj ) with output in the range of (-1,1).\nfunction to compute the derivative for gradient descent optimization. In this work, f (xij, \u03bbj ) \u2208 R\nis de\ufb01ned as a one-level decision tree (a decision stump) with the threshold of \u03bbj , which has been\nwidely used in AdaBoost. The parameter \u03b7 in Eq. 1 is employed to control the slope of function\n\n3\n\n\ff (\u00b7)\u221af (\u00b7)2+\u03b72 and can be set according to the distribution of f (\u00b7) as \u03b7 = \u03c3\ndeviation of f (\u00b7) and c is a constant. In this work, \u03b7 is empirically set to \u03c3\nthe jth weak classi\ufb01er and PK\nwill not go through the feedforward and backpropagation process.\n\nc , where \u03c3 is the standard\n2 . \u03b1j \u2265 0 is the weight of\nj=1 \u03b1j = 1. When \u03b1j = 0, the corresponding neuron is inactive and\n\nTraditional boosting algorithms only consider the loss of the strong classi\ufb01er, which can be domi-\nnated by some weak classi\ufb01ers with large weights, potentially leading to over\ufb01tting. To account for\nclassi\ufb01cation errors from both the strong classi\ufb01er and the individual classi\ufb01ers, the loss function is\nde\ufb01ned as the summation of a strong-classi\ufb01er loss and a weak-classi\ufb01er loss as follows:\n\n\u03b5B = \u03b2\u03b5B\n\nstrong + (1 \u2212 \u03b2)\u03b5weak\n\n(2)\n\nwhere \u03b2 \u2208 [0, 1] balances the strong-classi\ufb01er loss and the weak-classi\ufb01er loss.\nThe strong-classi\ufb01er loss is de\ufb01ned as the Euclidean distance between the prediction and the\ngroundtruth label:\n\nM\n\n\u03b5B\nstrong =\n\n1\nM\n\n(H(xi) \u2212 yi)2\n\nX\n\ni=1\n\n(3)\n\nThe weak-classi\ufb01er loss is de\ufb01ned as the summation of the individual losses of all weak classi\ufb01ers:\n\nM\n\n\u03b5weak =\n\n1\n\nM N\n\nX\n\ni=1\n\nX\n\n(cid:2)h(xij , \u03bbj) \u2212 yi(cid:3)2\n\n1\u2264j\u2264K\n\u03b1j >0\n\n(4)\n\nwhere the constraint \u03b1j > 0 excludes inactive neurons when calculating the loss.\n\nDriven by the loss \u03b5B de\ufb01ned in Eq. 2, the B-CNN can be iteratively \ufb01ne-tuned by backpropagation\nas illustrated in the top of Figure 2. However, the information captured previously, e.g., the weights\nand thresholds of the active neurons, is discarded for a new batch. Due to limited data in each mini-\nbatch, the trained B-CNN can be over\ufb01tted.\n\n3.3 Incremental Boosting\n\nFigure 2: A comparison of the IB-CNN and the B-CNN structures. For clarity, the illustration of IB-CNN\nor B-CNN starts from the FC layer (the cyan nodes). The blue nodes are active nodes selected in the current\niteration; the red nodes are the active nodes selected from previous iterations; and the gray nodes are inactive.\n\nIncremental learning can help to improve the prediction performance and to reduce over\ufb01tting. As\nillustrated in the bottom of Figure 2, both of the blue nodes selected in the current iteration and the\nred nodes selected previously are incrementally combined to form an incremental strong classi\ufb01er\nH t\n\nI at the tth iteration:\n\n(t \u2212 1)H t\u22121\n\n(xt\u22121\n\n) + H t(xt\ni)\n\n(5)\n\nH t\n\nI (xt\n\ni) =\n\nI\n\ni\nt\n\nwhere H t\u22121\nis the boosted strong classi\ufb01er estimated in the current iteration.\n\ni) is the incremental strong classi\ufb01er obtained at the (t \u2212 1)th iteration; and H t(xt\n(xt\ni)\n\nI\n\nSubstituting Eq. 1 into Eq. 5, we have\nX\n\nI (xt\n\ni) =\n\nH t\n\nK\n\n\u03b1t\n\nj ht(xt\n\nij; \u03bbt\n\nj); \u03b1t\n\nj =\n\nj=1\n\n(t \u2212 1)\u03b1t\u22121\n\nj + \u02c6\u03b1t\nt\n\nj\n\n(6)\n\nj is the weak classi\ufb01er weight calculated in the tth iteration by boosting and \u03b1t\n\nwhere \u02c6\u03b1t\nj is the\ncumulative weight considering previous iterations. As shown in Figure 3, ht\u22121(\u00b7) has been updated\n\n4\n\n\fAlgorithm 1 Incremental Boosting Algorithm for the IB-CNN\nInput: The number of iterations (mini-batches) T and activation features X with the size of M \u00d7 K,\nwhere M is the number of images in a mini-batch and K is the dimension of the activation\nfeature vector for one image.\n\n\u03b11\n\nj = 0\n\n1: for each input activation j from 1 to K do\n2:\n3: end for\n4: for each mini-batch t from 1 to T do\n5:\n6:\n7:\n8:\n9:\n10:\n11: end for\n\nFeed-forward to the fully connected layer;\nSelect active features by boosting and calculate weights \u02c6\u03b1\nUpdate the incremental strong classi\ufb01er as Eq. 6;\nCalculate the overall loss of IB-CNN as Eq. 8;\nBackpropagate the loss based on Eq. 9 and Eq. 10;\nContinue backpropagation to lower layers.\n\nt based on the standard AdaBoost;\n\nto ht(\u00b7) by updating the threshold \u03bbt\u22121\nestimated in the tth iteration by boosting. Otherwise, \u03bbt\nafter backpropagation as follows:\n\nto \u03bbt\n\nj\n\nj . If the jth weak classi\ufb01er was not selected before, \u03bbt\n\nj is\nj will be updated from the previous iteration\n\u2202\u03b5H t\u22121\n\u2202\u03bbt\u22121\n\n(7)\n\nI\n\nj\n\n\u03bbt\nj = \u03bbt\u22121\n\nj\n\n\u2212 \u03b3\u2207\n\nwhere \u03b3 is the learning rate.\n\nThen, the incremental strong classi\ufb01er H t\nI is up-\ndated over time. As illustrated in Figure 3, if a\nneuron is activated in the current iteration, the\ncorresponding weight will increase; otherwise,\nit will decrease. The summation of weights of\nall weak classi\ufb01ers will be normalized to 1.\nHence, the weak classi\ufb01ers selected most of the\ntime, i.e., effective for most of mini-batches,\nwill have higher weights. Therefore, the overall\nloss of IB-CNN is calculated as\n\n...\n\n...\n\n...\n\nt\n\nHI\n\nFigure 3: An illustration of constructing the incremen-\ntal strong classi\ufb01er. Squares represent neuron activa-\ntions. The gray nodes are inactive; while the blue and\nred nodes are active nodes selected in the current itera-\ntion and previous iterations, respectively.\n\nstrong + (1 \u2212 \u03b2)\u03b5weak\n\n(8)\n\nwhere \u03b5IB\n\nstrong = 1\n\nM PM\n\ni=1(H t\n\nI (xt\n\n\u03b5IB = \u03b2\u03b5IB\ni) \u2212 yt\n\ni )2.\n\nCompared to the B-CNN, the IB-CNN exploits the information from all mini-batches. For test-\ning, IB-CNN uses the incremental strong classi\ufb01er, while the B-CNN employs the strong classi\ufb01er\nlearned from the last iteration.\n\n3.4 IB-CNN Fine-tuning\nA stochastic gradient descent method is utilized for \ufb01ne-tuning the IB-CNN, i.e., updating IB-CNN\nparameters, by minimizing the loss in Eq. 8. The descent directions for xt\nj can be calculated\nas follows:\n\n\u2202\u03b5IB\n\u2202xt\nij\n\n= \u03b2\n\nM\n\n=\n\n\u03b2\n\nX\n\ni=1\n\n\u2202\u03b5IB\n\u2202H t\n\u2202\u03b5IB\n\u2202H t\n\nstrong\nI (xt\ni)\n\nstrong\nI (xt\ni)\n\nI (xi)\n\n\u2202H t\n\u2202xt\nij\nI(xt\ni)\n\u2202\u03bbt\nj\n\n\u2202H t\n\n\u2202\u03b5IB\n\u2202\u03bbt\nj\n\n+ (1 \u2212 \u03b2)\n\n+ (1 \u2212 \u03b2)\n\nX\n\ni=1\n\nij and \u03bbt\nij ; \u03bbt\nj)\n\n\u2202ht(xt\n\u2202xt\nij\n\u2202ht(xt\n\u2202\u03bbt\nj\n\nweak\n\n\u2202\u03b5IB\n\u2202ht(xt\nM\n\nij; \u03bbt\nj )\n\u2202\u03b5IB\n\u2202ht(xt\n\nweak\n\nij; \u03bbt\nj)\n\nij; \u03bbt\nj)\n\n(9)\n\n(10)\n\nand \u2202\u03b5IB\n\u2202\u03bbt\nj\n\nwhere \u2202\u03b5IB\n\u2202xt\nij\nblue nodes in Figure 3). \u2202\u03b5IB\n\u2202xt\nij\ntional layers. The incremental boosting algorithm for the IB-CNN is summarized in Algorithm 1.\n\nare only calculated for the active nodes of incremental boosting (the red and\n\ncan be further backpropagated to the lower FC layers and convolu-\n\n4 Experiments\n\nTo evaluate effectiveness of the proposed IB-CNN model, extensive experiments have been con-\nducted on four benchmark AU-coded databases. The CK database [25] contains 486 image se-\nquences from 97 subjects and has been widely used for evaluating the performance of AU recog-\nnition. In addition, 14 AUs were annotated frame-by-frame [30] for training and evaluation. The\nFERA2015 SEMAINE database [11] contains 6 AUs and 31 subjects with 93,000 images. The\nFERA2015 BP4D database [11] has 11 AUs and 41 subjects with 146,847 images. The DISFA\ndatabase [12] has 12 labeled AUs and 27 subjects with 130,814 images.\n\n5\n\n\f4.1 Pre-Processing\nFace alignment is conducted to reduce variation in face scale and in-plane rotation across different\nfacial images. Speci\ufb01cally, the face regions are aligned based on three \ufb01ducial points: the centers of\nthe two eyes and the mouth, and scaled to a size of 128 \u00d7 96. In order to alleviate face pose vari-\nations, especially out-of-plane rotations, face images are further warped to a frontal view based on\nlandmarks that are less affected by facial expressions including landmarks along the facial contour,\ntwo eye centers, the nose tip, the mouth center, and on the forehead. A total of 23 landmarks that\nare less affected by facial muscle movements are selected as control points to warp the face region\nto the mean facial shape calculated from all images 1.\n\nTime sequence normalization is used to reduce identity-related information and highlight appearance\nand geometrical changes due to activation of AUs. Particularly, each image is normalized based on\nthe mean and the standard deviation calculated from a short video sequence containing at least 800\ncontinuous frames at a frame rate of 30fps 2.\n\n4.2 CNN Implementation Details\nThe proposed IB-CNN is implemented based on a modi\ufb01cation of cifar10_quick in Caffe [28]. As\nillustrated in Figure 1, the preprocessed facial images are fed into the network as input. The IB-\nCNN consists of three stacked convolutional layers with activation functions, two average pooling\nlayers, an FC layer, and the proposed IB layer to predict the AU label. Speci\ufb01cally, the \ufb01rst two\nconvolutional layers have 32 \ufb01lters with a size of 5 \u00d7 5 and a stride of 1. Then, the output feature\nmaps are sent to a recti\ufb01ed layer followed by the average pooling layer with a downsampling stride\nof 3. The last convolutional layer has 64 \ufb01lters with a size of 5 \u00d7 5, and the output 9 \u00d7 5 feature\nmaps are fed into an FC layer with 128 nodes. The outputs of the FC layer are sent to the proposed\nIB layer. The stochastic gradient descent, with a momentum of 0.9 and a mini-batch size of 100, is\nused for training the CNN for each target AU.\n\n4.3 Experimental Results\nTo demonstrate effectiveness of the proposed IB-CNN, two baseline methods are employed for com-\nparison. The \ufb01rst method, denoted as CNN, is a traditional CNN model with a sigmoid cross entropy\ndecision layer. The second method, denoted as B-CNN, is the boosting CNN described in Section 3.2.\nBoth CNN and B-CNN have the same architecture as the IB-CNN with different decision layers.\n\nPerformance evaluation on the SEMAINE database: All the models compared were trained\non the training set and evaluated on the validation set. The training-testing process was repeated\n5 times. The mean and standard deviation of F1 score and two-alternative forced choice (2AFC)\nscore are calculated from the 5 runs for each target AU. As shown in Table 1, the proposed\nIB-CNN outperforms the traditional CNN in term of the average F1 score (0.416 vs 0.347) and the\naverage 2AFC score (0.775 vs 0.735). Not surprisingly, IB-CNN also beats B-CNN obviously: the\naverage F1 score increases from 0.310 (B-CNN) to 0.416 (IB-CNN) and the average 2AFC score\nincreases from 0.673 (B-CNN) to 0.775 (IB-CNN), thanks to incremental learning over time. In\naddition, IB-CNN considering both strong and weak classi\ufb01er losses outperforms the one with only\nstrong-classi\ufb01er loss, denoted as IB-CNN-S. Note that, IB-CNN achieves a signi\ufb01cant improvement\nfor recognizing AU28 (Lips suck), which has the least number of occurrences (around 1.25%\npositive samples) in the training set, from 0.280 (CNN) and 0.144 (B-CNN) to 0.490 (IB-CNN) in\nterms of F1 score. The performance of B-CNN is the worst for infrequent AUs due to the limited\npositive samples in each mini-batch. In contrast, the proposed IB-CNN improves CNN learning\nsigni\ufb01cantly with limited training data.\n\nTable 1: Performance comparison of CNN, B-CNN, IB-CNN-S, and IB-CNN on the SEMAINE database in\nterms of F1 and 2AFC. The format is mean\u00b1std. PPos: percentage of positive samples in the training set.\n\nCNN\n\n2AFC\n\nF1\n\n2AFC\n\nF1\n\n2AFC\n\nF1\n\n2AFC\n\nB-CNN\n\nIB-CNN-S\n\nIB-CNN\n\nAUs\n\nAU2\nAU12\nAU17\nAU25\nAU28\nAU45\nAVG\n\nF1\n\nPPos\n13.5% 0.314\u00b10.065\n17.6% 0.508\u00b10.023\n1.9% 0.288\u00b10.020\n17.7% 0.358\u00b10.033\n1.25% 0.280\u00b10.111\n19.7% 0.333\u00b10.036\n0.347\u00b10.026\n\n-\n\n0.820\u00b10.009\n0.777\u00b10.005\n0.777\u00b10.012\n0.638\u00b10.003\n0.904\u00b10.011\n0.734\u00b10.005\n0.775\u00b10.004\n1For the CK, SEMAINE, and DISFA databases, 66 landmarks are detected [26] for face alignment and\n\n0.241\u00b10.073\n0.555\u00b10.007\n0.204\u00b10.048\n0.407\u00b10.006\n0.144\u00b10.092\n0.311\u00b10.016\n0.310\u00b10.015\n\n0.646\u00b10.060\n0.746\u00b10.013\n0.719\u00b10.036\n0.618\u00b10.011\n0.639\u00b10.195\n0.668\u00b10.019\n0.673\u00b10.028\n\n0.414\u00b10.016\n0.549\u00b10.016\n0.248\u00b10.048\n0.378\u00b10.009\n0.483\u00b10.069\n0.401\u00b10.009\n0.412\u00b10.018\n\n0.812\u00b10.010\n0.773\u00b10.007\n0.767\u00b10.011\n0.638\u00b10.011\n0.898\u00b10.006\n0.738\u00b10.010\n0.771\u00b10.003\n\n0.715\u00b10.076\n0.751\u00b10.009\n0.767\u00b10.014\n0.635\u00b10.011\n0.840\u00b10.076\n0.702\u00b10.022\n0.735\u00b10.014\n\n0.410\u00b10.024\n0.539\u00b10.013\n0.248\u00b10.007\n0.401\u00b10.014\n0.490\u00b10.078\n0.398\u00b10.005\n0.416\u00b10.018\n\nwarping. For the BP4D database, the 49 landmarks provided in the database are used for face alignment.\n\n2Psychological studies show that each AU activation ranges from 48 to 800 frames at 30fps [27].\n\n6\n\n\fTable 2: Performance comparison of CNN, B-CNN, and IB-CNN on the DISFA database in terms of F1 score\nand 2AFC score. The format is mean\u00b1std. PPos: percentage of positive samples in the whole database.\n\nCNN\n\n2AFC\n\nF1\n\n2AFC\n\nF1\n\n2AFC\n\nB-CNN\n\nIB-CNN\n\nAUs\n\nAU1\nAU2\nAU4\nAU5\nAU6\nAU9\nAU12\nAU15\nAU17\nAU20\nAU25\nAU26\nAVG\n\nF1\n\nPPos\n6.71% 0.257\u00b10.200\n5.63% 0.346\u00b10.226\n18.8% 0.515\u00b10.208\n2.09% 0.195\u00b10.129\n14.9% 0.619\u00b10.072\n5.45% 0.340\u00b10.131\n23.5% 0.718\u00b10.063\n6.01% 0.174\u00b10.132\n9.88% 0.281\u00b10.154\n3.46% 0.134\u00b10.113\n35.2% 0.716\u00b10.111\n19.1% 0.563\u00b10.152\n0.405\u00b10.055\n\n-\n\n0.724\u00b10.116\n0.769\u00b10.119\n0.820\u00b10.116\n0.780\u00b10.154\n0.896\u00b10.042\n0.859\u00b10.081\n0.943\u00b10.028\n0.586\u00b10.174\n0.678\u00b10.125\n0.604\u00b10.155\n0.890\u00b10.064\n0.810\u00b10.073\n0.780\u00b10.036\n\n0.259\u00b10.150\n0.333\u00b10.197\n0.446\u00b10.186\n0.184\u00b10.114\n0.596\u00b10.086\n0.331\u00b10.115\n0.686\u00b10.083\n0.224\u00b10.120\n0.330\u00b10.132\n0.184\u00b10.101\n0.670\u00b10.064\n0.507\u00b10.131\n0.398\u00b10.059\n\n0.780\u00b10.079\n0.835\u00b10.085\n0.793\u00b10.083\n0.749\u00b10.279\n0.906\u00b10.040\n0.895\u00b10.057\n0.913\u00b10.030\n0.753\u00b10.091\n0.763\u00b10.086\n0.757\u00b10.083\n0.844\u00b10.049\n0.797\u00b10.054\n0.815\u00b10.031\n\n0.327\u00b10.204\n0.394\u00b10.219\n0.586\u00b10.104\n0.312\u00b10.153\n0.624\u00b10.069\n0.385\u00b10.137\n0.778\u00b10.047\n0.135\u00b10.122\n0.376\u00b10.222\n0.126\u00b10.069\n0.822\u00b10.076\n0.578\u00b10.155\n0.457\u00b10.067\n\n0.773\u00b10.119\n0.849\u00b10.073\n0.886\u00b10.060\n0.887\u00b10.076\n0.917\u00b10.026\n0.900\u00b10.057\n0.953\u00b10.020\n0.511\u00b10.226\n0.742\u00b10.148\n0.628\u00b10.151\n0.922\u00b10.063\n0.876\u00b10.039\n0.823\u00b10.031\n\nTable 3: Performance comparison with the state-of-the-art methods on four benchmark databases in terms of\ncommon metrics. ACC: Average classi\ufb01cation rate.\n\nSEMAINE\n\nBP4D\n\nDISFA\n\nCK\n\nMethods\nAAM [29]\n\nGabor+DBN [30]\n\nLBP [32]\n\nMethods\nLGBP [11]\n\nACC\n0.955\n0.933\n0.949 DLA-SIFT [16]\n\nCNN [9]\n\nF1\n\nMethods\nLGBP [11]\n\n0.351\n0.341\n0.435 DLA-SIFT [16]\n\nCNN [9]\n\nCNN (baseline)\n\nIB-CNN\n\n0.937 CNN (baseline)\n0.951\n\nIB-CNN\n\n0.347 CNN (baseline)\n0.416\n\nIB-CNN\n\nF1\n\n0.580\n0.522\n0.591\n\nMethods\nGabor [12]\nBGCS [31]\n\nLPQ [17]\n\nML-CNN [33]\n0.510 CNN (baseline)\n0.578\n\nIB-CNN\n\n2AFC ACC\n0.857\nN/A\n0.868\nN/A\nN/A\n0.810\n0.757\n0.846\n0.839\n0.780\n0.825\n0.858\n\nPerformance evaluation on the DISFA database: A 9-fold cross-validation strategy is employed\nfor the DISFA database, where 8 subsets of 24 subjects were utilized for training and the remaining\none subset of 3 subjects for testing. For each fold, the training-testing process was repeated 5 times.\nThe mean and standard deviation of the F1 score and the 2AFC score are calculated from the 5 \u00d7 9\nruns for each target AU and reported in Table 2. As shown in Table 2, the proposed IB-CNN improves\nthe performance from 0.405 (CNN) and 0.398 (B-CNN) to 0.457 (IB-CNN) in terms of the average\nF1 score and from 0.780 (CNN) and 0.815 (B-CNN) to 0.823 (IB-CNN) in terms of 2AFC score.\nSimilar to the results on the SEMAINE database, the performance improvement of the infrequent\nAUs is more impressive. AU5 (upper lid raiser) has the least number of occurrences, i.e., 2.09%\npositive samples, in the DISFA database. The recognition performance increases from 0.195 (CNN)\nand 0.184 (B-CNN) to 0.312 (IB-CNN) in terms of the average F1 score.\n\nComparison with the State-of-the-Art methods: We further compare the proposed IB-CNN with\nthe state-of-the-art methods, especially the CNN-based methods, evaluated on the four benchmark\ndatabases using the metrics that are common in those papers 3. As shown in Tables 3, the perfor-\nmance of IB-CNN is comparable with the state-of-the-art methods and more importantly, outper-\nforms the CNN-based methods.\n\n4.4 Data Analysis\n\nData analysis of the parameter \u03b7: The value of \u03b7 can affect\nthe slope of the simulated sign(\u00b7) function and consequently,\nthe gradient and optimization process. When \u03b7 is smaller than\n0.5, the simulation is more similar to the real sign(\u00b7), but the\nderivative is near zero for most of the input data, which can\ncause slow convergence or divergence. An experiment was\nconducted to analyze the in\ufb02uence of \u03b7 = \u03c3\nc in Eq. 1. Speci\ufb01-\ncally, an average F1 score is calculated from all AUs in the SE-\nMAINE database while varying the value of c. As illustrated in\nFigure 4, the recognition performance in terms of the average\nF1 score is robust to the choice of \u03b7 when c ranges from 0.5 to\n16. In our experiment, \u03b7 is set to half of the standard deviation \u03c3\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\ne\nr\no\nc\nS\n \n1\nF\n\n0.1\n\n0.5\n\n1\n\n2\n\n4\n\n8\n\n16\n\nValue of c in function \u03b7\n\nFigure 4: Recognition performance\nversus the choice of \u03b7.\n\n2 , empirically.\n\nData Analysis of the number of input neurons in the IB layer: Selecting an exact number of\nnodes for the hidden layers remains an open question. An experiment was conducted to demonstrate\nthat the proposed IB-CNN is insensitive to the number of input neurons. Speci\ufb01cally, a set of IB-\n\n3Since the testing sets of the SEMAINE and BP4D database are not available, the IB-CNN is compared\n\nwith the method reported on the validation sets.\n\n7\n\n\fCNNs, with the number of input neurons of 64, 128, 256, 512, 1042, and 2048, were trained and\ntested on the SEMAINE database. For each IB-CNN, the average F1 score is computed over 5 runs\nfor each AU. As shown in Figure 5, the B-CNN and especially, the proposed IB-CNN is more robust\nto the number of input neurons compared to the traditional CNN since a small set of neurons are\nactive in contrast to the FC layer in the traditional CNN.\n\nAU2\n\n64\n\n128 256 512 1024 2048\n\nAU25\n\ne\nr\no\nc\nS\n \n1\nF\n\ne\nr\no\nc\nS\n \n1\nF\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\nCNN\n\nB-CNN\n\nIB-CNN\n\nAU12\n\nAU17\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n64\n\n128\n\n256\n\n512 1024 2048\n\n64\n\n128\n\n256\n\n512 1024 2048\n\nAU28\n\nAU45\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n64\n\n128\n\n256\n\n512 1024 2048\n\n64\n\n128\n\n256\n\n512 1024 2048\n\n64\n\n128 256 512 1024 2048\n\nFigure 5: Recognition performance versus the number of input neurons in the IB layer.\n\nData analysis of learning rate \u03b3: Another issue in CNNs\nis the choice of the learning rate \u03b3. The performance of\nthe IB-CNN at different learning rates is depicted in Fig-\nure 6 in terms of the average F1 score calculated from all\nAUs on the SEMAINE database. Compared to the tradi-\ntional CNN, the proposed IB-CNN is less sensitive to the\nvalue of the learning rate.\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\nr\no\nc\nS\n1\nF\n\n \n\nIB-CNN\nCNN\n\n\u221210\n\n\u22129\n\n\u22128\n\n\u22127\n\n0\n\u221211\n\nLog Learning Rate\n\n5 Conclusion and Future Work\nIn this paper, a novel IB-CNN was proposed to integrate\nboosting classi\ufb01cation into a CNN for the application of\nAU recognition. To deal with limited positive samples in a\nmini-batch, an incremental boosting algorithm was devel-\noped to accumulate information from multiple batches over time. A novel loss function that accounts\nfor errors from both the incremental strong classi\ufb01er and individual weak classi\ufb01ers is proposed to\n\ufb01ne-tune the IB-CNN. Experimental results on four benchmark AU databases have demonstrated\nthat the IB-CNN achieves signi\ufb01cant improvement over the traditional CNN, as well as the state-\nof-the-art CNN-based methods for AU recognition. Furthermore, the IB-CNN is more effective in\nrecognizing infrequent AUs with limited training data. The IB-CNN is a general machine learning\nmethod and can be adapted to other learning tasks, especially those with limited training data. In the\nfuture, we plan to extend it to multitask learning by replacing the binary classi\ufb01er with a multiclass\nboosting classi\ufb01er.\n\nFigure 6: Recognition performance versus\nthe learning rate \u03b3.\n\nAcknowledgment\n\nThis work is supported by National Science Foundation under CAREER Award IIS-1149787.\n\nReferences\n\n[1] Ekman, P., Friesen, W.V., Hager, J.C.: Facial Action Coding System: the Manual. Research Nexus, Div.,\n\nNetwork Information Research Corp., Salt Lake City, UT (2002)\n\n[2] Zeng, Z., Pantic, M., Roisman, G.I., Huang, T.S.: A survey of affect recognition methods: Audio, visual,\n\nand spontaneous expressions. IEEE T-PAMI 31(1) (Jan. 2009) 39\u201358\n\n[3] Sariyanidi, E., Gunes, H., Cavallaro, A.: Automatic analysis of facial affect: A survey of registration,\n\nrepresentation and recognition. IEEE T-PAMI 37(6) (June 2015) 1113\u20131133\n\n[4] Fasel, B.: Head-pose invariant facial expression recognition using convolutional neural networks.\n\nIn:\n\nICMI. (2002) 529\u2013534\n\n[5] Rifai, S., Bengio, Y., Courville, A., Vincent, P., Mirza, M.: Disentangling factors of variation for facial\n\nexpression recognition. In: ECCV. (2012) 808\u2013822\n\n[6] Tang, Y.: Deep learning using linear support vector machines. In: ICML. (2013)\n\n8\n\n\f[7] Liu, M., Li, S., Shan, S., Wang, R., Chen, X.: Deeply learning deformable facial action parts model for\n\ndynamic expression analysis. In: ACCV. (2014)\n\n[8] Jung, H., Lee, S., Yim, J., Park, S., Kim, J.: Joint \ufb01ne-tuning in deep neural networks for facial expression\n\nrecognition. In: ICCV. (2015) 2983\u20132991\n\n[9] Gudi, A., Tasli, H.E., den Uyl, T.M., Maroulis, A.: Deep learning based FACS action unit occurrence and\n\nintensity estimation. In: FG. (2015)\n\n[10] Jaiswal, S., Valstar, M.F.: Deep learning the dynamic appearance and shape of facial action units. In:\n\nWACV. (2016)\n\n[11] Valstar, M., Girard, J., Almaev, T., McKeown, G., Mehu, M., Yin, L., Pantic, M., Cohn, J.: FERA 2015 -\n\nsecond facial expression recognition and analysis challenge. FG (2015)\n\n[12] Mavadati, S.M., Mahoor, M.H., Bartlett, K., Trinh, P., Cohn, J.F.: Disfa: A spontaneous facial action\n\nintensity database. IEEE Trans. on Affective Computing 4(2) (2013) 151\u2013160\n\n[13] Bartlett, M.S., Littlewort, G., Frank, M.G., Lainscsek, C., Fasel, I., Movellan, J.R.: Recognizing facial\n\nexpression: Machine learning and application to spontaneous behavior. In: CVPR. (2005) 568\u2013573\n\n[14] Valstar, M.F., Mehu, M., Jiang, B., Pantic, M., Scherer, K.: Meta-analysis of the \ufb01rst facial expression\n\nrecognition challenge. IEEE T-SMC-B 42(4) (2012) 966\u2013979\n\n[15] Baltrusaitis, T., Mahmoud, M., Robinson, P.: Cross-dataset learning and person-speci\ufb01c normalisation for\n\nautomatic action unit detection. In: FG. Volume 6. (2015) 1\u20136\n\n[16] Yuce, A., Gao, H., Thiran, J.: Discriminant multi-label manifold embedding for facial action unit detection.\n\nIn: FG. (2015)\n\n[17] Jiang, B., Martinez, B., Valstar, M.F., Pantic, M.: Decision level fusion of domain speci\ufb01c regions for\n\nfacial action recognition. In: ICPR, IEEE (2014) 1776\u20131781\n\n[18] Zhao, G., Pieti\u00e4inen, M.: Dynamic texture recognition using local binary patterns with an application to\n\nfacial expressions. IEEE T-PAMI 29(6) (June 2007) 915\u2013928\n\n[19] Yang, P., Liu, Q., Metaxas, D.N.: Boosting encoded dynamic features for facial expression recognition.\n\nPattern Recognition Letters 30(2) (Jan. 2009) 132\u2013139\n\n[20] Zafeiriou, S., Petrou, M.: Sparse representations for facial expressions recognition via L1 optimization.\n\nIn: CVPR Workshops. (2010) 32\u201339\n\n[21] Liu, P., Han, S., Meng, Z., Tong, Y.: Facial expression recognition via a boosted deep belief network. In:\n\nCVPR. (2014)\n\n[22] Nagi, J., Di Caro, G.A., Giusti, A., Nagi, F., Gambardella, L.M.: Convolutional neural support vector\n\nmachines: hybrid visual pattern classi\ufb01ers for multi-robot systems. In: ICMLA. (2012) 27\u201332\n\n[23] Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Improving neural networks\n\nby preventing co-adaptation of feature detectors. arXiv preprint (2012)\n\n[24] Medera, D., Babinec, S.: Incremental learning of convolutional neural networks. In: IJCCI. (2009) 547\u2013\n\n550\n\n[25] Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive database for facial expression analysis. In: FG. (2000)\n\n46\u201353\n\n[26] Asthana, A., Zafeiriou, S., Cheng, S., Pantic, M.: Robust discriminative response map \ufb01tting with con-\n\nstrained local models. In: CVPR. (2013) 3444\u20133451\n\n[27] Sayette, M.A., Cohn, J.F., Wertz, J.M., Perrott, M.A., Parrott, D.J.: A psychometric evaluation of the\nfacial action coding system for assessing spontaneous expression. J. Nonverbal Behavior 25(3) (2001)\n167\u2013185\n\n[28] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe:\n\nConvolutional architecture for fast feature embedding. In: ACM MM, ACM (2014) 675\u2013678\n\n[29] Lucey, S., Ashraf, A.B., Cohn, J.: Investigating spontaneous facial action recognition through AAM rep-\nresentations of the face. In Kurihara, K., ed.: Face Recognition Book. Pro Literatur Verlag, Mammendorf,\nGermany (April 2007)\n\n[30] Tong, Y., Liao, W., Ji, Q.: Facial action unit recognition by exploiting their dynamic and semantic rela-\n\ntionships. IEEE T-PAMI 29(10) (October 2007) 1683\u20131699\n\n[31] Song, Y., McDuff, D., Vasisht, D., Kapoor, A.: Exploiting sparsity and co-occurrence structure for action\n\nunit recognition. In: FG. (2015)\n\n[32] Han, S., Meng, Z., Liu, P., Tong, Y.: Facial grid transformation: A novel face registration approach for\n\nimproving facial action unit recognition. In: ICIP. (2014)\n\n[33] Ghosh, S., Laksana, E., Scherer, S., Morency, L.: A multi-label convolutional neural network approach to\n\ncross-domain action unit detection. ACII (2015)\n\n9\n\n\f", "award": [], "sourceid": 75, "authors": [{"given_name": "Shizhong", "family_name": "Han", "institution": "University of South Carolina"}, {"given_name": "Zibo", "family_name": "Meng", "institution": "University of South Carolina"}, {"given_name": "AHMED-SHEHAB", "family_name": "KHAN", "institution": "University of South Carolina"}, {"given_name": "Yan", "family_name": "Tong", "institution": "University of South Carolina"}]}