{"title": "Learning Generative Models with Visual Attention", "book": "Advances in Neural Information Processing Systems", "page_first": 1808, "page_last": 1816, "abstract": "Attention has long been proposed by psychologists to be important for efficiently dealing with the massive amounts of sensory stimulus in the neocortex. Inspired by the attention models in visual neuroscience and the need for object-centered data for generative models, we propose a deep-learning based generative framework using attention. The attentional mechanism propagates signals from the region of interest in a scene to an aligned canonical representation for generative modeling. By ignoring scene background clutter, the generative model can concentrate its resources on the object of interest. A convolutional neural net is employed to provide good initializations during posterior inference which uses Hamiltonian Monte Carlo. Upon learning images of faces, our model can robustly attend to the face region of novel test subjects. More importantly, our model can learn generative models of new faces from a novel dataset of large images where the face locations are not known.", "full_text": "Learning Generative Models with Visual Attention\n\nYichuan Tang, Nitish Srivastava, Ruslan Salakhutdinov\n\nDepartment of Computer Science\n\n{tang,nitish,rsalakhu}@cs.toronto.edu\n\nUniversity of Toronto\n\nToronto, Ontario, Canada\n\nAbstract\n\nAttention has long been proposed by psychologists to be important for ef\ufb01ciently\ndealing with the massive amounts of sensory stimulus in the neocortex. Inspired\nby the attention models in visual neuroscience and the need for object-centered\ndata for generative models, we propose a deep-learning based generative frame-\nwork using attention. The attentional mechanism propagates signals from the\nregion of interest in a scene to an aligned canonical representation for genera-\ntive modeling. By ignoring scene background clutter, the generative model can\nconcentrate its resources on the object of interest. A convolutional neural net is\nemployed to provide good initializations during posterior inference which uses\nHamiltonian Monte Carlo. Upon learning images of faces, our model can robustly\nattend to the face region of novel test subjects. More importantly, our model can\nlearn generative models of new faces from a novel dataset of large images where\nthe face locations are not known.\n\nIntroduction\n\n1\nBuilding rich generative models that are capable of extracting useful, high-level latent represen-\ntations from high-dimensional sensory input lies at the core of solving many AI-related tasks, in-\ncluding object recognition, speech perception and language understanding. These models capture\nunderlying structure in data by de\ufb01ning \ufb02exible probability distributions over high-dimensional data\nas part of a complex, partially observed system. Some of the successful generative models that\nare able to discover meaningful high-level latent representations include the Boltzmann Machine\nfamily of models: Restricted Boltzmann Machines, Deep Belief Nets [1], and Deep Boltzmann Ma-\nchines [2]. Mixture models, such as Mixtures of Factor Analyzers [3] and Mixtures of Gaussians,\nhave also been used for modeling natural image patches [4]. More recently, denoising auto-encoders\nhave been proposed as a way to model the transition operator that has the same invariant distribution\nas the data generating distribution [5].\nGenerative models have an advantage over discriminative models when part of the images are oc-\ncluded or missing. Occlusions are very common in realistic settings and have been largely ignored\nin recent literature on deep learning. In addition, prior knowledge can be easily incorporated in\ngenerative models in the forms of structured latent variables, such as lighting and deformable parts.\nHowever, the enormous amount of content in high-resolution images makes generative learning dif-\n\ufb01cult [6, 7]. Therefore, generative models have found most success in learning to model small\npatches of natural images and objects: Zoran and Weiss [4] learned a mixture of Gaussians model\nover 8\u00d78 image patches; Salakhutdinov and Hinton [2] used 64\u00d764 centered and uncluttered stereo\nimages of toy objects on a clear background; Tang et al. [8] used 24\u00d724 images of centered and\ncropped faces. The fact that these models require curated training data limits their applicability on\nusing the (virtually) unlimited unlabeled data.\nIn this paper, we propose a framework to infer the region of interest in a big image for genera-\ntive modeling. This will allow us to learn a generative model of faces on a very large dataset of\n(unlabeled) images containing faces. Our framework is able to dynamically route the relevant infor-\nmation to the generative model and can ignore the background clutter. The need to dynamically and\nselectively route information is also present in the biological brain. Plethora of evidence points to\n\n1\n\n\fthe presence of attention in the visual cortex [9, 10]. Recently, in visual neuroscience, attention has\nbeen shown to exist not only in extrastriate areas, but also all the way down to V1 [11].\nAttention as a form of routing was originally proposed by Anderson and Van Essen [12] and then\nextended by Olshausen et al. [13]. Dynamic routing has been hypothesized as providing a way for\nachieving shift and size invariance in the visual cortex [14, 15]. Tsotsos et al. [16] proposed a model\ncombining search and attention called the Selective Tuning model. Larochelle and Hinton [17] pro-\nposed a way of using third-order Boltzmann Machines to combine information gathered from many\nfoveal glimpses. Their model chooses where to look next to \ufb01nd locations that are most informative\nof the object class. Reichert et al. [18] proposed a hierarchical model to show that certain aspects of\ncovert object-based attention can be modeled by Deep Boltzmann Machines. Several other related\nmodels attempt to learn where to look for objects [19, 20] and for video based tracking [21]. Inspired\nby Olshausen et al. [13], we use 2D similarity transformations to implement the scaling, rotation,\nand shift operation required for routing. Our main motivation is to enable the learning of generative\nmodels in big images where the location of the object of interest is unknown a-priori.\n2 Gaussian Restricted Boltzmann Machines\nBefore we describe our model, we brie\ufb02y review the Gaussian Restricted Boltzmann Machine\n(GRBM) [22], as it will serve as the building block for our attention-based model. GRBMs are\na type of Markov Random Field model that has a bipartite structure with real-valued visible vari-\nables v \u2208 RD connected to binary stochastic hidden variables h \u2208 {0, 1}H. The energy of the joint\ncon\ufb01guration {v, h} of the Gaussian RBM is de\ufb01ned as follows:\n\n(cid:88)\n\n1\n2\n\n(vi \u2212 bi)2\n\n\u03c32\ni\n\n\u2212(cid:88)\n\ncjhj \u2212(cid:88)\n\nEGRBM (v, h; \u0398) =\n\nWijvihj,\n\n(1)\n\n(cid:80)\nwhere \u0398 = {W, b, c, \u03c3} are the model parameters. The marginal distribution over the visible vector\nh exp (\u2212E(v, h; \u0398)) and the corresponding conditional distributions take\nv is P (v; \u0398) = 1Z(\u0398)\nthe following form:\n\nij\n\nj\n\ni\n\np(hj = 1|v) = 1/(cid:0)1 + exp(\u2212(cid:88)\n\nWijvi \u2212 cj)(cid:1),\n\np(vi|h) = N (vi; \u00b5i, \u03c32\n\ni ), where \u00b5i = bi + \u03c32\ni\n\ni\n\n(cid:88)\n\nj\n\nWijhj.\n\n(2)\n\n(3)\n\nObserve that conditioned on the states of the hidden variables (Eq. 3), each visible unit is modeled\nby a Gaussian distribution, whose mean is shifted by the weighted combination of the hidden unit\nactivations. Unlike directed models, an RBM\u2019s conditional distribution over hidden nodes is factorial\nand can be easily computed.\nWe can also add a binary RBM on top of the learned GRBM by treating the inferred h as the\n\u201cvisible\u201d layer together with a second hidden layer h2. This results in a 2-layer Gaussian Deep\nBelief Network (GDBN) [1] that is a more powerful model of v.\nSpeci\ufb01cally, in a GDBN model, p(h1, h2) is modeled by the energy function of the 2nd-layer RBM,\nwhile p(v1|h1) is given by Eq. 3. Ef\ufb01cient inference can be performed using the greedy approach\nof [1] by treating each DBN layer as a separate RBM model. GDBNs have been applied to various\ntasks, including image classi\ufb01cation, video action and speech recognition [6, 23, 24, 25].\n3 The Model\nLet I be a high resolution image of a scene, e.g. a 256\u00d7256 image. We want to use attention to\npropagate regions of interest from I up to a canonical representation. For example, in order to learn\na model of faces, the canonical representation could be a 24\u00d724 aligned and cropped frontal face\nimage. Let v \u2208 RD represent this low resolution canonical image. In this work, we focus on a Deep\nBelief Network1 to model v.\nThis is illustrated in the diagrams of Fig. 1. The left panel displays the model of Olshausen et.al. [13],\nwhereas the right panel shows a graphical diagram of our proposed generative model with an atten-\ntional mechanism. Here, h1 and h2 represent the latent hidden variables of the DBN model, and\n\n1Other generative models can also be used with our attention framework.\n\n2\n\n\fFigure 1: Left: The Shifter Circuit, a well-known neuroscience model for visual attention [13]; Right: The\nproposed model uses 2D similarity transformations from geometry and a Gaussian DBN to model canonical\nface images. Associative memory corresponds to the DBN, object-centered frame correspond to the visible\nlayer and the attentional mechanism is modeled by 2D similarity transformations.\n\n(cid:52)x,(cid:52)y,(cid:52)\u03b8,(cid:52)s (position, rotation, and scale) are the parameters of the 2D similarity transforma-\ntion.\nThe 2D similarity transformation is used to rotate, scale, and translate the canonical image v onto the\ncanvas that we denote by I. Let p = [x y]T be a pixel coordinate (e.g. [0, 0] or [0, 1]) of the canonical\nimage v. Let {p} be the set of all coordinates of v. For example, if v is 24\u00d724, then {p} ranges\nfrom [0, 0] to [23, 23]. Let the \u201cgaze\u201d variables u \u2208 R4 \u2261 [(cid:52)x,(cid:52)y,(cid:52)\u03b8,(cid:52)s] be the parameter\nof the Similarity transformation. In order to simplify derivations and to make transformations be\nthe transformation parameters, we can equivalently rede\ufb01ne u = [a, b, (cid:52)x, (cid:52)y],\nlinear w.r.t.\nwhere a = s sin(\u03b8) \u2212 1 and b = s cos(\u03b8) (see [26] for details). We further de\ufb01ne a function\nw := w(p, u) \u2192 p(cid:48) as the transformation function to warp points p to p(cid:48):\n\np(cid:48) (cid:44)(cid:104) x(cid:48)\n\ny(cid:48)\n\n(cid:105)\n\n(cid:104) 1 + a \u2212b\n\n(cid:105)\n\n(cid:105)(cid:104) x\n\ny\n\n(cid:104) (cid:52)x(cid:52)y\n\n(cid:105)\n\n.\n\n(4)\nWe use the notation I({p}) to denote the bilinear interpolation of I at coordinates {p} with anti-\naliasing. Let x(u) be the extracted low-resolution image at warped locations p(cid:48):\n\n1 + a\n\n+\n\n=\n\nb\n\nx(u) (cid:44) I(w({p}, u)).\n\n(5)\nIntuitively, x(u) is a patch extracted from I according to the shift, rotation and scale parameters\nof u, as shown in Fig. 1, right panel. It is this patch of data that we seek to model generatively. Note\nthat the dimensionality of x(u) is equal to the cardinality of {p}, where {p} denotes the set of pixel\ncoordinates of the canonical image v. Unlike standard generative learning tasks, the data x(u) is\nnot static but changes with the latent variables u. Given v and u, we model the top-down generative\nprocess over2 x with a Gaussian distribution having a diagonal covariance matrix \u03c32I:\n\nThe fact that we do not seek to model the rest of the regions/pixels of I is by design. By using 2D\nsimilarity transformation to mimic attention, we can discard the complex background of the scene\nand let the generative model focus on the object of interest. The proposed generative model takes\nthe following form:\n\np(x, v, u|I) = p(x|v, u,I)p(v)p(u),\n\n(7)\n\nwhere for p(u) we use a \ufb02at prior that is constant for all u, and p(v) is de\ufb01ned by a 2-layer Gaussian\nDeep Belief Network. The conditional p(x|v, u,I) is given by a Gaussian distribution as in Eq. 6.\nTo simplify the inference procedure, p(x|v, u,I) and the GDBN model of v, p(v), will share the\nsame noise parameters \u03c3i.\n\n2We will often omit dependence of x on u for clarity of presentation.\n\n3\n\np(x|v, u,I) \u221d exp\n\n(6)\n\n(cid:18)\n\n(cid:88)\n\ni\n\n\u2212 1\n2\n\n(cid:19)\n\n.\n\n(xi(u) \u2212 vi)2\n\n\u03c32\ni\n\nOlshausen et al. 93Our model2d similaritytransformation\fInference\n\n4\nWhile the generative equations in the last section are straightforward and intuitive, inference in these\nmodels is typically intractable due to the complicated energy landscape of the posterior. During\ninference, we wish to compute the distribution over the gaze variables u and canonical object v given\nthe big image I. Unlike in standard RBMs and DBNs, there are no simplifying factorial assumptions\nabout the conditional distribution of the latent variable u. Having a 2D similarity transformation is\nreminiscent of third-order Boltzmann machines with u performing top-down multiplicative gating\nof the connections between v and I. It is well known that inference in these higher-order models is\nrather complicated.\nOne way to perform inference in our model is to resort to Gibbs sampling by computing the set of\nalternating conditional posteriors: The conditional distribution over the canonical image v takes the\nfollowing form:\n\np(v|u, h1,I) = N(cid:16) \u00b5 + x(u)\n\n; \u03c32(cid:17)\n\n(8)\nj is the top-down in\ufb02uence of the DBN. Note that if we know the\nwhere \u00b5i = bi + \u03c32\ni\ngaze variable u and the \ufb01rst layer of hidden variables h1, then v is simply de\ufb01ned by a Gaussian\ndistribution, where the mean is given by the average of the top-down in\ufb02uence and bottom-up in-\nformation from x. The conditional distributions over h1 and h2 given v are given by the standard\nDBN inference equations [1]. The conditional posterior over the gaze variables u is given by:\n\nj Wijh1\n\n2\n\n,\n\n(cid:80)\n\np(u|x, v) =\n\np(x|u, v)p(u)\n\np(x|v)\n\n,\n\n(cid:88)\n\n(xi(u) \u2212 vi)2\n\ni\n\n1\n2\n\n\u03c32\ni\n\n+ const.\n\nlog p(u|x, v) \u221d log p(x|u, v) + log p(u) =\n\n(9)\nUsing Bayes\u2019 rule, the unnormalized log probability of p(u|x, v) is de\ufb01ned in Eq. 9. We stress that\nthis equation is atypical in that the random variable of interest u actually affects the conditioning\nvariable x (see Eq. 5) We can explore the gaze variables using Hamiltonian Monte Carlo (HMC)\nalgorithm [27, 28]. Intuitively, conditioned on the canonical object v that our model has in \u201cmind\u201d,\nHMC searches over the entire image I to \ufb01nd a region x with a good match to v.\nIf the goal is only to \ufb01nd the MAP estimate of p(u|x, v), then we may want to use second-order\nmethods for optimizing u. This would be equivalent to the Lucas-Kanade framework in computer\nvision, developed for image alignment [29]. However, HMC has the advantage of being a proper\nMCMC sampler that satis\ufb01es detailed balance and \ufb01ts nicely with our probabilistic framework.\nThe HMC algorithm \ufb01rst speci\ufb01es the Hamiltonian over the position variables u and auxiliary\nmomentum variables r: H(u, r) = U (u) + K(r), where the potential function is de\ufb01ned by\ni . The dy-\nU (u) = 1\n2\nnamics of the system is de\ufb01ned by:\n\nand the kinetic energy function is given by K(r) = 1\n2\n\n(xi(u)\u2212vi)2\n\n(cid:80)\n\n(cid:80)\n\ni r2\n\n\u03c32\ni\n\ni\n\n\u2202u\n\u2202t\n(x(u) \u2212 v)\n\n\u03c32\n\u2202x\n\n\u2202w({p}, u)\n\n\u2202H\n\u2202u\n\u2202x\n\u2202u\n\n=\n\n=\n\n= r,\n\n,\n\n\u2202x(u)\n\u2202u\n\u2202w({p}, u)\n\n\u2202u\n\n\u2202u\n\n= \u2212 \u2202H\n(cid:88)\n\n\u2202r\n\u2202t\n\n=\n\n\u2202xi\n\n\u2202w(pi, u)\n\n\u2202w(pi, u)\n\n\u2202u\n\ni\n\n(10)\n\n(11)\n\n(12)\n\n.\n\nObserve that Eq. 12 decomposes into sums over single coordinate positions pi = [x y]T. Let us\ndenote p(cid:48)\ni = w(pi, u) to be the coordinate pi warped by u. For the \ufb01rst term on the RHS of Eq. 12,\n\n(13)\ni) denotes the sampling of the gradient images of I at the warped location pi. For the\n\nwhere \u2207I(p(cid:48)\nsecond term on the RHS of Eq. 12, we note that we can re-write Eq. 4 as:\n\n(dimension 1 by 2 )\n\n\u2202w(pi, u)\n\ni),\n\n\u2202xi\n\n= \u2207I(p(cid:48)\n\n(cid:105)\n\n(cid:104) x(cid:48)\n\ny(cid:48)\n\n(cid:104) x \u2212y\n\ny\n\nx\n\n=\n\n(cid:35)\n\n(cid:105)(cid:34) a\nb(cid:52)x(cid:52)y\n\n(cid:105)\n\n(cid:104) x\n\ny\n\n1\n0\n\n0\n1\n\n4\n\n+\n\n,\n\n(14)\n\n\fgiving us\n\n(cid:104) x \u2212y\n\ny\n\nx\n\n(cid:105)\n\n.\n\n1\n0\n\n0\n1\n\n\u2202w(pi, u)\n\n\u2202u\n\n=\n\n(15)\n\nHMC simulates the discretized system by performing leap-frog updates of u and r using Eq. 10.\nAdditional hyperparameters that need to be speci\ufb01ed include the step size \u0001, number of leap-frog\nsteps, and the mass of the variables (see [28] for details).\n\n(a)\n\n(b)\n\n4.1 Approximate Inference\nHMC essentially performs gradient descent with momentum,\ntherefore it is prone to getting stuck at local optimums. This\nis especially a problem for our task of \ufb01nding the best trans-\nformation parameters. While the posterior over u should be\nunimodal near the optimum, many local minima exist away\nfrom the global optimum. For example, in Fig. 2(a), the big\nimage I is enclosed by the blue box, and the canonical image\nv is enclosed by the green box. The current setting of u aligns\ntogether the wrong eyes. However, it is hard to move the green\nbox to the left due to the local optima created by the dark in-\ntensities of the eye. Resampling the momentum variable every\niteration in HMC does not help signi\ufb01cantly because we are\nmodeling real-valued images using a Gaussian distribution as\nthe residual, leading to quadratic costs in the difference be-\ntween x(u) and v (see Eq. 9). This makes the energy barriers\nbetween modes extremely high.\nTo alleviate this problem we need to \ufb01nd good initializations\nof u. We use a Convolutional Network (ConvNet) to per-\nform ef\ufb01cient approximate inference, resulting in good initial\nguesses. Speci\ufb01cally, given v, u and I, we predict the change\nin u that will lead to the maximum log p(u|x, v).\nIn other\nwords, instead of using the gradient \ufb01eld for updating u, we\nlearn a ConvNet to output a better vector \ufb01eld in the space\nof u. We used a fairly standard ConvNet architecture and the standard stochastic gradient descent\nlearning procedure.\nWe note that standard feedforward face detectors seek to model p(u|I), while completely ignoring\nthe canonical face v. In contrast, here we take v into account as well. The ConvNet is used to initial-\nize u for the HMC algorithm. This is important in a proper generative model because conditioning\non v is appealing when multiple faces are present in the scene. Fig. 2(b) is a hypothesized Euclidean\nspace of v, where the black manifold represents canonical faces and the blue manifold represents\ncropped faces x(u). The blue manifold has a low intrinsic dimensionality of 4, spanned by u. At A\nand B, the blue comes close to black manifold. This means that there are at least two modes in the\nposterior over u. By conditioning on v, we can narrow the posterior to a single mode, depending on\nwhom we want to focus our attention. We demonstrate this exact capability in Sec. 6.3.\nFig. 3 demonstrates the iterative process of how approximate inference works in our model. Specif-\nically, based on u, the ConvNet takes a window patch around x(u) (72\u00d772) and v (24\u00d724) as input,\nand predicts the output [(cid:52)x,(cid:52)y,(cid:52)\u03b8,(cid:52)s]. In step 2, u is updated accordingly, followed by step 3\nof alternating Gibbs updates of v and h, as discussed in Sec. 4. The process is repeated. For the\ndetails of the ConvNet see the supplementary materials.\n5 Learning\nWhile inference in our framework localizes objects of interest and is akin to object detection, it is not\nthe main objective. Our motivation is not to compete with state-of-the-art object detectors but rather\npropose a probabilistic generative framework capable of generative modeling of objects which are\nat unknown locations in big images. This is because labels are expensive to obtain and are often not\navailable for images in an unconstrained environment.\nTo learn generatively without\nlabels we propose a simple Monte Carlo based Expectation-\nMaximization algorithm. This algorithm is an unbiased estimator of the maximum likelihood objec-\n\nFigure 2:\n(a) HMC can easily get\nstuck at local optima. (b) Importance\nof modeling p(u|v,I). Best in color.\n\n5\n\nAverageAB\fFigure 3:\nInference process: u in step 1 is randomly initialized. The average v and the extracted x(u) form\nthe input to a ConvNet for approximate inference, giving a new u. The new u is used to sample p(v|I, u, h).\nIn step 3, one step of Gibbs sampling of the GDBN is performed. Step 4 repeats the approximate inference\nusing the updated v and x(u).\n\nFigure 4: Example of an inference step. v is 24\u00d724, x is 72\u00d772. Approximate inference quickly \ufb01nds a\ngood initialization for u, while HMC provides further adjustments. Intermediate inference steps on the right\nare subsampled from 10 actual iterations.\n\ntive. During the E-step, we use the Gibbs sampling algorithm developed in Sec. 4 to draw samples\nfrom the posterior over the latent gaze variables u, the canonical variables v, and the hidden vari-\nables h1, h2 of a Gaussian DBN model. During the M-step, we can update the weights of the\nGaussian DBN by using the posterior samples as its training data. In addition, we can update the\nparameters of the ConvNet that performs approximate inference. Due to the fact that the \ufb01rst E-step\nrequires a good inference algorithm, we need to pretrain the ConvNet using labeled gaze data as\npart of a bootstrap process. Obtaining training data for this initial phase is not a problem as we can\njitter/rotate/scale to create data. In Sec. 6.2, we demonstrate the ability to learn a good generative\nmodel of face images from the CMU Multi-PIE dataset.\n6 Experiments\nWe used two face datasets in our experiments. The \ufb01rst dataset is a frontal face dataset, called\nthe Caltech Faces from 1999, collected by Markus Weber. In this dataset, there are 450 faces of 27\nunique individuals under different lighting conditions, expressions, and backgrounds. We downsam-\npled the images from their native 896 by 692 by a factor of 2. The dataset also contains manually\nlabeled eyes and mouth coordinates, which will serve as the gaze labels. We also used the CMU\nMulti-PIE dataset [30], which contains 337 subjects, captured under 15 viewpoints and 19 illumi-\nnation conditions in four recording sessions for a total of more than 750,000 images. We demon-\nstrate our model\u2019s ability to perform approximate inference, to learn without labels, and to perform\nidentity-based attention given an image with two people.\n6.1 Approximate inference\nWe \ufb01rst investigate the critical inference algorithm of p(u|v,I) on the Caltech Faces dataset. We\nrun 4 steps of approximate inference detailed in Sec. 4.1 and diagrammed in Fig. 3, followed by\nthree iterations of 20 leap-frog steps of HMC. Since we do not initially know the correct v, we\ninitialize v to be the average face across all subjects.\nFig. 4 shows the image of v and x during inference for a test subject. The initial gaze box is colored\nyellow on the left. Subsequent gaze updates progress from yellow to blue. Once ConvNet-based\napproximate inference gives a good initialization, starting from step 5, \ufb01ve iterations of 20 leap-frog\nsteps of HMC are used to sample from the the posterior.\nFig. 5 shows the quantitative results of Intersection over Union (IOU) of the ground truth face box\nand the inferred face box. The results show that inference is very robust to initialization and requires\n\n6\n\nConvNetConvNetStep 1Step 2Step 3Step 41 Gibbs stepInference steps123456HMCVX\f(a)\n\n(b)\n\n(c)\n\nFigure 5: (a) Accuracy as a function of gaze initialization (pixel offset). Blue curve is the percentage success\nof at least 50% IOU. Red curve is the average IOU. (b) Accuracy as a function of the number of approximate\ninference steps when initializing 50 pixels away. (c) Accuracy improvements of HMC as a function of gaze\ninitializations.\n\n(a) DBN trained on Caltech\n\n(b) DBN updated with Multi-PIE\n\nFigure 6: Left: Samples from a 2-layer DBN trained on Caltech. Right: samples from an updated DBN after\ntraining on CMU Multi-PIE without labels. Samples highlighted in green are similar to faces from CMU.\n\nonly a few steps of approximate inference to converge. HMC clearly improves model performance,\nresulting in an IOU increase of about 5% for localization. This is impressive given that none of\nthe test subjects were part of the training and the background is different from backgrounds in the\ntraining set.\n\n97%\n\n93%\n\n78%\n\n97%\nO(c)\n\nOur method OpenCV NCC template\n\nIOU > 0.5\n# evaluations\nO(whs) O(whs) O(whs)\nTable 1: Face localization accuracy. w: image width;\nh: image height; s: image scales; c: number of inference\nsteps used.\n\nWe also compared our inference algorithm to\nthe template matching in the task of face de-\ntection. We took the \ufb01rst 5 subjects as test\nsubjects and the rest as training. We can lo-\ncalize with 97% accuracy (IOU > 0.5) us-\ning our inference algorithm3. In comparison,\na near state-of-the-art face detection system\nfrom OpenCV 2.4.9 obtains the same 97% ac-\ncuracy. It uses Haar Cascades, which is a form of AdaBoost4. Normalized Cross Correlation [31]\nobtained 93% accuracy, while Euclidean distance template matching achieved an accuracy of only\n78%. However, note that our algorithm looks at a constant number of windows while the other\nbaselines are all based on scanning windows.\n6.2 Generative learning without labels\n\nnats\n\nlog \u02c6Z\n\n627\u00b10.5\n503\u00b11.8\n499\u00b10.1\n387\u00b10.3\n687.8\n\n617\u00b10.4\n512\u00b11.1\n96\u00b10.8\n85\u00b10.5\n454.6\n\n569\u00b10.6\n494\u00b11.7\n594\u00b10.5\n503\u00b10.7\n694.2\n\nNo CMU training CMU w/o labels CMU w/ labels\n\nCaltech Train\nCaltech Valid\nCMU Train\nCMU Valid\n\nThe main advantage of our\nmodel is that it can learn on\nlarge images of faces without lo-\ncalization label information (no\nmanual cropping required). To\ndemonstrate, we use both the\nCaltech and the CMU faces\nTable 2: Variational lower-bound estimates on the log-density of the\ndataset. For the CMU faces, a\nGaussian DBNs (higher is better).\nsubset of 2526 frontal faces with\nground truth labels are used. We split the Caltech dataset into a training and a validation set. For\nthe CMU faces, we \ufb01rst took 10% of the images as training cases for the ConvNet for approximate\ninference. This is needed due to the completely different backgrounds of the Caltech and CMU\ndatasets. The remaining 90% of the CMU faces are split into a training and validation set. We \ufb01rst\ntrained a GDBN with 1024 h1 and 256 h2 hidden units on the Caltech training set. We also trained\n\n3u is randomly initialized at \u00b1 30 pixels, scale range from 0.5 to 1.5.\n4OpenCV detection uses pretrained model from haarcascade_frontalface_default.xml, scaleFactor=1.1,\n\nminNeighbors=3 and minSize=30.\n\n7\n\n0204060801000.50.60.70.80.911.1Accuracy of Approximate InferenceInitial Pixel OffsetAccuracy Trials with IOU > 0.5Average IOU05101500.20.40.60.81Accuracy of Approximate Inference# of Inference StepsAccuracy Trials with IOU > 0.5Average IOU020406080100\u22120.2\u22120.100.10.20.3Accuracy ImprovementsInitial Pixel OffsetAccuracy Average IOU Improvements\fFigure 7: Left: Conditioned on different v will result in a different (cid:52)u. Note that the initial u is exactly the\nsame for two trials. Right: Additional examples. The only difference between the top and bottom panels is the\nconditioned v. Best viewed in color.\n\nInference with ambiguity\n\na ConvNet for approximate inference using the Caltech training set and 10% of the CMU training\nimages.\nTable 2 shows the estimates of the variational lower-bounds on the average log-density (higher is\nbetter) that the GDBN models assign to the ground-truth cropped face images from the training/test\nsets under different scenarios. In the left column, the model is only trained on Caltech faces. Thus it\ngives very low probabilities to the CMU faces. Indeed, GDBNs achieve a variational lower-bound of\nonly 85 nats per test image. In the middle column, we use our approximate inference to estimate the\nlocation of the CMU training faces and further trained the GDBN on the newly localized faces. This\ngives a dramatic increase of the model performance on the CMU Validation set5, achieving a lower-\nbound of 387 nats per test image. The right column gives the best possible results if we can train\nwith the CMU manual localization labels. In this case, GDBNs achieve a lower-bound of 503 nats.\nWe used Annealed Importance Sampling (AIS) to estimate the partition function for the top-layer\nRBM. Details on estimating the variational lower bound are in the supplementary materials.\nFig. 6(a) further shows samples drawn from the Caltech trained DBN, whereas Fig. 6(b) shows\nsamples after training with the CMU dataset using estimated u. Observe that samples in Fig. 6(b)\nshow a more diverse set of faces. We trained GDBNs using a greedy, layer-wise algorithm of [1].\nFor the top layer we use Fast Persistent Contrastive Divergence [32], which substantially improved\ngenerative performance of GDBNs (see supplementary material for more details).\n6.3\nOur attentional mechanism can also be useful when multiple objects/faces are present in the scene.\nIndeed, the posterior p(u|x, v) is conditioned on v, which means that where to attend is a func-\ntion of the canonical object v the model has in \u201cmind\u201d (see Fig. 2(b)). To explore this, we \ufb01rst\nsynthetically generate a dataset by concatenating together two faces from the Caltech dataset. We\nthen train approximate inference ConvNet as in Sec. 4.1 and test on the held-out subjects. Indeed,\nas predicted, Fig. 7 shows that depending on which canonical image is conditioned, the same exact\ngaze initialization leads to two very different gaze shifts. Note that this phenomenon is observed\nacross different scales and location of the initial gaze. For example, in Fig. 7, right-bottom panel,\nthe initialized yellow box is mostly on the female\u2019s face to the left, but because the conditioned\ncanonical face v is that of the right male, attention is shifted to the right.\n7 Conclusion\nIn this paper we have proposed a probabilistic graphical model framework for learning generative\nmodels using attention. Experiments on face modeling have shown that ConvNet based approximate\ninference combined with HMC sampling is suf\ufb01cient to explore the complicated posterior distribu-\ntion. More importantly, we can generatively learn objects of interest from novel big images. Future\nwork will include experimenting with faces as well as other objects in a large scene. Currently the\nConvNet approximate inference is trained in a supervised manner, but reinforcement learning could\nalso be used instead.\nAcknowledgements\nThe authors gratefully acknowledge the support and generosity from Samsung, Google, and ONR\ngrant N00014-14-1-0232.\n\n5We note that we still made use of labels coming from the 10% of CMU Multi-PIE training set in order to\n\npretrain our ConvNet. \"w/o labels\" here means that no labels for the CMU Train/Valid images are given.\n\n8\n\n\fReferences\n[1] G. E. Hinton, S. Osindero, and Y. W. Teh. A fast learning algorithm for deep belief nets. Neural Compu-\n\ntation, 18(7):1527\u20131554, 2006.\n\n[2] R. Salakhutdinov and G. Hinton. Deep Boltzmann machines. In AISTATS, 2009.\n[3] Geoffrey E. Hinton, Peter Dayan, and Michael Revow. Modeling the manifolds of images of handwritten\n\ndigits. IEEE Transactions on Neural Networks, 8(1):65\u201374, 1997.\n\n[4] Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration.\n\nIn ICCV. IEEE, 2011.\n\n[5] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising auto-encoders as\n\ngenerative models. In Advances in Neural Information Processing Systems 26, 2013.\n\n[6] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsu-\n\npervised learning of hierarchical representations. In ICML, pages 609\u2013616, 2009.\n\n[7] Marc\u2019Aurelio Ranzato, Joshua Susskind, Volodymyr Mnih, and Geoffrey Hinton. On Deep Generative\n\nModels with Applications to Recognition. In CVPR, 2011.\n\n[8] Yichuan Tang, Ruslan Salakhutdinov, and Geoffrey E. Hinton. Deep mixtures of factor analysers.\n\nIn\n\n[9] M. I. Posner and C. D. Gilbert. Attention and primary visual cortex. Proc. of the National Academy of\n\nICML. icml.cc / Omnipress, 2012.\n\nSciences, 96(6), March 1999.\n\n[10] E. A. Buffalo, P. Fries, R. Landman, H. Liang, and R. Desimone. A backward progression of attentional\n\neffects in the ventral stream. PNAS, 107(1):361\u2013365, Jan. 2010.\n\n[11] N Kanwisher and E Wojciulik. Visual attention: Insights from brain imaging. Nature Reviews Neuro-\n\nscience, 1:91\u2013100, 2000.\n\n[12] C. H. Anderson and D. C. Van Essen. Shifter circuits: A computational strategy for dynamic aspects of\n\nvisual processing. National Academy of Sciences, 84:6297\u20136301, 1987.\n\n[13] B. A. Olshausen, C. H. Anderson, and D. C. Van Essen. A neurobiological model of visual attention and\ninvariant pattern recognition based on dynamic routing of information. The Journal of neuroscience : the\nof\ufb01cial journal of the Society for Neuroscience, 13(11):4700\u20134719, 1993.\n\n[14] Laurenz Wiskott. How does our visual system achieve shift and size invariance?, 2004.\n[15] S. Chikkerur, T. Serre, C. Tan, and T. Poggio. What and where: a Bayesian inference theory of attention.\n\nVision Research, 50(22):2233\u20132247, October 2010.\n\n[16] J. K. Tsotsos, S. M. Culhane, W. Y. K. Wai, Y. H. Lai, N. Davis, and F. Nu\ufb02o. Modeling visual-attention\n\nvia selective tuning. Arti\ufb01cial Intelligence, 78(1-2):507\u2013545, October 1995.\n\n[17] Hugo Larochelle and Geoffrey E. Hinton. Learning to combine foveal glimpses with a third-order boltz-\n\nmann machine. In NIPS, pages 1243\u20131251. Curran Associates, Inc., 2010.\n\n[18] D. P. Reichert, P. Seri\u00e8s, and A. J. Storkey. A hierarchical generative model of recurrent object-based\n\nattention in the visual cortex. In ICANN (1), volume 6791, pages 18\u201325. Springer, 2011.\n\n[19] B. Alexe, N. Heess, Y. W. Teh, and V. Ferrari. Searching for objects driven by context. In NIPS 2012,\n\n[20] Marc\u2019Aurelio Ranzato. On learning where to look. arXiv, arXiv:1405.5488, 2014.\n[21] M. Denil, L. Bazzani, H. Larochelle, and N. de Freitas. Learning where to attend with deep architectures\n\nfor image tracking. Neural Computation, 28:2151\u20132184, 2012.\n\n[22] G. E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\nDecember 2012.\n\n313:504\u2013507, 2006.\n\n[23] A. Krizhevsky. Learning multiple layers of features from tiny images, 2009.\n[24] Graham W. Taylor, Rob Fergus, Yann LeCun, and Christoph Bregler. Convolutional learning of spatio-\n\n[25] A. Mohamed, G. Dahl, and G. Hinton. Acoustic modeling using deep belief networks. IEEE Transactions\n\ntemporal features. In ECCV 2010. Springer, 2010.\n\non Audio, Speech, and Language Processing, 2011.\n\n[26] Richard Szeliski. Computer Vision - Algorithms and Applications. Texts in Computer Science. Springer,\n\n[27] S. Duane, A. D. Kennedy, B. J Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B,\n\n2011.\n\n195(2):216\u2013222, 1987.\n\n[28] R. M. Neal. MCMC using Hamiltonian dynamics. in Handbook of Markov Chain Monte Carlo (eds S.\n\nBrooks, A. Gelman, G. Jones, XL Meng). Chapman and Hall/CRC Press, 2010.\n\n[29] Simon Baker and Iain Matthews. Lucas-kanade 20 years on: A unifying framework. International Journal\n\n[30] Ralph Gross, Iain Matthews, Jeffrey F. Cohn, Takeo Kanade, and Simon Baker. Multi-pie. Image Vision\n\nof Computer Vision, 56:221\u2013255, 2002.\n\nComput., 28(5):807\u2013813, 2010.\n\n[31] J. P. Lewis. Fast normalized cross-correlation, 1995.\n[32] T. Tieleman and G. E. Hinton. Using fast weights to improve persistent contrastive divergence. In ICML,\n\nvolume 382, page 130. ACM, 2009.\n\n[33] R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In Proceedings of\n\nthe Intl. Conf. on Machine Learning, volume 25, 2008.\n\n9\n\n\f", "award": [], "sourceid": 972, "authors": [{"given_name": "Charlie", "family_name": "Tang", "institution": "University of Toronto"}, {"given_name": "Nitish", "family_name": "Srivastava", "institution": "University of Toronto"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "University of Toronto"}]}