{"title": "Learning to combine foveal glimpses with a third-order Boltzmann machine", "book": "Advances in Neural Information Processing Systems", "page_first": 1243, "page_last": 1251, "abstract": "We describe a model based on a Boltzmann machine with third-order connections that can learn how to accumulate information about a shape over several fixations. The model uses a retina that only has enough high resolution pixels to cover a small area of the image, so it must decide on a sequence of fixations and it must combine the glimpse\" at each fixation with the location of the fixation before integrating the information with information from other glimpses of the same object. We evaluate this model on a synthetic dataset and two image classification datasets, showing that it can perform at least as well as a model trained on whole images.\"", "full_text": "Learning to combine foveal glimpses with a\n\nthird-order Boltzmann machine\n\nHugo Larochelle and Geoffrey Hinton\n\nDepartment of Computer Science, University of Toronto\n6 King\u2019s College Rd, Toronto, ON, Canada, M5S 3G4\n\n{larocheh,hinton}@cs.toronto.edu\n\nAbstract\n\nWe describe a model based on a Boltzmann machine with third-order connections\nthat can learn how to accumulate information about a shape over several \ufb01xations.\nThe model uses a retina that only has enough high resolution pixels to cover a\nsmall area of the image, so it must decide on a sequence of \ufb01xations and it must\ncombine the \u201cglimpse\u201d at each \ufb01xation with the location of the \ufb01xation before\nintegrating the information with information from other glimpses of the same ob-\nject. We evaluate this model on a synthetic dataset and two image classi\ufb01cation\ndatasets, showing that it can perform at least as well as a model trained on whole\nimages.\n\n1\n\nIntroduction\n\ntranslating all the \ufb01xation points.\n\nLike insects with unmovable compound eyes, most current computer vision systems use images of\nuniform resolution. Human vision, by contrast, uses a retina in which the resolution falls off rapidly\nwith eccentricity and it relies on intelligent, top-down strategies for sequentially \ufb01xating parts of the\noptic array that are relevant for the task at hand. This \u201c\ufb01xation point strategy\u201d has many advantages:\n\u2022 It allows the human visual system to achieve invariance to large scale translations by simply\n\u2022 It allows a reduction in the number of \u201cpixels\u201d that must be processed in parallel yet pre-\nserves the ability to see very \ufb01ne details when necessary. This reduction allows the visual\nsystem to apply highly parallel processing to the sensory input produced by each \ufb01xation.\n\u2022 It removes most of the force from the main argument against generative models of percep-\ntion, which is that they waste time computing detailed explanations for parts of the image\nthat are irrelevant to the task at hand. If task-speci\ufb01c considerations are used to select \ufb01x-\nation points for a variable resolution retina, most of the irrelevant parts of the optic array\nwill only ever be represented in a small number of large pixels.\n\nIf a system with billions of neurons at its disposal has adopted this strategy, the use of a variable\nresolution retina and a sequence of intelligently selected \ufb01xation points is likely to be even more\nadvantageous for simulated visual systems that have to make do with a few thousand \u201cneurons\u201d. In\nthis paper we explore the computational issues that arise when the \ufb01xation point strategy is incorpo-\nrated in a Boltzmann machine and demonstrate a small system that can make good use of a variable\nresolution retina containing very few pixels. There are two main computational issues:\n\n\u2022 What-where combination: How can eye positions be combined with the features extracted\nfrom the retinal input (glimpses) to allow evidence for a shape to be accumulated across a\nsequence of \ufb01xations?\n\u2022 Where to look next: Given the results of the current and previous \ufb01xations, where should\n\nthe system look next to optimize its object recognition performance?\n\n1\n\n\fA\n\nB\n\nC\n\nFigure 1: A: Illustration of the retinal transformation r(I, (i, j)). The center dot marks the pixel\nat position (i, j) (pixels are drawn as dotted squares). B: examples of glimpses computed by the\nretinal transformation, at different positions (visualized through reconstructions). C: Illustration of\nthe multi-\ufb01xation RBM.\n\nTo tackle these issues, we rely on a special type of restricted Boltzmann machine (RBM) with third-\norder connections between visible units (the glimpses), hidden units (the accumulated features) and\nposition-dependent units which gate the connections between the visible and hidden units. We\ndescribe approaches for training this model to jointly learn and accumulate useful features from the\nimage and control where these features should be extracted, and evaluate it on a synthetic dataset\nand two image classi\ufb01cation datasets.\n\n2 Vision as a sequential process with retinal \ufb01xations\n\nThroughout this work, we will assume the following problem framework. We are given a training\nset of image and label pairs {(It, lt)}N\nt=1 and the task is to predict the value of lt (e.g. a class label\nlt \u2208 {1, . . . , C}) given the associated image It. The standard machine learning approach would\nconsist in extracting features from the whole image It and from those directly learn to predict lt.\nHowever, since we wish to incorporate the notion of \ufb01xation into our problem framework, we need\nto introduce some constraints on how information from It is acquired.\nTo achieve this, we require that information about an image I (removing the superscript t for sim-\nplicity) must be acquired sequentially by \ufb01xating (or querying) the image at a series of K positions\n[(i1, j1), . . . , (iK, jK)]. Given a position (ik, jk), which identi\ufb01es a pixel I(ik, jk) in the image,\ninformation in the neighborhood of that pixel is extracted through what we refer to as a retinal\ntransformation r(I, (ik, jk)). Much like the fovea of the human retina, this transformation extracts\nhigh-resolution information (i.e. copies the value of the pixels) from the image only in the neigh-\nborhood of pixel I(ik, jk). At the periphery of the retina, lower-resolution information is extracted\nby averaging the values of pixels falling in small hexagonal regions of the image. The hexagons\nare arranged into a spiral, with the size of the hexagons increasing with the distance from the center\n(ik, jk) of the \ufb01xation1. All of the high-resolution and low-resolution information is then concate-\nnated into a single vector given as output by r(I, (ik, jk)). An illustration of this retinal transforma-\ntion is given in Figure 1. As a shorthand, we will use xk to refer to the glimpse given by the output\nof the retinal transformation r(I, (ik, jk)).\n\n3 A multi-\ufb01xation model\n\nWe now describe a system that can predict l from a few glimpses x1, . . . , xK. We know that this\nproblem is solvable: [1] demonstrated that people can \u201csee\u201d a shape by combining information\nfrom multiple glimpses through a hole that is much smaller than the whole shape. He called this\n\u201canorthoscopic perception\u201d. The shape information derived from each glimpse cannot just be added\n\n1A retina with approximately hexagonal pixels produced by a log conformal mapping centered on the cur-\nrent \ufb01xation point has an interesting property: It is possible to use weight-sharing over scale and orientation\ninstead of translation, but we do not explore this here.\n\n2\n\nRetinal transformation(reconstruction from )Fovea (high-resolution)Periphery (low-resolution)IImagex1x2x3xkRetinal transformation(reconstruction from )Fovea (high-resolution)Periphery (low-resolution)IImagex1x2x3xkx3i3,j3x2i2,j2x1i1,j1hz(i1,j1)z(i2,j2)z(i3,j3)y\fp(h|y, x) = Y\np(x|h) = Y\nPC\n\nj\n\np(y = el|h) =\n\ni\n\np(xi|h), where p(xi = 1|h) = sigm(bi + h>W\u00b7i)\nexp(dl + h>U\u00b7l)\nl\u2217=1 exp(dl\u2217 + h>U\u00b7l\u2217)\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\nas implied in [2]. It is the conjunction of the shape of a part and its relative location that provides\nevidence for the shape of a whole object, and the natural way to deal with this conjunction is to use\nmultiplicative interactions between the \u201cwhat\u201d and \u201cwhere\u201d.\nLearning modules that incorporate multiplicative interactions have recently been developed [3, 4].\nThese can be viewed as energy-based models with three-way interactions. In this work, we build\non [5, 6] who introduced a method of keeping the number of parameters under control when in-\ncorporating such high-order interactions in a restricted Boltzmann machine. We start by describing\nthe standard RBM model for classi\ufb01cation, and then describe how we adapt it to the multi-\ufb01xation\nframework.\n\n3.1 Restricted Boltzmann Machine for classi\ufb01cation\n\nRBMs are undirected generative models which model the distribution of a visible vector v of units\nusing a hidden vector of binary units h. For a classi\ufb01cation problem with C classes, the visible layer\nis composed of an input vector x and a target vector y, where the target vector follows the so-called\n\u201c1 out of C\u201d representation of the classi\ufb01cation label l (i.e. y = el where all the components of el\nare 0 except for the lth which is 1).\nMore speci\ufb01cally, given the following energy function:\n\nE(y, x, h) = \u2212h>Wx \u2212 b>x \u2212 c>h \u2212 d>y \u2212 h>Uy\n\nwe de\ufb01ne the associated distribution over x, y and h: p(y, x, h) = exp(\u2212E(y, x, h))/Z.\nAssuming x is a binary vector, it can be shown that this model has the following posteriors:\n\np(hj|y, x), where p(hj = 1|y, x) = sigm(cj + Uj\u00b7y + Wj\u00b7x)\n\nwhere Aj\u00b7 and A\u00b7i respectively refer to the jth row and ith column of matrix A. These posteriors\nmake it easy to do inference or sample from the model using Gibbs sampling. For real-valued\ninput vectors, an extension of Equation 1 can be derived to obtain a Gaussian distribution for the\nconditional distribution over x of Equation 3 [7].\nAnother useful property of this model is that all hidden units can be marginalized over analytically\nin order to exactly compute\n\nexp(dl +P\nPC\nl\u2217=1 exp(dl\u2217 +P\n\nj softplus(cj + Ujl + Wj\u00b7x))\n\nj softplus(cj + Ujl\u2217 + Wj\u00b7x))\n\np(y = el|x) =\n\nwhere softplus(a) = log(1 + exp(a)). Hence, classi\ufb01cation can be performed for some given input\nx by computing Equation 5 and choosing the most likely class.\n\n3.2 Multi-\ufb01xation RBM\n\n!\n\n KX\n\nAt \ufb01rst glance, a very simple way of using the classi\ufb01cation RBM of the previous section in the\nmulti-\ufb01xation setting would be to set x = x1:K = [x1, . . . , xK]. However, doing so would com-\npletely throw away the information about the position of the \ufb01xations. Instead, we could rede\ufb01ne\nthe energy function of Equation 1 as follows:\n\nE(y, x1:K, h) =\n\n\u2212h>W(ik,jk)xk \u2212 b>xk\n\n\u2212 c>h \u2212 d>y \u2212 h>Uy\n\n(6)\n\nk=1\n\nwhere the connection matrix W(ik,jk) now depends on the position of the \ufb01xation2. Such connec-\ntions are called high-order (here third order) because they can be seen as connecting the hidden\n2To be strictly correct in our notation, we should add the position coordinates (i1, j1), . . . , (iK , jK ) as an\ninput of the energy function E(y, x, h). To avoid clutter however, we will consider the position coordinates to\nbe implicitly given by x1, . . . , xK.\n\n3\n\n\funits, input units and implicit position units (one for each possible value of positions (ik, jk)). Con-\nditioned on the position units (which are assumed to be given), this model is still an RBM satisfying\nthe traditional conditional independence properties between the hidden and visible units.\nFor a given m\u00d7 m grid of possible \ufb01xation positions, all W(ik,jk) matrices contain m2HR parame-\nters where H is the number of hidden units and R is the size of the retinal transformation. To reduce\nthat number, we parametrize or factorize the W(ik,jk) matrices as follows\n\nW(ik,jk) = P diag(z(ik, jk)) F\n\n(7)\nwhere F is R \u00d7 D, P is D \u00d7 H, z(ik, jk) is a (learned) vector associated to position (ik, jk) and\ndiag(a) is a matrix whose diagonal is the vector a. Hence, W(ik,jk) is now an outer product of the\nD lower-dimensional bases in F (\u201c\ufb01lters\u201d) and P (\u201cpooling\u201d), gated by a position speci\ufb01c vector\nz(ik, jk). Instead of learning a separate matrix W(ik,jk) for each possible position, we now only\nneed to learn a separate vector z(ik, jk) for each position. Intuitively, the vector z(ik, jk) controls\nwhich rows of F and columns of P are used to accumulate the glimpse at position (ik, jk) into the\nhidden layer of the RBM. A similar factorization has been used by [8]. We emphasize that z(ik, jk)\nis not stochastic but is a deterministic function of position (ik, jk), trained by backpropagation\nof gradients from the multi-\ufb01xation RBM learning cost. In practice, we force the components of\nz(ik, jk) to be in [0, 1]3. The multi-\ufb01xation RBM is illustrated in Figure 1.\n\n4 Learning in the multi-\ufb01xation RBM\n\nThe multi-\ufb01xation RBM must learn to accumulate useful features from each glimpse, and it must\nalso learn a good policy for choosing the \ufb01xation points. We refer to these two goals as \u201clearning\nthe what-where combination\u201d and \u201clearning where to look\u201d.\n\n4.1 Learning the what-where combination\n\n1:K and label lt :\nHybrid cost:\n\nFor now, let\u2019s assume that we are given the sequence of glimpses xt\n1:K fed to the multi-\ufb01xation RBM\nfor each image It. As suggested by [9], we can train the RBM to minimize the following hybrid cost\nover each input xt\n\nChybrid = \u2212 log p(yt|xt\n\n1:K) \u2212 \u03b1 log p(yt, xt\n\n1:K)\n\n(8)\nwhere yt = elt. The \ufb01rst term in Chybrid is the discriminative cost and its gradient with respect to\nthe RBM parameters can be computed exactly, since p(yt|xt\n1:K) can be computed exactly (see [9]\nfor more details on how to derive these gradients) . The second term is the generative cost and its\ngradient can only be approximated. Contrastive Divergence [10] based on one full step of Gibbs\nsampling provides a good enough approximation. The RBM is then trained by doing stochastic or\nmini-batch gradient descent on the hybrid cost.\nIn [9], it was observed that there is typically a value of \u03b1 which yields better performance than\nusing either discriminative or generative costs alone. Putting more emphasis on the discriminative\nterm ensures that more capacity is allocated to predicting the label values than to predicting each\npixel value, which is important because there are many more pixels than labels. The generative\nterm acts as a data-dependent regularizer that encourages the RBM to extract features that capture\nthe statistical structure of the input. This is a much better regularizer than the domain-independent\npriors implemented by L1 or L2 regularization.\nWe can also take advantage of the following obvious fact: If the sequence xt\n1:K is associated with a\nparticular target label yt, then so are all the subsequences xt\n1:k where k < K. Hence, we can also\ntrain the multi-\ufb01xation RBM on these subsequences using the following \u201chybrid-sequential\u201d cost:\n\nKX\n\nHybrid-sequential cost: Chybrid\u2212seq =\n\n\u2212 log p(yt|xt\n\n1:k) \u2212 \u03b1 log p(yt, xt\n\nk|xt\n\n1:k\u22121)\n\n(9)\n\nk=1\n\nwhere the second term, which corresponds to negative log-likelihoods under a so-called conditional\nRBM [8], plays a similar role to the generative cost term of the hybrid cost and encourages the\n\n3This is done by setting z(ik, jk) = sigm(\u00afz(ik, jk)) and learning the unconstrained \u00afz(ik, jk) vectors\n\ninstead. We also use a learning rate 100 times larger for learning those parameters.\n\n4\n\n\fRBM to learn about the statistical structure of the input glimpses. An estimate of the gradient of\nthis term can also be obtained using Contrastive Divergence (see [8] for more details). While being\nmore expensive than the hybrid cost, the hybrid-sequential cost could yield better generalization\nperformance by better exploiting the training data. Both costs are evaluated in Section 6.1.\n\n4.2 Learning where to look\n\nNow that we have a model for processing the glimpses resulting from \ufb01xating at different positions,\nwe need to de\ufb01ne a model which will determine where those \ufb01xations should be made on the m\u00d7 m\ngrid of possible positions.\nAfter k \u2212 1 \ufb01xations, this model should take as input some vector sk containing information about\nthe glimpses accumulated so far (e.g. the current activation probabilities of the multi-\ufb01xation RBM\nhidden layer), and output a score f(sk, (ik, jk)) for each possible \ufb01xation position (ik, jk). This\nscore should be predictive of how useful \ufb01xating at the given position will be. We refer to this\nmodel as the controller.\nIdeally, the \ufb01xation position with highest score under the controller should be the one which maxi-\nmizes the chance of correctly classifying the input image. For instance, a good controller could be\nsuch that\n\nf(sk, (ik, jk)) \u221d log p(yt|xt\n\n1:k\u22121, xt\n\nk = r(I, (ik, jk)))\n\n(10)\ni.e. its output is proportional to the log-probability the RBM will assign to the true target yt of the\nimage It once it has \ufb01xated at position (ik, jk) and incorporated the information in that glimpse. In\nother words, we would like the controller to assign high scores to \ufb01xation positions which are more\nlikely to provide the RBM with the necessary information to make a correct prediction of yt.\nA simple training cost for the controller could then be to reduce the absolute difference between\nits prediction f(sk, (ik, jk)) and the observed value of log p(yt|xt\n1:k\u22121, xk = r(I, (ik, jk))) for\nthe sequences of glimpses generated while training the multi-\ufb01xation RBM. During training, these\nsequences of glimpses can be generated from the controller using the Boltzmann distribution\n\npcontroller((ik, jk)|xt\n\n1:k\u22121) \u221d exp(f(sk, (ik, jk)))\n\n(11)\nwhich ensures that all \ufb01xation positions can be sampled but those which are currently considered\nmore useful by the controller are also more likely to be chosen. At test time however, for each k, the\nposition that is the most likely under the controller is chosen4.\nIn our experiments, we used a linear model for f(sk, (ik, jk)), with separate weights for each pos-\nsible value of (ik, jk). The controller is the same for all k, i.e. f(sk, (ik, jk)) only depends on the\nvalues of sk and (ik, jk) (though one could consider training a separate controller for each k). A\nconstant learning rate of 0.001 was used for training. As for the value taken by sk, we set it to\n\n!\n\n \n\nk\u22121X\n\nk\u2217=1\n\n!\n\n \n\nk\u22121X\n\nk\u2217=1\n\nsigm\n\nc +\n\nW(ik\u2217 ,jk\u2217 )xk\u2217\n\n= sigm\n\nc +\n\nP diag(z(ik\u2217 , jk\u2217)) F xk\u2217\n\n(12)\n\nwhich can be seen as an estimate of the probability vector for each hidden unit of the RBM to be 1,\ngiven the previous glimpses x1:k\u22121. For the special case k = 1, s1 is computed based on a \ufb01xation\nat the center of the image but all the information in this initial glimpse is then \u201cforgotten\u201d, i.e. it is\nonly used for choosing the \ufb01rst image-dependent \ufb01xation point and is not used by the multi-\ufb01xation\nRBM to accumulate information about the image. We also concatenate to sk a binary vector of size\nm2 (one component for each possible \ufb01xation position), where a component is 1 if the associated\nposition has been \ufb01xated. Finally, in order to ensure that a \ufb01xation position is never sampled twice,\nwe impose that pcontroller((ik, jk)|xt\n\n1:k\u22121) = 0 for all positions previously sampled.\n\n4.3 Putting it all together\n\nFigure 2 summarizes how the multi-\ufb01xation RBM and the controller are jointly trained, for either\nthe hybrid cost or the hybrid-sequential cost. Details on gradient computations for both costs are\n\n4While it might not be optimal, this greedy search for the best sequence of \ufb01xation positions is simple and\n\nworked well in practice.\n\n5\n\n\falso given in the supplementary material. To our knowledge, this is the \ufb01rst implemented system\nfor combining glimpses that jointly trains a recognition component (the RBM) with an attentional\ncomponent (the \ufb01xation controller).\n\n5 Related work\nA vast array of work has been dedicated to modelling the visual search behavior of humans [11, 12,\n13, 14], typically through the computation of saliency maps [15, 16]. Most of such work, however,\nis concerned with the prediction of salient regions in an image, and not with the other parts of a\ntask-oriented vision classi\ufb01er.\nSurprisingly little work has been done on how best to combine multiple glimpses in a recognition\nsystem. SIFT features have been proposed either as a pre\ufb01lter for reducing the number of possible\n\ufb01xation positions [17] or as a way of preprocessing the raw glimpses [13]. [18] used a \ufb01xed and\nhand-tuned saliency map to sample small patches in images of hand-written characters and trained\na recursive neural network from sequences of such patches. By contrast, the model proposed here\ndoes not rely on hand-tuned features or saliency maps and learns from scratch both the where to look\nand what-where combination components. A further improvement on the aforecited work consists\nin separately learning both the where to look and the what-where combination components [19, 20].\nIn this work however, both components are learned jointly, as opposed to being put together only at\ntest time. For instance, [19] use a saliency map based on \ufb01lters previously trained on natural images\nfor the where to look component, and the what-where combination component for recognition is a\nnearest neighbor density estimator. Moreover, their goal is not to avoid \ufb01xating everywhere, but to\nobtain more robust recognition by using a saliency map (whose computation effectively corresponds\nto \ufb01xating everywhere in the image). In that respect, our work is orthogonal, as we are treating each\n\ufb01xation as a costly operation (e.g. we considered up to 6 \ufb01xations, while they used 100 \ufb01xations).\n\n6 Experiments\nWe present three experiments on three different image classi\ufb01cation problems. The \ufb01rst is based\non the MNIST dataset and is meant to evaluate the multi-\ufb01xation RBM alone (i.e. without the con-\ntroller). The second is on a synthetic dataset and is meant to analyze the controller learning algorithm\nand its interaction with the multi-\ufb01xation RBM. Finally, results on a facial expression recognition\nproblem are presented.\n\n6.1 Experiment 1: Evaluation of the multi-\ufb01xation RBM\n\nIn order to evaluate the multi-\ufb01xation RBM of Section 3.2 separately from the controller model, we\ntrained a multi-\ufb01xation RBM5 on a \ufb01xed set of 4 \ufb01xations (i.e. the same \ufb01xation positions for all im-\nages). Those \ufb01xations were centered around the pixels at positions {(9, 9), (9, 19), (19, 9), (19, 19)}\n(MNIST images are of size 28 \u00d7 28) and their order was chosen at random for every parameter up-\ndate of the RBM. The retinal transformation had a high-resolution fovea covering 38 pixels and 60\nhexagonal low-resolution regions in the periphery (see Figure 2 for an illustration). We used the\ntraining, validation and test splits proposed by [21], with a training set of 10 000 examples.\nThe results are given in Figure 2, with comparisons with an RBF kernel SVM classi\ufb01er and a single\nhidden layer neural network initialized using unsupervised training of an RBM on the training set\n(those two baselines were trained on the full MNIST images). The multi-\ufb01xation RBM yields per-\nformance comparable to the baselines despite only having four glimpses, and the hybrid-sequential\ncost function works better than the non-sequential, hybrid cost.\n\n6.2 Experiment 2: evaluation of the controller\n\nIn this second experiment, we designed a synthetic problem where the optimal \ufb01xation policy is\nknown, to validate the proposed training algorithm for the controller. The task is to identify whether\n\n5The RBM used H = 500 hidden units and was trained with a constant learning rate of 0.1 (no momentum\nwas used). The learned position vectors z(ik, jk) were of size D = 250. Training lasted for 2000 iterations,\nwith a validation set used to keep track of generalization performance and remember the best parameter value\nof the RBM. We report results when using either the hybrid cost of Equation 8 or the hybrid-sequential cost of\nEquation 9, with \u03b1 = 0.1. Mini-batches of size 100 were used.\n\n6\n\n\fPseudocode for training update\n\u00b7 compute s1 based on center of image\nfor k from 1 to K do\n\n1:k\u22121)\n\n\u00b7 sample (ik, jk) from pcontroller((ik, jk)|xt\n\u00b7 compute xk = r(I, (ik, jk))\n\u00b7 update controller with a gradient step for error\n\n|f (sk, (ik, jk)) \u2212 log p(y|x1:k)|\n\nif using hybrid-sequential cost then\n\u00b7 accumulate gradient on RBM parameters\nof kth term in cost Chybrid\u2212seq\n\nend if\n\u00b7 compute sk+1\n\nend for\nif using hybrid-sequential cost then\n\n\u00b7 update RBM parameters based on accumulated\ngradient of hybrid-sequential cost Chybrid\u2212seq\n\u00b7 update RBM based on gradient of hybrid cost Chybrid\n\nelse {using hybrid cost}\n\nend if\n\nA\n\nError\n\nModel\n3.17% (\u00b1 0.15)\nNNet+RBM [22]\n3.03% (\u00b1 0.15)\nSVM [21]\nMulti-\ufb01xation RBM 3.20% (\u00b1 0.15)\n(hybrid)\nMulti-\ufb01xation RBM 2.76% (\u00b1 0.14)\n(hybrid-sequential)\nB\n\nFigure 2: A: Pseudocode for the training update of the multi-\ufb01xation RBM, using either the hybrid\nor hybrid-sequential cost. B: illustration of glimpses and results for experiment on MNIST.\n\nthere is a horizontal (positive class) or vertical (negative class) 3-pixel white bar somewhere near the\nedge of a 15 \u00d7 15 pixel image. At the center of the image is one of 8 visual symbols, indicating the\nlocation of the bar. This symbol conveys no information about the class (the positive and negative\nclasses are equiprobable) but is necessary to identify where to \ufb01xate. Figure 3 shows positive and\nnegative examples. There are only 48 possible images and the model is trained on all of them (i.e.\nwe are measuring the capacity of the model to learn this problem perfectly). Since, as described\nearlier, the input s1 of the controller contains information about the center of the image, only one\n\ufb01xation decision by the controller suf\ufb01ces to solve this problem.\nA multi-\ufb01xation RBM was trained jointly with a controller on this problem6, with only K = 1\n\ufb01xation. When trained according to the hybrid cost of Equation 8 (\u03b1 = 1), the model was able to\nsolve this problem perfectly without errors, i.e. the controller always proposes to \ufb01xate at the region\ncontaining the white bar and the multi-\ufb01xation RBM always correctly recognizes the orientation\nof the bar. However, using only the discriminative cost (\u03b1 = 0), it is never able to solve it (i.e.\nhas an error rate of 50%), even if trained twice as long as for \u03b1 = 1. This is because the purely\ndiscriminative RBM never learns meaningful features for the non-discriminative visual symbol at\nthe center, which are essential for the controller to be able to predict the position of the white bar.\n\n6.3 Experiment 3: facial expression recognition experiment\n\nFinally, we applied the multi-\ufb01xation RBM with its controller to a problem of facial expression\nrecognition. The dataset [23] consists in 4178 images of size 100 \u00d7 100, depicting people acting\none of seven facial expressions (anger, disgust, fear, happiness, sadness, surprise and neutral, see\nFigure 3 for examples). Five training, validation and test set splits where generated, ensuring that\nall images of a given person can only be found in one of the three sets. Pixel values of the images\nwere scaled to the [\u22120.5, 0.5] interval.\nA multi-\ufb01xation RBM learned jointly with a controller was trained on this problem7, with K = 6\n\ufb01xations. Possible \ufb01xation positions were layed out every 10 pixels on a 7\u00d7 7 grid, with the top-left\n6Hyper-parameters: H = 500, D = 250. Stochastic gradient descent was used with a learning rate of\n0.001. The controller had the choice of 9 possible \ufb01xation positions, each covering either one of the eight\nregions where bars can be found or the middle region where the visual symbol is. The retinal transformation\nwas such that information from only one of those regions is transferred.\n\n7Hyper-parameters: H = 250, D = 250. Stochastic gradient descent was used with a learning rate of 0.01.\nThe RBM was trained with the hybrid cost of Equation 8 with \u03b1 = 0.001 (the hybrid cost was preferred mainly\nbecause it is faster). Also, the matrix P was set to the identity matrix and only F was learned (this removed a\nmatrix multiplication and thus accelerated learning in the model, while still giving good results). The vectors\n\n7\n\nExperiment 1: MNIST with 4 \ufb01xations\fA\n\nB\n\nFigure 3: A: positive and negative from the synthetic dataset of experiment 2. B: examples and\nresults for the facial expression recognition dataset.\n\nposition being at pixel (20, 20). The retinal transformation covered around 2000 pixels and didn\u2019t\nuse a periphery8 (all pixels were from the fovea). Moreover, glimpses were passed through a \u201cpre-\nprocessing\u201d hidden layer of size 250, initialized by unsupervised training of an RBM with Gaussian\nvisible units (but without target units) on glimpses from the 7\u00d7 7 grid. During training of the multi-\n\ufb01xation RBM, the discriminative part of its gradient was also passed through the preprocessing\nhidden layer for \ufb01ne-tuning of its parameters.\nResults are reported in Figure 3, where the multi-\ufb01xation RBM is compared to an RBF kernel SVM\ntrained on the full images. The accuracy of the RBM is given after a varying number of \ufb01xations.\nWe can see that after 3 \ufb01xations (i.e. around 60% of the image) the multi-\ufb01xation RBM reaches\na performance that is statistically equivalent to that of the SVM (58.2 \u00b1 1.5%) trained on the full\nimages. Training the SVM on a scaled-down version of the data (48 \u00d7 48 pixels) gives a similar\nperformance of 57.8% (\u00b11.5%). At 5 \ufb01xations, the multi-\ufb01xation RBM now improves on the SVM,\nand gets even better at 6 \ufb01xations, with an accuracy of 62.7% (\u00b11.5%). Finally, we also computed\nthe performance of a linear SVM classi\ufb01er trained on the concatenation of the hidden units from\na unique RBM with Gaussian visible units applied at all 7 \u00d7 7 positions (the same RBM used\nfor initializing the preprocessing layer of the multi-\ufb01xation RBM was used). This convolutional\napproach, which requires 49 \ufb01xations, yields a performance of 61.2% (\u00b11.5%), slightly worse but\nstatistically indistinguishable from the multi-\ufb01xation RBM which only required 6 \ufb01xations.\n\n7 Conclusion\nHuman vision is a sequential sampling process in which only a fraction of the optic array is ever\nprocessed at the highest resolution. Most computer vision work on object recognition ignores this\nfact and can be viewed as modelling tachistoscopic recognition of very small objects that lie entirely\nwithin the fovea. We have focused on the other extreme, i.e. recognizing objects by using multiple\ntask-speci\ufb01c \ufb01xations of a retina with few pixels, and obtained positive results. We believe that the\nintelligent choice of \ufb01xation points and the integration of multiple glimpses will be essential for\nmaking biologically inspired vision systems work well on large images.\nAcknowledgments\nWe thank Marc\u2019Aurelio Ranzato and the reviewers for many helpful comments, and Josh Susskind\nand Tommy Liu for help with the facial expression dataset. This research was supported by NSERC.\n\nReferences\n[1] Hermann von Helmholtz. Treatise on physiological optics. Dover Publications, New York, 1962.\nz(i, j) were initialized in a topographic manner (i.e. each component of z(i, j) is (cid:29) 0 only in a small region of\nthe image). Finally, to avoid over\ufb01tting, exponentially decaying averages of the parameters of the model were\nmaintained throughout training and were used as the values of the model at test time.\n\n8This simpli\ufb01cation of the retinal transformation makes it more convenient to estimate the percentage of\nhigh-resolution pixels used by the multi-\ufb01xation RBM and contrast it with the SVM trained on the full image.\n\n8\n\nExamplesExperiment 3: facial expression recognition datasetResultsExperiment 2: synthetic datasetPositive examplesNegative examples1234560.40.450.50.550.60.65Number of fixationsAccuracy Multi!fixation RBM SVM\f[2] Arash Fazl, Stephen Grossberg, and Ennio Mingolla. View-invariant object category learning, recognition,\nand search: how spatial and object attention are coordinated using surface-based attentional shrouds. Cogn\nPsychol, 58(1):1\u201348, 2009.\n\n[3] Roland Memisevic and Geoffrey E. Hinton. Unsupervised learning of image transformations.\n\nComputer Vision and Pattern Recognition. IEEE Computer Society, 2007.\n\nIn In\n\n[4] Urs K\u00a8oster and Aapo Hyv\u00a8arinen. A two-layer ica-like model estimated by score matching. In ICANN\u201907:\nProceedings of the 17th international conference on Arti\ufb01cial neural networks, pages 798\u2013807, Berlin,\nHeidelberg, 2007. Springer-Verlag.\n\n[5] Geoffrey E. Hinton. Learning to represent visual input. Phil. Trans. R. Soc., 365(1537):177\u201384, 2010.\n[6] Roland Memisevic and Geoffrey E. Hinton. Learning to represent spatial transformations with factored\n\nhigher-order boltzmann machines. Neural Computation, 22:1473\u20131492, 2010.\n\n[7] Geoffrey E. Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, July 2006.\n\n[8] Graham W. Taylor and Geoffrey E. Hinton. Factored conditional restricted boltzmann machines for mod-\neling motion style. In ICML \u201909: Proceedings of the 26th Annual International Conference on Machine\nLearning, pages 1025\u20131032, New York, NY, USA, 2009. ACM.\n\n[9] Hugo Larochelle and Yoshua Bengio. Classi\ufb01cation using discriminative restricted boltzmann machines.\nIn ICML \u201908: Proceedings of the 25th international conference on Machine learning, pages 536\u2013543,\nNew York, NY, USA, 2008. ACM.\n\n[10] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Compu-\n\ntation, 14:1771\u20131800, 2002.\n\n[11] Rajesh P.N. Rao, Gregory J. Zelinsky, Mary M. Hayhoe, and Dana H. Ballard. Modeling saccadic target-\ning in visual search. In David S. Touretzky, Michael Mozer, and Michael E. Hasselmo, editors, Advances\nin Neural Information Processing Systems 8, pages 830\u2013836. MIT Press, 1996.\n\n[12] Laura Walker Renninger, James M. Coughlan, Preeti Verghese, and Jitendra Malik. An information\nIn Lawrence K. Saul, Yair Weiss, and L\u00b4eon Bottou, editors,\nmaximization model of eye movements.\nAdvances in Neural Information Processing Systems 17, pages 1121\u20131128. MIT Press, Cambridge, MA,\n2005.\n\n[13] Wei Zhang, Hyejin Yang, Dimitris Samaras, and Gregory Zelinsky. A computational model of eye move-\nments during object class detection. In Y. Weiss, B. Sch\u00a8olkopf, and J. Platt, editors, Advances in Neural\nInformation Processing Systems 18, pages 1609\u20131616. MIT Press, Cambridge, MA, 2006.\n\n[14] Antonio Torralba, Monica S. Castelhano, Aude Oliva, and John M. Henderson. Contextual guidance of\neye movements and attention in real-world scenes: the role of global features in object search. Psycho-\nlogical Review, 113:2006, 2006.\n\n[15] Laurent Itti, Christof Koch, and Ernst Niebur. A model of saliency-based visual attention for rapid scene\n\nanalysis. IEEE Trans. Pattern Anal. Mach. Intell., 20(11):1254\u20131259, 1998.\n\n[16] Laurent Itti and Christof Koch. Computational modelling of visual attention. Nature Reviews Neuro-\n\nscience, 2(3):194\u2013203, 2001.\n\n[17] Lucas Paletta, Gerald Fritz, and Christin Seifert. Q-learning of sequential attention for visual object\nIn ICML \u201905: Proceedings of the 22nd international\n\nrecognition from informative local descriptors.\nconference on Machine learning, pages 649\u2013656, New York, NY, USA, 2005. ACM.\n\n[18] Ethem Alpaydin. Selective attention for handwritten digit recognition. In David S. Touretzky, Michael\nMozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages\n771\u2013777. MIT Press, 1996.\n\n[19] Christopher Kanan and Garrison Cottrell. Robust classi\ufb01cation of objects, faces, and \ufb02owers using natural\n\nimage statistics. In CVPR, 2010.\n\n[20] Stephen Gould, Joakim Arfvidsson, Adrian Kaehler, Benjamin Sapp, Marius Messner, Gary Bradski, Paul\nBaumstarck, Sukwon Chung, and Andrew Y. Ng. Peripheral-foveal vision for real-time object recognition\nand tracking in video. In In International Joint Conference on Arti\ufb01cial Intelligence (IJCAI, 2007.\n\n[21] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical\nevaluation of deep architectures on problems with many factors of variation. In ICML \u201907: Proceedings\nof the 24th international conference on Machine learning, pages 473\u2013480, New York, NY, USA, 2007.\nACM.\n\n[22] Hugo Larochelle, Yoshua Bengio, Jerome Louradour, and Pascal Lamblin. Exploring strategies for train-\n\ning deep neural networks. Journal of Machine Learning Research, 10:1\u201340, 2009.\n\n[23] Josh M. Susskind, Adam K. Anderson, and Geoffrey E. Hinton. The toronto face database. Technical\n\nReport UTML TR 2010-001, Dept. of Computer Science, University of Toronto, 2010.\n\n9\n\n\f", "award": [], "sourceid": 824, "authors": [{"given_name": "Hugo", "family_name": "Larochelle", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}