{"title": "Gated Softmax Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1603, "page_last": 1611, "abstract": "We describe a log-bilinear\" model that computes class probabilities by combining an input vector multiplicatively with a vector of binary latent variables. Even though the latent variables can take on exponentially many possible combinations of values, we can efficiently compute the exact probability of each class by marginalizing over the latent variables. This makes it possible to get the exact gradient of the log likelihood. The bilinear score-functions are defined using a three-dimensional weight tensor, and we show that factorizing this tensor allows the model to encode invariances inherent in a task by learning a dictionary of invariant basis functions. Experiments on a set of benchmark problems show that this fully probabilistic model can achieve classification performance that is competitive with (kernel) SVMs, backpropagation, and deep belief nets.\"", "full_text": "Gated Softmax Classi\ufb01cation\n\nRoland Memisevic\n\nDepartment of Computer Science\n\nChristopher Zach\n\nDepartment of Computer Science\n\nETH Zurich\nSwitzerland\n\nETH Zurich\nSwitzerland\n\nroland.memisevic@gmail.com\n\nchzach@inf.ethz.ch\n\nGeoffrey Hinton\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nCanada\n\nMarc Pollefeys\n\nDepartment of Computer Science\n\nETH Zurich\nSwitzerland\n\nhinton@cs.toronto.edu\n\nmarc.pollefeys@inf.ethz.ch\n\nAbstract\n\nWe describe a \u201dlog-bilinear\u201d model that computes class probabilities by combin-\ning an input vector multiplicatively with a vector of binary latent variables. Even\nthough the latent variables can take on exponentially many possible combina-\ntions of values, we can ef\ufb01ciently compute the exact probability of each class\nby marginalizing over the latent variables. This makes it possible to get the ex-\nact gradient of the log likelihood. The bilinear score-functions are de\ufb01ned using\na three-dimensional weight tensor, and we show that factorizing this tensor al-\nlows the model to encode invariances inherent in a task by learning a dictionary\nof invariant basis functions. Experiments on a set of benchmark problems show\nthat this fully probabilistic model can achieve classi\ufb01cation performance that is\ncompetitive with (kernel) SVMs, backpropagation, and deep belief nets.\n\n1\n\nIntroduction\n\nConsider the problem of recognizing an image that contains a single hand-written digit that has been\napproximately normalized but may have been written in one of a number of different styles. Features\nextracted from the image often provide much better evidence for a combination of a class and a style\nthan they do for the class alone. For example, a diagonal stroke might be highly compatible with an\nitalic 1 or a non-italic 7. A short piece of horizontal stroke at the top right may be compatible with a\nvery italic 3 or a 5 with a disconnected top. A fat piece of vertical stroke at the bottom of the image\nnear the center may be compatible with a 1 written with a very thick pen or a narrow 8 written with\na moderately thick pen so that the bottom loop has merged. If each training image was labeled with\nboth the class and the values of a set of binary style features, it would make sense to use the image\nfeatures to create a bipartite conditional random \ufb01eld (CRF) which gave low energy to combinations\nof a class label and a style feature that were compatible with the image feature. This would force\nthe way in which local features were interpreted to be globally consistent about style features such\nas stroke thickness or \u201ditalicness\u201d. But what if the values of the style features are missing from the\ntraining data?\nWe describe a way of learning a large set of binary style features from training data that are only\nlabeled with the class. Our \u201dgated softmax\u201d model allows the 2K possible combinations of the K\nlearned style features to be integrated out. This makes it easy to compute the posterior probability\nof a class label on test data and easy to get the exact gradient of the log probability of the correct\nlabel on training data.\n\n1\n\n\f1.1 Related work\n\nThe model is related to several models known in the literature, that we discuss in the following. [1]\ndescribes a bilinear sparse coding model that, similar to our model, can be trained discriminatively\nto predict classes. Unlike in our case, there is no interpretation as a probabilistic model, and \u2013 conse-\nquently \u2013 not a simple learning rule. Furthermore, the model parameters, unlike in our case, are not\nfactorized, and as a result the model cannot extract features which are shared among classes. Feature\nsharing, as we shall show, greatly improves classi\ufb01cation performance as it allows for learning of\ninvariant representations of the input.\nOur model is similar to the top layer of the deep network discussed in [2], again, without factoriza-\ntion and feature sharing. We also derive and utilize discriminative gradients that allow for ef\ufb01cient\ntraining. Our model can be viewed also as a \u201cdegenerate\u201d special case of the image transformation\nmodel described in [3], which replaces the output-image in that model with a \u201cone-hot\u201d encoded\nclass label. The intractable objective function of that model, as a result, collapses into a tractable\nform, making it possible to perform exact inference.\nWe describe the basic model, how it relates to logistic regression, and how to perform learning and\ninference in the following section. We show results on benchmark classi\ufb01cation tasks in Section 3\nand discuss possible extensions in Section 4.\n\n2 The Gated Softmax Model\n\n2.1 Log-linear models\nWe consider the standard classi\ufb01cation task of mapping an input vector x \u2208 IRn to a class-label y.\nOne of the most common, and certainly oldest, approaches to solving this task is logistic regression,\nwhich is based on a log-linear relationship between inputs and labels (see, for example, [4]). In\nparticular, using a set of linear, class-speci\ufb01c score functions\n\nwe can obtain probabilities over classes by exponentiating and normalizing:\n\n(cid:80)\nClassi\ufb01cation decisons for test-cases xtest are given by arg max p(y|xtest). Training amounts to\n\u03b1 log p(y\u03b1|x\u03b1)\nadapting the vectors wy by maximizing the average conditional log-probability 1\nfor a set {(x\u03b1, y\u03b1)}N\nN\n\u03b1=1 of training cases. Since there is no closed form solution, training is typically\nperformed using some form of gradient based optimization. In the case of two or more labels, logistic\nregression is also referred to as the \u201cmultinomial logit model\u201d or the \u201cmaximum entropy model\u201d [5].\nIt is possible to include additive \u201cbias\u201d terms by in the de\ufb01nition of the score function (Eq. 1) so\nthat class-scores are af\ufb01ne, rather than linear, functions of the input. Alternatively, we can think of\nthe inputs as being in a \u201chomogeneous\u201d representation with an extra constant 1-dimension, in which\nbiases are implemented implicitly.\nImportant properties of logistic regression are that (a) the training objective is convex, so there are\nno local optima, and (b) the model is probabilistic, hence it comes with well-calibrated estimates of\nuncertainty in the classi\ufb01cation decision (ref. Eq. 2) [4]. Property (a) is shared with, and property\n(b) a possible advantage over, margin-maximizing approaches, like support vector machines [4].\n\n2.2 A log-bilinear model\n\nLogistic regression makes the assumption that classes can be separated in the input space with hyper-\nplanes (up to noise). A common way to relax this assumption is to replace the linear separation\nmanifold, and thus, the score function (Eq. 1), with a non-linear one, such as a neural network\n[4]. Here, we take an entirely different, probabilistic approach. We take the stance that we do not\nknow what form the separation manifold takes on, and instead introduce a set of probabilistic hidden\nvariables which cooperate to model the decision surface jointly. To obtain classi\ufb01cation decisions at\ntest-time and for training the model, we then need to marginalize over these hidden variables.\n\n2\n\nsy(x) = wt\n\nyx\n\nyx)\nexp(wt\ny(cid:48) exp(wt\n\ny(cid:48)x)\n\np(y|x) =\n\n(cid:80)\n\n(1)\n\n(2)\n\n\fh\n\nhk\n\nh\n\nhk\n\nf\n\nxi\n\nx\n\ny\n\nxi\n\nx\n\nf\n\nf\n\ny\n\n(a)\n\n(b)\n\nFigure 1: (a) A log-bilinear model: Binary hidden variables hk can blend in log-linear dependencies\nthat connect input features xi with labels y. (b) Factorization allows for blending in a learned feature\nspace.\n\nMore speci\ufb01cally, we consider the following variation of logistic regression: We introduce a vector\nh of binary latent variables (h1, . . . , hK) and replace the linear score (Eq. 1) with a bilinear score\nof x and h:\n\nsy(x, h) = htWyx.\n\nThe bilinear score combines, quadratically, all pairs of input components xi with hidden variables\nhk. The score for each class is thus a quadratic product, parameterized by a class-speci\ufb01c matrix\nWy. This is in contrast to the inner product, parameterized by class-speci\ufb01c vectors wy, for logistic\nregression. To turn scores into probabilities we can again exponentiate and normalize\n\n(3)\n\n(5)\n\nexp(htWyx)\ny(cid:48)h(cid:48) exp(h(cid:48)tWy(cid:48)x)\n\n.\n\n(4)\n\nIn contrast to logistic regression, we obtain a distribution over both the hidden variables h and labels\ny. We get back the (input-dependent) distributions over labels with an additional marginalization\nover h:\n\np(y, h|x) =\n\n(cid:80)\np(y|x) = (cid:88)\n\np(y, h|x).\n\nh\u2208{0,1}K\n\nAs with logistic regression, we thus get a distribution over labels y, conditioned on inputs x. The\nparameters are the set of class-speci\ufb01c matrices Wy. As before, we can add bias terms to the score,\nor add a constant 1-dimension to x and h. Note that for any single and \ufb01xed instantiation of h\nin Eq. 3, we obtain the logistic regression score (up to normalization), since the argument in the\n\u201cexp()\u201d collapses to the class-speci\ufb01c row-vector htWy. Each of the 2K summands in Eq. 5 is\ntherefore exactly one logistic classi\ufb01er, showing that the model is equivalent to a mixture of 2K\nlogistic regressors with shared weights. Because of the weight-sharing the number of parameters\ngrows linearly not exponentially in the number of hidden variables. In the following, we let W\ndenote the three-way tensor of parameters (by \u201cstacking\u201d the matrices Wy).\nThe sum over 2K terms in Eq. 5 seems to preclude any reasonably large value for K. However,\nsimilar to the models in [6], [7], [2], the marginalization can be performed in closed form and can\nbe computed tractably by a simple re-arrangement of terms:\n\n(cid:88)\n\np(y, h|x) \u221d(cid:88)\n\np(y|x) =\n\n(cid:88)\n\n(cid:88)\n\n(cid:89)\n\n(cid:88)\n(cid:0)1 + exp(\n\nWyikxi)(cid:1)\n\nexp(htWyx) =\n\nexp(\n\nWyikxihk) =\n\nh\n\nh\n\nh\n\nik\n\nk\n\ni\n\n(6)\n\n3\n\n\fThis shows that the class probabilities decouple into a product of K terms1, each of which is a mix-\nture of a uniform and an input-conditional \u201csoftmax\u201d. The model is thus a product of experts [8]\n(which is conditioned on input vectors x). It can be viewed also as a \u201cstrange\u201d kind of Gated Boltz-\nmann Machine [9] that models a single discrete output variable y using K binary latent variables.\nAs we shall show, it is the conditioning on the inputs x that renders this model useful.\nTypically, training products of experts is performed using approximate, sampling based schemes,\nbecause of the lack of a closed form for the data probability [8]. The same is true for most conditional\nproducts of experts [9].\nNote that in our case, the distribution that the model outputs is a distribution over a countable\n(and, in particular, fairly small2) number of possible values, so we can compute the constant\ni Wyikxi)), that normalizes the left-hand side in Eqs. 6, ef\ufb01ciently. The\n\ny(cid:48)(cid:81)\n\u2126 =(cid:80)\nk(1 + exp((cid:80)\n\nsame observation was utilized before in [6], [7], [10].\n\n2.3 Sharing features among classes\n\nThe score (or \u201cactivation\u201d) that class label y receives from each of the 2K terms in Eq. 5 is a linear\nfunction of the inputs. A different class y(cid:48) receives activations from a different, non-overlapping set\nof functions. The number of parameters is thus: (number of inputs) \u00d7 (number of labels) \u00d7 (number\nof hidden variables). As we shall show in Section 3 the model can achieve fairly good classi\ufb01cation\nperformance.\nA much more natural way to de\ufb01ne class-dependencies in this model, however, is by allowing for\nsome parameters to be shared between classes.\nIn most natural problems, inputs from different\nclasses share the same domain, and therefore show similar characteristics. Consider, for example,\nhandwritten digits, which are composed of strokes, or human faces, which are composed of facial\nfeatures. The features behave like \u201catoms\u201d that, by themselves, are only weakly indicative of a\nclass; it is the composition of these atoms that is highly class-speci\ufb01c3. Note that parameter sharing\nwould not be possible in models like logistic regression or SVMs, which are based on linear score\nfunctions.\nIn order to obtain class-invariant features, we factorize the parameter tensor W as follows:\n\nWyik =\n\nW x\n\nif W y\n\nyf W h\n\nkf\n\n(7)\n\nF(cid:88)\n\nf =1\n\nThe model parameters are now given by three matrices W x, W y, W h, and each component Wyik\nof W is de\ufb01ned as a three-way inner product of column vectors taken from these matrices. This\nfactorization of a three-way parameter tensor was previously used by [3] to reduce the number of\nparameters in an unsupervised model of images. Plugging the factorized form for the weight tensor\ninto the de\ufb01nition of the probability (Eq. 4) and re-arranging terms yields\n\np(y, h|x) \u221d exp\n\n(cid:16)(cid:88)\n\n(cid:0)(cid:88)\n\n(cid:1)(cid:0)(cid:88)\n\nxiW x\nif\n\nhkW h\nkf\n\nf\n\ni\n\nk\n\n(cid:17)\n\n(cid:1)W y\n\nyf\n\n(8)\n\nThis shows that, after factorizing, we obtain a classi\ufb01cation decision by \ufb01rst projecting the input\nvector x (and the vector of hidden variables h) onto F basis functions, or \ufb01lters. The resulting \ufb01lter\nresponses are multiplied and combined linearly using class-speci\ufb01c weights W y\nyf . An illustration of\nthe model is shown in Figure 1 (b).\nAs before, we need to marginalize over h to obtain class-probabilities. In analogy to Eqs. 6, we\nobtain the \ufb01nal form (here written in the log-domain):\n\nexp(ay(cid:48))\n\n(9)\n\nlog p(y|x) = ay \u2212 log(cid:88)\n\ny(cid:48)\n\n1The log-probability thus decouples into a sum over K terms and is the preferred object to compute in a\n\nnumerically stable implementation.\n\n2We are considering \u201cusual\u201d classi\ufb01cation problems, so the number of classes is in the tens, hundreds or\n\npossibly even millions, but it is not exponential like in a CRF.\n\n3If this was not the case, then many practical classi\ufb01cation problems would be much easier to solve.\n\n4\n\n\fwhere\n\nay =(cid:88)\n\nk\n\n(cid:16)\n\n1 + exp(cid:0)(cid:88)\n\n((cid:88)\n\nlog\n\nf\n\ni\n\n(cid:1)(cid:17)\n\n.\n\n(10)\n\nxiW x\n\nif )W h\n\nkf W y\n\nyf\n\nNote that in this model, learning of features (the F basis functions W x\u00b7f ) is tied in with learning of\nthe classi\ufb01er itself. In contrast to neural networks and deep learners ([11], [12]), the model does\nnot try to learn a feature hierarchy. Instead, learned features are combined multiplicatively with\nhidden variables and the results added up to provide the inputs to the class-units. In terms of neural\nnetworks nomenclature, the factored model can best be thought of as a single-hidden-layer network.\nIn general, however, the concept of \u201clayers\u201d is not immediately applicable in this architecture.\n\n2.4\n\nInterpretation\n\ntensor to a blend(cid:80)\n\nAn illustration of the graphical model is shown in Figure 1 (non-factored model on the left, factored\nmodel on the right). Each hidden variable hk that is \u201con\u201d contributes a slice W\u00b7k\u00b7 of the parameter\nk hkW\u00b7k\u00b7 of at most K matrices. The classi\ufb01cation decision is the sum over all\npossible instantiations of h and thus over all possible such blends. A single blend is simply a linear\nlogistic classi\ufb01er.\nAn alternative view is that each output unit y accumulates evidence for or against its class by project-\ning the input onto K basis functions (the rows of Wy in Eq. 4). Each instantiation of h constitutes\none way of combining a subset of basis function responses that are considered to be consistent into\na single piece of evidence. Marginalizing over h allows us to express the fact that there can be\nmultiple alternative sets of consistent basis function responses. This is like using an \u201cOR\u201d gate to\ncombine the responses of a set of \u201cAND\u201d gates, or like computing a probabilistic version of a dis-\njunctive normal form (DNF). As an example, consider the task of classifying a handwritten 0 that\nis roughly centered in the image but rotated by a random angle (see also Section 3): Each of the\nfollowing combinations: (i) a vertical stroke on the left and a vertical stroke on the right; (ii) a hori-\nzontal stroke on the top and a horizontal stroke on the bottom; (iii) a diagonal stroke on the bottom\nleft and a diagonal stroke on the top right, would constitute positive evidence for class 0. The model\ncan accomodate each if necessary by making appropriate use of the hidden variables.\nThe factored model, where basis function responses are computed jointly for all classes and then\nweighted differently for each class, can be thought of as accumulating evidence accordingly in the\n\u201cspatial frequency domain\u201d.\n\n2.5 Discriminative gradients\n\nLike the class-probabilities (Eq. 5) and thus the model\u2019s objective function, the derivative of the\nlog-probability w.r.t. model parameters, is tractable, and scales linearly not exponentially with K.\nThe derivative w.r.t. to a single parameter W\u00afyik of the unfactored form (Section 2.2) takes the form:\n\n=(cid:0)\u03b4\u00afyy \u2212 p(\u00afy|x)(cid:1)\u03c3(cid:0)(cid:88)\n\n\u2202 log p(y|x)\n\n\u2202W\u00afyik\n\nxiWyikhk\n\ni\n\n(cid:1)xi with \u03c3(a) =(cid:0)1 + exp(\u2212a)(cid:1)\u22121\n\n.\n\n(11)\n\nTo compute gradients of the factored model (Section 2.3) we use Eq. 11 and the chain rule, in\nconjunction with Eq. 7:\n\n\u2202 log p(y|x)\n\n\u2202W x\nif\n\n\u2202 log p(y|x)\n\n\u2202W\u00afyik\n\n\u2202W\u00afyik\n\u2202W x\nif\n\n.\n\n(12)\n\n=(cid:88)\n\n\u00afy,k\n\nyf and W h\n\nkf (with the sums running over the remaining indices).\n\nSimilarly for W y\nAs with logistic regression, we can thus perform gradient based optimization of the model likeli-\nhood for training. Moreover, since we have closed form expressions, it is possible to use conjugate\ngradients for fast training. However, in contrast to logistic regression, the model\u2019s objective function\nis non-linear, so it can contain local optima. We discuss this issue in more detail in the following\nsection. Like logistic regression, and in contrast to SVMs, the model computes probabilities and\nthus provides well-calibrated estimates of uncertainty in its decisions.\n\n5\n\n\f2.6 Optimization\n\nThe log-probability is non-linear and can contain local optima w.r.t. W , so some care has to be taken\nto obtain good local optima during training. In general we found that simply deploying a general-\npurpose conjugate gradient solver on random parameter initializations does not reliably yield good\nlocal optima (even though it can provide good solutions in some cases). Similar problems occur\nwhen training neural networks.\nWhile simple gradient descent tends to yield better results, we adopt the approach discussed in [2]\nin most of our experiments, which consists in initializing with class-speci\ufb01c optimization: The set\nof parameters in our proposed model is the same as the ones for an ensemble of class-speci\ufb01c distri-\nbutions p(x|y) (by simply adjusting the normalization in Eq. 4). More speci\ufb01cally, the distribution\np(x|y) of inputs given labels is a factored Restricted Boltzmann machine, that can be optimized\nusing contrastive divergence [3]. We found that performing a few iterations of class-conditional\noptimization as an initialization reliably yields good local optima of the model\u2018s objective func-\ntion. We also experimented with alternative approaches to avoiding bad local optima, such as letting\nparameters grow slowly during the optimization (\u201cannealing\u201d), and found that class-speci\ufb01c pre-\ntraining yields the best results. This pre-training is reminiscent of training deep networks, which\nalso rely on a pre-training phase. In contrast, however, here we pre-train class-conditionally, and\ninitialize the whole model at once, rather than layer-by-layer. It is possible to perform a different\nkind of annealing by adding the class-speci\ufb01c and the model\u2019s actual objective function, and slowly\nreducing the class-speci\ufb01c in\ufb02uence using some weighting scheme. We used both the simple and\nthe annealed optimization in some of our experiments, but did not \ufb01nd clear evidence that anneal-\ning leads to better local optima. We found that, given an initialization near a local optimum of the\nobjective function, conjugate gradients can signi\ufb01cantly outperform stochastic gradient descent in\nterms of the speed at which one can optimize both the model\u2019s own objective function and the cost\non validation data.\nIn practice, one can add a regularization (or \u201cweight-decay\u201d) penalty \u2212\u03bb(cid:107)W(cid:107)2 to the objective\nfunction, as is common for logistic regression and other classi\ufb01ers, where \u03bb is chosen by cross-\nvalidation.\n\n3 Experiments\n\nWe applied the Gated Softmax (GSM) classi\ufb01er4 on the benchmark classi\ufb01cation tasks described\nin [11]. The benchmark consists of a set of classi\ufb01cation problems, that are dif\ufb01cult, because they\ncontain many subtle, and highly complicated, dependencies of classes on inputs. It was initially\nintroduced to evaluate the performance of deep neural networks. Some examples tasks are illustrated\nin Figure 3. The benchmark consists of 8 datasets, each of which contains several thousand gray-\nlevel images of size 28 \u00d7 28 pixels. Training set sizes vary between 1200 and 10000. The test-\nsets contain 50000 examples each. There are three two-class problems (\u201crectangles\u201d, \u201crectangles-\nimages\u201d and \u201cconvex\u201d) and \ufb01ve ten-class problems (which are variations of the MNIST data-set5).\nTo train the model we make use of the approach described in Section 2.6. We do not make use of any\nrandom re-starts or other additional ways to \ufb01nd good local optima of the objective function. For the\nclass-speci\ufb01c initializations, we use a class-speci\ufb01c RBM with binary observables on the datasets\n\u201crectangles\u201d, \u201cmnist-rot\u201d, \u201cconvex\u201d and \u201cmnist\u201d, because they contain essentially binary inputs (or\na heavily-skewed histogram), and Gaussian observables on the others. For the Gaussian case, we\nnormalize the data to mean zero and standard-deviation one (independently in each dimension). We\nalso tried \u201chybrid\u201d approaches on some data-sets where we optimize a sum of the RBM and the\nmodel objective function, and decrease the in\ufb02uence of the RBM as training progresses.\n\n3.1 Learning task-dependent invariances\n\nThe \u201crectangles\u201d task requires the classi\ufb01cation of rectangle images into the classes horizontal vs.\nvertical (some examples are shown in Figure 3 (a)). Figure 2 (left) shows random sets of 50 rows\nof the matrix Wy learned by the unfactored model (class horizontal on the top, class vertical on\n\n4An implementation of the model is available at http://learning.cs.toronto.edu/\u223crfm/gatedsoftmax/\n5http://yann.lecun.com/exdb/mnist/\n\n6\n\n\fFigure 2: Left: Class-speci\ufb01c \ufb01lters learned from the rectangle task \u2013 top: \ufb01lters in support of the\nlabel horizontal, bottom: \ufb01lters in support of the class label vertical. Right: Shared \ufb01lters learned\nfrom rotation-invariant digit classi\ufb01cation.\n\nthe bottom). Each row Wy corresponds to a class-speci\ufb01c image \ufb01lter. We display the \ufb01lters using\ngray-levels, where brighter means larger. The plot shows that the hidden units, like \u201cHough-cells\u201d,\nmake it possible to accumulate evidence for the different classes, by essentially counting horizontal\nand vertical strokes in the images. Interestingly, classi\ufb01cation error is 0.56% false, which is about a\nquarter the number of mis-classi\ufb01cations of the next best performer (SVMs with 2.15% error) and\nsigni\ufb01cantly more accurate than all other models on this data-set.\nAn example of \ufb01lters learned by the factored model is shown in Figure 2 (right). The task is clas-\nsi\ufb01cation of rotated digits in this example. Figure 3 (b) shows some example inputs. In this task,\nlearning invariances with respect to rotation is crucial for achieving good classi\ufb01cation performance.\nInterestingly, the model achieves rotation-invariance by projecting onto a set of circular or radial\nFourier-like components. It is important to note that the model infers these \ufb01lters to be the opti-\nmal input representation entirely from the task at hand. The \ufb01lters resemble basis functions learned\nby an image transformation model trained to rotate image patches described in [3]. Classi\ufb01cation\nperformance is 11.75% error, which is comparable with the best results on this dataset.\n\n(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 3: Example images from four of the \u201cdeep learning\u201d benchmark tasks: (a) Rectangles (2-\nclass): Distinguish horizontal from vertical rectangles; (b) Rotated digits (10-class): Determine the\nclass of the digit; (c) Convex vs. non-convex (2-class): Determine if the image shows a convex or\nnon-convex shape; (d) Rectangles with images (2-class): Like (a), but rectangles are rendered using\nnatural images.\n\n3.2 Performance\n\nClassi\ufb01cation performance on all 8 datasets is shown in Figure 4. To evaluate the model we chose\nthe number of hiddens units K, the number of factors F and the regularizer \u03bb based on a validation\n\n7\n\n\fset (typically by taking a \ufb01fth of the training set). We varied both K and F between 50 and 1000 on\na fairly coarse grid, such as 50, 500, 1000, for most datasets, and for most cases we tried two values\nfor the regularizer (\u03bb = 0.001 and \u03bb = 0.0). A \ufb01ner grid may improve performance further.\nTable 4 shows that the model performs well on all data-sets (comparing numbers are from [11]).\nIt is among the best (within 0.01 tolerance), or the best performer, in three out of 8 cases. For\ncomparison, we also show the error rates achieved with the unfactored model (Section 2.2), which\nalso performs fairly well as compared to deep networks and SVMs, but is signi\ufb01cantly weaker in\nmost cases than the factored model.\n\nSVM\n\nDEEP\n\nGSM\n\ndataset/model:\nrectangles\nrect.-images\nmnistplain\nconvexshapes\nmnistbackrand\nmnistbackimg\nmnistrotbackimg\nmnistrot\n\nSVMRBF\n2.15\n24.04\n3.03\n19.13\n14.58\n22.61\n55.18\n11.11\n\nNNet\nSVMPOL NNet\n7.16\n33.20\n4.69\n32.25\n20.04\n27.41\n62.16\n18.11\n\n2.15\n24.05\n3.69\n19.82\n16.62\n24.01\n56.41\n15.42\n\nRBM\nRBM DBN3\n2.60\n4.71\n23.69\n22.50\n3.11\n3.94\n18.63\n19.92\n6.73\n9.80\n16.31\n16.15\n52.21\n47.39\n10.30\n14.69\n\nSAA3 GSM (unfact)\n(0.56)\n(23.17)\n(3.98)\n(21.03)\n(11.89)\n(22.07)\n(55.16)\n(16.15)\n\n2.41\n24.05\n3.46\n18.41\n11.28\n23.00\n51.93\n10.30\n\n0.83\n22.51\n3.70\n17.08\n10.48\n23.65\n55.82\n11.75\n\nFigure 4: Classi\ufb01cation error rates on test data (error rates are in %). Models: SVMRBF: SVM with\nRBF kernels. SVMPOL: SVM with polynomial kernels. NNet: (MLP) Feed-forward neural net.\nRBM: Restricted Boltzmann Machine. DBN3: Three-layer Deep Belief Net. SAA3: Three-layer\nstacked auto-associator. GSM: Gated softmax model (in brackets: unfactored model).\n\n4 Discussion/Future work\n\nSeveral extensions of deep learning methods, including deep kernel methods, have been suggested\nrecently (see, for example, [13], [14]), giving similar performance to the networks that we compare\nto here. Our method differs from these approaches in that it is not a multi-layer architecture. Instead,\nour model gets its power from the fact that inputs, hidden variables and labels interact in three-way\ncliques. Factored three-way interactions make it possible to learn task-speci\ufb01c features and to learn\ntransformational invariances inherent in the task at hand.\nIt is interesting to note that the model outperforms kernel methods on many of these tasks. In contrast\nto kernel methods, the GSM provides fully probabilistic outputs and can be easily trained online,\nwhich makes it directly applicable to very large datasets.\nInterestingly, the \ufb01lters that the model learns (see previous Section; Figure 2) resemble those learned\nbe recent models of image transformations (see, for example, [3]). In fact, learning of invariances\nin general is typically addressed in the context of learning transformations.\nInterestingly, most\ntransformation models themselves are also de\ufb01ned via three-way interactions of some kind ([15],\n[16], [17], [18] , [19]). In contrast to a model of transformations, it is the classi\ufb01cation task that\nde\ufb01nes the invariances here, and the model learns the invariant representations from that task only.\nCombining the explicit examples of transformations provided by video sequences with the implicit\ninformation about transformational invariances provided by labels is a promising future direction.\nGiven the probabilistic de\ufb01nition of the model, it would be interesting to investigate a fully Bayesian\nformulation that integrates over model parameters. Note that we trained the model without sparsity\nconstraints and in a fully supervised way. Encouraging the hidden unit activities to be sparse (e.g.\nusing the approach in [20]) and/or training the model semi-supervised are further directions for\nfurther research. Another direction is the extension to structured prediction problems, for example,\nby deploying the model as clique potential in a CRF.\n\nAcknowledgments\n\nWe thank Peter Yianilos and the anonymous reviewers for valuable discussions and comments.\n\n8\n\n\fReferences\n[1] Julien Mairal, Francis Bach, Jean Ponce, Guillermo Sapiro, and Andrew Zisserman. Supervised dictionary\n\nlearning. In Advances in Neural Information Processing Systems 21. 2009.\n\n[2] Vinod Nair and Geoffrey Hinton. 3D object recognition with deep belief nets. In Advances in Neural\n\nInformation Processing Systems 22. 2009.\n\n[3] Roland Memisevic and Geoffrey Hinton. Learning to represent spatial transformations with factored\n\nhigher-order Boltzmann machines. Neural Computation, 22(6):1473\u201392, 2010.\n\n[4] Christopher Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics).\n\nSpringer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.\n\n[5] Adam Berger, Vincent Della Pietra, and Stephen Della Pietra. A maximum entropy approach to natural\n\nlanguage processing. Computational Linguistics, 22(1):39\u201371, 1996.\n\n[6] Geoffrey Hinton. To recognize shapes, \ufb01rst learn to generate images. Technical report, Toronto, 2006.\n[7] Hugo Larochelle and Yoshua Bengio. Classi\ufb01cation using discriminative restricted Boltzmann machines.\nIn ICML \u201908: Proceedings of the 25th international conference on Machine learning, New York, NY,\nUSA, 2008. ACM.\n\n[8] Geoffrey Hinton. Training products of experts by minimizing contrastive divergence. Neural Computa-\n\ntion, 14(8):1771\u20131800, 2002.\n\n[9] Roland Memisevic and Geoffrey Hinton. Unsupervised learning of image transformations. In Proceedings\n\nof IEEE Conference on Computer Vision and Pattern Recognition, 2007.\n\n[10] Vinod Nair and Geoffrey Hinton. Implicit mixtures of restricted Boltzmann machines. In Advances in\n\nNeural Information Processing Systems 21. 2009.\n\n[11] Hugo Larochelle, Dumitru Erhan, Aaron Courville, James Bergstra, and Yoshua Bengio. An empirical\nevaluation of deep architectures on problems with many factors of variation. In ICML \u201907: Proceedings\nof the 24th international conference on Machine learning, New York, NY, USA, 2007. ACM.\n\n[12] Yoshua Bengio and Yann LeCun. Scaling learning algorithms towards ai.\n\nIn L. Bottou, O. Chapelle,\n\nD. DeCoste, and J. Weston, editors, Large-Scale Kernel Machines. MIT Press, 2007.\n\n[13] Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Advances in Neural Information\n\nProcessing Systems 22. 2009.\n\n[14] Jason Weston, Fr\u00b4ed\u00b4eric Ratle, and Ronan Collobert. Deep learning via semi-supervised embedding. In\nICML \u201908: Proceedings of the 25th international conference on Machine learning, New York, NY, USA,\n2008. ACM.\n\n[15] Bruno Olshausen, Charles Cadieu, Jack Culpepper, and David Warland. Bilinear models of natural im-\n\nages. In SPIE Proceedings: Human Vision Electronic Imaging XII, San Jose, 2007.\n\n[16] Rajesh Rao and Dana Ballard. Ef\ufb01cient encoding of natural time varying images produces oriented space-\n\ntime receptive \ufb01elds. Technical report, Rochester, NY, USA, 1997.\n\n[17] Rajesh Rao and Daniel Ruderman. Learning lie groups for invariant visual perception. In In Advances in\n\nNeural Information Processing Systems 11. MIT Press, 1999.\n\n[18] David Grimes and Rajesh Rao. Bilinear sparse coding for invariant vision. Neural Computation, 17(1):47\u2013\n\n73, 2005.\n\n[19] Joshua Tenenbaum and William Freeman. Separating style and content with bilinear models. Neural\n\nComputation, 12(6):1247\u20131283, 2000.\n\n[20] Honglak Lee, Chaitanya Ekanadham, and Andrew Ng. Sparse deep belief net model for visual area V2.\n\nIn Advances in Neural Information Processing Systems 20. MIT Press, 2008.\n\n9\n\n\f", "award": [], "sourceid": 322, "authors": [{"given_name": "Roland", "family_name": "Memisevic", "institution": null}, {"given_name": "Christopher", "family_name": "Zach", "institution": null}, {"given_name": "Marc", "family_name": "Pollefeys", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}