{"title": "Using Deep Belief Nets to Learn Covariance Kernels for Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 1249, "page_last": 1256, "abstract": "We show how to use unlabeled data and a deep belief net (DBN) to learn a good covariance kernel for a Gaussian process. We first learn a deep generative model of the unlabeled data using the fast, greedy algorithm introduced by Hinton et.al. If the data is high-dimensional and highly-structured, a Gaussian kernel applied to the top layer of features in the DBN works much better than a similar kernel applied to the raw input. Performance at both regression and classification can then be further improved by using backpropagation through the DBN to discriminatively fine-tune the covariance kernel.", "full_text": "Using Deep Belief Nets to Learn Covariance Kernels\n\nfor Gaussian Processes\n\nRuslan Salakhutdinov and Geoffrey Hinton\n\nDepartment of Computer Science, University of Toronto\n\n6 King\u2019s College Rd, M5S 3G4, Canada\n\nrsalakhu,hinton@cs.toronto.edu\n\nAbstract\n\nWe show how to use unlabeled data and a deep belief net (DBN) to learn a good\ncovariance kernel for a Gaussian process. We \ufb01rst learn a deep generative model\nof the unlabeled data using the fast, greedy algorithm introduced by [7]. If the\ndata is high-dimensional and highly-structured, a Gaussian kernel applied to the\ntop layer of features in the DBN works much better than a similar kernel applied\nto the raw input. Performance at both regression and classi\ufb01cation can then be\nfurther improved by using backpropagation through the DBN to discriminatively\n\ufb01ne-tune the covariance kernel.\n\n1 Introduction\nGaussian processes (GP\u2019s) are a widely used method for Bayesian non-linear non-parametric re-\ngression and classi\ufb01cation [13, 16]. GP\u2019s are based on de\ufb01ning a similarity or kernel function that\nencodes prior knowledge of the smoothness of the underlying process that is being modeled. Be-\ncause of their \ufb02exibility and computational simplicity, GP\u2019s have been successfully used in many\nareas of machine learning.\n\nMany real-world applications are characterized by high-dimensional, highly-structured data with a\nlarge supply of unlabeled data but a very limited supply of labeled data. Applications such as infor-\nmation retrieval and machine vision are examples where unlabeled data is readily available. GP\u2019s\nare discriminative models by nature and within the standard regression or classi\ufb01cation scenario,\nunlabeled data is of no use. Given a set of i.i.d.\nn=1 and their\nassociated target labels {yn}N\nn=1 \u2208 {\u22121, 1} for regression/classi\ufb01cation, GP\u2019s\nn=1 \u2208 R or {yn}N\nmodel p(yn|xn) directly. Unless some assumptions are made about the underlying distribution of\nthe input data X = [Xl, Xu], unlabeled data, Xu, cannot be used. Many researchers have tried to\nuse unlabeled data by incorporating a model of p(X). For classi\ufb01cation tasks, [11] model p(X) as\np(xn|yn)p(yn) and then infer p(yn|xn), [15] attempts to learn covariance kernels\nbased on p(X), and [10] assumes that the decision boundaries should occur in regions where the\ndata density, p(X), is low. When faced with high-dimensional, highly-structured data, however,\nnone of the existing approaches have proved to be particularly successful.\n\nlabeled input vectors Xl = {xn}N\n\na mixture Pyn\n\nIn this paper we exploit two properties of DBN\u2019s. First, they can be learned ef\ufb01ciently from unla-\nbeled data and the top-level features generally capture signi\ufb01cant, high-order correlations in the data.\nSecond, they can be discriminatively \ufb01ne-tuned using backpropagation. We \ufb01rst learn a DBN model\nof p(X) in an entirely unsupervised way using the fast, greedy learning algorithm introduced by [7]\nand further investigated in [2, 14, 6]. We then use this generative model to initialize a multi-layer,\nnon-linear mapping F (x|W ), parameterized by W , with F : X \u2192 Z mapping the input vectors in\nX into a feature space Z. Typically the mapping F (x|W ) will contain millions of parameters. The\ntop-level features produced by this mapping allow fairly accurate reconstruction of the input, so they\nmust contain most of the information in the input vector, but they express this information in a way\nthat makes explicit a lot of the higher-order structure in the input data.\nAfter learning F (x|W ), a natural way to de\ufb01ne a kernel function is to set K(xi, xj) =\nexp (\u2212||F (xi|W ) \u2212 F (xj|W )||2). Note that the kernel is initialized in an entirely unsupervised\nway. The parameters W of the covariance kernel can then be \ufb01ne-tuned using the labeled data by\n\n1\n\n\fmaximizing the log probability of the labels with respect to W . In the \ufb01nal model most of the in-\nformation for learning a covariance kernel will have come from modeling the input data. The very\nlimited information in the labels will be used only to slightly adjust the layers of features already\ndiscovered by the DBN.\n\n2 Gaussian Processes for Regression and Binary Classi\ufb01cation\nFor a regression task, we are given a data set D of i.i.d . labeled input vectors Xl = {xn}N\ntheir corresponding target labels {yn}N\nregression model:\n\nn=1 and\nn=1 \u2208 R. We are interested in the following probabilistic\n\n\u01eb \u223c N (\u01eb|0, \u03c32)\n\nyn = f (xn) + \u01eb,\n\n(1)\nA Gaussian process regression places a zero-mean GP prior over the underlying latent function f\nwe are modeling, so that a-priori p(f |Xl) =N (f |0, K), where f = [f (x1), ..., f (xn)]T and K is the\ncovariance matrix, whose entries are speci\ufb01ed by the covariance function Kij = K(xi, xj). The\ncovariance function encodes our prior notion of the smoothness of f, or the prior assumption that\nif two input vectors are similar according to some distance measure, their labels should be highly\ncorrelated. In this paper we will use the spherical Gaussian kernel, parameterized by \u03b8 = {\u03b1, \u03b2}:\n\nIntegrating out the function values f, the marginal log-likelihood takes form:\n\nKij = \u03b1 exp(cid:0) \u2212\n\n(xi \u2212 xj)T (xi \u2212 xj)(cid:1)\n\nL = log p(y|Xl) = \u2212\n\nlog 2\u03c0 \u2212\n\nlog |K + \u03c32I| \u2212\n\nyT (K + \u03c32I)\u22121y\n\n1\n2\n\n1\n2\u03b2\n\n1\n2\n\nN\n2\n\n(2)\n\n(3)\n\n(5)\n\n(6)\n\nwhich can then be maximized with respect to the parameters \u03b8 and \u03c3. Given a new test point x\n, a\n\u2217\nprediction is obtained by conditioning on the observed data and \u03b8. The distribution of the predicted\nvalue y\n\ntakes the form:\n, D, \u03b8, \u03c32) = N (y\n\n\u2217\np(y\n\nat x\n\u2217\n|x\n\u2217\n\n\u2217\n= K(x\n\u2217\n\n|kT\n\u2217\n\u2217\n= K(x\n\u2217\n\n(K + \u03c32I)\u22121y, k\n\n\u2212 kT\n\u2217\n\n(K + \u03c32I)\u22121k\n\u2217\n\n\u2217\u2217\n\n+ \u03c32)\n\n(4)\n\n, Xl), and k\n\nwhere k\n\u2217\nFor a binary classi\ufb01cation task, we similarly place a zero mean GP prior over the underlying latent\nfunction f, which is then passed through the logistic function g(x) = 1/(1 + exp(\u2212x)) to de\ufb01ne a\nprior p(yn = 1|xn) = g(f (xn)). Given a new test point x\n, inference is done by \ufb01rst obtaining the\n\u2217\ndistribution over the latent function f\n\n, x\n\u2217\n\n\u2217\u2217\n\n):\n\n).\n\n= f (x\n\u2217\n\n\u2217\n\np(f\n\n|x\n\u2217\n\n\u2217\n\n, D) = Z p(f\n\n|x\n\u2217\n\n\u2217\n\n, Xl, f )p(f |Xl, y)df\n\nwhich is then used to produce a probabilistic prediction:\n\np(y\n\n= 1|x\n\u2217\n\n\u2217\n\n, D) = Z g(f\n\n\u2217\n\n)p(f\n\n|x\n\u2217\n\n\u2217\n\n, D)df\n\n\u2217\n\nThe non-Gaussian likelihood makes the integral in Eq. 5 analytically intractable. In our experiments,\nwe approximate the non-Gaussian posterior p(f |Xl, y) with a Gaussian one using expectation prop-\nagation [12]. For more thorough reviews and implementation details refer to [13, 16].\n\n3 Learning Deep Belief Networks (DBN\u2019s)\nIn this section we describe an unsupervised way of learning a DBN model of the input data X =\n[Xl, Xu], that contains both labeled and unlabeled data sets. A DBN can be trained ef\ufb01ciently by\nusing a Restricted Boltzmann Machine (RBM) to learn one layer of hidden features at a time [7].\nWelling et. al. [18] introduced a class of two-layer undirected graphical models that generalize\nRBM\u2019s to exponential family distributions. This framework will allow us to model real-valued\nimages of face patches and word-count vectors of documents.\n\n3.1 Modeling Real-valued Data\nWe use a conditional Gaussian distribution for modeling observed \u201cvisible\u201d pixel values x (e.g.\nimages of faces) and a conditional Bernoulli distribution for modeling \u201chidden\u201d features h (Fig. 1):\n\np(xi = x|h) = 1\n\n\u221a2\u03c0\u03c3i\n\nexp(\u2212\n\n(x\u2212bi\u2212\u03c3iPj\n\n2\u03c32\ni\n\nhj wij )2\n\n)\n\np(hj = 1|x) = g(cid:0)bj +Pi wij\n\nxi\n\n\u03c3i(cid:1)\n\n2\n\n(7)\n\n(8)\n\n\fh\n\nW\n\nx\n\nBinary\nHidden Features\n\nGaussian\nVisible\nUnits\n\n1000\nW\n3\n1000\n\n1000\nW\n2\n1000\n\n1000\nW\n1\n\nRBM\n\nRBM\n\nRBM\n\ntarget y\n\nGP\n\n1000\nT\nW\n3\n1000\nT\nW\n2\n1000\nT\nW\n1\n\nFeature\nRepresentation\nF(X|W)\n\nInput X\n\nFigure 1: Left panel: Markov random \ufb01eld of the generalized RBM. The top layer represents stochastic binary\nhidden features h and and the bottom layer is composed of linear visible units x with Gaussian noise. When\nusing a Constrained Poisson Model, the top layer represents stochastic binary latent topic features h and the\nbottom layer represents the Poisson visible word-count vector x. Middle panel: Pretraining consists of learning\na stack of RBM\u2019s. Right panel: After pretraining, the RBM\u2019s are used to initialize a covariance function of the\nGaussian process, which is then \ufb01ne-tuned by backpropagation.\n\nwhere g(x) = 1/(1 + exp(\u2212x)) is the logistic function, wij is a symmetric interaction term between\ninput i and feature j, \u03c32\ni is the variance of input i, and bi, bj are biases. The marginal distribution\nover visible vector x is:\n\np(x) = Xh\n\nexp (\u2212E(x, h))\n\nRuPg exp (\u2212E(u, g))du\n\n(9)\n\n. The param-\n\nwhere E(x, h) is an energy term: E(x, h) = Pi\n\neter updates required to perform gradient ascent in the log-likelihood is obtained from Eq. 9:\n\n(xi\u2212bi)2\n\n2\u03c32\ni\n\n\u2212Pj bjhj \u2212Pi,j hjwij\n\nxi\n\u03c3i\n\n\u2206wij = \u01eb\n\n\u2202 log p(x)\n\n\u2202wij\n\n= \u01eb(data \u2212 model)\n\n(10)\n\nwhere \u01eb is the learning rate, zi = xi/\u03c3i, < \u00b7>data denotes an expectation with respect to the data\ndistribution and < \u00b7>model is an expectation with respect to the distribution de\ufb01ned by the model.\nTo circumvent the dif\ufb01culty of computing <\u00b7>model, we use 1-step Contrastive Divergence [5]:\n\n\u2206wij = \u01eb(data \u2212 recon)\n\n(11)\nThe expectation < zihj >data de\ufb01nes the expected suf\ufb01cient statistics of the data distribution and\nis computed as zip(hj = 1|x) when the features are being driven by the observed data from the\ntraining set using Eq. 8. After stochastically activating the features, Eq. 7 is used to \u201creconstruct\u201d\nreal-valued data. Then Eq. 8 is used again to activate the features and compute recon when\nthe features are being driven by the reconstructed data. Throughout our experiments we set variances\ni = 1 for all visible units i, which facilitates learning. The learning rule for the biases is just a\n\u03c32\nsimpli\ufb01ed version of Eq. 11.\n\n3.2 Modeling Count Data with the Constrained Poisson Model\nWe use a conditional \u201cconstrained\u201d Poisson distribution for modeling observed \u201cvisible\u201d word count\ndata x and a conditional Bernoulli distribution for modeling \u201chidden\u201d topic features h:\n\nexp (\u03bbi +Pj hjwij )\nPk exp(cid:0)\u03bbk +Pj hjWkj(cid:1)\n\n\u00d7 N(cid:19), p(hj = 1|x) = g(bj +Xi\n\np(xi = n|h) = Pois(cid:18)n,\nwhere Pois(cid:0)n, \u03bb(cid:1) = e\u2212\u03bb\u03bbn/n!, wij is a symmetric interaction term between word i and feature\nj, N = Pi xi is the total length of the document, \u03bbi is the bias of the conditional Poisson model\n\nfor word i, and bj is the bias of feature j. The Poisson rate, whose log is shifted by the weighted\ncombination of the feature activations, is normalized and scaled up by N . We call this the \u201cCon-\nstrained Poisson Model\u201d since it ensures that the mean Poisson rates across all words sum up to the\nlength of the document. This normalization is signi\ufb01cant because it makes learning stable and it\ndeals appropriately with documents of different lengths.\n\nwij xi) (12)\n\n3\n\n\fThe marginal distribution over visible count vectors x is given in Eq. 9 with an \u201cenergy\u201d given by\n\nE(x, h) = \u2212Xi\n\n\u03bbixi +Xi\n\nlog (xi!) \u2212Xj\n\nbjhj \u2212Xi,j\n\nxihjwij\n\nThe gradient of the log-likelihood function is:\n\n\u2206wij = \u01eb\n\n\u2202 log p(v)\n\n\u2202wij\n\n= \u01eb(data \u2212 model)\n\n(13)\n\n(14)\n\n3.3 Greedy Recursive Learning of Deep Belief Nets\n\nA single layer of binary features is not the best way to capture the structure in the input data. We\nnow describe an ef\ufb01cient way to learn additional layers of binary features.\n\nAfter learning the \ufb01rst layer of hidden features we have an undirected model that de\ufb01nes p(v, h)\nby de\ufb01ning a consistent pair of conditional probabilities, p(h|v) and p(v|h) which can be used to\nsample from the model distribution. A different way to express what has been learned is p(v|h) and\np(h). Unlike a standard, directed model, this p(h) does not have its own separate parameters. It is a\ncomplicated, non-factorial prior on h that is de\ufb01ned implicitly by p(h|v) and p(v|h). This peculiar\ndecomposition into p(h) and p(v|h) suggests a recursive algorithm: keep the learned p(v|h) but\nreplace p(h) by a better prior over h, i.e. a prior that is closer to the average, over all the data\nvectors, of the conditional posterior over h. So after learning an undirected model, the part we keep\nis part of a multilayer directed model.\n\nWe can sample from this average conditional posterior by simply using p(h|v) on the training data\nand these samples are then the \u201cdata\u201d that is used for training the next layer of features. The only\ndifference from learning the \ufb01rst layer of features is that the \u201cvisible\u201d units of the second-level RBM\nare also binary [6, 3]. The learning rule provided in the previous section remains the same [5].\nWe could initialize the new RBM model by simply using the existing learned model but with the\nroles of the hidden and visible units reversed. This ensures that p(v) in our new model starts out\nbeing exactly the same as p(h) in our old one. Provided the number of features per layer does not\ndecrease, [7] show that each extra layer increases a variational lower bound on the log probability\nof data. To suppress noise in the learning signal, we use the real-valued activation probabilities for\nthe visible units of every RBM, but to prevent hidden units from transmitting more than one bit of\ninformation from the data to its reconstruction, the pretraining always uses stochastic binary values\nfor the hidden units.\n\nThe greedy, layer-by-layer training can be repeated several times to learn a deep, hierarchical model\nin which each layer of features captures strong high-order correlations between the activities of\nfeatures in the layer below.\n\n4 Learning the Covariance Kernel for a Gaussian Process\nAfter pretraining, the stochastic activities of the binary features in each layer are replaced by deter-\nministic, real-valued probabilities and the DBN is used to initialize a multi-layer, non-linear map-\nping f (x|W ) as shown in \ufb01gure 1. We de\ufb01ne a Gaussian covariance function, parameterized by\n\u03b8 = {\u03b1, \u03b2} and W as:\n\nNote that this covariance function is initialized in an entirely unsupervised way. We can now maxi-\nmize the log-likelihood of Eq. 3 with respect to the parameters of the covariance function using the\nlabeled training data[9]. The derivative of the log-likelihood with respect to the kernel function is:\n\n\u2202L\n\u2202Ky\n\n=\n\n1\n\n2(cid:0)K\u22121\n\ny yyT K\u22121\n\ny \u2212 K\u22121\ny (cid:1)\n\n(16)\n\nwhere Ky = K + \u03c32I is the covariance matrix. Using the chain rule we readily obtain the necessary\ngradients:\n\n\u2202L\n\u2202\u03b8\n\n=\n\n\u2202L\n\u2202Ky\n\n\u2202Ky\n\u2202\u03b8\n\nand\n\n\u2202L\nW\n\n=\n\n\u2202L\n\u2202Ky\n\n\u2202Ky\n\n\u2202F (x|W )\n\n\u2202F (x|W )\n\n\u2202W\n\n(17)\n\n4\n\nKij = \u03b1 exp(cid:0) \u2212\n\n||F (xi|W ) \u2212 F (xj |W )||2(cid:1)\n\n1\n2\u03b2\n\n(15)\n\n\f\u221222.07\n\n32.99\n\n\u221241.15\n\n66.38\n\n27.49\n\nUnlabeled\n\nTraining Data\n\nTest Data\n\nA\n\nB\n\nFigure 2: Top panel A: Randomly sampled examples of the training and test data. Bottom panel B: The same\nsample of the training and test images but with rectangular occlusions.\n\nTraining GPstandard GP-DBNgreedy GP-DBN\ufb01ne\nARD\nlabels\n15.01\n6.84\n6.31\n18.59\n10.12\n9.23\n\nARD Sph.\n28.57\n17.94\n12.71\n18.16\n11.22\n16.36\n23.15\n28.32\n15.16\n21.06\n17.98\n14.15\n\nSph.\n15.28\n7.25\n6.42\n19.75\n10.56\n9.13\n\nARD\n18.37\n8.96\n8.77\n19.42\n11.01\n10.43\n\nA 100\n500\n1000\nB 100\n500\n1000\n\nSph.\n22.24\n17.25\n16.33\n26.94\n20.20\n19.20\n\nGPpca\n\nSph.\n18.13 (10)\n14.75 (20)\n14.86 (20)\n25.91 (10)\n17.67 (10)\n16.26 (10)\n\nARD\n16.47 (10)\n10.53 (80)\n10.00 (160)\n19.27 (20)\n14.11 (20)\n11.55 (80)\n\nTable 1: Performance results on the face-orientation regression task. The root mean squared error (RMSE) on\nthe test set is shown for each method using spherical Gaussian kernel and Gaussian kernel with ARD hyper-\nparameters. By row: A) Non-occluded face data, B) Occluded face data. For the GPpca model, the number of\nprincipal components that performs best on the test data is shown in parenthesis.\n\nwhere \u2202F (x|W )/\u2202W is computed using standard backpropagation. We also optimize the observa-\ntion noise \u03c32. It is necessary to compute the inverse of Ky, so each gradient evaluation has O(N 3)\ncomplexity where N is the number of the labeled training cases. When learning the restricted Boltz-\nmann machines that are composed to form the initial DBN, however, each gradient evaluation scales\nlinearly in time and space with the number of unlabeled training cases. So the pretraining stage\ncan make ef\ufb01cient use of very large sets of unlabeled data to create sensible, high-level features and\nwhen the amount of labeled data is small. Then the very limited amount of information in the labels\ncan be used to slightly re\ufb01ne those features rather than to create them.\n\n5 Experimental Results\nIn this section we present experimental results for several regression and classi\ufb01cation tasks that\ninvolve high-dimensional, highly-structured data. The \ufb01rst regression task is to extract the orienta-\ntion of a face from a gray-level image of a large patch of the face. The second regression task is\nto map images of handwritten digits to a single real-value that is as close as possible to the integer\nrepresented by the digit in the image. The \ufb01rst classi\ufb01cation task is to discriminate between images\nof odd digits and images of even digits. The second classi\ufb01cation task is to discriminate between\ntwo different classes of news story based on the vector of word counts in each story.\n\n5.1 Extracting the Orientation of a Face Patch\nThe Olivetti face data set contains ten 64\u00d764 images of each of forty different people. We con-\nstructed a data set of 13,000 28\u00d728 images by randomly rotating (\u221290\u25e6 to +90\u25e6), cropping, and\nsubsampling the original 400 images. The data set was then subdivided into 12,000 training images,\nwhich contained the \ufb01rst 30 people, and 1,000 test images, which contained the remaining 10 peo-\nple. 1,000 randomly sampled face patches from the training set were assigned an orientation label.\nThe remaining 11,000 training images were used as unlabeled data. We also made a more dif\ufb01cult\nversion of the task by occluding part of each face patch with randomly chosen rectangles. Panel A\nof \ufb01gure 2 shows randomly sampled examples from the training and test data.\n\nFor training on the Olivetti face patches we used the 784-1000-1000-1000 architecture shown in\n\ufb01gure 1. The entire training set of 12,000 unlabeled images was used for greedy, layer-by-layer\ntraining of a DBN model. The 2.8 million parameters of the DBN model may seem excessive for\n12,000 training cases, but each training case involves modeling 625 real-values rather than just a\nsingle real-valued label. Also, we only train each layer of features for a few passes through the\ntraining data and we penalize the squared weights.\n\n5\n\n\f1.0 \n\n0.8 \n\n2\n1\n3\n \ne\nr\nu\nt\na\ne\nF\n\n0.6 \n\n0.4 \n\n0.2 \n\nInput Pixel Space\n\n2\n\n3\n\nlog \u03b2\n\n4\n\n5\n\n6\n\nFeature Space\n\nMore Relevant\n\n45\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n1\n\n90\n\n80\n\n70\n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n 0 \n\n0.2 \n\n0.4 \n0.6 \nFeature 992\n\n0.8 \n\n1.0 \n\n0\n\u22121\n\n0\n\n1\n\n2\n\nlog \u03b2\n\n3\n\n4\n\n5\n\n6\n\nFigure 3: Left panel shows a scatter plot of the two most relevant features, with each point replaced by the\ncorresponding input test image. For better visualization, overlapped images are not shown. Right panel displays\nthe histogram plots of the learned ARD hyper-parameters log \u03b2.\n\nAfter the DBN has been pretrained on the unlabeled data, a GP model was \ufb01tted to the labeled\ndata using the top-level features of the DBN model as inputs. We call this model GP-DBNgreedy.\nGP-DBNgreedy can be signi\ufb01cantly improved by slightly altering the weights in the DBN. The\nGP model gives error derivatives for its input vectors which are the top-level features of the DBN.\nThese derivatives can be backpropagated through the DBN to allow discriminative \ufb01ne-tuning of\nthe weights. Each time the weights in the DBN are updated, the GP model is also re\ufb01tted. We call\nthis model GP-DBN\ufb01ne. For comparison, we \ufb01tted a GP model that used the pixel intensities of\nthe labeled images as its inputs. We call this model GPstandard. We also used PCA to reduce the\ndimensionality of the labeled images and \ufb01tted several different GP models using the projections\nonto the \ufb01rst m principal components as the input. Since we only want a lower bound on the error\nof this model, we simply use the value of m that performs best on the test data. We call this model\nGPpca. Table 1 shows the root mean squared error (RMSE) of the predicted face orientations using\nall four types of GP model on varying amounts of labeled data. The results show that both GP-\nDBNgreedy and GP-DBN\ufb01ne signi\ufb01cantly outperform a regular GP model. Indeed, GP-DBN\ufb01ne\nwith only 100 labeled training cases outperforms GPstandard with 1000.\n\nTo test the robustness of our approach to noise in the input we took the same data set and created\narti\ufb01cial rectangular occlusions (see Fig. 2, panel B). The number of rectangles per image was\ndrawn from a Poisson with \u03bb = 2. The top-left location, length and width of each rectangle was\nsampled from a uniform [0,25]. The pixel intensity of each occluding rectangle was set to the mean\npixel intensity of the entire image. Table 1 shows that the performance of all models degrades, but\ntheir relative performances remain the same and GP-DBN\ufb01ne on occluded data is still much better\nthan GPstandard on non-occluded data.\n\nWe have also experimented with using a Gaussian kernel with ARD hyper-parameters, which is a\ncommon practice when the input vectors are high-dimensional:\n\nKij = \u03b1 exp(cid:0) \u2212\n\n(xi \u2212 xj)T D(xi \u2212 xj)(cid:1)\n\n1\n2\n\n(18)\n\nwhere D is the diagonal matrix with Dii = 1/\u03b2i, so that the covariance function has a separate\nlength-scale parameter for each dimension. ARD hyper-parameters were optimized by maximizing\nthe marginal log-likelihood of Eq. 3. Table 1 shows that ARD hyper-parameters do not improve\nGPstandard, but they do slightly improve GP-DBN\ufb01ne and they strongly improve GP-DBNgreedy\nand GPpca when there are 500 or 1000 labeled training cases.\n\nThe histogram plot of log \u03b2 in \ufb01gure 3 reveals that there are a few extracted features that are very\nrelevant (small \u03b2) to our prediction task. The same \ufb01gure (left panel) shows a scatter plot of the two\nmost relevant features of GP-DBNgreedy model, with each point replaced by the corresponding in-\nput test image. Clearly, these two features carry a lot of information about the orientation of the face.\n\n6\n\n\fTrain\nlabels\n\nA 100\n500\n1000\n100\n500\n1000\n\nB\n\nGPstandard\nARD\n2.27\n1.62\n1.36\n\nSph.\n1.86\n1.42\n1.25\n\nGP-DBNgreedy\nARD\nSph.\n1.61\n1.68\n1.27\n1.19\n1.07\n1.14\n\nGP-DBN\ufb01ne\nARD\n1.58\n1.22\n1.10\n\nSph.\n1.63\n1.16\n1.03\n\n0.0884\n0.0222\n0.0129\n\n0.1087\n0.0541\n0.0385\n\n0.0528\n0.0100\n0.0058\n\n0.0597\n0.0161\n0.0059\n\n0.0501\n0.0055\n0.0050\n\n0.0599\n0.0104\n0.0100\n\nGPpca\n\nSph.\n1.73 (20)\n1.32 (40)\n1.19 (40)\n0.0785 (10)\n0.0160 (40)\n0.0091 (40)\n\nARD\n2.00 (20)\n1.36 (20)\n1.22 (80)\n0.0920 (10)\n0.0235 (20)\n0.0127 (40)\n\nTable 2: Performance results on the digit magnitude regression task (A) and and discriminating odd vs. even\ndigits classi\ufb01cation task (B). The root mean squared error for regression task on the test set is shown for each\nmethod. For classi\ufb01cation task the area under the ROC (AUROC) metric is used. For each method we show\n1-AUROC on the test set. All methods were tried using both spherical Gaussian kernel, and a Gaussian kernel\nwith ARD hyper-parameters. For the GPpca model, the number of principal components that performs best on\nthe test data is shown in parenthesis.\n\nNumber of labeled\ncases (50% in each class)\n100\n500\n1000\n\nGPstandard GP-DBNgreedy GP-DBN\ufb01ne\n\n0.1295\n0.0875\n0.0645\n\n0.1180\n0.0793\n0.0580\n\n0.0995\n0.0609\n0.0458\n\nTable 3: Performance results using the area under the ROC (AUROC) metric on the text classi\ufb01cation task.\nFor each method we show 1-AUROC on the test set.\nWe suspect that the GP-DBN\ufb01ne model does not bene\ufb01t as much from the ARD hyper-parameters\nbecause the \ufb01ne-tuning stage is already capable of turning down the activities of irrelevant top-level\nfeatures.\n\n5.2 Extracting the Magnitude Represented by a Handwritten Digit and Discriminating\n\nbetween Images of Odd and Even Digits\n\nThe MNIST digit data set contains 60,000 training and 10,000 test 28\u00d728 images of ten handwritten\ndigits (0 to 9). 100 randomly sampled training images of each class were assigned a magnitude label.\nThe remaining 59,000 training images were used as unlabeled data. As in the previous experiment,\nwe used the 784-1000-1000-1000 architecture with the entire training set of 60,000 unlabeled digits\nbeing used for greedily pretraining the DBN model. Table 2, panel A, shows that GP-DBN\ufb01ne and\nGP-DBNgreedy perform considerably better than GPstandard both with and without ARD hyper-\nparameters. The same table, panel B, shows results for the classi\ufb01cation task of discriminating be-\ntween images of odd and images of even digits. We used the same labeled training set, but with each\ndigit categorized into an even or an odd class. The same DBN model was used, so the Gaussian co-\nvariance function was initialized in exactly the same way for both regression and classi\ufb01cation tasks.\nThe performance of GP-DBNgreedy demonstrates that the greedily learned feature representation\ncaptures a lot of structure in the unlabeled input data which is useful for subsequent discrimination\ntasks, even though these tasks are unknown when the DBN is being trained.\n\n5.3 Classifying News Stories\nThe Reuters Corpus Volume II is an archive of 804,414 newswire stories The corpus covers four\nmajor groups: Corporate/Industrial, Economics, Government/Social, and Markets. The data was\nrandomly split into 802,414 training and 2000 test articles. The test set contains 500 articles of each\nmajor group. The available data was already in a convenient, preprocessed format, where common\nstopwords were removed and all the remaining words were stemmed. We only made use of the 2000\nmost frequently used word stems in the training data. As a result, each document was represented\nas a vector containing 2000 word counts. No other preprocessing was done.\n\nFor the text classi\ufb01cation task we used a 2000-1000-1000-1000 architecture. The entire unlabeled\ntraining set of 802,414 articles was used for learning a multilayer generative model of the text docu-\nments. The bottom layer of the DBN was trained using a Constrained Poisson Model. Table 3 shows\nthe area under the ROC curve for classifying documents belonging to the Corporate/Industrial vs.\nEconomics groups. As expected, GP-DBN\ufb01ne and GP-DBNgreedy work better than GPstandard.\nThe results of binary discrimination between other pairs of document classes are very similar to the\nresults presented in table 3. Our experiments using a Gaussian kernel with ARD hyper-parameters\ndid not show any signi\ufb01cant improvements. Examining the histograms of the length-scale parame-\n\n7\n\n\fters \u03b2, we found that most of the input word-counts as well as most of the extracted features were\nrelevant to the classi\ufb01cation task.\n\n6 Conclusions and Future Research\nIn this paper we have shown how to use Deep Belief Networks to greedily pretrain and discrimina-\ntively \ufb01ne-tune a covariance kernel for a Gaussian Process. The discriminative \ufb01ne-tuning produces\nan additional improvement in performance that is comparable in magnitude to the improvement pro-\nduced by using the greedily pretrained DBN. For high-dimensional, highly-structured data, this is\nan effective way to make use of large unlabeled data sets, especially when labeled training data is\nscarce. Greedily pretrained DBN\u2019s can also be used to provide input vectors for other kernel-based\nmethods, including SVMs [17, 8] and kernel regression [1], and our future research will concentrate\non comparing our method to other kernel-based semi-supervised learning algorithms [4, 19].\n\nAcknowledgments\nWe thank Radford Neal for many helpful suggestions. This research was supported by NSERC, CFI\nand OTI. GEH is a fellow of CIAR and holds a CRC chair.\n\nReferences\n\n[1] J. K. Benedetti. On the nonparametric estimation of regression functions. Journal of the Royal Statistical\n\nSociety series B, 39:248\u2013253, 1977.\n\n[2] Y. Bengio and Y. Le Cun. Scaling learning algorithms towards AI. In L. Bottou, O. Chapelle, D. DeCoste,\n\nand J. Weston, editors, Large-Scale Kernel Machines. MIT Press, 2007.\n\n[3] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In\n\nAdvances in Neural Information Processing Systems, 2006.\n\n[4] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien. Semi-Supervised Learning. MIT Press, 2006.\n[5] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation,\n\n14(8):1711\u20131800, 2002.\n\n[6] G. E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313, 2006.\n\n[7] Geoffrey E. Hinton, Simon Osindero, and Yee Whye Teh. A fast learning algorithm for deep belief nets.\n\nNeural Computation, 18(7):1527\u20131554, 2006.\n\n[8] F. Lauer, C. Y. Suen, and G. Bloch. A trainable feature extractor for handwritten digit recognition. Pattern\n\nRecognition, 40(6):1816\u20131824, 2007.\n\n[9] N. D. Lawrence and J. Qui\u02dcnonero Candela. Local distance preservation in the GP-LVM through back\nIn William W. Cohen and Andrew Moore, editors, ICML, volume 148, pages 513\u2013520.\n\nconstraints.\nACM, 2006.\n\n[10] N. D. Lawrence and M. I. Jordan. Semi-supervised learning via gaussian processes. In NIPS, 2004.\n[11] N. D. Lawrence and B. Sch\u00a8olkopf. Estimating a kernel Fisher discriminant in the presence of label\nnoise. In Proc. 18th International Conf. on Machine Learning, pages 306\u2013313. Morgan Kaufmann, San\nFrancisco, CA, 2001.\n\n[12] T. P. Minka. Expectation propagation for approximate bayesian inference. In Jack Breese and Daphne\n\nKoller, editors, UAI, pages 362\u2013369, San Francisco, CA, 2001. Morgan Kaufmann Publishers.\n\n[13] C. E. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006.\n[14] R. Salakhutdinov and G. E. Hinton. Learning a nonlinear embedding by preserving class neighbourhood\n\nstructure. In AI and Statistics, 2007.\n\n[15] M. Seeger. Covariance kernels from bayesian generative models.\n\nIn Thomas G. Dietterich, Suzanna\n\nBecker, and Zoubin Ghahramani, editors, NIPS, pages 905\u2013912. MIT Press, 2001.\n\n[16] M. Seeger. Gaussian processes for machine learning. Int. J. Neural Syst, 14(2):69\u2013106, 2004.\n[17] V. Vapnik. Statistical Learning Theory. Wiley, 1998.\n[18] M. Welling, M. Rosen-Zvi, and G. Hinton. Exponential family harmoniums with an application to infor-\n\nmation retrieval. In NIPS 17, pages 1481\u20131488, Cambridge, MA, 2005. MIT Press.\n\n[19] Xiaojin Zhu, Jaz S. Kandola, Zoubin Ghahramani, and John D. Lafferty. Nonparametric transforms of\n\ngraph kernels for semi-supervised learning. In NIPS, 2004.\n\n8\n\n\f", "award": [], "sourceid": 945, "authors": [{"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": null}]}