{"title": "Kernel Methods for Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 342, "page_last": 350, "abstract": "We introduce a new family of positive-definite kernel functions that mimic the computation in large, multilayer neural nets. These kernel functions can be used in shallow architectures, such as support vector machines (SVMs), or in deep kernel-based architectures that we call multilayer kernel machines (MKMs). We evaluate SVMs and MKMs with these kernel functions on problems designed to illustrate the advantages of deep architectures. On several problems, we obtain better results than previous, leading benchmarks from both SVMs with Gaussian kernels as well as deep belief nets.", "full_text": "Kernel Methods for Deep Learning\n\nYoungmin Cho and Lawrence K. Saul\n\nDepartment of Computer Science and Engineering\n\nUniversity of California, San Diego\n9500 Gilman Drive, Mail Code 0404\n{yoc002,saul}@cs.ucsd.edu\n\nLa Jolla, CA 92093-0404\n\nAbstract\n\nWe introduce a new family of positive-de\ufb01nite kernel functions that mimic the\ncomputation in large, multilayer neural nets. These kernel functions can be used\nin shallow architectures, such as support vector machines (SVMs), or in deep\nkernel-based architectures that we call multilayer kernel machines (MKMs). We\nevaluate SVMs and MKMs with these kernel functions on problems designed to\nillustrate the advantages of deep architectures. On several problems, we obtain\nbetter results than previous, leading benchmarks from both SVMs with Gaussian\nkernels as well as deep belief nets.\n\n1\n\nIntroduction\n\nRecent work in machine learning has highlighted the circumstances that appear to favor deep archi-\ntectures, such as multilayer neural nets, over shallow architectures, such as support vector machines\n(SVMs) [1]. Deep architectures learn complex mappings by transforming their inputs through mul-\ntiple layers of nonlinear processing [2]. Researchers have advanced several motivations for deep\narchitectures: the wide range of functions that can be parameterized by composing weakly non-\nlinear transformations, the appeal of hierarchical distributed representations, and the potential for\ncombining unsupervised and supervised methods. Experiments have also shown the bene\ufb01ts of\ndeep learning in several interesting applications [3, 4, 5].\nMany issues surround the ongoing debate over deep versus shallow architectures [1, 6]. Deep ar-\nchitectures are generally more dif\ufb01cult to train than shallow ones. They involve dif\ufb01cult nonlinear\noptimizations and many heuristics. The challenges of deep learning explain the early and continued\nappeal of SVMs, which learn nonlinear classi\ufb01ers via the \u201ckernel trick\u201d. Unlike deep architectures,\nSVMs are trained by solving a simple problem in quadratic programming. However, SVMs cannot\nseemingly bene\ufb01t from the advantages of deep learning.\nLike many, we are intrigued by the successes of deep architectures yet drawn to the elegance of ker-\nnel methods. In this paper, we explore the possibility of deep learning in kernel machines. Though\nwe share a similar motivation as previous authors [7], our approach is very different. Our paper\nmakes two main contributions. First, we develop a new family of kernel functions that mimic the\ncomputation in large neural nets. Second, using these kernel functions, we show how to train multi-\nlayer kernel machines (MKMs) that bene\ufb01t from many advantages of deep learning.\nThe organization of this paper is as follows.\nIn section 2, we describe a new family of kernel\nfunctions and experiment with their use in SVMs. Our results on SVMs are interesting in their own\nright; they also foreshadow certain trends that we observe (and certain choices that we make) for the\nMKMs introduced in section 3. In this section, we describe a kernel-based architecture with multiple\nlayers of nonlinear transformation. The different layers are trained using a simple combination of\nsupervised and unsupervised methods. Finally, we conclude in section 4 by evaluating the strengths\nand weaknesses of our approach.\n\n1\n\n\f2 Arc-cosine kernels\n\nIn this section, we develop a new family of kernel functions for computing the similarity of vector\ninputs x, y \u2208 (cid:60)d. As shorthand, let \u0398(z) = 1\n2(1 + sign(z)) denote the Heaviside step function. We\nde\ufb01ne the nth order arc-cosine kernel function via the integral representation:\n\nkn(x, y) = 2\n\ndw e\u2212 (cid:107)w(cid:107)2\n(2\u03c0)d/2\n\n2\n\n\u0398(w \u00b7 x) \u0398(w \u00b7 y) (w \u00b7 x)n (w \u00b7 y)n\n\n(1)\n\n(cid:90)\n\nThe integral representation makes it straightforward to show that these kernel functions are positive-\nsemide\ufb01nite. The kernel function in eq. (1) has interesting connections to neural computation [8]\nthat we explore further in sections 2.2\u20132.3. However, we begin by elucidating its basic properties.\n\n2.1 Basic properties\n\nWe show how to evaluate the integral in eq. (1) analytically in the appendix. The \ufb01nal result is most\neasily expressed in terms of the angle \u03b8 between the inputs:\n\n(cid:19)\n\n(cid:18) x \u00b7 y\n\n(cid:107)x(cid:107)(cid:107)y(cid:107)\n\n\u03b8 = cos\u22121\n\n.\n\n(2)\n\nThe integral in eq. (1) has a simple, trivial dependence on the magnitudes of the inputs x and y, but\na complex, interesting dependence on the angle between them. In particular, we can write:\n\nkn(x, y) =\n\n(cid:107)x(cid:107)n(cid:107)y(cid:107)nJn(\u03b8)\n\n1\n\u03c0\n\n(3)\n\nwhere all the angular dependence is captured by the family of functions Jn(\u03b8). Evaluating the\nintegral in the appendix, we show that this angular dependence is given by:\n\nJn(\u03b8) = (\u22121)n(sin \u03b8)2n+1\n\n.\n\n(4)\n\n(cid:18) 1\n\n\u2202\n\u2202\u03b8\n\nsin \u03b8\n\n(cid:19)n(cid:18) \u03c0 \u2212 \u03b8\n\n(cid:19)\n\nsin \u03b8\n\nFor n = 0, this expression reduces to the supplement of the angle between the inputs. However, for\nn >0, the angular dependence is more complicated. The \ufb01rst few expressions are:\n\nJ0(\u03b8) = \u03c0 \u2212 \u03b8\nJ1(\u03b8) = sin \u03b8 + (\u03c0 \u2212 \u03b8) cos \u03b8\nJ2(\u03b8) = 3 sin \u03b8 cos \u03b8 + (\u03c0 \u2212 \u03b8)(1 + 2 cos2 \u03b8)\n\n(5)\n(6)\n(7)\n\n\u03c0 cos\u22121 x\u00b7y\n\nWe describe eq. (3) as an arc-cosine kernel because for n = 0,\nit takes the simple form\nk0(x, y) = 1\u2212 1\n(cid:107)x(cid:107)(cid:107)y(cid:107). In fact, the zeroth and \ufb01rst order kernels in this family are strongly\nmotivated by previous work in neural computation. We explore these connections in the next section.\nArc-cosine kernels have other intriguing properties. From the magnitude dependence in eq. (3),\nwe observe the following: (i) the n = 0 arc-cosine kernel maps inputs x to the unit hypersphere\nin feature space, with k0(x, x) = 1; (ii) the n = 1 arc-cosine kernel preserves the norm of inputs,\nwith k1(x, x) = (cid:107)x(cid:107)2; (iii) higher order (n >1) arc-cosine kernels expand the dynamic range of the\ninputs, with kn(x, x) \u223c (cid:107)x(cid:107)2n. Properties (i)\u2013(iii) are shared respectively by radial basis function\n(RBF), linear, and polynomial kernels. Interestingly, though, the n = 1 arc-cosine kernel is highly\nnonlinear, also satisfying k1(x,\u2212x) = 0 for all inputs x. As a practical matter, we note that arc-\ncosine kernels do not have any continuous tuning parameters (such as the kernel width in RBF\nkernels), which can be laborious to set by cross-validation.\n\n2.2 Computation in single-layer threshold networks\n\nConsider the single-layer network shown in Fig. 1 (left) whose weights Wij connect the jth input\nunit to the ith output unit. The network maps inputs x to outputs f(x) by applying an elementwise\nnonlinearity to the matrix-vector product of the inputs and the weight matrix: f(x) = g(Wx). The\nnonlinearity is described by the network\u2019s so-called activation function. Here we consider the family\nof one-sided polynomial activation functions gn(z) = \u0398(z)zn illustrated in the right panel of Fig. 1.\n\n2\n\n\fFigure 1: Single layer network and activation functions\n\nm(cid:88)\n\nFor n = 0, the activation function is a step function, and the network is an array of perceptrons. For\nn = 1, the activation function is a ramp function (or recti\ufb01cation nonlinearity [9]), and the mapping\nf(x) is piecewise linear. More generally, the nonlinear (non-polynomial) behavior of these networks\nis induced by thresholding on weighted sums. We refer to networks with these activation functions\nas single-layer threshold networks of degree n.\nComputation in these networks is closely connected to computation with the arc-cosine kernel func-\ntion in eq. (1). To see the connection, consider how inner products are transformed by the mapping\nin single-layer threshold networks. As notation, let the vector wi denote ith row of the weight\nmatrix W. Then we can express the inner product between different outputs of the network as:\n\nf(x) \u00b7 f(y) =\n\n\u0398(wi \u00b7 x)\u0398(wi \u00b7 y)(wi \u00b7 x)n(wi \u00b7 y)n,\n\n(8)\n\ni=1\n\nwhere m is the number of output units. The connection with the arc-cosine kernel function emerges\nin the limit of very large networks [10, 8].\nImagine that the network has an in\ufb01nite number of\noutput units, and that the weights Wij are Gaussian distributed with zero mean and unit vari-\nIn this limit, we see that eq. (8) reduces to eq. (1) up to a trivial multiplicative factor:\nance.\nm f(x) \u00b7 f(y) = kn(x, y). Thus the arc-cosine kernel function in eq. (1) can be viewed\nlimm\u2192\u221e 2\nas the inner product between feature vectors derived from the mapping of an in\ufb01nite single-layer\nthreshold network [8].\nMany researchers have noted the general connection between kernel machines and neural networks\nwith one layer of hidden units [1]. The n = 0 arc-cosine kernel in eq. (1) can also be derived from\nan earlier result obtained in the context of Gaussian processes [8]. However, we are unaware of any\nprevious theoretical or empirical work on the general family of these kernels for degrees n\u22650.\nArc-cosine kernels differ from polynomial and RBF kernels in one especially interesting respect.\nAs highlighted by the integral representation in eq. (1), arc-cosine kernels induce feature spaces\nthat mimic the sparse, nonnegative, distributed representations of single-layer threshold networks.\nPolynomial and RBF kernels do not encode their inputs in this way. In particular, the feature vector\ninduced by polynomial kernels is neither sparse nor nonnegative, while the feature vector induced\nby RBF kernels resembles the localized output of a soft vector quantizer. Further implications of\nthis difference are explored in the next section.\n\n2.3 Computation in multilayer threshold networks\n\nA kernel function can be viewed as inducing a nonlinear mapping from inputs x to fea-\nture vectors \u03a6(x).\nin the induced feature space:\nk(x, y) = \u03a6(x)\u00b7\u03a6(y). In this section, we consider how to compose the nonlinear mappings in-\nduced by kernel functions. Speci\ufb01cally, we show how to derive new kernel functions\n\nThe kernel computes the inner product\n\nk((cid:96))(x, y) = \u03a6(\u03a6(...\u03a6\n\n(y)))\n\n(9)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(cid:96) times\n\n(cid:123)(cid:122)\n\n(x))) \u00b7 \u03a6(\u03a6(...\u03a6\n(cid:96) times\n\n(cid:125)\n\n(cid:124)\n\nwhich compute the inner product after (cid:96) successive applications of the nonlinear mapping \u03a6(\u00b7). Our\nmotivation is the following: intuitively, if the base kernel function k(x, y) = \u03a6(x) \u00b7 \u03a6(y) mimics\nthe computation in a single-layer network, then the iterated mapping in eq. (9) should mimic the\ncomputation in a multilayer network.\n\n3\n\nf2f3fix1x2xj. . . . . . f1fmxdW. . . . . . \u221210100.51Step (n=0)\u221210100.51Ramp (n=1)\u221210100.51Quarter\u2212pipe (n=2)\fFigure 2: Left: examples from the rectangles-image data set. Right: classi\ufb01cation error rates on the\ntest set. SVMs with arc-cosine kernels have error rates from 22.36\u201325.64%. Results are shown for\nkernels of varying degree (n) and levels of recursion ((cid:96)). The best previous results are 24.04% for\nSVMs with RBF kernels and 22.50% for deep belief nets [11]. See text for details.\n\nWe \ufb01rst examine the results of this procedure for widely used kernels. Here we \ufb01nd that the iterated\nmapping in eq. (9) does not yield particularly interesting results. Consider the two-fold composition\nthat maps x to \u03a6(\u03a6(x)). For linear kernels k(x, y) = x \u00b7 y, the composition is trivial: we obtain\nthe identity map \u03a6(\u03a6(x)) = \u03a6(x) = x. For homogeneous polynomial kernels k(x, y) = (x \u00b7 y)d,\nthe composition yields:\n\n\u03a6(\u03a6(x)) \u00b7 \u03a6(\u03a6(y)) = (\u03a6(x) \u00b7 \u03a6(y))d = ((x \u00b7 y)d)d = (x \u00b7 y)d2\n\n.\n\n(10)\n\nThe above result is not especially interesting: the kernel implied by this composition is also polyno-\nmial, just of higher degree (d2 versus d) than the one from which it was constructed. Likewise, for\nRBF kernels k(x, y) = e\u2212\u03bb(cid:107)x\u2212y(cid:107)2, the composition yields:\n\n\u03a6(\u03a6(x)) \u00b7 \u03a6(\u03a6(y)) = e\u2212\u03bb(cid:107)\u03a6(x)\u2212\u03a6(y)(cid:107)2 = e\u22122\u03bb(1\u2212k(x,y)).\n\n(11)\n\nThough non-trivial, eq. (11) does not represent a particularly interesting computation. Recall that\nRBF kernels mimic the computation of soft vector quantizers, with k(x, y) (cid:28) 1 when (cid:107)x\u2212y(cid:107) is\nlarge compared to the kernel width. It is hard to see how the iterated mapping \u03a6(\u03a6(x)) would\ngenerate a qualitatively different representation than the original mapping \u03a6(x).\nNext we consider the (cid:96)-fold composition in eq. (9) for arc-cosine kernel functions. We state the\nresult in the form of a recursion. The base case is given by eq. (3) for kernels of depth (cid:96) = 1 and\ndegree n. The inductive step is given by:\n\nk(l+1)\nn\n\n(x, y) =\n\n1\n\u03c0\n\nn (x, x) k(l)\nk(l)\n\nn (y, y)\n\nJn\n\n\u03b8((cid:96))\nn\n\n,\n\n(12)\n\nwhere \u03b8((cid:96))\ncomposition. In particular, we can write:\n\nn is the angle between the images of x and y in the feature space induced by the (cid:96)-fold\n\nn = cos\u22121\n\u03b8((cid:96))\n\nn (x, y)\nk((cid:96))\n\nn (x, x) k((cid:96))\nk((cid:96))\n\nn (y, y)\n\n.\n\n(13)\n\nThe recursion in eq. (12) is simple to compute in practice. The resulting kernels mimic the com-\nputations in large multilayer threshold networks. Above, for simplicity, we have assumed that the\narc-cosine kernels have the same degree n at every level (or layer) (cid:96) of the recursion. We can also\nuse kernels of different degrees at different layers. In the next section, we experiment with SVMs\nwhose kernel functions are constructed in this way.\n\n2.4 Experiments on binary classi\ufb01cation\nWe evaluated SVMs with arc-cosine kernels on two challenging data sets of 28 \u00d7 28 grayscale pixel\nimages. These data sets were speci\ufb01cally constructed to compare deep architectures and kernel\nmachines [11]. In the \ufb01rst data set, known as rectangles-image, each image contains an occluding\nrectangle, and the task is to determine whether the width of the rectangle exceeds its height; ex-\namples are shown in Fig. 2 (left). In the second data set, known as convex, each image contains a\nwhite region, and the task is to determine whether the white region is convex; examples are shown\n\n4\n\n(cid:104)\n\n(cid:18)\n\n(cid:104)\n\n(cid:105)n/2\n\n(cid:16)\n\n(cid:17)\n\n(cid:105)\u22121/2(cid:19)\n\n222426DBN\u22123SVM\u2212RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter\u2212pipe (n=2)222426DBN!3SVM!RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter!pipe (n=2)Figure2:Left:examplesfromtherectangles-imagedataset.Right:classi\ufb01cationerrorratesonthetestset.SVMswitharccosinekernelshaveerrorratesfrom22.36\u201325.64%.Resultsareshownforkernelsofvaryingdegree(n)andlevelsofrecursion(!).Thebestpreviousresultsare24.04%forSVMswithRBFkernelsand22.50%fordeepbeliefnets[2].Seetextfordetails.1718192021DBN!3SVM!RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter!pipe (n=2)Figure3:Left:examplesfromtheconvexdataset.Right:classi\ufb01cationerrorratesonthetestset.SVMswitharccosinekernelshaveerrorratesfrom17.15\u201320.51%.Resultsareshownforkernelsofvaryingdegree(n)andlevelsofrecursion(!).Thebestpreviousresultsare19.13%forSVMswithRBFkernelsand18.63%fordeepbeliefnets[2].Seetextfordetails.2000trainingexamplesasavalidationsettochoosethemarginpenaltyparameter;afterchoosingthisparameterbycross-validation,wethenretrainedeachSVMusingallthetrainingexamples.Forreference,wealsoreportthebestresultsobtainedpreviouslyfromthreelayerdeepbeliefnets(DBN-3)andSVMswithRBFkernels(SVM-RBF).Thesereferencesarerepresentativeofthecurrentstate-of-the-artfordeepandshallowarchitecturesonthesedatasets.Therightpanelsof\ufb01gures2and3showthetestseterrorratesfromarccosinekernelsofvaryingdegree(n)andlevelsofrecursion(!).Weexperimentedwithkernelsofdegreen=0,1and2,correspondingtosinglelayerthresholdnetworkswith\u201cstep\u201d,\u201cramp\u201d,and\u201cquarter-pipe\u201dactivationfunctions.Wealsoexperimentedwiththemultilayerkernelsdescribedinsection2.3,composedfromonetosixlevelsofrecursion.Overall,the\ufb01guresshowthatonthesetwodatasets,manydifferentarccosinekernelsoutperformthebestresultspreviouslyreportedforSVMswithRBFkernelsanddeepbeliefnets.Wegivemoredetailsontheseexperimentsbelow.Atahighlevel,though,wenotethatSVMswitharccosinekernelsareverystraightforwardtotrain;unlikeSVMswithRBFkernels,theydonotrequiretuningakernelwidthparameter,andunlikedeepbeliefnets,theydonotrequiresolvingadif\ufb01cultnonlinearoptimizationorsearchingoverpossiblearchitectures.Inourexperiments,wequicklydiscoveredthatthemultilayerkernelsonlyperformedwellwhenn=1kernelswereusedathigher(!>1)levelsintherecursion.Figs.2and3thereforeshowonlythesesetsofresults;inparticular,eachgroupofbarsshowsthetesterrorrateswhenaparticularkernel(ofdegreen=0,1,2)wasusedatthe\ufb01rstlayerofnonlinearity,whilethen=1kernelwasusedatsuccessivelayers.Wedonothaveaformalexplanationforthiseffect.However,recallthatonlythen=1arccosinekernelpreservesthenormofitsinputs:then=0kernelmapsallinputsontoaunithypersphereinfeaturespace,whilehigher-order(n>1)kernelsmayinducefeaturespaceswithseverelydistorteddynamicranges.Therefore,wehypothesizethatonlyn=1arccosinekernelspreservesuf\ufb01cientinformationaboutthemagnitudeoftheirinputstoworkeffectivelyincompositionwithotherkernels.Finally,theresultsonbothdatasetsrevealaninterestingtrend:themultilayerarccosinekernelsoftenperformbetterthantheirsinglelayercounterparts.ThoughSVMsareshallowarchitectures,5\fFigure 3: Left: examples from the convex data set. Right: classi\ufb01cation error rates on the test set.\nSVMs with arc-cosine kernels have error rates from 17.15\u201320.51%. Results are shown for kernels\nof varying degree (n) and levels of recursion ((cid:96)). The best previous results are 19.13% for SVMs\nwith RBF kernels and 18.63% for deep belief nets [11]. See text for details.\n\nin Fig. 3 (left). The rectangles-image data set has 12000 training examples, while the convex data\nset has 8000 training examples; both data sets have 50000 test examples. These data sets have\nbeen extensively benchmarked by previous authors [11]. Our experiments in binary classi\ufb01cation\nfocused on these data sets because in previously reported benchmarks, they exhibited the biggest\nperformance gap between deep architectures (e.g., deep belief nets) and traditional SVMs.\nWe followed the same experimental methodology as previous authors [11]. SVMs were trained using\nlibSVM (version 2.88) [12], a publicly available software package. For each SVM, we used the last\n2000 training examples as a validation set to choose the margin penalty parameter; after choosing\nthis parameter by cross-validation, we then retrained each SVM using all the training examples.\nFor reference, we also report the best results obtained previously from three-layer deep belief nets\n(DBN-3) and SVMs with RBF kernels (SVM-RBF). These references appear to be representative of\nthe current state-of-the-art for deep and shallow architectures on these data sets.\nFigures 2 and 3 show the test set error rates from arc-cosine kernels of varying degree (n) and levels\nof recursion ((cid:96)). We experimented with kernels of degree n = 0, 1 and 2, corresponding to thresh-\nold networks with \u201cstep\u201d, \u201cramp\u201d, and \u201cquarter-pipe\u201d activation functions. We also experimented\nwith the multilayer kernels described in section 2.3, composed from one to six levels of recursion.\nOverall, the \ufb01gures show that many SVMs with arc-cosine kernels outperform traditional SVMs,\nand a certain number also outperform deep belief nets. In addition to their solid performance, we\nnote that SVMs with arc-cosine kernels are very straightforward to train; unlike SVMs with RBF\nkernels, they do not require tuning a kernel width parameter, and unlike deep belief nets, they do not\nrequire solving a dif\ufb01cult nonlinear optimization or searching over possible architectures.\nOur experiments with multilayer kernels revealed that these SVMs only performed well when arc-\ncosine kernels of degree n = 1 were used at higher ((cid:96) > 1) levels in the recursion. Figs. 2 and\n3 therefore show only these sets of results; in particular, each group of bars shows the test error\nrates when a particular kernel (of degree n = 0, 1, 2) was used at the \ufb01rst layer of nonlinearity,\nwhile the n = 1 kernel was used at successive layers. We hypothesize that only n = 1 arc-cosine\nkernels preserve suf\ufb01cient information about the magnitude of their inputs to work effectively in\ncomposition with other kernels. Recall that only the n = 1 arc-cosine kernel preserves the norm of\nits inputs: the n = 0 kernel maps all inputs onto a unit hypersphere in feature space, while higher-\norder (n >1) kernels induce feature spaces with different dynamic ranges.\nFinally, the results on both data sets reveal an interesting trend: the multilayer arc-cosine kernels\noften perform better than their single-layer counterparts. Though SVMs are (inherently) shallow\narchitectures, this trend suggests that for these problems in binary classi\ufb01cation, arc-cosine kernels\nmay be yielding some of the advantages typically associated with deep architectures.\n\n3 Deep learning\n\nIn this section, we explore how to use kernel methods in deep architectures [7]. We show how to train\ndeep kernel-based architectures by a simple combination of supervised and unsupervised methods.\nUsing the arc-cosine kernels in the previous section, these multilayer kernel machines (MKMs)\nperform very competitively on multiclass data sets designed to foil shallow architectures [11].\n\n5\n\n222426DBN!3SVM!RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter!pipe (n=2)Figure2:Left:examplesfromtherectangles-imagedataset.Right:classi\ufb01cationerrorratesonthetestset.SVMswitharccosinekernelshaveerrorratesfrom22.36\u201325.64%.Resultsareshownforkernelsofvaryingdegree(n)andlevelsofrecursion(!).Thebestpreviousresultsare24.04%forSVMswithRBFkernelsand22.50%fordeepbeliefnets[2].Seetextfordetails.1718192021DBN!3SVM!RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter!pipe (n=2)Figure3:Left:examplesfromtheconvexdataset.Right:classi\ufb01cationerrorratesonthetestset.SVMswitharccosinekernelshaveerrorratesfrom17.15\u201320.51%.Resultsareshownforkernelsofvaryingdegree(n)andlevelsofrecursion(!).Thebestpreviousresultsare19.13%forSVMswithRBFkernelsand18.63%fordeepbeliefnets[2].Seetextfordetails.2000trainingexamplesasavalidationsettochoosethemarginpenaltyparameter;afterchoosingthisparameterbycross-validation,wethenretrainedeachSVMusingallthetrainingexamples.Forreference,wealsoreportthebestresultsobtainedpreviouslyfromthreelayerdeepbeliefnets(DBN-3)andSVMswithRBFkernels(SVM-RBF).Thesereferencesarerepresentativeofthecurrentstate-of-the-artfordeepandshallowarchitecturesonthesedatasets.Therightpanelsof\ufb01gures2and3showthetestseterrorratesfromarccosinekernelsofvaryingdegree(n)andlevelsofrecursion(!).Weexperimentedwithkernelsofdegreen=0,1and2,correspondingtosinglelayerthresholdnetworkswith\u201cstep\u201d,\u201cramp\u201d,and\u201cquarter-pipe\u201dactivationfunctions.Wealsoexperimentedwiththemultilayerkernelsdescribedinsection2.3,composedfromonetosixlevelsofrecursion.Overall,the\ufb01guresshowthatonthesetwodatasets,manydifferentarccosinekernelsoutperformthebestresultspreviouslyreportedforSVMswithRBFkernelsanddeepbeliefnets.Wegivemoredetailsontheseexperimentsbelow.Atahighlevel,though,wenotethatSVMswitharccosinekernelsareverystraightforwardtotrain;unlikeSVMswithRBFkernels,theydonotrequiretuningakernelwidthparameter,andunlikedeepbeliefnets,theydonotrequiresolvingadif\ufb01cultnonlinearoptimizationorsearchingoverpossiblearchitectures.Inourexperiments,wequicklydiscoveredthatthemultilayerkernelsonlyperformedwellwhenn=1kernelswereusedathigher(!>1)levelsintherecursion.Figs.2and3thereforeshowonlythesesetsofresults;inparticular,eachgroupofbarsshowsthetesterrorrateswhenaparticularkernel(ofdegreen=0,1,2)wasusedatthe\ufb01rstlayerofnonlinearity,whilethen=1kernelwasusedatsuccessivelayers.Wedonothaveaformalexplanationforthiseffect.However,recallthatonlythen=1arccosinekernelpreservesthenormofitsinputs:then=0kernelmapsallinputsontoaunithypersphereinfeaturespace,whilehigher-order(n>1)kernelsmayinducefeaturespaceswithseverelydistorteddynamicranges.Therefore,wehypothesizethatonlyn=1arccosinekernelspreservesuf\ufb01cientinformationaboutthemagnitudeoftheirinputstoworkeffectivelyincompositionwithotherkernels.Finally,theresultsonbothdatasetsrevealaninterestingtrend:themultilayerarccosinekernelsoftenperformbetterthantheirsinglelayercounterparts.ThoughSVMsareshallowarchitectures,51718192021DBN\u22123SVM\u2212RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter\u2212pipe (n=2)\f3.1 Multilayer kernel machines\n\nWe explored how to train MKMs in stages that involve kernel PCA [13] and feature selection [14] at\nintermediate hidden layers and large-margin nearest neighbor classi\ufb01cation [15] at the \ufb01nal output\nlayer. Speci\ufb01cally, for (cid:96)-layer MKMs, we considered the following training procedure:\n\n1. Prune uninformative features from the input space.\n2. Repeat (cid:96) times:\n\n(a) Compute principal components in the feature space induced by a nonlinear kernel.\n(b) Prune uninformative components from the feature space.\n\n3. Learn a Mahalanobis distance metric for nearest neighbor classi\ufb01cation.\n\nThe individual steps in this procedure are well-established methods; only their combination is new.\nWhile many other approaches are worth investigating, our positive results from the above procedure\nprovide a \ufb01rst proof-of-concept. We discuss each of these steps in greater detail below.\nKernel PCA. Deep learning in MKMs is achieved by iterative applications of kernel PCA [13]. This\nuse of kernel PCA was suggested over a decade ago [16] and more recently inspired by the pre-\ntraining of deep belief nets by unsupervised methods. In MKMs, the outputs (or features) from\nkernel PCA at one layer are the inputs to kernel PCA at the next layer. However, we do not strictly\ntransmit each layer\u2019s top principal components to the next layer; some components are discarded if\nthey are deemed uninformative. While any nonlinear kernel can be used for the layerwise PCA in\nMKMs, arc-cosine kernels are natural choices to mimic the computations in large neural nets.\nFeature selection. The layers in MKMs are trained by interleaving a supervised method for feature\nselection with the unsupervised method of kernel PCA. The feature selection is used to prune away\nuninformative features at each layer in the MKM (including the zeroth layer which stores the raw\ninputs).\nIntuitively, this feature selection helps to focus the unsupervised learning in MKMs on\nstatistics of the inputs that actually contain information about the class labels. We prune features\nat each layer by a simple two-step procedure that \ufb01rst ranks them by estimates of their mutual\ninformation, then truncates them using cross-validation. More speci\ufb01cally, in the \ufb01rst step, we\ndiscretize each real-valued feature and construct class-conditional and marginal histograms of its\ndiscretized values; then, using these histograms, we estimate each feature\u2019s mutual information with\nthe class label and sort the features in order of these estimates [14]. In the second step, considering\nonly the \ufb01rst w features in this ordering, we compute the error rates of a basic kNN classi\ufb01er using\nEuclidean distances in feature space. We compute these error rates on a held-out set of validation\nexamples for many values of k and w and record the optimal values for each layer. The optimal w\ndetermines the number of informative features passed onto the next layer; this is essentially the\nwidth of the layer. In practice, we varied k from 1 to 15 and w from 10 to 300; though exhaustive,\nthis cross-validation can be done quickly and ef\ufb01ciently by careful bookkeeping. Note that this\nprocedure determines the architecture of the network in a greedy, layer-by-layer fashion.\nDistance metric learning. Test examples in MKMs are classi\ufb01ed by a variant of kNN classi\ufb01cation\non the outputs of the \ufb01nal layer. Speci\ufb01cally, we use large margin nearest neighbor (LMNN) clas-\nsi\ufb01cation [15] to learn a Mahalanobis distance metric for these outputs, though other methods are\nequally viable [17]. The use of LMNN is inspired by the supervised \ufb01ne-tuning of weights in the\ntraining of deep architectures [18]. In MKMs, however, this supervised training only occurs at the\n\ufb01nal layer (which underscores the importance of feature selection in earlier layers). LMNN learns a\ndistance metric by solving a problem in semide\ufb01nite programming; one advantage of LMNN is that\nthe required optimization is convex. Test examples are classi\ufb01ed by the energy-based decision rule\nfor LMNN [15], which was itself inspired by earlier work on multilayer neural nets [19].\n\n3.2 Experiments on multiway classi\ufb01cation\n\nWe evaluated MKMs on the two multiclass data sets from previous benchmarks [11] that exhibited\nthe largest performance gap between deep and shallow architectures. The data sets were created from\nthe MNIST data set [20] of 28 \u00d7 28 grayscale handwritten digits. The mnist-back-rand data set was\ngenerated by \ufb01lling the image background by random pixel values, while the mnist-back-image data\nset was generated by \ufb01lling the image background with random image patches; examples are shown\nin Figs. 4 and 5. Each data set contains 12000 and 50000 training and test examples, respectively.\n\n6\n\n\fFigure 4: Left: examples from the mnist-back-rand data set. Right: classi\ufb01cation error rates on the\ntest set for MKMs with different kernels and numbers of layers (cid:96). MKMs with arc-cosine kernel\nhave error rates from 6.36\u20137.52%. The best previous results are 14.58% for SVMs with RBF kernels\nand 6.73% for deep belief nets [11].\n\nFigure 5: Left: examples from the mnist-back-image data set. Right: classi\ufb01cation error rates on the\ntest set for MKMs with different kernels and numbers of layers (cid:96). MKMs with arc-cosine kernel\nhave error rates from 18.43\u201329.79%. The best previous results are 22.61% for SVMs with RBF\nkernels and 16.31% for deep belief nets [11].\n\nWe trained MKMs with arc-cosine kernels and RBF kernels in each layer. For each data set, we\ninitially withheld the last 2000 training examples as a validation set. Performance on this validation\nset was used to determine each MKM\u2019s architecture, as described in the previous section, and also\nto set the kernel width in RBF kernels, following the same methodology as earlier studies [11].\nOnce these parameters were set by cross-validation, we re-inserted the validation examples into the\ntraining set and used all 12000 training examples for feature selection and distance metric learning.\nFor kernel PCA, we were limited by memory requirements to processing only 6000 out of 12000\ntraining examples. We chose these 6000 examples randomly, but repeated each experiment \ufb01ve\ntimes to obtain a measure of average performance. The results we report for each MKM are the\naverage performance over these \ufb01ve runs.\nThe right panels of Figs. 4 and 5 show the test set error rates of MKMs with different kernels and\nnumbers of layers (cid:96). For reference, we also show the best previously reported results [11] using\ntraditional SVMs (with RBF kernels) and deep belief nets (with three layers). MKMs perform sig-\nni\ufb01cantly better than shallow architectures such as SVMs with RBF kernels or LMNN with feature\nselection (reported as the case (cid:96) = 0). Compared to deep belief nets, the leading MKMs obtain\nslightly lower error rates on one data set and slightly higher error rates on another.\nWe can describe the architecture of an MKM by the number of selected features at each layer (in-\ncluding the input layer). The number of features essentially corresponds to the number of units in\neach layer of a neural net. For the mnist-back-rand data set, the best MKM used an n=1 arc-cosine\nkernel and 300-90-105-136-126-240 features at each layer. For the mnist-back-image data set, the\nbest MKM used an n=0 arc-cosine kernel and 300-50-130-240-160-150 features at each layer.\nMKMs worked best with arc-cosine kernels of degree n = 0 and n = 1. The kernel of degree n = 2\nperformed less well in MKMs, perhaps because multiple iterations of kernel PCA distorted the\ndynamic range of the inputs (which in turn seemed to complicate the training for LMNN). MKMs\nwith RBF kernels were dif\ufb01cult to train due to the sensitive dependence on kernel width parameters.\nIt was extremely time-consuming to cross-validate the kernel width at each layer of the MKM. We\nonly obtained meaningful results for one and two-layer MKMs with RBF kernels.\n\n7\n\n5678DBN\u22123Test error rate (%) 0 1 2 3 4 5 Step (n=0) 1 2 3 4 5 Ramp (n=1) 1 2Quarter\u2212pipe (n=2) 1 2 RBF222426DBN!3SVM!RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter!pipe (n=2)Figure2:Left:examplesfromtherectangles-imagedataset.Right:classi\ufb01cationerrorratesonthetestset.SVMswitharccosinekernelshaveerrorratesfrom22.36\u201325.64%.Resultsareshownforkernelsofvaryingdegree(n)andlevelsofrecursion(!).Thebestpreviousresultsare24.04%forSVMswithRBFkernelsand22.50%fordeepbeliefnets[2].Seetextfordetails.1718192021DBN!3SVM!RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter!pipe (n=2)Figure3:Left:examplesfromtheconvexdataset.Right:classi\ufb01cationerrorratesonthetestset.SVMswitharccosinekernelshaveerrorratesfrom17.15\u201320.51%.Resultsareshownforkernelsofvaryingdegree(n)andlevelsofrecursion(!).Thebestpreviousresultsare19.13%forSVMswithRBFkernelsand18.63%fordeepbeliefnets[2].Seetextfordetails.2000trainingexamplesasavalidationsettochoosethemarginpenaltyparameter;afterchoosingthisparameterbycross-validation,wethenretrainedeachSVMusingallthetrainingexamples.Forreference,wealsoreportthebestresultsobtainedpreviouslyfromthreelayerdeepbeliefnets(DBN-3)andSVMswithRBFkernels(SVM-RBF).Thesereferencesarerepresentativeofthecurrentstate-of-the-artfordeepandshallowarchitecturesonthesedatasets.Therightpanelsof\ufb01gures2and3showthetestseterrorratesfromarccosinekernelsofvaryingdegree(n)andlevelsofrecursion(!).Weexperimentedwithkernelsofdegreen=0,1and2,correspondingtosinglelayerthresholdnetworkswith\u201cstep\u201d,\u201cramp\u201d,and\u201cquarter-pipe\u201dactivationfunctions.Wealsoexperimentedwiththemultilayerkernelsdescribedinsection2.3,composedfromonetosixlevelsofrecursion.Overall,the\ufb01guresshowthatonthesetwodatasets,manydifferentarccosinekernelsoutperformthebestresultspreviouslyreportedforSVMswithRBFkernelsanddeepbeliefnets.Wegivemoredetailsontheseexperimentsbelow.Atahighlevel,though,wenotethatSVMswitharccosinekernelsareverystraightforwardtotrain;unlikeSVMswithRBFkernels,theydonotrequiretuningakernelwidthparameter,andunlikedeepbeliefnets,theydonotrequiresolvingadif\ufb01cultnonlinearoptimizationorsearchingoverpossiblearchitectures.Inourexperiments,wequicklydiscoveredthatthemultilayerkernelsonlyperformedwellwhenn=1kernelswereusedathigher(!>1)levelsintherecursion.Figs.2and3thereforeshowonlythesesetsofresults;inparticular,eachgroupofbarsshowsthetesterrorrateswhenaparticularkernel(ofdegreen=0,1,2)wasusedatthe\ufb01rstlayerofnonlinearity,whilethen=1kernelwasusedatsuccessivelayers.Wedonothaveaformalexplanationforthiseffect.However,recallthatonlythen=1arccosinekernelpreservesthenormofitsinputs:then=0kernelmapsallinputsontoaunithypersphereinfeaturespace,whilehigher-order(n>1)kernelsmayinducefeaturespaceswithseverelydistorteddynamicranges.Therefore,wehypothesizethatonlyn=1arccosinekernelspreservesuf\ufb01cientinformationaboutthemagnitudeoftheirinputstoworkeffectivelyincompositionwithotherkernels.Finally,theresultsonbothdatasetsrevealaninterestingtrend:themultilayerarccosinekernelsoftenperformbetterthantheirsinglelayercounterparts.ThoughSVMsareshallowarchitectures,51718192021DBN!3SVM!RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter!pipe (n=2)222426DBN!3SVM!RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter!pipe (n=2)Figure2:Left:examplesfromtherectangles-imagedataset.Right:classi\ufb01cationerrorratesonthetestset.SVMswitharccosinekernelshaveerrorratesfrom22.36\u201325.64%.Resultsareshownforkernelsofvaryingdegree(n)andlevelsofrecursion(!).Thebestpreviousresultsare24.04%forSVMswithRBFkernelsand22.50%fordeepbeliefnets[2].Seetextfordetails.1718192021DBN!3SVM!RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter!pipe (n=2)Figure3:Left:examplesfromtheconvexdataset.Right:classi\ufb01cationerrorratesonthetestset.SVMswitharccosinekernelshaveerrorratesfrom17.15\u201320.51%.Resultsareshownforkernelsofvaryingdegree(n)andlevelsofrecursion(!).Thebestpreviousresultsare19.13%forSVMswithRBFkernelsand18.63%fordeepbeliefnets[2].Seetextfordetails.2000trainingexamplesasavalidationsettochoosethemarginpenaltyparameter;afterchoosingthisparameterbycross-validation,wethenretrainedeachSVMusingallthetrainingexamples.Forreference,wealsoreportthebestresultsobtainedpreviouslyfromthreelayerdeepbeliefnets(DBN-3)andSVMswithRBFkernels(SVM-RBF).Thesereferencesarerepresentativeofthecurrentstate-of-the-artfordeepandshallowarchitecturesonthesedatasets.Therightpanelsof\ufb01gures2and3showthetestseterrorratesfromarccosinekernelsofvaryingdegree(n)andlevelsofrecursion(!).Weexperimentedwithkernelsofdegreen=0,1and2,correspondingtosinglelayerthresholdnetworkswith\u201cstep\u201d,\u201cramp\u201d,and\u201cquarter-pipe\u201dactivationfunctions.Wealsoexperimentedwiththemultilayerkernelsdescribedinsection2.3,composedfromonetosixlevelsofrecursion.Overall,the\ufb01guresshowthatonthesetwodatasets,manydifferentarccosinekernelsoutperformthebestresultspreviouslyreportedforSVMswithRBFkernelsanddeepbeliefnets.Wegivemoredetailsontheseexperimentsbelow.Atahighlevel,though,wenotethatSVMswitharccosinekernelsareverystraightforwardtotrain;unlikeSVMswithRBFkernels,theydonotrequiretuningakernelwidthparameter,andunlikedeepbeliefnets,theydonotrequiresolvingadif\ufb01cultnonlinearoptimizationorsearchingoverpossiblearchitectures.Inourexperiments,wequicklydiscoveredthatthemultilayerkernelsonlyperformedwellwhenn=1kernelswereusedathigher(!>1)levelsintherecursion.Figs.2and3thereforeshowonlythesesetsofresults;inparticular,eachgroupofbarsshowsthetesterrorrateswhenaparticularkernel(ofdegreen=0,1,2)wasusedatthe\ufb01rstlayerofnonlinearity,whilethen=1kernelwasusedatsuccessivelayers.Wedonothaveaformalexplanationforthiseffect.However,recallthatonlythen=1arccosinekernelpreservesthenormofitsinputs:then=0kernelmapsallinputsontoaunithypersphereinfeaturespace,whilehigher-order(n>1)kernelsmayinducefeaturespaceswithseverelydistorteddynamicranges.Therefore,wehypothesizethatonlyn=1arccosinekernelspreservesuf\ufb01cientinformationaboutthemagnitudeoftheirinputstoworkeffectivelyincompositionwithotherkernels.Finally,theresultsonbothdatasetsrevealaninterestingtrend:themultilayerarccosinekernelsoftenperformbetterthantheirsinglelayercounterparts.ThoughSVMsareshallowarchitectures,51718192021DBN!3SVM!RBFTest error rate (%)1 2 3 4 5 6 Step (n=0)1 2 3 4 5 6 Ramp (n=1)1 2 3 4 5 6Quarter!pipe (n=2)15202530DBN\u22123SVM\u2212RBFTest error rate (%) 0 1 2 3 4 5 Step (n=0) 1 2 3 4 5 Ramp (n=1) 1 2Quarter\u2212pipe (n=2) 1 2 RBF\fWe brie\ufb02y summarize many results that we lack space to report in full. We also experimented\non multiclass data sets using SVMs with single and multi-layer arc-cosine kernels, as described in\nsection 2. For multiclass problems, these SVMs compared poorly to deep architectures (both DBNs\nand MKMs), presumably because they had no unsupervised training that shared information across\nexamples from all different classes. In further experiments on MKMs, we attempted to evaluate the\nindividual contributions to performance from feature selection and LMNN classi\ufb01cation. Feature\nselection helped signi\ufb01cantly on the mnist-back-image data set, but only slightly on the mnist-back-\nrandom data set. Finally, LMNN classi\ufb01cation in the output layer yielded consistent improvements\nover basic kNN classi\ufb01cation provided that we used the energy-based decision rule [15].\n\n4 Discussion\n\nIn this paper, we have developed a new family of kernel functions that mimic the computation in\nlarge, multilayer neural nets. On challenging data sets, we have obtained results that outperform pre-\nvious SVMs and compare favorably to deep belief nets. More signi\ufb01cantly, our experiments validate\nthe basic intuitions behind deep learning in the altogether different context of kernel-based archi-\ntectures. A similar validation was provided by recent work on kernel methods for semi-supervised\nembedding [7]. We hope that our results inspire more work on kernel methods for deep learning.\nThere are many possible directions for future work. For SVMs, we are currently experimenting with\narc-cosine kernel functions of fractional and (even negative) degree n. For MKMs, we are hoping\nto explore better schemes for feature selection [21, 22] and kernel selection [23]. Also, it would be\ndesirable to incorporate prior knowledge, such as the invariances modeled by convolutional neural\nnets [24, 4], though it is not obvious how to do so. These issues and others are left for future work.\n\nA Derivation of kernel function\n\n(cid:90)\n\nIn this appendix, we show how to evaluate the multidimensional integral in eq. (1) for the arc-cosine\nkernel. Let \u03b8 denote the angle between the inputs x and y. Without loss of generality, we can take x\nto lie along the w1 axis and y to lie in the w1w2-plane. Integrating out the orthogonal coordinates\nof the weight vector w, we obtain the result in eq. (3) where Jn(\u03b8) is the remaining integral:\n\ndw1 dw2 e\u2212 1\n\n2 (w2\n\n1+w2\n\nJn(\u03b8) =\n1 (w1 cos \u03b8 + w2 sin \u03b8)n. (14)\nChanging variables to u = w1 and v = w1 cos \u03b8+w2 sin \u03b8, we simplify the domain of integration to\nthe \ufb01rst quadrant of the uv-plane:\n\n2) \u0398(w1) \u0398(w1 cos \u03b8 + w2 sin \u03b8) wn\n\nJn(\u03b8) =\n\n(15)\nThe prefactor of (sin \u03b8)\u22121 in eq. (15) is due to the Jacobian. To simplify the integral further, we\n4 ). Then, integrating out the radius\nadopt polar coordinates u = r cos( \u03c8\ncoordinate r, we obtain:\n\n4 ) and v = r sin( \u03c8\n\ndv e\u2212(u2+v2\u22122uv cos \u03b8)/(2 sin2 \u03b8) unvn.\n\n2 + \u03c0\n\n2 + \u03c0\n\nsin \u03b8\n\n0\n\n0\n\ndu\n\n(cid:90) \u221e\n\n(cid:90) \u221e\n\n1\n\nJn(\u03b8) = n! (sin \u03b8)2n+1\n\nd\u03c8\n\ncosn \u03c8\n\n(1 \u2212 cos \u03b8 cos \u03c8)n+1 .\n\n(16)\n\nTo evaluate eq. (16), we \ufb01rst consider the special case n=0. The following result can be derived by\ncontour integration in the complex plane [25]:\n\n(cid:90) \u03c0\n\n2\n\n0\n\n8\n\n(cid:90) \u03c0/2\n\n0\n\nd\u03c8\n\n1 \u2212 cos \u03b8 cos \u03c8\n\n= \u03c0 \u2212 \u03b8\nsin \u03b8\n\n.\n\n(17)\n\nSubstituting eq. (17) into our expression for the angular part of the kernel function in eq. (16), we\nrecover our earlier claim that J0(\u03b8) = \u03c0\u2212 \u03b8. Related integrals for the special case n = 0 can also be\nfound in earlier work [8].For the case n >0, the integral in eq. (16) can be performed by the method\nof differentiating under the integral sign. In particular, we note that:\n\nd\u03c8\n\ncosn \u03c8\n\n(1 \u2212 cos \u03b8 cos \u03c8)n+1 =\n\n1\nn!\n\n\u2202n\n\n\u2202(cos \u03b8)n\n\nd\u03c8\n\n1 \u2212 cos \u03b8 cos \u03c8\n\n.\n\n(18)\n\nSubstituting eq. (18) into eq. (16), then appealing to the previous result in eq. (17), we recover the\nexpression for Jn(\u03b8) in eq. (4).\n\n(cid:90) \u03c0/2\n\n0\n\n(cid:90) \u03c0\n\n2\n\n0\n\n\fReferences\n[1] Y. Bengio and Y. LeCun. Scaling learning algorithms towards AI. MIT Press, 2007.\n[2] G.E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural Compu-\n\ntation, 18(7):1527\u20131554, 2006.\n\n[3] G.E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504\u2013507, July 2006.\n\n[4] M.A. Ranzato, F.J. Huang, Y.L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature\nIn Proceedings of the 2007 IEEE Conference on\n\nhierarchies with applications to object recognition.\nComputer Vision and Pattern Recognition (CVPR-07), pages 1\u20138, 2007.\n\n[5] R. Collobert and J. Weston. A uni\ufb01ed architecture for natural language processing: deep neural net-\nworks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning\n(ICML-08), pages 160\u2013167, 2008.\n\n[6] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, to appear,\n\n2009.\n\n[7] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding. In Proceedings of\n\nthe 25th International Conference on Machine Learning (ICML-08), pages 1168\u20131175, 2008.\n\n[8] C.K.I. Williams. Computation with in\ufb01nite neural networks. Neural Computation, 10(5):1203\u20131216,\n\n1998.\n\n[9] R.H.R. Hahnloser, H.S. Seung, and J.J. Slotine. Permitted and forbidden sets in symmetric threshold-\n\nlinear networks. Neural Computation, 15(3):621\u2013638, 2003.\n\n[10] R.M. Neal. Bayesian Learning for Neural Networks. Springer-Verlag New York, Inc., 1996.\n[11] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep archi-\ntectures on problems with many factors of variation. In Proceedings of the 24th International Conference\non Machine Learning (ICML-07), pages 473\u2013480, 2007.\n\n[12] C.C. Chang and C.J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at\n\nhttp://www.csie.ntu.edu.tw/\u02dccjlin/libsvm.\n\n[13] B. Sch\u00a8olkopf, A. Smola, and K. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue problem.\n\nNeural Computation, 10(5):1299\u20131319, 1998.\n\n[14] I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning\n\nResearch, 3:1157\u20131182, 2003.\n\n[15] K.Q. Weinberger and L.K. Saul. Distance metric learning for large margin nearest neighbor classi\ufb01cation.\n\nJournal of Machine Learning Research, 10:207\u2013244, 2009.\n\n[16] B. Sch\u00a8olkopf, A. J. Smola, and K.-R. M\u00a8uller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Technical Report 44, Max-Planck-Institut f\u00a8ur biologische Kybernetik, 1996.\n\n[17] J. Goldberger, S. Roweis, G.E. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. In\nL.K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages\n513\u2013520. MIT Press, 2005.\n\n[18] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In\nB. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19,\npages 153\u2013160. MIT Press, 2007.\n\n[19] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to\nface veri\ufb01cation. In Proceedings of the 2005 IEEE Conference on Computer Vision and Pattern Recogni-\ntion (CVPR-05), pages 539\u2013546, 2005.\n\n[20] Y. LeCun and C. Cortes. The MNIST database of handwritten digits. http://yann.lecun.com/\n\nexdb/mnist/.\n\n[21] M. Tipping. Sparse kernel principal component analysis. In Advances in Neural Information Processing\n\nSystems 13. MIT Press, 2001.\n\n[22] A.J. Smola, O.L. Mangasarian, and B. Sch\u00a8olkopf. Sparse kernel feature analysis. Technical Report 99-04,\n\nUniversity of Wisconsin, Data Mining Institute, Madison, 1999.\n\n[23] G. Lanckriet, N. Cristianini, P. Bartlett, L.E. Ghaoui, and M.I. Jordan. Learning the kernel matrix with\n\nsemide\ufb01nite programming. Journal of Machine Learning Research, 5:27\u201372, 2004.\n\n[24] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel. Backpropa-\n\ngation applied to handwritten zip code recognition. Neural Computation, 1(4):541\u2013551, 1989.\n\n[25] G.F. Carrier, M. Krook, and C.E. Pearson. Functions of a Complex Variable: Theory and Technique.\n\nSociety for Industrial and Applied Mathematics, 2005.\n\n9\n\n\f", "award": [], "sourceid": 729, "authors": [{"given_name": "Youngmin", "family_name": "Cho", "institution": null}, {"given_name": "Lawrence", "family_name": "Saul", "institution": null}]}