{"title": "Laplace Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 441, "page_last": 448, "abstract": "", "full_text": "Laplace Propagation\n\nAlex J. Smola, S.V.N. Vishwanathan\n\nMachine Learning Group\n\nANU and National ICT Australia\n{smola, vishy}@axiom.anu.edu.au\n\nCanberra, ACT, 0200\n\nEleazar Eskin\n\nDepartment of Computer Science\n\nHebrew University Jerusalem\n\nJerusalem, Israel, 91904\neeskin@cs.columbia.edu\n\nAbstract\n\nWe present a novel method for approximate inference in Bayesian mod-\nels and regularized risk functionals. It is based on the propagation of\nmean and variance derived from the Laplace approximation of condi-\ntional probabilities in factorizing distributions, much akin to Minka\u2019s\nExpectation Propagation. In the jointly normal case, it coincides with\nthe latter and belief propagation, whereas in the general case, it provides\nan optimization strategy containing Support Vector chunking, the Bayes\nCommittee Machine, and Gaussian Process chunking as special cases.\n\n1\n\nIntroduction\n\nInference via Bayesian estimation can lead to optimization problems over rather large data\nsets. Exact computation in these cases is often computationally intractable, which has led\nto many approximation algorithms, such as variational approximation [5], or loopy belief\npropagation. However, most of these methods still rely on the propagation of the exact\nprobabilities (upstream and downstream evidence in the case of belief propagation), rather\nthan an approximation. This approach becomes costly if the random variables are real\nvalued or if the graphical model contains large cliques.\n\nTo \ufb01ll this gap, methods such as Expectation Propagation (EP) [6] have been proposed,\nwith explicit modi\ufb01cations to deal with larger cliques and real-valued variables. EP works\nby propagating the suf\ufb01cient statistics of an exponential family, that is, mean and variance\nfor the normal distribution, between various factors of the posterior. This is an attractive\nchoice only if we are able to compute the required quantities explicitly (this means that we\nneed to solve an integral in closed form).\n\nFurthermore computation of the mode of the posterior (MAP approximation) is a legitimate\ntask in its own right \u2014 Support Vector Machines (SVM) fall into this category. In the\nfollowing we develop a cheap version of EP which requires only the Laplace approximation\nin each step and show how this can be applied to SVM and Gaussian Processes.\n\nOutline of the Paper We describe the basic ideas of LP in Section 2, show how it applies\nto Gaussian Processes (in particular the Bayes Committee Machine of [9]) in Section 3,\nprove that SVM chunking is a special case of LP in Section 4, and \ufb01nally demonstrate in\nexperiments the feasibility of LP (Section 5).\n\n\f2 Laplace Propagation\n\nLet X be a set of observations and denote by \u03b8 a parameter we would like to infer by\nstudying p(\u03b8|X). This goal typically involves computing expectations Ep(\u03b8|X)[\u03b8], which\ncan only rarely be computed exactly. Hence we approximate\n\nEp(\u03b8|X)[\u03b8] \u2248 argmax\u03b8 \u2212 log p(\u03b8|X) =: \u02c6\u03b8\n\nVarp(\u03b8|X)[\u03b8] \u2248 \u22022\n\n\u03b8 [\u2212 log p(\u03b8|X)]|\u03b8=\u02c6\u03b8\n\n(1)\n(2)\nThis is commonly referred to as the Laplace-approximation. It is exact for normal distri-\nbutions and works best if \u03b8 is strongly concentrated around its mean. Solving for \u02c6\u03b8 can\nbe costly. However, if p(\u03b8|X) has special structure, such as being the product of several\nsimple terms, possibly each of them dependent only on a small number of variables at a\ntime, computational savings can be gained. In the following we present an algorithm to\ntake advantage of this structure by breaking up (1) into smaller pieces and optimizing over\nthem separately.\n\n2.1 Approximate Inference\n\nFor the sake of simplicity in notation we drop the explicit dependency of \u03b8 on X and as in\n[6] we assume that\n\np(\u03b8) =\n\nti(\u03b8).\n\n(3)\n\nby maximizing \u02dcp(\u03b8) := Q\n\nOur strategy relies on the assumption that if we succeed in \ufb01nding good approximations\nof each of the terms ti(\u03b8) by \u02dcti(\u03b8) we will obtain an approximate maximizer \u02dc\u03b8 of p(\u03b8)\n\u02dcti(\u03b8). Key is a good approximation of each of the ti at the\n\ni\n\nmaximum of p(\u03b8). This is ensured by maximizing\n\nNY\n\ni=1\n\nNY\n\nj=1,j6=i\n\n\u02dcpi(\u03b8) := ti(\u03b8)\n\n\u02dcti(\u03b8).\n\n(4)\n\nand subsequent use of the Laplace approximation of ti(\u03b8) at \u02dc\u03b8i := argmax\u03b8 \u02dcpi(\u03b8) as the\nnew estimate \u02dcti(\u03b8). This process is repeated until convergence. The following lemma\nshows that this strategy is valid:\n\n\u03b8 log \u02dcti(\u03b8\u2217) = \u22022\n\nLemma 1 (Fixed Point of Laplace Propagation) For all second-order \ufb01xed points the\nfollowing holds: \u03b8\u2217 is a \ufb01xed point of Laplace propagation if and only if it is a local\noptimum of p(\u03b8).\nProof Assume that \u03b8\u2217 is a \ufb01xed point of the above algorithm. Then the \ufb01rst order opti-\nmality conditions require \u2202\u03b8 log \u02dcpi(\u03b8\u2217) = 0 for all i and the Laplace approximation yields\n\u2202\u03b8 log \u02dcti(\u03b8\u2217) = \u2202\u03b8 log ti(\u03b8\u2217) and \u22022\n\u03b8 log ti(\u03b8\u2217). Consequently, up to second\norder, the derivatives of \u02dcp, \u02dcpi, and p agree at \u03b8\u2217, which implies that \u03b8\u2217 is a local optimum.\nNext assume that \u03b8\u2217 is locally optimal. Then again, \u2202\u03b8 log \u02dcpi(\u03b8\u2217) have to vanish, since the\nLaplace approximation is exact up to second order. This means that also all \u02dcti will have an\noptimum at \u02dc\u03b8\u2217, which means that \u03b8\u2217 is a \ufb01xed point.\nThe next step is to establish methods for updating the approximations \u02dcti of ti. One option\nis to perform such updates sequentially, thereby improving only one \u02dcti at a time. This is\nadvantageous if we can process only one approximation at a time. For parallel processing,\nhowever, we will perform several operations at a time, that is, recompute several \u02dcti(\u03b8)\nand merge the new approximations subsequently. We will see how the BCM is a one-step\napproximation of LP in the parallel case, whereas SV chunking is an exact implementation\nof LP in the sequential case.\n\n\f2.2 Message Passing\n\nNY\n\nMessage passing [7] has been widely successful for inference in graphical models. Assume\nthat we can split \u03b8 into a (not necessarily disjoint) set of coordinates, say \u03b8C1, . . . , \u03b8CN ,\nsuch that\n\np(\u03b8) =\n\ntN (\u03b8Ci).\n\n(5)\n\ni=1\n\nThen the goal of computing a Laplace approximation of \u02dcpi reduces to computing a Laplace\napproximation for the subset of variables \u03b8Ci, since these are the only coordinates ti de-\npends on.\nNote that an update in \u03b8Ci means that only terms sharing vari-\nables with \u03b8Ci are affected. For directed graphical models, these\nare the conditional probabilities governing the parents and chil-\ndren of \u03b8Ci. Hence, to carry out calculations we only need to\nconsider local information regarding \u02dcti(\u03b8Ci).\nIn the example above \u03b83 depends on (\u03b81, \u03b82) and (\u03b84, \u03b85) are conditionally independent of\n\u03b81 and \u03b82, given \u03b83. Consequently, we may write p(\u03b8) as\n\n0123\u03b81\n0123\u03b82\n\u001f????\n\u007f\u007f\u007f\u007f\u007f\n7654\n7654\n0123\u03b83\n\u001f????\n\u007f\u007f\u007f\u007f\u007f\n7654\n0123\u03b85\n0123\u03b84\n7654\n7654\n\np(\u03b8) = p(\u03b81)p(\u03b82)p(\u03b83|\u03b81, \u03b82)p(\u03b84|\u03b83)p(\u03b85|\u03b83).\n\n(6)\n\nTo \ufb01nd the Laplace approximation corresponding to the terms involving \u03b83 we only need\nto consider p(\u03b83|\u03b81, \u03b82) itself and its neighbors \u201cupstream\u201d and \u201cdownstream\u201d of \u03b83 con-\ntaining \u03b81, \u03b82, \u03b83 in their functional form.\nThis means that LP can be used as a drop-in replacement of exact inference in message\npassing algorithms. The main difference being, that now we are propagating mean and vari-\nance from the Laplace approximation rather than true probabilities (as in message passing)\nor true means and variances (as in expectation propagation).\n\n3 Bayes Committee Machine\n\nIn this section we show that the Bayes Committee Machine (BCM) [9] corresponds to one\nstep of LP in conjunction with a particular initialization, namely constant \u02dcti. As a result,\nwe extend BCM into an iterative method for improved precision of the estimates.\n\n3.1 The Basic Idea\n\nLet us assume that we are given a set\nof sets of observations, say, Z1, . . . , ZN ,\nwhich are conditionally independent of\neach other, given a parameter \u03b8, as de-\npicted in the \ufb01gure on the right.\nRepeated application of Bayes rule allows us to rewrite the conditional density p(\u03b8|Z) as\n\nujjjjjjjjjjjjjjjj\n$IIIIIIII\n89:;\u03b8\nzvvvvvvv\n?>=<\n\n@ABCZN\nGFED\n\n. . .\n\n@ABCZ2\nGFED\n\n@ABCZ1\nGFED\nNY\np(Zi|\u03b8) \u221d p1\u2212N (\u03b8)\n\nNY\n\np(\u03b8|Z) \u221d p(Z|\u03b8)p(\u03b8) = p(\u03b8)\n\np(\u03b8|Zi).\n\n(7)\n\nFinally, Tresp and coworkers [9] \ufb01nd Laplace approximations for p(\u03b8|Zi) \u221d p(Zi|\u03b8)p(\u03b8)\nwith respect to \u03b8. These results are then combined via (7) to come up with an overall\nestimate of p(\u03b8|X, Y ).\n\ni=1\n\ni=1\n\n\u007f\n\u001f\n\u001f\n\u007f\nu\nz\n$\n\f3.2 Rewriting The BCM\n\nThe repeated invocation of Bayes rule seems wasteful, yet it was necessary in the context\nof the BCM formulation to explain how estimates from subsets could be combined in a\ncommittee like fashion. To show the equivalence of BCM with one step of LP recall the\nthird term of (7). We have\n\n(8)\n\np(\u03b8|Z) = c \u00b7 p(\u03b8)\n\n| {z }\n\n:=t0(\u03b8)\n\nNY\n\ni=1\n\np(Zi|\u03b8)\n,\n\n| {z }\n\n:=ti(\u03b8)\n\nwhere c is a suitable normalizing constant. In Gaussian processes, we generally assume\nthat p(\u03b8) is normal, hence t0(\u03b8) is quadratic. This allows us to state the LP algorithm to\n\ufb01nd the mode and curvature of p(\u03b8|Z):\n\nAlgorithm 1 Iterated Bayes Committee Machine\n\nInitialize \u02dct0 \u2190 cp(\u03b8) and \u02dcti(\u03b8) \u2190 const.\nrepeat\n\nCompute new approximations \u02dcti(\u03b8) in parallel by \ufb01nding Laplace approximations to\n\u02dcpi, as de\ufb01ned in (4). Since t0 is normal, \u02dct0(\u03b8) = t0(\u03b8). For i 6= 0 we obtain\n\n\u02dcpi = ti(\u03b8)\n\n\u02dcti(\u03b8) = p(\u03b8)p(Zi|\u03b8)\n\n\u02dcti(\u03b8).\n\n(9)\n\nNY\n\nNY\n\nj=1,j6=i\n\nReturn argmax\u03b8 t0(\u03b8)QN\n\nuntil Convergence\n\ni=1\n\n\u02dcti(\u03b8).\n\nj=0,j6=i\n\nNote that in the \ufb01rst iteration (9) can be written as \u02dcpi \u221d p(\u03b8)p(Zi|\u03b8), since all remaining\nterms \u02dcti are constant. This means that after the \ufb01rst update \u02dcti is identical to the estimates\nobtained from the BCM.\n\nWhereas the BCM stops at this point, we have the liberty to continue the approximation and\nalso the liberty to choose whether we use a parallel or a sequential update regime, depend-\ning on the number of processing units available. As a side-effect, we obtain a simpli\ufb01ed\nproof of the following:\n\nTheorem 2 (Exact BCM [9]) For normal distributions the BCM is exact, that is, the Iter-\nated BCM converges in one step.\n\nProof For normal distributions all \u02dcti are exact, hence p(\u03b8) =Y\n\nti(\u03b8) =Y\n\n\u02dcti(\u03b8) = \u02dcp(\u03b8),\n\nwhich shows that \u02dcp = p.\nNote that [9] formulates the problem as one of classi\ufb01cation or regression, that is Z =\n(X, Y ), where the labels Y are conditionally independent, given X and the parameter \u03b8.\nThis, however, does not affect the validity of our reasoning.\n\ni\n\ni\n\n4 Support Vector Machines\n\nThe optimization goals in Support Vector Machines (SVM) are very similar to those in\nGaussian Processes: essentially the negative log posterior \u2212 log p(\u03b8|Z) corresponds to the\nobjective function of the SV optimization problem.\n\nThis gives hope that LP can be adapted to SVM. In the following we show that SVM\nchunking [4] and parallel SVM training [2] can be found to be special cases of LP. Taking\nlogarithms of (3) and de\ufb01ning \u03c0i(\u03b8) := \u2212 log ti(\u03b8) (and \u02dc\u03c0(\u03b8) := \u2212 log \u02dcti(\u03b8) analogously)\nwe obtain the following formulation of LP in log-space.\n\n\fAlgorithm 2 Logarithmic Version of Laplace Propagation\n\nInitialize \u02dc\u03c0i(\u03b8)\nrepeat\n\nChoose index i \u2208 {1, . . . , N}\n\nNX\n\nj=1,i6=j\n\nminimum \u03b8i of the above expression.\n\nuntil All \u03b8i agree\n\nMinimize \u03c0i(\u03b8) +\n\n\u02dc\u03c0j(\u03b8) and replace \u02dc\u03c0i(\u03b8) by a Taylor approximation at the\n\n4.1 Chunking\n\nTo show that SV chunking is equivalent to LP in logspace, we brie\ufb02y review the basic ideas\nof chunking. The standard SVM optimization problem is\n\nminimize\n\n\u03b8,b\n\nsubject to\n\n1\n2\n\nk\u03b8k2 + C\n\u03c0(\u03b8, b) :=\nf(xi) = h\u03b8, \u03a6(xi)i + b\n\nc(xi, yi, f(xi))\n\n(10)\n\nHere \u03a6(x) is the map into feature space such that k(x, x0) = h\u03a6(x), \u03a6(x0)i and\nc(x, y, f(x)) is a loss function penalizing the deviation between the estimate f(x) and\nthe observation y. We typically assume that c is convex. For the rest of the deviation we let\nc(x, y, f(x)) = max(0, 1 \u2212 yf(x)) (the analysis still holds in the general case, however it\nbecomes considerably more tedious). The dual of (10) becomes\n\nmX\n\n\u03b1i\u03b1jyiyjKijk(xi, xj) \u2212 mX\n\nmX\n\ni,j=1\n\ni=1\n\ni=1\n\nminimize\n\n\u03b1\n\n1\n2\n\n\u03b1i s.t.\n\nyi\u03b1i = 0 and \u03b1i \u2208 [0, C] (11)\n\nThe basic idea of chunking is to optimize only over subsets of the vector \u03b1 at a time.\nDenote by Sw the set of variables we are using in the current optimization step, let \u03b1w be the\ncorresponding vector, and by \u03b1f the variables which remain unchanged. Likewise denote\nby yw, yf the corresponding parts of y, and let H =\nmatrix of (11), again split into terms depending on \u03b1w and \u03b1f respectively. Then (11),\nrestricted to \u03b1w can be written as [4]\n\n(cid:20) Hww Hwf\n\nbe the quadratic\n\nHf w Hf f\n\n(cid:21)\n\nmX\n\ni=1\n\nf Hf w\u03b1w\u2212X\n\ni\u2208Sw\n\nminimize\n\n\u03b1w\n\n1\n2 \u03b1>\n\nw Hww\u03b1w +\u03b1>\n\n4.2 Equivalence to LP\n\n\u03b1i s.t. y>\n\nw \u03b1w +y>\n\nf \u03b1f = 0, \u03b1i \u2208 [0, C] (12)\n\nWe now show that the correction terms arising from chunking are the same as those arising\nfrom LP. Denote by S1, . . . , SN a partition of {1, . . . m} and de\ufb01ne\n\n\u03c00(\u03b8, b) :=\n\n1\n2\n\nk\u03b8k2 and \u03c0i(\u03b8, b) := C\n\nc(xj, yj, f(xj)).\n\n(13)\n\nX\n\nj\u2208Si\n\nThen \u02dc\u03c00 = \u03c00, since \u03c00 is purely quadratic, regardless of where we expand \u03c00. As for \u03c0i\n(with i 6= 0) we have\n\n\u02dc\u03c0i = X\n\nyj\u03b2jh\u03a6(xj), \u03b8i +X\n\nj\u2208Si\n\nj\u2208Si\n\nyj\u03b2jb = h\u03b8i, \u03b8i + bib\n\n(14)\n\n\fyj\u03b2j\u03a6(xj), and bi :=P\n\nj6=i \u02dc\u03c0j(\u03b8) amounts to minimizing\n\nj\u2208Si\n\nyj\u03b2j.1 In this\n\n[h\u03b8j, \u03b8i + bjb] s.t. f(xj) = h\u03b8, \u03a6(xj)i + b.\n\nSkipping technical details, the dual optimization problem is given by\n\n1\n2\n\nj\u2208Si\n\nj\u2208Si\n\nX\n\nk\u03b8k2 + C\n\nc(xj, yj, f(xj)) + C\n\nwhere \u03b2j \u2208 Cc0(xj, yj, f(xj)), \u03b8i :=P\ncase minimization over \u03c0i(\u03b8) +P\nX\n\u03b1j\u03b1lyjylk(xj, kl) \u2212X\nyj\u03b1j \u2212P\n\nsubject to \u03b1j \u2208 [0, C] and P\nof \u03c0i(\u03b8) with respect to \u03b8, b. Taking derivatives of \u03c0i +P\n\nX\n\nminimize\n\nj,l\u2208Si\n\nj\u2208Si\n\nj /\u2208Si\n\n1\n2\n\n\u03b1\n\n\u03b1j \u2212 X\n\nj\u2208Si\n\nj\u2208Si,l6\u2208Si\nyj\u03b2j = 0.\n\nj6\u2208Si\n\nThe latter is identical to (12), the optimization problem arising from chunking, provided\nthat we perform the substitution \u03b1j = \u2212\u03b2j for all j 6\u2208 Si.\nTo show this last step, note that at optimality null has to be an element of the subdifferential\n\n\u03b1j\u03b2lyjylk(xj, kl)\n\n(15)\n\nX\n\nj\u2208Si\n\nj6=i \u02dc\u03c0i implies\n\nX\n\nj6=i\n\n\u03b8 \u2208 \u2212C\n\nc0(xj, yj, f(xj)) \u2212 C\n\n\u03b8j.\n\n(16)\n\nMatching up terms in the expansion of \u03b8 we immediately obtain \u03b2j = \u2212\u03b1j.\nFinally, to start the approximation scheme we need to consider a proper initialization of \u02dc\u03c0i.\nIn analogy to the BCM setting we use \u02dc\u03c0i = 0, which leads precisely to the SVM chunking\nmethod, where one optimizes over one subset at a time (denoted by Si), while the other\nsets are \ufb01xed, taking only their linear contribution into account.\nLP does not require that all the updates of \u02dcti (or \u02dc\u03c0i) be carried out sequentially. Instead, we\ncan also consider parallel approximations similar to [2]. There the optimization problem is\nsplit into several small parts and each of them is solved independently. Subsequently the\nestimates are combined by averaging.\nThis is equivalent to one-step parallel LP: with the initialization \u02dc\u03c0i = 0 for all i 6= 0\nand \u02dc\u03c00 = \u03c00 = 1\nj6=i \u02dc\u03c0j in parallel. This is equivalent to\nsolving the SV optimization problem on the corresponding subset Si (as we saw in the\nprevious section). Hence, the linear terms \u03b8i, bi arising from the approximation \u02dc\u03c0i(\u03b8, b) =\nCh\u03b8i, \u03b8i + Cbib lead to the overall approximation\n1\n2\n\n2k\u03b8k2 we minimize \u03c0i +P\n\u02dc\u03c0(\u03b8, b) =X\n\nk\u03b8k2 +X\n\n\u02dc\u03c0i(\u03b8, b) =\n\nh\u03b8i, \u03b8i,\n\n(17)\n\ni\n\ni\n\nwith the joint minimizer being the average of the individual solutions.\n\n5 Experiments\n\nTo test our ideas we performed a set of experiments with the widely available Web and\nAdult datasets from the UCI repository [1]. All experiments were performed on a 2.4 MHz\nIntel Xeon machine with 1 GB RAM using MATLAB R13. We used a RBF kernel with\n\u03c32 = 10 [8], to obtain comparable results.\nWe \ufb01rst tested the performance of Gaussian process training with Laplace propogation\nusing a logistic loss function. The data was partitioned into chunks of roughly 500 samples\neach and the maximum of columns in the low rank approximation [3] was set to 750.\n\n1Note that we had to replace the equality with set inclusion due to the fact that c is not everywhere\n\ndifferentiable, hence we used sub-differentials instead.\n\n\fWe summarize the performance of our algorithm in Table 1. TFactor refers to the time\n(in seconds) for computing the low rank factorization while TTrain denotes the training\ntime for the Gaussian process. We empirically observed that on all datasets the algorithm\nconverges in less than 3 iterations using serial updates and in less than 6 iterations using\nparallel updates.\n\nDataset\nAdult1\nAdult2\nAdult3\nAdult4\nAdult5\nAdult6\nAdult7\n\nTFactor\n16.38\n20.07\n24.41\n36.29\n56.82\n89.78\n119.39\n\nTSerial\n25.72\n33.02\n47.05\n75.71\n97.57\n232.45\n293.45\n\nTParallel Dataset\n\n53.90 Web1\n75.76 Web2\n106.88 Web3\n202.88 Web4\n169.79 Web5\n348.10 Web6\n559.23 Web7\n\nTFactor\n20.33\n36.27\n37.09\n69.9\n68.15\n129.86\n213.54\n\nTSerial\n34.33\n67.65\n92.36\n168.88\n225.13\n261.23\n483.52\n\nTParallel\n93.47\n88.37\n212.04\n251.92\n249.15\n663.07\n838.36\n\nTable 1: Gaussian process training with serial and parallel Laplace propogation.\n\nWe conducted another set of experiments to test the speedups obtained by seeding the\nSMO with values of \u03b1 obtained by performing one iteration of Laplace propogation on the\ndataset. As before we used a RBF kernel with \u03c32 = 10. We partitioned the Adult1 and\nWeb1 datasets into 5 chunks each while the Adult4 and Web4 datasets were partitioned\ninto 10 chunks each. The freely available SMOBR package was modi\ufb01ed and used for\nour experiments. For simplicity we use the C-SVM and vary the regularization parameter.\nTParallel, TSerial and TNoMod refer to the times required by SMO to converge when using\none iteration of parallel/serial/no LP on the dataset.\n\nC\n0.1\n0.5\n1.0\n5.0\n\nC\n0.1\n0.5\n1.0\n5.0\n\nAdult1\n\nTParallel\n2.84\n5.57\n5.48\n107.37\n\nTSerial\n2.04\n3.99\n7.25\n110.07\n\nTNoMod C\n0.1\n0.5\n1.0\n5.0\n\n7.650\n9.215\n10.885\n307.135\n\nTParallel\n20.42\n46.29\n80.33\n1921.19\n\nAdult4\n\nTSerial\n13.26\n40.82\n64.37\n1500.42\n\nTNoMod\n59.935\n63.645\n107.475\n1427.925\n\nTable 2: Performance of SMO Initialization on the Adult dataset.\n\nWeb1\n\nTParallel\n21.36\n34.64\n61.15\n224.15\n\nTSerial\n15.65\n35.66\n38.56\n62.41\n\nTNoMod C\n0.1\n0.5\n1.0\n5.0\n\n27.34\n60.12\n63.745\n519.67\n\nTParallel\n63.76\n140.61\n254.84\n1959.08\n\nWeb4\n\nTSerial\n77.05\n149.80\n298.59\n3188.75\n\nTNoMod\n95.10\n156.525\n232.120\n2223.225\n\nTable 3: Performance of SMO Initialization on the Web dataset.\n\nAs can be seen our initialization signi\ufb01cantly speeds up the SMO in many cases some-\ntimes acheving upto 4 times speed up. Although in some cases (esp for large values of C)\nour method seems to slow down convergence of SMO. In general serial updates seem to\nperform better than parallel updates. This is to be expected since we use the information\nfrom other blocks as soon as they become available in case of the serial algorithm while we\ncompletely ignore the other blocks in the parallel algorithm.\n\n\f6 Summary And Discussion\n\nLaplace propagation \ufb01lls the gap between Expectation Propagation, which requires exact\ncomputation of \ufb01rst and second order moments, and message passing algorithms when\noptimizing structured density functions.\nIts main advantage is that it only requires the\nLaplace approximation in each computational step, while being applicable to a wide range\nof optimization tasks.\nIn this sense, it complements Minka\u2019s Expectation Propagation,\nwhenever exact expressions are not available.\n\nAs a side effect, we showed that Tresp\u2019s Bayes Committee Machine and Support Vector\nChunking methods are special instances of this strategy, which also sheds light on the fact\nwhy simple averaging schemes for SVM, such as the one of Colobert and Bengio seem to\nwork in practice.\n\nThe key point in our proofs was that we split the data into disjoint subsets. By the assump-\ntion of independent and identically distributed data it followed that the variable assignments\nare conditionally independent from each other, given the parameter \u03b8, which led to a fa-\nvorable factorization property in p(\u03b8|Z). It should be noted that LP allows one to perform\nchunking-style optimization in Gaussian Processes, which effectively puts an upper bound\non the amount of memory required for optimization purposes.\n\nAcknowledgements We thank Nir Friedman, Zoubin Ghahramani and Adam Kowalczyk\nfor useful suggestions and discussions.\n\nReferences\n\n[1] C. L. Blake and C. J. Merz. UCI repository of machine learning databases, 1998.\n[2] R. Collobert, S. Bengio, and Y. Bengio. A parallel mixture of svms for very large scale\n\nproblems. In Advances in Neural Information Processing Systems. MIT Press, 2002.\n\n[3] S. Fine and K. Scheinberg. Ef\ufb01cient SVM training using low-rank kernel rep-\nJournal of Machine Learning Research, 2:243\u2013264, Dec 2001.\n\nresentations.\nhttp://www.jmlr.org.\n\n[4] T. Joachims. Making large-scale SVM learning practical. In B. Sch\u00a8olkopf, C. J. C.\nBurges, and A. J. Smola, editors, Advances in Kernel Methods\u2014Support Vector Learn-\ning, pages 169\u2013184, Cambridge, MA, 1999. MIT Press.\n\n[5] M. I. Jordan, Z. Gharamani, T. S. Jaakkola, and L. K. Saul. An introduction to vari-\nIn Learning in Graphical Models, volume\n\national methods for graphical models.\nM. I. Jordan, pages 105\u2013162. Kluwer Academic, 1998.\n\n[6] T. Minka. Expectation Propagation for approximative Bayesian inference. PhD thesis,\n\nMIT Media Labs, Cambridge, USA, 2001.\n\n[7] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan-Kaufman, 1988.\n[8] J. C. Platt. Sequential minimal optimization: A fast algorithm for training support\n\nvector machines. Technical Report MSR-TR-98-14, Microsoft Research, 1998.\n\n[9] V. Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719\u20132741,\n\n2000.\n\n\f", "award": [], "sourceid": 2444, "authors": [{"given_name": "Eleazar", "family_name": "Eskin", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "S.v.n.", "family_name": "Vishwanathan", "institution": null}]}