{"title": "Regularized Laplacian Estimation and Fast Eigenvector Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 2420, "page_last": 2428, "abstract": "Recently, Mahoney and Orecchia demonstrated that popular diffusion-based procedures to compute a quick approximation to the first nontrivial eigenvector of a data graph Laplacian exactly solve certain regularized Semi-Definite Programs (SDPs). In this paper, we extend that result by providing a statistical interpretation of their approximation procedure. Our interpretation will be analogous to the manner in which l2-regularized or l1-regularized l2 regression (often called Ridge regression and Lasso regression, respectively) can be interpreted in terms of a Gaussian prior or a Laplace prior, respectively, on the coefficient vector of the regression problem. Our framework will imply that the solutions to the Mahoney-Orecchia regularized SDP can be interpreted as regularized estimates of the pseudoinverse of the graph Laplacian. Conversely, it will imply that the solution to this regularized estimation problem can be computed very quickly by running, e.g., the fast diffusion-based PageRank procedure for computing an approximation to the first nontrivial eigenvector of the graph Laplacian. Empirical results are also provided to illustrate the manner in which approximate eigenvector computation implicitly performs statistical regularization, relative to running the corresponding exact algorithm.", "full_text": "Regularized Laplacian Estimation and\n\nFast Eigenvector Approximation\n\nInformation, Operations, and Management Sciences\n\nPatrick O. Perry\n\nNYU Stern School of Business\n\nNew York, NY 10012\n\npperry@stern.nyu.edu\n\nMichael W. Mahoney\n\nDepartment of Mathematics\n\nStanford University\nStanford, CA 94305\n\nmmahoney@cs.stanford.edu\n\nAbstract\n\nRecently, Mahoney and Orecchia demonstrated that popular diffusion-based pro-\ncedures to compute a quick approximation to the \ufb01rst nontrivial eigenvector of\na data graph Laplacian exactly solve certain regularized Semi-De\ufb01nite Programs\n(SDPs). In this paper, we extend that result by providing a statistical interpre-\ntation of their approximation procedure. Our interpretation will be analogous to\nthe manner in which (cid:96)2-regularized or (cid:96)1-regularized (cid:96)2-regression (often called\nRidge regression and Lasso regression, respectively) can be interpreted in terms\nof a Gaussian prior or a Laplace prior, respectively, on the coef\ufb01cient vector of the\nregression problem. Our framework will imply that the solutions to the Mahoney-\nOrecchia regularized SDP can be interpreted as regularized estimates of the pseu-\ndoinverse of the graph Laplacian. Conversely, it will imply that the solution to this\nregularized estimation problem can be computed very quickly by running, e.g.,\nthe fast diffusion-based PageRank procedure for computing an approximation to\nthe \ufb01rst nontrivial eigenvector of the graph Laplacian. Empirical results are also\nprovided to illustrate the manner in which approximate eigenvector computation\nimplicitly performs statistical regularization, relative to running the corresponding\nexact algorithm.\n\n1\n\nIntroduction\n\nApproximation algorithms and heuristic approximations are commonly used to speed up the run-\nning time of algorithms in machine learning and data analysis. In some cases, the outputs of these\napproximate procedures are \u201cbetter\u201d than the output of the more expensive exact algorithms, in\nthe sense that they lead to more robust results or more useful results for the downstream practi-\ntioner. Recently, Mahoney and Orecchia formalized these ideas in the context of computing the\n\ufb01rst nontrivial eigenvector of a graph Laplacian [1]. Recall that, given a graph G on n nodes or\nequivalently its n\u00d7 n Laplacian matrix L, the top nontrivial eigenvector of the Laplacian exactly op-\ntimizes the Rayleigh quotient, subject to the usual constraints. This optimization problem can equiv-\nalently be expressed as a vector optimization program with the objective function f (x) = xT Lx,\nwhere x is an n-dimensional vector, or as a Semi-De\ufb01nite Program (SDP) with objective function\nF (X) = Tr(LX), where X is an n \u00d7 n symmetric positive semi-de\ufb01nite matrix. This \ufb01rst non-\ntrivial vector is, of course, of widespread interest in applications due to its usefulness for graph\npartitioning, image segmentation, data clustering, semi-supervised learning, etc. [2, 3, 4, 5, 6, 7].\nIn this context, Mahoney and Orecchia asked the question: do popular diffusion-based procedures\u2014\nsuch as running the Heat Kernel or performing a Lazy Random Walk or computing the PageRank\nfunction\u2014to compute a quick approximation to the \ufb01rst nontrivial eigenvector of L solve some\nother regularized version of the Rayleigh quotient objective function exactly? Understanding this\nalgorithmic-statistical tradeoff is clearly of interest if one is interested in very large-scale applica-\ntions, where performing statistical analysis to derive an objective and then calling a black box solver\nto optimize that objective exactly might be too expensive. Mahoney and Orecchia answered the\nabove question in the af\ufb01rmative, with the interesting twist that the regularization is on the SDP\n\n1\n\n\fformulation rather than the usual vector optimization problem. That is, these three diffusion-based\nprocedures exactly optimize a regularized SDP with objective function F (X) + 1\n\u03b7 G(X), for some\nregularization function G(\u00b7) to be described below, subject to the usual constraints.\nIn this paper, we extend the Mahoney-Orecchia result by providing a statistical interpretation of\ntheir approximation procedure. Our interpretation will be analogous to the manner in which (cid:96)2-\nregularized or (cid:96)1-regularized (cid:96)2-regression (often called Ridge regression and Lasso regression,\nrespectively) can be interpreted in terms of a Gaussian prior or a Laplace prior, respectively, on\nthe coef\ufb01cient vector of the regression problem. In more detail, we will set up a sampling model,\nwhereby the graph Laplacian is interpreted as an observation from a random process; we will posit\nthe existence of a \u201cpopulation Laplacian\u201d driving the random process; and we will then de\ufb01ne an\nestimation problem: \ufb01nd the inverse of the population Laplacian. We will show that the maximum a\nposteriori probability (MAP) estimate of the inverse of the population Laplacian leads to a regular-\nized SDP, where the objective function F (X) = Tr(LX) and where the role of the penalty function\nG(\u00b7) is to encode prior assumptions about the population Laplacian. In addition, we will show that\nwhen G(\u00b7) is the log-determinant function then the MAP estimate leads to the Mahoney-Orecchia\nregularized SDP corresponding to running the PageRank heuristic. Said another way, the solutions\nto the Mahoney-Orecchia regularized SDP can be interpreted as regularized estimates of the pseu-\ndoinverse of the graph Laplacian. Moreover, by Mahoney and Orecchia\u2019s main result, the solution\nto this regularized SDP can be computed very quickly\u2014rather than solving the SDP with a black-\nbox solver and rather computing explicitly the pseudoinverse of the Laplacian, one can simply run\nthe fast diffusion-based PageRank heuristic for computing an approximation to the \ufb01rst nontrivial\neigenvector of the Laplacian L.\nThe next section describes some background. Section 3 then describes a statistical framework for\ngraph estimation; and Section 4 describes prior assumptions that can be made on the population\nLaplacian. These two sections will shed light on the computational implications associated with\nthese prior assumptions; but more importantly they will shed light on the implicit prior assumptions\nassociated with making certain decisions to speed up computations. Then, Section 5 will provide\nan empirical evaluation, and Section 6 will provide a brief conclusion. Additional discussion is\navailable in the Appendix of the technical report version of this paper [8].\n2 Background on Laplacians and diffusion-based procedures\nA weighted symmetric graph G is de\ufb01ned by a vertex set V = {1, . . . , n}, an edge set E \u2282 V \u00d7 V ,\nand a weight function w : E \u2192 R+, where w is assumed to be symmetric (i.e., w(u, v) = w(v, u)).\nIn this case, one can construct a matrix, L0 \u2208 RV \u00d7V , called the combinatorial Laplacian of G:\n\n(cid:26)\u2212w(u, v)\n\nL0(u, v) =\n\nd(u) \u2212 w(u, u) otherwise,\n\nwhen u (cid:54)= v,\n\nwhere d(u) = (cid:80)\n\nv w(u, v) is called the degree of u. By construction, L0 is positive semide\ufb01nite.\nNote that the all-ones vector, often denoted 1, is an eigenvector of L0 with eigenvalue zero, i.e., L1 =\n0. For this reason, 1 is often called trivial eigenvector of L0. Letting D be a diagonal matrix with\nD(u, u) = d(u), one can also de\ufb01ne a normalized version of the Laplacian: L = D\u22121/2L0D\u22121/2.\nUnless explicitly stated otherwise, when we refer to the Laplacian of a graph, we will mean the\nnormalized Laplacian.\nIn many situations, e.g., to perform spectral graph partitioning, one is interested in computing the\n\ufb01rst nontrivial eigenvector of a Laplacian. Typically, this vector is computed \u201cexactly\u201d by calling\na black-box solver; but it could also be approximated with an iteration-based method (such as the\nPower Method or Lanczos Method) or by running a random walk-based or diffusion-based method\nto the asymptotic state. These random walk-based or diffusion-based methods assign positive and\nnegative \u201ccharge\u201d to the nodes, and then they let the distribution of charge evolve according to\ndynamics derived from the graph structure. Three canonical evolution dynamics are the following:\n\u2202t = \u2212LHt. Thus, the\n(\u2212t)k\nk! Lk, where t \u2265 0 is a time\nPageRank. Here, the charge at a node evolves by either moving to a neighbor of the current node\n\nvector of charges evolves as Ht = exp(\u2212tL) = (cid:80)\u221e\n\nHeat Kernel. Here, the charge evolves according to the heat equation \u2202Ht\n\nparameter, times an input seed distribution vector.\n\nk=0\n\nor teleporting to a random node. More formally, the vector of charges evolves as\n\nR\u03b3 = \u03b3 (I \u2212 (1 \u2212 \u03b3) M )\n\n\u22121 ,\n\n(1)\n\n2\n\n\fwhere M is the natural random walk transition matrix associated with the graph and where\n\u03b3 \u2208 (0, 1) is the so-called teleportation parameter, times an input seed vector.\n\nLazy Random Walk. Here, the charge either stays at the current node or moves to a neighbor.\nThus, if M is the natural random walk transition matrix associated with the graph, then the\nvector of charges evolves as some power of W\u03b1 = \u03b1I + (1 \u2212 \u03b1)M, where \u03b1 \u2208 (0, 1)\nrepresents the \u201cholding probability,\u201d times an input seed vector.\n\nIn each of these cases, there is a parameter (t, \u03b3, and the number of steps of the Lazy Random\nWalk) that controls the \u201caggressiveness\u201d of the dynamics and thus how quickly the diffusive process\nequilibrates; and there is an input \u201cseed\u201d distribution vector. Thus, e.g., if one is interested in global\nspectral graph partitioning, then this seed vector could be a vector with entries drawn from {\u22121, +1}\nuniformly at random, while if one is interested in local spectral graph partitioning [9, 10, 11, 12],\nthen this vector could be the indicator vector of a small \u201cseed set\u201d of nodes. See Appendix A of [8]\nfor a brief discussion of local and global spectral partitioning in this context.\nMahoney and Orecchia showed that these three dynamics arise as solutions to SDPs of the form\n\nTr(LX) + 1\n\nminimize\nsubject to X (cid:23) 0,\n\nX\n\n\u03b7 G(X)\n\nTr(X) = 1,\nXD1/21 = 0,\n\n(2)\n\nwhere G is a penalty function (shown to be the generalized entropy, the log-determinant, and a\ncertain matrix-p-norm, respectively [1]) and where \u03b7 is a parameter related to the aggressiveness\nof the diffusive process [1]. Conversely, solutions to the regularized SDP of (2) for appropriate\nvalues of \u03b7 can be computed exactly by running one of the above three diffusion-based procedures.\nNotably, when G = 0, the solution to the SDP of (2) is uu(cid:48), where u is the smallest nontrivial\neigenvector of L. More generally and in this precise sense, the Heat Kernel, PageRank, and Lazy\nRandom Walk dynamics can be seen as \u201cregularized\u201d versions of spectral clustering and Laplacian\neigenvector computation. Intuitively, the function G(\u00b7) is acting as a penalty function, in a manner\nanalogous to the (cid:96)2 or (cid:96)1 penalty in Ridge regression or Lasso regression, and by running one of\nthese three dynamics one is implicitly making assumptions about the form of G(\u00b7). In this paper, we\nprovide a statistical framework to make that intuition precise.\n\n3 A statistical framework for regularized graph estimation\n\nHere, we will lay out a simple Bayesian framework for estimating a graph Laplacian. Importantly,\nthis framework will allow for regularization by incorporating prior information.\n3.1 Analogy with regularized linear regression\n\nsum of squares, i.e., F (\u03b2) = RSS(\u03b2) =(cid:80)\n\nIt will be helpful to keep in mind the Bayesian interpretation of regularized linear regression. In\nthat context, we observe n predictor-response pairs in Rp \u00d7 R, denoted (x1, y1), . . . , (xn, yn); the\ngoal is to \ufb01nd a vector \u03b2 such that \u03b2(cid:48)xi \u2248 yi. Typically, we choose \u03b2 by minimizing the residual\n2, or a penalized version of it. For Ridge\n2; while for Lasso regression, we minimize F (\u03b2) + \u03bb(cid:107)\u03b2(cid:107)1.\nregression, we minimize F (\u03b2) + \u03bb(cid:107)\u03b2(cid:107)2\n2 and \u03bb(cid:107)\u03b2(cid:107)1) are called penalty func-\nThe additional terms in the optimization criteria (i.e., \u03bb(cid:107)\u03b2(cid:107)2\ntions; and adding a penalty function to the optimization criterion can often be interpreted as incor-\nporating prior information about \u03b2. For example, we can model y1, . . . , yn as independent random\nobservations with distributions dependent on \u03b2. Speci\ufb01cally, we can suppose yi is a Gaussian ran-\ndom variable with mean \u03b2(cid:48)xi and known variance \u03c32. This induces a conditional density for the\nvector y = (y1, . . . , yn):\n\ni (cid:107)yi \u2212 \u03b2(cid:48)xi(cid:107)2\n\n(3)\nwhere the constant of proportionality depends only on y and \u03c3. Next, we can assume that \u03b2 itself\nis random, drawn from a distribution with density p(\u03b2). This distribution is called a prior, since it\nencodes prior knowledge about \u03b2. Without loss of generality, the prior density can be assumed to\ntake the form\n\np(y | \u03b2) \u221d exp{\u2212 1\n\n2\u03c32 F (\u03b2)},\n\np(\u03b2) \u221d exp{\u2212U (\u03b2)}.\n\n(4)\n\n3\n\n\fSince the two random variables are dependent, upon observing y, we have information about \u03b2. This\ninformation is encoded in the posterior density, p(\u03b2 | y), computed via Bayes\u2019 rule as\n\np(\u03b2 | y) \u221d p(y | \u03b2) p(\u03b2) \u221d exp{\u2212 1\n\n(5)\nThe MAP estimate of \u03b2 is the value that maximizes p(\u03b2 | y); equivalently, it is the value of \u03b2\nthat minimizes \u2212 log p(\u03b2 | y). In this framework, we can recover the solution to Ridge regres-\n2\u03c32(cid:107)\u03b2(cid:107)1, respectively. Thus,\n2\u03c32(cid:107)\u03b2(cid:107)2\nsion or Lasso regression by setting U (\u03b2) = \u03bb\nRidge regression can be interpreted as imposing a Gaussian prior on \u03b2, and Lasso regression can be\ninterpreted as imposing a double-exponential prior on \u03b2.\n\n2 or U (\u03b2) = \u03bb\n\n2\u03c32 F (\u03b2) \u2212 U (\u03b2)}.\n\n3.2 Bayesian inference for the population Laplacian\n\nFor our problem, suppose that we have a connected graph with n nodes; or, equivalently, that we\nhave L, the normalized Laplacian of that graph. We will view this observed graph Laplacian, L,\nas a \u201csample\u201d Laplacian, i.e., as random object whose distribution depends on a true \u201cpopulation\u201d\nLaplacian, L. As with the linear regression example, this induces a conditional density for L, to be\ndenoted p(L | L). Next, we can assume prior information about the population Laplacian in the\nform of a prior density, p(L); and, given the observed Laplacian, we can estimate the population\nLaplacian by maximizing its posterior density, p(L | L).\nThus, to apply the Bayesian formalism, we need to specify the conditional density of L given L. In\nthe context of linear regression, we assumed that the observations followed a Gaussian distribution.\nA graph Laplacian is not just a single observation\u2014it is a positive semide\ufb01nite matrix with a very\nspeci\ufb01c structure. Thus, we will take L to be a random object with expectation L, where L is another\nnormalized graph Laplacian. Although, in general, L can be distinct from L, we will require that the\n\nnodes in the population and sample graphs have the same degrees. That is, if d =(cid:0)d(1), . . . , d(n)(cid:1)\ndenotes the \u201cdegree vector\u201d of the graph, and D = diag(cid:0)d(1), . . . , d(n)(cid:1), then we can de\ufb01ne\n\nX = {X : X (cid:23) 0, XD1/21 = 0, rank(X) = n \u2212 1},\n\n(6)\nin which case the population Laplacian and the sample Laplacian will both be members of X . To\nmodel L, we will choose a distribution for positive semi-de\ufb01nite matrices analogous to the Gaussian\ndistribution: a scaled Wishart matrix with expectation L. Note that, although it captures the trait\nthat L is positive semi-de\ufb01nite, this distribution does not accurately model every feature of L. For\nexample, a scaled Wishart matrix does not necessarily have ones along its diagonal. However, the\nmode of the density is at L, a Laplacian; and for large values of the scale parameter, most of the mass\nwill be on matrices close to L. Appendix B of [8] provides a more detailed heuristic justi\ufb01cation for\nthe use of the Wishart distribution.\nTo be more precise, let m \u2265 n \u2212 1 be a scale parameter, and suppose that L is distributed over X as\nm Wishart(L, m) random variable. Then, E[L | L] = L, and L has conditional density\na 1\n\np(L | L) \u221d exp{\u2212 m\n\n2 Tr(LL+)}\n|L|m/2\n\n(7)\nwhere |\u00b7| denotes pseudodeterminant (product of nonzero eigenvalues). The constant of proportion-\nality depends only on L, d, m, and n; and we emphasize that the density is supported on X . Eqn. (7)\nis analogous to Eqn. (3) in the linear regression context, with 1/m, the inverse of the sample size\nparameter, playing the role of the variance parameter \u03c32. Next, suppose we have know that L is a\nrandom object drawn from a prior density p(L). Without loss of generality,\n\n,\n\np(L) \u221d exp{\u2212U (L)},\n\n(8)\nfor some function U, supported on a subset \u00afX \u2286 X . Eqn. (8) is analogous to Eqn. (4) from the\nlinear regression example. Upon observing L, the posterior distribution for L is\n\np(L | L) \u221d p(L | L) p(L) \u221d exp{\u2212 m\n\n(9)\nwith support determined by \u00afX . Eqn. (9) is analogous to Eqn. (5) from the linear regression example.\nIf we denote by \u02c6L the MAP estimate of L, then it follows that \u02c6L+ is the solution to the program\n\n2 log |L+| \u2212 U (L)},\n\n2 Tr(LL+) + m\n\nminimize\nTr(LX) + 2\nsubject to X \u2208 \u00afX \u2286 X .\n\nX\n\nm U (X +) \u2212 log |X|\n\n(10)\n\n4\n\n\fNote the similarity with Mahoney-Orecchia regularized SDP of (2). In particular, if \u00afX = {X :\nTr(X) = 1} \u2229 X , then the two programs are identical except for the factor of log |X| in the opti-\nmization criterion.\n\n4 A prior related to the PageRank procedure\n\nHere, we will present a prior distribution for the population Laplacian that will allow us to leverage\nthe estimation framework of Section 3; and we will show that the MAP estimate of L for this prior\nis related to the PageRank procedure via the Mahoney-Orecchia regularized SDP. Appendix C of [8]\npresents priors that lead to the Heat Kernel and Lazy Random Walk in an analogous way; in both of\nthese cases, however, the priors are data-dependent in the strong sense that they explicitly depend\non the number of data points.\n\n4.1 Prior density\n\nThe prior we will present will be based on neutrality and invariance conditions; and it will be sup-\nported on X , i.e., on the subset of positive-semide\ufb01nite matrices that was the support set for the\nconditional density de\ufb01ned in Eqn. (7). In particular, recall that, in addition to being positive semi-\nde\ufb01nite, every matrix in the support set has rank n\u2212 1 and satis\ufb01es XD1/21 = 0. Note that because\nthe prior depends on the data (via the orthogonality constraint induced by D), this is not a prior in the\nfully Bayesian sense; instead, the prior can be considered as part of an empirical or pseudo-Bayes\nestimation procedure.\nThe prior we will specify depends only on the eigenvalues of the normalized Laplacian, or equiva-\nlently on the eigenvalues of the pseudoinverse of the Laplacian. Let L+ = \u03c4 O\u039bO(cid:48) be the spectral\ndecomposition of the pseudoinverse of the normalized Laplacian L, where \u03c4 \u2265 0 is a scale factor,\nv \u03bb(v) = 1.\n\nO \u2208 Rn\u00d7n\u22121 is an orthogonal matrix, and \u039b = diag(cid:0)\u03bb(1), . . . , \u03bb(n \u2212 1)(cid:1), where(cid:80)\nNote that the values \u03bb(1), . . . , \u03bb(n\u2212 1) are unordered and that the vector \u03bb =(cid:0)\u03bb(1), . . . , \u03bb(n\u2212 1)(cid:1)\npermutations) and neutral (\u03bb(v) independent of the vector (cid:0)\u03bb(u)/(1 \u2212 \u03bb(v)) : u (cid:54)= v(cid:1), for all\n\nlies in the unit simplex. If we require that the distribution for \u03bb be exchangeable (invariant under\n\nv), then the only non-degenerate possibility is that \u03bb is Dirichlet-distributed with parameter vector\n(\u03b1, . . . , \u03b1) [13]. The parameter \u03b1, to which we refer as the \u201cshape\u201d parameter, must satisfy \u03b1 > 0\nfor the density to be de\ufb01ned. In this case,\n\nn\u22121(cid:89)\n\np(L) \u221d p(\u03c4 )\n\n\u03bb(v)\u03b1\u22121,\n\n(11)\n\nv=1\n\nwhere p(\u03c4 ) is a prior for \u03c4. Thus, the prior weight on L only depends on \u03c4 and \u039b. One implication\nis that the prior is \u201cnearly\u201d rotationally invariant, in the sense that p(P (cid:48)LP ) = p(L) for any rank-\n(n \u2212 1) projection matrix P satisfying P D1/21 = 0.\n4.2 Posterior estimation and connection to PageRank\n\nTo analyze the MAP estimate associated with the prior of Eqn. (11) and to explain its connection\nwith the PageRank dynamics, the following proposition is crucial.\nProposition 4.1. Suppose the conditional likelihood for L given L is as de\ufb01ned in (7) and the prior\ndensity for L is as de\ufb01ned in (11). De\ufb01ne \u02c6L to be the MAP estimate of L. Then, [Tr( \u02c6L+)]\u22121 \u02c6L+\nsolves the Mahoney-Orecchia regularized SDP (2), with G(X) = \u2212 log |X| and \u03b7 as given in\nEqn. (12) below.\nProof. For L in the support set of the posterior, de\ufb01ne \u03c4 = Tr(L+) and \u0398 = \u03c4\u22121L+, so that\nTr(\u0398) = 1. Further, rank(\u0398) = n \u2212 1. Express the prior in the form of Eqn. (8) with function U\ngiven by\n\nU (L) = \u2212 log{p(\u03c4 )|\u0398|\u03b1\u22121} = \u2212(\u03b1 \u2212 1) log |\u0398| \u2212 log p(\u03c4 ),\n\nwhere, as before, | \u00b7 | denotes pseudodeterminant. Using (9) and the relation |L+| = \u03c4 n\u22121|\u0398|, the\nposterior density for L given L is\np(L | L) \u221d exp\n\n(cid:110) \u2212 m\u03c4\n\nlog |\u0398| + g(\u03c4 )\n\n2 Tr(L\u0398) + m+2(\u03b1\u22121)\n\n2\n\n(cid:111)\n\n,\n\n5\n\n\fm\u02c6\u03c4\n\n\u03b7 =\n\nm + 2(\u03b1 \u2212 1)\n\n.\n\n(12)\n\nwhere g(\u03c4 ) = m(n\u22121)\n\u02c6\u03c4 = Tr( \u02c6L+) and \u02c6\u0398 = [\u02c6\u03c4 ]\u22121 \u02c6L+. In this case, \u02c6\u0398 must minimize the quantity Tr(L \u02c6\u0398) \u2212 1\nwhere\n\nlog \u03c4 + log p(\u03c4 ). Suppose \u02c6L maximizes the posterior likelihood. De\ufb01ne\n\u03b7 log | \u02c6\u0398|,\n\n2\n\nThus \u02c6\u0398 solves the regularized SDP (2) with G(X) = \u2212 log |X|.\nMahoney and Orecchia showed that the solution to (2) with G(X) = \u2212 log |X| is closely related to\nthe PageRank matrix, R\u03b3, de\ufb01ned in Eqn. (1). By combining Proposition 4.1 with their result, we\nget that the MAP estimate of L satis\ufb01es \u02c6L+ \u221d D\u22121/2R\u03b3D1/2; conversely, R\u03b3 \u221d D1/2 \u02c6L+D\u22121/2.\nThus, the PageRank operator of Eqn. (1) can be viewed as a degree-scaled regularized estimate of\nthe pseudoinverse of the Laplacian. Moreover, prior assumptions about the spectrum of the graph\nLaplacian have direct implications on the optimal teleportation parameter. Speci\ufb01cally Mahoney\nand Orecchia\u2019s Lemma 2 shows how \u03b7 is related to the teleportation parameter \u03b3, and Eqn. (12)\nshows how the optimal \u03b7 is related to prior assumptions about the Laplacian.\n5 Empirical evaluation\n\nIn this section, we provide an empirical evaluation of the performance of the regularized Laplacian\nestimator, compared with the unregularized estimator. To do this, we need a ground truth population\nLaplacian L and a noisily-observed sample Laplacian L. Thus, in Section 5.1, we construct a family\nof distributions for L; importantly, this family will be able to represent both low-dimensional graphs\nand expander-like graphs.\nInterestingly, the prior of Eqn. (11) captures some of the qualitative\nfeatures of both of these types of graphs (as the shape parameter is varied). Then, in Section 5.2,\nwe describe a sampling procedure for L which, super\ufb01cially, has no relation to the scaled Wishart\nconditional density of Eqn. (7). Despite this model misspeci\ufb01cation, the regularized estimator \u02c6L\u03b7\noutperforms L for many choices of the regularization parameter \u03b7.\n5.1 Ground truth generation and prior evaluation\n\nThe ground truth graphs we generate are motivated by the Watts-Strogatz \u201csmall-world\u201d model [14].\nTo generate a ground truth population Laplacian, L\u2014equivalently, a population graph\u2014we start\nwith a two-dimensional lattice of width w and height h, and thus n = wh nodes. Points in the lattice\nare connected to their four nearest neighbors, making adjustments as necessary at the boundary. We\nthen perform s edge-swaps: for each swap, we choose two edges uniformly at random and then\nwe swap the endpoints. For example, if we sample edges i1 \u223c j1 and i2 \u223c j2, then we replace\nthese edges with i1 \u223c j2 and i2 \u223c j1. Thus, when s = 0, the graph is the original discretization\nof a low-dimensional space; and as s increases to in\ufb01nity, the graph becomes more and more like\na uniformly chosen 4-regular graph (which is an expander [15] and which bears similarities with\nan Erd\u02ddos-R\u00b4enyi random graph [16]). Indeed, each edge swap is a step of the Metropolis algorithm\ntoward a uniformly chosen random graph with a \ufb01xed degree sequence. For the empirical evaluation\npresented here, h = 7 and w = 6; but the results are qualitatively similar for other values.\nFigure 1 compares the expected order statistics (sorted values) for the Dirichlet prior of Eqn. (11)\nwith the expected eigenvalues of \u0398 = L+/ Tr(L+) for the small-world model. In particular, in\nFigure 1(a), we show the behavior of the order statistics of a Dirichlet distribution on the (n \u2212 1)-\ndimensional simplex with scalar shape parameter \u03b1, as a function of \u03b1. For each value of the\nshape \u03b1, we generated a random (n \u2212 1)-dimensional Dirichlet vector, \u03bb, with parameter vector\n(\u03b1, . . . , \u03b1); we computed the n \u2212 1 order statistics of \u03bb by sorting its components; and we repeated\nthis procedure for 500 replicates and averaged the values. Figure 1(b) shows a corresponding plot\nfor the ordered eigenvalues of \u0398. For each value of s (normalized, here, by the number of edges \u00b5,\nwhere \u00b5 = 2wh \u2212 w \u2212 h = 71), we generated the normalized Laplacian, L, corresponding to the\nrandom s-edge-swapped grid; we computed the n \u2212 1 nonzero eigenvalues of \u0398; and we performed\n1000 replicates of this procedure and averaged the resulting eigenvalues.\nInterestingly, the behavior of the spectrum of the small-world model as the edge-swaps increase is\nqualitatively quite similar to the behavior of the Dirichlet prior order statistics as the shape param-\neter \u03b1 increases. In particular, note that for small values of the shape parameter \u03b1 the \ufb01rst few\norder-statistics are well-separated from the rest; and that as \u03b1 increases, the order statistics become\n\n6\n\n\f(a) Dirichlet distribution order statistics.\n\n(b) Spectrum of the inverse Laplacian.\n\nFigure 1: Analytical and empirical priors. 1(a) shows the Dirichlet distribution order statistics versus\nthe shape parameter; and 1(b) shows the spectrum of \u0398 as a function of the rewiring parameter.\n\nconcentrated around 1/(n \u2212 1). Similarly, when the edge-swap parameter s = 0, the top two eigen-\nvalues (corresponding to the width-wise and height-wise coordinates on the grid) are well-separated\nfrom the bulk; as s increases, the top eigenvalues quickly merge into the bulk; and eventually, as s\ngoes to in\ufb01nity, the distribution becomes very close that that of a uniformly chosen 4-regular graph.\n\n5.2 Sampling procedure, estimation performance, and optimal regularization behavior\n\nFinally, we evaluate the estimation performance of a regularized estimator of the graph Laplacian\nand compare it with an unregularized estimate. To do so, we construct the population graph G and its\nLaplacian L, for a given value of s, as described in Section 5.1. Let \u00b5 be the number of edges in G.\nThe sampling procedure used to generate the observed graph G and its Laplacian L is parameterized\nby the sample size m. (Note that this parameter is analogous to the Wishart scale parameter in\nEqn. (7), but here we are sampling from a different distribution.) We randomly choose m edges with\nreplacement from G; and we de\ufb01ne sample graph G and corresponding Laplacian L by setting the\nweight of i \u223c j equal to the number of times we sampled that edge. Note that the sample graph G\nover-counts some edges in G and misses others.\nWe then compute the regularized estimate \u02c6L\u03b7, up to a constant of proportionality, by solving (implic-\nitly!) the Mahoney-Orecchia regularized SDP (2) with G(X) = \u2212 log |X|. We de\ufb01ne the unregular-\nized estimate \u02c6L to be equal to the observed Laplacian, L. Given a population Laplacian L, we de\ufb01ne\n\u03c4 = \u03c4 (L) = Tr(L+) and \u0398 = \u0398(L) = \u03c4\u22121L+. We de\ufb01ne \u02c6\u03c4\u03b7, \u02c6\u03c4, \u02c6\u0398\u03b7, and \u02c6\u0398 similarly to the popu-\nlation quantities. Our performance criterion is the relative Frobenius error (cid:107)\u0398 \u2212 \u02c6\u0398\u03b7(cid:107)F/(cid:107)\u0398 \u2212 \u02c6\u0398(cid:107)F,\nwhere (cid:107) \u00b7 (cid:107)F denotes the Frobenius norm ((cid:107)A(cid:107)F = [Tr(A(cid:48)A)]1/2). Appendix D of [8] presents\nsimilar results when the performance criterion is the relative spectral norm error.\nFigures 2(a), 2(b), and 2(c) show the regularization performance when s = 4 (an intermediate\nvalue) for three different values of m/\u00b5. In each case, the mean error and one standard deviation\naround it are plotted as a function of \u03b7/\u00af\u03c4, as computed from 100 replicates; here, \u00af\u03c4 is the mean\nvalue of \u03c4 over all replicates. The implicit regularization clearly improves the performance of the\nestimator for a large range of \u03b7 values. (Note that the regularization parameter in the regularized\nSDP (2) is 1/\u03b7, and thus smaller values along the X-axis correspond to stronger regularization.) In\nparticular, when the data are very noisy, e.g., when m/\u00b5 = 0.2, as in Figure 2(a), improved results\nare seen only for very strong regularization; for intermediate levels of noise, e.g., m/\u00b5 = 1.0, as in\nFigure 2(b), (in which case m is chosen such that G and G have the same number of edges counting\nmultiplicity), improved performance is seen for a wide range of values of \u03b7; and for low levels\nof noise, Figure 2(c) illustrates that improved results are obtained for moderate levels of implicit\nregularization. Figures 2(d) and 2(e) illustrate similar results for s = 0 and s = 32.\n\n7\n\n0.51.01.52.00.000.100.20ShapeOrder statisticsllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.20.40.60.81.00.000.050.100.150.20SwapsEdgesInverse Laplacian Eigenvalueslllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll\f(a) m/\u00b5 = 0.2 and s = 4.\n\n(b) m/\u00b5 = 1.0 and s = 4.\n\n(c) m/\u00b5 = 2.0 and s = 4.\n\n(d) m/\u00b5 = 2.0 and s = 0.\n\n(e) m/\u00b5 = 2.0 and s = 32.\n\n(f) Optimal \u03b7\u2217/\u00af\u03c4.\n\nFigure 2: Regularization performance. 2(a) through 2(e) plot the relative Frobenius norm error,\nversus the (normalized) regularization parameter \u03b7/\u00af\u03c4. Shown are plots for various values of the\n(normalized) number of edges, m/\u00b5, and the edge-swap parameter, s. Recall that the regularization\nparameter in the regularized SDP (2) is 1/\u03b7, and thus smaller values along the X-axis correspond to\nstronger regularization. 2(f) plots the optimal regularization parameter \u03b7\u2217/\u00af\u03c4 as a function of sample\nproportion for different fractions of edge swaps.\n\nAs when regularization is implemented explicitly, in all these cases, we observe a \u201csweet spot\u201d\nwhere there is an optimal value for the implicit regularization parameter. Figure 2(f) illustrates\nhow the optimal choice of \u03b7 depends on parameters de\ufb01ning the population Laplacians and sample\nLaplacians. In particular, it illustrates how \u03b7\u2217, the optimal value of \u03b7 (normalized by \u00af\u03c4), depends\non the sampling proportion m/\u00b5 and the swaps per edges s/\u00b5. Observe that as the sample size m\nincreases, \u03b7\u2217 converges monotonically to \u00af\u03c4; and, further, that higher values of s (corresponding to\nmore expander-like graphs) correspond to higher values of \u03b7\u2217. Both of these observations are in\ndirect agreement with Eqn. (12).\n\n6 Conclusion\n\nWe have provided a statistical interpretation for the observation that popular diffusion-based proce-\ndures to compute a quick approximation to the \ufb01rst nontrivial eigenvector of a data graph Laplacian\nexactly solve a certain regularized version of the problem. One might be tempted to view our re-\nsults as \u201cunfortunate,\u201d in that it is not straightforward to interpret the priors presented in this paper.\nInstead, our results should be viewed as making explicit the implicit prior assumptions associated\nwith making certain decisions (that are already made in practice) to speed up computations.\nSeveral extensions suggest themselves. The most obvious might be to try to obtain Proposition 4.1\nwith a more natural or empirically-plausible model than the Wishart distribution; to extend the em-\npirical evaluation to much larger and more realistic data sets; to apply our methodology to other\nwidely-used approximation procedures; and to characterize when implicitly regularizing an eigen-\nvector leads to better statistical behavior in downstream applications where that eigenvector is used.\nMore generally, though, we expect that understanding the algorithmic-statistical tradeoffs that we\nhave illustrated will become increasingly important in very large-scale data analysis applications.\n\n8\n\n0.00.51.01.52.02.50.00.51.01.52.02.5RegularizationRel. Frobenius Errorllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.51.01.52.02.50.00.51.01.5RegularizationRel. Frobenius Errorllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.51.01.52.02.50.00.51.01.52.02.53.0RegularizationRel. Frobenius Errorllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.51.01.52.02.50.00.51.01.52.02.5RegularizationRel. Frobenius Errorllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.00.51.01.52.02.53.001234RegularizationRel. Frobenius Errorllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.10.20.51.02.05.010.00.00.20.40.60.81.0Sample ProportionOptimal PenaltylllllllllllllllllllllllllllllllllllllllllllllllllSwaps/Edges0.900.450.230.110.060.030.01\fReferences\n[1] M. W. Mahoney and L. Orecchia.\n\nImplementing regularization implicitly via approximate\neigenvector computation. In Proceedings of the 28th International Conference on Machine\nLearning, pages 121\u2013128, 2011.\n\n[2] D.A. Spielman and S.-H. Teng. Spectral partitioning works: Planar graphs and \ufb01nite element\nmeshes. In FOCS \u201996: Proceedings of the 37th Annual IEEE Symposium on Foundations of\nComputer Science, pages 96\u2013107, 1996.\n\n[3] S. Guattery and G.L. Miller. On the quality of spectral separators. SIAM Journal on Matrix\n\nAnalysis and Applications, 19:701\u2013719, 1998.\n\n[4] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transcations of Pattern\n\nAnalysis and Machine Intelligence, 22(8):888\u2013905, 2000.\n\n[5] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data repre-\n\nsentation. Neural Computation, 15(6):1373\u20131396, 2003.\n\n[6] T. Joachims. Transductive learning via spectral graph partitioning. In Proceedings of the 20th\n\nInternational Conference on Machine Learning, pages 290\u2013297, 2003.\n\n[7] J. Leskovec, K.J. Lang, A. Dasgupta, and M.W. Mahoney. Community structure in large\nnetworks: Natural cluster sizes and the absence of large well-de\ufb01ned clusters. Internet Math-\nematics, 6(1):29\u2013123, 2009. Also available at: arXiv:0810.1355.\n\n[8] P. O. Perry and M. W. Mahoney. Regularized Laplacian estimation and fast eigenvector ap-\n\nproximation. Technical report. Preprint: arXiv:1110.1757 (2011).\n\n[9] D.A. Spielman and S.-H. Teng. Nearly-linear time algorithms for graph partitioning, graph\nsparsi\ufb01cation, and solving linear systems. In STOC \u201904: Proceedings of the 36th annual ACM\nSymposium on Theory of Computing, pages 81\u201390, 2004.\n\n[10] R. Andersen, F.R.K. Chung, and K. Lang. Local graph partitioning using PageRank vectors.\nIn FOCS \u201906: Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer\nScience, pages 475\u2013486, 2006.\n\n[11] F.R.K. Chung. The heat kernel as the pagerank of a graph. Proceedings of the National\n\nAcademy of Sciences of the United States of America, 104(50):19735\u201319740, 2007.\n\n[12] M. W. Mahoney, L. Orecchia, and N. K. Vishnoi. A spectral algorithm for improving graph\npartitions with applications to exploring data graphs locally. Technical report. Preprint:\narXiv:0912.0681 (2009).\n\n[13] J. Fabius. Two characterizations of the Dirichlet distribution. The Annals of Statistics,\n\n1(3):583\u2013587, 1973.\n\n[14] D.J. Watts and S.H. Strogatz. Collective dynamics of small-world networks. Nature, 393:440\u2013\n\n442, 1998.\n\n[15] S. Hoory, N. Linial, and A. Wigderson. Expander graphs and their applications. Bulletin of\n\nthe American Mathematical Society, 43:439\u2013561, 2006.\n\n[16] B. Bollobas. Random Graphs. Academic Press, London, 1985.\n\n9\n\n\f", "award": [], "sourceid": 1288, "authors": [{"given_name": "Patrick", "family_name": "Perry", "institution": null}, {"given_name": "Michael", "family_name": "Mahoney", "institution": null}]}