{"title": "Spectral Norm Regularization of Orthonormal Representations for Graph Transduction", "book": "Advances in Neural Information Processing Systems", "page_first": 2215, "page_last": 2223, "abstract": "Recent literature~\\cite{ando} suggests that embedding a graph on an unit sphere leads to better generalization for graph transduction. However, the choice of optimal embedding and an efficient algorithm to compute the same remains open. In this paper, we show that orthonormal representations, a class of unit-sphere graph embeddings are PAC learnable. Existing PAC-based analysis do not apply as the VC dimension of the function class is infinite. We propose an alternative PAC-based bound, which do not depend on the VC dimension of the underlying function class, but is related to the famous Lov\\'{a}sz~$\\vartheta$ function. The main contribution of the paper is SPORE, a SPectral regularized ORthonormal Embedding for graph transduction, derived from the PAC bound. SPORE is posed as a non-smooth convex function over an \\emph{elliptope}. These problems are usually solved as semi-definite programs (SDPs) with time complexity $O(n^6)$. We present, Infeasible Inexact proximal~(IIP): an Inexact proximal method which performs subgradient procedure on an approximate projection, not necessarily feasible. IIP is more scalable than SDP, has an $O(\\frac{1}{\\sqrt{T}})$ convergence, and is generally applicable whenever a suitable approximate projection is available. We use IIP to compute SPORE where the approximate projection step is computed by FISTA, an accelerated gradient descent procedure. We show that the method has a convergence rate of $O(\\frac{1}{\\sqrt{T}})$. The proposed algorithm easily scales to 1000's of vertices, while the standard SDP computation does not scale beyond few hundred vertices. Furthermore, the analysis presented here easily extends to the multiple graph setting.", "full_text": "Spectral Norm Regularization of Orthonormal\n\nRepresentations for Graph Transduction\n\nRakesh Shivanna\n\nGoogle Inc.\n\nMountain View, CA, USA\n\nBibaswan Chatterjee\n\nDept. of Computer Science & Automation\n\nIndian Institute of Science, Bangalore\n\nrakeshshivanna@google.com\n\nbibaswan.chatterjee@csa.iisc.ernet.in\n\nRaman Sankaran, Chiranjib Bhattacharyya\n\nDept. of Computer Science & Automation\n\nIndian Institute of Science, Bangalore\n\nramans,chiru@csa.iisc.ernet.in\n\nFrancis Bach\n\nINRIA - Sierra Project-team\n\n\u00b4Ecole Normale Sup\u00b4erieure, Paris, France\n\nfrancis.bach@ens.fr\n\nAbstract\n\nRecent literature [1] suggests that embedding a graph on an unit sphere leads to\nbetter generalization for graph transduction. However, the choice of optimal em-\nbedding and an ef\ufb01cient algorithm to compute the same remains open. In this\npaper, we show that orthonormal representations, a class of unit-sphere graph em-\nbeddings are PAC learnable. Existing PAC-based analysis do not apply as the VC\ndimension of the function class is in\ufb01nite. We propose an alternative PAC-based\nbound, which do not depend on the VC dimension of the underlying function\nclass, but is related to the famous Lov\u00b4asz \u03d1 function. The main contribution of the\npaper is SPORE, a SPectral regularized ORthonormal Embedding for graph trans-\nduction, derived from the PAC bound. SPORE is posed as a non-smooth convex\nfunction over an elliptope. These problems are usually solved as semi-de\ufb01nite pro-\ngrams (SDPs) with time complexity O(n6). We present, Infeasible Inexact prox-\nimal (IIP): an Inexact proximal method which performs subgradient procedure\non an approximate projection, not necessarily feasible. IIP is more scalable than\nSDP, has an O( 1\u221a\n) convergence, and is generally applicable whenever a suit-\nable approximate projection is available. We use IIP to compute SPORE where\nthe approximate projection step is computed by FISTA, an accelerated gradient\ndescent procedure. We show that the method has a convergence rate of O( 1\u221a\n).\nThe proposed algorithm easily scales to 1000\u2019s of vertices, while the standard\nSDP computation does not scale beyond few hundred vertices. Furthermore, the\nanalysis presented here easily extends to the multiple graph setting.\n\nT\n\nT\n\n1\n\nIntroduction\n\nLearning problems on graph-structured data have received signi\ufb01cant attention in recent years [11,\n17, 20]. We study an instance of graph transduction, the problem of learning labels on vertices of\nsimple graphs1. A typical example is webpage classi\ufb01cation [20], where a very small part of the\nentire web is manually classi\ufb01ed. Even for simple graphs, predicting binary labels of the unlabeled\nvertices is NP-complete [6].\nMore formally: let G = (V, E), V = [n] be a simple graph with unknown labels y \u2208 {\u00b11}n.\nWithout loss of generality, let the labels of \ufb01rst m \u2208 [n] vertices be observable, let u := n \u2212 m.\n\n1A simple graph is an unweighted, undirected graph with no self loops or multiple edges.\n\n1\n\n\fLet yS and y \u00afS be the labels of S = [m] and \u00afS = V \\S. Given G and yS, the goal is to learn soft\npredictions \u02c6y \u2208 Rn, such that er(cid:96)\n(cid:96)(yj, \u02c6yj) is small, where (cid:96) is any loss function. The\nfollowing formulation has been extensively used [19, 20]\n\n\u00afS[\u02c6y] := 1| \u00afS|\n\nj\u2208 \u00afS\n\n(cid:80)\n\n(1)\nwhere K is a graph-dependent kernel and \u03bb > 0 is a regularizer constant. Let \u02c6y\u2217 be the solution\nto (1), given G and S \u2286 V, |S| = m. [1] proposed the following generalization bound\n\nmin\n\u02c6y\u2208Rn\n\nS[\u02c6y] + \u03bb\u02c6y(cid:62)K\u22121\u02c6y,\ner(cid:96)\n\n\u00afS[\u02c6y\u2217](cid:3) \u2264 c1 inf\n(cid:2)er(cid:96)\n\nES\u2286V\n\nwhere c1, c2 are dependent on (cid:96) and trp(K) =(cid:0) 1\n\n\u02c6y\u2208Rn\n\n(cid:104)\n\n(cid:105)\nV [\u02c6y] + \u03bb\u02c6y(cid:62)K\u22121\u02c6y\n(cid:1)1/p\ner(cid:96)\n\n(cid:80)\n\n(cid:18) trp(K)\n\n(cid:19)p\n\n\u03bb|S|\n\n+ c2\n\n,\n\n(2)\n\n\u03c9C(K, yS) = max\n\u03b1\u2208Rn\n\nare given by \u02c6yi =(cid:80)\n\nn\n\nii\n\ni\u2208[n] Kp\n\n, p > 0. [1] argued that trp(K)\nshould be a constant and can be enforced by normalizing the diagonal entries of K to be 1. This\nis an important advice in graph transduction, however it is to be noted that the set of normalized\nkernels is quite large and (2) gives little insight in choosing the optimal kernel.\nNormalizing the diagonal entries of K can be viewed geometrically as embedding the graph on a\nunit sphere. Recently, [16] studied a rich class of unit sphere graph embeddings, called orthonormal\nrepresentations [13], and \ufb01nd that it is statistically consistent for graph transduction. However, the\nchoice of the optimal orthonormal embedding is not clear. We study orthonormal representations\nfor the following equivalent [19] kernel learning formulation of (1), with C = 1\n\n\u03bbm,\n\n\u03b1i\u03b1jyiyjKij\n\ns.t. 0 \u2264 \u03b1i \u2264 C \u2200i \u2208 S, \u03b1j = 0 \u2200j /\u2208 S, (3)\n\n(cid:88)\n\ni\u2208S\n\n\u03b1i\u2212 1\n2\n\n(cid:88)\n\ni,j\u2208S\n\nfrom a probably approximately correctly (PAC) learning point of view. Note that the \ufb01nal predictions\n\nj\u2208S Kij\u03b1\u2217\n\nj yj \u2200i \u2208 [n], where \u03b1\u2217 is the optimal solution to (3).\n\nContributions. We make the following contributions:\n\n\u2013 Using (3) we show the class of orthonormal representations are ef\ufb01ciently PAC learnable over a\n\nlarge class of graph families, including power-law and random graphs.\n\n\u2013 The above analysis suggests that spectral norm regularization could be bene\ufb01cial in computing\nthe best embedding. To this end we pose the problem of SPectral norm regularized ORthonormal\nEmbedding (SPORE) for graph Transduction, namely that of minimizing a convex function\nover an elliptope. One could solve such problems as SDPs which unfortunately do not scale\nwell beyond few hundred vertices.\n\n\u2013 We propose an infeasible inexact proximal (IIP) method, a novel projected subgradient descent\nalgorithm, in which the projection is approximated by an inexact proximal method. We suggest\na novel approximation criteria which approximates the proximal operator for the support func-\ntion of the feasible set within a given precision. One could compute an approximation to the\nprojection from the inexact proximal point which may not be feasible hence the name IIP. We\n\u221a\nprove that IIP converges to the optimal minimum of a non-smooth convex function with rate\nO(1/\n\nT ) in T iterations.\n\n\u2013 The IIP algorithm is then applied to the case when the set of interest is composed of the inter-\nsection of two convex sets. The proximal operator for the support function of the set of interest\ncan be obtained using the FISTA algorithm, once we know the proximal operator for the support\nfunctions of the individual sets involved.\n\n\u2013 Our analysis paves the way for learning labels on multiple graphs by using the embedding by\n\nadopting an MKL style approach. We present both algorithmic and generalization results.\n\n+ be a non-negative orthant. Let S n\u22121 =(cid:8)u \u2208 Rn\n\nNotations. Let (cid:107) \u00b7 (cid:107), (cid:107) \u00b7 (cid:107)F denote the Euclidean and Frobenius norm respectively. Let Sn and\nS +\nn denote the set of n \u00d7 n square symmetric and square symmetric positive semi-de\ufb01nite matrices\nrespectively. Let Rn\ndimensional simplex. Let [n] := {1, . . . , n}. For any M \u2208 Sn, let \u03bb1(M) \u2265 . . . \u2265 \u03bbn(M) denote\nits Eigenvalues. We denote the adjacency matrix of a graph G by A. Let \u00afG denote the complement\ngraph of G, with the adjacency matrix \u00afA = 11(cid:62) \u2212 I \u2212 A; where 1 is a vector of all 1\u2019s, and I is the\n\n(cid:12)(cid:12) (cid:107)u(cid:107)1 = 1(cid:9) denote the n\u22121\nidentity matrix. Let Y = {\u00b11},(cid:98)Y = R be the label and soft-prediction spaces over V . Given y \u2208 Y\n\n+\n\n2\n\n\fand \u02c6y \u2208 (cid:98)Y, we use (cid:96)0-1(y, \u02c6y) = 1[y \u02c6y < 0], (cid:96)hng(y, \u02c6y) = (1 \u2212 y \u02c6y)+\n\n2 to denote 0-1 and hinge loss\n\nrespectively. The notations O, o, \u2126, \u0398 will denote standard measures in asymptotic analysis [4].\nRelated work. [1]\u2019s analysis was restricted to Laplacian matrices, and does not give insights in\nchoosing the optimal unit sphere embedding.\n[2] studied graph transduction using PAC model,\nhowever for graph orthonormal embeddings, there is no known sample complexity estimate. [16]\nshowed that working with orthonormal embeddings leads to consistency. However, the choice of\noptimal embedding and an ef\ufb01cient algorithm to compute the same remains an open issue. Further-\nmore, we show that [16]\u2019s sample complexity estimate is sub-optimal.\nPreliminaries. An orthonormal embedding [13] of a simple graph G = (V, E), V = [n], is\nde\ufb01ned by a matrix U = [u1, . . . , un] \u2208 Rd\u00d7n such that u(cid:62)\ni uj = 0 whenever (i, j) /\u2208 E and\n(cid:107)ui(cid:107) = 1 \u2200i \u2208 [n]. Let Lab(G) denote the set of all possible orthonormal embeddings of the\n\ngraph G, Lab(G) :=(cid:8)U | U is an orthonormal embedding(cid:9). Recently, [8] showed an interesting\n\nconnection to the set of graph kernel matrices\n\nK(G) :=(cid:8)K \u2208 S +\n\nn | Kii = 1,\u2200i \u2208 [n]; Kij = 0,\u2200(i, j) /\u2208 E(cid:9).\n\nNote that K \u2208 K(G) is positive semide\ufb01nite, and hence there exists U \u2208 Rd\u00d7n such that K =\nU(cid:62)U. Note that Kij = u(cid:62)\ni uj where ui is the i-th column of U. Hence by inspection it is clear\nthat U \u2208 Lab(G). Using a similar argument, we can show that for any U \u2208 Lab(G), the matrix\nK = U(cid:62)U \u2208 K(G). Thus, the two sets, Lab(G) and K(G) are equivalent.\nFurthermore, orthonormal embeddings are associated with an interesting quantity, the Lov\u00b4asz \u03d1\nfunction [13, 7]. However, computing \u03d1 requires solving an SDP, which is impractical.\n\n2 Generalization Bound for Graph Transduction using Orthonormal\n\nEmbeddings\n\nIn this section we derive a generalization bound, used in the sequel for PAC analysis. We derive the\nfollowing error bound, valid for any orthonormal embedding (supplementary material, Section B).\nTheorem 1 (Generalization bound). Let G = (V, E) be a simple graph with unknown binary labels\ny \u2208 Y n on the vertices V . Let K \u2208 K(G). Given G, and labels of a randomly drawn subgraph\n\u2265 1 \u2212 \u03b4 over the choice of S \u2282 V , such that |S| = m\n\nS, let \u02c6y \u2208 (cid:98)Y n be the predictions learnt by \u03c9C(K, yS) in (3). Then, for m \u2264 n/2, with probability\n\n(cid:96)hng(yi, \u02c6yi) + 2C(cid:112)2\u03bb1(K) + O\n\n(cid:16)(cid:114) 1\n\n(cid:17)\n\nlog\n\n1\n\u03b4\n\nm\n\n.\n\n(4)\n\n(cid:88)\n\ni\u2208S\n\n\u00afS [\u02c6y] \u2264 1\ner0-1\n\nm\n\nNote that the above is a high-probability bound, in comparison to the expected analysis in (2). Also,\nthe above result suggests that graph embeddings with low spectral norm and empirical error lead to\nbetter generalization. [1]\u2019s analysis in (2) suggests that we should embed a graph on a unit sphere,\nhowever, does not help to choose the optimal embedding for graph transduction. Exploiting our\nanalysis from (4), we present a spectral norm regularized algorithm in Section 3.\nWe would also like to study PAC learnability of orthonormal embeddings, where PAC learnability\nis de\ufb01ned as follows: given G, y; does there exist an \u02dcm < n, such that w.p. \u2265 1 \u2212 \u03b4 over\nS \u2282 V,|S| \u2265 \u02dcm; the generalization error er0-1\n\u00afS \u2264 \u0001. The quantity \u02dcm is termed as labelled sample\ncomplexity [2]. Existing analysis [2] do not apply to orthonormal embeddings as discussed in related\nwork, Section 1. Theorem 1 allows us to derive improved statistical estimates (Section 3).\n\n3 SPORE Formulation and PAC Analysis\n\nTheorem 1 suggests that penalizing the spectral norm of K would lead to better generalization. To\nthis end we motivate the following formulation.\n\ng(cid:0)K(cid:1)\n\n\u03a8C,\u03b2(G, yS) = min\n\nK\u2208K(G)\n\n2(a)+ = max(a, 0) \u2200a \u2208 R\n\nwhere\n\ng(K) = \u03c9C(K, yS) + \u03b2\u03bb1(K).\n\n(5)\n\n3\n\n\f(5) gives an optimal orthonormal embedding, the optimal K, which we will refer to as SPORE. In\nthis section we \ufb01rst study the PAC learnability of SPORE, and derive a labelled sample complexity\nestimate. Next, we study ef\ufb01cient computation of SPORE. Though SPORE can be posed as an SDP,\nwe show in Section 4 that it is possible to exploit the structure, and solve ef\ufb01ciently.\nGiven G and yS, the function \u03c9C(K, yS) is convex in K as it is the maximum of af\ufb01ne functions\nof K. The spectral norm of K, \u03bb1(K) is also convex, and hence g(K) is a convex function. Fur-\nthermore K(G) is an Elliptope [5], a convex body which can be described by the intersection of\na positive semi-de\ufb01nite and af\ufb01ne constraints. It follows that hence (5) is convex. Usually these\nformulations are posed as SDPs which do not scale beyond few hundred vertices. In Section 4 we\nderive an ef\ufb01cient \ufb01rst order method which can solve for 1000\u2019s of vertices. Let K\u2217 be the optimal\nembedding computed from (5). Note that once the kernel is \ufb01xed, the predictions are only dependent\non \u03c9C(K\u2217, yS). Let \u03b1\u2217 be the solution to \u03c9C(K\u2217, yS) as in (3), then the \ufb01nal predictions of (5) is\n\ngiven by \u02c6yi =(cid:80)\n\nj\u2208S K\u2217\n\nij\u03b1\u2217\n\nj yj, \u2200i \u2208 [n].\n\nAt this point, we derive an interesting graph-dependent error convergence rate. We gather two\nimportant results, the proof of which appears in the supplementary material, Section C.\nLemma 2. Given a simple graph G = (V, E), maxK\u2208K(G) \u03bb1(K) = \u03d1( \u00afG).\nLemma 3. Given G and y, for any S \u2286 V and C > 0, minK\u2208K(G) \u03c9C(K, yS) \u2264 \u03d1(G)/2.\nIn the standard PAC setting, there is a complete disconnection between the data distribution and\ntarget hypothesis. However, in the presence of unlabeled nodes, without any assumption on the\ndata, it is impossible to learn labels. Following existing literature [1, 9], we work with similarity\ngraphs \u2013 where presence of an edge would mean two nodes are similar; and derive the following\n(supplementary material, Section C).\nTheorem 4. Let G = (V, E), V = [n] be a simple graph with unknown binary labels y \u2208 Y n\non the vertices V . Given G, and labels of a randomly drawn subgraph S \u2282 V , m = |S|; let \u02c6y be\nthe predictions learnt by SPORE (5), for parameters C =\n. Then, for\nm \u2264 n/2, with probability \u2265 1 \u2212 \u03b4 over the choice of S \u2282 V , such that |S| = m\n\n2 and \u03b2 = \u03d1(G)\n\u03d1( \u00afG)\n\n\u03d1( \u00afG)\n\n(cid:16) 1\n\nm\n\nm\n\n(cid:17) 1\n\n(cid:16) \u03d1(G)\n(cid:113)\n(cid:16)(cid:112)n\u03d1(G) + log\n(cid:17)(cid:17) 1\n(cid:16)(cid:114) 1\n2\u03d1(cid:0) \u00afG(cid:1) + O\n\n(cid:113)\n\n1\n\u03b4\n\n.\n\n2\n\n(cid:96)hng(yi, \u02c6yi) + 2C\n\ner0-1\n\n\u00afS [\u02c6y] = O\n\nProof. (Sketch) Let K\u2217 be the kernel learnt by SPORE (5). Using Theorem 1 and Lemma 2 for \u02c6y\n\n(6)\n\n(7)\n\nm\n\n.\n\n1\n\u03b4\n\nlog\n\n(cid:17)\n+ \u03b2\u03d1(cid:0) \u00afG(cid:1) .\n\n(cid:88)\n\ni\u2208S\n\nm\n\n\u00afS [\u02c6y] \u2264 1\ner0-1\n(cid:88)\n\nFrom the primal formulation of (3), using Lemma 2 and 3, we get\n\n(cid:96)hng(yi, \u02c6yi) \u2264 \u03c9C(K\u2217, yS) \u2264 \u03a8C,\u03b2(G, yS) \u2264 \u03d1(G)\n2\n\nC\n\ni\u2208S\n\nCm \u03d1(cid:0) \u00afG(cid:1) = 2C\n\n(cid:113)\n\n2\u03d1(cid:0) \u00afG(cid:1) and optimizing for C gives\n\nPlugging back in (7), choosing \u03b2 such that \u03b2\nus the choice of parameters as stated. Finally, using \u03d1(G)\u03d1( \u00afG) = n [13] proves the result.\nIn the theorem above, \u00afG is the complement graph of G. The optimal orthonormal embedding K\u2217\ntend to embed vertices to nearby regions if they have connecting edges, hence, the notion of sim-\nilarity is implicitly captured in the embedding. From (6), for a \ufb01xed n and m, note that the error\nconverges at a faster rate for a dense graph (\u03d1 is small), than for a sparse graph (\u03d1 is large). Such\nconnections relating to graph structural properties were previously unavailable [1].\nWe also estimate the labelled sample complexity, by bounding (6) by \u0001 > 0, to obtain \u02dcm =\n\n\u03b4 )(cid:1). This connection helps to reason the intuition that for a sparse graph one\n\n\u2126(cid:0) 1\nobtain a fractional labelled sample complexity estimate of \u02dcm/n = \u2126(cid:0)\u03d1/n(cid:1) 1\nicant improvement over the recently proposed \u2126(cid:0)\u03d1/n(cid:1) 1\n\nwould need a larger number of labelled vertices, than for a dense graph. For constants \u0001, \u03b4, we\n2 , which is a signif-\n3 [16]. The use of stronger machinery of\n\n\u03d1n + log 1\n\n\u00012 (\n\n\u221a\n\n4\n\n\fRademacher averages (supplementary material, Section C), instead of VC-dimension [2], and spe-\ncializing to SPORE allows us to improve over existing analysis [1, 16]. The proposed sample\ncomplexity estimate is interesting for \u03d1 = o(n), examples of such graphs include: random graphs\n(\u03d1(G(n, p)) = \u0398(\n\n\u221a\nn)) and power-law graphs ( \u00af\u03d1 = O(\n\nn)).\n\n\u221a\n\n4\n\nInexact Proximal methods for SPORE\n\nIn this section, we propose an ef\ufb01cient algorithm to solve SPORE (see (5)). The optimization prob-\nlem SPORE can be posed as an SDP. Generic SDP solvers have a runtime complexity of O(n6)\nand often does not scale well for large graphs. We study \ufb01rst-order methods, such as projected\nsubgradient procedures, as an alternative to SDPs, for minimizing g(K). The main computational\nchallenge in developing such procedures is that it is dif\ufb01cult to compute the projection on the ellip-\ntope. One could potentially use the seminal Dykstra\u2019s algorithm [3] of \ufb01nding a feasible point in the\nintersection of two convex sets. The algorithm asymptotically \ufb01nds a point in the intersection. This\nasymptotic convergence is a serious disadvantage in the usage of Dykstra\u2019s algorithm as a projection\nsub-routine. It would be useful to have an algorithm which after a \ufb01nite number of iterations yield\nan approximate projection and a subsequent descent algorithm can yield a convergent algorithm.\nMotivated by SPORE, we study the problem of minimizing non-smooth convex functions where\nthe projection onto the feasible set can be computed only approximately. Recently there has been\n\u221a\nincreasing interest in studying Inexact proximal methods [15, 18]. In the sequel we design an in-\nexact proximal method which yields an O(1/\nT ) algorithm to solve (5). The algorithm is based\non approximating the prox function by an iterative procedure which satis\ufb01es a suitably designed\ncriterion.\n\n4.1 An Infeasible Inexact Proximal (IIP) algorithm\nLet f be a convex function with properly de\ufb01ned sub-differential \u2202f (x) at every x \u2208 X . Consider\nthe following optimization problem.\n\nmin\n\nx\u2208X\u2282Rd\n\nf (x).\n\n(8)\n\n(10)\n\n(11)\n\nA subgradient projection iteration of the form\n\nxk+1 = PX (xk \u2212 \u03b1khk),\n\nhk \u2208 \u2202f (xk)\n\n(9)\n\u00012 ) number of times,\nis often used to arrive at an \u0001 accurate solution by running the iterations O( 1\n2(cid:107)v \u2212 x(cid:107)2\nwhere PX (v) is the projection of v \u2208 Rd on X \u2282 Rd if PX (v) = argminx\u2208X 1\nF . In many\nsituations, such as X = K(G), it is not possible to accurately compute the projection in \ufb01nite amount\nof time and one may obtain only an approximate projection. Using the Moreau decomposition\nPX (v) + Prox\u03c3X (v) = v [14], one can compute the projection if one could compute prox\u03c3X , where\n\u03c3A(a) = maxa\u2208X x(cid:62)a is the support function of X , and prox\u03c3X refers to the proximal operator for\nthe function g(cid:48) at v as de\ufb01ned below3.\n\n(cid:16)\n\n(cid:17)\n\n.\n\nproxg(cid:48)(v) = argmin\nz\u2208Dom(g(cid:48))\n\npg(cid:48)(z; v)\n\n(cid:107)v \u2212 z(cid:107)2 + g(cid:48)(z)\n\n=\n\n1\n2\n\nWe assume that one could compute z\u0001X (v), not necessarily in X , such that\n\np\u03c3X (z\u0001X (v); v) \u2264 min\nz\u2208Rn\n\np\u03c3X (z; v) + \u0001,\n\nand P \u0001X (v) = v \u2212 z\u0001X .\n\nSee that z\u0001X is an inexact prox and the resultant estimate of the projection P \u0001X can be infeasible but\nhopefully not too far away. Note that \u0001 = 0 recovers the exact case. The next theorem con\ufb01rms that\nit is possible to converge to the true optimum for a non-zero \u0001 (supplementary material, Section D.5).\nTheorem 5. Consider the optimization problem (8). Starting from any (cid:107)x0 \u2212 x\u2217(cid:107) \u2264 R, where x\u2217 is\na solution of (8), for every k let us assume that we could obtain P \u0001X (yk) such that zk = yk \u2212 P \u0001X (yk)\nsatisfy (11), where yk = xk \u2212 \u03b1khk, \u03b1k = s(cid:107)hk(cid:107) , (cid:107)hk(cid:107) \u2264 L, (cid:107)xk \u2212 x\u2217(cid:107) \u2264 R, s =\nT + \u0001.\nThen the iterates\n\n(cid:113) R2\n\n3A more general de\ufb01nition of the proximal operator is \u2013 prox\u03c4\n\nxk+1 = P \u0001X (xk \u2212 \u03b1khk),\n\nhk \u2208 \u2202f (xk)\ng(cid:48) (v) = argminz\u2208Dom(g(cid:48))\n\n(12)\n2\u03c4 (cid:107)v\u2212z(cid:107)2+g(cid:48)(z)\n\n1\n\n5\n\n\fyield\n\nT \u2212 f\u2217 \u2264 L\nf\u2217\n\n(cid:114)\n\nR2\nT\n\n+ \u0001.\n\n(13)\n\nRelated Work on Inexact Proximal methods: There has been recent interest in deriving inex-\nact proximal methods such as projected gradient descent, see [15, 18] for a comprehensive list of\nreferences. To the best of our knowledge, composite functions have been analyzed but no one has ex-\nplored the case that f is non-smooth. The results presented here are thus complementary to [15, 18].\nNote the subtlety in using the proper approximation criteria. Using a distance criterion between the\ntrue projection and the approximate projection, or an approximate optimality criteria on the optimal\ndistance would lead to a worse bound; using a dual approximate optimality criterion (here through\nthe proximal operator for the support function) is key (as noted in [15, 18] and references therein).\nAs an immediate consequence of Theorem 5, note that suppose we have an algorithm to compute\nprox\u03c3X which guarantees after S iterations that\n\np\u03c3X (zS; v) \u2212 min\nz\u2208Rd\n\np\u03c3X (z; v) \u2264 \u02c6R2\nS2 ,\n(cid:113)\n\n(14)\n\n(15)\n\nfor a constant \u02c6R particular to the set over which p\u03c3X is de\ufb01ned. We can initialize \u0001 = \u02c6R2\nthat may suggest that one could use S =\nT \u2212 f\u2217 \u2264 L \u00afR\u221a\nf\u2217\n\nT iterations to yield\n\nR2 + \u02c6R2.\n\n\u221a\n\nwhere\n\n\u00afR =\n\nS2 in (13),\n\nT\n\nRemarks: Computational ef\ufb01ciency dictates that the number of projection steps should be kept at a\nminimum. To this end we see that number of projection steps need to be at least S =\nT with the\ncurrent choice of stepsizes. Let cp be the cost of one iteration of FISTA step and c0 be the cost of\none outer iteration. The total computation cost can be then estimated as T 3/2 \u00b7 cp + T \u00b7 c0.\n\n\u221a\n\n4.2 Applying IIP to compute SPORE\n\nThe problem of computing SPORE can be posed as minimizing a non-smooth convex function over\nn \u2229 P (G), intersection of positive semi-de\ufb01nite cone S +\nan intersection of two sets: K(G) = S +\nand a polytope of equality constraints P (G) := {M \u2208 Sn|Mii = 1, Mij = 0 \u2200(i, j) /\u2208 E}.\nn\nThe algorithm described in Theorem 5 readily applies to the new setting if the projection can be\ncomputed ef\ufb01ciently. The proximal operator for \u03c3X can be derived as 4\n\nProx\u03c3X (v) = argmin\na,b\u2208Rd\n\np\u03c3X (a, b; v)\n\n=\n\n1\n2\n\n(cid:107)(a + b) \u2212 v(cid:107)2 + \u03c3A(a) + \u03c3B(b)\n\n.\n\n(16)\n\nThis means that even if we do not have an ef\ufb01cient procedure for computing Prox\u03c3X (v) directly,\nwe can devise an algorithm to guarantee the approximation (11) if we can compute Prox\u03c3A(v) and\nProx\u03c3B (v) ef\ufb01ciently. This can be done through the application of the popular FISTA algorithm for\n(16), which also guarantees (14). Algorithm 1 (detailed in the supplementary, named IIP F IST A),\ncomputes the following simple steps followed by the usual FISTA variable updates at each iteration\n2(cid:107)(a + b) \u2212 v(cid:107)2 and (b)\nt : (a) gradient descent step on a and b with respect to the smooth term 1\nproximal step with respect to \u03c3A and \u03c3B using the expressions (14), (21) (supplementary material).\nUsing the tools discussed above, we design Algorithm 1 to solve the SPORE formulation (5) using\nIIP. The proposed algorithm readily applies to general convex sets. However, we con\ufb01ne ourselves\nto speci\ufb01c sets of interest in our problem. The following theorem states the convergence rate of the\nproposed procedure.\nP (G) respectively. Starting from any K0 \u2208 A the iterates Kt in Algorithm (1) satisfy\n\nTheorem 6. Consider the optimization problem (8) with X = A(cid:84) B, where A and B are S +\n\nn and\n\n(cid:19)\n\n(cid:18)\n\n(cid:113)\n\nmin\n\nt=0,...,T\n\nf (Kt) \u2212 f (K\u2217) \u2264 L\u221a\nT\n\nR2 + \u02c6R2.\n\nProof. Is an immediate extension of Theorem 5 \u2013 supplementary material, Section D.6.\n\n4The derivation is presented in supplementary material, Claim 6.\n\n6\n\n\f(cid:17)\n\nR2 + \u02c6R2\n\n\u00b7(cid:16)(cid:112)\n\ns = L\u221a\nT\nInitialize t0 = 1.\nfor t = 1, . . . , T do\n\nAlgorithm 1 IIP for SPORE\n1: function APPROX-PROJ-SUBG(K0, L, R, \u02c6R, T )\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end function\n\ncompute ht\u22121\nvt = Kt\u22121 \u2212 s(cid:107)ht\u22121(cid:107) ht\u22121\n\u221a\n\u02dcKt = IIP F IST A(vt,\nKt = P rojA( \u02dcKt) = Kt \u2212 prox\u03c3A\n\nend for\n\n(Kt)\n\nT )\n\n(cid:46) compute stepsize\n\n(cid:46) subgradient of f (K) at Kt\u22121 see equation (5)\n\n\u221a\n\n(cid:46) FISTA for\n\nT steps. Use Algorithm 1 (supp.)\n\n(cid:46) Kt needs to be psd for the next SVM call. Use (14) (supp.)\n\nThe set of subgradients of f at the iteration t is given by \u2202f (Kt) = (cid:8) \u2212 1\n\nt |\u03b1t is returned by SVM, and vt is the eigen vector corresponding to \u03bb1(Kt)(cid:9)5, where Y be\n\nEquating the problem (8) with the SPORE problem (5), we have f (K) = \u03c9C(K, yS) + \u03b2\u03bb1(K).\nt Y +\n\u03b2vtv(cid:62)\na diagonal matrix such that Yii = yi, for i \u2208 S, and 0 otherwise. The step size is calculated using\nestimates of L, R and \u02c6R, which can be derived as L = nC 2, R = n, \u02c6R = n2.5 for the SPORE prob-\nlem. Check the supplementary material for the derivations.\n\n2 Y\u03b1t\u03b1(cid:62)\n\n5 Multiple Graph Transduction\n\nMultiple graph transduction is of recent interest in a multi-view setting, where individual views are\nexpressed by a graph. This includes many practical problems in bioinformatics [17], spam detection\n[21], etc. We propose an MKL style extension of SPORE, with improved PAC bounds.\nFormally, the problem of multiple graph transduction is stated as \u2013 let G = {G(1), . . . , G(M )} be a\nset of simple graphs G(k) = (V, E(k)), de\ufb01ned on a common vertex set V = [n]. Given G and yS\nas before, the goal is to accurately predict y \u00afS. Following the standard technique of taking convex\ncombination of graph kernels [16], we propose the following MKL-SPORE formulation\n\n\u03a6C,\u03b2(G, yS) =\n\nmin\n\nK(k)\u2208K(G(k))\n\nmin\n\n\u03b7\u2208SM\u22121\n\n\u03c9C\n\n(cid:16)\n\n(cid:0) (cid:88)\n\nk\u2208[M ]\n\n(cid:1) + \u03b2 max\n\nk\u2208[M ]\n\n(cid:17)\n\n\u03b7kK(k), yS\n\n\u03bb1(K(k))\n\n.\n\n(17)\n\nSimilar to Theorem 4, we can show the following (supplementary material, Theorem 8)\n\n(cid:16)(cid:112)n\u03d1(G) + log\n\n(cid:16) 1\n\nm\n\n(cid:17)(cid:17) 1\n\n2\n\n1\n\u03b4\n\ner0-1\n\n\u00afS [\u02c6y] = O\n\nwhere \u03d1(G) \u2264 min\nk\u2208[M ]\n\n\u03d1(G(k)).\n\n(18)\n\nIt immediately follows that combining multiple graphs improves the error convergence rate (see (6)),\nand hence the labelled sample complexity. Also, the bound suggests that the presence of at least one\n\u201cgood\u201d graph is suf\ufb01cient for MKL-SPORE to learn accurate predictions. This motivates us to use\nthe proposed formulation in the presence of noisy graphs (Section 6). We can also apply the IIP\nalgorithm described in Section 4 to solve for (17) (supplementary material, Section F).\n\n6 Experiments\n\nWe conducted experiments on both real world and synthetic graphs, to illustrate our theoretical\nobservations. All experiments were run on CPU with 2 Xeon Quad-Core processors (2.66GHz,\n12MB L2 Cache) and 16GB memory running CentOS 5.3.\n\n5\u03b1t = argmax\u03b1\u2208Rn\n\n+, (cid:107)\u03b1(cid:107)\u221e\u2264C\n\u03b1j =0 \u2200j /\u2208S\n\n\u03b1(cid:62)1 \u2212 1\n\n2 \u03b1(cid:62)YKtY\u03b1 and vt = argmaxv\u2208Rn,(cid:107)v(cid:107)=1 v(cid:62)Ktv\n\n7\n\n\fTable 1: SPORE comparison.\n\nDataset\nbreast-cancer\ndiabetes\nfourclass\nheart\nionosphere\nsonar\nmnist-1vs2\nmnist-3v8\nmnist-4v9\n\nUn-Lap N-Lap KS SPORE\n96.67\n73.33\n78.00\n81.97\n76.11\n63.92\n85.77\n86.11\n74.88\n\n88.22 93.33 92.77\n68.89 69.33 69.44\n70.00 70.00 70.44\n71.97 75.56 76.42\n67.77 68.00 68.11\n58.81 58.97 59.29\n75.55 80.55 79.66\n76.88 81.88 83.33\n68.44 72.00 72.22\n\nTable 2: Large Scale \u2013 2000 Nodes.\n\nDataset\nmnist-1vs2\nmnist-3vs8\nmnist-5vs6\nmnist-1vs7\nmnist-4vs9\n\nUn-Lap N-Lap KS SPORE\n96.72\n91.35\n97.35\n97.25\n87.40\n\n83.80 96.23 94.95\n55.15 87.35 87.35\n96.30 94.90 92.05\n90.65 96.80 96.55\n65.55 65.05 61.30\n\nGraph Transduction (SPORE): We use two datasets UCI [12] and MNIST [10]. For the UCI\ndatasets, we use the RBF kernel6 and threshold with the mean, and for the MNIST datasets we con-\nstruct a similarity matrix using cosine distance for a random sample of 500 nodes, and threshold\nwith 0.4 to obtain unweighted graphs. With 10% labelled nodes, we compare SPORE with for-\nmulation (3) using graph kernels \u2013 Unnormalized Laplacian (c1I + L)\u22121, Normalized Laplacian\n\n2(cid:1)\u22121 and K-Scaling [1], where L = D \u2212 A, D being a diagonal matrix of\n\n(cid:0)c2I + D\u2212 1\n\ndegrees. We choose parameters c1, c2, C and \u03b2 by cross validation. Table 1 summarizes the re-\nsults, averaged over 5 different labelled samples, with each entry being accuracy in % w.r.t. 0-1 loss\nfunction. As expected from Section 3, SPORE signi\ufb01cantly outperforms existing methods. We also\ntackle large scale graph transduction problems, Table 2 shows superior performance of Algorithm 1\nfor a random sample of 2000 nodes, with only 5 outer iterations and 20 inner projections.\n\n2 LD\u2212 1\n\nMultiple Graph Transduction (MKL-SPORE): We illustrate the effectiveness of combining\nmultiple graphs, using mixture of random graphs \u2013 G(p, q), p, q \u2208 [0, 1] where we \ufb01x |V | = n =\n100 and the labels y \u2208 Y n such that yi = 1 if i \u2264 n/2; \u22121 otherwise. An edge (i, j) is present with\nprobability p if yi = yj; otherwise present with probability q. We generate three datasets to simulate\nhomogenous, heterogenous and noisy case, shown in Table 3.\n\nTable 3: Synthetic multiple graphs dataset.\nGraph Homo.\nNoisy\nG(1) G(0.7, 0.3) G(0.7, 0.5) G(0.7, 0.3)\nG(2) G(0.7, 0.3) G(0.6, 0.4) G(0.6, 0.4)\nG(3) G(0.7, 0.3) G(0.5, 0.3) G(0.5, 0.5)\n\nHeter.\n\nTable 4: Superior performance of MKL-SPORE.\n\nGraph\nG(1)\nG(2)\nG(3)\nUnion\nIntersection\nMajority\nMultiple Graphs\n\nHomo. Heter. Noisy\n84.4\n84.4\n68.2\n84.8\n54.4\n86.4\n85.5\n69.3\n69.0\n83.8\n76.6\n93.7\n95.6\n81.9\n\n69.2\n68.6\n72.0\n69.3\n67.5\n76.9\n80.6\n\nMKL-SPORE was compared with individual graphs; and with the union, intersection and majority\ngraphs7. We use SPORE to solve for single graph transduction, and the results were averaged over\n10 random samples of 5% labelled nodes. For the comparison metric as before, Table 4 shows that\ncombining multiple graphs improves classi\ufb01cation accuracy. Furthermore, the noisy case illustrates\nthe robustness of the proposed formulation, a key observation from (18).\n\n7 Conclusion\n\nWe show that the class of orthonormal graph embeddings are ef\ufb01ciently PAC learnable. Our analysis\nmotivates a Spectral Norm regularized formulation \u2013 SPORE for graph transduction. Using inexact\nproximal method, we design an ef\ufb01cient \ufb01rst order method to solve for the proposed formulation.\nThe algorithm and analysis presented readily generalize to the multiple graphs setting.\n\nAcknowledgments\nWe acknowledge support from a grant from Indo-French Center for Applied Mathematics (IFCAM).\n\n6The (i, j)th entry of an RBF kernel is given by exp\n7Majority graph is a graph where an edge (i, j) is present, if a majority of the graphs have the edge (i, j).\n\n, where \u03c3 is set as the mean distance.\n\n2\u03c32\n\n(cid:16) (cid:107)xi\u2212xj(cid:107)2\n\n(cid:17)\n\n8\n\n\fReferences\n[1] R. K. Ando and T. Zhang. Learning on graph with Laplacian regularization. In NIPS, 2007.\n[2] N. Balcan and A. Blum. An augmented PAC model for semi-supervised learning.\n\nB. Sch\u00a8olkopf, and A. Zien, editors, Semi-supervised learning. MIT press Cambridge, 2006.\n\nIn O. Chapelle,\n\n[3] J. P. Boyle and R. L. Dykstra. A Method for Finding Projections onto the Intersection of Convex Sets\nin Hilbert Spaces. In Advances in Order Restricted Statistical Inference, volume 37 of Lecture Notes in\nStatistics, pages 28\u201347. Springer New York, 1986.\n\n[4] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms, volume 2. MIT\n\npress Cambridge, 2001.\n\n[5] M. Eisenberg-Nagy, M. Laurent, and A. Varvitsiotis. Forbidden minor characterizations for low-rank\noptimal solutions to semide\ufb01nite programs over the elliptope. J. Comb. Theory, Ser. B, 108:40\u201380, 2014.\n[6] A. Erdem and M. Pelillo. Graph transduction as a Non-Cooperative Game. Neural Computation,\n\n24(3):700\u2013723, 2012.\n\n[7] M. X. Goemans. Semide\ufb01nite programming in combinatorial optimization. Mathematical Programming,\n\n79(1-3):143\u2013161, 1997.\n\n[8] V. Jethava, A. Martinsson, C. Bhattacharyya, and D. P. Dubhashi. The Lov\u00b4asz \u03d1 function, SVMs and\n\n\ufb01nding large dense subgraphs. In NIPS, pages 1169\u20131177, 2012.\n\n[9] R. Johnson and T. Zhang. On the Effectiveness of Laplacian Normalization for Graph Semi-supervised\n\nLearning. JMLR, 8(7):1489\u20131517, 2007.\n\n[10] Y. LeCun and C. Cortes. The MNIST database of handwritten digits, 1998.\n[11] M. Leordeanu, A. Zan\ufb01r, and C. Sminchisescu. Semi-supervised learning and optimization for hypergraph\n\nmatching. In ICCV, pages 2274\u20132281. IEEE, 2011.\n\n[12] M. Lichman. UCI machine learning repository, 2013.\n[13] L. Lov\u00b4asz. On the shannon capacity of a graph. Information Theory, IEEE Transactions on, 25(1):1\u20137,\n\n1979.\n\n[14] N. Parikh and S. Boyd. Proximal algorithms. Foundations and Trends in optimization, 1(3):123\u2013231,\n\n2013.\n\n[15] M. Schmidt, N. L. Roux, and F. R. Bach. Convergence rates of inexact proximal-gradient methods for\n\nconvex optimization. In NIPS, pages 1458\u20131466, 2011.\n\n[16] R. Shivanna and C. Bhattacharyya. Learning on graphs using Orthonormal Representation is Statistically\n\nConsistent. In NIPS, pages 3635\u20133643, 2014.\n\n[17] L. Tran. Application of three graph Laplacian based semi-supervised learning methods to protein function\n\nprediction problem. IJBB, 2013.\n\n[18] S. Villa, S. Salzo, L. Baldassarre, and A. Verri. Accelerated and Inexact Forward-Backward Algorithms.\n\nSIAM Journal on Optimization, 23(3):1607\u20131633, 2013.\n\n[19] T. Zhang and R. K. Ando. Analysis of spectral kernel design based semi-supervised learning. NIPS,\n\n18:1601, 2005.\n\n[20] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global consistency.\n\nNIPS, 16(16):321\u2013328, 2004.\n\n[21] D. Zhou and C. J. C. Burges. Spectral clustering and transductive learning with multiple views. In ICML,\n\npages 1159\u20131166. ACM, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1319, "authors": [{"given_name": "Rakesh", "family_name": "Shivanna", "institution": "Google Inc."}, {"given_name": "Bibaswan", "family_name": "Chatterjee", "institution": "Indian Institute of Science"}, {"given_name": "Raman", "family_name": "Sankaran", "institution": "Indian Institute of Science"}, {"given_name": "Chiranjib", "family_name": "Bhattacharyya", "institution": "Indian Institute of Science"}, {"given_name": "Francis", "family_name": "Bach", "institution": "INRIA - ENS"}]}