{"title": "Learning on graphs using Orthonormal Representation is Statistically Consistent", "book": "Advances in Neural Information Processing Systems", "page_first": 3635, "page_last": 3643, "abstract": "Existing research \\cite{reg} suggests that embedding graphs on a unit sphere can be beneficial in learning labels on the vertices of a graph. However the choice of optimal embedding remains an open issue. \\emph{Orthonormal representation} of graphs, a class of embeddings over the unit sphere, was introduced by Lov\\'asz \\cite{lovasz_shannon}. In this paper, we show that there exists orthonormal representations which are statistically consistent over a large class of graphs, including power law and random graphs. This result is achieved by extending the notion of consistency designed in the inductive setting to graph transduction. As part of the analysis, we explicitly derive relationships between the Rademacher complexity measure and structural properties of graphs, such as the chromatic number. We further show the fraction of vertices of a graph $G$, on $n$ nodes, that need to be labelled for the learning algorithm to be consistent, also known as labelled sample complexity, is $ \\Omega\\left(\\frac{\\vartheta(G)}{n}\\right)^{\\frac{1}{4}}$ where $\\vartheta(G)$ is the famous Lov\\'asz~$\\vartheta$ function of the graph. This, for the first time, relates labelled sample complexity to graph connectivity properties, such as the density of graphs. In the multiview setting, whenever individual views are expressed by a graph, it is a well known heuristic that a convex combination of Laplacians \\cite{lap_mv1} tend to improve accuracy. The analysis presented here easily extends to Multiple graph transduction, and helps develop a sound statistical understanding of the heuristic, previously unavailable.", "full_text": "Learning on graphs using Orthonormal\nRepresentation is Statistically Consistent\n\nRakesh S\n\nDepartment of Electrical Engineering\n\nIndian Institute of Science\nBangalore, 560012, INDIA\n\nrakeshsmysore@gmail.com\n\nChiranjib Bhattacharyya\n\nDepartment of CSA\n\nIndian Institute of Science\nBangalore, 560012, INDIA\n\nchiru@csa.iisc.ernet.in\n\nAbstract\n\nExisting research [4] suggests that embedding graphs on a unit sphere can be ben-\ne\ufb01cial in learning labels on the vertices of a graph. However the choice of optimal\nembedding remains an open issue. Orthonormal representation of graphs, a class\nof embeddings over the unit sphere, was introduced by Lov\u00b4asz [2]. In this paper,\nwe show that there exists orthonormal representations which are statistically con-\nsistent over a large class of graphs, including power law and random graphs. This\nresult is achieved by extending the notion of consistency designed in the inductive\nsetting to graph transduction. As part of the analysis, we explicitly derive rela-\ntionships between the Rademacher complexity measure and structural properties\nof graphs, such as the chromatic number. We further show the fraction of vertices\nof a graph G, on n nodes, that need to be labelled for the learning algorithm to be\n4 where \u03d1(G)\nconsistent, also known as labelled sample complexity, is \u2126\nis the famous Lov\u00b4asz \u03d1 function of the graph. This, for the \ufb01rst time, relates la-\nbelled sample complexity to graph connectivity properties, such as the density of\ngraphs. In the multiview setting, whenever individual views are expressed by a\ngraph, it is a well known heuristic that a convex combination of Laplacians [7]\ntend to improve accuracy. The analysis presented here easily extends to Multi-\nple graph transduction, and helps develop a sound statistical understanding of the\nheuristic, previously unavailable.\n\n(cid:16) \u03d1(G)\n\n(cid:17) 1\n\nn\n\n1\n\nIntroduction\n\nIn this paper we study the problem of graph transduction on a simple, undirected graph G = (V, E),\nwith vertex set V = [n] and edge set E \u2286 V \u00d7V . We consider individual vertices to be labelled with\nbinary values, \u00b11. Without loss of generality we assume that the \ufb01rst f n vertices are labelled, i.e.,\nthe set of labelled vertices is given by S = [f n], where f \u2208 (0, 1). Let \u00afS = V \\S be the unlabelled\nvertex set, and let yS and y \u00afS be the labels corresponding to subgraphs S and \u00afS respectively.\nGiven G and yS, the goal of graph transduction is to learn predictions \u02c6y \u2208 Rn, such that er0-1\n\n(cid:3), \u00afy = sgn(\u02c6y) is small. To aid further discussion we introduce some notations.\n\n1(cid:2)yj (cid:54)= \u00afyj\n\n\u00afS [\u02c6y] =\n\n(cid:80)\n\nj\u2208 \u00afS\n\nNotation Let S n\u22121 = {u \u2208 Rn|(cid:107)u(cid:107)2 = 1} denote a (n \u2212 1) dimensional sphere. Let Dn, Sn\nn denote a set of n \u00d7 n diagonal, square symmetric and square symmetric positive semi-\nand S+\n+ be a non-negative orthant. Let 1n \u2208 Rn denote a vector\nde\ufb01nite matrices respectively. Let Rn\nof all 1\u2019s. Let [n] := {1, . . . , n}. For any M \u2208 Sn, let \u03bb1(M) \u2265 . . . \u2265 \u03bbn(M) denote the\neigenvalues and Mi denote the ith row of M, \u2200i \u2208 [n]. We denote the adjacency matrix of a\ngraph G by A. Let di denote the degree of vertex i \u2208 [n], di := A(cid:62)\ni 1n. Let D \u2208 Dn, where\n\n1\n\n\f2 AD\u2212 1\n\nDii = di,\u2200i \u2208 [n]. We refer I \u2212 D\u2212 1\nLet \u00afG denote the complement graph of G, with the adjacency matrix \u00afA = 1n1n\nK \u2208 S+\n\u03c9(K, y) = max\u03b1\u2208Rn\nbe the label, prediction and soft-prediction spaces over V . Given a graph G and labels y \u2208 Y n\n\n2 as the Laplacian, where I denotes the identity matrix.\n(cid:62) \u2212 I \u2212 A. For\n(cid:16)\nn(cid:80)\n. Let Y = \u00afY = {\u00b11},(cid:98)Y \u2286 R\nn and y \u2208 {\u00b11}n, the dual formulation of Support vector machine (SVM) is given by\nAij. We use (cid:96) : Y \u00d7 (cid:98)Y \u2192 R+ to denote any loss function.\n\non V , let cut(A, y) := (cid:80)\nIn particular, for a \u2208 Y, b \u2208 (cid:98)Y, let (cid:96)0-1(a, b) = 1[ab < 0], (cid:96)hinge(a, b) = (1 \u2212 ab)+\n\n1 and\n(cid:96)ramp(a, b) = min(1, (1 \u2212 ab)+) denote the 0-1, hinge and ramp loss respectively. The notations\nO, o, \u2126, \u0398 will denote standard measures de\ufb01ned in asymptotic analysis [14].\n\n\u03b1i\u03b1jyiyjKij\n\n\u03b1i \u2212 1\n\ng(\u03b1, K, y)\n\nn(cid:80)\n\nyi(cid:54)=yj\n\n(cid:17)\n\ni,j=1\n\ni=1\n\n=\n\n2\n\n+\n\nMotivation Regularization framework is a widely used tool for learning labels on the vertices of\na graph [23, 4]\n\n(cid:96)(yi, \u02c6yi) + \u03bb\u02c6y(cid:62)K\u22121 \u02c6y\n\n(1)\n\n(cid:88)\n\ni\u2208S\n\nmin\n\u02c6y\u2208Y n\n\n1\n|S|\n\nES\n\n\u02c6y\u2208Y n\n\n\u03bb|S|\n\nwhere K is a kernel matrix and \u03bb > 0 is an appropriately chosen regularization parameter. It was\nshown in [4] that the optimal \u02c6y\u2217 satis\ufb01es the following generalization bound\n\n(cid:16) trp(K)\n(cid:8)erV [\u02c6y] + \u03bb\u02c6y(cid:62)K\u22121 \u02c6y(cid:9) + c2\n(cid:2)er0-1\n\u00afS [\u02c6y\u2217](cid:3) \u2264 c1 inf\ni\u2208H (cid:96)(\u00b7)(yi, \u02c6yi), H \u2286 V 2; trp(K) = (cid:0) 1\n(cid:80)n\n(cid:80)\n\n(cid:17)p\n(cid:1)1/p, p > 0 and\noptimal kernel from such a set. The important problem of consistency(cid:0)er \u00afS \u2192 0 as n \u2192 \u221e, to be\nformally de\ufb01ned in Section 3(cid:1) of graph transduction algorithms was introduced in [5]. [5] showed\n\nwhere er(\u00b7)\nc1, c2 are dependent on (cid:96). [4] argued that for good generalization, trp(K) should be a constant,\nwhich motivated them to normalize the diagonal entries of K. It is important to note that the set of\nnormalized kernels is quite big and the above presented analysis gives little insight in choosing the\n\nH [\u02c6y] := 1|H|\n\ni=1 Kp\n\nthat the formulation (1), when used with a laplacian dependent kernel, achieves a generalization\nerror of ES [er \u00afS [\u02c6y\u2217]] = O\n, where q is the number of pure components3. Though [5]\u2019s\nalgorithm is consistent for a small number of pure components, they achieve the above convergence\nrate by choosing \u03bb dependent on true labels of the unlabeled nodes, which is not practical [6].\nIn this paper, we formalize the notion of consistency of graph transduction algorithms and derive\nnovel graph-dependent statistical estimates for the following formulation.\n\n(cid:16)(cid:113) q\n\n(cid:17)\n\nnf\n\nii\n\nn\n\n\u03b1(cid:62)K\u03b1 + C\n\n1\n2\n\ni\u2208S\n\n\u00afyj\u2208 \u00afY,j\u2208 \u00afS\n\n\u039bC(K, yS) = min\n\nwhere \u02c6yk = (cid:80)\n\ni\u2208S Kikyi\u03b1i +(cid:80)\n\nmin\n\u03b1\u2208Rn\n+\nj\u2208 \u00afS Kjk \u00afyj\u03b1j, \u2200k \u2208 V . If all the labels are observed then [22]\nshowed that the above formulation is equivalent to (1). We note that the normalization step con-\nsidered by [4] is equivalent to \ufb01nding an embedding of a graph on a sphere. Thus, we study or-\nthonormal representations of graphs [2], which de\ufb01ne a rich class of graph embeddings on an unit\nsphere. We show that the formulation (2) working with orthonormal representations of graphs is\nstatistically consistent over a large class of graphs, including random and power law graphs. In the\nsequel, we apply Rademacher complexity to orthonormal representations of graphs and derive novel\ngraph-dependent transductive error bound. We also extend our analysis to study multiple graph\ntransduction. More speci\ufb01cally, we make the following contributions.\n\nj\u2208 \u00afS\n\n(2)\n\n(cid:88)\n\n(cid:96)(cid:0)\u02c6yi, yi\n\n(cid:1) + C\n\n(cid:88)\n\n(cid:96)(cid:0)\u02c6yj, \u00afyj\n\n(cid:1)\n\nContributions The main contribution of this paper is that we show there exists orthonormal rep-\nresentations of graphs that are statistically consistent on a large class of graph families Gc. For a\nspecial orthonormal representation\u2014LS labelling, we show consistency on Erd\u00a8os R\u00b4enyi random\ngraphs. Given a graph G \u2208 Gc, with a constant fraction of nodes labelled f = O(1), we derive\n\n1(a)+ = max(a, 0).\n2We drop the argument \u02c6y, when implicit from the context.\n3Pure component is a connected subgraph, where all the nodes in the subgraph have the same label.\n\n2\n\n\f(cid:16) \u03d1(G)\n\n(cid:17) 1\n\nn\n\nn\n\nan error convergence rate of er0-1\n\n\u03d1 function of the graph G. Existing work [5] showed an expected convergence rate of O(cid:0)(cid:112) q\n\n4 , with high probability; where \u03d1(G) is the Lov\u00b4asz\n\n\u00afS = O\n\n(cid:1),\n\nhowever q is dependent on the true labels of the unlabelled nodes. Hence their bound cannot be\ncomputed explicitly [6]. We also apply Rademacher complexity measure to the function class as-\nsociated with orthonormal representations and derive a tight bound relating to \u03c7(G), the chromatic\nnumber of the graph G. We show that the Laplacian inverse [4] has O(1) complexity on graphs with\nhigh connectivity, whereas LS labelling exhibits a complexity of \u0398(n 1\n4 ). Experiments demonstrate\nsuperior performance of LS labelling on several real world datasets. We derive novel transductive\nerror bound, relating to graph structural measures. Using our analysis, we show that observing labels\n4 fraction of the nodes is suf\ufb01cient to achieve consistency. We also propose an ef\ufb01-\nof \u2126\ncient Multiple Kernel Learning (MKL) based algorithm, with generalization guarantees for multiple\ngraph transduction. Experiments demonstrate improved performance in combining multiple graphs.\n\n(cid:16) \u03d1(G)\n\n(cid:17) 1\n\nn\n\n2 Preliminaries\n\nOrthonormal Representation: [2] introduced the idea of orthonormal representations for the prob-\nlem of embedding a graph on a unit sphere. More formally, an orthonormal representation of a\nsimple, undirected graph G = (V, E) with V = [n], is a matrix U = [u1, . . . , un] \u2208 Rd\u00d7n such\nthat uT\nLet Lab(G) denote the set of all possible orthonormal representations of the graph G given by\n\nLab(G) :=(cid:8)U|U is an Orthonormal Representation(cid:9). [1] recently introduced the notion of graph\n\ni uj = 0 whenever (i, j) /\u2208 E and ui \u2208 S d\u22121 \u2200i \u2208 [n].\n\nembedding to Machine Learning community and showed connections to graph kernel matrices. Con-\nn|Kii = 1,\u2200i \u2208 [n]; Kij = 0,\u2200(i, j) /\u2208 E}.\nsider the set of graph kernels K(G) := {K \u2208 S+\n[1] showed that for every valid kernel K \u2208 K(G), there exists an orthonormal representation\nU \u2208 Lab(G); and it is easy to see the other way, K = U(cid:62)U \u2208 K(G). Thus, the two sets,\nLab(G) and K(G) are equivalent. Orthonormal representation is also associated with an interesting\n\nquantity, the Lov\u00b4asz number [2], de\ufb01ned as: \u03d1(G) = 2(cid:0) minK\u2208K(G) \u03c9(K, 1n)(cid:1) [1]. \u03d1 function is a\nLov\u00b4asz Sandwich Theorem: [2] Given an undirected graph G = (V, E), I(cid:0) \u00afG(cid:1) \u2264 \u03d1(cid:0) \u00afG(cid:1) \u2264 \u03c7(G);\nwhere I(cid:0) \u00afG(cid:1) is the independent number of the complement graph \u00afG.\n\nfundamental tool for combinatorial optimization and approximation algorithms for graphs.\n\n3 Statistical Consistency of Graph Transduction Algorithms\n\n\u00afSn\n\n\u00afSn\n\nIn this section, we formalize the notion of consistency of graph transduction algorithms. Given\na graph Gn = (Vn, En) of n nodes, with labels of subgraph Sn \u2286 Vn observable, let er\u2217\n:=\n\u00afSn\ninf \u02dcy\u2208 \u00afY n er \u00afSn [\u02dcy] denote the minimal unlabelled node set error. Consistency is a measure of the\nquality of the learning algorithm A, comparing er \u00afSn [\u02c6y] to er\u2217\n, where \u02c6y are the predictions made\nby A. A related notion of loss consistency has been extensively studied in literature [3, 12], which\nonly show that the difference er \u00afSn [\u02c6y] \u2212 erSn [\u02c6y] \u2192 0 as n \u2192 \u221e [6]. This does not con\ufb01rm the\noptimality of A, that is er \u00afSn[\u02c6y] \u2192 er\u2217\n. Hence, a notion stronger than loss consistency is needed.\nLet Gn belong to a graph family G, \u2200n. Let \u03a0f be the uniform distribution over the random draw of\nthe labelled subgraph Sn \u2286 Vn, such that |Sn| = f n, f \u2208 (0, 1). As discussed earlier, we want the\n(cid:96)-regret, RSn [A] = er \u00afSn [\u02c6y]\u2212 er\u2217\nto be small. Since, the labelled nodes are drawn randomly, there\nis a small probability that one gets an unrepresentative subgraph Sn. However, for large n, we want\n(cid:96)-regret to be close to zero with high probability4. In other words, for every \ufb01nite and \ufb01xed n, we\nwant to have an estimate on the (cid:96)-regret, which decreases as n increases. We de\ufb01ne the following\nnotion of consistency of graph transduction algorithms to capture this requirement\nDe\ufb01nition 1. Let G be a graph family and f \u2208 (0, 1) be \ufb01xed. Let V = {(vi, yi, Ei)}\u221e\ni=1 be an\nin\ufb01nite sequence of labelled node set, where yi \u2208 Y and Ei is the edge information of node vi\nwith the previously observed nodes v1, . . . , vi\u22121, \u2200i \u2265 2. Let Vn be the \ufb01rst n nodes in V, and let\n4If G is not deterministic (e.g., Erd\u00a8os R\u00b4eyni), then there is small probability that one gets an unrepresentative\n\ngraph, in which case we want the (cid:96)-regret to be close to zero with high probability over Gn \u223c G.\n\n\u00afSn\n\n3\n\n\fGn \u2208 G be the graph de\ufb01ned by (Vn, E1, . . . , En). Let Sn \u2286 Vn, and let yn, ySn be the labels of\nVn, Sn respectively. A learning algorithm A when given Gn and ySn returns soft-predictions \u02c6y is\nsaid to be (cid:96)-consistent w.r.t G if, when the labelled subgraph Sn are random drawn from \u03a0f , the\n(cid:96)-regret converges in probability to zero, i.e., \u2200\u0001 > 0\n\nPrSn\u223c\u03a0f [RSn[A] \u2265 \u0001] \u2192 0\n\nas n \u2192 \u221e\n\nIn Section 6 we show that the kernel learning style algorithm (2) working with orthonormal rep-\nresentations is consistent on a large class of graph families. To the best of our knowledge, we are\nnot aware of any literature which provide an explicit empirical error convergence rate and prove\nconsistency of the graph transduction algorithm considered. Before we prove our main result, we\ngather useful tools\u2014a) complexity measure, which reacts to the structural properties of the graph\nb) generalization analysis to bound er \u00afS (Section 5). In the interest of space, we defer\n(Section 4);\nmost of the proofs to the supplementary material5.\n\n4 Graph Complexity Measures\n\nIn this section we apply Rademacher complexity to orthonormal representations of graphs, and relate\nto the chromatic number. In particular, we study LS labelling, whose class complexity can be shown\nto be greater than that of the Laplacian inverse on a large class of graphs.\n(2) be solved for K \u2208 K(G), and let U \u2208 Lab(G) be the orthonormal representation cor-\nLet\nresponding to K (Section 2). Then by Representer\u2019s theorem, the classi\ufb01er learnt by (2) is of the\nform h = U\u03b2, \u03b2 \u2208 Rn. We de\ufb01ne Rademacher complexity of the function class associated with\northonormal representations\nDe\ufb01nition 2 (Rademacher Complexity). Given a graph G = (V, E), with V = [n]; let U \u2208 Lab(G)\n\u03c3 = (\u03c31, . . . , \u03c3n) be a vector of i.i.d. random variables such that \u03c3i \u223c {+1,\u22121, 0} w.p. p, p and\n1 \u2212 2p respectively. The Rademacher complexity of the graph G de\ufb01ned by U, \u00afHU is given by\nR( \u00afHU, p) = 1\n\nand \u00afHU = (cid:8)h|h = U\u03b2, \u03b2 \u2208 Rn(cid:9) be the function class associated with U. For p \u2208 (0, 1/2], let\n\n\u03c3i (cid:104)h, ui(cid:105)(cid:105)\n\nn(cid:80)\n\n(cid:104)\n\nE\u03c3\n\nn\n\nsuph\u2208 \u00afHU\n\ni=1\n\nThe above de\ufb01nition is motivated from [9, 3]. This is an empirical complexity measure, suited for\nthe transductive settings. We derive the following novel tight Rademacher bound\nTheorem 4.1. Let G = (V, E) be a simple, undirected graph with V = [n], U \u2208 Lab(G) and\nK = U(cid:62)U \u2208 K(G) be the graph-kernel corresponding to U. The Rademacher complexity of graph\n2 is a constant.\n\np \u2208 (cid:2)1/n, 1/2(cid:3). Let HU = (cid:8)h(cid:12)(cid:12) h = U\u03b2, \u03b2 \u2208 Rn,(cid:107)\u03b2(cid:107)2 \u2264 tC\nG de\ufb01ned by U is given by R(HU, p) = c0tC(cid:112)p\u03bb1(K), where 1/2\n\nn(cid:9), C > 0, t \u2208 [0, 1] and let\n2 \u2264 c0 \u2264 \u221a\n\n\u221a\n\n\u221a\n\nThe above result provides a lower bound for the Rademacher complexity for any unit sphere graph\nembedding. While upper-bounds maybe available [9, 3] but, to the best of our knowledge, this is the\n\ufb01rst attempt at establishing lower bounds. The use of orthonormal representations allow us to relate\nclass complexity measure to graph-structural properties.\n\nCorollary 4.2. For C, t, p = O(1), R(HU, p) = O((cid:112)\u03c7(G)). (Suppl.)\nas large as O((cid:112)\u03c7(G)), which motivate us to \ufb01nd substantially better regularizers. In particular, we\n\nSuch connections between learning theory complexity measures and graph properties was previously\nunavailable [9, 3]. Corollary 4.2 suggests that there exists graph regularizers with class complexity\ninvestigate LS labelling [16]; given a graph G, LS labelling KLS \u2208 K(G) is de\ufb01ned as\n\nKLS =\n\nA\n\u03c1\n\n+ I, \u03c1 \u2265 |\u03bbn(A)|\n\n(3)\n\nLS labellinghas high Rademacher complexity on a large class of graphs, in particular\nCorollary 4.3. For a random graph G(n, q), q \u2208 [0, 1), where each edge is present independently\nw.p. q; for C, t, q = O(1) the Rademacher complexity of the function class associated with LS\nlabelling (3) is \u0398(n 1\n\n4 ), with high probability. (Suppl.)\n\n5mllab.csa.iisc.ernet.in/rakeshs/nips14/suppl.pdf\n\n4\n\n\fFor the limiting case of complete graphs, we can show that Laplacian inverse [4], the most widely\nused graph regularizer has O(1) complexity (Claim 2, Suppl.), thus indicating that it may be subop-\ntimal for graphs with high connectivity. Experimental results illustrate our observation.\nWe derive a class complexity measure for unit sphere graph embeddings, which indicates the rich-\nness of the function class, and helps the learning algorithm to choose an effective embedding.\n\n5 Generalization Error Bound\n\nIn the previous section, we applied Rademacher complexity to orthonormal representations.\nIn\nthis section we derive novel graph-dependent generalization error bounds, which will be used in\nSection 6. Following a similar proof technique as in [3], we propose the following error bound\u2014\nTheorem 5.1. Given a graph G = (V, E), V = [n] with y \u2208 Y n being the unknown binary labels\nover V ; let U \u2208 Lab(G), and K \u2208 K(G) be the corresponding kernel. Let \u02dcHU = {h|h = U\u03b2, \u03b2 \u2208\nRn, (cid:107)\u03b2(cid:107)\u221e \u2264 C}, C > 0. Let (cid:96) be any loss function, bounded in [0, B] and L-Lipschitz in its\nsecond argument. For f \u2208 (0, 1/2]6, let labels of subgraph S \u2286 V be observable, |S| = nf. Let\n\u00afS = V \\S. For any \u03b4 > 0 and h \u2208 \u02dcHU, with probability \u2265 1 \u2212 \u03b4 over S \u223c \u03a0f\n\n(cid:115)\n\n(cid:114) 1\n\ner \u00afS[\u02c6y] \u2264 erS[\u02c6y] + LC\n\n2\u03bb1(K)\nf (1 \u2212 f )\n\n+\n\nc1B\n1 \u2212 f\n\nlog\n\n1\n\u03b4\n\n(4)\n\nnf\n\nwhere \u02c6y = U(cid:62)h and c1 > 0 is a constant. (Suppl.)\nDiscussion Note that from [2], \u03bb1(K) \u2264 \u03c7(G) and \u03c7(G) is in-turn bounded by the maximum\ndegree of the graph [21]. Thus, if L, B, f = O(1), then for sparse, degree bounded graphs; for\nthe choice of parameter C = \u0398(1/\nn), the slack term and the complexity term goes to zero as n\nincreases. Thus, making the bound useful. Examples include tree, cycle, path, star and d-regular\n(with d = O(1)). Such connections relating generalization error to graph properties was not avail-\nable before. We exploit this novel connection to analyze graph transduction algorithms in Section 6.\nAlso, in Section 7, we extend the above result to the problem of multiple graph transduction.\n\n\u221a\n\n5.1 Max-margin Orthonormal Representation\n\nlabels on V , let \u00afH = (cid:83)\n\nTo analyze er0-1\nS relating to graph structural measure, the \u03d1 function, we study the maximum margin\ninduced by any orthonormal representation, in an oracle setting.\nWe study a fully \u2018labelled graph\u2019 G = (V, E, y), where y \u2208 Y n are the binary labels on the vertices\nV . Given any U \u2208 Lab(G), the maximum margin classi\ufb01er is computed by solving \u03c9(K, y) =\ng(\u03b1\u2217, K, y) where K = U(cid:62)U \u2208 K(G). It is interesting to note that knowing all the labels, the\nmax-margin orthonormal representation can be computed by solving an SDP. More formally\nDe\ufb01nition 3. Given a labelled graph G = (V, E, y), where V = [n] and y \u2208 Y n are the binary\n\u00afHU, where \u00afHU = {h|h = U\u03b2, \u03b2 \u2208 Rn}. Let K \u2208 K(G) be\nthe kernel corresponding to U \u2208 Lab(G). The max-margin orthonormal representation associated\nwith the kernel function is given by Kmm = argminK\u2208K(G) \u03c9(K, y).\nBy de\ufb01nition, Kmm induces the largest margin amongst other orthonormal representations, and\nhence is optimal. The optimal margin has interesting connections to the Lov\u00b4asz \u03d1 function \u2014\nTheorem 5.2. Given a labelled graph G = (V, E, y), with V = [n] and y \u2208 Y n being the binary\nlabels on vertices. Let Kmm be as in De\ufb01nition 3, then \u03c9(Kmm, y) = \u03d1(G)/2. (Suppl.)\n\nU\u2208Lab(G)\n\nThus, knowing all the labels, computing Kmm is equivalent to solving the \u03d1 function. However,\nin the transductive setting, Kmm cannot be computed. Alternatively, we explore LS labelling (3),\nwhich gives a constant factor approximation to the optimal margin on a large class of graphs.\nDe\ufb01nition 4. A class of labelled graphs G = {G = (V, E, y)} is said to be a Labelled SVM-\u03d1 graph\nfamily, if there exist a constant \u03b3 > 1 such that \u2200G \u2208 G, \u03c9(KLS, y) \u2264 \u03b3\u03c9(Kmm, y).\n\n6We can generalize our result for f \u2208 (0, 1), but for the simplicity of the proof we assume f \u2208 (0, 1/2].\n\nThis is also true in practice, where the number of labelled examples is usually very small.\n\n5\n\n\fAlgorithm 1\n\nInput: U, yS and C > 0.\nGet \u03b1\u2217, \u00afy\u2217\nReturn: \u02c6y = U(cid:62)hS, where hS = UY\u03b1\u2217; Y \u2208 Dn , Y = yi, if i \u2208 S, otherwise \u00afy\u2217\ni .\n\n\u00afS by solving \u039bC(K, yS) (2) for (cid:96)hinge and K = U(cid:62)U.\n\nSuch class of graphs are interesting, because one can get a constant factor approximation to the\noptimal margin, without the knowledge of the true labels e.g., Mixture of random graphs: G =\n(V, E, y), with y(cid:62)1n = 0, cut(A, y) \u2264 c\nn, for c > 1 being a constant and the subgraphs\ncorresponding to the two classes form G(n/2, 1/2) random graphs (Claim 3, Suppl.).\nWe relate the maximum geometric margin induced by orthonormal representations to the \u03d1 function\nof the graph. This allows us to derive novel graph dependent learning theory estimates.\n\n\u221a\n\n6 Consistency of Orthonormal Representation of Graphs\n\n.\n\nAggregating results from Section 4 and 5, we show that Algorithm 1 working with orthonormal\nrepresentations of graphs is consistent on a large class of graph families. For every \ufb01nite and \ufb01xed\nn, we derive an estimate on er0-1\n(cid:18)\n\u00afSn\nTheorem 6.1. For the setting as in De\ufb01nition 1, let f \u2208 (0, 1/2] be \ufb01xed. Let \u02c6y be the predictions\nlearnt by Algorithm 1 with inputs Un \u2208 Lab(Gn), ySn and C\u2217 =\n. Then \u2203Un \u2208\n(cid:115)\n(cid:19) 1\nn over Sn \u223c \u03a0f\nLab(Gn), \u2200Gn such that with probability atleast 1 \u2212 1\n\n\u03d12(Gn)(1\u2212f )\n23n2f \u03d1( \u00afGn)\n\n(cid:32)(cid:18) \u03d1(Gn)\n\n(cid:19) 1\n\n(cid:33)\n\n4\n\n4\n\ner0-1\n\u00afSn\n\n[\u02c6y] = O\n\nf 3(1 \u2212 f )n\n\n+\n\n1\n1 \u2212 f\n\nlog n\nnf\n\nProof. Let Kn \u2208 K(Gn) be the max-margin kernel associated with Gn (De\ufb01nition 3), and let\nUn \u2208 Lab(G) be the corresponding orthonormal representation. Since (cid:96)ramp is an upper bound on\n(cid:96)0-1, we concentrate on bounding erramp\n\n[\u02c6y]. Note that for any C > 0\n\nC|Sn| \u00b7 erramp\n\nSn\n\n\u00afSn\n[\u02c6y] \u2264 C|Sn| \u00b7 erhinge\n\nSn\n\n[\u02c6y] \u2264 \u039bC(Kn, ySn )\n\u03d1(Gn)\n\n\u2264 \u039bC(Kn, yn) \u2264 \u03c9(Kn, yn) =\n\n2\n\n(cid:115)\n\nThe last inequality follows from Theorem 5.2. Note that for ramp loss L = B = 1; using Theo-\nn over the random draw of Sn \u223c \u03a0f ,\nrem 5.1 for \u03b4 = 1\n\nn, it follows that with probability atleast 1 \u2212 1\n\n+ C\n\nerramp\n\n\u00afSn\n\n(cid:16) \u03d12(Gn)(1\u2212f )\n\n23n2f \u03d1( \u00afGn)\n\n[\u02c6y] \u2264 \u03d1(Gn)\n2Cnf\n\nwhere c1 = O(1). Using \u03bb1(Kn) \u2264 \u03d1( \u00afGn)\n\n+\n\nlog n\nnf\n\nc1\n1 \u2212 f\n\n2\u03bb1(Kn)\nf (1 \u2212 f )\n[2] and optimizing RHS for C, we get C\u2217 =\n\n(cid:1) = n [2] proves the claim.\n\n(5)\n\n(cid:17) 1\n4 . Plugging back C\u2217 and using \u03d1(Gn)\u03d1(cid:0) \u00afGn\n(cid:2)er \u00afSn\n\n(cid:3) = O(cid:0)(cid:112) q\n\n(cid:1). However, as noted in Section 1, the quantity q is dependent\n\nn\n\n[5] showed that ES\non y \u00afSn, and hence their bounds cannot be computed explicitly [6].\nWe assume that the graph does not contain duplicate nodes with opposite labels, er\u2217\n= 0. Thus,\n\u00afSn\nconsistency follows from the fact that \u03d1(G) \u2264 n and for large families of graphs it is O(nc) where\n0 \u2264 c < 1. This theorem points to the fact that if f = O(1), then by De\ufb01nition 1, Algorithm 1 is\n(cid:96)0-1- consistency over such class of graph families. Examples include\nPower-law graphs: Graphs where the degree sequence follows a power law distribution. We show\nthat \u03d1( \u00afG) = O(\nn) for naturally occurring power law graphs (Claim 4, Suppl.). Thus, working\n\nwith the complement graph(cid:0) \u00afG(cid:1), makes Algorithm 1 consistent.\n\n\u221a\n\n(cid:115)\n\n6\n\n\f\u221a\n\nRandom graphs: For G(n, q) graphs, q = O(1); with high probability \u03d1(G(n, q)) = \u0398(\nn) [13].\nNote that choosing Kn for various graph families is dif\ufb01cult. Alternatively, for Labelled SVM-\u03d1\ngraph family (De\ufb01nition 4), if Lov\u00b4asz \u03d1 function is sub-linear, then for the choice of LS labelling,\nAlgorithm 1 is (cid:96)0-1consistent. Examples include Mixture of random graphs (Section 5.1). Further-\nmore, we analyze the fraction of labelled nodes to be observed, such that Algorithm 1 is consistent.\nCorollary 6.2 (Labelled Sample Complexity). Given a graph family Gc, such that \u03d1(Gn) =\nO(nc), \u2200Gn \u2208 Gc where 0 \u2264 c < 1. For C = C\u2217 as in Theorem 6.1; 1\n, \u03b5 > 0\nfraction of labelled nodes is suf\ufb01cient for Algorithm 1 to be (cid:96)0-1-consistent w.r.t. Gc.\nThe proof directly follows from Theorem 6.1. As a consequence of the above result, we can ar-\ngue that for sparse graphs (\u03d1(G) is large) one would need a larger fraction of nodes labelled, but\nfor denser graphs (\u03d1(G) is small) even a smaller fraction of nodes being labelled suf\ufb01ces. Such\nconnections relating sample complexity and graph properties was not available before.\nTo end this section, we discuss on the possible extensions to the inductive setting (Claim 5, Suppl.)\u2014\nwe can show that that the uniform convergence of er \u00afS to erS in the transductive setting (for f = 1/2)\nis a necessary and suf\ufb01cient condition for the uniform convergence of erS to the generalization error.\nThus, the results presented here can be extended to the supervised setting. Furthermore, combining\nTheorem 5.1 with the results of [9], we can also extend our results to the semi-supervised setting.\n\n(cid:16) \u03d1(Gn)\n\n(cid:17)1/3\u2212\u03b5\n\nn\n\n2\n\n7 Multiple Graph Transduction\n\nMany real world problems can be posed as learning on multiple graphs [19, ?]. Existing algorithms\nfor single graph transduction [10, 15] cannot be trivially extended to the new setting. It is a well\nknown heuristic that taking a convex combination of Laplacian improves classi\ufb01cation performance\n[7], however the underlying principle is not well understood. We propose an ef\ufb01cient MKL style\nalgorithm with generalization guarantees. Formally, the problem of multiple graph transduction is\u2014\n\nProblem 1. Given G = {G(1), . . . , G(m)} a set of simple, undirected graphs G(k) = (cid:0)V, E(k)(cid:1),\n\nde\ufb01ned on a common vertex set V = [n]. Without loss of generality we assume that the \ufb01rst f n\nvertices are labelled, i.e., the set of labelled vertices is given by S = [f n], where f \u2208 (0, 1). Let\n\u00afS = V \\S be the unlabelled node set. Let yS, y \u00afS be the labels of S, \u00afS respectively. Given G and\nlabels yS, the goal is to accurately predict the labels of y \u00afS.\nLet K = {K(1), . . . , K(m)} be the set of kernels corresponding to graphs G; K(k) \u2208 K(G(k)),\u2200k \u2208\n[m]. We propose the following MKL style formulation for multiple graph transduction\n\n(cid:19)\n\n\u03a8C(K, yS) =\n\n\u03b7\u2208Rm\n\nmin\n+ ,(cid:107)\u03b7(cid:107)1=1\n\nmin\n\n\u00afyj\u2208 \u00afY,\u2200j\u2208 \u00afS\n\n\u03b1\u2208Rn\n\nmax\n+,(cid:107)\u03b1(cid:107)\u221e\u2264C\n\ng\n\n\u03b1,\n\n\u03b7kK(k), [yS, \u00afy \u00afS]\n\n(6)\n\nthe setting as\n\nExtending our analysis from Section 5, we propose the following error bound\nTheorem 7.1. For\nlet f\n{K(1), . . . , K(m)}, K(k) \u2208 K(G(k)), \u2200k \u2208 [m]. Let \u03b1\u2217, \u03b7\u2217, \u00afy\u2217\n(6). Let \u02c6y =\n\u03b4 > 0, with probability \u2265 1 \u2212 \u03b4 over the choice of S \u2286 V such that |S| = nf\n\nkK(k) \u00afY\u03b1\u2217, where \u00afY \u2208 Dn, \u00afYii = yi if i \u2208 S, otherwise \u00afy\u2217\n\u03b7\u2217\n\nand K =\n\u00afS be the solution to \u03a8C(K, yS)\ni . Then, for any\n\nin Problem 1,\n\n(0, 1/2]7\n\nm(cid:80)\n\n\u2208\n\ni=1\n\n(cid:115)\n\n(cid:18)\n\nm(cid:88)\n\nk=1\n\n(cid:114) 1\n\n\u00afS [\u02c6y] \u2264 \u00af\u03a8(K, y)\ner0-1\n\nCnf\n\n+ C\n\n2\u03d1( \u00afG\u222a)\nf (1 \u2212 f )\n\n+\n\nc1\n1 \u2212 f\n\nlog\n\n1\n\u03b4\n\nnf\n\nwhere c1 = O(1), \u00af\u03a8(K, y) = mink\u2208[m] \u03c9(K(k), y) and G\u222a is the union of graphs G8. (Suppl.)\nThe above result gives us the ability for the \ufb01rst time to analyze generalization performance of multi-\nple graph transduction algorithms. The expression \u00af\u03a8(K, y) suggests that combining multiple graphs\nshould improve performance over considering individual graphs separately. Similar to Section 6,\n\n7As in Theorem 5.1, we can generalize our results for f \u2208 (0, 1).\n8G\u222a = (V, E\u222a), where (i, j) \u2208 E\u222a if edge (i, j) is present in atleast one of the graphs G(k) \u2208 G, k \u2208 [m].\n\n7\n\n\fwe can show that if one of the graph families G(l), l \u2208 [m] of G obey \u03d1(G(l)\nn ) = O(nc), 0 \u2264 c < 1;\nn \u2208 G(l), then there exists orthonormal representations K, such that the MKL style algorithm\nG(l)\noptimizing for (6) is (cid:96)0-1-consistent over G (Claim 6, Suppl.). We can also show that combining\ngraphs improves labelled sample complexity (Claim 7, Suppl.). This is a \ufb01rst attempt in developing\na statistical understanding for the problem of multiple graph transduction.\n\n8 Experimental results\n\nWe conduct two sets of experiments9.\n\nTable 1: Superior performance of LS labelling.\n\nLS-lab Un-Lap N-Lap KS-Lap\nDataset\nAuralSonar\u2217\n76.5\nYeast-SW-5-7\u2217\n60.4\nYeast-SW-5-12\u2217 78.6\nYeast-SW-7-12\u2217 76.5\nDiabetes\u2020\n73.1\nFourclass\u2020\n73.3\n\nSuperior performance of LS labelling: We\nuse two datasets\u2014similarity matrices\u2217 from\n69.2\n53.3\n[11] and RBF kernel10 as similarity matrices for\nthe UCI datasets\u2020[8]. We built an unweighted\n64.3\n63.1\ngraph by thresholding the similarity matrices\nabout the mean. Let L = D \u2212 A. For the reg-\n68.5\n71.8\nularized formulation (1), with 10% of labelled\nnodes observable, we test four types of kernel\nmatrices\u2014LS labelling(LS-lab), (\u03bb1I + L)\u22121 (Un-Lap), (\u03bb2I + D\u22121/2LD\u22121/2)\u22121 (N-Lap) and\nK-Scaling (KS-Lap) [4]. We choose the parameters \u03bb, \u03bb1 and \u03bb2 by cross validation. Table 1\nsummarizes the results. Each entry is accuracy in % w.r.t. 0-1 loss, and the results were averaged\nover 100 iterations. Since we are thresholding by mean, the graphs have high connectivity. Thus,\nfrom Corollary 4.3, the function class associated with LS labellingis rich and expressive, and hence\nit outperforms previously proposed regularizers.\n\n66.7\n52.9\n60.5\n59.5\n68.6\n71.2\n\n68.1\n54.1\n61.2\n64.0\n68.3\n69.3\n\ntransduction\n\nTable 2: Multiple Graphs Transduction.\nEach entry is accuracy in %.\n\nacross Multiple-views:\nGraph\nLearning on mutli-view data has been of recent\ninterest [18]. Following a similar line of attack, we\npose the problem of classi\ufb01cation on multi-view\ndata as multiple graph transduction. We investigate\nthe recently launched Google dataset [17], which\ncontains multiple views of video game YouTube\nvideos, consisting of 13 feature types of auditory\n(Aud), visual (Vis) and textual (Txt) description.\nEach video is labelled one of 30 classes. For each\nof the views we construct similarity matrices using\ncosine distance and threshold about the mean to obtain\nunweighted graphs. We considered 20% of the data to be labelled. We show results on pair-wise\nclassi\ufb01cation for the \ufb01rst four classes. As a natural way of combining graphs, we compared our\nalgorithm (6) (MV) with union (Unn), intersection (Int) and majority (Maj)11 of graphs. We used\nLS labelling as the graph-kernel and (2) was used to solve single graph transduction. Table 2\nsummarizes the results, averaged over 20 iterations. We also state top accuracy in each of the views\nfor comparison. As expected from our analysis in Theorem 7.1, we observe that combining multiple\ngraphs signi\ufb01cantly improves classi\ufb01cation accuracy.\n\nGraph 1vs2 1vs3 1vs4 2vs3 2vs4 3vs4\n62.8 64.8 68.3 59.3 50.8 61.5\nAud\nVis\n68.9 65.6 68.9 69.1 70.3 75.1\n68.7 59.2 64.8 64.6 60.9 65.4\nTxt\n69.7 60.3 52.7 62.7 67.4 62.5\nUnn\n72.7 75.2 80.5 65.4 62.6 77.4\nMaj\nInt\n80.6 83.6 86.0 90.9 75.3 91.8\nMV 98.9 93.4 95.6 97.7 87.7 98.8\n\n9 Conclusion\n\nFor the problem of graph transduction, we show that there exists orthonormal representations that\nare consistent over a large class of graphs. We also note that the Laplacian inverse regularizer\nis suboptimal on graphs with high connectivity, and alternatively show that LS labellingis not only\nconsistent, but also exhibits high Rademacher complexity on a large class of graphs. Using our anal-\nysis, we also develop a sound statistical understanding of the improved classi\ufb01cation performance\nin combining multiple graphs.\n\n9Relevant resources at: mllab.csa.iisc.ernet.in/rakeshs/nips14\n10The (i, j)th entry of an RBF kernel is given by exp\n11Majority graph is a graph where an edge (i, j) is present, if a majority of the graphs have the edge (i, j).\n\n. We set \u03c3 to the mean distance.\n\n2\u03c32\n\n(cid:16) \u2212(cid:107)xi\u2212xj(cid:107)2\n\n(cid:17)\n\n8\n\n\fReferences\n[1] V. Jethava, A. Martinsson, C. Bhattacharyya, and D. P. Dubhashi The Lov\u00b4asz \u03d1 function, SVMs and\n\n\ufb01nding large dense subgraphs. Neural Information Processing Systems , pages 1169\u20131177, 2012.\n\n[2] L. Lov\u00b4asz. On the shannon capacity of a graph. IEEE Transactions on Information Theory, 25(1):1\u20137,\n\n1979.\n\n[3] R. El-Yaniv and D. Pechyony. Transductive Rademacher complexity and its applications. In Learning\n\nTheory, pages 151\u2013171. Springer, 2007.\n\n[4] R. Ando, and T. Zhang. Learning on graph with Laplacian regularization. Neural Information Processing\n\nSystems , 2007.\n\n[5] R. Johnson and T. Zhang. On the effectiveness of Laplacian normalization for graph semi-supervised\n\nlearning. Journal of Machine Learning Research, 8(4), 2007.\n\n[6] R. El-Yaniv and D. Pechyony. Transductive Rademacher complexity and its applications. Journal of\n\nMachine Learning Research, 35(1):193, 2009\n\n[7] A. Argyriou, M. Herbster, and M. Pontil. Combining graph Laplacians for semi-supervised learning.\n\nNeural Information Processing Systems , 2005.\n\n[8] A. Asuncion, and D. Newman. UCI machine learning repository. 2000.\n[9] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3:463\u2013482, 2003.\n\n[10] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. International\n\nConference on Machine Learning, pages 19\u201326. Morgan Kaufmann Publishers Inc., 2001.\n\n[11] Y. Chen, M. Gupta, and B. Recht. Learning kernels from inde\ufb01nite similarities. International Conference\n\non Machine Learning, pages 145\u2013152. ACM, 2009.\n\n[12] C. Cortes, M. Mohri, D. Pechyony and A. Rastogi. Stability Analysis and Learning Bounds for Trans-\n\nductive Regression Algorithms. arXiv preprint arXiv:0904.0814, 2009.\n\n[13] A. Coja-Oghlan. The Lov\u00b4asz number of random graphs. Combinatorics, Probability and Computing, 14\n\n(04):439\u2013465, 2005.\n\n[14] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press, 3rd\n\nedition, 2009.\n\n[15] M. Szummer and T. Jaakkola Partially labeled classi\ufb01cation with Markov random walks. Neural Infor-\n\nmation Processing Systems , 14:945\u2013952, 2002.\n\n[16] C. J. Luz and A. Schrijver. A convex quadratic characterization of the Lov\u00b4asz theta number. SIAM Journal\n\non Discrete Mathematics, 19(2):382\u2013387, 2005.\n\n[17] O. Madani, M. Georg, and D. A. Ross. On using nearly-independent feature families for high precision\n\nand con\ufb01dence. Machine Learning, 92:457\u2013477, 2013.\n\n[18] W. Tang, Z. Lu, and I. S. Dhillon. Clustering with multiple graphs. International Conference on Data\n\nMining, pages 1016\u20131021. IEEE, 2009.\n\n[19] K. Tsuda, H. Shin, and B. Sch\u00a8olkopf. Fast protein classi\ufb01cation with multiple networks. Bioinformatics,\n\n21(suppl 2):ii59\u2013ii65, 2005.\n\n[20] V. N. Vapnik and A. J. Chervonenkis. On the uniform convergence of relative frequencies of events to\n\ntheir probabilities Theory of Probability & Its Applications, 16(2):264\u2013280. SIAM, 1971.\n\n[21] D. J. A. Welsh and M. B. Powell. An upper bound for the chromatic number of a graph and its application\n\nto timetabling problems. The Computer Journal, 10(1):85\u201386, 1967.\n\n[22] T. Zhang and R. Ando. Analysis of spectral kernel design based semi-supervised learning. Neural Infor-\n\nmation Processing Systems , 18:1601, 2006.\n\n[23] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global consistency.\n\nNeural Information Processing Systems , 16(16):321\u2013328, 2008.\n\n9\n\n\f", "award": [], "sourceid": 1915, "authors": [{"given_name": "Rakesh", "family_name": "Shivanna", "institution": "Indian Institute of Science"}, {"given_name": "Chiranjib", "family_name": "Bhattacharyya", "institution": "Indian Institute of Science"}]}