{"title": "Generalized Matrix Means for Semi-Supervised Learning with Multilayer Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 14877, "page_last": 14886, "abstract": "We study the task of semi-supervised learning on multilayer graphs by taking into account both labeled and unlabeled observations together with the information encoded by each individual graph layer. We propose a regularizer based on the generalized matrix mean, which is a one-parameter family of matrix means that includes the arithmetic, geometric and harmonic means as particular cases. We analyze it in expectation under a Multilayer Stochastic Block Model and verify numerically that it outperforms state of the art methods. Moreover, we introduce a matrix-free numerical scheme based on contour integral quadratures and Krylov subspace solvers that scales to large sparse multilayer graphs.", "full_text": "Generalized Matrix Means for Semi-Supervised\n\nLearning with Multilayer Graphs\n\nPedro Mercado1, Francesco Tudisco2 and Matthias Hein1\n\n1University of T\u00fcbingen, Germany\n2Gran Sasso Science Institute, Italy\n\nAbstract\n\nWe study the task of semi-supervised learning on multilayer graphs by taking into\naccount both labeled and unlabeled observations together with the information\nencoded by each individual graph layer. We propose a regularizer based on the\ngeneralized matrix mean, which is a one-parameter family of matrix means that\nincludes the arithmetic, geometric and harmonic means as particular cases. We\nanalyze it in expectation under a Multilayer Stochastic Block Model and verify\nnumerically that it outperforms state of the art methods. Moreover, we introduce a\nmatrix-free numerical scheme based on contour integral quadratures and Krylov\nsubspace solvers that scales to large sparse multilayer graphs.\n\n1\n\nIntroduction\n\nThe task of graph-based Semi-Supervised Learning (SSL) is to build a classi\ufb01er that takes into\naccount both labeled and unlabeled observations, together with the information encoded by a given\ngraph[4, 27]. A common and successful approach is to take a suitable loss function on the labeled\nnodes and a regularizer which provides information encoded by the graph [2, 15, 30, 32, 35]. Whereas\nthis task is well studied, traditionally these methods assume that the graph is composed by interactions\nof one single kind, i.e. only one graph is available.\nFor the case where multiple graphs, or equivalently, multiple layers are available, the challenge is to\nboost the classi\ufb01cation performance by merging the information encoded in each graph. The arguably\nmost popular approach for this task consists of \ufb01nding some form of convex combination of graph\nmatrices, where more informative graphs receive a larger weight [1, 13, 14, 23, 28, 29, 31, 33].\nNote that a convex combination of graph matrices can be seen as a weighted arithmetic mean of\ngraph matrices. In the context of multilayer graph clustering, previous studies [19\u201321] have shown\nthat weighted arithmetic means are suboptimal under certain benchmark generative graph models,\nwhereas other matrix means, such as the geometric [20] and harmonic means [19], are able to discover\nclustering structures that the arithmetic means overlook.\nIn this paper we study the task of semi-supervised learning with multilayer graphs with a novel\nregularizer based on the power mean Laplacian. The power mean Laplacian is a one-parameter family\nof Laplacian matrix means that includes as special cases the arithmetic, geometric and harmonic mean\nof Laplacian matrices.We show that in expectation under a Multilayer Stochastic Block Model, our\napproach provably correctly classi\ufb01es unlabeled nodes in settings where state of the art approaches fail.\nIn particular, a limit case of our method is provably robust against noise, yielding good classi\ufb01cation\nperformance as long as one layer is informative and remaining layers are potentially just noise. We\nverify the analysis in expectation with extensive experiments with random graphs, showing that our\napproach compares favorably with state of the art methods, yielding a good classi\ufb01cation performance\non several relevant settings where state of the art approaches fail.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fminimum harmonic mean\nname\np \u2192 \u2212\u221e\nmp(a, b) min{a, b}\n\np = \u22121\na + 1\n\np\n\nb\n\n2(cid:0) 1\n\n(cid:1)\u22121\n\ngeometric mean\n\np \u2192 0\n\u221a\nab\n\narithmetic mean maximum\np \u2192 \u221e\nmax{a, b}\n\n(a + b)/2\n\np = 1\n\nTable 1: Particular cases of scalar power means\n\nMoreover, our approach scales to large datasets: even though the computation of the power mean\nLaplacian is in general prohibitive for large graphs, we present a matrix-free numerical scheme based\non integral quadratures methods and Krylov subspace solvers which allows us to apply the power\nmean Laplacian regularizer to large sparse graphs. Finally, we perform numerical experiments on\nreal world datasets and verify that our approach is competitive to state of the art approaches.\n\n2 The Power Mean Laplacian\n\nIn this section we introduce our multilayer graph regularizer based on the power mean Laplacian.\nWe de\ufb01ne a multilayer graph G with T layers as the set G = {G(1), . . . , G(T )}, with each graph\nlayer de\ufb01ned as G(t) = (V, W (t)), where V = {v1, . . . , vn} is the node set and W (t) \u2208 Rn\u00d7n\nis the\ncorresponding adjacency matrix, which we assume symmetric and nonnegative. We further denote\nsym = I \u2212 (D(t))\u22121/2W (D(t))\u22121/2, where D(t) is the degree\nthe layers\u2019 normalized Laplacians as L(t)\n\ndiagonal matrix with (D(t))ii =(cid:80)n\n\n+\n\nj=1 W (t)\nij .\n\nThe scalar power mean is a one-parameter family of scalar means de\ufb01ned as\n\n(cid:80)T\n\nmp(x1, . . . , xT ) = ( 1\nT\n\ni=1 xp\n\ni )1/p\n\nwhere x1, . . . , xT are nonnegative scalars and p is a real parameter. Particular choices of p yield\nspeci\ufb01c means such as the arithmetic, geometric and harmonic means, as illustrated in Table 1.\nThe Power Mean Laplacian, introduced in [19], is a matrix extension of the scalar power mean\napplied to the Laplacians of a multilayer graph and proposed as a more robust way to blend the\ninformation encoded across the layers. It is de\ufb01ned as\n\n(cid:16) 1\n\nT\n\n(cid:80)T\n\nsym)p(cid:17)1/p\n\nLp =\n\ni=1(L(i)\n\nwhere A1/p is the unique positive de\ufb01nite solution of the matrix equation X p = A. For the case\np \u2264 0 a small diagonal shift \u03b5 > 0 is added to each Laplacian, i.e. we replace L(i)\nsym + \u03b5, to\nensure that Lp is well de\ufb01ned as suggested in [3]. In what follows all the proofs hold for an arbitrary\nshift. Following [19], we set \u03b5 = log10(1 + |p|) + 10\u22126 for p \u2264 0 in the numerical experiments.\n\nsym with L(i)\n\n3 Multilayer Semi-Supervised Learning with the Power Mean Laplacian\n\nIn this paper we consider the following optimization problem for the task of semi-supervised learning\nin multilayer graphs: Given k classes r = 1, . . . , k and membership vectors Y (r) \u2208 Rn de\ufb01ned by\nY (r)\ni = 1 if node vi belongs to class r and Y (r)\n\ni = 0 otherwise, we let\n(cid:107)f \u2212 Y (r)(cid:107)2 + \u03bbf T Lpf .\n\nf (r) = arg min\n\nf\u2208Rn\n\n(1)\n\nThe \ufb01nal class assignment for an unlabeled node vi is yi = arg max{f (1)\ni }. Note that the\nsolution f of (1), for a particular class r, is such that (I + \u03bbLp)f = Y (r). Equation (1) has two terms:\nthe \ufb01rst term is a loss function based on the labeled nodes whereas the second term is a regularization\nterm based on the power mean Laplacian Lp, which accounts for the multilayer graph structure. It is\nworth noting that the Local-Global approach of [32] is a particular case of our approach when only\none layer (T = 1) is considered. Moreover, not that when p = 1 we obtain a regularizer term based\non the arithmetic mean of Laplacians L1 = 1\nsym. In the following section we analyze our\nT\nproposed approach (1) under the Multilayer Stochastic Block Model.\n\n(cid:80)T\n\n, . . . , f (k)\n\ni=1 L(i)\n\ni\n\n2\n\n\f4 Multilayer Stochastic Block Model\n\nin and p(t)\n\nIn this section we provide an analysis of semi-supervised learning for multilayer graphs with the\npower mean Laplacian as a regularizer under the Multilayer Stochastic Block Model (MSBM). The\nMSBM is a generative model for graphs showing certain prescribed clusters/classes structures via\na set of membership parameters p(t)\nout, t = 1, . . . , T . These parameters designate the edge\nprobabilities: given nodes vi and vj the probability of observing an edge between them on layer t is\np(t)\nin (resp. p(t)\nout), if vi and vj belong to the same (resp. different) cluster/class. Note that, unlike the\nLabeled Stochastic Block Model [11], the MSBM allows multiple edges between the same pairs of\nnodes across the layers. For SSL with one layer under the SBM we refer the reader to [12, 22, 26].\nWe present an analysis in expectation. We consider k clusters/classes C1, . . . ,Ck of equal size\n|C| = n/k. We denote with calligraphic letters the layers of a multilayer graph in expectation\nE(G) = {E(G(1), . . . , E(G(T ))}, i.e. W (t) is the expected adjacency matrix of the tth-layer. We\nassume that our multilayer graphs are non-weighted, i.e. edges are zero or one, and hence we have\nW (t)\nout) for nodes vi, vj belonging to the same (resp. different) cluster/class.\nIn order to grasp how different methods classify the nodes in multilayer graphs following the MSBM\nwe analyze two different settings. In the \ufb01rst setting (Section 4.1) all layers have the same class\nstructure and we study the conditions for different regularizers Lp to correctly predict class labels.\nWe further show that our approach is robust against the presence of noise layers, in the sense that it\nachieves a small classi\ufb01cation error when at least one layer is informative and the remaining layers\nare potentially just noise. In this setting we distinguish the case where each class has the same\namount of initial labels and the case where different classes have different number of labels. In the\nsecond setting (Section 4.2) we consider the case where each layer taken alone would lead to a large\nclassi\ufb01cation error whereas considering all the layers together can lead to a small classi\ufb01cation error.\n\nin , (resp. W (t)\n\nij = p(t)\n\nij = p(t)\n\n4.1 Complementary Information Layers\n\nA common assumption in multilayer semi-supervised learning is that at least one layer encodes\nrelevant information in the label prediction task. The next theorem discusses the classi\ufb01cation error\nof the expected power mean Laplacian regularizer in this setting.\nTheorem 1. Let E(G) be the expected multilayer graph with T layers following the multilayer SBM\nwith k classes C1, . . . ,Ck of equal size and parameters\n. Assume the same number of\nlabeled nodes are available per class. Then, the solution of (1) yields zero test error if and only if\n\np(t)\nin , p(t)\n\n(cid:17)T\n\n(cid:16)\n\nout\n\nt=1\n\nmp(\u03c1\u0001) < 1 + \u0001 ,\n\n(2)\n\nout)/(p(t)\n\nout < p(t)\n\nin \u2212 p(t)\n\nin + (k \u2212 1)p(t)\n\nout) + \u0001, and t = 1, . . . , T .\n\nwhere (\u03c1\u0001)t = 1 \u2212 (p(t)\nThis theorem shows that the power mean Laplacian regularizer allows to correctly classify the nodes\nif p is such that condition (2) holds. In order to better understand how this condition changes when p\nvaries, we analyze in the next corollary the limit cases p \u2192 \u00b1\u221e.\nCorollary 1. Let E(G) be an expected multilayer graph as in Theorem 1. Then,\n\u2022 For p \u2192 \u221e, the test error is zero if and only if p(t)\nin for all t = 1, . . . , T .\n\u2022 For p\u2192\u2212\u221e, the test error is zero if and only there exists a t\u2208{1, . . . , T} such that p(t)\nout < p(t)\nin .\nThis corollary implies that the limit case p \u2192 \u221e requires that all layers convey information regarding\nthe clustering/class structure of the multilayer graph, whereas the case p \u2192 \u2212\u221e requires that at\nleast one layer encodes clustering/class information, and hence it is clear that conditions for the limit\np \u2192 \u2212\u221e are less restrictive than the conditions for the limit case p \u2192 \u221e. The next Corollary shows\nthat the smaller the power parameter p is, the less restrictive are the conditions to yield a zero test\nerror.\nCorollary 2. Let E(G) be an expected multilayer graph as in Theorem 1. Let p \u2264 q. If Lq yields\nzero test error, then Lp yields a zero test error.\nThe previous results show the effectivity of the power mean Laplacian regularizer in expectation.\nWe now present a numerical evaluation based on Theorem 1 and Corollaries 1 and 2 on random\n\n3\n\n\fClassi\ufb01cation Error\n\n(a) L\u221210\n\n(b) L\u22121\n\n(c) L0\n\n(d) L1\n\n(e) L10\n\n(f) SMACD\n\n(g) AGML\n\n(h) TLMV\n\n(i) SGMI\n\n(j) TSS\n\nin , p(1)\n\nout, p(2)\n\nin , p(2)\n\nin > p(2)\n\nin \u2212 p(1)\n\nin < p(2)\nin + p(t)\n\nFigure 1: Average classi\ufb01cation error under the Stochastic Block Model computed from 100 runs.\nTop Row: Particular cases with the power mean Laplacian. Bottom Row: State of the art models.\ngraphs sampled from the SBM. The corresponding results are presented in Fig. 1 for classi\ufb01cation\nwith regularizers L\u221210, L\u22121, L0, L1, L10 and \u03bb = 1. We \ufb01rst describe the setting we consider: we\ngenerate random multilayer graphs with two layers (T = 2) and two classes (k = 2) each composed\nby 100 nodes (|C| = 100). For each parameter con\ufb01guration (p(1)\nout) we generate 10\nrandom multilayer graphs and 10 random samples of labeled nodes, yielding a total of 100 runs per\nparameter con\ufb01guration, and report the average test error. Our goal is to evaluate the classi\ufb01cation\nperformance under different SBM parameters and different amounts of labeled nodes. To this end,\nwe \ufb01x the \ufb01rst layer G(1) to be informative of the class structure (p(1)\nout = 0.08), i.e. one can\nachieve a low classi\ufb01cation error by taking this layer alone, provided suf\ufb01ciently many labeled nodes\nare given. The second layer will go from non-informative (noisy) con\ufb01gurations (p(2)\nout, left\nhalf of x-axis) to informative con\ufb01gurations (p(2)\nout, right half of x-axis), with p(t)\nout = 0.1\nfor both layers. Moreover, we consider different amounts of labeled nodes: going from 1% to 50%\n(y-axis). The corresponding results are presented in Figs. 1a,1b,1c,1d, and 1e.\nIn general one can expect a low classi\ufb01cation error when both layers G(1) and G(2) are informative\n(right half of x-axis). We can see that this is the case for all power mean Laplacian regularizers here\nconsidered (see top row of Fig. 1). In particular, we can see in Fig. 1e that L10 performs well only\nwhen both layers are informative and completely fails when the second layer is not informative,\nregardless of the amount of labeled nodes. On the other side we can see in Fig. 1a that L\u221210 achieves\nin general a low classi\ufb01cation error, regardless of the con\ufb01guration of the second layer G(2), i.e. when\nG(1) or G(2) are informative. Moreover, we can see that overall the areas with low classi\ufb01cation\nerror (dark blue) increase when the parameter p decreases, verifying the result from Corollary 2. In\nthe bottom row of Fig. 1 we present the performance of state of the art methods. We can observe\nthat most of them present a classi\ufb01cation performance that resembles the one of the power mean\nLaplacian regularizer L1. In general their classi\ufb01cation performance drops when the level of noise\nincreases, i.e. for non-informative con\ufb01gurations of the second layer G(2), and they are outperformed\nby the power mean Laplacian regularizer for small values of p.\nUnbalanced Class Proportion on Labeled Datasets. In the previous analysis we assumed that\nwe had the same amount of labeled nodes per class. We consider now the case where the number\nof labeled nodes per class is different. This setting was considered in [35], where the goal was to\novercome unbalanced class proportions in labeled nodes. To this end, they propose a Class Mass\nNormalization (CMN) strategy, whose performance was also tested in [34]. In the following result\nwe show that, provided the ground truth classes have the same size, different amounts of labeled\nnodes per class affect the conditions in expectation for zero classi\ufb01cation error of (1). For simplicity,\nwe consider here only the case of two classes.\n\n4\n\n 0 0.5-0.100.102550-0.100.102550-0.100.102550-0.100.102550-0.100.102550-0.100.102550-0.100.102550-0.100.102550-0.100.102550-0.100.102550\fFigure 2: Different\nclass weighted loss\nstrategies. Left to right:\nuniform loss, weighted\nloss, and Class Mass\nNormalization.\n\n(a) Uniform loss\n\n(b) Weighted loss\n\n(c) CMN\n\nTheorem 2. Let E(G) be the expected multilayer graph with T layers following the multilayer SBM\nwith two classes C1,C2 of equal size and parameters\n. Assume n1, n2 nodes from\nC1,C2 are labeled, respectively. Let \u03bb = 1. Then (1) yields zero test error if\n\np(t)\nin , p(t)\n\nt=1\n\nout\n\n(cid:17)T\n\n(cid:16)\n(cid:26) n1\n\nn2\n\n(cid:27)\n\n,\n\nn2\nn1\n\nmp(\u03c1\u0001) < min\nin + (k \u2212 1)p(t)\n\nout)/(p(t)\n\nin \u2212 p(t)\n\nwhere (\u03c1\u0001)t = 1 \u2212 (p(t)\nObserve that Theorem 2 provides only a suf\ufb01cient condition. A necessary and suf\ufb01cient condition for\nzero test error in terms of p, n1 and n2 is given in the supplementary material.\nA different objective function can be employed for the case of classes with different number of labels\nper class. Let C be the diagonal matrix de\ufb01ned by Cii = n/nr, if node vi has been labeled to belong\nto class Cr. Consider the following modi\ufb01cation of (1)\n\nout) + \u0001, and t = 1, . . . , T .\n\n(3)\n\n(4)\n\n(cid:107)f \u2212 CY (cid:107)2 + \u03bbf T Lpf\n\narg min\n\nf\u2208Rn\n\nThe next Theorem shows that using (4) in place of (1) allows us to retrieve the same condition of\nTheorem 1 for zero test error in expectation in the setting where the number of labeled nodes per\nclass are not equal.\nTheorem 3. Let E(G) be the expected multilayer graph with T layers following the multilayer SBM\nk classes C1, . . . ,Ck of equal size and parameters\n. Let n1, . . . , nk be the number of\nlabeled nodes per class. Let C \u2208 Rn\u00d7n be a diagonal matrix with Cii = n/nr for vi \u2208 Cr. The\nsolution to (4) yields a zero test classi\ufb01cation error if and only if\n\np(t)\nin , p(t)\n\n(cid:17)T\n\n(cid:16)\n\nt=1\n\nout\n\nmp(\u03c1\u0001) < 1 + \u0001 ,\n\n(5)\n\nout)/(p(t)\n\nin \u2212 p(t)\n\nin \u2212 p(2)\n\nout = 0, with p(t)\n\nin + (k \u2212 1)p(t)\n\nout) + \u0001, and t = 1, . . . , T .\n\nwhere (\u03c1\u0001)t = 1 \u2212 (p(t)\nIn Figs. 2a, 2b, and 2c. we present a numerical experiment with random graphs of our analysis\nin expectation. We consider the following setting: we generate multilayer graphs with two layers\n(T = 2) and two classes (k = 2) each composed by 100 nodes (|C| = 100). We \ufb01x p(1)\nout = 0.08\nand p(2)\nout = 0.1 for both layers. We \ufb01x the total amount of labeled nodes\nto be n1 + n2 = 50 and let n1, n2 = 1, . . . 49. For each setting we generate 10 multilayer graphs\nand 10 sets of labeled nodes, yielding a total of 100 runs per setting, and report the average test\nclassi\ufb01cation error. In Fig. 2a we can see the performance of the power mean Laplacian regularizer\nwithout modi\ufb01cations. We can observe how different proportions of labeled nodes per class affect the\nperformance. In Fig. 2b, we present the performance of the modi\ufb01ed approach (4) and observe that it\nyields a better performance against different class label proportions. Finally in Fig. 2c we present\nthe performance based on Class Mass Normalization 1, where we can see that its effect is slightly\nskewed to one class and its overall performance is larger than the proposed approach.\n\nin \u2212p(1)\n\nin + p(t)\n\n4.2\n\nInformation-Independent Layers\n\nIn the previous section we considered the case where at least one layer had enough information to\ncorrectly estimate node class labels. In this section we now consider the case where single layers\n\n1We follow the authors\u2019 implementation: http://pages.cs.wisc.edu/~jerryzhu/pub/harmonic_function.m\n\n5\n\n0.021490.10.20.30.40.50.60.021490.10.20.30.40.50.021490.40.50.60.70.8\fClassi\ufb01cation Error\n\n(a) L\u221210\n\n(b) L\u22121\n\n(c) L0\n\n(d) L1\n\n(e) L10\n\n(f) SMACD\n\n(g) AGML\n\n(h) TLMV\n\n(i) SGMI\n\n(j) TSS\n\nFigure 4: Average test error under the SBM.Multilayer graph with 3 layers and 3 classes.Top Row:\nParticular cases with the power mean Laplacian. Bottom Row: State of the art models.\n\ntaken alone obtain a large classi\ufb01cation error, whereas when all the layers are taken together it is\npossible to obtain a good classi\ufb01cation performance. For this setting we consider multilayer graphs\nwith 3 layers (T = 3) and three classes (k = 3) C1,C2,C3, each composed by 100 nodes (|C| = 100)\nwith the following expected adjacency matrix per layer:\n\n(cid:26)pin,\n\npout,\n\nW (t)\n\ni,j =\n\nvi, vj \u2208 Ct or vi, vj \u2208 Ct\nelse\n\n(6)\n\nfor t = 1, 2, 3, i.e. layer G(t) is informative of class Ct but not of the remaining classes, and hence any\nclassi\ufb01cation method using one single layer will provide a poor classi\ufb01cation performance. In Fig. 4\nwe present numerical experiments: for each parameter setting (pin, pout) we generate 5 multilayer\ngraphs together with 5 samples of labeled nodes yielding a total of 25 runs per setting, and report\nthe average test classi\ufb01cation error. Also in this case we observe that the power mean Laplacian\nregularizer does identify the global class structure and that it leverages the information provided by\nlabeled nodes, particularly for smaller values of p. On the other hand, this is not the case for all other\nstate of the art methods. In fact, we can see that SGMI and TSS performs similarly to L10 which has\nthe largest classi\ufb01cation error. Moreover, we can see that AGML and TLMV perform similarly to\nthe arithmetic mean of Laplacians L1, which in turn is outperformed by the power mean Laplacian\nregularizer L\u221210. Please see the supplementary material for a more detailed comparison.\n\n5 A Scalable Matrix-free Numerical Method for the System (I + \u03bbLp)f = Y\n\nIn this section we introduce a matrix-free method for the solution of the system (I + \u03bbLp)f = Y\nbased on contour integrals and Krylov subspace methods. The method exploits the sparsity of\nthe Laplacians of each layer and is matrix-free, in the sense that it requires only to compute the\nsym \u00d7 vector, without requiring to store the matrices. Thus, when the layers\nmatrix-vector product L(i)\nare sparse, the method scales to large datasets. Observe that this is a critical requirement as Lp is\nin general a dense matrix, even for very sparse layers, and thus computing and storing Lp is very\nprohibitive for large multilayer graphs. We present a method for negative integer values p < 0,\nleaving aside the limit case p \u2192 0 as it requires a particular treatment. The following is a brief\noverview of the proposed approach. Further details are available in the supplementary material.\nLet A1, . . . , AT be symmetric positive de\ufb01nite matrices, \u03d5 : C \u2192 C de\ufb01ned by \u03d5(z) = z1/p and\nLp = T \u22121/p\u03d5(Sp), where Sp = Ap\nT . The proposed method consists of three main steps:\n\n1 + \u00b7\u00b7\u00b7 + Ap\n\n6\n\n 0 0.700.050.10255000.050.10255000.050.10255000.050.10255000.050.10255000.050.10255000.050.10255000.050.10255000.050.10255000.050.102550\fFigure 5: Mean execution time of 10 runs for different meth-\nods. L\u22121(ours) stands for the power mean Laplacian reg-\nularizer together with our proposed matrix-free contour in-\ntegral based method. We generate multilayer graphs with\ntwo layers, each with two classes of same size with param-\neters pin = 0.05 and pin = 0.025 and graphs of of sizes\n[0.5, 1, 2, 4, 8] \u00d7 104. Observe that our matrix free approach\nfor L\u22121 (solid blue curve) is competitive to state of the art ap-\nproaches as TSS[28], outperforming AGML[23], SGMI[13]\nand SMACD[9]. For TLMV[33] and SGMI we use our own\nimplementation.\n\n(cid:110)(cid:80)N\n\ni=1 \u03b2i(z2\n\ni I \u2212 Sp)\u22121Y\n\n(cid:111)\n\n,\n\n1. We solve the system (I + \u03bbLp)\u22121Y via a Krylov method (e.g. PCG or GMRES) with convergence\nrate O(( \u03ba2\u22121\n\u03ba2 )h/2) [25], where \u03ba = \u03bbmax(Lp)/\u03bbmin(Lp). At iteration h, this method projects\nthe problem onto the Krylov subspace spanned by {Y, \u03bbLpY, (\u03bbLp)2Y, . . . , (\u03bbLp)hY }, and\nef\ufb01ciently solve the projected problem.\n2. The previous step requires the matrix-vector product LpY = T \u22121/p\u03d5(Sp)Y which we compute\nby approximating the Cauchy integral form of the function \u03d5 with the trapezoidal rule in the\ncomplex plane [10]. Taking N suitable contour points and coef\ufb01cients \u03b20, . . . , \u03b2N , we have\n\n\u03d5N (Sp)Y = \u03b20Sp Im\n\np Y }. Since Sp =(cid:80)T\n\n(7)\nwhich has geometric convergence [10]: (cid:107)\u03d5(Sp)Y \u2212 \u03d5N (Sp)Y (cid:107) = O(e\u22122\u03c02N/(ln(M/m)+6)),\nwhere m, M are such that M \u2265 \u03bbmax(Sp) and m \u2264 \u03bbmin(Sp).\n3. The previous step requires to solve linear systems of the form (zI \u2212 Sp)\u22121Y . We solve each of\nthese systems via a Krylov subspace method, projecting, at each iteration h, onto the subspace\nspanned by {Y, SpY, S2\nthis problem reduces to comput-\ning |p| linear systems with Ai as coef\ufb01cient matrix, for i = 1 . . . , T . Provided that A1, . . . , AT\nare sparse matrices, this is done ef\ufb01ciently using pcg with incomplete Cholesky preconditioners.\nNotice that the method allows a high level of parallelism. In fact, the N (resp. p) linear systems\nsolvers at step 2 (resp. 3) are independent and can be run in parallel. Moreover, note that the main\ntask of the method is solving linear systems with Laplacian matrices, which can be solved linearly in\nthe number of edges in the corresponding adjacency matrix. Hence, the proposed approach scales to\nlarge sparse graphs and is highly parallelizable. A time execution analysis is provided in Fig 5, where\nwe can see that the time execution of our approach is competitive to the state of the art as TSS[28],\noutperforming AGML[23], SGMI[13] and SMACD[9].\n\npY, . . . , Sh\n\n\u2212|p|\ni\n\ni=1 A\n\n6 Experiments on Real Datasets\n\nIn this section we compare the performance of the proposed approach with state of the art methods\non real world datasets. We consider the following datasets: 3-sources [16], which consists of news\narticles that were covered by news sources BBC, Reuters and Guardian; BBC[7] and BBC Sports[8]\nnews articles, a dataset of Wikipedia articles with ten different classes [24], the hand written UCI\ndigits dataset with six different set of features, and citations datasets CiteSeer[17], Cora[18] and\nWebKB(Texas)[5]. For each dataset we build the corresponding layer adjacency matrices by taking the\nsymmetric k-nearest neighbour graph using as similarity measure the Pearson linear correlation, (i.e.\nwe take the k neighbours with highest correlation), and take the unweighted version of it. Datasets\nCiteSeer, Cora and WebKB have only two layers, where the \ufb01rst one is a \ufb01xed precomputed citation\nlayer, and the second one is the corresponding k-nearest neighbour graph built from document\nfeatures.\nAs baseline methods we consider: TSS [28] which identi\ufb01es an optimal linear combination of graph\nLaplacians, SGMI [13] which performs label propagation by sparse integration, TLMV [33] which is\na weighted arithmetic mean of adjacency matrices, CGL [1] which is a convex combination of the\npseudo inverse Laplacian kernel, AGML [23] which is a parameter-free method for optimal graph\nlayer weights, ZooBP [6] which is a fast approximation of Belief Propagation, and SMACD [9] which\nis a tensor factorization method designed for semi-supervised learning. Finally we set parameters\nfor TSS to (c = 10, c0 = 0.4), SMACD (\u03bb = 0.01)2, TLMV (\u03bb = 1), SGMI (\u03bb1 = 1, \u03bb2 = 10\u22123)\n\n2this is the default value in the code released by the authors: https://github.com/egujr001/SMACD\n\n7\n\n0.51248104100101102103104Mean time (sec.)\f3sources\n\nBBC\n\nTLMV\nCGL\n\nSMACD\nAGML\nZooBP\nTSS\nSGMI\n\nL1\nL-1\nL-10\n\nTLMV\nCGL\n\nSMACD\nAGML\nZooBP\nTSS\nSGMI\n\nL1\nL-1\nL-10\n\nTLMV\nCGL\n\nSMACD\nAGML\nZooBP\nTSS\nSGMI\n\nL1\nL-1\nL-10\n\nTLMV\nCGL\n\nSMACD\nAGML\nZooBP\nTSS\nSGMI\n\nL1\nL-1\nL-10\n\n1%\n29.8\n50.2\n91.5\n23.9\n31.0\n29.8\n34.4\n33.5\n28.4\n40.9\n\n1%\n25.6\n79.2\n77.8\n34.6\n33.8\n23.9\n31.9\n29.9\n23.8\n48.7\n\n1%\n28.9\n81.8\n73.6\n25.3\n30.8\n24.0\n36.0\n31.3\n30.5\n57.0\n\n1%\n46.0\n85.5\n75.6\n54.7\n54.7\n38.8\n57.3\n50.7\n43.2\n62.0\n\n5% 10% 15% 20% 25%\n16.5\n21.5\n19.8\n45.5\n91.3\n91.1\n22.0\n26.3\n15.3\n21.9\n23.9\n35.0\n17.9\n26.6\n14.6\n23.9\n20.0\n17.9\n29.1\n14.7\n\n20.8\n36.4\n91.2\n33.9\n21.3\n33.1\n25.4\n23.4\n21.8\n21.9\n\n15.5\n23.8\n90.7\n26.1\n15.0\n34.8\n19.1\n15.6\n17.2\n14.8\n\n20.3\n30.6\n90.9\n33.3\n19.8\n34.6\n24.4\n20.1\n22.0\n19.3\n\nBBCS\n\n5% 10% 15% 20% 25%\n5.4\n12.6\n12.7\n51.6\n98.3\n80.6\n5.4\n17.4\n13.9\n6.2\n12.2\n13.2\n12.1\n19.6\n7.2\n15.0\n5.1\n11.6\n22.5\n6.1\n\n10.5\n34.9\n82.4\n12.1\n11.3\n14.1\n16.6\n13.5\n8.7\n14.2\n\n7.5\n23.4\n96.4\n7.0\n8.8\n12.3\n15.5\n10.6\n6.3\n9.1\n\n6.4\n16.5\n98.4\n6.0\n7.6\n13.1\n14.8\n8.7\n5.8\n7.8\n\nUCI\n\n5% 10% 15% 20% 25%\n12.7\n20.4\n46.7\n64.0\n81.9\n81.0\n17.2\n12.0\n13.0\n21.7\n15.6\n17.6\n48.8\n44.4\n13.2\n23.8\n11.9\n17.1\n33.8\n13.4\n\n16.3\n54.6\n90.0\n15.2\n17.6\n16.6\n50.9\n18.7\n13.8\n23.7\n\n13.7\n46.7\n86.2\n12.5\n14.1\n15.8\n50.2\n14.4\n12.3\n15.3\n\n14.4\n49.1\n90.0\n13.2\n15.1\n15.9\n50.4\n15.6\n12.6\n17.6\n\nCora\n\n5% 10% 15% 20% 25%\n20.6\n34.1\n40.0\n70.1\n76.7\n87.1\n16.5\n36.0\n26.2\n38.0\n27.7\n19.1\n38.5\n47.7\n25.6\n38.2\n31.8\n17.2\n22.3\n46.3\n\n28.8\n56.5\n78.7\n25.4\n32.9\n24.1\n43.0\n33.4\n24.5\n35.4\n\n22.5\n44.2\n81.0\n18.1\n27.6\n20.0\n40.1\n28.2\n18.8\n25.2\n\n25.8\n49.1\n78.7\n20.7\n30.2\n21.5\n41.8\n31.2\n21.1\n29.4\n\nTLMV\nCGL\n\nSMACD\nAGML\nZooBP\nTSS\nSGMI\n\nL1\nL-1\nL-10\n\nTLMV\nCGL\n\nSMACD\nAGML\nZooBP\nTSS\nSGMI\n\nL1\nL-1\nL-10\n\nTLMV\nCGL\n\nSMACD\nAGML\nZooBP\nTSS\nSGMI\n\nL1\nL-1\nL-10\n\nTLMV\nCGL\n\nSMACD\nAGML\nZooBP\nTSS\nSGMI\n\nL1\nL-1\nL-10\n\n1%\n29.0\n72.5\n74.4\n60.0\n31.1\n40.4\n37.6\n31.3\n31.0\n51.6\n\n1%\n65.7\n87.3\n85.4\n71.3\n67.6\n87.7\n69.3\n68.2\n59.1\n66.9\n\n1%\n51.5\n89.3\n90.7\n47.3\n63.6\n58.5\n59.4\n56.3\n52.4\n68.6\n\n1%\n58.6\n80.4\n87.3\n56.5\n52.0\n60.9\n44.9\n58.5\n49.9\n52.3\n\n5% 10% 15% 20% 25%\n8.8\n19.3\n17.1\n52.3\n72.4\n73.5\n9.5\n34.2\n9.1\n20.1\n26.1\n19.7\n19.3\n28.9\n8.9\n22.8\n8.7\n17.0\n26.9\n9.5\n\n9.3\n22.0\n72.5\n11.0\n10.0\n19.8\n20.7\n10.2\n9.2\n10.3\n\n13.2\n36.1\n72.8\n18.6\n15.0\n20.9\n24.9\n17.4\n11.5\n16.6\n\n11.1\n27.4\n72.6\n13.1\n12.2\n20.1\n22.8\n13.5\n10.5\n12.8\n\nWikipedia\n\n5% 10% 15% 20% 25%\n39.2\n56.8\n83.0\n83.0\n90.0\n85.6\n37.3\n66.6\n58.0\n39.8\n81.4\n84.7\n82.8\n84.8\n42.3\n61.1\n34.1\n52.3\n57.2\n34.9\n\n40.8\n83.0\n86.8\n38.4\n41.2\n82.3\n83.2\n44.1\n35.1\n36.3\n\n46.4\n82.5\n85.4\n48.1\n47.0\n83.3\n84.5\n53.6\n40.2\n43.2\n\n43.1\n82.2\n85.3\n42.1\n43.8\n81.9\n83.8\n48.3\n36.3\n38.7\n\nCiteseer\n\n5% 10% 15% 20% 25%\n30.3\n39.4\n40.9\n71.8\n68.9\n90.4\n32.3\n27.0\n32.2\n41.9\n38.4\n49.5\n39.2\n46.8\n34.7\n44.1\n29.5\n39.0\n54.6\n37.2\n\n31.6\n44.5\n66.8\n27.5\n33.8\n39.8\n40.5\n36.1\n30.9\n39.7\n\n36.5\n58.0\n67.0\n29.6\n38.7\n45.9\n44.0\n41.2\n35.6\n48.5\n\n33.7\n49.8\n65.5\n28.2\n35.8\n42.1\n42.3\n38.5\n32.6\n43.0\n\nWebKB\n\n5% 10% 15% 20% 25%\n48.2\n49.4\n89.2\n82.4\n87.2\n87.8\n46.8\n50.3\n33.5\n45.0\n48.7\n51.0\n39.7\n52.5\n44.4\n49.0\n45.5\n40.3\n39.5\n41.9\n\n47.6\n82.7\n87.8\n47.6\n36.4\n49.2\n40.3\n44.5\n39.9\n36.8\n\n47.2\n86.9\n87.4\n44.7\n38.5\n47.3\n34.9\n44.3\n39.5\n38.1\n\n45.6\n84.4\n87.2\n46.8\n38.7\n50.5\n41.9\n44.8\n40.7\n38.0\n\nTable 2: Experiments in real datasets. Notation: best performances are marked with bold fonts and\ngray background and second best performances with only gray background.\n\nand \u03bb = 0.1 for L1 and \u03bb = 10 for L\u22121 and L\u221210. We do not perform cross validation in our\nexperimental setting due to the large execution time in some of the methods here considered. Hence\nwe \ufb01x the parameters for each method in all experiments.\nWe \ufb01x nearest neighbourhood size to k = 10 and generate 10 samples of labeled nodes, where the\npercentage of labeled nodes per class is in the range {1%, 5%, 10%, 15%, 20%, 25%}. The average\ntest errors are presented in table 2, where the best (resp. second best ) performances are marked\nwith bold fonts and gray background (resp. with only gray background). We can see that the \ufb01rst and\nsecond best positions are in general taken by the power mean Laplacian regularizers L1, L\u22121, L\u221210,\nbeing clear for all datasets except with 3-sources. Moreover we can see that in 77% of all cases L\u22121\npresents either the best or the second best performance, further verifying that our proposed approach\nbased on the power mean Laplacian for semi-supervised learning in multilayer graph is a competitive\nalternative to state of the art methods3.\n\n3Communications with the authors of [9] could not clarify the bad performance of SMACD.\n\n8\n\n\fAcknowledgement P.M and M.H are supported by the DFG Cluster of Excellence \u201cMachine Learning\n\u2013 New Perspectives for Science\u201d, EXC 2064/1, project number 390727645\n\nReferences\n[1] A. Argyriou, M. Herbster, and M. Pontil. Combining graph Laplacians for semi\u2013supervised\n\nlearning. In NeurIPS, 2006.\n\n[2] M. Belkin, I. Matveeva, and P. Niyogi. Regularization and semi-supervised learning on large\n\ngraphs. In COLT, 2004.\n\n[3] K. V. Bhagwat and R. Subramanian. Inequalities between means of positive operators. Mathe-\n\nmatical Proceedings of the Cambridge Philosophical Society, 83(3):393\u2013401, 1978.\n\n[4] O. Chapelle, B. Sch\u00f6lkopf, and A. Zien. Semi-Supervised Learning. The MIT Press, 2010.\n\n[5] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery.\n\nLearning to extract symbolic knowledge from the world wide web. In AAAI, 2011.\n\n[6] D. Eswaran, S. G\u00fcnnemann, C. Faloutsos, D. Makhija, and M. Kumar. Zoobp: Belief propaga-\n\ntion for heterogeneous networks. In VLDB, 2017.\n\n[7] D. Greene and P. Cunningham. Producing accurate interpretable clusters from high-dimensional\n\ndata. In PKDD, 2005.\n\n[8] D. Greene and P. Cunningham. A matrix factorization approach for integrating multiple data\n\nviews. In ECML PKDD, 2009.\n\n[9] E. Gujral and E. E. Papalexakis. SMACD: Semi-supervised multi-aspect community detection.\n\nIn SDM, 2018.\n\n[10] N. Hale, N. J. Higham, and L. N. Trefethen. Computing A\u03b1, log(A), and related matrix\nfunctions by contour integrals. SIAM Journal on Numerical Analysis, 46(5):2505\u20132523, 2008.\n\n[11] S. Heimlicher, M. Lelarge, and L. Massouli\u00e9. Community detection in the labelled stochastic\n\nblock model. arXiv:1209.2910, 2012.\n\n[12] V. Kanade, E. Mossel, and T. Schramm. Global and local information in clustering labeled\n\nblock models. IEEE Transactions on Information Theory, 62(10):5906\u20135917, 2016.\n\n[13] M. Karasuyama and H. Mamitsuka. Multiple graph label propagation by sparse integration.\n\nIEEE Transactions on Neural Networks and Learning Systems, 24(12):1999\u20132012, 2013.\n\n[14] T. Kato, H. Kashima, and M. Sugiyama. Robust label propagation on multiple networks.\n\nTransactions on Neural Networks, 20(1):35\u201344, Jan. 2009.\n\n[15] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nIn ICLR, 2017.\n\n[16] J. Liu, C. Wang, J. Gao, and J. Han. Multi-view clustering via joint nonnegative matrix\n\nfactorization. In SDM, 2013.\n\n[17] Q. Lu and L. Getoor. Link-based classi\ufb01cation. In ICML, 2003.\n\n[18] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the construction of internet\n\nportals with machine learning. Information Retrieval, 3(2):127\u2013163, 2000.\n\n[19] P. Mercado, A. Gautier, F. Tudisco, and M. Hein. The power mean Laplacian for multilayer\n\ngraph clustering. In AISTATS, 2018.\n\n[20] P. Mercado, F. Tudisco, and M. Hein. Clustering signed networks with the geometric mean of\n\nLaplacians. In NeurIPS. 2016.\n\n[21] P. Mercado, F. Tudisco, and M. Hein. Spectral clustering of signed graphs via matrix power\n\nmeans. In ICML, 2019.\n\n9\n\n\f[22] E. Mossel and J. Xu. Local algorithms for block models with side information. In ITCS, 2016.\n\n[23] F. Nie, J. Li, and X. Li. Parameter-free auto-weighted multiple graph learning: A framework\n\nfor multiview clustering and semi-supervised classi\ufb01cation. In IJCAI, 2016.\n\n[24] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle, G. R. Lanckriet, R. Levy, and N. Vascon-\n\ncelos. A new approach to cross-modal multimedia retrieval. In ACM Multimedia, 2010.\n\n[25] Y. Saad and M. H. Schultz. GMRES: A generalized minimal residual algorithm for solving\nnonsymmetric linear systems. SIAM Journal on Scienti\ufb01c and Statistical Computing, 7(3):856\u2013\n869, 1986.\n\n[26] A. Saade, F. Krzakala, M. Lelarge, and L. Zdeborov\u00e1. Fast randomized semi-supervised\n\nclustering. Journal of Physics: Conference Series, 1036:012015, 2018.\n\n[27] A. Subramanya and P. P. Talukdar. Graph-Based Semi-Supervised Learning. Morgan &\n\nClaypool Publishers, 2014.\n\n[28] K. Tsuda, H. Shin, and B. Sch\u00f6lkopf. Fast protein classi\ufb01cation with multiple networks.\n\nBioinformatics, 21(2):59\u201365, 2005.\n\n[29] K. Viswanathan, S. Sachdeva, A. Tomkins, and S. Ravi. Improved semi-supervised learning\n\nwith multiple graphs. In AISTATS, 2019.\n\n[30] Z. Yang, W. W. Cohen, and R. Salakhutdinov. Revisiting semi-supervised learning with graph\n\nembeddings. In ICML, 2016.\n\n[31] J. Ye and L. Akoglu. Robust semi-supervised learning on multiple networks with noise. In\n\nPKDD, 2018.\n\n[32] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00f6lkopf. Learning with local and global\n\nconsistency. In NeurIPS, 2003.\n\n[33] D. Zhou and C. J. Burges. Spectral clustering and transductive learning with multiple views. In\n\nICML, 2007.\n\n[34] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation.\n\nTechnical report, 2002.\n\n[35] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\n\nharmonic functions. In ICML, 2003.\n\n10\n\n\f", "award": [], "sourceid": 8467, "authors": [{"given_name": "Pedro", "family_name": "Mercado", "institution": "University of T\u00fcbingen"}, {"given_name": "Francesco", "family_name": "Tudisco", "institution": "University of Strathclyde"}, {"given_name": "Matthias", "family_name": "Hein", "institution": "University of T\u00fcbingen"}]}