{"title": "See the Tree Through the Lines: The Shazoo Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 1584, "page_last": 1592, "abstract": "Predicting the nodes of a given graph is a fascinating theoretical problem with applications in several domains. Since graph sparsification via spanning trees retains enough information while making the task much easier, trees are an important special case of this problem. Although it is known how to predict the nodes of an unweighted tree in a nearly optimal way, in the weighted case a fully satisfactory algorithm is not available yet. We fill this hole and introduce an efficient node predictor, Shazoo, which is nearly optimal on any weighted tree. Moreover, we show that Shazoo can be viewed as a common nontrivial generalization of both previous approaches for unweighted trees and weighted lines. Experiments on real-world datasets confirm that Shazoo performs well in that it fully exploits the structure of the input tree, and gets very close to (and sometimes better than) less scalable energy minimization methods.", "full_text": "See the Tree Through the Lines:\n\nThe Shazoo Algorithm\u2217\n\nFabio Vitale\n\nDSI, University of Milan, Italy\nfabio.vitale@unimi.it\n\nNicol`o Cesa-Bianchi\n\nDSI, University of Milan, Italy\n\nnicolo.cesa-bianchi@unimi.it\n\nClaudio Gentile\n\nDICOM, University of Insubria, Italy\n\nclaudio.gentile@uninsubria.it\n\nGiovanni Zappella\n\nDept. of Mathematics, Univ. of Milan, Italy\n\ngiovanni.zappella@unimi.it\n\nAbstract\n\nPredicting the nodes of a given graph is a fascinating theoretical problem with ap-\nplications in several domains. Since graph sparsi\ufb01cation via spanning trees retains\nenough information while making the task much easier, trees are an important\nspecial case of this problem. Although it is known how to predict the nodes of an\nunweighted tree in a nearly optimal way, in the weighted case a fully satisfactory\nalgorithm is not available yet. We \ufb01ll this hole and introduce an ef\ufb01cient node\npredictor, SHAZOO, which is nearly optimal on any weighted tree. Moreover, we\nshow that SHAZOO can be viewed as a common nontrivial generalization of both\nprevious approaches for unweighted trees and weighted lines. Experiments on\nreal-world datasets con\ufb01rm that SHAZOO performs well in that it fully exploits\nthe structure of the input tree, and gets very close to (and sometimes better than)\nless scalable energy minimization methods.\n\nIntroduction\n\n1\nPredictive analysis of networked data is a fast-growing research area whose application domains\ninclude document networks, online social networks, and biological networks. In this work we view\nnetworked data as weighted graphs, and focus on the task of node classi\ufb01cation in the transductive\nsetting, i.e., when the unlabeled graph is available beforehand. Standard transductive classi\ufb01cation\nmethods, such as label propagation [2, 3, 18], work by optimizing a cost or energy function de\ufb01ned\non the graph, which includes the training information as labels assigned to training nodes. Although\nthese methods perform well in practice, they are often computationally expensive, and have perfor-\nmance guarantees that require statistical assumptions on the selection of the training nodes.\nA general approach to sidestep the above computational issues is to sparsify the graph to the largest\npossible extent, while retaining much of its spectral properties \u2014see, e.g., [5, 6, 12, 16]. Inspired\nby [5, 6], this paper reduces the problem of node classi\ufb01cation from graphs to trees by extracting\nsuitable spanning trees of the graph, which can be done quickly in many cases. The advantage\nof performing this reduction is that node prediction is much easier on trees than on graphs. This\nfact has recently led to the design of very scalable algorithms with nearly optimal performance\nguarantees in the online transductive model, which comes with no statistical assumptions. Yet, the\ncurrent results in node classi\ufb01cation on trees are not satisfactory. The TREEOPT strategy of [5] is\noptimal to within constant factors, but only on unweighted trees. No equivalent optimality results\nare available for general weighted trees. To the best of our knowledge, the only other comparable\nresult is WTA by [6], which is optimal (within log factors) only on weighted lines. In fact, WTA can\nstill be applied to weighted trees by exploiting an idea contained in [9]. This is based on linearizing\nthe tree via a depth-\ufb01rst visit. Since linearization loses most of the structural information of the tree,\n\u2217This work was supported in part by Google Inc. through a Google Research Award, and by the PASCAL2\n\nNetwork of Excellence under EC grant 216886. This publication only re\ufb02ects the authors\u2019 views.\n\n1\n\n\fthis approach yields suboptimal mistake bounds. This theoretical drawback is also con\ufb01rmed by\nempirical performance: throwing away the tree structure negatively affects the practical behavior of\nthe algorithm on real-world weighted graphs.\nThe importance of weighted graphs, as opposed to unweighted ones, is suggested by many practical\nscenarios where the nodes carry more information than just labels, e.g., vectors of feature values. A\nnatural way of leveraging this side information is to set the weight on the edge linking two nodes to\nbe some function of the similariy between the vectors associated with these nodes. In this work, we\nbridge the gap between the weighted and unweighted cases by proposing a new prediction strategy,\ncalled SHAZOO, achieving a mistake bound that depends on the detailed structure of the weighted\ntree. We carry out the analysis using a notion of learning bias different from the one used in [6] and\nmore appropriate for weighted graphs. More precisely, we measure the regularity of the unknown\nnode labeling via the weighted cutsize induced by the labeling on the tree (see Section 3 for a precise\nde\ufb01nition). This replaces the unweighted cutsize that was used in the analysis of WTA. When the\nweighted cutsize is used, a cut edge violates this inductive bias in proportion to its weight. This\nmodi\ufb01ed bias does not prevent a fair comparison between the old algorithms and the new one:\nSHAZOO specializes to TREEOPT in the unweighted case, and to WTA when the input tree is a\nweighted line. By specializing SHAZOO\u2019s analysis to the unweighted case we recover TREEOPT\u2019s\noptimal mistake bound. When the input tree is a weighted line, we recover WTA\u2019s mistake bound\nexpressed through the weighted cutsize instead of the unweighted one. The effectiveness of SHAZOO\non any tree is guaranteed by a corresponding lower bound (see Section 3).\nSHAZOO can be viewed as a common nontrivial generalization of both TREEOPT and WTA. Obtain-\ning this generalization while retaining and extending the optimality properties of the two algorithms\nis far from being trivial from a conceptual and technical standpoint. Since SHAZOO works in the\nonline transductive model, it can easily be applied to the more standard train/test (or \u201cbatch\u201d) trans-\nductive setting: one simply runs the algorithm on an arbitrary permutation of the training nodes, and\nobtains a predictive model for all test nodes. However, the implementation might take advantage\nof knowing the set of training nodes beforehand. For this reason, we present two implementations\nof SHAZOO, one for the online and one for the batch setting. Both implementations result in fast\nalgorithms. In particular, the batch one is linear in |V |. This is achieved by a fast algorithm for\nweighted cut minimization on trees, a procedure which lies at the heart of SHAZOO.\nFinally, we test SHAZOO against WTA, label propagation, and other competitors on real-world\nweighted graphs. In almost all cases (as expected), we report improvements over WTA due to the\nbetter sensitivity to the graph structure. In some cases, we see that SHAZOO even outperforms stan-\ndard label propagation methods. Recall that label propagation has a running time per prediction\nwhich is proportional to |E|, where E is the graph edge set. On the contrary, SHAZOO can typically\nbe run in constant amortized time per prediction by using Wilson\u2019s algorithm for sampling random\nspanning trees [17]. By disregarding edge weights in the initial sampling phase, this algorithm is\nable to draw a random (unweighted) spanning tree in time proportional to |V | on most graphs. Our\nexperiments reveal that using the edge weights only in the subsequent prediction phase causes in\npractice only a minor performance degradation.\n2 Preliminaries and basic notation\nLet T = (V, E, W ) be an undirected and weighted tree with |V | = n nodes, positive edge weights\nWi,j > 0 for (i, j) \u2208 E, and Wi,j = 0 for (i, j) /\u2208 E. A binary labeling of T is any assignment\ny = (y1, . . . , yn) \u2208 {\u22121, +1}n of binary labels to its nodes. We use (T, y) to denote the resulting\nlabeled weighted tree. The online learning protocol for predicting (T, y) is de\ufb01ned as follows. The\nlearner is given T while y is kept hidden. The nodes of T are presented to the learner one by one,\naccording to an unknown and arbitrary permutation i1, . . . , in of V . At each time step t = 1, . . . , n\n\nnode it is presented and the learner must issue a prediction(cid:98)yit \u2208 {\u22121, +1} for the label yit. Then\nirregularity in a labeled tree (T, y) is the weighted cutsize \u03a6W =(cid:80)\n\nyit is revealed and the learner knows whether a mistake occurred. The learner\u2019s goal is to minimize\nthe total number of prediction mistakes.\nFollowing previous works [10, 9, 5, 6], we measure the regularity of a labeling y of T in terms of\n\u03c6-edges, where a \u03c6-edge for (T, y) is any (i, j) \u2208 E such that yi (cid:54)= yj. The overall amount of\n(i,j)\u2208E\u03c6 Wi,j, where E\u03c6 \u2286 E\nis the subset of \u03c6-edges in the tree. We use the weighted cutsize as our learning bias, that is, we\nwant to design algorithms whose predictive performance scales with \u03a6W . Unlike the \u03c6-edge count\n\u03a6 = |E\u03c6|, which is a good measure of regularity for unweighted graphs, the weighted cutsize takes\n\n2\n\n\f(r,s)\u2208\u03c0(i,j)\n\nWr,s\n\nthe resistance distance metric d, that is, d(i, j) = (cid:80)\n\nthe edge weight Wi,j into account when measuring the irregularity of a \u03c6-edge (i, j). In the sequel,\nwhen we measure the distance between any pair of nodes i and j on the input tree T we always use\n, where \u03c0(i, j) is the unique\n\n1\n\npath connecting i to j.\n3 A lower bound for weighted trees\nIn this section we show that the weighted cutsize can be used as a lower bound on the number of\nonline mistakes made by any algorithm on any tree. In order to do so (and unlike previous papers\non this speci\ufb01c subject \u2014see, e.g., [6]), we need to introduce a more re\ufb01ned notion of adversarial\n\u201cbudget\u201d. Given T = (V, E, W ), let \u03be(M) be the maximum number of edges of T such that the\n(i,j)\u2208E(cid:48) wi,j \u2264 M\nsum of their weights does not exceed M, \u03be(M) = max\n.\nWe have the following simple lower bound (all proofs are omitted from this extended abstract).\n\n(cid:110)|E(cid:48)| : E(cid:48) \u2286 E, (cid:80)\n\n(cid:111)\n\nTheorem 1 For any weighted tree T = (V, E, W ) there exists a randomized label assignment to\nV such that any algorithm can be forced to make at least \u03be(M)/2 online mistakes in expectation,\nwhile \u03a6W \u2264 M.\nSpecializing [6, Theorem 1] to trees gives the lower bound K/2 under the constraint \u03a6 \u2264 K \u2264 |V |.\nThe main difference between the two bounds is the measure of label regularity being used: Whereas\nTheorem 1 uses \u03a6W , which depends on the weights, [6, Theorem 1] uses the weight-independent\nquantity \u03a6. This dependence of the lower bound on the edge weights is consistent with our learning\nbias, stating that a heavy \u03c6-edge violates the bias more than a light one. Since \u03be is nondecreasing,\nthe lower bound implies a number of mistakes of at least \u03be(\u03a6W )/2. Note that \u03be(\u03a6W ) \u2265 \u03a6 for any\nlabeled tree (T, y). Hence, whereas a constraint K on \u03a6 implies forcing at least K/2 mistakes, a\nconstraint M on \u03a6W allows the adversary to force a potentially larger number of mistakes.\nIn the next section we describe an algorithm whose mistake bound nearly matches the above lower\nbound on any weighted tree when using \u03be(\u03a6W ) as the measure of label regularity.\n\n4 The Shazoo algorithm\nIn this section we introduce the SHAZOO algorithm, and relate it to previously proposed methods\nfor online prediction on unweighted trees (TREEOPT from [5]) and weighted line graphs (WTA from\n[6]). In fact, SHAZOO is optimal on any weighted tree, and reduces to TREEOPT on unweighted trees\nand to WTA on weighted line graphs. Since TREEOPT and WTA are optimal on any unweighted tree\nand any weighted line graph, respectively, SHAZOO necessarily contains elements of both of these\nalgorithms.\nIn order to understand our algorithm, we now de\ufb01ne some relevant structures of the input tree T . See\nFigure 1 (left) for an example. These structures evolve over time according to the set of observed\nlabels. First, we call revealed a node whose label has already been observed by the online learner;\notherwise, a node is unrevealed. A fork is any unrevealed node connected to at least three different\nrevealed nodes by edge-disjoint paths. A hinge node is either a revealed node or a fork. A hinge\ntree is any component of the forest obtained by removing from T all edges incident to hinge nodes;\nhence any fork or labeled node forms a 1-node hinge tree. When a hinge tree H contains only one\nhinge node, a connection node for H is the node contained in H. In all other cases, we call a\nconnection node for H any node outside H which is adjacent to a node in H. A connection fork is\na connection node which is also a fork. Finally, a hinge line is any path connecting two hinge nodes\nsuch that no internal node is a hinge node.\nGiven an unrevealed node i and a label value y \u2208 {\u22121, +1}, the cut function cut(i, y) is the value\nof the minimum weighted cutsize of T over all labelings y \u2208 {\u22121, +1}n consistent with the labels\nseen so far and such that yi = y. De\ufb01ne \u2206(i) = cut(i,\u22121) \u2212 cut(i, +1) if i is unrevealed, and\n\u2206(i) = yi, otherwise. The algorithm\u2019s pseudocode is given in Algorithm 1. At time t, in order\nto predict the label yit of node it, SHAZOO calculates \u2206(i) for all connection nodes i of H(it),\nwhere H(it) is the hinge tree containing it. Then the algorithm predicts yit using the label of the\nconnection node i of H(it) which is closest to it and such that \u2206(i) (cid:54)= 0 (recall from Section 2\nthat all distances/lengths are measured using the resistance metric). Ties are broken arbitrarily. If\n\u2206(i) = 0 for all connection nodes i in H(it) then SHAZOO predicts a default value (\u22121 in the\n\n3\n\n\fFigure 1: Left: An input tree. Revealed nodes are dark grey, forks are doubly circled, and hinge\nlines have thick black edges. The hinge trees not containing hinge nodes (i.e., the ones that are not\nsingletons) are enclosed by dotted lines. The dotted arrows point to the connection node(s) of such\nhinge trees. Middle: The predictions of SHAZOO on the nodes of a hinge tree. The numbers on the\nedges denote edge weights. At a given time t, SHAZOO uses the value of \u2206 on the two hinge nodes\n(the doubly circled ones, which are also forks in this case), and is required to issue a prediction on\nnode it (the black node in this \ufb01gure). Since it is between a positive \u2206 hinge node and a negative\n\u2206 hinge node, SHAZOO goes with the one which is closer in resistance distance, hence predicting\n\n(cid:98)yit = \u22121. Right: A simple example where the mincut prediction strategy does not work well in the\n\nweighted case. In this example, mincut mispredicts all labels, yet \u03a6 = 1, and the ratio of \u03a6W to the\ntotal weight of all edges is about 1/|V |. The labels to be predicted are presented according to the\nnumbers on the left of each node. Edge weights are also displayed, where a is a very small constant.\npseudocode). If it is a fork (which is also a hinge node), then H(it) = {it}. In this case, it is\na connection node of H(it), and obviously the one closest to itself. Hence, in this case SHAZOO\n\npredicts yt simply by(cid:98)yit = sgn(cid:0)\u2206(it)(cid:1). See Figure 1 (middle) for an example. On unweighted\nLet C(cid:0)H(it)(cid:1) be the set of the connection nodes i of H(it) for which \u2206(i) (cid:54)= 0\nif C(cid:0)H(it)(cid:1) (cid:54)\u2261 \u2205\nLet j be the node of C(cid:0)H(it)(cid:1) closest to it\nSet(cid:98)yit = sgn(cid:0)\u2206(j)(cid:1)\nelse Set(cid:98)yit = \u22121 (default value)\n\nAlgorithm 1: SHAZOO\nfor t = 1 . . . n\n\ntrees, computing \u2206(i) for a connection node i reduces to the Fork Label Estimation Procedure in\n[5, Lemma 13]. On the other hand, predicting with the label of the connection node closest to it\nin resistance distance is reminiscent of the nearest-neighbor prediction of WTA on weighted line\ngraphs [6]. In fact, as in WTA, this enables to take advantage of labelings whose \u03c6-edges are light\nweighted. An important limitation of WTA is that this algorithm linearizes the input tree. On the\none hand, this greatly simpli\ufb01es the analysis of nearest-neighbor prediction; on the other hand, this\nprevents exploiting the structure of T , thereby causing logaritmic slacks in the upper bound of WTA.\nThe TREEOPT algorithm, instead, performs better when the unweighted input tree is very different\nfrom a line graph (more precisely, when the input tree cannot be decomposed into long edge-disjoint\npaths, e.g., a star graph). Indeed, TREEOPT\u2019s upper bound does not suffer from logaritmic slacks,\nand is tight up to constant factors on any unweighted tree. Similar to TREEOPT, SHAZOO does\nnot linearize the input tree and extends to the weighted case TREEOPT\u2019s superior performance, also\ncon\ufb01rmed by the experimental comparison reported in Section 6.\nIn Figure 1 (right) we show an example that highlights the importance of using the \u2206 function to\ncompute the fork labels. Since \u2206 predicts a fork it with the label that minimizes the weighted cutsize\nof T consistent with the revealed labels, one may wonder whether computing \u2206 through mincut\nbased on the number of \u03c6-edges (rather than their weighted sum) could be an effective prediction\nstrategy. Figure 1 (right) illustrates an example of a simple tree where such a \u2206 mispredicts the\nlabels of all nodes, when both \u03a6W and \u03a6 are small.\nRemark 1 We would like to stress that SHAZOO can also be used to predict the nodes of an arbi-\ntrary graph by \ufb01rst drawing a random spanning tree T of the graph, and then predicting optimally\non T \u2014see, e.g., [5, 6]. The resulting mistake bound is simply the expected value of SHAZOO\u2019s\nmistake bound over the random draw of T . By using a fast spanning tree sampler [17], the involved\ncomputational overhead amounts to constant amortized time per node prediction on \u201cmost\u201d graphs.\n\n4\n\n121324211121>0>0<0+++++2436151+a1+2a1+(V-1)a1+3a\fRemark 2 In certain real-world input graphs, the presence of an edge linking two nodes may also\ncarry information about the extent to which the two nodes are dissimilar, rather than similar. This\ninformation can be encoded by the sign of the weight, and the resulting network is called a signed\ngraph. The regularity measure is naturally extended to signed graphs by counting the weight of\nfrustrated edges (e.g.,[7]), where (i, j) is frustrated if yiyj (cid:54)= sgn(wi,j). Many of the existing\nalgorithms for node classi\ufb01cation [18, 9, 10, 5, 8, 6] can in principle be run on signed graphs.\nHowever, the computational cost may not always be preserved. For example, mincut [4] is in general\nNP-hard when the graph is signed [13]. Since our algorithm sparsi\ufb01es the graph using trees, it can\nbe run ef\ufb01ciently even in the signed case. We just need to re-de\ufb01ne the \u2206 function as \u2206(i) =\nfcut(i,\u22121) \u2212 fcut(i, +1), where fcut is the minimum total weight of frustrated edges consistent\nwith the labels seen so far. The argument contained in Section 5 for the positive edge weights (see,\ne.g., Eq. (1) therein) allows us to show that also this version of \u2206 can be computed ef\ufb01ciently. The\nprediction rule has to be re-de\ufb01ned as well: We count the parity of the number z of negative-weighted\n\nedges along the path connecting it to the closest node j \u2208 C(cid:0)H(it)(cid:1), i.e.,(cid:98)yit = (\u22121)zsgn(cid:0)\u2206(j)(cid:1).\n\nRemark 3 In [5] the authors note that TREEOPT approximates a version space (Halving) algo-\nrithm on the set of tree labelings. Interestingly, SHAZOO is also an approximation to a more general\nHalving algorithm for weighted trees. This generalized Halving gives a weight to each labeling\nconsistent with the labels seen so far and with the sign of \u2206(f) for each fork f. These weighted\nlabelings, which depend on the weights of the \u03c6-edges generated by each labeling, are used for com-\nputing the predictions. One can show (details omitted due to space limitations) that this generalized\nHalving algorithm has a mistake bound within a constant factor of SHAZOO\u2019s.\n\nL =(cid:80)\n(cid:88)\n\n5 Mistake bound analysis and implementation\nWe now show that SHAZOO is nearly optimal on every weighted tree T . We obtain an upper bound\nin terms of \u03a6W and the structure of T , nearly matching the lower bound of Theorem 1. We now\ngive some auxiliary notation that is strictly needed for stating the mistake bound.\nGiven a labeled tree (T, y), a cluster is any maximal subtree whose nodes have the same label. An\nin-cluster line graph is any line graph that is entirely contained in a single cluster. Finally, given a\nline graph L, we set RW\n, i.e., the (resistance) distance between its terminal nodes.\n\n1\n\ndisjoint in-cluster line graphs such that the number of mistakes made by SHAZOO is at most of the\norder of\n\nTheorem 2 For any labeled and weighted tree (T, y), there exists a set LT of O(cid:0)\u03be(\u03a6W )(cid:1) edge-\nThe above mistake bound depends on the tree structure through LT . The sum contains O(cid:0)\u03be(\u03a6W )(cid:1)\nby the same key quantity \u03be(cid:0)\u03a6W(cid:1) occurring in the lower bound of Theorem 1. However, Theorem 2\nHence, the factor multiplying \u03be(cid:0)\u03a6W(cid:1) may be of the order of log(cid:0)1 + \u03a6W RW\n(cid:1). If, instead, T has\nnodes, and the number of mistakes can never exceed O(cid:0)\u03be(\u03a6W )(cid:1). This is a log factor improvement\n\nalso shows that SHAZOO can take advantage of trees that cannot be covered by long line graphs. For\nexample, if the input tree T is a weighted line graph, then it is likely to contain long in-cluster lines.\n\n(cid:110)|L|, 1 +(cid:4)log(cid:0)1 + \u03a6W RW\n\n(cid:1)(cid:5)(cid:111)\n\nconstant diameter (e.g., a star graph), then the in-cluster lines can only contain a constant number of\n\nmin\n\nL\u2208LT\n\nterms, each one being at most logarithmic in the scale-free products \u03a6W RW\n\nL . The bound is governed\n\n(i,j)\u2208L\n\nWi,j\n\nL\n\n.\n\nL\n\nover WTA which, by its very nature, cannot exploit the structure of the tree it operates on.1\nAs for the implementation, we start by describing a method for calculating cut(v, y) for any unla-\nbeled node v and label value y. Let T v be the maximal subtree of T rooted at v, such that no internal\ni (y) be the\nnode is revealed. For any node i of T v, let T v\ni consistent with the revealed nodes and such that yi = y. Since\nminimum weighted cutsize of T v\n\ni be the subtree of T v rooted at i. Let \u03a6v\n\n1One might wonder whether an arbitrarily large gap between upper (Theorem 2) and lower (Theorem 1)\nL . One way to get around this is to follow the\nbounds exists due to the extra factors depending on \u03a6W RW\nanalysis of WTA in [6]. Speci\ufb01cally, we can adapt here the more general analysis from that paper (see Lemma\n2 therein) that allows us to drop, for any integer K, the resistance contribution of K arbitrary non-\u03c6 edges of\nthe line graphs in LT (thereby reducing RW\nL for any L containing any of these edges) at the cost of increasing\nthe mistake bound by K. The details will be given in the full version of this paper.\n\n5\n\n\f\u2206(v) = cut(v,\u22121) \u2212 cut(v, +1) = \u03a6v\ni (y) can be recursively de\ufb01ned as follows, where C v\nsee by induction that the quantity \u03a6v\nof all children of i in T v, and Yj \u2261 {yj} if yj is revealed, and Yj \u2261 {\u22121, +1}, otherwise:2\n\nv(+1), our goal is to compute \u03a6v\n\nv(\u22121) \u2212 \u03a6v\n\nv(y). It is easy to\ni is the set\n\n(cid:17)\n\n(cid:16)\n\n(cid:88)\n\nj\u2208Cv\n\ni\n\n\uf8f1\uf8f2\uf8f3\n\n\u03a6v\n\ni (y) =\n\nj (y(cid:48)) + I{y(cid:48) (cid:54)= y} wi,j\n\u03a6v\n\nmin\ny(cid:48)\u2208Yj\n\nif i is an internal node of T v\n\n(1)\n\n0\n\notherwise.\n\ni (y) for each node i, the values \u03a6v\n\ncase running time is proportional to(cid:80)\n\nv(y) can be computed through a simple depth-\ufb01rst visit of T v. In all backtracking steps of\nNow, \u03a6v\nthis visit the algorithm uses (1) to compute \u03a6v\nj (y) for all children\nj of i being calculated during the previous backtracking steps. The total running time is therefore\nlinear in the number of nodes of T v.\nNext, we describe the basic implementation of SHAZOO for the on-line setting. A batch learning\nimplementation will be given at the end of this section. The online implementation is made up of\nthree steps.\n1. Find the hinge nodes of subtree T it. Recall that a hinge-node is either a fork or a revealed\nnode. Observe that a fork is incident to at least three nodes lying on different hinge lines. Hence, in\nthis step we perform a depth-\ufb01rst visit of T it, marking each node lying on a hinge line. In order to\naccomplish this task, it suf\ufb01ces to single out all forks marking each labeled node and, recursively,\neach parent of a marked node of T it. At the end of this process we are able to single out the forks\nby counting the number of edges (i, j) of each marked node i such that j has been marked, too. The\nremaining hinge nodes are the leaves of T it whose labels have currently been revealed.\n2. Compute sgn(\u2206(i)) for all connection forks of H(it). From the previous step we can easily\n\ufb01nd the connection node(s) of H(it). Then, we simply exploit the above-described technique for\ncomputing the cut function, obtaining sgn(\u2206(i)) for all connection forks i of H(it).\n3. Propagate the labels of the nodes of C(H(it)) (only if it is not a fork). We perform a visit of\nH(it) starting from every node r \u2208 C(H(it)). During these visits, we mark each node j of H(it)\nwith the label of r computed in the previous step, together with the length of \u03c0(r, j), which is what\nwe need for predicting any label of H(it) at the current time step.\nThe overall running time is dominated by the \ufb01rst step and the calculation of \u2206(i). Hence the worst\nt\u2264|V | |V (T it)|. This quantity can be quadratic in |V |, though\nthis is rarely encountered in practice if the node presentation order is not adversarial. For example,\nit is easy to show that in a line graph, if the node presentation order is random, then the total time is\nof the order of |V | log |V |. For a star graph the total time complexity is always linear in |V |, even\non adversarial orders.\nIn many real-world scenarios, one is interested in the more standard problem of predicting the labels\nof a given subset of test nodes based on the available labels of another subset of training nodes.\nBuilding on the above on-line implementation, we now derive an implementation of SHAZOO for\nthis train/test (or \u201cbatch learning\u201d) setting. We \ufb01rst show that computing |\u03a6i\ni(\u22121)| for\nall unlabeled nodes i in T takes O(|V |) time. This allows us to compute sgn(\u2206(v)) for all forks v\nin O(|V |) time, and then use the \ufb01rst and the third steps of the on-line implementation. Overall, we\nshow that predicting all labels in the test set takes O(|V |) time.\nConsider tree T i as rooted at i. Given any unlabeled node i, we perform a visit of T i starting at\ni. During the backtracking steps of this visit we use (1) to calculate \u03a6i\nj(y) for each node j in T i\nand label y \u2208 {\u22121, +1}. Observe now that for any pair i, j of adjacent unlabeled nodes and any\nlabel y \u2208 {\u22121, +1}, once we have obtained \u03a6i\ni (y) in\nconstant time, as \u03a6j\nof j in T i are descendants of i, while the children of i in T i (but j) are descendants of j in T j.\nSHAZOO computes \u03a6i\ni (y) for all child nodes j of i in T i,\nand use this value for computing \u03a6j\nj(y). Generalizing this argument, it is easy to see that in the next\nphase we can compute \u03a6k\nk(y) in constant time for all nodes k of T i such that for all ancestors u of\nk and all y \u2208 {\u22121, +1}, the values of \u03a6u\n\ni(y) \u2212 miny(cid:48)\u2208{\u22121,+1}(cid:0)\u03a6i\n\n(cid:1). In fact, all children\n\nj(+1) and \u03a6i\nj(y(cid:48)) + I{y(cid:48) (cid:54)= y} wi,j\n\ni(y), we can compute in constant time \u03a6j\n\nj(\u22121), we can compute \u03a6j\n\nu(y) have previously been computed.\n\ni(+1)| and |\u03a6i\n\ni (y) = \u03a6i\n\ni(y), \u03a6i\n\n2The recursive computations contained in this section are reminiscent of the sum-product algorithm [11].\n\n6\n\n\fOMV predicts yit through the sign of(cid:80)\n\nThe time for computing \u03a6s\ns(y) for all nodes s of T i and any label y is therefore linear in the time\nof performing a breadth-\ufb01rst (or depth-\ufb01rst) visit of T i, i.e., linear in the number of nodes of T i.\nSince each labeled node with degree d is part of at most d trees T i for some i, we have that the total\nnumber of nodes of all distinct (edge-disjoint) trees T i across i \u2208 V is linear in |V |.\nFinally, we need to propagate the connection node labels of each hinge tree as in the third step of\nthe online implementation. Since also this last step takes linear time, we conclude that the total time\nfor predicting all labels is linear in |V |.\n6 Experiments\nWe tested our algorithm on a number of real-world weighted graphs from different domains (char-\nacter recognition, text categorization, bioinformatics, Web spam detection) against the following\nbaselines:\nOnline Majority Vote (OMV). This is an intuitive and fast algorithm for sequentially predicting the\nnode labels is via a weighted majority vote over the labels of the adjacent nodes seen so far. Namely,\ns yiswis,it, where s ranges over s < t such that (is, it) \u2208 E.\nBoth the total time and space required by OMV are \u0398(|E|).\nLabel Propagation (LABPROP). LABPROP [18, 2, 3] is a batch transductive learning method com-\nputed by solving a system of linear equations which requires total time of the order of |E|\u00d7|V |. This\nrelatively high computational cost should be taken into account when comparing LABPROP to faster\nonline algorithms. Recall that OMV can be viewed as a fast \u201conline approximation\u201d to LABPROP.\nWeighted Tree Algorithm (WTA). As explained in the introductory section, WTA can be viewed\nas a special case of SHAZOO. When the input graph is not a line, WTA turns it into a line by \ufb01rst\nextracting a spanning tree of the graph, and then linearizing it. The implementation described in\n[6] runs in constant amortized time per prediction whenever the spanning tree sampler runs in time\n\u0398(|V |).\nThe Graph Perceptron algorithm [10] is another readily available baseline. This algorithm has been\nexcluded from our comparison because it does not seem to be very competitive in terms of perfor-\nmance (see, e.g., [6]), and is also computationally expensive.\nIn our experiments, we combined SHAZOO and WTA with spanning trees generated in different ways\n(note that OMV and LABPROP do not need to extract spanning trees from the input graph).\nRandom Spanning Tree (RST). Following Ch. 4 of [12], we draw a weighted spanning tree with\nprobability proportional to the product of its edge weights. We also tested our algorithms combined\nwith random spanning trees generated uniformly at random ignoring the edge weights (i.e., the\nweights were only used to compute predictions on the randomly generated tree) \u2014we call these\nspanning trees NWRST (no-weight RST). On most graphs, this procedure can be run in time linear\nin the number of nodes [17]. Hence, the combinations SHAZOO+NWRST and WTA+NWRST run in\nO(|V |) time on most graphs.\nMinimum Spanning Tree (MST). This is just the minimal weight spanning tree, where the weight\nof a spanning tree is the sum of its edge weights. This is the tree that best approximates the original\ngraph i.t.o. trace norm distance of the corresponding Laplacian matrices.\nFollowing [10, 6], we also ran SHAZOO and WTA using committees of spanning trees, and then\naggregating predictions via a majority vote. The resulting algorithms are denoted by k*SHAZOO\nand k*WTA, where k is the number of spanning trees in the aggregation. We used either k = 7, 11\nor k = 3, 7, depending on the dataset size.\nFor our experiments, we used \ufb01ve datasets: RCV1, USPS, KROGAN, COMBINED, and WEB-\nSPAM. WEBSPAM is a big dataset (110,900 nodes and 1,836,136 edges) of inter-host links created\nfor the Web Spam Challenge 2008 [15].3 KROGAN (2,169 nodes and 6,102 edges) and COM-\nBINED (2,871 nodes and 6,407 edges) are high-throughput protein-protein interaction networks of\nbudding yeast taken from [14] \u2014see [6] for a more complete description. Finally, USPS and RCV1\nare graphs obtained from the USPS handwritten characters dataset (all ten categories) and the \ufb01rst\n10,000 documents in chronological order of Reuters Corpus Vol. 1 (the four most frequent cate-\ngories), respectively. In both cases, we used Euclidean 10-Nearest Neighbor to create edges, each\n\n3We do not compare our results to those obtained within the challenge since we are only exploiting the\n\ngraph (weighted) topology here, disregarding content features.\n\n7\n\n\f(cid:0)\u03c32\n\ni,j = 1\n\n2\n\n(cid:1), where \u03c32\n\ni + \u03c32\n\nj\n\ni,j . We set \u03c32\n\ni is the average\n\nweight wi,j being equal to e\u2212(cid:107)xi\u2212xj(cid:107)2/\u03c32\nsquared distance between i and its 10 nearest neighbours.\nFollowing previous experimental settings [6], we associate binary classi\ufb01cation tasks with the \ufb01ve\ndatasets/graphs via a standard one-vs-all reduction. Each error rate is obtained by averaging over ten\nrandomly chosen training sets (and ten different trees in the case of RST and NWRST). WEBSPAM\nis natively a binary classi\ufb01cation problem, and we used the same train/test split provided with the\ndataset: 3,897 training nodes and 1,993 test nodes (the remaining nodes being unlabeled).\nIn the below table, we show the macro-averaged classi\ufb01cation error rates (percentages) achieved by\nthe various algorithms on the \ufb01rst four datasets mentioned in the main text. For each dataset we\ntrained ten times over a random subset of 5%, 10% and 25% of the total number of nodes and tested\non the remaining ones. In boldface are the lowest error rates on each column, excluding LABPROP\nwhich is used as a \u201cyardstick\u201d comparison. Standard deviations averaged over the binary problems\nare small: most of the times less than 0.5%.\n\nDatasets\n\nPredictors\nSHAZOO+RST\nSHAZOO+NWRST\nSHAZOO+MST\nWTA+RST\nWTA+NWRST\nWTA+MST\n7*SHAZOO+RST\n7*SHAZOO+NWRST\n7*WTA+RST\n7*WTA+NWRST\n11*SHAZOO+RST\n11*SHAZOO+NWRST\n11*WTA+RST\n11*WTA+NWRST\nOMV\nLABPROP\n\n5%\n3.62\n3.88\n1.07\n5.34\n5.74\n1.81\n1.68\n1.89\n2.10\n2.33\n1.52\n1.70\n1.84\n2.04\n24.79\n1.95\n\nUSPS\n10%\n2.82\n3.03\n0.96\n4.23\n4.45\n1.60\n1.28\n1.38\n1.56\n1.73\n1.17\n1.27\n1.36\n1.51\n12.34\n1.11\n\n25%\n2.02\n2.18\n0.80\n3.02\n3.26\n1.21\n0.97\n1.06\n1.14\n1.24\n0.89\n0.98\n1.01\n1.12\n2.10\n0.82\n\n5%\n21.72\n21.97\n17.71\n25.53\n25.50\n21.07\n16.33\n16.49\n17.44\n17.69\n15.82\n15.95\n16.40\n16.70\n31.65\n16.28\n\nRCV1\n10%\n18.70\n19.21\n14.87\n22.66\n22.70\n17.94\n13.52\n13.98\n14.74\n15.18\n13.04\n13.42\n13.95\n14.28\n22.35\n12.99\n\n25%\n15.68\n15.95\n11.73\n19.05\n19.24\n13.92\n11.07\n11.37\n12.15\n12.53\n10.59\n10.93\n11.42\n11.68\n11.79\n10.00\n\nKROGAN\n\n10%\n17.68\n18.14\n16.92\n21.05\n21.28\n20.63\n15.58\n15.62\n16.64\n16.60\n15.40\n15.33\n16.15\n16.05\n38.75\n14.98\n\n5%\n18.11\n18.11\n17.46\n21.82\n21.90\n21.41\n15.54\n15.61\n16.75\n16.71\n15.36\n15.40\n16.20\n16.22\n43.13\n15.56\n\n25%\n17.10\n17.32\n16.30\n20.08\n20.18\n19.61\n15.46\n15.50\n15.88\n16.00\n15.29\n15.32\n15.53\n15.50\n29.84\n15.23\n\n5%\n17.77\n17.22\n16.79\n21.76\n21.58\n21.74\n15.12\n15.02\n16.42\n16.24\n14.91\n14.87\n15.90\n15.74\n44.72\n14.79\n\nCOMBINED\n\n10%\n17.24\n17.21\n16.64\n21.38\n21.42\n21.20\n15.24\n15.12\n16.09\n16.13\n15.06\n14.99\n15.58\n15.57\n40.86\n14.93\n\n25%\n17.34\n17.53\n17.15\n20.26\n20.64\n20.32\n15.84\n15.80\n15.72\n15.79\n15.61\n15.67\n15.30\n15.33\n33.24\n15.18\n\nNext, we extract from the above table a speci\ufb01c comparison among SHAZOO, WTA, and LABPROP.\nSHAZOO and WTA use a single minimum spanning tree (the best performing tree type for both\nalgorithms). Note that SHAZOO consistently outperforms WTA.\n\nWe then report the results on WEBSPAM. SHAZOO and WTA use only non-weighted random span-\nning trees (NWRST) to optimize scalability. Since this dataset is extremely unbalanced (5.4% positive\nlabels) we use the average test set F-measure instead of the error rate.\n\nSHAZOO WTA\n0.947\n\n0.954\n\nOMV\n0.706\n\nLABPROP\n\n0.931\n\n3*WTA\n0.967\n\n3*SHAZOO\n\n0.964\n\n7*WTA\n0.968\n\n7*SHAZOO\n\n0.968\n\nOur empirical results can be brie\ufb02y summarized as follows:\n1. Without using committees, SHAZOO outperforms WTA on all datasets, irrespective to the type\nof spanning tree being used. With committees, SHAZOO works better than WTA almost always,\nalthough the gap between the two reduces.\n2. The predictive performance of SHAZOO+MST is comparable to, and sometimes better than, that\nof LABPROP, though the latter algorithm is slower.\n3. k*SHAZOO, with k = 11 (or k = 7 on WEBSPAM) seems to be especially effective, outper-\nforming LABPROP, with a small (e.g., 5%) training set size.\n4. NWRST does not offer the same theoretical guarantees as RST, but it is extremely fast to generate\n(linear in |V | on most graphs \u2014 e.g., [1]), and in our experiments is only slightly inferior to RST.\n\n8\n\n\fReferences\n[1] N. Alon, C. Avin, M. Kouck\u00b4y, G. Kozma, Z. Lotker, and M.R. Tuttle. Many random walks\nare faster than one. In Proc. 20th Symp. on Parallel Algo. and Architectures, pages 119\u2013128.\nSpringer, 2008.\n\n[2] M. Belkin, I. Matveeva, and P. Niyogi. Regularization and semi-supervised learning on large\ngraphs. In Proceedings of the 17th Annual Conference on Learning Theory, pages 624\u2013638.\nSpringer, 2004.\n\n[3] Y. Bengio, O. Delalleau, and N. Le Roux. Label propagation and quadratic criterion. In Semi-\n\nSupervised Learning, pages 193\u2013216. MIT Press, 2006.\n\n[4] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In\nProceedings of the 18th International Conference on Machine Learning. Morgan Kaufmann,\n2001.\n\n[5] N. Cesa-Bianchi, C. Gentile, and F. Vitale. Fast and optimal prediction of a labeled tree. In\n\nProceedings of the 22nd Annual Conference on Learning Theory, 2009.\n\n[6] N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella. Random spanning trees and the pre-\ndiction of weighted graphs. In Proceedings of the 27th International Conference on Machine\nLearning, 2010.\n\n[7] C. Alta\ufb01ni G. Iacono. Monotonicity, frustration, and ordered response: an analysis of the\nenergy landscape of perturbed large-scale biological networks. BMC Systems Biology, 4(83),\n2010.\n\n[8] M. Herbster and G. Lever. Predicting the labelling of a graph via minimum p-seminorm in-\nterpolation. In Proceedings of the 22nd Annual Conference on Learning Theory. Omnipress,\n2009.\n\n[9] M. Herbster, G. Lever, and M. Pontil. Online prediction on large diameter graphs. In Advances\n\nin Neural Information Processing Systems 22. MIT Press, 2009.\n\n[10] M. Herbster, M. Pontil, and S. Rojas-Galeano. Fast prediction on a tree. In Advances in Neural\n\nInformation Processing Systems 22. MIT Press, 2009.\n\n[11] F.R. Kschischang, B.J. Frey, and H.A. Loeliger. Factor graphs and the sum-product algorithm.\n\nIEEE Transactions on Information Theory, 47(2):498\u2013519, 2001.\n\n[12] R. Lyons and Y. Peres. Probability on trees and networks. Manuscript, 2008.\n[13] S.T. McCormick, M.R. Rao, and G. Rinaldi. Easy and dif\ufb01cult objective functions for max cut.\n\nMath. Program., 94(2-3):459\u2013466, 2003.\n\n[14] G. Pandey, M. Steinbach, R. Gupta, T. Garg, and V. Kumar. Association analysis-based trans-\nformations for protein interaction networks: a function prediction case study. In Proceedings of\nthe 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,\npages 540\u2013549. ACM Press, 2007.\n\n[15] Yahoo! Research (Barcelona) and Laboratory of Web Algorithmics (Univ. of Milan). Web\n\nspam collection. URL: barcelona.research.yahoo.net/webspam/datasets/.\n\n[16] D. A. Spielman and N. Srivastava. Graph sparsi\ufb01cation by effective resistances. In Proc. of\n\nthe 40th annual ACM symposium on Theory of computing (STOC 2008). ACM Press, 2008.\n\n[17] D.B. Wilson. Generating random spanning trees more quickly than the cover time. In Proceed-\nings of the 28th ACM Symposium on the Theory of Computing, pages 296\u2013303. ACM Press,\n1996.\n\n[18] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\nharmonic functions. In Proceedings of the 20th International Conference on Machine Learn-\ning, 2003.\n\n9\n\n\f", "award": [], "sourceid": 908, "authors": [{"given_name": "Fabio", "family_name": "Vitale", "institution": null}, {"given_name": "Nicol\u00f2", "family_name": "Cesa-bianchi", "institution": null}, {"given_name": "Claudio", "family_name": "Gentile", "institution": null}, {"given_name": "Giovanni", "family_name": "Zappella", "institution": null}]}