{"title": "Online Prediction on Large Diameter Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 649, "page_last": 656, "abstract": "Current on-line learning algorithms for predicting the labelling of a graph have an important limitation in the case of large diameter graphs; the number of mistakes made by such algorithms may be proportional to the square root of the number of vertices, even when tackling simple problems. We overcome this problem with an efficient algorithm which achieves a logarithmic mistake bound. Furthermore, current algorithms are optimised for data which exhibits cluster-structure; we give an additional algorithm which performs well locally in the presence of cluster structure and on large diameter graphs.", "full_text": "Online Prediction on Large Diameter Graphs\n\nMark Herbster, Guy Lever, Massimiliano Pontil\n\nGower Street, London WC1E 6BT, England, UK\n\n{m.herbster, g.lever, m.pontil}@cs.ucl.ac.uk\n\nDepartment of Computer Science\n\nUniversity College London\n\nAbstract\n\nWe continue our study of online prediction of the labelling of a graph. We show a\nfundamental limitation of Laplacian-based algorithms: if the graph has a large di-\nameter then the number of mistakes made by such algorithms may be proportional\nto the square root of the number of vertices, even when tackling simple problems.\nWe overcome this drawback by means of an ef\ufb01cient algorithm which achieves\na logarithmic mistake bound. It is based on the notion of a spine, a path graph\nwhich provides a linear embedding of the original graph. In practice, graphs may\nexhibit cluster structure; thus in the last part, we present a modi\ufb01ed algorithm\nwhich achieves the \u201cbest of both worlds\u201d: it performs well locally in the presence\nof cluster structure, and globally on large diameter graphs.\n\n1 Introduction\nWe study the problem of predicting the labelling of a graph in the online learning framework. Con-\nsider the following game for predicting the labelling of a graph: Nature presents a graph; nature\nqueries a vertex vi1; the learner predicts \u02c6y1 \u2208 {\u22121, 1}, the label of the vertex; nature presents a\nlabel y1; nature queries a vertex vi2; the learner predicts \u02c6y2; and so forth. The learner\u2019s goal is to\nminimise the total number of mistakes M = |{t : \u02c6yt (cid:54)= yt}|. If nature is adversarial, the learner\nwill always mispredict, but if nature is regular or simple, there is hope that a learner may make only\na few mispredictions. Thus, a central goal of online learning is to design algorithms whose total\nmispredictions can be bounded relative to the complexity of nature\u2019s labelling. In [9, 8, 7], the cut\nsize (the number of edges between disagreeing labels) was used as a measure of the complexity of a\ngraph\u2019s labelling, and mistake bounds relative to this and the graph diameter were derived.\nThe strength of the methods in [8, 7] is in the case when the graph exhibits \u201ccluster structure\u201d. The\napparent de\ufb01ciency of these methods is that they have poor bounds when the graph diameter is large\nrelative to the number of vertices. We observe that this weakness is not due to insuf\ufb01ciently tight\nbounds, but is a problem in their performance. In particular, we discuss an example of a n-vertex\nlabelled graph with a single edge between disagreeing label sets. On this graph, sequential prediction\n\u221a\nusing the common method based upon minimising the Laplacian semi-norm of a labelling, subject to\nconstraints, incurs \u03b8(\nn) mistakes (see Theorem 3). The expectation is that the number of mistakes\nincurred by an optimal online algorithm is bounded by O(ln n).\nWe solve this problem by observing that there exists an approximate structure-preserving embedding\nof any graph into a path graph. In particular the cut-size of any labelling is increased by no more than\na factor of two. We call this embedding a spine of the graph. The spine is the foundation on which we\nbuild two algorithms. Firstly we predict directly on the spine with the 1-nearest-neighbor algorithm.\nWe demonstrate that this equivalent to the Bayes-optimal classi\ufb01er for a particular Markov random\n\ufb01eld. A logarithmic mistake bound for learning on a path graph follows by the Halving algorithm\nanalysis. Secondly, we use the spine of the graph as a foundation to add a binary support tree to the\noriginal graph. This enables us to prove a bound which is the \u201cbest of both worlds\u201d \u2013 if the predicted\nset of vertices has cluster-structure we will obtain a bound appropriate for that case, but if instead,\nthe predicted set exhibits a large diameter we will obtain a polylogarithmic bound.\n\n\fPrevious work. The seminal approach to semi-supervised learning over graphs in [3] is to predict\nwith a labelling which is consistent with a minimum label-separating cut. More recently, the graph\nLaplacian has emerged as a key object in semi-supervised learning, for example the semi-norm\ninduced by the Laplacian is commonly either directly minimised subject to constraints, or used as\na regulariser [14, 2]. In [8, 7] the online graph labelling problem was studied. An aim of those\npapers was to provide a natural interpretation of the bound on the cumulative mistakes of the kernel\nperceptron when the kernel is the pseudoinverse of the graph Laplacian \u2013 bounds in this case being\nrelative to the cut and (resistance) diameter of the graph. In this paper we necessarily build directly\non the very recent results in [7] as those results depend on the resistance diameter of the predicted\nvertex set as opposed to the whole graph [8]. The online graph labelling problem is also studied in\n[13], and here the graph structure is not given initially. A slightly weaker logarithmic bound for the\nonline graph labelling problem has also been independently derived via a connection to an online\nrouting problem in the very recent [5].\n\n2 Preliminaries\nWe study the process of predicting a labelling de\ufb01ned on the vertices of a graph. Following the\nclassical online learning framework, a sequence of labelled vertices {(vi1, y1), (vi2 , y2), . . .}, the\ntrial sequence, is presented to a learning algorithm such that, on sight of each vertex vit, the learner\nmakes a prediction \u02c6yt for the label value, after which the correct label is revealed. This feedback\ninformation is then used by the learning algorithm to improve its performance on further examples.\nWe analyse the performance of a learning algorithm in the mistake bound framework [12] \u2013 the aim\nis to minimise the maximum possible cumulative number of mistakes made on the training sequence.\nA graph G = (V, E) is a collection of vertices V = {v1, . . . , vn} joined by connecting (possibly\nweighted) edges. Denote i \u223c j whenever vi and vj are connected so that E = {(i, j) : i \u223c j} is the\nset of unordered pairs of connected vertex indices. Associated with each edge (i, j) \u2208 E is a weight\nAij, so that A is the n \u00d7 n symmetric adjacency matrix. We say that G is unweighted if Aij = 1\nfor every (i, j) \u2208 E and is 0 otherwise. In this paper, we consider only connected graphs \u2013 that is,\ngraphs such that there exists a path between any two vertices. The Laplacian G of a graph G is the\nj Aij. The\n\nn \u00d7 n matrix G = D \u2212 A, where D is the diagonal degree matrix such that Dii =(cid:80)\n\nquadratic form associated with the Laplacian relates to the cut size of graph labellings.\nDe\ufb01nition 1. Given a labelling u \u2208 IRn of G = (V, E) we de\ufb01ne the cut size of u by\n\n\u03a6G(u) =\n\n(1)\nIn particular, if u \u2208 {\u22121, 1}n we say that a cut occurs on edge (i, j) if ui (cid:54)= uj and \u03a6G(u) measures\nthe number of cuts.\n\n(i,j)\u2208E\n\nAij(ui \u2212 uj)2.\n\n(cid:88)\n\n1\n4 uT Gu =\n\n1\n4\n\nWe evaluate the performance of prediction algorithms in terms of the cut size and the resistance\ndiameter of the graph. There is an established natural connection between graphs and resistive\nnetworks where each edge (i, j) \u2208 E is viewed as a resistor with resistance 1/Aij [4]. Thus the\neffective resistance rG(vi, vj) between vertex vi and vj is the potential difference needed to induce a\nunit current \ufb02ow between vi and vj. The effective resistance may be computed by the formula [11]\n(2)\nwhere \u201c+\u201d denotes the pseudoinverse and e1, . . . , en are the canonical basis vectors of IRn. The\nresistance diameter of a graph RG := maxvi,vj\u2208V rG(vi, vj) is the maximum effective resistance\nbetween any pair of vertices on the graph.\n\nrG(vi, vj) = (ei \u2212 ej)T G+(ei \u2212 ej),\n\n3 Limitations of online minimum semi-norm interpolation\nAs we will show, it is possible to develop online algorithms for predicting the labelling of a graph\nwhich have a mistake bound that is a logarithmic function of the number of vertices. Conversely, we\n\ufb01rst highlight a de\ufb01ciency in a standard Laplacian based method for predicting a graph labelling.\nGiven a partially labelled graph G = (V, E) with |V | = n \u2013 that is, such that for some (cid:96) \u2264 n,\ny(cid:96) \u2208 {\u22121, 1}(cid:96) is a labelling de\ufb01ned on the (cid:96) vertices V(cid:96) = {vi1, vi2, . . . , vi(cid:96)} \u2013 the minimum\nsemi-norm interpolant is de\ufb01ned by\n\n\u00afy = argmin{uT Gu : u \u2208 IRn, uik = yk, k = 1, . . . , (cid:96)}.\n\n\fn) mistakes.\n\nWe then predict using \u02c6yi = sgn(\u00afyi), for i = 1, . . . , n.\nThe common justi\ufb01cation behind the above learning paradigm [14, 2] is that minimizing the cut (1)\nencourages neighbouring vertices to be similarly labelled. However, we now demonstrate that in the\n\u221a\nonline setting such a regime will perform poorly on certain graph constructions \u2013 there exists a trial\nsequence on which the method will make at least \u03b8(\nDe\ufb01nition 2. An octopus graph of size d is de\ufb01ned to be d path graphs (the tentacles) of length d\n(that is, with d + 1 vertices) all adjoined at a common end vertex, to which a further single head\nvertex is attached, so that n = |V | = d2 + 2. This corresponds to the graph O1,d,d discussed in [8].\nTheorem 3. Let G = (V, E) be an octopus graph of size d and y = (y1, . . . , y|V |) the labelling\nsuch that yi = 1 if vi is the head vertex and yi = \u22121 otherwise. There exists a trial sequence for\n\nwhich online minimum semi-norm interpolation makes \u03b8((cid:112)|V |) mistakes.\n\u00afy (that is, for every unlabeled vertex vj,(cid:80)n\n\nProof. Let the \ufb01rst query vertex be the head vertex, and let the end vertex of a tentacle be queried at\neach subsequent trial. We show that this strategy forces at least d mistakes. The solution to the min-\nimum semi-norm interpolation with boundary values problem is precisely the harmonic solution [4]\ni=1 Aij(\u00afyi \u2212 \u00afyj) = 0). If the graph is connected \u00afy is\nunique and the graph labelling problem is identical to that of identifying the potential at each vertex\nof a resistive network de\ufb01ned on the graph where each edge corresponds to a resistor of 1 unit; the\nharmonic principle corresponds to Kirchoff\u2019s current law in this case. Using this analogy, suppose\nthat the end points of k < d tentacles are labelled and that the end vertex vq of an unlabelled tentacle\nis queried. Suppose a current of k\u03bb \ufb02ows from the head to the body of the graph. By Kirchoff\u2019s\nlaw, a current of \u03bb \ufb02ows along each labelled tentacle (in order to obey the harmonic principle at\nevery vertex it is clear that no current \ufb02ows along the unlabelled tentacles). By Ohm\u2019s law \u03bb = 2\nd+k .\nMinimum semi-norm interpolation therefore results in the solution\n\n\u00afyq = 1 \u2212 2k\nd + k\n\n\u2265 0 i\ufb00 k \u2264 d.\n\nHence the minimum semi-norm solution predicts incorrectly whenever k < d and the algorithm\nmakes at least d mistakes.\nThe above demonstrates a limitation in the method of online Laplacian minimum semi-norm inter-\npolation for predicting a graph labelling \u2013 the mistake bound can be proportional to the square root\nof the number of data points. We solve these problems in the following section.\n\n4 A linear graph embedding\nWe demonstrate a method of embedding data represented as a connected graph G into a path graph,\nwe call it a spine of G, which partially preserves the structure of G. Let Pn be the set of path graphs\nwith n vertices. We would like to \ufb01nd a path graph with the same vertex set as G, which solves\n\nminP\u2208Pn\n\n\u03a6P(u)\n\u03a6G(u) .\n\nmax\n\nu\u2208{\u22121,1}n\n\nIf a Hamiltonian path H of G (a path on G which visits each vertex precisely once) exists, then\n\u03a6G (u) \u2264 1. The problem of \ufb01nding a Hamiltonian path is NP-complete\nthe approximation ratio is \u03a6H(u)\nhowever, and such a path is not guaranteed to exist. As we shall see, a spine S of G may be found\n\u03a6G (u) \u2264 2.\nef\ufb01ciently and satis\ufb01es \u03a6S (u)\nWe now detail the construction of a spine of a graph G = (V, E), with |V | = n. Starting from\nany node, G is traversed in the manner of a depth-\ufb01rst search (that is, each vertex is fully explored\nbefore backtracking to the last unexplored vertex), and an ordered list VL = {vl1 , vl2 , . . . , vl2m+1}\nof the vertices (m \u2264 |E|) in the order that they are visited is formed, allowing repetitions when\na vertex is visited more than once. Note that each edge in EG is traversed no more than twice\n(cid:80)\nwhen forming VL. De\ufb01ne an edge multiset EL = {(l1, l2), (l2, l3), . . . , (l2m, l2m+1)} \u2013 the set\nof pairs of consecutive vertices in VL. Let u be an arbitrary labelling of G and denote, as usual,\n(i,j)\u2208EL(ui \u2212 uj)2. Since the multiset EL\n\u03a6G(u) = 1\n4\ncontains every element of EG no more than twice, \u03a6L(u) \u2264 2\u03a6G(u).\nWe then take any subsequence V (cid:48)\nL of VL containing every vertex in V exactly once. A spine\nS = (V, ES) is a graph formed by connecting each vertex in V to its immediate neighbours in\n\n(cid:80)\n(i,j)\u2208EG (ui \u2212 uj)2 and \u03a6L(u) = 1\n\n4\n\n\fL with an edge. Since a cut occurs between connected vertices vi and vj in S\nthe subsequence V (cid:48)\nonly if a cut occurs on some edge in EL located between the corresponding vertices in the list VL\nwe have\n\n\u03a6S(u) \u2264 \u03a6L(u) \u2264 2\u03a6G(u).\n\n(3)\nThus we have reduced the problem of learning the cut on a generic graph to that of learning the\ncut on a path graph. In the following we see that 1-nearest neighbour (1-NN) algorithm is a Bayes\n\u221a\noptimal algorithm for this problem. Note that the 1-NN algorithm does not perform well on general\ngraphs; on the octopus graph discussed above, for example, it can make at least \u03b8(\nn) mistakes,\nand even \u03b8(n) mistakes on a related graph construction [8].\n\n5 Predicting with a spine\n\nWe consider implementing the 1-NN algorithm on a path graph and demonstrate that it achieves a\nmistake bound which is logarithmic in the length of the line. Let G = (V, E) be a path graph, where\nV = {v1, v2, . . . , vn} is the set of vertices and E = {(1, 2), (2, 3), . . . , (n \u2212 1, n)}. The nearest\nneighbour algorithm, in the standard online learning framework described above, attempts to predict\na graph labelling by producing, for each query vertex vit, the prediction \u02c6yt which is consistent with\nthe label of the closest labelled vertex (and predicts randomly in the case of a tie).\nTheorem 4. Given the task of predicting the labelling of any unweighted, n-vertex path graph P in\nthe online framework, the number of mistakes, M, incurred by the 1-NN algorithm satis\ufb01es\n\n(cid:18) n \u2212 1\n\n(cid:19)\n\n+ 1,\nwhere u \u2208 {\u22121, 1}n is any labelling consistent with the trial sequence.\n\n\u03a6P(u)\n\nln 2\n\n+\n\nM \u2264 \u03a6P(u) log2\n\n\u03a6P(u)\n\n(4)\n\nProof. We shall prove the result by noting that the Halving algorithm [1] (under certain conditions\non the probabilities assigned to each hypothesis) implements the nearest neighbour algorithm on a\npath graph. Given any input space X and \ufb01nite binary concept class C \u2282 {\u22121, 1}|X|, the Halving\nalgorithm learns any target concept c\u2217 \u2208 C as follows. Each hypothesis c \u2208 C is given an associated\nprobability p(c). A sequence of labelled examples {(x1, y1), . . . , (xt\u22121, yt\u22121)} \u2282 X \u00d7 {\u22121, 1}, is\nrevealed in accordance with the usual online framework. Let Ft be the set of feasible hypotheses at\ntrial t; Ft = {c : c(xs) = ys \u2200s < t}. Given an unlabelled example xt \u2208 X at trial t the predicted\nlabel \u02c6yt is that which agrees with the majority vote \u2013 that is, such that\n2 (and\n2). It is well known [1] that the Halving algorithm makes at\nit predicts randomly if this is equal to 1\nmost MH mistakes with\n\nc\u2208Ft ,c(xt)=\u02c6yt\np(c)\n\nP\n\nP\n\n> 1\n\nc\u2208Ft\n\np(c)\n\n(cid:18) 1\n\n(cid:19)\n\nMH \u2264 log2\n\np(c\u2217)\n\n.\n\n(5)\n\nWe now de\ufb01ne a probability distribution over the space of all labellings u \u2208 {\u22121, 1}n of P such that\nthe Halving algorithm with these probabilities implements the nearest neighbour algorithm. Let a cut\noccur on any given edge with probability \u03b1, independently of all other cuts; Prob(ui+1 (cid:54)= ui) = \u03b1\n\u2200i < n. The position of all cuts \ufb01xes the labelling up to \ufb02ipping every label, and each of these\ntwo resulting possible arrangements are equally likely. This recipe associates with each possible\nlabelling u \u2208 {\u22121, 1}n a probability p(u) which is a function of the labelling\u2019s cut size\n\np(u) =\n\n1\n2 \u03b1\u03a6P (u)(1 \u2212 \u03b1)n\u22121\u2212\u03a6P (u).\n\n(6)\n\nThis induces a full joint probability distribution on the space of vertex labels. In fact (6) is a Gibbs\nmeasure and as such de\ufb01nes a Markov random \ufb01eld over the space of vertex labels [10]. The mass\nfunction p therefore satis\ufb01es the Markov property\n\np(ui = \u03b3 | uj = \u03b3j \u2200j (cid:54)= i) = p(ui = \u03b3 | uj = \u03b3j \u2200j \u2208 Ni),\n\n(7)\nwhere here Ni is the set of vertices neighbouring vi \u2013 those connected to vi by an edge. We will\ngive an equivalent Markov property which allows a more general conditioning to reduce to that over\nboundary vertices.\n\n\fDe\ufb01nition 5. Given a path graph P = (V, E), a set of vertices V (cid:48) \u2282 V and a vertex vi \u2208 V , we\nde\ufb01ne the boundary vertices v(cid:96), vr (either of which may be vacuous) to be the two vertices in V (cid:48) that\nare closest to vi in each direction along the path; its nearest neighbours in each direction.\n\nThe distribution induced by (6) satis\ufb01es the following Markov property; given a partial labelling of\nP de\ufb01ned on a subset V (cid:48) \u2282 V , the label of any vertex vi is independent of all labels on V (cid:48) except\nthose on the vertices v(cid:96), vr (either of which could be vacuous)\n\np(ui = \u03b3 | uj = \u03b3j, \u2200j : vj \u2208 V (cid:48)) = p(ui = \u03b3 | u(cid:96) = \u03b3(cid:96), ur = \u03b3r).\n\n(8)\n\nGiven the construction of the probability distribution formed by independent cuts on graph edges,\nwe can evaluate conditional probabilities. For example, p(uj = \u03b3 | uk = \u03b3) is the probability of an\neven number of cuts between vertex vj and vertex vk. Since cuts occur with probability \u03b1 and there\n\ns\n\n(cid:1) possible arrangements of s cuts we have\nare(cid:0)|k\u2212j|\np(uj = \u03b3 | uk = \u03b3) = (cid:88)\np(uj (cid:54)= \u03b3 | uk = \u03b3) = (cid:88)\n\n(cid:19)\n(cid:18)|k \u2212 j|\n(cid:19)\n(cid:18)|k \u2212 j|\n\nLikewise we have that\n\ns even\n\ns\n\ns odd\n\ns\n\n\u03b1s(1 \u2212 \u03b1)|k\u2212j|\u2212s =\n\n\u03b1s(1 \u2212 \u03b1)|k\u2212j|\u2212s =\n\n(1 + (1 \u2212 2\u03b1)|k\u2212j|).\n\n(9)\n\n1\n2\n\n(1 \u2212 (1 \u2212 2\u03b1)|k\u2212j|).\n\n(10)\n\n1\n2\n\n2 for \u03b3 \u2208 {\u22121, 1}.\n\n2 , implements the nearest neighbour algorithm.\n\nNote also that for any single vertex we have p(ui = \u03b3) = 1\nLemma 6. Given the task of predicting the labelling of an n-vertex path graph online, the Halving\nalgorithm, with a probability distribution over the labellings de\ufb01ned as in (6) and such that 0 <\n\u03b1 < 1\nProof. Suppose that t \u2212 1 trials have been performed so that we have a partial labelling of a subset\nV (cid:48) \u2282 V , {(vi1, y1), (vi2 , y2), . . . , (vit\u22121 , yt\u22121)}. Suppose the label of vertex vit is queried so that\nthe Halving algorithm makes the following prediction \u02c6yt for vertex vit: \u02c6yt = y if p(uit = y | uij =\nyj \u2200 1 \u2264 j < t) > 1\n2 (and predicts randomly\nif this probability is equal to 1\n2). We \ufb01rst consider the case where the conditional labelling includes\nvertices on both sides of vit. We have, by (8), that\np(uit = y | uij = yj \u2200 1 \u2264 j < t) = p(uit = y | u(cid:96) = y\u03c4 ((cid:96)), ur = y\u03c4 (r))\n\n2, \u02c6yt = \u2212y if p(uit = y | uij = yj \u2200 1 \u2264 j < t) < 1\n\n=\n\n=\n\np(u(cid:96) = y\u03c4 ((cid:96)) | ur = y\u03c4 (r), uit = y)p(ur = y\u03c4 (r), uit = y)\n\np(u(cid:96) = y\u03c4 ((cid:96)), ur = y\u03c4 (r))\n\np(u(cid:96) = y\u03c4 ((cid:96)) | uit = y)p(ur = y\u03c4 (r) | uit = y)\n\np(u(cid:96) = y\u03c4 ((cid:96)) | ur = y\u03c4 (r))\n\n(11)\n\nwhere v(cid:96) and vr are the boundary vertices and \u03c4((cid:96)) and \u03c4(r) are trials at which vertices v(cid:96) and vr\nare queried, respectively. We can evaluate the right hand side of this expression using (9, 10). To\nshow equivalence with the nearest neighbour method whenever \u03b1 < 1\n\n2, we have from (9, 10, 11)\n\np(uit = y | u(cid:96) = y, ur (cid:54)= y) =\n\n(1 + (1 \u2212 2\u03b1)|(cid:96)\u2212it|)(1 \u2212 (1 \u2212 2\u03b1)|r\u2212it|)\n\n2(1 \u2212 (1 \u2212 2\u03b1)|(cid:96)\u2212r|)\n\n2 if |(cid:96) \u2212 it| > |r \u2212 it|. Hence, this\nwhich is greater than 1\nproduces predictions exactly in accordance with the nearest neighbour scheme. We also have more\nsimply that for all it, (cid:96) and r and \u03b1 < 1\n2\n\n2 if |(cid:96) \u2212 it| < |r \u2212 it| and less than 1\n\nThis proves the lemma for all cases.\nA direct application of the Halving algorithm mistake bound (5) now gives\n\n1\n2 , and p(uit = y | u(cid:96) = y) >\np(uit = y | u(cid:96) = y, ur = y) >\n(cid:19)\n(cid:18)\n\n(cid:19)\n\n(cid:18) 1\n\n2\n\n1\n2 .\n\n= log2\n\np(u)\n\n\u03b1\u03a6P (u)(1 \u2212 \u03b1)n\u22121\u2212\u03a6P (u)\n\nM \u2264 log2\n\n\fwhere u is any labelling consistent with the trial sequence. We choose \u03b1 = min( \u03a6P (u)\nthat the bound is vacuous when \u03a6P (u)\n\n2 since M is necessarily upper bounded by n) giving\n\nn\u22121 > 1\n\nn\u22121 , 1\n\n2) (note\n\n(cid:18)\n\n(cid:19)\n\n1 +\n\n\u03a6P(u)\n\nn \u2212 1 \u2212 \u03a6P(u)\n\n+ 1\n\nM \u2264 \u03a6P(u) log2\n\u2264 \u03a6P(u) log2\n\n+ (n \u2212 1 \u2212 \u03a6P(u)) log2\n\n+\n\n\u03a6P(u)\n\nln 2\n\n+ 1.\n\n(cid:18) n \u2212 1\n(cid:18) n \u2212 1\n\n\u03a6P(u)\n\n(cid:19)\n(cid:19)\n\n\u03a6P(u)\n\n(cid:20)\n\n(cid:18) n \u2212 1\n\n(cid:19)(cid:21)\n\n2\u03a6G(u)\n\nThis proves the theorem.\nThe nearest neighbour algorithm can predict the labelling of any graph G = (V, E), by \ufb01rst trans-\nferring the data representation to that of a spine S of G, as presented in Section 4. We now apply the\nabove argument to this method and immediately deduce our \ufb01rst main result.\nTheorem 7. Given the task of predicting the labelling of any unweighted, connected, n-vertex graph\nG = (V, E) in the online framework, the number of mistakes, M, incurred by the nearest neighbour\nalgorithm operating on a spine S of G satis\ufb01es\n\nM \u2264 2\u03a6G(u) max\n\n0, log2\n\n+\n\n2\u03a6G(u)\n\nln 2\n\n+ 1,\n\n(12)\n\nwhere u \u2208 {\u22121, 1}n is any labelling consistent with the trial sequence.\n\n(cid:17)\nProof. Theorem 4 gives bound (4) for predicting on any path, hence M \u2264 \u03a6S(u) log2\n+\nln 2 + 1. Since this is an increasing function of \u03a6S(u) for \u03a6S(u) \u2264 n \u2212 1 and is vacuous at\n\u03a6S(u) \u2265 n \u2212 1 (M is necessarily upper bounded by n) we upper bound substituting \u03a6S(u) \u2264\n2\u03a6G(u) (equation (3)).\n\n(cid:16) n\u22121\n\n\u03a6S (u)\n\n\u03a6S (u)\n\nWe observe that predicting with the spine is a minimax improvement over Laplacian minimal semi-\n\u221a\nnorm interpolation. Recall Theorem 3, there we showed that there exists a trial sequence such that\nn) mistakes. In fact this trivially generalizes\nLaplacian minimal semi-norm interpolation incurs \u03b8(\n\nto \u03b8((cid:112)\u03a6G(u)n) mistakes by creating a colony of \u03a6G(u) octopi then identifying each previously\n\nseparate head vertex as a single central vertex. The upper bound (12) is smaller than the prior lower\nbound.\nThe computational complexity for this algorithm is O(|E| +|V | ln|V |) time. We compute the spine\nin O(|E|) time by simply listing vertices in the order in which they are \ufb01rst visited during a depth-\n\ufb01rst search traversal of G. Using online 1-NN requires O(|V | ln|V |) time to predict an arbitrary\nvertex sequence using a self-balancing binary search tree (e.g., a red-black tree) as the insertion of\neach vertex into the tree and determination of the nearest left and right neighbour is O(ln|V |).\n\n6 Prediction with a binary support tree\nThe Pounce online label prediction algorithm [7] is designed to exploit cluster structure of a graph\nG = (V, E) and achieves the following mistake bound\n\nM \u2264 N (X, \u03c1, rG) + 4\u03a6G(u)\u03c1 + 1,\n\n(13)\nfor any \u03c1 > 0. Here, u \u2208 IRn is any labelling consistent with the trial sequence, X =\n{vi1 , vi2 , . . .} \u2286 V is the set of inputs and N (X, \u03c1, rG) is a covering number \u2013 the minimum\nnumber of balls of resistance diameter \u03c1 (see Section 2) required to cover X. The mistake bound\n(13) can be preferable to (12) whenever the inputs are suf\ufb01ciently clustered and so has a cover of\nsmall diameter sets. For example, consider two (m + 1)-cliques, one labeled \u201c+1\u201d, one \u201c\u22121\u201d with\ncm arbitrary interconnecting edges (c \u2265 1) here the bound (12) is vacuous while (13) is M \u2264 8c+3\nm , N (X, \u03c1, rG) = 2, and \u03a6G(u) = cm). An input space V may have both local clus-\n(with \u03c1 = 2\nter structure yet have a large diameter. Imagine a \u201cuniverse\u201d such that points are distributed into\nmany dense clusters such that some sets of clusters are tightly packed but overall the distribution is\nquite diffuse. A given \u201cproblem\u201d X \u2286 V may then be centered on a few clusters or alternatively\nencompass the entire space. Thus, for practical purposes, we would like a prediction algorithm\n\n\fwhich achieves the \u201cbest of both worlds\u201d, that is a mistake bound which is no greater, in order of\nmagnitude, than the maximum of (12) and (13). The rest of this paper is directed toward this goal.\nWe now introduce the notion of binary support tree, formalise the Pounce method in the support tree\nsetting and then prove the desired result.\nDe\ufb01nition 8. Given a graph G = (V, E), with |V | = n, and spine S, we de\ufb01ne a binary support tree\nof G to be any binary tree T = (VT , ET ) of least possible depth, D, whose leaves are the vertices\nof S, in order. Note that D < log2(n) + 1.\nWe show that there is a weighting of the support tree which ensures that the resistance diameter of\nthe support tree is small, but also such that any labelling of the leaf vertices can be extended to the\nsupport tree such that its cut size remains small. This enables effective learning via the support tree.\nA related construction has been used to build preconditioners for solving linear systems [6].\nLemma 9. Given any spine graph S = (V, E) with |V | = n, and labelling u \u2208 {\u22121, 1}n, with\nsupport tree T = (VT , ET ), there exists a weighting A of T , and a labelling \u00afu \u2208 [\u22121, 1]|VT |\nof T such that \u00afu and u are identical on V , \u03a6T ( \u00afu) < \u03a6S(u) and RT \u2264 (log2 n + 1)(log2 n +\n4)(log2(log2 n + 2))2.\nProof. Let vr be the root vertex of T . Suppose each edge (i, j) \u2208 ET has a weight Aij, which\nis a function of the edge\u2019s depth d = max{dT (vi, vr), dT (vj, vr)}, Aij = W (d) where dT (v, v(cid:48))\nis the number of edges in the shortest path from v to v(cid:48). Consider the unique labelling \u00afu such\nthat, for 1 \u2264 i \u2264 n we have \u00afui = ui and such that for every other vertex vp \u2208 VT , with child\nvertices vc1, vc2, we have \u00afup = \u00afuc1 +\u00afuc2\n, or \u00afup = \u00afuc in the case where vp has only one child, vc.\nSuppose the edges (p, c1), (p, c2) \u2208 ET are at some depth d in T , and let V (cid:48) \u2282 V correspond to\nthe leaf vertices of T descended from vp. De\ufb01ne \u03a6S(uV (cid:48)) to be the cut of u restricted to vertices\nin V (cid:48). If \u00afuc1 = \u00afuc2 then (\u00afup \u2212 \u00afuc1)2 + (\u00afup \u2212 \u00afuc2)2 = 0 \u2264 2\u03a6S(uV (cid:48)), and if \u00afuc1 (cid:54)= \u00afuc2 then\n(\u00afup \u2212 \u00afuc1)2 + (\u00afup \u2212 \u00afuc2)2 \u2264 2 \u2264 2\u03a6S(uV (cid:48)). Hence\n\nW (d)(cid:0)(\u00afup \u2212 \u00afuc1)2 + (\u00afup \u2212 \u00afuc2)2(cid:1) \u2264 2W (d)\u03a6S(uV (cid:48))\n\n(14)\n(a similar inequality is trivial in the case that vp has only one child). Since the sets of leaf descendants\nof all vertices at depth d form a partition of V , summing (14) \ufb01rst over all parent nodes at a given\ndepth and then over all integers d \u2208 [1, D] gives\n\n2\n\nD(cid:88)\n\n4\u03a6T ( \u00afu) \u2264 2\n\nW (d)\u03a6S(u).\n\nd=1\n\n(15)\n\nWe then choose\n\n1\n\nd=1\n\n1\n\n1\n\n2\n\n(16)\n\n2 + ln 2 < 2.\n\nx ln2 xdx = 1\n\n(d + 1)(log2(d + 1))2\n\n2 + ln2 2(cid:82) \u221e\n\nand note that(cid:80)\u221e\nFurther, RT = 2(cid:80)D\n\nW (d) =\n(d+1)(log2(d+1))2 \u2264 1\nd=1(d + 1)(log2(d + 1))2 \u2264 D(D + 3)(log2(D + 1))2 and so D \u2264 log2 n + 1\ngives the resistance bound.\nDe\ufb01nition 10. Given the task of predicting the labelling of an unweighted graph G = (V, E) the\naugmented Pounce algorithm proceeds as follows: An augmented graph \u00afG = ( \u00afV , \u00afE) is formed\nby attaching a binary support tree of G, with weights de\ufb01ned as in (16), to G; formally let T =\n(VT , ET ) be such a binary support tree of G, then \u00afG = (VT , E \u222a ET ). The Pounce algorithm is\nthen used to predict the (partial) labelling de\ufb01ned on \u00afG.\nTheorem 11. Given the task of predicting the labelling of any unweighted, connected, n-vertex\ngraph G = (V, E) in the online framework, the number of mistakes, M, incurred by the augmented\nPounce algorithm satis\ufb01es\n\nM \u2264 min\n\n(17)\nwhere N (X, \u03c1, rG) is the covering number of the input set X = {vi1, vi2, . . .} \u2286 V relative to\nthe resistance distance rG of G and u \u2208 IRn is any labelling consistent with the trial sequence.\nFurthermore,\n\n{N (X, \u03c1, rG) + 12\u03a6G(u)\u03c1} + 1,\n\n\u03c1>0\n\nM \u2264 12\u03a6G(u)(log2 n + 1)(log2 n + 4)(log2(log2 n + 2))2 + 2.\n\n(18)\n\n\fProof. Let u be some labelling consistent with the trial sequence. By (3) we have that \u03a6S(u) \u2264\n2\u03a6G(u) for any spine S of G. Moreover, by the arguments in Lemma 9 there exists some labelling\n\u00afu of the weighted support tree T of G, consistent with u on V , such that \u03a6T ( \u00afu) < \u03a6S(u). We then\nhave\n\n(19)\nBy Rayleigh\u2019s monotonicity law the addition of the support tree does not increase the resistance\nbetween any vertices on G, hence\n\n\u03a6 \u00afG( \u00afu) = \u03a6T ( \u00afu) + \u03a6G(u) < 3\u03a6G(u).\n\nN (X, \u03c1, r \u00afG) \u2264 N (X, \u03c1, rG).\n\n(20)\n\nCombining inequalities (19) and (20) with the pounce bound (13) for predicting \u00afu on \u00afG, yields\n\nM \u2264 N (X, \u03c1, r \u00afG) + 4\u03a6 \u00afG( \u00afu)\u03c1 + 1 \u2264 N (X, \u03c1, rG) + 12\u03a6G(u)\u03c1 + 1.\n\nwhich proves (17). We prove (18) by covering \u00afG with single ball so that M \u2264 4\u03a6 \u00afG( \u00afu)R \u00afG + 2 \u2264\n12\u03a6G(u)RT + 2 and the result follows from the bound on RT in Lemma 9.\n\n7 Conclusion\nWe have explored a de\ufb01ciency with existing online techniques for predicting the labelling of a graph.\nAs a solution, we have presented an approximate cut-preserving embedding of any graph G =\n(V, E) into a simple path graph, which we call a spine, such that an implementation of the 1-\nnearest-neighbours algorithm is an ef\ufb01cient realisation of a Bayes optimal classi\ufb01er. This therefore\nachieves a mistake bound which is logarithmic in the size of the vertex set for any graph, and the\ncomplexity of our algorithm is of O(|E| + |V | ln|V |). We further applied the insights gained to\na second algorithm \u2013 an augmentation of the Pounce algorithm, which achieves a polylogarithmic\nperformance guarantee, but can further take advantage of clustered data, in which case its bound is\nrelative to any cover of the graph.\n\nReferences\n[1] J. M. Barzdin and R. V. Frievald. On the prediction of general recursive functions. Soviet Math. Doklady,\n\n13:1224\u20131228, 1972.\n\n[2] M. Belkin and P. Niyogi. Semi-supervised learning on riemannian manifolds. Machine Learning, 56:209\u2013\n\n239, 2004.\n\n[3] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In Proc. 18th\n\nInternational Conf. on Machine Learning, pages 19\u201326. Morgan Kaufmann, San Francisco, CA, 2001.\n\n[4] P. Doyle and J. Snell. Random walks and electric networks. Mathematical Association of America, 1984.\n[5] J. Fakcharoenphol and B. Kijsirikul. Low congestion online routing and an improved mistake bound for\n\nonline prediction of graph labeling. CoRR, abs/0809.2075, 2008.\n\n[6] K. Gremban, G. Miller, and M. Zagha. Performance evaluation of a new parallel preconditioner. Parallel\n\nProcessing Symposium, International, 0:65, 1995.\n\n[7] M. Herbster. Exploiting cluster-structure to predict the labeling of a graph. In The 19th International\n\nConference on Algorithmic Learning Theory, pages 54\u201369, 2008.\n\n[8] M. Herbster and M. Pontil. Prediction on a graph with a perceptron.\n\nIn B. Sch\u00a8olkopf, J. Platt, and\nT. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 577\u2013584. MIT Press,\nCambridge, MA, 2007.\n\n[9] M. Herbster, M. Pontil, and L. Wainer. Online learning over graphs. In ICML \u201905: Proceedings of the\n22nd international conference on Machine learning, pages 305\u2013312, New York, NY, USA, 2005. ACM.\n[10] R. Kinderman and J. L. Snell. Markov Random Fields and Their Applications. Amer. Math. Soc., Provi-\n\ndence, RI, 1980.\n\n[11] D. Klein and M. Randi\u00b4c. Resistance distance. Journal of Mathematical Chemistry, 12(1):81\u201395, 1993.\n[12] N. Littlestone. Learning when irrelevant attributes abound: A new linear-threshold algorithm. Machine\n\nLearning, 2:285\u2013318, 1988.\n\n[13] K. Pelckmans and J. A. Suykens. An online algorithm for learning a labeling of a graph. In In Proceedings\n\nof the 6th International Workshop on Mining and Learning with Graphs, 2008.\n\n[14] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and harmonic\nfunctions. In 20-th International Conference on Machine Learning (ICML-2003), pages 912\u2013919, 2003.\n\n\f", "award": [], "sourceid": 1041, "authors": [{"given_name": "Mark", "family_name": "Herbster", "institution": null}, {"given_name": "Guy", "family_name": "Lever", "institution": null}, {"given_name": "Massimiliano", "family_name": "Pontil", "institution": null}]}