{"title": "Online Prediction at the Limit of Zero Temperature", "book": "Advances in Neural Information Processing Systems", "page_first": 2935, "page_last": 2943, "abstract": "We design an online algorithm to classify the vertices of a graph. Underpinning the algorithm is the probability distribution of an Ising model isomorphic to the graph. Each classification is based on predicting the label with maximum marginal probability in the limit of zero-temperature with respect to the labels and vertices seen so far. Computing these classifications is unfortunately based on a $\\#P$-complete problem. This motivates us to develop an algorithm for which we give a sequential guarantee in the online mistake bound framework. Our algorithm is optimal when the graph is a tree matching the prior results in [1].For a general graph, the algorithm exploits the additional connectivity over a tree to provide a per-cluster bound. The algorithm is efficient as the cumulative time to sequentially predict all of the vertices of the graph is quadratic in the size of the graph.", "full_text": "Online Prediction at the Limit of Zero Temperature\n\nMark Herbster\n\nStephen Pasteris\n\nDepartment of Computer Science\n\nUniversity College London\n\nLondon WC1E 6BT, England, UK\n\n{m.herbster,s.pasteris}@cs.ucl.ac.uk\n\nShaona Ghosh\n\nECS\n\nUniversity of Southampton\nSouthampton, UK SO17 1BJ\n\nghosh.shaona@gmail.com\n\nAbstract\n\nWe design an online algorithm to classify the vertices of a graph. Underpinning\nthe algorithm is the probability distribution of an Ising model isomorphic to the\ngraph. Each classi\ufb01cation is based on predicting the label with maximum marginal\nprobability in the limit of zero-temperature with respect to the labels and vertices\nseen so far. Computing these classi\ufb01cations is unfortunately based on a #P -\ncomplete problem. This motivates us to develop an algorithm for which we give\na sequential guarantee in the online mistake bound framework. Our algorithm is\noptimal when the graph is a tree matching the prior results in [1]. For a general\ngraph, the algorithm exploits the additional connectivity over a tree to provide a\nper-cluster bound. The algorithm is ef\ufb01cient, as the cumulative time to sequen-\ntially predict all of the vertices of the graph is quadratic in the size of the graph.\n\nIntroduction\n\n1\nSemi-supervised learning is now a standard methodology in machine learning. A common approach\nin semi-supervised learning is to build a graph [2] from a given set of labeled and unlabeled data\nwith each datum represented as a vertex. The hope is that the constructed graph will capture either\nthe cluster [3] or manifold [4] structure of the data. Typically, an edge in this graph indicates the\nexpectation that the joined data points are more likely to have the same label. One method to\nexploit this representation is to use the semi-norm induced by the Laplacian of the graph [5, 4, 6, 7].\nA shared idea of the Laplacian semi-norm based approaches is that the smoothness of a boolean\nlabeling of the graph is measured via the \u201ccut\u201d, which is just the number of edges that connect\ndisagreeing labels. In practice the semi-norm is then used as a regularizer in which the optimization\nproblem is relaxed from boolean to real values. Our approach also uses the \u201ccut\u201d, but unrelaxed, to\nde\ufb01ne an Ising distribution over the vertices of the graph.\nPredicting with the vertex marginals of an Ising distribution in the limit of zero temperature was\nshown to be optimal in the mistake bound model [1, Section 4.1] when the graph is a tree. The exact\ncomputation of marginal probabilities in the Ising model is intractable on non-trees [8]. However, in\nthe limit of zero temperature, a rich combinatorial structure called the Picard-Queyranne graph [9]\nemerges. We exploit this structure to give an algorithm which 1) is optimal on trees, 2) has a\nquadratic cumulative computational complexity, and 3) has a mistake bound on generic graphs that\nis stronger than previous bounds in many natural cases.\nThe paper is organized as follows. In the remainder of this section, we introduce the Ising model\nand lightly review previous work in the online mistake bound model for predicting the labeling of a\ngraph. In Section 2 we review our key technical tool the Picard-Queyranne graph [9] and explain the\nrequired notation. In the body of Section 3 we provide a mistake bound analysis of our algorithm\nas well as the intractable 0-Ising algorithm and then conclude with a detailed comparison to the\nstate of the art. In the appendices we provide proofs as well as preliminary experimental results.\nIsing model in the limit zero temperature.\nIn our setting, the parameters of the Ising model\nare an n-vertex graph G = (V (G), E(G)) and a temperature parameter \u03c4 > 0, where V (G) =\n\n1\n\n\f\u03c6G(u) := (cid:80)\n\n\u03c4 (u) \u221d exp(cid:0)\u2212 1\n\n\u03c4 \u03c6G(u)(cid:1) where \u03c4 > 0 is the temperature parameter. In our online\n\n{1, . . . , n} denotes the vertex set and E(G) denotes the edge set. Each vertex of this graph may\nbe labeled with one of two states {0, 1} and thus a labeling of a graph may be denoted by a vector\nu \u2208 {0, 1}n where ui denotes the label of vertex i. The cutsize of a labeling u is de\ufb01ned as\n(i,j)\u2208E(G) |ui \u2212 uj|. The Ising probability distribution over labelings of G is then\nde\ufb01ned as pG\nsetting at the beginning of trial t + 1 we will have already received an example sequence, St, of t\nvertex-label pairs (i1, y1), . . . , (it, yt) where pair (i, y) \u2208 V (G)\u00d7{0, 1}. We use pG\n\u03c4 (uv = y|St) :=\n\u03c4 (uv = y|ui1 = y1, . . . , uit = yt) to denote the marginal probability that vertex v has label y\npG\ngiven the previously labeled vertices of St. For convenience we also de\ufb01ne the marginalized cutsize\n\u03c6G(u|St) to be equal to \u03c6G(u) if ui1 = y1, . . . , uit = yt and equal to undefined otherwise. Our\nprediction \u02c6yt+1 of vertex it+1 is then the label with maximal marginal probability in the limit of\nzero temperature, thus\n\nt+1(it+1|St) := argmax\n\u02c6y0I\ny\u2208{0,1}\n\nlim\n\u03c4\u21920\n\n\u03c4 (uit+1 = y|ui1 = y1, . . . , uit = yt) .\npG\n\n[0-Ising]\n\n(1)\n\nNote the prediction is unde\ufb01ned if the labels are equally probable. In low temperatures the mass\nof the marginal is dominated by the labelings consistent with St and the proposed label of vertex\nit+1 of minimal cut; as we approach zero, \u02c6yt+1 is the label consistent with the maximum number of\nlabelings of minimal cut. Thus if k := min\n\n\u03c6G(u|S) then we have that\n\nu\u2208{0,1}n\n\n\u00ae\n\n0\n1\n\n\u02c6y0I(v|S) =\n\n|u \u2208 {0, 1}n : \u03c6G(u|(S, (v, 0))) = k| > |u \u2208 {0, 1}n : \u03c6G(u|(S, (v, 1))) = k|\n|u \u2208 {0, 1}n : \u03c6G(u|(S, (v, 0))) = k| < |u \u2208 {0, 1}n : \u03c6G(u|(S, (v, 1))) = k| .\nThe problem of counting minimum label-consistent cuts was shown to be #P-complete in [10] and\nfurther computing \u02c6y0I(v|S) is also NP-hard (see Appendix G). In Section 2.1 we introduce the\nPicard-Queyranne graph [9] which captures the combinatorial structure of the set of minimum-cuts.\nWe then use this simplifying structure as a basis to design a heuristic approximation to \u02c6y0I(v|S) with\na mistake bound guarantee.\nPredicting the labelling of a graph in the mistake bound model. We prove performance guaran-\ntees for our method in the mistake bound model introduced by Littlestone [11]. On the graph this\nmodel corresponds to the following game. Nature presents a graph G; Nature queries a vertex\ni1 \u2208 V (G) = INn; the learner predicts the label of the vertex \u02c6y1 \u2208 {0, 1}; nature presents a\nlabel y1; nature queries a vertex i2; the learner predicts \u02c6y2; and so forth. The learner\u2019s goal is\nto minimize the total number of mistakes M = |{t : \u02c6yt (cid:54)= yt}|. If nature is adversarial, the learner\nwill always make a \u201cmistake\u201d, but if nature is regular or simple, there is hope that a learner may\nincur only a few mistakes. Thus, a central goal of online learning is to design algorithms whose\ntotal mistakes can be bounded relative to the complexity of nature\u2019s labeling. The graph labeling\nproblem has been studied extensively in the online literature. Here we provide a rough discussion of\nthe two main approaches for graph label prediction, and in Section 3.3 we provide a more detailed\ncomparison. The \ufb01rst approach is based on the graph Laplacian [12, 13, 14]; it provides bounds that\nutilize the additional connectivity of non-tree graphs, which are particularly strong when the graph\ncontains uniformly-labeled clusters of small (resistance) diameter. The drawbacks of this approach\nare that the bounds are weaker on graphs with large diameter and that the computation times are\nslower. The second approach is to estimate the original graph with an appropriately selected tree or\n\u201cpath\u201d graph [15, 16, 1, 17]; this leads to faster computation times, and bounds that are better on\ngraphs with large diameters. The algorithm treeOpt [1] is optimal on trees. These algorithms may\nbe extended to non-tree graphs by \ufb01rst selecting a spanning tree uniformly at random [16] and then\napplying the algorithm to the sampled tree. This randomized approach enables expected mistake\nbounds which exploit the cluster structure in the graph.\nThe bounds we prove for the NP-hard 0-Ising prediction and our heuristic are most similar to the\n\u201csmall p\u201d bounds proven for the p-seminorm interpolation algorithm [14]. Although these bounds\nare not strictly comparable, a key strength of our approach is that the new bounds often improve\nwhen the graph contains uniformly-labeled clusters of varying diameters. Furthermore, when the\ngraph is a tree we match the optimal bounds of [1]. Finally, the cumulative time required to compute\nthe complete labeling of a graph is quadratic in the size of the graph for our algorithm, while [14] re-\nquires the minimization of a non-strongly convex function (on every trial) which is not differentiable\nwhen p \u2192 1.\n\n2\n\n\fi=1 so that V (G) := {Vi}N\n\n2 Preliminaries\nAn (undirected) graph G is a pair of sets (V, E) such that E is a set of unordered pairs of distinct\nelements from V . We say that R is a subgraph R \u2286 G iff V (R) \u2286 V (G) and E(R) = {(i, j) :\ni, j \u2208 V (R), (i, j) \u2208 E(G)}. Given any subgraph R \u2286 G, we de\ufb01ne its boundary (or inner border)\n\u22020(R), its neighbourhood (or exterior border) \u2202e(R) respectively as \u22020(R) := {j : i (cid:54)\u2208 V (R), j \u2208\nV (R), (i, j) \u2208 E(G)}, and \u2202e(R) := {i : i (cid:54)\u2208 V (R), j \u2208 V (R), (i, j) \u2208 E(G)}, and its exterior\ne (R) := {(i, j) : i (cid:54)\u2208 V (R), j \u2208 V (R), (i, j) \u2208 E(G)}. The length of a subgraph\nedge border \u2202E\nP is denoted by |P| := |E(P)| and we denote the diameter of a graph by D(G). A pair of vertices\nv, w \u2208 V (G) are \u03ba-connected if there exist \u03ba edge-disjoint paths connecting them. The connectivity\nof a graph, \u03ba(G), is the maximal value of \u03ba such that every pair of points in G is \u03ba-connected. The\natomic number N\u03ba(G) of a graph at connectivity level \u03ba is the minimum cardinality c of a partition\nof G into subgraphs {R1, . . . ,Rc} such that \u03ba(Ri) \u2265 \u03ba for all 1 \u2264 i \u2264 c.\nOur results also require the use of directed-, multi-, and quotient- graphs. Every undirected graph\nalso de\ufb01nes a directed graph where each undirected edge (i, j) is represented by directed edges (i, j)\nand (j, i). An orientation of an undirected graph is an assignment of a direction to each edge, turning\nthe initial graph into a directed graph. In a multi-graph the edge set is now a multi-set and thus there\nmay be multiple edges between two vertices. A quotient-graph G is de\ufb01ned from a graph G and\na partition of its vertex set {Vi}N\ni=1 (we often call these vertices super-\nvertices to emphasize that they are sets) and the multiset E(G) := {(I, J) : I, J \u2208 V (G), I (cid:54)= J, i \u2208\nI, j \u2208 J, (i, j) \u2208 E(G)}. We commonly construct a quotient-graph G by \u201cmerging\u201d a collection of\nsuper-vertices, for example, in Figure 2 from 2a to 2b where 6 and 9 are merged to \u201c6/9\u201d and also\nthe \ufb01ve merges that transforms 2c to 2d.\nThe set of all label-consistent minimum-cuts in a graph with respect to an example sequence S is\nU\u2217\nG(S) := argminu\u2208{0,1}n \u03c6G(u|S). The minimum is typically non-unique. For example in Fig-\nure 2a, the vertex sets {v1, . . . , v4},{v5, . . . , v12} correspond to one label-consistent minimum-cut\nand {v1, . . . , v5, v7, v8},{v6, v9 . . . , v12} to another (the cutsize is 3). The (uncapacitated) maxi-\nmum \ufb02ow is the number of edge-disjoint paths between a source and target vertex. Thus in Figure 2b\nbetween vertex \u201c1\u201d and vertex \u201c6/9\u201d there are at most 3 simultaneously edge-disjoint paths; these are\nalso not unique, as one path must pass through either vertices (cid:104)v11, v12(cid:105) or vertices (cid:104)v11, v10, v12(cid:105).\nFigure 2c illustrates one such \ufb02ow F (just the directed edges). For convenience it is natural to view\nthe maximum \ufb02ow or the label-consistent minimum-cut as being with respect to only two vertices\nas in Figure 2a transformed to Figure 2b so that H \u2190 merge(G,{v6, v9}). The \u201c\ufb02ow\u201d and the\n\u201ccut\u201d are related by Menger\u2019s theorem which states that the minimum-cut with respect to a source\nand target vertex is equal to the max \ufb02ow between them. Given a connected graph H and source\nand target vertices s, t the Ford-Fulkerson algorithm [18] can \ufb01nd k edge-disjoint paths from s to t\nin time O(k|E(H)|) where k is the value of the max \ufb02ow.\n2.1 The Picard-Queyranne graph\nGiven a set of labels there may be multiple label-consistent minimum-cuts as well as multiple max-\nimum \ufb02ows in a graph. The Picard-Queyranne (PQ) graph [9] reduces this multiplicity as far as is\npossible with respect to the indeterminacy of the maximum \ufb02ow. The vertices of the PQ-graph are\nde\ufb01ned as a super-vertex set on a partition of the original graph\u2019s vertex set. Two vertices are con-\ntained in the same super-vertex iff they have the same label in every label-consistent minimum-cut.\nAn edge between two vertices de\ufb01nes an analogous edge between two super-vertices iff that edge is\nconserved in every maximum \ufb02ow. Furthermore the edges between super-vertices strictly orient the\nlabels in any label-consistent minimum-cut as may be seen in the formal de\ufb01nition that follows.\nFirst we introduce the following useful notations: let kG,S := min{\u03c6G(u|S) : u \u2208 {0, 1}n} denote\nthe minimum-cutsize of G with respect to S; let i\nS\u223cj denote an equivalence relation between vertices\nin V (G) where i\nDe\ufb01nition 1 ([9]). The Picard-Queyranne graph G(G,S) is derived from graph G and non-trivial\nexample sequence S. The graph is an orientation of the quotient graph derived from the partition\n{\u22a5, I2, . . . , IN\u22121,(cid:62)} of V (G) induced by S\u223c. The edge set of G is constructed of kG,S edge-disjoint\npaths starting at source vertex \u22a5 and terminating at target vertex (cid:62). A labeling u \u2208 {0, 1}n is in\nU\u2217\nG(S) iff\n\nG(S) : ui = uj; and then we de\ufb01ne,\n\nS\u223cj iff \u2200u \u2208 U\u2217\n\n1. i \u2208 \u22a5 implies ui = 0 and i \u2208 (cid:62) implies ui = 1\n\n3\n\n\f2. i, j \u2208 H implies ui = uj\n3. i \u2208 I, j \u2208 J, (I, J) \u2208 E(G), and ui = 1 implies uj = 1\n\nwhere \u22a5 and (cid:62) are the source and target vertices and H, I, J \u2208 V (G).\nAs G(G,S) is a DAG it naturally de\ufb01nes a partial order (V (G),\u2264G) on the vertex set where I \u2264G J\nif there exists a path starting at I and ending at J. The least and greatest elements of the partial order\nare \u22a5 and (cid:62). The notation \u2191R and \u2193R denote the up set and down set of R. Given the set U\u2217 of all\nlabel-consistent minimum-cuts then if u \u2208 U\u2217 there exists an antichain A \u2286 V (G) \\ {(cid:62)} such that\nui = 0 when i \u2208 I \u2208 \u2193A otherwise ui = 1; furthermore for every antichain there exists a label-\nconsistent minimum-cut. The simple structure of G(G,S) was utilized by [9] to enable the ef\ufb01cient\nalgorithmic enumeration of minimum-cuts. However, the cardinality of this set of all label-consistent\nminimum-cuts is potentially exponential in the size of the PQ-graph and the exact computation of\nthe cardinality was later shown #P-complete in [10]. In Figure 1 we give the algorithm from [9, 19]\nPicardQueyranneGraph(graph: G; example sequence: S = (vk, yk)t\n1. (H, s, t) \u2190 SourceTargetMerge(G,S)\n2. F \u2190 MaxFlow(H, s, t)\n3. I \u2190 (V (I), E(I)) where V (I) := V (H) and E(I) := {(i, j) : (i, j) \u2208 E(H), (j, i) (cid:54)\u2208 F}\n4. G0 \u2190 QuotientGraph(StronglyConnectedComponents(I),H)\n5. E(G) \u2190 E(G0); V (G) \u2190 V (G0) except \u22a5(G) \u2190 \u22a5(G0) \u222a {vk : k \u2208 INt, yk = 0}\nand (cid:62)(G) \u2190 (cid:62)(G0) \u222a {vk : k \u2208 INt, yk = 1}\n\nk=1)\n\nReturn: directed graph: G\n\nFigure 1: Computing the Picard-Queyranne graph\n\n(a) Graph G and S = (cid:104)(v1, 0), (v6, 1), (v9, 1)(cid:105)\n\n(b) Graph H (step 1 in Figure 1)\n\n(c) Graph I (step 3 in Figure 1)\n\n(d) PQ Graph G (step 4 in Figure 1)\n\nFigure 2: Building a Picard-Queyranne graph\n\nto compute a PQ-graph. We illustrate the computation in Figure 2. The algorithm operates \ufb01rst on\n(G,S) (step 1) by \u201cmerging\u201d all vertices which share the same label in S to create H. In step 2 a\nmax \ufb02ow graph F \u2286 H is computed by the Ford-fulkerson algorithm. It is well-known in the case\nof unweighted graphs that a max \ufb02ow graph F may be output as a DAG of k edge-disjoint paths\nwhere k is the value of the \ufb02ow. In step 3 all edges in the \ufb02ow become directed edges creating I.\nThe graph G0 is then created in step 4 from I where the strongly connected components become the\nsuper-vertices of G0 and the super-edges correspond to a subset of \ufb02ow edges from F. Finally, in\n\n4\n\n13241178510612913241178510126/913241178510126/9\u22a5ABC\u22a5\fstep 5, we create the PQ-graph G by \u201c\ufb01xing\u201d the source and target vertices so that they also have as\nelements the original labeled vertices from S which were merged in step 1. The correctness of the\nalgorithm follows from arguments in [9]; we provide an independent proof in Appendix B.\nTheorem 2 ([9]). The algorithm in Figure 1 computes the unique Picard-Queyranne graph G(G,S)\nderived from graph G and non-trivial example sequence S.\n\n3 Mistake Bounds Analysis\n\nIn this section we analyze the mistakes incurred by the intractable 0-Ising strategy (see (1)) and\nthe strategy longest-path (see Figure 3). Our analysis splits into two parts. Firstly, we show\n(Section 3.1, Theorem 4) for a suf\ufb01ciently regular graph label prediction algorithm, that we may\nanalyze independently the mistake bound of each uniformly-labeled cluster (connected subgraph).\nSecondly, the per-cluster analysis then separates into three cases, the result of which is summarized\nin Theorem 10. For a given cluster C when its internal connectivity is larger than the number of\nedges in the boundary (\u03ba(C) > |\u2202E\ne (C)|) we will incur no more than one mistake in that cluster. On\nthe other hand for smaller connectivity clusters (\u03ba(C) \u2264 |\u2202E\ne (C)|) we incur up to quadratically in\nmistakes via the edge boundary size. When C is a tree we incur O(|\u2202E\nThe analysis of smaller connectivity clusters separates into two parts. First, a sequence of trials in\nwhich the label-consistent minimum-cut does not increase, we call a PQ-game (Section 3.2) as in\nessence it is played on a PQ-graph. We give a mistake bound for a PQ-game for the intractable\n0-Ising prediction and a comparable bound for the strategy longest-path in Theorem 8.\nSecond, when the label-consistent minimum-cut increases the current PQ-game ends and a new one\nbegins, leading to a sequence of PQ-games. The mistakes incurred over a sequence of PQ-games is\naddressed in the aforementioned Theorem 10 and \ufb01nally Section 3.3 concludes with a discussion of\nthe combined bounds of Theorems 4 and 10 with respect to other graph label prediction algorithms.\n\ne (C)| log D(C)) mistakes.\n\n3.1 Per-cluster mistake bounds for regular graph label prediction algorithms\nAn algorithm is called regular if it is permutation-invariant, label-monotone, and Markov. An\nalgorithm is permutation-invariant if the prediction at any time t does not depend on the order of\nthe examples up to time t; label-monotone if for every example sequence if we insert an example\n\u201cbetween\u201d examples t and t+1 with label y then the prediction at time t+1 is unchanged or changed\nto y; and Markov with respect to a graph G if for any disjoint vertex sets P and Q and separating set\nR then the predictions in P are independent of the labels in Q given the labels of R. A subgraph is\nuniformly-labeled with respect to an example sequence iff the label of each vertex is the same and\nthese labels are consistent with the example sequence. The following de\ufb01nition characterizes the\n\u201cworst-case\u201d example sequences for regular algorithms with respect to uniformly-labeled clusters.\nDe\ufb01nition 3. Given an online algorithm A and a uniformly-labeled subgraph C\u2286G, then BA(C;G)\ndenotes the maximal mistakes made only in C for the presentation of any permutation of examples\nin \u2202e(C), each with label y, followed by any permutation of examples in C, each with label 1\u2212y.\nThe following theorem enables us to analyze the mistakes incurred in each uniformly-labeled sub-\ngraph C independently of each other and independently of the remaining graph structure excepting\nthe subgraph\u2019s exterior border \u2202e(C).\nTheorem 4 (Proof in Appendix D). Given an online permutation-invariant label-monotone Markov\nalgorithm A and a graph G which is covered by uniformly-labeled subgraphs C1, . . . ,Cc the mistakes\n\nincurred by the algorithm may be bounded by M \u2264(cid:80)c\n\ni=1 BA(Ci;G) .\n\nThe above theorem paired with Theorem 10 completes the mistake bound analysis of our algorithms.\n\n3.2 PQ-games\nGiven a PQ-graph G = G(G,S), the derived online PQ-game is played between a player and an\nadversary. The aim of the player is to minimize their mistaken predictions; for the adversary\nit is to maximize the player\u2019s mistaken predictions. Thus to play the adversary proposes a vertex\nz \u2208 Z \u2208 V (G), the player then predicts a label \u02c6y \u2208 {0, 1}, then the adversary returns a label\ny \u2208 {0, 1} and either a mistake is incurred or not. The only restriction on the adversary is to not\nreturn a label which increases the label-consistent minimum-cut. As long as the adversary does not\ngive an example (z \u2208 \u22a5, 1) or (z \u2208 (cid:62), 0), the label-consistent minimum-cut does not increase\n\n5\n\n\fno matter the value of y; which also implies the player has a trivial strategy to predict the label of\nz \u2208 \u22a5 \u222a (cid:62). After the example is given, we have an updated PQ-graph with new source and target\nsuper-vertices as seen in the proposition below.\nProposition 5. If G(G,S) is a PQ-graph and (z, y = 0) ((z, y = 1)) is an example with\nz \u2208 Z \u2208 V (G) and z (cid:54)\u2208 (cid:62) (z (cid:54)\u2208 \u22a5) then let Z = \u2193{Z} (Z = \u2191{Z}) then G(G,(cid:104)S, (z, y)(cid:105)) =\nmerge(G(G,S), Z).\nThus given the PQ-graph G the PQ-game is independent of G and S, since a \u201cplay\u201d z \u2208 V (G)\ninduces a \u201cplay\u201d Z \u2208 V (G) (with z \u2208 Z).\nMistake bounds for PQ-games. Given a single PQ-game, in the following we will discuss the\nthree strategies fixed-paths, 0-Ising, and longest-path that the player may adopt\nfor which we prove online mistake bounds. The \ufb01rst strategy fixed-paths is merely motiva-\ntional: it can be used to play a single PQ-game, but not a sequence. The second strategy 0-Ising\nis computationally infeasible. Finally, the longest-path strategy is \u201cdynamically\u201d similar to\nfixed-paths but is also permutation-invariant. Common to all our analyses is a k-path cover\nP of PQ-graph G which is a partitioning of the edge-set of G into k edge-disjoint directed paths\nP := {p1, . . . , pk} from \u22a5 to (cid:62). Note that the cover is not necessarily unique; for example, in\nFigure 2d, we have the two unique path covers P1 := {(\u22a5, A,(cid:62)), (\u22a5, A, B,(cid:62)), (\u22a5, B, C,(cid:62))} and\nP2 := {(\u22a5, A,(cid:62)), (\u22a5, A, B, C,(cid:62)), (\u22a5, B,(cid:62))}. We denote the set of all path covers as P and thus\nwe have for Figure 2d that P := {P1, P2}. This cover motivates a simple mistake bound and strat-\negy. Suppose we had a single path of length |p| where the \ufb01rst and last vertex are the \u201csource\u201d\nand \u201ctarget\u201d vertices. So the minimum label-consistent cut-size is \u201c1\u201d and a natural strategy is sim-\nply to predict with the \u201cnearest-neighbor\u201d revealed label and trivially our mistake bound is log |p|.\nGeneralizing to multiple paths we have the following strategy.\n\nStrategy fixed-paths(\u2039P ): Given a PQ-graph choose a path cover {\u02dcp1, . . . , \u02dcpk} =\u2039P \u2208 P(G).\n\u2039P is not vertex-disjoint and we need to predict a vertex V we may select a path in\u2039P containing V\n\u201cnearest-neighbor\u201d strategy detailed above, achieving the mistake upper bound M \u2264(cid:80)k\n\nIf the path cover is also vertex-disjoint except for the source and target vertex we may directly use the\ni=1 log |\u02dcpi|.\nUnsurprisingly, in the vertex-disjoint case it is a mistake-bound optimal [11] algorithm. If, however,\n\nand predict with the nearest neighbour and also obtain the bound above. In this case, however, the\nbound may not be \u201coptimal.\u201d Essentially the same technique was used in [20] in a related setting\nfor learning \u201cdirected cuts.\u201d A limitation of the fixed-paths strategy is that it does not seem\npossible to extend into a strategy that can play a sequence of PQ-games and still meet the regularity\nproperties, particularly permutation-invariance as required by Theorem 4.\nStrategy 0-Ising: The prediction of the Ising model in the limit of zero temperature (cf. (1)),\nis equivalent to those of the well-known Halving algorithm [21, 22] where the hypothesis class U\u2217\nis the set of label-consistent minimum-cuts. The mistake upper bound of the Halving algorithm\nis just M \u2264 log |U\u2217| where this bound follows from the observation that whenever a mistake is\n(cid:81)k\nmade at least \u201chalf\u201d of concepts in U\u2217 are no longer consistent. We observe that we may upper\nbound |U\u2217| \u2264 argminP\u2208P(G)\ni=1 |pi| since the product of path lengths from any path cover P\nis an upper bound on the cardinality of U\u2217 and hence we have the bound in (2). And in fact this\nbound may be a signi\ufb01cant improvement over the fixed-paths strategy\u2019s bound as seen in the\nfollowing proposition.\nProposition 6 (Proof in Appendix C). For every c \u2265 2 there exists a PQ-graph Gc, with a path cover\nP (cid:48) \u2208 P(Gc) and a PQ-game example sequence such that the mistakes Mfixed-paths(P (cid:48)) = \u2126(c2),\nwhile for all PQ-game example sequences on Gc the mistakes M0-Ising = O(c).\nUnfortunately the 0-Ising strategy has the drawback that counting label-consistent minimum-cuts\nis #P-complete and computing the prediction (see (1)) is NP-hard (see Appendix G).\nStrategy longest-path: In our search for an ef\ufb01cient and regular prediction strategy it seems\nnatural to attempt to \u201cdynamize\u201d the fixed-paths approach and predict with a nearest neigh-\nbor along a dynamic path. Two such permutation-invariant methods are the longest-path and\nshortest-path strategies. The strategy shortest-path predicts the label of a super-vertex\nZ in a PQ-game G as 0 iff the shortest directed path (\u22a5, . . . , Z) is shorter than the shortest directed\npath (Z, . . . ,(cid:62)). The strategy longest-path predicts the label of a super-vertex Z in a PQ-game\nG as 0 iff the longest directed path (\u22a5, . . . , Z) is shorter than the longest directed path (Z, . . . ,(cid:62)).\nThe strategy shortest-path seems to be intuitively favored over longest-path as it is just\n\n6\n\n\fInput: Graph: G, Example sequence: S = (cid:104)(i1, 0), (i2, 1), (i3, y3), . . . , (i(cid:96), y(cid:96))(cid:105) \u2208 (INn \u00d7 {0, 1})(cid:96)\nInitialization: G3 = PicardQueyranneGraph(G,S2)\nfor t = 3, . . . , (cid:96) do\n\n\u00df0 |longest-path(Gt,\u22a5t, It)|\u2264|longest-path(Gt, It,(cid:62)t)|\n\nReceive: it \u2208 {1, . . . , n}\nIt = V \u2208 V (Gt) with it \u2208 V\nPredict (longest-path): \u02c6yt =\n\n\u00dfmerge(Gt,\u2193{It}) yt = 0\n\nPredict (0-Ising):\nReceive: yt\nif (it (cid:54)\u2208 \u22a5t or yt (cid:54)= 1) and (it (cid:54)\u2208 (cid:62)t or yt (cid:54)= 0) then\n\n\u02c6yt = \u02c6yI0(it|St\u22121)\n\n1 otherwise\n\nmerge(Gt,\u2191{It}) yt = 1\n\nGt+1 =\n\nelse\n\nGt+1 = PicardQueyranneGraph(G,St)\n\nend\n\n% as per equation (1)\n\n% cut unchanged\n\n% cut increases\n\nFigure 3: Longest-path and 0-Ising online prediction\n\nthe \u201cnearest-neighbor\u201d prediction with respect to the geodesic distance. However, the following\nproposition shows that it is strictly worse than any fixed-paths strategy in the worst case.\nProposition 7 (Proof in Appendix C). For every c \u2265 4 there exists a PQ-graph Gc and a PQ-game\nexample sequence such that the mistakes Mshortest-path = \u2126(c2 log(c)), while for every path cover\nP \u2208P(Gc) and for all PQ-game example sequences on Gc the mistakes Mfixed-paths(P ) = O(c2).\nIn contrast, for the strategy longest-paths in the proof of Theorem 8 we show that there al-\nlp|.\n\nways exists some retrospective path cover Plp \u2208 P(G) such that Mlongest-paths \u2264 (cid:80)k\nfixed-paths(\u2039P ), 0-Ising, and longest-path on PQ-graph G and k-path cover\u2039P \u2208\nfixed-paths(\u2039P )\n\nComputing the \u201clongest-path\u201d has time complexity linear in the number of edges in a DAG.\nSummarizing the mistake bounds for the three PQ-game strategies for a single PQ-game we have\nthe following theorem.\nTheorem 8 (Proof in Appendix C). The mistakes, M, of an online PQ-game for player strategies\nP(G) is bounded by\n\ni=1 log |pi\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3\n\n(cid:80)k\ni=1 log |\u02dcpi|\nargminP\u2208P(G)\nargmaxP\u2208P(G)\n\n(cid:80)k\n(cid:80)k\ni=1 log |pi|\ni=1 log |pi|\n\nM \u2264\n\n0-Ising\nlongest-path\n\n.\n\n(2)\n\n3.3 Global analysis of prediction at zero temperature\nIn Figure 3 we summarize the prediction protocol for 0-Ising and longest-path. We claim\nthe regularity properties of our strategies in the following theorem.\nTheorem 9 (Proof in Appendix E). The strategies 0-Ising and longest-path are\npermutation-invariant, label-monotone, and Markov.\n\nThe technical hurdle here is to prove that label-monotonicity holds over a sequence of PQ-games.\nFor this we need an analog of Proposition 5 to describe how the PQ-graph changes when the label-\nconsistent minimum-cut increases (see Proposition 19). The application of the following theorem\nalong with Theorem 4 implies we may bound the mistakes of each uniformly-labeled cluster in\npotentially three ways.\nTheorem 10 (Proof in Appendix D). Given either the 0-Ising or longest-path strategy A\nthe mistakes on uniformly-labeled subgraph C \u2286 G are bounded by\n\ne (C)| \u2212 \u03ba(C)) log N (C)(cid:1) \u03ba(C) \u2264 |\u2202E\n\n\u03ba(C) > |\u2202E\n\ne (C)|\ne (C)|\n\nC is a tree\n\ne (C)|(1 + |\u2202E\ne (C)| log D(C))\n\n(3)\n\n\uf8f1\uf8f2\uf8f3O(1)\nO(cid:0)|\u2202E\n\nO(|\u2202E\n\nBA(C;G) \u2208\n\nwith the atomic number N (C) := N|\u2202E\n\ne (C)|+1(C) \u2264 |V (C)|.\n\n7\n\n\fFirst, if the internal connectivity of the cluster is high we will only make a single mistake in that clus-\nter. Second, if the cluster is a tree then we pay the external connectivity of the cluster |\u2202E\ne (C)| times\nthe log of the cluster diameter. Finally, in the remaining case we pay quadratically in the external\nconnectivity and logarithmically in the \u201catomic number\u201d of the cluster. The atomic number captures\nthe fact that even a poorly connected cluster may have sub-regions of high internal connectivity.\nIf G is a graph and S an example sequence with a label-consistent\nComputational complexity.\nminimum-cut of \u03c6 then we may implement the longest-path strategy so that it has a cumulative\ncomputational complexity of O(max(\u03c6, n)|E(G)|). This follows because if on a trial the \u201ccut\u201d does\nnot increase we may implement prediction and update in O(|E(G)|) time. On the other hand if the\n\u201ccut\u201d increases by \u03c6(cid:48) we pay O(\u03c6(cid:48)|E(G)|) time. To do so we implement an online \u201cFord-Fulkerson\u201d\nalgorithm [18] which starts from the previous \u201cresidual\u201d graph to which it then adds the additional\n\u03c6(cid:48) \ufb02ow paths with \u03c6(cid:48) steps of size O(|E(G)|).\nDiscussion. There are essentially \ufb01ve dominating mistake bounds for the online graph labeling prob-\nlem: (I) the bound of treeOpt [1] on trees, (II) the bound in expectation of treeOpt on a random\nspanning tree sampled from a graph [1], (III) the bound of p-seminorm interpolation [14]\ntuned for \u201csparsity\u201d (p < 2), (IV) the bound of p-seminorm interpolation as tuned to be\nequivalent to online label propagation [5] (p = 2), (V) this paper\u2019s longest-path strategy.\nThe algorithm treeOpt was shown to be optimal on trees.\nIn Appendix F we show that\nlongest-path also obtains the same optimal bound on trees. Algorithm (II) applies to generic\ngraphs and is obtained from (I) by sampling a random spanning tree (RST). It is not directly com-\nparable to the other algorithms as its bound holds only in expectation with respect to the RST.\nWe use [14, Corollary 10] to compare (V) to (III) and (IV). We introduce the following simpli-\nfying notation to compare bounds. Let C1, . . . ,Cc denote uniformly-labeled clusters (connected\ne (Cr)|. We de\ufb01ne Dr(i) to\nsubgraphs) which cover the graph and set \u03bar := \u03ba(Cr) and \u03c6r := |\u2202E\nbe the wide diameter at connectivity level i of cluster Cr. The wide diameter Dr(i) is the minimum\nvalue such that for all pairs of vertices v, w \u2208 Cr there exists i edge-disjoints of paths from v to w of\nlength at least Dr(i) in Cr (and if i > \u03bar then Dr(i) := +\u221e). Thus Dr(1) is the diameter of cluster\nCr and Dr(1) \u2264 Dr(2) \u2264 \u00b7\u00b7\u00b7 . Let \u03c6 denote the minimum label-consistent cutsize and observe that\n\nr=1 \u03c6r.\n\nif the cardinality of the cover |{C1, . . . ,Cc}| is minimized then we have that 2\u03c6 =(cid:80)c\n[(cid:80)c\n\nThus using [14, Corollary 10] we have the following upper bounds of (III): (\u03c6/\u03ba\u2217)2 log D\u2217 + c and\n(IV): (\u03c6/\u03ba\u2217)D\u2217 + c where \u03ba\u2217 := minr \u03bar and D\u2217 := maxr Dr(\u03ba\u2217). In comparison we have (V):\nr=1 max(0, \u03c6r \u2212 \u03bar + 1)\u03c6r log Nr] + c with atomic numbers Nr := N\u03c6r+1(Cr). To contrast\nthe bounds, consider a double lollipop labeled-graph: \ufb01rst create a lollipop which is a path of n/4\nvertices attached to a clique of n/4 vertices. Label these vertices 1. Second, clone the lollipop\nexcept with labels 0. Finally join the two cliques with n/8 edges arbitrarily. For (III) and (IV)\nthe bounds are O(n) independent of the choice of clusters. Whereas an upper bound for (V) is the\nexponentially smaller O(log n) which is obtained by choosing a four cluster cover consisting of the\ntwo paths and the two cliques. This emphasizes the generic problem of (III) and (IV): parameters\n\u03ba\u2217 and D\u2217 are de\ufb01ned by the worst clusters; whereas (V) is truly a per-cluster bound. We consider\nthe previous \u201cconstructed\u201d example to be representative of a generic case where the graph contains\nclusters of many resistance diameters as well as sparse interconnecting \u201cbackground\u201d vertices.\nOn the other hand, there are cases in which (III,IV) improve on (V). For a graph with only small\ndiameter clusters and if the cutsize exceeds the cluster connectivity then (IV) improves on (III,V)\ngiven the linear versus quadratic dependence on the cutsize. The log-diameter may be arbitrarily\nsmaller than log-atomic-number ((III) improves on (V)) and also vice-versa. Other subtleties not\naccounted for in the above comparison include the fact a) the wide diameter is a crude upper bound\nfor resistance diameter (cf. [14, Theorem 1]) and b) the clusters of (III,IV) are not required to be\nuniformly-labeled. Regarding \u201ca)\u201d replacing \u201cwide\u201d with \u201cresistance\u201d does not change the fact\nthe bound now holds with respect to the worst resistance diameter and the example above is still\nproblematic. Regarding \u201cb)\u201d it is a nice property but we do not know how to exploit this to give\nan example that signi\ufb01cantly improves (III) or (IV) over a slightly more detailed analysis of (V).\nFinally (III,IV) depend on a correct choice of tunable parameter p.\nThus in summary (V) matches the optimal bound of (I) on trees, and can often improve on (III,IV)\nwhen a graph is naturally covered by label-consistent clusters of different diameters. However\n(III,IV) may improve on (V) in a number of cases including when the log-diameter is signi\ufb01cantly\nsmaller than log-atomic-number of the clusters.\n\n8\n\n\fReferences\n[1] Nicol`o Cesa-Bianchi, Claudio Gentile, and Fabio Vitale. Fast and optimal prediction on a labeled tree. In\n\nProceedings of the 22nd Annual Conference on Learning. Omnipress, 2009.\n\n[2] Avrim Blum and Shuchi Chawla. Learning from labeled and unlabeled data using graph mincuts.\n\nIn\nProceedings of the Eighteenth International Conference on Machine Learning, ICML \u201901, pages 19\u201326,\nSan Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.\n\n[3] Olivier Chapelle, Jason Weston, and Bernhard Sch\u00a8olkopf. Cluster kernels for semi-supervised learning.\nIn S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems\n15, pages 601\u2013608. MIT Press, 2003.\n\n[4] Mikhail Belkin and Partha Niyogi. Semi-supervised learning on riemannian manifolds. Mach. Learn.,\n\n56(1-3):209\u2013239, 2004.\n\n[5] Xiaojin Zhu, Zoubin Ghahramani, and John D. Lafferty. Semi-supervised learning using gaussian \ufb01elds\n\nand harmonic functions. In ICML, pages 912\u2013919, 2003.\n\n[6] Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Sch\u00a8olkopf. Learning\n\nwith local and global consistency. In NIPS, 2003.\n\n[7] Martin Szummer and Tommi Jaakkola. Partially labeled classi\ufb01cation with markov random walks. In\n\nNIPS, pages 945\u2013952, 2001.\n\n[8] Leslie Ann Goldberg and Mark Jerrum. The complexity of ferromagnetic ising with local \ufb01elds. Combi-\n\nnatorics, Probability & Computing, 16(1):43\u201361, 2007.\n\n[9] Jean-Claude Picard and Maurice Queyranne. On the structure of all minimum cuts in a network and\napplications. In V.J. Rayward-Smith, editor, Combinatorial Optimization II, volume 13 of Mathematical\nProgramming Studies, pages 8\u201316. Springer Berlin Heidelberg, 1980.\n\n[10] J. Scott Provan and Michael O. Ball. The complexity of counting cuts and of computing the probability\n\nthat a graph is connected. SIAM Journal on Computing, 12(4):777\u2013788, 1983.\n\n[11] Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm.\n\nMachine Learning, 2:285\u2013318, April 1988.\n\n[12] Mark Herbster, Massimiliano Pontil, and Lisa Wainer. Online learning over graphs. In ICML \u201905: Pro-\nceedings of the 22nd international conference on Machine learning, pages 305\u2013312, New York, NY, USA,\n2005. ACM.\n\n[13] Mark Herbster. Exploiting cluster-structure to predict the labeling of a graph. In Proceedings of the 19th\n\nInternational Conference on Algorithmic Learning Theory, pages 54\u201369, 2008.\n\n[14] Mark Herbster and Guy Lever. Predicting the labelling of a graph via minimum p-seminorm interpolation.\n\nIn Proceedings of the 22nd Annual Conference on Learning Theory (COLT\u201909), 2009.\n\n[15] Mark Herbster, Guy Lever, and Massimiliano Pontil. Online prediction on large diameter graphs.\n\nAdvances in Neural Information Processing Systems (NIPS 22), pages 649\u2013656. MIT Press, 2009.\n\nIn\n\n[16] Nicol`o Cesa-Bianchi, Claudio Gentile, Fabio Vitale, and Giovanni Zappella. Random spanning trees\nand the prediction of weighted graphs. In Proceedings of the 27th International Conference on Machine\nLearning (27th ICML), pages 175\u2013182, 2010.\n\n[17] Fabio Vitale, Nicol`o Cesa-Bianchi, Claudio Gentile, and Giovanni Zappella. See the tree through the\nlines: The shazoo algorithm. In John Shawe-Taylor, Richard S. Zemel, Peter L. Bartlett, Fernando C. N.\nPereira, and Kilian Q. Weinberger, editors, NIPS, pages 1584\u20131592, 2011.\n\n[18] L. R. Ford and D. R. Fulkerson. Maximal Flow through a Network. Canadian Journal of Mathematics,\n\n8:399\u2013404, 1956.\n\n[19] Michael O. Ball and J. Scott Provan. Calculating bounds on reachability and connectedness in stochastic\n\nnetworks. Networks, 13(2):253\u2013278, 1983.\n\n[20] Thomas G\u00a8artner and Gemma C. Garriga. The cost of learning directed cuts. In Proceedings of the 18th\n\nEuropean Conference on Machine Learning, 2007.\n\n[21] J. M. Barzdin and R. V. Frievald. On the prediction of general recursive functions. Soviet Math. Doklady,\n\n13:1224\u20131228, 1972.\n\n[22] Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Inf. Comput., 108(2):212\u2013\n\n261, 1994.\n\n9\n\n\f", "award": [], "sourceid": 1669, "authors": [{"given_name": "Mark", "family_name": "Herbster", "institution": "University College London"}, {"given_name": "Stephen", "family_name": "Pasteris", "institution": "UCL"}, {"given_name": "Shaona", "family_name": "Ghosh", "institution": "University of Southhampton"}]}