{"title": "Label Selection on Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 691, "page_last": 699, "abstract": "We investigate methods for selecting sets of labeled vertices for use in predicting the labels of vertices on a graph. We specifically study methods which choose a single batch of labeled vertices (i.e. offline, non sequential methods). In this setting, we find common graph smoothness assumptions directly motivate simple label selection methods with interesting theoretical guarantees. These methods bound prediction error in terms of the smoothness of the true labels with respect to the graph. Some of these bounds give new motivations for previously proposed algorithms, and some suggest new algorithms which we evaluate. We show improved performance over baseline methods on several real world data sets.", "full_text": "Label Selection on Graphs\n\nAndrew Guillory\n\nDepartment of Computer Science\n\nUniversity of Washington\n\nJeff Bilmes\n\nDepartment of Electrical Engineering\n\nUniversity of Washington\n\nguillory@cs.washington.edu\n\nbilmes@ee.washington.edu\n\nAbstract\n\nWe investigate methods for selecting sets of labeled vertices for use in predicting\nthe labels of vertices on a graph. We speci\ufb01cally study methods which choose\na single batch of labeled vertices (i.e. of\ufb02ine, non sequential methods). In this\nsetting, we \ufb01nd common graph smoothness assumptions directly motivate simple\nlabel selection methods with interesting theoretical guarantees. These methods\nbound prediction error in terms of the smoothness of the true labels with respect\nto the graph. Some of these bounds give new motivations for previously proposed\nalgorithms, and some suggest new algorithms which we evaluate. We show im-\nproved performance over baseline methods on several real world data sets.\n\n1 Introduction\n\nIn this work we consider learning on a graph. Assume we have an undirected graph of n nodes\ngiven by a symmetric weight matrix W . The ith node in the graph has a label yi \u2208 {0, 1} stored in\na vector of labels y \u2208 {0, 1}n. We want to predict all of y from the labels yL for a labeled subset\nL \u2282 V = [n]. V is the set of all vertices. We use \u02c6y \u2208 {0, 1}n to denote our predicted labels. The\nnumber of incorrect predictions is ||y \u2212 \u02c6y||2.\nGraph-based learning is an interesting alternative to traditional feature-based learning.\nIn many\nproblems, graph representations are more natural than feature vector representations. When clas-\nsifying web pages, for example, edge weights in the graph may incorporate information about hy-\nperlinks. Even when the original data is represented as feature vectors, transforming the data into a\ngraph (for example using a Gaussian kernel to compute weights between points) can be convenient\n(cid:80)\nfor exploiting properties of a data set.\nIn order to bound prediction error, we assume that the labels are smoothly varying with respect\ni,j Wi,j|yi \u2212 yj| is\nto the underlying graph. The simple smoothness assumption we use is that\nsmall. Here || denotes absolute value, but the labels are binary so we can equivalently use squared\ndifference. This smoothness assumption has been used by graph-based semi-supervised learning\nalgorithms which compute \u02c6y using a labeled set L chosen uniformly at random from V [Blum and\nChawla, 2001, Hanneke, 2006, Pelckmans et al., 2007, Bengio et al., 2006] and by online graph la-\nbeling methods that operate on an adversarially ordered stream of vertices [Pelckmans and Suykens,\n2008, Brautbar, 2009, Herbster et al., 2008, 2005, Herbster, 2008]\nIn this work we consider methods that make use of the smoothness assumption and structure of the\ngraph in order to both select L as well as make predictions. Our hope is to achieve higher prediction\naccuracy as compared to random label selection and other methods for choosing L. We are particu-\nlarly interested in batch of\ufb02ine methods which select L up front, receive yL and then predict \u02c6y. The\nsingle batch, of\ufb02ine label selection problem is important in many real-world applications because it\nis often the case that problem constraints make requesting more than one batch of labels very costly.\nFor example, if requesting a label involves a time consuming, expensive experiment (potentially\n\n1\n\n\finvolving human subjects), it may be signi\ufb01cantly less costly to run a single batch of experiments in\nparallel as compared to running experiments in series.\ni,j Wi,j|yi \u2212 yj| is small, guarantee the\nWe give several methods which, under the assumption\nprediction error ||y \u2212 \u02c6y||2 will also be small. Some of the bounds provide interesting justi\ufb01cations\nfor previously used methods, and we show improved performance over random label selection and\nbaseline submodular maximization methods on several real world data sets.\n\n(cid:80)\n\n2 General Worst Case Bound\n\nWe \ufb01rst give a simple worst case bound on prediction error in terms of label smoothness using few\nassumptions about the method used to select labels or make predictions. In fact, the only assumption\nwe make is that the predictions are consistent with the set of labeled points (i.e. \u02c6yL = yL). The\nbound motivates an interesting method for selecting labeled points and provides a new motivation\nfor a standard prediction method Blum and Chawla [2001] when used with arbitrarily selected L.\nThe bound also forms the basis of the other bounds we derive which make additional assumptions.\nDe\ufb01ne the graph cut function \u0393(A, B) (cid:44)\n\n(cid:80)\n\ni\u2208A,j\u2208B Wi,j. Let\n\n\u03a8(L) (cid:44)\n\nmin\n\nT\u2286(V \\L)(cid:54)=0\n\n\u0393(T, (V \\ T ))\n\n|T|\n\nNote this function is different from normalized cut (also called sparsest cut). In this function, the\ndenominator is simply |T| while for normalized cut the denominator is min(|T|,|V \\ T|). This\ndifference is important: computing normalized cut is NP-hard, but we will show \u03a8(L) can be com-\nputed in polynomial time. \u03a8(L) measures how easily we can cut a large portion of the graph away\nfrom L. If \u03a8(L) is small, then we can separate many nodes from L without cutting very many edges.\nWe show that \u03a8(L) where L is the set of labeled vertices measures to what extent prediction error\ncan be high relative to label smoothness. This makes intuitive sense because if \u03a8(L) is small than\nthere is a large set of unlabeled nodes which are weakly connected to the remainder of the graph\n(including L).\nTheorem 1. For any \u02c6y consistent with a labeled set L\n||y\u2212 \u02c6y||2 \u2264 1\n\nWi,j(|yi\u2212 yj|\u2295|\u02c6yi\u2212 \u02c6yj|) \u2264 1\n\nWi,j|yi\u2212 yj|+\n\nWi,j|\u02c6yi\u2212 \u02c6yj|)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n2\u03a8(L)\n\ni,j\n\n(\n\n2\u03a8(L)\n\ni,j\n\ni,j\n\nwhere \u2295 is the XOR operator.\nProof. Let I be the set of incorrectly classi\ufb01ed points. First note that I \u2229 L = \u2205 (none of the labeled\npoints are incorrectly classi\ufb01ed).\n\n|I| = \u0393(I, V \\ I)\n\n|I|\n\n\u0393(I, V \\ I)\n\n\u2264 \u0393(I, V \\ I)\n\u03a8(L)\n\nNote that for all of the edges (i, j) counted in \u0393(I, V \\ I), \u02c6yi = \u02c6yj implies yi (cid:54)= yj and \u02c6yi (cid:54)= \u02c6yj\nimplies yi = yj. Then\n\n(cid:88)\n\ni,j\n\n|I| \u2264 1\n\n2\u03a8(L)\n\nWi,j(|yi \u2212 yj| \u2295 |\u02c6yi \u2212 \u02c6yj|)\n\nThe 1\n\n2 term is introduced because the sum double counts edges.\n\nThis bound is tight when the set of incorrectly classi\ufb01ed points I is one of the sets minimizing\nminT\u2286(V \\L)(cid:54)=0 \u0393(T, (V \\ T ))/|T|.\nThis bound provides an interesting justi\ufb01cation for the algorithm in Blum and Chawla [2001] and\nrelated methods when used with arbitrarily selected labeled sets. The term involving the predicted\ni,j Wi,j|\u02c6yi \u2212 \u02c6yj|, is the objective function minimized under the constraint \u02c6yL = yL by the\nlabels,\nalgorithm of Blum and Chawla [2001]. When this is used to compute \u02c6y, the bound simpli\ufb01es.\n\n(cid:80)\n\n2\n\n\fLemma 1. If\n\nfor a labeled set L then\n\n(cid:88)\n\ni,j\n\n(cid:88)\n\n\u02c6y = argmin\u02c6y\u2208{0,1}n:\u02c6yL=yL\n\nWi,j|\u02c6yi \u2212 \u02c6yj|\n\n||y \u2212 \u02c6y||2 \u2264 1\n\u03a8(L)\n\nWi,j|yi \u2212 yj|\n\ni,j\n\n(cid:80)\ni,j Wi,j|\u02c6yi \u2212 \u02c6yj| \u2264 (cid:80)\n\nComputeCut(L)\n\nT (cid:48) \u2190 V \\ L\nrepeat\nT \u2190 T (cid:48)\n\u03bb \u2190 \u0393(T,V \\T )\nT (cid:48) \u2190 argmin\nA\u2286(V \\L)\n\n|T|\n\n\u0393(A, V \\ A) \u2212 \u03bb|A|\n\nuntil \u0393(T (cid:48), V \\ T (cid:48)) \u2212 \u03bb|T (cid:48)| = 0\nreturn T\n\nMaximize\u03a8(L, k)\n\nL \u2190 \u2205\nrepeat\n\nT \u2190 ComputeCut(L)\ni \u2190 random vertex in T\nL \u2190 L \u222a {i}\n\nuntil |L| = k\nreturn L\n\nFigure 1: Left: Algorithm for computing \u03a8(L). Right: Heuristic for maximizing \u03a8(L).\n\nProof. When we choose \u02c6y in this way\nfollows from Theorem 1.\n\ni,j Wi,j|yi \u2212 yj| and the lemma\n\nLabel propagation solves a version of this problem in which \u02c6y is real valued [Bengio et al., 2006].\nThe bound also motivates a simple label selection method. In particular, we would like to select a la-\nbeled L set that maximizes \u03a8(L). We \ufb01rst describe how to compute \u03a8(L) for a \ufb01xed L. Computing\n\u03a8(L) is related to computing\n\nmin\n\nT\u2286(V \\L)\n\n\u0393(T, V \\ T ) \u2212 \u03bb|T|\n\n(1)\n\nwith parameter \u03bb > 0. The following result is paraphrased from Fujishige [2005] (pages 248-249).\nTheorem 2. \u03bb(cid:48) = minT\n\ng(T ) if and only if\n\nf (T )\n\nand\n\n\u2200\u03bb \u2264 \u03bb(cid:48) min\n\nT\n\nf(T ) \u2212 \u03bbg(T ) = 0\n\n\u2200\u03bb > \u03bb(cid:48) min\n\nT\n\nf(T ) \u2212 \u03bbg(T ) < 0\n\nWe can compute Equation 1 for all \u03bb via a parametric max\ufb02ow/mincut computation (it is known\nthere are no more than n \u2212 1 distinct solutions). This gives a polynomial time algorithm for com-\nputing \u03a8(L). Note this theorem is for unconstrained minimization of T , but restricting T \u2229 L = \u2205\ndoes not change the result: this constraint simply removes elements from the ground set. In practice,\nthis constraint can be enforced by contracting the graph used in the \ufb02ow computations or by giving\ncertain edges in\ufb01nite capacity.\nAs an alternative to solving the parametric \ufb02ow problem, we can \ufb01nd the desired \u03bb value through\nan iterative method [Cunningham, 1985]. The left of Figure 1 shows this approach. The algorithm\ntakes in a set L and computes argminT\u2286(V \\L)(cid:54)=0 \u0393(T, (V \\T ))/|T|. The correctness proof is simple.\nWhen the algorithm terminates, we know \u03bb \u2265 \u03bb(cid:48) = minT\u2286(V \\L)(cid:54)=0 \u0393(T, (V \\ T ))/|T| because we\nset \u03bb to be \u0393(T, (V \\ T ))/|T| for a particular T . By Theorem 2 and the termination condition, we\nalso know \u03bb \u2264 \u03bb(cid:48) and can conclude \u03bb = \u03bb(cid:48) and the set T returned achieves this minimum. One can\nalso show the algorithm terminates in at most |V | iterations [Cunningham, 1985].\nHaving shown how to compute \u03a8(L), we now consider methods for maximizing it. \u03a8 is neither\nsubmodular nor supermodular. This seems to rule out straightforward set function optimization. In\nour experiments, we try a simple heuristic based on the following observation: for any L, if \u03a8(L(cid:48)) >\n\u03a8(L) then it must be the case that L(cid:48) intersects one of the cuts minimizing minT\u2286(V \\L)(cid:54)=\u2205 \u0393(T, (V \\\n\n3\n\n\fT ))/|T|. In other words, in order to increase \u03a8(L) we must necessarily include a point from the\ncurrent cut. Our heuristic is then to simply add a random element from this cut to L. The right of\nFigure 1 shows this method.\nSeveral issues remain. First, although we have proposed a reasonable heuristic for maximizing\n\u03a8(L), we do not have methods for maximizing it exactly or with guaranteed approximation. Aside\nfrom knowing the function is not submodular or supermodular, we also do not know the hardness\nof the problem. In the next section, we describe a lower bound on the \u03a8 function based on a notion\nof graph covering. This lower bound can be maximized approximately via a simple algorithm and\nhas a well understood hardness of approximation. Second, we have found in experimenting with our\nheuristic for maximizing \u03a8(L) that the function can be prone to imbalanced cuts; the computed cuts\nsometimes contain all or most of the unselected points V \\ L and other times focus on small sets of\noutliers. We give a third bound on error which attempts to address some of this sensitivity.\n\n3 Graph Covering Algorithm\n\n(cid:80)\n\nThe method we consider in this section uses a notion of graph covering. We say a set L \u03b1-covers\nthe graph if \u2200i \u2208 V either i \u2208 L or\nj\u2208L Wi,j \u2265 \u03b1. In other words, every node in the graph is\neither in L or connected with total weight at least \u03b1 to nodes in L (or both). This is a simple real\nvalued extension of dominating sets. A dominating set is a set L \u2286 V such that \u2200i \u2208 V either\ni \u2208 L or a neighbor of i is in L (or both). This notion of covering is related to the \u03a8 function\ndiscussed in the previous section. In particular, if a set L \u03b1-covers a graph than it is necessarily the\ncase that \u03a8(L) \u2265 \u03b1. The converse does not hold, however. In other words, \u03b1 is a lower bound on\n\u03a8(L). Then, \u03b1 can replace \u03a8(L) in the bound in the previous section for a looser upper bound on\nprediction error. Although the bound is looser, compared to maximizing \u03a8(L) we better understand\nthe complexity of computing an \u03b1-cover.\nCorollary 1. For any \u02c6y consistent with a labeled set L that is an \u03b1-cover\n||y \u2212 \u02c6y||2 \u2264 1\n2\u03b1\n\nWi,j(|yi \u2212 yj| \u2295 |\u02c6yi \u2212 \u02c6yj|) \u2264 1\n2\u03b1\n\nWi,j|yi \u2212 yj| +\n\nWi,j|\u02c6yi \u2212 \u02c6yj|)\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\n(\n\ni,j\n\ni,j\n\ni,j\n\nwhere \u2295 is the XOR operator.\nSimilar to Lemma 1, by making additional assumptions concerning the prediction method used we\ncan derive a slightly simpler bound. In particular, for a labeled set L that is an \u03b1 cover, we assume\nunlabeled nodes are labeled with the weighted majority vote of neighbors in L. In other words, set\n\u02c6yi = yi for i \u2208 L, and set \u02c6yi = y(cid:48) for i /\u2208 L with y(cid:48) such that\nj\u2208L:yj(cid:54)=y(cid:48) Wi,j.\nWith this prediction method we get the following bound.\nLemma 2. If L is an \u03b1-cover and V \\ L is labeled according to majority vote\n\nj\u2208L:yj =y(cid:48) Wi,j \u2265(cid:80)\n(cid:80)\n(cid:88)\n\n(cid:88)\n\n||y \u2212 \u02c6y||2 \u2264 1\n\u03b1\n\nWi,j|yi \u2212 yj|(1 \u2212 |\u02c6yi \u2212 \u02c6yj|) \u2264 1\n\u03b1\n\ni,j\n\nWi,j|yi \u2212 yj|\n\ni,j\n\n(cid:80)\n\nProof. The right hand side follows immediately from the middle expression, so we focus on the \ufb01rst\ninequality. For every incorrectly labeled node, there is a set of nodes Li = {j \u2208 L : \u02c6yi = \u02c6yj} which\nsatis\ufb01es yi (cid:54)= yj\u2200j \u2208 Li, and\nWi,j \u2265 \u03b1/2. We then have for every incorrectly labeled node\na unique set of edges with total weight at least \u03b1/2 included inside the summation in the middle\nexpression.\n\nj\u2208Li\n\n|L| : F (L) \u2265 \u03b1\n\nIn computing an \u03b1-cover, we want to solve.\nmin\nL\u2286V\n\nWhere\n\n(cid:88)\ni,i = \u221e. F (cid:48) is the minimum of a set of modular functions. F (cid:48)\nwhere W (cid:48)\nis neither supermodular nor submodular. However, we can still compute an approximately minimal\n\nF (L) (cid:44) min\ni\u2208V \\L\ni,j = Wi,j for i (cid:54)= j and W (cid:48)\n\nWi,j = F (cid:48)(L) (cid:44) min\ni\u2208V\n\n(cid:88)\n\nW (cid:48)\n\nj\u2208L\n\nj\u2208L\n\ni,j\n\n4\n\n\f\u03b1-cover using a trick introduced by Krause et al. [2008]. In particular, Krause et al. [2008] point out\nthat\n\ni,j, \u03b1) is submodular, and the sum of submodular functions is submodular.\n\n(cid:80)\n\nAlso, min(\nThen, we can replace F (cid:48) with\n\nj\u2208L W (cid:48)\n\n(cid:88)\n\nj\u2208L\n\nmin\ni\u2208V\n\n(cid:88)\n\ni\n\n1\nn\n\ni,j \u2265 \u03b1 \u21d4\nW (cid:48)\n(cid:88)\n\nF (cid:48)\n\u03b1(L) =\n\ni\n\n(cid:88)\n\nj\u2208L\n\n(cid:88)\n\nj\u2208L\n\n1\nn\n\nmin(\n\ni,j, \u03b1) \u2265 \u03b1\nW (cid:48)\n\nmin(\n\nW (cid:48)\n\ni,j, \u03b1)\n\nand solve\n\n|L| : F (cid:48)\n\n\u03b1(L) \u2265 \u03b1\n\nmin\nL\u2286V\n\nThis is a submodular set cover problem. The greedy algorithm has approximation guarantees for\nthis problem for integer valued functions [Krause et al., 2008]. For binary weight graphs the ap-\nproximation is O(log n). For real valued functions, it\u2019s possible to round the function values to get\nan approximation guarantee. In practice, we apply the greedy algorithm directly.\nAs previously mentioned, \u03b1-covers can be seen as real valued generalizations of dominating sets.\nIn particular, an \u03b1-cover is a dominating set for binary weight graphs and \u03b1 = 1. The hardness\nof approximation results for \ufb01nding a minimum size dominating set then carry over to the more\ngeneral \u03b1-cover problem. The next theorem shows that the \u03b1-cover problem is NP-hard and in fact\nthe greedy algorithm for computing an \u03b1-cover is optimal up to constant factors for \u03b1 = 1 and binary\nweight graphs. It is based on the well known connection between \ufb01nding a minimum dominating\nset problem and \ufb01nding a minimum set cover.\nTheorem 3. Finding the smallest dominating set L in a binary weight graph is N P -complete.\nFurthermore, if there is some \u0001 > 0 such that a polynomial time algorithm approximates the smallest\ndominating set within (1 \u2212 \u0001) ln(n/2) then N P \u2282 T IM E(nO(log log n)).\nWe have so far discussed computing a small \u03b1 cover for a \ufb01xed \u03b1. If we instead have a \ufb01xed label\nbudget and want to maximize \u03b1, we can do so by performing binary search over \u03b1. This is the\napproach used by Krause et al. [2008] and gives a bi-approximation.\n\n4 Normalized Cut Algorithm\n\nIn this section we consider an algorithm that clusters the data set and replaces the \u03a8 function with a\nnormalized cut value. The normalized cut value for a set T \u2282 V is\n\n\u0393(T, V \\ T )\n\nmin(|T|,|V \\ T|)\n\nIn other words, normalized cut is the ratio between the cut value for T and minimum of the size of\nT and its complement. Computing the minimum normalized cut for a graph is NP-hard.\nConsider the following method: 1) partition the set of nodes V into clusters S1, S2, ...Sk, 2) for each\ncluster request suf\ufb01cient labels to estimate the majority class with probability at least 1 \u2212 \u03b4/k, and\n3) label all nodes in each cluster with the majority label for that cluster. Here the probability 1\u2212 \u03b4/k\nis with respect to the choice of the labeled nodes used to estimate the majority class for each cluster.\nTheorem 4. Let S1, S2, ...Sk be a partition of V , and assume we have estimates of the majority\nclass of each Sl each of which are accurate with probability at least 1\u2212 \u03b4/k. If \u02c6y labels every i \u2208 Sl\naccording to the estimated majority label for Sl then with probability at least 1 \u2212 \u03b4\nWi,j|yi \u2212 yj|\n\n||y \u2212 \u02c6y||2 \u2264\n\n(cid:88)\n\n(cid:88)\n\n(cid:88)\n\nWi,j|yi \u2212 yj| \u2264 1\n2\u03c6\n\n1\n2\u03c6l\n\ni,j\u2208Sl\n\ni,j\n\nl\n\nwhere\n\nand\n\n\u03c6l = min\nT\u2282Sl\n\n\u0393(T, Sl \\ T )\n\nmin(|T|,|Sl \\ T|)\n\n\u03c6 = min\n\nl\n\n\u03c6l\n\n5\n\n\f(cid:83)k\nProof. By the union bound, the estimated majority labels for all of the clusters are correct with\nprobability at least 1 \u2212 \u03b4. Let I be the set of incorrectly labeled nodes (errors). We consider the\nl=1 Il Note that |Il| \u2264 |Sl \\ Il|\nintersection of I with each of the clusters. Let Il (cid:44) |I \u2229 Sl|. I =\nsince we labeled cluster according to the majority label for the cluster. Then\n\n(cid:88)\n\n|Il| =\n\n\u0393(Il, Sl \\ Il)\n\n|Il|\n\n\u0393(Il, Sl \\ Il)\n\nmin(|Il|,|Sl \\ Il|)\n\n\u0393(Il, Sl \\ Il)\n\n(cid:88)\n(cid:88)\n(cid:88)\n\nl\n\nl\n\n|I| =\n\n=\n\n\u2264\n\nl\n\n\u0393(Il, Sl \\ Il)\n\u0393(Il, Sl \\ Il)\n\n\u03c6l\n\nl\n\nFor any i, j, with i \u2208 Il and j \u2208 Sl \\ Il, we must have yi (cid:54)= yj. Also, for any i, j with yi (cid:54)= yj and\ni, j \u2208 Sl, either i \u2208 Il or j \u2208 Il. In other words, there is a one-to-one correspondence between 1)\nedges i, j for which i, j \u2208 Sl and either i \u2208 Il or j \u2208 Il and 2) edges i, j for which i, j \u2208 Sl and\nyi (cid:54)= yj. The desired result then follows.\n\nNote in practice we only label the unlabeled nodes in each cluster using the majority label estimates.\nUsing the true labels for the labeled nodes only decreases error, so the theorem still holds.\nIn this bound, \u03c6 is a measure of the density of the clusters. Computing \u03c6l for a particular cluster\nis NP-hard, but there are approximation algorithms. However, we are not aware of approximation\nalgorithms for computing a partition such that \u03c6 is maximized. This is different from the standard\nnormalized cut clustering problem; we do not care if clusters are strongly connected to each other\nonly that each cluster is internally dense. In our experiments, we try several standard clustering\nalgorithms and achieve good real world performance, but it remains an interesting open question to\ndesign a clustering algorithm for directly maximizing \u03c6. An approach we have not yet tried is to use\nthe error bound to choose between the results of different clustering algorithms.\nWe now consider the problem of estimating the majority class for a cluster. If we uniformly sample\nlabels from a cluster, standard results give that the probability of incorrectly estimating the majority\ndecreases exponentially with the number of labels if the fraction of nodes in the minority class is\nbounded away from 1/2 by a constant. We now show that if the labels are suf\ufb01ciently smooth and\nthe cluster is suf\ufb01ciently dense then the fraction of nodes in the minority class is small.\nTheorem 5. The fraction of nodes in the minority class of S is at most\n\n(cid:80)\n\nwhere\n\n\u03c6 = min\nT\u2282S\n\ni,j\u2208S Wi,j|yi \u2212 yj|\n\n\u03c6|S|\n\u0393(T, S \\ T )\n\nmin(|T|,|S \\ T|)\n\nProof. Let S\u2212 be the set of nodes belonging to the minority class and S+ be the set of nodes\nbelonging to the other class. Let f be the fraction of nodes in the minority class.\nmin(|S+|,|S\u2212|)\ni,j\u2208S Wi,j|yi \u2212 yj|\n\n(cid:80)\ni,j\u2208S Wi,j|yi \u2212 yj|\nmin(|S+|,|S\u2212|)\n\n(cid:80)\n\nf =\n\n|S\u2212|\n|S|\n\n|S\u2212|\n(cid:80)\n|S| =\ni,j\u2208S Wi,j|yi \u2212 yj|\n(cid:80)\ni,j\u2208S Wi,j|yi \u2212 yj|\n\n|S|\n\n\u03c6|S|\n\n=\n\n\u2264\n\nmin(|S+|,|S\u2212|)\n\n\u0393(S+, S\u2212)\n\nIf we have an estimate of the smoothness of the labels in a cluster, we can use this bound and an\napproximation of \u03c6 to determine the number of labels needed to estimate the majority class with\nhigh con\ufb01dence. In our experiments, we simply request a single label per cluster.\n\n6\n\n\fDigit1/10\nText/10\nBCI/10\nUSPS/10\ng241c/10\ng241d/10\nDigit1/100\nText/100\nBCI/100\nUSPS/100\ng241c/100\ng241d/100\n\nSpectral\n9.54 (4.42)\n37.64 (8.64)\n50.13 (2.16)\n15.22 (6.22)\n39.63 (5.67)\n22.31 (7.06)\n4.47 (1.35)\n31.67 (2.41)\n47.37 (2.80)\n6.23 (1.49)\n44.31 (2.09)\n41.70 (2.44)\n\nk-Cut\n50.02 (1.04)\n50.03 (0.3)\n50.16 (0.64)\n31.53 (23.65)\n50.03 (0.03)\n50.02 (0.23)\n50.07 (1.46)\n50.26 (2.73)\n50.14 (0.5)\n31.13 (26.31)\n50.02 (0.18)\n50.03 (0.18)\n\nMETIS\n4.93 (4.05)\n34.76 (6.05)\n49.68 (2.63)\n8.15 (5.51)\n29.18 (7.28)\n22.57 (7.26)\n3.24 (0.76)\n32.57 (1.88)\n45.35 (1.91)\n9.28 (1.38)\n37.47 (2.13)\n35.96 (1.99)\n\n\u03a8\n49.92 (3.18)\n50.05 (0.06)\n50.32 (0.55)\n20.07 (2.70)\n50.29 (0.07)\n50.01 (0.09)\n2.60 (0.83)\n48.34 (0.67)\n48.17 (1.87)\n10.17 (0.39)\n52.48 (0.37)\n50.33 (0.21)\n\nBaseline\n20.90 (15.67)\n45.91 (7.96)\n50.12 (1.32)\n15.87 (4.82)\n47.26 (5.19)\n48.46 (3.39)\n2.57 (0.67)\n26.82 (3.88)\n47.48 (2.99)\n6.33 (2.46)\n42.86 (4.50)\n41.56 (4.34)\n\nTable 1: Error rate mean (standard deviation) for different data set, label count, method combina-\ntions.\n\nFigure 2: Left: Points selected by the \u03a8 function maximization method. Right: Points selected by\nthe spectral clustering method.\n\n5 Experiments\n\nWe experimented with a method based on Lemma 1. We use the randomized method for maximizing\n\u03a8 and then predict with min-cuts [Blum and Chawla, 2001]. We also tried a method based on\nTheorem 4. We cluster the data then label each cluster according to a single randomly chosen point.\nWe chose the number of clusters to be equal to the number of labeled points observing that if a\ncluster is split evenly amongst the two classes then we will have a high error rate regardless of\nhow well we estimate the majority class. We tried three clustering algorithms: a spectral clustering\nmethod [Ng et al., 2001], the METIS package for graph partitioning [Karypis and Kumar, 1999],\nand a k-cut approximation algorithm [Saran and Vazirani, 1995, Gus\ufb01eld, 1990]. As a baseline we\nuse random label selection and prediction using the label propagation method of Bengio et al. [2006]\nwith \u0001 = 10\u22126 and \u00b5 = 10\u22126 and class mass normalization. We also experimented with a method\nmotivated by the graph covering bound, but for lack of space we omit these results.\nWe used six benchmark data sets [Chapelle et al., 2006]. We use graphs constructed with a Gaussian\nkernel with standard deviation chosen to be the average distance to the k1th nearest neighbor divided\nby 3 (a similar heuristic is used by Chapelle et al. [2006]). We then make this graph sparse by\nremoving the edge between node i and j unless i is one of j\u2019s k2 nearest neighbors or j is one of i\u2019s\nk2 nearest neighbors. We use 10 and 100 labels. We set k1 and k2 for each data set and label count\nto be the parameters which give the lowest average error rate for label propagation averaging over\n100 trials and choosing from the set {5, 10, 50, 100}. We tune the graph construction parameters to\ngive low error for the baseline method to ensure any bias is in favor of the baseline as opposed to\nthe new methods we propose. We then report average error over 1000 trials in the 10 label case and\n100 trials in the 100 label case for each combination of data set and algorithm.\nTable 1 shows these results. We \ufb01nd that the \u03a8 function method does not perform well. We found\non most of the data sets the cuts found by the method included all or almost all of V \\ L. In this case\nthe points selected are essentially random. However, on the USPS data set and on some synthetic\ndata sets we have tried, we have also observed the opposite behavior where the cuts are very small\nand seem to focus on small sets of outliers. Figure 2 shows an example of this. The k-cut method\n\n7\n\n\u22123\u22122\u2212101234567\u22124\u2212202468\u22123\u22122\u2212101234567\u22124\u2212202468\falso did not perform well. We\u2019ve found this method has similar problems with outliers. We think\nthese outlier sensitive methods are impractical for graphs constructed from real world data.\nThe results for the spectral clustering and METIS clustering methods, however, are quite encourag-\ning. These methods performed well matching or beating the baseline method on the 10 label trials\nand in some cases signi\ufb01cantly improving performance. The METIS method seems particularly ro-\nbust. On the 100 label trials, performance was not as good. In general, we expect label selection to\nhelp more when learning from very few labels. The choice in clustering method seems to be of great\npractical importance. The clustering methods which work best seem to be methods which minimize\nnormalize cut like objectives. This is not surprising given the presence of the normalized cut term in\nTheorem 4, but it is an open problem to give a clustering method for directly minimizing the bound.\nWe \ufb01nally note that the numbers we report for our baseline method are in some cases signi\ufb01cantly\ndifferent than the published numbers [Chapelle et al., 2006]. This seems to be because of a variety of\nfactors including differences in implementation as well as signi\ufb01cant differences in experiment set\nup. We have also experimented with several heuristic modi\ufb01cations to our methods and compared\nour methods to simple greedy methods. One modi\ufb01cation we tried is to use label propagation for\nprediction in conjunction with our label selection methods. We omit these results for lack of space.\n\n6 Related Work\n\nPrevious work has also used clustering, covering, and other graph properties to guide label selection\non graphs. We are, however, the \ufb01rst to our knowledge to give bounds which relate prediction\nerror to label smoothness for single batch label selection methods. Most previous work on label\n(cid:80)\nselection methods for learning on graphs has considered active (i.e. sequential) label selection [Zhu\nand Lafferty, 2003, Pucci et al., 2007, Zhao et al., 2008, Wang et al., 2007, Afshani et al., 2007].\ni,j Wi,j|yi \u2212 yj| labels are\nAfshani et al. [2007] show in this setting O(c log(n/c)) where c =\nsuf\ufb01cient and necessary to learn the labeling exactly under some balance assumptions. Without\nbalance assumptions they show O(c log(1/\u0001) + c log(n/c)) labels are suf\ufb01cient to achieve an \u0001 error\nrate. In some cases, our bounds are better despite considering only non sequential label selection.\nConsider the case where c grows linearly with n so c/n = a for some constant a > 0. In this case,\nwith the bound of Afshani et al. [2007] the number of labels required to achieve a \ufb01xed error rate\n\u0001 also grows linearly with n. In comparison, our graph covering bound needs an \u03b1-cover with \u03b1 =\na/\u0001. For some graph topologies, the size of such a cover can grow sublinearly with n (for example\nif the graph contains large, dense clusters). Afshani et al. [2007] also use a kind of dominating set\nin their method, and it could be interesting to see if portions of their analysis could be adapted to the\nof\ufb02ine setting. Zhao et al. [2008] also use a clustering algorithm to select initial labels.\nOther work has given generalization error bounds in terms of label smoothness [Pelckmans et al.,\n(cid:80)\n2007, Hanneke, 2006, Blum et al., 2004] for transductive learning from randomly selected L.\nThese bounds are PAC style which typically show that, roughly, the error rate decreases with\ni,j Wi,j|yi \u2212 yj|/(b|L|)) where b is the minimum 2-cut of the graph. Depending on the\nO(\n(cid:80)\ngraph structure, our bounds can be signi\ufb01cantly better. For example, if a binary weight graph con-\ntains c cliques of size n/c then, we can \ufb01nd an \u03b1 cover of size c\u03b1 log(c\u03b1) giving an error rate of\ni,j Wi,j|yi \u2212 yj|/(n\u03b1)). This is better if c log(c\u03b1) < n/b.\nO(\n\nA line of work has examined mistake bounds in terms of label smoothness for online learning on\ngraphs [Pelckmans and Suykens, 2008, Brautbar, 2009, Herbster et al., 2008, 2005, Herbster, 2008].\nThese mistake bounds hold no matter how the sequence of vertices are chosen. Herbster [2008]\nalso considers how cluster structure can improve mistake bounds in this setting and gives a mistake\nbound similar to our graph covering bound on prediction error. Herbster et al. [2005] discusses using\nan active learning method for the \ufb01rst several steps of an online algorithm. Our work differs from\nthis previous work by considering prediction error bounds for of\ufb02ine learning as opposed to mistake\nbounds for online learning. The mistake bound setting is signi\ufb01cantly different as the prediction\nmethod receives feedback after every prediction.\n\nAcknowledgments\n\nThis material is based upon work supported by the National Science Foundation under grant IIS-\n0535100.\n\n8\n\n\fReferences\nP. Afshani, E. Chiniforooshan, R. Dorrigiv, A. Farzan, M. Mirzazadeh, N. Simjour, and H. Zarrabi-Zadeh. On\n\nthe complexity of \ufb01nding an unknown cut via vertex queries. In COCOON, 2007.\n\nY. Bengio, O. Delalleau, and N. Le Roux. Label propagation and quadratic criterion.\n\nB. Sch\u00a8olkopf, and A. Zien, editors, Semi-Supervised Learning. MIT Press, 2006.\n\nIn O. Chapelle,\n\nA. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In ICML, 2001.\nA. Blum, J. Lafferty, M. R. Rwebangira, and R. Reddy. Semi-supervised learning using randomized mincuts.\n\nIn ICML, 2004.\n\nM. Brautbar. Online Learning a Labeling of a Graph. Mining and Learning with Graphs, 2009.\nO. Chapelle, B. Sch\u00a8olkopf, and A. Zien. Semi-supervised learning. MIT press, 2006.\nW. Cunningham. Optimal attack and reinforcement of a network. Journal of the ACM, 1985.\nS. Fujishige. Submodular Functions and Optimization. Elsevier Science, 2005.\nD. Gus\ufb01eld. Very simple methods for all pairs network \ufb02ow analysis. SIAM Journal on Computing, 1990.\nS. Hanneke. An analysis of graph cut size for transductive learning. In ICML, 2006.\nM. Herbster. Exploiting Cluster-Structure to Predict the Labeling of a Graph. In ALT, 2008.\nM. Herbster, M. Pontil, and L. Wainer. Online learning over graphs. In ICML, 2005.\nM. Herbster, G. Lever, and M. Pontil. Online Prediction on Large Diameter Graphs. In NIPS, 2008.\nG. Karypis and V. Kumar. A fast and highly quality multilevel scheme for partitioning irregular graphs. SIAM\n\nJournal on Scienti\ufb01c Computing, 1999.\n\nA. Krause, H. B. McMahan, C. Guestrin, and A. Gupta. Robust submodular observation selection. JMLR,\n\n2008.\n\nA. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, 2001.\nK. Pelckmans and J. Suykens. An online algorithm for learning a labeling of a graph. In Mining and Learning\n\nwith Graphs, 2008.\n\nK. Pelckmans, J. Shawe-Taylor, J. Suykens, and B. De Moor. Margin based transductive graph cuts using linear\n\nprogramming. 2007.\n\nA. Pucci, M. Gori, and M. Maggini. Semi-supervised active learning in graphical domains. In Mining and\n\nLearning With Graphs, 2007.\n\nH. Saran and V. V. Vazirani. Finding k cuts within twice the optimal. SIAM Journal on Computing, 1995.\nM. Wang, X. Hua, Y. Song, J. Tang, and L. Dai. Multi-Concept Multi-Modality Active Learning for Interactive\n\nVideo Annotation. In International Conference on Semantic Computing, 2007.\n\nW. Zhao, J. Long, E. Zhu, and Y. Liu. A scalable algorithm for graph-based active learning. In Frontiers in\n\nAlgorithmics, 2008.\n\nX. Zhu and J. Lafferty. Combining active learning and semi-supervised learning using gaussian \ufb01elds and\n\nharmonic functions. In ICML, 2003.\n\n9\n\n\f", "award": [], "sourceid": 936, "authors": [{"given_name": "Andrew", "family_name": "Guillory", "institution": null}, {"given_name": "Jeff", "family_name": "Bilmes", "institution": null}]}