{"title": "Distributed Information Regularization on Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 297, "page_last": 304, "abstract": null, "full_text": "     Distributed Information Regularization on\n                                        Graphs\n\n\n\n                 Adrian Corduneanu                        Tommi Jaakkola\n                     CSAIL MIT                              CSAIL MIT\n                 Cambridge, MA 02139                   Cambridge, MA 02139\n         adrianc@csail.mit.edu                      tommi@csail.mit.edu\n\n\n\n\n                                        Abstract\n\n\n         We provide a principle for semi-supervised learning based on optimizing\n         the rate of communicating labels for unlabeled points with side informa-\n         tion. The side information is expressed in terms of identities of sets of\n         points or regions with the purpose of biasing the labels in each region\n         to be the same. The resulting regularization objective is convex, has a\n         unique solution, and the solution can be found with a pair of local prop-\n         agation operations on graphs induced by the regions. We analyze the\n         properties of the algorithm and demonstrate its performance on docu-\n         ment classification tasks.\n\n\n\n\n\n1    Introduction\n\n\nA number of approaches and algorithms have been proposed for semi-supervised learning\nincluding parametric models [1], random field/walk models [2, 3], or discriminative (kernel\nbased) approaches [4]. The basic intuition underlying these methods is that the labels\nshould not change within clusters of points, where the definition of a cluster may vary from\none method to another.\n\nWe provide here an alternative information theoretic criterion and associated algorithms\nfor solving semi-supervised learning problems. Our formulation, an extension of [5, 6],\nis based on the idea of minimizing the number of bits required to communicate labels for\nunlabeled points, and involves no parametric assumptions. The communication scheme\ninherent to the approach is defined in terms of regions, weighted sets of points, that are\nshared between the sender and the receiver. The regions are important in capturing the\ntopology over the points to be labeled, and, through the communication criterion, bias the\nlabels to be the same within each region.\n\nWe start by defining the communication game and the associated regularization problem,\nanalyze properties of the regularizer, derive distributed algorithms for finding the unique\nsolution to the regularization problem, and demonstrate the method on a document classi-\nfication task.\n\n\f\n                                       R                                R\n                                            1                                   m P\n                                                          ...                            (R)\n\n                                                  P (x|R)\n\n                                                           Q(y|x)\n                                                          ...\n                            x           x                             x                 x\n                                 1           2                             n                 n\n                                                                                -1\n\n\n\n\nFigure 1: The topology imposed by the set of regions (squares) on unlabeled points (circles)\n\n\n\n2    The communication problem\n\n\nLet S = {x1, . . . , xn} be the set of unlabeled points and Y the set of possible labels.\nWe assume that target labels are available only for a small subset Sl  S of the unlabeled\npoints. The objective here is to find a conditional distribution Q(y|x) over the labels at each\nunlabeled point x  S. The estimation is made possible by a regularization criterion over\nthe conditionals which we define here through a communication problem. The communi-\ncation scheme relies on a set of regions R = {R1, . . . , Rm}, where each region R  R is\na subset of the unlabeled points S (cf. Figure 1). The weights of points within each region\nare expressed in terms of a conditional distribution P (x|R),                                 P (x|R) = 1, and each\n                                                                                       xR\nregion has an a priori probability P (R). We require only that                                    P (x|R)P (R) = 1/n\n                                                                                       RR\nfor all x  S. (Note: in our overloaded notation \"R\" stands both for the set of points and\nits identity as a set).\n\nThe regions and the membership probabilities are set in an application specific manner. For\nexample, in a document classification setting we might define regions as sets of documents\ncontaining each word. The probabilities P (R) and P (x|R) could be subsequently derived\nfrom a word frequency representation of documents: if f (w|x) is the frequency of word\nw in document x, then for each pair of w and the corresponding region R we can set\nP (R) =             f (w|x)/n and P (x|R) = f (w|x)/(nP (R)).\n             xS\n\nFor any fixed conditionals {Q(y|x)} we define the communication problem as follows.\nThe sender selects a region R  R with probability P (R) and a point within the region\naccording to P (x|R). Since                       P (x|R)P (R) = 1/n, each point x is overall equally\n                                        RR\nlikely to be selected. The label y is sampled from Q(y|x) and communicated to the receiver\noptimally using a coding scheme tailored to the region R (based on knowing P (x|R) and\nQ(y|x), x  R). The receiver has access to x, R, and the region specific coding scheme\nto reproduce y. The rate of information needed to be sent to the receiver in this scheme is\ngiven by\n\n                                                                                                        Q(y|x)\n Jc(Q; R) =            P (R)IR(x; y) =                   P (R)               P (x|R)Q(y|x) log                    (1)\n                                                                                                       Q(y|R)\n                RR                               RR             xR yY\n\n\nwhere Q(y|R) =                    P (x|R)Q(y|x) is the overall probability of y within the region.\n                           xR\n\n\f\n3    The regularization problem\n\nWe use Jc(Q; R) to regularize the conditionals. This regularizer biases the conditional\ndistributions to be constant within each region so as to minimize the communication cost\nIR(x; y). Put another way, by introducing a region R we bias the points in the region\nto be labeled the same. By adding the cost of encoding the few available labeled points,\nexpressed here in terms of the empirical distribution ^\n                                                                    P (y, x) whose support lies in Sl, the\noverall regularization criterion is given by\n\n                J(Q; ) = -                       ^\n                                                 P (y, x) log Q(y|x) + Jc(Q; R)                                  (2)\n                                  xSl yY\n\nwhere  > 0 is a regularization parameter. The following lemma guarantees that the\nsolution is always unique:\n\nLemma 1 J (Q; ) for  > 0 is a strictly convex function of the conditionals {Q(y|x)}\nprovided that 1) each point x  S belongs to at least one region containing at least two\npoints, and 2) the membership probabilities P (x|R) and P (R) are all non-zero.\n\nThe proof follows immediately from the strict convexity of mutual information [7] and the\nfact that the two conditions guarantee that each Q(y|x) appears non-trivially in at least one\nmutual information term.\n\n\n4    Regularizer and the number of labelings\n\nWe consider here a simple setting where the labels are hard and binary, Q(y|x)  {0, 1},\nand seek to bound the number of possible binary labelings consistent with a cap on the\nregularizer.\n\nWe assume for simplicity that points in a region have uniform weights P (x|R). Let N (I)\nbe the number of labelings of S consistent with an upper bound I on the regularizer\nJc(Q, R). The goal is to show that N (I) is significantly less than 2n and N (I)  2\nas I  0.\n\nTheorem 2 log2 N (I)  C(I) + I  n  t(R)/ minR P (R), where C(I)  1 as I  0,\nand t(R) is a property of R.\n\nProof Let f (R) be the fraction of positive samples in region R. Because the labels are\nbinary IR(x; y) is given by H(f (R)), where H is the entropy. If                               P (R)H(f (R))  I\n                                                                                          R\nthen certainly H(f (R))  I/P (R). Since the binary entropy is concave and symmetric\nw.r.t. 0.5, this is equivalent to f (R)  gR(I) or f (R) >= 1 - gR(I), where gR(I) is the\ninverse of H at I/P (R). We say that a region is mainly negative if the former condition\nholds, or mainly positive if the latter.\n\nIf two regions R1 and R2 overlap by a large amount, they must be mainly positive or mainly\nnegative together. Specifically this is the case if |R1  R2| > gR (I)|R                                       (I)|R\n                                                                                     1            1| + gR2             2|\nConsider a graph with vertices the regions, and edges whenever the above condition holds.\nThen regions in a connected component must be all mainly positive or mainly negative\ntogether. Let C(I) be the number of connected components in this graph, and note that\nC(I)  1 as I  0.\n\nWe upper bound the number of labelings of the points spanned by a given connected com-\nponent C, and subsequently combine the bounds. Consider the case in which all regions in\nC are mainly negative. For any subset C of C that still covers all the points spanned by C,\n                           1                                                         |R|\n                f (C)                   g                     g                                                  (3)\n                           |C|                I (R)|R|  max I (R)         RC\n                                                          R                  |C |\n                                  RC\n\n\f\n                                                                                                          |R|\nThus f (C)  t(C) max                                                                           RC\n                                R gI (R) where t(C) = minC C, C cover                                           is the minimum\n                                                                                                  |C |\naverage number of times a point in C is necessarily covered.\n\nThere at most 2nf(R) log2(2/f(R)) labelings of a set of points of which at most nf (R) are\npositive. 1. Thus the number of feasible labelings of the connected component C is upper\nbounded by 21+nt(C) maxR gI(R) log2(2/(t(C) maxR gI(R))) where 1 is because C can be either\nmainly positive or mainly negative. By cumulating the bounds over all connected compo-\nnents and upper bounding the entropy-like term with I/P (R) we achieve the stated result.\n2\n\nNote that t(R), the average number of times a point is covered by a minimal subcovering\nof R normally does not scale with |R| and is a covering dependent constant.\n\n\n5         Distributed propagation algorithm\n\nWe introduce here a local propagation algorithm for minimizing J (Q; ) that is both easy to\nimplement and provably convergent. The algorithm can be seen as a variant of the Blahut-\nArimoto algorithm in rate-distortion theory [8], adapted to the more structured context here.\nWe begin by rewriting each mutual information term IR(x; y) in the criterion\n\n                                                                                        Q(y|x)\n                       IR(x; y) =                                P (x|R)Q(y|x) log                                          (4)\n                                                                                        Q(y|R)\n                                              xR yY\n\n                                                                                                Q(y|x)\n                                     =           min                       P (x|R)Q(y|x) log                                (5)\n                                              Q                                                 Q\n                                                      R ()                                          R(y)\n                                                               xR yY\n\nwhere the variational distribution QR(y) can be chosen independently from Q(y|x) but the\nunique minimum is attained when QR(y) = Q(y|R) =                                                P (x|R)Q(y|x). We can\n                                                                                         xR\nextend the regularizer over both {Q(y|x)} and {QR(y)} by defining\n\n                                                                                                  Q(y|x)\n                Jc(Q, QR; R) =                   P (R)                       P (x|R)Q(y|x) log                              (6)\n                                                                                                  QR(y)\n                                       RR                      xR yY\n\nso that Jc(Q; R) = min{QR(),RR} Jc(Q, QR; R) recovers the original regularizer.\n\nThe local propagation algorithm follows from optimizing each Q(y|x) based on fixed\n{QR(y)} and subsequently finding each QR(y) given fixed {Q(y|x)}. We omit the\nstraightforward derivation and provide only the resulting algorithm: for all points x \nS  Sl (not labeled) and for all regions R  R we perform the following complementary\naveraging updates\n\n                                            1\n                      Q(y|x)                         exp(                 [nP (R)P (x|R)] log Q\n                                            Z                                                        R(y) )                 (7)\n                                                 x              R:xR\n\n                      QR(y)                           P (x|R)Q(y|x)                                                        (8)\n                                            xR\n\nwhere Zx is a normalization constant. In other words, Q(y|x) is obtained by taking\na weighted geometric average of the distributions associated with the regions, whereas\nQR(y) is (as before) a weighted arithmetic average of the conditionals within each re-\ngion.        In terms of the document classification example discussed earlier, the weight\n[nP (R)P (x|R)] appearing in the geometric average reduces to f (w|x), the frequency of\nword w identified with region R in document x.\n\n     1                                 k                              k\n          The result follows from                n             2n\n                                       i=0       i               k\n\n\f\nUpdating Q(y|x) for each labeled point x  Sl involves minimizing\n\n                     ^                              \n                    P (y, x) log Q(y|x) -                 H(Q(|x)) -\n                                                    n\n             yY\n                                                                                                        (9)\n                              -               Q(y|x)                    P (R)P (x|R) log QR(y)\n                                        yY                 R:xR\n\nwhere H(Q(|x)) is the Shannon entropy of the conditional. While the objective is strictly\nconvex, the solution cannot be written in closed form and have to be found iteratively (e.g.,\nvia Newton-Raphson or simple bracketing when the labels are binary). A much simpler\nupdate Q(y|x) = (y, ^\n                           yx), where ^\n                                                yx is the observed label for x, may suffice in practice.\nThis update results from taking the limit of small  and approximates the iterative solution.\n\n\n6      Extensions\n\n6.1    Structured labels and generalized propagation steps\n\nHere we extend the regularization framework to the case where the labels represent\nmore structured annotations of objects.                    Let y be a vector of elementary labels y =\n[y1, . . . , yk] associated with a single object x. We assume that the distribution Q(y|x) =\nQ(y1, . . . , yk|x), for any x, can be represented as a tree structured graphical model, where\nthe structure is the same for all x  S. The model is appropriate, e.g., in the context of as-\nsigning topics to documents. While the regularization principle applies directly if we leave\nQ(y|x) unconstrained, the calculations would be potentially infeasible due to the number\nof elementary labels involved, and inefficient as we would not explicitly make use of the\nassumed structure. Consequently, we seek to extend the regularization framework to handle\ndistributions of the form\n\n                                        k                                  Q\n                 Q                                                              ij (yi, yj |x)\n                      T (y|x) =               Qi(yi|x)                                                 (10)\n                                                                     Q\n                                   i=1                                    i(yi|x)Qj (yj |x)\n                                                          (i,j)T\n\nwhere T defines the edge set of the tree. The regularization problem will be formulated\nover {Qi(yi|x), Qij(yi, yj|x)} rather than unconstrained Q(y|x).\n\nThe difficulty in this case arises from the fact that the arithmetic average (mixing) in eq\n(8) is not structure preserving (tree structured models are not mean flat). We can, however,\nalso constrain QR(y) to factor according to the same tree structure. By restricting the class\nof variational distributions QR(y) that we consider, we necessarily obtain an upper bound\non the original information criterion. If we minimize this upper bound with respect to\n{QR(y)}, under the factorization constraint\n\n                                   k                                      Q\n                Q                                                              R,ij (yi, yj )\n                     R,T (y) =               QR,i(yi)                                             ,    (11)\n                                                                    Q\n                                  i=1                                    R,i(yi|x)QR,j (yj )\n                                                         (i,j)T\n\n\ngiven fixed {QT (y|x)}, we can replace eq (8) with simple \"moment matching\" updates\n\n                          QR,ij(yi, yj)                   P (x|R)Qij(yi, yj|x)                        (12)\n                                                   xR\n\nThe geometric update of Q(y|x) in eq (7) is structure preserving in the sense that if\nQR,T (y), R  R share the same tree structure, then so will the resulting conditional.\nThe new updates will result in a monotonically decreasing bound on the original criterion.\n\n\f\n              1                                                    1\n\n\n\n         0.8                                                      0.8\n\n\n\n         0.6                                                      0.6\n\n\n\n         0.4                                                      0.4\n\n\n\n         0.2                                                      0.2\n\n\n\n              0                                                    0\n                   0    0.2    0.4           0.6     0.8     1           0    0.2     0.4    0.6    0.8    1\n\n\n\nFigure 2: Clusters correctly separated by information regularization given one label from\neach class\n\n\n6.2    Complementary sets of regions\n\nIn many cases the points to be labeled may have alternative feature representations, each\nleading to a different set of natural regions R(k). For example, in web page classification\nboth the content of the page, and the type of documents that link to that page should be\ncorrelated with its topic. The relationship between these heterogeneous features may be\ncomplex, with some features more relevant to the classification task than others.\n\nLet Jc(Q; R(k)) denote the regularizer from the kth feature representation. Since the reg-\nularizers are on a common scale we can combine them linearly:\n\n                                    K                                 K\n         Jc(Q; K, ) =                     kJc(Q; R(k)) =                           kPk(R)IR(x; y)            (13)\n                                k=1                                k=1 RR(k)\n\nwhere k  0 and                    \n                               k         k = 1. The result is a regularizer with regions K = kR(k)\nand adjusted a priori weights kPk(R) over the regions. The solution can therefore be\nfound as before provided that {k} are known. When {k} are unknown, we set them\ncompetitively. In other words, we minimize the worst information rate across the available\nrepresentations. This gives rise to the following regularization problem:\n\n                                              max           min J(Q; , )                                      (14)\n                                                           Q(y|x)\n                                           k 0,    k=1\n\nwhere J (Q; , ) is the overall objective that uses Jc(Q; K, ) as the regularizer. The\nmaximum is well-defined since the objective is concave in {k}. This follows immediately\nas the objective is a minimum of a collection of linear functions J (Q; , ) (linear in {k}).\n\nAt the optimum all Jc(Q; R(k)) for which k > 0 have the same value (the same informa-\ntion rate). Other feature sets, those with k = 0, do not contribute to the overall solution\nas their information rates are dominated by others.\n\n\n7      Experiments\n\nWe first illustrate the performance of information regularization on two generated binary\nclassification tasks in the plane. Here we can derive a region covering from the Euclidean\nmetric as spheres of a certain radius centered at each data point. On the data set in Fig-\nure 2 inspired from [3] the method correctly propagates the labels to the clusters starting\n\n\f\n    1                                                                               1\n\n\n\n   0.9                                                                             0.9\n\n\n\n   0.8                                                                             0.8\n\n\n\n   0.7                                                                             0.7\n\n\n\n   0.6                                                                             0.6\n\n\n\n   0.5                                                                             0.5\n\n\n\n   0.4                                                                             0.4\n\n\n\n   0.3                                                                             0.3\n\n\n\n   0.2                                                                             0.2\n\n\n\n   0.1                                                                             0.1\n\n\n\n    0                                                                               0\n          0    0.1    0.2    0.3    0.4    0.5    0.6    0.7    0.8    0.9    1           0    0.1    0.2    0.3    0.4    0.5    0.6    0.7    0.8    0.9    1\n\n\n\n\n\nFigure 3: Ability of information regularization to correct the output of a prior classifier\n(left: before, right: after)\n\n\n\nfrom a single labeled point in each class. In the example in Figure 3 we demonstrate that\ninformation regularization can be used as a post-processing to supervised classification and\nimprove error rates by taking advantage of the topology of the space. All points are a priori\nlabeled by a linear classifier that is non-optimal and places a decision boundary through the\nnegative and positive clusters. Information regularization (on a Euclidean region covering\ndefined as circles around each data point) is able to correct the mislabeling of the clusters.\nNext we test the algorithm on a web document classification task, the WebKB data set of\n[1]. The data consists of 1051 pages collected from the websites of four universities. This\nparticular subset of WebKB is a binary classification task into 'course' and 'non-course'\npages. 22% of the documents are positive ('course'). The dataset is interesting because\napart from the documents contents we have information about the link structure of the\ndocuments. The two sources of information can illustrate the capability of information\nregularization of combining heterogeneous unlabeled representations.\n\nBoth 'text' and 'link' features used here are a bag-of-words representation of documents.\nTo obtain 'link' features we collect text that appears under all links that link to that page\nfrom other pages, and produce its bag-of-words representation. We employ no stemming,\nor stop-word processing, but restrict the vocabulary to 2000 text words and 500 link words.\nThe experimental setup consists of 100 random selections of 3 positive labeled, 9 negative\nlabeled, and the rest unlabeled. The test set includes all unlabeled documents. We report a\nnaive Bayes baseline based on the model that features of different words are independent\ngiven the document class. The naive Bayes algorithm can be run on text features, link\nfeatures, or combine the two feature sets by assuming independence. We also quote the\nperformance of the semi-supervised method obtained by combining naive Bayes with the\nEM algorithm as in [9].\n\nWe measure the performance of the algorithms by the F-score equal to 2pr/(p+r), where p\nand r are the precision and recall. A high F-score indicates that the precision and recall are\nhigh and also close to each other. To compare algorithms independently of the probability\nthreshold that decides between positive and negative samples, the results reported are the\nbest F-scores for all possible settings of the threshold.\n\nThe key issue in applying information regularization is the derivation of a sound region\ncovering R. For document classification we obtained the best results by grouping all doc-\numents that share a certain word into the same region; thus each region is in fact a word,\nand there are as many regions as the size of the vocabulary. Regions are weighted equally,\nas well as the words belonging to the same region. The choice of  is also task dependent.\nHere cross-validation selected a optimal value  = 90. When running information regu-\n\n\f\nTable 1: Web page classification comparison between naive Bayes and information regu-\nlarization and semi-supervised nai ve Bayes+EM on text, link, and joint features\n\n                                naive Bayes    inforeg    naive Bayes+EM\n                        text       82.85        85.10           93.69\n                        link       65.64        82.85           67.18\n                        both       83.33        86.15           91.01\n\n\n\nlarization with both text and link features we combined the coverings with a weight of 0.5\nrather than optimizing it in a min-max fashion.\n\nAll results are reported in Table 1. We observe that information regularization performs bet-\nter than naive Bayes on all types of features, that combining text and link features improves\nperformance of the regularization method, and that on link features the method performs\nbetter than the semi-supervised nai ve Bayes+EM. Most likely the results do not reflect the\nfull potential of information regularization due to the ad-hoc choice of regions based on the\nvocabulary used by naive Bayes.\n\n\n8    Discussion\n\nThe regularization principle introduced here provides a general information theoretic ap-\nproach to exploiting unlabeled points. The solution implied by the principle is unique and\ncan be found efficiently with distributed algorithms, performing complementary averages,\non the graph induced by the regions. The propagation algorithms also extend to more\nstructured settings. Our preliminary theoretical analysis concerning the number of possi-\nble labelings with bounded regularizer is suggestive but rather loose (tighter results can be\nfound). The effect of the choice of the regions (sets of points that ought to be labeled the\nsame) is critical in practice but not yet well-understood.\n\n\nReferences\n\n[1] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In\n     Proceedings of the 1998 Conference on Computational Learning Theory, 1998.\n\n[2] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian\n     fields and harmonic functions. In Machine Learning: Proceedings of the Twentieth\n     International Conference, 2003.\n\n[3] M. Szummer and T. Jaakkola. Partially labeled classification with markov random\n     walks. In Advances in Neural Information Processing Systems 14, 2001.\n\n[4] O. Chapelle, J. Weston, and B. Schoelkopf. Cluster kernels for semi-supervised learn-\n     ing. In Advances in Neural Information Processing Systems 15, 2002.\n\n[5] M. Szummer and T. Jaakkola. Information regularization with partially labeled data.\n     In NIPS'2002, volume 15, 2003.\n\n[6] A. Corduneanu and T. Jaakkola. On information regularization. In Proceedings of the\n     19th UAI, 2003.\n\n[7] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley & Sons, New\n     York, 1991.\n\n[8] R. E. Blahut. Computation of channel capacity and rate distortion functions. In IEEE\n     Trans. Inform. Theory, volume 18, pages 460473, July 1972.\n\n[9] K. Nigam, A.K. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled\n     and unlabeled documents using EM. Machine Learning, 39:103134, 2000.\n\n\f\n", "award": [], "sourceid": 2632, "authors": [{"given_name": "Adrian", "family_name": "Corduneanu", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}