{"title": "Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models", "book": "Advances in Neural Information Processing Systems", "page_first": 585, "page_last": 593, "abstract": "Little work has been done to directly combine the outputs of multiple supervised and unsupervised models. However, it can increase the accuracy and applicability of ensemble methods. First, we can boost the diversity of classification ensemble by incorporating multiple clustering outputs, each of which provides grouping constraints for the joint label predictions of a set of related objects. Secondly, ensemble of supervised models is limited in applications which have no access to raw data but to the meta-level model outputs. In this paper, we aim at calculating a consolidated classification solution for a set of objects by maximizing the consensus among both supervised predictions and unsupervised grouping constraints. We seek a global optimal label assignment for the target objects, which is different from the result of traditional majority voting and model combination approaches. We cast the problem into an optimization problem on a bipartite graph, where the objective function favors smoothness in the conditional probability estimates over the graph, as well as penalizes deviation from initial labeling of supervised models. We solve the problem through iterative propagation of conditional probability estimates among neighboring nodes, and interpret the method as conducting a constrained embedding in a transformed space, as well as a ranking on the graph. Experimental results on three real applications demonstrate the benefits of the proposed method over existing alternatives.", "full_text": "Graph-based Consensus Maximization among\nMultiple Supervised and Unsupervised Models\n\nJing Gao\u2020, Feng Liang\u2020, Wei Fan\u2021, Yizhou Sun\u2020, and Jiawei Han\u2020\n\n\u2020University of Illinois at Urbana-Champaign, IL USA\n\u2021IBM TJ Watson Research Center, Hawthorn, NY USA\n\n\u2020{jinggao3,liangf,sun22,hanj}@illinois.edu, \u2021weifan@us.ibm.com\n\nAbstract\n\nEnsemble classi\ufb01ers such as bagging, boosting and model averaging are known\nto have improved accuracy and robustness over a single model. Their potential,\nhowever, is limited in applications which have no access to raw data but to the\nmeta-level model output. In this paper, we study ensemble learning with output\nfrom multiple supervised and unsupervised models, a topic where little work has\nbeen done. Although unsupervised models, such as clustering, do not directly\ngenerate label prediction for each individual, they provide useful constraints for\nthe joint prediction of a set of related objects. We propose to consolidate a classi-\n\ufb01cation solution by maximizing the consensus among both supervised predictions\nand unsupervised constraints. We cast this ensemble task as an optimization prob-\nlem on a bipartite graph, where the objective function favors the smoothness of the\nprediction over the graph, as well as penalizing deviations from the initial labeling\nprovided by supervised models. We solve this problem through iterative propaga-\ntion of probability estimates among neighboring nodes. Our method can also be\ninterpreted as conducting a constrained embedding in a transformed space, or a\nranking on the graph. Experimental results on three real applications demonstrate\nthe bene\ufb01ts of the proposed method over existing alternatives1.\n\n1 Introduction\n\nWe seek to integrate knowledge from multiple information sources. Traditional ensemble methods\nsuch as bagging, boosting and model averaging are known to have improved accuracy and robustness\nover a single model. Their potential, however, is limited in applications which have no access to raw\ndata but to the meta-level model output. For example, due to privacy, companies or agencies may\nnot be willing to share their raw data but their \ufb01nal models. So information fusion needs to be\nconducted at the decision level. Furthermore, different data sources may have different formats, for\nexample, web video classi\ufb01cation based on image, audio and text features. In these scenarios, we\nhave to combine incompatible information sources at the coarser level (predicted class labels) rather\nthan learn the joint model from raw data.\nIn this paper, we consider the general problem of combining output of multiple supervised and unsu-\npervised models to improve prediction accuracy. Although unsupervised models, such as clustering,\ndo not directly generate label predictions, they provide useful constraints for the classi\ufb01cation task.\nThe rationale is that objects that are in the same cluster should be more likely to receive the same\nclass label than the ones in different clusters. Furthermore, incorporating the unsupervised clustering\nmodels into classi\ufb01cation ensembles improves the base model diversity, and thus has the potential\nof improving prediction accuracy.\n\n1More information, data and codes are available at http://ews.uiuc.edu/\u223cjinggao3/nips09bgcm.htm\n\n1\n\n\fFigure 1: Groups\n\nFigure 2: Bipartite Graph Figure 3: Position of Consensus Maximization\nSuppose we have a set of data points X = {x1, x2, . . . , xn} from c classes. There are m models\nthat provide information about the classi\ufb01cation of X, where the \ufb01rst r of them are (supervised)\nclassi\ufb01ers, and the remaining are (unsupervised) clustering algorithms. Consider an example where\nX = {x1, . . . , x7}, c = 3 and m = 4. The output of the four models are:\nM1 = {1, 1, 1, 2, 3, 3, 2} M2 = {1, 1, 2, 2, 2, 3, 1} M3 = {2, 2, 1, 3, 3, 1, 3} M4 = {1, 2, 3, 1, 2, 1, 1}\nwhere M1 and M2 assign each object a class label, whereas M3 and M4 simply partition the objects\ninto three clusters and assign each object a cluster ID. Each model, no matter it is supervised or\nunsupervised, partitions X into groups, and objects in the same group share either the same predicted\nclass label or the same cluster ID. We summarize the data, models and the corresponding output by\na bipartite graph. In the graph, nodes at the left denote the groups output by the m models with\nsome labeled ones from the supervised models, nodes at the right denote the n objects, and a group\nand an object are connected if the object is assigned to the group by one of the models. For the\naforementioned toy example, we show the groups obtained from a classi\ufb01er M1 and a clustering\nmodel M3 in Figure 1, as well as the group-object bipartite graph in Figure 2.\nThe objective is to predict the class label of xi \u2208 X, which agrees with the base classi\ufb01ers\u2019 pre-\ndictions, and meanwhile, satis\ufb01es the constraints enforced by the clustering models, as much as\npossible. To reach maximum consensus among all the models, we de\ufb01ne an optimization problem\nover the bipartite graph whose objective function penalizes deviations from the base classi\ufb01ers\u2019 pre-\ndictions, and discrepancies of predicted class labels among nearby nodes. In the toy example, the\nconsensus label predictions for X should be {1, 1, 1, 2, 2, 3, 2}.\nRelated Work. We summarize various learning problems in Figure 3, where one dimension repre-\nsents the goal \u2013 from unsupervised to supervised, and the other dimension represents the method \u2013\nsingle models, ensembles at the raw data, or ensembles at the output level. Our proposed method is\na semi-supervised ensemble working at the output level, where little work has been done.\nMany efforts have been devoted to develop single-model learning algorithms, such as Support Vector\nMachines and logistic regression for classi\ufb01cation, K-means and spectral clustering for clustering.\nRecent studies reveal that unsupervised information can also be utilized to improve the accuracy of\nsupervised learning, which leads to semi-supervised [29, 8] and transductive learning [21]. Although\nour proposed algorithm works in a transductive setting, existing semi-supervised and transductive\nlearning methods cannot be easily applied to our problem setting and we discuss this in more de-\ntail at the end of Section 2. Note that all methods listed in Figure 3 are for single task learning.\nOn the contrary, multi-task learning [6, 9] deals with multiple tasks simultaneously by exploiting\ndependence among tasks, which has a different problem setting and thus is not discussed here.\nIn Figure 3, we divide ensemble methods into two categories depending on whether they require\naccess to raw data. In unsupervised learning, many clustering ensemble methods [12, 17, 25, 26]\nhave been developed to \ufb01nd a consensus clustering from multiple partitionings without accessing the\nfeatures. In supervised learning, however, only majority voting type algorithms work on the model\noutput level, and most well-known classi\ufb01cation ensemble approaches [2, 11, 19] (eg. bagging,\nboosting, bayesian model averaging) involve training diversi\ufb01ed classi\ufb01ers from raw data. Methods\nsuch as mixture of experts [20] and stacked generalization [27] try to obtain a meta-learner on\ntop of the model output, however, they still need the labels of the raw data as feedbacks, so we\nposition them as an intermediate between raw data ensemble and output ensemble. In multi-view\n\n2\n\n15243671524367123798g1\u2026...[0 0 1]M1Classifier[1 0 0][0 0 1]\u2026...\u2026...\u2026...[1 0 0]g3g4g6g7g9g10g12x1x2x3x4x5x6x7M2ClassifierM3ClusteringM4ClusteringSingleModelsEnsemble atRaw DataEnsembleat Output LevelK-means,Spectral Clustering, \u2026...Semi-supervised,TransductiveLearningSVM,Logistic Regression,\u2026...Multi-view LearningBagging,Boosting,Bayesianmodelaveraging,\u2026...UnsupervisedLearningSupervisedLearningSemi-supervisedLearningClustering EnsembleConsensusMaximizationMajorityVotingMixture of Experts,StackedGeneralization\flearning [4, 13], a joint model is learnt from both labeled and unlabeled data from multiple sources.\nTherefore, it can be regarded as a semi-supervised ensemble requiring access to the raw data.\nSummary. The proposed consensus maximization problem is a challenging problem that cannot\nbe solved by simple majority voting. To achieve maximum agreement among various models, we\nmust seek a global optimal prediction for the target objects. In Section 2, we formally de\ufb01ne the\ngraph-based consensus maximization problem and propose an iterative algorithm to solve it. The\nproposed solution propagates labeled information among neighboring nodes until stabilization. We\nalso present two different interpretations of the proposed method in Section 3, and discuss how\nto incorporate feedbacks obtained from a few labeled target objects into the framework in Section\n4. An extensive experimental study is carried out in Section 5, where the bene\ufb01ts of the proposed\napproach are illustrated on 20 Newsgroup, Cora research papers, and DBLP publication data sets.\n\n2 Methodology\nSuppose we have the output of r classi\ufb01cation algorithms and (m \u2212 r) clustering algorithms on a\ndata set X. For the sake of simplicity, we assume that each point is assigned to only one class or\ncluster in each of the m algorithms, and the number of clusters in each clustering algorithm is c,\nsame as the number of classes. Note that cluster ID z may not be related to class z. So each base\nalgorithm partitions X into c groups and there are totally v = mc groups, where the \ufb01rst s = rc\ngroups are generated by classi\ufb01ers and the remaining v \u2212 s groups are from clustering algorithms.\nBefore proceeding further, we introduce some notations that will be used in the following discussion:\nBn\u00d7m denotes an n\u00d7 m matrix with bij representing the (ij)-th entry, and (cid:126)bi\u00b7 and (cid:126)b\u00b7j denote vectors\nof row i and column j, respectively. See Table 1 for a summary of important symbols.\nWe represent the objects and groups in a bipartite graph as shown in Figure 2, where the object nodes\nx1, . . . , xn are on the right, the group nodes g1, . . . , gv are on the left. The af\ufb01nity matrix An\u00d7v of\nthis graph summarizes the output of m algorithms on X:\n\naij = 1,\n\nif xi is assigned to group gj by one of the algorithms;\n\n0,\n\notherwise.\n\nWe aim at estimating the conditional probability of each object node xi belonging to c classes. As\na nuisance parameter, the conditional probabilities at each group node gj are also estimated. These\nconditional probabilities are denoted by Un\u00d7c for object nodes and Qv\u00d7c for group nodes:\n\nuiz = \u02c6P (y = z|xi)\n\nand\n\nqjz = \u02c6P (y = z|gj).\n\nSince the \ufb01rst s = rc groups are obtained from supervised learning models, they have some initial\nclass label estimates denoted by Yv\u00d7c where\n\nif gj\u2019s predicted label is z, j = 1, . . . , s;\n\nyjz = 1,\nz=1 yjz, and we formulate the consensus agreement as the following optimization\n\notherwise.\n\n0,\n\n(cid:80)c\n\nLet kj =\nproblem on the graph:\n\n(cid:179) n(cid:88)\n\nv(cid:88)\n\nv(cid:88)\n\n(cid:180)\n\nmin\nQ,U\n\nf(Q, U) = min\nQ,U\n\naij||(cid:126)ui\u00b7 \u2212 (cid:126)qj\u00b7||2 + \u03b1\n\nkj||(cid:126)qj\u00b7 \u2212 (cid:126)yj\u00b7||2\n\n(1)\n\ni=1\n\nj=1\n\nj=1\n\ns.t. (cid:126)ui\u00b7 \u2265 (cid:126)0, |(cid:126)ui\u00b7| = 1, i = 1 : n\n\n(cid:126)qj\u00b7 \u2265 (cid:126)0, |(cid:126)qj\u00b7| = 1, j = 1 : v\n\nwhere ||.|| and |.| denote a vector\u2019s L2 and L1 norm respectively. The \ufb01rst term ensures that if an\nobject xi is assigned to group gj by one of the algorithm, their conditional probability estimates\nmust be close. When j = 1, . . . , s, the group node gj is from a classi\ufb01er, so kj = 1 and the second\nterm puts the constraints that a group gj\u2019s consensus class label estimate should not deviate much\nfrom its initial class label prediction. \u03b1 is the shadow price payment for violating the constraints.\nWhen j = s + 1, . . . , v, gj is a group from an unsupervised model with no such constraints, and\nthus kj = 0 and the weight of the constraint is 0. Finally, (cid:126)ui\u00b7 and (cid:126)qj\u00b7 are probability vectors, and\ntherefore each component must be greater than or equal to 0 and the sum equals to 1.\nWe propose to solve this problem using block coordinate descent methods as shown in Algorithm 1.\nAt the t-th iteration, if we \ufb01x the value of U, the objective function is a summation of v quadratic\ncomponents with respect to (cid:126)qj\u00b7. The corresponding Hessian matrix is diagonal with entries equal to\n\n3\n\n\f(cid:80)n\n(cid:80)n\ni=1 aij (cid:126)u (t\u22121)\n\ni\u00b7\n\n(cid:80)v\n(cid:80)v\n\nTable 1: Important Notations\n\nAlgorithm 1 BGCM algorithm\nInput: group-object af\ufb01nity matrix A, initial la-\nbeling matrix Y ; parameters \u03b1 and \u0001;\nOutput: consensus matrix U;\nAlgorithm:\n\nInitialize U 0,U 1 randomly\nt \u2190 1\nwhile ||U t \u2212 U t\u22121|| > \u0001 do\n\nclass indexes\nobject indexes\nindexes of groups from supervised models\nindexes of groups from unsupervised models\naij-indicator of object i in group j\nuiz-probability of object i wrt class z\n(cid:80)n\nqjz-probability of group j wrt class z\nreturn U t\nyjz-indicator of group j predicted as class z\ni=1 aij + \u03b1kj > 0. Therefore it is strictly convex and \u2207(cid:126)qj\u00b7f(Q, U (t\u22121)) = 0 gives the unique\nglobal minimum of the cost function with respect to (cid:126)qj\u00b7 in Eq. (2). Similarly, \ufb01xing Q, the unique\nglobal minimum with respect to (cid:126)ui\u00b7 is also obtained.\n\nSymbol De\ufb01nition\n1, . . . , c\n1, . . . , n\n1, . . . , s\ns + 1, . . . , v\nAn\u00d7v = [aij]\nUn\u00d7c = [uiz]\nQv\u00d7c = [qjz]\nYv\u00d7c = [yjz]\n\nQt = (Dv +\u03b1Kv)\u22121(AT U t\u22121+\u03b1KvY )\nU t = D\u22121\n\nn AQt\n\n(\n\n(2)\n\n(cid:169)\n\n(cid:170)\n\n(cid:126)q (t)\nj\u00b7 =\n\n(cid:169)\n\n(\n\n(cid:126)u (t)\n\ni\u00b7 =\n\nj=1 aij)\n\n+ \u03b1kj(cid:126)yj\u00b7\n\n(cid:170)\n(cid:170)\n\n(cid:80)v\n\ni=1 aij + \u03b1kj\n\ni=1 aij)\nz=1 yjz)\n\n(cid:169)\n(cid:80)n\n(cid:80)c\n\nj=1 aij(cid:126)q (t)\nj\u00b7\nj=1 aij\nThe update formula in matrix forms are given in Algorithm 1. Dv = diag\nv\u00d7v and\nDn = diag\nn\u00d7n act as the normalization factors. Kv = diag\n(\nv\u00d7v indi-\ncates the existence of constraints on the group nodes. During each iteration, the probability estimate\nat each group node (i.e., Q) receives the information from its neighboring object nodes while retains\nits initial value Y , and in return the updated probability estimates at group nodes propagate the in-\nformation back to its neighboring object nodes when updating U. It is straightforward to prove that\n(Q(t), U (t)) converges to a stationary point of the optimization problem [3].\nIn [14], we proposed a heuristic method to combine heterogeneous information sources. In this pa-\nper, we bring up the concept of consensus maximization and solve the problem over a bipartite graph\nrepresentation. Our proposed method is related to graph-based semi-supervised learning (SSL). But\nexisting SSL algorithms only take one supervised source (i.e., the labeled objects) and one unsu-\npervised source (i.e., the similarity graph) [29, 8], and thus cannot be applied to combine multiple\nmodels. Some SSL methods [16] can incorporate results from an external classi\ufb01er into the graph,\nbut obviously they cannot handle multiple classi\ufb01ers and multiple unsupervised sources. To apply\nSSL algorithms on our problem, we must \ufb01rst fuse all supervised models into one by some ensem-\nble approach, and fuse all unsupervised models into one by de\ufb01ning a similarity function. Such a\ncompression may lead to information loss, whereas the proposed method retains all the information\nand thus consensus can be reached among all the based model output.\n\n3 Interpretations\n\nIn this part, we explain the proposed method from two independent perspectives.\nConstrained Embedding. Now we focus on the \u201chard\u201d consensus solution, i.e., each point is\nassigned to exactly one class. So U and Q are indicator matrices: uiz = 1 if the ensemble assigns\nxi to class z, and 0 otherwise; similar for qjz\u2019s. For group nodes from classi\ufb01cation algorithms,\nwe will treat their entries in Q as known since they have been assigned a class label by one of the\nclassi\ufb01ers, that is, qjz = yjz for 1 \u2264 j \u2264 s.\nBecause U represents the consensus, we should let group gj correspond to class z if majority of the\nobjects in group gj correspond to class z in the consensus solution. The optimization is thus:\n\n(3)\n\nc(cid:88)\n\ns.t.\n\nz=1\n\nuiz = 1\u2200i \u2208 {1, . . . , n}\n\nqjz = 1\u2200j \u2208 {s+1, . . . , v} uiz \u2208 {0, 1} qjz \u2208 {0, 1} (4)\n\nqjz = 1 \u2200j \u2208 {1, . . . , s} if gj\u2019s label is z qjz = 0 \u2200j \u2208 {1, . . . , s} if gj\u2019s label is not z\n\n(5)\n\n(cid:175)(cid:175)(cid:175)(cid:175)qjz \u2212\n\n(cid:80)n\n(cid:80)n\n\ni=1 aijuiz\n\ni=1 aij\n\n(cid:175)(cid:175)(cid:175)(cid:175)\n\nmin\nQ,U\n\nc(cid:88)\n\nz=1\n\nv(cid:88)\nc(cid:88)\n\nj=1\n\nz=1\n\n4\n\n\fi=1 aij\n\ni=1 aij (cid:126)ui\u00b7\n\nHere, the two indicator matrices U and Q can be viewed as embedding x1, . . . , xn (object nodes)\nand g1, . . . , gv (group nodes) into a c-dimensional cube. Due to the constraints in Eq. (4), (cid:126)ui\u00b7 and\n(cid:126)qj\u00b7 reside on the boundary of the (c \u2212 1)-dimensional hyperplane in the cube. (cid:126)a\u00b7j denotes the\nobjects group gj contains, (cid:126)qj\u00b7 can be regarded as the group representative in this new space, and\n. For the s groups obtained from classi\ufb01cation\nthus it should be close to the group mean:\nalgorithms, we know their \u201cideal\u201d embedding, as represented in the constraints in Eq. (5).\nWe now relate this problem to the optimization framework discussed in Section 2. aij can only take\nvalue of 0 or 1, and thus Eq. (3) just depends on the cases when aij = 1. When aij = 1, no matter\nqjz is 1 or 0, we have |qjz\nc(cid:88)\n(cid:88)\ni=1 aijuiz\n\n(cid:80)n\n(cid:80)n\ni=1 aij \u2212(cid:80)n\n(cid:80)n\n(cid:175)(cid:175)qjz\n(cid:175)(cid:175)(cid:175)(cid:175) =\nc(cid:88)\n(cid:88)\n\n(cid:80)n\n(cid:80)n\ni=1 aij \u2212(cid:80)n\n(cid:80)n\ni=1 |aij(qjz \u2212 uiz)|. Therefore,\ni=1 aijuiz| =\n(cid:80)n\n(cid:80)n\ni=1 |aij(qjz \u2212 uiz)|\ni=1 aijuiz\nc(cid:88)\n\n(cid:88)\n(cid:80)n\nv(cid:88)\nn(cid:88)\n\nSuppose the groups found by the base models have balanced size, i.e.,\nconstant for \u2200j. Then the objective function can be approximated as:\n|qjz \u2212 uiz| =\n\n|aij(qjz \u2212 uiz)| =\n\ni=1 aij = \u03b3 where \u03b3 is a\n\n|qjz \u2212 uiz|\n\nn(cid:88)\n\n(cid:88)\n\nc(cid:88)\n\nc(cid:88)\n\nj:aij =1\n\nz=1\n\ni=1 aij\n\ni=1 aij\n\ni=1 aij\n\nj:aij =1\n\naij\n\naij\n\n(cid:175)(cid:175)\n\nz=1\n\n=\n\nj:aij =1\n\nz=1\n\n(cid:80)n\n(cid:175)(cid:175)(cid:175)(cid:175)qjz \u2212\n(cid:80)n\nc(cid:88)\n(cid:88)\nn(cid:88)\n(cid:80)n\n\nz=1\n\ni=1\n\nj:aij =1\n\ni=1\n\nj=1\n\nz=1\n\ni=1\n\nj:aij =1\n\nz=1\n\ni=1\n\nj=1 aij\n\n(cid:80)v\n\n(cid:80)c\nTherefore, when the classi\ufb01cation and clustering algorithms generate balanced groups, with the same\nset of constraints in Eq. (4) and Eq. (5), the constrained embedding problem in Eq. (3) is equivalent\nz=1 |qjz \u2212 uiz|. It is obvious that this is the same as the optimization\nto: min Q,U\nproblem we propose in Section 2 with two relaxations: 1) We transform hard constraints in Eq. (5)\nto soft constraints where the ideal embedding is expressed in the initial labeling matrix Y and the\nprice for violating the constraints is set to \u03b1. 2) uiz and qjz are relaxed to have values between 0 and\n1, instead of either 0 or 1, and quadratic cost functions replace the L1 norms. So they are probability\nestimates rather than class membership indicators, and we can embed them anywhere on the plane.\nThough with these relaxations, we build connections between the constrained embedding framework\nas discussed in this section with the one proposed in Section 2. Therefore, we can view our proposed\nmethod as embedding both object nodes and group nodes into a hyperlane so that object nodes are\nclose to the group nodes they link to. The constraints are put on the group nodes from supervised\nmodels to penalize the embedding that are far from the \u201cideal\u201d ones.\nRanking on Consensus Structure. Our method can also be viewed as conducting ranking with re-\nspect to each class on the bipartite graph, where group nodes from supervised models act as queries.\nSuppose we wish to know the probability of any group gj belonging to class 1, which can be regarded\nas the relevance score of gj with respect to example queries from class 1. Let wj =\ni=1 aij. In\nAlgorithm 1, the relevance scores of all the groups are learnt using the following equation:\n\n(cid:80)n\n\nn A)(cid:126)q\u00b71 + D1\u2212\u03bb(cid:126)y\u00b71\nwj\n.\n\nand\n\n(cid:126)q\u00b71 = (Dv + \u03b1Kv)\u22121(AT D\u22121\n\nn A(cid:126)q\u00b71 + \u03b1Kv(cid:126)y\u00b71) = D\u03bb(D\u22121\n\nv AT D\u22121\n\n\u03b1kj\n\nwj +\u03b1kj\n\nwj +\u03b1kj\n\nwhere the v \u00d7 v diagonal matrices D\u03bb and D1\u2212\u03bb have (j, j) entries as\nConsider collapsing the original bipartite graph into a graph with group nodes only, then AT A is its\naf\ufb01nity matrix. After normalizing it to be a probability matrix, we have pij in P = D\u22121\nv AT D\u22121\nn A\nrepresent the probability of jumping to node j from node i. The groups that are predicted to be in\nclass 1 by one of the supervised models have 1 at the corresponding entries in (cid:126)y\u00b71, therefore these\ngroup nodes are \u201cqueries\u201d and we wish to rank the group nodes according to their relevance to them.\nComparing our ranking model with PageRank model [24], there are the following relationships: 1)\nIn PageRank, a uniform vector with entries all equal to 1 replaces (cid:126)y\u00b71. In our model, we use (cid:126)y\u00b71 to\nshow our preference towards the query nodes, so the resulting scores would be biased to re\ufb02ect the\nrelevance regarding class 1. 2) In PageRank, the weights D\u03bb and D1\u2212\u03bb are \ufb01xed constants \u03bb and\n1 \u2212 \u03bb, whereas in our model D\u03bb and D1\u2212\u03bb give personalized damping factors, where each group\nhas a damping factor \u03bbj = wj\n. 3) In PageRank, the value of link-votes are normalized by the\nnumber of outlinks at each node, whereas our ranking model does not normalize pij on its outlinks,\nand thus can be viewed as an un-normalized version of personalized PageRank [18, 28]. When each\nbase model generates balanced groups, both \u03bbj and outlinks at each node become constants, and the\nproposed method simulates the standard personalized PageRank.\n\nwj +\u03b1kj\n\n5\n\n\fData\n\nNewsgroup\n\nCora\n\nDBLP\n\nID\n1\n2\n3\n4\n5\n6\n1\n2\n3\n4\n1\n\nTable 2: Data Sets Description\n\nCategory Labels\n\ncomp.graphics comp.os.ms-windows.misc sci.crypt sci.electronics\n\nrec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey\n\nsci.cypt sci.electronics sci.med sci.space\n\nmisc.forsale rec.autos rec.motorcycles talk.politics.misc\nrec.sport.baseball rec.sport.hockey sci.crypt sci.electronics\n\nalt.atheism rec.sport.baseball rec.sport.hockey soc.religion.christian\n\nOperating Systems Programming Data Structures Algorithms and Theory\n\nDatabases Hardware and Architecture Networking Human Computer Interaction\n\nDistributed Memory Management Agents Vision and Pattern Recognition\n\nGraphics and Virtual Reality Object Oriented Planning Robotics Compiler Design Software Development\n\nDatabases Data Mining Machine Learning Information Retrieval\n\n#target\n1408\n1428\n1413\n1324\n1424\n1352\n603\n897\n1368\n875\n3836\n\n#labeled\n\n160\n160\n160\n160\n160\n160\n60\n80\n100\n100\n400\n\nv AT D\u22121\n\nThe relevance scores with respect to class 1 for group and object nodes will converge to\n(cid:126)q\u00b71 = (Iv \u2212 D\u03bbD\u22121\nv AT )\u22121D\u22121\nn AD1\u2212\u03bb(cid:126)y\u00b71\nrespectively. Iv and In are identity matrices with size v \u00d7 v and n \u00d7 n. The above arguments hold\nfor the other classes as well, and thus each column in U and Q represents the ranking of the nodes\nwith respect to each class. Because each row sums up to 1, they are conditional probability estimates\nof the nodes belonging to one of the classes.\n\n(cid:126)u\u00b71 = (In \u2212 D\u22121\n\nn A)\u22121D1\u2212\u03bb(cid:126)y\u00b71\n\nn AD\u03bbD\u22121\n\n4 Incorporating Labeled Information\n\nThus far, we propose to combine the output of supervised and unsupervised models by consensus.\nWhen the true labels of the objects are unknown, this is a reliable approach. However, incorporating\nlabels from even a small portion of the objects may greatly re\ufb01ne the \ufb01nal hypothesis. We assume\nthat labels of the \ufb01rst l objects are known, which is encoded in an n \u00d7 c matrix F :\notherwise.\n\nxi\u2019s observed label is z, i = 1, . . . , l;\n\nfiz = 1,\n\n0,\n\nWe modify the objection function in Eq. (1) to penalize the deviation of (cid:126)ui\u00b7 of labeled objects from\nthe observed label:\n\nv(cid:88)\n\nn(cid:88)\n\nn(cid:88)\n\nv(cid:88)\n\nf(Q, U) =\n\n(cid:80)c\n\naij||(cid:126)ui\u00b7 \u2212 (cid:126)qj\u00b7||2 + \u03b1\n\nkj||(cid:126)qj\u00b7 \u2212 (cid:126)yj\u00b7||2 + \u03b2\n\nhi||(cid:126)ui\u00b7 \u2212 (cid:126)fi\u00b7||2\n\n(6)\n\ni=1\n\nj=1\n\nj=1\n\ni=1\n\n(cid:80)v\n(cid:80)v\n\nwhere hi =\nz=1 fiz. When i = 1, . . . , l, hi = 1, so we enforce the constraints that an object xi\u2019s\nconsensus class label estimate should be close to its observed label with a shadow price \u03b2. When\ni = l + 1, . . . , n, xi is unlabeled. Therefore, hi = 0 and the constraint term is eliminated from the\nobjective function. To update the condition probability for the objects, we incorporate their prior\nlabeled information:\n\nj=1 aij(cid:126)q t\n\nj\u00b7 + \u03b2hi\nj=1 aij + \u03b2hi\n\n(cid:126)fi\u00b7\n\n(cid:126)u t\n\ni\u00b7 =\n\n(\n\n(cid:170)\n\n(cid:169)\n\n(cid:80)c\n\n(7)\nit would be U t = (Dn + \u03b2Hn)\u22121(AQt + \u03b2HnF ) with Hn =\nIn matrix forms,\nz=1 fiz)\ndiag\nn\u00d7n. Note that the initial conditional probability of a labeled object is 1 at its\nobserved class label, and 0 at all the others. However, this optimistic estimate will be changed\nduring the updates, with the rationale that the observed labels are just random samples from some\nmultinomial distribution. Thus we only use the observed labels to bias the updating procedure,\ninstead of totally relying on them.\n\n5 Experiments\n\nWe evaluate the proposed algorithms on eleven classi\ufb01cation tasks from three real world applica-\ntions.\nIn each task, we have a target set on which we wish to predict class labels. Clustering\nalgorithms are performed on this target set to obtain the grouping results. On the other hand, we\nlearn classi\ufb01cation models from some training sets that are in the same domain or a relevant domain\nwith respect to the target set. These classi\ufb01cation models are applied to the target set as well. The\nproposed algorithm generates a consolidated classi\ufb01cation solution for the target set based on both\nclassi\ufb01cation and clustering results. We elaborate details of each application in the following.\n\n6\n\n\fTable 3: Classi\ufb01cation Accuracy Comparison on a Series of Data Sets\n\nMethods\n\nM1\nM2\nM3\nM4\nMCLA\nHBGF\nBGCM\n2-L\n3-L\nBGCM-L\nSTD\n\n1\n\n0.7967\n0.7721\n0.8056\n0.7770\n0.7592\n0.8199\n0.8128\n0.7981\n0.8188\n0.8316\n0.0040\n\n2\n\n0.8855\n0.8611\n0.8796\n0.8571\n0.8173\n0.9244\n0.9101\n0.9040\n0.9206\n0.9197\n0.0038\n\n20 Newsgroups\n3\n\n4\n\n0.8557\n0.8134\n0.8658\n0.8149\n0.8253\n0.8811\n0.8608\n0.8511\n0.8820\n0.8859\n0.0037\n\n0.8826\n0.8676\n0.8983\n0.8467\n0.8686\n0.9152\n0.9125\n0.8728\n0.9158\n0.9240\n0.0040\n\n5\n\n0.8765\n0.8358\n0.8716\n0.8543\n0.8295\n0.8991\n0.8864\n0.8830\n0.8989\n0.9016\n0.0027\n\n6\n\n0.8880\n0.8563\n0.9020\n0.8578\n0.8546\n0.9125\n0.9088\n0.8977\n0.9121\n0.9177\n0.0030\n\n1\n\n0.7745\n0.7797\n0.7779\n0.7476\n0.8703\n0.7834\n0.8687\n0.8066\n0.8557\n0.8891\n0.0096\n\nCora\n\n2\n\n0.8858\n0.8594\n0.8833\n0.8594\n0.8388\n0.9111\n0.9155\n0.8798\n0.9086\n0.9181\n0.0027\n\n3\n\n0.8671\n0.8508\n0.8646\n0.7810\n0.8892\n0.8481\n0.8965\n0.8932\n0.9202\n0.9246\n0.0052\n\n4\n\n0.8841\n0.8879\n0.8813\n0.9016\n0.8716\n0.8943\n0.9090\n0.8951\n0.9141\n0.9206\n0.0044\n\nDBLP\n\n1\n\n0.9337\n0.8766\n0.9382\n0.7949\n0.8953\n0.9357\n0.9417\n0.9054\n0.9332\n0.9480\n0.0020\n\n20 Newsgroup categorization. We construct six learning tasks, each of which involves four classes.\nThe objective is to classify newsgroup messages according to topics. We used the version [1] where\nthe newsgroup messages are sorted by date, and separated into training and test sets. The test sets\nare our target sets. We learn logistic regression [15] and SVM models [7] from the training sets, and\napply these models, as well as K-means and min-cut clustering algorithms [22] on the target sets.\nCora research paper classi\ufb01cation. We aim at classifying a set of research papers into their areas\n[23]. We extract four target sets, each of which includes papers from around four areas. The train-\ning sets contain research papers that are different from those in the target sets. Both training and\ntarget sets have two views, the paper abstracts, and the paper citations. We apply logistic regression\nclassi\ufb01ers and K-means clustering algorithms on the two views of the target sets.\nDBLP data. We retrieve around 4,000 authors from DBLP network [10], and try to predict their\nresearch areas. The training sets are drawn from a different domain, i.e., the conferences in each\nresearch \ufb01eld. There are also two views for both training and target sets, the publication network, and\nthe textual content of the publications. The amount of papers an author published in the conference\ncan be regarded as link feature, whereas the pool of titles that an author published is the text feature.\nLogistic regression and K-means clustering algorithms are used to derive the predictions on the\ntarget set. We manually label the target set for evaluation.\nThe details of each learning task are summarized in Table 2. On each target set, we apply four\nmodels M1 to M4, where the \ufb01rst two are classi\ufb01cation models and the remaining two are clus-\ntering models. We denote the proposed method as Bipartite Graph-based Consensus Maximization\n(BGCM), which combines the output of the four models. As shown in Figure 3, only clustering\nensembles, majority voting methods, and the proposed BGCM algorithm work at the meta output\nlevel where raw data are discarded and only prediction results from multiple models are available.\nHowever, majority voting can not be applied when there are clustering models because the cor-\nrespondence between clusters and classes is unknown. Therefore, we compare BGCM with two\nclustering ensemble approaches (MCLA [26] and HBGF [12]), which ignore the label information\nfrom supervised models, regard all the base models as unsupervised clustering, and integrate the\noutput of the base models. So they only give clustering solutions, not classi\ufb01cation results.\nTo evaluate classi\ufb01cation accuracy, we map the output of all the clustering algorithms (the base\nmodels, and the ensembles) to the best possible class predictions with the help of hungarian method\n[5], where cluster IDs are matched with class labels. Actually, it is \u201ccheating\u201d because the true class\nlabels are used to do the mapping, and thus it should be able to generate the best accuracy from\nthese unsupervised models. As discussed in Section 4, we can incorporate a few labeled objects,\nwhich are drawn from the same domain of the target set, into the framework and improve accuracy.\nThis improved version of the BGCM algorithm is denoted as BGCM-L, and the number of labeled\nobjects used in each task is shown in Table 2. On each task, we repeat the experiments 50 times,\neach of which has randomly chosen target and labeled objects, and report the average accuracy. Due\nto space limit, we only show the standard deviation (STD) for BGCM-L method. The baselines\nshare very similar standard deviation with the reported one on each task.\nAccuracy. In Table 3, we summarized the classi\ufb01cation accuracy of all the baselines and the pro-\nposed approach on the target sets of eleven tasks. The two single classi\ufb01ers (M1 and M2), and the\ntwo clustering single models (M3 and M4) usually have low accuracy. By combining all the base\nmodels, the clustering ensemble approaches (MCLA and HBGF) can improve the performance over\neach single model. However, these two methods are not designed for classi\ufb01cation, and the reported\n\n7\n\n\fFigure 4: Sensitivity Analysis\n\naccuracy is the upper bound of their \u201ctrue\u201d accuracy. The proposed BGCM method always outper-\nforms the base models, and achieves better or comparable performances compared with the upper\nbound of the baseline ensembles. By incorporating a small portion (around 10%) of labeled objects,\nthe BGCM-L method further improves the performances. The consistent increase in accuracy can\nbe observed in all the tasks, where the margin between the accuracy of the best single model and\nthat of the BGCM-L method is from 2% to 10%. Even when taking variance into consideration, the\nresults demonstrate the power of consensus maximization in accuracy improvements.\nSensitivity. As shown in Figure 4 (a) and (b), the proposed BGCM-L method is not sensitive to the\nparameters \u03b1 and \u03b2. To make the plots clear, we just show the performance on the \ufb01rst task of each\napplication. \u03b1 and \u03b2 are the shadow prices paid for deviating from the estimated labels of groups\nand observed labels of objects, so they should be greater than 0. \u03b1 and \u03b2 represent the con\ufb01dence\nof our belief in the labels of the groups and objects compared with 1. The labels of group nodes are\nobtained from supervised models and may not be correct, therefore, a smaller \u03b1 usually achieves\nbetter performance. On the other hand, the labels of objects can be regarded as groundtruths, and\nthus the larger \u03b2 the better. In experiments, we \ufb01nd that when \u03b1 is below 4, and \u03b2 greater than 4,\ngood performance can be achieved. We let \u03b1 = 2 and \u03b2 = 8 to get the experimental results shown\nin Table 3. Also, we \ufb01x the target set as 80% of all the objects, and use 1% to 20% as the labeled\nobjects to see how the performance varies, and the results are summarized in Figure 4 (c). In general,\nmore labeled objects would help the classi\ufb01cation task where the improvements are more visible on\nCora data set. When the percentage reaches 10%, BGCM-L\u2019s performance becomes stable.\nNumber of Models. We vary the number of base models incorporated into the consensus frame-\nwork. The BGCM-L method on two models is denoted as 2-L, where we average the performance\nof the combined model obtained by randomly choosing one classi\ufb01er and one clustering algorithm.\nSimilarly, the BGCM-L method on three models is denoted as 3-L. From Table 3, we can see that\nBGCM-L method using all the four models outperforms the method incorporating only two or three\nmodels. When the base models are independent and each of them obtains reasonable accuracy,\ncombining more models would bene\ufb01t more because the chances of reducing independent errors\nincrease. However, when the new model cannot provide additional information to the current pool\nof models, incorporating it may not improve the performance anymore. In the future, we plan to\nidentify this upper bound through experiments with more input sources.\n\n6 Conclusions\n\nIn this work, we take advantage of the complementary predictive powers of multiple supervised\nand unsupervised models to derive a consolidated label assignment for a set of objects jointly. We\npropose to summarize base model output in a group-object bipartite graph, and maximize the con-\nsensus by promoting smoothness of label assignment over the graph and consistency with the initial\nlabeling. The problem is solved by propagating labeled information between group and object nodes\nthrough their links iteratively. The proposed method can be interpreted as conducting an embedding\nof object and group nodes into a new space, as well as an un-normalized personalized PageRank.\nWhen a few labeled objects are available, the proposed method uses them to guide the propagation\nand re\ufb01ne the \ufb01nal hypothesis. In the experiments on 20 newsgroup, Cora and DBLP data, the\nproposed consensus maximization method improves the best base model accuracy by 2% to 10%.\nAcknowledgement The work was supported in part by the U.S. National Science Foundation grants\nIIS-08-42769, IIS-09-05215 and DMS-07-32276, and the Air Force Of\ufb01ce of Scienti\ufb01c Research\nMURI award FA9550-08-1-0265.\n\n8\n\n051015200.80.850.90.951\u03b1Accuracy(a) Performance w.r.t. \u03b1  Newsgroup1Cora1DBLP1051015200.80.850.90.951\u03b2Accuracy(b) Performance w.r.t. \u03b2  Newsgroup1Cora1DBLP10  0.020.040.060.080.10.80.850.90.951% Labeled ObjectsAccuracy(c) Performance w.r.t. % Labeled Objects  Newsgroup1Cora1DBLP1\fReferences\n[1] 20 Newsgroups Data Set. http://people.csail.mit.edu/jrennie/20Newsgroups/.\n[2] E. Bauer and R. Kohavi. An Empirical Comparison of Voting Classi\ufb01cation Algorithms: Bagging, Boost-\n\ning, and Variants. Machine Learning, 36:105\u2013139, 2004.\n\n[3] Dimitri P. Bertsekas. Non-Linear Programming (2nd Edition). Athena Scienti\ufb01c, 1999.\n[4] A. Blum and T. Mitchell. Combining Labeled and Unlabeled Data with Co-training. In Proc. of COLT\u2019\n\n98, pages 92\u2013100, 1998.\n\n[5] N. Borlin. Implementation of Hungarian Method. http://www.cs.umu.se/\u223cniclas/matlab/assignprob/.\n[6] R. Caruana. Multitask Learning. Machine Learning, 28:41\u201375, 1997.\n[7] C.-C. Chang and C.-J. Lin. LibSVM: a Library for Support Vector Machines, 2001. Software available\n\nat http://www.csie.ntu.edu.tw/\u223ccjlin/libsvm.\n\n[8] O. Chapelle, B. Sch\u00a8olkopf and A. Zien (eds). Semi-Supervised Learning. MIT Press, 2006.\n[9] K. Crammer, M. Kearns and J. Wortman. Learning from Multiple Sources. Journal of Machine Learning\n\nResearch, 9:1757-1774 , 2008.\n\n[10] DBLP Bibliography. http://www.informatik.uni-trier.de/\u223cley/db/.\n[11] T. Dietterich. Ensemble Methods in Machine Learning. In Proc. of MCS \u201900, pages 1\u201315, 2000.\n[12] X. Z. Fern and C. E. Brodley. Solving Cluster Ensemble Problems by Bipartite Graph Partitioning. In\n\nProc. of ICML\u2019 04, pages 281\u2013288, 2004.\n\n[13] K. Ganchev, J. Graca, J. Blitzer, and B. Taskar. Multi-view Learning over Structured and Non-identical\n\nOutputs. In Proc. of UAI\u2019 08, pages 204\u2013211, 2008.\n\n[14] J. Gao, W. Fan, Y. Sun, and J. Han. Heterogeneous source consensus learning via decision propagation\n\nand negotiation. In Proc. of KDD\u2019 09, pages 339\u2013347, 2009.\n\n[15] A. Genkin, D. D. Lewis, and D. Madigan. BBR: Bayesian Logistic Regression Software.\n\nhttp://stat.rutgers.edu/\u223cmadigan/BBR/.\n\n[16] A. Goldberg and X. Zhu. Seeing stars when there aren\u2019t many stars: Graph-based semi-supervised\n\nlearning for sentiment categorization. In HLT-NAACL 2006 Workshop on Textgraphs.\n\n[17] A. Gionis, H. Mannila, and P. Tsaparas. Clustering Aggregation. ACM Transactions on Knowledge\n\nDiscovery from Data, 1(1), 2007.\n\n[18] T. Haveliwala. Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search.\n\nIEEE Transactions on Knowledge and Data Engineering, 15(4):1041-4347, 2003.\n\n[19] J. Hoeting, D. Madigan, A. Raftery, and C. Volinsky. Bayesian Model Averaging: a Tutorial. Statistical\n\nScience, 14:382\u2013417, 1999.\n\n[20] R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton. Adaptive Mixtures of Local Experts. Neural Computa-\n\ntion, 3:79-87, 1991.\n\n[21] T. Joachims. Transductive Learning via Spectral Graph Partitioning. In Proc. of ICML\u2019 03, pages 290\u2013\n\n297, 2003.\n\n[22] G. Karypis. CLUTO \u2013 Family of Data Clustering Software Tools.\n\nhttp://glaros.dtc.umn.edu/gkhome/views/cluto.\n\n[23] A. McCallum, K. Nigam, J. Rennie, and K. Seymore. Automating the Construction of Internet Portals\n\nwith Machine Learning. Information Retrieval Journal, 3:127\u2013163, 2000.\n\n[24] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the\n\nWeb. Technical Report, Stanford InfoLab, 1999.\n\n[25] V. Singh, L. Mukherjee, J. Peng, and J. Xu. Ensemble Clustering using Semide\ufb01nite Programming. In\n\nProc. of NIPS\u2019 07, 2007.\n\n[26] A. Strehl and J. Ghosh. Cluster Ensembles \u2013 a Knowledge Reuse Framework for Combining Multiple\n\nPartitions. Journal of Machine Learning Research, 3:583\u2013617, 2003.\n\n[27] D. Wolpert. Stacked Generalization. Neural Networks, 5:241\u2013259, 1992.\n[28] D. Zhou , J. Weston, A. Gretton, O. Bousquet and B. Scholkopf. Ranking on Data Manifolds. In Proc.\n\nof NIPS\u2019 03, pages 169\u2013176, 2003.\n\n[29] X. Zhu. Semi-supervised Learning Literature Survey. Technical Report 1530, Computer Sciences, Uni-\n\nversity of Wisconsin-Madison, 2005.\n\n9\n\n\f", "award": [], "sourceid": 609, "authors": [{"given_name": "Jing", "family_name": "Gao", "institution": null}, {"given_name": "Feng", "family_name": "Liang", "institution": null}, {"given_name": "Wei", "family_name": "Fan", "institution": null}, {"given_name": "Yizhou", "family_name": "Sun", "institution": null}, {"given_name": "Jiawei", "family_name": "Han", "institution": null}]}