{"title": "Forging The Graphs: A Low Rank and Positive Semidefinite Graph Learning Approach", "book": "Advances in Neural Information Processing Systems", "page_first": 2960, "page_last": 2968, "abstract": "In many graph-based machine learning and data mining approaches, the quality of the graph is critical. However, in real-world applications, especially in semi-supervised learning and unsupervised learning, the evaluation of the quality of a graph is often expensive and sometimes even impossible, due the cost or the unavailability of ground truth. In this paper, we proposed a robust approach with convex optimization to ``forge'' a graph: with an input of a graph, to learn a graph with higher quality. Our major concern is that an ideal graph shall satisfy all the following constraints: non-negative, symmetric, low rank, and positive semidefinite. We develop a graph learning algorithm by solving a convex optimization problem and further develop an efficient optimization to obtain global optimal solutions with theoretical guarantees. With only one non-sensitive parameter, our method is shown by experimental results to be robust and achieve higher accuracy in semi-supervised learning and clustering under various settings. As a preprocessing of graphs, our method has a wide range of potential applications machine learning and data mining.", "full_text": "Forging The Graphs: A Low Rank and Positive\n\nSemide\ufb01nite Graph Learning Approach\n\nDijun Luo, Chris Ding, Heng Huang, Feiping Nie\nDepartment of Computer Science and Engineering\n\nThe University of Texas at Arlington\n\ndijun.luo@gmail.com, chqding@uta.edu\n\nheng@uta.edu, feipingnie@gmail.com\n\nAbstract\n\nIn many graph-based machine learning and data mining approaches, the quality\nof the graph is critical. However, in real-world applications, especially in semi-\nsupervised learning and unsupervised learning, the evaluation of the quality of a\ngraph is often expensive and sometimes even impossible, due the cost or the un-\navailability of ground truth. In this paper, we proposed a robust approach with\nconvex optimization to \u201cforge\u201d a graph: with an input of a graph, to learn a graph\nwith higher quality. Our major concern is that an ideal graph shall satisfy all the\nfollowing constraints: non-negative, symmetric, low rank, and positive semidef-\ninite. We develop a graph learning algorithm by solving a convex optimization\nproblem and further develop an ef\ufb01cient optimization to obtain global optimal so-\nlutions with theoretical guarantees. With only one non-sensitive parameter, our\nmethod is shown by experimental results to be robust and achieve higher accuracy\nin semi-supervised learning and clustering under various settings. As a prepro-\ncessing of graphs, our method has a wide range of potential applications machine\nlearning and data mining.\n\n1 Introduction\n\nMany machine learning algorithms use graphs as input, such as clustering [16, 14], manifold based\ndimensional reduction [2, 15], and graph-based semi-supervised learning [23, 22].\nIn these ap-\nproaches, we are particularly interested in the similarity among objects. However, the observation\nof similarity graphs often contain noise which sometimes mislead the learning algorithm, especially\nin unsupervised and semi-supervised learning. Deriving graphs with high quality becomes attractive\nin machine learning and data mining research.\n\nA robust and stable graph learning algorithm is especially desirable in unsupervised and semi-\nsupervised learning, because the unavailability or high cost of ground truth in real world appli-\ncations. In this paper, we develop a novel graph learning algorithm based on convex optimization,\nwhich leads to robust and competitive results.\n\n1.1 Motivation and Main Problem\n\nIn this section, the properties of similarity matrix are revisited from point of view of normalized\ncut clustering [19]. Given a symmetric similarity matrix W \u2208 Rn\u00d7n on n objects, normalized cut\nsolves the following optimization problem [10].\n\ntrH\u22ba(D \u2212 W)H s.t. H\u22baDH = I,\n\nmin\nH\u22650\n\n(1)\n\n1\n\n\fwhere H \u2208 {0, 1}n\u00d7K is the cluster indicator matrix, or equivalently,\n\ntrF\u22ba \u02dcWF s.t. F\u22baF = I,\n\nmax\nF\u22650\n\n(2)\n\nwhere F = [f1, f2, \u00b7 \u00b7 \u00b7 , fK], H = [h1, h2, \u00b7 \u00b7 \u00b7 , hK], fk = D\nD\u2212 1\nnumber of groups. Eq. (2) can be further rewritten as,\n\n2 , D = diag(d1, d2, \u00b7 \u00b7 \u00b7 , dn), di = Pn\n\n2 WD\u2212 1\n\n1\n\n1\n\n2 hkk, 1 \u2264 k \u2264 K, \u02dcW =\nj=1 Wij , I is the identity matrix, and K is the\n\n2 hk/kD\n\nk \u02dcW \u2212 FF\u22bakF s.t. F\u22baF = I,\n\nmin\nF\u22650\n\n(3)\n\nwhere k \u00b7 kF denotes the Frobenius norm. We notice that\n\n(4)\nfor any G \u2208 Rn\u00d7n. Our goal is to minimize the LHS (left-hand side); Instead, we can minimize the\nRHS which is the upper-bound of LHS.\n\nk \u02dcW \u2212 G + G \u2212 FF\u22bakF \u2264 k \u02dcW \u2212 GkF + kG \u2212 FF\u22bakF ,\n\nThus we need to \ufb01nd the intermediate matrix G, i.e., we learn a surrogate graph which is close but\nnot identical to \u02dcW. Our upper-bounding approach offers \ufb02exibility which allows us to impose cer-\ntain desirable properties. Note that matrix FF\u22ba holds the following properties: (P1) symmetric, (P2)\nnonnegative, (P3) low rank, and (P4) positive semide\ufb01nite. This suggests a convex graph learning\n\nmin\n\nG\n\nkG \u2212 \u02dcWk2\n\nF s.t. G < 0, kGk\u2217 \u2264 c, G = G\u22ba, G \u2265 0,\n\n(5)\n\nwhere G < 0 denotes the positive semide\ufb01nite constraint, k \u00b7 k\u2217 denotes the trace norm, i.e. the sum\nof the singular values [8], and c is a model parameter which controls the rank of G. The constraint\nof G \u2265 0 is to force the similarity to be naturally non-negative. By intuition, one might impose row\nrank constraint of rank(G) \u2264 c. But this leads to a non-convex optimization, which is undesirable\nin unsupervised and semi-supervised learning. Following matrix completion methods [5], the trace\nconstraint in Eq. (5) is a good surrogate for the low rank constraint. For notational convenience, the\nnormalized similarity matrix \u02dcW is denoted as W in the rest of the paper.\nBy solving Eq. (5), we are actually seeking a similarity matrix which satis\ufb01es all the properties of\na perfect similarity matrix (P1\u2013P4) and which is close to the original input matrix G. Our whole\npaper is here dedicated to solve Eq. (5) and to demonstrate the usefulness of its optimal solution in\nclustering and semi-supervised learning using both theoretical and empirical evidences.\n\n1.2 Related Work\n\nOur method can be viewed as a preprocessing for similarity matrix and a large number of machine\nlearning and data mining approaches require a similarity matrix (interpreted as a weighted graph) as\ninput. For example, in unsupervised clustering, Normalized Cut [19], Ratio Cut [11], Cheeger Cut\n[3] have been widely applied in various real world applications. In graphical models for relational\ndata, e.g. Mixed Membership Block models [1] can be also interpreted as generative models on\nthe similarity matrices among objects. Thus a similarity matrix preprocessing model can be widely\napplied.\n\nA large number of approaches have been developed to learn similarity matrix with different empha-\nsis. Local Linear Embedding (LLE ) [17, 18] and Linear Label Propagation [21] can be viewed as\nobtaining a similarity matrix using sparse coding. Another way to perform the similarity matrix pre-\nprocessing is to take a graph as input and to obtain a re\ufb01ned graph by learning, such as bi-stochastic\ngraph learning [13]. Our method falls in this category. We will compare our method with these\nmethods in the experimental section.\n\nOn the optimization techniques for problems with multiple constraints, there also exist many related\nresearches. First, von Neumann provided a convergence proof of successive projection method that\nit guarantees to converge to feasible solution in convex optimization with multiple constraints, which\nwas employed in the paper by Liu et al. [13]. In this paper, we develop a novel optimization algo-\nrithm to solve the optimization problem with multiple convex constraints (including the inequality\nconstraints), which is guaranteed to \ufb01nd the global solution. More explicitly, we develop a variant\nof inexact Augmented Lagrangian Multiplier method to handle inequality constraints. We also de-\nvelop a useful Lemma to handle the subproblems with trace norm constraint in the main algorithm.\nInterestingly, one of the derived subproblems is the \u21131 ball projection problem, which can be solved\nelegantly by simple thresholding.\n\n2\n\n\f(a)\n\n(b)\n\n(c1)\n\n(c2)\n\n(d)\n\n2\n\nl\n\ns\ne\nu\na\nv\nn\ne\ng\nE\n\ni\n\n1.5\n\n1\n\n0.5\n\n0\n\n\u22120.5\n\n\u22121\n\n \n0\n\n \n\nEq. (5)\nEq. (6)\n\n5\n\n10\n\n15\n\n20\n\n25\n\n30\n\nSorting index\n\n(a): A perfect\nFigure 1: A toy example of low rank and positive semide\ufb01nite graph learning.\nsimilarity matrix. (b): Adding noise from (a). (c1): the optimal solution of Eq. (5) by using the\nmatrix in (b) as G. (c2): the optimal solution of Eq. (6) by using the matrix in (b) as G. (d): sorted\neigenvalues for the two solutions of Eq. (5) and Eq. (6).\n\n2 A Toy Example\n\nWe \ufb01rst emphasize the usefulness of the positive semide\ufb01nite and low rank constraints in problem of\nEq. (5) using a toy example. In this toy example, we also solve the following problem for contrast,\n\nmin\n\nG\n\nkG \u2212 Wk2\n\nF s.t. G = G\u22ba, Ge = e, G \u2265 0,\n\n(6)\n\nwhere e = [1, 1, \u00b7 \u00b7 \u00b7 , 1]T and the constraints of positive semide\ufb01nite and low rank are removed from\nEq. (5) and instead a bi-stochastic constraint is applied (Ge = e). Notice that the model de\ufb01ned in\nEq. (6) is used in the bi-stochastic graph learning [13]. We solve Eqs. (5) and (6) for the same input\nG and compare the solution to see the effect of positive semide\ufb01nite and low rank constraints.\nIn the toy example, we \ufb01rst generate a perfect similarity matrix W in which Wij = 1 if data points\ni, j are in the same group and Wij = 0 otherwise. Three groups of data points (10 data points in\neach group) are considered. G are shown in Figure 1 (a) with black color denoting zeros values. We\nthen randomly add some positive noise on G which is shown in Figure 1 (b). Then we solve Eqs.\n(5) and (6) and the results of G is shown in Figure 1 (c1) and (c2). The observation is that Eq. (5)\nrecover the perfect similarity much more accurately than Eq. (6). The reason is that in model of\nEq. (6), the low rank and positive semide\ufb01nite constraints are ignored and the result deviates from\nthe ground truth.\nWe show the eigenvalues distributions of G for solution in Figure 1 (d) for both methods in Eqs. (5)\nand (6). One can observe that the solution Eq. (5) gives low rank and positive semide\ufb01nite results,\nwhile the solution for Eq. (6) is full rank and has negative eigenvalues.\n\nSince the solution of Eq. (5) is always positive, symmetric, low rank, and positive semide\ufb01nite, we\ncalled our solution the Non-negative Low-rank Kernel (NLK).\n\n2.1 NLK for Semi-supervised Learning\n\nAlthough NKL is mainly developed for unsupervised learning, it can be easily extended to incor-\nporate the label information in semi-supervised learning [23]. Assume we are given a set of data\nX = [x1, x2, \u00b7 \u00b7 \u00b7 , x\u2113, x\u2113+1, \u00b7 \u00b7 \u00b7 , xn] where the \ufb01rst l data points are labeled as [y1, y2, \u00b7 \u00b7 \u00b7 , y\u2113].\nThen we have more information to learn a better similarity matrix. Here we add additional con-\nstraints on Eq. (5) by enforcing the similarity to be zeros for those paris of data points in different\nclasses, i.e. Gij = 0 if yi 6= yj, 1 \u2264 i, j \u2264 \u2113. By considering all the constraints, we optimize the\nfollowing,\n\nmin\n\nG\n\nkG \u2212 Wk2\n\nF s.t. G < 0, kGk\u2217 \u2264 c, G = G\u22ba, G \u2265 0, Gij = 0, \u2200yi 6= yj.\n\n(7)\n\nWe will demonstrate the advantage of these semi-supervision constraints in the experimental section.\nThe computational algorithm is given in \u00a73.3.\n\n3\n\n\f3 Optimization\n\nThe optimization problemss in Eqs. (5) and (7) are non-trivial since there are multiple constraints,\nincluding both equality and inequality constraints. Our strategy is to introduce two extra copies of\noptimization variable X and Y to split the constraints into several directly solvable subproblems,\n\nmin\n\nG\n\nmin\n\nX\n\nmin\n\nY\n\nkG \u2212 Wk2\n\nkX \u2212 Wk2\n\nF , s.t. G \u2265 0\nF , s.t. X < 0, with X = G.\n\nkY \u2212 Wk2\n\nF , s.t.\n\nkYk\u2217 \u2264 c, with Y = G\n\nMore formally, we solve the following problem,\n\nminG kG \u2212 Wk2\nF\ns.t. G \u2265 0\n\nX = G, X < 0,\nY = G,\n\n|Yk\u2217 \u2264 c,\n\n(8a)\n\n(8b)\n\n(8c)\n\n(9a)\n(9b)\n(9c)\n(9d)\n\nOne should notice that problem in Eqs. (9a) \u2013 (9d) is equivalent to our main problem in Eq. (5).\nIn the rest of this section, we will employ a variant of Augmented Lagrangian Multiplier (AML)\nmethod to solve Eqs. (9a) \u2013 (9c).\n\n3.1 Seeking Global Solutions: A Variant of ALM\n\nThe augmented Lagrangian multiplier function of Eqs. (9a) \u2013 (9d) is\n\n\u03a6(G, X, Y) =kG \u2212 Wk2\n\nF \u2212 h\u039b, X \u2212 Gi +\n\n\u00b5\n2\n\nkG \u2212 Xk2\n\nF \u2212 h\u03a3, Y \u2212 Gi +\n\n\u00b5\n2\n\nkG \u2212 Yk2\nF ,\n(10)\n\nwith constraints of G \u2265 0, X < 0, and kYk\u2217 \u2264 c, where \u039b, \u03a3 are the Lagrangian multipliers.\nThen ALM method leads to the following updating steps,\n\nG \u2190 arg min\nG\u22650\nX \u2190 arg min\nX<0\n\n\u03a6(G, X, Y, Z)\n\n\u03a6(G, X, Y, Z)\n\nY \u2190 arg min\n\nkYk\u2217\u2264c\n\n\u03a6(G, X, Y, Z)\n\n\u039b \u2190 \u039b \u2212 \u00b5 (G \u2212 X)\n\u03a3 \u2190 \u03a3 \u2212 \u00b5 (G \u2212 Y)\nt \u2190 t + 1.\n\u00b5 \u2190 \u03b3\u00b5,\n\n(11a)\n\n(11b)\n\n(11c)\n\n(11d)\n(11e)\n(11f)\n\nNotice that the symmetric constraint is removed here. We will later show that given a symmetric\ninput W, the output of our algorithm automatically satis\ufb01es the symmetric constraints.\n\n3.2 Solving the Subproblems in ALM\n\nX and Y updating algorithms in Eqs. (11b and 11c) contain eigenvalue constraints, which appear\ncomplicated. Fortunately they have closed form solution. To show that, we \ufb01rst introduce the\nfollowing useful Lemma.\nLemma 3.1. Consider the following problem,\n\nmin\n\nX\n\nkX \u2212 Ak2\n\nF , s.t. \u03c6i(X) \u2264 ci, 1 \u2264 i \u2264 m,\n\n(12)\n\nwhere \u03c6i(X) \u2264 ci is any constraint on eigenvalues of X, i = 1, 2, \u00b7 \u00b7 \u00b7 , m and m is the num-\nber of constraints. Then there exists a diagonal matrix S such that USU\u22ba is an optimizer of\nEq. (12), where U\u03a3U\u22ba = A is the eigenvector decomposition of A. S relates to eigenvalues of\n\u03a3 = diag(\u03bb1, \u00b7 \u00b7 \u00b7 , \u03bbn) and satisfying the constraints.\n\n4\n\n\fProof. Let VSV\u22ba = X and UDU\u22ba = A be the eigenvector decomposition of X and A, respec-\ntively. By applying von Neumann\u2019s trace inequality, the following holds for any X and A,\n\nThen\n\ntrVSV\u22baA = trX\u22baA \u2264 trSD = tr(USU\u22ba)(UDU\u22ba) = USU\u22baA,\n\ntrX\u22baA \u2264 trSD.\n\n(13)\n\n(14)\n\nwhich leads to\n\nkUSU\u22ba \u2212 Ak2\n\n(15)\nNow assume X = VSV\u22ba is a minimizer of Eq. (12). By comparing two solutions of X = VSV\u22ba\nand Z = USU\u22ba, one should notice (a) that Z satis\ufb01es all the constraints of \u03c6i(Z) = \u03c6i(X) \u2264\nci, 1 \u2264 i \u2264 m in Eq. (12) and (b) that Z gives equal or less value of the objective, thus Z = USU\u22ba\nia also a minimizer of Eq. (12).\n\nF \u2264 kVSV\u22ba \u2212 Ak2\nF .\n\nLemma 3.1 shows an interesting property of the matrix approximation with eigenvalue or singular\nvalue constraint: the optimal solution matrix shares the same subspace of the input matrix. This\nis useful, because once the subspace is determined, the whole optimization becomes much easier.\nThus the lemma provides a powerful mathematical tool in computation of optimization problem\nwith eigenvalue and singular value constraints. Here, we apply Lemma 3.1 to solve the updating of\nX and Y in \u00a73.2.2 - 3.2.3.\n\n3.2.1 Updating G\n\nBy ignoring the irrelevant terms with respect to G , we can rewrite Eq. (11a) as following,\n\nG \u2190 arg min\nG\u22650\n\nk(2 + 2\u00b5)G \u2212 (2W + \u00b5(X + Y) + \u039b + \u03a3) k2\n\nF + const\n\n= max(cid:18) 2W + \u00b5(X + Y) + \u039b + \u03a3\n\n2 + 2\u00b5\n\n, 0(cid:19) .\n\n3.2.2 Updating X\n\nFor Eq. (11b), we need to solve the following subproblem\n\nmin\n\nX\n\nkX \u2212 Pk2\n\nF , X < 0, where P = G + \u039b/\u00b5.\n\n(16)\n\n(17)\n\n(18)\n\nNotice that X < 0 is constraint on the eigenvalues of X. Then we can directly apply Lemma 3.1, X\ncan be written as USU\u22ba and Eq. (18) becomes\n\nmin\n\nS\n\nkUSU\u22ba \u2212 UDU\u22bak2\n\nF , s.t. S \u2265 0,\n\n(19)\n\nwhere UDU\u22ba = P is the eigenvector decomposition of P. Let S = diag(s1, s2, \u00b7 \u00b7 \u00b7 , sn) and\nD = diag(d1, d2, \u00b7 \u00b7 \u00b7 , dn). Then Eq. (19) can be further rewritten as,\n\nmin\n\ns1,s2,\u00b7\u00b7\u00b7 ,sn\n\nn\n\nXi=1\n\n(si \u2212 di)2 , s.t. si \u2265 0, i = 1, 2, \u00b7 \u00b7 \u00b7 , n.\n\n(20)\n\nEq. (20) has simple closed form solution as si = max(di, 0), i = 1, 2, \u00b7 \u00b7 \u00b7 , n.\n\n3.2.3 Updating Y\n\nEq. (11c) can be rewritten as,\n\nwhere Q = G + 1\n\nF , kYk\u2217 \u2264 c,\n\u00b5 \u03a3. The corresponding Lagrangian function is,\n\nmin\n\nkY \u2212 Qk2\n\nY\n\n(21)\n\nL(Y, \u03bb) = kY \u2212 Qk2\n\n(22)\nSince we do not know the true Lagrangian multiplier \u03bb, we cannot directly apply the singular value\nthresholding technique [4]. However, we \ufb01nd Lemma 3.1 useful again. We notice that Y is symmet-\nric and the constraint of kYk\u2217 \u2264 c becomes a constraint on the eigenvalues of Y. Let Y = USU\u22ba\nand by directly applying Lemma 3.1, Eq. (21) can be further written as,\n\nF + \u03bb (kYk\u2217 \u2212 c) .\n\nmin\n\nS\n\nkUSU\u22ba \u2212 UDU\u22bak2\n\nF , s.t.\n\nn\n\nXi=1\n\n|si| \u2264 c,\n\n(23)\n\n5\n\n\for,\n\nmin\n\ns\n\nks \u2212 dk2, s.t.\n\nn\n\nXi=1\n\n|si| \u2264 c,\n\n(24)\n\nwhere S = diag(s), s = [s1, s2, \u00b7 \u00b7 \u00b7 , sn]\u22ba, D = diag(d), and d = [d1, d2, \u00b7 \u00b7 \u00b7 , dn]\u22ba.\nInterestingly, the above problem is a standard \u21131 ball optimization problem which has been studied\nfor a long time and Duchi et al. has recently provided a simple and elegant solution [7]. The \ufb01nal\n\nsolution is to search the least \u03b8 \u2265 0 such thatPi max(|di| \u2212 \u03b8, 0) \u2264 c, i.e.\n\n\u03b8 = arg min\n\n\u03b8\n\n\u03b8 s.t.\n\nn\n\nXi=1\n\nmax(|di| \u2212 \u03b8, 0) \u2264 c.\n\n(25)\n\nThis can be easily done by sorting the |di| and try the \u03b8 values between two consecutive sorted |di|.\nAnd the solution is\n\nsi = sign(di) max(|di| \u2212 \u03b8, 0).\n\n(26)\n\nNotice that in each step of algorithm, the solution has closed form solution and that the output of\nG is always symmetric, which indicates that the constraint of G = G\u22ba is automatically satis\ufb01ed in\neach step.\n\n3.3 NLK Algorithm For Semi-supervised Learning\n\nIn many real world settings, we know partially of data class labels and hope to further utilize such\ninformation, as described in Eq. (7). Fortunately, the corresponding optimization problem remains\nconvex. The augmented Lagrangian multiplier function is\n\n\u03a6(G, X, Y) = kG \u2212 Wk2\n\nF \u2212 h\u039b, X \u2212 Gi + \u00b5\n\n2 kX \u2212 Gk2\n\nF \u2212 h\u03a3, Y \u2212 Gi\n\n+ \u00b5\n\n2 kY \u2212 Gk2\n\nF +P(i,j)\u2208T (cid:0) \u00b5\n\n2 G2\n\nij \u2212 \u2126ij Gij(cid:1) ,\n\n(27)\n\nThis is identical to Eq. (10), except we added \u2126 as additional Lagrangian multiplier for the semi-\nsupervised constraints, i.e. the desired similarity Gij = 0 for (i, j) having different known class\nlabels. Here T = {(i, j) : yi 6= yj, i, j = 1, 2, \u00b7 \u00b7 \u00b7 , \u2113}.\nWe modify Algorithm of Eqs. (11a\u201311f) to solve this problem. The updating of X and Y remains\nthe same as NLK algorithm described previously. To update G, we set \u2202\u03a6(G, X, Y)/\u2202G = 0 and\nobtain\n\nGij \u2190\n\n\uf8f1\uf8f4\uf8f4\uf8f2\n\uf8f4\uf8f4\uf8f3\n\nmax(cid:18) 2Wij + \u00b5(Xij + Yij ) + \u039bij + \u03a3ij + \u2126ij\nmax(cid:18) 2Wij + \u00b5(Xij + Yij ) + \u039bij + \u03a3ij\n, 0(cid:19)\n\n2 + 2\u00b5\n\n2 + 3\u00b5\n\n, 0(cid:19)\n\nif yi 6= yj,\n\notherwise.\n\n(28)\n\nFor Lagrangian multiplier \u2126, the corresponding updating is\n\n\u2126ij \u2190 \u2126ij \u2212 \u00b5Gij , \u2200yi 6= yj.\n\n(29)\n\nThus the semi-supervised learning algorithm is nearly identical to the unsupervised learning algo-\nrithm \u2014 one strength of our uni\ufb01ed NLK approach.\n\nWe summarize the NLK algorithms for unsupervised and semi-supervised learning in Algorithm 1.\nIn the algorithm, Lines 4 and 9 are updated for semi-supervised learning while other lines are shared.\n\n6\n\n\fAlgorithm 1 NLK Algorithm For Supervised Learning and Semi-supervised Learning\nRequire: Weighted graph W, model parameters c, optimization parameter \u03b3, partial label y for\n\nsemi-supervised learning.\n\n1: Initialization: G = W, \u039b = 0, \u03a3 = 0, \u2126 = 0, \u00b5 = 1.\n2: while Not converged do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: end while\n12: return G\n\nFor unsupervised learning, G \u2190 max(cid:16) 2W+\u00b5(X+Y)+\u039b+\u03a3\nFor semi-supervised learning, update G using Eq. (28).\nX \u2190 UD+U\u22ba where UDU\u22ba = G + \u039b/\u00b5.\nY \u2190 USU\u22ba where UDU\u22ba = G + \u03a3/\u00b5 and S is computed by Eq. (26).\n\u039b \u2190 \u039b \u2212 \u00b5 (X \u2212 G)\n\u03a3 \u2190 \u03a3 \u2212 \u00b5 (Y \u2212 G)\nFor semi-supervised learning, \u2126ij \u2190 \u2126ij \u2212 \u00b5Gij , \u2200yi 6= yj.\n\u00b5 \u2190 \u03b3\u00b5.\n\n, 0(cid:17) .\n\n2+2\u00b5\n\n3.4 Theoretical Analysis of The Algorithm\n\nSince the objective function and all the constraints are convex, we have the following [12]\nTheorem 3.2. Algorithm 1 converges to the global solution of Eq. (5) or Eq. (7).\n\nNotice that this conclusion is stronger than that in the related research papers [13] for graph learning.\n\n4 Experimental Validation\n\nAs mentioned in the introduction section, the optimization results for NLK (Eq. (5)) can be used\nas preprocessing for any graph based methods. Here we evaluated NLK on several state-of-the-art\ngraph based learning models, include Normalized Cut (Ncut) [19] for unsupervised learning and\nGaussian Fields and Harmonic Functions (GFHF) and local and global consistency learning (LGC)\nfor semi-supervised learning. We compare the clustering in both clustering accuracy and normalized\nmutual information (NMI). For the semi-supervised learning model (Eq. (7)), we evaluate the our\nmodels on GFHF and LGC models. For semi-supervised learning, we measure the classi\ufb01cation\naccuracy. We verify the algorithms on four data sets: AT&T (n = 400, p = 644, K = 40), BinAlpha\n(n = 1404, p = 320, K = 36), Segment (n = 2310, p = 19, K = 7), and Vehicle(n = 946, p =\n18, K = 4) from UCI data [9], where n, p, and K are the number of data points, features, and\nclasses, respectively.\n\n4.1 Experimental Settings\n\nFor clustering, we compare three similarity matrices: (1) original from Gaussian kernel matrix,\n\nwij = exp(cid:0)\u2212kxi \u2212 xjk2/2\u03c32(cid:1), where \u03c3 is set to the average pairwise distances among all the data\n\npoints. (2) the BBS (Bregmanian Bi-Stochastication) [20], and our method (NLK). The clustering\nalgorithm of Normalized Cut [19] is applied on the three similarity matrices. Then we have total\nthree clustering approaches: Normalized Cut (Ncut), BBS+Ncut, and NLK+Ncut. For each cluster-\ning method, we try 100 random trials for different clustering initializations. For the semi-supervised\nlearning, we test three basic graph-based semi-supervised learning models. Gaussian Fields and\nHarmonic Functions (GFHF) [23], Local and Global Consistency learning (LGC) [22], and Green\u2019s\nfunction (Green) [6]. We compare 4 types of similarity matrices: original Gaussian kernel matrix, as\ndiscussed before, BBS, NLK, and NLK with semi-supervised constraints (model in Eq. (7), denoted\nby NLK Semi). Then we totally have 3 \u00d7 4 methods. For each method, we random split the data\nto 30%/70% where 30% is is used as labeled data an the other 70% as the testing data. We try 100\nrandom split and we report the average and standard deviations.\n4.2 Parameter Settings\n\nFor all the similarity learning approaches (BBS, NLK, and NLK Semi), we set the convergent crite-\nria as follows. If kGt+1 \u2212 Gtk2\nF < 10\u221210 we stop the algorithms. For our methods (NLK\n\nF /kGtk2\n\n7\n\n\fTable 1: Clustering accuracy and NMI comparison over 3 methods, Normalized Cut (Ncut),\nBBS+Ncut, and NLK+Ncut on 4 data sets. The best results are highlighted in bold.\n\nNcut\n\nAccuracy\nBBS+Ncut\n\nNMI\n\n0.607 \u00b1 0.022 0.686 \u00b1 0.021 0.767 \u00b1 0.006 0.785 \u00b1 0.025 0.836 \u00b1 0.026 0.873 \u00b1 0.025\nAT&T\nBinAlpha 0.431 \u00b1 0.018 0.444 \u00b1 0.022 0.490 \u00b1 0.009 0.618 \u00b1 0.013 0.629 \u00b1 0.015 0.673 \u00b1 0.011\n0.613 \u00b1 0.018 0.593 \u00b1 0.009 0.616 \u00b1 0.002 0.528 \u00b1 0.016 0.579 \u00b1 0.013 0.538 \u00b1 0.002\nSegment\n0.383 \u00b1 0.001 0.383 \u00b1 0.000 0.426 \u00b1 0.000 0.121 \u00b1 0.001 0.122 \u00b1 0.000 0.184 \u00b1 0.000\nVehicle\n\nNLK+Ncut\n\nNcut\n\nBBS+Ncut\n\nNLK+Ncut\n\nand NLK Semi), there is one model parameter c, which is always set to be c = 0.5kWk\u2217 where W\nis the input similarity matrix.\n\n4.3 Experimental Results\n\nWe show that clustering results in Table 1 where we compare both measurements (accuracy, NMI)\nover 3 methods on 4 data sets. For each method, we report the average performance and the corre-\nsponding standard deviation. Out of 4 data sets, our method outperforms all the other methods with\nall the measurements on 3 data sets (AT&T, BinAlpha, and Vehicle).\n\nIn\nWe also test the semi-supervised learning performance over the 12 methods on 4 data sets.\neach method on each data, we show the original performance values with dots. Shown are also the\naverage accuracies and the corresponding standard deviations. Out of 4 data sets, our method (NLK\nand NLK Semi) outperform the other methods.\n\n(a) GFHF\nOriginal\nBBS\nNLK\nNLK_Semi\n\nAT&T\n\nBinAlpha\n\nSegmentation\n\nVehicle\n\n0.4807 \u00b1 0.0419\n\n0.5981 \u00b1 0.0465\n\n0.7104 \u00b1 0.0277\n\n0.7121 \u00b1 0.0289\n\n0.4720 \u00b1 0.0772\n\n0.4715 \u00b1 0.0803\n\n0.4843 \u00b1 0.0770\n\n0.4931 \u00b1 0.0777\n\n0.7445 \u00b1 0.0442\n\n0.7926 \u00b1 0.0331\n\n0.8038 \u00b1 0.0290\n\n0.8500 \u00b1 0.0343\n\n(b) LGC\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.65\n\n0.7\n\n0.75\n\n0.8\n\nOriginal\nBBS\nNLK\nNLK_Semi\n\n0.5\n\n(c) Green\nOriginal\nBBS\nNLK\nNLK_Semi\n\n0.6561 \u00b1 0.0480\n\n0.6605 \u00b1 0.0480\n\n0.6649 \u00b1 0.0452\n\n0.7213 \u00b1 0.0451\n\n0.4250 \u00b1 0.0745\n\n0.4252 \u00b1 0.0755\n\n0.4367 \u00b1 0.0791\n\n0.4805 \u00b1 0.0789\n\n0.4881 \u00b1 0.0760\n\n0.5892 \u00b1 0.0656\n\n0.6850 \u00b1 0.0526\n\n0.6997 \u00b1 0.0518\n\n0.6\n\n0.7\n\n0.8\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n0.7148 \u00b1 0.0261\n\n0.7176 \u00b1 0.0269\n\n0.7340 \u00b1 0.0326\n\n0.7774 \u00b1 0.0349\n\n0.5497 \u00b1 0.0768\n\n0.5527 \u00b1 0.0752\n\n0.5545 \u00b1 0.0785\n\n0.5584 \u00b1 0.0806\n\n0.6887 \u00b1 0.0320\n\n0.7143 \u00b1 0.0325\n\n0.7539 \u00b1 0.0366\n\n0.7648 \u00b1 0.0395\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n0.7\n\n0.8\n\n0.9\n\n1\n\n0.4\n\n0.6\n\n0.8\n\n0.6729 \u00b1 0.0195\n\n0.6802 \u00b1 0.0188\n\n0.6857 \u00b1 0.0179\n\n0.7257 \u00b1 0.0213\n\n0.4936 \u00b1 0.0671\n\n0.5112 \u00b1 0.0666\n\n0.6233 \u00b1 0.0292\n\n0.6707 \u00b1 0.0313\n\n0.5299 \u00b1 0.0266\n\n0.5352 \u00b1 0.0255\n\n0.5965 \u00b1 0.0215\n\n0.6132 \u00b1 0.0274\n\nFigure 2: Semi-supervised learning performance over the 12 methods on 4 data sets. Original\naccuracy value for each random split is plotted with dots. Shown are also the average accuracies and\nthe corresponding standard deviations.\n\n5 Conclusions and Discussion\n\nIn this paper, we derive a similarity learning model based on convex optimizations. We demonstrate\nthat the low rank and positive semide\ufb01nite constraints are nature in the similarity. Further more,\nwe develop new suf\ufb01cient algorithm to obtain global solution with theoretical guarantees. We also\ndevelop more optimization techniques that are potentially useful in the related eigenvalues or singu-\nlar values constraints optimization. The presented model is veri\ufb01ed on extensive experiments, and\nthe results show that our method enhances the quality of the similarity matrix signi\ufb01cantly, in both\nclustering and semi-supervised learning.\nAcknowledgement This research was partially supported by NSF-CCF 0830780, NSF-DMS\n0915228, NSF-CCF 0917274, NSF-IIS 1117965.\n\n8\n\n\fReferences\n\n[1] E. Airoldi, D. Blei, E. Xing, and S. Fienberg. A latent mixed membership model for relational\ndata. In Proceedings of the 3rd international workshop on Link discovery, pages 82\u201389. ACM,\n2005.\n\n[2] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data repre-\n\nsentation. Neural computation, 15(6):1373\u20131396, 2003.\n\n[3] T. B\u00a8uhler and M. Hein. Spectral Clustering based on the graph p-Laplacian. In Proceedings of\n\nthe 26th Annual International Conference on Machine Learning, pages 81\u201388. ACM, 2009.\n\n[4] J. Cai, E. Candes, and Z. Shen. A singular value thresholding algorithm for matrix completion.\n\nIEEE Trans. Inform. Theory, 56(5), 2053-2080, (5):2053\u20132080, 2008.\n\n[5] E. Candes and Y. Plan. Matrix completion with noise. Proceedings of the IEEE, 98(6):925\u2013\n\n936, 2010.\n\n[6] C. Ding, R. Jin, T. Li, and H. Simon. A learning framework using Green\u2019s function and\nkernel regularization with application to recommender system. In Proceedings of the 13th ACM\nSIGKDD international conference on Knowledge discovery and data mining, pages 260\u2013269.\nACM, 2007.\n\n[7] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the l 1-\nball for learning in high dimensions. In Proceedings of the 25th international conference on\nMachine learning, pages 272\u2013279. ACM, 2008.\n\n[8] M. Fazel. Matrix rank minimization with applications. PhD thesis, Stanford University, 2002.\n[9] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\n[10] M. Gu, H. Zha, C. Ding, X. He, H. Simon, and J. Xia. Spectral relaxation models and structure\nanalysis for k-way graph clustering and bi-clustering. UC Berkeley Math Dept Tech Report,\n2001.\n\n[11] L. Hagen and A. Kahng. New spectral methods for ratio cut partitioning and cluster-\ning. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on,\n11(9):1074\u20131085, 2002.\n\n[12] R. Lewis, V. Torczon, and L. R. Center. A globally convergent augmented lagrangian pattern\nsearch algorithm for optimization with general constraints and simple bounds. SIAM Journal\non Optimization, 12(4):1075\u20131089, 2002.\n\n[13] W. Liu and S. Chang. Robust multi-class transductive learning with graphs. 2009.\n[14] D. Luo, C. Ding, and H. Huang. Graph evolution via social diffusion processes. Machine\n\nLearning and Knowledge Discovery in Databases, pages 390\u2013404, 2011.\n\n[15] D. Luo, C. Ding, F. Nie, and H. Huang. Cauchy graph embedding. ICML2011, pages 553\u2013560,\n\n2011.\n\n[16] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. Advances\n\nin neural information processing systems, 2:849\u2013856, 2002.\n\n[17] S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding. Sci-\n\nence, 290(5500):2323, 2000.\n\n[18] H. Seung and D. Lee.\n290(5500):2268\u20139, 2000.\n\nThe manifold ways of perception.\n\nScience(Washington),\n\n[19] J. Shi and J. Malik. Normalized cuts and image segmentation. Pattern Analysis and Machine\n\nIntelligence, IEEE Transactions on, 22(8):888\u2013905, 2002.\n\n[20] F. Wang, P. Li, and A. K\u00a8onig. Learning a Bi-Stochastic Data Similarity Matrix. In 2010 IEEE\n\nInternational Conference on Data Mining, pages 551\u2013560. IEEE, 2010.\n\n[21] F. Wang and C. Zhang. Label propagation through linear neighborhoods. IEEE Transactions\n\non Knowledge and Data Engineering, pages 55\u201367, 2007.\n\n[22] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global\nconsistency. In Advances in Neural Information Processing Systems 16: Proceedings of the\n2003 Conference, pages 595\u2013602, 2004.\n\n[23] X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\n\nharmonic functions. In ICML 2003.\n\n9\n\n\f", "award": [], "sourceid": 1344, "authors": [{"given_name": "Dijun", "family_name": "Luo", "institution": null}, {"given_name": "Heng", "family_name": "Huang", "institution": null}, {"given_name": "Feiping", "family_name": "Nie", "institution": null}, {"given_name": "Chris", "family_name": "Ding", "institution": null}]}