{"title": "Extracting Certainty from Uncertainty: Transductive Pairwise Classification from Pairwise Similarities", "book": "Advances in Neural Information Processing Systems", "page_first": 262, "page_last": 270, "abstract": "In this work, we study the problem of transductive pairwise classification from pairwise similarities~\\footnote{The pairwise similarities are usually derived from some side information instead of the underlying class labels.}. The goal of transductive pairwise classification from pairwise similarities is to infer the pairwise class relationships, to which we refer as pairwise labels, between all examples given a subset of class relationships for a small set of examples, to which we refer as labeled examples. We propose a very simple yet effective algorithm that consists of two simple steps: the first step is to complete the sub-matrix corresponding to the labeled examples and the second step is to reconstruct the label matrix from the completed sub-matrix and the provided similarity matrix. Our analysis exhibits that under several mild preconditions we can recover the label matrix with a small error, if the top eigen-space that corresponds to the largest eigenvalues of the similarity matrix covers well the column space of label matrix and is subject to a low coherence, and the number of observed pairwise labels is sufficiently enough. We demonstrate the effectiveness of the proposed algorithm by several experiments.", "full_text": "Extracting Certainty from Uncertainty: Transductive\n\nPairwise Classi\ufb01cation from Pairwise Similarities\n\nTianbao Yang\u2020, Rong Jin\u2021(cid:92)\n\n\u2020The University of Iowa, Iowa City, IA 52242\n\n\u2021Michigan State University, East Lansing, MI 48824\n\n(cid:92)Alibaba Group, Hangzhou 311121, China\n\ntianbao-yang@uiowa.edu, rongjin@msu.edu\n\nAbstract\n\nIn this work, we study the problem of transductive pairwise classi\ufb01cation from\npairwise similarities 1. The goal of transductive pairwise classi\ufb01cation from pair-\nwise similarities is to infer the pairwise class relationships, to which we refer as\npairwise labels, between all examples given a subset of class relationships for a\nsmall set of examples, to which we refer as labeled examples. We propose a very\nsimple yet effective algorithm that consists of two simple steps: the \ufb01rst step is\nto complete the sub-matrix corresponding to the labeled examples and the sec-\nond step is to reconstruct the label matrix from the completed sub-matrix and the\nprovided similarity matrix. Our analysis exhibits that under several mild precon-\nditions we can recover the label matrix with a small error, if the top eigen-space\nthat corresponds to the largest eigenvalues of the similarity matrix covers well the\ncolumn space of label matrix and is subject to a low coherence, and the number of\nobserved pairwise labels is suf\ufb01ciently enough. We demonstrate the effectiveness\nof the proposed algorithm by several experiments.\n\n1\n\nIntroduction\n\nPairwise classi\ufb01cation aims to determine if two examples belong to the same class. It has been\nstudied in several different contexts, depending on what prior information is provided. In this paper,\nwe tackle the pairwise classi\ufb01cation problem provided with a pairwise similarity matrix and a small\nset of true pairwise labels. We refer to the problem as transductive pairwise classi\ufb01cation from\npairwise similarities. The problem has many applications in real world situations. For example, in\nnetwork science [17], an interesting task is to predict whether a link between two nodes is likely to\noccur given a snapshot of a network and certain similarities between the nodes. In computational\nbiology [16], an important problem is to predict whether two protein sequences belong to the same\nfamily based on their sequence similarities, with some partial knowledge about protein families\navailable. In computer vision, a good application can been found in face veri\ufb01cation [5], which aims\nto verify whether two face images belong to the same identity given some pairs of training images.\nThe challenge in solving the problem arises from the uncertainty of the given pairwise similarities in\nre\ufb02ecting the pairwise labels. Therefore the naive approach by binarizing the similarity values with\na threshold would suffer from a bad performance. One common approach towards the problem is to\ncast the problem into a clustering problem and derive the pairwise labels from the clustering results.\nMany algorithms have been proposed to cluster the data using the pairwise similarities and a subset\nof pairwise labels. However, the success of these algorithms usually depends on how many pairwise\nlabels are provided and how well the pairwise similarities re\ufb02ect the true pairwise labels as well.\n\n1The pairwise similarities are usually derived from some side information instead of the underlying class\n\nlabels.\n\n1\n\n\fIn this paper, we focus on the theoretical analysis of the problem. Essentially, we answer the question\nof what property the similarity matrix should satisfy and how many pre-determined pairwise labels\nare suf\ufb01cient in order to recover the true pairwise labels between all examples. We base our analysis\non a very simple scheme which is composed of two steps: (i) the \ufb01rst step recovers the sub-matrix\nof the label matrix from the pre-determined entries by matrix completion, which has been studied\nextensively and can be solved ef\ufb01ciently; (ii) the second step estimates the full label matrix by\nsimple matrix products based on the top eigen-space of the similarity matrix and the completed\nsub-matrix. Our empirical studies demonstrate that the proposed algorithm could be effective than\nspectral clustering and kernel alignment approach in exploring the pre-determined labels and the\nprovided similarities.\nTo summarize our theoretical results: under some appropriate pre-conditions, namely the distribu-\ntion of data over the underlying classes in hindsight is well balanced, the labeled data are uniformly\nsampled from all data and the pre-determined pairwise labels are uniformly sampled from all pairs\nbetween the labeled examples, we can recover the label matrix with a small error if (i) the top eigen-\nspace that corresponds to the s largest eigen-values of the similarity matrix covers well the column\nspace of the label matrix and has a low incoherence, and (ii) the number of pre-determined pairwise\nlabels N on m labeled examples satisfy N \u2265 \u2126(m log2(m)), m \u2265 \u2126(\u00b5ss log s), where \u00b5s is a\ncoherence measure of the top eigen-space of the similarity matrix.\n\n2 Related Work\n\nThe transductive pairwise classi\ufb01cation problem is closely related to semi-supervised clustering,\nwhere a set of pairwise labels are provided with pairwise similarities or feature vectors to cluster a\nset of data points. We focus our attention on the works where the pairwise similarities instead of the\nfeature vectors are served as inputs.\nSpectral clustering [19] and kernel k-means [7] are probably the most widely applied clustering\nalgorithms given a similarity matrix or a kernel matrix. In spectral clustering, one \ufb01rst computes\nthe top eigen-vectors of a similarity matrix (or bottom eigen-vectors of a Laplacian matrix), and\nthen cluster the eigen-matrix into a pre-de\ufb01ned number of clusters. Kernel k-means is a variant\nof k-means that computes the distances using the kernel similarities. One can easily derive the\npairwise labels from the clustering results by assuming that if two data points assigned to the same\ncluster belong to the same class and vice versa. To utilize some pre-determined pairwise labels, one\ncan normalize the similarities and replace the entries corresponding to the observed pairs with the\nprovided labels.\nThere also exist some works that try to learn a parametric or non-parametric kernel from the pre-\ndetermined pairwise labels and the pairwise similarities. Hoi et al. [13] proposed to learn a para-\nmetric kernel that is characterized by a combination of the top eigen-vectors of a (kernel) similarity\nmatrix by maximizing a kernel alignment measure over the combination weights. Other works [2, 6]\nthat exploit the pairwise labels for clustering are conducted using feature vector representations of\ndata points. However, all of these works are lack of analysis of algorithms, which is important\nfrom a theoretical point. There also exist a large body of research on preference learning and rank-\ning in semi-supervised or transductive setting [1, 14]. We did not compare with them because that\nthe ground-truth we analyzed of a pair of data denoted by h(u, v) is a symmetric function, i.e.,\nh(u, v) = h(v, u), while in preference learning the function h(u, v) is an asymmetric function.\nOur theoretical analysis is built on several previous studies on matrix completion and matrix recon-\nstruction by random sampling. Cand`es and Recht [3] cooked a theory of matrix completion from par-\ntial observations that provides a theoretical guarantee of perfect recovery of a low rank matrix under\nappropriate conditions on the matrix and the number of observations. Several works [23, 10, 15, 28]\nanalyzed the approximation error of the Nystr\u00a8om method that approximates a kernel matrix by sam-\npling a small number of columns. All of these analyses exploit an important measure of an orthog-\nonal matrix, i.e., matrix incoherence, which also plays an important role in our analysis.\nIt has been brought to our attention that two recent works [29, 26] are closely related to the present\nwork but with remarkable differences. Both works present a matrix completion theory with side\ninformation. Yi et al. [29] aim to complete the pairwise label matrix given partially observed en-\ntries for semi-supervised clustering. Under the assumption that the column space of the symmetric\n\n2\n\n\fpairwise label matrix to be completed is spanned by the top left singular vectors of the data matrix,\nthey show that their algorithm can perfectly recover the pairwise label matrix with a high proba-\nbility. In [26], the authors assume that the column and row space of the matrix to be completed is\ngiven aprior and show that the required number of observations in order to perfectly complete the\nmatrix can be reduced substantially. There are two remarkable differences between [29, 26] and\nour work: (i) we target on a transductive setting, in which the observed partial entries are not uni-\nformly sampled from the whole matrix; therefore their algorithms are not applicable; (ii) we prove\na small reconstruction error when the assumption that the column space of the pairwise label matrix\nis spanned by the top eigen-vectors of the pairwise similarity matrix fails.\n\n3 The Problem and A Simple Algorithm\n\nWe \ufb01rst describe the problem of transductive pairwise classi\ufb01cation from pairwise similarities, and\nthen present a simple algorithm.\n\n3.1 Problem De\ufb01nition\nLet Dn = {o1, . . . , on} be a set of n examples. We are given a pairwise similarity matrix denoted\nby S \u2208 Rn\u00d7n with each entry Sij measuring the similarity between oi and oj, a set of m random\n\nsamples denote by (cid:98)Dm = {\u02c6o1, . . . , \u02c6om} \u2286 Dn, and a subset of pre-determined pairwise labels\nbeing either 1 or 0 that are randomly sampled from all pairs between the examples in (cid:98)Dm. The\nobserved entries are only randomly distributed over (cid:98)Dm \u00d7 (cid:98)Dm instead of Dn \u00d7 Dn.\n\nproblem is to recover the pairwise labels of all remaining pairs between examples in Dn. Note that\nthe key difference between our problem and previous matrix completion problems is that the partial\n\nWe are interested in that the pairwise labels indicate the pairwise class relationships, i.e., the pairwise\nlabel between two examples being equal to 1 indicates they belong to the same class, and being equal\nto 0 indicates that they belong to different classes. We denote by r the number of underlying classes.\nWe introduce a label matrix Z \u2208 {0, 1}n\u00d7n to represent the pairwise labels between all examples,\n\nand similarly denote by (cid:98)Z \u2208 {0, 1}m\u00d7m the pairwise labels between any two labeled examples 2 in\n(cid:98)Dm. To capture the subset of pre-determined pairwise labels for the labeled data, we introduce a set\n\u03a3 \u2282 [m]\u00d7 [m] to indicate the subset of observed entries in (cid:98)Z, i.e., the pairwise label (cid:98)Zi,j, (i, j) \u2208 \u03a3\nis observed if and only if the pairwise label between \u02c6oi and \u02c6oj is pre-determined. We denote by (cid:98)Z\u03a3\n\n(cid:26) (cid:98)Zi,j\n\n(i, j) \u2208 \u03a3\nN\\A (i, j) /\u2208 \u03a3\n\nthe partially observed label matrix, i.e.\n\n[(cid:98)Z\u03a3]i,j =\n(ii) the partially observed label matrix (cid:98)Z\u03a3.\n\n3.2 A Simple Algorithm\n\nThe goal of transductive pairwise classi\ufb01cation from pairwise similarities is to estimate the pair-\nwise label matrix Z \u2208 {0, 1}n\u00d7n for all examples in Dn using (i) the pairwise similarities in S and\n\nIn order to estimate the label matrix Z, the proposed algorithm consists of two steps. The \ufb01rst step is\n\nto recover the sub-matrix (cid:98)Z, and the second step is to estimate the label matrix Z using the recovered\n(cid:98)Z and the provided similarity matrix S.\nRecover the sub-matrix (cid:98)Z First, we note that the label matrix Z and the sub-matrix (cid:98)Z are of\n{1, 0}n,(cid:98)gk \u2208 {1, 0}m denote the class assignments to the k-th hidden class of all data and the\n\nlow rank by assuming that the number of hidden classes r is small. To see this, we let gk \u2208\n\nlabeled data, respectively. It is straightforward to show that\n\nr(cid:88)\n\nZ =\n\ngkg(cid:62)\nk ,\n\nr(cid:88)\n\n(cid:98)Z =\n\n(cid:98)gk(cid:98)g(cid:62)\n\nk\n\n2The labeled examples refer to examples in (cid:98)Dm that serve as the bed for the pre-determined pairwise labels.\n\nk=1\n\nk=1\n\n(1)\n\n3\n\n\fAlgorithm 1 A Simple Algorithm for Transductive Pairwise Classi\ufb01cation by Matrix Completion\n1: Input:\n\n\u2022 (cid:98)Z\u03a3: the subset of observed pairwise labels for labeled examples in (cid:98)Dm\n\n\u2022 S: a pairwise similarity matrix between all examples in Dn\n\u2022 s < m: the number of eigenvectors used for estimating Z\n\n2: Compute the \ufb01rst s eigen-vectors of a similarity matrix S // Preparation\n\n3: Estimate (cid:98)Z by solving the optimization problem in (2) // Step 1: recover the sub-matrix (cid:98)Z\n\n4: Estimate the label matrix Z using (5) // Step 2: estimate the label matrix Z\n5: Output: Z\n\nwhich clearly indicates that both Z,(cid:98)Z are of low rank if r is signi\ufb01cantly smaller than m. As a\nresult, we can apply the matrix completion algorithm [20] to recover (cid:98)Z by solving the following\n\noptimization problem:\n\nmin\n\nM\u2208Rm\u00d7m\n\n(cid:107)M(cid:107)tr,\n\ns.t. Mi,j = (cid:98)Zi,j \u2200(i, j) \u2208 \u03a3\n\n(2)\n\nwhere (cid:107)M(cid:107)tr denotes the nuclear norm of a matrix.\n\ngk = Usak,\n\nk = 1, . . . , r.\n\nEstimate the label matrix Z The second step is to estimate the remaining entries in the label\nmatrix Z. In the sequel, for the ease of analysis, we will attain an estimate of the full matrix Z, from\nwhich one can obtain the pairwise labels between all remaining pairs.\nWe \ufb01rst describe the motivation of the second step and then present the details of computation.\nAssuming that there exists an orthogonal matrix Us = (u1,\u00b7\u00b7\u00b7 , us) \u2208 Rn\u00d7s whose column space\nsubsumes the column space of the label matrix Z where s \u2265 r, then there exist ak \u2208 Rs, k =\n1, . . . , r such that\n(3)\n\nConsidering the formulation of Z and (cid:98)Z in (1), the second step works as follows: we \ufb01rst compute\nan estimate of(cid:80)r\nk from the completed sub-matrix (cid:98)Z, then compute an estimate of Z based\non the estimate of(cid:80)r\n(cid:98)ak = arg min(cid:107)(cid:98)gk \u2212(cid:98)Usa(cid:107)2\nwhere (cid:98)Us \u2208 Rm\u00d7s is a sub-matrix of Us \u2208 Rn\u00d7s with the row indices corresponding to the global\nindices of the labeled examples in (cid:98)Dm with respect to Dn. Then we can estimate(cid:80)r\nk=1 aka(cid:62)\ns (cid:98)Z(cid:98)Us((cid:98)U(cid:62)\ns (cid:98)Us)\u2020\ns (cid:98)Us)\u2020U(cid:62)\ns (cid:98)Z(cid:98)Us((cid:98)U(cid:62)\n\nr(cid:88)\nk (cid:98)Us((cid:98)U(cid:62)\ns (cid:98)Us)\u2020 = ((cid:98)U(cid:62)\n(cid:98)gk(cid:98)g(cid:62)\n(cid:33)\n(cid:32) r(cid:88)\ns = Us((cid:98)U(cid:62)\n\nk = ((cid:98)U(cid:62)\ns (cid:98)Us)\u2020(cid:98)U(cid:62)\nr(cid:88)\n\nk . To this end, we construct the following optimization problems for\n\ns (cid:98)Us)\u2020(cid:98)U(cid:62)\ns (cid:98)Us)\u2020(cid:98)U(cid:62)\n\nk=1 aka(cid:62)\n\n2 = ((cid:98)U(cid:62)\n\ns (cid:98)Us)\u2020(cid:98)U(cid:62)\ns (cid:98)gk\n\nk=1 aka(cid:62)\n\nk = 1, . . . , r:\n\ngkg(cid:62)\n\nk = Us\n\nr(cid:88)\n\nk=1\n\naka(cid:62)\n\nk\n\nU(cid:62)\n\ns\n\nk=1\n\ns\n\n(5)\n\n(4)\n\nk and\n\naka(cid:62)\n\nZ(cid:48) =\n\nZ by\n\nk=1\n\nk=1\n\nIn oder to complete the algorithm, we need to answer how to construct the orthogonal matrix Us =\n(u1,\u00b7\u00b7\u00b7 , us).\nInspired by previous studies on spectral clustering [18, 19], we can construct Us\nas the \ufb01rst s eigen-vectors that correspond to the s largest eigen-values of the provided similarity\nmatrix. A justi\ufb01cation of the practice is that if the similarity graph induced by a similarity matrix\nhas r connected components, then the eigen-space of the similarity matrix corresponding to the r\nlargest eigen-values is spanned by the indicator vectors of the components. Ideally, if the similarity\ngraph is equivalent to the label matrix Z, then the indicator vectors of connected components are\nexactly g1,\u00b7\u00b7\u00b7 , gr. Finally, we present the detailed step of the proposed algorithm in Algorithm 1.\n\nRemarks on the Algorithm The performance of the proposed algorithm will reply on two factors.\nanalysis, as long as the number of observed entries is suf\ufb01ciently large (e.g., |\u03a3| \u2265 \u2126(m log2 m),\n\nFirst, how accurate is the recovered the sub-matrix (cid:98)Z by matrix completion. According to our later\none can exactly recover the sub-matrix (cid:98)Z. Second, how well the top eigen-space of S covers the\n\n4\n\n\fcolumn space of the label matrix Z. As shown in section 4, if they are close enough, the estimated\nmatrix of Z has a small error provided the number of labeled examples m is suf\ufb01ciently large (e.g.,\nm \u2265 \u2126(\u00b5ss log s), where \u00b5s is a coherence measure of the top eigen-space of S.\nIt is interesting to compare the proposed algorithm to the spectral clustering algorithm [19] and\nthe spectral kernel learning algorithm [13], since all three algorithms exploit the top eigen-vectors\nof a similarity matrix. The spectral clustering algorithm employes a k-means algorithm to cluster\nthe top eigen-vector matrix. The spectral kernel learning algorithm optimizes a diagonal matrix\n\u039b = diag(\u03bb1,\u00b7\u00b7\u00b7 , \u03bbs) to learn a kernel matrix K = Us\u039bU(cid:62)\ns by maximizing the kernel alignment\nwith the pre-determined labels. In contrast, we estimate the pairwise label matrix by Z(cid:48) = UsM U(cid:62)\ns\n\nwhere the matrix M is learned from the recovered sub-matrix (cid:98)Z and the provided similarity matrix\nS. The recovered sub-matrix (cid:98)Z serves as supervised information and the similarity matrix S serves\nlow rank structure of (cid:98)Z we are able to gain more useful information for the estimation in the second\n\nas the input data for estimating the label matrix Z (c.f. equation 4). It is the \ufb01rst step that explores the\n\nstep. In our experiments, we observe improved performance of the proposed algorithm compared\nwith the spectral clustering and the spectral kernel learning algorithm.\n\n4 Theoretical Results\n\nIn this section, we present theoretical results regarding the reconstruction error of the proposed al-\ngorithm, which essentially answer the question of what property the similarity matrix should satisfy,\nhow many labeled data and how many pre-determined pairwise labels are required for a good or\nperfect recovery of the label matrix Z.\nBefore stating the theoretical results, we \ufb01rst introduce some notations. Let pi denote the percentage\nof all examples in Dn that belongs to the i-th class. To facilitate our presentation and analysis, we\nalso introduce a coherence measure \u00b5s of the orthogonal matrix Us = (u1,\u00b7\u00b7\u00b7 , us) \u2208 Rn\u00d7s as\nde\ufb01ned by\n\ns(cid:88)\n\nj=1\n\n\u00b5s =\n\nn\ns\n\nmax\n1\u2264i\u2264n\n\nU 2\nij\n\n(6)\n\n\u221a\n\nThe coherence measure has been exploited in many studies of matrix completion [29, 26], matrix\nreconstruction [23, 10]. It is notable that [4] de\ufb01ned a coherence measure of a complete orthogonal\nmatrix U = (u1,\u00b7\u00b7\u00b7 , un) \u2208 Rn\u00d7n by \u00b5 =\nn max1\u2264i\u2264n,1\u2264j\u2264n |Uij|. It is not dif\ufb01cult to see\nthat \u00b5s \u2264 \u00b52 \u2264 n. The coherence measure in (6) is also known as the largest statistical leverage\nscore. Drineas et al. [8] proposed a fast approximation algorithm to compute the coherence of an\narbitrary matrix. Intuitively, the coherence measures the degree to which the eigenvectors in Us\nor U are correlated with the canonical bases. The purpose of introducing the coherence measure\nis to quantify how large the sampled labeled examples m is in order to guarantee the sub-matrix\n\n(cid:98)Us \u2208 Rm\u00d7s has full column rank. We defer the detailed statement to the supplementary material.\nWe begin with the recovery of the sub-matrix (cid:98)Z. The theorem below states if the the distribution of\nbetween the labeled examples is enough for a perfect recovery of the sub-matrix (cid:98)Z.\nand the examples in (cid:98)Dm are sampled uniformly at random from Dn. Then with a probability at least\n1 \u2212(cid:80)r\n\ni=1 exp(\u2212mpi/8) \u2212 2m\u22122, (cid:98)Z is the unique solution to (2) if |\u03a3| \u2265\n\nTheorem 1. Suppose the entries at (i, j) \u2208 \u03a3 are sampled uniformly at random from [m] \u00d7 [m],\n\nthe data over the r hidden classes is not skewed, then an \u2126(r2m log2 m) number of pairwise labels\n\nm log2(2m).\n\n(cid:20)\n\n(cid:21)\n\n512\nmin\n1\u2264i\u2264r\n\np2\ni\n\nNext, we present a theorem stating that if the column space of Z is spanned by the orthogonal\nvectors u1,\u00b7\u00b7\u00b7 , us and m \u2265 \u2126(\u00b5ss ln(m2s)), the estimated matrix Z(cid:48) is equal to the underlying\ntrue matrix Z.\nTheorem 2. Suppose the entries at (i, j) \u2208 \u03a3 are sampled uniformly at random from [m]\u00d7[m], and\n\nthe objects in (cid:98)Dm are sampled uniformly at random from Dn. If the column space of Z is spanned\nleast 1 \u2212(cid:80)r\n\ni=1 exp (\u2212mpi/8) \u2212 3m\u22122, we have Z(cid:48) = Z, where Z(cid:48) is computed by (5).\n\nby u1,\u00b7\u00b7\u00b7 , us, m \u2265 8\u00b5ss log(m2s), and |\u03a3| \u2265\n\nm log2(2m), then with a probability at\n\n512\nmin\n1\u2264i\u2264r\n\n(cid:21)\n\n(cid:20)\n\np2\ni\n\n5\n\n\fSimilar to other matrix reconstruction algorithms [4, 29, 26, 23, 10], the theorem above indicates that\na low coherence measure \u00b5s plays a pivotal role in the success of the proposed algorithm. Actually,\nseveral previous works [23, 11] as well as our experiments have studied the coherence measure of\nreal data sets and demonstrated that it is not rare to have an incoherent similarity matrix, i.e., with\na small coherence measure. We now consider a more realistic scenario where some of the column\nmatrix, we de\ufb01ne the following quantity \u03b5 = (cid:80)r\nvectors of Z do not lie in the subspace spanned by the top s eigen-vectors of the similarity matrix. To\nquantify the gap between the column space of Z and the top eigen-space of the pairwise similarity\ns is the\nprojection matrix that projects a vector to the space spanned by the columns of Us. The following\ntheorem shows that if \u03b5 is small, so is the solution Z(cid:48) given in (5).\nand the objects in (cid:98)Dm are sampled uniformly at random from Dn. If the conditions on m and |\u03a3| in\nTheorem 3. Suppose the entries at (i, j) \u2208 \u03a3 are sampled uniformly at random from [m] \u00d7 [m],\nTheorem 2 are satis\ufb01ed. , then, with a probability at least 1 \u2212(cid:80)r\ni=1 exp (\u2212mpi) \u2212 3m\u22122, we have\n(cid:18) n\u03b5\n\n2, where PUs = UsU(cid:62)\n\nk=1 (cid:107)gk \u2212 PUS gk(cid:107)2\n\n(cid:32)\n\n(cid:33)\n\n(cid:19)\n\n\u2264 O\n\n+\n\nm\n\n\u221a\n\u03b5\u221a\nn\nm\n\n(cid:107)Z(cid:48) \u2212 Z(cid:107)F \u2264 \u03b5\n\n1 +\n\n+\n\n2n\nm\n\n\u221a\n2n\u221a\n2\nm\u03b5\n\n(cid:17)\n\n(cid:16)(cid:80)(cid:62)\nk=1(cid:98)ak(cid:98)a(cid:62)\n\nber of observed entries is suf\ufb01ciently enough. The key to the proof is to show that the coherence\n\nSketch of Proofs Before ending this section, we present a sketch of proofs. The details are de-\nferred to the supplementary material. The proof of Theorem 1 relies on a matrix completion theory\n\nresort to convex optimization theory and Lemma 1 in [10], which shows that the sub-sampled matrix\nU(cid:62)\ns and\n\nby Recht [20], which can guarantee the perfect recovery of the low rank matrix (cid:98)Z provided the num-\nmeasure of the sub-matrix (cid:98)Z is bounded using the concentration inequality. To prove Theorem 2, we\n(cid:16)(cid:80)(cid:62)\n(cid:98)Us \u2208 Rm\u00d7s has a full column rank if m \u2265 \u2126(\u00b5ss log(s)). Since Z = Us\ns , therefore to prove Z(cid:48) = Z is equivalent to show(cid:98)ak = ak, k \u2208 [r], i.e.,\ns (cid:98)Us is a full rank PSD\nproblems in (4) are strictly convex, which follows immediately from that (cid:98)U(cid:62)\nwhere Z\u2217 =(cid:80)\n\nmatrix with a high probability. The proof of Theorem 3 is more involved. The crux of the proof is\n(cid:107)\nto consider gk = g\u22a5\nk = PUs gk is the orthogonal projection of gk into the subspace\n(cid:107)\nk, and then bound (cid:107)Z\u2212Z(cid:48)(cid:107)F \u2264 (cid:107)Z\u2212Z\u2217(cid:107)F +(cid:107)Z(cid:48)\u2212Z\u2217(cid:107)F ,\nspanned by u1, . . . , us and g\u22a5\n\nZ(cid:48) = Us\nak, k \u2208 [r] are the unique minimizers of problems in (4). It is suf\ufb01cient to show the optimization\n\nk = gk\u2212g\n\n(cid:107)\nk, where g\n\nk=1 aka(cid:62)\n\nk\n\nk + g\n\n(cid:107)\nk g\nk\n\n(cid:62)\n\n(cid:107)\nk.\ng\n\n(cid:17)\n\nU(cid:62)\n\nk\n\n5 Experimental Results\n\nIn this section, we present an empirical evaluation of our proposed simple algorithm for Transductive\nPairwise Classi\ufb01cation by Matrix Completion (TPCMC for short) on one synthetic data set and three\nreal-world data sets.\n\n5.1 Synthetic Data\n\nWe \ufb01rst generate a synthetic data set of 1000 examples evenly distributed over 4 classes, each of\nwhich contains 250 data points. Then we generate a pairwise similarity matrix S by \ufb01rst constructing\na pairwise label matrix Z \u2208 {0, 1}1000\u00d71000, and then adding a noise term \u03b4ij to Zij where \u03b4ij \u2208\n(0, 0.5) follows a uniform distribution. We use S as the input pairwise similarity matrix of our\nproposed algorithm. The coherence measure of the top eigen-vectors of S is a small value as shown\nin Figure 1. According to the random perturbation matrix theory [22], the top eigen-space of S is\nclose to the column space of the label matrix Z. We choose s = 20, which yields roughly \u00b5s = 2.\nof the 160 \u00d7 160 sub-matrix are fed into the algorithm. In other words, roughly 0.5% entries out\nof the whole pairwise label matrix Z \u2208 {0, 1}1000\u00d71000 are observed. We show the ground-truth\npairwise label matrix, the similarity matrix and the estimated label matrix in Figure 1, which clearly\ndemonstrates that the recovered label matrix is more accurate than the perturbed similarities.\n\nWe randomly select m = 4s\u00b5s = 160 data to form (cid:98)Dm, out of with |\u03a3| = 2mr2 = 5120 entries\n\n6\n\n\fFigure 1: from left to right: \u00b5s vs s, the true pairwise label matrix, the perturbed similarity matrix,\nthe recovered pairwise label matrix. The error of the estimated matrix is reduced by two times\n(cid:107)Z \u2212 Z(cid:48)(cid:107)F /(cid:107)Z \u2212 S(cid:107)F = 0.5.\n\n5.2 Real Data\n\nWe further evaluate the performance of our algorithm on three real-world data sets: splice [24] 3,\ngisette [12] 4 and citeseer [21] 5. The splice is a DNA sequence data set for recognizing the splice\njunctions. The gisette is a perturbed image data for handwritten digit recognition, which is originally\nconstructed for feature selection. The citeseer is a paper citation data, which has been used for link\nprediction. We emphasize that we do not intend these data sets to be comprehensive but instead to be\nillustrative case studies that are representative of a much wider range of applications. The statistics\nof the three data sets are summarized in Table 1. Given a data set of size n, we randomly choose\nm = 20%n, 30%n, . . . , 90%n examples, where 10% entries of the m\u00d7m label matrix are observed.\nWe design the experiments in this way since according to Theorem 1, the number of observed entries\n|\u03a3| increase as m increases. For each given m, we repeat the experiments ten times with random\nselections and report the performance scores averaged over the ten trials. We construct a similarity\nmatrix S with each entry being equal to the cosine similarity of two examples based on their feature\nvectors. We set s = 50 in our algorithm and other algorithms as well. The corresponding coherence\nmeasures \u00b5s of the three data sets are shown in the last column of Table 1.\nWe compare with two state-of-the-art algorithms that utilize the pre-determined pairwise labels and\nthe provided similarity matrix in different way (c.f.\nthe discussion at the end of Section 3), i.e.,\nSpectral Clustering (SC) [19] and Spectral Kernel Learning (SKL) [13] for the task of clustering. To\nattain a clustering from the proposed algorithm, we apply a similarity-based clustering algorithm to\ngroup the data into clusters based on the estimated label matrix. Here we use spectral clustering [19]\nfor simplicity and fair comparison. For SC, to utilize the pre-determined pairwise labels we substi-\ntute the entries corresponding to the observed pairs by 1 if the two examples are known to be in the\nsame class and 0 if the two examples are determined to belong to different classes. For SKL, we\nalso apply the spectral clustering algorithm to cluster the data based on the learned kernel matrix.\nThe comparison to SC and SKL can verify the effectiveness of the proposed algorithm for exploring\nthe pre-determined labels and the provided similarities. After obtaining the clusters, we calculate\nthree well-known metrics, namely normalized mutual information [9], pairwise F-measure [27] and\naccuracy [25] that measure the degree to which the obtained clusters match the groundtruth.\nFigures 2\u223c4 show the performance of different algorithms on the three data sets, respectively. First,\nthe performance of all the three algorithms generally improves as the ratio of m/n increases, which\nis consistent with our theoretical result in Theorem 3. Second, our proposed TPCMC performs the\nbest on all the cases measured by all the three evaluation metrics, verifying its reliable performance.\nSKL generally performs better than SC, indicating that simply using the observed pairwise labels\nto directly modify the similarity matrix cannot fully utilize the label information. TPCMC is better\nthan SKL meaning that the proposed algorithm is more effective in mining the knowledge from the\npre-determined labels and the similarity matrix.\n6 Conclusions\nIn this paper, we have presented a simple algorithm for transductive pairwise classi\ufb01cation from\npairwise similarities based on matrix completion and matrix products. The algorithm consists of two\n\n3http://www.cs.toronto.edu/\u02dcdelve/data/datasets.html\n4http://www.nipsfsc.ecs.soton.ac.uk/datasets/\n5http://www.cs.umd.edu/projects/linqs/projects/lbc/\n\n7\n\n02040608010011.522.53s\u00b5s\fname\nsplice\ngisette\nciteseer\n\nTable 1: Statistics of the data sets\n# examples\n\n# classes\n\ncoherence (\u00b550)\n\n3175\n7000\n3312\n\n2\n2\n6\n\n1.97\n4.17\n2.22\n\nFigure 2: Performance on the splice data set.\n\nFigure 3: Performance on the gisette data set.\n\nFigure 4: Performance on the citeseer data set.\n\nsimple steps: recovering the sub-matrix of pairwise labels given partially pre-determined pairwise\nlabels and estimating the full label matrix from the recovered sub-matrix and the provided pairwise\nsimilarities. The theoretical analysis establishes the conditions on the similarity matrix, the number\nof labeled examples and the number of pre-determined pairwise labels under which the estimated\npairwise label matrix by the proposed algorithm recovers the true one exactly or with a small error\nwith an overwhelming probability. Preliminary empirical evaluations have veri\ufb01ed the potential of\nthe proposed algorithm.\n\nAckowledgement\n\nThe work of Rong Jin was supported in part by National Science Foundation (IIS-1251031) and\nOf\ufb01ce of Naval Research (N000141210431).\n\n8\n\n20304050607080900.10.20.30.40.50.60.70.80.9m/n \u00d7 100%Normalized Mutual Information SKLSCTPCMC20304050607080900.550.60.650.70.750.80.850.90.951m/n \u00d7 100%Accuracy SKLSCTPCMC20304050607080900.650.70.750.80.850.90.951m/n \u00d7 100%Pairwise F\u2212measure SKLSCTPCMC20304050607080900.20.30.40.50.60.70.80.91m/n \u00d7 100%Normalized Mutual Information SKLSCTPCMC20304050607080900.750.80.850.90.951m/n \u00d7 100%Accuracy SCTPCMCSKL20304050607080900.650.70.750.80.850.90.951m/n \u00d7 100%Pairwise F\u2212measure SKLSCTPCMC20304050607080900.20.30.40.50.60.70.8m/n \u00d7 100%Normalized Mutual Information SKLSCTPCMC20304050607080900.40.50.60.70.80.91m/n \u00d7 100%Accuracy SKLSCTPCMC20304050607080900.40.50.60.70.80.91m/n \u00d7 100%Pairwise F\u2212measure SKLSCTPCMC\fReferences\n[1] N. Ailon. An active learning algorithm for ranking from pairwise preferences with an almost optimal\n\nquery complexity. JMLR, 13:137\u2013164, 2012.\n\n[2] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In\n\nProceedings of SIGKDD, pages 59\u201368, 2004.\n\n[3] E. J. Cand`es and B. Recht. Exact matrix completion via convex optimization. Foundations of Computa-\n\ntional Mathematics, 9(6):717\u2013772, 2009.\n\n[4] E. J. Cand`es and T. Tao. The power of convex relaxation: Near-optimal matrix completion. IEEE Trans.\n\nInf. Theor., 56:2053\u20132080, 2010.\n\n[5] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to\n\nface veri\ufb01cation. In Proceedings of CVPR, pages 539\u2013546, 2005.\n\n[6] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon. Information-theoretic metric learning. In Proceed-\n\nings of ICML, pages 209\u2013216, 2007.\n\n[7] I. S. Dhillon, Y. Guan, and B. Kulis. Kernel k-means: spectral clustering and normalized cuts.\n\nProceedings of SIGKDD, pages 551\u2013556, 2004.\n\nIn\n\n[8] P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix\n\ncoherence and statistical leverage. In Proceedings of ICML, 2012.\n\n[9] A. Fred and A. Jain. Robust data clustering. In Proceedings of IEEE CVPR, volume 2, 2003.\n[10] A. Gittens. The spectral norm errors of the naive nystrom extension. CoRR, abs/1110.5305, 2011.\n[11] A. Gittens and M. W. Mahoney. Revisiting the nystrom method for improved large-scale machine learn-\n\ning. CoRR, abs/1303.1849, 2013.\n\n[12] I. Guyon, S. R. Gunn, A. Ben-Hur, and G. Dror. Result analysis of the nips 2003 feature selection\n\nchallenge. In NIPS, 2004.\n\n[13] S. C. H. Hoi, M. R. Lyu, and E. Y. Chang. Learning the uni\ufb01ed kernel machines for classi\ufb01cation. In\n\nProceedings of SIGKDD, pages 187\u2013196, 2006.\n\n[14] E. H\u00a8ullermeier and J. F\u00a8urnkranz. Learning from label preferences. In Proceedings of ALT, page 38, 2011.\n[15] R. Jin, T. Yang, M. Mahdavi, Y.-F. Li, and Z.-H. Zhou. Improved bounds for the nystr\u00a8om method with\napplication to kernel classi\ufb01cation. IEEE Transactions on Information Theory, 59(10):6939\u20136949, 2013.\n[16] A. Kelil, S. Wang, R. Brzezinski, and A. Fleury. Cluss: Clustering of protein sequences based on a new\n\nsimilarity measure. BMC Bioinformatics, 8, 2007.\n\n[17] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci.\n\nTechnol., 58:1019\u20131031, 2007.\n\n[18] U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395\u2013416, 2007.\n[19] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In NIPS, pages\n\n849\u2013856, 2001.\n\n[20] B. Recht. A simpler approach to matrix completion. JMLR, 12:3413\u20133430, 2011.\n[21] P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad. Collective classi\ufb01cation in\n\nnetwork data. AI Magazine, 29(3):93\u2013106, 2008.\n\n[22] G. W. Stewart and J. guang Sun. Matrix Perturbation Theory. Academic Press, 1990.\n[23] A. Talwalkar and A. Rostamizadeh. Matrix coherence and the nystrom method. In Proceedings of UAI,\n\npages 572\u2013579, 2010.\n\n[24] G. G. Towell and J. W. Shavlik. Interpretation of arti\ufb01cial neural networks: Mapping knowledge-based\n\nneural networks into rules. In NIPS, pages 977\u2013984, 1991.\n\n[25] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering\n\nwith side-information. In NIPS, volume 15, pages 505\u2013512, 2002.\n\n[26] M. Xu, R. Jin, and Z.-H. Zhou. Speedup matrix completion with side information: Application to multi-\n\nlabel learning. In NIPS, pages 2301\u20132309, 2013.\n\n[27] T. Yang, R. Jin, Y. Chi, and S. Zhu. Combining link and content for community detection: a discriminative\n\napproach. In Proceedings of SIGKDD, pages 927\u2013936, 2009.\n\n[28] T. Yang, Y. Li, M. Mahdavi, R. Jin, and Z. Zhou. Nystr\u00a8om method vs random fourier features: A\n\ntheoretical and empirical comparison. In NIPS, pages 485\u2013493, 2012.\n\n[29] J. Yi, L. Zhang, R. Jin, Q. Qian, and A. K. Jain. Semi-supervised clustering by input pattern assisted\n\npairwise similarity matrix completion. In Proceedings of ICML, pages 1400\u20131408, 2013.\n\n9\n\n\f", "award": [], "sourceid": 198, "authors": [{"given_name": "Tianbao", "family_name": "Yang", "institution": "University of Iowa"}, {"given_name": "Rong", "family_name": "Jin", "institution": "Michigan State University (MSU)"}]}