{"title": "Selective Labeling via Error Bound Minimization", "book": "Advances in Neural Information Processing Systems", "page_first": 323, "page_last": 331, "abstract": "In many practical machine learning problems, the acquisition of labeled data is often expensive and/or time consuming. This motivates us to study a problem as follows: given a label budget, how to select data points to label such that the learning performance is optimized. We propose a selective labeling method by analyzing the generalization error of Laplacian regularized Least Squares (LapRLS). In particular, we derive a deterministic generalization error bound for LapRLS trained on subsampled data, and propose to select a subset of data points to label by minimizing this upper bound. Since the minimization is a combinational problem, we relax it into continuous domain and solve it by projected gradient descent. Experiments on benchmark datasets show that the proposed method outperforms the state-of-the-art methods.", "full_text": "Selective Labeling via Error Bound Minimization\n\nQuanquan Gu\u2020, Tong Zhang\u2021, Chris Ding\u00a7, Jiawei Han\u2020\n\n\u2020Department of Computer Science, University of Illinois at Urbana-Champaign\n\n\u2021Department. of Statistics, Rutgers University\n\n\u00a7Department. of Computer Science & Engineering, University of Texas at Arlington\nqgu3@illinois.edu, tzhang@stat.rutgers.edu, chqding@uta.edu, hanj@cs.uiuc.edu\n\nAbstract\n\nIn many practical machine learning problems, the acquisition of labeled data is of-\nten expensive and/or time consuming. This motivates us to study a problem as fol-\nlows: given a label budget, how to select data points to label such that the learning\nperformance is optimized. We propose a selective labeling method by analyzing\nthe out-of-sample error of Laplacian regularized Least Squares (LapRLS). In par-\nticular, we derive a deterministic out-of-sample error bound for LapRLS trained\non subsampled data, and propose to select a subset of data points to label by min-\nimizing this upper bound. Since the minimization is a combinational problem, we\nrelax it into continuous domain and solve it by projected gradient descent. Ex-\nperiments on benchmark datasets show that the proposed method outperforms the\nstate-of-the-art methods.\n\n1 Introduction\n\nThe performance of (semi-)supervised learning methods typically depends on the amount of labeled\ndata. Roughly speaking, the more the labeled data, the better the learning performance will be.\nHowever, in many practical machine learning problems, the acquisition of labeled data is often\nexpensive and/or time consuming. To overcome this problem, active learning [9, 10] was proposed,\nwhich iteratively queries the oracle (labeler) to obtain the labels at new data points. Representative\nmethods include support vector machine (SVM) active learning [19, 18], agnostic active learning\n[2, 5, 14], etc. Due to the close interaction between the learner and the oracle, active learning can be\nadvantageous to achieve better learning performance. Nevertheless, in many real-world applications,\nsuch an interaction may not be feasible. For example, when one turns to Amazon Mechanical Turk1\nto label data, the interaction between the learner and the labeling workers is very limited. Therefore,\nstandard active learning is not very practical in this case.\nAnother potential solution to the label de\ufb01ciency problem is semi-supervised learning [7, 22, 21, 4],\nwhich aims at combining a small number of labeled data and a large amount of unlabeled data to\nimprove the learning performance. In a typical setting of semi-supervised learning, a small set of\nlabeled data is assumed to be given at hand or randomly generated in practice. However, randomly\nselecting (uniformly sampling) data points to label is unwise because not all the data points are\nequally informative. It is desirable to obtain a labeled subset which is most bene\ufb01cial for semi-\nsupervised learning.\nIn this paper, based on the above motivation, we investigate a problem as follows: given a \ufb01xed label\nbudget, how to select a subset of data points to label such that the learning performance is optimized.\nWe refer to this problem as selective labeling, in contrast to conventional random labeling. To\nachieve the goal of selective labeling, it is crucial to consider the out-of-sample error of a speci\ufb01c\nlearner. We choose Laplacian Regularized Least Squares (LapRLS) as the learner [4] because it is a\n\n1https://www.mturk.com/\n\n1\n\n\fstate-the-art semi-supervised learning method, and takes many linear regression methods as special\ncases (e.g., ridge regression [15]). We derive a deterministic out-of-sample error bound for LapRLS\ntrained on subsampled data, which suggests to select the data points to label by minimizing this\nupper bound. The resulting selective labeling method is a combinatorial optimization problem. In\norder to optimize it effectively and ef\ufb01ciently, we relax it into a continuous optimization problem,\nand solve it by projected gradient descent algorithm followed by discretization. Experiments on\nbenchmark datasets show that the proposed method outperforms the state-of-the-art methods.\nThe remainder of this paper is organized as follows.\nIn Section 2, we brie\ufb02y review manifold\nregularization and LapRLS. In Section 3, we derive an out-of-sample error bound for LapRLS on\nsubsampled data, and present a selective labeling criterion by minimizing the this bound, followed\nby its optimization algorithm. We discuss the connections between the proposed method and several\nexisting experimental design approaches in Section 4. The experiments are demonstrated in Section\n5. We conclude this paper in Section 6.\n\n2 Review of Laplacian Regularized Least Squares\nGiven a data set {(x1, y1), . . . , (xn, yn)} where xi \u2208 Rd and yi \u2208 {\u00b11}, Laplacian Regularized\nLeast Squares (LapRLS) [4] aims to learn a linear function f (x) = wT x. In order to estimate and\n\u222b\npreserve the geometrical and topological properties of the data, LapRLS [4] assumes that if two data\npoints xi and xj are close in the intrinsic geometry of the data distribution, the labels of this two\npoints are also close to each other. Let f (x) be a function that maps the original data point x in a\ncompact submanifold M to R, we use ||f||2M =\nx\u2208M || \u25bdM f||2dx to measure the smoothness of\nf along the geodesics in the intrinsic manifold of the data, where \u25bdMf is the gradient of f along\nthe manifold M. Recent study on spectral graph theory [8] has demonstrated that ||f||2M can be\ndiscretely approximated through a nearest neighbor graph on a set of data points. Given an af\ufb01nity\nmatrix W \u2208 Rn\u00d7n of the graph, ||f||2M is approximated as:\n\n\u2211\n\n||f||2M \u2248 1\n2\n\n||fi \u2212 fj||2\n\n2Wij = f T Lf ,\n\n(1)\n\nij\n\n\u2211\nwhere fi is a shorthand for f (xi), f = [f1, . . . , fn]T , D is a diagonal matrix, called degree matrix,\nj=1 Wij, and L = D \u2212 W is the combinatorial graph Laplacian [8]. Eq. (1) is called\nwith Dii =\nManifold Regularization. Intuitively, the regularization incurs a heavy penalty if neighboring points\nxi and xj are mapped far apart.\nBased on manifold regularization, LapRLS solves the following optimization problem,\n\nn\n\narg min\n\nw\n\n||XT w \u2212 y||2\n\n2 +\n\n||w||2\n\n2 +\n\n\u03bbA\n2\n\n\u03bbI\n2\n\nwT XLXT w,\n\n(2)\n\nwhere \u03bbA, \u03bbI > 0 are positive regularization parameters, X = [x1, . . . , xn] is the design ma-\ntrix, y = [y1, . . . , yn]T is the response vector, ||w||2 is \u21132 regularization of linear function, and\nwT XLXT w is manifold regularization of f (x) = wT x. When \u03bbI = 0, LapRLS reduces to ridge\nregression [15]. A bias term b can be incorporated into the form by expanding the weight vector and\ninput feature vector as w \u2190 [w; b] and x \u2190 [x; 1]. Note that Eq. (2) is a supervised version of\nLapRLS, because only labeled data are used in manifold regularization. Although our derivations\nare based on this version in the rest of the paper, the results can be extended to semi-supervised\nversion of LapRLS straightforwardly.\n\n3 The Proposed Method\n\n3.1 Problem Formulation\nThe generic problem of selective labeling is as follows. Given a set of data points X =\n{x1, . . . , xn}, namely the pool of candidate data points, our goal is to \ufb01nd a subsample L \u2282\n{1, . . . , n}, which contains the most informative |L| = l points.\nTo derive a selective labeling approach for LapRLS, we \ufb01rst derive an out-of-sample error bound of\nLapRLS.\n\n2\n\n\f3.2 Out-of-Sample Error Bound of LapRLS\n\nWe de\ufb01ne the function class of LapRLS as follows.\nDe\ufb01nition 1. The function class of LapRLS is FB = {x \u2192 wT x | \u03bbA||w||2\nB}, where X = [x1, . . . , xn], and B > 0 is a constant.\nConsider the following linear regression model,\n\n2 + \u03bbI wT XLXT w \u2264\n\n\u2217\n\n+ \u03f5,\n\ny = XT w\n\n(3)\n\u2217 is the\nwhere X = [x1, . . . , xn] is the design matrix, y = [y1, . . . , yn]T is the response vector, w\ntrue weight vector which is unknown, and \u03f5 = [\u03f51, . . . , \u03f5n]T is the noise vector with \u03f5i an unknown\nnoise with zero mean. We assume that different observations have noises that are independent, but\nwith equal variance \u03c32.\nMoreover, we assume that the true weight vector w\n\u2217\n\n(4)\nwhich implies that the true hypothesis belongs to the function class of LapRLS in De\ufb01nition 1. In\nthis case, the approximation error vanishes and the excess error equals to the estimation error. Note\nthat this assumption can be relaxed with more effort, under which we can derive a similar error\nbound as below. For simplicity, the following derivations are built upon the assumption in Eq. (4).\n\u2217 using LapRLS in Eq. (2) from a subsample\nIn selective labeling, we are interested in estimating w\nL \u2208 {1, . . . , n}. Denote the subsample of X by XL, the subsample of y by yL, and the subsample\nof \u03f5 by \u03f5L. The solution of LapRLS is given by\n\n\u2217 satis\ufb01es\n)T XLXT w\n\n\u2217 \u2264 B,\n\n\u03bbA||w\n\n2 + \u03bbI (w\n\n\u2217||2\n\n^wL = (XLXTL + \u03bbAI + \u03bbI XLLLXTL)\n\n(5)\nwhere I is an identity matrix, LL is the graph Laplacian computed based on XL, which is a principal\nsubmatrix of L.\nIn the following, we will present a deterministic out-of-sample error bound for LapRLS trained on\nthe subsampled data, which is among the main contributions of this paper.\nTheorem 2. For any \ufb01xed V = [v1, . . . , vm] and X = [x1, . . . , xn], and a subsample L of X, the\nexpected error of LapRLS trained on L in predicting the true response VT w\n\u2217 is upper bounded as\n(6)\nProof. Let ML = \u03bbAI + \u03bbI XLLLXTL. Given L, the expected error (where the expectation is w.r.t.\n\u03f5L) is given by\n\nVT (XLXTL + \u03bbAI + \u03bbI XLLLXTL)\n\nE||VT ^wL \u2212 VT w\n\n\u2264 (B + \u03c32)tr\n\n\u22121V\n\n\u2217||2\n\n)\n\n(\n\n2\n\n.\n\n\u22121XLyL,\n\n}\n\n2\n\n|\n+ E||VT (XLXTL + ML)\n\n{z\n\n}\n\u22121XL\u03f5L||2\n\n2\n\n,(7)\n\n\u2217||2\n\n\u2217\n\n+ \u03f5L. Now we bound the two terms in the\n\nA2\n\nE||VT ^wL \u2212 VT w\n\u2217||2\n= E||VT (XLXTL + ML)\n\u22121XLyL \u2212 VT w\n\u2217||2\n|\n{z\n\u2217 \u2212 VT w\n= ||VT (XLXTL + ML)\n\u22121XLXTLw\n\n2\n\n2\n\nA1\n\nwhere the second equality follows from yL = XLw\nright hand side respectively.\nThe \ufb01rst term is bounded by\n\nA1 = ||VT (XLXTL + ML)\n\n\u22121MLw\n\n\u2217||2\n\n2\n\n(\n(\n\n)\n\u2264 ||VT (XLXTL + ML)\n\n\u22121M\n\n1\n\n2L||2\n\nF\n\n||M\n\n1\n\n2Lw\n\n\u2217||2\n\n2\n\n)\n\n= Btr\n\u2264 Btr\n\nVT (XLXTL + ML)\nVT (XLXTL + ML)\n\n\u22121ML(XLXTL + ML)\n\u22121V\n\n\u22121V\n\n(8)\nwhere the \ufb01rst inequality is due to Cauchy Schwarz\u2019s inequality, and the second inequality follows\nfrom dropping the negative term.\nSimilarly, the second term can be bounded by\n\nA2 \u2264 \u03c32tr\n\u2264 \u03c32tr\n\nVT (XLXTL + ML)\nVT (XLXTL + ML)\n\n\u22121XLXTL(XLXTL + ML)\n\u22121V\n\n\u22121V\n\n(9)\nwhere the \ufb01rst equality uses E[\u03f5L\u03f5TL] \u2264 \u03c32I, and it becomes equality if \u03f5i are independent and\nidentically distributed (i.i.d.). Combing Eqs. (8) and (9) completes the proof.\n\n,\n\n(\n(\n\n)\n\n)\n\n3\n\n\fNote that in the above theorem, the sample V could be either the same as or different from the\nsample X. Sometimes, we are also interested in the expected estimation error of w\nTheorem 3. For any \ufb01xed X, and a subsample L of X, the expected error of LapRLS trained on L\nin estimating the true weight vector w\n\n\u2217 is upper bounded as\n\n\u2217 as follows.\n\n(\n\nE|| ^wL \u2212 w\n\n\u2217||2\n\n2\n\n\u2264 (B + \u03c32)tr\n\n(XLXTL + \u03bbAI + \u03bbI XLLLXTL)\n\n(10)\n\n)\n\n\u22121\n\nThe proof of this theorem follows similar derivations of Theorem 2.\n\n3.3 The Criterion of Selective Labeling\nFrom Theorem 2, we can see that given a subsample L of X, the expected prediction error of\nLapRLS on V is upper bounded by Eq. (6). In addition, the right hand side of Eq. (6) does not\ndepend on the labels, i.e., y. More importantly, the error bound derived in this paper is deterministic,\nwhich is unlike those probabilistic error bounds derived based on Rademacher complexity [3] or\nalgorithmic stability [6]. Since those probabilistic error bounds only hold for i.i.d. sample rather\nthan a particular sample, they cannot provide a criterion to choose a subsample set for labeling due\nto the correlation between the pool of candidate points and the i.i.d. sample. On the contrary, the\ndeterministic error bound does not suffer from such a kind of problem. Therefore, it provides a\nnatural criterion for selective labeling.\nIn detail, given a pool of candidate data points, i.e., X, we propose to \ufb01nd a subsample L of\n{1, . . . , n}, by minimizing the follow objective function\n\n(\n\narg min\n\nL\u2282{1;:::;n} tr\n\nXT (XLXTL + \u03bbI XLLLXTL + \u03bbAI)\n\n,\n\n(11)\n\n)\n\n\u22121X\n\nwhere we simply assume V = X. The above problem is a combinatorial optimization problem.\nFinding the global optimal solution is NP-hard. One potential way to solve it is greedy forward\n(or backward) selection. However, it is inef\ufb01cient. Here we propose an ef\ufb01cient algorithm, which\nsolves its continuous relaxation.\n\n3.4 Reformulation\nWe introduce a selection matrix S \u2208 Rn\u00d7l, which is de\ufb01ned as\n\n{\n\nSij =\n\n1,\n0,\n\nif xi is selected as the j-point in L\n\notherwise.\n\nIt is easy to check that each column of S has one and only one 1, and each row has at most one 1.\nThe constraint set for S can be de\ufb01ned as\n\nS1 = {S|S \u2208 {0, 1}n\u00d7l, ST 1 = 1, S1 \u2264 1},\n\n(12)\n\n(13)\n\n(14)\n\nwhere 1 is a vector of all ones, or equivalently,\n\nwhere I is an identity matrix.\nBased on S, we have XL = XS and LL = ST LS. Thus, Eq. (11) can be equivalently reformulated\nas\n\nS2 = {S|S \u2208 {0, 1}n\u00d7l, ST S = I},\n(\n(\n\n\u2032\nXT (XSST L\n\n\u22121X\n\n)\n\nXT (XSST XT + \u03bbI XSST LSST XT + \u03bbAI)\n\n)\n\n\u22121X\n\nSST XT + \u03bbAI)\n\n,\n\n(15)\n\narg min\nS\u2208S2\n= arg min\nS\u2208S2\n\ntr\n\ntr\n\n\u2032\nwhere L\n\n= I + \u03bbI L. The above optimization problem is still a discrete optimization. Let\n\nS3 = {S|S \u2265 0, ST S = I},\n\n(16)\nwhere we relax the binary constraint on S into nonnegative constraint. Note that S3 is a matching\npolytope [17]. Then we solve the following continuous optimization,\n\u22121X\n\n\u2032\nXT (XSST L\n\nSST XT + \u03bbAI)\n\n(17)\n\n)\n\n(\n\n.\n\narg min\nS\u2208S3\n\ntr\n\n4\n\n\fWe derive a projected gradient descent algorithm to \ufb01nd a local optimum of Eq. (17). We \ufb01rst ignore\nthe nonnegative constraint on S. Since ST S = I, we introduce a Lagrange multiplier (cid:3) \u2208 Rl\u00d7l,\nthus the Lagrangian function is\n\nL(S) = tr\n\n\u2032\nXT (XSST L\n\nSST XT + \u03bbAI)\n\n\u22121X\n\n+ tr\n\n.\n\n(18)\n\n)\n\n(\n\n)\n(cid:3)(ST S \u2212 I)\n\n(\n\nThe derivative of L(S) with respect to S is2\n= \u22122(XT BXSST L\n\u2032\n\n\u2202L\n\u2202S\n\n\u2032\nS + L\n\nSST XT BXS) + 2S(cid:3),\n\n(19)\n\n\u22121(XXT )A\n\n\u22121 and A = XSST L\n\u2032\n\nwhere B = A\nof the derivative is A\nthe Woodbury matrix identity [12]. Then A\n\nSST XT + \u03bbI. Note that the computational burden\n\u22121, which is the inverse of a d \u00d7 d matrix. To overcome this problem, we use\n\n(\n\n\u22121 can be computed as\n\u2032\n(ST L\n\nST XT XS\n\n\u22121 +\n\nS)\n\n1\n\u03bb\n\n)\u22121\n\n\u22121 =\n\nA\n\nI \u2212 1\n\n1\n\u03bb\n\n\u03bb2 XS\n\nST XT ,\n\n(20)\n\nS is a l \u00d7 l matrix, whose inverse can be solved ef\ufb01ciently when l \u226a d.\n\n\u2032\nwhere ST L\nTo determine the Lagrange multiplier (cid:3), left multiplying Eq. (19) by ST , and using the fact that\nST S = I, we obtain\n\n\u2032\n(cid:3) = ST XT BXSST L\n\n\u2032\nS + ST L\n\n(21)\nSubstituting the Lagrange multiplier (cid:3) back into Eq. (19), we can obtain the derivative depending\nonly on S. Thus we can use projected gradient descent to \ufb01nd a local optimal solution for Eq. (17).\nIn each iteration, it takes a step proportional to the negative of the gradient of the function at the\ncurrent point, followed by a projection back into the nonnegative set.\n\nSST XT BXS.\n\n3.5 Discretization\n\n\u2217 by projected gradient descent. However,\nTill now, we have obtained a local optimal solution S\n\u2217 \u2208 S3. In order to determine which l data\n\u2217 contains continuous values. In other words, S\nthis S\n\u2217 into S1. We use a simple greedy procedure to conduct the\npoints to select, we need to project S\ndiscretization: we \ufb01rst \ufb01nd the largest element in S (if there exist multiple largest elements, we\nchoose any one of them), and mark its row and column; then from the unmarked columns and rows\nwe \ufb01nd the largest element and also mark it; this procedure is repeated until we \ufb01nd l elements.\n\n4 Related Work\n\nWe notice that our proposed method shares similar spirit with optimal experimental design3 in statis-\ntics [1, 20, 16], whose intent is to select the most informative data points to learn a function which\nhas minimum variance of estimation, or minimum variance of prediction.\nFor example, A-Optimal Design (AOD) minimizes the expected variance of the model parameter. In\nparticular, for ridge regression, it optimizes the following criterion,\n\u22121\n\n)\n\n(\n\narg min\n\nL\u2282{1;:::;n} tr\n\n(XLXTL + \u03bbAI)\n\n,\n\n(22)\n\nwhere I is an identity matrix. We can recover this criterion by setting \u03bbI = 0 in Theorem 3.\nHowever, the pitfall of AOD is that it does not characterize the quality of predictions on the data,\nwhich is essential for classi\ufb01cation or regression.\nTo overcome the shortcoming of A-optimal design, Yu et al. [20] proposed a Transdutive Experi-\nmental Design (TED) approach. TED selects the samples which minimize the expected predictive\nvariance of ridge regression on the data,\nL\u2282{1;:::;n} tr\n\nXT (XLXTL + \u03bbAI)\n\narg min\n\n\u22121X\n\n(23)\n\n)\n\n(\n\n.\n\n2The calculation of the derivative is non-trivial, please refer to the supplementary material for detail.\n3Some literature also call it active learning, while our understand is there is no adaptive interaction between\nthe learner and the oracle within optimal experimental design. Therefore, it is better to call it nonadaptive\nactive learning.\n\n5\n\n\f(\n\nAlthough TED is motivated by minimizing the variance of the prediction, it is very interesting to\ndemonstrate that the above criterion is coinciding with minimizing the out-of-sample error bound\nin Theorem 2 with \u03bbI = 0. The reason is that for ridge regression, the upper bounds of the bias\nand variance terms share a common factor tr\n. This is a very important\nobservation because it explains why TED performs very well even though its criterion is minimizing\nthe variance of the prediction. Furthermore, TED can be seen as a special case of our proposed\nmethod.\nHe et al. [16] proposed Laplacian Optimal Design (LOD), which selects data points that minimize\nthe expected predictive variance of Laplacian regularized least squares [4] on the data,\n\nXT (XLXTL + \u03bbAI)\n\n)\n\n\u22121X\n\n)\n\n(\n\narg min\n\nL\u2282{1;:::;n} tr\n\nXT (\u03bbI XLXT + XLXTL + \u03bbAI)\n\n\u22121X\n\n,\n\n(24)\n\nwhere the graph Laplacian L is computed on all the data points in the pool, i.e., X. LOD selects\nthe points by XLXTL while leaving the graph Laplacian term XLXT \ufb01xed. However, our method\nselects the points by XLXTL as well as the graph Laplacian term i.e., XLLLXTL. This difference\nis essential, because our criterion has a strong theoretical foundation, i.e., minimizing the out-of-\nsample error bound of LapRLS. This explains the non-signi\ufb01cant improvement of LOD over TED.\nAdmittedly, the term XLLLXTL in our method raised a challenge for optimization. Yet it has been\nwell-solved by the projected gradient descent algorithm derived in previous section.\nWe also notice that similar problem was studied for graphs [13]. However, their method cannot be\napplied to our setting, because their input is restricted to the adjacency matrix of a graph.\n\n5 Experiments\n\nIn this section, we evaluate the proposed method on both synthetic and real-world datasets, and\ncompare it with the state-of-the-art methods. All the experiments are conducted in Matlab.\n\n5.1 Compared Methods\n\nTo demonstrate the effectiveness of our proposed method, we compare it with the following baseline\napproaches: Random Sampling (Random) uniformly selects data points from the pool as training\ndata. It is the simplest baseline for label selection. A-Optimal Design (AOD) is a classic exper-\nimental design method proposed in the community of statistics. There is a parameter \u03bbA to be\ntuned. Transductive Experiment Design (TED) is proposed in [20], which is the state-of-the-art\n(non-adaptive) active learning method. There is a parameter \u03bbA to be tuned. Laplacian Optimal\nDesign (LOD) [16] is an extension of TED, which incorporates the manifold structure of the data.\nSelective Labeling via Error Bound Minimization (Bound) is the proposed method. There are two\ntunable parameters \u03bbA and \u03bbI in both LOD and Bound.\nBoth LOD and Bound use graph Laplacian. To compute it, we \ufb01rst normalize each data point into a\nvector with unit \u21132-norm. Then we construct a 5-NN graph and use the cosine distance to measure\nthe similarity between data points throughout of our experiments.\nNote that the problem setting of our study is to select a batch of data points to label without training\na classi\ufb01er. Therefore, we do not compare our method with typical active learning methods such as\nSVM active learning [19, 18] and agnostic active learning [2].\nAfter selecting the data points by the above methods, we train a LapRLS [4] as the learner to do\nclassi\ufb01cation. There are two parameters in LapRLS, i.e., \u03bbA and \u03bbI.\n\n5.2 Synthetic Dataset\n\nTo get an intuitive picture of how the above methods (except random sampling, which is trivial)\nwork differently, we show their experimental results on a synthetic dataset in Figure 1. This dataset\ncontains two circles, each of which constitutes a class. It has strong manifold structure. We let\nthe compared methods select 8 data points. As can be seen, the data points selected by AOD are\nconcentrated on the inner circle (belonging to one class), which are not able to train a classi\ufb01er.\nThe data points selected by TED, LapIOD and Bound are distributed on both inner and outer circles\n\n6\n\n\f(a) AOD\n\n(b) TED\n\n(c) LOD\n\n(d) Bound\n\nFigure 1: Selected points (the red marks) on the two circles dataset by (a) AOD; (b) TED; (c) LOD;\nand (d) Bound.\n\n(belonging to different classes), which are good at training a learner. Furthermore, the 8 data points\nselected by Bound are uniformly distributed on the two circles, four from the inner circle, and the\nother four from the outer circle, which can better represent the original data.\n\n5.3 Real Datasets & Parameter Settings\n\nIn the following, we use three real-world benchmark datasets to evaluate the compared methods.\nwdbc is the Wisconsin Diagnostic Breast Cancer data set, which is from UCI machine learning\nrepository4. It aims at predicting the breast cancer as benign or malignant based on the digitalized\nimages. There are 357 positive samples and 212 negative samples. Each sample has 32 attributes.\nORL face database5 contains 10 images for each of the 40 human subjects, which were taken at\ndifferent times, varying the lighting, facial expressions and facial details. The original images (with\n256 gray levels) have size 92 \u00d7 112, which are resized to 32 \u00d7 32 for ef\ufb01ciency.\nIsolet was \ufb01rst used in [11]. It contains 150 people who spoke each letter of the alphabet twice. The\nspeakers are grouped into sets of 30 speakers each, and we use the \ufb01rst group, referred to Isolet1.\nEach sample is represented by a 617-dimensional feature vector.\nFor each data set, we randomly select 20% data as held-out set for model selection, and the rest 80%\ndata as work set. In order to randomize the experiments, in each run of experiments, we restrict the\ntraining data (pool of candidate data points) to be selected from a random sampling of 50% work\nset (which accounts for 40% of the total data). The remaining half data (40% of the total data) is\nused as test set. Once the labeled data are selected, we train a semi-supervised version of LapRLS,\nwhich uses both labeled and unlabeled data (all the training data) for manifold regularization. We\nreport the classi\ufb01cation result on the test set. This random split was repeated 10 times, thus we can\ncompute the mean and standard deviation of the classi\ufb01cation accuracy.\nThe parameters of compared methods (See Section 5.1) are tuned by 2-fold cross validation on the\nheld-out set. For the parameters of LapRLS, we use the same parameters of LOD (or Bound) for\nLapRLS. For the wdbc dataset, the chosen parameters are \u03bbA = 0.001, \u03bbI = 0.01. For ORL,\n\u03bbA = 0.0001, \u03bbI = 0.001. For Isolet1, \u03bbA = 0.01, \u03bbI = 0.001.\nFor wdbc, we let the compared methods incrementally choose {2, 4, . . . , 20} points to label, for\nORL, we incrementally choose {80, 90, . . . , 150} points for labeling, and for Isolet1, we choose\n{30, 40, . . . , 120} points to query.\n\n5.4 Results on Real Datasets\n\nThe experimental results are shown in Figure 2. In all sub\ufb01gures, the x-axis represents the number of\nlabeled points, while the y-axis is the averaged classi\ufb01cation accuracy on the test data over 10 runs.\nIn order to show some concrete results, we also list the accuracy and running time (in second) of\nall the compared methods on the three datasets with 2, 80 and 30 labeled data points respectively in\n\n4http://archive.ics.uci.edu/ml/\n5http://www.cl.cam.ac.uk/Research/DTG/attarchive:pub/data\n\n7\n\n-2.5-2-1.5-1-0.500.511.522.5-2.5-2-1.5-1-0.500.511.522.5 1 2 3 4 5 6 7 8-2.5-2-1.5-1-0.500.511.522.5-2.5-2-1.5-1-0.500.511.522.5 1 2 3 4 5 6 7 8-2.5-2-1.5-1-0.500.511.522.5-2.5-2-1.5-1-0.500.511.522.5 1 2 3 4 5 6 7 8-2.5-2-1.5-1-0.500.511.522.5-2.5-2-1.5-1-0.500.511.522.5 1 2 3 4 5 6 7 8\f(a) wdbc\n\n(b) ORL\n\n(c) Isolet1\n\nFigure 2: Comparison of different methods on (a) wdbc; (b) ORL; and (c) Isolet1 using LapRLS.\n\nTable 1: Classi\ufb01cation accuracy (%) and running time (in second) of compared methods on the three\ndatasets.\n\nwdbc (2 labeled)\n\nAcc\n\nDataset\nRandom 69.47\u00b114.56\n68.59\u00b112.46\nAOD\n68.33\u00b110.68\nTED\n63.48\u00b18.38\nLOD\n88.68\u00b12.82\nBound\n\ntime\n\n\u2013\n0.0\n0.0\n0.1\n0.3\n\nAcc\n\nORL (80 labeled)\ntime\n72.00\u00b14.05\n65.17\u00b13.14\n80.33\u00b12.94\n80.25\u00b12.64\n83.25\u00b13.17\n\n32.2\n39.6\n41.7\n23.4\n\n\u2013\n\nAcc\n\nIsolet1 (30 labeled)\ntime\n44.36\u00b13.09\n40.27\u00b12.24\n55.98\u00b12.54\n57.79\u00b11.87\n61.99\u00b12.14\n\n\u2013\n7.4\n41.1\n41.5\n17.4\n\nTable 1. For each dataset, we did paired t-tests between the proposed method and the other methods\nin the 95% con\ufb01dence interval. If it is signi\ufb01cant over all the other methods, the corresponding entry\nof Bound is bolded.\nWe observe that the proposed selective labeling method greatly outperforms the other methods at\nmost cases. AOD is usually worse than random sampling. The reason is that minimizing the variance\nof model parameter does not guarantee the quality of predictions on the data. TED performs very\nwell. As we mentioned before, the criterion of TED coincides with minimizing the out-of-sample\nerror bound of ridge regression. This explains its good empirical performance. The performance\nof LOD is slightly better than TED. This is because LOD incorporates the geometric structure into\nTED. The superior performance of our method is attributed to its theoretical foundation, which\nguarantees that the learner (LapRLS) can achieve small error on the test data. In addition, the running\ntime of our method is comparable to or even less than the running time of the other methods.\nOne may argue that the above comparison is not fair because we use LapRLS as the learner, which\ntends to \ufb01t the proposed method. Therefore, we also compare different methods using ridge re-\ngression (RR) as the learner. We \ufb01nd that our proposed method is also much better than the other\nmethods using RR. For the space limit, we omit the results here and put them in the supplemental\nmaterial.\n\n6 Conclusions\n\nThe main contributions of this paper are: (1) We present a deterministic out-of-sample error bound\nfor LapRLS; (2) we present a selective labeling method by minimizing this upper bound; and (3) we\npresent a simple yet effective algorithm to optimize the criterion for selective labeling.\n\nAcknowledgement\n\nThe work was supported in part by U.S. National Science Foundation grants IIS-0905215, CNS-\n0931975, the U.S. Army Research Laboratory under Cooperative Agreement No. W911NF-09-2-\n0053 (NS-CTA), the U.S. Air Force Of\ufb01ce of Scienti\ufb01c Research MURI award FA9550-08-1-0265,\nand MIAS, a DHS-IDS Center for Multimodal Information Access and Synthesis at UIUC. We\nwould like to thank the anonymous reviewers for their helpful comments.\n\n8\n\n2468101214161820657075808590#Labeled dataAccuracy RandomAODTEDLODBound809010011012013014015016065707580859095#Labeled dataAccuracy RandomAODTEDLODBound30405060708090100110120404550556065707580#Labeled dataAccuracy RandomAODTEDLODBound\fReferences\n[1] A. D. Anthony Atkinson and R. Tobias. Optimum Experimental Designs. Oxford University\n\n[2] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In ICML, pages\n\nPress, 2007.\n\n65\u201372, 2006.\n\n[3] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[4] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework\nfor learning from labeled and unlabeled examples. Journal of Machine Learning Research,\n7:2399\u20132434, 2006.\n\n[5] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without con-\n\nstraints. In NIPS, pages 199\u2013207, 2010.\n\n[6] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning\n\n[7] O. Chapelle, B. Sch\u00a8olkopf, and A. Zien, editors. Semi-Supervised Learning. MIT Press,\n\nResearch, 2:499\u2013526, 2002.\n\nCambridge, MA, 2006.\n\n[8] F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, February 1997.\n[9] D. A. Cohn, L. E. Atlas, and R. E. Ladner.\n\nImproving generalization with active learning.\n\nMachine Learning, 15(2):201\u2013221, 1994.\n\n[10] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning with statistical models. In NIPS,\n\npages 705\u2013712, 1994.\n\n[11] M. A. Fanty and R. A. Cole. Spoken letter recognition. In NIPS, pages 220\u2013226, 1990.\n[12] G. H. Golub and C. F. V. Loan. Matrix computations (3rd ed.). Johns Hopkins University\n\n[13] A. Guillory and J. Bilmes. Active semi-supervised learning using submodular functions. In\n\n[14] S. Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333\u2013361,\n\nPress, Baltimore, MD, USA, 1996.\n\nUAI, pages 274\u2013282, 2011.\n\n2011.\n\n[15] T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of statistical learning: data mining,\n\ninference, and prediction. New York: Springer-Verlag, 2001.\n\n[16] X. He, W. Min, D. Cai, and K. Zhou. Laplacian optimal design for image retrieval. In SIGIR,\n\npages 119\u2013126, 2007.\n\n[17] B. Korte and J. Vygen. Combinatorial Optimization: Theory and Algorithms. Springer Pub-\n\nlishing Company, Incorporated, 4th edition, 2007.\n\n[18] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines.\n\nIn\n\nICML, pages 839\u2013846, 2000.\n\n[19] S. Tong and D. Koller. Support vector machine active learning with applications to text classi-\n\n\ufb01cation. In ICML, pages 999\u20131006, 2000.\n\n[20] K. Yu, J. Bi, and V. Tresp. Active learning via transductive experimental design. In ICML,\n\n[21] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00a8olkopf. Learning with local and global\n\npages 1081\u20131088, 2006.\n\nconsistency. In NIPS, 2003.\n\n[22] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\n\nharmonic functions. In ICML, pages 912\u2013919, 2003.\n\n9\n\n\f", "award": [], "sourceid": 180, "authors": [{"given_name": "Quanquan", "family_name": "Gu", "institution": null}, {"given_name": "Tong", "family_name": "Zhang", "institution": null}, {"given_name": "Jiawei", "family_name": "Han", "institution": null}, {"given_name": "Chris", "family_name": "Ding", "institution": null}]}