{"title": "On Semi-Supervised Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 721, "page_last": 728, "abstract": null, "full_text": " On Semi-Supervised Classification\n\n\n\n Balaji Krishnapuram, David Williams, Ya Xue, Alex Hartemink, Lawrence Carin\n Duke University, USA\n\n\n Mario A. T. Figueiredo\n Instituto de Telecomunicac~oes, Instituto Superior Tecnico, Portugal\n\n\n\n Abstract\n\n A graph-based prior is proposed for parametric semi-supervised classi-\n fication. The prior utilizes both labelled and unlabelled data; it also in-\n tegrates features from multiple views of a given sample (e.g., multiple\n sensors), thus implementing a Bayesian form of co-training. An EM\n algorithm for training the classifier automatically adjusts the tradeoff be-\n tween the contributions of: (a) the labelled data; (b) the unlabelled data;\n and (c) the co-training information. Active label query selection is per-\n formed using a mutual information based criterion that explicitly uses the\n unlabelled data and the co-training information. Encouraging results are\n presented on public benchmarks and on measured data from single and\n multiple sensors.\n\n\n\n1 Introduction\n\nIn many pattern classification problems, the acquisition of labelled training data is costly\nand/or time consuming, whereas unlabelled samples can be obtained easily. Semi-\nsupervised algorithms that learn from both labelled and unlabelled samples have been the\nfocus of much research in the last few years; a comprehensive review up to 2001 can be\nfound in [13], while more recent references include [1, 2, 6, 7, 1618].\n\nMost recent semi-supervised learning algorithms work by formulating the assumption that\n\"nearby\" points, and points in the same structure (e.g., cluster), should have similar labels\n[6, 7, 16]. This can be seen as a form of regularization, pushing the class boundaries toward\nregions of low data density. This regularization is often implemented by associating the\nvertices of a graph to all the (labelled and unlabelled) samples, and then formulating the\nproblem on the vertices of the graph [6, 1618].\n\nWhile current graph-based algorithms are inherently transductive -- i.e., they cannot be\nused directly to classify samples not present when training -- our classifier is paramet-\nric and the learned classifier can be used directly on new samples. Furthermore, our al-\ngorithm is trained discriminatively by maximizing a concave objective function; thus we\navoid thorny local maxima issues that plague many earlier methods.\n\nUnlike existing methods, our algorithm automatically learns the relative importance of the\nlabelled and unlabelled data. When multiple views of the same sample are provided (e.g.\nfeatures from different sensors), we develop a new Bayesian form of co-training [4]. In\n\n\f\naddition, we also show how to exploit the unlabelled data and the redundant views of the\nsample (from co-training) in order to improve active label query selection [15].\n\nThe paper is organized as follows. Sec. 2 briefly reviews multinomial logistic regression.\nSec. 3 describes the priors for semi-supervised learning and co-training. The EM algorithm\nderived to learn the classifiers is presented in Sec. 4. Active label selection is discussed in\nSec. 5. Experimental results are shown in Sec. 6, followed by conclusions in Sec. 7.\n\n\n2 Multinomial Logistic Regression\n\nIn an m-class supervised learning problem, one is given a labelled training set DL =\n{(x d\n 1, y ), . . . , (x )}\n 1 L, yL , where xi R is a feature vector and yi the corresponding\n (1) (m)\nclass label. In \"1-of-m\" encoding, y = [y , . . . , y ]\n i is a binary vector, such that\n i i\n (c) (j)\ny = 1 and y = 0, for j = c, indicates that sample i belongs to class c. In multinomial\n i i\nlogistic regression [5], the posterior class probabilities are modelled as\n\n log P (y(c) = 1|x) = xT w(c) - log m exp(xT w(k)), for c = 1, . . . , m, (1)\n k=1\n\nwhere w(c) d\n R is the class-c weight vector. Notice that since m P (y(c)= 1|x) = 1,\n c=1\none of the weight vectors is redundant; we arbitrarily choose to set w(m) = 0, and consider\nthe (d (m-1))-dimensional vector w = [(w(1))T , ..., (w(m-1))T ]T . Estimation of w may\nbe achieved by maximizing the log-likelihood (with Y {y , ..., y }\n 1 L ) [5]\n\n (c)\n (w) log P (Y|w) = L m y xT w(c) - log m exp(xT w(j)) . (2)\n i=1 c=1 i i j=1 i\n\nIn the presence of a prior p(w), we seek a maximum a posteriori (MAP) estimate,\nw = arg max { (w) + log p(w)}\n w . Actually, if the training data is separable, (w) is\nunbounded, and a prior is crucial.\n\nAlthough we focus on linear classifiers, we may see the d-dimensional feature vectors x as\nhaving resulted from some deterministic, maybe nonlinear, transformation of an input raw\nfeature vector r; e.g., in a kernel classifier, xi = [1, K(ri, r1), ..., K(ri, rL)] (d = L + 1).\n\n\n3 Graph-Based Data-Dependent Priors\n\n3.1 Graph Laplacians and Regularization for Semi-Supervised Learning\n\nConsider a scalar function f = [f1, ..., f|V |]T , defined on the set V = {1, 2, ..., |V |} of\nvertices of an undirected graph (V, E). Each edge of the graph, joining vertices i and j, is\ngiven a weight kij = kji 0, and we collect all the weights in a |V | |V | matrix K. A\nnatural way to measure how much f varies across the graph is by the quantity\n\n kij(fi - fj)2 = 2 f T f , (3)\n i j\n\nwhere = diag{ k k\n j 1j , ..., j |V |j } - K is the so-called graph Laplacian [2]. Notice\nthat kij 0 (for all i, j) guarantees that is positive semi-definite and also that has (at\nleast) one null eigenvalue (1T1 = 0, where 1 has all elements equal to one).\n\nIn semi-supervised learning, in addition to DL, we are given U unlabelled samples DU =\n{xL+1, . . . , xL+U }. To use (3) for semi-supervised learning, the usual choice is to assign\none vertex of the graph to each sample in X = [x1, . . . , xL+U ]T (thus |V | = L + U ),\nand to let kij represent some (non-negative) measure of \"similarity\" between xi and xj. A\nGaussian random field (GRF) is defined on the vertices of V (with inverse variance )\n\n p(f ) exp{- f T f /2},\n\n\f\nin which configurations that vary more (according to (3)) are less probable. Most graph-\nbased approaches estimate the values of f , given the labels, using p(f ) (or some modifica-\ntion thereof) as a prior. Accordingly, they work in a strictly transductive manner.\n\n\n3.2 Non-Transductive Semi-Supervised Learning\n\nWe first consider two-class problems (m = 2, thus w d\n R ). In contrast to previous uses of\ngraph-based priors, we define f as the real function f (defined over the entire observation\nspace) evaluated at the graph nodes. Specifically, f is defined as a linear function of x,\nand at the graph node i, fi f (xi) = wT xi. Then, f = [f1, ..., f|V |]T = Xw, and p(f )\ninduces a Gaussian prior on w, with precision matrix A = XT X,\n\n p(w) exp{-(/2) wT XT Xw} = exp{-(/2) wT Aw}. (4)\n\nNotice that since is singular, A may also be singular, and the corresponding prior may\ntherefore be improper. This is no problem for MAP estimation of w because (as is well\nknown) the normalization factor of the prior plays no role in this estimate. If we include\nextra regularization, by adding a non-negative diagonal matrix to A, the prior becomes\n\n p(w) exp -(1/2) wT (0A + ) w , (5)\n\nwhere we may choose = diag{1, ..., d}, = 1I, or even = 0.\n\nFor m > 2, we define (m-1) identical independent priors, one for each w(c), c = 1, ..., m.\nThe joint prior on w = [(w(1))T , ..., (w(m-1))T ]T is then\n\n m-1 1 (c) 1\n p(w|) exp{- (w(c))T A + (c) w(c)} = exp{- wT ()w}, (6)\n 2 0 2\n c=1\n\n (c) (c) (c)\nwhere is a vector containing all the parameters, (c) = diag{ , ..., }, and\n i 1 d\n\n (1) (m-1)\n () = diag{ , ..., } A + block-diag{(1), ..., (m-1)}. (7)\n 0 0\n\nFinally, since all the 's are inverses of variances, the conjugate priors are Gamma [3]:\n (c) (c) (c) (c)\np( | | | |\n 0 0, 0) = Ga(0 0, 0), and p(i 1, 1) = Ga(i 1, 1), for c =\n1, ..., m - 1 and i = 1, ..., d. Usually, 0, 0, 1, and 1 are given small values indicating\ndiffuse priors. In the zero limit, we obtain scale-invariant (improper) Jeffreys hyper-priors.\n\nSummarizing, our model for semi-supervised learning includes the log-likelihood (2), a\nprior (6), and Gamma hyper-priors. In Section 4, we present a simple and computationally\nefficient expectation-maximization (EM) algorithm for obtaining the MAP estimate of w.\n\n\n3.3 Exploiting Features from Multiple Sensors: The Co-Training Prior\n\nIn some applications several sensors are available, each providing a different set of features.\nFor simplicity, we assume two sensors s {1, 2}, but everything discussed here is easily\n (s)\nextended to any number of sensors. Denote the features from sensor s, for sample i, as x ,\n i\nand Ss as the set of sample indices for which we have features from sensor s (S1 S2 =\n{1, ..., L + U }). Let O = S1 S2 be the indices for which both sensors are available, and\nOU = O {L + 1, ..., L + U } the unlabelled subset of O.\n\nBy using the samples in S1 and S2 as two independent training sets, we may obtain two sep-\narate classifiers (denoted w1 and w2). However, we can coordinate the information from\nboth sensors by using an idea known as co-training [4]: on the OU samples, classifiers w1\nand w2 should agree as much as possible. Notice that, in a logistic regression framework,\nthe disagreement between the two classifiers on the OU samples can be measured by\n (1) (2)\n [(w1)T x - (w2)T x ]2 = T C , (8)\n iOU i i\n\n\f\nwhere = [(w1)T (w2)T ]T and C = [(x1)T (-x2)T ]T [(x1)T (-x2)T ]. This\n iOU i i i i\nsuggests the \"co-training prior\" (where co is an inverse variance):\n p(w1, w2) = p() exp -(co/2) TC . (9)\n\nThis Gaussian prior can be combined with two smoothness Gaussian priors on w1 and w2\n(obtained as described in Section 3.2); this leads to a prior which is still Gaussian,\n\n p(w1, w2) = p() exp -(1/2) T coC + block-diag{1, 2} , (10)\n\nwhere 1 and 2 are the two graph-based precision matrices (see (7)) for w1 and w2.\nWe can again adopt a Gamma hyper-prior for co. Under this prior, and with a logistic\nregression likelihood as above, estimates of w1 and w2 can easily be found using minor\nmodifications to the EM algorithm described in Section 4. Computationally, this is only\nslightly more expensive than separately training the two classifiers.\n\n\n4 Learning Via EM\n\nTo find the MAP estimate w, we use the EM algorithm, with as missing data, which\nis equivalent to integrating out from the full posterior before maximization [8]. For\nsimplicity, we will only describe the single sensor case (no co-training).\n\nE-step: We compute the expected value of the complete log-posterior, given Y and the\ncurrent parameter estimate w: Q(w|w) E[log p(w, |Y)|w]. Since\n\n log p(w, |Y) = log p(Y|w) - (1/2)wT ()w + K, (11)\n\n(where K collects all terms independent of w) is linear w.r.t. all the parameters (see (6)\nand (7)), we just have to plug their conditional expectations into (11):\n\n Q(w|w) = log p(Y|w) - (1/2)wT E[()|w] w = (w) - (1/2)wT (w) w. (12)\nWe consider several different choices for the structure of the matrix. The necessary\nexpectations have well-known closed forms, due to the use of conjugate Gamma hyper-\n (c)\npriors [3]. For example, if the are m - 1 free non-negative parameters, we have\n 0\n (c) (c)\n E[ |w] = (2 \n 0 0 0 + d) [2 0 + (w(c))T Aw(c)]-1.\n (c)\nfor c = 1, ..., m - 1. For = \n 0 0, we still have a simple closed-form expres-\n (c)\nsion for E[0|w], and the same is true for the parameters, for i > 0. Finally,\n i\n(w) E[()|w] results from replacing the 's in (7) by the corresponding conditional\nexpectations.\n\nM-step: Given matrix (w), the M-step reduces to a logistic regression problem with a\nquadratic regularizer, i.e., maximizing (12). To this end, we adopt the bound optimization\napproach (see details in [5, 11]). Let B be a positive definite matrix such that -B bounds\nbelow (in the matrix sense) the Hessian of (w), which is negative definite, and g(w) is\nthe gradient of (w). Then, we have the following lower bound on Q(w|w):\n\n Q(w|w) l(w) + (w - w)T g(w) - [(w - w)T B(w - w) + wT (w)w]/2.\n -\nThe maximizer of this lower bound, wnew = (B + (w)) 1 (Bw + g(w)), is guaranteed\nto increase the Q-function, Q(wnew|w) Q(w|w), and we thus obtain a monotonic gen-\neralized EM algorithm [5, 11]. This (maybe costly) matrix inversion can be avoided by a\nsequential approach where we only maximize w.r.t. one element of w at a time, preserving\nthe monotonicity of the procedure. The sequential algorithm visits one particular element\nof w, say wu, and updates its estimate by maximizing the bound derived above, while\nkeeping all other variables fixed at their previous values. This leads to\n -\n wnew = w ] [(B + (w)) 1 ,\n u u + [gu(w) - ((w)w) (13)\n u uu]\nand wnew = w\n v v , for v = u. The total time required by a full sweep for all u = 1, ..., d is\nO(md(L + d)); this may be much better than the O((dm)3) of the matrix inversion.\n\n\f\n5 Active Label Selection\n\nIf we are allowed to obtain the label for one of the unlabelled samples, the following ques-\ntion arises: which sample, if labelled, would provide the most information?\n\nConsider the MAP estimate w provided by EM. Our approach uses a Laplace approxima-\ntion of the posterior p(w|Y) N (w|w, H-1), where H is the posterior precision matrix,\ni.e., the Hessian of minus the log-posterior H = 2(- log p(w|Y)). This approximation is\nknown to be accurate for logistic regression under a Gaussian prior [14]. By treating (w)\n(the expectation of ()) as deterministic, we obtain an evidence-type approximation [14]\n\n H = 2[- log(p(Y|w)p(w|(w)))] = (w) + L (diag{p } - p pT ) x ,\n i=1 i i i ixT\n i\n\nwhere pi is the (m - 1)-dimensional vector computed from (1), the c-th element of which\nindicates the probability that sample xi belongs to class c.\n\nNow let x DU be an unlabelled sample and y its label. Assume that the MAP esti-\nmate w remains unchanged after including y. In Sec. 7 we will discuss the merits and\nshortcomings of this assumption, which is only strictly valid when L . Accepting it\nimplies that after labeling x, and regardless of y, the posterior precision changes to\n H = H + (diag{p} - ppT ) xxT . (14)\nSince the entropy of a Gaussian with precision H is (-1/2) log |H| (up to an additive\nconstant), the mutual information (MI) between y and w (i.e., the expected decrease in\nentropy of w when y is observed) is I(w; y) = (1/2) log {|H |/|H|}. Our criterion\nis then: the best sample to label is the one that maximizes I(w; y). Further insight into\nI(w; y) can be obtained in the binary case (where p is a scalar); here, the matrix identity\n|H + p(1 - p)xxT |\n = |H|(1 + p(1 - p)xT H-1x) yields\n I(w; y) = (1/2) log(1 + p(1 - p)xT H-1x). (15)\nThis MI is larger when p 0.5, i.e., for samples with uncertain classifications. On the\nother hand, with p fixed, I(w; y) grows with xT H-1x, i.e., it is large for samples with\nhigh variance of the corresponding class probability estimate. Summarizing, (15) favors\nsamples with uncertain class labels and high uncertainty in the class probability estimate.\n\n\n6 Experimental Results\n\nWe begin by presenting two-dimensional synthetic examples to visually illustrate our semi-\nsupervised classifier. Fig. 1 shows the utility of using unlabelled data to improve the deci-\n\n\n\n\n\nFigure 1: Synthetic two-dimensional examples. (a) Comparison of the supervised logistic\nlinear classifier (boundary shown as dashed line) learned only from the labelled data (shown\nin color) with the proposed semi-supervised classifier (boundary shown as solid line) which\nalso uses the unlabelled samples (shown as dots). (b) A RBF kernel classifier obtained by\nour algorithm, using two labelled samples (shaded circles) and many unlabelled samples.\n\n\f\nFigure 2: (a)-(c) Accuracy (on UCI datasets) of the proposed method, the supervised SVM,\nand the other semi-supervised classifiers mentioned in the text; a subset of samples is la-\nbelled and the others are treated as unlabelled samples. In (d), a separate holdout set is used\nto evaluate the accuracy of our method versus the amount of labelled and unlabelled data.\n\n\n\nsion boundary in linear and non-linear (kernel) classifiers (see figure caption for details).\n\nNext we show results with linear classifiers on three UCI benchmark datasets. Results\nwith nonlinear kernels are similar, and therefore omitted to save space. We compare our\nmethod against state-of-the-art semi-supervised classifiers: the GRF method of [18], the\nSGT method of [10], and the transductive SVM (TSVM) of [9]. For reference, we also\npresent results for a standard SVM. To avoid unduly helping our method, we always use\na k=5 nearest neighbors graph, though our algorithm is not very sensitive to k. To avoid\ndisadvantaging other methods that do depend on such parameters, we use their best settings.\nSince these adjustments cannot be made in practice, the difference between our algorithm\nand the others is under-represented. Each point on the plots in Fig. 2(a)-(c) is an average\nof 20 trials: we randomly select 20 labelled sets which are used by every method. All\nremaining samples are used as unlabelled by the semi-supervised algorithms.\n\nFigs. 2(a)-(c) are transductive, in the sense that the unlabelled and test data are the same.\nOur logistic GRF is non-transductive: after being trained, it may be applied to classify new\ndata without re-training. In Fig. 2(d) we present non-transductive results for the Ionosphere\ndata. Training took place using labelled and unlabelled data, and testing was performed on\n200 new unseen samples. The results suggest that semi-supervised classifiers are most\nrelevant when the labelled set is small relative to the unlabelled set (as is often the case).\n\nOur final set of results address co-training (Sec. 3.3) and active learning (Sec. 5), applied to\nairborne sensing data for the detection of surface and subsurface land mines. Two sensors\nwere used: (1) a 70-band hyper-spectral electro-optic (EOIR) sensor; (2) an X-band syn-\nthetic aperture radar (SAR). A simple (energy) \"prescreener\" detected potential targets; for\neach of these, two feature vectors were extracted, of sizes 420 and 9, for the EOIR and SAR\nsensors, respectively. 123 samples have features from the EOIR sensor alone, 398 from the\n\n\f\nFigure 3: (a) Land mine detection ROC curves of classifiers designed using only hyper-\nspectral (EOIR) features, only SAR features, and both. (b) Number of landmines detected\nduring the active querying process (dotted lines), for active training and random selection\n(for the latter the bars reflect one standard deviation about the mean). ROC curves (solid)\nare for the learned classifier as applied to the remaining samples.\n\n\n\nSAR sensor alone, and 316 from both. This data will be made available upon request.\n\nWe first consider supervised and semi-supervised classification. For the purely supervised\ncase, a sparseness prior is used (as in [14]). In both cases a linear classifier is employed. For\nthe data for which only one sensor is available, 20% of it is labelled (selected randomly).\nFor the data for which both sensors are available, 80% is labelled (again selected randomly).\nThe results presented in Fig. 3(a) show that, in general, the semi-supervised classifiers\noutperform the corresponding supervised ones, and the classifier learned from both sensors\nis markedly superior to classifiers learned from either sensor alone.\n\nIn a second illustration, we use the active-learning algorithm (Sec. 5) to only acquire the\n100 most informative labels. For comparison, we also show average results over 100 in-\ndependent realizations for random label query selection (error bars indicate one standard\ndeviation). The results in Fig. 3(b) are plotted in two stages: first, mines and clutter are se-\nlected during the labeling process (dashed curves); then, the 100 labelled examples are used\nto build the final semi-supervised classifier, for which the ROC curve is obtained using the\nremaining unlabelled data (solid curves). Interestingly, the active-learning algorithm finds\nalmost half of the mines while querying for labels. Due to physical limitations of the sen-\nsors, the rate at which mines are detected drops precipitously after approximately 90 mines\nare detected -- i.e., the remaining mines are poorly matched to the sensor physics.\n\n\n7 Discussion\n\n7.1 Principal Contributions\n\nSemi-supervised vs. Transductive: Unlike most earlier methods, after the training stage our\nalgorithm can directly classify new samples without computationally expensive re-training.\n\nTradeoff between labelled and unlabelled data: Automatically addressing the inherent\ntradeoff between their relative contributions, we have ensured that even a small amount\nof labelled data does not get overlooked because of an abundance of unlabelled samples.\n\nBayesian co-training: Using the proposed prior, classifiers for all sensors are improved\nusing: (a) the label information provided on the other types of data, and (b) samples drawn\nfrom the joint distribution of features from multiple sensors.\n\nActive label acquisition: We explicitly account for the knowledge of the unlabelled data and\nthe co-training information while computing the well known mutual information criterion.\n\n\f\n7.2 Quality of Assumptions and Empirically Observed Shortcomings\n\nThe assumption that the mode of the posterior distribution of the classifier remains un-\nchanged after seeing an additional label is clearly not true at the beginning of the active\nlearning procedure. However, we have empirically found it a very good approximation af-\nter the active learning procedure has yielded as few as 15 labels. This assumption allows a\ntremendous saving in the computational cost, since it helps us avoid repeated re-training of\nclassifiers in the active label acquisition process while evaluating candidate queries.\n\nA disturbing fact that has been reported in the literature (e.g., in [12]) and that we have\nconfirmed (in unreported experiments) is that the error rate of the active query selection\nincreases slightly when the number of labelled samples grows beyond an optimal number.\nWe conjecture that this may be caused by keeping the hyper-prior parameters 0, 0, 1, 1\nfixed at the same value; in all of our experiments we have set them to 10-4, corresponding\nto an almost uninformative hyper-prior.\n\n\nReferences\n\n [1] M. Belkin, I. Matveeva, and P. Niyogi. Regularization and regression on large graphs. In Proc.\n Computational Learning Theory COLT'04, Banff, Canada, 2004.\n\n [2] M. Belkin and P. Niyogi. Using manifold structure for partially labelled classification. In NIPS\n 15, MIT Press, Cambridge, MA, 2003.\n\n [3] J. Bernardo and A. Smith. Bayesian Theory. J. Wiley & Sons, Chichester, UK, 1994.\n\n [4] A. Blum and T. Mitchell. Combining labelled and unlabelled data with co-training. In Proc.\n Computational Learning Theory COLT'98, Madison, WI, 1998.\n\n [5] D. Bohning. Multinomial logistic regression algorithm. Annals Inst. Stat. Math., vol. 44,\n pp. 197200, 1992.\n\n [6] O. Chapelle, J. Weston, and B. Scholkopf. Cluster kernels for semi-supervised learning. In\n NIPS 15, MIT Press, Cambridge, MA, 2003.\n\n [7] A. Corduneanu and T. Jaakkola. On Information regularization. In Proc. Uncertainty in Artifi-\n cial Intelligence UAI'03, Acapulco, Mexico, 2003.\n\n [8] M. Figueiredo. Adaptive sparseness using Jeffreys' prior. In NIPS 14, MIT Press, 2002.\n\n [9] T. Joachims. Transductive inference for text classification using support vector machines. In\n Int. Conf. Machine Learning ICML'99, 1999.\n\n[10] T. Joachims. Transductive learning via spectral graph partitioning. In ICML'03, 2003.\n\n[11] K. Lange, D. Hunter, and I. Yang. Optimization transfer using surrogate objective functions. J.\n Computational and Graphical Statistics, vol. 9, pp. 159, 2000.\n\n[12] G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. Intern.\n Conf. on Mach. Learn. ICML'00.\n\n[13] M. Seeger. Learning with labelled and unlabelled data. Tech. Rep., Institute for Adaptive and\n Neural Computation, University of Edinburgh, UK, 2001.\n\n[14] M. Tipping. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn.\n Research, vol. 1, pp. 211244, 2001.\n\n[15] S. Tong and D. Koller. Support vector machine active learning with applications to text classi-\n fication. In J. Mach. Learn. Research, vol. 2, pp. 4566, 2001.\n\n[16] D. Zhou, O. Bousquet, T. Lal, J. Weston, and B. Scholkopf. Semi-supervised learning by\n maximizing smoothness. J. of Mach. Learn. Research, 2004 (submitted).\n\n[17] X. Zhu, J. Lafferty and Z. Ghahramani. Combining active learning and semi-supervised learning\n using Gaussian fields and harmonic functions. In ICML'03 Workshop on The Continuum from\n Labelled to Unlabelled Data in Mach. Learning, 2003.\n\n[18] X. Zhu, J. Lafferty and Z. Ghahramani. Semi-supervised learning: From Gaussian fields to\n Gaussian processes. Tech. Rep. CMU-CS-03-175, School of CS, CMU, 2003.\n\n\f\n", "award": [], "sourceid": 2719, "authors": [{"given_name": "Balaji", "family_name": "Krishnapuram", "institution": null}, {"given_name": "David", "family_name": "Williams", "institution": null}, {"given_name": "Ya", "family_name": "Xue", "institution": null}, {"given_name": "Lawrence", "family_name": "Carin", "institution": null}, {"given_name": "M\u00e1rio", "family_name": "Figueiredo", "institution": null}, {"given_name": "Alexander", "family_name": "Hartemink", "institution": null}]}