{"title": "Kernel Expansions with Unlabeled Examples", "book": "Advances in Neural Information Processing Systems", "page_first": 626, "page_last": 632, "abstract": null, "full_text": "Kernel expansions with unlabeled examples \n\nMartin Szummer \n\nMIT AI Lab & CBCL \n\nCambridge, MA \n\nszummer@ai.mit.edu \n\nTommi Jaakkola \n\nMIT AI Lab \n\nCambridge, MA \ntommi @ai.mit.edu \n\nAbstract \n\nModern classification applications necessitate supplementing the few \navailable labeled examples with unlabeled examples to improve classi(cid:173)\nfication performance. We present a new tractable algorithm for exploit(cid:173)\ning unlabeled examples in discriminative classification. This is achieved \nessentially by expanding the input vectors into longer feature vectors via \nboth labeled and unlabeled examples. The resulting classification method \ncan be interpreted as a discriminative kernel density estimate and is read(cid:173)\nily trained via the EM algorithm, which in this case is both discriminative \nand achieves the optimal solution. We provide, in addition, a purely dis(cid:173)\ncriminative formulation of the estimation problem by appealing to the \nmaximum entropy framework. We demonstrate that the proposed ap(cid:173)\nproach requires very few labeled examples for high classification accu(cid:173)\nracy. \n\n1 Introduction \n\nIn many modern classification problems such as text categorization, very few labeled ex(cid:173)\namples are available but a large number of unlabeled examples can be readily acquired. \nVarious methods have recently been proposed to take advantage of unlabeled examples to \nimprove classification performance. Such methods include the EM algorithm with naive \nBayes models for text classification [1], the co-training framework [2], transduction [3, 4], \nand maximum entropy discrimination [5] . \n\nThese approaches are divided primarily on the basis of whether they employ generative \nmodeling or are motivated by robust classification. Unfortunately, the computational effort \nscales exponentially with the number of unlabeled examples for exact solutions in discrim(cid:173)\ninative approaches such as transduction [3, 5]. Various approximations are available [4, 5] \nbut their effect remains unclear. \n\nIn this paper, we formulate a complementary discriminative approach to exploiting unla(cid:173)\nbeled examples, effectively by using them to expand the representation of examples. This \napproach has several advantages including the ability to represent the true Bayes optimal \ndecision boundary and making explicit use of the density over the examples. It is also \ncomputationally feasible as stated. \n\nThe paper is organized as follows. We start by discussing the kernel density estimate and \nproviding a smoothness condition, assuming labeled data only. We subsequently introduce \nunlabeled data, define the expansion and formulate the EM algorithm for discriminative \n\n\ftraining. In addition, we provide a purely discriminative version of the parameter estima(cid:173)\ntion problem and formalize it as a maximum entropy discrimination problem. We then \ndemonstrate experimentally that various concerns about the approach are not warranted. \n\n2 Kernel density estimation and classification \n\nWe start by assuming a large number of labeled examples D = {(Xl, ill)\" .. , (XN, fiN)}' \nwhere ih E {-I, I} and Xi E Rf A joint kernel density estimate can be written as \n\nP(x,y) = N L t5(Y,ih)K(x,Xi) \n\n1 N \n\n(1) \n\ni=l \n\nwhere J K(x, xi)dl-\u00a3(x) = 1 for each i. With an appropriately chosen kernel K, a function \nof N, P(x, y) will be consistent in the sense of converging to the joint density as N -+ 00. \nGiven a fixed number of examples, the kernel functions K(x, Xi) may be viewed as con(cid:173)\nditional probabilities P(xli), where i indexes the observed points. For the purposes of this \npaper, we assume a Gaussian form K(x, Xi) = N(x; Xi> rr2 I). The labels ih assigned \nto the sampled points Xi may themselves be noisy and we incorporate P(yli), a location(cid:173)\nspecific probability of labels. The resulting joint density model is \n\nN \n\nP(x, y) = ~?: P(yli) P(xli) \n\nt=l \n\nInterpreting liN as a prior probability of the index variable i = 1, ... , N, the resulting \nmodel conforms to the graph depicted above. This is reminiscent of the aspect model for \nclustering of dyadic data [6]. There are two main differences. First, the number of aspects \nhere equals the number of examples and the model is not suitable for clustering. Second, we \ndo not search for the probabilities P(xli) (kernels), instead they are associated with each \nobserved example and are merely adjusted in terms of scale (kernel width). This restriction \nyields a significant computational advantage in classification, which is the objective in this \npaper. \nThe posterior probability of the label y given an example X is given by P(ylx) = \nL:i P(yli)P(ilx), where P(ilx) ex: P(xli) I P(x) as P(i) is assumed to be uniform. The \nquality of the posterior probability depends both on how accurately P(yli) are known as \nwell as on the properties of the membership probabilities P(ilx) (always known) that must \nbe relatively smooth. \nHere we provide a simple condition on the membership probabilities P(ilx) so that any \nnoise in the sampled labels for the available examples would not preclude accurate deci(cid:173)\nsions. In other words, we wish to ensure that the conditional probabilities P(ylx) can be \nevaluated accurately on the basis of the sampled estimate in Eq. (1). Removing the label \nnoise provides an alternative way of setting the width parameter rr of the Gaussian kernels. \nThe simple lemma below, obtained via standard large deviation methods, ties the appro(cid:173)\npriate choice of the kernel width rr to the squared norm of the membership probabilities \nP(iIXj). \n\nLemma I Let IN = {I, ... , N}. Given any t5 > 0, E > 0, and any collection of distribu(cid:173)\ntionsPilk ~ 0, L:iEINPilk = 1 fork E IN, suchthatllp'lkl12 ~ E/V210g(2Nlt5),Vk E \nIN, and independent samples ih E {-I, I} from some P(yli), i E IN, then \nP(3k E IN : I L::'1 ihpilk - L::'1 WiPilkl > E) ~ t5 where Wi = P(y = Iii) - P(y = \n-Iii) and the probability is taken over the independent samples. \n\n\fThe lemma applies to our case by setting Pilk = P(ilxk), {;iii} represents the sampled \nlabels for the examples, and by noting that the sign of L wiP(ilx) is the MAP decision \nrule from our model, P(y = 11x) - P(y = -llx). The lemma states that as long as the \nmembership probabilities have appropriately bounded squared norm, the noise in the label(cid:173)\ning is inconsequential for the classification decisions. Note, for example, that a distribution \nPilk = l/N has IIp.lkl12 = l/VN implying that the conditions are achievable for large \nN. The squared norm of P(ilx) is directly controlled by the kernel width a 2 and thus the \nlemma ties the kernel width with the accuracy of estimating the conditional probabilities \nP(ylx). Algorithms for adjusting the kernel width(s) on the basis of this will be presented \nin a longer version of the paper. \n\n3 The expansion and EM estimation \n\nA useful way to view the resulting kernel density estimate is that each example x is rep(cid:173)\nresented by a vector of membership probabilities P(ilx), i = 1, ... , N. Such mixture \ndistance representations have been used extensively; it can also be viewed as a Fisher score \nvector computed with respect to adjustable weighting P(i). The examples in this new \nrepresentation are classified by associating P(yli) with each component and computing \nP(ylx) = Li P(yli)P(ilx). An alternative approach to exploiting kernel density esti(cid:173)\nmates in classification is given by [7]. \n\nWe now assume that we have labels for only a few examples, and our training data is \n{(X1dh), ... , (XL, ih), XL+1,\u00b7 .. ,XN}. In this case, we may continue to use the model \ndefined above and estimate the free parameters, P(y Ii), i = 1, ... , N, from the few labeled \nexamples. In other words, we can maximize the conditional log-likelihood \n\nL \n\nL \n\nN \n\nZ)og P(Yllxl) = 2: log 2: P(jh li)P(ilxl) \n1=1 \n\n1=1 \n\ni=l \n\n(2) \n\nwhere the first summation is only over the labeled examples and L \u00ab N . Since P(ilxl) \nare fixed, this objective function is jointly concave in the free parameters and lends itself \nto a unique maximum value. The concavity also guarantees that this optimization is easily \nperformed via the EM algorithm [8]. \nLet Pill be the soft assignment for component i given (x!, iM, i.e., Pill = P(ilxl, iii) ex: \nP(ydi)P(ilxl) . The EM algorithm iterates between the E-step, where Pill are recom(cid:173)\nputed from the current estimates of P(yli), and the M-step where we update P(yli) ~ \nLl:ih=yPild L1Pill. \nThis procedure may have to be adjusted in cases where the overall frequency of different \nlabels in the (labeled) training set deviates significantly from uniform. A simple rescaling \nP(yli) ~ P(yli)/ Ly by the frequencies Ly and renormalization after each M-step would \nprobably suffice. \n\nThe runtime of this algorithm is O(L N). The discriminative formulation suggests that EM \nwill provide reasonable parameter estimates P(yli) for classification purposes. The qual(cid:173)\nity of the solution, as well as the potential for overfitting, is contingent on the smoothness \nof the kernels or, equivalently, smoothness of the membership probabilities P(ilx). Note, \nhowever, that whether or not P(yli) will converge to the extreme values 0 or 1 is not an in(cid:173)\ndication of overfitting. Actual classification decisions for unlabeled examples Xi (included \nin the expansion) need to be made on the basis of P(yIXi) and not on the basis of P(yli), \nwhich function as parameters. \n\n\f4 Discriminative estimation \n\nAn alternative discriminative formulation is also possible, one that is more sensitive to the \ndecision boundary rather than probability values associated with the labels. To this end, \nconsider the conditional probability P(Ylx) = L i P(Yli)P(ilx). The decisions are made \non the basis of the sign of the discriminant function \n\nf(x) = P(y = llx) - P(y = -llx) = L wiP(ilx) \n\nN \n\n(3) \n\ni=l \n\nwhere Wi = P(y = Iii) - P(y = -Iii). This is similar to a linear classifier and there \nare many ways of estimating the weights Wi discriminatively. The weights should remain \nbounded, however, i.e., Wi E [-1,1], so long as we wish to maintain the kernel density \ninterpretation. Estimation algorithms with Euclidean norm regularization such as SVMs \nwould not be appropriate in this sense. Instead, we employ the maximum entropy discrim(cid:173)\nination (MED) framework [5] and rely on the relation Wi = E{Yi} = L yi =\u00b11 YiP(y) to \nestimate the distribution P(y) over all the labels Y = [YI,'\" ,YN] . Here Yi is a parame(cid:173)\nter associated with the ith example and should be distinguished from any observed labels. \nWe can show that in this case the maximum entropy solution factors across the examples \nP(YI, ... , YN) = TIi Pi (Yi) and we can formulate the estimation problem directly in terms \nof the marginals Pi (Yi). \nThe maximum entropy formalism encodes the principle that label assignments Pi (Yi) for \nthe examples should remain uninformative to the extent possible given the classification \nobjective. More formally, given a set of L labeled examples (Xl, IiI), . .. , (XL, ih), we \nmaximize L~l H(Yi) - eLl el subject to the classification constraints \n\n(4) \n\nwhere H (Yi) is the entropy of Yi relative to the marginal Pi (Yi). Here'Y specifies the target \nseparation ('Y E [0,1]) and the slack variables el 2: a permit deviations from the target to \nensure that a solution always exists. The solution is not very sensitive to these parameters, \nand'Y = 0.1 and C = 40 worked well for many problems. The advantage of this formula(cid:173)\ntion is that effort is spent only on those training examples whose classification is uncertain. \nExamples already classified correctly with a margin larger than 'Yare effectively ignored. \nThe optimization problem and algorithms are explained in the appendix. \n\n5 Discussion of the expanded representation \n\nThe kernel expansion enables us to represent the Bayes optimal decision boundary provided \nthat the kernel density estimate is sufficiently accurate. With this representation, the EM \nand MED algorithms actually estimate decision boundaries that are sensitive to the density \nP(x). For example, labeled points in high-density regions will influence the boundary \nmore than in low-density regions. The boundary will partly follow the density, but unlike in \nunsupervised methods, will adhere strongly to the labeled points. Moreover, our estimation \ntechniques limit the effect of outliers, as all points have a bounded weight Wi = [-1,1] \n(spurious unlabeled points do not adversely affect the boundary). \nAs we impose smoothness constraints on the membership probabilities P(ilx), we also \nguarantee that the capacity of the resulting classifier need not increase with the number \nof unlabeled examples (in the fat shattering sense). Also, in the context of the maximum \nentropy formulation, if a point is not helpful for the classification constraints, then entropy \n\n\fis maximized for Pi(y = \u00b1l) = 0.5, implying Wi = 0, and the point has no effect on the \nboundary. \n\nIf we dispense with the conditional probability interpretation of the kernels K, we are \nfree to choose them from a more general class of functions. For example, the kernels \nno longer have to integrate to 1. An expansion of x in terms of these kernels can still \nbe meaningful; as a special case, when linear kernels are chosen, the expansion reduces \nto weighting distances between points by the covariance of the data. Distinctions along \nhigh variance directions then become easier to make, which is helpful when between-class \nscatter is greater than within-class scatter. \n\nThus, even though the probabilistic interpretation is missing, a simple preprocessing step \ncan still help, e.g., support vector machines to take advantage of unlabeled data: we can \nexpand the inputs x in terms of kernels G from labeled and unlabeled points as in \u00a2(x) = \n~[G(x, Xl)' ... ,G(x, XN)], where Z optionally normalizes the feature vector. \n\n6 Results \n\nWe first address the potential concern that the expanded representation may involve too \nmany degrees of freedom and result in poor generalization. Figure la) demonstrates that \nthis is not the case and, instead, the test classification error approaches the limiting asymp(cid:173)\ntotic rate exponentially fast. The problem considered was a DNA splice site classification \nproblem with 500 examples for which d = 100. Varying sizes of random subsets were \nlabeled and all the examples were used in the expansion as unlabeled examples. The er(cid:173)\nror rate was computed on the basis of the remaining 500 - L examples without labels, \nwhere L denotes the number of labeled examples. The results in the figure were averaged \nacross 20 independent runs. The exponential rate of convergence towards the limiting rate \nis evidenced by the linear trend in the semilog figure la). The mean test errors shown in fig(cid:173)\nure Ib) indicate that the purely discriminative training (MED) can contribute substantially \nto the accuracy. The kernel width in these experiments was simply fixed to the median \ndistance to the 5th nearest neighbor from the opposite class. Results from other methods \nof choosing the kernel width (the squared norm, adaptive) will be discussed in the longer \nversion of the paper. \n\nAnother concern is perhaps that the formulation is valid only in cases where we have a \nlarge number of unlabeled examples. In principle, the method could deteriorate rapidly \nafter the kernel density estimate no longer can be assumed to give reasonable estimates. \nFigure 2a) illustrates that this is not a valid interpretation. The problem here is to classify \nDNA micro array experiments on the basis of the leukemia types that the tissues used in \nthe array experiments corresponded to. Each input vector for the classifier consists of the \nexpression levels of over 7000 genes that were included as probes in the arrays. The number \nof examples available was 38 for training and 34 for testing. We included all examples \nas unlabeled points in the expansion and randomly selected subsets of labeled training \nexamples, and measured the performance only on the test examples (which were of slightly \ndifferent type and hence more appropriate for assessing generalization). Figure 2 shows \nrapid convergence for EM and the discriminative MED formulation. The \"asymptotic\" \nlevel here corresponds to about one classification error among the 34 test examples. The \nresults were averaged over 20 independent runs. \n\n7 Conclusion \n\nWe have provided a complementary framework for exploiting unlabeled examples in dis(cid:173)\ncriminative classification problems. The framework involves a combination of the ideas of \nkernel density estimation and representational expansion of input vectors. A simple EM \n\n\fo 35 ,----~-~-~-~-~_____, \n\na) \n\no 050L-~-----,\"10--1C=-5 -~2~0 -~25,--------,J30 \n\nlabeled examples \n\nb) \n\nFigure 1: A semilog plot of the test error rate for the EM formulation less the asymptotic \nrate as a function of labeled examples. The linear trend in the figure implies that the error \nrate approaches the asymptotic error exponentially fast. b) The mean test errors for EM, \nMED and SVM as a function of the number of labeled examples. SVM does not use \nunlabeled examples. \n\no 35 ,------~-,------~--,----______, \n\n03 \n\n025 g \n\u2022 02 \n~ \nma 15 \nE \n\n01 \n\n005 \n\n%~-~-~170 -~1 5-~270 -~25' \n\nnumber of labeled exam ples \n\nFigure 2: The mean test errors for the leukemia classification problem as a function of the \nnumber of randomly chosen labeled examples. Results are given for both EM (lower line) \nand MED (upper line) formulations. \n\nalgorithm is sufficient for finding globally optimal parameter estimates but we have shown \nthat a purely discriminative formulation can yield substantially better results within the \nframework. \n\nPossible extensions include using the kernel expansions with transductive algorithms that \nenforce margin constraints also for the unlabeled examples [5] . Such combination can be \nparticularly helpful in terms of capturing the lower dimensional structure of the data. Other \nextensions include analysis of the framework similarly to [9]. \n\nAcknowledgments \n\nThe authors gratefully acknowledge support from NTT and NSF. Szummer would also like \nto thank Thomas Minka for many helpful discussions and insights. \n\nReferences \n\n[1] Nigam K. , McCallum A. , Thrun S., and Mitchell T. (2000) Text classification from \n\nlabeled and unlabeled examples. Machine Learning 39 (2):103-134. \n\n[2] Blum A., Mitchell T. (1998) Combining Labeled and Unlabeled Data with Co(cid:173)\nTraining. In Proc. 11th Annual Con! Computational Learning Theo ry, pp. 92-100. \n\n\f[3] Vapnik V. (1998) Statistical learning theory. John Wiley & Sons. \n[4] Joachims, T. (1999) Transductive inference for text classification using support vector \n\nmachines. International Conference on Machine Learning. \n\n[5] Jaakkola T., Meila M., and Jebara T. (1999) Maximum entropy discrimination. In \n\nAdvances in Neural Information Processing Systems 12. \n\n[6] Hofmann T., Puzicha 1. (1998) Unsupervised Learning from Dyadic Data. Interna(cid:173)\n\ntional Computer Science Institute, TR-98-042. \n\n[7] Tong S., Koller D. (2000) Restricted Bayes Optimal Classifiers. Proceedings AAAI. \n[8] Miller D., Uyar T. (1996) A Mixture of Experts Classifer with Learning Based on \nBoth Labelled and Unlabelled Data. In Advances in Neural Information Processing \nSystems 9, pp. 571-577. \n\n[9] Castelli v., Cover T. (1996) The relative value of labeled and unlabeled samples in \npattern recognition with an unknown mixing parameter. IEEE Transactions on infor(cid:173)\nmation theory 42 (6): 2102-2117. \n\nA Maximum entropy solution \n\nThe unique solution to the maximum entropy estimation problem is found via introducing \nLagrange multipliers {AI} for the classification constraints. The multipliers satisfy Al E \n[0, el, where the lower bound comes from the inequality constraints and the upper bound \nfrom the linear margin penalties being minimized. To represent the solution and find the \noptimal setting of Al we must evaluate the partition function \n\nZ(A) = e- ~f An L II e~f iizA1YiP(ilxd = \n\nN \n\n= e- ~f An II ( e~f Y1A1P(ilxd + e- ~f Y1A1P(ilxd ) \n\nN \n\n(5) \n\n(6) \n\ni=l \n\nthat normalizes the maximum entropy distribution. Here Y denote the observed labels. \nMinimizing the jointly convex log-partition function log Z(A) with respect to the Lagrange \nmultipliers leads to the optimal setting {Ai}. This optimization is readily done via an axis \nparallel line search (e.g. the bisection method). The required gradients are given by \n\nO~Ak = -')' + ~ tanh tt Y1Aj P(ilxl) YkP(ilxk) = \n\n01 Z(A) \n\n) \n\nN \n\n( L \n\nN \n\n= -')'+Yk LEp;{YdP(ilxk) \n\ni=l \n\n(7) \n\n(8) \n\n(this is essentially the classification constraint). The expectation is taken with respect to \nthe maximum entropy distribution P* (Yl , ... , Y N) = Pi (Yl) .. . PN (y N) where the com(cid:173)\nponents are Pt(Yi) ex exp{L:1Y1A1YiP(ilx)}. The label averages wi = Ep.{Yd = \nL: Yi Yi Pt (Yi) are needed for the decision rule as well as in the optimization. We can iden-\ntify these from above wi = tanh(L:l Y1Aj P(ilxl)) and they are readily evaluated. Finding \nthe solution involves O(L2 N) operations. \n\nOften the numbers of positive and negative training labels are imbalanced. The MED \nformulation (analogously to SVMs) can be adjusted by defining the margin penalties as \ne+ L:l:Y1=1 6 + e- L:l:Y1=-1 ~l' where, for example, L+e+ = L-e- that equalizes the \nmean penalties. The coefficients e+ and e- can also be modified adaptively during the \nestimation process to balance the rate of misclassification errors across the two classes. \n\n\f", "award": [], "sourceid": 1928, "authors": [{"given_name": "Martin", "family_name": "Szummer", "institution": null}, {"given_name": "Tommi", "family_name": "Jaakkola", "institution": null}]}