{"title": "Maximum Margin Semi-Supervised Learning for Structured Variables", "book": "Advances in Neural Information Processing Systems", "page_first": 33, "page_last": 40, "abstract": "", "full_text": "Maximum Margin Semi-Supervised\nLearning for Structured Variables\n\nY. Altun, D. McAllester\n\nM. Belkin\n\nTTI at Chicago\n\nChicago, IL 60637\n\naltun,mcallester@tti-c.org\n\nDepartment of Computer Science\n\nUniversity of Chicago\n\nChicago, IL 60637\n\nmisha@cs.uchicago.edu\n\nAbstract\n\nMany real-world classi\ufb01cation problems involve the prediction of\nmultiple inter-dependent variables forming some structural depen-\ndency. Recent progress in machine learning has mainly focused on\nsupervised classi\ufb01cation of such structured variables. In this paper,\nwe investigate structured classi\ufb01cation in a semi-supervised setting.\nWe present a discriminative approach that utilizes the intrinsic ge-\nometry of input patterns revealed by unlabeled data points and we\nderive a maximum-margin formulation of semi-supervised learning\nfor structured variables. Unlike transductive algorithms, our for-\nmulation naturally extends to new test points.\n\n1 Introduction\n\nDiscriminative methods, such as Boosting and Support Vector Machines have sig-\nni\ufb01cantly advanced the state of the art for classi\ufb01cation. However, traditionally\nthese methods do not exploit dependencies between class labels where more than\none label is predicted. Many real-world classi\ufb01cation problems, on the other hand,\ninvolve sequential or structural dependencies between multiple labels. For example\nlabeling the words in a sentence with their part-of-speech tags involves sequential\ndependency between part-of-speech tags; \ufb01nding the parse tree of a sentence in-\nvolves a structural dependency among the labels in the parse tree. Recently, there\nhas been a growing interest in generalizing kernel methods to predict structured\nand inter-dependent variables in a supervised learning setting, such as dual percep-\ntron [7], SVMs [2, 15, 14] and kernel logistic regression [1, 11]. These techniques\ncombine the e\ufb03ciency of dynamic programming methods with the advantages of\nthe state-of-the-art learning methods. In this paper, we investigate classi\ufb01cation of\nstructured objects in a semi-supervised setting.\n\nThe goal of semi-supervised learning is to leverage the learning process from a\nsmall sample of labeled inputs with a large sample of unlabeled data. This idea has\nrecently attracted a considerable amount of interest due to ubiquity of unlabeled\ndata. In many applications from data mining to speech recognition it is easy to\nproduce large amounts of unlabeled data, while labeling is often manual and expen-\nsive. This is also the case for many structured classi\ufb01cation problems. A variety\n\n\fof methods ranging from Naive Bayes [12], Cotraining [4], to Transductive SVM [9]\nto Cluster Kernels [6] and graph-based approaches [3] and references therein, have\nbeen proposed. The intuition behind many of these methods is that the classi\ufb01ca-\ntion/regression function should be smooth with respect to the geometry of the data,\ni. e. the labels of two inputs x and \u00afx are likely to be the same if x and \u00afx are similar.\nThis idea is often represented as the cluster assumption or the manifold assumption.\nThe unlabeled points reveal the intrinsic structure, which is then utilized by the\nclassi\ufb01cation algorithm. A discriminative approach to semi-supervised learning was\ndeveloped by Belkin, Sindhwani and Niyogi [3, 13], where the Laplacian operator\nassociated with unlabeled data is used as an additional penalty (regularizer) on the\nspace of functions in a Reproducing Kernel Hilbert Space. The additional regular-\nization from the unlabeled data can be represented as a new kernel \u2014 a \u201cgraph\nregularized\u201d kernel.\n\nIn this paper, building on [3, 13], we present a discriminative semi-supervised learn-\ning formulation for problems that involve structured and inter-dependent outputs\nand give experimental results on max-margin semi-supervised structured classi\ufb01-\ncation using graph-regularized kernels. The solution of the optimization problem\nthat utilizes both labeled and unlabeled data is a linear combination of the graph\nregularized kernel evaluated at the parts of the labeled inputs only, leading to a\nlarge reduction in the number of parameters.\nIt is important to note that our\nclassi\ufb01cation function is de\ufb01ned on all input points whereas some previous work is\nonly de\ufb01ned for the input points in the (labeled and unlabeled) training sample, as\nthey use standard graph kernels, which are restricted to in-sample data points by\nde\ufb01nition.\n\nThere is an the extensive literature on semi-supervised learning and the growing\nnumber of studies on learning structured and inter-dependent variables. Delaleau\net. al. [8] propose a semi-supervised learning method for standard classi\ufb01cation that\nextends to out-of-sample points. Brefeld et. al. [5] is one of the \ufb01rst studies investi-\ngating semi-supervised structured learning problem in a discriminative framework.\nThe most relevant previous work is the transductive structured learning proposed\nby La\ufb00erty et. al. [11].\n\n2 Supervised Learning for Structured Variables\n\nIn structured learning, the goal is to learn a mapping h : X \u2192 Y from structured\ninputs to structured response values, where the inputs and response values form a\ndependency structure. For each input x, there is a set of feasible outputs, Y(x) \u2286 Y.\nFor simplicity, let us assume that Y(x) is \ufb01nite for all x \u2208 X , which is the case in\nmany real world problems and in all our examples. We denote the set of feasible\ninput-output pairs by Z \u2286 X \u00d7 Y.\n\nIt is common to construct a discriminant function F : Z \u2192 < which maps the\nfeasible input-output pairs to a compatibility score of the pair. To make a prediction\nfor x, this score is maximized over the set of feasible outputs,\n\nh(x) = argmax\ny\u2208Y(x)\n\nF (x, y).\n\n(1)\n\nThe score of an hx, yi pair is computed from local fragments, or \u201cparts\u201d, of hx, yi.\nIn Markov random \ufb01elds, x is a graph, y is a labeling of the nodes of x and a\nlocal fragment (a part) of hx, yi is a clique in x and its labeling y. In parsing with\nprobabilistic context free grammars, a local fragment (a part) of hx, yi consist of\na branch of the tree y, where a branch is an internal node in y together with its\n\n\fchildren, plus all pairs of a leaf node in y with the word in x labeled by that node.\nNote that a given branch structure, such as NP \u2192 Det N, can occur more than\nonce in a given parse tree.\n\nIn general, we let P be a set of (all possible) parts. We assume a \u201ccounting function\u201d,\nc, such that for p \u2208 P and hx, yi \u2208 Z, c(p, hx, yi) gives the number of times that\nthe part p occurs in the pair hx, yi (the count of p in hx, yi). For a Mercer kernel\nk : P \u00d7 P \u2192 < on P, there is an associated RHKS Hk of functions f : P \u2192 <,\nwhere f measures the goodness of a part p. For any f \u2208 Hk, we de\ufb01ne a function\nFf on Z as\n\nFf (x, y) = X\n\np\u2208P\n\nc(p, hx, yi)f (p).\n\n(2)\n\nConsider a simple chain example. Let \u0393 be a set of possible observations and \u03a3\nbe a set of possible hidden states. We take the input x to be a sequence x1, . . . , x`\nwith xi \u2208 \u0393 and we take Y(x) to be the set of all sequences y1, . . . , y` with the same\nlength as x and with yi \u2208 \u03a3. We can take P to be the set of all pairs hs, \u00afsi plus all\npairs hs, ui with s, \u00afs \u2208 \u03a3 and u \u2208 \u0393. Often \u03a3 is taken to be a \ufb01nite set of \u201cstates\u201d\nand \u0393 = 0). The results in pitch\naccent prediction shows the advantage of a sequence model over a non-structured\n\nTable 1: Per-label accuracy for Pitch Accent.\n\n72.15\n\n-\n\n-\n\n1For more complicated parts, di\ufb00erent measures can apply. For example, in sequence\nclassi\ufb01cation, if the classi\ufb01er is evaluated wrt the correctly classi\ufb01ed individual labels in\nthe sequence, W can be s. t. Wp,p0 = Pu\u2208p,u0\u2208p0 \u03b4(y(u), y(u0))\u02dcs(u, u0) where \u02dcs denotes\nsome similarity measure such as the heat kernel.\nIf the evaluation is over segments of\nthe sequence, the similarity can be Wp,p0 = \u03b4(y(p), y0(p0)) Pu\u2208p,u0\u2208p0 \u02dcs(u, u0) where y(p)\ndenotes all the label nodes in the part p.\n\n\fmodel, where STR consistently performs better than SVM. We also observe the\nusefulness of unlabeled data both in the structured and unstructured models, where\nas U increases, so does the accuracy. The improvement from unlabeled data and\nfrom structured classi\ufb01cation can be considered as additive. The small di\ufb00erence\nbetween the accuracy of in-sample unlabeled data and the test data indicates the\nnatural extension of our framework to new data points.\n\nIn OCR, on the other hand, STR does not improve\nover SVM. Even though unlabeled data improves ac-\ncuracy, performing sequence classi\ufb01cation is not help-\nful due to the sparsity of structural information. Since\n|\u03a3| = 15 and there are only 10 labeled sequences with\naverage length 8.3, the statistics of label-label depen-\ndency is quite noisy.\n\n7 Conclusions\n\nOCR\n\nSVM\n\nSTR\n\nU:0\n43.62\n\n-\n\n49.25\n\n-\n\nU:412\n49.96\n47.56\n49.91\n49.65\n\nTable 2: OCR\n\nWe presented a discriminative approach to semi-supervised learning of structured\nand inter-dependent response variables.\nIn this framework, we derived a maxi-\nmum margin formulation and presented experiments for a simple chain model. Our\napproach naturally extends to the classi\ufb01cation of unobserved structured inputs\nand this is supported by our empirical results which showed similar accuracy on\nin-sample unlabeled data and out-of-sample test data.\n\nReferences\n\n[1] Y. Altun, T. Hofmann, and A. Smola. Gaussian process classi\ufb01cation for segmenting\n\nand annotating sequences. In ICML, 2004.\n\n[2] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector ma-\n\nchines. In ICML, 2003.\n\n[3] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: a geometric frame-\n\nwork for learning from examples. Technical Report 06, UChicago CS, 2004.\n\n[4] Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-\n\ntraining. In COLT, 1998.\n\n[5] U. Brefeld, C. B\u00a8uscher, and T. Sche\ufb00er. Multi-view discriminative sequential learning.\n\nIn (ECML), 2005.\n\n[6] O. Chappelle, J. Weston, and B. Scholkopf. Cluster kernels for semi-supervised learn-\n\ning. In (NIPS), 2002.\n\n[7] M. Collins and N.l Du\ufb00y. Convolution kernels for natural language. In (NIPS), 2001.\n[8] Olivier Delalleau, Yoshua Bengio, and Nicolas Le Roux. E\ufb03cient non-parametric\n\nfunction induction in semi-supervised learning. In Proceedings of AISTAT, 2005.\n\n[9] Thorsten Joachims. Transductive inference for text classi\ufb01cation using support vector\n\nmachines. In (ICML), pages 200\u2013209, 1999.\n\n[10] G. Kimeldorf and G. Wahba. Some results on tchebychean spline functions. Journal\n\nof Mathematics Analysis and Applications, 33:82\u201395, 1971.\n\n[11] John La\ufb00erty, Yan Liu, and Xiaojin Zhu. Kernel conditional random \ufb01elds: Repre-\n\nsentation, clique selection, and semi-supervised learning. In (ICML), 2004.\n\n[12] K. Nigam, A. K. McCallum, S. Thrun, and T. M. Mitchell. Learning to classify text\nfrom labeled and unlabeled documents. In Proceedings of AAAI-98, pages 792\u2013799,\nMadison, US, 1998.\n\n[13] V. Sindhwani, P. Niyogi, and M. Belkin. Beyond the point cloud: from transductive\n\nto semi-supervised learning. In (ICML), 2005.\n\n[14] B. Taskar, C. Guestrin, and D. Koller. Max-margin markov networks. In NIPS, 2004.\n[15] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine\n\nlearning for interdependent and structured output spaces. In (ICML), 2004.\n\n[16] T. Zhang. personal communication.\n\n\f", "award": [], "sourceid": 2850, "authors": [{"given_name": "Y.", "family_name": "Altun", "institution": null}, {"given_name": "D.", "family_name": "McAllester", "institution": null}, {"given_name": "M.", "family_name": "Belkin", "institution": null}]}