{"title": "Advice Refinement in Knowledge-Based SVMs", "book": "Advances in Neural Information Processing Systems", "page_first": 1728, "page_last": 1736, "abstract": "Knowledge-based support vector machines (KBSVMs) incorporate advice from domain experts, which can improve generalization significantly. A major limitation that has not been fully addressed occurs when the expert advice is imperfect, which can lead to poorer models. We propose a model that extends KBSVMs and is able to not only learn from data and advice, but also simultaneously improve the advice. The proposed approach is particularly effective for knowledge discovery in domains with few labeled examples. The proposed model contains bilinear constraints, and is solved using two iterative approaches: successive linear programming and a constrained concave-convex approach. Experimental results demonstrate that these algorithms yield useful refinements to expert advice, as well as improve the performance of the learning algorithm overall.", "full_text": "Advice Re\ufb01nement in Knowledge-Based SVMs\n\nGautam Kunapuli\n\nRichard Maclin\n\nUniv. of Wisconsin-Madison\n\nUniv. of Minnesota, Duluth\n\n1300 University Avenue\n\nMadison, WI 53705\n\n1114 Kirby Drive\nDuluth, MN 55812\n\nJude W. Shavlik\n\nUniv. of Wisconsin-Madison\n\n1300 University Avenue\n\nMadison, WI 53705\n\nkunapuli@wisc.edu\n\nrmaclin@d.umn.edu\n\nshavlik@cs.wisc.edu\n\nAbstract\n\nKnowledge-based support vector machines (KBSVMs) incorporate advice from\ndomain experts, which can improve generalization signi\ufb01cantly. A major limita-\ntion that has not been fully addressed occurs when the expert advice is imperfect,\nwhich can lead to poorer models. We propose a model that extends KBSVMs\nand is able to not only learn from data and advice, but also simultaneously im-\nproves the advice. The proposed approach is particularly effective for knowledge\ndiscovery in domains with few labeled examples. The proposed model contains\nbilinear constraints, and is solved using two iterative approaches: successive linear\nprogramming and a constrained concave-convex approach. Experimental results\ndemonstrate that these algorithms yield useful re\ufb01nements to expert advice, as\nwell as improve the performance of the learning algorithm overall.\n\n1 Introduction\nWe are primarily interested in learning in domains where there is only a small amount of labeled data\nbut advice can be provided by a domain expert. The goal is to re\ufb01ne this advice, which is usually\nonly approximately correct, during learning, in such scenarios, to produce interpretable models that\ngeneralize better and aid knowledge discovery. For learning in complex environments, a number\nof researchers have shown that incorporating prior knowledge from experts can greatly improve the\ngeneralization of the model learned, often with many fewer labeled examples. Such approaches\nhave been shown in rule-learning methods [16], arti\ufb01cial neural networks (ANNs) [21] and support\nvector machines (SVMs) [10, 17]. One limitation of these methods concerns how well they adapt\nwhen the knowledge provided by the expert is inexact or partially correct. Many of the rule-learning\nmethods focus on rule re\ufb01nement to learn better rules, while ANNs form the rules as portions of\nthe network which are re\ufb01ned by backpropagation. Further, ANN methods have been paired with\nrule-extraction methods [3, 20] to try to understand the resulting learned network and provide rules\nthat are easily interpreted by domain experts.\n\nWe consider the framework of knowledge-based support vector machines (KBSVMs), in-\ntroduced by Fung et al.\n[6]. KBSVMs have been extensively studied, and in addition to linear\nclassi\ufb01cation, they have been extended to incorporate kernels [5], nonlinear advice [14] and for ker-\nnel approximation [13]. Recently, Kunapuli et al. derived an online version of KBSVMs [9], while\nother approaches such as that of Le et al. [11] modify the hypothesis space rather than the optimiza-\ntion problem. Extensive empirical results from this prior work establish that expert advice can be\neffective, especially for biomedical applications such as breast-cancer diagnosis. KBSVMs are an\nattractive methodology for knowledge discovery as they can produce good models that generalize\nwell with a small amount of labeled data.\n\nAdvice tends to be rule-of-thumb and is based on the expert\u2019s accumulated experience in\nthe domain; it may not always be accurate. Rather than simply ignoring or heavily penalizing in-\naccurate rules, the effectiveness of the advice can be improved through re\ufb01nement. There are two\nmain reasons for this: \ufb01rst, re\ufb01ned rules result in the improvement of the overall generalization,\nand second, if the re\ufb01nements to the advice are interpretable by the domain experts, it will help in\nthe understanding of the phenomena underlying the applications for the experts, and consequently\n\n1\n\n\fFigure 1: (left) Standard SVM, trades off complexity and loss wrt the data; (center) Knowledge-based SVM,\nalso trades off loss wrt advice. A piece of advice set 1 extends over the margin, and is penalized as the advice\nerror. No part of advice set 2 touches the margin, i.e., none of the rules in advice set 2 are useful as support\nconstraints. (right) SVM that re\ufb01nes advice in two ways: (1) advice set 1 is re\ufb01ned so that no part of is on the\nwrong side of the optimal hyperplane, minimizing advice error, (2) advice set 2 is expanded until it touches the\noptimal margin thus maximizing coverage of input space.\n\ngreatly facilitate the knowledge-discovery process. This is the motivation behind this work. KB-\nSVMs already have several desirable properties that make them an ideal target for re\ufb01nement. First,\nadvice is speci\ufb01ed as polyhedral regions in input space, whose constraints on the features are easily\ninterpretable by non-experts. Second, it is well-known that KBSVMs can learn to generalize well\nwith small data sets [9], and can even learn from advice alone [6]. Finally, owing to the simplicity of\nthe formulation, advice-re\ufb01nement terms for the rules can be incorporated directly into the model.\nWe further motivate advice re\ufb01nement in KBSVMs with the following example. Figure 1\n(left) shows an SVM, which trades off regularization with the data error. Figure 1 (center) illustrates\nKBSVMs in their standard form as shown in [6]. As mentioned before, expert rules are speci\ufb01ed\nin the KBSVM framework as polyhedral advice regions in input space. They introduce a bias to\nfocus the learner on a model that also includes the advice of the form \u2200x, (x \u2208 advice region i) \u21d2\nclass(x) = 1. Advice regarding the regions for which class(x) = \u22121 can be speci\ufb01ed similarly.\n\nIn the KBSVM (Figure 1, center), each advice region contributes to the \ufb01nal hypothesis in\na KBSVM via its advice vector, u1 and u2 (as introduced in [6]; also see Section 2). The individual\nj components. As a piece of advice\nconstraints that touch or intersect the margin have non-zero ui\nregion 1 extends beyond the margin, u1 6= 0; furthermore, analogous to data error, this overlap is\npenalized as the advice error. As no part of advice set 2 touches the margin, u2 = 0 and none of\nits rules contribute anything to the \ufb01nal classi\ufb01er. Again, analogous to support vectors, rules with\nnon-zero ui\nj components are called support constraints [6]. Consequently, in the \ufb01nal classi\ufb01er the\nadvice sets are incorporated with advice error (advice set 1) or are completely ignored (advice set\n2). Even though the rules are inaccurate, they are able to improve generalization compared to the\nSVM. However, simply penalizing advice that introduces errors can make learning dif\ufb01cult as the\nuser must carefully trade off between optimizing data or advice loss.\n\nNow, consider an SVM that is capable of re\ufb01ning inaccurate advice (Figure 1, right). When\nadvice is inaccurate and intersects the hyperplane, it is truncated such that it minimizes the advice\nerror. Advice that was originally ignored is extended to cover as much of the input space as is\nfeasible. The optimal classi\ufb01er has now minimized the error with respect to the data and the re\ufb01ned\nadvice and is able to further improve upon the performance of not just the SVM but also the KBSVM.\nThus, the goal is to re\ufb01ne potentially inaccurate expert advice during learning so as to learn a model\nwith the best generalization.\n\nOur approach generalizes the work of Maclin et al. [12], to produce a model that corrects\nthe polyhedral advice regions of KBSVMs. The resulting mathematical program is no longer a\nlinear or quadratic program owing to bilinear correction factors in the constraints. We propose\ntwo algorithmic techniques to solve the resulting bilinear program, one based on successive linear\nprogramming [12], and the other based on a concave-convex procedure [24]. Before we describe\nadvice re\ufb01nement, we brie\ufb02y introduce our notation and KBSVMs.\n\nWe wish to learn a linear classi\ufb01er (w\u2032x = b) given \u2113 labeled data (xj, yj)\u2113\n\nj=1 with xj \u2208 Rn\nand labels yj \u2208 {\u00b11}. Data are collected row-wise in the matrix X \u2208 R\u2113\u00d7n, while Y = diag(y) is\nthe diagonal matrix of the labels. We assume that m advice sets (Di, di, zi)m\ni=1 are given in addition\nto the data (see Section 2), and if the i-th advice set has ki constraints, we have Di \u2208 Rki\u00d7n,\ndi \u2208 Rki and zi = {\u00b11}. The absolute value of a scalar y is denoted |y|, the 1-norm of a vector x\n\n2\n\n\fi=1 Pq\n\ni=1 |Aij|. Finally, e is a vector of ones of appropriate dimension.\n\ni=1 |xi|, and the entrywise 1-norm of a m \u00d7 n matrix A \u2208 Rp\u00d7q is denoted\n\nis denoted kxk1 = Pn\nkAk1 = Pp\n2 Knowledge-Based Support Vector Machines\nIn KBSVMs, advice can be speci\ufb01ed about every potential data point in the input space that satis\ufb01es\ncertain advice constraints. For example, consider a task of learning to diagnose diabetes, based on\nfeatures such as age, blood pressure, body mass index (bmi), plasma glucose concentration (gluc),\netc. The National Institute for Health (NIH) provides the following guidelines to establish risk for\nType-2 Diabetes1: a person who is obese (bmi \u2265 30) with gluc \u2265 126 is at strong risk for diabetes,\nwhile a person who is at normal weight (bmi \u2264 25) with gluc \u2264 100 is unlikely to have diabetes.\nThis leads to two advice sets, one for each class:\n(bmi \u2264 25) \u2227 (gluc \u2264 100) \u21d2 \u00acdiabetes;\n\n(bmi \u2265 30) \u2227 (gluc \u2265 126) \u21d2 diabetes,\n\n(1)\nwhere \u00ac is the negation operator. In general, rules such as the ones above de\ufb01ne a polyhedral region\nof the input space and are expressed as the implication\n\nDix \u2264 di \u21d2 zi(w\u2032x \u2212 b) \u2265 1,\n\n(2)\nwhere the advice label zi = +1 indicates that all points x that satisfy the constraints for the i-th\nadvice set, Dix \u2264 di belong to class +1, while z = \u22121 indicates the same for the other class. The\nstandard linear SVM formulation (without incorporating advice) for binary classi\ufb01cation optimizes\nmodel complexity + \u03bb data loss:\n\nmin\n\n\u03be\u22650,w,b\n\nkwk1 + \u03bbe\u2032\u03be,\n\ns.t. Y (Xw \u2212 eb) + \u03be \u2265 e.\n\n(3)\n\nThe implications (2), for the i = 1, . . . , m, can be incorporated into (3) using the nonhomogeneous\nFarkas theorem of the alternative [6] that introduces advice vectors ui. The advice vectors perform\nthe same role as the dual multipliers \u03b1 in the classical SVM. Recall that points with non-zero \u03b1\u2019s\nare the support vectors which additively contribute to w. Similarly, the constraints of an advice set\nwhich have non-zero uis are called support constraints. The resulting formulation is the KBSVM,\nwhich optimizes model complexity + \u03bb data loss + \u00b5 advice loss:\nkwk1 + \u03bbe\u2032\u03be + \u00b5 Pm\n\ni=1 (e\u2032\u03b7i + \u03b6i)\n\nw,b,(\u03be,ui,\u03b7i,\u03b6i)\u22650\n\nmin\n\n(4)\n\ns.t.\n\nY (Xw \u2212 be) + \u03be \u2265 e,\n\u2212\u03b7i \u2264 D\u2032\n\u2212di\u2032\n\niui + ziw \u2264 \u03b7i,\n\nui \u2212 zib + \u03b6i \u2265 1, i = 1, . . . , m.\n\nIn the case of inaccurate advice, the advice errors \u03b7i and \u03b6i soften the advice constraints analogous\nto the data errors \u03be. Returning to Figure 1, for advice set 1, \u03b71, \u03b61 and u1 are non-zero, while for\nadvice set 2, u2 = 0. The in\ufb02uence of data and advice is determined by the choice of the parameters\n\u03bb and \u00b5 which re\ufb02ect the user\u2019s trust in the data and advice respectively.\n\n3 Advice-Re\ufb01ning Knowledge-based Support Vector Machines\nPreviously, Maclin et al. [12] formulated a model to re\ufb01ne advice in KBSVMs. However, their\nmodel is limited as only the terms di are re\ufb01ned, which as we discuss below, greatly restricts the\ntypes of re\ufb01nements that are possible. They only consider re\ufb01nement terms f i for the right hand\nside of the i-th advice set, and attempt to re\ufb01ne each rule such that\n\nDix \u2264 (di \u2212 f i) \u21d2 zi(w\u2032x \u2212 b) \u2265 1, i = 1, . . . , m.\n\n(5)\nThe resulting formulation adds re\ufb01nement terms into the KBSVM model (4) in the advice con-\nstraints, as well as in the objective. The latter allows for the overall extent of the re\ufb01nement to be\ncontrolled by the re\ufb01nement parameter \u03bd > 0. This formulation was called Re\ufb01ning-Rules Support\nVector Machine (RRSVM):\n\nw,b,f i,(\u03be,ui,\u03b7i,\u03b6i)\u22650\n\nmin\n\ns.t.\n\nkwk1 + \u03bbe\u2032\u03be + \u00b5 Pm\n\ni=1 (e\u2032\u03b7i + \u03b6i) + \u03bd Pm\n\nY (Xw \u2212 be) + \u03be \u2265 e,\n\u2212\u03b7i \u2264 D\u2032\n\u2212(di \u2212 f i)\u2032ui \u2212 zib + \u03b6i \u2265 1, i = 1, . . . , m.\n\niui + ziw \u2264 \u03b7i,\n\ni=1 kf ik1\n\n(6)\n\n1http://diabetes.niddk.nih.gov/DM/pubs/\u223criskfortype2\n\n3\n\n\fThis problem is no longer an LP owing to the bilinear terms f i\u2032\nui which make the re\ufb01nement con-\nstraints non-convex. Maclin et al. solve this problem using successive linear programming (SLP)\nwherein linear programs arising from alternately \ufb01xing either the advice terms di or the re\ufb01nement\nterms f i are solved iteratively.\n\nWe consider a full generalization of the RRSVM approach and develop a model where it is\npossible to re\ufb01ne the entire advice region Dx \u2264 d. This allows for much more \ufb02exibility in re\ufb01ning\nthe advice based on the data, while still retaining interpretability of the resulting re\ufb01ned advice.\nIn addition to the terms f i, we propose the introduction of additional re\ufb01nement terms Fi into the\nmodel, so that we can re\ufb01ne the rules in as general a manner as possible:\n\n(Di \u2212 Fi)x \u2264 (di \u2212 f i) \u21d2 zi(w\u2032x \u2212 b) \u2265 1, i = 1, . . . , m.\n\n(7)\nRecall that for each advice set we have Di \u2208 Rki\u00d7n and di \u2208 Rki, i.e., the i-th advice set contains\nki constraints. The corresponding re\ufb01nement terms Fi and f i will have the same dimensions respec-\ntively as Di and di. The formulation (6) now includes the additional re\ufb01nement terms Fi, and the\nformulation optimizes:\n\nw,b,Fi,f i,(\u03be,ui,\u03b7i,\u03b6i)\u22650\n\nmin\n\ns.t.\n\ni=1 (e\u2032\u03b7i + \u03b6i) + \u03bd Pm\n\nkwk1 + \u03bbe\u2032\u03be + \u00b5 Pm\nY (Xw \u2212 be) + \u03be \u2265 e,\n\u2212\u03b7i \u2264 (Di \u2212 Fi)\u2032ui + ziw \u2264 \u03b7i,\n\u2212(di \u2212 f i)\u2032ui \u2212 zib + \u03b6i \u2265 1, i = 1, . . . , m.\n\ni=1 (cid:0)kFik1 + kf ik1(cid:1)\n\n(8)\n\nThe objective function of (8) trades-off the effect of re\ufb01nement in each of the advice sets via the\nre\ufb01nement parameter \u03bd. This is the Advice-Re\ufb01ning KBSVM (arkSVM); it improves upon the work\nof Maclin et al. in two important ways. First, re\ufb01ning d alone is highly restrictive as it allows only\nfor the translation of the boundaries of the polyhedral advice; the generalized re\ufb01nement offered\nby arkSVMs allows for much more \ufb02exibility owing to the fact that the boundaries of the advice\ncan be translated and rotated (see Figure 2). Second, the newly added re\ufb01nement terms, F \u2032\ni ui, are\nbilinear also, and do not make the overall problem more complex; in addition to the successive\nlinear programming approach of [12], we also propose a concave-convex procedure that leads to an\napproach based on successive quadratic programming. We provide details of both approaches next.\n\n3.1 arkSVMs via Successive Linear Programming\nOne approach to solving bilinear programming problems is to solve a sequence of linear programs\nwhile alternately \ufb01xing the bilinear variables. This approach is called successive linear program-\nming, and has been used to solve various machine learning formulations, for instance [1, 2]. In this\napproach, which was also adopted by [12], we solve the LPs arising from alternatingly \ufb01xing the\nsources of bilinearity: (Fi, f i)m\ni=1. Algorithm 1 describes the above approach. At the\nt-th iteration, the algorithm alternates between the following steps:\n\ni=1 and {ui}m\n\n\u2022 (Estimation Step) When the re\ufb01nement terms, ( \u02c6F t\n\ni=1, are \ufb01xed the resulting LP\nbecomes a standard KBSVM which attempts to \ufb01nd a data-estimate of the advice vectors\n{ui}m\n\ni=1 using the current re\ufb01nement of the advice region: (Dj \u2212 \u02c6F t\n\nj ) x \u2264 (dj \u2212 \u02c6f j,t).\n\ni , \u02c6f i,t)m\n\n\u2022 (Re\ufb01nement Step) When the advice-estimate terms {\u02c6ui,t}m\n\ni=1 are \ufb01xed, the resulting LP\ni=1 and attempts to further re\ufb01ne the advice regions based on estimates\n\nsolves for (Fi, f i)m\nfrom data computed in the previous step.\n\nProposition 1 I. For \u01eb = 0,\nthe sequence of objective values converges to the value\ni=1 (cid:0)k \u00afFik1 + k\u00aff ik1(cid:1), where the data and advice\nk \u00afwk1 + \u03bbe\u2032 \u00af\u03be + \u00b5 Pm\nerrors ( \u00af\u03be, \u00af\u03b7i, \u00af\u03b6i) are computed from any accumulation point ( \u00afw, \u00afb, \u00afui, \u00afFi, \u00aff i) of the sequence of\niterates ( \u02c6wt, \u02c6bt, \u02c6ui,t, \u02c6F t\n\ni=1 (e\u2032 \u00af\u03b7i + \u00af\u03b6i) + \u03bd Pm\ni , \u02c6f i,t)\u221e\n\nt=1 generated by Algorithm 1.\n\nII. Such an accumulation point satis\ufb01es the local minimum condition\n\n( \u00afw, \u00afb) \u2208\n\nkwk1 + \u03bbe\u2032\u03be + \u00b5 Pm\nmin\nui\u22650\nsubject to Y (Xw \u2212 be) + \u03be \u2265 e,\n\nw,b,(\u03be,\u03b7i\u03b6i\u22650)\n\ni=1 (e\u2032\u03b7i + \u03b6i)\n\n\u2212\u03b7i \u2264 (Di \u2212 \u00afFi)\u2032ui + ziw \u2264 \u03b7i,\n\u2212(di \u2212 \u00aff i)\u2032ui \u2212 zib + \u03b6i \u2265 1,\n\ni = 1, . . . , m.\n\n4\n\n\fAlgorithm 1 arkSVM via Successive Linear Programming (arkSVM-sla)\n1: initialize: t = 1, \u02c6F 1\n2: while feasible do\n3:\n4:\n\nif x not feasible for (Di \u2212 \u02c6F t\n(estimation step) solve for {\u02c6ui,t+1}m\n\ni ) x \u2264 (dj \u2212 \u02c6f i,t)\n\ni = 0, \u02c6f i,1 = 0\n\nreturn failure\n\ni=1\n\nw,b,(\u03be,ui ,\u03b7 i ,\u03b6i)\u22650\n\nmin\n\ns.t.\n\ni=1 (e\u2032\u03b7i + \u03b6i)\n\nkwk1 + \u03bbe\u2032\u03be + \u00b5 Pm\nY (Xw \u2212 be) + \u03be \u2265 e,\n\u2212\u03b7i \u2264 (Di \u2212 \u02c6F t\n\u2212(di \u2212 \u02c6f i,t)\u2032ui \u2212 zib + \u03b6i \u2265 1, i = 1, . . . , m.\n\ni )\u2032ui + ziw \u2264 \u03b7i,\n\n5:\n\n(re\ufb01nement step) solve for ( \u02c6F t+1\n\ni\n\n, \u02c6f i,t+1)m\ni=1\n\nw,b,Fi,f i ,(\u03be,\u03b7 i ,\u03b6i)\u22650\n\nmin\n\ns.t.\n\ni=1 (e\u2032\u03b7i + \u03b6i) + \u03bd Pm\n\nkwk1 + \u03bbe\u2032\u03be + \u00b5 Pm\nY (Xw \u2212 be) + \u03be \u2265 e,\n\u2212\u03b7i \u2264 (Di \u2212 Fi)\u2032 \u02c6ui,t+1 + ziw \u2264 \u03b7i,\n\u2212(di \u2212 f i)\u2032 \u02c6ui,t+1 \u2212 zib + \u03b6i \u2265 1, i = 1, . . . , m.\n\ni=1 (cid:0)kFik1 + kf ik1(cid:1)\n\n(termination test) if Pj (cid:0)kF t\n(continue) t = t + 1\n\n6:\n7:\n8: end while\n\nj \u2212 F t+1\n\nj\n\nk + kf t\n\nj \u2212 f t+1\n\nj\n\nk(cid:1) \u2264 \u01eb\n\nthen return solution\n\nAlgorithm 2 arkSVM via Successive Quadratic Programming (arkSVM-sqp)\n1: initialize: t = 1, \u02c6F 1\n2: while feasible do\n3:\n4:\n\nif x not feasible for (Di \u2212 \u02c6F t\nsolve for {\u02c6ui,t+1}m\n\ni ) x \u2264 (dj \u2212 \u02c6f i,t)\n\ni = 0, \u02c6f i,1 = 0\n\nreturn failure\n\ni=1\n\nmin\n\nFi ,f i ,(ui \u22650)\n\nw,b,(\u03be,\u03b7 i \u03b6i \u22650)\n\ns.t.\n\ni=1 (e\u2032\u03b7i + \u03b6i) + \u03bd Pm\n\nkwk1 + \u03bbe\u2032\u03be + \u00b5 Pm\nY (Xw \u2212 be) + \u03be \u2265 e,\neqns (10\u201312), i = 1, . . . , m, j = 1, . . . , n\n\ni=1 (cid:0)kFik1 + kf ik1(cid:1)\n\n(termination test) if Pj (cid:0)kF t\n(continue) t = t + 1\n\n5:\n6:\n7: end while\n\nj \u2212 F t+1\n\nj\n\nk + kf t\n\nj \u2212 f t+1\n\nj\n\nk(cid:1) \u2264 \u01eb\n\nthen return solution\n\nD\u2032\n\nij ui + ziwj \u2212 \u03b7i\n\nj +\n\nkFij \u2212 uik2 \u2264\n\nkFij + uik2,\n\n(9)\n\n1\n4\n\n1\n4\n\n3.2 arkSVMs via Successive Quadratic Programming\nIn addition to the above approach, we introduce another algorithm (Algorithm 2) that is based on\nsuccessive quadratic programming. In the constraint (Di \u2212 Fi)\u2032ui + ziw \u2212 \u03b7i \u2264 0, only the re-\ni ui is bilinear, while the rest of the constraint is linear. Denote the j-th components\n\ufb01nement term F \u2032\nj respectively. A general bilinear term r\u2032s, which is non-convex, can be\nof w and \u03b7i to be wj and \u03b7i\nwritten as the difference of two convex terms: 1\n4 kr \u2212 sk2. Thus, we have the equivalent\nconstraint\n\n4 kr + sk2 \u2212 1\n\nand both sides of the constraint above are convex and quadratic. We can linearize the right-hand side\nof (9) around some current estimate of the bilinear variables ( \u02c6F t\n\nij , \u02c6ui,t):\n\nD\u2032\n\nij ui + ziwj \u2212 \u03b7i\n\nj + 1\n\n4 kFij \u2212 uik2 \u2264 1\n+ 1\n\n4 k \u02c6F t\n2 ( \u02c6F t\n\nij + \u02c6ui,tk2\nij + \u02c6ui,t)\u2032 (cid:16)(Fij \u2212 \u02c6F t\n\nij ) + (ui \u2212 \u02c6ui,t)(cid:17) .\n\nSimilarly, the constraint \u2212(Di \u2212 Fi)\u2032ui \u2212 ziw \u2212 \u03b7i \u2264 0, can be replaced by\n\n\u2212D\u2032\n\nij ui \u2212 ziwj \u2212 \u03b7i\n\nj + 1\n\n4 k \u02c6F t\n4 kFij + uik2 \u2264 1\n2 ( \u02c6F t\n+ 1\n\nij \u2212 \u02c6ui,tk2\nij \u2212 \u02c6ui,t)\u2032 (cid:16)(Fij \u2212 \u02c6F t\n\nij ) \u2212 (ui \u2212 \u02c6ui,t)(cid:17) ,\n\n5\n\n(10)\n\n(11)\n\n\fFigure 2: Toy data set (Section 4.1) using (left) RRSVM (center) arkSVM-sla (right) arkSVM-sqp. Orange\nand green unhatched regions show the original advice. The dashed lines show the margin, kwk\u221e. For each\nmethod, we show the re\ufb01ned advice: vertically hatched for Class +1, and diagonally hatched for Class \u22121.\nwhile di\u2032\n\nui \u2264 0 is replaced by\n\nui + zib + 1 \u2212 \u03b6i \u2212 f i\u2032\ndi\u2032\n\nui + zib + 1 \u2212 \u03b6i + 1\n\n4 kf i \u2212 uik2 \u2264 1\n+ 1\n\n4 k\u02c6f i,t + \u02c6ui,tk2\n2 (\u02c6f i,t + \u02c6ui,t)\u2032 (cid:16)(f i,t \u2212 \u02c6f i,t) + (ui \u2212 \u02c6ui,t)(cid:17) .\n\n(12)\n\nThe right-hand sides in (10\u201312) are af\ufb01ne and hence, the entire set of constraints are now convex.\nReplacing the original bilinear non-convex constraints of (8) with the convexi\ufb01ed relaxations results\nin a quadratically-constrained linear program (QCLP). These quadratic constraints are more restric-\ntive than their non-convex counterparts, which leads the feasible set of this problem to be a subset of\nthat of the original problem. Now, we can iteratively solve the resulting QCLP. At the t-th iteration,\nthe restricted problem uses the current estimate to construct a new feasible point and iterating this\nprocedure produces a sequence of feasible points with decreasing objective values. The approach\ndescribed here is essentially the constrained concave-convex procedure (CCCP) that has been dis-\ncovered and rediscovered several times. Most recently, the approach was described in the context\nof machine learning approaches by Yuille and Rangarajan [24], and Smola and Vishwanathan [19],\nwho also derived conditions under which the algorithm converges to a local solution. The following\nconvergence theorem is due to [19].\nProposition 2 For Algorithm 2, the sequence of objective values converges to the value k \u00afwk1 +\ni=1 (cid:0)k \u00afFik1 + k\u00aff ik1(cid:1), where ( \u00afw, \u00afb, \u00afui, \u00afFi, \u00aff i, \u00af\u03be, \u00af\u03b7i, \u00af\u03b6i) is the\n\u03bbe\u2032 \u00af\u03be + \u00b5 Pm\nlocal minimum solution of (8) provided that the constraints (10\u201312) in conjunction with the convex\nconstraints Y (Xw \u2212 eb) + \u03be \u2265 e, \u03be \u2265 0, ui \u2265 0, \u03b6i \u2265 0 satisfy suitable constraint quali\ufb01cations\nat the point of convergence of the algorithm.\n\ni=1 (e\u2032 \u00af\u03b7i + \u00af\u03b6i) + \u03bd Pm\n\nBoth Algorithms 1 and 2 produce local minima solutions to the arkSVM formulation (8).\nFor either solution, the following proposition holds, which shows that either algorithm produces\na re\ufb01nement of the original polyhedral advice regions. The proof is a direct consequence of\n[13][Proposition 2.1].\nProposition 3 Let ( \u00afw, \u00afb, \u00afui, \u00afFi, \u00aff i, \u00af\u03be, \u00af\u03b7i, \u00af\u03b6i) be the local minimum solution produced by Algorithm\n1 or Algorithm 2. Then, the following re\ufb01nement to the advice sets holds:\n\n(Di \u2212 \u00afFi) \u2264 (di \u2212 \u00aff i) \u21d2 zi( \u00afw\u2032x \u2212 \u00afb) \u2265 \u2212 \u02c6\u03b7i \u2032x \u2212 \u00af\u03b6i,\n\nwhere \u2212 \u00af\u03b7i \u2264 \u02c6\u03b7i \u2264 \u00af\u03b7i such that D\u2032\n\ni \u00afui + \u00afw + \u02c6\u03b7i = 0.\n\n4 Experiments\nWe present the results of several experiments that compare the performance of three algorithms:\nRRSVMs (which only re\ufb01ne the d term in Dx \u2264 d), arkSVM-sla (successive linear programming)\nand arkSVM-sqp (successive quadratic programming) with that of standard SVMs and KBSVMs.\nThe LPs were solved using QSOPT2, while the QCLPs were solved using SDPT-3 [22].\n\n4.1 Toy Example\nWe illustrate the behavior of advice re\ufb01nement algorithms discussed previously geometrically using\na simple 2-dimensional example (Figure 2). This toy data set consists of 200 points separated by\nx1 + x2 = 2. There are two advice sets: {S1 : (x1, x2) \u2265 0 \u21d2 z = +1}, {S2 : (x1, x2) \u2264 0 \u21d2\n\n2http://www2.isye.gatech.edu/\u223cwcook/qsopt/\n\n6\n\n\f)\n\n%\n\n(\n \nr\no\nr\nr\n\n \n\nE\ng\nn\n\ni\nt\ns\ne\nT\n\n40\n\n35\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n \n0\n\n10\n\n20\n\n \n\nsvm\nkbsvm\nrrsvm\narksvm\u2212sla\narksvm\u2212sqp\n\n30\n\n40\n\n50\n\nNumber of Training Examples\n\n60\n\n70\n\n80\n\n90 100\n\nFigure 3: Diabetes data set, Section 4.2; (left) Results averaged over 10 runs on a hold-out test set of 412\npoints, with parameters selected by \ufb01ve-fold cross validation; (right) An approximate decision-tree represen-\ntation of Diabetes Rule 6 before and after re\ufb01nement. The left branch is chosen if the query at a node is\ntrue, and the right branch otherwise. The leaf nodes classify the data point according to ?diabetes.\n\nz = \u22121}. Both arkSVMs are able to re\ufb01ne knowledge sets such that the no part of S1 lies on the\nwrong side of the \ufb01nal hyperplane. In addition, the re\ufb01nement terms allow for suf\ufb01cient modi\ufb01cation\nof the advice sets Dx \u2264 d so that they \ufb01ll the input space as much as possible, without violating\nthe margin. Comparing to RRSVMs, we see that re\ufb01nement is restrictive because corrections are\napplied only to part of the advice sets, rather than fully correcting the advice.\n\n4.2 Case Study 1: PIMA Indians Diabetes Diagnosis\n\nThe Pima Indians Diabetes data set [4] has been studied for several decades and is used as a standard\nbenchmark to test many machine learning algorithms. The goal is to predict the onset of diabetes in\n768 Pima Indian women within the next 5 years based on current indicators (eight features): number\nof times pregnant, plasma glucose concentration (gluc), diastolic blood pressure, triceps skin fold\ntest, 2-hour serum insulin, body mass index (bmi), diabetes pedigree function (pedf) and age.\nStudies [15] show that diabetes incidence among the Pima Indians is signi\ufb01cantly higher among\nsubjects with bmi \u2265 30. In addition, a person with impaired glucose tolerance is at a signi\ufb01cant\nrisk for, or worse, has undiagnosed diabetes [8]. This leads to the following expert rules:\n\n(Diabetes Rule 1)\n(Diabetes Rule 2)\n(Diabetes Rule 3)\n(Diabetes Rule 4)\n\n(gluc \u2264 126)\n\u21d2\u00acdiabetes,\n(gluc \u2265 126) \u2227 (gluc \u2264 140) \u2227 (bmi \u2264 30) \u21d2\u00acdiabetes,\n(gluc \u2265 126) \u2227 (gluc \u2264 140) \u2227 (bmi \u2265 30) \u21d2 diabetes,\n\u21d2 diabetes.\n(gluc \u2265 140)\n\nThe diabetes pedigree function was developed by Smith et al. [18], and uses genetic information\nfrom family relatives to provide a measure of the expected genetic in\ufb02uence (heredity) on the sub-\nject\u2019s diabetes risk. The function also takes into account the age of relatives who do have diabetes;\non average, Pima Indians are only 36 years old3 when diagnosed with diabetes. A subject with high\nheredity who is at least 31 is at a signi\ufb01cantly increased risk for diabetes in the next \ufb01ve years:\n\n(Diabetes Rule 5)\n(Diabetes Rule 6)\n\n(pedf \u2264 0.5) \u2227 (age \u2264 31) \u21d2\u00acdiabetes,\n(pedf \u2265 0.5) \u2227 (age \u2265 31) \u21d2 diabetes.\n\nFigure 3 (left) shows that unre\ufb01ned advice does help initially, especially with as few as 30 data\npoints. However, as more data points are available, the effect of the advice diminishes. In contrast,\nthe advice re\ufb01ning methods are able to generalize much better with few data points, and eventually\nconverge to a better solution. Finally, Figure 3 (right) shows an approximate tree representation of\nDiabetes Rule 6 after re\ufb01nement. This tree was constructed by sampling the space around\nre\ufb01ned advice region uniformly, and then training a decision tree that covers as many of the sampled\npoints as possible. This naive approach to rule extraction from re\ufb01ned advice is shown here only to\nillustrate that it is possible to produce very useful domain-expert-interpretable rules from re\ufb01nement.\nMore ef\ufb01cient and accurate rule extraction techniques inspired by SVM-based rule extraction (for\nexample, [7]) are currently under investigation.\n\n7\n\n\f \n\nsvm\nkbsvm\nrrsvm\narksvm\u2212sla\narksvm\u2212sqp\n\n)\n\n%\n\n(\n \nr\no\nr\nr\n\n \n\nE\ng\nn\n\ni\nt\ns\ne\nT\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n \n\n20\n\nNumber of Training Examples\n\n80\n\n100\n\n40\n\n60\n\nFigure 4: Wargus data set, Section 4.3; (left) An example Wargus scenario; (right) Results using 5-fold cross\nvalidation on a hold out test set of 1000 points.\n\n4.3 Case Study 2: Re\ufb01ning GUI-Collected Human Advice in a Wargus Task\nWargus4 is a real-time strategy game in which two or more players gather resources, build bases and\ncontrol units in order to conquer opposing players. It has been widely used to study and evaluate\nvarious machine learning and planning algorithms. We evaluate our algorithms on a classi\ufb01cation\ntask in the Wargus domain developed by Walker et al. [23] called tower-defense (Figure 4,\nleft). Advice for this task was collected from humans via a graphical, human-computer interface\n(HCI) as detailed in [23]. Each scenario (example) in tower-defense, consists of a single tower\nbeing attacked by a group of enemy units, and the task is to predict whether the tower will survive\nthe attack and defeat the attackers given the size and composition of the latter, as well as other\nfactors such as the environment. The data set consists of 80 features including information about\nunits (eg., archers, ballista, peasants), unit properties (e.g., map location, health), group properties\n(e.g., #archers, #footmen) and environmental factors (e.g., ?hasMoat).\n\nWalker et al. [23] used this domain to study the feasibility of learning from human teachers.\nTo this end, human players were \ufb01rst trained to identify whether a tower would fall given a particular\nscenario. Once the humans learned this task, they were asked to provide advice via a GUI-based\ninterface based on speci\ufb01c examples. This setting lends itself very well to re\ufb01nement as the advice\ncollected from human experts represents the sum of their experiences with the domain, but is by no\nmeans perfect or exact. The following are some rules provided by human \u201cdomain experts\u201d:\n\n(Wargus Rule 1) (#footmen \u2265 3) \u2227 (?hasMoat = 0)\n\u21d2falls,\n(Wargus Rule 2) (#archers \u2265 5)\n\u21d2falls,\n(Wargus Rule 3) (#ballistas \u2265 1)\n\u21d2falls,\n(Wargus Rule 4) (#ballistas = 0) \u2227 (#archers = 0) \u2227 (?hasMoat = 1) \u21d2stands.\n\nFigure 4 (right) shows the performance of the various algorithms on the Wargus data set. As with\nthe previous case study, the arkSVM methods are able to not only learn very effectively with a small\ndata set, they are also able to improve signi\ufb01cantly on the performances of standard knowledge-\nbased SVMs (KBSVMs) and rule-re\ufb01ning SVMs (RRSVMs).\n\n5 Conclusions and Future Work\nWe have presented two novel knowledge-discovery methods: arkSVM-sla and arkSVM-sqp, that\nallow SVM methods to not only make use of advice provided by human experts but to re\ufb01ne that\nadvice using labeled data to improve the advice. These methods are an advance over previous\nknowledge-based SVM methods which either did not re\ufb01ne advice [6] or could only re\ufb01ne simple\naspects of the advice [12]. Experimental results demonstrate that our arkSVM methods can make\nuse of inaccurate advice to revise them to better \ufb01t the data. A signi\ufb01cant aspect of these learn-\ning methods is that the system not only produces a classi\ufb01er but also produces human-inspectable\nchanges to the user-provided advice, and can do so using small data sets. In terms of future work, we\nplan to explore several avenues of research including extending this approach to the nonlinear case\nfor more complex models, better optimization algorithms for improved ef\ufb01ciency, and interpretation\nof re\ufb01ned rules for non-AI experts.\n\n3http://diabetes.niddk.nih.gov/dm/pubs/pima/kiddis/kiddis.htm\n4http://wargus.sourceforge.net/index.shtml\n\n8\n\n\fAcknowledgements\nThe authors gratefully acknowledge support of the Defense Advanced Research Projects Agency under DARPA\ngrant FA8650-06-C-7606 and the National Institute of Health under NLM grant R01-LM008796. Views and\nconclusions contained in this document are those of the authors and do not necessarily represent the of\ufb01cial\nopinion or policies, either expressed or implied of the US government or of DARPA.\n\nReferences\n[1] K. P. Bennett and E. J. Bredensteiner. A parametric optimization method for machine learning. INFORMS\n\nJournal on Computing, 9(3):311\u2013318, 1997.\n\n[2] K. P. Bennett and O. L. Mangasarian. Bilinear separation of two sets in n-space. Computational Opti-\n\nmization and Applications, 2:207\u2013227, 1993.\n\n[3] M. W. Craven and J. W. Shavlik. Extracting tree-structured representations of trained networks.\n\nAdvances in Neural Information Processing Systems, volume 8, pages 24\u201330, 1996.\n\nIn\n\n[4] A. Frank and A. Asuncion. UCI machine learning repository, 2010.\n[5] G. Fung, O. L. Mangasarian, and J. W. Shavlik. Knowledge-based nonlinear kernel classi\ufb01ers. In Sixteenth\n\nAnnual Conference on Learning Theory, pages 102\u2013113, 2003.\n\n[6] G. Fung, O. L. Mangasarian, and J. W. Shavlik. Knowledge-based support vector classi\ufb01ers. In Advances\n\nin Neural Information Processing Systems, volume 15, pages 521\u2013528, 2003.\n\n[7] G. Fung, S. Sandilya, and R. B. Rao. Rule extraction from linear support vector machines.\n\nIn Proc.\nEleventh ACM SIGKDD Intl. Conference on Knowledge Discovery in Data Mining, pages 32\u201340, 2005.\n[8] M. I. Harris, K. M. Flegal, C. C. Cowie, M. S. Eberhardt, D. E. Goldstein, R. R. Little, H. M. Wiedmeyer,\nand D. D. Byrd-Holt. Prevalence of diabetes, impaired fasting glucose, and impaired glucose tolerance in\nU.S. adults. Diabetes Care, 21(4):518\u2013524, 1998.\n\n[9] G. Kunapuli, K. P. Bennett, A. Shabbeer, R. Maclin, and J. W. Shavlik. Online knowledge-based support\n\nvector machines. In Proc. of the European Conference on Machine Learning, pages 145\u2013161, 2010.\n\n[10] F. Lauer and G. Bloch. Incorporating prior knowledge in support vector machines for classi\ufb01cation: A\n\nreview. Neurocomputing, 71(7\u20139):1578\u20131594, 2008.\n\n[11] Q. V. Le, A. J. Smola, and T. G\u00a8artner. Simpler knowledge-based support vector machines. In Proceedings\n\nof the Twenty-Third International Conference on Machine Learning, pages 521\u2013528, 2006.\n\n[12] R. Maclin, E. W. Wild, J. W. Shavlik, L. Torrey, and T. Walker. Re\ufb01ning rules incorporated into\nIn AAAI Twenty-Second\n\nknowledge-based support vector learners via successive linear programming.\nConference on Arti\ufb01cial Intelligence, pages 584\u2013589, 2007.\n\n[13] O. L. Mangasarian, J. W. Shavlik, and E. W. Wild. Knowledge-based kernel approximation. Journal of\n\nMachine Learning Research, 5:1127\u20131141, 2004.\n\n[14] O. L. Mangasarian and E. W. Wild. Nonlinear knowledge-based classi\ufb01cation. IEEE Transactions on\n\nNeural Networks, 19(10):1826\u20131832, 2008.\n\n[15] M. E. Pavkov, R. L. Hanson, W. C. Knowler, P. H. Bennett, J. Krakoff, and R. G. Nelson. Changing\n\npatterns of Type 2 diabetes incidence among Pima Indians. Diabetes Care, 30(7):1758\u20131763, 2007.\n\n[16] M. Pazzani and D. Kibler. The utility of knowledge in inductive learning. Mach. Learn., 9:57\u201394, 1992.\n[17] B. Sch\u00a8olkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support vector kernels. In Advances\n\nin Neural Information Processing Systems, volume 10, pages 640\u2013646, 1998.\n\n[18] J. W. Smith, J. E. Everhart, W. C. Dickson, W. C. Knowler, and R. S. Johannes. Using the ADAP learning\nIn Proc. of the Symposium on Comp. Apps. and\n\nalgorithm to forecast the onset of diabetes mellitus.\nMedical Care, pages 261\u2013265. IEEE Computer Society Press, 1988.\n\n[19] A. J. Smola and S. V. N. Vishwanathan. Kernel methods for missing variables. In Proceedings of the\n\nTenth International Workshop on Arti\ufb01cial Intelligence and Statistics, pages 325\u2013332, 2005.\n\n[20] S. Thrun. Extracting rules from arti\ufb01cial neural networks with distributed representations. In Advances\n\nin Neural Information Processing Systems, volume 8, 1995.\n\n[21] G. G. Towell and J. W. Shavlik. Knowledge-based arti\ufb01cial neural networks. Arti\ufb01cial Intelligence,\n\n70(1\u20132):119\u2013165, 1994.\n\n[22] R. H. T\u00a8ut\u00a8unc\u00a8u, K. C. Toh, and M. J. Todd. Solving semide\ufb01nite-quadratic-linear programs using SDPT3.\n\nMathematical Programming, 95(2), 2003.\n\n[23] T. Walker, G. Kunapuli, N. Larsen, D. Page, and J. W. Shavlik.\n\nIntegrating knowledge capture and\nsupervised learning through a human-computer interface. In Proc. Fifth Intl. Conf. Knowl. Capture, 2011.\n[24] A. L. Yuille and A. Rangarajan. The concave-convex procedure (CCCP). In Advances in Neural Infor-\n\nmation Processing Systems, volume 13, 2001.\n\n9\n\n\f", "award": [], "sourceid": 980, "authors": [{"given_name": "Gautam", "family_name": "Kunapuli", "institution": null}, {"given_name": "Richard", "family_name": "Maclin", "institution": null}, {"given_name": "Jude", "family_name": "Shavlik", "institution": null}]}