{"title": "Multiple Instance Learning via Disjunctive Programming Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 65, "page_last": 72, "abstract": "", "full_text": "Multiple Instance Learning via\n\nDisjunctive Programming Boosting\n\nStuart Andrews\n\nDepartment of Computer Science\n\nBrown University, Providence, RI, 02912\n\nstu@cs.brown.edu\n\nThomas Hofmann\n\nDepartment of Computer Science\n\nBrown University, Providence, RI, 02912\n\nth@cs.brown.edu\n\nAbstract\n\nLearning from ambiguous training data is highly relevant in many\napplications. We present a new learning algorithm for classi\ufb01cation\nproblems where labels are associated with sets of pattern instead\nof individual patterns. This encompasses multiple instance learn-\ning as a special case. Our approach is based on a generalization\nof linear programming boosting and uses results from disjunctive\nprogramming to generate successively stronger linear relaxations of\na discrete non-convex problem.\n\n1 Introduction\n\nIn many applications of machine learning, it is inherently di\ufb03cult or prohibitively\nexpensive to generate large amounts of labeled training data. However, it is often\nconsiderably less challenging to provide weakly labeled data, where labels or annota-\ntions y are associated with sets of patterns or bags X instead of individual patterns\nx \u2208 X. These bags re\ufb02ect a fundamental ambiguity about the correspondence of\npatterns and the associated label which can be expressed logically as a disjunction\n\nof the form: Wx\u2208X (x is an example of class y). In plain English, each labeled bag\n\ncontains at least one pattern (but possibly more) belonging to this class, but the\nidentities of these patterns are unknown.\n\nA special case of particular relevance is known as multiple instance learning [5]\n(MIL). In MIL labels are binary and the ambiguity is asymmetric in the sense that\nbags with negative labels are always of size one. Hence the label uncertainty is\nrestricted to members of positive bags. There are many interesting problems where\ntraining data of this kind arises quite naturally, including drug activity prediction\n[5], content-based image indexing [10] and text categorization [1]. The ambiguity\ntypically arises, because of polymorphisms allowing multiple representations, e.g. a\nmolecule which can be in di\ufb00erent conformations, or because of a part/whole am-\n\n\fbiguity, e.g. annotations may be associated with images or documents where they\nshould be attached to objects in an image or passages in a document. Notice also\nthat there are two intertwined objectives: the goal may be to learn a pattern-level\nclassi\ufb01er from ambiguous training examples, but sometimes one may be primarily\ninterested in classifying new bags without necessarily resolving the ambiguity for\nindividual patterns.\n\nA number of algorithms have been developed for MIL, including special purpose\nalgorithms using axis-parallel rectangular hypotheses [5], diverse density [10, 14],\nneural networks [11], and kernel methods [6]. In [1] two versions of a maximum-\nmargin learning architecture for solving the multiple instance learning problem have\nbeen presented. Because of the combinatorial nature of the problem, a simple\noptimization heuristic was used in [1] to learn discriminant functions. In this paper,\nwe take a more principled approach by carefully analyzing the nature of the resulting\noptimization problem and by deriving a sequence of successively stronger relaxations\nthat can be used to compute lower and upper bounds on the objective. Since it\nturns out that exploiting sparseness is a crucial aspect, we have focused on a linear\nprogramming formulation by generalizing the LPBoost algorithm [7, 12, 4] we call\nthe resulting method Disjunctive Programming Boosting (DPBoost).\n\n2 Linear Programming Boosting\n\nLPBoost is a linear programming approach to boosting, which aims at learning\n\nensemble classi\ufb01ers of the form G(x) = sgn F (x) with F (x) = Pk \u03b1khk(x), where\n\nhk : <d \u2192 {\u22121, 1}, k = 1, . . . , n are the so-called base classi\ufb01ers, weak hypotheses,\nor features and \u03b1k \u2265 0 are combination weights. The ensemble margin of a labeled\nexample (x, y) is de\ufb01ned as yF (x).\n\nGiven a set of labeled training examples {(x1, y1), . . . , (xm, ym)}, LPBoost formu-\nlates the supervised learning problem using the 1-norm soft margin objective\n\nmin\n\u03b1, \u03be\n\nn\n\nXk=1\n\n\u03b1k + C\n\nm\n\nXi=1\n\n\u03bei\n\ns.t. yiF (xi) \u2265 1 \u2212 \u03bei, \u03bei \u2265 0, \u2200i, \u03b1k \u2265 0, \u2200k .\n\n(1)\n\nHere C > 0 controls the tradeo\ufb00 between the Hinge loss and the L1 regularization\nterm. Notice that this formulation remains meaningful even if all training examples\nare just negative or just positive [13].\n\nFollowing [4] the dual program of Eq. (1) can be written as\n\nm\n\nm\n\nui,\n\ns.t.\n\nuiyihk(xi) \u2264 1, \u2200k,\n\n0 \u2264 ui \u2264 C, \u2200i .\n\n(2)\n\nmax\n\nu\n\nXi=1\n\nXi=1\n\nIt is useful to take a closer look at the KKT complementary conditions\n\nui (yiF (xi) + \u03bei \u2212 1) = 0,\n\nand \u03b1k  m\nXi=1\n\nuiyihk(xi) \u2212 1! = 0.\n\n(3)\n\nSince the optimal values of the slack variables are implicitly determined by \u03b1 as\n\u03bei(\u03b1) = [1 \u2212 yiF (xi)]+, the \ufb01rst set of conditions states that ui = 0 whenever\nyiF (xi) > 1. Since ui can be interpreted as the \u201cmisclassi\ufb01cation\u201d cost, this implies\nthat only instances with tight margin constraints may have non-vanishing associated\ni=1 uiyihk(xi) < 1,\nwhich states that a weak hypothesis hk is never included in the ensemble, if its\n\ncosts. The second set of conditions ensures that \u03b1k = 0, if Pm\nweighted scorePi uiyihk(xi) is strictly below the maximum score of 1. So a typical\n\n\fLPBoost solution may be sparse in two ways: (i) Only a small number of weak\nhypothesis with \u03b1k > 0 may contribute to the ensemble and (ii) the solution may\nonly depend on a subset of the training data, i.e. those instances with ui > 0.\n\nLPBoost exploits the sparseness of the ensemble by incrementally selecting columns\nfrom the simplex tableau and optimizing the smaller tableau. This amounts to\n\ufb01nding in each round a hypothesis hk for which the constraint in Eq. (2) is violated,\nadding it to the ensemble and re-optimizing the tableau with the selected columns.\nAs a column selection heuristic the authors of [4] propose to use the magnitude of\n\nthe violation, i.e. pick the weak hypothesis hk with maximal score Pi uiyihk(xi).\n\n3 Disjunctive Programming Boosting\n\nIn order to deal with pattern ambiguity, we employ the disjunctive programming\nframework [2, 9].\nIn the spirit of transductive large margin methods [8, 3], we\npropose to estimate the parameters \u03b1 of the discriminant function in a way that\nachieves a large margin for at least one of the patterns in each bag. Applying this\nprinciple, we can compile the training data into a set of disjunctive constraints on\n\u03b1. To that extend, let us de\ufb01ne the following polyhedra\n\nHi(x) \u2261((\u03b1, \u03be) : yiXk\n\n\u03b1khk(x) + \u03bei \u2265 1) , Q \u2261 {(\u03b1, \u03be) : \u03b1, \u03be \u2265 0} .\n\n(4)\n\nThen we can formulate the following disjunctive program:\n\nmin\n\u03b1,\u03be\n\nn\n\nXk=1\n\n\u03b1k + C\n\n\u03bei,\n\nm\n\nXi=1\n\ns.t. (\u03b1, \u03be) \u2208 Q \u2229\\i [x\u2208Xi\n\nHi(x) .\n\n(5)\n\nNotice that if |Xi| \u2265 2 then the constraint imposed by Xi is highly non-convex,\nsince it is de\ufb01ned via a union of halfspaces. However, for trivial bags with |Xi| = 1,\nthe resulting constraints are the same as in Eq. (1). Since we will handle these two\ncases quite di\ufb00erently in the sequel, let us introduce index sets I = {i : |Xi| \u2265 2}\nand J = {j : |Xj| = 1}.\n\nA suitable way to de\ufb01ne a relaxation to this non-convex optimization problem is\nto replace the disjunctive set in Eq. (5) by its convex hull. As shown in [2], a\nwhole hierarchy of such relaxations can be built, using the fundamental fact that\ncl-conv(A) \u2229 cl-conv(B) \u2287 cl-conv(A \u2229 B), where cl-conv(A) denotes the closure of\nthe convex hull of the limiting points of A. This means a tighter convex relaxation\nis obtained, if we intersect as many sets as possible, before taking their convex hull.\nSince repeated intersections of disjunctive sets with more than one element each\nleads to an combinatorial blow-up in the number of constraints, we propose to in-\ntersect every ambiguous disjunctive constraint with every non-ambiguous constraint\nas well as with Q. This is also called a parallel reduction step [2]. It results in the\nfollowing convex relaxation of the constraints in Eq. (5)\n\n(\u03b1, \u03be) \u2208 \\i\u2208I\n\n(6)\n\ncl-conv\uf8ee\n\uf8f0[x\u2208Xi\n\n\uf8eb\n\uf8edHi(x) \u2229 Q \u2229 \\j\u2208J\n\nHj(xj)\uf8f6\n\uf8f8\n\n\uf8f9\n\uf8fb ,\n\nwhere we have abused the notation slightly and identi\ufb01ed Xj = {xj} for bags with\none pattern. The rationale in using this relaxation is that the resulting convex\noptimization problem is tractable and may provide a reasonably accurate approxi-\nmation to the original disjunctive program, which can be further strengthened by\nusing it in combination with branch-and-bound search.\n\n\fThere is a lift-and-project representation of the convex hulls in Eq. (6), i.e. one\ncan characterize the feasible set as a projection of a higher dimensional polyhedron\nwhich can be explicitly characterized [2].\nProposition 1. Assume a set of non-empty linear constraints Hi \u2261 {z : Aiz \u2265\n\nbi} 6= \u2205 is given. Then z \u2208 cl-convSi Hi if and only if there exist zj and \u03b7j \u2265 0\n\nsuch that\n\nz =Xj\n\nzj, Xj\n\n\u03b7j = 1, Ajzj \u2265 \u03b7jbj .\n\nProof. [2]\n\nLet us pause here brie\ufb02y and recapitulate what we have achieved so far. We have\nderived a LP relaxation of the original disjunctive program for boosting with am-\nbiguity. This relaxation was obtained by a linearization of the original non-convex\nconstraints. Furthermore, we have demonstrated how this relaxation can be im-\nproved using parallel reduction steps.\n\nApplying this linearization to every convex hull in Eq. (6) individually, notice that\none needs to introduce duplicates \u03b1x, \u03bex of the parameters \u03b1 and slack variables \u03be,\nfor every x \u2208 Xi. In addition to the constraints \u03b1x\n\u03b7x\ni = 1\nthe relevant constraint set for ambiguous bag Xi for i \u2208 I of the resulting LP can\nbe written as\n\ni \u2265 0 andPx\u2208Xi\n\nj , \u03b7x\n\ni , \u03bex\n\nk, \u03bex\n\n\u2200x \u2208 Xi :\n\n\u2200x \u2208 Xi, \u2200j \u2208 J :\n\n\u03b1x\nkhk(x) + \u03bex\n\ni \u2265 \u03b7x\ni ,\n\n\u03b1x\nkhk(xj) + \u03bex\n\nj \u2265 \u03b7x\ni ,\n\nyiXk\nyjXk\n\n\u2200k, \u2200j \u2208 I \u222a J : \u03b1k = Xx\u2208Xi\n\n\u03b1x\nk,\n\n\u03bex\nj .\n\n\u03bej = Xx\u2208Xi\n\n(7a)\n\n(7b)\n\n(7c)\n\nThe \ufb01rst margin constraint in Eq. (7a) is the one associated with the speci\ufb01c pattern\nx, while the second set of margin constraints in Eq. (7b) stems from the parallel\nreduction performed with unambiguous bags. One can calculate the dual LP of\nthe above relaxation, the derivation of which can be found in the appendix. The\nresulting program has a more complicated bound structure on the u-variables and\nthe following crucial constraints involving the data\n\n\u2200i, \u2200x \u2208 Xi : yiux\n\ni hk(x) +Xj\u2208J\n\nyjux\n\nj hk(xj) \u2264 \u03c1ik, Xi\u2208I\n\n\u03c1ik = 1 .\n\n(8)\n\nHowever, the size of the resulting problem is signi\ufb01cant. As a result of linearization\nand parallel reductions, the number of parameters in the primal LP is now O(q \u00b7 n +\nq \u00b7 r), where q, r \u2264 m denote the number of patterns in ambiguous and unambiguous\nbags, compared to O(n + m) of the standard LPBoost. The number of constraints\n(variables in the dual) has also been in\ufb02ated signi\ufb01cantly from O(m) to O(q\u00b7r+p\u00b7n)),\nwhere p \u2264 q is the number of ambiguous bags.\n\nIn order to maintain the spirit of LPBoost in dealing e\ufb03ciently with a large-scale\nlinear program, we propose to maintain the column selection scheme of selecting\none or more \u03b1x\nk in every round. Notice that the column selection can not proceed\nk = \u03b1k for all Xi; in\nparticular, \u03b1x\nk > 0 for at least some z \u2208 Xi for each\nXi, i \u2208 I. We hence propose to simultaneously add all columns {\u03b1x\nk : x \u2208 Xi, i \u2208 I}\ninvolving the same weak hypothesis and to prune those back after each boosting\n\nindependently because of the equality constraints Px\u2208Xi\n\nk > 0 implies \u03b1k > 0, so that \u03b1z\n\n\u03b1x\n\n\fround in order to exploit the expected sparseness of the solution. In order to select\na feature hk, we compute the following score\n\nS(k) =Xi\n\n\u00af\u03c1ik \u2212 1,\n\n\u00af\u03c1ik \u2261 max\n\nx \uf8ee\n\uf8f0yiux\n\ni hk(x) +Xj\u2208J\n\nyjux\n\nj hk(xj)\uf8f9\n\uf8fb .\n\n(9)\n\nNotice that due to the block structure of the tableau, working with a reduced set of\ncolumns also eliminates a large number of inequalities (rows). However, the large\nset of q \u00b7 r inequalities for the parallel reductions is still prohibitive.\n\nIn order to address this problem, we propose to perform incremental row selection\nin an outer loop. Once we have converged to a column basis for the current relaxed\nLP, we add a subset of rows corresponding to the most useful parallel reductions.\nOne can use the magnitude of the margin violation as a heuristic to perform this\nrow selection. Hence we propose to use the following score\n\nT (x, j) = \u03b7x\n\n\u03b1x\n\nkhk(xj), where x \u2208 Xi, i \u2208 I, j \u2208 J\n\n(10)\n\ni \u2212 yjXk\n\nThis means that for current values of the duplicated ensemble weights \u03b1x\nk, one\nselects the parallel reduction margin constraint associated with ambiguous pattern\nx and unambiguous pattern j that is violated most strongly.\n\nAlthough the margin constraints imposed by unambiguous training instances\n(xj, yj) are redundant after we performed the parallel reduction step in Eq. (6),\nwe add them to the problem, because this will give us a better starting point with\nrespect to the row selection process, and may lead to a sparser solution. We hence\nadd the following constraints to the primal\n\n\u03b1khk(xj) + \u03bej \u2265 1,\n\n\u2200j \u2208 J ,\n\n(11)\n\nyjXk\n\nwhich will introduce additional dual variables uj, j \u2208 J. Notice that in the worst\ncase where all inequalities imposed by ambiguous training instances Xi are vacuous,\nthis will make sure that one recovers the standard LPBoost formulation on the\nunambiguous examples. One can then think of the row generation process as a way\nof deriving useful information from ambiguous examples. This information takes\nthe form of linear inequalities in the high dimensional representation of the convex\nhull and will sequentially reduce the version space, i.e. the set of feasible (\u03b1, \u03be) pairs.\n\ni : x \u2208 Xi, i \u2208 I} \u222a {uj : j \u2208 J}\n\nrepeat\n\n|J| , ux\n\ni = 0, \u03bei = 0\n\nAlgorithm 1 DPBoost Algorithm\n1: initialize H = \u2205, C = {\u03bei : i \u2208 I \u222a J}, R = {ux\n2: uj = 1\n3: repeat\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11: R = R \u222a {ux\n12:\n13: until max T (x, j) < \u0001\n\ncolumn selection: select hk 6\u2208 H with maximal S(k)\nH = H \u222a {hk}\nC = C \u222a {\u03b1k} \u222a {\u03b1x\nsolve LP (C, R)\n\nk : \u2200x \u2208 Xi, \u2200i \u2208 I}\n\nj : (x, j) \u2208 S}, C = C \u222a {\u03bex\n\nj : (x, j) \u2208 S}\n\nsolve LP (C, R)\n\nuntil max S(k) < \u0001\nrow selection: select a set S of pairs (x, j) 6\u2208 R with maximal T (x, j) > 0\n\n\f90\n80\n70\n60\n50\n\n90\n80\n70\n60\n50\n\n90\n80\n70\n60\n50\n\n1\n\n3\n\n5\n\n7\n\n1\n\n3\n\n5\n\n7\n\n1\n\n3\n\n5\n\n7\n\nFigure 1: (Left) Normalized intensity plot used to generate synthetic data sets.\n(Right) Performance relative to the degree of label ambiguity. Mean and standard\ndeviation of the pattern-level classi\ufb01cation accuracy plotted versus \u03bb, for perfect-\nknowledge (solid), perfect-selector (dotted), DPboost (dashed), and naive (dash-\ndot) algorithms. The three plots correspond to data sets of size |I| = 10, 20, 30.\n\n4 Experiments\n\nWe generated a set of synthetic weakly labeled data sets to evaluate DPboost on a\nsmall scale. These were multiple-instance data sets, where the label uncertainty was\nasymmetric; the only ambiguous bags (|Xi| > 1) were positive. More speci\ufb01cally, we\ngenerated instances x \u2208 [0, 1] \u00d7 [0, 1] sampled uniformly at random from the white\n(yi = 1) and black (yi = \u22121) regions of Figure 1, leaving the intermediate gray\narea as a separating margin. The degree of ambiguity was controlled by generating\nambiguous bags of size k \u223c Poisson(\u03bb) having only one positive and k \u2212 1 negative\npatterns. To control data set size, we generated a pre-speci\ufb01ed number of ambiguous\nbags, and the same number of singleton unambiguous bags.\n\nAs a proof of concept benchmark, we compared the classi\ufb01cation perfomance of\nDPboost with two other LPboost variants: perfect-knowledge, perfect-selector, and\nnaive algorithms. All variants use LPboost as their base algorithm and have slightly\ndi\ufb00erent preprocessing steps to accomodate the MIL data sets. The \ufb01rst corresponds\nto the supervised LPboost algorithm; i.e. the true pattern-level labels are used.\nSince this algorithm does not have to deal with ambiguity, it will perform better\nthan DPboost. The second uses the true pattern-level labels to prune the negative\nexamples from ambiguous bags and solves the smaller supervised problem with\nLPboost as above. This algorithm provides an interesting benchmark, since its\nperformance is the best we can hope for from DPboost. At the other extreme, the\nthird variant assumes the ambiguous pattern labels are equal to their respective\nbag labels. For all algorithms, we used thresholded \u201cRBF-like\u201d features.\n\nFigure 2 shows the discriminant boundary (black line), learned by each of the four\nalgorithms for a data set generated with \u03bb = 3 and having 20 ambiguous bags\n|I| = 20, no. ambig. = 71, no. total = 91). The ambiguous patterns are\n(i.e.\nmarked by \u201co\u201d, unambiguous ones \u201cx\u201d, and the background is shaded to indicate\nthe value of the ensemble F (x) (clamped to [\u22123, 3]). It is clear from the shading that\nthe ensemble has a small number of active features for DPboost, perfect-selector\nand perfect-knowledge algorithms. For each classi\ufb01er, we report the pattern-level\nclassi\ufb01cation accuracy for a uniform grid (21 x 21) of points. The sparsity of the dual\nvariables was also veri\ufb01ed; less than 20 percent of the dual variables and reductions\nwere active.\n\nWe ran 5-fold cross-validation on the synthetic data sets for \u03bb = 1, 3, 5, 7 and for\ndata sets having |I| = 10, 20, 30. Figure 1 (right side) shows the mean pattern-level\nclassi\ufb01cation accuracy with error bars showing one standard deviation, as a function\n\n\f3\n\n2\n\n1\n\n0\n\n\u22121\n\n\u22122\n\n\u22123\n\nFigure 2: Discriminant boundaries learned by naive (accuracy = 53.3 %), DPboost\n(85.3 %), perfect-selector (86.6 %) and perfect-knowledge (92.7 %) algorithms.\n\nof the parameter \u03bb.\n\n5 Conclusion\n\nWe have presented a new learning algorithm for classi\ufb01cation problems where labels\nare associated with sets of pattern instead of individual patterns. Using synthetic\ndata, the expected behaviour of the algorithm has been demonstrated. Our current\nimplementation could not handle large data sets, and so improvements, followed by\na large-scale validation and comparison to other algorithms using benchmark MIL\ndata sets, will follow.\n\nAcknowledgments\n\nDavid Musicant for making his CPLEX MEX interface available online. Also, to\nIoannis Tsochantaridis and Keith Hall, for useful discussion and advice. This work\nwas sponsored by an NSF-ITR grant, award number IIS-0085836.\n\nReferences\n\n[1] Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. Support vector ma-\nchines for multiple-instance learning. In Advances in Neural Information Processing\nSystems, volume 15. MIT Press, 2003.\n\n[2] Egon Balas. Disjunctive programming and a hierarchy of relaxations for discrete\noptimization problems. SIAM Journal on Algebraic and Discrete Methods, 6(3):466\u2013\n486, July 1985.\n\n[3] A. Demirez and K. Bennett. Optimization approaches to semisupervised learning.\nIn M. Ferris, O. Mangasarian, and J. Pang, editors, Applications and Algorithms of\nComplementarity. Kluwer Academic Publishers, Boston, 2000.\n\n[4] Ayhan Demiriz, Kristin P. Bennett, and John Shawe-Taylor. Linear programming\n\nboosting via column generation. Machine Learning, 46(1-3):225\u2013254, 2002.\n\n[5] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solving the multiple instance\n\nproblem with axis-parallel rectangles. Arti\ufb01cial Intelligence, 89(1-2):31\u201371, 1997.\n\n[6] T. G\u00a8artner, P. A. Flach, A. Kowalczyk, and A. J. Smola. Multi-instance kernels. In\nProc. 19th International Conf. on Machine Learning. Morgan Kaufmann, San Fran-\ncisco, CA, 2002.\n\n[7] A.J. Grove and D. Schuurmans. Boosting in the limit: Maximizing the margin of\nlearned ensembles. In Proceedings of the Fifteenth National Conference on Arti\ufb01cal\nIntelligence, 1998.\n\n[8] T. Joachims. Transductive inference for text classi\ufb01cation using support vector ma-\nIn Proceedings 16th International Conference on Machine Learning, pages\n\nchines.\n200\u2013209. Morgan Kaufmann, San Francisco, CA, 1999.\n\n\f[9] Sangbum Lee and Ignacio E. Grossmann. New algorithms for nonlinear general-\nized disjunctive programming. Computers and Chemical Engineering Journal, 24(9-\n10):2125\u20132141, October 2000.\n\n[10] O. Maron and A. L. Ratan. Multiple-instance learning for natural scene classi\ufb01ca-\ntion. In Proc. 15th International Conf. on Machine Learning, pages 341\u2013349. Morgan\nKaufmann, San Francisco, CA, 1998.\n\n[11] J. Ramon and L. De Raedt. Multi instance neural networks. In Proceedings of ICML-\n\n2000, Workshop on Attribute-Value and Relational Learning, 2000.\n\n[12] G. R\u00a8atsch, T. Onoda, and K.-R. M\u00a8uller. Soft margins for AdaBoost. Technical Report\nNC-TR-1998-021, Department of Computer Science, Royal Holloway, University of\nLondon, Egham, UK, 1998.\n\n[13] Gunnar R\u00a8atsch, Sebastian Mika, Bernhard Sch\u00a8olkopf, and Klaus-Robert M\u00a8uller. Con-\nstructing boosting algorithms from svms: an application to one-class classi\ufb01cation.\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 24(9):1184\u20131199,\n2002.\n\n[14] Qi Zhang and Sally A. Goldman. EM-DD: An improved multiple-instance learning\ntechnique. In Advances in Neural Information Processing Systems, volume 14. MIT\nPress, 2002.\n\nAppendix\n\nThe primal variables are \u03b1k, \u03b1x\ni . The dual variables are ux and\nux\nj for the margin constraints, and \u03c1ik, \u03c3i, and \u03b8i for the equality constraints on \u03b1k,\n\u03be and \u03b7, respectively.\n\nj , and \u03b7x\n\nk, \u03bei, \u03bex\n\ni , \u03bex\n\nThe Lagrangian is given by\n\n\u03bej\uf8f6\n\uf8f8\n\n\u2212Xi Xx\u2208Xi\n\nux\n\n\u03b1x\nkhk(xj) + \u03bex\n\nj \u2212 \u03b7x\n\nL = Xk\n\nux\n\n\u03bei +Xj\nj  yjXk\n\u03c1ik \u03b1k \u2212 Xx\u2208Xi\n\n\u03b1k + C\uf8eb\n\uf8edXi\n\u2212Xi Xx\u2208XiXj\n\u2212Xi,k\n\u2212Xi Xx\u2208XiXk\n\n\u02dc\u03b1x\nk\u03b1x\n\n\u03b7x\n\n\u03b1x\nkhk(x) + \u03bex\n\ni  yiXk\ni! +Xi\ni! \u2212Xi,j\n\ni \u2212 \u03b7x\n\ni!\n\u03b8i 1 \u2212 Xx\u2208Xi\ni!\n\u03c3ij \u03bej \u2212 Xx\u2208Xi\nj \u2212Xi Xx\u2208Xi\n\n\u02dc\u03bex\nj \u03bex\n\ni \u03b7x\n\u02dc\u03b7x\ni .\n\n\u03bex\n\nj!\n\n\u03b1x\n\nk! \u2212Xi\nk \u2212Xi Xx\u2208Xi\n\n\u03bex\n\n\u03c3i \u03bei \u2212 Xx\u2208Xi\ni \u2212Xi Xx\u2208XiXj\n\n\u02dc\u03bex\ni \u03bex\n\nTaking derivatives w.r.t. primal variables, leads to the following dual\n\nmax Xi\n\n\u03b8i\n\ns.t.\n\n\u03b8i \u2264 ux\n\nux\nj ,\n\ni +Xj\ni hk(x) +Xj\n\nyiux\n\nux\ni \u2264 C,\n\nux\n\nj \u2264 \u03c3ij, Xi\n\n\u03c3ij \u2264 C\n\nyjux\n\nj hk(xj) \u2264 \u03c1ik, Xi\n\n\u03c1ik = 1\n\n\f", "award": [], "sourceid": 2478, "authors": [{"given_name": "Stuart", "family_name": "Andrews", "institution": null}, {"given_name": "Thomas", "family_name": "Hofmann", "institution": null}]}