{"title": "beta-risk: a New Surrogate Risk for Learning from Weakly Labeled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 4365, "page_last": 4373, "abstract": "During the past few years, the machine learning community has paid attention to developping new methods for learning from weakly labeled data. This field covers different settings like semi-supervised learning, learning with label proportions, multi-instance learning, noise-tolerant learning, etc. This paper presents a generic framework to deal with these weakly labeled scenarios. We introduce the beta-risk as a generalized formulation of the standard empirical risk based on surrogate margin-based loss functions. This risk allows us to express the reliability on the labels and to derive different kinds of learning algorithms. We specifically focus on SVMs and propose a soft margin beta-svm algorithm which behaves better that the state of the art.", "full_text": "\u03b2-risk: a New Surrogate Risk for Learning\n\nfrom Weakly Labeled Data\n\nValentina Zantedeschi\u2217\n\nR\u00e9mi Emonet\n\nMarc Sebban\n\n\ufb01rstname.lastname@univ-st-etienne.fr\n\nUniv Lyon, UJM-Saint-Etienne, CNRS, Institut d Optique Graduate School,\nLaboratoire Hubert Curien UMR 5516, F-42023, SAINT-ETIENNE, France\n\nAbstract\n\nDuring the past few years, the machine learning community has paid attention to\ndeveloping new methods for learning from weakly labeled data. This \ufb01eld covers\ndifferent settings like semi-supervised learning, learning with label proportions,\nmulti-instance learning, noise-tolerant learning, etc. This paper presents a generic\nframework to deal with these weakly labeled scenarios. We introduce the \u03b2-risk as\na generalized formulation of the standard empirical risk based on surrogate margin-\nbased loss functions. This risk allows us to express the reliability on the labels and\nto derive different kinds of learning algorithms. We speci\ufb01cally focus on SVMs\nand propose a soft margin \u03b2-SVM algorithm which behaves better that the state of\nthe art.\n\n1\n\nIntroduction\n\nThe growing amount of data available nowadays allowed us to increase the con\ufb01dence in the models\ninduced by machine learning methods. On the other hand, it also caused several issues, especially in\nsupervised classi\ufb01cation, regarding the availability of labels and their reliability. Because it may be\nexpensive and tricky to assign a reliable and unique label to each training instance, the data at our\ndisposal for the application at hand are often weakly labeled. Learning from weak supervision has\nreceived important attention over the past few years [14, 12]. This research \ufb01eld includes different\nsettings: only a fraction of the labels are known (Semi-Supervised learning [22]); we can access only\nthe proportions of the classes (Learning with Label Proportions [19] and Multi-Instance Learning [8]);\nthe labels are uncertain or noisy (Noise-Tolerant Learning [1, 18, 16]); different discording labels are\ngiven to the same instance by different experts (Multi-Expert Learning [21]); labels are completely\nunknown (Unsupervised Learning [11]). As a consequence of this statement of fact, the data provided\nin all these situations cannot be fully exploited using supervised techniques, at the risk of drastically\nreducing the performance of the learned models. To address this issue, numerous machine learning\nmethods have been developed to deal with each of the previous speci\ufb01c situations. However, all\nthese weakly labeled learning tasks share common features mainly relying on the con\ufb01dence in\nthe labels, opening the door to the development of generic frameworks. Unfortunately, only a few\nattempts have tried to address several settings with the same approach. The most interesting one has\nbeen presented in [14] where the authors propose WELLSVM which is dedicated to deal with three\ndifferent weakly labeled learning scenarios: semi-supervised learning, multi-instance learning and\nclustering. However, WELLSVM focuses speci\ufb01cally on Support Vector Machines and it requires to\n\n\u2217http://vzantedeschi.com/\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fderive a new optimization problem for each new task. Even though WELLSVM constitutes a step\nfurther towards general models, it stopped in midstream constraining the learner to use SVMs.\nThis paper aims to bridge this gap by presenting a generic framework for learning from weakly labeled\ndata. Our approach is based on the derivation of the \u03b2-risk , a new surrogate empirical risk de\ufb01ned as\na strict generalization of the standard empirical risk relying on surrogate margin-based loss functions.\nThe main interesting property of the \u03b2-risk comes from its ability to exploit the information given\nby the weakly supervised setting and encoded as a \u03b2 matrix re\ufb02ecting the supervision on the labels.\nMoreover, the instance-speci\ufb01c weights \u03b2 let one integrate in classical methods the side information\nprovided by the setting. This is the peculiarity w.r.t. [18, 16]: in both papers, the proposed losses are\nde\ufb01ned using class-dependent weights (\ufb01xed to 1/2 for the \ufb01rst paper, and dependent on the class\nnoise rate for the latter) while in our approach the used weights are provided for each instance, which\ngives a more \ufb02exible formulation. Making use of this \u03b2-risk , we design a generic algorithm devoted\nto address different kinds of aforementioned weakly labeled settings. To allow a comparison with\nthe state of the art, we instantiate it with a learner that takes the form of an SVM algorithm. In this\ncontext, we derive a soft margin \u03b2-SVM algorithm and show that it outperforms WELLSVM.\nThe remainder of this paper is organized as follows: in Section 2, we de\ufb01ne the empirical surrogate\n\u03b2-risk and show under which conditions it can be used to learn without explicitly accessing the\nlabels; we also show how to instantiate \u03b2 according to the weakly labeled learning setting at hand; in\nSection 3, we present our generic iterative algorithm for learning with weakly labeled data and in\nSection 4 we exploit our new framework to derive a novel formulation of the Support Vector Machine\nproblem, the \u03b2-SVM ; \ufb01nally, we report experiments in semi-supervised learning and learning with\nlabel noise, conducted on classical datasets from the UCI repository [15], in order to compare our\nalgorithm with the state of the art approaches.\n\n2 From Classical Surrogate Losses and Surrogate Risks to the \u03b2-risk\n\nIn this section, we \ufb01rst provide reminders about surrogate losses and then exploit the characteristics of\nthe popular loss functions to introduce the empirical surrogate \u03b2-risk . The \u03b2-risk formulation allows\nus to tackle the problem of learning with weakly labeled data. We show under which conditions it\ncan be used instead of the standard empirical surrogate risk (de\ufb01ned in a fully supervised context).\nThose conditions give insight on how to design algorithms that learn from weak supervision. We\nrestrain our study to the context of binary classi\ufb01cation.\n\n2.1 Preliminaries\nIn statistical learning, a common approach for choosing the optimal hypothesis h\u2217 from a hypothesis\nclass H is to select the classi\ufb01er that minimizes the expected risk over the joint space Z = X \u00d7 Y ,\nwhere X is the feature space and Y the label space, expressed as\n\n(cid:90)\n\nR(cid:96)(h) =\n\nX\u00d7Y\n\n(cid:96)(yh(x))p(x, y)dxdy\n\nwith (cid:96) : H \u00d7 Z \u2192 R+ a margin-based loss function.\nSince the true distribution of the data p(x, y) is usually unknown, machine learning algorithms\ntypically minimize the empirical version of the risk, computed over a \ufb01nite set S composed of m\ninstances (xi, yi) i.i.d. drawn from a distribution over X \u00d7 {\u22121, 1}:\n\nR(cid:96)(S, h) =\n\n1\nm\n\n(cid:96)(yih(xi)).\n\nm(cid:88)\n\ni=1\n\nThe most natural loss function is the so-called 0-1 loss. As this function is not convex, not differ-\nentiable and has zero gradient, other loss functions are commonly employed instead. These losses,\nsuch as the logistic loss (e.g., for the logistic regression [6]), the exponential loss (e.g., for boosting\ntechniques [10]) and the hinge loss (e.g., for the SVM [7]), are convex and smooth relaxations of\nthe 0-1 loss. Theoretical studies on the characteristics and behavior of such surrogate losses can\nbe found in [17, 2, 20]. In particular, [17] showed that each commonly used surrogate loss can be\n\n2\n\n\fcharacterized by a permissible function \u03c6 (see below) and rewritten as F\u03c6(x)\n\nF\u03c6(x) =\n\n\u03c6\u2217(\u2212x) \u2212 a\u03c6\n\nb\u03c6\n\nwhere \u03c6\u2217(x) = supa(xa \u2212 \u03c6(a)) is the Legendre conjugate of \u03c6 (for more details, see [4]), a\u03c6 =\n2 ) \u2212 a\u03c6 > 0. As presented by the authors of [13] and [17],\n\u2212\u03c6(0) = \u2212\u03c6(1) \u2265 0 and b\u03c6 = \u2212\u03c6( 1\na permissible function is a function f : [0, 1] \u2192 R\u2212, symmetric about \u2212 1\n2, differentiable on\n]0, 1[ and strictly convex. For instance, the permissible function \u03c6log related to the logistic loss\nF\u03c6(x) = log(1 + exp\u2212x) is:\n\n\u03c6log(x) = x log(x) + (1 \u2212 x) log(1 \u2212 x)\n\nand a\u03c6 = 0 and b\u03c6 = log(2).\nAs detailed in [17], considering a surrogate loss F\u03c6, the empirical surrogate risk of an hypothesis\nh : X \u2192 R w.r.t. S can be expressed as:\n\nR\u03c6(S, h) =\n\n1\nm\n\n(cid:0)yi,\u2207\u22121\n\n\u03c6 (h(xi))(cid:1) =\n\nD\u03c6\n\nm(cid:88)\n\ni=1\n\nm(cid:88)\n\ni=1\n\nb\u03c6\nm\n\nF\u03c6(yih(xi))\n\nwith D\u03c6 the Bregman Divergence\n\nD\u03c6(x, y) = \u03c6(x) \u2212 \u03c6(y) \u2212 (x \u2212 y)\u2207\u03c6(y).\n\nIn order to evaluate such risk R\u03c6(S, h), it is mandatory to provide the labels y for all the instances. In\naddition, it is not possible to take into account eventual uncertainties on the given labels. Consequently,\nR\u03c6 is de\ufb01ned in a totally supervised context, where the labels y are known and considered to be\ntrue. In order to face the numerous situations where training data may be weakly labeled, we claim\nthat there is a need to \ufb01ll the gap by de\ufb01ning a new empirical surrogate risk that can deal with such\nsettings. In the following section, we propose a generalization of the empirical surrogate risk, called\nthe empirical surrogate \u03b2\u2212risk, which can be employed in the context of weakly labeled data instead\nof the standard one under some linear conditions on the margin.\n\n2.2 The Empirical Surrogate \u03b2-risk\nBefore de\ufb01ning the empirical surrogate \u03b2-risk for any loss F\u03c6 and hypothesis h \u2208 H, let us rewrite\nthe de\ufb01nition of R\u03c6 introducing a new set of variables named \u03b2, and that can be laid out as a 2\u00d7m\nmatrix.\nLemma 2.1. For any S, \u03c6 and h, and for any non-negative real coef\ufb01cients \u03b2-1\nfor each instance xi \u2208 S such that \u03b2-1\nrewritten as\n\ni de\ufb01ned\ni = 1, the empirical surrogate risk R\u03c6(S, h) can be\n\ni and \u03b2+1\n\ni + \u03b2+1\n\nR\u03c6(S, h) = R\u03c6(S, h, \u03b2)\n\nwhere\n\nR\u03c6(S, h, \u03b2) =\n\nm(cid:88)\n\ni=1\n\nb\u03c6\nm\n\n(cid:88)\n\n\u03c3\u2208\n\n{-1,+1}\n\n\u03b2\u03c3\ni F\u03c6(\u03c3h(xi)) +\n\nm(cid:88)\n\ni=1\n\n1\nm\n\ni (\u2212yih(xi)).\n\u03b2-yi\n\nThe coef\ufb01cient \u03b2+1\ni\nin (or the probability of) the label +1 (resp. -1) assigned to xi.\n\n(resp. \u03b2-1\n\ni ) for an instance xi can be interpreted here as the degree of con\ufb01dence\n\n3\n\n\fProof.\n\nR\u03c6(S, h) =\n\n=\n\n=\n\n=\n\nb\u03c6\nm\n\nb\u03c6\nm\n\nb\u03c6\nm\n\nb\u03c6\nm\n\ni=1\n\nm(cid:88)\nm(cid:88)\nm(cid:88)\nm(cid:88)\n\ni=1\n\ni=1\n\n(cid:0)\u03b2yi\n(cid:18)\n(cid:88)\n\ni=1\n\n\u03c3\u2208\n\n{-1,+1}\n\nF\u03c6(yih(xi))\n\ni F\u03c6(yih(xi)) + \u03b2-yi\n\ni F\u03c6(yih(xi))(cid:1)\n(cid:18)\n\ni F\u03c6(yih(xi)) + \u03b2-yi\n\u03b2yi\n\ni\n\nF\u03c6(\u2212yih(xi)) \u2212 yih(xi)\n\n\u03b2\u03c3\ni F\u03c6(\u03c3h(xi)) +\n\n1\nm\n\nm(cid:88)\n\ni=1\n\nb\u03c6\ni (\u2212yih(xi)).\n\u03b2-yi\n\n(cid:19)(cid:19)\n\n(1)\n\n(2)\n\n(3)\n\ni + \u03b2+1\n\ni = 1; Eq. (2) is due to the fact that \u03c6\u2217(\u2212x) = \u03c6\u2217(x) \u2212 x (see the sup-\n=\n\nEq. (1) is because \u03b2-1\nplementary material) for any permissible function \u03c6, so that F\u03c6(x) = \u03c6\u2217(\u2212x)\u2212a\u03c6\nF\u03c6(\u2212x) \u2212 x\nFrom Eq. (3), and considering that the sample S is composed by the \ufb01nite set of features X and labels\nY, we can write that\n\n= \u03c6\u2217(x)\u2212a\u03c6\u2212x\n\nb\u03c6\n\nb\u03c6\n\nb\u03c6\n\n.\n\nR\u03c6(S, h) = R\u03c6(S, h, \u03b2) = R\u03b2\n\n\u03b2-yi\ni yih(xi)\n\n(4)\n\nm(cid:88)\n\ni=1\n\n\u03c6(X , h) \u2212 1\nm\n(cid:88)\n\n\u03c3\u2208\n\nm(cid:88)\n\ni=1\n\nb\u03c6\nm\n\n\u03b2\u03c3\ni F\u03c6(\u03c3h(xi))\n\nwhere\n\nR\u03b2\n\u03c6(X , h) =\n\nm |\u03b2-1\n\n0 , ..., \u03b2+1\n\n0 , ..., \u03b2-1\nm].\n\n{-1,+1}\nis the empirical surrogate \u03b2-risk for a matrix \u03b2 = [\u03b2+1\nIt is worth noticing that R\u03c6(S, h, \u03b2) is expressed in the form of a sum of two terms: the second one\ntakes into account the labels of the data, while the \ufb01rst one, the \u03b2-risk, focuses on the loss suffered by\nh over X without explicitly needing the labels Y.\n\u2212yi\nThe empirical \u03b2-risk is a generalization of the empirical risk: setting \u03b2yi\ni = 0)\nfor each instance, the second term vanishes and we retrieve the classical formulation of the empirical\nrisk. Additionally, as developed in Section 2.3, the introduction of \u03b2 makes it possible to inject some\nside-information about the labels. For this reason, we claim that the \u03b2-risk is suited to deal with\nclassi\ufb01cation in the context of weakly labeled data.\nLet us now focus on the conditions allowing the empirical \u03b2-risk (i) to be a surrogate of the 0-1\nloss-based empirical risk and (ii) to be suf\ufb01cient to learn with a weak supervision on the labels.\nFrom (4), we deduce:\n\ni = 1 (and thus \u03b2\n\nR\u03b2\n\u03c6(X , h) = R\u03c6(S, h, \u03b2) +\n\n1\nm\n\ni yih(xi) \u2265 R0/1(S, h) +\n\u03b2-yi\n\n1\nm\n\n\u03b2-yi\ni yih(xi)\n\n(5)\n\nto force the following constraint:(cid:80)m\nUnfortunately, the constraint(cid:80)m\n\nwhere R0/1(S, h) the empirical risk related to the 0-1 loss and Eq. (5) is because b\u03c6F\u03c6(x) \u2265 F0/1(x)\n(for any surrogate loss).\nIt is possible to ensure that the \u03b2-risk is both a convex upper-bound of the 0-1 loss based risk and a\n\u03c6(X , h) \u2264 R\u03c6(S, h)) is\nrelaxation as tight as the traditional risk (i.e., that we have R0/1(S, h) \u2264 R\u03b2\n\ni=1 \u03b2-yi\ni yih(xi) = 0 still depends on the vector y of labels, which is\nnot always provided and most likely uncertain or inaccurate in a weakly labeled data setting. We will\nshow in Section 3 that this issue can be overcome by means of an iterative 2-step learning procedure,\nthat \ufb01rst learns a classi\ufb01er minimizing the \u03b2-risk , possibly violating the constraint, and then learns a\nnew matrix \u03b2 that ful\ufb01lls the constraint.\n\ni yih(xi) = 0.\n\ni=1 \u03b2-yi\n\nm(cid:88)\n\ni=1\n\nm(cid:88)\n\ni=1\n\n4\n\n\f2.3\n\nInstantiating \u03b2 for Different Weakly Supervised Settings\n\nThe \u03b2-risk can be used as the basis for handling different learning settings, including weakly labeled\nlearning. This can be achieved by \ufb01xing the \u03b2 values, choosing their initial values or putting a prior\non them. We have already seen that, fully supervised learning can be obtained by \ufb01xing all \u03b2 values\nto 1 for the assigned class and to 0 for the opposite class. The current section provides guidance on\nhow \u03b2 could be instantiated to handle various weakly labeled settings.\nIn a semi-supervised setting, as detailed in the experimental section, we propose to initialize the\n\u03b2 of unlabeled points to 0.5 and then to automatically re\ufb01ne them in an iterative process. Going\nfurther, and if we are ready to integrate spatial or topological information in the process, the \u03b2\nvalues of each unlabeled point could be initialized using a density estimation procedure (e.g., by\nconsidering the label proportions of the k nearest labeled neighbors). In the context of multi-expert\nlearning, the experts\u2019 votes for each instance i can simply be averaged to produce the \u03b2i values (or\ntheir initialization, or a prior). The case of learning with label proportions is especially useful for\nprivacy-preserving data processing: the training points are grouped into bags and, for each bag, the\nproportion of labels are given. One way of handling such supervision is to initialize, for each bag,\nall the \u03b2 with the same value that corresponds to the provided proportion of labels. Noise-tolerant\nlearning aims at learning in the presence of label noise, where labels are given but can be wrong. For\nany point that can be possibly noisy, a direct approach is to use lower \u03b2 values (instead of 1 in the\nsupervised case) and re\ufb01ne them as in the semi-supervised setting. \u03b2 can also be initialized using the\nlabel proportion of the k nearest labeled example (as done in the experimental section). The case of\nMultiple Instance Learning (MIL) is trickier: in a typical MIL setting, instances are grouped in bags\nand the supervision is given as a single label per bag that is positive if the bag contains at least one\npositive instance (negative bags contain only negative instances). A straightforward solution would\nbe to recast the MIL supervision as a \u201clearning with label proportion\u201d (e.g., considering exactly one\npositive instance in each bag). It is not fully satisfying and a more promising solution would be to\nconsider, within each bag, the set of \u03b2+1 variables and put a sparsity-inducing prior on them. This\napproach would be a less-constrained version of the relaxation proposed in WellSVM [14] (where it\nis supposed that there is exactly one positive instance per positive bag) and could be achieved by a l1\npenalty or using a Dirichlet prior (with low \u03b1 to promote sparsity).\n\n3 An Iterative Algorithm for Weakly-labeled Learning\n\n(cid:80)m\n\ni=1 \u03b2-yi\n\nAs explained in Section 2, a suf\ufb01cient condition for guaranteeing that the \u03b2-risk is a convex\nupper-bound of the 0-1 loss based risk and it is not worse than the traditional risk is to \ufb01x\ni yih(xi) = 0. However, the previous constraint depends on the labels. We overcome\nthis problem by (i) iteratively learning a classi\ufb01er minimizing the \u03b2-risk and most likely violating the\nconstraint and then (ii) learning a new matrix \u03b2 that ful\ufb01lls it. The algorithm is generic. It can be used\nin different weakly labeled settings and can be instantiated with different losses and regularizations,\nas we will do in the next Section with SVMs.\nAs the process is iterative, let t\u03b2 be the estimation of \u03b2 at iteration t. At each iteration, our algorithm\nconsists in two steps. We \ufb01rst learn an hypothesis h for the following problem P1:\n\nht+1 = P1(X , t\u03b2) = arg min\n\ncRt\u03b2\n\n\u03c6 (X , h) + N (h)\n\nh\n\nwhich boils down to minimizing the N -regularized empirical surrogate \u03b2-risk over the training\nsample X of size m, where N , for instance, can take the form of a L1 or a L2 norm.\nThen, we \ufb01nd the optimal \u03b2 of the following problem P2 for the points of X :\n\nt+1\u03b2 = P2(X , ht+1) = arg min\n\nR\u03b2\n\u03c6(X , ht+1)\n\n\u03b2\n\nm(cid:88)\n\ni=1\n\ns.t.\n\ni (\u2212yi ht+1(xi)) = 0\n\u03b2-yi\ni \u2265 0, \u03b2+1\n\ni \u2265 0 \u2200i = 1..m .\n\n\u03b2-1\ni + \u03b2+1\n\ni = 1, \u03b2-1\n\nFor this step, a vector of labels is required. We choose to re-estimate it at each iteration according\nto the current value of \u03b2: we affect to an instance the most probable label, i.e. the \u03c3 corresponding\n\n5\n\n\fto the biggest \u03b2\u03c3. The matrix \u03b2 has to be initialized at the beginning of the algorithm according to\nthe problem setting (see Section 2.3). While some stabilization criterion does not exceed a given\nthreshold \u0001, the two steps are repeated.\n\n4 Soft-margin \u03b2-SVM\n\nA major advantage of the empirical surrogate \u03b2-risk is that it can be plugged in numerous learning\nsettings without radically modifying the original formulations. As an example, in this section we\nderive a new version of the Support Vector Machine problem, using the empirical surrogate \u03b2-risk ,\nthat takes into account the knowledge provided for each training instance (through the matrix \u03b2).\nThe soft-margin \u03b2-SVM optimization problem is a direct generalization of a standard soft-margin\nSVM and is de\ufb01ned as follows:\n\n(cid:1)\n\n(cid:0)\u03b2-1\n\nm(cid:88)\ni \u2200i = 1..m, \u03c3 \u2208 {\u22121, 1}\n\ni \u03be-1\n\ni + \u03b2+1\n\ni \u03be+1\n\ni\n\ni=1\n\narg min\n\n\u03b8\n\n(cid:107)\u03b8(cid:107)2\n\n2 + c\n\n1\n2\n\ns.t. \u03c3(\u03b8T \u00b5(xi) + b) \u2265 1 \u2212 \u03be\u03c3\n\ni \u2265 0 \u2200i = 1..m, \u03c3 \u2208 {\u22121, 1}\n\u03be\u03c3\n\nwhere \u03b8 \u2208 X(cid:48) is the vector de\ufb01ning the margin hyperplane and b its offset, \u00b5 : X \u2192 X(cid:48) a mapping\nfunction and c \u2208 R a tuned hyper-parameter. In the rest of the paper, we will refer to K : X\u00d7X \u2192 R\nas the kernel function corresponding to \u00b5, i.e. K(xi, xj) = \u00b5(xi)\u00b5(xj).\nThe corresponding Lagrangian dual problem is given by (the complete derivation is provided in the\nsupplementary material):\n\nmax\n\n\u03b1\n\n\u2212 1\n2\n\nm(cid:88)\n\n(cid:88)\n\nm(cid:88)\n\n(cid:88)\n\ni=1\n\n\u03c3\u2208\n\n{-1,+1}\n\nj=1\n\n\u03c3(cid:48)\u2208\n{-1,+1}\n\n\u03b1\u03c3\ni \u03c3\u03b1\u03c3\n\nj \u03c3(cid:48)K(xi, xj) +\n\nm(cid:88)\n\n(cid:88)\n\ni=1\n\n\u03c3\u2208\n\n{-1,+1}\n\n\u03b1\u03c3\ni\n\ns.t. 0 \u2264 \u03b1\u03c3\n\nm(cid:88)\n\n(cid:88)\n\ni \u2200i = 1..m, \u03c3 \u2208 {\u22121, 1}\ni \u2264 c\u03b2\u03c3\ni \u03c3 = 0 \u2200i = 1..m, \u03c3 \u2208 {\u22121, 1}\n\u03b1\u03c3\n\ni=1\n\n\u03c3\u2208\n\n{-1,+1}\n\nwhich is concave w.r.t. \u03b1 as for the standard SVM.\nThe \u03b2-SVM formulation differs from the SVM one in two points: \ufb01rst, the number of Lagrangian\nmultipliers is doubled, because we consider both positive and negative labels for each instance;\nsecond, the upper-bounds for \u03b1 are not the same for all instances but depend on the given matrix\n\u03b2. Like the coef\ufb01cient c in the classical formulation of SVM, those upper-bounds play the role of\ntrade-off between under-\ufb01tting and over-\ufb01tting: the smaller they are, the more robust to outliers the\nlearner is but the less it adapts to the data. It is then logical that the upper-bound for an instance\ni because it re\ufb02ects the reliability on the label \u03c3 for that instance: if the label \u03c3 is\ni depends on \u03b2\u03c3\nunlikely, its corresponding \u03b1\u03c3\ni will be constrained to be null (and its adversary will have more chance\nto be selected as a support vector, as \u03b2\u03c3\n= 1). Also, those points for which no label is more\ni \u2192 0.5) will have less importance in the learning process compared to\nprobable than the other (\u03b2\u03c3\nthose for which a label is almost certain. In order to fully exploit the advantages of our formulation, c\nhas to be \ufb01nite and bigger than 0. As a matter of fact, when c \u2192 \u221e or c \u2192 0, the constraints become\nexactly those of the original formulation.\n\ni + \u03b2 \u2212 \u03c3\n\ni\n\n5 Experimental Results\n\nIn the \ufb01rst part of this section, we present some experimental results obtained by adapting the iterative\nalgorithm presented in Section 3 for semi-supervised learning and combining it with the previously\nderived \u03b2-SVM . Note that some approaches based on SVMs have been already presented in the\nliterature to address the problem of semi-supervised learning. Among them, TransductiveSVM [5]\n\n6\n\n\fiteratively learns a separator with the labeled instances, classi\ufb01es a subset of the unlabeled instances\nand adds it to the training set. On the other hand, WellSVM [14] combines the classical SVM with a\nlabel generation strategy that allows one to learn the optimal separator, even when the training sample\nis not completely labeled, by convexly relaxing the original Mixed-Integer Programming problem. In\n[14], WellSVM has been shown to be very effective and better than TransductiveSVM and the state of\nthe art. For this reason, we compare in this section \u03b2-SVM to WellSVM. In the second subsection, we\npresent some preliminary results in the noise-tolerant learning setting, showing how \u03b2-SVM behaves\nwhen facing data with label noise.\n\n5.1\n\nIterative \u03b2-SVM for semi-supervised learning\n\nWe compare our method\u2019s performances to those of WellSVM, that has been proved, in [14], to\nperforms in average better than the state of the art semi-supervised learning methods based on SVM\nand the standard SVM as well. In a semi-supervised context, a set Xl of labeled instances of size ml\nand a set Xu of unlabeled instances of size mu are provided. The matrix \u03b2 is initialized as follows:\n\n\u2200i = 1..ml and \u2200\u03c3 in {\u22121, 1}, 0\u03b2\u03c3\n\ni = 1 if \u03c3 = yi, 0 otherwise,\n\n\u2200i = ml +1..mu and \u2200\u03c3 in {\u22121, 1}, 0\u03b2\u03c3\n\ni = 0.5\n\nand we learn an optimal separator:\n\nht+1 = P1(Xl \u222a Xu, t\u03b2) = arg min\n\nh\n\nc1Rt\u03b2\n\n\u03c6 (Xl, h) + c2Rt\u03b2\n\n\u03c6 (Xu, h) + N (h).\n\nHere c1 and c2 are balance constants between the labeled and unlabeled set: when the number of\nunlabeled instances become greater than the number of labeled instances, we need to reduce the\nimportance of the unlabeled set in the learning procedure because there exists the risk that the labeled\nset will be ignored. We consider the provided labels to be correct, so we keep the corresponding\nl\u03b2 \ufb01xed during the iterations of the algorithm and estimate u\u03b2 by optimizing P2(Xu, ht+1). The\niterative algorithm with \u03b2-SVM is implemented in Python using Cvxopt (for optimizing \u03b2-SVM )\nand Cvxpy 2 with its Ecos solver [9].\nFor each dataset, we show in Figure 1 the accuracy of the two methods with an increasing proportion\nof labeled data. The different approaches are compared on the same kernel, either the linear or the\ngaussian, the one that gives higher overall accuracy. As a matter of fact, the choice of the kernel\ndepends on the geometry of the data, not on the learning method.\nFor each proportion of labeled data, we perform a 4-fold cross-validation and we show the average\naccuracy over 10 iterations. Concerning the hyper-parameters of the different methods, we \ufb01x c2\nof \u03b2-SVM to c1\nm , c1 of WellSVM to 1 as explained in [14] and all the other hyper-parameters\nml\n(c1 for \u03b2-SVM and c2 for WellSVM) are tuned by cross-validation through grid search. As for\nthe stopping criteria, we \ufb01x \u0001 of \u03b2-SVM to 10\u22125 + 10\u22123(cid:107)h(cid:107)F and \u0001 of WellSVM to 10\u22123 and the\nmaximal number of iterations to 20 for both methods. When using the gaussian kernel, the \u03b3 in\nK(xi, xj) = exp(\u2212(cid:107)xi \u2212 xj(cid:107)2\nOur method performs better than WellSVM, with few exceptions, and is more ef\ufb01cient in terms\nof CPU time: for the Australian dataset, the biggest dataset in number of features and instances,\nWellSVM is in average 30 times slower than our algorithm (without particular optimization efforts).\n\n2/\u03b3) is \ufb01xed to the mean distance between instances.\n\n5.2 Preliminary results under label-noise\n\nWe quickly tackle another setting of the weakly labeled data \ufb01eld: the noise-tolerant learning, the\ntask of learning from data that have noisy or uncertain labels. It has been shown in [3] that SVM\nlearning is extremely sensitive to outliers, especially the ones lying next to the boundary. We study,\nthe sensitivity of \u03b2-SVM to label noise arti\ufb01cially introduced on the Ionosphere dataset. We consider\ntwo initialization strategies for \u03b2: the standard on where \u03b2yi = 1 and \u03b2\u2212yi = 0 and the 4-nn one\nwhere \u03b2\u03c3 is set to the proportion of neighboring instances with label \u03c3. In Figure 2, we draw the mean\naccuracy over 4 repetitions w.r.t. an increasing percentage (as a proportion of the smallest dataset) of\ntwo kinds of noise: the symmetric noise, introduced by swapping the labels of instances belonging\nto different classes, and the asymmetric noise, introduced by gradually changing the labels of the\n\n2http://cvxopt.org/ and http://www.cvxpy.org/\n\n7\n\n\f0.85\n\n0.8\n\n0.75\n\n0.8\n\n0.75\n\n0.7\n\n0.65\n\n0.6\n\n0.55\n\n0.8\n\n0.75\n\n0.7\n\n5\n\n10\n\n15\n\n20\n\n5\n\n10\n\n15\n\n20\n\n5\n\n10\n\n15\n\n20\n\n5\n\n10\n\n15\n\n20\n\n(a) Ionosphere,\ngaussian kernel.\n\n(b) Heart-statlog,\n\nlinear kernel.\n\n(c) Liver,\n\nlinear kernel.\n\n(d) Australian,\ngaussian kernel.\n\n0.75\n\n0.7\n\n0.65\n\n0.65\n\n0.6\n\n0.65\n0.6\n0.55\n0.5\n\n5\n\n10\n\n15\n\n20\n\n5\n\n10\n\n15\n\n20\n\n5\n\n10\n\n15\n\n20\n\n(e) Pima,\n\nlinear kernel.\n\n(f) Sonar,\n\nlinear kernel.\n\n(g) Splice,\n\ngaussian kernel.\n\nWellSVM\nbetaSVM\n\nFigure 1: Comparison of the mean accuracies of WellSVM and \u03b2-SVM versus the percentage of labeled data on\n\ndifferent UCI datasets.\n\n0.85\n\n0.8\n\n0.75\n\n0.85\n\n0.8\n\n0.75\n\nstandard\n\n4-nn\n\n10\n\n20\n\n30\n\n40\n\n50\n\n10\n\n20\n\n30\n\n40\n\n50\n\n(a) Symmetric Noise.\n\n(b) Asymmetric Noise.\n\nFigure 2: Comparison of the mean accuracy versus the percentage of noise of iterative \u03b2-SVM with different\ninitializations of \u03b2. The standard curve refers to the initialization of \u03b2yi = 1 and \u03b2\u2212yi = 0 and the 4-nn to the\n\ninitialization of \u03b2\u03c3 to the proportion of neighboring instances with label \u03c3.\n\ninstances of one class. These preliminary results are encouraging and show that locally estimating\nthe conditional class density to initialize the \u03b2 matrix improves the robustness of our method to label\nnoise.\n\n6 Conclusion\n\nThis paper focuses on the problem of learning from weakly labeled data. We introduced the \u03b2-\nrisk which generalizes the standard empirical risk while allowing the integration of weak supervision.\nFrom the expression of the \u03b2-risk , we derived a generic algorithm for weakly labeled data and\nspecialized it in an SVM-like context. The resulting \u03b2-SVM algorithm has been applied in two\ndifferent weakly labeled settings, namely semi-supervised learning and learning with label noise,\nshowing the advantages of the approach.\nThe perspectives of this work are numerous and of two main kinds: covering new weakly labeled\nsettings and studying theoretical guarantees. As proposed in Section 2.3, the \u03b2-risk can be used in\nvarious weakly labeled scenarios. This requires to use different strategies for the initialization and the\nre\ufb01nement of \u03b2, and also to propose proper priors for these parameters. Generalizing the proposed\n\u03b2-risk to a multi-class setting is a natural extension as \u03b2 is already a matrix of class probabilities.\nAnother broad direction involves deriving robustness and convergence bounds for the algorithms\nbuilt on the \u03b2-risk .\n\n7 Acknowledgments\n\nWe thank the reviewers for their valuable remarks. We also thank the ANR projects SOLSTICE\n(ANR-13-BS02-01) and LIVES (ANR-15-CE230026-03).\n\n8\n\n\fReferences\n[1] D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2(4):343\u2013370,\n\n1988.\n\n[2] S. Ben-David, D. Loker, N. Srebro, and K. Sridharan. Minimizing the misclassi\ufb01cation error\nrate using a surrogate convex loss. In Proceedings of the 29th International Conference on\nMachine Learning, ICML. icml.cc / Omnipress, 2012.\n\n[3] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classi\ufb01ers.\nIn Proceedings of the \ufb01fth annual workshop on Computational learning theory, pages 144\u2013152.\nACM, 1992.\n\n[4] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.\n[5] L. Bruzzone, M. Chi, and M. Marconcini. A novel transductive svm for semisupervised\nclassi\ufb01cation of remote-sensing images. Geoscience and Remote Sensing, IEEE Transactions\non, 44(11):3363\u20133373, 2006.\n\n[6] M. Collins, R. E. Schapire, and Y. Singer. Logistic regression, adaboost and bregman distances.\n\nMachine Learning, 48(1-3):253\u2013285, 2002.\n\n[7] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013297, 1995.\n[8] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P\u00e9rez. Solving the multiple instance problem\n\nwith axis-parallel rectangles. Arti\ufb01cial intelligence, 89(1):31\u201371, 1997.\n\n[9] A. Domahidi, E. Chu, and S. Boyd. Ecos: An socp solver for embedded systems. In Control\n\nConference (ECC), 2013 European, pages 3071\u20133076. IEEE, 2013.\n\n[10] Y. Freund, R. E. Schapire, et al. Experiments with a new boosting algorithm.\n\nIn ICML,\n\nvolume 96, pages 148\u2013156, 1996.\n\n[11] T. Hastie, R. Tibshirani, and J. Friedman. Unsupervised learning. Springer, 2009.\n[12] A. Joulin and F. Bach. A convex relaxation for weakly supervised classi\ufb01ers. arXiv preprint\n\narXiv:1206.6413, 2012.\n\n[13] M. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms.\nIn Proceedings of the twenty-eighth annual ACM symposium on Theory of computing, pages\n459\u2013468. ACM, 1996.\n\n[14] Y.-F. Li, I. W. Tsang, J. T. Kwok, and Z.-H. Zhou. Convex and scalable weakly labeled svms.\n\nThe Journal of Machine Learning Research, 14(1):2151\u20132188, 2013.\n\n[15] M. Lichman. UCI machine learning repository, 2013.\n[16] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In\n\nAdvances in neural information processing systems, pages 1196\u20131204, 2013.\n\n[17] R. Nock and F. Nielsen. Bregman divergences and surrogates for learning. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 31(11):2048\u20132059, 2009.\n\n[18] G. Patrini, F. Nielsen, R. Nock, and M. Carioni. Loss factorization, weakly supervised learning\n\nand label noise robustness. arXiv preprint arXiv:1602.02450, 2016.\n\n[19] G. Patrini, R. Nock, T. Caetano, and P. Rivera. (almost) no label no cry. In Advances in Neural\n\nInformation Processing Systems, pages 190\u2013198, 2014.\n\n[20] L. Rosasco, E. De Vito, A. Caponnetto, M. Piana, and A. Verri. Are loss functions all the same?\n\nNeural Computation, 16(5):1063\u20131076, 2004.\n\n[21] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label? improving data quality and data\nmining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international\nconference on Knowledge discovery and data mining, pages 614\u2013622. ACM, 2008.\n\n[22] X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences,\n\nUniversity of Wisconsin-Madison, 2005.\n\n9\n\n\f", "award": [], "sourceid": 2154, "authors": [{"given_name": "Valentina", "family_name": "Zantedeschi", "institution": "UJM Saint-Etienne"}, {"given_name": "R\u00e9mi", "family_name": "Emonet", "institution": "Hubert Curien Lab."}, {"given_name": "Marc", "family_name": "Sebban", "institution": "University Jean Monnet"}]}