{"title": "Learning with Symmetric Label Noise: The Importance of Being Unhinged", "book": "Advances in Neural Information Processing Systems", "page_first": 10, "page_last": 18, "abstract": "Convex potential minimisation is the de facto approach to binary classification. However, Long and Servedio [2008] proved that under symmetric label noise (SLN), minimisation of any convex potential over a linear function class can result in classification performance equivalent to random guessing. This ostensibly shows that convex losses are not SLN-robust. In this paper, we propose a convex, classification-calibrated loss and prove that it is SLN-robust. The loss avoids the Long and Servedio [2008] result by virtue of being negatively unbounded. The loss is a modification of the hinge loss, where one does not clamp at zero; hence, we call it the unhinged loss. We show that the optimal unhinged solution is equivalent to that of a strongly regularised SVM, and is the limiting solution for any convex potential; this implies that strong l2 regularisation makes most standard learners SLN-robust. Experiments confirm the unhinged loss\u2019 SLN-robustness.", "full_text": "Learning with Symmetric Label Noise: The\n\nImportance of Being Unhinged\n\nBrendan van Rooyen\u2217,\u2020\n\nAditya Krishna Menon\u2020,\u2217\n\nRobert C. Williamson\u2217,\u2020\n\n\u2217The Australian National University\n\n\u2020National ICT Australia\n\n{ brendan.vanrooyen, aditya.menon, bob.williamson }@nicta.com.au\n\nAbstract\n\nConvex potential minimisation is the de facto approach to binary classi\ufb01cation.\nHowever, Long and Servedio [2010] proved that under symmetric label noise\n(SLN), minimisation of any convex potential over a linear function class can re-\nsult in classi\ufb01cation performance equivalent to random guessing. This ostensibly\nshows that convex losses are not SLN-robust. In this paper, we propose a convex,\nclassi\ufb01cation-calibrated loss and prove that it is SLN-robust. The loss avoids the\nLong and Servedio [2010] result by virtue of being negatively unbounded. The\nloss is a modi\ufb01cation of the hinge loss, where one does not clamp at zero; hence,\nwe call it the unhinged loss. We show that the optimal unhinged solution is equiv-\nalent to that of a strongly regularised SVM, and is the limiting solution for any\nconvex potential; this implies that strong (cid:96)2 regularisation makes most standard\nlearners SLN-robust. Experiments con\ufb01rm the unhinged loss\u2019 SLN-robustness is\nborne out in practice. So, with apologies to Wilde [1895], while the truth is rarely\npure, it can be simple.\n\n1 Learning with symmetric label noise\n\nBinary classi\ufb01cation is the canonical supervised learning problem. Given an instance space X, and\nsamples from some distribution D over X \u00d7 {\u00b11}, the goal is to learn a scorer s : X \u2192 R with low\nmisclassi\ufb01cation error on future samples drawn from D. Our interest is in the more realistic scenario\nwhere the learner observes samples from some corruption D of D, where labels have some constant\nprobability of being \ufb02ipped, and the goal is still to perform well with respect to D. This problem is\nknown as learning from symmetric label noise (SLN learning) [Angluin and Laird, 1988].\nLong and Servedio [2010] showed that there exist linearly separable D where, when the learner\nobserves some corruption D with symmetric label noise of any nonzero rate, minimisation of any\nconvex potential over a linear function class results in classi\ufb01cation performance on D that is equiv-\nalent to random guessing. Ostensibly, this establishes that convex losses are not \u201cSLN-robust\u201d and\nmotivates the use of non-convex losses [Stempfel and Ralaivola, 2009, Masnadi-Shirazi et al., 2010,\nDing and Vishwanathan, 2010, Denchev et al., 2012, Manwani and Sastry, 2013].\nIn this paper, we propose a convex loss and prove that it is SLN-robust. The loss avoids the result\nof Long and Servedio [2010] by virtue of being negatively unbounded. The loss is a modi\ufb01ca-\ntion of the hinge loss where one does not clamp at zero; thus, we call it the unhinged loss. This\nloss has several appealing properties, such as being the unique convex loss satisfying a notion of\n\u201cstrong\u201d SLN-robustness (Proposition 5), being classi\ufb01cation-calibrated (Proposition 6), consistent\nwhen minimised on D (Proposition 7), and having an simple optimal solution that is the difference\nof two kernel means (Equation 8). Finally, we show that this optimal solution is equivalent to that of\na strongly regularised SVM (Proposition 8), and any twice-differentiable convex potential (Proposi-\ntion 9), implying that strong (cid:96)2 regularisation endows most standard learners with SLN-robustness.\n\n1\n\n\fThe classi\ufb01er resulting from minimising the unhinged loss is not new [Devroye et al., 1996, Chap-\nter 10], [Sch\u00a8olkopf and Smola, 2002, Section 1.2], [Shawe-Taylor and Cristianini, 2004, Section\n5.1]. However, establishing this classi\ufb01er\u2019s (strong) SLN-robustness, uniqueness thereof, and its\nequivalence to a highly regularised SVM solution, to our knowledge is novel.\n\n2 Background and problem setup\nFix an instance space X. We denote by D a distribution over X \u00d7 {\u00b11}, with random variables\n(X, Y) \u223c D. Any D may be expressed via the class-conditionals (P, Q) = (P(X | Y = 1), P(X |\nY = \u22121)) and base rate \u03c0 = P(Y = 1), or via the marginal M = P(X) and class-probability\nfunction \u03b7 : x (cid:55)\u2192 P(Y = 1 | X = x). We interchangeably write D as DP,Q,\u03c0 or DM,\u03b7.\n\n2.1 Classi\ufb01ers, scorers, and risks\nA scorer is any function s : X \u2192 R. A loss is any function (cid:96) : {\u00b11} \u00d7 R \u2192 R. We use (cid:96)\u22121, (cid:96)1 to\nrefer to (cid:96)(\u22121,\u00b7) and (cid:96)(1,\u00b7). The (cid:96)-conditional risk L(cid:96) : [0, 1] \u00d7 R \u2192 R is de\ufb01ned as L(cid:96) : (\u03b7, v) (cid:55)\u2192\n\u03b7 \u00b7 (cid:96)1(v) + (1 \u2212 \u03b7) \u00b7 (cid:96)\u22121(v). Given a distribution D, the (cid:96)-risk of a scorer s is de\ufb01ned as\n\nLD\n(cid:96) (s) .= E\n\n(X,Y)\u223cD\n\n[(cid:96)(Y, s(X))] ,\n\n(1)\n\n(cid:96) (s) = E\nX\u223cM\n\n[L(cid:96)(\u03b7(X), s(X))]. For a set S, LD\n\nso that LD\n(cid:96) (S) is the set of (cid:96)-risks for all scorers in S.\nA function class is any F \u2286 RX. Given some F, the set of restricted Bayes-optimal scorers for a\nloss (cid:96) are those scorers in F that minimise the (cid:96)-risk:\n\n.= Argmin\nThe set of (unrestricted) Bayes-optimal scorers is SD,\u2217\n(cid:96)-regret of a scorer is its excess risk over that of any restricted Bayes-optimal scorer:\n\nLD\n(cid:96) (s).\n(cid:96) = SD,F,\u2217\n\ns\u2208F\n\n(cid:96)\n\n(cid:96)\n\nSD,F,\u2217\n\nfor F = RX. The restricted\n\nregretD,F\n\n(cid:96)\n\n(s) .= LD\n\n(cid:96) (s) \u2212 inf\nt\u2208F\n\nLD\n(cid:96) (t).\n\nBinary classi\ufb01cation is concerned with the zero-one loss, (cid:96)01 : (y, v) (cid:55)\u2192 (cid:74)yv < 0(cid:75) + 1\n\n2(cid:74)v = 0(cid:75).\nA loss (cid:96) is classi\ufb01cation-calibrated if all its Bayes-optimal scorers are also optimal for zero-one\nloss: (\u2200D) SD,\u2217\n01 . A convex potential is any loss (cid:96) : (y, v) (cid:55)\u2192 \u03c6(yv), where \u03c6 : R \u2192 R+ is\nconvex, non-increasing, differentiable with \u03c6(cid:48)(0) < 0, and \u03c6(+\u221e) = 0 [Long and Servedio, 2010,\nDe\ufb01nition 1]. All convex potentials are classi\ufb01cation-calibrated [Bartlett et al., 2006, Theorem 2.1].\n\n(cid:96) \u2286 SD,\u2217\n\n2.2 Learning with symmetric label noise (SLN learning)\n\nThe problem of learning with symmetric label noise (SLN learning) is the following [Angluin and\nLaird, 1988, Kearns, 1998, Blum and Mitchell, 1998, Natarajan et al., 2013]. For some notional\n\u201cclean\u201d distribution D, which we would like to observe, we instead observe samples from some\ncorrupted distribution SLN(D, \u03c1), for some \u03c1 \u2208 [0, 1/2). The distribution SLN(D, \u03c1) is such that\nthe marginal distribution of instances is unchanged, but each label is independently \ufb02ipped with\nprobability \u03c1. The goal is to learn a scorer from these corrupted samples such that LD\n01(s) is small.\nFor any quantity in D, we denote its corrupted counterparts in SLN(D, \u03c1) with a bar, e.g. M for\nthe corrupted marginal distribution, and \u03b7 for the corrupted class-probability function; additionally,\nwhen \u03c1 is clear from context, we will occasionally refer to SLN(D, \u03c1) by D. It is easy to check that\nthe corrupted marginal distribution M = M, and [Natarajan et al., 2013, Lemma 7]\n\n(\u2200x \u2208 X) \u03b7(x) = (1 \u2212 2\u03c1) \u00b7 \u03b7(x) + \u03c1.\n\n(2)\n\n3 SLN-robustness: formalisation\n\nWe consider learners ((cid:96), F) for a loss (cid:96) and a function class F, with learning being the search for\nsome s \u2208 F that minimises the (cid:96)-risk. Informally, ((cid:96), F) is \u201crobust\u201d to symmetric label noise (SLN-\nrobust) if minimising (cid:96) over F gives the same classi\ufb01er on both the clean distribution D, which\n\n2\n\n\fthe learner would like to observe, and SLN(D, \u03c1) for any \u03c1 \u2208 [0, 1/2), which the learner actually\nobserves. We now formalise this notion, and review what is known about SLN-robust learners.\n\n3.1 SLN-robust learners: a formal de\ufb01nition\nFor some \ufb01xed instance space X, let \u2206 denote the set of distributions on X\u00d7{\u00b11}. Given a notional\n\u201cclean\u201d distribution D, Nsln : \u2206 \u2192 2\u2206 returns the set of possible corrupted versions of D the learner\nmay observe, where labels are \ufb02ipped with unknown probability \u03c1:\n1\n2\n\nSLN(D, \u03c1) | \u03c1 \u2208\nEquipped with this, we de\ufb01ne our notion of SLN-robustness.\nDe\ufb01nition 1 (SLN-robustness). We say that a learner ((cid:96), F) is SLN-robust if\n01(SD,F,\u2217\n\n(\u2200D \u2208 \u2206) (\u2200D \u2208 Nsln(D)) LD\n\nNsln : D (cid:55)\u2192\n\n01(SD,F,\u2217\n\n(cid:96)\n\n) = LD\n\n(cid:19)(cid:27)\n\n0,\n\n.\n\n).\n\n(cid:96)\n\n(3)\n\n(cid:26)\n\n(cid:20)\n\nThat is, SLN-robustness requires that for any level of label noise in the observed distribution D, the\nclassi\ufb01cation performance (wrt D) of the learner is the same as if the learner directly observes D.\nUnfortunately, a widely adopted class of learners is not SLN-robust, as we will now see.\n\n3.2 Convex potentials with linear function classes are not SLN-robust\nFix X = Rd, and consider learners with a convex potential (cid:96), and a function class of linear scorers\n\nFlin = {x (cid:55)\u2192 (cid:104)w, x(cid:105) | w \u2208 Rd}.\n\nThis captures e.g. the linear SVM and logistic regression, which are widely studied in theory and\napplied in practice. Disappointingly, these learners are not SLN-robust: Long and Servedio [2010,\nTheorem 2] give an example where, when learning under symmetric label noise, for any convex\npotential (cid:96), the corrupted (cid:96)-risk minimiser over Flin has classi\ufb01cation performance equivalent to\nrandom guessing on D. This implies that ((cid:96), Flin) is not SLN-robust1 as per De\ufb01nition 1.\nProposition 1 (Long and Servedio [2010, Theorem 2]). Let X = Rd for any d \u2265 2. Pick any convex\npotential (cid:96). Then, ((cid:96), Flin) is not SLN-robust.\n\n3.3 The fallout: what learners are SLN-robust?\n\nIn light of Proposition 1, there are two ways to proceed in order to obtain SLN-robust learners: either\nwe change the class of losses (cid:96), or we change the function class F.\nThe \ufb01rst approach has been pursued in a large body of work that embraces non-convex losses\n[Stempfel and Ralaivola, 2009, Masnadi-Shirazi et al., 2010, Ding and Vishwanathan, 2010,\nDenchev et al., 2012, Manwani and Sastry, 2013]. While such losses avoid the conditions of Proposi-\ntion 1, this does not automatically imply that they are SLN-robust when used with Flin. In Appendix\nB, we present evidence that some of these losses are in fact not SLN-robust when used with Flin.\nThe second approach is to consider suitably rich F that contains the Bayes-optimal scorer for D,\ne.g. by employing a universal kernel. With this choice, one can still use a convex potential loss, and\nin fact, owing to Equation 2, any classi\ufb01cation-calibrated loss.\nProposition 2. Pick any classi\ufb01cation-calibrated (cid:96). Then, ((cid:96), RX) is SLN-robust.\nBoth approaches have drawbacks. The \ufb01rst approach has a computational penalty, as it requires\noptimising a non-convex loss. The second approach has a statistical penalty, as estimation rates\nwith a rich F will require a larger sample size. Thus, it appears that SLN-robustness involves a\ncomputational-statistical tradeoff. However, there is a variant of the \ufb01rst option: pick a loss that is\nconvex, but not a convex potential. Such a loss would afford the computational and statistical ad-\nvantages of minimising convex risks with linear scorers. Manwani and Sastry [2013] demonstrated\nthat square loss, (cid:96)(y, v) = (1 \u2212 yv)2, is one such loss. We will show that there is a simpler loss that\nis convex and SLN-robust, but is not in the class of convex potentials by virtue of being negatively\nunbounded. To derive this loss, we \ufb01rst re-interpret robustness via a noise-correction procedure.\n\n1Even if we were content with a difference of \u0001 \u2208 [0, 1/2] between the clean and corrupted minimisers\u2019\n\nperformance, Long and Servedio [2010, Theorem 2] implies that in the worst case \u0001 = 1/2.\n\n3\n\n\f4 A noise-corrected loss perspective on SLN-robustness\n\nWe now re-express SLN-robustness to reason about optimal scorers on the same distribution, but\nwith two different losses. This will help characterise a set of \u201cstrongly SLN-robust\u201d losses.\n\n4.1 Reformulating SLN-robustness via noise-corrected losses\nGiven any \u03c1 \u2208 [0, 1/2), Natarajan et al. [2013, Lemma 1] showed how to associate with a loss (cid:96) a\nnoise-corrected counterpart (cid:96) such that LD\nDe\ufb01nition 2 (Noise-corrected loss). Given any loss (cid:96) and \u03c1 \u2208 [0, 1/2), the noise-corrected loss (cid:96) is\n\n(s). The loss (cid:96) is de\ufb01ned as follows.\n\n(cid:96) (s) = LD\n\n(cid:96)\n\n(\u2200y \u2208 {\u00b11}) (\u2200v \u2208 R) (cid:96)(y, v) =\n\n(1 \u2212 \u03c1) \u00b7 (cid:96)(y, v) \u2212 \u03c1 \u00b7 (cid:96)(\u2212y, v)\n\n1 \u2212 2\u03c1\n\n.\n\n(4)\n\nSince (cid:96) depends on the unknown parameter \u03c1, it is not directly usable to design an SLN-robust\nlearner. Nonetheless, it is a useful theoretical device, since, by construction, for any F, SD,F,\u2217\n=\nSD,F,\u2217\n= SD,F,\u2217\n.\n(cid:96)\n(cid:96)\nGhosh et al. [2015, Theorem 1] proved a suf\ufb01cient condition on (cid:96) such that this holds, namely,\n\n. This means that a suf\ufb01cient condition for ((cid:96), F) to be SLN-robust is for SD,F,\u2217\n\n(cid:96)\n\n(cid:96)\n\n(\u2203C \u2208 R)(\u2200v \u2208 R) (cid:96)1(v) + (cid:96)\u22121(v) = C.\n\n(5)\n\nInterestingly, Equation 5 is necessary for a stronger notion of robustness, which we now explore.\n\n4.2 Characterising a stronger notion of SLN-robustness\n\nAs the \ufb01rst step towards a stronger notion of robustness, we rewrite (with a slight abuse of notation)\n\nLD\n(cid:96) (s) = E\n\n(X,Y)\u223cD\n\n[(cid:96)(Y, s(X))] =\n\nE\n\n(Y,S)\u223cR(D,s)\n\n[(cid:96)(Y, S)] .= L(cid:96)(R(D, s)),\n\nwhere R(D, s) is a distribution over labels and scores. Standard SLN-robustness requires that label\nnoise does not change the (cid:96)-risk minimisers, i.e. that if s is such that L(cid:96)(R(D, s)) \u2264 L(cid:96)(R(D, s(cid:48)))\nfor all s(cid:48), the same relation holds with D in place of D. Strong SLN-robustness strengthens this\nnotion by requiring that label noise does not affect the ordering of all pairs of joint distributions over\nlabels and scores. (This of course trivially implies SLN-robustness.) As with the de\ufb01nition of D,\ngiven a distribution R over labels and scores, let R be the corresponding distribution where labels\nare \ufb02ipped with probability \u03c1. Strong SLN-robustness can then be made precise as follows.\nDe\ufb01nition 3 (Strong SLN-robustness). Call a loss (cid:96) strongly SLN-robust if for every \u03c1 \u2208 [0, 1/2),\n\n(\u2200R, R(cid:48)) L(cid:96)(R) \u2264 L(cid:96)(R(cid:48)) \u21d0\u21d2 L(cid:96)(R) \u2264 L(cid:96)(R(cid:48)).\n\nWe now re-express strong SLN-robustness using a notion of order equivalence of loss pairs, which\nsimply requires that two losses order all distributions over labels and scores identically.\nDe\ufb01nition 4 (Order equivalent loss pairs). Call a pair of losses ((cid:96), \u02dc(cid:96)) order equivalent if\n\n(\u2200R, R(cid:48)) L(cid:96)(R) \u2264 L(cid:96)(R(cid:48)) \u21d0\u21d2 L\u02dc(cid:96)(R) \u2264 L\u02dc(cid:96)(R(cid:48)).\n\n(cid:96)\n\n= SD,F,\u2217\n\nClearly, order equivalence of ((cid:96), (cid:96)) implies SD,F,\u2217\n, which in turn implies SLN-robustness.\nIt is thus not surprising that we can relate order equivalence to strong SLN-robustness of (cid:96).\nProposition 3. A loss (cid:96) is strongly SLN-robust iff for every \u03c1 \u2208 [0, 1/2), ((cid:96), (cid:96)) are order equivalent.\nThis connection now lets us exploit a classical result in decision theory about order equivalent losses\nbeing af\ufb01ne transformations of each other. Combined with the de\ufb01nition of (cid:96), this lets us conclude\nthat the suf\ufb01cient condition of Equation 5 is also necessary for strong SLN-robustness of (cid:96).\nProposition 4. A loss (cid:96) is strongly SLN-robust if and only if it satis\ufb01es Equation 5.\n\n(cid:96)\n\nWe now return to our original goal, which was to \ufb01nd a convex (cid:96) that is SLN-robust for Flin (and\nideally more general function classes). The above suggests that to do so, it is reasonable to consider\nthose losses that satisfy Equation 5. Unfortunately, it is evident that if (cid:96) is convex, non-constant, and\nbounded below by zero, then it cannot possibly be admissible in this sense. But we now show that\nremoving the boundedness restriction allows for the existence of a convex admissible loss.\n\n4\n\n\f5 The unhinged loss: a convex, strongly SLN-robust loss\n\nConsider the following simple, but non-standard convex loss:\n\n(cid:96)unh\n1\n\n(v) = 1 \u2212 v and (cid:96)unh\u22121 (v) = 1 + v.\n\nCompared to the hinge loss, the loss does not clamp at zero, i.e. it does not have a hinge. (Thus, pecu-\nliarly, it is negatively unbounded, an issue we discuss in \u00a75.3.) Thus, we call this the unhinged loss2.\nThe loss has a number of attractive properties, the most immediate being is its SLN-robustness.\n\n5.1 The unhinged loss is strongly SLN-robust\n\n1\n\nSince (cid:96)unh\n(v) + (cid:96)unh\u22121 (v) = 2, Proposition 4 implies that (cid:96)unh is strongly SLN-robust, and thus that\n((cid:96)unh, F) is SLN-robust for any F. Further, the following uniqueness property is not hard to show.\nProposition 5. Pick any convex loss (cid:96). Then,\n(\u2203C \u2208 R) (cid:96)1(v) + (cid:96)\u22121(v) = C \u21d0\u21d2 (\u2203A, B, D \u2208 R) (cid:96)1(v) = \u2212A \u00b7 v + B, (cid:96)\u22121(v) = A \u00b7 v + D.\nThat is, up to scaling and translation, (cid:96)unh is the only convex loss that is strongly SLN-robust.\n\nReturning to the case of linear scorers, the above implies that ((cid:96)unh, Flin) is SLN-robust. This does\nnot contradict Proposition 1, since (cid:96)unh is not a convex potential as it is negatively unbounded. Intu-\nitively, this property allows the loss to offset the penalty incurred by instances that are misclassi\ufb01ed\nwith high margin by awarding a \u201cgain\u201d for instances that correctly classi\ufb01ed with high margin.\n\n5.2 The unhinged loss is classi\ufb01cation calibrated\n\nSLN-robustness is by itself insuf\ufb01cient for a learner to be useful. For example, a loss that is uni-\nformly zero is strongly SLN-robust, but is useless as it is not classi\ufb01cation-calibrated. Fortunately,\nthe unhinged loss is classi\ufb01cation-calibrated, as we now establish. For technical reasons (see \u00a75.3),\nwe operate with FB = [\u2212B, +B]X, the set of scorers with range bounded by B \u2208 [0,\u221e).\nProposition 6. Fix (cid:96) = (cid:96)unh. For any DM,\u03b7, B \u2208 [0,\u221e), SD,FB ,\u2217\n= {x (cid:55)\u2192 B \u00b7 sign(2\u03b7(x) \u2212 1)}.\nThus, for every B \u2208 [0,\u221e), the restricted Bayes-optimal scorer over FB has the same sign as the\nBayes-optimal classi\ufb01er for 0-1 loss. In the limiting case where F = RX, the optimal scorer is\nattainable if we operate over the extended reals R \u222a {\u00b1\u221e}, so that (cid:96)unh is classi\ufb01cation-calibrated.\n\n(cid:96)\n\n5.3 Enforcing boundedness of the loss\n\nWhile the classi\ufb01cation-calibration of (cid:96)unh is encouraging, Proposition 6 implies that its (unre-\nstricted) Bayes-risk is \u2212\u221e. Thus, the regret of every non-optimal scorer s is identically +\u221e, which\nhampers analysis of consistency.\nIn orthodox decision theory, analogous theoretical issues arise\nwhen attempting to establish basic theorems with unbounded losses [Ferguson, 1967, pg. 78].\nWe can side-step this issue by restricting attention to bounded scorers, so that (cid:96)unh is effectively\nbounded. By Proposition 6, this does not affect the classi\ufb01cation-calibration of the loss. In the con-\n\u221a\ntext of linear scorers, boundedness of scorers can be achieved by regularisation: instead of work-\n\u03bb}, where \u03bb > 0, so\ning with Flin, one can instead use Flin,\u03bb = {x (cid:55)\u2192 (cid:104)w, x(cid:105) | ||w||2 \u2264 1/\n\u03bb for R = supx\u2208X ||x||2. Observe that as ((cid:96)unh, F) is SLN-robust for any F,\nthat Flin,\u03bb \u2286 F\n((cid:96)unh, Flin,\u03bb) is SLN-robust for any \u03bb > 0. As we shall see in \u00a76.3, working with Flin,\u03bb also lets us\nestablish SLN-robustness of the hinge loss when \u03bb is large.\n\n\u221a\n\nR/\n\n5.4 Unhinged loss minimisation on corrupted distribution is consistent\n\nUsing bounded scorers makes it possible to establish a surrogate regret bound for the unhinged loss.\nThis shows classi\ufb01cation consistency of unhinged loss minimisation on the corrupted distribution.\n\n2This loss has been considered in Sriperumbudur et al. [2009], Reid and Williamson [2011] in the context\nof maximum mean discrepancy; see the Appendix. The analysis of its SLN-robustness is to our knowledge\nnovel.\n\n5\n\n\fProposition 7. Fix (cid:96) = (cid:96)unh. Then, for any D, \u03c1 \u2208 [0, 1/2), B \u2208 [1,\u221e), and scorer s \u2208 FB,\n\nregretD\n\n01(s) \u2264 regretD,FB\n\n(cid:96)\n\n(s) =\n\n1\n\n1 \u2212 2\u03c1\n\n\u00b7 regretD,FB\n\n(cid:96)\n\n(s).\n\nStandard rates of convergence via generalisation bounds are also trivial to derive; see the Appendix.\n\n6 Learning with the unhinged loss and kernels\n\nWe now show that the optimal solution for the unhinged loss when employing regularisation and\nkernelised scorers has a simple form. This sheds further light on SLN-robustness and regularisation.\n\n6.1 The centroid classi\ufb01er optimises the unhinged loss\nConsider minimising the unhinged risk over the class of kernelised scorers FH,\u03bb = {s : x (cid:55)\u2192\n\u03bb} for some \u03bb > 0, where \u03a6 : X \u2192 H is a feature mapping into a\n(cid:104)w, \u03a6(x)(cid:105)H | ||w||H \u2264 1/\nreproducing kernel Hilbert space H with kernel k. Equivalently, given a distribution3 D, we want\n\n\u221a\n\nw\u2217\nunh,\u03bb = argmin\nw\u2208H\n\nE\n\n(X,Y)\u223cD\n\n[1 \u2212 Y \u00b7 (cid:104)w, \u03a6(X)(cid:105)] +\n\n(cid:104)w, w(cid:105)H.\n\n\u03bb\n2\n\n(6)\n\nThe \ufb01rst-order optimality condition implies that\n\u00b7\n\nw\u2217\nunh,\u03bb =\n\n1\n\u03bb\n\nE\n\n(X,Y)\u223cD\n\n[Y \u00b7 \u03a6(X)] ,\n\n(cid:18)\n\n\u00b7\n\n\u00b7\n\nE\n\n(X,Y)\u223cD\n\n[Y \u00b7 k(X, x)] = x (cid:55)\u2192 1\n\u03bb\n\nwhich is the kernel mean map of D [Smola et al., 2007], and thus the optimal unhinged scorer is\nunh,\u03bb : x (cid:55)\u2192 1\ns\u2217\n\u03bb\n\n.\n(8)\nFrom Equation 8, the unhinged solution is equivalent to a nearest centroid classi\ufb01er [Manning et al.,\n2008, pg. 181] [Tibshirani et al., 2002] [Shawe-Taylor and Cristianini, 2004, Section 5.1]. Equation\n8 gives a simple way to understand the SLN-robustness of ((cid:96)unh, FH,\u03bb), as the optimal scorers on\nthe clean and corrupted distributions only differ by a scaling (see the Appendix):\n\n[k(X, x)] \u2212 (1 \u2212 \u03c0) \u00b7 E\nX\u223cQ\n\n\u03c0 \u00b7 E\nX\u223cP\n\n[k(X, x)]\n\n(7)\n\n(cid:19)\n\n(9)\n\n(\u2200x \u2208 X)\n\nE\n\n(X,Y)\u223cD\n\n[Y \u00b7 k(X, x)] =\n\n1\n\n1 \u2212 2\u03c1\n\n\u00b7\n\nE\n\n(X,Y)\u223cD\n\n(cid:2)Y \u00b7 k(X, x)(cid:3) .\n\nInterestingly, Servedio [1999, Theorem 4] established that a nearest centroid classi\ufb01er (which they\ntermed \u201cAVERAGE\u201d) is robust to a general class of label noise, but required the assumption that\nM is uniform over the unit sphere. Our result establishes that SLN robustness of the classi\ufb01er\nholds without any assumptions on M. In fact, Ghosh et al. [2015, Theorem 1] lets one quantify the\nunhinged loss\u2019 performance under a more general noise model; see the Appendix for discussion.\n\n6.2 Practical considerations\n\nWe note several points relating to practical usage of the unhinged loss with kernelised scorers. First,\ncross-validation is not required to select \u03bb, since changing \u03bb only changes the magnitude of scores,\nnot their sign. Thus, for the purposes of classi\ufb01cation, one can simply use \u03bb = 1.\nSecond, we can easily extend the scorers to use a bias regularised with strength 0 < \u03bbb (cid:54)= \u03bb. Tuning\n\u03bbb is equivalent to computing s\u2217\nThird, when H = Rd for d small, we can store w\u2217\nunh,\u03bb explicitly, and use this to make predictions.\nFor high (or in\ufb01nite) dimensional H, we can either make predictions directly via Equation 8, or\nuse random Fourier features [Rahimi and Recht, 2007] to (approximately) embed H into some low-\ndimensional Rd, and then store w\u2217\nunh,\u03bb as usual. (The latter requires a translation-invariant kernel.)\nWe now show that under some assumptions, w\u2217\nunh,\u03bb coincides with the solution of two established\nmethods; the Appendix discusses some further relationships, e.g. to the maximum mean discrepancy.\n\nunh,\u03bb as per Equation 8, and tuning a threshold on a holdout set.\n\n3Given a training sample S \u223c Dn, we can use plugin estimates as appropriate.\n\n6\n\n\f6.3 Equivalence to a highly regularised SVM and other convex potentials\n\nThere is an interesting equivalence between the unhinged solution and that of a highly regularised\nSVM. This has been noted in e.g. Hastie et al. [2004, Section 6], which showed how SVMs approach\na nearest centroid classi\ufb01er, which is of course the optimal unhinged solution.\nProposition 8. Pick any D and \u03a6 : X \u2192 H with R = supx\u2208X ||\u03a6(x)||H < \u221e. For any \u03bb > 0, let\n\nw\u2217\nhinge,\u03bb = argmin\nw\u2208H\n\nE\n\n(X,Y)\u223cD\n\n[max(0, 1 \u2212 Y \u00b7 (cid:104)w, \u03a6(x)(cid:105)H)] +\n\n(cid:104)w, w(cid:105)H\n\n\u03bb\n2\n\nunh,\u03bb.\n\nhinge,\u03bb = w\u2217\n\nbe the soft-margin SVM solution. Then, if \u03bb \u2265 R2, w\u2217\nSince ((cid:96)unh, FH,\u03bb) is SLN-robust, it follows that for (cid:96)hinge : (y, v) (cid:55)\u2192 max(0, 1\u2212yv), ((cid:96)hinge, FH,\u03bb)\nis similarly SLN-robust provided \u03bb is suf\ufb01ciently large. That is, strong (cid:96)2 regularisation (and a\nbounded feature map) endows the hinge loss with SLN-robustness4. Proposition 8 can be generalised\nto show that w\u2217\nunh,\u03bb is the limiting solution of any twice differentiable convex potential. This shows\nthat strong (cid:96)2 regularisation endows most learners with SLN-robustness. Intuitively, with strong\nregularisation, one only considers the behaviour of a loss near zero; since a convex potential \u03c6 has\n\u03c6(cid:48)(0) < 0, it will behave similarly to its linear approximation around zero, viz. the unhinged loss.\nProposition 9. Pick any D, bounded feature mapping \u03a6 : X \u2192 H, and twice differentiable convex\npotential \u03c6 with \u03c6(cid:48)(cid:48)([\u22121, 1]) bounded. Let w\u2217\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) w\u2217\n\n||w\u2217\n\nlim\n\u03bb\u2192\u221e\n\n\u03c6,\u03bb be the minimiser of the regularised \u03c6 risk. Then,\n\u2212 w\u2217\n||w\u2217\n\n= 0.\n\nunh,\u03bb\n\nunh,\u03bb||H\n\n\u03c6,\u03bb\n\n\u03c6,\u03bb||H\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)2\n\nH\n\n6.4 Equivalence to Fisher Linear Discriminant with whitened data\n\nFor binary classi\ufb01cation on DM,\u03b7, the Fisher Linear Discriminant (FLD) \ufb01nds a weight vector pro-\nportional to the minimiser of square loss (cid:96)sq : (y, v) (cid:55)\u2192 (1 \u2212 yv)2 [Bishop, 2006, Section 4.1.5],\n\nsq,\u03bb = (EX\u223cM [XXT ] + \u03bbI)\u22121 \u00b7 E(X,Y)\u223cD[Y \u00b7 X].\nw\u2217\n\n(10)\nBy Equation 9, and the fact that the corrupted marginal M = M, w\u2217\nsq,\u03bb is only changed by a scaling\nfactor under label noise. This provides an alternate proof of the fact that ((cid:96)sq, Flin) is SLN-robust5\n[Manwani and Sastry, 2013, Theorem 2]. Clearly, the unhinged loss solution w\u2217\nunh,\u03bb is equivalent to\nthe FLD and square loss solution w\u2217\nsq,\u03bb when the input data is whitened i.e. E\nX\u223cM\na well-speci\ufb01ed F, e.g. with a universal kernel, both the unhinged and square loss asymptotically\nrecover the optimal classi\ufb01er, but the unhinged loss does not require a matrix inversion. With a\nmisspeci\ufb01ed F, one cannot in general argue for the superiority of the unhinged loss over square loss,\nor vice-versa, as there is no universally good surrogate to the 0-1 loss [Reid and Williamson, 2010,\nAppendix A]; the Appendix illustrate examples where both losses may underperform.\n\n(cid:2)XXT(cid:3) = I. With\n\n7 SLN-robustness of unhinged loss: empirical illustration\n\nWe now illustrate that the unhinged loss\u2019 SLN-robustness is empirically manifest. We reiterate\nthat with high regularisation, the unhinged solution is equivalent to an SVM (and in the limit any\nclassi\ufb01cation-calibrated loss) solution. Thus, we do not aim to assert that the unhinged loss is\n\u201cbetter\u201d than other losses, but rather, to demonstrate that its SLN-robustness is not purely theoretical.\nthe unhinged risk minimiser performs well on the example of Long\nWe \ufb01rst show that\nand Servedio [2010] (henceforth LS10).\nFigure 1 shows the distribution D, where X =\n2} and all three instances\n{(1, 0), (\u03b3, 5\u03b3), (\u03b3,\u2212\u03b3)} \u2282 R2, with marginal distribution M = { 1\nare deterministically positive. We pick \u03b3 = 1/2. The unhinged minimiser perfectly classi\ufb01es all\nthree points, regardless of the level of label noise (Figure 1). The hinge minimiser is perfect when\nthere is no noise, but with even a small amount of noise, achieves a 50% error rate.\n\n4 , 1\n\n4 , 1\n\n4Long and Servedio [2010, Section 6] show that (cid:96)1 regularisation does not endow SLN-robustness.\n5Square loss escapes the result of Long and Servedio [2010] since it is not monotone decreasing.\n\n7\n\n\fUnhinged\nHinge 0% noise\nHinge 1% noise\n\n0.5\n\n1\n\n1\n\n0.5\n\n\u22120.5\n\n\u22121\n\nFigure 1: LS10 dataset.\n\nHinge\n0.00 \u00b1 0.00\n0.15 \u00b1 0.27\n0.21 \u00b1 0.30\n0.38 \u00b1 0.37\n0.42 \u00b1 0.36\n0.47 \u00b1 0.38\n\nt-logistic\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.22 \u00b1 0.08\n0.22 \u00b1 0.08\n0.39 \u00b1 0.23\n\nUnhinged\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.34 \u00b1 0.48\n\n\u03c1 = 0\n\u03c1 = 0.1\n\u03c1 = 0.2\n\u03c1 = 0.3\n\u03c1 = 0.4\n\u03c1 = 0.49\n\nTable 1: Mean and standard deviation of the 0-\n1 error over 125 trials on LS10. Grayed cells\ndenote the best performer at that noise rate.\n\nWe next consider empirical risk minimisers from a random training sample: we construct a training\nset of 800 instances, injected with varying levels of label noise, and evaluate classi\ufb01cation perfor-\nmance on a test set of 1000 instances. We compare the hinge, t-logistic (for t = 2) [Ding and\nVishwanathan, 2010] and unhinged minimisers using a linear scorer without a bias term, and regu-\nlarisation strength \u03bb = 10\u221216. From Table 1, even at 40% label noise, the unhinged classi\ufb01er is able\nto \ufb01nd a perfect solution. By contrast, both other losses suffer at even moderate noise rates.\nWe next report results on some UCI datasets, where we additionally tune a threshold so as to ensure\nthe best training set 0-1 accuracy. Table 2 summarises results on a sample of four datasets. (The\nAppendix contains results with more datasets, performance metrics, and losses.) Even at noise close\nto 50%, the unhinged loss is often able to learn a classi\ufb01er with some discriminative power.\n\nHinge\n0.00 \u00b1 0.00\n0.01 \u00b1 0.03\n0.06 \u00b1 0.12\n0.17 \u00b1 0.20\n0.35 \u00b1 0.24\n0.60 \u00b1 0.20\n\nt-Logistic\n0.00 \u00b1 0.00\n0.01 \u00b1 0.03\n0.04 \u00b1 0.05\n0.09 \u00b1 0.11\n0.24 \u00b1 0.16\n0.49 \u00b1 0.20\n\nUnhinged\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.00 \u00b1 0.01\n0.02 \u00b1 0.07\n0.13 \u00b1 0.22\n0.45 \u00b1 0.33\n\n(a) iris.\n\nHinge\n0.00 \u00b1 0.00\n0.10 \u00b1 0.08\n0.19 \u00b1 0.11\n0.31 \u00b1 0.13\n0.39 \u00b1 0.13\n0.50 \u00b1 0.16\n\nt-Logistic\n0.00 \u00b1 0.00\n0.11 \u00b1 0.02\n0.15 \u00b1 0.02\n0.22 \u00b1 0.03\n0.33 \u00b1 0.04\n0.48 \u00b1 0.04\n\nUnhinged\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.00 \u00b1 0.00\n0.01 \u00b1 0.00\n0.02 \u00b1 0.02\n0.34 \u00b1 0.21\n\n\u03c1 = 0\n\u03c1 = 0.1\n\u03c1 = 0.2\n\u03c1 = 0.3\n\u03c1 = 0.4\n\u03c1 = 0.49\n\n\u03c1 = 0\n\u03c1 = 0.1\n\u03c1 = 0.2\n\u03c1 = 0.3\n\u03c1 = 0.4\n\u03c1 = 0.49\n\nHinge\n0.05 \u00b1 0.00\n0.06 \u00b1 0.01\n0.06 \u00b1 0.01\n0.08 \u00b1 0.04\n0.14 \u00b1 0.10\n0.45 \u00b1 0.26\n\nt-Logistic\n0.05 \u00b1 0.00\n0.07 \u00b1 0.02\n0.08 \u00b1 0.03\n0.11 \u00b1 0.05\n0.24 \u00b1 0.13\n0.49 \u00b1 0.16\n\nUnhinged\n0.05 \u00b1 0.00\n0.05 \u00b1 0.00\n0.05 \u00b1 0.00\n0.05 \u00b1 0.01\n0.09 \u00b1 0.10\n0.46 \u00b1 0.30\n\n(b) housing.\n\nHinge\n0.05 \u00b1 0.00\n0.15 \u00b1 0.03\n0.21 \u00b1 0.03\n0.25 \u00b1 0.03\n0.31 \u00b1 0.05\n0.48 \u00b1 0.09\n\nt-Logistic\n0.04 \u00b1 0.00\n0.24 \u00b1 0.00\n0.24 \u00b1 0.00\n0.24 \u00b1 0.00\n0.24 \u00b1 0.00\n0.40 \u00b1 0.24\n\nUnhinged\n0.19 \u00b1 0.00\n0.19 \u00b1 0.01\n0.19 \u00b1 0.01\n0.19 \u00b1 0.03\n0.22 \u00b1 0.05\n0.45 \u00b1 0.08\n\n\u03c1 = 0\n\u03c1 = 0.1\n\u03c1 = 0.2\n\u03c1 = 0.3\n\u03c1 = 0.4\n\u03c1 = 0.49\n\n\u03c1 = 0\n\u03c1 = 0.1\n\u03c1 = 0.2\n\u03c1 = 0.3\n\u03c1 = 0.4\n\u03c1 = 0.49\n\n(c) usps0v7.\n\n(d) splice.\n\nTable 2: Mean and standard deviation of the 0-1 error over 125 trials on UCI datasets.\n\n8 Conclusion and future work\n\nWe proposed a convex, classi\ufb01cation-calibrated loss, proved that is robust to symmetric label noise\n(SLN-robust), showed it is the unique loss that satis\ufb01es a notion of strong SLN-robustness, estab-\nlished that it is optimised by the nearest centroid classi\ufb01er, and showed that most convex potentials,\nsuch as the SVM, are also SLN-robust when highly regularised. So, with apologies to Wilde [1895]:\n\nWhile the truth is rarely pure, it can be simple.\n\nAcknowledgments\n\nNICTA is funded by the Australian Government through the Department of Communications and\nthe Australian Research Council through the ICT Centre of Excellence Program. The authors thank\nCheng Soon Ong for valuable comments on a draft of this paper.\n\n8\n\n\fReferences\nDana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343\u2013370, 1988.\nPeter L. Bartlett, Michael I. Jordan, and Jon D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal\n\nof the American Statistical Association, 101(473):138 \u2013 156, 2006.\n\nChristopher M Bishop. Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc., 2006.\nAvrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Conference on\n\nComputational Learning Theory (COLT), pages 92\u2013100, 1998.\n\nVasil Denchev, Nan Ding, Hartmut Neven, and S.V.N. Vishwanathan. Robust classi\ufb01cation with adiabatic\n\nquantum optimization. In International Conference on Machine Learning (ICML), pages 863\u2013870, 2012.\n\nLuc Devroye, L\u00b4aszl\u00b4o Gy\u00a8or\ufb01, and G\u00b4abor Lugosi. A Probabilistic Theory of Pattern Recognition. Springer, 1996.\nIn Advances in Neural Information Processing\nNan Ding and S.V.N. Vishwanathan. t-logistic regression.\n\nSystems (NIPS), pages 514\u2013522. Curran Associates, Inc., 2010.\n\nThomas S. Ferguson. Mathematical Statistics: A Decision Theoretic Approach. Academic Press, 1967.\nAritra Ghosh, Naresh Manwani, and P. S. Sastry. Making risk minimization tolerant to label noise. Neurocom-\n\nputing, 160:93 \u2013 107, 2015.\n\nTrevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. The entire regularization path for the support\nvector machine. Journal of Machine Learning Research, 5:1391\u20131415, December 2004. ISSN 1532-4435.\nMichael Kearns. Ef\ufb01cient noise-tolerant learning from statistical queries. Journal of the ACM, 5(6):392\u2013401,\n\nNovember 1998.\n\nPhilip M. Long and Rocco A. Servedio. Random classi\ufb01cation noise defeats all convex potential boosters.\n\nMachine Learning, 78(3):287\u2013304, 2010. ISSN 0885-6125.\n\nChristopher D. Manning, Prabhakar Raghavan, and Hinrich Sch\u00a8utze. Introduction to Information Retrieval.\n\nCambridge University Press, New York, NY, USA, 2008. ISBN 0521865719, 9780521865715.\n\nNaresh Manwani and P. S. Sastry. Noise tolerance under risk minimization. IEEE Transactions on Cybernetics,\n\n43(3):1146\u20131151, June 2013.\n\nHamed Masnadi-Shirazi, Vijay Mahadevan, and Nuno Vasconcelos. On the design of robust classi\ufb01ers for\n\ncomputer vision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.\n\nNagarajan Natarajan, Inderjit S. Dhillon, Pradeep D. Ravikumar, and Ambuj Tewari. Learning with noisy\n\nlabels. In Advances in Neural Information Processing Systems (NIPS), pages 1196\u20131204, 2013.\n\nAli Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural\n\nInformation Processing Systems (NIPS), pages 1177\u20131184, 2007.\n\nMark D. Reid and Robert C. Williamson. Composite binary losses. Journal of Machine Learning Research,\n\n11:2387\u20132422, December 2010.\n\nMark D Reid and Robert C Williamson. Information, divergence and risk for binary experiments. Journal of\n\nMachine Learning Research, 12:731\u2013817, Mar 2011.\n\nBernhard Sch\u00a8olkopf and Alexander J Smola. Learning with kernels, volume 129. MIT Press, 2002.\nRocco A. Servedio. On PAC learning using Winnow, Perceptron, and a Perceptron-like algorithm. In Confer-\n\nence on Computational Learning Theory (COLT), 1999.\n\nJohn Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge Uni. Press, 2004.\nAlex Smola, Arthur Gretton, Le Song, and Bernhard Sch\u00a8olkopf. A Hilbert space embedding for distributions.\n\nIn Algorithmic Learning Theory (ALT), 2007.\n\nBharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Gert R. G. Lanckriet, and Bernhard Sch\u00a8olkopf.\nKernel choice and classi\ufb01ability for RKHS embeddings of probability distributions. In Advances in Neural\nInformation Processing Systems (NIPS), 2009.\n\nGuillaume Stempfel and Liva Ralaivola. Learning SVMs from sloppily labeled data.\nNetworks (ICANN), volume 5768, pages 884\u2013893. Springer Berlin Heidelberg, 2009.\n\nIn Arti\ufb01cial Neural\n\nRobert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. Diagnosis of multiple cancer\ntypes by shrunken centroids of gene expression. Proceedings of the National Academy of Sciences, 99(10):\n6567\u20136572, 2002.\n\nOscar Wilde. The Importance of Being Earnest, 1895.\n\n9\n\n\f", "award": [], "sourceid": 9, "authors": [{"given_name": "Brendan", "family_name": "van Rooyen", "institution": "NICTA"}, {"given_name": "Aditya", "family_name": "Menon", "institution": "NICTA"}, {"given_name": "Robert", "family_name": "Williamson", "institution": "NICTA"}]}