{"title": "Relaxed Clipping: A Global Training Method for Robust Regression and Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 2532, "page_last": 2540, "abstract": "Robust regression and classification are often thought to require non-convex loss functions that prevent scalable, global training. However, such a view neglects the possibility of reformulated training methods that can yield practically solvable alternatives. A natural way to make a loss function more robust to outliers is to truncate loss values that exceed a maximum threshold. We demonstrate that a relaxation of this form of ``loss clipping'' can be made globally solvable and applicable to any standard loss while guaranteeing robustness against outliers. We present a generic procedure that can be applied to standard loss functions and demonstrate improved robustness in regression and classification problems.", "full_text": "Relaxed Clipping: A Global Training Method\n\nfor Robust Regression and Classi\ufb01cation\n\nYaoliang Yu, Min Yang, Linli Xu, Martha White, Dale Schuurmans\n\nUniversity of Alberta, Dept. Computing Science, Edmonton AB T6G 2E8, Canada\n\n{yaoliang,myang2,linli,whitem,dale}@cs.ualberta.ca\n\nAbstract\n\nRobust regression and classi\ufb01cation are often thought to require non-convex loss\nfunctions that prevent scalable, global training. However, such a view neglects\nthe possibility of reformulated training methods that can yield practically solvable\nalternatives. A natural way to make a loss function more robust to outliers is\nto truncate loss values that exceed a maximum threshold. We demonstrate that\na relaxation of this form of \u201closs clipping\u201d can be made globally solvable and\napplicable to any standard loss while guaranteeing robustness against outliers. We\npresent a generic procedure that can be applied to standard loss functions and\ndemonstrate improved robustness in regression and classi\ufb01cation problems.\n\n1\n\nIntroduction\n\nRobust statistics is a well established \ufb01eld that analyzes the sensitivity of common estimators to out-\nliers and provides alternative estimators that achieve improved robustness [11, 13, 17, 23]. Outliers\nare understood to be observations that have been corrupted, incorrectly measured, mis-recorded,\ndrawn under different conditions than those intended, or so atypical as to require separate model-\ning. The main goal of classical robust statistics is to make estimators invariant, or nearly invariant,\nto arbitrary changes made to a non-trivial fraction of the sample data\u2014a goal that is equally rele-\nvant to machine learning research given that data sets are often collected with limited or no quality\ncontrol, making outliers ubiquitous. Unfortunately, the state-of-the-art in robust statistics relies on\nnon-convex training criteria that have yet to yield ef\ufb01cient global solution methods [13, 17, 23].\n\nAlthough many robust regression methods have been proposed in the classical literature, M-\nestimators continue to be a dominant approach [13, 17]. These correspond to the standard machine\nlearning approach of minimizing a sum of prediction errors under a given loss function (assuming\na \ufb01xed scaling). M-estimation is reasonably well understood, analytically tractable, and provides\na simple framework for trading off between robustness against outliers and data ef\ufb01ciency on in-\nliers [13, 17]. Unfortunately, robustness in this context comes with a cost: when minimizing a\nconvex loss, even a single data point can dominate the result. That is, any (non-constant) convex\nloss function exhibits necessarily unbounded sensitivity to even a single outlier [17, \u00a75.4.1]. Al-\nthough unbounded sensitivity can obviously be mitigated by imposing prior bounds on the domain\nand range of the data [5, 6], such is not always possible in practice. Instead, the classical literature\nachieves bounded outlier sensitivity by considering redescending loss functions (see [17, \u00a72.2] for a\nde\ufb01nition), or more restrictively, bounded loss functions, both of which are inherently non-convex.\nRobust regression has also been extensively investigated in computer vision [2, 26], where a similar\nconclusion has been reached that bounded loss functions are necessary to counteract the types of\noutliers created by edge discontinuities, multiple motions, and specularities in image data.\n\nFor classi\ufb01cation the story is similar. The attempt to avoid outlier sensitivity has led many to propose\nbounded loss functions [8, 15, 18, 19, 25] to replace the standard convex, unbounded losses deployed\nin support vector machines and boosting [9] respectively. In fact, [16] has shown that minimizing\n\n1\n\n\fany convex margin loss cannot achieve robustness to random misclassi\ufb01cation noise. The conclusion\nreached in the classi\ufb01cation literature, as in the regression literature, is therefore that non-convexity\nis necessary to ensure robustness against outliers\u2014creating an apparent dilemma: one can achieve\nglobal training via convexity or outlier robustness via boundedness, but not both.\n\nIn this paper we present a counterpoint to these pessimistic conclusions. In particular, we present a\ngeneral model for bounding any convex loss function, via a process of \u201closs clipping\u201d, that ensures\nbounded sensitivity to outliers. Although the resulting optimization problem is not, by itself, con-\nvex, we demonstrate an ef\ufb01cient convex relaxation and rounding procedure that guarantees bounded\nresponse to data\u2014a guarantee that cannot be established for any convex loss minimization on its\nown. The approach we propose is generic and can be applied to any standard loss function, be it\nfor regression or classi\ufb01cation. Our work is inspired by a number of studies that have investigated\nrobust estimators in computer vision and machine learning [2, 26, 27, 30]. However, these previ-\nous attempts were either hampered by local optimization or restricted to special cases; none had\nguarantees of global training and outlier insensitivity.\n\nBefore proceeding it is important to realize that there are many alternative conceptions of \u201crobust-\nness\u201d in the literature that do not correspond to the notion we are investigating. For example, work on\n\u201crobust optimization\u201d [28, 29] considers minimizing the worst case loss achieved given prespeci\ufb01ed\nbounds on the maximum data deviation that will be considered. Although interesting, these results\ndo not directly bear on the question at hand since we explicitly do not bound the magnitude of the\noutliers (i.e. the degree of leverage [23, \u00a71.1], nor the size of response deviations). Another notion\nof robustness is algorithmic stability under leave-one-out perturbation [3]. Although loosely related,\nalgorithmic stability addresses the analysis of given learning procedures rather than describing how\na stable algorithm might be generally achieved in the presence of arbitrary outliers. We also do not\nfocus on asymptotic or in\ufb01nitesimal notions from robust statistics, such as in\ufb02uence functions [11],\nnor impose boundedness assumptions on the domain and range of the data or the predictor [5, 6].\n\n2 Background\n\nWe consider the standard supervised setting where one is given an input matrix X and output targets\ny, with the goal of learning a predictor h :\u211cm \u2192\u211c. Each row of X gives the feature representation\nfor one training example, denoted Xi:, with corresponding target yi. We will assume the predictor\ncan be written as a generalized linear model; that is, the predictions are given by \u02c6yi = f (Xi:\u03b8) for\na \ufb01xed transfer function f (possibly identity) and a vector of parameters \u03b8. For training, we will\nconsider the standard L2 regularized loss minimization problem\n\nmin\n\n\u03b8\n\n\u03b3\n2k\u03b8k2\n\n2 +\n\nn\n\nXi=1\n\nL(yi, \u02c6yi) = min\n\n\u03b8\n\n\u03b3\n2k\u03b8k2\n\n2 +\n\nn\n\nXi=1\n\nL(yi, f (Xi:\u03b8))\n\n(1)\n\nwhere L denotes the loss function, \u03b3 is the regularization constant, and n denotes the number of\ntraining examples. Normally the loss function L is chosen to be convex in \u03b8 so that the minimization\nproblem can be solved ef\ufb01ciently. Although convexity is important for computational tractability, it\nhas the undesired side-effect of causing unbounded outlier sensitivity, as mentioned. Obviously, the\nseverity of the problem will range from minimal to extreme depending on the nature of the distribu-\ntion over (x, y). Nevertheless, our goal in this paper will be to eliminate unbounded sensitivity for\nconvex loss functions while retaining a scalable computational approach.1\n\nStandard Convex Loss Functions: Our general construction applies to arbitrary convex losses,\nbut we will demonstrate our methods on standard loss functions employed in regression and clas-\nsi\ufb01cation. A standard example is Bregman divergences, which are de\ufb01ned by taking a strongly\nconvex differentiable potential \u03a6 then taking the difference between the potential and its \ufb01rst\norder Taylor approximation, obtaining a loss L\u03a6(\u02c6yky) = \u03a6(\u02c6y) \u2212 \u03a6(y) \u2212 \u03c6(y)(\u02c6y \u2212 y), where\n\u03c6(y) = \u03a6\u2032(y) [1, 14]. Several natural loss functions can be de\ufb01ned this way, including least squares\nL\u03a6(\u02c6yky) = (\u02c6y \u2212 y)2/2, using the potential \u03a6(y) = y2/2, and forward KL-divergence L\u03a6(\u02c6yky) =\n\u02c6y ln \u02c6y\ny + (1 \u2212 \u02c6y) ln 1\u2212\u02c6y\n1\u2212y , using the potential \u03a6(y) = y ln y + (1 \u2212 y) ln(1 \u2212 y) for 0 \u2264 y \u2264 1.\n1All results in this paper extend to reproducing kernel Hilbert spaces via the representer theorem [24], but\nfor clarity of presentation we will use an explicit feature representation X even though it is not a requirement.\n\n2\n\n\f\u02c6y + (1 \u2212 y) ln 1\u2212y\n\nA related construction is matching losses [14], which are determined by taking a strictly increasing\ndifferentiable transfer function f to be used in prediction via \u02c6y = f (z) where z = x\u22a4\u03b8. Then, given\na transfer f, a loss can be de\ufb01ned by LF (\u02c6zkz) = R \u02c6z\nz f (\u03b6) \u2212 f (z) d\u03b6 = F (\u02c6z) \u2212 F (z) \u2212 f (z)(\u02c6z \u2212 z)\nsuch that F satis\ufb01es F \u2032(z) = f (z). By de\ufb01nition, matching losses are also Bregman divergences,\nsince F is differentiable and the assumptions on f imply that F is strongly convex. These two loss\nconstructions are related by the equality L\u03a6(yk\u02c6y) = LF (\u02c6zkz) where F is the Legendre-Fenchel\nconjugate of \u03a6 [4, \u00a73.3], z = f\u22121(y) = \u03c6(y) and \u02c6z = f\u22121(\u02c6y) = \u03c6(\u02c6y) [1, 14]. For example, the\npost-prediction KL-divergence y ln y\n1\u2212\u02c6y is equal to the convex pre-prediction loss\nLF (\u02c6zkz) = ln(e\u02c6z + 1) \u2212 ln(ez + 1) \u2212 \u03c3(z)(\u02c6z \u2212 z) via the transfer \u02c6y = \u03c3(\u02c6z) = (1 + e\u2212\u02c6z)\u22121. Such\nlosses are prevalent in regression and probabilistic classi\ufb01cation settings.\nFor discrete classi\ufb01cation it is also natural to work with a continuous pre-prediction space \u02c6z = x\u22a4\u03b8,\nrecovering discrete post-predictions \u02c6y \u2208 {\u22121, 1} via a step transfer \u02c6y = sign(z). Although a step\ntransfer does not admit the matching loss construction, a surrogate margin loss can be obtained by\ntaking a nonincreasing function l such that limm\u2192\u221e l(m) = 0, then de\ufb01ning Ll(\u02c6y, y) = l(y \u02c6y).\nHere y \u02c6y is known as the classi\ufb01cation margin. Standard examples include misclassi\ufb01cation loss,\nLl(\u02c6y, y) = 1(y \u02c6y<0), support vector machine (hinge) loss, Ll(\u02c6y, y) = max(0, 1 \u2212 y \u02c6y), binomial\ndeviance loss, Ll(\u02c6y, y) = ln(1 + e\u2212y \u02c6y) [12], and Adaboost loss, Ll(\u02c6y, y) = e\u2212y \u02c6y [9]. If the margin\nloss is furthermore chosen to be convex, ef\ufb01cient minimization can be attained.\nTo unify our presentation below we will simply denote all loss functions by \u2113(y, x\u22a4\u03b8), with the\nunderstanding that \u2113(y, x\u22a4\u03b8) = L\u03a6(x\u22a4\u03b8ky) if the loss is Bregman divergence on potential \u03a6;\n\u2113(y, x\u22a4\u03b8) = LF (x\u22a4\u03b8kf\u22121(y)) if the loss is a matching loss with transfer f; and \u2113(y, x\u22a4\u03b8) =\nl(yx\u22a4\u03b8) if the loss is a margin loss with margin function l. In each case, the loss is convex in the\nparameters \u03b8. Note that by their very convexity these losses cannot be robust: all admit unbounded\nsensitivity to a single outlier (the same is also true for L1 loss when applied to regression).\n\nBounded loss functions: As observed, non-convex loss functions are necessary to bound the ef-\nfects of outliers [17]. Black and Rangarajan [2] provide a useful catalog of bounded and redescend-\ning loss functions for robust regression, of which a representative example is the Geman and Mc-\nClure loss L(y, \u02c6y) = (\u02c6y \u2212 y)2/(\u03c4 + (\u02c6y \u2212 y)2) for \u03c4 > 0; see Figure 1. Unfortunately, as Figure 1\nmakes plain, boundedness implies non-convexity (for any non-constant function). It therefore ap-\npears that bounded loss functions achieve robustness at the cost of losing global training guarantees.\nOur goal is to show that robustness and ef\ufb01cient global training are not mutually exclusive. Despite\nextensive research on regression and classi\ufb01cation, almost no work we are aware of (save perhaps\n[30] in a limited way) attempts to reconcile robustness to outliers with global training algorithms.\n\n3 Loss Clipping\n\nAdapting the ideas of [2, 27, 30], given any convex loss \u2113(y, x\u22a4\u03b8) de\ufb01ne the clipped loss as\n\n\u2113c(y, x\u22a4\u03b8) = min(1, \u2113(y, x\u22a4\u03b8)).\n\n(2)\n\nFigure 1 demonstrates loss clipping for some standard loss functions. Given a clipped loss, a robust\nform of training problem (1) can be written as\n\nmin\n\n\u03b8\n\n\u03b3\n2k\u03b8k2\n\n2 +\n\nn\n\nXi=1\n\n\u2113c(yi, Xi:\u03b8).\n\n(3)\n\nClearly such a training objective bounds the in\ufb02uence of any one training example on the \ufb01nal re-\nsult. Unfortunately, the formulation (3) is not computationally convenient because the optimization\nproblem it poses is neither convex nor smooth. To make progress on the computational question\nwe exploit a key observation: for any loss function, its corresponding clipped loss can be indirectly\nexpressed by an auxiliary optimization of a smooth objective (if the original loss function itself was\nsmooth). That is, given a loss \u2113(y, x\u22a4\u03b8) de\ufb01ne the corresponding \u03c1-relaxed loss to be\n\n(4)\nfor 0 \u2264 \u03c1 \u2264 1; see Figure 1. This construction is an instance of an outlier process as described\nin [2] and is motivated by a special case hinge-loss construction originally proposed in [30]. The\n\n\u2113\u03c1(y, x\u22a4\u03b8) = \u03c1\u2113(y, x\u22a4\u03b8) + 1 \u2212 \u03c1\n\n3\n\n\fs\ns\no\nl\n\n0\n\ny \u2212 \u02c6y\n\n0\ny \u02c6y\n\n0\ny \u02c6y\n\nFigure 1: Comparing standard losses (dashed) with corresponding \u201cclipped\u201d losses (solid), \u03c1-relaxed losses\n(dotted), and non-convex robust losses (dash-dotted). Left: squared loss (dashed), clipped (solid), 1/3-relaxed\n(dotted), robust Geman and McClure loss [2] (dash-dotted). Center: SVM hinge loss (dashed), clipped [27, 30]\n(solid), 1/2-relaxed (upper dotted), robust 1 \u2212 tanh(y \u02c6y) loss [19] (dash-dotted). Right: Adaboost exponential\nloss (dashed), clipped (solid), 1/2-relaxed (upper dotted), robust 1 \u2212 tanh(y \u02c6y) loss [19] (dash-dotted).\n\n\u03c1-relaxation provides a convenient characterization of any clipped loss, since it can be shown in\ngeneral that minimizing a corresponding \u03c1-relaxed loss is equivalent to minimizing the clipped loss.\nProposition 1 For any loss function \u2113(y, x\u22a4\u03b8), we have \u2113c(y, x\u22a4\u03b8) = min0\u2264\u03c1\u22641 \u2113\u03c1(y, x\u22a4\u03b8).\n(The proof is straightforward, but it is given in the supplement for completeness.) Proposition 1\nnow allows us to reformulate (3) as a smooth optimization using the fact that the optimization is\ncompletely separable between the \u03c1i variables:\n\n(3) = min\n\n\u03b8\n\nmin\n0\u2264\u03c1\u22641\n\n\u03b3\n2k\u03b8k2\n\n2 +\n\nn\n\nXi=1\n\n\u03c1i\u2113(yi, Xi:\u03b8) + 1 \u2212 \u03c1i.\n\n(5)\n\nUnfortunately, the resulting problem is not jointly convex in \u03c1 and \u03b8 even though it is convex in each\ngiven the other. Such marginal convexity might suggest that an alternating minimization strategy,\nhowever the proof of Proposition 1 shows that each minimization over \u03c1 will result in \u03c1i = 0 for\nlosses greater than 1, or \u03c1i = 1 for losses less than 1. Such discrete assignments immediately causes\nthe search to get trapped in local minima, requiring a more sophisticated approach to be considered.\n\n4 A Convex Relaxation\n\nOne contribution of this paper is to derive an exact reformulation of (5) that admits a convex re-\nlaxation and rounding scheme that retain bounded sensitivity to outliers. We \ufb01rst show how the\nrelaxation can be ef\ufb01ciently solved by a scalable algorithm that eliminates any need for semide\ufb01nite\nprogramming, then provide a guarantee of bounded outlier sensitivity in Section 5.\n\nReformulation: To ease the notational burden, let us rewrite (5) in matrix-vector form\n\nmin\n\n(5) = min\n0\u2264\u03c1\u22641\nR(\u03c1, \u03b8) =\n\nR(\u03c1, \u03b8)\n\n(6)\n\n\u03b8\n\nwhere\n\n(7)\nHere 1 denotes the vector of all 1s, and it is understood that \u2113(y, X\u03b8) refers to the n \u00d7 1 vector\nof individual training losses. Given that \u2113(\u00b7,\u00b7) is convex in its second argument we will be able to\nexploit Fenchel duality to re-express the min-min form (6) into a min-max form that will serve as\nthe basis for the subsequent relaxation. In particular, consider the de\ufb01nition\n\n\u03b3\n2k\u03b8k2 + \u03c1\u22a4\u2113(y, X\u03b8) + 1\u22a4(1 \u2212 \u03c1).\n\n\u2113\u2217(y, \u03b1) = sup\n\u03b8\n\n\u03b1x\u22a4\u03b8 \u2212 \u2113(y, x\u22a4\u03b8).\n\n(8)\n\nBy construction, \u2113\u2217(y, \u03b1) is guaranteed to be convex in \u03b1 since it is a pointwise maximum over\nlinear functions [4, \u00a73.2].\nLemma 1 For any convex differentiable loss function \u2113(y, x\u22a4\u03b8) such that the level sets of \u2113\u03b1(v) =\n\u03b1x\u22a4(\u03b8 \u2212 v) + \u2113(y, x\u22a4v) are bounded, we have\n\u2113(y, x\u22a4\u03b8) = sup\n\u03b1\n\n\u03b1x\u22a4\u03b8 \u2212 \u2113\u2217(y, \u03b1).\n\n(9)\n\n4\n\n\f(This is a standard result, but a proof is given in the supplement for completeness.) For standard\nlosses \u2113\u2217(y, \u03b1) can be computed explicitly [1, 7]. For example, if \u2113(y, x\u22a4\u03b8) = (y \u2212 x\u22a4\u03b8)2/2 then\n\u2113\u2217(y, \u03b1) = \u03b12/2 + \u03b1y. Now let \u2206(\u03b1) denote putting \u03b1 in the main diagonal of a square matrix and\nlet \u2113\u2217(y, \u03b1) refer to the n\u00d7 1 vector of dual values over training examples. We can then express the\nmain reformulation as follows.\n\nTheorem 1 Let K = XX\u22a4 denote the kernel matrix over input data. Then\n\n(6) =\n\n\u2212 1\n\n\u221an+1\n\nmin\n\n1\u2264\u03bd\u2264 1\n\n\u221an+1\n\n1, \u03bd1= 1\n\n\u221an+1\n\n\u03b1 \u2212(n + 1) \u03bd\u22a4T (\u03b1)\u03bd\nsup\n\n(10)\n\n, k\u03bdk=1\n\nwhere \u03bd is an (n + 1) \u00d7 1 variable, \u03b1 is an n \u00d7 1 variable, and the matrix T (\u03b1) is given by\n(\u2113\u2217(y, \u03b1) + 1)\u22a4\n\n1\n\n8\u03b3 (cid:20) 1\u22a4\n\nI (cid:21) \u2206(\u03b1)K\u2206(\u03b1) [ 1 I ] +\n\n1\n\n4 (cid:20) 2(1\u22a4\u2113\u2217(y, \u03b1) \u2212 n)\n\n\u2113\u2217(y, \u03b1) + 1\n\nT (\u03b1) =\n\n0\n\n(cid:21) . (11)\n\nThe proof consists in \ufb01rst dualizing \u03b8 in (6) via Lemma 1, which establishes the key relationship\n\n\u03b8 = \u2212 1\n\n\u03b3 X\u22a4\u2206(\u03c1)\u03b1.\n\nThe remainder of the proof is merely algebra: given a solution \u03bd to (10), the corresponding solution\n\u03c1 to (6) can be recovered via \u03c1 = 1\nNote that the formulation (10) given in Theorem 1 is exact. No approximation to the problem (6)\nhas been introduced to this point. Unfortunately, as in (6), the formulation (10) is still not directly\namenable to an ef\ufb01cient algorithm: the objective is concave in \u03b1, conveniently, but it is not convex\nin \u03bd. The advantage attained by (10) however is that we can now derive an effective relaxation.\n\n2 (1 + \u03bd2:n+1\u221an + 1). See the supplement for full details.\n\nRelaxation: Let \u03b4(M ) denote the main diagonal vector of the square matrix M and let tr(M )\ndenote the trace. Consider the following relaxation\n\n(12)\n\n(13)\n\n(14)\n\n(10) \u2265\n\nmin\n\nM(cid:23)0, \u03b4(M )= 1\nmin\n\nn+1\n\n1\n\n= sup\n\u03b1\n\nM(cid:23)0, \u03b4(M )= 1\n\nn+1\n\n\u03b1 \u2212(n + 1) tr(M T (\u03b1))\nsup\n1 \u2212(n + 1) tr(M T (\u03b1))\n\nwhere we used strong minimax duality to obtain (14) from (13): since the constraint region on M\nis compact and the inner objective is concave and convex in \u03b1 and M respectively, Sion\u2019s mini-\nmax theorem is applicable [22, \u00a737]. Next enforce the constraint \u03b4(M ) = 1\nn+1 1 with a Lagrange\nmultiplier \u03bb:\n\n(14) = sup\n\u03b1, \u03bb\n\n= sup\n\u03b1, \u03bb\n\nmin\n\nM(cid:23)0, tr(M )=1 \u2212(n + 1) tr(M T (\u03b1)) + \u03bb\u22a4 (1 \u2212 (n + 1)\u03b4(M ))\n\u03bb\u22a41 \u2212 (n + 1)\n\ntrhM (T (\u03b1) + \u2206(\u03bb))i.\n\nM(cid:23)0, tr(M )=1\n\nmax\n\n(15)\n\n(16)\n\nThis relaxed formulation (16) is now amenable to ef\ufb01cient global optimization: The outer problem\nis jointly concave in \u03b1 and \u03bb, since it is a pointwise minimum of concave functions. The inner\noptimization with respect to M can now be simpli\ufb01ed by exploiting the well known result [21]:\n\nmax\n\nM(cid:23)0, tr(M )=1\n\ntrhM (T (\u03b1) + \u2206(\u03bb))i = max\n\nk\u03bdk=1\n\n\u03bd\u22a4hT (\u03b1) + \u2206(\u03bb)i\u03bd.\n\n(17)\n\nTherefore, given \u03b1 and \u03bb, the inner problem is solved by the maximum eigenvector of T (\u03b1)+\u2206(\u03bb).\n\nOptimization Procedure: Given training data, an outer maximization can be executed jointly over\n\u03b1 and \u03bb to maximize (16). This outer problem is concave in \u03b1 and \u03bb hence no local maxima exist.\nAlthough the outer problem is not smooth, many effective methods exist for nonsmooth convex\noptimization [20, 31]. Each outer function evaluation (and subgradient calculation) requires the\ninner problem (17) to be solved. Fortunately, a simple power method [10] can be used to ef\ufb01ciently\ncompute a maximum eigenvector solution to the inner problem by only performing matrix-vector\nmultiplications on the individual factors of the two low rank matrices making up T (\u03b1), meaning the\ninner problem can be solved without ever forming a large n\u00d7 n matrix T (\u03b1). That is, if X is n\u00d7 m\neach inner iteration requires at most O(nm) computation.\n\n5\n\n\fSolution Recovery: At a solution, the values of (13)\u2013(16) are equal, and all provide a lower bound\non the original objective (6). Ideally, given a maximizer \u03b1\u2217 for (14) one would recover a prediction\nmodel \u03b8 via (12). However, (12) requires \u03c1 to be acquired \ufb01rst, which could be obtained from a\n\u03bd that solves (10). Unfortunately, the relaxation step taken in (13) means that the solution to (14)\n(recovered from the \u03bd that solves (17)) does not necessarily solve (10): the inner solution \u03bd in (17)\nmight not be unique. If it is unique, we immediately have the optimal solution to (10) hence an\nexact solution to the original problem (6). More typically, however, the maximum eigenvector is\nnot unique at (\u03b1\u2217, \u03bb\u2217), meaning that a gap has been introduced\u2014this occurs if and only if the inner\nsolution M\u2217 to (14) is not rank 1. In such cases we need to use a rounding procedure to recover an\neffective rank 1 approximation.\n\nRounding Method: Given the inner maximizer (\u03b1\u2217, \u03bb\u2217) of (16) we do not need to explicitly\nconstruct the outer minimizer M\u2217. Instead, it suf\ufb01ces to construct a basis for M\u2217 by collecting the\nset of maximum eigenvectors \u02dcV = {\u02dc\u03bd 1, ..., \u02dc\u03bd k} of T (\u03b1\u2217) + \u2206(\u03bb\u2217) in (17) (note that k is usually\nmuch smaller than n + 1). A solution can then be indirectly obtained by solving a small semide\ufb01nite\nprogram to recover a k \u00d7 k matrix C\u2217 that satis\ufb01es C\u2217 (cid:23) 0 and \u03b4( \u02dcV C\u2217 \u02dcV \u22a4) = 1\nn+1 1. Note that\nC\u2217 = Q\u2217\u03a3\u2217Q\u2217\u22a4 for some orthonormal Q\u2217 and diagonal \u03a3\u2217 where \u03c3\u2217j \u2265 0 andPk\nj=1 \u03c3\u2217j = 1, hence\nM\u2217 = V \u2217\u03a3\u2217V \u2217\u22a4 such that V \u2217 = {\u03bd\u22171, ..., \u03bd\u2217k} = \u02dcV Q\u2217. Given V \u2217 and \u03a3\u2217 a rounded solution for \u02c6\u03c1\n2 (cid:0)1 + \u00af\u03bd\u22172:n+1\u221an + 1(cid:1).\ncan be recovered simply by computing \u00af\u03bd\u2217 = Pk\nFrom the constraints on C\u2217 it follows that \u22121\u221an+1 \u2264 \u00af\u03bd\u2217j \u2264 1\u221an+1\nhence 0 \u2264 \u02c6\u03c1j \u2264 1 \u2200j (details in\nthe supplement). Finally, instead of relying on (12) to recover the model parameters \u02c6\u03b8 from \u02c6\u03c1, we\nexplicitly minimize the \u02c6\u03c1-relaxed loss (7) given \u02c6\u03c1 to recover \u02c6\u03b8 via \u02c6\u03b8 = arg min\u03b8 R(\u02c6\u03c1, \u03b8).\nAlthough the rounding step has introduced an approximation, we establish that bounded outlier sen-\nsitivity can still be retained, even after the above relaxation and rounding processes, and demonstrate\nexperimentally that the gap from optimality is generally not too large.\n\nj=1 \u03c3\u2217j \u03bd\u2217j then setting \u02c6\u03c1 = 1\n\n5 Bounding Outlier Sensitivity\n\nThus far we have proposed a robust training objective, provided an ef\ufb01cient convex relaxation that\nestablishes a lower bound, and proposed a simple rounding method for recovering an approximate\nsolution. The question remains as to whether the approximate solution retains bounded sensitivity\nto outliers (or to leverage points [23, \u00a71.1]). Let (\u03c1\u2217, \u03b8\u2217) denote the joint minimizer of (6) and let\n(\u02c6\u03c1, \u02c6\u03b8) denote the approximate solution obtained from the procedure above.\nFirst, observe that an upper bound on the approximation error can be easily computed by subtract-\ning the lower bound value obtained in (14)\u2013(16) from R(\u02c6\u03c1, \u02c6\u03b8). Our experiments below show that\nreasonable gaps are obtained in this way. Nevertheless one would still like to guarantee that the gap\nstays bounded in the presence of arbitrary outliers and leverage points.\n\nTheorem 2 \u02c6R(\u02c6\u03c1, \u03b1\u2217) \u2264 2R(\u03c1\u2217, \u03b8\u2217) \u2264 2n, where \u02c6R(\u02c6\u03c1, \u03b1\u2217) is the value of (10) at the rounded\nsolution \u02c6\u03c1. Furthermore, if the unclipped loss \u2113(y, \u02c6y) is b-Lipschitz in \u02c6y for b < \u221e and either y or\nK remains bounded, then there exists a c < \u221e such that R(\u02c6\u03c1, \u02c6\u03b8) \u2264 c.\nThat is, the \u03c1-relaxed loss obtained by the rounded solution stays bounded in this case, even when\naccounting for the proposed relaxation and rounding procedure and data perturbation. The complete\nproof takes some space, however the key steps are to show that \u2212(n+1)tr(M\u2217T (\u03b1\u2217)) \u2264 R(\u03c1\u2217, \u03b8\u2217),\nand then use this to establish that \u02c6R(\u02c6\u03c1, \u03b1\u2217) \u2264 2R(\u03c1\u2217, \u03b8\u2217) and R(\u02c6\u03c1, \u02c6\u03b8) \u2264 c, respectively (full details\nin the supplement). Thus, (\u02c6\u03c1, \u02c6\u03b8) will not chase outliers or leverage points arbitrarily in this situation.\nNote that the proposed method cannot be characterized by minimizing a \ufb01xed convex loss. That is,\nthe tightest convex upper bound for any convex loss function is simply given by the function itself,\nwhich corresponds to setting \u03c1i = 1 for every training example. By contrast, our approximation\nmethod does not choose a constant \u03c1i = 1 for every training example, but instead adaptively chooses\n\u03c1i values, closer to 1 for inliers and closer to 0 for outliers. The resulting upper bound on the clipped\nloss (hence on the misclassi\ufb01cation error in the margin loss case) is much tighter than that achieved\nby simply minimizing a convex loss. This outcome is demonstrated clearly in our experiments.\n\n6\n\n\f60\n\n40\n\n20\n\ny\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n\u22126\n \n\n \n\nData\nClipAlt (local)\nL1\nL2\nClipRelax\n\n\u22124\n\n\u22122\n\n0\nx\n(a)\n\n2\n\n4\n\n6\n\n60\n\n40\n\n20\n\ny\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n \n\u22126\n\n \n\nData\nClipAlt (local)\nL1\nL2\nClipRelax\n\n\u22124\n\n\u22122\n\n0\nx\n(b)\n\n2\n\n4\n\n6\n\n60\n\n40\n\n20\n\ny\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n \n\u22126\n\n \n\nData\nClipAlt (local)\nL1\nL2\nClipRelax\n\n\u22124\n\n\u22122\n\n0\nx\n(c)\n\n2\n\n4\n\n6\n\nFigure 2: Comparison on three demonstration data sets.\n\nLoss\nL2\nL1\nHuberM\nGM (local)\nClipAlt (local)\nClipRelax\nOptimGap\n\noutlier probability\n\np = 0.0\n2.53 \u00b1 0.0015\n2.53 \u00b1 0.0015\n2.52 \u00b1 0.0015\n2.53 \u00b1 0.0015\n2.53 \u00b1 0.0019\n2.53 \u00b1 0.0016\n1.65% \u00b1 0.31%\n\np = 0.2\n25.11 \u00b1 13.78\n26.52 \u00b1 16.09\n12.02 \u00b1 5.33\n2.60 \u00b1 0.10\n2.75 \u00b1 0.27\n2.68 \u00b1 0.12\n0.10% \u00b1 0.22%\n\np = 0.4\n19.04 \u00b1 15.62\n27.14 \u00b1 22.40\n12.30 \u00b1 5.87\n2.62 \u00b1 0.09\n2.81 \u00b1 0.27\n2.53 \u00b1 0.87\n0.70% \u00b1 1.31%\n\nTable 1: Synthetic experiment with n = 200, m = 5, and t = 500. Test error rates (RMSE) on clean\ndata (average \u00b1 standard deviations) at different outlier probabilities p, 20 repeats. The bottom row shows the\nrelative gap obtained between the \u03c1-relaxed loss of the rounded solution and the computed lower bound (16).\n\n6 Experimental Results\n\nIn this section, we experimentally evaluate the preceding technical developments on synthetic and\nreal data for both regression and classi\ufb01cation.\nRegression: We \ufb01rst illustrate the behavior of the various regression techniques by a simple demon-\nstration. In Figure 2 (a) and (b), we generate a cluster of linearly related data y = x in a small interval\nabout the origin, then add outliers. In Figure 2 (c) the target linear model is mixed with another more\ndispersed model. We compare the behaviours of standard regression losses: least-squares (L2), L1\n(L1), the Huber minimax loss (HuberM) [13, 17], and the robust Geman and McClure loss (GM) [2].\nTo these we compare the proposed relaxed method (ClipRelax), along with an alternating minimizer\nof the clipped loss (ClipAlt). (In this problem the value of \u03b3 has little effect, and is simply set to 0.1.)\nFigure 2 demonstrates that the three convex losses, L2, L1 and HuberM, are dominated by outliers.\nBy contrast, ClipRelax successfully found the correct linear model in each case. Note that the robust\nGM loss \ufb01nds two different minima, corresponding to that of L2 and ClipRelax respectively, hence\nit was not depicted in the plot. ClipAlt also gets trapped in local minima as expected: it \ufb01nds the\ncorrect model in Figure 2 (a) but incorrect models in Figure 2 (b) and (c).\n\nIn our second synthetic regression experiment we consider larger problems. Here a target weight\nvector \u03b8 is drawn from N (0, I), with inputs Xi: sampled uniformly from [0, 1]m, m = 5. The\noutputs yi are computed as yi = Xi:\u03b8 + \u01ebi, \u01ebi \u223c N (0, 1\n4 ). We then seed the data set with outliers\nby randomly re-sampling each yi (and Xi:) from N (0, 105) and N (0, 102) respectively, governed\nby an outlier probability p. Here 200 of the 700 examples are randomly chosen as the training set\nand the rest used for testing. We compare the same six methods: L2, L1, HuberM, GM, ClipAlt and\nClipRelax. The regularization parameter \u03b3 was set on a separate validation set. These experiments\nare repeated 20 times and average (Huber loss) test errors on clean data are reported (with standard\ndeviations) in Table 1. Clearly, the outliers signi\ufb01cantly affect the performance of least squares. In\nthis case the proposed relaxation performs comparably to the the non-convex GM loss. Interestingly,\nthis experiment shows that the relative gap between the \u03c1-robust loss obtained by the proposed\nmethod and the lower bound on the optimal \u03c1-robust loss (16) remains remarkably small, indicating\nour robust relaxation (almost) optimally minimizes the original non-convex clipped loss.\n\n7\n\n\fLoss\nL2\nL1\nHuberM\nGM (local)\nClipAlt (local)\nClipRelax\n\nAstronomy\n(1, 46, 46)\n2.484\n0.170\n0.149\n0.166\n0.176\n0.131\n\nCal-housing\n(8, 100, 1000)\n804.5 \u00b1 892.5\n0.325 \u00b1 0.046\n0.306 \u00b1 0.050\n0.329 \u00b1 0.048\n0.329 \u00b1 0.048\n0.136 \u00b1 0.155\n\nPumadyn\n(32, 500, 1000)\n1.300 \u00d7 105 \u00b1 68.29\n5.133 \u00b1 0.056\n5.377 \u00b1 0.007\n4.399 \u00b1 0.003\n4.075 \u00b1 1 \u00d7 10\u22126\n4.075 \u00b1 1 \u00d7 10\u22126\n\nTable 2: Error rates (average root mean squared error, \u00b1 standard deviations) for the different regression\nestimators on various data sets. The values of (m, n, tt) are indicated for each data set, where m is the number\nof features, n is the number of training examples, and tt the number of testing samples.\n\nLoss\nLogit\n1-tanh (local)\nClipAlt (local)\nClipRelax\n\np = 0.02\n4.88 \u00b1 6.17\n0.91 \u00b1 1.93\n0.46 \u00b1 0.64\n0.26 \u00b1 0.34\n\np = 0.05\n1.61 \u00b1 7.23\n2.30 \u00b1 2.85\n1.51 \u00b1 1.45\n0.78 \u00b1 0.78\n\np = 0.1\n17.67 \u00b1 4.00\n6.49 \u00b1 4.32\n4.27 \u00b1 2.57\n2.49 \u00b1 3.38\n\np = 0.2\n19.53 \u00b1 2.91\n13.96 \u00b1 3.38\n11.32 \u00b1 3.48\n10.10 \u00b1 8.21\n\nTable 3: Misclassi\ufb01cation error rates on clean data (average error, \u00b1 standard deviations) on the Long-Servedio\nproblem [16] with increasing noise levels p.\n\nFinally, we investigated the behavior of the regression methods on a few real data sets. We chose\nthree data sets: astronomy data containing outliers from [23], and two UCI data sets, seeding the\nthe UCI data sets with outliers. Test results are reported on clean data to avoid skewing the reported\nresults. For UCI data, outliers were added by resampling Xi: and yi from N (0, 1000), with 5%\noutliers. The regularization parameter \u03b3 was chosen through 10-fold cross validation on the training\nset. Note that in real regression problems one needs to obtain an estimate for the the scale, given\nby the true standard deviation of the noise in the data. Here we estimated the scale using the mean\nabsolute deviation, a robust approach commonly used in the robust statistics literature [17].\nIn\nTable 2, one can see that on both data sets, ClipRelax clearly outperformed the other methods. L2 is\nclearly skewed by the outliers. Unsurprisingly, the classical robust loss functions, L1 and HuberM,\nperform better than L2 in the presence of outliers, but not as well as ClipRelax.\nClassi\ufb01cation: We investigated the well known case study from [16] and compared the proposed\nmethod to logistic regression (i.e. the logit, or binomial deviance loss [12]) and the robust 1 \u2212 tanh\nloss [19] in a classi\ufb01cation context. Here 200 examples were drawn from the target distribution with\nlabel noise applied at various levels. The experiment was repeated 50 times to obtain average results\nand standard deviations. Table 3 shows the test error performance in clean data of the different\nmethods. From these results one can conclude that ClipRelax is more robust than standard logit\ntraining. Training with logit loss is slightly better than the tanh loss algorithm in terms of training\nloss, but not very signi\ufb01cantly. It is interesting to see that when the prediction error is measured on\nclean labels ClipRelax generalizes signi\ufb01cantly better than the robust 1\u2212tanh loss. This implies that\nthe classi\ufb01cation model produced by ClipRelax is closer to the true model despite of the presence of\noutliers, demonstrating that the proposed method can be robust in a simple classi\ufb01cation context.\n\n7 Conclusion\n\nWe have proposed a robust estimation method for regression and classi\ufb01cation based on a notion\nof \u201closs-clipping\u201d. Although the method is not as fast as standard convex training, it is scalable to\nproblems of moderate size. The key bene\ufb01t is competitive (or better) estimation quality than the\nstate-of-the-art in robust estimation, while ensuring provable robustness to outliers and computable\nbounds on the optimality gap. To the best of our knowledge these two properties have never been\npreviously achieved simultaneously. It would be interesting to investigate whether the techniques\ndeveloped can also be applied to other forms of robust estimators from the classical literature, includ-\ning GM, MM, L, R and S estimators [11, 13, 17, 23]. Connections with algorithmic stability [3] and\nin\ufb02uence function based analysis [5, 6, 11] merit further investigation. Obtaining tighter bounds on\napproximation quality that would enable a proof of consistency also remains an important challenge.\n\n8\n\n\fReferences\n\n[1] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences. Journal of\n\nMachine Learning Research, 6:1705\u20131749, 2005.\n\n[2] M. Black and A. Rangarajan. On the uni\ufb01cation of line processes, outlier rejection, and robust statistics\n\nwith applications in early vision. International Journal of Computer Vision, 19(1):57\u201391, 1996.\n\n[3] O. Bousquet and A. Elisseeff. Stability and generalization. J. of Machine Learning Research, 2, 2002.\n[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge U. Press, 2004.\n[5] A. Christmann and I. Steinwart. On robustness properties of convex risk minimization methods for pattern\n\nrecognition. Journal of Machine Learning Research, 5:1007\u20131034, 2004.\n\n[6] A. Christmann and I. Steinwart. Consistency and robustness of kernel-based regression in convex risk\n\nminimization. Bernoulli, 13(3):799\u2013819, 2007.\n\n[7] M. Collins, R. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances. Machine\n\nLearning, 48, 2002.\n\n[8] Y. Freund. A more robust boosting algorithm, 2009. arXiv.org:0905.2138.\n[9] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to\n\nboosting. Journal of Computer and Systems Sciences, 55(1):119\u2013139, 1997.\n\n[10] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins U. Press, 1996.\n[11] F. Hampel, E. Ronchetti, P. Rousseeuw, and W. Stahel. Robust Statistics: The Approach Based on In\ufb02u-\n\nence Functions. Wiley, 1986.\n\n[12] T. Hastie, R. Tibshirani, and J. Friedman. Elements of Statistical Learning. Springer, 2nd edition, 2009.\n[13] P. Huber and E. Ronchetti. Robust Statistics. Wiley, 2nd edition, 2009.\n[14] J. Kivinen and M. Warmuth. Relative loss bounds for multidimensional regression problems. Machine\n\nLearning, 45:301\u2013329, 2001.\n\n[15] N. Krause and Y. Singer. Leveraging the margin more carefully.\n\nConference on Machine Learning (ICML), 2004.\n\nIn Proceedings of the International\n\n[16] P. Long and R. Servedio. Random classi\ufb01cation noise defeats all convex potential boosters. Machine\n\nLearning, 78:287\u2013304, 2010.\n\n[17] R. Maronna, R.D. Martin, and V. Yohai. Robust Statistics: Theory and Methods. Wiley, 2006.\n[18] H. Masnadi-Shirazi and N. Vasconcelos. On the design of loss functions for classi\ufb01cation: theory, ro-\nbustness to outliers, and SavageBoost. In Advances in Neural Information Processing Systems (NIPS),\nvolume 21, pages 1049\u20131056, 2008.\n\n[19] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Functional gradient techniques for combining hypotheses.\n\nIn Advances in Large Margin Classi\ufb01ers. MIT Press, 2000.\n\n[20] Y. Nesterov. Introductory Lectures on Convex Optimization. Kluwer, 1994.\n[21] M. Overton and R. Womersley. Optimality conditions and duality theory for minimizing sums of the\n\nlargest eigenvalues of symmetric matrices. Mathematical Programming, 62(2):321\u2013357, 1993.\n\n[22] R. Rockafellar. Convex Analysis. Princeton U. Press, 1970.\n[23] P. Rousseeuw and A. Leroy. Robust Regression and Outlier Detection. Wiley, 1987.\n[24] B. Schoelkopf and A. Smola. Learning with Kernels. MIT Press, 2002.\n[25] X. Shen, G. Tseng, X. Zhang, and W.-H. Wong. On \u03c8-learning. Journal of the American Statistical\n\nAssociation, 98(463):724\u2013734, 2003.\n\n[26] C. Stewart. Robust parameter estimation in computer vision. SIAM Review, 41(3), 1999.\n[27] Y. Wu and Y. Liu. Robust truncated hinge loss support vector machines. Journal of the American Statis-\n\ntical Association, 102(479):974\u2013983, 2007.\n\n[28] H. Xu, C. Caramanis, and S. Mannor. Robust regression and Lasso. In Advances in Neural Information\n\nProcessing Systems (NIPS), volume 21, pages 1801\u20131808, 2008.\n\n[29] H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector machines. Journal\n\nof Machine Learning Research, 10:1485\u20131510, 2009.\n\n[30] L. Xu, K. Crammer, and D. Schuurmans. Robust support vector machine training via convex outlier\n\nablation. In Proceedings of the National Conference on Arti\ufb01cial Intelligence (AAAI), 2006.\n\n[31] J. Yu, S. Vishwanathan, S. G\u00a8unter, and N. Schraudolph. A quasi-Newton approach to nonsmooth convex\n\noptimization problems in machine learning. J. of Machine Learning Research, 11:1145\u20131200, 2010.\n\n9\n\n\f", "award": [], "sourceid": 899, "authors": [{"given_name": "Min", "family_name": "Yang", "institution": null}, {"given_name": "Linli", "family_name": "Xu", "institution": null}, {"given_name": "Martha", "family_name": "White", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": null}, {"given_name": "Yao-liang", "family_name": "Yu", "institution": null}]}