{"title": "A Polynomial-time Form of Robust Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 2483, "page_last": 2491, "abstract": "Despite the variety of robust regression methods that have been developed, current regression formulations are either NP-hard, or allow unbounded response to even a single leverage point. We present a general formulation for robust regression --Variational M-estimation--that unifies a number of robust regression methods while allowing a tractable approximation strategy. We develop an estimator that requires only polynomial-time, while achieving certain robustness and consistency guarantees. An experimental evaluation demonstrates the effectiveness of the new estimation approach compared to standard methods.", "full_text": "A Polynomial-time Form of Robust Regression\n\nDepartment of Computing Science, University of Alberta, Edmonton AB T6G 2E8, Canada\n\nYaoliang Yu, \u00a8Ozlem Aslan and Dale Schuurmans\n\n{yaoliang,ozlem,dale}@cs.ualberta.ca\n\nAbstract\n\nDespite the variety of robust regression methods that have been developed, cur-\nrent regression formulations are either NP-hard, or allow unbounded response\nto even a single leverage point. We present a general formulation for robust\nregression\u2014Variational M-estimation\u2014that uni\ufb01es a number of robust regression\nmethods while allowing a tractable approximation strategy. We develop an esti-\nmator that requires only polynomial-time, while achieving certain robustness and\nconsistency guarantees. An experimental evaluation demonstrates the effective-\nness of the new estimation approach compared to standard methods.\n\nIntroduction\n\n1\nIt is well known that outliers have a detrimental effect on standard regression estimators. Even a\nsingle erroneous observation can arbitrarily affect the estimates produced by methods such as least\nsquares. Unfortunately, outliers are prevalent in modern data analysis, as large data sets are auto-\nmatically gathered without the bene\ufb01t of manual oversight. Thus the need for regression estimators\nthat are both scalable and robust is increasing.\nAlthough the \ufb01eld of robust regression is well established, it has not considered computational com-\nplexity analysis to be one of its central concerns. Consequently, none of the standard regression\nestimators in the literature are both robust and tractable, even in a weak sense: it has been shown\nthat standard robust regression formulations with non-zero breakdown are NP-hard [1, 2], while\nany estimator based on minimizing a convex loss cannot guarantee bounded response to even a sin-\ngle leverage point [3] (de\ufb01nitions given below). Surprisingly, there remain no standard regression\nformulations that guarantee both polynomial run-time with bounded response to even single outliers.\nIt is important to note that robustness and tractability can be achieved under restricted conditions. For\nexample, if the domain is bounded, then any estimator based on minimizing a convex and Lipschitz-\ncontinuous loss achieves high breakdown [4]. Such results have been extended to kernel-based\nregression under the analogous assumption of a bounded kernel [5, 6]. Unfortunately, these results\ncan no longer hold when the domain or kernel is unbounded: in such a case arbitrary leverage can\noccur [4, 7] and no (non-constant) convex loss, even Lipschitz-continuous, can ensure robustness\nagainst even a single outlier [3]. Our main motivation therefore is to extend these existing results\nto the case of an unbounded domain. Unfortunately, the inapplicability of convex losses in this\nsituation means that computational tractability becomes a major challenge, and new computational\nstrategies are required to achieve tractable robust estimators.\nThe main contribution of this paper is to develop a new robust regression strategy that can guarantee\nboth polynomial run-time and bounded response to individual outliers, including leverage points.\nAlthough such an achievement is modest, it is based on two developments of interest. The \ufb01rst\nis a general formulation of adaptive M-estimation, Variational M-estimation, that uni\ufb01es a number\nof robust regression formulations, including convex and bounded M-estimators with certain subset-\nselection estimators such as Least Trimmed Loss [7]. By incorporating Tikhonov regularization,\nthese estimators can be extended to reproducing kernel Hilbert spaces (RKHSs). The second devel-\nopment is a convex relaxation scheme that ensures bounded outlier in\ufb02uence on the \ufb01nal estimator.\n\n1\n\n\fThe overall estimation procedure is guaranteed to be tractable, robust to single outliers with un-\nbounded leverage, and consistent under non-trivial conditions. An experimental evaluation of the\nproposed estimator demonstrates effective performance compared to standard robust estimators.\nThe closest previous works are [8], which formulated variational representations of certain robust\nlosses, and [9], which formulated a convex relaxation of bounded loss minimization. Unfortunately,\n[8] did not offer a general characterization, while [9] did not prove their \ufb01nal estimator was robust,\nnor was any form of consistency established. The formulation we present in this paper generalizes\n[8] while the convex relaxation scheme we propose is simpler and tighter than [9]; we are thus able\nto establish non-trivial forms of both robustness and consistency while maintaining tractability.\nThere are many other notions of \u201crobust\u201d estimation in the machine learning literature that do not\ncorrespond to the speci\ufb01c notion being addressed in this paper. Work on \u201crobust optimization\u201d [10\u2013\n12], for example, considers minimizing the worst case loss achieved given bounds on the maximum\ndata deviation that will be considered. Such results are not relevant to the present investigation be-\ncause we explicitly do not bound the magnitude of the outliers. Another notion of robustness is\nalgorithmic stability under leave-one-out perturbation [13], which analyzes speci\ufb01c learning proce-\ndures rather than describing how a stable algorithm might be generally achieved.\n\n2 Preliminaries\n\nWe start by considering the standard linear regression model\n\ny = xT \u03b8\u2217\n\n+ u\n\n(1)\nwhere x is an Rp-valued random variable, u is a real-valued random noise term, and \u03b8\u2217\n\u2208 \u0398 \u2286 Rp\nis an unknown deterministic parameter vector. Assume we are given a sample of n independent\nidentically distributed (i.i.d.) observations represented by an n \u00d7 p matrix X and an n \u00d7 1 vector\ny, where each row Xi: is drawn from some unknown marginal probability measure Px, and yi are\ngenerated according to (1). Our task is to estimate the unknown deterministic parameter \u03b8\u2217\n\u2208 \u0398.\nClearly, this is a well-studied problem in statistics and machine learning. If the noise distribution\nhas a known density p(\u00b7), then a standard estimator is given by maximum likelihood\n\nn(cid:80)n\nwhere ri = yi \u2212 Xi:\u03b8 is the ith residual. When the noise distribution is unknown, one can replace\nthe negative log-likelihood with a loss function \u03c1(\u00b7) and use the estimator\n\ni=1 \u2212 log p(yi \u2212 Xi:\u03b8) = arg min\n\u03b8\u2208\u0398\n\n\u02c6\u03b8ML \u2208 arg min\n\u03b8\u2208\u0398\n\ni=1 \u2212 log p(ri),\n\nn(cid:80)n\n\n(2)\n\n1\n\n1\n\n\u02c6\u03b8M \u2208 arg min\n\u03b8\u2208\u0398\n\n1\n\nn 1T \u03c1(y \u2212 X\u03b8),\n\n(3)\n\nual, hence 1T \u03c1(r) =(cid:80)n\n\nwhere \u03c1(r) denotes the vector of losses obtained by applying the loss componentwise to each resid-\ni=1 \u03c1(ri). Such a procedure is known as M-estimation in the robust statis-\n\ntics literature, and empirical risk minimization in the machine learning literature.1\nAlthough uncommon in robust regression, it is conventional in machine learning to include a regu-\nlarizer. In particular we will use Tikhonov (\u201cridge\u201d) regularization by adding a squared penalty\n\n2\n\n1\n\n2(cid:107)\u03b8(cid:107)2\n\nfor \u03bb \u2265 0,\n\n\u02c6\u03b8MR \u2208 arg min\n\u03b8\u2208\u0398\n\nn 1T \u03c1(y \u2212 X\u03b8) + \u03bb\n\n(4)\nThe signi\ufb01cance of Tikhonov regularization is that it ensures \u02c6\u03b8MR = X T \u03b1 for some \u03b1 \u2208 Rn\n[14]. More generally, under Tikhonov regularization, the regression problem can be conveniently\nexpressed in a reproducing kernel Hilbert space (RKHS). If we let H denote the RKHS correspond-\ning to positive semide\ufb01nite kernel \u03ba : X \u00d7 X \u2192 R, then f (x) = (cid:104)\u03ba(x,\u00b7), f(cid:105)H for any f \u2208 H by\nthe reproducing property [14, 15]. We consider the generalized regression model\n(5)\nwhere x is an X -valued random variable, u is a real-valued random noise term as above, and f\u2217\n\u2208 H\nis an unknown deterministic function. Given a sample of n i.i.d. observations (x1, y1), ..., (xn, yn),\n1 Generally one has to introduce an additional scale parameter \u03c3 and allow rescaling of the residuals via\n\ny = f\n\n(x) + u\n\n\u2217\n\nri/\u03c3, to preserve parameter equivariance [3, 4]. However, we will initially assume a known scale.\n\n2\n\n\f\u02c6\u03b1MR \u2208 arg min\n\n\u03b1\n\ni=1 \u03b1i\u03ba(xi, x)\n\nwhere each xi is drawn from some unknown marginal probability measure Px, and yi are generated\naccording to (5),2 the task is then to estimate the unknown deterministic function f\u2217\n\u2208 H. To do so\nwe can express the estimator (4) more generally as\n(6)\n\n1\n\n\u02c6fMR \u2208 arg min\nf\u2208H\n\n2(cid:107)f(cid:107)2H.\n\n2 \u03b1T K\u03b1 such that Kij = \u03ba(xi, xj).\n\nn(cid:80)n\ni=1 \u03c1(yi \u2212 f (xi)) + \u03bb\nn 1T \u03c1(y \u2212 K\u03b1) + \u03bb\n\n1\n\nBy the representer theorem [14], the solution to (6) can be expressed by \u02c6fMR(x) =(cid:80)n\nfor some \u03b1 \u2208 Rn, and therefore (6) can be recovered by solving the \ufb01nite dimensional problem\n(7)\nOur interest is understanding the tractability, robustness and consistency aspects of such estimators.\nConsistency: Much is known about the consistency properties of estimators expressed as regular-\nized empirical risk minimizers. For example, the ML-estimator (2) and the M-estimator (3) are both\nknown to be parameter consistent under general conditions [16].3 The regularized M-estimator in\nRKHSs (6), is loss consistent under some general assumptions on the kernel, loss and training dis-\ntribution.4 Furthermore, a weak form of f-consistency has also established in [6]. For bounded\nkernel and bounded Lipschitz losses, one can similarly prove the loss consistency of the regularized\nM-estimator (6) (in RKHS). See Appendix C.1 of the supplement for more discussion.\nGenerally speaking, any estimator that can be expressed as a regularized empirical loss minimization\nis consistent under \u201creasonable\u201d conditions. That is, one can consider regularized loss minimization\nto be a (generally) sound principle for formulating regression estimators, at least from the perspective\nof consistency. However, this is no longer the case when we consider robustness and tractability;\nhere sharp distinctions begin to arise within this class of estimators.\nRobustness: Although robustness is an intuitive notion, it has not been given a unique technical\nde\ufb01nition in the literature. Several de\ufb01nitions have been proposed, with distinct advantages and\ndisadvantages [4]. Some standard de\ufb01nitions consider the asymptotic invariance of estimators to\nan in\ufb01nitesimal but arbitrary perturbation of the underlying distribution, e.g. the in\ufb02uence function\n[4, 17]. Although these analyses can be useful, we will focus on \ufb01nite sample notions of robustness\nsince these are most related to concerns of computational tractability. In particular, we focus on the\nfollowing de\ufb01nition related to the \ufb01nite sample breakdown point [18, 19].\nDe\ufb01nition 1 (Bounded Response). Assuming the parameter set \u0398 is metrizable, an estimator has\nbounded response if for any \ufb01nite data sample its output remains in a bounded interior subset of the\nclosed parameter set \u0398 (or respectively H), no matter how a single observation pair is perturbed.\nThis is a much weaker de\ufb01nition than having a non-zero breakdown point: a breakdown of \u0001 re-\nquires that bounded response be guaranteed when any \u0001 fraction of the data is perturbed arbitrarily.\nBounded response is obviously a far more modest requirement. However, importantly, the de\ufb01nition\nof bounded response allows the possibility of arbitrary leverage; that is, no bound is imposed on the\nmagnitude of a perturbed input (i.e. (cid:107)x1(cid:107) \u2192 \u221e or \u03ba(x1, x1) \u2192 \u221e). Surprisingly, we \ufb01nd that even\nsuch a weak robustness property is dif\ufb01cult to achieve while retaining computational tractability.\nComputational Dilemma: The goals of robustness and computational tractability raise a dilemma:\nit is easy to achieve robustness (i.e. bounded response) or tractability (i.e. polynomial run-time) in a\nconsistent estimator, but apparently not both.\nConsider, for example, using a convex loss function. These are the best known class of functions\nthat admit computationally ef\ufb01cient polynomial-time minimization [20] (see also [21] ). It is suf-\n\ufb01cient that the objective be polynomial-time evaluable, along with its \ufb01rst and second derivatives,\n2 We are obviously assuming X is equipped with an appropriate \u03c3-algebra, and R with the standard Borel\n\n(cid:80)n\ni \u03b8), let M (\u03b8) = E(\u03c1(y1 \u2212 xT\ni=1 \u03c1(yi \u2212 xT\n1 \u03b8)), and equip the parameter\nspace \u0398 with the uniform metric (cid:107) \u00b7 (cid:107)\u0398. Then \u02c6\u03b8(n)\nM \u2192 \u03b8\u2217, provided (cid:107)Mn \u2212 M(cid:107)\u0398 \u2192 0 in outer probability\n(adopted to avoid measurability issues) and M (\u03b8\u2217) > sup\u03b8\u2208G M (\u03b8) for every open set G that contains \u03b8\u2217.\nThe latter assumption is satis\ufb01ed in particular when M : \u0398 (cid:55)\u2192 R is upper semicontinuous with a unique\n(cid:80)n\nmaximum at \u03b8\u2217. It is also possible to derive asymptotic convergence rates for general M-estimators [16].\ni=1 \u03c1(yi \u2212 \u02c6fMR(xi)) \u2192 \u03c1\u2217\nprovided the regularization constant \u03bbn \u2192 0 and \u03bb2\nnn \u2192 \u221e, the loss \u03c1 is convex and Lipschitz-continuous,\nand the RKHS H (induced by some bounded measurable kernel \u03ba) is separable and dense in L1(P) (the space\nof P-integrable functions) for all distributions P on X . Also, Y \u2282 R is required to be closed where y \u2208 Y.\n\n\u03c3-algebra, such that the joint distribution P over X \u00d7 R is well de\ufb01ned and \u03ba(\u00b7,\u00b7) is measurable.\n\n4 Speci\ufb01cally, let \u03c1\u2217 = inf f\u2208H E[\u03c1(y1 \u2212 f (x1))]. Then [6] showed that 1\n\n3 In particular, let Mn(\u03b8) = 1\nn\n\nn\n\n3\n\n\fand that the objective be self-concordant [20].5 Since a Tikhonov regularizer is automatically self-\nconcordant, the minimization problems outlined above can all be solved in polynomial time with\nNewton-type algorithms, provided \u03c1(r), \u03c1(cid:48)(r), and \u03c1(cid:48)(cid:48)(r) can all be evaluated in polynomial time\nfor a self-concordant \u03c1 [22, Ch.9]. Standard loss functions, such as squared error or Huber\u2019s loss\nsatisfy these conditions, hence the corresponding estimators are polynomial-time.\nUnfortunately, loss minimization with a (non-constant) convex loss yields unbounded response to\neven a single outlier [3, Ch.5]. We extend this result to also account for regularization and RKHSs.\nTheorem 1. Empirical risk minimization based on a (non-constant) convex loss cannot have\nbounded response if the domain (or kernel) is unbounded, even under Tikhonov regularization.\n(Proof given in Appendix B of the supplement.)\n\nBy contrast, consider the case of a (non-constant) bounded loss function.6 Bounded loss functions\nare a common choice in robust regression because they not only ensure bounded response, trivially,\nthey can also ensure a high breakdown point of (n \u2212 p)/(2n) [3, Ch.5]. Unfortunately, estimators\nbased on bounded losses are inherently intractable.\nTheorem 2. Bounded (non-constant) loss minimization is NP-hard. (Proof given in Appendix E.)\n\nThese dif\ufb01culties with empirical risk minimization have led the \ufb01eld of robust statistics to develop\na variety of alternative estimators [4, Ch.7]. For example, [7] recommends subset-selection based\nregression estimators, such as Least Trimmed Loss:\n\n(8)\nHere r[i] denotes sorted residuals r[1] \u2264 \u00b7\u00b7\u00b7 \u2264 r[n] and n(cid:48) < n is the number of terms to consider.\nTraditionally \u03c1(r) = r2 is used. These estimators are known to have high breakdown [7],7 and\nobviously demonstrate bounded response to single outliers. Unfortunately, (8) is NP-hard [1].\n\n\u02c6\u03b8LTL \u2208 arg min\u03b8\u2208\u0398(cid:80)n\n\n(cid:48)\ni=1 \u03c1(r[i]).\n\n3 Variational M-estimation\n\nTo address the dilemma, we \ufb01rst adopt a general form of adaptive M-estimator that allows \ufb02exibility\nwhile allowing a general approximation strategy. The key construction is a variational representation\nof M-estimation that can express a number of standard robust (and non-robust) methods in a common\nframework. In particular, consider the following adaptive form of loss function\n\n\u03c1(r) = min\n0\u2264\u03b7\u22641\n\n\u03b7(cid:96)(r) + \u03c8(\u03b7).\n\n(9)\n\nwhere r is a residual value, (cid:96) is a closed convex base loss, \u03b7 is an adaptive weight on the base loss,\nand \u03c8 is a convex auxiliary function. The weight can choose to ignore the base loss if (cid:96)(r) is large,\nbut this is balanced against a prior penalty \u03c8(\u03b7). Different choices of base loss and auxiliary function\nwill yield different results, and one can represent a wide variety of loss functions \u03c1 in this way [8].\nFor example, any convex loss \u03c1 can be trivially represented in the form (9) by setting (cid:96) = \u03c1, and\n\u03c8(\u03b7) = \u03b4{1}(\u03b7).8 Bounded loss functions can also be represented in this way, for example\n\u03c8(\u03b7) = (\u221a\u03b7 \u2212 1)2\n\u03c8(\u03b7) = (\u221a\u03b7 \u2212 1)2\n\n\u03c1(r) = r2\n1+r2\n|r|\n1+|r|\n\n(Geman-McClure) [8]\n\n(Geman-Reynolds) [8]\n(LeClerc) [8]\n(Clipped-loss) [9]\n\n\u03c1(r) =\n\u03c1(r) = 1 \u2212 exp(\u2212(cid:96)(r))\n\u03c1(r) = max(1, (cid:96)(r))\n\n(cid:96)(r) = r2\n(cid:96)(r) = |r|\n(cid:96)(\u00b7) convex \u03c8(\u03b7) = \u03b7 log \u03b7 \u2212 \u03b7 + 1\n(cid:96)(\u00b7) convex \u03c8(\u03b7) = 1 \u2212 \u03b7.\n\n(10)\n\n(11)\n(12)\n(13)\n\nAppendix D in the supplement demonstrates how one can represent general functions \u03c1 in the form\n(9), not just speci\ufb01c examples, signi\ufb01cantly extending [8] with a general characterization.\n\n5 A function \u03c1 is self-concordant if |\u03c1(cid:48)(cid:48)(cid:48)(r)| \u2264 2\u03c1(cid:48)(cid:48)(r)3/2; see e.g. [22, Ch.9].\n6 A bounded function obviously cannot be convex over an unbounded domain unless it is constant.\n7 When n(cid:48) approaches n/2 the breakdown of (8) approaches 1/2 [7].\n8 We use \u03b4C (\u03b7) to denote the indicator for the point set C; i.e., \u03b4C (\u03b7) = 0 if \u03b7 \u2208 C, otherwise \u03b4C (\u03b7) = \u221e.\n\n4\n\n\fTherefore, all of the previous forms of regularized empirical risk minimization, whether with a\nconvex or bounded loss \u03c1, can be easily expressed using only convex base losses (cid:96) and convex\nauxiliary functions \u03c8, as follows\n\u02c6\u03b8VM \u2208 arg min\nmin\n0\u2264\u03b7\u22641\n\u03b8\u2208\u0398\n\u02c6fVM \u2208 arg min\nf\u2208H min\n\u02c6\u03b1VM \u2208 arg min\nmin\n0\u2264\u03b7\u22641\n\ni=1 {\u03b7i(cid:96)(yi \u2212 f (xi)) + \u03c8(\u03b7i)} + \u03bb\n\n\u03b7T (cid:96)(y \u2212 K\u03b1) + 1T \u03c8(\u03b7) + \u03bb\n\n\u03b7T (cid:96)(y \u2212 X\u03b8) + 1T \u03c8(\u03b7) + \u03bb\n\n0\u2264\u03b7\u22641(cid:80)n\n\n2(cid:107)\u03b7(cid:107)1\u03b1T K\u03b1.\n\n2(cid:107)\u03b7(cid:107)1(cid:107)f(cid:107)2H\n\n2(cid:107)\u03b7(cid:107)1(cid:107)\u03b8(cid:107)2\n\n(14)\n\n(15)\n\n(16)\n\n\u03b1\n\n2\n\nNote that we have added a regularizer (cid:107)\u03b7(cid:107)1/n, which increases robustness by encouraging \u03b7 weights\nto prefer small values (but adaptively increase on indices with small loss). This particular form of\nregularization has two advantages: (i) it is a smooth function of \u03b7 on 0 \u2264 \u03b7 \u2264 1 (since (cid:107)\u03b7(cid:107)1 = 1T \u03b7\nin this case), and (ii) it enables a tight convex approximation strategy, as we will see below.\nNote that other forms of robust regression can be expressed in a similar framework. For example,\ngeneralized M-estimation (GM-estimation) can be formulated simply by forcing each \u03b7i to take on\na speci\ufb01c value determined by (cid:107)xi(cid:107) or ri [7], ignoring the auxilary function \u03c8. Least Trimmed Loss\n(8) can be expressed in the form (9) provided only that we add a shared constraint over \u03b7:\n\nmin\n\n\u02c6\u03b8LT L \u2208 arg min\n\u03b8\u2208\u0398\n\n0\u2264\u03b7\u22641:1T \u03b7=n(cid:48) \u03b7T (cid:96)(r) + \u03c8(\u03b7)\n\n(17)\nwhere \u03c8(\u03b7i) = 1 \u2212 \u03b7i and n(cid:48) < n speci\ufb01es the number of terms to consider in the sum of losses.\nSince \u03b7 \u2208 {0, 1}n at a solution (see e.g. [9]), (17) is equivalent to (8) if \u03c8 is the clipped loss (13).\nThese formulations are all convex in the parameters given the auxiliary weights, and vice versa.\nHowever, they are not jointly convex in the optimization variables (i.e. in \u03b8 and \u03b7, or in \u03b1 and \u03b7).\nTherefore, one is not assured that the problems (14)\u2013(16) have only global minima; in fact local\nminima exist and global minima cannot be easily found (or even veri\ufb01ed).\n\n4 Computationally Ef\ufb01cient Approximation\n\nWe present a general approximation strategy for the variational regression estimators above that can\nguarantee polynomial run-time while ensuring certain robustness and consistency properties. The\napproximation is signi\ufb01cantly tighter than the existing work [9], which allows us to achieve stronger\nguarantees while providing better empirical performance. In developing our estimator we follow\nstandard methodology from combinatorial optimization: given an intractable optimization problem,\n\ufb01rst formulate a (hopefully tight) convex relaxation that provides a lower bound on the objective,\nthen round the relaxed minimizer back to the feasible space, hopefully verifying that the rounded\nsolution preserves desirable properties, and \ufb01nally re-optimize the rounded solution to re\ufb01ne the\nresult; see e.g. [23].\nTo maintain generality, we formulate the approximate estimator in the RKHS setting. Consider (16).\nAlthough the problem is obviously convex in \u03b1 given \u03b7, and vice versa, it is not jointly convex (recall\nthe assumption that (cid:96) and \u03c8 are both convex functions). This suggests that an obvious computational\nstrategy for computing the estimator (16) is to alternate between \u03b1 and \u03b7 optimizations (or use\nheuristic methods [2]), but this cannot guarantee anything other than local solutions (and thus may\nnot even achieve any of the desired theoretical properties associated with the estimator).\nReformulation: We \ufb01rst need to reformulate the problem to allow a tight relaxation. Let \u2206(\u03b7)\ndenote putting a vector \u03b7 on the main diagonal of a square matrix, and let \u25e6 denote componentwise\nmultiplication. Since (cid:96) is closed and convex by assumption, we know that (cid:96)(r) = sup\u03bd \u03bdr\u2212 \u03bd(cid:96)\u2217(\u03bd),\nwhere (cid:96)\u2217 is the Fenchel conjugate of (cid:96) [22]. This allows (16) to be reformulated as follows.\nLemma 1. min\n(18)\n0\u2264\u03b7\u22641\n= min\n0\u2264\u03b7\u22641\n\n\u03b7T (cid:96)(y \u2212 K\u03b1) + 1T \u03c8(\u03b7) + \u03bb\n\u2217\n1T \u03c8(\u03b7) \u2212 \u03b7T ((cid:96)\nsup\n\n(\u03bd) \u2212 \u2206(y)\u03bd) \u2212 1\n\n2(cid:107)\u03b7(cid:107)1\u03b1T K\u03b1\n\n(19)\n\nmin\n\n\u22121\n\n\u03b1\n\n\u03bd\n\n2\u03bb \u03bdT(cid:0)K \u25e6 (\u03b7(cid:107)\u03b7(cid:107)\n\n1 \u03b7T )(cid:1) \u03bd,\n\nwhere the function evaluations are componentwise. (Proof given in Appendix A of the supplement.)\n\nAlthough no relaxation has been introduced, the new form (25) has a more convenient structure.\n\n5\n\n\fmin\nN\u2208N\u03b7\nmin\nM\u2208M\u03b7\n\nsup\n\n\u03bd\n\nsup\n\n\u03bd\n\nN\u03b7 = {N : N (cid:60) 0, N 1 = \u03b7, rank(N ) = 1}\nM\u03b7 = {M : M (cid:60) 0, M 1 = \u03b7, tr(M ) \u2264 1}.\n\nRelaxation: Let N = \u03b7(cid:107)\u03b7(cid:107)\nuseful properties. We can summarize these by formulating a constraint set N \u2208 N\u03b7 given by:\n\n\u22121\n1 \u03b7T and note that, since 0 \u2264 \u03b7 \u2264 1, N must satisfy a number of\n(20)\n(21)\nUnfortunately, the set N\u03b7 is not convex because of the rank constraint. However, relaxing this\nconstraint leads to a set M\u03b7 \u2287 N\u03b7 which preserves much of the key structure, as we verify below.\n(22)\nLemma 2. (25) = min\n0\u2264\u03b7\u22641\n\u2265 min\n0\u2264\u03b7\u22641\n\n\u2217\n(\u03bd) \u2212 \u2206(y)\u03bd) \u2212 1\n1T \u03c8(\u03b7) \u2212 \u03b7T ((cid:96)\n\u2217\n(\u03bd) \u2212 \u2206(y)\u03bd) \u2212 1\n1T \u03c8(\u03b7) \u2212 \u03b7T ((cid:96)\nusing the fact that N\u03b7 \u2286 M\u03b7. (Proof given in Appendix A of the supplement.)\nCrucially, the constraint set {(\u03b7, M ) : 0 \u2264 \u03b7 \u2264 1, M \u2208 M\u03b7} is jointly convex in \u03b7 and M, thus\n(35) is a convex-concave min-max problem. To see why, note that the inner objective function is\njointly convex in \u03b7 and M, and concave in \u03bd. Since a pointwise maximum of convex functions is\nconvex, the problem is convex in (\u03b7, M ) [22, Ch.3]. We conclude that all local minima in (\u03b7, M )\nare global. Therefore, (35) provides the foundation for an ef\ufb01ciently solvable relaxation.\nRounding: Unfortunately the solution to M in (35) does not allow direct recovery of an estimator\n\u03b1 achieving the same objective value in (24), unless M satis\ufb01es rank(M ) = 1. In general we \ufb01rst\nneed to round M to a rank 1 solution. Fortunately, a trivial rounding procedure is available: we\nsimply use \u03b7 (ignoring M) and re-solve for \u03b1 in (24). This is equivalent to replacing M with the\n\u22121\nrank 1 matrix \u02dcN = \u03b7(cid:107)\u03b7(cid:107)\n1 \u03b7T \u2208 N\u03b7, which restores feasibility in the original problem. Of course,\nsuch a rounding step will generally increase the objective value.\nReoptimization: Finally, the rounded solution can be locally improved by alternating between \u03b7\nand \u03b1 updates in (24) (or using any other local optimization method), yielding the \ufb01nal estimate \u02dc\u03b1.\n\n2\u03bb \u03bdT (K \u25e6 N ) \u03bd\n2\u03bb \u03bdT (K \u25e6 M ) \u03bd.\n\n(23)\n\n5 Properties\n\nAlthough a tight a priori bound on the size of the optimality gap is dif\ufb01cult to achieve, a rigorous\nbound on the optimality gap can be recovered post hoc once the re-optimized estimator is computed.\nLet R0 denote the minimum value of (24) (not ef\ufb01ciently computable); let R1 denote the minimum\nvalue of (35) (the relaxed solution); let R2 denote the value of (24) achieved by freezing \u03b7 from\nthe relaxed solution but re-optimizing \u03b1 (the rounded solution); and \ufb01nally let R3 denote the value\nof (24) achieved by re-optimizing \u03b7 and \u03b1 from the rounded solution (the re-optimized solution).\nClearly we have the relationships R1 \u2264 R0 \u2264 R3 \u2264 R2. An upper bound on the relative optimality\ngap of the \ufb01nal solution (R3) can be determined by (R3 \u2212 R0)/R3 \u2264 (R3 \u2212 R1)/R3, since R1 and\nR3 are both known quantities.\nTractability: Under mild assumptions on (cid:96) and \u03c8, computation of the approximate estimator (solv-\ning the relaxed problem, rounding, then re-optimizing) admits a polynomial-time solution; see Ap-\npendix E in the supplement. (Appendix E also provides details for an ef\ufb01cient implementation for\nsolving (35).) Once \u03b7 is recovered from the relaxed solution, the subsequent optimizations of (24)\ncan be solved ef\ufb01ciently under weak assumptions about (cid:96) and \u03c8; namely that they both satisfy the\nself-concordance and polynomial-time computation properties discussed in Section 2.\nRobustness: Despite the approximation, the relaxation remains suf\ufb01ciently tight to preserve some\nof the robustness properties of bounded loss minimization. To establish the robustness (and consis-\ntency) properties, we will need to make use of a speci\ufb01c technical de\ufb01nition of outliers and inliers.\nDe\ufb01nition 2 (Outliers and Inliers). For an L-Lipschitz loss (cid:96), an outlier is a point (xi, yi) that\nsatis\ufb01es (cid:96)(yi) > L2Kii/(2\u03bb) \u2212 \u03c8(cid:48)(0), while an inlier satis\ufb01es (cid:96)(yi) + L2Kii/(2\u03bb) < \u2212\u03c8(cid:48)(1).\nTheorem 3. Assume the loss \u03c1 is bounded and has a variational representation (9) such that (cid:96)\nis Lipschitz-continuous and \u03c8(cid:48) is bounded. Also assume there is at least one (unperturbed) inlier,\nand consider the perturbation of a single data point (y1, x1). Under the following conditions, the\nrounded (re-optimized) estimator maintains bounded response:\n(i) If either y1 remains bounded, or \u03ba(x1, x1) remains bounded.\n(ii) If |y1| \u2192 \u221e, \u03ba(x1, x1) \u2192 \u221e and (cid:96)(y1)/\u03ba(x1, x1) \u2192 \u221e.\n(Proof given in Appendix B of the supplement.)\n\n6\n\n\fMethods\n\nL2\nL1\n\nHuber\nLTS\n\nGemMc\n\n[9]\n\nAltBndL2\nAltBndL1\nCvxBndL2\nCvxBndL1\n\n43.5\n4.89\n4.89\n6.72\n0.53\n0.52\n0.52\n0.73\n0.52\n0.53\n\n(13)\n(2.81)\n(2.81)\n(7.37)\n(0.03)\n(0.01)\n(0.01)\n(0.12)\n(0.01)\n(0.02)\n\n57.6\n3.6\n3.62\n8.65\n0.52\n0.52\n0.52\n0.74\n0.52\n0.55\n\nOutlier Probability\n\np = 0.4\n\np = 0.2\n\np = 0.0\n\n(21.21)\n(2.04)\n(2.02)\n(14.11)\n(0.02)\n(0.01)\n(0.01)\n(0.16)\n(0.01)\n(0.05)\n\n0.52\n0.52\n0.52\n0.52\n0.52\n0.52\n0.52\n0.52\n0.52\n0.52\n\n(0.01)\n(0.01)\n(0.01)\n(0.01)\n(0.01)\n(0.01)\n(0.02)\n(0.01)\n(0.01)\n(0.01)\n\nTable 1: RMSE on clean test data for an arti\ufb01cial data set with 5 features and 100 training points,\nwith outlier probability p, and 10000 test data points. Results are averaged over 10 repetitions.\nStandard deviations are given in parentheses.\n\nNote that the latter condition causes any convex loss (cid:96) to demonstrate unbounded response (see\nproof of Theorem 5 in Appendix B). Therefore, the approximate estimator is strictly more robust (in\nterms of bounded response) than regularized empirical risk minimization with a convex loss (cid:96).\nConsistency: Finally, we can establish consistency of the approximate estimator in a limited albeit\nnon-trivial setting, although we have yet to establish it generally.\nTheorem 4. Assume (cid:96) is Lipschitz-continuous and \u03c8(\u03b7) = 1\u2212 \u03b7. Assume that the data is generated\nfrom a mixture of inliers and outliers, where P (inlier) > P (outlier). Then the estimate \u02c6\u03b8 produced\nby the rounded (re-optimized) method is loss consistent.(Proof given in Appendix C.2.)\n\n6 Experimental Evaluation\nWe conducted a set of experiments to evaluate the effectiveness of the proposed method compared\nto standard methods from the literature. Our experimental evaluation was conducted in two parts:\n\ufb01rst a synthetic experiment where we could control data generation, then an experiment on real data.\nThe \ufb01rst synthetic experiment was conducted as follows. A target weight vector \u03b8 was drawn from\nN (0, I), with Xi: sampled uniformly from [0, 1]m, m = 5, and outputs yi computed as yi =\nXi:\u03b8 + \u0001i, \u0001i \u223c N (0, 1\n2 ). We then seeded the data set with outliers by randomly re-sampling each\nyi and Xi: from N (0, 108) and N (0, 104) respectively, governed by an outlier probability p. Then\nwe randomly sampled 100 points as the training set and another 10000 samples are used for testing.\nWe implemented the proposed method with two different base losses, L2 and L1, respectively; refer-\nring to these as CvxBndL2 and CvxBndL1. We compared to standard L2 and L1 loss minimization,\nas well as minimizing the Huber minimax loss (Huber) [4]. We also considered standard meth-\nods from the robust statistics literature, including the least trimmed square method (LTS) [7, 24],\nand bounded loss minimization based on the Geman-McClure loss (GemMc) [8]. Finally we also\ncompared to the alternating minimization strategies outlined at the end of Section 3 (AltBndL2 and\nAltBndL1 for L2 and L1 losses respectively), and implemented the strategy described in [9]. We\nadded the Tikhonov regularization to each method and the regularization parameter \u03bb was selected\n(optimally for each method) on a separate validation set. Note that LTS has an extra parameter n(cid:48),\nwhich is the number of inliers. The ideal setting n(cid:48) = (1 \u2212 p)n was granted to LTS. We also tried\n30 random restarts for LTS and picked the best result.\nAll experiments are repeated 10 times and the average root mean square errors (RMSE) (with stan-\ndard deviations) on the clean test data are reported in Table 1. For p = 0 (i.e. no outliers), all methods\nperform well; their RMSEs are close to optimal (1/2, the standard deviation of \u0001i). However, when\noutliers start to appear, the result of least squares is signi\ufb01cantly skewed, while the results of clas-\nsic robust statistics methods, Huber, L1 and LTS, indeed turn out to be more robust than the least\nsquares, but nevertheless are still affected signi\ufb01cantly. Both implementations of the new method\nperforms comparably to the the non-convex Geman-McClure loss while substantially improving the\nalternating strategy under the L1 loss. Note that the latter improvement clearly demonstrates that\n\n7\n\n\fMethods\n\nL2\nL1\n\nHuber\nLTS\n\nGemMc\n\n[9]\n\nAltBndL2\nAltBndL1\nCvxBndL2\nCvxBndL1\nGap(Cvx2)\nGap(Cvx1)\n\ncal-housing\n1185\n(124.59)\n1303\n(244.85)\n1221\n(119.18)\n533\n(398.92)\n28\n(88.45)\n967\n(522.40)\n967\n(522.40)\n1005\n(603.00)\n9\n(0.64)\n8\n(0.28)\n2e-12\n(3e-12)\n0.005\n(0.01)\n\n7.93\n7.30\n7.73\n755.1\n2.30\n8.39\n8.39\n7.30\n7.60\n2.98\n3e-9\n0.001\n\n(0.67)\n(0.40)\n(0.49)\n(126)\n(0.01)\n(0.54)\n(0.54)\n(0.40)\n(0.86)\n(0.08)\n(4e-9)\n(0.001)\n\nDatasets\n\nabalone\n\npumadyn\n1.24\n1.29\n1.24\n0.32\n0.12\n0.81\n0.81\n1.29\n0.07\n0.08\n0.025\n0.267\n\n(0.42)\n(0.42)\n(0.42)\n(0.41)\n(0.12)\n(0.77)\n(0.77)\n(0.42)\n(0.07)\n(0.07)\n(0.052)\n(0.269)\n\nbank-8fh\n(6.57)\n(3.09)\n(3.18)\n(6.67)\n(0.80)\n(6.18)\n(9.40)\n(2.51)\n(0.05)\n(0.07)\n(0.003)\n(0.028)\n\n18.21\n6.54\n7.37\n10.96\n0.93\n3.91\n7.74\n1.61\n0.20\n0.10\n0.001\n0.011\n\nTable 2: RMSE on clean test data for 108 training data points and 1000 test data points, with 10 re-\npeats. Standard deviations shown parentheses. The mean gap values of CvxBndL2 and CvxBndL1,\nGap(Cvx2) and Gap(Cvx1) respectively, are given in the last two rows.\n\nalternating can be trapped in poor local minima. The proposal from [9] was not effective in this\nsetting (which differed from the one investigated there).\nNext, we conducted an experiment on four real datasets taken from the StatLib repository9 and\nDELVE.10 For each data set, we randomly selected 108 points as the training set, and another random\n1000 points as the test set. Here the regularization constant is tuned by 10-fold cross validation. To\nseed outliers, 5% of the training set are randomly chosen and their X and y values are multiplied\nby 100 and 10000, respectively. All of these data sets have 8 features, except pumadyn which has\n32 features. We also estimated the scale factor on the training set by the mean absolute deviation\nmethod, a common method in robust statistics [3]. Again, the ideal parameter n(cid:48) = (1 \u2212 5%)n is\ngranted to LTS and 30 random restarts are performed.\nThe RMSE on test set for all methods are reported in Table 2. It is clear that all methods based on\nconvex losses (L2, L1, Huber) suffer signi\ufb01cantly from the added outliers. The method proposed in\nthis paper consistently outperform all other methods with a noticeable margin, except on the abalone\ndata set where GemMc performs slightly better.11 Again, we observe evidence that the alternating\nstrategy can be trapped in poor local minima, while the method from [9] was less effective. We also\nmeasured the relative optimality gaps for the approximate CvxBnd procedures. The gaps were quite\nsmall in most cases (the gaps were very close to zero in the synthetic case, and so are not shown),\ndemonstrating the tightness of the proposed approximation scheme.\n\n7 Conclusion\nWe have developed a new robust regression method that can guarantee a form of robustness (bounded\nresponse) while ensuring tractability (polynomial run-time). The estimator has been proved consis-\ntent under some restrictive but non-trivial conditions, although we have not established general con-\nsistency. Nevertheless, an empirical evaluation reveals that the method meets or surpasses the gen-\neralization ability of state-of-the-art robust regression methods in experimental studies. Although\nthe method is more computationally involved than standard approaches, it achieves reasonable scal-\nability in real problems. We are investigating whether the proposed estimator achieves stronger\nrobustness properties, such as high breakdown or bounded in\ufb02uence. It would be interesting to ex-\ntend the approach to also estimate scale in a robust and tractable manner. Finally, we continue to\ninvestigate whether other techniques from the robust statistics and machine learning literatures can\nbe incorporated in the general framework while preserving desired properties.\nAcknowledgements\nResearch supported by AICML and NSERC.\n\n9http://lib.stat.cmu.edu/datasets/\n10http://www.cs.utoronto.ca/ delve/data/summaryTable.html\n11Note that we obtain different results than [9] arising from a very different outlier process.\n\n8\n\n\fReferences\n[1] T. Bernholt. Robust estimators are hard to compute. Technical Report 52/2005, SFB475, U.\n\nDortmund, 2005.\n\n[2] R. Nunkesser and O. Morell. An evolutionary algorithm for robust regression. Computational\n\nStatistics and Data Analysis, 54:3242\u20133248, 2010.\n\n[3] R. Maronna, R. Martin, and V. Yohai. Robust Statistics: Theory and Methods. Wiley, 2006.\n[4] P. Huber and E. Ronchetti. Robust Statistics. Wiley, 2nd edition, 2009.\n[5] A. Christmann and I. Steinwart. Consistency and robustness of kernel-based regression in\n\nconvex risk minimization. Bernoulli, 13(3):799\u2013819, 2007.\n\n[6] A. Christmann, A. Van Messem, and I. Steinwart. On consistency and robustness properties of\nsupport vector machines for heavy-tailed distributions. Statistics and Its Interface, 2:311\u2013327,\n2009.\n\n[7] P. Rousseeuw and A. Leroy. Robust Regression and Outlier Detection. Wiley, 1987.\n[8] M. Black and A. Rangarajan. On the uni\ufb01cation of line processes, outlier rejection, and robust\nstatistics with applications in early vision. International Journal of Computer Vision, 19(1):\n57\u201391, 1996.\n\n[9] Y. Yu, M. Yang, L. Xu, M. White, and D. Schuurmans. Relaxed clipping: A global training\nmethod for robust regression and classi\ufb01cation. In Advances in Neural Information Processings\nSystems (NIPS), 2010.\n\n[10] A. Bental, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton Series in Applied\n\nMathematics. Princeton University Press, October 2009.\n\n[11] H. Xu, C. Caramanis, and S. Mannor. Robust regression and Lasso. In Advances in Neural\n\nInformation Processing Systems (NIPS), volume 21, pages 1801\u20131808, 2008.\n\n[12] H. Xu, C. Caramanis, and S. Mannor. Robustness and regularization of support vector ma-\n\nchines. Journal of Machine Learning Research, 10:1485\u20131510, 2009.\n\n[13] S. Mukherjee, P. Niyogi, T. Poggio, and R. Rifkin. Learning theory: Stability is suf\ufb01cient\nfor generalization and necessary and suf\ufb01cient for consistency of empirical risk minimization.\nAdvances in Computational Mathematics, 25(1-3):161\u2013193, 2006.\n\n[14] G. Kimeldorf and G. Wahba. A correspondence between Bayesian estimation on stochastic\nprocesses and smoothing by splines. Annals of Mathematical Statistics, 41(2):495\u2013502, 1970.\n\n[15] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.\n[16] Aad W. van der Vaart and Jon A. Wellner. Weak Convergence and Empirical Processes.\n\nSpringer, 1996.\n\n[17] F. Hampel, E. Ronchetti, P. Rousseeuw, and W. Stahel. Robust Statistics: The Approach Based\n\non In\ufb02uence Functions. Wiley, 1986.\n\n[18] D. Donoho and P. Huber. The notion of breakdown point.\n\nLehmann, pages 157\u2013184. Wadsworth, 1983.\n\nIn A Festschrift for Erich L.\n\n[19] P. Davies and U. Gather. The breakdown point\u2014examples and counterexamples. REVSTAT\n\nStatistical Journal, 5(1):1\u201317, 2007.\n\n[20] Y. Nesterov and A. Nemiroviskii. Interior-point Polynomial Methods in Convex Programming.\n\nSIAM, 1994.\n\n[21] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, 2003.\n[22] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge U. Press, 2004.\n[23] J. Peng and Y. Wei. Approximating k-means-type clustering via semide\ufb01nite programming.\n\nSIAM Journal on Optimization, 18(1):186\u2013205, 2007.\n\n[24] P. Rousseeuw and K. Van Driessen. Computing LTS regression for large data sets. Data Mining\n\nand Knowledge Discovery, 12(1):29\u201345, 2006.\n\n[25] R. Horn and C. Johnson. Matrix Analysis. Cambridge, 1985.\n\n9\n\n\f", "award": [], "sourceid": 1197, "authors": [{"given_name": "Yao-liang", "family_name": "Yu", "institution": ""}, {"given_name": "\u00d6zlem", "family_name": "Aslan", "institution": ""}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": ""}]}