{"title": "Robust Logistic Regression and Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 253, "page_last": 261, "abstract": "We consider logistic regression with arbitrary outliers in the covariate matrix. We propose a new robust logistic regression algorithm, called RoLR, that estimates the parameter through a simple linear programming procedure. We prove that RoLR is robust to a constant fraction of adversarial outliers. To the best of our knowledge, this is the first result on estimating logistic regression model when the covariate matrix is corrupted with any performance guarantees. Besides regression, we apply RoLR to solving binary classification problems where a fraction of training samples are corrupted.", "full_text": "Robust Logistic Regression and Classi\ufb01cation\n\nJiashi Feng\n\njshfeng@berkeley.edu\n\nEECS Department & ICSI\n\nUC Berkeley\n\nHuan Xu\n\nME Department\n\nNational University of Singapore\n\nmpexuh@nus.edu.sg\n\nShie Mannor\nEE Department\n\nTechnion\n\nshie@ee.technion.ac.il\n\nShuicheng Yan\nECE Department\n\nNational University of Singapore\n\neleyans@nus.edu.sg\n\nAbstract\n\nWe consider logistic regression with arbitrary outliers in the covariate matrix. We\npropose a new robust logistic regression algorithm, called RoLR, that estimates\nthe parameter through a simple linear programming procedure. We prove that\nRoLR is robust to a constant fraction of adversarial outliers. To the best of our\nknowledge, this is the \ufb01rst result on estimating logistic regression model when the\ncovariate matrix is corrupted with any performance guarantees. Besides regres-\nsion, we apply RoLR to solving binary classi\ufb01cation problems where a fraction of\ntraining samples are corrupted.\n\n1\n\nIntroduction\n\n1\n\n1+e\u2212\u03b2(cid:62)xi\n\nLogistic regression (LR) is a standard probabilistic statistical classi\ufb01cation model that has been\nextensively used across disciplines such as computer vision, marketing, social sciences, to name a\nfew. Different from linear regression, the outcome of LR on one sample is the probability that it is\npositive or negative, where the probability depends on a linear measure of the sample. Therefore,\nLR is actually widely used for classi\ufb01cation. More formally, for a sample xi \u2208 Rp whose label is\ndenoted as yi, the probability of yi being positive is predicted to be P{yi = +1} =\n, given\nthe LR model parameter \u03b2. In order to obtain a parameter that performs well, often a set of labeled\nsamples {(x1, y1), . . . , (xn, yn)} are collected to learn the LR parameter \u03b2 which maximizes the\ninduced likelihood function over the training samples.\nHowever, in practice, the training samples x1, . . . , xn are usually noisy and some of them may\neven contain adversarial corruptions. Here by \u201cadversarial\u201d, we mean that the corruptions can be\narbitrary, unbounded and are not from any speci\ufb01c distribution. For example, in the image/video\nclassi\ufb01cation task, some images or videos may be corrupted unexpectedly due to the error of sen-\nsors or the severe occlusions on the contained objects. Those corrupted samples, which are called\noutliers, can skew the parameter estimation severely and hence destroy the performance of LR.\nTo see the sensitiveness of LR to outliers more intuitively, consider a simple example where all\nthe samples xi\u2019s are from one-dimensional space R, as shown in Figure 1. Only using the inlier\nsamples provides a correct LR parameter (we here show the induced function curve) which explains\nthe inliers well. However, when only one sample is corrupted (which is originally negative but now\ncloser to the positive samples), the resulted regression curve is distracted far away from the ground\ntruth one and the label predictions on the concerned inliers are completely wrong. This demonstrates\nthat LR is indeed fragile to sample corruptions. More rigorously, the non-robustness of LR can be\nshown via calculating its in\ufb02uence function [7] (detailed in the supplementary material).\n\n1\n\n\fFigure 1: The estimated logistic regression curve (red solid) is far away from the correct one (blue\ndashed) due to the existence of just one outlier (red circle).\n\nAs Figure 1 demonstrates, the maximal-likelihood estimate of LR is extremely sensitive to the pres-\nence of anomalous data in the sample. Pregibon also observed this non-robustness of LR in [14].\nTo solve this important issue of LR, Pregibon [14], Cook and Weisberg [4] and Johnson [9] pro-\nposed procedures to identify observations which are in\ufb02uential for estimating \u03b2 based on certain\noutlyingness measure. Stefanski et al. [16, 10] and Bianco et al. [2] also proposed robust estimators\nwhich, however, require to robustly estimating the covariate matrix or boundedness on the outliers.\nMoreover, the breakdown point1 of those methods is generally inversely proportional to the sample\ndimensionality and diminishes rapidly for high-dimensional samples.\nWe propose a new robust logistic regression algorithm, called RoLR, which optimizes a robusti\ufb01ed\nlinear correlation between response y and linear measure (cid:104)\u03b2, x(cid:105) via an ef\ufb01cient linear programming-\nbased procedure. We demonstrate that the proposed RoLR achieves robustness to arbitrarily covari-\nate corruptions. Even when a constant fraction of the training samples are corrupted, RoLR is still\nable to learn the LR parameter with a non-trivial upper bound on the error. Besides this theoretical\nguarantee of RoLR on the parameter estimation, we also provide the empirical and population risks\nbounds for RoLR. Moreover, RoLR only needs to solve a linear programming problem and thus is\nscalable to large-scale data sets, in sharp contrast to previous LR optimization algorithms which typ-\nically resort to (computationally expensive) iterative reweighted method [11]. The proposed RoLR\ncan be easily adapted to solving binary classi\ufb01cation problems where corrupted training samples\nare present. We also provide theoretical classi\ufb01cation performance guarantee for RoLR. Due to the\nspace limitation, we defer all the proofs to the supplementary material.\n\n2 Related Works\n\nSeveral previous works have investigated multiple approaches to robustify the logistic regression\n(LR) [15, 13, 17, 16, 10]. The majority of them are M-estimator based: minimizing a complicated\nand more robust loss function than the standard loss function (negative log-likelihood) of LR. For\nexample, Pregiobon [15] proposed the following M-estimator:\n\nn(cid:88)\n\ni=1\n\n(cid:26)t,\n\n\u221a\n2\n\n\u02c6\u03b2 = arg min\n\u03b2\n\n\u03c1((cid:96)i(\u03b2)),\n\n\u03c1(t) =\n\ntc \u2212 c,\n\nif t \u2264 c,\nif t > c,\n\nwhere (cid:96)i(\u00b7) is the negative log-likelihood of the ith sample xi and \u03c1(\u00b7) is a Huber type function [8]\nsuch as\n\nwith c a positive parameter. However, the result from such estimator is not robust to outliers with\nhigh leverage covariates as shown in [5].\n\n1It is de\ufb01ned as the percentage of corrupted points that can make the output of an algorithm arbitrarily bad.\n\n2\n\n\u22125\u22124\u22123\u22122\u2212101234500.20.40.60.81 inlieroutlier\fRecently, Ding et al [6] introduced the T -logistic regression as a robust alternative to the standard\nLR, which replaces the exponential distribution in LR by t-exponential distribution family. However,\nT -logistic regression only guarantees that the output parameter converges to a local optimum of the\nloss function instead of converging to the ground truth parameter.\nOur work is largely inspired by following two recent works [3, 13] on robust sparse regression.\nIn [3], Chen et al. proposed to replace the standard vector inner product by a trimmed one, and\nobtained a novel linear regression algorithm which is robust to unbounded covariate corruptions. In\nthis work, we also utilize this simple yet powerful operation to achieve robustness. In [13], a convex\nprogramming method for estimating the sparse parameters of logistic regression model is proposed:\n\nyi(cid:104)xi, \u03b2(cid:105), s.t. (cid:107)\u03b2(cid:107)1 \u2264 \u221a\n\ns,(cid:107)\u03b2(cid:107) \u2264 1,\n\nm(cid:88)\n\ni=1\n\nmax\n\n\u03b2\n\nwhere s is the sparseness prior parameter on \u03b2. However, this method is not robust to corrupted\ncovariate matrix. Few or even one corrupted sample may dominate the correlation in the objective\nfunction and yield arbitrarily bad estimations. In this work, we propose a robust algorithm to remedy\nthis issue.\n\n3 Robust Logistic Regression\n\n1+e\u2212z . The additive noise vi \u223c N (0, \u03c32\n\n3.1 Problem Setup\nWe consider the problem of logistic regression (LR). Let Sp\u22121 denote the unit sphere and Bp\n2 denote\nthe Euclidean unit ball in Rp. Let \u03b2\u2217 be the groundtruth parameter of the LR model. We assume\nthe training samples are covariate-response pairs {(xi, yi)}n+n1\ni=1 \u2282 Rp \u00d7 {\u22121, +1}, which, if not\ncorrupted, would obey the following LR model:\nP{yi = +1} = \u03c4 ((cid:104)\u03b2\u2217, xi(cid:105) + vi),\n\n(1)\nwhere the function \u03c4 (\u00b7) is de\ufb01ned as: \u03c4 (z) = 1\ne ) is an i.i.d.\nGaussian random variable with zero mean and variance of \u03c32\ne. In particular, when we consider the\ne = 0. Since LR only depends on (cid:104)\u03b2\u2217, xi(cid:105), we can always scale the\nnoiseless case, we assume \u03c32\nsamples xi to make the magnitude of \u03b2\u2217 less than 1. Thus, without loss of generality, we assume\nthat \u03b2\u2217 \u2208 Sp\u22121.\nOut of the n + n1 samples, a constant number (n1) of the samples may be adversarially corrupted,\nand we make no assumptions on these outliers. Throughout the paper, we use \u03bb (cid:44) n1\nn to denote the\noutlier fraction. We call the remaining n non-corrupted samples \u201cauthentic\u201d samples, which obey\nthe following standard sub-Gaussian design [12, 3].\nDe\ufb01nition 1 (Sub-Gaussian design). We say that a random matrix X = [x1, . . . , xn] \u2208 Rp\u00d7n is\nx) if: (1) each column xi \u2208 Rp is sampled independently\nsub-Gaussian with parameter ( 1\nn \u03a3x, and (2) for any unit vector u \u2208 Rp, the random\nfrom a zero-mean distribution with covariance 1\nvariable u(cid:62)xi is sub-Gaussian with parameter2 1\u221a\n\nn \u03a3x, 1\n\nn \u03c32\n\nn \u03c3x.\n\ni=1 X 2\n\ni \u2212 \u03c32\n\nx\n\n(cid:12)(cid:12)(cid:80)n\n\nn and variance at most \u03c32\n\n(cid:12)(cid:12) \u2264 c1\u03c32\n\nThe above sub-Gaussian random variables have several nice concentration properties, one of which\nis stated in the following Lemma [12].\n(cid:113) log p\n\u221a\nLemma 1 (Sub-Gaussian Concentration [12]). Let X1, . . . , Xn be n i.i.d. zero-mean sub-\nGaussian random variables with parameter \u03c3x/\nx/n. Then we have\nn , with probability of at least 1\u2212 p\u22122 for some absolute constant c1.\nBased on the above concentration property, we can obtain following bound on the magnitude of a\ncollection of sub-Gaussian random variables [3].\nLemma 2. Suppose X1, . . . , Xn are n independent sub-Gaussian random variables with parameter\n\u221a\nn. Then we have maxi=1,...,n|Xi| \u2264 4\u03c3x\n\u03c3x/\n1 \u2212 p\u22122.\n2Here,\n\n(cid:112)(log n + log p)/n with probability of at least\n\nthe parameter means the sub-Gaussian norm of\n\nthe random variable Y , (cid:107)Y (cid:107)\u03c82 =\n\nx\n\nsupq\u22651 q\u22121/2(E|Y |q)1/q.\n\n3\n\n\fAlso, this lemma provides a rough bound on the magnitude of inlier samples, and this bound serves\nas a threshold for pre-processing the samples in the following RoLR algorithm.\n\n3.2 RoLR Algorithm\n\nWe now proceed to introduce the details of the proposed Robust Logistic Regression (RoLR) algo-\nrithm. Basically, RoLR \ufb01rst removes the samples with overly large magnitude and then maximizes\na trimmed correlation of the remained samples with the estimated LR model. The intuition behind\nthe RoLR maximizing the trimmed correlation is: if the outliers have too large magnitude, they will\nnot contribute to the correlation and thus not affect the LR parameter learning. Otherwise, they have\nbounded affect on the LR learning (which actually can be bounded by the inlier samples due to our\nadopting the trimmed statistic). Algorithm 1 gives the implementation details of RoLR.\n\nAlgorithm 1 RoLR\n\nInitialization: Set T = 4(cid:112)log p/n + log n/n.\n\nInput: Contaminated training samples {(x1, y1), . . . , (xn+n1 , yn+n1)}, an upper bound on the\nnumber of outliers n1, number of inliers n and sample dimension p.\nPreprocessing: Remove samples (xi, yi) whose magnitude satis\ufb01es (cid:107)xi(cid:107) \u2265 T .\nSolve the following linear programming problem (see Eqn. (3)):\n[y(cid:104)\u03b2, x(cid:105)](i).\n\n\u02c6\u03b2 = arg max\n\nn(cid:88)\n\nOutput: \u02c6\u03b2.\n\n\u03b2\u2208Bp\n\n2\n\ni=1\n\nn(cid:88)\n\ni=1\n\nNote that, within the RoLR algorithm, we need to optimize the following sorted statistic:\n\n[y(cid:104)\u03b2, x(cid:105)](i).\n\nmax\n\u03b2\u2208Bp\n2\n\n(2)\nwhere [\u00b7](i) is a sorted statistic such that [z](1) \u2264 [z](2) \u2264 . . . \u2264 [z](n), and z denotes the involved\nvariable. The problem in Eqn. (2) is equivalent to minimizing the summation of top n variables,\nwhich is a convex one and can be solved by an off-the-shelf solver (such as CVX). Here, we note that\nit can also be converted to the following linear programming problem (with a quadratic constraint),\nwhich enjoys higher computational ef\ufb01ciency. To see this, we \ufb01rst introduce auxiliary variables\nti \u2208 {0, 1} as indicators of whether the corresponding terms yi(cid:104)\u03b2,\u2212xi(cid:105) fall in the smallest n ones.\nThen, we write the problem in Eqn. (2) as\n\nn+n1(cid:88)\nHere the constraints of(cid:80)n+n1\nti \u2264 n, 0 \u2264 ti \u2264 1 are from standard reformulation of(cid:80)n+n1\n\u03bei be the Lagrange multipliers for the constraints(cid:80)n+n1\n\nti =\nn, ti \u2208 {0, 1}. Now, the above problem becomes a max-min linear programming. To decouple the\nvariables \u03b2 and ti, we turn to solving the dual form of the inner minimization problem. Let \u03bd, and\nti \u2264 n and ti \u2264 1 respectively. Then the\n\nti \u2264 n, 0 \u2264 ti \u2264 1.\n\nti \u00b7 yi(cid:104)\u03b2, xi(cid:105), s.t.\n\nn+n1(cid:88)\n\nmax\n\u03b2\u2208Bp\n2\n\nmin\n\nti\n\ni=1\n\ni=1\n\ni=1\n\ni=1\n\ndual form w.r.t. ti of the above problem is:\n\ni=1\n\n\u03bei, s.t. yi(cid:104)\u03b2, xi(cid:105) + \u03bd + \u03bei \u2265 0, \u03b2 \u2208 Bp\n\n2 , \u03bd \u2265 0, \u03bei \u2265 0.\n\n(3)\n\n\u2212\u03bd \u00b7 n \u2212 n+n1(cid:88)\n\ni=1\n\nmax\n\u03b2,\u03bd,\u03bei\n\nReformulating logistic regression into a linear programming problem as above signi\ufb01cantly en-\nhances the scalability of LR in handling large-scale datasets, a property very appealing in practice,\nsince linear programming is known to be computationally ef\ufb01cient and has no problem dealing with\nup to 1 \u00d7 106 variables in a standard PC.\n\n3.3 Performance Guarantee for RoLR\n\nIn contrast to traditional LR algorithms, RoLR does not perform a maximal likelihood estimation.\nInstead, RoLR maximizes the correlation yi(cid:104)\u03b2, xi(cid:105) . This strategy reduces the computational com-\nplexity of LR, and more importantly enhances the robustness of the parameter estimation, using\n\n4\n\n\fthe fact that the authentic samples usually have positive correlation between the yi and (cid:104)\u03b2, xi(cid:105), as\ndescribed in the following lemma.\nLemma 3. Fix \u03b2 \u2208 Sp\u22121. Suppose that the sample (x, y) is generated by the model described in\n(1). The expectation of the product y(cid:104)\u03b2, x(cid:105) is computed as:\n\nEy(cid:104)\u03b2, x(cid:105) = E sech2(g/2),\n\nwhere g \u2208 N (0, \u03c32\nmore, the above expectation can be bounded as follows,\n\ne ) is a Gaussian random variable and \u03c32\nx) \u2264 Ey(cid:104)\u03b2, x(cid:105) \u2264 \u03d5\u2212(\u03c32\n\nx + \u03c32\n\nx) and \u03d5\u2212(\u03c32\n\n3 sech2(cid:16) 1+\u03c32\n\n(cid:17)\n\n\u03d5+(\u03c32\ne , \u03c32\nx) are positive.\ne , \u03c32\nand \u03d5\u2212(\u03c32\n\ne , \u03c32\n\n2\n\ne\n\nx) = \u03c32\n\ne , \u03c32\nx).\nIn particular,\n3 + \u03c32\n\n6 sech2(cid:16) 1+\u03c32\n\n2\n\nx\n\nx\n\ne\n\nwhere \u03d5+(\u03c32\n\u03d5+(\u03c32\n\ne , \u03c32\nx) = \u03c32\n\ne , \u03c32\n\nx\n\n(cid:17)\n\n.\n\ne is the noise level in (1). Further-\n\nthey can take the form of\n\n(cid:80)\ni yi(cid:104) \u02c6\u03b2, xi(cid:105).\n\nThe following lemma shows the difference of correlations is an effective surrogate for the difference\nof the LR parameters. Thus we can always minimize the difference of (cid:107) \u02c6\u03b2\u2212\u03b2\u2217(cid:107) through maximizing\n\nLemma 4. Fix \u03b2 \u2208 Sp\u22121 as the groundtruth parameter in (1) and \u03b2(cid:48) \u2208 Bp\nThen\n\n2. Denote \u03b7 = Ey(cid:104)\u03b2, x(cid:105).\n\nEy(cid:104)\u03b2(cid:48), x(cid:105) = \u03b7(cid:104)\u03b2, \u03b2(cid:48)(cid:105),\n\nand thus,\n\nE [y(cid:104)\u03b2, x(cid:105) \u2212 y(cid:104)\u03b2(cid:48), x(cid:105)] = \u03b7(1 \u2212 (cid:104)\u03b2, \u03b2(cid:48)(cid:105)) \u2265 \u03b7\n2\n\n(cid:107)\u03b2 \u2212 \u03b2(cid:48)(cid:107)2\n2.\n\nBased on these two lemmas, along with some concentration properties of the inlier samples (shown\nin the supplementary material), we have the following performance guarantee of RoLR on LR model\nparameter recovery.\nTheorem 1 (RoLR for recovering LR parameter). Let \u03bb (cid:44) n1\nn be the outlier fraction, \u02c6\u03b2 be the\noutput of Algorithm 1, and \u03b2\u2217 be the ground truth parameter. Suppose that there are n authentic\nsamples generated by the model described in (1). Then we have, with probability larger than 1 \u2212\n4 exp(\u2212c2n/8),\n\n(cid:114) p\n\n\u221a\n2(\u03bb + 4 + 5\ne , \u03c32\nx)\n\n\u03d5+(\u03c32\n\n(cid:114)\n\n\u03d5\u2212(\u03c32\n\u03d5+(\u03c32\n\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107) \u2264 2\u03bb\n\ne , \u03c32\nx)\ne , \u03c32\nx)\nHere c2 is an absolute constant.\nRemark 1. To make the above results more explicit, we consider the asymptotic case where p/n \u2192\n0. Thus the above bounds become\n\n8\u03bb\ne , \u03c32\n\u03d5+(\u03c32\nx)\n\nlog n\n\nlog p\n\n\u03c32\nx\n\n\u03bb)\n\n+\n\n+\n\n+\n\nn\n\nn\n\nn\n\n.\n\nwhich holds with probability larger than 1\u2212 4 exp(\u2212c2n/8). In the noiseless case, i.e., \u03c3e = 0, and\nassuming \u03c32\n0.4644. The ratio is \u03d5\u2212/\u03d5+ \u2248 1.7715. Thus the bound is simpli\ufb01ed to:\n\nx = 1, we have \u03d5+(\u03c32\n\ne + 1) = 1\n\ne ) = 1\n\n3 + 1\n\n6 sech2(cid:0) 1\n\n2\n\n(cid:1) \u2248\n\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107) \u2264 2\u03bb\n\n3 sech2(cid:0) 1\n\n2\n\n,\n\n\u03d5\u2212(\u03c32\n\u03d5+(\u03c32\n\ne , \u03c32\nx)\ne , \u03c32\nx)\n\n(cid:1) \u2248 0.2622 and \u03d5\u2212(\u03c32\n\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107) (cid:46) 3.54\u03bb.\n\nRecall that \u02c6\u03b2, \u03b2\u2217 \u2208 Sp\u22121 and the maximal value of (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107) is 2. Thus, for the above result to be\nnon-trivial, we need 3.54\u03bb \u2264 2, namely \u03bb \u2264 0.56. In other words, in the noiseless case, the RoLR\nis able to estimate the LR parameter with a non-trivial error bound (also known as a \u201cbreakdown\npoint\u201d) with up to 0.56/1.56 \u00d7 100% = 36% of the samples being outliers.\n\n4 Empirical and Population Risk Bounds of RoLR\n\nBesides the parameter recovery, we are also concerned about the prediction performance of the\nestimated LR model in practice. The standard prediction loss function (cid:96)(\u00b7,\u00b7) of LR is a non-negative\nand bounded function, and is de\ufb01ned as:\n\n(cid:96)((xi, yi), \u03b2) =\n\n1\n\n1 + exp{\u2212yi\u03b2(cid:62)xi} .\n\n(4)\n\n5\n\n\fThe goodness of an LR predictor \u03b2 is measured by its population risk:\n\nR(\u03b2) = EP (X,Y )(cid:96)((x, y), \u03b2),\n\nwhere P (X, Y ) describes the joint distribution of covariate X and response Y . However, the pop-\nulation risk rarely can be calculated directly as the distribution P (X, Y ) is usually unknown. In\npractice, we often consider the empirical risk, which is calculated over the provided training sam-\nples as follows:\n\nn(cid:88)\n\ni=1\n\nRemp(\u03b2) =\n\n1\nn\n\n(cid:96)((xi, yi), \u03b2).\n\nNote that the empirical risk is computed only over the authentic samples, hence cannot be directly\noptimized when outliers exist.\nBased on the bound of (cid:107) \u02c6\u03b2\u2212\u03b2\u2217(cid:107) provided in Theorem 1, we can easily obtain the following empirical\nrisk bound for RoLR as the LR loss function given in Eqn. (4) is Lipschitz continuous.\nCorollary 1 (Bound on the empirical risk). Let \u02c6\u03b2 be the output of Algorithm 1, and \u03b2\u2217 be the optimal\nparameter minimizing the empirical risk. Suppose that there are n authentic samples generated by\nthe model described in (1). De\ufb01ne X (cid:44) 4\u03c3x\nlarger than 1 \u2212 4 exp(\u2212c2n/8), the empirical risk of \u02c6\u03b2 is bounded by,\n\n(cid:112)(log n + log p)/n. Then we have, with probability\n(cid:40)\n\nRemp( \u02c6\u03b2) \u2212 Remp(\u03b2\u2217) \u2264\n\nX\n\n2\u03bb\n\n\u221a\n2(\u03bb + 4 + 5\ne , \u03c32\nx)\n\n\u03d5+(\u03c32\n\n\u03d5\u2212(\u03c32\n\u03d5+(\u03c32\n\ne , \u03c32\nx)\ne , \u03c32\nx)\n\n+\n\n(cid:114)\n\n+\n\n8\u03bb\u03c32\nx\n\n\u03d5+(\u03c32\n\ne , \u03c32\nx)\n\nlog p\n\nn\n\n+\n\nlog n\n\nn\n\n(cid:114) p\n\n\u03bb)\n\n(cid:41)\n\nn\n\n.\n\nGiven the empirical risk bound, we can readily obtain the bound on the population risk by referring\nto standard generalization results in terms of various function class complexities. Some widely used\ncomplexity measures include the VC-dimension [18] and the Rademacher and Gaussian complex-\nity [1]. Compared with the Rademacher complexity which is data dependent, the VC-dimension is\nmore universal although the resulting generalization bound can be slightly loose. Here, we adopt the\nVC-dimension to measure the function complexity and obtain the following population risk bound.\nCorollary 2 (Bound on the population risk). Let \u02c6\u03b2 be the output of Algorithm 1, and \u03b2\u2217 be the opti-\nmal parameter. Suppose the parameter space Sp\u22121 (cid:51) \u03b2 has \ufb01nite VC dimension d. There are n au-\nthentic samples are generated by the model described in (1). De\ufb01ne X (cid:44) 4\u03c3x\nThen we have, with high probability larger larger than 1 \u2212 4 exp(\u2212c2n/8) \u2212 \u03b4, the population risk\nof \u02c6\u03b2 is bounded by,\n\n(cid:112)(log n + log p)/n.\n(cid:114)\n\nR( \u02c6\u03b2) \u2212 R(\u03b2\u2217) \u2264 X\n\n(cid:40)\n(cid:114)\n\n2\u03bb\n\n+2c3\n\n\u03d5\u2212(\u03c32\n\u03d5+(\u03c32\n\ne , \u03c32\nx)\ne , \u03c32\nx)\n\nd + ln(1/\u03b4)\n\nn\n\n+\n\n(cid:41)\n\n.\n\n(cid:114) p\n\n\u221a\n2(\u03bb + 4 + 5\ne , \u03c32\nx)\n\n\u03d5+(\u03c32\n\n\u03bb)\n\n8\u03bb\u03c32\nx\n\n+\n\nn\n\n\u03d5+(\u03c32\n\ne , \u03c32\nx)\n\nn\n\nlog p\n\nlog n\n\n+\n\nn\n\nHere both c2 and c3 are absolute constants.\n\n5 Robust Binary Classi\ufb01cation\n\n5.1 Problem Setup\n\nDifferent from the sample generation model for LR, in the standard binary classi\ufb01cation setting,\nthe label yi of a sample xi is deterministically determined by the sign of the linear measure of the\nsample (cid:104)\u03b2\u2217, xi(cid:105). Namely, the samples are generated by the following model:\n\nyi = sign ((cid:104)\u03b2\u2217, xi(cid:105) + vi) .\n\n(5)\nHere vi is a Gaussian noise as in Eqn. (1). Since yi is deterministically related to (cid:104)\u03b2\u2217, xi(cid:105), the\nexpected correlation Ey(cid:104)\u03b2, x(cid:105) achieves the maximal value in this setup (ref. Lemma 5), which\nensures that the RoLR also performs well for classi\ufb01cation. We again assume that the training\nsamples contain n authentic samples and at most n1 outliers.\n\n6\n\n\f5.2 Performance Guarantee for Robust Classi\ufb01cation\nLemma 5. Fix \u03b2 \u2208 Sp\u22121. Suppose the sample (x, y) is generated by the model described in (5).\nThe expectation of the product y(cid:104)\u03b2, x(cid:105) is computed as:\n\n(cid:115)\n\nEy(cid:104)\u03b2, x(cid:105) =\n\n2\u03c34\nx\nx + \u03c32\nv)\n\n\u03c0(\u03c32\n\n.\n\n(cid:112)2/\u03c0, which is well known as the half-normal distribution.\n\nComparing the above result with the one in Lemma 3, here for the binary classi\ufb01cation, we can\nexactly calculate the expectation of the correlation, and this expectation is always larger than that of\nthe LR setting. The correlation depends on the signal-noise ratio \u03c3x/\u03c3e. In the noiseless case, \u03c3e =\n0 and the expected correlation is \u03c3x\nSimilarly to analyzing RoLR for LR, based on Lemma 5, we can obtain the following performance\nguarantee for RoLR in solving classi\ufb01cation problems.\nTheorem 2. Let \u02c6\u03b2 be the output of Algorithm 1, and \u03b2\u2217 be the optimal parameter minimizing the\nempirical risk. Suppose there are n authentic samples generated by the model described by (5).\nThen we have, with large probability larger than 1 \u2212 4 exp(\u2212c2n/8),\n\n(cid:115)\n\n(cid:114)\n\n(cid:114)\n\n\u221a\n(cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107)2 \u2264 2\u03bb + 2(\u03bb + 4 + 5\n\n\u03bb)\n\nx)\u03c0p\n\n(\u03c32\n\ne + \u03c32\n2\u03c34\nxn\n\n+ 8\u03bb\n\n(\u03c32\n\ne + \u03c32\n\nx)\u03c0\n\n2\n\nlog p\n\nn\n\n+\n\nlog n\n\nn\n\n.\n\nThe proof of Theorem 2 is similar to that of Theorem 1. Also, similar to the LR case, based on\nthe above parameter error bound, it is straightforward to obtain the empirical and population risk\nbounds of RoLR for classi\ufb01cation. Due to the space limitation, here we only sketch how to obtain\nthe risk bounds.\nFor the classi\ufb01cation problem, the most natural loss function is the 0 \u2212 1 loss. However, 0 \u2212 1\nloss function is non-convex, non-smooth, and we cannot get a non-trivial function value bound in\nterms of (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107) as we did for the logistic loss function. Fortunately, several convex surrogate\nloss functions for 0\u2212 1 loss have been proposed and achieve good classi\ufb01cation performance, which\ninclude the hinge loss, exponential loss and logistic loss. These loss functions are all Lipschitz\ncontinuous and thus we can bound their empirical and then population risks as for logistic regression.\n\n6 Simulations\n\nIn this section, we conduct simulations to verify the robustness of RoLR along with its applicability\nfor robust binary classi\ufb01cation. We compare RoLR with standard logistic regression which estimates\nthe model parameter through maximizing the log-likelihood function.\nWe randomly generated the samples according to the model in Eqn. (1) for the logistic regression\nIn particular, we \ufb01rst sample the model parameter \u03b2 \u223c N (0, Ip) and normalize it as\nproblem.\n\u03b2 := \u03b2/(cid:107)\u03b2(cid:107)2. Here p is the dimension of the parameter, which is also the dimension of samples.\nThe samples are drawn i.i.d. from xi \u223c N (0, \u03a3x) with \u03a3x = Ip, and the Gaussian noise is sampled\nas vi \u223c N (0, \u03c3e). Then, the sample label yi is generated according to P{yi = +1} = \u03c4 ((cid:104)\u03b2, xi(cid:105)+vi)\nfor the LR case. For the classi\ufb01cation case, the sample labels are generated by yi = sign((cid:104)\u03b2, xi(cid:105)+vi)\nand additional nt = 1, 000 authentic samples are generated for testing. The entries of outliers xo are\ni.i.d. random variables from uniform distribution [\u2212\u03c3o, \u03c3o] with \u03c3o = 10. The labels of outliers are\ngenerated by yo = sign((cid:104)\u2212\u03b2, xo(cid:105)). That is, outliers follow the model having opposite sign as inliers,\nwhich according to our experiment, is the most adversarial outlier model. The ratio of outliers over\ninliers is denoted as \u03bb = n1/n, where n1 is the number of outliers and n is the number of inliers.\nWe \ufb01x n = 1, 000 and the \u03bb varies from 0 to 1.2, with a step of 0.1.\nWe repeat the simulations under each outlier fraction setting for 10 times and plot the performance\n(including the average and the variance) of RoLR and ordinary LR versus the ratio of outliers to\ninliers in Figure 2. In particular, for the task of logistic regression, we measure the performance\nby the parameter prediction error (cid:107) \u02c6\u03b2 \u2212 \u03b2\u2217(cid:107). For classi\ufb01cation, we use the classi\ufb01cation error rate\non test samples \u2013 #(\u02c6yi (cid:54)= yi)/nt \u2013 as the performance measure. Here \u02c6yi = sign( \u02c6\u03b2(cid:62)xi) is the\npredicted label for sample xi and yi is the ground truth sample label. The results, shown in Figure 2,\n\n7\n\n\f(a) Logistic regression\n\n(b) Classi\ufb01cation\n\nFigure 2: Performance comparison between RoLR, ordinary LR and LR with the thresholding pre-\nprocessing as in RoLR (LR+P) for (a) regression parameter estimation and (b) classi\ufb01cation, under\nthe setting of \u03c3e = 0.5, \u03c3o = 10, p = 20 and n = 1, 000. The simulation is repeated for 10 times.\n\nclearly demonstrate that RoLR performs much better than standard LR for both tasks. Even when\nthe outlier fraction is small (\u03bb = 0.1), RoLR already outperforms LR with a large margin. From\nFigure 2(a), we observe that when \u03bb \u2265 0.3, the parameter estimation error of LR reaches around\n1.3, which is pretty unsatisfactory since simply outputting a trivial solution \u02c6\u03b2 = 0 has an error of\n1 (recall (cid:107)\u03b2\u2217(cid:107)2 = 1). In contrast, RoLR guarantees the estimation error to be around 0.5, even\nthough \u03bb = 0.8, i.e., around 45% of the samples are outliers. To see the role of preprocessing in\nRoLR, we also apply such preprocessing to LR and plot its performance as \u201cLR+P\u201d in the \ufb01gure. It\ncan be seen that the preprocessing step indeed helps remove certain outliers with large magnitudes.\nHowever, when the fraction of outliers increases to \u03bb = 0.5, more outliers with smaller magnitudes\nthan the pre-de\ufb01ned threshold enter the remained samples and increase the error of \u201cLR+P\u201d to be\nlarger than 1. This demonstrates maximizing the correlation is more essential than the thresholding\nfor the robustness gain of RoLR. From results for classi\ufb01cation, shown in Figure 2(b), we observe\nthat again from \u03bb = 0.2, LR starts to breakdown. The classi\ufb01cation error rate of LR achieves 0.8,\nwhich is even worse than random guess. In contrast, RoLR still achieves satisfactory classi\ufb01cation\nperformance with classi\ufb01cation error rate around 0.4 even with \u03bb \u2192 1. But when \u03bb > 1, RoLR also\nbreaks down as outliers dominate in the training samples.\nWhen there is no outliers, with the same inliers (n = 1\u00d7 103 and p = 20), the error of LR in logistic\nregression estimation is 0.06 while the error of RoLR is 0.13. Such performance degradation in\nRoLR is due to that RoLR maximizes the linear correlation statistics instead of the likelihood as in\nLR in inferring the regression parameter. This is the price RoLR needs to pay for the robustness.\nWe provide more investigations and also results for real large data in the supplementary material.\n\n7 Conclusions\n\nWe investigated the problem of logistic regression (LR) under a practical case where the covariate\nmatrix is adversarially corrupted. Standard LR methods were shown to fail in this case. We proposed\na novel LR method, RoLR, to solve this issue. We theoretically and experimentally demonstrated\nthat RoLR is robust to the covariate corruptions. Moreover, we devised a linear programming algo-\nrithm to solve RoLR, which is computationally ef\ufb01cient and can scale to large problems. We further\napplied RoLR to successfully learn classi\ufb01ers from corrupted training samples.\n\nAcknowledgments\n\nThe work of H. Xu was partially supported by the Ministry of Education of Singapore through\nAcRF Tier Two grant R-265-000-443-112. The work of S. Mannor was partially funded by the Intel\nCollaborative Research Institute for Computational Intelligence (ICRI-CI) and by the Israel Science\nFoundation (ISF under contract 920/12).\n\n8\n\n00.10.20.30.40.50.60.70.80.91.01.11.200.511.52outlier to inliear ratioerror: ||\u03b2\u2212\u03b2*|| RoLRLRLR+P00.10.20.30.40.50.60.70.80.911.11.200.20.40.60.81classification erroroutlier to inlier ratioRoLR ClassificationLR Classification\fReferences\n[1] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds\n\nand structural results. The Journal of Machine Learning Research, 3:463\u2013482, 2003.\n\n[2] Ana M Bianco and V\u00b4\u0131ctor J Yohai. Robust estimation in the logistic regression model. Springer,\n\n1996.\n\n[3] Yudong Chen, Constantine Caramanis, and Shie Mannor. Robust sparse regression under ad-\n\nversarial corruption. In ICML, 2013.\n\n[4] R Dennis Cook and Sanford Weisberg. Residuals and in\ufb02uence in regression. 1982.\n[5] JB Copas. Binary regression models for contaminated data. Journal of the Royal Statistical\n\nSociety. Series B (Methodological), pages 225\u2013265, 1988.\n\n[6] Nan Ding, SVN Vishwanathan, Manfred Warmuth, and Vasil S Denchev. T-logistic regression\nfor binary and multiclass classi\ufb01cation. Journal of Machine Learning Research, 5:1\u201355, 2013.\n[7] Frank R Hampel. The in\ufb02uence curve and its role in robust estimation. Journal of the American\n\nStatistical Association, 69(346):383\u2013393, 1974.\n[8] Peter J Huber. Robust statistics. Springer, 2011.\n[9] Wesley Johnson. In\ufb02uence measures for logistic regression: Another point of view. Biometrika,\n\n72(1):59\u201365, 1985.\n\n[10] Hans R K\u00a8unsch, Leonard A Stefanski, and Raymond J Carroll. Conditionally unbiased\nbounded-in\ufb02uence estimation in general regression models, with applications to generalized\nlinear models. Journal of the American Statistical Association, 84(406):460\u2013466, 1989.\n\n[11] Su-In Lee, Honglak Lee, Pieter Abbeel, and Andrew Y Ng. Ef\ufb01cient L1 regularized logistic\n\nregression. In AAAI, 2006.\n\n[12] Po-Ling Loh and Martin J Wainwright. High-dimensional regression with noisy and missing\n\ndata: Provable guarantees with nonconvexity. Annals of Statistics, 40(3):1637, 2012.\n\n[13] Yaniv Plan and Roman Vershynin. Robust 1-bit compressed sensing and sparse logistic re-\nInformation Theory, IEEE Transactions on,\n\ngression: A convex programming approach.\n59(1):482\u2013494, 2013.\n\n[14] Daryl Pregibon. Logistic regression diagnostics. The Annals of Statistics, pages 705\u2013724,\n\n1981.\n\n[15] Daryl Pregibon. Resistant \ufb01ts for some commonly used logistic models with medical applica-\n\ntions. Biometrics, pages 485\u2013498, 1982.\n\n[16] Leonard A Stefanski, Raymond J Carroll, and David Ruppert. Optimally hounded score\nfunctions for generalized linear models with applications to logistic regression. Biometrika,\n73(2):413\u2013424, 1986.\n\n[17] Julie Tibshirani and Christopher D Manning. Robust logistic regression using shift parameters.\n\narXiv preprint arXiv:1305.4987, 2013.\n\n[18] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequen-\ncies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264\u2013280,\n1971.\n\n9\n\n\f", "award": [], "sourceid": 184, "authors": [{"given_name": "Jiashi", "family_name": "Feng", "institution": "UC Berkeley"}, {"given_name": "Huan", "family_name": "Xu", "institution": "NUS"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}, {"given_name": "Shuicheng", "family_name": "Yan", "institution": "National University of Singapore"}]}