{"title": "Sufficient Conditions for Generating Group Level Sparsity in a Robust Minimax Framework", "book": "Advances in Neural Information Processing Systems", "page_first": 2577, "page_last": 2585, "abstract": "Regularization technique has become a principle tool for statistics and machine learning research and practice. However, in most situations, these regularization terms are not well interpreted, especially on how they are related to the loss function and data. In this paper, we propose a robust minimax framework to interpret the relationship between data and regularization terms for a large class of loss functions. We show that various regularization terms are essentially corresponding to different distortions to the original data matrix. This minimax framework includes ridge regression, lasso, elastic net, fused lasso, group lasso, local coordinate coding, multiple kernel learning, etc., as special cases. Within this minimax framework, we further gave mathematically exact definition for a novel representation called sparse grouping representation (SGR), and proved sufficient conditions for generating such group level sparsity. Under these sufficient conditions, a large set of consistent regularization terms can be designed. This SGR is essentially different from group lasso in the way of using class or group information, and it outperforms group lasso when there appears group label noise. We also gave out some generalization bounds in a classification setting.", "full_text": "Suf\ufb01cient Conditions for Generating Group Level\n\nSparsity in a Robust Minimax Framework\n\nHongbo Zhou and Qiang Cheng\n\nComputer Science department,\n\nSouthern Illinois University Carbondale, IL, 62901\nhongboz@siu.edu, qcheng@cs.siu.edu\n\nAbstract\n\nRegularization technique has become a principled tool for statistics and machine\nlearning research and practice. However, in most situations, these regularization\nterms are not well interpreted, especially on how they are related to the loss func-\ntion and data. In this paper, we propose a robust minimax framework to interpret\nthe relationship between data and regularization terms for a large class of loss\nfunctions. We show that various regularization terms are essentially correspond-\ning to different distortions to the original data matrix. This minimax framework\nincludes ridge regression, lasso, elastic net, fused lasso, group lasso, local coordi-\nnate coding, multiple kernel learning, etc., as special cases. Within this minimax\nframework, we further give mathematically exact de\ufb01nition for a novel represen-\ntation called sparse grouping representation (SGR), and prove a set of suf\ufb01cient\nconditions for generating such group level sparsity. Under these suf\ufb01cient con-\nditions, a large set of consistent regularization terms can be designed. This SGR\nis essentially different from group lasso in the way of using class or group infor-\nmation, and it outperforms group lasso when there appears group label noise. We\nalso provide some generalization bounds in a classi\ufb01cation setting.\n\n1\n\nIntroduction\n\nA general form of estimating a quantity w \u2208 Rn from an empirical measurement set X by minimiz-\ning a regularized or penalized functional is\n\n\u02c6w = argmin\n\nw\n\n{L(Iw(X )) + \u03bbJ (w)},\n\n(1)\n\nwhere Iw(X ) \u2208 Rm expresses the relationship between w and data X , L(.) := Rm \u2192 R+ is a\nloss function, J (.) := Rn \u2192 R+ is a regularization term and \u03bb \u2208 R is a weight. Positive integers\nn, m represent the dimensions of the associated Euclidean spaces. Varying in speci\ufb01c applications,\nthe loss function L has lots of forms, and the most often used are these induced (A is induced\nby B, means B is the core part of A) by squared Euclidean norm or squared Hilbertian norms.\nEmpirically, the functional J is often interpreted as smoothing function, model bias or uncertainty.\nAlthough Equation (1) has been widely used, it is dif\ufb01cult to establish a general mathematically\nexact relationship between L and J . This directly encumbers the interpretability of parameters in\nthe model selection. It would be desirable if we can represent Equation (1) by a simpler form\n\n\u02c6w = argmin\n\n\u2032\n\nL\n\nw\n\n\u2032\n\n(I\n\nw(X )).\n\n(2)\n\nObviously, Equation (2) provides a better interpretability for the regularization term in Equation (1)\nby explicitly expressing the model bias or uncertainty as a variable of the relationship functional. In\nthis paper, we introduce a minimax framework and show that for a large family of Euclidean norm\ninduced loss functions, an equivalence relationship between Equation (1) and Equation (2) can be\n\n1\n\n\festablished. Moreover, the model bias or uncertainty will be expressed as distortions associated\nwith certain functional spaces. We will give a series of corollaries to show that well-studied lasso,\ngroup lasso, local coordinate coding, multiple kernel learning, etc., are all special cases of this novel\nframework. As a result, we shall see that various regularization terms associated with lasso, group\nlasso, etc., can be interpreted as distortions that belong to different distortion sets.\n\nWithin this framework, we further investigate a large family of distortion sets which can generate\na special type of group level sparsity which we call sparse grouping representation (SGR). Instead\nof merely designing one speci\ufb01c regularization term, we give suf\ufb01cient conditions for the distortion\nsets to generate the SGR. Under these suf\ufb01cient conditions, a large set of consistent regularization\nterms can be designed. Compared with the well-known group lasso which uses group distribution\ninformation in a supervised learning setting, the SGR is an unsupervised one and thus essentially\ndifferent from the group lasso.\nIn a novel fault-tolerance classi\ufb01cation application, where there\nappears class or group label noise, we show that the SGR outperforms the group lasso. This is not\nsurprising because the class or group label information is used as a core part of the group lasso while\nthe group sparsity produced by the SGR is intrinsic, in that the SGR does not need the class label\ninformation as priors. Finally, we also note that the group level sparsity is of great interests due to\nits wide applications in various supervised learning settings.\n\nIn this paper, we will state our results in a classi\ufb01cation setting. In Section 2 we will review some\nclosely related work, and we will introduce the robust minimax framework in Section 3. In Section\n4, we will de\ufb01ne the sparse grouping representation and prove a set of suf\ufb01cient conditions for\ngenerating group level sparsity. An experimental veri\ufb01cation on a low resolution face recognition\ntask will be reported in Section 5.\n\n2 Related Work\n\nIn this paper, we will mainly work with the penalized linear regression problem and we shall review\nsome closely related work here. For penalized linear regression, several well-studied regularization\nprocedures are ridge regression or Tikhonov regularization [15], bridge regression [10], lasso [19]\nand subset selection [5], fused lasso [20], elastic net [27], group lasso [25], multiple kernel learning\n[3, 2], local coordinate coding [24], etc. The lasso has at least three prominent features to make itself\na principled tool among all of these procedures: continuous shrinkage and automatic variable selec-\ntion at the same time, computational tractability (can be solved by linear programming methods) as\nwell as inducing sparsity. Recent results show that lasso can recover the solution of l0 regularization\nunder certain regularity conditions [8, 6, 7]. Recent advances such as fused lasso [20], elastic net\n[27], group lasso [25] and local coordinate coding [24] are motivated by lasso [19].\n\nTwo concepts closely related to our work are the elastic net or grouping effect observed by [27] and\nthe group lasso [25]. The elastic net model hybridizes lasso and ridge regression to preserve some\nredundancy for the variable selection, and it can be viewed as a stabilized version of lasso [27] and\nhence it is still biased. The group lasso can produce group level sparsity [25, 2] but it requires the\ngroup label information as prior. We shall see that in a novel classi\ufb01cation application when there\nappears class label noise [22, 18, 17, 26], the group lasso fails. We will discuss the differences of\nvarious regularization procedures in a classi\ufb01cation setting. We will use the basic schema for the\nsparse representation classi\ufb01cation (SRC) algorithm proposed in [21], and different regularization\nprocedures will be used to replace the lasso in the SRC.\n\nThe proposed framework reveals a fundamental connection between robust linear regression and\nvarious regularized techniques using regularization terms of l0, l1, l2, etc. Although [11] \ufb01rst intro-\nduced a robust model for least square problem with uncertain data and [23] discussed a robust model\nfor lasso, our results allow for using any positive regularization functions and a large family of loss\nfunctions.\n\n3 Minimax Framework for Robust Linear Regression\n\nIn this section, we will start with taking the loss function L as squared Euclidean norm, and we will\ngeneralize the results to other loss functions in section 3.4.\n\n2\n\n\f3.1 Notations and Problem Statement\n\nIn a general M (M > 1)-classes classi\ufb01cation setting, we are given a training dataset T =\n{(xi, gi)}n\ni=1, where xi \u2208 Rp is the feature vector and gi \u2208 {1, \u00b7 \u00b7 \u00b7 , M } is the class label for\nthe ith observation. A data (observation) matrix is formed as A = [x1, \u00b7 \u00b7 \u00b7 , xn] of size p \u00d7 n. Given\na test example y, the goal is to determine its class label.\n\n3.2 Distortion Models\n\nAssume that the jth class Cj has nj observations x(j)\nx \u2208 span{x(j)\n\n1 , \u00b7 \u00b7 \u00b7 , x(j)\n\nnj }. We approximate y by a linear combination of the training examples:\n\n1 , \u00b7 \u00b7 \u00b7 , x(j)\n\nnj . If x belongs to the jth class, then\n\n(3)\nwhere w = [w1, w2, \u00b7 \u00b7 \u00b7 , wn]T is a vector of combining coef\ufb01cients; and \u03b7 \u2208 Rp represents a vector\nof additive zero-mean noise. We assume a Gaussian model v \u223c N (0, \u03c32I) for this additive noise,\nso a least squares estimator can be used to compute the combining coef\ufb01cients.\n\ny = Aw + \u03b7,\n\nThe observed training dataset T may have undergone various noise or distortions. We de\ufb01ne the\nfollowing two classes of distortion models.\n\nA random matrix \u2206A is called bounded example-wise (or attribute) distortion\nDe\ufb01nition 1:\n(BED) with a bound \u03bb, denoted as BED(\u03bb), if \u2206A := [d1, \u00b7 \u00b7 \u00b7 , dn], dk \u2208 Rp, ||dk||2 \u2264 \u03bb, k =\n1, \u00b7 \u00b7 \u00b7 , n. where \u03bb is a positive parameter.\n\nThis distortion model assumes that each observation (signal) is distorted independently from the\nother observations, and the distortion has a uniformly upper bounded energy (\u201cuniformity\u201d refers to\nthe fact that all the examples have the same bound). BED includes attribute noise de\ufb01ned in [22,\n26], and some examples of BED include Gaussian noise and sampling noise in face recognition.\n\nDe\ufb01nition 2: A random matrix \u2206A is called bounded coef\ufb01cient distortion (BCD) with bound f ,\ndenoted as BCD(f ), if ||\u2206Aw||2 \u2264 f (w), \u2200w \u2208 Rp, where f (w) \u2208 R+ .\n\nThe above de\ufb01nition allows for any distortion with or without inter-observation dependency. For\nexample, we can take f (w) = \u03bb||w||2, and De\ufb01nition 2 with this f (w) means that the maximum\neigenvalue of \u2206A is upper limited by \u03bb. This can be easily seen as follows. Denote the maximum\neigenvalue of \u2206A by \u03c3max(\u2206A). Then we have\n\n\u03c3max(\u2206A) = sup\nu,v6=0\n\nuT \u2206Av\n||u||2||v||2\n\n= sup\nu6=0\n\n||\u2206Au||2\n\n||u||2\n\n,\n\nwhich is a standard result from the singular value decomposition (SVD) [12]. That is, the condition\nof ||\u2206Aw||2 \u2264 \u03bb||w||2 is equivalent to the condition that the maximum eigenvalue of \u2206A is upper\nbounded by \u03bb. In fact, BED is a subset of BCD by using triangular inequality and taking special\nforms of f (w). We will use D := BCD to represent the distortion model.\n\nBesides the additive residue \u03b7 generated from \ufb01tting models, to account for the above distortion\nmodels, we shall consider multiplicative noise by extending Equation (3) as follows:\n\ny = (A + \u2206A)w + \u03b7,\n\n(4)\n\nwhere \u2206A \u2208 D represents a possible distortion imposed to the observations.\n\n3.3 Fundamental Theorem of Distortion\n\nNow with the above re\ufb01ned linear model that incorporates a distortion model, we estimate the model\nparameters w by minimizing the variance of Gaussian residues for the worst distortions within a\npermissible distortion set D. Thus our robust model is\n\nmin\nw\u2208Rp\n\nmax\n\u2206A\u2208D\n\n||y \u2212 (A + \u2206A)w||2.\n\n(5)\n\nThe above minimax estimation will be used in our robust framework.\n\nAn advantage of this model is that it considers additive noise as well as multiplicative one within\na class of allowable noise models. As the optimal estimation of the model parameter in Equation\n\n3\n\n\f(5), w\u2217, is derived for the worst distortion in D, w\u2217 will be insensitive to any deviation from the\nunderlying (unknown) noise-free examples, provided the deviation is limited to the tolerance level\ngiven by D. The estimate w\u2217 thus is applicable to any A + \u2206A with \u2206A \u2208 D.\nIn brief, the\nrobustness of our framework is offered by modeling possible multiplicative noise as well as the\nconsequent insensitivity of the estimated parameter to any deviations (within D) from the noise-free\nunderlying (unknown) data. Moreover, this model can seamlessly incorporate either example-wise\nnoise or class noise, or both.\n\nEquation (5) provides a clear interpretation of the robust model. In the following, we will give a\ntheorem to show an equivalence relationship between the robust minimax model of Equation (5)\nand a general form of regularized linear regression procedure.\n\nTheorem 1. Equation (5) with distortion set D(f ) is equivalent to the following generalized regu-\nlarized minimization problem:\n\nmin\nw\u2208Rp\n\n||y \u2212 Aw||2 + f (w).\n\n(6)\n\nSketch of the proof: Fix w = w\u2217 and establish equality between upper bound and lower bound.\n\n||y \u2212 (A + \u2206A)w\u2217||2 \u2264 ||y \u2212 Aw\u2217||2 + ||\u2206Aw\u2217||2\n\u2264 ||y \u2212 Aw\u2217||2 + f (w\u2217).\n\nIn the above we have used the triangle inequality of norms.\n(y \u2212 Aw\u2217)/||y \u2212 Aw\u2217||2. Since max\n\u2206A\u2208D\nwhere t(w\u2217\ni 6= 0, t(w\u2217\ni (note\nthat w\u2217 is \ufb01xed so we can de\ufb01ne t(w\u2217)), we can actually attain the upper bound. It is easily veri\ufb01ed\nthat the expression is also valid if y \u2212 Aw\u2217 = 0.\n\nIf y \u2212 Aw\u2217 6= 0, we de\ufb01ne u =\nf (\u2206A) \u2265 f (\u2206A\u2217), by taking \u2206A\u2217 = \u2212uf (w\u2217)t(w\u2217)T /k,\ni ) = 0 for w\u2217\n\ni = 0 and k is the number of non-zero w\u2217\n\ni ) = 1/w\u2217\n\ni for w\u2217\n\nTheorem 1 gives an equivalence relationship between general regularized least squares problems\nand the robust regression under certain distortions. It should be noted that Equation (6) involves\nmin ||.||2, and the standard form for least squares problem uses min ||.||2\nIt\nis known that these two coincide up to a change of the regularization coef\ufb01cient so the following\nconclusions are valid for both of them. Several corollaries related to l0, l1, l2, elastic net, group\nlasso, local coordinate coding, etc., can be derived based on Theorem 1.\n\n2 as a loss function.\n\nl0 regularized regression is equivalent to taking a distortion set D(f l0 ) where\n\nCorollary 1:\nf l0(w) = t(w)wT , t(wi) = 1/wi for wi 6= 0, t(wi) = 0 for wi = 0.\nCorollary 2: l1 regularized regression (lasso) is equivalent to taking a distortion set D(f l1 ) where\nf l1(w) = \u03bb||w||1.\nCorollary 3: Ridge regression (l2) is equivalent to taking a distortion set D(f l2 ) where f l2(w) =\n\u03bb||w||2.\nCorollary 4: Elastic net regression [27] (l2 + l1) is equivalent to taking a distortion set D(f e)\nwhere f e(w) = \u03bb1||w||1 + \u03bb2||w||2\nCorollary 5: Group lasso [25] (grouped l1 of l2) is equivalent to taking a distortion set D(f gl1 )\nwhere f gl1(w) = Pm\nCorollary 6: Local coordinate coding [24] is equivalent to taking a distortion set D(f lcc) where\nf lcc(w) = Pn\n\nj=1 dj||wj||2, dj is the weight for jth group and m is the number of group.\n\n2, xi is ith basis, n is the number of basis, y is the test example.\n\n2, with \u03bb1 > 0, \u03bb2 > 0.\n\ni=1 |wi|||xi \u2212 y||2\n\nSimilar results can be derived for multiple kernel learning [3, 2], overlapped group lasso [16], etc.\n\n3.4 Generalization to Other Loss Functions\n\nFrom the proof of Theorem 1, we can see the Euclidean norm used in Theorem 1 can be generalized\nto other loss functions too. We only require the loss function is a proper norm in a normed vector\nspace. Thus, we have the following Theorem for a general form of Equation (1).\n\nTheorem 2. Given the relationship function Iw(X ) = y \u2212 Aw and J \u2208 R+ in a normed vector\nspace, if the loss functional L is a norm, then Equation (1) is equivalent to the following minimax\nestimation with a distortion set D(J ):\n\nmin\nw\u2208Rp\n\nmax\n\n\u2206A\u2208D(J )\n\nL(y \u2212 (A + \u2206A)w).\n\n(7)\n\n4\n\n\f4 Sparse Grouping Representation\n\n4.1 De\ufb01nition of SGR\n\nWe consider a classi\ufb01cation application where class noise is present. The class noise can be viewed\nas inter-example distortions. The following novel representation is proposed to deal with such dis-\ntortions.\n\nDe\ufb01nition 3. Assume all examples are standardized with zero mean and unit variance. Let \u03c1ij =\ni xj be the correlation for any two examples xi, xj \u2208 T. Given a test example y, w \u2208 Rn is de\ufb01ned\nxT\nas a sparse grouping representation for y, if both of the following two conditions are satis\ufb01ed,\n(a) If wi \u2265 \u01eb and \u03c1ij > \u03b4, then |wi \u2212 wj| \u2192 0 (when \u03b4 \u2192 1) for all i and j.\n(b) If wi < \u01eb and \u03c1ij > \u03b4, then wj \u2192 0 (when \u03b4 \u2192 1) for all i and j.\nEspecially, \u01eb is the sparsity threshold, and \u03b4 is the grouping threshold.\n\nThis de\ufb01nition requires that if two examples are highly correlated, then the resulted coef\ufb01cients\ntend to be identical. Condition (b) produces sparsity by requiring that these small coef\ufb01cients will\nbe automatically thresholded to zero. Condition (a) preserves grouping effects [27] by selecting\nall these coef\ufb01cients which are larger than a certain threshold. In the following we will provide\nsuf\ufb01cient conditions for the distortion set D(J ) to produce this group level sparsity.\n\n4.2 Group Level Sparsity\n\nAs known, D(l1) or lasso can only select arbitrarily one example from many identical candidates\n[27]. This leads to the sensitivity to the class noise as the example lasso chooses may be mislabeled.\nAs a consequence, the sparse representation classi\ufb01cation (SRC), a lasso based classi\ufb01cation schema\n[21], is not suitable for applications in the presence of class noise. The group lasso can produce\ngroup level sparsity, but it uses group label information to restrict the distribution of the coef\ufb01cients.\nWhen there exists group label noise or class noise, group lasso will fail because it cannot correctly\ndetermine the group. De\ufb01nition 3 says that the SGR is de\ufb01ned by example correlations and thus it\nwill not be affected by class noise.\n\nIn the general situation where the examples are not identical but have high within-class correlations,\nwe give the following theorem to show that the grouping is robust in terms of data correlation. From\nnow on, for distortion set D(f (w)), we require that f (w) = 0 for w = 0 and we use a special form\nof f (w), which is a sum of components fj(w),\n\nf (w) = \u00b5\n\nn\n\nX\n\nj=1\n\nfj(wj).\n\nTheorem 3. Assume all examples are standardized. Let \u03c1ij = xT\ni xj be the correlation for any two\nexamples. For a given test example y, if both fi 6= 0 and fj 6= 0 have \ufb01rst order derivatives, we\nhave\n\n\u2032\n\n\u2032\n\n|f\n\ni \u2212 f\n\nj| \u2264\n\n2||y||2\n\n\u00b5 q2(1 \u2212 \u03c1ij).\n2 + P fj with respect to wi and wj respectively,\nj = 0. The difference of these two\n\nj {y \u2212 Aw} + \u00b5f\n\n(8)\n\n\u2032\n\nSketch of the proof: By differentiating ||y \u2212 Aw||2\nwe have \u22122xT\n\ni {y \u2212 Aw} + \u00b5f\ni \u2212xT\n\n2(xT\n\ni = 0 and \u22122xT\nj )r\n\n\u2032\n\n\u2032\n\n\u2032\n\nj =\n\ni \u2212 f\n\nwhere r = y \u2212 Aw is the residual vector. Since all examples are\nequations is f\nstandardized, we have ||xT\n2 = 2(1 \u2212 \u03c1ij) where \u03c1 = xT\ni xj. For a particular value w = 0, we\nhave ||r||2 = ||y||2, and thus we can get ||r||2 \u2264 ||y||2 for the optimal value of w. Combining r and\n||xT\n\nj ||2, we proved the Theorem 3.\n\ni \u2212 xT\n\ni \u2212 xT\n\nj ||2\n\n\u00b5\n\nThis theorem is different from the Theorem 1 in [27] in the following aspects: a) we have no re-\nstrictions on the sign of the wi or wj ; b) we use a family of functions which give us more choices to\nbound the coef\ufb01cients. As aforementioned, it is not necessary for fi to be the same with fj and we\neven can use different growth rates for different components; and c) f\ni (wi) does not have to be wi\nand a monotonous function with very small growth rate would be enough.\n\n\u2032\n\n5\n\n\fAs an illustrative example, we can choose fi(wi) or fj(wj) to be a second order function with\nrespect to wi or wj . Then the resulted |f\nj| will be the difference of the coef\ufb01cients \u03bb|wi \u2212 wj|\nwith a constant \u03bb. If the two examples are highly correlated and \u00b5 is suf\ufb01ciently large, then we can\nconclude that the difference of the coef\ufb01cients will be close to zero.\n\ni \u2212 f\n\n\u2032\n\n\u2032\n\nThe sparsity implies an automatic thresholding ability with which all small estimated coef\ufb01cients\nwill be shrunk to zero, that is, f (w) has to be singular at the point w = 0 [9]. Incorporating this\nrequirement with Theorem 3, we can achieve group level sparsity: if some of the group coef\ufb01cients\nare small and automatically thresholded to zero, all other coef\ufb01cients within this group will be reset\nto zero too. This correlation based group level sparsity does not require any prior information on the\ndistribution of group labels.\n\nTo make a good estimator, there are still two properties we have to consider: continuity and un-\nbiasedness [9]. In short, to avoid instability, we always require the resulted estimator for w be a\ncontinuous function; and a suf\ufb01cient condition for unbiasedness is that f\n(|w|) = 0 when |w| is\nlarge. Generally, the requirement of stability is not consistent with that of sparsity. Smoothness\ndetermines the stability and singularity at zero measures the degree of sparsity. As an extreme ex-\nample, l1 can produce sparsity while l2 does not because l1 is singular while l2 is smooth at zero;\nat the same time, l2 is more stable than l1. More details regarding these conditions can be found in\n[1, 9].\n\n\u2032\n\n4.3 Suf\ufb01cient Condition for SGR\n\nBased on the above discussion, we can readily construct a sparse grouping representation based\non Equation (5) where we only need to specify a distortion set D(f \u2217(w)) satisfying the following\nsuf\ufb01cient conditions:\n\n\u2032\u2032\n\n\u2032\n\n\u2032\n\nj 6= 0.\n\n\u2208 R+ for all f\n\n(|wj|) = 0 for large |wj| for all j.\n\nLemma 1: Suf\ufb01cient condition for SGR.\n(a). f \u2217\nj\n(b). f \u2217\nj is continuous and singular at zero with respect to wj for all j.\n(c). f \u2217\nj\nProof: Together with Theorem 3, it is easy to be veri\ufb01ed.\nAs we can see, the regularization term \u03bbl1 + (1 \u2212 \u03bb)l2\n2 proposed by [27] satis\ufb01es the above condition\n(a) and (b), but it fails to comply with (c). So, it may become biased for large |w|. Based on\nthese conditions, we can easily construct regularization terms f \u2217 to generate the sparse grouping\nrepresentation. We will call these f \u2217 as core functions for producing the SGR. As some concrete\nexamples, we can construct a large family of clipped \u00b51Lq + \u00b52l2\n2 where 0 < q \u2264 1 by restricting\nf \u2032\ni = wiI(|wi| < \u01eb) + c for some constant \u01eb and c. Also, SCAD [9] satis\ufb01es all three conditions\nso it belongs to f \u2217. This gives more theoretic justi\ufb01cations for previous empirical success of using\nSCAD.\n\n4.4 Generalization Bounds for Presence of Class Noise\n\nWe will follow the algorithm given in [21] and merely replace the lasso with the SGR or group lasso.\nAfter estimating the (minimax) optimal combining coef\ufb01cient vector w\u2217 by the SGR or group lasso,\nwe may calculate the distance from the new test data y to the projected point in the subspace spanned\nby class Ci:\n\ndi(A, w\u2217|Ci ) = di(A|Ci , w\u2217) = ||y \u2212 Aw\u2217|Ci ||2\n\n(9)\nj 1(xj \u2208 Ci), where\n\nwhere w\u2217|Ci represents restricting w\u2217 to the ith class Ci; that is, (w\u2217|Ci)j = w\u2217\n1(\u00b7) is an indicator function; and similarly A|Ci represents restricting A to the ith class Ci.\n\nA decision rule may be obtained by choosing the class with the minimum distance:\n\n\u02c6i = argmini\u2208{1,\u00b7\u00b7\u00b7 ,M }{di}.\n\n(10)\n\nBased on these notations, we now have the following generalization bounds for the SGR in the\npresence of class noise in the training data.\n\nTheorem 4. All examples are standardized to be zero mean and unit variance. For an arbitrary\nclass Ci of N examples, we have p (p < 0.5) percent (fault level) of labels mis-classi\ufb01ed into class\n\n6\n\n\fCk 6= Ci. We assume w is a sparse grouping representation for any test example y and \u03c1ij > \u03b4 (\u03b4\nis in De\ufb01nition 3) for any two examples. Under the distance function d(A|Ci , w) = d(A, w|Ci ) =\nj = w for all j, we have con\ufb01dence threshold \u03c4 to give correct estimation \u02c6i for\n||y \u2212 Aw|Ci ||2 and f\ny, where\n\n\u2032\n\n(1 \u2212 p) \u00d7 N \u00d7 (w0)2\n\n,\n\n\u03c4 \u2264\n\nd\n\nwhere w0 is a constant and the con\ufb01dence threshold is de\ufb01ned as \u03c4 = di(A|Ci ) \u2212 di(A|Ck ).\nSketch of the proof: Assume y is in class Ci. The correctly labeled (mislabeled, respectively) subset\nfor Ci is C 1\ni . We use A1w to\ndenote Aw|C 1\n\ni , respectively) and the size of set C 1\nand A2w to denote Aw|C 2\n\n. By triangular inequality, we have\n\nis larger than that of C 2\n\ni (C 2\n\ni\n\ni\n\ni\n\n\u03c4 = ||y \u2212 Aw|C 1\n\ni\n\n||2\n||2 \u2212 ||y \u2212 Aw|C 2\n\u2264 ||A1w \u2212 A2w||2.\n\ni\n\ni . Finally we subtract the summation of C 2\n\nFor each k \u2208 C 1\ni , we differentiate with respect to wk and do the same procedure as in proof of\nTheorem 3. Then summarizing all equalities for C 1\ni and repeating the same procedure for each\ni \u2208 C 2\ni . Use the conditions\nthat w is a sparse grouping representation and \u03c1ij > \u03b4, combing De\ufb01nition 3, so all wk in class Ci\nshould be the same as a constant w0 while others \u2192 0. By taking the l2-norm for both sides, we\nhave ||A1w \u2212 A2w||2 \u2264 (1\u2212p)N (w0)2\nThis theorem gives an upper bound for the fault-tolerance against class noise. By this theorem, we\ncan see that the class noise must be smaller than a certain value to guarantee a given fault correction\ncon\ufb01dence level \u03c4 .\n\ni from the summation of C 1\n\nd\n\n.\n\n5 Experimental Veri\ufb01cation\n\nIn this section, we compare several methods on a challenging low-resolution face recognition task\n(multi-class classi\ufb01cation) in the presence of class noise. We use the Yale database [4] which consists\nof 165 gray scale images of 15 individuals (each person is a class). There are 11 images per subject,\none per different facial expression or con\ufb01guration: center-light, w/glasses, happy, left-light, w/no\nglasses, normal, right-light, sad, sleepy, surprised, and wink. Starting from the orignal 64 \u00d7 64\nimages, all images are down-sampled to have a dimension of 49. A training/test data set is generated\nby uniformly selecting 8 images per individual to form the training set, and the rest of the database\nis used as the test set; repeating this procedure to generate \ufb01ve random split copies of training/test\ndata sets. Five class noise levels are tested. Class noise level=p means there are p percent of labels\n(uniformly drawn from all labels of each class) mislabeled for each class.\n\nFor SVM, we use the standard implementation of multiple-class (one-vs-all) LibSVM in Mat-\nlabArsenal1. For lasso based SRC, we use the CVX software [13, 14] to solve the corresponding\nconvex optimization problems. The group lasso based classi\ufb01er is implemented in the same way as\nthe SRC. We use a clipped \u03bbl1 + (1 \u2212 \u03bb)l2 as an illustrative example of the SGR, and the corre-\nsponding classi\ufb01er is denoted as SGRC. For lasso, group Lasso and the SGR based classi\ufb01er, we\nrun through \u03bb \u2208 {0.001, 0.005, 0.01, 0.05, 0.1, 0.2} and report the best results for each classi\ufb01er.\nFigure 1 (b) shows the parameter range of \u03bb that is appropriate for lasso, group lasso and the SGR\nbased classi\ufb01er. Figure 1 (a) shows that the SGR based classi\ufb01er is more robust than lasso or group\nlasso based classi\ufb01er in terms of class noise. These results verify that in a novel application when\nthere exists class noise in the training data, the SGR is more suitable than group lasso for generating\ngroup level sparsity.\n\n6 Conclusion\n\nTowards a better understanding of various regularized procedures in robust linear regression, we\nintroduce a robust minimax framework which considers both additive and multiplicative noise or\ndistortions. Within this uni\ufb01ed framework, various regularization terms correspond to different\n\n1A matlab\n\nbe\nhttp://www.informedia.cs.cmu.edu/yanrong/MATLABArsenal/MATLABArsenal.htm.\n\nalgorithms which\n\nclassi\ufb01cation\n\npackage\n\ncan\n\nfor\n\ndownloaded\n\nfrom\n\n7\n\n\f \n\nSVM\nSRC\nSGRC\nGroup lasso\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\nt\n\ne\na\nr\n \nr\no\nr\nr\ne\nn\no\n\n \n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n \n\nSRC\nSGRC\nGroup lasso\n\n0.65\n\n0.6\n\n0.55\n\n0.5\n\n0.45\n\n0.4\n\n0.35\n\n0.3\n\nt\n\ne\na\nr\n \nr\no\nr\nr\ne\nn\no\n\n \n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n\nl\n\n0.2\n\n \n\n0.1\n\n0.15\n\n0.2\n\n0.3\n\n0.25\n0.4\nClass noise level\n\n0.35\n\n(a)\n\n0.45\n\n0.5\n\n0.55\n\n0.25\n\n \n0\n\n0.02\n\n0.04\n\n0.06\n\n0.08\n\n0.12\n\n0.14\n\n0.16\n\n0.18\n\n0.2\n\n0.1\n\u03bb\n\n(b)\n\nFigure 1: (a) Comparison of SVM, SRC (lasso), SGRC and Group lasso based classi\ufb01ers on the low\nresolution Yale face database. At each level of class noise, the error rate is averaged over \ufb01ve copies\nof training/test datasets for each classi\ufb01er. For each classi\ufb01er, the variance bars for each class noise\nlevel are plotted. (b) Illustration of the paths for SRC (lasso), SGRC and group lasso. \u03bb is the weight\nfor regularization term. All data points are averaged over \ufb01ve copies with the same class noise level\nof 0.2.\n\ndistortions to the original data matrix. We further investigate a novel sparse grouping representation\n(SGR) and prove suf\ufb01cient conditions for generating such group level sparsity. We also provide a\ngeneralization bound for the SGR. In a novel classi\ufb01cation application when there exists class noise\nin the training example, we show that the SGR is more robust than group lasso. The SCAD and\nclipped elastic net are special instances of the SGR.\n\nReferences\n\n[1] A. Antoniadis and J. Fan. Regularitation of wavelets approximations. J. the American Statis-\n\ntical Association, 96:939\u2013967, 2001.\n\n[2] F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine\n\nLearning Research, 9:1179\u20131225, 2008.\n\n[3] F. Bach, G. R. G. Lanckriet, and M. I. Jordan. Multiple kernel learning, conic duality, and\nthe smo algorithm. In Proceedings of the Twenty-\ufb01rst International Conference on Machine\nLearning, 2004.\n\n[4] P. N. Bellhumer, J. Hespanha, and D. Kriegman. Eigenfaces vs. \ufb01sherfaces: Recognition using\nclass speci\ufb01c linear projection. IEEE Trans. Pattern Anal. Mach. Intelligence, 17(7):711\u2013720,\n1997.\n\n[5] L. Breiman. Heuristics of instability and stabilization in model selection. Ann. Statist.,\n\n24:2350\u20132383, 1996.\n\n[6] E. Cand\u00b4es, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate\n\nmeasurements. Comm. on Pure and Applied Math, 59(8):1207\u20131233, 2006.\n\n[7] E. Cand\u00b4es and T. Tao. Near-optimal signal recovery from random projections: Universal en-\n\ncoding strategies? IEEE Trans. Information Theory, 52(12):5406\u20135425, 2006.\n\n[8] D. Donoho. For most large underdetermined systems of linear equations the minimum l1 nom\nsolution is also the sparsest solution. Comm. on Pure and Applied Math, 59(6):797\u2013829, 2006.\n\n[9] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle proper-\n\nties. J. Am. Statist. Ass., 96:1348\u20131360, 2001.\n\n[10] I. Frank and J. Friedman. A statistical view of some chemometrics regression tools. Techno-\n\nmetrics, 35:109\u2013148, 1993.\n\n[11] L. El Ghaoui and H. Lebret. Robust solutions to least-squares problems with uncertain data.\n\nSIAM Journal Matrix Analysis and Applications, 18:1035\u20131064, 1997.\n\n[12] G.H. Golub and C.F. Van Loan. Matrix computations. Johns Hopkins Univ Pr, 1996.\n\n8\n\n\f[13] M. Grant and S. Boyd. Graph implementations for nonsmooth convex programs, recent ad-\nvances in learning and control. Lecture Notes in Control and Information Sciences, pages\n95\u2013110, 2008.\n\n[14] M. Grant and S. Boyd. UCI machine learning repositorycvx: Matlab software for disciplined\n\nconvex programming, 2009.\n\n[15] A. Hoerl and R. Kennard. Ridge regression. Encyclpedia of Statistical Science, 8:129\u2013136,\n\n1988.\n\n[16] L. Jacob, G. Obozinski, and J.-P. Vert. Group lasso with overlap and graph lasso.\n\nIn Pro-\nceedings of the Twenty-six International Conference on Machine Learning, pages 433\u2013440,\n2009.\n\n[17] J. Maletic and A. Marcus. Data cleansing: Beyond integrity analysis. In Proceedings of the\n\nConference on Information Quality, 2000.\n\n[18] K. Orr. Data quality and systems theory. Communications of the ACM, 41(2):66\u201371, 1998.\n\n[19] R. Tibshirani. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B, 58:267\u2013\n\n288, 1996.\n\n[20] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the\n\nfused lasso. J.R.Statist.Soc.B, 67:91\u2013108, 2005.\n\n[21] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, and Y. Ma. Robust face recognition via sparse\nrepresentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 210\u2013\n227, 2009.\n\n[22] X. Wu. Knowledge Acquisition from Databases. Ablex Pulishing Corp, Greenwich, CT, USA,\n\n1995.\n\n[23] H. Xu, C. Caramanis, and S. Mannor. Robust regression and lasso. In NIPS, 2008.\n\n[24] K. Yu, T. Zhang, and Y. Gong. Nonlinear learning using local coordinate coding. In Advances\n\nin Neural Information Processing Systems, volume 22, 2009.\n\n[25] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.\n\nJournal of The Royal Statistical Society Series B, 68(1):49\u201367, 2006.\n\n[26] X. Zhu, X. Wu, and S. Chen. Eliminating class noise in large datasets. In Proceedings of the\n20th ICML International Conference on Machine Learning, Washington D.C., USA, March\n2003.\n\n[27] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. J. R. Statist.\n\nSoc. B, 67(2):301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 356, "authors": [{"given_name": "Hongbo", "family_name": "Zhou", "institution": null}, {"given_name": "Qiang", "family_name": "Cheng", "institution": null}]}