{"title": "Generalized Dantzig Selector: Application to the k-support norm", "book": "Advances in Neural Information Processing Systems", "page_first": 1934, "page_last": 1942, "abstract": "We propose a Generalized Dantzig Selector (GDS) for linear models, in which any norm encoding the parameter structure can be leveraged for estimation. We investigate both computational and statistical aspects of the GDS. Based on conjugate proximal operator, a flexible inexact ADMM framework is designed for solving GDS. Thereafter, non-asymptotic high-probability bounds are established on the estimation error, which rely on Gaussian widths of the unit norm ball and the error set. Further, we consider a non-trivial example of the GDS using k-support norm. We derive an efficient method to compute the proximal operator for k-support norm since existing methods are inapplicable in this setting. For statistical analysis, we provide upper bounds for the Gaussian widths needed in the GDS analysis, yielding the first statistical recovery guarantee for estimation with the k-support norm. The experimental results confirm our theoretical analysis.", "full_text": "Generalized Dantzig Selector:\n\nApplication to the k-support norm\n\nSoumyadeep Chatterjee\u2217\n\nSheng Chen\u2217\n\nArindam Banerjee\n\nDept. of Computer Science & Engg.\nUniversity of Minnesota, Twin Cities\n\n{chatter,shengc,banerjee}@cs.umn.edu\n\nAbstract\n\nWe propose a Generalized Dantzig Selector (GDS) for linear models, in which any\nnorm encoding the parameter structure can be leveraged for estimation. We inves-\ntigate both computational and statistical aspects of the GDS. Based on conjugate\nproximal operator, a \ufb02exible inexact ADMM framework is designed for solving\nGDS. Thereafter, non-asymptotic high-probability bounds are established on the\nestimation error, which rely on Gaussian widths of the unit norm ball and the error\nset. Further, we consider a non-trivial example of the GDS using k-support norm.\nWe derive an ef\ufb01cient method to compute the proximal operator for k-support\nnorm since existing methods are inapplicable in this setting. For statistical analy-\nsis, we provide upper bounds for the Gaussian widths needed in the GDS analysis,\nyielding the \ufb01rst statistical recovery guarantee for estimation with the k-support\nnorm. The experimental results con\ufb01rm our theoretical analysis.\n\n1\n\nIntroduction\n\nThe Dantzig Selector (DS) [3, 5] provides an alternative to regularized regression approaches such as\nLasso [19, 22] for sparse estimation. While DS does not consider a regularized maximum likelihood\napproach, [3] has established clear similarities between the estimates from DS and Lasso. While\nnorm regularized regression approaches have been generalized to more general norms [14, 2], the\nliterature on DS has primarily focused on the sparse L1 norm case, with a few notable exceptions\nwhich have considered extensions to sparse group-structured norms [11].\nIn this paper, we consider linear models of the form y = X\u03b8\u2217 + w, where y \u2208 Rn is a set of\nobservations, X \u2208 Rn\u00d7p is a design matrix with i.i.d. standard Gaussian entries, and w \u2208 Rn\nis i.i.d. standard Gaussian noise. For any given norm R(\u00b7), the parameter \u03b8\u2217 is assumed to be\nstructured in terms of having a low value of R(\u03b8\u2217). For this setting, we propose the following\nGeneralized Dantzig Selector (GDS) for parameter estimation:\n\ns.t. R\u2217(cid:0)XT (y \u2212 X\u03b8)(cid:1)\n\n\u02c6\u03b8 = argmin\n\n\u03b8\u2208Rp R(\u03b8)\n\n\u2264 \u03bbp ,\n\n(1)\n\nwhere R\u2217(\u00b7) is the dual norm of R(\u00b7), and \u03bbp is a suitable constant. If R(\u00b7) is the L1 norm, (1)\nreduces to standard DS [5]. A key novel aspect of GDS is that the constraint is in terms of the dual\nnorm R\u2217(\u00b7) of the original structure inducing norm R(\u00b7). It is instructive to contrast GDS with\nthe recently proposed atomic norm based estimation framework [6] which, unlike GDS, considers\nconstraints based on the L2 norm of the error (cid:107)y \u2212 X\u03b8(cid:107)2.\nIn this paper, we consider both computational and statistical aspects of the GDS. For the L1-norm\nDantzig selector, [5] proposed a primal-dual interior point method since the optimization is a linear\nprogram. DASSO and its generalization proposed in [10, 9] focused on homotopy methods, which\n\n\u2217Both authors contributed equally.\n\n1\n\n\fprovide a piecewise linear solution path through a sequential simplex-like algorithm. However, none\nof the algorithms above can be immediately extended to our general formulation. In recent work,\nthe Alternating Direction Method of Multipliers (ADMM) has been applied to the L1-norm Dantzig\nselection problem [12, 21], and the linearized version in [21] proved to be ef\ufb01cient. Motivated\nby such results for DS, we propose a general inexact ADMM [20] framework for GDS where the\nprimal update steps, interestingly, turn out respectively to be proximal updates involving R(\u03b8) and\nits convex conjugate, the indicator of R\u2217(x) \u2264 \u03bbp. As a result, by Moreau decomposition, it suf\ufb01ces\nto develop ef\ufb01cient proximal update for either R(\u03b8) or its conjugate. On the statistical side, we\nestablish non-asymptotic high-probability bounds on the estimation error (cid:107) \u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)2. Interestingly,\nthe bound depends on the Gaussian width of the unit norm ball of R(\u00b7) as well as the Gaussian width\nof intersection of error cone and unit sphere [6, 16].\nAs a non-trivial example of the GDS framework, we consider estimation using the recently proposed\nk-support norm [1, 13]. We show that proximal operators for k-support norm can be ef\ufb01ciently\ncomputed in O(p log p + log k log(p \u2212 k)), and hence the estimation can be done ef\ufb01ciently. Note\nthat existing work [1, 13] on k-support norm has focused on the proximal operator for the square of\nthe k-support norm, which is not directly applicable in our setting. On the statistical side, we provide\nupper bounds for the Gaussian widths of the unit norm ball and the error cone as needed in the GDS\nframework, yielding the \ufb01rst statistical recovery guarantee for estimation with the k-support norm.\nThe rest of the paper is organized as follows: We establish general optimization and statistical\nrecovery results for GDS for any norm in Section 2. In Section 3, we present ef\ufb01cient algorithms\nand estimation error bounds for the k-support norm. We present experimental results in Section 4\nand conclude in Section 5. All technical analysis and proofs can be found in [7].\n\n2 General Optimization and Statistical Recovery Guarantees\n\nThe problem in (1) is a convex program, and a suitable choice of \u03bbp ensures that the feasible set\nis not empty. We start the section with an inexact ADMM framework for solving problems of the\nform (1), and then present bounds on the estimation error establishing statistical consistency of GDS.\n\n2.1 General Optimization Framework using Inexact ADMM\n\nFor convenience, we temporarily drop the subscript p of \u03bbp. We let A = XT X, b = XT y, and\nde\ufb01ne the set C\u03bb = {v : R\u2217(v) \u2264 \u03bb}. The optimization problem is equivalent to\n\n\u03b8,v R(\u03b8)\nmin\n\ns.t. b \u2212 A\u03b8 = v, v \u2208 C\u03bb .\n\n(2)\n\nDue to the nonsmoothness of both R and R\u2217, solving (2) can be quite challenging and a generally\napplicable algorithm is Alternating Direction Method of Multipliers (ADMM) [4]. The augmented\nLagrangian function for (2) is given as\n\nLR(\u03b8, v, z) = R(\u03b8) + (cid:104)z, A\u03b8 + v \u2212 b(cid:105) +\n\n\u03c1\n2||A\u03b8 + v \u2212 b||2\n2 ,\n\n(3)\n\nwhere z is the Lagrange multiplier and \u03c1 controls the penalty introduced by the quadratic term. The\niterative updates of the variables (\u03b8, v, z) in standard ADMM are given by\n\n\u03b8\n\nLR(\u03b8, vk, zk) ,\nv\u2208C\u03bb LR(\u03b8k+1, v, zk) ,\n\n\u03b8k+1 \u2190 argmin\nvk+1 \u2190 argmin\nzk+1 \u2190 zk + \u03c1(A\u03b8k+1 + vk+1 \u2212 b) .\n\n(6)\nNote that update (4) amounts to a norm regularized least squares problem for \u03b8, which can be\ncomputationally expensive. Thus we use an inexact update for \u03b8 instead, which can alleviate the\ncomputational cost and lead to a quite simple algorithm. Inspired by [21, 20], we consider a simpler\nsubproblem for the \u03b8-update which minimizes\n\n(4)\n\n(5)\n\n(7)\n\n(cid:101)Lk\n\nR(\u03b8, vk, zk) = R(\u03b8) + (cid:104)zk, A\u03b8 + vk \u2212 b(cid:105) +\n\n2(cid:10)\u03b8 \u2212 \u03b8k, AT (A\u03b8k + vk \u2212 b)(cid:11) +\n\n\u03c1\n2\n\n(cid:16)(cid:13)(cid:13)A\u03b8k + vk \u2212 b(cid:13)(cid:13)2\n(cid:17)\n(cid:13)(cid:13)\u03b8 \u2212 \u03b8k(cid:13)(cid:13)2\n\n\u00b5\n2\n\n2\n\n,\n\n2+\n\n2\n\n\fAlgorithm 1 ADMM for Generalized Dantzig Selector\nInput: A = XT X, b = XT y, \u03c1, \u00b5\nOutput: Optimal \u02c6\u03b8 of (1)\n1: Initialize (\u03b8, v, z)\n2: while not converged do\n3:\n\n(cid:1)\n\u03b8k+1 \u2190 prox 2R\nvk+1 \u2190 proxIC\u03bb\nzk+1 \u2190 zk + \u03c1(A\u03b8k+1 + vk+1 \u2212 b)\n\n4:\n5:\n6: end while\n\n\u00b5 AT (A\u03b8k + vk \u2212 b + zk\n\n\u03c1\u00b5\n\n\u03c1 )(cid:1)\n\n\u03c1\n\n(cid:0)\u03b8k \u2212 2\n(cid:0)b \u2212 A\u03b8k+1 \u2212 zk\n(cid:101)Lk\n(cid:13)(cid:13)(cid:13)\u03b8 \u2212\n\nR(\u03b8, vk, zk)\n1\n2\n\n\u03c1\u00b5\n\n+\n\nR(\u03b8, vk, zk) can be viewed as an approximation of\n\n\u03b8k+1 \u2190 argmin\n\n(cid:101)Lk\n(cid:26) 2R(\u03b8)\n\nwhere \u00b5 is a user-de\ufb01ned parameter.\nLR(\u03b8, vk, zk) with the quadratic term linearized at \u03b8k. Then the update (4) is replaced by\n(cid:27)\n)(cid:1)(cid:13)(cid:13)(cid:13)2\n)(cid:13)(cid:13)2\n\nSimilarly the update of v in (5) can be recast as\n\n(cid:13)(cid:13)v \u2212 (b \u2212 A\u03b8k+1 \u2212\n\nAT (A\u03b8k + vk \u2212 b +\n\nv\u2208C\u03bb LR(\u03b8k+1, v, zk) = argmin\n\nvk+1 \u2190 argmin\n\n(cid:0)\u03b8k \u2212\n\n= argmin\n\nzk\n\u03c1\n\nzk\n\u03c1\n\nv\u2208C\u03bb\n\n2\n\u00b5\n\n2 .\n\n\u03b8\n\n\u03b8\n\n.\n\n2\n\n1\n2\n\n(8)\n\n(9)\n\nIn fact, the updates of both \u03b8 and v are to compute certain proximal operators [15]. In general, the\nproximal operator proxh(\u00b7) of a closed proper convex function h : Rp \u2192 R \u222a {+\u221e} is de\ufb01ned as\n\n(cid:110) 1\n\n(cid:111)\n\n2 + h(w)\n\n.\n\n2(cid:107)w \u2212 x(cid:107)2\nHence it is easy to see that (8) and (9) correspond to prox 2R\nIC\u03bb(\u00b7) is the indicator function of set C\u03bb given by\n\nproxh(x) = argmin\nw\u2208Rp\n\n\u03c1\u00b5\n\n(cid:26) 0\n\nIC\u03bb(x) =\n\nif x \u2208 C\u03bb\n\n+\u221e if otherwise\n\n.\n\n(\u00b7) and proxIC\u03bb\n\n(\u00b7), respectively, where\n\nIn Algorithm 1, we provide our general ADMM for the GDS. For the ADMM to work, we need two\nsubroutines that can ef\ufb01ciently compute the proximal operators for the functions in Line 3 and 4\nrespectively. The simplicity of the proposed approach stems from the fact that we in fact need only\none subroutine, for any one of the functions, since the functions are conjugates of each other.\n\nProposition 1 Given \u03b2 > 0 and a norm R(\u00b7), the two functions, f (x) = \u03b2R(x) and g(x) = IC\u03b2 (x)\nare convex conjugate to each other, thus giving the following identity,\n\nx = proxf (x) + proxg(x) .\n\n(10)\n\nProof: The Proposition 1 simply follows from the de\ufb01nition of convex conjugate and dual norm,\nand (10) is just Moreau decomposition provided in [15].\n\nThe decomposition enables conversion of the two types of proximal operator to each other at neg-\nligible cost (i.e., vector subtraction). Thus we have the \ufb02exibility in Algorithm 1 to focus on the\nproximal operator that is ef\ufb01ciently computable, and the other can be simply obtained through (10).\nRemark on convergence: Note that Algorithm 1 is a special case of inexact Bregman ADMM pro-\nposed in [20], which matches the case of linearizing quadratic penalty term by using B\u03d5(cid:48)\n(\u03b8, \u03b8k) =\n2 to be larger than\n1\n2(cid:107)\u03b8 \u2212 \u03b8k(cid:107)2\nthe spectral radius of AT A, and the convergence rate is O(1/T ) according to Theorem 2 in [20].\n\n2 as Bregman divergence. In order to converge, the algorithm requires \u00b5\n\n\u03b8\n\n3\n\n\f2.2 Statistical Recovery for Generalized Dantzig Selector\n\nOur goal is to provide non-asymptotic bounds on (cid:107) \u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)2 between the true parameter \u03b8\u2217 and\nthe minimizer \u02c6\u03b8 of (1). Let the error vector be de\ufb01ned as \u02c6\u2206 = \u02c6\u03b8 \u2212 \u03b8\u2217. For any set \u2126 \u2286 Rp, we\nwould measure the size of this set using its Gaussian width [17, 6], which is de\ufb01ned as \u03c9(\u2126) =\nEg [supz\u2208\u2126(cid:104)g, z(cid:105)] , where g is a vector of i.i.d. standard Gaussian entries. We also consider the\nerror cone TR(\u03b8\u2217), generated by the set of possible error vectors \u2206 and containing \u02c6\u2206, de\ufb01ned as\n\n(11)\nNote that this set contains a restricted set of directions and does not in general span the entire space\nof Rp. With these de\ufb01nitions, we obtain our main result.\n\nTR(\u03b8\u2217) := cone{\u2206 \u2208 Rp : R(\u03b8\u2217 + \u2206) \u2264 R(\u03b8\u2217)} .\n\nTheorem 1 Suppose the design matrix X consists of i.i.d. Gaussian entries with zero mean variance\n1, and we solve the optimization problem (1) with\n\n\u03bbp \u2265 cE(cid:2)\n\nR\u2217(XT w)(cid:3) .\n\nThen, with probability at least (1 \u2212 \u03b71 exp(\u2212\u03b72n)), we have\n4c\u03a8R\u03c9(\u2126R)\n\u03baL\u221an\n\n(13)\nwhere \u03c9(TR(\u03b8\u2217)\u2229 Sp\u22121) is the Gaussian width of the intersection of TR(\u03b8\u2217) and the unit spherical\nshell Sp\u22121, \u03c9(\u2126R) is the Gaussian width of the unit norm ball, \u03baL > 0 is the gain given by\n\n(cid:107) \u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)2 \u2264\n\n,\n\n(14)\n\u03a8R = sup\u2206\u2208TR R(\u2206)/(cid:107)\u2206(cid:107)2 is a norm compatibility factor, (cid:96)n is the expected length of a length n\ni.i.d. standard Gaussian vector with\n\nn\u221an+1 < (cid:96)n < \u221an, and c > 1, \u03b71, \u03b72 > 0 are constants.\n\n\u03baL =\n\n,\n\n(cid:0)(cid:96)n \u2212 \u03c9(TR(\u03b8\u2217) \u2229 Sp\u22121)(cid:1)2\n\n1\nn\n\nRemark: The choice of \u03bbp is also intimately connected to the notion of Gaussian width. Note that\n= z is an i.i.d. standard Gaussian vector for any w. Therefore the right hand side of (12)\nXT w\n(cid:107)w(cid:107)2\ncan be written as\n\nE(cid:2)\nR\u2217(XT w)(cid:3) = E\n\n(cid:20)\nR\u2217(XT w\n(cid:107)w(cid:107)2\n\n(cid:21)\n\u2264 \u221an \u00b7 \u03c9 ({u : R(u) \u2264 1}) ,\n\n)(cid:107)w(cid:107)2\n\n= E\n\n(cid:34)\n(cid:35)\nu: R(u)\u22641(cid:104)u, z(cid:105)\n\nsup\n\nE [(cid:107)w(cid:107)2]\n\nwhich is the Gaussian width of the unit ball of the norm R(\u00b7) scaled up by a factor of \u221an.\nExample: L1-norm Dantzig Selector When R(\u00b7) is chosen to be L1 norm, the dual norm is the\nL\u221e norm, and (1) is reduced to the standard DS, given by\n\n(12)\n\n(15)\n\n(16)\n\ns.t. (cid:107)XT (y \u2212 X\u03b8)(cid:107)\u221e \u2264 \u03bb .\n\n\u02c6\u03b8 = argmin\n\n\u03b8\u2208Rp (cid:107)\u03b8(cid:107)1\n\n(cid:2)prox\u03b2(cid:107)\u00b7(cid:107)1(x)(cid:3)\n(cid:0)\u03b8k \u2212\n\nWe know that prox\u03b2(cid:107)\u00b7(cid:107)1(\u00b7) is given by the elementwise soft-thresholding operation\n\nBased on Proposition 1, the ADMM updates in Algorithm 1 can be instantiated as\n\ni = sign(xi) \u00b7 max(0,|xi| \u2212 \u03b2) .\n\nAT (A\u03b8k + vk \u2212 u +\n\n\u03c1\u00b5\n\n\u03b8k+1 \u2190 prox 2(cid:107)\u00b7(cid:107)1\nvk+1 \u2190 (u \u2212 A\u03b8k+1 \u2212\nzk+1 \u2190 zk + \u03c1(A\u03b8k+1 + vk+1 \u2212 u) ,\n\n) \u2212 prox\u03bb(cid:107)\u00b7(cid:107)1\n\n2\n\u00b5\nzk\n\u03c1\n\n)(cid:1) ,\n(cid:0)u \u2212 A\u03b8k+1 \u2212\n\nzk\n\u03c1\n\n(cid:1) ,\n\nzk\n\u03c1\n\nwhere the update of v leverages the decomposition (10). Similar updates were used in [21] for\nL1-norm Dantzig selector.\n\n4\n\n\fSp\u22121) is upper bounded as \u03c9(TL1(\u03b8\u2217)\u2229Sp\u22121)2 \u2264 2s log(cid:0) p\nR\u2217(XT w)(cid:3) =\n(cid:1)+ 5\n4 s. Also note that E(cid:2)\nFor statistical recovery, we assume that \u03b8\u2217 is s-sparse, i.e., contains s non-zero entries, and that\n(cid:107)\u03b8\u2217(cid:107)2 = 1, so that (cid:107)\u03b8\u2217(cid:107)1 \u2264 s. It was shown in [6] that the Gaussian width of the set (TL1 (\u03b8\u2217) \u2229\nE[(cid:107)w(cid:107)2]E[(cid:107)g(cid:107)\u221e] \u2264 \u221an log p, where g is a vector of i.i.d. standard Gaussian entries [5]. Further,\n(cid:32)(cid:114)\n[14] has shown that \u03a8R = \u221as. Therefore, if we solve (15) with \u03bbp = c\u221an log p, then\n\n(cid:33)\n\ns\n\nn\nwith high probability, which agrees with known results for DS [3, 5].\n\n(cid:107) \u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)2 \u2264 4c\n\n\u221as log p\n\u03baL\u221an\n\n= O\n\ns log p\n\n(17)\n\n3 Dantzig Selection with k-support norm\nWe \ufb01rst introduce some notations. Given any \u03b8 \u2208 Rp, let |\u03b8| denote its absolute-valued counterpart\nand \u03b8\u2193 denote the permutation of \u03b8 with its elements arranged in decreasing order. In previous\nwork [1, 13], the k-support norm has been de\ufb01ned as\n\n(cid:107)\u03b8(cid:107)sp\n\nk = min\n\nvI = \u03b8\n\n(18)\n\nwhere G(k) denotes the set of subsets of {1, . . . , p} of cardinality at most k. The unit ball of this\nnorm is the set Ck = conv{\u03b8 \u2208 Rp : (cid:107)\u03b8(cid:107)0 \u2264 k,(cid:107)\u03b8(cid:107)2 \u2264 1} . The dual norm of the k-support norm\nis given by\n\n\uf8fc\uf8fd\uf8fe ,\n\n(cid:88)\n\n\uf8f1\uf8f2\uf8f3 (cid:88)\nI\u2208G(k) (cid:107)vI(cid:107)2 : supp(vI ) \u2286 I,\n(cid:32) k(cid:88)\n(cid:110)\n(cid:107)\u03b8G(cid:107)2 : G \u2208 G(k)(cid:111)\n\n=\n\nI\u2208G(k)\n\n|\u03b8|\u21932\n\ni\n\ni=1\n\n(cid:33) 1\n\n2\n\n(cid:107)\u03b8(cid:107)sp\u2217\n\nk = max\n\n.\n\n(19)\n\n(20)\n\nNote that k = 1 gives the L1 norm and its dual norm is L\u221e norm. The k-support norm was\nproposed in order to overcome some of the empirical shortcomings of the elastic net [23] and the\n(group)-sparse regularizers. It was shown in [1] to behave similarly as the elastic net in the sense\nthat the unit norm ball of the k-support norm is within a constant factor of \u221a2 of the unit elastic net\nball. Although multiple papers have reported good empirical performance of the k-support norm on\nselecting correlated features, where L1 regularization fails, there exists no statistical analysis of the\nk-support norm. Besides, current computational methods consider square of k-support norm in their\nformulation, which might fail to work out in certain cases.\nIn the rest of this section, we focus on GDS with R(\u03b8) = (cid:107)\u03b8(cid:107)sp\n\n\u02c6\u03b8 = argmin\n\n\u03b8\u2208Rp (cid:107)\u03b8(cid:107)sp\n\nk\n\ns.t.\n\nk given as\n(cid:107)XT (y \u2212 X\u03b8)(cid:107)sp\u2217\n\nk \u2264 \u03bbp .\n\nFor the indicator function IC\u03bb(\u00b7) of the dual norm, we present a fast algorithm for computing its\nproximal operator by exploiting the structure of its solution, which can be directly plugged in Al-\ngorithm 1 to solve (20). Further, we prove statistical recovery bounds for k-support norm Dantzig\nselection, which hold even for a high-dimensional scenario, where n < p.\n\n3.1 Computation of Proximal Operator\n\nk\n\n(\u00b7) for (cid:107) \u00b7 (cid:107)sp\u2217\nIn order to solve (20), either prox\u03bb(cid:107)\u00b7(cid:107)sp\nshould be ef\ufb01ciently com-\n(\u00b7) or proxIC\u03bb\nputable. Existing methods [1, 13] are inapplicable to our scenario since they compute the proximal\n(\u00b7) cannot be directly obtained. In Theo-\noperator for squared k-support norm, from which proxIC\u03bb\nrem 2, we show that proxIC\u03bb\n(\u00b7) can be ef\ufb01ciently computed, and thus Algorithm 1 is applicable.\nde\ufb01ne Asr =(cid:80)r\nTheorem 2 Given \u03bb > 0 and x \u2208 Rp, if (cid:107)x(cid:107)sp\u2217\nk > \u03bb,\ni=1(|x|\u2193i )2, in which 0 \u2264 s < k and k \u2264 r \u2264 p, and construct\n\nk \u2264 \u03bb, then w\u2217 = proxIC\u03bb\n\n(x) = x. If (cid:107)x(cid:107)sp\u2217\n\nk\n\nthe nonlinear equation of \u03b2,\n\ni=s+1 |x|\u2193i , Bs =(cid:80)s\n(cid:20)\n\n(cid:21)2\n\n(k \u2212 s)A2\n\nsr\n\n\u2212 \u03bb2(1 + \u03b2)2 + Bs = 0 .\n\n(21)\n\n1 + \u03b2\n\nr \u2212 s + (k \u2212 s)\u03b2\n5\n\n\fLet \u03b2sr be given by\n\n\u03b2sr =\n\nThen the proximal operator w\u2217 = proxIC\u03bb\n1+\u03b2s\u2217 r\u2217 |x|\u2193i\n\n1\n\n(x) is given by\n\n0\n\n(cid:26) nonnegative root of (21)\n(cid:113) \u03bb2\u2212Bs\u2217\n\nk\u2212s\u2217\n\nAs\u2217 r\u2217\n\nr\u2217\u2212s\u2217+(k\u2212s\u2217)\u03b2s\u2217 r\u2217\n|x|\u2193i\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\nif s > 0 and the root exists\notherwise\n\nif 1 \u2264 i \u2264 s\u2217\nif s\u2217 < i \u2264 r\u2217 and \u03b2s\u2217r\u2217 = 0\nif s\u2217 < i \u2264 r\u2217 and \u03b2s\u2217r\u2217 > 0\nif r\u2217 < i \u2264 p\n\n|w\u2217|\u2193i =\n\n.\n\n,\n\n(22)\n\n(23)\n\nwhere the indices s\u2217 and r\u2217 with computed |w\u2217|\u2193 satisfy the following two inequalities:\n\n(24)\n(25)\nThere might be multiple pairs of (s, r) satisfying the inequalities (24)-(25), and we choose the pair\nwith the smallest (cid:107)|x|\u2193 \u2212 |w|\u2193(cid:107)2. Finally, w\u2217 is obtained by sign-changing and reordering |w\u2217|\u2193 to\nconform to x.\n\n|x|\u2193r\u2217+1 \u2264 |w\u2217|\u2193k < |x|\u2193r\u2217 .\n\n|w\u2217|\u2193s\u2217 > |w\u2217|\u2193k ,\n\nRemark: The nonlinear equation (21) is quartic, for which we can use general formula to get all the\nroots [18]. In addition, if it exists, the nonnegative root is unique, as shown in the proof [7].\nTheorem 2 indicates that computing proxIC\u03bb\n(\u00b7) requires sorting of entries in |x| and a two-\ndimensional search of s\u2217 and r\u2217. Hence the total time complexity is O(p log p + k(p \u2212 k)). How-\never, a more careful observation can particularly reduce the search complexity from O(k(p \u2212 k)) to\nO(log k log(p \u2212 k)), which is motivated by Theorem 3.\nTheorem 3 In search of (s\u2217, r\u2217) de\ufb01ned in Theorem 2, there can be only one \u02dcr for a given candidate\n\u02dcs of s\u2217, such that the inequality (25) is satis\ufb01ed. Moreover if such \u02dcr exists, then for any r < \u02dcr, the\nassociated | \u02dcw|\u2193k violates the \ufb01rst part of (25), and for r > \u02dcr, | \u02dcw|\u2193k violates the second part of (25).\nOn the other hand, based on the \u02dcr, we have following assertion of s\u2217,\n\nif \u02dcr does not exist\nif \u02dcr exists and corresponding | \u02dcw|\u2193k satis\ufb01es (24)\nif \u02dcr exists but corresponding | \u02dcw|\u2193k violates (24)\n\n.\n\n(26)\n\n\uf8f1\uf8f2\uf8f3 > \u02dcs\n\n\u2265 \u02dcs\n< \u02dcs\n\ns\u2217\n\nBased on Theorem 3, the accelerated search procedure for \ufb01nding (s\u2217, r\u2217) is to execute a two-\ndimensional binary search, and Algorithm 2 gives the details. Therefore the total time complexity\nbecomes O(p log p + log k log(p \u2212 k)). Compared with previous proximal operators for squared\nk-support norm, this complexity is better than that in [1], and roughly the same as the one in [13].\n\n3.2 Statistical Recovery Guarantees for k-support norm\n\nThe analysis of the generalized Dantzig Selector for k-support norm consists of addressing two key\nchallenges. First, note that Theorem 1 requires an appropriate choice of \u03bbp. Second, one needs\nto compute the Gaussian width of the subset of the error set TR(\u03b8\u2217) \u2229 Sp\u22121. For the k-support\nnorm, we can get upper bounds to both of these quantities. We start by de\ufb01ning some notation. Let\nG\u2217 \u2286 G(k) be the set of groups intersecting with the support of \u03b8\u2217, and let S be the union of groups\nin G\u2217, such that s = |S|. Then, we have the following bounds which are used for choosing \u03bbp, and\nbounding the Gaussian width.\nTheorem 4 For the k-support norm Generalized Dantzig Selection problem (20), we obtain For the\nk-support norm Generalized Dantzig Selection problem (20), we obtain\n\nR\u2217(XT w)(cid:3)\nE(cid:2)\n\u03c9(TA(\u03b8\u2217) \u2229 Sp\u22121)2 \u2264\n\n(cid:18)(cid:114)\n\u2264 \u221an\n(cid:18)(cid:114)\n\n2k log\n\n(cid:16)\n\n2k log\n\n(cid:16) pe\n\n(cid:17)\n\nk\n\n(cid:19)\n\n+ \u221ak\n(cid:109)\n(cid:108) s\n\nk\n\n+ 2\n\n(cid:17)\n\n+ \u221ak\n\n(cid:19)2\n\n(cid:108) s\n\n(cid:109)\n\nk\n\n\u00b7\n\n+ s .\n\n(27)\n\n(28)\n\np \u2212 k \u2212\n\n6\n\n\f(\u00b7) of (cid:107) \u00b7 (cid:107)sp\u2217\n\nk\n\n(x)\n\nk \u2264 \u03bb then\n\nAlgorithm 2 Algorithm for computing proxIC\u03bb\nInput: x, k, \u03bb\nOutput: w\u2217 = proxIC\u03bb\n1: if (cid:107)x(cid:107)sp\u2217\n2: w\u2217 := x\n3: else\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: end if\n\nelse if \u02dcr exists and (24) is satis\ufb01ed then\n\nw\u2217 := \u02dcw, l := \u02dcs + 1\n\nu := \u02dcs \u2212 1\n\nend while\n\nl := \u02dcs + 1\n\nend if\n\nelse if \u02dcr exists but (24) is not satis\ufb01ed then\n\nl := 0, u := k \u2212 1, and sort |x| to get |x|\u2193\nwhile l \u2264 u do\n\u02dcs := (cid:98)(l + u)/2(cid:99), and binary search for \u02dcr that satis\ufb01es (25) and compute \u02dcw based on (23)\nif \u02dcr does not exist then\n\nOur analysis technique for these bounds follows [16]. Similar results were obtained in [8] in the\ncontext of calculating norms of Gaussian vectors, and our bounds are of the same order as those\nof [8]. Choosing \u03bbp = \u221an\n, and under the assumptions of Theorem 1, we\nobtain the following result on the error bound for the minimizer of (20).\n\n2k log(cid:0) pe\n\n(cid:1) + \u221ak\n\nk\n\n(cid:16)(cid:113)\n\n(cid:17)\n\nCorollary 1 Suppose that we obtain i.i.d. Gaussian measurements X, and the noise w is i.i.d.\nN (0, 1). If we solve (20) with \u03bbp chosen as above. Then, with high probability, we obtain\n\n(cid:17)\n\n(cid:1) + \u221ak\n\n(cid:16)(cid:113)\n2k log(cid:0) pe\n\u03baL\u221an\n\nk\n\n4c\u03a8R\n\n(cid:107) \u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)2 \u2264\n\n(29)\n\nRemark The error bound provides a natural interpretation for the two special cases of the k-support\nnorm, viz. k = 1 and k = p. First, for k = 1 the k-support norm is exactly the same as the L1 norm,\n, the same as known results of DS, and shown in\nand the error bound obtained will be O\nSection 2.2. Second, for k = p, the k-support norm is equal to the L2 norm, and the error cone (11)\n\nis then simply a half space (there is no structural constraint) and the error bound scales as O(cid:0)(cid:112) p\n\n(cid:1).\n\nn\n\n(cid:18)(cid:113) s log p\n\n(cid:19)\n\nn\n\n4 Experimental Results\nOn the optimization side, we focus on the ef\ufb01ciency of different proximal operators related to k-\nsupport norm. On the statistical side, we concentrate on the behavior and performance of GDS with\nk-support norm. All experiments are implemented in MATLAB.\n\n4.1 Ef\ufb01ciency of Proximal Operator\n(\u00b7) in The-\nWe tested four proximal operators related to k-support norm, which are normal proxIC\u03bb\nk )2(\u00b7) in [1], and prox \u03bb\norem 2 and the accelerated one in Theorem 3, prox 1\n(\u00b7) in [13].\n2\u03b2 ((cid:107)\u00b7(cid:107)sp\n2 (cid:107)\u00b7(cid:107)2\nThe dimension p of vector varied from 1000 to 10000, and the ratio p/k = {200, 100, 50, 20}. As\n(\u00b7) is considerable compared with the\nillustrated in Figure 1, the speedup of accelerated proxIC\u03bb\nnormal one and prox 1\n(cid:125)\n(cid:124)\n(cid:125)\n\nk )2 (\u00b7). It is also slightly better than the prox \u03bb\n(cid:125)\n(cid:123)(cid:122)\n\n4.2 Statistical Recovery on Synthetic Data\nData generation We \ufb01xed p = 600, and \u03b8\u2217 = (10, . . . , 10\n\n(\u00b7).\n(cid:123)(cid:122)\n\n, 10, . . . , 10\n\n, 10, . . . , 10\n\n, 0, 0, . . . , 0\n\n2\u03b2 ((cid:107)\u00b7(cid:107)sp\n\n(cid:123)(cid:122)\n\n(cid:123)(cid:122)\n\n2 (cid:107)\u00b7(cid:107)2\n\n(cid:124)\n\n(cid:125)\n\n(cid:124)\n\n(cid:124)\n\n)\n\n\u0398\n\n\u0398\n\nthroughout the experiment, in which nonzero entries were divided equally into three groups. The\ndesign matrix X were generated from a normal distribution such that the entries in the same group\n\n10\n\n10\n\n10\n\n570\n\n7\n\n\f(\u00b7) in Theorem 2, Square: prox 1\n\nk )2 (\u00b7)\nFigure 1: Ef\ufb01ciency of proximal operators. Diamond: proxIC\u03bb\n2\u03b2 ((cid:107)\u00b7(cid:107)sp\n(\u00b7)\n(\u00b7) in [13], Upward-pointing triangle: accelerated proxIC\u03bb\nin [1], Downward-pointing triangle: prox \u03bb\nin Theorem 3. For each (p, k), 200 vectors are randomly generated for testing. Time is measured in seconds.\nhave the same mean sampled from N (0, 1). X was normalized afterwards. The response vector y\nwas given by y = X\u03b8\u2217 + 0.01 \u00d7 N (0, 1). The number of samples n is speci\ufb01ed later.\nROC curves with different k We \ufb01xed n = 400 to obtain the ROC plot for k = {1, 10, 50} as\nshown in Figure 2(a). \u03bbp ranged from 10\u22122 to 103. Small k gets better ROC curve.\n\n2 (cid:107)\u00b7(cid:107)2\n\n\u0398\n\n(a) ROC curves\n\n(b) L2 error vs. n\n\n(c) L2 error vs. k\n\nFigure 2: (a) The true positive rate reaches 1 quite early for k = 1, 10. When k = 50, the ROC gets worse\ndue to the strong smoothing effect introduced by large k. (b) For each k, the L2 error is large when the sample\nis inadequate. As n increases, the error decreases dramatically for k = 1, 10 and becomes stable afterwards,\nwhile the decrease is not that signi\ufb01cant for k = 50 and the error remains relatively large. (c) Both mean and\nstandard deviation of L2 error are decreasing as k increases until it exceeds the number of nonzero entries in\n\u03b8\u2217, and then the error goes up for larger k.\nL2 error vs. n We investigated how the L2 error (cid:107) \u02c6\u03b8 \u2212 \u03b8\u2217(cid:107)2 of Dantzig selector changes as the\nnumber of samples increases, where k = {1, 10, 50} and n = {30, 60, 90, . . . , 300}. k = 1, 10 give\nsmall errors when n is suf\ufb01ciently large.\nL2 error vs. k We also looked at the L2 error with different k. We again \ufb01xed n = 400 and\nvaried k from 1 to 39. For each k, we repeated the experiment 100 times, and obtained the mean and\nstandard deviation plot in Figure 2(c). The result shows that the k-support-norm GDS with suitable\nk outperforms the L1-norm DS (i.e. k = 1) when correlated variables present in data.\n\n5 Conclusions\nIn this paper, we introduced the GDS, which generalizes the standard L1-norm Dantzig selector to\nestimation with any norm, such that structural information encoded in the norm can be ef\ufb01ciently\nexploited. A \ufb02exible framework based on inexact ADMM is proposed for solving the GDS, which\nonly requires one of conjugate proximal operators to be ef\ufb01ciently solved. Further, we provide a\nuni\ufb01ed statistical analysis framework for the GDS, which utilizes Gaussian widths of certain re-\nstricted sets for proving consistency. In the non-trivial example of k-support norm, we showed that\nthe proximal operators used in the inexact ADMM can be computed more ef\ufb01ciently compared to\npreviously proposed variants. Our statistical analysis for the k-support norm provides the \ufb01rst result\nof consistency of this structured norm. Further, experimental results provided sound support to the\ntheoretical development in the paper.\nAcknowledgements\nThe research was supported by NSF grants IIS-1447566, IIS-1422557, CCF-1451986, CNS-\n1314560, IIS-0953274, IIS-1029711, and by NASA grant NNX12AQ39A.\n\n8\n\n500010000\u22123\u22122\u2212101plog(time)p / k = 200500010000\u22123\u22122\u2212101plog(time)p / k = 100500010000\u22123\u22122\u2212101plog(time)p / k = 50500010000\u22123\u22122\u2212101plog(time)p / k = 2000.20.40.60.8100.10.20.30.40.50.60.70.80.91FPRTPR k = 1k = 10k = 500501001502002503000102030405060n(cid:2)2error:(cid:2)\u02c6\u03b8\u2212\u03b8\u2217(cid:2)2 k = 1k = 10k = 50051015202530354000.511.522.5k(cid:2)2error:(cid:2)\u02c6\u03b8\u2212\u03b8\u2217(cid:2)2 Mean of error\fReferences\n\n[1] Andreas Argyriou, Rina Foygel, and Nathan Srebro. Sparse prediction with the k-support\n\nnorm. In NIPS, pages 1466\u20131474, 2012.\n\n[2] Arindam Banerjee, Sheng Chen, Farideh Fazayeli, and Vidyashankar Sivakumar. Estimation\n\nwith norm regularization. In NIPS, 2014.\n\n[3] Peter J Bickel, Ya\u2019acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis of lasso and\n\ndantzig selector. The Annals of Statistics, 37(4):1705\u20131732, 2009.\n\n[4] Stephen P. Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed\noptimization and statistical learning via the alternating direction method of multipliers. Foun-\ndations and Trends in Machine Learning, 3(1):1\u2013122, 2011.\n\n[5] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when p is\n\nmuch larger than n. The Annals of Statistics, 35(6):2313\u20132351, December 2007.\n\n[6] Venkat Chandrasekaran, Benjamin Recht, Pablo A Parrilo, and Alan S Willsky. The convex\ngeometry of linear inverse problems. Foundations of Computational Mathematics, 12(6):805\u2013\n849, 2012.\n\n[7] Soumyadeep Chatterjee, Sheng Chen, and Arindam Banerjee. Generalized dantzig selector:\n\nApplication to the k-support norm. ArXiv e-prints, June 2014.\n\n[8] Yehoram Gordon, Alexander E. Litvak, Shahar Mendelson, and Alain Pajor. Gaussian averages\nof interpolated bodies and applications to approximate reconstruction. Journal of Approxima-\ntion Theory, 149(1):59\u201373, 2007.\n\n[9] Gareth M. James and Peter Radchenko. A generalized Dantzig selector with shrinkage tuning.\n\nBiometrika, 96(2):323\u2013337, 2009.\n\n[10] Gareth M. James, Peter Radchenko, and Jinchi Lv. DASSO: connections between the Dantzig\n\nselector and lasso. Journal of the Royal Statistical Society Series B, 71(1):127\u2013142, 2009.\n\n[11] Han Liu, Jian Zhang, Xiaoye Jiang, and Jun Liu. The group Dantzig selector. In AISTATS,\n\npages 461\u2013468, 2010.\n\n[12] Zhaosong Lu, Ting Kei Pong, and Yong Zhang. An alternating direction method for \ufb01nding\n\ndantzig selectors. Computational Statistics & Data Analysis, 56(12):4037 \u2013 4046, 2012.\n\n[13] Andrew M. McDonald, Massimiliano Pontil, and Dimitris Stamos. New perspectives on k-\n\nsupport and cluster norms. ArXiv e-prints, March 2014.\n\n[14] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, Bin Yu, et al. A uni\ufb01ed frame-\nwork for high-dimensional analysis of m-estimators with decomposable regularizers. Statisti-\ncal Science, 27(4):538\u2013557, 2012.\n\n[15] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization,\n\n1(3):127\u2013239, 2014.\n\n[16] Nikhil S Rao, Ben Recht, and Robert D Nowak. Universal measurement bounds for structured\n\nsparse signal recovery. In AISTATS, pages 942\u2013950, 2012.\n\n[17] Mark Rudelson and Roman Vershynin. On sparse reconstruction from Fourier and Gaussian\nmeasurements. Communications on Pure and Applied Mathematics, 61(8):1025\u20131045, 2008.\n[18] Ian Stewart. Galois Theory, Third Edition. Chapman Hall/CRC Mathematics Series. Taylor &\n\nFrancis, 2003.\n\n[19] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society. Series B (Methodological), pages 267\u2013288, 1996.\n\n[20] Huahua Wang and Arindam Banerjee. Bregman alternating direction method of multipliers. In\n\nNIPS, 2014.\n\n[21] Xiangfeng Wang and Xiaoming Yuan. The linearized alternating direction method of multipli-\n\ners for Dantzig selector. SIAM Journal on Scienti\ufb01c Computing, 34(5), 2012.\n\n[22] Peng Zhao and Bin Yu. On model selection consistency of lasso. The Journal of Machine\n\nLearning Research, 7:2541\u20132563, 2006.\n\n[23] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301\u2013320, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1065, "authors": [{"given_name": "Soumyadeep", "family_name": "Chatterjee", "institution": "University of Minnesota, Twin Cities"}, {"given_name": "Sheng", "family_name": "Chen", "institution": "University of Minnesota"}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": "University of Minnesota, Twin Cites"}]}