{"title": "Sparse Estimation Using General Likelihoods and Non-Factorial Priors", "book": "Advances in Neural Information Processing Systems", "page_first": 2071, "page_last": 2079, "abstract": "Finding maximally sparse representations from overcomplete feature dictionaries frequently involves minimizing a cost function composed of a likelihood (or data fit) term and a prior (or penalty function) that favors sparsity. While typically the prior is factorial, here we examine non-factorial alternatives that have a number of desirable properties relevant to sparse estimation and are easily implemented using an efficient, globally-convergent reweighted $\\ell_1$ minimization procedure. The first method under consideration arises from the sparse Bayesian learning (SBL) framework. Although based on a highly non-convex underlying cost function, in the context of canonical sparse estimation problems, we prove uniform superiority of this method over the Lasso in that, (i) it can never do worse, and (ii) for any dictionary and sparsity profile, there will always exist cases where it does better. These results challenge the prevailing reliance on strictly convex penalty functions for finding sparse solutions. We then derive a new non-factorial variant with similar properties that exhibits further performance improvements in empirical tests. For both of these methods, as well as traditional factorial analogs, we demonstrate the effectiveness of reweighted $\\ell_1$-norm algorithms in handling more general sparse estimation problems involving classification, group feature selection, and non-negativity constraints. As a byproduct of this development, a rigorous reformulation of sparse Bayesian classification (e.g., the relevance vector machine) is derived that, unlike the original, involves no approximation steps and descends a well-defined objective function.", "full_text": "Sparse Estimation Using General Likelihoods and\n\nNon-Factorial Priors\n\nDavid Wipf and Srikantan Nagarajan, (cid:3)\nBiomagnetic Imaging Lab, UC San Francisco\nfdavid.wipf, srig@mrsc.ucsf.edu\n\nAbstract\n\nFinding maximally sparse representations from overcomplete feature dictionaries\nfrequently involves minimizing a cost function composed of a likelihood (or data\n(cid:2)t) term and a prior (or penalty function) that favors sparsity. While typically the\nprior is factorial, here we examine non-factorial alternatives that have a number of\ndesirable properties relevant to sparse estimation and are easily implemented using\nan ef(cid:2)cient and globally-convergent, reweighted \u20181-norm minimization procedure.\nThe (cid:2)rst method under consideration arises from the sparse Bayesian learning\n(SBL) framework. Although based on a highly non-convex underlying cost func-\ntion, in the context of canonical sparse estimation problems, we prove uniform\nsuperiority of this method over the Lasso in that, (i) it can never do worse, and (ii)\nfor any dictionary and sparsity pro(cid:2)le, there will always exist cases where it does\nbetter. These results challenge the prevailing reliance on strictly convex penalty\nfunctions for (cid:2)nding sparse solutions. We then derive a new non-factorial variant\nwith similar properties that exhibits further performance improvements in some\nempirical tests. For both of these methods, as well as traditional factorial analogs,\nwe demonstrate the effectiveness of reweighted \u20181-norm algorithms in handling\nmore general sparse estimation problems involving classi(cid:2)cation, group feature\nselection, and non-negativity constraints. As a byproduct of this development, a\nrigorous reformulation of sparse Bayesian classi(cid:2)cation (e.g., the relevance vector\nmachine) is derived that, unlike the original, involves no approximation steps and\ndescends a well-de(cid:2)ned objective function.\n\n1 Introduction\nWith the advent of compressive sensing and other related applications, there has been growing inter-\nest in (cid:2)nding sparse signal representations from redundant dictionaries [3, 5]. The canonical form\nof this problem is given by,\n\n(1)\nwhere (cid:8) 2 Rn(cid:2)m is a matrix whose columns (cid:30)i represent an overcomplete or redundant basis (i.e.,\nrank((cid:8)) = n and m > n), x 2 Rm is a vector of unknown coef(cid:2)cients to be learned, and y is the\nsignal vector. The cost function being minimized represents the \u20180 norm of x (i.e., a count of the\nnumber of nonzero elements in x). If measurement noise or modeling errors are present, we instead\nsolve the alternative problem\n\ns.t. y = (cid:8)x;\n\nx\n\nmin\n\nkxk0;\n\nmin\n\nx\n\nky (cid:0) (cid:8)xk2\n\n2 + (cid:21)kxk0;\n\n(cid:21) > 0;\n\n(2)\n\nnoting that in the limit as (cid:21) ! 0, the two problems are equivalent (the limit must be taken outside\nof the minimization). From a Bayesian perspective, optimization of either problem can be viewed,\nafter a exp[(cid:0)((cid:1))] transformation, as a challenging MAP estimation task with a quadratic likelihood\nfunction and a prior that is both improper and discontinuous. Unfortunately, an exhaustive search\n\nfor the optimal representation requires the solution of up to (cid:0)m\n\n(cid:3)This research was supported by NIH grants R01DC04855 and R01DC006435.\n\nn(cid:1) linear systems of size n (cid:2) n, a\n\n\fprohibitively expensive procedure for even modest values of m and n. Consequently, in practical\nsituations there is a need for approximate methods that ef(cid:2)ciently solve (1) or (2) with high prob-\nability. Moreover, we would ideally like these methods to generalize to other likelihood functions\nand priors for applications such as non-negative sparse coding, classi(cid:2)cation, and group variable\nselection.\nOne common strategy is to replace kxk0 with a more manageable penalty function g(x) (or prior)\nthat still favors sparsity. Typically this replacement is a concave, non-decreasing function of\n\njxj , [jx1j; : : : ; jxmj]T . It is also generally assumed to be factorial, meaning g(x) = Pi g(xi).\n\nGiven this selection, a recent, very successful optimization technique involves iterative reweighted\n\u20181 minimization, a process that produces more focal estimates with each passing iteration [3, 19].\nTo implement this procedure, at the (k + 1)-th iteration we compute\nw(k)\n\nx(k+1) ! arg min\n\nky (cid:0) (cid:8)xk2\n\n(3)\n\njxij;\n\nx\n\n2 + (cid:21)Xi\n\ni\n\ni\n\ni\n\ni\n\n)=@jx(k)\n\n, @g(x(k)\n\nwhere w(k)\nj. As discussed in [6], these updates are guaranteed to converge to\na local minimum of the underlying cost function by satisfying the conditions of the Global Con-\nvergence Theorem (see for example [24]). Moreover, empirical evidence from [3] suggests that\ngenerally only a few iterations, which can be readily computed using standard convex programming\npackages, are required. Note that a single iteration with unit weights is equivalent to the traditional\nLasso estimator [14]. However, given an appropriate selection for g((cid:1)), e.g., g(xi) = log(jxij + (cid:11))\nwith (cid:11) > 0, subsequent iterations have been shown to exhibit substantial improvements over the\nLasso in approximating the solution of (1) or (2) [3].\nWhile certainly successful in practice, there remain fundamental limitations as to what can be\nachieved using factorial penalties to approximate kxk0. Perhaps counterintuitively, it has been\nshown in [19] that by considering the wider class of non-factorial penalties, more effective sur-\nrogates for kxk0 can be obtained, potentially leading to better approximate solutions of either\n(1) or (2). In this paper we consider two non-factorial methods that rely on the same basic itera-\ntive reweighted \u20181 minimization procedure outlined above. In Section 2, we brie(cid:3)y introduce the\nnon-factorial penalty function (cid:2)rst proposed in [19] (based on a dual-form interpretation of sparse\nBayesian learning) and then derive a new iterative reweighted \u20181 implementation that builds upon\nthese ideas. We then demonstrate that this algorithm satis(cid:2)es two desirable properties pertaining to\nproblem (1): (i) each iteration can only improve the sparsity and, (ii) for any (cid:8) and sparsity pro(cid:2)le,\nthere will always exist cases where performance improves over standard \u20181 minimization, which\nrepresents the best convex approximation to (1). Together, these results imply that this reweighting\nscheme can never do worse than Lasso (assuming w(0)\ni = 1; 8i), and that there will always be cases\nwhere improvement over Lasso is achieved. To a large extent, this removes much of the stigma com-\nmonly associated with using non-convex sparsity penalties. Later in Section 3, we derive a second\npromising non-factorial variant by starting with a plausible \u20181 reweighting scheme and then working\nbackwards to determine the form and properties of the underlying penalty function.\nIn general, iterative reweighted \u20181 procedures of any kind are attractive for our purposes because\nthey can easily be augmented to handle other likelihoods and priors, provided convexity of the\nupdate (3) is preserved (of course the overall cost function being minimized will be non-convex).\nFor example, to address the extensions mentioned above, in Section 4 we explore adding constraints\nsuch as xi (cid:21) 0, replacing jxij with a norm on groups of variables, and using a logistic instead of\nquadratic likelihood term for classi(cid:2)cation. The latter extension leads to a rigorous reformulation\nof sparse Bayesian classi(cid:2)cation (e.g., the relevance vector machine [15]) that, unlike the original,\ninvolves no approximation steps and descends a well-de(cid:2)ned objective function. Finally, Section 5\ncontains empirical comparisons while Section 6 provides brief concluding remarks.\n\n2 Non-Factorial Methods Based on Sparse Bayesian Learning\n\nA particularly useful non-factorial penalty emerges from a dual-space view [19] of sparse Bayesian\nlearning (SBL) [15], which is based on the notion of automatic relevance determination (ARD)\n[10]. SBL assumes a Gaussian likelihood function p(yjx) = N (y; (cid:8)x; (cid:21)I), consistent with the\ndata (cid:2)t term from (2). The basic ARD prior incorporated by SBL is p(x; (cid:13)) = N (x; 0; diag[(cid:13)]),\nwhere (cid:13) 2 Rm\n+ is a vector of m non-negative hyperparameters governing the prior variance of each\n\n\f(4)\n\nunknown coef(cid:2)cient. These hyperparameters are estimated from the data by (cid:2)rst marginalizing over\nthe coef(cid:2)cients x and then performing what is commonly referred to as evidence maximization or\ntype-II maximum likelihood [10, 15]. Mathematically, this is equivalent to minimizing\n\nL((cid:13)) , (cid:0) logZ p(yjx)p(x; (cid:13))dx = (cid:0) log p(y; (cid:13)) (cid:17) log j(cid:6)yj + yT (cid:6)(cid:0)1\n\ny y;\n\nwhere (cid:6)y , (cid:21)I + (cid:8)(cid:0)(cid:8)T and (cid:0) , diag[(cid:13)]. Once some (cid:13)(cid:3) = arg min(cid:13) L((cid:13)) is computed, an\nestimate of the unknown coef(cid:2)cients can be obtained by setting xSBL to the posterior mean computed\nusing (cid:13)(cid:3):\n\nxSBL = E[xjy; (cid:13)(cid:3)] = (cid:0)(cid:3)(cid:8)T (cid:6)(cid:0)1\n\n(5)\nNote that if any (cid:13)(cid:3);i = 0, as often occurs during the learning process, then xSBL;i = 0 and the\ncorresponding feature is effectively pruned from the model. The resulting coef(cid:2)cient vector xSBL is\ntherefore sparse, with nonzero elements corresponding with the \u2018relevant\u2019 features.\nIt is not immediately apparent how the SBL procedure, which requires optimizing a cost function in\n(cid:13)-space and is based on a factorial prior p(x; (cid:13)), relates to solving/approximating (1) and/or (2) via\na non-factorial penalty in x-space. However, it has been shown in [19] that xSBL satis(cid:2)es\n\ny(cid:3) y:\n\nwhere\n\nxSBL = arg min\n\nx\n\nky (cid:0) (cid:8)xk2\n\n2 + (cid:21)gSBL(x);\n\ngSBL(x) , min\n(cid:13)(cid:21)0\n\nxT (cid:0)(cid:0)1x + log j(cid:11)I + (cid:8)(cid:0)(cid:8)T j;\n\n(6)\n\n(7)\n\nassuming (cid:11) = (cid:21) and jxj , [jx1j; : : : ; jxmj]T . While not discussed in [19], gSBL(x) is a general\npenalty function that only need have (cid:11) = (cid:21) to obtain equivalence with SBL; other selections may\nlead to better performance (more on this in Section 4 below).\nThe analysis in [19] reveals that replacing kxk0 with gSBL(x) and (cid:11) ! 0 leaves the globally mini-\nmizing solution to (1) unchanged but drastically reduces the number of local minima (more so than\nany possible factorial penalty function). While space precludes the details here, these ideas can be\nextended signi(cid:2)cantly to form conditions, which again are only satis(cid:2)able by a non-factorial penalty,\nwhereby all local minima are smoothed away [21]. Note that while basic \u20181-norm minimization also\nhas no local minima, the global minimum need not always correspond with the global solution to\n(1), unlike when using gSBL(x).\nIt can also be shown that gSBL(x) is a non-decreasing, concave function of jxj (see Appendix),\na desirable property of sparsity-promoting penalties. Importantly, as a direct consequence of this\nconcavity, (6) can be optimized using a reweighted \u20181 algorithm (in an analogous fashion to the\nfactorial case) using\n\n(8)\nAlthough this quantity is not available in closed form (except for the special case where (cid:11) ! 0),\nit can be estimated by executing: Step I - Initialize by setting w(k+1) ! w(k), the k-th vector of\nweights, Step II - Repeat until convergence\n\nw(k+1)\n\n=\n\ni\n\n:\n\n@jxij\n\n@gSBL(x)\n\n(cid:12)(cid:12)(cid:12)(cid:12)x=x(k+1)\ni (cid:16)(cid:11)I + (cid:8)fW (k+1)eX (k+1)(cid:8)T(cid:17)(cid:0)1\n\n2\n\n(cid:30)i(cid:21) 1\n\nw(k+1)\n\ni\n\n!(cid:20)(cid:30)T\n\n;\n\n(9)\n\nAppendix, while further details and analyses are deferred to [20]. Note that cost function descent\nis guaranteed with only a single iteration, so we need not execute (9) until convergence. In fact, it\ncan be shown that a more rudimentary form of reweighted \u20181 applied to this model in [19] amounts\nto performing exactly one such iteration. However, repeated execution of (9) is cheap computation-\n\nwherefW (k+1) , diag[w(k+1)](cid:0)1 and eX (k+1) , diag[jx(k+1)j]. The derivation is shown in the\nally since it scales as O(cid:0)nmkx(k+1)k0(cid:1), where typically kx(k+1)k0 (cid:20) n, and is substantially less\n\nintensive than the subsequent \u20181 step given by (3).\nFrom a theoretical standpoint, \u20181 reweighting applied to gSBL(x) is guaranteed to aid performance\nin the sense described by the following two results, which apply in the case where (cid:21) ! 0; (cid:11) ! 0.\nBefore proceeding, we de(cid:2)ne spark((cid:8)) as the smallest number of linearly dependent columns in (cid:8)\n[5]. It follows then that 2 (cid:20) spark((cid:8)) (cid:20) n + 1.\n\n\fTheorem 1. When applying iterative reweighted \u20181 using (9) and w(1)\nsatis(cid:2)es kx(k+1)k0 (cid:20) kx(k)k0 (i.e., continued iteration can never do worse).\n\ni\n\n6= 0; 8i, the solution sparsity\n\nTheorem 2. Assume that spark((cid:8)) = n+1 and consider any instance where standard \u20181 minimiza-\n. Then there exists\ntion fails to (cid:2)nd some x(cid:3) drawn from support set S with cardinality jSj < (n+1)\na set of signals y (with non-zero measure) generated from S such that non-factorial reweighted \u20181,\n\n2\n\nwithfW (k+1) updated using (9), always succeeds but standard \u20181 always fails.\n\nNote that Theorem 2 does not in any way indicate what is the best non-factorial reweighting scheme\nin practice (for example, in our limited experience with empirical simulations, the selection (cid:11) ! 0\nis not necessarily always optimal). However, it does suggest that reweighting with non-convex, non-\nfactorial penalties is potentially very effective, motivating other selections as discussed next. Taken\ntogether, Theorems 1 and 2 challenge the prevailing reliance on strictly convex cost functions, since\nthey ensure that we can never do worse than the Lasso (which uses the tightest convex approximation\nto the \u20180 norm), and that there will always be cases where improvement over the Lasso is obtained.\n\n3 Bottom-Up Construction of Non-Factorial Penalty\n\nIn the previous section, we described what amounts to a top-down formulation of a non-factorial\npenalty function that emerges from a particular hierarchical Bayesian model. Based on the insights\ngleaned from this procedure (and its distinction from factorial penalties), it is possible to stipulate\nalternative penalty functions from the bottom up by creating plausible, non-factorial reweighting\nschemes. The following is one such possibility.\nAssume for simplicity that (cid:21) ! 0. The Achilles heel of standard, factorial penalties is that if we\nwant to retain a global minimum similar to that of (1), we require a highly concave penalty on each\nxi [21]. However, this implies that almost all basic feasible solutions (BFS) to y = (cid:8)x, de(cid:2)ned as\na solution with kxk0 (cid:20) n, will form local minima of the penalty function constrained to the feasible\n\nregion. This is a very undesirable property since there are on the order of(cid:0)m\n\nwhich is equal to the signal dimension and not very sparse. We would really like to (cid:2)nd degenerate\nBFS, where kxk0 is strictly less than n. Such solutions are exceedingly rare and dif(cid:2)cult to (cid:2)nd.\nConsequently we would like to utilize a non-factorial, yet highly concave penalty that explicitly\nfavors degenerate BFS. We can accomplish this by constructing a reweighting scheme designed to\navoid non-degenerate BFS whenever possible.\n\nn(cid:1) BFS with kxk0 = n,\n\nconstruct weights using the projection of each basis vector (cid:30)i as de(cid:2)ned via\n\nNow consider the covariance-like quantity (cid:11)I + (cid:8)(eX (k+1))2(cid:8)T , where (cid:11) may be small, and then\n\n(10)\n\nw(k+1)\n\ni\n\n! (cid:30)T\n\ni (cid:16)(cid:11)I + (cid:8)(eX (k+1))2(cid:8)T(cid:17)(cid:0)1\n\n(cid:30)i:\n\ni\n\nIdeally, if at iteration k + 1 we are at a bad or non-degenerate BFS, we do not want the newly\ncomputed w(k+1)\nto favor the present position at the next iteration of (3) by assigning overly large\n\nrank and so all weights will be relatively modest sized. In contrast, if a rare, degenerate BFS is\n\nweights to the zero-valued xi. In such a situation, the factor (cid:8)(eX (k+1))2(cid:8)T in (10) will be full\nfound, then (cid:8)(eX (k+1))2(cid:8)T will no longer be full rank, and the weights associated with zero-valued\n\ncoef(cid:2)cients will be set to large values, meaning this solution will be favored in the next iteration.\nIn some sense, the distinction between (10) and its factorial counterparts, such as the method of\nCand(cid:30)es et al. [3] which uses w(k+1)\nj+(cid:11)), can be summarized as follows: the factorial\nmethods assign the largest weight whenever the associated coef(cid:2)cient goes to zero; with (10) the\nlargest weight is only assigned when the associated coef(cid:2)cient goes to zero and kx(k+1)k0 < n.\nThe reweighting option (10), which bears some resemblance to (9), also has some very desirable\nproperties beyond the intuitive justi(cid:2)cation given above. First, since we are utilizing (10) in the\ncontext of reweighted \u20181 minimization, it would productive to know what cost function, if any, we\nare minimizing when we compute each iteration. Using the fundamental theorem of calculus for line\nintegrals (or the gradient theorem), it follows that the bottom-up (BU) penalty function associated\n\n! 1=(jx(k+1)\n\ni\n\ni\n\n\f0\n\nwith (10) is\n\ngBU(x) ,Z 1\n\ntrace(cid:20)eX(cid:8)T(cid:16)(cid:11)I + (cid:8)((cid:23)eX)2(cid:8)T(cid:17)(cid:0)1\n\n(11)\nMoreover, because each weight wi is a non-increasing function of each xj; 8j, from Kachurovskii\u2019s\ntheorem [12] it directly follows that (11) is concave and non-decreasing in jxj, and thus naturally\npromotes sparsity. Additionally, for (cid:11) suf(cid:2)ciently small, it can be shown that the global minimum\nof (11) on the constraint y = (cid:8)x must occur at a degenerate BFS (Theorem 1 from above also holds\nwhen using (10); Theorem 2 may as well, although we have not formally shown this). And (cid:2)nally,\nregarding implementational issues and interpretability, (10) avoids any recursive weight assignments\nor inner-loop optimization as when using (9).\n\n(cid:8)(cid:21) d(cid:23):\n\n4 Extensions\nOne of the motivating factors for using iterative reweighted \u20181 optimization is that it is very easy to\nincorporate alternative likelihoods and priors. This section addresses three such examples.\nNon-Negative Sparse Coding: Numerous applications require sparse solutions where all coef(cid:2)cients\nxi are constrained to be non-negative [2]. By adding the contraint x (cid:21) 0 to (3) at each iteration, we\ncan easily compute such solutions using gSBL(x), gBU(x), or any other appropriate penalty function.\nNote that in the original SBL formulation, this is not a possibility since the integrals required to\ncompute the associated cost function or update rules no longer have closed-form expressions.\nGroup Feature Selection: Another common generalization is to seek sparsity at the level of groups\nof features, e.g., the group Lasso [23]. The simultaneous sparse approximation problem [17] is a par-\nticularly useful adaptation of this idea relevant to compressive sensing [18], manifold learning [13],\nand neuroimaging [22]. In this situation, we are presented with r signals Y , [y(cid:1)1; y(cid:1)2; : : : ; y(cid:1)r]\nthat were produced by coef(cid:2)cient vectors X , [x(cid:1)1; x(cid:1)2; : : : ; x(cid:1)r] characterized by the same spar-\nsity pro(cid:2)le or support, meaning that the coef(cid:2)cient matrix X is row sparse. Here we adopt the\nnotation that x(cid:1)j represents the j-th column of X while xi(cid:1) represents the i-th row of X. The sparse\nrecovery problems (1) and (2) then become\n\nd(X); s.t. Y = (cid:8)X;\n\nand\n\nmin\nX\n\nkY (cid:0) (cid:8)Xk2\n\nF + (cid:21)d(X); (cid:21) > 0;\n\n(12)\n\nmin\nX\n\nwhere d(X) ,Pm\n\ni=1 I [kxi(cid:1)k > 0] and I[(cid:1)] is an indicator function. d(X) favors row sparsity and\n\nis a natural extension of the \u20180 norm to the simultaneous approximation problem.\nAs before, the combinatorial nature of each optimization problem renders them intractable and so\napproximate procedures are required. All of the algorithms discussed herein can naturally be ex-\npanded to this domain essentially by substituting the scalar coef(cid:2)cient magnitudes from a given\niteration jx(k)\nj with some row-vector penalty, such as a norm. If we utilize kxi(cid:1)k2, then the co-\nef(cid:2)cient matrix update analogous to (3) requires the solution of the more complicated weighted\nsecond-order cone (SOC) program\n\ni\n\nX (k+1) ! arg min\nX\n\nkY (cid:0) (cid:8)Xk2\n\nw(k)\n\ni kxi(cid:1)k2:\n\n(13)\n\nOther selections such as the \u20181 norm are possible as well, providing added generality.\nSparse Classi\ufb01er Design: At a high level, sparse classi(cid:2)ers can be trained by simply substituting\na (preferrably) convex likelihood function for the quadratic term in (2). For example, to perform\nsparse logistic regression we would solve\n\nF + (cid:21)Xi\n\nmin\n\nx Xj (cid:2)yj log((cid:30)T\n\nj(cid:1)x) + (1 (cid:0) yj) log(1 (cid:0) (cid:30)T\n\n(14)\n\nj(cid:1)x)(cid:3) + (cid:21)g(x);\n\nwhere now yj 2 f0; 1g and g(x) is an arbitrary, concave-in-jxj penalty. This can be implemented\nby iteratively solving an \u20181-norm penalized logistic regression problem, which can be ef(cid:2)ciently ac-\ncomplished using a simple majorization-maximization approach [7]. Note that cost function descent\ndoes not require that we compute the full reweighted \u20181 solution; the iterations from [7] naturally\nlend themselves to an ef(cid:2)cient partial (or greedy) update before recomputing the weights.\nIt is very insightful to compare this methodology with the original SBL (or relevance vector ma-\nchine) classi(cid:2)er derived in [15]. When the Gaussian likelihood p(yjx) is replaced with a Bernoulli\n\n\fdistribution (which leads to the logistic data (cid:2)t term above), it is no longer possible to compute\nthe marginalization (4) or the posterior distribution p(xjy; (cid:13)), which is used both for optimization\npurposes and to make predictive statements about test data. Consequently, a heuristic Laplace ap-\nproximation is adopted, which requires a second-order Newton inner-loop to (cid:2)t a Gaussian about\nthe mode of p(xjy; (cid:13)). This Gaussian is then used to transform the classi(cid:2)cation problem into a\nstandard regression one with data-dependent (herteroscedastic) noise, and then whatever approach\nis used to minimize (4), either the MacKay update rules [15] or a greedy constructive method [16],\ncan be used in the outer-loop. When (if) a (cid:2)xed point (cid:13)(cid:3) is reached, the corresponding classi(cid:2)er\ncoef(cid:2)cients are chosen as the mode of p(xjy; (cid:13)(cid:3)).\nWhile demonstrably effective in a wide variety of empirical classi(cid:2)cation tests, the problem with\nthis formulation of SBL is threefold. First, there are no convergence guarantees of any kind, regard-\nless of which method is used for the outer-loop. Secondly, it is completely unclear what, if any, cost\nfunction is being descended (even approximately) to obtain the classi(cid:2)er coef(cid:2)cients, making it dif-\n(cid:2)cult to explore the model for enhancements or analytical purposes. Thirdly, in certain applications\nit has been observed that SBL achieves extreme sparsity at the expense of classi(cid:2)cation accuracy\n[4, 11]. There is currently no (cid:3)exibility in the model to remedy this problem.\nThese issues are directly addressed by dispensing with the Bayesian hierarchical derivation of SBL\naltogether and considering classi(cid:2)cation in light of (14). Both the MacKay and greedy SBL updates\nare equivalent to minimizing (14) with g(x) = gSBL(x), and assuming (cid:11) = (cid:21) = 1, using coordinate\ndescent over a set of auxiliary functions (details provided in a forthcoming paper). Unfortunately\nhowever, because these auxiliary functions are based in part on a second-order Laplace approxima-\ntion, they do not form a strict upper bound and so provable convergence (or even descent) is not\npossible. Of course we can always substitute the reweighted \u20181 scheme discussed above to avoid\nthis issue, since the underlying cost function in x-space is the same. Perhaps more importantly, to\nproperly regulate sparsity, when we deviate from the original Bayesian inspiration for this model,\nwe are free to adjust (cid:11) and/or (cid:21). For example, with (cid:11) small, the penalty gSBL(x) is more highly\nconcave favoring sparsity, while in the limit at (cid:11) becomes large, it acts like a standard \u20181 norm, still\nfavoring sparsity but not exceedingly so (the same phenomena occurs when using the penalty (11)).\nLikewise, (cid:21) is as a natural trade-off parameter balancing the contribution from the two terms in (6)\nor (14). Both (cid:11) and (cid:21) can be tuned via cross-validation if desired.\nThere is one additional concern regarding SBL that involves marginal likelihood (sometimes called\nevidence) calculations. In the standard regression case where marginalization was possible, the op-\ntimized quantity (cid:0) log p(y; (cid:13)) represents an approximation to (cid:0) log p(y) that can be used, among\nother things, for model comparison. This notion is completely lost when we move to the classi(cid:2)-\ncation case under consideration. While space precludes the details, if we are willing to substitute\na probit likelihood function for the logistic, it is possible to revert (14) back to the original hierar-\nchical, (cid:13)-dependent Bayesian model and obtain a rigorous upper bound on (cid:0) log p(y; (cid:13)). Finally,\ndetailed empirical simulations with both logistic- and probit-based classi(cid:2)ers is an area of future\nresearch; preliminary results are promising.\n\n5 Empirical Comparisons\nTo further examine the algorithms discussed herein, we performed simulations similar to those in [3].\nIn the (cid:2)rst experiment, each trial consisted of generating a 100 (cid:2) 256 dictionary (cid:8) with iid Gaussian\nentries and a sparse vector x(cid:3) with 60 nonzero, non-negative (truncated Gaussian) coef(cid:2)cients.\nA signal is then computed using y = (cid:8)x(cid:3). We then attempted to recover x(cid:3) by applying non-\nnegative \u20181 reweighting strategies with four different penalty functions: (i) gSBL(x) implemented\nusing a single iteration of (9), referred to as SBL-I (equivalent to the method from [19]); (ii) gSBL(x)\nimplemented using multiple iterations of (9) as discussed in Section 2, referred to as SBL-II; (iii)\n\ngBU(x); and (cid:2)nally (iv) g(x) = Pi log(jxij + (cid:11)), the factorial method of Cand(cid:30)es et al., which\n\nrepresents the current state-of-the-art in reweighted \u20181 algorithms. In all cases (cid:11) was chosen via\ncoarse cross-validation. Additionally, since we are working with a noise-free signal, we assume\n(cid:21) ! 0 and so the requisite coef(cid:2)cient update (3) with xi (cid:21) 0 reduces to a standard linear program.\ni = 1; 8i for each algorithm, the (cid:2)rst iteration amounts to the non-negative minimum\nGiven w(0)\n\u20181-norm solution (i.e., the Lasso). Average results from 1000 random trials are displayed in Figure\n1 (left), which plots the empirical probability of success in recovering x(cid:3) versus the iteration num-\nber. We observe that standard non-negative \u20181 never succeeds (see (cid:2)rst iteration results); however,\n\n\fwith only a few reweighted iterations drastic improvement is possible, especially for the bottom-up\napproach. By 10 iterations, the non-factorial variants have all exceeded the method of Cand(cid:30)es et al.\n(There was no appreciable improvement by any method after 10 iterations.) This shows both the\nef(cid:2)cacy of non-factorial reweighting and the ability to handle constraints on x.\nFor the second experiment, we used a randomly generated 50 (cid:2) 100 dictionary for each trial with\niid Gaussian entries as above, and created 5 coef(cid:2)cient vectors X (cid:3) = [x(cid:3)\n(cid:1)5] with matching\nsparsity pro(cid:2)le and iid Gaussian nonzero coef(cid:2)cients. We then generate the signal matrix Y = (cid:8)X (cid:3)\nand attempt to learn X (cid:3) using various group-level reweighting schemes. In this experiment we var-\nied the row sparsity of X (cid:3) from d(X (cid:3)) = 30 to d(X (cid:3)) = 40; in general, the more nonzero rows,\nthe harder the recovery problem becomes. A total of (cid:2)ve algorithms modi(cid:2)ed to the simultaneous\nsparse approximation problem were tested using an \u20182-norm penalty on each coef(cid:2)cient row: the\nfour methods from above (executed for 5 iterations each) plus the standard group Lasso (equiva-\nlent to a single iteration of any of the other algorithms). Results are presented in Figure 1 (right),\nwhere the performance gap between the factorial and non-factorial approaches is very signi(cid:2)cant.\nAdditionally, we have successfully applied this methodology to large neuroimaging data sets [22],\nobtaining signi(cid:2)cant improvements over existing convex approaches such as the group Lasso, con-\nsistent with the results in Figure 1. Other related simulation results are contained in [20].\n\n(cid:1)1; :::; x(cid:3)\n\n)\ns\ns\ne\nc\nc\nu\ns\n(\np\n\nPSfrag replacements\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nrow sparsity, d(X (cid:3))\n\n \n\n0\n1\n\n2\n\n3\n\n \n\n)\ns\ns\ne\nc\nc\nu\ns\n(\np\n\nSBL\u2212I\nSBL\u2212II\nBottom\u2212Up\nCandes et al.\n\nPSfrag replacements\n\u20181 iteration number\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nSBL\u2212I\nSBL\u2212II\nBottom\u2212Up\nCandes et al.\nGroup Lasso\n\n4\n\n5\n\n6\n\n7\n\n\u20181 iteration number\n\n8\n\n9\n\n10\n\n \n\n0\n30\n\n32\n\n34\n\n36\n\nrow sparsity, d(X (cid:3))\n\n \n\n38\n\n40\n\nFigure 1: Left: Probability of success recovering sparse non-negative coef(cid:2)cients as a function of\nreweighted \u20181 iterations. Right: Iterative reweighted results using 5 simultaneous signal vectors.\nProbability of success recovering sparse coef(cid:2)cients for different row sparsity values, i.e., d(X (cid:3)).\n\n6 Conclusion\nIn this paper we have examined concave, non-factorial priors (which previously have received little\nattention) for the purpose of estimating sparse coef(cid:2)cients. When coupled with general likelihood\nmodels and minimized using ef(cid:2)cient iterative reweighted \u20181 methods, these priors offer a powerful\nalternative to existing state-of-the-art sparse estimation techniques. We have also shown (for the (cid:2)rst\ntime) exactly what the underlying cost function associated with the SBL classi(cid:2)er is and provided a\nmore principled algorithm for minimizing it.\n\nAppendix\nConcavity of gSBL(x) and derivation of weight updates (9): Because log j(cid:11)I + (cid:8)(cid:0)(cid:8)T j is concave\nand non-decreasing with respect to (cid:13) (cid:21) 0, we can express it as\n\nlog j(cid:11)I + (cid:8)(cid:0)(cid:8)T j = min\nz(cid:21)0\n\nzT (cid:13) (cid:0) h(cid:3)(z);\n\n(15)\n\nwhere h(cid:3)(z) is de(cid:2)ned as the concave conjugate of h((cid:13)) , log j(cid:11)I + (cid:8)(cid:0)(cid:8)T j [1]. We can then\nexpress gSBL(x) via\n\ngSBL(x) = min\n(cid:13)(cid:21)0\n\nxT (cid:0)(cid:0)1x + log j(cid:11)I + (cid:8)(cid:0)(cid:8)T j = min\n\nMinimizing over (cid:13) for (cid:2)xed x and z, we get\n\n(cid:13)i = z(cid:0)1=2\n\ni\n\njxij; 8i:\n\n(cid:13);z(cid:21)0Xi (cid:18) x2\n\ni\n(cid:13)i\n\n+ zi(cid:13)i(cid:19) (cid:0) h(cid:3)(z):\n\n(16)\n\n(17)\n\n\fSubstituting this expression into (16) gives the representation\n\ngSBL(x) = min\n\nz(cid:21)0Xi x2\n\ni\nz(cid:0)1=2\ni\n\n+ ziz(cid:0)1=2\n\ni\n\njxij\n\njxij! (cid:0) h(cid:3)(z) = min\nz(cid:21)0Xi\n\n2z1=2\n\ni\n\njxij (cid:0) h(cid:3)(z);\n\n(18)\n\nwhich implies that gSBL(x) can be represented as a minimum of upper-bounding hyperplanes with\nrespect to jxj, and thus must be concave and non-decreasing since z (cid:21) 0 [1]. We also observe that\nfor (cid:2)xed z, solving (6) is a weighted \u20181 minimization problem.\nTo derive the weight update (9), we only need the optimal value of each zi, which from basic convex\nanalysis will satisfy\n\n(19)\nSince this quantity is not available in closed form, we can instead iteratively minimize (16) over\n(cid:13) and z. We start by initializing z1=2\n; 8i and then minimize over (cid:13) using (17). We then\ncompute the optimal z for (cid:2)xed (cid:13), which can be done analytically using\n\ni ! w(k)\n\nz1=2\ni =\n\n@gSBL(x)\n\n2@jxij\n\n:\n\ni\n\nz = O\n\n(cid:13) log(cid:12)(cid:12)(cid:11)I + (cid:8)(cid:0)(cid:8)T(cid:12)(cid:12) = diagh(cid:8)T(cid:0)(cid:11)I + (cid:8)(cid:0)(cid:8)T(cid:1)(cid:0)1\n\ni\n\ni\n\n(cid:8)i :\n\nBy substituting (17) into (20) and de(cid:2)ning w(k+1)\n, we obtain the weight update (9). This\nprocedure is guaranteed to converge to a solution satisfying (19) [20] although, as mentioned\npreviously, only one iteration is actually required for the overall algorithm.\n(cid:4)\n\n, z1=2\n\n(20)\n\ni\n\nProof of Theorem 1: Before we begin, we should point out that for (cid:11) ! 0, the weight update\n\n(9) is still well-speci(cid:2)ed regardless of the value of the diagonal matrix fW (k+1)eX (k+1). If (cid:30)i is\nnot in the span of (cid:8)fW (k+1)eX (k+1)(cid:8)T , then w(k+1)\n\ncan be set to zero for all future iterations. Otherwise w(k+1)\nMoore-Penrose pseudoinverse and will be strictly nonzero.\nFor simplicity we will now assume that spark((cid:8)) = n + 1, which is equivalent to requiring that\neach subset of n columns of (cid:8) forms a basis in Rn. The extension to the more general case is\ndiscussed in [20]. From basic linear programming [8], at any iteration the coef(cid:2)cients will satisfy\n\n! 1 and the corresponding coef(cid:2)cient xi\ncan be computed ef(cid:2)ciently using the\n\ntwo possibilities. If kx(k)k0 = n, then we will automatically satisfy kx(k+1)k0 (cid:20) kx(k)k0 at the\n\nkx(k)k0 (cid:20) n for arbitrary weightsfW (k(cid:0)1). Given our simplifying assumptions, there exists only\nnext iteration regardless offW (k). In contrast, if kx(k)k0 < n, then rankhfW (k)i (cid:20) kx(k)k0 for all\n\nevaluations of (9) with (cid:11) ! 0, enforcing kx(k+1)k0 (cid:20) kx(k)k0.\n\n(cid:4)\n\ni\n\nx\n\n2\n\n(i+1) (cid:20) (cid:23)ix0\n\nProof of Theorem 2: For a (cid:2)xed dictionary (cid:8) and coef(cid:2)cient vector x(cid:3), we are assuming that\n. Now consider a second coef(cid:2)cient vector x0 with support and sign pattern equal to\nkx(cid:3)k0 < (n+1)\nx(cid:3) and de(cid:2)ne x0\n(i) as the i-th largest coef(cid:2)cient magnitude of x0. Then there exists a set of kx(cid:3)k0 (cid:0)1\nscaling constants (cid:23)i 2 (0; 1] (i.e., strictly greater than zero) such that, for any signal y generated via\ny = (cid:8)x0 and x0\n\n(i), i = 1; : : : ; kx(cid:3)k0 (cid:0) 1, the minimization problem\n^x , arg min\n\ngSBL(x);\n\ns.t. (cid:8)x0 = (cid:8)x; (cid:11) ! 0;\n\n(21)\nis unimodal and has a unique minimizing stationary point which satis(cid:2)es ^x = x0. This result\nfollows from [21] and the dual-space characterization of the penalty gSBL(x) from [19]. Note that\n(21) is equivalent to (6) with (cid:21) ! 0, so the reweighted non-factorial update (9) can be applied.\nFurthermore, based on the global convergence of these updates discussed above, the sequence of\nestimates are guaranteed to satisfy x(k) ! ^x = x0. So we will necessarily learn the generative x0.\nLet x\u20181 , arg minx kxk1, subject to (cid:8)x(cid:3) = (cid:8)x. By assumption we know that x\u20181 6= x(cid:3).\nMoreover, we can conclude using [9, Theorem 6] that if x\u20181 fails for some x(cid:3), it will fail for any\nother x with matching support and sign pattern; it will therefore fail for any x0 as de(cid:2)ned above.\nFinally, by construction, the set of feasible x0 will have nonzero measure over the support S since\neach (cid:23)i is strictly nonzero. Note also that this result can likely be extended to the case where\nspark((cid:8)) < n + 1 and to any x(cid:3) that satis(cid:2)es kx(cid:3)k0 < spark((cid:8)) (cid:0) 1. The more speci(cid:2)c case\naddressed above was only assumed to allow direct application of [9, Theorem 6].\n(cid:4)\n\n\fReferences\n[1] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, 2004.\n[2] A. Bruckstein, M. Elad, and M. Zibulevsky, (cid:147)A non-negative and sparse enough solution of an\nunderdetermined linear system of equations is unique,(cid:148) IEEE Trans. Information Theory, vol.\n54, no. 11, pp. 4813(cid:150)4820, Nov. 2008.\n\n[3] E. Cand(cid:30)es, M. Wakin, and S. Boyd, (cid:147)Enhancing sparsity by reweighted \u20181 minimization,(cid:148) J.\n\nFourier Anal. Appl., vol. 14, no. 5, pp. 877(cid:150)905, 2008.\n\n[4] G. Cawley and N. Talbot, (cid:147)Gene selection in cancer classi(cid:2)cation using sparse logistic regres-\n\nsion with Bayesian regularization,(cid:148) Bioinformatics, vol. 22, no. 19, pp. 2348(cid:150)2355, 2006.\n\n[5] D. Donoho and M. Elad, (cid:147)Optimally sparse representation in general (nonorthogonal) dictio-\n\nnaries via \u20181 minimization,(cid:148) Proc. Nat. Acad. Sci., vol. 100, no. 5, pp. 2197(cid:150)2202, 2003.\n\n[6] M. Fazel, H. Hindi, and S. Boyd, (cid:147)Log-Det heuristic for matrix rank minimization with appli-\ncations to hankel and Euclidean distance matrices,(cid:148) Proc. American Control Conf., vol. 3, pp.\n2156(cid:150)2162, June 2003.\n\n[7] B. Krishnapuram, L. Carin, M. Figueiredo, and A. Hartemink, (cid:147)Sparse multinomial logistic\nregression: Fast algorithms and generalization bounds,(cid:148) IEEE Trans. Pattn Anal. Mach. Intell.,\nvol. 27, pp. 957(cid:150)968, 2005.\n\n[8] D. Luenberger, Linear and Nonlinear Programming, Addison(cid:150)Wesley, Reading, Massachusetts,\n\nsecond edition, 1984.\n\n[9] D. Malioutov, M. C\u201a etin, and A.S. Willsky, (cid:147)Optimal sparse representations in general overcom-\n\nplete bases,(cid:148) IEEE Int. Conf. Acoust., Speech, and Sig. Proc., vol. 2, pp. II(cid:150)793(cid:150)796, 2004.\n\n[10] R. Neal, Bayesian Learning for Neural Networks, Springer-Verlag, New York, 1996.\n[11] Y. Qi, T. Minka, R. Picard, and Z. Ghahramani, (cid:147)Predictive automatic relevance determination\n\nby expectation propagation,(cid:148) Int. Conf. Machine Learning (ICML), pp. 85(cid:150)92, 2004.\n\n[12] R. Showalter, (cid:147)Monotone operators in Banach space and nonlinear partial differential equa-\n\ntions,(cid:148) Mathematical Surveys and Monographs 49. AMS, Providence, RI, 1997.\n\n[13] J. Silva, J. Marques, and J. Lemos, (cid:147)Selecting landmark points for sparse manifold learning,(cid:148)\n\nAdvances in Neural Information Processing Systems 18, pp. 1241(cid:150)1248, 2006.\n\n[14] R. Tibshirani, (cid:147)Regression shrinkage and selection via the Lasso,(cid:148) Journal of the Royal\n\nStatistical Society, vol. 58, no. 1, pp. 267(cid:150)288, 1996.\n\n[15] M. Tipping, (cid:147)Sparse bayesian learning and the relevance vector machine,(cid:148) J. Machine Learning\n\nResearch, vol. 1, pp. 211(cid:150)244, 2001.\n\n[16] M. Tipping and A. Faul, (cid:147)Fast marginal likelihood maximisation for sparse Bayesian models,(cid:148)\n\nNinth Int. Workshop. Arti\ufb01cial Intelligence and Statistics, Jan. 2003.\n\n[17] J. Tropp, (cid:147)Algorithms for simultaneous sparse approximation. Part II: Convex relaxation,(cid:148)\n\nSignal Processing, vol. 86, pp. 589(cid:150)602, April 2006.\n\n[18] M. Wakin, M. Duarte, S. Sarvotham, D. Baron, and R. Baraniuk, (cid:147)Recovery of jointly sparse\nsignals from a few random projections,(cid:148) Advances in Neural Information Processing Systems\n18, pp. 1433(cid:150)1440, 2006.\n\n[19] D. Wipf and S. Nagarajan, (cid:147)A new view of automatic relevance determination,(cid:148) Advances in\n\nNeural Information Processing Systems 20, pp. 1625(cid:150)1632, 2008.\n\n[20] D. Wipf and S. Nagarajan, (cid:147)Iterative reweighted \u20181 and \u20182 methods for (cid:2)nding sparse solu-\n\ntions,(cid:148) Submitted, 2009.\n\n[21] D. Wipf and S. Nagarajan, (cid:147)Latent variable Bayesian models for promoting sparsity,(cid:148) Submit-\n\nted, 2009.\n\n[22] D. Wipf, J. Owen, H. Attias, K. Sekihara, and S. Nagarajan, (cid:147)Robust Bayesian Estimation of\nthe Location, Orientation, and Time Course of Multiple Correlated Neural Sources using MEG,(cid:148)\nNeuroImage, vol. 49, no. 1, pp. 641(cid:150)655, Jan. 2010.\n\n[23] M. Yuan and Y. Lin, (cid:147)Model selection and estimation in regression with grouped variables,(cid:148) J.\n\nR. Statist. Soc. B, vol. 68, pp. 49(cid:150)67, 2006.\n\n[24] W. Zangwill, Nonlinear Programming: A Uni\ufb01ed Approach, Prentice Hall, New Jersey, 1969.\n\n\f", "award": [], "sourceid": 963, "authors": [{"given_name": "David", "family_name": "Wipf", "institution": null}, {"given_name": "Srikantan", "family_name": "Nagarajan", "institution": null}]}