{"title": "Comparing the Effects of Different Weight Distributions on Finding Sparse Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 1521, "page_last": 1528, "abstract": null, "full_text": "Comparing the Effects of Different Weight\n\nDistributions on Finding Sparse Representations\n\nDavid Wipf and Bhaskar Rao \u2217\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of California, San Diego, CA 92093\ndwipf@ucsd.edu, brao@ece.ucsd.edu\n\nAbstract\n\nGiven a redundant dictionary of basis vectors (or atoms), our goal is to\n\ufb01nd maximally sparse representations of signals. Previously, we have\nargued that a sparse Bayesian learning (SBL) framework is particularly\nwell-suited for this task, showing that it has far fewer local minima than\nother Bayesian-inspired strategies. In this paper, we provide further evi-\ndence for this claim by proving a restricted equivalence condition, based\non the distribution of the nonzero generating model weights, whereby the\nSBL solution will equal the maximally sparse representation. We also\nprove that if these nonzero weights are drawn from an approximate Jef-\nfreys prior, then with probability approaching one, our equivalence con-\ndition is satis\ufb01ed. Finally, we motivate the worst-case scenario for SBL\nand demonstrate that it is still better than the most widely used sparse rep-\nresentation algorithms. These include Basis Pursuit (BP), which is based\non a convex relaxation of the \u21130 (quasi)-norm, and Orthogonal Match-\ning Pursuit (OMP), a simple greedy strategy that iteratively selects basis\nvectors most aligned with the current residual.\n\nIntroduction\n\n1\nIn recent years, there has been considerable interest in \ufb01nding sparse signal representations\nfrom redundant dictionaries [1, 2, 3, 4, 5]. The canonical form of this problem is given by,\n\nmin\nw kwk0,\n\ns.t. t = \u03a6w,\n\n(1)\n\nwhere \u03a6 \u2208 RN\u00d7M is a matrix whose columns represent an overcomplete or redundant\nbasis (i.e., rank(\u03a6) = N and M > N), w \u2208 RM is the vector of weights to be learned,\nand t is the signal vector. The cost function being minimized represents the \u21130 (quasi)-norm\nof w (i.e., a count of the nonzero elements in w).\n\nUnfortunately, an exhaustive search for the optimal representation requires the solution of\n\nup to (cid:0)M\nN(cid:1) linear systems of size N \u00d7 N, a prohibitively expensive procedure for even\nmodest values of M and N. Consequently, in practical situations there is a need for ap-\nproximate procedures that ef\ufb01ciently solve (1) with high probability. To date, the two most\nwidely used choices are Basis Pursuit (BP) [1] and Orthogonal Matching Pursuit (OMP)\n[5]. BP is based on a convex relaxation of the \u21130 norm, i.e., replacing kwk0 with kwk1,\nwhich leads to an attractive, unimodal optimization problem that can be readily solved via\nlinear programming. In contrast, OMP is a greedy strategy that iteratively selects the basis\n\n\u2217This work was supported by DiMI grant 22-8376, Nissan, and NSF grant DGE-0333451.\n\n\fvector most aligned with the current signal residual. At each step, a new approximant is\nformed by projecting t onto the range of all the selected dictionary atoms.\n\nPreviously [9], we have demonstrated an alternative algorithm for solving (1) using a sparse\nBayesian learning (SBL) framework [6] that maintains several signi\ufb01cant advantages over\nother, Bayesian-inspired strategies for \ufb01nding sparse solutions [7, 8]. The most basic for-\nmulation begins with an assumed likelihood model of the signal t given weights w,\n\np(t|w) = (2\u03c0\u03c32)\u2212N/2 exp(cid:18)\u2212\n\n2(cid:19) .\n1\n2\u03c32kt \u2212 \u03a6wk2\n\nTo provide a regularizing mechanism, SBL uses the parameterized weight prior\n\np(w; \u03b3) =\n\nM\n\nYi=1\n\n(2\u03c0\u03b3i)\u22121/2 exp(cid:18)\u2212\n\nw2\ni\n\n2\u03b3i(cid:19) ,\n\n(2)\n\n(3)\n\nwhere \u03b3 = [\u03b31, . . . , \u03b3M ]T is a vector of M hyperparameters controlling the prior variance\nof each weight. These hyperparameters can be estimated from the data by marginalizing\nover the weights and then performing ML optimization. The cost function for this task is\n\nL(\u03b3) = \u2212 logZ p(t|w)p(w; \u03b3)dw \u221d log |\u03a3t| + tT \u03a3\u22121\n\nt t,\n\n(4)\n\nwhere \u03a3t , \u03c32I + \u03a6\u0393\u03a6T and we have introduced the notation \u0393 , diag(\u03b3). This pro-\ncedure, which can be implemented via the EM algorithm (or some other technique), is\nreferred to as evidence maximization or type-II maximum likelihood [6]. Once \u03b3 has been\nestimated, a closed-form expression for the posterior weight distribution is available.\n\nAlthough SBL was initially developed in a regression context, it can be easily adapted to\nhandle (1) in the limit as \u03c32 \u2192 0. To accomplish this we must reexpress the SBL iterations\nto handle the low noise limit. Applying various matrix identities to the EM algorithm-based\nupdate rules for each iteration, we arrive at the modi\ufb01ed update [9]\n\n\u03b3(new) = diag(cid:18) \u02c6w(old) \u02c6wT\n(new)(cid:16)\u03a6\u03931/2\n\n(old) +(cid:20)I \u2212 \u03931/2\n(new)(cid:17)\u2020 t,\n\n\u02c6w(new) = \u03931/2\n\n(old)(cid:16)\u03a6\u03931/2\n\n(old)(cid:17)\u2020 \u03a6(cid:21) \u0393(old)(cid:19)\n\n(5)\n\nwhere (\u00b7)\u2020 denotes the Moore-Penrose pseudo-inverse. Given that t \u2208 range(\u03a6) and as-\nsuming \u03b3 is initialized with all nonzero elements, then feasibility is enforced at every itera-\ntion, i.e., t = \u03a6 \u02c6w. We will henceforth refer to wSBL as the solution of this algorithm when\ninitialized at \u0393 = IM and \u02c6w = \u03a6\u2020t.1 In [9] (which extends work in [10]), we have argued\nwhy wSBL should be considered a viable candidate for solving (1).\n\nIn comparing BP, OMP, and SBL, we would ultimately like to know in what situations a\nparticular algorithm is likely to \ufb01nd the maximally sparse solution. A variety of results stip-\nulate rigorous conditions whereby BP and OMP are guaranteed to solve (1) [1, 4, 5]. All\nof these conditions depend explicitly on the number of nonzero elements contained in the\noptimal solution. Essentially, if this number is less than some \u03a6-dependent constant \u03ba, the\nBP/OMP solution is proven to be equivalent to the minimum \u21130-norm solution. Unfortu-\nnately however, \u03ba turns out to be restrictively small and, for a \ufb01xed redundancy ratio M/N,\ngrows very slowly as N becomes large [3]. But in practice, both approaches still perform\nwell even when these equivalence conditions have been grossly violated. To address this\nissue, a much looser bound has recently been produced for BP, dependent only on M/N.\nThis bound holds for \u201cmost\u201d dictionaries in the limit as N becomes large [3], where \u201cmost\u201d\n\n1Based on EM convergence properties, the algorithm will converge monotonically to a \ufb01xed point.\n\n\fis with respect to dictionaries composed of columns drawn uniformly from the surface of\nan N-dimensional unit hypersphere. For example, with M/N = 2, it is argued that BP is\ncapable of resolving sparse solutions with roughly 0.3N nonzero elements with probability\napproaching one as N \u2192 \u221e.\nTurning to SBL, we have neither a convenient convex cost function (as with BP) nor a\nsimple, transparent update rule (as with OMP); however, we can nonetheless come up with\nan alternative type of equivalence result that is neither unequivocally stronger nor weaker\nthan those existing results for BP and OMP. This condition is dependent on the relative\nmagnitudes of the nonzero elements embedded in optimal solutions to (1). Additionally,\nwe can leverage these ideas to motivate which sparse solutions are the most dif\ufb01cult to \ufb01nd.\nLater, we provide empirical evidence that SBL, even in this worst-case scenario, can still\noutperform both BP and OMP.\n\n2 Equivalence Conditions for SBL\nIn this section, we establish conditions whereby wSBL will minimize (1). To state these\nresults, we require some notation. First, we formally de\ufb01ne a dictionary \u03a6 = [\u03c61, . . . , \u03c6M ]\nas a set of M unit \u21132-norm vectors (atoms) in RN , with M > N and rank(\u03a6) = N. We\nsay that a dictionary satis\ufb01es the unique representation property (URP) if every subset of\nN atoms forms a basis in RN . We de\ufb01ne w(i) as the i-th largest weight magnitude and \u00afw\nas the kwk0-dimensional vector containing all the nonzero weight magnitudes of w. The\nset of optimal solutions to (1) is W\u2217 with cardinality |W\u2217|. The diversity (or anti-sparsity)\nof each w\u2217 \u2208 W\u2217 is de\ufb01ned as D\u2217 , kw\u2217k0.\nResult 1. For a \ufb01xed dictionary \u03a6 that satis\ufb01es the URP, there exists a set of M \u2212 1 scaling\nconstants \u03bdi \u2208 (0, 1] (i.e., strictly greater than zero) such that, for any t = \u03a6w\u2032 generated\nwith\n(6)\n\nw\u2032(i+1) \u2264 \u03bdiw\u2032(i)\n\ni = 1, . . . , M \u2212 1,\n\nSBL will produce a solution that satis\ufb01es kwSBLk0 = min(N,kw\u2032k0) and wSBL \u2208 W\u2217.\nDo to space limitations, the proof has been deferred to [11]. The basic idea is that, as\nthe magnitude differences between weights increase, at any given scale, the covariance\n\u03a3t embedded in the SBL cost function is dominated by a single dictionary atom such that\nproblematic local minimum are removed. The unique, global minimum in turn achieves the\nstated result.2 The most interesting case occurs when kw\u2032k0 < N, leading to the following:\nCorollary 1. Given the additional restriction kw\u2032k0 < N, then wSBL = w\u2032 \u2208 W\u2217 and\n|W\u2217| = 1, i.e., SBL will \ufb01nd the unique, maximally sparse representation of the signal t.\nSee [11] for the proof. These results are restrictive in the sense that the dictionary dependent\nconstants \u03bdi signi\ufb01cantly con\ufb01ne the class of signals t that we may represent. Moreover,\nwe have not provided any convenient means of computing what the different scaling con-\nstants might be. But we have nonetheless solidi\ufb01ed the notion that SBL is most capable of\nrecovering weights of different scales (and it must still \ufb01nd all D\u2217 nonzero weights no mat-\nter how small some of them may be). Additionally, we have speci\ufb01ed conditions whereby\nwe will \ufb01nd the unique w\u2217 even when the diversity is as large as D\u2217 = N \u2212 1. The tighter\nBP/OMP bound from [1, 4, 5] scales as O(cid:0)N\u22121/2(cid:1), although this latter bound is much\nmore general in that it is independent of the magnitudes of the nonzero weights.\n\nIn contrast, neither BP or OMP satisfy a comparable result; in both cases, simple 3D\ncounter examples suf\ufb01ce to illustrate this point.3 We begin with OMP. Assume the fol-\n\n2Because we have effectively shown that the SBL cost function must be unimodal, etc., any proven\n\ndescent method could likely be applied in place of (5) to achieve the same result.\n\n3While these examples might seem slightly nuanced, the situations being illustrated can occur\n\nfrequently in practice and the requisite column normalization introduces some complexity.\n\n\flowing:\n\nw\u2217 =\uf8ee\n\uf8ef\uf8f0\n\n1\n\u01eb\n0\n0\n\n\uf8f9\n\uf8fa\uf8fb\n\n\u03a6 =\uf8ee\n\uf8ef\uf8f0\n\n0\n0\n1\n\n1\u221a2\n0\n1\u221a2\n\n0\n1\n0\n\n1\u221a1.01\n0.1\u221a1.01\n\n0\n\n\uf8f9\n\uf8fa\uf8fb\n\nt = \u03a6w\u2217 =\uf8ee\n\uf8f0\n\n\u01eb\u221a2\n0\n\n1 + \u01eb\u221a2\n\n\uf8f9\n\uf8fb ,\n\n(7)\n\nwhere \u03a6 satis\ufb01es the URP and has columns \u03c6i of unit \u21132 norm. Given any \u01eb \u2208 (0, 1),\nwe will now show that OMP will necessarily fail to \ufb01nd w\u2217. Provided \u01eb < 1, at the \ufb01rst\niteration OMP will select \u03c61, which solves maxi |tT \u03c6i|, leaving the residual vector\n\n0\n\n0 ]T .\n\n(8)\n\nNext, \u03c64 will be chosen since it has the largest value in the top position, thus solving\nmaxi |rT\n\nr1 =(cid:0)I \u2212 \u03c61\u03c6T\n\n1(cid:1) t = [ \u01eb/\u221a2\n1 \u03c6i|. The residual is then updated to become\nr2 =(cid:0)I \u2212 [ \u03c61 \u03c64 ][ \u03c61 \u03c64 ]T(cid:1) t =\n\n\u01eb\n\n101\u221a2\n\n[ 1 \u221210\n\n0 ]T .\n\n(9)\n\nFrom the remaining two columns, r2 is most highly correlated with \u03c63. Once \u03c63 is se-\nlected, we obtain zero residual error, yet we did not \ufb01nd w\u2217, which involves only \u03c61 and\n\u03c62. So for all \u01eb \u2208 (0, 1), the algorithm fails. As such, there can be no \ufb01xed constant \u03bd > 0\nsuch that if w\u2217(2) \u2261 \u01eb \u2264 \u03bdw\u2217(1) \u2261 \u03bd, we are guaranteed to obtain w\u2217 (unlike with SBL).\nWe now give an analogous example for BP, where we present a feasible solution with\nsmaller \u21131 norm than the maximally sparse solution. Given\n\n0\n0\n1\n\n1\n0\n0\n\n1\n\u01eb\n0\n0\n\n\uf8f9\n\uf8fa\uf8fb\n\n0.1\u221a1.02\n\u22120.1\u221a1.02\n1\u221a1.02\n\n\u03a6 =\uf8ee\n\uf8ef\uf8f0\n\nw\u2217 =\uf8ee\n\uf8ef\uf8f0\n\nt = \u03a6w\u2217 =\" \u01eb\nit is clear that kw\u2217k1 = 1 + \u01eb. However, for all \u01eb \u2208 (0, 0.1),\nif we form a\nfeasible solution using only \u03c61, \u03c63, and \u03c64, we obtain the alternate solution w =\n(cid:2) (1 \u2212 10\u01eb) 0 5\u221a1.02\u01eb 5\u221a1.02\u01eb (cid:3)T with kwk1 \u2248 1 + 0.1\u01eb. Since this has a smaller\n\n\u21131 norm for all \u01eb in the speci\ufb01ed range, BP will necessarily fail and so again, we cannot\nreproduce the result for a similar reason as before.\n\n1 # ,\n\n(10)\n\n0.1\u221a1.02\n0.1\u221a1.02\n1\u221a1.02\n\n\uf8f9\n\uf8fa\uf8fb\n\n0\n\nAt this point, it remains unclear what probability distributions are likely to produce weights\nthat satisfy the conditions of Result 1.\nIt turns out that the Jeffreys prior, given by\np(x) \u221d 1/x, is appropriate for this task. This distribution has the unique property that\nthe probability mass assigned to any given scaling is equal. More explicitly, for any s \u2265 1,\n(11)\n\nP (cid:0)x \u2208(cid:2)si, si+1(cid:3)(cid:1) \u221d log(s)\n\n\u2200i \u2208 Z.\n\nFor example, the probability that x is between 1 and 10 equals the probability that it lies\nbetween 10 and 100 or between 0.01 and 0.1. Because this is an improper density, we\nde\ufb01ne an approximate Jeffreys prior with range parameter a \u2208 (0, 1]. Speci\ufb01cally, we say\nthat x \u223c J(a) if\n\np(x) =\n\n\u22121\n\n2 log(a)x\n\nfor x \u2208 [a, 1/a].\n\n(12)\n\nWith this de\ufb01nition in mind, we present the following result.\nResult 2. For a \ufb01xed \u03a6 that satis\ufb01es the URP, let t be generated by t = \u03a6w\u2032, where w\u2032\nhas magnitudes drawn iid from J(a). Then as a approaches zero, the probability that we\nobtain a w\u2032 such that the conditions of Result 1 are satis\ufb01ed approaches unity.\nAgain, for space considerations, we refer the reader to [11]. However, on a conceptual\nlevel this result can be understood by considering the distribution of order statistics. For\n\n\fexample, given M samples from a uniform distribution between zero and some \u03b8, with\nprobability approaching one, the distance between the k-th and (k +1)-th order statistic can\nbe made arbitrarily large as \u03b8 moves towards in\ufb01nity. Likewise, with the J(a) distribution,\nthe relative scaling between order statistics can be increased without bound as a decreases\ntowards zero, leading to the stated result.\nCorollary 2. Assume that D\u2032 < N randomly selected elements of w\u2032 are set to zero.\nThen as a approaches zero, the probability that we satisfy the conditions of Corollary 1\napproaches unity.\n\nIn conclusion, we have shown that a simple, (approximate) noninformative Jeffreys prior\nleads to sparse inverse problems that are optimally solved via SBL with high probability.\nInterestingly, it is this same Jeffreys prior that forms the implicit weight prior of SBL (see\n[6], Section 5.1). However, it is worth mentioning that other Jeffreys prior-based tech-\nsubject to t = \u03a6w, do not provide\nany SBL-like guarantees. Although several algorithms do exist that can perform such a\nminimization task (e.g., [7, 8]), they perform poorly with respect to (1) because of conver-\ngence to local minimum as shown in [9, 10]. This is especially true if the weights are highly\nscaled, and no nontrivial equivalence results are known to exist for these procedures.\n\nniques, e.g., direct minimization of p(w) = Qi\n\n1\n|wi|\n\n3 Worst-Case Scenario\nIf the best-case scenario occurs when the nonzero weights are all of very different scales,\nit seems reasonable that the most dif\ufb01cult sparse inverse problem may involve weights of\nthe same or even identical scale, e.g., \u00afw\u22171 = \u00afw\u22172 = . . . \u00afw\u2217D\u2217. This notion can be formalized\nsomewhat by considering the \u00afw\u2217 distribution that is furthest from the Jeffreys prior. First,\nwe note that both the SBL cost function and update rules are independent of the overall\nscaling of the generating weights, meaning \u03b1 \u00afw\u2217 is functionally equivalent to \u00afw\u2217 provided\n\u03b1 is nonzero. This invariance must be taken into account in our analysis. Therefore, we\n\nassume the weights are rescaled such thatPi \u00afw\u2217i = 1. Given this restriction, we will \ufb01nd\n\nthe distribution of weight magnitudes that is most different from the Jeffreys prior.\n\nUsing the standard procedure for changing the parameterization of a probability density,\nthe joint density of the constrained variables can be computed simply as\n\n1\ni=1 \u00afw\u2217i\n\nQD\u2217\n\nD\u2217\n\nXi=1\n\np( \u00afw\u22171, . . . , \u00afw\u2217D\u2217 ) \u221d\n\nfor\n\n\u00afw\u2217i = 1, \u00afw\u2217i \u2265 0,\u2200i.\n\n(13)\n\nFrom this expression, it is easily shown that \u00afw\u22171 = \u00afw\u22172 = . . . = \u00afw\u2217D\u2217 achieves the global\nminimum. Consequently, equal weights are the absolute least likely to occur from the\nJeffreys prior. Hence, we may argue that the distribution that assigns \u00afw\u2217i = 1/D\u2217 with\nprobability one is furthest from the constrained Jeffreys prior.\n\nNevertheless, because of the complexity of the SBL framework, it is dif\ufb01cult to prove ax-\niomatically that \u00afw\u2217 \u223c 1 is overall the most problematic distribution with respect to sparse\nrecovery. We can however provide additional motivation for why we should expect it to\nbe unwieldy. As proven in [9], the global minimum of the SBL cost function is guaran-\nteed to produce some w\u2217 \u2208 W\u2217. This minimum is achieved with the hyperparameters\n\u03b3\u2217i = (w\u2217i )2, \u2200i. We can think of this solution as forming a collapsed, or degenerate co-\nvariance \u03a3\u2217t = \u03a6\u0393\u2217\u03a6T that occupies a proper D\u2217-dimensional subspace of N-dimensional\nsignal space. Moreover, this subspace must necessarily contain the signal vector t. Essen-\ntially, \u03a3\u2217t proscribes in\ufb01nite density to t, leading to the globally minimizing solution.\nNow consider an alternative covariance \u03a3\u22c4t that, although still full rank, is nonetheless ill-\nconditioned (\ufb02attened), containing t within its high density region. Furthermore, assume\nthat \u03a3\u22c4t is not well aligned with the subspace formed by \u03a3\u2217t . The mixture of two \ufb02at-\ntened, yet misaligned covariances naturally leads to a more voluminous (less dense) form\n\n\fas measured by the determinant |\u03b1\u03a3\u2217t + \u03b2\u03a3\u22c4t|. Thus, as we transition from \u03a3\u22c4t to \u03a3\u2217t , we\nnecessarily reduce the density at t, thereby increasing the cost function L(\u03b3). So if SBL\nconverges to \u03a3\u22c4t it has fallen into a local minimum.\nSo the question remains, what values of \u00afw\u2217 are likely to create the most situations where\nthis type of local minima occurs? The issue is resolved when we again consider the D\u2217-\ndimensional subspace determined by \u03a3\u2217t . The volume of the covariance within this sub-\n, where \u00af\u03a6\u2217 and \u00af\u0393\u2217 are the basis vectors and hyperparameters\nassociated with \u00afw\u2217. The larger this volume, the higher the probability that other basis vec-\ntors will be suitably positioned so as to both (i), contain t within the high density portion\nand (ii), maintain a suf\ufb01cient component that is misaligned with the optimal covariance.\n\ni\noccurs with \u00af\u03b3\u2217i = 1/(D\u2217)2, i.e., all the \u00afw\u2217i are equal. Consequently, geometric considera-\ntions support the notion that deviance from the Jeffreys prior leads to dif\ufb01culty recovering\nw\u2217. Moreover, empirical analysis (not shown) of the relationship between volume and\nlocal minimum avoidance provide further corroboration of this hypothesis.\n\nunder the constraintsPi \u00afw\u2217i = 1 and \u00af\u03b3\u2217i = ( \u00afw\u2217)2\n\n\u00af\u03a6\u00af\u0393\u2217 \u00af\u03a6\u2217T(cid:12)(cid:12)\nspace is given by(cid:12)(cid:12)\nThe maximum volume of(cid:12)(cid:12)\n\n\u00af\u03a6\u2217 \u00af\u0393\u2217 \u00af\u03a6\u2217T(cid:12)(cid:12)\n\n4 Empirical Comparisons\nThe central purpose of this section is to present empirical evidence that supports our theo-\nretical analysis and illustrates the improved performance afforded by SBL. As previously\nmentioned, others have established deterministic equivalence conditions, dependent on D\u2217,\nwhereby BP and OMP are guaranteed to \ufb01nd the unique w\u2217. Unfortunately, the relevant\ntheorems are of little value in assessing practical differences between algorithms. This is\nbecause, in the cases we have tested where BP/OMP equivalence is provably known to hold\n(e.g., via results in [1, 4, 5]), SBL always converges to w\u2217 as well.\nAs such, we will focuss our attention on the insights provided by Sections 2 and 3 as well\nas probabilistic comparisons with [3]. Given a \ufb01xed distribution for the nonzero elements\nof w\u2217, we will assess which algorithm is best (at least empirically) for most dictionaries\nrelative to a uniform measure on the unit sphere as discussed.\n\nTo this effect, a number of monte-carlo simulations were conducted, each consisting of the\nfollowing: First, a random, overcomplete N \u00d7 M dictionary \u03a6 is created whose entries\nare each drawn uniformly from the surface of an N-dimensional hypersphere. Next, sparse\nweight vectors w\u2217 are randomly generated with D\u2217 nonzero entries. Nonzero amplitudes\n\u00afw\u2217 are drawn iid from an experiment-dependent distribution. Response values are then\ncomputed as t = \u03a6w\u2217. Each algorithm is presented with t and \u03a6 and attempts to estimate\nw\u2217. In all cases, we ran 1000 independent trials and compared the number of times each\nalgorithm failed to recover w\u2217. Under the speci\ufb01ed conditions for the generation of \u03a6\nand t, all other feasible solutions w almost surely have a diversity greater than D\u2217, so\nour synthetically generated w\u2217 must be maximally sparse. Moreover, \u03a6 will almost surely\nsatisfy the URP.\n\nWith regard to particulars, there are essentially four variables with which to experiment: (i)\nthe distribution of \u00afw\u2217, (ii) the diversity D\u2217, (iii) N, and (iv) M. In Figure 1, we display\nresults from an array of testing conditions. In each row of the \ufb01gure, \u00afw\u2217i is drawn iid from\na \ufb01xed distribution for all i; the \ufb01rst row uses \u00afw\u2217i = 1, the second has \u00afw\u2217i \u223c J(a = 0.001),\nand the third uses \u00afw\u2217i \u223c N (0, 1), i.e., a unit Gaussian. In all cases, the signs of the nonzero\nweights are irrelevant due to the randomness inherent in the basis vectors.\n\nThe columns of Figure 1 are organized as follows: The \ufb01rst column is based on the values\nN = 50, D\u2217 = 16, while M is varied from N to 5N, testing the effects of an increasing\nlevel of dictionary redundancy, M/N. The second \ufb01xes N = 50 and M = 100 while D\u2217\nis varied from 10 to 30, exploring the ability of each algorithm to resolve an increasing\nnumber of nonzero weights. Finally, the third column \ufb01xes M/N = 2 and D\u2217/N \u2248 0.3\n\n\fwhile N, M, and D\u2217 are increased proportionally. This demonstrates how performance\nscales with larger problem sizes.\n\nRedundancy Test\n(N = 50, D* = 16)\n\nDiversity Test\n\n(N = 50, M = 100)\n\nSignal Size Test\n\n(M/N = 2, D*/N = 0.32)\n\n)\ns\nt\n\ne\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\nE\n\ni\n\nh\ng\ne\nw\n\n \nt\ni\n\nn\nu\n\n \n/\n\nw\n\n(\n\n)\ns\nt\n\ne\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\nE\n\ni\n\nh\ng\ne\nw\n \ns\ny\ne\nr\nf\nf\n\ne\nJ\n \n/\n\nw\n\n(\n\ne\n\nt\n\na\nR\n\n \nr\no\nr\nr\n\nE\n\ni\n\n \n\n)\ns\nt\nh\ng\ne\nw\nn\na\ns\ns\nu\na\nG\n\ni\n\n \n/\n\nw\n\n(\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n2\n\n3\n\n4\n\n5\n\n2\n\n3\n\n4\n\n5\n\nRedundancy Ratio (M/N)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n10\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n10\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n10\n\n15\n\n20\n\n25\n\n30\n\n15\n\n20\n\n25\n\n30\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n25\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n25\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n50\n\n75 100 125 150\n\nOMP\nBP\nSBL\n\n50\n\n75 100 125 150\n\n20\n\n15\nDiversity (D*)\n\n25\n\n30\n\n0\n25\n\n50\n\n75 100 125 150\n\nSignal Size (N)\n\nFigure 1: Empirical results comparing the probability that OMP, BP, and SBL fail to \ufb01nd\nw\u2217 under various testing conditions. Each data point is based on 1000 independent trials.\nThe distribution of the nonzero weight amplitudes is labeled on the far left for each row,\nwhile the values for N, M, and D\u2217 are included on the top of each column. Independent\nvariables are labeled along the bottom of the \ufb01gure.\n\nThe \ufb01rst row of plots essentially represents the worst-case scenario for SBL per our pre-\nvious analysis, and yet performance is still consistently better than both BP and OMP. In\ncontrast, the second row of plots approximates the best-case performance for SBL, where\nwe see that SBL is almost infallible. The handful of failure events that do occur are because\na is not suf\ufb01ciently small and therefore, J(a) was not suf\ufb01ciently close to a true Jeffreys\nprior to achieve perfect equivalence (see center plot). Although OMP also does well here,\nthe parameter a can generally never be adjusted such that OMP always succeeds. Finally,\nthe last row of plots, based on Gaussian distributed weight amplitudes, re\ufb02ects a balance\nbetween these two extremes. Nonetheless, SBL still holds a substantial advantage.\n\nIn general, we observe that SBL is capable of handling more redundant dictionaries (col-\numn one) and resolving a larger number of nonzero weights (column two). Also, column\nthree illustrates that both BP and SBL are able to resolve a number of weights that grows\nlinearly in the signal dimension (\u2248 0.3N), consistent with the analysis in [3] (which applies\nonly to BP). In contrast, OMP performance begins to degrade in some cases (see the upper\nright plot), a potential limitation of this approach. Of course additional study is necessary\nto fully compare the relative performance of these methods on large-scale problems.\n\nFinally, by comparing row one, two and three, we observe that the performance of BP is\nroughly independent of the weight distribution, with performance slightly below the worst-\n\n\fcase SBL performance. Like SBL, OMP results are highly dependent on the distribution;\nhowever, as the weight distribution approaches unity, performance is unsatisfactory.\nIn\nsummary, while the relative pro\ufb01ciency between OMP and BP is contingent on experimen-\ntal particulars, SBL is uniformly superior in the cases we have tested (including examples\nnot shown, e.g., results with other dictionary types).\n5 Conclusions\nIn this paper, we have related the ability to \ufb01nd maximally sparse solutions to the partic-\nular distribution of amplitudes that compose the nonzero elements. At \ufb01rst glance, it may\nseem reasonable that the most dif\ufb01cult sparse inverse problems occur when some of the\nnonzero weights are extremely small, making them dif\ufb01cult to estimate. Perhaps surpris-\ningly then, we have shown that the exact opposite is true with SBL: The more diverse the\nweight magnitudes, the better the chances we have of learning the optimal solution. In\ncontrast, unit weights offer the most challenging task for SBL. Nonetheless, even in this\nworst-case scenario, we have shown that SBL outperforms the current state-of-the-art; the\noverall assumption here being that, if worst-case performance is superior, then it is likely\nto perform better in a variety of situations.\nFor a \ufb01xed dictionary and diversity D\u2217, successful recovery of unit weights does not ab-\nsolutely guarantee that any alternative weighting scheme will necessarily be recovered as\nwell. However, a weaker result does appear to be feasible: For \ufb01xed values of N, M,\nand D\u2217, if the success rate recovering unity weights approaches one for most dictionar-\nies, where most is de\ufb01ned as in Section 1, then the success rate recovering weights of any\nother distribution (assuming they are distributed independently of the dictionary) will also\napproach one. While a formal proof of this conjecture is beyond the scope of this paper,\nit seems to be a very reasonable result that is certainly born out by experimental evidence,\ngeometric considerations, and the arguments presented in Section 3. Nonetheless, this re-\nmains a fruitful area for further inquiry.\nReferences\n[1] D. Donoho and M. Elad, \u201cOptimally sparse representation in general (nonorthogonal) dictionar-\n\nies via \u21131 minimization,\u201d Proc. Nat. Acad. Sci., vol. 100, no. 5, pp. 2197\u20132202, March 2003.\n\n[2] R. Gribonval and M. Nielsen, \u201cSparse representations in unions of bases,\u201d IEEE Transactions\n\non Information Theory, vol. 49, pp. 3320\u20133325, Dec. 2003.\n\n[3] D. Donoho, \u201cFor most large underdetermined systems of linear equations the minimal \u21131-norm\n\nsolution is also the sparsest solution,\u201d Stanford University Technical Report, September 2004.\n\n[4] J.J. Fuchs, \u201cOn sparse representations in arbitrary redundant bases,\u201d IEEE Transactions on\n\nInformation Theory, vol. 50, no. 6, pp. 1341\u20131344, June 2004.\n\n[5] J.A. Tropp, \u201cGreed is good: Algorithmic results for sparse approximation,\u201d IEEE Transactions\n\non Information Theory, vol. 50, no. 10, pp. 2231\u20132242, October 2004.\n\n[6] M.E. Tipping, \u201cSparse Bayesian learning and the relevance vector machine,\u201d Journal of Machine\n\nLearning Research, vol. 1, pp. 211\u2013244, 2001.\n\n[7] I.F. Gorodnitsky and B.D. Rao, \u201cSparse signal reconstruction from limited data using FOCUSS:\nA re-weighted minimum norm algorithm,\u201d IEEE Transactions on Signal Processing, vol. 45, no.\n3, pp. 600\u2013616, March 1997.\n\n[8] M.A.T. Figueiredo, \u201cAdaptive sparseness using Jeffreys prior,\u201d Advances in Neural Information\n\nProcessing Systems 14, pp. 697\u2013704, 2002.\n\n[9] D.P. Wipf and B.D. Rao, \u201c\u21130-norm minimization for basis selection,\u201d Advances in Neural\n\nInformation Processing Systems 17, pp. 1513\u20131520, 2005.\n\n[10] D.P. Wipf and B.D. Rao, \u201cSparse Bayesian learning for basis selection,\u201d IEEE Transactions on\n\nSignal Processing, vol. 52, no. 8, pp. 2153\u20132164, 2004.\n\n[11] D.P. Wipf, To appear in Bayesian Methods for Sparse Signal Representation, PhD Dissertation,\n\nUC San Diego, 2006 (estimated). http://dsp.ucsd.edu/\u223cdwipf/\n\n\f", "award": [], "sourceid": 2771, "authors": [{"given_name": "Bhaskar", "family_name": "Rao", "institution": null}, {"given_name": "David", "family_name": "Wipf", "institution": null}]}