{"title": "Maximal Sparsity with Deep Networks?", "book": "Advances in Neural Information Processing Systems", "page_first": 4340, "page_last": 4348, "abstract": "The iterations of many sparse estimation algorithms are comprised of a fixed linear filter cascaded with a thresholding nonlinearity, which collectively resemble a typical neural network layer. Consequently, a lengthy sequence of algorithm iterations can be viewed as a deep network with shared, hand-crafted layer weights. It is therefore quite natural to examine the degree to which a learned network model might act as a viable surrogate for traditional sparse estimation in domains where ample training data is available. While the possibility of a reduced computational budget is readily apparent when a ceiling is imposed on the number of layers, our work primarily focuses on estimation accuracy. In particular, it is well-known that when a signal dictionary has coherent columns, as quantified by a large RIP constant, then most tractable iterative algorithms are unable to find maximally sparse representations. In contrast, we demonstrate both theoretically and empirically the potential for a trained deep network to recover minimal $\\ell_0$-norm representations in regimes where existing methods fail. The resulting system, which can effectively learn novel iterative sparse estimation algorithms, is deployed on a practical photometric stereo estimation problem, where the goal is to remove sparse outliers that can disrupt the estimation of surface normals from a 3D scene.", "full_text": "Maximal Sparsity with Deep Networks?\n\nBo Xin1,2\n\nYizhou Wang1 Wen Gao1\n\nBaoyuan Wang3\n{yizhou.wang, wgao}@pku.edu.cn\n\n3Microsoft Research, Redmond\n\nDavid Wipf2\n\n1Peking University\n\n2Microsoft Research, Beijing\n\n{boxin, baoyuanw, davidwip}@microsoft.com\nAbstract\n\nThe iterations of many sparse estimation algorithms are comprised of a \ufb01xed lin-\near \ufb01lter cascaded with a thresholding nonlinearity, which collectively resemble a\ntypical neural network layer. Consequently, a lengthy sequence of algorithm itera-\ntions can be viewed as a deep network with shared, hand-crafted layer weights. It\nis therefore quite natural to examine the degree to which a learned network model\nmight act as a viable surrogate for traditional sparse estimation in domains where\nample training data is available. While the possibility of a reduced computational\nbudget is readily apparent when a ceiling is imposed on the number of layers, our\nwork primarily focuses on estimation accuracy. In particular, it is well-known that\nwhen a signal dictionary has coherent columns, as quanti\ufb01ed by a large RIP con-\nstant, then most tractable iterative algorithms are unable to \ufb01nd maximally sparse\nrepresentations. In contrast, we demonstrate both theoretically and empirically the\npotential for a trained deep network to recover minimal (cid:96)0-norm representations in\nregimes where existing methods fail. The resulting system, which can effectively\nlearn novel iterative sparse estimation algorithms, is deployed on a practical pho-\ntometric stereo estimation problem, where the goal is to remove sparse outliers\nthat can disrupt the estimation of surface normals from a 3D scene.\n\n1\n\nIntroduction\n\nOur launching point is the optimization problem\n(cid:107)x(cid:107)0\n\nmin\n\nx\n\ns.t. y = \u03a6x,\n\n(1)\nwhere y \u2208 Rn is an observed vector, \u03a6 \u2208 Rn\u00d7m is some known, overcomplete dictionary of\nfeature/basis vectors with m > n, and (cid:107)\u00b7(cid:107)0 denotes the (cid:96)0 norm of a vector, or a count of the number\nof nonzero elements. Consequently, (1) can be viewed as the search for a maximally sparse feasible\nvector x\u2217 (or approximately feasible if the constraint is relaxed). Unfortunately however, direct\nassault on (1) involves an intractable, combinatorial optimization process, and therefore ef\ufb01cient\nalternatives that return a maximally sparse x\u2217 with high probability in restricted regimes are sought.\nPopular examples with varying degrees of computational overhead include convex relaxations such\nas (cid:96)1-norm minimization [2, 5, 21], greedy approaches like orthogonal matching pursuit (OMP)\n[18, 22], and many \ufb02avors of iterative hard-thresholding (IHT) [3, 4].\nVariants of these algorithms \ufb01nd practical relevance in numerous disparate domains, including fea-\nture selection [7, 8], outlier removal [6, 13], compressive sensing [5], and source localization [1, 16].\nHowever, a fundamental weakness underlies them all: If the Gram matrix \u03a6(cid:62)\u03a6 has signi\ufb01cant off-\ndiagonal energy, indicative of strong coherence between columns of \u03a6, then estimation of x\u2217 may\nbe extremely poor. Loosely speaking this occurs because, as higher correlation levels are present,\nthe null-space of \u03a6 is more likely to include large numbers of approximately sparse vectors that\ntend to distract existing algorithms in the feasible region, an unavoidable nuisance in many practical\napplications.\nIn this paper we consider recent developments in the \ufb01eld of deep learning as an entry point for\nimproving the performance of sparse recovery algorithms. Although seemingly unrelated at \ufb01rst\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fglance, the layers of a deep neural network (DNN) can be viewed as iterations of some algorithm\nthat have been unfolded into a network structure [9, 11]. In particular, iterative thresholding ap-\nproaches such as IHT mentioned above typically involve an update rule comprised of a \ufb01xed, linear\n\ufb01lter followed by a non-linear activation function that promotes sparsity. Consequently, algorithm\nexecution can be interpreted as passing an input through an extremely deep network with constant\nweights (dependent on \u03a6) at every layer. This \u2018unfolding\u2019 viewpoint immediately suggests that\nwe consider substituting discriminatively learned weights in place of those inspired by the original\nsparse recovery algorithm. For example, it has been argued that, given access to a suf\ufb01cient number\nof {x\u2217, y} pairs, a trained network may be capable of producing quality sparse estimates with a\nfew number of layers. This in turn can lead to a dramatically reduced computational burden rela-\ntive to purely optimization-based approaches [9, 19, 23] or to enhanced non-linearities for use with\ntraditional iterative algorithms [15].\nWhile existing empirical results are promising, especially in terms of the reduction in computational\nfootprint, there is as of yet no empirical demonstration of a learned deep network that can unequivo-\ncally recover maximally sparse vectors x\u2217 with greater accuracy than conventional, state-of-the-art\noptimization-based algorithms, especially with a highly coherent \u03a6. Nor is there supporting theo-\nretical evidence elucidating the exact mechanism whereby learning may be expected to improve the\nestimation accuracy, especially in the presence of coherent dictionaries. This paper attempts to \ufb01ll\nin some of these gaps, and our contributions can be distilled to the following points:\nQuanti\ufb01able Bene\ufb01ts of Unfolding: We rigorously dissect the bene\ufb01ts of unfolding conventional\nsparse estimation algorithms to produce trainable deep networks. This includes a precise charac-\nterization of exactly how different architecture choices can affect the ability to improve so-called\nrestrictive isometry property (RIP) constants, which measure the degree of disruptive correlation\nin \u03a6. This helps to quantify the limits of shared layer weights, which are the standard template\nof existing methods [9, 19, 23], and motivates more \ufb02exible network constructions reminiscent of\nLSTM cells [12] that account for multi-resolution structure in \u03a6 in a previously unexplored fashion.\nNote that we defer all proofs, as well as many additional analyses and problem details, to a longer\ncompanion paper [26].\nIsolation of Important Factors: Based on these theoretical insights, and a better understanding\nof the essential factors governing performance, we establish the degree to which it is favorable to\ndiverge from strict conformity to any particular unfolded algorithmic script. In particular, we argue\nthat layer-wise independent weights and/or activations are essential, while retainment of original\nthresholding non-linearities and squared-error loss implicit to many sparse algorithms is not. We\nalso recast the the core problem as deep multi-label classi\ufb01cation given that optimal support pattern\nrecovery is the primary concern. This allows us to adopt a novel training paradigm that is less\nsensitive to the speci\ufb01c distribution encountered during testing. Ultimately, we development the\n\ufb01rst, ultra-fast sparse estimation algorithm (or more precisely a learning procedure that produces\nsuch an algorithm) that can effectively deal with coherent dictionaries and adversarial RIP constants.\nState-of-the-Art Empirical Performance: We apply the proposed system to a practical photomet-\nric stereo computer vision problem, where the goal is to estimate the 3D geometry of an object\nusing only 2D photos taken from a single camera under different lighting conditions. In this con-\ntext, shadows and specularities represent sparse outliers that must be simultaneously removed from\n\u223c 104 \u2212 106 surface points. We achieve state-of-the-art performance using only weak supervision\ndespite a minuscule computational budget appropriate for real-time mobile environments.\n2 From Iterative Hard Thesholding (IHT) to Deep Neural Networks\nAlthough any number of iterative algorithms could be adopted as our starting point, here we examine\nIHT because it is representative of many other sparse estimation paradigms and is amenable to\ntheoretical analysis. With knowledge of an upper bound on the true cardinality, solving (1) can be\nreplaced by the equivalent problem\n\n(2)\nIHT attempts to minimize (2) using what can be viewed as computationally-ef\ufb01cient projected gra-\ndient iterations [3]. Let x(t) denote the estimate of some maximally sparse x\u2217 after t iterations. The\naggregate IHT update computes\n\nmin\n\nx\n\ns.t. (cid:107)x(cid:107)0 \u2264 k.\n\n2\n\n1\n\n2(cid:107)y \u2212 \u03a6x(cid:107)2\n(cid:104)\nx(t) \u2212 \u00b5\u03a6(cid:62)(cid:16)\n\nx(t+1) = Hk\n\n\u03a6x(t) \u2212 y\n\n,\n\n(3)\n\n(cid:17)(cid:105)\n\n2\n\n\fwhere \u00b5 is a step-size parameter and Hk[\u00b7] is a hard-thresholding operator that sets all but the k\nlargest values (in magnitude) of a vector to zero. For the vanilla version of IHT, the step-size \u00b5 = 1\nleads to a number of recovery guarantees whereby iterating (3), starting from x(0) = 0 is guaranteed\nto reduce (2) at each step before eventually converging to the globally optimal solution. These\nresults hinge on properties of \u03a6 which relate to the coherence structure of dictionary columns as\nencapsulated by the following de\ufb01nition.\n\n2\n\n(4)\n\nDe\ufb01nition 1 (Restricted Isometry Property) A dictionary \u03a6 satis\ufb01es the Restricted Isometry\nProperty (RIP) with constant \u03b4k[\u03a6] < 1 if\n(1 \u2212 \u03b4k[\u03a6])(cid:107)x(cid:107)2\n\n2 \u2264 (1 + \u03b4k[\u03a6])(cid:107)x(cid:107)2\n\n2 \u2264 (cid:107)\u03a6x(cid:107)2\n\nholds for all {x : (cid:107)x(cid:107)0 \u2264 k}.\nIn brief, the smaller the value of the RIP constant \u03b4k[\u03a6], the closer any sub-matrix of \u03a6 with k\ncolumns is to being orthogonal (i.e., it has less correlation structure). It is now well-established that\ndictionaries with smaller values of \u03b4k[\u03a6] lead to sparse recovery problems that are inherently easier\nto solve. For example, in the context of IHT, it has been shown [3] that if y = \u03a6x\u2217, with (cid:107)x\u2217(cid:107)0 \u2264 k\n\u221a\n32, then at iteration t of (3) we will have (cid:107)x(t) \u2212 x\u2217(cid:107)2 \u2264 2\u2212t(cid:107)x\u2217(cid:107)2. It follows\nand \u03b43k[\u03a6] < 1/\nthat as t \u2192 \u221e, x(t) \u2192 x\u2217, meaning that we recover the true, generating x\u2217. Moreover, it can be\nshown that this x\u2217 is also the unique, optimal solution to (1) [5].\n\u221a\nThe success of IHT in recovering maximally sparse solutions crucially depends on the RIP-based\ncondition that \u03b43k[\u03a6] < 1/\n32, which heavily constrains the degree of correlation structure in \u03a6\nthat can be tolerated. While dictionaries with columns drawn independently and uniformly from the\nsurface of a unit hypersphere (or with elements drawn iid from N (0, 1/n) ) will satisfy this condition\nwith high probability provided k is small enough [6], for many/most practical problems of interest\nwe cannot rely on this type of IHT recovery guarantee. In fact, except for randomized dictionaries\nin high dimensions where tight bounds exist, we cannot even compute the value of \u03b43k[\u03a6], which\n\nrequires calculating the spectral norm of(cid:0)m\n(cid:2)\u0001A + uv(cid:62)(cid:3) N, where columns of A \u2208 Rn\u00d7m and u \u2208 Rn are drawn iid from the surface of a\n\nThere are many ways nature might structure a dictionary such that IHT (or most any other existing\nsparse estimation algorithm) will fail. Here we examine one of the most straightforward forms\nof dictionary coherence that can easily disrupt performance. Consider the situation where \u03a6 =\nunit hypersphere, while v \u2208 Rm is arbitrary. Additionally, \u0001 > 0 is a scalar and N is a diagonal\nnormalization matrix that scales each column of \u03a6 to have unit (cid:96)2 norm. It then follows that if \u0001 is\n\u221a\nsuf\ufb01ciently small, the rank-one component begins to dominate, and there is no value of 3k such that\n32. In this type of problem we hypothesize that DNNs provide a potential avenue for\n\u03b43k[\u03a6] < 1/\nimprovement to the extent that they might be able to compensate for disruptive correlations in \u03a6.\nFor example, at the most basic level we might consider general networks with the layer t de\ufb01ned by\n\n(cid:1) subsets of dictionary columns.\n\n3k\n\n(cid:104)\n\n(cid:105)\n\n,\n\nx(t+1) = f\n\n\u03a8x(t) + \u0393y\n\n(5)\nwhere f : Rm \u2192 Rm is a non-linear activation function, and \u03a8 \u2208 Rm\u00d7m and \u0393 \u2208 Rm\u00d7n are\narbitrary. Moreover, given access to training pairs {x\u2217, y}, where x\u2217 is a sparse vector such that\ny = \u03a6x\u2217, we can optimize \u03a8 and \u0393 using traditional stochastic gradient descent just like any other\nDNN structure. We will \ufb01rst precisely characterize the extent to which this adaptation affords any\nbene\ufb01t over IHT where f (\u00b7) = Hk[\u00b7]. Later we will consider \ufb02exible, layer-speci\ufb01c non-linearities\nf (t) and parameters {\u03a8(t), \u0393(t)}.\n3 Analysis of Adaptable Weights and Activations\nFor simplicity in this section we restrict ourselves to the \ufb01xed hard-threshold operator Hk[\u00b7] across\nall layers; however, many of the conclusions borne out of our analysis nonetheless carry over to a\nmuch wider range of activation functions f. In general it is dif\ufb01cult to analyze how arbitrary \u03a8\nand \u0393 may improve upon the \ufb01xed parameterization from (3) where \u03a8 = I \u2212 \u03a6(cid:62)\u03a6 and \u0393 =\n\u03a6(cid:62) (assuming \u00b5 = 1). Fortunately though, we can signi\ufb01cantly collapse the space of potential\nweight matrices by including the natural requirement that if x\u2217 represents the true, maximally sparse\nsolution, then it must be a \ufb01xed-point of (5). Indeed, without this stipulation the iterations could\n\n3\n\n\fdiverge away from the globally optimal value of x, something IHT itself will never do. These\nconsiderations lead to the following:\nProposition 1 Consider a generalized IHT-based network layer given by (5) with f (\u00b7) = Hk[\u00b7] and\nlet x\u2217 denote any unique, maximally sparse feasible solution to y = \u03a6x with (cid:107)x(cid:107)0 \u2264 k. Then to\nensure that any such x\u2217 is a \ufb01xed point it must be that \u03a8 = I \u2212 \u0393\u03a6.\nAlthough \u0393 remains unconstrained, this result has restricted \u03a8 to be a rank-n factor, parameterized\nby \u0393, subtracted from an identity matrix. Certainly this represents a signi\ufb01cant contraction of the\nspace of \u2018reasonable\u2019 parameterizations for a general IHT layer. In light of Proposition 1, we may\nthen further consider whether the added generality of \u0393 (as opposed to the original \ufb01xed assignment\n\u0393 = \u03a6(cid:62)) affords any further bene\ufb01t to the revised IHT update\n\n(cid:104)\n\n(cid:105)\n\nx(t+1) = Hk\n\n(I \u2212 \u0393\u03a6) x(t) + \u0393y\n\n.\n\n(6)\n\nFor this purpose we note that (6) can be interpreted as a projected gradient descent step for solving\n(7)\n\n2 x(cid:62)\u0393\u03a6x \u2212 x(cid:62)\u0393y s.t. (cid:107)x(cid:107)0 \u2264 k.\n\nmin\n\n1\n\nx\n\nHowever, if \u0393\u03a6 is not positive semi-de\ufb01nite, then this objective is no longer even convex, and\ncombined with the non-convex constraint is likely to produce an even wider constellation of trouble-\nsome local minima with no clear af\ufb01liation with the global optimum of our original problem from\n(2). Consequently it does not immediately appear that \u0393 (cid:54)= \u03a6(cid:62) is likely to provide any tangible\nbene\ufb01t. However, there do exist important exceptions. The \ufb01rst indication of how learning a general\n\u0393 might help comes from the following result:\nProposition 2 Suppose that \u0393 = D\u03a6(cid:62)W W (cid:62), where W is an arbitrary matrix of appropriate\ndimension and D is a full-rank diagonal that jointly solve\n\n\u03b4\u2217\n3k [\u03a6] (cid:44) inf\nW ,D\n\n\u03b43k [W \u03a6D] .\n\n(8)\n\nMoreover, assume that \u03a6 is substituted with \u03a6D in (6), meaning we have simply replaced \u03a6 with\na new dictionary that has scaled columns. Given these quali\ufb01cations, if y = \u03a6x\u2217, with (cid:107)x\u2217(cid:107)0 \u2264 k\nand \u03b4\u2217\n\n32, then at iteration t of (6)\n\n\u221a\n3k [\u03a6] < 1/\n\n(cid:107)D\u22121x(t) \u2212 D\u22121x\u2217(cid:107)2 \u2264 2\u2212t(cid:107)D\u22121x\u2217(cid:107)2.\n\n(9)\nIt follows that as t \u2192 \u221e, x(t) \u2192 x\u2217, meaning that we recover the true, generating x\u2217. Addition-\nally, it can be guaranteed that after a \ufb01nite number of iterations, the correct support pattern will be\ndiscovered. And it should be emphasized that rescaling \u03a6 by some known diagonal D is a com-\nmon prescription for sparse estimation (e.g., column normalization) that does not alter the optimal\n(cid:96)0-norm support pattern.1\nBut the real advantage over regular IHT comes from the fact that \u03b4\u2217\ntical cases, \u03b4\u2217\n\nof RIP conditions. For example, if we revisit the dictionary \u03a6 = (cid:2)\u0001A + uv(cid:62)(cid:3) N, an immediate\n\n3k [\u03a6] \u2264 \u03b4k [\u03a6], and in many prac-\n3k [\u03a6] (cid:28) \u03b43k [\u03a6], which implies success can be guaranteed across a much wider range\n\n3k [\u03a6] will remain quite small, satisfying \u03b4\u2217\n\n\u221a\nbene\ufb01t can be observed. More concretely, for \u0001 suf\ufb01ciently small we argued that \u03b43k [\u03a6] > 1/\n32\nfor all k, and consequently convergence to the optimal solution may fail. In contrast, it can be shown\n3k [\u03a6] \u2248 \u03b43k [A], implying that performance will\nthat \u03b4\u2217\nnearly match that of an equivalent recovery problem using A (and as we discussed above, \u03b43k [A] is\nlikely to be relatively small per its unique, randomized design). The following result generalizes a\nsuf\ufb01cient regime whereby this is possible:\nCorollary 1 Suppose \u03a6 = [\u0001A + \u2206r] N, where elements of A are drawn iid from N (0, 1/n), \u2206r\nis any arbitrary matrix with rank[\u2206r] = r < n, and N is a diagonal matrix (e.g, one that enforces\nunit (cid:96)2 column norms). Then\n\n(cid:16)\nwhere (cid:101)A denotes the matrix A with any r rows removed.\n\n,\n\n(10)\n\nE (\u03b4\u2217\n\n3k [\u03a6]) \u2264 E\n\n\u03b43k\n\n(cid:105)(cid:17)\n\n(cid:104)(cid:101)A\n\n1Inclusion of this diagonal factor D can be equivalently viewed as relaxing Proposition 1 to hold under\n\nsome \ufb01xed rescaling of \u03a6, i.e., an operation that preserves the optimal support pattern.\n\n4\n\n\f(cid:104)(cid:101)A\n(cid:105)\n\n(cid:104)(cid:101)A\n\n3k [\u03a6] \u2264 \u03b43k\n\n(cid:105) \u2248 \u03b43k [A], we can indeed be con\ufb01dent\n\nAdditionally, as the size of \u03a6 grows proportionally larger, it can be shown that with overwhelming\nprobability \u03b4\u2217\n. Overall, these results suggest that we can essentially annihilate\nany potentially disruptive rank-r component \u2206r at the cost of implicitly losing r measurements\n(linearly independent rows of A, and implicitly the corresponding elements of y). Therefore, at\nleast provided that r is suf\ufb01ciently small such that \u03b43k\nthat a modi\ufb01ed form of IHT can perform much like a system with an ideal RIP constant. And of\ncourse in practice we may not know how \u03a6 decomposes as some \u03a6 \u2248 [\u0001A + \u2206r] N; however, to\nthe extent that this approximation can possibly hold, the RIP constant can be improved nonetheless.\nIt should be noted that globally solving (8) is non-differentiable and intractable, but this is the whole\npoint of incorporating a DNN network to begin with. If we have access to a large number of training\npairs {x\u2217, y} generated using the true \u03a6, then during the course of the learning process a useful\nW and D can be implicitly estimated such that a maximal number of sparse vectors can be suc-\ncessfully recovered. Of course we will experience diminishing marginal returns as more non-ideal\ncomponents enter the picture. In fact, it is not dif\ufb01cult to describe a slightly more sophisticated sce-\nnario such that use of layer-wise constant weights and activations are no longer capable of lowering\n\u03b43k[\u03a6] signi\ufb01cantly at all, portending failure when it comes to accurate sparse recovery.\nOne such example is a clustered dictionary model (which we describe in detail in [26]), whereby\ncolumns of \u03a6 are grouped into a number of tight clusters with minimal angular dispersion. While the\nclusters themselves may be well-separated, the correlation within clusters can be arbitrarily large. In\nsome sense this model represents the simplest partitioning of dictionary column correlation structure\ninto two scales: the inter- and intra-cluster structures. Assuming the number of such clusters is larger\nthan n, then layer-wise constant weights and activations are unlikely to provide adequate relief, since\nthe implicit \u2206r factor described above will be full rank.\nFortunately, simple adaptations of IHT, which are re\ufb02ective of many generic DNN structures, can\nremedy the problem. The core principle is to design a network such that earlier layers/iterations\nare tasked with exposing the correct support at the cluster level, without concern for accuracy within\neach cluster. Once the correct cluster support has been obtained, later layers can then be charged with\nestimating the \ufb01ne-grain details of within-cluster support. We believe this type of multi-resolution\nsparse estimation is essential when dealing with highly coherent dictionaries. This can be accom-\nplished with the following adaptations to IHT:\n\n1. The hard-thresholding operator is generalized to \u2018remember\u2019 previously learned cluster-\nlevel sparsity patterns, in much the same way that LSTM gates allow long term dependen-\ncies to propagate [12] or highway networks [20] facilitate information \ufb02ow unfettered to\ndeeper layers. Practically speaking this adaptation can be computed by passing the prior\nlayer\u2019s activations x(t) through linear \ufb01lters followed by indicator functions, again remi-\nniscent of how DNN gating functions are typically implemented.\n\n2. We allow the layer weights {\u03a8(t), \u0393(t)} to vary from iteration to iteration t sequencing\n\nthrough a \ufb01xed set akin to layers of a DNN.\n\nIn [26] we show that hand-crafted versions of these changes allow IHT to provably recovery maxi-\nmally sparse vectors x\u2217 in situations where existing algorithms fail.\n4 Discriminative Multi-Resolution Sparse Estimation\nAs implied previously, guaranteed success for most existing sparse estimation strategies hinges on\nthe dictionary \u03a6 having columns drawn (approximately) from a uniform distribution on the surface\nof a unit hypersphere, or some similar condition to ensure that subsets of columns behave approxi-\nmately like an orthogonal basis. Essentially this con\ufb01nes the structure of the dictionary to operate on\na single universal scale. The clustered dictionary model described in the previous section considers\na dictionary built on two different scales, with a cluster-level distribution (coarse) and tightly-packed\nwithin-cluster details (\ufb01ne). But practical dictionaries may display structure operating across a vari-\nety of scales that interleave with one another, forming a continuum among multiple levels.\nWhen the scales are clearly demarcated, we have argued that it is possible to manually de\ufb01ne a\nmulti-resolution IHT-inspired algorithm that guarantees success in recovering the optimal support\npattern; and indeed, IHT could be extended to handle a clustered dictionary model with nested\n\n5\n\n\fi\n\n1, . . . , s\u2217\n\nstructures across more than two scales. However, without clearly partitioned scales it is much less\nobvious how one would devise an optimal IHT modi\ufb01cation. It is in this context that learning \ufb02exible\nalgorithm iterations is likely to be most advantageous.\nIn fact, the situation is not at all unlike\nmany computer vision scenarios whereby handcrafted features such as SIFT may work optimally in\ncon\ufb01ned, idealized domains, while learned CNN-based features are often more effective otherwise.\nGiven a suf\ufb01cient corpus of {x\u2217, y} pairs linked via some \ufb01xed \u03a6, we can replace manual \ufb01lter\nconstruction with a learning-based approach. On this point, although we view our results from\nSection 3 as a convincing proof of concept, it is unlikely that there is anything intrinsically special\nabout the speci\ufb01c hard-threshold operator and layer-wise construction we employed per se, as long\nas we allow for deep, adaptable layers that can account for structure at multiple scales. For example,\nwe expect that it is more important to establish a robust training pipeline that avoids stalling at the\nhand of vanishing gradients in a deep network, than to preserve the original IHT template analogous\nto existing learning-based methods. It is here that we propose several deviations:\nMulti-Label Classi\ufb01cation Loss: We exploit the fact that in producing a maximally sparse vec-\ntor x\u2217, the main challenge is estimating supp[x\u2217]. Once the support is obtained, computing the\nactual nonzero coef\ufb01cients just boils down to solving a least squares problem. But any learning\nsystem will be unaware of this and could easily expend undue effort in attempting to match coef\ufb01-\ncient magnitudes at the expense of support recovery. Certainly the use of a data \ufb01t penalty of the\nform (cid:107)y \u2212 \u03a6x(cid:107)2\n2, as is adopted by nearly all sparse recovery algorithms, will expose us to this is-\nsue. Therefore we instead formulate sparse recovery as a multi-label classi\ufb01cation problem. More\nm](cid:62), where s\u2217\nspeci\ufb01cally, instead of directly estimating x\u2217, we attempt to learn s\u2217 = [s\u2217\n(cid:54)= 0]. For this purpose we may then incorporate a traditional\nequals the indicator function I[x\u2217\ni\nmulti-label classi\ufb01cation loss function via a \ufb01nal softmax output layer, which forces the network to\nonly concern itself with learning support patterns. This substitution is further justi\ufb01ed by the fact\nthat even with traditional IHT, the support pattern will be accurately recovered before the iterations\nconverge exactly to x\u2217. Therefore we may expect that fewer layers (as well as training data) are\nrequired if all we seek is a support estimate, opening the door for weaker forms of supervision.\nInstruments for Avoiding Bad Local Solutions: Given that IHT can take many iterations to con-\nverge on challenging problems, we may expect that a relatively deep network structure will be\nneeded to obtain exact support recovery. We must therefore take care to avoid premature conver-\ngence to local minima or areas with vanishing gradient by incorporating several recent counter-\nmeasures proposed in the DNN community. For example, the adaptive variant of IHT described\npreviously is reminiscent of highway networks or LSTM cells, which have been proposed to al-\nlow longer range \ufb02ow of gradient information to improve convergence through the use of gating\nfunctions. An even simpler version of this concept involves direct, un-gated connections that allow\nmuch deeper \u2018residual\u2019 networks to be trained [10] (which is even suggestive of the residual factor\nembedded in the original IHT iterations). We deploy this tool, along with batch-normalization [14]\nto aid convergence, for our basic feedforward pipeline, along with an alternative structure based\non recurrent LSTM cells. Note that unfolded LSTM networks frequently receive a novel input for\nevery time step, whereas here y is applied unaltered at every layer (more on this in [26]). We also\nreplace the non-integrable hard-threshold operator with simple rectilinear (ReLu) units [17], which\nare functionally equivalent to one-sided soft-thresholding; this convex selection likely reduces the\nconstellation of sub-optimal local minima during the training process.\n5 Experiments and Applications\nSynthetic Tests with Correlated Dictionaries: We generate a dictionary matrix \u03a6 \u2208 Rn\u00d7m using\ni , where ui \u2208 Rn and vi \u2208 Rm have iid elements drawn from N (0, 1). We also\nrescale each column of \u03a6 to have unit (cid:96)2 norm. \u03a6 generated in this way has super-linear decaying\nsingular values (indicating correlation between the columns) but is not constrained to any speci\ufb01c\nstructure. Many dictionaries in real applications have such a property. As a basic experiment, we\ngenerate N = 700000 ground truth samples x\u2217 \u2208 Rm by randomly selecting d nonzero entries, with\nnonzero amplitudes drawn iid from the uniform distribution U[\u22120.5, 0.5], excluding the interval\n[\u22120.1, 0.1] to avoid small, relatively inconsequential contributions to the support pattern. We then\ncreate y \u2208 Rn via y = \u03a6x\u2217. As d increases, the estimation problem becomes more dif\ufb01cult. In\nfact, to guarantee success with such correlated data (and high RIP constant) requires evaluating on\n\n(cid:1) linear systems of size n\u00d7 n, which is infeasible even for small values, indicative of\n\nthe order of(cid:0)m\n\n\u03a6 =(cid:80)n\n\n1\n\ni2 uiv(cid:62)\n\ni=1\n\nhow challenging it can be to solve sparse inverse problems of any size. We set n=20 and m=100.\n\nn\n\n6\n\n\fFigure 1: Average support recovery accuracy. Left: Uniformly distributed nonzero elements. Mid:\nDifferent network variants. Right: Different training and testing distr. (LSTM-Net results).\n\nWe used N1 = 600000 samples for training and the remaining N2 = 100000 for testing. Echoing\nour arguments in Section 4, we explored both a feedforward network with residual connections\n[10] and a recurrent network with vanilla LSTM cells [12]. To evaluate the performance, we check\nwhether the d ground truth nonzeros are aligned with the predicted top-d values produced by our\nnetwork, a common all-or-nothing metric in the compressive sensing literature. Detailed network\ndesign, optimization setup, and alternative metrics can be found in [26].\nFigure 1(left) shows comparisons against a battery of existing algorithms, both learning- and\noptimization-based. These include standard (cid:96)1 minimization via ISTA iterations [2], IHT [3] (sup-\nplied with the ground truth number of nonzeros), an ISTA-based network [9], and an IHT-inspired\nnetwork [23]. For both the ISTA- and IHT-based networks, we used the exact same training data\ndescribed above. Note that given the correlated \u03a6 matrix, the recovery performance of IHT, and to\na lesser degree (cid:96)l minimization using ISTA, is rather modest as expected given that the associated\nRIP constant will be quite large by construction. In contrast our two methods achieve uniformly\nhigher accuracy, including over other learning-based methods trained with the same data. This im-\nprovement is likely the result of three signi\ufb01cant factors: (i) Existing learning methods initialize\nusing weights derived from the original sparse estimation algorithms, but such an initialization will\nbe associated with locally optimal solutions in most cases with correlated dictionaries. (ii) As de-\nscribed in Section 3, constant weights across layers have limited capacity to unravel multi-resolution\ndictionary structure, especially one that is not con\ufb01ned to only possess some low rank correlating\ncomponent. (iii) The quadratic loss function used by existing methods does not adequately focus\nresources on the crux of the problem, which is accurate support recovery. In contrast we adopt\nan initialization motivated by DNN-based training considerations, unique layer weights to handle a\nmulti-resolution dictionary, and a multi-label classi\ufb01cation output layer to focus on support recovery.\nTo further isolate essential factors affecting performance, we next consider the following changes:\n(1) We remove the residual connections from Res-Net. (2) We replace ReLU with hard-threshold\nactivations. In particular, we utilize the so-called HELU\u03c3 function introduced in [23], which is a\ncontinuous and piecewise linear approximation of the scalar hard-threshold operator. (3) We use\na quadratic penalty layer instead of a multi-label classi\ufb01cation loss layer, i.e., the loss function is\n2 (where a is the output of the last fully-connected layer) during\ntraining. Figure 1(middle) displays the associated recovery percentages, where we observe that\nin each case performance degrades. Without the residual design, and also with the inclusion of a\nrigid, non-convex hard-threshold operator, local minima during training appear to be a likely culprit,\nconsistent with observations from [10]. Likewise, use of a least-squares loss function is likely to\nover-emphasize the estimation of coef\ufb01cient amplitudes rather than focusing on support recovery.\nFinally, from a practical standpoint we may expect that the true amplitude distribution may deviate\nat times from the original training set. To explore robustness to such mismatch, as well as different\namplitude distributions, we consider two sets of candidate data: the original data, and similarly-\ngenerated data but with the uniform distribution of nonzero elements replaced with the Gaussians\nN (\u00b10.3, 0.1), where the mean is selected with equal probability as either \u22120.3 or 0.3, thus avoiding\ntiny magnitudes with high probability. Figure 1(right) reports accuracies under different distribu-\ntions for both training and testing, including mismatched cases. (The results are obtained using\nLSTM-Net, but the Res-net showed similar pattern.) The label \u2018U2U\u2019 refers to training and testing\nwith the uniformly distributed amplitudes, while \u2018U2N\u2019 uses uniform training set and a Gaussian test\nset. Analogous de\ufb01nitions apply for \u2018N2N\u2019 and \u2018N2U\u2019. In all cases we note that the performance is\n\nchanged to(cid:80)N1\n\ni=1 (cid:107)a(i) \u2212 y(i)(cid:107)2\n\n7\n\nd2468acc00.51\u21131IHTISTA-NetIHT-NetOurs-ResOurs-LSTMd2468acc00.51ResNoResHardActLSLossd2468acc00.51U2U-testU2U-trainU2N-testU2N-trainN2U-testN2U-trainN2N-testN2N-train\f(a) GT\n\n(b) LS (E=12.1 T=4.1)\n\n(c) (cid:96)1 (E=7.1 T=33.7)\n\n(d) Ours (E=1.5 T=1.2)\n\nFigure 2: Reconstruction error maps. Angular error in degrees (E) and runtime in sec. (T) are provided.\n\nquite stable across training and testing conditions. We would argue that our recasting of the problem\nas multi-label classi\ufb01cation contributes, at least in part, to this robustness. The application example\ndescribed next demonstrates further tolerance of training-testing set mismatches.\nPractical Application - Photometric Stereo: Suppose we have q observations of a given surface\npoint from a Lambertian scene under different lighting directions. Then the resulting measurements\nfrom a standard calibrated photometric stereo design (linear camera response function, an ortho-\ngraphic camera projection, and known directional light sources), denoted o \u2208 Rq, can be expressed\nas o = \u03c1Ln, where n \u2208 R3 denotes the true 3D surface normal, each row of L \u2208 Rq\u00d73 de\ufb01nes\na lighting direction, and \u03c1 is the diffuse albedo, acting here as a scalar multiplier [24]. If specular\nhighlights, shadows, or other gross outliers are present, then the observations are more realistically\nmodeled as o = \u03c1Ln + e, where e is an an unknown sparse vector [13, 25]. It is apparent that,\nsince n is unconstrained, e need not compensate for any component of o in the range of L. Given\nthat null[L(cid:62)] is the orthogonal complement to range[L], we may consider the following problem\n\n(cid:107)e(cid:107)0\n\nmin\n\ne\n\ns.t. Projnull[L(cid:62)](o) = Projnull[L(cid:62)](e)\n\n(11)\n\nwhich ultimately collapses to our canonical sparse estimation problem from (1), where lighting-\nhardware-dependent correlations may be unavoidable in the implicit dictionary.\nFollowing [13], we use 32-bit HDR gray-scale images of the object Bunny (256\u00d7256) with fore-\nground masks under different lighting conditions whose directions, or rows of L, are randomly\nselected from a hemisphere with the object placed at the center. To apply our method, we \ufb01rst com-\npute \u03a6 using the appropriate projection operator derived from the lighting matrix L. As real-world\ntraining data is expensive to acquire, we instead use weak supervision by synthetically generating a\ntraining set as follows. First, we draw a support pattern for e randomly with cardinality d sampled\nuniformly from the range [d1, d2]. The values of d1 and d2 can be tuned in practice. Nonzero values\nof e are assigned iid random values from a Gaussian distribution whose mean and variance are also\ntunable. Beyond this, no attempt was made to match the true outlier distributions encountered in\napplications of photometric stereo. Finally, for each e we can naturally compute observations via\nthe linear constraint in (11), which serve as candidate network inputs.\nGiven synthetic training data acquired in this way, we learn a network with the exact same structure\nand optimization parameters as in Section 5; no application-speci\ufb01c tuning was introduced. We then\ndeploy the resulting network on the gray-scale Bunny images. For each surface point, we use our\nDNN model to approximately solve (11). Since the network output will be a probability map for the\noutlier support set instead of the actual values of e, we choose the 4 indices with the least probability\nas inliers and use them to compute n via least squares.\nWe compare our method against the baseline least squares estimate from [24] and (cid:96)1 norm mini-\nmization. We defer more quantitative comparisons to [26]. In Figure 2, we illustrate the recovered\nsurface normal error maps of the hardest case (fewest lighting directions). Here we observe that our\nDNN estimates lead to far fewer regions of signi\ufb01cant error and the runtime is orders of magnitude\nfaster. Overall though, this application example illustrates that weak supervision with mismatched\nsynthetic training data can, at least for some problem domains, be suf\ufb01cient to learn a quite useful\nsparse estimation DNN; here one that facilitates real-time 3D modeling in mobile environments.\nDiscussion: In this paper we have shown that deep networks with hand-crafted, multi-resolution\nstructure can provably solve certain speci\ufb01c classes of sparse recovery problems where existing\nalgorithms fail. However, much like CNN-based features can often outperform SIFT on many com-\nputer vision tasks, we argue that a discriminative approach can outperform manual structuring of\nlayers/iterations and compensate for dictionary coherence under more general conditions.\n\n8\n\n 00.10.20.30.40.50.60.70.80.91 00.10.20.30.40.50.60.70.80.91 00.10.20.30.40.50.60.70.80.91\fAcknowledgements: This work was done while the \ufb01rst author was an intern at Microsoft Re-\nsearch, Beijing.\nIt is also funded by 973-2015CB351800, NSFC-61231010, NSFC-61527804,\nNSFC-61421062, NSFC-61210005 and MOEMicrosoft Key Laboratory, Peking University.\nReferences\n[1] S. Baillet, J.C. Mosher, and R.M. Leahy. Electromagnetic brain mapping.\n\nIEEE Signal Processing\n\nMagazine, pages 14\u201330, Nov. 2001.\n\n[2] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM J. Imaging Sciences, 2(1), 2009.\n\n[3] T. Blumensath and M.E. Davies. Iterative hard thresholding for compressed sensing. Applied and Com-\n\nputational Harmonic Analysis, 27(3), 2009.\n\n[4] T. Blumensath and M.E. Davies. Normalized iterative hard thresholding: Guaranteed stability and perfor-\n\nmance. IEEE J. Selected Topics Signal Processing, 4(2), 2010.\n\n[5] E. Cand`es, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from\n\nhighly incomplete frequency information. IEEE Trans. Information Theory, 52(2):489\u2013509, 2006.\n\n[6] E. Cand`es and T. Tao. Decoding by linear programming. IEEE Trans. Information Theory, 51(12), 2005.\n[7] S.F. Cotter and B.D. Rao. Sparse channel estimation via matching pursuit with application to equalization.\n\nIEEE Trans. on Communications, 50(3), 2002.\n\n[8] M.A.T. Figueiredo. Adaptive sparseness using Jeffreys prior. NIPS, 2002.\n[9] K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In ICML, 2010.\n[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.\n[11] J.R. Hershey, J. Le Roux, and F. Weninger. Deep unfolding: Model-based inspiration of novel deep\n\narchitectures. arXiv preprint arXiv:1409.2574v4, 2014.\n\n[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8), 1997.\n[13] S. Ikehata, D.P. Wipf, Y. Matsushita, and K. Aizawa. Robust photometric stereo using sparse regression.\n\nIn CVPR, 2012.\n\n[14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[15] U. Kamilov and H. Mansour. Learning optimal nonlinearities for iterative thresholding algorithms. arXiv\n\npreprint arXiv:1512.04754, 2015.\n\n[16] D.M. Malioutov, M. C\u00b8 etin, and A.S. Willsky. Sparse signal reconstruction perspective for source local-\n\nization with sensor arrays. IEEE Trans. Signal Processing, 53(8), 2005.\n\n[17] V. Nair and G. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. ICML, 2010.\n[18] Y.C. Pati, R. Rezaiifar, and P.S. Krishnaprasad. Orthogonal matching pursuit: Recursive function approx-\nimation with applications to wavelet decomposition. In 27th Asilomar Conference on Signals, Systems\nand Computers, 1993.\n\n[19] P. Sprechmann, A.M. Bronstein, and G. Sapiro. Learning ef\ufb01cient sparse and low rank models. IEEE\n\nTrans. Pattern Analysis and Machine Intelligence, 37(9), 2015.\n\n[20] R.K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. NIPS, 2015.\n[21] R. Tibshirani. Regression shrinkage and selection via the lasso. J. of the Royal Statistical Society, 1996.\n[22] J.A. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Infor-\n\nmation Theory, 50(10):2231\u20132242, October 2004.\n\n[23] Z. Wang, Q. Ling, and T. Huang. Learning deep (cid:96)0 encoders. arXiv preprint arXiv:1509.00153v2, 2015.\n[24] R.J. Woodham. Photometric method for determining surface orientation from multiple images. Optical\n\nEngineering, 19(1), 1980.\n\n[25] L. Wu, A. Ganesh, B. Shi, Y. Matsushita, Y. Wang, and Y. Ma. Robust photometric stereo via low-rank\n\nmatrix completion and recovery. Asian Conference on Computer Vision, 2010.\n\n[26] Bo Xin, Yizhou Wang, Wen Gao, and David Wipf. Maximal sparsity with deep networks? arXiv preprint\n\narXiv:1605.01636, 2016.\n\n9\n\n\f", "award": [], "sourceid": 2147, "authors": [{"given_name": "Bo", "family_name": "Xin", "institution": "Peking University"}, {"given_name": "Yizhou", "family_name": "Wang", "institution": "Peking University"}, {"given_name": "Wen", "family_name": "Gao", "institution": "peking university"}, {"given_name": "David", "family_name": "Wipf", "institution": "Microsoft Research"}, {"given_name": "Baoyuan", "family_name": "Wang", "institution": "Microsoft Research"}]}