{"title": "High Dimensional Structured Superposition Models", "book": "Advances in Neural Information Processing Systems", "page_first": 3691, "page_last": 3699, "abstract": "High dimensional superposition models characterize observations using parameters which can be written as a sum of multiple component parameters, each with its own structure, e.g., sum of low rank and sparse matrices. In this paper, we consider general superposition models which allow sum of any number of component parameters, and each component structure can be characterized by any norm. We present a simple estimator for such models, give a geometric condition under which the components can be accurately estimated, characterize sample complexity of the estimator, and give non-asymptotic bounds on the componentwise estimation error. We use tools from empirical processes and generic chaining for the statistical analysis, and our results, which substantially generalize prior work on superposition models, are in terms of Gaussian widths of suitable spherical caps.", "full_text": "High Dimensional Structured Superposition Models\n\nQilong Gu\n\nArindam Banerjee\n\nDept of Computer Science & Engineering\n\nUniversity of Minnesota, Twin Cities\n\nDept of Computer Science & Engineering\n\nUniversity of Minnesota, Twin Cities\n\nguxxx396@cs.umn.edu\n\nbanerjee@cs.umn.edu\n\nAbstract\n\nHigh dimensional superposition models characterize observations using parameters\nwhich can be written as a sum of multiple component parameters, each with its\nown structure, e.g., sum of low rank and sparse matrices, sum of sparse and rotated\nsparse vectors, etc. In this paper, we consider general superposition models which\nallow sum of any number of component parameters, and each component structure\ncan be characterized by any norm. We present a simple estimator for such models,\ngive a geometric condition under which the components can be accurately estimated,\ncharacterize sample complexity of the estimator, and give high probability non-\nasymptotic bounds on the componentwise estimation error. We use tools from\nempirical processes and generic chaining for the statistical analysis, and our results,\nwhich substantially generalize prior work on superposition models, are in terms of\nGaussian widths of suitable sets.\n\nIntroduction\n\n1\nFor high-dimensional structured estimation problems [3, 15], considerable advances have been made\nin accurately estimating a sparse or structured parameter \u03b8 \u2208 Rp even when the sample size n is\nfar smaller than the ambient dimensionality of \u03b8, i.e., n (cid:28) p. Instead of a single structure, such as\nsparsity or low rank, recent years have seen interest in parameter estimation when the parameter \u03b8 is\ni=1 \u03b8i, where \u03b81 may be sparse, \u03b82\n\na superposition or sum of multiple different structures, i.e., \u03b8 =(cid:80)k\n\nmay be low rank, and so on [1, 6, 7, 9, 11, 12, 13, 23, 24].\nIn this paper, we substantially generalize the non-asymptotic estimation error analysis for such\nsuperposition models such that (i) the parameter \u03b8 can be the superposition of any number of\ncomponent parameters \u03b8i, and (ii) the structure in each \u03b8i can be captured by any suitable norm\nRi(\u03b8i). We will analyze the following linear measurement based superposition model\n\ni=1\n\ny = X\n\n\u03b8i + \u03c9 ,\n\n(1)\nwhere X \u2208 Rn\u00d7p is a random sub-Gaussian design or compressive matrix, k is the number of\ncomponents, \u03b8i is one component of the unknown parameters, y \u2208 Rn is the response vector, and\n\u03c9 \u2208 Rn is random noise independent of X. The structure in each component \u03b8i is captured by any\nsuitable norm Ri(\u00b7), such that Ri(\u03b8i) has a small value, e.g., sparsity captured by (cid:107)\u03b8i(cid:107)1, low-rank\n(for matrix \u03b8i) captured by the nuclear norm (cid:107)\u03b8i(cid:107)\u2217, etc. Popular models such as Morphological\nComponent Analysis (MCA) [10] and Robust PCA [6, 9] can be viewed as a special cases of this\nframework (see Section D).\nThe superposition estimation problem can be posed as follows: Given (y, X) generated following (1),\nestimate component parameters {\u02c6\u03b8i} such that all the component-wise estimation errors \u2206i = \u02c6\u03b8i\u2212\u03b8\u2217\ni ,\nwhere \u03b8\u2217\ni is the population mean, are small. Ideally, we want to obtain high-probability non-asymptotic\ni (cid:107)2, with the bound improving\n\nbounds on the total componentwise error measured as(cid:80)k\n\ni=1 (cid:107)\u02c6\u03b8i \u2212 \u03b8\u2217\n\n(getting smaller) with increase in the number n of samples.\n\nk(cid:88)\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fWe propose the following estimator for the superposition model in (1):\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)y \u2212 X\n\nk(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\n\u03b8i\n\nmin\n\n{\u03b81,...,\u03b8k}\n\ns.t. Ri(\u03b8i) \u2264 \u03b1i ,\n\ni = 1, . . . , k ,\n\n(2)\n\n\u221a\n\ni (cid:107)2 = 1 and Ri(\u00b7) = (cid:107) \u00b7 (cid:107)1, then \u03b1i =\n\nwhere \u03b1i are suitable constants. In this paper, we focus on the case where \u03b1i = Ri(\u03b8\u2217\ns-sparse with (cid:107)\u03b8\u2217\ns so that Ri(\u03b8\u2217\nadvances [16] can be used to extend our results to more general settings.\nThe superposition estimator in (2) succeeds if a certain geometric condition, which we call structural\ncoherence (SC), is satis\ufb01ed by certain sets (cones) associated with the component norms Ri(\u00b7). Since\nthe estimate \u02c6\u03b8i = \u03b8\u2217\ni + \u2206i is in the feasible set of the optimization problem (2), the error vector\n\u2206i satis\ufb01es the constraint Ri(\u03b8\u2217\ni ). The SC condition is a geometric\nrelationship between the corresponding error cones Ci = cone{\u2206i|Ri(\u03b8\u2217\nIf SC is satis\ufb01ed, then we can show that the sum of componentwise estimation error can be bounded\nwith high probability, and the bound takes the form:\n\ni + \u2206i) \u2264 \u03b1i where \u03b1i = Ri(\u03b8\u2217\n\ni + \u2206i) \u2264 Ri(\u03b8\u2217\n\ni ), e.g., if \u03b8\u2217\n\ni is\ns, noting that recent\n\ni ) \u2264 \u221a\n\ni )}.\n\nk(cid:88)\n\ni=1\n\n(cid:107)\u02c6\u03b8i \u2212 \u03b8\u2217\n\ni (cid:107)2 \u2264 c\n\nmaxi w(Ci \u2229 Bp) +\n\n\u221a\n\nn\n\n\u221a\n\nlog k\n\n,\n\n(3)\n\n\u221a\n\nwhere n is the sample size, k is the number of components, and w(Ci \u2229 Bp) is the Gaussian width\n[3, 8, 22] of the intersection of the error cone Ci with the unit Euclidean ball Bp \u2286 Rp. Interestingly,\nthe estimation error decreases at the rate of 1/\nn, similar to the case of single parameter estimators\n[15, 3], and depends only logarithmically on the number of components k. Further, while dependency\nof the error on Gaussian width of the error cone has been established in recent results involving\na single parameter [3, 22], the bound in (3) depends on the maximum of the Gaussian width of\nindividual error cones, not their sum. The analysis thus gives a general way to construct estimators\nfor superposition problems along with high-probability non-asymptotic upper bounds on the sum of\ncomponentwise errors. To show the generality of our work, we review and compare related work in\nAppendix B.\nNotation: In this paper, we use (cid:107).(cid:107) to denote vector norm, and |||.||| to denote operator norm. For\nexample, (cid:107).(cid:107)2 is the Euclidean norm for a vector or matrix, and |||.|||\u2217 is the nuclear norm of a matrix.\nWe denote cone{E} as the smallest closed cone that contains a given set E. We denote (cid:104)., .(cid:105) as the\ninner product.\nThe rest of this paper is organized as follows: We start with a deterministic estimation error bound\nin Section 2, while laying down the key geometric and statistical quantities involved in the analysis.\nIn Section 3, we discuss the geometry of the structural coherence (SC) condition, and in Section\n4 show that the geometric SC condition implies statistical restricted eigenvalue (RE) condition. In\nSection 5, we develop the main error bound on the sum of componentwise errors which hold with high\nprobability for sub-Gaussian designs and noise. We apply our error bound to practical problems in\nSection 6, and present experimental results in Section 7. We conclude in Section 8. In the Appendix,\nwe compare an estimator using \u201cin\ufb01mal convolution\u201d[18] of norms with our estimator (2) for the\nnoiseless case, and provide some addition examples and experiments. The proofs of all technical\nresults are also in the Appendix.\n\n2 Error Structure and Recovery Guarantees\n\nIn this section, we start with some basic results and, under suitable assumptions, provide a deter-\nministic bound for the componentwise estimation error in superposition models. Subsequently, we\nwill show that the assumptions made here hold with high probability as long as a purely geometric\nnon-probabilistic condition characterized by structural coherence (SC) is satis\ufb01ed.\ni } be the optimal (population)\nLet {\u02c6\u03b8i} be a solution to the superposition estimation problem in (2), {\u03b8\u2217\nparameters involved in the true data generation process. Let \u2206i = \u02c6\u03b8i \u2212 \u03b8\u2217\ni be the error vector for\ncomponent i of the superposition. Our goal is to provide a preliminary understanding of the structure\nof error sets where \u2206i live, identify conditions under which a bound on the total componentwise\ni (cid:107)2 will hold, and provide a preliminary version of such a bound, which will be\ni + \u2206i lies in the feasible set of (2),\n\nsubsequently re\ufb01ned to the form in (3) in Section 5. Since \u02c6\u03b8i = \u03b8\u2217\n\nerror(cid:80)k\n\ni=1 (cid:107)\u02c6\u03b8i \u2212 \u03b8\u2217\n\n2\n\n\fas discussed in Section 1, the error vectors \u2206i will lie in the error sets Ei = {\u2206i \u2208 Rp|Ri(\u03b8\u2217\nRi(\u03b8\u2217\n\ni +\u2206i) \u2264\ni )} respectively. For the analysis, we will be focusing on the cone of such error sets, given by\n(4)\ni=1 \u2206i, so that \u2206 = \u02c6\u03b8 \u2212 \u03b8\u2217. From the optimality of \u02c6\u03b8\n\n\u02c6\u03b8i, and \u2206 =(cid:80)k\n\nLet \u03b8\u2217 =(cid:80)k\n\nCi = cone{\u2206i \u2208 Rp|Ri(\u03b8\u2217\n\ni , \u02c6\u03b8 =(cid:80)k\n\ni + \u2206i) \u2264 Ri(\u03b8\u2217\n\ni=1 \u03b8\u2217\n\ni )} .\n\ni=1\n\nas a solution to (2), we have\n\n(cid:107)y \u2212 X \u02c6\u03b8(cid:107)2 \u2264 (cid:107)y \u2212 X\u03b8\u2217(cid:107)2 \u21d2 (cid:107)X\u2206(cid:107)2 \u2264 2\u03c9T X\u2206 ,\n\n(5)\nusing \u02c6\u03b8 = \u03b8\u2217 + \u2206 and y = X\u03b8\u2217 + \u03c9. In order to establish recovery guarantees, under suitable\nassumptions we construct a lower bound to (cid:107)X\u2206(cid:107)2, the left hand side of (5). The lower bound is a\ngeneralized form of the restricted eigenvalue (RE) condition studied in the literature [4, 5, 17]. We\nalso construct an upper bound to \u03c9T X\u2206, the right hand side of (5), which needs to carefully analyze\nthe noise-design (ND) interaction, i.e., between the noise \u03c9 and the design X.\nWe start by assuming that a generalized form of RE condition is satis\ufb01ed by the superposition of\nerrors: there exists a constant \u03ba > 0 such that for all \u2206i \u2208 Ci, i = 1, 2, . . . , k:\n\n1\u221a\nn\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)X\nk(cid:88)\nk(cid:88)\n(cid:110)(cid:80)k\ni=1 \u2206i : \u2206i \u2208 Ci,(cid:80)k\n\n\u2265 \u03ba\n\n\u2206i\n\ni=1\n\ni=1\n\nH =\n\n(cid:107)\u2206i(cid:107)2 .\n\n(cid:111)\ni=1 (cid:107)\u2206i(cid:107)2 = 1\n\n(6)\n\n(7)\n\n.\n\n(RE)\n\nThe above RE condition considers the following set:\n\nwhich involves all the k error cones, and the lower bound is over the sum of norms of the component\nwise errors. If k = 1, the RE condition in (6) above simpli\ufb01es to the widely studied RE condition\nin the current literature on Lasso-type and Dantzig-type estimators [4, 17, 3] where only one error\ncone is involved. If we set all components but \u2206i to zero, then (6) becomes the RE condition only for\ncomponent i. We also note that the general RE condition as explicitly stated in (6) has been implicitly\nused in [1] and [24]. For subsequent analysis, we introduce the set \u00afH de\ufb01ned as\n\n(cid:110)(cid:80)k\ni=1 \u2206i : \u2206i \u2208 Ci,(cid:80)k\n\n(cid:111)\ni=1 (cid:107)\u2206i(cid:107)2 \u2264 1\n\n\u00afH =\n\n.\n\n(8)\n\nnoting that H \u2282 \u00afH.\nThe general RE condition in (6) depends on the random design matrix X, and is hence an inequality\nwhich will hold with certain probability depending on X and the set H. For superposition problems,\nthe probabilistic RE condition as in (6) is intimately related to the following deterministic structural\ncoherence (SC) condition on the interaction of the different component cones Ci, without any\nexplicit reference to the random design matrix X: there is a constant \u03c1 > 0 such that for all\n\u2206i \u2208 Ci, i = 1, . . . , k,\n(SC)\n\nk(cid:88)\n\n(cid:107)\u2206i(cid:107)2 .\n\n\u2265 \u03c1\n\n(9)\n\n\u2206i\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\ni=1\n\ncomponent cones. In particular, if the SC condition is true, then the sum(cid:80)k\n\nIf k = 1, the SC condition is trivially satis\ufb01ed with \u03c1 = 1. Since most existing literature on high-\ndimensional structured models focus on the k = 1 setting [4, 17, 3], there was no reason to study the\nSC condition carefully. For k > 1, the SC condition (9) implies a non-trivial relationship among the\ni=1 \u2206i being zero implies\nthat each component \u2206i must also be zero. As presented in (9), the SC condition comes across as\nan algebraic condition. In Section 3, we present a geometric characterization of the SC condition\n[13], and illustrate that the condition is both necessary and suf\ufb01cient for accurate recovery of each\ncomponent. In Section 4, we show that for sub-Gaussian design matrices X, the SC condition in\n(9) in fact implies that the RE condition in (6) will hold with high probability, after the number of\nsamples crosses a certain sample complexity, which depends on the Gaussian width of the component\ncones. For now, we assume the RE condition in (6) to hold, and proceed with the error bound analysis.\nTo establish recovery guarantee, following (5), we need an upper bound on the interaction between\nnoise \u03c9 and design X [3, 14]. In particular, we consider the noise-design (ND) interaction\n\n(cid:26)\n\n(cid:27)\n\n\u03c9T Xu \u2264 \u03b3s2\u221a\n\nn\n\n,\n\n(ND)\n\nsn(\u03b3) = inf\ns>0\n\ns : sup\nu\u2208sH\n\n(10)\n\n1\u221a\nn\n\n3\n\n\fFigure 1: Geometry of SC condition\nwhen k = 2. The error sets E1 and\nE2 are respectively shown as blue an\ngreen squares, and the corresponding er-\nror cones are C1 and C2 respectively. \u2212C1\nis the re\ufb02ection of error cone C1. If \u2212C1\nand C2 do not share a ray, i.e., the angle\n\u03b1 between the cones is larger than 0, then\n\u03b40 < 1, and the SC condition will hold.\n\nwhere \u03b3 > 0 is a constant, and sH is the scaled version of H where the scaling factor is s > 0. Here,\nsn(\u03b3) denotes the minimal scaling needed on H such that one obtains a uniform bound over \u2206 \u2208 sH\nof the form: 1\nn(\u03b3). Then, from the basic inequality in (5), with the bounds implied by\nthe RE condition and the ND interaction, we have\n\nn \u03c9T X\u2206 \u2264 \u03b3s2\n\nk(cid:88)\n\ni=1\n\n1\u221a\nn\n\n(cid:107)X\u2206(cid:107)2 \u2264 1\u221a\nn\n\n\u221a\n\n\u03c9T X\u2206 \u21d2 \u03ba\n\n(cid:107)\u2206i(cid:107)2 \u2264 \u221a\n\n\u03b3sn(\u03b3) ,\n\n(11)\n\nwhich implies a bound on the component-wise error. The main deterministic bound below states the\nresult formally:\nTheorem 1 (Deterministic bound) Assume that the RE condition in (6) is satis\ufb01ed in H with pa-\n\nrameter \u03ba. Then, if \u03ba2 > \u03b3, we have(cid:80)k\n\ni=1 (cid:107)\u2206i(cid:107)2 \u2264 2sn(\u03b3).\n\nThe above bound is deterministic and holds only when the RE condition in (6) is satis\ufb01ed with\nconstant \u03ba such that \u03ba2 > \u03b3. In the sequel, we \ufb01rst give a geometric characterization of the SC\ncondition in Section 3, and show that the SC condition implies the RE condition with high probability\nin Section 4. Further, we give a high probability characterization of sn(\u03b3) based on the noise \u03c9 and\ndesign X in terms of the Gaussian widths of the component cones, and also illustrate how one can\nchoose \u03b3 in Section 5. With these characterizations, we will obtain the desired component-wise error\nbound of the form (3).\n\n(cid:107)x + y(cid:107)2 \u2265(cid:113) 1\u2212\u03b4\n\n3 Geometry of Structural Coherence\nIn this section, we give a geometric characterize the structural coherence (SC) condition in (9). We\nstart with the simplest case of two vectors x, y. If they are not re\ufb02ections of each other, i.e., x (cid:54)= \u2212y,\nthen the following relationship holds:\nProposition 2 If there exists a \u03b4 < 1 such that \u2212(cid:104)x, y(cid:105) \u2264 \u03b4(cid:107)x(cid:107)2(cid:107)y(cid:107)2, then\n\n2 ((cid:107)x(cid:107)2 + (cid:107)y(cid:107)2) .\n\n\u03b40 =\n\n(12)\nNext, we generalize the condition of Proposition 2 to vectors in two different cones C1 and C2. Given\nthe cones, de\ufb01ne\n\nx\u2208C1\u2229S p\u22121,y\u2208C2\u2229S p\u22121\n\n(13)\nBy construction, \u2212(cid:104)x, y(cid:105) \u2264 \u03b40(cid:107)x(cid:107)2(cid:107)y(cid:107)2 for all x \u2208 C1 and y \u2208 C2. If \u03b40 < 1, then (12) continues to\n\nhold for all x \u2208 C1 and y \u2208 C2 with constant(cid:112)(1 \u2212 \u03b40)/2 > 0. Note that this corresponds to the SC\ncondition with k = 2 and \u03c1 =(cid:112)(1 \u2212 \u03b40)/2. We can interpret this geometrically as follows: \ufb01rst\n\nre\ufb02ect cone C1 to get \u2212C1, then \u03b4 is the cosine of the minimum angle between \u2212C1 and C2. If \u03b40 = 1,\nthen \u2212C1 and C2 share a ray, and structural coherence does not hold. Otherwise, \u03b40 < 1, implying\n\u2212C1 \u2229 C2 = {0}, i.e., the two cones intersect only at the origin, and structural coherence holds.\nFor the general case involving k cones, denote\n\n\u2212 (cid:104)x, y(cid:105) .\n\nsup\n\nu\u2208\u2212Ci\u2229Sp\u22121,v\u2208(cid:80)\n\n\u03b4i =\n\nIn recent work, [13] concluded that if \u03b4i < 1 for each i = 1, . . . , k then \u2212Ci and(cid:80)\n\n(14)\nj(cid:54)=i Cj does not\nshare a ray, and the original signal can be recovered in noiseless case. We show that the condition\nabove in fact implies \u03c1 > 0 for the SC condition in (9), which is suf\ufb01cient for accurate recovery even\nin the noisy case. In particular, with \u03b4 := maxi \u03b4i, we have the following result:\n\nj(cid:54)=i Cj\u2229Sp\u22121\n\n(cid:104)u, v(cid:105) .\n\nsup\n\n4\n\n\fTheorem 3 (Structural Coherence (SC) Condition) Let \u03b4 := maxi \u03b4i with \u03b4i as de\ufb01ned in (14).\nIf \u03b4 < 1, then there exists a \u03c1 > 0 such that for any \u2206i \u2208 Ci, i = 1, . . . , k, the SC condition in (9)\nholds, i.e.,\n\n(15)\nThus, the SC condition is satis\ufb01ed in the general case as long as the re\ufb02ection \u2212Ci of any cone Ci\n\ndoes not intersect, i.e., share a ray, with the Minkowski sum(cid:80)\n\nj(cid:54)=i Cj of the other cones.\n\ni=1 \u2206i\n\n(cid:13)(cid:13)(cid:13)2\n\n\u2265 \u03c1(cid:80)k\n\n(cid:13)(cid:13)(cid:13)(cid:80)k\n\ni=1 (cid:107)\u2206i(cid:107)2 .\n\n4 Restricted Eigenvalue Condition for Superposition Models\nAssuming that the SC condition is satis\ufb01ed by the error cones {Ci}, i = 1, . . . , k, in this section we\nshow that the general RE condition in (6) will be satis\ufb01ed with high probability when the number of\nsamples n in the sub-Gaussian design matrix X \u2208 Rn\u00d7p crosses the sample complexity n0. We give\na precise characterization of the sample complexity n0 in terms of the Gaussian width of the set H.\nOur analysis is based on the results and techniques in [20, 14], and we note that [3] has related results\nusing mildly different techniques. We start with a restricted eigenvalue condition on C. For a random\nvector Z \u2208 Rp, we de\ufb01ne marginal tail function for an arbitrary set E as\n\nQ\u03be(E; Z) = inf u\u2208E P (|(cid:104)Z, u(cid:105)| \u2265 \u03be) ,\n\n(16)\nnoting that it is deterministic given the set E \u2286 Rp. Let \u0001i, i = 1, . . . , n, be independent Rademacher\n2 of being either +1 or \u22121, and let Xi, i =\nrandom variables, i.e., random variable with probability 1\n1, . . . , n, be independent copies of Z. We de\ufb01ne empirical width of E as\n\nWn(E; Z) = supu\u2208E(cid:104)h, u(cid:105), where h = 1\u221a\n\nn\n\ni=1 \u0001iXi .\n\n(17)\n\n(cid:80)n\n\nWith this notation, we recall the following result from [20]:\nLemma 1 Let X \u2208 Rn\u00d7p be a random design matrix with each row the independent copy of\nsub-Gaussian random vector Z. Then for any \u03be, \u03c1, t > 0, we have\n\nu\u2208H(cid:107)Xu(cid:107)2 \u2265 \u03c1\u03be\n\ninf\n\n\u221a\n\nnQ2\u03c1\u03be(H; Z) \u2212 2Wn(H; Z) \u2212 \u03c1\u03bet\n\n(18)\n\nwith probability at least 1 \u2212 e\u2212 t2\n2 .\nIn order to obtain lower bound of \u03ba in RE condition (6), we need to lower bound Q2\u03c1\u03be(H; Z) and\nupper bound Wn(H; Z). To lower bound Q2\u03c1\u03be(H; Z), we consider the spherical cap\n\n(19)\nFrom [20, 14], one can obtain a lower bound to Q\u03be(A; Z) based on the Paley-Zygmund inequality.\nThe Paley-Zygmund inequality lower bound the tail distribution of a random variable by its second\nmomentum. Let u be an arbitrary vector, we use the following version of the inequality.\n\ni=1 Ci) \u2229 S p\u22121 .\n\nA = ((cid:80)k\n\nP (|(cid:104)Z, u(cid:105)| \u2265 2\u03be) \u2265 [E|(cid:104)Z,u(cid:105)|\u22122\u03be]2\nE|(cid:104)Z,u(cid:105)|2\n\n+\n\n(20)\n\nIn the current context, the following result is a direct consequence of SC condition, which shows that\nQ2\u03c1\u03be(H; Z) is lower bounded by Q\u03be(A; Z), which in turn is strictly bounded away from 0 . The\nproof of Lemma 2 is given in Appendix H.1.\nLemma 2 Let sets H and A be as de\ufb01ned in (7) and (19) respectively. If the SC condition in (9)\nholds, then the marginal tail functions of the two sets have the following relationship:\n\nQ\u03c1\u03be(H; Z) \u2265 Q\u03be(A; Z).\n\n(21)\nNext we discuss how to upper bound the empirical width Wn(H; Z). Let set E be arbitrary, and\nrandom vector g \u223c N (0, Ip) be a standard Gaussian random vector in Rp. The Gaussian width [3] of\nE is de\ufb01ned as\n\n(22)\nEmpirical width Wn(H; Z) can be seen as the supremum of a stochastic process. One way to upper\nbound the supremum of a stochastic process is by generic chaining [19, 3, 20], and by using generic\n\nw(E) = E sup\nu\u2208E\n\n(cid:104)g, u(cid:105).\n\n5\n\n\fchaining we can upper bound the stochastic process by a Gaussian process, which is the Gaussian\nwidth.\nAs we can bound Q2\u03c1\u03be(H; Z) and Wn(H; Z), we come to the conclusion on RE condition. Let X \u2208\nRn\u00d7p be a random matrix where each row is an independent copy of the sub-Gaussian random vector\n\u2264 \u03c3x [21]. Let \u03b1 = inf u\u2208S p\u22121 E[|(cid:104)Z, u(cid:105)|] so\nZ \u2208 Rp, and where Z has sub-Gaussian norm |||Z|||\u03c82\nthat \u03b1 > 0 [14, 20]. We have the following lower bound of the RE condition. The proof of Theorem\n4 is based on the proof of [20, Theorem 6.3], and we give it in appendix H.2.\nTheorem 4 (Restricted Eigenvalue Condition) Let X be the sub-Gaussian design matrix that sat-\nis\ufb01es the assumptions above. If the SC condition (9) holds with a \u03c1 > 0, then with probability at least\n1 \u2212 exp(\u2212t2/2), we have\n\ninf\n\nu\u2208H(cid:107)Xu(cid:107)2 \u2265 c1\u03c1\n\nn \u2212 c2w(H) \u2212 c3\u03c1t\nwhere c1, c2 and c3 are positive constants determined by \u03c3x, \u03c3\u03c9 and \u03b1.\n\u221a\nTo get a \u03ba > 0 in (6), one can simply choose t = (c1\u03c1\nn > c4w2(H)/\u03c12 for c4 = c2\n\n1, we have\n\n2/c2\n\n\u03ba = inf u\u2208H 1\u221a\n\nn(cid:107)Xu(cid:107)2 \u2265 1\n\n2\n\nc1\u03c1 \u2212 c2\n\nw(H)\u221a\n\nn\n\n\u221a\n\n(cid:16)\n\n(cid:17)\n\n> 0,\n\n(23)\n\nn \u2212 c2w(H))/2c3\u03c1. Then as long as\n\nwith high probability.\nFrom the discussion above, if SC condition holds and the sample size n is large enough, then we can\n\ufb01nd a matrix X such that RE condition holds. On the other hand, once there is a matrix X such that\nRE condition holds, then we can show that SC must also be true. Its proof is give in Appendix H.3.\nProposition 5 If X is a matrix such that the RE condition (6) holds for \u2206i \u2208 Ci, then the SC\ncondition (9) holds.\nProposition 5 demonstrates that SC condition is a necessary condition for the possibility of RE.\nIf SC condition does not hold, then there is {\u2206i} such that \u2206i (cid:54)= 0 for some i = 1, . . . , k, but\ni=1 \u2206i = 0,\n\ni=1 \u2206i = 0. Then for every matrix X, we have X(cid:80)k\n\n(cid:107)(cid:80)k\ni=1 \u2206i(cid:107)2 = 0 which implies(cid:80)k\n\nand RE condition is not possible.\n\n5 General Error Bound\nRecall that the error bound in Theorem 1 is given in terms of the noise-design (ND) interaction\n\nsn(\u03b3) = inf s>0\n\ns : supu\u2208sC 1\u221a\n\nn \u03c9T Xu \u2264 \u03b3s2\u221a\n\nn\n\n.\n\n(24)\n\n(cid:110)\n\n(cid:111)\n\nIn this section, we give a characterization of the ND interaction, which yields the \ufb01nal bound on the\ncomponentwise error as long as n \u2265 n0, i.e., the sample complexity is satis\ufb01ed.\n\u2264 \u03c3\u03c9. Let X be\nLet \u03c9 be a centered sub-Gaussian random vector, and its sub-Gaussian norm |||\u03c9|||\u03c82\n\u2264 \u03c3x.\na row-wise i.i.d. sub-Gaussian random matrix, for each row Z, its sub-Gaussian norm |||Z|||\u03c82\nThe ND interaction can be bounded by the following conclusion, and the proof of lemma 3 is given\nin appendix I.1.\nLemma 3 Let design X \u2208 Rn\u00d7p be a row-wise i.i.d. sub-Gaussian random matrix, and noise\n\u03c9 \u2208 Rn be a centered sub-Gaussian random vector. Then sn(\u03b3) \u2264 c w( \u00afH)\n\u221a\nn . for some constant c > 0\nwith probability at least 1 \u2212 c1 exp(\u2212c2w2( \u00afH)) \u2212 c3 exp(\u2212c4n). Constant c depends on \u03c3x and \u03c3\u03c9.\nIn lemma 3 and theorem 6, we need the Gaussian width of \u00afH and H respectively. From de\ufb01nition,\nboth \u00afH and H is related to the union of different cones; therefore bounding the width of \u00afH and\nH may be dif\ufb01cult. We have the following bound of w(H) and w( \u00afH) in terms of the width of the\ncomponent spherical caps. The proof of Lemma 4 is given in Appendix I.2.\nLemma 4 (Gaussian width bound) Let H and \u00afH be as de\ufb01ned in (7) and (8) respectively. Then, we\n\nlog k(cid:1) and w( \u00afH) = O(cid:0)maxi w(Ci \u2229 Bp) +\n\nhave w(H) = O(cid:0)maxi w(Ci \u2229 Sp\u22121) +\n\nlog k(cid:1).\n\n\u221a\n\n\u221a\n\n\u03b3\n\nBy applying lemma 4, we can derive the error bound using the Gaussian width of individual error\ncone. From our conclusion on deterministic bound in theorem 1, we can choose an appropriate \u03b3\nsuch that \u03ba2 > \u03b3. Then, by combining the result of theorem 1, theorem 4, lemma 3 and lemma 4, we\nhave the \ufb01nal form of the bound, as originally discussed in (3):\n\n6\n\n\fi + \u2206) \u2264 Ri(\u03b8\u2217\n\nTheorem 6 For estimator (2), let Ci = cone{\u2206 : Ri(\u03b8\u2217\ni )}, design X be a ran-\ndom matrix with each row an independent copy of sub-Gaussian random vector Z, noise \u03c9 be\na centered sub-Gaussian random vector, and Bp \u2286 Rp be the centered unit euclidean ball. If\nsample size n > c(maxi w2(Ci \u2229 Sp\u22121) + log k)/\u03c12, then we have with probability at least\n1 \u2212 \u03b71\n\nk exp(\u2212\u03b72 maxi w2(Ci \u2229 Sp\u22121)) \u2212 \u03b73 exp(\u2212\u03b74n),\n\n\u221a\ni (cid:107)2 \u2264 C maxi w(Ci\u2229Bp)+\nfor constants c, C > 0 that depend on sub-Gaussian norms |||Z|||\u03c62\nThus, assuming the SC condition in (9) is satis\ufb01ed, the sample complexity and error bound of the\nestimator depends on the largest Gaussian width, rather than the sum of Gaussian widths. The result\ncan be viewed as a direct generalization of existing results for k = 1, when the SC condition is always\nsatis\ufb01ed, and the sample complexity and error is given by w2(C1 \u2229 Sp\u22121) and w(C1 \u2229 Bp) [3, 8].\n\n(cid:80)k\ni=1 (cid:107) \u02c6\u03b8i \u2212 \u03b8\u2217\n\nand |||\u03c9|||\u03c62\n\n(25)\n\nlog k\n\n\u221a\n\n\u03c12\n\nn\n\n.\n\n,\n\n6 Application of General Bound\n\nIn this section, we instantiate the general error bounds on Morphological Component Analysis\n(MCA), and low-rank and sparse matrix decomposition. The comprehensive results are provided in\nAppendix D.\n6.1 Morphological Component Analysis\nIn Morphological Component Analysis [10], we consider the following linear model\n\ny = X(\u03b8\u2217\n\n1 + \u03b8\u2217\n\n2) + \u03c9\n\n(26)\n\nwhere vector \u03b8\u2217\n\n1 is sparse and \u03b8\u2217\nmin\n\u03b81,\u03b82\n\n(cid:107)y \u2212 X(\u03b81 + \u03b82)(cid:107)2\n\n2 is sparse under a rotation Q. Consider the following estimator\n\ns.t. (cid:107)\u03b81(cid:107)1 \u2264 (cid:107)\u03b8\u2217\n\n1(cid:107)1,(cid:107)Q\u03b82(cid:107)1 \u2264 (cid:107)Q\u03b8\u2217\n\n2(cid:107)1,\n\n(27)\nwhere vector y \u2208 Rn is the observation, vectors \u03b81, \u03b82 \u2208 Rp are the parameters we want to estimate,\nmatrix X \u2208 Rn\u00d7p is a sub-Gaussian random design, matrix Q \u2208 Rp\u00d7p is orthogonal. We assume \u03b8\u2217\n2 are s1-sparse and s2-sparse vectors respectively. Function (cid:107)Q.(cid:107)1 is still a norm. In general,\nand Q\u03b8\u2217\nwe can derive the following error bound from Theorem 6:\n\n2\n\n1\n\n(cid:107)\u03b81 \u2212 \u03b8\u2217\n\n1(cid:107)2 + (cid:107)\u03b82 \u2212 \u03b8\u2217\n\n2(cid:107)2 = O\n\nmax\n\n(cid:18)\n\n(cid:26)(cid:113) s1 log p\n\n(cid:113) s2 log p\n\n(cid:27)(cid:19)\n\n,\n\nn\n\nn\n\n.\n\n6.2 Low-rank and Sparse Matrix Decomposition\nTo recover a sparse matrix and low-rank matrix from their sum [6, 9], one can use L1 norm to induce\nsparsity and nuclear norm to induce low-rank. These two kinds of norm ensure that the sparsity and\nthe rank of the estimated matrices are small. Suppose we have a rank-r matrix L\u2217 and a sparse matrix\nS\u2217 with s nonzero entries, S\u2217, L\u2217 \u2208 Rd1\u00d7d2. Our observation Y comes from the following problem\n\nYi = (cid:104)Xi, L\u2217 + S\u2217(cid:105) + Ei, i = 1, . . . , n,\n\nwhere each Xi \u2208 Rd1\u00d7d2 is a sub-Gaussian random design matrix. Ei is the noise matrix. The\nestimator takes the form:\n\n(Yi \u2212 (cid:104)Xi, L + S(cid:105))2\n\ns.t.\n\n(28)\n\n|||L|||\u2217 \u2264 |||L\u2217|||\u2217,(cid:107)S(cid:107)1 \u2264 (cid:107)S\u2217(cid:107)1.\n(cid:27)(cid:19)\n\n(cid:26)(cid:113) s log(d1d2)\n\n(cid:113) r(d1+d2\u2212r)\n\n,\n\nn\n\nn\n\n.\n\n(cid:18)\n\nmax\n\nn(cid:88)\n\ni=1\n\nmin\nL,S\n\nBy using Theorem 6, and existing results on Gaussian widths, the error bound is given by\n\n(cid:107)L \u2212 L\u2217(cid:107)2 + (cid:107)S \u2212 S\u2217(cid:107)2 = O\n\n7 Experimental Results\n\nIn this section, we con\ufb01rm the theoretical results in this paper with some simple experiments. We\nshow our experimental results under different settings. In our experiments we focus on MCA when\nk = 2. The design matrix X are generated from Gaussian distribution such that every entry of X\n\n7\n\n\f(a)\n\n(b)\n\nFigure 2: (a) Effect of parameter \u03c1 on estimation error when noise \u03c9 (cid:54)= 0. We choose the parameter\n\u221a\n\u03c1 to be 0, 1/\n2, and a random sample. (b) Effect of dimension p on fraction of successful recovery\nin noiseless case. Dimension p varies in {20, 40, 50, 150}\nsubjects to N (0, 1). The noise \u03c9 is generated from Gaussian distribution such that every entry of \u03c9\nsubjects to N (0, 1). We implement our algorithm 1 in MATLAB. We use synthetic data in all our\nexperiments, and let the true signal\n\n\u03b81 = (1, . . . , 1\n\n, 0 . . . , 0), Q\u03b82 = (1, . . . , 1\n\n, 0 . . . , 0)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\ns1\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n\ns2\n\n\u221a\n\n1 + \u03b8\u2217\n\nWe generate our data in different ways for our three experiments.\n7.1 Recovery From Noisy Observation\nIn our \ufb01rst experiment, we test the impact of \u03c1 on the estimation error. We choose three different\nmatrices Q, and \u03c1 is determined the choice of Q. The \ufb01rst Q is given by random sampling: we sample\na random orthogonal matrix Q such that Qij > 0, and \u03c1 is lower bounded by (42). The second and\nthird Q is given by identity matrix I and its negative \u2212I; therefore \u03c1 = 1/\n2 and \u03c1 = 0 respectively.\nWe choose dimension p = 1000, and let s1 = s2 = 1. The number of samples n varied between 1\nand 1000. Observation y is given by y = X(\u03b8\u2217\n2) + \u03c9. In this experiment, given Q, for each\nn, we generate 100 pairs of X and w. For each (X, w) pair, we get a solution \u02c6\u03b81 and \u02c6\u03b82. We take\n2(cid:107)2. Figure 2(a) shows the plot of number of samples vs\nthe average over all (cid:107)\u02c6\u03b81 \u2212 \u03b8\u2217\nthe average error. From \ufb01gure 2(a), we can see that the error curve given by random Q lies between\ncurves given by two extreme cases, and larger \u03c1 gives lower curve. In Appendix E, we provide an\nadditional experiment using k-support norm [2].\n7.2 Recovery From Noiseless Observation\nIn our second experiment, we test how the dimension p affects the successful recovery of true value.\nIn this experiment, we choose different dimension p with p = 20, p = 40, p = 80, and p = 160. We\nlet s1 = s2 = 1. To avoid the impact of \u03c1, for each sample size n, we sample 100 random orthogonal\nmatrices Q. Observation y is given by y = X(\u03b8\u2217\n2). For each solution \u02c6\u03b81 and \u02c6\u03b82 of (41), we\n2(cid:107)2 \u2264 10\u22124. We increase n from 1\ncalculate the proportion of Q such that (cid:107)\u02c6\u03b81 \u2212 \u03b8\u2217\nto 40, and the plot we get is \ufb01gure 2(b). From \ufb01gure 2(b) we can \ufb01nd that the sample complexity\nrequired to recover \u03b8\u2217\n\n1 + \u03b8\u2217\n1(cid:107)2 + (cid:107)\u02c6\u03b82 \u2212 \u03b8\u2217\n\n1(cid:107)2 + (cid:107)\u02c6\u03b82 \u2212 \u03b8\u2217\n\n2 increases with dimension p.\n\n1 and \u03b8\u2217\n\n8 Conclusions\nWe present a simple estimator for general superposition models and give a purely geometric charac-\nterization, based on structural coherence, of when accurate estimation of each component is possible.\nFurther, we establish sample complexity of the estimator and upper bounds on componentwise\nestimation error and show that both, interestingly, depend on the largest Gaussian width among the\nspherical caps induced by the error cones corresponding to the component norms. Going forward, it\nwill be interesting to investigate speci\ufb01c component structures which satisfy structural coherence,\nand also extend our results to allow more general measurement models.\nAcknowledgements: The research was also supported by NSF grants IIS-1563950, IIS-1447566,\nIIS-1447574, IIS-1422557, CCF-1451986, CNS- 1314560, IIS-0953274, IIS-1029711, NASA grant\nNNX12AQ39A, and gifts from Adobe, IBM, and Yahoo.\n\n8\n\nSamples01002003004005006007008009001000k\u03b81\u2212\u03b8\u22171k2+k\u03b82\u2212\u03b8\u22172k200.511.522.5\u03c1\u22650.026\u03c1=1/\u221a2\u03c1=0Samples0510152025303540Fraction of Successful Recovery00.20.40.60.81p=20p=40p=80p=160\fReferences\n[1] A. Agarwal, S. Negahban, and M. J. Wainwright. Noisy matrix decomposition via convex\nrelaxation: Optimal rates in high dimensions. The Annals of Statistics, 40(2):1171\u20131197, 2012.\n[2] A. Argyriou, R. Foygel, and N. Srebro. Sparse Prediction with the k-Support Norm. In Advances\n\nin Neural Information Processing Systems, Apr. 2012.\n\n[3] A. Banerjee, S. Chen, F. Fazayeli, and V. Sivakumar. Estimation with Norm Regularization. In\n\nAdvances in Neural Information Processing Systems, 2014.\n\n[4] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector.\n\nThe Annals of Statistics, 37(4):1705\u20131732, 2009.\n\n[5] P. Buhlmann and S. van de Geer. Statistics for High Dimensional Data: Methods, Theory and\n\nApplications. Springer Series in Statistics. Springer, 2011.\n\n[6] E. J. Cand\u00e8s, X. Li, Y. Ma, and J. Wright. Robust principal component analysis? J. ACM,\n\n58(3):1\u201337, 2011.\n\n[7] V. Chandrasekaran, P. A. Parrilo, and A. S. Willsky. Latent variable graphical model selection\n\nvia convex optimization. The Annals of Statistics, 40(4):1935\u20131967, 2012.\n\n[8] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky. The Convex Geometry of Linear\n\nInverse Problems. Foundations of Computational Mathematics, 12:805\u2013849, 2012.\n\n[9] V. Chandrasekaran, S. Sanghavi, P. a. Parrilo, and A. S. Willsky. Rank-Sparsity Incoherence for\n\nMatrix Decomposition. SIAM Journal on Optimization, 21(2):572\u2013596, 2011.\n\n[10] D. L. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. IEEE\n\nTransactions on Information Theory, 47(7):2845\u20132862, 2001.\n\n[11] R. Foygel and L. Mackey. Corrupted Sensing: Novel Guarantees for Separating Structured\n\nSignals. IEEE Transactions on Information Theory, 60(2):1223\u20131247, Feb. 2014.\n\n[12] D. Hsu, S. M. Kakade, and T. Zhang. Robust matrix decomposition with sparse corruptions.\n\nIEEE Transactions on Information Theory, 57(11):7221\u20137234, 2011.\n\n[13] M. B. McCoy and J. A. Tropp. The achievable performance of convex demixing. arXiv, 2013.\n[14] S. Mendelson. Learning without concentration. J. ACM, 62(3):21:1\u201321:25, June 2015.\n[15] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu. A Uni\ufb01ed Framework for High-\nDimensional Analysis of M-Estimators with Decomposable Regularizers. Statistical Science,\n27(4):538\u2013557, Nov. 2012.\n\n[16] S. Oymak, B. Recht, and M. Soltanolkotabi. Sharp Time\u2013Data Tradeoffs for Linear Inverse\n\nProblems. ArXiv e-prints, July 2015.\n\n[17] G. Raskutti, M. J. Wainwright, and B. Yu. Restricted Eigenvalue Properties for Correlated\n\nGaussian Designs. Journal of Machine Learning Research, 11:2241\u20132259, 2010.\n\n[18] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n[19] M. Talagrand. Upper and Lower Bounds for Stochastic Processes. A Series of Modern Surveys\n\nin Mathematics. Springer-Verlag Berlin Heidelberg, 2014.\n\n[20] J. A. Tropp. Convex recovery of a structured signal from independent random linear measure-\n\nments. arXiv, May 2014.\n\n[21] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. Eldar\nand G. Kutyniok, editors, Compressed Sensing, pages 210\u2013268. Cambridge University Press,\nCambridge, Nov. 2012.\n\n[22] R. Vershynin. Estimation in high dimensions: a geometric perspective. Sampling Theory, a\n\nRenaissance, pages 3\u201366, 2015.\n\n[23] J. Wright, A. Ganesh, K. Min, and Y. Ma. Compressive principal component pursuit. IEEE\n\nInternational Symposium on Information Theory, pages 1276\u20131280, 2012.\n\n[24] E. Yang and P. Ravikumar. Dirty statistical models. Advances in Neural Information Processing\n\nSystems, pages 1\u20139, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1834, "authors": [{"given_name": "Qilong", "family_name": "Gu", "institution": "University of Minnesota"}, {"given_name": "Arindam", "family_name": "Banerjee", "institution": "University of Minnesota"}]}