{"title": "Sample Complexity of Learning Mixture of Sparse Linear Regressions", "book": "Advances in Neural Information Processing Systems", "page_first": 10532, "page_last": 10541, "abstract": "In the problem of learning mixtures of linear regressions, the goal is to learn a col-lection of signal vectors from a sequence of (possibly noisy) linear measurements,where each measurement is evaluated on an unknown signal drawn uniformly fromthis collection. This setting is quite expressive and has been studied both in termsof practical applications and for the sake of establishing theoretical guarantees. Inthis paper, we consider the case where the signal vectors aresparse; this generalizesthe popular compressed sensing paradigm. We improve upon the state-of-the-artresults as follows: In the noisy case, we resolve an open question of Yin et al. (IEEETransactions on Information Theory, 2019) by showing how to handle collectionsof more than two vectors and present the first robust reconstruction algorithm, i.e.,if the signals are not perfectly sparse, we still learn a good sparse approximationof the signals. In the noiseless case, as well as in the noisy case, we show how tocircumvent the need for a restrictive assumption required in the previous work. Ourtechniques are quite different from those in the previous work: for the noiselesscase, we rely on a property of sparse polynomials and for the noisy case, we providenew connections to learning Gaussian mixtures and use ideas from the theory of", "full_text": "Sample Complexity of Learning Mixtures of\n\nSparse Linear Regressions\n\nAkshay Krishnamurthy\nMicrosoft Research, NYC\nakshay@cs.umass.edu\n\nAndrew McGregor\n\nUMass Amherst\n\nmcgregor@cs.umass.edu\n\nArya Mazumdar\nUMass Amherst\n\narya@cs.umass.edu\n\nSoumyabrata Pal\nUMass Amherst\n\nspal@cs.umass.edu\n\nAbstract\n\nIn the problem of learning mixtures of linear regressions, the goal is to learn a col-\nlection of signal vectors from a sequence of (possibly noisy) linear measurements,\nwhere each measurement is evaluated on an unknown signal drawn uniformly from\nthis collection. This setting is quite expressive and has been studied both in terms\nof practical applications and for the sake of establishing theoretical guarantees. In\nthis paper, we consider the case where the signal vectors are sparse; this generalizes\nthe popular compressed sensing paradigm. We improve upon the state-of-the-art\nresults as follows: In the noisy case, we resolve an open question of Yin et al. (IEEE\nTransactions on Information Theory, 2019) by showing how to handle collections\nof more than two vectors and present the \ufb01rst robust reconstruction algorithm, i.e.,\nif the signals are not perfectly sparse, we still learn a good sparse approximation\nof the signals. In the noiseless case, as well as in the noisy case, we show how to\ncircumvent the need for a restrictive assumption required in the previous work. Our\ntechniques are quite different from those in the previous work: for the noiseless\ncase, we rely on a property of sparse polynomials and for the noisy case, we provide\nnew connections to learning Gaussian mixtures and use ideas from the theory of\nerror correcting codes.\n\n1\n\nIntroduction\n\nLearning mixtures of linear regressions is a natural generalization of the basic linear regression\nproblem. In the basic problem, the goal is to learn the best linear relationship between the scalar\nresponses (i.e., labels) and the explanatory variables (i.e., features). In the generalization, each scalar\nresponse is stochastically generated by picking a function uniformly from a set of L unknown linear\nfunctions, evaluating this function on the explanatory variables and possibly adding noise; the goal\nis to learn the set of L unknown linear functions. The problem was introduced by De Veaux [11]\nover thirty years ago and has recently attracted growing interest [8, 14, 22, 24, 25, 27]. Recent\nwork focuses on a query-based scenario in which the input to the randomly chosen linear function\ncan be speci\ufb01ed by the learner. The sparse setting, in which each linear function depends on only\na small number of variables, was recently considered by Yin et al. [27], and can be viewed as a\ngeneralization of the well-studied compressed sensing problem [7, 13]. The problem has numerous\napplications in modelling heterogeneous data arising in medical applications, behavioral health, and\nmusic perception [27].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFormal Problem Statement. There are L unknown distinct vectors 1, 2, . . . , L 2 Rn and\neach is k-sparse, i.e., the number of non-zero entries in each i is at most k where k is some known\nparameter. We de\ufb01ne an oracle O which, when queried with a vector x 2 Rn, returns the noisy\noutput y 2 R:\n\ny = hx, i + \u2318\n\n(1)\nwhere \u2318 is a random variable with E\u2318 = 0 that represents the measurement noise and is chosen\nuniformly1 from the set B = {1, 2, . . . , L}. The goal is to recover all vectors in B by making\na set of queries x1, x2, . . . , xm to the oracle. We refer to the values returned by the oracle given\nthese queries as samples. Note that the case of L = 1 corresponds to the problem of compressed\nsensing. Our primary focus is on the sample complexity of the problem, i.e., minimizing the number\nof queries that suf\ufb01ces to recover the sparse vectors up to some tolerable error.\n\nRelated Work. The most relevant previous work is by Yin et al. [27]. For the noiseless case, i.e.,\n\u2318 = 0, they show that O(kL log(kL)) queries are suf\ufb01cient to recover all vectors in B with high\nprobability. However, their result requires a restrictive assumption on the set of vectors and do not\nhold for an arbitrary set of sparse vectors. Speci\ufb01cally, they require that for any , 0 2B ,\n\nfor each\n\nj 6= 0j\n\nj 2 supp() \\ supp(0) .\n\n(2)\nTheir approach depends crucially on this assumption and this limits the applicability of their approach.\nNote that our results will not depend on such an assumption. For the noisy case, the approach taken\nby Yin et al. only handles the L = 2 case and they state the case of L > 2 as an important open\nproblem. Resolving this open problem will be another one of our contributions.\nMore generally, both compressed sensing [7, 13] and learning mixtures of distributions [10, 23]\nare immensely popular topics across statistics, signal processing and machine learning with a large\nbody of prior work. Mixture of linear regressions is a natural synthesis of mixture models and\nlinear regression, a very basic machine learning primitive [11]. Most of the work on the problem\nhas considered learning generic vectors, i.e., not necessary sparse, and they propose a variety of\nalgorithmic techniques to obtain polynomial sample complexity [8, 14, 19, 24, 26]. To the best of our\nknowledge, St\u00e4dler et al. [22] were the \ufb01rst to impose sparsity on the solutions. However, many of\nthe earlier papers on mixtures of linear regression, essentially consider the queries to be \ufb01xed, i.e.,\npart of the input, whereas in this paper, and in Yin et al. [27], we are interested in designing queries\nin such a way to minimize the number of queries.\n\nOur Results and Techniques. We present results for both the noiseless and noisy cases. The latter\nis signi\ufb01cantly more involved and is the main technical contribution of this paper.\nNoiseless Case: In the case where there is no noise and the L unknown vectors are k-sparse, we show\nthat O(kL log(kL)) queries suf\ufb01ce and that \u2326(kL) queries are necessary. The upper bound matches\nthe query complexity of the result by Yin et al. but our result applies for all k-sparse vectors rather\nthan just those satisfying the assumption in Eq. 2. The approach we take is as follows: In compressed\nsensing, exact recovery of k-sparse vectors is possible by taking samples with an m \u21e5 n matrix with\nany 2k columns linearly independent. Such matrices exists with m = 2k (such as Vandermonde\nmatrices) and are called MDS matrices. We use rows of such a matrix repeatedly to generate samples.\nSince there are L different vectors in the mixture, with O(L log L) measurements with a row we will\nbe able to see the samples corresponding to each of the L vectors with that row. However, even if this\nis true for measurements with each rows, we will still not be able to align measurements across the\nrows. For example, even though we will obtain hx, `i for all ` 2 [L] and for all x that are rows of an\nMDS matrix, we will be unable to identify the samples corresponding to 1. To tackle this problem,\nwe propose using a special type of MDS matrix that allows us to align measurements corresponding\nto the same s. After that, we just use the sparse recovery property of the MDS matrix to individually\nrecover each of the vectors.\nNoisy Case: We assume that the noise \u2318 is a Gaussian random variable with zero mean. Going forward,\nwe write N (\u00b5, 2) to denote a Gaussian distribution with mean \u00b5 and variance 2. Furthermore, we\nwill no longer assume vectors in B are necessarily sparse. From the noisy samples, our objective is to\n1Many of our results can be generalized to non-uniform distributions but we will assume a uniform distribution\n\nthroughout for the sake of clarity.\n\n2\n\n\frecover an estimate \u02c6 for each 2B such that\n\nk \u02c6k \uf8ff ck \u21e4k,\n\n(3)\nwhere c is an absolute constant and \u21e4 is the best k-sparse approximation of , i.e., all except the\nlargest (by absolute value) k coordinates set to 0. The norms in the above equation can be arbitrary\nde\ufb01ning the strength of the guarantee, e.g., when we refer to an `1/`1 guarantee both norms are k\u00b7k 1.\nOur results should be contrasted with [27], where results not only hold for only L = 2 and under\nassumption (2), but the vectors are also strictly k-sparse. However, like [27], we assume \u270f-precision\nof the unknown vectors, i.e., the value in each coordinate of each 2B is an integer multiple of \u270f.2\nNotice that in this model the noise is additive and not multiplicative. Hence, it is possible to increase\nthe `2 norm of the queries arbitrarily so that the noise becomes inconsequential. However, in a real\nsetting, this cannot be allowed since increasing the strength (norm) of the queries has a cost and it\nis in our interest to minimize the cost. Suppose the algorithm designs the ith query vector by \ufb01rst\nchoosing a distribution Qi and subsequently sampling a query vector xi \u21e0 Qi. Let us now de\ufb01ne the\nsignal to noise ratio as follows:\n\nExi\u21e0Qi|hxi, `i|2\n\n.\n\ni\n\nmin\n\n`\n\nE\u23182\n\nSNR = max\n\n(4)\nOur objective in the noisy setting is to recover the unknown vectors 1, 2, . . . , L 2 Rn while\nminimizing the number of queries and the SNR at the same time. In this setting, assuming that\nall the unknown vectors have unit norm, we show that O(k log3 n exp((/\u270f)2/3)) queries with\nSNR = O(1/2) suf\ufb01ce to reconstruct the L = O(1) vectors in B with the approximation guarantees\ngiven in Eq. (3) with high probability if the noise \u2318 is a zero mean gaussian with a variance of 2.\nThis is equivalent to stating that O(k log3 n exp(1/(\u270fpSNR)2/3)) queries suf\ufb01ce to recover the L\nunknown vectors with high probability.\nNote that in the previous work \u270fpSNR is assumed to be at least constant and, if this is the case,\nour result is optimal up to polynomial factors since \u2326(k) queries are required even if L = 1. More\ngenerally, the dependence upon \u270fpSNR in our result improves upon the dependence in the result by\nYin et al. Note that we assumed L = O(1) in our result because the dependence of sample complexity\non L is complicated as it is implicit in the signal-to-noise ratio.\nAs in noiseless case, our approach is to use a compressed sensing matrix and use its rows multiple\ntime as queries to the oracle. At the \ufb01rst step, we would like to separate out the different s from\ntheir samples with the same rows. Unlike the noiseless case, even this turns out to be a dif\ufb01cult task.\nUnder the assumption of Gaussian noise, however, we are able to show that this is equivalent to\nlearning a mixture of Gaussians with different means. In this case, the means of the Gaussians belong\nto an \u201c\u270f-grid\", because of the assumption on the precision of s. This is not a standard setting in the\nliterature of learning Gaussian mixtures, e.g., [1, 16, 20]. Note that, this is possible if the vector that\nwe are sampling with has integer entries. As we will see a binary-valued compressed sensing matrix\nwill do the job for us. We will rely on a novel complex-analytic technique to exactly learn the means\nof a mixture of Gaussians, with means belonging to an \u270f-grid. This technique is paralleled by the\nrecent developments in trace reconstructions where similar methods were used for learning a mixture\nof binomials [18, 21].\nOnce for each query, the samples are separated, we are still tasked with aligning them so that we\nknow the samples produced by the same across different queries. The method for the noiseless case\nfails to work here. Instead, we use a new method motivated by error-correcting codes. In particular,\nwe perform several redundant queries, that help us to do this alignment. For example, in addition to\nthe pair of queries xi, xj, we also perform the queries de\ufb01ned by xi + xj and xi xj.\nAfter the alignment, we use the compressed sensing recovery to estimate the unknown vectors. For\nthis, we must start with a matrix that with minimal number of rows, will allow us to recover any\nvector with a guarantee such as (3). On top of this, we also need the matrix to have integer entries so\nthat we can use our method of learning a mixture of Gaussians with means on an \u270f-grid. Fortunately,\na random binary \u00b11 matrix satis\ufb01es all the requirements [3]. Putting now these three steps of learning\nmixtures, aligning and compressed sensing, lets us arrive at our results.\nWhile we concentrate on sample complexity in this paper, our algorithm for the noiseless case is\ncomputationally ef\ufb01cient, and the only computationally inef\ufb01cient step in the general noisy case is\n\n2Note that we do not assume \u270f-precision in the noiseless case.\n\n3\n\n\fAlgorithm 1 Noiseless Recovery The algorithm for extracting recovering vectors via queries to\noracle in noiseless setting.\nRequire: Number of unknown sparse vectors L, dimension n, sparsity k.\n1: Let t 2R {0, 1, 2, . . . , k2L2 1} and de\ufb01ne \u21b51,\u21b5 2, . . . ,\u21b5 2k where \u21b5j = 2kt+j\n2k3L2 .\n2: for i = 1, 2, . . . , 2k do\n3: Make L log(Lk2) oracle queries with vector [1 \u21b5i \u21b52\ni\n4: end for\n5: for i = 1, 2, . . . , 2k do\n6:\n\nFor each batch of query responses corresponding to the same query vector, retain unique values\nand sort them in ascending order. Refer to this as the processed batch.\n\n]. Refer to these as a batch.\n\n. . . \u21b5 n1\n\ni\n\nj\n\nthe query vector [1 \u21b5j \u21b52\n\n7: end for\n8: Set matrix Q of dimension 2k \u21e5 L such that its jth row is the processed batch corresponding to\n9: for i = 1, 2, . . . , L do\n10:\n11: end for\n12: Return 1, 2, . . . , L.\n\nDecode the ith column of the matrix Q to recover i.\n\nj . . . \u21b5 n1\n\n]\n\nthat of learning Gaussian mixtures. However, in practice one can perform a simple clustering (such\nas Lloyd\u2019s algorithm) to learn the means of the mixture.\n\nOrganization and Notation.\nIn Section 2, we present our results for the noiseless case. In Section\n3.1 we consider the case with noise when L = 2 and then consider noise and general L in Section\n3.2. Most proofs are deferred to the appendix in the supplementary material. Throughout, we write\nx 2R X to denote taking an element x from a \ufb01nite set X uniformly at random. For n 2 N, let\n[n] := {1, 2, . . . , n}.\n2 Exact sparse vectors and noiseless samples\n\nTo begin, we deal with the case of uniform mixture of exact sparse vectors with the oracle returning\nnoiseless answers when queried with a vector. For this case, our scheme is provided in Algorithm 1.\nThe main result for this section is the following.\nTheorem 1. For a collection of L vectors 1, 2, . . . , L 2 Rn such that kik0 \uf8ff k 8i 2 [L], one\ncan recover all of them exactly with probability at least 1 3/k with a total of 2kL log Lk2 oracle\nqueries. See Algorithm 1.\n\nA Vandermonde matrix is a matrix such that the entries in each row of the matrix are in geometric\nprogression i.e., for an m \u21e5 n dimensional Vandermonde matrix the entry in the (i, j)th entry is\n\u21b5j\ni where \u21b51,\u21b5 2, . . . ,\u21b5 m 2 R are distinct values. We will use the following useful property of the\nVandermonde matrices; see, e.g., [15, Section XIII.8] for the proof.\nLemma 1. The rank of any m \u21e5 m square submatrix of a Vandermonde matrix is m assuming\n\u21b51,\u21b5 2, . . . ,\u21b5 m are distinct and positive.\nThis implies that, with the samples from a 2k \u21e5 n Vandermonde matrix, a k-sparse vector can be\nexactly recovered. This is because for any two unknown vectors and \u02c6, the same set of responses\nfor all the 2k rows of the Vandermonde matrix implies that a 2k \u21e5 2k square submatrix of the\nVandermonde matrix is not full rank which is a contradiction to Lemma 1.\nWe are now ready to prove Theorem 1.\n\nProof. For the case of L = 1, note that the setting is the same as the well-known compressed sensing\nproblem. Furthermore, suppose a 2k \u21e5 n matrix has the property that any 2k \u21e5 2k submatrix is full\nrank, then using the rows of this matrix as queries is suf\ufb01cient to recover any k-sparse vector. By\nLemma 1, any 2k \u21e5 n Vandemonde matrix has the necessary property.\nLet 1, 2, . . . , L be the set of unknown k-sparse vectors. Notice that a particular row of the\nVandermonde matrix looks like [1 z z2 z3 . . . zn1] for some value of z 2 R. Therefore, for\n\n4\n\n\fj=0 i\n\nsome vector i and a particular row of the Vandermonde matrix, the inner product of the two can be\ninterpreted as a degree n polynomial evaluated at z such that the coef\ufb01cients of the polynomial form\nthe vector i. More formally, the inner product can be written as f i(z) =Pn1\njzj where f i is\nthe polynomial corresponding to the vector i. For any value z 2 Rn, we can de\ufb01ne an ordering over\nthe L polynomials f 1, f 2, . . . , f L such that f i > f j iff f i(z) > f j(z).\nFor two distinct indices i, j 2 [L], we will call the polynomial f i f j a difference polynomial.\nEach difference polynomial has at most 2k non-zero coef\ufb01cients and therefore has at most 2k\npositive roots by Descartes\u2019 Rule of Signs [9]. Since there are at most L(L 1)/2 distinct difference\npolynomials, the total number of distinct values that are roots of at least one difference polynomial\nis less than kL2. Note that if an interval does not include any of these roots, then the ordering of\nf 1, . . . , f L remains consistent for any point in that interval. In particular, consider the intervals\n(0, ], (, 2], . . . , (1 , 1] where = 1/(k2L2). At most kL2 of these intervals include a root\nof a difference polynomial and hence if we pick a random interval then with probability at least\n1 1/k, the ordering of f 1, . . . , f L are consistent throughout the interval. If the interval chosen is\n(t, (t + 1)] then set \u21b5j = t + j/ (2k) for j = 1, . . . , 2k.\nNow for each value of \u21b5i, de\ufb01ne the vector xi \u2318 [1 \u21b5i \u21b52\n]. For each i 2 [2k], the\nvector xi will be used as query to the oracle repeatedly for L log Lk2 times. We will call the set of\nquery responses from the oracle for a \ufb01xed query vector xi a batch. For a \ufb01xed batch and j,\nPr(j is not sampled by the oracle in the batch) \uf8ff\u21e31 \n1\nLk2 .\nTaking a union bound over all the vectors (L of them) and all the batches (2k of them), we get that in\nevery batch every vector j for j 2 [L] is sampled with probability at least 1 2/k. Now, for each\nbatch, we will retain the unique values (there should be exactly L of them with high probability) and\nsort the values in each batch. Since the ordering of the polynomial remains same, after sorting, all\nthe values in a particular position in each batch correspond to the same vector j for some unknown\nindex j 2 [L]. We can aggregate the query responses of all the batches in each position and since\nthere are 2k linear measurements corresponding to the same vector, we can recover all the unknown\nvectors j using Lemma 1. The failure probability of this algorithm is at most 3/k.\n\nL\u2318L log Lk2\n\n\uf8ff e log Lk2\n\n=\n\ni \u21b53\ni\n\n. . . \u21b5 n1\n\ni\n\n1\n\nThe following theorem establishes that our method is almost optimal in terms of sample complexity.\nTheorem 2. At least 2Lk oracle queries are necessary to recover an arbitrary set of L vectors that\nare k-sparse.\n\n3 Noisy Samples and Sparse Approximation\n\nWe now consider the more general setting where the oracle is noisy and the vectors 1, . . . , L are\nnot necessarily sparse. We assume L is an arbitrary constant, i.e., it does not grow with n or k and\nthat the unknown vectors have \u270f precision, i.e., each entries is an integer multiple of \u270f. The noise will\nbe Gaussian with zero mean and variance 2, i.e., \u2318 \u21e0N (0, 2). Our main result of this section is\nthe following.\nTheorem 3. It is possible to recover approximations with the `1/`1 guarantee in Eq. (3) with\nprobability at least 1 2/n of all the unknown vectors ` 2{ 0,\u00b1\u270f,\u00b12\u270f,\u00b13\u270f, . . .}n,` = 1, . . . , L\nwith O(k(log3 n) exp((/\u270f)2/3) oracle queries where SNR = O(1/2).\n\nBefore we proceed with the ideas of proof, it would be useful to recall the restricted isometry property\n(RIP) of matrices in the context of recovery guarantees of (3). A matrix 2 IRm\u21e5n satis\ufb01es the\n(k, )-RIP if for any vector z 2 IRn with kzk0 \uf8ff k,\n(5)\n2 \uf8ff kzk2\nIt is known that if a matrix is (2k, )-RIP with < p2 1, then the guarantee of (3) (in particular,\n`1/`1-guarantee and also an `2/`1-guarantee) is possible [6] with the the basis pursuit algorithm, an\nef\ufb01cient algorithm based on linear programming. It is also known that a random \u00b11 matrix (with\nnormalized columns) satis\ufb01es the property with csk log n rows, where cs is an absolute constant [3].\nThere are several key ideas of the proof. Since the case of L = 2 is simpler to handle, we start with\nthat and then provide the extra steps necessary for the general case subsequently.\n\n2 \uf8ff (1 + )kzk2\n2.\n\n(1 )kzk2\n\n5\n\n\fAlgorithm 2 Noisy Recovery for L = 2 The algorithm for recovering best k-sparse approxima-\ntion of vectors via queries to oracle in noisy setting.\nRequire: SNR = 1/2, Precision of unknown vectors \u270f, and the constant cs where csk log n rows\n\nare suf\ufb01cient for RIP in binary matrices.\n\n2\n\nCall SampleAndRecover(vi) where vi 2R {+1,1}n.\n\nCall SampleAndRecover((vi + vj)/2) and SampleAndRecover((vi vj)/2)\n\n1: for i = 1, 2, . . . , csk log(n/k) do\n2:\n3: end for\n4: for i 2 [log n] and j 2 [csk log(n/k)] with j 6= i do\n5:\n6: end for\n7: Choose vector v from {v1, v2, . . . , vlog n} such that hv, 1i 6= hv, 2i.\n8: for i = 1, 2, . . . , k log(n/k) and vi 6= v do\nLabel one of hvi, 1i,hvi, 2i to be hv, 1i if their sum is in the pair h vi+v\n9:\nand their difference is in the pair h vvi\n10: end for\n11: Aggregate all (query, denoised query response pairs) labelled hv, 1i and hv, 2i separately and\nmultiply all denoised query responses by a factor of 1/(pcsk log(n/k)).\n12: Return best k-sparse approximation of 1 and 2 by using Basis Pursuit algorithm on each\nIssue T = c2 exp(/\u270f)2/3 queries to oracle with v.\nReturn hv, 1i, hv, 2i via min-distance estimator (Gaussian mixture learning, lemma 2).\n\n13: function SampleAndRecover (v)\n14:\n15:\n16: end function\n\naggregated cluster of (query, denoised query response) pairs.\n\n, 2i. Label the other hv, 2i.\n\n, 1i,h vvi\n\n2\n\n2\n\n, 1i,h vi+v\n\n2\n\n, 2i\n\n3.1 Gaussian Noise: Two vectors\nAlgorithm 2 addresses the setting with only two unknown vectors. We will assume k1k2 = k2k2 =\n1, so that we can subsequently show that the SNR is simply 1/2. This assumption is not necessary\nbut we make this for the ease of presentation. The assumption of \u270f-precision for was made in\nYin et al. [27], and we stick to the same assumption. On the other hand, Yin et al. requires further\nassumptions that we do not need to make. Furthermore, the result of Yin et al. is restricted to exactly\nsparse vectors, whereas our result holds for general sparse approximation.\nFor the two-vector case the result we aim to show is following.\nTheorem 4. Algorithm 2 uses O(k log3 n exp((/\u270f)2/3)) queries to recover both the vectors 1 and\n2 with an `1/`1 guarantee in Eq. (3) with probability at least 1 2/n.\nThis result is directly comparable with [27]. On the statistical side, we improve their result in\nseveral ways: (1) we improve the dependence on /\u270f in the sample complexity from exp(/\u270f) to\nexp((/\u270f)2/3),3 (2) our result applies for dense vectors, recovering the best k-sparse approximations,\nand (3) we do not need the overlap assumption (eq. (2)) used in their work.\nOnce we show SNR = 1/2, Theorem 4 trivially implies Theorem 3 in the case L = 2. Indeed,\nfrom Algorithm 2, notice that we have used vectors v sampled uniformly at random from {+1,1}n\nand use them as query vectors. We must have Ev|hv, `i|2/E\u23182 = k`k2\n2/2 = 1/2 for ` = 1, 2.\nFurther, we have used the sum and difference query vectors which have the form (v1 + v2)/2 and\n(v1 v2)/2 respectively where v1, v2 are sampled uniformly and independently from {+1,1}n.\nTherefore, we must have for ` = 1, 2, Ev1,v2|h(v1 \u00b1 v2)/2, `i|2/E\u23182 = 1/22. According to our\nde\ufb01nition of SNR, we have that SNR = 1/2.\nA description of Algorithm 2 that lead to proof of Theorem 4 can be found in Appendix B. We\nprovide a short sketch here and state an important lemma that we will use in the more general case.\nThe main insight is that for a \ufb01xed sensing vector v, if we repeatedly query with v, we obtain samples\n2N (hv, 2i, 2). If we can exactly recover the\nfrom a mixture of Gaussians 1\n3Note that [27] treat /\u270f as constant in their theorem statement, but the dependence can be extracted from\n\n2N (hv, 1i, 2) + 1\n\ntheir proof.\n\n6\n\n\fmeans of these Gaussians, we essentially reduce to the noiseless case from the previous section. The\n\ufb01rst key step upper bounds the sample complexity for exactly learning the parameters of a mixture of\nGaussians.\nLPL\nLemma 2 (Learning Gaussian mixtures). Let M = 1\ni=1 N (\u00b5i, 2) be a uniform mixture of L\nunivariate Gaussians, with known shared variance 2 and with means \u00b5i 2 \u270fZ. Then, for some\nconstant c > 0 and some t = !(L), there exists an algorithm that requires ctL2 exp((/\u270f)2/3)\nsamples from M and exactly identi\ufb01es the parameters {\u00b5i}L\ni=1 with probability at least 1 2e2t.\nIf we sense with v 2 {1, +1}n then hv, 1i,hv, 2i 2 \u270fZ, so appealing to the above lemma, we\ncan proceed assuming we know these two values exactly. Unfortunately, the sensing vectors here are\nmore restricted \u2014 we must maintain bounded SNR and our technique of mixture learning requires\nthat the means have \ufb01nite precision \u2014 so we cannot simply appeal to our noiseless results for the\nalignment step. Instead we design a new alignment strategy, inspired by error correcting codes. Given\ntwo query vectors v1, v2 and the exact means hvi, ji, i, j 2{ 1, 2}, we must identify which values\ncorrespond to 1 and 2. In addition to sensing with any pair v1 and v2 we sense with v1\u00b1v2\n, and\nwe use these two additional measurements to identify which recovered means correspond to 1 and\nwhich correspond to 2. Intuitively, we can check if our alignment is correct via these reference\nmeasurements.\nTherefore, we can obtain aligned, denoised inner products with each of the two parameter vectors. At\nthis point we can apply a standard compressed sensing result as mentioned at the start of this section\nto obtain the sparse approximations of vectors.\n\n2\n\n3.2 General value of L\nIn this setting, we will have L > 2 unknown vectors 1, 2, . . . , L 2 Rn of unit norm each from\nwhich the oracle can sample from with equal probability. We assume that L does not grow with n or\nk and as before, all the elements in the unknown vectors lie on a \u270f-grid. Here, we will build on the\nideas for the special case of L = 2.\nThe main result of this section is the following.\n\n\u270f )2/3\u2318\u2318 queries with SNR = O(1/2) to recover\n\nTheorem 5. Algorithm 3 uses O\u21e3k(log n)3 exp\u21e3( \nall the vectors 1, . . . , L with `1/`1 guarantees in Eq. (3) with probability at least 1 2/n.\nTheorem 3 follows as a corollary of this result.\nThe analysis of Algorithm 3 and the proofs of Theorems 3 and 5 are provided in detail in Appendix D.\nBelow we sketch some of the main points of the proof.\nThere are two main hurdles in extending the steps explained for L = 2. For a query vector v, we de\ufb01ne\nthe denoised query means to be the set of elements {hv, ii}L\ni=1. Recall that a query vector v is de-\n\ufb01ned to be good if all the elements in the set of denoised query means {hv, 1i,hv, 2i, . . . ,hv, Li}\nare distinct. For L = 2, the probability of a query vector v being good for L = 2 is at least 1/2 but\nfor a value of L larger than 2, it is not possible to obtain such guarantees without further assumptions.\nFor a more concrete example, consider L 4 and the unknown vectors 1, 2, . . . , L to be such\nthat i has 1 in the ith position and zero everywhere else. If v is sampled from {+1,1}n as before,\nthen hv, ii can take values only in {1, 0, +1} and therefore it is not possible that all the values\nhv, ii are distinct. Secondly, even if we have a good query vector, it is no longer trivial to extend\nthe clustering or alignment step. Hence a number of new ideas are necessary to solve the problem for\nany general value of L.\n\nWe need to de\ufb01ne a few constants which are used in the algorithm. Let < p2 1 be a constant\n\n(we need a that allow k-sparse approximation given a (2k, )-RIP matrix). Let c0 be a large positive\nconstant such that\n\nSecondly, let \u21b5? be another positive constant that satis\ufb01es the following for a given value of c0,\n\n\u21b5? = maxn\u21b5 :\n\n\u21b5\u21b5\n\n(\u21b5 1)\u21b51 < exp\u21e3 2\n\n16 \n\n3\n48 \n\n1\n\nc0\u2318o.\n\n2\n16 \n\n3\n48 \n\n1\nc0\n\n> 0.\n\n7\n\n(A)\n\n(B)\n\n\fand precision of unknown vectors as \u270f.\n\nAlgorithm 3 Noisy Recovery for any constant L The algorithm for recovering best k-sparse\napproximation of vectors via queries to oracle in noisy setting.\nRequire: c0,\u21b5 ?, z? as de\ufb01ned in equations (A), (B) and (C) respectively, Variance of noise E\u23182 = 2\n1: for i = 1, 2, . . . ,p\u21b5? log n + c0\u21b5?k log(n/k) do\n2:\n3: Make c2 exp((/\u270f)2/3) queries to the oracle using each of the vectors (qi 1)ri, vi + qiri\nt=1 by using min-distance\n4:\n\nLet vi 2R {+1,1}n, ri 2R {2z?,2z? + 1, . . . , 2z?}n, qi 2R {1, 2, . . . , 4z? + 1}\nand vi + ri.\nt=1,{hvi + ri, ti}L\nRecover h{(qi 1)ri, ti}L\nestimator (Gaussian mixture learning, lemma 2).\n\nt=1,{hvi + qiri, ti}L\n\n5: end for\n6: for i 2 [p\u21b5? log n] and j 2 [\u21b5?k log(nk)] do\n7: Make c2 exp((/\u270f)2/3) queries to the oracle using the vector ri+j + ri.\n8:\n\nRecover {hri+j + ri, ti}L\nLemma 2).\n\nt=1, by using the min-distance estimator (Gaussian mixture learning,\n\n9: end for\n10: Choose vector (v?, r?, q?) from {(vt, rt, qt)}\n\nsuch that (v? + r?, (q 1)r?, v? + q?r?)\nis good. Call a triplet (v + r, (q 1)r, v + qr) to be good if no element in {hv + qr, ii}L\ncan be written in two possible ways as sum of two elements, one each from {hv + r, ii}L\nand {h(q 1)r, ii}L\n\np\u21b5? log n\nt=1\n\ni=1.\n\ni=1\n\ni=1\n\n11: Initialize Sj = for j = 1, . . . , L\n12: for i = p\u21b5? log n + 1, 2, . . . ,p\u21b5? log n + c0\u21b5?k log n\n13:\n\nk do\n\nif (vi + ri, (qi 1)ri, vi + qri) is matching good with respect to (v? + r?, (q 1)r?, v? +\nq?r?) (Call a triplet (v0 + r0, (q0 1)r0, v0 + q0r0) to be matching good w.r.t a good triplet\n(v? + r?, (q? 1)r?, v? + q?r?) if (v0 + r0, (q0 1)r0, v0 + q0r0) and (r0, r?, r0 + r?) are\ngood. ) then\n\nLabel the elements in {hvi, ti}L\nfor j = 1, 2, . . . , L do\n\nt=1 as described in Lemma 18\n\nSj = Sj [ {hvi, ti} if label of hvi, ti is hr?, ji\n\nend for\n\n14:\n15:\n16:\n17:\nend if\n18:\n19: end for\n20: for j = 1, 2, . . . , L do\n21:\n22:\n23: end for\n24: Return 1, 2, . . . , L.\n\nAggregate the elements of Sj and scale them by a factor of 1/c0k log(n/k).\nRecover the vector j by using basis pursuit algorithms (compressed sensing decoding).\n\nFinally, for a given value of \u21b5? and L, let z? be the smallest integer that satis\ufb01es the following:\n\nz? = minnz 2 Z : 1 L3\u21e3\n\n3\n\n4z + 1 \n\n1\n\n4z2 + 1\u2318 \n\n1\n\np\u21b5?o.\n\n(C)\n\nThe Denoising Step.\nIn each step of the algorithm, we sample a vector v uniformly at random from\n{+1,1}n, another vector r uniformly at random from G \u2318 {2z?,2z? + 1, . . . , 2z? 1, 2z?}n\nand a number q uniformly at random from {1, 2, . . . , 4z? + 1}. Now, we will use a batch of queries\ncorresponding to the vectors v + r, (q 1)r and v + qr. We de\ufb01ne a triplet of query vectors\n(v1, v2, v3) to be good if for all triplets of indices i, j, k 2 [L] such that i, j, k are not identical,\n\nhv1, ii + hv2, ji 6= hv3, ki.\n\nWe show that the query vector triplet (v + r, (q 1)r, v + qr) is good with at least some probability.\nThis implies if we choose O(log n) triplets of such query vectors, then at least one of the triplets are\ngood with probability 1 1/n. It turns out that, for a good triplet of vectors (v + r, (q 1)r, v + qr),\nwe can obtain hv, ii for all i 2 [L].\nFurthermore, it follows from Lemma 2 that for a query vector v with integral entries, a batch size of\nT > c3 log n exp((/\u270f)2/3) , for some constant c3 > 0, is suf\ufb01cient to recover the denoised query\nresponses hv, 1i,hv, 2i, . . . ,hv, Li for all the queries with probability at least 1 1/poly(n).\n\n8\n\n\fThe Alignment Step. Let a particular good query vector triplet be (v? + r?, (q?1)r?, v? +q?r?).\nFrom now, we will consider the L elements {hr?, ii}L\ni=1 to be labels and for a vector u, we will\nassociate a label with every element in {hu, ii}L\ni=1. The labelling is correct if, for all i 2 [L], the\nelement labelled as hr?, ii also corresponds to the same unknown vector i. Notice that we can\nlabel the elements {hv?, ii}L\ni=1 correctly because the triplet (v? + r?, (q? 1)r?, v? + q?r?) is\ngood. Consider another good query vector triplet (v0 + r0, (q0 1)r0, v0 + q0r0). This matches with\nthe earlier query triplet if additionally, the vector triplet (r0, r?, r0 + r?) is also good.\nSuch matching pair of good triplets exists, and can be found by random choice with some probability.\nWe show that, the matching good triplets allow us to do the alignment in the case of general L > 2.\nAt this point we would again like to appeal to the standard compressed sensing results. However\nwe need to show that the matching good vectors themselves form a matrix that has the required RIP\nproperty. As our \ufb01nal step, we establish this fact.\nRemark 3 (Re\ufb01nement and adaptive queries). It is possible to have a sample complexity of\n\nO\u21e3k(log n)2 log k exp\u21e3(\u270fpSNR)2/3\u2318\u2318 in Theorem 3, but with a probability of 1 poly(k1).\n\nAlso it is possible to shave-off another log n factor from sample complexity if we can make the queries\nadaptive.\n\nAcknowledgements: This research is supported in part by NSF Grants CCF 1642658, 1618512,\n1909046, 1908849 and 1934846.\n\nReferences\n[1] S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In Symposium on Theory of\n\nComputing, 2001.\n\n[2] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. The johnson-lindenstrauss lemma meets\n\ncompressed sensing. preprint, 100(1):0, 2006.\n\n[3] R. Baraniuk, M. Davenport, R. DeVore, and M. Wakin. A simple proof of the restricted isometry\n\nproperty for random matrices. Constructive Approximation, 28(3):253\u2013263, 2008.\n\n[4] P. Borwein and T. Erd\u00e9lyi. Littlewood-type problems on subarcs of the unit circle. Indiana\n\nUniversity Mathematics Journal, 1997.\n\n[5] S. Boucheron, G. Lugosi, and P. Massart. Concentration inequalities: A nonasymptotic theory\n\nof independence. Oxford university press, 2013.\n\n[6] E. J. Candes. The restricted isometry property and its implications for compressed sensing.\n\nComptes rendus mathematique, 346(9-10):589\u2013592, 2008.\n\n[7] E. J. Cand\u00e8s, J. Romberg, and T. Tao. Robust uncertainty principles: exact signal reconstruction\nfrom highly incomplete frequency information. IEEE Transactions on Information Theory,\n52(2):489\u2013509, 2006.\n\n[8] A. T. Chaganty and P. Liang. Spectral experts for estimating mixtures of linear regressions. In\n\nInternational Conference on Machine Learning, pages 1040\u20131048, 2013.\n\n[9] D. Curtiss. Recent extentions of descartes\u2019 rule of signs. Annals of Mathematics, pages 251\u2013278,\n\n1918.\n\n[10] S. Dasgupta. Learning mixtures of gaussians. In Foundations of Computer Science, pages\n\n634\u2013644, 1999.\n\n[11] R. D. De Veaux. Mixtures of linear regressions. Computational Statistics & Data Analysis,\n\n8(3):227\u2013245, 1989.\n\n[12] L. Devroye and G. Lugosi. Combinatorial methods in density estimation. Springer Science &\n\nBusiness Media, 2012.\n\n9\n\n\f[13] D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289\u20131306,\n\n2006.\n\n[14] S. Faria and G. Soromenho. Fitting mixtures of linear regressions. Journal of Statistical\n\nComputation and Simulation, 80(2):201\u2013225, 2010.\n\n[15] F. R. Gantmakher. The theory of matrices, volume 131. American Mathematical Soc., 1959.\n[16] M. Hardt and E. Price. Tight bounds for learning a mixture of two gaussians. In Symposium on\n\nTheory of Computing, 2015.\n\n[17] A. Kalai, A. Moitra, and G. Valiant. Disentangling Gaussians. Communications of the ACM,\n\n55(2):113\u2013120, 2012.\n\n[18] A. Krishnamurthy, A. Mazumdar, A. McGregor, and S. Pal. Trace reconstruction: Generalized\nand parameterized. In 27th Annual European Symposium on Algorithms, ESA 2019, September\n9-11, 2019, Munich/Garching, Germany., pages 68:1\u201368:25, 2019.\n\n[19] J. Kwon and C. Caramanis. Global convergence of em algorithm for mixtures of two component\n\nlinear regression. arXiv preprint arXiv:1810.05752, 2018.\n\n[20] A. Moitra and G. Valiant. Settling the polynomial learnability of mixtures of gaussians. In\n\nFoundations of Computer Science, 2010.\n\n[21] F. Nazarov and Y. Peres. Trace reconstruction with exp(O(n1/3) samples. In Symposium on\n\nTheory of Computing, 2017.\n\n[22] N. St\u00e4dler, P. B\u00fchlmann, and S. Van De Geer. l1-penalization for mixture regression models.\n\nTest, 19(2):209\u2013256, 2010.\n\n[23] D. M. Titterington, A. F. Smith, and U. E. Makov. Statistical analysis of \ufb01nite mixture distribu-\n\ntions. Wiley, 1985.\n\n[24] K. Viele and B. Tong. Modeling with mixtures of linear regressions. Statistics and Computing,\n\n12(4):315\u2013330, 2002.\n\n[25] X. Yi, C. Caramanis, and S. Sanghavi. Alternating minimization for mixed linear regression. In\n\nInternational Conference on Machine Learning, pages 613\u2013621, 2014.\n\n[26] X. Yi, C. Caramanis, and S. Sanghavi. Solving a mixture of many random linear equations by\ntensor decomposition and alternating minimization. arXiv preprint arXiv:1608.05749, 2016.\n[27] D. Yin, R. Pedarsani, Y. Chen, and K. Ramchandran. Learning mixtures of sparse linear\nregressions using sparse graph codes. IEEE Transactions on Information Theory, 65(3):1430\u2013\n1451, 2019.\n\n10\n\n\f", "award": [], "sourceid": 5546, "authors": [{"given_name": "Akshay", "family_name": "Krishnamurthy", "institution": "Microsoft"}, {"given_name": "Arya", "family_name": "Mazumdar", "institution": "University of Massachusetts Amherst"}, {"given_name": "Andrew", "family_name": "McGregor", "institution": "University of Massachusetts Amherst"}, {"given_name": "Soumyabrata", "family_name": "Pal", "institution": "University of Massachusetts Amherst"}]}