{"title": "Compressed Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1713, "page_last": 1720, "abstract": "Recent research has studied the role of sparsity in high dimensional regression and signal reconstruction, establishing theoretical limits for recovering sparse models from sparse data. In this paper we study a variant of this problem where the original $n$ input variables are compressed by a random linear transformation to $m \\ll n$ examples in $p$ dimensions, and establish conditions under which a sparse linear model can be successfully recovered from the compressed data. A primary motivation for this compression procedure is to anonymize the data and preserve privacy by revealing little information about the original data. We characterize the number of random projections that are required for $\\ell_1$-regularized compressed regression to identify the nonzero coefficients in the true model with probability approaching one, a property called ``sparsistence.'' In addition, we show that $\\ell_1$-regularized compressed regression asymptotically predicts as well as an oracle linear model, a property called ``persistence.'' Finally, we characterize the privacy properties of the compression procedure in information-theoretic terms, establishing upper bounds on the rate of information communicated between the compressed and uncompressed data that decay to zero.", "full_text": "Compressed Regression\n\nShuheng Zhou\u2217 John Lafferty\u2217\u2020 Larry Wasserman\u2021\u2020\n\n\u2217Computer Science Department\n\n\u2021Department of Statistics\n\n\u2020Machine Learning Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nAbstract\n\nRecent research has studied the role of sparsity in high dimensional regression and\nsignal reconstruction, establishing theoretical limits for recovering sparse models\nfrom sparse data.\nIn this paper we study a variant of this problem where the\noriginal n input variables are compressed by a random linear transformation to\nm (cid:28) n examples in p dimensions, and establish conditions under which a sparse\nlinear model can be successfully recovered from the compressed data. A primary\nmotivation for this compression procedure is to anonymize the data and preserve\nprivacy by revealing little information about the original data. We characterize\nthe number of random projections that are required for `1-regularized compressed\nregression to identify the nonzero coef\ufb01cients in the true model with probabil-\nity approaching one, a property called \u201csparsistence.\u201d In addition, we show that\n`1-regularized compressed regression asymptotically predicts as well as an or-\nacle linear model, a property called \u201cpersistence.\u201d Finally, we characterize the\nprivacy properties of the compression procedure in information-theoretic terms,\nestablishing upper bounds on the rate of information communicated between the\ncompressed and uncompressed data that decay to zero.\n\n1 Introduction\n\nTwo issues facing the use of statistical learning methods in applications are scale and privacy. Scale\nis an issue in storing, manipulating and analyzing extremely large, high dimensional data. Privacy\nis, increasingly, a concern whenever large amounts of con\ufb01dential data are manipulated within an\norganization. It is often important to allow researchers to analyze data without compromising the\nprivacy of customers or leaking con\ufb01dential information outside the organization. In this paper we\nshow that sparse regression for high dimensional data can be carried out directly on a compressed\nform of the data, in a manner that can be shown to guard privacy in an information theoretic sense.\n\nThe approach we develop here compresses the data by a random linear or af\ufb01ne transformation,\nreducing the number of data records exponentially, while preserving the number of original input\nvariables. These compressed data can then be made available for statistical analyses; we focus on\nthe problem of sparse linear regression for high dimensional data. Informally, our theory ensures\nthat the relevant predictors can be learned from the compressed data as well as they could be from\nthe original uncompressed data. Moreover, the actual predictions based on new examples are as\naccurate as they would be had the original data been made available. However, the original data\nare not recoverable from the compressed data, and the compressed data effectively reveal no more\ninformation than would be revealed by a completely new sample. At the same time, the inference\nalgorithms run faster and require fewer resources than the much larger uncompressed data would\nrequire. The original data need not be stored; they can be transformed \u201con the \ufb02y\u201d as they come in.\n\n1\n\n\fIn more detail, the data are represented as a n \u00d7 p matrix X. Each of the p columns is an attribute,\nand each of the n rows is the vector of attributes for an individual record. The data are compressed\n\nis a random m \u00d7 p matrix. Such transformations have been called \u201cmatrix masking\u201d in the privacy\nliterature [6]. The entries of \u0001 and \u0001 are taken to be independent Gaussian random variables, but\n\nby a random linear transformation X 7\u2192 eX \u2261 \u0001X, where \u0001 is a random m \u00d7 n matrix with\nm (cid:28) n. It is also natural to consider a random af\ufb01ne transformation X 7\u2192eX \u2261 \u0001X + \u0001, where \u0001\nother distributions are possible. We think of eX as \u201cpublic,\u201d while \u0001 and \u0001 are private and only\nneeded at the time of compression. However, even with \u0001 = 0 and \u0001 known, recovering X from\neX requires solving a highly under-determined linear system and comes with information theoretic\nprivacy guarantees, as we demonstrate.\nIn standard regression, a response variable Y = X\u03b2 + \u0001 \u2208 Rn is associated with the input variables,\nwhere \u0001i are independent, mean zero additive noise variables. In compressed regression, we assume\nthat the response is also compressed, resulting in the transformed responseeY \u2208 Rm given by Y 7\u2192\neY \u2261 \u0001Y = \u0001X\u03b2 + \u0001\u0001 = eX \u03b2 +e\u0001. Note that under compression,e\u0001i , i \u2208 {1, . . . , m}, in the\ntransformed noisee\u0001 = \u0001\u0001 are no longer independent. In the sparse setting, the parameter \u03b2 \u2208 R p\n\nis sparse, with a relatively small number s = k\u03b2k0 of nonzero coef\ufb01cients in \u03b2. The method we\nfocus on is `1-regularized least squares, also known as the lasso [17]. We study the ability of the\ncompressed lasso estimator to identify the correct sparse set of relevant variables and to predict well.\n\n1\n\nj 6= 0}.\n\n2mkeY \u2212eX \u03b2k2\n\nWe omit details and technical assumptions in the following theorems for clarity. Our \ufb01rst result\nshows that the lasso is sparsistent under compression, meaning that the correct sparse set of relevant\nvariables is identi\ufb01ed asymptotically.\nSparsistence (Theorem 3.3): Ifthenumberofcompressedexamples m satis\ufb01es C1s2 log nps \u2264\nm \u2264 \u221aC2n/ log n,andtheregularizationparameter \u03bbm satis\ufb01es \u03bbm \u2192 0 and m\u03bb2\nm / log p \u2192\n\u221e, then the compressed lasso estimatore\u03b2m = arg min\u03b2\n2 + \u03bbmk\u03b2k1 is sparsistent:\nP(cid:0)supp(e\u03b2m ) = supp(\u03b2)(cid:1) \u2192 1 asm \u2192 \u221e,wheresupp(\u03b2) = {j :\nOur second result shows that the lasso is persistent under compression. Roughly speaking, per-\nsistence [10] means that the procedure predicts well, as measured by the predictive risk R(\u03b2) =\nE(cid:0)Y \u2212 \u03b2 T X(cid:1)2, where X \u2208 R p is a new input vector and Y is the associated response. Persistence is\na weaker condition than sparsistency, and in particular does not assume that the true model is linear.\nPersistence (Theorem 4.1): Givenasequenceofsetsofestimators Bn,m \u2282 R p suchthat Bn,m =\n{\u03b2 : k\u03b2k1 \u2264 Ln,m} with log2(np) \u2264 m \u2264 n,thesequenceofcompressedlassoestimatorse\u03b2n,m =\n2 is persistent with the predictive risk R(\u03b2) = E(cid:0)Y \u2212 \u03b2 T X(cid:1)2 over\nargmin\nuncompressed data with respect to Bn,m, meaning that R(e\u03b2n,m ) \u2212 infk\u03b2k1\u2264Ln,m R(\u03b2)\nn \u2192 \u221e,incase Ln,m = o (m/ log(np))1/4.\nOur third result analyzes the privacy properties of compressed regression. We evaluate privacy in\ninformation theoretic terms by bounding the average mutual information I (eX; X )/np per matrix\n\nentry in the original data matrix X, which can be viewed as a communication rate. Bounding this\nmutual information is intimately connected with the problem of computing the channel capacity of\ncertain multiple-antenna wireless communication systems [13].\nInformation Resistence (Propositions 5.1 and 5.2): The rate at which information about X is\n\nk\u03b2k1\u2264Ln,m keY \u2212 eX \u03b2k2\n\n\u2212\u2192 0, as\n\nP\n\nrevealed by the compressed dataeX satis\ufb01es rn,m = sup I (X;eX )\n\nsupremumisoverdistributionsontheoriginaldata X.\n\nnp\n\n= O(cid:0) m\n\nn(cid:1) \u2192 0, where the\n\nAs summarized by these results, compressed regression is a practical procedure for sparse learning\nin high dimensional data that has provably good properties. Connections with related literature are\nbrie\ufb02y reviewed in Section 2. Analyses of sparsistence, persistence and privacy properties appear in\nSection 3\u20135. Simulations for sparsistence and persistence of the compressed lasso are presented in\nSection 6. The proofs are included in the full version of the paper, available at http://arxiv.\norg/abs/0706.0534.\n\n2\n\n\f2 Background and Related Work\n\nIn this section we brie\ufb02y review related work in high dimensional statistical inference, compressed\nsensing, and privacy, to place our work in context.\nSparse Regression. An estimator that has received much attention in the recent literature is the\n2 + \u03bbnk\u03b2k1, where \u03bbn is a regularization param-\neter. In [14] it was shown that the lasso is consistent in the high dimensional setting under certain\nassumptions. Sparsistency proofs for high dimensional problems have appeared recently in [20]\nand [19]. The results and method of analysis of Wainwright [19], where X comes from a Gaussian\nensemble and \u0001i is i.i.d. Gaussian, are particularly relevant to the current paper. We describe this\nGaussian Ensemble result, and compare our results to it in Sections 3, 6.Given that under com-\n\nlassob\u03b2n [17], de\ufb01ned asb\u03b2n = arg min 1\n\n2nkY \u2212 X\u03b2k2\n\npression, the noisee\u0001 = \u0001\u0001 is not i.i.d, one cannot simply apply this result to the compressed case.\n\nPersistence for the lasso was \ufb01rst de\ufb01ned and studied by Greenshtein and Ritov in [10]; we review\ntheir result in Section 4.\nCompressed Sensing. Compressed regression has close connections to, and draws motivation from\ncompressed sensing [4, 2]. However, in a sense, our motivation is the opposite of compressed\nsensing. While compressed sensing of X allows a sparse X to be reconstructed from a small number\nof random measurements, our goal is to reconstruct a sparse function of X. Indeed, from the point\nof view of privacy, approximately reconstructing X, which compressed sensing shows is possible\nif X is sparse, should be viewed as undesirable; we return to this point in Section ??. Several\nauthors have considered variations on compressed sensing for statistical signal processing tasks\n[5, 11]. They focus on certain hypothesis testing problems under sparse random measurements, and\na generalization to classi\ufb01cation of a signal into two or more classes. Here one observes y = \u0001x,\nwhere y \u2208 Rm, x \u2208 Rn and \u0001 is a known random measurement matrix. The problem is to select\nbetween the hypotheses eHi : y = \u0001(si + \u0001). The proofs use concentration properties of random\nprojection, which underlie the celebrated Johnson-Lindenstrauss lemma. The compressed regression\nproblem we introduce can be considered as a more challenging statistical inference task, where the\nproblem is to select from an exponentially large set of linear models, each with a certain set of\nrelevant variables with unknown parameters, or to predict as well as the best linear model in some\nclass.\nPrivacy. Research on privacy in statistical data analysis has a long history, going back at least to [3].\nWe refer to [6] for discussion and further pointers into this literature; recent work includes [16]. The\nwork of [12] is closely related to our work at a high level, in that it considers low rank random linear\ntransformations of either the row space or column space of the data X. The authors note the Johnson-\nLindenstrauss lemma, and argue heuristically that data mining procedures that exploit correlations\nor pairwise distances in the data are just as effective under random projection. The privacy analysis\n\nsystem. We are not aware of previous work that analyzes the asymptotic properties of a statistical\nestimator under random projection in the high dimensional setting, giving information-theoretic\nguarantees, although an information-theoretic quanti\ufb01cation of privacy was proposed in [1]. We\n\nis restricted to observing that recovering X from eX requires solving an under-determined linear\ncast privacy in terms of the rate of information communicated about X througheX, maximizing over\n\nall distributions on X, and identify this with the problem of bounding the Shannon capacity of a\nmulti-antenna wireless channel, as modeled in [13]. Finally, it is important to mention the active\narea of cryptographic approaches to privacy from the theoretical computer science community, for\ninstance [9, 7]; however, this line of work is quite different from our approach.\n\n3 Compressed Regression is Sparsistent\n\nIn the standard setting, X is a n \u00d7 p matrix, Y = X\u03b2 + \u0001 is a vector of noisy observations under\na linear model, and p is considered to be a constant. In the high-dimensional setting we allow p to\ngrow with n. The lasso refers to the following: (P1) min kY \u2212 X\u03b2k2\n2 such that k\u03b2k1 \u2264 L. In\nLagrangian form, this becomes: (P2) min 1\n2 + \u03bbnk\u03b2k1. For an appropriate choice of\nthe regularization parameter \u03bb = \u03bb(Y, L), the solutions of these two problems coincide.\nIn compressed regression we project each column X j \u2208 Rn of X to a subspace of m dimensions,\nusing an m \u00d7 n random projection matrix \u0001. LeteX = \u0001X be the compressed design matrix, and\n\n2nkY \u2212 X\u03b2k2\n\n3\n\n\fbeing the set of optimal solutions:\n\nleteY = \u0001Y be the compressed response. Thus, the transformed noisee\u0001 is no longer i.i.d.. The\ncompressed lasso is the following optimization problem, foreY = \u0001X\u03b2 + \u0001\u0001 = \u0001eX +e\u0001, withe\u0001m\n(a) (eP2) min\n\n2 + \u03bbmk\u03b2k1, (b) e\u0001m = arg min\n\nAlthough sparsistency is the primary goal in selecting the correct variables, our analysis establishes\nconditions for the stronger property of sign consistency:\n\n2m keY \u2212eX \u03b2k2\n\n2m keY \u2212eX \u03b2k2\n\n2 + \u03bbmk\u03b2k1.\n\n\u03b2\u2208R p\n\n(1)\n\n1\n\n1\n\nDe\ufb01nition 3.1. (Sign Consistency) A set of estimators \u0001n is sign consistent with the true \u03b2 if\n\n\u22121 for x >,=, or < 0 respectively. Asashorthand,denotetheeventthatasignconsistentsolution\n\nP(cid:0)\u2203b\u03b2n \u2208 \u0001n s.t.sgn(b\u03b2n) = sgn(\u03b2)(cid:1) \u2192 1 as n \u2192 \u221e,wheresgn(\u00b7) isgivenbysgn(x) = 1, 0, and\nexistswith E(cid:0)sgn(b\u03b2n) = sgn(\u03b2\u2217)(cid:1) :=(cid:8)\u2203b\u03b2 \u2208 \u0001n suchthatsgn(b\u03b2) = sgn(\u03b2\u2217)(cid:9).\n\nClearly, if a set of estimators is sign consistent then it is sparsistent.\n\nAll recent work establishing results on sparsity recovery assumes some form of incoherence condi-\ntion on the data matrix X. To formulate such a condition, it is convenient to introduce an additional\n6= 0} be the set of relevant variables and let Sc = {1, . . . , p} \\ S\npiece of notation. Let S = {j : \u03b2 j\nbe the set of irrelevant variables. Then X S and X Sc denote the corresponding sets of columns of the\nmatrix X. We will impose the following incoherence condition; related conditions are used by [18]\n\nin a deterministic setting. Let kAk\u221e = maxiP p\nDe\ufb01nition 3.2. (S-Incoherence) Let X be an n \u00d7 p matrix and let S \u2282 {1, . . . , p} be nonempty.\nWesaythat X is S-incoherent incase\nn X T\n\nj=1 |Ai j| denote the matrix \u221e-norm.\n\nn X T\n\n(2)\n\nSc X S(cid:13)(cid:13)(cid:13)\u221e +(cid:13)(cid:13)(cid:13) 1\n\nS X S \u2212 I|S|(cid:13)(cid:13)(cid:13)\u221e \u2264 1 \u2212 \u03b7, forsome \u03b7 \u2208 (0, 1].\n\n(cid:13)(cid:13)(cid:13) 1\n\nAlthough not explicitly required, we only apply this de\ufb01nition to X such that columns of X satisfy\n\n2 = \u0001(n),\u2200 j \u2208 {1, . . . , p}. We can now state our main result on sparsistency.\n\n(cid:13)(cid:13)X j(cid:13)(cid:13)2\nTheorem 3.3. Suppose that, before compression, Y = X\u03b2\u2217 + \u0001, where each column of X is\nnormalized to have `2-norm n, and \u03b5 \u223c N (0, \u03c3 2 In). Assume that X is S-incoherent, where S =\nsupp(\u03b2\u2217),andde\ufb01nes = |S| and\u03c1m = mini\u2208S |\u03b2\u2217i |. Weobserve,aftercompression,eY =eX \u03b2\u2217 +e\u0001,\nwhereeY = \u0001Y,eX = \u0001X,ande\u0001 = \u0001\u0001,where \u0001i j \u223c N (0, 1\nwithC1 = 4e\n\n\u03b7 (cid:19) (ln p + 2 log n + log 2(s + 1)) \u2264 m \u2264r n\nS X S)\u22121(cid:13)(cid:13)(cid:13)\u221e) \u2192 0.\nThenthecompressedlassoissparsistent: P(cid:0)supp(e\u03b2m ) = supp(\u03b2)(cid:1) \u2192 1 asm \u2192 \u221e.\n\n\u221a6\u03c0 \u2248 2.5044 andC2 =\nlog( p \u2212 s) \u2192 \u221e, and (b)\n\nn ). Lete\u03b2m \u2208e\u0001m asin(1b). If\n\n\u221a8e \u2248 7.6885,and \u03bbm \u2192 0 satis\ufb01es\n\nm + \u03bbm(cid:13)(cid:13)(cid:13)( 1\n\n\u03c1m (r log s\n\n(cid:18) 16C1s2\n\n4 Compressed Regression is Persistent\n\n\u03b72 +\n\nm\u03b72\u03bb2\nm\n\n16 log n\n\nn X T\n\n4C2s\n\n(a)\n\n(4)\n\n(3)\n\n1\n\nPersistence (Greenshtein and Ritov [10]) is a weaker condition than sparsistency. In particular, the\nassumption that E(Y|X ) = \u03b2 T X is dropped. Roughly speaking, persistence implies that a procedure\npredicts well. We review the arguments in [10] \ufb01rst; we then adapt it to the compressed case.\nUncompressed Persistence. Consider a new pair (X, Y ) and suppose we want to predict Y from X.\nThe predictive risk using predictor \u03b2 T X is R(\u03b2) = E(Y \u2212 \u03b2 T X )2. Note that this is a well-de\ufb01ned\nquantity even though we do not assume that E(Y|X ) = \u03b2 T X. It is convenient to rewrite the risk in\nthe following way: de\ufb01ne Q = (Y, X1, . . . , X p) and \u03b3 = (\u22121, \u03b21, . . . , \u03b2 p)T , then\n\nR(\u03b2) = \u03b3 T \u0001\u03b3 , where \u0001 = E(Q QT ).\n\n(5)\n\n4\n\n\f1\nn\n\narg min\n\n(6)\n\n(7)\n\n.\n\n1\nn\n\nQT Q.\n\nCompressed Persistence.\n\nbRn(\u03b2) =\n\nLet Q = (Q\u2020\nvectors and the training error is\n\n2 \u00b7 \u00b7\u00b7 Q\u2020\n\n1 Q\u2020\n\nn)T , where Q\u2020\n\nnXi=1\n(Yi \u2212 X T\n\nk\u03b2k1\u2264Ln bRn(\u03b2).\n\nFor the compressed case, again we want to predict (X, Y ), but\n\nconstants M and s, where Z = Q j Qk \u2212 E(Q j Qk ), where Q j , Qk denote elements of Q.\nFollowing arguments in [10], it can be shown that under Assumption 1 and given a sequence of sets\n\ni = (Yi , X1i , . . . , X pi )T \u223c Q,\u2200i = 1, . . . , n are i.i.d. random\ni \u03b2)2 = \u03b3 Tb\u0001n\u03b3 , where b\u0001n =\nk\u03b2k1\u2264Ln R(\u03b2), and the uncompressed lasso estimatorb\u03b2n = arg min\n\nGiven Bn = {\u03b2 : k\u03b2k1 \u2264 Ln} for Ln = o(cid:0)(n/ log n)1/4(cid:1), we de\ufb01ne the oracle predictor \u03b2\u2217,n =\nAssumption 1. Suppose that, for each j and k, E(cid:0)|Z|q(cid:1) \u2264 q!Mq\u22122s/2, for every q \u2265 2 and some\nof estimators Bn = {\u03b2 : k\u03b2k1 \u2264 Ln} for Ln = o(cid:0)(n/ log n)1/4(cid:1), the sequence of uncompressed\nlasso estimatorsb\u03b2n = arg min \u03b2\u2208Bn bRn(\u03b2) is persistent, i.e., R(b\u03b2n) \u2212 R(\u03b2\u2217,n)\nnow the estimator b\u03b2n,m is based on the lasso from the compressed data of size mn. Let \u03b3 =\n(\u22121, \u03b21, . . . , \u03b2 p)T as before and we replacebRn with\nlog(npn )(cid:17)1/4\nGiven compressed sample size mn, let Bn,m = {\u03b2 : k\u03b2k1 \u2264 Ln,m}, where Ln,m = o(cid:16) mn\nWe de\ufb01ne the compressed oracle predictor \u03b2\u2217,n,m = arg min \u03b2 : k\u03b2k1\u2264Ln,m R(\u03b2) and the compressed\nlasso estimatorb\u03b2n,m = arg min \u03b2 : k\u03b2k1\u2264Ln,m bRn,m (\u03b2).\nTheorem 4.1. Under Assumption 1, we further assume that there exists a constant M1 > 0 such\nthat E(Q2\nj ) < M1,\u2200 j,where Q j denotesthe j th elementof Q. Foranysequence Bn,m \u2282 R p with\nlog2(npn) \u2264 mn \u2264 n, where Bn,m consists of all coef\ufb01cient vectors \u03b2 such thatk\u03b2k1 \u2264 Ln,m =\no(cid:0)(mn/ log(npn))1/4(cid:1),thesequenceofcompressedlassoproceduresb\u03b2n,m = argmin\u03b2\u2208Bn,m bRn,m (\u03b2)\nispersistent: R(b\u03b2n,m ) \u2212 R(\u03b2\u2217,n,m )\n(cid:8)b\u03b2n,m , \u2200mn such that log2(np) < mn \u2264 n(cid:9) de\ufb01nes a subsequence of estimators. In Section 6 we\n\nThe main difference between the sequence of compressed lasso estimators and the original un-\ncompressed sequence is that n and mn together de\ufb01ne the sequence of estimators for the com-\npressed data. Here mn is allowed to grow from \u0001(log2(np)) to n; hence for each \ufb01xed n,\n\nillustrate the compressed lasso persistency via simulations to compare the empirical risks with the\noracle risks on such a subsequence for a \ufb01xed n.\n\nbRn,m (\u03b2) = \u03b3 Tb\u0001n,m \u03b3 , whereb\u0001n,m =\n\n\u2192 0,when pn = O(cid:0)enc(cid:1) forc < 1/2.\n\n1\nmn\n\nQT \u0001T \u0001Q.\n\nP\n\n\u2192 0.\n\nP\n\n5 Information Theoretic Analysis of Privacy\n\nNext we derive bounds on the rate at which the compressed data eX reveal information about the\n\nuncompressed data X. Our general approach is to consider the mapping X 7\u2192 \u0001X + \u0001 as a noisy\ncommunication channel, where the channel is characterized by multiplicative noise \u0001 and additive\nnoise \u0001. Since the number of symbols in X is np we normalize by this effective block length to\nde\ufb01ne the information rate rn,m per symbol as rn,m = sup p(X )\n. Thus, we seek bounds on\nthe capacity of this channel. A privacy guarantee is given in terms of bounds on the rate rn,m \u2192 0\n\ndecaying to zero. Intuitively, if the mutual information satis\ufb01es I (X;eX ) = H (X ) \u2212 H (X |eX ) \u2248 0,\nthen the compressed dataeX reveal, on average, no more information about the original data X than\n\nThe underlying channel is equivalent to the multiple antenna model for wireless communication\n[13], where there are n transmitter and m receiver antennas in a Raleigh \ufb02at-fading environment.\nThe propagation coef\ufb01cients between pairs of transmitter and receiver antennas are modeled by the\nmatrix entries \u0001i j ; they remain constant for a coherence interval of p time periods. Computing the\n\ncould be obtained from an independent sample.\n\nI (X;eX )\n\nnp\n\n5\n\n\fchannel capacity over multiple intervals requires optimization of the joint density of pn transmitted\nsignals, the problem studied in [13]. Formally, the channel is modeled as Z = \u0001X + \u03b3 \u0001, where\nnPn\n\u03b3 > 0, \u0001i j \u223c N (0, 1), \u0001i j \u223c N (0, 1/n) and 1\ni=1 E[X 2\ni j ] \u2264 P, where the latter is a power\nconstraint.\n\nTheorem 5.1. Supposethat E[X 2\nj ] \u2264 P andthecompresseddataareformedby Z = \u0001X + \u03b3 \u0001,\nwhere\u0001 ism\u00d7n withindependententries\u0001i j \u223c N (0, 1/n) and\u0001 ism\u00d7 p withindependententries\n\u0001i j \u223c N (0, 1). Thentheinformationratern,m satis\ufb01esrn,m = sup p(X )\nThis result is implicitly contained in [13]. When \u0001 = 0, or equivalently \u03b3 = 0, which is the\ncase assumed in our sparsistence and persistence results, the above analysis yields the trivial bound\nrn,m \u2264 \u221e. We thus derive a separate bound for this case; however, the resulting asymptotic order\nof the information rate is the same.\n\nn log(cid:16)1 + P\n\u03b3 2(cid:17) .\n\nI (X; Z )\nnp \u2264 m\n\nTheorem 5.2. Supposethat E[X 2\nj ] \u2264 P andthecompresseddataareformedby Z = \u0001X,where\n\u0001 is m \u00d7 n with independent entries \u0001i j \u223c N (0, 1/n). Then the information rate rn,m satis\ufb01es\nrn,m = sup p(X )\nUnder our sparsistency lower bound on m, the above upper bounds are rn,m = O(log(np)/n). We\nnote that these bounds may not be the best possible since they are obtained assuming knowledge of\nthe compression matrix \u0001, when in fact the privacy protocol requires that \u0001 and \u0001 are not public.\n\nI (X; Z )\nnp \u2264 m\n\n2n log (2\u03c0eP) .\n\n6 Experiments\n\nIn this section, we report results of simulations designed to validate the theoretical analysis presented\nin previous sections. We \ufb01rst present results that show the compressed lasso is comparable to the\nuncompressed lasso in recovering the sparsity pattern of the true linear model. We then show results\non persistence that are in close agreement with the theoretical results of Section 4. We only include\nFigures 1\u20132 here; additional plots are included in the full version.\nSparsistency. Here we run simulations to compare the compressed lasso with the uncompressed\nlasso in terms of the probability of success in recovering the sparsity pattern of \u03b2\u2217. We use random\nmatrices for both X and \u0001, and reproduce the experimental conditions of [19]. A design parameter\nis the compression factor f = n\nm , which indicates how much the original data are compressed.\nThe results show that when the compression factor f is large enough, the thresholding behaviors\nas speci\ufb01ed in (8) and (9) for the uncompressed lasso carry over to the compressed lasso, when\nX is drawn from a Gaussian ensemble.\nis well below the\nrequirement that we have in Theorem 3.3 in case X is deterministic. In more detail, we consider the\nGaussian ensemble for the projection matrix \u0001, where \u0001i, j \u223c N (0, 1/n) are independent. The noise\nis \u0001 \u223c N (0, \u03c3 2), where \u03c3 2 = 1. We consider Gaussian ensembles for the design matrix X with both\ndiagonal and Toeplitz covariance. In the Toeplitz case, the covariance is given by T (\u03c1)i, j = \u03c1|i\u2212 j|;\nwe use \u03c1 = 0.1. [19] shows that when X comes from a Gaussian ensemble under these conditions,\nthere exist \ufb01xed constants \u03b8` and \u03b8u such that for any \u03bd > 0 and s = supp(\u03b2), if\n(8)\n\nIn general, the compression factor f\n\nn > 2(\u03b8u + \u03bd)s log( p \u2212 s) + s + 1,\n\nthen the lasso identi\ufb01es true variables with probability approaching one. Conversely, if\n\nn < 2(\u03b8` \u2212 \u03bd)s log( p \u2212 s) + s + 1,\n\n(9)\nthen the probability of recovering the true variables using the lasso approaches zero. In the follow-\ning simulations, we carry out the lasso using procedure lars(Y, X ) that implements the LARS\nalgorithm of [8] to calculate the full regularization path. For the uncompressed case, we run\nlars(Y, X ) such that Y = X\u03b2\u2217 + \u0001, and for the compressed case we run lars(\u0001Y, \u0001X ) such\nshow that the behavior under compression is close to the uncompressed case.\n\nthat \u0001Y = \u0001X\u03b2\u2217+ \u0001\u0001. The regularization parameter is \u03bbm = cp(log( p \u2212 s) log s)/m. The results\nPersistence. Here we solve the following `1-constrained optimization problem e\u03b2 =\nk\u03b2k1\u2264L kY \u2212 X\u03b2k2 directly, based on algorithms described by [15]. We constrain the solu-\ntion to lie in the ball Bn = {k\u03b2k1 \u2264 Ln}, where Ln = n1/4/\u221alog n. By [10], the uncompressed lasso\n\narg min\n\n6\n\n\fToeplitz r= 0.1; Fractional Power g=0.5, a=0.2\n\np=128\n\n256\n\n512\n\n1024\n\nUncompressed\nf = 120\n\ns\ns\ne\nc\nc\nu\ns\n \nf\no\n \nb\no\nr\nP\n\n0\n.\n1\n\n8\n.\n0\n\n6\n.\n0\n\n4\n.\n0\n\n2\n.\n0\n\n0\n.\n0\n\n0\n\n50\n\n150\n\n100\nCompressed dimension m\n\n200\n\n250\n\n300\n\ns\ns\ne\nc\nc\nu\ns\n \nf\n\no\n\n \n\nb\no\nr\nP\n\ns\ns\ne\nc\nc\nu\ns\n \nf\n\no\n\n \n\nb\no\nr\nP\n\n0\n\n.\n\n1\n\n8\n0\n\n.\n\n6\n0\n\n.\n\n4\n0\n\n.\n\n2\n0\n\n.\n\n0\n0\n\n.\n\n0\n\n.\n\n1\n\n8\n0\n\n.\n\n6\n0\n\n.\n\n4\n0\n\n.\n\n2\n0\n\n.\n\n0\n\n.\n\n0\n\nIdentity; FP g= 0.5, a=0.2; p=1024\n\nUncomp.\nf = 5\nf = 10\nf = 20\nf = 40\nf = 80\nf = 120\n\n0.0\n\n0.5\n\n1.0\n2.0\nControl parameter q\n\n1.5\n\n2.5\n\n3.0\n\nToeplitz r= 0.1; FP g=0.5, a=0.2; p=1024\n\nUncomp.\nf = 5\nf = 10\nf = 20\nf = 40\nf = 80\nf = 120\n\n0.0\n\n0.5\n\n1.0\n2.0\nControl parameter q\n\n1.5\n\n2.5\n\n3.0\n\n1\n\nFigure 1: Plots of the number of samples versus the probability of success for recovering sgn(\u03b2\u2217).\nEach point on a curve for a particular \u03b8 or m, where m = 2\u03b8 \u03c3 2s log( p \u2212 s) + s + 1, is an average\nover 200 trials; for each trial, we randomly draw Xn\u00d7 p, \u0001m\u00d7n, and \u0001 \u2208 Rn. The covariance \u0001 =\nn E(cid:0)X T X(cid:1) and model \u03b2\u2217 are \ufb01xed across all curves in the plot. The sparsity level is s( p) = 0.2 p1/2.\nThe four sets of curves in the left plot are for p = 128, 256, 512 and 1024, with dashed lines\nmarking m for \u03b8 = 1 and s = 2, 3, 5 and 6 respectively. In the plots on the right, each curve has\na compression factor f \u2208 {5, 10, 20, 40, 80, 120} for the compressed lasso, thus n = f m; dashed\nlines mark \u03b8 = 1. For \u0001 = I , \u03b8u = \u03b8` = 1, while for \u0001 = T (0.1), \u03b8u \u2248 1.84 and \u03b8` \u2248 0.46 [19],\nfor the uncompressed lasso in (8) and in (9).\n\nn=9000, p=128, s=9\n\nUncompressed predictive\nCompressed predictive\nCompressed empirical\n\nk\ns\nR\n\ni\n\n8\n1\n\n6\n1\n\n4\n1\n\n2\n1\n\n0\n1\n\n8\n\n0\n\n2000\n\n4000\n\n6000\n\n8000\n\nCompressed dimension m\n\nthat(cid:13)(cid:13)\u03b2\u2217b(cid:13)(cid:13)1 > Ln and \u03b2\u2217b 6\u2208 Bn, and the uncompressed oracle predictive risk is R = 9.81. For each\n\nFigure 2: Risk versus compressed dimension. We \ufb01x n = 9000 and p = 128, and set s( p) = 3 and\nLn = 2.6874. The model is \u03b2\u2217 = (\u22120.9,\u22121.7, 1.1, 1.3,\u22120.5, 2,\u22121.7,\u22121.3,\u22120.9, 0, . . . , 0)T so\nvalue of m, a data point corresponds to the mean empirical risk, which is de\ufb01ned in (7), over 100\ntrials, and each vertical bar shows one standard deviation. For each trial, we randomly draw Xn\u00d7 p\nwith i.i.d. row vectors xi \u223c N (0, T (0.1)), and Y = X\u03b2\u2217 + \u0001.\n\n7\n\n\festimatorb\u03b2n is persistent over Bn. For the compressed lasso, given n and pn, and a varying com-\npressed sample size m, we take the ball Bn,m = {\u03b2 : k\u03b2k1 \u2264 Ln,m} where Ln,m = m1/4/plog(npn).\nThe compressed lasso estimator b\u03b2n,m for log2(npn) \u2264 m \u2264 n, is persistent over Bn,m by Theo-\n\nrem 4.1. The simulations con\ufb01rm this behavior.\n\n7 Acknowlegments\n\nThis work was supported in part by NSF grant CCF-0625879.\n\nReferences\n\n[1] D. Agrawal and C. C. Aggarwal. On the design and quanti\ufb01cation of privacy preserving data mining\n\nalgorithms. In In Proceedings of the 20th Symposium on Principles of Database Systems, May 2001.\n\n[2] E. Cand\u00e8s, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements.\n\nCommunications in Pure and Applied Mathematics, 59(8):1207\u20131223, August 2006.\n\n[3] T. Dalenius. Towards a methodology for statistical disclosure control. Statistik Tidskrift, 15:429\u2013444,\n\n1977.\n\n[4] D. Donoho. Compressed sensing. IEEE Trans. Info. Theory, 52(4):1289\u20131306, April 2006.\n[5] M. Duarte, M. Davenport, M. Wakin, and R. Baraniuk. Sparse signal detection from incoherent projections.\n\nIn Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing, 2006.\n\n[6] G. Duncan and R. Pearson. Enhancing access to microdata while protecting con\ufb01dentiality: Prospects for\n\nthe future. Statistical Science, 6(3):219\u2013232, August 1991.\n\n[7] C. Dwork. Differential privacy.\n\nIn 33rd International Colloquium on Automata, Languages and\n\nProgramming\u2013ICALP 2006, pages 1\u201312, 2006.\n\n[8] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statistics, 32(2):407\u2013\n\n499, 2004.\n\n[9] J. Feigenbaum, Y. Ishai, T. Malkin, K. Nissim, M. J. Strauss, and R. N. Wright. Secure multiparty compu-\n\ntation of approximations. ACM Trans. Algorithms, 2(3):435\u2013472, 2006.\n\n[10] E. Greenshtein and Y. Ritov. Persistency in high dimensional linear predictor-selection and the virtue of\n\nover-parametrization. Journal of Bernoulli, 10:971\u2013988, 2004.\n\n[11] J. Haupt, R. Castro, R. Nowak, G. Fudge, and A. Yeh. Compressive sampling for signal classi\ufb01cation. In\n\nProc. Asilomar Conference on Signals, Systems, and Computers, October 2006.\n\n[12] K. Liu, H. Kargupta, and J. Ryan. Random projection-based multiplicative data perturbation for privacy\n\npreserving distributed data mining. IEEE Trans. on Knowl. and Data Engin., 18(1), Jan. 2006.\n\n[13] T. L. Marzetta and B. M. Hochwald. Capacity of a mobile multiple-antenna communication link in\n\nRayleigh \ufb02at fading. IEEE Trans. Info. Theory, 45(1):139\u2013157, January 1999.\n\n[14] N. Meinshausen and B. Yu. Lasso-type recovery of sparse representations for high-dimensional data.\n\nTechnical Report 720, Department of Statistics, UC Berkeley, 2006.\n\n[15] M. Osborne, B. Presnell, and B. Turlach. On the lasso and its dual. J. Comp. and Graph. Stat., 9(2):319\u2013\n\n337, 2000.\n\n[16] A. P. Sanil, A. Karr, X. Lin, and J. P. Reiter. Privacy preserving regression modelling via distributed\n\ncomputation. In Proceedings of Tenth ACM SIGKDD, 2004.\n\n[17] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267\u2013288,\n\n1996.\n\n[18] J. Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on Information\n\nTheory, 50(10):2231\u20132242, 2004.\n\n[19] M. Wainwright. Sharp thresholds for high-dimensional and noisy recovery of sparsity. Technical Report\n\n709, Department of Statistics, UC Berkeley, May 2006.\n\n[20] P. Zhao and B. Yu. On model selection consistency of lasso. J. Mach. Learn. Research, 7:2541\u20132567,\n\n2007.\n\n8\n\n\f", "award": [], "sourceid": 195, "authors": [{"given_name": "Shuheng", "family_name": "Zhou", "institution": null}, {"given_name": "Larry", "family_name": "Wasserman", "institution": null}, {"given_name": "John", "family_name": "Lafferty", "institution": null}]}