{"title": "A Unified Near-Optimal Estimator For Dimension Reduction in $l_\\alpha$ ($0<\\alpha\\leq 2$) Using Stable Random Projections", "book": "Advances in Neural Information Processing Systems", "page_first": 905, "page_last": 912, "abstract": null, "full_text": "A Uni\ufb01ed Near-Optimal Estimator For Dimension Reduction in l\u03b1\n\n(0 < \u03b1 \u2264 2) Using Stable Random Projections\nPing Li\n\nTrevor J. Hastie\n\nDepartment of Statistical Science\n\nDepartment of Statistics\n\nFaculty of Computing and Information Science\n\nDepartment of Health, Research and Policy\n\nCornell University\n\nStanford University\n\npingli@cornell.edu\n\nhastie@stanford.edu\n\nAbstract\n\nMany tasks (e.g., clustering) in machine learning only require the l\u03b1 distances in-\nstead of the original data. For dimension reductions in the l\u03b1 norm (0 < \u03b1 \u2264 2),\nthe method of stable random projections can ef\ufb01ciently compute the l\u03b1 distances\nin massive datasets (e.g., the Web or massive data streams) in one pass of the data.\nThe estimation task for stable random projections has been an interesting topic.\nWe propose a simple estimator based on the fractional power of the samples (pro-\njected data), which is surprisingly near-optimal in terms of the asymptotic vari-\nance. In fact, it achieves the Cram\u00b4er-Rao bound when \u03b1 = 2 and \u03b1 = 0+. This\nnew result will be useful when applying stable random projections to distance-\nbased clustering, classi\ufb01cations, kernels, massive data streams etc.\n\n1 Introduction\nDimension reductions in the l\u03b1 norm (0 < \u03b1 \u2264 2) have numerous applications in data mining,\ninformation retrieval, and machine learning. In modern applications, the data can be way too large\nfor the physical memory or even the disk; and sometimes only one pass of the data can be afforded\nfor building statistical learning models [1, 2, 5]. We abstract the data as a data matrix A \u2208 Rn\u00d7D.\nIn many applications, it is often the case that we only need the l\u03b1 properties (norms or distances) of\nA. The method of stable random projections [9,18,22] is a useful tool for ef\ufb01ciently computing the\nl\u03b1 (0 < \u03b1 \u2264 2) properties in massive data using a small (memory) space.\nDenote the leading two rows in the data matrix A by u1, u2 \u2208 RD. The l\u03b1 distance d(\u03b1) is\n\nD\n\nd(\u03b1) =\n\n|u1,i \u2212 u2,i|\u03b1.\n\n(1)\n\nXi=1\n\nThe choice of \u03b1 is beyond the scope of this study; but basically, we can treat \u03b1 as a tuning parameter.\nIn practice, the most popular choice, i.e., the \u03b1 = 2 norm, often does not work directly on the original\n(unweighted) data, as it is well-known that truly large-scale datasets (especially Internet data) are\nubiquitously \u201cheavy-tailed.\u201d In machine learning, it is often crucial to carefully term-weight the\ndata (e.g., taking logarithm or tf-idf) before applying subsequent learning algorithms using the l2\nnorm. As commented in [12, 21], the term-weighting procedure is often far more important than\n\ufb01ne-tuning the learning parameters. Instead of weighting the original data, an alternative scheme\nis to choose an appropriate norm. For example, the l1 norm has become popular recently, e.g.,\nLASSO, LARS, 1-norm SVM [23], Laplacian radial basis kernel [4], etc. But other norms are also\npossible. For example, [4] proposed a family of non-Gaussian radial basis kernels for SVM in the\n\nform K(x, y) = exp (\u2212\u03c1Pi |xi \u2212 yi|\u03b1), where x and y are data points in high-dimensions; and [4]\nshowed that \u03b1 \u2264 1 (even \u03b1 = 0) in some cases produced better results in histogram-based image\nclassi\ufb01cations. The l\u03b1 norm with \u03b1 < 1, which may initially appear strange, is now well-understood\nto be a natural measure of sparsity [6]. In the extreme case, when \u03b1 \u2192 0+, the l\u03b1 norm approaches\nthe Hamming norm (i.e., the number of non-zeros in the vector).\n\nTherefore, there is the natural demand in science and engineering for dimension reductions in the\nl\u03b1 norm other than l2. While the method of normal random projections for the l2 norm [22] has\nbecome very popular recently, we have to resort to more general methodologies for 0 < \u03b1 < 2.\nThe idea of stable random projections is to multiply A with a random projection matrix R \u2208 RD\u00d7k\n(k \u226a D). The matrix B = A \u00d7 R \u2208 Rn\u00d7k will be much smaller than A. The entries of R are\n(typically) i.i.d. samples from a symmetric \u03b1-stable distribution [24], denoted by S(\u03b1, 1), where \u03b1\nis the index and 1 is the scale. We can then discard the original data matrix A because the projected\nmatrix B now contains enough information to recover the original l\u03b1 properties approximately.\n\n\fA symmetric \u03b1-stable random variable is denoted by S(\u03b1, d), where d is the scale parameter. If\nx \u223c S(\u03b1, d), then its characteristic function (Fourier transform of the density function) would be\n\nE(cid:0)exp(cid:0)\u221a\u22121xt(cid:1)(cid:1) = exp (\u2212d|t|\u03b1) ,\n\nwhose inverse does not have a closed-form except for \u03b1 = 2 (i.e., normal) or \u03b1 = 1 (i.e., Cauchy).\nApplying stable random projections on u1 \u2208 RD, u2 \u2208 RD yields, respectively, v1 = RTu1 \u2208 Rk\nand v2 = RTu2 \u2208 Rk. By the properties of Fourier transforms, the projected differences, v1,j\u2212v2,j,\nj = 1, 2, ..., k, are i.i.d. samples of the stable distribution S(\u03b1, d(\u03b1)), i.e.,\n\n(2)\n\nxj = v1,j \u2212 v2,j \u223c S(\u03b1, d(\u03b1)),\n\n(3)\nThus, the task is to estimate the scale parameter from k i.i.d. samples xj \u223c S(\u03b1, d(\u03b1)). Because no\nclosed-form density functions are available except for \u03b1 = 1, 2, the estimation task is challenging\nwhen we seek estimators that are both accurate and computationally ef\ufb01cient.\nFor general 0 < \u03b1 < 2, a widely used estimator is based on the sample inter-quantiles [7,20], which\ncan be simpli\ufb01ed to be the sample median estimator by choosing the 0.75 - 0.25 sample quantiles\nand using the symmetry of S(\u03b1, d(\u03b1)). That is\n\nj = 1, 2, ..., k.\n\n\u02c6d(\u03b1),me =\n\nmedian{|xj |\u03b1, j = 1, 2, ..., k}\n\nmedian{S(\u03b1, 1)}\u03b1\n\n.\n\n(4)\n\nIt has been well-known that the sample median estimator is not accurate, especially when the\nsample size k is not too large. Recently, [13] proposed various estimators based on the geometric\nmean and the harmonic mean of the samples. The harmonic mean estimator only works for small\n\u03b1. The geometric mean estimator has nice properties including closed-form variances, closed-form\ntail bounds in exponential forms, and very importantly, an analog of the Johnson-Lindenstrauss (JL)\nLemma [10] for dimension reduction in l\u03b1. The geometric mean estimator, however, can still be\nimproved for certain \u03b1, especially for large samples (e.g., as k \u2192 \u221e).\n1.1 Our Contribution: the Fractional Power Estimator\nThe fractional power estimator, with a simple uni\ufb01ed format for all 0 < \u03b1 \u2264 2, is (surprisingly)\nnear-optimal in the Cram\u00b4er-Rao sense (i.e., when k \u2192 \u221e, its variance is close to the Cram\u00b4er-Rao\nlower bound). In particularly, it achieves the Cram\u00b4er-Rao bound when \u03b1 = 2 and \u03b1 \u2192 0+.\nThe basic idea is straightforward. We \ufb01rst obtain an unbiased estimator of d\u03bb\nWe then estimate d(\u03b1) by (cid:16) \u02c6R(\u03b1),\u03bb(cid:17)1/\u03bb\n\nconsequently also reduces the variance) using Taylor expansions. We choose \u03bb = \u03bb\u2217(\u03b1) to minimize\nthe theoretical asymptotic variance. We prove that \u03bb\u2217(\u03b1) is the solution to a simple convex program,\ni.e., \u03bb\u2217(\u03b1) can be pre-computed and treated as a constant for every \u03b1. The main computation\n\n, which can be improved by removing the O(cid:0) 1\n\nk(cid:1) bias (this\n\n(\u03b1), denoted by \u02c6R(\u03b1),\u03bb.\n\n; and hence this estimator is also computationally ef\ufb01cient.\n\ninvolves only(cid:16)Pk\n\nj=1 |xj|\u03bb\u2217\u03b1(cid:17)1/\u03bb\u2217\n\n1.2 Applications\nThe method of stable random projections is useful for ef\ufb01ciently computing the l\u03b1 properties (norms\nor distances) in massive data, using a small (memory) space.\n\n\u2022 Data stream computations\n\nMassive data streams are fundamental in many modern\ndata processing application [1, 2, 5, 9]. It is common practice to store only a very small\nsketch of the streams to ef\ufb01ciently compute the l\u03b1 norms of the individual streams or the l\u03b1\ndistances between a pair of streams. For example, in some cases, we only need to visually\nmonitor the time history of the l\u03b1 distances; and approximate answers often suf\ufb01ce.\nOne interesting special case is to estimate the Hamming norms (or distances) using the\nfact that, when \u03b1 \u2192 0+, d(\u03b1) = PD\ni=1 |u1,i \u2212 u2,i|\u03b1 approaches the total number of\ni=1, i.e., the Hamming distance [5]. One may ask why not just\nnon-zeros in {|u1,i \u2212 u2,i|}D\n(binary) quantize the data and then apply normal random projections to the binary data. [5]\nconsidered that the data are dynamic (i.e., frequent addition/subtraction) and hence pre-\nquantizing the data would not work. With stable random projections, we only need to\nupdate the corresponding sketches whenever the data are updated.\n\n\f\u2022 Computing all pairwise distances\n\nIn many applications including distanced-based\nclustering, classi\ufb01cations and kernels (e.g.) for SVM, we only need the pairwise distances.\nComputing all pairwise distances of A \u2208 Rn\u00d7D would cost O(n2D), which can be signif-\nicantly reduced to O(nDk + n2k) by stable random projections. The cost reduction will\nbe more considerable when the original datasets are too large for the physical memory.\n\n\u2022 Estimating l\u03b1 distances online\n\nWhile it is often infeasible to store the original matrix\nA in the memory, it is also often infeasible to materialize all pairwise distances in A. Thus,\nin applications such as online learning, databases, search engines, online recommendation\nsystems, and online market-basket analysis, it is often more ef\ufb01cient if we store B \u2208 Rn\u00d7k\nin the memory and estimate any pairwise distance in A on the \ufb02y only when it is necessary.\n\nWhen we treat \u03b1 as a tuning parameter, i.e., re-computing the l\u03b1 distances for many different \u03b1,\nstable random projections will be even more desirable as a cost-saving device.\n\n2 Previous Estimators\nWe assume k i.i.d. samples xj \u223c S(\u03b1, d(\u03b1)), j = 1, 2, ..., k. We list several previous estimators.\n\n\u02c6d(\u03b1),gm =\n\nj=1 |xj|\u03b1/k\n\n\u2022 The geometric mean estimator is recommended in [13] for \u03b1 < 2.\nQk\nk(cid:1)(cid:3)k .\nk(cid:1) \u0393(cid:0)1 \u2212 1\n(cid:2) 2\n\u03c0 \u0393(cid:0) \u03b1\nk(cid:1)(cid:3)k\n(\u03b1)((cid:2) 2\nk(cid:1)(cid:3)2k \u2212 1)\nk(cid:1) sin(cid:0)\u03c0 \u03b1\n\u03c0 \u0393(cid:0) 2\u03b1\nVar(cid:16) \u02c6d(\u03b1),gm(cid:17) = d2\nk(cid:1) sin(cid:0) \u03c0\n(cid:2) 2\n\u03c0 \u0393(cid:0) \u03b1\n12 (cid:0)\u03b12 + 2(cid:1)(cid:27) + O(cid:18) 1\nk2(cid:19) .\n(\u03b1)(cid:26) 1\n\nk(cid:1) sin(cid:0) \u03c0\nk (cid:1) \u0393(cid:0)1 \u2212 2\nk(cid:1) \u0393(cid:0)1 \u2212 1\n\n= d2\n\n\u03c02\n\nk\n\n\u03b1\n\n2\n\n\u03b1\n\n2\n\n\u2212 2\n\n\u02c6d(\u03b1),hm =\n\n\u2022 The harmonic mean estimator is recommended in [13] for 0 < \u03b1 \u2264 0.344.\n2 \u03b1(cid:1)(cid:3)2 \u2212 1!! ,\n k \u2212 \u2212\u03c0\u0393(\u22122\u03b1) sin (\u03c0\u03b1)\n\u03c0 \u0393(\u2212\u03b1) sin(cid:0) \u03c0\n2 \u03b1(cid:1)\nPk\n(cid:2)\u0393(\u2212\u03b1) sin(cid:0) \u03c0\nk \u2212\u03c0\u0393(\u22122\u03b1) sin (\u03c0\u03b1)\n2 \u03b1(cid:1)(cid:3)2 \u2212 1! + O(cid:18) 1\nk2(cid:19) .\n(cid:2)\u0393(\u2212\u03b1) sin(cid:0) \u03c0\nkPk\n\u2022 For \u03b1 = 2, the arithmetic mean estimator, 1\n\nj=1 |xj|2, is commonly used, which has\n(2). It can be improved by taking advantage of the marginal l2 norms [17].\n\nVar(cid:16) \u02c6d(\u03b1),hm(cid:17) = d2\n\nvariance = 2\n\nj=1 |xj|\u2212\u03b1\n\nk d2\n\n(\u03b1)\n\n1\n\n(5)\n\n(6)\n\n(7)\n\n(8)\n\n(9)\n\n3 The Fractional Power Estimator\n\nThe fractional power estimator takes advantage of the following statistical result in Lemma 1.\n\nLemma 1 Suppose x \u223c S(cid:0)\u03b1, d(\u03b1)(cid:1). Then for \u22121 < \u03bb < \u03b1,\n\nE(cid:0)|x|\u03bb(cid:1) = d\u03bb/\u03b1\n\n(\u03b1)\n\n2\n\u03c0\n\n\u0393(cid:18)1 \u2212\n\n\u03bb\n\n\u03b1(cid:19) \u0393(\u03bb) sin(cid:16) \u03c0\n\n2\n\n\u03bb(cid:17) .\n\n(10)\n\nIf \u03b1 = 2, i.e., x \u223c S(2, d(2)) = N (0, 2d(2)), then for \u03bb > \u22121,\n2(cid:19) \u0393(\u03bb) sin(cid:16) \u03c0\n\nE(cid:0)|x|\u03bb(cid:1) = d\u03bb/2\n\n\u0393(cid:18)1 \u2212\n\n2\n\u03c0\n\n(2)\n\n\u03bb\n\n2\n\n\u03bb(cid:17) = d\u03bb/2\n\n(2)\n\n2\u0393 (\u03bb)\n\n\u0393(cid:0) \u03bb\n2(cid:1)\n\n.\n\n(11)\n\nProof: For 0 < \u03b1 \u2264 2 and \u22121 < \u03bb < \u03b1, (10) can be inferred directly from [24, Theorem 2.6.3].\nFor \u03b1 = 2, the moment E(cid:0)|x|\u03bb(cid:1) exists for any \u03bb > \u22121. (11) can be shown by directly integrating\nthe Gaussian density (using the integral formula [8, 3.381.4]). The Euler\u2019s re\ufb02ection formula\n2(cid:1) = 21\u22122z\u221a\u03c0\u0393(2z) are handy.\n\u0393(1 \u2212 z)\u0393(z) = \u03c0\n\nsin(\u03c0z) and the duplication formula \u0393(z)\u0393(cid:0)z + 1\n\n\fThe fractional power estimator is de\ufb01ned in Lemma 2. See the proof in Appendix A.\n\nLemma 2 Denoted by \u02c6d(\u03b1),f p, the fractional power estimator is de\ufb01ned as\n\n\u02c6d(\u03b1),f p = 1\n\nk\n\nwhere\n\n\u03bb\u2217 = argmin\n2\u03b1 \u03bb< 1\n2\n\n\u2212 1\n\n!1/\u03bb\u2217\n\n\u00d7\n\nj=1 |xj|\u03bb\u2217\u03b1\n\nPk\n2\u03bb\u2217 (cid:18) 1\n\n1\n\n2\n\n1\nk\n\n\u03c0 \u0393(1 \u2212 2\u03bb\u2217)\u0393(2\u03bb\u2217\u03b1) sin (\u03c0\u03bb\u2217\u03b1)\n\n\u03c0 \u0393(1 \u2212 \u03bb\u2217)\u0393(\u03bb\u2217\u03b1) sin(cid:0) \u03c0\n2 \u03bb\u2217\u03b1(cid:1)\n\u03bb\u2217 \u2212 1(cid:19) 2\n 1 \u2212\n\u03c0 \u0393(1 \u2212 \u03bb\u2217)\u0393(\u03bb\u2217\u03b1) sin(cid:0) \u03c0\n(cid:2) 2\n\u03bb2 2\n\u03c0 \u0393(1 \u2212 \u03bb)\u0393(\u03bb\u03b1) sin(cid:0) \u03c0\n(cid:2) 2\n\n2 \u03bb\u2217\u03b1(cid:1)(cid:3)2 \u2212 1!! ,\n2 \u03bb\u03b1(cid:1)(cid:3)2 \u2212 1! .\n\n\u03c0 \u0393(1 \u2212 2\u03bb)\u0393(2\u03bb\u03b1) sin (\u03c0\u03bb\u03b1)\n\ng (\u03bb; \u03b1) =\n\ng (\u03bb; \u03b1) ,\n\n1\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\nAsymptotically (i.e., as k \u2192 \u221e), the bias and variance of \u02c6d(\u03b1),f p are\n\nk2(cid:19) ,\nE(cid:16) \u02c6d(\u03b1),f p(cid:17) \u2212 d(\u03b1) = O(cid:18) 1\n\u03bb\u22172 2\nVar(cid:16) \u02c6d(\u03b1),f p(cid:17) = d2\n(cid:2) 2\n\u03c0 \u0393(1 \u2212 \u03bb\u2217)\u0393(\u03bb\u2217\u03b1) sin(cid:0) \u03c0\n\n1\nk\n\n(\u03b1)\n\n1\n\n\u03c0 \u0393(1 \u2212 2\u03bb\u2217)\u0393(2\u03bb\u2217\u03b1) sin (\u03c0\u03bb\u2217\u03b1)\n\n2 \u03bb\u2217\u03b1(cid:1)(cid:3)2 \u2212 1! + O(cid:18) 1\nk2(cid:19) .\nj=1 |xj|\u03bb\u2217\u03b1(cid:17)1/\u03bb\u2217\n\nall other terms are basically constants and can be pre-computed.\n\nNote that in calculating \u02c6d(\u03b1),f p, the real computation only involves(cid:16)Pk\nFigure 1(a) plots g (\u03bb; \u03b1) as a function of \u03bb for many different values of \u03b1. Figure 1(b) plots the\noptimal \u03bb\u2217 as a function of \u03b1. We can see that g (\u03bb; \u03b1) is a convex function of \u03bb and \u22121 < \u03bb\u2217 < 1\n(except for \u03b1 = 2), which will be proved in Lemma 3.\n\n, because\n\n2\n\nr\no\nt\nc\na\nf\n \ne\nc\nn\na\ni\nr\na\nV\n\n1\n\n1.2\n\n1.5\n\n1.999\n\n1.9\n1.95\n\n3.2\n3\n2.8\n2.6\n2.4\n2.2\n2\n1.8\n1.6\n1.4\n1.2\n1\n\u22121 \u2212.8 \u2212.6 \u2212.4 \u2212.2 0 .2 .4 .6 .8 1 \n\n2e\u221216\n\n0.5\n\n0.3\n\n0.8\n\n2\n\n\u03bb\n\nt\np\no\n\n\u03bb\n\n0.5\n0.4\n0.3\n0.2\n0.1\n0\n\u22120.1\n\u22120.2\n\u22120.3\n\u22120.4\n\u22120.5\n\u22120.6\n\u22120.7\n\u22120.8\n\u22120.9\n\u22121\n\n0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2\n\n\u03b1\n\nFigure 1: Left panel plots the variance factor g (\u03bb; \u03b1) as functions of \u03bb for different \u03b1, illustrating\ng (\u03bb; \u03b1) is a convex function of \u03bb and the optimal solution (lowest points on the curves) are between\n-1 and 0.5 (\u03b1 < 2). Note that there is a discontinuity between \u03b1 \u2192 2\u2212 and \u03b1 = 2. Right panel plots\nthe optimal \u03bb\u2217 as a function of \u03b1. Since \u03b1 = 2 is not included, we only see \u03bb\u2217 < 0.5 in the \ufb01gure.\n\n3.1 Special cases\n\nThe discontinuity, \u03bb\u2217(2\u2212) = 0.5 and \u03bb\u2217(2) = 1, re\ufb02ects the fact that, for x \u223c S(\u03b1, d), E(cid:0)|x|\u03bb(cid:1)\nexists for \u22121 < \u03bb < \u03b1 when \u03b1 < 2 and exists for any \u03bb > \u22121 when \u03b1 = 2.\nWhen \u03b1 = 2, since \u03bb\u2217(2) = 1, the fractional power estimator becomes 1\nj=1 |xj|2, i.e., the\narithmetic mean estimator. We will from now on only consider 0 < \u03b1 < 2.\nwhen \u03b1 \u2192 0+, since \u03bb\u2217(0+) = \u22121, the fractional power estimator approaches the harmonic mean\nestimator, which is asymptotically optimal when \u03b1 = 0+ [13].\nWhen \u03b1 \u2192 1, since \u03bb\u2217(1) = 0 in the limit, the fractional power estimator has the same asymptotic\nvariance as the geometric mean estimator.\n\nkPk\n\n\f3.2 The Asymptotic (Cram\u00b4er-Rao) Ef\ufb01ciency\n\nFor an estimator \u02c6d(\u03b1), its variance, under certain regularity condition, is lower-bounded by the Infor-\nkI(\u03b1) .\n\nmation inequality (also known as the Cram\u00b4er-Rao bound) [11, Chapter 2], i.e., Var(cid:16) \u02c6d(\u03b1)(cid:17) \u2265 1\n\nThe Fisher Information I(\u03b1) can be approximated by computationally intensive procedures [19].\nWhen \u03b1 = 2, it is well-known that the arithmetic mean estimator attains the Cram\u00b4er-Rao bound.\nWhen \u03b1 = 0+, [13] has shown that the harmonic mean estimator is also asymptotically optimal.\nTherefore, our fractional power estimator achieves the Cram\u00b4er-Rao bound, exactly when \u03b1 = 2,\nand asymptotically when \u03b1 = 0+.\nkI(\u03b1) to the asymptotic variance of\nThe asymptotic (Cram\u00b4er-Rao) ef\ufb01ciency is de\ufb01ned as the ratio of\n\u02c6d(\u03b1) (d(\u03b1) = 1 for simplicity). Figure 2 plots the ef\ufb01ciencies for all estimators we have mentioned,\nillustrating that the fractional power estimator is near-optimal in a wide range of \u03b1.\n\n1\n\n1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\ny\nc\nn\ne\nc\ni\nf\nf\n\ni\n\nE\n\n0.4\n\n \n\n \n\nFractional\nGeometric\nHarmonic\nMedian\n\n0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2\n\n\u03b1\n\n1\n\nFigure 2: The asymptotic Cram\u00b4er-Rao ef\ufb01ciencies of various estimators for 0 < \u03b1 < 2, which are\nkI(\u03b1) to the asymptotic variances of the estimators. Here k is the sample size and I(\u03b1) is\nthe ratios of\nthe Fisher Information (we use the numeric values in [19]). The asymptotic variance of the sample\nmedian estimator \u02c6d(\u03b1),me is computed from known statistical theory for sample quantiles. We can\nsee that the fractional power estimator \u02c6d(\u03b1),f p is close to be optimal in a wide range of \u03b1; and it\nalways outperforms both the geometric mean and the harmonic mean estimators. Note that since we\nonly consider \u03b1 < 2, the ef\ufb01ciency of \u02c6d(\u03b1),f p does not achieve 100% when \u03b1 \u2192 2\u2212.\n3.3 Theoretical Properties\nWe can show that, when computing the fractional power estimator \u02c6d(\u03b1),f p, to \ufb01nd the opti-\nmal \u03bb\u2217 only involves searching for the minimum on a convex curve in the narrow range \u03bb\u2217 \u2208\n(cid:0)max(cid:8)\u22121,\u2212 1\n2\u03b1(cid:9) , 0.5(cid:1). These properties theoretically ensure that the new estimator is well-de\ufb01ned\nand is numerically easy to compute. The proof of Lemma 3 is brie\ufb02y sketched in Appendix B.\n\nLemma 3 Part 1:\n\ng (\u03bb; \u03b1) =\n\nis a convex function of \u03bb.\nPart 2: For 0 < \u03b1 < 2, the optimal \u03bb\u2217 = argmin\n2\u03b1 \u03bb< 1\n\n\u2212 1\n\n2\n\n3.4 Comparing Variances at Finite Samples\n\n1\n\n\u03bb2 2\n\u03c0 \u0393(1 \u2212 2\u03bb)\u0393(2\u03bb\u03b1) sin (\u03c0\u03bb\u03b1)\n\u03c0 \u0393(1 \u2212 \u03bb)\u0393(\u03bb\u03b1) sin(cid:0) \u03c0\n(cid:2) 2\n\n2 \u03bb\u03b1(cid:1)(cid:3)2 \u2212 1! ,\n\ng (\u03bb; \u03b1), satis\ufb01es \u22121 < \u03bb\u2217 < 0.5.\n\n(16)\n\nIt is also important to understand the small sample performance of the estimators. Figure 3 plots\nthe empirical mean square errors (MSE) from simulations for the fractional power estimator, the\nharmonic mean estimator, and the sample median estimator. The MSE for the geometric mean\nestimators can be computed exactly without simulations.\nFigure 3 indicates that the fractional power estimator \u02c6d(\u03b1),f p also has good small sample perfor-\nmance unless \u03b1 is close to 2. After k \u2265 50, the advantage of \u02c6d(\u03b1),f p becomes noticeable even\nwhen \u03b1 is very close to 2. It is also clear that the sample median estimator has poor small sample\nperformance; but even at very large k, its performance is not that good except when \u03b1 is about 1.\n\n\f0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\nk = 10\n\n \n\nFractional\nGeometric\nHarmonic\nMedian\n\n0\n \n0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2\n\n \n\nk = 100\n\n)\n\nE\nS\nM\n\n \n\n(\n \nr\no\nr\nr\ne\ne\nr\na\nu\nq\ns\n \nn\na\ne\nM\n\n)\n\nE\nS\nM\n\n(\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n \n\nn\na\ne\nM\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n\u03b1\n\n\u03b1\n\nFractional\nGeometric\nHarmonic\nMedian\n\n0\n \n0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2\n\n)\n\nE\nS\nM\n\n \n\n(\n \nr\no\nr\nr\ne\ne\nr\na\nu\nq\ns\n \nn\na\ne\nM\n\n)\n\nE\nS\nM\n\n(\n \nr\no\nr\nr\ne\n\n \n\ne\nr\na\nu\nq\ns\n \n\nn\na\ne\nM\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\nk = 50\n\n \n\nFractional\nGeometric\nHarmonic\nMedian\n\n0\n \n0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2\n\n\u03b1\n\n \n\nk = 500\n\n0.0109\n0.01\n0.009\n0.008\n0.007\n0.006\n0.005\n0.004\n0.003\n0.002\n0.001\n0\n \n0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2\n\nFractional\nGeometric\nHarmonic\nMedian\n\n\u03b1\n\nFigure 3: We simulate the mean square errors (MSE) (106 simulations at every \u03b1 and k) for the\nharmonic mean estimator (0 < \u03b1 \u2264 0.344 only) and the fractional power estimator. We compute\nthe MSE exactly for the geometric mean estimator (for 0.344\u03b1 < 2). The fractional power has good\naccuracy (small MSE) at reasonable sample sizes (e.g., k \u2265 50). But even at small samples (e.g.,\nk = 10), it is quite accurate except when \u03b1 approaches 2.\n\n4 Discussion\n\nj=1 |xj|\u03bb\u2217\u03b1(cid:17)1/\u03bb\u2217\nThe fractional power estimator \u02c6d(\u03b1),f p \u221d(cid:16)Pk\nin because the power 1/\u03bb\u2217 is just a constant. However, Pk\n\ncan be treated as a linear estimator\nj=1 |xj|\u03bb\u2217\u03b1 is not a metric because\n\u03bb\u2217\u03b1 < 1, as shown in Lemma 3. Thus our result does not con\ufb02ict the celebrated impossibility result\n[3], which proved that there is no hope to recover the original l1 distances using linear projections\nand linear estimators without incurring large errors.\n\nAlthough the fractional power estimator achieves near-optimal asymptotic variance, analyzing its\ntail bounds does not appear straightforward.\nIn fact, when \u03b1 approaches 2, this estimator does\nnot have \ufb01nite moments much higher than the second order, suggesting poor tail behavior. Our\nadditional simulations (not included in this paper) indicate that \u02c6d(\u03b1),f p still has comparable tail\nprobability behavior as the geometric mean estimator, when \u03b1 \u2264 1.\nFinally, we should mention that the method of stable random projections does not take advantage of\nthe data sparsity while high-dimensional data (e.g., text data) are often highly sparse. A new method\ncall Conditional Random Sampling (CRS) [14\u201316] may be more preferable in highly sparse data.\n5 Conclusion\nIn massive datasets such as the Web and massive data streams, dimension reductions are often crit-\nical for many applications including clustering, classi\ufb01cations, recommendation systems, and Web\nsearch, because the data size may be too large for the physical memory or even for the hard disk and\nsometimes only one pass of the data can be afforded for building statistical learning models.\nWhile there are already many papers on dimension reductions in the l2 norm, this paper focuses on\nthe l\u03b1 norm for 0 < \u03b1 \u2264 2 using stable random projections, as it has become increasingly popular in\nmachine learning to consider the l\u03b1 norm other than l2. It is also possible to treat \u03b1 as an additional\ntuning parameter and re-run the learning algorithms many times for better performance.\n\nOur main contribution is the fractional power estimator for stable random projections. This esti-\nmator, with a uni\ufb01ed format for all 0 < \u03b1 \u2264 2, is computationally ef\ufb01cient and (surprisingly) is\nalso near-optimal in terms of the asymptotic variance. We also prove some important theoretical\nproperties (variance, convexity, etc.) to show that this estimator is well-behaved. We expect that this\nwork will help advance the state-of-the-art of dimension reductions in the l\u03b1 norms.\n\n\fA Proof of Lemma 2\n\nBy Lemma 1, we \ufb01rst seek an unbiased estimator of of d\u03bb\n\n(\u03b1), denoted by \u02c6R(\u03b1),\u03bb,\n\nwhose variance is\n\nVar(cid:16) \u02c6R(\u03b1),\u03bb(cid:17) =\n\n\u02c6R(\u03b1),\u03bb =\n\n1\nk\n\nj=1 |xj|\u03bb\u03b1\n\nP k\n\n2\n\n\u03c0 \u0393(1 \u2212 \u03bb)\u0393(\u03bb\u03b1) sin(cid:0) \u03c0\n\n,\n\n\u22121/\u03b1 < \u03bb < 1\n\n2 \u03bb\u03b1(cid:1)\n2 \u03bb\u03b1(cid:1)(cid:3) 2 \u2212 1! ,\n\nd2\u03bb\n(\u03b1)\n\n\u03c0 \u0393(1 \u2212 2\u03bb)\u0393(2\u03bb\u03b1) sin (\u03c0\u03bb\u03b1)\n\nk 2\n\u03c0 \u0393(1 \u2212 \u03bb)\u0393(\u03bb\u03b1) sin(cid:0) \u03c0\n(cid:2) 2\n\n\u2212\n\n1\n2\u03b1\n\n< \u03bb <\n\n1\n2\n\nA biased estimator of d(\u03b1) would be simply (cid:16) \u02c6R(\u03b1),\u03bb(cid:17)1/\u03bb\n\nbe removed to an extent by Taylor expansions [11, Theorem 6.1.1]. While it is well-known that\nbias-corrections are not always bene\ufb01cial because of the bias-variance trade-off phenomenon, in our\ncase, it is a good idea to conduct the bias-correction because the function f (x) = x1/\u03bb is convex for\nx > 0. Note that f \u2032(x) = 1\n2\u03b1 < \u03bb < 1\n2 .\n\n\u03bb x1/\u03bb\u22121 and f \u2032\u2032(x) = 1\n\n, which has O(cid:0) 1\n\nk(cid:1) bias. This bias can\n\n\u03bb \u2212 1(cid:1) x1/\u03bb\u22122 > 0, assuming \u2212 1\n\u03bb(cid:0) 1\nk(cid:1) bias will also lead to a smaller variance.\n\nWe call this new estimator the \u201cfractional power\u201d estimator:\n\nBecause f (x) is convex, removing the O(cid:0) 1\nVar(cid:16) \u02c6R(\u03b1),\u03bb(cid:17)\n! 1/\u03bb 1 \u2212\n\n\u02c6d(\u03b1),f p,\u03bb = (cid:16) \u02c6R(\u03b1),\u03bb(cid:17) 1/\u03bb\n= 1\n\u03c0 \u0393(1 \u2212 \u03bb)\u0393(\u03bb\u03b1) sin(cid:0) \u03c0\n\n2 \u03bb\u03b1(cid:1)\nwhere we plug in the estimated d\u03bb\n\nj=1 |xj |\u03bb\u03b1\n\nP k\n\n\u2212\n\nk\n\n2\n\n2\n\n1\n\n\u03bb (cid:18) 1\n\n\u03bb\n\n\u2212 1(cid:19)(cid:16) d\u03bb\n2\u03bb (cid:18) 1\n\n(\u03b1)(cid:17) 1/\u03bb\u22122\n\u2212 1(cid:19) 2\n(cid:2) 2\n\u03c0 \u0393(1 \u2212 \u03bb)\u0393(\u03bb\u03b1) sin(cid:0) \u03c0\n\n\u03bb\n\n1\n\n\u03c0 \u0393(1 \u2212 2\u03bb)\u0393(2\u03bb\u03b1) sin (\u03c0\u03bb\u03b1)\n\n2 \u03bb\u03b1(cid:1)(cid:3) 2 \u2212 1!! ,\n\n1\nk\n\n(\u03b1). The asymptotic variance would be\n\n(\u03b1)(cid:17) 1/\u03bb\u22121(cid:19) 2\nVar(cid:16) \u02c6d(\u03b1),f p,\u03bb(cid:17) = Var(cid:16) \u02c6R(\u03b1),\u03bb(cid:17)(cid:18) 1\n\u03bb (cid:16)d\u03bb\n\u03bb2k 2\n(cid:2) 2\n\u03c0 \u0393(1 \u2212 \u03bb)\u0393(\u03bb\u03b1) sin(cid:0) \u03c0\n\n+ O(cid:18) 1\nk2(cid:19)\n2 \u03bb\u03b1(cid:1)(cid:3) 2 \u2212 1! + O(cid:18) 1\nk2(cid:19) .\n\n\u03c0 \u0393(1 \u2212 2\u03bb)\u0393(2\u03bb\u03b1) sin (\u03c0\u03bb\u03b1)\n\n= d2\n\n(\u03b1)\n\n1\n\nThe optimal \u03bb, denoted by \u03bb\u2217, is then\n\n\u03bb\u2217 = argmin\n2\u03b1 \u03bb< 1\n\n\u2212 1\n\n2 ( 1\n\n\u03c0 \u0393(1 \u2212 2\u03bb)\u0393(2\u03bb\u03b1) sin (\u03c0\u03bb\u03b1)\n\n\u03bb2 2\n(cid:2) 2\n\u03c0 \u0393(1 \u2212 \u03bb)\u0393(\u03bb\u03b1) sin(cid:0) \u03c0\n\n2 \u03bb\u03b1(cid:1)(cid:3)2 \u2212 1!) .\n\nB Proof of Lemma 3\n\nWe sketch the basic steps; and we direct readers to the additional supporting material for more detail.\nWe use the in\ufb01nite-product representations of the Gamma and sine functions [8, 8.322,1.431.1],\n\n\u0393(z) =\n\nexp (\u2212\u03b3ez)\n\nz\n\nto re-write g (\u03bb; \u03b1) as\n\n\u221e\n\nYs=1(cid:18) 1 +\n\nz\n\ns(cid:19) \u22121\n\nexp(cid:18) z\n\ns(cid:19) ,\n\nsin(z) = z\n\n\u221e\n\nYs=1 1 \u2212\n\nz2\n\ns2\u03c02! ,\n\n1\n\n1\n\u03bb2\n\ng(\u03bb; \u03b1) =\n\n(M (\u03bb; \u03b1) \u2212 1) =\n\nfs(\u03bb; \u03b1) \u2212 1! ,\ns (cid:19)(cid:18) 1 +\nWith respect to \u03bb, the \ufb01rst two derivatives of g(\u03bb; \u03b1) are\n\n\u03bb2 \u221e\nYs=1\ns (cid:19) \u22121(cid:18) 1 \u2212\n\nfs(\u03bb; \u03b1) = (cid:18) 1 \u2212\n\ns(cid:19) 2(cid:18) 1 +\n\n2\u03bb\u03b1\n\n\u03bb\u03b1\n\n\u03bb\n\n\u03bb\u03b1\n\ns (cid:19) 3 1 \u2212\n\n\u03bb2\u03b12\n\n4s2 ! \u22122\n\n(cid:18) 1 \u2212\n\n2\u03bb\n\ns (cid:19) \u22121\n\n.\n\n\u2202g\n\u2202\u03bb\n\n=\n\n\u22022g\n\u2202\u03bb2\n\n=\n\n1\n\n2\n\u03bb\n\n\u03bb2 \u2212\n\u03bb2 6\n\nM\n\n\u03bb2\n\n(M \u2212 1) +\n\n\u221e\n\nXs=1\n\n\u2202 log fs\n\n\u2202\u03bb\n\nM! .\n\n\u22022 log fs\n\n\u2202\u03bb2\n\n+\n\n\u221e\n\nXs=1\n\n+ \u221e\nXs=1\n\n\u2202 log fs\n\n\u2202\u03bb ! 2\n\n\u2212\n\n4\n\u03bb\n\n\u221e\n\nXs=1\n\n\u2202 log fs\n\n\u2202\u03bb ! \u2212\n\n6\n\u03bb4\n\n.\n\n\f= 2\u03bb\n\n\u221e\n\n\u221e\n\nXs=1\nXs=1\n\n=\n\n1\n\ns2 \u2212 3s\u03bb + 2\u03bb2\n\n+ \u03b12(cid:18)\n\n2\n\n4s2 \u2212 \u03bb2\u03b12\n\n+\n\n1\n\ns2 + 3s\u03bb\u03b1 + 2\u03bb2\u03b12\n\n\u2212\n\n1\n\ns2 \u2212 \u03bb2\u03b12(cid:19) ,\n\n\u22122\n\n(s \u2212 \u03bb)2\n\n+\n\n4\n\n(s \u2212 2\u03bb)2\n\n+\n\n2\u03b12\n\n(2s \u2212 \u03bb\u03b1)2\n\n\u2212\n\n\u03b12\n\n(s \u2212 \u03bb\u03b1)2\n\n\u2212\n\n3\u03b12\n\n(s + \u03bb\u03b1)2\n\n+\n\n4\u03b12\n\n(s + 2\u03bb\u03b1)2\n\n+\n\n2\u03b12\n\n(2s + \u03bb\u03b1)2\n\n+\n\n16\n\n(s \u2212 2\u03bb)3\n\n+ 2\u03b13(cid:18)\n\n2\n\n(2s \u2212 \u03bb\u03b1)3\n\n+\n\n96\n\n(s \u2212 2\u03bb)4\n\n+ 6\u03b14(cid:18)\n\n2\n\n(2s \u2212 \u03bb\u03b1)4\n\n\u2212\n\n\u2212\n\n1\n\n(s \u2212 \u03bb\u03b1)3\n\n1\n\n(s \u2212 \u03bb\u03b1)4\n\n+\n\n\u2212\n\n3\n\n(s + \u03bb\u03b1)3\n\n3\n\n(s + \u03bb\u03b1)4\n\n\u2212\n\n+\n\n8\n\n(s + 2\u03bb\u03b1)3\n\n\u2212\n\n2\n\n(2s + \u03bb\u03b1)3(cid:19) ,\n\n16\n\n(s + 2\u03bb\u03b1)4\n\n+\n\n2\n\n(2s + \u03bb\u03b1)4(cid:19) .\n\n\u22023 log fs\n\n\u221e\n\n\u22022 log fs\n\n\u221e\n\n\u221e\n\n\u2202\u03bb\n\n\u2202\u03bb3\n\n\u2202\u03bb2\n\n\u2202 log fs\n\nAlso,\nXs=1\nXs=1\nXs=1\nXs=1\nXs=1\nXs=1\nTo show \u22022g\n\n\u221212\n\n\u2202\u03bb4\n\n=\n\n=\n\n4\n\n\u221e\n\n\u221e\n\n\u221e\n\n(s \u2212 \u03bb)4\n\n(s \u2212 \u03bb)3\n\n\u22024 log fs\n\n\u2202\u03bb4\n\ns=1\n\n\u22024 log fs\n\n\u2202\u03bb2 > 0, it suf\ufb01ces to show \u03bb4 \u22022g\n\n\u2202\u03bb2 > 0, which can shown based on its own second deriva-\n). Here we consider \u03bb 6= 0 to avoid triviality. To complete\n\ntive (and hence we needP\u221e\nthe proof, we use some properties of the Riemann\u2019s Zeta function and the in\ufb01nite countability.\n(cid:12)(cid:12)(cid:12)\u03bb\u2217\nNext, we show that \u03bb\u2217 < \u22121 does not satisfy \u2202g(\u03bb;\u03b1)\nXs=1\n\nh(\u03bb\u2217) = M(\u03bb\u2217) 1 \u2212\n\n= 0, which is equivalent to h(\u03bb\u2217) = 1,\n\n\u2202\u03bb > 0, i.e., h(\u03bb) < h(\u22121). We then show \u2202h(\u22121)\nWe show that when \u03bb < \u22121, \u2202h\n0 < \u03b1 < 0.5; and hence h(\u22121; \u03b1) < h(\u22121; 0+) = 1. Therefore, we must have \u03bb\u2217 > \u22121.\nReferences\n[1] C. Aggarwal, editor. Data Streams: Models and Algorithms. Springer, New York, NY, 2007.\n[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, 1\u201316, 2002.\n[3] B. Brinkman and M. Charikar. On the impossibility of dimension reduction in l1. Journal of ACM, 52(2):766\u2013788, 2005.\n[4] O. Chapelle, P. Haffner, and V. Vapnik. Support vector machines for histogram-based image classi\ufb01cation. IEEE Trans. Neural Networks,\n\n\u2202\u03b1 < 0 for\n\n\u2202\u03bb (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\u03bb\u2217! = 1,\n\n\u2202 log fs\n\n\u03bb\u2217\n2\n\n\u2202\u03bb\n\n\u221e\n\n10(5):1055\u20131064, 1999.\n\n[5] G. Cormode, M. Datar, P. Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in). In VLDB,\n\n335\u2013345, 2002.\n\n[6] D. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289\u20131306, 2006.\n[7] E. Fama and R. Roll. Parameter estimates for symmetric stable distributions. JASA, 66(334):331\u2013338, 1971.\n[8]\n[9] P. Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. Journal of ACM, 53(3):307\u2013323,\n\nI. Gradshteyn and I. Ryzhik. Table of Integrals, Series, and Products. Academic Press, New York, \ufb01fth edition, 1994.\n\n2006.\n\n[10] W. Johnson and J. Lindenstrauss. Extensions of Lipschitz mapping into Hilbert space. Contemporary Mathematics, 26:189\u2013206, 1984.\n[11] E. Lehmann and G. Casella. Theory of Point Estimation. Springer, New York, NY, second edition, 1998.\n[12] E. Leopold and J. Kindermann. Text categorization with support vector machines. how to represent texts in input space? Machine\n\nLearning, 46(1-3):423\u2013444, 2002.\n\n[13] P. Li. Estimators and tail bounds for dimension reduction in l\u03b1 (0 < \u03b1 \u2264 2) using stable random projections. In SODA, 2008.\n[14] P. Li and K. Church. Using sketches to estimate associations. In HLT/EMNLP, 708\u2013715, 2005.\n[15] P. Li and K. Church. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics, 33(3):305\u2013354,\n\n2007.\n\n[16] P. Li, K. Church, and T. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In NIPS, 873\u2013880,\n\n2007.\n\n[17] P. Li, T. Hastie, and K. Church. Improving random projections using marginal information. In COLT, 635\u2013649, 2006.\n[18] P. Li, T. Hastie, and K. Church. Nonlinear estimators and tail bounds for dimensional reduction in l1 using cauchy random projections.\n\nJournal of Machine Learning Research (To appear) .\n\n[19] M. Matsui and A. Takemura. Some improvements in numerical evaluation of symmetric stable density and its derivatives. Communica-\n\ntions on Statistics-Theory and Methods, 35(1):149\u2013172, 2006.\n\n[20] J. McCulloch. Simple consistent estimators of stable distribution parameters. Communications on Statistics-Simulation, 15(4):1109\u2013\n\n1136, 1986.\n\n[21] J. Rennie, L. Shih, J. Teevan, and D. Karger. Tackling the poor assumptions of naive Bayes text classi\ufb01ers. In ICML, 616\u2013623, 2003.\n[22] S. Vempala. The Random Projection Method. American Mathematical Society, Providence, RI, 2004.\n[23] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. In NIPS, Vancouver, 2003.\n[24] V. M. Zolotarev. One-dimensional Stable Distributions. American Mathematical Society, Providence, RI, 1986.\n\n\f", "award": [], "sourceid": 3225, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": null}, {"given_name": "Trevor", "family_name": "Hastie", "institution": null}]}