{"title": "Conditional Random Sampling: A Sketch-based Sampling Technique for Sparse Data", "book": "Advances in Neural Information Processing Systems", "page_first": 873, "page_last": 880, "abstract": null, "full_text": "Conditional Random Sampling: A Sketch-based\n\nSampling Technique for Sparse Data\n\nPing Li\n\nDepartment of Statistics\n\nStanford University\nStanford, CA 94305\n\npingli@stat.stanford.edu\n\nKenneth W. Church\nMicrosoft Research\nOne Microsoft Way\nRedmond, WA 98052\n\nTrevor J. Hastie\n\nDepartment. of Statistics\n\nStanford University\nStanford, CA 94305\n\nchurch@microsoft.com\n\nhastie@stanford.edu\n\nAbstract\n\nWe1 develop Conditional Random Sampling (CRS), a technique particularly suit-\nable for sparse data. In large-scale applications, the data are often highly sparse.\nCRS combines sketching and sampling in that it converts sketches of the data into\nconditional random samples online in the estimation stage, with the sample size\ndetermined retrospectively. This paper focuses on approximating pairwise l2 and\nl1 distances and comparing CRS with random projections. For boolean (0/1) data,\nCRS is provably better than random projections. We show using real-world data\nthat CRS often outperforms random projections. This technique can be applied in\nlearning, data mining, information retrieval, and database query optimizations.\n\n1 Introduction\n\nConditional Random Sampling (CRS) is a sketch-based sampling technique that effectively exploits\ndata sparsity. In modern applications in learning, data mining, and information retrieval, the datasets\nare often very large and also highly sparse. For example, the term-document matrix is often more\nthan 99% sparse [7]. Sampling large-scale sparse data is challenging. The conventional random\nsampling (i.e., randomly picking a small fraction) often performs poorly when most of the samples\nare zeros. Also, in heavy-tailed data, the estimation errors of random sampling could be very large.\n\nAs alternatives to random sampling, various sketching algorithms have become popular, e.g., random\nprojections [17] and min-wise sketches [6]. Sketching algorithms are designed for approximating\nspeci\ufb01c summary statistics. For a speci\ufb01c task, a sketching algorithm often outperforms random\nsampling. On the other hand, random sampling is much more \ufb02exible. For example, we can use the\nsame set of random samples to estimate any lp pairwise distances and multi-way associations. Con-\nditional Random Sampling (CRS) combines the advantages of both sketching and random sampling.\n\nMany important applications concern only the pairwise distances, e.g., distance-based clustering\nand classi\ufb01cation, multi-dimensional scaling, kernels. For a large training set (e.g., at Web scale),\ncomputing pairwise distances exactly is often too time-consuming or even infeasible.\n\nLet A be a data matrix of n rows and D columns. For example, A can be the term-document matrix\nwith n as the total number of word types and D as the total number of documents. In modern search\nengines, n \u2248 106 \u223c 107 and D \u2248 1010 \u223c 1011. In general, n is the number of data points and D\nis the number of features. Computing all pairwise associations AAT, also called the Gram matrix\nin machine learning, costs O(n2D), which could be daunting for large n and D. Various sampling\nmethods have been proposed for approximating Gram matrix and kernels [2, 8]. For example, using\n(normal) random projections [17], we approximate AAT by (AR) (AR)T, where the entries of\nR \u2208 RD\u00d7k are i.i.d. N (0, 1). This reduces the cost down to O(nDk+n2k), where k \u226a min(n, D).\n\n1The full version [13]: www.stanford.edu/\u223cpingli98/publications/CRS tr.pdf\n\n\fSampling techniques can be critical in databases and information retrieval. For example, the\ndatabase query optimizer seeks highly ef\ufb01cient techniques to estimate the intermediate join sizes\nin order to choose an \u201coptimum\u201d execution path for multi-way joins.\n\nConditional Random Sampling (CRS) can be applied to estimating pairwise distances (in any norm)\nas well as multi-way associations. CRS can also be used for estimating joint histograms (two-way\nand multi-way). While this paper focuses on estimating pairwise l2 and l1 distances and inner\nproducts, we refer readers to the technical report [13] for estimating joint histograms. Our early\nwork, [11, 12] concerned estimating two-way and multi-way associations in boolean (0/1) data.\nWe will compare CRS with normal random projections for approximating l2 distances and inner\nproducts, and with Cauchy random projections for approximating l1 distances. In boolean data,\nCRS bears some similarity to Broder\u2019s sketches [6] with some important distinctions. [12] showed\nthat in boolean data, CRS improves Broder\u2019s sketches by roughly halving the estimation variances.\n\n2 The Procedures of CRS\nConditional Random Sampling is a two-stage procedure. In the sketching stage, we scan the data\nmatrix once and store a fraction of the non-zero elements in each data point, as \u201csketches.\u201d In the\nestimation stage, we generate conditional random samples online pairwise (for two-way) or group-\nwise (for multi-way); hence we name our algorithm Conditional Random Sampling (CRS).\n\n2.1 The Sampling/Sketching Procedure\n\n1 2 3 4 5 6 7 8 D\n\n1 2 3 4 5 6 7 8 D\n\n1\n2\n3\n4\n5\nn\n\n1\n2\n3\n4\n5\nn\n\n1\n2\n3\n4\n5\nn\n\n1\n2\n3\n4\n5\nn\n\n(a) Original\n\n(b) Permuted\n\n(c) Postings\nFigure 1: A global view of the sketching stage.\n\n(d) Sketches\n\nFigure 1 provides a global view of the sketching stage. The columns of a sparse data matrix (a)\nare \ufb01rst randomly permuted (b). Then only the non-zero entries are considered, called postings (c).\nSketches are simply the front of postings (d). Note that in the actual implementation, we only need\nto maintain a permutation mapping on the column IDs.\n\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15\n0 1 0 2 0 1 0 0 1 2 1 0 1 0 2 \n1 3 0 0 1 2 0 1 0 0 3 0 0 2 1 \n\n \n\nu\nu\n\n1\n\n2\n\n(a) Data matrix and random samples\n\nP : 2 (1) 4 (2) 6 (1) 9 (1) 10 (2) 11 (1) 13 (1) 15 (2)\n1\nP : 1 (1) 2 (3) 5 (1) 6 (2) 8 (1) 11 (3) 14 (2) 15 (1) \n2\n\nK : 2 (1) 4 (2) 6 (1) 9 (1) 10 (2) \nK : 1 (1) 2 (3) 5 (1) 6 (2) 8 (1) 11 (3) \n\n1\n2\n\n(b) Postings\n\n(c) Sketches\n\nFigure 2: (a): A data matrix with two rows and D = 15. If the column IDs are random, the \ufb01rst\nDs = 10 columns constitute a random sample. ui denotes the ith row. (b): Postings consist of\ntuples \u201cID (Value).\u201d (c): Sketches are the \ufb01rst ki entries of postings sorted ascending by IDs. In this\nexample, k1 = 5, k2 = 6, Ds = min(10, 11) = 10. Excluding 11(3) in K2, we obtain the same\nsamples as if we directly sampled the \ufb01rst Ds = 10 columns in the data matrix.\n\nApparently sketches are not uniformly random samples, which may make the estimation task dif-\n\ufb01cult. We show, in Figure 2, that sketches are almost random samples pairwise (or group-wise).\nFigure 2(a) constructs conventional random samples from a data matrix; and we show one can gen-\nerate (retrospectively) the same random samples from sketches in Figure 2(b)(c).\n\nIn Figure 2(a), when the column are randomly permuted, we can construct random samples by sim-\nply taking the \ufb01rst Ds columns from the data matrix of D columns (Ds \u226a D in real applications).\nFor sparse data, we only store the non-zero elements in the form of tuples \u201cID (Value),\u201d a structure\ncalled postings. We denote the postings by Pi for each row ui. Figure 2(b) shows the postings for\nthe same data matrix in Figure 2(a). The tuples are sorted ascending by their IDs. A sketch, Ki, of\npostings Pi, is the \ufb01rst ki entries (i.e., the smallest ki IDs) of Pi, as shown in Figure 2(c).\n\n\fThe central observation is that if we exclude all elements of sketches whose IDs are larger than\n\nDs = min (max(ID(K1)), max(ID(K2))) ,\n\n(1)\n\nwe obtain exactly the same samples as if we directly sampled the \ufb01rst Ds columns from the data\nmatrix in Figure 2(a). This way, we convert sketches into random samples by conditioning on Ds,\nwhich differs pairwise and we do not know beforehand.\n\n2.2 The Estimation Procedure\n\nThe estimation task for CRS can be extremely simple. After we construct the conditional random\nsamples from sketches K1 and K2 with the effective sample size Ds, we can compute any distances\n(l2, l1, or inner products) from the samples and multiply them by D\nto estimate the original space.\nDs\n(Later, we will show how to improve the estimates by taking advantage of the marginal information.)\nWe use \u02dcu1,j and \u02dcu2,j (j = 1 to Ds) to denote the conditional random samples (of size Ds) obtained\nby CRS. For example, in Figure 2, we have Ds = 10, and the non-zero \u02dcu1,j and \u02dcu2,j are\n\n\u02dcu1,2 = 3, \u02dcu1,4 = 2, \u02dcu1,6 = 1, \u02dcu1,9 = 1, \u02dcu1,10 = 2\n\u02dcu2,1 = 1, \u02dcu2,2 = 3, \u02dcu2,5 = 1, \u02dcu2,6 = 2, \u02dcu2,8 = 1.\n\nDenote the inner product, squared l2 distance, and l1 distance, by a, d(2), and d(1), respectively,\n\na =\n\nD\n\nXi=1\n\nu1,iu2,i,\n\nd(2) =\n\nD\n\nXi=1\n\n|u1,i \u2212 u2,i|2,\n\nd(1) =\n\nD\n\nXi=1\n\n|u1,i \u2212 u2,i|\n\n(2)\n\nOnce we have the random samples, we can then use the following simple linear estimators:\n\n\u02c6aMF =\n\nD\nDs\n\nDs\n\nXj=1\n\n\u02dcu1,j \u02dcu2,j,\n\n\u02c6d(2)\nMF =\n\nD\nDs\n\nDs\n\nXj=1\n\n(\u02dcu1,j \u2212 \u02dcu2,j)2 ,\n\n\u02c6d(1)\nMF =\n\nD\nDs\n\nDs\n\nXj=1\n\n|\u02dcu1,j \u2212 \u02dcu2,j|.\n\n(3)\n\n2.3 The Computational Cost\n\nall the non-zeros. Therefore, generating sketches for A \u2208 Rn\u00d7D costs O(Pn\n\nTh sketching stage requires generating a random permutation mapping of length D, and linear scan\ni=1 fi), where fi is the\nnumber of non-zeros in the ith row, i.e., fi = |Pi|. In the estimation stage, we need to linear scan the\nsketches. While the conditional sample size Ds might be large, the cost for estimating the distance\nbetween one pair of data points would be only O(k1 + k2) instead of O(Ds).\n\n3 The Theoretical Variance Analysis of CRS\n\nWe give some theoretical analysis on the variances of CRS. For simplicity, we ignore the \u201c\ufb01nite\npopulation correction factor\u201d, D\u2212Ds\n\nD\u22121 , due to \u201csample-without-replacement.\u201d\nj=1 \u02dcu1,j \u02dcu2,j. By assuming \u201csample-with-replacement,\u201d the samples,\n\nWe \ufb01rst consider \u02c6aMF = D\n(\u02dcu1,j \u02dcu2,j), j = 1 to Ds, are i.i.d, conditional on Ds. Thus,\n\nThe unconditional variance would be simply\n\nD\n\nXi=1\n\n(u1,iu2,i)2 \u2212\u201c a\n\nD\u201d2! =\n\nD\n\nDs D\nXi=1\n\nu2\n1,iu2\n\n2,i \u2212\n\na2\n\nD! .\n\nVar(\u02c6aM F ) = E (Var(\u02c6aM F|Ds)) = E\u201e D\n\nDs\u00ab D\nXi=1\n\nu2\n1,iu2\n\n2,i \u2212\n\na2\n\nD! ,\n\nDs PDs\nDs\u00ab2\nVar(\u02c6aM F|Ds) =\u201e D\nXi=1\nD 1\n\nVar(\u02c6aM F|Ds) =\n\nE (\u02dcu1,1 \u02dcu2,1) =\n\nD\nDs\n\n1\nD\n\nD\n\nD\n\nDsVar (\u02dcu1,1 \u02dcu2,1) =\n\n(u1,iu2,i) =\n\na\nD\n\n,\n\nD\nDs\n\nD`E (\u02dcu1,1 \u02dcu2,1)2 \u2212 E2 (\u02dcu1,1 \u02dcu2,1)\u00b4 ,\n\nD\n\nE (\u02dcu1,1 \u02dcu2,1)2 =\n\n(u1,iu2,i)2 ,\n\n1\nD\n\nXi=1\n\n(4)\n\n(5)\n\n(6)\n\n\fas Var( \u02c6X) = E\u201cVar( \u02c6X|Ds)\u201d + Var\u201cE( \u02c6X|Ds)\u201d = E\u201cVar( \u02c6X|Ds)\u201d, when \u02c6X is conditionally unbiased.\nNo closed-form expression is known for E(cid:16) D\nk2(cid:17) (similar\n, f2\nto Jensen\u2019s inequality). Asymptotically (as k1 and k2 increase), the inequality becomes an equality\nDs(cid:19) \u2248 max(cid:18) f1 + 1\n\nDs(cid:17); but we know E(cid:16) D\nk2 (cid:19) \u2248 max(cid:18) f1\n\nDs(cid:17) \u2265 max(cid:16) f1\n\nE(cid:18) D\n\nk2(cid:19) ,\n\nf2 + 1\n\n(7)\n\nk1\n\nk1\n\nf2\n\nk1\n\n,\n\n,\n\nwhere f1 and f2 are the numbers of non-zeros in u1 and u2, respectively. See [13] for the proof.\nExtensive simulations in [13] verify that the errors of (7) are usually within 5% when k1, k2 > 20.\nWe similarly derive the variances for \u02c6d(2)\n\nVar (\u02c6aM F ) = E\u201e D\nM F\u201d = E\u201e D\nVar\u201c \u02c6d(2)\nM F\u201d = E\u201e D\nVar\u201c \u02c6d(1)\nwhere we denote d(4) = PD\n\nD\n\n1\n\nD\n\na2\n\nu2\n1,iu2\n\nu2\n1,iu2\n\nmax(f1, f2)\n\nMF and \u02c6d(1)\nMF . In a summary, we obtain (when k1 = k2 = k)\nD! \u2248\n2,i \u2212\n[d(2)]2\nD \u00ab \u2248\nD \u00ab \u2248\n\nDs\u00ab D\nXi=1\nDs\u00ab\u201ed(4) \u2212\nDs\u00ab\u201ed(2) \u2212\ni=1 (u1,i \u2212 u2,i)4.\nreduces the variances signi\ufb01cantly. If max(f1,f2)\n\nk D\nk \u201cDd(4) \u2212 [d(2)]2\u201d ,\nk \u201cDd(2) \u2212 [d(1)]2\u201d .\n\n2,i \u2212 a2! ,\n\n= 0.01, the variances\n\nmax(f1, f2)\n\nmax(f1, f2)\n\nXi=1\n\n[d(1)]2\n\n(10)\n\n(8)\n\n(9)\n\nD\n\nD\n\n1\n\n1\n\nThe sparsity term max(f1,f2)\ncan be reduced by a factor of 100, compared to conventional random coordinate sampling.\n\nD\n\nD\n\n4 A Brief Introduction to Random Projections\nWe give a brief introduction to random projections, with which we compare CRS. (Normal) Random\nprojections [17] are widely used in learning and data mining [2\u20134].\n\nRandom projections multiply the data matrix A \u2208 Rn\u00d7D with a random matrix R \u2208 RD\u00d7k to\ngenerate a compact representation B = AR \u2208 Rn\u00d7k. For estimating l2 distances, R typically\nconsists of i.i.d. entries in N (0, 1); hence we call it normal random projections. For l1, R consists\nof i.i.d. Cauchy C(0, 1) [9]. However, the recent impossibility result [5] has ruled out estimators\nthat could be metrics for dimension reduction in l1.\nDenote v1, v2 \u2208 Rk the two rows in B, corresponding to the original data points u1, u2 \u2208 RD. We\nalso introduce the notation for the marginal l2 norms: m1 = ku1k2, m2 = ku2k2.\n\n4.1 Normal Random Projections\n\nIn this case, R consists of i.i.d. N (0, 1). It is easy to show that the following linear estimators of\nthe inner product a and the squared l2 distance d(2) are unbiased\n1\nk\n\n\u02c6d(2)\nN RP,MF =\n\nkv1 \u2212 v2k2,\n\n\u02c6aN RP,MF =\n\nvT\n1 v2,\n\n(11)\n\n1\nk\n\nwith variances [15, 17]\n\n1\n\nVar (\u02c6aN RP,MF ) =\n\nN RP,MF(cid:17) =\nAssuming that the margins m1 = ku1k2 and m2 = ku2k2 are known, [15] provides a maximum\nlikelihood estimator, denoted by \u02c6aN RP,MLE, whose (asymptotic) variance is\n\nk (cid:0)m1m2 + a2(cid:1) ,\n\nVar(cid:16) \u02c6d(2)\n\n(12)\n\nk\n\n.\n\n2[d(2)]2\n\nVar (\u02c6aN RP,MLE) =\n\n1\nk\n\n(m1m2 \u2212 a2)2\nm1m2 + a2 + O(k\u22122).\n\n(13)\n\n4.2 Cauchy Random Projections for Dimension Reduction in l1\nIn this case, R consisting of i.i.d. entries in Cauchy C(0, 1). [9] proposed an estimator based on the\nabsolute sample median. Recently, [14] proposed a variety of nonlinear estimators, including, a bias-\ncorrected sample median estimator, a bias-corrected geometric mean estimator, and a bias-corrected\n\n\fmaximum likelihood estimator. An analog of the Johnson-Lindenstrauss (JL) lemma for dimension\nreduction in l1 is also proved in [14], based on the bias-corrected geometric mean estimator.\nWe only list the maximum likelihood estimator derived in [14], because it is the most accurate one.\n\nCRP,MLE,c = (cid:18)1 \u2212\n\u02c6d(1)\nCRP,MLE solves a nonlinear MLE equation\n\nwhere \u02c6d(1)\n\n1\n\nk(cid:19) \u02c6d(1)\n\nCRP,MLE,\n\n\u2212\n\nk\n\n\u02c6d(1)\n\nCRP,M LE\n\n+\n\nk\n\nXj=1\n\n[14] shows that\n\n2 \u02c6d(1)\n\nCRP,M LE\n\n(v1,j \u2212 v2,j)2 +\u201c \u02c6d(1)\n\nVar(cid:16) \u02c6d(1)\n\nCRP,MLE,c(cid:17) =\n\n2[d(1)]2\n\nk\n\n+\n\nCRP,M LE\u201d2 = 0.\nk2 + O(cid:18) 1\nk3(cid:19) .\n\n3[d(1)]2\n\n(14)\n\n(15)\n\n(16)\n\n4.3 General Stable Random Projections for Dimension Reduction in lp (0 < p \u2264 2)\n[10] generalized the bias-corrected geometric mean estimator to general stable random projections\nfor dimension reduction in lp (0 < p \u2264 2), and provided the theoretical variances and exponential\ntail bounds. Of course, CRS can also be applied to approximating any lp distances.\n5 Improving CRS Using Marginal Information\nIt is often reasonable to assume that we know the marginal information such as marginal l2 norms,\nnumbers of non-zeros, or even marginal histograms. This often leads to (much) sharper estimates,\nby maximizing the likelihood under marginal constraints. In the boolean data case, we can express\nthe MLE solution explicitly and derive a closed-form (asymptotic) variance. In general real-valued\ndata, the joint likelihood is not available; we propose an approximate MLE solution.\n\n5.1 Boolean (0/1) Data\nIn 0/1 data, estimating the inner product becomes estimating a two-way contingency table, which\nhas four cells. Because of the margin constraints, there is only one degree of freedom. Therefore, it\nis not hard to show that the MLE of a is the solution, denoted by \u02c6a0/1,MLE, to a cubic equation\n\ns11\na\n\n\u2212\n\ns10\n\nf1 \u2212 a\n\n\u2212\n\ns01\n\nf2 \u2212 a\n\n+\n\ns00\n\nD \u2212 f1 \u2212 f2 + a\n\n= 0,\n\n(17)\n\nwhere s11 = #{j : \u02dcu1,j = \u02dcu2,j = 1}, s10 = #{j : \u02dcu1,j = 1, \u02dcu2,j = 0}, s01 = #{j : \u02dcu1,j =\n0, \u02dcu2,j = 1}, s00 = #{j : \u02dcu1,j = 0, \u02dcu2,j = 0}, j = 1, 2, ..., Ds.\nThe (asymptotic) variance of \u02c6a0/1,MLE is proved [11\u201313] to be\n1\nf1\u2212a + 1\n\nDs(cid:19)\nVar(\u02c6a0/1,MLE) = E(cid:18) D\n\na + 1\n\nf2\u2212a +\n\nD\u2212f1\u2212f2+a\n\n(18)\n\n1\n\n1\n\n.\n\n5.2 Real-valued Data\nA practical solution is to assume some parametric form of the (bivariate) data distribution based\non prior knowledge; and then solve an MLE considering various constraints. Suppose the samples\n(\u02dcu1,j, \u02dcu2,j) are i.i.d. bivariate normal with moments determined by the population moments, i.e.,\n\nDs\n\n\u02dc\u03a3 =\n\n1\nDs\n\n\u00bb \u02dcv1,j\n\u02dcv2,j \u2013 =\u00bb \u02dcu1,j \u2212 \u00afu1\n\n\u02dcu2,j \u2212 \u00afu2 \u2013 \u223c N\u201e\u00bb 0\nD \u00bb ku1k2 \u2212 D\u00afu2\nuT\n1u2 \u2212 D\u00afu1 \u00afu2\ni=1 u1,i/D, \u00afu2 = PD\n\n2 \u2013 =\nwhere \u00afu1 = PD\nD (cid:0)uT\nDs\nD (cid:0)ku2k2 \u2212 D\u00afu22(cid:1), \u00a8a = Ds\nD (cid:0)ku1k2 \u2212 D\u00afu2\n\u00afu2, m1 = ku1k2 and m2 = ku2k2 are known, an MLE for a = uT\n\u02c6\u00a8a + D\u00afu1 \u00afu2,\n\n1(cid:1), \u00a8m2 = Ds\n\n0 \u2013 , \u02dc\u03a3\u00ab ,\nuT\n1u2 \u2212 D\u00afu1 \u00afu2\nku2k2 \u2212 D\u00afu2\ni=1 u2,i/D are the population means.\n\nDs \u00bb \u00a8m1\n1u2 \u2212 D\u00afu1 \u00afu2(cid:1). Suppose that \u00afu1,\n1u2, denoted by \u02c6aMLE,N , is\n\n\u00a8m2 \u2013 ,\n\n\u02c6aMLE,N =\n\n\u00a8m1 =\n\n(21)\n\n(19)\n\n(20)\n\n\u00a8a\n\n1\n\n\u00a8a\n\n1\n\nD\nDs\n\n\fwhere, similar to Lemma 2 of [15], \u02c6\u00a8a is the solution to a cubic equation:\n\n\u00a8a3 \u2212 \u00a8a2(cid:0)\u02dcvT\n\n1 \u02dcv2(cid:1) + \u00a8a(cid:0)\u2212 \u00a8m1 \u00a8m2 + \u00a8m1k\u02dcv2k2 + \u00a8m2k\u02dcv1k2(cid:1) \u2212 \u00a8m1 \u00a8m2\u02dcvT\n\n1 \u02dcv2 = 0.\n\n\u02c6aMLE,N is fairly robust, although sometimes we observe the biases are quite noticeable. In general,\nthis is a good bias-variance trade-off (especially when k is not too large). Intuitively, the reason why\nthis (seemly crude) assumption of bivariate normality works well is because, once we have \ufb01xed the\nmargins, we have removed to a large extent the non-normal component of the data.\n\n(22)\n\n6 Theoretical Comparisons of CRS With Random Projections\nAs re\ufb02ected by their variances, for general data types, whether CRS is better than random projec-\ntions depends on two competing factors: data sparsity and data heavy-tailedness. However, in the\nfollowing two important scenarios, CRS outperforms random projections.\n\n6.1 Boolean (0/1) data\nIn this case, the marginal norms are the same as the numbers of non-zeros, i.e., mi = kuik2 = fi.\n\nFigure 3 plots the ratio,\n\nVar(\u02c6aM F )\n\nVar(\u02c6aN RP,M F ) , verifying that CRS is (considerably) more accurate:\n\nVar (\u02c6aMF )\n\nVar (\u02c6aN RP,MF )\n\n=\n\nmax(f1, f2)\nf1f2 + a2\n\n1\n\n1\n\na + 1\n\nD\u2212a\n\n\u2264\n\nmax(f1, f2)a\nf1f2 + a2 \u2264 1.\n\nFigure 4 plots Var(\u02c6a0/1,M LE)\nVar(\u02c6aN RP,M LE) . In most possible range of the data, this ratio is less than 1. When\nu1 and u2 are very close (e.g., a \u2248 f2 \u2248 f1), random projections appear more accurate. However,\nwhen this does occur, the absolute variances are so small (even zero) that their ratio does not matter.\n\no\n\ni\nt\n\na\nr\n \n\ne\nc\nn\na\ni\nr\na\nV\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n0\n\nf2/f1 = 0.2\n\no\n\nf1 = 0.05D\n\nf1 = 0.95D\n\ni\nt\n\na\nr\n \n\ne\nc\nn\na\ni\nr\na\nV\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nf2/f1 = 0.5\n\no\n\ni\nt\n\na\nr\n \n\ne\nc\nn\na\ni\nr\na\nV\n\n0.2\n\n0.4\n\na/f2\n\n0.6\n\n0.8\n\n1\n\n0\n0\n\n0.2\n\n0.4\n\na/f2\n\n0.6\n\n0.8\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n0\n\nf2/f1 = 0.8\n\no\n\ni\nt\n\na\nr\n \n\ne\nc\nn\na\ni\nr\na\nV\n\n0.2\n\n0.4\n\na/f2\n\n0.6\n\n0.8\n\n1\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n0\n\nf2/f1 = 1\n\n0.2\n\n0.4\n\na/f2\n\n0.6\n\n0.8\n\n1\n\nVar(\u02c6aN RP,M F ), show that CRS has smaller variances than random\nFigure 3: The variance ratios,\nprojections, when no marginal information is used. We let f1 \u2265 f2 and f2 = \u03b1f1 with \u03b1 =\n0.2, 0.5, 0.8, 1.0. For each \u03b1, we plot from f1 = 0.05D to f1 = 0.95D spaced at 0.05D.\n\nVar(\u02c6aM F )\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\no\n\ni\nt\n\na\nr\n \ne\nc\nn\na\ni\nr\na\nV\n\nf2/f1 = 0.2\n\nf\n = 0.05 D\n1\n\nf2/f1 = 0.5\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\no\n\ni\nt\n\na\nr\n \ne\nc\nn\na\ni\nr\na\nV\n\nf2/f1 = 0.8\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\no\n\ni\nt\n\na\nr\n \ne\nc\nn\na\ni\nr\na\nV\n\nf2/f1 = 1\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\no\n\ni\nt\n\na\nr\n \ne\nc\nn\na\ni\nr\na\nV\n\n0\n0\n\n0.2\n\n0.4\n\na/f2\n\nf\n = 0.95 D\n1\n\n0.8\n\n0.6\n\n1\n\n0\n0\n\n0.2\n\n0.4\n\na/f2\n\n0.6\n\n0.8\n\n1\n\n0\n0\n\n0.2\n\n0.4\n\na/f2\n\n0.6\n\n0.8\n\n1\n\n0\n0\n\n0.2\n\n0.6\n\n0.8\n\n0.4\n\na/f2\n\nFigure 4: The ratios, Var(\u02c6a0/1,M LE)\nprojections, except when f1 \u2248 f2 \u2248 a.\n\nVar(\u02c6aN RP,M LE) , show that CRS usually has smaller variances than random\n\n6.2 Nearly Independent Data\nSuppose two data points u1 and u2 are independent (or less strictly, uncorrelated to the second\norder), it is easy to show that the variance of CRS is always smaller:\n\nVar (\u02c6aMF ) \u2264\n\nmax(f1, f2)\n\nm1m2\n\nD\n\nk\n\neven if we ignore the data sparsity. Therefore, CRS will be much better for estimating inner products\nin nearly independent data. Once we have obtained the inner products, we can infer the l2 distances\neasily by d(2) = m1 + m2 \u2212 2a, since the margins, m1 and m2, are easy to obtain exactly.\nIn high dimensions, it is often the case that most of the data points are only very weakly correlated.\n\n\u2264 Var (\u02c6aN RP,MF ) =\n\n,\n\n(23)\n\nm1m2 + a2\n\nk\n\n\fmore precisely, O(Pn\n\n6.3 Comparing the Computational Ef\ufb01ciency\nAs previously mentioned, the cost of constructing sketches for A \u2208 Rn\u00d7D would be O(nD) (or\ni=1 fi)). The cost of (normal) random projections would be O(nDk), which\ncan be reduced to O(nDk/3) using sparse random projections [1]. Therefore, it is possible that\nCRS is considerably more ef\ufb01cient than random projections in the sampling stage.2\nIn the estimation stage, CRS costs O(2k) to compute the sample distance for each pair. This cost is\nonly O(k) in random projections. Since k is very small, the difference should not be a concern.\n\n7 Empirical Evaluations\nWe compare CRS with random projections (RP) using real data, including n = 100 randomly\nsampled documents from the NSF data [7] (sparsity \u2248 1%), n = 100 documents from the NEWS-\nGROUP data [4] (sparsity \u2248 1%), and one class of the COREL image data (n = 80, sparsity \u2248 5%).\nWe estimate all pairwise inner products, l1 and l2 distances, using both CRS and RP. For each pair,\nwe obtain 50 runs and average the absolute errors. We compare the median errors and the percentage\nin which CRS does better than random projections.\n\nThe results are presented in Figures 5, 6, 7. In each panel, the dashed curve indicates that we sample\neach data point with equal sample size (k). For CRS, we can adjust the sample size according to\nthe sparsity, re\ufb02ected by the solid curves. We adjust sample sizes only roughly. The data points are\ndivided into 3 groups according to sparsity. Data in different groups are assigned different sample\nsizes for CRS. For random projections, we use the average sample size.\n\nFor both NSF and NEWSGROUP data, CRS overwhelmingly outperforms RP for estimating inner\nproducts and l2 distances (both using the marginal information). CRS also outperforms RP for\napproximating l1 and l2 distances (without using the margins).\nFor the COREL data, CRS still outperforms RP for approximating inner products and l2 distances\n(using the margins). However, RP considerably outperforms CRS for approximating l1 distances\nand l2 distances (without using the margins). Note that the COREL image data are not too sparse\nand are considerably more heavy-tailed than the NSF and NEWSGROUP data [13].\n\ns\nr\no\nr\nr\ne\n \nn\na\nd\ne\nm\n\ni\n\n \nf\no\n \no\ni\nt\na\nR\n\n0.06\n\n0.05\n\n0.04\n\n0.03\n\n0.02\n\n10\n\n1\n\ne\ng\na\nt\nn\ne\nc\nr\ne\nP\n\n0.9995\n\n0.999\n\nInner product\n\n0.8\n\n0.6\n\n0.4\n\n20\n40\nSample size k\n\n30\n\n0.2\n\n10\n\n50\n\nL1 distance\n\n20\n\n30\n\nSample size k\n\n40\n\nInner product\n\n1\n\n0.98\n\n0.96\n\nL1 distance\n\n0.9985\n\n10\n\n30\n\n20\n40\nSample size k\n\n0.94\n\n10\n\n50\n\n20\n\n30\n\nSample size k\n\n40\n\n1\n\n0.8\n\n0.6\n\n50\n\n10\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n10\n\n50\n\nL2 distance\n\n20\n\nSample size k\n\n30\n\n40\n\nL2 distance\n\nL2 distance (Margins)\n\n20\n\n30\n\n40\n\nSample size k\n\n50\n\nL2 distance (Margins)\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n10\n\n50\n\n1\n\n0.9998\n\n0.9996\n\n0.9994\n\n20\n\n30\n\n40\n\nSample size k\n\n50\n\n10\n\n20\n40\nSample size k\n\n30\n\n50\n\nFigure 5: NSF data. Upper four panels: ratios (CRS over RP ( random projections)) of the median\nabsolute errors; values < 1 indicate that CRS does better. Bottom four panels: percentage of pairs\nfor which CRS has smaller errors than RP; values > 0.5 indicate that CRS does better. Dashed\ncurves correspond to \ufb01xed sample sizes while solid curves indicate that we (crudely) adjust sketch\nsizes in CRS according to data sparsity. In this case, CRS is overwhelmingly better than RP for\napproximating inner products and l2 distances (both using margins).\n\n8 Conclusion\nThere are many applications of l1 and l2 distances on large sparse datasets. We propose a new\nsketch-based method, Conditional Random Sampling (CRS), which is provably better than random\nprojections, at least for the important special cases of boolean data and nearly independent data. In\ngeneral non-boolean data, CRS compares favorably, both theoretically and empirically, especially\nwhen we take advantage of the margins (which are easier to compute than distances).\n\n2 [16] proposed very sparse random projections to reduce the cost O(nDk) down to O(n\u221aDk).\n\n\f \n\ns\nr\no\nr\nr\ne\nn\na\nd\ne\nm\n\ni\n\n \nf\n\no\no\n\n \n\ni\nt\n\na\nR\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n10\n\nInner product\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\nL1 distance\n\n1.2\n\n1\n\n0.8\n\nL2 distance\n\n0.2\n\n0.15\n\n0.1\n\nL2 distance (Margins)\n\n20\n\nSample size k\n\n0.3\n\n10\n\n30\n\n20\n\nSample size k\n\n30\n\n0.6\n\n10\n\n20\n\nSample size k\n\n0.05\n\n10\n\n30\n\n20\n\nSample size k\n\n30\n\n1\n\nt\n\ne\ng\na\nn\ne\nc\nr\ne\nP\n\n0.995\n\n0.99\n\nInner product\n\n1\n\n0.95\n\n1\n\n0.8\n\n0.6\n\n0.4\n\nL1 distance\n\nL2 distance\n\n1\n\n0.995\n\nL2 distance (Margins)\n\n0.985\n\n10\n\n20\n\nSample size k\n\n0.9\n\n10\n\n30\n\n20\n\nSample size k\n\n0.2\n\n10\n\n30\n\n20\n\nSample size k\n\n0.99\n\n10\n\n30\n\n20\n\nSample size k\n\n30\n\nFigure 6: NEWSGROUP data. The results are quite similar to those in Figure 5 for the NSF data.\nIn this case, it is more obvious that adjusting sketch sizes helps CRS.\n\n \n\ns\nr\no\nr\nr\ne\nn\na\nd\ne\nm\n\ni\n\n \nf\n\no\n \no\ni\nt\na\nR\n\n0.3\n\n0.29\n\n0.28\n\n0.27\n\n0.26\n\n10\n\n0.9\n\n0.85\n\n0.8\n\ne\ng\na\nt\nn\ne\nc\nr\ne\nP\n\nInner product\n\n2\n\n1.9\n\n1.8\n\n1.7\n\n20\n40\nSample size k\n\n30\n\n50\n\n10\n\nInner product\n\n0.04\n\n0.03\n\n0.02\n\n0.01\n\n0.75\n\n10\n\n20\n40\nSample size k\n\n30\n\n50\n\n0\n10\n\nL1 distance\n\n20\n\n30\n\nSample size k\n\n40\n\nL1 distance\n\n4.5\n\n4\n\n3.5\n\n3\n10\n\n0.1\n\n0.05\n\n0\n\n50\n\n\u22120.05\n\nL2 distance\n\n20\n\n30\n\nSample size k\n\n40\n\nL2 distance\n\n\u22120.1\n\n10\n\n50\n\n20\n\n30\n\n40\n\n20\n\n30\n\n40\n\nSample size k\n\nSample size k\nFigure 7: COREL image data.\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n50\n\n10\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n10\n\n50\n\nL2 distance (Margins)\n\n20\n\n30\n\nSample size k\n\n40\n\n50\n\nL2 distance (Margins)\n\n20\n\n30\n\nSample size k\n\n40\n\n50\n\nAcknowledgment\nWe thank Chris Burges, David Heckerman, Chris Meek, Andrew Ng, Art Owen, Robert Tibshirani,\nfor various helpful conversations, comments, and discussions. We thank Ella Bingham, Inderjit\nDhillon, and Matthias Hein for the datasets.\nReferences\n[1] D. Achlioptas. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System\n\nSciences, 66(4):671\u2013687, 2003.\n\n[2] D. Achlioptas, F. McSherry, and B. Sch\u00a8olkopf. Sampling techniques for kernel methods. In NIPS, pages 335\u2013342, 2001.\n[3] R. Arriaga and S. Vempala. An algorithmic theory of learning: Robust concepts and random projection. Machine Learning, 63(2):161\u2013\n\n182, 2006.\n\n[4] E. Bingham and H. Mannila. Random projection in dimensionality reduction: Applications to image and text data. In KDD, pages\n\n245\u2013250, 2001.\n\n[5] B. Brinkman and M. Charikar. On the impossibility of dimension reduction in l1. Journal of ACM, 52(2):766\u2013788, 2005.\n[6] A. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21\u201329, 1997.\n[7]\nI. Dhillon and D. Modha. Concept decompositions for large sparse text data using clustering. Machine Learning, 42(1-2):143\u2013175, 2001.\n[8] P. Drineas and M. Mahoney. On the nystrom method for approximating a gram matrix for improved kernel-based learning. Journal of\n\nMachine Learning Research, 6(Dec):2153\u20132175, 2005.\n\n[9] P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In FOCS, pages 189\u2013197, 2000.\n[10] P. Li. Very sparse stable random projections, estimators and tail bounds for stable random projections. Technical report, http:\n\n//arxiv.org/PS cache/cs/pdf/0611/0611114.pdf, 2006.\n\n[11] P. Li and K. Church. Using sketches to estimate associations. In HLT/EMNLP, pages 708\u2013715, 2005.\n[12] P. Li and K. Church. A sketch algorithm for estimating two-way and multi-way associations. Computational Linguistics, To Appear.\n[13] P. Li, K. Church, and T. Hastie. Conditional random sampling: A sketched-based sampling technique for sparse data. Technical Report\n\n2006-08, Department of Statistics, Stanford University), 2006.\n\n[14] P. Li, K. Church, and T. Hastie. Nonlinear estimators and tail bounds for dimensional reduction in l1 using Cauchy random projections.\n\n(http://arxiv.org/PS cache/cs/pdf/0610/0610155.pdf), 2006.\n\n[15] P. Li, T. Hastie, and K. Church. Improving random projections using marginal information. In COLT, pages 635\u2013649, 2006.\n[16] P. Li, T. Hastie, and K. Church. Very sparse random projections. In KDD, pages 287\u2013296, 2006.\n[17] S. Vempala. The Random Projection Method. American Mathematical Society, Providence, RI, 2004.\n\n\f", "award": [], "sourceid": 2980, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": null}, {"given_name": "Kenneth", "family_name": "Church", "institution": null}, {"given_name": "Trevor", "family_name": "Hastie", "institution": null}]}