{"title": "One sketch for all: Theory and Application of Conditional Random Sampling", "book": "Advances in Neural Information Processing Systems", "page_first": 953, "page_last": 960, "abstract": "Conditional Random Sampling (CRS) was originally proposed for efficiently computing pairwise ($l_2$, $l_1$) distances, in static, large-scale, and sparse data sets such as text and Web data. It was previously presented using a heuristic argument. This study extends CRS to handle dynamic or streaming data, which much better reflect the real-world situation than assuming static data. Compared with other known sketching algorithms for dimension reductions such as stable random projections, CRS exhibits a significant advantage in that it is ``one-sketch-for-all.'' In particular, we demonstrate that CRS can be applied to efficiently compute the $l_p$ distance and the Hilbertian metrics, both are popular in machine learning. Although a fully rigorous analysis of CRS is difficult, we prove that, with a simple modification, CRS is rigorous at least for an important application of computing Hamming norms. A generic estimator and an approximate variance formula are provided and tested on various applications, for computing Hamming norms, Hamming distances, and $\\chi^2$ distances.", "full_text": "One Sketch For All: Theory and Application of\n\nConditional Random Sampling\n\nPing Li\n\nDept. of Statistical Science\n\nCornell University\n\nKenneth W. Church\nMicrosoft Research\nMicrosoft Corporation\n\nTrevor J. Hastie\nDept. of Statistics\nStanford University\n\npingli@cornell.edu\n\nchurch@microsoft.com\n\nhastie@stanford.edu\n\nAbstract\n\nConditional Random Sampling (CRS) was originally proposed for ef\ufb01ciently\ncomputing pairwise (l2, l1) distances, in static, large-scale, and sparse data. This\nstudy modi\ufb01es the original CRS and extends CRS to handle dynamic or stream-\ning data, which much better re\ufb02ect the real-world situation than assuming static\ndata. Compared with many other sketching algorithms for dimension reductions\nsuch as stable random projections, CRS exhibits a signi\ufb01cant advantage in that it\nis \u201cone-sketch-for-all.\u201d In particular, we demonstrate the effectiveness of CRS in\nef\ufb01ciently computing the Hamming norm, the Hamming distance, the lp distance,\nand the \u03c72 distance. A generic estimator and an approximate variance formula are\nalso provided, for approximating any type of distances.\nWe recommend CRS as a promising tool for building highly scalable systems, in\nmachine learning, data mining, recommender systems, and information retrieval.\n\n1 Introduction\nLearning algorithms often assume a data matrix A \u2208 Rn\u00d7D with n observations and D attributes\nand operate on the data matrix A through pairwise distances. The task of computing and maintaining\ndistances becomes non-trivial, when the data (both n and D) are large and possibly dynamic.\nFor example, if A denotes a term-doc matrix at Web scale with each row representing one Web page,\nthen n \u2248 O(1010) (which may be veri\ufb01ed by querying \u201cA\u201d or \u201cThe\u201d in a search engine). Assuming\n105 English words, the simplest uni-gram model requires the dimension D \u2248 O(105); and a bi-gram\nmodel can boost the dimension to D \u2248 O(1010). Google book search program currently provides\ndata sets on indexed digital books up to \ufb01ve-grams. Note that the term-doc matrix is \u201ctransposable,\u201d\nmeaning that one can treat either documents or terms as features, depending on applications.\nAnother example is the image data. The Caltech 256 benchmark contains n = 30, 608 images,\nprovided by two commercial \ufb01rms. Using pixels as features, a 1024 \u00d7 1024 color image can be\nrepresented by a vector of dimension D = 10242\u00d73 = 3, 145, 728. Using histogram-based features\n(e.g., [3]), D = 2563 = 16, 777, 216 is possible if one discretizes the RGB space into 2563 scales.\nText data are large and sparse, as most terms appear only in a small fraction of documents. For\nexample, a search engine reports 107 pagehits for the query \u201cNIPS,\u201d which is not common to the\ngeneral audience. Out of 1010 pages, 107 pagehits indicate a sparsity of 99.9%. (We de\ufb01ne sparsity\nas the percentage of zero elements.) In the absolute magnitude, however, 107 is actually very large.\nNot all large-scale data are sparse. Image data are usually sparse when features are represented by\nhistograms; they are, however, dense when pixel-based features are used.\n\n1.1 Pairwise Distances Used in Machine Learning\nThe lp distance and \u03c72 distance are both popular. Denote by u1 and u2 the leading two rows in\nA \u2208 Rn\u00d7D. The lp distance (raised to the pth power), and the \u03c72 distance, are, respectively,\n\ndp(u1, u2) =\n\nd\u03c72 (u1, u2) =\n\n(u1,i \u2212 u2,i)2\nu1,i + u2,i\n\n,\n\n(\n\n0\n0\n\n= 0).\n\nD(cid:88)\n\ni=1\n\n21/\u03b2\n\nD(cid:88)\n\ni=1\n\n|u1,i \u2212 u2,i|p,\n(cid:179)\n\nu\u03b1\n1,i + u\u03b1\n2,i\n\n(cid:179)\n\n(cid:180)1/\u03b1 \u2212 21/\u03b1\n\n21/\u03b1 \u2212 21/\u03b2\n\nD(cid:88)\n\ni=1\n\n(cid:180)1/\u03b2\n\nThe \u03c72 distance is only a special case of Helbertian metrics, de\ufb01ned as,\n\ndH,\u03b1,\u03b2 (u1, u2) =\n\nu\u03b2\n1,i + u\u03b2\n2,i\n\n, \u03b1 \u2208 [1, \u221e), \u03b2 \u2208 [1/2, \u03b1] or \u03b2 \u2208 [\u2212\u221e, \u22121].\n\n\fHelbertian metrics are de\ufb01ned over probability space[7] and hence suitable for data generated from\nhistograms, e.g., the \u201cbag-of-words\u201d model. For applications in text and images using SVM, empir-\nical studies have demonstrated the superiority of Helbertian metrics over lp distances[3, 7, 9].\nMore generally, we are interested in any linear summary statistics which can be written in the form:\n\nD(cid:88)\n\ndg(u1, u2) =\n\ng(u1,i, u2,i),\n\n(1)\n\ni=1\n\nfor any generic function g. An ef\ufb01cient method for computing (1) for any g would be desirable.\n\n1.2 Bottleneck in Distance/Kernel-based Learning Algorithms\nA ubiquitous task in learning is to compute, store, update, and retrieve various types of distances[17].\nFor popular kernel SVM solvers including the SMO algorithm[16], storing and computing kernels\nis the major bottleneck[2], because computing kernels is expensive, and more seriously, storing the\nfull kernel matrix in memory is infeasible when the number of observations n > 105.\nOne popular strategy is to evaluate kernels on the \ufb02y[2]. This works well in low-dimensional data\n(i.e., relatively small D). With high-dimensional data, however, either computing distances on-\ndemand becomes too slow or the data matrix A \u2208 Rn\u00d7D itself may not \ufb01t in memory.\nWe should emphasize that this challenge is a universal issue in distance-based methods, not limited\nto SVMs. For example, popular clustering algorithms and multi-dimensional scaling algorithms\nrequire frequently accessing a (di)similarity matrix, which is usually distance-based.\nIn addition to computing and storing distances, another general issue is that, for many real-world\napplications, entries of the data matrix may be frequently updated, for example, data streams[15].\nThere have been considerable studies on learning from dynamic data, e.g., [5, 1]. Since streaming\ndata are often not stored (even on disks), computing and updating distances becomes challenging.\n\n1.3 Contributions and Paper Organization\nConditional Random Sampling (CRS)[12, 13] was originally proposed for ef\ufb01ciently computing\npairwise (l2 and l1) distances, in large-scale static data. The contributions of this paper are:\n\n1. We extend CRS to handle dynamic data. For example, entries of a matrix may vary over\ntime, or the data matrix may not be stored at all. We illustrate that CRS has the one-sketch-\nfor-all property, meaning that the same set of samples/sketches can be used for computing\nany linear summary statistics (1). This is a signi\ufb01cant advantage over many other dimen-\nsion reduction or data stream algorithms.\nFor example, the method of stable random\nprojections (SRP)[8, 10, 14] was designed for estimating the lp norms/distances for a \ufb01xed\np with 0 < p \u2264 2. Recently, a new method named Compressed Counting[11] is able to\nvery ef\ufb01ciently approximate the lp moments of data streams when p \u2248 1.\n\n2. We introduce a modi\ufb01cation to the original CRS and theoretically justify that this mod-\ni\ufb01cation makes CRS rigorous, at least for computing the Hamming norm, an important\napplication in databases. We point out the original CRS was based on a heuristic argument.\n3. We apply CRS for computing Hilbertian metrics[7], a popular family of distances for con-\nstructing kernels in SVM. We focus on a special case, by demonstrating that CRS is effec-\ntive in approximating the \u03c72 distance.\n\nSection 2 reviews the original CRS. Section 3 extends CRS to dynamic/streaming data. Section 4\nfocuses on using CRS to estimate the Hamming norm of a single vector, based on which Section 5\nprovides a generic estimation procedure for CRS, for estimating any linear summary statistics, with\nthe focus on the Hamming distance and the \u03c72 distance. Finally, Section 6 concludes the paper.\n\n2 Conditional Random Sampling (CRS), the Original Version\nConditional Random Sampling (CRS)[12, 13] is a local sampling strategy. Since distances are local\n(i.e., one pair at a time), there is no need to consider the whole matrix at one time.\nAs the \ufb01rst step, CRS applies a random permutation on the columns of A \u2208 Rn\u00d7D. Figure 1(a)\nprovides an example of a column-permuted data matrix. The next step of CRS is to construct a\n\n\fsketch for each row of the data matrix. A sketch can be viewed as a linked list which stores a small\nfraction of the non-zero entries from the front of each row. Figure 1(b) demonstrates three sketches\ncorresponding to the three rows of the (column) permuted data matrix in Figure 1(a).\n\n(a) Permuted data matrix\n\n(b) Sketches\n\nFigure 1: (a): A data matrix with three rows and D = 16 columns. We assume the columns are\nalready permuted. (b): Sketches are the \ufb01rst ki non-zero entries ascending by IDs (here ki = 4).\nIn Figure 1, the sketch for row ui is denoted by Ki. Each element of Ki is a tuple \u201cID {val},\u201d where\n\u201cID\u201d is the column ID after the permutation and \u201c{val}\u201d is the value of that entry.\nConsider two rows u1 and u2. The last (largest) IDs of sketches K1 and K2 are max(ID(K1)) = 10\nand max(ID(K2)) = 8, respectively. Here, \u201cID(K)\u201d stands for the vector of IDs in the sketch K. It is\nclear that K1 and K2 contain all information about u1 and u2 from columns 1 to min(10, 8) = 8.\nHad we directly taken the \ufb01rst Ds = 8 columns from the permuted data matrix, we would obtain the\nsame non-zero entries as in K1 and K2, if we exclude elements in K1 and K2 whose IDs > Ds = 8.\nin this example, the element 10{8} in sketch K1 is excluded.\nOn the other hand, since the columns are already permuted, any Ds columns constitute a random\nsample of size Ds. This means, by only looking at sketches K1 and K2, one can obtain a \u201crandom\u201d\nsample of size Ds. By statistics theory, one can easily obtain an unbiased estimate of any linear\nsummary statistics from a random sample. Since Ds is unknown until we look at K1 and K2 together,\n[13] viewed this as a random sample conditioning on Ds.\nNote that the Ds varies pairwise. When considering the rows u1 and u3, the sketches K1 and K3\nsuggest their Ds = min(max(ID(K1)), max(ID(K3))) = min(10,12) = 10.\n\nIn this study, we point out that, although the \u201cconditioning\u201d argument appeared intuitive, it is only a\n(good) heuristic. There are two ways to understand why this argument is not strictly correct.\nConsider a true random sample of size Ds, directly obtained from the \ufb01rst Ds columns of the\npermuted data matrix. Assuming sparse data, elements at the Dsth column should be most likely\nzero. However, in the \u201cconditional random sample\u201d obtained from CRS, at least one element at the\nDsth column is non-zero. Thus, the estimates of the original CRS are, strictly speaking, biased.\nFor a more obvious example, we can consider two rows with exactly one non-zero entry in each row\nat the same column. The original CRS can not obtain an unbiased estimate unless Ds = D.\n3 CRS for Dynamic Data and Introduction to Stable Random Projections\nThe original CRS was proposed for static data.\nIn reality, the \u201cdata matrix\u201d may be frequently\nupdated. When data arrive in a streaming fashion, they often will not be stored (even on disks)[15].\nThus, a one-pass algorithm is needed to compute and update distances for training. Learning with\ndynamic (or incremental) data has become an active topic of research, e.g., [5, 1].\n\n3.1 Dynamic/Streaming Data\nWe \ufb01rst consider only one data vector u of length D (viewed as one row in the data matrix). At each\ntime t, there is an input stream st = (it, It), it \u2208 [1, D] which updates u (denoted by ut) by\n\nut[it] = H(ut\u22121[it], It),\n\nwhere It is the increment/decrement at time t and H is an updating function. The so-called Turnstile\nmodel [15] is extremely popular and assumes a linear updating function H, i.e.,\n\n(2)\nFor example, ut[it] can represent the number of orders a \u201cuser\u201d i has purchased up to time t, where\na user may be identi\ufb01ed by his/her IP address (i.e., i \u2208 [1, D = 264]); It is the number of orders the\nuser i orders (i.e., It > 0) or cancels (i.e., It < 0) at time t.\n\nut[it] = ut\u22121[it] + It.\n\n5 0 0 1 0 7 0 0 0 8 0 1 0 8 0 2 12uuu31 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160 9 2 0 6 0 0 7 0 5 0 0 4 0 0 13 0 4 0 0 2 0 0 0 8 0 0 3 0 0 12 0 213K : 1 {5} 4 {1} 6 {7} 10 {8} K : 2 {9} 3 {2} 5 {6} 8 {7}K : 2 {4} 5 {2} 9 {8} 12 {3} \fIn terms of the data matrix A \u2208 Rn\u00d7D, we can view it to be a collection of n data streams.\n3.2 CRS for Streaming Data\nFor each stream ut, we maintain a sketch K with length (i.e., capacity) k. Each entry of K is a tuple\n\u201cID{val}.\u201d Initially, all entries are empty. The procedure for sketch construction works as follows:\n\n1. Generate a random permutation \u03c0 : [1, D] \u2192 [1, D].\n2. For each st = (it, It), if \u03c0[it] > max(ID(K)) and the capacity of K is reached, do nothing.\n3. Suppose \u03c0[it] \u2264 max(ID(K)) or the capacity of K is not reached. If an entry with ID =\n\n\u03c0[it] does not exist, insert a new entry. Otherwise, update that entry according to H.1\n\n4. Apply the procedure to each data stream using the same random permutation mapping \u03c0.\n\nOnce sketches are constructed, the estimation procedure will be the same regardless whether the\noriginal data are dynamic or static. Thus, we will use static data to verify some estimators of CRS.\n\n(Symmetric) Stable Random Projections (SRP)\n\n3.3\nSince the method of (symmetric) stable random projections (SRP)[8, 10] has become a standard\nalgorithm for data stream computations, we very brie\ufb02y introduce SRP for the sake of comparisons.\nThe procedure of SRP is to multiply the data matrix A \u2208 Rn\u00d7D by a random matrix R \u2208 RD\u00d7k,\nwhose entries are i.i.d. samples from a standard (symmetric) stable distribution S(p, 1), 0 < p \u2264 2.\nConsider two rows, u1 and u2, in A. By properties of stable distributions, the projected vectors\nv1 = RTu1 and v2 = RTu2 have i.i.d. stable entries, i.e., for j = 1 to k,\n\n(cid:195)\n\n(cid:33)\n\nD(cid:88)\n\n(cid:195)\n\nD(cid:88)\n\n(cid:33)\n\nv1,j \u223c S\n\np, Fp =\n\n|u1,i|p\n\n,\n\nv1,j \u2212 v2,j \u223c S\n\np, dp =\n\n|u1,i \u2212 u2,i|p\n\n.\n\ni=1\n\ni=1\n\nThus, one can estimate an individual norm or distance from k samples. SRP is applicable to dy-\nnamic/streaming data, provided the data follow the Turnstile model in (2). Because the Turnstile\nmodel is linear and matrix multiplication is also linear, one can conduct A \u00d7 R incrementally.\nCompared with Conditional Random Sampling (CRS), SRP has an elegant mathematical deriva-\ntion, with various interesting estimators and rigorous sample complexity bounds, i.e., k can be pre-\ndetermined in fully rigorous fashion. The accuracy of SRP is not affected by heavy-tailed data.\nCRS, however, exhibits certain advantages over SRP:\n\n\u2022 CRS is \u201cone-sketch-for-all\u201d.\n\nThe same sketch of CRS can approximate any linear\nsummary statistics (1). SRP is limited to the lp norm and distance with 0 < p \u2264 2. One has\nto conduct SRP 10 times (and store 10 sets of sketches) if 10 different p values are needed.\n\u2022 CRS allows \u201cterm-weighting\u201d in dynamic data.\nare often computed using weighted data (e.g., \u221a\nIn machine learning, the distances\nu1,i or log(1 + u1,i)), which is critical for\ngood performance. For static data, one can \ufb01rst term-weight the data before applying SRP.\nFor dynamic data, however, there is no way to trace back the original data after projections.\n\n\u2022 CRS is not restricted to the Turnstile model.\n\u2022 CRS is not necessary less accurate,\n\nespecially for sparse data or binary data.\n\n4 Approximating Hamming Norms in Dynamic Data\nCounting the Hamming norm (i.e., number of non-zeros) in an exceptionally long, dynamic vector\nhas important applications[4, 15]. For example, if a vector ut records the numbers of items users\nhave ordered, one meaningful question to ask may be \u201c how many distinct users are there?\u201d\nThe purpose of this section is three-fold. (1) This is the case we can rigorously analyze CRS and\npropose a truly unbiased estimator. (2) This analysis brings better insights and more reasonable\nestimators for pairs of data vectors. (3) In this case, despite its simplicity, CRS theoretically achieves\nsimilar accuracy as stable random projections (SRP). Empirically, CRS (slightly) outperforms SRP.\n1We leave it for particular applications to decide whether an entry updated to zero should be discarded or\nshould be kept in the sketch. In reality, this case does not occur often. For example, the most important type of\ndata streams[15] is \u201cinsertion-only,\u201d meaning that the values will never decrease.\n\n\f4.1 The Proposed (Unbiased) Estimator and Variance\nSuppose we have obtained the sketch K. For example, consider the \ufb01rst row in Figure 1: D = 16,\nk = 4 and the number of non-zeros f = 7. Lemma 1 (whose proof is omitted) proposes an unbiased\nestimator of f, denoted by \u02c6f, and a biased estimator based on the maximum likelihood, fmle.\nLemma 1\n\n(cid:179)\n\n(cid:180)\n\n,\n\nZ = max(ID(K)),\n\nE\n\n\u02c6f\n\n= f,\n\nD \u2265 f \u2265 k > 1\n\nD(k \u2212 1)\nZ \u2212 1\n\u02c6f\n< V U\n\n(cid:180)\n(cid:180)\n\n(cid:179)\n(cid:179)\n\nf =\n\n\u02c6f =\n\nVar\n\nVar\n\n\u02c6f\n\n> V L\n\nf = V U\n\n,\n\nD\n\nD \u2212 1\n\nf 2 \u2212 f\nk \u2212 2\nf \u2212 (k \u2212 1)f (f \u2212 1)(f \u2212 2)D\n(k \u2212 2)(k \u2212 3)(D \u2212 1)(D \u2212 2)\n= f 2\n\n\u2212 (D \u2212 f )f\nD \u2212 1\n(cid:179)\n\u02c6f\nZ \u2212 1.\n\n(cid:180)\n\n,\n\nk + O\n\n(k > 2)\n\n(k > 3).\n\n(cid:161) 1\n\nk2\n\n(cid:162)\n\n.\n\nAssume f /D is small and k/f is also small, then Var\n\nThe maximum likelihood estimator is \u02c6fmle = k(D+1)\n\n(cid:179)\n\n(cid:180)\n\n/f 2 \u2248 1/k, independent of the data, the estimator \u02c6f actually has the worst-\nNote that, since Var\ncase complexity bound similar to that of SRP[10], although the precise constant is not easy to obtain.\n\n\u02c6f\n\n4.2 The Approximation Using the Conditioning Argument\nmax(ID(K))\u22121, appears to be the estimator for a hypergeometric\nInterestingly, this estimator, \u02c6f =\nrandom sample of size Ds = max(ID(K)) \u2212 1. That is, suppose we randomly pick Ds balls (with-\nout replacement) from a pool of D balls and we observe that k(cid:48) balls are red; then a natural (and\nunbiased) estimator for the total number of red balls would be D\nDs\n\nk(cid:48); here k(cid:48) = k \u2212 1.\n\nD(k\u22121)\n\nThis seems to imply that the \u201cconditioning\u201d argument in the original CRS in Section 2 is \u201ccorrect\u201d\nif we make a simple modi\ufb01cation by using the Ds which is the original Ds minus 1. While this is\nwhat we will recommend as the modi\ufb01ed CRS, it is only a close approximation.\nConsider \u02c6fapp = \u02c6f, where we assume \u02c6fapp is the estimator for the hypergeometric distribution, then\n\n(cid:180)\n\n(cid:181)\n\n\u02c6fapp|Ds = Z \u2212 1\n\n(cid:180)\n\n(cid:179)\n\n\u02c6fapp\n\n= E\n\nVar\n\n(cid:179)\n\n=\n\nD2\nD2\ns\n\u02c6fapp|Ds\n\nDs\n\n(cid:180)(cid:180)\n\nf\nD\n\n=\n\n1 \u2212 f\nD\n\n(cid:181)\n\nD\n\nD \u2212 1\n\n(cid:182)\n\u00d7 D \u2212 Ds\n(cid:182)\n(cid:181)\nD \u2212 1\nD\n\nE\n\nZ \u2212 1\n\n(cid:181)\n\n(cid:182)\n\n(cid:181)\n\n\u2212 1\n\nf\n\n(cid:182)\n\n1 \u2212 f\nD\n\n(cid:181)\n\n=\n\nDf\nD \u2212 1\n\nf\n\nk \u2212 1\n\n\u2212 1\n\n=\n\n(cid:182)\n\nD \u2212 1\n\nD\n\n(cid:181)\n\n(cid:182)\n\nD\nDs\n1 \u2212 f\nD\n\n\u2212 1\n\nf\n\n(cid:182)(cid:181)\n\n(cid:182)\n\n(3)\n\n1 \u2212 f\nD\n\n(cid:179)\n(cid:179)\n\nVar\n\nVar\n\n(cid:80)D\n4.3 Comparisons with Stable Random Projections (SRP)\ni=1 |ui|p, [4] proposed using SRP to approximate the\nBased on the observation that f = limp\u21920+\n(cid:80)D\nlp norm with very small p, as an approximation to f. For p \u2192 0+, the recent work for SRP [10]\nproposed the harmonic mean estimator. Recall that after projections v = RTu \u2208 Rk consists of\n(cid:195)\n(cid:195)\ni=1 |ui|p. The harmonic mean estimator is\ni.i.d. stable samples with scale parameter Fp =\n(cid:163)\n\u2212\u03c0\u0393(\u22122p) sin (\u03c0p)\n(cid:33)\nk \u2212\n\u0393(\u2212p) sin\n(cid:162)(cid:164)2 \u2212 1\n(cid:163)\n\u2212\u2212\u03c0\u0393(\u22122p) sin (\u03c0p)\n\n(cid:161)\n(cid:80)k\n\u03c0 \u0393(\u2212p) sin\n(cid:195)\n(cid:180)\nj=1 |vj|\u2212p\n(cid:163)\n(cid:182)\n(cid:181)\n\n\u2212\u03c0\u0393(\u22122p) sin (\u03c0p)\n\u0393(\u2212p) sin\n\u2192 1,\n\n1\nk\n\u0393(\u2212p) sin\n\n(cid:162)(cid:164)2 \u2212 1\n(cid:181)\n(cid:182)\n\n(cid:162)(cid:164)2 \u2212 1 \u2192 1.\n\n(cid:33)(cid:33)\n\n\u02c6Fp,hm =\n\n= F 2\np\n\n\u02c6Fp,hm\n\n\u2212 2\n\n1\nk2\n\n(cid:179)\n\n(cid:161)\n\n(cid:162)\n\n(cid:161)\n\n(cid:161)\n\n+ O\n\n\u03c0\n2 p\n\n\u03c0\n2 p\n\n\u03c0\n2 p\n\nVar\n\n.\n\n,\n\nlim\np\u21920+\n\n\u2212 2\n\u03c0\n\n\u03c0\n2\n\np\n\nlim\np\u21920+\n\n\u0393(\u2212p) sin\n\n\u03c0\n2 p\n\nDenote this estimator by \u02c6fsrp (using p as small as possible), whose variance is Var\nwhich is roughly equivalent to the variance of \u02c6f, the unbiased estimator for CRS.\nWe empirically compared CRS with SRP. Four word vectors were selected; entries of each vector\nrecord the numbers of occurrences of the word in D = 216 Web pages. The data are very heavy-\ntailed. The percentage of zero elements (i.e., sparsity) varies from 58% to 95%.\nFigure 2 presents the comparisons. (1): It is possible that CRS may outperform SRP non-negligibly.\n(2): The variance (3) based on the approximate \u201cconditioning\u201d argument is very accurate.\n(3): The\nunbiased estimator \u02c6f is more accurate than \u02c6fmle; the latter actually uses one more sample.\n\n\u02c6fsrp\n\n\u2248 f 2\nk ,\n\n(cid:179)\n\n(cid:180)\n\n\fFigure 2: Comparing CRS with SRP for approximating Hamming norms in Web crawl data (four\nword vectors), using the normalized mean square errors (MSE, normalized by f 2). \u201cCRS\u201d and\n\u201cCRS+mle\u201d respectively correspond to \u02c6f and \u02c6fmle, derived in Lemma 1. \u201dSRP\u201d corresponds to the\nharmonic mean estimator of SRP using p = 0.04. \u201c1/k\u201d is the theoretical asymptotic variance of\nboth CRS and SRP. The curve labeled \u201dApprox. Var\u201d is the approximate variance in (3).\n\n5 The Modi\ufb01ed CRS Estimation Procedure\nThe modi\ufb01ed CRS estimation procedure is based on the theoretical analysis for using CRS to ap-\nproximate Hamming norms. Suppose we are interested in the distance between rows u1 and u2 and\nwe have access to sketches K1 and K2. Our suggested \u201cequivalent\u201d sample size Ds would be\n\nDs = min{Z1 \u2212 1, Z2 \u2212 1},\n\nZ1 = max(ID(K1), Z2 = max(ID(K2).\n\n(4)\n\nWe should not include elements in K1 and K2 whose IDs are larger than Ds\nConsider K1 and K2 in Figure 1, the modi\ufb01ed CRS adopts Ds = min(10\u2212 1, 8\u2212 1) = min(9, 7) =\n7. Removing 10{8} from K1 and 8{7} from K2, we obtain a sample for u1 and u2:\n\u02dcu2,2 = 9, \u02dcu2,3 = 2, \u02dcu2,5 = 6.\n\n\u02dcu1,1 = 5, \u02dcu1,4 = 1, \u02dcu1,6 = 7,\n\nAll other sample entries are zero: \u02dcu1,2 = \u02dcu1,3 = \u02dcu1,5 = \u02dcu1,7 = 0, \u02dcu2,1 = \u02dcu2,4 = \u02dcu2,6 = \u02dcu2,7 = 0.\n\n(cid:80)D\ni=1 g (u1,i, u2,i), and assume that, conditioning on Ds, the sample {\u02dcu1,j, \u02dcu2,j}Ds\n\n5.1 A Generic Estimator and Approximate Variance\nRigorous theoretical analysis on one pair of sketches is dif\ufb01cult. We resort to the approximate\n\u201cconditioning\u201d argument using the modi\ufb01ed Ds in (4). We consider a generic distance dg(u1, u2) =\nj=1 is exactly\nequivalent to the sample from randomly selected Ds columns without replacement. Under this\nassumption, an \u201cunbiased\u201d estimator of dg(u1, u2) (and two special cases) would be\n\n\u02c6dg(u1, u2) =\n\nD\nDs\n\ng(\u02dcu1,j, \u02dcu2,j),\n\n\u02c6dp =\n\nD\nDs\n\n|\u02dcu1,j \u2212 \u02dcu2,j|p,\n\n\u02c6d\u03c72 =\n\nD\nDs\n\n(\u02dcu1,j \u2212 \u02dcu2,j)2\n\u02dcu1,j + \u02dcu2,j\n\n.\n\nDs(cid:88)\n\nj=1\n\nA generic (approximate) variance formula can be obtained as follows:\n\nVar\n\n\u02c6dg(u1, u2)|Ds\n\nDs\n\nE\n\ng2(\u02dcu1,j , \u02dcu2,j )\n\n\u2212 E2 (g(\u02dcu1,j , \u02dcu2,j ))\n\nD(cid:88)\n\ni=1\n\n(cid:180)\n\uf8eb\uf8ed 1\n\nD\n\n(cid:179)\n\nDs\n\nD \u2212 Ds\nD \u2212 1\n\n=\n\nD2\nD2\ns\n\nVar\n\n(cid:179)\n\n\u00d7 D2\nD2\ns\n\ng2(u1,i, u2,i) \u2212\n\n\u2248 D \u2212 Ds\nD(cid:88)\nD \u2212 1\n(cid:180)\n\n(cid:179)\n\ni=1\n\n(cid:179)\n\n,\n\n(cid:182)\n\n\u02c6dg(u1, u2)\n\n\u2248 E\n\nVar\n\n\u02c6dg(u1, u2)|Ds\n\n(cid:181)\n\n(cid:181)\n(cid:181)\n(cid:181)\n\n(cid:181)\n\n(cid:189)\n(cid:189)\n\n=\n\nD\n\nD \u2212 1\n\n\u2248 D\n\nD \u2212 1\n\n=\n\nD\n\nD \u2212 1\n\nE\n\nmax{ D\n\nZ1 \u2212 1\n\nD\n\nZ2 \u2212 1\n\n}\n\nmax\n\nE\n\nD\n\nZ1 \u2212 1\n\n, E\n\nD\n\nZ2 \u2212 1\n\nmax\n\nf1\n\nk1 \u2212 1\n\n,\n\nf2\n\nk2 \u2212 1\n\n\u2212 1\n\n(cid:181)\n(cid:190)\n\n(cid:179)\n(cid:195)\n\n(cid:179)\nD(cid:88)\n\n1\nD\n\n(cid:180)\n\n(cid:33)2\uf8f6\uf8f8 =\n(cid:181)\n(cid:181)\n(cid:33)\n\nE\n\ng(u1,i, u2,i)\n\n(cid:182)\n\n=\n\nD\n\ni=1\n\n\u2212 1\n\n(cid:180)(cid:180)\n(cid:182)(cid:195)\nD \u2212 1\ndg2 \u2212 d2\n(cid:182)(cid:195)\n(cid:182)(cid:190)\ndg2 \u2212 d2\n(cid:33)\n(cid:182)(cid:195)\ndg2 \u2212 d2\n\n\u2212 1\n\ng\nD\n\ng\nD\n\n.\n\ng\nD\n\nD\nDs\n\n(cid:33)\n\nDs(cid:88)\n\nj=1\n\n(cid:180)\n\n(cid:181)\n\n(cid:182)(cid:195)\ndg2 \u2212 d2\n(cid:33)\n\ng\nD\n\n\u2212 1\n\nD\nDs\n\n(cid:182)(cid:195)\ndg2 \u2212 d2\n\ng\nD\n\nD\n\nD \u2212 1\n\n(cid:182)\n\n\u2212 1\n\n(cid:33)\n\n.\n\n(5)\n\nHere, k1 and k2 are the sketch sizes of K1 and K2, respectively, f1 and f2 are the numbers of non-\nzeros in the original data, u1, u2, respectively. We have used the results in Lemma 1 and a common\nstatistical approximation: E(max(x, y)) \u2248 max (E(x), E(y)).\n\n31020304010\u22121100kStandardized MSETHIS CRSCRS+mleSRP1/kApprox. Var31020304010\u22121100kStandardized MSEHAVE CRSCRS+mleSRP1/kApprox. Var31020304010\u22121100kStandardized MSEADDRESS CRSCRS+mleSRP1/kApprox. Var31020304010\u22121100kStandardized MSECUSTOMER CRSCRS+mleSRP1/kApprox. Var\f(cid:110)\n\n(cid:111)\n\nf1\nk1\u22121 ,\n\nFrom (5), we know the variance is affected by two factors.\nIf the data are very sparse, i.e.,\nmax\nf2\nis small, then the variance also tends to be small. If the data are heavy-tailed,\nk2\u22121\ni.e., Ddg2 (cid:192) d2\ng, then the variance tends to be large. Text data are often highly sparse and heavy-\ntailed; but machine learning applications often need to use the weighted data (i.e., taking logarithm\nor binary quantization). This is why we expect CRS will be successful in real applications, although\nit in general does not have the worst-case performance guarantees.\nThe next two subsections apply CRS to estimating the Hamming distance and the \u03c72 distance.\nEmpirical studies [3, 7, 9] have demonstrated that, in text and image data, using the Hamming\ndistance or the \u03c72 distance for kernel SVMs achieved good performance.\n\n5.2 Estimating the Hamming Distance\nFollowing the de\ufb01nition of Hamming distance in [4]: h (u1, u2) =\nmate h using the modi\ufb01ed CRS procedure, denoted by \u02c6h. The approximate variance (5) becomes\n\n(cid:80)D\ni=1 1{u1,i \u2212 u2,i (cid:54)= 0}, we esti-\n(cid:182)(cid:181)\n\n(cid:182)\n\n(cid:190)\n\n(cid:189)\n\n(cid:181)\n\n\u2248 D\n\nD \u2212 1\n\nmax\n\nf1\n\nk1 \u2212 1\n\n,\n\nf2\n\nk2 \u2212 1\n\n\u2212 1\n\nh \u2212 h2\nD\n\n.\n\n(6)\n\n(cid:179)\n\n(cid:180)\n\nVar\n\n\u02c6h\n\nWe also apply SRP using small p and its most accurate harmonic mean estimator[10]. The empirical\ncomparisons in Figure 3 verify two points. (1): CRS can be considerably more accurate than SRP for\nestimating Hamming distances in [4]. (2): The approximate variance formula (6) is very accurate.\n\nFigure 3: Approximating Hamming distances (h) using two pairs of words. The results are presented\nin terms of the normalized (by h2) MSE. The curves labeled \u201cApprox. Var\u201d correspond to the\napproximate variance of CRS in (6).\n\n(cid:80)D\nIn this example, the seemingly impressive improvement of CRS over SRP is actually due to that we\nused the de\ufb01nition of Hamming distance in [4]. An alternative de\ufb01nition of Hamming distance is\ni=1[1{u1,i (cid:54)= 0 and u2,i = 0} + 1{u1,i = 0 and u2,i (cid:54)= 0}], which is basically the\nh(u1, u2) =\nlp distance after a binary term-weighting. As we have commented, if using SRP in dynamic data,\nterm-weighting is not possible; thus we only experimented with the de\ufb01nition in [4].\n\n5.3 Estimating the \u03c72 Distance\n\ni=1\n\ni=1(u1,i + u2,i)2.\nwhich is affected only by the second moments, because\nThere are proved negative results [6] that in the worst-case no ef\ufb01cient algorithms exist for approx-\nimating the \u03c72 distances. CRS does not provide any worst-case guarantees; its performance relies\non the assumption that the data are often reasonably sparse and the second moments should be\nreasonably bounded in machine learning applications.\nFigure 4 presents some empirical study, using the same four words, plus the UCI Dexter data. Even\nthough the four words are fairly common (i.e., not very sparse) and they are heavy-tailed (no term-\nweighting was applied), CRS still achieved good performance in terms of the normalized MSE (e.g.,\n\u2264 0.1) at reasonably small k. And again, the approximate variance formula (7) is accurate.\nResults in the Dexter data set (which is more realistic for machine learning) are encouraging. Only\nabout k = 10 is needed to achieve small MSE.\n\nWe apply CRS to estimating the \u03c72 distance between u1 and u2: d\u03c72 (u1, u2) =\nAccording to (5), the estimation variance should be approximately\n\ni=1\n\n(cid:181)\n\n(cid:189)\n\n(cid:190)\n\nD\n\nD \u2212 1\n\nmax\n\nf1\n\nk1 \u2212 1\n\n,\n\nf2\n\nk2 \u2212 1\n\n\u2212 1\n\n(cid:80)D\n(cid:182)(cid:195)\n(cid:33)\nD(cid:88)\n(u1,i \u2212 u2,i)4\n(u1,i + u2,i)2 \u2212 d2\n(u1,i+u2,i)2 \u2264 (cid:80)D\n(cid:80)D\n\n(u1,i\u2212u2,i)4\n\n\u03c72\nD\n\ni=1\n\n,\n\n(u1,i\u2212u2,i)2\nu1,i+u2,i\n\n.\n\n(7)\n\n1020304010\u2212210\u22121100kStandardized MSE THIS \u2212 HAVECRSApprox. VarSRP1020304010\u2212210\u22121100kStandardized MSE ADDRESS \u2212 CUSTOMERCRSApprox. VarSRP\fFigure 4: Left two panels: CRS for approximating the \u03c72 distance using two pairs of words (D =\n216). The curves report the normalized MSE and the approximate variance in (7).\nRight-most panel: The Dexter data, D = 20000, with 300 data points. We estimate all pairwise (i.e.,\n44850 pairs) \u03c72 distances using CRS. The three curves report the quantiles of normalized MSEs.\n\n6 Conclusion\nThe ubiquitous phenomenon of massive, high-dimensional, and possibly dynamic data, has brought\nin serious challenges. It is highly desirable to achieve compact data presentation and ef\ufb01ciently\ncomputing and retrieving summary statistics, in particular, various types of distances. Conditional\nRandom Sampling (CRS) provides a simple and effective mechanism to achieve this goal.\nCompared with other \u201cmain stream\u201d sketching algorithms such as stable random projections (SRP),\nthe major advantage of CRS is that it is \u201cone-sketch-for-all,\u201d meaning that the same set of sketches\ncan approximate any linear summary statistics. This would be very convenient in practice.\nThe major disadvantage of CRS is that it relies heavily on the data sparsity and also on the assump-\ntion that in machine learning applications the \u201cworst-case\u201d data distributions are often avoided (e.g.,\nthrough term-weighting). Also, the theoretical analysis is dif\ufb01cult, despite it is a simple algorithm.\nOriginally based on a heuristic argument, the preliminary version of CRS, was proposed as a tool\nfor computing pairwise l2 and l1 distances in static data. This paper provides a partial theoretical\njusti\ufb01cation of CRS and various modi\ufb01cations, to make the algorithm more rigorous and to extend\nCRS for handling dynamic/streaming data. We demonstrate, empirically and theoretically, the ef-\nfectiveness of CRS in approximating the Hamming norms/distances and the \u03c72 distances.\nAcknowledgement\nPing Li is partially supported by grant DMS-0808864 from the National Science Foundation, and\na gift from Microsoft. Trevor Hastie was partially supported by grant DMS-0505676 from the\nNational Science Foundation, and grant 2R01 CA 72028-07 from the National Institutes of Health.\nReferences\n[1] Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. On demand classi\ufb01cation of data streams. In KDD, 503\u2013508, 2004.\n[2] L\u00b4eon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston, editors. Large-Scale Kernel Machines. The MIT Press, 2007.\n[3] Olivier Chapelle, Patrick Haffner, and Vladimir N. Vapnik. Support vector machines for histogram-based image classi\ufb01cation. IEEE\n\nTrans. Neural Networks, 10(5):1055\u20131064, 1999.\n\n[4] Graham Cormode, Mayur Datar, Piotr Indyk, and S. Muthukrishnan. Comparing data streams using hamming norms (how to zero in).\n\nIEEE Transactions on Knowledge and Data Engineering, 15(3):529\u2013540, 2003.\n\n[5] Carlotta Domeniconi and Dimitrios Gunopulos. Incremental support vector machine construction. In ICDM, pages 589\u2013592, 2001.\n[6] Sudipto Guha, Piotr Indyk, and Andrew McGregor. Sketching infomration divergence. In COLT, pages 424\u2013438, 2007.\n[7] M. Hein and O. Bousquet. Hilbertian metrics and positive de\ufb01nite kernels on probability measures. In AISTATS, pages 136\u2013143, 2005.\n[8] Piotr Indyk. Stable distributions, pseudorandom generators, embeddings, and data stream computation. J. of ACM, 53(3):307\u2013323, 2006.\n\nCIVR, pages 494\u2013501, 2007.\n\n[9] Yugang Jiang, Chongwah Ngo, and Jun Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In\n[10] Ping Li. Estimators and tail bounds for dimension reduction in l\u03b1 (0 < \u03b1 \u2264 2) using stable random projections. In SODA, 2008.\n[11] Ping Li. Compressed Counting. In SODA, 2009.\n[12] Ping Li and Kenneth W. Church. A sketch algorithm for estimating two-way and multi-way Associations. Computational Linguistics,\n\n33(3):305-354, 2007. Preliminary results appeared in HLT/EMNLP, 2005.\n\n[13] Ping Li, Kenneth W. Church, and Trevor J. Hastie. Conditional random sampling: A sketch-based sampling technique for sparse data. In\n\nNIPS, pages 873\u2013880, 2007.\n\n[14] Ping Li. Computationally ef\ufb01cient estimators for dimension reductions using stable random projections. In ICDM, 2008.\n[15] S. Muthukrishnan. Data streams: Algorithms and applications. Found. and Trends in Theoretical Computer Science, 1:117\u2013236, 2 2005.\n\n[16] John C. Platt. Using analytic QP and sparseness to speed training of support vector machines. In NIPS, pages 557\u2013563, 1998.\n[17] Bernhard Sch\u00a8olkopf and Alexander J. Smola. Learning with Kernels. The MIT Press, 2002.\n\n0102040608010010\u2212210\u22121100kStandardized MSE THIS \u2212 HAVECRSApprox. Var0102040608010010\u22121100kStandardized MSE ADDRESS \u2212 CUSTOMERCRSApprox. Var35101520253010\u2212210\u22121100101kNormalized MSE Dexter10%50%90%\f", "award": [], "sourceid": 986, "authors": [{"given_name": "Ping", "family_name": "Li", "institution": null}, {"given_name": "Kenneth", "family_name": "Church", "institution": null}, {"given_name": "Trevor", "family_name": "Hastie", "institution": null}]}