Part of Advances in Neural Information Processing Systems 19 (NIPS 2006)
Ping Li, Kenneth Church, Trevor Hastie
We1 develop Conditional Random Sampling (CRS), a technique particularly suit- able for sparse data. In large-scale applications, the data are often highly sparse. CRS combines sketching and sampling in that it converts sketches of the data into conditional random samples online in the estimation stage, with the sample size determined retrospectively. This paper focuses on approximating pairwise l2 and l1 distances and comparing CRS with random projections. For boolean (0/1) data, CRS is provably better than random projections. We show using real-world data that CRS often outperforms random projections. This technique can be applied in learning, data mining, information retrieval, and database query optimizations.