{"title": "Learning from Infinite Data in Finite Time", "book": "Advances in Neural Information Processing Systems", "page_first": 673, "page_last": 680, "abstract": null, "full_text": "Learning from Infinite Data \n\nin Finite Time \n\nPedro Domingos \n\nGeoff H ulten \n\nDepartment of Computer Science and Engineering \n\nUniversity of Washington \n\nSeattle, WA 98185-2350, U.S.A. \n\n{pedrod, ghulten} @cs.washington.edu \n\nAbstract \n\nWe propose the following general method for scaling learning \nalgorithms to arbitrarily large data sets. Consider the model \nMii learned by the algorithm using ni examples in step i (ii = \n(nl , ... ,nm)) , and the model Moo that would be learned using in(cid:173)\nfinite examples. Upper-bound the loss L(Mii' M oo ) between them \nas a function of ii, and then minimize the algorithm's time com(cid:173)\nplexity f(ii) subject to the constraint that L(Moo , Mii ) be at most \nf with probability at most 8. We apply this method to the EM \nalgorithm for mixtures of Gaussians. Preliminary experiments on \na series of large data sets provide evidence of the potential of this \napproach. \n\n1 An Approach to Large-Scale Learning \n\nLarge data sets make it possible to reliably learn complex models. On the other \nhand, they require large computational resources to learn from. While in the past \nthe factor limiting the quality of learnable models was typically the quantity of data \navailable, in many domains today data is super-abundant, and the bottleneck is t he \ntime required to process it. Many algorithms for learning on large data sets have \nbeen proposed, but in order to achieve scalability they generally compromise the \nquality of the results to an unspecified degree. We believe this unsatisfactory state \nof affairs is avoidable, and in this paper we propose a general method for scaling \nlearning algorithms to arbitrarily large databases without compromising the quality \nof the results. Our method makes it possible to learn in finite time a model that \nis essentially indistinguishable from the one that would be obtained using infinite \ndata. \n\nConsider the simplest possible learning problem: estimating the mean of a random \nvariable x. If we have a very large number of samples, most of them are probably \nsuperfluous. If we are willing to accept an error of at most f with probability at most \n8, Hoeffding bounds [4] (for example) tell us that, irrespective of the distribution of \nx, only n = ~(R/f)2 1n (2/8) samples are needed, where R is x's range. We propose \nto extend this type of reasoning beyond learning single parameters, to learning \ncomplex models. The approach we propose consists of three steps: \n\n\f1. Derive an upper bound on the relative loss between the finite-data and \ninfinite-data models, as a function of the number of samples used in each \nstep of the finite-data algorithm. \n\n2. Derive an upper bound on the time complexity of the learning algorithm, \n\nas a function of the number of samples used in each step. \n\n3. Minimize the time bound (via the number of samples used in each step) \n\nsubject to target limits on the loss. \n\nIn this paper we exemplify this approach using the EM algorithm for mixtures of \nGaussians. In earlier papers we applied it (or an earlier version of it) to decision \ntree induction [2J and k-means clustering [3J. Despite its wide use, EM has long \nbeen criticized for its inefficiency (see discussion following Dempster et al. [1]), and \nhas been considered unsuitable for large data sets [8J. Many approaches to speeding \nit up have been proposed (see Thiesson et al. [6J for a survey) . Our method can be \nseen as an extension of progressive sampling approaches like Meek et al. [5J: rather \nthan minimize the total number of samples needed by the algorithm, we minimize \nthe number needed by each step, leading to potentially much greater savings; and \nwe obtain guarantees that do not depend on unverifiable extrapolations of learning \ncurves. \n\n2 A Loss Bound for EM \n\nIn a mixture of Gaussians model, each D-dimensional data point Xj is assumed to \nhave been independently generated by the following process: 1) randomly choose a \nmixture component k; 2) randomly generate a point from it according to a Gaussian \ndistribution with mean f-Lk and covariance matrix ~k. In this paper we will restrict \nourselves to the case where the number K of mixture components and the probabil(cid:173)\nity of selection P(f-Lk) and covariance matrix for each component are known. Given \na training set S = {Xl, ... , X N }, the learning goal is then to find the maximum(cid:173)\nlikelihood estimates of the means f-Lk. The EM algorithm [IJ accomplishes this by, \nstarting from some set of initial means, alternating until convergence between esti(cid:173)\nmating the probability p(f-Lk IXj) that each point was generated by each Gaussian (the \nEstep), and computing the ML estimates of the means ilk = 2::;':1 WjkXj / 2::f=l Wjk \n(the M step), where Wjk = p(f-Lklxj) from the previous E step. In the basic EM \nalgorithm, all N examples in the training set are used in each iteration. The goal \nin this paper is to speed up EM by using only ni < N examples in the ith itera(cid:173)\ntion, while guaranteeing that the means produced by the algorithm do not differ \nsignificantly from those that would be obtained with arbitrarily large N. \n\nLet Mii = (ill , . . . , ilK) be the vector of mean estimates obtained by the finite-data \nEM algorithm (i.e., using ni examples in iteration i), and let Moo = (f-L1, ... ,f-LK) be \nthe vector obtained using infinite examples at each iteration. In order to proceed, \nwe need to quantify the difference between Mii and Moo . A natural choice is the \nsum of the squared errors between corresponding means, which is proportional to \nthe negative log-likelihood of the finite-data means given the infinite-data ones: \n\nL(Mii' Moo ) = L Ililk - f-Lkl12 = L L lilkd -\n\nK D \n\nK \n\nk=l \n\nf-Lkdl 2 \n\n(1) \n\nk=ld=l \n\nwhere ilkd is the dth coordinate of il, and similarly for f-Lkd. \n\nAfter any given iteration of EM, lilkd - f-Lkdl has two components. One, which we call \nthe sampling error, derives from the fact that ilkd is estimated from a finite sample, \n\n\fJ-Lkdi 1 :::; Iflkdi -\n\nJ-Lkdi I, the weighting error is Iflkdi -\n\nwhile J-Lkd is estimated from an infinite one. The other component, which we call \nthe weighting error, derives from the fact that, due to sampling errors in previous \niterations, the weights Wjk used to compute the two estimates may differ. Let J-Lkdi \nbe the infinite-data estimate of the dth coordinate of the kth mean produced in \niteration i, flkdi be the corresponding finite-data estimate, and flkdi be the estimate \nthat would be obtained if there were no weighting errors in that iteration. Then \nflkdi I, \nthe sampling error at iteration i is Iflkdi -\nand the total error is Iflkdi -\nGiven bounds on the total error of each coordinate of each mean after iteration i-I, \nwe can derive a bound on the weighting error after iteration i as follows. Bounds \non J-Lkd ,i-l for each d imply bounds on p(XjlJ-Lki ) for each example Xj, obtained by \nsubstituting the maximum and minimum allowed distances between Xjd and J-Lkd ,i-l \ninto the expression of the Gaussian distribution. Let P}ki be the upper bound on \nP(XjlJ-Lki) , and Pjki be the lower bound. Then the weight of example Xj in mean J-Lki \ncan be bounded from below by wjki = PjkiP(J-Lk)/ ~~=l P}k'iP(J-LU, and from above \nby W}ki = min{p}kiP(J-Lk)/ ~~=l Pjk'iP(J-LU, I}. Let w;t: = W}ki if Xj \n::::: 0 and \nth \n(+) -\nW jki - W jki ot erWlse, an \nan W jki - W jki 0 \nerWlse. \nThen \n\ne W jki - W jki 1 Xj _ \n\n- ' f > 0 \n\nflkdi 1 + Iflkdi -\n\n- h ' \n\nd 1 t \n\n(- ) -\n\nd \n\n(- ) -\n\n+ \n\nJ-Lkdi I\u00b7 \n\n. \n\nIflkdi -\n\nflkdi 1 \n\n< \n\n, \nI\nJ-Lkdi -\n\nmax \n\n~7~1 Wjk i Xj I \n\",ni \nuj=l Wjki \n{I \n\n, \nJ-Lkdi -\n\nuj =l W jki Xj\n\", ni \n\",ni \nuj=l w jki \n\n_ \n\n(+) II \n\n, \n\n,J-Lkdi -\n\n( - ) I} \n\nuj =l W jki Xj \n\",ni \n+ \n\",ni \nuj=l w jki \n\n(2) \n\nA corollary of Hoeffding's [4] Theorem 2 is that, with probability at least 1 - 8, the \nsampling error is bounded by \n\nIflkdi -\n\nJ-Lkdi 1 :::; \n\n(3) \n\nwhere Rd is the range of the dth coordinate of the data (assumed known 1 ). This \nbound is independent of the distribution of the data, which will ensure that our \nresults are valid even if the data was not truly generated by a mixture of Gaussians, \nas is often the case in practice. On the other hand, the bound is more conserva(cid:173)\ntive than distribution-dependent ones, requiring more samples to reach the same \nguarantees. \n\nThe initialization step is error-free, assuming the finite- and infinite-data algo(cid:173)\nrithms are initialized with the same means. Therefore the weighting error in \nthe first iteration is zero, and Equation 3 bounds the total error. From this \nwe can bound the weighting error in the second iteration according to Equa(cid:173)\ntion 2, and therefore bound the total error by the sum of Equations 2 and 3, \nIf the finite- and \nand so on for each iteration until the algorithms converge. \ninfinite-data EM converge in the same number of iterations m, the loss due to \nfinite data is L(Mii\" Moo ) = ~f= l ~~= llflkdm -\nsume that the convergence criterion is ~f=l IIJ-Lki -\n\nIn general \n1 Although a normally distributed variable has infinite range, our experiments show \n\nJ-Lkdml 2 (see Equation 1). As(cid:173)\n\nJ-Lk,i-111 2 \n\n:::; f. \n\nthat assuming a sufficiently wide finite range does not significantly affect the results. \n\n\f(with probability specified below), infinite-data EM converges at one of the iter(cid:173)\nations for which the minimum possible change in mean positions is below ,,/, and \nis guaranteed to converge at the first iteration for which the maximum possible \nchange is below \"(. More precisely, it converges at one of the iterations for which \n~~=l ~~=l (max{ IPkd,i- l - Pkdil-IPkd,i- l -\nftkdil, O})2 ::; ,,/, and \nis guaranteed to converge at the first iteration for which ~~=l ~~=l (IPkd,i-l -\nPkdil + IPkd,i-l -\nftkdil)2 ::; \"/. To obtain a bound for L(Mn, Moo), \nfinite-data EM must be run until the latter condition holds. Let I be the set of \niterations at which infinite-data EM could have converged. Then we finally obtain \n\nftkd ,i-ll + IPkdi -\n\nftkd,i - ll-IPkdi -\n\nwhere m is the total number of iterations carried out. This bound holds if all of \nthe Hoeffding bounds (Equation 3) hold. Since each of these bounds fails with \nprobability at most 8, the bound above fails with probability at most 8* = K Dm8 \n(by the union bound). As a result, the growth with K, D and m of the number \nof examples required to reach a given loss bound with a given probability is only \nO(v'lnKDm). \n\nThe bound we have just derived utilizes run-time information, namely the distance \nof each example to each mean along each coordinate in each iteration. This allows it \nto be tighter than a priori bounds. Notice also that it would be trivial to modify the \ntreatment for any other loss criterion that depends only on the terms IPkdm -\nftkdm I \n(e.g., absolute loss) . \n\n3 A Fast EM Algorithm \n\nWe now apply the previous section's result to reduce the number of examples used \nby EM at each iteration while keeping the loss bounded. We call the resulting \nalgorithm VFEM. The goal is to learn in minimum time a model whose loss relative \nto EM applied to infinite data is at most f* with probability at least 1 - 8*. (The \nreason to use f* instead of f will become apparent below.) Using the notation of the \nprevious section, if ni examples are used at each iteration then the running time of \nEM is O(KD ~::l ni) , and can be minimized by minimizing ~::l ni. Assume for \nthe moment that the number of iterations m is known. Then, using Equation 1, we \ncan state the goal more precisely as follows. \nGoal: Minimize ~::l ni, subject to the constraint that ~~=l IIPkm - ftkml12 ::; f* \nwith probability at least 1 - 8* . \n\nftkml12 ::; f* is that Vk \n\nA sufficient condition for ~~=l IIPkm -\nftkmll ::; \nJf*/K. We thus proceed by first minimizing ~::l ni subject to IIPkm - ftkmll ::; \nJ f* / K separately for each mean.2 In order to do this, we need to express IIPkm -\nftkm II as a function of the ni 'so By the triangle inequality, IIPki - ftki II ::; IIPki - ftki II + \n~R2ln(2/8) ~;~l w;kd(~;~l Wjki)2, \nIlftki - ftk& By Equation 3, Ilftki - ftki II::; \nwhere R2 = ~~=l RJ and 8 = 8* / K Dm per the discussion following Equation 4. \nThe (~;~ l Wjki)2 / ~;~ l W;ki term is a measure of the diversity of the weights, \n\nIIPkm -\n\n2This will generally lead to a suboptimal solution; improving it is a matter for future \n\nwork. \n\n\fbeing equal to 1 - l/Gini(W~i)' where W~i is the vector of normalized weights \nwjki = wjkd 2:j,i=l Wjl ki. It attains a minimum of! when all the weights but one are \nzero, and a maximum of ni when all the weights are equal and non-zero. However, \nwe would like to have a measure whose maximum is independent of ni, so that it \nremains approximately constant whatever the value of ni chosen (for sufficiently \nlarge ni). The measure will then depend only on the underlying distribution of the \ndata. Thus we define f3ki = (2:7~1 Wjki)2 /(ni 2:7~1 W]ki) ' obtaining IliLki - ILkill :::; \nJR2ln(2/8)/(2f3ki ni). Also, IIP-ki-iLkill = J2:~=llP-kdi - iLkdil 2, with lP-kdi-iLkdil \nbounded by Equation 2. To keep the analysis tractable, we upper-bound this term \nby a function proportional to IIP-kd,i-1 - ILkd,i-111. This captures the notion than the \nweighting error in one iteration should increase with the total error in the previous \none. Combining this with the bound for IliLki - ILkill, we obtain \n\nwhere CXki is the proportionality constant. Given this equation and IIP-kO - ILkO II = 0, \nit can be shown by induction that \n\nR2 ln(2/8) \n\n2f3kini \n\n(5) \n\nIIP-km - ILkmll :::; ~ ~ \n\nm \n\nwhere \n\n(6) \n\n(7) \n\nThe target bound will thus be satisfied by minimizing 2::1 ni subject to \n2::1 (rkd,;niJ = J E* / K. 3 Finding the n/s by the method of Lagrange multi(cid:173)\npliers yields \n\nni = ~ (f ~rkir%j) 2 \n\n)=1 \n\n(8) \n\nThis equation will produce a required value of ni for each mean. To guarantee the \ndesired E*, it is sufficient to make ni equal to the maximum of these values. \n\nThe VFEM algorithm consists of a sequence of runs of EM, with each run using \nmore examples than the last, until the bound L(Mii' Moo) :::; E* is satisfied, with \nL(Mii' Moo) bounded according to Equation 4. In the first run, VFEM postulates a \nmaximum number of iterations m, and uses it to set 8 = 8* / K Dm. If m is exceeded, \nfor the next run it is set to 50% more than the number needed in the current run. \n(A new run will be carried out if either the 8* or E* target is not met.) The number \nof examples used in the first run of EM is the same for all iterations, and is set to \n1.1(K/2)(R/E*)2ln(2/8). This is 10% more than the number of examples that would \ntheoretically be required in the best possible case (no weighting errors in the last \n3This may lead to a suboptimal solution for the ni's, in the unlikely case that Ilflkm -\n\nJtkm II increases with them. \n\n\fiteration, leading to a pure Hoeffding bound, and a uniform distribution of examples \namong mixture components). The numbers of examples for subsequent runs are set \naccording to Equation 8. For iterations beyond the last one in the previous run, \nthe number of examples is set as for the first run. A run of EM is terminated \nILkdi 1)2 :s: \"( (see \nwhen L~= l L~=l (Iflkd,i- l -\nILk,i-1 11 2 :s: \ndiscussion preceding Equation 4), or two iterations after L~=l IIILki -\n\"( 13, whichever comes first. The latter condition avoids overly long unproductive \nruns. If the user target bound is E, E* is set to min{ E, \"( 13}, to facilitate meeting the \nfirst criterion above. When the convergence threshold for infinite-data EM was not \nreached even when using the whole training set, VFEM reports that it was unable \nto find a bound; otherwise the bound obtained is reported. \n\nflkdi 1 + Iflkd ,i-l - ILkd ,i-l l + Iflkdi -\n\nVFEM ensures that the total number of examples used in one run is always at least \ntwice the number n used in the previous run. This is done by, if L ni < 2n, setting \nthe ni's instead to n~ = 2n(nil L ni). If at any point L ni > mN, where m is the \nnumber of iterations carried out and N is the size of the full training set, Vi ni = N \nis used. Thus, assuming that the number of iterations does not decrease with the \nnumber of examples, VFEM's total running time is always less than three times the \ntime taken by the last run of EM. (The worst case occurs when the one-but-last \nrun is carried out on almost the full training set.) \n\nThe run-time information gathered in one run is used to set the n/s for the next \nrun. We compute each Ctki as Ilflki - Pkill/llflk ,i-l - ILk,i-lll. The approximations \nmade in the derivation will be good, and the resulting ni's accurate, if the means' \npaths in the current run are similar to those in the previous run. This may not \nbe true in the earlier runs , but their running time will be negligible compared to \nthat of later runs, where the assumption of path similarity from one run to the next \nshould hold. \n\n4 Experiments \n\nWe conducted a series of experiments on large synthetic data sets to compare VFEM \nwith EM. All data sets were generated by mixtures of spherical Gaussians with \nmeans ILk in the unit hypercube. Each data set was generated according to three \nparameters: the dimensionality D , the number of mixture components K , and \nthe standard deviation (Y of each coordinate in each component. The means were \ngenerated one at a time by sampling each dimension uniformly from the range \n(2(Y,1 - 2(Y). This ensured that most of the data points generated were within the \nunit hypercube. The range of each dimension in VFEM was set to one. Rather \nthan discard points outside the unit hypercube, we left them in to test VFEM's \nrobustness to outliers. Any ILk that was less than (vD 1 K)(Y away from a previously \ngenerated mean was rejected and regenerated, since problems with very close means \nare unlikely to be solvable by either EM or VFEM. Examples were generated by \nchoosing one of the means ILk with uniform probability, and setting the value of \neach dimension of the example by randomly sampling from a Gaussian distribution \nwith mean ILkd and standard deviation (Y. We compared VFEM to EM on 64 data \nsets of 10 million examples each, generated by using every possible combination of \nthe following parameters: D E {4, 8,12, 16}; K E {3, 4, 5, 6} ; (Y E {.01 , .03, .05, .07}. \nIn each run the two algorithms were initialized with the same means, randomly \nselected with the constraint that no two be less than vD 1 (2K) apart. VFEM was \nallowed to converge before EM's guaranteed convergence criterion was met (see \ndiscussion preceding Equation 4). All experiments were run on a 1 GHz Pentium \nIII machine under Linux, with \"( = O.OOOlDK, 8* = 0.05, and E* = min{O.Ol, \"(} . \n\n\fTable 1: Experimental results. Values are averages over the number of runs shown. \nTimes are in seconds, and #EA is the total number of example accesses made by \nthe algorithm, in millions. \n\nRuns \nBound \n\nAll \n\nNo bound VFEM \n\nAlgorithm #Runs Time #EA Loss \nVFEM \n2.51 \n2.51 \nEM \n1.20 \n1.20 \n2.02 \n2.02 \n\n1.21 \n217 \n3457 19.75 \n7820 43.19 \n4502 27.91 \n3068 16.95 \n3849 22.81 \n\nEM \nVFEM \nEM \n\n40 \n40 \n24 \n24 \n64 \n64 \n\nD \n\nK \n10.5 4.2 \n10.5 4.2 \n4.9 \n9.1 \n4.9 \n9.1 \n10 \n4.5 \n4.5 \n10 \n\nrr \n\n0.029 \n0.029 \n0.058 \n0.058 \n0.04 \n0.04 \n\nThe results are shown in Table 1. Losses were computed relative to the true means, \nwith the best match between true means and empirical ones found by greedy search. \nResults for runs in which VFEM achieved and did not achieve the required E* and \n8* bounds are reported separately. VFEM achieved the required bounds and was \nable to stop early on 62.5% of its runs. When it found a bound, it was on average \n16 times faster than EM. When it did not, it was on average 73% slower. The losses \nof the two algorithms were virtually identical in both situations. VFEM was more \nlikely to converge rapidly for higher D's and lower K's and rr's. When achieved, \nthe average loss bound for VFEM was 0.006554, and for EM it was 0.000081. In \nother words, the means produced by both algorithms were virtually identical to \nthose that would be obtained with infinite data. 4 \n\nWe also compared VFEM and EM on a large real-world data set, obtained by \nrecording a week of Web page requests from the entire University of Washington \ncampus. The data is described in detail in Wolman et al. [7], and the preprocessing \ncarried out for these experiments is described in Domingos & Hulten [3]. The goal \nwas to cluster patterns of Web access in order to support distributed caching. On a \ndataset with D = 10 and 20 million examples, with 8* = 0.05, I = 0.001, E* = 1/3, \nK = 3, and rr = 0.01, VFEM achieved a loss bound of 0.00581 and was two orders \nof magnitude faster than EM (62 seconds vs. 5928), while learning essentially the \nsame means. \n\nVFEM's speedup relative to EM will generally approach infinity as the data set \nsize approaches infinity. The key question is thus: what are the data set sizes at \nwhich VFEM becomes worthwhile? The tentative evidence from these experiments \nis that they will be in the millions. Databases of this size are now common, and \ntheir growth continues unabated, auguring well for the use of VFEM. \n\n5 Conclusion \n\nLearning algorithms can be sped up by minimizing the number of examples used in \neach step, under the constraint that the loss between the resulting model and the \none that would be obtained with infinite data remain bounded. In this paper we \napplied this method to the EM algorithm for mixtures of Gaussians, and observed \nthe resulting speedups on a series of large data sets. \n\n4The much higher loss values relative to the true means, however, indicate that infinite(cid:173)\n\ndata EM would often find only local optima (unless the greedy search itself only found a \nsuboptimal match). \n\n\fAcknowledgments \n\nThis research was partly supported by NSF CAREER and IBM Faculty awards to \nthe first author, and by a gift from the Ford Motor Company. \n\nReferences \n\n[1] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from \nincomplete data via the EM algorithm. Journal of the Royal Statistical Society, \nSeries B, 39:1- 38, 1977. \n\n[2] P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings \nof the Sixth ACM SIGKDD International Conference on Knowledge Discovery \nand Data Mining, pp. 71- 80, Boston, MA, 2000. ACM Press. \n\n[3] P. Domingos and G. Hulten. A general method for scaling up machine learning \n\nalgorithms and its application to clustering. In Proceedings of the Eighteenth In(cid:173)\nternational Conference on Machine Learning, pp. 106-113, Williamstown, MA, \n2001. Morgan Kaufmann. \n\n[4] W. Hoeffding. Probability inequalities for sums of bounded random variables. \n\nJournal of the American Statistical Association, 58:13- 30, 1963. \n\n[5] C. Meek, B. Thiesson, and D. Heckerman. The learning-curve method applied \nto clustering. Technical Report MSR-TR-01-34, Microsoft Research, Redmond, \nWA,2000. \n\n[6] B. Thiesson, C. Meek, and D. Heckerman. Accelerating EM for large databases. \n\nTechnical Report MSR-TR-99-31, Microsoft Research, Redmond, WA, 2001. \n\n[7] A. Wolman, G. Voelker, N. Sharma, N. Cardwell, M. Brown, T. Landray, D. Pin(cid:173)\n\nnel, A. Karlin, and H. Levy. Organization-based analysis of Web-object sharing \nand caching. In Proceedings of the Second USENIX Conference on Internet \nTechnologies and Systems, pp. 25- 36, Boulder, CO, 1999. \n\n[8] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clus(cid:173)\ntering method for very large databases. In Proceedings of the 1996 A CM SIG(cid:173)\nMOD International Conference on Management of Data, pp. 103- 114, Montreal, \nCanada, 1996. ACM Press. \n\n\f", "award": [], "sourceid": 2064, "authors": [{"given_name": "Pedro", "family_name": "Domingos", "institution": null}, {"given_name": "Geoff", "family_name": "Hulten", "institution": null}]}