{"title": "Using Tarjan's Red Rule for Fast Dependency Tree Construction", "book": "Advances in Neural Information Processing Systems", "page_first": 825, "page_last": 833, "abstract": null, "full_text": "Using Tarjan\u2019s Red Rule for Fast Dependency\n\nTree Construction\n\nDan Pelleg and Andrew Moore\n\nSchool of Computer Science\nCarnegie-Mellon University\nPittsburgh, PA 15213 USA\n\ndpelleg@cs.cmu.edu, awm@cs.cmu.edu\n\nAbstract\n\nWe focus on the problem of ef\ufb01cient learning of dependency trees. It\nis well-known that given the pairwise mutual information coef\ufb01cients,\na minimum-weight spanning tree algorithm solves this problem exactly\nand in polynomial time. However, for large data-sets it is the construc-\ntion of the correlation matrix that dominates the running time. We have\ndeveloped a new spanning-tree algorithm which is capable of exploiting\npartial knowledge about edge weights. The partial knowledge we main-\ntain is a probabilistic con\ufb01dence interval on the coef\ufb01cients, which we\nderive by examining just a small sample of the data. The algorithm is\nable to \ufb02ag the need to shrink an interval, which translates to inspec-\ntion of more data for the particular attribute pair. Experimental results\nshow running time that is near-constant in the number of records, with-\nout signi\ufb01cant loss in accuracy of the generated trees. Interestingly, our\nspanning-tree algorithm is based solely on Tarjan\u2019s red-edge rule, which\nis generally considered a guaranteed recipe for bad performance.\n\n1\n\nIntroduction\n\nBayes\u2019 nets are widely used for data modeling. However, the problem of constructing\nBayes\u2019 nets from data remains a hard one, requiring search in a super-exponential space of\npossible graph structures. Despite recent advances [1], learning network structure from big\ndata sets demands huge computational resources. We therefore turn to a simpler model,\nwhich is easier to compute while still being expressive enough to be useful. Namely, we\nlook at dependency trees, which are belief networks that satisfy the additional constraint\nthat each node has at most one parent.\nIn this simple case it has been shown [2] that\n\ufb01nding the tree that maximizes the data likelihood is equivalent to \ufb01nding a minimum-\nweight spanning tree in the attribute graph, where edge weights are derived from the mutual\ninformation of the corresponding attribute pairs.\n\nDependency tress are interesting in their own right, but also as initializers for Bayes\u2019 Net\nsearch, as mixture components [3], or as components in classi\ufb01ers [4]. It is our intent to\neventually apply the technology introduced in this paper to the full problem of Bayes Net\nstructure search.\n\nOnce the weight matrix is constructed, executing a minimum spanning tree (MST) algo-\n\n\frithm is fast. The time-consuming part is the population of the weight matrix, which takes\ntime O(Rn2) for R records and n attributes. This becomes expensive when considering\ndatasets with hundreds of thousands of records and hundreds of attributes.\n\nTo overcome this problem, we propose a new way of interleaving the spanning tree con-\nstruction with the operations needed to compute the mutual information coef\ufb01cients. We\ndevelop a new spanning-tree algorithm, based solely on Tarjan\u2019s [5] red-edge rule. This\nalgorithm is capable of using partial knowledge about edge weights and of signaling the\nneed for more accurate information regarding a particular edge. The partial information we\nmaintain is in the form of probabilistic con\ufb01dence intervals on the edge weights; an interval\nis derived by looking at a sub-sample of the data for a particular attribute pair. Whenever\nthe algorithm signals that a currently-known interval is too wide, we inspect more data\nrecords in order to shrink it. Once the interval is small enough, we may be able to prove\nthat the corresponding edge is not a part of the tree. Whenever such an edge can be elim-\ninated without looking at the full data-set, the work associated with the remainder of the\ndata is saved. This is where performance is potentially gained.\n\nWe have implemented the algorithm for numeric and categorical data and tested it on real\nand synthetic data-sets containing hundreds of attributes and millions of records. We show\nexperimental results of up to 5; 000-fold speed improvements over the traditional algorithm.\nThe resulting trees are, in most cases, of near-identical quality to the ones grown by the\nnaive algorithm.\n\nUse of probabilistic bounds to direct structure-search appears in [6] for classi\ufb01cation and\nin [7] for model selection. In a sequence of papers, Domingos et al. have demonstrated the\nusefulness of this technique for decision trees [8], K-means clustering [9], and mixtures-\nof-Gaussians EM [10]. In the context of dependency trees, Meila [11] discusses the discrete\ncase that frequently comes up in text-mining applications, where the attributes are sparse in\nthe sense that only a small fraction of them is true for any record. In this case it is possible\nto exploit the sparseness and accelerate the Chow-Liu algorithm.\n\nThroughout the paper we use the following notation. The number of data records is R, the\nnumber of attributes n. When x is an attribute, xi is the value it takes for the i-th record. We\ndenote by (cid:26)xy the correlation coef\ufb01cient between attributes x and y, and omit the subscript\nwhen it is clear from the context.\n\n2 A slow minimum-spanning tree algorithm\n\nWe begin by describing our MST algorithm1. Although in its given form it can be applied\nto any graph, it is asymptotically slower than established algorithms (as predicted in [5] for\nall algorithms in its class). We then proceed to describe its use in the case where some edge\nweights are known not exactly, but rather only to lie within a given interval. In Section 4\nwe will show how this property of the algorithm interacts with the data-scanning step to\nproduce an ef\ufb01cient dependency-tree algorithm.\n\nIn the following discussion we assume we are given a complete graph with n nodes, and\nthe task is to \ufb01nd a tree connecting all of its nodes such that the total tree weight (de\ufb01ned\nto be the sum of the weights of its edges) is minimized. This problem has been extremely\nwell studied and numerous ef\ufb01cient algorithms for it exist.\n\nWe start with a rule to eliminate edges from consideration for the output tree. Following [5],\nwe state the so-called \u201cred-edge\u201d rule:\n\nTheorem 1: The heaviest edge in any cycle in the graph is not part of the minimum\n1To be precise, we will use it as a maximum spanning tree algorithm. The two are interchangeable,\n\nrequiring just a reversal of the edge weight comparison operator.\n\n\f1. T an arbitrary spanning set of n (cid:0) 1 edges.\n2. While j (cid:22)Lj > n (cid:0) 1 do:\n\nL empty set.\n\nPick an arbitrary edge e 2 (cid:22)L n T .\nLet e0 be the heaviest edge on the path in T between the\nendpoints of e.\nIf e is heavier than e0:\n\notherwise:\n\nL L [ feg\nT T [ feg n fe0g\nL L [ fe0g\n\n3. Output T .\n\nFigure 1: The MIST algorithm. At each step of the iteration, T contains the current \u201cdraft\u201d\ntree. L contains the set of edges that have been proven to not be in the MST and so (cid:22)L\ncontains the set of edges that still have some chance of being in the MST. T never contains\nan edge in L.\n\nspanning tree.\n\nTraditionally, MST algorithms use this rule in conjunction with a greedy \u201cblue-edge\u201d rule,\nwhich chooses edges for inclusion in the tree.\nIn contrast, we will repeatedly use the\nred-edge rule until all but n (cid:0) 1 edges have been eliminated. The proof this results in a\nminimum-spanning tree follows from [5].\n\nLet E be the original set of edges. Denote by L the set of edges that have already been\neliminated, and let (cid:22)L = E n L. As a way to guide our search for edges to eliminate we\nmaintain the following invariant:\n\nInvariant 2: At any point there is a spanning tree T , which is composed of edges in (cid:22)L.\n\nIn each step, we arbitrarily choose some edge e in (cid:22)L n T and try to eliminate it using the\nred-edge rule. Let P be the path in T between e\u2019s endpoints. The cycle we will apply the\nred-edge rule to will be composed of e and P . It is clear we only need to compare e with\nthe heaviest edge in P . If e is heavier, we can eliminate it by the red-edge rule. However, if\nit is lighter, then we can eliminate the tree edge by the same rule. We do so and add e to the\ntree to preserve Invariant 2. The algorithm, which we call Minimum Incremental Spanning\nTree (MIST), is listed in Figure 1.\n\nThe MIST algorithm can be applied directly to a graph where the edge weights are known\nexactly. And like many other MST algorithms, it can also be used in the case where just\nthe relative order of the edge weights is given. Now imagine a different setup, where edge\nweights are not given, and instead an oracle exists, who knows the exact values of the edge\nweights. When asked about the relative order of two edges, it may either respond with\nthe correct answer, or it may give an inconclusive answer. Furthermore, a constant fee is\ncharged for each query. In this setup, MIST is still suited for \ufb01nding a spanning tree while\nminimizing the number of queries issued. In step 2, we go to the oracle to determine the\norder. If the answer is conclusive, the algorithm proceeds as described. Otherwise, it just\nignores the \u201cif\u201d clause altogether and iterates (possibly with a different edge e).\nFor the moment, this setup may seem contrived, but in Section 4, we go back to the MIST\nalgorithm and put it in a context very similar to the one described here.\n\n\f3 Probabilistic bounds on mutual information\n\nWe now concentrate once again on the speci\ufb01c problem of determining the mutual infor-\nmation between a pair of attributes. We show how to compute it given the complete data,\nand how to derive probabilistic con\ufb01dence intervals for it, given just a sample of the data.\n\nAs shown in [12], the mutual information for two jointly Gaussian numeric attributes X\nand Y is:\n\nI(X; Y ) = (cid:0)\n\nln(1 (cid:0) (cid:26)2)\n\n1\n2\n\n((xi(cid:0)(cid:22)x)(yi(cid:0)(cid:22)y))\n^(cid:27)2\nX ^(cid:27)2\n\nY\n\nwith (cid:22)x; (cid:22)y; ^(cid:27)2\n\nX and ^(cid:27)2\nY\n\nwhere the correlation coef\ufb01cient (cid:26) = (cid:26)XY = PR\n\ni=1\n\nbeing the sample means and variances for attributes X and Y .\nSince the log function is monotonic, I(X; Y ) must be monotonic in j(cid:26)j. This is a suf\ufb01cient\ncondition for the use of j(cid:26)j as the edge weight in a MST algorithm. Consequently, the\nsample correlation can be used in a straightforward manner when the complete data is\navailable. Now consider the case where just a sample of the data has been observed.\n\ni=1 xi (cid:1) yi given the partial\ni=1 xi(cid:1) yi for some r < R. To derive a con\ufb01dence interval, we use the Central Limit\nTheorem 2. It states that given samples of the random variable Z (where for our purposes\n\nLet x and y be two data attributes. We are trying to estimatePR\nsumPr\nZi = xi (cid:1) yi), the sumPi Zi can be approximated by a Normal distribution with mean\ninterval forPi Zi =Pi xi (cid:1) yi with probability 1 (cid:0) (cid:14) for some user-speci\ufb01ed (cid:14), typically\n\nand variance closely related to the distribution mean and variance. Furthermore, for large\nsamples, the sample mean and variance can be substituted for the unknown distribution\nparameters. Note in particular that the central limit theorem does not require us to make\nany assumption about the Gaussianity of Z. We thus can derive a two-sided con\ufb01dence\n\n1%. Given this interval, computing an interval for (cid:26) is straightforward. Categorical data\ncan be treated similarly; for lack of space we refer the reader to [13] for the details.\n\n4 The full algorithm\n\nAs we argued, the MIST algorithm is capable of using partial information about edge\nweights. We have also shown how to derive con\ufb01dence intervals on edge weights. We\nnow combine the two and give an ef\ufb01cient dependency-tree algorithm.\n\nWe largely follow the MIST algorithm as listed in Figure 1. We initialize the tree T in\nthe following heuristic way: \ufb01rst we take a small sub-sample of the data, and derive point\nestimates for the edge weights from it. Then we feed the point estimates to a MST algorithm\nand obtain a tree T .\nWhen we come to compare edge weights, we generally need to deal with two intervals. If\nthey do not intersect, then the points in one of them are all smaller in value than any point\nin the other, in which case we can determine which represents a heavier edge. We apply\nthis logic to all comparisons, where the goal is to determine the heaviest path edge e0 and\nto compare it to the candidate e. If we are lucky enough that all of these comparisons are\nconclusive, the amount of work we save is related to how much data was used in computing\nthe con\ufb01dence intervals \u2014 the rest of the data for the attribute-pair that is represented by\nthe eliminated edge can be ignored.\n\nHowever, there is no guarantee that the intervals are separated and allow us to draw mean-\ningful conclusions. If they do not, then we have a situation similar to the inconclusive\n\n2One can use the weaker Hoeffding bound instead, and our implementation supports it as well,\n\nalthough it is generally much less powerful.\n\n\foracle answers in Section 2. The price we need to pay here is looking at more data to\nshrink the con\ufb01dence intervals. We do this by choosing one edge \u2014 either a tree-path edge\nor the candidate edge \u2014 for \u201cpromotion\u201d, and doubling the sample size used to compute\nthe suf\ufb01cient statistics for it. After doing so we try to eliminate again (since we can do\nthis at no additional cost). If we fail to eliminate we iterate, possibly choosing a different\ncandidate edge (and the corresponding tree path) this time. The choice of which edge to\npromote is heuristic, and depends on the expected success of resolution once the interval\nhas shrunk. The details of these heuristics are omitted due to space constraints.\n\nAnother heuristic we employ goes as follows. Consider the comparison of the path-heaviest\nedge to an estimate of a candidate edge. The candidate edge\u2019s con\ufb01dence interval may be\nvery small, and yet still intersect the interval that is the heavy edge\u2019s weight (this would\nhappen if, for example, both attribute-pairs have the same distribution). We may be able\nto reduce the amount of work by pretending the interval is narrower than it really is. We\ntherefore trim the interval by a constant, parameterized by the user as (cid:15), before performing\nthe comparison. This use of (cid:14) and (cid:15) is analogous to their use in \u201cProbably Approximately\nCorrect\u201d analysis: on each decision, with high probability (1 (cid:0) (cid:14)) we will make at worst a\nsmall mistake ((cid:15)).\n\n5 Experimental results\n\nIn the following description of experiments, we vary different parameters for the data and\nthe algorithm. Unless otherwise speci\ufb01ed, these are the default values for the parameters.\nWe set (cid:14) to 1% and (cid:15) to 0:05 (on either side of the interval, totaling 0:1). The initial sample\nsize is \ufb01fty records. There are 100; 000 records and 100 attributes. The data is numeric.\nThe data-generation process \ufb01rst generates a random tree, then draws points for each node\nfrom a normal distribution with the node\u2019s parent\u2019s value as the mean. In addition, any data\nvalue is set to random noise with probability 0:15.\nTo construct the correlation matrix from the full data, each of the R records needs to be\n\nby adding the number of records that were actually scanned for all the attribute-pairs, and\n\nconsidered for each of the(cid:0)n\ndividing the total by R(cid:0)n\n\n2(cid:1) attribute pairs. We evaluate the performance of our algorithm\n2(cid:1). We call this number the \u201cdata usage\u201d of our algorithm. The\n\ncloser it is to zero, the more ef\ufb01cient our sampling is, while a value of one means the same\namount of work as for the full-data algorithm.\n\nWe \ufb01rst demonstrate the speed of our algorithm as compared with the full O(Rn2) scan.\nFigure 2 shows that the amount of data the algorithm examines is a constant that does not\ndepend on the size of the data-set. This translates to relative run-times of 0:7% (for the\n37; 500-record set) to 0:02% (for the 1; 200; 000-record set) as compared with the full-data\nalgorithm. The latter number translates to a 5; 000-fold speedup. Note that the reported\nusage is an average over the number of attributes. However this does not mean that the\nsame amount of data was inspected for every attribute-pair \u2014 the algorithm determines\nhow much effort to invest in each edge separately. We return to this point below.\n\nThe running time is plotted against the number of data attributes in Figure 3. A linear\nrelation is clearly seen, meaning that (at least for this particular data-generation scheme)\nthe algorithm is successful in doing work that is proportional to the number of tree edges.\n\nClearly speed has to be traded off. For our algorithm the risk is making the wrong decision\nabout which edges to include in the resulting tree. For many applications this is an accept-\nable risk. However, there might be a simpler way to grow estimate-based dependency trees,\none that does not involve complex red-edge rules. In particular, we can just run the original\nalgorithm on a small sample of the data, and use the generated tree. It would certainly be\nfast, and the only question is how well it performs.\n\n\f250\n\n200\n\n150\n\n100\n\n50\n\nr\ni\na\np\n-\ne\nt\nu\nb\ni\nr\nt\nt\na\n \nr\ne\np\n \ns\n\nl\nl\n\ne\nc\n\n0\n\n0\n\n200000\n\n400000\n\n2\n\n1.5\n\n1\n\n0.5\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\nl\n \ne\nv\ni\nt\n\nl\n\na\ne\nr\n\n0\n\n0\n\n200000\n\n400000\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\ne\nm\n\ni\n\ni\nt\n \ng\nn\nn\nn\nu\nr\n\n800000\n\n1e+06\n\n1.2e+06\n\n0\n\n20\n\n40\n\n60\n\n0\n\n-1\n\n-2\n\n-3\n\n-4\n\n-5\n\nd\no\no\nh\n\ni\nl\n\ne\nk\n\ni\nl\n-\ng\no\nl\n \ne\nv\ni\nt\n\nl\n\na\ne\nr\n\n800000\n\n1e+06\n\n1.2e+06\n\n-6\n0.001\n\n0.0015\n\n0.002\n\nFigure 2: Data usage (indicative of absolute running\ntime), in attribute-pair units per attribute.\n\nFigure 3: Running time as a function of the number\nof attributes.\n\n80\n\nnumber of attributes\n\n100\n\n120\n\n140\n\n160\n\n600000\nrecords\n\n600000\nrecords\n\nFigure 4: Relative log-likelihood vs. the sample-\nbased algorithm. The log-likelihood difference is di-\nvided by the number of records.\n\nMIST\nSAMPLE\n\n0.003\n\n0.0035\n\n0.004\n\n0.0025\n\ndata usage\n\nFigure 5: Relative log-likelihood vs. the sample-\nbased algorithm, drawn against the fraction of data\nscanned.\n\nTo examine this effect we have generated data as above, then ran a 30-fold cross-validation\ntest for the trees our algorithm generated. We also ran a sample-based algorithm on each of\nthe folds. This variant behaves just like the full-data algorithm, but instead examines just\nthe fraction of it that adds up to the total amount of data used by our algorithm. Results for\nmultiple data-sets are in Figure 4. We see that our algorithm outperforms the sample-based\nalgorithm, even though they are both using the same total amount of data. The reason is\nthat using the same amount of data for all edges assumes all attribute-pairs have the same\nvariance. This is in contrast to our algorithm, which determines the amount of data for each\nedge independently. Apparently for some edges this decision is very easy, requiring just a\nsmall sample. These \u201csavings\u201d can be used to look at more data for high-variance edges.\nThe sample-based algorithm would not put more effort into those high-variance edges,\neventually making the wrong decision. In Figure 5 we show the log-likelihood difference\nfor a particular (randomly generated) set. Here, multiple runs with different (cid:14) and (cid:15) values\nwere performed, and the result is plotted against the fraction of data used. The baseline (0)\nis the log-likelihood of the tree grown by the original algorithm using the full data. Again\nwe see that MIST is better over a wide range of data utilization ratio.\n\nKeep in mind that the sample-based algorithm has been given an unfair advantage, com-\npared with MIST: it knows how much data it needs to look at. This parameter is implicitly\npassed to it from our algorithm, and represents an important piece of information about\nthe data. Without it, there would need to be a preliminary stage to determine the sample\nsize. The alternative is to use a \ufb01xed amount (speci\ufb01ed either as a fraction or as an absolute\ncount), which is likely to be too much or too little.\n\nTo test our algorithm on real-life data, we used various data-sets from [14, 15], as well\nas analyzed data derived from astronomical observations taken in the Sloan Digital Sky\nSurvey. On each data-set we ran a 30-fold cross-validation test as described above. For\n\n\fTable 1: Results, relative to the sample-based algorithm, on real data. \u201cType\u201d means nu-\nmerical or categorical data.\n\nNAME\n\nATTR.\n\nRECORDS\n\nTYPE\n\nCENSUS-HOUSE\nCOLORHISTOGRAM\nCOOCTEXTURE\nABALONE\nCOLORMOMENTS\nCENSUS-INCOME\nCOIL2000\nIPUMS\nKDDCUP99\nLETTER\nCOVTYPE\nPHOTOZ\n\n129\n32\n16\n8\n10\n678\n624\n439\n214\n16\n151\n23\n\n22784\n68040\n68040\n4177\n68040\n99762\n5822\n88443\n303039\n20000\n581012\n2381112\n\nN\nN\nN\nN\nN\nC\nC\nC\nC\nN\nC\nN\n\nDATA\nUSAGE\n\n1.0%\n0.5%\n4.6%\n21.0%\n0.6%\n0.05%\n0.9%\n0.06%\n0.02%\n1.5%\n0.009%\n0.008%\n\nMIST\n\nBETTER?\n\n(cid:2)\np\n(cid:2)\n(cid:2)\n(cid:2)\np\np\np\np\np\n(cid:2)\np\n\nSAMPLE\nBETTER?\n\np\n(cid:2)\np\n(cid:2)\np\n(cid:2)\n(cid:2)\n(cid:2)\n(cid:2)\n(cid:2)\np\n(cid:2)\n\neach training fold, we ran our algorithm, followed by a sample-based algorithm that uses\nas much data as our algorithm did. Then the log-likelihoods of both trees were computed\nfor the test fold. Table 1 shows whether the 99% con\ufb01dence interval for the log-likelihood\ndifference indicates that either of the algorithms outperforms the other.\nIn seven cases\nthe MIST-based algorithm was better, while the sample-based version won in four, and\nthere was one tie. Remember that the sample-based algorithm takes advantage of the \u201cdata\nusage\u201d quantity computed by our algorithm. Without it, it would be weaker or slower,\ndepending on how conservative the sample size was.\n\n6 Conclusion and future work\n\nWe have presented an algorithm that applies a \u201cprobably approximately correct\u201d approach\nto dependency-tree construction for numeric and categorical data. Experiments in sets with\nup to millions of records and hundreds of attributes show it is capable of processing massive\ndata-sets in time that is constant in the number of records, with just a minor loss in output\nquality.\n\nFuture work includes embedding our algorithm in a framework for fast Bayes\u2019 Net structure\nsearch.\n\nA additional issue we would like to tackle is disk access. One advantage the full-data\nalgorithm has is that it is easily executed with a single sequential scan of the data \ufb01le.\nWe will explore the ways in which this behavior can be attained or approximated by our\nalgorithm.\n\nWhile we have derived formulas for both numeric and categorical data, we currently do not\nallow both types of attributes to be present in a single network.\n\nAcknowledgments\n\nWe would like to thank Mihai Budiu, Scott Davies, Danny Sleator and Larry Wasserman\nfor helpful discussions, and Andy Connolly for providing access to data.\n\n\fReferences\n\n[1] Nir Friedman, Iftach Nachman, and Dana Pe\u00b4er. Learning bayesian network struc-\nture from massive datasets: The \u201dsparse candidate\u201d algorithm. In Proceedings of the\n15th Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI-99), pages 206\u2013215,\nStockholm, Sweden, 1999.\n\n[2] C. K. Chow and C. N. Liu. Approximating discrete probability distributions with\n\ndependence trees. IEEE Transactions on Information Theory, 14:462\u2013467, 1968.\n\n[3] Marina Meila. Learning with Mixtures of Trees. PhD thesis, Massachusetts Institute\n\nof Technology, 1999.\n\n[4] N. Friedman, M. Goldszmidt, and T. J. Lee. Bayesian Network Classi\ufb01cation with\nContinuous Attributes: Getting the Best of Both Discretization and Parametric Fitting.\nIn Jude Shavlik, editor, International Conference on Machine Learning, 1998.\n\n[5] Robert Endre Tarjan. Data structures and network algorithms, volume 44 of CBMS-\n\nNSF Reg. Conf. Ser. Appl. Math. SIAM, 1983.\n\n[6] Oded Maron and Andrew W. Moore. Hoeffding races: Accelerating model selec-\ntion search for classi\ufb01cation and function approximation. In Jack D. Cowan, Gerald\nTesauro, and Joshua Alspector, editors, Advances in Neural Information Processing\nSystems, volume 6, pages 59\u201366, Denver, Colorado, 1994. Morgan Kaufmann.\n\n[7] Andrew W. Moore and Mary S. Lee. Ef\ufb01cient algorithms for minimizing cross valida-\ntion error. In Proceedings of the 11th International Conference on Machine Learning\n(ICML-94), pages 190\u2013198. Morgan Kaufmann, 1994.\n\n[8] Pedro Domingos and Geoff Hulten. Mining high-speed data streams. In Raghu Ra-\nmakrishnan, Sal Stolfo, Roberto Bayardo, and Ismail Parsa, editors, Proceedings of\nthe 6th ACM SIGKDD International Conference on Knowledge Discovery and Data\nMining (KDD-00), pages 71\u201380, N. Y., August 20\u201323 2000. ACM Press.\n\n[9] Pedro Domingos and Geoff Hulten. A general method for scaling up machine learning\nalgorithms and its application to clustering. In Carla Brodley and Andrea Danyluk,\neditors, Proceeding of the 17th International Conference on Machine Learning, San\nFrancisco, CA, 2001. Morgan Kaufmann.\n\n[10] Pedro Domingos and Geoff Hulten. Learning from in\ufb01nite data in \ufb01nite time.\n\nIn\nProceedings of the 14th Neural Information Processing Systems (NIPS-2001), Van-\ncouver, British Columbia, Canada, 2001.\n\n[11] Marina Meila. An accelerated Chow and Liu algorithm: \ufb01tting tree distributions to\nhigh dimensional sparse data. In Proceedings of the Sixteenth International Confer-\nence on Machine Learning (ICML-99), Bled, Slovenia, 1999.\n\n[12] Fazlollah Reza. An Introduction to Information Theory, pages 282\u2013283. Dover Pub-\n\nlications, New York, 1994.\n\n[13] Dan Pelleg and Andrew Moore. Using Tarjan\u2019s red rule for fast dependency tree\nconstruction. Technical Report CMU-CS-02-116, Carnegie-Mellon University, 2002.\n[14] C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.\n\nhttp://\n\n[15] S. Hettich and S. D. Bay.\n\nhttp://www.ics.uci.edu/(cid:24)mlearn/MLRepository.html.\nkdd.ics.uci.edu.\n\nThe UCI KDD archive, 1999.\n\n\f", "award": [], "sourceid": 2281, "authors": [{"given_name": "Dan", "family_name": "Pelleg", "institution": null}, {"given_name": "Andrew", "family_name": "Moore", "institution": null}]}