Dan Pelleg, Andrew Moore
We focus on the problem of efﬁcient learning of dependency trees. It is well-known that given the pairwise mutual information coefﬁcients, a minimum-weight spanning tree algorithm solves this problem exactly and in polynomial time. However, for large data-sets it is the construc- tion of the correlation matrix that dominates the running time. We have developed a new spanning-tree algorithm which is capable of exploiting partial knowledge about edge weights. The partial knowledge we main- tain is a probabilistic conﬁdence interval on the coefﬁcients, which we derive by examining just a small sample of the data. The algorithm is able to ﬂag the need to shrink an interval, which translates to inspec- tion of more data for the particular attribute pair. Experimental results show running time that is near-constant in the number of records, with- out signiﬁcant loss in accuracy of the generated trees. Interestingly, our spanning-tree algorithm is based solely on Tarjan’s red-edge rule, which is generally considered a guaranteed recipe for bad performance.