{"title": "Unsupervised On-line Learning of Decision Trees for Hierarchical Data Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 514, "page_last": 520, "abstract": null, "full_text": "Unsupervised On-Line Learning of \nDecision Trees for Hierarchical Data \n\nAnalysis \n\nMarcus Held and Joachim M. Buhmann \n\nRheinische Friedrich-Wilhelms-U niversitat \nInstitut fUr Informatik III, ROmerstraBe 164 \n\nWWW: http://www-dbv.cs.uni-bonn.de \n\nD-53117 Bonn, Germany \n\nemail: {held.jb}.cs.uni-bonn.de \n\nAbstract \n\nAn adaptive on-line algorithm is proposed to estimate hierarchical \ndata structures for non-stationary data sources. The approach \nis based on the principle of minimum cross entropy to derive a \ndecision tree for data clustering and it employs a metalearning idea \n(learning to learn) to adapt to changes in data characteristics. Its \nefficiency is demonstrated by grouping non-stationary artifical data \nand by hierarchical segmentation of LANDSAT images. \n\n1 \n\nIntroduction \n\nUnsupervised learning addresses the problem to detect structure inherent in un(cid:173)\nlabeled and unclassified data. The simplest, but not necessarily the best ap(cid:173)\nproach for extracting a grouping structure is to represent a set of data samples \nX = {Xi E Rdli = 1, ... ,N} by a set of prototypes y = {Ya E Rdlo = 1, .. . ,K}, \nK \u00ab N. The encoding usually is represented by an assignment matrix M = (Mia), \nwhere Mia = 1 if and only if Xi belongs to cluster 0, and Mia = 0 otherwise. Accord-\ning to this encoding scheme, the cost function 1i (M, Y) = ~ L:~1 MiaV (Xi, Ya) \nmeasures the quality of a data partition, Le., optimal assignments and prototypes \n(M,y)OPt = argminM,y1i (M,Y) minimize the inhomogeneity of clusters w.r.t. a \ngiven distance measure V. For reasons of simplicity we restrict the presentation \nto the ' sum-of-squared-error criterion V(x, y) = !Ix - YI12 in this paper. To fa(cid:173)\ncilitate this minimization a deterministic annealing approach was proposed in [5] \nwhich maps the discrete optimization problem, i.e. how to determine the data as(cid:173)\nsignments, via the Maximum Entropy Principle [2] to a continuous parameter es-\n\n\fUnsupervised On-line Learning of Decision Trees for Data Analysis \n\n515 \n\ntimation problem. Deterministic annealing introduces a Lagrange multiplier {3 to \ncontrol the approximation of 11. (M, Y) in a probabilistic sense. Equivalently to \nmaximize the entropy at fixed expected K-means costs we minimize the free energy \n:F = ~ 2:f::1ln (2::=1 exp (-{3V (Xi, Ya:))) w.r.t. the prototypes Ya:. The assign(cid:173)\nments Mia: are treated as random variables yielding a fuzzy centroid rule \n\nN \n\nN \n\nYa: = L i=l (Mia:)xdLi=l (Mia:) , \n\nwhere the expected assignments (Mia:) are given by Gibbs distributions \n\n(Mia:) = :xp (-{3V (Xi,Ya:)) \n\n2:/l=1 exp ( -{3V (Xi, Ya:)) \n\n. \n\n(1) \n\n(2) \n\nFor a more detailed discussion of the DA approach to data clustering cf. [1, 3, 5]. \nIn addition to assigning data to clusters (1,2), hierarchical clustering provides the \npartitioning of data space with a tree structure. Each data sample X is sequentially \nassigned to a nested structure of partitions which hierarchically cover the data \nspace Rd. This sequence of special decisions is encoded by decision rules which are \nattached to nodes along a path in the tree (see also fig. 1). \nTherefore, learning a decision tree requires to determine a tree topology, the accom(cid:173)\npanying assignments, the inner node labels S and the prototypes y at the leaves. \nThe search of such a hierarchical partition of the data space should be guided by \nan optimization criterion, i.e., minimal distortion costs. \n\nThis problem is solvable by a two-stage approach, which on the one hand minimizes \nthe distortion costs at the leaves given the tree structure and on the other hand \noptimizes the tree structure given the leaf induced partition of Rd. This approach, \ndue to Miller & Rose [3], is summarized in section 2. The extensions for adaptive on(cid:173)\nline learning and experimental results are described in sections 3 and 4, respectively. \n\nx \n\n~ S \n/'\\ of data space \u2022 \n\npartition \n\nS \n\nS \n\n/4\\ \nd \nc \n\ne \n\nf \n\nS \n\nj ' \\ \na \nb \n\na \n\nFigure 1: Right: Topology of a decision tree. Left: Induced partitioning of the \ndata space (positions of the letters also indicate the positions of the prototypes). \nDecisions are made according to the nearest neighbor rule. \n\n2 Unsupervised Learning of Decision Trees \n\nDeterministic annealing of hierarchical clustering treats the assignments of data to \ninner nodes of the tree in a probabilistic way analogous to the expected assignments \nof data to leaf prototypes. Based on the maximum entropy principle, the probability \n~~j that data point Xi reaches inner node Sj is recursively defined by (see [3]): \n\n~~root:= 1, ~~j = ~~parent(j)1ri,j, 1ri,j = \n\nexp (-,V(Xi,Sj)) \n\n2: \n\nexp(-,V(Xi,Sk)) , \n\n(3) \n\nkEsiblings(j) \n\n\f516 \n\nM. Held and 1. M. Buhmann \n\nwhere the Lagrange multiplier, controls the fuzziness of all the transitions 1fi,j' \nOn the other hand, given the tree topology and the prototypes at the leaves, the \nmaximum entropy principle naturally recommends an ideal probability cpLl< at leaf \nYet, resp. at an inner node sj> \n\ncp~ = \n\n',et \n\nexp(-j1V(Xi,Yet)) \nL: exp(-j1V(Xi'Y/L)) \n/LEY \n\nand cpl. = \n\nl,J \n\n:E \n\nkEdescendants(j) \n\nCP!k' \n\n, \n\n(4) \n\nWe apply the principle of minimum cross entropy for the calculation of the proto(cid:173)\ntypes at the leaves given a priori the probabilities for the parents of the leaves. Min(cid:173)\nimization of the cross entropy with fixed expected costs (HXi) = L:et (Miet)V (Xi, Yet) \nfor the data point Xi yields the expression \n\nm~n I({(Miet)}II{Cp~parent(et)/K}) = min Let (Miet) In cpJMiet) \n\n, \n\n(5) \n\n{(M.e>)} \n\ni,parent(et) \n\n{(M.e\u00bb} \n\nwhere I denotes the Kullback-Leibler divergence and K defines the degree of the \ninner nodes. The tilted distribution \n\n( \nMiet) = \n\nH \n\nCp~parent(et) exp (-j1V (Xi, Yet)) \n\nL: /L cp i,parent(/L) exp ( - j1V (Xi, Y /L)) \n\n. \n\n(6) \n\ngeneralizes the probabilistic assignments (2). \nIn the case of Euclidian distances \nwe again obtain the centroid formula (1) as the minimum of the free energy \nF = - h L::l ln [L:etEY cprparent(et) exp (-j1V (Xi, Yet))]. Constraints induced by \nthe tree structure are incorporated in the assignments (6). For the optimization \nof the hierarchy, Miller and Rose in a second step propose the minimization of the \ndistance between the hierarchical probabilities CP~. and the ideal probabilities Cp~ ., \nthe distance being measured by the Kullback-Leibler divergence \n' \n\n~T L \n\n, BjEparent(Y) \n\nI ({ Cp~,j }II{ Cp~j}) == ~,W L \n\nN \n\ncp~ . \n\n:E cpL In cp~J . \n\nBjEparent(Y)i=l \n\nt,J \n\n(7) \n\nEquation (7) describes the minimization of the sum of cross entropies between the \nprobability densities CP~. and CP~. over the parents of the leaves. Calculating the \ngradients for the inner ;'odes S j ~d the Lagrange multiplier, we receive \n\nN \n-2, L \n\ni=l \n\n(Xi - Sj) {cpL - cp!,parent(j)1fi,j} := -2,:E ~1 (Xi, Sj), (8) \n\nN \n\ni=l \n\nL:E V (Xi, Sj) {cpL - cpLparent(j)1fi,j} := L:E ~2 (Xi, Sj). \n\n(9) \n\nN \n\nN \n\ni=l jES \n\ni=l jES \n\nThe first gradient is a weighted average of the difference vectors (Xi - Sj), where the \nweights measure the mismatch between the probability CPtj and the probability in(cid:173)\nduced by the transition 1fi,j' The second gradient (9) measures the scale - V (Xi, Sj) \n- on which the transition probabilities are defined, and weights them with the mis(cid:173)\nmatch between the ideal probabilities. This procedure yields an algorithm which \nstarts at a small value j1 with a complete tree and identical test vectors attached \nto all nodes. The prototypes at the leaves are optimized according to (6) and the \ncentroid rule (1), and the hierarchy is optimized by (8) and (9). After convergence \none increases j1 and optimizes the hierarchy and the prototypes at the leaves again. \nThe increment of j1leads to phase transitions where test vectors separate from each \nother and the formerly completely degenerated tree evolves its structure. For a \ndetailed description of this algorithm see [3]. \n\n\fUnsupervised On-line Learning of Decision Trees for Data Analysis \n\n517 \n\n3 On-Line Learning of Decision Trees \n\nLearning of decision trees is refined in this paper to deal with unbalanced trees \nand on-line learning of trees. Updating identical nodes according to the gradients \n(9) with assignments (6) weighs parameters of unbalanced tree structures in an \nunsatisfactory way. A detailed analysis reveals that degenerated test vectors, i.e., \ntest vectors with identical components, still contribute to the assignments and to \nthe evolution of /. This artefact is overcome by using dynamic tree topologies \ninstead of a predefined topology with indistinguishable test vectors. On the other \nhand, the development of an on-line algorithm makes it possible to process huge \ndata sets and non-stationary data. For this setting there exists the need of on-line \nlearning rules for the prototypes at the leaves, the test vectors at the inner nodes \nand the parameters / and (3. Unbalanced trees also require rules for splitting and \nmerging nodes. \nFollowing Buhmann and Kuhnel [1] we use an expansion of order O(I/n) of (1) to \nestimate the prototypes for the Nth datapoint \n\nN '\" N-l \nYa '\" Ya + 'TJa N-1M XN - Ya \n\n(M:;;I) ( \nPOI \n\nN-I) \n, \n\n(10) \n\nwhere P~ ~ p~-1 +1/M ((M:;;I) - p~-l) denotes the probability of the occurence \nof class o. The parameters M and'TJa are introduced in order to take the possible \nnon-stationarity of the data source into account. M denotes the size of the data \nwindow, and 'TJa is a node specific learning rate. \nAdaptation of the inner nodes and of the parameter / is performed by stochastic \napproximation using the gradients (8) and (9) \n\n(11) \n\n(12) \n\nFor an appropriate choice of the learning rates 'TJ, the learning to learn approach of \nMurata et al. [4] suggests the learning algorithm \n\n(13) \nThe flow 1 in parameter space determines the change of w N -1 given a new datapoint \nXN. Murata et al. derive the following update scheme for the learning rate: \n\nr N \n'TJN \n\n_ \n\n(1 - 8)rN- 1 + 81 (XN, W N- 1 ) , \n'TJN- I + Vl\",N-l (v2I1rNII- 'TJN-l) , \n\n(14) \n(15) \n\nwhere VI, v2 and 8 are control parameters to balance the tradeoff between accuracy \nand convergence rate. r N denotes the leaky average of the flow at time N. \nThe adaptation of (3 has to observe the necessary condition for a phase transition \n(3 > (3erit == 1/28rnax , 8rnax being the largest eigenvalue of the covariance matrix [3] \n\nM \n\n~a = L \n\ni=l \n\n(Xi - Ya) (Xi - Ya)t (Mia)/L(Mia ). \n\n(16) \n\ni=l \n\nM \n\nRules for splitting and merging nodes of the tree are introduced to deal with un(cid:173)\nbalanced trees and non-stationary data. Simple rules measure the distortion costs \nat the prototypes of the leaves. According to these costs the leaf with highest \n\n\f518 \n\nM Held and 1. M Buhmann \n\ndistortion costs is split. The merging criterion combines neighboring leaves with \nminimal distance in a greedy fashion. The parameter M (10), the typical time \nscale for changes in the data distribution is used to fix the time between splitting \nresp. merging nodes and the update of (3. Therefore, M controls the time scale for \nchanges of the tree topology. The learning parameters for the learning to learn rules \n(13)-(15) are chosen empirically and are kept constant for all experiments. \n\n4 Experiments \n\nThe first experiment demonstrates how a drifting two dimensional data source can \nbe tracked. This data source is generated by a fixed tree augmented with transition \nprobabilities at the edges and with Gaussians at the leaves. By descending the tree \nstructure this ~enerates an Li.d. random variable X E R2, which is rotated around \nto obtain a random variable T(N) = R(w, N)X . R is an orthogonal \nthe origin of R \nmatrix, N denotes the number of the actual data point and w denotes the angular \nvelocity, M = 500. Figure 2 shows 45 degree snapshots of the learning of this non(cid:173)\nstationary data source. We start to take these snapshots after the algorithm has \ndeveloped its final tree topology (after ~ 8000 datapoints). Apart from fluctuations \nof the test vectors at the leaves, the whole tree structure is stable while tracking \nthe rotating data source. \n\nAdditional experiments with higher dimensional data sources confirm the robustness \nof the algorithm w.r.t. \nthe dimension of the data space, i. e. similiar tracking \nperformances for different dimensions are observed, where differences are explained \nas differences in the data sources (figure 3) . This performance is measured by the \nvariance of the mean of the distances between the data source trajectory and the \ntrajectories of the test vectors at the nodes of the tree. \n\n7 ) \n\n8) \n\nFigure 2: 45 degree snapshots of the learning of a data source which rotates with a \nvelocity w = 271\"/30000 (360 degree per 30000 data samples:. \n\nA second experiment demonstrates the learning of a switching data source. The \nresults confirm a good performance concerning the restructuring of the tree (see \nfigure 4). In this experiment the algorithm learns a given data source and after \n10000 data points we switch to a different source. \n\nAs a real-world example of on-line learning of huge data sources the algorithm is \napplied to the hierarchical clustering of 6- dimensional LANDSAT data. The heat \n\n\fUnsupervised On-line Learning of Decision Trees for Data Analysis \n\n519 \n\n-0.5 \n\u00b71 \n' 1.5 \n\nI \n\n~ . ~ -3.5 \n\n~ \n\n2dim-\n4dim ~ .--\n12dim ..... \n18dim \n. \n\n-4 \n-4.5 \n\u00b75 \n\u00b75.5 \n\n0 \n\n10000 \n\n20000 \n\n30000 \n\n40000 \n\nN \n\n50000 \n\n60000 \n\n70000 \n\n80000 \n\n90000 \n\nFigure 3: Tracking performance for different dimensions. As data sources we use \nd-dimensional Gaussians which are attached to a unit sphere. To the components \nof every random sample X we add sin(wN) in order to introduce non stationarity. \nThe first 8000 samples are used for the development of the tree topology. \n\nA ) \n\n\\ \n\n\\ \n\na \n\n~, \nb 1\\ \"'-, \n. \\ _ , ___ __. \n\" \n_.-\" -'-----?/l i \n\" \".( \n<' \n\n\\ \n.>,\\\" . \n\n\\ \nL\n\nJ \n\nk \n\n' \n\nII \n. \n'\" \n\n\" \\ \" \n\n1 \n\n._________ \n\n.. __ \n\nB ) \n\n>I--.\"/_-J--; \n- #-\\01f;;<1 \n\n\\ ;r\" \" n'/ .'6> 1 -\n\nJ \n\ne I:' \\\\Q \\ h \\,m \n\nC \n\nl/ \n\n, \n\n/ \n\n, / \n\n/ \n\n\\' \n\n\\' \n\n\\ \n\n\\ \n\\ \n\n\\ \n\n\\ \n\n\\ \n\n\\ \n\n\\ \n\n~\"\" k \n\n/ \n\n/ \n\n\\ \n\n\\ \n\n\\ \n\n1 \n\nh \n\n1 \"\"., . \n\nc\n\n.\u2022 ' ( \n\n\\ 9\" , ~, \n. ._ \" ,--\n''-.--~.~ \nv' j \n\\ \n-----b~~<\\ :' 1 :1'\" e \n-\\--