{"title": "DTs: Dynamic Trees", "book": "Advances in Neural Information Processing Systems", "page_first": 634, "page_last": 640, "abstract": null, "full_text": "DTs: Dynamic Trees \n\nChristopher K. I. Williams \n\nNicholas J. Adams \n\nInstitute for Adaptive and Neural Computation \n\nDivision of Informatics, 5 Forrest Hill \n\nEdinburgh, EHI 2QL, UK. \nckiw~dai.ed.ac.uk \n\nhttp://www.anc.ed.ac . uk/ \nnicka~dai.ed.ac.uk \n\nAbstract \n\nIn this paper we introduce a new class of image models, which we \ncall dynamic trees or DTs. A dynamic tree model specifies a prior \nover a large number of trees, each one of which is a tree-structured \nbelief net (TSBN). Experiments show that DTs are capable of \ngenerating images that are less blocky, and the models have better \ntranslation invariance properties than a fixed, \"balanced\" TSBN. \nWe also show that Simulated Annealing is effective at finding trees \nwhich have high posterior probability. \n\n1 \n\nIntroduction \n\nIn this paper we introduce a new class of image models, which we call dynamic \ntrees or DTs. A dynamic tree model specifies a prior over a large number of trees, \neach one of which is a tree-structured belief net (TSBN) . Our aim is to retain \nthe advantages of tree-structured belief networks, namely the hierarchical structure \nof the model and (in part) the efficient inference algorithms, while avoiding the \n\"blocky\" artifacts that derive from a single, fixed TSBN structure. One use for \nDTs is as prior models over labellings for image segmentation problems. \n\nSection 2 of the paper gives the theory of DTs, and experiments are described in \nsection 3. \n\n2 Theory \n\nThere are two essential components that make up a dynamic tree network (i) the \ntree architecture and (ii) the nodes and conditional probability tables (CPTs) in \nthe given tree. We consider the architecture question first. \n\n\fDTs: Dynamic Trees \n\n635 \n\no \n\no \n\no \n\no \n\no \n\no \n\no \n\no 0 0 0 0 000 \n10000000000000000 \n\n(a) \n\n(c) \n\n(d) \n\nFigure 1: (a) \"Naked\" nodes, (b) the \"balanced\" tree architecture, (c) a sample \nfrom the prior over Z, (d) data generated from the tree in (c). \n\nConsider a number of nodes arranged into layers, as in Figure lea). We wish \nto construct a tree structure so that any child node in a particular layer will be \nconnected to a parent in the layer above. We also allow there to be a null parent for \neach layer, so that any child connected to it will become a new root. (Technically we \nare constructing a forest rather than a tree.) An example of a structure generated \nusing this method is shown in Figure 1 ( c). \n\nThere are a number of ways of specifying a prior over trees. If we denote by Zi the \nindicator vector which shows to which parent node i belongs, then the tree structure \nis specified by a matrix Z whose columns are the individual Zi vectors (one for each \nnode). The scheme that we have investigated so far is to set P(Z) = It P(Zi). \nIn our work we have specified P(Zi) as follows. Each child node is considered to \nhave a \"natural\" parent-its parent in the balanced structure shown in Figure l(b). \nEach node in the parent layer is assigned an \"affinity\" for each child node, and \nthe \"natural\" parent has the highest affinity. Denote the affinity of node k in the \nparent layer by ak. Then we choose P(Zi = ek) = e!3a/e / EjEPai e!3a j, where (3 is \nsome positive constant and ek is the unit vector with a 1 in position k. Note that \nthe \"null\" parent is included in the sum, and has affinity anull associated with it, \nwhich affects the relative probability of \"orphans\". We have named this prior the \n\"full-time-node-employment\" prior as all the nodes participate in the creation of \nthe tree structure to some degree. \n\nHaving specified the prior over architectures, we now need to translate this into a \nTSBN. The units in the tree are taken to be C-class multinomial random variables. \nEach layer of the structure has associated with it a prior probability vector 7f1 \nand CPT MI. Given a particular Z matrix which specifies a forest structure, the \nprobability of a particular instantiation of all of the random variables is simply \nthe product of the probabilities of all of the trees, where the appropriate root \nprobabilities and CPTs are picked up from the 7fIS and MIS. A sample generated \nfrom the tree structure in Figure l(c) is shown in Figure led). \n\n\f636 \n\nC. K. I. Williams and N. 1. Adams \n\nOur intuition as to why DTs may be useful image models is based on the idea that \nmost pixels in an image are derived from a single object. We think of an object as \nbeing described by a root of a tree, with the scale of the object being determined \nby the level in the tree at which the root occurs. In this interpretation the ePTs \nwill have most of their probability mass on the diagonal. \n\nGiven some data at the bottom layer of units, we can form a posterior over the tree \nstructures and node instantiations of the layers above. This is rather like obtaining \na set of parses for a number of sentences using a context-free grammarl . \n\nIn the DT model as described above different examples are explained by different \ntrees. This is an important difference with the usual priors over belief networks as \nused, e.g. in Bayesian averaging over model structures. Also, in the usual case of \nmodel averaging, there is normally no restriction to TSBN structures, or to tying \nthe parameters (1rIS and MIS) between different structures. \n\n2.1 \n\nInference in DTs \n\nWe now consider the problem of inference in DTs, i.e. obtaining the posterior \nP(Z, XhlXv) where Z denotes the tree-structure, Xv the visible units (the image \nclamped on the lowest level) and X h the hidden units. In fact, we shall concen(cid:173)\ntrate on obtaining the posterior marginal P(ZIXv), as we can obtain samples from \nP(XhIXv, Z) using standard techniques for TSBNs. \n\nThere are a very large number of possible structures; in fact for a set of nodes cre(cid:173)\nated from a balanced tree with branching factor b and depth D (with the top level \nindexed by 1) there are IT~=2(b(d-2) + l)b(d-l) possible forest structures. Our ob(cid:173)\njective will be to obtain the maximum a posteriori (MAP) state from the posterior \nP(ZIXv) ex P(Z)P(XvIZ) using Simulated Annealing.2 This is possible because \ntwo components P(Z) and P(XvIZ) are readily evaluated. P(XvIZ) can be com(cid:173)\nputed from ITr (Exr A(Xr )'7r(xr)), where A(Xr) and 7r(xr) are the Pearl-style vectors \nof each root r of the forest. \n\nAn alternative to sampling from the posterior P(Z, XhlXv) is to use approximate \ninference. One possibility is to use a mean-field-type approximation to the posterior \nof the form QZ(Z)Qh(Xh) (Zoubin Ghahramani, personal communication, 1998). \n\n2.2 Comparing DTs to other image models \n\nFixed-structure TSBNs have been used by a number of authors as models of images \n(Bouman and Shapiro, 1994), (Luettgen and Willsky, 1995). They have an attract(cid:173)\nive multi-scale structure, but suffer from problems due to the fixed tree structure, \nwhich can lead to very \"blocky\" segmentations. Markov Random Field (MRF) \nmodels are also popular image models; however, one of their main limitations is \nthat inference in a MRF is NP-hard. Also, they lack an hierarchical structure. On \nthe other hand, stationarity of the process they define can be easily ensured, which \n\nlCFGs have a O(n3 ) algorithm to infer the MAP parse; however, this algorithm depends \ncrucially on the one-dimensional ordering of the inputs. We believe that the possibility of \ncrossed links in the DT architecture means that this kind of algorithm is not applicable to \nthe DT case. Also, the DT model can be applied to 2-d images, where the O(n3 ) algorithm \nis not applicable. \n\n2It is also possible to sample from the posterior using, e.g. Gibbs Sampling. \n\n\fDTs: Dynamic Trees \n\n637 \n\nis not the case for fixed-structure TSBNs. One strategy to overcome the fixed struc(cid:173)\nture of TSBNs is to break away from the tree structure, and use belief networks \nwith cross connections e.g. (Dayan et ai., 1995). However, this means losing the \nlinear-time belief-propagation algorithms that can be used in trees (Pearl, 1988) \nand using approximate algorithms. While it is true that inference over DTs is also \nNP-hard, we do retain a\"clean\" semantics based on the fact that we expect that \neach pixel should belong to one object, which may lead to useful approximation \nschemes. \n\n3 Experiments \n\nIn this section we describe two experiments conducted on the DT models. The first \nhas been designed to compare the translation performance of DTs with that of the \nbalanced TSBN structure and is described in section 3.1. In section 3.2 we generate \n2-d images from the DT model, find the MAP Dynamic Tree for these images, and \ncontrast their performance in relative to the balanced TSBN. \n\n3.1 Comparing DTs with the balanced TSBN \n\nWe consider a 5-1ayer binary tree with 16 leaf nodes, as shown in Figure 1. Each node \nin the tree is a binary variable, taking on values of white/black. The 7r1'S, M,'s and \naffinities were set to be equal in each layer. The values used were 7r = (0.75,0.25) \nwith 0.75 referring to white, and M had values 0.99 on the diagonal and 0.01 off(cid:173)\ndiagonal. The affinities3 were set as 1 for the natural parent, 0 for the nearest \nneighbour(s) of the natural parent, -00 for non-nearest neighbours and anull = 0, \nwith f3 = 1.25. \n\n... -\n\n-~, ~~~~.~.~,~.~~~~. \n\n(a) 5 black nodes \n\n(b) 4 black nodes \n\n, . . . \n, . . . , \n\nr \n\n\" \n\n1\\ \n\n, \n\n\" !\\I~ ~i \n\u00b7 .. \n\u00b7 \" \n\u00b7 \" \n\" \n'/ \n\n: \n\n\" \n\\ ' \n\n\\~ \n\n'. \n\n~ \n\n\\ \nI \n\nI \nI \n\n/\" \n\nI \n\n\\ \n\n\\ \n\nI \n\n\\ \n\nr \n\n\\1 \n\n\\ , ' \\ \n, \n\\ \nI ' \\\n\nI \n\nI \nr \n. \n\n\\,,: \n\n... -\n~,/ \n. . \n\" \n. \n\nFigure 2: Plots of the unnormalised log posterior vs position of the input pattern \nfor (a) the 5-black-nodes pattern and (b) 4-black-nodes pattern. \n\nTo illustrate the effects of translation, we have taken a stimulus made up of a bar \nof five black pixels, and moved it across the image. The unnormalised log posterior \nfor a particular Z configuration is logP(Z) + logP(XvIZ). This is computed for \nthe balanced TSBN architecture, and compared to the highest value that can be \nfound by conducting a search over Z. These results are plotted in Figure 2(a). \nThe x-axis denotes the position of the left hand end of the bar (running from 1 to \n\n3The affinities are defined up to the addition of an arbitrary constant. \n\n\f638 \n\nC. K. I. Williams and N. 1. Adams \n\n12), and the y-axis shows the posterior probability. Note that due to symmetries \nthere are in reality fewer than 12 distinct configurations. Figure 2(a) shows clearly \nthat the balanced TSBN is a poor model for this stimulus, and that much better \ninterpretations can be found using DTs, even though the \"natural parent\" idea \nensures that the logP(Z) is always larger for the balanced tree. \n\nNotice also how the balanced TSBN displays greater sensitivity of the log posterior \nwith respect to position than the DT model. Figure 2 shows both the \"optimal\" \nlog posterior (found \"by hand\", using intuitions as to the best trees), and the those \nof the MAP models discovered by Simulated Annealing. Annealing was conducted \nfrom a starting temperature of 1.0 and exponentially decreased by a factor of 0.9. \nAt each temperature up to 2000 proposals could be made, although transition to \nthe next temperature would occur after 200 accepted steps. The run was deemed to \nhave converged after five successive temperature steps were made without accepting \na single step. We also show the log posterior of trees found by Gibbs sampling from \nwhich we report the best configuration found from four separate runs (with different \nrandom starting positions), each of which was run for 25,000 sweeps through all of \nthe nodes. \n\nIn Figure 2(b) we have shown the log posterior for a stimulus made up of four black \nnodes4 . \nIn this case the balanced TSBN is even more sensitive to the stimulus \nlocation, as the four black nodes fit exactly under one sub-tree when they are \nin positions 1, 5, 9 or 13. By contrast, the dynamic tree is less sensitive to the \nalignment, although it does retain a preference for the configuration most favoured \nby the balanced TSBN. This is due to the concept of a \"natural\" parent built into \nthe (current) architecture (but see Section 4 for further discussion) . \n\nClearly these results are somewhat sensitive to settings of the parameters. One of \nthe most important parameters is the diagonal entry in the CPT. This controls the \nrelative desirability of having a disconnection against a transition in the tree that \ninvolves a colour change. For example, if the diagonal entry in the CPT is reduced \nto 0.95, the gap between the optimal and balanced trees in Figure 2(b) is decreased. \nWe have experimented with CPT entries of 0.90,0.95 and 0.99, but otherwise have \n. not needed to explore the parameter space to obtain the results shown. \n\n3.2 Generating from the prior and finding the MAP Tree in 2-d \n\nWe now turn our attention to 2-d images. Considering a 5 layer quad-tree node \narrangement gives a total of 256 leaf nodes or a 16x16 pixel image. A structural \nplot of such a tree generated from the prior is shown in figure 3. \n\nEach sub-plot is a slice through the tree showing the nodes on successive levels. \nThe boxes represent a single node on the current level and their shading indicates \nthe tree to which they belong. Nodes in the parent layer above are superimposed \nas circles and the lines emanating from them shows their connectivity. Black circles \nwith a smaller white circle inside are used to indicate root nodes. Thus in the \nexample above we see that the forest consists of five trees, four of whose roots lie \nat level 3 (which between them account for most of the black in the image, Figure \n3(f\u00bb, while the root node at level 1 is responsible for the background. \n\n4The parameters are the same as above, except that anull in level 3 was set to 10.0 to \n\nencourage disconnections at this level. \n\n\fDTs: Dynamic Trees \n\n639 \n\n(a) \n\n(b) \n\n(c) \n\n(e) \n\n(d) \n\n(f) \n\nFigure 3: Plot of the MAP Dynamic Tree of the accompanying image (f). \n\nBroadly speaking the parameters for the 2-d DTs were set to be similar to the I-d \ntrees of the previous section, except that the disconnection affinities were set to \nfavour disconnections higher up the tree, and to values for the leaf level such that \nleaf disconnection probabilities tend to zero. In practice this resulted in all leaves \nbeing connected to parent nodes (which is desirable as we believe that single-pixel \nobjects are unlikely). The (3 values increase with tree depth so that lower levels \nnodes choose parents from a tighter neighbourhood. The 7ft and M t values were \nunchanged, and again we consider binary valued nodes. \n\nA suite of 600 images were created by sampling DTs from the above prior and then \ngenerating 5 images from each. Figure 3(f) shows an example of an image generated \nby the DT and it can be seen that the \"blockiness\" exhibited by balanced TSBNs \nis not present . \n\n. ':. \n\n. \n\n(a) \n\n(b) \n\nFigure 4: (a) Comparison of the MAP DT log posterior against that of the quad-tree \nfor 600 images, (b) tree generated from the \"part-time-node-employment\" prior. \n\n\f640 \n\nC. K. I. Williams and N. J Adams \n\nThe MAP Dynamic Tree for each of these images was found by Simulated Annealing \nusing the same exponential strategy described earlier, and their log posteriors are \ncompared with those of the balanced TSBN in the plot 4(a). The line denotes the \nboundary of equal log posterior and the location of all the points above this clearly \nshows that in every case the MAP tree found has a higher posterior. \n\n4 Discussion \n\nAbove we have demonstrated that DT models have greater translation invariance \nand do not exhibit the blockiness of the balanced TSBN model. We also see that \nSimulated Annealing methods are successful at finding trees that have high posterior \nprobability. \n\nWe now discuss some extensions to the model. In the work above we have kept the \nbalanced tree arrangement of nodes. However, this could be relaxed, giving rise to \nroughly equal numbers of nodes at the various levels (cf stationary wavelets). This \nwould be useful (a) for providing better translation invariance and (b) to avoid \nslight shortages of hidden units that can occur when patterns that are \"misaligned\" \nwrt the balanced tree are presented. In this case the prior over Z would need to be \nadjusted to ensure a high proportion of tree-like structures, by generating the z's \nand x's in layers, so that the z's can be contingent on the states of the units in the \nlayer above. We have devised a prior of this nature and called it the \"part-time(cid:173)\nemployment\" prior as the nodes can decide whether or not they wish to be employed \nin the tree structure or remain redundant and inactive. An example tree generated \nfrom this prior is shown in figure 4(b); we plan to explore this direction further \nin on-going research. Other research directions include the learning of parameters \nin the networks (e.g. using EM), and the introduction of additional information \nat the nodes; for example one might use real-valued variables in addition to the \nmultinomial variables considered above. These additional variables might be used to \nencode information such as that concerning the instantiation parameters of objects. \n\nAcknowledgements \n\nThis work stems from a conversation between CW and Zoubin Gharahmani at the Isaac \nNewton Institute in October 1997. We thank Zoubin Ghahramani, Geoff Hinton and Peter \nDayan for helpful conversations, and the Isaac Newton Institute for Mathematical Sciences \n(Cambridge, UK) for hospitality during the \"Neural Networks and Machine Learning\" pro(cid:173)\ngramme. NJA is supported by an EPSRC research studentship, and the work of CW is \npartially supported by EPSRC grant GR/L03088, Combining Spatially Distributed Pre(cid:173)\ndictions From Neural Networks. \n\nReferences \n\nBouman, C. A. and M. Shapiro (1994). A Multiscale Random Field Model for Bayesian \n\nImage Segmentation. IEEE Transactions on Image Processing 3(2),162-177. \n\nDayan, P., G. E. Hinton, R. M. Neal, and R. S. Zemel (1995). The Helmholtz Machine. \n\nNeural Computation 7(5)r 889-904. \n\nLuettgen, M. R. and A. S. Willsky (1995). Likelihood Calculation for a Class of \nMultiscale Stocahstic Models, with Application to Texture Discrimination. IEEE \nTrans. Image Processing 4(2\u00bb, 194-207. \n\nPearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible \n\nInference. San Mateo, CA: Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 1535, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}, {"given_name": "Nicholas", "family_name": "Adams", "institution": null}]}