{"title": "Generative Affine Localisation and Tracking", "book": "Advances in Neural Information Processing Systems", "page_first": 1505, "page_last": 1512, "abstract": null, "full_text": " Generative Affine Localisation and Tracking\n\n\n John Winn Andrew Blake\n Microsoft Research Cambridge\n Roger Needham Building\n 7 J. J. Thomson Avenue\n Cambridge CB3 0FB, U.K\n http://research.microsoft.com/mlp\n\n\n\n Abstract\n\n We present an extension to the Jojic and Frey (2001) layered sprite model\n which allows for layers to undergo affine transformations. This extension\n allows for affine object pose to be inferred whilst simultaneously learn-\n ing the object shape and appearance. Learning is carried out by applying\n an augmented variational inference algorithm which includes a global\n search over a discretised transform space followed by a local optimisa-\n tion. To aid correct convergence, we use bottom-up cues to restrict the\n space of possible affine transformations. We present results on a number\n of video sequences and show how the model can be extended to track an\n object whose appearance changes throughout the sequence.\n\n\n1 Introduction\n\nGenerative models provide a powerful and intuitive way to analyse images or video se-\nquences. Because such models directly represent the process of image generation, it is\nstraightforward to incorporate prior knowledge about the imaging process and to interpret\nresults. Since the entire data set is modelled, generative models can give improved accu-\nracy and reliability over feature-based approaches and they also allow for selection between\nmodels using Bayesian model comparison. Finally, it is possible to sample from generative\nmodels, for example, for the purposes of image or video editing.\nOne popular type of generative model represents images as a composition of layers [1]\nwhere each layer corresponds to the appearance and shape of an individual object or the\nbackground. If the generative model is expressed probabilistically, Bayesian learning and\ninference techniques can then be applied to reverse the imaging process and infer the shape\nand appearance of individual objects in an unsupervised fashion [2].\nThe difficulty with generative models is how to apply Bayesian inference efficiently. In a\nlayered model, inference involves localising the pose of the layers in each image, which\nis hard because of the large space of possible object transformations that needs to be ex-\nplored. Previously, this has been dealt with by imposing restrictions on the space of object\ntransformations, such as allowing only similarity transformations [3]. Alternatively, if the\nimages are known to belong to a video sequence, tracking constraints can be used to fo-\ncus the search on a small area of transformation space consistent with a dynamic model of\nobject motion [4]. However, even in a video sequence, this technique relies on the object\n\n\f\nremaining in frame and moving relatively slowly.\nIn this paper, we extend the work of [3] and present an approach to object localisation which\nallows objects to undergo planar affine transformations and works well both in the frames of\na video sequence and in unordered sets of images. A two-layer generative model is defined\nand inference performed using a factorised variational approximation, including a global\nsearch over a discretised transform space followed by a local optimisation using conjugate\ngradients. Additionally, we exploit bottom-up cues to constrain the space of transforms\nbeing explored. Finally, we extend our generative model to allow the object appearance in\none image to depend on its appearance in the previous one. Tracking appearance in this\nway gives improved performance for objects whose appearance changes slowly over time\n(e.g. objects undergoing non-planar rotation).\nIf the images are not frames of a video, or the object is out-of-frame or occluded in the\nprevious image, then the system automatically reverts to using a learned foreground ap-\npearance model.\n\n2 The generative image model\n\nThis section describes the generative image model, which is illustrated in the Bayesian\nnetwork of Figure 1. This model consists of two layers, a foreground layer containing a\nsingle object and a background layer.\nWe denote our image set as {x1, . . . , xN }, where xi is a vector of the pixel intensities in the\nith image. The background layer is assumed to be stationary and so its appearance vector b\nis set to be the same size as the image. A mask mi has binary elements that indicate which\n\n\n\n b f\n\n\n\n T\n\n m\n\n x\n N\n\nFigure 1: The Bayesian network for the generative image model. The rounded rectangle is a\nplate, indicating that there are N copies of each contained node (one for each image). Common\nto all images are the background b, foreground object appearance f and mask prior . An affine\ntransform T gives the position and pose of the object in each image. The binary mask m defines\nthe area of support of the foreground object and has a prior given by a transformed . The observed\nimage x is generated by adding noise separately to the transformed foreground appearance and the\nbackground and composing them together using the mask. For illustration, the images underneath\neach node of the graph represent the inferred value of that node given a data set of hand images. A\npriori, the appearance and mask of the object are not known.\n\n\f\npixels of the ith image are foreground. The mask is set to be slightly larger than the image\nto allow the foreground object to overlap the edge of the image.\nThe foreground layer is represented by an appearance image vector f and a prior over its\nmask , both of which are to be inferred from the image set. The elements of are real\nnumbers in the range [0, 1] which indicate the probability that the corresponding mask\npixels are on, as suggested in [5]. The object appearance and mask prior are defined in a\ncanonical, normalised pose; the actual position and pose of the object in the ith image is\ngiven by an affine transformation Ti. With our images in vector form, we can consider a\ntransformation T to be a sparse matrix where the jth row defines the linear interpolation\nof pixels that gives the jth pixel in the transformed image. For example, a translation of an\ninteger number of pixels is represented as a matrix whose entries Tjk are 1 if the translation\nof location k in the source image is location j in the destination image, and 0 otherwise.\nHence, the transformed foreground appearance is given by Tf and the transformed mask\nprior by T. Given the transformed mask prior, the conditional distribution for the kth\nmask pixel is\n P (mk = 1 | , T) = (T)k. (1)\n\nThe observed image x is generated by a composition of the transformed foreground appear-\nance and the background plus some noise. The conditional distribution for the kth image\npixel is given by\n\n P (xk | b, f , m, T, ) = N (xk | (Tf )k, -1)mk N (x )1-mk (2)\n f k | bk , -1\n b\n\nwhere = (f , b) are the noise precisions for the foreground layer and the background\nlayer respectively. The elements of both b and f are given broad Gaussian priors and the\nprior on each is a broad Gamma distribution. The prior on each element of is a Beta\ndistribution.\n\n3 Factorised variational inference\n\nGiven the above model and a set of images {x1, . . . , xN }, the inference task is to learn\na posterior distribution over all other variables including the background, the foreground\nappearance and mask prior, the transformation and mask for each image and the noise\nprecisions. Direct application of Bayes's theorem is intractable because this would require\nintegrating over all unobserved variables. Instead, we turn to the approximate inference\ntechnique of variational inference [6].\nVariational inference involves defining a factorised variational distribution Q and then op-\ntimising to minimise the Kullback-Leibler divergence between Q and the true posterior\ndistribution. The motivation behind this methodology is that we expect the posterior to be\nunimodal and tightly peaked and so it can be well approximated by a separable distribution.\nIn this paper, we choose our variational distribution to be factorised with respect to each el-\nement of b, f, m and and also with respect to f and b. The factor of Q corresponding\nto each of these variables has the same form as the prior over that variable. For example,\nthe factor for the kth element of is a Beta distribution Q(k) = (k | a , b ). The choice\nof approximation to the posterior over the affine transform Q(T) is a more complex one,\nand will be discussed below.\nThe optimisation of the Q distribution is achieved by firstly initialising the parameters of all\nfactors and then iteratively updating each factor in turn so as to minimise the KL divergence\nwhilst keeping all other factors fixed. If we define H to be the set of all hidden variables,\nthen the factor for the ith member Hi is updated using\n\n log Q(Hi) = log P ({x1, . . . , xN }, H) + const. (3)\n Q(Hi)\n\n\f\n b f\n b f log\n 2\n b 2\n f -1 log 1\n ( - ) -\n ( * )\n f T m x T 1m\n - 1 -\n T 1m 1- -\n T 1m\n 2 f\n\n T\n Tf T log\n 2\n Tf T log 1\n ( - )\n\n\n m\n m* x\n f m 1- m\n 1\n b 1\n ( - m)* x - m\n 2 f\n 1 1- m\n - b 1\n ( - m)\n 2 m\n log x 1\n 1 2 log \n - 1\n ( - m)*(x - b) b - b (x - 2\n b)\n 2\n 2\n 1\n 1 2 log \n - m*(x -Tf ) f - f ( x - 2\n Tf )\n 2\n 2 N\n\nFigure 2: The messages passed when VMP is applied to the generative model. The messages to\nor from T are not shown (see text). Where a message is shown as leaving the N plate, the destination\nnode receives a set of N messages, one from each copy of the nodes within the plate. Where a\nmessage is shown entering the N plate, the message is sent to all copies of the destination node. All\nexpectations are with respect to the variational distribution Q.\n\n\nwhere . means the expectation under the distribution given by the product of all\n Q(Hi)\nfactors of Q except Q(Hi).\nWhen the model is a Bayesian network, this optimisation procedure can be carried out\nin a modular fashion by applying Variational Message Passing (VMP) [7, 8]. Using VMP\nmakes it very much simpler and quicker to extend, modify, combine or compare probabilis-\ntic models; it gives the same results as applying factorised variational inference by hand and\nplaces no additional constraints on the form of the model. In VMP, messages consisting\nof vectors of real numbers are sent to each node from its parent and children in the graph.\nIn our model, the messages to and from all nodes (except T) are shown in Figure 2. By\nexpressing each variational factor as an exponential family distribution, the `natural param-\neter vector' [8] of that distribution can be optimised using (3) by adding messages received\nat the corresponding node. For example, if the prior over b is N (b | , -1), the parameter\nvector of the factor Q(b) = N (b | , -1) is updated from the messages received at b\nusing\n natural param. vector prior received messages\n\n N\n bi(1 - mi) xi\n - = + . (4)\n 1 - 1 - 1 \n 2 2 2 bi(1 - mi)\n i=1\nThe form of the natural parameter vector varies for different exponential family distribu-\ntions (Gaussian, Gamma, Beta, discrete . . . ) but the update equation remains the same.\nFollowing this update, the message being sent from b is recomputed to reflect the new pa-\nrameters of Q(b). For details of the derivation this update equation and how to determine\nVMP messages for a given model, see [8].\nWhere a set of similar messages are sent corresponding to the pixels of an image, it is\nconvenient to think instead of a single message where each element is itself an image. It\nis efficient to structure the implementation in this way because message computation and\nparameter updates can then be carried out using block operations on entire images.\n\n\f\n4 Learning the object transformation\n\nFollowing [3], we decompose the layer transformation into a product of transformations\nand define a variational distribution that is separable over each. To allow for affine trans-\nformations, we choose to decompose T into three transformations applied sequentially,\n T = TxyTrsTa. (5)\nIn this expression, Txy is a two-dimensional translation belonging to a finite set of transla-\ntions Txy. Similarly, Trs is a rotation and uniform scaling and so the space of transforms\nis also two-dimensional and is discretised to form a finite set Trs. The third transformation\nTa is a freeform (non-discretised) affine transform. The variational distribution over the\ncombined transform T is given by\n Q(T) = Q(Txy)Q(Trs)Q(Ta). (6)\n\nBecause Txy and Trs are discretised, Q(Txy) and Q(Trs) are defined to be discrete dis-\ntributions. We can apply (3) to determine the update equations for these distributions,\nlog Q(Txy) = m . (Txy TrsTa log ) + 1 - m . (Txy TrsTa log(1 - ) )\n\n + f m . x Txy TrsTaf - 1 T\n 2 xy TrsTaf 2 + zxy (7)\nlog Q(Trs) = T-1m . (T (1 - m) . (T\n xy rs Ta log ) + T-1\n xy rs Ta log(1 - ) )\n\n + f T-1(m x) . (T m T\n xy rs Taf ) - 1\n 2 f T-1\n xy rs Taf 2 + zrs(8)\nwhere zxy and zrs are constants which can be found by normalisation.\nAs described in [3], the evaluation of (7) and (8) for all Txy Txy and all Trs Trs\ncan be carried out efficiently using Fast Fourier Transforms in either Cartesian or log-polar\nco-ordinate systems. The use of FFTs allows us to make both Txy and Trs large: we set Txy\nto contain all translations of a whole number of pixels and Trs to contain 360 rotations (at\n1 intervals) and 50 scalings (where each scaling represents a 1.5% increase in length\nscale). FFTs can be used within the VMP framework as both (7) and (8) involve quantities\nthat are contained in messages to T (see Figure 2).\nFinally, we define the variational distribution over Ta to be a delta function,\n Q(Ta) = (Ta - T ). (9)\n a\nUnlike all the other variational factors, this cannot be optimised analytically. To minimise\nthe KL divergence, we need to find the value of T that maximises\n a\n\n Fa = T-1T-1m . (T log ) + T-1T-1(1 - m) . (T log(1 - ) )\n rs xy a rs xy a\n\n + f T-1T-1(m x) . (T f ) - 1 T-1m T f 2 . (10)\n rs xy a 2 f T-1\n rs xy a\nThis local maximisation is achieved efficiently by using a trust-region Newton method. The\nassumption is that the search through Txy and Trs has located the correct posterior mode in\ntransform space and that it is only necessary to use gradient methods to find the peak of that\nmode. This assumption appeared valid for the image sequences used in our experiments,\neven when the transformation of the foreground layer was not well approximated by a\nsimilarity transform alone.\nInference in this model is made harder due to an inherent non-identifiability problem. The\npose of the learned appearance and mask prior is undefined and so applying a transform to f\nand and the inverse of the transform to each Ti results in an unchanged joint distribution.\nWhen applying a variational technique, such non-identifiability leads to many more local\nminima in the KL divergence. We partially resolve this issue by adding a constraint to this\nmodel that the expected mask is centred, so that its centre of gravity is in the middle\nof the latent image. This constraint is applied by shifting the parameters of Q() directly\nfollowing each update (and also shifting Q(f) and each Q(T) appropriately).\n\n\f\n Background Example frame #1 Foreground #1 Normalised frame #1\n\n\n\n\n\n Object appearance & mask Example frame #2 Foreground #2 Normalised frame #2\n\n\n\n\n\nFigure 3: Tracking a hand undergoing extreme affine transformation. The first column shows\nthe learned background and masked object appearance. The second and third columns contain two\nframes from the sequence along with the foreground segmentation for each. The final column shows\neach frame transformed by the inverse of the inferred object transform. In each image the red outline\nsurrounds the area where the transformed mask prior is greater than 0.5.\n\n\n4.1 Using bottom-up information to improve inference\n\nGiven that is centred, we can significantly improve convergence by using bottom-up in-\nformation about the translation of the object. For example, the inferred mask mi for each\nframe is very informative about the location of the object in that frame. Using sufficient\ndata, we could learn a conditional model P (Txy | mi ) and bound Txy by only consider-\ning translations with non-negligible posterior mass under this conditional model. Instead,\nwe use a conservative, hand-constructed bound on Txy based on the assumption that, dur-\ning inference, the most probable mask under Q(mi) consists of a (noisy) subset of the true\nmask pixels. Suppose the true mask contains M non-zero pixels with second moment of\narea IM and the current most probable mask contains V non-zero pixels (V M) with\nsecond moment of area IV . A bound on c, the position of the centre of the inferred mask\nrelative to the centre of true mask, is given by\n\n diag(ccT) (M - V ) diag(IM /V - IV /M ). (11)\nWe can gain a conservative estimate of M and IM by using the maximum values of V\nand IV across all frames, multiplied by a constant 1.2. The bound is deliberately\nconstructed to be conservative; its purpose is to discard settings of Txy that have negligible\nprobability under the model and so avoid local minima in the variational optimisation. The\nbound is updated at each iteration and applied by setting Q(Txy) = 0 for values of Txy\noutside the bound. Q(Txy) is then re-normalised.\nThe use of this bound on Txy is intended as a very simple example of incorporating bottom-\nup information to improve inference within a generative model. In future work, we intend\nto investigate using more informative bottom-up cues, such as optical flow or tracked in-\nterest points, to propose probable transformations within this model. Incorporating such\nproposals or bounds into a variational inference framework both speeds convergence and\nhelps avoid local minima.\n\n5 Experimental results\n\nWe present results on two video sequences. The first is of a hand rotating both parallel\nto the image plane and around its own axis, whilst also translating in three dimensions.\nThe sequence consists of 59 greyscale frames, each of size 160 120 pixels (excluding\nthe border). Our Matlab implementation took about a minute per frame to analyse the se-\n\n\f\n Appearance & mask Example frame #1 Foreground #1 Example frame #2 Foreground #2\n\n\n\n\n\n Figure 4: Affine tracking of a semi-transparent object.\n\n Appearance & mask First frame Foreground Last frame Foreground\n\n\n\n\n\nFigure 5: Tracking an object with changing appearance. A person is tracked throughout a se-\nquence despite their appearance changing dramatically from between first and last frames. The blue\noutline shows the inferred mask m which differs slightly from due to the object changing shape.\n\n\nquence, over half of which was spent on the conjugate gradient optimisation step. Figure 3\nshows the expected values of the background and foreground layers under the optimised\nvariational distribution, along with foreground segmentation results for two frames of the\nsequence. The right hand column gives another indication of the accuracy of the inferred\ntransformations by applying the inverse transformation to the entire frame and showing that\nthe hand then has a consistent normalised position and size. In a video of the hand showing\nthe tracked outline,1 the outline appears to move smoothly and follow the hand with a high\ndegree of accuracy, despite the system not using any temporal constraints.\nResults for a second sequence showing a cyclist are given in Figure 4. Although the cyclist\nand her shadow are tracked correctly, the learned appearance is slightly inaccurate as the\nsystem is unable to capture the perspective foreshortening of the bicycle. This could be\ncorrected by allowing Ta to include projective transformations.\n\n6 Tracking objects with changing appearance\n\nThe model described so far makes the assumption that the appearance of the object does\nnot change significantly from frame to frame. If the set of images are actually frames from\na video, we can model objects whose appearance changes slowly by allowing the model\nto use the object appearance in the previous frame as the basis for its appearance in the\ncurrent frame. However, we may not know if the images are video frames and, even if we\ndo, the object may be occluded or out-of-frame in the previous image. We can cope with\nthis uncertainty by inferring automatically whether to use the previous frame or the learned\nappearance f. Switching between two methods in this way is similar to [9].\nThe model is extended by introducing a binary variable si for each frame and define a new\nappearance variable gi = sif + (1 - si)T-1 x\n i-1 i-1. Hence gi either equals the foreground\nappearance f (if si = 1) or the transform-normalised previous frame (if si = 0). For the\nfirst frame, we fix s1 = 1. We then replace f with gi in (2) and then apply VMP within the\nresulting Bayesian network.\nThe extended model is able to track an object even when its appearance changes signifi-\ncantly throughout the image sequence (see Figure 5). The binary variable si is found to\nhave an expected value 0 for all frames (except the first). Using the tracked appearance\n\n 1Videos of results are available from http://johnwinn.org/Research/affine.\n\n\f\nallows the foreground segmentation of each frame to be accurate even though the object is\npoorly modelled by the inferred appearance image. If we introduce an abrupt change into\nthe sequence, for example by reversing the second half of the sequence, si is found to be\n 1 for the frame following the change. In other words, the system has detected not to use\nthe previous frame at this point, but to revert to using the latent appearance image f.\n\n7 Discussion\n\nWe have proposed a method for localising an object undergoing affine transform whilst\nsimultaneously learning its shape and appearance. This power of this method has been\ndemonstrated by tracking moving objects in several real videos, including where the ap-\npearance of the object changes significantly from start to end. The system makes no as-\nsumptions about the speed of motion of the object, requires no special initialisation and is\nrobust to the object being temporarily occluded or moving out of frame.\nA natural extension to this work is to allow multiple layers, with each layer having its own\nlatent shape and appearance and set of affine transformations. Unfortunately, as the num-\nber of latent variables increases, the inference problem becomes correspondingly harder\nand an exhaustive search becomes less practical. Instead, we are investigating perform-\ning inference in a simpler model where a subset of the variables have been approximately\nmarginalised out. The results of using this simpler model can then be used to guide infer-\nence in the full model. A further interesting addition to the model would be to allow layers\nto be grouped into rigid or articulated three-dimensional objects.\n\nAcknowledgments\n\nThe authors would like to thank Nebojsa Jojic for suggesting the use of a binary switch\nvariable for tracking and Tom Minka for helpful discussions.\n\nReferences\n[1] J. Y. A. Wang and E. H. Adelson. Representing moving images with layers. In IEEE Transactions\n on Image Processing, volume 3, pages 625638, 1994.\n[2] N. Jojic and B. Frey. Learning flexible sprites in video layers. In Proc. of IEEE Conf. on\n Computer Vision and Pattern Recognition, 2001.\n[3] B. Frey and N. Jojic. Fast, large-scale transformation-invariant clustering. In Advances in Neural\n Information Processing Systems 14, 2001.\n[4] M. K. Titsias and C. K. I. Williams. Fast unsupervised greedy learning of multiple objects\n and parts from video. 2004. To appear in Proc. Generative-Model Based Vision Workshop,\n Washington DC, USA.\n[5] C.K.I. Williams and M. K. Titsias. Greedy learning of multiple objects in images using robust\n statistics and factorial learning. Neural Computation, 16(5):10391062, 2004.\n[6] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational\n methods for graphical models. In M. I. Jordan, editor, Learning in Graphical Models, pages\n 105162. Kluwer, 1998.\n[7] C. M. Bishop, J. M. Winn, and D. Spiegelhalter. VIBES: A variational inference engine for\n Bayesian networks. In Advances in Neural Information Processing Systems, volume 15, 2002.\n[8] J. M. Winn and C. M. Bishop. Variational Message Passing. 2004. To appear in Journal of\n Machine Learning Research. Available from http://johnwinn.org.\n[9] A. Jepson, D. Fleet, and T. El-Maraghi. Robust online appearance models for visual tracking. In\n Proc. IEEE Conf. Computer Vision and Pattern Recognition, volume I, pages 415422, 2001.\n\n\f\n", "award": [], "sourceid": 2658, "authors": [{"given_name": "John", "family_name": "Winn", "institution": null}, {"given_name": "Andrew", "family_name": "Blake", "institution": null}]}