{"title": "Learning About Multiple Objects in Images: Factorial Learning without Factorial Search", "book": "Advances in Neural Information Processing Systems", "page_first": 1415, "page_last": 1422, "abstract": "", "full_text": "Learning about Multiple Objects in Images:\nFactorial Learning without Factorial Search\n\nChristopher K. I. Williams\n\nand Michalis K. Titsias\n\nSchool of Informatics, University of Edinburgh, Edinburgh EH1 2QL, UK\nc.k.i.williams@ed.ac.uk\nM.Titsias@sms.ed.ac.uk\n\nAbstract\n\nWe consider data which are images containing views of multiple objects.\nOur task is to learn about each of the objects present in the images. This\ntask can be approached as a factorial learning problem, where each image\nmust be explained by instantiating a model for each of the objects present\nwith the correct instantiation parameters. A major problem with learning\na factorial model is that as the number of objects increases, there is a\ncombinatorial explosion of the number of con\ufb01gurations that need to be\nconsidered. We develop a method to extract object models sequentially\nfrom the data by making use of a robust statistical method, thus avoid-\ning the combinatorial explosion, and present results showing successful\nextraction of objects from real images.\n\n1 Introduction\n\nIn this paper we consider data which are images containing views of multiple objects.\nOur task is to learn about each of the objects present in the images. Previous approaches\n(discussed in more detail below) have approached this as a factorial learning problem,\nwhere each image must be explained by instantiating a model for each of the objects present\nwith the correct instantiation parameters. A serious concern with the factorial learning\nproblem is that as the number of objects increases, there is a combinatorial explosion of the\nnumber of con\ufb01gurations that need to be considered. Suppose there are \u0001 possible objects,\npossible values that the instantiation parameters of any one object can\nand that there are\ntake on; we will need to consider\ncombinations to explain any image. In contrast,\nin our approach we \ufb01nd one object at a time, thus avoiding the combinatorial explosion.\n\n\u0003\u0005\u0004\u0006\u0002\b\u0007\b\t\n\nIn unsupervised learning we aim to identify regularities in data such as images. One fairly\nsimple unsupervised learning model is clustering, which can be viewed as a mixture model\nwhere there are a \ufb01nite number of types of object, and data is produced by choosing one of\nthese objects and then generating the data conditional on this choice. As a model of objects\nin images standard clustering approaches are limited as they do not take into account the\nvariability that can arise due to the transformations that can take place, described by in-\nstantiation parameters such as translation, rotation etc of the object. Suppose that there are\n-dimensional\nmanifold in the image space. Learning about objects taking this regularity into account has\n\ndifferent instantiation parameters, then a single object will sweep out a \n\n\u000b http://anc.ed.ac.uk\n\n\u0002\n\n\fbeen called transformation-invariant clustering by Frey and Jojic (1999, 2002). However,\nthis work is still limited to \ufb01nding a single object in each image.\n\nA more general model for data is that where the observations are explained by multiple\ncauses; in our example this will be that in each image there are \u0001 objects. The approach\nof Frey and Jojic (1999, 2002) can be extended to this case by explicitly considering the\nsimultaneous instantiation of all \u0001 objects (Jojic and Frey, 2001). However, this gives rise\nto a large search problem over the instantiation parameters of all objects simultaneously,\nand approximations such as variational methods are needed to carry out the inference. In\nour method, by contrast, we discover the objects one at a time using a robust statistical\nmethod. Sequential object discovery is possible because multiple objects combine by oc-\ncluding each other.\n\nThe general problem of factorial learning has longer history, see, for example, Barlow\n(1989), Hinton and Zemel (1994), and Ghahramani (1995). However, Frey and Jojic made\nthe important step for image analysis problems of using explicit transformations of object\nmodels, which allows the incorporation of prior knowledge about these transformations\nand leads to good interpretability of the results.\n\nA related line of research is that concerned with discovering part decompositions of objects.\nLee and Seung (1999) described a non-negative matrix factorization method addressing this\nproblem, although their work does not deal with parts undergoing transformations. There\nis also work on learning parts by Shams and von der Malsburg (1999), which is compared\nand contrasted with our work in section 4.\n\nThe structure of the remainder of this paper is as follows. In section 2 we describe the\nmodel, \ufb01rst for images containing only a single object (\n2.1) and then for images con-\ntaining multiple objects (\n2.2). In section 3 we present experimental results for up to \ufb01ve\nobjects appearing against stationary and non-stationary backgrounds. We conclude with a\ndiscussion in section 4.\n\n2 Theory\n\n2.1 Learning one object\n\nIn this section we consider the problem of learning about one object which can appear at\nvarious locations in an image. The object is in the foreground, with a background behind it.\nThis background can either be \ufb01xed for all training images, or vary from image to image.\nThe two key issues that we must deal with are (i) the notion of a pixel being modelled\nas foreground or background, and (ii) the problem of transformations of the object. We\nconsider \ufb01rst the foreground/background issue.\n\nof size\n\ncontaining\n\nConsider an image\nand the\nvector. Our aim is to learn appearance-based representations of the foreground\nbackground\npixels, we will need to specify\nwhich pixels belong to the background and which to the foreground; this is achieved by a\nis drawn\nvector of binary latent variables\n, if\nindependently from the corresponding entry in a vector of probabilities\n\n, one for each pixel. Each binary variable in\n\n. As the object will be smaller than\n\npixels, arranged as a length\n\n\u0002\t\u0003\u0011\u0002\t\b\n\u0005\u0014\u0002\n\n\u0002\u0004\u0003\u0006\u0005\u0007\u0002\t\b\n\n\u0002\u000b\n\r\f\u000f\u000e\n\n\u0018\u001a\u0019\u001c\u001b\u001e\u001d , then the pixel will be ascribed to the background with high probability, and if\n\u0018\u001a\u0019\u0006\u001b \u001f , it will be ascribed to the foreground with high probability. We sometimes refer to\n\nis modelled by a mixture distribution:\n\nas a mask.\n\n. For pixel\n\n\u0019\u000b\"$#\n\n\u001798\n\n\u0019&%(')\u0019\n%(:\n\n\u0010+*\n\u0010;*\n\n\u0019\u001a%,')\u0019\u001a-/.10\n-/.10\n%(:\n\n24365\n24365\n\n\u001f7-\n\u001d<-\n\n(1)\n\n\n\n\u0001\n\u0010\n\u0002\n\u0012\n\u0013\n\u0002\n\u0003\n\b\n\u0015\n\u0015\n\u0016\n\u0017\n\u0016\n!\n\u0019\n!\n\u0017\n\u000e\n\u0004\n!\n\t\n\u0004\n!\n\u000e\n\t\n\u0019\n\u0010\n\u0004\n!\n\u0019\n\u0019\n\t\n\u0004\n!\n\u0019\n\u0019\n8\n\t\n\u0019\n\u0010\n\fare respectively the foreground and background variances. Thus, ignoring\n\nwhere.10\n\nand.10\n\ntransformations, we obtain\n\n\u0001\u0003\u0002\n\u0019\u0005\u0004\u0007\u0006\t\b\n\n\u0001\b\t\n\n\t\u000b\n\n%('\n\n\u001f\r\f\n\n\t\u000f\u000e\u0011\u0010\n\n%(:\n\n\u001798\n\nThe second issue that we must deal with is that of transformations. Below we consider only\ntranslations, although the ideas can be extended to deal with other transformations such as\nscaling and rotation (see e.g. Jojic and Frey (2001)). Each possible transformation (e.g.\ntranslations in units of one pixel) is represented by a corresponding transformation matrix,\nso that matrix\nis the transformed foreground\nmodel. In our implementation the translations use wrap-around, so that each\nis in fact\na permutation matrix. The semantics of foreground and background mean that the mask\nmust also be transformed, so that we obtain\n\ncorresponds to transformation\n\nand\n\n\u0012\u0016\u0013\n\n\u0012\u0014\u0013\n\n\u0012\u0014\u0013\n\nNotice that the foreground\nnot. In order for equation 2 to make sense, each element of\n\nand mask\n\n). This is certainly true for the case when\n\n, but the background\nis\nmust be a valid probability\nis a permutation matrix (and can\n\n\u0012\u001d\u0013\n\n\u0019&%\n\n\t\u000b\n\n\u0004\u001b\u0012\n\u0004\u001a\u0012\nare transformed by\n\n\u001f\r\f\n\n(2)\n\n\t\u001c\u000e\u001c\u0010\n\n\u0019&%(:/\u0019\n\n\u0001\u0018\u0017\n\n\u0004\u001a\u0012\n\n\u0019\u0005\u0004\u0007\u0006\n\n(lying in \b\n\nbe true more generally).\n\n\u001d<-\n\n,\n\non each transformation\n\nTo complete the model we place a prior probability\ntaken to be uniform over all possibilities so that\nset\n\u001f7-\n\u0010\u0005\u0010)\u0010\nthe log likelihood \u0001\nalgorithm to handle the missing data which is the transformation and\nThe model developed in this section is similar to Jojic and Frey (2001), except that our mask\nhas probabilistic semantics, which means that an exact M-step can be used as opposed\n\n; this is\n. Given a data\nby maximizing\n. This can be achieved through using the EM\n\n\u0017\u001e\u0013\n\u0001\b\t\n* we can adapt the parameters\n\n\u0010 \u001f\"!\n\n\u0001\u0018\u0017\n\u0012\u001e\u0013\n-(.10\n\n\u0001\u0016%\u000b\u0017\n\n\u0001\u0016%'&\n\n\u0017#\u0013\u000f\u0017\n-(.\n\n\u0004\u0007\u0006\u0014-/.10\n\n\u001f,+\n\n\u0004\u0007\u0006\n\n\u0012\u0014\u0013\n\n\u0004\u001a*\n\n.\n\nto the generalized M-step used by Jojic and Frey.\n\nforeground objects, one natural approach is to consider models with \u0001\n\n2.2 Coping with multiple objects\nIf there are \u0001\nla-\nvalues of the possible transformations. We also need\ntent variables, each taking on the\nto account for object occlusions. By assuming that the \u0001 objects can arbitrarily occlude\none another (and this occlusion ordering can change in different images), there are \u000132 pos-\nsible arrangements. A model that accounts for multiple objects is described in Jojic and\nFrey (2001) where the occlusion ordering of the objects is taken as being \ufb01xed since they\nassume that each object is ascribed to a global layer. A full search over the parameters\n\u000142 possibilities,\n(assuming unknown occlusion ordering for each image) must consider\n. An alternative is to consider approximations; Ghahra-\nwhich scales exponentially with \u0001\nmani (1995) suggests mean \ufb01eld and Gibbs sampling approximations and Jojic and Frey\n(2001) use approximate variational inference.\n\nOur goal is to \ufb01nd one object at a time in the images. We describe two methods for doing\nthis. The \ufb01rst uses random initializations, and on different runs can \ufb01nd different objects;\nwe denote this RANDOM STARTS. The second method (denoted GREEDY) removes\nobjects found in earlier iterations and looks for as-yet-undiscovered objects in what re-\nmains.\n\nFor both methods we need to adapt the model presented in section 2.1. The problem is that\nocclusion can occur of both the foreground and the background. For a foreground pixel, a\ndifferent object to the one being modelled may be interposed between the camera and our\nobject, thus perturbing the pixel value. This can be modelled with a mixture distribution\nis the fraction of times\nas\n\n%,'\n\n\u001065\n\n%('\n\n\t7\n\n-/.10\n\n\u001f8\f\n\n, where 5\n\n\t:9\u0005\u0004\n\n\u000e\n8\n\u0017\n\u0004\n\u0010\n\u0018\n\u0019\n\u0017\n\u000e\n\u0004\n!\n\u0019\n\u0019\n\u0004\n\u0018\n\u0019\n\t\n\u0004\n!\n\u0019\n\u0019\n\u0015\n\u0012\n\u0016\n\u0017\n\u0004\n\u0012\n\u0013\n\t\n\u0010\n\u0002\n\u0019\n\b\n\u0013\n\u0016\n\t\n\u0019\n\u0017\n\u000e\n\u0004\n!\n\u0013\n\u0012\n\t\n\u0019\n\u0004\n\u0013\n\u0016\n\t\n\u0019\n\t\n\u0017\n8\n\u0004\n!\n\u0012\n\u0016\n\u0013\n\u0012\n\u0013\n\u0016\n\u001f\n\u000e\n\u0012\n\u0013\n\u0017\n\u0004\n\u0013\n\u0004\n\t\n$\n(\n\u0010\n-\n*\n\u0010\n\u0004\n\u0012\n-\n\u0016\n-\n\u0013\n0\n\u000e\n8\n\t\n\t\n\u0010\n%\n\u0017\n\u0004\n*\n\t\n\u0015\n\u0016\n\u0002\n\u0002\n\u0007\n\u0017\n\u000e\n\u0004\n!\n\u0019\n\u0019\n\t\n\u000e\n*\n\u0004\n!\n\u0019\n\u0019\n\u000e\n\u0004\n5\n\u000e\n!\n\u0019\n\t\n\u000e\n\fis a uniform\na foreground pixel is not occluded and the robustifying component\ndistribution common for all image pixels. Such robust models have been used for image\nmatching tasks by a number of authors, notably Black and colleagues (Black and Jepson,\n1996).\n\n9\u0005\u0004\n\nSimilarly for the background, a different object from the one being modelled may be in-\nterposed between the background and the camera, so that we again have a mixture model\n, with similar semantics for the parameter\n. (If the background has high variability then this robustness may not be required, but it\n\nwill be in the case that the background is \ufb01xed while the objects move.)\n\n\t:9\n\n\u0019&%(:/\u0019\n\n\u0019&%,:/\u0019\n\n-(.10\n\n\u001f\r\f\n\n2.2.1 Finding the \ufb01rst object\n\nWith this robust model we can now apply the RANDOM STARTS algorithm by maxi-\nmizing the likelihood of a set of images with respect to the model using the EM algorithm.\nThe expected complete data log likelihood is given by\n\n\u0004\u0007\u0006\n\n\u0004\u0007\u0006\n\n\u0004\u001a\u0012\n\n-/.10\n\n\t\u0004\u0003\n\n\u0012\u001e\u0013\n\n-/.10\n\t\u000b\n\u0006\u0005\n\n\u0006\u0005\n\n%\b\u0007\n\u000e\n\t\n\n-/.10\n\n(3)\n\n-/.10\n\u000e\r\f\n&7\n\u0006\u000e\u0010\u000f\n5\u0010\u0011\n\u0019\u0016\u0015\n\u0019\u0018\u0017\u001a\u0019\u001c\u001b\n\u0015\u001e\u001d\n\u0019\u0018\u0017\u001a\u0019\u001c\u001b\n\u001f! \n\"$#&%\n\u0001\u0016%\n\u0017<;>= element\n\nwhere \u0012\nas \u0012\n\n\t\u0014\u0013\n\n\u0015(')\u001d+*(\u0019-,\u0018\u0017\n0\u0004\u001d\n\u001d546\u0017\n./\u0017\n\u0012\u001e\u0013\n\nde\ufb01nes the element-wise product between two vectors,\n\nis written\n-dimensional vector containing ones. The\n\nfor compactness and\n\ndenotes the\n\n-dimensional vector associated with the\n\nexpected values of several latent variables are as follows:\nis the transformation responsibility,\nbinary variables\n\n\u0004\u001a\u0012'\u0013\nwith each element storing the probability\n\nis a\n\n\u0001\b\t\n\n.3\u0017\n\n\u001521\n\nbilities for the foreground on image\nde\ufb01nes the robust\nis equal to\nresponsibilities of the background. Note that the latter responsibilities do not depend on\nthe transformation\n\n%&\u0007\n\u001d and similarly the vector \u0005\t%\n\nsince the background is not transformed.\n\nusing transformation\n\nis the vector containing the robust responsi-\n\n\u001d , \u0005\n\u0001\u0007%\n,\u0010\u001dGF9\u0017\n\n\u0019-:(\u0017\n0\u0004\u001d\n\u0006E7\n\u001d54D\u0017\n\n, so that its\n\nB\u0018C\n\n\u0012\u0014\u0013\n\n*-./\u0017\n\u0006879\u0017\n0\u0004\u001d\n\n\u0015\u001e0\u0004\u001d+*\u0010\u001d\n\u001521\n./\u0017\n\u0003A@\n\n,\n\n,\n\n-step using\nfunction with respect to\n. We do not have space to show all of the updates\n\nthe model parameters\nbut for example\n\nAll of the above expected values of the missing variables are estimated in the H\nthe current parameter values. In the I\nand.10\n%\b\u0007\n\n-step we maximise the \u0001\n\n\u0004\u001b\u0012\u001e\u0013\n\n%&\u0007\n\n(4)\n\n\u000e\u001c\u0010/L\n\nstands for the element-wise division between two vectors. This update is quite\n\nintuitive. Consider the case when\n\u0004\u001a\u0012\u0014\u0013\nare\nels which are ascribed to the foreground (i.e.\ntransformed by\nas the transformations are permutation matrices). This\nremoves the effect of the transformation and thus allows the foreground pixels found in\neach training image to be averaged to produce\n\n(which is\n\n\u0001\b\t\n\n\u0001\u000b%\n\n7\u0016\u0006\n\n%\b\u0007\n\n\u00138M\n\n\u00138M\n\n.\n\n\t\u000f\u0012\n\n\u0004\u001b\u0012\u001e\u0013\n and \u001d otherwise. For pix-\n\u001f ), the values in\n\n\u0012KJ\nwhere \u0012\u0018\u0010/L\n\n\u0004\u0007\u0006\nfor\n\nOn different runs we hope to discover objects. However, this is rather inef\ufb01cient as the\nbasins of attraction for the different objects may be very different in size given the initial-\nization. Thus we describe the GREEDY algorithm next.\n\n!\n\u0019\n\t\n\u0017\n8\n\u0004\n!\n\t\n\u0010\n5\n8\n*\n\u0004\n!\n8\n\t\n\n\u0004\n5\n8\n\u0004\n!\n\u0019\n\t\n5\n8\n\u0001\n\u0010\n+\n\u0002\n%\n!\n\u0002\n\u0013\n\u0002\n\u0013\n\u0017\n\u0001\n%\n\t\n$\n\u0004\n\u0015\n%\n\u0013\n\u0004\n\u0012\n\u0013\n\u0016\n\u0013\n\u0004\n\f\n\u001f\n\u000b\n.\n0\n\u000e\n\u0004\n\u0001\n%\n\f\n\u0012\n\u0013\n\u0012\n\t\n0\n\f\n\u001f\n\u000b\n.\n0\n\t\n\n\u0004\n\f\n\f\n\u0015\n%\n\u0013\n\t\n\u0003\n\u0004\n\u0004\n\f\n\f\n\u0016\n%\n8\n\t\n\u0004\n\f\n\u001f\n\u000b\n.\n0\n8\n\u0004\n\u0001\n%\n\f\n\u0013\n\t\n0\n\f\n\u001f\n\u000b\n.\n0\n8\n\f\n\t\n(\n-\n\u0012\n\t\n\u0012\n0\n\f\n\u0002\n\u0002\n\u0017\n\u0010\n\u0003\n\u0019\n\"\n\u0003\n\"\n\u001d\n\u0015\n%\n\u0013\n\u0002\n\u0015\n\u0017\n\u0004\n5\n\u0019\n\u0010\n\u001f\n\u0017\n-\n\u0012\n\u0013\n\t\n\u0010\n\u0017\n\u0003\n\u0003\n\u0003\n\u0017\n\u0003\n\u001d\n*\n\u0019\n,\n\u0017\n\u0003\n*\n\u0003\n\u0015\n*\n\u0003\n\u001d\n*\n\u001d\n\u0003\n*\n.\n8\n*\n\u0013\n\u000e\n?\n,\n+\n\u0017\n*\n\u0003\n\u0015\n*\n\u0007\n,\n\u001d\n?\n,\n+\n\u0017\n\u0003\n@\n*\n\u0003\n\u0015\n*\n\u0007\nB\nC\n,\n?\n\u0003\n@\n*\n8\n\u0016\n\u0012\n\u0013\n-\n.\n0\n\u000e\n8\n+\n\u0002\n%\n\u0004\n\u0006\n!\n\u0002\n\u0013\n\u0004\n\u0006\n\u0002\n\u0017\n\u0001\n%\n\t\n\u0012\n\u0003\n\u0013\n\b\n\u0015\n%\n\u0013\n\t\n\u0005\n\u0013\n\u000e\n\t\n\u0001\n%\n+\n\u0002\n%\n!\n\u0002\n\u0013\n\u0004\n\u0006\n\u0002\n\u0017\n\u0001\n%\n\u0003\n\u0013\n\b\n\u0015\n%\n\u0013\n\t\n\u0005\n\u0013\n\u000e\n\u000e\n-\n\u0013\n\u0002\n\u0017\n\u0010\n\u001f\n\u0015\n\u0010\n\u0015\n\u0004\n\u0015\n%\n\t\n\u0005\n\u0013\nM\n\u000e\n\t\n\u0019\n\u001b\n\u0012\n\u0003\n\u0012\n\u0013\nM\n\u0012\n\f2.2.2 The GREEDY algorithm\n\n\u0004\u001a\u0012\n\n\u0006 and mask\n\nwe can use the responsibilities\n\u0012\u0002\u0001\n\nWe assume that we have run the RANDOM STARTS algorithm and have learned a fore-\n\u0006 . We wish to remove from consideration the pixels of the\nground model\nlearned object (in each training image) in order to \ufb01nd a new object by applying the same\nalgorithm. For each example image\nto \ufb01nd the\n\u0001\b\t\n\u0006 .1 Now note that the transformed mask\n\u0006 obtains values\nmost likely transformation\nclose to 1 for all object pixels, however some of these pixels might be occluded by other\nnot-yet-discovered objects and we do not wish to remove them from consideration. Thus\nwe consider the vector\n% . According to the semantics of the robust\n\u0006 will roughly give close to \u001f values only for the non-\n\u001d we introduce a new\n\u0004\u001b\u0012\u0005\u0001\n\noccluded object pixels. To further explain all pixels having\nforeground model\n\nforeground responsibilities \u0005\n\n, then for each transformation\n\nof model 2, we obtain\n\nand mask\n\n\u0004\u001b\u0012\n% ,\n\n(5)\n\u0012\u001e\u0013\n\t\u000b\n\nNote that we have dropped the robustifying component\nfrom model 1, since the\nparameters of this object have been learned. By summing out over the possible transforma-\n,\ntions we can maximize the likelihood with respect to\n\n\u0019\u0005\u0004\u0007\u0006\n\n\u0001\u0018\u0017\n\n%(:\n\n-/.\n\n\u0012\u0005\u0001\n\n\u0012\u0014\u0013\n\n\u000e\u0011\u0010\n\n\u0004\u0004\u0003\n\n\u0004\u0004\u0003\n\n,\n\n.\n\n\u0004\u001a\u0012\n\n\u0004\u001b\u0012\u001e\u0013\n\n\u0013)\u0012\nThe above expression says that each image pixel !\nmixture distribution; the pixel !\n\ndoes not belong to the \ufb01rst object and belongs to the second one with probability\n\n\u0004\u001b\u0012\na new object involves only the pixels that are not accounted for by model 1 (i.e. those for\nwhich\n\n\u0019 can belong to the \ufb01rst object with probability\n\n\u0019 ,\n\u0019 , while with the remaining probability it is background. Thus, the search for\n\u001d ).\n\nThis process can be continued, so that after \ufb01nding a second model, the remaining back-\nground is searched for a third model, and so on. The formula for \u0001 objects becomes\n\n\u0004\u0004\u0003\n\n\u0004\u0006\u0003\n\n,.10\n\nand.10\n\nis modelled by a three-component\n\n7\u001d\u0006\n\u0004\u0007\u0006\n\n7\u001d\u0006\n\u0004\u0007\u0006\n\u0004\u001b\u0012\u001e\u0013)\u0012\n\n\u0019\u001a%\n\n\u0019&-/.\n\n\u0004\u0006\u0003\n\n\u0004\u001b\u0012\n\n\t:\n\n\u0007\t\b\n\n\u0010\u0005\u0010\u0005\u0010\n\n\u0004\u001a\u0012\u001e\u0013\n\n7\u001d\u0006\n\u0004\u0007\u0006\n\n\u0019\u0005\u0004\u0007\u0006\n\n\u0001\u0018\u0017\n7\u001d\u0006\n\u0004\u0007\u0006\nThis is a \u0001\nIf\nthe \ufb01rst \u0001\nonly one object at a time and thus with one transformation latent variable. This approach\ncan be viewed as approximating the full factorial model by sequentially learning each factor\n(object). A crucial point is that the algorithm is not assumed to extract layers in images,\nordered from the nearest layer to the furthest one. In fact in next section we show a two-\nobject example of a video sequence where we learn \ufb01rst the occluded object.\n\n\t\u000b\n\n\u0012\u001e\u0013\n;>= object is the background.\n\u001f component mixture at each pixel, where the \u0001\nthen the term \u0001\nis de\ufb01ned to be equal to\u001f . Note that all parameters of\n\u001f components are kept \ufb01xed (learned in previous stages). We always deal with\n\n7\u0016\u0006\n\u0004\u0007\u0006\n\n%,:\n\n\f\u000f\u000e\n\n\t\u000f\u000e\u0011\u0010\n\n(6)\n\nSpace limitations do not permit us to show the \u0001\n\nfunction and updates for the parameters,\nbut these are very similar to the RANDOM STARTS, since we also learn only the param-\neters of one object plus the background while keeping \ufb01xed all the parameters of previously\ndiscovered objects.\n\n1It would be possible to make a \u201csofter\u201d version of this, where the transformations are weighted\nfor\n\nby their posterior probabilities, but in practice we have found that these probabilities are usually\nthe best-\ufb01tting transformation and\n\notherwise after learning\n\nand\n\n.\n\n\u0014\u0015\u0013\n\n\u0012\n\u0016\n\u0001\n\u0017\n\u0013\n\u0017\n\n\nM\n%\n\u0016\n\u0003\n\u0006\n\u0010\n\u0001\nM\n%\n\u0016\n\u0006\n\t\n\t\n\u0005\n\u0001\nM\n%\n\u000e\n\u0001\nM\n%\n\u000e\n\u0003\n\u0006\n\t\n\u0019\n\u001b\n\u0012\n0\n\u0016\n0\n\u0017\n\u0004\nM\n%\n-\n\u0012\n\u0013\n\t\n\u0010\n\u0002\n\u0019\n\b\n\u0006\n\t\n\u0019\n*\n\u0004\n!\n\u0019\n%\nM\n%\n\u0012\n\u0006\n\t\n\u0019\n0\n\u000e\n%\n\t\n\n\u0004\n\f\n\f\n\u0003\n\u0006\n\t\n\u0019\n\u0004\n\u0016\n0\n\t\n\u0019\n\u0017\n\u000e\n\u0004\n!\n\u0019\n%\n0\n\t\n\u0019\n\u0004\n\f\n\f\n\u0016\n0\n\t\n\u0019\n\u0017\n8\n\u0004\n!\n\u0019\n\u0019\n\t\n\t\n9\n\u0004\n!\n\u0019\n\t\n\u0012\n0\n\u0016\n0\n\u000e\nC\n\u0013\n8\n\u0019\n\u0006\n\t\n\u0004\n\f\n\f\n\u0003\n\u0006\n\t\n\u0019\n\u0013\n\u0016\n0\n\t\n\u0006\n\t\n\u0019\n\u001b\n\u0017\n\u0004\n\u0012\n\u0001\nM\n%\n-\n-\n\u0012\n\u0001\nM\n%\n-\n\u0012\n\u0013\n\t\n\u0010\n\u0002\n\u0019\n\b\n\u0007\n\u0002\n\n\u0019\n\u000b\n\u0004\n\f\n\f\n\u0003\n\u000b\n\t\n\u0019\n\n\t\n\u0019\n*\n\u0004\n!\n\u0001\nM\n\f\n\u0012\n\n\t\n0\n\u000e\n\f\n\u0007\n\u0019\n\n\u0004\n\f\n\f\n\u0003\n\n\t\n\u0019\n\u0016\n\u0007\n\t\n\u0019\n\u0017\n\u000e\n\u0004\n!\n\u0019\n%\n\u0007\n\t\n\u0019\n\u0007\n\u0019\n\n\u0004\n\f\n\f\n\u0003\n\n\t\n\u0019\n\u0004\n\f\n\f\n\u0016\n\u0007\n\t\n\u0019\n\u0017\n8\n\u0004\n!\n\u0019\n\u0019\n\n\u001f\n\n\u0010\n\u001f\n\n\u000b\n\u0004\n\f\n\u0001\nM\n\f\n\n\t\n\u0019\n\f\n\u0010\n\u0011\n\u0012\n\u0013\n\fMask\n\nForeground * Mask\n\nMask\n\nForeground * Mask\n\nBackground\n\n(a)\n\n(b)\n\nFigure 1: Learning two objects against a stationary background. Panel (a) displays some\nframes of the training images, and (b) shows the two objects and background found by the\nGREEDY algorithm.\n\n3 Experiments\n\n9\u0005\u0004\n\n, the mask\n\n. Also we assume that the total number of objects \u0001\n\nWe describe three experiments extracting objects from images including up to \ufb01ve movable\nobjects, using stationary as well as non-stationary backgrounds. In these experiments the\nuniform distribution\nis based on the maximum and minimum pixel values of all\nwere chosen to be\nthat appear in the images is known,\n\ntraining image pixels. In all the experiments reported below 5\nthus the GREEDY algorithm terminates when we discover the \u0001\nThe learning algorithm also requires the initialization of the foreground\nappearances\ninitialised to 0.5, the background appearance\n\nand background\nis\nto the mean of the training images and the\nare initialized to equal large values (larger than the overall variance of\nall image pixels). For the foreground appearance\nwe compute the pixelwise mean of the\ntraining images and add independent Gaussian noise with the equal variances at each pixel,\nwhere the variance is set to be large enough so that the range of pixel values found in the\ntraining images can be explored.\n\nand the parameters.\n\nand 5\n;>= object.\n\nvariances.\n\n. Each element of the mask\n\nand.\n\nand.\n\n,\n\n.10\n\n-/.10\n\nthe parameters\n\nIn the GREEDY algorithm each time we add a new object\n\nare initialized as described above. This means that the background\n\n ,\n\n ,\nis reset to\nthe mean of the training images; this is done to avoid local maxima since the background\nfound by considering only some of the objects in the images can be very different than the\ntrue background.\nFigure 1 illustrates the detection of two objects against a stationary background 2. Some ex-\ntraining images (excluding the black border) are shown in Figure\n1(a) and results are shown in Figure 1(b). For both objects we show both the learned mask\nand the elementwise product of the learned foreground and mask. In most runs the person\nwith the lighter shirt (Jojic) is discovered \ufb01rst, even though he is occluded and the person\nwith the striped shirt (Frey) is not. Video sequences of the raw data and the extracted objects\ncan be viewed at http://www.dai.ed.ac.uk/homes/s0129556/lmo.html .\n\namples of the 44\u001f\u0011\u001f\u0002\u0001\n\n\u000b\u0004\u0003\n\nIn Figure 2 \ufb01ve objects are learned against a stationary background, using a dataset of \u00017\u001d\n\nimages of size\nshown in Figure 2(a). Results are shown in Figure 2(b) for the GREEDY algorithm.\n\n\u0001\u0007\u0001 . Notice the large amount of occlusion in some of the training images\n\n\u0005\u0006\u0005\n\n2These data are used in Jojic and Frey (2001). We thank N. Jojic and B. Frey for making available\n\nthese data via http://www.psi.toronto.edu/layers.html.\n\n!\n\u0019\n\t\n\u000e\n8\n\u001d\n\u0010\n\n\u0012\n\u0013\n\u0016\n0\n\u000e\n0\n8\n\u0016\n\u0013\n0\n\u000e\n0\n8\n\u0012\n\n\u0012\n\u0013\n\u0016\n\u000e\n\f\n8\n\u0013\n\u0005\n\u0001\n\u0005\n\fMask\n\nMask\n\nMask\n\nMask\n\nMask\n\nForegr. * Mask\n\nForegr. * Mask\n\nForegr. * Mask\n\nForegr. * Mask\n\nForegr. * Mask\n\n(a)\n\n(b)\n\nFigure 2: Learning \ufb01ve objects against a stationary background. Panel (a) displays some of\nthe training images and (b) shows the objects learned by the GREEDY algorithm.\n\nMask\n\nForeground * Mask\n\nMask\n\nForeground * Mask\n\nBackground\n\n(a)\n\n(b)\n\nFigure 3: Two objects are learned from a set of images with non-stationary background.\nPanel (a) displays some examples of the training images, and (b) shows the objects found\nby the GREEDY algorithm.\n\nIn Figure 3 we consider learning objects against a non-stationary background. Actually\nthree different backgrounds were used, as can be seen in the example images shown in\nimages in the training set. Using the RANDOM\nFigure 3(a). There were\nSTARTS algorithm the CD was found in 9 out of 10 runs. The results with the GREEDY\nalgorithm are shown in Figure 3(b). The background found is approximately the average\nof the three backgrounds.\n\n\u0005\u0001\n\n\u0001\u0006\u0001\n\n\u0005\u0006\u0005\n\nOverall we conclude that the RANDOM STARTS algorithm is not very effective at \ufb01nd-\ning multiple objects in images; it needs many runs from different initial conditions, and\nsometimes fails entirely to \ufb01nd all objects. In contrast the GREEDY algorithm is very\neffective.\n\n4 Discussion\n\nShams and von der Malsburg (1999) obtained candidate parts by matching images in a\npairwise fashion, trying to identify corresponding regions in the two images. These can-\ndidate image patches were then clustered to compensate for the effect of occlusions. We\nmake four observations: (i) instead of directly learning the models, they match each image\nin\nagainst all others (with complexity\nour method; (ii) in their method the background must be removed otherwise it would give\nrise to large match regions; (iii) they do not de\ufb01ne a probabilistic model for the images\n(with all its attendant bene\ufb01ts); (iv) their data (although based on realistic CAD-type mod-\nels) is synthetic, and designed to focus learning on shape related features by eliminating\ncomplicating factors such as background, surface markings etc.\n\n), as compared to the linear scaling with *\n\n\u0003\u0005\u0004\n\nIn our work the model for each pixel is a mixture of Gaussians. There is some previous\n\n\u0005\n*\n0\n\t\n\fwork on pixelwise mixtures of Gaussians (see, e.g. Rowe and Blake 1995) which can, for\nexample, be used to achieve background subtraction and highlight moving objects against\na stationary background. Our work extends beyond this by gathering the foreground pixels\ninto objects, and also allows us to learn objects in the more dif\ufb01cult non-stationary back-\nground case. For the stationary background case, pixelwise mixture of Gaussians might be\nuseful ways to create candidate objects.\n\nThe GREEDY algorithm has shown itself to be an effective factorial learning algorithm\nfor image data. We are currently investigating issues such as dealing with richer classes\nof transformations, detecting \u0001\nautomatically, and allowing objects not to appear in all im-\nages. Furthermore, although we have described this work in relation to image modelling,\nit can be applied to other domains. For example, one can make a model for sequence\ndata by having Hidden Markov models (HMMs) for a \u201cforeground\u201d pattern and the \u201cback-\nground\u201d. Faced with sequences containing multiple foreground patterns, one could extract\nthese patterns sequentially using a similar algorithm to that described above. It is true that\nfor sequence data it would be possible to train a compound HMM consisting of \u0001\ncomponents simultaneously, but there may be severe local minima problems in the search\nspace so that the sequential approach might be preferable.\n\n\u001f HMM\n\nAcknowledgements: CW thanks Geoff Hinton for helpful discussions concerning the idea\nof learning one object at a time.\n\nReferences\nBarlow, H. (1989). Unsupervised Learning. Neural Computation, 1:295\u2013311.\n\nBlack, M. J. and Jepson, A. (1996). EigenTracking: Robust matching and tracking of articulated\nobjects using a view-based representation. In Buxton, B. and Cipolla, R., editors, Proceedings of\nthe Fourth European Conference on Computer Vision, ECCV\u201996, pages 329\u2013342. Springer-Verlag.\n\nFrey, B. J. and Jojic, N. (1999). Estimating mixture models of images and inferring spatial transfor-\nmations using the EM algorithm. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition 1999. IEEE Computer Society Press. Ft. Collins, CO.\n\nFrey, B. J. and Jojic, N. (2002). Transformation Invariant Clustering and Linear Component Analysis\n\nUsing the EM Algorithm. Revised manuscript under review for IEEE PAMI.\n\nGhahramani, Z. (1995). Factorial Learning and the EM Algorithm. In Tesauro, G., Touretzky, D. S.,\nand Leen, T. K., editors, Advances in Neural Information Processing Systems 7, pages 617\u2013624.\nMorgan Kaufmann, San Mateo, CA.\n\nHinton, G. E. and Zemel, R. S. (1994). Autoencoders, minimum description length, and Helmholtz\nfree energy. In Cowan, J., Tesauro, G., and Alspector, J., editors, Advances in Neural Information\nProcessing Systems 6. Morgan Kaufmann.\n\nJojic, N. and Frey, B. J. (2001). Learning Flexible Sprites in Video Layers. In Proceedings of the\nIEEE Conference on Computer Vision and Pattern Recognition 2001. IEEE Computer Society\nPress. Kauai, Hawaii.\n\nLee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factoriza-\n\ntion. Nature, 401:788\u2013791.\n\nRowe, S. and Blake, A. (1995). Statistical Background Modelling For Tracking With A Virtual\nCamera. In Pycock, D., editor, Proceedings of the 6th British Machine Vision Conference, volume\nvolume 2, pages 423\u2013432. BMVA Press.\n\nShams, L. and von der Malsburg, C. (1999). Are object shape primitives learnable? Neurocomputing,\n\n26-27:855\u2013863.\n\n\f", "award": [], "sourceid": 2288, "authors": [{"given_name": "Christopher", "family_name": "Williams", "institution": null}, {"given_name": "Michalis", "family_name": "Titsias", "institution": null}]}