{"title": "Occlusive Components Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 1069, "page_last": 1077, "abstract": "We study unsupervised learning in a probabilistic generative model for occlusion. The model uses two types of latent variables: one indicates which objects are present in the image, and the other how they are ordered in depth. This depth order then determines how the positions and appearances of the objects present, specified in the model parameters, combine to form the image. We show that the object parameters can be learnt from an unlabelled set of images in which objects occlude one another. Exact maximum-likelihood learning is intractable. However, we show that tractable approximations to Expectation Maximization (EM) can be found if the training images each contain only a small number of objects on average. In numerical experiments it is shown that these approximations recover the correct set of object parameters. Experiments on a novel version of the bars test using colored bars, and experiments on more realistic data, show that the algorithm performs well in extracting the generating causes. Experiments based on the standard bars benchmark test for object learning show that the algorithm performs well in comparison to other recent component extraction approaches. The model and the learning algorithm thus connect research on occlusion with the research field of multiple-cause component extraction methods.", "full_text": "Occlusive Components Analysis\n\nJ\u00a8org L\u00a8ucke\n\nFrankfurt Institute for Advanced Studies\nGoethe-University Frankfurt, Germany\n\nRichard Turner\n\nGatsby Computational Neuroscience Unit, UCL\n\n17 Queen Square, London WC1N 3AR, UK\n\nluecke@fias.uni-frankfurt.de\n\nturner@gatsby.ucl.ac.uk\n\nManeesh Sahani\n\nGatsby Computational Neuroscience Unit, UCL\n\n17 Queen Square, London WC1N 3AR, UK\n\nMarc Henniges\n\nFrankfurt Institute for Advanced Studies\nGoethe-University Frankfurt, Germany\n\nmaneesh@gatsby.ucl.ac.uk\n\nhenniges@fias.uni-frankfurt.de\n\nAbstract\n\nWe study unsupervised learning in a probabilistic generative model for occlusion.\nThe model uses two types of latent variables: one indicates which objects are\npresent in the image, and the other how they are ordered in depth. This depth\norder then determines how the positions and appearances of the objects present,\nspeci\ufb01ed in the model parameters, combine to form the image. We show that the\nobject parameters can be learnt from an unlabelled set of images in which objects\nocclude one another. Exact maximum-likelihood learning is intractable. However,\nwe show that tractable approximations to Expectation Maximization (EM) can\nbe found if the training images each contain only a small number of objects on\naverage. In numerical experiments it is shown that these approximations recover\nthe correct set of object parameters. Experiments on a novel version of the bars\ntest using colored bars, and experiments on more realistic data, show that the\nalgorithm performs well in extracting the generating causes. Experiments based\non the standard bars benchmark test for object learning show that the algorithm\nperforms well in comparison to other recent component extraction approaches.\nThe model and the learning algorithm thus connect research on occlusion with the\nresearch \ufb01eld of multiple-causes component extraction methods.\n\n1 Introduction\n\nA long-standing goal of unsupervised learning on images is to be able to learn the shape and form of\nobjects from unlabelled scenes. Individual images usually contain only a small subset of all possible\nobjects. This observation has motivated the construction of algorithms\u2014such as sparse coding (SC;\n[1]) or non-negative matrix factorization (NMF; [2]) and its sparse variants\u2014based on learning in\nlatent-variable models, where each possible object, or part of an object, is associated with a variable\ncontrolling its presence or absence in a given image. Any individual \u201chidden cause\u201d is rarely active,\ncorresponding to the small number of objects present in any one image. Despite this plausible\nmotivation, these algorithms make severe approximations. Perhaps the most crucial is that in the\nunderlying latent variable models, objects or parts thereof, combine linearly to form the image. In\nreal images the combination of individual objects depends on their relative distance from the camera\nor eye. If two objects occupy the same region in planar space, the nearer one occludes the other, i.e.,\nthe hidden causes non-linearly compete to determine the pixel values in the region of overlap.\n\nIn this paper we extend multiple-causes models such as SC or NMF to handle occlusion. The idea\nof using many hidden \u201ccause\u201d variables to control the presence or absence of objects is retained,\nbut these variables are augmented by another set of latent variables which determine the relative\n\n1\n\n\fdepth of the objects, much as in the z-buffer employed by computer graphics. In turn, this enables\nthe simplistic linear combination rule to be replaced by one in which nearby objects occlude those\nthat are more distant. One of the consequences of moving to a richer, more complex model is that\ninference and learning become correspondingly harder. One of the main contributions of this paper\nis to show how to overcome these dif\ufb01culties.\n\nThe problem of occlusion has been addressed in different contexts [3, 4, 5, 6]. Prominent probabilis-\ntic approaches [3, 4] assign pixels in multiple images taken from the same scene to a \ufb01xed number\nof image layers. The approach is most frequently applied to automatically remove foreground and\nbackground objects. Those models are in many aspects more general than the approach discussed\nhere. However, they model, in contrast to our approach, data in which objects maintain a \ufb01xed\nposition in depth relative to the other objects.\n\n2 A Generative Model for Occlusion\n\nThe occlusion model contains three important elements. The \ufb01rst is a set of variables which controls\nthe presence or absence of objects in a particular image (this part will be analogous, e.g., to NMF).\nThe second is a variable which controls the relative depths of the objects that are present. The third\nis the combination rule which describes how closer active objects occlude more distant ones.\nTo model the presence or absence of an object we use H binary hidden variables s1, . . . , sH. We\nassume that the presence of one object is independent of the presence of the others and assume, for\nsimplicity, equal probabilities \u03c0 for objects to be present:\n\np(~s | \u03c0) = QH\n\n(1)\nObjects in a real image can be ordered by their depth and it is this ordering which determines\nwhich of two overlapping objects occludes the other. The depth-ordering is captured in the model\nby randomly and uniformly choosing a member \u02c6\u03c3 of the set G(|~s|) which contains all permutation\n\nh=1 Bernoulli(sh; \u03c0) = QH\n\nh=1 \u03c0sh (1 \u2212 \u03c0)1\u2212sh .\n\ngiven ~s is de\ufb01ned by:\n\nfunctions \u02c6\u03c3 : {1, . . . , |~s|} \u2192 {1, . . . , |~s|}, with |~s| = Ph sh. More formally, the probability of \u02c6\u03c3\n\n(2)\nNote that we could have de\ufb01ned the order in depth independently of ~s, by choosing from G(H) with\np(\u02c6\u03c3) = 1\nH! . But then, because the depth of absent objects (sh = 0) is irrelevant, no more than |~s|!\ndistinct choices of \u02c6\u03c3 would have resulted in different images.\n\n|~s|! with \u02c6\u03c3 \u2208 G(|~s|) .\n\np(\u02c6\u03c3 | ~s) = 1\n\nA\n\n \n\nobjects\n\nobject\n\npermutation\n\nB\n\nimage\n\nFigure 1: A Illustration of how two\nobject masks and features combine to\ngenerate an image (generation without\nnoise). B Graphical model of the gener-\nation process with hidden permutation\nvariable \u02c6\u03c3.\n\nThe \ufb01nal stage of the generative model describes how to produce the image given a selection of\nactive causes and an ordering in relative depth of these causes. One approach would be to choose the\nclosest object and to set the image equal to the feature vector associated with this object. However,\nthis would mean that every image generated from the model would comprise just one object; the\nclosest. What is missing from this description is a notion of the extent of an object and the fact\nthat it might only contribute to a local selection of pixels in an image. For this reason, our model\ncontains two sets of parameters. One set of parameters, W \u2208 RH\u00d7D, describes what contribution\nan object makes to each pixel (D is the number of pixels). The vector (Wh1, . . . , WhD) is therefore\ndescribed as the mask of object h. If an object is highly localized, this vector will contain many zero\nelements. The other set of paramenters, T \u2208 RH\u00d7C, represent the features of the objects. A feature\nvector ~Th \u2208 RC describing object h might, for instance, be the object\u2019s rgb-color (C = 3 in that\ncase). Fig. 1A illustrates the combination of masks and features, and Fig. 1B shows the graphical\nmodel of the generation process.\n\nLet us formalize how an image is generated given the parameters \u0398 = (W, T ) and given the hidden\nvariables S = (~s, \u02c6\u03c3). Before we consider observation noise, we de\ufb01ne the generation of a noiseless\n\n2\n\n\f0\n3\n2\n\u02c6\u03c3(h)\u22121\n\n\u03c4 (S, h) = \uf8f1\uf8f4\uf8f2\n\uf8f4\uf8f3\n\n|~s|\u22121 + 1 otherwise\n\nif sh = 0\nif sh = 1 and |~s| = 1\n\n(3)\n\nimage ~T (S, \u0398) to be given by:\n\n~T d(S, \u0398) = Whod ~Tho\n\nwhere ho = argmaxh{\u03c4 (S, h) Whd} ,\n\nIn (3) the order in depth is represented by the mapping \u03c4 whose speci\ufb01c form will facilitate later\nalgebraic steps. To illustrate the combination rule (3) and the mapping \u03c4 consider Fig. 1A and\nFig. 2. Let us assume that the mask values Whd are zero or one (although we will later also allow\nfor continuous values). As depicted in Fig. 1A an object h with sh = 1 occupies all image pixels\nwith Whd = 1 and does not occupy pixels with Whd = 0. For all pixels with Whd = 1 the vector\n~Th sets the pixels\u2019 values to a speci\ufb01c feature, e.g., to a speci\ufb01c color. The function \u03c4 maps all\ncauses h with sh = 0 to zero while all other causes are mapped to values within the interval [1, 2]\n(see Fig. 2). \u03c4 assigns a proximity value \u03c4 (S, h) > 0 to each present object. For a given pixel d the\n\nA\n\nh\n\nsh\n\n\u03c4 (S, h)\n\nB\n\nh\n\nsh\n\n\u03c4 (S, h)\n\nC\n\nh\n\nsh\n\n\u03c4 (S, h)\n\nFigure 2: Visualization of\nthe mapping \u03c4 . A and B\nshow the two possible map-\npings for two causes, and\nC shows one possible map-\nping for four causes.\n\n\u03c4\n\n\u03c4\n\n\u03c4\n\ncombination rule (3) simply states that of all objects with Whd = 1, the most proximal is used to set\nthe pixel property. Given the latent variables and the noiseless image ~T (S, \u0398), we take the observed\nvariables Y = (~y1, . . . , ~yD) to be drawn independently from a Gaussian distribution (which is the\nusual choice for component extraction systems):\n\nEquations (1) to (4) represent a generative model for occlusion.\n\np(Y | S, \u0398) = QD\n\nd=1 p(~yd | ~T d(S, \u0398)),\n\np(~y | ~t) = N (~y; ~t, \u03c321) .\n\n(4)\n\n3 Maximum Likelihood\n\nOne approach to learning the parameters \u0398 = (W, T ) of this model from data Y = {Y (n)}n=1,...,N\nis to use Maximum Likelihood learning, that is,\n\n\u0398\u2217 = argmax\u0398{L(\u0398)} with L(\u0398) = log(cid:0)p(Y (1), . . . , Y (N ) | \u0398)(cid:1) .\n\n(5)\nHowever, as there is usually a large number of objects that can potentially be present in the train-\ning images, and as the likelihood involves summing over all combinations of objects and associ-\nated orderings, the computation of (5) is typically intractable. Moreover, even if it were tractably\ncomputable, optimization of the likelihood is made problematic by an analytical intractability aris-\ning from the fact that the occlusion non-linearity is non-differentiable. The following section de-\nscribes how to side-step the computational intractability within the standard Expectation Maximi-\nsation (EM) formalism for maximum likelihood learning, using a truncated expansion of sums for\nthe suf\ufb01cient statistics. Furthermore, as the M-Step of EM requires gradients to be computed, the\nsection also describes how to side-step the analytical intractability by an approximate version of the\nmodel\u2019s non-linearity.\n\nTo \ufb01nd the parameters \u0398\u2217 at least approximately, we use the variational EM formalism (e.g., [7]) and\nintroduce the free-energy function F(\u0398, q) which is a function of \u0398 and an unknown distribution\nq(S(1), . . . , S(N )) over the hidden variables. F(\u0398, q) is a lower bound of the likelihood L(\u0398).\nApproximations introduced later on can be interpreted as choosing speci\ufb01c functions q, although\n(for brevity) we will not make this relation explicit. In the model described above, in which each\n\nimage is drawn independently and identically, q(S(1), . . . , S(N )) = Qn qn(S(n), \u0398\u2032) which is taken\n\nto be parameterized by \u0398\u2032. The free-energy can thus be written as:\n\nF(\u0398, q) =\n\nN\n\nXn=1\n\n(cid:20)XS\n\nqn(S , \u0398\u2032) h log(cid:0)p(Y (n) | S, \u0398)(cid:1) + log(cid:0)p(S | \u0398)(cid:1)i(cid:21) + H(q) ,\n\n(6)\n\n3\n\n\fwhere the function H(q) = \u2212PnPS qn(S , \u0398\u2032) log(qn(S , \u0398\u2032)) (the Shannon entropy) is inde-\npendent of \u0398. Note that PS in (6) sums over all possible states of S = (~s, \u02c6\u03c3), i.e., over all binary\n\nvectors and all associated permutations in depth. This is the source of the computational intractabil-\nity. In the EM scheme F(\u0398, q) is maximized alternately with respect to the distribution, q, in the\nE-step (while the parameters, \u0398, are kept \ufb01xed) and with respect to parameters, \u0398, in the M-step\n(while q is kept \ufb01xed). It can be shown that an EM iteration increases the likelihood or leaves it\nunchanged. In practical applications EM is found to increase the likelihood to likelihood maxima,\nalthough these can be local.\nM-Step. The M-Step of EM, in which the free-energy, F, is optimized with respect to the parame-\nters, is canonically derived by taking derivatives of F with respect to the parameters. Unfortunately,\nthis standard procedure is not directly applicable because of the non-linear nature of occlusion as\nre\ufb02ected by the combination rule (3). However, it is possible to approximate the combination rule\nby the differentiable function,\n\n~T \u03c1\n\nd(S, \u0398)\n\n:= PH\n\nh=1(\u03c4 (S, h) Whd)\u03c1 Whd ~Th\nPH\nd(S, \u0398) is equal to the combination rule in (3). ~T \u03c1\nh (c \u2208 {1, . . . , C}) and it applies for large \u03c1:\n\nh=1(\u03c4 (S, h) Whd)\u03c1\n\n.\n\nd(S, \u0398) is\n\nNote that for \u03c1 \u2192 \u221e the function ~T \u03c1\ndifferentiable w.r.t. the parameters Whd and T c\n\n(7)\n\n(8)\n\n\u2202\n\n\u2202Wid\n\n\u2202\n\n\u2202T c\ni\n\n~T \u03c1\n~T \u03c1\n\nd(S, \u0398) \u2248 A\u03c1\nd(S, \u0398) \u2248 A\u03c1\n\nid(S, W ) ~Ti,\nid(S, W ) Wid ~ec,\n\nwith\n\nA\u03c1\n\nid(S, W ) :=\n\nAid(S, W ) := lim\n\u03c1\u2192\u221e\n\nPH\n\n(\u03c4 (S,i) Wid)\u03c1\nh=1(\u03c4 (S,h) Whd)\u03c1 ,\nid(S, W ) ,\n\nA\u03c1\n\nwhere ~ec is a unit vector in feature space with entry 1 at position c and zero elsewhere (the ap-\nproximations on the left-hand-side above become equalities for \u03c1 \u2192 \u221e). We can now compute\napproximations to the derivatives of F(\u0398, q). For large values of \u03c1 the following holds:\n\nqn(S , \u0398\u2032)(cid:18) \u2202\n\n\u2202Wid\n\n~T \u03c1\n\n\u2202\n\n\u2202Wid\n\n\u2202\n\n\u2202T c\ni\n\nF(\u0398, q) \u2248\n\nF(\u0398, q) \u2248\n\nN\n\nXn=1\nXn=1\n\nN\n\n(cid:20)XS\n(cid:20)XS\n\nD\n\nqn(S , \u0398\u2032)\n\nXd=1\n~f (~y (n), ~t ) :=\n\nwhere\n\nd(S, \u0398)(cid:19)T\nd(S, \u0398)(cid:19)T\n\n(9)\n\n~f (cid:16)~y (n), ~T \u03c1\n~f (cid:16)~y (n), ~T \u03c1\n\nd(S, \u0398)(cid:17)(cid:21),\nd(S, \u0398)(cid:17)(cid:21), (10)\n\n~T \u03c1\n\n\u2202T c\ni\n\n(cid:18) \u2202\nlog(cid:16)p(~y (n) | ~t )(cid:17) = \u2212\u03c3\u22122 (~y (n) \u2212 ~t ).\n\n\u2202\n\u2202~t\n\nSetting the derivatives (9) and (10) to zero and inserting equations (8) yields the following necessary\nconditions for a maximum of the free energy that hold in the limit \u03c1 \u2192 \u221e:\n\nWid = Xn\nhAid(S, W )iqn\nXn\n\nhAid(S, W )iqn\n\ni ~y (n)\n~T T\n\nd\n\n,\n\n~Ti =\n\n~T T\ni\n\n~Ti\n\nXn Xd\nXn Xd\n\nhAid(S, W )iqn\n\nWid ~y (n)\n\nd\n\nhAid(S, W )iqn\n\n(Wid)2\n\n.\n\n(11)\n\nNote that equations (11) are not straight-forward update rules. However, we can use them in the\n\ufb01xed-point sense and approximate the parameters which appear on the right-hand-side of the equa-\ntions using the values from the previous iteration.\n\nEquations (11), together with the exact posterior qn(S, \u0398\u2032) = p(S | ~y (n), \u0398\u2032), represent a maximum-\nlikelihood based learning algorithm for the generative model (1) to (4). Note, however, that due to\nthe multiplication of the weights and the mask, Whd ~Th in (3), there is degeneracy in the parameters:\ngiven h the combination ~Td remains unchanged for the operation ~Th \u2192 \u03b1 ~Th and Whd \u2192 Whd/\u03b1\nwith \u03b1 6= 0. To remove the degeneracy we set after each iteration:\n\nW new\n\nhd = Whd / W h , ~T new\n\nh = W h ~Th , where W h = Xd\u2208I\n\nWhd with I = {d | Wid > 0.5}. (12)\n\nFor reasons that will brie\ufb02y be discussed later, the use of W h instead of, e.g., W max\nis advantageous for some data, although for many other types of data W max\n\nh = maxd{Whd}\n\nworks equally well.\n\nh\n\n4\n\n\fE-Step. The crucial entities that have to be computed for update equations (11) are the suf\ufb01cient\nstatistics hAid(S, W )iqn\n, i.e., the expectation of the function Aid(S, W ) in (8) over the distribution\nof hidden states S. In order to derive a computationally tractable learning algorithm the expectation\nhAid(S, W )iqn\n\nis re-written and approximated as follows,\n\nhAid(S, W )iqn\n\n=\n\np(S, Y (n) | \u0398\u2032) Aid(S, W )\n\nXS\n\np( \u02dcS, Y (n) | \u0398\u2032)\n\nX\u02dcS\n\n\u2248\n\np(S, Y (n) | \u0398\u2032) Aid(S, W )\n\nXS,(|~s|\u2264\u03c7)\n\np( \u02dcS, Y (n) | \u0398\u2032)\n\nX\u02dcS,(|\u02dc~s|\u2264\u03c7)\n\n.\n\n(13)\n\nThat is, in order to approximate (13), the problematic sums in the numerator and denominator have\nbeen truncated. We only sum over states ~s with less or equal \u03c7 non-zero entries. Approximation (13)\nreplaces the intractable exact E-step by one whose computational cost scales only polynomially with\nH (roughly cubically for \u03c7 = 3). As for other approximate EM approaches, there is no guarantee\nthat this approximation will always result in an increase of the data likelihood. For data points that\nwere generated by a small number of causes on average we can, however, expect the approximation\nto match an exact E-step with increasing accuracy the closer we get to the optimum. For reasons\nhighlighted earlier, such data will be typical in image modelling. A truncation approach similar to\n(13) has successfully been used in the context of the maximal causes generative model in [8]. Also\nin the case of occlusion we will later see that in numerical experiments using approximation (13)\nthe true generating causes are indeed recovered.\n\n4 Experiments\n\nwe used approximation (13) with A\u03c1\n\nh were independently and uniformly drawn from the interval [0, 1].\n\nIn order to evaluate the algorithm it has been applied to arti\ufb01cial data, where its performance can\nbe compared to ground truth, and to more realistic visual data. In all the experiments we use image\npixels as input variables ~yd. The entries of the observed variables ~yd are set by the pixels\u2019 rgb-color\nvector, ~yd \u2208 [0, 1]3. In all trials of all experiments the initial values of the mask parameters Whd and\nthe feature parameters T c\nLearning and annealing. The free-energy landscape traversed by EM algorithms is often multi-\nmodal. Therefore EM algorithms can converge to local optima. However, this problem can be\nalleviated using deterministic annealing as described in [9, 10]. For the model under consideration\nhere annealing amounts to the substitutions \u03c0 \u2192 \u03c0\u03b2, (1 \u2212 \u03c0) \u2192 (1 \u2212 \u03c0)\u03b2, and (1/\u03c32) \u2192 (\u03b2/\u03c32),\nwith \u03b2 = 1/ \u02c6T in the E-step equations. During learning, the \u2018temperature\u2019 parameter \u02c6T is decreased\nfrom an initial value \u02c6T init to 1. To update the parameters W and T we applied the M-step equations\n(11). For the suf\ufb01cient statistics hAid(S, W )iqn\nid(S, W ) in\n(8) instead of Aid(S, W ) and with \u03c7 = 3 if not stated otherwise. The parameter \u03c1 was increased\nduring learning with \u03c1 = 1\n1\u2212\u03b2 (with a maximum of \u03c1 = 20 to avoid numerical instabilities). In all\nexperiments we used 100 EM iterations and decreased \u02c6T linearly except for 10 initial iterations at\n\u02c6T = \u02c6T init and 20 \ufb01nal iterations at \u02c6T = 1. In addition to annealing, a small amount of independent\nand identically distributed Gaussian noise (standard deviation 0.01) was added to the masks and the\nfeatures, Whd and T c\nd , to help escape local optima. This parameter noise was linearly decreased to\nzero during the last 20 iterations of each trial.\nThe colored bars test. The component extraction capabilities of the model were tested using the\ncolored bars test. This test is a generalization of the classical bars test [11] which has become a\npopular benchmark task for non-linear component extraction. In the standard bars test with H = 8\nbars the input data are 16-dimensional vectors, representing a 4 \u00d7 4 grid of pixels, i.e., D = 16.\nThe single bars appear at the 4 vertical and 4 horizontal positions. For the colored bars test, the bars\nhave colors ~T gen\nh which are independently and uniformly drawn from the rgb-color-cube [0, 1]3.\nOnce chosen, they remain \ufb01xed for the generation of the data set. For each image a bar appears\nindependently with a probability \u03c0 = 2\n8 which results in two bars per image on average (the standard\nvalue in the literature). For the bars active in an image, a ranking in depth is randomly and uniformly\nchosen from the permutation group. The color of each pixel is determined by the least distant bar\nand is black if the pixel is occupied by no bar. N = 500 images were generated for learning and\nFig. 3A shows a random selection of 13 examples. The learning algorithms were applied to the\ncolored bars test with H = 8 hidden units and D = 16 input units. The observation noise was set\n\n5\n\n\fA\n\nB\n\nW T\n\niteration\n\nC\n\n1\n\n20\n\n40\n\n100\n\nFigure 3: Application to the colored bars test. A Selection of 13 of the N = 500 data points used\nfor learning. B Changes of the parameters W and T for the algorithm with H = 8 hidden units.\nEach row shows W and T for the speci\ufb01ed EM iteration. C Feature vectors at the iterations in B\ndisplayed as points in color space (for visualization we used the 2-D hue and saturation plane of the\nHSV color space). Crosses are the real generating values, black circles the current model values ~Th,\nand grey circles those of the previous iterations.\n\nto \u03c3 = 0.05 and learning was initialized with \u02c6T init = 1\n2 D. The inferred approximate maximum-\nlikelihood parameters converged to values close to the generating parameters in 44 of 50 trials. In\n6 trials the algorithm represented 7 of the 8 causes. Its success rate, or reliability, is thus 88%.\nFig. 3B shows the time-course of a typical trial during learning. As can be observed, the mask value\nW and the feature values T converged to values close to the generating ones. For data with added\nGaussian pixel noise (\u03c3gen=\u03c3=0.05) the algorithms converges to values representing all causes in\n48 of 50 trials (96% reliability). A higher average number of causes per input reduced reliability.\nA maximum of three causes (on average) were used for the noiseless bars test. This is considered\na dif\ufb01cult task in the standard bars test. With otherwise the same parameters our algorithm had a\nreliability of 26% (50 trials) on this data. Performance seemed limited by the dif\ufb01culty of the data\nrather than by the limitations of the used approximation. We could not increase the reliability of the\nalgorithm when we increased the accuracy of (13) by setting \u03c7 = 4 (instead of \u03c7 = 3). Reliability\nseemed much more affected by changes to parameters for annealing and parameter noise, i.e., by\nchanges to those parameters that affect the additional mechanisms to avoid local optima.\nThe standard bars test. Instead of choosing the bar colors randomly as above, they can also be set\nto speci\ufb01c values. In particular, if all bar colors are white, ~T = (1, 1, 1)T , the classical version of the\nbars test is recovered. Note that the learning algorithms can be applied to this standard form without\nmodi\ufb01cation. When the generating parameters were as above (eight bars, probability of a bar to be\npresent 2\n8 , N = 500), all bars were successfully extracted in 42 of 50 trials (84% reliability). For\na bars test with ten bars, D = 5 \u00d7 5, a probability of 2\n10 for each bar to be present, and N = 500\ndata points, the algorithm with model parameters as above extracted all bars in 43 of 50 trials (86%\nreliability; mean number of extracted bars 9.5). Reliability for this test increased when we increased\nthe number of training images. For N = 1000 instead of 500 reliability increased to 94% (50 trials;\nmean number of extracted bars 9.9). The bars test with ten bars is probably the one most frequently\nfound in the literature. Linear and non-linear component extraction approaches are compared, e.g.,\nin [12, 8] and usually achieve lower reliability values than the presented algorithm. Classical ICA\nand PCA algorithms investigated in [13] never succeeded in extracting all bars. Relatively recent\napproaches can achieve reliability values higher than 90% but often only by introducing additional\nconstraints (compare R-MCA [8], or constrained forms of NMF [14]).\nMore realistic input. One possible criticism of the bars tests above is that the bars are relatively\nsimple objects. The purpose of this section is, therefore, to demonstrate the performance of the\nalgorithm when images contain more complicated objects. Sized objects were taken from the COIL-\n100 dataset [15] with relatively uniform color distribution (objects 2, 4, 47, 78, 94, 97; all with zero\ndegree rotation). The images were scaled down to 15 \u00d7 15 pixels and randomly placed on a black\nbackground image of 25 \u00d7 25 pixels. Downscaling introduced blurred object edges and to remove\nthis effect dark pixels were set to black. The training images were generated with each object being\n\n6\n\n\fA\n\nB\n\nW T\n\niteration\n\nC\n\n1\n\n10\n\n25\n\n50\n\n100\n\nFigure 4: Application to images of cluttered objects. A Selection of 14 of the N = 500 data points.\nB Parameter change displayed as in Fig. 3. C Change of feature vectors displayed as in Fig. 3.\n\npresent with probability 2\n6 and at a random depth. N = 500 such images were generated. Example\nimages1 are given in Fig. 4A. We applied the learning algorithm with H = 6, an initial temperature\nfor annealing of \u02c6T init = 1\n4 D, and parameters as above otherwise. Fig. 4B shows the development\nof parameter values during learning. As can be observed, the mask values converged to represent\nthe different objects, and the feature vectors converged to values representing the mean object color.\nNote that the model is not matched to the dataset as each object has a \ufb01xed distribution of color\nvalues which is a poor match to a Gaussian distribution with a constant color mean. The model\nreacted by assigning part of the real color distribution to the mask values which are responsible\nfor the 3-dimensional appearance of the masks (see Fig. 4B). Note that the normalization (12) was\nmotivated by this observation because it can better tolerate high mask value variances. We ran 50\ntrials using different sets of N = 500 images generated as above. In 42 of the trials (84%) the\nalgorithm converged to values representing all six objects together with appropriate values for their\nmean colors. In seven trials the algorithm converged to a local optima (average number of extracted\nobjects was 5.8). In 50 trials with 8 objects (we added objects 36 and 77 of the COIL-100 database)\nan algorithm with same parameters but H = 8 extracted all objects in 40 of the trials (reliability\n80%, average number of extracted objects 7.7).\n\n5 Discussion\n\nWe have studied learning in the generative model of occlusion (1) to (4). Parameters can be op-\ntimized given a collection of N images in which different sets of causes are present at different\npositions in depth. As brie\ufb02y discussed earlier, the problem of occlusion has been addressed by\nother system before. E.g., the approach in [3, 4] uses a \ufb01xed number of layers, so called sprites, to\nmodel an order in depth. The approach assigns, to each pixel, probabilities that it has been generated\nby a speci\ufb01c sprite. Typically, the algorithms are applied to data which consist of images that have a\nsmall number of foreground objects (usually one or two) on a static or slowly changing background.\nTypical applications of the approach are \ufb01gure-ground separation and the automatic removal of the\nbackground or foreground objects. The approach using sprites is in many aspects more general than\nthe model presented in this paper. It includes, for instance, variable estimation for illumination and,\nimportantly, addresses the problem of invariance by modeling object transformations. Regarding the\nmodelling of object arrangements, our approach is, however, more general. The additional hidden\nvariable used for object arrangements allows our model to be applied to images of cluttered scenes.\nThe approach in [3, 4] assumes a \ufb01xed object arrangement, i.e., it assumes that each object has the\nsame depth position in all training images. Our approach therefore addresses an aspect of visual\ndata that is complementary to the aspects modeled in [3, 4]. Models that combine the advantages of\n\n1Note that this appears much easier for a human observer because he/she can also make use of object\nknowlege, e.g., of the gestalt law of proximity. The dif\ufb01culty of the data would become obvious if all pixels in\neach image of the data set were permuted by a \ufb01xed permutation map.\n\n7\n\n\fboth approaches thus promise interesting advancements, e.g., towards systems that can learn from\nvideo data in which objects change their positions in depth.\n\nAnother interesting aspect of the model presented in this work is its close connection to component\nextraction methods. Algorithms such as SC, NMF or maximal causes analysis (MCA; [8]) use super-\npositions of elementary components to explain the data. ICA and SC have prominently been applied\nto explain neural response properties, and NMF is a popular approach to learn components for visual\nobject recognition [e.g. 14, 16]. Our model follows these multiple-causes methods by assuming the\ndata to consist of independently generated components. It distinguishes itself, however, by the way\nin which these components are assumed to combine. ICA, SC, NMF and many other models assume\nlinear superposition, MCA uses a max-function instead of the sum, and other systems use noisy-or\ncombinations. In the class of multiple-causes approaches our model is the \ufb01rst to generalize the\ncombination rule to one that models occlusion explicitly. This required an additional variable for\ndepth and the introduction of two sets of parameters: masks and features. Note that in the context of\nmultiple-causes models, masks have recently been introduced in conjunction with ICA [17] in order\nto model local contrast correlation in image patches. For our model, the combination of masks and\nvectorial feature parameters allow for applications to more general sets of data than those used for\nclassical component extraction. In numerical experiments we have used color images for instance.\nHowever, we can apply our algorithm also to grey-level data such as used for other algorithms. This\nallows for a direct quantitative comparison of the novel algorithm with state-of-the-art component\nextraction approaches. The reported results for the standard bars test show the competitiveness of\nour approach despite its larger set of parameters [compare, e.g., 12, 8]. A limitation of the training\nmethod used is its assumption of relatively sparsely active hidden causes. This limitation is to some\nextent shared, e.g., with SC or sparse versions of NMF. Experiments with higher \u03c7 values in (13)\nindicate, however, that the performance of the algorithm is not so much limited by the accuracy of\nthe E-step, but rather by the more challenging likelihood landscape for less sparse data.\n\nFor applications to visual data, color is the most straight-forward feature to model. Possible alterna-\ntives are, however, Gabor feature vectors which model object textures (see, e.g., [18] and references\ntherein), SWIFT features [19], or vectors using combinations of color and texture [e.g. 6]. Depend-\ning on the choice of feature vectors and the application domain, it might be necessary to generalize\nthe model. It is, for instance, straight-forward to introduce more complex feature vectors. Although\none feature, e.g. one color, per cause can represent a suitable model for many applications, it can for\nother applications also make sense to use multiple feature vectors per cause. In the extreme case as\nmany feature vectors as pixels could be used, i.e., ~Th \u2192 ~Thd. The derivation of update rules for such\nfeatures would proceed along the same lines as the derivations for single features ~Th. Furthermore,\nindividual prior parameters for the frequency of object appearances could be introduced. Such pa-\nrameters could be trained with an approach similar to the one in [8]. Additional parameters could\nalso be introduced to model different prior probabilities for different arrangements in depth. An easy\nalteration would be, for instance, to always map one speci\ufb01c hidden unit to the most distant position\nin depth in order to model a background. Finally, the most interesting, but also most challenging\ngeneralization direction would be the inclusion of invariance principles.\nIn its current form the\nmodel has, in common with state-of-the-art component extraction algorithms, the assumption that\nthe component locations are \ufb01xed. Especially for images of objects, changes in planar component\npositions have to be addressed in general. Possible approaches that have been used in the literature\ncan, for instance, be found in [3, 4] in the context of occlusion modeling, in [20] in the context of\nNMF, and in [18] in the context of object recognition. Potential future application domains for our\napproach would, however, also include data sets for which component positions are \ufb01xed. E.g., in\nmany benchmark databases for face recognition, faces are already in a normalized position. For\ncomponent extraction, faces can be regarded as combinations of a background faces \u2018occluded\u2019 by\nmouth, nose, and eye textures which can themselves be occluded by beards, sunglasses, or hats.\n\nIn summary, the studied occlusion model advances generative modeling approaches to visual data\nby explicitly modeling object arrangements in depth. The approach complements established ap-\nproaches of occlusion modeling in the literature by generalizing standard approaches to multiple-\ncauses component extraction.\nAcknowledgements. We gratefully acknowledge funding by the German Federal Ministry of Ed-\nucation and Research (BMBF) in the project 01GQ0840 (Bernstein Focus Neurotechnology Frank-\nfurt), the Gatsby Charitable Foundation, and the Honda Research Institute Europe GmbH.\n\n8\n\n\fReferences\n[1] B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive \ufb01eld properties by learning\n\na sparse code for natural images. Nature, 381:607 \u2013 609, 1996.\n\n[2] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization.\n\nNature, 401(6755):788\u201391, 1999.\n\n[3] N. Jojic and B. Frey. Learning \ufb02exible sprites in video layers. Conf. on Computer Vision and\n\nPattern Recognition, 1:199\u2013206, 2001.\n\n[4] C. K. I. Williams and M. K. Titsias. Greedy learning of multiple objects in images using robust\n\nstatistics and factorial learning. Neural Computation, 16(5):1039\u20131062, 2004.\n\n[5] K. Fukushima. Restoring partly occluded patterns: a neural network model. Neural Networks,\n\n18(1):33\u201343, 2005.\n\n[6] C. Eckes, J. Triesch, and C. von der Malsburg. Analysis of cluttered scenes using an elastic\n\nmatching approach for stereo images. Neural Computation, 18(6):1441\u20131471, 2006.\n\n[7] R. M. Neal and G. E. Hinton. A view of the EM algorithm that justi\ufb01es incremental, sparse,\n\nand other variants. In M. I. Jordan, editor, Learning in Graphical Models. Kluwer, 1998.\n\n[8] J. L\u00a8ucke and M. Sahani. Maximal causes for non-linear component extraction. Journal of\n\nMachine Learning Research, 9:1227 \u2013 1267, 2008.\n\n[9] N. Ueda and R. Nakano. Deterministic annealing EM algorithm. Neural Networks, 11(2):271\u2013\n\n282, 1998.\n\n[10] M. Sahani. Latent variable models for neural data analysis, 1999. PhD Thesis, Caltech.\n[11] P. F\u00a8oldi\u00b4ak. Forming sparse representations by local anti-Hebbian learning. Biol Cybern, 64:165\n\n\u2013 170, 1990.\n\n[12] M. W. Spratling. Learning image components for object recognition. Journal of Machine\n\nLearning Research, 7:793 \u2013 815, 2006.\n\n[13] S. Hochreiter and J. Schmidhuber. Feature extraction through LOCOCODE. Neural Compu-\n\ntation, 11:679 \u2013 714, 1999.\n\n[14] P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. Journal of Machine\n\nLearning Research, 5:1457\u20131469, 2004.\n\n[15] S. A. Nene, S. K. Nayar, and H. Murase. Columbia object image library (COIL-100). Technical\n\nreport, cucs-006-96, 1996.\n\n[16] H. Wersing and E. K\u00a8orner. Learning optimized features for hierarchical models of invariant\n\nobject recognition. Neural Computation, 15(7):1559\u20131588, 2003.\n\n[17] U. K\u00a8oster, J. T. Lindgren, M. Gutmann, and A. Hyv\u00a8arinen. Learning natural image structure\nwith a horizontal product model. In Int. Conf. on Independent Component Analysis and Signal\nSeparation (ICA), pages 507\u2013514, 2009.\n\n[18] P. Wolfrum, C. Wolff, J. L\u00a8ucke, and C. von der Malsburg. A recurrent dynamic model for\n\ncorrespondence-based face recognition. Journal of Vision, 8(7):1\u201318, 2008.\n\n[19] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal\n\nof Computer Vision, 60(2):91\u2013110, 2004.\n\n[20] J. Eggert, H. Wersing, and E. K\u00a8orner. Transformation-invariant representation and NMF. In\n\nInt. J. Conf. on Neural Networks (IJCNN), pages 2535\u20132539, 2004.\n\n9\n\n\f", "award": [], "sourceid": 39, "authors": [{"given_name": "J\u00f6rg", "family_name": "L\u00fccke", "institution": null}, {"given_name": "Richard", "family_name": "Turner", "institution": null}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": null}, {"given_name": "Marc", "family_name": "Henniges", "institution": null}]}