{"title": "Layered image motion with explicit occlusions, temporal consistency, and depth ordering", "book": "Advances in Neural Information Processing Systems", "page_first": 2226, "page_last": 2234, "abstract": "Layered models are a powerful way of describing natural scenes containing smooth surfaces that may overlap and occlude each other. For image motion estimation, such models have a long history but have not achieved the wide use or accuracy of non-layered methods. We present a new probabilistic model of optical flow in layers that addresses many of the shortcomings of previous approaches. In particular, we define a probabilistic graphical model that explicitly captures: 1) occlusions and disocclusions; 2) depth ordering of the layers; 3) temporal consistency of the layer segmentation. Additionally the optical flow in each layer is modeled by a combination of a parametric model and a smooth deviation based on an MRF with a robust spatial prior; the resulting model allows roughness in layers. Finally, a key contribution is the formulation of the layers using an image-dependent hidden field prior based on recent models for static scene segmentation. The method achieves state-of-the-art results on the Middlebury benchmark and produces meaningful scene segmentations as well as detected occlusion regions.", "full_text": "Layered Image Motion with Explicit Occlusions,\n\nTemporal Consistency, and Depth Ordering\n\nDeqing Sun, Erik B. Sudderth, and Michael J. Black\nDepartment of Computer Science, Brown University\n{dqsun,sudderth,black}@cs.brown.edu\n\nAbstract\n\nLayered models are a powerful way of describing natural scenes containing\nsmooth surfaces that may overlap and occlude each other. For image motion es-\ntimation, such models have a long history but have not achieved the wide use or\naccuracy of non-layered methods. We present a new probabilistic model of optical\n\ufb02ow in layers that addresses many of the shortcomings of previous approaches. In\nparticular, we de\ufb01ne a probabilistic graphical model that explicitly captures: 1)\nocclusions and disocclusions; 2) depth ordering of the layers; 3) temporal con-\nsistency of the layer segmentation. Additionally the optical \ufb02ow in each layer is\nmodeled by a combination of a parametric model and a smooth deviation based\non an MRF with a robust spatial prior; the resulting model allows roughness in\nlayers. Finally, a key contribution is the formulation of the layers using an image-\ndependent hidden \ufb01eld prior based on recent models for static scene segmentation.\nThe method achieves state-of-the-art results on the Middlebury benchmark and\nproduces meaningful scene segmentations as well as detected occlusion regions.\n\n1 Introduction\n\nLayered models of scenes offer signi\ufb01cant bene\ufb01ts for optical \ufb02ow estimation [8, 11, 25]. Splitting\nthe scene into layers enables the motion in each layer to be de\ufb01ned more simply, and the estimation\nof motion boundaries to be separated from the problem of smooth \ufb02ow estimation. Layered models\nalso make reasoning about occlusion relationships easier. In practice, however, none of the current\ntop performing optical \ufb02ow methods use a layered approach [2]. The most accurate approaches\nare single-layered, and instead use some form of robust smoothness assumption to cope with \ufb02ow\ndiscontinuities [5]. This paper formulates a new probabilistic, layered motion model that addresses\nthe key problems of previous layered approaches. At the time of writing, it achieves the lowest\naverage error of all tested approaches on the Middlebury optical \ufb02ow benchmark [2]. In particular,\nthe accuracy at occlusion boundaries is signi\ufb01cantly better than previous methods. By segmenting\nthe observed scene, our model also identi\ufb01es occluded and disoccluded regions.\n\nLayered models provide a segmentation of the scene and this segmentation, because it corresponds\nto scene structure, should persist over time. However, this persistence is not a bene\ufb01t if one is only\ncomputing \ufb02ow between two frames; this is one reason that multi-layer models have not surpassed\ntheir single-layer competitors on two-frame benchmarks. Without loss of generality, here we use\nthree-frame sequences to illustrate our method. In practice, these three frames can be constructed\nfrom an image pair by computing both the forward and backward \ufb02ow. The key is that this gives\ntwo segmentations of the scene, one at each time instant, both of which must be consistent with the\n\ufb02ow. We formulate this temporal layer consistency probabilistically. Note that the assumption of\ntemporal layer consistency is much more realistic than previous assumptions of temporal motion\nconsistency [4]; while the scene motion can change rapidly, scene structure persists.\n\n1\n\n\fOne of the main motivations for layered models is that, conditioned on the segmentation into layers,\neach layer can employ af\ufb01ne, planar, or other strong models of optical \ufb02ow. By applying a single\nsmooth motion across the entire layer, these models combine information over long distances and\ninterpolate behind occlusions. Such rigid parametric assumptions, however, are too restrictive for\nreal scenes. Instead one can model the \ufb02ow within each layer as smoothly varying [26]. While\nthe resulting model is more \ufb02exible than traditional parametric models, we \ufb01nd that it is still not as\naccurate as robust single-layer models. Consequently, we formulate a hybrid model that combines a\nbase af\ufb01ne motion with a robust Markov random \ufb01eld (MRF) model of deformations from af\ufb01ne [6].\nThis roughness in layers model, which is similar in spirit to work on plane+parallax [10, 14, 19],\nencourages smooth \ufb02ow within layers but allows signi\ufb01cant local deviations.\n\nBecause layers are temporally persistent, it is also possible to reason about their relative depth or-\ndering. In general, reliable recovery of depth order requires three or more frames. Our probabilistic\nformulation explicitly orders layers by depth, and we show that the correct order typically produces\nmore probable (lower energy) solutions. This also allows explicit reasoning about occlusions, which\nour model predicts at locations where the layer segmentations for consecutive frames disagree.\n\nMany previous layered approaches are not truly \u201clayered\u201d: while they segment the image into mul-\ntiple regions with distinct motions, they do not model what is in front of what. For example, widely\nused MRF models [27] encourage neighboring pixels to occupy the same region, but do not capture\nrelationships between regions. In contrast, building on recent state-of-the-art results in static scene\nsegmentation [21], our model determines layer support via an ordered sequence of occluding binary\nmasks. These binary masks are generated by thresholding a series of random, continuous functions.\nThis approach uses image-dependent Gaussian random \ufb01eld priors and favors partitions which ac-\ncurately match the statistics of real scenes [21]. Moreover, the continuous layer support functions\nplay a key role in accurately modeling temporal layer consistency. The resulting model produces\naccurate layer segmentations that improve \ufb02ow accuracy at occlusion boundaries, and recover mean-\ningful scene structure.\n\nAs summarized in Figure 1, our method is based on a principled, probabilistic generative model\nfor image sequences. By combining recent advances in dense \ufb02ow estimation and natural image\nsegmentation, we develop an algorithm that simultaneously estimates accurate \ufb02ow \ufb01elds, detects\nocclusions and disocclusions, and recovers the layered structure of realistic scenes.\n\n2 Previous Work\n\nLayered approaches to motion estimation have long been seen as elegant and promising, since spatial\nsmoothness is separated from the modeling of discontinuities and occlusions. Darrell and Pentland\n[7, 8] provide the \ufb01rst full approach that incorporates a Bayesian model, \u201csupport maps\u201d for seg-\nmentation, and robust statistics. Wang and Adelson [25] clearly motivate layered models of image\nsequences, while Jepson and Black [11] formalize the problem using probabilistic mixture models.\nA full review of more recent methods is beyond our scope [1, 3, 12, 13, 16, 17, 20, 24, 27, 29].\n\nEarly methods, which use simple parametric models of image motion within layers, are not highly\naccurate. Observing that rigid parametric models are too restrictive for real scenes, Weiss [26] uses a\nmore \ufb02exible Gaussian process to describe the motion within each layer. Even using modern imple-\nmentation methods [22] this approach does not achieve state-of-the-art results. Allocating a separate\nlayer for every small surface discontinuity is impractical and fails to capture important global scene\nstructure. Our approach, which allows \u201croughness\u201d within layers rather than \u201csmoothness,\u201d provides\na compromise that captures coarse scene structure as well as \ufb01ne within-layer details.\n\nOne key advantage of layered models is their ability to realistically model occlusion boundaries.\nTo do this properly, however, one must know the relative depth order of the surfaces. Performing\ninference over the combinatorial range of possible occlusion relationships is challenging and, con-\nsequently, only a few layered \ufb02ow models explicitly encode relative depth [12, 30]. Recent work\nrevisits the layered model to handle occlusions [9], but does not explicitly model the layer ordering\nor achieve state-of-the-art performance on the Middlebury benchmark. While most current optical\n\ufb02ow methods are \u201ctwo-frame,\u201d layered methods naturally extend to longer sequences [12, 29, 30].\n\nLayered models all have some way of making either a hard or soft assignment of pixels to layers.\nWeiss and Adelson [27] introduce spatial coherence to these layer assignments using a spatial MRF\n\n2\n\n\ftkg\n\ntks\n\nK\n\ng\n\nt+1,k\n\nK\u22121\n\ns\n\nt+1,k\n\nK\n\ntI\n\nt+1I\n\ntku\n\ntkv\n\nK\n\nFigure 1: Left: Graphical representation for the proposed layered model. Right: Illustration of variables from\nthe graphical model for the \u201cSchef\ufb02era\u201d sequence. Labeled sub-images correspond to nodes in the graph. The\nleft column shows the \ufb02ow \ufb01elds for three layers, color coded as in [2]. The g and s images illustrate the\nreasoning about layer ownership (see text). The composite \ufb02ow \ufb01eld (u, v) and layer labels (k) are also shown.\n\nmodel. However, the Ising/Potts MRF they employ assigns low probability to typical segmentations\nof natural scenes [15]. Adapting recent work on static image segmentation by Sudderth and Jor-\ndan [21], we instead generate spatially coherent, ordered layers by thresholding a series of random\ncontinuous functions. As in the single-image case, this approach realistically models the size and\nshape properties of real scenes. For motion estimation there are additional advantages: it allows\naccurate reasoning about occlusion relationships and modeling of temporal layer consistency.\n\n3 A Layered Motion Model\n\nBuilding on this long sequence of prior work, our generative model of layered image motion is\nsummarized in Figure 1. Below we describe how the generative model captures piecewise smooth\ndeviation of the layer motion from parametric models (Sec. 3.1), depth ordering and temporal con-\nsistency of layers (Sec. 3.2), and regions of occlusion and disocclusion (Sec. 3.3).\n\n3.1 Roughness in Layers\n\nOur approach is inspired by Weiss\u2019s model of smoothness in layers [26]. Given a sequence of\nimages It, 1 \u2264 t \u2264 T , we model the evolution from the current frame It, to the subsequent frame\nIt+1, via K locally smooth, but potentially globally complex, \ufb02ow \ufb01elds. Let utk and vtk denote\nthe horizontal and vertical \ufb02ow \ufb01elds, respectively, for layer k at time t. The corresponding \ufb02ow\nvector for pixel (i, j) is then denoted by (uij\nEach layer\u2019s \ufb02ow \ufb01eld is drawn from a distribution chosen to encourage piecewise smooth motion.\nFor example, a pairwise Markov random \ufb01eld (MRF) would model the horizontal \ufb02ow \ufb01eld as\n\ntk, vij\n\ntk).\n\np(utk) \u221d exp{\u2212Emrf(utk)} = exp(cid:26) \u2212\n\n1\n\n2 X(i,j) X(i\u2032,j \u2032)\u2208\u0393(i,j)\n\n\u03c1s(uij\n\ntk \u2212 ui\u2032j \u2032\n\ntk )(cid:27).\n\n(1)\n\nHere, \u0393(i, j) is the set of neighbors of pixel (i, j), often its four nearest neighbors. The potential\n\u03c1s(\u00b7) is some robust function [5] that encourages smoothness, but allows occasional signi\ufb01cant de-\nviations from it. The vertical \ufb02ow \ufb01eld vtk can then be modeled via an independent MRF prior as\nin Eq. (1), as justi\ufb01ed by the statistics of natural \ufb02ow \ufb01elds [18].\n\nWhile such MRF priors are \ufb02exible, they capture very little dependence between pixels separated by\neven moderate image distances. In contrast, real scenes exhibit coherent motion over large scales,\ndue to the motion of (partially) rigid objects in the world. To capture this, we associate an af\ufb01ne (or\nplanar) motion model, with parameters \u03b8tk, to each layer k. We then use an MRF to allow piecewise\nsmooth deformations from the globally rigid assumptions of af\ufb01ne motion:\n\nEaff(utk, \u03b8tk) =\n\n1\n\n2 X(i,j) X(i\u2032,j \u2032)\u2208\u0393(i,j)\n\ntk \u2212 \u00afuij\n\n\u03b8tk\n\n\u03c1s(cid:16)(uij\n\n) \u2212 (ui\u2032j \u2032\n\ntk \u2212 \u00afui\u2032j \u2032\n\n\u03b8tk\n\n)(cid:17).\n\n(2)\n\n3\n\n\fHere, \u00afuij\ndenotes the horizontal motion predicted for pixel (i, j) by an af\ufb01ne model with pa-\n\u03b8tk\nrameters \u03b8tk. Unlike classical models that assume layers are globally well \ufb01t by a single af\ufb01ne\nmotion [6, 25], this prior allows signi\ufb01cant, locally smooth deviations from rigidity. Unlike the ba-\nsic smoothness prior of Eq. (1), this semiparametric construction allows effective global reasoning\nabout non-contiguous segments of partially occluded objects. More sophisticated \ufb02ow deformation\npriors may also be used, such as those based on robust non-local terms [22, 28].\n\n3.2 Layer Support and Spatial Contiguity\n\nThe support for whether or not a pixel belongs to a given layer k is de\ufb01ned using a hidden random\n\ufb01eld gk. We associate each of the \ufb01rst K \u2212 1 layers at time t with a random continuous function\ngtk, de\ufb01ned over the same domain as the image. This hidden support \ufb01eld is illustrated in Figure 1.\nWe assume a single, unique layer is observable at each location and that the observed motion of\nthat pixel is determined by its assigned layer. Analogous to level set representations, the discrete\nsupport of each layer is determined by thresholding gtk: pixel (i, j) is considered visible when\ngtk(i, j) \u2265 0. Let stk(i, j) equal one if layer k is visible at pixel (i, j), and zero otherwise; note that\n\nPk stk(i, j) = 1. For pixels (i, j) for which gtk(i, j) < 0, we necessarily have stk(i, j) = 0.\n\nWe de\ufb01ne the layers to be ordered with respect to the camera, so that layer k occludes layers k\u2032 > k.\nGiven the full set of support functions gtk, the unique layer kij\n\n(i, j) = 1 is then\n\nt\u2217 for which stkij\n\nt\u2217\n\nkij\nt\u2217 = min ({k | 1 \u2264 k \u2264 K \u2212 1, gtk(i, j) \u2265 0} \u222a {K}) .\n\n(3)\n\nNote that layer K is essentially a background layer that captures all pixels not assigned to the \ufb01rst\nK \u2212 1 layers. For this reason, only K \u2212 1 hidden \ufb01elds gtk are needed (see Figure 1).\nOur use of thresholded, random continuous functions to de\ufb01ne layer support is partially motivated\nby known shortcomings of discrete Ising/Potts MRF models for image partitions [15]. They also\nprovide a convenient framework for modeling the temporal and spatial coherence observed in real\nmotion sequences. Spatial coherence is captured via a Gaussian conditional random \ufb01eld in which\nedge weights are modulated by local differences in Lab color vectors, Ic\n\nt (i, j):\n\n(4)\n\n(5)\n\nEspace(gtk) =\n\n1\n\n2 X(i,j) X(i\u2032,j \u2032)\u2208\u0393(i,j)\ni\u2032j \u2032 = max(cid:26) expn \u2212\n\n1\n2\u03c32\nc\n\nwij\n\nwij\n\ni\u2032j \u2032 (gtk(i, j) \u2212 gtk(i\u2032, j \u2032))2,\n\n||Ic\n\nt (i, j) \u2212 Ic\n\nt (i\u2032, j \u2032)||2o, \u03b4c(cid:27).\n\nThe threshold \u03b4c > 0 adds robustness to large color changes in internal object texture. Temporal\ncoherence of surfaces is then encouraged via a corresponding Gaussian MRF:\n\nEtime(gtk, gt+1,k, utk, vtk) = X(i,j)\n\n(gtk(i, j) \u2212 gt+1,k(i + uij\n\ntk, j + vij\n\ntk))2.\n\n(6)\n\nCritically, this energy function uses the corresponding \ufb02ow \ufb01eld to non-rigidly align the layers at\nsubsequent frames. By allowing smooth deformation of the support functions gtk, we allow layer\nsupport to evolve over time, as opposed to transforming a single rigid template [12].\n\nOur model of layer coherence is inspired by a recent method for image segmentation, based on\nspatially dependent Pitman-Yor processes [21]. That work makes connections between layered oc-\nclusion processes and stick breaking representations of nonparametric Bayesian models. By as-\nsigning appropriate stochastic priors to layer thresholds, the Pitman-Yor model captures the power\nlaw statistics of natural scene partitions and infers an appropriate number of segments for each\nimage. Existing optical \ufb02ow benchmarks employ arti\ufb01cially constructed scenes that may have dif-\nferent layer-level statistics. Consequently our experiments in this paper employ a \ufb01xed number of\nlayers K.\n\n3.3 Depth Ordering and Occlusion Reasoning\n\nThe preceding generative process de\ufb01nes a set of K ordered layers, with corresponding \ufb02ow\n\ufb01elds utk, vtk and segmentation masks stk. Recall that the layer assignment masks s are a\n\n4\n\n\fdeterministic function (threshold) of the underlying continuous layer support functions g (see\nEq. (3)). To consistently reason about occlusions, we examine the layer assignments stk(i, j) and\nst+1,k(i + uij\ntk) at locations corresponded by the underlying \ufb02ow \ufb01elds. This leads to a far\nricher occlusion model than standard spatially independent outlier processes: geometric consistency\nis enforced via the layered sequence of \ufb02ow \ufb01elds.\n\ntk, j + vij\n\nt (i, j) denote an observed image feature for pixel (i, j); we work with a \ufb01ltered\nLet Is\nIf\nversion of the intensity images to provide some invariance to illumination changes.\nstk(i, j) = st+1,k(i + uij\ntk) = 1, the visible layer for pixel (i, j) at time t remains unoc-\ncluded at time t + 1, and the image observations are modeled using a standard brightness (or, here,\nfeature) constancy assumption. Otherwise, that pixel has become occluded, and is instead generated\nfrom a uniform distribution. The image likelihood model can then be written as\n\ntk, j + vij\n\np(Is\n\nt | Is\n\nt+1, ut, vt, gt, gt+1) \u221d exp{\u2212Edata(ut, vt, gt, gt+1)}\n\nt (i, j) \u2212 Is\n\nt+1(i + uij\n\ntk, j + vij\n\ntk))stk(i, j)st+1,k(i + uij\n\ntk, j + vij\ntk)\n\n= expn \u2212Xk X(i,j)(cid:16)\u03c1d(Is\n\nwhere \u03c1d(\u00b7) is a robust potential function and the constant \u03bbd arises from the difference of the log\nnormalization constants for the robust and uniform distributions. With algebraic simpli\ufb01cations, the\ndata error term can be written as\n\n+ \u03bbdstk(i, j)(1 \u2212 st+1,k(i + uij\n\ntk, j + vij\n\ntk))(cid:17)o\n\nEdata(ut, vt, gt, gt+1) =\n\nXk X(i,j)(cid:16)\u03c1d(Is\n\nt (i, j) \u2212 Is\n\nt+1(i + uij\n\ntk, j + vij\n\ntk)) \u2212 \u03bbd(cid:17)stk(i, j)st+1,k(i + uij\n\ntk, j + vij\ntk)\n\n(7)\n\nup to an additive, constant multiple of \u03bbd. The shifted potential function (\u03c1d(\u00b7) \u2212 \u03bbd) represents the\nchange in energy when a pixel transitions from an occluded to an unoccluded con\ufb01guration. Note\nthat occlusions have higher likelihood only for suf\ufb01ciently large discrepancies in matched image\nfeatures and can only occur via a corresponding change in layer visibility.\n\n4 Posterior Inference from Image Sequences\n\nConsidering the full generative model de\ufb01ned in Sec. 3, maximum a posteriori (MAP) estimation\nfor a T frame image sequence is equivalent to minimization of the following energy function:\n\nE(u, v, g, \u03b8) =\n\n\u03bba(Eaff(utk, \u03b8tk) + Eaff(vtk, \u03b8tk))\n\nT \u22121\n\nK\n\nXt=1 (cid:26)Edata(ut, vt, gt, gt+1) +\n\u03bbbEspace(gtk) + \u03bbcEtime(gtk, gt+1,k, utk, vtk)(cid:27) +\nXk=1\n\nXk=1\n\nK\u22121\n\n+\n\nK\u22121\n\nXk=1\n\nHere \u03bba, \u03bbb, and \u03bbc are weights controlling the relative importance of the af\ufb01ne, spatial, and tempo-\nral terms respectively. Simultaneously inferring \ufb02ow \ufb01elds, layer support maps, and depth ordering\nis a challenging process; our approach is summarized below.\n\n\u03bbbEspace(gT k).\n\n(8)\n\n4.1 Relaxation of the Layer Assignment Process\n\nDue to the non-differentiability of the threshold process that determines assignments of regions to\nlayers, direct minimization of Eq. (8) is challenging. For a related approach to image segmentation,\na mean \ufb01eld variational method has been proposed [21]. However, that segmentation model is based\non a much simpler, spatially factorized likelihood model for color and texture histogram features.\nGeneralization to the richer \ufb02ow likelihoods considered here raises signi\ufb01cant complications.\n\nInstead, we relax the hard threshold assignment process using the logistic function \u03c3(g) = 1/(1 +\nexp(\u2212g)). Applied to Eq. (3), this induces the following soft layer assignments:\n\n\u02dcstk(i, j) =(\u03c3(\u03bbegtk(i, j))Qk\u22121\n\nk\u2032=1 \u03c3(\u2212\u03bbegtk\u2032 (i, j)),\n\nk\u2032=1 \u03c3(\u2212\u03bbegtk\u2032 (i, j)),\n\nQK\u22121\n\n5\n\n1 \u2264 k < K,\n\nk = K.\n\n(9)\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\n(g)\n\nFigure 2: Results on the \u201cVenus\u201d sequence with 4 layers. The two background layers move faster than the two\nforeground layers, and the solution with the correct depth ordering has lower energy and smaller error. (a) First\n786 \u00d7 106. Left to right: motion segmentation,\nframe. (b-d) Fast-to-slow ordering: EPE 0\nestimated \ufb02ow \ufb01eld, and absolute error of estimated \ufb02ow \ufb01eld. (f-g) Slow-to-fast ordering: EPE 0\n195 and\nenergy \u22121\n\n808 \u00d7 106. Darker indicates larger \ufb02ow \ufb01eld errors in (d) and (g).\n\n.\n\n.\n\n252 and energy \u22121\n\n.\n\n.\n\nNote that \u03c3(\u2212g) = 1 \u2212 \u03c3(g), andPK\n\nSubstituting these soft assignments \u02dcstk(i, j) for stk(i, j) in Eq. (7), we obtain a differentiable energy\nfunction that can be optimized via gradient-based methods. A related relaxation underlies the classic\nbackpropagation algorithm for neural network training.\n\nk=1 \u02dcstk(i, j) = 1 for any gtk and constant \u03bbe > 0.\n\n4.2 Gradient-Based Energy Minimization\n\nWe estimate the hidden \ufb01elds for all the frames together, while \ufb01xing the \ufb02ow \ufb01elds, by optimizing\nan objective involving the relevant Edata(\u00b7), Espace(\u00b7), and Etime(\u00b7) terms. We then estimate the \ufb02ow\n\ufb01elds ut, vt for each frame, while \ufb01xing those of neighboring frames and the hidden \ufb01elds, via the\nEdata(\u00b7), Eaff(\u00b7), and Etime(\u00b7) terms. For \ufb02ow estimation, we use a standard coarse-to-\ufb01ne, warping-\nbased technique as described in [22]. For hidden \ufb01eld estimation, we use an implementation of\nconjugate gradient descent with backtracking and line search. See Supplemental Material for details.\n\n5 Experimental Results\n\nWe apply the proposed model to two-frame sequences and compute both the forward and backward\n\ufb02ow \ufb01elds. This enables the use of the temporal consistency term by treating one frame as both\nthe previous and the next frame of the other1. We obtain an initial \ufb02ow \ufb01eld using the Classic+NL\nmethod [22], cluster the \ufb02ow vectors into K groups (layers), and convert the initial segmentation\ninto the corresponding hidden \ufb01elds. We then use a two-level Gaussian pyramid (downsampling\nfactor 0.8) and perform a fairly standard incremental estimation of the \ufb02ow \ufb01elds for each layer. At\neach level, we perform 20 incremental warping steps and during each step alternately solve for the\nhidden \ufb01elds and the \ufb02ow estimates. In the end, we threshold the hidden \ufb01elds to compute a hard\nsegmentation, and obtain the \ufb01nal \ufb02ow \ufb01eld by selecting the \ufb02ow \ufb01eld from the appropriate layers.\n\nOccluded regions are determined by inconsistencies between the hard segmentations at subsequent\nframes, as matched by the \ufb01nal \ufb02ow \ufb01eld. We would ideally like to compare layer initializations\nbased on all permutations of the initial \ufb02ow vector clusters, but this would be computationally inten-\nsive for large K. Instead we compare two orders: a fast-to-slow order appropriate for rigid scenes,\nand an opposite slow-to-fast order (for variety and robustness). We illustrate automatic selection of\nthe preferred order for the \u201cVenus\u201d sequence in Figure 2.\nThe parameters for all experiments are set to \u03bba = 3, \u03bbb = 30, \u03bbc = 4, \u03bbd = 9, \u03bbe = 2,\n\u03c3i = 12, and \u03b4c = 0.004. A generalized Charbonnier function is used for \u03c1S(\u00b7) and \u03c1d(\u00b7) (see\nSupplemental Material). Optimization takes about 5 hours for the two-frame \u201cUrban\u201d sequence\nusing our MATLAB implementation.\n\n5.1 Results on the Middlebury Benchmark\n\nTraining Set As a baseline, we implement the smoothness in layers model [26] using modern\ntechniques, and obtain an average training end-point error (EPE) of 0.487. This is reasonable but\nnot competitive with state-of-the-art methods. The proposed model with 1 to 4 layers produces\naverage EPEs of 0.248, 0.212, 0.200, and 0.194, respectively (see Table 1). The one-layer model is\nsimilar to the Classic+NL method, but has a less sophisticated (more local) model of the \ufb02ow within\n\n1Our model works for longer sequences. We use two frames here for fair comparison with other methods.\n\n6\n\n\fTable 1: Average end-point error (EPE) on the Middlebury optical \ufb02ow benchmark training set.\n\nVenus\n0.510\n0.271\n0.238\n0.243\n0.219\n0.212\n0.197\n0.211\n0.212\n\nDimetrodon\n\nHydrangea\n\nRubberWhale\n\n0.179\n0.128\n0.131\n0.144\n0.147\n0.149\n0.148\n0.150\n0.151\n\n0.249\n0.153\n0.152\n0.175\n0.169\n0.173\n0.159\n0.161\n0.161\n\n0.236\n0.081\n0.073\n0.095\n0.081\n0.073\n0.068\n0.067\n0.066\n\nGrove2\n0.221\n0.139\n0.103\n0.125\n0.098\n0.090\n0.088\n0.086\n0.087\n\nGrove3\n0.608\n0.614\n0.468\n0.504\n0.376\n0.343\n0.359\n0.331\n0.339\n\nUrban2\n0.614\n0.336\n0.220\n0.279\n0.236\n0.220\n0.230\n0.210\n0.210\n\nUrban3\n1.276\n0.555\n0.384\n0.422\n0.370\n0.338\n0.300\n0.345\n0.396\n\nAvg. EPE\n\nWeiss [26]\nClassic++\nClassic+NL\n1layer\n2layers\n3layers\n4layers\n3layers w/ WMF\n3layers w/ WMF C++Init\n\n0.487\n0.285\n0.221\n0.248\n0.212\n0.200\n0.194\n0.195\n0.203\n\nTable 2: Average end-point error (EPE) on the Middlebury optical \ufb02ow benchmark test set.\n\nRank\n\nAverage\n\nArmy\n\nMequon\n\nSchef\ufb02era\n\nWooden\n\nGrove\n\nUrban\n\nYosemite\n\nTeddy\n\nEPE\nLayers++\nClassic+NL\nEPE in boundary regions\nLayers++\nClassic+NL\n\n4.3\n6.5\n\n0.270\n0.319\n\n0.560\n0.689\n\n0.08\n0.08\n\n0.21\n0.23\n\n0.19\n0.22\n\n0.56\n0.74\n\n0.20\n0.29\n\n0.40\n0.65\n\n0.13\n0.15\n\n0.58\n0.73\n\n0.48\n0.64\n\n0.70\n0.93\n\n0.47\n0.52\n\n1.01\n1.12\n\n0.15\n0.16\n\n0.14\n0.13\n\n0.46\n0.49\n\n0.88\n0.98\n\nthat layer. It thus performs worse than the Classic+NL initialization; the performance improvements\nallowed by additional layers demonstrate the bene\ufb01ts of a layered model.\n\nAccuracy is improved by applying a 15 \u00d7 15 weighted median \ufb01lter (WMF) [22] to the \ufb02ow \ufb01elds of\neach layer during the iterative warping step (EPE for 1 to 4 layers: 0.231, 0.204, 0.195, and 0.193).\nWeighted median \ufb01ltering can be interpreted as a non-local spatial smoothness term in the energy\nfunction that integrates \ufb02ow \ufb01eld information over a larger spatial neighborhood.\n\nThe \u201ccorrect\u201d number of layers for a real scene is not well de\ufb01ned (consider the \u201cGrove3\u201d sequence,\nfor example). We use a restricted number of layers, and model the remaining complexity of the \ufb02ow\nwithin each layer via the roughness-in-layers spatial term and the WMF. As the number of layers\nincreases, the complexity of the \ufb02ow within each layer decreases, and consequently the need for\nWMF also decreases; note that the difference in EPE for the 4-layer model with and without WMF\nis insigni\ufb01cant. For the remaining experiments we use the version with WMF.\n\nTo test the sensitivity of the result to the initialization, we also initialized with Classic++ (\u201cC++Init\u201d\nin Table 1), a good, but not top, non-layered method [22]. The average EPE for 1 to 4 layers increases\nto 0.248, 0.206, 0.203, and 0.198, respectively. While the one-layer method gets stuck in poor local\nminima on the \u201cGrove3\u201d and \u201cUrban3\u201d sequences, models with additional layers are more robust to\nthe initialization. For more details and full EPE results, see the Supplemental Material.\n\nTest Set For evaluation, we focus on a model with 3 layers (denoted \u201cLayers++\u201d in the Middlebury\npublic table). On the Middlebury test set it has an average EPE of 0.270 and average angular error\n(AAE) of 2.556; this is the lowest among all tested methods [2] at the time of writing (Oct. 2010).\nTable 2 summarizes the results for individual test sequences. The layered model is particularly\naccurate at motion boundaries, probably due to the use of layer-speci\ufb01c motion models, and the\nexplicit modeling of occlusion in Edata (Eq. (7)). For more extensive results, see the Supplemental\nMaterial.\n\nVisual Comparison Figure 3 shows results for the 3-layer model on several training and test\nsequences. Notice that the layered model produces a motion segmentation that captures the major\nstructure of the scene, and the layer boundaries correspond well to static image edges. It detects most\nocclusion regions and interpolates their motion reasonably well. Several sequences show signi\ufb01cant\nimprovement due to the global reasoning provided by the layered model. On the training \u201cGrove3\u201d\nsequence, the proposed method correctly identi\ufb01es many holes between the branches and leaves as\nbackground. It also associates the branch at the bottom right corner with branches in the center.\nAs the branch moves beyond the image boundary, the layered model interpolates its motion using\nlong-range correlation with the branches in the center.\nIn contrast, the single-layered approach\nincorrectly interpolates from local background regions. The \u201cSchef\ufb02era\u201d result illustrates how the\nlayered method can separate foreground objects from the background (e.g., the leaves in the top\nright corner), and thereby reduce errors made by single-layer approaches such as Classic+NL.\n\n7\n\n\fFigure 3: Results on some Middlebury training (rows 1 to 3) and test (rows 4 to 6) sequences. Top to bottom:\n\u201cRubberWhale\u201d, \u201cGrove3\u201d, \u201cUrban3\u201d, \u201cMequon\u201d, \u201cSchef\ufb02era\u201d, and \u201cGrove\u201d. Left to right: First image frame,\ninitial \ufb02ow \ufb01eld from \u201cClassic+NL\u201d, \ufb01nal \ufb02ow \ufb01eld, motion segmentation (green front, blue middle, red back),\nand detected occlusions. Best viewed in color and enlarged to allow comparison of detailed motions.\n\n6 Conclusion and Discussion\n\nWe have described a new probabilistic formulation for layered image motion that explicitly models\nocclusion and disocclusion, depth ordering of layers, and the temporal consistency of the layer seg-\nmentation. The approach allows the \ufb02ow \ufb01eld in each layer to have piecewise smooth deformation\nfrom a parametric motion model. Layer support is modeled using an image-dependent hidden \ufb01eld\nprior that supports a model of temporal layer continuity over time. The image data error term takes\ninto account layer occlusion relationships, resulting in increased \ufb02ow accuracy near motion bound-\naries. Our method achieves state-of-the-art results on the Middlebury optical \ufb02ow benchmark while\nproducing meaningful segmentation and occlusion detection results.\n\nFuture work will address better inference methods, especially a better scheme to infer the layer or-\nder, and the automatic estimation of the number of layers. Computational ef\ufb01ciency has not been\naddressed, but will be important for inference on long sequences. Currently our method does not\ncapture transparency, but this could be supported using a soft layer assignment and a different gen-\nerative model. Additionally, the parameters of the model could be learned [23], but this may require\nmore extensive and representative training sets. Finally, the parameters of the model, especially the\nnumber of layers, should adapt to the motions in a given sequence.\n\nAcknowledgments DS and MJB were supported in part by the NSF Collaborative Research in Computa-\ntional Neuroscience Program (IIS\u20130904875) and a gift from Intel Corp.\n\n8\n\n\fReferences\n[1] S. Ayer and H. S. Sawhney. Layered representation of motion video using robust maximum-likelihood\n\nestimation of mixture models and MDL encoding. In ICCV, pages 777\u2013784, Jun 1995.\n\n[2] S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation\n\nmethodology for optical \ufb02ow. IJCV, to appear.\n\n[3] S. Birch\ufb01eld and C. Tomasi. Multiway cut for stereo and motion with slanted surfaces. In ICCV, pages\n\n489\u2013495, 1999.\n\n[4] M. J. Black and P. Anandan. Robust dynamic motion estimation over time. In CVPR, pages 296\u2013302,\n\n1991.\n\n[5] M. J. Black and P. Anandan. The robust estimation of multiple motions: Parametric and piecewise-smooth\n\n\ufb02ow \ufb01elds. CVIU, 63:75\u2013104, 1996.\n\n[6] M. J. Black and A. D. Jepson. Estimating optical-\ufb02ow in segmented images using variable-order para-\n\nmetric models with local deformations. PAMI, 18(10):972\u2013986, October 1996.\n\n[7] T. Darrell and A. Pentland. Robust estimation of a multi-layered motion representation. In Workshop on\n\nVisual Motion, pages 173\u2013178, 1991.\n\n[8] T. Darrell and A. Pentland. Cooperative robust estimation using layers of support. PAMI, 17(5):474\u2013487,\n\n1995.\n\n[9] B. Glocker, T. H. Heibel, N. Navab, P. Kohli, and C. Rother. Triangle\ufb02ow: Optical \ufb02ow with triangulation-\n\nbased higher-order likelihoods. In ECCV, pages 272\u2013285, 2010.\n\n[10] M. Irani, P. Anandan, and D. Weinshall. From reference frames to reference planes: Multi-view parallax\n\ngeometry and applications. In ECCV, 1998.\n\n[11] A. Jepson and M. J. Black. Mixture models for optical \ufb02ow computation. In CVPR, 1993.\n[12] N. Jojic and B. Frey. Learning \ufb02exible sprites in video layers. In CVPR, pages I:199\u2013206, 2001.\n[13] A. Kannan, B. Frey, and N. Jojic. A generative model of dense optical \ufb02ow in layers. Technical Report\n\nTR PSI-2001-11, University of Toronto, Aug. 2001.\n\n[14] R. Kumar, P. Anandan, and K. Hanna. Shape recovery from multiple views: A parallax based approach.\n\nIn Proc 12th ICPR, 1994.\n\n[15] R. D. Morris, X. Descombes, and J. Zerubia. The Ising/Potts model is not well suited to segmentation\n\ntasks. In Proceedings of the IEEE Digital Signal Processing Workshop, 1996.\n\n[16] M. Nicolescu and G. Medioni. Motion segmentation with accurate boundaries - a tensor voting approach.\n\nIn CVPR, pages 382\u2013389, 2003.\n\n[17] M. P. Kumar, P. H. Torr, and A. Zisserman. Learning layered motion segmentations of video.\n\n76(3):301\u2013319, 2008.\n\nIJCV,\n\n[18] S. Roth and M. J. Black. On the spatial statistics of optical \ufb02ow. IJCV, 74(1):33\u201350, August 2007.\n[19] H. S. Sawhney. 3D geometry from planar parallax. In CVPR, pages 929\u2013934, 1994.\n[20] T. Schoenemann and D. Cremers. High resolution motion layer decomposition using dual-space graph\n\ncuts. In CVPR, pages 1\u20137, June 2008.\n\n[21] E. Sudderth and M. Jordan. Shared segmentation of natural scenes using dependent Pitman-Yor processes.\n\nIn NIPS, pages 1585\u20131592, 2009.\n\n[22] D. Sun, S. Roth, and M. J. Black. Secrets of optical \ufb02ow estimation and their principles. In CVPR, 2010.\n[23] D. Sun, S. Roth, J. P. Lewis, and M. J. Black. Learning optical \ufb02ow. In ECCV, pages 83\u201397, 2008.\n[24] P. Torr, R. Szeliski, and P. Anandan. An integrated Bayesian approach to layer extraction from image\n\nsequences. PAMI, 23(3):297\u2013303, Mar 2001.\n\n[25] J. Y. A. Wang and E. H. Adelson. Representing moving images with layers. IEEE Transactions on Image\n\nProcessing, 3(5):625\u2013638, Sept. 1994.\n\n[26] Y. Weiss. Smoothness in layers: Motion segmentation using nonparametric mixture estimation. In CVPR,\n\npages 520\u2013526, Jun 1997.\n\n[27] Y. Weiss and E. Adelson. A uni\ufb01ed mixture framework for motion segmentation: Incorporating spatial\n\ncoherence and estimating the number of models. In CVPR, pages 321\u2013326, Jun 1996.\n\n[28] M. Werlberger, T. Pock, and H. Bischof. Motion estimation with non-local total variation regularization.\n\nIn CVPR, 2010.\n\n[29] H. Yalcin, M. J. Black, and R. Fablet. The dense estimation of motion and appearance in layers. In IEEE\n\nWorkshop on Image and Video Registration, pages 777\u2013784, Jun 2004.\n\n[30] Y. Zhou and H. Tao. Background layer model for object tracking through occlusion. In ICCV, volume 2,\n\npages 1079\u20131085, 2003.\n\n9\n\n\f", "award": [], "sourceid": 266, "authors": [{"given_name": "Deqing", "family_name": "Sun", "institution": null}, {"given_name": "Erik", "family_name": "Sudderth", "institution": null}, {"given_name": "Michael", "family_name": "Black", "institution": null}]}