{"title": "Simultaneous Sampling and Multi-Structure Fitting with Adaptive Reversible Jump MCMC", "book": "Advances in Neural Information Processing Systems", "page_first": 540, "page_last": 548, "abstract": "Multi-structure model fitting has traditionally taken a two-stage approach: First, sample a (large) number of model hypotheses, then select the subset of hypotheses that optimise a joint fitting and model selection criterion. This disjoint two-stage approach is arguably suboptimal and inefficient - if the random sampling did not retrieve a good set of hypotheses, the optimised outcome will not represent a good fit. To overcome this weakness we propose a new multi-structure fitting approach based on Reversible Jump MCMC. Instrumental in raising the effectiveness of our method is an adaptive hypothesis generator, whose proposal distribution is learned incrementally and online. We prove that this adaptive proposal satisfies the diminishing adaptation property crucial for ensuring ergodicity in MCMC. Our method effectively conducts hypothesis sampling and optimisation simultaneously, and gives superior computational efficiency over other methods.", "full_text": "Simultaneous Sampling and Multi-Structure Fitting\n\nwith Adaptive Reversible Jump MCMC\n\nTrung Thanh Pham, Tat-Jun Chin, Jin Yu and David Suter\n\nSchool of Computer Science, The University of Adelaide, South Australia\n{trung,tjchin,jin.yu,dsuter}@cs.adelaide.edu.au\n\nAbstract\n\nMulti-structure model \ufb01tting has traditionally taken a two-stage approach: First,\nsample a (large) number of model hypotheses, then select the subset of hypotheses\nthat optimise a joint \ufb01tting and model selection criterion. This disjoint two-stage\napproach is arguably suboptimal and inef\ufb01cient \u2014 if the random sampling did not\nretrieve a good set of hypotheses, the optimised outcome will not represent a good\n\ufb01t. To overcome this weakness we propose a new multi-structure \ufb01tting approach\nbased on Reversible Jump MCMC. Instrumental in raising the effectiveness of our\nmethod is an adaptive hypothesis generator, whose proposal distribution is learned\nincrementally and online. We prove that this adaptive proposal satis\ufb01es the dimin-\nishing adaptation property crucial for ensuring ergodicity in MCMC. Our method\neffectively conducts hypothesis sampling and optimisation simultaneously, and\nyields superior computational ef\ufb01ciency over previous two-stage methods.\n\n1\n\nIntroduction\n\nMulti-structure model \ufb01tting is concerned with estimating the multiple instances (or structures) of\na geometric model embedded in the input data. The task manifests in applications such as mixture\nregression [21], motion segmentation [27, 10], and multi-projective estimation [29]. Such a prob-\nlem is known for its \u201cchicken-and-egg\u201d nature: Both data-to-structure assignments and structure\nparameters are unavailable, but given the solution of one subproblem, the solution of the other can\nbe easily derived. In practical settings the number of structures is usually unknown beforehand, thus\nmodel selection is required in conjunction to \ufb01tting. This makes the problem very challenging.\nA common framework is to optimise a robust goodness-of-\ufb01t function jointly with a model selection\ncriterion. For tractability most methods [25, 19, 17, 26, 18, 7, 31] take a \u201chypothesise-then-select\u201d\napproach: First, randomly sample from the parameter space a large number of putative model hy-\npotheses, then select a subset of the hypotheses (structures) that optimise the combined objective\nfunction. The hypotheses are typically \ufb01tted on minimal subsets [9] of the input data. Depending on\nthe speci\ufb01c de\ufb01nition of the cost functions, a myriad of strategies have been proposed to select the\nbest structures, namely tabu search [25], branch-and-bound [26], linear programming [19], dirichlet\nmixture clustering [17], message passing [18], graph cut [7], and quadratic programming [31].\nWhile sampling is crucial for tractability, a disjoint two-stage approach raises an awkward situa-\ntion: If the sampled hypotheses are inaccurate, or worse, if not all valid structures are sampled, the\nselection or optimisation step will be affected. The concern is palpable especially for higher-order\ngeometric models (e.g., fundamental matrices in motion segmentation [27]) where enormous sam-\npling effort is required before hitting good hypotheses (those \ufb01tted on all-inlier minimal subsets).\nThus two-stage approaches are highly vulnerable to sampling inadequacies, even with theoretical\nassurances on the optimisation step (e.g., globally optimal over the sampled hypotheses [19, 7, 31]).\nThe issue above can be viewed as the lack of a stopping criterion for the sampling stage. If there\nis only one structure, we can easily evaluate the sample quality (e.g., consensus size) on-the-\ufb02y\n\n1\n\n\fand stop as soon as the prospect of obtaining a better sample becomes insigni\ufb01cant [9]. Under\nmulti-structure data, it is unknown what a suitable stopping criterion is (apart from solving the\noverall \ufb01tting and model selection problem itself). One can consider iterative local re\ufb01nement of the\nstructures or re-sampling after data assignment [7], but the fact remains that if the initial hypotheses\nare inaccurate, the results of the subsequent \ufb01tting and re\ufb01nement will be affected.\nClearly, an approach that simultaneously samples and optimises is more appropriate. To this end\nwe propose a new method for multi-structure \ufb01tting and model selection based on Reversible Jump\nMarkov Chain Monte Carlo (RJMCMC) [12]. By design MCMC techniques directly optimise via\nsampling. Despite their popular use [3] the method has not been fully explored in multi-structure\n\ufb01tting (a few authors have applied Monte Carlo techniques for robust estimation [28, 8], but mostly\nto enhance hypothesis sampling on single-structure data). We show how to exploit the reversible\njump mechanism to provide a simple and effective framework for multi-structure model selection.\nThe bane of MCMC, however, is the dif\ufb01culty in designing ef\ufb01cient proposal distributions. Adaptive\nMCMC techniques [4, 24] promise to alleviate this dif\ufb01culty by learning the proposal distribution\non-the-\ufb02y. Instrumental in raising the ef\ufb01ciency of our RJMCMC approach is a recently proposed\nhypothesis generator [6] that progressively updates the proposal distribution using generated hy-\npotheses. Care must be taken in introducing such adaptive schemes, since a chain propagated based\non a non-stationary proposal is non-Markovian, and unless the proposal satis\ufb01es certain proper-\nties [4, 24], this generally means a loss of asymptotic convergence to the target distribution.\nClearing these technical hurdles is one of our major contributions: Using emerging theory from\nadaptive MCMC [23, 4, 24, 11], we prove that the adaptive proposal, despite its origins in robust es-\ntimation [6], satis\ufb01es the properties required for convergence, most notably diminishing adaptation.\nThe rest of the paper is organised as follows: Sec. 2 formulates our goal within a clear optimisation\nframework, and outlines our RJMCMC approach. Sec. 3 describes the adaptive hypothesis proposal\nused in our method, and develops proof that it is a valid adaptive MCMC sampler. We present our\nexperimental results in Sec. 4 and draw conclusions in Sec. 5.\n\n2 Multi-Structure Fitting and Model Selection\nGive input data X = {xi}N\ni=1, usually with outliers, our goal is to recover the instances or structures\n\u03b8k = {\u03b8c}k\nc=1 of a geometric model M embedded in X. The number of valid structures k is\nunknown beforehand and must also be estimated from the data. The problem domain is therefore\nthe joint space of structure quantity and parameters {k, \u03b8k}. Such a problem is typically solved by\njointly minimising \ufb01tting error and model complexity. Similar to [25, 19, 26], we use the AIC [1]\n\n{k\u2217, \u03b8\u2217\n\nk\u2217} = arg min\n{k,\u03b8k}\n\n\u22122 log L(\u03b8k) + 2\u03b1n(\u03b8k).\n\nHere L(\u03b8k) is the robust data likelihood and n(\u03b8k) the number of parameters to de\ufb01ne \u03b8k. We\ninclude a positive constant \u03b1 to allow reweighting of the two components. Assuming i.i.d. Gaussian\nnoise with known variance \u03c3, the above problem is equivalent to minimising the function\n\n(cid:18) minc ric\n\n(cid:19)\n\n1.96\u03c3\n\nN(cid:88)\n\ni=1\n\nf (k, \u03b8k) =\n\n\u03c1\n\n+ \u03b1n(\u03b8k),\n\n(1)\n\nwhere ric = g(xi, \u03b8c) is the absolute residual of xi to the c-th structure \u03b8c in \u03b8k. The residuals\nare subjected to a robust loss function \u03c1(\u00b7) to limit the in\ufb02uence of outliers; we use the biweight\nfunction [16]. Minimising a function like (1) over a vast domain {k, \u03b8k} is a formidable task.\n\n2.1 A reversible jump simulated annealing approach\n\nSimulated annealing has proven to be effective for dif\ufb01cult model selection problems [2, 5]. The\nidea is to propagate a Markov chain for the Boltzmann distribution encapsulating (1)\n\nbT (k, \u03b8k) \u221d exp(\u2212f (k, \u03b8k)/T )\n\n(2)\nwhere temperature T is progressively lowered until the samples from bT (k, \u03b8k) converge to the\nglobal minima of f (k, \u03b8k). Algorithm 1 shows the main body of the algorithm. Under weak regu-\nlarity assumptions, there exist cooling schedules [5] that will guarantee that as T tends to zero the\nsamples from the chain will concentrate around the global minima.\n\n2\n\n\fTo simulate bT (k, \u03b8k) we adopt a mixture of kernels MCMC approach [2]. This involves in each\niteration the execution of a randomly chosen type of move to update {k, \u03b8k}. Algorithm 2 sum-\nmarises the idea. We make available 3 types of moves: birth, death and local update. Birth and\ndeath moves change the number of structures k. These moves effectively cause the chain to jump\nacross parameter spaces \u03b8k of different dimensions. It is crucial that these trans-dimensional jumps\nare reversible to produce correct limiting behaviour of the chain. The following subsections explain.\n\nAlgorithm 1 Simulated annealing for multi-structure \ufb01tting and model selection\n1: Initialise temperature T and state {k, \u03b8k}.\n2: Simulate Markov chain for bT (k, \u03b8k) until convergence.\n3: Lower temperature T and repeat from Step 2 until T \u2248 0.\nAlgorithm 2 Reversible jump mixture of kernels MCMC to simulate bT (k, \u03b8k)\nRequire: Last visited state {k, \u03b8k} of previous chain, probability \u03b2 (Sec. 4 describes setting \u03b2).\n1: Sample a \u223c U[0,1].\n2: if a \u2264 \u03b2 then\n3: With probability rB(k), attempt birth move, else attempt death move.\n4: else\n5:\n6: end if\n7: Repeat from Step 1 until convergence (e.g., last V moves all rejected).\n\nAttempt local update.\n\n2.1.1 Birth and death moves\nThe birth move propagates {k, \u03b8k} to {k(cid:48), \u03b8(cid:48)\nk(cid:48)}, with k(cid:48) = k + 1. Applying Green\u2019s [12, 22] seminal\ntheorems on RJMCMC, the move is reversible if it is accepted with probability min{1, A}, where\n\nbT (k(cid:48), \u03b8(cid:48)\n\nk(cid:48))[1 \u2212 rB(k(cid:48))]/k(cid:48)\n\nA =\n\nbT (k, \u03b8k)rB(k)q(u)\n\n\u2202(\u03b8k, u)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2202\u03b8(cid:48)\n\nk(cid:48)\n\n(cid:12)(cid:12)(cid:12)(cid:12) .\n\n(3)\n\nThe probability of proposing the birth move is rB(k), where rB(k) = 1 for k = 1, rB(k) = 0.5\nfor k = 2, . . . , kmax \u2212 1, and rB(kmax) = 0. In other words, any move that attempts to move k\nbeyond the range [1, kmax] is disallowed in Step 3 of Algorithm 2. The death move is proposed with\nprobability 1 \u2212 rB(k). An existing structure is chosen randomly and deleted from \u03b8k. The death\nmove is accepted with probability min{1, A\u22121}, with obvious changes to the notations in A\u22121.\nIn the birth move, the extra degrees of freedom required to specify the new item in \u03b8(cid:48)\nk(cid:48) are given\nby auxiliary variables u, which are in turn proposed by q(u). Following [18, 7, 31], we estimate\nparameters of the new item by \ufb01tting the geometric model M onto a minimal subset of the data. Thus\nu is a minimal subset of X. The size p of u is the minimum number of data required to instantiate\nM, e.g., p = 4 for planar homographies, and p = 7 or 8 for fundamental matrices [15]. Our\napproach is equivalently minimising (1) over collections {k, \u03b8k} of minimal subsets of X, where\nnow \u03b8k \u2261 {uc}k\nConsidering only minimal subsets somewhat simpli\ufb01es the problem, but there are still a colossal\nnumber of possible minimal subsets. Obtaining good overall performance thus hinges on the ability\nof proposal q(u) to propose minimal subsets that are relevant, i.e., those \ufb01tted purely on inliers of\nvalid structures in the data. One way is to learn q(u) incrementally using generated hypotheses. We\ndescribe such a scheme [6] in Sec. 3 and prove that the adaptive proposal preserves ergodicity.\n\nc=1. Taking this view the Jacobian \u2202\u03b8(cid:48)\n\nk(cid:48)/\u2202(\u03b8k, u) is simply the identity matrix.\n\n2.1.2 Local update\n\nA local update does not change the model complexity k. The move involves randomly choosing a\nstructure \u03b8c in \u03b8k to update, making only local adjustments to its minimal subset uc. The outcome\nis a revised minimal subset u(cid:48)\n\nc, and the move is accepted with probability min{1, A}, where\n\nA =\n\nbT (k, \u03b8(cid:48)\nbT (k, \u03b8k)q(u(cid:48)\n\nk)q(uc|\u03b8(cid:48)\nc)\nc|\u03b8c)\n\n.\n\n(4)\n\nAs shown in the above our local update is also accomplished with the adaptive proposal q(u|\u03b8), but\nthis time conditioned on the selected structure \u03b8c. Sec. 3 describes and anlyses q(u|\u03b8).\n\n3\n\n\f3 Adaptive MCMC for Multi-Structure Fitting\n\nOur work capitalises on the hypothesis generation scheme of Chin et al. called Multi-GS [6] origi-\nnally proposed for robust geometric \ufb01tting. The algorithm maintains a series of sampling weights\nwhich are revised incrementally as new hypotheses are generated. This bears similarity to the pio-\nneering Adaptive Metropolis (AM) method of Haario et al. [13]. Here, we prove that our adaptive\nproposals q(u) and q(u|\u03b8) based on Multi-GS satisfy conditions required to preserve ergodicity.\n\n3.1 The Multi-GS algorithm\nLet {\u03b8m}M\nm=1 aggregate the set of hypotheses \ufb01tted on the minimal subsets proposed thus far in all\nbirth and local update moves in Algorithm 1. To build the sampling weights, \ufb01rst for each xi \u2208 X\nwe compute its absolute residuals as measured to the M hypotheses, yielding the residual vector\n\nWe then \ufb01nd the permutation\n\nr(i) := [ r(i)\n\n1 r(i)\n\n2\n\na(i) := [ a(i)\n\n1 a(i)\n\n2\n\n\u00b7\u00b7\u00b7 r(i)\nM ].\n\u00b7\u00b7\u00b7 a(i)\nM ]\n\nthat sorts the elements in r(i) in non-descending order. The permutation a(i) essentially ranks the\nM hypotheses according to the preference of xi; The higher a hypothesis is ranked the more likely\nxi is an inlier to it. The weight wi,j between the pair xi and xj is obtained as\n\nwi,j = Ih(xi, xj) :=\n\n(5)\nwhere |a(i)\nh | is the number of identical elements shared by the \ufb01rst-h elements of a(i) and a(j).\nClearly wi,j is symmetric with respect to the input pair xi and xj, and wi,i = 1 for all i. To ensure\ntechnical consistency in our later proofs, we add a small positive offset \u03b3 to the weight1, or\n\nh \u2229 a(j)\n\nwi,j = max(Ih(xi, xj), \u03b3),\n\n(6)\nhence \u03b3 \u2264 wi,j \u2264 1. The weight wi,j measures the correlation of the top h preferences of xi and\nxj, and this value is typically high iff xi and xj are inliers from the same structure; Figs. 1 (c)\u2013(g)\nillustrate. Parameter h controls the discriminative power of wi,j, and is typically set as a \ufb01xed ratio k\nof M, i.e., h = (cid:100)kM(cid:101). Experiments suggest that k = 0.1 provides generally good performance [6].\nMulti-GS exploits the preference correlations to sample the next minimal subset u = {xst}p\nt=1,\nwhere xst \u2208 X and st \u2208 {1, . . . , N} indexes the particular datum from X; henceforth we regard\nu \u2261 {st}p\nt=1. The \ufb01rst datum s1 is chosen purely randomly. Beginning from t = 2, the selection of\nthe t-th member st considers the weights related to the data s1, . . . , st\u22121 already present in u. More\nspeci\ufb01cally, the index st is sampled according to the probabilities\n\n(cid:12)(cid:12)(cid:12)a(i)\n\n1\nh\n\nh \u2229 a(j)\n\nh\n\n(cid:12)(cid:12)(cid:12) ,\n\nPt(i) \u221d t\u22121(cid:89)\n\nwsz,i,\n\nfor i = 1, . . . , N,\n\n(7)\n\nz=1\n\ni.e., if Pt(i) > Pt(j) then i is more likely than j to be chosen as st. A new hypothesis \u03b8M +1 is then\n\ufb01tted on u and the weights are updated in consideration of \u03b8M +1. Experiments comparing sampling\nef\ufb01ciency (e.g., all-inlier minimal subsets produced per unit time) show that Multi-GS is superior\nover previous guided sampling schemes, especially on multi-structure data; See [6] for details.\n\n3.2\n\nIs Multi-GS a valid adaptive MCMC proposal?\n\nOur RJMCMC scheme in Algorithm 2 depends on the Multi-GS-inspired adaptive proposals qM (u)\nand qM (u|\u03b8), where we now add the subscript M to make explicit their dependency on the set of\naggregated hypotheses {\u03b8m}M\ni,j=1 they induce. The probability of\nproposing a minimal subset u = {st}p\n\nt=1 from qM (u) can be calculated as\n\nm=1 as well as the weights {wi,j}N\nd(cid:75)\n\n(cid:34)p\u22121(cid:89)\n\n(cid:89)\n\n(cid:35)\u22121\n\nqM (u) =\n\nwsa,sb\n\n1T\n\nwse\n\n,\n\n(8)\n\n1\nN\n\na** 0} be a stochastic process on a compact state space \u039e evolving\naccording to a collection of transition kernels\n\nTn(z, z(cid:48)) = pr(Zn+1|Zn = z, Zn\u22121 = zn\u22121, . . . , Z0 = z0),\n\nand let p(z) be the distribution of Zn. Suppose for every n and z0, . . . , zn\u22121 \u2208 \u039e and for some\ndistribution \u03c0(z) on \u039e,\n\n\u03c0(zn)Tn(zn, zn+1) = \u03c0(zn+1),\n\n|Tn+k(z, z(cid:48)) \u2212 Tn(z, z(cid:48))| \u2264 anck, an = O(n\u2212r1), ck = O(k\u2212r2), r1, r2 > 0,\n\nTn(z, z(cid:48)) \u2265 \u0001\u03c0(z(cid:48)), \u0001 > 0,\n\nwhere \u0001 does not depend on n, z0, . . . , zn\u22121. Then, for any initial distribution p(z0) for Z0,\n\n|p(zn) \u2212 \u03c0(zn)| \u2192 0 for n \u2192 \u221e.\n\nsup\nzn\n\n(10)\n\n(11)\n(12)\n\n(cid:88)\n\nzn\n\nDiminishing adaptation. Eq. (11) dictates that the transition kernel, and thus the proposal distri-\nbution in the Metropolis-Hastings updates in Eqs. (3) and (4), must converge to a \ufb01xed distribution,\ni.e., the adaptation must diminish. To see that this occurs naturally in qM (u), \ufb01rst we show that wi,j\nfor all i, j converges as M increases. Without loss of generality assume that b new hypotheses are\ngenerated between successive weight updates wi,j and w(cid:48)\nkM \u2229 a(j)\nkM|\n\nkM| \u00b1 b(k + 1)\n\n(cid:48)(i)\nk(M +b) \u2229 a\n\ni,j. Then,\n\nkM \u2229 a(j)\n\n(cid:48)(j)\nk(M +b)|\n\n(cid:12)(cid:12)(cid:12)w\n\nlim\n\nM\u2192\u221e\n\n(cid:48)\ni,j \u2212 wi,j\n\nk(M + b)\n\nkM\n\nk(M + b)\n\n\u2212 |a(i)\n\n\u2212 |a(i)\n\nkM \u2229 a(j)\nkM|\n\nkM\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12) = lim\n\nM\u2192\u221e\n\n= lim\n\nM\u2192\u221e\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)|a\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)|a(i)\n\nkM \u2229 a(j)\n\nkM|/M \u00b1 b(k + 1)/M\n\nk + kb/M\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 lim\n\nM\u2192\u221e\nkM \u2229 a(j)\nkM|/M\n\n\u2212 |a(i)\n\nk\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)|a(i)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 0,\n\nwhere a(cid:48)(i) is the revised preference of xi in consideration of the b new hypotheses. The result is\nbased on the fact that the extension of b hypotheses will only perturb the overlap between the top-k\npercentile of any two preference vectors by at most b(k + 1) items. It should also be noted that the\nresult is not due to w(cid:48)\n\ni,j and wi,j simultaneously vanishing with increasing M; in general\n\nsince a(i) and a(j) are extended and revised as M increases and this may increase their mutual\noverlap. Figs. 1 (c)\u2013(g) illustrate the convergence of wi,j as M increases. Using the above result, it\ncan be shown that the product of any two weights also converges\n\n(cid:12)(cid:12)w(cid:48)\n\nlim\nM\u2192\u221e\n\ni,jw(cid:48)\n\np,q \u2212 wi,jwp,q\n\ni,j(w(cid:48)\n\n(cid:12)(cid:12)(cid:12)(cid:12)w(cid:48)\n\np,q \u2212 wp,q) + wp,q(w(cid:48)\np,q \u2212 wp,q\n\n(cid:12)(cid:12) +(cid:12)(cid:12)wp,q\n\ni,j\n\ni,j \u2212 wi,j)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)(cid:12)w(cid:48)\n\ni,j \u2212 wi,j\n\n(cid:12)(cid:12) = 0.\n\nkM \u2229 a(j)\n\nkM|/M (cid:54)= 0\n\nlim\n\nM\u2192\u221e|a(i)\n(cid:12)(cid:12)w(cid:48)\n(cid:12)(cid:12) = lim\n(cid:12)(cid:12)w(cid:48)\n\nM\u2192\u221e\n\u2264 lim\nM\u2192\u221e\n\n5\n\n\fThis result is readily extended to the product of any number of weights. To show the convergence\nof the normalisation terms in (8), we \ufb01rst observe that the sum of weights is bounded away from 0\n\n\u2200i,\n\n1T wi \u2265 L,\n\nL > 0,\n\ndue to the offsetting (6) and the constant element wi,i = 1 in wi (although wi,i will be set to zero to\nenforce sampling without replacement [6]). It can thus be established that\n\n(cid:12)(cid:12)(cid:12)(cid:12) = lim\n\nM\u2192\u221e\n\n(cid:12)(cid:12)(cid:12)(cid:12) 1T w(cid:48)\n\n(1T w(cid:48)\n\ni \u2212 1T wi\ni)(1T wi)\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2264 lim\n\nM\u2192\u221e\n\n(cid:12)(cid:12)(cid:12)(cid:12) 1T w(cid:48)\n\ni \u2212 1T wi\nL2\n\n(cid:12)(cid:12)(cid:12)(cid:12) = 0\n\n(cid:12)(cid:12)(cid:12)(cid:12)\n\nlim\nM\u2192\u221e\n\n1\n\n1T w(cid:48)\n\ni\n\n\u2212 1\n\n1T wi\n\nsince the sum of the weights also converges. The result is readily extended to the inverse of the sum\nof any number of Hadamard products of weights, since we have also previously established that the\nproduct of any number of weights converges. Finally, since Eq. (8) involves only multiplications of\nconvergent quantities, qM (u) will converge to a \ufb01xed distribution as the update progresses.\nInvariance. Eq. (10) requires that transition probabilities based on qM (u) permits an invariant dis-\ntribution individually for all M. Since we propose and accept based on the Metropolis-Hastings\nalgorithm, detailed balance is satis\ufb01ed by construction [3], which means that a Markov chain prop-\nagated based on qM (u) will asymptotically sample from the target distribution.\nUniform ergodicity. Eq. (12) requires that qM (u) for all M be individually ergodic, i.e., the re-\nsulting chain using qM (u) is aperiodic and irreducible. Again, since we simulate the target using\nMetropolis-Hastings, every proposal has a chance of being rejected, thus implying aperiodicity [3].\nIrreducibility is satis\ufb01ed by the offsetting in (6) and renormalising [20], since this implies that there\nis always a non-zero probability of reaching any state (minimal subset) from the current state.\nThe above results apply for the local update proposal qM (u|\u03b8) which differs from qM (u) only in the\n(stationary) probability to select the \ufb01rst index s1. Hence qM (u|\u03b8) is also a valid adaptive proposal.\n\n4 Experiments\n\nWe compare our approach (ARJMC) against state-of-the-art methods: message passing [18]\n(FLOSS), energy minimisation with graph cut [7] (ENERGY), and quadratic programming based on\na novel preference feature [31] (QP-MF). We exclude older methods with known weaknesses, e.g.,\ncomputational inef\ufb01ciency [19, 17, 26], low accuracy due to greedy search [25], or vulnerability to\noutliers [17]. All methods are run in MATLAB except ENERGY which is available in C++2.\nFor ARJMC, standard deviation \u03c3 in (1) is set as t/1.96, where t is the inlier threshold [9] obtained\nusing ground truth model \ufb01tting results\u2014 The same t is provided to the competitors. In Algorithm 1\ntemperature T is initialiased as 1 and we apply the geometric cooling schedule Tnext = 0.99T .\nIn Algorithm 2, probability \u03b2 is set as equal to current temperature T , thus allowing more global\nexplorations in the parameter space initially before concentrating on local re\ufb01nement subsequently.\nSuch a helpful strategy is not naturally practicable in disjoint two-stage approaches.\n\n4.1 Two-view motion segmentation\n\nThe goal is to segment point trajectories X matched across two views into distinct motions [27].\nTrajectories of a particular motion can be related by a distinct fundamental matrix F \u2208 R3\u00d73 [15].\nOur task is thus to estimate the number of motions k and the fundamental matrices {Fc}k\nc=1 corre-\nsponding to the motions embedded in data X. Note that X may contain false trajectories (outliers).\nWe estimate fundamental matrix hypotheses from minimal subsets of size p = 8 using the 8-point\nmethod [14]. The residual g(xi, F) is computed as the Sampson distance [15].\nWe test the methods on publicly available two-view motion segmentation datasets [30]. In particular\nwe test on the 3- and 4-motion datasets provided, namely breadtoycar, carchipscube, toycubecar,\nbreadcubechips, biscuitbookbox, cubebreadtoychips and breadcartoychips; see the dataset home-\npage for more details. Correspondences were established via SIFT matching and manual \ufb01ltering\nwas done to obtain ground truth segmentation. Examples are shown in Figs. 1(a) and 1(b).\n\n2http://vision.csd.uwo.ca/code/#Multi-label optimization\n\n6\n\n\f(a) breadtoycar dataset with 3 motions (37,\n39 and 34 inliers, 56 outliers)\n\n(b) cubebreadtoychips dataset with 4 mo-\ntions (71, 49, 38 and 81 inliers, 88 outliers)\n\n(c) M = 50\n\n(d) M = 100\n\n(e) M = 1000\n\n(f) M = 5000\n\n(g) M = 10000\n\n(h) Value of function f (k, \u03b8k) (best viewed in colour)\n\n(i) Segmentation error (best viewed in colour)\n\n(j) M = 100\n\n(k) M = 200\n\n(l) M = 500\n\n(m) M = 1000\n\nFigure 1: (a) and (b) show respectively a 3- and 4-motion dataset (colours show ground truth la-\nbelling). To minimise clutter, lines joining false matches are not drawn. (c)\u2013(g) show the evolution\nof the matrix of pairwise weights (5) computed from (b) as the number of hypotheses M is increased.\nFor presentation the data are arranged according to their structure membership, which gives rise to\na 4-block pattern. Observe that the block pattern, hence weights, converge as M increases. (h) and\n(i) respectively show performance measures (see text) of four methods on the dataset in (b). (j)\u2013(m)\nshow the evolution of the labelling result of ARJMC as M increases (only one view is shown).\n\nFigs. 1(c)\u2013(g) show the evolution of the pairwise weights (5) as M increases until 10,000 for the data\nin Fig. 1(b). The matrices exhibit a a four-block pattern, indicating strong mutual preference among\ninliers from the same structure. This phenomenon allows accurate selection of minimal subsets in\nMulti-GS [6]. More pertinently, as we predicted in Sec. 3.2, the weights converge as M increases,\nas evidenced by the stabilising block pattern. Note that only a small number of weights are actually\ncomputed in Multi-GS [6]; the full matrix of weights are calculated here for illustration only.\nWe run ARJMC and record the following performance measures: Value of the objective function\nf (k, \u03b8k) in Eq. (1), and segmentation error. The latter involves assigning each datum xi \u2208 X\nto the nearest structure in \u03b8k if the residual is less than the threshold t; else xi is labelled as an\noutlier. The overall labelling error is then obtained. The measures are recorded at time intervals\ncorresponding to the instances when M = 100, 200, . . . , 1000 number of hypotheses generated so\nfar in Algorithm 1. Median results over 20 repetitions on the data in Fig. 1(b) are shown in Figs. 1(h)\nand 1(i). Figs. 1(j)\u20131(m) depict the evolution of the segmentation result of ARJMC as M increases.\n\n7\n\n5010015020025030050100150200250300501001502002503005010015020025030050100150200250300501001502002503005010015020025030050100150200250300501001502002503005010015020025030005101520253015202530354045Time (s)Objective function value f(k,\u03b8k) QP\u2212MF (random)ENERGY (random)FLOSS (random)ARJMCQP\u2212MF (Multi\u2212GS)ENERGY (Multi\u2212GS)FLOSS (Multi\u2212GS)051015202530010203040506070Time (s)Segmentation error (%) QP(cid:239)MF (random)ENERGY (random)FLOSS (random)ARJMCQP(cid:239)MF (Multi(cid:239)GS)ENERGY (Multi(cid:239)GS)FLOSS (Multi(cid:239)GS)\fFor objective comparisons the competing two-stage methods were tested as follows: First, M =\n100, 200, . . . , 1000 hypotheses are accumulatively generated (using both uniform random sam-\npling [9] and Multi-GS [6]). A new instance of each method is invoked on each set of M hypotheses.\nWe ensure that each method returns the true number of structures for all M; this represents an ad-\nvantage over ARJMC, since the \u201conline learning\u201d nature of ARJMC means the number of structures\nis not discovered until closer to convergence. Results are also shown in Figs. 1(h) and 1(i).\nFirstly, it is clear that the performance of the two-stage methods on both measures are improved\ndramatically with the application of Multi-GS for hypothesis generation. From Fig. 1(h) ARJMC is\nthe most ef\ufb01cient in minimising the function f (k, \u03b8k); it converges to a low value in signi\ufb01cantly\nless time. It should be noted however that the other methods are not directly minimising AIC or\nf (k, \u03b8k). The segmentation error (which no method here is directly minimising) thus represents a\nmore objective performance measure. From Fig. 1(i), it can be seen that the initial error of ARJMC\nis much higher than all other methods, a direct consequence of not having yet estimated the true\nnumber of structures. The error is eventually minimised as ARJMC converges. Table 1 which\nsummarises the results on the other datasets (all using Multi-GS) conveys a similar picture. Further\nresults on multi-homography detection also yield similar outcomes (see supplementary material).\n\nDataset\n\n# inliers, outliers\n\nbreadtoycar (3 structures)\n\n37, 39 and 34 inliers, 56 outliers\n\ncarchipscube (3 structures)\n\n19, 33 and 53 inliers, 60 outliers\n\ntoycubecar (3 structures)\n\n45, 69 and 14 inliers, 72 outliers\n\nM\n100\n200\n300\n400\n500\n600\n700\n800\n900\n1000\n\nTime (seconds)\n\nDataset\n\n# inliers, outliers\n\nM\n100\n200\n300\n400\n500\n600\n700\n800\n900\n1000\n\nTime (seconds)\n\nS\nS\nO\nL\nF\n\n25.22\n14.13\n10.43\n9.57\n9.57\n8.70\n8.91\n7.83\n7.39\n7.17\n12.88\n\nY\nG\nR\nE\nN\nE\n\n31.74\n26.74\n33.48\n27.83\n27.39\n25.87\n30.43\n21.09\n25.22\n20.43\n9.40\n\nF\nM\n-\nP\nQ\n24.78\n18.91\n18.70\n18.26\n26.30\n20.43\n21.30\n22.17\n26.74\n25.22\n21.57\n\nC\nM\nJ\nR\nA\n68.70\n61.96\n54.13\n48.48\n10.87\n8.48\n7.17\n6.52\n6.52\n6.52\n5.44\n\nS\nS\nO\nL\nF\n\n21.82\n15.76\n12.73\n10.30\n10.30\n9.09\n8.48\n10.30\n8.48\n9.09\n9.57\n\nY\nG\nR\nE\nN\nE\n\n29.70\n36.97\n24.24\n32.73\n30.91\n28.48\n22.42\n26.67\n36.36\n28.48\n7.02\n\nF\nM\n-\nP\nQ\n23.64\n30.30\n26.67\n28.48\n27.27\n23.03\n27.88\n25.45\n26.06\n23.64\n16.23\n\nC\nM\nJ\nR\nA\n52.73\n58.18\n49.09\n24.24\n13.33\n9.70\n9.70\n9.70\n9.70\n9.70\n5.16\n\nS\nS\nO\nL\nF\n\n31.75\n23.00\n22.75\n22.00\n22.50\n21.75\n17.50\n21.50\n18.75\n15.50\n11.73\n\nY\nG\nR\nE\nN\nE\n\n26.25\n27.25\n25.25\n26.25\n22.50\n26.50\n26.50\n26.50\n20.75\n23.00\n8.14\n\nF\nM\n-\nP\nQ\n29.00\n19.25\n18.00\n22.50\n23.00\n20.75\n23.00\n20.00\n15.75\n18.25\n18.94\n\nC\nM\nJ\nR\nA\n81.50\n75.75\n65.00\n52.75\n45.75\n37.75\n23.50\n18.50\n19.75\n19.50\n4.95\n\nbreadcubechip (3 structures)\n\n34, 57 and 58 inliers, 81 outliers\n\nbreadcartoychip (4 structures)\n\n33, 23, 41 and 58 inliers, 82 outliers\n\nbiscuitbookbox (3 structures)\n\n67, 41 and 54 inliers, 97 outliers\n\nS\nS\nO\nL\nF\n\n23.49\n16.27\n12.65\n13.86\n12.05\n12.05\n10.84\n10.84\n10.84\n10.84\n9.57\n\nY\nG\nR\nE\nN\nE\n\n21.08\n13.25\n10.84\n11.45\n13.25\n12.05\n11.45\n12.05\n10.24\n10.84\n6.96\n\nF\nM\n-\nP\nQ\n24.10\n15.06\n18.07\n14.46\n13.25\n12.05\n9.04\n11.45\n10.24\n10.84\n16.38\n\nC\nM\nJ\nR\nA\n81.93\n78.92\n70.48\n48.80\n37.95\n11.45\n9.64\n9.64\n7.83\n8.43\n4.47\n\nS\nS\nO\nL\nF\n\n36.92\n28.90\n19.41\n17.51\n13.92\n11.81\n10.76\n10.55\n10.34\n9.70\n13.40\n\nY\nG\nR\nE\nN\nE\n\n35.86\n27.00\n21.30\n20.88\n18.56\n19.83\n15.18\n18.56\n14.55\n15.18\n9.86\n\nF\nM\n-\nP\nQ\n32.07\n20.04\n17.09\n15.19\n13.50\n13.92\n12.66\n12.24\n11.39\n11.60\n22.46\n\nC\nM\nJ\nR\nA\n54.01\n61.60\n61.18\n56.54\n21.94\n18.99\n18.14\n10.97\n9.70\n9.70\n5.39\n\nS\nS\nO\nL\nF\n\n17.57\n11.00\n7.92\n8.49\n7.92\n5.79\n5.79\n5.79\n5.79\n5.79\n15.46\n\nY\nG\nR\nE\nN\nE\n\n25.87\n17.95\n17.95\n14.86\n18.73\n17.18\n18.92\n16.60\n18.53\n13.71\n10.66\n\nF\nM\n-\nP\nQ\n18.15\n17.76\n9.27\n13.51\n10.04\n11.39\n14.67\n13.51\n12.36\n13.13\n24.36\n\nC\nM\nJ\nR\nA\n49.03\n31.85\n6.95\n6.37\n4.44\n5.21\n4.83\n5.21\n5.21\n5.79\n5.47\n\nTable 1: Median segmentation error (%) at different number of hypotheses M. Time elapsed at\nM = 1000 is shown at the bottom. The lowest error and time achieved on each dataset is boldfaced.\n\n5 Conclusions\n\nBy design, since our algorithm conducts hypothesis sampling, geometric \ufb01tting and model selection\nsimultaneously, it minimises wastage in the sampling process and converges faster than previous\ntwo-stage approaches. This is evident from the experimental results. Underpinning our novel Re-\nversible Jump MCMC method is an ef\ufb01cient hypothesis generator whose proposal distribution is\nlearned online. Drawing from new theory on Adaptive MCMC, we prove that our ef\ufb01cient hypoth-\nesis generator satis\ufb01es the properties crucial to ensure convergence to the correct target distribution.\nOur work thus links the latest developments from MCMC optimisation and geometric model \ufb01tting.\nAcknowledgements. The authors would like to thank Anders Eriksson his insightful comments.\nThis work was partly supported by the Australian Research Council grant DP0878801.\n\n8\n\n\fReferences\n[1] H. Akaike. A new look at the statistical model identi\ufb01cation.\n\n19(6):716\u2013723, 1974.\n\nIEEE Trans. on Automatic Control,\n\n[2] C. Andrieu, N. de Freitas, and A. Doucet. Robust full Bayesian learning for radial basis networks. Neural\n\nComputation, 13:2359\u20132407, 2001.\n\n[3] C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to MCMC for machine learning.\n\nMachine Learning, 50:5\u201343, 2003.\n\n[4] C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Statistics and Computing, 18(4), 2008.\n[5] S. P. Brooks, N. Friel, and R. King. Classical model selection via simulated annealing. J. R. Statist. Soc.\n\nB, 65(2):503\u2013520, 2003.\n\n[6] T.-J. Chin, J. Yu, and D. Suter. Accelerated hypothesis generation for multi-structure robust \ufb01tting. In\n\nEuropean Conf. on Computer Vision, 2010.\n\n[7] A. Delong, A. Osokin, H. Isack, and Y. Boykov. Fast approximate energy minimization with label costs.\n\nIn Computer Vision and Pattern Recognition, 2010.\n\n[8] L. Fan and T. Pyln\u00a8an\u00a8ainen. Adaptive sample consensus for ef\ufb01cient random optimisation. In Int. Sympo-\n\nsium on Visual Computing, 2009.\n\n[9] M. A. Fischler and R. C. Bolles. Random sample consensus: A paradigm for model \ufb01tting with applica-\n\ntions to image analysis and automated cartography. Comm. of the ACM, 24:381\u2013395, 1981.\n\n[10] S. Gaffney and P. Smyth. Trajectory clustering with mixtures of regression models.\n\nKnowledge Discovery and Data Mining, 1999.\n\nIn ACM SIG on\n\n[11] P. Giordani and R. Kohn. Adaptive independent Metropolis-Hastings by fast estimation of mixtures of\n\nnormals. Journal of Computational and Graphical Statistics, 19(2):243\u2013259, 2010.\n\n[12] P. J. Green. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination.\n\nBiometrika, 82(4):711\u2013732, 1995.\n\n[13] H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. Bernoulli, 7(2):223\u2013242,\n\n2001.\n\n[14] R. Hartley.\n\nIn defense of the eight-point algorithm.\n\nIntelligence, 19(6):580\u2013593, 1997.\n\nIEEE Trans. on Pattern Analysis and Machine\n\n[15] R. Hartley and A. Zisserman. Multiple View Geometry. Cambridge University Press, 2004.\n[16] P. J. Huber. Robust Statistics. John Wiley & Sons Inc., 2009.\n[17] Y.-D. Jian and C.-S. Chen. Two-view motion segmentation by mixtures of dirichlet process with model\n\nselection and outlier removal. In International Conference on Computer Vision, 2007.\n\n[18] N. Lazic, I. Givoni, B. Frey, and P. Aarabi. FLoSS: Facility location for subspace segmentation. In IEEE\n\nInt. Conf. on Computer Vision, 2009.\n\n[19] H. Li. Two-view motion segmentation from linear programming relaxation.\n\nPattern Recognition, 2007.\n\nIn Computer Vision and\n\n[20] D. Nott and R. Kohn. Adaptive sampling for Bayesian variable selection. Biometrika, 92:747\u2013763, 2005.\n[21] N. Quadrianto, T. S. Caetano, J. Lim, and D. Schuurmans. Convex relaxation of mixture regression with\n\nef\ufb01cient algorithms. In Advances in Neural Information Processing Systems, 2010.\n\n[22] S. Richardson and P. J. Green. On Bayesian analysis on mixtures with an unknown number of components.\n\nJ. R. Statist. Soc. B, 59(4):731\u2013792, 1997.\n\n[23] G. O. Roberts and J. S. Rosenthal. Coupling and ergodicity of adaptive Markov chain Monte Carlo\n\nalgorithms. Journal of Applied Probability, 44:458\u2013475, 2007.\n\n[24] G. O. Roberts and J. S. Rosenthal. Examples of adaptive MCMC. Journal of Computational and Graph-\n\nical Statistics, 18(2):349\u2013367, 2009.\n\n[25] K. Schinder and D. Suter. Two-view multibody structure-and-motion with outliers through model selec-\n\ntion. IEEE Trans. on Pattern Analysis and Machine Intelligence, 28(6):983\u2013995, 2006.\n\n[26] N. Thakoor and J. Gao. Branch-and-bound hypothesis selection for two-view multiple structure and\n\nmotion segmentation. In Computer Vision and Pattern Recognition, 2008.\n\n[27] P. H. S. Torr. Motion segmentation and outlier detection. PhD thesis, Dept. of Engineering Science,\n\nUniversity of Oxford, 1995.\n\n[28] P. H. S. Torr and C. H. Davidson.\n\nIMPSAC: Synthesis of importance sampling and random sample\n\nconsensus. IEEE Trans. on Pattern Analysis and Machine Intelligence, 25(3):354\u2013364, 2003.\n\n[29] E. Vincent and R. Lagani`ere. Detecting planar homographies in an image pair. In International Sympo-\n\nsium on Image and Signal Processing and Analysis, 2001.\n\n[30] H. S. Wong, T.-J. Chin, J. Yu, and D. Suter. Dynamic and hierarchical multi-structure geometric model\n\n\ufb01tting. In International Conference on Computer Vision, 2011.\n\n[31] J. Yu, T.-J. Chin, and D. Suter. A global optimization approach to robust multi-model \ufb01tting. In Computer\n\nVision and Pattern Recognition, 2011.\n\n9\n\n\f", "award": [], "sourceid": 383, "authors": [{"given_name": "Trung", "family_name": "Pham", "institution": null}, {"given_name": "Tat-jun", "family_name": "Chin", "institution": null}, {"given_name": "Jin", "family_name": "Yu", "institution": null}, {"given_name": "David", "family_name": "Suter", "institution": null}]}**