{"title": "Two-Dimensional Object Localization by Coarse-to-Fine Correlation Matching", "book": "Advances in Neural Information Processing Systems", "page_first": 985, "page_last": 992, "abstract": null, "full_text": "Two-Dimensional Object Localization by \n\nCoarse-to-Fine Correlation Matching \n\nChien-Ping Lu and Eric Mjolsness \n\nDepartment of Computer Science \n\nYale University \n\nNew Haven, CT 06520-8285 \n\nAbstract \n\nWe present a Mean Field Theory method for locating two(cid:173)\ndimensional objects that have undergone rigid transformations. \nThe resulting algorithm is a form of coarse-to-fine correlation \nmatching. We first consider problems of matching synthetic point \ndata, and derive a point matching objective function. A tractable \nline segment matching objective function is derived by considering \neach line segment as a dense collection of points, and approximat(cid:173)\ning it by a sum of Gaussians. The algorithm is tested on real images \nfrom which line segments are extracted and matched. \n\n1 \n\nIntroduction \n\nAssume that an object in a scene can be viewed as an instance of the model placed \nin space by some spatial transformation, and object recognition is achieved by dis(cid:173)\ncovering an instance of the model in the scene. Two tightly coupled subproblems \nneed to be solved for locating and recognizing the model: the correspondence prob(cid:173)\nlem (how are scene features put into correspondence with model features?), and the \nlocalization problem (what is the transformation that acceptably relates the model \nfeatures to the scene features?). If the correspondence is known, the transformation \ncan be determined easily by least squares procedures. Similarly, for known trans(cid:173)\nformation, the correspondence can be found by aligning the model with the scene, \nor the problem becomes an assignment problem if the scene feature locations are \njittered by noise. \n\n985 \n\n\f986 \n\nLu and Mjolsness \n\nSeveral approaches have been proposed to solve this problem. Some tree-pruning \nmethods [1, 3] make hypotheses concerning the correspondence by searching over \na tree in which each node represents a partial match. Each partial match is then \nevaluated through the pose that best fits it. In the generalized Hough transform or \nequivalently template matching approach [7, 3], optimal transformation parameters \nare computed for each possible pairing of a model feature and a scene feature, and \nthese \"optimal\" parameters then \"vote\" for the closest candidate in the discretized \ntransformation space. \n\nBy contrast with the tree-pruning methods and the generalized Hough transform, we \npropose to formulate the problem as an objective function and optimize it directly \nby using Mean Field Theory (MFT) techniques from statistical physics, adapted as \nnecessary to produce effective algorithms in the form of analog neural networks. \n\n2 Point Matching \n\nConsider the problem of locating a two-dimensional \"model\" object that is believed \nto appear in the \"scene\". Assume first that both the model and the scene are \nrepresented by a set of \"points\" respectively, {xd and {Ya}. The problem is to \nrecover the actual transformation (translation and rotation) that relates the two \nsets of points. It can be solved by minimizing the following objective function \n\nEmatch(Mia, 0, t) = L Miallxi - ReYa -\n\ntll 2 \n\n(1) \n\nwhere {Mia} = M is a Ofl-valued \"match matrix\" representing the unknown cor(cid:173)\nrespondence, Re is a rotation matrix with rotation angle 0, and t is a translation \nvector. \n\nia \n\n2.1 Constraints on match variables \n\nWe need to enforce some constraints on correspondence (match) variables Mia; \notherwise all Mia = a in (1). Here, we use the following constraint \n\nLMia = N, 'iMia ~ 0; \nia \n\n(2) \n\nimplying that there are exactly N matches among all possible matches, where N \nis the number of the model features. Summing over permutation matrices obeying \nthis constraint, the effective objective function is approximately [5]: \n\nF(O, t, (3) = -.!. L e-.8l1 x .-R8 y .. -tIl 2\n\n, \n\n13 \nwhich has the same fixed points as \n\nia \n\nEpenalty(M, 0, t) = Ematch(M, 0, t) + - L Mia (log Mia - 1), \n\n1 \n13 \n\nia \n\n(3) \n\n(4) \n\nwhere Mia is treated as a continuous variable and is subject to the penalty function \nx(logx-l). \n\n\fTwo-Dimensional Object Localization by Coarse-ta-Fine Correlation Matching \n\n987 \n\nFigure 1: Assume that there is only translation between the model and the scene, \neach containing 20 points. The objective functions at at different temperatures \n(,8- 1 ): 0.0512 (top left), 0.0128 (top right) , 0.0032 (bottom left) and 0.0008 (bot(cid:173)\ntom right), are plotted as energy surfaces of x and y components of translation . \n\nNow, let j3 = 1/2u2 and write \n\nEpoint(O, t) = L e-~lIx,-R8y(1-tIl2. \n\nia \n\n(5) \n\nThe problem then becomes that of maximizing Epoint , which in turn can be interpre(cid:173)\ntated as minimizing the Euclidean distance between two Gaussian-blurred images \ncontaining the scene points Xi and a transformed version of the model points Ya. \nTracking the local maximum of the objective function from large u to small u, as in \ndeterministic annealing and other continuation methods, corresponds to a coarse(cid:173)\nto-fine correlation matching. See Figure 1 for a demonstration of a simpler case in \nwhich only translation is applied to the model. \n\n2.2 The descent dynamics \n\nA gradient descent dynamics for finding the saddle point of the effective objective \nfunction F is \n\no \n\nia \n\n-I\\, L mia(Xi - R 9Ya - t)t(R9+~Ya) , \n\nia \n\n(6) \n\n\f988 \n\nLu and Mjolsness \n\nwhere mia = (Mia}/3 = e-/3llx .-R 8y,,-tIl 2 is the \"soft correspondence\" associated \nwith Mia. \nInstead of updating t by descent dynamics, we can also solve for t \ndirectly. \n\n3 The Vernier Network \n\nThough the effective objective is non-convex over translation at low temperatures, \nits dependence on rotation is non-convex even at relatively high temperatures. \n\n3.1 Hierachical representation of variables \n\nWe propose overcoming this problem by applying Mean Field Theory (M FT) to a \nhierachical representation of rotation resulting from the change of variables [4] \n\no \n\nB-1 \n\nL Xb(Ob + (h), (h E [-te, te], \nb=O \n\n(7) \n\nwhere te = 7r /2B, Ob = (b + l)~ are the constant centers of the intervals, and (h \nare fine-scale \"vernier\" variabfes. The Xb'S are binary variables (so Xb E {O, I}) that \nsatisfy the winner-take-all (WTA) constraint Lb Xb = 1. \nThe essential reason that this hierarchical representation of 0 has fewer spurious \nlocal minima than the conventional analog representation is that the change of \nvariables also changes the connectivity of the network's state space: big jumps in 0 \ncan be achieved by local variations of X. \n\n3.2 Vernier optimization dynamics \n\nEpoint can be transformed as (see [6, 4]) 1 \n\nEpoint(O, t) \n\n~vbl \n\n~ E(LXb(Ob +Ob),2:XVt b) \n\nb \n\nb \n\nLXbE(Ov + Ob, tb) \n\nb \n\n1 Notation: Coordinate descent with 2-phase clock 'IlIa(t): \n\na \n\n\u2022 EB for clocked sum \n\u2022 x for a clamped variable \n\u2022 x A for a set of variables to be optimized analytically \n\u2022 (v, u)H for Hopfield/Grossberg dynamics \n\u2022 E(x, y)fJJ for coordinate descent/ascent on x, then y, iterated if necessary. Nested \n\nangle brackets correspond to nested loops. \n\n(8) \n\n\fTwo-Dimensional Object Localization by Coarse-to-Fine Correlation Matching \n\n989 \n\n\u2022 \n\n\u2022 \n\n, \n\n\u2022 . e.. . , \n\n\" \n\u2022 \nI \n\u2022 \u2022 \u2022 \u2022 I \u2022\u2022 \n\n, . , \n. \n'. , \n\u2022 \n () 0 \n\u2022 0 \n,0 -0' \nCOO \n\n. 0 . \u00b7 \u00b7 -. \n\u2022 \u2022 \u2022 \u2022 \n0 0 \n0 0\u2022 \no \n\n\u2022\u2022 \n\no \n\n, \n\n0 \n\n, \n\n. . \n\n\u2022 \n\u2022 \n\n' ...... . , . \n00\u00b7 \u2022 o\u00b7 \no \n'0,4), \n\n. \n'. \n\u2022 c;P~ I \n00 .\u2022 0 \n: 0 \u2022\u2022 -. \nQ:)q \u2022\u2022 \n\no\u00b7 \n\n\u2022 \n\n.\u2022\u2022\u2022\u2022 \n\n, \n\n\u2022 e, \n\n0 \n\n~ 0 \n\n~ ~ , . Q, \u2022 \n. .. [) \n\n\u2022 \u2022 ~Q \n\nI \n\n~f) 0' \u2022\u2022 \n\n\u20ac) \n\n-\u2022\u2022 \n\nFigure 2: Shown here is an example of matching a 20-point model to a scene with \n66.7% spurious outliers. The model is represented by circles. The set of square dots \nis an instance of the model in the scene. All other dots are outliers. From left to \nright are configurations at the annealing steps 1, 10, and 51, respectively. \n\nMFT \n\n~ [ ~ A \n\n~ XbE(th + Vb, tb) + ,8 ~(UbVb -log \nb \n\nb \n\n1 ~ \n\nsinh(tub) \n) \n\nt \n\n+ WTA(x,,8)] (((v, u)H, t A), XA)$ \n\n(9) \n\nEach bin-specific rotation angle Vb can be found by the following fixed point equa(cid:173)\ntions \n\na \n\nia \n\n(10) \n\nThe algorithm is illustrated in Figure 2. \n\n4 Line Segment Matching \n\nIn many vision problems, representation of images by line segments has the ad(cid:173)\nvantage of compactness and subpixel accuracy along the direction transverse to the \nline. However, such a representation of an object may vary substantially from image \nto image due to occlusions and different illumination conditions. \n\n4.1 \n\nIndexing points on line segements \n\nThe problem of matching line segments can be thought of as a point matching \nproblem in which each line segment is treated as a dense collection of points. Assume \nnow that both the scene and the model are represented by a set of line segments \nrespectively, {sil and {rna} . Both the model and the scene line segments are \n\n\f990 \n\nLu and Mjolsness \n\n'! ' J \n( \nf \n\no 1!io \n\n.... ./ \\ \n\n\" \n\\ \n\nD.' \n\n1.2\\ \n\n1 S \n\n-e .lS \n\nFigure 3: Approximating e(t) by a sum of 3 Gaussians. \n\nrepresented by their endpoints as Si = (pi, p~) and rna = (qa, q~), where Pi, p~, \nand qa, q~ are the endpoints of the ith scene segment and the ath model segment, \nrespectively. The locations of the points on each scene segment and model segments \ncan be parameterized as \n\nXi = Si(U) = \nYa = IDa(v) = \n\nPi + u(p~ - Pi), U E [0,1] and \nqa + v(q~ - qa), v E [0,1]. \n\n(ll) \n(12) \nN ow the model points and the scene points can be though of as indexed by i = \n(i, u) and a = (a, v). Using this indexing, we have Li ex Li Ii Jol du and La ex \nLa1aJoi dv, where Ii = Ilpi-P~II andla = IIqa-q~ll\u00b7 The point matching objective \nfunction (5) can be specialized to line segment matching as [5] \n\nEseg((}, t) = L hla t (I e- ~IIS.(u)-Rem,,(v)-tIl2 du dv. \n\nia \n\nJo Jo \n\n(13) \n\nAs a special case of point matching objective function, (13) can readily be trans(cid:173)\nformed to the vernier network previously developed for point matching problem. \n\n4.2 Gaussian sum approximation \n\nNote that, as in Figure 3 and [5], \n\ne (t) -_ {I if t E [0: 1] ~ \n\notherWIse ~ ~ Ak exp -\"2 \n\no \n\n1 (Ck -\n\nt)2 \n\n(14) \n\nk~I \n\n(72 \nk \n\nwhere by numerical minimization of the Euclidean distance between these two func(cid:173)\ntions of t, the parameters may be chosen as Al = A3 = 0.800673, A2 = 1.09862, \n(71 = (73 = 0.0929032, (72 = 0.237033, C1 = 1 - C3 = 0.1l6807, and C2 = 0.5. \nUsing this approximation, each finite double integral in (13) can be replaced by \n\n1+00 1+00 __ 1_(Ck_U)2 -\n\n-00 e \n\n2\"'~ \n\n3 \n\nk~l AkAl -00 \n\n1 (cr-v)2 \n\n1 \n\ne ~ e- 2,;2l1 s .(u)+ em,,(v)- II du dv. (15) \n\nR \n\nt 2 \n\nEach of these nine Gaussian integrals can be done exactly. Defining \n\nViakl = Si(Ck) - Rema(cl) -\n\nt \nPi = pi - Pi, qa = Re(q~ - qa), \n\n(16) \n(17) \n\n\fTwo-Dimensional Object Localization by Coarse-to-Fine Correlation Matching \n\n991 \n\nFigure 4: The model line segments, which are transformed with the optimal pa(cid:173)\nrameter found by the matching algorithm, are overlayed on the scene image. The \nalgorithm has successfully located the model object in the scene. \n\n(15) becomes \n\n1 vlaklu2 + (Viakl X pd2u~ + (Viakl X Qa)2uf \nX exp -\"2 (u2 + f>;un(u2 + Q~uf) - U~U;(f>i . Qa)2 \n\n(18) \n\nas was calculated by Garrett [2, 5]. From the Gaussian sum approximation, we get \na closed form objective function which can be readily optimized to give a solution \nto the line segment matching problem. \n\n5 Results and Discussion \n\nThe line segment matching algorithm described in this paper was tested on scenes \ncaptured by a CCD camera producing 640 x 480 images, which were then processed \nby an edge detector. Line segments were extracted using a polygonal approximation \nto the edge images. The model line segments were extracted from a scene containing \na canonically positioned model object (Figure 4 left). They were then matched to \nthat extracted from a scene containing differently positioned and partially occluded \nmodel object (Figure 4 nght). The result of matching is shown in Figure 5. \n\nOur approach is based on a scale-space continuation scheme derived from an appli(cid:173)\ncation of Mean Field Theory to the match variables. It provides a means to avoid \ntrapping by local extrema and is more efficient than stochastic searches such as \nsimulated annealing. The estimation of location parameters based on continuously \nimproved \"soft correspondences\" and scale-space is often more robust than that \nbased on crisp (but usually inaccurate) correspondences. \n\nThe vernier optimization dynamics arises from an application of Mean Field The(cid:173)\nory to a hierarchical representation of the rotation, which turns the original uncon(cid:173)\nstrained optimization problem over rotation e into several constrained optimization \nproblems over smaller e intervals. Such a transformation results in a Hopfield-style \n\n\f992 \n\nLu and Mjolsness \n\nFigure 5: Shows how the model line segments (gray) and the scene segments (black) \nare matched. The model line segments, which are transformed with the optimal pa(cid:173)\nrameter found by the matching algorithm, are overlayed on the scene line segments \nwith which they are matched. Most of the the endpoints and the lengths of the line \nsegments are different. Furthermore, one long segment frequently corresponds to \nseveral short ones. However, the matching algorithm is robust enough to uncover \nthe underlying rigid transformation from the incomplete and ambiguous data. \n\ndynamics on rotation 0, which effectively coordinates the dynamics of rotation and \ntranslation during the optimization. The algorithm tends to find a roughly correct \ntranslation first, and then tunes up the rotation. \n\n6 Acknowledgements \n\nThis work was supported under grant NOOOl4-92-J-4048 from ONRjDARPA. \n\nReferences \n\n[1] H. S. Baird. Model-Based Image Matching Using Location. The MIT Press, \n\nCambridge, Massachusetts, first edition, 84. \n\n[2] C. Garrett, 1990. Private communication to Eric Mjolsness. \n[3] W. E. L. Grimson and T. Lozano-Perez. Localizing overlapping parts by search(cid:173)\n\ning the interpretation tree. IEEE Transaction on Pattern Analysis and Machine \nInt elligence, 9 :469-482, 1987. \n\n[4] C.-P. Lu and E. Mjolsness. Mean field point matching by vernier network and \nby generalized Hough transform. In World Congress on Neural Networks, pages \n674-684, 1993. \n\n[5] E. Mjolsness. Bayesian inference on visual grammars by neural nets that opti(cid:173)\nmize. In SPIE Science of Artificial Neural Networks, pages 63-85, April 1992. \n[6] E. Mjolsness and W. L. Miranker. Greedy Lagrangians for neural net-\nworks: Three levels of optimization in relaxation dynamics. Technical Report \nYALEUjDCSjTR-945, Yale Computer Science Department, January 1993. \n\n[7] G. Stockman. Object recognition and localization via pose clustering. Computer \n\nVision, Graphics, and Image Processing, (40), 1987. \n\n\f", "award": [], "sourceid": 866, "authors": [{"given_name": "Chien-Ping", "family_name": "Lu", "institution": null}, {"given_name": "Eric", "family_name": "Mjolsness", "institution": null}]}