{"title": "Improving Convergence in Hierarchical Matching Networks for Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 401, "page_last": 408, "abstract": null, "full_text": "Improving Convergence in Hierarchical \n\nMatching Networks for Object \n\nRecognition \n\nJoachim Utans* \n\nGene Gindit \n\nDepartment of Electrical Engineering \n\nYale University \n\nP. O. Box 2157 Yale Station \n\nNew Haven, CT 06520 \n\nAbstract \n\nWe are interested in the use of analog neural networks for recog(cid:173)\nnizing visual objects. Objects are described by the set of parts \nthey are composed of and their structural relationship. Struc(cid:173)\ntural models are stored in a database and the recognition prob(cid:173)\nlem reduces to matching data to models in a structurally consis(cid:173)\ntent way. The object recognition problem is in general very diffi(cid:173)\ncult in that it involves coupled problems of grouping, segmentation \nand matching. We limit the problem here to the simultaneous la(cid:173)\nbelling of the parts of a single object and the determination of \nanalog parameters. This coupled problem reduces to a weighted \nmatch problem in which an optimizing neural network must min(cid:173)\nimize E(M, p) = LO'i MO'i WO'i(p), where the {MO'd are binary \nmatch variables for data parts i to model parts a and {Wai(P)} \nare weights dependent on parameters p . In this work we show that \nby first solving for estimates p without solving for M ai , we may \nobtain good initial parameter estimates that yield better solutions \nfor M and p. \n\n*Current address: \n\nInternational Computer Science Institute, 1947 Center Street, \n\nSuite 600, Berkeley, CA 94704, utans@icsi.berkeley.edu \n\ntCurrent address: SUNY Stony Brook, Department of Electrical Engineering, Stony \n\nBrook, NY 11784 \n\n401 \n\n\f402 \n\nUtans and Gindi \n\nFigure 1: Stored Model for a 3-Level Compositional Hierarchy (compare Figure 3) . \n\n1 Recognition via Stochastic Forward Models \n\nThe Frameville object recognition system introduced by Mjolsness et al [5, 6, 1] \nmakes use of a compositional hierarchy to represent stored models. The recognition \nproblem is formulated as the minimization of an objective function. Mjolsness [3,4] \nhas proposed to derive the objective function describing the recognition problem \nin a principled way from a stochastic model that describes the objects the system \nis designed to recognize (stochastic visual grammar). The description mirrors the \ndata representation as a compositional hierarchy, at each stage the description of \nthe object becomes more detailed as parts are added. \n\nThe stochastic model assigns a probability distribution at each stage of that process. \nThus at each level of the hierarchy a more detailed description of parts in terms of \ntheir subparts is given by specifying a probability distribution for the coordinates of \nthe subparts. Explicitly specifying these distributions allows for finer control over \nindividual part descriptions than the rather general parameter error terms used \nbefore [1, 8]. The goal is to derive a joint probability distribution for an instance \nof an object and its parts as it appears in the scene. This gives the probability of \nobserving such an object prior to the arrival of the data. Given an observed image, \nthe recognition problem can be stated as a Bayesian inference problem that the \nneural network solves. \n\n1.1 3-Level Stochastic Model \n\nFor example, consider the model shown in Figure 1 and 3. The object and its parts \nare represented as line segments (sticks), the parameters were p = (x, y, I, ())T with \nx , y denoting position, I the length of a stick and () its orientation. The model \nconsiders only a rigid translation of an object in the image. \nOnly one model is stored. From a central position p = (x, y, I, ()), itself chosen \nfrom a uniform density, the N{3 parts at the first level are placed. Their structural \nrelationships is stored as coordinates u{3 in an object-centered coordinate frame, \ni.e. relative to p. While placing the parts, Gaussian distributed noise with mean 0 \nand is added to the position coordinates to capture the notion of natural variation \nof the object's shape. The variance is coordinate specific, but we assume the same \ndistribution for the x and y coordinates, O\"'ix; O\"'~, is the variance for the length \n\n\fImproving Convergence in Hierarchical Matching Networks for Object Recognition \n\n403 \n\ncomponent and UI9 for the relative angle. In addition, here we assume for simplicity \nthat all parts are independently distributed. Each of the parts {3 is composed of sub(cid:173)\nparts. For simplicity of notation, we assume that each part {3 is composed from the \nsame number of subparts N m (note that the index 'Y in Figure 2 here corresponds \nto the double index {3m to keep track of which part {3 subpart {3m belongs to on the \nmodel side, i.e. the index (3m denotes the mth sub-part of part (3). The next step \nmodels the unordering of parts in the image via a permutation matrix M, chosen \nwith probability P(M), by which their identity is lost. If this step were omitted, \nthe recognition problem would reduce to the problem of estimating part parameters \nbecause the parts would already be labeled. \n\nFrom the grammar we compute the final joint probability distribution (all constant \nterms are collected in a constant C): \n\nP(M, {P,3m}, {PtJ}, p) = \n\n1.2 Frameville Architecture for Part Labelling within a single Object \n\nThe stochastic forward model for the part labelling problem with only a single object \npresent in the scene translates into a reduced Frameville architecture as depicted in \nFigure 2. The compositional hierarchy parallels the steps in the stochastic model \nas parts are added at each level. Match variables appear only at the lowest level, \ncorresponding to the permutation step of the grammar. Parts in the image must \nbe matched to model parts and parts found to belong to the stored object must be \ngrouped together. \n\nThe single match neuron Mai at the highest level can be set to unity since we assume \nwe know the object's identity and only a single object is present. Similarly, all terms \ninaij from the first to the second level can be set to unity for the correct grouping \nsince the grouping is known at this point from the forward model description. In \naddition, at the intermediate (second) level, we may set all M,3j = 1 for {3 = j \nand MtJj = 0 otherwise with no loss of generality. These mid-level frames may \nbe matched ahead of time, but their parameters must be computed from data. \nIntroducing a part permutation at the intermediate levels thus is redundant. Given \nthis, an additional simplification ina grouping variables at the lowest (third) level \nis possible. Since parts are pre-matched at all but the lowest level, inaj k can be \nexpressed in terms of the part match M\"{k as inajk = M\"{k1NA\"{tJM,3j and explicitly \nrepresenting inaj k as variables is not necessary. \nThe input to the system are the {pk}, recognition involves finding the parameters \n\n\f404 \n\nUtans and Gindi \n\nModel one can derive a new objective function that does not depend on \nM and the parameters {Pi} by integrating over the {Pi} and summing over all \npossible permutation matrices M: \n\nP(p,{pk}) = L J d{pj}P(P,{Pi},{pd,M) \n\n{M}IM is a \npermutation \n\n(4) \n\nThis formulation leads to an Elastic Net type network [9, 7]. However, this imple(cid:173)\nmentation of a separate network for the bootstrap computations is expensive. \n\nHere we use simpler computation where the coarse scale parameters are estimated \nby computing sample averages, corresponding to finding the solution for the Elastic \nNet in the high temperature limit [7]. For the position x we find, after integrating \nover the {xi}, \n\nx \n\nL M{3mkXk \nL{3m 1/(O\"~xO\"~mx) 13m k O\"~xO\"~mx \n\n1 \n\n_ \n\n1 L U{3x \nO\"~x \n\nL{3 1/ O\"~x (3 \n\nand similarly for y. Since the assignment M{3m k of subparts k on the data side \nto subparts fJm on the model side is not known at this point, the first term in \nequations (5) cannot be evaluated. After approximating the actual variance with \n\n(5) \n\n\fImproving Convergence in Hierarchical Matching Networks for Object Recognition \n\n407 \n\nan average variance, these equations reduce to \n\nx \n\n111 \n\nN N L Xk - N N L uf3mx - N L uf3x \n\n13 m \n\nk \n\n13 m 13m \n\n13 \n\n13 \n\n(6) \n\nIn terms of the objective function this translates into assuming that here the error \nterms for all parts are weighted equally. Since these weights would depend on the \nactual part match, this just corresponds to our ignorance regarding identity of the \nparts. This approximation assumes that the variances do not differ by a large \namount, otherwise the approximation p will not be close to the true values. Since \nthe model can be designed such that the part primitives used at the lowest level \nof the grammar are not highly specialized as would be the case for abstractions \nat higher levels of the model, the approximation proved sufficient for the problems \nstudied here. \n\nThe neural network can be used to perform the calculation. The Elastic Net for(cid:173)\nmulation assigns approximately equal weights to all possible assignments at high \ntemperatures. Thus, this behavior can be expressed in the original network with \nmatch variables by choosing Mf3mk = l/{Nf3Nm ) V i,j. This leads to the following \ntwo-pass bootstrap computation. Using this specific choice for M only the analog \nvariables need to be updated to compute the coarse scale estimates. The network \nwith constant M is just the neural network implementation for computing x from \nequation (6). After these have converged, x can be used to compute Xj = x + uf3. \nThus, the parameters for intermediate levels can by hypothesized from the coarse \nscale estimate x by adding the known transformation (recall that for intermediate \nlevels, the part identity is preserved and no permutation steps takes place (see Fig(cid:173)\nure 2)). Then the network is restarted with random values for the match variables \nto compute the correct labelling and the correct parameters. \n\n2.3 Simulation Results \n\nThe bootstrap procedure has been implemented for a 3-level hierarchical model. The \nmodel describes a \"gingerbread man\" as shown in Figure 3. The incorrect solutions \nobserved did not, in the vast majority of cases, violate the permutation matrix \nconstraint, i.e. the assignment was unique. However, even though the assignment \nis unique, parts where not always assigned correctly. Most commonly, the identity \nof neighboring parts was interchanged, in particular for cases with large variance. \n\nThe advantage of using the bootstrap initialization is clear from Figure 5. For \nthe simulation, cr~ = 2crt; the noise variance was identical for all parts. The net(cid:173)\nwork computed the solution reliably for large noise variances. In such cases the \nperformance of the network without initialization deteriorates rapidly. Only one \nset of 10 experiments was used for the graph but in all simulations performed, \nthe network with initialization consistently outperformed the network without ini(cid:173)\ntialization. Figure 5(right) shows the time measured in the number of iterations \nnecessary for the network to converge; it is almost unaffected by the increase in the \nnoise variance. This is because the initial values derived from data are still close \nto the final solution. While in some cases, the random starting point happens to \nbe close to the correct solution and the network without initialization converges \nrapidly, Figure 5 reflect the typical behavior and demonstrate the advantage of \ncomputing approximate initial values. \n\n\f408 \n\nUtans and Gindi \n\n100 \n\n80 \n\n80 \n\n~ \n\n.0 \n\n20 \n\n'h.o \n\n0.2 \n\nSuccess Rate \n\nConvergence Speed \n\n300 \n\n'\" -= o .-.oj ., \n\n... \n\n11)200 \n.oj \n\n..... \no \n... \nII) \n~ 100 \n\n-= \n\n0.8 \n\nI 0 \n\no 0.0 \n\n0 . 2 \n\n0.8 \n\n0.4 \na\"1' 2 CT22 \n\n0.8 \n\n1.0 \n\n0 . \u2022 \n\n0.8 \n\nott. 2 (122 \n\nFigure 5 : Results Comparing the Network without and with Initialization (solid line) . \nLeft : The success rate indicates the rate at which the network converged to the correct solutions. /1~ \ndenotes the noise variance at the intermediate level of the model and /1~ the noise variance at the lowest \nlevel. Only one set of 10 experiments was used for the graph but in all simulations performed , the \nnetwork with initialization consistently outperformed the network without initialization . \nRight: The graph shows the average time it takes for the network to converge (as measured by the \nnumber of iterations) averaged over 10 experiments. Only simulations where the network converged to \nthe correct solution are used to compute the average time for convergence. The stopping criterion used \nrequired all the match neurons to assume values M'j > 0.95 or M'J < 0 .05. The error bars denote the \nstandard deviation. \n\nAcknowledgements \n\nThis work was supported in part by AFOSR grant AFOSR 90-0224. Vie thank \nE. Mjolsness and A. Rangarajan for many helpful discussions. \n\nReferences \n[1] G. Gindi, E. Mj~lsness, and P. Anandan. Neural networks for model based recogni(cid:173)\n\ntion. In Neural Networks: Concepts, Applications and Implementations, pages 144-173. \nPrentice-Hall, 1991. \n\n[2] David Marr. Vision. W. H. Freeman and Co., New York, 1982. \n[3] E. Mjolsness. Bay~sian inference on visual grammars by neural nets that optimize. \nTechnical Report YALEU-DCS-TR-854, Yale University, Dept. of Computer Science, \n1991. \n\n[4] E. Mj~lsness. Visual grammars and their neural nets. In R.P. Lippmann J.E. Moody, \nS.J. Hanson, editor, Advances in Neural Information Processing Systems 4. Morgan \nKaufmann Publishers, San Mateo, CA, 1992. \n\n[5] Eric Mjolsness, Gene Gindi, and P. Anandan. Optimization in model matching and \n\nperceptual organization: A first look. Research report yaleu/dcs/rr-634, Yale Univer(cid:173)\nsity, Department of Computer Science, 1988. \n\n[6] Eric Mjolsness, Gene R. Gindi, and P. Anandan. Optimization in model matching and \n\nperceptual organization. Neural Computation, vol. 1, no. 2, 1989. \n\n[7] Joachim Utans. Neural Networks for Object Recognition within Compositional Hierar(cid:173)\n\nchies. PhD thesis, Department of Electrical Engineering, Yale University, New Haven, \nCT 06520, 1992. \n\n[8] Joachim Utans, Gene R. Gindi, Eric Mjolsness, and P. Anandan. Neural networks \n\nfor object recognition within compositional hierarchies: Initial experiments. Techni(cid:173)\ncal report 8903, Yale University, Center for Systems Science, Department Electrical \nEngineering, 1989. \n\n[9] A. L. Yuille. Generalized deformable models, statistical physics, and matching prob(cid:173)\n\nlems. Neural Computation, 2(2):1-24, 1990. \n\n\f", "award": [], "sourceid": 661, "authors": [{"given_name": "Joachim", "family_name": "Utans", "institution": null}, {"given_name": "Gene", "family_name": "Gindi", "institution": null}]}