{"title": "A Cortically-Plausible Inverse Problem Solving Method Applied to Recognizing Static and Kinematic 3D Objects", "book": "Advances in Neural Information Processing Systems", "page_first": 59, "page_last": 66, "abstract": null, "full_text": "A Cortically-Plausible Inverse Problem \nSolving Method Applied to Recognizing \n\nStatic and Kinematic 3D Objects \n\nDavid W. Arathorn \n\nCenter for Computational Biology, \n\nMontana State University \n\nBozeman, MT 59717 \n\ndwa@cns . montana . edu \n\nGeneral Intelligence Corporation \n\ndwa@giclab . com \n\nAbstract \n\nRecent neurophysiological evidence suggests the ability to interpret \nbiological motion is facilitated by a neuronal \"mirror system\" \nwhich maps visual inputs to the pre-motor cortex. If the common \narchitecture and circuitry of the cortices is taken to imply a \ncommon computation across multiple perceptual and cognitive \nmodalities, this visual-motor interaction might be expected to have \na unified computational basis. Two essential tasks underlying such \nvisual-motor cooperation are shown here to be simply expressed \nand directly solved as transformation-discovery inverse problems: \n(a) discriminating and determining the pose of a primed 3D object \nin a real-world scene, and (b) interpreting the 3D configuration of \nan articulated kinematic object in an image. The recently developed \nmap-seeking method provides \ntractable, \ncortically-plausible solution to these and a variety of other inverse \nproblems which can be posed as the discovery of a composition of \ntransformations between two patterns. The method relies on an \nordering property of superpositions and on decomposition of the \ntransformation spaces inherent in the generating processes of the \nproblem. \n\na mathematically \n\n1 Introduction \nA variety of \"brain tasks\" can be tersely posed as transformation-discovery \nproblems. Vision is replete with such problems, as is limb control. The problem of \nrecognizing the 2D projection of a known 3D object is an inverse problem of \nfinding both the visual and pose transformations relating the image and the 3D \nmodel of the object. When the object in the image may be one of many known \nobjects another step is added to the inverse problem, because there are multiple \n\n\fcandidates each of which must be mapped to the input image with possibly different \ntransformations. When the known object is not rigid, the determination of \narticulations and/or morphings is added to the inverse problem. This includes the \ngeneral problem of recognition of biological articulation and motion, a task recently \nattributed to a neuronal mirror-system linking visual and motor cortical areas [1]. \n\nThough the aggregate transformation space implicit in such problems is vast, a \nrecently developed method for exploring vast transformation spaces has allowed \nsome significant progress with a simple unified approach. The map-seeking method \n[2,4] is a general purpose mathematical procedure for finding the decomposition of \nthe aggregate transformation between two patterns, even when that aggregate \ntransformation space is vast and there is no prior information is available to restrict \nthe search space. The problem of concurrently searching a large collection of \nmemories can be treated as a subset of the transformation problem and consequently \nthe same method can be applied to find the best transformation between an input \nimage and a collection of memories (numbering at least thousands in practice to \ndate) during a single convergence. In the last several years the map-seeking method \nhas been applied to a variety of practical problems, most of them related to vision, a \nfew related to kinematics, and some which do not correspond to usual categories of \n\"brain functions.\" The generality of the method is due to the fact that only the \nmappings are specialized to the task. The mathematics of the search, whether \nexpressed in an algorithm or in a neuronal or electronic circuit, do not change. \nFrom an evolutionary biological point of view this is a satisfying characteristic for a \nmodel of cortical function because only the connectivity which implements the \nmappings must be varied to specialize a cortex to a task. All the rest - organization \nand dynamics - would remain the same across cortical areas. \n\nf = = \n\nt' / '\" \n(; y \n\nL1 \n\n~ f' \n\ne/ '\" \n\nL2 \n\nE \n\n~ f 2 = = \n\n~ \n\nL3 \n\nt\n\nI \n\nE \n\n, \nq \n\nt \nb' \n\n.. ' \n\n~ \n\nt\"' \n\n~ V \n\n~ \n\nb\\ \n) ~ I \n\n, \n\n. ~ 2 \n)'1 \n\n~ \n\nt'2 \n\n~ V \n\n~ \n\nb3 \nE t \n\n.~ \n\nq3 \n\n93 \n\nw, \n\nW2 \n\nw,~ \n\nc:::!:::J wt \n\n~ b3 \n\n} ~ t \n\n1 q3 \n1 \n\nw, \n\nW2 \n\nw,~ \n\nFigure 1. Data flow in map-seeking circuit \n\n\fCortical neuroanatomy offers emphatic hints about the characteristics of its solution \nin the vast neuronal resources allocated to creating reciprocal top-down and bottom(cid:173)\nup pathways. More specifically, recent evidence suggests this reciprocal pathway \narchitecture appears to be organized with reciprocal, co-centered fan outs in the \nopposing directions [3], quite possibly implementing inverse mappings. The data \nflow of map-seeking computations, seen in Figure I, is architecturally compatibility \nwith these features of cortical organization. Though not within the scope of this \ndiscussion, it has been demonstrated [4] that the mathematical expression of the \nmap-seeking method, \nisomorphic \nimplementation in neuronal circuitry with reasonably realistic dendritic architecture \nand dynamics (e.g. compatible with [5] ) and oscillatory dynamics. \n\nin equations 6-9 below, has an \n\nseen \n\n2 The basis for tractable transformation-discovery \nThe related problems of recognition/interpretation of 2D images of static and \narticulated kinematic 3D objects illustrate how cleanly significant vision problems \nmay be posed and solved as transformation-discovery inverse ?roblems. The visual \nand pose (in the sense of orientation) transformations, tVIS ua and fo se, between a \ngiven 3D model ml and the extent of an input image containing a 2D projection \nP(OI) of an object 01 mappable to ml can be expressed \nff sllal E T Visuol , trse E T pose \n\neq. I \n\nIf we now consider that the model ml may be constructed by the one-to-many \nmapping of a base vector or feature e, and that arbitrarily other models mj may be \nsimilarly constructed by different mappings, then \nf ormation \ncorresponding to the correct \"memory\" converts the memory database search \nproblem into another \ntransformation-discovery problem with one more composed \ntransformation I \n\nthe transformation \n\np( 0 ) = r :isual 0 { pose 0 ( formation (e) \n\nI \n\nJ \n\nk \n\n1111 \n\nt~~rmatiol1 E T formatioll \nt fo rmatioll (e) = m \n\n\"\" \n\nI m l E \n\nM \n\neq. 2 \n\nFinally, if we allow a morphable object to be \"constructed\" by a generative model, \nwhose various configurations or articulations may be generated by a composition of \ntransformations f ell erative of some root or seed feature e, the problem of explicitly \nrecognizing the particular configuration of morph becomes a transformation(cid:173)\ndiscovery problem of the form \n\np( C ( 0) ) = t,/\",al 0 tfse 0 Wile/alive ( e) \n\nt lenerative E T generative \n\neq. 3 \n\nThese unifying formulations are only useful, however, if there is a tractable method \nof solving for the various transformations. That is what the map-seeking method \nthe discovery of a composition of \nprovides. \ntransformations between two patterns. In general the transformations express the \ngenerating process of the problem. Define correspondence c between vectors rand \nw through a composition of L transformations tJ, ,t]2 , .. \u00b7,tfL where t~t E ti ,t~,\u00b7 \u00b7\u00b7,t;'t \n\nthe problem \n\nAbstractly \n\nis \n\n1 This illustrates that forming a superposItion of memories is equivalent to forming \nsuperpositions of transformations. The first is a more practical realization, as seen in \nFigure 1. Though not demonstrated in this paper, the multi-memory architecture has \nproved robust with 1000 or more memory patterns from real-world datasets. \n\n\fc( j) = (~I tj i ( r) , w) \n\neq. 4 \n\nwhere the composition operator is defined \n\nL\n\no t l ( r) = \n\n. \n\n;=0. 1 ;. \n\n(I = 1\u00b7\u00b7\u00b7L \n\n1=0 \n\n( L o ( L- I ... o (1 (r) \n\nJ 1 \n\n)L \n\nJL-I \n\nr \n\nLet C be an L dimensional matrix of values of c(j) whose dimensions are n, .. . nL. \nThe problem, then is to find \n\nx = argmax c(j) \n\neq. 5 \n\nThe indices x specify the sequence of transformations that best correspondence \nbetween vectors rand w. The problem is that C is too large a space to search for x \nInstead, a continuous embedding of C permits a search \nby conventional means. \nwith resources proportional to the sum of sizes of the dimensions of C instead of \ntheir product. \nC is embedded in a superposition dot product space Q defined \n\neq. 6 \n\nnm is number of t in layer m, g;:, E [0,1] , \n\nwhere G = [g;:\" ] m = 1\u00b7\u00b7\u00b7L,x\", = 1\u00b7\u00b7 \u00b7nm \nt: I is adjoint of tf . \nIn Q space, the solution to eq. 5 lies along a single axis in the set of axes \nis, gIll =< 0,.\u00b7 \u00b7'U'm\" \u00b7 \u00b7,0> U'm > 0 which \nrepresented each row of G. \ncorresponds to the best fitting transformation tx , where Xm is the mth index in x in \neq. 5. This state is reached from an initial ~'~ate G = [1] by a process termed \nsuperposition culling in which the components of grad Q are used to compute a path \nin steps Llg , \n\nThat \n\neq. 7 \n\neq. 8 \n\nThe functionfpreserves the maximal component and reduces the others: in neuronal \nterms, lateral inhibition . The resulting path along the surface Q can be thought of as \na \"high traverse\" in contrast to the gradient ascent or descent usual in optimization \nmethods . The price for moving the problem into superposition dot product space is \nthat collusions of components of the superpositions can result in better matches for \nincorrect mappings than for the mappings of the correct solution. If this occurs it is \nalmost always a temporary state early in the convergence. This is a consequence of \nthe ordering property of superpositions (OPS) [2,4], which, as applied here, \nthree \ndescribes \n\nsurface Q. For example, \n\nthe characteristics of \n\nthe \n\nlet \n\n\fsuperpositions r = :t U; , S = :t V j \n\nand s' = :t Vk be formed from three sets of sparse \n\ni= 1 \n\nj = l \n\nk = l \n\nvectors u; ER, Vj ES and VkES' where R n S=0 and R n S'=vq \u2022 Then the \nfollowing relationship expresses the OPS: \n\ndefine Pco\" ec! = p( r \u2022 s' > r. s), P'\"co,,'eCl = p( r. s' :::; r. s) \nthen Pcorrecl > R ncorrect or R orrecf > 0.5 \nand as n, m --+ 1 Pco\"'ect --+ l.0 \n\nApplied to eq. 8, this means that for superposItIOns composed of vectors which \nsatisfy the distribution properties of sparse, decorrelating encodings 2 (a biologically \nplausible assumption [6]), the probability of the maximum components of grad Q \nmoving the solution in the correct direction is always greater than 0.5 and increases \ntoward 1.0 as the G becomes sparser. In other words, the probability of the \noccurrence of collusion decreases with the decrease in numbers of contributing \ncomponents in the superposition(s), and/or the decrease in their gating coefficients. \n\n3 The map-seeking method and application \n\nA map-seeking circuit (MSC) is composed of several transformation or mapping \nlayers between the input at one end and a memory layer at the other, as seen in \nFigure l. The compositional structure is evident in the simplicity of the equations \n(eqs. 9-12 below) which define a circuit of any dimension. In a multi-layer circuit \nof L layers plus memory with n{ mappings in layer I the forward path signal for layer \nm is computed \n\n11m \n\nf m = Lg;\" t;' (rm-l) \n\n) = 1 \n\nfor m = 1. .. L \n\neq. 9 \n\nThe \n\nsignal \n\nfor \n\nlayer \n\nm \n\nis \n\ncomputed \n\nform=1. .. L \n\nor !gZ\" Wk or W \n\nfor m = L+ I \n\nk=1 \n\nThe mapping coefficients g are updated by the recurrence \n\ngi\" := K( gi\", ti\" (f m- I ) . b \",+I) for m = 1. .. L ,i = 1. .. n, \ng/+I := K( g/+I , f' \u2022 W k ) for k = l... nw (optional) \n\neq. 10 \n\neq. 11 \n\nwhere match operator u \u2022 v = q, q is a scalar measure of goodness-of-match between \nu and v, and may be non-linear. When. is a dot product, the second argument of K \nis the same as oQlg in eq. 7. The competition function K is a realization of lateral \ninhibition function/in eq. 8. It may optionally be applied to the memory layer, as \nseen in eq. 11. \n\n2 A restricted case of the superposition ordering property using non-sparse representation \nis exploited by HRR distributed memory. See [7] for an analysis which is also applicable \nhere. \n\n\fK(g;, q;) = max [0, g; - k, -(1- m:~ q J J \n\neq. 12 \n\nThresholds are normally applied to q and g, below which they are set to zero to \nspeed convergence. In above, f is the input signal, tT , (Ill are the /h forward and \nbackward mappings for the m th layer, Wk is the kth memory pattern, z( ) is a non(cid:173)\nlinearity applied to the response of each memory. \ngill is the set of mapping \ncoefficients gT for the m th layer, each of which is associated with mapping tT and \nis modified over time by the competition function K( ). \n\nRecognizing 2D projections of 3D objects under real operating conditions \n\n(a) 3D memory model \n\n(b )source image \n\n(c) input image - blurred \n\n2 00 , - - - - - - - - - , 200 . - - - - - - - - , ,.-------, 0. \n\n150 \n\n100 \n\n50 \n\n150 \n\n100 \n\n50 \n\no. \n\no. \n\nos \n\no ':: \no \n\n50 \n\n1 00 15 0 200 \n\n0 \n\n0 \n\n50 \n\n100 150 2 00 \n\n0 ':---::cc----:-o-:---:-:-----:,-:' \n\n100 150 2 00 oL..-o-.-o-.--os-....lo 0 \n\n0 \n\n50 \n\n(d) iter 1 \n\n(e) iter 3 \n\n(f)iterI2 \n\n(g) final model pose \n\nFigure 2. Recognizing target among distractor vehicles. (a) M60 3D memory model ; \n(b) source image, Fort Carson Data Set; (c) Gaussian blurred input image; (d-f) \nisolation of target in layer 0, iterations 1, 3, 12; (g) pose determination in final \niteration, layer 4 backward - presented left-right mirrored to reflect mirroring \ndetermined in layer 3. M-60 model courtesy Colorado State University. \n\nReal world problems of the form expressed in eq. 1 often present objects at \ndistances or in conditions which so limit the resolution that there are no alignable \nfeatures other than the shape of the object itself, which is sufficiently blurred as to \nprevent generating reliable edges in a feed-forward manner (e.g. Fig. 2c). \nIn the \nmap-seeking approach, however, the top-down (in biological parlance) inverse(cid:173)\nmappings of the 3D model are used to create a set of edge hypotheses on the \nbackward path out of layer 1 into layer O. \nIn layer 0 these hypotheses are used to \ngate the input image. As convergence proceeds, the edge hypotheses are reduced to \na single edge hypothesis that best fits the grayscale input image. Figure 2 shows this \nprocess applied to one of a set of deliberately blurred images from the Fort Carson \nImagery Data Set. The MSC used four layers of visual transformations: 14,400 \ntranslational, 31 rotational, 41 scaling, 481 3D projection. The MSC had no \ndifficulty distinguishing the location and orientation of the tank, despite distractors \n\n\fand background clutter: in all tests in the dataset target was correctly located. In \neffect, once primed with a top-down expectation, attentional behavior IS an \nemergent property of application of the map-seeking method to vision [8]. \n\nAdapting generative models by transformation \n\n\"The direct-matching hypothesis of the interpretation of biological motion] holds \nthat we understand actions when we map the visual representation of the observed \naction onto our motor representation of the same action.\" [1] This mapping, \nattributed to a neuronal mirror-system for which there is gathering neurobiological \nevidence (as reviewed in [1]), requires a mechanism for projecting between the \nvisual space and the constrained skeletal joint parameter (kinematic) space to \ndisambiguate the 2D projection of body structure. [4] Though this problem has been \nsolved to various degrees by other computational methods, a review of which is \nbeyond the scope of this discussion, to the author's knowledge none of these have \nbiological plausibility. The present purpose is to show how simply the problem can \nbe expressed by the generative model interpretation problem introduced in eq. 3 and \nsolve by map-seeking circuits. An idealized example is the problem of interpreting \nthe shape of a featureless \"snake\" articulated into any configuration, as appears in \nFig. 3. \n\n(e) \n\n(d) \n\nFigure 3. Projection between visual and kinematic spaces with two map-seeking \ncircuits. (a) input view, (b) top view, (c) projection of 3D occluding contours, \n(d ,e) projections of relationship of occluding contours to generating spine. \n\nThe solution to this problem involves two coupled map-seeking circuits. The \nkinematic circuit layers model the multiple degrees of freedom (here two angles, \nvariable length and optionally variable radius from spine to surface) of each of the \nconnected spine segments. The other circuit determines the visual transformations, \nas seen in the earlier example. The surface of the articulated cylinder is mapped \nfrom an axial spine. The points where that surface is tangent to the viewpoint \nvectors define the occluding contours which, projected in 2D, become the object \nsilhouette. The problem is to find the articulations, segment lengths (and optionally \nsegment diameter) which account for the occluding contour matching the silhouette \nin the input image. In the MSC solution, the initial state all possible articulations of \nthe snake spine are superposed, and all the occluding contours from a range of \nviewing angles are projected into 2D. The latter superposition serves as the \nbackward input to the visual space map-seeking circuit. Since the snake surfaceis \ndetermined by all of the layers of the kinematic circuit, these are projected in \n\n\fparallel to form the backward (biologically top-down) 2D input to the visual \ntransformation-discovery circuit. A matching operation between the contributors to \nthe 2D occluding contour superposition and the forward transformations of the input \nimage modulates the gain of each mapping in the kinematic circuit via a/n \nin eqs. \n13, 14 (modified from eq. 11). \nIn eqs. 13 , 14 K indicates kinematic circuit, V \nindicates visual circuit. \n\nK \ngin := camp gt', at' \u00b7 tin \n\n( K VKK ( K ) K ) \n\n\u2022 b m+1 \n\nf m- l \n\nK \nfor m = l. .. L,i = l. .. n! \n\nK \n\nVK \na!\" = f L\n\n( V ) \n\n. t 3D....::,.2D 0 t mrjace 0 t: III bm+ ' \n\nK ( K ) \n\neq. 13 \n\neq. 14 \n\nThe process converges concurrently in both circuits to a solution, as seen in Figure \n3. The match of the occluding contours and the input image, Figure 3a, is seen in \nFigure 3b,c, with its three dimensional structure is clarified in Figure 3d. Figure 3e \nshows a view of the 3D structure as determined directly from the mapping \nparameters defining the snake \"spine\" after convergence. \n\n4 Conclusion \nThe investigations reported here expand the envelope of vision-related problems \namenable to a pure transformation-discovery approach implemented by the map(cid:173)\nseeking method. The recognition of static 3D models, as seen in Figure 2, and \nother problems [9] solved by MSC have been well tested with real-world input. \nNumerous variants of Figure 3 have demonstrated the applicability of MSC to \nrecognizing generative models of high dimensionality, and the principle has recently \nbeen applied successfully to real-world domains. Consequently, the research to date \ndoes suggest that a single cortical computational mechanism could span a \nsignificant range of the brain's visual and kinematic computing. \nReferences \n[IJ G. Rizzo lati, L. Fogassi, V. Gallese, Neurophysiological mechanisms underlying the \nunderstanding and imitation of action, Nature Reviews Neuroscience, 2, 2001 , 661-670 \n\n[2J D. Arathorn, Map-Seeking: Recognition Under Transformation Using A Superposition \nOrdering Property. Electronics Letters 37(3), 2001 pp164-165 \n[3J A. Angelucci, B. Levitt, E. Walton, J.M. Hupe, J. Bullier, J. Lund, Circuits for Local and \nGlobal Signal Integration in Primary Visual Cortex, Journal of Neuroscience, 22(19) , 2002 \npp 8633-8646 \n[4J D. Arathorn, Map-Seeking Circuits in Visual Cognition, Palo Alto, Stanford Univ Press, \n2002 \n[5J A. Polsky, B. Mel, J. Schiller, Computational Subunits in Thin Dendrites of Pyramidal \nCells, Nature Neuroscience 7(6), 2004 pp 621-627 \n[6J B.A. OIshausen, D.J. Field, Emergence of Simple-Cell Receptive Field Properties by \nLearning a Sparse Code for Natural Images, Nature, 381, 1996 pp607-609 \n[7J T. Plate, Holographic Reduced Representation, CSLI publications, Stanford, California, \n2003 \n\n[8J D. Arathorn, Memory-driven visual attention: an emergent behavior of map-seeking \ncircuits, in Neurobiology of Attention, Eds Itti L, Rees G, Tsotsos J, Academic/Elsevier, \n2005 \n\n[9J C. Vogel, D. Arathorn, A. Parker, and A. Roorda, \"Retinal motion tracking in adaptive \noptics scanning laser ophthalmoscopy\" , Proceedings of OSA Conference on Signal Recovery \nand Synthesis, Charlotte NC, June 2005. \n\n\f", "award": [], "sourceid": 2936, "authors": [{"given_name": "David", "family_name": "Arathorn", "institution": null}]}