{"title": "Compositionality, MDL Priors, and Object Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 838, "page_last": 844, "abstract": null, "full_text": "Compositionality, MDL Priors, and \n\nObject Recognition \n\nElie Bienenstock (elie@dam.brown.edu) \nStuart Geman (geman@dam.brown.edu) \n\nDaniel Potter (dfp@dam.brown.edu) \n\nDivision of Applied Mathematics, \n\nBrown University, Providence, RI 02912 USA \n\nAbstract \n\nImages are ambiguous at each of many levels of a contextual hi(cid:173)\nerarchy. Nevertheless, the high-level interpretation of most scenes \nis unambiguous, as evidenced by the superior performance of hu(cid:173)\nmans. This observation argues for global vision models, such as de(cid:173)\nformable templates. Unfortunately, such models are computation(cid:173)\nally intractable for unconstrained problems. We propose a composi(cid:173)\ntional model in which primitives are recursively composed, subject \nto syntactic restrictions, to form tree-structured objects and object \ngroupings. Ambiguity is propagated up the hierarchy in the form \nof multiple interpretations, which are later resolved by a Bayesian, \nequivalently minimum-description-Iength, cost functional. \n\n1 Bayesian decision theory and compositionaiity \n\nIn his Essay on Probability, Laplace (1812) devotes a short chapter-his \"Sixth \nPrinciple\" -to what we call today the Bayesian decision rule. Laplace observes \nthat we interpret a \"regular combination,\" e.g., an arrangement of objects that \ndisplays some particular symmetry, as having resulted from a \"regular cause\" rather \nthan arisen by chance. It is not, he argues, that a symmetric configuration is less \nlikely to happen by chance than another arrangement. Rather, it is that among all \npossible combinations, which are equally favored by chance, there are very few of \nthe regular type: \"On a table we see letters arranged in this order, Constantinople, \nand we judge that this arrangement is not the result of chance, not because it is \nless possible than the others, for if this word were not employed in any language \n\n\fCompositionality, MDL Priors, and Object Recognition \n\n839 \n\nwe should not suspect it came from any particular cause, but this word being in use \namongst us, it is incomparably more probable that some person has thus arranged \nthe aforesaid letters than that this arrangement is due to chance.\" In this example, \nregularity is not a mathematical symmetry. Rather, it is a convention shared among \nlanguage users, whereby Constantinople is a word, whereas Jpctneolnosant, a string \ncontaining the same letters but arranged in a random order, is not. \n\nCentral in Laplace's argument is the observation that the number of words in the \nlanguage is smaller, indeed \"incomparably\" smaller, than the number of possible \narrangements of letters. Indeed, if the collection of 14-letter words in a language \nmade up, say, half of all 14-letter strings- a rich language indeed-we would, upon \nseeing the string Constantinople on the table, be far less inclined to deem it a word, \nand far more inclined to accept it as a possible coincidence. The sparseness of al(cid:173)\nlowed combinations can be observed at all linguistic articulations (phonetic-syllabic, \nsyllabic-lexical, lexical-syntactic, syntactic-pragmatic, to use broadly defined levels), \nand may be viewed as a form of redundancy-by analogy to error-correcting codes. \nThis redundancy was likely devised by evolution to ensure efficient communication \nin spite of the ambiguity of elementary speech signals. The hierarchical composi(cid:173)\ntional structure of natural visual scenes can also be thought of as redundant: the \nrules that govern the composition of edge elements into object boundaries, of in(cid:173)\ntensities into surfaces etc., all the way to the assembly of 2-D projections of named \nobjects, amount to a collection of drastic combinatorial restrictions. Arguably, this \nis why in all but a few-generally hand-crafted-cases, natural images have a unique \nhigh-level interpretation in spite of pervasive low-level ambiguity-this being amply \ndemonstrated by the performances of our brains. \n\nIn sum, compositionality appears to be a fundamental aspect of cognition (see also \nvon der Malsburg 1981, 1987; Fodor and Pylyshyn 1988; Bienenstock, 1991, 1994, \n1996; Bienenstock and Geman 1995). We propose here to account for mental com(cid:173)\nputation in general and scene interpretation in particular in terms of elementary \ncomposition operations, and describe a mathematical framework that we have de(cid:173)\nveloped to this effect. The present description is a cursory one, and some notions \nare illustrated on two simple examples rather than formally defined-for a detailed \naccount, see Geman et al. (1996), Potter (1997). The binary-image example refers \nto an N x N array of binary-valued pixels, while the Laplace-Table example refers \nto a one-dimensional array of length N, where each position can be filled with one \nof the 26 letters of the alphabet or remain blank. \n\n2 Labels and composition rules \n\nThe objects operated upon are denoted Wi, i = 1,2, ... , k. Each composite object \nW carries a label, I = L(w), and the list of its constituents, (Wt,W2,\u00b7\u00b7 .). These \nuniquely determine w, so we write W = I (WI, W2, .\u2022. ) . A scene S is a collection of \nprimitive objects. In the binary-image case, a scene S consists of a collection of \nblack pixels in the N x N array. All these primitives carry the same label, L(w) = p \n(for \"Point\"), and a parameter 7r(w) which is the position in the image. In Laplace's \nTable, a scene S consists of an arrangement of characters on the table. There are 26 \nprimitive labels, \"A\", \"B\" , ... , \"Z\" , and the parameter of a primitive W is its position \n1 ~ 7r(w) ~ N (all primitives in such a scene must have different positions). \n\nAn example of a composite W in the binary-image case is an arrangement composed \n\n\f840 \n\nE. Bienenstock. S. Geman and D. Potter \n\nof a black pixel at any position except on the rightmost column and another black \npixel to the immediate right of the first one. The label is \"Horizontal Linelet,\" \ndenoted L(w) = hl, and there are N(N - 1) possible horizontallinelets. Another \nnon-primitive label, \"Vertical Linelet,\" or vl, is defined analogously. An example \nof a composite W for Laplace's Table is an arrangement of 14 neighboring primi(cid:173)\ntives carrying the labels \"G\", \"0\", \"N\", \"S\", ... , \"E\" in that order, wherever that \narrangement will fit. We then have L(w) = Ganstantinople, and there are N - 13 \npossible Constantinople objects. \n\nThe composition rule for label type 1 consists of a binding junction, B\" and a set \nof allowed binding-function values, or binding support, S,: denoting by 0 the set \nof all objects in the model, we have, for any WI, ' .. ,Wk E 0, B, (WI. ... ,Wk) E \ns, \u00a2:} l(WI\"\" ,Wk) E O. In the binary-image example, Bhl(WI,W2) = Bv,(WI,W2) = \n(L(WI),L(W2),7I'(W2) -7I'(WI)), Sh' = {(P,p,(I,O))} and Sv' = {(p,p,(O,I))} define \nthe hl- and vl-composition rules, p+p -+ hl and p+p -+ vl. In Laplace's Table, G + \n0+\u00b7 .. + E -+ Ganstantinpole is an example of a 14-ary composition rule, where we \nmust check the label and position of each constituent. One way to define the binding \nfunction and support for this rule is: B(WI, ' \" ,WI4) = (L(WI),' \" ,L(WI4), 71'(W2) -\n71'(Wt} , 71'(W3) - 71'(WI),\"', 71'(W14) - 71'(WI)) and S = (G,\"', E, 1,2\"\",13). \n\nWe now introduce recursive labels and composition rules: the label of the composite \nobject is identical to the label of one or more of its constituents, and the rule may \nbe applied an arbitrary number of times, to yield objects of arbitrary complexity. \nIn the binary-image case, we use a recursive label c, for Curve, and an associated \nbinding function which creates objects of the form hl + p -+ c, vl + p -+ c, c + p -+ c, \np + hl -+ c, p + vl -+ c, p + c -+ c, and c + c -+ c. The reader may easily \nfill in the details, i.e., define a binding function and binding support which result \nin \"c\" -objects being precisely curves in the image, where a curve is of length at \nIn the previous examples, primitives were \nleast 3 and may be self-intersecting. \ncomposed into compositions; here compositions are further composed into more \ncomplex compositions. In general, an object W is a labeled tree, where each vertex \ncarries the name of an object, and each leaf is associated with a primitive (the \nassociation is not necessarily one-to-one, as in the case of a self-intersecting curve). \n\nLet M be a model-Le., a collection of labels with their binding functions and \nbinding supports-and 0 the set of all objects in M . We say that object W E \no covers S if S is precisely the set of primitives that make up w's leaves. An \ninterpretation I of S is any finite collection of objects in 0 such that the union \nof the sets of primitives they cover is S. We use the convention that, for all M \nand S, 10 denotes the trivial interpretation, defined as the collection of (unbound) \nprimitives in S. In most cases of interest, a model M will allow many interpretations \nfor a scene S . For instance, given a long curve in the binary-image model, there \nwill be many ways to recursively construct a \"c\"-labeled tree that covers exactly \nthat curve. \n\n3 The MDL formulation \n\nIn Laplace's Table, a scene consisting of the string Constantinople admits, in \naddition to 10 , the interpretation II = {WI}, where WI is a \"G anstantinople\" -\nobject. We wish to define a probability distribution D on interpretations such that \nD(I1 ) \u00bb D(Io), in order to realize Laplace's \"incomparably more probable\". Our \n\n\fCompositionality, MDL Priors, and Object Recognition \n\n841 \n\ndefinition of D will be motivated by the following use of the Minimum Description \nLength (MDL) principle (Rissanen 1989). Consider a scene S and pretend we want \nto transmit S as quickly as possible through a noiseless channel, hence we seek to \nencode it as efficiently as possible, i.e., with the shortest possible binary code c. We \ncan always use the trivial interpretation 10: the codeword c(Io) is a mere list of n \nlocations in S. We need not specify labels, since there is only one primitive label in \nthis example. The length, or cost, of this code for S is Ic(Io)1 = nlog2 (N 2 ). \n\nNow however we want to take advantage of regularities, in the sense of Laplace, \nthat we expect to be present in S. We are specifically interested in compositional \nregularities, where some arrangements that occur more frequently than by chance \ncan be interpreted advantageously using an appropriate compositional model M. \nInterpretation I is advantageous if Ic(I)1 < Ic(Io)l. An example in the binary-image \ncase is a linelet scene S. The trivial encoding of this scene costs us Ic(Io)1 = 2[log2 3+ \nlog2(N2)] bits, whereas the cost of the compositional interpretation II = {wI} is \nIc(Idl = log2 3+log2 (N(N -1)), where WI is an hI or vl object, as the case may be. \nThe first log23 bits encode the label L(WI) E {p, hi, vi}, and the rest encodes the \nposition in the image. The compositional {p, hl, vl} model is therefore advantageous \nfor a linelet scene, since It affords us a gain in encoding cost of about 2log2 N bits. \nIn general, the gain realized by encoding {w} = {I (WI, W2)} instead of {WI, W2} may \nbe viewed as a binding energy, measuring the affinity that WI and W2 exhibit for \neach other as they assemble into w. This binding energy is c, = IC(WI)I + IC(W2)1 -\nI c( I (WI, W2) ) I, and an efficient M is one that contains judiciously chosen cost-saving \ncomposition rules. In effect, if, say, linelets were very rare, we would be better \noff with the trivial model. The inclusion of non-primitive labels would force us \nto add at least one bit to the code of every object-to specify its label-and this \nwould increase the average encoding cost, since the infrequent use of non-primitive \nlabels would not balance the extra small cost incurred on primitives. In practical \napplications, the construction of a sound M is no trivial issue. Note however \nthe simple rationale for including a rule such as p + p --7 hl. Giving ourselves the \nlabel hi renders redundant the independent encoding of the positions of horizontally \nadjacent pixels. In general, a good model should allow one to hierarchically compose \nwith each other frequently occurring arrangements of objects. \n\nThis use of MDL leads in a straightforward way to an equivalent Bayesian formula(cid:173)\ntion. Setting P'(w) = 2- lc(w)lj L:w'EO 2- lc(w')I yields a probability distribution P' \non n for which c is approximately a Shannon code (Cover and Thomas 1991). With \nthis definition, the decision to include the label hl-or the label Con8tantinople(cid:173)\nwould be viewed, in principle, as a statement about the prior probability of finding \nhorizontal linelets-or Constantinople strings-in the scene to be interpreted. \n\n4 The observable-measure formulation \n\nThe MDL formulation however has a number of shortcomings; in particular, com(cid:173)\nputing the binding energy for composite objects can be problematic. We outline \nnow an alternative approach (Geman et al. 1996, Potter 1997), where a probabil(cid:173)\nity distribution P(w) on n is defined through a family of observable measures Q,. \nThese measures assign probabilities to each possible binding-function value, s E S\" \nand also to the primitives. We require L:'EM L:sEsr Q,(8) = 1, where the notion of \nbinding function has been extended to primitives via Bprim (w) = 7r(w) for primitive \n\n\f842 \n\nE. Bienenstoc/c, S. Geman and D. Potter \n\nw. The probabilities induced on 0 by Q, are given by P(w) = Qprim(Bprim(w)) \nfor a primitive w, and P(w) = Q,(B,(WI,W2))P2(WI,W2IB,(WI,W2)) for a composite \nobject w = l(wI, W2).1 Here p 2 = P X P is the product probability, i.e., the free, or \nnot-bound, distribution for the pair (WI, W2) E 0 2. For instance, with C + ... + E -? \nCanstantinople, p 14 (WI,W2,'\" \n,w14IBcons ... (W1, ... ,W14) = (C, 0,\u00b7\u00b7\u00b7,13)) is the \nconditional probability of observing a particular string Constantinople, under the \nfree distribution, given that (WI, ... , W14) constitutes such a string. With the rea(cid:173)\nsonable assumption that, under Q, primitives are uniformly distributed over the \ntable, this conditional probability is simply the inverse of the number of possible \nConstantinople strings, Le., 1/(N - 13). \n\nThe binding energy, defined, by analogy to the MDL approach, as [, = \nlog2(P(w)/(P(wdP(w2))), now becomes [, = \nlog2(P x \nP(B'(Wl,W2)))' Finally, if I is the collection of all finite interpretations / c 0, we \ndefine the probability of / E I as D(/) = IIwElP(w)/Z, with Z = L:I'EI IIwEl'P(w), \nThus, the probability of an interpretation containing several free objects is obtained \nby assuming that these objects occurred in the scene independently of each other. \nGiven a scene S, recognition is formulated as the task of maximizing D over all the \nl's in I that are interpretations of S. \n\nlog2(Q,(B,(wI,w2))) -\n\nWe now illustrate the use of D on our two examples. In the binary-image example \nwith model M = {p, hi, vi}, we use a parameter q, 0 ~ q ~ 1, to adjust the prior \nprobability of linelets. Thus, Qprim(Bprim(W)) = (1 - q)/N2 for primitives, and \nQh'\u00abP,p,O, 1)) = Qv'\u00abP,p, 1,0)) = q/2 for linelets. It is easily seen that regardless \nof the normalizing constant Z, the binding energy of two adjacent pixels into a \nlinelet is [h' = [v, = log2(q/2) - log2[{lNf N(N - 1)]. Interestingly, as long as \nq =1= 0 and q =1= 1, the binding energy, for large N, is approximately 2log2 N, which \nis independent of q. Thus, the linelet interpretation is \"incomparably\" more likely \nthan the independent occurrence of two primitives at neighboring positions. We \nleave it to the reader to construct a prior P for the model {p, hl, vI, c}, e.g. by \ndistributing the Q-mass evenly between all composition rules. Finally, in Laplace's \nTable, if there are M equally likely non-primitive labels-say city names-and q is \ntheir total mass, the binding energy for Constantinople is [Cons ... = log2 M(r! -13) -\nlog2[~~.&]14, and the \"regular\" cause is again \"incomparably\" more likely. \n\nThere are several advantages to this reformulation from codewords into probabilities \nusing the Q-parameters. First, the Q-parameters can in principle be adjusted to \nbetter account for a particular world of images. Second, we get an explicit formula \nfor the binding energy, (namely log2 (Q / P x P)). Of course, we need to evaluate \nthe product probability P x P, and this can be highly non-trivial-one approach \nis through sampling, as demonstrated in Potter (1997). Fi~ally, this formulation \nis well-suited for parameter estimation: the Q's, which are the parameters of the \ndistribution P, are indeed observables, Le., directly available empirically. \n\n5 Concluding remarks \n\nThe approach described here was applied by X. Xing to the recognition of \"on(cid:173)\nline\" handwritten characters, using a binary-image-type model as above, enriched \n\nIThis is actually an implicit definition. Under reasonable conditions, it is well defined(cid:173)\n\nsee Geman et al. (1996). \n\n\fCompositionality, MDL Priors, and Object Recognition \n\n843 \n\nwith higher-level labels including curved lines, straight lines, angles, crossings, T(cid:173)\njunctions, L-junctions {right angles}, and the 26 letters of the alphabet. In such \na model, the search for an optimal solution cannot be done exhaustively. We ex(cid:173)\nperimented with a number of strategies, including a two-step algorithm which first \ngenerates all possible objects in the scene, and then selects the \"best\" objects, Le., \nthe objects with highest total binding energy, using a greedy method, to yield a final \nscene interpretation. (The total binding energy of W is the sum of the binding ener(cid:173)\ngies \u00a3, over all the composition rules I used in the composition of w. Equivalently, \nthe total binding energy is the log-likelihood ratio log2{P{w}/IIi P{Wi)), where the \nproduct is taken over all the primitives Wi covered by w.} \n\nThe first step of the algorithm typically results in high-level objects partly over(cid:173)\nlapping on the set of primitives they cover, i.e., competing for the interpretation of \nshared primitives. Ambiguity is thus propagated in a \"bottom-up\" fashion. The \nambiguity is resolved in the second \"top-down\" pass, when high-level composition \nrules are used to select the best compositions, at all levels including the lower ones. \nA detailed account of our experiments will be given elsewhere. We found the re(cid:173)\nsults quite encouraging, particularly in view of the potential scope of the approach. \nIn effect, we believe that this approach is in principle capable of addressing unre(cid:173)\nstricted vision problems, where images are typically very ambiguous at lower levels \nfor a variety of reasons-including occlusion and mutual overlap of objects-hence \npurely bottom-up segmentation is impractical. \n\nTurning now to biological implications, note that dynamic binding in the nervous \nsystem has been a subject of intensive research and debate in the last decade. Most \ninteresting in the present context is the suggestion, first clearly articulated by von \nder Malsburg {1981}, that composition may be performed thanks to a dual mech(cid:173)\nanism of accurate synchronization of spiking activity-not necessarily relying on \nperiodic firing-and fast reversible synaptic plasticity. If there is some neurobio(cid:173)\nlogical truth to the model described in the present paper, the binding mechanism \nproposed by von der Malsburg would appear to be an attractive implementation. \nIn effect, the use of fine temporal structure of neural activity opens up a large realm \nof possible high-order codes in networks of neurons. \n\nIn the present model, constituents always bind in the service of a new object, an \noperation one may refer to as triangular binding. Composite objects can engage in \nfurther composition, thus giving rise to arbitrarily deep tree-structured constructs. \nPhysiological evidence of triangular binding in the visual system can be found in Sil(cid:173)\nlito et al. {1994}; Damasio {1989} describes an approach derived from neuroanatom(cid:173)\nical data and lesion studies that is largely consistent with the formalism described \nhere. \n\nAn important requirement for the neural representation of the tree-structured ob(cid:173)\njects used in our model is that the doing and undoing of links operating on some \nconstituents, say Wi and W2, while affecting in some useful way the high-order pat(cid:173)\nterns that represent these objects, leaves these patterns, as representations of Wi and \nW2, intact. A family of biologically plausible patterns that would appear to satisfy \nthis requirement is provided by synfire patterns {Abeles 1991}. We hypothesized \nelsewhere {Bienenstock 1991, 1994, 1996} that synfire chains could be dynamically \nbound via weak synaptic couplings; such couplings would synchronize the wave-like \nactivities of two synfire chains, in much the same way as coupled oscillators lock \n\n\f844 \n\nE. Bienenstock, S. Geman and D. Potter \n\ntheir phases. Recursiveness of compositionality could, in principle, arise from the \nfurther binding of these composite structures. \n\nAcknow ledgements \n\nSupported by the Army Research Office (DAAL03-92-G-0115), the National Science \nFoundation (DMS-9217655), and the Office of Naval Research (N00014-96-1-0647). \n\nReferences \n\nAbeles, M. (1991) Corticonics: Neuronal circuits of the cerebral cortex, Cambridge \n\nUniversity Press. \n\nBienenstock, E. (1991) Notes on the growth of a composition machine, in Proceed(cid:173)\n\nings of the Royaumont Interdisciplinary Workshop on Compositionality in \nCognition and Neural Networks-I, D. Andler, E. Bienenstock, and B. Laks, \nEds., pp. 25--43. (1994) A Model of Neocortex. Network: Computation in \nNeural Systems, 6:179-224. (1996) Composition, In Brain Theory: Biolog(cid:173)\nical Basis and Computational Principles, A. Aertsen and V. Braitenberg \neds., Elsevier, pp 269-300. \n\nBienenstock, E., and Geman, S. (1995) Compositionality in Neural Systems, In \nThe Handbook of Brain Theory and Neural Networks, M.A. Arbib ed., \nM.I.T./Bradford Press, pp 223-226. \n\nCover, T.M., and Thomas, J.A. (1991) Elements of Information Theory, Wiley \n\nand Sons, New York. \n\nDamasio, A. R. (1989) Time-locked multiregional retroactivation: a systems-level \nproposal for the neural substrates of recall and recognition, Cognition, \n33:25-62. \n\nFodor, J .A., and Pylyshyn, Z.W. (1988) Connectionism and cognitive architecture: \n\na critical analysis, Cognition, 28:3-71. \n\nGeman, S., Potter, D., and Chi, Z. (1996) Compositional Systems, Technical Re(cid:173)\n\nport, Division of Applied Mathematics, Brown University. \n\nLaplace, P.S. (1812) Esssai philosophique sur les probabiliUs. Translation of Tr(cid:173)\n\nuscott and Emory, New York, 1902. \n\nPotter, D. (1997) Compositional Pattern Recognition, PhD Thesis, Division of \n\nApplied Mathematics, Brown University, In preparation. \n\nRissanen, J. (1989) Stochastic Complexity in Statistical Inquiry World Scientific \n\nCo, Singapore. \n\nSillito, A.M., Jones, H.E, Gerstein, G.L., and West, D.C. (1994) Feature-linked \nsynchronization of thalamic relay cell firing induced by feedback from the \nvisual cortex, Nature, 369: 479-482 \n\nvon der Malsburg, C. (1981) The correlation theory of brain function. Internal re(cid:173)\nport 81-2, Max-Planck Institute for Biophysical Chemistry, Dept. of Neu(cid:173)\nrobiology, Gottingen, Germany. \n(1987) Synaptic plasticity as a basis of \nbrain organization, in The Neural and Molecular Bases of Learning (J.P. \nChangeux and M. Konishi, Eds.), John Wiley and Sons, pp. 411--432. \n\n\f", "award": [], "sourceid": 1327, "authors": [{"given_name": "Elie", "family_name": "Bienenstock", "institution": null}, {"given_name": "Stuart", "family_name": "Geman", "institution": null}, {"given_name": "Daniel", "family_name": "Potter", "institution": null}]}