{"title": "Information Factorization in Connectionist Models of Perception", "book": "Advances in Neural Information Processing Systems", "page_first": 45, "page_last": 51, "abstract": null, "full_text": "Information Factorization in \n\nConnectionist Models of Perception \n\nJavier R. Movellan \n\nDepartment of Cognitive Science \nInstitute for Neural Computation \nUniversity of California San Diego \n\nJames L. McClelland \n\nCenter for the Neural Bases of Cognition \n\nDepartment of Psychology \nCarnegie Mellon University \n\nAbstract \n\nWe examine a psychophysical law that describes the influence of \nstimulus and context on perception. According to this law choice \nprobability ratios factorize into components independently con(cid:173)\ntrolled by stimulus and context. It has been argued that this pat(cid:173)\ntern of results is incompatible with feedback models of perception. \nIn this paper we examine this claim using neural network models \ndefined via stochastic differential equations. We show that the law \nis related to a condition named channel separability and has little \nto do with the existence of feedback connections. In essence, chan(cid:173)\nnels are separable if they converge into the response units without \ndirect lateral connections to other channels and if their sensors are \nnot directly contaminated by external inputs to the other chan(cid:173)\nnels. Implications of the analysis for cognitive and computational \nneurosicence are discussed. \n\n1 \n\nIntroduction \n\nWe examine a psychophysical law, named the Morton-Massaro law, and its implica(cid:173)\ntions to connectionist models of perception and neural information processing. For \nan example of the type of experiments covered by the Morton-Massaro law consider \nan experiment by Massaro and Cohen (1983) in which subjects had to identify syn(cid:173)\nthetic consonant sounds presented in the context of other phonemes. There were \ntwo response alternatives, seven stimulus conditions, and four context conditions. \nThe response alternatives were /1/ and /r/, the stimuli were synthetic sounds gen(cid:173)\nerated by varying the onset frequency of the third formant, followed by the vowel \n/i/. Each of the 7 stimuli was placed after each offour different context consonants, \n/v/, /s/, /p/, and /t/. Morton (1969) and Massaro independently showed that in a \nremarkable range of experiments of this type, the influence of stimulus and context \non response probabilities can be accounted for with a factorized version of Luce's \nstrength model (Luce, 1959) \nP(R = k I S = i, C = j) \n\nTls(i, k) Tlc(j, k) \n2: C l) C l)' for (l,,),k) E S x ex 'R. (1) \nI TIs 1\" \nTIc), \nHere S, C and R are random variables representing the stimulus, context and the \nsubject's response, S, C and 'R are the set of stimulus, context and response al-\n\n.\n\n. \n\n\f46 \n\nJ. R. Movellan and J. L. McClelland \n\nternatives, l1s(i, k) > 0 represents the support of stimulus i for response k, and \nl1c(j, k) > 0 the support of context j for response k. Assuming no strength param(cid:173)\neter is exactly zero, (1) is equivalent to \nP(R = k I S = i,e = j) = '(l1S(i,k)) (l1c(j,k)) , for all (i,j,k) E S x ex R. \nP(R = II S = i, e = j) \n(2) \n\nl1c(j, l) \n\nl1s(i, l) \n\nThis says that response probability ratios factorize into two components, one which \nis affected by the stimulus but unaffected by the context and one affected by the \ncontext but unaffected by the stimulus. \n\n2 Diffusion Models of Perception \n\nMassaro (1989) conjectured that the Morton-Massaro law may be incompatible \nwith feedback models of perception. This conjecture was based on the idea that in \nnetworks with feedback connections the stimulus can have an effect on the context \nunits and the context can have an effect on the stimulus units making it impossible \nto factorize the influence of information sources. In this paper we analyze such \na conjecture and show that, surprisiQ.gly, the Morton-Massaro law has little to do \nwith the existence of feedback and lateral connections. We ground our analysis \non continuous stochastic versions of recurrent neural networks 1. We call these \nmodels diffusion (neural) networks for they are stochastic diffusion processes defined \nby adding Brownian motion to the standard recurrent neural network dynamics. \nDiffusion networks are defined by the following stochastic differential equation \n\ndYi(t) = JLi(Y(t), X) dt + (J dBi(t) for i E {I, ... , n}, \n\n(3) \nwhere Yi(t) is a random variable representing the internal potential at time t of the \nith unit, Y(t) = (Yl(t),\u00b7\u00b7\u00b7 ,Yn(t))', X represents the external input, which consists \nof stimulus and context, and Bi is Brownian motion, which acts as a stochastic \ndriving term. The constant (J > 0, known as the dispersion, controls the amount \nof noise injected onto each unit. The function JLi, known as the drift, determines \nthe average instantaneous change of activation and is borrowed from the standard \nrecurrent neural network literature: this change is modulated by a matrix w of \nconnections between units, and a matrix v that controls the influence of the external \ninputs onto each unit. \n\nJLi(Yi(t), X) = ~i(Yi(t)) (Yi(t) - Yi(t)), \n\n1 \n\n-\n\nfor all i E {I,\u00b7\u00b7\u00b7 , n}, \n\n(4) \n\nwhere 1/ ~i is a positive function, named the capacitance, controlling the speed of \nprocessing and \n\nYi(t) = L Wi,j Zj(t) + LVi,kXk, \n\nfor alli E {I, .. \u00b7 ,n}, \n\nj \n\nk \n\nZj(t) = CPi(}j(t)) = CP(O!i }j(t)) = 1/(1 + e- a \u2022 Y;(t)). \n\n(5) \n\n(6) \n\nHere Wi ,j, an element of the connection matrix w, is the weight from unit j to unit i, \nVi,k is an element of the matrix v, cP is the logistic activation function and the O!i > 0 \nterms are gain parameters, that control the sharpness of the activation functions. \nFor large values of O!i the activation function of unit i converges to a step function. \nThe variable Zj(t) represents a short-time mean firing rate (the activation) of unit \n\nlFor an analysis grounded on discrete time networks with binary states see McClelland \n\n(1991). \n\n\fInformation Factorization \n\n47 \n\nj scaled in the (0,1) range. Intuition for equation (4) can be achieved by thinking \nof it as a the limit of a discrete time difference equation, in such case \n\nY(t + ~t) = Yi(t) + J.'i (Yi (t), X)~t + (rli5:tNi (t), \n\n(7) \n\nwhere the Ni(t) are independent standard Gaussian random variables. For a fixed \nstate at time t there are two forces controlling the change in activation: the drift, \nwhich is deterministic, and the dispersion which is stochastic. This results in a \ndistribution of states at time t + ~t. As ~t goes to zero, the solution to the \ndifference equation (7) converges to the diffusion process defined in (4). In this \npaper we focus on the behavior of diffusion networks at stochastic equilibrium, i.e., \nwe assume the network is given enough time to approximate stochastic equilibrium \nbefore its response is sampled. \n\n3 Channel Separability \n\nIn this section we show that the Morton-Massaro is related to an architectural con(cid:173)\nstraint named channel separability, which has nothing to do with the existence of \nfeedback connections. In order to define channel separability it is useful to char(cid:173)\nacterize the function of different units using the following categories: 1) Response \nspecification units: A unit is a response specification unit, if, when the state of all \nthe other units in the network is fixed, changing the state of this unit affects the \nprobability distribution of overt responses. 2) Stimulus units: A unit belongs to \nthe stimulus channel if: a) it is not a response unit, and b) when the state of the \nresponse units is fixed, the probability distribution of the activations of this unit is \naffected by the stimulus. 3) Context units: A unit belongs to the context channel if: \na) it is not a response unit, and b) when the states of the response units are fixed, \nthe probability distribution of the activations of this unit can be affected by the \ncontext. Given the above definitions, we say that a network has separable stimulus \nand context channels if the stimulus and context units are disjoint: no unit simul(cid:173)\ntaneously belongs to the stimulus and context channels. In essence, channels are \nstructurally separable if they converge into the response units without direct lateral \nconnections to other channels and if their sensors are not directly contaminated by \nexternal inputs to the other channels (see Figure 1). \n\nIn the rest of the paper we show that if a diffusion network is structurally separable \nthe Morton-Massaro law can be approximated with arbitrary precision regardless of \nthe existence of feedback connections. For simplicity we examine the case in which \nthe weight matrix is symmetric. In such case, each state has an associated goodness \nfunction that greatly simplifies the analysis. In a later section we discuss how the \nresults generalize to the non-symmetric case. \nLet y E IRn represent the internal potential of a diffusion network. Let Zi = cp(aiYi) \nfor i = 1,\u00b7\u00b7\u00b7 , n represent the firing rates corresponding to y. Let zS, ZC and \nzr represent the components of z for the units in the stimulus channel, context \nchannel and response specification module. Let x be a vector representing an input \nand let x S , XC be the components of x for the external stimulus and context. Let \na = (a1,\u00b7\u00b7\u00b7 , an) be a fixed gain vector and ZO/(t) a random vector representing \nthe firing rates at time t of a network with gain vector a. Let za = limt-+oo za (t), \nrepresent the firing rates at stochastic equilibrium. In Movellan (1998) it is shown \nthat if the weights are symmetric i.e., W = w' and l/Ki(x) = dcpi(X)/dx then the \nequilibrium probability density of za is as follows \n\nPZQlx(zs,zc,zr I XS,X C ) = K ( 1 \n\n) exp((2/0'2) Ga(zs,zr I XS,X C )) , \n\na Xs,Xc \n\n(8) \n\n\f48 \n\nJ. R. Movellan and J. L. McClelland \n\nSdmU~ /CoDtut \n\nInput \n\nFigure 1: A network with separable context and stimulus processing channels. The \nstimulus sensor and stimulus relay units make up the stimulus channel units, and \nthe context sensor and context channel units make up the context channel units. \nNote that any of the modules can be empty except the response module. \n\nwhere \n\nn \n\nKa(xs, xc) = / exp((2/(72) Ga(z I Xs, xc)) dz, \nGa(z I x) = H(z I x) - L Sa; (Zi), \nH(z I x) = z' w z/2 + z' V x, \nSa; (Zi) = ai (IOg(Zi) + log(1 - Zi)) + ~i (Zi log(zi) + (1 - Zi) log(1 - Zi)) . \n\ni=l \n\n(9) \n\n(10) \n\n(11) \n\n(12) \n\nWithout loss of generality hereafter we set (72 = 2. When there are no direct con(cid:173)\nnections between the stimulus and context units there are no terms in the goodness \nfunction in which XS or ZS occur jointly with XC or ZC. Because of this, the goodness \ncan be separated into three additive terms, that depend on x S , XC and a third term \nwhich depends on the response units: \n\nGa(z\\zc,zr I XS,XC) = G~(zs,zr I XS ) + G~(zr,zc I XC) + G~(zr) , \n\n(13) \n\nwhere \n\nG~(ZS, zr I XS) = (zs),ws,szs/2 + (zS)'ws,rzr + (ZS)'vs,sx s + (zr),vr,sxs - L S(zt) , \n\nG~(ZC, zr I XS ) = (ZC)'wc,czc /2 + (zc),wc,rzr + (zc),vc,cxc + (zr),vr,cxc - L S(zf) , \n\ni \n\n(14) \n\ni \n\n(15) \n\n(16) \n\n\fInformation Factorization \n\n49 \n\nwhere ws,r is a submatrix of w connecting the stimulus and response units. Similar \nnotation is used for the other submatrices of wand v. It follows that we can write \nthe ratio of the jOint probability density' of two states z and z as follows: \nPZ .. lx(zs,zc,zr I XS,X C ) \nexp(G~(zS,zr I xs) + G~(zc,zr I XC) + G~(zr\u00bb \npZ .. lx(zS,zC,zrlxs,xc ) - exp(G~(zS,zrlxs)+G~(zC,zrlxC)+G~(zr\u00bb' \n\n(17) \n\nwhich factorizes as desired. To get probability densities for the response units, we \nintegrate over the states of all the other units \n\nPZ;;IX(zr I XS,XC) = / \n\n/ PZ .. lx(zs,zc,zr I XS,XC) dz s dz c , \n\n(18) \n\nand after rearranging terms \n\npZ;;IX(zr I XS,XC) = Kcr(:s,xC) (/ exp( Gz(zs,zr I XS) + Gr(zr\u00bb dZS) \n\n( / exp( G c(ZC, zr I xc\u00bb dZC) , \n\n(19) \n\nwhich also factorizes. All is left is mapping continuous states of the response units \nto discrete external responses. To do so we partition the space of the response \nspecification units into discrete regions. The probability of a response becomes the \nintegral of the probability density over the region corresponding to that response. \nThe problem is that the integral of probability densities does not necessarily fac(cid:173)\ntorize even though the densities factorize at every point. \n\nFortunately there are two important cases for which the law holds, at least as a \ngood approximation. The first case is when the response regions are small and thus \nwe can approximate the integral over that region by the density at a point times the \nvolume of the region. In such a case the ratio of the integrals can be approximated \nby the ratio of the probability densities of those individual states. The second case \napplies to models, like McClelland and Rumelhart's (1981) interactive activation \nmodel, in which each response is associated with a distinct response unit. These \nmodels typically have negative connections amongst the response units so that at \nequilibrium one unit tends to be active while the others are inactive. In such a \ncase a common response policy picks the response corresponding to the active unit. \nWe now show that such a policy can approximate the Morton-Massaro law to an \narbitrary level of precision as the gain parameter of the response units is increased. \nLet z represent the joint state of a network and let the first r components of z \nbe the states of the response specification units. Let z(1) = (1,0,0, ... ,0)', Z(2) = \n(0,1,0,\u00b7\u00b7\u00b7 ,0)' be two r-dimensional vectors representing states of the response \nspecification units. For i E {1,2} and ~ E (0,1) let \n\nz~) = (1 - Z(i\u00bb~ + (z(i\u00bb(l - ~), \nR~) = {x E IRr \n\n: Xj E ((1- ~)Z~i), ~ + (1 - ~)Z~i\u00bb, for j = 1,\u00b7\u00b7\u00b7 , r}. \n\n(21) \nThe sets R~) and R~) are regions of the [O,l]r space mapping into two distinct \nexternal responses. We now investigate the convergence of the probability ratio of \nthese two responses as we let ~ 4 0, i.e., as the response regions collapse into \ncorners of [0, l]r. \n\n(20) \n\ncr \n\nA \n\np(zr E R(2) I X = x) \nlim \nA-+O P(Z~ E R~) I X = x) \n~rpZ;;IX(z~) I x) \n. \nhm \nA-+O ~rpZ;;IX(zA I x) \n\n(1) \n\n= lim t . \" \n\nJR(2) PzrlX(u I x)du \nA-+O JR~) PZ;;lx(u I x)du \n\n= \n\n. J J eG~(z~),z\u00b7,ze I z}dz 8 dzc \nA-+O J J eG .. (zt. ,Z',ze I z)dzs dzc \n\n(1) \n\n= hm \n\n\u2022 \n\n(22) \n\n(23) \n\n\f50 \n\nJ. R. Movellan and J. L. McClelland \n\nTable 1: Predictions by the Morton-Massaro law (left side) versus diffusion network \n(square brackets) for subject 7 of Massaro and Cohen (1983) Experiment 2. Each \nprediction of the diffusion network is based on 100 random samples. \n\nContext \n\nStimulus \n\nV \n\n0 \n1 \n2 \n3 \n4 \n5 \n6 \n\n0.0017 \n0.0126 \n0.1105 \n0.5463 \n0.9827 \n0.9999 \n0.9999 \n\n0.01 \n0.00 \n0.19 \n0.54 \n1.00 \n1.00 \n1.00 \n\nS \n\n0.0000 \n0.0000 \n0.0008 \n0.0079 \n0.2756 \n0.9924 \n0.9924 \n\n0.00 \n0.00 \n0.00 \n0.00 \n0.30 \n0.99 \n1.00 \n\nP \n\nT \n\n0.0152 \n0.1008 \n0.5208 \n0.9133 \n0.9980 \n0.9999 \n0.9999 \n\n0.03 \n0.10 \n0.45 \n0.91 \n1.00 \n1.00 \n1.00 \n\n0.9000 \n0.9849 \n0.9984 \n0.9998 \n0.9999 \n1.0000 \n1.0000 \n\n0.91 \n0.97 \n1.00 \n1.00 \n1.00 \n1.00 \n1.00 \n\nNow note that \n\nGo(z~), Z8, ZC I x) = H(z~), Z8, ZC I x) - L So; (Z~!i) - L So; (zt) - L SOj (zj), \n\nr \n\ni=1 \n\ni \n\nj \n\n(24) \n\nand since E~=l So; (Z~!i) = E;=1 So; (Z~!i)' it follows that \n\n. P(Z~ E R~) I X = x) _ J J eH(z~),z\u00b7,ze I x)-E;Sa;(zi)-E j Saj(zj)dz 8 dz c \nhm \n~-+o P(Z~ E R~ I X = x) J J eH(za.. ,Z',ze I x)-E; Sai(zt)-E j Saj(Zj)dz 8 dzc \n\n(1) \n\n\u2022 \n\n-\n\n(1) \n\n(25) \nIt is easy to show that this ratio factorizes. Moreover, for all .6. > 0 if we let \n0:1 = ... = O:r = 0:, where 0: > 0 then \n\nlim P(Z~ E [.6.,1 - .6.t) = 0, \n\n0-+00 \n\n(26) \n\nsince as the gain of the response units increases So; decreases very fast at the corners \nof (0, 1 y. Thus as 0: -4 00 the random variable Z~ converges in distribution to a \ndiscrete random variable with mass at the corner of the [0, It hypercube and with \nfactorized probability ratios as expressed on (25). Since the indexing ofthe response \nunits is arbitrary the argument applies to all the responses. \n\no \n\n4 Discussion \n\nOur analysis establishes that in diffusion networks the Morton-Massaro law is not \nincompatible with the presence of feedback and lateral connections. Surprisingly, \neven though in diffusion networks with feedback connections stimulus and context \nunits are interdependent, it is still possible to factorize the effect of stimulus and \ncontext on response probabilities. \n\nThe analysis shows that the Morton-Massaro can be arbitrarily approximated as \nthe sharpness of the response units is increased. In practice we have found very \ngood approximations with relatively small values of the sharpness parameter (see \nTable 1 for an example). The analysis assumed that the weights were symmetric. \nMathematical analysis of the general case with non-symmetric weights is difficult. \n\n\fInformation Factorization \n\n51 \n\nHowever useful approximations exist (Movellan & McClelland, 1995) showing that \nif the noise parameter (7 is relatively small or if the activation function c.p is approx(cid:173)\nimately linear, symmetric weights are not needed to exhibit the Morton-Massaro \nlaw. \nThe analysis presented here has potential applications to investigate models of per(cid:173)\nception and the functional architecture of the brain. For example the interactive \nactivation model of word perception has a separable architecture and thus, diffusion \nversions of it adhere to the Morton Massaro law. The analysis also points to po(cid:173)\ntential applications in computational neuroscience. It would be of interest to study \nwhether the Morton-Massaro holds at the level of neural responses. For example, \nwe may excite a neuron with two different sources of information and observe its \nshort term average response to combination of stimuli. If the observed distribution \nof responses exhibits the Morton-Massaro law, this would be consistent with the \nexistence of separable channels converging into that neuron. Otherwise, it would \nindicate that the channels from the two input areas to the response may not be \nstructurally separable. \n\nReferences \n\nLuce, R. D. (1959). Individual choice behavior. New York: Wiley. \nMassaro, D. W. (1989). Testing between the TRACE Model and the fuzzy logical \n\nmodel of speech perception. Cognitive Psychology, 21, 398-42l. \n\nMassaro, D. W. (1998). Perceiving Talking Faces. Cambridge, Massachusetts: MIT \n\nPress. \n\nMassaro, D. W. & Cohen, M. M. (1983a). Phonological constraints in speech per(cid:173)\n\nception. Perception and Psychophysics, 94, 338-348. \n\nMcClelland, J. L. (1991). Stochastic interactive activation and the effect of context \n\non perception. Cognitive Psychology, 29, 1-44. \n\nMorton, J. (1969). The interaction of information in word recognition. Psychological \n\nReview, 76, 165-178. \n\nMovellan, J. R. (1998). A Learning Theorem for Networks at Detailed Stochastic \n\nEquilibrium. Neural Computation, 10(5), 1157-1178. \n\nMovellan, J. R. & McClelland, J. L. (1995) . Stochastic interactive processing, chan(cid:173)\nnel separability and optimal perceptual inference: an examination of Mor(cid:173)\nton's law. Technical Report PDP.CNS.95A, Available at http://cnbc.cmu.edu, \nCarnegie Mellon University. \n\n\f", "award": [], "sourceid": 1678, "authors": [{"given_name": "Javier", "family_name": "Movellan", "institution": null}, {"given_name": "James", "family_name": "McClelland", "institution": null}]}