{"title": "Sensory Modality Segregation", "book": "Advances in Neural Information Processing Systems", "page_first": 913, "page_last": 920, "abstract": "", "full_text": "Sensory Modality Segregation\n\nVirginia R. de Sa\n\nDepartment of Cognitive Science\nUniversity of California, San Diego\n\nLa Jolla, CA 92093-0515\n\ndesa@ucsd.edu\n\nAbstract\n\nWhy are sensory modalities segregated the way they are? In this paper\nwe show that sensory modalities are well designed for self-supervised\ncross-modal learning. Using the Minimizing-Disagreement algorithm on\nan unsupervised speech categorization task with visual (moving lips) and\nauditory (sound signal) inputs, we show that very informative auditory\ndimensions actually harm performance when moved to the visual side of\nthe network. It is better to throw them away than to consider them part\nof the \u201cvisual input\u201d. We explain this \ufb01nding in terms of the statistical\nstructure in sensory inputs.\n\n1\n\nIntroduction\n\nIn previous work [1, 2] we developed a simple neural network algorithm that learned cat-\negories from co-occurences of patterns to different sensory modalities. Using only the\nco-occuring patterns of lip motion and acoustic signal, the network learned separate visual\nand auditory networks (subnets) to distinguish 5 consonant vowel utterances. It performed\nalmost as well as the corresponding supervised algorithm, where the utterance label is\ngiven, on the same data and signi\ufb01cantly better than a strategy of separate unsupervised\nclustering in each modality followed by clustering of these clusters (This strategy is used\nto initialize our algorithm).\n\nIn this paper we show that the success of this biologically motivated algorithm depends\ncrucially on the statistics of features derived from different sensory modalities. We do this\nby examining the performance when the two \u201cnetwork-modalities\u201d or pseudo-modalities\nare made up of inputs from the different sensory modalities.\n\nThe Minimizing-Disagreement Algorithm\n\nThe Minimizing-Disagreement (M-D) algorithm is designed to allow two (or more) modal-\nities (or subnets) to simultaneously train each other by \ufb01nding a local minimum of the\nnumber of times the individual modalities disagree on their classi\ufb01cation decision (see Fig-\nure 1). The modalities are essentially trained by running Kohonen\u2019s LVQ2.1 algorithm[3]\nbut with the target class set by the output of the subnet of the other modality (receiving a\nco-occuring pattern) not a supervisory external signal. The steps of the algorithm are as\nfollows.\n\n\fMulti-sensory object area\n\n\"Class\" Units\n\nfeedback of class picked \nby auditory input\n\nHidden Units\n\nModality/Network 1\n\n(Visual)\n\nModality/Network 2\n\n(Auditory)\n\nvisual input\n\nFigure 1: The network for Minimizing-Disagreement algorithm. The weights from the\nhidden units to the output units determine the \u201clabels\u201d of the hidden units. These weights\nare updated throughout training to allow hidden units to change classes if needed. During\ntraining each modality creates an output label for the other as shown on the right side of\nthe \ufb01gure. After training, each modality subnet is tested separately.\n\n1. Initialize hidden unit weight vectors in each modality (unsupervised clustering)\n\n2. Initialize hidden unit labels using unsupervised clustering of the activity patterns\n\nacross the hidden units from both modalities\n\n3. Repeat for each presentation of input patterns X1(n) and X2(n) to their respective\n\nmodalities\n\n\u2022 For each modality, \ufb01nd the two nearest hidden unit weight vectors to the\n\nrespective input pattern\n\n\u2022 Find the hypothesized output class in each modality (as given by the label of\nthe hidden unit with closest weight vector). The label of a hidden unit is the\noutput unit to which it projects most strongly.\n\n\u2022 For each modality update the hidden unit weight vectors according to the\n\nLVQ2.1 rule (Only the rules for modality 1 are given below)\nUpdates are performed only if the current pattern X1(n) falls within c(n) of\nthe border between two hidden units of different classes (one of them agree-\ning with the output from the other modality). In this case\n\n~w1i\u2217 (n) = ~w1i\u2217 (n \u2212 1) + e( n)\n\n~w1 j\u2217 (n) = ~w1 j\u2217 (n \u2212 1) \u2212 e( n)\n\n(X1(n) \u2212 ~w1i\u2217 (n \u2212 1))\n||X1(n) \u2212 ~w1i\u2217 (n \u2212 1)||\n\n(X1(n) \u2212 ~w1 j\u2217 (n \u2212 1))\n||X1(n) \u2212 ~w1 j\u2217 (n \u2212 1)||\n\nwhere ~w1i\u2217 is the weight vector of the hidden unit with the same label, and\n~w1 j\u2217 is the weight vector of the hidden unit with another label.\n\n\u2022 Update the labeling weights using Hebbian learning between the winning\n\nhidden unit and the output of the other modality.\n\nIn order to discourage runaway to one of the trivial global minima of disagreement, where\nboth modalities only ever output one class, weights to the output class neurons are renor-\nmalized at each step. This normalization means that the algorithm is not modifying the\noutput weights to minimize the disagreement but instead clustering the hidden unit rep-\nresentation using the output class given by the other modality. This objective is better\nfor these weights as it balances the goal of agreement with the desire to avoid the trivial\nsolution of all hidden units having the same label.\n\n\ftime\n\ntime\n\nVx\n\nVy\n\nmotion\nvectors\n\nfrequency\nvectors\n\nAx\n\nAy\n\nimage areas\n\nfrequency channels\n\nFigure 2: An example Auditory and Visual pattern vector. The \ufb01gure shows which\ndimensions went into each of Ax, Ay, Vx, and Vy.\n\n2 Experiments\n\n2.1 Creation of Sub-Modalities\n\nThe original auditory and visual data were collected using an 8mm camcorder and direc-\ntional microphone. The speaker spoke 118 repetitions of /ba/, /va/, /da/, /ga/, and /wa/. The\n\ufb01rst 98 samples of each utterance class formed the training set and the remaining 20 the\ntest set. The auditory feature vector was encoded using a 24 channel mel code1 over 20\nmsec windows overlapped by 10 msec. This is a coarse short time frequency encoding,\nwhich crudely approximates peripheral auditory processing. Each feature vector was lin-\nearly scaled so that all dimensions lie in the range [-1,1]. The \ufb01nal auditory code is a (24 \u00d7\n9) 216 dimension vector for each utterance. An example auditory feature vector is shown\nin Figure 2 (bottom).\n\nThe visual data were processed using software designed and written by Ramprasad Polana\n[4]. Visual frames were digitized as 64 \u00d7 64 8 bit gray-level images using the Datacube\nMaxVideo system. Segments were taken as 6 frames before the acoustically determined\nutterance offset and 4 after. The normal \ufb02ow was computed using differential techniques\nbetween successive frames. Each pair of frames was then averaged and then these averaged\nframes were divided into 25 equal areas (5 \u00d7 5) and the motion magnitudes within each\nframe were averaged within each area. The \ufb01nal visual feature vector of dimension (5\nframes \u00d7 25 areas) 125 was linearly normalized as for the auditory vectors. An example\nvisual feature vector is shown in Figure 2 (top).\n\nThe original auditory and visual feature vectors were divided into two parts (called Ax, Ay\nand Vx,Vy as shown in Figure 2). The partition was arbitrarily determined as a compromise\nbetween wanting a similar number of dimensions and similar information content in each\npart. (We did not search over partitions; the experiments below were performed only for\nthis partition). Our goal is to combine them in different ways and observe the performance\nof the minimizing-disagreement algorithm.\n\nWe \ufb01rst benchmarked the divided \u201csub-modalities\u201d to see how useful they were for the task.\nFor this, we ran a supervised algorithm on each subset. The performance measurements\nare shown in Table 1.\n\n1linear spacing below 1000 Hz and logarithmic above 1000 Hz.\n\n\fSub-Modality\nAx\nAy\nVx\nVy\n\nSupervised Performance\n89 \u00b1 2\n91 \u00b1 2\n83 \u00b1 2\n77 \u00b1 3\n\nTable 1: Supervised performance of each of the sub-modalities. All numbers give per-\ncent correct classi\ufb01cations on independent test sets \u00b1 standard deviations.\n\n2.2 Creation of Pseudo-Modalities\n\nPseudo-modalities were created by combining all combinations (of 3 or less) of Ax, Ay, Vx\nand Vy; thus Ax+Vx+Vy (Ax+V) would be a pseudo-modality. The idea is to test all possi-\nble combinations of pseudo-modalities and compare the resulting performance of the \ufb01nal\nindividual subnets with what a supervised algorithm could do with the same dimensions.\n\n2.3 Pseudo-Modality Experiments\n\nIn order to allow fair comparison, appropriate parameters were found for each modality\ndivision. The data were divided into 75% Training, and 25% Test data. Optimal parameters\nwere selected by observing performance on the training data, and performance is reported\non the test data.\n\nThe results for all possible divisions are presented in Figure 3. Each network has the\nfollowing key. The light gray bar and number represents the test-set performance of the\npseudo-modality consisting of the sub-modalities listed below it. The darker bar and num-\nber represents the test-set performance of the other pseudo-modality. The black outlines\n(and numbers above the outline) give the performance of the corresponding supervised al-\ngorithm (LVQ2.1) with the same data. Thus, the empty area between the shaded area and\nblack outline represents the loss from lack of supervision.\n\nLooking at the \ufb01gure, one can make several comparisons. For each submodality, we can\nask: To get the best performance of a subnet using those dimensions, where should one\nput the other sub-modalities in a M-D network? For instance, to answer that question\nof Ax, one would compare the performance of the Ax subnet in Ax/Ay+V network with\nthat of the Ax+Ay subnet in the Ax+Ay/Vx+Vy network, with that of the Ax+Vx+Vy\nsubnet in the Ax+Vx+Vy/Ay network etc. The subnet containing Ax that performs the\nbest is the Ax+Ay subnet (trained with co-modality Vx+Vy).\nIn fact, it turns out that\nfor each submodality, the architecture for optimal post-training performance of the subnet\ncontaining that submodality, is to put the dimensions from the same \u201creal\u201d modality on the\nsame side and those from the other modality on the other side.\n\nThis raises the question: Is performance better for the Ax+Ay/Vx+Vy network than the\nAx/Ay+Vx+Vy network because the bene\ufb01t of having Ay with Ax is greater than that of\nhaving Ay with Vx and Vy (in other words, are there some higher order relationships be-\ntween dimensions in Ax and those in Ay that require both dimensions to be learned by\nthe same subnet) OR is it actually harmful to have Ay on the opposite side from Ax? We\ncan answer this question by comparing the performance of the Ax/Ay+Vx+Vy network\nwith that of the Ax/Vx+Vy network as shown in Figure 4. For that particular division,\nthe results are not signi\ufb01cantly different (even though we have removed the most useful\ndimensions), but for all the other divisions, performance is improved when dimensions are\nremoved so that only dimensions from one \u201creal\u201d sensory modality are on one side. For\nexample, the two graphs in the second column show that it is actually harmful to include\nthe very useful Ax dimensions on the visual side of the network \u2013 we do better when we\n\n\f97\n\n88\n\n69\n\n73\n\n91\n\n77\n\n97\n\n98\n\n97\n\n83\n\n72\n\n73\n\n68\n\n67\n\n77\n\n58\n\nAx\n\n97\n\n90\n\n86\n\nAy,Vx,Vy\n\nAy\n\nAx,Vx,Vy Ax,Ay,Vy\n\nVx\n\nAx,Ay,Vx\n\nVy\n\n93\n\n91\n\n91\n\n91\n\n79\n\n71\n\n75\n\n73\n\n67\n\nPseudo\u2212\nModality1\n\nPseudo\u2212\nModality2\n\nAx,Ay\n\nVx,Vy\n\nAx,Vx\n\nAy,Vy\n\nAy,Vx\n\nAx,Vy\n\nFigure 3: Self-supervised and Supervised performances for the various pseudo-\nmodality divisions. Standard errors for the self-supervised performance means are \u00b11.\nThose for the supervised performances are \u00b1.5.\n\nthrow them away. Note that this is true even though a supervised network with Ax+Vx+Vy\ndoes much better than a supervised network with Vx+Vy \u2014 this is not a simple feature\nselection result.\n\n2.4 Correlational structure is important\n\nWhy do we get these results? The answer is that the results are very dependent on the\nstatistical structure between dimensions within and between different sensory modalities.\n\nConsider a simpler system of two 1-Dimensional modalities and two classes of objects.\nAssume that the sensation detected by each modality has a probability density given by a\nGaussian of different mean for each class. The densities seen by each modality are shown\nin Figure 5. In part A) of the Figure, the joint density for the stimuli to both modalities\nis shown for the case of conditionally uncorrelated stimuli (within each class, the inputs\nare uncorrelated). Parts C) and D) show the changing joint density as the sensations to the\ntwo modalities become more correlated within each class. Notice that the density changes\nfrom a \u201ctwo blob\u201d structure to more of a \u201cridge\u201d structure. As it does this the projection\nof the joint density gives less indication of the underlying bi-modal structure and the local\nminimum of the Minimizing-Disagreement Energy function gets shallower and narrower.\nThis means that the M-D algorithm would be less likely to \ufb01nd the correct boundary.\n\nA more intuitive explanation is shown in Figure 6. In the \ufb01gure imagine that there are two\nclasses of objects, with densities given by the thick curve and the thin curve and that this\nmarginal density is the same in each one-dimensional modality. The line drawing below\nthe densities, shows two possible scenarios for how the \u201cmodalities\u201d may be related. In\nthe top case, the modalities are conditionally independent. Given that a \u201cthick\u201d object\nis present, the particular pattern to each modality is independent. The lines represent a\npossible sampling of data (where points are joined if they co-occured). The minimizing\ndisagreement algorithm wants to \ufb01nd a line from top to bottom that crosses the fewest lines\n\u2013 within the pattern space, disagreement is minimized for the dashed line shown.\n\n\f97\n\n88\n\n69\n\n73\n\n91\n\n77\n\n97\n\n98\n\n97\n\n83\n\n72\n\n73\n\n68\n\n67\n\n77\n\n58\n\nAx\n\nAy,Vx,Vy\n\nAy\n\nAx,Vx,Vy\n\nAx,Ay,Vy\n\nVx\n\nAx,Ay,Vx\n\nVy\n\n88\n\n86\n\n71\n\n72\n\n91\n\n86\n\n84\n\n97\n\n83\n\n97\n\n75\n\n84\n\n70\n\n75\n\n77\n\nP\u2212M 1 P\u2212M 2\n\n60\n\nAx\n\nAy,Vx,Vy\n\nAy\n\nAx,Vx,Vy\n\nAx,Ay,Vy\n\nVx\n\nAx,Ay,Vx\n\nVy\n\nFigure 4: This \ufb01gure shows the bene\ufb01ts of having a pseudo-modality composed of\ndimensions from only ONE real modality (even if this means throwing away useful\ndimensions). Standard errors for the self-supervised performance means are \u00b11. Those\nfor the supervised performances are \u00b1.5.\n\nExample Densities for\n\nJoint Density with \n\nA\n\ntwo classes in one modality\n\nB\n\n = 0\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n-4\n\n-2\n\n2\n\n4\n\n0.25\n0.2\n0.15\n0.1\n0.05\n0\n\n-4\n-4\n\n-2\n-2\n\n4\n\n2\n\n0\n\n-2\n\n0\n0\n\n2\n2\n\n-4\n\n4\n4\n\nJoint Density with\n\nJoint Density with\n\nC\n\n = 0.2\n\nD\n\nr = 0.5\n\n0.25\n0.2\n0.15\n0.1\n0.05\n0\n\n-4\n-4\n\n-2\n-2\n\n0.25\n0.2\n0.15\n0.1\n0.05\n0\n\n-4\n-4\n\n-2\n-2\n\n4\n\n2\n\n0\n\n-2\n\n4\n\n2\n\n0\n\n-2\n\n0\n0\n\n2\n2\n\n-4\n\n4\n4\n\n0\n0\n\n2\n2\n\n-4\n\n4\n4\n\nFigure 5: Different joint densities with the same marginal densities\n\nr\nr\n\fModality 1\n\nConditionally Independent\n\nModality 2\n\nModality 1\n\nHighly Correlated\n\nModality 2\n\nFigure 6: Lines are joined between co-occuring patterns in two imaginary 1-D modalities\n(as shown at top). The M-D algorithm wants to \ufb01nd a partition that crosses the fewest lines.\n\nConditional Information (I(X;Y |Class))\n\nWithin-Class Correlation Coef\ufb01cients\n\n(with diagonal zeroed)\n\n(averaged over each class)\n\nFigure 7: Statistical Structure of our data\n\nIn the bottom case, the modalities are strongly dependent. In this case there are many local\nminima for minimum disagreement, that are not closely related to the class boundary. It is\neasy for the networks to minimize the disagreement between the outputs of the modalities,\nwithout paying attention to the class. Having two very strongly dependent variables, one\non each side of the network, means that the network can minimize disagreement by simply\nlistening to those units.\n\nTo verify that our auditory-visual results were due to statistical differences between the\ndimensions, we examined the statistical structure of our data. It turns out that, within a\nclass, the correlation coef\ufb01cient between most pairs of dimensions is fairly low. However,\nfor related auditory features (similar time and frequency band) correlations are high and\nalso for related visual features. This is shown in Figure 7. We also computed the conditional\nmutual information between each pair of features given the class I(x; y|Class). This is\nalso shown in Figure 7. This value is 0 if and only if the two features are conditionally\nindependent given the class. The graphs show that many of the auditory dimensions are\nhighly dependent on each other (even given the class), as are many of the visual dimensions.\nThis makes them unsuitable for serving on the other side of a M-D network.\n\n\f2.5 Discussion\n\nThe minimizing-disagreement algorithm was initially developed as a model of self-\nsupervised cortical learning and the importance of conditionally uncorrelated structure was\nmentioned in [5]. Since then people have used similar partly-supervised algorithms to deal\nwith limited labeled data in machine learning problems [6, 7]. They have also emphasized\nthe importance of conditional independence between the two sides of the input. However\nin the co-training style algorithms, inputs that are conditionally dependent are not helpful,\nbut they are also not as harmful. Because the self-supervised algorithm is dependent on the\nclass structure being evident in the joint space as its only source of supervision, it is very\nsensitive to conditionally dependent relationships between the modalities.\n\nWe have shown that different sensory modalities are ideally suited for teaching each other.\nSensory modalities are also composed of submodalities (e.g. color and motion for the\nvisual modality) which are also likely to be conditionally independent (and indeed may be\nactively kept so [8, 9, 10]). We suggest that brain connectivity may be constrained not only\ndue to volume limits, but because limiting connectivity may be bene\ufb01cial for learning.\n\nAcknowledgements\n\nA preliminary version of this work appeared in a book chapter [5] in the book, Psychology\nof Learning and Motivation. This work is supported by NSF CAREER grant 0133996.\n\nReferences\n\n[1] Virginia R. de Sa. Learning classi\ufb01cation with unlabeled data.\n\nIn J.D. Cowan,\nG. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing\nSystems 6, pages 112\u2014119. Morgan Kaufmann, 1994.\n\n[2] Virginia R. de Sa and Dana H. Ballard. Category learning through multimodality\n\nsensing. Neural Computation, 10(5):1097\u20131117, 1998.\n\n[3] Teuvo Kohonen. Improved versions of learning vector quantization. In IJCNN Inter-\nnational Joint Conference on Neural Networks, volume 1, pages I\u2013545\u2013I\u2013550, 1990.\n[4] Ramprasad Polana. Temporal Texture and Activity Recognition. PhD thesis, Depart-\n\nment of Computer Science, University of Rochester, 1994.\n\n[5] Virginia R. de Sa and Dana H. Ballard. Perceptual learning from cross-modal feed-\nback.\nIn R. L. Goldstone, P.G. Schyns, and D. L. Medin, editors, Psychology of\nLearning and Motivation, volume 36, pages 309\u2013351. Academic Press, San Diego,\nCA, 1997.\n\n[6] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In\nProceedings of the Eleventh Annual Conference on Computational Learning Theory\n(COLT-98), pages 92\u2013100, 1998.\n\n[7] Ion Muslea, Steve Minton, and Craig Knoblock. Active + semi-supervised learning =\nrobust multi-view learning. In Proceedings of the 19th International Conference on\nMachine Learning (ICML 2002), pages 435\u2013442, 2002.\n\n[8] C. McCollough. Color adaptation of edge-detectors in the human visual system. Sci-\n\nence, 149:1115\u20131116, 1965.\n\n[9] P.C. Dodwell and G.K. Humphrey. A functional theory of the mccollough effect.\n\nPsychological Review, 1990.\n\n[10] F. H. Durgin and D.R. Prof\ufb01tt. Combining recalibration and learning accounts of\nIn Proceedings of the annual meeting of the Psychonomic\n\ncontingent aftereffects.\nSociety.\n\n\f", "award": [], "sourceid": 2524, "authors": [{"given_name": "Virginia", "family_name": "Sa", "institution": null}]}