{"title": "Learning Classification with Unlabeled Data", "book": "Advances in Neural Information Processing Systems", "page_first": 112, "page_last": 119, "abstract": null, "full_text": "Learning Classification with Unlabeled Data \n\nVirginia R. de Sa \n\ndesa@cs.rochester.edu \n\nDepartment of Computer Science \n\nUniversity of Rochester \nRochester, NY 14627 \n\nAbstract \n\nOne of the advantages of supervised learning is that the final error met(cid:173)\nric is available during training. For classifiers, the algorithm can directly \nreduce the number of misclassifications on the training set. Unfortu(cid:173)\nnately, when modeling human learning or constructing classifiers for au(cid:173)\ntonomous robots, supervisory labels are often not available or too ex(cid:173)\npensive. In this paper we show that we can substitute for the labels by \nmaking use of structure between the pattern distributions to different sen(cid:173)\nsory modalities. We show that minimizing the disagreement between the \noutputs of networks processing patterns from these different modalities is \na sensible approximation to minimizing the number of misclassifications \nin each modality, and leads to similar results. Using the Peterson-Barney \nvowel dataset we show that the algorithm performs well in finding ap(cid:173)\npropriate placement for the codebook vectors particularly when the con(cid:173)\nfuseable classes are different for the two modalities. \n\n1 \n\nINTRODUCTION \n\nThis paper addresses the question of how a human or autonomous robot can learn to classify \nnew objects without experience with previous labeled examples. We represent objects \nwith n-dimensional pattern vectors and consider piecewise-linear classifiers consisting of \na collection of (labeled) codebook vectors in the space of the input patterns (See Figure 1). \nThe classification boundaries are gi ven by the voronoi tessellation of the codebook vectors. \nPatterns are said to belong to the class (given by the label) of the codebook vector to which \nthey are closest. \n\n112 \n\n\fLearning Classification with Unlabeled Data \n\n113 \n\n\u2022 \n\n0 \n\nX B 0 \no \n\n\u2022 \n\no \n\u2022 XB \no \n\no \n\nFigure 1: A piecewise-linear classifier in a 2-Dimensional input space. The circles represent data \nsamples from two classes (filled (A) and not filled (B)). The X's represent codebook vectors (They \nare labeled according to their class A and B). Future patterns are classified according to the label of \nthe closest codebook vector. \n\nIn [de Sa and Ballard, 1993] we showed that the supervised algorithm LVQ2.1[Kohonen, \n1990] moves the codebook vectors to minimize the number of misclassified patterns. The \npower of this algorithm lies in the fact that it directly minimizes its final error measure (on \nthe training set). The positions of the codebook vectors are placed not to approximate the \nprobability distributions but to decrease the number of misclassifications. \n\nUnfortunately in many situations labeled training patterns are either unavailable or ex(cid:173)\npensive. The classifier can not measure its classification performance while learning (and \nhence not directly maximize it). One such unsupervised algorithm, Competitive Learn(cid:173)\ning[Grossberg, 1976; Kohonen, 1982; Rumelhart and Zipser, 1986], has unlabeled code(cid:173)\nbook vectors that move to minimize a measure of the reconstruction cost. Even with sub(cid:173)\nsequent labeling of the codebook vectors, they are not well suited for classification because \nthey have not been positioned to induce optimal borders. \n\nSupervised Unsupervised \n\n- implausible label \n\n-limited power \n\n\"COW\" \n\nSelf-Supervised \n- derives label from a \nco-occuring input to \nanother modality \n\n000 \n\nTarget \n\\I \n\u2022 \n\u2022 \n\u2022 \n\no O{}OO \n\n000 \n\n\u2022 \n\u2022 \n\u2022 \n\no O{}OO \n\n~ \nO~ 0 \n\u2022 \n\u2022 \n\u2022 \n\n0 600 \n\u2022 \n\u2022 \n\u2022 \n\no O{}OO \n\n~ ~ ~ Input 2 \n\no O{}OO \nmoo \n\nFigure 2: The idea behind the algorithm \n\nThis paper presents a new measure for piecewise-linear classifiers receiving unlabeled pat(cid:173)\nterns from two or more sensory modalities. Minimizing the new measure is an approxi(cid:173)\nmation to minimizing the number of misclassifications directly. It takes advantage of the \nstructure available in natural environments which results in sensations to different sensory \nmodalities (and sub-modalities) that are correlated. For example, hearing \"mooing\" and \n\n\f114 \n\nde Sa \n\np \n0.5 \n\n0.4 \n\n0 . 3 \n\np \n0.5 \n\n0 . 4 \n\n0 . 3 \n\n0.2 \n\n1\\ \nI \\ \nP(CB)P(,,*~) \nI I \n\\ \n\\ \nI \nI \n\\ \n\n\\ \n\nFigure 3: This figure shows an example world as sensed by two different modalities. If modality A \nreceives a pattern from its Class A distribution, modality 2 receives a pattern from its own class A \ndistribution (and the same for Class B). Without receiving information about which class the patterns \ncame from, they must try to determine appropriate placement of the boundaries b l and b2 \u2022 P(C;) is \nthe prior probability of Class i and p(xjIC;) is the conditional density of Class i for modality j \n\nseeing cows tend to occur together. So, although the sight of a cow does not come with an \ninternal homuncular \"cow\" label it does co-occur with an instance of a \"moo\". The key \nis to process the \"moo\" sound to obtain a self-supervised label for the network processing \nthe visual image of the cow and vice-versa. See Figure 2. \n\n2 USING MULTI-MODALITY INFORMATION \n\nOne way to make use of the cross-modality structure is to derive labels for the codebook \nvectors (after they have been positioned either by random initialization or an unsupervised \nalgorithm). The labels can be learnt with a competitive learning algorithm using a network \nsuch as that shown in Figure 4. In this network the hidden layer competitive neurons repre(cid:173)\nsent the codebook vectors. Their weights from the input neurons represent their positions \nin the respective input spaces. Presentation of the paired patterns results in activation of \nthe closest codebook vectors in each modality (and D's elsewhere). Co-occurring code(cid:173)\nbook vectors will then increase their weights to the same competitive output neuron. After \nseveral iterations the codebook vectors are given the (arbitrary) label of the output neuron \nto which they have the strongest weight. We will refer to this as the \"labeling algorithm\". \n\n2.1 MINIMIZING DISAGREEMENT \n\nA more powerful use of the extra information is for better placement of the codebook \nvectors themselves. \n\nIn [de Sa, 1994] we derive an algorithm that minimizes l the disagreement between the \noutputs of two modalities. The algorithm is originally derived not as a piecewise-linear \nclassifier but as a method of moving boundaries for the case of two classes and an agent \nwith two I-Dimensional sensing modalities as shown in Figure 3. \n\nEach class has a particular pro babili ty distri buti on for the sensation received by each modal(cid:173)\nity. If modality 1 experiences a sensation from its pattern A distribution, modality 2 expe(cid:173)\nriences a sensation from its own pattern A distribution. That is, the world presents patterns \n\nIthe goal is actually to find a non-trivial local minimum (for details see [de Sa, 1994]) \n\n\fLearning Classification with Unlabeled Data \n\n115 \n\nOutput (Class) \n\n00 0 \n\nHidden Layer \n\nCode book \nVectors \n\n(W) \n\nInput (X) \n\nModaiitylNetwork 1 \n\nModalitylNetwork 2 \n\nFigure 4: This figure shows a network for learning the labels of the codebook vectors. The weight \nvectors of the hidden layer neurons represent the codebook vectors while the weight vectors of the \nconnections from the hidden layer neuron!; to the output neurons represent the output class that each \ncodebook vector currently represents. In this example there are 3 output classes and two modalities \neach of which has 2-D input patterns and 5 codebook vectors. \n\nfrom the 2-D joint distribution shown in Figure 5a) but each modality can only sample its \n1-D marginal distribution (shown in Figure 3 and Figure 5a). \n\nWe show [de Sa, 1994] that minimizing the disagreement error -\nof patterns for which the two modalities output different labels -\n\nthe proportion of pairs \n\nE(b), b2) = Pr{x) < b) & X2 > bJ} + Pr{x) > b) & X2 < b2} \n\n(1) \n\n(2) \n\n(where f(x). X2) = P(CA)p(xtICA)P(X2ICA) + P(CB)p(x1ICB)p(x2ICB) is the joint probability \ndensity for the two modalities) in the above problem results in an algorithm that corresponds \nto the optimal supervised algorithm except that the \"label\" for each modality's pattern is \nthe hypothesized output of the other modality. \n\nConsider the example illustrated in Figure 5. In the supervised case (Figure 5a\u00bb) the labels \nare given allowing sampling of the actual marginal distributions. For each modality, the \nnumber of misclassifications can be minimized by setting the boundaries for each modality \nat the crossing points of their marginal distributions. \n\nHowever in the self-supervised system, the labels are not available. Instead we are given \nthe output of the other modality. Consider the system from the point of view of modality \n2. Its patterns are labeled according to the outputs of modality 1. This labels the patterns \nin Class A as shown in Figure 5b). Thus from the actual Class A patterns, the second \nmodality sees the \"labeled\" distributions shown. Letting a be the fraction of misclassified \npatterns from Class A, the resulting distributions are given by (1 - a)P(CA)P(X2ICA) and \n(a)P(CA)P(X2ICA). \nSimilarly Figure 5c) shows the effect on the patterns in class B. Letting b be the frac(cid:173)\ntion of Class B patterns misclassified, the distributions are given by (1 - b)P( CB)P(X2ICB) \n\n\f116 \n\nde Sa \n\nand (b)P( CB)p(X2ICB). Combining the effects on both classes results in the \"labeled\" \ndistributions shown in Figure 5d). The \"apparent Class ~' distribution is given by \n(1 - a)P(CA)P(X2ICA) + (b)P(CB)p(X2ICB and the \"apparent Class B\" distribution by \n(a)P(CA)P(X2ICA) + (1-b)P(CB)p(x2ICB). Notice that even though the approximated dis(cid:173)\ntributions may be discrepant, if a:::: b, the crossing point will be close. \n\nSimultaneously the second modality is labeling the patterns to the first modality. At each \niteration of the algorithm both borders move according to the samples from the \"apparent\" \nmarginal distributions. \n\n- P(CA)p(x1ICA) \n- P(CB)p(x1ICB) \n\n- (a)P(CA}p(x2ICA) \n- (1-a)P(CA)p(x2ICA) \n\na) \n\nFigure 5: This figure shows an example of the joint and marginal distributions (For better visual(cid:173)\nization the scale of the joint distribution is twice that of the marginal distributions) for the example \nproblem introduced in Figure 3. The darker gray represents patterns labeled \"N', while the lighter \ngray are labeled \"B\". The dark and light curves are the corresponding marginal distributions with \nbold and regular labels respectively. a) shows the labeling for the supervised case. b),c) and d) reflect \nthe labels given by modality 1 and the corresponding marginal distributions seen by modality 2. See \ntext for more details \n\n2.2 Self-Supervised Piecewise-Linear Classifier \n\nThe above ideas have been extended[de Sa, 1994] to rules for moving the codebook vectors \nin a piecewise-linear classifier. Codebook vectors are initially chosen randomly from the \ndata patterns. In order to complete the algorithm idea, the codebook vectors need to be \ngiven initial labels (The derivation assumes that the current labels are correct). In LVQ2.1 \n\n\fLearning Classification with Unlabeled Data \n\n117 \n\nthe initial codebook vectors are chosen from among the data patterns that are consistent \nwith their neighbours (according to a k-nearest neighbour algorithm); their labels are then \ntaken as the labels of the data patterns. In order to keep our algorithm unsupervised the \n\"labeling algorithm\" mentioned earlier is used to derive labels for the initial codebook \nvectors. \n\nAlso due to the fact that the codebook vectors may cross borders or may not be accurately \nlabeled in the initialization stage, they are updated throughout the algorithm by increas(cid:173)\ning the weight to the output class hypothesized by the other modality, from the neuron \nrepresenting the closest codebook vector. The final algorithm is given in Figure 6 \n\n1. Randomly choose initial codebook vectors from data vectors \n2. Initialize labels of codebook vectors using the labeling algorithm \n\ndescribed in text \n\n3 . Repeat for each presentation of input patterns XI(n) and X2(n) to their \n\nrespective modalities \n\n\u2022 find the two nearest codebook vectors in modality 1 -- wl.i; , Wl.i;, and \n\nmodality 2 -- W2,k;, W2,k; to the respective input patterns \n\n\u2022 Find the hypothesized output class (CA , CB ) in each modality (as \n\ngiven by the label of the closest codebook vector) \n\n\u2022 For each modality update the weights according to the following \n\nrules (Only the rules for modality 1 are given) \nIf neither or both Wli', WI;' have the same label as w2,k' or XI(n) does \nnot lie within c(n) of the border between them no updates are done, \notherwise \n\n, 1 \n\n' 2 \n\n1 \n\n*( \nwi,i' n =WI,i n -\n\n() \n\n1) \n\n)(XI(n)-wv(n-l)) \n+a(n IIXI (n)-wV(n-I)1I \n\n(XI (n)-wIJ,(n-I)) \nWIJ* (n) = wi/n - 1) - a(n) IIXI (n) _ w~J(n -1)11 \n\n* \n\nwhere WI ,i' is the codebook vector wi th the same label, and WIJ' is \nthe codebook vector with another label. \n\n\u2022 update the labeling weights \n\nFigure 6: The Self-Supervised piecewise-linear classifier algorithm \n\n3 EXPERIMENTS \n\nThe following experiments were all performed using the Peterson and Barney vowel for(cid:173)\nmant data 2. The dataset consists of the first and second formants for ten vowels in a /h V d/ \ncontext from 75 speakers (32 males, 28 females, 15 children) who repeated each vowel \ntwice 3. \n\nTo enable performance comparisons, each modality received patterns from the same \ndataset. This is because the final classification performance within a modality depends \n\n20 btained from Steven Nowlan \n33 speakers were missing one vowel and the raw data was linearly transformed to have zero mean \n\nand fall within the range [-3, 3] in both components \n\n\f118 \n\nde Sa \n\nTable 1: Tabulation of performance figures (mean percent correct and sample standard deviation \nover 60 trials and 2 modalities). The heading i - j refers to performance measured after the lh step \nduring the ilh iteration. (Note Step 1 is not repeated during the multi-iteration runs). \n\nsame-paired vowels \nrandom pairing \n\nnot only on the difficulty of the measured modality but also on that of the other \"labeling\" \nmodality. Accuracy was measured individually (on the training set) for both modalities \nand averaged. These results were then averaged over 60 runs. The results described below \nare also tabulated in Table 1 \n\nIn the first experiment, the classes were paired so that the modalities received patterns \nfrom the same vowel class. If modality 1 received an [a] vowel, so did modality 2 and \nlikewise for all the vowel classes (i.e. p(xt!Cj ) = p(x2ICj) for all j). After the labeling \nalgorithm stage, the accuracy was 60\u00b15% as the initial random placement of the codebook \nvectors does not induce a good classifier. After application of the third step in Figure 6 (the \nminimizing-disagreement part of the algorithm) the accuracy was 75 \u00b14%. At this point the \ncodebook vectors are much better suited to defining appropriate classification boundaries. \n\nIt was discovered that all stages of the algorithm tended to produce better results on the \nruns that started with better random initial configurations. Thus, for each run, steps 2 and \n3 were repeated with the final codebook vectors. Average performance improved (73\u00b14% \nafter step 2 and 76\u00b14% after step 3). Steps 2 and 3 were repeated several more times with \nno further significant increase in performance. \n\nThe power of using the cross-modality information to move the codebook vectors can be \nseen by comparing these results to those obtained with unsupervised competitive learn(cid:173)\ning within modalities followed by an optimal supervised labeling algorithm which gave a \nperformance of 72 %. \n\nOne of the features of multi-modality information is that classes that are easily confuseable \nin one modality may be well separated in another. This should improve the performance of \nthe algorithm as the \"labeling\" signal for separating the overlapping classes will be more \nreliable. In order to demonstrate this, more tests were conducted with random pairing of \nthe vowels for each run. For example presentation of [a] vowels to one modality would be \npaired with presentation of [i] vowels to the other. That is p(xIICj ) = p(x2ICaj) for a random \npermutation aI, a2 .. alO. For the labeling stage the performance was as before (60 \u00b1 4%) \nas the difficulty within each modality has not changed. However after the minimizing(cid:173)\ndisagreement algorithm the results were better as expected. After 1 and 2 iterations of the \nalgorithm, 77 \u00b1 3% and 79 \u00b1 2% were classified correctly. These results are close to those \nobtained with the related supervised algorithm LVQ2.1 of 80%. \n\n4 DISCUSSION \n\nIn summary, appropriate classification borders can be learnt without an explicit external \nlabeling or supervisory signal. For the particular vowel recognition problem, the perfor(cid:173)\nmance of this \"self-supervised\" algorithm is almost as good as that achieved with super-\n\n\fLearning Classification with Unlabeled Data \n\n119 \n\nvised algorithms. This algorithm would be ideal for tasks in which signals for two or more \nmodalities are available, but labels are either not available or expensive to obtain. \n\nOne specific task is learning to classify speech sounds from images of the lips and the \nacoustic signal. Stork et. al. [1992] performed this task with a supervised algorithm \nbut one of the main limitations for data collection was the manual labeling of the patterns \n[David Stork, personal communication, 1993]. This task also has the feature that the speech \nsounds that are confuseable are not confuseable visually and vice-versa [Stork et ai., 1992]. \nThis complementarity helps the performance of this classifier as the other modality provides \nmore reliable labeling where it is needed most. \n\nThe algorithm could also be used for learning to classify signals to a single modality where \nthe signal to the other \"modality\" is a temporally close sample. As the world changes \nslowly over time, signals close in time are likely from the same class. This approach \nshould be more powerful than that of [FOldiak, 1991] as signals close in time need not be \nmapped to the same codebook vector but the closest codebook vector of the same class. \n\nAcknowledgements \n\nI would like to thank Steve Nowlan for making the vowel formant data available to me. \nMany thanks also to Dana Ballard, Geoff Hinton and Jeff Schneider for their helpful con(cid:173)\nversations and suggestions. A preliminary version of parts of this work appears in greater \ndepth in [de Sa, 1994]. \n\nReferences \n\n[de Sa, 1994] Virginia R. de Sa, \"Minimizing disagreement for self-supervised classification,\" In \nM.C. Mozer, P. Smolensky, D.S. Touretzky, J.L. Elman, and A.S. Weigend, editors, Proceedings \nof the 1993 Connectionist Models Summer School, pages 300-307. Erlbaum Associates, 1994. \n[de Sa and Ballard, 1993] Virginia R. de Sa and Dana H. Ballard, \"a note on learning vector quan(cid:173)\ntization,\" In c.L. Giles, SJ.Hanson, and J.D. Cowan, editors, Advances in Neural Information \nProcessing Systems 5, pages 220-227. Morgan Kaufmann, 1993. \n\n[Foldiak, 1991] Peter FOldiak, \"Learning Invariance from Transformation Sequences,\" Neural Com(cid:173)\n\nputation, 3(2):194-200, 1991. \n\n[Grossberg, 1976] Stephen Grossberg, \"Adaptive Pattern Classification and Universal Recoding: I. \nParallel Development and Coding of Neural Feature Detectors,\" Biological Cybernetics, 23: 121-\n134, 1976. \n\n[Kohonen, 1982] Teuvo Kohonen, \"Self-Organized Formation of Topologically Correct Feature \n\nMaps,\" Biological Cybernetics, 43:59-69, 1982. \n\n[Kohonen, 1990] Teuvo Kohonen, \"Improved Versions of Learning Vector Quantization,\" In IJCNN \n\nInternational Joint Conference on Neural Networks, volume 1, pages 1-545-1-550, 1990. \n\n[Rumelhart and Zipser, 1986] D. E. Rumelhart and D. Zipser, \"Feature Discovery by Competitive \nLearning,\" In David E. Rumelhart, James L. McClelland, and the PDP Research Group, editors, \nParallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 2, \npages 151-193. MIT Press, 1986. \n\n[Stork et at., 1992] David G. Stork, Greg Wolff, and Earl Levine, \"Neural network lipreading system \nfor improved speech recognition,\" In IJCNN International Joint Conference on Neural Networks, \nvolume 2, pages 11-286-11-295, 1992. \n\n\f", "award": [], "sourceid": 831, "authors": [{"given_name": "Virginia", "family_name": "de", "institution": null}]}