{"title": "Unsupervised Classification of 3D Objects from 2D Views", "book": "Advances in Neural Information Processing Systems", "page_first": 949, "page_last": 956, "abstract": null, "full_text": "Unsupervised Classification of 3D Objects \n\nfrom 2D Views \n\nSatoshi Suzuki Hiroshi Ando \n\nA TR Human Information Processing Research Laboratories \n2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan \n\nsatoshi@hip.atr.co.jp, ando@hip.atr.co.jp \n\nAbstract \n\nThis paper presents an unsupervised learning scheme for categorizing \n3D objects from their 2D projected images. The scheme exploits an \nauto-associative network's ability to encode each view of a single object \ninto a representation that indicates its view direction. We propose two \nmodels that employ different classification mechanisms; the first model \nselects an auto-associative network whose recovered view best matches \nthe input view, and the second model is based on a modular architecture \nwhose additional network classifies the views by splitting the input \nspace nonlinearly. We demonstrate the effectiveness of the proposed \nclassification models through simulations using 3D wire-frame objects. \n\n1 INTRODUCTION \nThe human visual system can recognize various 3D (three-dimensional) objects from \ntheir 2D (two-dimensional) retinal images although the images vary significantly as the \nviewpoint changes. Recent computational models have explored how to learn to recognize \n3D objects from their projected views (Poggio & Edelman, 1990). Most existing models \nare, however, based on supervised learning, i.e., during training the teacher tells which \nobject each view belongs to. The model proposed by Weinshall et al. (1990) also requires \na signal that segregates different objects during training. This paper, on the other hand, \ndiscusses unsupervised aspects of 3D object recognition where the system discovers \ncategories by itself. \n\n\f950 \n\nSatoshi Suzuki, Hiroshi Ando \n\nThis paper presents an unsupervised classification scheme for categorizing 3D objects \nfrom their 2D views. The scheme consists of a mixture of 5-layer auto-associative networks, \neach of which identifies an object by non-linearly encoding the views into a representation \nthat describes transformation of a rigid object. A mixture model with linear networks was \nalso studied by Williams et al. (1993) for classifying objects under affine transformations. \nWe propose two models that employ different classification mechanisms. The first model \nclassifies the given view by selecting an auto-associative network whose recovered view \nbest matches the input view. The second model is based on the modular architecture \nproposed by Jacobs et al. (1991) in which an additional 3-layer network classifies the \nviews by directly splitting the input space. The simulations using 3D wire-frame objects \ndemonstrate that both models effectively learn to classify each view as a 3D object. \nThis paper is organized as follows. Section 2 describes in detail the proposed models for \nunsupervised classification of 3D objects. Section 3 describes the simulation results using \n3D wire-frame objects. In these simulations, we test the performance of the proposed \nmodels and examine what internal representations are acquired in the hidden layers. \nFinally, Section 4 concludes this paper. \n2 THE NETWORK MODELS \nThis section describes an unsupervised scheme that classifies 2D views into 3D objects. \nWe initially examined classical unsupervised clustering schemes, such as the k-means \nmethod or the vector quantization method, to see whether such methods can solve this \nproblem (Duda & Hart, 1973). Through simulations using the wire-frame objects described \nin the next section, we found that these methods do not yield satisfactory performance. \nWe, therefore, propose a new unsupervised learning scheme for classifying 3D objects. \nThe proposed scheme exploits an auto-associative network for identifying an object. An \nauto-associative network finds an identity mapping through a bottleneck in the hidden \nlayer, \nthat \nRn ~Rm r \n) Rn where m < n. The network, thus, compresses the input into a \nlow dimensional representation by eliminating redundancy. If we use a five-layer perceptron \nnetwork, the network can perform nonlinear dimensionality reduction, which is a nonlinear \nanalogue to the principal component analysis (Oja, 1991; DeMers & Cottrell, 1993). \nThe proposed classification scheme consists of a mixture of five-layer auto-associative \nnetworks which we call the identification networks, or the I-Nets. In the case where the \ninputs are the projected views of a rigid object, the minimum dimension that constrains \nthe input variation is the degree of freedom of the rigid object, which is six in the most \ngeneral case, three for rotation and three for translation. Thus, a single I-Net can compress \nthe views of an object into a representation whose dimension is its degree of freedom. \nThe proposed scheme categorizes each view of a number of 3D objects into its class \nthrough selecting an appropriate I-Net. We present the following two models for different \nselection and learning methods. \nModel I: The model I selects an I-Net whose output best fits the input (see Fig. 1). \nSpecIfically, we assume a classifier whose output vector is given by the softmax function \nof a negative squared difference between the input and the output of the I-Nets, i.e., \n\nthe l network approximates functions . F \n\ni.e., \n\nand F- 1 such \n\n(1) \n\n\fUnsupervised CLassification of 3D Objects from 2D Views \n\n951 \n\nI-Net \n\nI-Net \u2022\u2022\u2022 I-Net \n\nI-Net I-Net... I-Net \n\n2D Projected Images of 3D Objects \n\n2D Projected Images of 3D Objects \n\nModell \n\nModel II \n\nFigure 1: Model I and Model II. Each I-Net (identification net) is a 5-layer auto-associative \nnetwork and the C-Net (classification net) is a 3-layer network. \n\nwhere Y * and Yi denote the input and the output of the i th I-Net, respectively. Therefore, \nif only one of the I-Nets has an output that best matches the input, then the output value \nof the corresponding unit in the classifier becomes nearly one and the output values of the \nother units become nearly zero. For training the network, we maximize the following \nobjective function: \n\nL exp[ -ally * _YiI12\nL exp[ -Ily * - Yi 112 ] \n\nIn-'~' --~------~ \n\ni \n\n] \n\n(2) \n\nwhere a (>1) denotes a constant. This function forces the output of at least one I-Net to \nfit the input, and it also forces the rest of I-Nets to increase the error between the input \nand the output. Since it is difficult for a single I-Net to learn more than one object, we \nexpect that the network will eventually converge to the state where each I-Net identifies \nonly one object. \nModel II: The model II, on the other hand, employs an additional network which we call \nthe classification network or the C-Net, as illustrated in Fig. 1. The C-Net classifies the \ngiven views by directly partitioning the input space. This type of modular architecture has \nbeen proposed by Jacobs et al. (1991) based on a stochastic model (see also Jordan & \nJacobs, 1992). In this architecture, the final output, Y, is given by \n\n(3) \n\nwhere Yi denotes the output of the i th I-Net, and gi is given by the softmax function \n\ngi = eXP[SiVtexP[Sj] \n\n(4) \n\nwhere Si is the weighted sum arriving at the i th output unit of the C-Net. \nFor the C-Net, we use three-layer perceptron, since a simple perceptron with two layers \ndid not provide a good performance for the objects used for our simulations (see Section \n\n\f952 \n\nSatoshi Suzuki, Hiroshi Ando \n\n3). The results suggest that classification of such objects is not a linearly separable \nproblem. Instead of using MLP (multi-layer perceptron), we could use other types of \nnetworks for the C-Net, such as RBF (radial basis function) (Poggio & Edelman, 1990). \nWe maximize the objective function \n\nIn LgjO'-1 exp[-lly*-yJ /(20'2)] \n\nj \n\n(5) \n\nwhere 0'2 is the variance. This function forces the C-Net to select only one I-Net, and at \nthe same time, the selected I-Net to encode and decode the input information. \nNote that the model I can be interpreted as a modified version of the model II, since \nmaximizing (2) is essentially equivalent to maximizing (5) if we replace Sj of the C-Net \nin (4) with a ne&ative s~uared difference between the input and the output of the i th \nI-Net, i.e., Sj = -Ily * -yj Ir . Although the model I is a more direct classification method \nthat exploits auto-associative networks, it is interesting to examine what information can \nbe extracted from the input for classification in the model II (see Section 3.2). \n\n3 SIMULATIONS \nWe implemented the network models described in the previous section to evaluate their \nperformance. The 3D objects that we used for our simulations are 5-segment wire-frame \nobjects whose six vertices are randomly selected in a unit cube, as shown in Fig. 2 (a) \n(see also Poggio & Edelman, 1990). Various views of the objects are obtained by \northographically projecting the objects onto an image plane whose position covers a \nsphere around the object (see Fig. 2 (b\u00bb. The view position is defined by the two \nparameters, 8 and fj). In the simulations, we used x, y image coordinates of the six \nvertices of three wire-frame objects for the inputs to the network. \nThe models contain three I-Nets, whose number is set equal to the number of the objects. \nThe number of units in the third layer of the five-layer I-Nets is set equal to the number \nof the view parameters, which is two in our simulations. We used twenty units in the \nsecond and fourth layers. To train the network efficiently, we initially limited the ranges \nof 8 and fj) to 1r /8 and 1r /4 and gradually increased the range until it covered the whole \nsphere. During the training, objects were randomly selected among the three and their \nviews were randomly selected within the view range. The steepest ascent method was \nused for maximizing the objective functions (2) and (5) in our simulations, but more \nefficient methods, such as the conjugate gradient method, can also be used. \n\n(a) \n\nz \n\nView \n\n(b) \n\ny \n\nFigure 2: (a) 3D wire-frame objects. (b) Viewpoint defined by two parameters, 8 and fj). \n\n\fUnsupervised Classification of 3D Objects from 2D Views \n\n953 \n\n3.1 SIMULATIONS USING THE MODEL I \nThis section describes the simulation results using the model!. As described in Section 2, \nthe classifier of this model selects an I-Net that produces minimum error between the \noutput and the input. We test the classification performance of the model and examine \ninternal representations of the I-Nets after training the networks. The constant a in the \nobjective function (2) was set to 50 during the training. \nFig. 3 shows the output of the classifier plotted over the view directions when the views \nof an object are used for the inputs. The output value of a unit is almost equal to one over \nthe entire range of the view direction, and the outputs of the other two units are nearly \nzero. This indicates that the network effectively classifies a given view into an object \nregardless of the view directions. We obtained satisfactory results for classification if \nmore than five units are used in the second and fourth layers of the I-Nets. \nFig. 4 shows examples of the input views of an object and the views recovered by the \ncorresponding I-Net. The recovered views are significantly similar to the input views, \nindicating that each auto-associative I-Net can successfully compress and recover the \nviews of an object. In fact, as shown in Fig. 5, the squared error between the input and the \noutput of an I-Net is nearly zero for only one of the objects. This indicates that each I-Net \ncan be used for identifying an object for almost entire view range. \n\nUNIT 1 \n\nUNIT 2 \n\nUNIT 3 \n\nFigure 3: Outputs of the classifier in the model I. The output value of the second unit is \nalmost equal to one over the full view range, and the outputs of the other two units are \nnearly zero for one of the 3D objects. \n\nRecovered \nviews \n\nInput views \n\nFigure 4: Examples of the input and recovered views of an object. The recovered views are \nsignificantly similar to the input views. \n\n\f954 \n\nSatoshi Suzuki, Hiroshi Ando \n\nWe further analyzed what information is encoded in the third layer of the I-Nets. Fig. 6 \n(a) illustrates the outputs of the third layer units plotted as a function of the view direction \n( (}, \u00a2) of an object. Fig. 6 (b) shows the view direction ( (} , \u00a2) plotted as a function of \nthe outputs of the third layer units. Both figures exhibit single-valued functions, i.e. the \nview direction of the object uniquely determines the outputs of the hidden units, and at \nthe same time the outputs of the hidden units uniquely determine the view direction. \nThus, each I-Net encodes a given view of an object into a representation that has one-to-one \ncorrespondence with the view direction. This result is expected from the condition that \nthe dimension of the third layer is set equal to the degree of freedom of a rigid object. \n\nObject 1 \n\nObject 2 \n\nObject 3 \n\nFigure 5: Error between the input view and the recovered view of an I-Net for each object. \nThe figures show that the I-Net recovers only the views of Object 3. \n\nunit! \n\nunit2 \n\n(a) \n\n(b) \n\na \n\nWlit2 \n\nunit2 \n\nFigure 6: (a) Outputs of the third layer units of an I-Net plotted over the view direction ( (}, \n\u00a2) of an object. (b) The view direction plotted over the outputs of the third layer units. \nFigure (b) was obtained by inversely replotting Figure (a). \n\n3.2 SIMULATIONS USING THE MODEL n \nIn this section, we show the simulation results using the model II. The C-Net in the model \nlearns to classify the views by splitting the input space nonlinearly. We examine internal \nrepresentations of the C-Net that lead to view invariant classification in its output. \n\n\fUnsupervised Classification of 3D Objects from 2D Views \n\n955 \n\nIn the simulations, we used the same 3 wire-frame objects used in the previous simulations. \nThe C-Net contains 20 units in the hidden layer. The parameter cr in the objective \nfunction (5) was set to 0.1. Fig. 7 (a) illustrates the values of an output unit in the C-Net \nfor an object. As in the case of the model I, the model correctly classified the views into \ntheir original object for almost entire view range. Fig. 7 (b) illustrates the outputs of two \nof the hidden units as examples, showing that each hidden unit has a limited view range \nwhere its output is nearly one. The C-Net, thus, combines these partially invariant \nrepresentations in the hidden layer to achieve full view invariance at the output layer. \nTo examine a generalization ability of the model, we limited the view range in the \ntraining period and tested the network using the images with the full view range. Fig. 8 \n(a) and (b) show the values of an output unit of the C-Net and the error of the corresponding \nI-Net plotted over the entire view range. The region surrounded by a rectangle indicates \nthe range of view directions where the training was done. The figures show that the \ncorrect classification and the small recovery error are not restricted within the training \nrange but spread across this range, suggesting that the network exhibits a satisfactory \ncapability of generalization. We obtained similar generalization results for the model I as \nwell. We also trained the networks with a sparse set of views rather than using randomly \nselected views. The results show that classification is nearly perfect regardless of the \nviewpoints if we use at least 16 training views evenly spaced within the full view range. \n\nFigure 7: (a) Output values of an output unit of the C-Net when the views of an object are \ngiven (cf. Fig.3). (b) Output values of two hidden units ofthe C-Net for the same object. \n\nOUTPUT \n\nERROR \n\nFigure 8: (a) Output values of an output unit of the C-Net. (b) Errors between the input \nviews and the recovered views of the corresponding I-Net. The region surrounded by a \nrectangle indicates the view range where the training was done. \n\n\f956 \n\nSatoshi Suzuki, Hiroshi Ando \n\n4 CONCLUSIONS \nWe have presented an unsupervised classification scheme that classifies 3D objects from \ntheir 2D views. The scheme consists of a mixture of non-linear auto-associative networks \neach of which identifies an object by encoding an input view into a representation that \nindicates its view direction. The simulations using 3D wire-frame objects demonstrated \nthat the scheme effectively clusters the given views into their original objects with no \nexplicit identification of the object classes being provided to the networks. We presented \ntwo models that utilize different classification mechanisms. In particular, the model I \nemploys a novel classification and learning strategy that forces only one network to \nreconstruct the input view, whereas the model II is based on a conventional modular \narchitecture which requires training of a separate classification network. Although we \nassumed in the simulations that feature points are already identified in each view and that \ntheir correspondence between the views is also established, the scheme does not, in \nprinciple, require the identification and correspondence of features, because the scheme is \nbased solely on the existence of non-linear mappings between a set of images of an object \nand its degree of freedom. Therefore, we are currently investigating how the proposed \nscheme can be used to classify real gray-level images of 3D objects. \n\nAcknowledgments \nWe would like to thank Mitsuo Kawato for extensive discussions and continuous \nencouragement, and Hiroaki Gomi and Yasuharu Koike for helpful comments. We are \nalso grateful to Tommy Poggio for insightful discussions. \n\nReferences \nDeMers, D. and Cottrell, G. (1993). Non-linear dimensionality reduction. In Hanson, S. \n1., Cowan, 1. D. & Giles, C. L., (eds), Advances in Neural Information Processing \nSystems 5. Morgan Kaufmann Publishers, San Mateo, CA. 580-587. \n\nDuda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. John \n\nWiley & Sons, NY. \n\nJacobs, R. A., Jordan, M. I., Nowlan, S. 1. and Hinton, G. E. (1991). Adaptive mixtures \n\nof local experts. Neural Computation, 3,79-87. \n\nJordan, M. I. and Jacobs, R. A. (1992). Hierarchies of adaptive experts. In Moody, J. E., \nHanson, S. J. & Lippmann, R. P., (eds), Advances in Neural Information Processing \nSystems 4. Morgan Kaufmann Publishers, San Mateo, CA. 985-992. \n\nOja, E. (1991). Data compression, Feature extraction, and autoassociation in feedforward \nneural networks. In Kohonen, K. et al. (eds), Anificial Neural Networks. Elsevier \nScience publishers B.V., North-Holland. \n\nPoggio, T. and Edelman, S. (1990). A network that learns to recognize three-dimensional \n\nobjects. Nature, 343, 263. \n\nWeinshall, D., Edelman, S. and Btilthoff, H. H. (1990). A self-organizing multiple-view \nrepresentation of 3D objects. In Touretzky, D. S., (eds), Advances in Neural Information \nProcessing Systems 2. Morgan Kaufmann Publishers, San Mateo, CA. 274-281. \n\nWilliams, C. K. I., Zemel, R. S. and Mozer, M. C. (1993). Unsupervised learning of \nobject models. AAAI Fall 1993 Symposium on Machine Learning in Computer Vision. \n\n\f", "award": [], "sourceid": 910, "authors": [{"given_name": "Satoshi", "family_name": "Suzuki", "institution": null}, {"given_name": "Hiroshi", "family_name": "Ando", "institution": null}]}