{"title": "A Constructive Learning Algorithm for Discriminant Tangent Models", "book": "Advances in Neural Information Processing Systems", "page_first": 786, "page_last": 792, "abstract": null, "full_text": "A Constructive Learning Algorithm for \n\nDiscriminant Tangent Models \n\nDiego Sona Alessandro Sperduti Antonina Starita \n\nDipartimento di Informatica, U niversita di Pisa \n\nCorso Italia, 40, 56125 Pisa, Italy \n\nemail: {sona.perso.starita}di.unipi.it \n\nAbstract \n\n(HSS) developed an algo(cid:173)\n\nTo reduce the computational complexity of classification systems \nusing tangent distance, Hastie et al. \nrithm to devise rich models for representing large subsets of the \ndata which computes automatically the \"best\" \nassociated tan(cid:173)\ngent subspace. Schwenk & Milgram proposed a discriminant mod(cid:173)\nular classification system (Diabolo) based on several autoassociative \nmultilayer perceptrons which use tangent distance as error recon(cid:173)\nstruction measure. \nWe propose a gradient based constructive learning algorithm for \nbuilding a tangent subspace model with discriminant capabilities \nwhich combines several of the the advantages of both HSS and \nDiabolo: devised tangent models hold discriminant capabilities, \nspace requirements are improved with respect to HSS since our \nalgorithm is discriminant and thus it needs fewer prototype models, \ndimension of the tangent subspace is determined automatically by \nthe constructive algorithm, and our algorithm is able to learn new \ntransformations. \n\n1 \n\nIntroduction \n\nTangent distance is a well known technique used for transformation invariant pat(cid:173)\ntern recognition. State-of-the-art accuracy can be achieved on an isolated hand(cid:173)\nwritten character task using tangent distance as the classification metric within a \nnearest neighbor algorithm [SCD93]. However, this approach has a quite high com(cid:173)\nputational complexity, owing to the inefficient search and large number of Euclidean \nand tangent distances that need to be calculated. Different researchers have shown \nhow such time complexity can be reduced [Sim94, SS95] at the cost of increased \nspace complexity. \n\n\fA Constructive Learning Algorithm for Discriminant Tangent Models \n\n787 \n\nA different approach to the problem was used by Hastie et al. [HSS95) and Schwenk \n& Milgram [SM95b, SM95a). Both of them used learning algorithms for reducing \nthe classification time and space requirements, while trying to preserve the same \naccuracy. Hastie et al. [HSS95) developed rich models for representing large subsets \nof the prototypes. These models are learned from a training set through a Singular \nValue Decomposition based algorithm which minimizes the average 2-sided tangent \ndistance from a subset of the training images. A nice feature of this algorithm \nis that it computes automatically the \"best\" \ntangent subspace associated with \nthe prototypes. Schwenk & Milgram [SM95b) proposed a modular classification \nsystem (Diabolo) based on several autoassociative multilayer perceptrons which use \ntangent distance as the error reconstruction measure. This original model was then \nimproved by adding discriminant capabilities to the system [SM95a). \n\nComparing Hastie et al. algorithm (HSS) versus the discriminant version of Diabolo, \nwe observe that: Diabolo seems to require less memory than HSS, however, learning \nis faster in HSS; Diabolo is discriminant while HSS is not; the number of hidden \nunits to be used in Diabolo's autoassociators must be decided heuristically through \na trial and error procedure, while the dimension of the tangent subspaces in HSS \ncan be controlled more easily; Diabolo uses predefined transformations, while HSS \nis able to learn new transformations (like style transformations). \n\nIn this paper, we introduce the tangent distance neuron (TO-neuron), which imple(cid:173)\nments the I-sided version of the tangent distance, and we devise a gradient based \nconstructive learning algorithm for building a tangent subspace model with dis(cid:173)\ncriminant capabilities. In this way, we are able to combine the advantages of both \nHSS and Diabolo: the model holds discriminant capabilities, learning is just a bit \nslower than HSS, space requirements are improved with respect to HSS since the \nTO-neuron is discriminant and thus it needs fewer prototype models, the dimension \nof the tangent subspace is determined automatically by the constructive algorithm, \nand TO-neuron is able to learn new transformations. \n\n2 Tangent Distance \n\nIn several pattern recognition problems Euclidean distance fails to give a satis(cid:173)\nfactory solution since it is unable to account for invariant transformations of the \npatterns. Simard et al. [SCD93) suggested dealing with this problem by generating \na parameterized 7-dimensional manifold for each image, where each parameter ac(cid:173)\ncounts for one such invariance. The underlying idea consists in approximating the \nconsidered transformations locally through a linear model. \n\nFor the sake of exposition, consider rotation. Given a digitalized image Xi of a \npattern i, the rotation operation can be approximated by Xi(O) = Xi + Tx,O, \nwhere 0 is the rotation angle, and T x, is the tangent vector to the rotation curve \ngenerated by the rotation operator for Xi. The tangent vector T x, can easily be \ncomputed by finite difference. Now, instead of measuring the distance between two \nimages as D( Xi, X j) = IIX i-X j II for any norm 11\u00b711. Simard et al. proposed using \nthe tangent distance DT(Xi,Xj ) = min9.,9, IIXi(Oi) - Xj(Oj)lI. \nIf k types of transformations are considered, there will be k different tangent vectors \nper pattern. If II . II is the Euclidean norm, computing the tangent distance is a \nsimple least-squares problem. A solution for this problem l can be found in Simard \net al. \n[SCD93], where the authors used DT to drive a I-NN classification rule. \n\n1 A special case of \n\nthe one sided \n\ntangent distance \n\nD~-\u00b7;ded(x\" X J ) = mine; IIX,((J,) - Xjll, can be computed more efficiently [SS95]. \n\ntangent distance, \n\ni.e., \n\n\f788 \n\nD. Sona, A. Sperduti and A. Starita \n\nFigure 1: Geometric interpretation of equation 1. Note that net = (D~-8ided ):l. \n\nUnfortunately, 1-NN is expensive. To reduce the complexity ofthe above approach, \nHastie et al. \n[HSS95] proposed an algorithm for the generation of rich models \nrepresenting large subsets of patterns. This algorithm computes for each class a \nprototype (the centroid), and an associated subspace (described by the tangent \nvectors), such that the total tangent distance of the centroid with respect to the \nprototypes in the training set is minimised. Note that the associated subspace is \nnot predefined as in the case of standard tangent distance, but is computed on the \nbasis of the training set . \n\n3 Tangent Distance Neuron \n\nIn this section we define the Tangent ~istance neuron (TO-neuron), which is the \ncomputational model studied in this paper. A TO-neuron is characterized by a set \nof n + 1 vectors, of the same dimension as the input vectors (in our case, images) . \nOne of these vectors, W is used as reference vector (centroid), while the remaining \nvectors, Ti (i = 1, \u2022\u2022\u2022 , n), are used as tangent vectors. Moreover, the set of tangent \nvectors constitutes an ortho-normal basis. \n\nGiven an input vector I the input net of the TO-neuron is computed as the square \nof the I-sided tangent distance between I and the tangent model {W, T 1 , \u2022 \u2022 \u2022 , Tn} \n(see Figure 1) \n\nn \n\nn \n\nwhere we have used the fact that the tangent vectors constitute an ortho-normal \nbasis. For the sake of notation, d denotes the difference between the input pattern \nand the centroid, and the projection of d over the i-th tangent vector is denoted by \n\"'fi. Note that, by definition, net is non-negative. \n\nThe output 0 of the TO-neuron is then computed by transforming the net through \na nonlinear monotone function f. In our experiments, we have used the following \nfunction \n\n0= f(o.,net) = - - - -\n1 + 0. net \n\n1 \n\n(2) \n\nwhere 0. controls the steepness of the function. Note that 0 is positive since net is \nalways positive and within the range (0, 1]. \n\n\fA Constructive Learning Algorithm for Discriminant Tangent Models \n\n789 \n\n4 Learning \n\nThe TD-neuron can be trained to discriminate between patterns belonging to two \ndifferent classes through a gradient descent technique. Thus, given a training set \n{(I\"t,), ... ,(IN,tN)} , where ti E {O,1} is the i-th desired output, and N is \nthe total number of patterns in the training set, we can define the error function as \n\nN \n\nE = = 2:)tk - Okr~ \n\n2 k=l \n\n(3) \n\nwhere Ok is the output of the TD-neuron for the k-th input pattern. \n\nUsing equations (1-2), it is trivial to compute the changes for the tangent vectors, \nthe centroid and 0.: \n\n..do. :=: -'1101 (6E) = - t netk 'I101(tk - Ok) o~ \n\n60. \n\nk=l \n\n(4) \n\n(5) \n\n(6) \n\nwhere '11 and '1101 are learning parameters. \nThe learning algorithm initializes the centroid W to the average of the patterns with \ntarget 1, i.e., W = /., Lf~, Ik , where N, is the number of patterns with target \nequal to 1, and the tangent vectors to random vectors with small modulus. Then \n0., t he centroid Wand the tangent vectors Ti are changed according to equations \n(4-6). Moreover, since the tangent vectors must constitute an ortho-normal basis, \nafter each epoch of training the vectors Ti are ortho-normalized. \n\n5 The Constructive Algorithm \n\nBefore training the TD-neuron using equations (4-6), we have to set the tangent \nsubspace dimension. The same problem is present in HSS and Diabolo (i.e., number \nof hidden units). To solve this problem we have developed a constructive algorithm \nwhich adds tangent vectors one by one according to the computational needs. \n\nThe key idea is based on the observation that a typical run of the learning algorithm \ndescribed in Section 4 leads to the sequential convergence of the vectors according to \ntheir relative importance. This means that the tangent vectors all remain random \nvectors while the centroid converges first. \n\nThen one of the tangent vectors converges to the most relevant transformation \n(while the remaining tangent vectors are still immature), and so on till all the \ntangent vectors converge, one by one, to less and less relevant transformations . \n\nThis behavior suggests starting the training using only the centroid (i .e., without \ntangent vectors) and allow it to converge. Then, as in other constructive algorithms, \nthe centroid is frozen and one random tangent vector T 1 is added. Learning is \nresumed till changes in T 1 become irrelevant. During learning, however, T, is \nnormalized after each epoch. At convergence, T 1 is frozen, a new random tangent \nvector T 2 is added, and learning is resumed. New tangent vectors are iteratively \nadded till changes in the classification accuracy becomes irrelevant. \n\n\f790 \n\nD. Sona, A. Sperduti and A. Starita \n\n# Tang. \n\n0 \n1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n\nHSS \n\n% Cor % Err \n-\n78 . 74 \n79.10 \n79 .94 \n81.47 \n76 . 87 \n71 .29 \n-\n\n-\n21.26 \n20.90 \n20.06 \n18.53 \n23.13 \n28 . 71 \n-\n\nTD-neuron \n\n% Cor % Rej % Err \n73 . 78 \n18.98 \n72 . 06 \n17.46 \n13.96 \n77 . 99 \n11.69 \n81.14 \n10 .48 \n82 .68 \n10.12 \n84 .25 \n85. 21 \n9 .65 \n9 .08 \n86.16 \n86.37 \n8 . 74 \n\n7.24 \n10 .\u2022 8 \n8 .05 \n7.17 \n6 .8. \n5.63 \n5 .14 \n4.76 \n4 .89 \n\nTable 1; The results obtained by the HSS algorithm and the TO-neuron. \n\n6 Results \n\nWe have tested our constructive algorithm versus the HSS algorithm (which uses \nthe 2-sided tangent distance) on 10587 binary digits from the NIST-3 dataset . The \nbinary I28xI28 digits were transformed into a 64-grey level I6xI6 format by a \nsimple local counting procedure. No other pre-processing transformation \nwas performed. The training set consisted of 3000 randomly chosen digits, while \nthe remaining digits where used in the test set. A single tangent model for each \nclass of digit was computed using both algorithms. The classification of the test \ndigits was performed using the label of the closest model for HSS and the output \nof the TO-neurons for our system . The TO-neurons used a rejection criterion with \nparameters adapted during training. \n\nIn Table 1 we have reported the performances on the test set of both HSS and our \nsystem. Oifferent numbers of tangent vectors were tested for both of them. From the \nresults it is clear that the models generated by HSS reach a peak in performance with \n4 tangent vectors and then a sharp degradation of the generalization is observed \nby adding more tangent vectors. On the contrary, the TO-neurons are able to \nsteadly increase the performance with an increasing number of tangent vectors. \nThe improvement in the performance, however, seems to saturate when using many \ntangent vectors. Table 2 presents the confusion matrix obtained by the TO-neurons \nwith 8 tangent vectors. \n\nFor comparison, we display some of the tangent models computed by HSS and \nby our algorithm in Figure 2. Note how tangent models developed by the HSS \nalgorithm tend to be more blurred than the ones developed by our algorithm. This \nis due to the lake of discriminant capabilities by the HSS algorithm and it is the \nmain cause of the degradation in performance observed when using more than 4 \ntangent vectors. \n\nIt must be pointed out that, for a fixed number of tangent vectors, the HSS algo(cid:173)\nrithm is faster than ours, because it needs only a fraction of the training examples \n(only one class). However, our algorithm is remarkably more efficient when a family \nof tangent models with an increasing number of tangent vectors must be generated2 \u2022 \nMoreover, since a TO-neuron uses the one sided tangent distance, it is faster in com(cid:173)\nputing the output. \n\n7 Conclusion \n\nWe introduced the tangent distance neuron (TO-neuron), which implements the \nI-sided version of the tangent distance and gave a constructive learning algorithm \nfor building a tangent subspace with discriminant capabilities. As stated in the in-\n\n:>The tangent model computed by HSS depends on the number of ta.ngent vectors. \n\n\fA Constructive Learning Algorithm/or Discriminant Tangent Models \n\n791 \n\nFigure 2: The tangent models obtained for digits '1' and '3' by the HSS algorithm \n(row 1 and 3, respectively) and our TD-neuron (row 2 and 4, respectively). The \ncentroids are shown in the first column . \n\ntroduction , there are many advantages of using the proposed computational model \nversus other techniques like HSS and Diabolo. Specifically, we believe that the \nproposed approach is particularly useful in those applications where it is very im(cid:173)\nportant to have a classification system which is both discriminant and semantically \ntransparent, in the sense that it is very easy to understand how it works. One \namong these applications is the classification of ancient book scripts. In fact, the \ndescription, the comparison , and the classification of forms are the main tasks of \npaleographers. Until now, however, these tasks have been generally performed with(cid:173)\nout the aid of a universally accepted and quantitatively based method or technique. \nConsequently, very often it is impossible to reach a definitive date attribution of a \ndocument to within 50 years. In this field, it is very important to have a system \nwhich is both discriminant and explanatory, so that paleographers can learn from it \nwhich are the relevant features of the script of a given epoch. These requirements \nrule out systems like Diabolo, which is not easily interpretable, and also tangent \nmodels developed by HSS, which are not discriminant. In Figure 3 we have reported \nsome preliminary results we obtained within this field. \n\nPerhaps most importantly, our work suggests a number of research avenues. We \nused just a single TD-neuron; presumably having several neurons arranged as an \nadaptive pre-processing layer within a standard feed-forward neural network can \nyield a remarkable increase in the transformation invariant features of the network. \n\nCo \nC t \nC~ \nC.,. \nC \n. Co. \nCil \nC,7 \nCR \nC',q \n\n00 \n661 \n\n{ \n4 M2 \n1 \n{ \n1 \n3 \n0 \n0 \n1 \n1 \n0 \n4 \n1 \n1 \n4 \n1 \n0 \n0 \n\nTotal \n\n5 \n1 \n2 \n656 \n0 \n39 \n2 \n0 \n18 \n12 \n\n2 \n0 \n650 \n22 \n2 \n3 \n1 \n3 \n1 \n0 \nCorrect : 86.37% \n\n27 \n1 \n13 \n0 \n633 \n2 \n6 \n4 \n14 \n43 \n\n8 \n1 \n2 \n26 \n3 \n535 \n11 \n3 \n12 \n1 \n\n1 \n11 \n6 \n11 \n5 \n1 \n0 \n127 \n7 \n70 \n\n2 \n1 \n10 \n1 \n4 \n7 \n680 \n0 \n0 \n0 \nRe.Jected \n\nOQ \n\n08 \n9 \n8 \n9 \n18 \n1 \n7 \n4 \n12 \n607 \n36 \n4 .89% \n\n0 \n0 \n0 \n{ \n48 \n3 \n0 \n24 \n11 \n562 \n\nReJ \n{2 \n9 \n69 \n28 \n32 \nH \n33 \n27 \n62 \n25 \n\nII % Cor % Rej % Err \n1 .62 \n3.08 \n6.14 \n11.17 \n9 .{0 \n9 .95 \n3.18 \n5 .99 \n9.96 \n21.63 \n\n86 .86 \n95.90 \nM.86 \n85.19 \n86 .2{ \n83 .20 \n91. 77 \n90.65 \n81.70 \n15.03 \nErrors \n\n5 .52 \n1.03 \n9.01 \n3.6{ \n{.36 \n6.M \n4 .45 \n3 .37 \n8 .34 \n3.34 \n8 . 74% \n\nTable 2: The confusion matrix for the TD-neurons with 8 tangent vectors. \n\n\f", "award": [], "sourceid": 1234, "authors": [{"given_name": "Diego", "family_name": "Sona", "institution": null}, {"given_name": "Alessandro", "family_name": "Sperduti", "institution": null}, {"given_name": "Antonina", "family_name": "Starita", "institution": null}]}