{"title": "The Canonical Distortion Measure in Feature Space and 1-NN Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 245, "page_last": 251, "abstract": null, "full_text": "The Canonical Distortion Measure in Feature \n\nSpace and I-NN Classification \n\nJonathan Baxter*and Peter Bartlett \nDepartment of Systems Engineering \n\nAustralian National University \n\nCanberra 0200, Australia \n\n{jon,bartlett}@syseng.anu.edu.au \n\nAbstract \n\nWe prove that the Canonical Distortion Measure (CDM) [2, 3] is the \noptimal distance measure to use for I nearest-neighbour (l-NN) classifi(cid:173)\ncation, and show that it reduces to squared Euclidean distance in feature \nspace for function classes that can be expressed as linear combinations \nof a fixed set of features. PAC-like bounds are given on the sample(cid:173)\ncomplexity required to learn the CDM. An experiment is presented in \nwhich a neural network CDM was learnt for a Japanese OCR environ(cid:173)\nment and then used to do I-NN classification. \n\n1 \n\nINTRODUCTION \n\nLet X be an input space, P a distribution on X, F a class of functions mapping X into Y \n(called the \"environment\"), Q a distribution on F and (J' a function (J': Y X Y -t [0 , .\"1]. \nThe Canonical Distortion Measure (CDM) between two inputs x, Xl is defined to be: \n\np(x, Xl) = L (J'(f(x) , f(x l)) dQ(f). \n\n(1) \n\nThroughout this paper we will be considering real-valued functions and squared loss, so \nY = ~ and (J'(y, yl) := (y - yl)2. The CDM was introduced in [2, 3], where it was \nanalysed primarily from a vector quantization perspective. In particular, the CDM was \nproved to be the optimal distortion measure to use in vector quantization, in the sense of \nproducing the best approximations to the functions in the environment F. In [3] some \nexperimental results were also presented (in a toy domain) showing how the CDM may be \nlearnt. \n\nThe purpose of this paper is to investigate the utility of the CDM as a classification tool. \nIn Section 2 we show how the CDM for a class of functions possessing a common feature \n\n*The first author was supported in part by EPSRC grants #K70366 and #K70373 \n\n\f246 \n\n1. Baxter and P. Bartlett \n\nset reduces, via a change of variables, to squared Euclidean distance in feature space. A \nlemma is then given showing that the CDM is the optimal distance measure to use for 1-\nnearest-neighbour (l-NN) classification. Thus, for functions possessing a common feature \nset, optimall-NN classification is achieved by using squared Euclidean distance in feature \nspace. \n\nIn general the CDM will be unknown, so in Section 4 we present a technique for learning \nthe CDM by minimizing squared loss, and give PAC-like bounds on the sample-size re(cid:173)\nquired for good generalisation. In Section 5 we present some experimental results in which \na set of features was learnt for a machine-printed Japanese OCR environment, and then \nsquared Euclidean distance was used to do I-NN classification in feature space. The exper(cid:173)\niments provide strong empirical support for the theoretical results in a difficult real-world \napplication. \n\n2 THE CDM IN FEATURE SPACE \n\nSuppose each f E F can be expressed as a linear combination of a fixed set of features \n~ := (\u00a2l, ... , \u00a2k). That is, for all f E F, there exists w := (WI,\u00b7\u00b7\u00b7, Wk) such that \nf = w . ~ = 2:7=1 Wi\u00a2i. In this case the distribution Q over the environment F is a \ndistribution over the weight vectors w. Measuring the distance between function values by \n()(y, y') := (y - yl)2, the CDM (1) becomes: \n\np(x, x') = r [w\u00b7 ~(x) - w\u00b7 ~(X,)]2 dQ(w) = (~(x) - ~(X'))W(~(x) - ~(X'))' \n\niE.k \n\n(2) \nwhere W = fw w'w dQ(w). is a k x k matrix. Making the change of variable ~ -t \n~JW, we have p(x, x') = 11~(x) - ~(x')112 . Thus, the assumption that the functions in \nthe environment can be expressed as linear combinations of a fixed set of features means \nthat the CDM is simply squared Euclidean distance in a feature space related to the original \nby a linear transformation. \n\n3 \n\nI-NN CLASSIFICATION AND THE CDM \n\nSuppose the environment F consists of classifiers, i.e. {O, 1 }-valued functions. Let f be \nsome function in F and z := (Xl, f(Xl)), ... , (Xn, f(x n)) a training set of examples of \nf. In I-NN classification the classification of a novel x is computed by f(x*) where X* = \nargminx \u2022 d(x, Xi)), i.e. the classification of X is the classification of the nearest training \npoint to x under some distance measure d. If both f and x are chosen at random, the \nexpected misclassification error of the 1-NN scheme using d and the training points x := \n(xl, ... ,xn)is \n\nf(x* )]2 , \n\ner(x, d) := EF Ex [J(x) -\n\n(3) \nwhere x* is the nearest neighbour to x from {Xl, . . . , x n }. The following lemma is now \nimmediate from the definitions. \nLemma 1. For all sequences x = (Xl, . .. , X n ). er{x, d) is minimized ifd is the CDM p. \nRemarks. Lemma 1 combined with the results of the last section shows that for function \nclasses possessing a common feature set, optimall-NN classification is achieved by using \nsquared Euclidean distance in feature space. In Section 5 some experimental results on \nJapanese OCR are presented supporting this conclusion. \n\nThe property of optimality of the CDM for I-NN classification may not be stable to small \nperturbations. That is, if we learn an approximationg to p, then even ifExxx (g(x, x') -\n\n\fThe Canonical Distortion Measure in Feature Space and I-NN Classification \n\n247 \n\np(x, x' ))2 is small it may not be the case that l-NN classification using 9 is also small. \nHowever, one can show that stability is maintained for classifier environments in which \npositive examples of different functions do not overlap significantly (as is the case for the \nJapanese OCR environment of Section 5, face recognition environments, speech recogni(cid:173)\ntion environments and so on). We are currently investigating the general conditions under \nwhich stability is maintained. \n\n4 LEARNING THE CDM \n\nFor most environments encountered in practice (e.g speech recognition or image recogni(cid:173)\ntion), P will be unknown. In this section it is shown how p may be estimated or learnt using \nfunction approximation techniques (e.g. feedforward neural networks). \n\n4.1 SAMPLING THE ENVIRONMENT \n\nTo learn the CDM p, the learner is provided with a class of functions (e.g. neural networks) \n9 where each 9 E 9 maps X x X ~ [0, M]. The goal of the learner is to find a 9 such \nthat the error between 9 and the CDM p is small. For the sake of argument this error will \nbe measured by the expected squared loss: \n\nerp(g) := Exxx [g(x, x') - p(x, x')f , \n\n(4) \n\nwhere the expectation is with respect to p2. \n\nOrdinarily the learner would be provided with training data in the form (x, x', p( x, x'}) \nand would use this data to minimize an empirical version of (4). However, p is unknown \nso to generate data of this form p must be estimated for each training pair x, x'. Hence to \ngenerate training sets for learning the CDM, both the distribution Q over the environment \n:F and the distribution P over the input space X must be sampled. So let f := (it, ... , f m) \nbe m i.i.d. samples from :F according to Q and let x := (Xl, ... , x n ) be n i.i.d. samples \nfrom X according to P. For any pair Xi, X j an estimate of p( Xi, X j) is given by \n\n1 m \n\nP(Xi' Xj} := m ~ (J'(fdxd,fk(Xj )). \n\nk=l \n\nThis gives n (n - 1) /2 training triples, \n\n{(xi,Xj,p(xi,xj)),l::; i < j::; n} , \n\nwhich can be used as data to generate an empirical estimate of er p (g): \n\n(5) \n\n(6) \n\nOnly n(n - 1)/2 of the possible n 2 training triples are used because the functions 9 E 9 \nare assumed to already be symmetric and to satisfy 9 (x, x) = 0 for all x (if this is not \nthe case then set g'(X, x') := (g(x, x') + g(x', x))/2 if x =j:. x' and g'(X, x) = 0 and use \ng' := {g': 9 E g} instead). \nIn [3] an experiment was presented in which 9 was a neural network class and (6) was \nminimized directly by gradient descent. In Section 5 we present an alternative technique \nin which a set of features is first learnt for the environment and then an estimate of p in \nfeature space is constructed explicitly. \n\n\f248 \n\nJ. Baxter and P. Bartlett \n\n4.2 UNIFORM CONVERGENCE \nWe wish to ensure good generalisation from a 9 minimizing e~r r, in the sense that (for \nsmall 6, 5), \n\nx , \n\nPr { x, r : :~~ lerx,f(g) - erp(g) I > 6} < 5, \n\nThe following theorem shows that this occurs if both the number of functions m and the \nnumber of input samples n are sufficiently large. Some exotic (but nonetheless benign) \nmeasurability restrictions have been ignored in the statement of the theorem. In the state(cid:173)\nment of the theorem, N (E , 9) denotes the smallest 6 -cover of 9 under the L 1 ( P 2 ) norm, \nwhere {gl , . . . , gN} is an 6-cover of9 iffor all 9 E 9 there exists gi such that Ilgi - gil ~ 6. \nTheorem 2. Assume the range of the functions in the environment :F is no more than \n[-J B /2, J B /2) and in the class 9 (used to approximate the CDM) is no more than \n[0 , VB). For all 6 > 0 and 0 < 5 ~ 1. if \n\n32B 4 \n\n4 \nm> --log-\n5 \n\n6 2 \n\n-\n\nn 2: \n\n512B 2 ( \n\n6 2 \n\nlogN(6,9) + log \n\n512B 2 \n\n8) \n6 2 + log;5 \n\nand \n\nthen \n\nProof For each 9 E 9 , define \n\nerx(g) := (2 ) \nnn-1 \n\nIf for any x = (Xl, . . . , X n ), \n\nl~i ~} ~ ~, \n2 \nPr {x: sup lerx(g) - erp(g)1 > ~} ~ ~, \n\n2 \n\ngE9 \n\n2 \n\n2 \n\n(7) \n\n(8) \n\n(9) \n\n(10) \n\n(II) \n\n(12) \n\nthen by the triangle inequality (9) will hold. We treat (11) and (12) separately. \nEquation (11). To simplify the notation let gij , Pij and Pij denote 9 (Xi, X j), p( Xi, X j) and \np(Xi' Xj) respectively. Now, \n\nL (gij - pij )2 - 2: (gij - Pij)2 \n\nl~i n S Pr {f EJ'x(f) - ~ t, xU,) > ~ } \n\n1 :Si ~ \n\n2 n/2 \n\n{ \n\n} \n\ngEQ \n\nJ=l \n\nHence, by the union bound, \n\nPr { x: ~~g /erx(g) - erp(g) I > ~} ~ 4(n -\n\n~ 4N (48~2 ' g) exp ( - 2;:~2 ) . \nl)N (48~2 ' g) exp ( - 2;:~2 ) . \n\nSetting n as in the statement of the theorem ensures this is less than 6/2. \n\nD \n\nRemark. The bound on m (the number of functions that need to be sampled from the \nenvironment) is independent of the complexity of the class g. This should be contrasted \nwith related bias learning (or equivalently, learning to learn) results [1] in which the number \nof functions does depend on the complexity. The heuristic explanation for this is that here \nwe are only learning a distance function on the input space (the CDM), whereas in bias \nlearning we are learning an entire hypothesis space that is appropriate for the environment. \nHowever, we shall see in the next section how for certain classes of problems the CDM can \nalso be used to learn the functions in the environment. Hence in these cases learning the \nCDM is a more effective method of learning to learn. \n\n5 EXPERIMENT: JAPANESE OCR \n\nTo verify the optimality of the CDM for I-NN classification, and also to show how \nit can be learnt in a non-trivial domain (only a toy example was given in [3]), the \n\n\f250 \n\n1. Baxter and P Bartlett \n\nCOM was learnt for a Japanese OCR environment. Specifically, there were 3018 func(cid:173)\ntions I in the environment F, each one a classifier for a different Kanji character. A \ndatabase containing 90,918 segmented, machine-printed Kanji characters scanned from \nvarious sources was purchased from the CEDAR group at the State University of New \nYork, Buffalo The quality of the images ranged from clean to very degraded (see \nhttp://www . cedar .buffalo. edu/Databases/JOcR/). \n\nThe main reason for choosing Japanese OCR rather than English OCR as a test-bed was \nthe large number of distinct characters in Japanese. Recall from Theorem 2 that to get good \ngeneralisation from a learnt COM, sufficiently many functions must be sampled from the \nenvironment. If the environment just consisted of English characters then it is likely that \n\"sufficiently many\" characters would mean all characters, and so it would be impossible to \ntest the learnt COM on novel characters not seen in training. \n\nInstead of learning the COM directly by minimizing (6), it was learnt implicitly by first \nlearning a set of neural network features for the functions in the environment. The features \nwere learnt using the method outlined in [1], which essentially involves learning a set of \nclassifiers with a common final hidden layer. The features were learnt on 400 out of the \n3000 classifiers in the environment, using 90% of the data in training and 10% in testing. \nEach resulting classifier was a linear combination of the neural network features. The \naverage error of the classifiers was 2.85% on the test set (which is an accurate estimate as \nthere were 9092 test examples). \nRecall from Section 2 that if all f E F can be expressed as I = W . 4> for a fixed feature \nset 4>, then the COM reduces to p{x, x') = (4)(x) - 4>(x' ))W(4>{x) - 4>(X I ))' where \nW = fw w/w dQ(w). The result of the learning procedure above is a set of features \nci> and 400 weight vectors w l, . . . , W 400, such that for each of the character classifiers fi \nused in training, Ii :: Wi . ci>. Thus, g(x, x') := (ci>(x) - ci>(X'))W(ci>(x) - 4>(X'))' is \nan empirical estimate of the true CDM, where W := L;~~ W:Wi. With a linear change \nof variable ci> -+ ci>VW, 9 becomes g(x, x') = 114>(x) - ci>(x')112. This 9 was used to do \nI-NN classification on the test examples in two different experiments. \n\nIn the first experiment, all testing and training examples that were not an example of one \nof the 400 training characters were lumped into an extra category for the purpose of clas(cid:173)\nsification. All test examples were then given the label of their nearest neighbour in the \ntraining set under 9 (i.e. , initially all training examples were mapped into feature space \nto give {ci>( Xl)' ... , ci>( X n )}. Then each test example was mapped into feature space and \nassigned the same label as argminx.llci>( x) - ci>( Xi) 11 2).The total misclassification error was \n2.2%, which can be directly compared with the misclassification error of the original clas(cid:173)\nsifiers of 2.85%. The COM does better because it uses the training data explicitly and the \ninformation stored in the network to make a comparison, whereas the classifiers only use \nthe information in the network. The learnt COM was also used to do k-NN classification \nwith k > 1. However this afforded no improvement. For example, the error of the 3-NN \nclassifier was 2.54% and the error of the 20-NN classifier was 3.99%. This provides an \nindication that the COM may not be the optimal distortion measure to use if k-NN classifi(cid:173)\ncation (k > 1) is the aim. \nIn the second experiment 9 was again used to do I-NN classification on the test set, but \nthis time all 3018 characters were distinguished. So in this case the learnt COM was being \nasked to distinguish between 2618 characters that were treated as a single character when \nit was being trained. The misclassification error was a surprisingly low 7.5%. The 7.5% \nerror compares favourably with the 4.8% error achieved on the same data by the CEDAR \ngroup, using a carefully selected feature set and a hand-tailored nearest-neighbour routine \n[5]. In our case the distance measure was learnt from raw-data input, and has not been the \nsubject of any optimization or tweaking. \n\n\fThe Canonical Distortion Measure in Feature Space and I-NN Classification \n\n251 \n\nFigure 1: Six Kanji characters (first character in each row) and examples of their four \nnearest neighbours (remaining four characters in each row). \n\nAs a final, more qualitative assessment, the learnt CDM was used to compute the dis(cid:173)\ntance between every pair of testing examples, and then the distance between each pair of \ncharacters (an individual character being represented by a number of testing examples) \nwas computed by averaging the distances between their constituent examples. The near(cid:173)\nest neighbours of each character were then calculated. With this measure, every character \nturned out to be its own nearest neighbour, and in many cases the next-nearest neighbours \nbore a strong subjective similarity to the original. Some representative examples are shown \nin Figure 1. \n\n6 CONCLUSION \n\nWe have shown how the Canonical Distortion Measure (CDM) is the optimal distortion \nmeasure for I-NN classification, and that for environments in which all the functions can \nbe expressed as a linear combination of a fixed set of features, the Canonical Distortion \nMeasure is squared Euclidean distance in feature space. A technique for learning the CDM \nwas presented and PAC-like bounds on the sample complexity required for good generali(cid:173)\nsation were proved. \n\nExperimental results were presented in which the CDM for a Japanese OCR environment \nwas learnt by first learning a common set of features for a subset of the character classifiers \nin the environment. The learnt CDM was then used as a distance measure in I-NN neigh(cid:173)\nbour classification, and performed remarkably well, both on the characters used to train it \nand on entirely novel characters. \n\nReferences \n\n[1] Jonathan Baxter. Learning Internal Representations. In Proceedings of the Eighth \nInternational Conference on Computational Learning Theory, pages 311-320. ACM \nPress, 1995. \n\n[2] Jonathan Baxter. The Canonical Metric for Vector Quantisation. Technical Report \nNeuroColt Technical Report 047, Royal Holloway College, University of London, July \n1995. \n\n[3] Jonathan Baxter. The Canonical Distortion Measure for Vector Quantization and Func(cid:173)\ntion Approximation. In Proceedings of the Fourteenth International Conference on \nMachine Learning, July 1997. To Appear. \n\n[4] W S Lee, P L Bartlett, and R C Williamson. Efficient agnostic learning of neural \n\nnetworks with bounded fan-in. IEEE Transactions on Information Theory, 1997. \n\n[5] S.N. Srihari, T. Hong, and Z. Shi. Cherry Blossom: A System for Reading Uncon(cid:173)\n\nstrained Handwritten Page Images. In Symposium on Document Image Understanding \nTechnology (SDIUT), 1997. \n\n\f", "award": [], "sourceid": 1357, "authors": [{"given_name": "Jonathan", "family_name": "Baxter", "institution": null}, {"given_name": "Peter", "family_name": "Bartlett", "institution": null}]}