{"title": "Margin Analysis of the LVQ Algorithm", "book": "Advances in Neural Information Processing Systems", "page_first": 479, "page_last": 486, "abstract": null, "full_text": "Margin Analysis of the LVQ Algorithm\n\nKoby Crammer\n\nkobics@cs.huji.ac.il\n\nRan Gilad-Bachrach\n\nranb@cs.huji.ac.il\n\nAmir Navot\n\nanavot@cs.huji.ac.il\n\nNaftali Tishby\n\ntishby@cs.huji.ac.il\n\nSchool of Computer Science and Engineering and\nInterdisciplinary Center for Neural Computation\n\nThe Hebrew University, Jerusalem, Israel\n\nAbstract\n\nPrototypes based algorithms are commonly used to reduce the computa-\ntional complexity of Nearest-Neighbour (NN) classi\ufb01ers. In this paper\nwe discuss theoretical and algorithmical aspects of such algorithms. On\nthe theory side, we present margin based generalization bounds that sug-\ngest that these kinds of classi\ufb01ers can be more accurate then the 1-NN\nrule. Furthermore, we derived a training algorithm that selects a good set\nof prototypes using large margin principles. We also show that the 20\nyears old Learning Vector Quantization (LVQ) algorithm emerges natu-\nrally from our framework.\n\n1\n\nIntroduction\n\nThough \ufb01fty years have passed since the introduction of One Nearest Neighbour (1-NN) [1]\nit is still a popular algorithm. 1-NN is a simple and intuitive algorithm but at the same time\nachieves state of the art results [2]. However in large, high dimensional data set it often\nbecome infeasible. One approach to face this computational problem is to approximate\nthe nearest neighbour [3] using various techniques. Alternative approach is to choose a\nsmall data-set (aka prototypes) which represents the original training sample, and apply the\nnearest neighbour rule only with respect to this small data-set. This solution maintains the\n\u201cspirit\u201d of the original algorithm, while making it feasible. Moreover, it might improve the\naccuracy by reducing noise over-\ufb01tting.\n\nIn this setting, the goal of the learning stage is to choose wisely the prototypes, i.e., in a way\nthat will yield good generalization 1. In this paper we use the Maximal Margin principle\n[4, 5] for this purpose. The training data is used to measure the margin of each proposed\npositioning of the prototypes. We combine these measurements to calculate a risk for each\nprototype set and select the prototypes that minimize the risk.\n\nRoughly speaking, margins measure the level of con\ufb01dence a classi\ufb01ers has with respect\nto its decisions. This tool has become a primary method in machine learning during the\nlast decade. Two of the most powerful algorithms in the \ufb01eld, Support Vector Machines\n\n1Good generalization means that the probability of misclassifying a new example is small.\n\n\f(SVM) [4] and AdaBoost [5] are motivated and analyzed by margins. Since the introduction\nof these algorithms dozens of papers were published on different aspect of margins in\nsupervised learning [6, 7, 8].\n\nLearning Vector Quantization (LVQ) [9] is a well-known algorithm that deals with the\nsame problem of selecting prototypes. LVQ iterates over the training data and updates the\nprototypes position. Although it is known for more then 20 years and in spite of its pop-\nularity, no adequate generalization bounds and theory were suggested for this algorithm.\nIn this paper we show that algorithms derived from the maximal margin principle contains\nLVQ as a special case. We use this result to present generalization bounds and insights for\nthe LVQ algorithm.\n\nBuckingham and Geva [10] were the \ufb01rst to explore the relations between maximal margin\nprinciple and LVQ. They presented a variant named LMVQ and analyzed it. As in most of\nthe literature about LVQ they look at the algorithm as trying to estimate a density function\n(or a function of the density) at each point. After estimating the density the Bayesian\ndecision rule is used. We take a different point of view on the problem and look at the\ngeometry of the decision boundary induced by the decision rule. Note that in order to\ngenerate a good classi\ufb01cation rule the only signi\ufb01cant factor is where the decision boundary\nlies (It is a well known fact that classi\ufb01cation is easier then density estimation [11]).\n\nSummary of the Results\nIn section 2 we present the model and outline the LVQ family\nof algorithms. A discussion and de\ufb01nition of margin is provided in section 3. The two\nfundamental results are a bound on the generalization error and a theoretical reasoning for\nthe LVQ family of algorithms. In section 4 we present a bound on the gap between the\nempirical and the generalization accuracy. This provides a guaranty on the performance\nover unseen instances based on the empirical evidence. Although LVQ was designed as an\napproximation to nearest neighbour the theorem suggests that the former is more accurate\nin many cases. Indeed a simple experiment shows this prediction to be true. In section 5\nwe show how LVQ family of algorithms emerges from the generalization bound. These\nalgorithms minimize the bound using gradient descent. The different variants correspond\nto different tradeoff between opposing quantities. In practice the tradeoff is controlled by\nloss functions.\n\n2 Problem Setting and the LVQ algorithm\n\nThe framework we are interested in is supervised learning for classi\ufb01cation problems. In\nthis framework the task is to \ufb01nd a map from Rn into a \ufb01nite set of labels Y. We focus\non classi\ufb01cation functions of the following form: the classi\ufb01ers are parameterized by a set\nof points (cid:22)1; : : : ; (cid:22)k 2 Rn which we refer to as prototypes. Each prototype is associated\nwith a label y 2 Y. Given a new instance x 2 Rn we predict that it has the same label as\nthe closest prototype, similar to the 1-nearest-neighbour rule (1-NN). We denote the label\nj=1 by (cid:22)(x). The goal of the learning process in\npredicted using a set of prototypes f(cid:22)jgk\nthis model is to \ufb01nd a set of prototypes which will predict accurately the labels of unseen\ninstances.\n\nThe Learning Vector Quantization (LVQ) family of algorithms works in this model. The\nl=1, where xl 2 Rn and yl 2 Y\nalgorithm gets as an input a labelled sample S = f(xl; yl)gm\nand uses it to \ufb01nd a good set of prototypes. All the variants of LVQ share the following\ncommon scheme. The algorithm maintains a set of prototypes each is assigned with a\nprede\ufb01ned label, which is kept constant during the learning process. It cycles through the\ntraining data S and on each iteration modi\ufb01es the set of prototypes in accordance to one\ninstance (xt; yt). If the prototype (cid:22)j has the same label as yt it is attracted to xt but if the\nlabel of (cid:22)j is different it is repelled from it. Hence LVQ updates the closest prototypes to\n\n\fxt according to the rule:\n\n(cid:22)j   (cid:22)j (cid:6) (cid:11)t(xt (cid:0) (cid:22)j) ;\n\n(1)\nwhere the sign is positive if the label of xt and (cid:22)j agree, and negative otherwise. The\nparameter (cid:11)t is updated using a prede\ufb01ned scheme and controls the rate of convergence\nof the algorithm. The variants of LVQ differ in which prototypes they choose to update in\neach iteration and in the speci\ufb01c scheme used to modify (cid:11)t.\nFor instance, LVQ1 and OLVQ1 updates only the closest prototype to xt in each itera-\ntion. Another example is the LVQ2.1 which modi\ufb01es the two closest prototypes (cid:22)i and (cid:22)j\nto xt. It uses the same update rule (1) but apply it only if the following two conditions hold :\n\n1. Exactly one of the prototypes has the same label as xt, i.e. yt.\n2. The ratios of their distances from xt falls in a window: 1=s (cid:20) kxt (cid:0) (cid:22)ik = kxt (cid:0) (cid:22)jk (cid:20) s,\n\nwhere s is the window size.\n\nMore variants of LVQ can be found in [9].\n\n3 Margins\n\nMargin plays an important role in current research of machine learning. It measures the\ncon\ufb01dence of a classi\ufb01er with respect to its predictions. One approach is to de\ufb01ne margin\nas the distance between an instance and the decision boundary induced by the classi\ufb01cation\nrule as illustrated in \ufb01gure 1(a). Support Vector Machines [4] are based on this de\ufb01nition of\nmargin, which we refer to as Sample-Margin. However, an alternative de\ufb01nition, Hypothe-\nsis Margin, exists. In this de\ufb01nition the margin is the distance that the classi\ufb01er can travel\nwithout changing the way it labels any of the sample points. Note that this de\ufb01nition re-\nquires a distance measure between classi\ufb01ers. This type of margin is used in AdaBoost [5]\nand is illustrated in \ufb01gure 1(b).\n\nIt is possible to apply these two types of margin\nin the context of LVQ. Recall that in our model a\nclassi\ufb01er is de\ufb01ned by a set of labeled prototypes.\nSuch a classi\ufb01er generates a decision boundary by\nVoronoi tessellation. Although using sample mar-\ngin is more natural as a \ufb01rst choice, it turns out\nthat this type of margin is both hard to compute\nand numerically unstable in our context, since\nsmall relocations of the prototypes might lead to a\ndramatic change in the sample margin. Hence we\nfocus on the hypothesis margin and thus have to\nde\ufb01ne a distance measure between two classi\ufb01ers.\nWe choose to de\ufb01ne it as the maximal distance\nbetween prototypes pairs as illustrated in \ufb01gure 2.\nFormally, let (cid:22) = f(cid:22)jgk\nj=1 de-\n\ufb01ne two classi\ufb01ers, then\nk\n\nj=1 and ^(cid:22) = f^(cid:22)jgk\n\n(cid:26) ((cid:22); ^(cid:22)) =\n\nmax\ni=1\n\nk(cid:22)i (cid:0) ^(cid:22)ik2 :\n\nNote that this de\ufb01nition is not invariant to per-\nmutations of the prototypes but it upper bounds\nthe invariant de\ufb01nition. Furthermore, the induced\nmargin is easy to compute (lemma 1) and lower\nbounds the sample-margin (lemma 2).\n\n(a)\n\n(b)\n\nFigure 1:\nSample Margin (\ufb01g-\nure 1(a)) measures how much can\nan instance travel before it hits\nthe decision boundary.\nOn the\nother hand Hypothesis Margin (\ufb01g-\nure 1(b)) measures how much can\nthe hypothesis travel before it hits\nan instance.\n\nLemma 1 Let (cid:22) = f(cid:22)jgk\nesis margin of (cid:22) with respect to x is (cid:18) = 1\nclosest prototype to x with the same (alternative) label.\n\nj=1 be a set of prototypes and x a sample point. Then the hypoth-\n2 (k(cid:22)j (cid:0) xk (cid:0) k(cid:22)i (cid:0) xk) where (cid:22)i ((cid:22)j ) is the\n\n\fLemma 2 Let S = fxlgm\n\nl=1 be a sample and (cid:22) = ((cid:22)1; : : : ; (cid:22)k) be a set of prototypes.\n\nsample-marginS((cid:22)) (cid:21) hypothesis-marginS((cid:22))\n\nLemma 2 shows that if we \ufb01nd a set of prototypes with large hypothesis margin then it has\nlarge sample margin as well.\n\n4 Margin Based Generalization Bound\n\nIn this section we present a bound on the general-\nization error of LVQ type of classi\ufb01ers.\n\nWhen a classi\ufb01er is applied to a training data it\nis natural to use the training error as a predic-\ntion to the generalization error (the probability of\nmisclassi\ufb01cation of an unseen instance). In proto-\ntype based hypothesis the classi\ufb01er assigns a con-\n\ufb01dence level, i.e. margin, to its predictions. Tak-\ning into account the margin by counting instances\nwith small margin as mistakes gives a better pre-\ndiction and provide a bound on the generalization\nerror. This bound is given in terms of the num-\nber of prototypes, the sample size, the margin and\nthe margin based empirical error. The following\ntheorem states this result formally.\n\nTheorem 1 In the following setting:\n\nFigure 2: The distance measure on\nthe LVQ hypothesis class. The dis-\ntance between the white and black\nprototypes set is the maximal dis-\ntance between prototypes pairs.\n\n(cid:15) Let S = fxi; yigm\n\ni=1 2 fRn (cid:2) Ygm be a\n\ntraining sample drawn by some underlying distribution D.\n\n(cid:15) Assume that 8i kxik (cid:20) R.\n(cid:15) Let (cid:22) be a set of prototypes with k prototypes from each class.\n(cid:15) Let 0 < (cid:18) < 1=2.\n(cid:15) Let (cid:11)(cid:18)\n(cid:15) Let eD((cid:22)) be the generalization error: eD((cid:22)) = Pr(x;y)(cid:24)D [(cid:22)(x) 6= y].\n(cid:15) Let (cid:14) > 0.\n\nm(cid:12)(cid:12)fi : margin(cid:22)(xi) < (cid:18)g(cid:12)(cid:12)\n\nS((cid:22)) = 1\n\n.\n\nThen with probability 1 (cid:0) (cid:14) over the choices of the training data:\n\n8(cid:22)\n\neD (cid:20) (cid:11)(cid:18)\n\nS((cid:22)) +s 8\n\nm(cid:18)d log2 32m\n\n(cid:18)2 + log\n\nwhere d is the VC dimension:\n\nd = min(cid:18)n + 1;\n\n64R2\n\n(cid:18)2 (cid:19) 2kjYj log ek2\n\n4\n\n(cid:14)(cid:19)\n\n(2)\n\n(3)\n\nThis theorem leads to a few observations. First, note that the bound is dimension free, in\nthe sense that the generalization error is bounded independently of the input dimension (n)\nmuch like in SVM. Hence it makes sense to apply these algorithms with kernels.\n\nSecond, note that the VC dimension grows as the number of prototypes grows (3). This\nsuggest that using too many prototypes might result in poor performance, therefore there\n\n\fis a non trivial optimal number of prototypes. One should not be surprised by this result\nas it is a realization of the Structural Risk Minimization (SRM) [4] principle. Indeed a\nsimple experiment supports this prediction. Hence not only that prototype based methods\nare faster than Nearest Neighbour, they are more accurate as well. Due to space limitations\nproofs are provided in the full version of this paper only.\n\n5 Maximizing Hypothesis Margin Through Loss Function\n\nOnce margin is properly de\ufb01ned it is natural to ask for algorithm that maximizes it. We will\nshow that this is exactly what LVQ does. Before going any further we have to understand\nwhy maximizing the margin is a good idea.\n\nIn theorem 1 we saw that the generalization error\ncan be bounded by a function of the margin (cid:18) and\nthe empirical (cid:18)-error ((cid:11)). Therefore it is natural to\nseek prototypes that obtain small (cid:18)-error for a large\n(cid:18). We are faced with two contradicting goals: small\n(cid:18)-error verses large (cid:18). A natural way to solve this\nproblem is through the use of loss function.\n\nLoss function are a common technique in machine\nlearning for \ufb01nding the right balance between op-\nposed quantities [12]. The idea is to associate a\nmargin based loss (a \u201ccost\u201d) for each hypothesis\nwith respect to a sample. More formally, let L be a\nfunction such that:\nFor every (cid:18):\nL((cid:18)) (cid:21) 0.\nFor every (cid:18) < 0: L((cid:18)) (cid:21) 1.\n\n1.\n2.\n\nWe use L to compute the loss of an hypothesis with\nrespect to one instance. When a training set is avail-\n\nzero one loss\nhinge loss\nbroken linear loss\nexponential loss\n\n4.5\n\n4\n\n3.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\ns\ns\no\n\nl\n\n0\n-1.5\n\n-1\n\n-0.5\n\n0\n\n0.5\n\n1\n\n1.5\n\nmargin\n\nFigure 3: Different loss functions.\nSVM, LVQ1 and OLVQ1 use the\n\u201chinge\u201d loss: (1 (cid:0) (cid:18))+. LVQ2.1\nuses the broken linear: min(2; (1 (cid:0)\n2(cid:18))+). AdaBoost use the exponen-\ntial loss (e(cid:0)(cid:18)).\n\nable we sum the loss over the instances: L((cid:22)) = Pl L((cid:18)l), where (cid:18)l is the margin of the\n\nl\u2019th instance in the training data. The two axioms of loss functions guarantee that L((cid:22))\nbounds the empirical error. It is common to add more restrictions on the loss function, such\nas requiring that L is a non-increasing function. However, the only assumption we make\nhere is that the loss function L is differentiable.\nDifferent algorithms use different loss functions [12]. AdaBoost uses the exponential loss\nfunction L((cid:18)) = e(cid:0)(cid:12)(cid:18) while SVM uses the \u201chinge\u201d loss L((cid:18)) = (1 (cid:0) (cid:12)(cid:18))+, where (cid:12) > 0\nis a scaling factor. See \ufb01gure 3 for a demonstration of these loss functions.\n\nOnce a loss function is chosen, the goal of the learning algorithm is \ufb01nding an hypothesis\nthat minimizes it. Gradient descent is a natural simple choice for the task. Recall that in\nour case (cid:18)l = (kxl (cid:0) (cid:22)ik (cid:0) kxl (cid:0) (cid:22)jk)=2 where (cid:22)j and (cid:22)i are the closest prototypes to xl\nwith the correct and incorrect labels respectively. Hence we have that2\n\nd(cid:18)l\nd(cid:22)r\n\n= Sl(r)\n\nxl (cid:0) (cid:22)r\nkxl (cid:0) (cid:22)rk\n\nwhere Sl(r) is a sign function such that\n\nSl(r) =( 1\n\n(cid:0)1\n0\n\nif (cid:22)r is the closest prototype with correct label.\nif (cid:22)r is the closest prototype with incorrect label.\notherwise.\n\n2Note that if xl = (cid:22)j the derivative is not de\ufb01ned. This extreme case does not affect our conclu-\n\nsions, hence or the sake of clarity we avoid the treatment of such extreme cases in this paper.\n\n\fAlgorithm 1 Online Loss Minimization.\nRecall that L is a loss function, and (cid:13)t varies to zero as the algorithm proceeds.\n\n1. Choose an initial positions for the prototypes f(cid:22)jgk\n2. For t = 1 : T ( or 1)\n\nj=1.\n\n(a) Receive a labelled instance xt; yt\n(b) Compute the closest correct and incorrect prototypes to xt: (cid:22)j; (cid:22)i, and the\n\nmargin of xt, i.e. (cid:18)t = 1=2(kxt (cid:0) (cid:22)ik (cid:0) kxt (cid:0) (cid:22)jk)\n\n(c) Apply the update rule for r = i; j:\n\n(cid:22)r   (cid:22)r + (cid:13)t\n\ndL((cid:18)t)\n\nd(cid:18)\n\nSl(r)\n\nxt (cid:0) (cid:22)r\nkxt (cid:0) (cid:22)rk\n\nTaking the derivative of L with respect to (cid:22)r using the chain rule we obtain\n\ndL\nd(cid:22)r\n\n=Xl\n\ndL((cid:18)l)\n\nd(cid:18)l\n\nSl(r)\n\nxl (cid:0) (cid:22)r\nkxl (cid:0) (cid:22)rk\n\n(4)\n\nl = dL((cid:18)l)\n\nd(cid:18)l\n\nl xl where (cid:11)r\n\n(cid:22)r = Pl wr\n\nBy comparing the derivative to zero, we get that\n\nthe optimal solution is achieved when\nl = (cid:11)r\n. This leads to two conclu-\nPl\nsions. First, the optimal solution is in the span of the training instances. Furthermore, from\nits de\ufb01nition it is clear that wr\n6= 0 only for the closest prototypes to xl. In other words,\nl\n6= 0 if and only if (cid:22)r is either the closest prototype to xl which have the same label\nwr\nl\nas xl, or the closest prototype to xl with alternative label. Therefore the notion of support\nvectors [4] applies here as well.\n\nSl(r)\n\nkxl(cid:0)(cid:22)rk and wr\n\nl\n\n(cid:11)r\n\nl\n\n5.1 Minimizing The Loss\n\nUsing (4) we can \ufb01nd a local minima of the loss function by a gradient descent algorithm.\nThe iteration in time t computes:\n\n(cid:22)r(t + 1)   (cid:22)r(t) + (cid:13)tXl\n\ndL((cid:18)l)\n\nd(cid:18)\n\nSl(r)\n\nxl (cid:0) (cid:22)r(t)\nkxl (cid:0) (cid:22)r(t)k\n\nwhere (cid:13)t approaches zero as t increases. This computation can be done iteratively where\nin each step we update (cid:22)r only with respect to one sample point xl. This leads to the\nfollowing basic update step\n\n(cid:22)r   (cid:22)r + (cid:13)t\n\ndL((cid:18)l)\n\nd(cid:18)\n\nSl(r)\n\nxl (cid:0) (cid:22)r\nkxl (cid:0) (cid:22)rk\n\nNote that Sl(r) differs from zero only for the closest correct and incorrect prototypes to xl,\ntherefore a simple online algorithm is obtained and presented as algorithm 1.\n\n5.2 LVQ1 and OLVQ1\n\nThe online loss minimization (algorithm 1) is a general algorithm applicable with differ-\nent choices of loss functions. We will now apply it with a couple of loss functions and\nsee how LVQ emerges. First let us consider the \u201chinge\u201d loss function. Recall that the\nhinge loss is de\ufb01ned to be L((cid:18)) = (1 (cid:0) (cid:12)(cid:18))+. The derivative3 of this loss function is\n3The \u201chinge\u201d loss has no derivative at the point(cid:18) = 1=(cid:12). Again as in other cases in this paper,\n\nthis fact is neglected.\n\n\fs\ns\no\n\nl\n\n663\n662\n661\n660\n659\n658\n657\n656\n655\n654\n653\n\ndL((cid:18))\n\nd(cid:18)\n\n=(cid:26) 0\n\n(cid:0)(cid:12)\n\nif (cid:18) > 1=(cid:12)\notherwise\n\nIf (cid:12) is chosen to be large enough, the update rule in\nthe online loss minimization is\n\n0\n\n5000\n\n10000\n\n15000\n\n20000\n\nnumber of iterations\n\n(cid:22)r = (cid:22)r (cid:6) (cid:13)t(cid:12)\n\nxt (cid:0) (cid:22)r\nkxt (cid:0) (cid:22)rk\n\nof iterations of OLVQ1. One can\nclearly see that it decreases.\n\n(cid:12)\n\nFigure 4: The \u201dhinge\u201d loss func-\n\ntion (P(1 (cid:0) (cid:18)l)+) vs. number\n\nThis is the same update rule as in LVQ1 and\nOLVQ1 algorithm [9] beside the extra factor of\nkxt(cid:0)(cid:22)rk . However, this is a minor difference since\n(cid:12)= kxt (cid:0) (cid:22)rk is just a normalizing factor. A demon-\nstration of the affect of OLVQ1 on the \u201chinge\u201d loss\nfunction is provided in \ufb01gure 4. We applied the algo-\nrithm to a simple toy problem consisting of three classes and a training set of 800 points.\nWe allowed the algorithm 10 prototypes. As expected the loss decreases as the algorithm\nproceeds. For this purpose we used the lvq pak package [13].\n\n5.3 LVQ2.1\n\nThe idea behind the de\ufb01nition of margin, and especially hypothesis margin was that a\nminor change in the hypothesis can not change the way it labels an instance which had a\nlarge margin. Hence when making small updates (i.e. small (cid:13)t) one should focus only on\nthe instances which have margins close to zero. The same idea appeared also in Freund\u2019s\nboost by majority algorithm [14].\n\nKohonen adapted this idea to his LVQ2.1 algorithm [9]. The major difference between\nLVQ1 and LVQ2.1 algorithm is that LVQ2.1 updates (cid:22)r only if the margin of xt falls in-\nside a certain window. The suitable loss function for LVQ2.1 is the broken linear loss\nfunction (see \ufb01gure 3). The broken linear loss is de\ufb01ned to be L((cid:18)) = min(2; (1 (cid:0) (cid:12)(cid:18))+).\nNote that for j(cid:18)j > 1=(cid:12) the loss is constant (i.e.\nthe derivative is zero), this causes the\nlearning algorithm to overlook instances with too high or too low margin. There exist sev-\neral differences between LVQ2.1 and the online loss minimization presented here, however\nthese differences are minor.\n\n6 Conclusions and Further Research\n\nIn this paper we used the maximal margin principle together with loss functions to derive\nalgorithms for prototype positioning. We saw that LVQ can be considered as a special case\nof this general algorithm. We also provide generalization bounds for any prototype based\nclassi\ufb01er.\n\nThis formulation allows derivation of new algorithms in several different ways. The \ufb01rst\nis to use other loss functions such as the exponential loss. A second way is to use other\nclassi\ufb01cation rule, such as k-NN or parzan window. The proper way to adapt the algorithm\nto the chosen rule is to de\ufb01ne the margin accordingly, and modify the minimization process\nin the training stage. We have constructed some basic experiments using the k-NN rule.\nThe performance of the modi\ufb01ed classi\ufb01er did not exceed those of the 1-NN rule. We\nsuggest the following explanation of these results. Usually the k-NN rule perform better\nthan the 1-NN rule as it \ufb01lters noise better, and in our setting the noise \ufb01ltering is already\nachieved by using a small number of prototype.\n\n\fAnother extension to use a different distance measure instead of the l2 norm. This may\nresult in more complicated formula of the derivative of the loss function, but may improve\nthe results signi\ufb01cantly in some cases. One speci\ufb01c interesting distance measure is the\nTangent Distance [2].\n\nWe also presented a generalization guarantee for prototype based classi\ufb01er that is based\non the margin training error. The bound is dimension free and thus a kernel version of\nthe algorithm may yield a good performance. This modi\ufb01cation is straightforward, as the\nalgorithm can be expressed as function of inner-products only. We performed preliminary\nexperiments with a kernelized version of the algorithm. It seems that it improves the accu-\nracy when it is used with a small number of prototypes. However, allowing more prototypes\nto the standard version achieves the same improvement.\n\nA possible explanation of this phenomenon is the following. Recall that a classi\ufb01er is\nparametrised by a set of labelled prototypes that de\ufb01ne a Voronoi tessellation. The decision\nboundary of such a classi\ufb01er is built of some of the lines of the Voronoi tessellation. In the\nstandard version these lines are straight lines. In the kernel version these lines are smooth\nnon-linear curves. As the number of prototypes grows, the decision boundary consists\nof more, and shorter lines. Now, if we remember the fact that any smooth curve can be\napproximated by a broken linear line, we come to the conclusion that any classi\ufb01er that can\nbe generated by the kernel version, can be approximated by one that is generated by the\nstandard version, when is applied with more prototypes.\nAcknowledgement We thank Yoram Singer and Gal Chechik for their helpful remarks.\nReferences\n[1] E. Fix and j. Hodges. Discriminatory analysis. nonparametric discrimination: Consistency\n\nproperties. Technical Report 4, USAF school of Aviation Medicine, 1951.\n\n[2] P. Y. Simard, Y. A. Le Cun, and J. Denker. Ef\ufb01cient pattern recognition using a new transforma-\ntion distance. In Advances in Neural Information Processing Systems, volume 5, pages 50\u201358.\n1993.\n\n[3] P. Indyk and R. Motwani. Approximate nearest neighbors:\n\ntowards removing the curse of\nIn Proceedings of the 30th ACM Symposium on the Theory of Computing,\n\ndimensionality.\npages 604\u2013613, 1998.\n\n[4] V. Vapnik. The Nature Of Statistical Learning Theory. Springer-Verlag, 1995.\n[5] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\n[6] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the margin : A new explanation\n\nfor the effectiveness of voting methods. Annals of Statistics, 1998.\n\n[7] Llew Mason, P. Bartlett, and J. Baxter. Direct optimization of margins improves generalization\nin combined classi\ufb01er. Advances in Neural Information Processing Systems, 11:288\u2013294, 1999.\n[8] C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classi\ufb01ers. In\n\nInternational Conference on Machine Learning, 2000.\n\n[9] T. Kohonen. Self-Organizing Maps. Springer-Verlag, 1995.\n[10] L. Buckingham and S. Geva. Lvq is a maximum margin algorithm.\n\nAcquisition Workshop PKAW\u20192000, 2000.\n\nIn Paci\ufb01c Knowledge\n\n[11] L. Devroye, L. Gyor\ufb01, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer,\n\nNew York, 1996.\n\n[12] Y. Singer and D. D. Lewis. Machine learning for information retrieval: Advanced techniques.\n\npresented at ACM SIGIR 2000, 2000.\n\n[13] T. Kohonen, J. Hynninen, J. Kangas, and K. Laaksonen, J. Torkkola. Lvq pak, the learning\n\nvector quantization program package. http://www.cis.hut.\ufb01/research/lvq pak, 1995.\n\n[14] Y. Freund. Boosting a weak learning algorithm by majority. Information and Computation,\n\n121(2):256\u2013285, 1995.\n\n\f", "award": [], "sourceid": 2261, "authors": [{"given_name": "Koby", "family_name": "Crammer", "institution": null}, {"given_name": "Ran", "family_name": "Gilad-bachrach", "institution": null}, {"given_name": "Amir", "family_name": "Navot", "institution": null}, {"given_name": "Naftali", "family_name": "Tishby", "institution": null}]}