{"title": "On-line Learning of Dichotomies", "book": "Advances in Neural Information Processing Systems", "page_first": 303, "page_last": 310, "abstract": null, "full_text": "On-line Learning of Dichotomies \n\nN. Barkai \n\nRacah Institute of Physics \n\nThe Hebrew University \nJerusalem, Israel 91904 \nnaamaCfiz.huji.ac.il \n\nH. S. Seung \n\nAT&T Bell Laboratories \nMurray Hill, NJ 07974 \nseungCphysics.att.com \n\nRacah Institute of Physics \n\nThe Hebrew University \nJerusalem, Israel 91904 \n\nH. Sompolinsky \n\nand AT&T Bell Laboratories \n\nhaimCfiz.huji.ac.il \n\nAbstract \n\nThe performance of on-line algorithms for learning dichotomies is studied. In on-line learn(cid:173)\ning, the number of examples P is equivalent to the learning time, since each example is \npresented only once. The learning curve, or generalization error as a function of P, depends \non the schedule at which the learning rate is lowered. For a target that is a perceptron rule, \nthe learning curve of the perceptron algorithm can decrease as fast as p- 1 , if the sched(cid:173)\nule is optimized. If the target is not realizable by a perceptron, the perceptron algorithm \ndoes not generally converge to the solution with lowest generalization error. For the case \nof unrealizability due to a simple output noise, we propose a new on-line algorithm for a \nperceptron yielding a learning curve that can approach the optimal generalization error as \nfast as p-l/2. We then generalize the perceptron algorithm to any class of thresholded \nsmooth functions learning a target from that class. For \"well-behaved\" input distributions, \nif this algorithm converges to the optimal solution, its learning curve can decrease as fast \nas p-l. \n\n1 \n\nIntroduction \n\nMuch work on the theory of learning from examples has focused on batch learning, in which the learner is \ngiven all examples simultaneously, or is allowed to cycle through them repeatedly. In many situations, it is \nmore natural to consider on-line learning paradigms, in which at each time step a new example is chosen. \nThe examples are never recycled, and the learner is not allowed to simply store them (see e.g, Heskes, \n1991; Hansen, 1993; Radons, 1993). Stochastic approximation theory (Kushner, 1978) provides a framework \nfor understanding of the local convergence properties of on-line learning of smooth functions. This paper \naddresses the problem of on-line learning of dichotomies, for which no similarly complete theory yet exists. \n\n\f304 \n\nN. Barkai, H. S. Seung, H. Sompolinsky \n\nWe begin with on-line learning of perceptron rules. Since its introduction in the early 60's, the perceptron \nalgorithm has been used as a simple model of learning a binary classification rule. The algorithm has been \nproven to converge in finite time and to yield a half plane separating any set of linearly separable examples. \nThe perceptron algorithm, however, is not efficient in the sense of distribution-free PAC learning (Valiant, \n1984), for one can construct input distributions that require an arbitrarily long convergence time. In a recent \npaper (Baum, 1990) Baum proved that the perceptron algorithm applied in an on-line mode, converges as \np-I/3 when learning a half space under a uniform input distribution, where P is the number of presented \nexamples drawn at random. For on-line learning P is also the number of time steps. Baum also generalized \nhis result to any \"non-malicious\" distribution. Kabashima has found the same power law for learning a \ntwo-layer parity machine with non-overlapping inputs, using an on-line least action algorithm (Kabashima, \n1994). \nIf efficiency is measured only by the number of examples used (disregarding time), these particular on-line \nalgorithms are much worse than batch algorithms. Any batch algorithm which is able to correctly classify a \ngiven set of P examples will converge as p- I (Vapnik, 1982; Amari, 1992; Seung, 1992). In this paper, we \nconstruct on-line algorithms that can actually achieve the same power law as batch algorithms, demonstrating \nthat the results of Baum and Kabashima do not reflect a fundamental limitation of on-line learning. \nIn Section 3, we study on-line algorithms for perceptron learning of a target rule that is not realizable by \na perceptron. Here it is nontrivial to construct an algorithm that even converges to the optimal one, let \nalone to optimize the rate of convergence. For the special case of a target rule that is a percept ron corrupted \nby output noise this can be done. In Section 4, our results are generalized to dichotomies generated by \nthresholding smooth functions. In Section 5 we summarize the results. \n\n2 On-line learning of a perceptron rule \n\nWe consider a half space rule generated by a normalized teacher perceptron Wo ERN, Wo' Wo = 1 such \nthat any vector 5 E RN is given a label uo(5) = sgn(Wo \u00b75). We study the case of a Gaussian input \ndistribution centered at zero with a unit variance in each direction in space: \n\nN \n\n2 \n\nP(5) = II _e-s, /2 \n\n1 \n;=1 v'21r \n\n(1) \n\nAverages over this input distribution will be written with angle brackets (). A student percept ron W is \ntrained by an on-line perceptron algorithm. At each time step, an input 5 E RN is drawn at random, \naccording to distribution Eq. (1) and the student's output u(5) = sgn(W . 5) is calculated. The student is \nthen updated according to the perceptron rule: \n\nW' = W + ~f(5;W)uo(5)5 \n\n(2) \n\nand is then normalized so that W . W = 1 at all times. The factor f(5; W) denotes the error of the student \nperceptron on the input 5: f = 1 if u(5)uo(5) = 1, and 0 otherwise. The learning rate 1/ is the magnitude \nof change of the weights at each time step. It is scaled by N to ensure that the change in the overlap \nR = W\u00b7 Wo is of order lIN. Thus, a change of 0(1) occurs only after presentation of P = O(N) examples. \nThe performance of the student is measured by the generalization error, defined as the probability of dis(cid:173)\nagreement between the student and the teacher on an arbitrary input fg = (f(5; W\u00bb). In the present case, \nfg is \n\ncos- I R \n\nfg = - -7 r - ' \n\n(3) \n\nAlthough for simplicity we analyze below the performance of the perceptron rule (2) only for large N, \nour results apply to finite N as well. Multiplying Eq. (2) by Wo after incorporation of the normalization \noperation and averaging with respect to the input distribution (1), yields the following differential equation \nfor R(a) where a = PIN, \n\n(4) \n\n\fOn-line Learning of Dichotomies \n\n305 \n\nHere terms of order J17/N have been neglected. \nThe evolution of the overlap R, and thus of the generalization error, depends on the schedule at which the \nlearning rate 17 decreases. We consider two cases, a constant 17 and a time-dependent 17. \nConstant learning rate: When 17 is held fixed, Eq. (4) has a stable fixed point at R < 1, and hence eg \nconverges to an '7-dependent nonzero value eoo ('7). For '7 \u00ab: 1, 1- Roo('7) oc '72 and eg oc VI - R is therefore \nproportional to '7, \n\neoo ('7) = '7/..f2i3 . \n\n(5) \n\nThe convergence to this value is exponential in 0, eg(o) - eoo ('7) '\" exp(-'7o/$). \nTime-dependent learning rate: Convergence to eg = 0 can be achieved if '7 decreases slowly enough with \no. We study the limiting behaviour of the system for '7 which is decreasing with time as '7 = (7]0$) o-z. \nz > 1. In this case the rate is reduced too fast before a sufficient number of examples have been seen. This \nresults in R which does not converge to 1 but instead to a smaller value that depends on its initial value. \nz < 1. The system follows the change in '7 adiabatically. Hence, to first order in 0- 1 , eg(o) = eoo ('7(o\u00bb. \nThus, eg converges to zero with an asymptotic rate eg(o) '\" o-z. \nz = 1. The behaviour of the system depends on the prefactor '70: \n\neg \n\n(1 '7~)1 \n\n71\"'70- 10 \n\ny'logo \n\na \nA \n0'1<> \n\n'70> 1 \n\n7]0 = 1 \n\n7]0 < 1 \n\n(6) \n\nwhere A depends on the initial condition. Thus the optimal asymptotic change of '7 is 2..;21i / 0, in which case \nthe error will behave asymptotically as eg(o) '\" 1.27/0. This is not far from the batch asymptotic (Seung, \n1992) eg(o) '\" 0.625/0. We have confirmed these results by numerical simulation of the algorithm Eq. (2). \nFigure 1 presents the results of the optimalleaming schedule, i.e., '7 = 2..;21i/o. The numerical results are \nin excellent agreement with the prediction eg(o) = 1.27/0 for the asymptotic behavior. Finally, we note \nthat our analysis of the time-dependent case is similar to that of Kabashima and Shinomoto for a different \non-line learning problem (Kabashima, 1993). \n\n3 On-line learning of a percept ron with output noise \n\nIn the case discussed above, the task can be fully realized by a perceptron, i.e., there is a perceptron W \nsuch that eg = O. In more realistic situations a percept ron will only provide an approximation of the target \nfunction, so that the minimal value of eg is greater than zero. These cases are called unrealizable tasks. A \ndrawback of the above on-line algorithm is that, for a general unrealizable task, it does not converge to \nthe optimal perceptron, i.e., it does not approach the minimum of ego To illustrate this fact we consider a \nperceptron rule corrupted by output noise. The label of an input S is O\"o(S), where O\"o(S) = sgn(Wo . S) \nwith probability 1 - p, and - sgn(Wo . S) with probability p. We assume 0 $ p $ 1/2. For reasons which \nwill become clear later, the input distribution is taken as a Gaussian centered at U \n\nP(S) = n _1_e-(S,-u.)'/2 \n\nN \n\n;=1 ..;21i \n\n(7) \n\nIn this case eg is given by \n\neg =p+ (l- q DYH(~)+l\u00b0O DYH(~\u00bb). \n\n(8) \nwhere qo = U . W 0 denotes the overlap between the center of the distribution and the teacher perceptron, \nand q = U . W is the overlap between the center of the distribution and W. The integrals in Eq. (8) are \n\n1 - R \n\n1 - R \n\n-00 \n\n_q \n\n\f306 \n\nN. Barkai, H. S. Seung, H. Sompolinsky \n\nwith respect to a Gaussian measure Dy = exp(-y2/2)/.J21i and H(x) = J.,oo Dy. Note that the optimal \nperceptron is the teacher W = Wo i.e., R = 1, q = 110, which yields the minimal error fmin = p. \nFirst, we consider training with the normalized perceptron rule (2). In this case, we obtain differential \nequations for two variables: R and q. Solving these equations we find that in general, W converges to a \nvector with a direction which is in the plane of Wo and U and is does not point in the direction of Wo even \nin the limit of,., -+ O. Here we present the result for the limit of,., -+ 0 and small noise level, i.e., p \u00a2: 1. In \nthis case, we obtain for foo ('\" = 0) \n\nfoo(O) = P + p (1 - 2H(qo\u00bb~ + O(p2) \n\n1+(u2-q~) \n\n(9) \nwhere u = lUI is the magnitude of the center of the input distribution. For p = 0, the only solution is \nR = 1 and q = qo, in agreement with the previous results. For p > 0 the optimal solution is retrieved only \nin the following special cases: (i) the input distribution is isotropic, i.e., qo = u = OJ (ii) when U is parallel \nto W o, i.e., u = qoj and (iii) when U is orthogonal to W o, i.e., qo = O. This holds also for large value of \np. In these special cases, the symmetry of the input distribution relative to the teacher vector, guarantees \nthat the deviations from W = Wo incurred by the inputs that come with the wrong label cancel each other \non average. According to Eq. (9), for other directions of U, fg is above the optimal value. Note that the \nadditional term in fg is of the same order of magnitude (O(P\u00bb as the minimal error. \nIn the following we suggest a modified on-line algorithm for learning a perceptron rule with output noise. \nThe student weights are changed according to \n\nW' = W + ~f(S; W)ITo(S)(S - T(S\u00bb \n\n(10) \n\nfollowed by a normalization of W. This algorithm differs from the perceptron algorithm in that the change in \nW is not proportional to the present input, but to a shifted vector. The shifting vector T(S), is determined \nby the requirement that the teacher Wo will be a fixed point of the algorithm in the limit of,., -+ O. This is \nequivalent to the condition \n\n(11) \nwhere foeS) is the error function for S when W = Wo o This condition does not determine T uniquely. A \nsimple choice is one for which T is independent of S. This leads to \nT = (sgn(Wo' S)S) = (IToS) \n(ITo) \n\n(sgn(Wo . S\u00bb \n\n(fo(S)lTo(S)(S - T(S\u00bb) = 0 \n\n(12) \n\nwhere we used the fact that for any S, fo(S)lTo(S) equals - sgn(Wo . S) with probability p, and zero with \nprobability (1 - p). This uniform shift is possible only when (O'o) ~ 0, namely when the average frequencies \nof +1 and -1 labels are not equal. If this is not the case, one has to choose nonuniform forms of T(S) . \nNote that in general T has to be learned so that Eq. (10) has to be supplemented by appropriate equations \nfor changing T. In the case of Eq. (12), one can easily learn separately the numerator and denominator by \nrunning averages of O'oS and 0'0, respectively. We have studied analytically the above algorithm for the case \nof the Gaussian input distribution Eq. (7), in the limit of large N. The shifting vector is given by \n\nThe differential equations for the overlaps R and q in the neighborhood of the point R = 1 and q = qo are, \n\nT = U + Wo ~ exp( -q~/2) \nV 1T 1 - 2H(qo) \n\n(13) \n\ndaR \nda \ndoq \nda \n\n(14) \n\nwhere oR = 1 - R and oq = qo - q. In the limit,., -+ 0, R = 1 and q = qo is indeed a stable fixed point of \nthe algorithm, so that the student converges to the optimal perceptron W o, and hence the feneralization \nerror converges to its minimal value fmin = p. Since, unlike Eq. (4), the coefficient of the,., term in Eq. \n\n\fOn-line Learning of Dichotomies \n\n307 \n\n(14) is constant, c5Roo(1]) 1 \n- p ~ a-'1oj:~, ~ < 1 \n\n{ a-l /2 \n\nt: (a) _ p ~ f2P \nV~ \n\n9 \n\n(17) \n\n(18) \n\nwhich is achieved for 1]0 = 2. \nWe have tested successfully this algorithm by simulations of learning a perceptron rule with output noise \nwith several input distributions, including the Gaussian, of Eq. (7). Figure 2 presents the generalization \nerror as a function of a for the Gaussian distribution, with p = 0.2 , and we have chosen 1]0 = 2. The error \nconverges to the optimal value 0.2 as a- I / 2 in agreement with the theory. For comparison the result of the \nusual perceptron algorithm is also presented. This algorithm converges to t:g ~ 0.32, clearly larger than the \noptimal value. \n\n4 On-line learning of thresholded smooth functions \n\nOur results for the realizable perceptron can be extended to a more general class of dichotomies, namely \nthresholded smooth functions. They are defined as dichotomies of the form \n\n(19) \nwhere 1 is a differentiable function of a set of parameters, denoted by W, and S is the input vector. We \nconsider here the case of a realizable task, where the examples are given with labels 0\"0 corresponding to a \ntarget machine W 0 which is in the W space. For this task we propose the following generalization of the \nperceptron rule (2) \n\nO\"(S; W) = sgn(f(Sj W)) \n\nW' = W + 1]t:(Sj W)O\"o(S)V I(S; W) \n\n(20) \nwhere V denotes a gradient w.r.t. W. Then, as we argue below, the vector Wo is a stable fixed point in the \nlimit of 1] -t O. Furthermore, for constant small 1] the residual error scales as t:oo is the transition rate from w to w'. In the limit of small fixed 1], \nthe equilibrium distribution, Poo, can be shown to have the following scaling form, \n\nP(w,n + 1):;;: J dw'W(w'lw)P(w' , n) \n\nPoo(w;1]):;;: -F(c5wl1]) \n\n1 \n1] \n\n(23) \n\n\f308 \n\nN. Barkai, H. S. Seung, H. Sompolinsky \n\nwhere ow = w - Wo and F(x) obeys the following difference equation \n\nLF(x) == L 9\u00ab(f~ + O\"x)J~)I(f~ + O\"x)IF(x + o\"/~) -lxIF(x) = 0 \n\nC1=\u00b11 \n\n(24) \n\nwhere J6 is the value of the gradient aJ(wo, s)law at the decision boundary of J(wo, s), namely at the point \ns obeying J(wo,s) = O. Note that since we are interested in normalizable solutions of Eq. (24), F(x) has to \nvanish for for all x > 1161. This result is valid provided the input distribution is smooth and nonvanishing \nnear the decision boundary. Furthermore, aJ law at Wo may not vanish on the decision boundary. Under \nthe same conditions, it can be shown that the error is homogeneous in ow with degree 1, hence it should \nscale linearly with \"I, i.e., 1':00 oc \"I. It should be noted that, unlike other on-line learning problems (Heskes, \n1991; Hansen, 1993; Radons, 1993), the equilibrium distribution is our case is not Gaussian. \nFor a time-dependent \"I of the form \"I = TJon- z , z < I, pew, n) at long times is of the form \n\n(25) \n\nwhere F is the stationary distribution, given by Eq. (24) and the coefficient of the correction, G, solves the \ninhomogeneous equation \n\nzx dx + zF(x) = TJoLG(x) \n\n(26) \nwhere the linear operator L is defined in Eq. (24). Thus, to leading order in inverse time, the system \nfollows adiabatically the finite-TJ stationary distribution, yielding I':g(n) which vanishes asymptotically as \nI':g(n) oc TJ(n) ~ n- z . The optimal schedule is obtained for z = 1. In this case, P(w,n) = \"1-1 (n)F(ow/TJ(n)) \nwhere F(x) solves the homogeneous equation \ndF \n\n\u2022 \n\n\u2022 \n\ndF \n\nzx dx + zF(x) = TJoLF(x) \n\n(27) \n\nFor sufficiently large \"10, this equation has a solution, implying that I':g oc n- 1 \nSimilarly, the results of Section 3 can also be extended to the case of thresholded- smooth functions with \na probability p of an error due to isotropic output noise. In this case, the optimal choice is again \"I oc n- I \nyielding I':g -p Rl ,;ri. It should be noted that for this case, the probability distribution for small \"I does reduce \nto a Gaussian distribution in owl,;ri. Using a multidimensional Markov equation, it is straightforward to \nextend these results to higher dimensions. The small \"I limit yields equations similar to Eqs. (24-26), that \ninvolve integration over the decision boundary of J(W, S). \n\n5 Summary and Discussion \n\nWe have found that the perceptron rule (2) with normalization can lead to a variety of learning curves, \ndepending on the schedule at which the learning rate is decreased. The optimal schedule leads to an inverse \npower law learning curve, I':g ~ 0-1 . Baum's results (Baum, 1990) of a non-normalized perceptron with a \nconstant learning rate can be viewed as a special case of the above analysis. In the non-normalized perceptron \nalgorithm, the magnitude of the student's weights grow with 0 as IWI ~ 0 1/ 3 . The time evolution of the \noverlap R, and thus of the generalization error is governed by the effective learning rate TJe/f = T//IWlleading \nvia Eq. (6) to the result I':g ~ 0- 1/ 3 . Similar results apply to the two-layer parity machine studied in \n(Kabashima, 1994). \nOur analysis, leading to the equations of motion (4) and (14), was based on the limit of large N and P, such \nthat 0 = PIN remains finite. We would like to stress however, that this limit is only necessary in deriving \nthe full form of the learning curve, i.e., R(o) for all o. On the other hand, our results for the large P \nasymptote of the learning curve for smailT/ are valid for finite N as well, as implied by the general treatment \nof the previous section. \nUnrealizable percept ron rules present a more complicated problem. We have presented here a modified \nperceptron algorithm that converges to the optimal solution in the special case of an isotropic output noise. \n\n\fOn-line Learning of Dichotomies \n\n309 \n\nIn this case, the convergence to the optimal error is as 01- 1/ 2 . This is the same power law as obtained in \nthe standard sample complexity upper bounds (Vapnik, 1982) and in the approximate replica symmetric \ncalculations (Seung, 1992) for batch learning of unrealizable rules. It should be stressed however, that the \nsuccess of the modified algorithm in the case of an output noise depends on the fact that the errors made \nby the optimal solution are uncorrelated with the input. Thus, finding an on-line algorithm that can cope \nwith other types of unrealizability remains an important problem. \nThe learning algorithms for the perceptron rule, without and with output noise, can be generalized to learning \nthresholded smooth functions, assuming certain reasonable properties of the input distribution are present, \nas shown in Section 4. The dependence of the learning curve on the learning rate schedule remains roughly \nthe same as in the percept ron case. This implies that on-line learning of realizable dichotomies, with possible \noutput noise, can achieve the same power laws in the number of examples that is typical of batch learning \nof such rules. Furthermore, the on-line formulation possesses the theoretical virtues of addressing time as \nwell as sample complexity, so that the same power laws imply the polynomial relationship between the time \nand the achieved error level. The above conclusions assume that the equilibrium state at small learning \nrates is unique, which in general is not the case. The issue of overcoming local minima in on-line learning \nis a difficult problem (Heskes, 1992) Finally, the theoretical results for on-line learning has the important \nadvantage of not requiring the use of the often problematic replica formalism. \n\nAcknowledgements \n\nWe are grateful for helpful discussions with Y. Freund, M. Kearns, R. Schapire, and E. Shamir, and thank Y. \nKabashima for bringing his paper to our attention. HS is partially supported by the Fund for Basic Research \nof the Israeli Academy of Arts and Sciences. \n\nReferences \n\nS. Amari, N. Fujita, and S. Shinomoto. Four types of learning curves. Neural Comput., 4:605-618, 1992. \nE. B. Baum. The perceptron algorithm is fast for nonmalicious distributions. Neural Comput., 2:248-260, \n1990. \nH. J. Kushner and D. S. Clark. Stochastic approximation methods for constrained and unconstrained systems. \nSpringer, Berlin, 1978. \nL. K. Hansen, R. Pathria, and P. Salamon. Stochastic dynamics of supervised learning. J. Phys., A26:63-71, \n1993. \nT . Heskes and B. Kappen. Learning processes in neural networks. Phys. Rev., A44:2718-2762, 1991. \nT. Heskes, E. T. P. Slijpen, and B. Kappen. Learning in neural networks with local minima. Phys. Rev., \nA46:5221-5231, 1992. \nY. Kabashima. Perfect loss of generalization due to noise in k = 2 parity machines. J. Phys., A27:1917-1927, \n1994. \nY. Kabashima and S. Shinomoto. Incremental learning with and without queries in binary choice problems. \nIn Proc. of IJCNN, 1993. \nG. Radons. On stochastic dynamics of supervised learning. J. Phys., A26:3455-3461, 1993. \nH. S. Seung, H. Sompolinsky, and N. Tishby. Statistical mechanics of learning from examples. Phys. Rev., \nA45:6056-6091, 1992. \nL. G. Valiant. A theory of the learnable. Commun. ACM, 27:1134-1142,1984. \n\nN. G. Van Kampen. Stochastic processes in physics and chemistry. North holland 1981. \nV. N. Vapnik. Estimation of Dependences based on Empirical Data. Springer-Verlag, New York, 1982. \n\n\f310 \n\n0.03 \n\n0.025 \n\n0.02 \n\nf9 0.015 \n\n0.01 \n\n0.005 \n\n0 \n\n0 \n\nN. Barkai, H. S. Seung, H. Sompolinsky \n\n0.005 \n\n0.01 \n1/a \n\n0.015 \n\n0.02 \n\nFigure 1: Asymptotic performance of a realizable perceptron. Simulation results for 110 :;;: 2 and N :;;: 50 \n(solid curve) are compared with the theoretical prediction f g :;;: 1.271a (dashed curve). \n\n0.35 \n\n0.3 \n\n0.25 \n\n0.2 \n\n0.15 \n\n0.1 \n\n0.05 \n\nfg - P \n\n..... \n.... \n\n\u2022 'c\" .. ' \n\no \n\n\u2022\n\n'.:.... \n\n.' \n.. \n' .. \n. .' \n. .' \n\n. ' .. ~ . \n\n0 \n\n0 \n\n0.1 \n\n0.2 \n\n0.3 \n\n0.4 \n\nlIfo \n\n0.5 \n\n0.6 \n\n0.7 \n\nFigure 2: Simulation results for on-line learning of a perceptron with output noise. Here 1Jo :;;: 2, P :;;: 0.2, \nN :;;: 250, U = 4, and qo :;;: -1.95. The regular percept ron learning (dashed curve) is compared with the \nmodified algorithm (solid curve). The dashed line shows the theoretical prediction Eq. (18) \n\n\f", "award": [], "sourceid": 976, "authors": [{"given_name": "N.", "family_name": "Barkai", "institution": null}, {"given_name": "H.", "family_name": "Seung", "institution": null}, {"given_name": "H.", "family_name": "Sompolinsky", "institution": null}]}