{"title": "Extensions of a Theory of Networks for Approximation and Learning: Outliers and Negative Examples", "book": "Advances in Neural Information Processing Systems", "page_first": 750, "page_last": 756, "abstract": null, "full_text": "Extensions of a Theory of Networks for \n\nApproximation and Learning: Outliers and \n\nNegative Examples \n\nFederico Girosi \nAI Lab. M.I.T. \n\nTomaso Poggio \nAl Lab. M.LT. \n\nCambridge, MA 02139 \n\nCambridge, MA 021:39 \n\nBruno Caprile \n\nI.R.S.T . \n\nPovo, Italy, 38050 \n\nAbstract \n\nLearning an input-output mapping from a set of examples can be regarded \nas synthesizing an approximation of a multi-dimensional function. From \nthis point of view, this form of learning is closely related to regularization \ntheory, and we have previously shown (Poggio and Girosi, 1990a, 1990b) \nthe equivalence between reglilari~at.ioll and a. class of three-layer networks \nthat we call regularization networks. In this note, we ext.end the theory \nby introducing ways of <lealing with t.wo aspect.s of learning: learning in \npresence of unreliable examples or outliel\u00b7s, an<llearning from positive and \nnegative examples. \n\n1 \n\nIntroduction \n\nIn previous papers (Poggio and Girosi, 1990a, 1990b) we have shown the equivalence \nbetween certain regularization techniques and a. cla'3s of tlll\u00b7ee-layer networks - that \nwe call regularization networks - which are relat.ed to the Ra<lial Basis Functions \ninterpolation method (Powell, 1987). In this not.e we indicat.e how it is possible \nto extend our theory of learning in order t.o deal with 1) occurence of unreliable \nexamples, 2) negative examples. Both problems are also interesting from the point \nof view of classical approximation theory: \n\n1. discounting \"bad\" examples cOlTesponds to discarding, in the approximation \n\nof a function, data points that are outliel\u00b7s. \n\n2. learning by using negative examples - in addition to positive ones - corresponds \nto approximating a function considering not only points which the function \n\n750 \n\n\fExtensions of a Theory of Networks for Approximation and Learning \n\n751 \n\nought to be close to, but also point.s - or regions - that the functioll must \navoid. \n\n2 Unreliable data \nSuppose that a set 9 = {(Xi, yd E Rn x R}f:l of data has been obtained by randomly \nsampling a function f, defined in Rn, in presence of noise, in a way that we can \nwrite \n\nYi = f (xd + f i , \n\ni = 1, ... , N \n\nwhere fi are independent random variables. \n\\Ve arc interested in recovering an \nestimate of the function f from the set of data [I. Taking a probabilistic approach, \nwe can regard the function I as the realization of a random field with specified \nprior probability distribut.ion. Consequelltly, the data 9 and the function I are nOll \nindependent random variables, and, by using Bayes rule, it is possible to express \nthe conditional probability P[flg] of t.he function I, given the examples g, in terms \nof the prior probability of f, P[t], and the conditional probability of 9 given f, \nP[glf]: \n\nP[tlg] ex P[gll] P[t]. \n\n(1) \n\nA common choice (Marroquin et. al., 1987) for the prior probability distribut.ion \nP[f] is \n\nwhere P is a differential operator (the so called sta bili:er), 11\u00b711 is the L2 norm, and .x \nis a positive real number. This form of probability distribution assignes significant \nprobability only to those functions for which the term liP 1112 is \"small\", that is to \nfunctions that do not vary too \"quickly\" in their domain. \nIf the noise is Gaussian, the probabilit.y P[glf] can be written as: \n\n(2) \n\n(3) \n\nu. \n\nwhere f3i = pI ,and (1i is the variance of t.he noise related to the i-th data point. \nThe values of the variances are usually assumed to be equal to some known value \n(1, that reflects the accuracy of the measurement apparatus. However, in many \ncases we do not have access to such an information, and weaker assumptions have \nto be made. A fairly natural and general one consists in regarding the variances of \nthe noise, as well as the function f, as random variables. Of course, some a priori \nknowledge about these variables, represented by an appropriate prior probability \ndistribution, is needed. Let us denote by j.3 the set of random variables {f3df:I' By \n\n\f752 \n\nGirosi, Poggio, and Caprile \n\nmeans of Bayes rule we can compute the joint probability of the variables f and fJ\u00b7 \nAssuming that the field f and the set fJ are conditionally independent we obtain: \n\nprj, f3lg] ex P[gIJ, f3] P[J] P[f3] \n\n(4) \n\nwhere P[f3] is the prior probability of the set of variances fJ and P[g If, f3] is the \nsame as in eq. (3). Given the posterior probability (4) we are mainly interested in \ncomputing an estimate of J. Thus what we really need to compute is the marginal \nposterior probability of J, Pm [fl, that is obtained integrating equation (4) over the \nvariables f3i: \n\n(5) \n\nA simple way to obtain an estimate of t.he fUllction f from the probability distribu(cid:173)\ntion (5) consists in computing the so callecl MAP (Maximum A Posteriori) estimate, \nthat is the function that maximizes t.he post.erior probability P,n (f]. The problem \nof recovering the function f from the set of data g, with partial information about \nthe amount of Gaussian noise affecting the data, is therefore equivalent to solving \nan appropriate variational problem. The specific form of the functional that has to \nbe maximized - or minimized - depends on the proba.bility clistributions P(f] and \nP[f3]. \nHere we consider the following situation: we have knowledge that a given percentage, \nt) of the data is characterized by a Gaussian noise distribution of variance \n(1 -\n0\"1 = (2f31)- 4, whereas for the rest of the data t.he variance of the noise is a very \nlarge number 0\"2 = (2f32)-! (we will call these cla.ta \"outliel\u00b7s\"). This situation \nyields the following probability dist.ribution: \n\nP[f3] = rn(1 -t)O(fJi - f31) + t O(fJi - f3:d] . \n\nN \n\ni=l \n\n(6) \n\nIn this case, choosing P(fl as in eq. (2), we can show that Pm(fl ex e-H[Jl, where \n\nN \n\nH(f] = L V(td + AIIP tl12 . \n\n;=1 \n\nHere V represents the effecti've potential \n\ndepicted in fig. (1) for different values of {32. \n\n(7) \n\n(8) \n\n\fExtensions of a Theory of Networks for Approximation and Learning \n\n753 \n\n/ \n\n.~ \n,. ... \n\" \n\n- - -r-\n-- --\nc-\n\n\\ \n\\ \n\n~~\"'. \n\n~ '. .... \\ \n\n'. \n\n\\ \n\n- ?--- -\n\n\\ \nI \nI \n\\ \nI \n\\ \n.~ j \n\nVex) \n\n6.SO \n\n6.00 \n\n~.SO \n\n~.OO \n\n4.SO \n\n4.00 \n\n3.SO \n\n3.00 \n\n2. SO \n\n2.00 \n\nI.SO \n\n1.00 \n\no.SO \n\n0.00 \n\n.... 00 \n\n-2.00 \n\n0.00 \n\n2.00 \n\n4.00 \n\nx \n\nFigure 1: The effective potential V(x) for t = 0.1, f31 \nvalues of f32: 0.1,0.03, 0.001 \n\n3.0 and three different \n\nThe MAP estimate is therefore obtained minimizing the fUllctional (7) . The first. \nterm enforces closeness to the data, while the second term enforces smoothness of \nthe solution, the trade off between these two opposite tendencies being controlled \nby the parameter A. Looking at fig. \n(1) we notice that, in the limit of f32 .-\n0, the effective potential V is quadratic if the absolute value of its argument is \nsmaller than a threshold, and constant othen-vise (fig. 1). Therefore , data points \nare taken in account when the interpolation error is smaller than a threshold, and \ntheir contribution neglected otherwise. \nIf f31 = f32 = p, that is if the distribution of the variables f3i is a delta function \ncentered on some value ~, the effect.ive potential V(x) = ilx 2 is obtained. There(cid:173)\nfore, this method becomes equivalent. to the so called \"regularization technique\" \n(Tikhonov and Arsenin, 1977) that has been extensively used to solve ill-posed \nproblems, of which the one we have just outlilleJ is a particulal' example (Poggio \nand Girosi, 1990a, 1990b). Suitable choices of dist.ribution P[f3] result in other ef(cid:173)\nfective potentials (for example the potential \\I(J:) = vex'), + x 2 can be obtained), \nand the corresponding estimators turn out to be similar to the well known robust \nsmoothing splines (Eubank, 1988). \n\nThe functional (7), with the choice expressed by eq. (2), admits a simple physical in(cid:173)\nterpretation. Let us consider for simplicity a function defined on a. one-dimensional \nla.ttice. The value of the function J(xd at. site i is regarded as the position of a \nparticle that can move only in the vertical direction. The particle is attracted -\naccording to a spring-like potential V - towards the data point and the neighboring \n\n\f754 \n\nGirosi, Poggio, and Caprile \n\nparticles as well. The natUl'al trend of the system will be to mmmuze its total \nenergy which, in this scheme, is expressed by the functional (7): the first term is \nassociated to the springs connecting the particle to the data point, and the second \none, being associated to the the springs connecting neighboring particles, enforces \nthe smoothness of the final configuration. Notice that the potential energy of the \nsprings connecting the particle to the data point is not quadratic, as for the \"stan(cid:173)\ndard\" springs, resulting this in a non-linear relationship between the force and the \nelongation. The potential energy becomes constant when the elongation is larger \nthan a fixed threshold, and the force (which is proportional to the first derivati ve of \nthe potential energy) goes to zero. In this sense we can say that the springs \"break\" \nwhen we try to stretch them t.oo much (Geiger and Cirosi, 1990). \n\n3 Negative exanlples \n\nIn many situations, further information about a function may consist in knowing \nthat its value at some given point has to be far from a given value (which, in this \ncontext, can be considered as a \"negative example\"). \n'rVe shall account for the \npresence of negative examples by adding t.o t.he functional (7) a quadratic repulsive \nterm for each negative example (for a relat.ed trick, see Kass et aI., 1987). How(cid:173)\never, the introduction of such a \"repulsive spring\" may make the functional (7) \nunbounded from below, because the repulsive terms tend to push the value of the \nfunction up to infinity. The simplest. way to prevent this occurency is eit.her to \nallow the spring constant to decrease with the increasing elongation, or, in the ex(cid:173)\ntreme case, to break at some point. Hence. we can use the same model of nonlinear \nspring of the previous section, and just reverse the sign of the associated poten(cid:173)\ntial. If {(ta, Ya) E Rn x R}f:l is the set of negative examples, and if we define \n.1. a = Ya - f(t a) the functional (7) becomes: \n\nN \n\nH[fJ = L V (.1.;) - L V(.1. a ) + AIIP 1112 \n\nK \n\ni=l \n\n4 Solution of the variational problenl \n\nAn exhaustive discussion of the solution of the variational problem associated to \nthe functional (7) cannot be given here. \\Ve refer t.he reader to the papers of Poggio \nand Cirosi (1990a, 1990b) and Cirosi, Poggio aud Caprile (1990), and just sketch \nthe form of the solution. In both cases of unreliable and negat.ive data, it call be \nshown that the solution of the variational problem always has the form \n\nN \n\nr(x) = L CiG(X; xd + L O'i\u00a2i(X) \n\nI.: \n\ni=l \n\ni=l \n\n(9) \n\nwhere G is the Green's function of the operator P P (p denoting the adjoint oper(cid:173)\nator of P), and {\u00a2i(x)}f=l is a basis of functions for the null space of P (usually \npolynomials or low degree) and {cdf:l and {adt'=l are coefficients to be computed. \n\n\fExtensions of a Theory of Networks for Approximation and Learning \n\n755 \n\nSubstituting the expansion (9) in the functional (7), the fUIlction H* (c, a) = H[J*] \nis defined. The vectors c and a can then be found by minimizing the function \nH*(c, a). \nWe shall finally notice that the solution (9) has a simple interpretation in terms \nof feedforward networks with one layer of hidden units, of the same class of the \nregularization networks introduced in previous papers (Poggio and Girosi, 1990a, \n1990b). The only difference between these networks and the regularization networks \npreviously introduced consists in the functioll that has to be minimized in order to \nfind the weights of the network. \n\n5 Exp erin1ental Results \n\nIn this section we report two examples of the application of these techniques to very \nsimple one-dimensional problems. \n\n5.1 Unreliable data \n\nThe data set consisted of seven examples, randomly ta.ken, within the interval \n[-1,1], from the graph of f( x) = cos(.r). \nIn order to create an outlier in the \ndata set, the value of the fOUl'th point. has been sub::;tituted with the value 1.5. The \nGreen's function of the problem was a Gaussian of variance (J = 0.:3, the parameter \nf was set to 0.1, the value of the regularization parameter A was 10- 2 , and the \nparameters /31 and /32 were set l'espectively to 10.0 and 0.003. With this choice of \nthe parameters the effective potential was approximately constant for values of its \nargument larger than 1. In figure (2a) we show the result that is obtained after \nonly 10 iterations of gradient descent: the spring of the outlier breaks, and it does \nnot influence the solution any more. The \"hole\" that the solution shows nearby the \noutlier is a combined effect of the fact that t.he variance of the Gaussian Green's \nfunction is small ((J = 0,3), and of the lack of data next to the outlier itself. \n\n5.2 Negative exalnples \n\nAgain data to be approximated came from a ra.11l10111 sampling of the function \nf(x) = cos(x), in the interval [-1,1]. The fourth data point was selected as the \nnegative example, and the parametel's were set in a way that its spring would break \nwhen the elongation exceeded the value 1. In figure (2b) we show a result obtained \nwith 500 iterations of a stochast.ic gradient descellt a.lgorithm, with a Gaussian \nGreen's function of variance (J = 0.4. \n\nAcknowledgements We thank Cesare Furlanello for useful discussions and for a crit(cid:173)\nical reading of the manuscript. \n\n\f756 \n\nGirosi, Poggio, and Caprile \n\ny \n\n.... \n.... \n\n'AO \nI.\" \n\n.., . .... \n\n1.>. \n\n1.1' \n\nL'. \n\n1.0' \n\n-- f--\n\nLOO \n\n.... \n.. ,. \n.... \n.... \n\n0.7. \n.. 10 \n\n,. \n\n....,.-r\" t---... \n\n...... \n\n.... \n\n......... ~ \n-;'\" \n\n\" \n\n.. \n.. \n\n\". \n\n-: \n... ....... \n\n\",' \n\n-'\" \n\n.'\" \n\n---+--~--~--~.--+--. -- f---\n---+~~--~--+---+----~ --\n\n-+--~--_r---+_-- f-----\n.. n ---+--_t_--~--+----+--- r-(cid:173)\n\"~ --~----r--+-~---~---r--~ \n~.--7+---~-+--~--~---~--\u00ad\n..M--~--_t_--_r--+_--+_--r_~ \n\"M ~~--_t_--_r--+_---~-- -r_~ \n\"Q ~-+--~--4---+---+---r-~ \n\"~ ~r-+-'--~-4---+--~----1-~ \n\nl \n\nFigure 2: (a) Approximation in presence of an outlier (the data POillt whose value \nis 1.5) . (b) Approximation in presence of a. negative example. \n\nReferences \n\n[1] R.L . Eubank. Spline Smoothing and Nonpamm etric Regr'essioll, volume gO of \n\nStatistics: Textbooks and lI10nogmphs. l\\larcel Dekker, Inc ., New YOl'k, 1988. \n[2] D. Geiger and F. Girosi. Parallel and deterministic algorithms for IVIRFs: sur(cid:173)\n\nface reconstruction and integration. In O. Faugeras, editor, Lecture Not es in \nComputer Science, Vol. 427: Computer Vision - ECCV 90. Springer-Verlag, \nBerlin, 1990. \n\n[3] F. Girosi, T. Poggio, and B. Caprile. Extensions of a theory of networks for \napproximation and learning: outliers and negative examples. A.l. Memo 1220, \nArtificial Intelligence Labora.tory , Massachusetts Instit.ut.e of Technology, 1990. \n[4] M. Kass, A. Witkin, and D. Terzopoulos . Snakes: Active contour models. III \nProceedings of the First Inter'national Confercnce on Comp uter' Vision , London, \n1987. IEEE Computer Society Press , VVashingtoll , D.C . \n\n[5] J. 1. Marroquin, S. Mittel', ant! T . Poggio. Probabilistic solution of ill-poset! \n\nproblems in computational vision. 1. Amer. Stat . A.ssoc., 82:76-89, 1987. \n\n[6] T. Poggio and F. Girosi. A theory of networks for learning. Sci ence , 247:978- 982 , \n\n1990a. \n\n[7] T. Poggio and F. Girosi. Networks for approximation and learning. P1'Oceedings \n\nof the IEEE, 78(9), September 1990b . \n\n[8] M. J. D. Powell. Ra.dial basis funct.iolls for mult.iva.riable interpolation: a. re(cid:173)\n\nview. In J. C. Mason and M. G. Cox, editors, Algorithms for Approximatw1t. \nClarendon Press , Oxford, 1987. \n\n[9] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-posed Problems. VI/. II . \n\nWinston, Washington, D.C., 1977. \n\n\f", "award": [], "sourceid": 343, "authors": [{"given_name": "Federico", "family_name": "Girosi", "institution": null}, {"given_name": "Tomaso", "family_name": "Poggio", "institution": null}, {"given_name": "Bruno", "family_name": "Caprile", "institution": null}]}