{"title": "Learning, Regularization and Ill-Posed Inverse Problems", "book": "Advances in Neural Information Processing Systems", "page_first": 1145, "page_last": 1152, "abstract": null, "full_text": " Learning, Regularization and Ill-Posed Inverse\n Problems\n\n\n\n Lorenzo Rosasco Andrea Caponnetto\n DISI, Universit`a di Genova DISI, Universit`a di Genova\n Genova, I Genova, I\n rosasco@disi.unige.it caponnetto@disi.unige.it\n\n\n Ernesto De Vito Umberto De Giovannini\n Dipartimento di Matematica DISI, Universit`a di Genova\n Universit`a di Modena Genova, I\n and INFN, Sezione di Genova umberto.degiovannini@fastwebnet.it\n Genova, I\n devito@unimo.it\n\n\n Francesca Odone\n DISI, Universit`a di Genova\n Genova, I\n odone@disi.unige.it\n\n\n\n\n Abstract\n\n\n Many works have shown that strong connections relate learning from ex-\n amples to regularization techniques for ill-posed inverse problems. Nev-\n ertheless by now there was no formal evidence neither that learning from\n examples could be seen as an inverse problem nor that theoretical results\n in learning theory could be independently derived using tools from reg-\n ularization theory. In this paper we provide a positive answer to both\n questions. Indeed, considering the square loss, we translate the learning\n problem in the language of regularization theory and show that consis-\n tency results and optimal regularization parameter choice can be derived\n by the discretization of the corresponding inverse problem.\n\n\n\n1 Introduction\n\nThe main goal of learning from examples is to infer an estimator, given a finite sample\nof data drawn according to a fixed but unknown probabilistic input-output relation. The\ndesired property of the selected estimator is to perform well on new data, i.e. it should gen-\neralize. The fundamental works of Vapnik and further developments [16], [8], [5], show\nthat the key to obtain a meaningful solution to the above problem is to control the complex-\nity of the solution space. Interestingly, as noted by [12], [8], [2], this is the idea underlying\nregularization techniques for ill-posed inverse problems [15], [7]. In such a context to\navoid undesired oscillating behavior of the solution we have to restrict the solution space.\n\n\f\nNot surprisingly the form of the algorithms proposed in both theories is strikingly similar.\nAnyway a careful analysis shows that a rigorous connection between learning and regular-\nization for inverse problem is not straightforward. In this paper we consider the square loss\nand show that the problem of learning can be translated into a convenient inverse problem\nand consistency results can be derived in a general setting. When a generic loss is consid-\nered the analysis becomes immediately more complicated.\nSome previous works on this subject considered the special case in which the elements of\nthe input space are fixed and not probabilistically drawn [11], [9]. Some weaker results\nin the same spirit of those presented in this paper can be found in [13] where anyway the\nconnections with inverse problems is not discussed. Finally, our analysis is close to the\nidea of stochastic inverse problems discussed in [16]. It follows the plan of the paper. Af-\nter recalling the main concepts and notation of learning and inverse problems, in section\n4 we develop a formal connection between the two theories. In section 5 the main results\nare stated and discussed. Finally in section 6 we conclude with some remarks and open\nproblems.\n\n\n2 Learning from examples\n\nWe briefly recall some basic concepts of learning theory [16], [8]. In the framework of\nlearning, there are two sets of variables: the input space X, compact subset of Rn, and\nthe output space Y , compact subset of R. The relation between the input x X and the\noutput y Y is described by a probability distribution (x, y) = (x)(y|x) on X Y .\nThe distribution is known only through a sample z = (x, y) = ((x1, y1), . . . , (x , y )),\ncalled training set, drawn i.i.d. according to . The goal of learning is, given the sample\nz, to find a function fz : X R such that fz(x) is an estimate of the output y when the\nnew input x is given. The function fz is called estimator and the rule that, given a sample\nz, provides us with fz is called learning algorithm.\n\nGiven a measurable function f : X R, the ability of f to describe the distribution is\nmeasured by its expected risk defined as\n\n I[f ] = (f (x) - y)2 d(x,y).\n XY\n\nThe regression function\n\n g(x) = y d(y|x),\n Y\nis the minimizer of the expected risk over the set of all measurable functions and always\nexists since Y is compact. Usually, the regression function cannot be reconstructed exactly\nsince we are given only a finite, possibly small, set of examples z.\nTo overcome this problem, in the regularized least squares algorithm an hypothesis space\nH is fixed, and, given > 0, an estimator f \n z is defined as the solution of the regularized\nleast squares problem,\n\n 1\n min{ (f (xi) - yi)2 + f 2H}. (1)\n f H\n i=1\n\nThe regularization parameter has to be chosen depending on the available data, =\n( , z), in such a way that, for every > 0\n\n lim P I[f ( ,z)\n z ] - inf I[f] = 0. (2)\n + f H\n\nWe note that in general inffH I[f ] is larger that I[g] and represents a sort of irreducible\nerror associated with the choice of the space H. The above convergence in probability is\nusually called consistency of the algorithm [16] [14].\n\n\f\n3 Ill-Posed Inverse Problems and Regularization\n\nIn this section we give a very brief account of linear inverse problems and regularization\ntheory [15], [7]. Let H and K be two Hilbert spaces and A : H K a linear bounded\noperator. Consider the equation\n Af = g (3)\n\nwhere g, g K and g - g K . Here g represents the exact, unknown data and g the\navailable, noisy data. Finding the function f satisfying the above equation, given A and g,\nis the linear inverse problem associated to Eq. (3). The above problem is, in general, ill-\nposed, that is, the Uniqueness can be restored introducing the Moore-Penrose generalized\ninverse f = Ag defined as the minimum norm solution of the problem\n\n min Af - g 2 . (4)\n K\n f H\n\n\nHowever the operator A is usually not bounded so, in order to ensure a continuous de-\npendence of the solution on the data, the following Tikhonov regularization scheme can be\nconsidered1\n min{ Af - g 2\n + f 2\n K H}, (5)\n f H\n\nwhose unique minimizer is given by\n\n f \n = (AA + I )-1Ag , (6)\n\nwhere A denotes the adjoint of A.\n\nA crucial step in the above algorithm is the choice of the regularization parameter =\n(, g), as a function of the noise level and the data g, in such a way that\n\n lim f (,g) = 0, (7)\n - f\n 0 H\n\n\nthat is, the regularized solution f (,g) converges to the generalized solution f = Ag\n \n(f exists if and only if P g Range(A), where P is the projection on the closure of the\nrange of A and, in that case, Af = P g) when the noise goes to zero.\n\nThe similarity between regularized least squares algorithm (1) and Tikhonov regulariza-\ntion (5) is apparent. However, several difficulties emerge. First, to treat the problem of\nlearning in the setting of ill-posed inverse problems we have to define a direct problem by\nmeans of a suitable operator A. Second, in the context of learning, it is not clear the nature\nof the noise . Finally we have to clarify the relation between consistency (2) and the kind\nof convergence expressed by (7). In the following sections we will show a possible way to\ntackle these problems.\n\n\n4 Learning as an Inverse Problem\n\nWe can now show how the problem of learning can be rephrased in a framework close to\nthe one presented in the previous section.\nWe assume that hypothesis space H is a reproducing kernel Hilbert space [1] with a contin-\nuous kernel K : X X R. If x X, we let Kx(s) = K(s, x), and, if is the marginal\ndistribution of on X, we define the bounded linear operator A : H L2(X, ) as\n (Af )(x) = f, Kx = f (x),\n H\n\n 1In the framework of inverse problems, many other regularization procedures are introduced [7].\nFor simplicity we only treat the Tikhonov regularization.\n\n\f\nthat is, A is the canonical injection of H in L2(X, ). In particular, for all f H, the\nexpected risk becomes,\n I[f ] = Af - g 2 + I[g],\n L2(X,)\nwhere g is the regression function [2]. The above equation clarifies that if the expected\nrisk admits a minimizer fH on the hypothesis space H, then it is exactly the generalized\nsolution2 f = Ag of the problem\n Af = g. (8)\nMoreover, given a training set z = (x, y), we get a discretized version Ax : H E of A,\nthat is\n (Axf)i = f, Kx = f (x\n i H i),\nwhere E = R is the finite dimensional euclidean space endowed with the scalar product\n\n 1\n y, y = y\n E iyi.\n i=1\nIt is straightforward to check that\n\n 1 (f (xi) - yi)2 = Axf - y 2 ,\n E\n i=1\n\nso that the estimator f \n z given by the regularized least squares algorithm is the regularized\nsolution of the discrete problem\n Axf = y. (9)\nAt this point it is useful to remark the following two facts. First, in learning from examples\nwe are not interested into finding an approximation of the generalized solution of the dis-\ncretized problem (9), but we want to find a stable approximation of the solution of the exact\nproblem (8) (compare with [9]). Second, we notice that in learning theory the consistency\nproperty (2) involves the control of the quantity\n\n I[f \n z ] - inf I[f] = Af - g 2 Af . (10)\n L2(X,) - inf - g 2L2(X,)\n f H f H\n\nIf P is the projection on the closure of the range of A, the definition of P gives\n 2\n I[f \n z ] - inf I[f] = Afz - P g (11)\n f H L2(X,)\n\n(the above equality stronlgy depends on the fact that the loss function is the square loss). In\nthe inverse problem setting, the square root of the above quantity is called the residue of the\nsolution f \n z . Hence, consistency is controlled by the residue of the estimator, instead of\n\nthe reconstruction error f \n z - f (as in inverse problems). In particular, consistency\n H\nis a weaker condition than the one required by (7) and does not require the existence of the\ngeneralized solution fH.\n\n\n5 Regularization, Stochastic Noise and Consistency\n\nTo apply the framework of ill-posed inverse problems of Section 3 to the formulation of\nlearning proposed above, we note that the operator Ax in the discretized problem (9) differs\nfrom the operator A in the exact problem (8) and a measure of the difference between Ax\nand A is required. Moreover, the noisy data y E and the exact data g L2(X, )\nbelong to different spaces, so that the notion of noise has to be modified. Given the above\npremise our derivation of consistency results is developed in two steps: we first study the\nresidue of the solution by means of a measure of the noise due to discretization and then we\nshow a possible way to give a probabilistic evaluation of the noise previously introduced.\n\n 2The fact that fH is the minimal norm solution of (4) is ensured by the assumption that the support\nof the measure is X, since in this case the operator A is injective.\n\n\f\n5.1 Bounding the Residue of the Regularized Solution\n\nWe recall that the regularized solutions of problems (9) and (8) are given by\n\n f = (A A y,\n z x x + I )-1A\n x\n f = (AA + I)-1Ag.\n\nThe above equations show that f and f depend only on A A\n z x x and AA which are\noperators from H into H and on Ay and Ag which are elements of\n x H, so that the space\nE disappears. This observation suggests that noise levels could be A A\n x x - AA L(H)\nand A y , where is the uniform operator norm. To this purpose, for\n x - Ag H L(H)\nevery = (1, 2) R2+ we define the collection of training sets.\n U = {z (X Y ) | Ay A\n x - Ag H 1, Ax x - AA L(H) 2, N}\nand we let M = sup{|y| | y Y }. The next theorem is the central result of the paper.\nTheorem 1 If > 0, the following inequalities hold\n\n 1. for any training set z U\n M \n Af 2 + 1\n z - P g L2(X,) - Af - P g L2(X,) 4 2\n\n 2. if P g Range(A), for any training set z U,\n M \n f 2 + 1\n z - f H - f - f H 3\n 2 2 \n\nMoreover if we choose = (, z) in such a way that\n lim0 sup (,z) = 0\n zU 2\n lim 1\n 0 sup = 0\n zU (12)\n (,z)\n \nthen lim 2\n 0 sup = 0\n zU (,z)\n\n\n lim sup Af (,z) = 0. (13)\n z - Pg\n 0 zU L2(X,)\n \n\n\n\nWe omit the complete proof and refer to [3]. Briefly, the idea is to note that\n\n Af \n z - P g L2(X,) - Af - P g L2(X,)\n 1\n Af = (AA) 2 (f \n z - Af L2(X,) z - f) H\nwhere the last equation follows by polar decomposition of the operator A. Moreover a\nsimple algebraic computation gives\n\nf A A y+(AA+I)-1(A y\n z -f = (AA+I)-1(AA-Ax x)(Ax x+I)-1Ax x -Ag)\nwhere the relevant quantities for definition of the noise appear.\nThe first item in the above proposition quantifies the difference between the residues of\nthe regularized solutions of the exact and discretized problems in terms of the noise level\n = (1, 2). As mentioned before this is exactly the kind of result needed to derive\nconsistency. On the other hand the last part of the proposition gives sufficient conditions\non the parameter to ensure convergence of the residue to zero as the level noise decreases.\nThe above results were obtained introducing the collection U of training sets compatible\nwith a certain noise level . It is left to quantify the noise level corresponding to a training\nset of cardinality . This will be achieved in a probabilistic setting in the next section.\n\n\f\n5.2 Stochastic Evaluation of the Noise\n\nIn this section we estimate the discretization noise = (1, 2).\n\nTheorem 2 Let 1, 2 > 0 and = supxX K(x, x), then\n\n M 2\n P Ag - A \n x y A\n H + 1, AA - Ax x L(H) + 2\n\n 2 2\n 1 2\n 1 - e-22M2 - e-24 (14)\n\nThe proof is given in [3] and it is based on McDiarmid inequality [10] applied to the random\nvariables\n F (z) = A \n x y - Ag G(z) = A A .\n H x x - AA L(H)\nOther estimates of the noise can be given using, for example, union bounds and Hoeffd-\ning's inequality. Anyway rather then providing a tight analysis our concern was to find an\nnatural, explicit and easy to prove estimate of .\n\n\n5.3 Consistency and Regularization Parameter Choice\n\nCombining Theorems 1 and 2, we easily derive the following corollary.\n\nCorollary 1 Given 0 < < 1, with probability greater that 1 - ,\n\n Af \n z - Pg - Af - Pg\n L2(X,) L2(X,)\n\n M 1 4\n + 1 + log (15)\n 2 2 \n\nfor all > 0.\n\nRecalling (10) and (11) it is straightforward to check that the above inequality can be easily\nrestated in the usual learning notation, in fact we obtain\n\n 2\n L 1 4\nI[f \n z ] + 1 + log + Af - Pg + inf I[f] .\n L2(X,)\n 2 2 fH\n approximation error \n sample error irreducible error\n\nIn the above inequality the first term plays the role of sample error. If we choose the\nregularization parameter so that = ( , z) = O( 1 ), with 0 < b < 1 the sample error\n b 2\nconverges in probability to zero with order O 1 when\n 1 2b\n - . On the other\nhand the second term represents the approximation error and it is possible to show, using\nstandard results from spectral theory, that it vanishes as goes to zero [7]. Finally, the last\nterm represents the minimum attainable risk once the hypothesis space H has been chosen.\nFrom the above observations it is clear that consistency is ensured once the parameter is\nchosen according to the aforementioned conditions. Nonetheless to provide convergence\nrates it is necessary to control the convergence rate of the approximation error. Unfortu-\nnately it is well known that this can be accomplished only making some assumptions on\nthe underlying probability distribution (see for example [2]). It can be shown that if the\nexplicit dependence of the approximation error on is not available we cannot determine\n\n\f\nan optimal a priori (data independent) dependency = ( ) for the regularization param-\neter. Nevertheless a posteriori (data dependent) choices = ( , z) can be considered to\nautomatically achieve optimal convergence rate [5], [6]. With respect to this last fact we\nnotice that the set of samples such that inequality (14) holds depends on and , but does\nnot depend , so that we can consider without any further effort a posteriori parameter\nchoices (compare with [4], [5]).\n\nFinally, we notice that the estimate (15) is the result of two different procedures: Theo-\nrem 1, which is of functional type, gives the dependence of the bound by the regularization\nparameter and by the noise levels A A and A y , whereas\n x x - AA L(H) x - Ag H\nTheorem 2, which is of probabilistic nature, relates the noise levels to the number of data\nand the confidence level .\n\n\n6 Conclusions\n\nIn this paper we defined a direct and inverse problem suitable for the learning problem and\nderived consistency results for the regularized least squares algorithm. Though our analysis\nformally explains the connections between learning theory and linear inverse problems, its\nmain limit is that we considered only the square loss. We briefly sketch how the arguments\npresented in the paper extend to general loss functions. For sake of simplicity we consider\na differentiable loss function V . It is easy to see that the minimizer fH of the expected risk\nsatisfies the following equation\n SfH = 0 (16)\n\nwhere S = LK O and LK is the integral operator with kernel K, that is\n (LKf )(x) = K(x, s)f (s)d(s)\n X\n\nand O is the operator defined by\n\n\n (Of )(x) = V (y, f (x))d(y|x).\n Y\n\nIf we consider a generic differentiable loss the operator O and hence S is non linear, and\nestimating fH is an ill-posed non linear inverse problem. It is well known that the theory\nfor this kind of problems is much less developed than the corresponding theory for linear\nproblems. Moreover, since, in general, I[f ] does not define a metric, it is not so clear the\nrelation between the expected risk and the residue. It appears evident that the attempt to\nextend our results to a wider class of loss function is not straightforward. A possible way to\ntackle the problem, further developing our analysis, might pass through the exploitation of\na natural convexity assumption on the loss function. Future work also aims to derive tighter\nprobabilistic bounds on the noise using recently proposed data dependent techniques.\n\n\nAcknowledgments\n\nWe would like to thank M.Bertero, C. De Mol, M. Piana, T. Poggio, G. Talenti, A. Verri\nfor useful discussions and suggestions. This research has been partially funded by the\nINFM Project MAIA, the FIRB Project ASTA2 and the IST Programme of the European\nCommunity, under the PASCAL Network of Excellence, IST-2002-506778.\n\n\nReferences\n\n [1] N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 68:337404,\n 1950.\n\n\f\n [2] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bull.\n Amer. Math. Soc. (N.S.), 39(1):149 (electronic), 2002.\n\n [3] E. De Vito, A. Caponnetto, and L. Rosasco. Discretization error analysis\n for Tikhonov regularization. submitted to Inverse Problem, 2004. available\n http://www.disi.unige.it/person/RosascoL/publications/discre iop.pdf.\n\n [4] E. De Vito, A. Caponnetto, and L. Rosasco. Model selection for regularized least-\n squares algorithm in learning theory. to appear on Journal Machine Learning Re-\n search, 2004.\n\n [5] L. Devroye, L. Gy orfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.\n Number 31 in Applications of mathematics. Springer, New York, 1996.\n\n [6] Schock E. and Sergei V. Pereverzev. On the adaptive selection of the parameter in\n regularization of ill-posed problems. Technical report, University of Kaiserslautern,\n august 200r.\n\n [7] Heinz W. Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse prob-\n lems, volume 375 of Mathematics and its Applications. Kluwer Academic Publishers\n Group, Dordrecht, 1996.\n\n [8] Theodoros Evgeniou, Massimiliano Pontil, and Tomaso Poggio. Regularization net-\n works and support vector machines. Adv. Comput. Math., 13(1):150, 2000.\n\n [9] Vera Kurkova. Supervised learning as an inverse problem. Technical Report 960,\n Institute of Computer Science, Academy of Sciences of the Czech Republic, April\n 2004.\n\n[10] Colin McDiarmid. On the method of bounded differences. In Surveys in combina-\n torics, 1989 (Norwich, 1989), volume 141 of London Math. Soc. Lecture Note Ser.,\n pages 148188. Cambridge Univ. Press, Cambridge, 1989.\n\n[11] S. Mukherjee, T. Niyogi, P.and Poggio, and R. Rifkin. Statistical learning: Stability is\n sufficient for generalization and necessary and sufficient for consistency of empirical\n risk minimization. Technical Report CBCL Paper 223, Massachusetts Institute of\n Technology, january revision 2004.\n\n[12] T. Poggio and Girosi F. Networks for approximation and learning. Proc. IEEE,\n 78:14811497, 1990.\n\n[13] Cynthia Rudin. A different type of convergence for statistical learning algorithms.\n Technical report, Program in Applied and Computational Mathematics Princeton Uni-\n versity, 2004.\n\n[14] I. Steinwart. Consistency of support vector machines and other regularized kernel\n machines. IEEE Transaction on Information Theory, 2004. (accepted).\n\n[15] Andrey N. Tikhonov and Vasiliy Y. Arsenin. Solutions of ill-posed problems. V. H.\n Winston & Sons, Washington, D.C.: John Wiley & Sons, New York, 1977. Translated\n from the Russian, Preface by translation editor Fritz John, Scripta Series in Mathe-\n matics.\n\n[16] Vladimir N. Vapnik. Statistical learning theory. Adaptive and Learning Systems\n for Signal Processing, Communications, and Control. John Wiley & Sons Inc., New\n York, 1998. A Wiley-Interscience Publication.\n\n\f\n", "award": [], "sourceid": 2722, "authors": [{"given_name": "Lorenzo", "family_name": "Rosasco", "institution": null}, {"given_name": "Andrea", "family_name": "Caponnetto", "institution": null}, {"given_name": "Ernesto", "family_name": "Vito", "institution": null}, {"given_name": "Francesca", "family_name": "Odone", "institution": null}, {"given_name": "Umberto", "family_name": "Giovannini", "institution": null}]}