{"title": "Learning How to Teach or Selecting Minimal Surface Data", "book": "Advances in Neural Information Processing Systems", "page_first": 364, "page_last": 371, "abstract": null, "full_text": "Learning How To Teach \n\nor \n\nSelecting Minimal Surface Data \n\nDavi Geiger \n\nRicardo A. Marques Pereira \n\nSiemens Corporate Research, Inc \n\n755 College Rd. East \nPrinceton, NJ 08540 \n\nUSA \n\nDipartimento di Informatica \n\nUniversita di Trento \n\nVia Inama 7, Trento, TN 38100 \n\nITALY \n\nAbstract \n\nLearning a map from an input set to an output set is similar to the prob(cid:173)\nlem of reconstructing hypersurfaces from sparse data (Poggio and Girosi, \n1990). In this framework, we discuss the problem of automatically select(cid:173)\ning \"minimal\" surface data. The objective is to be able to approximately \nreconstruct the surface from the selected sparse data. We show that this \nproblem is equivalent to the one of compressing information by data re(cid:173)\nmoval and the one oflearning how to teach. Our key step is to introduce a \nprocess that statistically selects the data according to the model. During \nthe process of data selection (learning how to teach) our system (teacher) \nis capable of predicting the new surface, the approximated one provided \nby the selected data. We concentrate on piecewise smooth surfaces, e.g. \nimages, and use mean field techniques to obtain a deterministic network \nthat is shown to compress image data. \n\n1 Learning and surface reconstruction \n\nGiven a dense input data that represents a hypersurface, how could we automatically \nselect very few data points such as to be able to use these fewer data points (sparse \ndata) to approximately reconstruct the hypersurface ? \nWe will be using the term surface to refer to hypersurface (surface in multidimen-\n\n364 \n\n\fLearning How to Teach or Selecting Minimal Surface Data \n\n365 \n\nsions) throughout the paper. \nIt has been shown (Poggio and Girosi, 1990) that the problem of reconstructing a \nsurface from sparse and noisy data is equivalent to the problem of learning from \nexamples. For instance, to learn how to add numbers can be cast as finding the \nmap from X = {pair 01 numbers} to F = {sum} from a set of noisy examples. The \nsurface is F(X) and the sparse and noisy data are the set of N examples {(Xi, di)}, \nwhere i = 0,1, ... , N and Xi = (ai, bi) E X, such that ai + bi = di + TJi (TJi being the \nnoise term). Some a priori information about the surface, e.g. the smoothness one, \nis necessary for reconstruction. \nConsider a set of N input-output examples, {(Xi, di)}, and a form II PI 112 for \nthe cost of the deviation of I, the approximated surface, from smoothness. P is a \ndifferential operator and II . II is a norm (usually L2). To find the surface I, that \nbest fits (i) the data and (ii) the smoothness criteria, is to solve the problem of \nminimizing the functional \n\nN-l \n\nV(f) = L (di - I(Xi\u00bb2 + #11 PI W \n\ni=O \n\nDifferent methods of solving the function can yield different types of network. In \nparticular using the Green's method gives supervised backprop type of networks \n(Poggio and Girosi, 1990) and using optimization techniques (like gradient descent) \nwe obtain unsupervised (with feedback) type of networks. \n\n2 Learning how to teach arithmetic operations \n\nThe problem of learning how to add and multiply is a simple one and yet provide \ninsights to our approach of selecting the minimum set of examples. \n\nLearning arithmetic operations The surface given by the addition of two num(cid:173)\nbers, namely I(x, y) = X + y, is a plane passing through the origin. The multipli(cid:173)\ncation surface, I(x, y) = Xv, is hyperbolic. The a priori knowledge of the addition \nand multiplication surface can be expressed as a minimum of the functional \n\nV(f) = 1: 1: II yr 2/(x,y) II dxdy \n\nwhere \n\nyr2/(x, y) = ({}x 2 + {}y2 )/(x, y) \n\n{}2 \n\n{}2 \n\nOther functions also minimize V(f), like I(x, y) = x2 - y2, and so a few examples \nare necessary to learn how to add and multiply given the above prior knowledge. If \nthe prior assumption consider a larger class of basis functions, then more examples \nwill be required. Given p input-output examples, {(Xi, Vi); di}, the learning problem \nof adding and multiplying can be cast as the optimization of \n\n\f366 \n\nGeiger and Pereira \n\np-l \n\nV(f) = ~(f( X\" y,) - d,)' + Jl 100 100 II \\1' I( x, y) II d xd y \n\n00 \n\n00 \n\nWe now consider the problem of selecting the examples from the full surface data. \n\nA sparse process for selecting data Let us assume that the full set of data \nis given. in a 2-Dimensionallattice. So we have a finite amount of data (N 2 data \npoints), with the input-output set being {(Xi, Yj); dij}, where i, j = 0, 1, ... , N -1. To \nselect p examples we introduce a sparse process that selects out data by modifying \nthe cost function according to \n\n00 \n\n00 \n\nN-l \n\nN-l \n\nV = ,~y-8,;)(f(X\"y;)-d';)'+Jl 100 100 II \\1'I(x,y) II +A(p-i~O (1-8,;\u00bb' \nwhere Sij = 1 selects out the data and we have added the last term to assure that \np examples are selected. The data term forces noisy data to be thrown out first, \nthe second order smoothness of I reduces the need for many examples (p ~ 10) to \nlearn these arithmetic operations. Learning S is equivalent to learn how to select \nthe examples, or to learn how to teach. The system (teacher) has to learn a set of \nexamples (sparse data) that contains all the \"relevant\" information. The redundant \ninformation can be \"filled in\" by the prior knowledge. Once the teacher has learned \nthese selected examples, he, she or it (machine) presents them to the student that \nwith the a priori knowledge about surfaces is able to approximately learn the full \ninput-output map (surface). \n\n3 Teaching piecewise smooth surfaces \n\nWe first briefly introduce the weak membrane model, a coupled Markov random \nfield for modeling piecewise smooth surfaces. Then we lay down the framework for \nlearning to teach this surface. \n\n3.1 Weak membrane model \n\nWithin the Bayes approach the a priori knowledge that surfaces are smooth (first \norder smoothness) but not at the discontinuities has been analyzed by (Geman and \nGeman, 1984) (Blake and Zisserman, 1987) (Mumford and Shah, 1985) (Geiger and \nGirosi, 1991). If we consider the noise to be white Gaussian, the final posterior \nprobability becomes P(j,/lg) = ie-,I3VU,l) , where \n\nV(j,/) = I)(jij - gij)2 + J1.11 'VI Ilrj (1-lij) +,ijlij], \n\ni,j \n\n(1) \n\nWe represented surfaces by lij at pixel (i, j), and discontinuities by lij. The input, \ndata is gij, II 'V I Ilij is the norm of the gradient at pixel (i, j). Z is a normalization \n\n\fLearning How to Teach or Selecting Minimal Surface Data \n\n367 \n\nconstant, known as the partition function. f3 is a global parameter of the model and \nis inspired on thermodynamics, and J.L and lij are parameters to be estimated. This \nmodel, when used for image segmentation, has been shown to give a good pattern \nof discontinuities and eliminate the noise. Thus, suggesting that the piecewise \nassumption is valid for images. \n\n3.2 Redundant data \n\nWe have assumed the surface to be smooth and therefore there is redundant infor(cid:173)\nmation within smooth regions. We then propose a model that selects the \"relevant\" \ninformation according to two criteria \n\n1. Discontinuity data: Discontinuities usually capture relevant information, \nand it is possible to roughly approximate surfaces just using edge data (see Geiger \nand Pereira, 1990). A limitation of just using edge data is that an oversmoothed \nsurface is represented. \n\n2. Texture data: Data points that have significant gradients (not enough to be \na discontinuity) are here considered texture data. Keeping texture data allows us \nto distinguish between flat surfaces, as for example a clean sky in an image, and \ntexture surfaces, as for example the leaves in the tree (see figure 2). \n\n3.3 The sparse process \n\nAgain, our proposal is first to extend the weak membrane model by including an \nadditional binary field - the sparse process s- that is 1 when data is selected out \nand 0 otherwise. There are natural connections between the process s and robust \nstatistics (Huber, 1988) as discussed in (Geiger and Yuille, 1990) and (Geiger and \nPereira, 1991). We modify (1) by considering (see also Geiger and Pereira, 1990) \n\nV(/, I, s) = 2:)(1 - Sij )(fij - gij)2 + J.L II 'V I II;j (1 -lij) + TJijSij + lijlij]. \n\n(2) \n\ni,j \n\nwhere we have introduced the term TJijSij to keep some data otherwise Sij = 1 \neverywhere. If the data term is too large, the process S = 1 can suppress it. We \nwill now assume that the data is noise-free, or that the noise has already been \nsmoothed out. We then want to find which data points (s = 0) are necessary to \nkeep to reconstruct I. \n\n3.4 Mean field equations and unsupervised networks \n\nTo impose the discontinuity data constraint we use the hard constraint technique \n(Geiger and Yuille, 1990 and its references). We do not allow states that throw \nout data (Sij = 1) at the edge location (lij = 1). More precisely, within the \nstatistical framework we reduce the possible states for the processes S and I to \nSij1ij = O. Therefore, excluding the state (Sij = 1,/ij = 1). Applying the saddle \npoint approximation, a well known mean field technique (Geiger and Girosi, 1989 \nand its references), on the field I, we can compute the partition function \n\n\f368 \n\nGeiger and Pereira \n\nZ = \n\nL \n\ns.l=O \n\nL \n\ne-f3V (j,l,s) ~ L \n\ns.1=O \n\ne-f3VCf,l,s) ~ II Zij \n\nf=(0, .. ,255)N2 s,1=(0 ,1)N2 \n\nij \n(e- f3 h'ij +Cfi j -9i j )2] + e- f3 [JlIIVfll:j+T/;j] + e-f3[JlIIVfll~j+(jij-9,j)2]) \n\ns,1=(0,1)N2 \n\n(3) \n\nZij \n\nwhere f maximizes Z. After applying mean field techniques we obtain the following \nequations for the processes I and S \n\nand, using the definition II \\l f IIlj = [(fi,j+l - fi+l,j)2 + (Ji+l,j+l -\nfield self consistent equation (Geiger and Pereira, 1991) becomes \n\nfi,j)2 , the mean \n\n(4) \n\n-J.L{ f{ij(1 - ~j) + f{i-l,j-l(l- [i-l,j-l) + \nMi -1 ,j (1 - ~ -1 ,j ) + Mi ,j -1 (1 -\n\nIi ,j -1) } \n\n(5) \n\nfi,j)2 and Mij = (Ji+l,j -\n\nwhere f{ij = (fi+l,j+l -\nfi,j+l?' The set of coupled \nequations (5) (4) can be mapped to an unsupervised network, we call a minimal \nsurface representation network (MSRN), and can efficiently be solved in a massively \nparallel machine. Notice that Sij + lij ~ 1, because of the hard constraint, and in \nthe limit of j3 --+ 00 the processes S and I becomes either 0 or 1. In order to throw \naway redundant (smooth) data keeping some of the texture we adapt the cost TJij \naccording to the gradient of the surface. More precisely, we set \n\n(6) \n\nwhere (ilfjg)2 = (gi+l,j --gi_l,j)2 and (ilijg)2 = (9i,j+l - 9i,j_l)2. The smoother \nis the data the lower is the cost to discard the data (Sij = 1). In the limit of TJ --+ 0 \nonly edge data (lij = 1) is kept, since from (4) limT/-+osij = l-lij . \n\n3.5 Learning how to teach and the approximated surface \nWith the mean field equations we compute the approximated surface f simulta(cid:173)\nneously to S and to I. Thus, while learning the process S (the selected data) the \nsystem also predict the approximated surface f that the student will learn from \nthe selected examples. By changing the parameters, say J.L and TJ, the teacher can \nchoose the optimal parameters such as to select less data and preserve the quality \nof the approximat~d surface. Once S has been learned the system only feeds the \nselected data points to the learner machinery. We actually relax the condition and \nfeed the learner with the selected data and the corresponding discontinuity map (l). \nNotice that in the limit of TJ --+ 0 the selected data points are coincident with the \ndiscontinuities (I = 1). \n\n\fLearning How to Teach or Selecting Minimal Surface Data \n\n369 \n\n4 Results: Image compression \n\nWe show the results of the algorithm to learn the minimal representation of images. \nThe algorithm is capable of image compression and one advantage over the cosine \ntransform (traditional method) is that it does not have the problem of breaking the \nimages into blocks. However, a more careful comparison is needed. \n\n4.1 Learning s, f, and I \n\nTo analyze the quality of the surface approximation, we show in figure 1 the per(cid:173)\nformance of the network as we vary the threshold 1]. We first show a face image \nand the line process and then the predicted approximated surfaces together with \nthe correspondent sparse process s. \n\n4.2 Reconstruction, Generalization or \"The student performance\" \n\nWe can now test how the student learns from the selected examples, or how good is \nthe surface reconstruction from the selected data. We reconstruct the approximate \nsurfaces by running (5) again, but with the selected surface data points (Sij = 0) \nand the discontinuities (iij = 1) given from the previous step. We show in figure 2f \nthat indeed we obtain the predicted surfaces (the student has learned). \n\nReferences : \n\nE. B. Baum and Y. Lyuu. 1991. The transition to perfect generalization in perceptrons, \nNeural Computation, vo1.3, no.3. pp.386-401. \n\nA. Blake and A. Zisserman. 1987. Visual Reconstruction, MIT Press, Cambridge, Mass. \n\nD. Geiger and F. Girosi. 1989. Coupled Markov random fields and mean field theory, \nAdvances in Neural Information Processing Systems 2, Morgan Kaufmann, D. Touretzky. \n\nD. Geiger and A. Yuille. 1991. A common framework for image segmentation, Int. Jour. \nCompo Vis.,vo1.6:3, pp. 227-243. \n\nD. Geiger and F. Girosi. 1991. Parallel and deterministic algorithms for MRFs: surface \nreconstruction, PAMI, May 1991, vol.PAMI-13, 5, pp.401-412 . \n\nD. Geiger and R. M. Pereira. 1991. The outlier process, IEEE Workshop on Neural \nNetworks for signal Processing, Princeton, N J. \n\nS. Geman and D. Geman. 1984. Stochastic Relaxation, Gibbs Distributions, and the \nBayesian Restoration of Images,PAMI, vol.PAMI-6, pp.721-741K. \nJ.J. Hopfield. 1984. Neural networks and physical systems with emergent collective com(cid:173)\nputational abilities, Proc. Nat. Acad. Sci.,79 , pp. 2554-2558. \n\nP.J. Huber. 1981. Robust Statistics, John Wiley and Sons, New York. \nD. Mumford and J. Shah. 1985. Boundary detection by minimizing functionals, I , Proc. \nIEEE Conf. on Computer Vision & Pattern Recognition, San Francisco, CA . \n\nT. Poggio and F. Girosi. 1990. Regularization algorithms for learning that are equivalent \nto multilayer network, Science,vol-247, pp. 978-982. \nD. E. Rumelhart, G. Hinton and R. J. Willians. 1986. Learning internal representations \nby error backpropagation. Nature, 323, 533. \n\n\f370 \n\nGeiger and Pereira \n\nf \n\na. \n\nc. \n\ne. \n\nh ... p \n\n. . \n\nd. \n\nf. \n\n(b) The edge map for J-l \n\nFigure 1: (a) 8-bit image of 128 X 128 pixels. \n::::: 1.0, \n'Yij ::::: 100.0. After 200 iterations and final f3 ::::: 25 ~ 00 (c) the approximated image \n::::: 0.0009. (d) the corresponding sparse process (e) \nfor J-l \n::::: 0.0001. (f) the corresponding \napproximated image J-l \nsparse process. \n\n::::: 0.01, 'Yij = 1.0 and TJ \n\n::::: 0.01, 'Yij ::::: 1.0 and TJ \n\n\f", "award": [], "sourceid": 559, "authors": [{"given_name": "Davi", "family_name": "Geiger", "institution": null}, {"given_name": "Ricardo", "family_name": "Pereira", "institution": null}]}