{"title": "Learning to Find Pre-Images", "book": "Advances in Neural Information Processing Systems", "page_first": 449, "page_last": 456, "abstract": "", "full_text": "Learning to Find Pre-Images\n\nG\u00a8okhan H. Bak\u0131r, Jason Weston and Bernhard Sch\u00a8olkopf\n\nMax Planck Institute for Biological Cybernetics\nSpemannstra\u00dfe 38, 72076 T\u00a8ubingen, Germany\n\n{gb,weston,bs}@tuebingen.mpg.de\n\nAbstract\n\nWe consider the problem of reconstructing patterns from a feature map.\nLearning algorithms using kernels to operate in a reproducing kernel\nHilbert space (RKHS) express their solutions in terms of input points\nmapped into the RKHS. We introduce a technique based on kernel princi-\npal component analysis and regression to reconstruct corresponding pat-\nterns in the input space (aka pre-images) and review its performance in\nseveral applications requiring the construction of pre-images. The intro-\nduced technique avoids dif\ufb01cult and/or unstable numerical optimization,\nis easy to implement and, unlike previous methods, permits the compu-\ntation of pre-images in discrete input spaces.\n\n1\n\nIntroduction\n\nWe denote by Hk the RKHS associated with the kernel k(x, y) = \u03c6(x)>\u03c6(y), where\n\u03c6(x) : X \u2192 Hk is a possible nonlinear mapping from input space X (assumed to be a\nnonempty set) to the possible in\ufb01nite dimensional space Hk. The pre-image problem is\nde\ufb01ned as follows: given a point \u03a8 in Hk, \ufb01nd a corresponding pattern x \u2208 X such that\n\u03a8 = \u03c6(x). Since Hk is usually a far larger space than X , this is often not possible (see Fig.\n??). In these cases, the (approximate) pre-image z is chosen such that the squared distance\nof \u03a8 and \u03c6(z) is minimized,\n\nz = arg min\n\nk\u03a8 \u2212 \u03c6(z)k2.\n\nz\n\n(1)\n\nThis has a signi\ufb01cant range of applications in kernel methods: for reduced set methods\n[1], for denoising and compression using kernel principal components analysis (kPCA),\nand for kernel dependency estimation (KDE), where one \ufb01nds a mapping between paired\nsets of objects. The techniques used so far to solve this nonlinear optimization problem\noften employ gradient descent [1] or nonlinear iteration methods [2]. Unfortunately, this\nsuffers from (i) being a dif\ufb01cult nonlinear optimization problem with local minima requir-\ning restarts and other numerical issues, (ii) being computationally inef\ufb01cient, given that\nthe problem is solved individually for each testing example, (iii) not being the optimal\napproach (e.g., we may be interested in minimizing a classi\ufb01cation error rather then a dis-\ntance in feature space); and (iv) not being applicable for pre-images which are objects with\ndiscrete variables.\n\nIn this paper we propose a method which can resolve all four dif\ufb01culties: the simple idea is\nto estimate the function (1) by learning the map \u03a8 \u2192 z from examples (\u03c6(z), z). Depend-\ning on the learning technique used this can mean, after training, each use of the function\n\n\f(each pre-image found) can be computed very ef\ufb01ciently, and there are no longer issues\nwith complex optimization code. Note that this problem is unusual in that it is possible to\nproduce an in\ufb01nite amount of training data ( and thus expect to get good performance) by\ngenerating points in Hk and labeling them using (1). However, often we have knowledge\nabout the distribution over the pre-images, e.g., when denoising digits with kPCA, one ex-\npects as a pre-image something that looks like a digit, and an estimate of this distribution is\nactually given by the original data. Taking this distribution into account, it is conceivable\nthat a learning method could outperform the naive method, that of equation (1), by produc-\ning pre-images that are subjectively preferable to the minimizers of (1). Finally, learning\nto \ufb01nd pre-images can also be applied to objects with discrete variables, such as for string\noutputs as in part-of-speech tagging or protein secondary structure prediction.\n\nThe remainder of the paper is organized as follows: in Section 2 we review kernel methods\nrequiring the use of pre-images: kPCA and KDE. Then, in Section 3 we describe our\napproach for learning pre-images. In Section 4 we verify our method experimentally in the\nabove applications, and in Section 5 we conclude with a discussion.\n\n2 Methods Requiring Pre-Images\n\n2.1 Kernel PCA Denoising and Compression\n\ni=1 \u03b1r\n\ni=1 \u03b11\n\ni \u03c6(xi), . . . ,Pm\n\nGiven data points {xi}m\ni=1 \u2208 X , kPCA constructs an orthogonal set of feature extractors\nin the RKHS. The constructed orthogonal system P = {v1, . . . , vr} lies in the span of\nthe data points, i.e., P = \u00a1Pm\ni \u03c6(xi)\u00a2. It is obtained by solv-\ning the eigenvalue problem m\u03bb\u03b1i = K\u03b1i for 1 \u2264 i \u2264 r where Kij = k(xi, xj)\nis the kernel matrix and r \u2264 m is the number of nonzero eigenvalues.1 Once built,\nthe orthogonal system P can be used for nonlinear feature extraction. Let x denote\na test point, then the nonlinear principal components can be extracted via P \u03c6(x) =\n\u00a1Pm\ni k(xi, x)\u00a2 where k(xi, x) is substituted for \u03c6(xi)>\u03c6(x).\nSee ([3],[4] chapter 14) for details.\nBeside serving as a feature extractor, kPCA has been proposed as a denoising and com-\npression procedure, both of which require the calculation of input patterns x from feature\nspace points P \u03c6(x).\n\ni k(xi, x), . . . ,Pm\n\ni=1 \u03b1r\n\ni=1\u03b11\n\nDenoising. Denoising is a technique used to reconstruct patterns corrupted by noise.\nGiven data points {xi}m\ni=1 and the orthogonal system P = (v1, . . . , va, . . . , vr) obtained\nby kPCA. Assume that the orthogonal system is sorted by decreasing variance, we write\na \u03c6(x), where Pa denotes the projection on the span of\n\u03c6(x) = P \u03c6(x) = Pa\u03c6(x) + P \u22a5\n(v1, . . . , va). The hope is that Pa\u03c6(x) retains the main structure of x, while P \u22a5\na \u03c6(x) con-\ntains noise. If this is the case, then we should be able to construct a denoised input pattern\nas the pre-image of Pa\u03c6(x). This denoised pattern z can be obtained as solution to the\nproblem\n\nz = arg min\n\nz\n\nkPa\u03c6(x) \u2212 \u03c6(z)k2.\n\n(2)\n\nFor an application of kPCA denoising see [2].\n\nCompression. Consider a sender receiver-scenario, where the sender S wants to transmit\ninformation to the receiver R. If S and R have the same projection matrix P serving as a\nvocabulary, then S could use Pa to encode x and send Pa\u03c6(x) \u2208 Ra instead of x \u2208 Rn.\nThis corresponds to a lossy compression, and is useful if a \u00bf n. R would obtain the\n\n1We assume that the \u03c6(xi) are centered in feature space. This can be achieved by centering the\n11>), where 1 \u2208 Rm is the vector with every entry\n\nkernel matrix Kc = (I \u2212\nequal 1. Test patterns must be centered with the same center obtained from the training stage.\n\n11>)K(I \u2212\n\nm\n\nm\n\n1\n\n1\n\n\fcorresponding pattern x by minimizing (2) again. Therefore kPCA would serve as encoder\nand the pre-image technique as decoder.\n\n2.2 Kernel Dependency Estimation\n\nKernel Dependency Estimation (KDE) is a novel algorithm [5] which is able to learn gen-\neral mappings between an input set X and output set Y, give de\ufb01nitions of kernels k and\nl (with feature maps \u03a6k and \u03a6l) which serve as similarity measures on X and Y, respec-\ntively. To learn the mapping from data {xi, yi}m\n1) Decomposition of outputs. First a kPCA is performed in Hl associated with kernel l.\nThis results in r principal axes v1, . . . , vr in Hl. Obtaining the principal axes, one is able\nto obtain principal components (\u03c6l(y)>v1, . . . , \u03c6l(y)>vr) of any object y.\n2) Learning the map. Next, we learn the map from \u03c6k(x) to (\u03c6l(y)>v1, . . . , \u03c6l(y)>vr).\nTo this end, for each principal axis vj we solve the problem\n\ni=1 \u2208 X \u00d7 Y, KDE performs two steps.\n\narg min\n\n\u03b2j Xm\n\ni=1\n\n(\u03c6l(yi)>vj \u2212 g(xi, \u03b2j))2 + \u03b3k\u03b2jk2,\n\n(3)\n\nwhere \u03b3k\u03b2jk2 acts as a regularization term (with \u03b3 > 0), g(xi, \u03b2j) = Pm\nsk(xs, xi),\nand \u03b2 \u2208 Rm\u00d7r. Let P \u2208 Rm\u00d7r with Pij = \u03c6l(yi)>vj, j = 1 . . . r and K \u2208 Rm\u00d7m the\nkernel matrix with entries Kst = k(xs, xt), with s, t = 1 . . . m. Problem (3) can then be\nminimized, for example via kernel ridge regression, yielding\n\ns=1 \u03b2j\n\n\u03b2 = (K>K + \u03b3I)\u22121KP.\n\n(4)\n\n3) Testing Phase. Using the learned map from input patterns to principal components,\npredicting an output y0 for a new pattern x0 requires solving the pre-image problem\n\ny0 = arg min\n\nk(\u03c6l(y)>v1, . . . , \u03c6l(y)>vr) \u2212 (k(x1, x0), . . . , k(xm, x0))\u03b2k2.\n\ny\n\n(5)\n\nThus y0 is the approximate pre-image of the estimated point \u03c6(y0) in Hl.\n\n3 Learning Pre-Images\n\ni=1 f \u22121\n\nk \u00b3Pm\n\nWe shall now argue that by mainly being concerned with (1), the methods that have been\nused for this task in the past disregard an important piece of information. Let us summarize\nthe state of the art (for details, see [4]).\nExact pre-images. One can show that if an exact pre-image exists, and if the ker-\nnel can be written as k(x, x0) = fk((x>x0)) with an invertible function fk (e.g.,\nk(x, x0) = (x>x0)d with odd d), then one can compute the pre-image analytically as\nj=1 \u03b1jk(xj, ei)\u00b4 ei, where {e1, . . . , eN } is any orthonormal basis of\nz = PN\ninput space. However, if one tries to apply this method in practice, it usually works less\nwell than the approximate pre-image methods described below. This is due to the fact that\nit usually is not the case that exact pre-images exist.\nGeneral approximation methods. These methods are based on the minimization of (1).\nWhilst there are certain cases where the minimizer of (1) can be found by solving an eigen-\nvalue problem (for k(x, x0) = (x>x0)2), people in general resort to methods of nonlinear\noptimization. For instance, if the kernel is differentiable, one can multiply out (1) to ex-\npress it in terms of the kernel, and then perform gradient descent [1]. The drawback of\nthese methods is that the optimization procedure is expensive and will in general only \ufb01nd\na local optimum. Alternatively one can select the k best input points from some training\nset and use them in combination to minimize the distance (1), see [6] for details.\n\n\fIteration schemes for particular kernels. For particular types of kernels, such as radial\nbasis functions, one can devise \ufb01xed point iteration schemes which allow faster minimiza-\ntion of (1). Again, there is no guarantee that this leads to a global optimum.\n\nOne aspect shared by all these methods is that they do not explicitly make use of the fact\nthat we have labeled examples of the unknown pre-image map: speci\ufb01cally, if we consider\nany point in x \u2208 X , we know that the pre-image of \u03a6(x) is simply x.2 Below, we describe a\nmethod which makes heavy use of this information. Speci\ufb01cally, we use kernel regression\nto estimate the pre-image map from data. As a data set, we consider the training data\n{xi}m\n\ni=1 that we are given in our original learning problem (kPCA, KDE, etc.).\n\n3.1 Estimation of the Pre-Image Map\n\nWe seek to estimate a function \u0393 : Hk \u2192 X with the property that, at least approxi-\nmately, \u0393(\u03a6(xi)) = xi for i = 1, . . . , m. If we were to use regression using the ker-\nnel k corresponding to Hk, then we would simply look for weight vectors wj \u2208 Hk,\nj = 1, . . . , dim X such that \u0393j(\u03a8) = w>\nj \u03a8, and use the kernel trick to evaluate \u0393. How-\never, in general we may want to use a kernel \u03ba which is different from k, and thus we cannot\nperform our computations implicitly by the use of a kernel. This looks like a problem, but\nthere is a way to handle it. It is based on the well-known observation that although the\ndata in Hk may live in an in\ufb01nite-dimensional space, any \ufb01nite data set spans a subspace\nof \ufb01nite dimension. A convenient way of working in that subspace is to choose a basis and\nto work in coordinates, e.g., using a kPCA basis. Let Pn\u03a8 = Pn\ni=1(\u03a8>vi)vi denote the\nprojection that maps a point into its coordinates in the PCA basis v1, . . . , vn, i.e., into the\nsubspace where the training set has nonzero variance. We then learn the pre-image map\n\u0393j : Rn \u2192 X by solving the learning problem\n\n\u0393j = arg min\n\n\u0393j\n\nXm\n\ni=1\n\nl (xi, \u0393(Pn\u03c6(xi))) + \u03bb\u2126(\u0393).\n\n(6)\n\nr=1 \u03b2j\n\nr \u03ba(Pn\u03c6(x), Pn\u03c6(xr)), j = 1, . . . , N, which can be solved like (3).\n\nHere, \u2126 is a regularizer, and \u03bb \u2265 0. If X is the vector space RN , we can consider the\nproblem (6) as a standard regression problem for the m training points xi and use ker-\nnel ridge regression with a kernel \u03ba. This yields a pre-image mapping \u0393j(Pn\u03c6(x)) =\nPm\nNote that the general learning setup of (6) allows to use of any suitable loss function, incor-\nporating invariances and a-priori knowledge. For example, if the pre-images are (natural)\nimages, a psychophysically motivated loss function could be used, which would allow the\nalgorithm to ignore differences that cannot be perceived.\n\n3.2 Pre-Images for Complex Objects\n\nIn methods such as KDE one is interested in \ufb01nding pre-images for general sets of objects,\ne.g. one may wish to \ufb01nd a string which is the pre-image of a representation using a string\nkernel [7, 8]. Using gradient descent techniques this is not possible as the objects have\ndiscrete variables (elements of the string). However, using function estimation techniques,\nas long as it is possible to learn to \ufb01nd pre-images even for such objects, the problem can\nbe approached by decomposition into several learning subtasks. This should be possible\nwhenever there is structure in the object one is trying to predict. In the case of strings one\ncan predict each character of the string independently given the estimate \u03c6l(y0). This is\nmade particularly tractable in \ufb01xed-length string prediction problems such as for part-of-\nspeech tagging or protein secondary structure prediction because the length is known (it is\nthe same length as the input). Otherwise the task is more dif\ufb01cult but still one could also\n\n2It may not be the only pre-image, but this does not matter as long as it minimizes the value of\n\n(1).\n\n\fp=1(\u03a3x)p \u00d7 \u222a\u221e\n\npredict the length of the output string before predicting each element of it. As an example,\nwe now describe in depth a method for \ufb01nding pre-images for known-length strings.\nThe task is to predict a string y given a string x and a set of paired examples (xi, yi) \u2208\np=1(\u03a3y)p. Note that |xi| = |yi| for all i, i.e., the length of any paired input\n\u222a\u221e\nand output strings are the same. This is the setting of part-of-speech tagging, where \u03a3x are\nwords and \u03a3y are parts of speech, and also secondary structure prediction, where \u03a3x are\namino acids of a protein sequence and \u03a3y are classes of structure that the sequence folds\ninto, e.g. helix, sheet or coil.\n\nIt is possible to use KDE (Section 2.2) to solve this task directly. One has to de\ufb01ne an\nappropriate similarity function for both sets of objects using a kernel function, giving two\nimplicit maps \u03c6k(x) and \u03c6l(y) using string kernels. KDE then learns a map between the\ntwo feature spaces, and for a new test string x one must \ufb01nd the pre-image of the estimate\n\u03c6l(y0) as in equation (5). One can \ufb01nd this pre-image by predicting each character of the\nstring independently given the estimate \u03c6l(y0) as it has known length given the input x.\nOne can thus learn a function bp = f (\u03c6l(y0), \u03b1p) where bp is the pth element of the\noutput and \u03b1p = (a(p\u2212n/2)a(p\u2212n/2+1) . . . a(p+n/2)) is a window of length n + 1 with\ncenter at position p in the input string. One computes the entire output string with\n\u03b2 = (f (\u03c6l(y0), \u03b11) . . . f (\u03c6l(y0), \u03b1|x|)); window elements outside of the string can be en-\ncoded with a special terminal character. The function f can be trained with any multi-class\nclassi\ufb01cation algorithm to predict one of the elements of the alphabet, the approach can\nthus be seen as a generalization of the traditional approach which is learning a function\nf given only a window on the input (the second parameter). Our approach \ufb01rst estimates\nthe output using global information from the input and with respect to the loss function of\ninterest on the outputs\u2014it only decodes this global prediction in the \ufb01nal step. Note that\nproblems such as secondary structure prediction often have loss functions dependent on the\ncomplete outputs, not individual elements of the output string [9].\n\n4 Experiments\n\nIn the following we demonstrate the pre-image learning technique on the applications we\nhave introduced.\n\nGaussian noise\n\nPCA\n\nkPCA+grad.desc.\n\nkPCA+learn-pre.\n\nFigure 1: Denoising USPS digits: linear PCA fails on this task, learning to \ufb01nd pre-images\nfor kPCA performs at least as well as \ufb01nding pre-images by gradient descent.\n\nKPCA Denoising. We performed a similar experiment to the one in [2] for demonstration\npurposes: we denoised USPS digits using linear and kPCA. We added Gaussian noise with\nvariance 0.5 and selected 100 randomly chosen non-noisy digits for training and a further\n100 noisy digits for testing, 10 from each class. As in [2] we chose a nonlinear map\nvia a Gaussian kernel with \u03c3 = 8. We selected 80 principal components for kPCA. We\nfound pre-images using the Matlab function fminsearch, and compared this to our pre-\n\n\fimage-learning method (RBF kernel K(x, x0) = exp(\u2212||x \u2212 x0||2/2\u03c32) with \u03c3 = 1, and\nregularization parameter \u03bb = 1). Figure 1 shows the results: our approach appears to\nperform better than the gradient descent approach. As in [2], linear PCA visually fails for\nthis problem: we show its best results, using 32 components. Note the mean squared error\nperformance of the algorithms is not precisely in accordance with the loss of interest to the\nuser. This can be seen as PCA has an MSE (13.8\u00b10.4) versus gradient descent (31.6\u00b11.7)\nand learnt pre-images (29.2\u00b11.8). PCA has the lowest MSE but as can be seen in Figure 1\nit doesn\u2019t give satisfactorys visual results in terms of denoising.\n\nNote that some of the digits shown are actually denoised incorrectly as the wrong class.\nThis is of course possible as choosing the correct digit is a problem which is harder than\na standard digit classi\ufb01cation problem because the images are noisy. Moreover, kPCA is\nnot a classi\ufb01er per se and could not be expected to classify digits as well as Support Vector\nMachines. In this experiment, we also took a rather small number of training examples,\nbecause otherwise the fminsearch code for the gradient descent was very slow, and this\nallowed us to compare our results more easily.\n\nKPCA Compression. For the compression experiment we use a video sequence consist-\ning of 1000 graylevel images, where every frame has a 100 \u00d7 100 pixel resolution. The\nvideo sequence shows a famous science \ufb01ction \ufb01gure turning his head 180 degrees. For\ntraining we used every 20th frame resulting in a video sequence of 50 frames with 3.6 de-\ngree orientation difference per image. The motivation is to store only these 50 frames and\nto reconstruct all frames in between.\nWe applied a kPCA to all 50 frames with a Gaussian kernel with kernel parameter \u03c31.\nThe 50 feature vectors v1, . . . , v50 \u2208 R50 are used then to learn the interpolation between\nthe timeline of the 50 principal components vij where i is the time index, j the principal\ncomponent number j and 1 \u2264 i, j \u2264 50. A kernel ridge regression with Gaussian kernel\nand kernel parameter \u03c32 and ridge r1 was used for this task. Finally the pre-image map \u0393\nwas learned from projections onto vi to frames using kernel ridge regression with kernel\nparameter \u03c33 and ridge r2. All parameters \u03c31, \u03c32, \u03c33, r1, r2 were selected in a loop such\nthat new synthesized frames looked subjectively best. This led to the values \u03c31 = 2.5, \u03c32 =\n1, \u03c33 = 0.15 and for the ridge parameters r1 = 10\u221213, r2 = 10\u22127. Figure 2 shows the\noriginal and reconstructed video sequence.\n\nNote that the pre-image mechanism could possibly be adapted to take into account invari-\nances and a-priori knowledge like geometries of standard heads to reduce blending effects,\nmaking it more powerful than gradient descent or plain linear interpolation of frames. For\nan application of classical pre-image methods to face modelling, see [10].\n\nString Prediction with Kernel Dependency Estimation.\nIn the following we expose a\nsimple string mapping problem to show the potential of the approach outlined in Section\n3.2. We construct an arti\ufb01cial problem with |\u03a3x| = 3 and |\u03a3y| = 2. Output strings are\ngenerated by the following algorithm: start in a random state (1 or 2) corresponding to one\nof the output symbols. The next symbol is either the same or, with probability 1\n5 , the state\nswitches (this tends to make neighboring characters the same symbol). The length of the\nstring is randomly chosen between 10 and 20 symbols. Each input string is generated with\nequal probability from one of two models, starting randomly in state a, b or c and using the\nfollowing transition matrices, depending on the current output state:\n\nOutput 1\n\na\n0\n0\n1\n\nb\n0\n0\n0\n\nc\n1\n1\n0\n\na\nb\nc\n\nModel 1\n\nOutput 2\na\n1/2\n1/2\n0\n\na\nb\nc\n\nModel 2\n\nOutput 1\n\na\n1/2\n1/2\n0\n\nb\n1/2\n1/2\n1/2\n\nc\n0\n0\n1/2\n\na\nb\nc\n\nOutput 2\n\na\n1\n0\n0\n\nb\n0\n1\n1\n\nc\n0\n0\n0\n\na\nb\nc\n\nb\n1/2\n1/2\n1\n\nc\n0\n0\n0\n\n\fSubsequence of original video sequence.\nFirst and last frame are used in training set.\n\nSubsequence of synthesized video sequence.\n\nFigure 2: Kernel PCA compression used to learn intermediate images. The pre-images are\nin a 100 \u00d7 100 dimensional space making gradient-descent based descent impracticable.\n\ni=1\n\nAs the model of the string can be better predicted from the complete string, a global method\ncould be better in principle than a window-based method. We use a string kernel called the\nspectrum kernel[11] to de\ufb01ne strings for inputs. This method builds a representation which\nis a frequency count of all possible contiguous subsequences of length p. This produces a\nmapping with features \u03c6k(x) = hP|x|\u2212p+1\n[(xi, . . . , x(i+p\u22121)) = \u03b1] : \u03b1 \u2208 (\u03a3x)pi where\n[x = y] is 1 if x = y, and 0 otherwise. To de\ufb01ne a feature space for outputs we count\nthe number of contiguous subsequences of length p on the input that, if starting in position\nq, have the same element of the alphabet at position q + (p \u2212 1)/2 in the output, for odd\nvalues of p. That is, \u03c6l(x, y) = hP|x|\u2212p+1\n[(xi, . . . , x(i+p\u22121)) = \u03b1][yi+(p\u22121)/2 = b] :\n\u03b1 \u2208 (\u03a3x)p, b \u2208 \u03a3yi. We can then learn pre-images using a window also of size p as\ndescribed in Section 3.2, e.g. using k-NN as the learner. Note that the output kernel is\nde\ufb01ned on both the inputs and outputs: such an approach is also used in [12] and called\n\u201cjoint kernels\u201d, and in their approach the calculation of pre-images is also required, so they\nonly consider speci\ufb01c kernels for computational reasons. In fact, our approach could also\nbe a bene\ufb01t if used in their algorithm.\n\ni=1\n\nWe normalized the input and output kernel matrices such that a matrix S is normalized\nwith S \u2190 D\u22121SD\u22121 where D is a diagonal matrix such that Dii = Pi Sii. We also used\na nonlinear map for KDE, via an RBF kernel, i.e. K(x, x0) = exp(\u2212d(x, x0)) where the\ndistance d is induced by the input string kernel de\ufb01ned above, and we set \u03bb = 1.\nWe give the results on this toy problem using the classi\ufb01cation error (fraction of symbols\nmisclassi\ufb01ed) in the table below, with 50 strings using 10-fold cross validation, we compare\nto k-nearest neighbor using a window size of 3, in our method we used p = 3 to generate\nstring kernels, and k-NN to learn the pre-image, therefore we quote different k for both\nmethods. Results for larger window sizes only made the results worse.\n\n1-NN\n\n3-NN\n\n5-NN\n\n7-NN\n\n9-NN\n\nKDE\n0.182\u00b10.03\nk-NN 0.251\u00b10.03\n\n0.169\u00b10.03\n0.243\u00b10.03\n\n0.162\u00b10.03\n0.249\u00b10.03\n\n0.164\u00b10.03\n0.250\u00b10.03\n\n0.163\u00b10.03\n0.248\u00b10.03\n\n5 Conclusion\n\nWe introduced a method to learn the pre-image of a vector in an RKHS. Compared to clas-\nsical approaches, the new method has the advantage that it is not numerically unstable, it is\nmuch faster to evaluate, and better suited for high-dimensional input spaces. It is demon-\n\n\fstrated that it is applicable when the input space is discrete and gradients do not exist. How-\never, as a learning approach, it requires that the patterns used during training reasonably\nwell represent the points for which we subsequently want to compute pre-images. Oth-\nerwise, it can fail, an example being a reduced set (see [1]) application, where one needs\npre-images of linear combinations of mapped points in H, which can be far away from\ntraining points, making generalization of the estimated pre-image map impossible. Indeed,\npreliminary experiments (not described in this paper) showed that whilst the method can\nbe used to compute reduced sets, it seems inferior to classical methods in that domain.\n\nFinally, the learning of the pre-image can probably be augmented with mechanisms for\nincorporating a-priori knowledge to enhance performance of pre-image learning, making\nit more \ufb02exible than just a pure optimization approach. Future research directions include\nthe inference of pre-images in structures like graphs and incorporating a-priori knowledge\nin the pre-image learning stage.\n\nAcknowledgement. The authors would like to thank Kwang In Kim for fruitful discus-\nsions, and the anonymous reviewers for their comments.\n\nReferences\n\n[1] C. J. C. Burges. Simpli\ufb01ed support vector decision rules. In L. Saitta, editor, Proceedings of\nthe 13th International Conference on Machine Learning, pages 71\u201377, San Mateo, CA, 1996.\nMorgan Kaufmann.\n\n[2] S. Mika, B. Sch\u00a8olkopf, A. J. Smola, K.-R. M\u00a8uller, M. Scholz, and G. R\u00a8atsch. Kernel PCA and\nde-noising in feature spaces. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in\nNeural Information Processing Systems 11, pages 536\u2013542, Cambridge, MA, 1999. MIT Press.\n[3] B. Sch\u00a8olkopf, A. J. Smola, and K.-R. M\u00a8uller. Nonlinear component analysis as a kernel eigen-\n\nvalue problem. Neural Computation, 10:1299\u20131319, 1998.\n\n[4] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[5] Jason Weston, Olivier Chapelle, Andre Elisseeff, Bernhard Sch\u00a8olkopf, and Vladimir Vapnik.\nKernel dependency estimation. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in\nNeural Information Processing Systems 15, Cambridge, MA, 2002. MIT Press.\n\n[6] J.T. Kwok and I.W. Tsang. Finding the pre images in kernel principal component analysis. In\n\nNIPS\u20192002 Workshop on Kernel Machines, 2002.\n\n[7] D. Haussler. Convolutional kernels on discrete structures. Technical Report UCSC-CRL-99-10,\n\nComputer Science Department, University of California at Santa Cruz, 1999.\n\n[8] H. Lodhi, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classi\ufb01cation using string ker-\nnels. Technical Report 2000-79, NeuroCOLT, 2000. Published in: T. K. Leen, T. G. Dietterich\nand V. Tresp (eds.), Advances in Neural Information Processing Systems 13, MIT Press, 2001,\nas well as in JMLR 2:419-444, 2002.\n\n[9] S. Hua and Z. Sun. A novel method of protein secondary structure prediction with high segment\n\noverlap measure: Svm approach. Journal of Molecular Biology, 308:397\u2013407, 2001.\n\n[10] S. Romdhani, S. Gong, and A. Psarrou. A multiview nonlinear active shape model using kernel\n\nPCA. In Proceedings of BMVC, pages 483\u2013492, Nottingham, UK, 1999.\n\n[11] C. Leslie, E. Eskin, and W. S. Noble. The spectrum kernel: A string kernel for SVM protein\n\nclassi\ufb01cation. Proceedings of the Paci\ufb01c Symposium on Biocomputing, 2002.\n\n[12] Y. Altun, I. Tsochantaridis, and T. Hofmann. Hidden markov support vector machines. In 20th\n\nInternational Conference on Machine Learning (ICML), 2003.\n\n\f", "award": [], "sourceid": 2417, "authors": [{"given_name": "Jason", "family_name": "Weston", "institution": null}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": null}, {"given_name": "G\u00f6khan", "family_name": "Bakir", "institution": null}]}