{"title": "Semi-supervised Regression using Hessian energy with an application to semi-supervised dimensionality reduction", "book": "Advances in Neural Information Processing Systems", "page_first": 979, "page_last": 987, "abstract": "Semi-supervised regression based on the graph Laplacian suffers from the fact that the solution is biased towards a constant and the lack of extrapolating power. Outgoing from these observations we propose to use the second-order Hessian energy for semi-supervised regression which overcomes both of these problems, in particular, if the data lies on or close to a low-dimensional submanifold in the feature space, the Hessian energy prefers functions which vary ``linearly with respect to the natural parameters in the data. This property makes it also particularly suited for the task of semi-supervised dimensionality reduction where the goal is to find the natural parameters in the data based on a few labeled points. The experimental result suggest that our method is superior to semi-supervised regression using Laplacian regularization and standard supervised methods and is particularly suited for semi-supervised dimensionality reduction.", "full_text": "Semi-supervised Regression using Hessian Energy\n\nwith an Application to Semi-supervised\n\nDimensionality Reduction\n\nKwang In Kim1, Florian Steinke2,3, and Matthias Hein1\n\n1Department of Computer Science, Saarland University Saarbr\u00a8ucken, Germany\n\n2Siemens AG Corporate Technology Munich, Germany\n\n{kimki,hein}@cs.uni-sb.de, Florian.Steinke@siemens.com\n\n3MPI for Biological Cybernetics, Germany\n\nAbstract\n\nSemi-supervised regression based on the graph Laplacian suffers from the fact\nthat the solution is biased towards a constant and the lack of extrapolating power.\nBased on these observations, we propose to use the second-order Hessian energy\nfor semi-supervised regression which overcomes both these problems. If the data\nlies on or close to a low-dimensional submanifold in feature space, the Hessian\nenergy prefers functions whose values vary linearly with respect to geodesic dis-\ntance. We \ufb01rst derive the Hessian energy for smooth manifolds and continue to\ngive a stable estimation procedure for the common case where only samples of\nthe underlying manifold are given. The preference of \u2018\u2019linear\u201d functions on mani-\nfolds renders the Hessian energy particularly suited for the task of semi-supervised\ndimensionality reduction, where the goal is to \ufb01nd a user-de\ufb01ned embedding\nfunction given some labeled points which varies smoothly (and ideally linearly)\nalong the manifold. The experimental results suggest superior performance of our\nmethod compared with semi-supervised regression using Laplacian regularization\nor standard supervised regression techniques applied to this task.\n\n1\n\nIntroduction\n\nCentral to semi-supervised learning is the question how unlabeled data can help in either classi\ufb01ca-\ntion or regression. A large class of methods for semi-supervised learning is based on the manifold\nassumption, that is, the data points do not \ufb01ll the whole feature space but they are concentrated\naround a low-dimensional submanifold. Under this assumption unlabeled data points can be used\nto build adaptive regularization functionals which penalize variation of the regression function only\nalong the underlying manifold.\nOne of the main goals of this paper is to propose an appropriate regularization functional on a man-\nifold, the Hessian energy, and show that it has favourable properties for semi-supervised regression\ncompared to the well known Laplacian regularization [2, 12]. Opposite to the Laplacian regularizer,\nthe Hessian energy allows functions that extrapolate, i.e. functions whose values are not limited to\nthe range of the training outputs. Particularly if only few labeled points are available, we show that\nthis extrapolation capability leads to signi\ufb01cant improvements. The second property of the proposed\nHessian energy is that it favors functions which vary linearly along the manifold, so-called geodesic\nfunctions de\ufb01ned later. By linearity we mean that the output values of the functions change linearly\nalong geodesics in the input manifold. This property makes it particularly useful as a tool for semi-\nsupervised dimensionality reduction [13], where the task is to construct user-de\ufb01ned embeddings\nbased on a given subset of labels. These user-guided embeddings are supposed to vary smoothly or\neven linearly along the manifold, where the later case corresponds to a setting where the user tries to\n\n1\n\n\frecover a low-distortion parameterization of the manifold. Moreover, due to user de\ufb01ned labels the\ninterpretability of the resulting parameterization is signi\ufb01cantly improved over unsupervised meth-\nods like Laplacian [1] or Hessian [3] eigenmaps. The proposed Hessian energy is motivated by the\nrecently proposed Eells energy for mappings between manifolds [11], which contains as a special\ncase the regularization of real-valued functions on a manifold. In \ufb02avour, it is also quite similar to\nthe operator constructed in Hessian eigenmaps [3]. However, we will show that their operator due\nto problems in the estimation of the Hessian, leads to useless results when used as regularizer for re-\ngression. On the contrary, our novel estimation procedure turns out to be more stable for regression\nand as a side effects leads also to a better estimation of the eigenvectors used in Hessian eigenmaps.\nWe present experimental results on several datasets, which show that our method for semi-supervised\nregression is often superior to other semi-supervised and supervised regression techniques.\n\n2 Regression on manifolds\n\nOur approach for regression is based on regularized empirical risk minimization. First, we will\ndiscuss the problem and the regularizer in the ideal case where we know the manifold exactly,\ncorresponding to the case where we have access to an unlimited number of unlabeled data. In the\nfollowing we denote by M the m-dimensional data-submanifold in Rd. The supervised regression\nproblems for a training set of l points (Xi, Yi)l\n\ni=1 can then be formulated as,\n\nl(cid:88)\n\ni=1\n\narg min\nf\u2208C\u221e(M )\n\n1\nl\n\nL(Yi, f(Xi)) + \u03bb S(f),\n\nwhere C\u221e(M) is the set of smooth functions on M, L : R \u00d7 R \u2192 R is the loss function and S :\nC\u221e(M) \u2192 R is the regularization functional. For simplicity we use the squared loss L(y, f(x)) =\n(y \u2212 f(x))2, but the framework can be easily extended to other convex loss functions.\nNaturally, we do not know the manifold M the data is lying on. However, we have unlabeled data\nwhich can be used to estimate it, or more precisely we can use the unlabeled data to build an estimate\n\u02c6S(f) of the true regularizer S(f). The proper estimation of S(f) will be the topic of the next section.\nFor the moment we just want to discuss regularization functionals in the ideal case, where we know\nthe manifold. However, we would like to stress already here that for our framework to work it does\nnot matter if the data lies on or close to a low-dimensional manifold. Even the dimension can change\nfrom point to point. The only assumption we make is that the data generating process does not \ufb01ll\nthe whole space but is concentrated on a low-dimensional structure.\n\nRegularization on manifolds. Our main goal is to construct a regularization functional on mani-\nfolds, which is particularly suited for semi-supervised regression and semi-supervised dimensional-\nity reduction. We follow here the framework of [11] who discuss regularization of mappings between\nmanifolds, where we are interested in the special case of real-valued output. They propose to use the\nso called Eells-energy SEells(f), which can be written for real-valued functions, f : M \u2192 R, as,\n\nwhere \u2207a\u2207bf is the second covariant derivative of f and dV (x) is the natural volume element, see\n[7]. Note, that the energy is by de\ufb01nition independent of the coordinate representation and depends\nonly on the properties of M. For details we refer to [11]. This energy functional looks quite abstract.\nHowever, in a special coordinate system on M, the so called normal coordinates, one can evaluate\nit quite easily. In sloppy terms, normal coordinates at a given point p are coordinates on M such\nthat the manifold looks as Euclidean as possible (up to second order) around p. Thus in normal\ncoordinates xr centered at p,\n\u2207a\u2207bf\n\nb =\u21d2 (cid:107)\u2207a\u2207bf(cid:107)2\np M\u2297T \u2217\nT \u2217\n\n(cid:18) \u22022f\n\nm(cid:88)\n\na \u2297 dxs\n\ndxr\n\nm(cid:88)\n\n(cid:19)2\n\np M =\n\n, (1)\n\n(cid:12)(cid:12)(cid:12)p\n\n=\n\n(cid:12)(cid:12)(cid:12)p\n\n\u22022f\n\n\u2202xr\u2202xs\n\nr,s=1\n\n\u2202xr\u2202xs\n\nr,s=1\n\nso that at p the norm of the second covariant derivative is just the Frobenius norm of the Hessian of f\nin normal coordinates. Therefore we call the resulting functional the Hessian regularizer SHess(f).\n\n2\n\n(cid:90)\n\nM\n\nSEells(f) =\n\n(cid:107)\u2207a\u2207bf(cid:107)2\nx M\u2297T \u2217\nT \u2217\n\nx M dV (x),\n\n\fFigure 1: Difference between semi-supervised regression using\nLaplacian and Hessian regularization for \ufb01tting two points on the\none-dimensional spiral. The Laplacian regularization has always\na bias towards the constant function (for a non-zero regularization\nparameter it will not \ufb01t the data exactly) and the extrapolation be-\nyond data points to the boundary of the domain is always constant.\nThe non-linearity of the \ufb01tted function between the data point arises\ndue to the non-uniform sampling of the spiral. On the contrary the\nHessian regularization \ufb01ts the data perfectly and extrapolates nicely\nto unseen data, since it\u2019s null space contains functions which vary\nlinearly with the geodesic distance.\n\nBefore we discuss the discretization, we would like to discuss some properties of this regularizer. In\nparticular, its difference to the regularizer S\u2206(f) using the Laplacian,\n\n(cid:90)\n\nM\n\nS\u2206(f) =\n\n(cid:107)\u2207f(cid:107)2 dV (x)\n\nproposed by Belkin and Niyogi [2] for semi-supervised classi\ufb01cation and in the meantime also\nadopted for semi-supervised regression [12]. While this regularizer makes sense for classi\ufb01cation, it\nis of limited use for regression. The problem is that the null space NS = {f \u2208 C\u221e(M)| S(f) = 0}\nof S\u2206, that is the functions which are not penalized, are only the constant functions on M. The\nfollowing adaptation of a result in [4] shows that the Hessian regularizer has a richer null-space.\nProposition 1 (Eells, Lemaire [4]) A function f : M \u2192 R with f \u2208 C\u221e(M) has zero second\nderivative, \u2207a\u2207bf\n\u2200x \u2208 M, if and only if for any geodesic \u03b3 : (\u2212\u03b5, \u03b5) \u2192 M parameter-\nized by arc length s, there exists a constant c\u03b3 depending only on \u03b3 such that\n\n(cid:12)(cid:12)(cid:12)x\n\n= 0,\n\nf(cid:0)\u03b3(s)(cid:1) = c\u03b3,\n\n\u2202\n\u2202s\n\n\u2200 \u2212 \u03b5 < s < \u03b5.\n\n\u2202s f(\u03b3(s)) = const. geodesic functions. They correspond to linear\nWe call functions f which ful\ufb01ll \u2202\nmaps in Euclidean space and encode a constant variation with respect to the geodesic distance of\nthe manifold. It is however possible that apart from the trivial case f = const. no other geodesic\nfunctions exist on M. What is the implication of these results for regression? First, the use of Lapla-\ncian regularization leads always to a bias towards the constant function and does not extrapolate\nbeyond data points. On the contrary, Hessian regularization is not biased towards constant functions\nif geodesic functions exist and extrapolates \u201clinearly\u201d (if possible) beyond data points. These crucial\ndifferences are illustrated in Figure 1 where we compare Laplacian regularization using the graph\nLaplacian as in [2] to Hessian regularization as introduced in the next section for a densely sampled\nspiral. Since the spiral is isometric to a subset of R, it allows \u201cgeodesic\u201d functions.\n\n3 Semi-supervised regression using Hessian energy\n\nAs discussed in the last section unlabeled data provides us valuable information about the data man-\nifold. We use this information to construct normal coordinates around each unlabeled point, which\nrequires the estimation of the local structure of the manifold. Subsequently, we employ the normal\ncoordinates to estimate the Hessian regularizer using the simple form of the second covariant deriva-\ntive provided in Equation (1). It turns out that these two parts of our construction are similar to the\none done in Hessian eigenmaps [3]. However, their estimate of the regularizer has stability prob-\nlems when applied to semi-supervised regression as is discussed below. In contrast, the proposed\nmethod does not suffer from this short-coming and leads to signi\ufb01cantly better performance. The\nsolution of the semi-supervised regression problem is obtained by solving a sparse linear system. In\nthe following, capital letters Xi correspond to sample points and xr denote normal coordinates.\n\nConstruction of local normal coordinates. The estimation of local normal coordinates can be\ndone using the set of k nearest neighbors (NN) Nk(Xi) of point Xi. The cardinality k will be\nchosen later on by cross-validation. In order to estimate the local tangent space TXiM (seen as an\n\n3\n\n\u2212202\u2212202\u221210105101520\u22122\u22121012geodesic distance along spiralestimated output  Laplacian regularizationHessian regularization\fm-dimensional af\ufb01ne subspace of Rd), we perform PCA on the points in Nk(Xi). The m leading\neigenvectors then correspond to an orthogonal basis of TXiM. In the ideal case, where one has a\ndensely sampled manifold, the number of dominating eigenvalues should be equal to the dimension\nm. However, for real-world datasets like images the sampling is usually not dense enough so that the\ndimension of the manifold can not be detected automatically. Therefore the number of dimensions\nhas to be provided by the user using prior knowledge about the problem or alternatively, and this is\nthe way we choose in this paper, by cross-validation.\nHaving the exact tangent space TXiM one can determine the normal coordinates xr of a point\nXj \u2208 Nk(Xi) as follows. Let {ur}m\nr=1 be the m leading PCA eigenvectors, which have been\nnormalized, then the normal coordinates {xr}m\n\nxr(Xj) = (cid:104)ur, Xj \u2212 Xi(cid:105)\n\nr=1 of Xj are given as,\n(cid:80)m\ndM (Xj, Xi)2\nr=1 (cid:104)ur, Xj \u2212 Xi(cid:105)2 ,\n\ndM (Xj, Xi) of Xj to Xi on M, (cid:107)x(Xj)(cid:107)2 =(cid:80)m\n\nwhere the \ufb01rst term is just the projection of the difference vector, Xj \u2212 Xi, on the basis vector ur \u2208\nTXi M and the second component is just a rescaling to ful\ufb01ll the property of normal coordinates that\nthe distance of a point Xj \u2208 M to the origin (corresponding to Xi) is equal to the geodesic distance\nr=1 xr(Xi)2 = dM (Xj, Xi)2. The rescaling makes\nsense only if local geodesic distances can be accurately estimated. In our experiments, this was only\nthe case for the 1D-toy dataset of Figure 1. For all other datasets we therefore use xr(Xj) =\n(cid:104)ur, Xj \u2212 Xi(cid:105) as normal coordinates. In [11] it is shown that this replacement yields an error of\norder O((cid:107)\u2207af(cid:107)2 \u03ba2) in the estimation of (cid:107)\u2207a\u2207bf(cid:107)2, where \u03ba is the maximal principal curvature\n(the curvature of M with respect to the ambient space Rd).\n\nEstimation of the Hessian energy. The Hessian regularizer, the squared norm of the second co-\nvariant derivative, (cid:107)\u2207a\u2207bf(cid:107)2, corresponds to the Frobenius norm of the Hessian of f in normal\ncoordinates, see Equation 1. Thus, given normal coordinates xr at Xi we would like to have an\noperator H which given the function values f(Xj) on Nk(Xi) estimates the Hessian of f at Xi,\n\nwith Ars = Asr. In order to \ufb01t the polynomial we use standard linear least squares,\n\nwhere \u03a6 \u2208 Rk\u00d7P is the design matrix with P = m+ m(m+1)\n. The corresponding basis functions \u03c6,\nare the monomials, \u03c6 = [x1, . . . , xm, x1x1, x1x2, . . . , xmxm], of the normal coordinates (centered\nat Xi) of Xj \u2208 Nk(xi) up to second order. The solution w \u2208 RP is w = \u03a6+f, where f \u2208 Rk and\nfj = f(Xj) with Xj \u2208 Nk(Xi) and \u03a6+ denotes the pseudo-inverse of \u03a6.\nNote, that the last m(m+1)\ncomponents of w correspond to the coef\ufb01cients Ars of the polynomial\n(up to rescaling for the diagonal components) and thus with Equation (3) we obtain the desired form\nH (i)\n\nrsj. An estimate of the Frobenius norm of the Hessian of f at Xi is thus given as,\n\n2\n\n2\n\n(cid:107)\u2207a\u2207bf(cid:107)2 \u2248 m(cid:88)\n\n(cid:16) k(cid:88)\n\n(cid:17)2\n\nk(cid:88)\n\nH (i)\n\nrs\u03b1f\u03b1\n\n=\n\nf\u03b1f\u03b2B(i)\n\u03b1\u03b2,\n\nr,s=1\n\n\u03b1=1\n\n\u03b1,\u03b2=1\n\n4\n\nThis can be done by \ufb01tting a second-order polynomial p(x) in normal coordinates to {f(Xj)}k\n\nj=1,\n\np(i)(x) = f(Xi) +\n\nBrxr +\n\nArsxrxs,\n\n(2)\n\nwhere the zeroth-order term is \ufb01xed at f(Xi). In the limit as the neighborhood size tends to zero,\np(i)(x) becomes the second-order Taylor expansion of f around Xi, that is,\n\n(3)\n\n\u22022f\n\n\u2202xr\u2202xs\n\n(cid:12)(cid:12)(cid:12)Xi\n\u2248 k(cid:88)\nn(cid:88)\n\nj=1\n\nH (i)\n\nrsj f(Xj).\n\nn(cid:88)\n\nn(cid:88)\n\nr=1\n\nr\n\ns=r\n\n,\n\nArs =\n\n(cid:12)(cid:12)(cid:12)Xi\n(cid:12)(cid:12)(cid:12)Xi\n(cid:16)(cid:0)f(Xj) \u2212 f(Xi)(cid:1) \u2212 (\u03a6w)j\n(cid:17)2\n\n\u2202xr\u2202xs\n\n\u22022f\n\n1\n2\n\n,\n\n,\n\nBr = \u2202f\n\u2202xr\n\nk(cid:88)\n\nj=1\n\narg min\nw\u2208RP\n\n\f\u03b1\u03b2 =(cid:80)m\nn(cid:88)\n\n\u02c6SHess(f) =\n\nwhere B(i)\nover all data points, where n denotes the number of unlabeled and labeled points,\n\nrs\u03b2 and \ufb01nally the total estimated Hessian energy \u02c6SHess(f) is the sum\n\nr,s=1 H (i)\n\nrs\u03b1H (i)\n\nm(cid:88)\n\n(cid:18) \u22022f\n\n(cid:19)2\n\n(cid:12)(cid:12)(cid:12)Xi\n\nn(cid:88)\n\n=\n\n(cid:88)\n\n(cid:88)\n\ni=1\n\nr,s=1\n\n\u2202xr\u2202xs\n\ni=1\n\n\u03b1\u2208Nk(Xi)\n\n\u03b2\u2208Nk(Xi)\n\nf\u03b1f\u03b2B(i)\n\n\u03b1\u03b2 = (cid:104)f, Bf(cid:105) ,\n\nwhere B is the accumulated matrix summing up all the matrices B(i). Note, that B is sparse since\neach point Xi has only contributions from its neighbors.\nMoreover, since we sum up the energy over all points, the squared norm of the Hessian is actu-\nally weighted with the local density of the points leading to a stronger penalization of the Hessian\nin densely sampled regions. The same holds for the estimate \u02c6S\u2206(f) of Laplacian regularization,\ni,j=1 wij(fi \u2212 fj)2, where one also sums up the contributions of all data points (the\n\n\u02c6S\u2206(f) = (cid:80)n\n\nrigorous connection between \u02c6S\u2206(f) and S\u2206(f) has been established in [2, 5]).\nThe effect of non-uniform sampling can be observed in Figure 1. There the samples of the spiral are\ngenerated by uniform sampling of the angle leading to a more densely sampled \u201cinterior\u201d region,\nwhich leads to the non-linear behavior of the function for the Laplacian regularization. For the\nHessian energy this phenomena cannot be seen in this example, since the Hessian of a \u201cgeodesic\u201d\nfunction is zero everywhere and therefore it does not matter if it is weighted with the density. On\nthe other hand for non-geodesic functions the weighting matters also for the Hessian energy. We did\nnot try to enforce a weighting with respect to the uniform density. However, it would be no problem\nto compensate the effects of non-uniform sampling by using a weighted from of the Hessian energy.\n\nFinal algorithm. Using the ideas of the previous paragraphs the \ufb01nal algorithmic scheme for\nsemi-supervised regression can now be immediately stated. We have to solve,\n\n(Yi \u2212 f(Xi))2 + \u03bb(cid:104)f, Bf(cid:105) ,\n\n(4)\n\nl(cid:88)\n\ni=1\n\narg min\n\nf\u2208Rn\n\n1\nl\n\nwhere for notational simplicity we assume that the data is ordered such that the \ufb01rst l points are\nlabeled. The solution is obtained by solving the following sparse linear system,\n\n(I(cid:48) + l \u03bbB)f = Y,\nwhere I(cid:48) is the diagonal matrix with I(cid:48)\nii = 1 if i is labeled and zero else and Yi = 0 if i is not\nlabeled. The sparsity structure of B is mainly in\ufb02uencing the complexity to solve this linear system.\nHowever, the number of non-zeros entries of B is between O(nk) and O(nk2) depending on how\nwell behaved the neighborhoods are (the later case corresponds basically to random neighbors) and\nthus grows linearly with the number of data points.\n\nStability of estimation procedure of Hessian energy. Since we optimize the objective in Equa-\ntion (4) for any possible assignment of function values f on the data points, we have to ensure that\nthe estimation of the Hessian is accurate for any possible function. However, the quality of the es-\ntimate of the Hessian energy depends on the quality of the local \ufb01t of p(i) for each data point Xi.\nClearly, there are function assignments where the estimation goes wrong. If k < P (P is the number\nof parameters of the polynomial) p can over\ufb01t the function and if k > P then p generally under\ufb01ts.\nIn both cases, the Hessian estimation is inaccurate. Most dangerous are the cases where the norm of\nthe Hessian is underestimated in particular if the function is heavily oscillating. Note that during the\nestimation of local Hessian, we do not use the full second-order polynomial but \ufb01x its zeroth-order\nterm at the value of f (i.e. p(i)(Xi) = f(Xi); cf. Eq. (2)). The reason for this is that under\ufb01tting is\nmuch more likely if one \ufb01ts a full second-order polynomial since the additional \ufb02exibility in \ufb01tting\nthe constant term always reduces the Hessian estimate. In the worst case a function which is heavily\noscillating can even have zero Hessian energy, if it allows a linear \ufb01t at each point, see Figure 3. If\nsuch a function \ufb01ts the data well we get useless regression results1 see Fig. 3. While \ufb01xing the con-\nstant term does not completely rule out such undesired behavior, we did not observe such irregular\nsolutions in any experiment. In the appendix we discuss a modi\ufb01cation of (Eq. (4)) which rules out\n\n1For the full second-order polynomial even cross-validation does not rule out these irregular solutions.\n\n5\n\n\fFigure 2: Fitting two points on the spiral revis-\nited (see Fig. 1): Left image shows the regres-\nsion result f using the Hessian energy estimated\nby \ufb01tting a full polynomial in normal coordi-\nnates. The Hessian energy of this heavily oscil-\nlating function is 0, since every local \ufb01t is lin-\near (an example shown in the right image; green\ncurve). However, \ufb01xing the zeroth-order term\nyields a high Hessian energy as desired (local \ufb01t\nis shown as the red curve in the right image).\n\nFigure 3: Sinusoid on the spiral: Left two im-\nages show the result of semi-supervised regres-\nsion using the Hessian estimate of [3] and the\ncorresponding smallest eigenvectors of the Hes-\nsian \u201cmatrix\u201d. One observes heavy oscillations,\ndue to the bad estimation of the Hessian. The\nright two images show the result of our method.\nNote, that in particular the third eigenvector cor-\nresponding to a non-zero eigenvalue of B is\nmuch better behaved.\n\nfor sure irregular solutions, but since it did not lead to signi\ufb01cantly better experimental results and\nrequires an additional parameter to tune we do not recommend to use it.\nOur estimation procedure of the Hessian has similar motivation as the one done in Hessian eigen-\nmaps [3]. However, in their approach they do not \ufb01x the zeroth-order term. This seems to be suitable\nfor Hessian eigenmaps as they do not use the full Hessian, but only its m + 1-dimensional null space\n(where m is the intrinsic dimension of the manifold). Apparently, this resolves the issues discussed\nabove so that the null space can still be well estimated also with their procedure. However, us-\ning their estimator for semi-supervised regression leads to useless results, see Fig. 3. Moreover,\nwe would like to note that using our estimator not only the eigenvectors of the null space but also\neigenvectors corresponding to higher eigenvalues can be well estimated, see Fig. 3.\n\n4 Experiments\n\nWe test our semi-supervised regression method using Hessian regularization on one synthetic and\ntwo real-world data sets. We compare with the results obtained using Laplacian-based regulariza-\ntion and kernel ridge regression (KRR) trained only with the labeled examples. The free parameters\nfor our method are the number of neighbors k for k-NN, the dimensionality of the PCA subspace,\nand the regularization parameter \u03bb while the parameters for the Laplacian regularization-based re-\ngression are: k for k-NN, the regularization parameter and the width of the Gaussian kernel. For\nKRR we used also the Gaussian kernel with the width as free parameter. These parameters were\nchosen for each method using 5-fold cross-validation on the labeled examples. For the digit and\n\ufb01gure datasets, the experiments were repeated with 5 different assignments of labeled examples.\n\nDigit Dataset.\nIn the \ufb01rst set of experiments, we generated 10000 random samples of arti\ufb01cially\ngenerated images (size 28\u00d7 28) of the digit 1. There are four variations in the data: translation (two\nvariations), rotation and line thickness. For this dataset we are doing semi-supervised dimensionality\nreduction since the task is to estimate the natural parameters which were used to generate the digits.\nThis is done based on 50 and 100 labeled images. Each of the variation corresponds then to a separate\nregression problem which we \ufb01nally stick together to get an embedding into four dimensions. Note,\nthat this dataset is quite challenging since translation of the digit leads to huge Euclidean distances\nbetween digits although they look visually very similar. Fig. 2 and Table 1 summarize the results.\nAs observed in the \ufb01rst row of Fig. 2, KRR (K) and Hessian (H) regularization recover well the\ntwo parameters of line width and rotation (all other embeddings can be found in the supplementary\nmaterial). As discussed previously, the Laplacian (L) regularization tends to shrink the estimated\nparameters towards a constant as it penalizes the \u201cgeodesic\u201d functions. This results in the poor\nestimation of parameters, especially the line-thickness parameter.2 Although KRR estimates well\nthe thickness parameter, it fails for the rotation parameter (cf. the second row of Fig. 2 where we\n\n2In this \ufb01gure, each parameter is normalized to lie in the unit interval while the regression was performed\n\nin the original scale. The point (0.5, 0.5) corresponds roughly to the origin in the original parameters.\n\n6\n\n\u2212505\u22124\u22122024\u221210010203040Regression results using full polynomial\u22120.2\u22120.100.10.2\u221210\u2212505101520  fFull polynomialFixed zeroth\u2212order\u22124\u22122024\u22124\u22122024\u221210\u22125051001020\u22120.1\u22120.0500.050.1  1st eigenvector2nd eigenvector3rd eigenvector\u22124\u22122024\u22124\u22122024\u221210\u22125051001020\u22120.1\u22120.0500.050.1  1st eigenvector2nd eigenvector3rd eigenvector\fFigure 2: Results on the digit 1 dataset. First row: the 2D-embedding of the digits obtained by\nregression for the rotation and thickness parameter with 100 labels. Second row: 21 digit images\nsampled at regular intervals in the estimated parameter spaces: two reference points (inverted im-\nages) are sampled in the ground truth parameter space and then in the corresponding estimated\nembedding. Then, 19 points are sampled in the estimated parameter spaces based on linear in-\nter/extrapolation of the parameters. The shown image samples are the ones which have parameters\nclosest to the interpolated ones. In each parameter space the interpolated points, the corresponding\nclosest data points and the reference points are marked with red dots, blue and cyan circles.\n\nTable 1: Results on digits: mean squared error (standard deviation) (both in units 10\u22123).\n\n50 labeled points\n\nv-trans.\n\nrotation\n\nh-trans.\nthickness\n0.78(0.13) 0.85(0.14) 45.49(7.20) 0.02(0.01)\n2.41(0.26) 3.91(0.59) 64.56(3.90) 0.39(0.02)\n0.34(0.03) 0.88(0.07) 4.03(1.15) 0.15(0.02)\n\n100 labeled points\n\nv-trans.\n\nrotation\n\nh-trans.\nthickness\n0.39(0.10) 0.48(0.08) 26.02(2.98) 0.01(0.00)\n1.17(0.13) 2.20(0.22) 30.73(6.05) 0.34(0.01)\n0.16(0.03) 0.39(0.07) 1.48(0.26) 0.06(0.01)\n\nK\nL\nH\n\nshow the images corresponding to equidistant inter/extrapolation in the estimated parameter space\nbetween two \ufb01xed digits (inverted image)). The Hessian regularization provided a moderate level of\naccuracy in recovering the thickness parameter and performed best on the remaining ones.\n\nFigure Dataset. The second dataset consists of 2500 views of a toy \ufb01gure (see Fig. 3) sampled\nbased on regular intervals in zenith and azimuth angles on the upper hemisphere around the centered\nobject [10]. Fig. 3 shows the results of regression for three parameters - the zenith angle, and\nthe azimuth angle is transformed into Euclidean x,y coordinates.3 Both Laplacian and Hessian\nregularizers provided signi\ufb01cantly better estimation of the parameters in comparison to KRR, which\ndemonstrates the effectiveness of semi-supervised regression. However, the Laplacian shows again\ncontracting behavior which is observed in the top view of hemisphere. Note that for our method this\ndoes not occur and the spacing of the points in the parameter space is much more regular, which\nagain stresses the effectiveness of our proposed regularizer.\n\nImage Colorization.\nImage colorization refers to the task of estimating the color components of a\ngiven gray level image. Often, this problem is approached based on the color information of a subset\nof pixels in the image, which is speci\ufb01ed by a user (cf. [8] for more details). This is essentially a\nsemi-supervised regression problem where the user-speci\ufb01ed color components correspond to the\nlabels. To facilitate quantitative evaluation, we adopted 20 color images, sampled a subset of pixels\nin each image as labels, and used the corresponding gray levels as inputs. The number of labeled\npoints were 30 and 100 for each images, which we regard as a moderate level of user intervention. As\nerror measure, we use the mean square distance between the original image and the corresponding\n\n3Although the underlying manifold is two dimensional, the parametrization cannot be directly found based\non regression as the azimuth angle is periodic. This results in contradicting assignments of ground truth labels.\n\n7\n\n\fFigure 3: Results of regression on the \ufb01gure dataset. First row: embedding in the three dimensional\nspaces with 50 labels. Second row: Left: some example images of the dataset, Right: error plots for\neach regression variable for different number of labeled points.\n\nOriginal image\n\nKRR\n\nL\n\nH\n\nLevin et al. [8]\n\nFigure 4: Example of image colorization with 30 labels. KRR failed in reconstructing (the color of)\nthe red pepper at the lower-right corner, while the Laplacian regularizer produced overall, a greenish\nimage. Levin et al\u2019s method well-recovered the lower central part however failed in reconstructing\nthe upper central pepper. Despite the slight diffusion of red color at the upper-left corner, overall,\nthe result of Hessian regularization looks best which is also con\ufb01rmed by the reconstruction error.\n\nreconstruction in the RGB space. During the colorization, we go over to the YUV color model such\nthat the Y components, containing the gray level values, are used as the input, based on which the\nU and V components are estimated. The estimated U-V components are then combined with the Y\ncomponent and converted into RGB format. For the regression, for each pixel, we use as features\nthe 3 \u00d7 3-size image patch centered at the pixel of interest plus the 2-dimensional coordinate value\nof that pixel. The coordinate values are weighted by 10 such that the contribution of coordinate\nvalues and gray levels is balanced. For comparison, we performed experiments with the method\nof Levin et al. [8] as one of the state-of-the-art methods.4 Figure 4 shows an example and Table 2\nsummarizes the results. The Hessian regularizer clearly outperformed the KRR and the Laplacian-\nbased regression and produced slightly better results than those of Levin et al. [8]. We expect that\nthe performance can be further improved by exploiting a priori knowledge on structure of natural\nimages (e.g., by exploiting the segmentation information (cf. [9, 6]) in the NN structure).\n\n4Code is available at: http://www.cs.huji.ac.il/\u02dcyweiss/Colorization/.\n\nTable 2: Results on colorization: mean squared error (standard deviation) (both in units 10\u22123).\n\n# labels\n\n30\n100\n\nK\n\n1.18(1.10)\n0.66(0.65)\n\nL\n\n0.83(0.64)\n0.50(0.33)\n\nH\n\n0.64(0.50)\n0.32(0.25)\n\nLevin et al. [8]\n\n0.74(0.61)\n0.37(0.26)\n\n8\n\n\u2212101\u221210100.51ground truth\u2212101\u221210100.51KRR\u2212101\u221210100.51Laplacian regularization\u2212101\u221210100.51Hessian regularization10255010000.050.10.150.2x coord.  Laplacian regularizationKRRHessian regularization10255010000.050.10.150.2y coord.  Laplacian regularizationKRRHessian regularization10255010000.050.10.150.2number of labeled pointserrorzenith coord.  Laplacian regularizationKRRHessian regularization\fReferences\n[1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data repre-\n\nsentation. Neural Computation, 15(6):1373\u20131396, 2003. 2\n\n[2] M. Belkin and P. Niyogi. Semi-supervised learning on manifolds. Machine Learning, 56:209\u2013\n\n239, 2004. 1, 3, 5\n\n[3] D. Donoho and C. Grimes. Hessian eigenmaps: Locally linear embedding techniques for high-\ndimensional data. Proc. of the National Academy of Sciences, 100(10):5591\u20135596, 2003. 2, 3,\n6\n\n[4] J. Eells and L. Lemaire. Selected topics in harmonic maps. AMS, Providence, RI, 1983. 3\n[5] M. Hein. Uniform convergence of adaptive graph-based regularization.\n\nIn G. Lugosi and\nH. Simon, editors, Proc. of the 19th Conf. on Learning Theory (COLT), pages 50\u201364, Berlin,\n2006. Springer. 5\n\n[6] R. Irony, D. Cohen-Or, and D. Lischinski. Colorization by example. In Proc. Eurographics\n\nSymposium on Rendering, pages 201\u2013210, 2005. 8\n\n[7] J. M. Lee. Riemannian Manifolds - An introduction to curvature. Springer, New York, 1997. 2\n[8] A. Levin, D. Lischinski, and Y. Weiss. Colorization using optimization. In Proc. SIGGRAPH,\n\npages 689\u2013694, 2004. 7, 8\n\n[9] Q. Luan, F. Wen, D. Cohen-Or, L. Liang, Y.-Q. Xu, and H.-Y. Shum. Natural image coloriza-\n\ntion. In Proc. Eurographics Symposium on Rendering, pages 309\u2013320, 2007. 8\n\n[10] G. Peters. Ef\ufb01cient pose estimation using view-based object representations. Machine Vision\n\nand Applications, 16(1):59\u201363, 2004. 7\n\n[11] F. Steinke and M. Hein. Non-parametric regression between Riemannian manifolds. In Ad-\n\nvances in Neural Information Processing Systems, pages 1561\u20131568, 2009. 2, 4\n\n[12] J. J. Verbeek and N. Vlassis. Gaussian \ufb01elds for semi-supervised regression and correspon-\n\ndence learning. Pattern Recognition, 39:1864\u20131875, 2006. 1, 3\n\n[13] X. Yang, H. Fu, H. Zha, and J. Barlow. Semi-supervised nonlinear dimensionality reduction.\nIn Proc. of the 23rd international conference on Machine learning, pages 1065\u20131072, New\nYork, NY, USA, 2006. ACM. 1\n\n9\n\n\f", "award": [], "sourceid": 847, "authors": [{"given_name": "Kwang", "family_name": "Kim", "institution": null}, {"given_name": "Florian", "family_name": "Steinke", "institution": null}, {"given_name": "Matthias", "family_name": "Hein", "institution": null}]}