{"title": "Supervised Learning with Similarity Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 215, "page_last": 223, "abstract": "We address the problem of general supervised learning when data can only be accessed through an (indefinite) similarity function between data points. Existing work on learning with indefinite kernels has concentrated solely on binary/multiclass classification problems. We propose a model that is generic enough to handle any supervised learning task and also subsumes the model previously proposed for classification. We give a ''goodness'' criterion for similarity functions w.r.t. a given supervised learning task and then adapt a well-known landmarking technique to provide efficient algorithms for supervised learning using ''good'' similarity functions. We demonstrate the effectiveness of our model on three important supervised learning problems: a) real-valued regression, b) ordinal regression and c) ranking where we show that our method guarantees bounded generalization error. Furthermore, for the case of real-valued regression, we give a natural goodness definition that, when used in conjunction with a recent result in sparse vector recovery, guarantees a sparse predictor with bounded generalization error. Finally, we report results of our learning algorithms on regression and ordinal regression tasks using non-PSD similarity functions and demonstrate the effectiveness of our algorithms, especially that of the sparse landmark selection algorithm that achieves significantly higher accuracies than the baseline methods while offering reduced computational costs.", "full_text": "Supervised Learning with Similarity Functions\n\nPurushottam Kar\n\nIndian Institute of Technology\n\nKanpur, INDIA\n\npurushot@cse.iitk.ac.in\n\nPrateek Jain\n\nMicrosoft Research Lab\n\nBangalore, INDIA\n\nprajain@microsoft.com\n\nAbstract\n\nWe address the problem of general supervised learning when data can only be ac-\ncessed through an (inde\ufb01nite) similarity function between data points. Existing\nwork on learning with inde\ufb01nite kernels has concentrated solely on binary/multi-\nclass classi\ufb01cation problems. We propose a model that is generic enough to handle\nany supervised learning task and also subsumes the model previously proposed for\nclassi\ufb01cation. We give a \u201cgoodness\u201d criterion for similarity functions w.r.t. a given\nsupervised learning task and then adapt a well-known landmarking technique to\nprovide ef\ufb01cient algorithms for supervised learning using \u201cgood\u201d similarity func-\ntions. We demonstrate the effectiveness of our model on three important super-\nvised learning problems: a) real-valued regression, b) ordinal regression and c)\nranking where we show that our method guarantees bounded generalization error.\nFurthermore, for the case of real-valued regression, we give a natural goodness\nde\ufb01nition that, when used in conjunction with a recent result in sparse vector re-\ncovery, guarantees a sparse predictor with bounded generalization error. Finally,\nwe report results of our learning algorithms on regression and ordinal regression\ntasks using non-PSD similarity functions and demonstrate the effectiveness of\nour algorithms, especially that of the sparse landmark selection algorithm that\nachieves signi\ufb01cantly higher accuracies than the baseline methods while offering\nreduced computational costs.\n\n1\n\nIntroduction\n\nThe goal of this paper is to develop an extended framework for supervised learning with similarity\nfunctions. Kernel learning algorithms [1] have become the mainstay of discriminative learning with\nan incredible amount of effort having been put in, both from the theoretician\u2019s as well as the prac-\ntitioner\u2019s side. However, these algorithms typically require the similarity function to be a positive\nsemi-de\ufb01nite (PSD) function, which can be a limiting factor for several applications. Reasons being:\n1) the Mercer\u2019s condition is a formal statement that is hard to verify, 2) several natural notions of\nsimilarity that arise in practical scenarios are not PSD, and 3) it is not clear as to why an arti\ufb01cial\nconstraint like PSD-ness should limit the usability of a kernel.\nSeveral recent papers have demonstrated that inde\ufb01nite similarity functions can indeed be success-\nfully used for learning [2, 3, 4, 5]. However, most of the existing work focuses on classi\ufb01cation tasks\nand provides specialized techniques for the same, albeit with little or no theoretical guarantees. A\nnotable exception is the line of work by [6, 7, 8] that de\ufb01nes a goodness criterion for a similarity\nfunction and then provides an algorithm that can exploit this goodness criterion to obtain provably\naccurate classi\ufb01ers. However, their de\ufb01nitions are yet again restricted to the problem of classi\ufb01-\ncation as they take a \u201cmargin\u201d based view of the problem that requires positive points to be more\nsimilar to positive points than to negative points by at least a constant margin.\nIn this work, we instead take a \u201ctarget-value\u201d point of view and require that target values of similar\npoints be similar. Using this view, we propose a generic goodness de\ufb01nition that also admits the\n\n1\n\n\fgoodness de\ufb01nition of [6] for classi\ufb01cation as a special case. Furthermore, our de\ufb01nition can be seen\nas imposing the existence of a smooth function over a generic space de\ufb01ned by similarity functions,\nrather than over a Hilbert space as required by typical goodness de\ufb01nitions of PSD kernels.\nWe then adapt the landmarking technique of [6] to provide an ef\ufb01cient algorithm that reduces learn-\ning tasks to corresponding learning problems over a linear space. The main technical challenge at\nthis stage is to show that such reductions are able to provide good generalization error bounds for\nthe learning tasks at hand. To this end, we consider three speci\ufb01c problems: a) regression, b) ordinal\nregression, and c) ranking. For each problem, we de\ufb01ne appropriate surrogate loss functions, and\nshow that our algorithm is able to, for each speci\ufb01c learning task, guarantee bounded generalization\nerror with polynomial sample complexity. Moreover, by adapting a general framework given by\n[9], we show that these guarantees do not require the goodness de\ufb01nition to be overly restrictive by\nshowing that our de\ufb01nitions admit all good PSD kernels as well.\nFor the problem of real-valued regression, we additionally provide a goodness de\ufb01nition that cap-\ntures the intuition that usually, only a small number of landmarks are in\ufb02uential w.r.t. the learning\ntask. However, to recover these landmarks, the uniform sampling technique would require sampling\na large number of landmarks thus increasing the training/test time of the predictor. We address this\nissue by applying a sparse vector recovery algorithm given by [10] and show that the resulting sparse\npredictor still has bounded generalization error.\nWe also address an important issue faced by algorithms that use landmarking as a feature construc-\ntions step viz [6, 7, 8], namely that they typically assume separate landmark and training sets for ease\nof analysis. In practice however, one usually tries to overcome paucity of training data by reusing\ntraining data as landmark points as well. We use an argument outlined in [11] to theoretically justify\nsuch \u201cdouble dipping\u201d in our case. The details of the argument are given in Appendix B.\nWe perform several experiments on benchmark datasets that demonstrate signi\ufb01cant performance\ngains for our methods over the baseline of kernel regression. Our sparse landmark selection tech-\nnique provides signi\ufb01cantly better predictors that are also more ef\ufb01cient at test time.\nRelated Work: Existing approaches to extend kernel learning algorithms to inde\ufb01nite kernels can\nbe classi\ufb01ed into three broad categories: a) those that use inde\ufb01nite kernels directly with existing\nkernel learning algorithms, resulting in non-convex formulations [2, 3]. b) those that convert a given\ninde\ufb01nite kernel into a PSD one by either projecting onto the PSD-cone [4, 5] or performing other\nspectral operations [12]. The second approach is usually expensive due to the spectral operations\ninvolved apart from making the method inherently transductive. Moreover, any domain knowledge\nstored in the original kernel is lost due to these task oblivious operations and consequently, no\ngeneralization guarantees can be given. c) those that use notions of \u201ctask-kernel alignment\u201d or\nequivalently, notions of \u201cgoodness\u201d of a kernel, to give learning algorithms [6, 7, 8]. This approach\nenjoys several advantages over the other approaches listed above. These models are able to use\nthe inde\ufb01nite kernel directly with existing PSD kernel learning techniques; all the while retaining\nthe ability to give generalization bounds that quantitatively parallel those of PSD kernel learning\nmodels. In this paper, we adopt the third approach for general supervised learning problem.\n\n2 Problem formulation and Preliminaries\n\nover some domain X using a hypothesis \u02c6f ( \u00b7 ; K) : X!Y\n\nThe goal in similarity-based supervised learning is to closely approximate a target predictor y :\nthat restricts its interaction\nX!Y\nwith data points to computing similarity values given by K. Now, if the similarity function K is\nnot discriminative enough for the given task then we cannot hope to construct a predictor out of it\nthat enjoys good generalization properties. Hence, it is natural to de\ufb01ne the \u201cgoodness\u201d of a given\nsimilarity function with respect to the learning task at hand.\nDe\ufb01nition 1 (Good similarity function: preliminary). Given a learning task y : X!Y over some\ndistribution D, a similarity function K : X\u21e5X! R is said to be (\u270f0, B)-good with respect to\nthis task if there exists some bounded weighing function w : X! [B, B] such that for at least a\n(1 \u270f0) D-fraction of the domain, we have y(x) = E\nThe above de\ufb01nition is inspired by the de\ufb01nition of a \u201cgood\u201d similarity function with respect to\nclassi\ufb01cation tasks given in [6]. However, their de\ufb01nition is tied to class labels and thus applies only\n\nx0\u21e0DJw(x0)y(x0)K(x, x0)K .\n\n2\n\n\fover a distribution D, an (\u270f0, B)-good similarity function K, labeled\n\nAlgorithm 1 Supervised learning with Similarity functions\nInput: A target predictor y : X!Y\nn, yn) , loss function `S : R \u21e5Y! R+.\ntraining points sampled from D: T =(xt\n1, y1), . . . , (xt\nOutput: A predictor \u02c6f : X! R with bounded true loss over D\n1: Sample d unlabeled landmarks from D: L =xl\nd \n1, . . . , xl\n2: L : x 7! 1/pdK(x, xl\nd) 2 Rd\ni)\u21b5 , yi\nw2Rd:kwk2\uf8ffBPn\n3: \u02c6w = arg min\n4: return \u02c6f : x 7! h \u02c6w, L(x)i\n\n// Else subsample d landmarks from T (see Appendix B for details)\n\ni `S\u2326w, L(xt\n\n1), . . . , K(x, xl\n\nto classi\ufb01cation tasks. Similar to [6], the above de\ufb01nition calls a similarity function K \u201cgood\u201d if the\ntarget value y(x) of a given point x can be approximated in terms of (a weighted combination of)\nthe target values of the K-\u201cneighbors\u201d of x. Also, note that this de\ufb01nition automatically enforces a\nsmoothness prior on the framework.\nHowever the above de\ufb01nition is too rigid. Moreover, it de\ufb01nes goodness in terms of violations, a\nnon-convex loss function. To remedy this, we propose an alternative de\ufb01nition that incorporates an\narbitrary (but in practice always convex) loss function.\nDe\ufb01nition 2 (Good similarity function: \ufb01nal). Given a learning task y : X!Y\nover some\ndistribution D, a similarity function K is said to be (\u270f0, B)-good with respect to a loss function\n`S : R \u21e5Y! R if there exists some bounded weighing function w : X! [B, B] such that if we\nde\ufb01ne a predictor as f (x) := E\n\nx0\u21e0DJw(x0)K(x, x0)K, then we have E\n\nx\u21e0DJ`S(f (x), y(x))K \uf8ff \u270f0.\n\nNote that De\ufb01nition 2 reduces to De\ufb01nition 1 for `S(a, b) = 1\n{a6=b}. Moreover, for the case of\nbinary classi\ufb01cation where y 2 {1, +1}, if we take `S(a, b) = 1\n{ab\uf8ffB}, then we recover the\n(\u270f0, )-goodness de\ufb01nition of a similarity function, given in De\ufb01nition 3 of [6]. Also note that,\nassuming sup\n\nx2X {|y(x)|} < 1 we can w.l.o.g. merge w(x0)y(x0) into a single term w(x0).\n\nHaving given this de\ufb01nition we must make sure that \u201cgood\u201d similarity functions allow the construc-\ntion of effective predictors (Utility property). Moreover, we must make sure that the de\ufb01nition does\nnot exclude commonly used PSD kernels (Admissibility property). Below, we formally de\ufb01ne these\ntwo properties and in later sections, show that for each of the learning tasks considered, our goodness\nde\ufb01nition satis\ufb01es these two properties.\n\n2.1 Utility\nDe\ufb01nition 3 (Utility). A similarity function K is said to be \u270f0-useful w.r.t. a loss function `actual (\u00b7,\u00b7)\nif the following holds: there exists a learning algorithm A that, for any \u270f1, > 0, when given\npoly(1/\u270f1, log(1/)) \u201clabeled\u201d and \u201cunlabeled\u201d samples from the input distribution D, with prob-\nx\u21e0Dr`actual\u21e3 \u02c6f (x), y(x)\u2318z \uf8ff \u270f0 + \u270f1.\nability at least 1 , generates a hypothesis \u02c6f (x; K) s.t. E\nNote that \u02c6f (x; K) is restricted to access the data solely through K.\n\nHere, the \u270f0 term captures the mis\ufb01t or the bias of the similarity function with respect to the learning\nproblem. Notice that the above utility de\ufb01nition allows for learning from unlabeled data points and\nthus puts our approach in the semi-supervised learning framework.\nAll our utility guarantees proceed by \ufb01rst using unlabeled samples as landmarks to construct a land-\nmarked space. Next, using the goodness de\ufb01nition, we show the existence of a good linear predictor\nin the landmarked space. This guarantee is obtained in two steps as outlined in Algorithm 1: \ufb01rst of\nall we choose d unlabeled landmark points and construct a map : X! Rd (see Step 1 of Algo-\nrithm 1) and show that there exists a linear predictor over Rd that closely approximates the predictor\nf used in De\ufb01nition 2 (see Lemma 15 in Appendix A). In the second step, we learn a predictor (over\nthe landmarked space) using ERM over a fresh labeled training set (see Step 3 of Algorithm 1). We\nthen use individual task-speci\ufb01c arguments and Rademacher average-based generalization bounds\n[13] thus proving the utility of the similarity function.\n\n3\n\n\f2.2 Admissibility\n\nIn order to show that our models are not too rigid, we would prove that they admit good PSD\nkernels. The notion of a good PSD kernel for us will be one that corresponds to a prevalent large\nmargin technique for the given problem. In general, most notions correspond to the existence of a\nlinear operator in the RKHS of the kernel that has small loss at large margin. More formally,\nDe\ufb01nition 4 (Good PSD Kernel). Given a learning task y : X!Y\nover some distribution D, a\nPSD kernel K : X\u21e5X ! R with associated RKHS HK and canonical feature map K : X!H K\nis said to be (\u270f0, )-good with respect to a loss function `K : R \u21e5Y! R if there exists W\u21e4 2H K\nsuch that kW\u21e4k = 1 and\n\nx\u21e0Ds`K\u2713hW\u21e4, K(x)i\n\nE\n\n\n\n, y(x)\u25c6{ <\u270f 0\n\nWe will show, for all the learning tasks considered, that every (\u270f0, )-good PSD kernel, when treated\nas simply a similarity function with no consideration of its RKHS, is also (\u270f + \u270f1, B)-good for\narbitrarily small \u270f1 with B = h(, \u270f 1) for some function h. To prove these results we will adapt\ntechniques introduced in [9] with certain modi\ufb01cations and task-dependent arguments.\n\n3 Applications\n\nWe will now instantiate the general learning model described above to real-valued regression, ordinal\nregression and ranking by providing utility and admissibility guarantees. Due to lack of space, we\nrelegate all proofs as well as the discussion on ranking to the supplementary material (Appendix F).\n\n3.1 Real-valued Regression\n\nReal-valued regression is a quintessential learning problem [1] that has received a lot of attention\nin the learning literature. In the following we shall present algorithms for performing real-valued\nregression using non-PSD similarity measures. We consider the problem with `actual (a, b) = |a b|\nas the true loss function. For the surrogates `S and `K, we choose the \u270f-insensitive loss function [1]\nde\ufb01ned as follows:\n\n`\u270f (a, b) = `\u270f (a b) =\u21e2 0,\n\n|a b| \u270f,\n\nif |a b| <\u270f,\notherwise.\n\nThe above loss function automatically gives us notions of good kernels and similarity functions by\nappealing to De\ufb01nitions 4 and 2 respectively. It is easy to transfer error bounds in terms of absolute\nerror to those in terms of mean squared error (MSE), a commonly used performance measure for\nreal-valued regression. See Appendix D for further discussion on the choice of the loss function.\nUsing the landmarking strategy described in Section 2.1, we can reduce the problem of real regres-\nsion to that of a linear regression problem in the landmarked space. More speci\ufb01cally, the ERM step\nin Algorithm 1 becomes the following:\n\ni `\u270f (hw, L(xi)i yi).\n\narg min\n\nw2Rd:kwk2\uf8ffBPn\n\nThere exist solvers (for instance [14]) to ef\ufb01ciently solve the above problem on linear spaces. Using\nproof techniques sketched in Section 2.1 along with speci\ufb01c arguments for the \u270f-insensitive loss, we\ncan prove generalization guarantees and hence utility guarantees for the similarity function.\nTheorem 5. Every similarity function that is (\u270f0, B)-good for a regression problem with respect\nto the insensitive loss function `\u270f (\u00b7,\u00b7) is (\u270f0 + \u270f)-useful with respect to absolute loss as well as\n(B\u270f0 + B\u270f)-useful with respect to mean squared error. Moreover, both the dimensionality of the\nlandmarked space as well as the labeled sample complexity can be bounded by O\u21e3 B2\n\u21e3\u270f0 + \u270f1,O\u21e3 1\nregression problem but only (\u270f1, B)-good as a similarity function for B =\u2326 \u21e3 1\n\u270f12\u2318.\n\n\u270f12\u2318\u2318-good as a similarity function as well. Moreover, for any \u270f1 < 1/2 and any\n\nWe are also able to prove the following (tight) admissibility result:\nTheorem 6. Every PSD kernel that is (\u270f0, )-good for a regression problem is, for any \u270f1 > 0,\n\n< 1, there exists a regression instance and a corresponding kernel that is (0, )-good for the\n\n\u2318.\n\nlog 1\n\n\u270f2\n1\n\n4\n\n\f3.2 Sparse regression models\n\nAn artifact of a random choice of landmarks is that very few of them might turn out to be \u201cinforma-\ntive\u201d with respect to the prediction problem at hand. For instance, in a network, there might exist\nhubs or authoritative nodes that yield rich information about the learning problem. If the relative\nabundance of such nodes is low then random selection would compel us to choose a large number\nof landmarks before enough \u201cinformative\u201d ones have been collected.\nHowever this greatly increases training and testing times due to the increased costs of constructing\nthe landmarked space. Thus, the ability to prune away irrelevant landmarks would speed up training\nand test routines. We note that this issue has been addressed before in literature [8, 12] by way\nof landmark selection heuristics. In contrast, we guarantee that our predictor will select a small\nnumber of landmarks while incurring bounded generalization error. However this requires a careful\nrestructuring of the learning model to incorporate the \u201cinformativeness\u201d of landmarks.\nDe\ufb01nition 7. A similarity function K is said to be (\u270f0, B,\u2327 )-good for a real-valued regression\nproblem y : X! R if for some bounded weight function w : X! [B, B] and choice function\nR : X!{\nbounded \u270f-insensitive loss i.e. E\n\nx\u21e0DJR(x)K = \u2327, the predictor f : x 7! E\n\nx0\u21e0DJw(x0)K(x, x0)|R(x0)K has\n\n0, 1} with E\n\nx\u21e0DJ`\u270f (f (x), y(x))K <\u270f 0.\n\n\u2327 \u270f2\n1\n\nlog 1\n\nThe role of the choice function is to single out informative landmarks, while \u2327 speci\ufb01es the relative\ndensity of informative landmarks. Note that the above de\ufb01nition is similar in spirit to the goodness\nde\ufb01nition presented in [15]. While the motivation behind [15] was to give an improved admissi-\nbility result for binary classi\ufb01cation, we squarely focus on the utility guarantees; with the aim of\naccelerating our learning algorithms via landmark pruning.\nWe prove the utility guarantee in three steps as outlined in Appendix D. First, we use the usual\nlandmarking step to project the problem onto a linear space. This step guarantees the following:\nTheorem 8. Given a similarity function that is (\u270f0, B,\u2327 )-good for a regression problem, there exists\n\n\u2318 such that with probability at least 1 ,\na randomized map : X! Rd for d = O\u21e3 B2\nthere exists a linear operator \u02dcf : x 7! hw, xi over Rd such that kwk1 \uf8ff B with \u270f-insensitive loss\nbounded by \u270f0 + \u270f1. Moreover, with the same con\ufb01dence we have kwk0 \uf8ff 3d\u2327\n2 .\nOur proof follows that of [15], however we additionally prove sparsity of w as well. The number of\nlandmarks required here is a \u2326(1 /\u2327 ) fraction greater than that required by Theorem 5. This formally\ncaptures the intuition presented earlier of a small fraction of dimensions (read landmarks) being ac-\ntually relevant to the learning problem. So, in the second step, we use the Forward Greedy Selection\nalgorithm given in [10] to learn a sparse predictor. The use of this learning algorithm necessitates\nthe use of a different generalization bound in the \ufb01nal step to complete the utility guarantee given\nbelow. We refer the reader to Appendix D for the details of the algorithm and its utility analysis.\nTheorem 9. Every similarity function that is (\u270f0, B,\u2327 )-good for a regression problem with respect\nto the insensitive loss function `\u270f (\u00b7,\u00b7) is (\u270f0 + \u270f)-useful with respect to absolute loss as well; with the\ndimensionality of the landmarked space being bounded by O\u21e3 B2\n\u2318 and the labeled sampled\n\u270f1\u2318. Moreover, this utility can be achieved by an O (\u2327 )-\ncomplexity being bounded by O\u21e3 B2\n\nlog B\nsparse predictor on the landmarked space.\n\nlog 1\n\n\u2327 \u270f2\n1\n\n\u270f2\n1\n\nWe note that the improvements obtained here by using the sparse learning methods of [10] provide\n\u2326( \u2327 ) increase in sparsity. We now prove admissibility results for this sparse learning model. We\ndo this by showing that the dense model analyzed in Theorem 5 and that given in De\ufb01nition 7 are\ninterpretable in each other for an appropriate selection of parameters. The guarantees in Theorem 6\ncan then be invoked to conclude the admissibility proof.\n\nTheorem 10. Every (\u270f0, B)-good similarity function K is also \u270f0, B, \u00afw\nx\u21e0DJ|w(x)|K. Moreover, every (\u270f0, B,\u2327 )-good similarity function K is also (\u270f0, B/\u2327 )-good.\n\u21e3\u270f0 + \u270f1,O\u21e3 1\n\nUsing Theorem 6, we immediately have the following corollary:\nCorollary 11. Every PSD kernel that is (\u270f0, )-good for a regression problem is, for any \u270f1 > 0,\n\n\u270f12\u2318 , 1\u2318-good as a similarity function as well.\n\nB-good where \u00afw =\n\nE\n\n5\n\n\f3.3 Ordinal Regression\n\nThe problem of ordinal regression requires an accurate prediction of (discrete) labels coming from\na \ufb01nite ordered set [r] = {1, 2, . . . , r}. The problem is similar to both classi\ufb01cation and regression,\nbut has some distinct features due to which it has received independent attention [16, 17] in domains\nsuch as product ratings etc. The most popular performance measure for this problem is the absolute\nloss which is the absolute difference between the predicted and the true labels.\nA natural and rather tempting way to solve this problem is to relax the problem to real-valued\nregression and threshold the output of the learned real-valued predictor using prede\ufb01ned thresholds\nb1, . . . , br to get discrete labels. Although this approach has been prevalent in literature [17], as the\ndiscussion in the supplementary material shows, this leads to poor generalization guarantees in our\nmodel. More speci\ufb01cally, a goodness de\ufb01nition constructed around such a direct reduction is only\nable to ensure (\u270f0 + 1)-utility i.e. the absolute error rate is always greater than 1.\nOne of the reasons for this is the presence of the thresholding operation that makes it impossible to\ndistinguish between instances that would not be affected by small perturbations to the underlying\nreal-valued predictor and those that would. To remedy this, we enforce a (soft) margin with respect\nto thresholding that makes the formulation more robust to noise. More formally, we expect that if\na point belongs to the label i, then in addition to being sandwiched between the thresholds bi and\nbi+1, it should be separated from these by a margin as well i.e. bi + \uf8ff f (x) \uf8ff bi+1 .\nThis is a direct generalization of the margin principle in classi\ufb01cation where we expect w>x > b+\nfor positively labeled points and w>x < b for negatively labeled points. Of course, wherein\nclassi\ufb01cation requires a single threshold, we require several, depending upon the number of labels.\nFor any x 2 R, let [x]+ = max{x, 0}. Thus, if we de\ufb01ne the -margin loss function to be [x] :=\n[ x]+ (note that this is simply the well known hinge loss function scaled by a factor of ), we\ncan de\ufb01ne our goodness criterion as follows:\nDe\ufb01nition 12. A similarity function K is said to be (\u270f0, B)-good for an ordinal regression problem\ny : X! [r] if for some bounded weight function w : X! [B, B] and some (unknown but \ufb01xed)\nset of thresholds {bi}r\nE\n\ni=1 with b1 = 1, the predictor f : x 7! E\n\nx0\u21e0DJw(x0)K(x, x0)K satis\ufb01es\n\nx\u21e0Dr\u21e5f (x) by(x)\u21e4 +\u21e5by(x)+1 f (x)\u21e4z <\u270f 0.\n\nWe now give utility guarantees for our learning model. We shall give guarantees on both the mis-\nclassi\ufb01cation error as well as the absolute error of our learned predictor. We say that a set of points\nx1, . . . , xi . . . is -spaced if min\n\n .\ni6=j {|xi xj|} . De\ufb01ne the function (x) = x+1\n\nlog 1\n\nregression error rate is also bounded above by 1 since\n\nuseful with respect to the zero-one mislabeling error as well.\n\nWe can bound, both dimensionality of the landmarked space as well as labeled sampled complexity,\n\nTheorem 13. Let K be a similarity function that is (\u270f0, B)-good for an ordinal regression prob-\nlem with respect to -spaced thresholds and -margin loss. Let \u00af = max{, 1}. Then K is\n\u00af\u2318-useful with respect to ordinal regression error (absolute loss). Moreover, K is\u21e3 \u270f0\n\u00af\u2318-\n (/\u00af)\u21e3 \u270f0\n\u2318. Notice that for \u270f0 < 1 and large enough d, n, we can ensure that the ordinal\nby O\u21e3 B2\n\n( (x)) = 1. This is in contrast\nwith the direct reduction to real valued regression which has ordinal regression error rate bounded\nbelow by 1. This indicates the advantage of the present model over a naive reduction to regression.\nWe can show that our de\ufb01nition of a good similarity function admits all good PSD kernels as well.\nThe kernel goodness criterion we adopt corresponds to the large margin framework proposed by\n[16]. We refer the reader to Appendix E.3 for the de\ufb01nition and give the admissibility result below.\nTheorem 14. Every PSD kernel that is (\u270f0, )-good for an ordinal regression problem is also\n\n1,\u270f 1 > 0. Moreover, for any \u270f1 < 1/2, there exists an ordinal regression instance and a corre-\nsponding kernel that is (0, )-good for the ordinal regression problem but only (\u270f1, B)-good as a\n\n\u270f12\u2318\u2318-good as a similarity function with respect to the 1-margin loss for any\n\n\u21e31\u270f0 + \u270f1,O\u21e3 2\nsimilarity function with respect to the 1-margin loss function for B =\u2326 \u21e3 2\n\u270f12\u2318.\n\nx2[0,1],>0\n\n\u270f2\n1\n\nsup\n\n1\n\n1\n\n6\n\n\f(a) Mean squared error for landmarking (RegLand), sparse landmarking (RegLand-Sp) and kernel regression (KR)\n\n(b) Avg. absolute error for landmarking (ORLand) and kernel regression (KR) on ordinal regression datasets\n\nFigure 1: Performance of landmarking algorithms with increasing number of landmarks on real-\nvalued regression (Figure 1a) and ordinal regression (Figure 1b) datasets.\n\nDatasets\nAbalone [18]\nN = 4177\nd = 8\nBodyfat [19]\nN = 252\nd = 14\nCAHousing [19]\nN = 20640\nd = 8\nCPUData [20]\nN = 8192\nd = 12\nPumaDyn-8 [20]\nN = 8192\nd = 8\nPumaDyn-32 [20]\nN = 8192\nd = 32\n\nSigmoid kernel\n\nManhattan kernel\n\nKR\n2.1e-02\n(8.3e-04)\n\nLand-Sp\n6.2e-03\n(8.4e-04)\n\nKR\n1.7e-02\n(7.1e-04)\n\nLand-Sp\n6.0e-03\n(3.7e-04)\n\n4.6e-04\n(6.5e-05)\n\n5.9e-02\n(2.3e-04)\n\n4.1e-02\n(1.6e-03)\n\n2.3e-01\n(4.6e-03)\n\n1.8e-01\n(3.6e-03)\n\n9.5e-05\n(1.3e-04)\n\n1.6e-02\n(6.2e-04)\n\n1.4e-03\n(1.7e-04)\n\n1.4e-02\n(4.5e-04)\n\n1.4e-02\n(3.7e-04)\n\n3.9e-04\n(2.2e-05)\n\n5.8e-02\n(1.9e-04)\n\n4.3e-02\n(1.6e-03)\n\n2.3e-01\n(4.5e-03)\n\n1.8e-01\n(3.6e-03)\n\n3.5e-05\n(1.3e-05)\n\n1.5e-02\n(1.4e-04)\n\n1.2e-03\n(3.2e-05)\n\n1.4e-02\n(4.8e-04)\n\n1.4e-02\n(3.1e-04)\n\nDatasets\nWine-Red [18]\nN = 1599\nd = 11\nWine-White [18]\nN = 4898\nd = 11\nBank-8 [20]\nN = 8192\nd = 8\nBank-32 [20]\nN = 8192\nd = 32\nHouse-8 [20]\nN = 22784\nd = 8\nHouse-16 [20]\nN = 22784\nd = 16\n\nSigmoid kernel\n\nManhattan kernel\n\nKR\n6.8e-01\n(2.8e-02)\n\nORLand\n4.2e-01\n(3.8e-02)\n\nKR\n6.7e-01\n(3.0e-02)\n\nORLand\n4.5e-01\n(3.2e-02)\n\n6.2e-01\n(2.0e-02)\n\n2.9e+0\n(6.2e-02)\n\n2.7e+0\n(1.2e-01)\n\n2.8e+0\n(9.3e-03)\n\n2.7e+0\n(2.0e-02)\n\n8.9e-01\n(8.5e-01)\n\n6.1e-01\n(4.4e-02)\n\n1.6e+0\n(2.3e-02)\n\n1.5e+0\n(2.0e-02)\n\n1.5e+0\n(1.0e-02)\n\n6.2e-01\n(2.0e-02)\n\n2.7e+0\n(6.6e-02)\n\n2.6e+0\n(8.1e-02)\n\n2.7e+0\n(1.0e-02)\n\n2.8e+0\n(2.0e-02)\n\n4.9e-01\n(1.5e-02)\n\n6.3e-01\n(1.7e-02)\n\n1.6e+0\n(9.4e-02)\n\n1.4e+0\n(1.2e-02)\n\n1.4e+0\n(2.3e-02)\n\n(a) Mean squared error for real regression\n\n(b) Mean absolute error for ordinal regression\n\nTable 1: Performance of landmarking-based algorithms (with 50 landmarks) vs. baseline kernel\nregression (KR). Values in parentheses indicate standard deviation values. Values in the \ufb01rst columns\nindicate dataset source (in parentheses), size (N) and dimensionality (d).\n\nDue to lack of space we refer the reader to Appendix F for a discussion on ranking models that\nincludes utility and admissibility guarantees with respect to the popular NDCG loss.\n\n4 Experimental Results\n\nIn this section we present an empirical evaluation of our learning models for the problems of real-\nvalued regression and ordinal regression on benchmark datasets taken from a variety of sources\n[18, 19, 20]. In all cases, we compare our algorithms against kernel regression (KR), a well known\ntechnique [21] for non-linear regression, whose predictor is of the form:\n\nwhere T is the training set. We selected KR as the baseline as it is a popular regression method that\ndoes not require similarity functions to be PSD. For ordinal regression problems, we rounded off the\nresult of the KR predictor to get a discrete label. We implemented all our algorithms as well as the\n\nf : x 7! Pxi2T y(xi)K(x, xi)\nPxi2T K(x, xi)\n\n.\n\n7\n\n\fbaseline KR method in Matlab. In all our experiments we report results across 5 random splits on\nthe (inde\ufb01nite) Sigmoid: K(x, y) = tanh(ahx, yi + r) and Manhattan: K(x, y) = kx yk1\nkernels. Following standard practice, we \ufb01xed r = 1 and a = 1/dorig for the Sigmoid kernel\nwhere dorig is the dimensionality of the dataset.\nReal valued regression: For this experiment, we compare our methods (RegLand and RegLand-Sp)\nwith the KR method. For RegLand, we constructed the landmarked space as speci\ufb01ed in Algorithm 1\nand learned a linear predictor using the LIBLINEAR package [14] that minimizes \u270f-insensitive\nloss. In the second algorithm (RegLand-Sp), we used the sparse learning algorithm of [10] on the\nlandmarked space to learn the best predictor for a given sparsity level. Due to its simplicity and\ngood convergence properties, we implemented the Fully Corrective version of the Forward Greedy\nSelection algorithm with squared loss as the surrogate.\nWe evaluated all methods using Mean Squared Error (MSE) on the test set. Figure 1a shows the MSE\nincurred by our methods along with reference values of accuracies obtained by KR as landmark sizes\nincrease. The plots clearly show that our methods incur signi\ufb01cantly lesser error than KR. Moreover,\nRegLand-Sp learns more accurate predictors using the same number of landmarks. For instance,\nwhen learning using the Sigmoid kernel on the CPUData dataset, at 20 landmarks, RegLand is able\nto guarantee an MSE of 0.016 whereas RegLand-Sp offers an MSE of less than 0.02 ; MLKR is\nonly able to guarantee an MSE rate of 0.04 for this dataset. In Table 1a, we compare accuracies of\nthe two algorithms when given 50 landmark points with those of KR for the Sigmoid and Manhattan\nkernels. We \ufb01nd that in all cases, RegLand-Sp gives superior accuracies than KR. Moreover, the\nManhattan kernel seems to match or outperform the Sigmoid kernel on all the datasets.\nOrdinal Regression: Here, we compare our method with the baseline KR method on benchmark\ndatasets. As mentioned in Section 3.3, our method uses the EXC formulation of [16] along with\nlandmarking scheme given in Algorithm 1. We implemented a gradient descent-based solver (OR-\nLand) to solve the primal formulation of EXC and used \ufb01xed equi-spaced thresholds instead of\nlearning them as suggested by [16]. Of the six datasets considered here, the two Wine datasets are\nordinal regression datasets where the quality of the wine is to be predicted on a scale from 1 to 10.\nThe remaining four datasets are regression datasets whose labels were subjected to equi-frequency\nbinning to obtain ordinal regression datasets [16]. We measured the average absolute error (AAE)\nfor each method. Figure 1b compares ORLand with KR as the number of landmarks increases. Ta-\nble 1b compares accuracies of ORLand for 50 landmark points with those of KR for Sigmoid and\nManhattan kernels. In almost all cases, ORLand gives a much better performance than KR. The\nSigmoid kernel seems to outperform the Manhattan kernel on a couple of datasets.\nWe refer the reader to Appendix G for additional experimental results.\n\n5 Conclusion\n\nIn this work we considered the general problem of supervised learning using non-PSD similarity\nfunctions. We provided a goodness criterion for similarity functions w.r.t. various learning tasks.\nThis allowed us to construct ef\ufb01cient learning algorithms with provable generalization error bounds.\nAt the same time, we were able to show, for each learning task, that our criterion is not too restrictive\nin that it admits all good PSD kernels. We then focused on the problem of identifying in\ufb02uential\nlandmarks with the aim of learning sparse predictors. We presented a model that formalized the\nintuition that typically only a small fraction of landmarks is in\ufb02uential for a given learning problem.\nWe adapted existing sparse vector recovery algorithms within our model to learn provably sparse\npredictors with bounded generalization error. Finally, we empirically evaluated our learning algo-\nrithms on benchmark regression and ordinal regression tasks. In all cases, our learning methods,\nespecially the sparse recovery algorithm, consistently outperformed the kernel regression baseline.\nAn interesting direction for future research would be learning good similarity functions \u00b4a la metric\nlearning or kernel learning. It would also be interesting to conduct large scale experiments on real-\nworld data such as social networks that naturally capture the notion of similarity amongst nodes.\n\nAcknowledgments\nP. K. is supported by a Microsoft Research India Ph.D. fellowship award. Part of this work was done\nwhile P. K. was an intern at Microsoft Research Labs India, Bangalore.\n\n8\n\n\fReferences\n[1] Bernhard Sch\u00a8olkopf and Alex J. Smola. Learning with Kernels : Support Vector Machines, Regularization,\n\nOptimization, and Beyond. MIT Press, 2002.\n\n[2] Bernard Haasdonk. Feature Space Interpretation of SVMs with Inde\ufb01nite Kernels. IEEE Transactions on\n\nPattern Analysis and Machince Intelligence, 27(4):482\u2013492, 2005.\n\n[3] Cheng Soon Ong, Xavier Mary, St\u00b4ephane Canu, and Alexander J. Smola. Learning with non-positive\n\nKernels. In 21st Annual International Conference on Machine Learning, 2004.\n\n[4] Yihua Chen, Maya R. Gupta, and Benjamin Recht. Learning Kernels from Inde\ufb01nite Similarities. In 26th\n\nAnnual International Conference on Machine Learning, pages 145\u2013152, 2009.\n\n[5] Ronny Luss and Alexandre d\u2019Aspremont. Support Vector Machine Classi\ufb01cation with Inde\ufb01nite Kernels.\n\nIn 21st Annual Conference on Neural Information Processing Systems, 2007.\n\n[6] Maria-Florina Balcan and Avrim Blum. On a Theory of Learning with Similarity Functions.\n\nAnnual International Conference on Machine Learning, pages 73\u201380, 2006.\n\nIn 23rd\n\n[7] Liwei Wang, Cheng Yang, and Jufu Feng. On Learning with Dissimilarity Functions.\n\nInternational Conference on Machine Learning, pages 991\u2013998, 2007.\n\nIn 24th Annual\n\n[8] Purushottam Kar and Prateek Jain. Similarity-based Learning via Data Driven Embeddings. In 25th Annual\n\nConference on Neural Information Processing Systems, 2011.\n\n[9] Nathan Srebro. How Good Is a Kernel When Used as a Similarity Measure? In 20th Annual Conference\n\non Computational Learning Theory, pages 323\u2013335, 2007.\n\n[10] Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading Accuracy for Sparsity in Optimization\n\nProblems with Sparsity Constraints. SIAM Journal on Optimization, 20(6):2807\u20132832, 2010.\n\n[11] Nathan Srebro Shai Ben-David, Ali Rahimi. Generalization Bounds for Inde\ufb01nite Kernel Machines. In\n\nNIPS 2008 Workshop: New Challenges in Theoretical Machine Learning, 2008.\n\n[12] Yihua Chen, Eric K. Garcia, Maya R. Gupta, Ali Rahimi, and Luca Cazzanti. Similarity-based Classi\ufb01-\n\ncation: Concepts and Algorithms. Journal of Machine Learning Research, 10:747\u2013776, 2009.\n\n[13] Sham M. Kakade, Karthik Sridharan, and Ambuj Tewari. On the Complexity of Linear Prediction :\nIn 22nd Annual Conference on Neural Information\n\nRisk Bounds, Margin Bounds, and Regularization.\nProcessing Systems, 2008.\n\n[14] Chia-Hua Ho and Chih-Jen Lin. Large-scale Linear Support Vector Regression. http://www.csie.\n\nntu.edu.tw/\u02dccjlin/papers/linear-svr.pdf, retrieved on May 18, 2012, 2012.\n\n[15] Maria-Florina Balcan, Avrim Blum, and Nathan Srebro. Improved Guarantees for Learning via Similarity\n\nFunctions. In 21st Annual Conference on Computational Learning Theory, pages 287\u2013298, 2008.\n\n[16] Wei Chu and S. Sathiya Keerthi. Support Vector Ordinal Regression. Neural Computation, 19(3):792\u2013\n\n815, 2007.\n\n[17] Shivani Agarwal. Generalization Bounds for Some Ordinal Regression Algorithms. In 19th International\n\nConference on Algorithmic Learning Theory, pages 7\u201321, 2008.\n\n[18] A. Frank and Arthur Asuncion. UCI Machine Learning Repository. http://archive.ics.uci.\n\nedu/ml, 2010. University of California, Irvine, School of Information and Computer Sciences.\n\n[19] StatLib Dataset Repository. http://lib.stat.cmu.edu/datasets/. Carnegie Mellon Univer-\n\nsity.\n\n[20] Delve Dataset Repository. http://www.cs.toronto.edu/\u02dcdelve/data/datasets.html.\n\nUniversity of Toronto.\n\n[21] Kilian Q. Weinberger and Gerald Tesauro. Metric Learning for Kernel Regression. In 11th International\n\nConference on Arti\ufb01cial Intelligence and Statistics, pages 612\u2013619, 2007.\n\n9\n\n\f", "award": [], "sourceid": 123, "authors": [{"given_name": "Purushottam", "family_name": "Kar", "institution": null}, {"given_name": "Prateek", "family_name": "Jain", "institution": null}]}