{"title": "A Consistent Regularization Approach for Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 4412, "page_last": 4420, "abstract": "We propose and analyze a regularization approach for structured prediction problems. We characterize a large class of loss functions that allows to naturally embed structured outputs in a linear space.  We exploit this fact to  design learning  algorithms using a surrogate loss approach and regularization techniques.   We prove universal consistency and finite sample bounds characterizing the generalization properties of the proposed method. Experimental results are provided to demonstrate the practical usefulness of the proposed approach.", "full_text": "A Consistent Regularization Approach for Structured\n\nPrediction\n\nCarlo Ciliberto \u2217,1\ncciliber@mit.edu\n\nAlessandro Rudi \u2217,1,2\nale_rudi@mit.edu\n\nLorenzo Rosasco 1,2\nlrosasco@mit.edu\n\n1 Laboratory for Computational and Statistical Learning - Istituto Italiano di Tecnologia, Genova, Italy &\n\nMassachusetts Institute of Technology, Cambridge, MA 02139, USA.\n\n2 Universit\u00e0 degli Studi di Genova, Genova, Italy.\n\n\u2217Equal contribution\n\nAbstract\n\nWe propose and analyze a regularization approach for structured prediction prob-\nlems. We characterize a large class of loss functions that allows to naturally\nembed structured outputs in a linear space. We exploit this fact to design learning\nalgorithms using a surrogate loss approach and regularization techniques. We\nprove universal consistency and \ufb01nite sample bounds characterizing the general-\nization properties of the proposed method. Experimental results are provided to\ndemonstrate the practical usefulness of the proposed approach.\n\n1\n\nIntroduction\n\nMany machine learning applications require dealing with data-sets having complex structures, e.g.\nnatural language processing, image segmentation, reconstruction or captioning, pose estimation,\nprotein folding prediction to name a few [1, 2, 3]. Structured prediction problems pose a challenge\nfor classic off-the-shelf learning algorithms for regression or binary classi\ufb01cation. This has motivated\nthe extension of methods such as support vector machines to structured problems [4]. Dealing\nwith structured prediction problems is also a challenge for learning theory. While the theory of\nempirical risk minimization provides a very general statistical framework, in practice it needs to be\ncomplemented with an ad-hoc analysis for each speci\ufb01c setting. Indeed, in the last few years, an\neffort has been made to analyze speci\ufb01c structured problems, such as multiclass classi\ufb01cation [5],\nmulti-labeling [6], ranking [7] or quantile estimation [8]. A natural question is whether a unifying\nlearning framework can be developed to address a wide range of problems from theory to algorithms.\nThis paper takes a step in this direction, proposing and analyzing a general regularization approach\nto structured prediction. Our starting observation is that for a large class of these problems, we can\nde\ufb01ne a natural embedding of the associated loss functions into a linear space. This allows to de\ufb01ne\na (least squares) surrogate problem of the original structured one, that is cast within a multi-output\nregularized learning framework [9, 10, 11]. We prove that by solving the surrogate, we are able to\nrecover the exact solution of the original structured problem. The corresponding algorithm essentially\ngeneralizes approaches considered in [12, 13, 14, 15, 16]. We study the generalization properties of\nthe proposed approach, establishing universal consistency as well as \ufb01nite sample bounds.\nThe rest of this paper is organized as follows: in Sec. 2 we introduce the structured prediction problem\nin its generality and present our algorithm to approach it. In Sec. 3 we introduce and discuss a\nsurrogate framework for structured prediction, from which we derive our algorithm. In Sec. 4, we\nanalyze the theoretical properties of the proposed algorithm. In Sec. 5 we draw connections with\nprevious work in structured prediction. Sec. 6 reports promising experimental results on a variety of\nstructured prediction problems. Sec. 7 concludes the paper outlining relevant directions for future\nresearch.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f2 A Regularization Approach to Structured prediction\nThe goal of supervised learning is to learn functional relations f : X \u2192 Y between two sets X ,Y,\ngiven a \ufb01nite number of examples. In particular in this work we are interested to structured prediction,\nnamely the case where Y is a set of structured outputs (such as histograms, graphs, time sequences,\npoints on a manifold, etc.). Moreover, structure on Y can be implicitly induced by a suitable loss\n(cid:52) : Y \u00d7 Y \u2192 R (such as edit distance, ranking error, geodesic distance, indicator function of a\nsubset, etc.). Then, the problem of structured prediction becomes\n\nminimize\nf :X\u2192Y\n\nE(f ),\n\nwith\n\nE(f ) =\n\nX\u00d7Y\n\n(cid:52)(f (x), y) d\u03c1(x, y)\n\n(1)\n\nand the goal is to \ufb01nd a good estimator for the minimizer of the above equation, given a \ufb01nite number\nof (training) points {(xi, yi)}n\ni=1 sampled from a unknown probability distribution \u03c1 on X \u00d7 Y. In\nthe following we introduce an estimator \u02c6f : X \u2192 Y to approach Eq. (1). The rest of this paper is\ndevoted to prove that \u02c6f it a consistent estimator for a minimizer of Eq. (1).\nOur Algorithm for Structured Prediction. In this paper we propose and analyze the following\nestimator\n\n(cid:90)\n\n\u03b1i(x) (cid:52) (y, yi) with \u03b1(x) = (K + n\u03bbI)\u22121Kx \u2208 Rn\n\n(Alg. 1)\n\nn(cid:88)\n\ni=1\n\n\u02c6f (x) = argmin\n\ny\u2208Y\n\ngiven a positive de\ufb01nite kernel k : X \u00d7X \u2192 R and training set {(xi, yi)}n\ni=1. In the above expression,\n\u03b1i(x) is i-th entry in \u03b1(x), K \u2208 Rn\u00d7n is the kernel matrix Ki,j = k(xi, xj), Kx \u2208 Rn the vector\nwith entires (Kx)i = k(x, xi), \u03bb > 0 a regularization parameter and I the identity matrix.\nFrom a computational perspective, the procedure in Alg. 1 is divided in two steps: a learning step\nwhere input-dependents weights \u03b1i(\u00b7) are computed (which essentially consists in solving a kernel\nridge regression problem) and a prediction step where the \u03b1i(x)-weighted linear combination in\nAlg. 1 is optimized, leading to a prediction \u02c6f (x) given an input x. The idea of a similar two-steps\nstrategy goes back to standard approaches for structured prediction and was originally proposed\nin [17], where a \u201cscore\u201d function F (x, y) was learned to estimate the \u201clikelihood\u201d of a pair (x, y)\nsampled from \u03c1, and then used in \u02c6f (x) = argminy\u2208Y \u2212F (x, y), to predict the best \u02c6f (x) \u2208 Y given\nx \u2208 X . This strategy was extended in [4] for the popular SVMstruct and adopted also in a variety of\napproaches for structured prediction [1, 12, 14].\nIntuition. While providing a principled derivation of Alg. 1 for a large class of loss functions is a\nmain contribution of this work, it is useful to \ufb01rst consider the special case where (cid:52) is induced by a\nreproducing kernel h : Y \u00d7 Y \u2192 R on the output set, such that\n\n(cid:52)(y, y(cid:48)) = h(y, y) \u2212 2h(y, y(cid:48)) + h(y(cid:48), y(cid:48)).\n\n(2)\nThis choice of (cid:52) was originally considered in Kernel Dependency Estimation (KDE) [18]. In\nparticular, for the special case of normalized kernels (i.e. h(y, y) = 1 \u2200y \u2208 Y), Alg. 1 essentially\nreduces to [12, 13, 14] and recalling their derivation is insightful. Note that, since a kernel can be\nwritten as h(y, y(cid:48)) = (cid:104)\u03c8(y), \u03c8(y(cid:48))(cid:105)HY , with \u03c8 : Y \u2192 HY a non-linear map into a feature space\nHY [19], then Eq. (2) can be rewritten as\n\n(3)\nDirectly minimizing the equation above with respect to f is generally challenging due to the non\nlinearity \u03c8. A possibility is to replace \u03c8 \u25e6 f by a function g : X \u2192 HY that is easier to optimize. We\ncan then consider the regularized problem\n\n(cid:52)(f (x), y(cid:48)) = (cid:107)\u03c8(f (x)) \u2212 \u03c8(y(cid:48))(cid:107)2HY .\n\nwith G a space of functions1 g : X \u2192 HY of the form g(x) =(cid:80)\n\n(cid:107)g(xi) \u2212 \u03c8(yi)(cid:107)2HY + \u03bb(cid:107)g(cid:107)2G\n\nminimize\n\n(4)\ni=1 k(x, xi)ci with ci \u2208 HY and k\n\ng\u2208G\n\n1\nn\n\na reproducing kernel. Indeed, in this case the solution to Eq. (4) is\n\nn(cid:88)\n\ni=1\n\n\u02c6g(x) =\n\n\u03b1i(x)\u03c8(yi) with \u03b1(x) = (K + n\u03bbI)\u22121Kx \u2208 Rn\n\n(5)\n\ni=1\n\n1G is the reproducing kernel Hilbert space for vector-valued functions [9] with inner product\n\n(cid:104)k(xi,\u00b7)ci, k(xj,\u00b7)cj(cid:105)G = k(xi, xj)(cid:104)ci, cj(cid:105)HY\n\n2\n\nn(cid:88)\n\n\fn(cid:88)\n\ni=1\n\nwhere the \u03b1i are the same as in Alg. 1. Since we replaced (cid:52)(f (x), y) by (cid:107)g(x)\u2212 \u03c8(y)(cid:107)2HY , a natural\nquestion is how to recover an estimator \u02c6f from \u02c6g. In [12] it was proposed to consider\n\n\u02c6f (x) = argmin\n\ny\u2208Y\n\n(cid:107)\u03c8(y) \u2212 \u02c6g(x)(cid:107)2HY = argmin\ny\u2208Y\n\nh(y, y) \u2212 2\n\n\u03b1i(x)h(y, yi),\n\n(6)\n\nwhich corresponds to Alg. 1 when h is a normalized kernel.\nThe discussion above provides an intuition on how Alg. 1 is derived but raises also a few questions.\nFirst, it is not clear if and how the same strategy could be generalized to loss functions that do not\nsatisfy Eq. (2). Second, the above reasoning hinges on the idea of replacing \u02c6f with \u02c6g (and then\nrecovering \u02c6f by Eq. (6)), however it is not clear whether this approach can be justi\ufb01ed theoretically.\nFinally, we can ask what are the statistical properties of the resulting algorithm. We address the\n\ufb01rst two questions in the next section, while the rest of the paper is devoted to establish universal\nconsistency and generalization bounds for algorithm Alg. 1.\n\n3 Surrogate Framework and Derivation\n\n(cid:90)\n\nTo derive Alg. 1 we consider ideas from surrogate approaches [20, 21, 7] and in particular [5]. The\nidea is to tackle Eq. (1) by substituting (cid:52)(f (x), y) with a \u201crelaxation\u201d L(g(x), y) on a space HY,\nthat is easy to optimize. The corresponding surrogate problem is\n\nR(g),\n\nwith\n\nR(g) =\n\nX\u00d7Y\n\nminimize\ng:X\u2192HY\n\nL(g(x), y) d\u03c1(x, y),\n\nFisher Consistency: E(d \u25e6 g\u2217) = E(f\u2217),\nComparison Inequality: E(d \u25e6 g) \u2212 E(f\u2217) \u2264 \u03d5(R(g) \u2212 R(g\u2217)),\n\n(7)\nand the question is how a solution g\u2217 for the above problem can be related to a minimizer f\u2217 of\nEq. (1). This is made possible by the requirement that there exists a decoding d : HY \u2192 Y, such that\n(8)\n(9)\nhold for all g : X \u2192 HY, where \u03d5 : R \u2192 R is such that \u03d5(s) \u2192 0 for s \u2192 0. Indeed, given an\nestimator \u02c6g for g\u2217, we can \u201cdecode\u201d it considering \u02c6f = d \u25e6 \u02c6g and use the excess risk R(\u02c6g) \u2212 R(g\u2217)\nto control E( \u02c6f ) \u2212 E(f\u2217) via the comparison inequality in Eq. (9). In particular, if \u02c6g is a data-\ndependent predictor trained on n points and R(\u02c6g) \u2192 R(g\u2217) when n \u2192 +\u221e, we automatically\nhave E( \u02c6f ) \u2192 E(f\u2217). Moreover, if \u03d5 in Eq. (9) is known explicitly, generalization bounds for \u02c6g are\nautomatically extended to \u02c6f.\nProvided with this perspective on surrogate approaches, here we revisit the discussion of Sec. 2 for\nthe case of a loss function induced by a kernel h. Indeed, by assuming the surrogate L(g(x), y) =\n(cid:107)g(x) \u2212 \u03c8(y)(cid:107)2HY , Eq. (4) becomes the empirical version of the surrogate problem at Eq. (7)\nand leads to an estimator \u02c6g of g\u2217 as in Eq. (5). Therefore, the approach in [12, 14] to recover\n\u02c6f (x) = argminy L(g(x), y) can be interpreted as the result \u02c6f (x) = d \u25e6 \u02c6g(x) of a suitable decoding\nof \u02c6g(x). An immediate question is whether the above framework satis\ufb01es Eq. (8) and (9). Moreover,\nwe can ask if the same idea could be applied to more general loss functions.\nIn this work we identify conditions on (cid:52) that are satis\ufb01ed by a large family of functions and moreover\nallow to design a surrogate framework for which we prove Eq. (8) and (9). The \ufb01rst step in this\ndirection is to introduce the following assumption.\nAssumption 1. There exists a separable Hilbert space HY with inner product (cid:104)\u00b7,\u00b7(cid:105)HY , a continuous\nembedding \u03c8 : Y \u2192 HY and a bounded linear operator V : HY \u2192 HY, such that\n\n(cid:52)(y, y(cid:48)) = (cid:104)\u03c8(y), V \u03c8(y(cid:48))(cid:105)HY\n\n\u2200y, y(cid:48) \u2208 Y\n\n(10)\n\nAsm. 1 is similar to Eq. (3) and in particular to the de\ufb01nition of a reproducing kernel. Note however\nthat by not requiring V to be positive semide\ufb01nite (or even symmetric), we allow for a surprisingly\nwide range of functions beyond kernel functions. Indeed, below we give some examples of functions\nthat satisfy Asm. 1 (see supplementary material Sec. C for more details):\nExample 1. The following functions of the form (cid:52) : Y \u00d7 Y \u2192 R satisfy Asm. 1:\n\n3\n\n\f1. Any loss on Y of \ufb01nite cardinality. Several problems belong to this setting, such as Multi-\n\nClass Classi\ufb01cation, Multi-labeling, Ranking, predicting Graphs (e.g. protein foldings).\n\n2. Regression and Classi\ufb01cation Loss Functions. Least-squares, Logistic, Hinge, \u0001-insensitive,\n\n\u03c4-Pinball.\n\n3. Robust Loss Functions. Most loss functions used for robust estimation [22] such as the\nabsolute value, Huber, Cauchy, German-McLure, \u201cFair\u201d and L2 \u2212 L1. See [22] or the\nsupplementary material for their explicit formulation.\n\n4. KDE. Loss functions (cid:52) induced by a kernel such as in Eq. (2).\n5. Distances on Histograms/Probabilities. The \u03c72 and the Hellinger distances.\n6. Diffusion distances on Manifolds. The squared diffusion distance induced by the heat kernel\n\n(at time t > 0) on a compact Reimannian manifold without boundary [23].\n\nThe Least Squares Loss Surrogate Framework. Asm. 1 implicitly de\ufb01nes the space HY similarly\nto Eq. (3). The following result motivates the choice of the least squares surrogate and moreover\nsuggests a possible choice for the decoding.\nLemma 1. Let (cid:52) : Y \u00d7 Y \u2192 R satisfy Asm. 1 with \u03c8 : Y \u2192 HY bounded. Then the expected risk\nin Eq. (1) can be written as\n\nfor all f : X \u2192 Y, where g\u2217 : X \u2192 HY minimizes\n\nE(f ) =\n\n(cid:104)\u03c8(f (x)), V g\u2217(x)(cid:105)HY d\u03c1X (x)\n\nR(g) =\n\n(cid:107)g(x) \u2212 \u03c8(y)(cid:107)2HY d\u03c1(x, y).\n\nX\u00d7Y\n\n(cid:90)\n(cid:90)\n\nX\n\n(11)\n\n(12)\n\n(14)\n(15)\n\nLemma 1 shows how Eq. (12) arises naturally as surrogate problem. In particular, Eq. (11) suggests\nto choose the decoding\n\n(cid:104) \u03c8(y) , V h (cid:105)HY\n\n\u2200h \u2208 HY ,\n\ny\u2208Y\n\nd(h) = argmin\n\n(13)\nsince d \u25e6 g\u2217(x) = arg miny\u2208Y(cid:104)\u03c8(y), V g\u2217(x)(cid:105) and therefore E(d \u25e6 g\u2217) \u2264 E(f ) for any measurable\nf : X \u2192 Y, leading to Fisher Consistency. We formalize this in the following result.\nTheorem 2. Let (cid:52) : Y \u00d7 Y \u2192 R satisfy Asm. 1 with Y a compact set. Then, for every measurable\ng : X \u2192 HY and d : HY \u2192 Y satisfying Eq. (13), the following holds\n\nE(d \u25e6 g\u2217) = E(f\u2217)\n\nE(d \u25e6 g) \u2212 E(f\u2217) \u2264 c(cid:52)(cid:112)R(g) \u2212 R(g\u2217).\n\nwith c(cid:52) = (cid:107)V (cid:107) maxy\u2208Y (cid:107)\u03c8(y)(cid:107)HY .\nThm. 2 shows that for all (cid:52) satisfying Asm. 1, the corresponding surrogate framework identi\ufb01ed\nby the surrogate in Eq. (12) and decoding Eq. (13) satis\ufb01es Fisher consistency Eq. (14) and the\ncomparison inequality in Eq. (15). We recall that a \ufb01nite set Y is always compact, and moreover,\nassuming the discrete topology on Y, we have that any \u03c8 : Y \u2192 HY is continuous. Therefore,\nThm. 2 applies in particular to any structured prediction problem on Y with \ufb01nite cardinality.\nThm. 2 suggest to approach structured prediction by \ufb01rst learning \u02c6g and then decoding it to recover\n\u02c6f = d \u25e6 \u02c6g. A natural question is how to choose \u02c6g in order to compute \u02c6f in practice. In the rest of this\nsection we propose an approach to this problem.\nDerivation for Alg. 1. Minimizing R in Eq. (12) corresponds to a vector-valued regression problem\n[9, 10, 11]. In this work we adopt an empirical risk minimization approach to learn \u02c6g as in Eq. (4).\nThe following result shows that combining \u02c6g with the decoding in Eq. (13) leads to the \u02c6f in Alg. 1.\nLemma 3. Let (cid:52) : Y \u00d7 Y \u2192 R satisfy Asm. 1 with Y a compact set. Let \u02c6g : X \u2192 HY be the\nminimizer of Eq. (4). Then, for all x \u2208 X\n\nd \u25e6 \u02c6g(x) = argmin\ny\u2208Y\n\n\u03b1i(x) (cid:52) (y, yi)\n\n\u03b1(x) = (K + n\u03bbI)\u22121Kx \u2208 Rn\n\n(16)\n\nn(cid:88)\n\ni=1\n\n4\n\n\fLemma 3 concludes the derivation of Alg. 1. An interesting observation is that computing \u02c6f does not\nrequire explicit knowledge of the embedding \u03c8 and the operator V , which are implicitly encoded\nwithin the loss (cid:52) by Asm. 1. In analogy to the kernel trick [24] we informally refer to such assumption\nas the \u201closs trick\u201d. We illustrate this effect with an example.\nExample 2 (Ranking). In ranking problems the goal is to predict ordered sequences of a \ufb01xed number\n(cid:96) of labels. For these problems, Y corresponds to the set of all ordered sequences of (cid:96) labels and has\ncardinality |Y| = (cid:96)!, which is typically dramatically larger than the number n of training examples\n(e.g. for (cid:96) = 15, (cid:96)! (cid:39) 1012). Therefore, given an input x \u2208 X , directly computing \u02c6g(x) \u2208 R|Y| is\nimpractical. On the opposite, the loss trick allows to express d \u25e6 \u02c6g(x) only in terms of the n weights\n\u03b1i(x) in Alg. 1, making the computation of the argmin easier to approach in general. For details on\nthe rank loss (cid:52)rank and the corresponding optimization over Y, we refer to the empirical analysis of\nSec. 6.\n\nIn this section we have shown a derivation for the structured prediction algorithm proposed in this\nwork. In Thm. 2 we have shown how the expected risk of the proposed estimator \u02c6f is related to an\nestimator \u02c6g via a comparison inequality. In the following we will make use of these results to prove\nconsistency and generalization bounds for Alg. 1.\n\n4 Statistical Analysis\n\nIn this section we study the statistical properties of Alg. 1 exploiting of the relation between the\nstructured and surrogate problems characterized be the comparison inequality in Thm. 2. We begin\nour analysis by proving that Alg. 1 is universally consistent.\nTheorem 4 (Universal Consistency). Let (cid:52) : Y \u00d7 Y \u2192 R satisfy Asm. 1, X and Y be compact sets\nand k : X \u00d7X \u2192 R a continuous universal reproducing kernel2. For any n \u2208 N and any distribution\n\u03c1 on X \u00d7 Y let \u02c6fn : X \u2192 Y be obtained by Alg. 1 with {(xi, yi)}n\ni=1 training points independently\nsampled from \u03c1 and \u03bbn = n\u22121/4. Then,\n\nn\u2192+\u221eE( \u02c6fn) = E(f\u2217)\n\nlim\n\nwith probability 1\n\n(17)\n\nThm. 4 shows that, when the (cid:52) satis\ufb01es Asm. 1, Alg. 1 approximates a solution f\u2217 to Eq. (1)\narbitrarily well, given a suf\ufb01cient number of training examples. To the best of our knowledge this is\nthe \ufb01rst consistency result for structured prediction in the general setting considered in this work and\ncharacterized by Asm. 1, in particular for the case of Y with in\ufb01nite cardinality (dense or discrete).\nThe No Free Lunch Theorem [25] states that it is not possible to prove uniform convergence rates for\nEq. (17). However, by imposing suitable assumptions on the regularity of g\u2217 it is possible to prove\ngeneralization bounds for \u02c6g and then, using Thm. 2, extend them to \u02c6f. To show this, it is suf\ufb01cient\nto require that g\u2217 belongs to G the reproducing kernel Hilbert space used in the ridge regression of\nEq. (4). Note that in the proofs of Thm. 4 and Thm. 5, our analysis on \u02c6g borrows ideas from [10] and\nextends their result to our setting for the case of HY in\ufb01nite dimensional (i.e. when Y has in\ufb01nite\ncardinality). Indeed, note that in this case [10] cannot be applied to the estimator \u02c6g considered in this\nwork (see supplementary material Sec. B.3, Lemma 18 for details).\nTheorem 5 (Generalization Bound). Let (cid:52) : Y \u00d7 Y \u2192 R satisfy Asm. 1, Y be a compact set and\nk : X \u00d7 X \u2192 R a bounded continuous reproducing kernel. Let \u02c6fn denote the solution of Alg. 1 with\nn training points and \u03bb = n\u22121/2. If the surrogate risk R de\ufb01ned in Eq. (12) admits a minimizer\ng\u2217 \u2208 G, then\n\nE( \u02c6fn) \u2212 E(f\u2217) \u2264 c\u03c4 2 n\u2212 1\n\n4\n\n(18)\n\nholds with probability 1 \u2212 8e\u2212\u03c4 for any \u03c4 > 0, with c a constant not depending on n and \u03c4.\nThe bound in Thm. 5 is of the same order of the generalization bounds available for the least squares\nbinary classi\ufb01er [26]. Indeed, in Sec. 5 we show that in classi\ufb01cation settings Alg. 1 reduces to least\nsquares classi\ufb01cation. This opens the way to possible improvements, as we discuss in the following.\n\n2This is a standard assumption for universal consistency (see [21]). An example of continuous universal\n\nkernel is the Gaussian k(x, x(cid:48)) = exp(\u2212\u03b3(cid:107)x \u2212 x(cid:48)(cid:107)2), with \u03b3 > 0.\n\n5\n\n\fRemark 1 (Better Comparison Inequality). The generalization bounds for the least squares classi\ufb01er\ncan be improved by imposing regularity conditions on \u03c1 via the Tsybakov condition [26]. This was\nobserved in [26] for binary classi\ufb01cation with the least squares surrogate, where a tighter comparison\ninequality than the one in Thm. 2 was proved. Therefore, a natural question is whether the inequality\nof Thm. 2 could be similarly improved, consequently leading to better rates for Thm. 5. Promising\nresults in this direction can be found in [5], where the Tsybakov condition was generalized to the\nmulti-class setting and led to a tight comparison inequality analogous to the one for the binary setting.\nHowever, this question deserves further investigation. Indeed, it is not clear how the approach in [5]\ncould be further generalized to the case where Y has in\ufb01nite cardinality.\nRemark 2 (Other Surrogate Frameworks). In this paper we focused on a least squares surrogate\nloss function and corresponding framework. A natural question is to ask whether other loss functions\ncould be considered to approach the structured prediction problem, sharing the same or possibly even\nbetter properties. This question is related also to Remark 1, since different surrogate frameworks\ncould lead to sharper comparison inequalities. This seems an interesting direction for future work.\n\n5 Connection with Previous Work\n\nBinary and Multi-class Classi\ufb01cation. It is interesting to note that in classi\ufb01cation settings, Alg. 1\ncorresponds to the least squares classi\ufb01er [26]. Indeed, let Y = {1, . . . , (cid:96)} be a set of labels and\nconsider the misclassi\ufb01cation loss (cid:52)(y, y(cid:48)) = 1 for y (cid:54)= y(cid:48) and 0 otherwise. Then (cid:52)(y, y(cid:48)) =\ny V ey(cid:48) with ei \u2208 R(cid:96) the i-the element of the canonical basis of R(cid:96) and V = 1 \u2212 I, where I is the\ne(cid:62)\n(cid:96) \u00d7 (cid:96) identity matrix and 1 the matrix with all entries equal to 1. In the notation of surrogate methods\nadopted in this work, HY = R(cid:96) and \u03c8(y) = ey. Note that both Least squares classi\ufb01cation and our\napproach solve the surrogate problem at Eq. (4)\n\nn(cid:88)\n\ni=1\n\n1\nn\n\n(cid:107)g(xi) \u2212 eyi(cid:107)2R(cid:96) + \u03bb (cid:107)g(cid:107)2G\n\n(19)\n\nto obtain a vector-valued predictor \u02c6g : X \u2192 R(cid:96) as in Eq. (5). Then, the least squares classi\ufb01er \u02c6c and\nthe decoding \u02c6f = d \u25e6 \u02c6g are respectively obtained by\n\n\u02c6c(x) = argmax\ni=1,...,(cid:96)\n\n\u02c6g(x)\n\n\u02c6f (x) = argmin\ni=1,...,(cid:96)\n\nV \u02c6g(x).\n\n(20)\n\nHowever, since V = 1 \u2212 I, it is easy to see that \u02c6c(x) = \u02c6f (x) for all x \u2208 X .\nKernel Dependency Estimation. In Sec. 2 we discussed the relation between KDE [18, 12] and\nAlg. 1. In particular, we have observed that if (cid:52) is induced by a kernel h : Y \u00d7 Y \u2192 R as in Eq. (2)\nand h is normalized, i.e. h(y, y) = \u03ba \u2200y \u2208 Y, with \u03ba > 0, then algorithm Eq. (6) proposed in [12]\nleads to the same predictor as Alg. 1. Therefore, we can apply Thm. 4 and 5 to prove universal\nconsistency and generalization bounds for methods such as [12, 14]. Some theoretical properties of\nKDE have been previously studied in [15] from a PAC Bayesian perspective. However, the obtained\nbounds do not allow to control the excess risk or establish consistency of the method. Moreover, note\nthat when the kernel h is not normalized, the \u201cdecoding\u201d in Eq. (6) is not equivalent to Alg. 1. In\nparticular, given the surrogate solution g\u2217, applying Eq. (6) leads to predictors that do not minimize\nEq. (1). As a consequence, approaches in [12, 13, 14] are not consistent in the general case.\nSupport Vector Machines for Structured Output. A popular approach to structured prediction\nis the Support Vector Machine for Structured Outputs (SVMstruct) [4] that extends ideas from the\nwell-known SVM algorithm to the structured setting. One of the main advantages of SVMstruct is\nthat it can be applied to a variety of problems since it does not impose strong assumptions on the loss.\nIn this view, our approach shares similar properties, and in particular allows to consider Y of in\ufb01nite\ncardinality. Moreover, we note that generalization studies for SVMstruct are available [3] (Ch. 11).\nHowever, it seems that these latter results do not allow to derive universal consistency of the method.\n\n6 Experiments\n\nIn this section we report on preliminary experiments showing the performance of the proposed\napproach on simulated as well as real structured prediction problems.\n\n6\n\n\fLinear [7]\nHinge [27]\nLogistic [28]\nSVM Struct [4]\nAlg. 1\n\nRank Loss\n0.430 \u00b1 0.004\n0.432 \u00b1 0.008\n0.432 \u00b1 0.012\n0.451 \u00b1 0.008\n0.396 \u00b1 0.003\n\nTable 1: Normalized (cid:52)rank for ranking\nmethods on the MovieLens dataset [29].\n\nKDE [18]\n(Gaussian)\n\n0.149 \u00b1 0.013\n0.736 \u00b1 0.032\n0.294 \u00b1 0.012\n\nAlg. 1\n\n(Hellinger)\n0.172 \u00b1 0.011\n0.647 \u00b1 0.017\n0.193 \u00b1 0.015\n\nLoss\n(cid:52)G\n(cid:52)H\n(cid:52)R\n\nTable 2:\n(KDE [18]) and Hellinger loss.\n\nDigit\n\nreconstruction using Gaussian\n\nRanking Movies. We considered the problem of ranking movies in the MovieLens dataset [29]\n(ratings (from 1 to 5) of 1682 movies by 943 users). The goal was to predict preferences of a given\nuser, i.e. an ordering of the 1682 movies, according to the user\u2019s partial ratings. We applied Alg. 1 to\nthe ranking problem using the rank loss [7]\n\n(cid:52)rank(y, y(cid:48)) =\n\n1\n2\n\n\u03b3(y(cid:48))ij (1 \u2212 sign(yi \u2212 yj)),\n\n(21)\n\nM(cid:88)\n\ni,j=1\n\nwhere M is the number of movies, y is a re-ordering of the sequence 1, . . . , M. The scalar \u03b3(y)ij\ndenotes the costs (or reward) of having movie j ranked higher than movie i. Similarly to [7], we set\n\u03b3(y)ij equal to the difference of ratings provided by user associated to y (from 1 to 5). We chose as k\nin Alg. 1, a linear kernel on features similar to those proposed in [7], which were computed based on\nusers\u2019 profession, age, similarity of previous ratings, etc. Since solving Alg. 1 for (cid:52)rank is NP-hard\n(see [7]) we adopted the Feedback Arc Set approximation (FAS) proposed in [30] to approximate the\n\u02c6f (x) of Alg. 1. Results are reported in Tab. 1 comparing Alg. 1 (Ours) with surrogate ranking methods\nusing a Linear [7], Hinge [27] or Logistic [28] loss and Struct SVM [4]. We randomly sampled\nn = 643 users for training and tested on the remaining 300. We performed 5-fold cross-validation for\nmodel selection. We report the normalized (cid:52)rank, averaged over 10 trials to account for statistical\nvariability. Interestingly, our approach appears to outperform all competitors, suggesting that Alg. 1\nis a viable approach to ranking.\nImage Reconstruction with Hellinger Distance. We considered the USPS digits reconstruction\nexperiment originally proposed in [18]. The goal is to predict the lower half of an image depicting a\ndigit, given the upper half of the same image in input. The standard approach is to use a Gaussian\nkernel kG on images in input and adopt KDE methods such as [18, 12, 14] with loss (cid:52)G(y, y(cid:48)) =\n1\u2212 kG(y, y(cid:48)). Here we take a different approach and, following [31], we interpret an image depicting\na digit as an histogram and normalize it to sum up to 1. Therefore, Y is the unit simplex in R128\n(16 \u00d7 16 images) and we adopt the Hellinger distance (cid:52)H\n\n(cid:52)H (y, y(cid:48)) =\n\n|(yi)1/2 \u2212 (y(cid:48)\n\ni)1/2|\n\nfor\n\ny = (yi)M\ni=1\n\n(22)\n\ni=1\n\nto measure distances on Y. We used the kernel kG on the input space and compared Alg. 1 using\nrespectively (cid:52)H and (cid:52)G. For (cid:52)G Alg. 1 correpsponds to [12]. We performed digit reconstruction\nexperiments by training on 1000 examples evenly distributed among the 10 digits of USPS and\ntested on 5000 images. We performed 5-fold cross-validation for model selection. Tab. 2 reports\nthe performance of Alg. 1 and the KDE methods averaged over 10 runs. Performance are reported\naccording to the Gaussian loss (cid:52)G and Hellinger loss (cid:52)H. Unsurprisingly, methods trained with\nrespect to a speci\ufb01c loss perform better than the competitor with respect to such loss. Therefore, as a\nfurther measure of performance we also introduced the \u201cRecognition\u201d loss (cid:52)R. This loss has to be\nintended as a measure of how \u201cwell\u201d a predictor was able to correctly reconstruct an image for digit\nrecognition purposes. To this end, we trained an automatic digit classi\ufb01er and de\ufb01ned (cid:52)R to be the\nmisclassi\ufb01cation error of such classi\ufb01er when tested on images reconstructed by the two prediction\nalgorithms. This automatic classi\ufb01er was trained using a standard SVM [24] on a separate subset of\nUSPS images and achieved an average 0.04% error rate on the true 5000 test sets. In this case a clear\ndifference in performance can be observed between using two different loss functions, suggesting\nthat (cid:52)H is more suited for the reconstruction problem.\n\n7\n\nM(cid:88)\n\n\fn\n50\n100\n200\n500\n1000\n\nAlg. 1\n\n0.39 \u00b1 0.17\n0.21 \u00b1 0.04\n0.12 \u00b1 0.02\n0.08 \u00b1 0.01\n0.07 \u00b1 0.01\n\nRNW\n\n0.45 \u00b1 0.18\n0.29 \u00b1 0.04\n0.24 \u00b1 0.03\n0.22 \u00b1 0.02\n0.21 \u00b1 0.02\n\nKRR\n\n0.62 \u00b1 0.13\n0.47 \u00b1 0.09\n0.33 \u00b1 0.04\n0.31 \u00b1 0.03\n0.19 \u00b1 0.02\n\nFigure 1: Robust estimation on the regression problem in Sec. 6 by minimizing the Cauchy loss with\nAlg. 1 (Ours) or Nadaraya-Watson (Nad). KRLS as a baseline predictor. Left. Example of one run of\nthe algorithms. Right. Average distance of the predictors to the actual function (without noise and\noutliers) over 100 runs with respect to training sets of increasing dimension.\n\nRobust Estimation. We considered a regression problem with many outliers and evaluated Alg. 1\nusing the Cauchy loss (see Example 1 - (3)) for robust estimation. Indeed, in this setting, Y =\n[\u2212M, M ] \u2282 R is not structured, but the non-convexity of (cid:52) can be an obstacle to the learning\nprocess. We generated a dataset according to the model y = sin(6\u03c0x) + \u0001 + \u03b6, where x was sampled\nuniformly on [\u22121, 1] and \u0001 according to a zero-mean Gaussian with variance 0.1. \u03b6 modeled the\noutliers and was sampled according to a zero-mean random variable that was 0 with probability\n0.90 and a value uniformly at random in [\u22123, 3] with probability 0.1. We compared Alg. 1 with the\nNadaraya-Watson robust estimator (RNW) [32] and kernel ridge regression (KRR) with a Gaussian\nkernel as baseline. To train Alg. 1 we used a Gaussian kernel on the input and performed predictions\n(i.e. solved Eq. (16)) using Matlab FMINUNC function for unconstrained minimization. Experiments\nwere performed with training sets of increasing dimension (100 repetitions each) and test set of 1000\nexamples. 5-fold cross-validation for model selection. Results are reported in Fig. 1, showing that\nour estimator signi\ufb01cantly outperforms the others. Moreover, our method appears to greatly bene\ufb01t\nfrom training sets of increasing size.\n\n7 Conclusions and Future Work\n\nIn this work we considered the problem of structured prediction from a Statistical Learning Theory\nperspective. We proposed a learning algorithm for structured prediction that is split into a learning\nand prediction step similarly to previous methods in the literature. We studied the statistical properties\nof the proposed algorithm by adopting a strategy inspired to surrogate methods. In particular, we\nidenti\ufb01ed a large family of loss functions for which it is natural to identify a corresponding surrogate\nproblem. This perspective allows to prove a derivation of the algorithm proposed in this work.\nMoreover, by exploiting a comparison inequality relating the original and surrogate problems we\nwere able to prove universal consistency and generalization bounds under mild assumption. In\nparticular, the bounds proved in this work recover those already known for least squares classi\ufb01cation,\nof which our approach can be seen as a generalization. We supported our theoretical analysis with\nexperiments showing promising results on a variety of structured prediction problems.\nA few questions were left opened. First, we ask whether the comparison inequality can be improved\n(under suitable hypotheses) to obtain faster generalization bounds for our algorithm. Second, the\nsurrogate problem in our work consists of a vector-valued regression (in a possibly in\ufb01nite dimensional\nHilbert space), we solved this problem by plain kernel ridge regression but it is natural to ask whether\napproaches from the multi-task learning literature could lead to substantial improvements in this\nsetting. Finally, an interesting question is whether alternative surrogate frameworks could be derived\nfor the setting considered in this work, possibly leading to tighter comparison inequalities. We will\ninvestigate these questions in the future.\n\nReferences\n[1] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with\n\ndiscriminatively trained part-based models. PAMI, IEEE Transactions on, 32(9):1627\u20131645, 2010.\n\n[2] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In\n\nProceedings of the IEEE Conference on CVPR, pages 3128\u20133137, 2015.\n\n[3] Thomas Hofmann Bernhard Sch\u00f6lkopf Alexander J. Smola Ben Taskar Bakir, G\u00f6khan and S.V.N Vish-\n\nwanathan. Predicting structured data. MIT press, 2007.\n\n8\n\n\u22121\u22120.8\u22120.6\u22120.4\u22120.200.20.40.60.81\u22122024Alg. 1RNWKRLS\f[4] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun. Large margin methods\n\nfor structured and interdependent output variables. In JMLR, pages 1453\u20131484, 2005.\n\n[5] Youssef Mroueh, Tomaso Poggio, Lorenzo Rosasco, and Jean-Jacques Slotine. Multiclass learning with\n\nsimplex coding. In NIPS, pages 2798\u20132806, 2012.\n\n[6] Wei Gao and Zhi-Hua Zhou. On the consistency of multi-label learning. Arti\ufb01cial Intelligence, 2013.\n[7] John C Duchi, Lester W Mackey, and Michael I Jordan. On the consistency of ranking algorithms. In\nProceedings of the 27th International Conference on Machine Learning (ICML-10), pages 327\u2013334, 2010.\n[8] Ingo Steinwart, Andreas Christmann, et al. Estimating conditional quantiles with the help of the pinball\n\nloss. Bernoulli, 17(1):211\u2013225, 2011.\n\n[9] Charles A Micchelli and Massimiliano Pontil. Kernels for multi\u2013task learning. In Advances in Neural\n\nInformation Processing Systems, pages 921\u2013928, 2004.\n\n[10] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.\n\nFoundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[11] M. \u00c1lvarez, N. Lawrence, and L. Rosasco. Kernels for vector-valued functions: a review. Foundations and\n\nTrends in Machine Learning, 4(3):195\u2013266, 2012. see also http://arxiv.org/abs/1106.6251.\n\n[12] Corinna Cortes, Mehryar Mohri, and Jason Weston. A general regression technique for learning transduc-\n\ntions. In Proceedings of the 22nd international conference on Machine learning, 2005.\n\n[13] P. Geurts, L. Wehenkel, and F. d\u2019Alch\u00e9 Buc. Kernelizing the output of tree-based methods. In ICML, 2006.\n[14] H. Kadri, M. Ghavamzadeh, and P. Preux. A generalized kernel approach to structured output learning.\n\nProc. International Conference on Machine Learning (ICML), 2013.\n\n[15] S. Gigu\u00e8re, M. M., K. Sylla, and F. Laviolette. Risk bounds and learning algorithms for the regression\napproach to structured output prediction. In ICML. JMLR Workshop and Conference Proceedings, 2013.\n[16] C. Brouard, M. Szafranski, and F. d\u2019Alch\u00e9 Buc. Input output kernel regression: Supervised and semi-\n\nsupervised structured output prediction with operator-valued kernels. JMLR, 17(176):1\u201348, 2016.\n\n[17] Michael Collins. Discriminative training methods for hidden markov models: Theory and experiments\nwith perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural\nlanguage processing-Volume 10, pages 1\u20138. Association for Computational Linguistics, 2002.\n\n[18] Jason Weston, Olivier Chapelle, Vladimir Vapnik, Andr\u00e9 Elisseeff, and Bernhard Sch\u00f6lkopf. Kernel\n\ndependency estimation. In Advances in neural information processing systems, pages 873\u2013880, 2002.\n\n[19] Alain Berlinet and Christine Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and\n\nstatistics. Springer Science & Business Media, 2011.\n\n[20] Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\nJournal of the American Statistical Association, 101(473):138\u2013156, 2006.\n\n[21] Ingo Steinwart and Andreas Christmann. Support Vector Machines. Information Science and Statistics.\n\nSpringer New York, 2008.\n\n[22] Peter J Huber and Elvezio M Ronchetti. Robust statistics. Springer, 2011.\n[23] Richard Schoen and Shing-Tung Yau. Lectures on differential geometry, volume 2. International press\n\nBoston, 1994.\n\n[24] Bernhard Sch\u00f6lkopf and Alexander J Smola. Learning with kernels: support vector machines, regulariza-\n\ntion, optimization, and beyond. MIT press, 2002.\n\n[25] David H Wolpert. The lack of a priori distinctions between learning algorithms. Neural computation,\n\n8(7):1341\u20131390, 1996.\n\n[26] Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.\n\nConstructive Approximation, 26(2):289\u2013315, 2007.\n\n[27] Ralf Herbrich, Thore Graepel, and Klaus Obermayer. Large margin rank boundaries for ordinal regression.\n\nAdvances in neural information processing systems, pages 115\u2013132, 1999.\n\n[28] Ofer Dekel, Yoram Singer, and Christopher D Manning. Log-linear models for label ranking. In Advances\n\nin neural information processing systems, page None, 2004.\n\n[29] F Maxwell Harper and Joseph A Konstan. The movielens datasets: History and context. ACM Transactions\n\non Interactive Intelligent Systems (TiiS), 5(4):19, 2015.\n\n[30] Peter Eades, Xuemin Lin, and William F Smyth. A fast and effective heuristic for the feedback arc set\n\nproblem. Information Processing Letters, 47(6):319\u2013323, 1993.\n\n[31] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural\n\nInformation Processing Systems, pages 2292\u20132300, 2013.\n\n[32] Wolfgang H\u00e4rdle. Robust regression function estimation. Journal of Multivariate Analysis, 14(2):169\u2013180,\n\n1984.\n\n9\n\n\f", "award": [], "sourceid": 2170, "authors": [{"given_name": "Carlo", "family_name": "Ciliberto", "institution": "MIT"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova- MIT - IIT"}, {"given_name": "Alessandro", "family_name": "Rudi", "institution": "University of Genova"}]}