{"title": "Manifold Structured Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 5610, "page_last": 5621, "abstract": "Structured prediction provides a general framework to deal with supervised problems where the outputs have semantically rich structure. While classical approaches consider finite, albeit potentially huge, output spaces, in this paper we discuss how structured prediction can be extended to a continuous scenario. Specifically, we study a structured prediction approach to manifold-valued regression. We characterize a class of problems for which the considered approach is statistically consistent and study how geometric optimization can be used to compute the corresponding estimator. Promising experimental results on both simulated and real data complete our study.", "full_text": "Manifold Structured Prediction\n\nAlessandro Rudi \u2022,1 Carlo Ciliberto \u2022,\u2217 ,2 Gian Maria Marconi 3 Lorenzo Rosasco 3,4\n\nalessandro.rudi@inria.fr\n\nc.ciliberto@imperial.ac.uk\n\ngian.maria.marconi@iit.it\n\nlrosasco@mit.edu\n\n1INRIA - D\u00e9partement d\u2019informatique, \u00c9cole Normale Sup\u00e9rieure - PSL Research University, Paris, France.\n\n2Department of Electrical and Electronic Engineering, Imperial College, London, UK.\n\n3Universit\u00e0 degli studi di Genova & Istituto Italiano di Tecnologia, Genova, Italy.\n\n4Massachusetts Institute of Technology, Cambridge, USA.\n\n\u2022 Equal Contribution\n\nAbstract\n\nStructured prediction provides a general framework to deal with supervised prob-\nlems where the outputs have semantically rich structure. While classical approaches\nconsider \ufb01nite, albeit potentially huge, output spaces, in this paper we discuss how\nstructured prediction can be extended to a continuous scenario. Speci\ufb01cally, we\nstudy a structured prediction approach to manifold valued regression. We character-\nize a class of problems for which the considered approach is statistically consistent\nand study how geometric optimization can be used to compute the corresponding\nestimator. Promising experimental results on both simulated and real data complete\nour study.\n\n1\n\nIntroduction\n\nRegression and classi\ufb01cation are probably the most classical machine learning problems and cor-\nrespond to estimating a function with scalar and binary values, respectively. In practice, it is often\ninteresting to estimate functions with more structured outputs. When the output space can be assumed\nto be a vector space, many ideas from regression can be extended, think for example to multivariate\n[20] or functional regression [32]. However, a lack of a natural vector structure is a feature of many\npractically interesting problems, such as ranking [18], quantile estimation [26] or graph prediction\n[38]. In this latter case, the outputs are typically provided only with some distance or similarity func-\ntion that can be used to design appropriate loss function. Knowledge of the loss is suf\ufb01cient to analyze\nan abstract empirical risk minimization approach within the framework of statistical learning theory,\nbut deriving approaches that are at the same time statistically sound and computationally feasible\nis a key challenge. While ad-hoc solutions are available for many speci\ufb01c problems [15, 37, 24, 7],\nstructured prediction [5] provides a unifying framework where a variety of problems can be tackled\nas special cases.\nClassically, structured prediction considers problems with \ufb01nite, albeit potentially huge, output spaces.\nIn this paper, we study how these ideas can be applied to non-discrete output spaces. In particular, we\nconsider the case where the output space is a Riemannian manifold, that is the problem of manifold\nstructured prediction (also called manifold valued regression [46]). While also in this case ad-hoc\nmethods are available [47], in this paper we adopt and study a structured prediction approach starting\nfrom a framework proposed in [13]. Within this framework, it is possible to derive a statistically\nsound, and yet computationally feasible, structured prediction approach as long as the loss function\nsatis\ufb01es suitable structural assumptions [14, 17, 29, 25, 12, 36]. Moreover we can guarantee that the\ncomputed prediction is always an element of the manifold.\nOur main technical contribution is a characterization of loss functions for manifold structured\nprediction satisfying such a structural assumption. In particular, we consider the case where the\n\n\u2217Work performed while C.C. was at the University College London.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fRiemannian metric is chosen as a loss function. As a byproduct of these results, we derive a manifold\nstructured learning algorithm that is universally consistent and corresponding \ufb01nite sample bounds.\nFrom a computational point of view, the proposed algorithm requires solving a linear system (at\ntraining time) and a minimization problem over the output manifold (at test time). To tackle this\nlatter problem, we investigate the application of geometric optimization methods, and in particular\nRiemannian gradient descent [1]. We consider both numerical simulations and benchmark datasets\nreporting promising performances. The rest of the paper is organized as follows. In Section 2, we\nde\ufb01ne the problem and explain the proposed algorithm. In Section 3 we state and prove the theoretical\nresults of this work. In Section 4 we explain how to compute the proposed algorithm and we show\nthe performance of our method on synthetic and real data.\n\n(cid:90)\n\nE(f )\n\nwith\n\n2 Structured Prediction for Manifold Valued Regression\nThe goal of supervised learning is to \ufb01nd a functional relation between an input space X and an\noutput space Y given a \ufb01nite set of observations. Traditionally, the output space is either a linear\nspace (e.g. Y = RM ) or a discrete set (e.g. Y = {0, 1} in binary classi\ufb01cation). In this paper, we\nconsider the problem of manifold structured prediction [47], in which output data lies on a manifold\nM \u2282 Rd. In this context, statistical learning corresponds to solving\n\nE(f ) =\n\n(cid:52)(f (x), y) \u03c1(x, y)\n\nX\u00d7Y\n\nargmin\nf\u2208X\u2192Y\n\n(1)\nwhere Y is a subset of the manifold M and \u03c1 is an unknown distribution on X \u00d7Y. Here, (cid:52) : Y\u00d7Y \u2192\nR is a loss function that measures prediction errors for points estimated on the manifold. The\nminimization is meant over the set of all measurable functions from X to Y. The distribution is \ufb01xed\n\nbut unknown and a learning algorithm seeks an estimator (cid:98)f : X \u2192 Y that approximately solves\n\nEq. (1), given a set of training points (xi, yi)n\nA concrete example of loss function that we will consider in this paper is (cid:52) = d2 the squared\ngeodesic distance d : Y \u00d7 Y \u2192 R [27]. The geodesic distance is the natural metric on a Riemannian\nmanifold (it corresponds to the Euclidean distance when M = Rd) and is a natural loss function in\nthe context of manifold regression [46, 47, 19, 23, 21].\n\ni=1 sampled independently from \u03c1.\n\n2.1 Manifold Valued Regression via Structured Prediction\n\nIn this paper we consider a structured prediction approach to manifold valued regression following\nideas in [13]. Given a training set (xi, yi)n\n\ni=1, an estimator for problem Eq. (1) is de\ufb01ned by\n\n(cid:98)f (x) = argmin\n\ny\u2208Y\n\nn(cid:88)\n\ni=1\n\n\u03b1i(x) (cid:52) (y, yi)\n\n(2)\n\nfor any x \u2208 X . The coef\ufb01cients \u03b1(x) = (\u03b11(x), . . . , \u03b1n(x))(cid:62) \u2208 Rn are obtained solving a linear\nsystem for a problem akin to kernel ridge regression (see Section 2.2): given a positive de\ufb01nite kernel\nk : X \u00d7 X \u2192 R [4] over X , we have\n\n\u03b1(x) = (\u03b11(x), . . . , \u03b1n(x))(cid:62) = (K + n\u03bbI)\u22121Kx\n\n(3)\nwhere K \u2208 Rn\u00d7n is the empirical kernel matrix with Ki,j = k(xi, xj), and Kx \u2208 Rn the vector\nwhose i-th entry corresponds to (Kx)i = k(x, xi). Here, \u03bb \u2208 R+ is a regularization parameter and\nI \u2208 Rn\u00d7n denotes identity matrix.\nComputing the estimator in Eq. (2) can be divided into two steps. During a training step the score\non a new test point x \u2208 X . This last step requires minimizing the linear combination of distances\n(cid:52)(y, yi) between a candidate y \u2208 Y and the training outputs (yi)n\ni=1, weighted by the corresponding\nscores \u03b1i(x). Next, we recall the derivation of the above estimator following [13].\n\nfunction \u03b1 : X \u2192 Rn is learned, while during the prediction step, the output (cid:98)f (x) \u2208 Y is estimated\n\n2.2 Derivation of the Proposed Estimator\n\nThe derivation of (cid:98)f in Eq. (2) is based on the following key structural assumption on the loss.\n\n2\n\n\f(cid:90)\n\n(cid:29)\n\n(cid:90)\n\n(cid:28)\n\nE(f ) =\n\nDe\ufb01nition 1 (Structure Encoding Loss Function (SELF)). Let Y be a compact set. A function\n(cid:52) : Y \u00d7 Y \u2192 R is a Structure Encoding Loss Function if there exist a separable Hilbert space H, a\ncontinuous feature map \u03c8 : Y \u2192 H and a continuous linear operator V : H \u2192 H such that for all\ny, y(cid:48) \u2208 Y\n\n(cid:52)(y, y(cid:48)) = (cid:104)\u03c8(y), V \u03c8(y(cid:48))(cid:105)H.\n\n(4)\nIntuitively, the SELF de\ufb01nition requires a loss function to be \u201cbi-linearizable\u201d over the space H. This\nis similar, but more general, than requiring the loss to be a kernel since it allows also to consider\ndistances (which are not positive de\ufb01nite) or even non-symmetric loss functions. As observed in [13],\na wide range of loss functions often used in machine learning are SELF. In Section 3 we study how\nthe above assumption applies to manifold structured loss functions, including the squared geodesic\ndistance.\nWe \ufb01rst recall how the estimator Eq. (2) can be obtained assuming (cid:52) to be SELF. We begin by\nrewriting the expected risk in Eq. (1) as\n\n\u03c8(y) d\u03c1(y|x)\n\n(cid:90)\n\n(5)\nwhere we have conditioned \u03c1(y, x) = \u03c1(y|x)\u03c1X (x) and used the linearity of the integral and the\ninner product. Therefore, any function f\u2217 : X \u2192 Y minimizing the above functional must satisfy the\nfollowing condition\n\n\u03c8(f (x)), V\n\nd\u03c1(x)\n\nH\n\nX\n\nY\n\n(cid:104)\u03c8(y), V g\u2217(x)(cid:105)H\n\nf\u2217(x) = argmin\ny\u2208Y\n\n(6)\nwhere we have introduced the function g\u2217 : X \u2192 H that maps each point x \u2208 X to the conditional\nexpectation of \u03c8(y) given x. However we cannot compute explicitly g\u2217, but noting that it minimizes\nthe expected least squares error\n\nwhere\n\nY\n\ng\u2217(x) =\n\n\u03c8(y) d\u03c1(y|x)\n\n(cid:90)\n\n(cid:98)g(x) =(cid:99)W (cid:62)x\n\n(cid:107)\u03c8(y) \u2212 g(x)(cid:107)2Hd\u03c1(x, y)\n(cid:99)W = argmin\n\n(7)\nsuggests that a least squares estimator can be considered. We \ufb01rst illustrate this idea for X = Rd and\nH = Rk. In this case we can consider a ridge regression estimator\n\n1\nn\n\n(cid:107)XW \u2212 \u03c8(Y )(cid:107)2\n\nF + \u03bb(cid:107)W(cid:107)2\n\nwith\n\nW\u2208Rd\u00d7k\n\n(8)\nwhere X = (x1, . . . , xn)(cid:62) \u2208 Rn\u00d7d and \u03c8(Y ) = (\u03c8(y1), . . . , \u03c8(yn))(cid:62) \u2208 Rn\u00d7k are the matrices\nwhose i-th row correspond respectively to the training sample xi \u2208 X and the (mapped) training\noutput \u03c8(yi) \u2208 H. We have denoted (cid:107) \u00b7 (cid:107)2\nF the squared Frobenius norm of a matrix, namely the\nsum of all its squared entries. The ridge regression solution can be obtained in closed form as\n\n(cid:99)W = (X(cid:62)X + n\u03bbI)\u22121X(cid:62)\u03c8(Y ). For any x \u2208 X we have\nwhere we have introduced the coef\ufb01cients \u03b1(x) = X(X(cid:62)X + n\u03bbI)\u22121x \u2208 Rn. By substituting(cid:98)g to\n\n(cid:98)g(x) = \u03c8(Y )(cid:62)X(X(cid:62)X + n\u03bbI)\u22121x = \u03c8(Y )(cid:62)\u03b1(x) =\n\nn(cid:88)\n\n\u03b1i(x)\u03c8(yi)\n\n(9)\n\ni=1\n\nF\n\n\u03c8(y), V\n\n\u03b1i(x)\u03c8(yi)\n\n= argmin\ny\u2208M\n\n\u03b1i(x) (cid:52) (y, yi)\n\n(10)\n\nwhere we have used the linearity of the sum and the inner product to move the coef\ufb01cients \u03b1i outside\nof the inner product. Since the loss is SELF, we then obtain (cid:104)\u03c8(y), V \u03c8(yi)(cid:105) = (cid:52)(y, yi) for any yi in\n\nthe training set. This recovers the estimator (cid:98)f introduced in Eq. (2), as desired.\nk : X \u00d7 X \u2192 R a positive de\ufb01nite kernel. Then(cid:98)g can be computed by kernel ridge regression (see\napplies if H is in\ufb01nite dimensional. Indeed, thanks to the SELF assumption, (cid:98)f does not depend on\n\nWe end noting how the above idea can be extended. First, we can consider X to be a set and\ne.g. [43]) to obtain the scores \u03b1(x) = (K + n\u03bbI)\u22121Kx, see Eq. (3). Second, the above discussion\nexplicit knowledge of the space H but only on the loss function.\nWe next discuss the main results of the paper, showing that a large class of loss functions for manifold\nstructured prediction are SELF. This will allow us to prove consistency and learning rates for the\nmanifold structured estimator considered in this work.\n\ng\u2217 in Eq. (6) we have\n\n(cid:98)f (x) = argmin\n\ny\u2208M\n\n(cid:42)\n\n(cid:32) n(cid:88)\n\ni=1\n\n(cid:33)(cid:43)\n\nn(cid:88)\n\ni=1\n\n3\n\n\f3 Characterization of SELF Function on Manifolds\n\nIn this section we provide suf\ufb01cient conditions for a wide class of functions on manifolds to satisfy\nthe de\ufb01nition of SELF. A key example will be the case of the squared geodesic distance. To this end\nwe will make the following assumptions on the manifold M and the output space Y \u2286 M where the\nlearning problem takes place.\nAssumption 1. M is a complete d-dimensional smooth connected Riemannian manifold, without\nboundary, with Ricci curvature bounded below and positive injectivity radius.\n\nThe assumption above imposes basic regularity conditions on the output manifold. In particular\nwe require the manifold to be locally diffeomorphic to Rd and that the tangent space of M at any\np \u2208 M varies smoothly with respect to p. This assumption avoids pathological manifolds and is\nsatis\ufb01ed for instance by any smooth compact manifold (e.g. the sphere, torus, etc.) [27]. Other\nnotable examples are the statistical manifold (without boundary) [3] any open bounded sub-manifold\nof the cone of positive de\ufb01nite matrices, which is often studied in geometric optimization settings [1].\nThis assumption will be instrumental to guarantee the existence of a space of functions H on M rich\nenough to contain the squared geodesic distance.\nAssumption 2. Y is a compact geodesically convex subset of the manifold M.\nA subset Y of a manifold is geodesically convex if for any two points in Y there exists one and only\none minimizing geodesic curve connecting them. The effect of Asm. 2 is twofold. On one hand it\nguarantees a generalized notion of convexity for the space Y on which we will solve the optimization\nproblem in Eq. (2). On the other hand it avoids the geodesic distance to have singularities on Y\n(which is key to our main result below). For a detailed introduction to most de\ufb01nitions and results\nreviewed in this section we refer the interested reader to standard references for differential and\nRiemannian geometry (see e.g. [27]). We are ready to prove the main result of this work.\nTheorem 1 (Smooth Functions are SELF). Let M satisfy Asm. 1 and Y \u2286 M satisfy Asm. 2. Then,\nany smooth function h : Y \u00d7 Y \u2192 R is SELF on Y.\n\n0 (M) \u2297 C\u221e\n\nSketch of the proof (Thm. 1). The complete proof of Thm. 1 is reported in ??. The proof hinges\naround the following key steps:\nStep 1 If there exists an RKHS H on M, then any h \u2208 H \u2297 H is SELF. Let H be a reproducing\nkernel Hilbert space (RKHS) [4] of functions on M with associated bounded kernel k : M\u00d7M \u2192 R.\nLet H \u2297 H denote the RKHS of functions h : M \u00d7 M \u2192 R with associated kernel \u00afk such that\n\u00afk((y, z), (y(cid:48), z(cid:48))) = k(y, y(cid:48))k(z, z(cid:48)) for any y, y(cid:48), z, z(cid:48) \u2208 M. Let, h : M \u00d7 M \u2192 R be such that\nh \u2208 H \u2297 H. Recall that H \u2297 H is isometric to the space of Hilbert-Schmidt operators from H to\nitself. Let Vh : H \u2192 H be the operator corresponding to h via such isometry. We show that the\nSELF de\ufb01nition is satis\ufb01ed with V = Vh and \u03c8(y) = k(y,\u00b7) \u2208 H for any y \u2208 M. In particular, we\nhave (cid:107)V (cid:107) \u2264 (cid:107)V (cid:107)HS = (cid:107)h(cid:107)H\u2297H, with (cid:107)V (cid:107)HS denoting the Hilbert-Schmidt norm of V .\n0 (M) \u201ccontains\u201d C\u221e(Y \u00d7 Y). If Y is compact and geodesi-\nStep 2: Under Asm. 2, C\u221e\ncally convex, then it is diffeomorphic to a compact set of Rd. By using this fact, we prove that any\n0 (M\u00d7M)\nfunction in C\u221e(Y\u00d7Y), the space of smooth functions on Y\u00d7Y, admits an extension in C\u221e\nthe space of smooth functions on M\u00d7M vanishing at in\ufb01nity (this is well de\ufb01ned since M is diffeo-\n0 (M),\nmorphic to Rd, see ?? in the supplementary material), and that C\u221e\nwith \u2297 the canonical topological tensor product [50].\n0 (M). Under Asm. 1, the\nStep 3: Under Asm. 1, there exists an RKHS on M containing C\u221e\nSobolev space H = H 2\ns (M) of square integrable functions with smoothness s is an RKHS for any\ns > d/2 (see [22] for a de\ufb01nition of Sobolev spaces on Riemannian manifolds).\nThe proof proceeds as follows: from Step 1, we see that to guarantee h to be SELF it is suf\ufb01cient to\nprove the existence of an RKHS H such that h \u2208 H \u2297H. The rest of the proof is therefore devoted to\nshowing that for smooth functions this is satis\ufb01ed for H = H 2\ns (M). Since h is smooth, by Step 2 we\nhave that under Asm. 2, there exists a \u00afh \u2208 C\u221e\n0 (M) whose restriction \u00afh|Y\u00d7Y to Y \u00d7 Y\ns (M) the Sobolev space of squared integrable functions on M\ncorresponds to h. Now, denote by H 2\n0 (M)|Y \u2286\nwith smoothness index s > 0. By construction, (see [22]) for any s > 0, we have C\u221e\ns (M).\ns (M)|Y, namely for any function. In particular, \u00afh \u2208 C\u221e\nH 2\nFinally, Step 3 guarantees that under Asm. 1, H = H 2\ns (M) with s > d/2 is an RKHS, showing that\nh \u2208 H \u2297 H as desired.\n\n0 (M\u00d7M) = C\u221e\n\n0 (M) \u2297 C\u221e\n\n0 (M)\u2297C\u221e\n\n0 (M) \u2297 C\u221e\n\n0 (M) \u2286 H 2\n\ns (M) \u2297 H 2\n\n4\n\n\fInterestingly, Thm. 1 shows that the SELF estimator proposed in Eq. (2) can tackle any manifold\nvalued learning problem in the form of Eq. (1) with smooth loss function. In the following we study\nthe speci\ufb01c case of the squared geodesic distance.\nTheorem 2 (d2 is SELF). Let M satisfy Asm. 1 and Y \u2286 M satisfy Asm. 2. Then, the squared\ngeodesic distance (cid:52) = d2 : M \u00d7 M \u2192 R is smooth on Y. Therefore (cid:52) is SELF on Y.\nThe proof of the result above is reported in the supplementary material. The main technical aspect is\nto show that regularity provided by Asm. 2 guarantees the squared geodesic distance to be smooth.\nThe fact that (cid:52) is SELF is then an immediate corollary of Thm. 1.\n\n3.1 Statistical Properties of Manifold Structured Prediction\n\nIn this section, we characterize the generalization properties of the manifold structured estimator\nEq. (2) in light of Thm. 1 and Thm. 2.\nTheorem 3 (Universal Consistency). Let M satisfy Asm. 1 and Y \u2286 M satisfy Asm. 2. Let X be\na compact set and k : X \u00d7 X \u2192 R be a bounded continuous universal kernel2 For any n \u2208 N\nfor a learning problem with smooth loss function (cid:52), with (xi, yi)n\nsampled from \u03c1 and \u03bbn = n\u22121/4. Then\n\nand any distribution \u03c1 on X \u00d7 Y let (cid:98)fn : X \u2192 Y be the manifold structured estimator in Eq. (2)\n\ni=1 training points independently\n\nn\u2192\u221eE((cid:98)fn) = E(f\u2217) with probability 1.\n\nlim\n\n(11)\n\nThe result above follows from Thm. 4 in [13] combined with our result in Thm. 1. It guarantees\nthat the algorithm considered in this work \ufb01nds a consistent estimator for the manifold structured\nproblem, in order to derive also generalization bounds for (cid:98)f. In particular, if we denote by F the\nproblem, when the loss function is smooth (thus also in the case of the squared geodesic distance).\nAs it is standard in statistical learning theory, we can impose regularity conditions on the learning\nestimator(cid:98)g introduced in Eq. (9) is learned. In the simpli\ufb01ed case discussed in Section 2.2, with\nRKHS associated to the kernel k, we will require g\u2217 to belong to the same space H \u2297 F where the\nlinear kernel on X = Rd and H = Rk \ufb01nite dimensional, we have F = Rd and this assumption\n\u2217 \u2208 Rk\u00d7d = H \u2297 F, such that g\u2217(x) = W (cid:62)\ncorresponds to require the existence of a matrix W (cid:62)\n\u2217 x\nfor any x \u2208 X . In the general case, the space H \u2297 F extends to the notion of reproducing kernel\nHilbert space for vector-valued functions (see e.g. [30, 2]) but the same intuition applies [28, 10, 49].\nTheorem 4 (Generalization Bounds). Let M satisfy Asm. 1 and Y \u2286 M satisfy Asm. 2. Let\nassociated RKHS F. For any n \u2208 N, let (cid:98)fn denote the manifold structured estimator in Eq. (2) for a\ns (M) with s > d/2 and k : X \u00d7 X \u2192 R be a bounded continuous reproducing kernel with\nH = H 2\nlearning problem with smooth loss (cid:52) : Y \u00d7 Y \u2192 R and \u03bbn = n\u22121/2. If the conditional mean g\u2217\nbelongs to H \u2297 F, then\n\nE((cid:98)fn) \u2212 E(f\u2217) \u2264 c(cid:52)q \u03c4 2 n\u2212 1\n\n4\n\n(12)\nholds with probability not less than 1 \u2212 8e\u2212\u03c4 for any \u03c4 > 0, with c(cid:52) = (cid:107) (cid:52) (cid:107)H\u2297H and q a constant\nnot depending on n, \u03c4 or the loss (cid:52).\nThe generalization bound of Thm. 4 is obtained by adapting Thm. 5 of [13] to our results in Thm. 1\nas detailed in the supplementary material. To our knowledge these are the \ufb01rst results characterizing\nin such generality the generalization properties of an estimator for manifold structured learning with\ngeneric smooth loss function. We conclude with a remark on a key quantity in the bound of Thm. 4.\nRemark 1 (The constant c(cid:52)). We comment on the role played in the learning rate by c(cid:52), the norm of\nthe loss function (cid:52) seen an element of the Hilbert space H\u2297H. Indeed, from the discussion of Thm. 1\nwe have seen that any smooth function on Y is SELF and belongs to the set H\u2297H with H = H 2\ns (M),\nthe Sobolev space of squared integrable functions for s > d/2. Following this interpretation, we\nsee that the bound in Thm. 4 can improve signi\ufb01cantly (in terms of the constants) depending on the\nregularity of the loss function: smoother loss functions will result in \u201csimpler\u201d learning problems and\nvice-versa. In particular, when (cid:52) corresponds to the squared geodesic distance, the more \u201cregular\u201d\nis the manifold M, the learning problem will be. A re\ufb01ned quantitative characterization of c(cid:52) in\nterms of the Ricci curvature and the injective radius of the manifold is left to future work.\n\n2This is standard assumption for universal consistency (see [48]). An example of continuous universal kernel\n\non X = Rd is the Gaussian k(x, x(cid:48)) = exp(\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2/\u03c3), for \u03c3 > 0.\n\n5\n\n\fTable 1: Structured loss, gradient of the structured loss and retraction for P m\nzi \u2208 Sd\u22121 are the training set points. I \u2208 Rd\u00d7d is the identity matrix.\n\n++ and Sd\u22121. Zi \u2208 P m\n\n++ and\n\nPositive de\ufb01nite matrix manifold (P m\n\n++)\n\nSphere (Sd\u22121)\n\u03b1i arccos ((cid:104)zi, y(cid:105))2\n\nn(cid:80)\ni=1 \u03b1i(yyT \u2212 I) arccos((cid:104)zi,y(cid:105))\n1\u2212(cid:104)zi,y(cid:105) zi\n\n\u221a\n\ni=1\n\n4(cid:80)n\n\nv(cid:107)v(cid:107)\n\nn(cid:80)\nn(cid:80)\n\ni=1\n\ni=1\n\nF (y)\n\u2207MF (y)\nRy(v)\n\n2\n\n\u03b1i(cid:107) log(Y \u2212 1\n\n2 log(Y 1\n\u03b1iY 1\n2 exp(Y \u2212 1\n\nY 1\n\n2 )(cid:107)2\n\nF\n\n2 ZiY \u2212 1\n2 Z\u22121\n2 vY \u2212 1\n\ni Y 1\n2 )Y 1\n\n2\n\n2 )Y 1\n\n2\n\n4 Manifold Structured Prediction Algorithm and Experiments\n\nIn this section we recall geometric optimization algorithms that can be adopted to perform the\n\nestimation of (cid:98)f on a novel test point x. We then evaluate the performance of the proposed method in\n\npractice, reporting numerical results on simulated and real data.\nThe algorithm presented in this paper consists in two steps. In the training phase, the matrix C =\n(K + \u03bbnI)\u22121 is computed, requiring a computational cost of essentially O(n3) in time and O(n2) in\nspace (see Eq. (3)). In the inference phase, given a test point x, the coef\ufb01cients in Eq. (3) are computed,\n\u03b1(x) := Cv(x), requiring essentially O(n) in time, and then the optimization problem in Eq. (2) is\nsolved (see next subsection for more details). Note that it is possible to reduce the computational\ncomplexity of the training and evaluation of the coef\ufb01cients, while retaining the statistical guarantees\nof the proposed method. Indeed the computation of the coef\ufb01cients consists essentially in solving\n\u221a\na kernel ridge regression problem [10] as analyzed in [13], for which methods based on random\nprojection, as Nystr\u00f6m [44] or random features [39], reduce the complexity up to O(n\nn) in time\nand O(n) in space, while guaranteeing the same statistical properties [40, 42, 9, 11, 41].\n\n4.1 Optimization on Manifolds\n\nWe begin discussing the computational aspects related to evaluating the manifold structured estimator.\nIn particular, we discuss how to address the optimization problem in Eq. (2) in speci\ufb01c settings.\nGiven a test point x \u2208 X , this process consists in solving a minimization over Y, namely\n\n(13)\nwhere F (y) corresponds to the linear combination of (cid:52)(y, yi) weighted by the scores \u03b1i(x) computed\naccording to Eq. (3). If Y is a linear manifold or a subset of M = Rd, this problem can be solved by\nmeans of gradient-based minimization algorithms, such as Gradient Descent (GD):\n\nmin\ny\u2208Y F (y)\n\n(14)\nfor a step size \u03b7t \u2208 R. This algorithm can be extended to Riemannian gradient descent (RGD) [52]\non manifolds, as\n\nyt+1 = yt \u2212 \u03b7t\u2207F (yt)\n\nyt+1 = Expyt(\u03b7t\u2207MF (yt))\n\n(15)\nWhere \u2207MF is the gradient de\ufb01ned with respect to the Riemannian metric (see [1]) and Expy :\nTyM \u2192 M denotes the exponential map on y \u2208 Y, mapping a vector from the tangent space TyM\nto the associated point on the manifold according to the Riemannian metric [27]. For completeness,\nthe algorithm is recalled in ??. For this family of gradient-based algorithms it is possible to substitute\nthe exponential map with a retraction Ry : TyM \u2192 M, which is a \ufb01rst order approximation to the\nexponential map. Retractions are often faster to compute and still offer convergence guarantees [1].\nIn the following experiments we will use both retractions and exponential maps. We mention that the\nstep size \u03b7t can be found with a line search over the validation set, for more see [1]. Note that also\nstochastic optimization algorithms have been generalized to the Riemannian setting such as R-SGD\n[8]. Interestingly, methods such as R-SGD can be advantageous in our setting. Indeed, solving the\ninference in Eq. (2) requires solving the minimization of a sum over n elements, it might be favorable\nfrom the computational perspective to adopt a stocastic method that at each iteration minimizes the\nfunctional with respect to a single (or a mini-batch) of randomly sampled elements of the entire sum.\n\n6\n\n\fTable 2: Simulation experiment: average squared loss (First two columns) and (cid:52)PD (Last two columns) error of\nthe proposed structured prediction (SP) approach and the KRLS baseline on learning the inverse of a PD matrix\nfor increasing matrix dimension.\n\nSquared loss\nSP\n\nKRLS\n\n0.72\u00b10.08\n0.81\u00b10.03\n0.83\u00b10.03\n0.85\u00b10.02\n0.87\u00b10.01\n0.88\u00b10.01\n\n0.89\u00b10.08\n0.92\u00b10.05\n0.91\u00b10.06\n0.91\u00b10.03\n0.91\u00b10.02\n0.91\u00b10.02\n\nDim\n5\n10\n15\n20\n25\n30\n\nKRLS\n111\u00b164\n44\u00b18.3\n56\u00b110\n59\u00b112\n72\u00b19\n67\u00b17.2\n\n(cid:52)PD loss\nSP\n\n0.94\u00b10.06\n1.24\u00b10.06\n1.25\u00b10.05\n1.33\u00b10.03\n1.44\u00b10.03\n1.55\u00b10.03\n\nTable 1 reports gradients and retraction maps for the geodesic distance of two problems of interest\nconsidered in this work: positive de\ufb01nite manifold and the sphere. See Sections 4.2 and 4.3 for more\ndetails on the related manifolds.\nWe point out that using optimization algorithms that comply with the geometry of the manifold,\nsuch as RGD, guarantees that the computed value is an element of the manifold. This is in contrast\nwith algorithms that compute a solution in a linear space that contains M and then need to project\nthe computed solution onto M. We next discuss empirical evaluations of the proposed manifold\nstructured estimator on both synthetic and real datasets.\n\n4.2 Synthetic Experiments: Learning Positive De\ufb01nite Matrices\nWe consider the problem of learning a function f : Rd \u2192 Y = P m\npositive de\ufb01nite (PD) m \u00d7 m matrices. Note that P m\n(cid:52)PD between any two PD matrices Z, Y \u2208 P m\n\n++ de\ufb01ned as\n\n++ denotes the cone of\n++ is a manifold with squared geodesic distance\n\n++, where P m\n\n(cid:52)PD(Z, Y ) = (cid:107) log(Y \u2212 1\n\nF\n\n2 Z Y \u2212 1\n\n2 )(cid:107)2\n\n(16)\n\n++, we have that M 1\n\nwhere, for any M \u2208 P m\n2 and log(M ) correspond to the matrices with same\neigenvectors of M but with respectively the square root and logarithm of the eigenvalues of M.\nIn Table 1 we show the computation of the structured loss, the gradient of the structured loss and\nthe exponential map of the PD matrix manifold. We refer the reader to [31, 6] for a more detailed\nintroduction on the manifold of positive de\ufb01nite matrices.\nFor the experiments reported in the following we compared the performance of the manifold struc-\ntured estimator minimizing the loss (cid:52)PD and a Kernel Regularized Least Squares classi\ufb01er (KRLS)\nbaseline (see Appendix ??), both trained using the Gaussian kernel k(x, x(cid:48)) = exp(\u2212(cid:107)x\u2212x(cid:48)(cid:107)2/2\u03c32).\nThe matrices predicted by the KRLS estimator are projected on the PD manifold by setting to a\nsmall positive constant (1e \u2212 12) the negative eigenvalues. For the manifold structured estimator,\nthe optimization problem at Eq. (2) was performed with the Riemannian Gradient Descent (RGD)\nalgorithm [1]. We refer to [52] regarding the implementation of the RGD in the case of the geodesic\ndistance on the PD cone.\n\nLearning the Inverse of a Positive De\ufb01nite Matrix. We consider the problem of learning the\n++ \u2192 P m\n++ such that f (X) = X\u22121 for any X \u2208 P m\nfunction f : P m\n++. Input matrices are generated\nas Xi = U \u03a3U(cid:62) \u2208 P m\n++ with U a random orthonormal matrix sampled from the Haar distribution\n[16] and S \u2208 P m\n++ a diagonal matrix with entries randomly sampled from the uniform distribution\non [0, 10]. We generated datasets of increasing dimension m from 5 to 50, each with 1000 points\nfor training, 100 for validation and 100 for testing. The kernel bandwidth \u03c3 was chosen and the\nregularization parameter \u03bb were selected by cross-validation respectively in the ranges 0.1 to 1000\nand 10\u22126 to 1 (logarithmically spaced).\nTable 2 reports the performance of the manifold structured estimator (SP) and the KRLS baseline\nwith respect to both the (cid:52)PD loss and the least squares loss (normalized with respect to the number\nof dimensions). Note that the KRLS estimator target is to minimize the least squares (Frobenius) loss\nand is not designed to capture the geometry of the PD cone. We notice that the proposed approach\nsigni\ufb01cantly outperforms the KRLS baseline with respect to the (cid:52)PD loss. This is expected: (cid:52)PD\n\n7\n\n\fKRLS\nMR[47]\nSP (ours)\n\n\u2206 Deg.\n26.9 \u00b1 5.4\n22 \u00b1 6\n18.8 \u00b1 3.9\n\nFigure 1: (Left) Fingerprints reconstruction: Average absolute error (in degrees) for the manifold structured\nestimator (SP), the manifold regression (MR) approach in [47] and the KRLS baseline. (Right) Fingerprint\nreconstruction of a single image where the structured predictor achieves 15.7 of average error while KRLS 25.3.\n\npenalizes especially matrices with very different eigenvalues and our method cannot predict matrices\nwith non-positive eigenvalues, as opposed to KRLS which computes a linear solution in Rd2 and\nthen projects it onto the manifold. However the two methods perform comparably with respect to the\nsquared loss. This is consistent with the fact that our estimator is aware of the natural structure of the\noutput space and uses it pro\ufb01tably during learning.\n\n4.3 Fingerprint Reconstruction\n\nWe consider the \ufb01ngerprint reconstruction application in [47] in the context of manifold regression.\nGiven a partial image of a \ufb01ngerprint, the goal is to reconstruct the contour lines in output. Each\n\ufb01ngerprint image is interpreted as a separate structured prediction problem where training input points\ncorrespond to the 2D position x \u2208 R2 of valid contour lines and the output is the local orientation\nof the contour line, interpreted as a point on the circumference S1. The space S1 is a manifold with\nsquared geodesic distance (cid:52)S1 between two points z, y \u2208 S1 corresponding to\n\n(cid:52)S1 (z, y) = arccos ((cid:104)z, y(cid:105))2\n\n(17)\nwhere arccos is the inverse cosine function. In Table 1 we show the computation of the structured loss,\nthe gradient of the structured loss and the chosen retraction for the sphere manifold. We compared the\nperformance of the manifold structured estimator proposed in this paper with the manifold regression\napproach in [47] on the FVC \ufb01ngerprint veri\ufb01cation challenge dataset3. The dataset consists of 48\n\ufb01ngerprint pictures, each with \u223c 1400 points for training, \u223c 1000 points for validation and the rest\n(\u223c 25000) for test.\nFig. 1 reports the average absolute error (in degrees) between the true contour orientation and the one\nestimated by our structured prediction approach (SP), the manifold regression (MR) in [47] and the\nKRLS baseline. Our method outperforms the MR competitor by a signi\ufb01cant margin. As expected,\nthe KRLS baseline is not able to capture the geometry of the output space and has a signi\ufb01cantly\nlarger error of the two other approaches. This is also observed on the qualitative plot in Fig. 1 (Left)\nwhere the predictions of our SP approach and the KRLS baseline are compared with the ground truth\non a single \ufb01ngerprint. Output orientations are reported for each pixel with a color depending on their\norientation (from 0 to \u03c0). While the KRLS predictions are quite inconsistent, it can be noticed that\nour estimator is very accurate and even \u201csmoother\u201d than the ground truth.\n\n4.4 Multilabel Classi\ufb01cation on the Statistical Manifold\n\nWe evaluated our algorithm on multilabel prediction problems. In this context the output is an\nm-dimensional histogram, i.e. a discrete probability distribution over m points. We consider as\nmanifold the space of probability distributions over m points, that is the m-dimensional simplex\n\u2206m endowed with the Fisher information metric [3]. We will consider Y = \u2206m\n\u0001 where we require\ny1, . . . , ym \u2265 \u0001, for \u0001 > 0. In the experiment we considered \u0001 = 1e\u2212 5. The geodesic induced by the\n\n3http://bias.csr.unibo.it/fvc2004\n\n8\n\n\fTable 3: Area under the curve (AUC) on multilabel benchmark datasets [51] for KRLS and SP.\n\nEmotions\nCAL500\nScene\n\nKRLS\n0.63\n0.92\n0.62\n\nSP (Ours)\n\n0.73\n0.92\n0.73\n\nFisher metric is, d(y, y(cid:48)) = arccos(cid:0)(cid:80)m\n\n\u221a\n\n\u221a\n\ni=1\n\n(cid:112)yiy(cid:48)\n\ni\n\n(cid:1)[35]. This geodesic comes from applying the map\n\ny1, . . . ,\n\nym+1) to the points {yi}n\n\n\u03c0 : \u2206m \u2192 Sm\u22121, \u03c0(y) = (\ni=1 \u2208 \u2206m. This results in points\n++ and the sphere Sm\u22121. We can therefore\nthat belong to the intersection of the positive quadrant Rm\nuse the geodetic distance on the Sphere and gradient and retraction map described in Table 1. We\ntest our approach on some of the benchmark multilabel datasets described in [51] and we compare\nthe results with the KRLS baseline. We cross-validate \u03bb and \u03c3 taking values, respectively, from the\nintervals [10\u22126, 10\u22121] and [0.1, 10]. We compute the area under curve (AUC) [45] metric to evaluate\nthe quality of the predictions, results are shown in Table 3.\n\n4.5 Additional Example\n\nWe conclude this section with a further relevant example of our proposed approach to the setting of\napplications to relational knowledge, which we plan to investigate in future work. In particular, we\nconsider settings where the output space M corresponds to the Hyperbolic space (or Poincar\u00e9 disk),\nwhich has recently been adopted by the knowledge representation community to learn embeddings\nof relational data to encode discrete semantic/hierarchical information [34, 33]. The embedding is\nsuch that symbolic objects (e.g. words, entities, concepts) with high semantic or functional similarity\nare mapped into points with small hyperbolic geodesic distance. Typically, learning the embedding\nis a time consuming process that requires training from scratch on the whole dataset whenever a\nnew example is provided. To address this issue, with our manifold regression approach we could\nlearn f : X \u2192 M with x \u2208 X the observed entity and f (x) its predicted embedding. This would\nallow to transfer the embedding learned using techniques such as those in [34] to new points, without\nretraining the entire system from scratch. Interestingly, our theory is applicable to this setting since\nM satis\ufb01es Asm. 1.\n\n5 Conclusions\n\nIn this paper we studied a structured prediction approach for manifold valued learning problems.\nIn particular we characterized a wide class of loss functions (including the geodesic distance) for\nwhich we proved the considered algorithm to be statistically consistent, additionally providing \ufb01nite\nsample bounds under standard regularity assumptions. Our experiments show promising results on\nsynthetic and real data using two common manifolds: the positive de\ufb01nite matrices cone and the\nsphere. With the latter we considered applications on \ufb01ngerprint reconstruction and multi-labeling.\nThe proposed method leads to some open questions. From a statistical point of view it is of interest\nhow invariants of the manifold explicitly affect the learning rates, see Remark 1. From a more\ncomputational perspective, even if experimentally our algorithm achieves good results we did not\ninvestigate convergence guarantees in terms of optimization.\nAcknowledgments.\nA. R. acknowledges the support of the European Research Council (grant SEQUOIA 724063). L. R. acknowl-\nedges the support of the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European\nOf\ufb01ce of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS - DLV-\n777826. The work of L. R. and G. M. M. is supported by the Center for Brains, Minds and Machines (CBMM),\nfunded by NSF STC award CCF-1231216, and the Italian Institute of Technology. We gratefully acknowledge\nthe support of NVIDIA Corporation for the donation of the Titan Xp GPUs and the Tesla k40 GPU used for this\nresearch. This work was supported in part by EPSRC grant EP/P009069/1.\n\n9\n\n\fReferences\n[1] P-A Absil, Robert Mahony, and Rodolphe Sepulchre. Optimization algorithms on matrix manifolds.\n\nPrinceton University Press, 2009.\n\n[2] Mauricio A Alvarez, Lorenzo Rosasco, Neil D Lawrence, et al. Kernels for vector-valued functions: A\n\nreview. Foundations and Trends R(cid:13) in Machine Learning, 4(3):195\u2013266, 2012.\n\n[3] Shun-ichi Amari and Hiroshi Nagaoka. Methods of information geometry, volume 191. American\n\nMathematical Soc., 2007.\n\n[4] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American mathematical society,\n\n68(3):337\u2013404, 1950.\n\n[5] GH Bakir, T Hofmann, B Sch\u00f6lkopf, AJ Smola, B Taskar, and SVN Vishwanathan. Predicting structured\n\ndata. neural information processing, 2007.\n\n[6] Rajendra Bhatia. Positive de\ufb01nite matrices. Princeton university press, 2009.\n\n[7] Veli Bicer, Thanh Tran, and Anna Gossen. Relational kernel machines for learning from graph-structured\n\nrdf data. In Extended Semantic Web Conference, pages 47\u201362. Springer, 2011.\n\n[8] Silvere Bonnabel. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic\n\nControl, 58(9):2217\u20132229, 2013.\n\n[9] Raffaello Camoriano, Tom\u00e1s Angles, Alessandro Rudi, and Lorenzo Rosasco. Nytro: When subsampling\n\nmeets early stopping. In Arti\ufb01cial Intelligence and Statistics, pages 1403\u20131411, 2016.\n\n[10] Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.\n\nFoundations of Computational Mathematics, 7(3):331\u2013368, 2007.\n\n[11] Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features. Advances\n\nin Neural Information Processing Systems, 2018.\n\n[12] Carlo Ciliberto, Francis Bach, and Alessandro Rudi. Localized structured prediction. arXiv preprint\n\narXiv:1806.02402, 2018.\n\n[13] Carlo Ciliberto, Lorenzo Rosasco, and Alessandro Rudi. A consistent regularization approach for structured\n\nprediction. Advances in Neural Information Processing Systems 29 (NIPS), pages 4412\u20134420, 2016.\n\n[14] Carlo Ciliberto, Alessandro Rudi, Lorenzo Rosasco, and Massimiliano Pontil. Consistent multitask learning\nwith nonlinear output relations. In Advances in Neural Information Processing Systems, pages 1986\u20131996,\n2017.\n\n[15] Harold Charles Daume and Daniel Marcu. Practical structured learning techniques for natural language\n\nprocessing. Citeseer, 2006.\n\n[16] Joe Diestel and Angela Spalsbury. The joys of Haar measure. American Mathematical Soc., 2014.\n\n[17] Moussab Djerrab, Alexandre Garcia, Maxime Sangnier, and Florence d\u2019Alch\u00e9 Buc. Output \ufb01sher embed-\n\nding regression. Machine Learning, pages 1\u201328, 2018.\n\n[18] John C Duchi, Lester W Mackey, and Michael I Jordan. On the consistency of ranking algorithms. In\n\nICML, pages 327\u2013334, 2010.\n\n[19] P Thomas Fletcher. Geodesic regression and the theory of least squares on riemannian manifolds. Interna-\n\ntional journal of computer vision, 105(2):171\u2013185, 2013.\n\n[20] Wolfgang H\u00e4rdle and L\u00e9opold Simar. Applied multivariate statistical analysis, volume 22007. Springer,\n\n2007.\n\n[21] S\u00f8ren Hauberg, Oren Freifeld, and Michael J Black. A geometric take on metric learning. In Advances in\n\nNeural Information Processing Systems, pages 2024\u20132032, 2012.\n\n[22] Emmanuel Hebey. Nonlinear analysis on manifolds: Sobolev spaces and inequalities, volume 5. American\n\nMathematical Soc., 2000.\n\n[23] Jacob Hinkle, Prasanna Muralidharan, P Thomas Fletcher, and Sarang Joshi. Polynomial regression on\n\nriemannian manifolds. In European Conference on Computer Vision, pages 1\u201314. Springer, 2012.\n\n10\n\n\f[24] Mohammed Waleed Kadous and Claude Sammut. Classi\ufb01cation of multivariate time series and structured\n\ndata using constructive induction. Machine learning, 58(2):179\u2013216, 2005.\n\n[25] Anna Korba, Alexandre Garcia, and Florence d\u2019Alch\u00e9 Buc. A structured prediction approach for label\n\nranking. Advances in Neural Information Processing Systems, 2018.\n\n[26] Quoc V Le, Tim Sears, and Alexander J Smola. Nonparametric quantile estimation. 2005.\n\n[27] John M Lee. Smooth manifolds. In Introduction to Smooth Manifolds, pages 1\u201329. Springer, 2003.\n\n[28] Junhong Lin, Alessandro Rudi, Lorenzo Rosasco, and Volkan Cevher. Optimal rates for spectral algorithms\nwith least-squares regression over hilbert spaces. Applied and Computational Harmonic Analysis, 2018.\n\n[29] Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto. Differential properties of sinkhorn\napproximation for learning with wasserstein distance. Advances in Neural Information Processing Systems,\n2018.\n\n[30] Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions. Neural computation,\n\n17(1):177\u2013204, 2005.\n\n[31] Maher Moakher and Philipp G Batchelor. Symmetric positive-de\ufb01nite matrices: From geometry to\napplications and visualization. In Visualization and Processing of Tensor Fields, pages 285\u2013298. Springer,\n2006.\n\n[32] Jeffrey S Morris. Functional regression. Annual Review of Statistics and Its Application, 2:321\u2013359, 2015.\n\n[33] Maximilian Nickel and Douwe Kiela. Learning continuous hierarchies in the lorentz model of hyperbolic\n\ngeometry. International Conference on Machine Learning, 2018.\n\n[34] Maximillian Nickel and Douwe Kiela. Poincar\u00e9 embeddings for learning hierarchical representations. In\n\nAdvances in neural information processing systems, pages 6338\u20136347, 2017.\n\n[35] Frank Nielsen and Ke Sun. Clustering in hilbert simplex geometry. arXiv preprint arXiv:1704.00454,\n\n2017.\n\n[36] Alex Nowak-Vila, Francis Bach, and Alessandro Rudi. Sharp analysis of learning with discrete losses.\n\narXiv preprint arXiv:1810.06839, 2018.\n\n[37] Sebastian Nowozin, Christoph H Lampert, et al. Structured learning and prediction in computer vision.\n\nFoundations and Trends R(cid:13) in Computer Graphics and Vision, 6(3\u20134):185\u2013365, 2011.\n\n[38] Benjamin Paa\u00dfen, Christina G\u00f6pfert, and Barbara Hammer. Time series prediction for graphs in kernel\n\nand dissimilarity spaces. Neural Processing Letters, pages 1\u201321, 2017.\n\n[39] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural\n\ninformation processing systems, pages 1177\u20131184, 2008.\n\n[40] Alessandro Rudi, Raffaello Camoriano, and Lorenzo Rosasco. Less is more: Nystr\u00f6m computational\n\nregularization. In Advances in Neural Information Processing Systems, pages 1657\u20131665, 2015.\n\n[41] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. Falkon: An optimal large scale kernel method.\n\nIn Advances in Neural Information Processing Systems, pages 3888\u20133898, 2017.\n\n[42] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In\n\nAdvances in Neural Information Processing Systems, pages 3215\u20133225, 2017.\n\n[43] John Shawe-Taylor and Nello Cristianini. Kernel methods for pattern analysis. Cambridge university\n\npress, 2004.\n\n[44] Alex J Smola and Bernhard Sch\u00f6lkopf. Sparse greedy matrix approximation for machine learning. 2000.\n\n[45] Ashwin Srinivasan. Note on the location of optimal classi\ufb01ers in n-dimensional roc space. 1999.\n\n[46] Florian Steinke and Matthias Hein. Non-parametric regression between manifolds. In Advances in Neural\n\nInformation Processing Systems, pages 1561\u20131568, 2009.\n\n[47] Florian Steinke, Matthias Hein, and Bernhard Sch\u00f6lkopf. Nonparametric regression between general\n\nriemannian manifolds. SIAM Journal on Imaging Sciences, 3(3):527\u2013563, 2010.\n\n[48] Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media,\n\n2008.\n\n11\n\n\f[49] Ingo Steinwart, Don R Hush, Clint Scovel, et al. Optimal rates for regularized least squares regression. In\n\nCOLT, 2009.\n\n[50] Fran\u00e7ois Treves. Topological Vector Spaces, Distributions and Kernels: Pure and Applied Mathematics,\n\nvolume 25. Elsevier, 2016.\n\n[51] Grigorios Tsoumakas, Ioannis Katakis, and Ioannis Vlahavas. Mining multi-label data. In Data mining\n\nand knowledge discovery handbook, pages 667\u2013685. Springer, 2009.\n\n[52] Hongyi Zhang and Suvrit Sra. First-order methods for geodesically convex optimization. In Conference on\n\nLearning Theory, pages 1617\u20131638, 2016.\n\n12\n\n\f", "award": [], "sourceid": 2690, "authors": [{"given_name": "Alessandro", "family_name": "Rudi", "institution": "INRIA, Ecole Normale Superieure"}, {"given_name": "Carlo", "family_name": "Ciliberto", "institution": "Imperial College London"}, {"given_name": "GianMaria", "family_name": "Marconi", "institution": "Italian Institute of Technology"}, {"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova- MIT - IIT"}]}