{"title": "Joint quantile regression in vector-valued RKHSs", "book": "Advances in Neural Information Processing Systems", "page_first": 3693, "page_last": 3701, "abstract": "Addressing the will to give a more complete picture than an average relationship provided by standard regression, a novel framework for estimating and predicting simultaneously several conditional quantiles is introduced. The proposed methodology leverages kernel-based multi-task learning to curb the embarrassing phenomenon of quantile crossing, with a one-step estimation procedure and no post-processing. Moreover, this framework comes along with theoretical guarantees and an efficient coordinate descent learning algorithm. Numerical experiments on benchmark and real datasets highlight the enhancements of our approach regarding the prediction error, the crossing occurrences and the training time.", "full_text": "Joint quantile regression in vector-valued RKHSs\n\nMaxime Sangnier Olivier Fercoq\n\nFlorence d\u2019Alch\u00b4e-Buc\n\nLTCI, CNRS, T\u00b4el\u00b4ecom ParisTech\n\nUniversit\u00b4e Paris-Saclay\n75013, Paris, France\n\n{maxime.sangnier, olivier.fercoq, florence.dalche}\n\n@telecom-paristech.fr\n\nAbstract\n\nAddressing the will to give a more complete picture than an average relationship\nprovided by standard regression, a novel framework for estimating and predicting\nsimultaneously several conditional quantiles is introduced. The proposed method-\nology leverages kernel-based multi-task learning to curb the embarrassing phe-\nnomenon of quantile crossing, with a one-step estimation procedure and no post-\nprocessing. Moreover, this framework comes along with theoretical guarantees\nand an ef\ufb01cient coordinate descent learning algorithm. Numerical experiments on\nbenchmark and real datasets highlight the enhancements of our approach regard-\ning the prediction error, the crossing occurrences and the training time.\n\n1\n\nIntroduction\n\nGiven a couple (X, Y ) of random variables, where Y takes scalar values, a common aim in statistics\nand machine learning is to estimate the conditional expectation E [Y | X = x] as a function of x. In\nthe previous setting, called regression, one assumes that the main information in Y is a scalar value\ncorrupted by a centered noise. However, in some applications such as medicine, economics, social\nsciences and ecology, a more complete picture than an average relationship is required to deepen the\nanalysis. Expectiles and quantiles are different quantities able to achieve this goal.\nThis paper deals with this last setting, called (conditional) quantile regression. This topic has been\nchampioned by Koenker and Bassett [16] as the minimization of the pinball loss (see [15] for an\nextensive presentation) and brought to the attention of the machine learning community by Takeuchi\net al. [26]. Ever since then, several studies have built upon this framework and the most recent ones\ninclude regressing a single quantile of a random vector [12]. On the contrary, we are interested in\nestimating and predicting simultaneously several quantiles of a scalar-valued random variable Y |X\n(see Figure 1), thus called joint quantile regression. For this purpose, we focus on non-parametric\nhypotheses from a vector-valued Reproducing Kernel Hilbert Space (RKHS).\nSince quantiles of a distribution are closely related, joint quantile regression is subsumed under the\n\ufb01eld of multi-task learning [3]. As a consequence, vector-valued kernel methods are appropriate for\nsuch a task. They have already been used for various applications, such as structured classi\ufb01cation\n[10] and prediction [7], manifold regularization [21, 6] and functional regression [14]. Quantile\nregression is a new opportunity for vector-valued RKHSs to perform in a multi-task problem, along\nwith a loss that is different from the (cid:96)2 cost predominantly used in the previous references.\nIn addition, such a framework offers a novel way to curb the phenomenon of quantile curve crossing,\nwhile preserving the so called quantile property (which may not be true for current approaches). This\none guarantees that the ratio of observations lying below a predicted quantile is close to the quantile\nlevel of interest.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fi) a novel\nIn a nutshell, the contributions of this work are (following the outline of the paper):\nmethodology for joint quantile regression, that is based on vector-valued RKHSs; ii) enhanced pre-\ndictions thanks to a multi-task approach along with limited appearance of crossing curves; iii) the-\noretical guarantees regarding the generalization of the model; iv) an ef\ufb01cient coordinate descent\nalgorithm, that is able to handle the intercept of the model in a manner that is simple and different\nfrom Sequential Minimal Optimization (SMO). Besides these novelties, the enhancements of the\nproposed method and the ef\ufb01ciency of our learning algorithm are supported by numerical experi-\nments on benchmark and real datasets.\n\n2 Problem de\ufb01nition\n\n2.1 Quantile regression\nLet Y \u2282 R be a compact set, X be an arbitrary input space and (X, Y ) \u2208 X \u00d7 Y a pair of ran-\ndom variables following an unknown joint distribution. For a given probability \u03c4 \u2208 (0, 1), the\nconditional \u03c4-quantile of (X, Y ) is the function \u00b5\u03c4 : X \u2192 R such that \u00b5\u03c4 (x) = inf{\u00b5 \u2208 R :\ni=1 \u2208 (X \u00d7 Y)n, the quantile re-\nP (Y \u2264 \u00b5 | X = x) \u2265 \u03c4}. Thus, given a training set {(xi, yi)}n\ngression problem aims at estimating this conditional \u03c4-quantile function \u00b5\u03c4 . Following Koenker\n[15], this can be achieved by minimization of the pinball loss: (cid:96)\u03c4 (r) = max(\u03c4 r, (\u03c4 \u2212 1)r), where\nr \u2208 R is a residual. Using such a loss \ufb01rst arose from the observation that the location parameter \u00b5\n\nthat minimizes the (cid:96)1-loss(cid:80)n\n\ni=1 |yi \u2212 \u00b5| is an estimator of the unconditional median [16].\n\n\u03c4\n\n(h) = 1\nn\n\nNow focusing on the estimation of a conditional quantile, one can show that the target function\n\u00b5\u03c4 is a minimizer of the \u03c4-quantile risk R\u03c4 (h) = E [(cid:96)\u03c4 (Y \u2212 h(X))] [17]. However, since the\n(cid:80)n\njoint probability of (X, Y ) is unknown but we are provided with an independent and identically\ndistributed (iid) sample of observations {(xi, yi)}n\ni=1, we resort to minimizing the empirical risk:\ni=1 (cid:96)\u03c4 (yi \u2212 h(xi)), within a class H \u2282 (R)X of functions, calibrated in order\nRemp\nto overcome the shift from the true risk to the empirical one. In particular, when H has the form:\nH = {h = f + b : b \u2208 R, f \u2208 (R)X , \u03c8(f ) \u2264 c}, with \u03c8 : (R)X \u2192 R being a convex function\nand c > 0 a constant, Takeuchi et al. [26] proved that (similarly to the unconditional case) the\nin H, the ratio of\nquantile property is satis\ufb01ed: for any estimator \u02c6h, obtained by minimizing Remp\nobservations lying below \u02c6h (i.e. yi < \u02c6h(xi)) equals \u03c4 to a small error (the ration of observations\nexactly equal to \u02c6h(xi)). Moreover, under some regularity assumptions, this quantity converges to \u03c4\nwhen the sample grows. Note that these properties are true since the intercept b is unconstrained.\n\n\u03c4\n\n2.2 Multiple quantile regression\n\n(cid:80)n\n\n\u03c4 (h1, . . . , hp) = 1\nn\n\nIn many real problems (such as medical reference charts), one is not only interested by estimating a\n(cid:80)p\nsingle quantile curve but a few of them. Thus, denoting Np the range of integers between 1 and p, for\nseveral quantile levels \u03c4j (j \u2208 Np) and functions hj \u2208 H, the empirical loss to be minimized can bi\nj=1 (cid:96)\u03c4j (yi \u2212 hj(xi)),\nwritten as the following separable function: Remp\nwhere \u03c4 denotes the p dimensional vector of quantile levels.\nA nice feature of multiple quantile regression is thus to extract slices of the conditional distribution\nof Y |X. However, when quantiles are estimated independently, an embarrassing phenomenon often\nappears: quantile functions cross, thus violating the basic principle that the cumulative distribution\nfunction should be monotonically non-decreasing. We refer to that pitfall as the crossing problem.\nIn this paper, we propose to prevent curve crossing by considering the problem of multiple quantile\nregression as a vector-valued regression problem where outputs are not independent. An interesting\nfeature of our method is to preserve the quantile property while most other approaches lose it when\nstruggling to the crossing problem.\n\ni=1\n\n2.3 Related work\n\nGoing beyond linear and spline-based models, quantile regression in RKHSs has been introduced a\ndecade ago [26, 17]. In [26], the authors proposed to minimize the pinball loss in a scalar-valued\nRKHS and to add hard constraints on the training points in order to prevent the crossing problem.\nOur work can be legitimately seen as an extension of [26] to multiple quantile regression using\n\n2\n\n\fa vector-valued RKHS and structural constraints against curve crossing thanks to an appropriate\nmatrix-valued kernel.\nAnother related work is [27], which \ufb01rst introduced the idea of multi-task learning for quantile\nregression. In [27], linear quantile curves are estimated jointly with a common feature subspace\nshared across the tasks, based on multi-task feature learning [3]. In addition, the authors showed\nthat for such linear regressors, a common representation shared across in\ufb01nitely many tasks can be\ncomputed, thus estimating simultaneously conditional quantiles for all possible quantile levels. Both\nprevious approaches will be considered in the numerical experiments.\nQuantile regression has been investigated from many perspectives, including different losses lead-\ning to an approximate quantile property (\u0001-insensitive [25], re-weighted least squares [22]) along\nwith models and estimation procedures to curb the crossing problem: location-scale model with a\nmulti-step strategy [13], tensor product spline surface [22], non-negative valued kernels [18], hard\nnon-crossing constraints [26, 28, 5], inversion and monotonization of a conditional distribution esti-\nmation [9] and rearrangement of quantile estimations [8], to cite only a few references. Let us remark\nthat some solutions such as non-crossing constraints [26] lose theoretically the quantile property be-\ncause of constraining the intercept.\nIn comparison to the literature, we propose a novel methodology, based on vector-valued RKHSs,\nwith a one-step estimation, no post-processing, and keeping the quantile property while dealing with\ncurve crossing. We also provide an ef\ufb01cient learning algorithm and theoretical guarantees.\n\n3 Vector-valued RKHS for joint quantile regression\n\n\u03c4 (h) = 1\nn\n\nJoint estimation\n\n(cid:96)\u03c4 (r) =(cid:80)p\n\n3.1\nGiven a vector \u03c4 \u2208 (0, 1)p of quantile levels, multiple quantile regression is now considered as a\njoint estimation in (Rp)X of the target function x \u2208 X (cid:55)\u2192 (\u00b5\u03c41 (x), . . . , \u00b5\u03c4p (x)) \u2208 Rp of conditional\n(cid:80)n\nquantiles. Thus, let now \u03c8 be a convex regularizer on (Rp)X and H = {h = f + b : b \u2208\nRp, f \u2208 (Rp)X , \u03c8(f ) \u2264 c} be the hypothesis set. Similarly to previously, joint quantile regression\ni=1 (cid:96)\u03c4 (yi1 \u2212 h(xi)), where 1 stands for the all-ones vector,\naims at minimizing Remp\nj=1 (cid:96)\u03c4j (rj) and h is in H, which is to be appropriately chosen in order to estimate the\np conditional quantiles while enhancing predictions and avoiding curve crossing. It is worthwhile\nremarking that, independently of the choice of \u03c8, the quantile property is still veri\ufb01ed for a vector-\nvalued estimator since the loss is separable and the intercept is unconstrained. Similarly, the vector-\nvalued function whose components are the conditional \u03c4j-quantiles is still a minimizer of the \u03c4 -\nquantile risk R\u03c4 (h) = E [(cid:96)\u03c4 (Y 1 \u2212 h(X))].\nIn this context, the constraint \u03c8 does not necessarily apply independently on each coordinate func-\ntion hj but can impose dependency between them. The theory of vector-valued RKHS seems espe-\ncially well suited for this purpose when considering \u03c8 as the norm associated to it. In this situation,\nthe choice of the kernel does not only in\ufb02uence the nature of the hypotheses (linear, non-linear,\nuniversal approximators) but also the way the estimation procedure is regularized. In particular, the\nkernel critically operates on the output space by encoding structural constraints on the outputs.\n\n3.2 Matrix-valued kernel\nLet us denote \u00b7(cid:62) the transpose operator and L(Rp) the set of linear and bounded operators from\nIn our (\ufb01nite) case, L(Rp) comes down to the set of p \u00d7 p real-valued matrices.\nRp to itself.\nA matrix-valued kernel is a function K : X \u00d7 X \u2192 L(Rp), that is symmetric and positive [20]:\n(cid:80)\n\u2200(x, x(cid:48)) \u2208 X \u00d7 X , K(x, x(cid:48)) = K(x(cid:48), x)(cid:62) and \u2200m \u2208 N,\u2200{(\u03b1i, \u03b2i)}1\u2264i\u2264m \u2208 (X \u00d7 Rp)m,\n\n(cid:10)\u03b2i | K(\u03b1i, \u03b1j)\u03b2j\n\n\u2265 0.\n\n1\u2264i,j\u2264m\n\nLet K be such a kernel and for any x \u2208 X , let Kx : y \u2208 Rp (cid:55)\u2192 Kxy \u2208 (Rp)X be the linear operator\nsuch that: \u2200x(cid:48) \u2208 X , (Kxy)(x(cid:48)) = K(x(cid:48), x)y. There exists a unique Hilbert space of functions\nKK \u2282 (Rp)X (with an inner product and a norm respectively denoted (cid:104)\u00b7 | \u00b7(cid:105)K and (cid:107)\u00b7(cid:107)K), called the\nRKHS associated to K, such that \u2200x \u2208 X [20]: Kx spans the space KK (\u2200y \u2208 Rp : Kxy \u2208 K), Kx\nis bounded for the uniform norm (supy\u2208Rp (cid:107)Kxy(cid:107)K < \u221e) and \u2200f \u2208 K : f (x) = K\u2217\nxf (reproducing\nproperty), where \u00b7\u2217 is the adjoint operator.\n\n(cid:11)\n\n(cid:96)2\n\n3\n\n\fconditional quantile estimators. A rational choice is to consider B =(cid:0)exp(\u2212\u03b3(\u03c4i \u2212 \u03c4j)2)(cid:1)\n\nFrom now on, we assume that we are provided with a matrix-valued kernel K and we limit the\nhypothesis space to: H = {f + b : b \u2208 Rp, f \u2208 KK,(cid:107)f(cid:107)K \u2264 c} (i.e. \u03c8 = (cid:107)\u00b7(cid:107)K). Though several\ncandidates are available [1], we focus on one of the simplest and most ef\ufb01ciently computable kernels,\ncalled decomposable kernel: K : (x, x(cid:48)) (cid:55)\u2192 k(x, x(cid:48))B, where k : X \u00d7X \u2192 R is a scalar-valued ker-\nnel and B is a p\u00d7 p symmetric Positive Semi-De\ufb01nite (PSD) matrix. In this particular case, the ma-\ntrix B encodes the relationship between the components fj and thus, the link between the different\n1\u2264i,j\u2264p.\nTo explain it, let us consider two extreme cases (see also Figure 1).\nSince KK is the closure of the space\nFirst, when \u03b3 = 0, B is the all-ones matrix.\nspan{Kxy : (x, y) \u2208 X \u00d7 Rp}, any f \u2208 KK has all its components equal. Consequently, the\nquantile estimators hj = fj + bj are parallel (and non-crossing) curves. In this case, the regressor\nis said homoscedastic. Second, when \u03b3 \u2192 +\u221e, then B \u2192 I (identity matrix).\nIn this situa-\ntion, it is easy to show that the components of f \u2208 KK are independent from each other and that\nj=1 (cid:107)fj(cid:107)2K(cid:48) (where (cid:107)\u00b7(cid:107)K(cid:48) is the norm coming with the RKHS associated to k) is sepa-\nrable. Thus, each quantile function is learned independently from the others. Regressors are said\nheteroscedastic. It appears clearly that between these two extreme cases, there is a room for learning\na non-homescedastic and non-crossing quantile regressor (while preserving the quantile property).\n\n(cid:107)f(cid:107)2K = (cid:80)p\n\nFigure 1: Estimated (plain lines) and true (dashed lines) conditional quantiles of Y |X (synthetic\ndataset) from homoscedastic regressors (\u03b3 = 0) to heteroscedastic ones (\u03b3 \u2192 +\u221e).\n\n4 Theoretical analysis\n\nThis section is intended to give a few theoretical insights about the expected behavior of our\nhypotheses. Here, we do assume working in an RKHS but not speci\ufb01cally with a decompos-\n(cid:80)n\nable kernel. First, we aim at providing a uniform generalization bound. For this purpose, let\nF = {f \u2208 KK,(cid:107)f(cid:107)K \u2264 c}, tr(\u00b7) be the trace operator, ((Xi, Yi))1\u2264i\u2264n \u2208 (X \u00d7 Y)n be an\ni=1 (cid:96)\u03c4 (Yi1 \u2212 h(Xi)), the random variable associated to the\niid sample and denote \u02c6Rn(h) = 1\nn\nempirical risk of a hypothesis h.\nTheorem 4.1 (Generalization). Let a \u2208 R+ such that supy\u2208Y |y| \u2264 a, b \u2208 Y p and H =\n{f + b : f \u2208 F} be the class of hypotheses. Moreover, assume that there exists \u03ba \u2265 0 such that:\n(cid:114)\nsupx\u2208X tr(K(x, x)) \u2264 \u03ba. Then with probability at least 1 \u2212 \u03b4 (for \u03b4 \u2208 (0, 1]):\n\n(cid:114) p\u03ba\n\nn\n\n\u221a\n\n2c\n\n\u2200h \u2208 H, R(h) \u2264 \u02c6Rn(h) + 2\n\n\u221a\n+ (2pa + c\n\np\u03ba)\n\nlog(1/\u03b4)\n\n.\n\n2n\n\nSketch of proof (full derivation in Appendix A.1). We start with a concentration inequality for\nscalar-valued functions [4] and we use a vector-contraction property [19]. The bound on the\nRademacher complexity of [24, Theorem 3.1] concludes the proof.\n\n\u221a\nThe uniform bound in Theorem 4.1 states that, with high probability, all the hypotheses of interest\nhave a true risk which is less that an empirical risk to an additive bias in O(1/\nn). Let us remark that\nit makes use of the output dimension p. However, there exist non-uniform generalization bounds for\noperator-valued kernel-based hypotheses, which do not depend on the output dimension [14], being\nthus well-suited for in\ufb01nite-dimensional output spaces. Yet those results, only hold for optimal\nsolutions \u02c6h of the learning problem, which we never obtain in practice.\nAs a second theoretical insight, Theorem 4.2 gives a bound on the quantile property, which is similar\nto the one provided in [26] for scalar-valued functions. This one states that E [P (Y \u2264 hj(X) | X)]\ndoes not deviate to much from \u03c4j.\n\n4\n\n\f(cid:0)\u2212 r\n\nTheorem 4.2 (Quantile deviation). Let us consider that the assumptions of Theorem 4.1 hold. More-\n\u0001 : r \u2208 R (cid:55)\u2192\nover, let \u0001 > 0 be an arti\ufb01cial margin, \u0393+\nproj[0,1]\n\n(cid:1) , two ramp functions, j \u2208 Np and \u03b4 \u2208 (0, 1]. Then with probability at least 1 \u2212 \u03b4:\nn(cid:88)\n\u0001 (Yi \u2212 hj(Xi))\n(cid:125)\n(cid:123)(cid:122)\n\u0393+\n(cid:112) \u03ba\n\n\u0001 (Yi\u2212hj(Xi))\u2212\u2206 \u2264 E [P (Y \u2264 hj(X) | X)] \u2264 1\n\u0393\u2212\n(cid:124)\nn\n\n\u0001 : r \u2208 R (cid:55)\u2192 proj[0,1]\n\n(cid:113) log(2/\u03b4)\n\n\u2200h \u2208 H,\n\n1\nn\n\n+\u2206,\n\n\u2248\u03c4j\n\ni=1\n\ni=1\n\n\u0001\n\nwhere \u2206 = 2c\n\u0001\n\nn +\n\n.\n\n2n\n\n(cid:1) and \u0393\u2212\n\n(cid:0)1 \u2212 r\nn(cid:88)\n\n\u0001\n\nSketch of proof (full derivation in Appendix A.2). The proof is similar to the one of Theorem 4.1,\nwhen remarking that \u0393+\n\n\u0001 are 1/\u0001-Lipschitz continuous.\n\n\u0001 and \u0393\u2212\n\n5 Optimization algorithm\n\nIn order to \ufb01nalize the M-estimation of a non-parametric function, we need a way to jointly solve the\noptimization problem of interest and compute the estimator. For ridge regression in vector-valued\nRKHSs, representer theorems enable to reformulate the hypothesis f and to derive algorithms based\non matrix inversion [20, 6] or Sylvester equation [10]. Since the optimization problem we are\ntackling is quite different, those methods do not apply. Yet, deriving a dual optimization problem\nmakes it possible to hit the mark.\nQuantile estimation, as presented in this paper, comes down to minimizing a regularized empirical\nrisk, de\ufb01ned by the pinball loss (cid:96)\u03c4 . Since this loss function is non-differentiable, we introduce\n\u2217 to get the following primal formulation. We also consider a regularization\nslack variables \u03be and \u03be\nn(cid:88)\nparameter C to be tuned:\n(cid:107)f(cid:107)2K + C\n\n(cid:16)(cid:104)\u03c4 | \u03bei(cid:105)(cid:96)2\n\n(cid:60) 0,\nyi \u2212 f (xi) \u2212 b = \u03bei \u2212 \u03be\n\u2217\ni ,\n\n+(cid:104)1 \u2212 \u03c4 | \u03be\n\n(cid:60) 0, \u03be\n\ni (cid:105)(cid:96)2\n\u2217\n\n(cid:17)\n\ns. t.\n\n(1)\n\n1\n2\n\n\u2217\ni\n\nminimize\nf\u2208KK ,b\u2208Rp ,\n\u03be,\u03be\u2217\u2208(Rp )n\n\ni=1\n\nwhere (cid:60) is a pointwise inequality. A dual formulation of Problem (1) is (see Appendix B):\n\n(cid:26) \u2200i \u2208 Nn : \u03bei\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 n(cid:88)\n\nn(cid:88)\n\n\u2212 n(cid:88)\n\nminimize\n\u03b1\u2208(Rp)n\n\n1\n2\n\n(cid:104)\u03b1i | K(xi, xj)\u03b1j(cid:105)(cid:96)2\n\nyi (cid:104)\u03b1i | 1(cid:105)(cid:96)2\n\ns. t.\n\ni,j=1\n\ni=1\n\n\u03b1i = 0Rp , \u2200i \u2208 Nn :\nC(\u03c4 \u2212 1) (cid:52) \u03b1i (cid:52) C\u03c4 ,\n\ni=1\n\n(2)\n\n(2) with the formula \u02c6f =(cid:80)n\n\nwhere the linear constraints come from considering an intercept b. The Karush-Kuhn-Tucker (KKT)\nconditions of Problem (1) indicate that a minimizer \u02c6f of (1) can be recovered from a solution \u02c6\u03b1 of\ni=1 Kxi \u02c6\u03b1i. Moreover, \u02c6b can also be obtained thanks to KKT conditions.\nHowever, as we deal with a numerical approximate solution \u03b1, in practice b is computed by solving\nProblem (1) with f \ufb01xed. This boils down to taking bj as the \u03c4j-quantile of (yi \u2212 fj(xi))1\u2264i\u2264n.\nProblem (2) is a common quadratic program that can be solved with off-the-shelf solvers. However,\nsince we are essentially interested in decomposable kernels K(\u00b7,\u00b7) = k(\u00b7,\u00b7)B, it appears that the\nquadratic part of the objective function would be de\ufb01ned by the np \u00d7 np matrix K \u2297 B, where \u2297\nis the Kronecker product and K = (k(xi, xj))1\u2264i,j\u2264n. Storing this matrix explicitly is likely to be\ntime and memory expensive. In order to improve the estimation procedure, ad hoc algorithms can\nbe derived. For instance, regression with a decomposable kernel boils down to solving a Sylvester\nequation (which can be done ef\ufb01ciently) [10] and vector-valued Support Vector Machine (SVM)\nwithout intercept can be learned with a coordinate descent algorithm [21]. However, these methods\ncan not be used in our setting since the loss function is different and considering the intercept is\nnecessary for the quantile property. Yet, coordinate descent could theoretically be extended in an\nSMO technique, able to handle the linear constraints introduced by the intercept. However, SMO\nworks usually with a single linear constraint and needs heuristics to run ef\ufb01ciently, which are quite\ndif\ufb01cult to \ufb01nd (even though an implementation exists for two linear constraints [25]).\nTherefore, for the sake of ef\ufb01ciency, we propose to use a Primal-Dual Coordinate Descent (PDCD)\ntechnique, recently introduced in [11]. This algorithm (which is proved to converge) is able to deal\nwith the linear constraints coming from the intercept and is thus utterly workable for the problem at\nhand. Moreover, PDCD has been proved favorably competitive with SMO for SVMs.\n\n5\n\n\fTable 1: Empirical pinball loss and crossing loss \u00d7100 (the less, the better). Bullets (resp. circles)\nindicate statistically signi\ufb01cant (resp non-signi\ufb01cant) differences. The proposed method is JQR.\n\n-\n\nPinball loss\n\nCrossing loss\n\n-\n\nJQR\n\nJQR\n\nIND.\n\nIND.\n\nMTFL\n\nMTFL\n\nData set\n\nIND. (NC)\n\nIND. (NC)\n\n-\n- 102.6 \u00b1 17.3 103.2 \u00b1 17.2 102.9 \u00b1 19.0 \u25e6\u25e6\u25e6 102.6 \u00b1 19.0 - 0.53 \u00b1 0.67 0.31 \u00b1 0.70 0.69 \u00b1 0.54 \u2022\u25e6\u2022 0.09 \u00b1 0.14\ncaution\n- 151.1 \u00b1 8.2\n152.4 \u00b1 8.9 \u25e6\u25e6\u25e6 153.7 \u00b1 12.1 - 0.00 \u00b1 0.00 0.00 \u00b1 0.00 0.00 \u00b1 0.00 \u25e6\u25e6\u25e6 0.00 \u00b1 0.00\n150.8 \u00b1 8.0\nftcollinssnow\n- 102.9 \u00b1 39.1 102.8 \u00b1 38.9 102.0 \u00b1 34.5 \u25e6\u25e6\u25e6 103.7 \u00b1 35.7 - 9.08 \u00b1 7.38 9.00 \u00b1 7.39 3.48 \u00b1 4.49 \u25e6\u25e6\u2022 8.81 \u00b1 7.46\nhighway\n128.6 \u00b1 2.2 \u25e6\u25e6\u2022 127.9 \u00b1 1.8 - 0.04 \u00b1 0.05 0.04 \u00b1 0.05 0.07 \u00b1 0.14 \u2022\u2022\u2022 0.00 \u00b1 0.00\n- 128.2 \u00b1 2.4\n128.2 \u00b1 2.4\nheights\n- 1.01 \u00b1 0.75 0.52 \u00b1 0.48 1.23 \u00b1 0.77 \u2022\u2022\u2022 0.15 \u00b1 0.22\n46.9 \u00b1 7.6\n44.6 \u00b1 6.8\n44.8 \u00b1 6.7\nsniffer\n-\n68.4 \u00b1 35.3\n- 68.4 \u00b1 35.3\n75.3 \u00b1 38.2 \u25e6\u25e6\u25e6 76.0 \u00b1 31.5 - 3.24 \u00b1 5.10 2.60 \u00b1 4.28 8.93 \u00b1 19.52 \u2022\u2022\u25e6 0.94 \u00b1 3.46\nsnowgeese\n84.9 \u00b1 4.7\n81.8 \u00b1 4.6\n- 0.24 \u00b1 0.22 0.27 \u00b1 0.42 0.82 \u00b1 1.47 \u2022\u2022\u2022 0.05 \u00b1 0.15\n81.6 \u00b1 4.6\nufc\n-\n- 139.0 \u00b1 9.9\n139.0 \u00b1 9.9 142.6 \u00b1 11.6 \u25e6\u25e6\u25e6 139.8 \u00b1 11.7 - 0.00 \u00b1 0.00 0.00 \u00b1 0.00 0.31 \u00b1 0.88 \u25e6\u25e6\u2022 0.00 \u00b1 0.00\nbirthwt\n12.3 \u00b1 1.0\n12.3 \u00b1 1.0\n12.6 \u00b1 1.0\n- 0.46 \u00b1 0.33 0.35 \u00b1 0.24 0.30 \u00b1 0.22 \u2022\u2022\u2022 0.06 \u00b1 0.20\ncrabs\n-\n- 0.05 \u00b1 0.08 0.04 \u00b1 0.07 0.05 \u00b1 0.09 \u25e6\u25e6\u25e6 0.03 \u00b1 0.08\n64.5 \u00b1 7.5\n62.6 \u00b1 8.2\n62.6 \u00b1 8.2\nGAGurine\n-\n109.4 \u00b1 7.1 \u25e6\u25e6\u25e6 111.3 \u00b1 8.2 - 0.87 \u00b1 1.60 0.92 \u00b1 2.02 0.80 \u00b1 1.18 \u25e6\u25e6\u25e6 0.72 \u00b1 1.51\n110.1 \u00b1 7.8\n- 110.2 \u00b1 7.8\ngeyser\n- 1.23 \u00b1 0.96 0.95 \u00b1 0.85 0.71 \u00b1 0.96 \u25e6\u25e6\u25e6 0.81 \u00b1 0.43\n49.9 \u00b1 3.6\n47.2 \u00b1 4.4\n47.4 \u00b1 4.4\ngilgais\n-\n73.1 \u00b1 11.8 \u25e6\u25e6\u25e6 69.6 \u00b1 13.4 - 2.72 \u00b1 3.26 1.52 \u00b1 2.47 2.75 \u00b1 2.93 \u2022\u25e6\u2022 1.14 \u00b1 2.02\n70.1 \u00b1 13.7\n- 71.1 \u00b1 13.0\ntopo\n48.5 \u00b1 5.0\n48.5 \u00b1 5.0\n49.7 \u00b1 4.7\n- 0.64 \u00b1 0.32 0.48 \u00b1 0.27 1.11 \u00b1 0.33 \u25e6\u2022\u2022 0.58 \u00b1 0.34\nBostonHousing -\n0.5 \u00b1 0.5\n0.5 \u00b1 0.5\n5.0 \u00b1 4.9\n- 0.10 \u00b1 0.13 0.10 \u00b1 0.13 0.29 \u00b1 0.35 \u2022\u2022\u2022 0.02 \u00b1 0.05\nCobarOre\n-\n- 61.3 \u00b1 18.3\n61.2 \u00b1 19.0\n58.7 \u00b1 17.9 \u25e6\u25e6\u2022 64.4 \u00b1 23.2 - 1.50 \u00b1 4.94 1.25 \u00b1 4.53 1.65 \u00b1 5.97 \u2022\u25e6\u25e6 0.06 \u00b1 0.14\nengel\n102.0 \u00b1 11.7 \u2022\u2022\u2022 84.3 \u00b1 10.3 - 2.10 \u00b1 1.83 0.92 \u00b1 1.25 1.13 \u00b1 1.10 \u2022\u2022\u2022 0.14 \u00b1 0.37\n88.9 \u00b1 8.4\n89.2 \u00b1 8.5\n-\nmcycle\n68.7 \u00b1 18.1 \u2022\u2022\u25e6 67.6 \u00b1 20.9 - 2.50 \u00b1 2.12 1.87 \u00b1 1.68 0.73 \u00b1 0.92 \u25e6\u25e6\u25e6 1.55 \u00b1 1.75\n70.9 \u00b1 21.1\n- 71.0 \u00b1 21.0\nBigMac2003\n99.5 \u00b1 7.0\n99.4 \u00b1 7.0\n101.8 \u00b1 7.1\n- 1.06 \u00b1 0.85 0.85 \u00b1 0.70 0.65 \u00b1 0.62 \u2022\u2022\u2022 0.09 \u00b1 0.31\nUN3\n-\n- 20.0 \u00b1 13.7\n19.9 \u00b1 13.6\n23.8 \u00b1 16.0 \u25e6\u25e6\u2022 19.7 \u00b1 13.7 - 1.29 \u00b1 1.13 1.17 \u00b1 1.15 0.46 \u00b1 0.28 \u2022\u2022\u2022 0.09 \u00b1 0.13\ncpus\n\n\u25e6\u25e6\u2022 45.2 \u00b1 6.9\n\u2022\u2022\u2022 80.6 \u00b1 4.1\n\u2022\u2022\u2022 11.9 \u00b1 0.9\n\u25e6\u25e6\u2022 62.6 \u00b1 8.1\n\u25e6\u25e6\u2022 46.9 \u00b1 4.6\n\u2022\u2022\u2022 47.4 \u00b1 4.7\n\u2022\u2022\u2022 0.6 \u00b1 0.5\n\n\u25e6\u25e6\u2022 98.8 \u00b1 7.6\n\nPDCD is described in Algorithm 1, where, for \u03b1 = (\u03b1i)1\u2264i\u2264n \u2208 (Rp)n, \u03b1j \u2208 Rn denotes its jth\ni its ith component, diag is the operator mapping a vector to a diagonal matrix\nrow vector and \u03b1j\nand proj1 and proj[C(\u03c4l\u22121),C\u03c4l] are respectively the projectors onto the vector 1 and the compact\nset [C(\u03c4l \u2212 1), C\u03c4l]. PDCD uses dual variables \u03b8 \u2208 (Rp)n (which are updated during the descent)\nand has two sets of parameters \u03bd \u2208 (Rp)n and \u00b5 \u2208 (Rp)n, that verify (\u2200(i, l) \u2208 Nn \u00d7 Np):\n\u00b5l\ni <\ni = 10(K(xi, xi))l,l\n(K(xi,xi))l,l+\u03bdl\ni\ni equal to 0.95 times the bound. Moreover, as it is standard for coordinate descent methods,\nand \u00b5l\nl.\n\nour implementation uses ef\ufb01cient updates for the computation of both(cid:80)n\n\n. In practice, we kept the same parameters as in [11]: \u03bdl\n\nj=1 K(xi, xj)\u03b1j and \u03b8\n\n1\n\n6 Numerical experiments\n\nTwo sets of experiments are presented, respectively aimed at assessing the ability of our method-\nology to predict quantiles and at comparing an implementation of Algorithm 1 with an off-the-\nshelf solver and an augmented Lagrangian scheme. Following the previous sections, a decom-\nposable kernel K(x, x(cid:48)) = k(x, x(cid:48))B is used, where B = (exp(\u2212\u03b3(\u03c4i \u2212 \u03c4j)2))1\u2264i,j\u2264p and\nk(x, x(cid:48)) = exp(\u2212(cid:107)x \u2212 x(cid:48)(cid:107)2\n/2\u03c32), with \u03c3 being the 0.7-quantile of the pairwise distances of\nthe training data {xi}1\u2264i\u2264n. Quantile levels of interest are \u03c4 = (0.1, 0.3, 0.5, 0.7, 0.9).\n\n(cid:96)2\n\nn\n\n1\nn\n\nis\n\nj=1\n\nloss\n\n(cid:2) 1\n\nthe pinball\n\n6.1 Quantile regression\n\n(cid:80)n\ni=1 (cid:96)\u03c4 (yi1 \u2212\nthe one minimized to build the proposed estimator and the crossing loss\n\ni=1 max(0, hj+1(xi) \u2212 hj(xi))(cid:3), assuming that \u03c4j > \u03c4j+1, quanti\ufb01es how far hj goes\n(cid:80)n\n\nQuantile regression is assessed with two criteria:\nh(xi))\n\n(cid:80)p\u22121\n\nbelow hj+1, while hj is expected to stay always above hj+1. More experiments are in Appendix D.1.\nThis study focuses on three non-parametric models based on the RKHS theory. Other linear and\nspline-based models have been dismissed since Takeuchi et al. [26] have already provided a com-\nparison of these ones with kernel methods. First, we considered an independent estimation of quan-\ntile regressors (IND.), which boils down to setting B = I (this approach could be set up without\nvector-valued RKHSs but with scalar-valued kernels only). Second, hard non-crossing constraints\non the training data have been imposed (IND. (NC)), as proposed in [26]. Third, the proposed joint\nestimator (JQR) uses the Gaussian matrix B presented above.\nQuantile regression with multi-task feature learning (MTFL), as proposed in [27], is also included.\nFor a fair comparison, each point is mapped with \u03c8(x) = (k(x, x1), . . . , k(x, xn)) and the estimator\nh(x) = W (cid:62)\u03c8(x) + b (W \u2208 Rn\u00d7p) is learned jointly with the PSD matrix D \u2208 Rn\u00d7n of the\n\n6\n\n\fAlgorithm 1 Primal-Dual Coordinate Descent.\n\nInitialize \u03b1i, \u03b8i \u2208 Rp (\u2200i \u2208 Nn).\nrepeat\n\nl \u2190 proj1\n\ni \u2190(cid:80)n\n\n(cid:0)\u03b8l + diag(\u03bd l)\u03b1l(cid:1).\n(cid:0)\u03b1l\n\nChoose (i, l) \u2208 Nn \u00d7 Np uniformly at random.\nSet \u03b8\nSet dl\ni \u2212 \u00b5l\ni \u2190 proj[C(\u03c4l\u22121),C\u03c4l]\nSet \u03b1l\nidl\ni\ni \u2190 \u03b8\ni, \u03b8l\nUpdate coordinate (i, l): \u03b1l\nand keep other coordinates unchanged.\nuntil duality gap (1)-(2) is small enough\n\nj=1(K(xi, xj)\u03b1j)l \u2212 yi + 2\u03b8\n\ni \u2190 \u03b1l\n\nl\ni,\n\nTable 2: CPU time (s) for training a model.\n\nSize\n\nQP\n\n8.73 \u00b1 0.34\n250\n75.53 \u00b1 2.98\n500\n621.60 \u00b1 30.37\n1000\n2000 3416.55 \u00b1 104.41\n\nAUG. LAG.\n261.11 \u00b1 46.69\n865.86 \u00b1 92.26\n\n\u2013\n\u2013\n\nPDCD\n\n18.69 \u00b1 3.54\n61.30 \u00b1 7.05\n266.50 \u00b1 41.16\n958.93 \u00b1 107.80\n\n(cid:1).\n\nl\n\ni \u2212 \u03b8l\ni.\n\n) and the update D \u2190 (W W (cid:62))1/2/ tr((W W (cid:62))1/2).\n\nregularizer \u03c8(h) = tr(W (cid:62)D\u22121W ). This comes down to alternating our approach (with B = I\nand k(\u00b7,\u00b7) = (cid:104)\u00b7 | D\u00b7(cid:105)(cid:96)2\nTo present an honorable comparison of these four methods, we did not choose datasets for the bene\ufb01t\nof our method but considered the ones used in [26]. These 20 datasets (whose names are indicated\nin Table 1) come from the UCI repository and three R packages: quantreg, alr3 and MASS. The\nsample sizes vary from 38 (CobarOre) to 1375 (heights) and the numbers of explanatory variables\nvary from 1 (5 sets) to 12 (BostonHousing). The datasets were standardized coordinate-wise to\nhave zero mean and unit variance. Results are given in Table 1 thanks to the mean and the standard\ndeviation of the test losses recorded on 20 random splits train-test with ratio 0.7-0.3. The best result\nof each line is boldfaced and the bullets indicate the signi\ufb01cant differences of each competitor from\nJQR (based on a Wilcoxon signed-rank test with signi\ufb01cance level 0.05).\nThe parameter C is chosen by cross-validation (minimizing the pinball loss) inside a logarithmic\ngrid (10\u22125, 10\u22124, . . . , 105) for all methods and datasets. For our approach (JQR), the parameter \u03b3\nis chosen in the same grid as C with extra candidates 0 and +\u221e. Finally, for a balanced comparison,\nthe dual optimization problems corresponding to each approach are solved with CVXOPT [2].\nRegarding the pinball loss, joint quantile regression compares favorably to independent and hard\nnon-crossing constraint estimations for 12 vs 8 datasets (5 vs 1 signi\ufb01cantly different). These results\nbear out the assumption concerning the relationship between conditional quantiles and the usefulness\nof multiple-output methods for quantile regression. Prediction is also enhanced compared to MTFL\nfor 15 vs 5 datasets (11 vs 1 signi\ufb01cantly different).\nThe crossing loss clearly shows that joint regression enables to weaken the crossing problem, in\ncomparison to independent estimation and hard non-crossing constraints (18 vs 1 favorable datasets\nand 9 vs 0 signi\ufb01cantly different). Results are similar compared to MTFL (16 vs 3, 12 vs 1). Note\nthat for IND. (NC), the crossing loss is null on the training data by construction but not necessarily\non the test data. In addition, let us remark that model selection (and particularly for \u03b3, which tunes\nthe trade-off between hetero and homoscedastic regressors) has been performed based on the pinball\nloss only. It seems that, in a way, the pinball loss embraces the crossing loss as a subcriterion.\n\n6.2 Learning algorithms\n\nThis section is aimed at comparing three implementations of algorithms for estimating joint quantile\nregressors (solving Problem 2), following their running (CPU) time. First, the off-the-shelf solver\n(based on an interior-point method) included in CVXOPT [2] (QP) is applied to Problem (2) turned\ninto a standard form of linearly constrained quadratic program. Second, an augmented Lagrangian\nscheme (AUG. LAG) is used in order to get rid of the linear constraints and to make it possible to use\na coordinate descent approach (detailed procedure in Appendix C). In this scheme, the inner solver\nis Algorithm 1 when the intercept is dismissed, which boils down to be the algorithm proposed in\n[23]. The last approach (PDCD) is Algorithm 1.\nWe use a synthetic dataset (the same as in Figure 1), for which X \u2208 [0, 1.5]. The target Y is com-\nputed as a sine curve at 1 Hz modulated by a sine envelope at 1/3 Hz and mean 1. Moreover, this\npattern is distorted with a random Gaussian noise with mean 0 and a linearly decreasing standard de-\nviation from 1.2 at X = 0 to 0.2 at X = 1.5. Parameters for the models are: (C, \u03b3) = (102, 10\u22122).\n\n7\n\n\fTo compare the implementations of the three algorithms, we \ufb01rst run QP, with a relative tolerance set\nto 10\u22122, and store the optimal objective value. Then, the two other methods (AUG. LAG and PDCD)\nare launched and stopped when they pass the objective value reached by QP (optimal objective\nvalues are reported in Appendix D.2). Table 2 gives the mean and standard deviation of the CPU\ntime required by each method for 10 random datasets and several sample sizes. Some statistics are\nmissing because AUG. LAG. ran out of time.\nAs expected, it appears that for a not too tight tolerance and big datasets, implementation of Algo-\nrithm 1 outperforms the two other competitors. Let us remark that QP is also more expensive in\nmemory than the coordinate-based algorithms like ours. Moreover, training time may seem high in\ncomparison to usual SVMs. However, let us \ufb01rst remind that we jointly learn p regressors. Thus, a\nfair comparison should be done with an SVM applied to an np \u00d7 np matrix, instead of n \u00d7 n. In\naddition, there is no sample sparsity in quantile regression, which does speed up SVM training.\nLast but not least, in order to illustrate the use of our algorithm, we have run it on two 2000-point\ndatasets from economics and medicine: the U.S. 2000 Census data, consisting of annual salary and\n9 related features on workers, and the 2014 National Center for Health Statistics\u2019 data, regarding\ngirl birth weight and 16 statistics on parents.1 Parameters (C, \u03b3) have been set to (1, 100) and\n(0.1, 1) respectively for the Census and NCHS datasets (determined by cross-validation). Figure 2\ndepicts 9 estimated conditional quantiles of the salary with respect to the education (17 levels from\nno schooling completed to doctorate degree) and of the birth weight (in grams) vs mother\u2019s pre-\npregnancy weight (in pounds). As expected, the Census data reveal an increasing and heteroscedastic\ntrend while new-born\u2019s weight does not seem correlated to mother\u2019s weight.\n\nFigure 2: Estimated conditional quantiles for the Census (left, salary vs education) and the NCHS\ndata (right, birth weight vs mother\u2019s pre-pregnancy weight).\n\n7 Conclusion\n\nThis paper introduces a novel framework for joint quantile regression, which is based on vector-\nvalued RKHSs.\nIt comes along with theoretical guarantees and an ef\ufb01cient learning algorithm.\nMoreover, this methodology, which keeps the quantile property, enjoys few curve crossing and en-\nhanced performances compared to independent estimations and hard non-crossing constraints.\nTo go forward, let us remark that this framework bene\ufb01ts from all the tools now associated with\nvector-valued RKHSs, such as manifold learning for the semi-supervised setting, multiple kernel\nlearning for measuring feature importance and random Fourier features for very large scale applica-\ntions. Moreover, extensions of our methodology to multivariate output variables are to be investi-\ngated, given that it requires to choose among the various de\ufb01nitions of multivariate quantiles.\n\nAcknowledgments\n\nThis work was supported by the industrial chair \u201cMachine Learning for Big Data\u201d.\n\nReferences\n[1] M.A. Alvarez, L. Rosasco, and N.D. Lawrence. Kernels for Vector-Valued Functions: a Review. Foun-\n\ndations and Trends in Machine Learning, 4(3):195\u2013266, 2012. arXiv: 1106.6251.\n\n1Data are available at www.census.gov/census2000/PUMS5.html and www.nber.org/data/\n\nvital-statistics-natality-data.html.\n\n8\n\n\f[2] M.S. Anderson, J. Dahl, and L. Vandenberghe. CVXOPT: A Python package for convex optimization,\n\nversion 1.1.5., 2012.\n\n[3] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learning, 73(3):\n\n243\u2013272, 2008.\n\n[4] P.L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[5] H.D. Bondell, B.J. Reich, and H. Wang. Noncrossing quantile regression curve estimation. Biometrika,\n\n97(4):825\u2013838, 2010.\n\n[6] C. Brouard, F. d\u2019Alch\u00b4e Buc, and M. Szafranski. Semi-supervised Penalized Output Kernel Regression for\n\nLink Prediction. In Proceedings of the 28th International Conference on Machine Learning, 2011.\n\n[7] C. Brouard, M. Szafranski, and F. d\u2019Alch\u00b4e Buc. Input Output Kernel Regression: Supervised and Semi-\nSupervised Structured Output Prediction with Operator-Valued Kernels. Journal of Machine Learning\nResearch, 17(176):1\u201348, 2016.\n\n[8] V. Chernozhukov, I. Fern\u00b4andez-Val, and A. Galichon. Quantile and Probability Curves Without Crossing.\n\nEconometrica, 78(3):1093\u20131125, 2010.\n\n[9] H. Dette and S. Volgushev. Non-crossing non-parametric estimates of quantile curves. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 70(3):609\u2013627, 2008.\n\n[10] F. Dinuzzo, C.S. Ong, P. Gehler, and G. Pillonetto. Learning Output Kernels with Block Coordinate\n\nDescent. In Proceedings of the 28th International Conference of Machine Learning, 2011.\n\n[11] O. Fercoq and P. Bianchi. A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possi-\n\nbly Non Separable Functions. arXiv:1508.04625 [math], 2015.\n\n[12] M. Hallin and M. \u02c7Siman. Elliptical multiple-output quantile regression and convex optimization. Statistics\n\n& Probability Letters, 109:232\u2013237, 2016.\n\n[13] X. He. Quantile Curves without Crossing. The American Statistician, 51(2):186\u2013192, 1997.\n[14] H. Kadri, E. Du\ufb02os, P. Preux, S. Canu, A. Rakotomamonjy, and J. Audiffren. Operator-valued Kernels\n\nfor Learning from Functional Response Data. Journal of Machine Learning Research, 16:1\u201354, 2015.\n\n[15] R. Koenker. Quantile Regression. Cambridge University Press, Cambridge, New York, 2005.\n[16] R. Koenker and G. Bassett. Regression Quantiles. Econometrica, 46(1):33\u201350, 1978.\n[17] Y. Li, Y. Liu, and J. Zhu. Quantile Regression in Reproducing Kernel Hilbert Spaces. Journal of the\n\nAmerican Statistical Association, 102(477):255\u2013268, 2007.\n\n[18] Y. Liu and Y. Wu. Simultaneous multiple non-crossing quantile regression estimation using kernel con-\n\nstraints. Journal of nonparametric statistics, 23(2):415\u2013437, 2011.\n\n[19] A. Maurer. A vector-contraction inequality for Rademacher complexities. In Proceedings of The 27th\n\nInternational Conference on Algorithmic Learning Theory, 2016.\n\n[20] C.A. Micchelli and M.A. Pontil. On Learning Vector-Valued Functions. Neural Computation, 17:177\u2013\n\n204, 2005.\n\n[21] H.Q. Minh, L. Bazzani, and V. Murino. A Unifying Framework in Vector-valued Reproducing Kernel\nHilbert Spaces for Manifold Regularization and Co-Regularized Multi-view Learning. Journal of Ma-\nchine Learning Research, 17(25):1\u201372, 2016.\n\n[22] S.K. Schnabel and P.H.C. Eilers. Simultaneous estimation of quantile curves using quantile sheets. AStA\n\nAdvances in Statistical Analysis, 97(1):77\u201387, 2012.\n\n[23] S. Shalev-Shwartz and T. Zhang. Stochastic Dual Coordinate Ascent Methods for Regularized Loss\n\nMinimization. Journal of Machine Learning Research, 14:567\u2013599, 2013.\n\n[24] V. Sindhwani, M.H. Quang, and A.C. Lozano. Scalable Matrix-valued Kernel Learning for High-\ndimensional Nonlinear Multivariate Regression and Granger Causality. In Proceedings of the Twenty-\nNinth Conference on Uncertainty in Arti\ufb01cial Intelligence, 2013.\n\n[25] I. Takeuchi and T. Furuhashi. Non-crossing quantile regressions by SVM. In 2004 IEEE International\n\nJoint Conference on Neural Networks, 2004. Proceedings, July 2004.\n\n[26] I. Takeuchi, Q.V. Le, T.D. Sears, and A.J. Smola. Nonparametric Quantile Estimation. Journal of Machine\n\nLearning Research, 7:1231\u20131264, 2006.\n\n[27] I. Takeuchi, T. Hongo, M. Sugiyama, and S. Nakajima. Parametric Task Learning. In Advances in Neural\n\nInformation Processing Systems 26, pages 1358\u20131366. Curran Associates, Inc., 2013.\n\n[28] Y. Wu and Y. Liu. Stepwise multiple quantile regression estimation using non-crossing constraints. Statis-\n\ntics and Its Interface, 2:299\u2013310, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1835, "authors": [{"given_name": "Maxime", "family_name": "Sangnier", "institution": "LTCI"}, {"given_name": "Olivier", "family_name": "Fercoq", "institution": "Telecom ParisTech"}, {"given_name": "Florence", "family_name": "d'Alch\u00e9-Buc", "institution": "LTCI CNRS, T\u00e9l\u00e9com ParisTech, University of Paris-Saclay"}]}