{"title": "Greedy Feature Construction", "book": "Advances in Neural Information Processing Systems", "page_first": 3945, "page_last": 3953, "abstract": "We present an effective method for supervised feature construction. The main goal of the approach is to construct a feature representation for which a set of linear hypotheses is of sufficient capacity -- large enough to contain a satisfactory solution to the considered problem and small enough to allow good generalization from a small number of training examples. We achieve this goal with a greedy procedure that constructs features by empirically fitting squared error residuals. The proposed constructive procedure is consistent and can output a rich set of features. The effectiveness of the approach is evaluated empirically by fitting a linear ridge regression model in the constructed feature space and our empirical results indicate a superior performance of our approach over competing methods.", "full_text": "Greedy Feature Construction\n\nDino Oglic\u2020 \u2021\n\ndino.oglic@uni-bonn.de\n\n\u2020Institut f\u00fcr Informatik III\nUniversit\u00e4t Bonn, Germany\n\nThomas G\u00e4rtner \u2021\n\nthomas.gaertner@nottingham.ac.uk\n\n\u2021School of Computer Science\n\nThe University of Nottingham, UK\n\nAbstract\n\nWe present an effective method for supervised feature construction. The main\ngoal of the approach is to construct a feature representation for which a set of\nlinear hypotheses is of suf\ufb01cient capacity \u2013 large enough to contain a satisfactory\nsolution to the considered problem and small enough to allow good generalization\nfrom a small number of training examples. We achieve this goal with a greedy\nprocedure that constructs features by empirically \ufb01tting squared error residuals.\nThe proposed constructive procedure is consistent and can output a rich set of\nfeatures. The effectiveness of the approach is evaluated empirically by \ufb01tting a\nlinear ridge regression model in the constructed feature space and our empirical\nresults indicate a superior performance of our approach over competing methods.\n\n1\n\nIntroduction\n\nEvery supervised learning algorithm with the ability to generalize from training examples to unseen\ndata points has some type of inductive bias [5]. The bias can be de\ufb01ned as a set of assumptions that\ntogether with the training data explain the predictions at unseen points [25]. In order to simplify\ntheoretical analysis of learning algorithms, the inductive bias is often represented by a choice\nof a hypothesis space (e.g., the inductive bias of linear regression models is the assumption that\nthe relationship between inputs and outputs is linear). The fundamental limitation of learning\nprocedures with an a priori speci\ufb01ed hypothesis space (e.g., linear models or kernel methods with\na preselected kernel) is that they can learn good concept descriptions only if the hypothesis space\nselected beforehand is large enough to contain a good solution to the considered problem and small\nenough to allow good generalization from a small number of training examples. As \ufb01nding a good\nhypothesis space is equivalent to \ufb01nding a good set of features [5], we propose an effective supervised\nfeature construction method to tackle this problem. The main goal of the approach is to embed the\ndata into a feature space for which a set of linear hypotheses is of suf\ufb01cient capacity. The motivation\nfor this choice of hypotheses is in the desire to exploit the scalability of existing algorithms for\ntraining linear models. It is for their scalability that these models are frequently a method of choice\nfor learning on large scale data sets (e.g., the implementation of linear SVM [13] has won the large\nscale learning challenge at ICML 2008 and KDD CUP 2010). However, as the set of linear hypotheses\nde\ufb01ned on a small or moderate number of input features is usually of low capacity these methods\noften learn inaccurate descriptions of target concepts. The proposed approach surmounts this and\nexploits the scalability of existing algorithms for training linear models while overcoming their low\ncapacity on input features. The latter is achieved by harnessing the information contained in the\nlabeled training data and constructing features by empirically \ufb01tting squared error residuals.\nWe draw motivation for our approach by considering the minimization of the expected squared error\nusing functional gradient descent (Section 2.1). In each step of the descent, the current estimator\nis updated by moving in the direction of the residual function. We want to mimic this behavior by\nconstructing a feature representation incrementally so that for each step of the descent we add a\nfeature which approximates well the residual function. In this constructive process, we select our\nfeatures from a predetermined set of basis functions which can be chosen so that a high capacity set\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fof linear hypotheses corresponds to the constructed feature space (Section 2.2). In our theoretical\nanalysis of the approach, we provide a convergence rate for this constructive procedure (Section 2.3)\nand give a generalization bound for the empirical \ufb01tting of residuals (Section 2.4). The latter is needed\nbecause the feature construction is performed based on an independent and identically distributed\nsample of labeled examples. The approach, presented in Section 2.5, is highly \ufb02exible and allows for\nan extension of a feature representation without complete re-training of the model. As it performs\nsimilar to gradient descent, a stopping criteria based on an accuracy threshold can be devised and the\nalgorithm can then be simulated without specifying the number of features a priori. In this way, the\nalgorithm can terminate sooner than alternative approaches for simple hypotheses. The method is\neasy to implement and can be scaled to millions of instances with a parallel implementation.\nTo evaluate the effectiveness of our approach empirically, we compare it to other related approaches\nby training linear ridge regression models in the feature spaces constructed by these methods. Our\nempirical results indicate a superior performance of the proposed approach over competing methods.\nThe results are presented in Section 3 and the approaches are discussed in Section 4.\n\n2 Greedy feature construction\n\nIn this section, we present our feature construction approach. We start with an overview where we\nintroduce the problem setting and motivate our approach by considering the minimization of the\nexpected squared error using functional gradient descent. Following this, we de\ufb01ne a set of features\nand demonstrate that the approach can construct a rich set of hypotheses. We then show that our\ngreedy constructive procedure converges and give a generalization bound for the empirical \ufb01tting of\nresiduals. The section concludes with a pseudo-code description of our approach.\n\n2.1 Overview\n\nWe consider a learning problem with the squared error loss function where the goal is to \ufb01nd a\nmapping from a Euclidean space to the set of reals. In these problems, it is typically assumed that a\nsample z = ((x1, y1) , . . . , (xm, ym)) of m examples is drawn independently from a Borel probability\nmeasure \u03c1 de\ufb01ned on Z = X \u00d7 Y , where X is a compact subset of a \ufb01nite dimensional Euclidean\nspace with the dot product (cid:104)\u00b7,\u00b7(cid:105) and Y \u2282 R. For every x \u2208 X let \u03c1 (y | x) be the conditional\nprobability measure on Y and \u03c1X be the marginal probability measure on X. For the sake of brevity,\n\nwhen it is clear from the context, we will write \u03c1 instead of \u03c1X. Let f\u03c1(x) =(cid:82) y d\u03c1 (y | x) be the\nsquared error in \u03c1, E\u03c1 (f ) =(cid:82) (f (x) \u2212 y)2 d\u03c1. The empirical counterpart of the error, de\ufb01ned over\n\nbounded target/regression function of the measure \u03c1. Our goal is to construct a feature representation\nsuch that there exists a linear hypothesis on this feature space that approximates well the target\nfunction. For an estimator f of the function f\u03c1 we measure the goodness of \ufb01t with the expected\na sample z \u2208 Z m, is denoted with Ez (f ) = 1\nHaving de\ufb01ned the problem setting, we proceed to motivate our approach by considering the mini-\nmization of the expected squared error using functional gradient descent. For that, we \ufb01rst review the\nde\ufb01nition of functional gradient. For a functional F de\ufb01ned on a normed linear space and an element\np from this space, the functional gradient \u2207F (p) is the principal linear part of a change in F after\nit is perturbed in the direction of q, F (p + q) = F (p) + \u03c8 (q) + \u0001(cid:107)q(cid:107) , where \u03c8 (q) is the linear\nfunctional with \u2207F (p) as its principal linear part, and \u0001 \u2192 0 as (cid:107)q(cid:107) \u2192 0 [e.g., see Section 3.2 in\n16]. In our case, the normed space is the Hilbert space of square integrable functions L2\n\u03c1 (X) and for\nthe expected squared error functional on this space we have that it holds\n\n(cid:80)m\ni=1 (f (xi) \u2212 yi)2.\n\nm\n\nE\u03c1 (f + \u0001q) \u2212 E\u03c1 (f ) = (cid:104)2 (f \u2212 f\u03c1), \u0001q(cid:105)L2\n\n\u03c1(X) + O(cid:0)\u00012(cid:1) .\n\nHence, an algorithm for the minimization of the expected squared error using functional gradient\ndescent on this space could be speci\ufb01ed as\n\nft+1 = \u03bdft + 2 (1 \u2212 \u03bd) (f\u03c1 \u2212 ft) ,\n\nwhere 0 \u2264 \u03bd \u2264 1 denotes the learning rate and ft is the estimate at step t. The functional gradient\ndirection 2 (f\u03c1 \u2212 ft) is the residual function at step t and the main idea behind our approach is to\niteratively re\ufb01ne our feature representation by extending it with a new feature that matches the current\nresidual function. In this way, for a suitable choice of learning rate \u03bd, the functional descent would\nbe performed through a convex hull of features and in each step we would have an estimate of the\ntarget function f\u03c1 expressed as a convex combination of the constructed features.\n\n2\n\n\f2.2 Greedy features\n\nWe introduce now a set of features parameterized with a ridge basis function and hyperparameters\ncontrolling the smoothness of these features. As each subset of features corresponds to a set of\nhypotheses, in this way we specify a family of possible hypothesis spaces. For a particular choice of\nridge basis function we argue below that the approach outlined in the previous section can construct a\nhighly expressive feature representation (i.e., a hypothesis space of high capacity).\nLet C (X) be the Banach space of continuous functions on X with the uniform norm. For a\nLipschitz continuous function \u03c6 : R \u2192 R, (cid:107)\u03c6(cid:107)\u221e \u2264 1, and constants r, s, t > 0 let F\u0398 \u2282 C (X),\n\u0398 = (\u03c6, r, s, t), be a set of ridge-wave functions de\ufb01ned on the set X,\n\nF\u0398 =(cid:8)a \u03c6 ((cid:104)w, x(cid:105) + b) | w \u2208 Rd, a, b \u2208 R,|a| \u2264 r,(cid:107)w(cid:107)2 \u2264 s,|b| \u2264 t(cid:9) .\n\nFrom this de\ufb01nition, it follows that for all g \u2208 F\u0398 it holds (cid:107)g(cid:107)\u221e \u2264 r. As a ridge-wave function\ng \u2208 F\u0398 is bounded and Lipschitz continuous, it is also square integrable in the measure \u03c1 and\ng \u2208 L2\n\u03c1 (X). Therefore, F\u0398 is a subset of the Hilbert space of square integrable functions de\ufb01ned on\nX with respect to the probability measure \u03c1, i.e., F\u0398 \u2282 L2\nTaking \u03c6 (\u00b7) = cos (\u00b7) in the de\ufb01nition of F\u0398 we obtain a set of cosine-wave features\n\nFcos =(cid:8)a cos ((cid:104)w, x(cid:105) + b) | w \u2208 Rd, a, b \u2208 R,|a| \u2264 r,(cid:107)w(cid:107)2 \u2264 s,|b| \u2264 t(cid:9) .\n\n\u03c1 (X).\n\nFor this set of features the approach outlined in Section 2.1 can construct a rich set of hypotheses. To\ndemonstrate this we make a connection to shift-invariant reproducing kernel Hilbert spaces and show\nthat the approach can approximate any bounded function from any shift-invariant reproducing kernel\nHilbert space. This means that a set of linear hypotheses de\ufb01ned by cosine features can be of high\ncapacity and our approach can overcome the problems with the low capacity of linear hypotheses\nde\ufb01ned on few input features. A proof of the following theorem is provided in Appendix B.\nTheorem 1. Let Hk be a reproducing kernel Hilbert space corresponding to a continuous shift-\ninvariant and positive de\ufb01nite kernel k de\ufb01ned on a compact set X. Let \u00b5 be the positive and bounded\nspectral measure whose Fourier transform is the kernel k. For any probability measure \u03c1 de\ufb01ned\non X, it is possible to approximate any bounded function f \u2208 Hk using a convex combination of n\n\u221a\nridge-wave functions from Fcos such that the approximation error in (cid:107)\u00b7(cid:107)\u03c1 decays with rate O (1/\nn).\n\n2.3 Convergence\n\nFor the purpose of this paper, it suf\ufb01ces to show the convergence of \u0001-greedy sequences of functions\n(see De\ufb01nition 1) in Hilbert spaces. We, however, choose to provide a stronger result which holds for\n\u0001-greedy sequences in uniformly smooth Banach spaces. In the remainder of the paper, co (S) and S\nwill be used to denote the convex hull of elements from a set S and the closure of S, respectively.\nDe\ufb01nition 1. Let B be a Banach space with norm (cid:107)\u00b7(cid:107) and let S \u2286 B. An incremental sequence is\nany sequence {fn}n\u22651 of elements of B such that f1 \u2208 S and for each n \u2265 1 there is some g \u2208 S so\nthat fn+1 \u2208 co ({fn, g}). An incremental sequence is greedy with respect to an element f \u2208 co (S)\nif for all n \u2208 N it holds (cid:107)fn+1 \u2212 f(cid:107) = inf {(cid:107)h \u2212 f(cid:107) | h \u2208 co ({fn, g}) , g \u2208 S} . Given a positive\nsequence of allowed slack terms {\u0001n}n\u22651, an incremental sequence {fn}n\u22651 is called \u0001-greedy with\nrespect to f if for all n \u2208 N it holds (cid:107)fn+1 \u2212 f(cid:107) < inf {(cid:107)h \u2212 f(cid:107) | h \u2208 co ({fn, g}) , g \u2208 S} + \u0001n.\nHaving introduced the notion of an \u0001-greedy incremental sequence of functions, let us now relate\nit to our feature construction approach. In the outlined constructive procedure (Section 2.1), we\nproposed to select new features corresponding to the functional gradient at the current estimate of\nthe target function. Now, if at each step of the functional gradient descent there exists a ridge-wave\nfunction from our set of features which approximates well the residual function (w.r.t. f\u03c1) then this\nsequence of functions de\ufb01nes a descent through co (F\u0398) which is an \u0001-greedy incremental sequence\nof functions with respect to f\u03c1 \u2208 co (F\u0398). In Section 2.1, we have also demonstrated that F\u0398 is a\nsubset of the Hilbert space L2\nIn accordance with De\ufb01nition 1, we now consider under what conditions an \u0001-greedy sequence of\nfunctions from this space converges to any target function f\u03c1 \u2208 co (F\u0398). Note that this relates to\nTheorem 1 which con\ufb01rms the strength of the result by showing that the capacity of co (F\u0398) is large.\nBefore we show the convergence of our constructive procedure, we need to prove that an \u0001-greedy\n\n\u03c1 (X) and this is by de\ufb01nition a Banach space.\n\n3\n\n\f0 \u2192 R+\n\n0 de\ufb01ned\n2 sup(cid:107)f(cid:107)=(cid:107)g(cid:107)=1 ((cid:107)f + rg(cid:107) + (cid:107)f \u2212 rg(cid:107)) \u2212 1, where f, g \u2208 B. The Banach space B is\n\nincremental sequence of functions/features can be constructed in our setting. For that, we characterize\nthe Banach spaces in which it is always possible to construct such sequences of functions/features.\nDe\ufb01nition 2. Let B be a Banach space, B\u2217 the dual space of B, and f \u2208 B, f (cid:54)= 0. A peak functional\nfor f is a bounded linear operator F \u2208 B\u2217 such that (cid:107)F(cid:107)B\u2217 = 1 and F (f ) = (cid:107)f(cid:107)B. The Banach\nspace B is said to be smooth if for each f \u2208 B, f (cid:54)= 0, there is a unique peak functional.\nThe existence of at least one peak functional for all f \u2208 B, f (cid:54)= 0, is guaranteed by the Hahn-Banach\ntheorem [27]. For a Hilbert space H, for each element f \u2208 H, f (cid:54)= 0, there exists a unique peak\nfunctional F = (cid:104)f,\u00b7(cid:105)H/(cid:107)f(cid:107)H. Thus, every Hilbert space is a smooth Banach space. Donahue et al. [12,\nTheorem 3.1] have shown that in smooth Banach spaces \u2013 and in particular in the Hilbert space\nL2\n\u03c1 (X) \u2013 an \u0001-greedy incremental sequence of functions can always be constructed. However, not\nevery such sequence of functions converges to the function with respect to which it was constructed.\nFor the convergence to hold, a stronger notion of smoothness is needed.\nDe\ufb01nition 3. The modulus of smoothness of a Banach space B is a function \u03c4 : R+\nas \u03c4 (r) = 1\nsaid to be uniformly smooth if \u03c4 (r) \u2208 o (r) as r \u2192 0.\nWe need to observe now that every Hilbert space is a uniformly smooth Banach space [12]. For the\nsake of completeness, we provide a proof of this proposition in Appendix B.\n1 + r2 \u2212 1.\nProposition 2. For any Hilbert space the modulus of smoothness is equal to \u03c4 (r) =\nHaving shown that Hilbert spaces are uniformly smooth Banach spaces, we proceed with two results\ngiving a convergence rate of an \u0001-greedy incremental sequence of functions. What is interesting about\nthese results is the fact that a feature does not need to match exactly the residual function in a greedy\ndescent step (Section 2.1); it is only required that condition (ii) from the next theorem is satis\ufb01ed.\nTheorem 3. [Donahue et al., 12] Let B be a uniformly smooth Banach space having modulus of\nsmoothness \u03c4 (u) \u2264 \u03b3ut, with \u03b3 being a constant and t > 1. Let S be a bounded subset of B and let\nf \u2208 co (S). Let K > 0 be chosen such that (cid:107)f \u2212 g(cid:107) \u2264 K for all g \u2208 S, and let \u0001 > 0 be a \ufb01xed slack\nvalue. If the sequences {fn}n\u22651 \u2282 co (S) and {gn}n\u22651 \u2282 S are chosen recursively so that: (i) f1 \u2208\n(cid:105)1/t\nS, (ii) Fn (gn \u2212 f ) \u2264 2\u03b3((K+\u0001)t\u2212Kt)/nt\u22121(cid:107)fn\u2212f(cid:107)t\u22121, and (iii) fn+1 = n/n+1 fn + 1/n+1 gn, where\nFn is the peak functional of fn \u2212 f, then it holds (cid:107)fn \u2212 f(cid:107) \u2264 (2\u03b3t)\n.\nThe following corollary gives a convergence rate for an \u0001-greedy incremental sequence of functions\nconstructed according to Theorem 3 with respect to f\u03c1 \u2208 co (F\u0398). As this result (a proof is given in\nAppendix B) holds for all such sequences of functions, it also holds for our constructive procedure.\nCorollary 4. Let {fn}n\u22651 \u2282 co (F\u0398) be an \u0001-greedy incremental sequence of functions constructed\naccording to the procedure described in Theorem 3 with respect to a function f \u2208 co (F\u0398) . Then, it\nholds (cid:107)fn \u2212 f(cid:107)\u03c1 \u2264 (K+\u0001)\n\n\u221a\n\u221a\n2+log2 n/2n\n\n1/t(K+\u0001)\nn1\u22121/t\n\n1 + (t\u22121) log2 n\n\n2tn\n\nn\n\n.\n\n\u221a\n\n(cid:104)\n\n2.4 Generalization bound\nIn step t + 1 of the empirical residual \ufb01tting, based on a sample {(xi, yi \u2212 ft (xi))}m\ni=1, the approach\nselects a ridge-wave function from F\u0398 that approximates well the residual function (f\u03c1 \u2212 ft). In\nthe last section, we have speci\ufb01ed in which cases such ridge-wave functions can be constructed and\nprovided a convergence rate for this constructive procedure. As the convergence result is not limited\nto target functions from F\u0398 and co (F\u0398), we give a bound on the generalization error for hypotheses\nfrom F = co (F\u0398), where the closure is taken with respect to C (X).\nBefore we give a generalization bound, we show that our hypothesis space F is a convex and compact\nset of functions. The choice of a compact hypothesis space is important because it guarantees that a\nminimizer of the expected squared error E\u03c1 and its empirical counterpart Ez exists. In particular, a\ncontinuous function attains its minimum and maximum value on a compact set and this guarantees\nthe existence of minimizers of E\u03c1 and Ez. Moreover, for a hypothesis space that is both convex\nand compact, the minimizer of the expected squared error is unique as an element of L2\n\u03c1 (X). A\nsimple proof of the uniqueness of such a minimizer in L2\n\u03c1 (X) and the continuity of the functionals\nE\u03c1 and Ez can be found in [9]. For the sake of completeness, we provide a proof in Appendix A\nas Proposition A.2. The following proposition (a proof is given in Appendix B) shows that our\nhypothesis space is a convex and compact subset of the metric space C (X).\n\n4\n\n\fi=1, initial estimates at sample points {f0,i}s\n\ni=1, ridge basis function \u03c6, maximum number\n\nAlgorithm 1 GREEDYDESCENT\nInput: sample z = {(xi, yi)}s\n1: W \u2190 \u2205\n2: for k = 1, 2, . . . , p do\n3:\n\nof descent steps p, regularization parameter \u03bb, and precision \u0001\n\nwk, ck \u2190 arg minw,c=(c(cid:48),c(cid:48)(cid:48))\n4: W \u2190 W \u222a {wk} and fk,i \u2190 c(cid:48)\nif |Ez(fk)\u2212Ez(fk\u22121)|/max{Ez(fk),Ez(fk\u22121)} < \u0001 then EXIT FOR LOOP end if\n\n(cid:80)s\nkfk\u22121,i + c(cid:48)(cid:48)\n\n(cid:0)c(cid:48)fk\u22121,i + c(cid:48)(cid:48)\u03c6(cid:0)w(cid:62)xi\n\n(cid:1) \u2212 yi\n(cid:1), i = 1, . . . , s\n\nk \u03c6(cid:0)w(cid:62)\n\n(cid:1)2 + \u03bb\u2126 (c, w)\n\nk xi\n\ni=1\n\n5:\n6: end for\n7: return W\nProposition 5. The hypothesis space F is a convex and compact subset of the metric space C(X).\nMoreover, the elements of this hypothesis space are Lipschitz continuous functions.\nHaving established that the hypothesis space is a compact set, we can now give a generalization\nbound for learning with this hypothesis space. The fact that the hypothesis space is compact implies\nthat it is also a totally bounded set [27], i.e., for all \u0001 > 0 there exists a \ufb01nite \u0001-net of F. This, on the\nother hand, allows us to derive a sample complexity bound by using the \u0001-covering number of a space\nas a measure of its capacity [21]. The following theorem and its corollary (proofs are provided in\nAppendix B) give a generalization bound for learning with the hypothesis space F.\nTheorem 6. Let M > 0 such that, for all f \u2208 F, |f (x) \u2212 y| \u2264 M almost surely. Then, for all \u0001 > 0\n\nP [E\u03c1 (fz) \u2212 E\u03c1 (f\u2217) \u2264 \u0001] \u2265 1 \u2212 N (F, \u0001/24M,(cid:107)\u00b7(cid:107)\u221e) exp (\u2212m\u0001/288M 2) ,\n\nwhere fz and f\u2217 are the minimizers of Ez and E\u03c1 on the set F, z \u2208 Z m, and N (F, \u0001,(cid:107)\u00b7(cid:107)\u221e) denotes\nthe \u0001-covering number of F w.r.t. C (X).\nCorollary 7. For all \u0001 > 0 and all \u03b4 > 0, with probability 1 \u2212 \u03b4, a minimizer of the empirical\nsquared error on the hypothesis space F is (\u0001, \u03b4)-consistent when the number of samples m \u2208\nits interior, L\u03c6 is the Lipschitz constant of a function \u03c6, and r, s, and t are hyperparameters of F\u0398.\n\n(cid:1) . Here, R is the radius of a ball containing the set of instances X in\n\n\u2126(cid:0)r (Rs + t) L\u03c6\n\n\u00012 + 1\n\n\u0001 ln 1\n\n1\n\n\u03b4\n\n2.5 Algorithm\n\nAlgorithm 1 is a pseudo-code description of the outlined approach. To construct a feature space\nwith a good set of linear hypotheses the algorithm takes as input a set of labeled examples and an\ninitial empirical estimate of the target function. A dictionary of features is speci\ufb01ed with a ridge\nbasis function and the smoothness of individual features is controlled with a regularization parameter.\nOther parameters of the algorithm are the maximum allowed number of descent steps and a precision\nterm that de\ufb01nes the convergence of the descent. As outlined in Sections 2.1 and 2.3, the algorithm\nworks by selecting a feature that matches the residual function at the current estimate of the target\nfunction. For each selected feature the algorithm also chooses a suitable learning rate and performs a\nfunctional descent step (note that we are inferring the learning rate instead of setting it to 1/n+1 as\nin Theorem 3). To avoid solving these two problems separately, we have coupled both tasks into a\nsingle optimization problem (line 3) \u2013 we \ufb01t a linear model to a feature representation consisting\nof the current empirical estimate of the target function and a ridge function parameterized with a\nd-dimensional vector w. The regularization term \u2126 is chosen to control the smoothness of the new\nfeature and avoid over-\ufb01tting. The optimization problem over the coef\ufb01cients of the linear model\nand the spectrum of the ridge basis function is solved by casting it as a hyperparameter optimization\nproblem [20]. For the sake of completeness, we have provided a detailed derivation in Appendix C.\nWhile the hyperparameter optimization problem is in general non-convex, Theorem 3 indicates that a\nglobally optimal solution is not (necessarily) required and instead speci\ufb01es a weaker condition. To\naccount for the non-convex nature of the problem and compensate for the sequential generation of\nfeatures, we propose to parallelize the feature construction process by running several instances of\nthe greedy descent simultaneously. A pseudo-code description of this parallelized approach is given\nin Algorithm 2. The algorithm takes as input parameters required for running the greedy descent\nand some parameters speci\ufb01c to the parallelization scheme \u2013 number of data passes and available\nmachines/cores, regularization parameter for the \ufb01tting of linear models in the constructed feature\nspace, and cut-off parameter for the elimination of redundant features. The whole process is started\nby adding a bias feature and setting the initial empirical estimates at sample points to the mean value\nof the outputs (line 1). Following this, the algorithm mimics stochastic gradient descent and makes\n\n5\n\n\fAlgorithm 2 GREEDY FEATURE CONSTRUCTION (GFC)\nInput: sample z = {(xi, yi)}m\n1: W \u2190 {0} and f0,k \u2190 1\n2: for i = 1, . . . , T do\n3:\n\nfor j = 1, 2, . . . , M parallel do\n\ni=1 yi, k = 1, . . . , m\n\n(cid:80)m\n\nm\n\n5:\n\n4:\n\nSj \u223c U{1,2,...,m} and W \u2190 W \u222a GREEDYDESCENT\n(cid:17)2\n\n(cid:16)(cid:80)|W|\nl=1 al\u03c6(cid:0)w(cid:62)\n7: W \u2190 W \\(cid:8)wl \u2208 W | |a\u2217\nl | < \u03b7, 1 \u2264 l \u2264 |W|(cid:9) and fi,k \u2190(cid:80)|W|\n\nend for\na\u2217 \u2190 arg mina\n\n(cid:1) \u2212 yk\n\n(cid:80)m\n\n+ \u03bd (cid:107)a(cid:107)2\n\nl xk\n\nk=1\n\n6:\n\n8: end for\n9: return (W, a\u2217)\n\nsteps p, number of machines/cores M, regularization parameters \u03bb and \u03bd, precision \u0001, and feature cut-off threshold \u03b7\n\ni=1, ridge basis function \u03c6, number of data passes T , maximum number of greedy descent\n\n(cid:17)\n\n, \u03c6, p, \u03bb, \u0001\n\n(cid:16){(xk, yk)}k\u2208Sj\nl \u03c6(cid:0)w(cid:62)\n\nl=1 a\u2217\n\n,(cid:8)fi\u22121,k\n(cid:9)\n(cid:1), k = 1, . . . , m\n\nk\u2208Sj\n\nl xk\n\n2\n\na speci\ufb01ed number of passes through the data (line 2). In the \ufb01rst step of each pass, the algorithm\nperforms greedy functional descent in parallel using a pre-speci\ufb01ed number of machines/cores M\n(lines 3-5). This step is similar to the splitting step in parallelized stochastic gradient descent [32].\nGreedy descent is performed on each of the machines for a maximum number of iterations p and\nthe estimated parameter vectors are added to the set of constructed features W (line 4). After the\nfeatures have been learned the algorithm \ufb01ts a linear model to obtain the amplitudes (line 6). To \ufb01t a\nlinear model, we use least square regression penalized with the l2-norm because it can be solved in\na closed form and cross-validation of the capacity parameter involves optimizing a 1-dimensional\nobjective function [20]. Fitting of the linear model can be understood as averaging of the greedy\napproximations constructed on different chunks of the data. At the end of each pass, the empirical\nestimates at sample points are updated and redundant features are removed (line 7).\nOne important detail in the implementation of Algorithm 1 is the data splitting between the training\nand validation samples for the hyperparameter optimization. In particular, during the descent we are\nmore interested in obtaining a good spectrum than the amplitude because a linear model will be \ufb01t in\nAlgorithm 2 over the constructed features and the amplitude values will be updated. For this reason,\nduring the hyperparameter optimization over a k-fold splitting in Algorithm 1, we choose a single\nfold as the training sample and a batch of folds as the validation sample.\n\n3 Experiments\n\nIn this section, we assess the performance of our approach (see Algorithm 2) by comparing it to other\nfeature construction approaches on synthetic and real-world data sets. We evaluate the effectiveness\nof the approach with the set of cosine-wave features introduced in Section 2.2. For this set of\nfeatures, our approach is directly comparable to random Fourier features [26] and \u00e1 la carte [31]. The\nimplementation details of the three approaches are provided in Appendix C. We address here the\nchoice of the regularization term in Algorithm 1: To control the smoothness of newly constructed\nfeatures, we penalize the objective in line 3 so that the solutions with the small L2\n\u03c1 (X) norm are\npreferred. For this choice of regularization term and cosine-wave features, we empirically observe\nthat the optimization objective is almost exclusively penalized by the l2 norm of the coef\ufb01cient vector\nc. Following this observation, we have simulated the greedy descent with \u2126 (c, w) = (cid:107)c(cid:107)2\n2.\nWe now brie\ufb02y describe the data sets and the experimental setting. The experiments were conducted\non three groups of data sets. The \ufb01rst group contains four UCI data sets on which we performed\nparameter tuning of all three algorithms (Table 1, data sets 1-4). The second group contains the data\nsets with more than 5000 instances available from Lu\u00eds Torgo [28]. The idea is to use this group of\ndata sets to test the generalization properties of the considered algorithms (Table 1, data sets 5-10).\nThe third group contains two arti\ufb01cial and very noisy data sets that are frequently used in regression\ntree benchmark tests. For each considered data set, we split the data into 10 folds; we refer to these\nsplits as the outer cross-validation folds. In each step of the outer cross-validation, we use nine folds\nas the training sample and one fold as the test sample. For the purpose of the hyperparameter tuning\nwe split the training sample into \ufb01ve folds; we refer to these splits as the inner cross-validation folds.\nWe run all algorithms on identical outer cross-validation folds and construct feature representations\nwith 100 and 500 features. The performance of the algorithms is assessed by comparing the root\nmean squared error of linear ridge regression models trained in the constructed feature spaces and the\naverage time needed for the outer cross-validation of one fold.\n\n6\n\n\fTable 1: To facilitate the comparison between data sets we have normalized the outputs so that their range\nis one. The accuracy of the algorithms is measured using the root mean squared error, multiplied by 100 to\nmimic percentage error (w.r.t. the range of the outputs). The mean and standard deviation of the error are\ncomputed after performing 10-fold cross-validation. The reported walltime is the average time it takes a method\nto cross-validate one fold. To assess whether a method performs statistically signi\ufb01cantly better than the other\non a particular data set we perform the paired Welch t-test [29] with p = 0.05. The signi\ufb01cantly better results\nfor the considered settings are marked in bold.\n\nDATASET\n\nm d\n\nGFC\n\nALC\n\nGFC\n\nALC\n\nn = 100\n\nn = 500\n\nERROR\n\nWALLTIME\n2.73 (\u00b10.19) 00 : 03 : 49 0.78 (\u00b10.13) 00 : 05 : 19\n\nWALLTIME\n\nERROR\n\nERROR\n\nWALLTIME\n\nWALLTIME\n2.20 (\u00b10.27) 00 : 04 : 15 0.31 (\u00b10.17) 00 : 27 : 15\nparkinsons tm (total)\n5875 21\nujindoorloc (latitude) 21048 527 3.17 (\u00b10.15) 00 : 21 : 39 6.19 (\u00b10.76) 01 : 21 : 58 3.04 (\u00b10.19) 00 : 36 : 49 6.99 (\u00b10.97) 02 : 23 : 15\n53500 380 2.93 (\u00b10.10) 00 : 52 : 05 3.82 (\u00b10.64) 03 : 31 : 25 2.59 (\u00b10.10) 01 : 24 : 41 2.73 (\u00b10.29) 06 : 11 : 12\nct-slice\nYear Prediction MSD 515345 90 10.06 (\u00b10.09) 01 : 20 : 12 9.94 (\u00b10.08) 05 : 29 : 14 10.01 (\u00b10.08) 01 : 30 : 28 9.92 (\u00b10.07) 11 : 58 : 41\n3.79 (\u00b10.25) 00 : 01 : 57 3.73 (\u00b10.24) 00 : 25 : 14\n3.82 (\u00b10.24) 00 : 01 : 23 3.73 (\u00b10.20) 00 : 05 : 13\ndelta-ailerons\n4.65 (\u00b10.11) 00 : 04 : 44 5.01 (\u00b10.76) 00 : 38 : 53\n5.18 (\u00b10.09) 00 : 04 : 02 5.03 (\u00b10.23) 00 : 11 : 28\nkinematics\n2.65 (\u00b10.12) 00 : 04 : 23 2.68 (\u00b10.27) 00 : 09 : 24\n2.60 (\u00b10.16) 00 : 04 : 24 2.62 (\u00b10.15) 00 : 25 : 13\ncpu-activity\n9.83 (\u00b10.30) 00 : 02 : 01 9.87 (\u00b10.42) 00 : 49 : 48\n9.83 (\u00b10.27) 00 : 01 : 39 9.84 (\u00b10.30) 00 : 12 : 48\nbank\n3.44 (\u00b10.10) 00 : 02 : 24 3.24 (\u00b10.07) 00 : 13 : 17 3.30 (\u00b10.06) 00 : 02 : 27 3.42 (\u00b10.15) 00 : 57 : 33\npumadyn\n5.26 (\u00b10.17) 00 : 00 : 57 5.28 (\u00b10.18) 00 : 07 : 07\n5.24 (\u00b10.17) 00 : 01 : 04 5.23 (\u00b10.18) 00 : 32 : 30\ndelta-elevators\n4.67 (\u00b10.18) 00 : 02 : 56 4.89 (\u00b10.43) 00 : 16 : 34\n4.51 (\u00b10.12) 00 : 02 : 11 4.77 (\u00b10.40) 01 : 05 : 07\nailerons\n5.55 (\u00b10.15) 00 : 11 : 37 5.20 (\u00b10.51) 01 : 39 : 22\n7.34 (\u00b10.29) 00 : 10 : 45 7.16 (\u00b10.55) 00 : 20 : 34\npole-telecom\n3.34 (\u00b10.08) 00 : 03 : 16 3.37 (\u00b10.55) 00 : 21 : 20\n3.12 (\u00b10.20) 00 : 04 : 06 3.13 (\u00b10.24) 01 : 20 : 58\nelevators\n8 11.55 (\u00b10.24) 00 : 05 : 49 12.69 (\u00b10.47) 00 : 11 : 14 11.17 (\u00b10.25) 00 : 06 : 16 12.70 (\u00b11.01) 01 : 01 : 37\ncal-housing\n4.01 (\u00b10.03) 00 : 03 : 04 4.03 (\u00b10.03) 01 : 04 : 16\nbreiman\n3.29 (\u00b10.09) 00 : 06 : 07 3.37 (\u00b10.46) 00 : 18 : 43 3.16 (\u00b10.03) 00 : 07 : 04 3.25 (\u00b10.09) 01 : 39 : 37\nfriedman\n\n5\n7129\n8192\n8\n8192 21\n8192 32\n8192 32\n6\n9517\n13750 40\n15000 26\n16599 18\n20640\n40768 10 4.01 (\u00b10.03) 00 : 02 : 46 4.06 (\u00b10.04) 00 : 13 : 52\n40768 10\n\nERROR\n\nAn extensive summary containing the results of experiments with the random Fourier features\napproach (corresponding to Gaussian, Laplace, and Cauchy kernels) and different con\ufb01gurations\nof \u00e1 la carte is provided in Appendix D. As the best performing con\ufb01guration of \u00e1 la carte on the\ndevelopment data sets is the one with Q = 5 components, we report in Table 1 the error and walltime\nfor this con\ufb01guration. From the walltime numbers we see that our approach is in both considered\nsettings \u2013 with 100 and 500 features \u2013 always faster than \u00e1 la carte. Moreover, the proposed approach\nis able to generate a feature representation with 500 features in less time than required by \u00e1 la carte\nfor a representation of 100 features. In order to compare the performance of the two methods with\nrespect to accuracy, we use the Wilcoxon signed rank test [30, 11]. As our approach with 500 features\nis on all data sets faster than \u00e1 la carte with 100 features, we \ufb01rst compare the errors obtained in these\nexperiments. For 95% con\ufb01dence, the threshold value of the Wilcoxon signed rank test with 16 data\nsets is T = 30 and from our results we get the T-value of 28. As the T-value is below the threshold,\nour algorithm can with 95% con\ufb01dence generate in less time a statistically signi\ufb01cantly better feature\nrepresentation than \u00e1 la carte. For the errors obtained in the settings where both methods have the\nsame number of features, we obtain the T-values of 60 and 42. While in the \ufb01rst case for the setting\nwith 100 features the test is inconclusive, in the second case our approach is with 80% con\ufb01dence\nstatistically signi\ufb01cantly more accurate than \u00e1 la carte. To evaluate the performance of the approaches\non individual data sets, we perform the paired Welch t-test [29] with p = 0.05. Again, the results\nindicate a good/competitive performance of our algorithm compared to \u00e1 la carte.\n\n4 Discussion\n\nIn this section, we discuss the advantages of the proposed method over the state-of-the-art baselines\nin learning fast shift-invariant kernels and other related approaches.\nFlexibility. The presented approach is a highly \ufb02exible supervised feature construction method. In\ncontrast to an approach based on random Fourier features [26], the proposed method does not require\na spectral measure to be speci\ufb01ed a priori. In the experiments (details can be found in Appendix D),\nwe have demonstrated that the choice of spectral measure is important as, for the considered measures\n(corresponding to Gaussian, Laplace, and Cauchy kernels), the random Fourier features approach\nis outperformed on all data sets. The second competing method, \u00e1 la carte, is more \ufb02exible when it\ncomes to the choice of spectral measure and works by approximating it with a mixture of Gaussians.\nHowever, the number of components and features per component needs to be speci\ufb01ed beforehand or\ncross-validated. In contrast, our approach mimics functional gradient descent and can be simulated\nwithout specifying the size of the feature representation beforehand. Instead, a stopping criteria\n(see, e.g., Algorithm 1) based on the successive decay of the error can be devised. As a result, the\nproposed approach terminates sooner than the alternative approaches for simple concepts/hypothesis.\nThe proposed method is also easy to implement (for the sake of completeness, the hyperparameter\ngradients are provided in Appendix C.1) and allows us to extend the existing feature representation\nwithout complete re-training of the model. We note that the approaches based on random Fourier\n\n7\n\n\ffeatures are also simple to implement and can be re-trained ef\ufb01ciently with the increase in the number\nof features [10]. \u00c1 la carte, on the other hand, is less \ufb02exible in this regard \u2013 due to the number of\nhyperparameters and the complexity of gradients it is not straightforward to implement this method.\nScalability. The fact that our greedy descent can construct a feature in time linear in the number\nof instances m and dimension of the problem d makes the proposed approach highly scalable. In\nparticular, the complexity of the proposed parallelization scheme is dominated by the cost of \ufb01tting a\n\nlinear model and the whole algorithm runs in time O(cid:0)T(cid:0)n3 + n2m + nmd(cid:1)(cid:1), where T denotes the\n\nnumber of data passes (i.e., linear model \ufb01ts) and n number of constructed features. To scale this\nscheme to problems with millions of instances, it is possible to \ufb01t linear models using the parallelized\nstochastic gradient descent [32]. As for the choice of T , the standard setting in simulations of\nstochastic gradient descent is 5-10 data passes. Thus, the presented approach is quite robust and\ncan be applied to large scale data sets. In contrast to this, the cost of performing a gradient step in\n\nthe hyperparameter optimization of \u00e1 la carte is O(cid:0)n3 + n2m + nmd(cid:1). In our empirical evaluation\nwhich also run in time O(cid:0)n3 + n2m + nmd(cid:1), the main cost is the \ufb01tting of linear models \u2013 one for\n\nusing an implementation with 10 random restarts, the approach needed at least 20 steps per restart\nto learn an accurate model. The required number of gradient steps and the cost of computing them\nhinder the application of \u00e1 la carte to large scale data sets. In learning with random Fourier features\n\neach pair of considered spectral and regularization parameters.\nOther approaches. Beside fast kernel learning approaches, the presented method is also related to\nneural networks parameterized with a single hidden layer. These approaches can be seen as feature\nconstruction methods jointly optimizing over the whole feature representation. A detailed study of\nthe approximation properties of a hypothesis space of a single layer network with the sigmoid ridge\nfunction has been provided by Barron [4]. In contrast to these approaches, we construct features\nincrementally by \ufb01tting residuals and we do this with a set of non-monotone ridge functions as a\ndictionary of features. Regarding our generalization bound, we note that the past work on single layer\nneural networks contains similar results but in the context of monotone ridge functions [1].\nAs the goal of our approach is to construct a feature space for which linear hypotheses will be of\nsuf\ufb01cient capacity, the presented method is also related to linear models working with low-rank kernel\nrepresentations. For instance, Fine and Scheinberg [14] investigate a training algorithm for SVMs\nusing low-rank kernel representations. The difference between our approach and this method is in the\nfact that the low-rank decomposition is performed without considering the labels. Side knowledge\nand labels are considered by Kulis et al. [22] and Bach and Jordan [3] in their approaches to construct\na low-rank kernel matrix. However, these approaches are not selecting features from a set of ridge\nfunctions, but \ufb01nd a subspace of a preselected kernel feature space with a good set of hypothesis.\nFrom the perspective of the optimization problem considered in the greedy descent (Algorithm 1) our\napproach can be related to single index models (SIM) where the goal is to learn a regression function\nthat can be represented as a single monotone ridge function [19, 18]. In contrast to these models, our\napproach learns target/regression functions from the closure of the convex hull of ridge functions.\nTypically, these target functions cannot be written as single ridge functions. Moreover, our ridge\nfunctions do not need to be monotone and are more general than the ones considered in SIM models.\nIn addition to these approaches and considered baseline methods, the presented feature construction\napproach is also related to methods optimizing expected loss functions using functional gradient\ndescent [23]. However, while Mason et al. [23] focus on classi\ufb01cation problems and hypothesis spaces\nwith \ufb01nite VC dimension, we focus on the estimation of regression functions in spaces with in\ufb01nite\nVC dimension (e.g., see Section 2.2). In contrast to that work, we provide a convergence rate for our\napproach. Similarly, Friedman [15] has proposed a gradient boosting machine for greedy function\nestimation. In their approach, the empirical functional gradient is approximated by a weak learner\nwhich is then combined with previously constructed learners following a stagewise strategy. This is\ndifferent from the stepwise strategy that is followed in our approach where previously constructed\nestimators are readjusted when new features are added. The approach in [15] is investigated mainly\nin the context of regression trees, but it can be adopted to feature construction. To the best of our\nknowledge, theoretical and empirical properties of this approach in the context of feature construction\nand shift-invariant reproducing kernel Hilbert spaces have not been considered so far.\nAcknowledgment: We are grateful for access to the University of Nottingham High Performance Computing\nFacility. A part of this work was also supported by the German Science Foundation (grant number GA 1615/1-1).\n\n8\n\n\fReferences\n[1] Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge\n\nUniversity Press, 2009.\n\n[2] Nachman Aronszajn. Theory of reproducing kernels. Transactions of the American Math. Society, 1950.\n[3] Francis R. Bach and Michael I. Jordan. Predictive low-rank decomposition for kernel methods.\nIn\n\nProceedings of the 22nd International Conference on Machine Learning.\n\n[4] Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE\n\nTransactions on Information Theory, 39(3), 1993.\n\n[5] Jonathan Baxter. A model of inductive bias learning. Journal of Arti\ufb01cial Intelligence Research, 12, 2000.\n[6] Alain Bertinet and Thomas C. Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics.\n\nKluwer Academic Publishers, 2004.\n\n[7] Salomon Bochner. Vorlesungen \u00fcber Fouriersche Integrale. In Akademische Verlagsgesellschaft, 1932.\n[8] Bernd Carl and Irmtraud Stephani. Entropy, Compactness, and the Approximation of Operators. Cambridge\n\nUniversity Press, 1990.\n\n[9] Felipe Cucker and Steve Smale. On the mathematical foundations of learning. Bulletin of the American\n\nMathematical Society, 39, 2002.\n\n[10] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina Balcan, and Le Song. Scalable kernel\n\nmethods via doubly stochastic gradients. In Advances in Neural Information Processing Systems 27.\n\n[11] Janez Dem\u0161ar. Statistical comparisons of classi\ufb01ers over multiple data sets. Journal of Machine Learning\n\nResearch, 7, 2006.\n\n[12] Michael J. Donahue, Christian Darken, Leonid Gurvits, and Eduardo Sontag. Rates of convex approxima-\n\ntion in non-Hilbert spaces. Constructive Approximation, 13(2), 1997.\n\n[13] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIBLINEAR: A library\n\nfor large linear classi\ufb01cation. Journal of Machine Learning Research, 9, 2008.\n\n[14] Shai Fine and Katya Scheinberg. Ef\ufb01cient SVM training using low-rank kernel representations. Journal of\n\nMachine Learning Research, 2, 2002.\n\n[15] Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of\n\nStatistics, 29, 2000.\n\n[16] Israel M. Gelfand and Sergei V. Fomin. Calculus of variations. Prentice-Hall Inc., 1963.\n[17] Marc G. Genton. Classes of kernels for machine learning: A statistics perspective. Journal of Machine\n\nLearning Research, 2, 2002.\n\n[18] Sham M. Kakade, Varun Kanade, Ohad Shamir, and Adam T. Kalai. Ef\ufb01cient learning of generalized\nlinear and single index models with isotonic regression. In Advances in Neural Information Processing\nSystems 24, 2011.\n\n[19] Adam T. Kalai and Ravi Sastry. The isotron algorithm: High-dimensional isotonic regression.\n\nProceedings of the Conference on Learning Theory, 2009.\n\nIn\n\n[20] Sathiya Keerthi, Vikas Sindhwani, and Olivier Chapelle. An ef\ufb01cient method for gradient-based adaptation\n\nof hyperparameters in SVM models. In Advances in Neural Information Processing Systems 19, 2006.\n\n[21] Andrey N. Kolmogorov and Vladimir M. Tikhomirov. \u0001-entropy and \u0001-capacity of sets in function spaces.\n\nUspehi Matematicheskikh Nauk, 14(2), 1959.\n\n[22] Brian Kulis, M\u00e1ty\u00e1s Sustik, and Inderjit Dhillon. Learning low-rank kernel matrices. In Proceedings of the\n\n23rd International Conference on Machine Learning, 2006.\n\n[23] Llew Mason, Jonathan Baxter, Peter L. Bartlett, and Marcus Frean. Functional gradient techniques for\n\ncombining hypotheses. In Advances in large margin classi\ufb01ers. MIT Press, 1999.\n\n[24] Sebastian Mayer, Tino Ullrich, and Jan Vybiral. Entropy and sampling numbers of classes of ridge\n\nfunctions. Constructive Approximation, 42(2), 2015.\n\n[25] Tom M. Mitchell. Machine Learning. McGraw-Hill, 1997.\n[26] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural\n\nInformation Processing Systems 20.\n\n[27] Walter Rudin. Functional Analysis. Int. Series in Pure and Applied Mathematics. McGraw-Hill Inc., 1991.\n[28] Lu\u00eds Torgo. Repository with regression data sets. http://www.dcc.fc.up.pt/~ltorgo/\n\nRegression/DataSets.html, accessed September 22, 2016.\n\n[29] Bernard L. Welch. The generalization of student\u2019s problem when several different population variances are\n\ninvolved. Biometrika, 34(1/2), 1947.\n\n[30] Frank Wilcoxon. Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 1945.\n[31] Zichao Yang, Alexander J. Smola, Le Song, and Andrew G. Wilson. \u00c1 la carte\u2014Learning fast kernels. In\n\nProceedings of the 18th International Conference on Arti\ufb01cial Intelligence and Statistics, 2015.\n\n[32] Martin A. Zinkevich, Alex J. Smola, Markus Weimer, and Lihong Li. Parallelized stochastic gradient\n\ndescent. In Advances in Neural Information Processing Systems 23, 2010.\n\n9\n\n\f", "award": [], "sourceid": 1957, "authors": [{"given_name": "Dino", "family_name": "Oglic", "institution": "University of Bonn"}, {"given_name": "Thomas", "family_name": "G\u00e4rtner", "institution": "The University of Nottingham"}]}