{"title": "Learning Bounds for Greedy Approximation with Explicit Feature Maps from Multiple Kernels", "book": "Advances in Neural Information Processing Systems", "page_first": 4690, "page_last": 4701, "abstract": "Nonlinear kernels can be approximated using finite-dimensional feature maps for efficient risk minimization. Due to the inherent trade-off between the dimension of the (mapped) feature space and the approximation accuracy, the key problem is to identify promising (explicit) features leading to a satisfactory out-of-sample performance. In this work, we tackle this problem by efficiently choosing such features from multiple kernels in a greedy fashion. Our method sequentially selects these explicit features from a set of candidate features using a correlation metric. We establish an out-of-sample error bound capturing the trade-off between the error in terms of explicit features (approximation error) and the error due to spectral properties of the best model in the Hilbert space associated to the combined kernel (spectral error). The result verifies that when the (best) underlying data model is sparse enough, i.e., the spectral error is negligible, one can control the test error with a small number of explicit features, that can scale poly-logarithmically with data. Our empirical results show that given a fixed number of explicit features, the method can achieve a lower test error with a smaller time cost, compared to the state-of-the-art in data-dependent random features.", "full_text": "Learning Bounds for Greedy Approximation with\n\nExplicit Feature Maps from Multiple Kernels\n\nShahin Shahrampour\n\nDepartment of Industrial and Systems Engineering\n\nTexas A&M University\n\nCollege Station, TX 77843\n\nshahin@tamu.edu\n\nDepartment of Electrical and Computer Engineering\n\nVahid Tarokh\n\nDuke University\n\nDurham, NC 27708\n\nvahid.tarokh@duke.edu\n\nAbstract\n\nNonlinear kernels can be approximated using \ufb01nite-dimensional feature maps for\nef\ufb01cient risk minimization. Due to the inherent trade-off between the dimension\nof the (mapped) feature space and the approximation accuracy, the key problem\nis to identify promising (explicit) features leading to a satisfactory out-of-sample\nperformance. In this work, we tackle this problem by ef\ufb01ciently choosing such\nfeatures from multiple kernels in a greedy fashion. Our method sequentially selects\nthese explicit features from a set of candidate features using a correlation metric.\nWe establish an out-of-sample error bound capturing the trade-off between the error\nin terms of explicit features (approximation error) and the error due to spectral\nproperties of the best model in the Hilbert space associated to the combined kernel\n(spectral error). The result veri\ufb01es that when the (best) underlying data model is\nsparse enough, i.e., the spectral error is negligible, one can control the test error\nwith a small number of explicit features, that can scale poly-logarithmically with\ndata. Our empirical results show that given a \ufb01xed number of explicit features, the\nmethod can achieve a lower test error with a smaller time cost, compared to the\nstate-of-the-art in data-dependent random features.\n\n1\n\nIntroduction\n\nKernel methods are powerful tools in describing the nonlinear representation of data. Mapping the\ninputs to a high-dimensional feature space, kernel methods compute their inner products without\nrecourse to the explicit form of the feature map (kernel trick). However, unfortunately, calculating\nthe kernel matrix for the training stage requires a prohibitive computational cost scaling quadratically\nwith data. To address this shortcoming, recent years have witnessed an intense interest on the\napproximation of kernels using low-rank surrogates [1, 2, 3]. Such techniques can turn the kernel\nformulation to a linear problem, which is potentially solvable in a linear time with respect to data\n(see e.g. [4] for linear Support Vector Machines (SVM)) and thus applicable to large data sets. In\nthe approximation of kernels via their corresponding \ufb01nite-dimensional feature maps, regardless of\nwhether the approximation is deterministic [5] or random [3], it is extremely critical that \u2013 we can\ncompute the feature maps ef\ufb01ciently \u2013 and \u2013 we can (hopefully) represent the data in a sparse fashion.\nThe challenge is that \ufb01nding feature maps with these characteristics is generally hard.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIt is well-known that any Mercer kernel can be represented as an (potentially in\ufb01nite-dimensional)\ninner-product of its feature maps, and thus, it can be approximated with an inner product in a lower\ndimension. As an example, the explicit feature map (also called Taylor feature map) of the Gaussian\nkernel is derived in [6] via Taylor expansion. In supervised learning, the key problem is to identify\nthe explicit features 1 that lead to low out-of-sample error as there is an inherent trade-off between\nthe computational complexity and the approximation accuracy. This will turn the learning problem at\nhand into an optimization with sparsity constraints, which is is generally NP-hard.\nIn this paper, our objective is to present a method for ef\ufb01ciently \u201cchoosing\u201d explicit features associated\nto a number of base positive semi-de\ufb01nite kernels. Motivated by the success of greedy methods in\nsparse approximation [7, 8], we propose a method to select promising features from multiple kernels\nin a greedy fashion. Our method, dubbed Multi Feature Greedy Approximation (MFGA), has access\nto a set of candidate features. Exploring these features sequentially, the algorithm maintains an active\nset and adds one explicit feature to it per step. The selection criterion is according to the correlation\nof the gradient of the empirical risk with the standard bases.\nWe provide non-asymptotic guarantees for MFGA, characterizing its out-of-sample performance via\nthree types of errors, one of which (spectral error) relates to spectral properties of the best model\nin the Hilbert space associated to the combined kernel. Our theoretical result suggests that if the\nunderlying data model is sparse enough, i.e., the spectral error is negligible, one can achieve a low\nout-of-sample error with a small number of features, that can scale poly-logarithmically with data.\nRecent \ufb01ndings in [9] shows that in approximating square integrable functions with smooth radial\nkernels, the coef\ufb01cient decay is nearly exponential (small spectral error). In light of these results, our\nmethod has potential in constructing sparse representations for a rich class of functions.\nWe further provide empirical evidence (Section 5) that explicit feature maps can be ef\ufb01cient tools\nfor sparse representation. In particular, compared to the state-of-the-art in data-dependent random\nfeatures, MFGA requires a smaller number of features to achieve a certain test error on a number of\ndatasets, while spending less computational resource. Our work is related to several lines of research\nin the literature, namely random and deterministic kernel approximation, sparse approximation, and\nmultiple kernel learning. Due to variety of these works, we postpone the detailed discussion of the\nrelated literature to Section 4, after presenting the preliminaries, formulation, and results.\n\n2 Problem Formulation\n\nPreliminaries: Throughout the paper, the vectors are all in column format. We denote by [N ] the\nset of positive integers {1, . . . , N}, by hx, x0i the inner product of vectors x and x0 (in potentially\nin\ufb01nite dimension), by k\u00b7kp the p-norm operator, by L2(X ) the set of square integrable functions\non the domain X , and by P the P -dimensional probability simplex, respectively. The support of\nvector \u2713 2 Rd is supp(\u2713) , {i 2 [d] : \u2713i 6= 0}. d\u00b7e and b\u00b7c denote the ceiling and \ufb02oor functions,\nrespectively. We make use of the following de\ufb01nitions:\nDe\ufb01nition 1. (strong convexity) A differentiable function g(\u00b7) is called \u00b5-strongly convex on the\ndomain X with respect to k\u00b7k2, if for all x, x0 2X and some \u00b5 > 0,\n\ng(x)  g(x0) + hrg(x0), x  x0i +\n\n\u00b5\n2 kx  x0k2\n2 .\n\nDe\ufb01nition 2. (smoothness) A differentiable function g(\u00b7) is called -smooth on the domain X with\nrespect to k\u00b7k2, if for all x, x0 2X and some > 0,\n\ng(x) \uf8ff g(x0) + hrg(x0), x  x0i +\n\n\n2 kx  x0k2\n2 .\n\n2.1 Supervised Learning with Explicit Feature Maps\nn=1 in the form of input-output pairs is given to the\nIn supervised learning, a training set {(xn, yn)}N\nlearner. The (input-output) samples are generated independently from an unknown distribution PXY.\nFor n 2 [N ], we have xn 2X\u21e2 Rd. In the case of regression, the output variable yn 2Y\u2713 [1, 1],\n1In this paper, our focus is on \u201cexplicit features\u201d, and whenever it is clear from the context, we simply use\n\n\u201cfeatures\u201d instead.\n\n2\n\n\fwhereas in the case of classi\ufb01cation yn 2 {1, 1}. The ultimate objective is to \ufb01nd a target function\nf : X! R, to be employed in mapping (unseen) inputs to correct outputs. This goal may be achieved\nthrough minimizing a risk function R(f ), de\ufb01ned as\n\nbR(f ) , 1\n\nN\n\nNXn=1\n\nR(f ) , EPXY\n\n[L(f (x), y)]\n\nL(f (xn), yn),\n\n(1)\n\nwhere L(\u00b7,\u00b7) is a loss function depending on the task (e.g., quadratic for regression, hinge loss\nfor SVM). Since the distribution PXY is unknown, in lieu of the true risk R(f ), we minimize the\nempirical risk bR(f ). To solve the problem, one needs to consider a function class for f (\u00b7) to minimize\nthe empirical risk over that class. For example, consider a positive semi-de\ufb01nite kernel K(\u00b7,\u00b7)2 and\nconsider functions of the form f (\u00b7) =PN\nn=1 \u21b5nK(xn,\u00b7). Kernel methods minimize the empirical\nrisk bR(f ) over this class of functions by solving for optimal values of parameters {\u21b5n}N\nn=1. While\nbeing theoretically well-justi\ufb01ed, this approach is not practically applicable to large datasets, as\nO(N 2) computations are required just to set up the training problem.\nWe now face two important questions: (i) can we reduce the computation time using a suitable\napproximation of the kernel? (ii) how does the choice of kernel affect the prediction of unseen data\n(generalization performance)? There is a large body of literature addressing these two questions. We\nprovide an extensive discussion of the related works in Section 4, and here, we focus on presenting\nour method aiming to tackle the challenges above.\nConsider a set of base positive semi-de\ufb01nite kernels {K1, . . . , KP}, such that Kp(x, x0) =\n\u2326p(x), p(x0)\u21b5 for p 2 [P ]. The feature map p : x 7! FKp maps the points in X\nto FKp, the associated Reproducing Kernel Hilbert Space (RKHS) to kernel Kp. Let \u2713 =\n[\u27131,1, . . . ,\u2713 1,M1 . . . ,\u2713 P,1, . . . ,\u2713 P,MP ]> and \u232b = [\u232b1, . . . ,\u232b P ]>, such thatPP\np=1 Mp = M. De-\n\ufb01ne\nMp = M9=;\nbFM ,8<:\nPXp=1\n\nwhere p,m(\u00b7) is the m-th component of the explicit feature map associated to Kp. The use of explicit\nfeature maps has proved to be bene\ufb01cial in learning with signi\ufb01cantly smaller computational burden\n(see e.g. [6] for approximation of Gaussian kernel in training SVM and [5] for explicit form of feature\nmaps for several practical kernels). We use \u232b for normalization purposes, and we are not concerned\nwith learning a rule to optimize it. Instead, given a \ufb01xed value of \u232b, we are interested in including\n\n\u2713p,mp\u232bpp,m(x) : k\u2713k2 \uf8ff C , \u232b 2 P ,\n\nMpXm=1\n\nPXp=1\n\nf (x) =\n\n(2)\n\n,\n\nF ,(f (x) =\n\nperformance over \u232b. It is actually well-known that Multiple Kernel Learning (MKL) can potentially\nimprove the generalization; however, it comes at the cost of solving expensive optimization problems\n[10].\n\npromising p,m(\u00b7)\u2019s in bFM, i.e., the ones improving generalization. We can always optimize the\nNote that the set bFM is a rich class of functions. It consists of M-term approximations of the class\nthat for a function in L2(X ) the i-th coef\ufb01cient must decay faster than O(1/pi) when the bases are\northonormal. Interestingly, it has recently been proved that for approximation with smooth radial\nkernels, the coef\ufb01cient decay is nearly exponential [9]. Therefore, for functions in L2(X ), most\nof the energy content comes from the initial coef\ufb01cients, and we can hope to keep M \u2327 N for\ncomputationally ef\ufb01cient training. Such solutions also offer O(M ) computations in the test phase as\nopposed to O(N ) in traditional kernel methods.\n\nusing multiple feature maps. Focusing on one kernel (P = 1), we know by Parseval\u2019s theorem [11]\n\np,m \uf8ff C , \u232b 2 P) ,\n\n\u2713p,mp\u232bpp,m(x) :\n\n1Xm=1\n\n1Xm=1\n\nPXp=1\n\nPXp=1\n\n(3)\n\n\u27132\n\n2.2 Multi Feature Greedy Approximation\nWe now propose an algorithm that carefully chooses the (approximated) kernel to attain a low\nout-of-sample error. The algorithm has access to a set of M0 candidate (explicit) features p,m(\u00b7),\n\u21b5i\u21b5jK(xi, xj)  0 for \u21b5 2 RN .\n\n2A symmetric function K : X\u21e5X! R is positive semi-de\ufb01nite if\n\nNPi,j=1\n\n3\n\n\fi.e., Pp,m 1 = M0. Starting with an empty set, it maintains an active set of selected features\nby exploring the candidate features. At each step, the algorithm calculates the correlation of the\ngradient (of the empirical risk) with standard bases of RM0. The feature p,m(\u00b7) whose index\ncoincides with the most absolute correlation is added to the active set, and next, the empirical risk\nis minimized over a more general model including the chosen feature. In the case of regression,\nif we let >p,m = [p,m(x1) \u00b7\u00b7\u00b7 p,m(xN )], the algorithm selects a p,m such that p,m has the\nlargest absolute correlation with the residual (the method is known as Orthogonal Matching Pursuit\n(OMP) [12, 13]). The algorithm can proceed for M rounds or until a termination condition is met\n(e.g. the risk is small enough). Denoting by ej the j-th standard basis in RM0, we outline the method\nin Algorithm 1.\n\nAlgorithm 1 Multi Feature Greedy Approximation (MFGA)\nInitialize: I (1) = ;, \u2713(0) = 0 2 RM0\n1: for t 2 [M ] (M < M0) do\n2:\n3:\n4:\n5: end for\n\nLet J (t) = argmaxj2[M0]DrbR\u21e3\u2713(t1)\u2318 , ejE.\nLet I (t+1) = I (t) [{ J (t)}.\nSolve \u2713(t) = argminf2bFM0{bR(f )} subject to supp(\u2713) = I (t+1).\n\nPPp=1\n\nMpPm=1\n\np,m p\u232bpp,m(\u00b7).\n\u2713(M )\n\nOutput: bfMFGA(\u00b7) =\nAssuming that repetitive features are not selected, at each iteration of the algorithm, a linear regression\nor classi\ufb01cation is solved over a variable of size t. If the time cost of the task is C(t), the training cost\nof MFGA would bePM\nt=1 C(t). However, in practice, we can select multiple features at each iteration\nto decrease the runtime of the algorithm. In the case of regression, this amounts to Generalized OMP\n[14]. While in general this rule might be sub-optimal, the authors of [14] have shown that the method\nis quite competitive to the original OMP where one element is selected per iteration.\n\n3 Theoretical Guarantees\n\nRecall that our objective is to evaluate the out-of-sample performance (generalization) of our proposed\nmethod. To begin, we quantify the richness of the class (2) in Lemma 1 using the notion of\nRademacher complexity, de\ufb01ned below:\nDe\ufb01nition 3. (Rademacher complexity) For a \ufb01nite-sample set {xi}N\ncomplexity of a class F is de\ufb01ned as\nif (xi)# ,\n\nN\nwhere the expectation is taken over {i}N\nthe set {1, 1}. The Rademacher complexity is then R(F) , EPX bR(F).\nAssumption 1. For all p 2 [P ], Kp is a positive semi-de\ufb01nite kernel and supx2X Kp(x, x) \uf8ff B2.\nLemma 1. Given Assumption 1, the Rademacher complexity of the function class (2) is bounded as,\n\ni=1 that are independent samples uniformly distributed on\n\nbR(F) , 1\n\ni=1, the empirical Rademacher\n\nEP\"sup\n\nf2F\n\nNXi=1\n\nR(bFM ) \uf8ff BCr 3dlog Pe\n\nN\n\n.\n\nThe bound above exhibits mild dependence to the number of base kernels P , akin to the results in\n[15]. To derive our theoretical guarantees, we rely on the following assumptions:\nAssumption 2. The loss function L(y, y0) = L(yy0) is -smooth and G-Lipschitz in the \ufb01rst argu-\nment.\n\nNotable example of the loss function satisfying the assumption above is the logistic loss L(y, y0) =\nlog(1 + exp(yy0)) for binary classi\ufb01cation.\n\n4\n\n\fAssumption 3. The empirical risk bR is \u00b5-strongly convex with respect to \u2713.\n\nIn case the empirical risk is weakly convex, strongly convexity can be achieved via adding a Tikhonov\nregularizer. We are now ready to present our main theoretical result which decomposes the out-of-\nsample error into three components:\np,mp\u232bpp,m(\u00b7). Let Assumptions 1-3\nTheorem 2. De\ufb01ne f ?(\u00b7) , argminf2F R(f ) =\nhold and \u2713(t) 2{ \u2713 2 RM0 : k\u2713k2 < C} for t 2 [M ]. Then, after M iterations of Algorithm 1, the\noutput satis\ufb01es,\n\n1Pm=1\n\nPPp=1\n\n\u2713?\n\nPPp=1\n\n1Pm=b bM \"cP c\n\n\u2713?\np,m\n\n1\n\npN\n\nf2F\n\n21CA ,\n\n\u00b5m1\u25c6\u25c6 ,\n\nR(f ) \uf8ffE est + Eapp + Espec,\n\nEspec = O0B@vuut\n\nR(bfMFGA)  min\nwith probability at least 1   over data, where\nEapp = O\u2713exp\u2713\u2305M 1\"\u21e7l \n\u25c6 ,\nEest = O\u2713pdlog Pe+p log \nfor any \" 2 (0, 1).\nOur error bound consists of three terms: estimation error Eest, approximation error Eapp, and spectral\nerror Espec. As the bound holds for \" 2 (0, 1), it can optimized over the choice of \" in theory. The\nO(1/pN ) estimation error with respect to the sample size is quite standard in supervised learning. It\nwas also shown in [15] that one cannot improve upon the plog P dependence due to the selection\nof multiple kernels. The approximation error shows that the decay is exponential with respect to\nthe number of features, i.e., to get an O(1/pN ) error, we only need O((log N )\n1\" ) features. The\nexponential decay (expected from the greedy methods [8, 16]) dramatically reduces the number\nof features compared to non-greedy, randomized techniques at the cost of more computation. The\n\u201cspectral\u201d error characterizes the spectral properties of the best model in the class (3). Since the\n2-norm of the coef\ufb01cient sequence is bounded, Espec ! 0 as M ! 1, but the rate depends on the\ntail of the coef\ufb01cient sequence. For example, if for all p 2 [P ], Kp is a smooth radial kernel, the\ncoef\ufb01cient decay is nearly exponential [9].\nRemark 1. The quadratic loss L(y, y0) = (y  y0)2 does not satisfy Assumption 2 in the sense that\nL(y, y0) 6= L(yy0), but with similar analysis in Theorem 2, we can prove that the same error bound\nholds with slightly different constants (see the supplementary material).\nRemark 2. Using Theorem 2.8 in [17], our result can be extended to `2-regularized risk (see [17],\nRemark 2.1). In case of `1-penalty, due to non-differentiability, we should work with alternatives (e.g.\nlog[cosh(\u00b7)]).\nRemark 3. There is an interesting connection between our result and reconstruction bounds in\ngreedy methods (e.g. [8]), where using M bases, the error decay is a function of both M and\n\u201cthe best reconstruction\u201d with M bases. Similarly here, Eapp and Espec capture these two notions,\nrespectively. Both errors go to zero as M ! 1 and there is a trade-off between the two, given\n\"> 0. An important issue is that \u201cthe best reconstruction\u201d depends on the initial candidate (explicit\nfeatures) set. That error is small if the good explicit features are in the candidate set, and in a Fourier\nanalogy, a signal should be \u201cband-limited\u201d to be approximated well with \ufb01nite bases.\n\n4 Related Literature\n\nOur work is related to several strands of literature reviewed below:\nKernel approximation: Since the kernel matrix is N \u21e5 N, the computational cost of kernel methods\nscales at least quadratically with respect to data. To overcome this problem, a large body of literature\nhas focused on approximation of kernels using low-rank surrogates [1, 2]. Examples include the\ncelebrated Nystr\u00f6m method [18, 19] which samples a subset of training data, approximates a surrogate\nkernel matrix, and then transforms the data using the approximated kernel. Shifting focus to explicit\nfeature maps, in [20, 21], the authors have proposed low-dimensional Taylor expansions of Gaussian\nkernel for speeding up learning. Moreover, Vedaldi et al. [22] provide explicit feature maps for\nadditive homogeneous kernels and quantify the approximation error using this approach. The major\n\n5\n\n\fdifference of our work with this literature is that we are concerned with selecting \u201cgood\u201d feature\nmaps in a greedy fashion for improved generalization.\nRandom features: An elegant idea to improve the ef\ufb01ciency of kernel approximation is to use\nrandomized features [3, 23]. In this approach, the kernel function can be approximated as\n\nK(x, x0) =Z\u2326\n\n(x, !)(x0, !)dP\u2326(!) \u21e1\n\n1\nM\n\nMXm=1\n\n(x, !m)(x0, !m),\n\n(4)\n\nusing Monte Carlo sampling of random features {!m}M\nm=1 from the support set \u2326. A wide variety of\nkernels can be written in the form of above. Examples include shift-invariant kernels approximated by\nMonte Carlo [3] or Quasi Monte Carlo [24] sampling as well as dot product (e.g. polynomial) kernels\n[25]. Various methods have been developed to decrease the time and space complexity of kernel\napproximation (see e.g. Fast-food [26] and Structured Orthogonal Random Features [27]) using\nproperties of dense Gaussian random matrices. In general, random features reduce the computational\ncomplexity of traditional kernel methods. It has been shown recently in [28] that to achieve O(1/pN )\nlearning error, we require only M = O(pN log N ) random features. Also, the authors of [29] have\nshown that by `1-regularization (using a randomized coordinate descent approach) random features\ncan be made more ef\ufb01cient. In particular, to achieve \u270f-precision on risk, O(1/\u270f) random features\nwould be suf\ufb01cient (as opposed to O(1/\u270f2)).\nAnother line of research has focused on data-dependent choice of random features. In [30, 31, 32, 33],\ndata-dependent random features has been studied for the approximation of shift-invariant/translation-\ninvariant kernels. On the other hand, in [34, 35, 36, 37], the focal point is on the improvement of the\nout-of-sample error. Sinha and Duchi [34] propose a pre-processing optimization to re-weight random\nfeatures, whereas Shahrampour et al. [35] introduce a data-dependent score function to select random\nfeatures. Furthermore, Bullins et al. [37] focus on approximating translation-invariant/rotation-\ninvariant kernels and maximizing kernel alignment in the Fourier domain. They provide analytic\nresults on classi\ufb01cation by solving the SVM dual with a no-regret learning scheme, and also an\nimprovement is achieved in terms of using multiple kernels. The distinction of our work with this\nliterature is that our method is greedy rather than randomized, and our focus is on explicit feature\nmaps. Additionally, another signi\ufb01cant difference in our framework with that of [37] is that we work\nwith differentiable loss functions, whereas [37] focuses on SVM. We will compare our work with\n[23, 34, 35].\nGreedy approximation: Over the pas few decades, greedy methods such as Matching Pursuit (MP)\n[38, 7] and Orthogonal Matching Pursuit (OMP) [12, 13, 8] have attracted the attention of several\ncommunities due to their success in sparse approximation. In the machine learning community,\nVincent et al. [39] have proposed MP and OMP with kernels as elements. In the similar spirit is the\nwork of [40], which concentrates on sparse regression and classi\ufb01cation models using Mercer kernels,\nas well as the work of [41] that considers sparse regression with multiple kernels. Though traditional\nMP and OMP were developed for regression, they have been further extended to logistic regression\n[42] and smooth loss functions [43]. Moreover, in [44], a greedy reconstruction technique has been\ndeveloped for regression by empirically \ufb01tting squared error residuals. Unlike most of the prior art,\nour focus is on explicit feature maps rather than kernels to save signi\ufb01cant computational costs. Our\nalgorithm can be thought as an extension of fully corrective greedy in [17] to nonlinear features from\nmultiple kernels where we optimize the risk over the class (2). However, in MFGA, we work with the\nempirical risk (rather than the true risk in [17]), which happens in practice as we do not know PXY.\nMultiple kernel learning: The main objective of MKL is to identify a good kernel using a data-\ndependent procedure. In supervised learning, these methods may consider optimizing a convex, linear,\nor nonlinear combination of a number of base kernels with respect to some measure (e.g. kernel\nalignment) to select an ideal kernel [45, 46, 47]. It is also possible to optimize the kernel as well as\nthe empirical risk simultaneously [48, 49]. On the positive side, there are many theoretical guarantees\nfor MKL [15, 50], but unfortunately, these methods often involve computationally expensive steps,\nsuch as eigen-decomposition of the Gram matrix (see [10] for a comprehensive survey). The major\ndifference of this work with MLK is that we consider a combination of explicit feature maps (rather\nthan kernels), and more importantly, we do not optimize the weights (as mentioned in Section 2.1,\nwe do not optimize the class (2) over \u232b) to avoid computational cost. Instead, our goal is to greedily\nchoose promising features for a \ufb01xed value of \u232b.\n\n6\n\n\fWe \ufb01nally remark that data-dependent learning has been explored in the context of boosting and deep\nlearning [51, 52, 53]. Here, our main focus is on sparse representation for shallow networks.\n\n5 Empirical Evaluations\n\nm=1 are\nm=1 are sampled from the uniform distribution on\n\nWe now evaluate our method on several datasets from the UCI Machine Learning Repository.\nBenchmark algorithms: We compare MFGA to the state-of-the-art in randomized kernel approxi-\nmation as well as traditional kernel methods:\n1) RKS [23], with approximated Gaussian kernel:  = cos(x>!m + bm) in (4), {!m}M\nsampled from a Gaussian distribution, and {bm}M\n[0, 2\u21e1).\n2) LKRF [34], with approximated Gaussian kernel:  = cos(x>!m + bm) in (4), but instead of M,\na larger number M0 random features are sampled and then re-weighted by solving a kernel alignment\noptimization. The top M random features would be used in the training.\n3) EERF [35], with approximated Gaussian kernel:  = cos(x>!m + bm) in (4), and again M0\nrandom features are sampled and then re-weighted according to a score function. The top M random\nfeatures would appear in the training. See Table 2a-2b for values of M and M0.\n4) GK, the standard Gaussian kernel.\n5) GLK, which is a sum of a Gaussian and a linear kernel.\nThe selection of the baselines above allows us to investigate the time-vs-accuracy tradeoff in kernel\napproximation. Ideally, we would like to outperform randomized approaches, while being competitive\nto kernel methods with signi\ufb01cantly lower computational cost.\nPractical considerations: To determine the width of the Gaussian kernel K(x, x0) =\nexp(kx  x0k2 /22), we choose the value of  for each dataset to be the mean distance of\nthe 50th `2 nearest neighbor. Though being a rule-of-thumb, this choice has exhibited good general-\nization performance [30]. Notice that for randomized approaches, this amounts to sampling random\nfeatures from 1N (0, Id). Of course, optimizing over  (e.g. using cross-validation, jackknife,\nor their approximate surrogates [54, 55, 56]) may provide better results. For our method as well\nas GLK, we do not optimize over the convex combination weights (uniform weights are assigned).\nThis is possible using MKL, but our goal is to evaluate the trade-off between approximation and\naccuracy, rather than proposing a rule to learn the best possible weights for the kernel. For classi-\n\ufb01cation, we let the number of candidate features M0 = 2d + 1, consisting of the \ufb01rst order Taylor\nfeatures of the Gaussian kernel combined with features of linear kernel, whereas for regression, we let\n\n2 + 2d + 1, approximating the Gaussian kernel up to second order. In the experiments, we\nM0 =d\nreplace the 2-norm constraint of (2) by a quadratic regularizer [23], tune the regularization parameter\nover the set {105, 104, . . . , 105}, and report the best result for each method. As noted in Section\n2.2, we select multiple features at each iteration of MFGA which is suboptimal but decreases the\nruntime of the algorithm. We use logistic regression model for classi\ufb01cation to be able to compute\nthe gradient needed in MFGA.\nDatasets: In Table 1, we report the number of training samples (Ntrain) and test samples (Ntest) used\nfor each dataset. If the training and test samples are not provided separately for a dataset, we split it\nrandomly. We standardize the data in the following sense: we scale the features to have zero mean\nand unit variance and the responses in regression to be inside [1, 1].\n\nTable 1: Input dimension, number of training samples, and number of test samples are denoted by d, Ntrain, and\nNtest, respectively.\n\nDataset\n\nYear prediction\n\nOnline news popularity\n\nAdult\n\nEpileptic seizure recognition\n\nTask\n\nRegression\nRegression\nClassi\ufb01cation\nClassi\ufb01cation\n\nd\n90\n58\n122\n178\n\nNtrain\n46371\n26561\n32561\n8625\n\nNtest\n5163\n13083\n16281\n2875\n\n7\n\n\fYear prediction\n\nRKS\nLKRF\nEERF\nThis work\nGK\nGLK\n\n0.15\n\nr\no\nr\nr\ne\n \nt\ns\ne\nT\n\n0.1\n\n0.05\n\nOnline news popularity\n\nRKS\nLKRF\nEERF\nThis work\nGK\nGLK\n\n50\n\n100\n\n150\n\n200\n\n250\n\n300\n\n350\n\n400\n\nNumber of explicit feature maps (M)\n\n0\n20\n\n40\n\n80\n\n60\n\n160\nNumber of explicit feature maps (M)\n\n100\n\n120\n\n140\n\n180\n\n200\n\nAdult\n\nEpileptic seizure recognition\n\nRKS\nLKRF\nEERF\nThis work\nGK\nGLK\n\nr\no\nr\nr\ne\n \nt\ns\ne\nT\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\nRKS\nLKRF\nEERF\nThis work\nGK\nGLK\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n14\n\n16\n\n18\n\n20\n\nNumber of explicit feature maps (M)\n\n0.15\n\n0.1\n\nr\no\nr\nr\ne\n \nt\ns\ne\nT\n\n0.05\n\n0.25\n\nr\no\nr\nr\ne\n \nt\ns\ne\nT\n\n0.2\n\n0.15\n\n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\n70\n\n80\n\n90\n\n100\n\nNumber of explicit feature maps (M)\n\nFigure 1: Comparison of the test error of MFGA (this work) versus the randomized features baselines RKS,\nLKRF, and EERF, as well as Gaussian Kernel (GK) and Gaussian+Linear Kernel (GLK).\n\nComparison with random features: For datasets in Table 1, we report our empirical \ufb01ndings in\nFigure 1. On \u201cYear prediction\u201d and \u201cAdult\u201d, our method consistently improves the test error compared\nto the state-of-the-art, i.e., MFGA requires smaller number of features to achieve a certain test error\nthreshold. The key is to select \u201cgood\u201d features to learn the subspace, and MFGA does so by greedily\nsearching among the candidate features that are explicit feature maps of the linear+Gaussian kernel\n(up to second order Taylor expansion). As the number of features M increases, all methods tend\nto generalize better in the regime shown in Figure 1. On \u201cOnline news popularity\u201d our method\neventually achieves a smaller test error, whereas on \u201cEpileptic seizure recognition\u201d it is superior for\nM \uf8ff 14 while being dominated by EERF afterwards.\nTable 2a-2b tabulates the test error and time cost for largest M (for each dataset) in Figure 1. Since\nRKS is fully randomized and data-independent, it has the smallest training time. However, in order to\ncompare the time cost of LKRF, EERF, and our work, we need additional details as the comparison\nmay not be immediate. In the pre-processing phase, LKRF and EERF draw M0 samples from\nthe Gaussian distribution and incur O (dN M0) computational cost. Additionally, LKRF solves an\noptimization with O(M0 log \u270f1) time to reach the \u270f-optimal solution, and EERF sorts an array of size\nM0 with average O(M0 log M0) time. On the other hand, when approximating Gaussian kernel by a\nsecond order Taylor expansion, our method forms O(d2) features and incurs O(N d2) computations,\nwhich is less than the other two in case d \u2327 M0. On all data sets except \u201cYear prediction\u201d, observe\nthat our method spends drastically smaller pre-processing time to achieve a competitive result after\nevaluating smaller number of candidate features (i.e., smaller M0). To compare the training cost, if\nthe time cost of the related task (regression or classi\ufb01cation) with M features is C(M ), LKRF and\nEERF simply spend that budget. However, running K iterations of our method (with M a multiple\ninteger of K), assuming that repetitive features are not selected, the training cost of MFGA would\nbePK\nk=1 C(kM/K), which is more than LKRF and EERF. Furthermore, notice that the choice of\nexplicit or random feature maps would too affect the training time. For example, in regression, this\ndirectly governs the condition number of the M \u21e5 M matrix that is to be inverted. As a result, there\nexist hidden constants in C that are different across algorithms. Overall, looking at the sum of training\nand pre-processing time from Table 2a-2b, we observe that our algorithm can achieve competitive\nresults by spending less time compared to data-dependent methods. For example, on \u201cOnline news\u201d,\n\n8\n\n\fwe reduce the error of EERF from 1.63% to 0.57% (\u21e1 65% decrease) in 1.22+0.92\n10.6+0.15 time ratio (\u21e1 80%\ndecrease).\nIn general, the comparison of our method to LKRF and EERF is equivalent to the comparison of\n(data-dependent) explicit-vs-randomized feature maps. In comparison of vanilla (data-independent)\nexplicit-vs-randomized feature maps, as discussed in the experiments of [6] for Gaussian kernel, the\nperformance of none clearly dominates the other. Essentially, Gaussian kernel can be (roughly) seen\nas (a countable) sum of polynomial kernels as well as (an uncountable) sum of cosine feature maps.\nOur theoretical bound, which holds for countable sums, suggests that for \u201cgood\u201d explicit feature\nmaps, the coef\ufb01cients may vanish fast (small Espec), i.e., there exists a sparse representation, but of\ncourse, such feature map is unknown before the learning process.\nComparison with kernel methods: As we observe in Table 2a-2b, our method outperforms GK and\nGLK on \u201cYear prediction\u201d and \u201cAdult\u201d. For \u201cYear prediction\u201d, our tpp + ttrain divided by the training\ntime of GK is (5.33 + 4.25)/139.5 \u21e1 0.068. The same number for \u201cAdult\u201d is \u21e1 0.036, exhibiting\na dramatic decrease in the runtime. Noticing that (except for \u201cEpileptic seizure recognition\u201d) we\nused a subsample of training data for kernel methods (due to computational cost), the actual runtime\ndecrease is even more remarkable (2 to 3 orders of magnitude). For \u201cOnline news popularity\u201d and\n\u201cEpileptic seizure recognition\u201d, our method is outperformed in terms of accuracy but still saves\nsigni\ufb01cant computational cost while being competitive to kernel methods.\n\nTable 2: Comparison of the error and time cost of our algorithm versus other baselines. M0 is the number of\ncandidate features and M is the number of features used for training and testing. tpp and ttrain, respectively,\nrepresent pre-processing and training time (seconds). For kernel methods, we use a subsample N0 of the training\nset. For all methods, the test error (%) is reported with standard errors in parentheses for randomized approaches.\n\n(a) Results on regression: Year prediction (left) and Online news (right)\n\nMethod\nRKS\nLKRF\nEERF\n\nThis work\n\nGK\nGLK\n\nMethod\nRKS\nLKRF\nEERF\n\nThis work\n\nGK\nGLK\n\nM M0 N0/N\n400\n400\n400\n400\n\u2013\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n0.5\n0.5\n\n4000\n4000\n4186\n\n\u2013\n\n\u2013\n\u2013\n\ntpp\n\u2013\n3.5\n3.3\n5.33\n\u2013\n\u2013\n\nttrain\n0.63\n0.62\n0.64\n4.25\n139.5\n150.6\n\nerror (%)\n8.27 (4e-2)\n8.51 (8e-2)\n8.76 (6e-2)\n\n4.78\n5.7\n5.08\n\nMethod\nRKS\nLKRF\nEERF\n\nThis work\n\nGK\nGLK\n\nM\n200\n200\n200\n200\n\u2013\n\u2013\n\nM0\n\u2013\n\n20000\n20000\n1770\n\n\u2013\n\u2013\n\nN0/N\n\n\u2013\n\u2013\n\u2013\n\u2013\n0.3\n0.3\n\ntpp\n\u2013\n9.8\n10.6\n1.22\n\u2013\n\u2013\n\n(b) Results on classi\ufb01cation: Adult (left) and Epileptic seizure recognition (right)\n\n\u2013\n\nM M0 N0/N\n100\n100\n100\n100\n\u2013\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n0.25\n0.25\n\n2000\n2000\n245\n\u2013\n\u2013\n\ntpp\n\u2013\n1.4\n2\n0.19\n\u2013\n\u2013\n\nttrain\n0.87\n0.91\n1.38\n0.69\n24.07\n77.09\n\nerror (%)\n17.7 (6e-2)\n16.46 (3e-2)\n16.15 (2e-2)\n\n15.10\n15.70\n15.22\n\nMethod\nRKS\nLKRF\nEERF\n\nThis work\n\nGK\nGLK\n\n\u2013\n\nM M0 N0/N\n20\n20\n20\n20\n\u2013\n\u2013\n\n2000\n2000\n357\n\u2013\n\u2013\n\n\u2013\n\u2013\n\u2013\n\u2013\n1\n1\n\ntpp\n\u2013\n4.2\n6.8\n0.08\n\u2013\n\u2013\n\nttrain\n0.13\n0.14\n0.15\n0.92\n240.9\n257.6\n\nttrain\n0.06\n0.06\n0.07\n0.32\n12.95\n73.02\n\nerror (%)\n3.08 (5e-2)\n2.07 (5e-2)\n1.63 (4e-2)\n\n0.57\n0.23\n0.14\n\nerror (%)\n6.21 (9e-2)\n5.24 (4e-2)\n4.46 (4e-2)\n\n4.73\n2.82\n3.41\n\nAcknowledgements\n\nWe gratefully acknowledge the support of DARPA Grant W911NF1810134.\n\nReferences\n[1] Alex J Smola and Bernhard Sch\u00f6kopf. Sparse greedy matrix approximation for machine learning. In\n\nProceedings of the Seventeenth International Conference on Machine Learning, pages 911\u2013918, 2000.\n\n[2] Shai Fine and Katya Scheinberg. Ef\ufb01cient SVM training using low-rank kernel representations. Journal of\n\nMachine Learning Research, 2(Dec):243\u2013264, 2001.\n\n[3] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in Neural\n\nInformation Processing Systems, 2007.\n\n9\n\n\f[4] Thorsten Joachims. Training linear SVM\u2019s in linear time. In Proceedings of the 12th ACM SIGKDD\n\nInternational Conference on Knowledge Discovery and Data Mining, pages 217\u2013226, 2006.\n\n[5] Ha Quang Minh, Partha Niyogi, and Yuan Yao. Mercer\u2019s theorem, feature maps, and smoothing. In\n\nInternational Conference on Computational Learning Theory, pages 154\u2013168. Springer, 2006.\n\n[6] Andrew Cotter, Joseph Keshet, and Nathan Srebro. Explicit approximations of the gaussian kernel. arXiv\n\npreprint arXiv:1109.4603, 2011.\n\n[7] St\u00e9phane G Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries.\n\nTransactions on Signal Processing, 41(12):3397\u20133415, 1993.\n\nIEEE\n\n[8] Joel A Tropp. Greed is good: Algorithmic results for sparse approximation. IEEE Transactions on\n\nInformation Theory, 50(10):2231\u20132242, 2004.\n\n[9] Mikhail Belkin. Approximation beats concentration? an approximation view on inference with smooth\n\nradial kernels. arXiv preprint arXiv:1801.03437, 2018.\n\n[10] Mehmet G\u00f6nen and Ethem Alpayd\u0131n. Multiple kernel learning algorithms. Journal of Machine Learning\n\nResearch, 12(Jul):2211\u20132268, 2011.\n\n[11] Alan V Oppenheim, Alan S Willsky, and S Hamid Nawab. Signals & Systems. Pearson Educaci\u00f3n, 1998.\n\n[12] Yagyensh Chandra Pati, Ramin Rezaiifar, and Perinkulam Sambamurthy Krishnaprasad. Orthogonal\nmatching pursuit: Recursive function approximation with applications to wavelet decomposition. In 1993\nConference Record of The Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, pages\n40\u201344. IEEE, 1993.\n\n[13] Geoffrey M Davis, Stephane G Mallat, and Zhifeng Zhang. Adaptive time-frequency decompositions.\n\nOptical Engineering, 33(7):2183\u20132192, 1994.\n\n[14] Jian Wang, Seokbeop Kwon, and Byonghyo Shim. Generalized orthogonal matching pursuit. IEEE\n\nTransactions on Signal Processing, 60(12):6202\u20136216, 2012.\n\n[15] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Generalization bounds for learning kernels. In\n\nInternational Conference on Machine Learning, pages 247\u2013254, 2010.\n\n[16] R\u00e9mi Gribonval and Pierre Vandergheynst. On the exponential convergence of matching pursuits in\n\nquasi-incoherent dictionaries. IEEE Transactions on Information Theory, 52(1):255\u2013261, 2006.\n\n[17] Shai Shalev-Shwartz, Nathan Srebro, and Tong Zhang. Trading accuracy for sparsity in optimization\n\nproblems with sparsity constraints. SIAM Journal on Optimization, 20(6):2807\u20132832, 2010.\n\n[18] Christopher Williams and Matthias Seeger. Using the Nystr\u00f6m method to speed up kernel machines. In\n\nAdvances in Neural Information Processing Systems, 2001.\n\n[19] Petros Drineas and Michael W Mahoney. On the Nystr\u00f6m method for approximating a gram matrix for\n\nimproved kernel-based learning. Journal of Machine Learning Research, 6(Dec):2153\u20132175, 2005.\n\n[20] Changjiang Yang, Ramani Duraiswami, and Larry Davis. Ef\ufb01cient kernel machines using the improved fast\ngauss transform. In Proceedings of the 17th International Conference on Neural Information Processing\nSystems, pages 1561\u20131568, 2004.\n\n[21] Jian-Wu Xu, Puskal P Pokharel, Kyu-Hwa Jeong, and Jose C Principe. An explicit construction of a\nreproducing gaussian kernel Hilbert space. In IEEE International Conference on Acoustics, Speech and\nSignal Processing, volume 5, 2006.\n\n[22] Andrea Vedaldi and Andrew Zisserman. Ef\ufb01cient additive kernels via explicit feature maps.\n\nTransactions on Pattern Analysis and Machine Intelligence, 34(3):480\u2013492, 2012.\n\nIEEE\n\n[23] Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with\nrandomization in learning. In Advances in Neural Information Processing Systems, pages 1313\u20131320,\n2009.\n\n[24] Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney. Quasi-monte carlo feature maps for\n\nshift-invariant kernels. In International Conference on Machine Learning, pages 485\u2013493, 2014.\n\n[25] Purushottam Kar and Harish Karnick. Random feature maps for dot product kernels. In International\n\nconference on Arti\ufb01cial Intelligence and Statistics, pages 583\u2013591, 2012.\n\n10\n\n\f[26] Quoc Le, Tam\u00e1s Sarl\u00f3s, and Alex Smola. Fastfood-approximating kernel expansions in loglinear time. In\n\nInternational Conference on Machine Learning, volume 85, 2013.\n\n[27] X Yu Felix, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv\nKumar. Orthogonal random features. In Advances in Neural Information Processing Systems, pages\n1975\u20131983, 2016.\n\n[28] Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. In\n\nAdvances in Neural Information Processing Systems, pages 3218\u20133228, 2017.\n\n[29] Ian En-Hsu Yen, Ting-Wei Lin, Shou-De Lin, Pradeep K Ravikumar, and Inderjit S Dhillon. Sparse random\nfeature algorithm as coordinate descent in hilbert space. In Advances in Neural Information Processing\nSystems, pages 2456\u20132464, 2014.\n\n[30] Felix X Yu, Sanjiv Kumar, Henry Rowley, and Shih-Fu Chang. Compact nonlinear maps and circulant\n\nextensions. arXiv preprint arXiv:1503.03893, 2015.\n\n[31] Zichao Yang, Andrew Wilson, Alex Smola, and Le Song. A la carte\u2013learning fast kernels. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 1098\u20131106, 2015.\n\n[32] Junier B Oliva, Avinava Dubey, Andrew G Wilson, Barnab\u00e1s P\u00f3czos, Jeff Schneider, and Eric P Xing.\nBayesian nonparametric kernel-learning. In Arti\ufb01cial Intelligence and Statistics, pages 1078\u20131086, 2016.\n\n[33] Wei-Cheng Chang, Chun-Liang Li, Yiming Yang, and Barnabas Poczos. Data-driven random fourier\nfeatures using stein effect. Proceedings of the Twenty-Sixth International Joint Conference on Arti\ufb01cial\nIntelligence (IJCAI-17), 2017.\n\n[34] Aman Sinha and John C Duchi. Learning kernels with random features. In Advances In Neural Information\n\nProcessing Systems, pages 1298\u20131306, 2016.\n\n[35] Shahin Shahrampour, Ahmad Beirami, and Vahid Tarokh. On data-dependent random features for improved\n\ngeneralization in supervised learning. In AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[36] Shahin Shahrampour, Ahmad Beirami, and Vahid Tarokh. Supervised learning using data-dependent\nrandom features with application to seizure detection. In IEEE Conference on Decision and Control, 2018.\n\n[37] Brian Bullins, Cyril Zhang, and Yi Zhang. Not-so-random features. International Conference on Learning\n\nRepresentations, 2018.\n\n[38] Jerome H Friedman and Werner Stuetzle. Projection pursuit regression. Journal of the American Statistical\n\nAssociation, 76(376):817\u2013823, 1981.\n\n[39] Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine Learning, 48(1-3):165\u2013187, 2002.\n\n[40] Prasanth B Nair, Arindam Choudhury, and Andy J Keane. Some greedy learning algorithms for sparse\nregression and classi\ufb01cation with mercer kernels. Journal of Machine Learning Research, 3(Dec):781\u2013801,\n2002.\n\n[41] Vikas Sindhwani and Aur\u00e9lie C Lozano. Non-parametric group orthogonal matching pursuit for sparse\nlearning with multiple kernels. In Advances in Neural Information Processing Systems, pages 2519\u20132527,\n2011.\n\n[42] Aurelie Lozano, Grzegorz Swirszcz, and Naoki Abe. Group orthogonal matching pursuit for logistic\n\nregression. In Arti\ufb01cial Intelligence and Statistics, pages 452\u2013460, 2011.\n\n[43] Francesco Locatello, Rajiv Khanna, Michael Tschannen, and Martin Jaggi. A uni\ufb01ed optimization view\non generalized matching pursuit and frank-wolfe. In Arti\ufb01cial Intelligence and Statistics, pages 860\u2013868,\n2017.\n\n[44] Dino Oglic and Thomas G\u00e4rtner. Greedy feature construction.\n\nProcessing Systems, pages 3945\u20133953, 2016.\n\nIn Advances in Neural Information\n\n[45] Jaz Kandola, John Shawe-Taylor, and Nello Cristianini. Optimizing kernel alignment over combinations of\n\nkernel. 2002.\n\n[46] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Learning non-linear combinations of kernels.\n\nIn Advances in Neural Information Processing Systems, pages 396\u2013404, 2009.\n\n[47] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernels based on\n\ncentered alignment. Journal of Machine Learning Research, 13(Mar):795\u2013828, 2012.\n\n11\n\n\f[48] Marius Kloft, Ulf Brefeld, S\u00f6ren Sonnenburg, and Alexander Zien. Lp-norm multiple kernel learning.\n\nJournal of Machine Learning Research, 12(Mar):953\u2013997, 2011.\n\n[49] Gert RG Lanckriet, Nello Cristianini, Peter Bartlett, Laurent El Ghaoui, and Michael I Jordan. Learning\nthe kernel matrix with semide\ufb01nite programming. Journal of Machine Learning Research, 5(Jan):27\u201372,\n2004.\n\n[50] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3(Nov):463\u2013482, 2002.\n\n[51] Corinna Cortes, Mehryar Mohri, and Umar Syed. Deep boosting. In International Conference on Machine\n\nLearning, pages 1179\u20131187, 2014.\n\n[52] Corinna Cortes, Xavier Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Adanet: Adaptive\nstructural learning of arti\ufb01cial neural networks. In International Conference on Machine Learning, pages\n874\u2013883, 2017.\n\n[53] Furong Huang, Jordan Ash, John Langford, and Robert Schapire. Learning deep resnet blocks sequentially\n\nusing boosting theory. In International Conference on Machine Learning, 2018.\n\n[54] Ahmad Beirami, Meisam Razaviyayn, Shahin Shahrampour, and Vahid Tarokh. On optimal generalizability\nin parametric learning. In Advances in Neural Information Processing Systems, pages 3455\u20133465, 2017.\n\n[55] Shuaiwen Wang, Wenda Zhou, Haihao Lu, Arian Maleki, and Vahab Mirrokni. Approximate leave-one-out\n\nfor fast parameter tuning in high dimensions. arXiv preprint arXiv:1807.02694, 2018.\n\n[56] Ryan Giordano, Will Stephenson, Runjing Liu, Michael I Jordan, and Tamara Broderick. Return of the\n\nin\ufb01nitesimal jackknife. arXiv preprint arXiv:1806.00550, 2018.\n\n[57] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT\n\npress, 2012.\n\n12\n\n\f", "award": [], "sourceid": 2284, "authors": [{"given_name": "Shahin", "family_name": "Shahrampour", "institution": "Texas A&M University"}, {"given_name": "Vahid", "family_name": "Tarokh", "institution": "Duke University"}]}