{"title": "Efficient Sampling for Learning Sparse Additive Models in High Dimensions", "book": "Advances in Neural Information Processing Systems", "page_first": 514, "page_last": 522, "abstract": "We consider the problem of learning sparse additive models, i.e., functions of the form: $f(\\vecx) = \\sum_{l \\in S} \\phi_{l}(x_l)$, $\\vecx \\in \\matR^d$ from point queries of $f$. Here $S$ is an unknown subset of coordinate variables with $\\abs{S} = k \\ll d$. Assuming $\\phi_l$'s to be smooth, we propose a set of points at which to sample $f$ and an efficient randomized algorithm that recovers a \\textit{uniform approximation} to each unknown $\\phi_l$. We provide a rigorous theoretical analysis of our scheme along with sample complexity bounds. Our algorithm utilizes recent results from compressive sensing theory along with a novel convex quadratic program for recovering robust uniform approximations to univariate functions, from point queries corrupted with arbitrary bounded noise. Lastly we theoretically analyze the impact of noise -- either arbitrary but bounded, or stochastic -- on the performance of our algorithm.", "full_text": "Ef\ufb01cient Sampling for Learning Sparse Additive\n\nModels in High Dimensions\n\nHemant Tyagi\nETH Z\u00a8urich\n\nhtyagi@inf.ethz.ch\n\nAndreas Krause\n\nETH Z\u00a8urich\n\nkrausea@ethz.ch\n\nBernd G\u00a8artner\n\nETH Z\u00a8urich\n\ngaertner@inf.ethz.ch\n\nAbstract\n\nform: f(x) = (cid:80)\n\nWe consider the problem of learning sparse additive models, i.e., functions of the\nl\u2208S \u03c6l(xl), x \u2208 Rd from point queries of f. Here S is an un-\nknown subset of coordinate variables with |S| = k (cid:28) d. Assuming \u03c6l\u2019s to be\nsmooth, we propose a set of points at which to sample f and an ef\ufb01cient random-\nized algorithm that recovers a uniform approximation to each unknown \u03c6l. We\nprovide a rigorous theoretical analysis of our scheme along with sample complex-\nity bounds. Our algorithm utilizes recent results from compressive sensing theory\nalong with a novel convex quadratic program for recovering robust uniform ap-\nproximations to univariate functions, from point queries corrupted with arbitrary\nbounded noise. Lastly we theoretically analyze the impact of noise \u2013 either arbi-\ntrary but bounded, or stochastic \u2013 on the performance of our algorithm.\n\nIntroduction\n\n1\nSeveral problems in science and engineering require estimating a real-valued, non-linear (and of-\nten non-convex) function f de\ufb01ned on a compact subset of Rd in high dimensions. This chal-\nlenge arises, e.g., when characterizing complex engineered or natural (e.g., biological) systems\n[1, 2, 3]. The numerical solution of such problems involves learning the unknown f from point\nevaluations (xi, f(xi))n\ni=1. Unfortunately, if the only assumption on f is of mere smoothness, then\nthe problem is in general intractable. For instance, it is well known [4] that if f is C s-smooth then\nn = \u2126((1/\u03b4)d/s) samples are needed for uniformly approximating f within error 0 < \u03b4 < 1. This\nexponential dependence on d is referred to as the curse of dimensionality.\nFortunately, many functions arising in practice are much better behaved in the sense that they are\nintrinsically low-dimensional, i.e., depend on only a small subset of the d variables. Estimating\nsuch functions has received much attention and has led to a considerable amount of theory along\nwith algorithms that do not suffer from the curse of dimensionality (cf., [5, 6, 7, 8]). Here we focus\non the problem of learning one such class of functions, assuming f possesses the sparse additive\nstructure:\n(1.1)\n\nf(x1, x2, . . . , xd) =(cid:88)\n\n\u03c6l(xl); S \u2282 {1, . . . , d} ,|S| = k (cid:28) d.\n\nl\u2208S\n\nFunctions of the form (1.1) are referred to as sparse additive models (SPAMs) and generalize sparse\nlinear models to which they reduce to if each \u03c6l is linear. The problem of estimating SPAMs has\nreceived considerable attention in the regression setting (cf., [9, 10, 11] and references within) where\ni=1 are typically i.i.d samples from some unknown probability measure P. This setting,\n(xi, f(xi))n\nhowever, does not consider the possibility of sampling f at speci\ufb01cally chosen points, tailored to\nthe additive structure of f. In this paper, we propose a strategy for querying f, together with an\nef\ufb01cient recovery algorithm, with much stronger guarantees than known in the regression setting. In\nparticular, we provide the \ufb01rst results guaranteeing uniformly accurate recovery of each individual\ncomponent \u03c6l of the SPAM. This can be crucial in applications where the goal is to not merely\napproximate f, but gain insight into its structure.\n\n1\n\n\fRelated work. SPAMs have been studied extensively in the regression setting, with observations\nbeing corrupted with random noise. [9] proposed the COSSO method, which is an extension of the\nLasso to the reproducing kernel Hilbert space (RKHS) setting. A similiar extension was considered\nin [10].\nIn [12], the authors propose a least squares method regularized with smoothness, with\neach \u03c6l lying in an RKHS and derive error rates for estimating f, in the L2(P) norm1. [13, 14]\npropose methods based on least squares loss regularized with sparsity and smoothness constraints.\n[13] proves consistency of its method in terms of mean squared risk while [14] derives error rates\nfor estimating f in the empirical L2(Pn) norm 1. [11] considers the setting where each \u03c6l lies in\nan RKHS. They propose a convex program for estimating f and derive error rates for the same,\nin the L2(P), L2(Pn) norms. Furthermore they establish the minimax optimality of their method\nfor the L2(P) norm. For instance, they derive an error rate of O((k log d/n) + kn\u2212 2s\n2s+1 ) in the\nL2(P) norm for estimating C s smooth SPAMs. An estimator similar to the one in [11] was also\nconsidered by [15]. They derive similar error rates as in [11], albeit under stronger assumptions on\nf. [16] proposes a method based on the adaptive group Lasso which asymptotically recovers S as n\nincreases. They also derive L2(P) error rates for the individual components of the SPAM.\nThere is further related work in approximation theory, where it is assumed that f can be sampled\nat a desired set of points. [5] considers a setting more general than (1.1), with f simply assumed\nto depend on an unknown subset of k (cid:28) d-coordinate variables. They construct a set of sampling\npoints of size O(ck log d) for some constant c > 0, and present an algorithm that recovers a uniform\napproximation2 to f. This model is generalized in [8], with f assumed to be of the form f(x) =\ng(Ax) for unknown A \u2208 Rk\u00d7d; each row of A is assumed to be sparse.\n[7] generalizes this,\nby removing the sparsity assumption on A. While the methods of [5, 8, 7] could be employed\nfor learning SPAMs, their sampling sets will be of size exponential in k, and hence sub-optimal.\nFurthermore, while these methods derive uniform approximations to f, they are unable to recover\nthe individual \u03c6l\u2019s.\n\nOur contributions. Our contributions are threefold:\n\n1. We propose an ef\ufb01cient algorithm that queries f at O(k log d) locations and recovers: (i) the\nactive set S exactly, along with (ii) a uniform approximation to each \u03c6l, l \u2208 S. In contrast,\nthe existing error bounds in the statistics community [11, 12, 15] are in the much weaker\nL2(P) sense. Furthermore, the existing results in approximation theory provide explicit\nerror bounds for recovering f and not the individual \u03c6l\u2019s.\n\n2. An important component of our algorithm is a novel convex quadratic program for estimat-\ning an unknown univariate function from point queries corrupted with arbitrary bounded\nnoise. We derive rigorous error bounds for this program in the L\u221e norm that demonstrate\nthe robustness of the solution returned. We also explicitly demonstrate the effect of noise,\nsampling density and the curvature of the function on the solution returned.\n\n3. We theoretically analyze the impact of additive noise in the point queries on the perfor-\nmance of our algorithm, for two noise models: arbitrary bounded noise and stochastic (iid)\nnoise. In particular for additive Gaussian noise, we show that our algorithm recovers a ro-\nbust uniform approximation to each \u03c6l with at most O(k3(log d)2) point queries of f. We\nalso provide simulation results that validate our theoretical \ufb01ndings.\n\n2 Problem statement\n\nwe assume f to be of the additive form: f(x1, . . . , xd) =(cid:80)\n\nFor any function g we denote its pth derivative by g(p) when p is large, else we use appropriate\nnumber of prime symbols. (cid:107) g (cid:107)L\u221e[a,b] denotes the L\u221e norm of g in [a, b]. For a vector x we\ndenote its (cid:96)q norm for 1 \u2264 q \u2264 \u221e by (cid:107) x (cid:107)q. We consider approximating functions f : Rd \u2192 R\nfrom point queries. In particular, for some unknown active S \u2282 {1, . . . , d} with |S| = k (cid:28) d,\nl\u2208S \u03c6l(xl). Here \u03c6l : R \u2192 R are the\nindividual univariate components of the model. Our goal is to query f at suitably chosen points in\nits domain in order to recover an estimate \u03c6est,l of \u03c6l in a compact subset \u2126 \u2282 R for each l \u2208 S.\nWe measure the approximation error in the L\u221e norm. For simplicity, we assume that \u2126 = [\u22121, 1],\n\nL2(P)=R |f (x)|2 dP(x) and (cid:107) f (cid:107)2\n\n1 (cid:107) f (cid:107)2\n2This means in the L\u221e norm\n\nL2(Pn)= 1\n\nn\n\nP\n\ni f 2(xi)\n\n2\n\n\fmeaning that we guarantee an upper bound on: (cid:107) \u03c6est,l \u2212 \u03c6l (cid:107)L\u221e[\u22121,1] ;\nl \u2208 S. Furthermore, we\nassume that we can query f from a slight enlargement: [\u2212(1 + r), (1 + r)]d of [\u22121, 1]d for3 some\nsmall r > 0. As will be seen later, the enlargement r can be made arbitrarily close to 0. We now list\nour main assumptions for this problem.\n\nl\n\nmax\nl\u2208S\n\n(cid:107) \u03c6(p)\n\n1. Each \u03c6l is assumed to be suf\ufb01ciently smooth. In particular we assume that \u03c6l \u2208 C 5[\u2212(1 +\nr), (1 + r)] where C 5 denotes \ufb01ve times continuous differentiability. Since [\u2212(1 + r), (1 +\nr)] is compact, this implies that there exist constants B1, . . . , B5 \u2265 0 so that\n\n(cid:107)L\u221e[\u2212(1+r),(1+r)] \u2264 Bp;\n\n2. We assume each \u03c6l to be centered in the interval [\u22121, 1], i.e. (cid:82) 1\nreplace each \u03c6l with \u03c6l + al for al \u2208 R where(cid:80)\n\n(2.1)\n\u03c6l(t)dt = 0; l \u2208 S.\nSuch a condition is necessary for unique identi\ufb01cation of \u03c6l. Otherwise one could simply\nl al = 0 and unique identi\ufb01cation will not\nbe possible.\n3. We require that for each \u03c6l, \u2203Il \u2286 [\u22121, 1] with Il being connected and \u00b5(Il) \u2265 \u03b4 so that\nl(x)| \u2265 D ; \u2200x \u2208 Il. Here \u00b5(I) denotes the Lebesgue measure of I and \u03b4, D > 0 are\n|\u03c6(cid:48)\nconstants assumed to be known to the algorithm. This assumption essentially enables us\nl was zero or close to zero throughout [\u22121, 1] for some\nto detect the active set S. If say \u03c6(cid:48)\nl \u2208 S, then due to Assumption 2 this would imply that \u03c6l is zero or close to zero.\n\np = 1, . . . , 5.\n\n\u22121\n\nWe remark that it suf\ufb01ces to use estimates for our problem parameters instead of exact values. In\nparticular we can use upper bounds for: k, Bp; p = 1, . . . , 5 and lower bounds for the parameters:\nD, \u03b4. Our methods and results stated in the coming sections will remain unchanged.\n\n3 Our sampling scheme and algorithm\n\nf(\u03be + \u0001v) \u2212 f(\u03be)\n\nIn this section, we \ufb01rst motivate and describe our sampling scheme for querying f. We then outline\nour algorithm and explain the intuition behind its different stages. Consider the Taylor expansion of\nf at any point \u03be \u2208 Rd along the direction v \u2208 Rd with step size: \u0001 > 0. For any C p smooth f;\np \u2265 2, we obtain for \u03b6 = \u03be + \u03b8v for some 0 < \u03b8 < \u0001 the following expression:\n\u0001vT (cid:53)2 f(\u03b6)v.\n\n(3.1)\nNote that (3.1) can be interpreted as taking a noisy linear measurement of (cid:53)f(\u03be) with the mea-\nsurement vector v and the noise being the Taylor remainder term. Importantly, due to the sparse\nadditive form of f, we have \u03c6l \u2261 0, l /\u2208 S, implying that (cid:53)f(\u03be) = [\u03c6(cid:48)\nd(\u03bed)] is\nat most k-sparse. Hence (3.1) actually represents a noisy linear measurement of the k-sparse vector\n: (cid:53)f(\u03be). For any \ufb01xed \u03be, we know from compressive sensing (CS) [17, 18] that (cid:53)f(\u03be) can be\nrecovered (with high probability) using few random linear measurements4.\nThis motivates the following sets of points using which we query f as illustrated in Figure 1. For\nintegers mx, mv > 0 we de\ufb01ne\n\n= (cid:104)v,(cid:53)f(\u03be)(cid:105) +\n\n2(\u03be2) . . . \u03c6(cid:48)\n\n1(\u03be1) \u03c6(cid:48)\n\n1\n2\n\n\u0001\n\ni\nmx\n\nX :=\nV :=\n\n(1, 1, . . . , 1)T \u2208 Rd : i = \u2212mx, . . . , mx\n\n(cid:27)\n.\nUsing (3.1) at each \u03bei \u2208 X and vj \u2208 V for i = \u2212mx, . . . , mx and j = 1, . . . , mv leads to:\n\nw.p. 1/2 each; j = 1, . . . , mv and l = 1, . . . , d\n\n,\n\n(cid:27)\n\n(cid:26)\n(cid:26)\n\n\u03bei =\nvj \u2208 Rd : vj,l = \u00b1 1\u221a\nmv\nf(\u03bei + \u0001vj) \u2212 f(\u03bei)\n(cid:125)\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u0001\n\nyi,j\n\n(3.2)\n\n(3.3)\n\n(3.4)\n\n(cid:124) (cid:123)(cid:122) (cid:125)\n(cid:105) +\n= (cid:104)vj,(cid:53)f(\u03bei)\n\nxi\n\n1\n2\n\n(cid:124)\n\n\u0001vT\n\n(cid:125)\n(cid:123)(cid:122)\nj (cid:53)2 f(\u03b6i,j)vj\n\n,\n\nni,j\n\n2 ) =P\n\n3In case f : [a, b]d \u2192 R we can de\ufb01ne g : [\u22121, 1]d \u2192 R where g(x) = f ( (b\u2212a)\n\nwith \u02dc\u03c6l(xl) = \u03c6l( (b\u2212a)\nby querying f, and estimate \u02dc\u03c6l in [\u22121, 1] which in turn gives an estimate to \u03c6l in [a, b].\n\n\u02dc\u03c6l(xl)\n2 ). We then sample g from within [\u2212(1 + r), (1 + r)]d for some small r > 0\n\n2 xl + b+a\n\n2 x + b+a\n\n4 Estimating sparse gradients via compressive sensing has been considered previously by Fornasier et al.\n[8] albeit for a substantially different function class than us. Hence their sampling scheme differs considerably\nfrom ours, and is not tailored for learning SPAMs.\n\nl\u2208S\n\n3\n\n\f1(i/mx) \u03c6(cid:48)\n\n2(i/mx) . . . \u03c6(cid:48)\n\nwhere xi = (cid:53)f(\u03bei) = [\u03c6(cid:48)\nd(i/mx)] is k-sparse. Let us denote V =\n[v1 . . . vmv]T , yi = [yi,1 . . . yi,mv] and ni = [ni,1 . . . ni,mv]. Then for each i, we can write (3.4) in\nthe succinct form:\n(3.5)\nHere V \u2208 Rmv\u00d7d represents the linear measurement matrix, yi \u2208 Rmv denotes the mea-\nsurement vector at \u03bei and ni represents \u201cnoise\u201d on account of non-linearity of f. Note that\nwe query f at |X| (|V| + 1) = (2mx + 1)(mv + 1) many points. Given yi, V we can re-\ncover a robust approximation to xi via (cid:96)1 minimization [17, 18]. On account of the struc-\nture of (cid:53)f , we thus recover noisy estimates to \u03c6(cid:48)\nl at equispaced points along the interval\n[\u22121, 1]. We are now in a position to formally present our algorithm for learning SPAMs.\n\nyi = Vxi + ni.\n\nOur algorithm for learning SPAMs. The steps involved in our learn-\ning scheme are outlined in Algorithm 1. Steps 1-4 involve the CS-based\nrecovery stage wherein we use the aforementioned sampling sets to for-\nmulate our problem as a CS one. Step 4 involves a simple thresholding\nprocedure where an appropriate threshold \u03c4 is employed to recover the\nunknown active set S. In Section 4 we provide precise conditions on our\n\nsampling parameters which guarantee exact recovery, i.e. (cid:98)S = S. Step\nl(i/mx), i.e., (cid:98)xi,l for each l \u2208 (cid:98)S and i = \u2212mx, . . . , mx, to return a\nl for each l \u2208 (cid:98)S. Hence our \ufb01nal es-\n\n5 leverages a convex quadratic program (P), that uses noisy estimates of\n\u03c6(cid:48)\ncubic spline estimate \u02dc\u03c6(cid:48)\nl. This program and its theoretical properties are\nexplained in Section 4. Finally, in Step 6 we derive our \ufb01nal estimate\n\u03c6est,l via piecewise integration of \u02dc\u03c6(cid:48)\ntimate of \u03c6l is a spline of degree 4. The performance of Algorithm 1 for\nrecovering S and the individual \u03c6l\u2019s is presented in Theorem 1, which is\nalso our \ufb01rst main result. All proofs are deferred to the appendix.\n\nAlgorithm 1 Algorithm for learning \u03c6l in the SPAM: f(x) =(cid:80)\n\nFigure 1: The points \u03bei \u2208\nX (blue disks) and \u03bei + \u0001vj\n(red arrows) for vj \u2208 V.\n\n\u0001\n\nl.\n\nyi=Vz\n\nfor i = \u2212mx, . . . , mx and j = 1, . . . , mv.\n{l \u2208 {1, . . . , d} : |(cid:98)xi,l| > \u03c4}.\n\nl\u2208S \u03c6l(xl)\n1: Choose mx, mv and construct sampling sets X and V as in (3.2), (3.3).\n2: Choose step size \u0001 > 0. Query f at f(\u03bei),f(\u03bei +\u0001vj) for i = \u2212mx, . . . , mx and j = 1, . . . , mv.\n(cid:107) z (cid:107)1. For \u03c4 > 0 compute (cid:98)S = \u222amx\n4: Set(cid:98)xi := argmin\n3: Construct yi where yi,j = f (\u03bei+\u0001vj )\u2212f (\u03bei)\n5: For each l \u2208 (cid:98)S, run (P) as de\ufb01ned in Section 4 using ((cid:98)xi,l)mx\n6: For each l \u2208 (cid:98)S, set \u03c6est,l to be the piece-wise integral of \u02dc\u03c6(cid:48)\nparameter \u03b3 \u2265 0, to obtain \u02dc\u03c6(cid:48)\nthen with high probability, (cid:98)S = S and for any \u03b3 \u2265 0 the estimate \u03c6est,l\nTheorem 1. There exist constants C, C1 > 0 such that if mx \u2265 (1/\u03b4), mv \u2265 C1k log d, 0 < \u0001 <\nD\u221amv\nreturned by Algorithm 1 satis\ufb01es for each l \u2208 S:\nCkB2\n(cid:107) \u03c6est,l \u2212 \u03c6l (cid:107)L\u221e[\u22121,1]\u2264 [59(1 + \u03b3)]\n\nl as explained in Section 4.\n\n, \u03c4 and some smoothing\n\n2\u221amv\nand \u03c4 = C\u0001kB2\n\n(cid:107)L\u221e[\u22121,1] .\n\n(cid:107) \u03c6(5)\n\ni=\u2212mx\n\ni=\u2212mx\n\n(3.6)\n\nC\u0001kB2\u221a\nmv\n\n+\n\n87\n64m4\nx\n\nl\n\nRecall that k, B2, D, \u03b4 are our problem parameters introduced in Section 2, while \u0001 is the step size\nparameter from (3.4). We see that with O(k log d) point queries of f and with \u0001 < D\u221amv\n, the active\nset is recovered exactly. The error bound in (3.6) holds for all such choices of \u0001. It is a sum of two\nterms in which the \ufb01rst one arises during the estimation of (cid:53)f during the CS stage. The second\nerror term is the interpolation error bound for interpolating \u03c6(cid:48)\n\u221a\nl from its samples in the noise-free\nsetting. We note that our point queries lie in [\u2212(1 + (\u0001/\nmv))]d. For the stated\ncondition on \u0001 in Theorem 1 we have \u0001/\nwhich can be made arbitrarily close to zero\nby choosing an appropriately small \u0001. Hence we sample from only a small enlargement of [\u22121, 1]d.\n\n\u221a\nmv)), (1 + (\u0001/\n\nmv < D\n\nCkB2\n\nCkB2\n\n\u221a\n\n4\n\n(\u22121\u22121...\u22121)(11...1)\f4 Analyzing the algorithm\n\nWe now describe and analyze in more detail the individual stages of Algorithm 1. We \ufb01rst analyze\nSteps 1-4 which constitute the compressive sensing (CS) based recovery stage. Next, we analyze\nStep 5 where we also introduce our convex quadratic program. Lastly, we analyze Step 6 where we\nderive our \ufb01nal estimate \u03c6est,l.\n\nCkB2\n\nl(i/mx)\n\n(cid:12)(cid:12)(cid:12)(cid:98)\u03c6(cid:48)\n\n1(i/mx) . . . \u03c6(cid:48)\n\n2\u221amv\nthat the choice \u03c4 = C\u0001kB2\n\n1 > 0 such that for mv satisfying c(cid:48)\n\nl(i/mx) = (cid:98)xi,l of \u03c6(cid:48)\n\n1mv \u2212 e\u2212\u221a\nimplies that (cid:98)S = S.\n\nCompressive sensing-based recovery stage. This stage of Algorithm 1 involves solving a se-\nquence of linear programs for recovering estimates of xi = [\u03c6(cid:48)\nd(i/mx)] for\ni = \u2212mx, . . . , mx. We note that the measurements yi are noisy linear measurements of xi with the\nnoise being arbitrary and bounded. For such a noise model, it is known that (cid:96)1 minimization results\nrecovery error (cid:107)(cid:98)xi \u2212 xi (cid:107)2 as speci\ufb01ed in Lemma 1.\nin robust recovery of the sparse signal [19]. Using this result in our setting allows us to quantify the\nmvd that(cid:98)xi satis\ufb01es (cid:107)(cid:98)xi \u2212 xi (cid:107)2\u2264\n3 \u2265 1 and C, c(cid:48)\nLemma 1. There exist constants c(cid:48)\n3k log d < mv <\nd/(log 6)2 we have with probability at least 1 \u2212 e\u2212c(cid:48)\nfor all i = \u2212mx, . . . , mx. Furthermore, given that this holds and mx \u2265 1/\u03b4 is satis\ufb01ed we\n2\u221amv\nC\u0001kB2\nthen have for any \u0001 < D\u221amv\nThus upon using (cid:96)1 minimization based decoding at 2mx + 1 points, we recover robust estimates(cid:98)xi\nto xi which immediately gives us estimates (cid:98)\u03c6(cid:48)\nl(i/mx) for i = \u2212mx, . . . , mx\nand l = 1, . . . , d.\nIn order to recover the active set S, we \ufb01rst note that the spacing between\nconsecutive samples in X is 1/mx. Therefore the condition mx \u2265 1/\u03b4 implies on account of\nAssumption 3 that the sample spacing is \ufb01ne enough to ensure that for each l \u2208 S, there exists a\n(cid:12)(cid:12)(cid:12) lies within a suf\ufb01ciently small neighborhood of the origin in turn enabling\nsample i for which |\u03c6(cid:48)\nl(i/mx)| \u2265 D holds. The stated choice of the step size \u0001 essentially guarantees\n\u2200l /\u2208 S, i that\nset S along with the estimates: ((cid:98)\u03c6(cid:48)\n(cid:12)(cid:12)(cid:12)(cid:98)\u03c6(cid:48)\n(cid:12)(cid:12)(cid:12) \u2264 \u03c4 = C\u0001kB2\ndetection of the active set. Therefore after this stage of Algorithm 1, we have at hand: the active\nfor each l \u2208 S. Furthermore, it is easy to see that\nl(i/mx) \u2212 \u03c6(cid:48)\nby using the noisy samples ((cid:98)\u03c6(cid:48)\n\nRobust estimation via cubic splines. Our aim now is to recover a smooth, robust estimate to \u03c6(cid:48)\nl\n. Note that the noise here is arbitrary and bounded\n2\u221amv\nby \u03c4 = C\u0001kB2\n. To this end we choose to use cubic splines as our estimates, which are essentially\npiecewise cubic polynomials that are C 2 smooth [20]. There is a considerable amount of literature\nin the statistics community devoted to the problem of estimating univariate functions from noisy\nsamples via cubic splines (cf., [21, 22, 23, 24]), albeit under the setting of random noise. Cubic\nsplines have also been studied extensively in the approximation theoretic setting for interpolating\nsamples (cf., [20, 25, 26]).\nWe introduce our solution to this problem in a more general setting. Consider a smooth function\n\ng : [t1, t2] \u2192 R and a uniform mesh5: (cid:81) : t1 = x0 < x1 < \u00b7\u00b7\u00b7 < xn\u22121 < xn = t2 with\nxi \u2212 xi\u22121 = h. We have at hand noisy samples: (cid:98)gi = g(xi) + ei, with noise ei being arbitrary\nboundary conditions. Let H 2[t1, t2] denote the space of cubic splines de\ufb01ned on [t1, t2] w.r.t(cid:81). We\n\nIn the noiseless scenario, the problem would be an interpolation one\nand bounded:\nfor which a popular class of cubic splines are the \u201cnot-a-knot\u201d cubic splines [25]. These achieve\noptimal O(h4) error rates for C 4 smooth g without using any higher order information about g as\n\nl(i/mx))mx\n2\u221amv\n\n, \u2200l \u2208 S,\u2200i.\n\ni=mx\n\n|ei| \u2264 \u03c4.\n\nl(i/mx))mx\n\ni=mx\n\nl(i/mx)\n\nthen propose \ufb01nding the cubic spline estimate as a solution of the following convex optimization\nproblem (in the 4n coef\ufb01cients of the n cubic polynomials) for some parameter \u03b3 \u2265 0:\n\n(P)\n\nmin\n\nL\u2208H2[t1,t2]\n\ns.t. (cid:98)gi \u2212 \u03b3\u03c4 \u2264 L(xi) \u2264(cid:98)gi + \u03b3\u03c4;\n1 ), L(cid:48)(cid:48)(cid:48)(x\u2212\n\n1 ) = L(cid:48)(cid:48)(cid:48)(x+\n\nL(cid:48)(cid:48)(cid:48)(x\u2212\n\nt1\n\nL(cid:48)(cid:48)(x)2dx\n\ni = 0, . . . , n,\nn\u22121) = L(cid:48)(cid:48)(cid:48)(x+\n\nn\u22121).\n\n(4.1)\n\n(4.2)\n(4.3)\n\n5We consider uniform meshes for clarity of exposition. The results in this section can be easily generalized\n\nto non-uniform meshes.\n\n5\n\n(cid:90) t2\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n\fNote that (P) is a convex QP with linear constraints. The objective function can be veri\ufb01ed to be\na positive de\ufb01nite quadratic form in the spline coef\ufb01cients6. Speci\ufb01cally, the objective measures\nthe total curvature of a feasible cubic spline in [t1, t2]. Each of the constraints (4.2)-(4.3) along\n\nwith the implicit continuity constraints of L(p); p = 0, 1, 2 at the interior points of(cid:81), are linear\nequalities/inequalities in the coef\ufb01cients of the piecewise cubic polynomials. (4.3) refers to the not-\na-knot boundary conditions [25] which are also linear equalities in the spline coef\ufb01cients. These\nspace of all not-a-knot cubic splines such that L(xi) lies within a \u00b1\u03b3\u03c4 interval of(cid:98)gi, and returns\nconditions imply that L(cid:48)(cid:48)(cid:48) is continuous7 at the knots x1, xn\u22121. Thus, (P) searches amongst the\nthe smoothest solution, i.e., the one with the least total curvature. The parameter \u03b3 \u2265 0, controls\n((cid:98)gi)n\ni=0. As \u03b3 increases, the search interval: [(cid:98)gi \u2212 \u03b3\u03c4,(cid:98)gi + \u03b3\u03c4] becomes larger for all i, leading to\nthe degree of smoothness of the solution. Clearly, \u03b3 = 0 implies interpolating the noisy samples\nsmoother feasible cubic splines. The following theorem formally describes the estimation properties\nof (P) and is also our second main result.\nTheorem 2. For g \u2208 C 4[t1, t2] let L\u2217 : [t1, t2] \u2192 R be a solution of (P) for some parameter \u03b3 \u2265 0.\nWe then have that\n\n\u03c4 +\n\n29\n64\n\nh4 (cid:107) g(4) (cid:107)\u221e .\n\n(4.4)\n\nWe show in the appendix that if(cid:82) t2\n\nt1\n\n(L\u2217(cid:48)(cid:48)(x))2dx > 0, then L\u2217 is unique. Note that the error bound\n(4.4) is a sum of two terms. The \ufb01rst term is proportional to the external noise bound: \u03c4, indicating\nthat the solution is robust to noise. The second term is the error that would arise even if perturbation\nwas absent i.e. \u03c4 = 0. Intuitively, if \u03b3\u03c4 is large enough, then we would expect the solution returned\nby (P) to be a line. Indeed, a larger value of \u03b3\u03c4 would imply a larger search interval in (4.2), which\nif suf\ufb01ciently large, would allow a line (that has zero curvature) to lie in the feasible region. More\nformally, we show in the appendix, suf\ufb01cient conditions: \u03c4 = \u2126( n1/2(cid:107)g(cid:48)(cid:48)(cid:107)\u221e\n), \u03b3 > 1, which if\nsatis\ufb01ed, imply that the solution returned by (P) is a line. This indicates that if either n is small or g\nhas small curvature, then moderately large values of \u03c4 and/or \u03b3 will cause the solution returned by\n(P) to be a line. If an estimate of (cid:107) g(cid:48)(cid:48) (cid:107)\u221e is available, then one could for instance, use the upper\nbound 1 + O(n1/2 (cid:107) g(cid:48)(cid:48) (cid:107)\u221e /\u03c4) to restrict the range of values of \u03b3 within which (P) is used.\nl in the interval [\u22121, 1]. The\nTheorem 2 has the following Corollary for estimation of C 4 smooth \u03c6(cid:48)\nproof simply involves replacing: g with \u03c6(cid:48)\n2\u221amv\nl, n + 1 with 2mx + 1, h with 1/mx and \u03c4 with C\u0001kB2\n.\nAs the perturbation \u03c4 is directly proportional to the step size \u0001, we show in the appendix that if\nadditionally \u0001 = \u2126(\nl will be a\nline.\nCorollary 1. Let (P) be employed for each l \u2208 S using noisy samples\nwith step size \u0001 satisfying 0 < \u0001 < D\u221amv\n(P), we then have for any \u03b3 \u2265 0 that:\n\n), \u03b3 > 1, holds, then the corresponding estimate \u02dc\u03c6(cid:48)\n\nl as the corresponding solution returned by\n\n. Denoting \u02dc\u03c6(cid:48)\n\n(cid:110)(cid:98)\u03c6(cid:48)\n\n\u221amxmv(cid:107)\u03c6(cid:48)(cid:48)(cid:48)\n\nl (cid:107)\u221e\n\n(cid:111)mx\n\nl(i/mx)\n\n\u03b3\u22121\n\n, and\n\ni=\u2212mx\n\n\u03b3\u22121\n\n(cid:107) L\u2217 \u2212 g (cid:107)\u221e\u2264(cid:20)118(1 + \u03b3)\n\n(cid:21)\n\n3\n\nCkB2\n\nl (cid:107)L\u221e[\u22121,1]\u2264(cid:20)59(1 + \u03b3)\n\n3\n\n(cid:21) C\u0001kB2\u221a\n\nmv\n\n(cid:107) \u02dc\u03c6(cid:48)\n\nl \u2212 \u03c6(cid:48)\n\n+\n\n29\n64m4\nx\n\n(cid:107) \u03c6(5)\n\nl\n\n(cid:107)L\u221e[\u22121,1] .\n\n(4.5)\n\nThe \ufb01nal estimate. We now derive the \ufb01nal estimate \u03c6est,l of \u03c6l for each l \u2208 S. Denote x0(=\n\u22121) < x1 < \u00b7\u00b7\u00b7 < x2mx\u22121 < x2mx(= 1) as our equispaced set of points on [\u22121, 1]. Since\nl : [\u22121, 1] \u2192 R returned by (P) is a cubic spline, we have \u02dc\u03c6(cid:48)\nl,i(x) for x \u2208 [xi, xi+1]\n\u02dc\u03c6(cid:48)\nl(x) = \u02dc\u03c6(cid:48)\nwhere \u02dc\u03c6(cid:48)\nl,i is a polynomial of degree at most 3. We then de\ufb01ne \u03c6est,l(x) := \u02dc\u03c6l,i(x) + Fi for\nx \u2208 [xi, xi+1] and i = 0, . . . , 2mx \u2212 1. Here \u02dc\u03c6l,i is a antiderivative of \u02dc\u03c6(cid:48)\nl,i and Fi\u2019s are constants\nof integration. Denoting F0 = F , we have that \u03c6est,l is continuous at x1, . . . , x2mx\u22121 for: Fi =\n1 \u2264 i \u2264 2mx \u2212 1. Hence\ni we obtain \u03c6est,l(\u00b7) = \u03c8l(\u00b7) + F where \u03c8l(x) = \u03c8l,i(x) for\nby denoting \u03c8l,i(\u00b7) := \u02dc\u03c6l,i(\u00b7) + F (cid:48)\n\nj=1( \u02dc\u03c6l,j(xj+1) \u2212 \u02dc\u03c6l,j(xj)) \u2212 \u02dc\u03c6l,i(xi) + F = F (cid:48)\n\n\u02dc\u03c6l,0(x1) +(cid:80)i\u22121\n\ni + F ;\n\n6Shown in the appendix.\n7f (x\u2212) = limh\u21920\u2212 f (x + h) and f (x+) = limh\u21920+ f (x + h) denote left,right hand limits respectively.\n\n6\n\n\fx \u2208 [xi, xi+1]. Now on account of Assumption 2, we require \u03c6est,l to also be centered implying\nF = \u2212 1\n\n\u03c8l(x)dx. Hence we output our \ufb01nal estimate of \u03c6l to be:\n\n(cid:82) 1\n\n2\n\n\u22121\n\n(cid:90) 1\n\n\u03c6est,l(x) := \u03c8l(x) \u2212 1\n2\n\n(4.6)\nSince \u03c6est,l is by construction continuous in [\u22121, 1], is a piecewise combination of polynomials of\ndegree at most 4, and since \u03c6(cid:48)\nest,l is a cubic spline, \u03c6est,l is a spline function of order 4. Lastly, we\nshow in the proof of Theorem 1 that (cid:107) \u03c6est,l \u2212 \u03c6l (cid:107)L\u221e[\u22121,1]\u2264 3 (cid:107) \u02dc\u03c6(cid:48)\nl (cid:107)L\u221e[\u22121,1] holds. Using\nCorollary 1, this provides us with the error bounds stated in Theorem 1.\n\nl \u2212 \u03c6(cid:48)\n\n\u03c8l(x)dx;\n\n\u22121\n\nx \u2208 [\u22121, 1].\n\n5\n\nImpact of noise on performance of our algorithm\n\ni,j\n\n(cid:17)\n\n\u0001 + \u0001kB2\n2mv\n\nOur third main contribution involves analyzing the more realistic scenario, when the point queries\nare corrupted with additive external noise z(cid:48). Thus querying f in Step 2 of Algorithm 1 results in\ni and f(\u03bei + \u0001vj) + z(cid:48)\nnoisy values: f(\u03bei) + z(cid:48)\ni,j respectively. This changes (3.5) to the noisy linear\ni)/\u0001 for i = \u2212mx, . . . , mx and j = 1, . . . , mv.\ni,j \u2212 z(cid:48)\nsystem: yi = Vxi + ni + zi where zi,j = (z(cid:48)\nNotice that external noise gets scaled by (1/\u0001), while |ni,j| scales linearly with \u0001.\n\nIn this model, the external noise is arbitrary but bounded, so that\n\ni| ,(cid:12)(cid:12)z(cid:48)\n(cid:12)(cid:12) < \u03ba; \u2200i, j. It can be veri\ufb01ed along the lines of the proof of Lemma 1 that: (cid:107) ni + zi (cid:107)2\u2264\n(cid:16) 2\u03ba\n\nArbitrary bounded noise.\n|z(cid:48)\n\u221a\n. Observe that unlike the noiseless setting, \u0001 cannot be made arbitrarily close to\nmv\n0, as it would blow up the impact of the external noise. The following theorem shows that if \u03ba is\nsmall relative to D2 < |\u03c6(cid:48)\nl(x)|2, \u2200x \u2208 Il, l \u2208 S, then8 there exists an interval for choosing \u0001, within\nwhich Algorithm 1 recovers exactly the active set S. This condition has the natural interpretation\n[1 \u2212 A, 1 + A] where A :=(cid:112)1 \u2212 (16C 2kB2\u03ba)/D2\nthat if the signal-to-\u2018external noise\u2019 ratio in Il is suf\ufb01ciently large, then S can be detected exactly.\nTheorem 3. There exist constants C, C1 > 0 such that if \u03ba < D2/(16C 2kB2), mx \u2265 (1/\u03b4), and\n(cid:16) 2\u03ba\nmv \u2265 C1k log d hold, then for any \u0001 \u2208 D\u221amv\n, we have in Algorithm 1, with high probability, that(cid:98)S = S and for any\nand \u03c4 =\n(cid:18)4C\n\u03b3 \u2265 0, for each l \u2208 S:\n(cid:107) \u03c6est,l \u2212 \u03c6l (cid:107)L\u221e[\u22121,1]\u2264 [59(1 + \u03b3)]\n\n(cid:107)L\u221e[\u22121,1] .\n\n\u0001 + \u0001kB2\n2mv\n\n(cid:107) \u03c6(5)\n\n(cid:19)\n\n(5.1)\n\n(cid:17)\n\n2CkB2\n\nmv\n\n\u221a\n\n\u221a\n\n+\n\n+\n\nl\n\nmv\u03ba\n\u0001\n\nC\u0001kB2\u221a\nmv\n\n87\n64m4\nx\n\ni,j\n\ni \u2212 z(cid:48)\n\nStochastic noise.\nIn this model, the external noise is assumed to be i.i.d. Gaussian, so that\nz(cid:48)\ni, z(cid:48)\nIn this setting we consider resampling f at the query point N\ntimes and then averaging the noisy samples, in order to reduce \u03c3. Given this, we now have that\nN ); i.i.d. \u2200i, j. Using standard tail-bounds for Gaussians, we can show that for\ni, z(cid:48)\nz(cid:48)\n\ni,j \u223c N (0, \u03c32); i.i.d. \u2200i, j.\ni,j \u223c N (0, \u03c32\n\nany \u03ba > 0 if N is chosen large enough then: |zi,j| =(cid:12)(cid:12)z(cid:48)\n\n(cid:12)(cid:12) \u2264 2\u03ba; \u2200i, j with high probability.\n\nHence the external noise zi,j would be bounded with high probability and the analysis for Theorem\n3 can be used in a straightforward manner. Of course, an advantage that we have in this setting is\nthat \u03ba can be chosen to be arbitrarily close to zero by choosing a correspondingly large value of N.\nWe state all this formally in the form of the following theorem.\nTheorem 4. There exist constants C, C1 > 0 such that for \u03ba < D2/(16C 2kB2), mx \u2265 (1/\u03b4), and\nmv \u2265 C1k log d, if we re-sample each query in Step 2 of Algorithm 1: N > \u03c32\n(cid:112)1 \u2212 (16C 2kB2\u03ba)/D2 and \u03c4 =\ntimes for 0 < p < 1, and average the values, then for any \u0001 \u2208 D\u221amv\nleast 1 \u2212 p \u2212 o(1), that (cid:98)S = S and for any \u03b3 \u2265 0, for each l \u2208 S:\n(cid:19)\n(cid:107) \u03c6est,l \u2212 \u03c6l (cid:107)L\u221e[\u22121,1]\u2264 [59(1 + \u03b3)]\n8Il is the \u201ccritical\u201d interval de\ufb01ned in Assumption 3 for detecting l \u2208 S.\n\n[1 \u2212 A, 1 + A] where A :=\n, we have in Algorithm 1, with probability at\n\n(cid:16)\u221a\n\u03bap |X||V|(cid:17)\n\n(cid:16) 2\u03ba\n(cid:18)4C\n\n\u0001 + \u0001kB2\n2mv\n\u221a\n\n(cid:107)L\u221e[\u22121,1] .\n\nC\u0001kB2\u221a\nmv\n\n87\n64m4\nx\n\nmv\u03ba\n\u0001\n\n(cid:107) \u03c6(5)\n\n\u03ba2 log\n\n(5.2)\n\n(cid:17)\n\n2CkB2\n\nmv\n\n\u221a\n\n+\n\n+\n\n2\u03c3\n\nl\n\n7\n\n\fNote that we query f now N |X| (|V| + 1) times. Also, |X| = (2mx + 1) = \u0398(1), and\n\u03ba = O(k\u22121), as D, C, B2, \u03b4 are constants. Hence the choice |V| = O(k log d) gives us N =\nO(k2 log(p\u22121k2 log d)) and leads to an overall query complexity of: O(k3 log d log(p\u22121k2 log d))\nwhen the samples are corrupted with additive Gaussian noise. Choosing p = O(d\u2212c) for any con-\nstant c > 0 gives us a sample complexity of O(k3(log d)2), and ensures that the result holds with\nhigh probability. The o(1) term goes to zero exponentially fast as d \u2192 \u221e.\n\nSimulation results. We now provide simulation results on synthetic data to support our theoretical\n\ufb01ndings. We consider the noisy setting with the point queries being corrupted with Gaussian noise.\nFor d = 1000, k = 4 and S = {2, 105, 424, 782}, consider f : Rd \u2192 R where f = \u03c62(x2) +\n\u03c6105(x105) + \u03c6424(x424) + \u03c6782(x782) with: \u03c62(x) = sin(\u03c0x), \u03c6105(x) = exp(\u22122x), \u03c6424(x) =\n(1/3) cos3(\u03c0x) + 0.8x2, \u03c6782(x) = 0.5x4 \u2212 x2 + 0.8x. We choose \u03b4 = 0.3, D = 0.2 which\ncan be veri\ufb01ed as valid parameters for the above \u03c6l\u2019s. Furthermore, we choose mx = (cid:100)2/\u03b4(cid:101) = 7\nand mv = (cid:100)2k log d(cid:101) = 56 to satisfy the conditions of Theorem 4. Next, we choose constants\n= 4.24 \u00d7 10\u22124 as required by Theorem 4. For the choice\nC = 0.2, B2 = 35 and \u03ba = 0.95 D2\n\u0001 = D\u221amv\n= 0.0267, we then query f at (2mx + 1)(mv + 1) = 855 points. The function values\nare corrupted with Gaussian noise: N (0, \u03c32/N) for \u03c3 = 0.01 and N = 100. This is equivalent to\nresampling and averaging the points queries N times. Importantly the suf\ufb01cient condition on N, as\nstated in Theorem 4 is (cid:100) \u03c32\n)(cid:101) = 6974 for p = 0.1. Thus we consider a signi\ufb01cantly\nundersampled regime. Lastly we select the threshold \u03c4 =\n= 0.2875 as stated\nby Theorem 4, and employ Algorithm 1 for different values of the smoothing parameter \u03b3.\n\n\u0001 + \u0001kB2\n2mv\n\n(cid:16) 2\u03ba\n\n2\u03c3|X||V|\n\n\u03bap\n\n\u221a\n\nmv\n\n(cid:17)\n\n2CkB2\n\n16C2kB2\n\n\u221a\n\n\u03ba2 log(\n\n(a) Estimates of \u03c62\n\n(b) Estimates of \u03c6105\n\n(c) Estimates of \u03c6424\n\n(d) Estimates of \u03c6782\n\nFigure 2: Estimates \u03c6est,l of \u03c6l (black) for: \u03b3 = 0.3 (red), \u03b3 = 1 (blue) and \u03b3 = 5 (green).\n\nThe results are shown in Figure 2. Over 10 independent runs of the algorithm we observed that\nS was recovered exactly each time. Furthermore we see from Figure 2 that the recovery is quite\naccurate for \u03b3 = 0.3. For \u03b3 = 1 we notice that the search interval \u03b3\u03c4 = 0.2875 becomes large\nenough so as to cause the estimates \u03c6est,424, \u03c6est,782 to become relatively smoother. For \u03b3 = 5,\nthe search interval \u03b3\u03c4 = 1.4375 becomes wide enough for a line to \ufb01t in the feasible region for\n424, \u03c6(cid:48)\n\u03c6(cid:48)\n105, the\nsearch interval is not suf\ufb01ciently wide enough for a line to lie in the feasible region, even for \u03b3 = 5.\nHowever we notice that the estimates \u03c6est,2, \u03c6est,105 become relatively smoother as expected.\n\n782. This results in \u03c6est,424, \u03c6est,782 to be quadratic functions. In the case of \u03c6(cid:48)\n\n2, \u03c6(cid:48)\n\n6 Conclusion\n\nWe proposed an ef\ufb01cient sampling scheme for learning SPAMs. In particular, we showed that with\nonly a few queries, we can derive uniform approximations to each underlying univariate function\nof the SPAM. A crucial component of our approach is a novel convex QP for robust estimation of\nunivariate functions via cubic splines, from samples corrupted with arbitrary bounded noise. Lastly,\nwe showed how our algorithm can handle noisy point queries for both (i) arbitrary bounded and (ii)\ni.i.d. Gaussian noise models. An important direction for future work would be to determine the op-\ntimality of our sampling bounds by deriving corresponding lower bounds on the sample complexity.\nAcknowledgments. This research was supported in part by SNSF grant 200021 137528 and a\nMicrosoft Research Faculty Fellowship.\n\n8\n\n\u22121\u22120.500.51\u22121\u22120.500.51x\u03c62,\u03c6est,2\u22121\u22120.500.51\u221220246x\u03c6105,\u03c6est,105\u22121\u22120.500.51\u22120.2\u22120.100.10.20.3x\u03c6424,\u03c6est,424\u22121\u22120.500.51\u22121.5\u22121\u22120.500.51x\u03c6782,\u03c6est,782\fReferences\n[1] Th. Muller-Gronbach and K. Ritter. Minimal errors for strong and weak approximation of\nstochastic differential equations. Monte Carlo and Quasi-Monte Carlo Methods, pages 53\u201382,\n2008.\n\n[2] M.H. Maathuis, M. Kalisch, and P. B\u00a8uhlmann. Estimating high-dimensional intervention ef-\n\nfects from observational data. Ann. Statist., 37(6A):3133\u20133164, 2009.\n\n[3] M.J. Wainwright. Information-theoretic limits on sparsity recovery in the high-dimensional\n\nand noisy setting. IEEE Trans. Inform. Theory, 55(12):5728\u20135741, 2009.\n\n[4] J.F. Traub, G.W. Wasilkowski, and H. Wozniakowski. Information-Based Complexity. Aca-\n\ndemic Press, New York, 1988.\n\n[5] R. DeVore, G. Petrova, and P. Wojtaszczyk. Approximation of functions of few variables in\n\nhigh dimensions. Constr. Approx., 33:125\u2013143, 2011.\n\n[6] A. Cohen, I. Daubechies, R.A. DeVore, G. Kerkyacharian, and D. Picard. Capturing ridge\n\nfunctions in high dimensions from point queries. Constr. Approx., pages 1\u201319, 2011.\n\n[7] H. Tyagi and V. Cevher. Active learning of multi-index function models. Advances in Neural\n\nInformation Processing Systems 25, pages 1475\u20131483, 2012.\n\n[8] M. Fornasier, K. Schnass, and J. Vyb\u00b4\u0131ral. Learning functions of few arbitrary linear parameters\n\nin high dimensions. Foundations of Computational Mathematics, 12(2):229\u2013262, 2012.\n\n[9] Y. Lin and H.H. Zhang. Component selection and smoothing in multivariate nonparametric\n\nregression. Ann. Statist., 34(5):2272\u20132297, 2006.\n\n[10] M. Yuan. Nonnegative garrote component selection in functional anova models. In AISTATS,\n\nvolume 2, pages 660\u2013666, 2007.\n\n[11] G. Raskutti, M.J. Wainwright, and B. Yu. Minimax-optimal rates for sparse additive models\n\nover kernel classes via convex programming. J. Mach. Learn. Res., 13(1):389\u2013427, 2012.\n\n[12] V. Koltchinskii and M. Yuan. Sparse recovery in large ensembles of kernel machines. In COLT,\n\npages 229\u2013238, 2008.\n\n[13] P. Ravikumar, J. Lafferty, H. Liu, and L. Wasserman. Sparse additive models. Journal of the\n\nRoyal Statistical Society: Series B (Statistical Methodology), 71(5):1009\u20131030, 2009.\n\n[14] L. Meier, S. Van De Geer, and P. B\u00a8uhlmann. High-dimensional additive modeling. Ann.\n\nStatist., 37(6B):3779\u20133821, 2009.\n\n[15] V. Koltchinskii and M. Yuan. Sparsity in multiple kernel learning. Ann. Statist., 38(6):3660\u2013\n\n3695, 2010.\n\n[16] J. Huang, J.L. Horowitz, and F. Wei. Variable selection in nonparametric additive models. Ann.\n\nStatist., 38(4):2282\u20132313, 2010.\n\n[17] E.J. Cand`es, J.K. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate\nmeasurements. Communications on Pure and Applied Mathematics, 59(8):1207\u20131223, 2006.\n\n[18] D.L. Donoho. Compressed sensing. IEEE Trans. Inform. Theory, 52(4):1289\u20131306, 2006.\n[19] P. Wojtaszczyk. (cid:96)1 minimization with noisy data. SIAM J. Numer. Anal., 50(2):458\u2013467, 2012.\n[20] J.H. Ahlberg, E.N. Nilson, and J.L. Walsh. The theory of splines and their applications. Aca-\n\ndemic Press (New York), 1967.\n\n[21] I.J. Schoenberg. Spline functions and the problem of graduation. Proceedings of the National\n\nAcademy of Sciences, 52(4):947\u2013950, 1964.\n\n[22] C.M. Reinsch. Smoothing by spline functions. Numer. Math, 10:177\u2013183, 1967.\n[23] G. Wahba. Smoothing noisy data with spline functions. Numer. Math., 24(5):383\u2013393, 1975.\n[24] P. Craven and G. Wahba. Smoothing noisy data with spline functions. Numer. Math.,\n\n31(4):377\u2013403, 1978.\n\n[25] C. de Boor. A practical guide to splines. Springer Verlag (New York), 1978.\n[26] C.A. Hall and W.W. Meyer. Optimal error bounds for cubic spline interpolation. J. Approx.\n\nTheory, 16(2):105 \u2013 122, 1976.\n\n9\n\n\f", "award": [], "sourceid": 340, "authors": [{"given_name": "Hemant", "family_name": "Tyagi", "institution": "ETH Zurich"}, {"given_name": "Bernd", "family_name": "G\u00e4rtner", "institution": "ETH Zurich"}, {"given_name": "Andreas", "family_name": "Krause", "institution": "ETH Zurich"}]}