{"title": "Sparse Random Feature Algorithm as Coordinate Descent in Hilbert Space", "book": "Advances in Neural Information Processing Systems", "page_first": 2456, "page_last": 2464, "abstract": "In this paper, we propose a Sparse Random Feature algorithm, which learns a sparse non-linear predictor by minimizing an $\\ell_1$-regularized objective function over the Hilbert Space induced from kernel function. By interpreting the algorithm as Randomized Coordinate Descent in the infinite-dimensional space, we show the proposed approach converges to a solution comparable within $\\eps$-precision to exact kernel method by drawing $O(1/\\eps)$ number of random features, contrasted to the $O(1/\\eps^2)$-type convergence achieved by Monte-Carlo analysis in current Random Feature literature. In our experiments, the Sparse Random Feature algorithm obtains sparse solution that requires less memory and prediction time while maintains comparable performance on tasks of regression and classification. In the meantime, as an approximate solver for infinite-dimensional $\\ell_1$-regularized problem, the randomized approach converges to better solution than Boosting approach when the greedy step of Boosting cannot be performed exactly.", "full_text": "Sparse Random Features Algorithm as\nCoordinate Descent in Hilbert Space\n\nIan E.H. Yen 1 Ting-Wei Lin 2\n\nShou-De Lin 2\n\nPradeep Ravikumar 1\n\nInderjit S. Dhillon 1\n\nDepartment of Computer Science\n\n1: University of Texas at Austin,\n2: National Taiwan University\n1: {ianyen,pradeepr,inderjit}@cs.utexas.edu,\n\n2: {b97083,sdlin}@csie.ntu.edu.tw\n\nAbstract\n\nIn this paper, we propose a Sparse Random Features algorithm, which learns a\nsparse non-linear predictor by minimizing an \u21131-regularized objective function\nover the Hilbert Space induced from a kernel function. By interpreting the al-\ngorithm as Randomized Coordinate Descent in an in\ufb01nite-dimensional space, we\nshow the proposed approach converges to a solution within \u03f5-precision of that us-\ning an exact kernel method, by drawing O(1/\u03f5) random features, in contrast to the\nO(1/\u03f52) convergence achieved by current Monte-Carlo analyses of Random Fea-\ntures. In our experiments, the Sparse Random Feature algorithm obtains a sparse\nsolution that requires less memory and prediction time, while maintaining compa-\nrable performance on regression and classi\ufb01cation tasks. Moreover, as an approx-\nimate solver for the in\ufb01nite-dimensional \u21131-regularized problem, the randomized\napproach also enjoys better convergence guarantees than a Boosting approach in\nthe setting where the greedy Boosting step cannot be performed exactly.\n\n1 Introduction\n\nKernel methods have become standard for building non-linear models from simple feature represen-\ntations, and have proven successful in problems ranging across classi\ufb01cation, regression, structured\nprediction and feature extraction [16, 20]. A caveat however is that they are not scalable as the\nnumber of training samples increases. In particular, the size of the models produced by kernel meth-\nods scale linearly with the number of training samples, even for sparse kernel methods like support\nvector machines [17]. This makes the corresponding training and prediction computationally pro-\nhibitive for large-scale problems.\nA line of research has thus been devoted to kernel approximation methods that aim to preserve pre-\ndictive performance, while maintaining computational tractability. Among these, Random Features\nhas attracted considerable recent interest due to its simplicity and ef\ufb01ciency [2, 3, 4, 5, 10, 6]. Since\n\ufb01rst proposed in [2], and extended by several works [3, 4, 5, 10], the Random Features approach is a\nsampling based approximation to the kernel function, where by drawing D features from the distri-\n\u221a\nbution induced from the kernel function, one can guarantee uniform convergence of approximation\nerror to the order of O(1/\nD). On the \ufb02ip side, such a rate of convergence suggests that in order\nto achieve high precision, one might need a large number of random features, which might lead to\nmodel sizes even larger than that of the vanilla kernel method.\nOne approach to remedy this problem would be to employ feature selection techniques to prevent\nthe model size from growing linearly with D. A simple way to do so would be by adding \u21131-\nregularization to the objective function, so that one can simultaneously increase the number of ran-\ndom features D, while selecting a compact subset of them with non-zero weight. However, the re-\nsulting algorithm cannot be justi\ufb01ed by existing analyses of Random Features, since the Representer\ntheorem does not hold for the \u21131-regularized problem [15, 16]. In other words, since the prediction\n\n1\n\n\fcannot be expressed as a linear combination of kernel evaluations, a small error in approximating\nthe kernel function cannot correspondingly guarantee a small prediction error.\nIn this paper, we propose a new interpretation of Random Features that justi\ufb01es its usage with\n\u21131-regularization \u2014 yielding the Sparse Random Features algorithm. In particular, we show that\nthe Sparse Random Feature algorithm can be seen as Randomized Coordinate Descent (RCD) in\nthe Hilbert Space induced from the kernel, and by taking D steps of coordinate descent, one can\nachieve a solution comparable to exact kernel methods within O(1/D) precision in terms of the\nobjective function. Note that the surprising facet of this analysis is that in the \ufb01nite-dimensional\ncase, the iteration complexity of RCD increases with number of dimensions [18], which would\ntrivially yield a bound going to in\ufb01nity for our in\ufb01nite-dimensional problem. In our experiments,\nthe Sparse Random Features algorithm obtains a sparse solution that requires less memory and\nprediction time, while maintaining comparable performance on regression and classi\ufb01cation tasks\nwith various kernels. Note that our technique is complementary to that proposed in [10], which aims\nto reduce the cost of evaluating and storing basis functions, while our goal is to reduce the number\nof basis functions in a model.\nAnother interesting aspect of our algorithm is that our in\ufb01nite-dimensional \u21131-regularized objective\nis also considered in the literature of Boosting [7, 8], which can be interpreted as greedy coordinate\ndescent in the in\ufb01nite-dimensional space. As an approximate solver for the \u21131-regularized problem,\nwe compare our randomized approach to the boosting approach in theory and also in experiments.\nAs we show, for basis functions that do not allow exact greedy search, a randomized approach enjoys\nbetter guarantees.\n\n2 Problem Setup\nWe are interested in estimating a prediction function f: X\u2192Y from training data set D =\nn=1, (xn, yn) \u2208 X \u00d7 Y by solving an optimization problem over some Reproducing\n{(xn, yn)}N\nKernel Hilbert Space (RKHS) H:\n\n\u2217\n\nf\n\nf\u2208H\n\n= argmin\n\n(1)\nwhere L(z, y) is a convex loss function with Lipschitz-continuous derivative satisfying |L\n(z1, y) \u2212\n\u2032\n(z2, y)| \u2264 \u03b2|z1 \u2212 z2|, which includes several standard loss functions such as the square-loss\n\u2032\nL\n2 (z \u2212 y)2, square-hinge loss L(z, y) = max(1 \u2212 zy, 0)2 and logistic loss L(z, y) =\nL(z, y) = 1\nlog(1 + exp(\u2212yz)).\n\nL(f (xn), yn),\n\nn=1\n\nN\u2211\n\n1\nN\n\n\u2225f\u22252H +\n\n\u03bb\n2\n\n2.1 Kernel and Feature Map\nThere are two ways in practice to specify the space H. One is via specifying a positive-de\ufb01nite\nkernel k(x, y) that encodes similarity between instances, and where H can be expressed as the\ncompletion of the space spanned by {k(x,\u00b7)}x\u2208X , that is,\n\n}\n\nH =\n\n\u03b1ik(xi,\u00b7) | \u03b1i \u2208 R, xi \u2208 X\n\n.\n\nK\u2211\n\ni=1\n\n{\nf (\u00b7) =\n\u222b\n\n{\n\nThe other way is to \ufb01nd an explicit feature map { \u00af\u03d5h(x)}h\u2208H, where each h \u2208 H de\ufb01nes a basis\nfunction \u00af\u03d5h(x) : X \u2192 R. The RKHS H can then be de\ufb01ned as\n\n}\nw(h) \u00af\u03d5h(\u00b7)dh = \u27e8w, \u00af\u03d5(\u00b7)\u27e9H | \u2225f\u22252H < \u221e\n\n(2)\nwhere w(h) is a weight distribution over the basis {\u03d5h(x)}h\u2208H. By Mercer\u2019s theorem [1], every\npositive-de\ufb01nite kernel k(x, y) has a decomposition s.t.\n\nf (\u00b7) =\n\nH =\n\nh\u2208H\n\n,\n\nk(x, y) =\n\n(3)\n\u221a\np \u25e6 \u03d5. However, the decomposition\nwhere p(h) \u2265 0 and \u00af\u03d5h(.) =\nis not unique. One can derive multiple decompositions from the same kernel k(x, y) based on\n\nh\u2208H\np(h)\u03d5h(.), denoted as \u00af\u03d5 =\n\np(h)\u03d5h(x)\u03d5h(y)dh = \u27e8 \u00af\u03d5(x), \u00af\u03d5(y)\u27e9H,\n\n\u222b\n\u221a\n\n2\n\n\fdifferent sets of basis functions {\u03d5h(x)}h\u2208H. For example, in [2], the Laplacian kernel k(x, y) =\nexp(\u2212\u03b3\u2225x \u2212 y\u22251) can be decomposed through both the Fourier basis and the Random Binning\nbasis, while in [7], the Laplacian kernel can be obtained from the integrating of an in\ufb01nite number\nof decision trees.\nOn the other hand, multiple kernels can be derived from the same set of basis functions via different\ndistribution p(h). For example, in [2, 3], a general decomposition method using Fourier basis func-\n\u03c9\u2208Rd was proposed to \ufb01nd feature map for any shift-invariant kernel of\ntions\nthe form k(x \u2212 y), where the feature maps (3) of different kernels k(\u2206) differ only in the distri-\nbution p(\u03c9) obtained from the Fourier transform of k(\u2206). Similarly, [5] proposed decomposition\nbased on polynomial basis for any dot-product kernel of the form k(\u27e8x, y\u27e9).\n\n\u03d5\u03c9(x) = cos(\u03c9T x)\n\n}\n\n{\n\n2.2 Random Features as Monte-Carlo Approximation\n\n\u2211\n\n{\nf (\u00b7) =\n\n}\nThe standard kernel method, often referred to as the \u201ckernel trick,\u201d solves problem (1) through\n\u2217 \u2208 H lies in\nthe Representer Theorem [15, 16], which states that the optimal decision function f\nthe span of training samples HD =\n, which\nreduces the in\ufb01nite-dimensional problem (1) to a \ufb01nite-dimensional problem with N variables\n{\u03b1n}N\nn=1. However, it is known that even for loss functions with dual-sparsity (e.g. hinge-loss),\nthe number of non-zero \u03b1n increases linearly with data size [17].\nRandom Features has been proposed as a kernel approximation method [2, 3, 10, 5], where a Monte-\nCarlo approximation\n\nn=1 \u03b1nk(xn,\u00b7) | \u03b1n \u2208 R, (xn, yn) \u2208 D\n\nN\n\nk(xi, xj) = Ep(h)[\u03d5h(xi)\u03d5h(xj)] \u2248 1\n\nD\n\n\u03d5hk (xi)\u03d5hk (xj) = z(xi)T z(xj)\n\n(4)\n\nis used to approximate (3), so that the solution to (1) can be obtained by\n\nThe corresponding approximation error\n\n(cid:12)(cid:12)wT\n\nRF z(x) \u2212 f\n\n\u2217\n\n(x)\n\nwRF = argmin\nw\u2208RD\n\n\u2225w\u22252 +\n\n1\nN\n\nL(wT z(xn), yn).\n\n\u03bb\n2\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N\u2211\n\nn=1\n\n(cid:12)(cid:12) =\n\nn z(xn)T z(x) \u2212 N\u2211\n\n\u03b1RF\n\nn=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ,\n\n\u2217\nnk(xn, x)\n\n\u03b1\n\nD\u2211\n\nk=1\n\nN\u2211\n\nn=1\n\nas proved in [2,Appendix B], can be bounded by \u03f5 given D = \u2126(1/\u03f52) number of random fea-\ntures, which is a direct consequence of the uniform convergence of the sampling approximation (4).\nUnfortunately, the rate of convergence suggests that to achieve small approximation error \u03f5, one\nneeds signi\ufb01cant amount of random features, and since the model size of (5) grows linearly with\nD, such an algorithm might not obtain a sparser model than kernel method. On the other hand, the\n\u21131-regularized Random-Feature algorithm we are proposing aims to minimize loss with a selected\nsubset of random feature that neither grows linearly with D nor with N. However, (6) does not hold\nfor \u21131-regularization, and thus one cannot transfer guarantee from kernel approximation (4) to the\nlearned decision function.\n\n3 Sparse Random Feature as Coordinate Descent\n\n\u221a\nIn this section, we present the Sparse Random Features algorithm and analyze its convergence by\ninterpreting it as a fully-corrective randomized coordinate descent in a Hilbert space. Given a feature\np(h)\u03d5h(x)}h\u2208H, the optimization program (1) can\nmap of orthogonal basic functions { \u00af\u03d5h(x) =\nN\u2211\nbe written as the in\ufb01nite-dimensional optimization problem\n\nmin\nw\u2208H\n\n\u2225w\u22252\n\n2 +\n\n\u03bb\n2\n\n1\nN\n\nL(\u27e8w, \u00af\u03d5(xn)\u27e9H, yn).\n\nn=1\n\n3\n\n(5)\n\n(6)\n\n(7)\n\n\fInstead of directly minimizing (7), the Sparse Random Features algorithm optimizes the related\n\u21131-regularized problem de\ufb01ned as\n\nN\u2211\n\nmin\n\n\u222b\n\n(cid:22)w\u2208H F ( \u00afw) = \u03bb\u2225 \u00afw\u22251 +\n(8)\n\u221a\np \u25e6 \u03d5(x) is replaced by \u03d5(x) and \u2225 \u00afw\u22251 is de\ufb01ned as the \u21131-norm in function\nwhere \u00af\u03d5(x) =\n| \u00afw(h)|dh. The whole procedure is depicted in Algorithm 1. At each iteration,\nspace \u2225 \u00afw\u22251 =\nh\u2208H\nwe draw R coordinates h1, h2, ..., hR from distribution p(h), add them into a working set At, and\nminimize (8) w.r.t. the working set At as\n\nL(\u27e8 \u00afw, \u03d5(xn)\u27e9H, yn),\n\n1\nN\n\nn=1\n\n\u2211\n\nN\u2211\n\n\u2211\n\nmin\n\n(cid:22)w(h),h\u2208At\n\n\u03bb\n\nh\u2208At\n\n| \u00afw(h)| +\n\n1\nN\n\nL(\n\nh\u2208At\n\nn=1\n\n\u00afw(h)\u03d5h(xn), yn).\n\n(9)\n\nAt the end of each iteration, the algorithm removes features with zero weight to maintain a compact\nworking set.\n\nAlgorithm 1 Sparse Random-Feature Algorithm\n\nInitialize \u00afw0 = 0, working set A(0) = {}, and t = 0.\nrepeat\n\n4. A(t+1) = A(t) \\{\n\n1. Sample h1, h2, ..., hR i.i.d. from distribution p(h).\n2. Add h1, h2, ..., hR to the set A(t).\n3. Obtain \u00afwt+1 by solving (9).\nh| \u00afwt+1(h) = 0\n5. t \u2190 t + 1.\n\n}\n\n.\n\nuntil t = T\n\n3.1 Convergence Analysis\n\nIn this section, we analyze the convergence behavior of Algorithm 1. The analysis comprises of two\nparts. First, we estimate the number of iterations Algorithm 1 takes to produce a solution wt that\nis at most \u03f5 away from some arbitrary reference solution wref on the \u21131-regularized program (8).\n\u2217 of (7), we obtain an approximation guarantee for\nThen, by taking wref as the optimal solution w\nwt with respect to w\nLemma 1. Suppose loss function L(z, y) has \u03b2-Lipschitz-continuous derivative and |\u03d5h(x)| \u2264\nB,\u2200h \u2208 H,\u2200x \u2208 X . The loss term Loss( \u00afw; \u03d5) = 1\n\n\u2217. The proofs for most lemmas and corollaries will be in the appendix.\nn=1 L(\u27e8 \u00afw, \u03d5(xn)\u27e9, yn) in (8) has\n\n\u2211\n\nN\n\nN\n\nLoss( \u00afw + \u03b7\u03b4h; \u03d5) \u2212 Loss( \u00afw; \u03d5) \u2264 gh\u03b7 +\n\n\u03b3\n2\n\n\u03b72,\n\nwhere \u03b4h = \u03b4(\u2225x \u2212 h\u2225) is a Dirac function centered at h, and gh = \u2207 (cid:22)wLoss( \u00afw; \u03d5)(h) is the\nFrechet derivative of the loss term evaluated at h, and \u03b3 = \u03b2B2.\n\nN\u2211\n\n1\nN\n\nThe above lemma states smoothness of the loss term, which is essential to guarantee descent amount\n\u221a\nobtained by taking a coordinate descent step. In particular, we aim to express the expected progress\np \u25e6 w) de\ufb01ned as\nmade by Algorithm 1 as the proximal-gradient magnitude of \u00afF (w) = F (\n\n\u00afF (w) = \u03bb\u2225\u221a\n\np \u25e6 w\u22251 +\n\nL(\u27e8w, \u00af\u03d5(xn)\u27e9, yn).\n\n(10)\n. Let g = \u2207 (cid:22)wLoss( \u00afw, \u03d5), \u00afg = \u2207wLoss(w, \u00af\u03d5) be the gradients of loss terms in (8), (10) respec-\ntively, and let \u03c1 \u2208 \u2202 (\u03bb\u2225 \u00afw\u22251). We have following relations between (8) and (10):\n\nn=1\n\n\u221a\n\np \u25e6 \u03c1 \u2208 \u2202 (\u03bb\u2225\u221a\n\np \u25e6 w\u22251),\n\n\u221a\np \u25e6 g,\n\n\u00af\u03c1 :=\n\n(11)\nby simple applications of the chain rule. We then analyze the progress made by each iteration of\nAlgorithm 1. Recalling that we used R to denote the number of samples drawn in step 1 of our\nalgorithm, we will \ufb01rst assume R = 1, and then show that same result holds also for R > 1.\n\n\u00afg =\n\n4\n\n\fTheorem 1 (Descent Amount). The expected descent of the iterates of Algorithm 1 satis\ufb01es\n\nE[F ( \u00afwt+1)] \u2212 F ( \u00afwt) \u2264 \u2212 \u03b3\u2225\u00af\u03b7t\u22252\n\n,\n\n2\n\nwhere \u00af\u03b7 is the proximal gradient of (10), that is,\n\n\u03bb\u2225\u221a\n\np \u25e6 (wt + \u03b7)\u22251 \u2212 \u03bb\u2225\u221a\n\np \u25e6 wt\u22251 + \u27e8\u00afg, \u03b7\u27e9 +\n\n\u00af\u03b7 = argmin\n\n(cid:17)\n\nand \u00afg = \u2207wLoss(wt, \u00af\u03d5) is the derivative of loss term w.r.t. w.\nProof. Let gh = \u2207 (cid:22)wLoss( \u00afwt, \u03d5)(h). By Corollary 1, we have\n\nF ( \u00afwt + \u03b7\u03b4h) \u2212 F ( \u00afwt) \u2264 \u03bb| \u00afwt(h) + \u03b7| \u2212 \u03bb| \u00afwt(h)| + gh\u03b7 +\n\nMinimizing RHS w.r.t. \u03b7, the minimizer \u03b7h should satisfy\n\n(12)\n\n(13)\n\n(14)\n\n\u2225\u03b7\u22252\n\n\u03b3\n2\n\n\u03b3\n2\n\n\u03b72.\n\n(15)\nfor some sub-gradient \u03c1h \u2208 \u2202 (\u03bb| \u00afwt(h) + \u03b7h|). Then by de\ufb01nition of sub-gradient and (15) we have\n(16)\n\n\u03bb| \u00afwt(h) + \u03b7| \u2212 \u03bb| \u00afwt(h)| + gh\u03b7 +\n\ngh + \u03c1h + \u03b3\u03b7h = 0\n\n\u03b3\n2\n\n\u03b72 \u2264 \u03c1h\u03b7h + gh\u03b7h +\n\u03b72\nh\nh = \u2212 \u03b3\n\u03b72\n\n= \u2212\u03b3\u03b72\n\nh +\n\n\u03b3\n2\n\n\u03b3\n2\n\n\u03b72\nh.\n\n(17)\n\n2\n\nNote the equality in (16) holds if \u00afwt(h) = 0 or the optimal \u03b7h = 0, which is true for Algorithm\n1. Since \u00afwt+1 minimizes (9) over a block At containing h, we have F ( \u00afwt+1) \u2264 F ( \u00afwt + \u03b7h\u03b4h).\nCombining (14) and (16), taking expectation over h on both sides, and then we have\n\nE[F ( \u00afwt+1)] \u2212 F ( \u00afwt) \u2264 \u2212 \u03b3\n2\n\nh] = \u2225\u221a\n\nE[\u03b72\n\np \u25e6 \u03b7\u22252 = \u2225\u00af\u03b7\u22252\n\n\u221a\np \u25e6 \u03b7 is the proximal gradient (13) of \u00afF (wt), which is true\n\nThen it remains to verify that \u00af\u03b7 =\nsince \u00af\u03b7 satis\ufb01es the optimality condition of (13)\n\u221a\np \u25e6 (g + \u03c1 + \u03b3\u03b7) = 0,\n\n\u00afg + \u00af\u03c1 + \u03b3 \u00af\u03b7 =\n\nwhere \ufb01rst equality is from (11) and the second is from (15).\n\nTheorem 2 (Convergence Rate). Given any reference solution wref , the sequence {wt}\u221e\n\nt=1 satis\ufb01es\n\nE[ \u00afF (wt)] \u2264 \u00afF (wref ) +\n\n2\u03b3\u2225wref\u22252\n\nk\n\n,\n\n(18)\n\n\u2225\u00af\u03b7\u22252 = min\n\n\u2212 \u03b3\n2\n\nwhere k = max{t \u2212 c, 0} and c = 2( (cid:22)F (0)\u2212 (cid:22)F (wref ))\nProof. First, the equality actually holds in inequality (16), since for h /\u2208 A(t\u22121), we have wt(h) = 0,\nwhich implies \u03bb|wt(h) + \u03b7| \u2212 \u03bb|wt(h)| = \u03c1\u03b7, \u03c1 \u2208 \u2202(\u03bb|wt(h) + \u03b7|), and for h \u2208 At\u22121 we have\n\u00af\u03b7h = 0, which gives 0 to both LHS and RHS. Therefore, we have\n\nis a constant.\n\n\u03b3\u2225wref\u22252\n\n\u03bb\u2225\u221a\n\np \u25e6 (wt + \u03b7)\u22251 \u2212 \u03bb\u2225\u221a\n\np \u25e6 wt\u22251 + \u00afgT \u03b7 +\n\n(cid:17)\n\n\u2202(|\u221a\n\n(19)\nNote the minimization in (19) is separable for different coordinates. For h \u2208 A(t\u22121), the weight\nwt(h) is already optimal in the beginning of iteration t, so we have \u00af\u03c1h + \u00afgh = 0 for some \u00af\u03c1h \u2208\np(h)(w(h) + \u03b7h)| + \u00afgh\u03b7h)\n}\n\np(h)w(h)|). Therefore, \u03b7h = 0, h \u2208 A(t\u22121) is optimal both to (|\u221a\n}\n\np \u25e6 (wt + \u03b7)\u22251 \u2212 \u03bb\u2225\u221a\n\u222b\n\nh. Set \u03b7h = 0 for the latter, we have\n2 \u03b72\n\u2225\u00af\u03b7\u22252 = min\n\u2264 min\n\n{\n\u03bb\u2225\u221a\n{\n\u00afF (wt + \u03b7) \u2212 \u00afF (wt) +\n\np \u25e6 wt\u22251 + \u27e8\u00afg, \u03b7\u27e9 +\n\nand to \u03b3\n\u2212 \u03b3\n2\n\n\u03b72\nhdh\n\nh /\u2208A(t(cid:0)1)\n\n\u222b\n\n\u03b3\n2\n\n(cid:17)\n\n\u2225\u03b7\u22252.\n\n\u03b3\n2\n\n(cid:17)\n\n\u03b72\nhdh\n\nh /\u2208A(t(cid:0)1)\n\n\u03b3\n2\n\n5\n\n\ffrom convexity of \u00afF (w). Consider solution of the form \u03b7 = \u03b1(wref \u2212 wt), we have\n\u2212 \u03b3\n2\n\n\u00afF\n\n\u2225\u00af\u03b7\u22252 \u2264 min\n\u03b1\u2208[0,1]\n\u2264 min\n\u03b1\u2208[0,1]\n\u2264 min\n\u03b1\u2208[0,1]\n{\n\n\u2217\n\n\u222b\n) \u2212 \u00afF (wt) +\n}\n\n\u03b3\u03b12\n2\n\n\u2225wref\u22252\n\n(\n\n(\n\n\u00afF (wref ) \u2212 \u00afF (wt)\n\n{\n{\n{\n\u00afF (wt) + \u03b1\n\u2212\u03b1\n(\n(cid:22)F (wt)\u2212 (cid:22)F (wref )\n, 1\n\u00afF (wt) \u2212 \u00afF (wref )\n\u2225wref\u22252\n\n) \u2212 \u00afF (wt) +\nwt + \u03b1(wref \u2212 wt)\n)\n(\n\u00afF (wt) \u2212 \u00afF (wref )\n)\nand\n/(2\u03b3\u2225wref\u22252)\n\n\u03b3\u2225wref\u22252\n\n\u03b3\u03b12\n2\n\n+\n\n,\n\n}\n}\n(wref (h) \u2212 wt(h))2dh\n\n\u222b\n\nwref (h)2dh\n\nh /\u2208A(t(cid:0)1)\n\nh /\u2208A(t(cid:0)1)\n\u03b3\u03b12\n2\n\nwhere the second inequality results from wt(h) = 0, h /\u2208 A(t\u22121). Minimizing last expression w.r.t.\n\u03b1, we have \u03b1\n\u2212 \u03b3\n\u2225\u00af\u03b7\u22252 \u2264\n2\n\n, if \u00afF (wt) \u2212 \u00afF (wref ) < \u03b3\u2225wref\u22252\n, o.w.\n\n\u2212(\n\n)2\n\n= min\n\nNote, since the function value { \u00afF (wt)}\u221e\nsecond case of (20), and the number of such iterations is at most c = \u2308 2( (cid:22)F (0)\u2212 (cid:22)F (wref ))\nwe have\n\n(20)\nt=1 is non-increasing, only iterations in the beginning fall in\n\u2309. For t > c,\n\n\u2212 \u03b3\n\n2\n\n.\n\nE[ \u00afF (wt+1)] \u2212 \u00afF (wt) \u2264 \u2212 \u03b3\u2225\u00af\u03b7t\u22252\n\n2\n\n2\n\n\u03b3\u2225wref\u22252\n\u2264 \u2212 ( \u00afF (wt) \u2212 \u00afF (wref ))2\n\n2\u03b3\u2225wref\u22252\n\n.\n\n(21)\n\n[\nE\n= max{D \u2212 c, 0}, where w\n\n\u03bb\u2225 \u00afw(D)\u22251 + Loss( \u00afw(D); \u03d5)\n\nThe recursion then leads to the result.\nNote the above bound does not yield useful result if \u2225wref\u22252 \u2192 \u221e. Fortunately, the optimal\nsolution of our target problem (7) has \ufb01nite \u2225w\n\u2217\u22252 as long as in (7) \u03bb > 0, so it always give a useful\nbound when plugged into (18), as following corollary shows.\n]\n\nCorollary 1 (Approximation Guarantee). The output of Algorithm 1 satis\ufb01es\n\n; (cid:22)\u03d5)\n\n\u2032\n\n(22)\n\u2217 is the optimal solution of problem (7), c is a constant de\ufb01ned\n\nwith D\nin Theorem 2.\nThen the following two corollaries extend the guarantee (22) to any R \u2265 1, and a bound holds with\nhigh probability. The latter is a direct result of [18,Theorem 1] applied to the recursion (21).\nCorollary 2. The bound (22) holds for any R \u2265 1 in Algorithm 1, where if there are T iterations\nthen D = T R.\nCorollary 3. For D \u2265 2\u03b3\u2225w\n\n(1 + log 1\n\n+\n\n2\n\n\u03bb\u2225w\n\n\u2217\u22252 + Loss(w\n\n\u2217\n\n\u2217\u22252\n\n2\u03b3\u2225w\nD\u2032\n\n\u2264{\n\n}\n\n(cid:3)\u22252\n\n\u03bb\u2225 \u00afw(D)\u22251 + Loss( \u00afw(D); \u03d5) \u2264{\n\n\u03c1 ) + 2 \u2212 4\n\n\u03f5\n\nwith probability 1 \u2212 \u03c1, where c is as de\ufb01ned in Theorem 2 and w\n\n\u2217 is the optimal solution of (7).\n\nc + c , the output of Algorithm 1 has\n\u03bb\u2225w\n\n\u2217\u22252 + Loss(w\n\n; (cid:22)\u03d5)\n\n+ \u03f5\n\n\u2217\n\n(23)\n\n}\n\n3.2 Relation to the Kernel Method\n\nOur result (23) states that, for D large enough, the Sparse Random Features algorithm achieves either\na comparable loss to that of the vanilla kernel method, or a model complexity (measured in \u21131-norm)\n\u2217 is not the optimal\nless than that of kernel method (measured in \u21132-norm). Furthermore, since w\nsolution of the \u21131-regularized program (8), it is possible for the LHS of (23) to be much smaller than\n\u2217 of \ufb01nite \u21132-norm can be the reference solution wref , the \u03bb\nthe RHS. On the other hand, since any w\nused in solving the \u21131-regularized problem (8) can be different from the \u03bb used in the kernel method.\nThe tightest bound is achieved by minimizing the RHS of (23), which is equivalent to minimizing\n(7) with some unknown \u02dc\u03bb(\u03bb) due to the difference of \u2225w\u22251 and \u2225w\u22252\n2. In practice, we can follow\na regularization path to \ufb01nd small enough \u03bb that yields comparable predictive performance while\nmaintains model as compact as possible. Note, when using different sampling distribution p(h) from\nthe decomposition (3), our analysis provides different bounds (23) for the Randomized Coordinate\nDescent in Hilbert Space. This is in contrast to the analysis in the \ufb01nite-dimensional case, where\nRCD with different sampling distribution converges to the same solution [18].\n\n6\n\n\f3.3 Relation to the Boosting Method\n\nBoosting is a well-known approach to minimize in\ufb01nite-dimensional problems with \u21131-\nregularization [8, 9], and which in this setting, performs greedy coordinate descent on (8). For\neach iteration t, the algorithm \ufb01nds the coordinate h(t) yielding steepest descent in the loss term\n\nN\u2211\n\nn=1\n\nh(t) = argmin\n\nh\u2208H\n\n1\nN\n\n\u2032\nL\nn\u03d5h(xn)\n\n(24)\n\nto add into a working set At and minimize (8) w.r.t. At. When the greedy step (24) can be solved\nexactly, Boosting has fast convergence to the optimal solution of (8) [13, 14]. On the contrary,\nrandomized coordinate descent can only converge to a sub-optimal solution in \ufb01nite time when there\nare in\ufb01nite number of dimensions. However, in practice, only a very limited class of basis functions\nallow the greedy step in (24) to be performed exactly. For most basis functions (weak learners)\nsuch as perceptrons and decision trees, the greedy step (24) can only be solved approximately. In\nsuch cases, Boosting might have no convergence guarantee, while the randomized approach is still\nguaranteed to \ufb01nd a comparable solution to that of the kernel method. In our experiments, we found\nthat the randomized coordinate descent performs considerably better than approximate Boosting\nwith the perceptron basis functions (weak learners), where as adopted in the Boosting literature\n[19, 8], a convex surrogate loss is used to solve (24) approximately.\n\n4 Experiments\n\nIn this section, we compare Sparse Random Features (Sparse-RF) to the existing Random Fea-\ntures algorithm (RF) and the kernel method (Kernel) on regression and classi\ufb01cation problems with\nkernels set to Gaussian RBF, Laplacian RBF [2], and Perceptron kernel [7] 1. For Gaussian and\nLaplacian RBF kernel, we use Fourier basis function with corresponding distribution p(h) derived\nin [2]; for Perceptron kernel, we use perceptron basis function with distribution p(h) being uniform\nover unit-sphere as shown in [7]. For regression, we solve kernel ridge regression (1) and RF regres-\nsion (6) in closed-form as in [10] using Eigen, a standard C++ library of numerical linear algebra.\nFor Sparse-RF, we solve the LASSO sub-problem (9) by standard RCD algorithm. In classi\ufb01cation,\nwe use LIBSVM2as solver of kernel method, and use Newton-CG method and Coordinate Descent\nmethod in LIBLINEAR [12] to solve the RF approximation (6) and Sparse-RF sub-problem (9) re-\nspectively. We set \u03bbN = N \u03bb = 1 for the kernel and RF methods, and for Sparse-RF, we choose\n\u03bbN \u2208 {1, 10, 100, 1000} that gives RMSE (accuracy) closest to the RF method to compare spar-\nsity and ef\ufb01ciency. The results are in Tables 1 and 2, where the cost of kernel method grows at\nleast quadratically in the number of training samples. For YearPred, we use D = 5000 to maintain\ntractability of the RF method. Note for Covtype dataset, the \u21132-norm \u2225w\n\u2217\u22252 from kernel machine is\nsigni\ufb01cantly larger than that of others, so according to (22), a larger number of random features D\nare required to obtain similar performance, as shown in Figure 1.\nIn Figure 1, we compare Sparse-RF (randomized coordinate descent) to Boosting (greedy coordinate\ndescent) and the bound (23) obtained from SVM with Perceptron kernel and basis function (weak\nlearner). The \ufb01gure shows that Sparse-RF always converges to a solution comparable to that of\nthe kernel method, while Boosting with approximate greedy steps (using convex surrogate loss)\nconverges to a higher objective value, due to bias from the approximation.\n\nAcknowledgement\n\nS.-D.Lin acknowledges the support of Telecommunication Lab., Chunghwa Telecom Co., Ltd via TL-103-\n8201, AOARD via No. FA2386-13-1-4045, Ministry of Science and Technology, National Taiwan University\nand Intel Co. via MOST102-2911-I-002-001, NTU103R7501, 102-2923-E-002-007-MY2, 102-2221-E-002-\n170, 103-2221-E-002-104-MY2. P.R. acknowledges the support of ARO via W911NF-12-1-0390 and NSF via\nIIS-1149803, IIS-1320894, IIS-1447574, and DMS-1264033. This research was also supported by NSF grants\nCCF-1320746 and CCF-1117055.\n\n2Data set for classi\ufb01cation can be downloaded from LIBSVM data set web page, and data set for regression can be found at UCI Machine\n\nLearning Repository and Ali Rahimi\u2019s page for the paper [2].\n\n2We follow the FAQ page of LIBSVM to replace hinge-loss by square-hinge-loss for comparison.\n\n7\n\n\fTable 1: Results for Kernel Ridge Regression. Fields from top to bottom are model size (# of support\nvectors or # of random features or # of non-zero weights respectively), testing RMSE, training time,\ntesting prediction time, and memory usage during training.\n\nData set\nCPU\nNtr =6554\nNt =819\nd =21\n\nCensus\nNtr =18186\nNt =2273\nd =119\n\nYearPred\nNtr =463715\nNt =51630\nd =90\n\nKernel\nSV=6554\nRMSE=0.038\nTtr=154 s\nTt=2.59 s\nMem=1.36 G\nSV=18186\nRMSE=0.029\nTtr=2719 s\nTt=74 s\nMem=10 G\nSV=#\nRMSE=#\nTtr=#\nTt=#\nMem=#\n\nGaussian RBF\n\nRF\nD=10000\n0.037\n875 s\n6 s\n4.71 G\nD=10000\n0.032\n1615 s\n80 s\n8.2 G\nD=5000\n0.103\n7697 s\n697 s\n76.7G\n\nSparse-RF\nNZ=57\n0.032\n22 s\n0.04 s\n0.069 G\nNZ=1174\n0.030\n229 s\n8.6 s\n0.55 G\nNZ=1865\n0.104\n1618 s\n97 s\n45.6G\n\nKernel\nSV=6554\n0.034\n157 s\n3.13 s\n1.35 G\nSV=18186\n0.146\n3268 s\n68 s\n10 G\nSV=#\n#\n#\n#\n#\n\nLaplacian RBF\n\nRF\nD=10000\n. 0.035\n803 s\n6.99 s\n4.71 G\nD=10000\n0.168\n1633 s\n88 s\n8.2 G\nD=5000\n0.286\n9417 s\n715 s\n76.6 G\n\nSparse-RF\nNZ=289\n0.027\n43 s\n0.18 s\n0.095 G\nNZ=5269\n0.179\n225 s\n38s\n1.7 G\nNZ=3739\n0.273\n1453 s\n209 s\n54.3 G\n\nKernel\nSV=6554\n0.026\n151 s\n2.48 s\n1.36 G\nSV=18186\n0.010\n2674 s\n67.45 s\n10 G\nSV=#\n#\n#\n#\n#\n\nPerceptron Kernel\n\nRF\nD=10000\n0.038\n776 s\n6.37 s\n4.71 G\nD=10000\n0.016\n1587 s\n76 s\n8.2 G\nD=5000\n0.105\n8636 s\n688 s\n76.7 G\n\nSparse-RF\nNZ=251\n0.027\n27 s\n0.13 s\n0.090 G\nNZ=976\n0.016\n185 s\n6.7 s\n0.49 G\nNZ=896\n0.105\n680 s\n51 s\n38.1 G\n\nTable 2: Results for Kernel Support Vector Machine. Fields from top to bottom are model size (#\nof support vectors or # of random features or # of non-zero weights respectively), testing accuracy,\ntraining time, testing prediction time, and memory usage during training.\n\nData set\nCod-RNA\nNtr =59535\nNt =10000\nd =8\n\nIJCNN\nNtr =127591\nNt =14100\nd =22\n\nCovtype\nNtr =464810\nNt =116202\nd =54\n\nKernel\nSV=14762\nAcc=0.966\nTtr=95 s\nTt=15 s\nMem=3.8 G\nSV=16888\nAcc=0.991\nTtr=636 s\nTt=34 s\nMem=12 G\nSV=335606\nAcc=0.849\nTtr=74891 s\nTt=3012 s\nMem=78.5 G\n\nGaussian RBF\n\nRF\nD=10000\n0.964\n214 s\n56 s\n9.5 G\nD=10000\n0.989\n601 s\n88 s\n20 G\nD=10000\n0.829\n9909 s\n735 s\n74.7 G\n\nSparse-RF\nNZ=180\n0.964\n180 s\n0.61 s\n0.66 G\nNZ=1392\n0.989\n292 s\n11 s\n7.5 G\nNZ=3421\n0.836\n6273 s\n132 s\n28.1 G\n\nKernel\nSV=13769\n0.971\n89 s\n15 s\n3.6 G\nSV=16761\n0.995\n988 s\n34 s\n12 G\nSV=224373\n0.954\n64172 s\n2004 s\n80.8 G\n\nLaplacian RBF\n\nRF\nD=10000\n. 0.969\n290 s\n46 s\n9.6 G\nD=10000\n0.992\n379 s\n86 s\n20 G\nD=10000\n0.888\n10170 s\n635 s\n74.6 G\n\nSparse-RF\nNZ=1195\n0.970\n137 s\n6.41 s\n1.8 G\nNZ=2508\n0.992\n566 s\n25 s\n9.9 G\nNZ=3141\n0.869\n2788 s\n175 s\n56.5 G\n\nKernel\nSV=15201\n0.967\n57.34 s\n7.01 s\n3.6 G\nSV=26563\n0.991\n634 s\n16 s\n11 G\nSV=358174\n0.905\n79010 s\n1774 s\n80.5 G\n\nPerceptron Kernel\n\nRF\nD=10000\n0.964\n197 s\n71.9 s\n9.6 G\nD=10000\n0.987\n381 s\n77 s\n20 G\nD=10000\n0.835\n6969 s\n664 s\n74.7 G\n\nSparse-RF\nNZ=1148\n0.963\n131 s\n3.81 s\n1.4 G\nNZ=1530\n0.988\n490 s\n11 s\n7.8 G\nNZ=1401\n0.836\n1706 s\n70 s\n44.4 G\n\nFigure 1: The \u21131-regularized objective (8) (top) and error rate (bottom) achieved by Sparse Random\nFeature (randomized coordinate descent) and Boosting (greedy coordinate descent) using perceptron\nbasis function (weak learner). The dashed line shows the \u21132-norm plus loss achieved by kernel\nmethod (RHS of (22)) and the corresponding error rate using perceptron kernel [7].\n\n8\n\n050010001500200025000.10.20.30.40.50.60.70.8timeobjectiveCod\u2212RNA\u2212Objective BoostingSparse\u2212RFKernel00.511.522.5x 10400.050.10.150.20.25timeobjectiveIJCNN\u2212Objective BoostingSparse\u2212RFKernel0500010000150000.350.40.450.50.550.60.65timeobjectiveCovtype\u2212Objective BoostingSparse\u2212RFKernel0500100015002000250000.050.10.150.20.250.30.35timeerrorCod\u2212RNA\u2212Error BoostingSparse\u2212RFKernel00.511.522.5x 10400.010.020.030.040.050.060.070.080.09timeerrorIJCNN\u2212Error BoostingSparse\u2212RFKernel0500010000150000.080.10.120.140.160.180.20.22timeerrorCovtype\u2212Error BoostingSparse\u2212RFKernel\fReferences\n[1] Mercer, J. Functions of positive and negative type and their connection with the theory of inte-\n\ngral equations. Royal Society London, A 209:415 446, 1909.\n\n[2] Rahimi, A. and Recht, B. Random features for large-scale kernel machines. NIPS 20, 2007.\n[3] Rahimi, A. and Recht, B. Weighted sums of random kitchen sinks: Replacing minimization with\n\nrandomization in learning. NIPS 21, 2008.\n\n[4] Vedaldi, A., Zisserman, A.: Ef\ufb01cient additive kernels via explicit feature maps. In CVPR. (2010)\n[5] P. Kar and H. Karnick. Random feature maps for dot product kernels. In Proceedings of AIS-\n\nTATS\u201912, pages 583 591, 2012.\n\n[6] T. Yang, Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. Nystrom method vs. random Fourier\n\nfeatures: A theoretical and empirical comparison. In Adv. NIPS, 2012.\n\n[7] Husan-Tien Lin, Ling Li, Support Vector Machinery for In\ufb01nite Ensemble Learnings. JMLR\n\n2008.\n\n[8] Saharon Rosset, Ji Zhu, and Trevor Hastie. Boosting as a Regularized Path to a Maximum\n\nMargin Classi\ufb01er. JMLR, 2004.\n\n[9] Saharon Rosset, Grzegorz Swirszcz, Nathan Srebro, and Ji Zhu. \u21131-regularization in in\ufb01nite\ndimensional feature spaces. In Learning Theory: 20th Annual Conference on Learning Theory,\n2007.\n\n[10] Q. Le, T. Sarlos, and A. J. Smola. Fastfood - approximating kernel expansions in loglinear\n\ntime. In The 30th International Conference on Machine Learning, 2013.\n\n[11] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Transac-\n\ntions on Intelligent Systems and Technology, 2011.\n\n[12] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library for\n\nlarge linear classi\ufb01cation. Journal of Machine Learning Research, 9:1871 1874, 2008.\n\n[13] Gunnar Ratsch, Sebastian Mika, and Manfred K. Warmuth. On the convergence of leveraging.\n\nIn NIPS, 2001.\n\n[14] Matus Telgarsky. The Fast Convergence of Boosting. In NIPS, 2011.\n[15] Kimeldorf, G. S. and Wahba, G. A correspondence between Bayesian estimation on stochastic\n\nprocesses and smoothing by splines. Annals of Mathematical Statistics, 41:495502, 1970.\n\n[16] Scholkopf, Bernhard and Smola, A. J. Learning with Kernels. MIT Press, Cambridge, MA,\n\n2002.\n\n[17] Steinwart, Ingo and Christmann, Andreas. Support Vector Machines. Springer, 2008.\n[18] P. Ricktarik and M. Takac, Iteration complexity of randomized block-coordinate descent meth-\nods for minimizing a composite function, School of Mathematics, University of Edinburgh,\nTech. Rep., 2011.\n\n[19] Chen, S.-T., Lin, H.-T. and Lu, C.-J. An online boosting algorithm with theoretical justi\ufb01ca-\n\ntions. ICML 2012.\n\n[20] Taskar, B., Guestrin, C., and Koller, D. Max-margin Markov networks. NIPS 16, 2004.\n[21] G. Song et.al. Reproducing kernel Banach spaces with the l1 norm. Journal of Applied and\n\nComputational Harmonic Analysis, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1292, "authors": [{"given_name": "Ian En-Hsu", "family_name": "Yen", "institution": "UT-Austin"}, {"given_name": "Ting-Wei", "family_name": "Lin", "institution": "National Taiwan University"}, {"given_name": "Shou-De", "family_name": "Lin", "institution": "National Taiwan University"}, {"given_name": "Pradeep", "family_name": "Ravikumar", "institution": "UT Austin"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "University of Texas"}]}