{"title": "Scalable Non-linear Learning with Adaptive Polynomial Expansions", "book": "Advances in Neural Information Processing Systems", "page_first": 2051, "page_last": 2059, "abstract": "Can we effectively learn a nonlinear representation in time comparable to linear learning? We describe a new algorithm that explicitly and adaptively expands higher-order interaction features over base linear representations. The algorithm is designed for extreme computational efficiency, and an extensive experimental study shows that its computation/prediction tradeoff ability compares very favorably against strong baselines.", "full_text": "Scalable Nonlinear Learning with\nAdaptive Polynomial Expansions\n\nAlekh Agarwal\nMicrosoft Research\n\nalekha@microsoft.com\n\nAlina Beygelzimer\n\nYahoo! Labs\n\nbeygel@yahoo-inc.com\n\nDaniel Hsu\n\nColumbia University\n\ndjhsu@cs.columbia.edu\n\nJohn Langford\n\nMicrosoft Research\n\njcl@microsoft.com\n\nMatus Telgarsky\u2217\nRutgers University\n\nmtelgars@cs.ucsd.edu\n\nAbstract\n\nCan we effectively learn a nonlinear representation in time comparable to linear\nlearning? We describe a new algorithm that explicitly and adaptively expands\nhigher-order interaction features over base linear representations. The algorithm\nis designed for extreme computational ef\ufb01ciency, and an extensive experimental\nstudy shows that its computation/prediction tradeoff ability compares very favor-\nably against strong baselines.\n\n1\n\nIntroduction\n\nWhen faced with large datasets, it is commonly observed that using all the data with a simpler\nalgorithm is superior to using a small fraction of the data with a more computationally intense but\npossibly more effective algorithm. The question becomes: What is the most sophisticated algorithm\nthat can be executed given a computational constraint?\nAt the largest scales, Na\u00efve Bayes approaches offer a simple, easily distributed single-pass algo-\nrithm. A more computationally dif\ufb01cult, but commonly better-performing approach is large scale\nlinear regression, which has been effectively parallelized in several ways on real-world large scale\ndatasets [1, 2]. Is there a modestly more computationally dif\ufb01cult approach that allows us to com-\nmonly achieve superior statistical performance?\nThe approach developed here starts with a fast parallelized online learning algorithm for linear mod-\nels, and explicitly and adaptively adds higher-order interaction features over the course of training,\nusing the learned weights as a guide. The resulting space of polynomial functions increases the\napproximation power over the base linear representation at a modest increase in computational cost.\nSeveral natural folklore baselines exist. For example, it is common to enrich feature spaces with n-\ngrams or low-order interactions. These approaches are naturally computationally appealing, because\nthese nonlinear features can be computed on-the-\ufb02y avoiding I/O bottlenecks. With I/O bottlenecked\ndatasets, this can sometimes even be done so ef\ufb01ciently that the additional computational complexity\nis negligible, so improving over this baseline is quite challenging.\nThe design of our algorithm is heavily in\ufb02uenced by considerations for computational ef\ufb01ciency, as\ndiscussed further in Section 2. Several alternative designs are plausible but fail to provide adequate\ncomputation/prediction tradeoffs or even outperform the aforementioned folklore baselines. An\nextensive experimental study in Section 3 compares ef\ufb01cient implementations of these baselines with\n\n\u2217This work was performed while MT was visiting Microsoft Research, NYC.\n\n1\n\n\fFigure 1: Computation/prediction tradeoff points using non-adaptive polynomial expansions and\nadaptive polynomial expansions (apple). The markers are positioned at the coordinate-wise me-\ndian of (relative error, relative time) over 30 datasets, with bars extending to 25th and 75th\npercentiles. See Section 3 for de\ufb01nition of relative error and relative time used here.\n\nthe proposed mechanism and gives strong evidence of the latter\u2019s dominant computation/prediction\ntradeoff ability (see Figure 1 for an illustrative summary).\nAlthough it is notoriously dif\ufb01cult to analyze nonlinear algorithms, it turns out that two aspects\nof this algorithm are amenable to analysis. First, we prove a regret bound showing that we can\neffectively compete with a growing feature set. Second, we exhibit simple problems where this\nalgorithm is effective, and discuss a worst-case consistent variant. We point the reader to the full\nversion [3] for more details.\n\nRelated work. This work considers methods for enabling nonlinear learning directly in a highly-\nscalable learning algorithm. Starting with a fast algorithm is desirable because it more naturally\nallows one to improve statistical power by spending more computational resources until a compu-\ntational budget is exhausted. In contrast, many existing techniques start with a (comparably) slow\nmethod (e.g., kernel SVM [4], batch PCA [5], batch least-squares regression [5]), and speed it up by\nsacri\ufb01cing statistical power, often just to allow the algorithm to run at all on massive data sets.\nA standard alternative to explicit polynomial expansions is to employ polynomial kernels with the\nkernel trick [6]. While kernel methods generally have computation scaling at least quadratically\nwith the number of training examples, a number of approximations schemes have been developed to\nenable a better tradeoff. The Nystr\u00f6m method (and related techniques) can be used to approximate\nthe kernel matrix while permitting faster training [4]. However, these methods still suffer from the\ndrawback that the model size after n examples is typically O(n). As a result, even single pass online\nimplementations [7] typically suffer from O(n2) training and O(n) testing time complexity.\nAnother class of approximation schemes for kernel methods involves random embeddings into a\nhigh (but \ufb01nite) dimensional Euclidean space such that the standard inner product there approxi-\nmates the kernel function [8\u201311]. Recently, such schemes have been developed for polynomial ker-\nnels [9\u201311] with computational scaling roughly linear in the polynomial degree. However, for many\nsparse, high-dimensional datasets (such as text data), the embedding of [10] creates dense, high di-\nmensional examples, which leads to a substantial increase in computational complexity. Moreover,\nneither of the embeddings from [9, 10] exhibits good statistical performance unless combined with\ndense linear dimension reduction [11], which again results in dense vector computations. Such fea-\nture construction schemes are also typically unsupervised, while the method proposed here makes\nuse of label information.\nAmong methods proposed for ef\ufb01ciently learning polynomial functions [12\u201316], all but [13] are\nbatch algorithms. The method of [13] uses online optimization together with an adaptive rule for\ncreating interaction features. A variant of this is discussed in Section 2 and is used in the experi-\nmental study in Section 3 as a baseline.\n\n2\n\n100101102relative time\u22120.20.00.20.40.60.81.0relative errorRelative error vs time tradeofflinearquadraticcubicapple(0.125)apple(0.25)apple(0.5)apple(0.75)apple(1.0)\fAlgorithm 1 Adaptive Polynomial Expansion (apple)\ninput Initial features S1 = {x1, . . . , xd}, expansion sizes (sk), epoch schedule (\u03c4k), stepsizes (\u03b7t).\n1: Initial weights w1 := 0, initial epoch k := 1, parent set P1 := \u2205.\n2: for t = 1, 2, . . . : do\n3:\n4:\n\nReceive stochastic gradient gt.\nUpdate weights: wt+1 := wt \u2212 \u03b7t[gt]Sk,\nwhere [\u00b7]Sk denotes restriction to monomials in the feature set Sk.\nif t = \u03c4k then\n\nLet Mk \u2286 Sk be the top sk monomials m(x) \u2208 Sk such that m(x) /\u2208 Pk, ordered from\nhighest-to-lowest by the weight magnitude in wt+1.\nExpand feature set:\n\nSk+1 := Sk \u222a {xi \u00b7 m(x) : i \u2208 [d], m(x) \u2208 Mk},\nPk+1 := Pk \u222a {m(x) : m(x) \u2208 Mk}.\n\n5:\n6:\n\n7:\n\nk := k + 1.\n\n8:\nend if\n9:\n10: end for\n\nand\n\n2 Adaptive polynomial expansions\n\nThis section describes our new learning algorithm, apple.\n\n2.1 Algorithm description\n\nThe pseudocode is given in Algorithm 1. The algorithm proceeds as stochastic gradient descent over\nthe current feature set to update a weight vector. At speci\ufb01ed times \u03c4k, the feature set Sk is expanded\nto Sk+1 by taking the top monomials in the current feature set, ordered by weight magnitude in the\ncurrent weight vector, and creating interaction features between these monomials and x. Care is\nexercised to not repeatedly pick the same monomial for creating higher order monomial by tracking\na parent set Pk, the set of all monomials for which higher degree terms have been expanded. We\nprovide more intuition for our choice of this feature growing heuristic in Section 2.3.\nThere are two bene\ufb01ts to this staged process. Computationally, the stages allow us to amortize the\ncost of the adding of monomials\u2014which is implemented as an expensive dense operation\u2014over\nseveral other (possibly sparse) operations. Statistically, using stages guarantees that the monomials\nadded in the previous stage have an opportunity to have their corresponding parameters converge.\nWe have found it empirically effective to set sk := average(cid:107)[gt]S1(cid:107)0, and to update the feature set\nat a constant number of equally-spaced times over the entire course of learning. In this case, the\nnumber of updates (plus one) bounds the maximum degree of any monomial in the \ufb01nal feature set.\n\n2.2 Shifting comparators and a regret bound for regularized objectives\n\nmethods over expanding feature spaces [17]. These bounds are roughly of the form(cid:80)T\nft(ut) (cid:46) (cid:112)T(cid:80)\n\nStandard regret bounds compare the cumulative loss of an online learner to the cumulative loss of a\nsingle predictor (comparator) from a \ufb01xed comparison class. Shifting regret is a more general notion\nof regret, where the learner is compared to a sequence of comparators u1, u2, . . . , uT .\nExisting shifting regret bounds can be used to loosely justify the use of online gradient descent\nt=1 ft(wt) \u2212\nt<T (cid:107)ut \u2212 ut+1(cid:107), where ut is allowed to use the same features available to wt,\nand ft is the convex cost function in step t. This suggests a relatively high cost for a substantial\ntotal change in the comparator, and thus in the feature space. Given a budget, one could either do a\nliberal expansion a small number of times, or opt for including a small number of carefully chosen\nmonomials more frequently. We have found that the computational cost of carefully picking a small\nnumber of high quality monomials is often quite high. With computational considerations at the\nforefront, we will prefer a more liberal but infrequent expansion. This also effectively exposes the\nlearning algorithm to a large number of nonlinearities quickly, allowing their parameters to jointly\nconverge between the stages.\nIt is natural to ask if better guarantees are possible under some structure on the learning problem.\nHere, we consider the stochastic setting (rather than the harsher adversarial setting of [17]), and\n\n3\n\n\ffurther assume that our objective takes the form\n\nf (w) := E[(cid:96)((cid:104)w, xy(cid:105))] + \u03bb(cid:107)w(cid:107)2/2,\n\n(1)\nwhere the expectation is under the (unknown) data generating distribution D over (x, y) \u2208 S \u00d7 R,\nand (cid:96) is some convex loss function on which suitable restrictions will be placed. Here S is such\nthat S1 \u2286 S2 \u2286 . . . \u2286 S, based on the largest degree monomials we intend to expand. We assume\nthat in round t, we observe a stochastic gradient of the objective f, which is typically done by \ufb01rst\nsampling (xt, yt) \u223c D and then evaluating the gradient of the regularized objective on this sample.\nThis setting has some interesting structural implications over the general setting of online learning\nwith shifting comparators. First, the \ufb01xed objective f gives us a more direct way of tracking the\nchange in comparator through f (ut) \u2212 f (ut+1), which might often be milder than (cid:107)ut \u2212 ut+1(cid:107).\nIn particular, if ut = arg minu\u2208Sk f (u) in epoch k, for a nested subspace sequence Sk, then we\nimmediately obtain f (ut+1) \u2264 f (ut). Second, the strong convexity of the regularized objective\nenables the possibility of faster O(1/T ) rates than prior work [17]. Indeed, in this setting, we obtain\nthe following stronger result. We use the shorthand Et[\u00b7] to denote the conditional expectation at\ntime t, conditioning over the data from rounds 1, . . . , t \u2212 1.\nTheorem 1. Let a distribution over (x, y), twice differentiable convex loss (cid:96) with (cid:96) \u2265 0 and\nmax{(cid:96)(cid:48), (cid:96)(cid:48)(cid:48)} \u2264 1, and a regularization parameter \u03bb > 0 be given. Recall the de\ufb01nition (1) of\nthe objective f. Let (wt, gt)t\u22651 be as speci\ufb01ed by apple with step size \u03b7t := 1/(\u03bb(t + 1)), where\nEt([gt]S(t) ) = [\u2207f (wt)]S(t) and S(t) is the support set corresponding to epoch kt at time t in\napple. Then for any comparator sequence (ut)\u221e\n\n(cid:33)\nt=1 satisfying ut \u2208 S(t), for any \ufb01xed T \u2265 1,\n\n(cid:18) (X 2 + \u03bb)(X + \u03bbD)2\n\n(cid:19)\n\n(cid:32)\n\n(cid:80)T\n(cid:80)T\n\n,\n\nE\n\nf (wT +1) \u2212\n\nt=1(t + 2)f (ut)\n\n\u2264 1\n\n2\u03bb2\n\nt=1(t + 2)\n\npredictor wT as opposed to a bound on(cid:80)T\n\nT + 1\nwhere X \u2265 maxt (cid:107)xtyt(cid:107) and D \u2265 maxt max{(cid:107)wt(cid:107),(cid:107)ut(cid:107)}.\nQuite remarkably, the result exhibits no dependence on the cumulative shifting of the comparators\nunlike existing bounds [17]. This is the \ufb01rst result of this sort amongst shifting bounds to the best of\nour knowledge, and the only one that yields 1/T rates of convergence even with strong convexity. Of\ncourse, we limit ourselves to the stochastic setting, and prove expected regret guarantees on the \ufb01nal\nt=1 f (wt)/T . A curious distinction is our comparator,\nwhich is a weighted average of f (ut) as opposed to the more standard uniform average. Recalling\nthat f (ut+1) \u2264 f (ut) in our setting, this is a strictly harder benchmark than an unweighted average\nand overemphasizes the later comparator terms which are based on larger support sets. Indeed, this\nis a nice compromise between competing against uT , which is the hardest yardstick, and u1, which\nis what a standard non-shifting analysis compares to.\nIndeed our improvement can be partially\nattributed to the stability of the averaged f values as opposed to just f (uT ) (more details in [3]).\nOverall, this result demonstrates that in our setting, while there is generally a cost to be paid for\nshifting the comparator too much, it can still be effectively controlled in favorable cases. One\nproblem for future work is to establish these fast 1/T rates also with high probability.\nNote that the regret bound offers no guidance on how or when to select new monomials to add.\n\n2.3 Feature expansion heuristics\n\nPrevious work on learning sparse polynomials [13] suggests that it is possible to anticipate the utility\nof interaction features before even evaluating them. For instance, one of the algorithms from [13]\norders monomials m(x) by an estimate of E[r(x)2m(x)2]/E[m(x)2], where r(x) = E[y|x]\u2212 \u02c6f (x)\nis the residual of the current predictor \u02c6f (for least-squares prediction of the label y). Such an index\nis shown to be related to the potential error reduction by polynomials with m(x) as a factor. We call\nthis the SSM heuristic (after the authors of [13], though it differs from their original algorithm).\nAnother plausible heuristic, which we use in Algorithm 1, simply orders the monomials in Sk by\ntheir weight magnitude in the current weight vector. We can justify this weight heuristic in the\nfollowing simple example. Suppose a target function E[y|x] is just a single monomial in x, say,\ni\u2208M xi for some M \u2286 [d], and that x has a product distribution over {0, 1}d with 0 <\nE[xi] =: p \u2264 1/2 for all i \u2208 [d]. Suppose we repeatedly perform 1-sparse regression with the current\n\nm(x) :=(cid:81)\n\n4\n\n\ffeature set Sk, and pick the top weight magnitude monomial for inclusion in the parent set Pk+1. It\nis easy to show that the weight on a degree (cid:96) sub-monomial of m(x) in this regression is p|M|\u2212(cid:96), and\nthe weight is strictly smaller for any term which is not a proper sub-monomial of m(x). Thus we\nrepeatedly pick the largest available sub-monomial of m(x) and expand it, eventually discovering\nm(x). After k stages of the algorithm, we have at most kd features in our regression here, and\nhence we \ufb01nd m(x) with a total of d|M| variables in our regression, as opposed to d|M| which\ntypical feature selection approaches would need. This intuition can be extended more generally to\nscenarios where we do not necessarily do a sparse regression and beyond product distributions, but\nwe \ufb01nd that even this simplest example illustrates the basic motivations underlying our choice\u2014we\nwant to parsimoniously expand on top of a base feature set, while still making progress towards a\ngood polynomial for our data.\n\n2.4 Fall-back risk-consistency\n\nNeither the SSM heuristic nor the weight heuristic is rigorously analyzed (in any generality). Despite\nthis, the basic algorithm apple can be easily modi\ufb01ed to guarantee a form of risk consistency,\nregardless of which feature expansion heuristic is used. Consider the following variant of the support\nupdate rule in the algorithm apple. Given the current feature budget sk, we add sk \u2212 1 monomials\nordered by weight magnitudes as in Step 7. We also pick a monomial m(x) of the smallest degree\nsuch that m(x) /\u2208 Pk.\nIntuitively, this ensures that all degree 1 terms are in Pk after d stages,\nall degree 2 terms are in Pk after k = O(d2) stages and so on. In general, it is easily seen that\nk = O(d(cid:96)\u22121) ensures that all degree (cid:96) \u2212 1 monomials are in Pk and hence all degree (cid:96) monomials\nare in Sk. For ease of exposition, let us assume that sk is set to be a constant s independent of k.\nThen the total number of monomials in Pk when k = O(d(cid:96)\u22121) is O(sd(cid:96)\u22121), which means the total\nnumber of features in Sk is O(sd(cid:96)).\nSuppose we were interested in competing with all \u03b3-sparse polynomials of degree (cid:96). The most direct\napproach would be to consider the explicit enumeration of all monomials of degree up to (cid:96), and then\nperform (cid:96)1-regularized regression [18] or a greedy variable selection method such as OMP [19] as\nmeans of enforcing sparsity. This ensures consistent estimation with n = O(\u03b3 log d(cid:96)) = O(\u03b3(cid:96) log d)\nexamples. In contrast, we might need n = O(\u03b3((cid:96) log d + log s)) examples in the worst case using\nthis fall back rule, a minor overhead at best. However, in favorable cases, we stand to gain a lot when\nthe heuristic succeeds in \ufb01nding good monomials rapidly. Since this is really an empirical question,\nwe will address it with our empirical evaluation.\n\n3 Experimental study\n\nWe now describe of our empirical evaluation of apple.\n\n3.1\n\nImplementation, experimental setup, and performance metrics\n\nIn order to assess the effectiveness of our algorithm, it is critical to build on top of an ef\ufb01cient\nlearning framework that can handle large, high-dimensional datasets. To this end, we implemented\napple in the Vowpal Wabbit (henceforth VW) open source machine learning software1. VW is a\ngood framework for us, since it also natively supports quadratic and cubic expansions on top of the\nbase features. These expansions are done dynamically at run-time, rather than being stored and read\nfrom disk in the expanded form for computational considerations. To deal with these dynamically\nenumerated features, VW uses hashing to associate features with indices, mapping each feature to a\nb-bit index, where b is a parameter. The core learning algorithm is an online algorithm as assumed\nin apple, but uses re\ufb01nements of the basic stochastic gradient descent update (e.g., [20\u201323]).\nWe implemented apple such that the total number of epochs was always 6 (meaning 5 rounds of\nadding new features). At the end of each epoch, the non-parent monomials with largest magnitude\nweights were marked as parents. Recall that the number of parents is modulated at s\u03b1 for some\n\u03b1 > 0, with s being the average number of non-zero features per example in the dataset so far. We\nwill present experimental results with different choices of \u03b1, and we found \u03b1 = 1 to be a reliable\n\n1Please see https://github.com/JohnLangford/vowpal_wabbit and the associated git\n\nrepository, where -stage_poly and related command line options execute apple.\n\n5\n\n\f(a)\n\n(b)\n\nFigure 2: Dataset CDFs across all 30 datasets: (a) relative test error, (b) relative training time (log\nscale). {apple, ssm} refer to the \u03b1 = 1 default; {apple, ssm}-best picks best \u03b1 per dataset.\n\ndefault. Upon seeing an example, the features are enumerated on-the-\ufb02y by recursively expanding\nthe marked parents, taking products with base monomials. These operations are done in a way to\nrespect the sparsity (in terms of base features) of examples which many of our datasets exhibit.\nSince the bene\ufb01ts of nonlinear learning over linear learning themselves are very dataset dependent,\nand furthermore can vary greatly for different heuristics based on the problem at hand, we found it\nimportant to experiment with a large testbed consisting of a diverse collection of medium and large-\nscale datasets. To this end, we compiled a collection of 30 publicly available datasets, across a num-\nber of KDDCup challenges, UCI repository and other common resources (detailed in the appendix).\nFor all the datasets, we tuned the learning rate for each learning algorithm based on the progressive\nvalidation error (which is typically a reliable bound on test error) [24]. The number of bits in hashing\nwas set to 18 for all algorithms, apart from cubic polynomials, where using 24 bits for hashing was\nfound to be important for good statistical performance. For each dataset, we performed a random\nsplit with 80% of the data used for training and the remainder for testing. For all datasets, we used\nsquared-loss to train, and 0-1/squared-loss for evaluation in classi\ufb01cation/regression problems. We\nalso experimented with (cid:96)1 and (cid:96)2 regularization, but these did not help much. The remaining settings\nwere left to their VW defaults.\nFor aggregating performance across 30 diverse datasets, it was important to use error and running\ntime measures on a scale independent of the dataset. Let (cid:96), q and c refer to the test errors of linear,\nquadratic and cubic baselines respectively (with lin, quad, and cubic used to denote the baseline\nalgorithms themselves). For an algorithm alg, we compute the relative (test) error:\n\nrel err(alg) =\n\nerr(alg) \u2212 min((cid:96), q, c)\nmax((cid:96), q, c) \u2212 min((cid:96), q, c)\n\n,\n\n(2)\n\nwhere min((cid:96), q, c) is the smallest error among the three baselines on the dataset, and max((cid:96), q, c)\nis similarly de\ufb01ned. We also de\ufb01ne the relative (training) time as the ratio to running time of lin:\nrel time(alg) = time(alg)/time(lin). With these de\ufb01nitions, the aggregated plots of relative\nerrors and relative times for the various baselines and our methods are shown in Figure 2. For each\nmethod, the plots show a cumulative distribution function (CDF) across datasets: an entry (a, b)\non the left plot indicates that the relative error for b datasets was at most a. The plots include the\nbaselines lin, quad, cubic, as well as a variant of apple (called ssm) that replaces the weight\nheuristic with the SSM heuristic, as described in Section 2.3. For apple and ssm, the plot shows\nthe results with the \ufb01xed setting of \u03b1 = 1, as well as the best setting chosen per dataset from\n\u03b1 \u2208 {0.125, 0.25, 0.5, 0.75, 1} (referred to as apple-best and ssm-best).\n\n3.2 Results\n\nIn this section, we present some aggregate results. Detailed results with full plots and tables are\npresented in the appendix. In the Figure 2(a), the relative error for all of lin, quad and cubic is\n\n6\n\n\u22121.5\u22121.0\u22120.50.00.51.01.5relative error051015202530number of datasets (cumulative)linearquadraticcubicappleapple-bestssmssm-best110100relative time051015202530number of datasets (cumulative)linearquadraticcubicappleapple-bestssmssm-best\f(a)\n\n(b)\n\nFigure 3: Dataset CDFs across 13 datasets where time(quad) \u2265 2time(lin): (a) relative test error,\n(b) relative training time (log scale).\n\nalways to the right of 0 (due to the de\ufb01nition of rel err). In this plot, a curve enclosing a larger area\nindicates, in some sense, that one method uniformly dominates another. Since apple uniformly\ndominates ssm statistically (with only slightly longer running times), we restrict the remainder of\nour study to comparing apple to the baselines lin, quad and cubic. We found that on 12 of the\n30 datasets, the relative error was negative, meaning that apple beats all the baselines. A relative\nerror of 0.5 indicates that we cover at least half the gap between min((cid:96), q, c) and max((cid:96), q, c). We\n\ufb01nd that we are below 0.5 on 27 out of 30 datasets for apple-best, and 26 out of the 30 datasets for\nthe setting \u03b1 = 1. This is particularly striking since the error min((cid:96), q, c) is attained by cubic on\na majority of the datasets (17 out of 30), where the relative error of cubic is 0. Hence, statistically\napple often outperforms even cubic, while typically using a much smaller number of features. To\nsupport this claim, we include in the appendix a plot of the average number of features per example\ngenerated by each method, for all datasets. Overall, we \ufb01nd the statistical performance of apple\nfrom Figure 2 to be quite encouraging across this large collection of diverse datasets.\nThe running time performance of apple is also extremely good. Figure 2(b) shows that the running\ntime of apple is within a factor of 10 of lin for almost all datasets, which is quite impressive\nconsidering that we generate a potentially much larger number of features. The gap between lin\nand apple is particularly small for several large datasets, where the examples are sparse and high-\ndimensional. In these cases, all algorithms are typically I/O-bottlenecked, which is the same for all\nalgorithms due to the dynamic feature expansions used. It is easily seen that the statistically ef\ufb01cient\nbaseline of cubic is typically computationally infeasible, with the relative time often being as large\nas 102 and 105 on the biggest dataset. Overall, the statistical performance of apple is competitive\nwith and often better than min((cid:96), q, c), and offers a nice intermediate in computational complexity.\nA surprise in Figure 2(b) is that quad appears to computationally outperform apple for a relatively\nlarge number of datasets, at least in aggregate. This is due to the extremely ef\ufb01cient implementation\nof quad in VW: on 17 of 30 datasets, the running time of quad is less than twice that of lin. While\nwe often statistically outperform quad on many of these smaller datasets, we are primarily interested\nin the larger datasets where the relative cost of nonlinear expansions (as in quad) is high.\nIn Figure 3, we restrict attention to the 13 datasets where time(quad)/time(lin) \u2265 2. On these\nlarger datasets, our statistical performance seems to dominate all the baselines (at least in terms\nof the CDFs, more on individual datasets will be said later). In terms of computational time, we\nsee that we are often much better than quad, and cubic is essentially infeasible on most of these\ndatasets. This demonstrates our key intuition that such adaptively chosen monomials are key to\neffective nonlinear learning in large, high-dimensional datasets.\nWe also experimented with picky algorithms of the sort mentioned in Section 2.2. We tried the\noriginal algorithm from [13], which tests a candidate monomial before adding it to the feature set Sk,\nrather than just testing candidate parent monomials for inclusion in Pk; and also a picky algorithm\nbased on our weight heuristic. Both algorithms were extremely computationally expensive, even\nwhen implemented using VW as a base: the explicit testing for inclusion in Sk (on a per-example\n\n7\n\n\u22121.5\u22121.0\u22120.50.00.51.01.5relative error024681012number of datasets (cumulative)linearquadraticcubicappleapple-best110100relative time024681012number of datasets (cumulative)linearquadraticcubicappleapple-best\f(a)\n\n(b)\n\nFigure 4: Comparison of different methods on the top 6 datasets by non-zero features per example:\n(a) relative test errors, (b) relative training times.\n\nTest AUC\n\nTraining time (in s)\n\nlin\n\n0.81664\n\n1282\n\nlin + apple\n\n0.81712\n\n2727\n\nbigram\n0.81757\n\n2755\n\nbigram + apple\n\n0.81796\n\n7378\n\nTable 1: Test error and training times for different methods in a large-scale distributed setting. For\n{lin, bigram} + apple, we used \u03b1 = 0.25.\n\nbasis) caused too much overhead. We ruled out other baselines such as polynomial kernels for\nsimilar computational reasons.\nTo provide more intuition, we also show individual results for the top 6 datasets with the highest\naverage number of non-zero features per example\u2014a key factor determining the computational cost\nof all approaches. In Figure 4, we show the performance of the lin, quad, cubic baselines, as well\nas apple with 5 different parameter settings in terms of relative error (Figure 4(a)) and relative time\n(Figure 4(b)). The results are overall quite positive. We see that on 3 of the datasets, we improve\nupon all the baselines statistically, and even on other 3 the performance is quite close to the best of\nthe baselines with the exception of the cup98 dataset. In terms of running time, we \ufb01nd cubic to be\nextremely expensive in all the cases. We are typically faster than quad, and in the few cases where\nwe take longer, we also obtain a statistical improvement for the slight increase in computational\ncost. In conclusion, on larger datasets, the performance of our method is quite desirable.\nFinally, we also implemented a parallel version of our algorithm, building on the repeated averaging\napproach [2, 25], using the built-in AllReduce communication mechanism of VW, and ran an ex-\nperiment using an internal advertising dataset consisting of approximately 690M training examples,\nwith roughly 318 non-zero features per example. The task is the prediction of click/no-click\nevents. The data was stored in a large Hadoop cluster, split over 100 partitions. We implemented the\nlin baseline, using 5 passes of online learning with repeated averaging on this dataset, but could\nnot run full quad or cubic baselines due to the prohibitive computational cost. As an intermediate,\nwe generated bigram features, which only doubles the number of non-zero features per example.\nWe parallelized apple as follows. In the \ufb01rst pass over the data, each one of the 100 nodes locally\nselects the promising features over 6 epochs, as in our single-machine setting. We then take the\nunion of all the parents locally found across all nodes, and freeze that to be the parent set for the rest\nof training. The remaining 4 passes are now done with this \ufb01xed feature set, repeatedly averaging\nlocal weights. We then ran apple, on top of both lin as well as bigram as the base features to\nobtain maximally expressive features. The test error was measured in terms of the area under ROC\ncurve (AUC), since this is a highly imbalanced dataset. The error and time results, reported in Ta-\nble 1, show that using nonlinear features does lead to non-trivial improvements in AUC, albeit at\nan increased computational cost. Once again, this should be put in perspective with the full quad\nbaseline, which did not \ufb01nish in over a day on this dataset.\nAcknowledgements: We thank Leon Bottou, Rob Schapire and Dean Foster for helpful discussions.\n\n8\n\nrcv1nomaoyear20newsslicecup98012345Relative error, ordered by average nonzero features per examplelinearquadraticcubicapple(0.125)apple(0.25)apple(0.5)apple(0.75)apple(1.0)rcv1nomaoyear20newsslicecup98100101102103Relative time, ordered by average nonzero features per example\fReferences\n[1] I. Mukherjee, K. Canini, R. Frongillo, and Y. Singer. Parallel boosting with momentum. In Proceedings\nof the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery\nin Databases, 2013.\n\n[2] A. Agarwal, O. Chapelle, M. Dud\u00edk, and J. Langford. A reliable effective terascale linear learning system.\n\nJournal of Machine Learning Research, 15(Mar):1111\u20131133, 2014.\n\n[3] A. Agarwal, A. Beygelzimer, D. Hsu, J. Langford, and M. Telgarsky. Scalable nonlinear learning with\n\nadaptive polynomial expansions. 2014. arXiv:1410.0440 [cs.LG].\n\n[4] C. Williams and M. Seeger. Using the Nystr\u00f6m method to speed up kernel machines. In Advances in\n\nNeural Information Processing Systems 13, 2001.\n\n[5] M. W. Mahoney. Randomized algorithms for matrices and data. Foundations and Trends in Machine\n\nLearning, 3(2):123\u2013224, 2011.\n\n[6] B. Sch\u00f6lkopf and A. Smola. Learning with Kernels. MIT Press, Cambridge, MA, 2002.\n[7] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classi\ufb01ers with online and active learning.\n\nJournal of Machine Learning Research, 6:1579\u20131619, 2005.\n\n[8] A. Rahimi and B. Recht. Random features for large-scale kernel machines.\n\nInformation Processing Systems 20, 2008.\n\nIn Advances in Neural\n\n[9] P. Kar and H. Karnick. Random feature maps for dot product kernels. In AISTATS, 2012.\n[10] N. Pham and R. Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of\n\nthe 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013.\n\n[11] R. Hamid, A. Gittens, Y. Xiao, and D. Decoste. Compact random feature maps. In ICML, 2014.\n[12] A. G. Ivakhnenko. Polynomial theory of complex systems. Systems, Man and Cybernetics, IEEE Trans-\n\nactions on, SMC-1(4):364\u2013378, 1971.\n\n[13] T. D. Sanger, R. S. Sutton, and C. J. Matheus. Iterative construction of sparse polynomial approximations.\n\nIn Advances in Neural Information Processing Systems 4, 1992.\n\n[14] A. T. Kalai, A. Samorodnitsky, and S.-H. Teng. Learning and smoothed analysis. In FOCS, 2009.\n[15] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning sparse polynomial functions. In SODA,\n\n2014.\n\n[16] A. G. Dimakis, A. Klivans, M. Kocaoglu, and K. Shanmugam. A smoothed analysis for learning sparse\n\npolynomials. CoRR, abs/1402.3902, 2014.\n\n[17] M. Zinkevich. Online convex programming and generalized in\ufb01nitesimal gradient ascent. In ICML, 2003.\n[18] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267\u2013288,\n\n1996.\n\n[19] J. A. Tropp and A. C. Gilbert. Signal recovery from random measurements via orthogonal matching\n\npursuit. IEEE Transactions on Information Theory, 53(12):4655\u20134666, December 2007.\n\n[20] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic\n\noptimization. The Journal of Machine Learning Research, 12:2121\u20132159, 2011.\n\n[21] H. B. McMahan and M. J. Streeter. Adaptive bound optimization for online convex optimization.\n\nCOLT, pages 244\u2013256, 2010.\n\nIn\n\n[22] N. Karampatziakis and J. Langford. Online importance weight aware updates. In UAI, pages 392\u2013399,\n\n2011.\n\n[23] S. Ross, P. Mineiro, and J. Langford. Normalized online learning. In UAI, 2013.\n[24] A. Blum, A. Kalai, and J. Langford. Beating the hold-out: Bounds for k-fold and progressive cross-\n\nvalidation. In COLT, 1999.\n\n[25] K. Hall, S. Gilpin, and G. Mann. Mapreduce/bigtable for distributed optimization.\n\nLearning on Cores, Clusters, and Clouds, 2010.\n\nIn Workshop on\n\n[26] S. Bubeck. Theory of convex optimization for machine learning.\n\n[math.OC].\n\n2014. arXiv:1405.4980\n\n[27] O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results\n\nand optimal averaging schemes. In ICML, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1109, "authors": [{"given_name": "Alekh", "family_name": "Agarwal", "institution": "Microsoft Research"}, {"given_name": "Alina", "family_name": "Beygelzimer", "institution": "Yahoo"}, {"given_name": "Daniel", "family_name": "Hsu", "institution": "Columbia University"}, {"given_name": "John", "family_name": "Langford", "institution": "Microsoft Research"}, {"given_name": "Matus", "family_name": "Telgarsky", "institution": "University of Michigan"}]}