{"title": "Algorithms and hardness results for parallel large margin learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1314, "page_last": 1322, "abstract": "We study the fundamental problem of learning an unknown large-margin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a large-margin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown gamma-margin halfspace over n dimensions using poly(n,1/gamma) processors and runs in time ~O(1/gamma) + O(log n). In contrast, naive parallel algorithms that learn a gamma-margin halfspace in time that depends polylogarithmically on n have Omega(1/gamma^2) runtime dependence on gamma. Our main negative result deals with boosting, which is a standard approach to learning large-margin halfspaces. We give an information-theoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required.", "full_text": "Algorithms and hardness results\nfor parallel large margin learning\n\nPhilip M. Long\n\nGoogle\n\nplong@google.com\n\nRocco A. Servedio\nColumbia University\n\nrocco@cs.columbia.edu\n\nAbstract\n\nWe study the fundamental problem of learning an unknown large-margin half-\nspace in the context of parallel computation.\nOur main positive result is a parallel algorithm for learning a large-margin half-\nspace that is based on interior point methods from convex optimization and fast\nparallel algorithms for matrix computations. We show that this algorithm learns\nan unknown \u03b3-margin halfspace over n dimensions using poly(n, 1/\u03b3) processors\nand runs in time \u02dcO(1/\u03b3) + O(log n). In contrast, naive parallel algorithms that\nlearn a \u03b3-margin halfspace in time that depends polylogarithmically on n have\n\u2126(1/\u03b32) runtime dependence on \u03b3.\nOur main negative result deals with boosting, which is a standard approach to\nlearning large-margin halfspaces. We give an information-theoretic proof that in\nthe original PAC framework, in which a weak learning algorithm is provided as an\noracle that is called by the booster, boosting cannot be parallelized: the ability to\ncall the weak learner multiple times in parallel within a single boosting stage does\nnot reduce the overall number of successive stages of boosting that are required.\n\n1\n\nIntroduction\n\nIn this paper we consider large-margin halfspace learning in the PAC model: there is a target half-\nspace f(x) = sign(w\u00b7x), where w is an unknown unit vector, and an unknown probability distribu-\ntion D over the unit ball Bn = {x \u2208 Rn : (cid:107)x(cid:107)2 \u2264 1} which has support on {x \u2208 Bn : |w\u00b7x| \u2265 \u03b3}.\n(Throughout this paper we refer to such a combination of target halfspace f and distribution D\nas a \u03b3-margin halfspace.) The learning algorithm is given access to labeled examples (x, f(x))\nwhere each x is independently drawn from D, and it must with high probability output a hypothesis\nh : Rn \u2192 {\u22121, 1} that satis\ufb01es Prx\u223cD[h(x) (cid:54)= f(x)] \u2264 \u03b5. Learning a large-margin halfspace is\na fundamental problem in machine learning; indeed, one of the most famous algorithms in machine\nlearning is the Perceptron algorithm [25] for this problem. PAC algorithms based on the Percep-\ntron [17] run in poly(n, 1\n\u03b5\u03b32 ) labeled examples in Rn, and learn an unknown\nn-dimensional \u03b3-margin halfspace to accuracy 1 \u2212 \u03b5.\nA motivating question: achieving Perceptron\u2019s performance in parallel? The last few years have\nwitnessed a resurgence of interest in highly ef\ufb01cient parallel algorithms for a wide range of compu-\ntational problems in many areas including machine learning [33, 32]. So a natural goal is to develop\nan ef\ufb01cient parallel algorithm for learning \u03b3-margin halfspaces that matches the performance of the\nPerceptron algorithm. A well-established theoretical notion of ef\ufb01cient parallel computation is that\nan ef\ufb01cient parallel algorithm for a problem with input size N is one that uses poly(N) processors\nand runs in parallel time polylog(N), see e.g. [12]. Since the input to the Perceptron algorithm is a\nsample of poly( 1\n\n\u03b3 ) labeled examples in Rn, we naturally arrive at the following:\n\n\u03b5 , 1\n\n\u03b3 , 1\n\n\u03b5 ) time, use O( 1\n\n1\n\n\fAlgorithm\n\nNumber of processors\n\nRunning time\n\nnaive parallelization of Perceptron\n\nnaive parallelization of [27]\n\npolynomial-time linear programming [2]\n\nThis paper\n\npoly(n, 1/\u03b3)\npoly(n, 1/\u03b3)\n\n1\n\npoly(n, 1/\u03b3)\n\n\u02dcO(1/\u03b32) + O(log n)\n\u02dcO(1/\u03b32) + O(log n)\npoly(n, log(1/\u03b3))\n\u02dcO(1/\u03b3) + O(log n)\n\nTable 1: Bounds on various parallel algorithms for learning a \u03b3-margin halfspace over Rn.\n\nMain Question: Is there a learning algorithm that uses poly(n, 1\nand runs in time poly(log n, log 1\nmargin halfspace to accuracy 1 \u2212 \u03b5?\n\n\u03b5 ) processors\n\u03b5 ) to learn an unknown n-dimensional \u03b3-\n\n\u03b3 , log 1\n\n\u03b3 , 1\n\n(See [31] for a detailed de\ufb01nition of parallel learning algorithms; here we only recall that an ef\ufb01-\ncient parallel learning algorithm\u2019s hypothesis must be ef\ufb01ciently evaluatable in parallel.) As Freund\n[10] has largely settled how the resources required by parallel algorithms scale with the accuracy\nparameter \u0001 (see Lemma 6 below), our focus in this paper is on \u03b3 and n, leading to the following:\n\nMain Question (simpli\ufb01ed): Is there a learning algorithm that uses poly(n, 1\n\u03b3 )\nprocessors and runs in time poly(log n, log 1\n\u03b3 ) to learn an unknown n-dimensional\n\u03b3-margin halfspace to accuracy 9/10?\n\nThis question, which we view as a fundamental open problem, inspired the research reported here.\nPrior results. Table 1 summarizes the running time and number of processors used by various par-\nallel algorithms to learn a \u03b3-margin halfspace over Rn. The naive parallelization of Perceptron in\nthe \ufb01rst line of the table is an algorithm that runs for O(1/\u03b32) stages; in each stage it processes\nall of the O(1/\u03b32) examples simultaneously in parallel, identi\ufb01es one that causes the Perceptron\nalgorithm to update its hypothesis vector, and performs this update. We do not see how to obtain\nparallel time bounds better than O(1/\u03b32) from recent analyses of other algorithms based of gra-\ndient descent (such as [7, 8, 4]), some of which use assumptions incomparable in strength to the\n\u03b3-margin condition studied here. The second line of the table corresponds to a similar naive par-\nallelization of the boosting-based algorithm of [27] that achieves Perceptron-like performance for\nlearning a \u03b3-margin halfspace. It boosts for O(1/\u03b32) stages over a O(1/\u03b32)-size sample; using one\nprocessor for each coordinate of each example, the running time bound is \u02dcO(1/\u03b32) \u00b7 log n, using\npoly(n, 1/\u03b3) processors. (For both this algorithm and the Perceptron the time bound can be im-\nproved to \u02dcO(1/\u03b32) + O(log n) as claimed in the table by using an initial random projection step;\nwe explain how to do this in Section 2.) The third line of the table, included for comparison, is\nsimply a standard sequential algorithm for learning a halfspace based on polynomial-time linear\nprogramming executed on one processor, see e.g. [2, 14].\nEf\ufb01cient parallel algorithms have been developed for some simpler PAC learning problems such\nas learning conjunctions, disjunctions, and symmetric Boolean functions [31]. [6] gave ef\ufb01cient\nparallel PAC learning algorithms for some geometric constant-dimensional concept classes.\nIn terms of negative results for parallel learning, [31] shows that (under a complexity-theoretic\nassumption) there is no parallel algorithm using poly(n) processors and polylog(n) time that\nconstructs a halfspace hypothesis that is consistent with a given linearly separable data set of n-\ndimensional labeled examples. This does not give a negative answer to the Main Question for several\nreasons: the Main Question allows any hypothesis representation (that can be ef\ufb01ciently evaluated in\nparallel), allows the number of processors to grow inverse polynomially with the margin parameter\n\u03b3, and allows the \ufb01nal hypothesis to err on up to (say) 5% of the points in the data set.\nOur results. Our main positive result is a parallel algorithm that uses poly(n, 1/\u03b3) processors\nto learn \u03b3-margin halfspaces in parallel time \u02dcO(1/\u03b3) + O(log n) (see Table 1). We believe ours\nis the \ufb01rst algorithm that runs in time polylogarithmic in n and subquadratic in 1/\u03b3. Our analy-\nsis can be modi\ufb01ed to establish similar positive results for other formulations of the large-margin\nlearning problem, including ones (see [28]) that have been tied closely to weak learnability (these\nmodi\ufb01cations are not presented due to space constraints). In contrast, our main negative result is an\n\n2\n\n\finformation-theoretic argument that suggests that such positive parallel learning results cannot be\nobtained by boosting alone. We show that if the weak learner must be called as an oracle, boosting\ncannot be parallelized: any parallel booster must perform \u2126(1/\u03b32) sequential stages of boosting a\n\u201cblack-box\u201d \u03b3-advantage weak learner in the worst case. This extends an earlier lower bound of\nFreund [10] for standard (sequential) boosters that can only call the weak learner once per stage.\n\n2 A parallel algorithm for learning \u03b3-margin halfspaces over Bn\n\n\u221a\n\nOur parallel algorithm is an amalgamation of existing tools from high-dimensional geometry, con-\nvex optimization, parallel algorithms for linear algebra, and learning theory. Roughly speaking the\nalgorithm works as follows: given a data set of m = \u02dcO(1/\u03b32) labeled examples from Bn\u00d7{\u22121, 1},\nit begins by randomly projecting the examples down to d = \u02dcO(1/\u03b32) dimensions. This essentially\npreserves the geometry so the resulting d-dimensional labeled examples are still linearly separable\nwith margin \u0398(\u03b3). The algorithm then uses a variant of a linear programming algorithm of Renegar\n[24, 21] which, roughly speaking, solves linear programs with m constraints to high accuracy using\n(essentially)\nm stages of Newton\u2019s method. Within Renegar\u2019s algorithm we employ fast parallel\nalgorithms for linear algebra [22] to carry out each stage of Newton\u2019s method in polylog(1/\u03b3) par-\nallel time steps. This suf\ufb01ces to learn the unknown halfspace to high constant accuracy (say 9/10);\nto get a 1 \u2212 \u03b5-accurate hypothesis we combine the above procedure with Freund\u2019s approach [10]\nfor boosting accuracy that was mentioned in the introduction. The above sketch omits many details,\nincluding crucial issues of precision in solving the linear programs to adequate accuracy. In the rest\nof this section we address the necessary details in full and prove the following theorem:\nTheorem 1 There is a parallel algorithm with the following performance guarantee: Let f,D de\ufb01ne\nan unknown \u03b3-margin halfspace over Bn as described in the introduction. The algorithm is given as\ninput \u0001, \u03b4 > 0 and access to labeled examples (x, f(x)) that are drawn independently from D. It runs\nin O(((1/\u03b3)polylog(1/\u03b3)+log(n)) log(1/\u0001)+log log(1/\u03b4)) time, uses poly(n, 1/\u03b3, 1/\u0001, log(1/\u03b4))\nprocessors, and with probability 1\u2212\u03b4 it outputs a hypothesis h satisfying Prx\u223cD[h(x) (cid:54)= f(x)] \u2264 \u03b5.\nWe assume that the value of \u03b3 is \u201cknown\u201d to the algorithm, since otherwise the algorithm can use a\nstandard \u201cguess and check\u201d approach trying \u03b3 = 1, 1/2, 1/4, etc., until it \ufb01nds a value that works.\nWe \ufb01rst describe the tools from the literature that are used in the algorithm.\nRandom projection. We say that a random projection matrix is a matrix A chosen uniformly from\n{\u22121, 1}n\u00d7d. Given such an A and a unit vector w \u2208 Rn (recall that the target halfspace f is\n\u221a\nf(x) = sign(w \u00b7 x)), let w(cid:48) denote (1/\nd)wA. After transformation by A the distribution D over\nBn is transformed to a distribution D(cid:48) over Rd in the natural way: a draw x(cid:48) from D(cid:48) is obtained by\n\u221a\nmaking a draw x from D and setting x(cid:48) = (1/\nd)xA. We will use the following lemma from [1]:\nLemma 1 [1] Let f(x) = sign(w \u00b7 x) and D de\ufb01ne a \u03b3-margin halfspace as de-\nFor d = O((1/\u03b32) log(1/\u03b3)), a random n \u00d7 d projection\nscribed in the introduction.\nmatrix A will with probability 99/100 induce D(cid:48) and w(cid:48) as described above such that\nPrx(cid:48)\u223cD(cid:48)\n\n(cid:105) \u2264 \u03b34.\nLet F be the convex barrier function F (u) = (cid:80)d\nbe the Hessian of F at u, let ||v||u =(cid:112)vT H(u)v, and let n(u) = \u2212H(u)\u22121g(u) be the Newton\n\n(cid:104)(cid:12)(cid:12)(cid:12) w(cid:48)\n(cid:107)w(cid:48)(cid:107) \u00b7 x(cid:48)(cid:12)(cid:12)(cid:12) < \u03b3/2 or (cid:107)x(cid:48)(cid:107)2 > 2\n\nConvex optimization. We recall some tools we will use from convex optimization over Rd [24, 3].\n\nai < bi below). Let g(u) be the gradient of F at u; note that g(u)i = 1\n\n1\n\n(ui\u2212ai)(bi\u2212ui)\n\n(we specify the values\n. Let H(u)\n\n\u2212 1\nui\u2212ai\n\nbi\u2212ui\n\n(cid:16)\n\ni=1 log\n\n(cid:17)\n\nstep at u. For a linear subspace L of Rd, let F|L be the restriction of F to L, i.e. the function that\nevaluates to F on L and \u221e everywhere else.\nWe will apply interior point methods to approximately solve problems of the following form, where\na1, ..., ad, b1, ..., bd \u2208 [\u22122, 2], |bi \u2212 ai| \u2265 2 for all i, and L is a subspace of Rd:\nminimize \u2212 u1 such that u \u2208 L and ai \u2264 xi \u2264 bi for all i.\n\n(1)\n\nLet z \u2208 Rd be the minimizer, and let opt be the optimal value of (1).\n\n3\n\n\fThe algorithm we analyze minimizes F\u03b7(u) def= \u2212\u03b7u1 + F|L(u) for successively larger values of \u03b7.\nLet z(\u03b7) be the minimizer of F\u03b7 , let opt\u03b7 = F\u03b7(z(\u03b7)), and let n\u03b7(u) be its Newton step. (To keep\nthe notation clean, the dependence on L is suppressed from the notation.)\nAs in [23], we periodically round intermediate solutions to keep the bit complexity under control.\nThe analysis of such rounding in [23] requires a problem transformation which does not preserve\nthe large-margin condition that we need for our analysis, so we give a new analysis, using tools\nfrom [24], and a simpler algorithm. It is easier to analyze the effect of the rounding on the quality\nof the solution than on the progress measure used in [24]. Fortunately, [3] describes an algorithm\nthat can go from an approximately optimal solution to a solution with a good measure of progress\nwhile controlling the bit complexity of the output. The algorithm repeatedly \ufb01nds the direction of\nthe Newton step, and then performs a line search to \ufb01nd the approximately optimal step size.\n\nLemma 2 ([3, Section 9.6.4]) There is an algorithm Abt with the following property. Suppose for\nany \u03b7 > 0, Abt is given u with rational components such that F\u03b7(u) \u2212 opt\u03b7 \u2264 2. Then after\nconstantly many iterations of Newton\u2019s method and back-tracking line search, Abt returns an u+ that\n(i) satis\ufb01es ||n\u03b7(u+)||u+ \u2264 1/9; and (ii) has rational components that have bit-length bounded by a\npolynomial in d, the bit length of u, and the bit length of the matrix A such that L = {v : Av = 0}.1\n\nWe analyze the following variant of the usual central path algorithm for linear programming, which\nwe call Acpr. It takes as input a precision parameter \u03b1 and outputs the \ufb01nal u(k).\n\n2d\n\n\u221a\n\n(cid:100)2\n\n\u221a\n1\n\nd(5d/\u03b1+\n\nand \u0001 =\n\nd210d/\u03b1+2d+1)(cid:101).\n\n\u2022 Set \u03b71 = 1, \u03b2 = 1 + 1\n\u221a\n8\n\u2022 Given u as input, run Abt starting with u to obtain u(1) such that ||n\u03b71(u(1))||u(1) \u2264 1/9.\nlog(\u03b2) (cid:101) perform the following steps (i)\u2013(iv): (i) set \u03b7k = \u03b2\u03b7k\u22121;\n\u2022 For k from 2 to 1 + (cid:100) log(4d/\u03b1)\n(ii) set w(k) = u(k\u22121) + n\u03b7k(u(k\u22121)) (i.e. do one step of Newton\u2019s method); (iii) form r(k)\nby rounding each component of w(k) to the nearest multiple of \u0001, and then projecting back\nonto L; (iv) Run Abt starting with r(k) to obtain u(k) such that ||n\u03b7k(u(k))||u(k) \u2264 1/9.\n\nThe following lemma, implicit2 in [3, 24], bounds the quality of the solutions in terms of the progress\nmeasure ||n\u03b7k(u)||u.\nu and \u2212u1\u2212opt \u2264 4d\nLemma 3 If u \u2208 L and ||n\u03b7k(u)||u \u22641/9, then F\u03b7k(u)\u2212opt\u03b7k\nThe following key lemma shows that rounding intermediate solutions does not do too much harm:\nLemma 4 For any k, if F\u03b7k(w(k)) \u2264 opt\u03b7k\nProof: Fix k, and note that \u03b7k = \u03b2k\u22121 \u2264 5d/\u03b1. We henceforth drop k from all notation.\nFirst, we claim that\n\n+ 1/9, then F\u03b7k(r(k)) \u2264 opt\u03b7k\n\n\u2264||n\u03b7k(u)||2\n\n(2)\nLet m = ((a1 + b1)/2, ..., (ad + bd)/2). Since F\u03b7(w) \u2264 opt\u03b7 + 1/9, we have F\u03b7(w) \u2264 F\u03b7(m) +\n\n1/9 \u2264 \u03b7 + 1/9. But minimizing each term of F\u03b7 separately, we get F\u03b7(w) \u2265 log(cid:0) 1\n\n(cid:1) \u2212 2d \u2212 \u03b7.\n\n{|ai \u2212 wi|,|bi \u2212 wi|} \u2265 2\u22122\u03b7\u22122d\u22121/9.\n\n\u03ba = min\n\n+ 1.\n\n\u03b7k\n\n.\n\ni\n\n\u03ba\n\n\u221a\n\nCombining this with the previous inequality and solving for \u03ba yields (2).\nSince ||w \u2212 r|| \u2264 \u0001\n\nd, recalling that \u0001 \u2264\n{|ai \u2212 ri|,|bi \u2212 ri|} \u2265 2\u22122\u03b7\u22122d\u22121/9 \u2212 \u0001\n\n\u221a\n1\nd210d/\u03b1+2d+1)\n\nd(5d/\u03b1+\n\nmin\n\n\u221a\n2\n\n\u221a\n\n, we have\n\ni\n\n(3)\n1We note for the reader\u2019s convenience that \u03bb(u) in [3] is the same as our ||n(u+)||u+. The analysis on pages\n503-505 of [3] shows that a constant number of iterations suf\ufb01ce. Each step is a projection of H(u)\u22121g(u)\nonto L, which can be seen to have bit-length bounded by a polynomial in the bit-length of u. Composing\npolynomials constantly many times yields a polynomial, which gives the claimed bit-length bound for u+.\n2The \ufb01rst inequality is (9.50) from [3]. The last line of p. 46 of [24] proves that ||n\u03b7k (u)||u \u2264 1/9 implies\n||u \u2212 z(\u03b7)||z(\u03b7) \u2264 1/5 from which the second inequality follows by (2.14) of [24], using the fact that \u03d1 = 2d\n(proved on page 35 of [24]).\n\nd \u2265 2\u22122\u03b7\u22122d\u22121.\n\n4\n\n\fNow, de\ufb01ne \u03c8 : R \u2192 R by \u03c8(t) = F\u03b7\n\n(cid:16)\n\nw + t r\u2212w||r\u2212w||\n\n. We have\n\n(cid:17)\n(cid:90) ||r\u2212w||\n\n0\n\nF\u03b7(r) \u2212 F\u03b7(w) = \u03c8(||r \u2212 w||) \u2212 \u03c8(0) =\n\n(4)\nLet S be the line segment between w and r. Since for each t \u2208 [0,||r \u2212 w||] the value \u03c8(cid:48)(t) is a\ndirectional derivative of F\u03b7 at some point of S, (4) implies that, for the gradient g\u03b7 of F\u03b7,\n\n\u03c8(cid:48)(t)dt \u2264 ||r \u2212 w|| max\n\n|\u03c8(cid:48)(t)|.\n\nF\u03b7(r) \u2212 F\u03b7(w) \u2264 ||w \u2212 r|| max{||g\u03b7(s)|| : s \u2208 S}.\n\n(5)\nHowever (3) and (2) imply that min{|ai \u2212 si|,|bi \u2212 si|} \u2265 2\u22122\u03b7\u22122d\u22121 for all s \u2208 S. Recalling that\ng(u)i = 1\nd22\u03b7+2d+1 so that applying (5) we get\nbi\u2212ui\nF\u03b7(r) \u2212 F\u03b7(w) \u2264 ||w \u2212 r||(\u03b7 +\nd, we have F\u03b7(r) \u2212 F\u03b7(w) \u2264\n\u221a\n\u221a\nd22\u03b7+2d+1) \u2264 \u0001\n\u0001\n\nd210d/\u03b1+2d+1) \u2264 1/2, and the lemma follows.\n\n, this means that ||g\u03b7(s)|| \u2264 \u03b7 +\n\nd22\u03b7+2d+1). Since ||w \u2212 r|| \u2264 \u0001\n\n\u2212 1\nui\u2212ai\n\nd(5d/\u03b1 +\n\nd(\u03b7 +\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\n\u221a\n\nt\n\nFast parallel linear algebra: inverting matrices. We will use an algorithm due to Reif [22]:\n\nLemma 5 ([22]) There is a polylog(d, L)-time, poly(d, L)-processor parallel algorithm which,\ngiven as input a d \u00d7 d matrix A with rational entries of total bit-length L, outputs A\u22121.\nLearning theory: boosting accuracy. The following is implicit in the analysis of Freund [10].\nLemma 6 ([10]) Let D be a distribution over (unlabeled) examples. Let A be a parallel learning\nalgorithm such that for all D(cid:48) with support(D(cid:48)) \u2286 support(D), given draws (x, f(x)) from D(cid:48), with\nprobability 9/10 A outputs a hypothesis with accuracy 9/10 (w.r.t. D(cid:48)) using P processors in T\ntime. Then there is a parallel algorithm B that with probability 1 \u2212 \u03b4 constructs a (1 \u2212 \u03b5)-accurate\nhypothesis (w.r.t. D) in O(T log(1/\u0001)+log log(1/\u03b4)) time using poly(P, 1/\u0001, log(1/\u03b4)) processors.\n\n2.1 Proof of Theorem 1\n\nRd is obtained by (i) rounding each xi to the nearest integer multiple of 1/(4(cid:100)(cid:112)n/\u03b3(cid:101)); then (ii)\n\nAs described at the start of this section, due to Lemma 6, it suf\ufb01ces to prove the lemma in the case\nthat \u0001 = 1/10 and \u03b4 = 1/10. We assume w.l.o.g. that \u03b3 = 1/integer.\nThe algorithm \ufb01rst selects an n \u00d7 d random projection matrix A where d = O(log(1/\u03b3)/\u03b32).\nThis de\ufb01nes a transformation \u03a6A : Bn \u2192 Rd as follows: given x \u2208 Bn, the vector \u03a6A(x) \u2208\ni to the nearest multiple of 1/(8(cid:100)d/\u03b3(cid:101)).\nsetting x(cid:48) = (1/2\nGiven x it is easy to compute \u03a6A(x) using O(n log(1/\u03b3)/\u03b32) processors in O(log(n/\u03b3)) time. Let\nD(cid:48) denote the distribution over Rd obtained by applying \u03a6A to D. Across all coordinates D(cid:48) is\nsupported on rational numbers with the same poly(1/\u03b3) common denominator. By Lemma 1, with\nprobability 99/100 over A, the target-distribution pair (w(cid:48) = (1/\n\nd)xA; and \ufb01nally (iii) rounding each x(cid:48)\n\nd)wA,D(cid:48)) satis\ufb01es\n\n\u221a\n\n\u221a\n\n(cid:104)|x(cid:48) \u00b7 (w(cid:48)/(cid:107)w(cid:48)(cid:107))| < \u03b3(cid:48) def= \u03b3/8 or (cid:107)x(cid:48)(cid:107)2 > 1\n\n(cid:105) \u2264 \u03b34.\n\n(6)\n\nPr\nx(cid:48)\u223cD(cid:48)\n\nThe algorithm next draws m = c log(1/\u03b3)/\u03b32 labeled training examples (\u03a6A(x), f(x)) from\nD(cid:48); this can be done in O(log(n/\u03b3)) time using O(n) \u00b7 poly(1/\u03b3) processors as noted above.\nIt then applies Acpr to \ufb01nd a d-dimensional halfspace h that classi\ufb01es all m examples correctly\n(more on this below). By (6), with probability at least (say) 29/30 over the random draw of\n(\u03a6A(x1), ym), ..., (\u03a6A(xm), ym), we have that yt(w(cid:48) \u00b7 \u03a6A(xt)) \u2265 \u03b3 and ||\u03a6A(xt)|| \u2264 1 for all\nt = 1, . . . , m. Now the standard VC bound for halfspaces [30] applied to h and D(cid:48) implies that\nsince h classi\ufb01es all m examples correctly, with overall probability at least 9/10 its accuracy is at\nleast 9/10 with respect to D(cid:48), i.e. Prx\u223cD[h(\u03a6A(x)) (cid:54)= f(x)] \u2264 1/10. So the hypothesis h\u25e6 \u03a6A has\naccuracy 9/10 with respect to D with probability 9/10 as required by Lemma 6.\nIt remains to justify the above claim about Acpr classifying all examples correctly, and analyze the\nrunning time. More precisely we show that given m = O(log(1/\u03b3)/\u03b32) training examples in Bd\nwith rational components that all have coordinates with a common denominator that is poly(1/\u03b3)\nand are separable with a margin \u03b3(cid:48) = \u03b3/8, Acpr can be used to construct a d-dimensional halfspace\nthat classi\ufb01es them all correctly in \u02dcO(1/\u03b3) parallel time using poly(1/\u03b3) processors.\n\n5\n\n\fm, ym) \u2208 Bd \u00d7 {\u22121, 1} satisfying the above conditions, we will apply algo-\n1, y1), ..., (x(cid:48)\nGiven (x(cid:48)\nrithm Acpr to the following linear program, called LP, with \u03b1 = \u03b3(cid:48)/2: \u201cminimize \u2212s such that\nyt(v \u00b7 x(cid:48)\nt) \u2212 st = s and 0 \u2264 st \u2264 2 for all t \u2208 [m]; \u22121 \u2264 vi \u2264 1 for all i \u2208 [d]; and \u22122 \u2264 s \u2264 2.\u201d\nIntuitively, s is the minimum margin over all examples, and st is the difference between each exam-\nple\u2019s margin and s. The subspace L is de\ufb01ned by the equality constraints yt(v \u00b7 x(cid:48)\nOur analysis will conclude by applying the following lemma, with an initial solution of s = \u22121,\nv = 0, and st = 1 for all t. (Note that u1 corresponds to s.)\nLemma 7 Given any d-dimensional linear program in the form (1), and an initial solution u \u2208 L\nsuch that min{|ui \u2212 ai|,|ui \u2212 bi|} \u2265 1 for all i, Algorithm Acpr approximates the optimal solution\nto an additive \u00b1\u03b1. It runs in\n\nd \u00b7 polylog(d/\u03b1) parallel time and uses poly(1/\u03b1, d) processors.\n\nt) \u2212 st = s.\n\n\u221a\n\nThe LP constraints enforce that all examples are classi\ufb01ed correctly with a margin of at least s. The\nfeasible solution in which v is w(cid:48)/||w(cid:48)||, s equals \u03b3(cid:48) and st = yt(v\u00b7x(cid:48)\nt)\u2212 s shows that the optimum\nsolution of LP has value at most \u2212\u03b3(cid:48). So approximating the optimum to an additive \u00b1\u03b1 = \u00b1\u03b3(cid:48)/2\nensures that all examples are classi\ufb01ed correctly, and it is enough to prove Lemma 7.\nProof of Lemma 7: First, we claim that, for all k, ||n\u03b7k(u(k))||u(k) \u2264 1/9; given this, since the\n\ufb01nal value of \u03b7k is at least 4d/\u03b1, Lemma 3 implies that the solution is \u03b1-close to optimal. We induct\non k. For k = 1, since initially mini{|ui \u2212 ai|,|ui \u2212 bi|} \u2265 1, we have F (u) \u2264 0, and, since \u03b71 = 1\nand u1 \u2265 \u22121 we have F\u03b71(u) \u2264 1 and opt\u03b71 \u2265 \u22121. So we can apply Lemma 2 to get the base case.\nNow, for the induction step, suppose ||n\u03b7k(u(k))||u(k) \u2264 1/9. It then follows3 from [24, page 46]\nthat ||n\u03b7k+1(w(k+1))||w(k+1) \u2264 1/9. Next, Lemmas 3 and 4 imply that F\u03b7k+1(r(k+1))\u2212 opt\u03b7k+1 \u2264\n1. Then Lemma 2 gives ||n\u03b7k+1(u(k+1))||u(k+1) \u2264 1/9 as required.\nNext, we claim that the bit-length of all intermediate solutions is at most poly(d, 1/\u03b3). This holds for\nr(k), and follows for u(k) and w(k) because each of them is obtained from some r(k) by performing\na constant number of operations each of which blows up the bit length at most polynomially (see\nLemma 2). Since each intermediate solution has polynomial bit length, the matrix inverses can be\ncomputed in polylog(d, 1/\u03b3) time using poly(d, 1/\u03b3) processors, by Lemma 5. The time bound\nthen follows from the fact that there are at most O(\n\n\u221a\nd log(d/\u03b1)) iterations.\n\n3 Lower bound for parallel boosting in the oracle model\n\nBoosting is a widely used method for learning large-margin halfspaces. In this section we consider\nthe question of whether boosting algorithms can be ef\ufb01ciently parallelized. We work in the original\nPAC learning setting [29, 16, 26] in which a weak learning algorithm is provided as an oracle that is\ncalled by the boosting algorithm, which must simulate a distribution over labeled examples for the\nweak learner. Our main result for this setting is that boosting is inherently sequential; being able to\nto call the weak learner multiple times in parallel within a single boosting stage does not reduce the\noverall number of sequential boosting stages that are required. In fact we show this in a very strong\nsense, by proving that a boosting algorithm that runs arbitrarily many copies of the weak learner in\nparallel in each stage cannot save even one stage over a sequential booster that runs the weak learner\njust once in each stage. This lower bound is unconditional and information-theoretic.\nBelow we \ufb01rst de\ufb01ne the parallel boosting framework and give some examples of parallel boosters.\nWe then state and prove our lower bound on the number of stages required by parallel boosters. A\nconsequence of our lower bound is that \u2126(log(1/\u03b5)/\u03b32) stages of parallel boosting are required in\norder to boost a \u03b3-advantage weak learner to achieve classi\ufb01cation accuracy 1 \u2212 \u03b5 no matter how\nmany copies of the weak learner are used in parallel in each stage.\nOur de\ufb01nition of weak learning is standard in PAC learning, except that for our discussion it suf\ufb01ces\nto consider a single target function f : X \u2192 {\u22121, 1} over a domain X.\n\nDe\ufb01nition 1 A \u03b3-advantage weak learner L is an algorithm that is given access to a source of inde-\npendent random labeled examples drawn from an (unknown and arbitrary) probability distribution\n\n3Noting that \u03d1 \u2264 2d [24, page 35].\n\n6\n\n\fP over labeled examples {(x, f(x))}x\u2208X . L must4 return a weak hypothesis h : X \u2192 {\u22121, 1} that\nsatis\ufb01es Pr(x,f (x))\u2190P[h(x) = f(x)] \u2265 1/2 + \u03b3. Such an h is said to have advantage \u03b3 w.r.t. P.\nWe \ufb01x P to henceforth denote the initial distribution over labeled examples, i.e. P is a distribution\nover {(x, f(x))}x\u2208X where the marginal distribution PX may be an arbitrary distribution over X.\nIntuitively, a boosting algorithm runs the weak learner repeatedly on a sequence of carefully chosen\ndistributions to obtain a sequence of weak hypotheses, and combines the weak hypotheses to obtain a\n\ufb01nal hypothesis that has high accuracy under P. We give a precise de\ufb01nition below, but \ufb01rst we give\nsome intuition to motivate our de\ufb01nition. In stage t of a parallel booster the boosting algorithm may\nrun the weak learner many times in parallel using different probability distributions. The probability\nweight of a labeled example (x, f(x)) under a distribution constructed at the t-th stage of boosting\nmay depend on the values of all the weak hypotheses from previous stages and on the value of\nf(x), but may not depend on any of the weak hypotheses generated by any of the calls to the weak\nlearner in stage t. No other dependence on x is allowed, since intuitively the only interface that\nthe boosting algorithm should have with each data point is through its label and the values of the\nweak hypotheses from earlier stages. We further observe that since the distribution P is the only\nsource of labeled examples, a booster should construct the distributions at each stage by somehow\n\u201c\ufb01ltering\u201d examples (x, f(x)) drawn from P based only on the value of f(x) and the values of the\nweak hypotheses from previous stages. We thus de\ufb01ne a parallel booster as follows:\n\nDe\ufb01nition 2 (Parallel booster) A T -stage parallel boosting algorithm with N-fold parallelism is\nde\ufb01ned by T N functions {\u03b1t,k}t\u2208[T ],k\u2208[N ] and a (randomized) Boolean function h, where \u03b1t,k :\n{\u22121, 1}(t\u22121)N +1 \u2192 [0, 1] and h : {\u22121, 1}T N \u2192 {\u22121, 1}. In the t-th stage of boosting the weak\nlearner is run N times in parallel. For each k \u2208 [N], the distribution Pt,k over labeled examples\nthat is given to the k-th run of the weak learner is as follows: a draw from Pt,k is made by drawing\n(x, f(x)) from P and accepting (x, f(x)) as the output of the draw from Pt,k with probability\npx = \u03b1t,k(h1,1(x), . . . , ht\u22121,N (x), f(x)) (and rejecting it and trying again otherwise). In stage t,\nfor each k \u2208 [N] the booster gives the weak learner access to Pt,k as de\ufb01ned above and the weak\nlearner generates a hypothesis ht,k that has advantage at least \u03b3 w.r.t. Pt,k.\nAfter T stages, T N weak hypotheses {ht,k}t\u2208[T ],k\u2208[N ] have been obtained from the weak learner.\nThe \ufb01nal hypothesis of the booster is H(x) := h(h1,1(x), . . . , hT,N (x)), and its accuracy is\nminht,k Pr(x,f (x))\u2190P[H(x) = f(x)], where the min is taken over all sequences of T N weak hy-\npotheses subject to the condition that each ht,k has advantage at least \u03b3 w.r.t. Pt,k.\n\nThe parameter N above corresponds to the number of processors that the parallel booster is using;\nwe get a sequential booster when N = 1. Many of the most common PAC-model boosters in the\nliterature are sequential boosters, such as [26, 10, 9, 11, 27, 5] and others.\nIn [10] Freund gave a\nboosting algorithm and showed that after T stages of boosting, his algorithm generates a \ufb01nal hy-\n\npothesis that is guaranteed to have error at most vote(\u03b3, T ) def= (cid:80)(cid:98)T /2(cid:99)\n\n(cid:1)(cid:0) 1\n2 + \u03b3(cid:1)j (1/2 \u2212 \u03b3)T\u2212j\n\n(see Theorem 2.1 of [10]). Freund also gave a matching lower bound by showing (see his Theo-\nrem 2.4) that any T -stage sequential booster must have error at least as large as vote(\u03b3, T ), and so\nconsequently any sequential booster that generates a (1 \u2212 \u03b5)-accurate \ufb01nal hypothesis must run for\n\u2126(log(1/\u03b5)/\u03b32) stages. Our Theorem 2 below extends this lower bound to parallel boosters.\nSeveral parallel boosting algorithms have been given in the literature, including branching pro-\ngram [20, 13, 18, 19] and decision tree [15] boosters. All of these boosters take \u2126(log(1/\u03b5)/\u03b32)\nstages to learn to accuracy 1 \u2212 \u03b5; our theorem below implies that any parallel booster must run for\n\u2126(log(1/\u03b5)/\u03b32) stages no matter how many parallel calls to the weak learner are made per stage.\n\n(cid:0)T\n\nj\n\nj=0\n\nTheorem 2 Let B be any T -stage parallel boosting algorithm with N-fold parallelism. Then for\nany 0 < \u03b3 < 1/2, when B is used to boost a \u03b3-advantage weak learner the resulting \ufb01nal hypothesis\nmay have error as large as vote(\u03b3, T ) (see the discussion after De\ufb01nition 2).\n\nWe emphasize that Theorem 2 holds for any \u03b3 and any N that may depend on \u03b3 in an arbitrary way.\n\n4The usual de\ufb01nition of a weak learner would allow L to fail with probability \u03b4. This probability can be made\n\nexponentially small by running L multiple times so for simplicity we assume there is no failure probability.\n\n7\n\n\fThe theorem is proved as follows: \ufb01x any 0 < \u03b3 < 1/2 and \ufb01x B to be any T -stage parallel\nboosting algorithm. We will exhibit a target function f and a distribution P over {(x, f(x))x\u2208X,\nand describe a strategy that a weak learner W can use to generate weak hypotheses ht,k that each\nhave advantage at least \u03b3 with respect to the distributions Pt,k. We show that with this weak learner\nW , the resulting \ufb01nal hypothesis H that B outputs will have accuracy at most 1 \u2212 vote(\u03b3, T ).\nWe begin by describing the desired f and P. The domain X of f is X = Z \u00d7 \u2126, where Z =\n{\u22121, 1}n and \u2126 is the set of all \u03c9 = (\u03c91, \u03c92, . . . ) where each \u03c9i belongs to {\u22121, 1}. The target\nfunction f is simply f(z, \u03c9) = z. The distribution P = (P X ,P Y ) over {(x, f(x))}x\u2208X is de\ufb01ned\nas follows. A draw from P is obtained by drawing x = (z, \u03c9) from P X and returning (x, f(x)). A\ndraw of x = (z, \u03c9) from P X is obtained by \ufb01rst choosing a uniform random value in {\u22121, 1} for z,\nand then choosing \u03c9i \u2208 {\u22121, 1} to equal z with probability 1/2 + \u03b3 independently for each i. Note\nthat under P, given the label z = f(x) of a labeled example (x, f(x)), each coordinate \u03c9i of x is\ncorrect in predicting the value of f(x, z) with probability 1/2 + \u03b3 independently of all other \u03c9j\u2019s.\nWe next describe a way that a weak learner W can generate a \u03b3-advantage weak hypothesis each\ntime it is invoked by B. Fix any t \u2208 [T ] and any k \u2208 [N]. When W is invoked with Pt,k it replies as\nfollows (recall that for x \u2208 X we have x = (z, \u03c9) as described above): (i) if Pr(x,f (x))\u2190Pt,k[\u03c9t =\nf(x)] \u2265 1/2 + \u03b3 then the weak hypothesis ht,k(x) is the function \u201c\u03c9t,\u201d i.e. the (t + 1)-st coordinate\nof x. Otherwise, (ii) the weak hypothesis ht,k(x) is \u201cz,\u201d i.e.\nthe \ufb01rst coordinate of x. (Note that\nsince f(x) = z for all x, this weak hypothesis has zero error under any distribution.)\nIt is clear that each weak hypothesis ht,k generated as described above indeed has advantage at least\n\u03b3 w.r.t. Pt,k, so the above is a legitimate strategy for W . The following lemma will play a key role:\nLemma 8 If W never uses option (ii) then Pr(x,f (x))\u2190P[H(x) (cid:54)= f(x)] \u2265 vote(\u03b3, T ).\nProof: If the weak learner never uses option (ii) then H depends only on variables \u03c91, . . . , \u03c9T and\nhence is a (randomized) Boolean function over these variables. Recall that for (x = (z, \u03c9), f(x) =\nz) drawn from P, each coordinate \u03c91, . . . , \u03c9T independently equals z with probability 1/2 + \u03b3.\nHence the optimal (randomized) Boolean function H over inputs \u03c91, . . . , \u03c9T that maximizes the\naccuracy Pr(x,f (x))\u2190P[H(x) = f(x)] is the (deterministic) function H(x) = Maj(\u03c91, . . . , \u03c9T ) that\noutputs the majority vote of its input bits. (This can be easily veri\ufb01ed using Bayes\u2019 rule in the usual\n\u201cNaive Bayes\u201d calculation.) The error rate of this H is precisely the probability that at most (cid:98)T /2(cid:99)\n\u201cheads\u201d are obtained in T independent (1/2 + \u03b3)-biased coin tosses, which equals vote(\u03b3, T ).\n\nThus it suf\ufb01ces to prove the following lemma, which we prove by induction on t:\nLemma 9 W never uses option (ii) (i.e. Pr(x,f (x))\u2190Pt,k[\u03c9t = f(x)] \u2265 1/2 + \u03b3 always).\nProof: Base case (t = 1). For any k \u2208 [N], since t = 1 there are no weak hypotheses from\nprevious stages, so the value of px is determined by the bit f(x) = z (see De\ufb01nition 2). Hence P1,k\nis a convex combination of two distributions which we call D1 and D\u22121. For b \u2208 {\u22121, 1}, a draw\nof (x = (z, \u03c9); f(x) = z) from Db is obtained by setting z = b and independently setting each\ncoordinate \u03c9i equal to z with probability 1/2 + \u03b3. Thus in the convex combination P1,k of D1 and\nD\u22121, we also have that \u03c91 equals z (i.e. f(x)) with probability 1/2 + \u03b3. So the base case is done.\nInductive step (t > 1). Fix any k \u2208 [N]. The inductive hypothesis and the weak learner\u2019s strategy\ntogether imply that for each labeled example (x = (z, \u03c9), f(x) = z), since hs,(cid:96)(x) = \u03c9s for\ns < t, the rejection sampling parameter px = \u03b1t,k(h1,1(x), . . . , ht\u22121,N (x), f(x)) is determined\nby \u03c91, . . . , \u03c9t\u22121 and z and does not depend on \u03c9t, \u03c9t+1, .... Consequently the distribution Pt,k\nover labeled examples is some convex combination of 2t distributions which we denote Db, where b\nranges over {\u22121, 1}t corresponding to conditioning on all possible values for \u03c91, . . . , \u03c9t\u22121, z. For\neach b = (b1, . . . , bt) \u2208 {\u22121, 1}t, a draw of (x = (z, \u03c9); f(x) = z) from Db is obtained by setting\nz = bt, setting (\u03c91, . . . , \u03c9t\u22121) = (b1, . . . , bt\u22121), and independently setting each other coordinate\n\u03c9j (j \u2265 t) equal to z with probability 1/2+ \u03b3. In particular, because \u03c9t is conditionally independent\nof \u03c91, ..., \u03c9t\u22121 given z, Pr(\u03c9t = z|\u03c91 = b1, ..., \u03c9t\u22121 = bt\u22121) = Pr(\u03c9t = z) = 1/2 + \u03b3. Thus\nin the convex combination Pt,k of the different Db\u2019s, we also have that \u03c9t equals z (i.e. f(x)) with\nprobability 1/2 + \u03b3. This concludes the proof of the lemma and the proof of Theorem 2.\n\n8\n\n\fReferences\n[1] R. Arriaga and S. Vempala. An algorithmic theory of learning: Robust concepts and random projection.\n\nIn Proc. 40th FOCS, pages 616\u2013623, 1999.\n\n[2] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Learnability and the Vapnik-Chervonenkis\n\ndimension. Journal of the ACM, 36(4):929\u2013965, 1989.\n\n[3] S. P. Boyd and L. Vandenberghe. Convex Optimization. Cambridge, 2004.\n[4] J. Bradley, A. Kyrola, D. Bickson, and C. Guestrin. Parallel coordinate descent for l1-regularized loss\n\nminimization. ICML, 2011.\n\n[5] Joseph K. Bradley and Robert E. Schapire. Filterboost: Regression and classi\ufb01cation on large datasets.\n\nIn NIPS, 2007.\n\n[6] N. Bshouty, S. Goldman, and H.D. Mathias. Noise-tolerant parallel learning of geometric concepts. Inf.\n\nand Comput., 147(1):89 \u2013 110, 1998.\n\n[7] Michael Collins, Robert E. Schapire, and Yoram Singer. Logistic regression, adaboost and bregman\n\ndistances. Machine Learning, 48(1-3):253\u2013285, 2002.\n\n[8] O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao. Optimal distributed online prediction.\n\n2011.\n\nICML,\n\n[9] C. Domingo and O. Watanabe. MadaBoost: a modi\ufb01ed version of AdaBoost. In Proc. 13th COLT, pages\n\n180\u2013189, 2000.\n\n[10] Y. Freund. Boosting a weak learning algorithm by majority. Inf. and Comput., 121(2):256\u2013285, 1995.\n[11] Y. Freund. An adaptive version of the boost-by-majority algorithm. Mach. Learn., 43(3):293\u2013318, 2001.\n[12] R. Greenlaw, H.J. Hoover, and W.L. Ruzzo. Limits to Parallel Computation: P-Completeness Theory.\n\nOxford University Press, New York, 1995.\n\n[13] A. Kalai and R. Servedio. Boosting in the presence of noise. Journal of Computer & System Sciences,\n\n71(3):266\u2013290, 2005.\n\n[14] N. Karmarkar. A new polynomial time algorithm for linear programming. Combinat., 4:373\u2013395, 1984.\n[15] M. Kearns and Y. Mansour. On the boosting ability of top-down decision tree learning algorithms. In\n\nProceedings of the Twenty-Eighth Annual Symposium on Theory of Computing, pages 459\u2013468, 1996.\n\n[16] M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT Press, Cambridge,\n\nMA, 1994.\n\n[17] N. Littlestone. From online to batch learning. In Proc. 2nd COLT, pages 269\u2013284, 1989.\n[18] P. Long and R. Servedio. Martingale boosting. In Proc. 18th Annual COLT, pages 79\u201394, 2005.\n[19] P. Long and R. Servedio. Adaptive martingale boosting. In Proc. 22nd NIPS, pages 977\u2013984, 2008.\n[20] Y. Mansour and D. McAllester. Boosting using branching programs. Journal of Computer & System\n\nSciences, 64(1):103\u2013112, 2002.\n\n[21] Y. Nesterov and A. Nemirovskii. Interior Point Polynomial Methods in Convex Programming: Theory\n\nand Applications. Society for Industrial and Applied Mathematics, Philadelphia, 1994.\n\n[22] John H. Reif. O(log2 n) time ef\ufb01cient parallel factorization of dense, sparse separable, and banded matri-\n\nces. SPAA, 1994.\n\n[23] J. Renegar. A polynomial-time algorithm, based on Newton\u2019s method, for linear programming. Mathe-\n\nmatical Programming, 40:59\u201393, 1988.\n\n[24] James Renegar. A mathematical view of interior-point methods in convex optimization. Society for\n\nIndustrial and Applied Mathematics, 2001.\n\n[25] F. Rosenblatt. The Perceptron: a probabilistic model for information storage and organization in the brain.\n\nPsychological Review, 65:386\u2013407, 1958.\n\n[26] R. Schapire. The strength of weak learnability. Machine Learning, 5(2):197\u2013227, 1990.\n[27] R. Servedio. Smooth boosting and learning with malicious noise. JMLR, 4:633\u2013648, 2003.\n[28] S. Shalev-Shwartz and Y. Singer. On the equivalence of weak learnability and linear separability: New\n\nrelaxations and ef\ufb01cient boosting algorithms. Machine Learning, 80(2):141\u2013163, 2010.\n\n[29] L. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134\u20131142, 1984.\n[30] V. Vapnik. Statistical Learning Theory. Wiley-Interscience, New York, 1998.\n[31] J. S. Vitter and J. Lin. Learning in parallel. Inf. Comput., 96(2):179\u2013202, 1992.\n[32] DIMACS 2011 Workshop. Parallelism: A 2020 Vision. 2011.\n[33] NIPS 2009 Workshop. Large-Scale Machine Learning: Parallelism and Massive Datasets. 2009.\n\n9\n\n\f", "award": [], "sourceid": 771, "authors": [{"given_name": "Phil", "family_name": "Long", "institution": null}, {"given_name": "Rocco", "family_name": "Servedio", "institution": null}]}