{"title": "Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models", "book": "Advances in Neural Information Processing Systems", "page_first": 1231, "page_last": 1239, "abstract": "Training conditional maximum entropy models on massive data requires significant time and computational resources. In this paper, we investigate three common distributed training strategies: distributed gradient, majority voting ensembles, and parameter mixtures. We analyze the worst-case runtime and resource costs of each and present a theoretical foundation for the convergence of parameters under parameter mixtures, the most efficient strategy. We present large-scale experiments comparing the different strategies and demonstrate that parameter mixtures over independent models use fewer resources and achieve comparable loss as compared to standard approaches.", "full_text": "Ef\ufb01cient Large-Scale Distributed Training of\n\nConditional Maximum Entropy Models\n\nGideon Mann\n\nGoogle\n\ngmann@google.com\n\nRyan McDonald\n\nGoogle\n\nryanmcd@google.com\n\nMehryar Mohri\n\nCourant Institute and Google\nmohri@cims.nyu.edu\n\nNathan Silberman\n\nGoogle\n\nnsilberman@google.com\n\nNLP Lab, Brigham Young University\n\nDaniel D. Walker\u2217\ndanl4@cs.byu.edu\n\nAbstract\n\nTraining conditional maximum entropy models on massive data sets requires sig-\nni\ufb01cant computational resources. We examine three common distributed training\nmethods for conditional maxent: a distributed gradient computation method, a\nmajority vote method, and a mixture weight method. We analyze and compare the\nCPU and network time complexity of each of these methods and present a theoret-\nical analysis of conditional maxent models, including a study of the convergence\nof the mixture weight method, the most resource-ef\ufb01cient technique. We also re-\nport the results of large-scale experiments comparing these three methods which\ndemonstrate the bene\ufb01ts of the mixture weight method: this method consumes\nless resources, while achieving a performance comparable to that of standard ap-\nproaches.\n\n1 Introduction\nConditional maximum entropy models [1, 3], conditional maxent models for short, also known as\nmultinomial logistic regression models, are widely used in applications, most prominently for multi-\nclass classi\ufb01cation problems with a large number of classes in natural language processing [1, 3] and\ncomputer vision [12] over the last decade or more.\nThese models are based on the maximum entropy principle of Jaynes [11], which consists of se-\nlecting among the models approximately consistent with the constraints, the one with the greatest\nentropy. They bene\ufb01t from a theoretical foundation similar to that of standard maxent probabilistic\nmodels used for density estimation [8]. In particular, a duality theorem for conditional maxentmodel\nshows that these models belong to the exponential family. As shown by Lebanon and Lafferty [13],\nin the case of two classes, these models are also closely related to AdaBoost, which can be viewed as\nsolving precisely the same optimization problem with the same constraints, modulo a normalization\nconstraint needed in the conditional maxent case to derive probability distributions.\nWhile the theoretical foundation of conditional maxent models makes them attractive, the computa-\ntional cost of their optimization problem is often prohibitive for data sets of several million points.\nA number of algorithms have been described for batch training of conditional maxent models using\na single processor. These include generalized iterative scaling [7], improved iterative scaling [8],\ngradient descent, conjugate gradient methods, and second-order methods [15, 18].\nThis paper examines distributed methods for training conditional maxent models that can scale to\nvery large samples of up to 1B instances. Both batch algorithms and on-line training algorithms such\n\n\u2217This work was conducted while at Google Research, New York.\n\n1\n\n\fas that of [5] or stochastic gradient descent [21] can bene\ufb01t from parallelization, but we concentrate\nhere on batch distributed methods.\nWe examine three common distributed training methods: a distributed gradient computation method\n[4], a majority vote method, and a mixture weight method. We analyze and compare the CPU and\nnetwork time complexity of each of these methods (Section 2) and present a theoretical analysis of\nconditional maxent models (Section 3), including a study of the convergence of the mixture weight\nmethod, the most resource-ef\ufb01cient technique. We also report the results of large-scale experiments\ncomparing these three methods which demonstrate the bene\ufb01ts of the mixture weight method (Sec-\ntion 4): this method consumes less resources, while achieving a performance comparable to that of\nstandard approaches such as the distributed gradient computation method.1\n2 Distributed Training of Conditional Maxent Models\nIn this section, we \ufb01rst brie\ufb02y describe the optimization problem for conditional maximum entropy\nmodels, then discuss three common methods for distributed training of these models and compare\ntheir CPU and network time complexity.\n2.1 Conditional Maxent Optimization problem\nLet X be the input space, Y the output space, and \u03a6: X \u00d7Y\u2192 H a (feature) mapping to a Hilbert\nspace H, which in many practical settings coincides with RN, N = dim(H) < \u221e. We denote by\n$ \u00b7$ the norm induced by the inner product associated to H.\nLet S = ((x1, y1), . . . , (xm, ym)) be a training sample of m pairs in X\u00d7Y. A conditional maximum\nentropy model is a conditional probability of the form pw[y|x] = 1\nZ(x) exp(w\u00b7 \u03a6(x, y)) with Z(x)=\n!y\u2208Y exp(w\u00b7\u03a6(x, y)), where the weight or parameter vector w\u2208H is the solution of the following\noptimization problem:\n\n\u03bb$w$2 \u2212\n\nw\u2208H\n\nlog pw[yi|xi].\n\nw = argmin\n\nFS(w) = argmin\n\npw[y|x] = argmax\n\n(1)\nHere, \u03bb \u2265 0 is a regularization parameter typically selected via cross-validation. The optimization\nproblem just described corresponds to an L2 regularization. Many other types of regularization have\nbeen considered for the same problem in the literature, in particular L1 regularization or regulariza-\ntions based on other norms. This paper will focus on conditional maximum entropy models with L2\nregularization.\nThese models have been extensively used and studied in natural language processing [1, 3] and\nother areas where they are typically used for classi\ufb01cation. Given the weight vector w, the output y\npredicted by the model for an input x is:\ny = argmax\n\n(2)\nSince the function FS is convex and differentiable, gradient-based methods can be used to \ufb01nd a\nglobal minimizer w of FS. Standard training methods such as iterative scaling, gradient descent,\nconjugate gradient, and limited-memory quasi-Newton all have the general form of Figure 1, where\nthe update function \u0393: H \u2192 H for the gradient \u2207FS(w) depends on the optimization method\nselected. T is the number of iterations needed for the algorithm to converge to a global minimum.\nIn practice, convergenceoccurs when FS(w) differs by less than a constant \u0001 in successive iterations\nof the loop.\n2.2 Distributed Gradient Computation Method\nSince the points are sampled i.i.d., the gradient computation in step 3 of Figure 1 can be distributed\nacross p machines. Consider a sample S = (S1, . . . , Sp) of pm points formed by p subsamples of\n1A batch parallel estimation technique for maxent models based on their connection with AdaBoost is also\ndescribed by [5]. This algorithm is quite different from the distributed gradient computation method, but, as for\nthat method, it requires a substantial amount of network resources, since updates need to be transferred to the\nmaster at every iteration.\n\ny\u2208Y\n\nw \u00b7 \u03a6(x, y).\n\ny\u2208Y\n\nw\u2208H\n\n1\nm\n\nm\"i=1\n\n2\n\n\f1 w \u2190 0\n2 for t \u2190 1 to T do\n3\n4\n5 return w\n\n\u2207FS(w) \u2190 GRADIENT(FS(w))\nw \u2190 w +\u0393( \u2207FS(w))\n\nFigure 1: Standard Training\n\n1 w \u2190 0\n2 for t \u2190 1 to T do\n3\n4\n5\n6 return w\n\n\u2207FS(w) \u2190 DISTGRADIENT(FSk (w) # p machines)\nw \u2190 w +\u0393( \u2207FS(w))\nUPDATE(w # p machines)\n\nFigure 2: Distributed Gradient Training\n\nm points drawn i.i.d., S1, . . . , Sp. At each iteration, the gradients \u2207FSk(w) are computed by these\np machines in parallel. These separate gradients are then summed up to compute the exact global\ngradient on a single machine, which also performs the optimization step and updates the weight\nvector received by all other machines (Figure 2). Chu et al. [4] describe a map-reduce formulation\nfor this computation, where each training epoch consists of one map (compute each \u2207FSk (w))\nand one reduce (update w). However, the update method they present is that of Newton-Raphson,\nwhich requires the computation of the Hessian. We do not consider such strategies, since Hessian\ncomputations are often infeasible for large data sets.\n2.3 Majority Vote Method\nThe ensemble methods described in the next two paragraphs are based on mixture weights \u00b5\u2208 Rp.\nLet \u2206p ={\u00b5 \u2208 Rp : \u00b5\u2265 0\u2227!p\nk=1 \u00b5k = 1} denote the simplex of Rp and let \u00b5\u2208\u2206p. In the absence\nof any prior knowledge, \u00b5 is chosen to be the uniform mixture \u00b50 = (1/p, . . . , 1/p) as in all of our\nexperiments.\nInstead of computing the gradient of the global function in parallel, a (weighted) majority vote\nmethod can be used. Each machine receives one subsample Sk, k \u2208 [1, p], and computes wk =\nargminw\u2208H FSk(w) by applying the standard training of Figure 1 to Sk. The output y predicted by\nthe majority vote method for an input x is\n\ny = argmax\n\n\u00b5k I(argmax\n\npwk[y#|x] = y),\n\n(3)\n\np\"k=1\n\ny\u2208Y\n\ny!\u2208Y\n\np\"k=1\n\n3\n\ny = argmaxy!p\n\nwhere I is an indicator function of the predicate it takes as argument. Alternatively, the con-\nditional class probabilities could be used to take into account the uncertainty of each classi\ufb01er:\n\nk=1 \u00b5k pwk[y|x].\n2.4 Mixture Weight Method\nThe cost of storing p weight vectors can make the majority vote method unappealing. Instead, a\nsingle mixture weight w\n\n\u00b5 can be de\ufb01ned form the weight vectors wk, k\u2208[1, p]:\n\nw\n\n\u00b5 =\n\n\u00b5kwk.\n\n(4)\n\n\u00b5 can be used directly for classi\ufb01cation.\n\nThe mixture weight w\n2.5 Comparison of CPU and Network Times\nThis section compares the CPU and network time complexity of the three training methods just\ndescribed. Table 1 summarizes these results. Here, we denote by N the dimension of H. User CPU\nrepresents the CPU time experienced by the user, cumulative CPU the total amount of CPU time for\nthe machines participating in the computation, and latency the experienced runtime effects due to\nnetwork activity. The cumulative network usage is the amount of data transferred across the network\nduring a distributed computation.\nFor a training sample of pm points, both the user and cumulative CPU times are in Ocpu(T pmN )\nwhen training on a single machine (Figure 1) since at each of the T iterations, the gradient compu-\ntation must iterate over all pm training points and update all the components of w.\n\n\fTraining\nUser CPU + Latency\nOcpu(pmN T )\n\nTraining\nCum. CPU\nOcpu(pmN T )\nOcpu(pmN T )\n\nSingle Machine\nDistributed Gradient Ocpu(mN T ) + Olat(N T )\nMajority Vote\nMixture Weight\n\nOcpu(mN Tmax) + Olat(N) Pp\nOcpu(mN Tmax) + Olat(N) Pp\n\nTraining\nPrediction\nCum. Network User CPU\nN/A\nOcpu(N)\nOcpu(N)\nOnet(pN T )\nOcpu(pN)\nOcpu(N)\n\nk=1 Ocpu(mN Tk) Onet(pN)\nk=1 Ocpu(mN Tk) Onet(pN)\n\nTable 1: Comparison of CPU and network times.\n\nFor the distributed gradient method (Section 2.2), the worst-case user CPU of the gradient and\nparameter update computations (lines 3-4 of Figure 2) is Ocpu(mN +pN +N ) since each parallel\ngradient calculation takes mN to compute the gradient for m instances, p gradients of size N need\nto be summed, and the parameters updated. We assume here that the time to compute \u0393 is negligible.\nIf we assume that p* m, then, the user CPU is in Ocpu(mN T ). Note that the number of iterations\nit takes to converge, T, is the same as when training on a single machine since the computations are\nidentical.\nIn terms of network usage, a distributed gradient strategy will incur a cost of Onet(pN T ) and a\nlatency proportional to Olat(N T ), since at each iteration w must be transmitted to each of the\np machines (in parallel) and each \u2207FSk(w) returned back to the master. Network time can be\nimproved through better data partitioning of S when \u03a6(x, y) is sparse. The exact runtime cost of\nlatency is complicated as it depends on factors such as the physical distance between the master and\neach machine, connectivity, the switch fabric in the network, and CPU costs required to manage\nmessages. For parallelization on massively multi-core machines [4], communication latency might\nbe negligible. However, in large data centers running commodity machines, a more common case,\nnetwork latency cost can be signi\ufb01cant.\nThe training times are identical for the majority vote and mixture weight techniques. Let Tk be the\nnumber of iterations for training the kth mixture component wk and let Tmax = max{T1, . . . , Tp}.\nThen, the user CPU usage of training is in Ocpu(mN Tmax), similar to that of the distributed gradient\nmethod. However, in practice, Tmax is typically less than T since convergence is often faster with\nsmaller data sets. A crucial advantage of these methods over the distributed gradient method is that\ntheir network usage is signi\ufb01cantly less than that of the distributed gradient computation. While\nparameters and gradients are exchanged at each iteration for this method, majority vote and mixture\nweight techniques only require the \ufb01nal weight vectors to be transferred at the conclusion of training.\nThus, the overall network usage is Onet(pN ) with a latency in Olat(N T ). The main difference\nbetween the majority vote and mixture weight methods is the user CPU (and memory usage) for\nprediction which is in Ocpu(pN ) versus Ocpu(N ) for the mixture weight method. Prediction could\nbe distributed over p machines for the majority vote method, but that would incur additional machine\nand network bandwidth costs.\n3 Theoretical Analysis\nThis section presents a theoretical analysis of conditional maxent models, including a study of the\nconvergence of the mixture weight method, the most resource-ef\ufb01cient technique, as suggested in\nthe previous section.\nThe results we obtain are quite general and include the proof of several fundamental properties of\nthe weight vector w obtained when training a conditional maxent model. We \ufb01rst prove the stability\nof w in response to a change in one of the training points. We then give a convergence bound for\nw as a function of the sample size in terms of the norm of the feature space and also show a similar\nresult for the mixture weight w\n\u00b5. These results are used to compare the weight vector wpm obtained\nby training on a sample of size pm with the mixture weight vector w\nConsider two training samples of size m, S = (z1, . . . , zm\u22121, zm) and S# = (z1, . . . , zm\u22121, z#m),\nwith elements in X \u00d7Y, that differ by a single training point, which we arbitrarily set as the last one\nof each sample: zm = (xm, ym) and z#m = (x#m, y#m). Let w denote the parameter vector returned\nby conditional maximum entropy when trained on sample S, w# the vector returned when trained\non S#, and let \u2206w denote w#\u2212 w. We shall assume that the feature vectors are bounded, that is\nthere exists R > 0 such that for all (x, y) in X \u00d7Y, $\u03a6(x, y)$ \u2264 R. Our bounds are derived using\n\n\u00b5.\n\n4\n\n\ftechniques similar to those used by Bousquet and Elisseeff [2], or other authors, e.g., [6], in the\nanalysis of stability. In what follows, for any w \u2208 H and z = (x, y)\u2208X \u00d7Y, we denote by Lz(w)\nthe negative log-likelihood - log pw[y|x].\nTheorem 1. Let S# and S be two arbitrary samples of size m differing only by one point. Then, the\nfollowing stability bound holds for the weight vector returned by a conditional maxent model:\n\n$\u2206w$ \u2264\n\n2R\n\u03bbm\n\n.\n\n(5)\n\n1\n\n=\n\n1\n\n1\n\nm(w) \u2212 Lz!\n\nm!m\n\nBW (w#$w) + BW (w$w#) \u2264 BFS (w#$w) + BFS! (w$w#).\n\nProof. We denote by BF the Bregman divergenceassociated to a convex and differentiable function\nF de\ufb01ned for all u, u# by: BF (u#$u) = F (u#)\u2212F (u)\u2212\u2207F (u)\u00b7(u#\u2212u). Let GS denote the function\ni=1 Lzi(u) and W the function u ,\u2192 \u03bb$u$2. GS and W are convex and differentiable\nu ,\u2192 1\nfunctions. Since the Bregman divergence is non-negative, BGS \u2265 0 and BFS = BW + BGS \u2265 BW.\nSimilarly, BFS! \u2265 BW. Thus, the following inequality holds:\n(6)\nBy the de\ufb01nition of w and w# as the minimizers of FS and FS !, \u2207FS(w) = \u2207FS !(w#) = 0 and\nm(w#)%&\nm(w) \u00b7 (w# \u2212 w)&\n\nBFS (w#$w) + BFS! (w$w#) = FS(w#) \u2212 FS(w) + FS !(w) \u2212 FS !(w#)\nm#$Lzm(w#) \u2212 Lzm(w)% +$Lz!\nm#\u2207Lzm(w#) \u00b7 (w \u2212 w#) + \u2207Lz!\nm$\u2207Lz!\n\n\u2264 \u2212\n= \u2212\nm and Lzm. It is not hard to see that BW (w#$w)+BW (w$w#) =\nwhere we used the convexity of Lz!\n2\u03bb$\u2206w$2. Thus, the application of the Cauchy-Schwarzinequality to the inequality just established\nyields\n(7)\n\nm(w) \u2212 \u2207Lzm(w#)% \u00b7 (w# \u2212 w),\n\nm#$\u2207Lzm(w#)$ + $\u2207Lz!\n1\nm$\u2207Lzm(w#) \u2212 \u2207Lz!\nThe gradient of w ,\u2192 Lzm(w) = log!y\u2208Y ew\u00b7\u03a6(xm,y)\u2212w \u00b7 \u03a6(xm, ym) is given by\n\u2207Lzm(w) = !y\u2208Y ew\u00b7\u03a6(xm,y)\u03a6(xm, y)\n!y!\u2208Y ew\u00b7\u03a6(xm,y!)\nThus, we obtain $\u2207Lzm(w#)$ \u2264 Ey\u223cpw\n$\u2207Lz!\nLet D denote the distribution according to which training and test points are drawn and let F ! be\nthe objective function associated to the optimization de\ufb01ned with respect to the true log loss:\n\ny\u223cpw[\u00b7|xm]$\u03a6(xm, y) \u2212 \u03a6(xm, ym)%.\n! [\u00b7|xm]$$\u03a6(xm, y)\u2212 \u03a6(xm, ym)$% \u2264 2R and similarly\n\nm(w)$\u22642 R, which leads to the statement of the theorem.\n\nm(w)$&.\n\n\u2212 \u03a6(xm, ym) =\n\n2\u03bb$\u2206w$ \u2264\n\nw\u2208H\n\nF !(w) = argmin\n\n\u03bb$w$2 + E\n\nz\u223cD$Lz(w)%.\n\n(8)\nF ! is a convex function since ED[Lz] is convex. Let the solution of this optimization be denoted by\nw! = argminw\u2208H F !(w).\nTheorem 2. Let w \u2208 H be the weight vector returned by conditional maximum entropy when\ntrained on a sample S of size m. Then, for any \u03b4> 0, with probability at least 1\u2212\u03b4, the following\ninequality holds:\n(9)\n\nm(w)$ \u2264\n\nE\n\n1\n\n$w \u2212 w!$ \u2264\n\nR\n\n\u03bb\u2019m/2(1 +\u2019log 1/\u03b4).\n\nProof. Let S and S# be as before samples of size m differing by a single point. To derive this\nbound, we apply McDiarmid\u2019s inequality [17] to \u03a8(S)=$w \u2212 w!$. By the triangle inequality and\nTheorem 1, the following Lipschitz property holds:\n(10)\n\n|\u03a8(S#) \u2212 \u03a8(S)| =**$w# \u2212 w!$ \u2212 $w \u2212 w!$** \u2264 $w# \u2212 w$ \u2264\n\n2R\n\u03bbm\n\n.\n\n5\n\n\f\u03b4\n\n2R\n\n2R\n\n(11)\n\n2m \u2264\n\n\u03a8 \u2264 E[\u03a8] +\n\n. Using this bound\n\n\u03bb + log 1\n\n4R2/\u03bb2). The following bound can be\n\nThus, by McDiarmid\u2019s inequality, Pr[\u03a8\u2212E[\u03a8] \u2265 \u0001] \u2264 exp( \u22122\u00012m\nshown for the expectation of \u03a8 (see longer version of this paper): E[\u03a8] \u2264 2R\n\u03bb\u221a2m\nand setting the right-hand side of McDiarmid\u2019s inequality to \u03b4 show that the following holds\n\u03bb\u221a2m(1 +\u2019log 1/\u03b4),\n\nwith probability at least 1\u2212\u03b4.\nNote that, remarkably, the bound of Theorem 2 does not depend on the dimension of the feature\nspace but only on the radius R of the sphere containing the feature vectors.\nConsider now a sample S = (S1, . . . , Sp) of pm points formed by p subsamples of m points drawn\ni.i.d. and let w\n\u00b5 denote the \u00b5-mixture weight as de\ufb01ned in Section 2.4. The following theorem gives\na learning bound for w\n\u00b5.\nTheorem 3. For any \u00b5 \u2208 \u2206p, let w\n\u00b5 \u2208 H denote the mixture weight vector obtained from a sample\nof size pm by combining the p weight vectors wk, k\u2208[1, p], each returned by conditional maximum\nentropy when trained on the sample Sk of size m. Then, for any \u03b4> 0, with probability at least 1\u2212\u03b4,\nthe following inequality holds:\n(12)\n\nR$\u00b5$\nFor the uniform mixture \u00b50 = (1/p, . . . , 1/p), the bound becomes\n\n$w\n\n\u00b5 \u2212 w!$ \u2264 E$$w\n\u00b5 \u2212 w!$ \u2264 E$$w\n\n\u00b5 \u2212 w!$% +\n\u00b5 \u2212 w!$% +\n\n$w\n\n\u03bb\u2019m/2\u2019log 1/\u03b4.\n\u03bb\u2019pm/2\u2019log 1/\u03b4.\n\nR\n\n(13)\n\n.\n\nProof. The result follows by application of McDiarmid\u2019s inequality to \u03a5(S) = $w\n\u00b5 \u2212 w!$. Let\nS# = (S#1, . . . , S#p) denote a sample differing from S by one point, say in subsample Sk. Let w#k\ndenote the weight vector obtained by training on subsample S#k and w#\u00b5 the mixture weight vector\nassociated to S#. Then, by the triangle inequality and the stability bound of Theorem 1, the following\nholds:\n\n2\u00b5kR\n\u03bbm\n\nThus, by McDiarmid\u2019s inequality,\n\n\u00b5 \u2212 w!$** \u2264 $w#\u00b5 \u2212 w\n\u22122\u00012\nk=1 m( 2\u00b5kR\n!p\n\n\u00b5$ = \u00b5k$w#k \u2212 wk$ \u2264\n\u03bbm )2- = exp,\u22122\u03bb2m\u00012\n4R2$\u00b5$2-,\n\n|\u03a5(S#) \u2212 \u03a5(S)| =**$w#\u00b5 \u2212 w!$ \u2212 $w\nPr[\u03a5(S) \u2212 E[\u03a5(S)] \u2265 \u0001] \u2264 exp,\nwhich proves the \ufb01rst statement and the uniform mixture case since $\u00b50$ = 1/\u221ap.\nTheorems 2 and 3 help us compare the mixture weight wpm obtained by training on a sample of\nsize pm versus the mixture weight vector w\n\u00b50. The regularization parameter \u03bb is a function of\nthe sample size. To simplify the analysis, we shall assume that \u03bb = O(1/m1/4) for a sample of\nsize m. A similar discussion holds for other comparable asymptotic behaviors. By Theorem 2,\n$wpm \u2212 w!$ converges to zero in O(1/(\u03bb\u221apm)) = O(1/(pm)1/4), since \u03bb = O(1/(pm)1/4) in\nthat case. But, by Theorem 3, the slack term bounding $w\n\u00b50 \u2212 w!$ converges to zero at the faster\nrate O(1/(\u03bb\u221apm)) = O(1/p1/2m1/4), since here \u03bb = O(1/m1/4). The expectation term appearing\nin the bound on $w\n\u00b50 \u2212 w!$], does not bene\ufb01t from the same convergence rate\nhowever. E[$w\n\u00b50 \u2212 w!$] converges always as fast as the expectation E[$wm \u2212 w!$] for a weight\nvector wm obtained by training on a sample of size m since, by the triangle inequality, the following\nholds:\n\n\u00b50 \u2212 w!$, E[$w\n\n(14)\n\n1\np\n\np\"k=1\n\nE[$w\n\n(wk \u2212 w!)$] \u2264\n\n\u00b5 \u2212 w!$] = E[$\n\nE[$wk \u2212 w!$] = E[$w1 \u2212 w!$].\nBy the proof of Theorem 2, E[$w1\u2212w!$]\u2264 R/(\u03bb\u2019m/2) = O(1/(\u03bb\u221am)), thus E[$w\n\n\u00b5\u2212w!$]\u2264\n\u00b50 always converges signi\ufb01cantly faster than wm. The convergence\n\u00b50 contains two terms, one somewhat more favorable, one somewhat less than its coun-\n\nO(1/m1/4). In summary, w\nbound for w\nterpart term in the bound for wpm.\n\n(15)\n\n1\np\n\np\"k=1\n\n6\n\n\fEnglish POS [16]\nSentiment\nRCV1-v2 [14]\nSpeech\nDeja News Archive\nDeja News Archive 250K\nGigaword [10]\n\npm |Y|\n1 M\n24\n9 M\n3\n26 M 103\n50 M 129\n306 M\n8\n8\n306 M\n1,000 M\n96\n\n|X |\n\nsparsity\n\n500 K 0.001\n500 K 0.001\n10 K 0.08\n1.0\n50 K 0.002\n250 K 0.0004\n10 K 0.001\n\n39\n\np\n10\n10\n10\n499\n200\n200\n1000\n\nTable 2: Description of data sets. The column named sparsity reports the frequency of non-zero\nfeature values for each data set.\n\n4 Experiments\nWe ran a number of experiments on data sets ranging in size from 1M to 1B labeled instances (see\nTable 2) to compare the three distributed training methods described in Section 2. Our experiments\nwere carried out using a large cluster of commodity machines with a local shared disk space and a\nhigh rate of connectivity between each machine and between machines and disk. Thus, while the\nprocesses did not run on one multi-core supercomputer, the network latency between machines was\nminimized.\nWe report accuracy, wall clock, cumulative CPU usage, and cumulative network usage for all of our\nexperiments. Wall clock measures the combined effects of the user CPU and latency costs (column\n1 of Table 1), and includes the total time for training, including all summations. Network usage\nmeasures the amount of data transferred across the network. Due to the set-up of our cluster, this\nincludes both machine-to-machine traf\ufb01c and machine-to-disk traf\ufb01c. The resource estimates were\ncalculated by point-sampling and integrating over the sampling time. For all three methods, we used\nthe same base implementation of conditional maximum entropy, modi\ufb01ed only in whether or not the\ngradient was computed in a distributed fashion.\nOur \ufb01rst set of experiments were carried out with \u201cmedium\u201d scale data sets containing 1M-300M in-\nstances. These included: English part-of-speech tagging, generated from the Penn Treebank\n[16] using the \ufb01rst character of each part-of-speech tag as output, sections 2-21 for training, section\n23 for testing and a feature representation based on the identity, af\ufb01xes, and orthography of the in-\nput word and the words in a window of size two; Sentiment analysis, generated from a set of\nonline product, service, and merchant reviews with a three-label output (positive, negative, neutral),\nwith a bag of words feature representation; RCV1-v2 as described by [14], where documents having\nmultiple labels were included multiple times, once for each label; Acoustic Speech Data, a 39-\ndimensional input consisting of 13 PLP coef\ufb01cients, plus their \ufb01rst and second derivatives, and 129\noutputs (43 phones \u00d7 3 acoustic states); and the Deja News Archive, a text topic classi\ufb01cation\nproblem generated from a collection of Usenet discussion forums from the years 1995-2000. For all\ntext experiments, we used random feature mixing [9, 20] to control the size of the feature space.\nThe results reported in Table 3 show that the accuracy of the mixture weight method consistently\nmatches or exceeds that of the majority vote method. As expected, the resource costs here are\nsimilar, with slight differences due to the point-sampling methods and the overhead associated with\nstoring p models in memory and writing them to disk. For some data sets, we could not report\nmajority vote results as all models could not \ufb01t into memory on a single machine.\nThe comparison shows that in some cases the mixture weight method takes longer and achieves\nsomewhat better performance than the distributed gradient method while for other data sets it ter-\nminates faster, at a slight loss in accuracy. These differences may be due to the performance of the\noptimization with respect to the regularization parameter \u03bb. However, the results clearly demon-\nstrate that the mixture weight method achieves comparable accuracies at a much decreased cost in\nnetwork bandwidth \u2013 upwards of 1000x. Depending on the cost model assessed for the underlying\nnetwork and CPU resources, this may make mixture weight a signi\ufb01cantly more appealing strategy.\nIn particular, if network usage leads to signi\ufb01cant increases in latency, unlike our current experi-\nmental set-up of high rates of connectivity, then the mixture weight method could be substantially\nfaster to train. The outlier appears to be the acoustic speech data, where both mixture weight and\ndistributed gradient have comparable network usage, 158GB and 200GB, respectively. However, the\nbulk of this comes from the fact that the data set itself is 157GB in size, which makes the network\n\n7\n\n\fTraining Method\nEnglish POS Distributed Gradient\nMajority Vote\n(m=100k,p=10)\nMixture Weight\nDistributed Gradient\nSentiment\nMajority Vote\n(m=900k,p=10)\nMixture Weight\nDistributed Gradient\nRCV1-v2\nMajority Vote\n(m=2.6M,p=10)\nMixture Weight\nDistributed Gradient\nSpeech\n(m=100k,p=499) Mixture Weight\nDistributed Gradient\nDeja\n(m=1.5M,p=200) Mixture Weight\nDeja 250K\nDistributed Gradient\n(m=1.5M,p=200) Mixture Weight\nGigaword\nDistributed Gradient\nMixture Weight\n(m=1M,p=1k)\n\nAccuracy Wall Clock Cumulative CPU Network Usage\n97.60%\n652 GB\n0.686 GB\n96.80%\n96.80%\n0.015 GB\n81.18%\n367 GB\n81.25%\n3 GB\n9 GB\n81.30%\n27.03%\n479 GB\n26.89%\n3 GB\n0.108 GB\n27.15%\n34.95%\n200 GB\n158 GB\n34.99%\n5,283 GB\n64.74%\n65.46%\n48 GB\n67.03%\n17,428 GB\n65 GB\n66.86%\n13,000 GB\n51.16%\n50.12%\n21 GB\n\n11.0 h\n18.5 h\n11.5 h\n123 h\n168 h\n163 h\n407 h\n474 h\n473 h\n511 h\n534 h\n733 h\n707 h\n698 h\n710 h\n18,598 h\n17,998 h\n\n17.5 m\n12.5 m\n5 m\n104 m\n131 m\n110 m\n48 m\n54 m\n56 m\n160 m\n130 m\n327 m\n316 m\n340 m\n300 m\n240 m\n215 m\n\nTable 3: Accuracy and resource costs for distributed training strategies.\n\nusage closer to 1GB for the mixture weight and 40GB for distributed gradient method when we\ndiscard machine-to-disk traf\ufb01c.\nFor the largest experiment, we examined the task of predicting the next character in a sequence\nof text [19], which has implications for many natural language processing tasks. As a training\nand evaluation corpus we used the English Gigaword corpus [10] and used the full ASCII output\nspace of that corpus of around 100 output classes (uppercase and lowercase alphabet characters\nvariants, digits, punctuation, and whitespace). For each character s, we designed a set of observed\nfeatures based on substrings from s\u22121, the previous character, to s\u221210, 9 previous characters, and\nhashed each into a 10k-dimensional space in an effort to improve speed. Since there were around\n100 output classes, this led to roughly 1M parameters. We then sub-sampled 1B characters from\nthe corpus as well as 10k testing characters and established a training set of 1000 subsets, of 1M\ninstances each. For the experiments described above, the regularization parameter \u03bb was kept \ufb01xed\nacross the differentmethods. Here, we decreased the parameter \u03bb for the distributed gradient method\nsince less regularization was needed when more data was available, and since there were three orders\nof magnitude difference between the training size for each independent model and the distributed\ngradient. We compared only the distributed gradient and mixture weight methods since the majority\nvote method exceeded memory capacity. On this data set, the network usage is on a different scale\nthan most of the previous experiments, though comparable to Deja 250, with the distributed gradient\nmethod transferring 13TB across the network. Overall, the mixture weight method consumes less\nresources: less bandwidth and less time (both wall clock and CPU). With respect to accuracy, the\nmixtureweight method does only slightly worse than the distributed gradientmethod. The individual\nmodels in the mixture weight method ranged between 49.73% to 50.26%, with a mean accuracy\nof 50.07%, so a mixture weight model improves slightly over a random subsample models and\ndecreases the overall variance.\n5 Conclusion\nOur analysis and experiments give signi\ufb01cant support for the mixture weight method for training\nvery large-scale conditional maximum entropy models with L2 regularization. Empirical results\nsuggest that this method achieves similar or better accuracies while reducing network usage by\nabout three orders of magnitude and modestly reducing the wall clock time, typically by about 15%\nor more. In distributed environments without a high rate of connectivity, the decreased network\nusage of the mixture weight method should lead to substantial gains in wall clock as well.\nAcknowledgments\nWe thank Yishay Mansour for his comments on an earlier version of this paper.\n\n8\n\n\fReferences\n[1] A. Berger, V. Della Pietra, and S. Della Pietra. A maximum entropy approach to natural\n\nlanguage processing. Computational Linguistics, 22(1):39\u201371, 1996.\n\n[2] O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning\n\nResearch, 2:499\u2013526, 2002.\n\n[3] S. F. Chen and R. Rosenfeld. A survey of smoothing techniques for ME models. IEEE Trans-\n\nactions on Speech and Audio Processing, 8(1):37\u201350, 2000.\n\n[4] C. Chu, S. Kim, Y. Lin, Y. Yu, G. Bradski, A. Ng, and K. Olukotun. Map-Reduce for machine\n\nlearning on multicore. In Advances in Neural Information Processing Systems, 2007.\n\n[5] M. Collins, R. Schapire, and Y. Singer. Logistic regression, AdaBoost and Bregman distances.\n\nMachine Learning, 48, 2002.\n\n[6] C. Cortes, M. Mohri, M. Riley, and A. Rostamizadeh. Sample selection bias correction theory.\n\nIn Proceedings of ALT 2008, volume 5254 of LNCS, pages 38\u201353. Springer, 2008.\n\n[7] J. Darroch and D. Ratcliff. Generalized iterative scaling for log-linear models. The Annals of\n\nMathematical Statistics, pages 1470\u20131480, 1972.\n\n[8] S. Della Pietra, V. Della Pietra, J. Lafferty, R. Technol, and S. Brook. Inducing features of\nrandom \ufb01elds. IEEE transactions on pattern analysis and machine intelligence, 19(4):380\u2013\n393, 1997.\n\n[9] K. Ganchev and M. Dredze. Small statistical models by random feature mixing. In Workshop\n\non Mobile Language Processing, ACL, 2008.\n\n[10] D. Graff, J. Kong, K. Chen, and K. Maeda. English gigaword third edition, linguistic data\n\nconsortium, philadelphia, 2007.\n\n[11] E. T. Jaynes. Information theory and statistical mechanics. Physical Review, 106(4):620630,\n\n1957.\n\n[12] J. Jeon and R. Manmatha. Using maximum entropy for automatic image annotation. In Inter-\n\nnational Conference on Image and Video Retrieval, 2004.\n\n[13] G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models. In\n\nAdvances in Neural Information Processing Systems, pages 447\u2013454, 2001.\n\n[14] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text catego-\n\nrization research. Journal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[15] R. Malouf. A comparison of algorithms for maximum entropy parameter estimation. In Inter-\n\nnational Conference on Computational Linguistics (COLING), 2002.\n\n[16] M. Marcus, M. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of English:\n\nThe Penn Treebank. Computational linguistics, 19(2):313\u2013330, 1993.\n\n[17] C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages\n\n148\u2013188. Cambridge University Press, Cambridge, 1989.\n\n[18] J. Nocedal and S. Wright. Numerical optimization. Springer, 1999.\n[19] C. E. Shannon. Prediction and entropy of printed English. Bell Systems Technical Journal,\n\n30:50\u201364, 1951.\n\n[20] K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. Feature hashing for\n\nlarge scale multitask learning. In International Conference on Machine Learning, 2009.\n\n[21] T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent\n\nalgorithms. In International Conference on Machine Learning, 2004.\n\n9\n\n\f", "award": [], "sourceid": 345, "authors": [{"given_name": "Ryan", "family_name": "Mcdonald", "institution": null}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": null}, {"given_name": "Nathan", "family_name": "Silberman", "institution": null}, {"given_name": "Dan", "family_name": "Walker", "institution": null}, {"given_name": "Gideon", "family_name": "Mann", "institution": null}]}