{"title": "Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 9971, "page_last": 9980, "abstract": "Training deep neural networks requires an exorbitant amount of computation resources, including a heterogeneous mix of GPU and CPU devices. It is critical to place operations in a neural network on these devices in an optimal way, so that the training process can complete within the shortest amount of time. The state-of-the-art uses reinforcement learning to learn placement skills by repeatedly performing Monte-Carlo experiments. However, due to its equal treatment of placement samples, we argue that there remains ample room for significant improvements. In this paper, we propose a new joint learning algorithm, called Post, that integrates cross-entropy minimization and proximal policy optimization to achieve theoretically guaranteed optimal efficiency. In order to incorporate the cross-entropy method as a sampling technique, we propose to represent placements using discrete probability distributions, which allows us to estimate an optimal probability mass by maximal likelihood estimation, a powerful tool with the best possible efficiency. We have implemented Post in the Google Cloud platform, and our extensive experiments with several popular neural network training benchmarks have demonstrated clear evidence of superior performance: with the same amount of learning time, it leads to placements that have training times up to 63.7% shorter over the state-of-the-art.", "full_text": "Post: Device Placement with Cross-Entropy\n\nMinimization and Proximal Policy Optimization\n\nYuanxiang Gao1,2\n\nLi Chen 3\n\nBaochun Li 1\n\n1 Department of Electrical and Computer Engineering, University of Toronto\n\n2 School of Information and Communication Engineering,\nUniversity of Electronic Science and Technology of China\n\n3 School of Computing and Informatics, University of Louisiana at Lafayette\nyuanxiang@ece.utoronto.ca, li.chen@louisiana.edu, bli@ece.toronto.edu\n\nAbstract\n\nTraining deep neural networks requires an exorbitant amount of computation re-\nsources, including a heterogeneous mix of GPU and CPU devices. It is critical\nto place operations in a neural network on these devices in an optimal way, so\nthat the training process can complete within the shortest amount of time. The\nstate-of-the-art uses reinforcement learning to learn placement skills by repeat-\nedly performing Monte-Carlo experiments. However, due to its equal treatment\nof placement samples, we argue that there remains ample room for signi\ufb01cant\nimprovements. In this paper, we propose a new joint learning algorithm, called\nPost, that integrates cross-entropy minimization and proximal policy optimization\nto achieve theoretically guaranteed optimal ef\ufb01ciency. In order to incorporate the\ncross-entropy method as a sampling technique, we propose to represent placements\nusing discrete probability distributions, which allows us to estimate an optimal\nprobability mass by maximal likelihood estimation, a powerful tool with the best\npossible ef\ufb01ciency. We have implemented Post in the Google Cloud platform, and\nour extensive experiments with several popular neural network training benchmarks\nhave demonstrated clear evidence of superior performance: with the same amount\nof learning time, it leads to placements that have training times up to 63.7% shorter\nover the state-of-the-art.\n\n1\n\nIntroduction\n\nWith an increasing demand of computing resources to train today\u2019s deep neural networks (DNNs), it\nbecomes typical to leverage a heterogeneous mix of both CPU and GPU devices [1, 2]. In such a\ndistributed training environment, it is important to specify how each operation in a neural network\nshould be placed on each of these CPU and GPU devices, referred to as the device placement problem.\nThe objective is to \ufb01nd a placement of operations on devices, so that the time required to train the\nneural network can be minimized.\nIn the recent literature, Mirhoseini et al. [3, 4] proposed to solve the device placement problem with\na reinforcement learning approach, based on the policy gradient method [5]. Unfortunately, this\nmethod is inef\ufb01cient because it relies on the Monte Carlo method to generate samples, which treats\neach data sample equally without emphasizing important ones. To improve the learning ef\ufb01ciency, an\nimportance sampling technique, called the cross-entropy method [6, 7, 8], becomes promising, due to\nits theoretical guarantee on the optimal ef\ufb01ciency when generating samples [9].\nIn this paper, our \ufb01rst contribution is to apply the importance sampling technique in device placement\nand model the problem as a cross-entropy minimization problem. An example of the placement\nprobability distribution for a DNN with two operations is shown in Fig. 1. Each possible placement is\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Adjusting the probability distribution of the placement for two operations with cross-entropy\nminimization. The x-axis represents the device to place the \ufb01rst operation, the y-axis represents\nthe device to place the second operation, and the z-axis represents the probability of selecting a\nplacement.\n\nassociated with a probability, according to a two-dimensional probability mass function. Initially, all\nthe placements have an equal probability (left). With cross-entropy minimization, a better probability\ndistribution can be learned, where the placement with a shorter training time has a higher probability\n(middle). Eventually, the probability distribution will converge to a degenerate distribution, with\nwhich the optimal placement has a probability of 1.\nAs cross-entropy minimization adjusts the distribution after a batch of placement trials, it lacks a\nreinforcement learning algorithm with a \ufb01ner granularity, one which has the ability to \ufb01ne-tune\nthe distribution after every placement trial. Our original contribution in this paper focuses on the\ndesign and implementation of a new learning algorithm, called Post, that integrates cross-entropy\nminimization and proximal policy optimization [10, 11], a state-of-the-art reinforcement learning\nalgorithm. With Post, cross-entropy minimization is applied to each batch of trials, to reduce\nthe sample space of placement for a higher sample ef\ufb01ciency. Within a batch, proximal policy\noptimization is applied to each placement trial to incrementally improve the placement distribution,\nwhich achieves a further performance boost in the learning process.\nWe have evaluated Post with a wide variety of deep neural network models in TensorFlow, including\nInception-V3 [12], ResNet [13], Language Modeling [14] and Neural Machine Translation [15].\nFor an extensive array of experiments, we have deployed Post on up to 96 CPU and GPU nodes\nin the Google Cloud platform for distributed training. Our experimental results have demonstrated\na substantial improvement in the learning ef\ufb01ciency achieved by Post, compared with using the\npolicy gradient method [3], proximal policy optimization only [16], and cross-entropy minimization\nonly. Thanks to its ef\ufb01ciency, Post is able to \ufb01nd better placements for ResNet, Inception-V3 and\nNMT benchmarks, respectively, that lead to 24.0% to 63.7% shorter training time than the policy\ngradient method [3]. Further, detailed runtime pro\ufb01les have clearly shown that Post is able to discover\na non-trivial placement \u201ctrick\u201d to speedup the training process: it is able to parallelize the model\nwithout incurring undesirable communication delays across GPUs.\n\n2 Related Work\n\nWe classify the existing research efforts on device placement for deep neural networks into three\ncategories as follows.\nDevice placement with reinforcement learning. Mirhoseini, et al. [3] proposed a sequence-to-\nsequence RNN model as the parameterized policy for generating placements. As RNN suffers from\nthe vanishing (and exploding) gradient problem when predicting long sequences, this work manually\nassigns operations into hundreds of groups and further places groups on devices, thus reducing the\nlength of the input sequence. More recently, they replaced the manual grouping with an automated\ngrouping mechanism [4], which is a feedforward neural network. Both works used the policy gradient\nmethod to train their RNNs; in contrast, Spotlight [16] applied proximal policy optimization to\nachieve a faster training speed. This work adopts a different approach, using the softmax distributions\nrather than an RNN as the policy representation. Accompanied by proximal policy optimization and\n\n2\n\n\fcross-entropy minimization, training softmax distributions solves the device placement problem with\na lower training overhead, as demonstrated in our extensive experiments.\nCross-entropy method. Initially developed for variance reduction in rare-event simulations [9], the\ncross-entropy method has been shown as an ef\ufb01cient method to solve combinatorial optimization\nproblems by treating the optimal solution as a rare-event to be discovered [17]. Due to its ef\ufb01ciency,\nthe cross-entropy method achieves several orders of magnitude higher scores than policy gradient\nmethod in Tetris games [8]. This work adapts the cross-entropy method for device placement, by\nmodeling device placement as multi-dimensional softmax distributions. We have derived a closed-\nform solution of cross-entropy minimization for softmax distributions, which has not been developed\nin the existing literature. Our proposed algorithm integrates the cross-entropy method with proximal\npolicy optimization, which further improves its ef\ufb01ciency.\nProximal policy optimization. The theory of proximal policy optimization was proposed by Shul-\nman, et al. [18, 11], where an objective function was derived to obtain the performance lower bound\nof the new policy. This method shows superior performance in high-dimensional continuous control\nproblems. Heess, et al. [10] proposed a distributed algorithm for proximal policy optimization, which\ncan successfully teach a simulated agent to perform complex tasks. Our work demonstrates that by\nincorporating periodic cross-entropy minimization for aggressive policy improvement, the learning\nperformance of proximal policy optimization can be signi\ufb01cantly improved.\n\n3 Algorithm Design\n\n3.1 Device Placement: Preliminaries\n\nFor a deep neural network with M operations, a particular placement is a vector of devices as-\nsigned to each of its operations, represented as d = {d1, d2,\u00b7\u00b7\u00b7 , dm,\u00b7\u00b7\u00b7 , dM}, where dm \u2208\n{CPU, GPU0, GPU1,\u00b7\u00b7\u00b7} is the device selected from the set of available devices to run the m-th\noperation. Given a placement d, the time it takes to train the neural network is denoted by T (d). Our\ngoal is to \ufb01nd an optimal placement, so that the training time using this placement is minimized.\nFormally, the device placement problem can be formulated as\n\n\u03b3\u2217 = min\nd\u2208A\n\nT (d),\n\n(1)\n\nwhere A denotes the set of all possible placements of the deep neural network.\nFor such a combinatorial optimization problem, Mirhoseini, et al. [3, 4] generated placement samples\nfrom a placement distribution and used the policy gradient method [5] to improve parameters of the\ndistribution, so that a better placement can be generated with a higher probability. However, the\npolicy gradient method is sample inef\ufb01cient [18], as it performs only a one-step gradient update for\neach data sample [11].\nTo achieve a higher sample ef\ufb01ciency, a more advanced policy gradient method \u2014 proximal policy\noptimization \u2014 was proposed in the recent literature of reinforcement learning [11, 10]. The main\nidea is to derive a performance objective and perform multiple steps of gradient updates towards\noptimizing the objective. Still, both policy gradient and proximal policy optimization perform only\none step or several steps of stochastic gradient descent/ascent (SGD/SGA), which are slow as they do\nnot achieve the optimum. In contrast, the unique advantage of cross-entropy minimization is that it\nformulates the performance improvement as an optimization problem, which is directly solved to\nobtain the optimal solution.\nAlthough optimization-based cross-entropy minimization improves performance faster than\nSGD/SGA-based methods, it updates parameters after each batch of placement samples is gen-\nerated, without online per-sample parameter updates as in SGD/SGA-based methods. Therefore, it is\nintuitive that an elegantly designed combination of these two methods brings the fastest and the most\nfrequent parameter improvement.\n\n3.2 Cross-Entropy Minimization for Device Placement\n\nFor the placement of the m-th operation, we use a parameter umj to represent the preference of\nchoosing the j-th device. With a softmax function, we normalize the parameters for all the devices to\n\n3\n\n\fM(cid:89)\n\nobtain a probability distribution of the placement. Particularly, the probability of choosing the j-th\ndevice for the m-th operation is given by:\n\nf (dm = j|um) =\n\n,\n\n(2)\n\neumj(cid:80)D\n\ni=1 eumi\n\nwhere D is the total number of devices, and um denotes the vector of parameters\n{um1, um2,\u00b7\u00b7\u00b7 , umD} for the m-th operation.\nSimilarly, we use u to denote the vector of parameters {u1, u2,\u00b7\u00b7\u00b7 , uM} for the entire neural network.\nObserving that the placements of all the operations in the network can be con\ufb01gured independently in\npractice, the device selection for each operation can also be independent. Thus, the joint distribution\nf (d|u), i.e., the placement distribution for the neural network, can be expressed as a product of\nmarginal distributions for each operation:\n\nf (d|u) =\n\nf (dm|um).\n\n(3)\n\nm=1\n\nTo obtain the optimal distribution which results in the optimal placement d\u2217 = arg min T (d) with\nprobability 1, we iteratively update the parameters until the \ufb01nal convergence, starting from randomly\ngenerated values. To be particular, initially with parameter u(0), we gradually improve the joint\ndistribution through a set of target distributions along iterations, with parameters u(1), u(2),\u00b7\u00b7\u00b7 . The\ntarget distribution at the t-th iteration is de\ufb01ned as the following conditional distributions:\n\nf (d|u(t+1)) = f (d|u(t), T (d) \u2264 \u03b3t),\n\nwhere \u03b3t is a constant and \u03b30 > \u03b31 > \u00b7\u00b7\u00b7 > \u03b3\u2217.\nAccording to the de\ufb01nition of conditional probability, Eq. (4) can be expressed as\n(cid:80)\nI{T (d)\u2264\u03b3t}f (d|u(t))\nI{T (d)\u2264\u03b3t}f (d|u(t))\n\nf (d, T (d) \u2264 \u03b3t|u(t))\nf (T (d) \u2264 \u03b3t|u(t))\n\nf (d|u(t+1)) =\n\n=\n\n,\n\nd\n\n(4)\n\n(5)\n\nwhere I{\u00b7} is the indicator function which equals to 1 when the condition holds and to 0 otherwise.\nIntuitively, Eq. (5) represents a normalized density of the good placements (which result in training\ntimes smaller than \u03b3t) in f (d|u(t)). With the normalization, the old distribution f (d|u(t)) is updated\nto the target distribution f (d|u(t+1)). As \u03b3t decreases over iterations, the probability of generating\ngood placements from the target distribution keeps increasing.\nWith such an intuition, we now describe how parameters are updated at each iteration t in detail. The\nparameters u(t) will be adjusted by minimizing the stochastic distance, speci\ufb01cally, the Kullback-\nLeibler (KL)-divergence, between the old distribution and the target distribution, which can be\nexpressed as\n\nf (d|u(t+1)) log f (d|u(t+1)) \u2212 f (d|u(t+1)) log f (d|u(t)),\n\n(6)\n\n(cid:88)\n\nd\n\nmin\nu(t)\n\nwhere the target distribution f (d|u(t+1)) is given in Eq. (5). As the \ufb01rst term in the objective\nis constant with respect to u(t), the KL-divergence minimization is equivalent to the following\ncross-entropy minimization:\n\n\u2212(cid:88)\n\nd\n\nmin\nu(t)\n\nI{T (d)\u2264\u03b3t}f (d|u(t)) log f (d|u(t)),\n\nwhich can be further transformed to its equivalent expectation form:\n\u2212Eu(t) [I{T (d)\u2264\u03b3t} log f (d|u(t))].\n\nmin\nu(t)\n\nTo solve this problem, we replace the expectation with its average over N samples and minimize the\nsample average as the following:\n\nI{T (d(n))\u2264\u03b3t} log f (d(n)|u(t)),\n\n(9)\n\nN(cid:88)\n\n\u2212 1\nN\n\nmin\nu(t)\n\nn=1\n\n4\n\n(7)\n\n(8)\n\n\fwhere d(n) is the n-th sampled placement. Among the N sampled placements, the subset satisfying\nthe condition is determined by the constant \u03b3t. Using the typical cross-entropy method [17], we\nset \u03b3t so that the percentage of placements satisfying the condition is \u03c1 (e.g., \u03c1 = 0.1). To be\nmore speci\ufb01c, \u03b3t is set as the \u03c1N-th lowest training time among all the sampled placements. The\ncross-entropy minimization in Eq. (9) is a maximum likelihood estimator (MLE) for estimating the\ntarget distribution, which has theoretically guaranteed optimal sample ef\ufb01ciency [19]. As a convex\noptimization problem [20], problem (9) can be ef\ufb01ciently solved by a gradient-based method to\nachieve the global minimum.\nFurther, we have derived the following closed-form solution to problem (9), for softmax distributions.\nTheorem 1. The solution of the cross-entropy minimization problem in (9) for softmax distributions\nhas the following closed form:\n\nf (dm = j|u(t+1)\n\nm ) =\n\nI{d(p)\nm =j}\nP\n\n,\n\n(10)\n\nP(cid:88)\n\np=1\n\ntK+K(cid:88)\n\nM(cid:88)\n\nn=tK+1\n\nm=1\n\nwhere p refers to the promising placements and P = \u03c1N is the total number of promising placements.\n\nProof: Please refer to the supplementary material for a proof of this theorem.\nEq. (10) gives an ef\ufb01cient and gradient-free approach to solve the cross-entropy minimization problem.\nAmong the promising placements, we count the frequency of the j-th device allocated to the m-th\noperation, and use it as the placement probability given new parameters.\n\n3.3\n\nJoint Learning with Proximal Policy Optimization\n\nCross-entropy minimization is a batch learning algorithm as it updates parameters after a large\nnumber (N, which can be hundreds) of placement samples. Such a batch learning algorithm lacks the\nincremental parameter improvement in an online reinforcement learning for every small number (K)\nof placement samples. Therefore, we integrate cross-entropy minimization and a recent reinforcement\nlearning algorithm, called proximal policy optimization [10, 18, 11], to achieve a further learning\nspeedup. The objective of proximal policy optimization at the t-th iteration is as follows:\n\nmax\nu(t+1)\n\n1\nK\n\nf (d(n)\nf (d(n)\n\nm |u(t+1)\nm )\nm |u(t)\nm )\n\n(b \u2212 T (d(n))) \u2212 \u03b2DKL(f (d(n)\n\nm |u(t)\n\nm )||f (d(n)\n\nm |u(t+1)\n\nm )), (11)\n\nwhere b is the moving average of the training times across all the sampled placements, and DKL(\u00b7)\nis the KL divergence between the old and new distributions. The coef\ufb01cient \u03b2 is an adaptive\nhyperparameter, whose value is adjusted each iteration based on the calculated KL divergence and a\ntarget value of KL divergence [11] so as to avoid policy updates that are too large or too small. The\nterm b \u2212 T (d(n)) represents a reward signal for a placement, which is positive if its training time\nis smaller than the average and negative otherwise. This objective is a performance lower bound\nof distributions under the new parameters [18]. Optimizing the lower bound leads to performance\nimprovement over the old parameters.\nOur joint policy optimization algorithm, Post, is summarized in Algorithm 1. Within a loop, Post\ncontinuously samples placements from the distribution and evaluates their training times. For every\nK sampled placements, we perform several (e.g., 10) stochastic gradient ascent steps with respect to\nthe objective of proximal policy optimization, which makes incremental policy improvements. For\nevery N sampled placements (N is several or tens of times larger than K), we solve the cross-entropy\nminimization problem with Eq. (10) to achieve a global and an aggressive policy improvement.\nAfter each cross-entropy minimization, the probability of choosing a particular device for an operation\nmay become zero, which discourages the exploration of more potential placements. For better\nexploration, we mix the distribution with a uniform distribution, resulting in a probability of \u0001 (e.g.,\n0.1) for a device to be uniformly selected from all the available devices, and a probability of 1 \u2212 \u0001\nfor a device to be chosen according to the solution of cross-entropy minimization. For softmax\ndistributions, the update rule for stochastic gradient ascent with respect to the objective of proximal\npolicy optimization can be expressed in closed forms, as derived in our supplementary material.\n\n5\n\n\fAlgorithm 1 Post: Joint Policy Optimization\n1: Initialize parameters u(0) as all zeros; Initialize t = 0;\n2: for n = 1, 2, . . . , L do\n3:\n4:\n5:\n6:\n\nSample a placement d(n) \u223c f (d|u(t));\nTrain the DNN under d(n) and recording T (d(n));\nif n%K == 0 and n%N (cid:54)= 0 then\n\nPerform several (e.g., 10) stochastic gradient ascent steps w.r.t. the objective of proximal\npolicy optimization in Eq. (11);\nt = t + 1\n\nend if\nif n%N == 0 then\n\n7:\n8:\n9:\n10:\n11:\n12:\nend if\n13:\n14: end for\n\nSolve the cross-entropy minimization using Eq. (10) to achieve a global minimum;\nt = t + 1\nMix the new distribution f (dm|u(t+1)\n\nm ) = (1 \u2212 \u0001)f (dm|u(t+1)\n\nm ) + \u0001 1\n\nD ,\u2200m;\n\n4 Experimental Evaluation\n\nIn this section, we have implemented and evaluated Post on the Google Cloud platform with an exten-\nsive set of deep learning models. Experimental results have demonstrated the superior performance of\nPost in comparison with existing mechanisms in the literature. A more in-depth analysis on runtime\npro\ufb01les has also been presented to better understand the advantages of our algorithm.\nSetup. We have conducted our experiments with 12 machines on the Google Cloud platform. Each\nmachine is equipped with 26 GB of main memory, an Intel Broadwell 8-core CPU and 2, 4 or 8\nNVIDIA Tesla K80 GPUs, each with 11 GB of memory.\nThe training of Post progresses in a distributed fashion, with a parameter server maintaining the\nparameters of softmax placement distributions and 12 workers. In particular, at each iteration of\nlearning, the parameter server samples the distributions to obtain 12 placements, each to be evaluated\nat a worker. After receiving the training times from all the workers, the parameter server applies\nproximal policy optimization to update the parameters. After \ufb01ve iterations, the parameter server\nfurther updates its parameters by solving cross-entropy minimization.\nIn our experiments, we train the placement parameters for a neural network with a total of 2400\nsampled placements. For each sampled placement, a worker con\ufb01gures the placement accordingly\nand trains the network for 11 steps. As the \ufb01rst step involves initial con\ufb01guration, we use the average\ntraining time of the following 10 steps to measure the performance of this placement.\nTo handle the placements that result in Out-Of-Memory (OOM), each of such placements is assigned\nwith a large per-step training time (e.g., 100 seconds). Our detailed setting of the parameters in\nAlgorithm 1 is as follows: the learning rate of SGA is 1; the ratio for choosing promising placements\nis 0.1 (6 best placements out of 60 samples over 5 iterations); the exploration factor \u0001 is 0.1, which is\nlinearly reduced to zero during learning; the KL penalty coef\ufb01cient \u03b2 is initialized as 1 and adapted\nbased on a target KL divergence of 0.03.\nWe have evaluated Post using four open-source benchmarks in TensorFlow: Inception-V3 [12, 21],\nResNet [13, 21], RNN Language Modeling (RNNLM) [14], and Neural Machine Translation (NMT)\n[15, 22]. Inception-V3 and ResNet are two popular deep convolutional neural networks trained on\nthe ImageNet dataset using a batch size of 32. RNNLM is a 4-layer RNN trained with a batch size of\n64. NMT is a sequence-to-sequence encoder-decoder architecture trained on the WMT16 dataset\nwith a batch size of 64.\nWe compare the training time performance \u2014 including both the average and the minimum training\ntime \u2014 of the placements found by Post with the following baselines: Single GPU, which assigns\nall the operations to a single GPU, except for those without GPU implementation; Policy Gradient\nMethod, the device placement algorithm proposed by Google [3, 4], with the policy gradient update\nrule; Proximal Policy Optimization, the reinforcement learning algorithm [10, 11] used by Spotlight\n[16]), performing SGA updates; Cross-Entropy Minimization, our proposed global optimization\n\n6\n\n\f(a) ResNet-50 on 2 GPUs\n\n(b) Inception-V3 on 4 GPUs\n\n(c) ResNet-101 on 8 GPUs\n\nFigure 2: Average performance of Post, compared with the baselines of CE \u2014 Cross-Entropy\nMinimization, PG \u2014 Policy gradient method, and PPO \u2014 Proximal Policy Optimization.\n\nTable 1: Per-step training time (in seconds) of placements given by the baselines. The training time\nreduction is calculated based on the policy gradient method. OOM stands for Out-Of-Memory. Errors\nof per-step times are within 2% around the average.\n\nModels\nRNNLM\nResNet-50\nResNet-200\nResNet-50\nResNet-101\nInception-V3\nInception-V3\nInception-V3\nNMT (2-layer)\nNMT (4-layer)\nNMT (8-layer)\n\nGPUs\n\n2\n2\n2\n4\n8\n2\n4\n8\n2\n4\n8\n\nSingle GPU Expert Metis\n1.81\n4.47\n11.63\n4.21\n7.31\n6.76\n6.07\n7.35\n7.47\n11.04\n12.74\n\n0.44\n0.86\n2.75\n0.86\n1.59\n1.79\n1.75\n1.78\nOOM\nOOM\nOOM\n\n0.89\n0.86\n2.75\n0.86\n1.59\n1.79\n1.75\n1.78\n2.05\n3.43\n5.11\n\nPolicy Gradient\n\n0.66\n1.49\n3.62\n2.37\n2.90\n2.15\n1.82\n2.67\n2.30\n4.08\n5.35\n\nPost Reduction\n0.44\n0.86\n2.75\n0.86\n1.59\n1.30\n1.19\n1.29\n1.22\n2.13\n2.20\n\n33.3%\n42.3%\n24.0%\n63.7%\n45.2%\n39.5%\n34.6%\n51.7%\n47.0%\n47.8%\n58.9%\n\nmethod in Theorem 1, without the updates from proximal policy optimization; Metis, an algorithm\nthat partitions the network into 2 parts (2-GPU case), 4 parts (4-GPU case) or 8 parts (8-GPU case),\neach assigned to one device, according to a cost model of operations in the network [23]; Expert,\na method that places the entire model on a single GPU for Inception-V3 and ResNet [12, 13], and\nassigns each LSTM layer to a separate GPU for NMT and RNNLM, collocating the embedding layer\nwith the \ufb01rst LSTM layer, the softmax layer and the attentional layer with the last LSTM layer [15].\nDue to different hardware (CPU and GPU) and TensorFlow benchmark versions, a complete re-\nproduction of the experiments in the previous works [4, 3] is not feasible. As a compromise, we\nhave implemented and run the policy gradient method proposed in these papers in our hardware and\nsoftware environments, so that a fair comparison with our proposed algorithm can be made.\nAverage Performance. Figure 2 presents the average per-step training time resulted from sampled\nplacements for a neural network, along with the progress of learning the softmax distributions. Three\nneural networks are to be placed on different numbers of GPUs, corresponding to Figure 2(a)-(c),\nrespectively.\nAs shown in Figure 2(a) where ResNet-50 is to be placed on 2 GPUs, the policy gradient method\nsuffers from the slowest progress, with which the training times are always longer than 2 seconds.\nProximal policy optimization improves the training time performance at a faster speed than the policy\ngradient method, while the cross-entropy minimization is even faster. Compared to these baselines,\nPost achieves the fastest improvement and obtains the shortest training time, as it integrates cross-\nentropy minimization into proximal policy optimization to simultaneously improve the placement\ndistribution. In a similar vein, Post signi\ufb01cantly outperforms the baselines when placing deeper\nnetworks on more GPUs, as demonstrated by the learning curves in Figure 2(b) and (c).\nBest Performance. As shown in Table 1, for RNNLM and ResNet, Post is identical to the Single\nGPU baseline. This is because the architecture of these networks is not suitable to be processed in\n\n7\n\n\fFigure 3: Performance pro\ufb01le of the placement found by Post on Inception-V3 and NMT.\n\nparallel. For example, the continuous convolution layers of ResNet are densely connected with each\nother. As a result, partitioning them to multiple GPUs introduces a long data communication delay\nbetween GPUs. Due to its slower learning speed, the best placement found by the policy gradient\nmethod given the amount of learning time is far from the optimum, and is worse than the Single GPU\ncase.\nFor Inception-V3, Post is able to \ufb01nd optimal placements that outperform the Single GPU case.\nFor all three cases, the policy gradient method fails to discover a placement that is better than Post\nor Single GPU within a pre-speci\ufb01ed amount of learning time. Speci\ufb01cally, Post outperforms the\npolicy gradient method by a training time reduction of 39.5%, 34.6% and 51.7% for 2 GPUs, 4\nGPUs and 8 GPUs, respectively. For NMT, the Single GPU baseline is no longer feasible due to the\nOut-Of-Memory error, and the Expert method performs the best among all the baselines. Post is able\nto \ufb01nd better placements that outperform the Expert method, while the policy gradient method cannot.\nCompared with the policy gradient method, Post \ufb01nds better placements that reduce the training times\nby 47.0%, 47.8% and 58.9% for 2 GPUs, 4 GPUs and 8 GPUs, respectively.\nWe further present end-to-end training time reduction resulted from optimized placements discovered\nby Post. With Inception-V3 trained on 4 GPUs for 40000 steps, it takes 13.5 hours to complete\nthe training given the placement found by Post. In comparison, it requires 20 hours for the single\nGPU case and 20.5 hours with the placement found by the policy gradient method, which clearly\ndemonstrate that Post saves the end-to-end training time by 32.5% and 34.1%, respectively.\nAnalysis of Discovered Placement. We present the placements our algorithm has discovered for\nInception-V3 and NMT in Figure 3(a) and 3(d), in comparison with the placement given by the\nExpert method, shown in Figure 3(c).\nAs observed in Figure 3(a), the per-step training for Inception-V3 involves consecutive inception\nblocks, each with multiple parallel branches of convolutional layers. For the \ufb01rst four inception\nblocks, despite four parallel branches, Post only allocates one of them to a different GPU. Since\nthe \ufb01rst four branches are relatively light-loaded, we believe the reason of such an allocation is that\nthe communication overhead will outweigh the bene\ufb01t of balanced load if the four branches are\nallocated to separate GPUs. It is interesting to observe that for the four blocks with heavier loads in\nthe middle, Post increases the degree of parallelism by allocating more operations to more GPUs.\nWith this allocation, the computation loads on GPU0 and communication overhead among GPUs is\nwell balanced so that the training time is minimized.\nFigure 3(c) illustrates the placement of a sequence-to-sequence RNN, which involves recurrent\nencoding-decoding steps on consecutive LSTM cells. With the placement from the Expert method,\neach layer is placed on a separate GPU. Due to massive connections between consecutive layers,\noutputs of the previous layer need to be transmitted across GPUs at every step of state transition,\nwhich incurs expensive communication overhead. In comparison, Post discovers a novel strategy to\nparallelize such a sequence-to-sequence model. As shown in Figure 3(d), rather than partitioning\nLSTM layers into GPUs, it divides functions (encoding or decoding) into GPUs. In this way, one\n\n8\n\ntime/msGPU0(a)PlacementfoundbyPostonInception-V3Convolution100AvgPoolSoftmaxMaxPoolConcateGPU1GPU2GPU3200300400500600time/msGPU0(c)ExpertplacementofNMT100GPU1GPU2GPU3700200300400500600........................100200300400......(d)PostplacementonNMTDecodercell1Encodercell1Decodercell2Encodercell2Decodercell3Encodercell3Decodercell4Encodercell4(b)Legend\fGPU is dedicated for encoding tasks and another one is dedicated for decoding tasks. Such a\ntrick learned by Post makes great sense, as cutting along the encoder-decoder boundaries incurs a\nnegligible communication overhead, which greatly reduces the communication delay and accelerates\nthe training.\n\n5 Conclusion\n\nIn this paper, we represent device placement as a high-dimensional softmax probability distribution,\nwhich translates the problem of \ufb01nding the best placement to one of estimating the optimal density.\nWith such a new model, we developed a customized cross-entropy method to estimate the optimal\nplacement, which theoretically guarantees the best possible ef\ufb01ciency. A highlight of our contribution\nis our integration of proximal policy optimization, an online reinforcement learning algorithm, and\nthe cross-entropy method, a batch learning algorithm, to achieve a faster policy improvement. We\nevaluated our proposed algorithm with an array of deep neural network models in TensorFlow. Our\nexperiments demonstrated that Post achieved a signi\ufb01cantly better learning performance than the\npolicy gradient method, proximal policy optimization, and the cross-entropy method alone.\n\nReferences\n[1] I. Sutskever, O. Vinyals, and Q. Le, \u201cSequence to sequence learning with neural networks,\u201d in\n\nAdvances in Neural Information Processing Systems (NIPS), 2014.\n\n[2] D. Bahdanau, C. Kyunghyun, and Y. Bengio, \u201cNeural machine translation by jointly learning to\n\nalign and translate,\u201d in Proc. Int\u2019l Conference on Learning Representations (ICLR), 2015.\n\n[3] A. Mirhoseini, H. Pham, Q. Le, B. Steiner, M. Norouzi, S. Bengio, and J. Dean, \u201cDevice\nplacement optimization with reinforcement learning,\u201d in Proc. Int\u2019l Conference on Machine\nLearning (ICML), 2017.\n\n[4] A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. V. Le, and J. Dean, \u201cA hierarchical model for\n\ndevice placement,\u201d in Proc. Int\u2019l Conference on Learning Representations (ICLR), 2018.\n\n[5] R. Sutton, D. McAllester, S. Singh, and Y. Mansour, \u201cPolicy gradient methods for reinforcement\nlearning with function approximation,\u201d in Advances in Neural Information Processing Systems\n(NIPS), 2000.\n\n[6] S. Goschin, A. Weinstein, and M. L. Littman, \u201cThe cross-entropy method optimizes for quan-\n\ntiles,\u201d in Proc. Int\u2019l Conference on Machine Learning (ICML), 2013.\n\n[7] S. Mannor, R. Rubinstein, and Y. Gat, \u201cThe cross-entropy method for fast policy search,\u201d in\n\nProc. Int\u2019l Conference on Machine Learning (ICML), 2003.\n\n[8] I. Szita and A. Lorincz, \u201cLearning Tetris using the noisy cross-entropy method,\u201d Neural Compu-\n\ntation, 2006.\n\n[9] R. Y. Rubinstein and D. P. Kroese, The Cross-Entropy Method: A Uni\ufb01ed Approach to Combi-\n\nnatorial Optimization, Monte-Carlo Simulation and Machine Learning. Springer, 2004.\n\n[10] N. Heess, D. TB, S. Sriram, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, A. Eslami, and\nM. Riedmiller, \u201cEmergence of locomotion behaviours in rich environments,\u201d in arXiv preprint\narXiv:1707.02286, 2017.\n\n[11] J. Shulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, \u201cProximal policy optimization\n\nalgorithms,\u201d in Proc. Int\u2019l Conference on Machine Learning (ICML), 2017.\n\n[12] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, \u201cRethinking the inception architec-\nture for computer vision,\u201d in Proc. IEEE Computer Vision and Pattern Recognition (CVPR),\n2016.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun, \u201cDeep residual learning for image recognition,\u201d in Proc.\n\nIEEE Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[14] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, \u201cExploring the limits of\n\nlanguage modeling,\u201d in eprint arXiv:1602.02410, 2016.\n\n[15] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, and et al., \u201cGoogle\u2019s neural machine translation system:\nBridging the gap between human and machine translation,\u201d in eprint arXiv:1609.08144, 2016.\n\n9\n\n\f[16] Y. Gao, L. Chen, and B. Li, \u201cSpotlight: Optimizing device placement for training deep neural\n\nnetworks,\u201d in Proc. Int\u2019l Conference on Machine Learning (ICML), 2018.\n\n[17] P. D. Boer, D. P. Kroese, and R. Y. Rubinstein, \u201cA tutorial on the cross-entropy method,\u201d Annals\n\nof Operations Research, 2005.\n\n[18] J. Shulman, S. Levine, P. Moritz, M. Jordan, and P. Abbeel, \u201cTrust region policy optimization,\u201d\n\nin Proc. Int\u2019l Conference on Machine Learning (ICML), 2015.\n\n[19] A. Papoulis and S. U. Pillai, Probability, Random Variables, and Stochastic Processes, 4th\n\nEdition. McGraw-Hill, 2002.\n\n[20] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.\n[21] \u201cTensorFlow-Slim image classi\ufb01cation model\n\nlibrary.\u201d [Online]. Available:\n\nhttps:\n\n//github.com/tensor\ufb02ow/models/tree/master/research/slim\n\n[22] \u201cNeural machine translation.\u201d [Online]. Available: https://github.com/tensor\ufb02ow/nmt\n[23] G. Karypis and V. Kumar, \u201cA fast and highly quality multilevel scheme for partitioning irregular\n\ngraphs,\u201d SIAM Journal on Scienti\ufb01c Computing, vol. 20, no. 1, pp. 359\u2013392, 1999.\n\n10\n\n\f", "award": [], "sourceid": 6467, "authors": [{"given_name": "Yuanxiang", "family_name": "Gao", "institution": "University of Toronto"}, {"given_name": "Li", "family_name": "Chen", "institution": "University of Louisiana at Lafayette"}, {"given_name": "Baochun", "family_name": "Li", "institution": "University of Toronto"}]}