{"title": "Learning Generalizable Device Placement Algorithms for Distributed Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3981, "page_last": 3991, "abstract": "We present Placeto, a reinforcement learning (RL) approach to\nefficiently find device placements for distributed neural network training.\n\nUnlike prior approaches that only find a device placement for a specific computation graph, Placeto can learn generalizable device placement policies that can be applied to any graph.\n\nWe propose two key ideas in our approach:\n(1) we represent the policy as performing iterative placement improvements, rather than outputting a placement in one shot;\n(2) we use graph embeddings to capture relevant information about the structure of the computation graph, without relying on node labels for indexing.\n\nThese ideas allow Placeto to train efficiently and generalize\nto unseen graphs.\n\nOur experiments show that Placeto requires up to 6.1x\nfewer training steps to find placements that are on par with or\nbetter than the best placements found by prior approaches.\nMoreover, Placeto is able to learn a generalizable placement policy for any given family of graphs that can be used without any re-training to predict optimized placements for unseen graphs from the same family. \n\nThis eliminates the huge overhead incurred by the prior RL approaches whose lack of generalizability necessitates re-training from scratch every time a new graph is to be placed.", "full_text": "Placeto: Learning Generalizable Device Placement\n\nAlgorithms for Distributed Machine Learning\n\nRavichandra Addanki\u2217, Shaileshh Bojja Venkatakrishnan,\n\nShreyan Gupta, Hongzi Mao, Mohammad Alizadeh\n\nMIT Computer Science and Arti\ufb01cial Intelligence Laboratory\n\n{addanki, bjjvnkt, shreyang, hongzi, alizadeh}@mit.edu\n\nAbstract\n\nWe present Placeto, a reinforcement learning (RL) approach to ef\ufb01ciently \ufb01nd de-\nvice placements for distributed neural network training. Unlike prior approaches\nthat only \ufb01nd a device placement for a speci\ufb01c computation graph, Placeto can\nlearn generalizable device placement policies that can be applied to any graph.\nWe propose two key ideas in our approach: (1) we represent the policy as per-\nforming iterative placement improvements, rather than outputting a placement in\none shot; (2) we use graph embeddings to capture relevant information about the\nstructure of the computation graph, without relying on node labels for indexing.\nThese ideas allow Placeto to train ef\ufb01ciently and generalize to unseen graphs. Our\nexperiments show that Placeto requires up to 6.1\u00d7 fewer training steps to \ufb01nd\nplacements that are on par with or better than the best placements found by prior\napproaches. Moreover, Placeto is able to learn a generalizable placement policy\nfor any given family of graphs, which can then be used without any retraining\nto predict optimized placements for unseen graphs from the same family. This\neliminates the large overhead incurred by prior RL approaches whose lack of gen-\neralizability necessitates re-training from scratch every time a new graph is to be\nplaced.\n\nIntroduction & Related Work\n\n1\nThe computational requirements for training neural networks have steadily increased in recent years.\nAs a result, a growing number of applications [14, 21] use distributed training environments in which\na neural network is split across multiple GPU and CPU devices. A key challenge for distributed\ntraining is how to split a large model across multiple heterogeneous devices to achieve the fastest\npossible training speed. Today device placement is typically left to human experts, but determining\nan optimal device placement can be very challenging, particularly as neural networks grow in com-\nplexity (e.g., networks with many interconnected branches) or approach device memory limits. In\nshared clusters, the task is made even more challenging due to interference and variability caused\nby other applications.\nMotivated by these challenges, a recent line of work [13, 12, 6] has proposed an automated approach\nto device placement based on reinforcement learning (RL). In this approach, a neural network pol-\nicy is trained to optimize the device placement through repeated trials. For example, Mirhoseini et\nal. [13] use a recurrent neural network (RNN) to process a computation graph and predict a place-\nment for each operation. They show that the RNN, trained to minimize computation time, produces\ndevice placements that outperform both human experts and graph partitioning heuristics such as\nScotch [18]. Subsequent work [12] improved the scalability of this approach with a hierarchical\nmodel and explored more sophisticated policy optimization techniques [6].\n\n\u2217Corresponding author\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fAlthough RL-based device placement is promising, existing approaches have a key drawback: they\nrequire signi\ufb01cant amount of re-training to \ufb01nd a good placement for each computation graph. For\nexample, Mirhoseini et al. [13] report 12 to 27 hours of training time to \ufb01nd the best device place-\nment for several vision and natural language models; more recently, the same authors report 12.5\nGPU-hours of training to \ufb01nd a placement for a neural machine translation (NMT) model [12].\nWhile this overhead may be acceptable in some scenarios (e.g., training a stable model on large\namounts of data), it is undesirable in many cases. For example, high device placement overhead is\nproblematic during model development, which can require many ad-hoc model explorations. Also,\nin a shared, non-stationary environment, it is important to make a placement decision quickly, before\nthe underlying environment changes.\nExisting methods have high overhead because they do not learn generalizable device placement poli-\ncies. Instead they optimize the device placement for a single computation graph. Indeed, the training\nprocess in these methods can be thought of as a search for a good placement for one computation\ngraph, rather than a search for a good placement policy for a class of computation graphs. Therefore,\nfor a new computation graph, these methods must train the policy network from scratch. Nothing\nlearned from previous graphs carries over to new graphs, neither to improve placement decisions\nnor to speed up the search for a good placement.\nIn this paper, we present Placeto, a reinforcement learning (RL) approach to learn an ef\ufb01cient algo-\nrithm for device placement for a given family of computation graphs. Unlike prior work, Placeto\nis able to transfer a learned placement policy to unseen computation graphs from the same family\nwithout requiring any retraining.\nPlaceto incorporates two key ideas to improve training ef\ufb01ciency and generalizability. First, it mod-\nels the device placement task as \ufb01nding a sequence of iterative placement improvements. Speci\ufb01-\ncally, Placeto\u2019s policy network takes as input a current placement for a computation graph, and one\nof its node, and it outputs a device for that node. By applying this policy sequentially to all nodes,\nPlaceto is able to iteratively optimize the placement. This placement improvement policy, operating\non an explicitly-provided input placement, is simpler to learn than a policy representation that must\noutput a \ufb01nal placement for the entire graph in one step.\nPlaceto\u2019s second idea is a neural network architecture that uses graph embeddings [3, 4, 8] to en-\ncode the computation graph structure in the placement policy. Unlike prior RNN-based approaches,\nPlaceto\u2019s neural network policy does not depend on the sequential order of nodes or an arbitrary\nlabeling of the graph (e.g., to encode adjacency information). Instead it naturally captures graph\nstructure (e.g., parent-child relationships) via iterative message passing computations performed on\nthe graph.\nOur experiments show that Placeto learns placement policies that outperform the RNN-based ap-\nproach over three neural network models: Inception-V3 [23], NASNet [28] and NMT [27]. For\nexample, on the NMT model Placeto \ufb01nds a placement that runs 16.5% faster than the RNN-based\napproach. Moreover, it also learns these placement policies substantially faster, with up to 6.1\u00d7\nfewer placement evaluations, than the RNN approach. Given any family of graphs Placeto learns\na generalizable placement policy, that can then be used to predict optimized placements for unseen\ngraphs from the same family without any re-training. This avoids the large overheads incurred by\nRNN-based approaches which must repeat the training from scratch every time a new graph is to be\nplaced.\nConcurrently with this work, Paliwal et al. [17] proposed using graph embeddings to learn a gen-\neralizable policy for device placement and schedule optimization. However, their approach does\nnot involve optimizing placements directly; instead a genetic search algorithm needs to be run for\nseveral thousands of iterations every time placement for a new graph is to be optimized [17].\n\n2 Learning Method\nThe computation graph of a neural network can be modeled as a graph G(V, E), where V denotes\nthe atomic computational operations (also referred to as \u201cops\u201d) in the neural network, and E is\nthe set of data communication edges. Each op v \u2208 V performs a speci\ufb01c computational function\n(e.g., convolution) on input tensors that it receives from its parent ops. For a set of devices D =\n{d1, . . . , dm}, a placement for G is a mapping \u03c0 : V \u2192 D that assigns a device to each op. The goal\nof device placement is to \ufb01nd a placement \u03c0 that minimizes \u03c1(G, \u03c0), the duration of G\u2019s execution\n\n2\n\n\fPlacement improvement MDP steps\n\nFinal placement\n\nAction a1: \nDevice 2\n\nAction a2: \nDevice 1\n\nAction a3:\nDevice 2\n\nAction a4:\nDevice 2\n\n... \n\nStep t=0\n\nStep t=1\n\nStep t=2\n\nStep t=3\n\nEnd of episode\n\nFigure 1: MDP structure of Placeto\u2019s device placement task. At each step, Placeto updates a placement for a\nnode (shaded) in the computation graph. These incremental improvements amount to the \ufb01nal placement at the\nend of an MDP episode.\n\nwhen its ops are placed according to \u03c0. To reduce the number of placement actions, we partition\nops into predetermined groups and place ops from the same group on the same device, similar to\nMirhoseini et.al. [12]. For ease of notation, henceforth we will use G(V, E) to denote the graph of\nop groups. Here V is the set of op groups and E is set of data communication edges between op\ngroups. An edge is drawn between two op groups if there exists a pair of ops, from the respective\nop groups, that have an edge between them in the neural network.\nPlaceto \ufb01nds an ef\ufb01cient placement for a given input computation graph, by executing an iterative\nplacement improvement policy on the graph. The policy is learned using RL over computation\ngraphs that are structurally similar (i.e., coming from the same underlying probability distribution)\nas the input graph. In the following we present the key ideas of this learning procedure: the Markov\ndecision process (MDP) formalism in \u00a72.1, graph embedding and the neural network architecture\nfor encoding the placement policy in \u00a72.2, and the training/testing methodology in \u00a72.3. We refer\nthe reader to [22] for a primer on RL.\n\n2.1 MDP Formulation\nLet G be a family of computation graphs, for which we seek to learn an effective placement policy.\nWe consider an MDP where a state observation s comprises of a graph G(V, E) \u2208 G with the\nfollowing features on each node v \u2208 V : (1) estimated run time of v, (2) total size of tensors output\nby v, (3) the current device placement of v, (4) a \ufb02ag indicating whether v has been \u201cvisited\u201d before,\nand (5) a \ufb02ag indicating whether v is the \u201ccurrent\u201d node for which the placement has to be updated.\nAt the initial state s0 for a graph G(V, E), the nodes are assigned to devices arbitrarily, the visit\n\ufb02ags are all 0, and an arbitrary node is selected as the current node.\nAt a step t in the MDP, the agent selects an action to update the placement for the current node\nv in state st. The MDP then transitions to a new state st+1 in which v is marked as visited, and\nan unvisited node is selected as the new current node. The episode ends in n|V | steps, after the\nplacement of each node has been updated n times, where n is a tunable hyper-parameter. Figure 1\nillustrates this procedure for an example graph to be placed over two devices.\nWe consider two approaches for assigning rewards in the MDP: (1) assigning a zero reward at each\nintermediate step in the MDP, and a reward equal to the negative run time of the \ufb01nal placement at\nthe terminal step; (2) assigning an intermediate reward of rt = \u03c1(st)\u2212 \u03c1(st + 1) at the t-th round for\neach t = 0, 1, . . . , n|V | \u2212 1, where \u03c1(s) is the execution time of placement s. Intermediate rewards\ncan help improve credit assignment in long training episodes and reduce variance of the policy\ngradient estimates [2, 15, 22]. However, training with intermediate rewards is more expensive, as it\nmust determine the computation time for a placement at each step as opposed to once per episode.\nWe contrast the bene\ufb01ts of either reward design through evaluations in Appendix A.4. To \ufb01nd a\nvalid placement that \ufb01ts without exceeding the memory limit on devices, we include a penalty in\nthe reward proportional to the peak memory utilization if it crosses a certain threshold M (details in\nAppendix A.7).\n\n2.2 Policy Network Architecture\n\nPlaceto learns effective placement policies by directly parametrizing the MDP policy using a neural\nnetwork, which is then trained using a standard policy-gradient algorithm [26]. At each step t of the\n\n3\n\n\fCurrent\nnode\n\nState st\n\nRL agent\n\nNext state st+1\n\nDevice 1\n\nPolicy\n\nGraph\t\nneural\nnetwork\n\nPolicy\t\nnetwork\n\nDevice 2\n\nSample\n\nRuntime(st)\n\n-\n\nReward rt = Runtime(st+1) - Runtime(st)\n\nRuntime(st+1)\n\nNew\n\nplacement\n\nFigure 2: Placeto\u2019s RL framework for device placement. The state input to the agent is represented as a DAG\nwith features (such as computation types, current placement) attached to each node. The agent uses a graph\nneural network to parse the input and uses a policy network to output a probability distribution over devices for\nthe current node. The incremental reward is the difference between runtimes of consecutive placement plans.\n\nOp group feature:\n(total_runtime,\noutput_tensor_size,\ncurrent_placement,\nis_node_current,\nis_node_done)\n\n(a)\n\nTop-down\nmessage\npassing\n\n(b)\n\nParent \ngroups\n\n(c)\n\nParallel\ngroups\n\nBottom-up\nmessage\npassing\n\nChild\ngroups\n\nParent \ngroups\n\nChild\ngroups\n\nParallel\ngroups\n\n+\n\n+\n\n+\n\n+\n\nCurrent\nnode\n\n(d)\n\nDev 1\n\nDev 2\n\nFigure 3: Placeto\u2019s graph embedding approach. It maps raw features associated with each op group to the\ndevice placement action. (a) Example computation graph of op groups. The shaded node is taking the current\nplacement action. (b) Two-way message passing scheme applied to all nodes in the graph. Figure shows a\nsingle round of message passing. (c) Partitioning the message passed (denoted as bold) op groups. (d) Taking\na placement action on two candidate devices for the current op group.\nMDP, the policy network takes the graph con\ufb01guration in state st as input, and outputs an updated\nplacement for the t-th node. However, to compute this placement action using a neural network,\nwe need to \ufb01rst encode the graph-structured information of the state as a real-valued vector. Placeto\nachieves this vectorization via a graph embedding procedure, that is implemented using a specialized\ngraph neural network and learned jointly with the policy.Figure 2 summarizes how node placements\nare updated during each round of an RL episode. Next, we describe Placeto\u2019s graph neural network.\nGraph embedding. Recent works [3, 4, 8, 11] have proposed graph embedding techniques that\nhave been shown to achieve state-of-the-art performance on a variety of graph tasks, such as node\nclassi\ufb01cation, link prediction, job scheduling etc. Moreover, the embedding produced by these\nmethods are such that they can generalize (and scale) to unseen graphs. Inspired by this line of work,\nin Placeto we present a graph embedding architecture for processing the raw features associated\nwith each node in the computation graph. Our embedding approach is customized for the placement\nproblem and has the following three steps (Figure 3):\n1. Computing per-group attributes (Figure 3a). As raw features for each op group, we use the total\nexecution time of ops in the group, total size of their output tensors, a one-hot encoding of the\ndevice (e.g., device 1 or device 2) that the group is currently placed on, a binary \ufb02ag indicating\nwhether the current placement action is for this group, and a binary encoding of whether a place-\nment action has already been made for the group. We collect the runtime of each op on each\ndevice from on-device measurements (we refer to Appendix 5 for details).\n\n2. Local neighborhood summarization (Figure 3b). Using the raw features on each node, we per-\nform a sequence of message passing steps [4, 8] to aggregate neighborhood information for each\nnode. Letting x(i)\nv denote the node embeddings of op group v after i rounds of message passing,\nthe updates take the form x(i+1)\nu )), where \u03be(v) is the set of neighbors of\nv, and f, g are multilayer perceptrons with trainable parameters. We construct two directions\n(top-down from root groups and bottom-up from leaf groups) of message passings with separate\nparameters. The top-down messages summarize information about the subgraph of nodes that\ncan reach v, while the bottom-up does so for the subgraph reachable from v. The parameters in\nthe transformation functions f, g are shared for message passing steps in each direction, among\n\nv \u2190 g(!u\u2208\u03be(v) f (x(i)\n\n4\n\n\fall nodes. As we show in our experiments (\u00a73), reusing the same message passing function ev-\nerywhere provides a natural way to transfer the learned policy to unseen computation graphs.\nWe repeat the message passing updates k times, where k is a tunable hyperparameter which can\nbe set to sweep the entire graph or to integrate information from a local neighborhood of each\nnode that is suf\ufb01cient for making good placement decisions. In our experiments, we found that\nsweeping the entire graph is computationally expensive and provides little bene\ufb01t compared to\nperforming a \ufb01xed number (e.g., k = 8) of iterations of message passing.\n\n3. Pooling summaries (Figures 3c and 3d). After message passing, we aggregate the embeddings\ncomputed at each node to create a global summary of the entire graph. Speci\ufb01cally, for the node\nv for which a placement decision has to be made, we perform three separate aggregations: on the\nset Sparents(v) of nodes that can reach v, set Schildren(v) of nodes that are reachable by v, and set\nSparallel(v) of nodes that can neither reach nor be reached by v. On each set Si(v), we perform\n\nthe aggregations using hi(!u\u2208Si(v) li(xu)) where xu are the node embeddings and hi, li are\n\nmultilayer perceptrons with trainable parameters as above. Finally, node v\u2019s embedding and the\nresult from the three aggregations are concatenated as input to the subsequent policy network.\n\nThe above three steps de\ufb01ne an end-to-end policy mapping from raw features associated with each\nop group to the device placement action.\n\n2.3 Training\n\nPlaceto is trained using a standard policy-gradient algorithm [26], with a timestep-based baseline [7]\n(see Appendix A.1 for details). During each training episode, a graph from a set GT of training\ngraphs is sampled and used for performing the rollout. The neural network design of Placeto\u2019s\ngraph embedding procedure and policy network allows the training parameters to be shared across\nepisodes, regardless of the input graph type or size. This allows Placeto to learn placement policies\nthat generalize well to unseen graphs during testing. We present further details on training in \u00a73.\n\n3 Experimental Setup\n3.1 Dataset\nWe use Tensor\ufb02ow to generate a computation graph given any neural network model, which can then\nbe run to perform one step of stochastic gradient descent on a mini-batch of data. We evaluate our\napproach on computation graphs corresponding to the following three popular deep learning models:\n(1) Inception-V3 [23], a widely used convolutional neural network which has been successfully\napplied to a large variety of computer vision tasks; (2) NMT [27], a language translation model that\nuses an LSTM based encoder-decoder and attention architecture for natural language translation;\n(3) NASNet [28], a computer vision model designed for image classi\ufb01cation. For a more detailed\ndescriptions of these models, we refer to Appendix A.2\nWe also evaluate on three synthetic datasets, each comprising of 32 graphs, spanning a wide range\nof graph sizes and structures. We refer to these datasets as cifar10, ptb and nmt. Graphs from\ncifar10 and ptb datasets are synthesized using an automatic model design approach called ENAS\n[19]. The nmt dataset is constructed by varying the RNN length and batch size hyperparameters of\nthe NMT model [27]. We randomly split these datasets for training and test purposes. Graphs in\ncifar10 and ptb datasets are grouped to have about 128 nodes each, whereas graphs from nmt have\n160 nodes. Further details on how these datasets are constructed can be found in the Appendix A.3.\n\n3.2 Baselines\nWe compare Placeto against the following heuristics and baselines from prior work [13, 12, 6]:\n(1) Single GPU, where all the ops in a model are placed on the same GPU. For graphs that can \ufb01t\non a single device and don\u2019t have a signi\ufb01cant inherent parallelism in their structure, this baseline\ncan often lead to the fastest placement as it eliminates any cost of communication between devices.\n(2) Scotch [18], a graph-partitioning-based static mapper that takes as input the computation graph,\ncost associated with each node, amount of data associated with connecting edges, and then outputs\na placement which minimizes communication costs while keeping the load balanced across devices\nwithin a speci\ufb01ed tolerance.\n(3) Human expert. For NMT models, we place each LSTM layer on a separate device as recom-\nmended by Wu et al. [27]. We also colocate the attention and softmax layers with the \ufb01nal LSTM\n\n5\n\n\flayer. Similarly for vision models, we place each parallel branch on a different device.\n(4) RNN-based approach [12], in which the placement problem is posed as \ufb01nding a mapping from\nan input sequence of op-groups to its corresponding sequence of optimized device placements. An\nRNN model is used to learn this mapping. The RNN model has an encoder-decoder architecture\nwith content-based attention mechanism. We use an open source implementation from Mirhoseini\n,et.al. [12] available as part of the of\ufb01cial Tensor\ufb02ow repository [24]. We use the included hyperpa-\nrameter settings and tune them extensively as required.\n\n3.3 Training Details\n\nCo-location groups. To decide which set of ops have to be co-located in an op-group, we follow\nthe same strategy as described by Mirhoseini et al. [13] and use the \ufb01nal grouped graph as input to\nboth Placeto and the RNN-based approach. We found that even after this grouping, there could still\nbe a few operation groups with very small memory and compute costs left over. We eliminate such\ngroups by iteratively merging them with their neighbors as detailed in Appendix A.6.\nSimulator. Since it can take a long time to execute placements on real hardware and measure the\nelapsed time [12, 13], we built a reliable simulator that can quickly predict the runtime of any given\nplacement for a given device con\ufb01guration. We have discussed details about how the simulator\nworks and its accuracy in Appendix A.5. This simulator is used only for training purposes. All the\nreported runtime improvements have been obtained by evaluating the learned placements on real\nhardware, unless explicitly speci\ufb01ed otherwise.\nFurther details on training of Placeto and the RNN-based approach including our choice of hyper-\nparameter values are given in the Appendix A.7. We open-source our implementation, datasets and\nthe simulator. 2\n4 Results\nIn this section, we \ufb01rst evaluate the performance of Placeto and compare it with aforementioned\nbaselines (\u00a74.1). Then we evaluate Placeto\u2019s generalizability compared to the RNN-based approach\n(\u00a74.2). Finally, we provide empirical validation for Placeto\u2019s design choices (\u00a74.3).\n\n4.1 Performance\n\nTable 1 summarizes the performance of Placeto and baseline schemes for the Inception-V3, NMT\nand NASNet models. We quantify performance along two axes: (i) runtime of the best placement\nfound, and (ii) time taken to \ufb01nd the best placement, measured in terms of the number of placement\nevaluations required for the RL-based schemes while training.\nFor all considered graphs, Placeto is able to rival or outperform the best comparing scheme. Placeto\nalso \ufb01nds optimized placements much faster than the RNN-based approach. For Inception on 2\nGPUs, Placeto is able to \ufb01nd a placement that is 7.8% faster than the expert placement. Addi-\ntionally, it requires about 4.8\u00d7 fewer samples than the RNN-based approach. Similarly, for the\nNASNet model Placeto outperforms the RNN-based approach using up to 4.7\u00d7 fewer episodes.\nFor the NMT model with 2 GPUs, Placeto is able to optimize placements to the same extent as the\nRNN-based scheme, while using 3.5\u00d7 fewer samples. For NMT distributed across 4 GPUs, Placeto\n\ufb01nds a non-trivial placement that is 16.5% faster than the existing baselines. We visualize this\nplacement in Figure 4. The expert placement heuristic for NMT fails to meet memory constraints\nof the GPU devices. This is because in an attempt to maximize parallelism, it places each layer\non a different GPU, requiring copying over the outputs of the ith layer to the GPU hosting the\n(i + 1)th layer. These copies have to be retained until they can be fed in as inputs to the co-located\ngradient operations during the back-propagation phase. This results in a large memory footprint\nwhich ultimately leads to an OOM error. On the other hand, Placeto learns to exploit parallelism and\nminimize the inter-device communication overheads while remaining within memory constraints of\nall the devices. The above results show the advantage of Placeto\u2019s simpler policy representation: it\nis easier to learn a policy to incrementally improve placements, than to learn a policy that decides\nplacements for all nodes in one shot.\n\n2https://github.com/aravic/generalizable-device-placement\n\n6\n\n\fModel\n\nCPU\nonly\n\nSingle\nGPU\n\n#GPUs\n\nScotch\n\nPlaceto\n\nPlacement runtime\n\n(sec)\nExpert\n\nInception-V3\n\n12.54\n\n1.56\n\nNMT\n\n33.5\n\nOOM\n\nNASNet\n\n37.5\n\n1.28\n\n2\n4\n\n2\n4\n\n2\n4\n\n1.28\n1.15\n\nOOM\nOOM\n\n0.86\n0.84\n\n1.54\n1.74\n\nOOM\nOOM\n\n1.28\n1.22\n\n1.18\n1.13\n\n2.32\n2.63\n\n0.86\n0.74\n\nRNN-\nbased\n1.17\n1.19\n\n2.35\n3.15\n\n0.89\n0.76\n\nTraining time\n\n(# placements sampled)\nPlaceto\n\nRNN-\nbased\n7.8 K\n35.8 K\n\n73 K\n51.7 K\n\n16.3 K\n37 K\n\n1.6 K\n5.8 K\n\n20.4 K\n94 K\n\n3.5 K\n29 K\n\nImprovement\n\nRuntime\nReduction\n- 0.85%\n\n5%\n\n1.3 %\n16.5 %\n3.4%\n2.6%\n\nSpeedup\nfactor\n4.8 \u00d7\n6.1 \u00d7\n3.5 \u00d7\n0.55 \u00d7\n4.7 \u00d7\n1.3 \u00d7\n\nTable 1: Running times of placements found by Placeto compared with RNN-based approach [13], Scotch and\nhuman-expert baseline. The number of measurements needed to \ufb01nd the best placements for Placeto and the\nRNN-based are also shown (K stands for kilo). Reported runtimes are measured on real hardware. Runtime\nreductions and speedup factors are calculated with respect to the RNN-based approach. Lower runtimes and\nlower training times are better. OOM: Out of Memory. For NMT model, the number of LSTM layers is chosen\nbased on the number of GPUs.\n\nFigure 4: Optimized placement across 4 GPUs for a 4-layered NMT model with attention found by Placeto.\nThe top LSTM layers correspond to encoder and the bottom layers to decoder. All the layers are unrolled to a\nmaximum sequence length of 32. Each color represents a different GPU. This non-trivial placement meets the\nmemory constraints of the GPUs unlike the expert-based placement and the Scotch heuristic, which result in an\nOut of Memory (OOM) error. It also runs 16.5% faster than the one found by the RNN-based approach.\n\n4.2 Generalizability\nWe evaluate generalizability of the learning-based schemes, by training them over samples of graphs\ndrawn from a speci\ufb01c distribution, and using the learned policies to predict effective placements for\nunseen graphs from the same distribution during test time.\nIf the placements predicted by a policy are as good as placements found by separate optimizations\nover the individual test graphs, we conclude that the placement scheme generalizes well. Such a\npolicy can then be applied to a wide variety of structurally-similar graphs without requiring re-\ntraining. We consider three datasets of graphs called nmt, ptb and cifar10 for this experiment.\nFor each test graph in a dataset, we compare placements generated by the following schemes: (1)\nPlaceto Zero-Shot. A Placeto policy trained over graphs from the dataset, and used to predict place-\nments for the test graph without any further re-training. (2) Placeto Optimized. A Placeto policy\ntrained speci\ufb01cally over the test graph to \ufb01nd an effective placement. (3) Random. A simple straw-\nman policy that generates placement for each node by sampling from a uniform random distribution.\nWe de\ufb01ne RNN Zero-Shot and RNN Optimized in a similar manner for the RNN-based approach.\nFigure 5 shows CDFs of runtimes of the placements generated by the above-de\ufb01ned schemes for\ntest graphs from nmt, ptb and cifar10 datasets. We see that the runtimes of the placements gen-\nerated by Placeto Zero-Shot are very close to those generated by Placeto Optimized. Due to\nPlaceto\u2019s generalizability-\ufb01rst design, Placeto Zero-Shot avoids the signi\ufb01cant overhead incurred\nby Placeto Optimized and RNN Optimized approaches, which search through several thousands of\nplacements before \ufb01nding a good one.\nFigure 5 also shows that RNN Zero-Shot performs signi\ufb01cantly worse than RNN Optimized. In\nfact, its performance is very similar to that of Random. When trained on a graph, the RNN-based\n\n7\n\n\f(a)\n\n(d)\n\n(b)\n\n(e)\n\n(c)\n\n(f)\n\nPlaceto Zero-\nShot\n\nRandom\n\nPlaceto\nOptimized\n\nRNN Zero-Shot\n\nRandom\n\nRNN Optimized\n\nFigure 5: CDFs of runtime of placements found by the different schemes for test graphs from (a), (d) nmt (b),\n(e) ptb and (c), (f) cifar10 datasets. Top row of \ufb01gures ((a), (b), (c)) correspond to Placeto and bottom row ((d),\n(e), (f)) to RNN-based approach. Placeto Zero-Shot performs almost on par with fully optimized schemes like\nPlaceto Optimized and RNN Optimized even without any re-training. In contrast, RNN Zero-Shot performs\nmuch worse and only slightly better than a randomly initialized policy used in Random scheme.\n\napproach learns a policy to search for an effective placement for that graph. However, this learned\nsearch strategy is closely tied to the assignment of node indices and the traversal order of the nodes\nin the graph, which are arbitrary and have a meaning only within the context of that speci\ufb01c graph.\nAs a result, the learned policy cannot be applied to graphs with a different structure or even to the\nsame graph using a different assignment of node indices or traversal order.\n\n4.3 Placeto Deep Dive\nIn this section we evaluate how the node traversal order of a graph during training, affects the policy\nlearned by the different learning schemes. We also present an ablation study of Placeto\u2019s policy\nnetwork architecture.\nIn Appendix A.4 we conduct a similar study on the bene\ufb01ts of providing\nintermediate rewards in Placeto during training.\nNode traversal order. Unlike the RNN-based approach, Placeto\u2019s use of a graph neural network\neliminates the need to assign arbitrary indices to nodes while embedding the graph features. This\naids in Placeto\u2019s generalizability, and allows it to learn effective policies that are not tied to the\nspeci\ufb01c node traversal orders seen during training. To verify this claim, we train Placeto and the\nRNN-based approach on the Inception-V3 model following one of 64 \ufb01xed node traversal orders at\neach episode. We then use the learned policies to predict placements under 64 unseen random node\ntraversal orders for the same model. With Placeto, we observe that the predicted placements have\nruntimes within 5% of that of the optimized placement on average, with a difference of about 10%\nbetween the fastest and slowest placements. However, the RNN-based approach predicts placements\nthat are about 30% worse on average.\nAlternative policy architectures. To highlight the role of Placeto\u2019s graph neural network archi-\ntecture (\u00a72.2), we consider the following two alternative policy architectures and compare their\ngeneralizability performance against Placeto\u2019s on the nmt dataset.\n(1) Simple aggregator, in which a feed-forward network is used to aggregate all the node features\nof the input graph, which is then fed to another feed-forward network with softmax output units for\npredicting a placement. This simple aggregator performs very poorly, with its predicted placements\non the test dataset about 20% worse on average compared to Placeto.\n(2) Simple partitioner, in which the node features corresponding to the parent, child and parallel\nnodes\u2014of the node for which a decision is to be made\u2014are aggregated independently by three dif-\nferent feed-forward networks. Their outputs are then fed to a separate feed-forward network with\nsoftmax output units as in the simple aggregator. Note that this is similar to Placeto\u2019s policy ar-\nchitecture (\u00a72.2), except for the local neighborhood summarization step (i.e., step 2 in \u00a72.2). This\nresults in the simple partitioner predicting placements that run 13% slower on average compared to\nPlaceto. Thus, local neighborhood aggregation and pooling summaries from parent, children and\n\n8\n\n\fparallel nodes are both essential steps for transforming raw node features into generalizable embed-\ndings in Placeto.\n\n5 Conclusion\n\nWe presented Placeto, an RL-based approach for \ufb01nding device placements to minimize training\ntime of deep-learning models. By structuring the policy decisions as incremental placement im-\nprovement steps, and using graph embeddings to encode graph structure, Placeto is able to train\nef\ufb01ciently and learns policies that generalize to unseen graphs.\nPlaceto currently relies on a manual grouping procedure to consolidate ops into groups. An interest-\ning future direction would be to extend our approach to work directly on large-scale graphs without\nmanual grouping [17] [12]. Another interesting future direction would be to learn to make place-\nment decisions that exploit both model and data parallelism which can lead to a signi\ufb01cant increase\nin the margin of runtime improvements possible [9] [20].\nAcknowledgements. We thank the anonymous NeurIPS reviewers for their feedback. This work\nwas funded in part by NSF grants CNS-1751009, CNS-1617702, a Google Faculty Research Award,\nan AWS Machine Learning Research Award, a Cisco Research Center Award, an Alfred P. Sloan\nResearch Fellowship, and the sponsors of MIT Data Systems and AI Lab. We also gratefully ac-\nknowledge Cloudlab [5] and Chameleon testbeds [10] for providing us with compute environments\nfor some of the experiments.\n\n9\n\n\fReferences\n\n[1] Ec2 instance types. https://aws.amazon.com/ec2/instance-types/, 2018. Accessed: 2018-10-19.\n\n[2] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. P.\nAbbeel, and W. Zaremba. Hindsight experience replay. In Advances in Neural Information Processing\nSystems, pages 5048\u20135058, 2017.\n\n[3] P. W. Battaglia et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint\n\narXiv:1806.01261, 2018.\n\n[4] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst. Geometric deep learning: going\n\nbeyond euclidean data. IEEE Signal Processing Magazine, 34(4):18\u201342, 2017.\n\n[5] D. Duplyakin, R. Ricci, A. Maricq, G. Wong, J. Duerig, E. Eide, L. Stoller, M. Hibler, D. Johnson,\nK. Webb, A. Akella, K. Wang, G. Ricart, L. Landweber, C. Elliott, M. Zink, E. Cecchet, S. Kar, and\nP. Mishra. The design and operation of CloudLab.\nIn Proceedings of the USENIX Annual Technical\nConference (ATC), pages 1\u201314, July 2019.\n\n[6] Y. Gao, L. Chen, and B. Li. Spotlight: Optimizing device placement for training deep neural networks.\nIn J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learn-\ning, volume 80 of Proceedings of Machine Learning Research, pages 1676\u20131684, Stockholmsm\u00e4ssan,\nStockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[7] E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in\n\nreinforcement learning. Journal of Machine Learning Research, 5(Nov):1471\u20131530, 2004.\n\n[8] W. Hamilton, Z. Ying, and J. Leskovec. Inductive representation learning on large graphs. In Advances\n\nin Neural Information Processing Systems, pages 1024\u20131034, 2017.\n\n[9] Z. Jia, M. Zaharia, and A. Aiken. Beyond data and model parallelism for deep neural networks. arXiv\n\npreprint arXiv:1807.05358, 2018.\n\n[10] K. Keahey, P. Riteau, D. Stanzione, T. Cockerill, J. Manbretti, P. Rad, and R. Paul. Chameleon: a scalable\nproduction testbed for computer science research. Contemporary High Performance Computing, 3, 2017.\n\n[11] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and M. Alizadeh. Learning scheduling\n\nalgorithms for data processing clusters. arXiv preprint arXiv:1810.01963, 2018.\n\n[12] A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. V. Le, and J. Dean. A hierarchical model for device\n\nplacement. In International Conference on Learning Representations, 2018.\n\n[13] A. Mirhoseini, H. Pham, Q. Le, M. Norouzi, S. Bengio, B. Steiner, Y. Zhou, N. Kumar, R. Larsen, and\n\nJ. Dean. Device placement optimization with reinforcement learning. 2017.\n\n[14] A. Nair, P. Srinivasan, S. Blackwell, C. Alcicek, R. Fearon, A. De Maria, V. Panneershelvam, M. Suley-\nman, C. Beattie, S. Petersen, et al. Massively parallel methods for deep reinforcement learning. arXiv\npreprint arXiv:1507.04296, 2015.\n\n[15] A. Y. Ng, D. Harada, and S. J. Russell. Policy invariance under reward transformations: Theory and\napplication to reward shaping.\nIn Proceedings of the Sixteenth International Conference on Machine\nLearning, ICML \u201999, pages 278\u2013287, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.\n\n[16] ONNX Developers. Onnx model zoo, 2018.\n\n[17] A. Paliwal, F. Gimeno, V. Nair, Y. Li, M. Lubin, P. Kohli, and O. Vinyals. Reinforced genetic algorithm\n\nlearning for optimizing computation graphs. arXiv preprint arXiv:1905.02494, 2019.\n\n[18] F. Pellegrini. A parallelisable multi-level banded diffusion scheme for computing balanced partitions with\nsmooth boundaries. In T. P. A.-M. Kermarrec, L. Boug\u00e9, editor, EuroPar, volume 4641 of Lecture Notes\nin Computer Science, pages 195\u2013204, Rennes, France, Aug. 2007. Springer.\n\n[19] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Ef\ufb01cient neural architecture search via parameter\n\nsharing. arXiv preprint arXiv:1802.03268, 2018.\n\n[20] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanantakool, P. Hawkins, H. Lee, M. Hong,\nC. Young, R. Sepassi, and B. Hechtman. Mesh-tensor\ufb02ow: Deep learning for supercomputers. In Pro-\nceedings of the 32Nd International Conference on Neural Information Processing Systems, NIPS\u201918,\npages 10435\u201310444, USA, 2018. Curran Associates Inc.\n\n10\n\n\f[21] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai,\n\nA. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.\n\n[22] R. S. Sutton and A. G. Barto.\n\nUSA, 1st edition, 1998.\n\nIntroduction to Reinforcement Learning. MIT Press, Cambridge, MA,\n\n[23] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for\ncomputer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 2818\u20132826, 2016.\n\n[24] Tensor\ufb02ow contributors. Tensor\ufb02ow of\ufb01cial repository, 2017.\n\n[25] Wikipedia contributors. Perplexity \u2014 Wikipedia, the free encyclopedia, 2019.\n\nApril-2019].\n\n[Online; accessed 26-\n\n[26] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine learning, 8(3-4):229\u2013256, 1992.\n\n[27] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,\nK. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, \u0141. Kaiser, S. Gouws, Y. Kato, T. Kudo,\nH. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick,\nO. Vinyals, G. Corrado, M. Hughes, and J. Dean. Google\u2019s Neural Machine Translation System: Bridging\nthe Gap between Human and Machine Translation. ArXiv e-prints, 2016.\n\n[28] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image\n\nrecognition. arXiv preprint arXiv:1707.07012, 2(6), 2017.\n\n11\n\n\f", "award": [], "sourceid": 2191, "authors": [{"given_name": "ravichandra", "family_name": "addanki", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Shaileshh", "family_name": "Bojja Venkatakrishnan", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Shreyan", "family_name": "Gupta", "institution": "MIT"}, {"given_name": "Hongzi", "family_name": "Mao", "institution": "MIT"}, {"given_name": "Mohammad", "family_name": "Alizadeh", "institution": "Massachusetts Institute of Technology"}]}