Reviews: Learning Generalizable Device Placement Algorithms for Distributed Machine Learning

Originality: The use of graph neural networks appears novel (concurrent with Paliwal), as does the sweep order (for which I don't know other papers, at least for this application of graph neural networks). The trick of using architecture search as a dataset also seems novel, and I'm quite happy with this idea. Quality: The submission is sound, but I have a few minor concerns: 1. Why use REINFORCE, rather than a better RL algorithm such as PPO (as in Spotlight)? It's possible REINFORCE is good enough, but I'm skeptical given that (1) REINFORCE is much worse in normal RL environments and (2) the paper explicitly presents evidence that using an incremental baseline helps learning. The learned value function in PPO, Q-learning, etc. could potentially play the same variance reduction role or even do quite a lot better (presumably not all of the variance due to upstream moves is explained by reward so far). 2. The node traversal order ablation is a bit silly: it doesn't feel very interesting whether we can sabotage training order and not lose accuracy for other orders. I was specifically disappointed because I was hoping to see an ablation for the bidirectional sweep approach vs. updating every node at once as with standard GNNs. Clarity: I found several aspects of the paper difficult to parse, or even slightly misleading. I think these are all fixable during the review process. 1. As mentioned above, the authors claim as a contribution iterative improvement of policies. The architecture they train supports being used in this way, by sweeping over the graph multiple times with each vertex flagged for change multiple times. However, the paper says "The episode ends in |V| steps when the placements for all the nodes have been updated." This is the same number of steps as the RNN approach (though the RNN approach does much less work per step). It's still possible that at test time they do more than |V| steps at test time, but I don't see any mention of this. To be explicit: I don't think it is reasonable to say the RNN approach outputs a whole placement in one step, since the placement of future nodes can condition on the sampled placement of previous nodes in exactly the same way as this paper. 2. Given that the architecture *does* support multiple epochs of placement and replacement, some discussion seems in order for why this isn't done. 3. What's k (the number of graph sweeps performed per step)? What are the details of the Placeto network, such as the embedding size? 4. I'm somewhat surprised that sweeps through a deep graph would work using (as far as I can tell from the paper) pure feed forward networks at message passing nodes. At least early GNN papers used LSTMs / GRUs / similar at nodes, but both this paper and Paliwal just use MLPs. I'd be curious for discussion of this, or a pointer to a reference that explains the switch if a good one exists. 5. I'm confused about the "Training time" column in Table 1. Is this for Placeto optimized, such that the number of steps is RL training on the specific graph we're trying to optimize (ignoring prior training on the ensemble of different graphs)? If so, what's the stopping condition? 6. Minor cleanup: in the supplementary material, the reward can be written more concisely as R(p) = r + c relu(m − M), which also highlights the continuity. Significance: I think it's quite likely these results will be built on, in particular the use of graph neural networks for both device placement and other computation graphs applications, and the use of neural architecture search outputs as a dataset.

The paper presents a new approach to device placement of a compute graph by using a graph embedding neural network. The graph embedding network computes features that are used in the a policy to compute which device the next node in the graph should be assigned to (It appears to me, that the order of the nodes is possibly fixed and not addressed by the policy; the order seems to be based on how the graph is fed into the model, and the authors make the point of showing that the method is more resilient to changes in order of nodes at test time, compared to RNNs in section 4). The learning algorithm used is a policy gradient algorithm with time step based baselines. The paper presents the ideas very clearly, and has nice results. The contribution is more from the combination of ideas than from a new idea. Graph embedding networks have been applied for various tasks, and device placement has been addressed previously by Mirhoseni et al. However, this paper shows how to combined these thoughts, allowing policies that are resilient to changes in node order and also generalizes to other graphs.

Paper ID:	2191
Title:	Learning Generalizable Device Placement Algorithms for Distributed Machine Learning

Reviewer 1

Reviewer 2

Reviewer 3