Review for NeurIPS paper: Coded Sequential Matrix Multiplication For Straggler Mitigation

NeurIPS 2020

Coded Sequential Matrix Multiplication For Straggler Mitigation

Review 1

Summary and Contributions: This paper studies the problem of performing large scale matrix multiplications over a distributed computing setup in the presence of stragglers. In the naive scheme, one divides the matrix multiplication task to many mini-tasks (multiplication of sub-blocks of the underlying matrices) which are assigned to different worker nodes. In this case, the master node needs to wait for the slowest workers, i.e., stragglers, to obtain the final result. Recently, coding for computation has been explored to counter the stragglers. Here, one redundantly assign the mini-tasks to the workers and the master can obtain the final result as long as enough number of worker complete their mini-tasks in a timely manner. For distribution matrix multiplication, optimal coding schemes are known in the literature (e.g., polynomial coding). However, this coding scheme focus on one large scale matrix multiplication at a time. Thus, the coding can only be performed across workers. In this paper, the authors consider the problem of sequential matrix multiplication where the objective is to perform multiple large scale matrix multiplications such that the matrix multiplication task arriving at round t must be completed by round t + T. The additional freedom associated with T>0 allows the authors to perform encoding both across workers and time. For this novel problem of coded sequential matrix multiplication, the paper proposes two coding schemes that generalize polynomial codes. The paper then analyzes the completion time of these coding schemes for a simple iid straggler model and shows the advantages over polynomial coding or naive uncoded approach. The paper also evaluates the proposed scheme on a real distributed computing setup where sequential matrix multiplications correspond to computations performed during (concurrent) iterative training of independent neural networks on MNIST. The paper shows that the average processing time of the proposed schemes is better than that of polynomial coding and uncoded approach two stochastic models for stragglers: 1) iid model and 2) Fritchman model.

Strengths: The paper introduced the problem of coded sequential matrix multiplication in the presence of stragglers. The paper proposes two coding schemes for the underlying problem where underlying matrices are encoded both across workers and across time. This enables the system to tolerate stragglers that can appear in a burst as long as the total number of stragglers in a given time window is bounded. For stochastic straggler models, the paper show, both analytically and experimentally, that the proposed coding schemes lead to lower processing time (time required to complete the matrix multiplications) as compared to the existing baselines.

Weaknesses: Although addressing a novel problem, the proposed coding schemes in the paper do not introduce any novel coding techniques. Thus, it becomes even more important than the paper carries out a thorough empirical evaluation of the coding scheme. The paper, in its current form, is light on experimental evaluation. How realistic are the stochastic models for the stragglers considered in the paper in the context of large scale distributed computation.

Correctness: As far as the reviewer can tell, the results in the paper are correct.

Clarity: The main body of the paper glosses over many important details that are presented in the supplementary material. The supplementary appears as a complete paper in itself. The reader cannot be expected to read the entire paper while referring to the supplementary. Please try to include important details such as motivation for the coding across time, explanation of the proposed coding schemes in the main body. When referring the reader to the supplementary, please refer to the specific sections of the supplementary.

Relation to Prior Work: The paper adequately discusses the relevant prior work on coded computation. The sequential matrix multiplication problem where one requires to complete each arriving matrix multiplication task within a pre-specified time range has some similarity with the problem of coding for streaming. Could the authors comment on this?

Reproducibility: Yes

Additional Feedback: In line 215, please add a citation regarding the condition number of Vandermonde matrices. The paper separately encoded each job. Could the authors comment on the utility of the coding schemes that perform encoding across multiple jobs? ###################### Post author response phase ################## Thank you for taking the time to respond to my comments. As acknowledged in my review, the paper does propose a novel problem setting. However, the experimental results are a bit limited. Also, there is scope for enhancing the structure of the paper with more streamlined referencing to the supplementary. After going through the authors' responses and other reviewers' comments, I have decided to leave my score unchanged. Please include a discussion on the connections/differences between your work and coding for streaming in the revised version.

Review 2

Summary and Contributions: This paper considers the problem of coded computation for matrix multiplication. Different from earlier works that have addressed the problem, it proposes a novel way of looking at a sequence of matrix multiplications through a "temporal" dimension. This dimension indicates the number of steps of coordination between master-workers before which the result of a particular matrix multiplication in the sequence is expected. This problem is termed Coded sequential multiplication in this paper. For this novel view, the work proposes extensions of the polynomial codes of [3] taking the temporal aspect into account. To analyze the efficiency of the proposed methods, the paper also extends the arbitrary straggler model to multiple rounds, called (N,W) straggler model. The asymptotic load of the proposed methods is analyzed under the model, as well as for polynomial codes, and it is shown to be better. Simulations to train multiple simultaneous NN are also run, as a particular application of coded sequential multiplication.

Strengths: The proposed view on coded matrix multiplication is very novel and would be of interest to the NeurIPS community. The paper is also fairly well-written. Despite being notationally heavy, the authors have done a good job of repeating notational details at the right spots in the paper, making it easy to read. The authors also do a good job of motivating why incorporating the temporal dimension can provide more gains over one-shot coded matrix multiplication.

Weaknesses: The practical application suggested of training multiple NN simultaneously, seems somewhat limited.

Correctness: Yes, the results seem correct. Though, I have not verified the proofs in the supplementary material in detail.

Clarity: Yes

Relation to Prior Work: Yes

Reproducibility: Yes

Additional Feedback: - In lines 203,204, can you please clarify what 'u' is ? - In line 235, "We consider in this section, a probabilistic, i.i.d. straggler model, i.e., any worker in any given round will be a straggler with probability \delta", can you explain the relation between this model and the (N,W) straggler model. It seems that they co-exist ? - In line 259, "Through numerical evaluation in Fig. 1, it can be observed that DIP scheme outperforms polynomial and IDIP schemes". Please explain why. It was my impression that IDIP should outperform DIP under the (N,W) straggler model. #### After author response #### Thank you for responding to my comments. My score remains unchanged, though I also do agree with the other reviews regarding limited experimental evaluation. Please mention at least a discussion around other applications of this problem in the revision.

Review 3

Summary and Contributions: This paper studies a generalized version of coded matrix multiplications in multiple rounds. The main observation is that when extending the computation to multiple rounds, the worst-case straggler tolerance improves due to the ability to launch delayed recomputation of failed subtasks (thus the stragglers become more evenly distributed across time). The authors show in both theory and experiments that the coded computing schemes based on this idea can achieve better cumulative processing time compared to the static scheme [3]

Strengths: * The problem formulation does genenerlize the conventional setting in coding for straggler mitigation * The authors provide optimality proofs of the proposed schemes under a well-defined problem formulation * The authors show some improvement over [3]

Weaknesses: * The improvement seems limited * The experiment setting is semi-simulated because this not how people train neural networks (i.e., partition the computation into sequential matrix multiplications and distribute each partial job to several workers) * The studied problem seems to be a classical extension under the framework of delay-vs-throughput tradeoff, which limits the novelty

Correctness: The claims sound correct to me. The authors did provide a thorough analysis on the proposed schemes

Clarity: The paper writing is ok. Sometimes I have to go to the supplementary materials in order to understand the details.

Relation to Prior Work: There is not much discussion on the connection to and difference from prior works

Reproducibility: Yes

Additional Feedback: (After rebuttal) I want to thank the authors for addressing my comments. I have read the authors' rebuttal and the other reviewers' comments and have carefully rethought the paper, including its connections to other related works. Since the authors have thoroughly addressed two of my three major concerns given the limited rebuttal space, I want to increase my overall score from 5 to 6. The reason that my final score is still a borderline score is mainly because of the unaddressed first comment and my third comment regarding novelty. Although I agree with the authors that the proposed scheme, especially the idea of actively selecting uncoded matrix-multiplication, has some novelty, the idea of amortizing tasks over time is a quite commonly explored theme in communication networks. But I agree that this idea has not been fully explored in coded computing. Regarding the experiments, I hope the authors can find and demonstrate applications that are motivated by practical implementation such as those proposed in the rebuttal. So maybe conducting well-developed experiments and improving the empirical evaluations can be the primary focus to improve the paper.