{"title": "Polynomial Codes: an Optimal Design for High-Dimensional Coded Matrix Multiplication", "book": "Advances in Neural Information Processing Systems", "page_first": 4403, "page_last": 4413, "abstract": "We consider a large-scale matrix multiplication problem where the computation is carried out using a distributed system with a master node and multiple worker nodes, where each worker can store parts of the input matrices. We propose a computation strategy that leverages ideas from coding theory to design intermediate computations at the worker nodes, in order to optimally deal with straggling workers. The proposed strategy, named as \\emph{polynomial codes}, achieves the optimum recovery threshold, defined as the minimum number of workers that the master needs to wait for in order to compute the output. This is the first code that achieves the optimal utilization of redundancy for tolerating stragglers or failures in distributed matrix multiplication. Furthermore, by leveraging the algebraic structure of polynomial codes, we can map the reconstruction problem of the final output to a polynomial interpolation problem, which can be solved efficiently. Polynomial codes provide order-wise improvement over the state of the art in terms of recovery threshold, and are also optimal in terms of several other metrics including computation latency and communication load. Moreover, we extend this code to distributed convolution and show its order-wise optimality.", "full_text": "Polynomial Codes: an Optimal Design for\n\nHigh-Dimensional Coded Matrix Multiplication\n\nQian Yu\u2217, Mohammad Ali Maddah-Ali\u2020, and A. Salman Avestimehr\u2217\n\n\u2217 Department of Electrical Engineering, University of Southern California, Los Angeles, CA, USA\n\n\u2020 Nokia Bell Labs, Holmdel, NJ, USA\n\nAbstract\n\nWe consider a large-scale matrix multiplication problem where the computation\nis carried out using a distributed system with a master node and multiple worker\nnodes, where each worker can store parts of the input matrices. We propose a\ncomputation strategy that leverages ideas from coding theory to design intermediate\ncomputations at the worker nodes, in order to optimally deal with straggling\nworkers. The proposed strategy, named as polynomial codes, achieves the optimum\nrecovery threshold, de\ufb01ned as the minimum number of workers that the master\nneeds to wait for in order to compute the output. This is the \ufb01rst code that\nachieves the optimal utilization of redundancy for tolerating stragglers or failures\nin distributed matrix multiplication. Furthermore, by leveraging the algebraic\nstructure of polynomial codes, we can map the reconstruction problem of the \ufb01nal\noutput to a polynomial interpolation problem, which can be solved ef\ufb01ciently.\nPolynomial codes provide order-wise improvement over the state of the art in\nterms of recovery threshold, and are also optimal in terms of several other metrics\nincluding computation latency and communication load. Moreover, we extend this\ncode to distributed convolution and show its order-wise optimality.\n\n1\n\nIntroduction\n\nMatrix multiplication is one of the key building blocks underlying many data analytics and machine\nlearning algorithms. Many such applications require massive computation and storage power to\nprocess large-scale datasets. As a result, distributed computing frameworks such as Hadoop MapRe-\nduce [1] and Spark [2] have gained signi\ufb01cant traction, as they enable processing of data sizes at the\norder of tens of terabytes and more.\nAs we scale out computations across many distributed nodes, a major performance bottleneck is the\nlatency in waiting for slowest nodes, or \u201cstragglers\u201d to \ufb01nish their tasks [3]. The current approaches\nto mitigate the impact of stragglers involve creation of some form of \u201ccomputation redundancy\u201d.\nFor example, replicating the straggling task on another available node is a common approach to\ndeal with stragglers (e.g., [4]). However, there have been recent results demonstrating that coding\ncan play a transformational role for creating and exploiting computation redundancy to effectively\nalleviate the impact of stragglers [5, 6, 7, 8, 9]. Our main result in this paper is the development\nof optimal codes, named polynomial codes, to deal with stragglers in distributed high-dimensional\nmatrix multiplication, which also provides order-wise improvement over the state of the art.\nMore speci\ufb01cally, we consider a distributed matrix multiplication problem where we aim to compute\nB from input matrices A and B. As shown in Fig. 1, the computation is carried out using\nC = A\na distributed system with a master node and N worker nodes that can each store 1\nm fraction of A\nn fraction of B, for some parameters m, n \u2208 N+. We denote the stored submtarices at each\nand 1\n\n(cid:124)\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fworker i \u2208 {0, . . . , N \u2212 1} by \u02dcAi and \u02dcBi, which can be designed as arbitrary functions of A and B\n(cid:124)\nrespectively. Each worker i then computes the product \u02dcA\ni\n\n\u02dcBi and returns the result to the master.\n\nFigure 1: Overview of the distributed matrix multiplication framework. Coded data are initially stored distribut-\nedly at N workers according to data assignment. Each worker computes the product of the two stored matrices\nand returns it to the master. By carefully designing the computation strategy, the master can decode given the\ncomputing results from a subset of workers, without having to wait for the stragglers (worker 1 in this example).\n\nBy carefully designing the computation strategy at each worker (i.e. designing \u02dcAi and \u02dcBi), the master\nonly needs to wait for the fastest subset of workers before recovering output C, hence mitigating the\nimpact of stragglers. Given a computation strategy, we de\ufb01ne its recovery threshold as the minimum\nnumber of workers that the master needs to wait for in order to compute C. In other words, if any\nsubset of the workers with size no smaller than the recovery threshold \ufb01nish their jobs, the master is\nable to compute C. Given this formulation, we are interested in the following main problem.\n\nWhat is the minimum possible recovery threshold for distributed matrix multiplication? Can\nwe \ufb01nd an optimal computation strategy that achieves the minimum recovery threshold, while\nallowing ef\ufb01cient decoding of the \ufb01nal output at the master node?\n\nThere have been two computing schemes proposed earlier for this problem that leverage ideas from\ncoding theory. The \ufb01rst one, introduced in [5] and extended in [10], injects redundancy in only one\nof the input matrices using maximum distance separable (MDS) codes [11] 1. We illustrate this\napproach, referred to as one dimensional MDS code (1D MDS code), using the example shown in\nFig. 2a, where we aim to compute C = A\nB using 3 workers that can each store half of A and the\nentire B. The 1D MDS code evenly divides A along the column into two submatrices denoted by A0\nand A1, encodes them into 3 coded matrices A0, A1, and A0 + A1, and then assigns them to the 3\nworkers. This design allows the master to recover the \ufb01nal output given the results from any 2 of\nthe 3 workers, hence achieving a recovery threshold of 2. More generally, one can show that the 1D\nMDS code achieves a recovery threshold of\n\n(cid:124)\n\nK1D-MDS (cid:44) N \u2212 N\nn\n\n+ m = \u0398(N ).\n\n(1)\n\nN columns of workers. Similarly\n\n\u221a\nN\u2212by\u2212\u221a\n\u221a\n\nAn alternative computing scheme was recently proposed in [10] for the case of m = n, referred to as\nthe product code, which instead injects redundancy in both input matrices. This coding technique has\nalso been proposed earlier in the context of Fault Tolerant Computing in [12, 13]. As demonstrated in\nN layout. A is divided along the columns into\nFig. 2b, product code aligns workers in an\nm submatrices, encoded using an (\nN coded matrices, and then assigned\n\u221a\nN coded matrices of B are created and assigned to the\nto the\nN rows. Given the property of MDS codes, the master can decode an entire row after obtaining any\nm results in that row; likewise for the columns. Consequently, the master can recover the \ufb01nal output\nusing a peeling algorithm, iteratively decoding the MDS codes on rows and columns until the output\nC is completely available. For example, if the 5 computing results A\nB1,\n(cid:124)\n(cid:124)\n0 (B0 + B1), and A\n1 (B0 + B1) are received as demonstrated in Fig. 2b, the master can recover the\nA\n1An (n, k) MDS code is a linear code which transforms k raw inputs to n coded outputs, such that from any\n\nN , m) MDS code into\n\n(cid:124)\n1 B0, A\n\n(cid:124)\n1 B1, (A0 + A1)\n\n(cid:124)\n\n\u221a\n\n\u221a\n\n\u221a\n\nsubset of size k of the outputs, the original k inputs can be recovered.\n\n2\n\n. . . . . . . . . \f(cid:124)\n(cid:124)\n0 (B0 + B1)\u2212 A\nneeded results by computing A\n0 B0 = A\nIn general, one can show that the product code achieves a recovery threshold of\n\n(cid:124)\n0 B1 = (A0 + A1)\n\n(cid:124)\n1 B1 then A\n\nB1 \u2212 A\n\n(cid:124)\n\n\u221a\nKproduct (cid:44) 2(m \u2212 1)\n\n\u221a\nN \u2212 (m \u2212 1)2 + 1 = \u0398(\n\nN ),\n\nwhich signi\ufb01cantly improves over K1D-MDS.\n\n(cid:124)\n0 B1.\n\n(2)\n\n(a) 1D MDS-code [5] in an example with 3 workers\n\nthat can each store half of A and the entire B.\n\n(b) Product code [10] in an example with 9 workers\n\nthat can each store half of A and half of B.\n\nFigure 2: Illustration of (a) 1D MDS code, and (b) product code.\n\nIn this paper, we show that quite interestingly, the optimum recovery threshold can be far less\nthan what the above two schemes achieve. In fact, we show that the minimum recovery threshold\ndoes not scale with the number of workers (i.e. \u0398(1)). We prove this fact by designing a novel\ncoded computing strategy, referred to as the polynomial code, which achieves the optimum recovery\nthreshold of mn, and signi\ufb01cantly improves the state of the art. Hence, our main result is as follows.\n\nFor a general matrix multiplication task C = A\nstore 1\noptimum recovery threshold of\n\nm fraction of A and 1\n\nB using N workers, where each worker can\nn fraction of B, we propose polynomial codes that achieve the\n\n(cid:124)\n\n(3)\nFurthermore, polynomial code only requires a decoding complexity that is almost linear to\nthe input size.\n\nKpoly (cid:44) mn = \u0398(1).\n\nThe main novelty and advantage of the proposed polynomial code is that, by carefully designing the\nalgebraic structure of the encoded submatrices, we ensure that any mn intermediate computations at\nthe workers are suf\ufb01cient for recovering the \ufb01nal matrix multiplication product at the master. This\nin a sense creates an MDS structure on the intermediate computations, instead of only the encoded\nmatrices as in prior works. Furthermore, by leveraging the algebraic structure of polynomial codes, we\ncan then map the reconstruction problem of the \ufb01nal output at the master to a polynomial interpolation\nproblem (or equivalently Reed-Solomon decoding [14]), which can be solved ef\ufb01ciently [15]. This\nmapping also bridges the rich theory of algebraic coding and distributed matrix multiplication.\nWe prove the optimality of polynomial code by showing that it achieves the information theoretic\nlower bound on the recovery threshold, obtained by cut-set arguments (i.e., we need at least mn matrix\nblocks returned from workers to recover the \ufb01nal output, which exactly have size mn blocks). Hence,\nthe proposed polynomial code essentially enables a speci\ufb01c computing strategy such that, from any\nsubset of workers that give the minimum amount of information needed to recover C, the master can\nsuccessfully decode the \ufb01nal output. As a by-product, we also prove the optimality of polynomial\ncode under several other performance metrics considered in previous literature: computation latency\n[5, 10], probability of failure given a deadline [9], and communication load [16, 17, 18].\nWe extend the polynomial code to the problem of distributed convolution [9]. We show that by simply\nreducing the convolution problem to matrix multiplication and applying the polynomial code, we\nstrictly and unboundedly improve the state of the art. Furthermore, by exploiting the computing\nstructure of convolution, we propose a variation of the polynomial code, which strictly reduces the\nrecovery threshold even further, and achieves the optimum recovery threshold within a factor of 2.\nFinally, we implement and benchmark the polynomial code on an Amazon EC2 cluster. We measure\nthe computation latency and empirically demonstrate its performance gain under straggler effects.\n\n3\n\n\f(cid:124)\n\n2 System Model, Problem Formulation, and Main Result\nWe consider a problem of matrix multiplication with two input matrices A \u2208 Fs\u00d7r\nand B \u2208 Fs\u00d7t\n,\nfor some integers r, s, t and a suf\ufb01ciently large \ufb01nite \ufb01eld Fq. We are interested in computing the\nproduct C (cid:44) A\nB in a distributed computing environment with a master node and N worker nodes,\nn fraction of B, for some parameters m, n \u2208 N+\nwhere each worker can store 1\n(see Fig. 1). We assume at least one of the two input matrices A and B is tall (i.e. s \u2265 r or s \u2265 t),\nbecause otherwise the output matrix C would be rank inef\ufb01cient and the problem is degenerated.\nSpeci\ufb01cally, each worker i can store two matrices \u02dcAi \u2208 Fs\u00d7 r\n, computed based\n(cid:124)\non arbitrary functions of A and B respectively. Each worker can compute the product \u02dcCi (cid:44) \u02dcA\n\u02dcBi,\ni\nand return it to the master. The master waits only for the results from a subset of workers, before\nproceeding to recover the \ufb01nal output C given these products using certain decoding functions.2\n\nm fraction of A and 1\n\nand \u02dcBi \u2208 Fs\u00d7 t\n\nm\n\nq\n\nq\n\nq\n\nn\n\nq\n\n2.1 Problem Formulation\n\nGiven the above system model, we formulate the distributed matrix multiplication problem based on\nthe following terminology: We de\ufb01ne the computation strategy as the 2N functions, denoted by\n\nf = (f0, f1, ..., fN\u22121),\n\ng = (g0, g1, ..., gN\u22121),\n\nthat are used to compute each \u02dcAi and \u02dcBi. Speci\ufb01cally,\n\n\u02dcAi = fi(A),\n\n\u02dcBi = gi(B),\n\n\u2200 i \u2208 {0, 1, ..., N \u2212 1}.\n\n(4)\n\n(5)\n\nFor any integer k, we say a computation strategy is k-recoverable if the master can recover C given\nthe computing results from any k workers. We de\ufb01ne the recovery threshold of a computation strategy,\ndenoted by k(f , g), as the minimum integer k such that computation strategy (f , g) is k-recoverable.\nUsing the above terminology, we de\ufb01ne the following concept:\nDe\ufb01nition 1. For a distributed matrix multiplication problem of computing A\nB using N workers\nn fraction of B, we de\ufb01ne the optimum recovery threshold,\nthat can each store 1\ndenoted by K\u2217, as the minimum achievable recovery threshold among all computation strategies, i.e.\n\nm fraction of A and 1\n\n(cid:124)\n\nK\u2217 (cid:44) min\n\nf ,g\n\nk(f , g).\n\n(6)\n\nThe goal of this problem is to \ufb01nd the optimum recovery threshold K\u2217, as well as a computation\nstrategy that achieves such an optimum threshold.\n\n2.2 Main Result\n\nOur main result is stated in the following theorem:\nTheorem 1. For a distributed matrix multiplication problem of computing A\nthat can each store 1\n\nn fraction of B, the minimum recovery threshold K\u2217 is\n\nm fraction of A and 1\n\nB using N workers\n\n(cid:124)\n\nK\u2217 = mn.\n\n(7)\n\nFurthermore, there is a computation strategy, referred to as the polynomial code, that achieves the\nabove K\u2217 while allowing ef\ufb01cient decoding at the master node, i.e., with complexity equal to that of\npolynomial interpolation given mn points.\nRemark 1. Compared to the state of the art [5, 10], the polynomial code provides order-wise\nimprovement in terms of the recovery threshold. Speci\ufb01cally, the recovery thresholds achieved by\n1D MDS code [5, 10] and product code [10] scale linearly with N and\nN respectively, while\nthe proposed polynomial code actually achieves a recovery threshold that does not scale with N.\nFurthermore, polynomial code achieves the optimal recovery threshold. To the best of our knowledge,\nthis is the \ufb01rst optimal design proposed for the distributed matrix multiplication problem.\n\n\u221a\n\n2Note that we consider the most general model and do not impose any constraints on the decoding functions.\n\nHowever, any good decoding function should have relatively low computation complexity.\n\n4\n\n\fRemark 2. We prove the optimality of polynomial code using a matching information theoretic\nlower bound, which is obtained by applying a cut-set type argument around the master node. As a\nby-product, we can also prove that the polynomial code simultaneously achieves optimality in terms\nof several other performance metrics, including the computation latency [5, 10], the probability of\nfailure given a deadline [9], and the communication load [16, 17, 18], as discussed in Section 3.4.\nRemark 3. The polynomial code not only improves the state of the art asymptotically, but also gives\nstrict and signi\ufb01cant improvement for any parameter values of N, m, and n (See Fig. 3 for example).\n\nFigure 3: Comparison of the recovery thresholds achieved by the proposed polynomial code and the state of the\narts (1D MDS code [5] and product code [10]), where each worker can store 1\n10 fraction of each input matrix.\nThe polynomial code attains the optimum recovery threshold K\u2217, and signi\ufb01cantly improves the state of the art.\nRemark 4. As we will discuss in Section 3.2, decoding polynomial code can be mapped to a\npolynomial interpolation problem, which can be solved in time almost linear to the input size [15].\nThis is enabled by carefully designing the computing strategies at the workers, such that the computed\nproducts form a Reed-Solomon code [19] , which can be decoded ef\ufb01ciently using any polynomial\ninterpolation algorithm or Reed-Solomon decoding algorithm that provides the best performance\ndepending on the problem scenario (e.g., [20]).\nRemark 5. Polynomial code can be extended to other distributed computation applications involving\nlinear algebraic operations. In Section 4, we focus on the problem of distributed convolution, and\nshow that we can obtain order-wise improvement over the state of the art (see [9]) by directly applying\nthe polynomial code. Furthermore, by exploiting the computing structure of convolution, we propose\na variation of the polynomial code that achieves the optimum recovery threshold within a factor of 2.\nRemark 6. In this work we focused on designing optimal coding techniques to handle stragglers\nissues. The same technique can also be applied to the fault tolerance computing setting (e.g., within\nthe algorithmic fault tolerance computing framework of [12, 13], where a module can produce\narbitrary error results under failure), to improve robustness to failures in computing. Given that the\npolynomial code produces computing results that are coded by Reed-Solomon code, which has the\noptimum hamming distance, it allows detecting, or correcting the maximum possible number of\nmodule errors. Speci\ufb01cally, polynomial code can robustly detect up to N \u2212 mn errors, and correct\nup to (cid:98) N\u2212mn\n(cid:99) errors. This provides the \ufb01rst optimum code for matrix multiplication under fault\ntolerance computing.\n\n2\n\n3 Polynomial Code and Its Optimality\n\nIn this section, we formally describe the polynomial code and its decoding procedure. We then\nprove its optimality with an information theoretic converse, which completes the proof of Theorem 1.\nFinally, we conclude this section with the optimality of polynomial code under other settings.\n\n3.1 Motivating Example\n\nWe \ufb01rst demonstrate the main idea through a motivating example. Consider a distributed matrix\nmultiplication task of computing C = A\nB using N = 5 workers that can each store half of the\nmatrices (see Fig. 4). We evenly divide each input matrix along the column side into 2 submatrices:\n(8)\n\nB = [B0 B1].\n\nA = [A0 A1],\n\n(cid:124)\n\nGiven this notation, we essentially want to compute the following 4 uncoded components:\n\nC = A\n\n(cid:124)\n\nB =\n\n(cid:124)\n0 B0 A\n(cid:124)\n1 B0 A\n\n(cid:124)\n0 B1\n(cid:124)\n1 B1\n\n.\n\n(9)\n\n(cid:21)\n\n(cid:20)A\n\nA\n\n5\n\n\fFigure 4: Example using polynomial code, with 5 workers that can each store half of each input matrix. (a)\nComputation strategy: each worker i stores A0 + iA1 and B0 + i2B1, and computes their product. (b) Decoding:\nmaster waits for results from any 4 workers, and decodes the output using fast polynomial interpolation algorithm.\n\nNow we design a computation strategy to achieve the optimum recovery threshold of 4. Suppose ele-\nments of A, B are in F7, let each worker i \u2208 {0, 1, ..., 4} store the following two coded submatrices:\n(10)\nTo prove that this design gives a recovery threshold of 4, we need to design a valid decoding function\nfor any subset of 4 workers. We demonstrate this decodability through a representative scenario,\nwhere the master receives the computation results from workers 1, 2, 3, and 4, as shown in Figure 4.\nThe decodability for the other 4 possible scenarios can be proved similarly.\nAccording to the designed computation strategy, we have\n\n\u02dcBi = B0 + i2B1.\n\n\u02dcAi = A0 + iA1,\n\n\uf8f9\uf8fa\uf8fa\uf8fb =\n\n\uf8ee\uf8ef\uf8ef\uf8f0 \u02dcC1\n\n\u02dcC2\n\u02dcC3\n\u02dcC4\n\n\uf8ee\uf8ef\uf8f010\n\n20\n30\n40\n\n11\n21\n31\n41\n\n12\n22\n32\n42\n\n13\n23\n33\n43\n\n\uf8f9\uf8fa\uf8fb .\n\n\uf8f9\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8f0A\n\nA\nA\nA\n\n(cid:124)\n0 B0\n(cid:124)\n1 B0\n(cid:124)\n0 B1\n(cid:124)\n1 B1\n\n(11)\n\nThe coef\ufb01cient matrix in the above equation is a Vandermonde matrix, which is invertible because its\nparameters 1, 2, 3, 4 are distinct in F7. So one way to recover C is to directly invert equation (11),\nwhich proves the decodability. However, directly computing this inverse using the classical inversion\nalgorithm might be expensive in more general cases. Quite interestingly, because of the algebraic\nstructure we designed for the computation strategy (i.e., equation (10)), the decoding process can be\nviewed as a polynomial interpolation problem (or equivalently, decoding a Reed-Solomon code).\nSpeci\ufb01cally, in this example each worker i returns\n(cid:124)\n(cid:124)\n1 B0 + i2A\n0 B0 + iA\n\n(cid:124)\n0 B1 + i3A\nwhich is essentially the value of the following polynomial at point x = i:\n\n(cid:124)\n\u02dcCi = \u02dcA\ni\n\n\u02dcBi = A\n\n(cid:124)\n1 B1,\n\n(12)\n\nh(x) (cid:44) A\n\n(cid:124)\n0 B0 + xA\n\n(cid:124)\n1 B0 + x2A\n\n(cid:124)\n0 B1 + x3A\n\n(cid:124)\n1 B1.\n\n(13)\nHence, recovering C using computation results from 4 workers is equivalent to interpolating a 3rd-\ndegree polynomial given its values at 4 points. Later in this section, we will show that by mapping\nthe decoding process to polynomial interpolation, we can achieve almost-linear decoding complexity.\n\n3.2 General Polynomial Code\n\nNow we present the polynomial code in a general setting that achieves the optimum recovery threshold\nstated in Theorem 1 for any parameter values of N, m, and n. First of all, we evenly divide each\ninput matrix along the column side into m and n submatrices respectively, i.e.,\n\n(14)\nWe then assign each worker i \u2208 {0, 1, ..., N \u2212 1} a number in Fq, denoted by xi, and make sure that\nall xi\u2019s are distinct. Under this setting, we de\ufb01ne the following class of computation strategies.\n\nA = [A0 A1 ... Am\u22121],\n\nB = [B0 B1 ... Bn\u22121],\n\n6\n\n\fDe\ufb01nition 2. Given parameters \u03b1, \u03b2 \u2208 N, we de\ufb01ne the (\u03b1, \u03b2)-polynomial code as\n\nm\u22121(cid:88)\n\n\u02dcAi =\n\nAjxj\u03b1\ni\n\n,\n\n\u02dcBi =\n\nj=0\n\nj=0\n\nBjxj\u03b2\ni\n\n,\n\n\u2200 i \u2208 {0, 1, ..., N \u2212 1}.\n\n(15)\n\nIn an (\u03b1, \u03b2)-polynomial code, each worker i essentially computes\n\nn\u22121(cid:88)\n\nm\u22121(cid:88)\n\nn\u22121(cid:88)\n\nj=0\n\nk=0\n\n(cid:124)\n\u02dcCi = \u02dcA\ni\n\n\u02dcBi =\n\n(cid:124)\nj Bkxj\u03b1+k\u03b2\n\ni\n\n.\n\nA\n\n(16)\n\nIn order for the master to recover the output given any mn results (i.e. achieve the optimum recovery\nthreshold), we carefully select the design parameters \u03b1 and \u03b2, while making sure that no two terms in\nthe above formula has the same exponent of x. One such choice is (\u03b1, \u03b2) = (1, m), i.e, let\n\nn\u22121(cid:88)\n\nj=0\n\nAjxj\ni ,\n\nm\u22121(cid:88)\nh(x) (cid:44) m\u22121(cid:88)\n\nj=0\n\nn\u22121(cid:88)\n\n\u02dcAi =\n\n\u02dcBi =\n\nBjxjm\n\ni\n\n.\n\n(17)\n\nHence, each worker computes the value of the following degree mn \u2212 1 polynomial at point x = xi:\n\n(cid:124)\nj Bkxj+km,\n\nA\n\n(18)\n\nj=0\n\nk=0\n\nwhere the coef\ufb01cients are exactly the mn uncoded components of C. Since all xi\u2019s are selected to be\ndistinct, recovering C given results from any mn workers is essentially interpolating h(x) using mn\ndistinct points. Since h(x) has degree mn \u2212 1, the output C can always be uniquely decoded.\nIn terms of complexity, this decoding process can be viewed as interpolating degree mn \u2212 1 polyno-\nmials of Fq for rt\nmn times. It is well known that polynomial interpolation of degree k has a complexity\nof O(k log2 k log log k) [15]. Therefore, decoding polynomial code also only requires a complexity\nof O(rt log2(mn) log log(mn)). Furthermore, this complexity can be reduced by simply swapping\nin any faster polynomial interpolation algorithm or Reed-Solomon decoding algorithm.\nRemark 7. We can naturally extend polynomial code to the scenario where input matrix elements\nare real or complex numbers. In practical implementation, to avoid handling large elements in the\ncoef\ufb01cient matrix, we can \ufb01rst quantize input values into numbers of \ufb01nite digits, embed them into a\n\ufb01nite \ufb01eld that covers the range of possible values of the output matrix elements, and then directly\napply polynomial code. By embedding into \ufb01nite \ufb01elds, we avoid large intermediate computing\nresults, which effectively saves storage and computation time, and reduces numerical errors.\n\n3.3 Optimality of Polynomial Code for Recovery Threshold\n\nq\n\nq\n\nuniformly at random. It is easy to show that C = A\n\nSo far we have constructed a computing scheme that achieves a recovery threshold of mn, which\nupper bounds K\u2217. To complete the proof of Theorem 1, here we establish a matching lower bound\nthrough an information theoretic converse.\nWe need to prove that for any computation strategy, the master needs to wait for at least mn workers\nin order to recover the output. Recall that at least one of A and B is a tall matrix. Without loss\nof generality, assume A is tall (i.e. s \u2265 r). Let A be an arbitrary \ufb01xed full rank matrix and B be\nsampled from Fs\u00d7t\nB is uniformly distributed\non Fr\u00d7t\n. This means that the master essentially needs to recover a random variable with entropy\nmn elements of Fq, providing at most\nof H(C) = rt log2 q bits. Note that each worker returns rt\nmn log2 q bits of information. Consequently, using a cut-set bound around the master, we can show\nrt\nthat at least mn results from the workers need to be collected, and thus we have K\u2217 \u2265 mn.\nRemark 8 (Random Linear Code). We conclude this subsection by noting that, another computation\ndesign is to let each worker store two random linear combinations of the input submatrices. Although\nthis design can achieve the optimal recovery threshold with high probability, it creates a large coding\noverhead and requires high decoding complexity (e.g., O(m3n3 + mnrt) using the classical inversion\ndecoding algorithm). Compared to random linear code, the proposed polynomial code achieves the\noptimum recovery threshold deterministically, with a signi\ufb01cantly lower decoding complexity.\n\n(cid:124)\n\n7\n\n\f3.4 Optimality of Polynomial Code for Other Performance Metrics\n\nIn the previous subsection, we proved that polynomial code is optimal in terms of the recovery\nthreshold. As a by-product, we can prove that it is also optimal in terms of some other performance\nmetrics. In particular, we consider the following 3 metrics considered in prior works, and formally\nestablish the optimality of polynomial code for each of them. Proofs can be found in Appendix A.\nComputation latency is considered in models where the computation time Ti of each worker i is\na random variable with a certain probability distribution (e.g, [5, 10]). The computation latency is\nde\ufb01ned as the amount of time required for the master to collect enough information to decode C.\nTheorem 2. For any computation strategy, the computation latency T is always no less than the\nlatency achieved by polynomial code, denoted by Tpoly. Namely,\n\nT \u2265 Tpoly.\n\n(19)\n\nProbability of failure given a deadline is de\ufb01ned as the probability that the master does not receive\nenough information to decode C at any time t [9].\nCorollary 1. For any computation strategy, let T denote its computation latency, and let Tpoly denote\nthe computation latency of polynomial code. We have\n\nP(T > t) \u2265 P(Tpoly > t)\n\n\u2200 t \u2265 0.\n\n(20)\n\nCorollary 1 directly follows from Theorem 2 since (19) implies (20) .\nCommunication load is another important metric in distributed computing (e.g. [16, 17, 18]), de\ufb01ned\nas the minimum number of bits needed to be communicated in order to complete the computation.\nTheorem 3. Polynomial code achieves the minimum communication load for distributed matrix\nmultiplication, which is given by\n\nL\u2217 = rt log2 q.\n\n(21)\n\n4 Extension to Distributed Convolution\n\nWe can extend our proposed polynomial code to distributed convolution. Speci\ufb01cally, we consider a\nconvolution task with two input vectors\n\nb = [b0 b1 ... bn\u22121],\n\na = [a0 a1 ... am\u22121],\n\n(22)\nwhere all ai\u2019s and bi\u2019s are vectors of length s over a suf\ufb01ciently large \ufb01eld Fq. We want to compute\nc (cid:44) a \u2217 b using a master and N workers. Each worker can store two vectors of length s, which are\nfunctions of a and b respectively. We refer to these functions as the computation strategy.\nEach worker computes the convolution of its stored vectors, and returns it to the master. The master\nonly waits for the fastest subset of workers, before proceeding to decode c. Similar to distributed\nmatrix multiplication, we de\ufb01ne the recovery threshold for each computation strategy. We aim to\ncharacterize the optimum recovery threshold denoted by K\u2217\nconv, and \ufb01nd computation strategies that\nclosely achieve this optimum threshold, while allowing ef\ufb01cient decoding at the master.\nDistributed convolution has also been studied in [9], where the coded convolution scheme was\nproposed. The main idea of the coded convolution scheme is to inject redundancy in only one of\nthe input vectors using MDS codes. The master waits for enough results such that all intermediate\nvalues ai \u2217 bj can be recovered, which allows the \ufb01nal output to be computed. One can show that\nthis coded convolution scheme is in fact equivalent to the 1D MDS-coded scheme proposed in [10].\nConsequently, it achieves a recovery threshold of K1D-MDS = N \u2212 N\nNote that by simply adapting our proposed polynomial code designed for distributed matrix multipli-\ncation to distributed convolution, the master can recover all intermediate values ai \u2217 bj after receiving\nresults from any mn workers, to decode the \ufb01nal output. Consequently, this achieves a recovery\nthreshold of Kpoly = mn, which already strictly and signi\ufb01cantly improves the state of the art.\nIn this paper, we take one step further and propose an improved computation strategy, strictly reducing\nthe recovery threshold on top of the naive polynomial code. The result is summarized as follows:\n\nn + m.\n\n8\n\n\fTheorem 4. For a distributed convolution problem of computing a \u2217 b using N workers that can\neach store 1\nn fraction of b, we can \ufb01nd a computation strategy that achieves a\nrecovery threshold of\n\nm fraction of a and 1\n\nKconv-poly (cid:44) m + n \u2212 1.\n\n(23)\nFurthermore, this computation strategy allows ef\ufb01cient decoding, i.e., with complexity equal to that\nof polynomial interpolation given m + n \u2212 1 points.\nWe prove Theorem 4 by proposing a variation of the polynomial code, which exploits the computation\nstructure of convolution. This new computing scheme is formally demonstrated in Appendix B.\nRemark 9. Similar to distributed matrix multiplication, our proposed computation strategy provides\norderwise improvement compared to state of the art [9] in various settings. Furthermore, it achieves\nalmost-linear decoding complexity using the fastest polynomial interpolation algorithm or the Reed-\nSolomon decoding algorithm. More recently, we have shown that this proposed scheme achieves the\noptimum recovery threshold among all computation strategies that are linear functions [21].\nMoreover, we characterize Kconv within a factor of 2, as stated in the following theorem and proved\nin Appendix C.\nTheorem 5. For a distributed convolution problem, the minimum recovery threshold K\u2217\ncharacterized within a factor of 2, i.e.:\n\nconv can be\n\nKconv-poly < K\u2217\n\nconv \u2264 Kconv-poly.\n\n1\n2\n\n(24)\n\n5 Experiment Results\n\nTo examine the ef\ufb01ciency of our proposed polynomial code, we implement the algorithm in Python\nusing the mpi4py library and deploy it on an AWS EC2 cluster of 18 nodes, with the master running\non a c1.medium instance, and 17 workers running on m1.small instances.\nThe input matrices are randomly generated as two numpy matrices of size 4000 by 4000, and then\nencoded and assigned to the workers in the preprocessing stage. Each worker stores 1\n4 fraction of each\ninput matrix. In the computation stage, each worker computes the product of their assigned matrices,\nand then returns the result using MPI.Comm.Isend(). The master actively listens to responses from\nthe 17 worker nodes through MPI.Comm.Irecv(), and uses MPI.Request.Waitany() to keep\npolling for the earliest ful\ufb01lled request. Upon receiving 16 responses, the master stops listening and\nstarts decoding the result. To achieve the best performance, we implement an FFT-based algorithm\nfor the Reed-Solomon decoding.\n\nFigure 5: Comparison of polynomial code and the uncoded scheme. We implement polynomial code and the\nuncoded scheme using Python and mpi4py library and deploy them on an Amazon EC2 cluster of 18 instances.\nWe measure the computation latency of both algorithms and plot their CCDF. Polynomial code can reduce the\ntail latency by 37% even taking into account of the decoding overhead.\nWe compare our results with distributed matrix multiplication without coding.3 The uncoded imple-\nmentation is similar, except that only 16 out of the 17 workers participate in the computation, each of\nthem storing and processing 1\n4 fraction of uncoded rows from each input matrix. The master waits for\nall 16 workers to return, and does not need to perform any decoding algorithm to recover the result.\nTo simulate straggler effects in large-scale systems, we compare the computation latency of these\ntwo schemes in a setting where a randomly picked worker is running a background thread which\napproximately doubles the computation time. As shown in Fig. 5, polynomial code can reduce the\ntail latency by 37% in this setting, even taking into account of the decoding overhead.\n\n3Due to the EC2 instance request quota limit of 20, 1D MDS code and product code could not be implemented\n\nin this setting, which require at least 21 and 26 nodes respectively.\n\n9\n\n0510152025303540Computation Latency (s)10-210-1100CCDFUncodedPolynomial-Code\f6 Acknowledgement\n\nThis work is in part supported by NSF grants CCF-1408639, NETS-1419632, ONR award\nN000141612189, NSA grant, and a research gift from Intel. This material is based upon\nwork supported by Defense Advanced Research Projects Agency (DARPA) under Contract No.\nHR001117C0053. The views, opinions, and/or \ufb01ndings expressed are those of the author(s) and\nshould not be interpreted as representing the of\ufb01cial views or policies of the Department of Defense\nor the U.S. Government.\n\n10\n\n\fReferences\n[1] J. Dean and S. Ghemawat, \u201cMapReduce: Simpli\ufb01ed data processing on large clusters,\u201d Sixth USENIX\n\nSymposium on Operating System Design and Implementation, Dec. 2004.\n\n[2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, \u201cSpark: cluster computing with\n\nworking sets,\u201d in Proceedings of the 2nd USENIX HotCloud, vol. 10, p. 10, June 2010.\n\n[3] J. Dean and L. A. Barroso, \u201cThe tail at scale,\u201d Communications of the ACM, vol. 56, no. 2, pp. 74\u201380,\n\n2013.\n\n[4] M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica, \u201cImproving MapReduce performance\n\nin heterogeneous environments,\u201d OSDI, vol. 8, p. 7, Dec. 2008.\n\n[5] K. Lee, M. Lam, R. Pedarsani, D. Papailiopoulos, and K. Ramchandran, \u201cSpeeding up distributed machine\n\nlearning using codes,\u201d e-print arXiv:1512.02673, 2015.\n\n[6] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, \u201cA uni\ufb01ed coding framework for distributed computing\n\nwith straggling servers,\u201d arXiv preprint arXiv:1609.01690, 2016.\n\n[7] A. Reisizadehmobarakeh, S. Prakash, R. Pedarsani, and S. Avestimehr, \u201cCoded computation over heteroge-\n\nneous clusters,\u201d arXiv preprint arXiv:1701.05973, 2017.\n\n[8] R. Tandon, Q. Lei, A. G. Dimakis, and N. Karampatziakis, \u201cGradient coding,\u201d arXiv preprint\n\narXiv:1612.03301, 2016.\n\n[9] S. Dutta, V. Cadambe, and P. Grover, \u201cCoded convolution for parallel and distributed computing within a\n\ndeadline,\u201d arXiv preprint arXiv:1705.03875, 2017.\n\n[10] K. Lee, C. Suh, and K. Ramchandran, \u201cHigh-dimensional coded matrix multiplication,\u201d in 2017 IEEE\n\nInternational Symposium on Information Theory (ISIT), pp. 2418\u20132422, June 2017.\n\n[11] R. Singleton, \u201cMaximum distance q-nary codes,\u201d IEEE Transactions on Information Theory, vol. 10, no. 2,\n\npp. 116\u2013118, 1964.\n\n[12] K.-H. Huang and J. A. Abraham, \u201cAlgorithm-based fault tolerance for matrix operations,\u201d IEEE Transac-\n\ntions on Computers, vol. C-33, pp. 518\u2013528, June 1984.\n\n[13] J.-Y. Jou and J. A. Abraham, \u201cFault-tolerant matrix arithmetic and signal processing on highly concurrent\n\ncomputing structures,\u201d Proceedings of the IEEE, vol. 74, pp. 732\u2013741, May 1986.\n\n[14] F. Didier, \u201cEf\ufb01cient erasure decoding of reed-solomon codes,\u201d arXiv preprint arXiv:0901.1886, 2009.\n[15] K. S. Kedlaya and C. Umans, \u201cFast polynomial factorization and modular composition,\u201d SIAM Journal on\n\nComputing, vol. 40, no. 6, pp. 1767\u20131802, 2011.\n\n[16] S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, \u201cCoded MapReduce,\u201d 53rd Annual Allerton Conference\n\non Communication, Control, and Computing, Sept. 2015.\n\n[17] S. Li, M. A. Maddah-Ali, Q. Yu, and A. S. Avestimehr, \u201cA fundamental tradeoff between computation and\ncommunication in distributed computing,\u201d IEEE Transactions on Information Theory, vol. 64, pp. 109\u2013128,\nJan 2018.\n\n[18] Q. Yu, S. Li, M. A. Maddah-Ali, and A. S. Avestimehr, \u201cHow to optimally allocate resources for coded\ndistributed computing?,\u201d in 2017 IEEE International Conference on Communications (ICC), pp. 1\u20137, May\n2017.\n\n[19] R. Roth, Introduction to coding theory. Cambridge University Press, 2006.\n[20] S. Baktir and B. Sunar, \u201cAchieving ef\ufb01cient polynomial multiplication in fermat \ufb01elds using the fast fourier\ntransform,\u201d in Proceedings of the 44th annual Southeast regional conference, pp. 549\u2013554, ACM, 2006.\n[21] Q. Yu, M. A. Maddah-Ali, and A. S. Avestimehr, \u201cStraggler mitigation in distributed matrix multiplication:\n\nFundamental limits and optimal coding,\u201d arXiv preprint arXiv:1801.07487, 2018.\n\n11\n\n\f", "award": [], "sourceid": 2304, "authors": [{"given_name": "Qian", "family_name": "Yu", "institution": "University of Southern Califor"}, {"given_name": "Mohammad", "family_name": "Maddah-Ali", "institution": "Nokia Bell Labs"}, {"given_name": "Salman", "family_name": "Avestimehr", "institution": "USC"}]}