{"title": "Short-Dot: Computing Large Linear Transforms Distributedly Using Coded Short Dot Products", "book": "Advances in Neural Information Processing Systems", "page_first": 2100, "page_last": 2108, "abstract": "Faced with saturation of Moore's law and increasing size and dimension of data, system designers have increasingly resorted to parallel and distributed computing to reduce computation time of machine-learning algorithms. However, distributed computing is often bottle necked by a small fraction of slow processors called \"stragglers\" that reduce the speed of computation because the fusion node has to wait for all processors to complete their processing. To combat the effect of stragglers, recent literature proposes introducing redundancy in computations across processors, e.g., using repetition-based strategies or erasure codes. The fusion node can exploit this redundancy by completing the computation using outputs from only a subset of the processors, ignoring the stragglers. In this paper, we propose a novel technique - that we call \"Short-Dot\" - to introduce redundant computations in a coding theory inspired fashion, for computing linear transforms of long vectors. Instead of computing long dot products as required in the original linear transform, we construct a larger number of redundant and short dot products that can be computed more efficiently at individual processors. Further, only a subset of these short dot products are required at the fusion node to finish the computation successfully. We demonstrate through probabilistic analysis as well as experiments on computing clusters that Short-Dot offers significant speed-up compared to existing techniques. We also derive trade-offs between the length of the dot-products and the resilience to stragglers (number of processors required to finish), for any such strategy and compare it to that achieved by our strategy.", "full_text": "\u201cShort-Dot\u201d: Computing Large Linear Transforms\n\nDistributedly Using Coded Short Dot Products\n\nSanghamitra Dutta\n\nCarnegie Mellon University\nsanghamd@andrew.cmu.edu\n\nViveck Cadambe\n\nPennsylvania State University\n\nviveck@engr.psu.edu\n\nPulkit Grover\n\nCarnegie Mellon University\npgrover@andrew.cmu.edu\n\nAbstract\n\nFaced with saturation of Moore\u2019s law and increasing size and dimension of data,\nsystem designers have increasingly resorted to parallel and distributed computing\nto reduce computation time of machine-learning algorithms. However, distributed\ncomputing is often bottle necked by a small fraction of slow processors called\n\u201cstragglers\u201d that reduce the speed of computation because the fusion node has\nto wait for all processors to complete their processing. To combat the effect\nof stragglers, recent literature proposes introducing redundancy in computations\nacross processors, e.g., using repetition-based strategies or erasure codes. The\nfusion node can exploit this redundancy by completing the computation using\noutputs from only a subset of the processors, ignoring the stragglers. In this paper,\nwe propose a novel technique \u2013 that we call \u201cShort-Dot\u201d \u2013 to introduce redundant\ncomputations in a coding theory inspired fashion, for computing linear transforms\nof long vectors. Instead of computing long dot products as required in the original\nlinear transform, we construct a larger number of redundant and short dot products\nthat can be computed more ef\ufb01ciently at individual processors. Further, only a\nsubset of these short dot products are required at the fusion node to \ufb01nish the\ncomputation successfully. We demonstrate through probabilistic analysis as well\nas experiments on computing clusters that Short-Dot offers signi\ufb01cant speed-up\ncompared to existing techniques. We also derive trade-offs between the length of\nthe dot-products and the resilience to stragglers (number of processors required to\n\ufb01nish), for any such strategy and compare it to that achieved by our strategy.\n\n1\n\nIntroduction\n\nThis work proposes a coding-theory inspired computation technique for speeding up computing\nlinear transforms of high-dimensional data by distributing it across multiple processing units that\ncompute shorter dot products. Our main focus is on addressing the \u201cstraggler effect,\u201d i.e., the problem\nof delays caused by a few slow processors that bottleneck the entire computation. To address this\nproblem, we provide techniques (building on [1] [2] [3] [4] [5]) that introduce redundancy in the\ncomputation by designing a novel error-correction mechanism that allows the size of individual dot\nproducts computed at each processor to be shorter than the length of the input.\nThe problem of computing linear transforms of high-dimensional vectors is \u201cthe\" critical step [6] in\nseveral machine learning and signal processing applications. Dimensionality reduction techniques\nsuch as Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), taking random\nprojections, require the computation of short and fat linear transforms on high-dimensional data.\nLinear transforms are the building blocks of solutions to various machine learning problems, e.g.,\nregression and classi\ufb01cation etc., and are also used in acquiring and pre-processing the data through\nFourier transforms, wavelet transforms, \ufb01ltering, etc. Fast and reliable computation of linear trans-\nforms are thus a necessity for low-latency inference [6]. Due to saturation of Moore\u2019s law, increasing\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fspeed of computing in a single processor is becoming dif\ufb01cult, forcing practitioners to adopt parallel\nprocessing to speed up computing for ever increasing data dimensions and sizes.\nClassical approaches of computing linear transforms across parallel processors, e.g., Block-Striped\nDecomposition [7], Fox\u2019s method [8, 7], and Cannon\u2019s method [7], rely on dividing the computational\ntask equally among all available processors1 without any redundant computation. The fusion node\ncollects the outputs from each processors to complete the computation and thus has to wait for\nall the processors to \ufb01nish. In almost all distributed systems, a few slow or faulty processors \u2013\ncalled \u201cstragglers\u201d[11] \u2013 are observed to delay the entire computation. This unpredictable latency in\ndistributed systems is attributed to factors such as network latency, shared resources, maintenance\nactivities, and power limitations. In order to combat with stragglers, cloud computing frameworks\nlike Hadoop [12] employ various straggler detection techniques and usually reset the task allotted\nto stragglers. Forward error-correction techniques offer an alternative approach to deal with this\n\u201cstraggler effect\u201d by introducing redundancy in the computational tasks across different processors.\nThe fusion node now requires outputs from only a subset of all the processors to successfully \ufb01nish.\nIn this context, the use of preliminary erasure codes dates back to the ideas of algorithmic fault\ntolerance [13] [14]. Recently optimized Repetition and Maximum Distance Separable (MDS) [19]\ncodes have been explored [2] [3] [1] [16] to speed up computations.\nWe consider the problem of computing Ax where A(M\u00d7N ) is a given matrix and x(N\u00d71) is a vector\nthat is input to the computation (M (cid:28) N ). In contrast with [1], which also uses codes to compute\nlinear transforms in parallel, we allow the size of individual dot products computed at each processor\nto be smaller than N, the length of the input. Why might one be interested in computing short\ndot products while performing an overall large linear transform? This is because for distributed\ndigital processors, the computation time is reduced with the number of operations (length of the\ndot-products). In Sections 4 and 5, we show that the computation speed-up can be increased beyond\nthat obtained in [1]. Another interesting example comes from recent work on designing processing\nunits that exclusively compute dot-products using analog components [17, 18]. These devices are\nprone to errors and increased delays in convergence when designed for larger dot products.\nTo summarize, our main contributions are:\n1. To compute Ax for a given matrix A(M\u00d7N ), we instead compute F x where we construct F(P\u00d7N )\n(total no. of processors P > Required no. of dot-products M) such that each N-length row of F has\nat most N (P \u2212 K + M )/P non-zero elements. Because the locations of zeros in a row of F are\nknown by design, this reduces the complexity of computing dot-products of rows of F with x. Here\nK parameterizes the resilience to stragglers: any K of the P dot products of rows of F with x are\nsuf\ufb01cient to recover Ax, i.e., any K rows of F can be linearly combined to generate the rows of A.\n2. We provide fundamental limits on the trade-off between the length of the dot-products and the\nstraggler resilience (number of processors to wait for) for any such strategy in Section 3. This\nsuggests a lower bound on the length of task allotted per processor. However, we believe that these\nlimits are loose and point to an interesting direction for future work.\n3. Assuming exponential tails of service-times at each server (used in [1]), we derive the expected\ncomputation time required by our strategy and compare it to uncoded parallel processing, repetition\nstrategy and MDS codes [19] (see Fig. 2). Short-Dot offers speed-up by a factor of \u2126(log(P )) over\nuncoded, parallel processing and repetition, and nearly by a factor of \u2126( P\nM ) compared to MDS\ncodes when M is linear in P . The strategy out-performs repetition or MDS codes by a factor of\n\u2126\n4. We provide experimental results showing that Short-Dot is faster than existing strategies.\nFor the rest of the paper, we de\ufb01ne the sparsity of a vector u \u2208 RN as the number of nonzero\nj=1 I(uj (cid:54)= 0). We also assume that P divides N (P (cid:28) N).\nComparison with existing strategies: Consider the problem of computing a single dot product of\nan input vector x \u2208 RN with a pre-speci\ufb01ed vector a \u2208 RN . By an \u201cuncoded\u201d parallel processing\nstrategy (which includes Block Striped Decomposition [7]), we mean a strategy that does not use\nredundancy to overcome delays caused by stragglers. One uncoded strategy is to partition the dot\nproduct into P smaller dot products, where P is the number of available processors. E.g. a can\n\n(cid:16)\nelements in the vector, i.e., (cid:107)u(cid:107)0 =(cid:80)N\n\nwhen M is sub-linear in P .\n\n(cid:17)\n\nP\n\nM log(P/M )\n\n1Strassen\u2019s algorithm [9] and its generalizations offer a recursive approach to faster matrix multiplications\n\nover multiple processors, but they are often not preferred because of their high communication cost [10].\n\n2\n\n\fFigure 1: A dot-product of length N = 12 is being computed parallely using P = 6 processors.\n(Left) Uncoded Parallel Processing - Divide into P parts, (Right) Repetition with block partitioning.\n\nbe divided into P parts \u2013 constructing P short vectors of sparsity N/P \u2013 with each vector stored\nin a different processor (as shown in Fig. 1 left). Only the nonzero values of the vector need to be\nstored since the locations of the nonzero values is known apriori at every node. One might expect\nthe computation time for each processor to reduce by a factor of P . However, now the fusion node\nhas to wait for all the P processors to \ufb01nish their computation, and the stragglers can now delay the\nentire computation. Can we construct P vectors such that dot products of a subset of them with x\nare suf\ufb01cient to compute (cid:104)a, x(cid:105)? A simple coded strategy is Repetition with block partitioning i.e.,\nconstructing L vectors of sparsity N/L by partitioning the vector of length N into L parts (L < P ),\nand repeating the L vectors P/L times so as to obtain P vectors of sparsity N/L as shown in Fig. 1\n(right). For each of the L parts of the vector, the fusion node only needs the output of one processor\namong all its repetitions. Instead of a single dot-product, if one requires the dot-product of x with M\nvectors {a1, . . . , aM}, one can simply repeat the aforementioned strategy M times.\nFor multiple dot-products, an alternative repetition-based strategy is to compute M dot products\nP/M times in parallel at different processors. Now we only have to wait for at least one processor\ncorresponding to each of the M vectors to \ufb01nish. Improving upon repetition, it is shown in [1]\nthat an (P, M )-MDS code allows constructing P coded vectors such that any M of P dot-products\ncan be used to reconstruct all the M original vectors (see Fig. 2b). This strategy is shown, both\nexperimentally and theoretically, to perform better than repetition and uncoded strategies.\n\n(a) Uncoded Parallel Processing\nFigure 2: Different strategies of parallel processing: Here M = 3 dot-products of length N = 12 are\nbeing computed using P = 6 processors.\n\n(b) Using MDS codes\n\n(c) Using Short-Dot\n\nCan we go beyond MDS codes? MDS codes-based strategies require N-length dot-products to be\ncomputed on each processor. Short-Dot instead constructs P vectors of sparsity s (less than N), such\nthat the dot product of x with any K (\u2265 M ) out of these P short vectors is suf\ufb01cient to compute the\ndot-product of x with all the M given vectors (see Fig. 2c). Compared to MDS Codes, Short-Dot\nwaits for some more processors (since K \u2265 M), but each processor computes a shorter dot product.\nWe also propose Short-MDS, an extension of the MDS codes-based strategy in [1] to create short\ndot-products of length s, through block partitioning, and compare it with Short-Dot. In regimes\nwhere N\ns is not\nan integer, Short-MDS has to wait for more processors in worst case than Short-Dot for the same\nsparsity s, as discussed in Remark 1 in Section 2.\n\ns is an integer, Short-MDS may be viewed as a special case of Short-Dot. But when N\n\n2 Our coded parallelization strategy: Short-Dot\nIn this section, we provide our strategy of computing the linear transform Ax where x \u2208 RN\nis the input vector and A(M\u00d7N ) = [a1, a2, . . . , aM ]T is a given matrix. Short-Dot constructs a\n\n3\n\n\fFigure 3: Short-Dot: Distributes short dot-products over P parallel processors, such that outputs from\nany K out of P processors are suf\ufb01cient to compute successfully.\n\n1 , . . . , aT\n\n2 , . . . , aT\n\n1 , aT\n\n2 , . . . , aT\n\nP (P \u2212 K + M ). Each sparse row of F (say f T\n\n1 , aT\nP (P \u2212 K + M ), provided P divides N.\n\nP \u00d7 N matrix F = [f1, f2, . . . , fP ]T such that M predetermined linear combinations of any K\nM}, and any row of F has sparsity at most\nrows of F are suf\ufb01cient to generate each of {aT\ni ) is sent to the i-th processor (i = 1, . . . , P )\ns = N\nand dot-products of x with all sparse rows are computed in parallel. Let Si denote the support\n(set of non-zero indices) of fi. Thus, for any unknown vector x, short dot products of length\nP (P \u2212 K + M ) are computed on each processor. Since the linear combination of any\n|Si| \u2264 s = N\nK rows of F can generate the rows of A, i.e., {aT\nM}, the dot-product from the earliest\nK out of P processors can be linearly combined to obtain the linear transform Ax. Before formally\nstating our algorithm, we \ufb01rst provide an insight into why such a matrix F exists in the following\ntheorem, and develop an intuition on the construction strategy.\nM}, there exists a P \u00d7 N matrix F such that a linear\nTheorem 1 Given row vectors {aT\ncombination of any K(> M ) rows of the matrix is suf\ufb01cient to generate the row vectors and each\nrow of F has sparsity at most s = N\nProof: We may append (K \u2212 M ) rows to A = [a1, a2, . . . , aM ]T , to form a K \u00d7 N matrix\n\u02dcA = [a1, a2, . . . , aM , z1, . . . , zK\u2212M ]T . The precise choice of these additional vectors will be made\nexplicit later. Next, we choose B, a P \u00d7 K matrix such that any square sub-matrix of B is invertible.\nE.g., A Vandermonde or Cauchy Matrix, or a matrix with i.i.d. Gaussian entries can be shown to\nsatisfy this property with probability 1. The following lemma shows that any K rows of the matrix\nB \u02dcA are suf\ufb01cient to generate any row of \u02dcA, including {aT\nLemma 1 Let F = B \u02dcA where \u02dcA is a K \u00d7 N matrix and B is any (P \u00d7 K) matrix such that every\nsquare sub-matrix is invertible. Then, any K rows of F can be linearly combined to generate any\nrow of \u02dcA.\nProof: Choose an arbitrary index set \u03c7 \u2282 {1, 2, . . . , P} such that |\u03c7| = K. Let F \u03c7 be the sub-matrix\nformed by chosen K rows of F indexed by \u03c7. Then, F \u03c7 = B\u03c7 \u02dcA. Now, B\u03c7 is a K \u00d7 K sub-matrix\nof B, and is thus invertible. Thus, \u02dcA = (B\u03c7)\u22121F \u03c7. The i-th row of \u02dcA is [i-th Row of (B\u03c7)\u22121]F \u03c7\n(cid:4)\nfor i = 1, 2, . . . , K. Thus, each row of \u02dcA is generated by the chosen K rows of F .\nP (P\u2212K+M )\nIn the next lemma, we show how the row sparsity of F can be constrained to be at most N\nby appropriately choosing the appended vectors z1, . . . , zK\u2212M .\nLemma 2 Given an M \u00d7 N matrix A = [a1, . . . , aM ]T , let \u02dcA = [a1, . . . , aM , z1, . . . , zK\u2212M ]T\nbe a K \u00d7 N matrix formed by appending K \u2212 M row vectors to A. Also let B be a P \u00d7 K matrix\nsuch that every square matrix is invertible. Then there exists a choice of the appended vectors\nz1, . . . , zK\u2212M such that each row of F = B \u02dcA has sparsity at most s = N\n\nP (P \u2212 K + M ).\n\n1 , aT\n\n2 , . . . , aT\n\nM}:\n\nProof: We select a sparsity pattern that we want to enforce on F and then show that there exists a\nchoice of the appended vectors z1, . . . , zK\u2212M such that the pattern can be enforced.\nSparsity Pattern enforced on F : This is illustrated in Fig. 4. First, we construct a P \u00d7 P \u201cunit\nblock\u201d with a cyclic structure of nonzero entries, where (K \u2212 M ) zeros in each row and column\nare arranged as shown in Fig. 4. Each row and column have at most sc = P \u2212 K + M non-zero\nentries. This unit block is replicated horizontally N/P times to form an P \u00d7 N matrix with at most\n\n4\n\n\fsc non-zero entries in each column, and and at most s = N sr/P non-zero entries in each row. We\nnow show how choice of z1, . . . , zK\u2212M can enforce this pattern on F .\n\nFigure 4: Sparsity pattern of F : (Left) Unit Block (P \u00d7 P ); (Right) Unit Block concatenated N/P\ntimes to form N \u00d7 P matrix F with row sparsity at most s.\n\nFrom F = B \u02dcA, the j-th column of F can be written as, Fj = B \u02dcAj. Each column of F has at\nleast K \u2212 M zeros at locations indexed by U \u2282 {1, 2, . . . , P}. Let BU denote a ((K \u2212 M ) \u00d7 K)\nsub-matrix of B consisting of the rows of B indexed by U. Thus, BU \u02dcAj = [0](K\u2212M )\u00d71. Divide\n\u02dcAj into two portions of lengths M and K \u2212 M as follows:\n\u02dcAj = [AT\nj\nHere Aj = [a1(j) a2(j) . . . aM (j)]T is actually the j-th column of given matrix A and z =\n[z1(j), . . . zK\u2212M (j)]T depends on the choice of the appended vectors. Thus,\n\n| zT ]T = [a1(j) a2(j) . . . aM (j) z1(j) . . . zK\u2212M (j)]T\n\nBU\n\ncols 1:M Aj + BU\n\ncols M +1:K z = [0]K\u2212M\u00d71\n\n\u21d2 BU\n\ncols M +1:K z = \u2212BU\n\ncols 1:M [Aj]\n\n\u21d2 [ z ] = \u2212(BU\n\ncols M +1:K)\u22121 BU\n\ncols 1:M [Aj]\n\n(1)\ncols M +1:K] is invertible because it is a (K \u2212 M ) \u00d7 (K \u2212 M )\nwhere the last step uses the fact that [BU\nsquare sub-matrix of B. This explicitly provides the vector z which completes the j-th column of \u02dcA.\n(cid:4)\nThe other columns of \u02dcA can be completed similarly, proving the lemma.\nFrom Lemmas 1 and 2, for a given M \u00d7 N matrix A, there always exists a P \u00d7 N matrix F such\nthat a linear combination of any K columns of F is suf\ufb01cient to generate our given vectors and each\n(cid:4)\nrow of F has sparsity at most s = N\nWith this insight in mind, we now formally state our computation strategy:\n\nP (P \u2212 K + M ). This proves the theorem.\n\nAlgorithm 1 Short-Dot\n\ncols 1:M [Aj]\n\nU \u2190 ({(j \u2212 1), . . . , (j + K \u2212 M \u2212 1)} mod P ) + 1\n\n[A] Pre-Processing Step: Encode F (Performed Of\ufb02ine)\nGiven: AM\u00d7N = [a1, . . . , aM ]T = [A1, A2, . . . , AN ], parameter K, M atrix BP\u00d7K\n1: For j = 1 to N do\n2:\n3:\n4:\n5:\n6:\n\nSet\nSet BU \u2190 Rows of B indexed by U\ncols M +1:K)\u22121 BU\nSet\nj |zT ]T\nSet\n\n(cid:66) The set of (K \u2212 M ) indices that are 0 for the j-th column of F\n(cid:66) z(K\u2212M )\u00d71 is a row vector.\n(cid:66) Fj is a column vector ( j-th col of F )\n(cid:66) Row representation of matrix F\n(cid:66) Indices of non-zero entries in the i-th row of F\n(cid:66) i-th row of F sent to i-th processor\n\nEncoded Output: FP\u00d7N = [f1f2 . . . fP ]T\n\n[ z ] = \u2212(BU\nFj = B[AT\n\nStore Si \u2190 Support(fi)\nSend f Si\nto i-th processor\ni\n\n7: For i = 1 to P do\n8:\n9:\n[B] Online computations\nExternal Input : x\nResources: P parallel processors (P > M )\n[B1] Parallelization Strategy: Divide task among parallel processors:\n1: For i = 1 to P do\n2:\n3:\nOutput: (cid:104)f Si\n\nSend xSi to the i-th processor\nCompute at i-th processor: (cid:104)f Si\n\n, xSi(cid:105) from K earliest processors\n\n, xSi(cid:105) (cid:66) uS denotes only the rows of vector u indexed by S\n\ni\n\ni\n\n5\n\n\fV \u2190 Indices of the K processors that \ufb01nished \ufb01rst\nvK\u00d71 \u2190 [(cid:104)f Si\n\n[B2] Fusion Node: Decode the dot-products from the processor outputs:\n1: Set\n2: Set BV \u2190 Rows of B indexed by V\n, xSi(cid:105), \u2200 i \u2208 V ]\n3: Set\n4: Set Ax = [(cid:104)a1, x(cid:105), . . . ,(cid:104)aM , x(cid:105)]T \u2190 [(BV )\u22121]rows 1:M v\n5: Output: (cid:104)x, a1(cid:105), . . . ,(cid:104)x, aM(cid:105)\n\ni\n\n(cid:66) Col Vector of outputs from \ufb01rst K processors\n\nTable 1: Trade-off between the length of the dot-products and parameter K for different strategies\n\nStrategy\n\nLength Parameter K\n\nLength\n\nStrategy\nRepetition N\nMDS\nN\nShort-Dot\ns\n\nParameter K\n\nP \u2212(cid:4) P\nP \u2212(cid:4) P s\n\nM\n\nM\n\n(cid:5) + 1\n(cid:5) + M\n\nN\n\nRepetition with\nblock partition\nShort-MDS\n\ns\n\ns\n\n(cid:107)\n\nP \u2212(cid:106)\nP \u2212(cid:106) P(cid:100)N/s(cid:101)\n\nP\n\n(cid:107)\n\nM(cid:100)N/s(cid:101)\n\n+ 1\n\n+ M\n\nRemark 1: Short-MDS - a special case of Short-Dot An extension of the MDS codes-based\nstrategy proposed in [1], that we call Short-MDS can be designed to achieve row-sparsity s. First\nblock-partition the matrix of N columns, into (cid:100)N/s(cid:101) sub-matrices of size M \u00d7 s, and also divide\nthe total processors P equally into (cid:100)N/s(cid:101) parts. Now, each sub-matrix can be encoded using\na ( P(cid:100)N/s(cid:101) , M ) MDS code. In the worst case, including all integer effects, this strategy requires\n\n+ M processors to \ufb01nish. In comparison, Short-Dot requires K = P \u2212(cid:4) P s\n\nK = P \u2212(cid:106) P(cid:100)N/s(cid:101)\n\n(cid:5) + M\n\n(cid:107)\n\nprocessors to \ufb01nish. In the regime where, s exactly divides N, Short-MDS can be viewed as a special\ncase of Short-Dot, as both the expressions match. However, in the regime where s does not exactly\ndivide N, Short-MDS requires more processors to \ufb01nish in the worst case than Short-Dot. Short-Dot\nis a generalized framework that can achieve a wider variety of pre-speci\ufb01ed sparsity patterns as\nrequired by the application. In Table 1, we compare the lengths of the dot-products and straggler\nresilience K, i.e., the number of processors to wait for in worst case, for different strategies.\n\nN\n\n3 Limits on trade-off between the length of dot-products and parameter K\n\nTheorem 2 Let AM\u00d7N be any matrix such that each column has at least one non-zero element. If\nthe linear combination of any K rows of F(P\u00d7N ) can generate M rows of AM\u00d7N , then the average\n\nsparsity s of each row of F(P\u00d7N ) must satisfy s \u2265 N(cid:0)1 \u2212 K\n\n(cid:1) + N\n\nP\n\nP .\n\nProof: We claim that K is strictly greater than the maximum number of zeros that can occur\nin any column of the matrix F . If not, suppose the j-th column of F has more than K zeros.\nThen there exists a linear combination of K rows of F that will always have 0 at the j-th column\nindex and it is not possible to generate any row of the given matrix A. Thus, K is no less than\n1 + M ax N o. of 0s in any column of F . Since, maximum value is always greater than average,\n\nK \u2265 1 + Avg. N o. of 0s in any column of F \u2265 1 +\n\nA slight re-arrangement establishes the aforementioned lower bound.\n\nShort-Dot achieves a row-sparsity of at most s = N(cid:0)1 \u2212 K\nsuch strategy is s \u2265 N(cid:0)1 \u2212 K\n\n(2)\n(cid:4)\nP while the lower bound for any\nP . Notice that the bounds only differ in the second term. We\nbelieve that the difference in the bounds arises due to the looseness of the fundamental limit: our\ntechnique is based on derivation for M = 1 (bound is tight), and could be tightened for M > 1.\n\n(cid:1) + NM\n\n(cid:1) + N\n\nN\n\nP\n\nP\n\n.\n\n(N \u2212 s)P\n\n4 Analysis of expected computation time for exponential tail models\n\nWe now provide a probabilistic analysis of the computational time required by Short-Dot and compare\nit with uncoded parallel processing, repetition and MDS codes as shown in Fig. 5. Table 2 shows\nthe order-sense expected computation time in the regimes where M is linear and sub-linear in P .\nA detailed analysis is provided in the supplement. Assume that the time required by a processor to\n\n6\n\n\fFigure 5: Expected computation time: Short-Dot is faster than MDS when M (cid:28) P and Uncoded\nwhen M \u2248 P , and is universally faster over the entire range of M. For the choice of straggling\nparameter, Repetition is slowest. When M does not exactly divide P , the distribution of computation\ntime for repetition and uncoded strategies is the maximum of non-identical but independent random\nvariables, which produce the ripples in these curves (see supplement for details).\n\ncompute a single dot-product follows an exponential distribution and is independent of the other\nprocessors, as described in [1]. Let the time required to compute a single dot-product of length N\n\nbe distributed as: Pr(TN \u2264 t) = 1 \u2212 exp(cid:0)\u2212\u00b5 (cid:0) t\n\nN \u2212 1(cid:1)(cid:1) \u2200 t \u2265 N. Here, \u00b5 is the \u201cstraggling\n\nparameter\u201d that determines the unpredictable latency in computation time. For an s length dot product,\nwe simply replace N by s .The expected computation time for Short-Dot is the expected value of the\nK-th order statistic of these P iid exponential random variables, which is given by:\n\nE(T ) = s\n\n1 +\n\nlog( P\nP\u2212K )\n\u00b5\n\n(P \u2212 K + M )N\n\nP\n\n1 +\n\nlog( P\nP\u2212K )\n\u00b5\n\n.\n\n(3)\n\nvariables with parameter 1 is(cid:80)P\n\nHere, (3) uses the fact that the expected value of the K-th statistic of P iid exponential random\ni \u2248 log(P ) \u2212 log(P \u2212 K) [1]. The expected\ncomputation time in the RHS of (3) is minimized when P \u2212 K = \u0398(M ). This minimal expected\ntime is O( M N\n\nP ) for M linear in P and is O(cid:16) M N log(P/M )\n\nfor M sub-linear in P .\n\n(cid:17)\n\ni=1\n\nP\n\n1\n\n1\n\n(cid:33)\ni \u2212(cid:80)P\u2212K\n\ni=1\n\n=\n\n(cid:32)\n\n(cid:33)\n\n(cid:32)\n\nStrategy\n\nOnly one Processor\nUncoded (M divides P)2\nP\nRepetition (M divides P) 2 N\n\nM N\n\nM N\n\nMDS\n\nShort-Dot\n\nTable 2: Probabilistic Computation Times\nE(T )\n\nM linear in P\n\n(cid:16)\n(cid:16)\n\n(cid:17)\n(cid:17)\n(cid:19)\n\n(cid:16)\n(cid:18)\n\n\u00b5\n\n1 + 1\n\u00b5\n1 + log(P )\n1 + M log(M )\nP \u00b5\nP \u2212M )\nlog( P\n\u00b5\n\n1 +\n\nN\n\n(cid:17)\n\n(cid:18)\n\nN (P\u2212K+M )\n\nP\n\n1 +\n\nP \u2212K )\nlog( P\n\u00b5\n\nM sub-linear in P\n\n\u0398 (M N )\n\n\u0398 (M N )\n\nP log(P )(cid:1) \u0398(cid:0) M N\nP log(P )(cid:1)\n\u0398(cid:0) M N\n\u0398(cid:0) M N\nP log(P )(cid:1) \u0398 (N )\n(cid:1)(cid:1)\nO(cid:0) M N\nP log(cid:0) P\n\n\u0398(N )\nO( M N\nP )\n\n\u0398(N )\n\nM\n\n(cid:19)\n\n2 Refer to Supplement for more accurate analysis taking integer effects into account\n\nEncoding and Decoding Complexity: Even though encoding is a pre-processing step (since A is\nassumed to be given in advance), we include a complexity analysis for the sake of completeness. The\nP matrix inversions of size (K \u2212 M ), and a P \u00d7 K matrix multiplication with\nencoding requires N\na K \u00d7 N matrix. The naive encoding complexity is therefore O( N\nP (K \u2212 M )3 + N KP ). This is\nhigher than MDS codes that has an encoding complexity of O(N M P )), but it is only a one-time cost\nthat provides savings in online steps (as discussed earlier in this section). The decoding complexity\nof Short-Dot is O(K 3 + KM ) which does not depend on N when M, K (cid:28) N. This is nearly the\nsame as O(M 3 + M 2) complexity of MDS codes. We believe that the complexities might be reduced\nfurther, based on special choices of encoding matrix B.\n\n7\n\n\fTable 3: Experimental computation time of 10000 dot products (N = 785, M = 10, P = 20)\n\nStrategy\nUncoded\nShort-Dot\nMDS\n\nParameter K Mean\n20\n18\n10\n\nSTDEV\n11.8653 2.8427\n10.4306 0.9253\n15.3411 0.8987\n\nMinimum Time Maximum Time\n9.5192\n8.2145\n13.8232\n\n27.0818\n11.8340\n17.5416\n\n5 Experimental Results\n\nWe perform experiments on computing clusters at CMU to test the computational time. We use\nHTCondor [20] to schedule jobs simultaneously among the P processors. We compare the time\nrequired to classify 10000 handwritten digits of the MNIST [21] database, assuming we are given a\ntrained 1-layer Neural Network. We separately trained the Neural network using training samples, to\nform a matrix of weights, denoted by A10\u00d7785. For testing, the multiplication of this given 10 \u00d7 785\nmatrix, with the test data matrix X785\u00d710000 is considered. The total number of processors was 20.\nAssuming that A10\u00d7785 is encoded into F20\u00d7785 in a pre-processing step, we store the rows of F\nin each processor apriori. Now portions of the data matrix X of size s \u00d7 10000 are sent to each of\nthe P parallel processors as input. We also send a C-program to compute dot-products of length\nP (P \u2212 K + M ) with appropriate rows of F using command condor-submit. Each processor\ns = N\noutputs the value of one dot-product. The computation time reported in Fig. 6 includes the total time\nrequired to communicate inputs to each processor, compute the dot-products in parallel, fetch the\nrequired outputs, decode and classify all the 10000 test-images, based on 35 experimental runs.\n\nFigure 6: Experimental results: (Left) Mean computation time for Uncoded Strategy, Short-Dot\n(K=18) and MDS codes: Short-Dot is faster than MDS by 32% and Uncoded by 12%. (Right) Scatter\nplot of computation time for different experimental runs: Short-Dot is faster most of the time.\n\nKey Observations: (See Table 3 for detailed results). Computation time varies based on nature of\nstraggling, at the particular instant of the experimental run. Short-Dot outperforms both MDS and\nUncoded, in mean computation time. Uncoded is faster than MDS since per-processor computation\ntime for MDS is larger, and it increases the straggling, even though MDS waits for only for 10 out of\n20 processors. However, note that Uncoded has more variability than both MDS and Short-Dot, and\nits maximum time observed during the experiment is much greater than both MDS and Short-Dot.\nThe classi\ufb01cation accuracy was 85.98% on test data.\n\n6 Discussion\n\nWhile we have presented the case of M < P here, Short-Dot easily generalizes to the case where\nM \u2265 P . The matrix can be divided horizontally into several chunks along the row dimension (shorter\nmatrices) and Short-Dot can be applied on each of those chunks one after another. Moreover if rows\nwith same sparsity pattern are grouped together and stored in the same processor initially, then the\ncommunication cost is also signi\ufb01cantly reduced during the online computations, since only some\nelements of the unknown vector x are sent to a particular processor.\nAcknowledgments: Systems on Nanoscale Information fabriCs (SONIC), one of the six SRC\nSTARnet Centers, sponsored by MARCO and DARPA. We also acknowledge NSF Awards 1350314,\n1464336 and 1553248. S Dutta also received Prabhu and Poonam Goel Graduate Fellowship.\n\n8\n\n\fReferences\n[1] Kangwook Lee, Maximilian Lam, Ramtin Pedarsani, Dimitris Papailiopoulos, and Kannan\nRamchandran. Speeding Up Distributed Machine Learning Using Codes. NIPS Workshop on\nLearning Systems, 2015.\n\n[2] Da Wang, Gauri Joshi, and Gregory Wornell. Using straggler replication to reduce latency\nin large-scale parallel computing. In ACM SIGMETRICS Performance Evaluation Review,\nvolume 43, pages 7\u201311, 2015.\n\n[3] Da Wang, Gauri Joshi, and Gregory Wornell. Ef\ufb01cient Task Replication for Fast Response Times\nin Parallel Computation. In ACM SIGMETRICS Performance Evaluation Review, volume 42,\npages 599\u2013600, 2014.\n\n[4] Gauri Joshi, Yanpei Liu, and Emina Soljanin. On the delay-storage trade-off in content download\nfrom coded distributed storage systems. IEEE Journal on Selected Areas in Communications,\n32(5):989\u2013997, 2014.\n\n[5] Longbo Huang, Sameer Pawar, Hao Zhang, and Kannan Ramchandran. Codes can reduce\nqueueing delay in data centers. In Proceedings IEEE International Symposium on Information\nTheory (ISIT), pages 2766\u20132770, 2012.\n\n[6] William Dally. High-performance hardware for machine learning. NIPS Tutorial, 2015.\n[7] Vipin Kumar, Ananth Grama, Gupta Anshul, and George Karypis. Introduction to Parallel\nComputing: Design and Analysis of Algorithms. The Benjamin/Cummings Publishing Company,\nInc., Redwood City, 1994.\n\n[8] Geoffrey C Fox, Steve W Otto, and Anthony JG Hey. Matrix algorithms on a hypercube I:\n\nMatrix multiplication. Parallel computing, 4(1):17\u201331, 1987.\n\n[9] Volker Strassen. Gaussian elimination is not optimal. Numerische Mathematik, 13(4):354\u2013356,\n\n1969.\n\n[10] Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Communication costs of\n\nstrassen\u2019s matrix multiplication. Communications of the ACM, 57(2):107\u2013114, 2014.\n\n[11] Jeffrey Dean and Luiz Andr\u00e9 Barroso. The tail at scale. Communications of the ACM, 56(2):74\u2013\n\n80, 2013.\n\n[12] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop\nIn Proceedings IEEE Symposium on Mass Storage Systems and\n\nDistributed File System.\nTechnologies (MSST), pages 1\u201310, 2010.\n\n[13] Kuang-Hua Huang and Jacob A. Abraham. Algorithm-based fault tolerance for matrix opera-\n\ntions. IEEE transactions on computers, 100(6):518\u2013528, 1984.\n\n[14] Thomas Herault and Yves Robert. Fault-Tolerance Techniques for High Performance Computing.\n\nSpringer, 2015.\n\n[15] William Ryan and Shu Lin. Channel codes: Classical and Modern. Cambridge University\n\nPress, 2009.\n\n[16] Songze Li, Mohammad Ali Maddah-Ali, and A Salman Avestimehr. A uni\ufb01ed coding framework\n\nfor distributed computing with straggling servers. arXiv:1609.01690v1 [cs.IT], 2016.\n\n[17] Ihab Nahlus, Eric P Kim, Naresh R Shanbhag, and David Blaauw. Energy-ef\ufb01cient Dot-Product\nComputation using a Switched Analog Circuit Architecture. In International Symposium on\nLow Power Electronics and Design (ISLPED), pages 315\u2013318, 2014.\n\n[18] Ning C Wang, Sujan K Gonugondla, Ihab Nahlus, Naresh Shanbhag, and Eric Pop. GDOT: a\nGraphene-Based Nanofunction for Dot-Product Computation. In IEEE Symposium on VLSI\nTechnology, 2016.\n\n[19] HTCondor. https://research.cs.wisc.edu/htcondor/.\n[20] Yann LeCun, Corinna Cortes, and Christopher JC Burges. The MNIST database of handwritten\n\ndigits. http://yann.lecun.com/exdb/mnist, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1103, "authors": [{"given_name": "Sanghamitra", "family_name": "Dutta", "institution": "Carnegie Mellon University"}, {"given_name": "Viveck", "family_name": "Cadambe", "institution": "Pennsylvania State University"}, {"given_name": "Pulkit", "family_name": "Grover", "institution": "Carnegie Mellon University"}]}