{"title": "Efficient Use of Limited-Memory Accelerators for Linear Learning on Heterogeneous Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 4258, "page_last": 4267, "abstract": "We propose a generic algorithmic building block to accelerate training of  machine learning models on heterogeneous compute systems. Our scheme allows to efficiently employ compute accelerators such as GPUs and FPGAs for the training of large-scale machine learning models, when the training data exceeds their memory capacity. Also, it provides adaptivity to any system's memory hierarchy in terms of size and processing speed. Our technique is built upon novel theoretical insights regarding primal-dual coordinate methods, and uses duality gap information to dynamically decide which part of the data should be made available for fast processing. To illustrate the power of our approach we demonstrate its performance for training of generalized linear models on a large-scale dataset exceeding the memory size of a modern GPU, showing an order-of-magnitude speedup over existing approaches.", "full_text": "Ef\ufb01cient Use of Limited-Memory Accelerators\nfor Linear Learning on Heterogeneous Systems\n\nCelestine D\u00a8unner\n\nIBM Research - Zurich\n\nSwitzerland\n\nThomas Parnell\n\nIBM Research - Zurich\n\nSwitzerland\n\nMartin Jaggi\n\nEPFL\n\nSwitzerland\n\ncdu@zurich.ibm.com\n\ntpa@zurich.ibm.com\n\nmartin.jaggi@epfl.ch\n\nAbstract\n\nWe propose a generic algorithmic building block to accelerate training of machine\nlearning models on heterogeneous compute systems. Our scheme allows to ef\ufb01-\nciently employ compute accelerators such as GPUs and FPGAs for the training\nof large-scale machine learning models, when the training data exceeds their me-\nmory capacity. Also, it provides adaptivity to any system\u2019s memory hierarchy in\nterms of size and processing speed. Our technique is built upon novel theoretical\ninsights regarding primal-dual coordinate methods, and uses duality gap informa-\ntion to dynamically decide which part of the data should be made available for\nfast processing. To illustrate the power of our approach we demonstrate its perfor-\nmance for training of generalized linear models on a large-scale dataset exceeding\nthe memory size of a modern GPU, showing an order-of-magnitude speedup over\nexisting approaches.\n\nIntroduction\n\n1\nAs modern compute systems rapidly increase in size, complexity and computational power, they\nbecome less homogeneous. Today\u2019s systems exhibit strong heterogeneity at many levels: in terms\nof compute parallelism, memory size and access bandwidth, as well as communication bandwidth\nbetween compute nodes (e.g., computers, mobile phones, server racks, GPUs, FPGAs, storage nodes\netc.). This increasing heterogeneity of compute environments is posing new challenges for the\ndevelopment of ef\ufb01cient distributed algorithms. That is to optimally exploit individual compute\nresources with very diverse characteristics without suffering from the I/O cost of exchanging data\nbetween them.\nIn this paper, we focus on the task of training large scale\nmachine learning models in such heterogeneous compute en-\nvironments and propose a new generic algorithmic building\nblock to ef\ufb01ciently distribute the workload between heteroge-\nneous compute units. Assume two compute units, denoted A\nand B, which differ in compute power as well as memory ca-\npacity as illustrated in Figure 1. The computational power of\nunit A is smaller and its memory capacity is larger relative to\nits peer unit B (i.e., we assume that the training data \ufb01ts into\nthe memory of A, but not into B\u2019s). Hence, on the compu-\ntationally more powerful unit B, only part of the data can be\nprocessed at any given time. The two units, A and B, are able\nto communicate with each other over some interface, however\nthere is cost associated with doing so.\nThis generic setup covers many essential elements of modern machine learning systems. A typical\nexample is that of accelerator units, such as a GPUs or FPGAs, augmenting traditional computers\n\nFigure 1: Compute units A, B with\ndifferent memory size, bandwidth\nand compute power.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\nUnit A\u2699Unit B\u2699\u2699\u2699\u2699\for servers. While such devices can offer a signi\ufb01cant increase in computational power due to their\nmassively parallel architectures, their memory capacity is typically very limited. Another example\ncan be found in hierarchical memory systems where data in the higher level memory can be accessed\nand hence processed faster than data in the \u2013 typically larger \u2013 lower level memory. Such memory\nsystems are spanning from fast on-chip caches on one extreme to slower hard drives on the other\nextreme.\nThe core question we address in this paper is the following: How can we ef\ufb01ciently distribute the\nworkload between heterogeneous units A and B in order to accelerate large scale learning?\nThe generic algorithmic building block we propose systematically splits the overall problem into two\nworkloads, a more data-intensive but less compute-intensive part for unit A and a more compute-\nintensive but less data-intensive part for B. These workloads are then executed in parallel, enabling\nfull utilization of both resources while keeping the amount of necessary communication between\nthe two units minimal. Such a generic algorithmic building block is useful much more widely than\njust for training on two heterogeneous compute units \u2013 it can serve as a component of larger training\nalgorithms or pipelines thereof. In a distributed training setting, our scheme allows each individual\nnode to locally bene\ufb01t from its own accelerator, therefore speeding up the overall task on a cluster,\ne.g., as part of [14] or another distributed algorithm. Orthogonal to such a horizontal application, our\nscheme can also be used as a building block vertically integrated in a system, serving the ef\ufb01ciency\nof several levels of the memory hierarchy of a given compute node.\n\nRelated Work. The most popular existing approach to deal with memory limitations is to process\ndata in batches. For example, for the special case of SVMs, [16] splits data samples into blocks\nwhich are then loaded and processed sequentially (on B), in the setting of limited RAM and the\nfull data residing on disk. This approach enables contiguous chunks of data to be loaded which is\nbene\ufb01cial in terms of I/O overhead; it however treats samples uniformly. The same holds for [15]\nwhere blocks to be loaded are selected randomly. Later, in [2, 7] it is proposed to selectively load\nand keep informative samples in memory in order to reduce disk access, but this approach is speci\ufb01c\nto support vectors and is unable to theoretically quantify the possible speedup.\nIn this work, we propose a novel, theoretically-justi\ufb01ed scheme to ef\ufb01ciently deal with memory\nlimitations in the heterogeneous two-unit setting illustrated in Figure 1. Our scheme can be applied\nto a broad class of machine learning problems, including generalized linear models, empirical risk\nminimization problems with a strongly convex regularizer, such as SVM, as well as sparse models,\nsuch as Lasso. In contrast to the related line of research [16, 2, 7], our scheme is designed to take full\nadvantage of both compute resources A and B for training, by systematically splitting the workload\namong A and B in order to adapt to their speci\ufb01c properties and to the available bandwidth between\nthem. At the heart of our approach lies a smart data selection scheme using coordinate-wise duality\ngaps as selection criteria. Our theory will show that our selection scheme provably improves the\nconvergence rate of training overall, by explicitly quantifying the bene\ufb01t over uniform sampling. In\ncontrast, existing work [2, 7] only showed that the linear convergence rate on SVMs is preserved\nasymptotically, but not necessarily improved.\nA different line of related research is steepest coordinate selection. It is known that steepest coor-\ndinate descent can converge much faster than uniform [8] for single coordinate updates on smooth\nobjectives, however it typically does not perform well for general convex problems, such as those\nwith L1 regularization. In our work, we overcome this issue by using the generalized primal-dual\ngaps [4] which do extend to L1 problems. Related to this notion, [3, 9, 11] have explored the use\nof similar information as an adaptive measure of importance, in order to adapt the sampling prob-\nabilities of coordinate descent. Both this line of research as well as steepest coordinate descent [8]\nare still limited to single coordinate updates, and cannot be readily extended to arbitrary accuracy\nupdates on a larger subset of coordinates (performed per communication round) as required in our\nheterogeneous setting.\n\nContributions. The main contributions of this work are summarized as follows:\n\u2022 We analyze the per-iteration-improvement of primal-dual block coordinate descent and how it\ndepends on the selection of the active coordinate block at that iteration. Further, we extend the\nconvergence theory to arbitrary approximate updates on the coordinate subsets. We propose\na novel dynamic selection scheme for blocks of coordinates, which relies on coordinate-wise\nduality gaps, and precisely quantify the speedup of the convergence rate over uniform sampling.\n\n2\n\n\f\u2022 Our theoretical \ufb01ndings result in a scheme for learning in heterogeneous compute environments\nwhich is easy to use, theoretically justi\ufb01ed and versatile in that it can be adapted to given re-\nsource constraints, such as memory, computation and communication. Furthermore our scheme\nenables parallel execution between, and also within, two heterogeneous compute units.\n\n\u2022 For the example of joint training in a CPU plus GPU environment \u2013 which is very challenging\nfor data-intensive work loads \u2013 we demonstrate a more than 10\u00d7 speed-up over existing methods\nfor limited-memory training.\n\nmin\n\u03b1\u2208Rn\n\nO(\u03b1) := f (A\u03b1) + g(\u03b1)\n\nwhere f is a smooth function and g(\u03b1) =(cid:80)\n\n2 Learning Problem\nFor the scope of this work we focus on the training of convex generalized linear models of the form\n(1)\ni gi(\u03b1i) is separable, \u03b1 \u2208 Rn describes the parameter\nvector and A = [a1, a2, . . . , an] \u2208 Rd\u00d7n the data matrix with column vectors ai \u2208 Rd. This setting\ncovers many prominent machine learning problems, including generalized linear models as used for\nregression, classi\ufb01cation and feature selection. To avoid confusion, it is important to distinguish the\ntwo main application classes: On one hand, we cover empirical risk minimization (ERM) problems\nwith a strongly convex regularizer such as L2-regularized SVM \u2013 where \u03b1 then is the dual variable\nvector and f is the smooth regularizer conjugate, as in SDCA [13]. On the other hand, we also cover\nthe class of sparse models such as Lasso or ERM with a sparse regularizer \u2013 where f is the data-\ufb01t\nterm and g takes the role of the non-smooth regularizer, so \u03b1 are the original primal parameters.\nDuality Gap. Through the perspective of Fenchel-Rockafellar duality, one can, for any primal-\ndual solution pair (\u03b1, w), de\ufb01ne the non-negative duality gap for (1) as\n\n(2)\nwhere the functions f\u2217, g\u2217 in (2) are de\ufb01ned as the convex conjugate1 of their corresponding coun-\nterparts f, g [1]. Let us consider parameters w that are optimal relative to a given \u03b1, i.e.,\n\n:= f (A\u03b1) + g(\u03b1) + f\u2217(w) + g\u2217(\u2212A(cid:62)w)\n\ngap(\u03b1; w)\n\nw := w(\u03b1) = \u2207f (A\u03b1),\n\n(3)\nwhich implies f (A\u03b1) + f\u2217(w) = (cid:104)A\u03b1, w(cid:105). In this special case, the duality gap (2) simpli\ufb01es and\nbecomes separable over the columns ai of A and the corresponding parameter weights \u03b1i given w.\nWe will later exploit this property to quantify the suboptimality of individual coordinates.\ni (\u2212a(cid:62)\n\ngapi(\u03b1i), where gapi(\u03b1i) := w(cid:62)ai\u03b1i + gi(\u03b1i) + g\u2217\n\n(cid:88)\n\ngap(\u03b1) =\n\ni w).\n\n(4)\n\ni\u2208[n]\n\nNotation. For the remainder of the paper we use v[P] to denote a vector v with non-zero entries\nonly for the coordinates i \u2208 P \u2286 [n] = {1, . . . , n}. Similarly we write A[P] to denote the matrix A\ncomposing only of columns indexed by i \u2208 P.\n\n3 Approximate Block Coordinate Descent\nThe theory we present in this section serves to derive a theoretical framework for our heterogeneous\nlearning scheme presented in Section 4. Therefore, let us consider the generic block minimization\nscheme described in Algorithm 1 to train generalized linear models of the form (1).\n\n3.1 Algorithm Description\nIn every round t, of Algorithm 1, a block P of m coordinates of \u03b1 is selected according to an\narbitrary selection rule. Then, an update is computed on this block of coordinates by optimizing\n\narg min\n\u2206\u03b1[P]\u2208Rn\n\nO(\u03b1 + \u2206\u03b1[P])\n\n(5)\n\nwhere an arbitrary solver can be used to \ufb01nd this update. This update is not necessarily perfectly\noptimal but of a relative accuracy \u03b8, in the following sense of approximation quality:\n\n1For h : Rd \u2192 R the convex conjugate is de\ufb01ned as h\u2217(v) := supu\u2208Rd v(cid:62)u \u2212 h(u).\n\n3\n\n\fAlgorithm 1 Approximate Block CD\n1: Initialize \u03b1(0) := 0\n2: for t = 0, 1, 2, ... do\n3:\n4: \u2206\u03b1[P] \u2190 \u03b8-approx. solution to (5)\n5: \u03b1(t+1) := \u03b1(t) + \u2206\u03b1[P]\n6: end for\n\nselect a subset P with |P| = m\n\nAlgorithm 2 DUHL\n1: Initialize \u03b1(0) := 0, z := 0\n2: for t = 0, 1, 2, ...\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n\ndetermine P according to (13)\nrefresh memory B to contain A[P].\non B do:\n\u2206\u03b1[P] \u2190 \u03b8-approx. solution to (12)\nin parallel on A do:\nwhile B not \ufb01nished\nsample j \u2208 [n]\nupdate zj := gapj(\u03b1(t)\nj )\n\n\u03b1(t+1) := \u03b1(t) + \u2206\u03b1[P]\n\nDe\ufb01nition 1 (\u03b8-Approximate Update). The block update \u2206\u03b1[P] is \u03b8-approximate iff\n[P]) + (1 \u2212 \u03b8)O(\u03b1)\n\n\u2203\u03b8 \u2208 [0, 1] : O(\u03b1 + \u2206\u03b1[P]) \u2264 \u03b8O(\u03b1 + \u2206\u03b1(cid:63)\n\nwhere \u2206\u03b1(cid:63)\n\n[P] \u2208 arg min\u2206\u03b1[P]\u2208Rn O(\u03b1 + \u2206\u03b1[P]).\n\n(6)\n\n3.2 Convergence Analysis\n\nIn order to derive a precise convergence rate for Algorithm 1 we build on the convergence analysis\nof [4, 13]. We extend their analysis of stochastic coordinate descent in two ways: 1) to a block\ncoordinate scheme with approximate coordinate updates, and 2) to explicitly cover the importance\nof each selected coordinate, as opposed to uniform sampling.\nWe de\ufb01ne\n\n(cid:80)\n(cid:80)\n\n1\nm\n1\nn\n\n\u03c1t,P :=\n\nj\u2208P gapj(\u03b1(t)\nj )\nj\u2208[n] gapj(\u03b1(t)\nj )\n\n(7)\n\nwhich quanti\ufb01es how much the coordinates i \u2208 P of \u03b1(t) contribute to the global duality gap\n(2). Thus giving a measure of suboptimality for these coordinates. In Algorithm 1 an arbitrary\nselection scheme (deterministic or randomized) can be applied and our theory will explain how\nthe convergence of Algorithm 1 depends on the selection through the distribution of \u03c1t,P. That\nis, for strongly convex functions gi, we found that the per-step improvement in suboptimality is\nproportional to \u03c1t,P of the speci\ufb01c coordinate block P being selected at that iteration t:\n\n\u0001(t+1) \u2264 (1 \u2212 \u03c1t,P \u03b8c) \u0001(t)\n\n(8)\nwhere \u0001(t) := O(\u03b1(t)) \u2212 O(\u03b1(cid:63)) measures the suboptimality of \u03b1(t) and c > 0 is a constant which\nwill be speci\ufb01ed in the following theorem. A similar dependency on \u03c1t,P can also be shown for\nnon-strongly convex functions gi, leading to our two main convergence results for Algorithm 1:\nTheorem 1. For Algorithm 1 running on (1) where f is L-smooth and gi is \u00b5-strongly convex with\n\u00b5 > 0 for all i \u2208 [n], it holds that\n\n(cid:18)\n\n(cid:19)t\n\nEP [\u0001(t) | \u03b1(0)] \u2264\n\n1 \u2212 \u03b7P\n\nm\nn\n\n\u00b5\n\n\u03c3L + \u00b5\n\n\u0001(0)\n\n(9)\n\nop and \u03b7P := mint \u03b8 EP [\u03c1t,P | \u03b1(t)]. Expectations are over the choice of P.\n\nwhere \u03c3 := (cid:107)A[P](cid:107)2\nThat is, for strongly convex gi, Algorithm 1 has a linear convergence rate. This was shown before\nin [13, 4] for the special case of exact coordinate updates. In strong contrast to earlier coordinate\ndescent analyses which build on random uniform sampling, our theory explicitly quanti\ufb01es the im-\npact of the sampling scheme on the convergence through \u03c1t,P. This allows one to bene\ufb01t from smart\nselection and provably improve the convergence rate by taking advantage of the inhomogeneity of\nthe duality gaps. The same holds for non-strongly convex functions gi:\n\n4\n\n\fTheorem 2. For Algorithm 1 running on (1) where f is L-smooth and gi has B-bounded support\nfor all i \u2208 [n], it holds that\n\nEP [\u0001(t) | \u03b1(0)] \u2264 1\n\u03b7P m\n\nop and t \u2265 t0 = max(cid:8)0, n\n\n2n + t \u2212 t0\n\n2\u03b3n2\n\nm log(cid:0) 2\u03b7m\u0001(0)\n\nn\u03b3\n\n(10)\n\n(cid:1)(cid:9) where \u03b7P :=\n\nwith \u03b3 := 2LB2\u03c3 where \u03c3 := (cid:107)A[P](cid:107)2\nmint \u03b8 EP [\u03c1t,P | \u03b1(t)]. Expectations are over the choice of P.\nRemark 1. Note that for uniform selection, our proven convergence rates for Algorithm 1 recover\nclassical primal-dual coordinate descent [4, 13] as a special case, where in every iteration a single\ncoordinate is selected and each update is solved exactly, i.e., \u03b8 = 1. In this case \u03c1t,P measures the\ncontribution of a single coordinate to the duality gap. For uniform sampling, EP [\u03c1t,P | \u03b1(t)] = 1\nand hence \u03b7P = 1 which recovers [4, Theorems 8 and 9].\n\n3.3 Gap-Selection Scheme\nThe convergence results of Theorems 1 and 2 suggest that the optimal rule for selecting the block\nof coordinates P in step 3 of Algorithm 1, leading to the largest improvement in that step, is the\nfollowing:\n\nP := arg max\nP\u2282[n]:|P|=m\n\n(cid:0)\u03b1(t)\n\n(cid:1).\n\nj\n\ngapj\n\n(cid:88)\n\nj\u2208P\n\n(11)\n\nThis scheme maximizes \u03c1t,P at every iterate. Furthermore, the selection scheme (11) guarantees\n\u03c1t,P \u2265 1 which quanti\ufb01es the relative gain over random uniform sampling. In contrast to existing\nimportance sampling schemes [17, 12, 5] which assign static probabilities to individual coordinates,\nour selection scheme (11) is dynamic and adapts to the current state \u03b1(t) of the algorithm, similar\nto that used in [9, 11] in the standard non-heterogeneous setting.\n\n4 Heterogeneous Training\nIn this section we build on the theoretical insight of the previous section to tackle the main objective\nof this work: How can we ef\ufb01ciently distribute the workload between two heterogeneous compute\nunits A and B to train a large-scale machine learning problem where A and B ful\ufb01ll the following\ntwo assumptions:\nAssumption 1 (Difference in Memory Capacity). Compute unit A can \ufb01t the whole dataset in its\nmemory and compute unit B can only \ufb01t a subset of the data. Hence, B only has access to A[P], a\nsubset P of m columns of A, where m is determined by the memory size of B.\nAssumption 2 (Difference in Computational Power). Compute unit B can access and process data\nfaster than compute unit A.\n4.1 DUHL: A Duality Gap-Based Heterogeneous Learning Scheme\nWe propose a duality gap-based heterogeneous learning scheme, henceforth referring to as DUHL,\nfor short. DUHL is designed for ef\ufb01cient training on heterogeneous compute resources as described\nabove. The core idea of DUHL is to identify a block P of coordinates which are most relevant to\nimproving the model at the current stage of the algorithm, and have the corresponding data columns,\nA[P], residing locally in the memory of B. Compute unit B can then exploit its superior compute\npower by using an appropriate solver to locally \ufb01nd a block coordinate update \u2206\u03b1[P]. At the same\ntime, compute unit A, is assigned the task of updating the block P of important coordinates as\nthe algorithm proceeds and the iterates change. Through this split of workloads DUHL enables full\nutilization of both compute units A and B. Our scheme, summarized in Algorithm 2, \ufb01ts the theoret-\nical framework established in the previous section and can be viewed as an instance of Algorithm 1,\nimplementing a time-delayed version of the duality gap-based selection scheme (11).\nIn the heterogeneous setting compute unit B only has access to its local data\nLocal Subproblem.\nA[P] and some current state v := A\u03b1 \u2208 Rd in order to compute a block update \u2206\u03b1[P] in Step 4\nof Algorithm 1. While for quadratic functions f this information is suf\ufb01cient to optimize (5), for\nnon-quadratic functions f we consider the following modi\ufb01ed local optimization problem instead:\n\n(cid:88)\n\ni\u2208P\n\ngi((\u03b1 + \u2206\u03b1[P])i).\n\n(12)\n\narg min\n\u2206\u03b1[P]\u2208Rn\n\nf (v) + (cid:104)\u2207f (v), A\u2206\u03b1[P](cid:105) +\n\n(cid:107)A\u2206\u03b1[P](cid:107)2\n\n2 +\n\nL\n2\n\n5\n\n\fFigure 2: Illustration of one round of DUHL as described in Algorithm 2.\n\nIt can be shown that the convergence guarantees of Theorems 1 and 2 similarly hold if the block\ncoordinate update in Step 4 of Algorithm 1 is computed on (12) instead of (5) (see Appendix C for\nmore details).\nA Time-Delayed Gap Measure. Motivated by our theoretical \ufb01ndings, we use the duality gap as a\nmeasure of importance for selecting which coordinates unit B is working on. However, a scheme as\nsuggested in (11) is not suitable for our purpose since it requires knowledge of the duality gaps (4)\nfor every coordinate i at a given iterate \u03b1(t). For our scheme this would imply a computationally\nexpensive selection step at the beginning of every round which has to be performed in sequence to\nthe update step. To overcome this and enable parallel execution of the two workloads on A and B,\nwe propose to introduce a gap memory. This is an n-dimensional vector z where zi measures the\nimportance of coordinate \u03b1i. We have zi := gap(\u03b1(t(cid:48))\n) where t(cid:48) \u2208 [0, t] and the different elements\nof z are allowed to be based on different, possibly stale iterates \u03b1(t(cid:48)). Thus, the entries of z can be\ncontinuously updated during the course of the algorithm. Then, at the beginning of every round the\nnew block P is chosen based on the current state of z as follows:\n\ni\n\nP := arg max\nP\u2282[n]:|P|=m\n\nzj.\n\n(13)\n\n(cid:88)\n\nj\u2208P\n\nIn DUHL, keeping z up to date is the job of compute unit A. Hence, while B is computing a block\ncoordinate update \u2206\u03b1[P], A updates z by randomly sampling from the entire training data. Then,\nas soon as B is done, the current state of z is used to determine P for the next round and data\ncolumns on B are replaced if necessary. The parallel execution of the two workloads during a single\nround of DUHL is illustrated in Figure 2. Note, that the freshness of the gap-memory z depends\non the relative compute power of A versus B, as well as \u03b8 which controls the amount of time spent\ncomputing on unit B in every round.\nIn Section 5.2 we will experimentally investigate the effect of staleness of the values zi on the\nconvergence behavior of our scheme.\n\n5 Experimental Results\nFor our experiments we have implemented DUHL for the particular use-case where A corresponds\nto a CPU with attached RAM and B corresponds to a GPU \u2013 A and B communicate over the PCIe\nbus. We use an 8-core Intel Xeon E5 x86 CPU with 64GB of RAM which is connected over PCIe\nGen3 to an NVIDIA Quadro M4000 GPU which has 8GB of RAM. GPUs have recently experience\na widespread adoption in machine learning systems and thus this hardware scenario is timely and\nhighly relevant. In such a setting we wish to apply DUHL to ef\ufb01ciently populate the GPU memory\nand thereby making this part of the data available for fast processing.\nGPU solver.\nIn order to bene\ufb01t from the enormous parallelism offered by GPUs and ful\ufb01ll As-\nsumption 2, we need a local solver capable of exploiting the power of the GPU. Therefore, we\nhave chosen to implement the twice parallel, asynchronous version of stochastic coordinate descent\n\n6\n\n\f(a)\n\n(b)\n\n(a)\n\n(b)\n\nFigure 3: Validation of faster convergence: (a)\ntheoretical quantity \u03c1t,P (orange), versus the\npractically observed speedup (green) \u2013 both re-\nlative to the random scheme baseline, (b) con-\nvergence of gap selection compared to random\nselection.\n\nFigure 4: Effect of stale entries in the gap me-\nmory of DUHL: (a) number of rounds needed\nto reach suboptimality 10\u22124 for different update\nfrequencies compared to o-DUHL, (b) the num-\nber of data columns that are replaced per round\nfor update frequency of 5%.\n\n(TPA-SCD) that has been proposed in [10] for solving ridge regression. In this work we have gene-\nralized the implementation further so that it can be applied in a similar manner to solve the Lasso,\nas well as the SVM problem. For more details about the algorithm and how to generalize it we refer\nthe reader to Appendix D.\n\n5.1 Algorithm Behavior\n\nFirstly, we will use the publicly available epsilon dataset from the LIBSVM website (a fully dense\ndataset with 400\u2019000 samples and 2\u2019000 features) to study the convergence behavior of our scheme.\nFor the experiments in this section we assume that the GPU \ufb01ts 25% of the training data, i.e., m = n\n4\nand show results for training the sparse Lasso as well as the ridge regression model. For the Lasso\ncase we have chosen the regularizer to obtain a support size of \u223c 12% and we apply the coordinate-\nwise Lipschitzing trick [4] to the L1-regularizer in order to allow the computation of the duality\ngaps. For computational details we refer the reader to Appendix E.\n\nValidation of Faster Convergence. From our theory in Section 3.2 we expect that during any\ngiven round t of Algorithm 1, the relative gain in convergence rate of one sampling scheme over\nthe other should be quanti\ufb01ed by the ratio of the corresponding values of \u03b7t,P := \u03b8\u03c1t,P (for the\nrespective block of coordinates processed in this round). To verify this, we trained a ridge regression\nmodel on the epsilon dataset implementing a) the gap-based selection scheme, (11), and b) random\nselection, \ufb01xing \u03b8 for both schemes. Then, in every round t of our experiment, we record the value\nof \u03c1t,P as de\ufb01ned in (7) and measure the relative gain in convergence rate of the gap-based scheme\nover the random scheme. In Figure 3(a) we plot the effective speedup of our scheme, and observe\nthat this speedup almost perfectly matches the improvement predicted by our theory as measured\nby \u03c1t,P - we observe an average deviation of 0.42. Both speedup numbers are calculated relative to\nplain random selection. In Figure 3(b) we see that the gap-based selection can achieve a remarkable\n10\u00d7 improvement in convergence over the random reference scheme. When running on sparse\nproblems instead of ridge regression, we have observed \u03c1t,P of the oracle scheme converging to n\nm\nwithin only a few iterations if the support of the problem is smaller than m and \ufb01ts on the GPU.\n\nEffect of Gap-Approximation.\nIn this section we study the effect of using stale, inconsistent gap-\nmemory entries for selection on the convergence of DUHL. While the freshness of the memory\nentries is, in reality, determined by the relative compute power of unit B over unit A and the relative\naccuracy \u03b8, in this experiment we arti\ufb01cially vary the number of gap updates performed during each\nround while keeping \u03b8 \ufb01xed. We train the Lasso model and show, in Figure 4(a), the number of\nrounds needed to reach a suboptimality of 10\u22124, as a function of the number of gap entries updated\nper round. As a reference we show o-DUHL which has access to an oracle providing the true duality\ngaps. We observe that our scheme is quite robust to stale gap values and can achieve performance\nwithin a factor of two over the oracle scheme up to an average delay of 20 iterations. As the update\nfrequency decreases we observed that the convergence slows down in the initial rounds because the\nalgorithm needs more rounds until the active set of the sparse problem is correctly detected.\n\n7\n\n\f(d) Lasso\n\n(e) SVM\n\n(f) ridge regression\n\nFigure 5: Performance results of DUHL on the 30GB ImageNet dataset. I/O cost (top) and conver-\ngence behavior (bottom) for Lasso, SVM and ridge regression.\n\nReduced I/O operations. The ef\ufb01ciency of our scheme regarding I/O operations is demonstrated\nin Figure 4(b), where we plot the number of data columns that are replaced on B in every round\nof Algorithm 2. Here the Lasso model is trained assuming a gap update frequency of 5%. We\nobserve that the number of required I/O operations of our scheme is decreasing over the course of\nthe algorithm. When increasing the freshness of the gap memory entries we could see the number\nof swaps go to zero faster.\n\n5.2 Reference Schemes\nIn the following we compare the performance of our scheme against four reference schemes. We\ncompare against the most widely-used scheme for using a GPU to accelerate training when the data\ndoes not \ufb01t into the memory of the GPU, that is the sequential block selection scheme presented\nin [16]. Here the data columns are split into blocks of size m which are sequentially put on the GPU\nand operated on (the data is ef\ufb01ciently copied to the GPU as a contiguous memory block).\nWe also compare against importance sampling as presented in [17], which we refer to as IS. Since\nprobabilities assigned to individual data columns are static we cannot use them as importance mea-\nsures in a deterministic selection scheme. Therefore, in order to apply importance sampling in the\nheterogeneous setting, we non-uniformly sample m data-columns to reside inside the GPU memory\nin every round of Algorithm 2 and have the CPU determine the new set in parallel. As we will see,\ndata column norms often come with only small variance, in particular for dense datasets. Therefore,\nimportance sampling often fails to give a signi\ufb01cant gain over uniformly random selection.\nAdditionally, we compare against a single-threaded CPU implementation of a stochastic coordinate\ndescent solver to demonstrate that with our scheme, the use of a GPU in such a setting indeed yields a\nsigni\ufb01cant speedup over a basic CPU implementation despite the high I/O cost of repeatedly copying\ndata on and off the GPU memory. To the best of our knowledge, we are the \ufb01rst to demonstrate this.\nFor all competing schemes, we use TPA-SCD as the solver to ef\ufb01ciently compute the block update\n\u2206\u03b1[P] on the GPU. The accuracy \u03b8 of the block update computed in every round is controlled by\nthe number of randomized passes of TPA-SCD through the coordinates of the selected block P. For\na fair comparison we optimize this parameter for the individual schemes.\n\n5.3 Performance Analysis of DUHL\nFor our large-scale experiments we use an extended version of the Kaggle Dogs vs. Cats ImageNet\ndataset as presented in [6], where we additionally double the number of samples, while using single\nprecision \ufb02oating point numbers. The resulting dataset is fully dense and consists of 40\u2019000 samples\nand 200\u2019704 features, resulting in over 8 billion non-zero elements and a data size of 30GB. Since\nthe memory capacity of our GPU is 8GB, we can put \u223c 25% of the data on the GPU. We will show\n\n8\n\n\fresults for training a sparse Lasso model, ridge regression as well as linear L2-regularized SVM.\nFor Lasso we choose the regularization to achieve a support size of 12%, whereas for SVM the\nregularizer was chosen through cross-validation. For all three tasks, we compare the performance\nof DUHL to sequential block selection, random selection, selection through importance sampling\n(IS) all on GPU, as well as a single-threaded CPU implementation. In Figure 5(d) and 5(e) we\ndemonstrate that for Lasso as well as SVM, DUHL converges 10\u00d7 faster than any reference scheme.\nThis gain is achieved by improved convergence \u2013 quanti\ufb01ed through \u03c1t,P \u2013 as well as through\nreduced I/O cost, as illustrated in the top plots of Figure 5, which show the number of data columns\nreplaced per round. The results in Figure 5(f) show that the application of DUHL is not limited\nto sparse problems and SVMs. Even for ridge regression DUHL signi\ufb01cantly outperforms all the\nreference scheme considered in this study.\n\n6 Conclusion\n\nWe have presented a novel theoretical analysis of block coordinate descent, highlighting how the\nperformance depends on the coordinate selection. These results prove that the contribution of in-\ndividual coordinates to the overall duality gap is indicative of their relevance to the overall model\noptimization. Using this measure we develop a generic scheme for ef\ufb01cient training in the presence\nof high performance resources of limited memory capacity. We propose DUHL, an ef\ufb01cient gap\nmemory-based strategy to select which part of the data to make available for fast processing. On a\nlarge dataset which exceeds the capacity of a modern GPU, we demonstrate that our scheme out-\nperforms existing sequential approaches by over 10\u00d7 for Lasso and SVM models. Our results show\nthat the practical gain matches the improved convergence predicted by our theory for gap-based\nsampling under the given memory and communication constraints, highlighting the versatility of the\napproach.\n\nReferences\n[1] Heinz H Bauschke and Patrick L Combettes. Convex Analysis and Monotone Operator Theory in Hilbert\n\nSpaces. CMS Books in Mathematics. Springer New York, New York, NY, 2011.\n\n[2] Kai-Wei Chang and Dan Roth. Selective block minimization for faster convergence of limited memory\nlarge-scale linear models. In Proceedings of the 17th ACM SIGKDD international conference on Knowl-\nedge Discovery and Data Mining, pages 699\u2013707, New York, USA, August 2011. ACM.\n\n[3] Dominik Csiba, Zheng Qu, and Peter Richt\u00b4arik. Stochastic Dual Coordinate Ascent with Adaptive Proba-\nbilities. In ICML 2015 - Proceedings of the 32th International Conference on Machine Learning, February\n2015.\n\n[4] Celestine D\u00a8unner, Simone Forte, Martin Tak\u00b4ac, and Martin Jaggi. Primal-Dual Rates and Certi\ufb01cates.\nIn Proceedings of the 33th International Conference on Machine Learning (ICML) - Volume 48, pages\n783\u2013792, 2016.\n\n[5] Olivier Fercoq and Peter Richt\u00b4arik. Optimization in High Dimensions via Accelerated, Parallel, and\n\nProximal Coordinate Descent. SIAM Review, 58(4):739\u2013771, January 2016.\n\n[6] Christina Heinze, Brian McWilliams, and Nicolai Meinshausen. DUAL-LOCO: Distributing Statistical\nEstimation Using Random Projections. In AISTATS - Proceedings of the th International Conference on\nArti\ufb01cial Intelligence and Statistics, pages 875\u2013883, 2016.\n\n[7] Shin Matsushima, SVN Vishwanathan, and Alex J Smola. Linear support vector machines via dual cached\nloops. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 177\u2013185, New York, USA, 2012. ACM Press.\n\n[8] Julie Nutini, Mark Schmidt, Issam Laradji, Michael Friedlander, and Hoyt Koepke. Coordinate Descent\nConverges Faster with the Gauss-Southwell Rule Than Random Selection. In ICML 2015 - Proceedings\nof the 32th International Conference on Machine Learning, pages 1632\u20131641, 2015.\n\n[9] Anton Osokin, Jean-Baptiste Alayrac, Isabella Lukasewitz, Puneet K. Dokania, and Simon Lacoste-\nIn Proceedings of\nJulien. Minding the gaps for block frank-wolfe optimization of structured svms.\nthe 33rd International Conference on Machine Learning (ICML) - Volume 48, pages 593\u2013602. JMLR.org,\n2016.\n\n[10] Thomas Parnell, Celestine D\u00a8unner, Kubilay Atasu, Manolis Sifalakis, and Haris Pozidis. Large-Scale\nIn Proceedings of the 6th International Workshop on Parallel and\nStochastic Learning using GPUs.\nDistributed Computing for Large Scale Machine Learning and Big Data Analytics (IPDPSW), IEEE,\n2017.\n\n9\n\n\f[11] Dmytro Perekrestenko, Volkan Cevher, and Martin Jaggi. Faster Coordinate Descent via Adaptive Impor-\n\ntance Sampling. In AISTATS - Arti\ufb01cial Intelligence and Statistics, pages 869\u2013877. April 2017.\n\n[12] Zheng Qu and Peter Richt\u00b4arik. Coordinate descent with arbitrary sampling I: algorithms and complexity.\n\nOptimization Methods and Software, 31(5):829\u2013857, April 2016.\n\n[13] Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss. J.\n\nMach. Learn. Res., 14(1):567\u2013599, February 2013.\n\n[14] Virginia Smith, Simone Forte, Chenxin Ma, Martin Tak\u00b4a\u02c7c, Michael I Jordan, and Martin Jaggi. CoCoA:\n\nA General Framework for Communication-Ef\ufb01cient Distributed Optimization. arXiv, November 2016.\n\n[15] Ian En-Hsu Yen, Shan-Wei Lin, and Shou-De Lin. A dual augmented block minimization framework for\nlearning with limited memory. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 28, pages 3582\u20133590. Curran Associates,\nInc., 2015.\n\n[16] Hsiang-Fu Yu, Cho-Jui Hsieh, Kai-Wei Chang, and Chih-Jen Lin. Large Linear Classi\ufb01cation When Data\nCannot Fit in Memory. ACM Transactions on Knowledge Discovery from Data, 5(4):1\u201323, February\n2012.\n\n[17] Peilin Zhao and Tong Zhang. Stochastic Optimization with Importance Sampling for Regularized Loss\nMinimization. In ICML 2015 - Proceedings of the 32th International Conference on Machine Learning,\npages 1\u20139, 2015.\n\n10\n\n\f", "award": [], "sourceid": 2233, "authors": [{"given_name": "Celestine", "family_name": "D\u00fcnner", "institution": "IBM Research"}, {"given_name": "Thomas", "family_name": "Parnell", "institution": "IBM Research"}, {"given_name": "Martin", "family_name": "Jaggi", "institution": "EPFL"}]}