{"title": "PyTorch: An Imperative Style, High-Performance Deep Learning Library", "book": "Advances in Neural Information Processing Systems", "page_first": 8026, "page_last": 8037, "abstract": "Deep learning frameworks have often focused on either usability or speed, but not both. PyTorch is a machine learning library that shows that these two goals are in fact compatible: it was designed from first principles to support an imperative and Pythonic programming style that supports code as a model, makes debugging easy and is consistent with other popular scientific computing libraries, while remaining efficient and supporting hardware accelerators such as GPUs.\nIn this paper, we detail the principles that drove the implementation of PyTorch and how they are reflected in its architecture. We emphasize that every aspect of PyTorch is a regular Python program under the full control of its user. We also explain how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.\nWe demonstrate the efficiency of individual subsystems, as well as the overall speed of PyTorch on several commonly used benchmarks.", "full_text": "PyTorch: An Imperative Style, High-Performance\n\nDeep Learning Library\n\nAdam Paszke\n\nUniversity of Warsaw\n\nSam Gross\n\nFacebook AI Research\n\nadam.paszke@gmail.com\n\nsgross@fb.com\n\nFrancisco Massa\n\nFacebook AI Research\n\nfmassa@fb.com\n\nAdam Lerer\n\nFacebook AI Research\n\nalerer@fb.com\n\nJames Bradbury\n\nGoogle\n\njekbradbury@gmail.com\n\nGregory Chanan\n\nFacebook AI Research\n\ngchanan@fb.com\n\nTrevor Killeen\nSelf Employed\n\nZeming Lin\n\nFacebook AI Research\n\nNatalia Gimelshein\n\nNVIDIA\n\nkilleent@cs.washington.edu\n\nzlin@fb.com\n\nngimelshein@nvidia.com\n\nLuca Antiga\n\nOrobix\n\nAlban Desmaison\nOxford University\n\nAndreas K\u00f6pf\n\nXamla\n\nluca.antiga@orobix.com\n\nalban@robots.ox.ac.uk\n\nandreas.koepf@xamla.com\n\nEdward Yang\n\nFacebook AI Research\n\nezyang@fb.com\n\nZach DeVito\n\nFacebook AI Research\n\nMartin Raison\n\nNabla\n\nzdevito@cs.stanford.edu\n\nmartinraison@gmail.com\n\nAlykhan Tejani\n\nTwitter\n\nSasank Chilamkurthy\n\nQure.ai\n\natejani@twitter.com\n\nsasankchilamkurthy@gmail.com\n\nBenoit Steiner\n\nFacebook AI Research\nbenoitsteiner@fb.com\n\nLu Fang\nFacebook\n\nlufang@fb.com\n\nJunjie Bai\nFacebook\n\njbai@fb.com\n\nSoumith Chintala\n\nFacebook AI Research\nsoumith@gmail.com\n\nAbstract\n\nDeep learning frameworks have often focused on either usability or speed, but\nnot both. PyTorch is a machine learning library that shows that these two goals\nare in fact compatible: it provides an imperative and Pythonic programming style\nthat supports code as a model, makes debugging easy and is consistent with other\npopular scienti\ufb01c computing libraries, while remaining ef\ufb01cient and supporting\nhardware accelerators such as GPUs.\nIn this paper, we detail the principles that drove the implementation of PyTorch\nand how they are re\ufb02ected in its architecture. We emphasize that every aspect of\nPyTorch is a regular Python program under the full control of its user. We also\nexplain how the careful and pragmatic implementation of the key components of\nits runtime enables them to work together to achieve compelling performance.\nWe demonstrate the ef\ufb01ciency of individual subsystems, as well as the overall\nspeed of PyTorch on several common benchmarks.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f1\n\nIntroduction\n\nWith the increased interest in deep learning in recent years, there has been an explosion of machine\nlearning tools. Many popular frameworks such as Caffe [1], CNTK [2], TensorFlow [3], and\nTheano [4], construct a static data\ufb02ow graph that represents the computation and which can then be\napplied repeatedly to batches of data. This approach provides visibility into the whole computation\nahead of time, and can theoretically be leveraged to improve performance and scalability. However, it\ncomes at the cost of ease of use, ease of debugging, and \ufb02exibility of the types of computation that\ncan be represented.\nPrior work has recognized the value of dynamic eager execution for deep learning, and some recent\nframeworks implement this de\ufb01ne-by-run approach, but do so either at the cost of performance\n(Chainer [5]) or using a less expressive, faster language (Torch [6], DyNet [7]), which limits their\napplicability.\nHowever, with careful implementation and design choices, dynamic eager execution can be achieved\nlargely without sacri\ufb01cing performance. This paper introduces PyTorch, a Python library that\nperforms immediate execution of dynamic tensor computations with automatic differentiation and\nGPU acceleration, and does so while maintaining performance comparable to the fastest current\nlibraries for deep learning. This combination has turned out to be very popular in the research\ncommunity with, for instance, 296 ICLR 2019 submissions mentioning PyTorch.\n\n2 Background\n\nFour major trends in scienti\ufb01c computing have become increasingly important for deep learning.\nFirst, starting in the 1960s, the development of domain speci\ufb01c languages such as APL [8], MATLAB\n[9], R [10] and Julia [11], turned multidimensional arrays (often referred to as tensors) into \ufb01rst-class\nobjects supported by a comprehensive set of mathematical primitives (or operators) to manipulate\nthem. Separately, libraries such as NumPy[12], Torch[6], Eigen[13] and Lush[14] made array-based\nprogramming productive in general purpose languages such as Python, Lisp, C++ and Lua.\nSecond, the development of automatic differentiation [15] made it possible to fully automate\nthe daunting labor of computing derivatives. This made it signi\ufb01cantly easier to experiment with\ndifferent machine learning approaches while still allowing for ef\ufb01cient gradient based optimization.\nThe autograd [16] package popularized the use of this technique for NumPy arrays, and similar\napproaches are used in frameworks such as Chainer [5], DyNet [7], Lush [14], Torch [6], Jax [17]\nand Flux.jl [18].\nThird, with the advent of the free software movement, the scienti\ufb01c community moved away from\nclosed proprietary software such as Matlab[9], and towards the open-source Python ecosystem\nwith packages like NumPy [12], SciPy [19], and Pandas [20]. This ful\ufb01lled most of the numerical\nanalysis needs of researchers while allowing them to take advantage of a vast repository of libraries\nto handle dataset preprocessing, statistical analysis, plotting, and more. Moreover, the openness,\ninteroperability, and \ufb02exibility of free software fostered the development of vibrant communities that\ncould quickly address new or changing needs by extending the existing functionality of a library or if\nneeded by developing and releasing brand new ones. While there is a rich offering of open-source\nsoftware for neural networks in languages other than Python, starting with Lush [14] in Lisp, Torch [6]\nin C++, Objective-C and Lua, EBLearn [21] in C++, Caffe [1] in C++, the network effects of a large\necosystem such as Python made it an essential skill to jumpstart one\u2019s research. Hence, since 2014,\nmost deep learning frameworks converged on a Python interface as an essential feature.\nFinally, the availability and commoditization of general-purpose massively parallel hardware such\nas GPUs provided the computing power required by deep learning methods. Specialized libraries\nsuch as cuDNN [22], along with a body of academic work (such as [23] and [24]), produced a\nset of high-performance reusable deep learning kernels that enabled frameworks such as Caffe [1],\nTorch7 [25], or TensorFlow [3] to take advantage of these hardware accelerators.\nPyTorch builds on these trends by providing an array-based programming model accelerated by GPUs\nand differentiable via automatic differentiation integrated in the Python ecosystem.\n\n2\n\n\f3 Design principles\n\nPyTorch\u2019s success stems from weaving previous ideas into a design that balances speed and ease of\nuse. There are four main principles behind our choices:\nBe Pythonic Data scientists are familiar with the Python language, its programming model, and its\ntools. PyTorch should be a \ufb01rst-class member of that ecosystem. It follows the commonly established\ndesign goals of keeping interfaces simple and consistent, ideally with one idiomatic way of doing\nthings. It also integrates naturally with standard plotting, debugging, and data processing tools.\nPut researchers \ufb01rst\nPyTorch strives to make writing models, data loaders, and optimizers as\neasy and productive as possible. The complexity inherent to machine learning should be handled\ninternally by the PyTorch library and hidden behind intuitive APIs free of side-effects and unexpected\nperformance cliffs.\nProvide pragmatic performance\nTo be useful, PyTorch needs to deliver compelling performance,\nalthough not at the expense of simplicity and ease of use. Trading 10% of speed for a signi\ufb01cantly\nsimpler to use model is acceptable; 100% is not. Therefore, its implementation accepts added\ncomplexity in order to deliver that performance. Additionally, providing tools that allow researchers\nto manually control the execution of their code will empower them to \ufb01nd their own performance\nimprovements independent of those that the library provides automatically.\nWorse is better [26] Given a \ufb01xed amount of engineering resources, and all else being equal, the\ntime saved by keeping the internal implementation of PyTorch simple can be used to implement\nadditional features, adapt to new situations, and keep up with the fast pace of progress in the \ufb01eld of\nAI. Therefore it is better to have a simple but slightly incomplete solution than a comprehensive but\ncomplex and hard to maintain design.\n\n4 Usability centric design\n\n4.1 Deep learning models are just Python programs\n\nIn a surprisingly short amount of time, machine learning grew from recognizing individual digits [27]\ninto autonomously playing StarCraft [28]. Consequently, the neural networks themselves evolved\nrapidly from simple sequences of feed forward layers into incredibly varied numerical programs\noften composed of many loops and recursive functions. To support this growing complexity, PyTorch\nforegoes the potential bene\ufb01ts of a graph-metaprogramming based approach to preserve the imperative\nprogramming model of Python. This design was pioneered for model authoring by Chainer[5] and\nDynet[7]. PyTorch extends this to all aspects of deep learning work\ufb02ows. De\ufb01ning layers, composing\nmodels, loading data, running optimizers, and parallelizing the training process are all expressed\nusing the familiar concepts developed for general purpose programming.\nThis solution ensures that any new potential neural network architecture can be easily implemented\nwith PyTorch. For instance, layers (which in modern machine learning should really be understood\nas stateful functions with implicit parameters) are typically expressed as Python classes whose\nconstructors create and initialize their parameters, and whose forward methods process an input\nactivation. Similarly, models are usually represented as classes that compose individual layers, but let\nus state again that nothing forces the user to structure their code in that way. Listing 1 demonstrates\nhow an entire model can be created by composing functionality provided by PyTorch such as 2d\nconvolution, matrix multiplication, dropout, and softmax to classify gray-scale images. Note that\nlinear layers are of course part of the library, but we show an example implementation to highlight\nhow simple it is.\n\n3\n\n\fclass LinearLayer(Module):\n\nclass FullBasicModel(nn.Module):\n\ndef __init__(self, in_sz, out_sz):\n\ndef __init__(self):\n\nsuper().__init__()\nt1 = torch.randn(in_sz, out_sz)\nself.w = nn.Parameter(t1)\nt2 = torch.randn(out_sz)\nself.b = nn.Parameter(t2)\n\ndef forward(self, activations):\n\nsuper().__init__()\nself.conv = nn.Conv2d(1, 128, 3)\nself.fc = LinearLayer(128, 10)\n\ndef forward(self, x):\nt1 = self.conv(x)\nt2 = nn.functional.relu(t1)\nt3 = self.fc(t1)\nreturn nn.functional.softmax(t3)\n\nt = torch.mm(activations, self.w)\nreturn t + self.b\nListing 1: A custom layer used as a building block for a simple but complete neural network.\n\nThis \u201ceverything is a just a program\u201d philosophy is not limited to just the models, and applies to\noptimizers and data loaders as well. This facilitates the experimentation of new training techniques.\nFor example, to implement the very popular generative adversarial networks, one needs to specify\ntwo separate models (the generator and the discriminator), and two loss functions that depend on both\nmodels at the same time. Rigid APIs would struggle with this setup, but the simple design employed\nin PyTorch easily adapts to this setting as shown in Listing 2.\n\ndiscriminator = create_discriminator()\ngenerator = create_generator()\noptimD = optim.Adam(discriminator.parameters())\noptimG = optim.Adam(generator.parameters())\n\ndef step(real_sample):\n\n# (1) Update Discriminator\nerrD_real = loss(discriminator(real_sample), real_label)\nerrD_real.backward()\nfake = generator(get_noise())\nerrD_fake = loss(discriminator(fake.detach(), fake_label)\nerrD_fake.backward()\noptimD.step()\n# (2) Update Generator\nerrG = loss(discriminator(fake), real_label)\nerrG.backward()\noptimG.step()\n\nListing 2: Simpli\ufb01ed training of a generative adversarial networks.\n\nSince PyTorch programs execute eagerly, all the features of Python are available throughout the\nwhole design process. Print statements, standard debuggers, and common visualization tools like\nmatplotlib all work as expected. Users do not have to wait for lengthy compilation before they can\nstart running their programs, and more importantly intermediate computations can be observed to\nunderstand how a model works and whether its results are correct.\n\n4.2\n\nInteroperability and extensibility\n\nEasy and ef\ufb01cient interoperability is one of the top priorities for PyTorch because it opens the\npossibility to leverage the rich ecosystem of Python libraries as part of user programs. Hence,\nPyTorch allows for bidirectional exchange of data with external libraries. For example, it provides\na mechanism to convert between NumPy arrays and PyTorch tensors using the torch.from_numpy()\nfunction and .numpy() tensor method. Similar functionality is also available to exchange data stored\nusing the DLPack [29] format. Note that this exchange happens in both cases without any data\ncopying \u2013 objects on both sides only describe how to interpret a memory region which is shared\namong them. Hence, those operations are actually extremely cheap, and take constant time no matter\nhow large the converted arrays are.\n\n4\n\n\fMoreover, many of the critical systems are designed speci\ufb01cally to be extensible. For instance, the\nautomatic differentiation system allows users to add support for custom differentiable functions.\nTo do that users can de\ufb01ne a new subclass of torch.autograd.Function that implements forward()\nand backward() methods, which specify the function and its derivative (or more formally the vector-\nJacobian product). Similarly new datasets can be added by subclassing torch.utils.data.Dataset\nand implementing two methods: __getitem__ (the indexing operator) and __len__ (the length op-\nerator), making datasets behave like (possibly lazy) lists. How these work is completely up to the\nimplementer, and many users leverage other Python packages for data loading. The DataLoader class\nconsumes objects conforming to this interface and provides an iterator over the data which takes\ncare of shuf\ufb02ing, batching, parallelization, and management of pinned CUDA memory to improve\nthroughput.\nMost importantly, users are free to replace any component of PyTorch that does not meet the needs or\nperformance requirements of their project. They are all designed to be completely interchangeable,\nand PyTorch takes great care not to impose any particular solution.\n\n4.3 Automatic differentiation\n\nSince gradient based optimization is vital to deep learning, PyTorch must be able to automatically\ncompute gradients of models speci\ufb01ed by our users, and those can be arbitrary Python programs.\nHowever, Python is a dynamic programming language that allows changing most behaviors at\nruntime, making ahead of time source-to-source differentiation cumbersome. Instead, PyTorch uses\nthe operator overloading approach, which builds up a representation of the computed function every\ntime it is executed. In its current implementation [30], PyTorch performs reverse-mode automatic\ndifferentiation, which computes the gradient of a scalar output with respect to a multivariate input.\nDifferentiating functions with more outputs than inputs is more ef\ufb01ciently executed using forward-\nmode automatic differentiation, but this use case is less common for machine learning applications.\nPyTorch can be easily extended to perform forward-mode differentiation using array-level dual\nnumbers [31, 32].\nAnother interesting and uncommon feature of our system is that it can differentiate through code\nemploying mutation on tensors, which is one of the basic building blocks of imperative programs.\nTo ensure safety, we have implemented a versioning system for tensors, which lets us track their\nmodi\ufb01cations and ensure that we always use the data we expect. One interesting tradeoff is that\nwhile we could utilize techniques like copy-on-write to support arbitrary programs, we chose to not\ngo down this path, as performance-wise it is usually bene\ufb01cial for the users to rewrite their code\nto ensure that no copies have to be performed. Hence, while most mutations are benign and can\nbe handled automatically, the really complicated cases result in a user error, which lets them know\nthat they likely want to restructure the program. This allows us to avoid introducing subtle and\nhard-to-\ufb01nd performance cliffs.\n\n5 Performance focused implementation\n\nRunning deep learning algorithms ef\ufb01ciently from a Python interpreter is notoriously challenging: for\ninstance, the global interpreter lock [33] effectively ensures that only one of any number of concurrent\nthreads is running at any given time. Deep learning frameworks based on the construction of a static\ndata-\ufb02ow graph sidestep this problem by deferring the evaluation of the computation to a custom\ninterpreter.\nPyTorch solved the problem differently, by carefully optimizing every aspect of its execution while\nsimultaneously empowering its users to easily leverage additional optimization strategies.\n\n5.1 An ef\ufb01cient C++ core\n\nDespite being closely integrated in the Python ecosystem, most of PyTorch is written in C++ to\nachieve high performance. This core libtorch library implements the tensor data structure, the GPU\nand CPU operators, and basic parallel primitives. It also provides the automatic differentiation system,\nincluding the gradient formulas for most built-in functions. This ensures that the computation of the\nderivatives of functions composed of core PyTorch operators is executed entirely in a multithreaded\nevaluator which does not require holding the Python global interpreter lock [33]. Python bindings\n\n5\n\n\fare generated using YAML meta-data \ufb01les. An interesting side-effect of this approach is that it\nallowed our community to quickly create bindings to multiple other languages resulting in projects\nlike NimTorch [34], hasktorch [35] and others.\nThis design also allowed us to create \ufb01rst-class C++ bindings and modeling libraries that can be\nused in places where Python is inconvenient, such as the game engine for Starcraft [36] or on mobile\nplatforms. It is even possible to take the Python code describing a PyTorch model and run it without\nPython using the TorchScript engine [37].\n\n5.2 Separate control and data \ufb02ow\n\nPyTorch maintains a strict separation between its control (i.e. program branches, loops) and data \ufb02ow\n(i.e. tensors and the operations performed on them). The resolution of the control \ufb02ow is handled\nby Python and optimized C++ code executed on the host CPU, and result in a linear sequence of\noperator invocations on the device. Operators can be run either on CPU or on GPU.\nPyTorch is designed to execute operators asynchronously on GPU by leveraging the CUDA stream\nmechanism [38] to queue CUDA kernel invocations to the GPUs hardware FIFO. This allows the\nsystem to overlap the execution of Python code on CPU with tensor operators on GPU. Because\nthe tensor operations usually take a signi\ufb01cant amount of time, this lets us saturate the GPU and\nreach peak performance even in an interpreted language with fairly high overhead like Python. Note\nthat this mechanism is nearly invisible to the user. Unless they implement their own multi-stream\nprimitives all of the CPU-GPU synchronization is handled by the library.\nPyTorch could leverage a similar mechanism to also execute operators asynchronously on the CPU.\nHowever the costs of cross-thread communication and synchronization would negate the performance\nbene\ufb01t of such an optimization.\n\n5.3 Custom caching tensor allocator\n\nAlmost every operator must dynamically allocate an output tensor to hold the result of its execution.\nIt is therefore critical to optimize the speed of the dynamic memory allocators. PyTorch can rely on\noptimized libraries [39, 40, 41] to handle this task on CPU. However, on GPU the cudaFree routine\nmay block its caller until all previously queued work on all GPUs completes. To avoid this bottleneck,\nPyTorch implements a custom allocator which incrementally builds up a cache of CUDA memory\nand reassigns it to later allocations without further use of CUDA APIs. The incremental allocation\nis also crucial for better interoperability, because taking up all GPU memory ahead of time would\nprevent the user from utilizing other GPU-enabled Python packages.\nTo further improve its effectiveness, this allocator was tuned for the speci\ufb01c memory usage patterns of\ndeep learning. For example, it rounds up allocations to multiples of 512 bytes to avoid fragmentation\nissues. Moreover, it maintains a distinct pool of memory for every CUDA stream (work queue).\nThe one-pool-per-stream design assumption simpli\ufb01es the implementation and improves the perfor-\nmance of the allocator: because the CPU runs ahead of the GPU, memory is freed on the CPU before\nits last use on the GPU \ufb01nishes. Since streams serialize execution, if the free precedes the reallocation\non the CPU, the same order will occur on the GPU. So the allocator can reallocate memory freed on\nthe CPU immediately as long as the new allocation is used on the same stream as the freed region.\nHowever, if an allocation was last used on one stream and then allocated on another, additional\nsynchronization is needed.\nThe one-pool-per-stream design seems limiting since the allocations end up fragmented per stream, but\nin practice PyTorch almost never uses multiple streams. It is notoriously hard to write CUDA kernels\nin a way that would let them cooperatively share the GPU because exact scheduling is hardware\ncontrolled. In practice, kernel writers usually resort to monolithic kernels that combine multiple tasks.\nData loading and distributed computing utilities are exceptions to the one stream design, and they\ncarefully insert additional synchronization to avoid bad interactions with the allocator.\nWhile this design is susceptible to certain corner cases, it almost never exhibits unwanted behaviors\nin practical code. Most of our users are not aware of its existence.\n\n6\n\n\f5.4 Multiprocessing\n\nDue to the global interpreter lock (GIL) Python\u2019s default implementation does not allow concurrent\nthreads to execute in parallel. To alleviate this problem, the Python community has established a\nstandard multiprocessing module, containing a number of utilities that allow users to easily spawn\nchild processes and implement basic inter-process communication primitives.\nHowever, the implementation of the primitives uses the same form of serialization used for on-disk\npersistence, which is inef\ufb01cient when dealing with large arrays. Hence, PyTorch extends the Python\nmultiprocessing module into torch.multiprocessing, which is a drop-in replacement for the\nbuilt in package and automatically moves the data of tensors sent to other processes to shared memory\ninstead of sending it over the communication channel.\nThis design greatly improves performance and makes the process isolation weaker, resulting in a\nprogramming model which more closely resembles regular threaded programs. Users can easily\nimplement heavily parallel programs that operate on independent GPUs but later synchronize gradients\nusing all-reduce style primitives.\nAnother unique feature of this system is that it transparently handles sharing of CUDA tensors,\nmaking it easy to implement techniques like Hogwild [42].\n\n5.5 Reference counting\n\nUsers often design their models to utilize all memory available during training, and increasing batch\nsizes is a common technique of speeding up the process. Therefore, to deliver great performance,\nPyTorch has to treat memory as a scarce resource that it needs to manage carefully.\nLibraries with eager semantics have to manage tensor memory without knowing how it will be used\nin the future. Garbage collection is the typical way to handle this automatically because it has good\namortized performance. In this approach, the runtime periodically investigates the state of the system,\nenumerates used objects and frees everything else. However, by deferring the deallocation, it causes\nthe program to use more memory overall [43]. Given the scarcity of GPU memory, these overheads\nare unacceptable. In fact, Torch7 utilized the garbage collector built into Lua, and a common anti-\npattern among the users was to sprinkle the program with explicit triggers to the garbage collector,\nhoping that the memory errors go away.\nPyTorch takes a different approach: it relies on a reference counting scheme to track the number of\nuses of each tensor, and frees the underlying memory immediately once this count reaches zero. Note\nthat PyTorch tracks both references internal to the libtorch library and external references made by\nusers in their Python code by integrating with Python\u2019s own reference counting mechanism. This\nensures that memory is released exactly when tensors become unneeded.\nOne notable caveat is that we can only guarantee the desired performance characteristics in implemen-\ntations of languages that either already utilize reference counting (CPython, Swift, but not PyPy or\nmany scripting languages such as Lua), and those that allow for user-de\ufb01ned behavior for assignment,\ncopies, and moves (e.g. C++, Rust). Bindings to implementations that do not satisfy those criteria\nwill have to implement their own specialized memory management on top of PyTorch.\n\n6 Evaluation\n\nIn this section we compare the performance of PyTorch with several other commonly-used deep\nlearning libraries, and \ufb01nd that it achieves competitive performance across a range of tasks. All\nexperiments were performed on a workstation with two Intel Xeon E5-2698 v4 CPUs and one\nNVIDIA Quadro GP100 GPU.\n\n6.1 Asynchronous data\ufb02ow\n\nWe start by quantifying the ability of PyTorch to asynchronously execute data\ufb02ow on GPU. We use\nthe built-in pro\ufb01ler [44] to instrument various benchmarks and record a timeline of the execution of a\nsingle training step.\n\n7\n\n\fFigure 1 shows a representative timeline of execution for the \ufb01rst few operations of a ResNet-50\nmodel. The host CPU which queues the work quickly outpaces the execution of the operators on\nthe GPU. This allows PyTorch to achieve almost perfect device utilization. In this example, GPU\nexecution takes around three times longer than CPU scheduling. The exact ratio depends on the\nrelative performance of the host CPU and the GPU, as well as the number of elements in each tensor\nand the average arithmetic complexity of the \ufb02oating point computations to be performed on the\nGPU.\n\nFigure 1: A trace of the \ufb01rst few operators of Resnet-50. The top row depicts the execution of the control\n\ufb02ow running on the host CPU. The gray areas are Python code executed by its interpreter. The colored areas\ncorrespond to the work done on the host CPU to queue various operators (convolution, batch normalization, and\nso on). The bottom row shows the corresponding execution of those operators on the GPU. The arrows pair the\ntwo events in time.\n\n6.2 Memory management\n\nWe used the NVIDIA pro\ufb01ler to trace the execution of the CUDA runtime as well as the execution\nof the CUDA kernels launched during one training iteration of the ResNet-50 model. As shown in\nFigure 2, the behavior of the \ufb01rst iteration differs signi\ufb01cantly from that of subsequent ones. At\n\ufb01rst, calls to the CUDA memory management functions (cudaMalloc and cudaFree) slow down the\nexecution quite dramatically by blocking the CPU thread for long periods of time, hence lowering\nthe utilization of the GPU. This effect disappears in subsequent iterations as the PyTorch caching\nmemory allocator starts reusing previously allocated regions.\n\nFigure 2: Annotated traces of the execution of ResNet-50 on GPU.\n\n6.3 Benchmarks\n\nFinally, we can get an overall sense of single-machine eager mode performance of PyTorch by com-\nparing it to three popular graph-based deep learning frameworks (CNTK, MXNet and TensorFlow), a\nde\ufb01ne-by-run framework (Chainer), and production oriented platform (PaddlePaddle). The Appendix\ndetails all the steps needed to reproduce our setup.\nOur results are summarized in Table 1. On all the benchmarks, the performance of PyTorch is within\n17% of that of of the fastest framework. We attribute this result to the fact that these tools of\ufb02oad\nmost of the computation to the same version of the cuDNN and cuBLAS libraries.\n\nThroughput (higher is better)\n\nVGG-19 ResNet-50 MobileNet\n\nGNMTv2\n\nFramework\n\nChainer\nCNTK\nMXNet\nPaddlePaddle\nTensorFlow\nPyTorch\n\nAlexNet\n778 \u00b1 15\n845 \u00b1 8\n1554 \u00b1 22\n933 \u00b1 123\n1422 \u00b1 27\n1547 \u00b1 316\n\nN/A\n84 \u00b1 3\n113 \u00b1 1\n112 \u00b1 2\n66 \u00b1 2\n119 \u00b1 1\n\n219 \u00b1 1\n210 \u00b1 1\n218 \u00b1 2\n192 \u00b1 4\n200 \u00b1 1\n212 \u00b1 2\n\nN/A\nN/A\n444 \u00b1 2\n557 \u00b1 24\n216 \u00b1 15\n463 \u00b1 17\n\nN/A\nN/A\nN/A\nN/A\n\nNCF\nN/A\nN/A\nN/A\nN/A\n\n9631 \u00b1 1.3% 4.8e6 \u00b1 2.9%\n15512 \u00b1 4.8% 5.4e6 \u00b1 3.4%\n\nTable 1: Training speed for 6 models using 32bit \ufb02oats. Throughput is measured in images per second for the\nAlexNet, VGG-19, ResNet-50, and MobileNet models, in tokens per second for the GNMTv2 model, and in\nsamples per second for the NCF model. The fastest speed for each model is shown in bold.\n\n8\n\n\f6.4 Adoption\n\nThe validity of design decisions and their impact on ease-of-use is hard to measure. As a proxy,\nwe tried to quantify how well the machine learning community received PyTorch by counting how\noften various machine learning tools (including Caffe, Chainer, CNTK, Keras, MXNet, PyTorch,\nTensorFlow, and Theano) are mentioned on arXiv e-Prints since the initial release of PyTorch in\nJanuary 2017. In Figure 3 we report the monthly number of mentions of the word \"PyTorch\" as a\npercentage of all mentions among these deep learning frameworks. We counted tools mentioned\nmultiple times in a given paper only once, and made the search case insensitive to account for various\nspellings.\n\nFigure 3: Among arXiv papers each month that mention common deep learning frameworks, percentage of\nthem that mention PyTorch.\n\n7 Conclusion and future work\n\nPyTorch has become a popular tool in the deep learning research community by combining a focus\non usability with careful performance considerations. In addition to continuing to support the latest\ntrends and advances in deep learning, in the future we plan to continue to improve the speed and\nscalability of PyTorch. Most notably, we are working on the PyTorch JIT: a suite of tools that\nallow PyTorch programs to be executed outside of the Python interpreter where they can be further\noptimized. We also intend to improve support for distributed computation by providing ef\ufb01cient\nprimitives for data parallelism as well as a Pythonic library for model parallelism based around\nremote procedure calls.\n\n8 Acknowledgements\n\nWe are grateful to the PyTorch community for their feedback and contributions that greatly in\ufb02uenced\nthe design and implementation of PyTorch. We thank all the PyTorch core team members, contributors\nand package maintainers including Ailing Zhang, Alex Suhan, Alfredo Mendoza, Alican Bozkurt,\nAndrew Tulloch, Ansha Yu, Anthony Shoumikhin, Bram Wasti, Brian Vaughan, Christian Puhrsch,\nDavid Reiss, David Riazati, Davide Libenzi, Dmytro Dzhulgakov, Dwaraj Rajagopal, Edward Yang,\nElias Ellison, Fritz Obermeyer, George Zhang, Hao Lu, Hong Xu, Hung Duong, Igor Fedan, Ilia\nCherniavskii, Iurii Zdebskyi, Ivan Kobzarev, James Reed, Jeff Smith, Jerry Chen, Jerry Zhang, Jiakai\nLiu, Johannes M. Dieterich, Karl Ostmo, Lin Qiao, Martin Yuan, Michael Suo, Mike Ruberry, Mikhail\nZolothukhin, Mingzhe Li, Neeraj Pradhan, Nick Korovaiko, Owen Anderson, Pavel Belevich, Peter\nJohnson, Pritam Damania, Raghuraman Krishnamoorthi, Richard Zou, Roy Li, Rui Zhu, Sebastian\nMessmer, Shen Li, Simon Wang, Supriya Rao, Tao Xu, Thomas Viehmann, Vincent Quenneville-\nBelair, Vishwak Srinivasan, Vitaly Fedyunin, Wanchao Liang, Wei Yang, Will Feng, Xiaomeng Yang,\nXiaoqiang Zheng, Xintao Chen, Yangqing Jia, Yanli Zhao, Yinghai Lu and Zafar Takhirov.\n\nReferences\n[1] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,\nSergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature\nembedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[2] Frank Seide and Amit Agarwal. Cntk: Microsoft\u2019s open-source deep-learning toolkit. In\nProceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, KDD \u201916, pages 2135\u20132135, New York, NY, USA, 2016. ACM.\n\n9\n\n\f[3] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dandelion Man\u00e9, Rajat Monga, Sherry Moore, Derek\nMurray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal\nTalwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete\nWarden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-\nscale machine learning on heterogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n\n[4] Theano Development Team. Theano: A Python framework for fast computation of mathematical\n\nexpressions. arXiv e-prints, abs/1605.02688, May 2016.\n\n[5] Seiya Tokui, Kenta Oono, Shohei Hido, and Justin Clayton. Chainer: a next-generation open\nsource framework for deep learning. In Proceedings of Workshop on Machine Learning Systems\n(LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing\nSystems (NIPS), 2015.\n\n[6] Ronan Collobert, Samy Bengio, and Johnny Mari\u00e9thoz. Torch: a modular machine learning\n\nsoftware library. Technical report, Idiap, 2002.\n\n[7] G. Neubig, C. Dyer, Y. Goldberg, A. Matthews, W. Ammar, A. Anastasopoulos, M. Balles-\nteros, D. Chiang, D. Clothiaux, T. Cohn, K. Duh, M. Faruqui, C. Gan, D. Garrette, Y. Ji,\nL. Kong, A. Kuncoro, G. Kumar, C. Malaviya, P. Michel, Y. Oda, M. Richardson, N. Saphra,\nS. Swayamdipta, and P. Yin. DyNet: The Dynamic Neural Network Toolkit. ArXiv e-prints,\nJanuary 2017.\n\n[8] Philip S. Abrams. An APL Machine. PhD thesis, Stanford University, 1970.\n\n[9] The MathWorks, Inc., Natick, Massachusetts, United States. MATLAB and Statistics Toolbox.\n\n[10] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for\n\nStatistical Computing, Vienna, Austria.\n\n[11] Jeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to\n\nnumerical computing. SIAM review, 59(1):65\u201398, 2017.\n\n[12] Travis Oliphant.\n\nhttp://www.numpy.org/.\n\nNumPy: A guide to NumPy.\n\nUSA: Trelgol Publishing, 2006.\n\n[13] Ga\u00ebl Guennebaud, Beno\u00eet Jacob, et al. Eigen v3. http://eigen.tuxfamily.org, 2010.\n\n[14] Y LeCun and L Bottou. Lush reference manual. Technical report, code available at\n\nhttp://lush.sourceforge.net, 2002.\n\n[15] Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark\nSiskind. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res.,\n18(1):5595\u20135637, January 2017.\n\n[16] Dougal Maclaurin. Modeling, Inference and Optimization with Composable Differentiable\n\nProcedures. PhD thesis, Harvard University, April 2016.\n\n[17] Matthew Johnson et. al. Jax. https://github.com/google/jax, 2018.\n\n[18] Mike Innes et. al. Flux.jl. https://github.com/FluxML/Flux.jl, 2018.\n\n[19] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scienti\ufb01c tools for\n\nPython, 2001\u2013. http://www.scipy.org/.\n\n[20] Wes McKinney. Data structures for statistical computing in python. In Proceedings of the 9th\n\nPython in Science Conference, 51-56, 2010.\n\n[21] Pierre Sermanet, Koray Kavukcuoglu, and Yann LeCun. Eblearn: Open-source energy-based\nlearning in c++. In 2009 21st IEEE International Conference on Tools with Arti\ufb01cial Intelligence,\npages 693\u2013697. IEEE, 2009.\n\n10\n\n\f[22] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan D. Cohen, John Tran, Bryan\nCatanzaro, and Evan Shelhamer. cudnn: Ef\ufb01cient primitives for deep learning. CoRR,\nabs/1410.0759, 2014.\n\n[23] Andrew Lavin. maxdnn: An ef\ufb01cient convolution kernel for deep learning with maxwell gpus,\n\nJanuary 2015.\n\n[24] Andrew Lavin and Scott Gray. Fast algorithms for convolutional neural networks. 2016 IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), pages 4013\u20134021, 2016.\n\n[25] Ronan Collobert, Koray Kavukcuoglu, and Cl\u00e9ment Farabet. Torch7: A matlab-like environment\n\nfor machine learning. In NIPS 2011, 2011.\n\n[26] Richard Gabriel. The rise of worse is better. http://dreamsongs.com/RiseOfWorseIsBetter.html.\n\n[27] Yann LeCun\n\nand Corinna Cortes.\n\nhttp://yann.lecun.com/exdb/mnist/.\n\nMNIST handwritten\n\ndigit\n\ndatabase.\n\n[28] Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets,\nMichelle Yeo, Alireza Makhzani, Heinrich K\u00fcttler, John Agapiou, Julian Schrittwieser, John\nQuan, Stephen Gaffney, Stig Petersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David\nSilver, Timothy P. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawrence,\nAnders Ekermo, Jacob Repp, and Rodney Tsing. Starcraft II: A new challenge for reinforcement\nlearning. CoRR, abs/1708.04782, 2017.\n\n[29] DMLC. Dlpack: Open in memory tensor structure. https://github.com/dmlc/dlpack.\n\n[30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\npytorch. In NIPS Workshop, 2017.\n\n[31] Dan Piponi. Automatic differentiation, C++ templates, and photogrammetry. J. Graphics, GPU,\n\n& Game Tools, 9(4):41\u201355, 2004.\n\n[32] Holger Leuck and Hans-Hellmut Nagel. Automatic differentiation facilitates of-integration\ninto steering-angle-based road vehicle tracking. In 1999 Conference on Computer Vision and\nPattern Recognition (CVPR \u201999), 23-25 June 1999, Ft. Collins, CO, USA, pages 2360\u20132365,\n1999.\n\n[33] The\n\nPython\n\nteam.\n\nThe\n\ncpython\n\nglobal\n\ninterpreter\n\nlock.\n\nhttps://wiki.python.org/moin/GlobalInterpreterLock.\n\n[34] Giovanni Petrantoni and J\u00f6rg Wollenschl\u00e4ger. Nimtorch.\n\nxyz/nimtorch.\n\nhttps://github.com/fragcolor-\n\n[35] Austin\n\nHuang,\n\nJunji\n\nHashimoto,\n\nand\n\nSam\n\nStites.\n\nHasktorch.\n\nhttps://github.com/hasktorch/hasktorch.\n\n[36] G. Synnaeve, Z. Lin, J. Gehring, D. Gant, V. Mella, V. Khalidov, N. Carion, and N. Usunier.\nForward modeling for partial observation strategy games - a starcraft defogger. In Advances in\nNeural Information Processing Systems, pages 10761\u201310771, 2018.\n\n[37] The PyTorch team. Torch Script. https://pytorch.org/docs/stable/jit.html.\n\n[38] Justin Luitjens. Cuda streams. GPU technology conference, 2014.\n\n[39] Emery D. Berger, Kathryn S. McKinley, Robert D. Blumofe, and Paul R. Wilson. Hoard:\nIn Proceedings of the Ninth\nA scalable memory allocator for multithreaded applications.\nInternational Conference on Architectural Support for Programming Languages and Operating\nSystems, ASPLOS IX, pages 117\u2013128, New York, NY, USA, 2000. ACM.\n\n[40] J. Evans. A scalable concurrent malloc(3) implementation for freebsd. In In BSDCan \u2014 The\n\nTechnical BSD Conference, May 2006.\n\n[41] S. Ghemawat and P. Menage. Tcmalloc: Thread-caching malloc.\n\n11\n\n\f[42] Benjamin Recht, Christopher R\u00e9, Stephen J. Wright, and Feng Niu. Hogwild: A lock-free\nIn Advances in Neural Information\napproach to parallelizing stochastic gradient descent.\nProcessing Systems 24: 25th Annual Conference on Neural Information Processing Systems\n2011. Proceedings of a meeting held 12-14 December 2011, Granada, Spain., pages 693\u2013701,\n2011.\n\n[43] Matthew Hertz and Emery D. Berger. Quantifying the performance of garbage collection vs.\nexplicit memory management. In Proceedings of the 20th Annual ACM SIGPLAN Conference\non Object-oriented Programming, Systems, Languages, and Applications, OOPSLA \u201905, pages\n313\u2013326, New York, NY, USA, 2005. ACM.\n\n[44] The PyTorch team. Pytorch Autograd Pro\ufb01ler. https://pytorch.org/docs/1.0.1/autograd.html#pro\ufb01ler.\n\n12\n\n\f", "award": [], "sourceid": 4399, "authors": [{"given_name": "Adam", "family_name": "Paszke", "institution": "University of Warsaw"}, {"given_name": "Sam", "family_name": "Gross", "institution": "Facebook"}, {"given_name": "Francisco", "family_name": "Massa", "institution": "Facebook AI Research"}, {"given_name": "Adam", "family_name": "Lerer", "institution": "Facebook AI Research"}, {"given_name": "James", "family_name": "Bradbury", "institution": "Google Research"}, {"given_name": "Gregory", "family_name": "Chanan", "institution": "Facebook"}, {"given_name": "Trevor", "family_name": "Killeen", "institution": "Self Employed"}, {"given_name": "Zeming", "family_name": "Lin", "institution": "Facebook AI Research"}, {"given_name": "Natalia", "family_name": "Gimelshein", "institution": "NVIDIA"}, {"given_name": "Luca", "family_name": "Antiga", "institution": "Orobix"}, {"given_name": "Alban", "family_name": "Desmaison", "institution": "Oxford University"}, {"given_name": "Andreas", "family_name": "Kopf", "institution": "Xamla"}, {"given_name": "Edward", "family_name": "Yang", "institution": "Facebook"}, {"given_name": "Zachary", "family_name": "DeVito", "institution": "Facebook AI Research"}, {"given_name": "Martin", "family_name": "Raison", "institution": "Nabla"}, {"given_name": "Alykhan", "family_name": "Tejani", "institution": "Twitter, Inc."}, {"given_name": "Sasank", "family_name": "Chilamkurthy", "institution": "Qure.ai"}, {"given_name": "Benoit", "family_name": "Steiner", "institution": "Facebook AI Research"}, {"given_name": "Lu", "family_name": "Fang", "institution": "Facebook"}, {"given_name": "Junjie", "family_name": "Bai", "institution": "Facebook"}, {"given_name": "Soumith", "family_name": "Chintala", "institution": "Facebook AI Research"}]}