{"config":{"lang":["en"],"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"index.html","text":"Distiller Documentation What is Distiller Distiller is an open-source Python package for neural network compression research. Network compression can reduce the footprint of a neural network, increase its inference speed and save energy. Distiller provides a PyTorch environment for prototyping and analyzing compression algorithms, such as sparsity-inducing methods and low precision arithmetic. Distiller contains: A framework for integrating pruning, regularization and quantization algorithms. A set of tools for analyzing and evaluating compression performance. Example implementations of state-of-the-art compression algorithms. Motivation A sparse tensor is any tensor that contains some zeros, but sparse tensors are usually only interesting if they contain a significant number of zeros. A sparse neural network performs computations using some sparse tensors (preferably many). These tensors can be parameters (weights and biases) or activations (feature maps). Why do we care about sparsity? Present day neural networks tend to be deep, with millions of weights and activations. Refer to GoogLeNet or ResNet50, for a couple of examples. These large models are compute-intensive which means that even with dedicated acceleration hardware, the inference pass (network evaluation) will take time. You might think that latency is an issue only in certain cases, such as autonomous driving systems, but in fact, whenever we humans interact with our phones and computers, we are sensitive to the latency of the interaction. We don't like to wait for search results or for an application or web-page to load, and we are especially sensitive in realtime interactions such as speech recognition. So inference latency is often something we want to minimize. Large models are also memory-intensive with millions of parameters. Moving around all of the data required to compute inference results consumes energy, which is a problem on a mobile device as well as in a server environment. Data center server-racks are limited by their power-envelope and their ToC (total cost of ownership) is correlated to their power consumption and thermal characteristics. In the mobile device environment, we are obviously always aware of the implications of power consumption on the device battery. Inference performance in the data center is often measured using a KPI (key performance indicator) which folds latency and power considerations: inferences per second, per Watt (inferences/sec/watt). The storage and transfer of large neural networks is also a challenge in mobile device environments, because of limitations on application sizes and long application download times. For these reasons, we wish to compress the network as much as possible, to reduce the amount of bandwidth and compute required. Inducing sparseness, through regularization or pruning, in neural-network models, is one way to compress the network (quantization is another method). Sparse neural networks hold the promise of speed, small size, and energy efficiency. Smaller Sparse NN model representations can be compressed by taking advantage of the fact that the tensor elements are dominated by zeros. The compression format, if any, is very HW and SW specific, and the optimal format may be different per tensor (an obvious example: largely dense tensors should not be compressed). The compute hardware needs to support the compressions formats, for representation compression to be meaningful. Compression representation decisions might interact with algorithms such as the use of tiles for memory accesses. Data such as a parameter tensor is read/written from/to main system memory compressed, but the computation can be dense or sparse. In dense compute we use dense operators, so the compressed data eventually needs to be decompressed into its full, dense size. The best we can do is bring the compressed representation as close as possible to the compute engine. Sparse compute, on the other hand, operates on the sparse representation which never requires decompression (we therefore distinguish between sparse representation and compressed representation). This is not a simple matter to implement in HW, and often means lower utilization of the vectorized compute engines. Therefore, there is a third class of representations, which take advantage of specific hardware characteristics. For example, for a vectorized compute engine we can remove an entire zero-weights vector and skip its computation (this uses structured pruning or regularization). Faster Many of the layers in modern neural-networks are bandwidth-bound, which means that the execution latency is dominated by the available bandwidth. In essence, the hardware spends more time bringing data close to the compute engines, than actually performing the computations. Fully-connected layers, RNNs and LSTMs are some examples of bandwidth-dominated operations. Reducing the bandwidth required by these layers, will immediately speed them up. Some pruning algorithms prune entire kernels, filters and even layers from the network without adversely impacting the final accuracy. Depending on the hardware implementation, these methods can be leveraged to skip computations, thus reducing latency and power. More energy efficient Because we pay two orders-of-magnitude more energy to access off-chip memory (e.g. DDR) compared to on-chip memory (e.g. SRAM or cache), many hardware designs employ a multi-layered cache hierarchy. Fitting the parameters and activations of a network in these on-chip caches can make a big difference on the required bandwidth, the total inference latency, and off course reduce power consumption. And of course, if we used a sparse or compressed representation, then we are reducing the data throughput and therefore the energy consumption.","title":"Home"},{"location":"index.html#distiller-documentation","text":"","title":"Distiller Documentation"},{"location":"index.html#what-is-distiller","text":"Distiller is an open-source Python package for neural network compression research. Network compression can reduce the footprint of a neural network, increase its inference speed and save energy. Distiller provides a PyTorch environment for prototyping and analyzing compression algorithms, such as sparsity-inducing methods and low precision arithmetic. Distiller contains: A framework for integrating pruning, regularization and quantization algorithms. A set of tools for analyzing and evaluating compression performance. Example implementations of state-of-the-art compression algorithms.","title":"What is Distiller"},{"location":"index.html#motivation","text":"A sparse tensor is any tensor that contains some zeros, but sparse tensors are usually only interesting if they contain a significant number of zeros. A sparse neural network performs computations using some sparse tensors (preferably many). These tensors can be parameters (weights and biases) or activations (feature maps). Why do we care about sparsity? Present day neural networks tend to be deep, with millions of weights and activations. Refer to GoogLeNet or ResNet50, for a couple of examples. These large models are compute-intensive which means that even with dedicated acceleration hardware, the inference pass (network evaluation) will take time. You might think that latency is an issue only in certain cases, such as autonomous driving systems, but in fact, whenever we humans interact with our phones and computers, we are sensitive to the latency of the interaction. We don't like to wait for search results or for an application or web-page to load, and we are especially sensitive in realtime interactions such as speech recognition. So inference latency is often something we want to minimize. Large models are also memory-intensive with millions of parameters. Moving around all of the data required to compute inference results consumes energy, which is a problem on a mobile device as well as in a server environment. Data center server-racks are limited by their power-envelope and their ToC (total cost of ownership) is correlated to their power consumption and thermal characteristics. In the mobile device environment, we are obviously always aware of the implications of power consumption on the device battery. Inference performance in the data center is often measured using a KPI (key performance indicator) which folds latency and power considerations: inferences per second, per Watt (inferences/sec/watt). The storage and transfer of large neural networks is also a challenge in mobile device environments, because of limitations on application sizes and long application download times. For these reasons, we wish to compress the network as much as possible, to reduce the amount of bandwidth and compute required. Inducing sparseness, through regularization or pruning, in neural-network models, is one way to compress the network (quantization is another method). Sparse neural networks hold the promise of speed, small size, and energy efficiency.","title":"Motivation"},{"location":"index.html#smaller","text":"Sparse NN model representations can be compressed by taking advantage of the fact that the tensor elements are dominated by zeros. The compression format, if any, is very HW and SW specific, and the optimal format may be different per tensor (an obvious example: largely dense tensors should not be compressed). The compute hardware needs to support the compressions formats, for representation compression to be meaningful. Compression representation decisions might interact with algorithms such as the use of tiles for memory accesses. Data such as a parameter tensor is read/written from/to main system memory compressed, but the computation can be dense or sparse. In dense compute we use dense operators, so the compressed data eventually needs to be decompressed into its full, dense size. The best we can do is bring the compressed representation as close as possible to the compute engine. Sparse compute, on the other hand, operates on the sparse representation which never requires decompression (we therefore distinguish between sparse representation and compressed representation). This is not a simple matter to implement in HW, and often means lower utilization of the vectorized compute engines. Therefore, there is a third class of representations, which take advantage of specific hardware characteristics. For example, for a vectorized compute engine we can remove an entire zero-weights vector and skip its computation (this uses structured pruning or regularization).","title":"Smaller"},{"location":"index.html#faster","text":"Many of the layers in modern neural-networks are bandwidth-bound, which means that the execution latency is dominated by the available bandwidth. In essence, the hardware spends more time bringing data close to the compute engines, than actually performing the computations. Fully-connected layers, RNNs and LSTMs are some examples of bandwidth-dominated operations. Reducing the bandwidth required by these layers, will immediately speed them up. Some pruning algorithms prune entire kernels, filters and even layers from the network without adversely impacting the final accuracy. Depending on the hardware implementation, these methods can be leveraged to skip computations, thus reducing latency and power.","title":"Faster"},{"location":"index.html#more-energy-efficient","text":"Because we pay two orders-of-magnitude more energy to access off-chip memory (e.g. DDR) compared to on-chip memory (e.g. SRAM or cache), many hardware designs employ a multi-layered cache hierarchy. Fitting the parameters and activations of a network in these on-chip caches can make a big difference on the required bandwidth, the total inference latency, and off course reduce power consumption. And of course, if we used a sparse or compressed representation, then we are reducing the data throughput and therefore the energy consumption.","title":"More energy efficient"},{"location":"algo_earlyexit.html","text":"Early Exit Inference While Deep Neural Networks benefit from a large number of layers, it's often the case that many data points in classification tasks can be classified accurately with much less work. There have been several studies recently regarding the idea of exiting before the normal endpoint of the neural network. Panda et al in Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition points out that a lot of data points can be classified easily and require less processing than some more difficult points and they view this in terms of power savings. Surat et al in BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks look at a selective approach to exit placement and criteria for exiting early. Why Does Early Exit Work? Early Exit is a strategy with a straightforward and easy to understand concept Figure #fig(boundaries) shows a simple example in a 2-D feature space. While deep networks can represent more complex and expressive boundaries between classes (assuming we\u2019re confident of avoiding over-fitting the data), it\u2019s also clear that much of the data can be properly classified with even the simplest of classification boundaries. Data points far from the boundary can be considered \"easy to classify\" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is \"difficult to classify\" and require the full expressiveness of the neural network to accurately classify it. Example code for Early Exit Both CIFAR10 and ImageNet code comes directly from publicly available examples from PyTorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work. Note: the sample code provided for ResNet models with Early Exits has exactly one early exit for the CIFAR10 example and exactly two early exits for the ImageNet example. If you want to modify the number of early exits, you will need to make sure that the model code is updated to have a corresponding number of exits. Deeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively. Note that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues. Example command lines We have provided examples for ResNets of varying sizes for both CIFAR10 and ImageNet datasets. An example command line for training for CIFAR10 is: python compress_classifier.py --arch=resnet32_cifar_earlyexit --epochs=20 -b 128 \\ --lr=0.003 --earlyexit_thresholds 0.4 --earlyexit_lossweights 0.4 -j 30 \\ --out-dir /home/ -n earlyexit /home/pcifar10 And an example command line for ImageNet is: python compress_classifier.py --arch=resnet50_earlyexit --epochs=120 -b 128 \\ --lr=0.003 --earlyexit_thresholds 1.2 0.9 --earlyexit_lossweights 0.1 0.3 \\ -j 30 --out-dir /home/ -n earlyexit /home/I1K/i1k-extracted/ Heuristics The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more aggressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy. There are other benefits to adding exits in that training the modified network now has back-propagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient. Early Exit Hyper-Parameters There are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit: --earlyexit_thresholds defines the thresholds for each of the early exits. The cross entropy measure must be less than the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify \"--earlyexit_thresholds 0.9 1.2\" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits. 12 --earlyexit_lossweights provide the weights for the linear combination of losses during training to compute a single, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of \"--earlyexit_lossweights 0.2 0.3\" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy. Output Stats The example code outputs various statistics regarding the loss and accuracy at each of the exits. During training, the Top1 and Top5 stats represent the accuracy should all of the data be forced out that exit (in order to compute the loss at that exit). During inference (i.e. validation and test stages), the Top1 and Top5 stats represent the accuracy for those data points that could exit because the calculated entropy at that exit was lower than the specified threshold for that exit. CIFAR10 In the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers. ImageNet This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic ResNet code and could be used with other size ResNets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly. References Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy . Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition , arXiv:1509.08971v6, 2017. Surat Teerapittayanon, Bradley McDanel, H. T. Kung . BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks , arXiv:1709.01686, 2017.","title":"Early Exit"},{"location":"algo_earlyexit.html#early-exit-inference","text":"While Deep Neural Networks benefit from a large number of layers, it's often the case that many data points in classification tasks can be classified accurately with much less work. There have been several studies recently regarding the idea of exiting before the normal endpoint of the neural network. Panda et al in Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition points out that a lot of data points can be classified easily and require less processing than some more difficult points and they view this in terms of power savings. Surat et al in BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks look at a selective approach to exit placement and criteria for exiting early.","title":"Early Exit Inference"},{"location":"algo_earlyexit.html#why-does-early-exit-work","text":"Early Exit is a strategy with a straightforward and easy to understand concept Figure #fig(boundaries) shows a simple example in a 2-D feature space. While deep networks can represent more complex and expressive boundaries between classes (assuming we\u2019re confident of avoiding over-fitting the data), it\u2019s also clear that much of the data can be properly classified with even the simplest of classification boundaries. Data points far from the boundary can be considered \"easy to classify\" and achieve a high degree of confidence quicker than do data points close to the boundary. In fact, we can think of the area between the outer straight lines as being the region that is \"difficult to classify\" and require the full expressiveness of the neural network to accurately classify it.","title":"Why Does Early Exit Work?"},{"location":"algo_earlyexit.html#example-code-for-early-exit","text":"Both CIFAR10 and ImageNet code comes directly from publicly available examples from PyTorch. The only edits are the exits that are inserted in a methodology similar to BranchyNet work. Note: the sample code provided for ResNet models with Early Exits has exactly one early exit for the CIFAR10 example and exactly two early exits for the ImageNet example. If you want to modify the number of early exits, you will need to make sure that the model code is updated to have a corresponding number of exits. Deeper networks can benefit from multiple exits. Our examples illustrate both a single and a pair of early exits for CIFAR10 and ImageNet, respectively. Note that this code does not actually take exits. What it does is to compute statistics of loss and accuracy assuming exits were taken when criteria are met. Actually implementing exits can be tricky and architecture dependent and we plan to address these issues.","title":"Example code for Early Exit"},{"location":"algo_earlyexit.html#example-command-lines","text":"We have provided examples for ResNets of varying sizes for both CIFAR10 and ImageNet datasets. An example command line for training for CIFAR10 is: python compress_classifier.py --arch=resnet32_cifar_earlyexit --epochs=20 -b 128 \\ --lr=0.003 --earlyexit_thresholds 0.4 --earlyexit_lossweights 0.4 -j 30 \\ --out-dir /home/ -n earlyexit /home/pcifar10 And an example command line for ImageNet is: python compress_classifier.py --arch=resnet50_earlyexit --epochs=120 -b 128 \\ --lr=0.003 --earlyexit_thresholds 1.2 0.9 --earlyexit_lossweights 0.1 0.3 \\ -j 30 --out-dir /home/ -n earlyexit /home/I1K/i1k-extracted/","title":"Example command lines"},{"location":"algo_earlyexit.html#heuristics","text":"The insertion of the exits are ad-hoc, but there are some heuristic principals guiding their placement and parameters. The earlier exits are placed, the more aggressive the exit as it essentially prunes the rest of the network at a very early stage, thus saving a lot of work. However, a diminishing percentage of data will be directed through the exit if we are to preserve accuracy. There are other benefits to adding exits in that training the modified network now has back-propagation losses coming from the exits that affect the earlier layers more substantially than the last exit. This effect mitigates problems such as vanishing gradient.","title":"Heuristics"},{"location":"algo_earlyexit.html#early-exit-hyper-parameters","text":"There are two parameters that are required to enable early exit. Leave them undefined if you are not enabling Early Exit: --earlyexit_thresholds defines the thresholds for each of the early exits. The cross entropy measure must be less than the specified threshold to take a specific exit, otherwise the data continues along the regular path. For example, you could specify \"--earlyexit_thresholds 0.9 1.2\" and this implies two early exits with corresponding thresholds of 0.9 and 1.2, respectively to take those exits. 12 --earlyexit_lossweights provide the weights for the linear combination of losses during training to compute a single, overall loss. We only specify weights for the early exits and assume that the sum of the weights (including final exit) are equal to 1.0. So an example of \"--earlyexit_lossweights 0.2 0.3\" implies two early exits weighted with values of 0.2 and 0.3, respectively and that the final exit has a value of 1.0-(0.2+0.3) = 0.5. Studies have shown that weighting the early exits more heavily will create more agressive early exits, but perhaps with a slight negative effect on accuracy.","title":"Early Exit Hyper-Parameters"},{"location":"algo_earlyexit.html#output-stats","text":"The example code outputs various statistics regarding the loss and accuracy at each of the exits. During training, the Top1 and Top5 stats represent the accuracy should all of the data be forced out that exit (in order to compute the loss at that exit). During inference (i.e. validation and test stages), the Top1 and Top5 stats represent the accuracy for those data points that could exit because the calculated entropy at that exit was lower than the specified threshold for that exit.","title":"Output Stats"},{"location":"algo_earlyexit.html#cifar10","text":"In the case of CIFAR10, we have inserted a single exit after the first full layer grouping. The layers on the exit path itself includes a convolutional layer and a fully connected layer. If you move the exit, be sure to match the proper sizes for inputs and outputs to the exit layers.","title":"CIFAR10"},{"location":"algo_earlyexit.html#imagenet","text":"This supports training and inference of the ImageNet dataset via several well known deep architectures. ResNet-50 is the architecture of interest in this study, however the exit is defined in the generic ResNet code and could be used with other size ResNets. There are two exits inserted in this example. Again, exit layers must have their sizes match properly.","title":"ImageNet"},{"location":"algo_earlyexit.html#references","text":"Priyadarshini Panda, Abhronil Sengupta, Kaushik Roy . Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition , arXiv:1509.08971v6, 2017. Surat Teerapittayanon, Bradley McDanel, H. T. Kung . BranchyNet: Fast Inference via Early Exiting from Deep Neural Networks , arXiv:1709.01686, 2017.","title":"References"},{"location":"algo_pruning.html","text":"Weights Pruning Algorithms Magnitude Pruner This is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor. A different threshold can be used for each layer's weights tensor. Because the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family. \\[ thresh(w_i)=\\left\\lbrace \\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} } \\right\\rbrace \\] Sensitivity Pruner Finding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values. We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor. The diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model. You can see that they have an approximate Gaussian distribution. The distributions of Alexnet conv1 and fc1 layers We use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors. For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\\(\\sigma\\)) of the tensor. Thus, if we set the threshold to \\(s*\\sigma\\), then basically we are thresholding \\(s * 68\\%\\) of the tensor elements. \\[ thresh(w_i)=\\left\\lbrace \\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} } \\right\\rbrace \\] \\[ \\lambda = s * \\sigma_l \\;\\;\\; where\\; \\sigma_l\\; is \\;the \\;std \\;of \\;layer \\;l \\;as \\;measured \\;on \\;the \\;dense \\;model \\] How do we choose this \\(s\\) multiplier? In Learning both Weights and Connections for Efficient Neural Networks the authors write: \"We used the sensitivity results to find each layer\u2019s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer\u2019s weights So the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\). Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value. Method of Operation Start by running a pruning sensitivity analysis on the model. Then use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution. Schedule In their paper Song Han et al. use iterative pruning and change the value of the \\(s\\) multiplier at each pruning step. Distiller's SensitivityPruner works differently: the value \\(s\\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are \"pulled\" toward the center of the distribution and thus more elements gets pruned. This actually works quite well as we can see in the diagram below. This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate. We use a simple iterative-pruning schedule such as: Prune every second epoch starting at epoch 0, and ending at epoch 38. This excerpt from alexnet.schedule_sensitivity.yaml shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML: pruners: my_pruner: class: 'SensitivityPruner' sensitivities: 'features.module.0.weight': 0.25 'features.module.3.weight': 0.35 'features.module.6.weight': 0.40 'features.module.8.weight': 0.45 'features.module.10.weight': 0.55 'classifier.1.weight': 0.875 'classifier.4.weight': 0.875 'classifier.6.weight': 0.625 policies: - pruner: instance_name : 'my_pruner' starting_epoch: 0 ending_epoch: 38 frequency: 2 Level Pruner Class SparsityLevelParameterPruner uses a similar method to go around specifying specific thresholding magnitudes. Instead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity). Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level. This pruner is much more stable compared to SensitivityPruner because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's SensitivityPruner is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution. Song Han's methodology of using several different values for the multiplier \\(s\\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far). To set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each Method of Operation Sort the weights in the specified layer by their absolute values. Mask to zero the smallest magnitude weights until the desired sparsity level is reached. Splicing Pruner In Dynamic Network Surgery for Efficient DNNs Guo et. al propose that network pruning and splicing work in tandem. A SpilicingPruner is a pruner that both prunes and splices connections and works best with a Dynamic Network Surgery schedule, which, for example, configures the PruningPolicy to mask weights only during the forward pass. Automated Gradual Pruner (AGP) In To prune, or not to prune: exploring the efficacy of pruning for model compression , authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in AutomatedGradualPruner . \"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a \ufb01nal sparsity value \\(s_f\\) over a span of n pruning steps. The intuition behind this sparsity function in equation (1) is to prune the network rapidly in the initial phase when the redundant connections are abundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.\"\" You can play with the scheduling parameters in the agp_schedule.ipynb notebook . The authors describe AGP: Our automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity. Doesn't require much hyper-parameter tuning Shown to perform well across different models Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable. RNN Pruner The authors of Exploring Sparsity in Recurrent Neural Networks , Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, \"propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network.\" They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training. They show pruning of RNN, GRU, LSTM and embedding layers. Distiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm. Structure Pruners Element-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation. Structure pruners, remove entire \"structures\", such as kernels, filters, and even entire feature-maps. Structure Ranking Pruners Ranking pruners use some criterion to rank the structures in a tensor, and then prune the tensor to a specified level. In principle, these pruners perform one-shot pruning, but can be combined with automatic pruning-level scheduling, such as AGP (see below). In Pruning Filters for Efficient ConvNets the authors use filter ranking, with one-shot pruning followed by fine-tuning. The authors of Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation: First, after sweeping through the full training set several times the weights become relatively stable \u2014 they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities) L1RankedStructureParameterPruner The L1RankedStructureParameterPruner pruner calculates the magnitude of some \"structure\", orders all of the structures based on some magnitude function and the m lowest ranking structures are pruned away. This pruner performs ranking of structures using the mean of the absolute value of the structure as the representative of the structure magnitude. The absolute mean does not depend on the size of the structure, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm. Basically, you can think of mean(abs(t)) as a form of normalization of the structure L1-norm by the length of the structure. L1RankedStructureParameterPruner currently prunes weight filters, channels, and rows (for linear layers). ActivationAPoZRankedFilterPruner The ActivationAPoZRankedFilterPruner pruner uses the activation channels mean APoZ (average percentage of zeros) to rank weight filters and prune a specified percentage of filters. This method is called Network Trimming from the research paper: \"Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures\", Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang, ICLR 2016 https://arxiv.org/abs/1607.03250 GradientRankedFilterPruner The GradientRankedFilterPruner tries to asses the importance of weight filters using the product of their gradients and the filter value. RandomRankedFilterPruner For research purposes we may want to compare the results of some structure-ranking pruner to a random structure-ranking. The RandomRankedFilterPruner pruner can be used for this purpose. Automated Gradual Pruner (AGP) for Structures The idea of a mathematical formula controlling the sparsity level growth is very useful and StructuredAGP extends the implementation to structured pruning. Pruner Compositions Pruners can be combined to create new pruning schemes. Specifically, with a few lines of code we currently marry the AGP sparsity level scheduler with our filter-ranking classes to create pruner compositions. For each of these, we use AGP to decided how many filters to prune at each step, and we choose the filters to remove using one of the filter-ranking methods: L1RankedStructureParameterPruner_AGP ActivationAPoZRankedFilterPruner_AGP GradientRankedFilterPruner_AGP RandomRankedFilterPruner_AGP Hybrid Pruning In a single schedule we can mix different pruning techniques. For example, we might mix pruning and regularization. Or structured pruning and element-wise pruning. We can even apply different methods on the same tensor. For example, we might want to perform filter pruning for a few epochs, then perform thinning and continue with element-wise pruning of the smaller network tensors. This technique of mixing different methods we call Hybrid Pruning, and Distiller has a few example schedules.","title":"Pruning"},{"location":"algo_pruning.html#weights-pruning-algorithms","text":"","title":"Weights Pruning Algorithms"},{"location":"algo_pruning.html#magnitude-pruner","text":"This is the most basic pruner: it applies a thresholding function, \\(thresh(.)\\), on each element, \\(w_i\\), of a weights tensor. A different threshold can be used for each layer's weights tensor. Because the threshold is applied on individual elements, this pruner belongs to the element-wise pruning algorithm family. \\[ thresh(w_i)=\\left\\lbrace \\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} } \\right\\rbrace \\]","title":"Magnitude Pruner"},{"location":"algo_pruning.html#sensitivity-pruner","text":"Finding a threshold magnitude per layer is daunting, especially since each layer's elements have different average absolute values. We can take advantage of the fact that the weights of convolutional and fully connected layers exhibit a Gaussian distribution with a mean value roughly zero, to avoid using a direct threshold based on the values of each specific tensor. The diagram below shows the distribution the weights tensor of the first convolutional layer, and first fully-connected layer in TorchVision's pre-trained Alexnet model. You can see that they have an approximate Gaussian distribution. The distributions of Alexnet conv1 and fc1 layers We use the standard deviation of the weights tensor as a sort of normalizing factor between the different weights tensors. For example, if a tensor is Normally distributed, then about 68% of the elements have an absolute value less than the standard deviation (\\(\\sigma\\)) of the tensor. Thus, if we set the threshold to \\(s*\\sigma\\), then basically we are thresholding \\(s * 68\\%\\) of the tensor elements. \\[ thresh(w_i)=\\left\\lbrace \\matrix{{{w_i: \\; if \\;|w_i| \\; \\gt}\\;\\lambda}\\cr {0: \\; if \\; |w_i| \\leq \\lambda} } \\right\\rbrace \\] \\[ \\lambda = s * \\sigma_l \\;\\;\\; where\\; \\sigma_l\\; is \\;the \\;std \\;of \\;layer \\;l \\;as \\;measured \\;on \\;the \\;dense \\;model \\] How do we choose this \\(s\\) multiplier? In Learning both Weights and Connections for Efficient Neural Networks the authors write: \"We used the sensitivity results to find each layer\u2019s threshold: for example, the smallest threshold was applied to the most sensitive layer, which is the first convolutional layer... The pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layer\u2019s weights So the results of executing pruning sensitivity analysis on the tensor, gives us a good starting guess at \\(s\\). Sensitivity analysis is an empirical method, and we still have to spend time to hone in on the exact multiplier value.","title":"Sensitivity Pruner"},{"location":"algo_pruning.html#method-of-operation","text":"Start by running a pruning sensitivity analysis on the model. Then use the results to set and tune the threshold of each layer, but instead of using a direct threshold use a sensitivity parameter which is multiplied by the standard-deviation of the initial weight-tensor's distribution.","title":"Method of Operation"},{"location":"algo_pruning.html#schedule","text":"In their paper Song Han et al. use iterative pruning and change the value of the \\(s\\) multiplier at each pruning step. Distiller's SensitivityPruner works differently: the value \\(s\\) is set once based on a one-time calculation of the standard-deviation of the tensor (the first time we prune), and relies on the fact that as the tensor is pruned, more elements are \"pulled\" toward the center of the distribution and thus more elements gets pruned. This actually works quite well as we can see in the diagram below. This is a TensorBoard screen-capture from Alexnet training, which shows how this method starts off pruning very aggressively, but then slowly reduces the pruning rate. We use a simple iterative-pruning schedule such as: Prune every second epoch starting at epoch 0, and ending at epoch 38. This excerpt from alexnet.schedule_sensitivity.yaml shows how this iterative schedule is conveyed in Distiller scheduling configuration YAML: pruners: my_pruner: class: 'SensitivityPruner' sensitivities: 'features.module.0.weight': 0.25 'features.module.3.weight': 0.35 'features.module.6.weight': 0.40 'features.module.8.weight': 0.45 'features.module.10.weight': 0.55 'classifier.1.weight': 0.875 'classifier.4.weight': 0.875 'classifier.6.weight': 0.625 policies: - pruner: instance_name : 'my_pruner' starting_epoch: 0 ending_epoch: 38 frequency: 2","title":"Schedule"},{"location":"algo_pruning.html#level-pruner","text":"Class SparsityLevelParameterPruner uses a similar method to go around specifying specific thresholding magnitudes. Instead of specifying a threshold magnitude, you specify a target sparsity level (expressed as a fraction, so 0.5 means 50% sparsity). Essentially this pruner also uses a pruning criteria based on the magnitude of each tensor element, but it has the advantage that you can aim for an exact and specific sparsity level. This pruner is much more stable compared to SensitivityPruner because the target sparsity level is not coupled to the actual magnitudes of the elements. Distiller's SensitivityPruner is unstable because the final sparsity level depends on the convergence pattern of the tensor distribution. Song Han's methodology of using several different values for the multiplier \\(s\\), and the recalculation of the standard-deviation at each pruning phase, probably gives it stability, but requires much more hyper-parameters (this is the reason we have not implemented it thus far). To set the target sparsity levels, you can once again use pruning sensitivity analysis to make better guesses at the correct sparsity level of each","title":"Level Pruner"},{"location":"algo_pruning.html#method-of-operation_1","text":"Sort the weights in the specified layer by their absolute values. Mask to zero the smallest magnitude weights until the desired sparsity level is reached.","title":"Method of Operation"},{"location":"algo_pruning.html#splicing-pruner","text":"In Dynamic Network Surgery for Efficient DNNs Guo et. al propose that network pruning and splicing work in tandem. A SpilicingPruner is a pruner that both prunes and splices connections and works best with a Dynamic Network Surgery schedule, which, for example, configures the PruningPolicy to mask weights only during the forward pass.","title":"Splicing Pruner"},{"location":"algo_pruning.html#automated-gradual-pruner-agp","text":"In To prune, or not to prune: exploring the efficacy of pruning for model compression , authors Michael Zhu and Suyog Gupta provide an algorithm to schedule a Level Pruner which Distiller implements in AutomatedGradualPruner . \"We introduce a new automated gradual pruning algorithm in which the sparsity is increased from an initial sparsity value \\(s_i\\) (usually 0) to a \ufb01nal sparsity value \\(s_f\\) over a span of n pruning steps. The intuition behind this sparsity function in equation (1) is to prune the network rapidly in the initial phase when the redundant connections are abundant and gradually reduce the number of weights being pruned each time as there are fewer and fewer weights remaining in the network.\"\" You can play with the scheduling parameters in the agp_schedule.ipynb notebook . The authors describe AGP: Our automated gradual pruning algorithm prunes the smallest magnitude weights to achieve a preset level of network sparsity. Doesn't require much hyper-parameter tuning Shown to perform well across different models Does not make any assumptions about the structure of the network or its constituent layers, and is therefore more generally applicable.","title":"Automated Gradual Pruner (AGP)"},{"location":"algo_pruning.html#rnn-pruner","text":"The authors of Exploring Sparsity in Recurrent Neural Networks , Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta, \"propose a technique to reduce the parameters of a network by pruning weights during the initial training of the network.\" They use a gradual pruning schedule which is reminiscent of the schedule used in AGP, for element-wise pruning of RNNs, which they also employ during training. They show pruning of RNN, GRU, LSTM and embedding layers. Distiller's distiller.pruning.BaiduRNNPruner class implements this pruning algorithm.","title":"RNN Pruner"},{"location":"algo_pruning.html#structure-pruners","text":"Element-wise pruning can create very sparse models which can be compressed to consume less memory footprint and bandwidth, but without specialized hardware that can compute using the sparse representation of the tensors, we don't gain any speedup of the computation. Structure pruners, remove entire \"structures\", such as kernels, filters, and even entire feature-maps.","title":"Structure Pruners"},{"location":"algo_pruning.html#structure-ranking-pruners","text":"Ranking pruners use some criterion to rank the structures in a tensor, and then prune the tensor to a specified level. In principle, these pruners perform one-shot pruning, but can be combined with automatic pruning-level scheduling, such as AGP (see below). In Pruning Filters for Efficient ConvNets the authors use filter ranking, with one-shot pruning followed by fine-tuning. The authors of Exploiting Sparseness in Deep Neural Networks for Large Vocabulary Speech Recognition also use a one-shot pruning schedule, for fully-connected layers, and they provide an explanation: First, after sweeping through the full training set several times the weights become relatively stable \u2014 they tend to remain either large or small magnitudes. Second, in a stabilized model, the importance of the connection is approximated well by the magnitudes of the weights (times the magnitudes of the corresponding input values, but these are relatively uniform within each layer since on the input layer, features are normalized to zero-mean and unit-variance, and hidden-layer values are probabilities)","title":"Structure Ranking Pruners"},{"location":"algo_pruning.html#l1rankedstructureparameterpruner","text":"The L1RankedStructureParameterPruner pruner calculates the magnitude of some \"structure\", orders all of the structures based on some magnitude function and the m lowest ranking structures are pruned away. This pruner performs ranking of structures using the mean of the absolute value of the structure as the representative of the structure magnitude. The absolute mean does not depend on the size of the structure, so it is easier to use compared to just using the \\(L_1\\)-norm of the structure, and at the same time it is a good proxy of the \\(L_1\\)-norm. Basically, you can think of mean(abs(t)) as a form of normalization of the structure L1-norm by the length of the structure. L1RankedStructureParameterPruner currently prunes weight filters, channels, and rows (for linear layers).","title":"L1RankedStructureParameterPruner"},{"location":"algo_pruning.html#activationapozrankedfilterpruner","text":"The ActivationAPoZRankedFilterPruner pruner uses the activation channels mean APoZ (average percentage of zeros) to rank weight filters and prune a specified percentage of filters. This method is called Network Trimming from the research paper: \"Network Trimming: A Data-Driven Neuron Pruning Approach towards Efficient Deep Architectures\", Hengyuan Hu, Rui Peng, Yu-Wing Tai, Chi-Keung Tang, ICLR 2016 https://arxiv.org/abs/1607.03250","title":"ActivationAPoZRankedFilterPruner"},{"location":"algo_pruning.html#gradientrankedfilterpruner","text":"The GradientRankedFilterPruner tries to asses the importance of weight filters using the product of their gradients and the filter value.","title":"GradientRankedFilterPruner"},{"location":"algo_pruning.html#randomrankedfilterpruner","text":"For research purposes we may want to compare the results of some structure-ranking pruner to a random structure-ranking. The RandomRankedFilterPruner pruner can be used for this purpose.","title":"RandomRankedFilterPruner"},{"location":"algo_pruning.html#automated-gradual-pruner-agp-for-structures","text":"The idea of a mathematical formula controlling the sparsity level growth is very useful and StructuredAGP extends the implementation to structured pruning.","title":"Automated Gradual Pruner (AGP) for Structures"},{"location":"algo_pruning.html#pruner-compositions","text":"Pruners can be combined to create new pruning schemes. Specifically, with a few lines of code we currently marry the AGP sparsity level scheduler with our filter-ranking classes to create pruner compositions. For each of these, we use AGP to decided how many filters to prune at each step, and we choose the filters to remove using one of the filter-ranking methods: L1RankedStructureParameterPruner_AGP ActivationAPoZRankedFilterPruner_AGP GradientRankedFilterPruner_AGP RandomRankedFilterPruner_AGP","title":"Pruner Compositions"},{"location":"algo_pruning.html#hybrid-pruning","text":"In a single schedule we can mix different pruning techniques. For example, we might mix pruning and regularization. Or structured pruning and element-wise pruning. We can even apply different methods on the same tensor. For example, we might want to perform filter pruning for a few epochs, then perform thinning and continue with element-wise pruning of the smaller network tensors. This technique of mixing different methods we call Hybrid Pruning, and Distiller has a few example schedules.","title":"Hybrid Pruning"},{"location":"algo_quantization.html","text":"Quantization Algorithms Note: For any of the methods below that require quantization-aware training, please see here for details on how to invoke it using Distiller's scheduling mechanism. Range-Based Linear Quantization Let's break down the terminology we use here: Linear: Means a float value is quantized by multiplying with a numeric constant (the scale factor ). Range-Based: Means that in order to calculate the scale factor, we look at the actual range of the tensor's values. In the most naive implementation, we use the actual min/max values of the tensor. Alternatively, we use some derivation based on the tensor's range / distribution to come up with a narrower min/max range, in order to remove possible outliers. This is in contrast to the other methods described here, which we could call clipping-based , as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value). Asymmetric vs. Symmetric In this method we can use two modes - asymmetric and symmetric . Asymmetric Mode In asymmetric mode, we map the min/max in the float range to the min/max of the integer range. This is done by using a zero-point (also called quantization bias , or offset ) in addition to the scale factor. Let us denote the original floating-point tensor by x_f , the quantized tensor by x_q , the scale factor by q_x , the zero-point by zp_x and the number of bits used for quantization by n . Then, we get: x_q = round\\left ((x_f - min_{x_f})\\underbrace{\\frac{2^n - 1}{max_{x_f} - min_{x_f}}}_{q_x} \\right) = round(q_x x_f - \\underbrace{min_{x_f}q_x)}_{zp_x} = round(q_x x_f - zp_x) In practice, we actually use zp_x = round(min_{x_f}q_x) . This means that zero is exactly representable by an integer in the quantized range. This is important, for example, for layers that have zero-padding. By rounding the zero-point, we effectively \"nudge\" the min/max values in the float range a little bit, in order to gain this exact quantization of zero. Note that in the derivation above we use unsigned integer to represent the quantized range. That is, x_q \\in [0, 2^n-1] . One could use signed integer if necessary (perhaps due to HW considerations). This can be achieved by subtracting 2^{n-1} . Let's see how a convolution or fully-connected (FC) layer is quantized in asymmetric mode: (we denote input, output, weights and bias with x, y, w and b respectively) y_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q + zp_x}{q_x} \\frac{w_q + zp_w}{q_w}} + \\frac{b_q + zp_b}{q_b} = = \\frac{1}{q_x q_w} \\left( \\sum { (x_q + zp_x) (w_q + zp_w) + \\frac{q_x q_w}{q_b}(b_q + zp_b) } \\right) Therefore: y_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { (x_q+zp_x) (w_q+zp_w) + \\frac{q_x q_w}{q_b}(b_q+zp_b) } \\right) \\right) Notes: We can see that the bias has to be re-scaled to match the scale of the summation. In a proper integer-only HW pipeline, we would like our main accumulation term to simply be \\sum{x_q w_q} . In order to achieve this, one needs to further develop the expression we derived above. For further details please refer to the gemmlowp documentation Symmetric Mode In symmetric mode, instead of mapping the exact min/max of the float range to the quantized range, we choose the maximum absolute value between min/max. In addition, we don't use a zero-point. So, the floating-point range we're effectively quantizing is symmetric with respect to zero, and so is the quantized range. There's a nuance in the symmetric case with regards to the quantized range. Assuming N_{bins}=2^n-1 , we can use either a \"full\" or \"restricted\" quantized range: Full Range Restricted Range Quantized Range \\left[-\\frac{N_{bins}}{2}, \\frac{N_{bins}}{2} - 1\\right] \\left[-\\left(\\frac{N_{bins}}{2} - 1\\right), \\frac{N_{bins}}{2} - 1\\right] 8-bit example [-128, 127] (As shown in image above) [-127,127] Scale Factor q_x = \\frac{(2^n-1)/2}{\\max(abs(x_f))} q_x = \\frac{2^{n-1}-1}{\\max(abs(x_f))} The restricted range is less accurate on-paper, and is usually used when specific HW considerations require it. Implementations of quantization \"in the wild\" that use a full range include PyTorch's native quantization (from v1.3 onwards) and ONNX. Implementations that use a restricted range include TensorFlow, NVIDIA TensorRT and Intel DNNL (aka MKL-DNN). Distiller can emulate both modes. Using the same notations as above, we get (regardless of full/restricted range): x_q = round(q_x x_f) Again, let's see how a convolution or fully-connected (FC) layer is quantized, this time in symmetric mode: y_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q}{q_x} \\frac{w_q}{q_w}} + \\frac{b_q}{q_b} = \\frac{1}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right) Therefore: y_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right) \\right) Comparing the Two Modes The main trade-off between these two modes is simplicity vs. utilization of the quantized range. When using asymmetric quantization, the quantized range is fully utilized. That is because we exactly map the min/max values from the float range to the min/max of the quantized range. Using symmetric mode, if the float range is biased towards one side, could result in a quantized range where significant dynamic range is dedicated to values that we'll never see. The most extreme example of this is after ReLU, where the entire tensor is positive. Quantizing it in symmetric mode means we're effectively losing 1 bit. On the other hand, if we look at the derviations for convolution / FC layers above, we can see that the actual implementation of symmetric mode is much simpler. In asymmetric mode, the zero-points require additional logic in HW. The cost of this extra logic in terms of latency and/or power and/or area will of course depend on the exact implementation. Other Features Scale factor scope: For weight tensors, Distiller supports per-channel quantization (per output channel). Removing outliers (post-training only): As discussed here , in some cases the float range of activations contains outliers. Spending dynamic range on these outliers hurts our ability to represent the values we actually care about accurately. Currently, Distiller supports clipping of activations during post-training quantization using the following methods: Averaging: Global min/max values are replaced with an average of the min/max values of each sample in the batch. Mean +/- N*Std: Take N standard deviations for the tensor's mean, and in any case don't exceed the tensor's actual min/max. N is user configurable. ACIQ - Analytical calculation of clipping values assuming either a Gaussian or Laplace distribution. As proposed in Post training 4-bit quantization of convolutional networks for rapid-deployment . Scale factor approximation (post-training only): This can be enabled optionally, to simulate an execution pipeline with no floating-point operations. Instead of multiplying with a floating-point scale factor, we multiply with an integer and then do a bit-wise shift: Q \\approx {A}/{2^n} , where Q denotes the FP32 scale factor, A denotes the integer multiplier and n denotes the number of bits by which we shift after multiplication. The number of bits assigned to A is usually a parameter of the HW, and in Distiller it is configured by the user. Let us denote that with m . Given Q and m , we determine A and n as follows: Q \\approx \\frac{A}{2^n} \\Rightarrow A \\approx 2^nQ \\Rightarrow \\Rightarrow 2^nQ \\le 2^m - 1 \\Rightarrow \\Rightarrow n = \\left\\lfloor\\log_2\\frac{2^m - 1}{Q}\\right\\rfloor\\ \\ \\ ;\\ \\ \\ A = \\lfloor 2^nQ \\rfloor Implementation in Distiller Post-Training For post-training quantization, this method is implemented by wrapping existing modules with quantization and de-quantization operations. The wrapper implementations are in range_linear.py . The following operations have dedicated implementations which consider quantization: torch.nn.Conv2d/Conv3d torch.nn.Linear torch.nn.Embedding distiller.modules.Concat distiller.modules.EltwiseAdd distiller.modules.EltwiseMult distiller.modules.Matmul distiller.modules.BatchMatmul Any existing module will likely need to be modified to use the distiller.modules.* modules. See here for details on how to prepare a model for quantization. To automatically transform an existing model to a quantized model using this method, use the PostTrainLinearQuantizer class. For details on ways to invoke the quantizer see here . When using PostTrainLinearQuantizer , by default, any operation not in the list above is \"fake\"-quantized, meaning it is executed in FP32 and its output is quantized. Quantization for specific layers (or groups of layers) can be disabled using Distiller's override mechanism (see example here ). For weights and bias the scale factor and zero-point are determined once at quantization setup (\"offline\" / \"static\"). For activations, both \"static\" and \"dynamic\" quantization is supported. Static quantization of activations requires that statistics be collected beforehand. See details on how to do that here . The calculated quantization parameters are stored as buffers within the module, so they are automatically serialized when the model checkpoint is saved. Quantization-Aware Training To apply range-based linear quantization in training, use the QuantAwareTrainRangeLinearQuantizer class. As it is now, it will apply weights quantization to convolution, FC and embedding modules. For activations quantization, it will insert instances FakeLinearQuantization module after ReLUs. This module follows the methodology described in Benoit et al., 2018 and uses exponential moving averages to track activation ranges. Note that the current implementation of QuantAwareTrainRangeLinearQuantizer supports training with single GPU only . Similarly to post-training, the calculated quantization parameters (scale factors, zero-points, tracked activation ranges) are stored as buffers within their respective modules, so they're saved when a checkpoint is created. Note that converting from a quantization-aware training model to a post-training quantization model is not yet supported. Such a conversion will use the activation ranges tracked during training, so additional offline or online calculation of quantization parameters will not be required. DoReFa (As proposed in DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients ) In this method, we first define the quantization function quantize_k , which takes a real value a_f \\in [0, 1] and outputs a discrete-valued a_q \\in \\left\\{ \\frac{0}{2^k-1}, \\frac{1}{2^k-1}, ... , \\frac{2^k-1}{2^k-1} \\right\\} , where k is the number of bits used for quantization. a_q = quantize_k(a_f) = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) a_f \\right) Activations are clipped to the [0, 1] range and then quantized as follows: x_q = quantize_k(x_f) For weights, we define the following function f , which takes an unbounded real valued input and outputs a real value in [0, 1] : f(w) = \\frac{tanh(w)}{2 max(|tanh(w)|)} + \\frac{1}{2} Now we can use quantize_k to get quantized weight values, as follows: w_q = 2 quantize_k \\left( f(w_f) \\right) - 1 This method requires training the model with quantization-aware training, as discussed here . Use the DorefaQuantizer class to transform an existing model to a model suitable for training with quantization using DoReFa. Notes Gradients quantization as proposed in the paper is not supported yet. The paper defines special handling for binary weights which isn't supported in Distiller yet. PACT (As proposed in PACT: Parameterized Clipping Activation for Quantized Neural Networks ) This method is similar to DoReFa, but the upper clipping values, \\alpha , of the activation functions are learned parameters instead of hard coded to 1. Note that per the paper's recommendation, \\alpha is shared per layer. This method requires training the model with quantization-aware training, as discussed here . Use the PACTQuantizer class to transform an existing model to a model suitable for training with quantization using PACT. WRPN (As proposed in WRPN: Wide Reduced-Precision Networks ) In this method, activations are clipped to [0, 1] and quantized as follows ( k is the number of bits used for quantization): x_q = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) x_f \\right) Weights are clipped to [-1, 1] and quantized as follows: w_q = \\frac{1}{2^{k-1}-1} round \\left( \\left(2^{k-1} - 1 \\right)w_f \\right) Note that k-1 bits are used to quantize weights, leaving one bit for sign. This method requires training the model with quantization-aware training, as discussed here . Use the WRPNQuantizer class to transform an existing model to a model suitable for training with quantization using WRPN. Notes The paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of WRPNQuantizer at the moment. To experiment with this, modify your model implementation to have wider layers. The paper defines special handling for binary weights which isn't supported in Distiller yet.","title":"Quantization"},{"location":"algo_quantization.html#quantization-algorithms","text":"Note: For any of the methods below that require quantization-aware training, please see here for details on how to invoke it using Distiller's scheduling mechanism.","title":"Quantization Algorithms"},{"location":"algo_quantization.html#range-based-linear-quantization","text":"Let's break down the terminology we use here: Linear: Means a float value is quantized by multiplying with a numeric constant (the scale factor ). Range-Based: Means that in order to calculate the scale factor, we look at the actual range of the tensor's values. In the most naive implementation, we use the actual min/max values of the tensor. Alternatively, we use some derivation based on the tensor's range / distribution to come up with a narrower min/max range, in order to remove possible outliers. This is in contrast to the other methods described here, which we could call clipping-based , as they impose an explicit clipping function on the tensors (using either a hard-coded value or a learned value).","title":"Range-Based Linear Quantization"},{"location":"algo_quantization.html#asymmetric-vs-symmetric","text":"In this method we can use two modes - asymmetric and symmetric .","title":"Asymmetric vs. Symmetric"},{"location":"algo_quantization.html#asymmetric-mode","text":"In asymmetric mode, we map the min/max in the float range to the min/max of the integer range. This is done by using a zero-point (also called quantization bias , or offset ) in addition to the scale factor. Let us denote the original floating-point tensor by x_f , the quantized tensor by x_q , the scale factor by q_x , the zero-point by zp_x and the number of bits used for quantization by n . Then, we get: x_q = round\\left ((x_f - min_{x_f})\\underbrace{\\frac{2^n - 1}{max_{x_f} - min_{x_f}}}_{q_x} \\right) = round(q_x x_f - \\underbrace{min_{x_f}q_x)}_{zp_x} = round(q_x x_f - zp_x) In practice, we actually use zp_x = round(min_{x_f}q_x) . This means that zero is exactly representable by an integer in the quantized range. This is important, for example, for layers that have zero-padding. By rounding the zero-point, we effectively \"nudge\" the min/max values in the float range a little bit, in order to gain this exact quantization of zero. Note that in the derivation above we use unsigned integer to represent the quantized range. That is, x_q \\in [0, 2^n-1] . One could use signed integer if necessary (perhaps due to HW considerations). This can be achieved by subtracting 2^{n-1} . Let's see how a convolution or fully-connected (FC) layer is quantized in asymmetric mode: (we denote input, output, weights and bias with x, y, w and b respectively) y_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q + zp_x}{q_x} \\frac{w_q + zp_w}{q_w}} + \\frac{b_q + zp_b}{q_b} = = \\frac{1}{q_x q_w} \\left( \\sum { (x_q + zp_x) (w_q + zp_w) + \\frac{q_x q_w}{q_b}(b_q + zp_b) } \\right) Therefore: y_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { (x_q+zp_x) (w_q+zp_w) + \\frac{q_x q_w}{q_b}(b_q+zp_b) } \\right) \\right) Notes: We can see that the bias has to be re-scaled to match the scale of the summation. In a proper integer-only HW pipeline, we would like our main accumulation term to simply be \\sum{x_q w_q} . In order to achieve this, one needs to further develop the expression we derived above. For further details please refer to the gemmlowp documentation","title":"Asymmetric Mode"},{"location":"algo_quantization.html#symmetric-mode","text":"In symmetric mode, instead of mapping the exact min/max of the float range to the quantized range, we choose the maximum absolute value between min/max. In addition, we don't use a zero-point. So, the floating-point range we're effectively quantizing is symmetric with respect to zero, and so is the quantized range. There's a nuance in the symmetric case with regards to the quantized range. Assuming N_{bins}=2^n-1 , we can use either a \"full\" or \"restricted\" quantized range: Full Range Restricted Range Quantized Range \\left[-\\frac{N_{bins}}{2}, \\frac{N_{bins}}{2} - 1\\right] \\left[-\\left(\\frac{N_{bins}}{2} - 1\\right), \\frac{N_{bins}}{2} - 1\\right] 8-bit example [-128, 127] (As shown in image above) [-127,127] Scale Factor q_x = \\frac{(2^n-1)/2}{\\max(abs(x_f))} q_x = \\frac{2^{n-1}-1}{\\max(abs(x_f))} The restricted range is less accurate on-paper, and is usually used when specific HW considerations require it. Implementations of quantization \"in the wild\" that use a full range include PyTorch's native quantization (from v1.3 onwards) and ONNX. Implementations that use a restricted range include TensorFlow, NVIDIA TensorRT and Intel DNNL (aka MKL-DNN). Distiller can emulate both modes. Using the same notations as above, we get (regardless of full/restricted range): x_q = round(q_x x_f) Again, let's see how a convolution or fully-connected (FC) layer is quantized, this time in symmetric mode: y_f = \\sum{x_f w_f} + b_f = \\sum{\\frac{x_q}{q_x} \\frac{w_q}{q_w}} + \\frac{b_q}{q_b} = \\frac{1}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right) Therefore: y_q = round(q_y y_f) = round\\left(\\frac{q_y}{q_x q_w} \\left( \\sum { x_q w_q + \\frac{q_x q_w}{q_b}b_q } \\right) \\right)","title":"Symmetric Mode"},{"location":"algo_quantization.html#comparing-the-two-modes","text":"The main trade-off between these two modes is simplicity vs. utilization of the quantized range. When using asymmetric quantization, the quantized range is fully utilized. That is because we exactly map the min/max values from the float range to the min/max of the quantized range. Using symmetric mode, if the float range is biased towards one side, could result in a quantized range where significant dynamic range is dedicated to values that we'll never see. The most extreme example of this is after ReLU, where the entire tensor is positive. Quantizing it in symmetric mode means we're effectively losing 1 bit. On the other hand, if we look at the derviations for convolution / FC layers above, we can see that the actual implementation of symmetric mode is much simpler. In asymmetric mode, the zero-points require additional logic in HW. The cost of this extra logic in terms of latency and/or power and/or area will of course depend on the exact implementation.","title":"Comparing the Two Modes"},{"location":"algo_quantization.html#other-features","text":"Scale factor scope: For weight tensors, Distiller supports per-channel quantization (per output channel). Removing outliers (post-training only): As discussed here , in some cases the float range of activations contains outliers. Spending dynamic range on these outliers hurts our ability to represent the values we actually care about accurately. Currently, Distiller supports clipping of activations during post-training quantization using the following methods: Averaging: Global min/max values are replaced with an average of the min/max values of each sample in the batch. Mean +/- N*Std: Take N standard deviations for the tensor's mean, and in any case don't exceed the tensor's actual min/max. N is user configurable. ACIQ - Analytical calculation of clipping values assuming either a Gaussian or Laplace distribution. As proposed in Post training 4-bit quantization of convolutional networks for rapid-deployment . Scale factor approximation (post-training only): This can be enabled optionally, to simulate an execution pipeline with no floating-point operations. Instead of multiplying with a floating-point scale factor, we multiply with an integer and then do a bit-wise shift: Q \\approx {A}/{2^n} , where Q denotes the FP32 scale factor, A denotes the integer multiplier and n denotes the number of bits by which we shift after multiplication. The number of bits assigned to A is usually a parameter of the HW, and in Distiller it is configured by the user. Let us denote that with m . Given Q and m , we determine A and n as follows: Q \\approx \\frac{A}{2^n} \\Rightarrow A \\approx 2^nQ \\Rightarrow \\Rightarrow 2^nQ \\le 2^m - 1 \\Rightarrow \\Rightarrow n = \\left\\lfloor\\log_2\\frac{2^m - 1}{Q}\\right\\rfloor\\ \\ \\ ;\\ \\ \\ A = \\lfloor 2^nQ \\rfloor","title":"Other Features"},{"location":"algo_quantization.html#implementation-in-distiller","text":"","title":"Implementation in Distiller"},{"location":"algo_quantization.html#post-training","text":"For post-training quantization, this method is implemented by wrapping existing modules with quantization and de-quantization operations. The wrapper implementations are in range_linear.py . The following operations have dedicated implementations which consider quantization: torch.nn.Conv2d/Conv3d torch.nn.Linear torch.nn.Embedding distiller.modules.Concat distiller.modules.EltwiseAdd distiller.modules.EltwiseMult distiller.modules.Matmul distiller.modules.BatchMatmul Any existing module will likely need to be modified to use the distiller.modules.* modules. See here for details on how to prepare a model for quantization. To automatically transform an existing model to a quantized model using this method, use the PostTrainLinearQuantizer class. For details on ways to invoke the quantizer see here . When using PostTrainLinearQuantizer , by default, any operation not in the list above is \"fake\"-quantized, meaning it is executed in FP32 and its output is quantized. Quantization for specific layers (or groups of layers) can be disabled using Distiller's override mechanism (see example here ). For weights and bias the scale factor and zero-point are determined once at quantization setup (\"offline\" / \"static\"). For activations, both \"static\" and \"dynamic\" quantization is supported. Static quantization of activations requires that statistics be collected beforehand. See details on how to do that here . The calculated quantization parameters are stored as buffers within the module, so they are automatically serialized when the model checkpoint is saved.","title":"Post-Training"},{"location":"algo_quantization.html#quantization-aware-training","text":"To apply range-based linear quantization in training, use the QuantAwareTrainRangeLinearQuantizer class. As it is now, it will apply weights quantization to convolution, FC and embedding modules. For activations quantization, it will insert instances FakeLinearQuantization module after ReLUs. This module follows the methodology described in Benoit et al., 2018 and uses exponential moving averages to track activation ranges. Note that the current implementation of QuantAwareTrainRangeLinearQuantizer supports training with single GPU only . Similarly to post-training, the calculated quantization parameters (scale factors, zero-points, tracked activation ranges) are stored as buffers within their respective modules, so they're saved when a checkpoint is created. Note that converting from a quantization-aware training model to a post-training quantization model is not yet supported. Such a conversion will use the activation ranges tracked during training, so additional offline or online calculation of quantization parameters will not be required.","title":"Quantization-Aware Training"},{"location":"algo_quantization.html#dorefa","text":"(As proposed in DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients ) In this method, we first define the quantization function quantize_k , which takes a real value a_f \\in [0, 1] and outputs a discrete-valued a_q \\in \\left\\{ \\frac{0}{2^k-1}, \\frac{1}{2^k-1}, ... , \\frac{2^k-1}{2^k-1} \\right\\} , where k is the number of bits used for quantization. a_q = quantize_k(a_f) = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) a_f \\right) Activations are clipped to the [0, 1] range and then quantized as follows: x_q = quantize_k(x_f) For weights, we define the following function f , which takes an unbounded real valued input and outputs a real value in [0, 1] : f(w) = \\frac{tanh(w)}{2 max(|tanh(w)|)} + \\frac{1}{2} Now we can use quantize_k to get quantized weight values, as follows: w_q = 2 quantize_k \\left( f(w_f) \\right) - 1 This method requires training the model with quantization-aware training, as discussed here . Use the DorefaQuantizer class to transform an existing model to a model suitable for training with quantization using DoReFa.","title":"DoReFa"},{"location":"algo_quantization.html#notes","text":"Gradients quantization as proposed in the paper is not supported yet. The paper defines special handling for binary weights which isn't supported in Distiller yet.","title":"Notes"},{"location":"algo_quantization.html#pact","text":"(As proposed in PACT: Parameterized Clipping Activation for Quantized Neural Networks ) This method is similar to DoReFa, but the upper clipping values, \\alpha , of the activation functions are learned parameters instead of hard coded to 1. Note that per the paper's recommendation, \\alpha is shared per layer. This method requires training the model with quantization-aware training, as discussed here . Use the PACTQuantizer class to transform an existing model to a model suitable for training with quantization using PACT.","title":"PACT"},{"location":"algo_quantization.html#wrpn","text":"(As proposed in WRPN: Wide Reduced-Precision Networks ) In this method, activations are clipped to [0, 1] and quantized as follows ( k is the number of bits used for quantization): x_q = \\frac{1}{2^k-1} round \\left( \\left(2^k - 1 \\right) x_f \\right) Weights are clipped to [-1, 1] and quantized as follows: w_q = \\frac{1}{2^{k-1}-1} round \\left( \\left(2^{k-1} - 1 \\right)w_f \\right) Note that k-1 bits are used to quantize weights, leaving one bit for sign. This method requires training the model with quantization-aware training, as discussed here . Use the WRPNQuantizer class to transform an existing model to a model suitable for training with quantization using WRPN.","title":"WRPN"},{"location":"algo_quantization.html#notes_1","text":"The paper proposed widening of layers as a means to reduce accuracy loss. This isn't implemented as part of WRPNQuantizer at the moment. To experiment with this, modify your model implementation to have wider layers. The paper defines special handling for binary weights which isn't supported in Distiller yet.","title":"Notes"},{"location":"conditional_computation.html","text":"Conditional Computation Conditional Computation refers to a class of algorithms in which each input sample uses a different part of the model, such that on average the compute, latency or power (depending on our objective) is reduced. To quote Bengio et. al \"Conditional computation refers to activating only some of the units in a network, in an input-dependent fashion. For example, if we think we\u2019re looking at a car, we only need to compute the activations of the vehicle detecting units, not of all features that a network could possible compute. The immediate effect of activating fewer units is that propagating information through the network will be faster, both at training as well as at test time. However, one needs to be able to decide in an intelligent fashion which units to turn on and off, depending on the input data. This is typically achieved with some form of gating structure, learned in parallel with the original network.\" As usual, there are several approaches to implement Conditional Computation: Sun et. al use several expert CNN, each trained on a different task, and combine them to one large network. Zheng et. al use cascading, an idea which may be familiar to you from Viola-Jones face detection. Theodorakopoulos et. al add small layers that learn which filters to use per input sample, and then enforce that during inference (LKAM module). Ioannou et. al introduce Conditional Networks: that \"can be thought of as: i) decision trees augmented with data transformation operators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions\" Bolukbasi et. al \"learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. We extend this to learn a network selection system that adaptively selects the network to be evaluated for each example.\" Conditional Computation is especially useful for real-time, latency-sensitive applicative. In Distiller we currently have implemented a variant of Early Exit. References Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, Doina Precup. Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition , arXiv:1511.06297v2, 2016. Y. Sun, X.Wang, and X. Tang. Deep Convolutional Network Cascade for Facial Point Detection . In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014 X. Zheng, W.Ouyang, and X.Wang. Multi-Stage Contextual Deep Learning for Pedestrian Detection. In Proc. IEEE Intl Conf. on Computer Vision (ICCV), 2014. I. Theodorakopoulos, V. Pothos, D. Kastaniotis and N. Fragoulis1. Parsimonious Inference on Convolutional Neural Networks: Learning and applying on-line kernel activation rules. Irida Labs S.A, January 2017 Tolga Bolukbasi, Joseph Wang, Ofer Dekel, Venkatesh Saligrama Adaptive Neural Networks for Efficient Inference . Proceedings of the 34th International Conference on Machine Learning, PMLR 70:527-536, 2017. Yani Ioannou, Duncan Robertson, Darko Zikic, Peter Kontschieder, Jamie Shotton, Matthew Brown, Antonio Criminisi . Decision Forests, Convolutional Networks and the Models in-Between , arXiv:1511.06297v2, 2016.","title":"Conditional Computation"},{"location":"conditional_computation.html#conditional-computation","text":"Conditional Computation refers to a class of algorithms in which each input sample uses a different part of the model, such that on average the compute, latency or power (depending on our objective) is reduced. To quote Bengio et. al \"Conditional computation refers to activating only some of the units in a network, in an input-dependent fashion. For example, if we think we\u2019re looking at a car, we only need to compute the activations of the vehicle detecting units, not of all features that a network could possible compute. The immediate effect of activating fewer units is that propagating information through the network will be faster, both at training as well as at test time. However, one needs to be able to decide in an intelligent fashion which units to turn on and off, depending on the input data. This is typically achieved with some form of gating structure, learned in parallel with the original network.\" As usual, there are several approaches to implement Conditional Computation: Sun et. al use several expert CNN, each trained on a different task, and combine them to one large network. Zheng et. al use cascading, an idea which may be familiar to you from Viola-Jones face detection. Theodorakopoulos et. al add small layers that learn which filters to use per input sample, and then enforce that during inference (LKAM module). Ioannou et. al introduce Conditional Networks: that \"can be thought of as: i) decision trees augmented with data transformation operators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions\" Bolukbasi et. al \"learn a system to adaptively choose the components of a deep network to be evaluated for each example. By allowing examples correctly classified using early layers of the system to exit, we avoid the computational time associated with full evaluation of the network. We extend this to learn a network selection system that adaptively selects the network to be evaluated for each example.\" Conditional Computation is especially useful for real-time, latency-sensitive applicative. In Distiller we currently have implemented a variant of Early Exit.","title":"Conditional Computation"},{"location":"conditional_computation.html#references","text":"Emmanuel Bengio, Pierre-Luc Bacon, Joelle Pineau, Doina Precup. Conditional Deep Learning for Energy-Efficient and Enhanced Pattern Recognition , arXiv:1511.06297v2, 2016. Y. Sun, X.Wang, and X. Tang. Deep Convolutional Network Cascade for Facial Point Detection . In Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2014 X. Zheng, W.Ouyang, and X.Wang. Multi-Stage Contextual Deep Learning for Pedestrian Detection. In Proc. IEEE Intl Conf. on Computer Vision (ICCV), 2014. I. Theodorakopoulos, V. Pothos, D. Kastaniotis and N. Fragoulis1. Parsimonious Inference on Convolutional Neural Networks: Learning and applying on-line kernel activation rules. Irida Labs S.A, January 2017 Tolga Bolukbasi, Joseph Wang, Ofer Dekel, Venkatesh Saligrama Adaptive Neural Networks for Efficient Inference . Proceedings of the 34th International Conference on Machine Learning, PMLR 70:527-536, 2017. Yani Ioannou, Duncan Robertson, Darko Zikic, Peter Kontschieder, Jamie Shotton, Matthew Brown, Antonio Criminisi . Decision Forests, Convolutional Networks and the Models in-Between , arXiv:1511.06297v2, 2016.","title":"References"},{"location":"design.html","text":"Distiller design Distiller is designed to be easily integrated into your own PyTorch research applications. It is easiest to understand this integration by examining the code of the sample application for compressing image classification models ( compress_classifier.py ). The application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand. Integrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training. The training skeleton looks like the pseudo code below. The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler. For each epoch: compression_scheduler.on_epoch_begin(epoch) train() validate() save_checkpoint() compression_scheduler.on_epoch_end(epoch) train(): For each training step: compression_scheduler.on_minibatch_begin(epoch) output = model(input_var) loss = criterion(output, target_var) compression_scheduler.before_backward_pass(epoch) loss.backward() optimizer.step() compression_scheduler.on_minibatch_end(epoch) These callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's Scheduler , which invokes the correct algorithm. The application also uses Distiller services to collect statistics in Summaries and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard. Sparsification and fine-tuning The application sets up a model as normally done in PyTorch. And then instantiates a Scheduler and configures it: Scheduler configuration is defined in a YAML file The configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training. Some types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\". Some algorithms control some parameter of the training process, such as the learning-rate decay scheduler ( lr_scheduler ). The parameters of each algorithm are also specified in the configuration. In addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency. The Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined. These callbacks are placed the training loop. Quantization A quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary. In Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided. We also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the Quantizer class. Quantizer should be sub-classed for each quantization method. Model Transformation The high-level flow is as follows: Define a mapping between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the replacement_factory attribute of the Quantizer class. Iterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it. Replace the existing module with the module returned by the function. It is important to note that the name of the module does not change, as that could break the forward function of the parent module. Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different mapping will likely be defined. Each sub-class of Quantizer should populate the replacement_factory dictionary attribute with the appropriate mapping. To execute the model transformation, call the prepare_model function of the Quantizer instance. Flexible Bit-Widths Each instance of Quantizer is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the bits_activations , bits_weights and bits_bias parameters in Quantizer 's constructor. Sub-classes may define bit-widths for other tensor types as needed. We also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern. So, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the overrides parameter in the constructor. The overrides mapping is required to be an instance of collections.OrderedDict (as opposed to just a simple Python dict ). This is done in order to enable handling of overlapping name patterns. So, for example, one could define certain override parameters for a group of layers, e.g. 'conv*', but also define different parameters for specific layers in that group, e.g. 'conv1'. The patterns are evaluated eagerly - the first match wins. Therefore, the more specific patterns must come before the broad patterns. Weights Quantization The Quantizer class also provides an API to quantize the weights of all layers at once. To use it, the param_quantization_fn attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the Quantizer class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the quantize_params function can be called, which will iterate over all parameters and quantize them using params_quantization_fn . Quantization-Aware Training The Quantizer class supports quantization-aware training, that is - training with quantization in the loop. This requires handling of a couple of flows / scenarios: Maintaining a full precision copy of the weights, as described here . This is enabled by setting train_with_fp_copy=True in the Quantizer constructor. At model transformation, in each module that has parameters that should be quantized, a new torch.nn.Parameter is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module is not created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\": The existing torch.nn.Parameter , e.g. weights , is replaced by a torch.nn.Parameter named float_weight . To maintain the existing functionality of the module, we then register a buffer in the module with the original name - weights . During training, float_weight will be passed to param_quantization_fn and the result will be stored in weight . In addition, some quantization methods may introduce additional learned parameters to the model. For example, in the PACT method, acitvations are clipped to a value \\alpha , which is a learned parameter per-layer To support these two cases, the Quantizer class also accepts an instance of a torch.optim.Optimizer (normally this would be one an instance of its sub-classes). The quantizer will take care of modifying the optimizer according to the changes made to the parameters. Optimizing New Parameters In cases where new parameters are required by the scheme, it is likely that they'll need to be optimized separately from the main model parameters. In that case, the sub-class for the speicifc method should override Quantizer._get_updated_optimizer_params_groups() , and return the proper groups plus any desired hyper-parameter overrides. Examples The base Quantizer class is implemented in distiller/quantization/quantizer.py . For a simple sub-class implementing symmetric linear quantization, see SymmetricLinearQuantizer in distiller/quantization/range_linear.py . In distiller/quantization/clipped_linear.py there are examples of lower-precision methods which use training with quantization. Specifically, see PACTQuantizer for an example of overriding Quantizer._get_updated_optimizer_params_groups() .","title":"Design"},{"location":"design.html#distiller-design","text":"Distiller is designed to be easily integrated into your own PyTorch research applications. It is easiest to understand this integration by examining the code of the sample application for compressing image classification models ( compress_classifier.py ). The application borrows its main flow code from torchvision's ImageNet classification training sample application (https://github.com/pytorch/examples/tree/master/imagenet). We tried to keep it similar, in order to make it familiar and easy to understand. Integrating compression is very simple: simply add invocations of the appropriate compression_scheduler callbacks, for each stage in the training. The training skeleton looks like the pseudo code below. The boiler-plate Pytorch classification training is speckled with invocations of CompressionScheduler. For each epoch: compression_scheduler.on_epoch_begin(epoch) train() validate() save_checkpoint() compression_scheduler.on_epoch_end(epoch) train(): For each training step: compression_scheduler.on_minibatch_begin(epoch) output = model(input_var) loss = criterion(output, target_var) compression_scheduler.before_backward_pass(epoch) loss.backward() optimizer.step() compression_scheduler.on_minibatch_end(epoch) These callbacks can be seen in the diagram below, as the arrow pointing from the Training Loop and into Distiller's Scheduler , which invokes the correct algorithm. The application also uses Distiller services to collect statistics in Summaries and logs files, which can be queried at a later time, from Jupyter notebooks or TensorBoard.","title":"Distiller design"},{"location":"design.html#sparsification-and-fine-tuning","text":"The application sets up a model as normally done in PyTorch. And then instantiates a Scheduler and configures it: Scheduler configuration is defined in a YAML file The configuration specifies Policies. Each Policy is tied to a specific algorithm which controls some aspect of the training. Some types of algorithms control the actual sparsification of the model. Such types are \"pruner\" and \"regularizer\". Some algorithms control some parameter of the training process, such as the learning-rate decay scheduler ( lr_scheduler ). The parameters of each algorithm are also specified in the configuration. In addition to specifying the algorithm, each Policy specifies scheduling parameters which control when the algorithm is executed: start epoch, end epoch and frequency. The Scheduler exposes callbacks for relevant training stages: epoch start/end, mini-batch start/end and pre-backward pass. Each scheduler callback activates the policies that were defined according the schedule that was defined. These callbacks are placed the training loop.","title":"Sparsification and fine-tuning"},{"location":"design.html#quantization","text":"A quantized model is obtained by replacing existing operations with quantized versions. The quantized versions can be either complete replacements, or wrappers. A wrapper will use the existing modules internally and add quantization and de-quantization operations before/after as necessary. In Distiller we will provide a set of quantized versions of common operations which will enable implementation of different quantization methods. The user can write a quantized model from scratch, using the quantized operations provided. We also provide a mechanism which takes an existing model and automatically replaces required operations with quantized versions. This mechanism is exposed by the Quantizer class. Quantizer should be sub-classed for each quantization method.","title":"Quantization"},{"location":"design.html#model-transformation","text":"The high-level flow is as follows: Define a mapping between the module types to be replaced (e.g. Conv2D, Linear, etc.) to a function which generates the replacement module. The mapping is defined in the replacement_factory attribute of the Quantizer class. Iterate over the modules defined in the model. For each module, if its type is in the mapping, call the replacement generation function. We pass the existing module to this function to allow wrapping of it. Replace the existing module with the module returned by the function. It is important to note that the name of the module does not change, as that could break the forward function of the parent module. Different quantization methods may, obviously, use different quantized operations. In addition, different methods may employ different \"strategies\" of replacing / wrapping existing modules. For instance, some methods replace ReLU with another activation function, while others keep it. Hence, for each quantization method, a different mapping will likely be defined. Each sub-class of Quantizer should populate the replacement_factory dictionary attribute with the appropriate mapping. To execute the model transformation, call the prepare_model function of the Quantizer instance.","title":"Model Transformation"},{"location":"design.html#flexible-bit-widths","text":"Each instance of Quantizer is parameterized by the number of bits to be used for quantization of different tensor types. The default ones are activations and weights. These are the bits_activations , bits_weights and bits_bias parameters in Quantizer 's constructor. Sub-classes may define bit-widths for other tensor types as needed. We also want to be able to override the default number of bits mentioned in the bullet above for certain layers. These could be very specific layers. However, many models are comprised of building blocks (\"container\" modules, such as Sequential) which contain several modules, and it is likely we'll want to override settings for entire blocks, or for a certain module across different blocks. When such building blocks are used, the names of the internal modules usually follow some pattern. So, for this purpose, Quantizer also accepts a mapping of regular expressions to number of bits. This allows the user to override specific layers using they're exact name, or a group of layers via a regular expression. This mapping is passed via the overrides parameter in the constructor. The overrides mapping is required to be an instance of collections.OrderedDict (as opposed to just a simple Python dict ). This is done in order to enable handling of overlapping name patterns. So, for example, one could define certain override parameters for a group of layers, e.g. 'conv*', but also define different parameters for specific layers in that group, e.g. 'conv1'. The patterns are evaluated eagerly - the first match wins. Therefore, the more specific patterns must come before the broad patterns.","title":"Flexible Bit-Widths"},{"location":"design.html#weights-quantization","text":"The Quantizer class also provides an API to quantize the weights of all layers at once. To use it, the param_quantization_fn attribute needs to point to a function that accepts a tensor and the number of bits. During model transformation, the Quantizer class will build a list of all model parameters that need to be quantized along with their bit-width. Then, the quantize_params function can be called, which will iterate over all parameters and quantize them using params_quantization_fn .","title":"Weights Quantization"},{"location":"design.html#quantization-aware-training","text":"The Quantizer class supports quantization-aware training, that is - training with quantization in the loop. This requires handling of a couple of flows / scenarios: Maintaining a full precision copy of the weights, as described here . This is enabled by setting train_with_fp_copy=True in the Quantizer constructor. At model transformation, in each module that has parameters that should be quantized, a new torch.nn.Parameter is added, which will maintain the required full precision copy of the parameters. Note that this is done in-place - a new module is not created. We preferred not to sub-class the existing PyTorch modules for this purpose. In order to this in-place, and also guarantee proper back-propagation through the weights quantization function, we employ the following \"hack\": The existing torch.nn.Parameter , e.g. weights , is replaced by a torch.nn.Parameter named float_weight . To maintain the existing functionality of the module, we then register a buffer in the module with the original name - weights . During training, float_weight will be passed to param_quantization_fn and the result will be stored in weight . In addition, some quantization methods may introduce additional learned parameters to the model. For example, in the PACT method, acitvations are clipped to a value \\alpha , which is a learned parameter per-layer To support these two cases, the Quantizer class also accepts an instance of a torch.optim.Optimizer (normally this would be one an instance of its sub-classes). The quantizer will take care of modifying the optimizer according to the changes made to the parameters. Optimizing New Parameters In cases where new parameters are required by the scheme, it is likely that they'll need to be optimized separately from the main model parameters. In that case, the sub-class for the speicifc method should override Quantizer._get_updated_optimizer_params_groups() , and return the proper groups plus any desired hyper-parameter overrides.","title":"Quantization-Aware Training"},{"location":"design.html#examples","text":"The base Quantizer class is implemented in distiller/quantization/quantizer.py . For a simple sub-class implementing symmetric linear quantization, see SymmetricLinearQuantizer in distiller/quantization/range_linear.py . In distiller/quantization/clipped_linear.py there are examples of lower-precision methods which use training with quantization. Specifically, see PACTQuantizer for an example of overriding Quantizer._get_updated_optimizer_params_groups() .","title":"Examples"},{"location":"install.html","text":"Distiller Installation These instructions will help get Distiller up and running on your local machine. You may also want to refer to these resources: Dataset installation instructions. Jupyter installation instructions. Notes: - Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5. - If you are not using a GPU, you might need to make small adjustments to the code. Clone Distiller Clone the Distiller code repository from github: $ git clone https://github.com/NervanaSystems/distiller.git The rest of the documentation that follows, assumes that you have cloned your repository to a directory called distiller . Create a Python virtual environment We recommend using a Python virtual environment , but that of course, is up to you. There's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness. Before creating the virtual environment, make sure you are located in directory distiller . After creating the environment, you should see a directory called distiller/env . Using virtualenv If you don't have virtualenv installed, you can find the installation instructions here . To create the environment, execute: $ python3 -m virtualenv env This creates a subdirectory named env where the python virtual environment is stored, and configures the current shell to use it as the default python environment. Using venv If you prefer to use venv , then begin by installing it: $ sudo apt-get install python3-venv Then create the environment: $ python3 -m venv env As with virtualenv, this creates a directory called distiller/env . Activate the environment The environment activation and deactivation commands for venv and virtualenv are the same. !NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages: $ source env/bin/activate Install the package Finally, install the Distiller package and its dependencies using pip3 : $ cd distiller $ pip3 install -e . This installs Distiller in \"development mode\", meaning any changes made in the code are reflected in the environment without re-running the install command (so no need to re-install after pulling changes from the Git repository). PyTorch is included in the requirements.txt file, and will currently download PyTorch version 1.0.1 for CUDA 9.0. This is the setup we've used for testing Distiller.","title":"Installation"},{"location":"install.html#distiller-installation","text":"These instructions will help get Distiller up and running on your local machine. You may also want to refer to these resources: Dataset installation instructions. Jupyter installation instructions. Notes: - Distiller has only been tested on Ubuntu 16.04 LTS, and with Python 3.5. - If you are not using a GPU, you might need to make small adjustments to the code.","title":"Distiller Installation"},{"location":"install.html#clone-distiller","text":"Clone the Distiller code repository from github: $ git clone https://github.com/NervanaSystems/distiller.git The rest of the documentation that follows, assumes that you have cloned your repository to a directory called distiller .","title":"Clone Distiller"},{"location":"install.html#create-a-python-virtual-environment","text":"We recommend using a Python virtual environment , but that of course, is up to you. There's nothing special about using Distiller in a virtual environment, but we provide some instructions, for completeness. Before creating the virtual environment, make sure you are located in directory distiller . After creating the environment, you should see a directory called distiller/env .","title":"Create a Python virtual environment"},{"location":"install.html#using-virtualenv","text":"If you don't have virtualenv installed, you can find the installation instructions here . To create the environment, execute: $ python3 -m virtualenv env This creates a subdirectory named env where the python virtual environment is stored, and configures the current shell to use it as the default python environment.","title":"Using virtualenv"},{"location":"install.html#using-venv","text":"If you prefer to use venv , then begin by installing it: $ sudo apt-get install python3-venv Then create the environment: $ python3 -m venv env As with virtualenv, this creates a directory called distiller/env .","title":"Using venv"},{"location":"install.html#activate-the-environment","text":"The environment activation and deactivation commands for venv and virtualenv are the same. !NOTE: Make sure to activate the environment, before proceeding with the installation of the dependency packages: $ source env/bin/activate","title":"Activate the environment"},{"location":"install.html#install-the-package","text":"Finally, install the Distiller package and its dependencies using pip3 : $ cd distiller $ pip3 install -e . This installs Distiller in \"development mode\", meaning any changes made in the code are reflected in the environment without re-running the install command (so no need to re-install after pulling changes from the Git repository). PyTorch is included in the requirements.txt file, and will currently download PyTorch version 1.0.1 for CUDA 9.0. This is the setup we've used for testing Distiller.","title":"Install the package"},{"location":"jupyter.html","text":"Jupyter environment The Jupyter notebooks environment allows us to plan our compression session and load Distiller data summaries to study and analyze compression results. Each notebook has embedded instructions and explanations, so here we provide only a brief description of each notebook. Installation Jupyter and its dependencies are included as part of the main requirements.txt file, so there is no need for a dedicated installation step. However, to use the ipywidgets extension, you will need to enable it: $ jupyter nbextension enable --py widgetsnbextension --sys-prefix You may want to refer to the ipywidgets extension installation documentation . Another extension which requires special installation handling is Qgrid . Qgrid is a Jupyter notebook widget that adds interactive features, such as sorting, to Panadas DataFrames rendering. To enable Qgrid: $ jupyter nbextension enable --py --sys-prefix qgrid Launching the Jupyter server There are all kinds of options to use when launching Jupyter which you can use. The example below tells the server to listen to connections from any IP address, and not to launch the browser window, but of course, you are free to launch Jupyter any way you want. Consult the user's guide for more details. $ jupyter-notebook --ip=0.0.0.0 --no-browser Using the Distiller notebooks The Distiller Jupyter notebooks are located in the distiller/jupyter directory. They are provided as tools that you can use to prepare your compression experiments and study their results. We welcome new ideas and implementations of Jupyter. Roughly, the notebooks can be divided into three categories. Theory jupyter/L1-regularization.ipynb : Experience hands-on how L1 and L2 regularization affect the solution of a toy loss-minimization problem, to get a better grasp on the interaction between regularization and sparsity. jupyter/alexnet_insights.ipynb : This notebook reviews and compares a couple of pruning sessions on Alexnet. We compare distributions, performance, statistics and show some visualizations of the weights tensors. Preparation for compression jupyter/model_summary.ipynb : Begin by getting familiar with your model. Examine the sizes and properties of layers and connections. Study which layers are compute-bound, and which are bandwidth-bound, and decide how to prune or regularize the model. jupyter/sensitivity_analysis.ipynb : If you performed pruning sensitivity analysis on your model, this notebook can help you load the results and graphically study how the layers behave. jupyter/interactive_lr_scheduler.ipynb : The learning rate decay policy affects pruning results, perhaps as much as it affects training results. Graph a few LR-decay policies to see how they behave. jupyter/jupyter/agp_schedule.ipynb : If you are using the Automated Gradual Pruner, this notebook can help you tune the schedule. Reviewing experiment results jupyter/compare_executions.ipynb : This is a simple notebook to help you graphically compare the results of executions of several experiments. jupyter/compression_insights.ipynb : This notebook is packed with code, tables and graphs to us understand the results of a compression session. Distiller provides summaries , which are Pandas dataframes, which contain statistical information about you model. We chose to use Pandas dataframes because they can be sliced, queried, summarized and graphed with a few lines of code.","title":"Jupyter Notebooks"},{"location":"jupyter.html#jupyter-environment","text":"The Jupyter notebooks environment allows us to plan our compression session and load Distiller data summaries to study and analyze compression results. Each notebook has embedded instructions and explanations, so here we provide only a brief description of each notebook.","title":"Jupyter environment"},{"location":"jupyter.html#installation","text":"Jupyter and its dependencies are included as part of the main requirements.txt file, so there is no need for a dedicated installation step. However, to use the ipywidgets extension, you will need to enable it: $ jupyter nbextension enable --py widgetsnbextension --sys-prefix You may want to refer to the ipywidgets extension installation documentation . Another extension which requires special installation handling is Qgrid . Qgrid is a Jupyter notebook widget that adds interactive features, such as sorting, to Panadas DataFrames rendering. To enable Qgrid: $ jupyter nbextension enable --py --sys-prefix qgrid","title":"Installation"},{"location":"jupyter.html#launching-the-jupyter-server","text":"There are all kinds of options to use when launching Jupyter which you can use. The example below tells the server to listen to connections from any IP address, and not to launch the browser window, but of course, you are free to launch Jupyter any way you want. Consult the user's guide for more details. $ jupyter-notebook --ip=0.0.0.0 --no-browser","title":"Launching the Jupyter server"},{"location":"jupyter.html#using-the-distiller-notebooks","text":"The Distiller Jupyter notebooks are located in the distiller/jupyter directory. They are provided as tools that you can use to prepare your compression experiments and study their results. We welcome new ideas and implementations of Jupyter. Roughly, the notebooks can be divided into three categories.","title":"Using the Distiller notebooks"},{"location":"jupyter.html#theory","text":"jupyter/L1-regularization.ipynb : Experience hands-on how L1 and L2 regularization affect the solution of a toy loss-minimization problem, to get a better grasp on the interaction between regularization and sparsity. jupyter/alexnet_insights.ipynb : This notebook reviews and compares a couple of pruning sessions on Alexnet. We compare distributions, performance, statistics and show some visualizations of the weights tensors.","title":"Theory"},{"location":"jupyter.html#preparation-for-compression","text":"jupyter/model_summary.ipynb : Begin by getting familiar with your model. Examine the sizes and properties of layers and connections. Study which layers are compute-bound, and which are bandwidth-bound, and decide how to prune or regularize the model. jupyter/sensitivity_analysis.ipynb : If you performed pruning sensitivity analysis on your model, this notebook can help you load the results and graphically study how the layers behave. jupyter/interactive_lr_scheduler.ipynb : The learning rate decay policy affects pruning results, perhaps as much as it affects training results. Graph a few LR-decay policies to see how they behave. jupyter/jupyter/agp_schedule.ipynb : If you are using the Automated Gradual Pruner, this notebook can help you tune the schedule.","title":"Preparation for compression"},{"location":"jupyter.html#reviewing-experiment-results","text":"jupyter/compare_executions.ipynb : This is a simple notebook to help you graphically compare the results of executions of several experiments. jupyter/compression_insights.ipynb : This notebook is packed with code, tables and graphs to us understand the results of a compression session. Distiller provides summaries , which are Pandas dataframes, which contain statistical information about you model. We chose to use Pandas dataframes because they can be sliced, queried, summarized and graphed with a few lines of code.","title":"Reviewing experiment results"},{"location":"knowledge_distillation.html","text":"Knowledge Distillation (For details on how to train a model with knowledge distillation in Distiller, see here ) Knowledge distillation is model compression method in which a small model is trained to mimic a pre-trained, larger model (or ensemble of models). This training setting is sometimes referred to as \"teacher-student\", where the large model is the teacher and the small model is the student (we'll be using these terms interchangeably). The method was first proposed by Bucila et al., 2006 and generalized by Hinton et al., 2015 . The implementation in Distiller is based on the latter publication. Here we'll provide a summary of the method. For more information the reader may refer to the paper (a video lecture with slides is also available). In distillation, knowledge is transferred from the teacher model to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model. That is - the output of a softmax function on the teacher model's logits. However, in many cases, this probability distribution has the correct class at a very high probability, with all other class probabilities very close to 0. As such, it doesn't provide much information beyond the ground truth labels already provided in the dataset. To tackle this issue, Hinton et al., 2015 introduced the concept of \"softmax temperature\". The probability p_i of class i is calculated from the logits z as: p_i = \\frac{exp\\left(\\frac{z_i}{T}\\right)}{\\sum_{j} \\exp\\left(\\frac{z_j}{T}\\right)} where T is the temperature parameter. When T=1 we get the standard softmax function. As T grows, the probability distribution generated by the softmax function becomes softer, providing more information as to which classes the teacher found more similar to the predicted class. Hinton calls this the \"dark knowledge\" embedded in the teacher model, and it is this dark knowledge that we are transferring to the student model in the distillation process. When computing the loss function vs. the teacher's soft targets, we use the same value of T to compute the softmax on the student's logits. We call this loss the \"distillation loss\". Hinton et al., 2015 found that it is also beneficial to train the distilled model to produce the correct labels (based on the ground truth) in addition to the teacher's soft-labels. Hence, we also calculate the \"standard\" loss between the student's predicted class probabilities and the ground-truth labels (also called \"hard labels/targets\"). We dub this loss the \"student loss\". When calculating the class probabilities for the student loss we use T = 1 . The overall loss function, incorporating both distillation and student losses, is calculated as: \\mathcal{L}(x;W) = \\alpha * \\mathcal{H}(y, \\sigma(z_s; T=1)) + \\beta * \\mathcal{H}(\\sigma(z_t; T=\\tau), \\sigma(z_s, T=\\tau)) where x is the input, W are the student model parameters, y is the ground truth label, \\mathcal{H} is the cross-entropy loss function, \\sigma is the softmax function parameterized by the temperature T , and \\alpha and \\beta are coefficients. z_s and z_t are the logits of the student and teacher respectively. New Hyper-Parameters In general \\tau , \\alpha and \\beta are hyper parameters. In their experiments, Hinton et al., 2015 use temperature values ranging from 1 to 20. They note that empirically, when the student model is very small compared to the teacher model, lower temperatures work better. This makes sense if we consider that as we raise the temperature, the resulting soft-labels distribution becomes richer in information, and a very small model might not be able to capture all of this information. However, there's no clear way to predict up front what kind of capacity for information the student model will have. With regards to \\alpha and \\beta , Hinton et al., 2015 use a weighted average between the distillation loss and the student loss. That is, \\beta = 1 - \\alpha . They note that in general, they obtained the best results when setting \\alpha to be much smaller than \\beta (although in one of their experiments they use \\alpha = \\beta = 0.5 ). Other works which utilize knowledge distillation don't use a weighted average. Some set \\alpha = 1 while leaving \\beta tunable, while others don't set any constraints. Combining with Other Model Compression Techniques In the \"basic\" scenario, the smaller (student) model is a pre-defined architecture which just has a smaller number of parameters compared to the teacher model. For example, we could train ResNet-18 by distilling knowledge from ResNet-34. But, a model with smaller capacity can also be obtained by other model compression techniques - sparsification and/or quantization. So, for example, we could train a 4-bit ResNet-18 model with some method using quantization-aware training, and use a distillation loss function as described above. In that case, the teacher model can even be a FP32 ResNet-18 model. Same goes for pruning and regularization. Tann et al., 2017 , Mishra and Marr, 2018 and Polino et al., 2018 are some works that combine knowledge distillation with quantization . Theis et al., 2018 and Ashok et al., 2018 combine distillation with pruning . References Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil . Model Compression. KDD, 2006 Geoffrey Hinton, Oriol Vinyals and Jeff Dean . Distilling the Knowledge in a Neural Network. arxiv:1503.02531 Hokchhay Tann, Soheil Hashemi, Iris Bahar and Sherief Reda . Hardware-Software Codesign of Accurate, Multiplier-free Deep Neural Networks. DAC, 2017 Asit Mishra and Debbie Marr . Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. ICLR, 2018 Antonio Polino, Razvan Pascanu and Dan Alistarh . Model compression via distillation and quantization. ICLR, 2018 Anubhav Ashok, Nicholas Rhinehart, Fares Beainy and Kris M. Kitani . N2N learning: Network to Network Compression via Policy Gradient Reinforcement Learning. ICLR, 2018 Lucas Theis, Iryna Korshunova, Alykhan Tejani and Ferenc Husz\u00e1r . Faster gaze prediction with dense networks and Fisher pruning. arxiv:1801.05787","title":"Knowledge Distillation"},{"location":"knowledge_distillation.html#knowledge-distillation","text":"(For details on how to train a model with knowledge distillation in Distiller, see here ) Knowledge distillation is model compression method in which a small model is trained to mimic a pre-trained, larger model (or ensemble of models). This training setting is sometimes referred to as \"teacher-student\", where the large model is the teacher and the small model is the student (we'll be using these terms interchangeably). The method was first proposed by Bucila et al., 2006 and generalized by Hinton et al., 2015 . The implementation in Distiller is based on the latter publication. Here we'll provide a summary of the method. For more information the reader may refer to the paper (a video lecture with slides is also available). In distillation, knowledge is transferred from the teacher model to the student by minimizing a loss function in which the target is the distribution of class probabilities predicted by the teacher model. That is - the output of a softmax function on the teacher model's logits. However, in many cases, this probability distribution has the correct class at a very high probability, with all other class probabilities very close to 0. As such, it doesn't provide much information beyond the ground truth labels already provided in the dataset. To tackle this issue, Hinton et al., 2015 introduced the concept of \"softmax temperature\". The probability p_i of class i is calculated from the logits z as: p_i = \\frac{exp\\left(\\frac{z_i}{T}\\right)}{\\sum_{j} \\exp\\left(\\frac{z_j}{T}\\right)} where T is the temperature parameter. When T=1 we get the standard softmax function. As T grows, the probability distribution generated by the softmax function becomes softer, providing more information as to which classes the teacher found more similar to the predicted class. Hinton calls this the \"dark knowledge\" embedded in the teacher model, and it is this dark knowledge that we are transferring to the student model in the distillation process. When computing the loss function vs. the teacher's soft targets, we use the same value of T to compute the softmax on the student's logits. We call this loss the \"distillation loss\". Hinton et al., 2015 found that it is also beneficial to train the distilled model to produce the correct labels (based on the ground truth) in addition to the teacher's soft-labels. Hence, we also calculate the \"standard\" loss between the student's predicted class probabilities and the ground-truth labels (also called \"hard labels/targets\"). We dub this loss the \"student loss\". When calculating the class probabilities for the student loss we use T = 1 . The overall loss function, incorporating both distillation and student losses, is calculated as: \\mathcal{L}(x;W) = \\alpha * \\mathcal{H}(y, \\sigma(z_s; T=1)) + \\beta * \\mathcal{H}(\\sigma(z_t; T=\\tau), \\sigma(z_s, T=\\tau)) where x is the input, W are the student model parameters, y is the ground truth label, \\mathcal{H} is the cross-entropy loss function, \\sigma is the softmax function parameterized by the temperature T , and \\alpha and \\beta are coefficients. z_s and z_t are the logits of the student and teacher respectively.","title":"Knowledge Distillation"},{"location":"knowledge_distillation.html#new-hyper-parameters","text":"In general \\tau , \\alpha and \\beta are hyper parameters. In their experiments, Hinton et al., 2015 use temperature values ranging from 1 to 20. They note that empirically, when the student model is very small compared to the teacher model, lower temperatures work better. This makes sense if we consider that as we raise the temperature, the resulting soft-labels distribution becomes richer in information, and a very small model might not be able to capture all of this information. However, there's no clear way to predict up front what kind of capacity for information the student model will have. With regards to \\alpha and \\beta , Hinton et al., 2015 use a weighted average between the distillation loss and the student loss. That is, \\beta = 1 - \\alpha . They note that in general, they obtained the best results when setting \\alpha to be much smaller than \\beta (although in one of their experiments they use \\alpha = \\beta = 0.5 ). Other works which utilize knowledge distillation don't use a weighted average. Some set \\alpha = 1 while leaving \\beta tunable, while others don't set any constraints.","title":"New Hyper-Parameters"},{"location":"knowledge_distillation.html#references","text":"Cristian Bucila, Rich Caruana, and Alexandru Niculescu-Mizil . Model Compression. KDD, 2006 Geoffrey Hinton, Oriol Vinyals and Jeff Dean . Distilling the Knowledge in a Neural Network. arxiv:1503.02531 Hokchhay Tann, Soheil Hashemi, Iris Bahar and Sherief Reda . Hardware-Software Codesign of Accurate, Multiplier-free Deep Neural Networks. DAC, 2017 Asit Mishra and Debbie Marr . Apprentice: Using Knowledge Distillation Techniques To Improve Low-Precision Network Accuracy. ICLR, 2018 Antonio Polino, Razvan Pascanu and Dan Alistarh . Model compression via distillation and quantization. ICLR, 2018 Anubhav Ashok, Nicholas Rhinehart, Fares Beainy and Kris M. Kitani . N2N learning: Network to Network Compression via Policy Gradient Reinforcement Learning. ICLR, 2018 Lucas Theis, Iryna Korshunova, Alykhan Tejani and Ferenc Husz\u00e1r . Faster gaze prediction with dense networks and Fisher pruning. arxiv:1801.05787","title":"References"},{"location":"model_zoo.html","text":"Distiller Model Zoo How to contribute models to the Model Zoo We encourage you to contribute new models to the Model Zoo. We welcome implementations of published papers or of your own work. To assure that models and algorithms shared with others are high-quality, please commit your models with the following: Command-line arguments Log files PyTorch model Contents The Distiller model zoo is not a \"traditional\" model-zoo, because it does not necessarily contain best-in-class compressed models. Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers. These are meant to serve as examples of how Distiller can be used. Each model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs. table, th, td { border: 1px solid black; } Paper Dataset Network Method & Granularity Schedule Features Learning both Weights and Connections for Efficient Neural Networks ImageNet Alexnet Element-wise pruning Iterative; Manual Magnitude thresholding based on a sensitivity quantifier. Element-wise sparsity sensitivity analysis To prune, or not to prune: exploring the efficacy of pruning for model compression ImageNet MobileNet Element-wise pruning Automated gradual; Iterative Magnitude thresholding based on target level Learning Structured Sparsity in Deep Neural Networks CIFAR10 ResNet20 Group regularization 1.Train with group-lasso 2.Remove zero groups and fine-tune Group Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols) Pruning Filters for Efficient ConvNets CIFAR10 ResNet56 Filter ranking; guided by sensitivity analysis 1.Rank filters 2. Remove filters and channels 3.Fine-tune One-shot ranking and pruning of filters; with network thinning Learning both Weights and Connections for Efficient Neural Networks This schedule is an example of \"Iterative Pruning\" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation: Efficient Methods and Hardware for Deep Learning and in his paper Learning both Weights and Connections for Efficient Neural Networks . The Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying \"raw\" thresholds, it uses a \"sensitivity parameter\". Song Han's paper says that \"the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights,\" and this is not explained much further. In Distiller, the \"quality parameter\" is referred to as \"sensitivity\" and is based on the values learned from performing sensitivity analysis. Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer. Note that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once. In his PhD dissertation, Song Han describes a growing threshold, at each iteration. This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration. Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold \"traps\" more weights. Thus, we can use less hyper-parameters and achieve the same results. Distiller schedule: distiller/examples/sensitivity-pruning/alexnet.schedule_sensitivity.yaml Checkpoint file: alexnet.checkpoint.89.pth.tar Results Our reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09. We prune away 88.44% of the parameters and achieve Top1=56.61 and Top5=79.45. Song Han prunes 89% of the parameters, which is slightly better than our results. Parameters: +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------| | 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 | | 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 | | 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 | | 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 | | 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 | | 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 | | 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 | | 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 | | 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 | +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ 2018-04-04 21:30:52,499 - Total sparsity: 88.44 2018-04-04 21:30:52,499 - --- validate (epoch=89)----------- 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch) 2018-04-04 21:31:35,357 - ==> Top1: 51.838 Top5: 74.817 Loss: 2.150 2018-04-04 21:31:39,251 - --- test --------------------- 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch) 2018-04-04 21:32:01,274 - ==> Top1: 56.606 Top5: 79.446 Loss: 1.893 To prune, or not to prune: exploring the efficacy of pruning for model compression In their paper Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint.\" They also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning.\" This pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size. ImageNet files: Distiller schedule: distiller/examples/agp-pruning/mobilenet.imagenet.schedule_agp.yaml Checkpoint file: checkpoint.pth.tar ResNet18 files: Distiller schedule: distiller/examples/agp-pruning/resnet18.schedule_agp.yaml Checkpoint file: checkpoint.pth.tar Results As our baseline we used a pretrained PyTorch MobileNet model (width=1) which has Top1=68.848 and Top5=88.740. In their paper, Zhu and Gupta prune 50% of the elements of MobileNet (width=1) with a 1.1% drop in accuracy. We pruned about 51.6% of the elements, with virtually no change in the accuracies (Top1: 68.808 and Top5: 88.656). We didn't try to prune more than this, but we do note that the baseline accuracy that we used is almost 2% lower than the accuracy published in the paper. +----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean | |----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------| | 0 | module.model.0.0.weight | (32, 3, 3, 3) | 864 | 864 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.14466 | 0.00103 | 0.06508 | | 1 | module.model.1.0.weight | (32, 1, 3, 3) | 288 | 288 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.32146 | 0.01020 | 0.12932 | | 2 | module.model.1.3.weight | (64, 32, 1, 1) | 2048 | 2048 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11942 | 0.00024 | 0.03627 | | 3 | module.model.2.0.weight | (64, 1, 3, 3) | 576 | 576 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.15809 | 0.00543 | 0.11513 | | 4 | module.model.2.3.weight | (128, 64, 1, 1) | 8192 | 8192 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08442 | -0.00031 | 0.04182 | | 5 | module.model.3.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.16780 | 0.00125 | 0.10545 | | 6 | module.model.3.3.weight | (128, 128, 1, 1) | 16384 | 16384 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07126 | -0.00197 | 0.04123 | | 7 | module.model.4.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.10182 | 0.00171 | 0.08719 | | 8 | module.model.4.3.weight | (256, 128, 1, 1) | 32768 | 13108 | 0.00000 | 0.00000 | 10.15625 | 59.99756 | 12.50000 | 59.99756 | 0.05543 | -0.00002 | 0.02760 | | 9 | module.model.5.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.12516 | -0.00288 | 0.08058 | | 10 | module.model.5.3.weight | (256, 256, 1, 1) | 65536 | 26215 | 0.00000 | 0.00000 | 12.50000 | 59.99908 | 23.82812 | 59.99908 | 0.04453 | 0.00002 | 0.02271 | | 11 | module.model.6.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08024 | 0.00252 | 0.06377 | | 12 | module.model.6.3.weight | (512, 256, 1, 1) | 131072 | 52429 | 0.00000 | 0.00000 | 23.82812 | 59.99985 | 14.25781 | 59.99985 | 0.03561 | -0.00057 | 0.01779 | | 13 | module.model.7.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11008 | -0.00018 | 0.06829 | | 14 | module.model.7.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 14.25781 | 59.99985 | 21.28906 | 59.99985 | 0.02944 | -0.00060 | 0.01515 | | 15 | module.model.8.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08258 | 0.00370 | 0.04905 | | 16 | module.model.8.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 21.28906 | 59.99985 | 28.51562 | 59.99985 | 0.02865 | -0.00046 | 0.01465 | | 17 | module.model.9.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07578 | 0.00468 | 0.04201 | | 18 | module.model.9.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 28.51562 | 59.99985 | 23.43750 | 59.99985 | 0.02939 | -0.00044 | 0.01511 | | 19 | module.model.10.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07091 | 0.00014 | 0.04306 | | 20 | module.model.10.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 24.60938 | 59.99985 | 20.89844 | 59.99985 | 0.03095 | -0.00059 | 0.01672 | | 21 | module.model.11.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.05729 | -0.00518 | 0.04267 | | 22 | module.model.11.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 20.89844 | 59.99985 | 17.57812 | 59.99985 | 0.03229 | -0.00044 | 0.01797 | | 23 | module.model.12.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.04981 | -0.00136 | 0.03967 | | 24 | module.model.12.3.weight | (1024, 512, 1, 1) | 524288 | 209716 | 0.00000 | 0.00000 | 16.01562 | 59.99985 | 44.23828 | 59.99985 | 0.02514 | -0.00106 | 0.01278 | | 25 | module.model.13.0.weight | (1024, 1, 3, 3) | 9216 | 9216 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.02396 | -0.00949 | 0.01549 | | 26 | module.model.13.3.weight | (1024, 1024, 1, 1) | 1048576 | 419431 | 0.00000 | 0.00000 | 44.72656 | 59.99994 | 1.46484 | 59.99994 | 0.01801 | -0.00017 | 0.00931 | | 27 | module.fc.weight | (1000, 1024) | 1024000 | 409600 | 1.46484 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 60.00000 | 0.05078 | 0.00271 | 0.02734 | | 28 | Total sparsity: | - | 4209088 | 1726917 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 58.97171 | 0.00000 | 0.00000 | 0.00000 | +----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ Total sparsity: 58.97 --- validate (epoch=199)----------- 128116 samples (256 per mini-batch) ==> Top1: 65.337 Top5: 84.984 Loss: 1.494 --- test --------------------- 50000 samples (256 per mini-batch) ==> Top1: 68.810 Top5: 88.626 Loss: 1.282 Learning Structured Sparsity in Deep Neural Networks This research paper from the University of Pittsburgh, \"proposes a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation.\" Note that this paper does not use pruning, but instead uses group regularization during the training to force weights towards zero, as a group. We used a schedule which thresholds the regularized elements at a magnitude equal to the regularization strength. At the end of the regularization phase, we save the final sparsity masks generated by the regularization, and exit. Then we load this regularized model, remove the layers corresponding to the zeroed weight tensors (all of a layer's elements have a zero value). Baseline training We started by training the baseline ResNet20-Cifar dense network since we didn't have a pre-trained model. Distiller schedule: distiller/examples/ssl/resnet20_cifar_baseline_training.yaml Checkpoint files: distiller/examples/ssl/checkpoints/ $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../cifar10/resnet20/baseline_training.yaml -j=1 --deterministic Regularization Then we started training from scratch again, but this time we used Group Lasso regularization on entire layers: Distiller schedule: distiller/examples/ssl/ssl_4D-removal_4L_training.yaml $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../ssl/ssl_4D-removal_training.yaml -j=1 --deterministic The diagram below shows the training of Resnet20/CIFAR10 using Group Lasso regularization on entire layers (in blue) vs. training Resnet20/CIFAR10 baseline (in red). You may notice several interesting things: 1. The LR-decay policy is the same, but the two sessions start with different initial LR values. 2. The data-loss of the regularized training follows the same shape as the un-regularized training (baseline), and eventually the two seem to merge. 3. We see similar behavior in the validation Top1 and Top5 accuracy results, but the regularized training eventually performs better. 4. In the top right corner we see the behavior of the regularization loss ( Reg Loss ), which actually increases for some time, until the data-loss has a sharp drop (after ~16K mini-batches), at which point the regularization loss also starts dropping. This regularization yields 5 layers with zeroed weight tensors. We load this model, remove the 5 layers, and start the fine tuning of the weights. This process of layer removal is specific to ResNet for CIFAR, which we altered by adding code to skip over layers during the forward path. When you export to ONNX, the removed layers do not participate in the forward path, so they don't get incarnated. We managed to remove 5 of the 16 3x3 convolution layers which dominate the computation time. It's not bad, but we probably could have done better. Fine-tuning During the fine-tuning process, because the removed layers do not participate in the forward path, they do not appear in the backward path and are not backpropogated: therefore they are completely disconnected from the network. We copy the checkpoint file of the regularized model to checkpoint_trained_4D_regularized_5Lremoved.pth.tar . Distiller schedule: distiller/examples/ssl/ssl_4D-removal_finetuning.yaml $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.1 --epochs=250 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --compress=../ssl/ssl_4D-removal_finetuning.yaml -j=1 --deterministic Results Our baseline results for ResNet20 Cifar are: Top1=91.450 and Top5=99.750 We used Distiller's GroupLassoRegularizer to remove 5 layers from Resnet20 (CIFAR10) with no degradation of the accuracies. The regularized model exhibits really poor classification abilities: $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --evaluate => loading checkpoint ../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar best top@1: 90.620 Loaded compression schedule from checkpoint (epoch 179) Removing layer: module.layer1.0.conv1 [layer=0 block=0 conv=0] Removing layer: module.layer1.0.conv2 [layer=0 block=0 conv=1] Removing layer: module.layer1.1.conv1 [layer=0 block=1 conv=0] Removing layer: module.layer1.1.conv2 [layer=0 block=1 conv=1] Removing layer: module.layer2.2.conv2 [layer=1 block=2 conv=1] Files already downloaded and verified Files already downloaded and verified Dataset sizes: training=45000 validation=5000 test=10000 --- test --------------------- 10000 samples (256 per mini-batch) ==> Top1: 22.290 Top5: 68.940 Loss: 5.172 However, after fine-tuning, we recovered most of the accuracies loss, but not quite all of it: Top1=91.020 and Top5=99.670 We didn't spend time trying to wrestle with this network, and therefore didn't achieve SSL's published results (which showed that they managed to remove 6 layers and at the same time increase accuracies). Pruning Filters for Efficient ConvNets Quoting the authors directly: We present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly. In contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications. The implementation of the research by Hao et al. required us to add filter-pruning sensitivity analysis, and support for \"network thinning\". After performing filter-pruning sensitivity analysis to assess which layers are more sensitive to the pruning of filters, we execute distiller.L1RankedStructureParameterPruner once in order to rank the filters of each layer by their L1-norm values, and then we prune the schedule-prescribed sparsity level. Distiller schedule: distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank.yaml Checkpoint files: checkpoint_finetuned.pth.tar The excerpt from the schedule, displayed below, shows how we declare the L1RankedStructureParameterPruner. This class currently ranks filters only, but because in the future this class may support ranking of various structures, you need to specify for each parameter both the target sparsity level, and the structure type ('3D' is filter-wise pruning). pruners: filter_pruner: class: 'L1RankedStructureParameterPruner' reg_regims: 'module.layer1.0.conv1.weight': [0.6, '3D'] 'module.layer1.1.conv1.weight': [0.6, '3D'] 'module.layer1.2.conv1.weight': [0.6, '3D'] 'module.layer1.3.conv1.weight': [0.6, '3D'] In the policy, we specify that we want to invoke this pruner once, at epoch 180. Because we are starting from a network which was trained for 180 epochs (see Baseline training below), the filter ranking is performed right at the outset of this schedule. policies: - pruner: instance_name: filter_pruner epochs: [180] Following the pruning, we want to \"physically\" remove the pruned filters from the network, which involves reconfiguring the Convolutional layers and the parameter tensors. When we remove filters from Convolution layer n we need to perform several changes to the network: 1. Shrink layer n 's weights tensor, leaving only the \"important\" filters. 2. Configure layer n 's .out_channels member to its new, smaller, value. 3. If a BN layer follows layer n , then it also needs to be reconfigured and its scale and shift parameter vectors need to be shrunk. 4. If a Convolution layer follows the BN layer, then it will have less input channels which requires reconfiguration and shrinking of its weights. All of this is performed by distiller.ResnetCifarFilterRemover which is also scheduled at epoch 180. We call this process \"network thinning\". extensions: net_thinner: class: 'FilterRemover' thinning_func_str: remove_filters arch: 'resnet56_cifar' dataset: 'cifar10' Network thinning requires us to understand the layer connectivity and data-dependency of the DNN, and we are working on a robust method to perform this. On networks with topologies similar to ResNet (residuals) and GoogLeNet (inception), which have several inputs and outputs to/from Convolution layers, there is extra details to consider. Our current implementation is specific to certain layers in ResNet and is a bit fragile. We will continue to improve and generalize this. Baseline training We started by training the baseline ResNet56-Cifar dense network (180 epochs) since we didn't have a pre-trained model. Distiller schedule: distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml Checkpoint files: checkpoint.resnet56_cifar_baseline.pth.tar Results We trained a ResNet56-Cifar10 network and achieve accuracy results which are on-par with published results: Top1: 92.970 and Top5: 99.740. We used Hao et al.'s algorithm to remove 37.3% of the original convolution MACs, while maintaining virtually the same accuracy as the baseline: Top1: 92.830 and Top5: 99.760","title":"Model Zoo"},{"location":"model_zoo.html#distiller-model-zoo","text":"","title":"Distiller Model Zoo"},{"location":"model_zoo.html#how-to-contribute-models-to-the-model-zoo","text":"We encourage you to contribute new models to the Model Zoo. We welcome implementations of published papers or of your own work. To assure that models and algorithms shared with others are high-quality, please commit your models with the following: Command-line arguments Log files PyTorch model","title":"How to contribute models to the Model Zoo"},{"location":"model_zoo.html#contents","text":"The Distiller model zoo is not a \"traditional\" model-zoo, because it does not necessarily contain best-in-class compressed models. Instead, the model-zoo contains a number of deep learning models that have been compressed using Distiller following some well-known research papers. These are meant to serve as examples of how Distiller can be used. Each model contains a Distiller schedule detailing how the model was compressed, a PyTorch checkpoint, text logs and TensorBoard logs. table, th, td { border: 1px solid black; } Paper Dataset Network Method & Granularity Schedule Features Learning both Weights and Connections for Efficient Neural Networks ImageNet Alexnet Element-wise pruning Iterative; Manual Magnitude thresholding based on a sensitivity quantifier. Element-wise sparsity sensitivity analysis To prune, or not to prune: exploring the efficacy of pruning for model compression ImageNet MobileNet Element-wise pruning Automated gradual; Iterative Magnitude thresholding based on target level Learning Structured Sparsity in Deep Neural Networks CIFAR10 ResNet20 Group regularization 1.Train with group-lasso 2.Remove zero groups and fine-tune Group Lasso regularization. Groups: kernels (2D), channels, filters (3D), layers (4D), vectors (rows, cols) Pruning Filters for Efficient ConvNets CIFAR10 ResNet56 Filter ranking; guided by sensitivity analysis 1.Rank filters 2. Remove filters and channels 3.Fine-tune One-shot ranking and pruning of filters; with network thinning","title":"Contents"},{"location":"model_zoo.html#learning-both-weights-and-connections-for-efficient-neural-networks","text":"This schedule is an example of \"Iterative Pruning\" for Alexnet/Imagent, as described in chapter 3 of Song Han's PhD dissertation: Efficient Methods and Hardware for Deep Learning and in his paper Learning both Weights and Connections for Efficient Neural Networks . The Distiller schedule uses SensitivityPruner which is similar to MagnitudeParameterPruner, but instead of specifying \"raw\" thresholds, it uses a \"sensitivity parameter\". Song Han's paper says that \"the pruning threshold is chosen as a quality parameter multiplied by the standard deviation of a layers weights,\" and this is not explained much further. In Distiller, the \"quality parameter\" is referred to as \"sensitivity\" and is based on the values learned from performing sensitivity analysis. Using a parameter that is related to the standard deviation is very helpful: under the assumption that the weights tensors are distributed normally, the standard deviation acts as a threshold normalizer. Note that Distiller's implementation deviates slightly from the algorithm Song Han describes in his PhD dissertation, in that the threshold value is set only once. In his PhD dissertation, Song Han describes a growing threshold, at each iteration. This requires n+1 hyper-parameters (n being the number of pruning iterations we use): the threshold and the threshold increase (delta) at each pruning iteration. Distiller's implementation takes advantage of the fact that as pruning progresses, more weights are pulled toward zero, and therefore the threshold \"traps\" more weights. Thus, we can use less hyper-parameters and achieve the same results. Distiller schedule: distiller/examples/sensitivity-pruning/alexnet.schedule_sensitivity.yaml Checkpoint file: alexnet.checkpoint.89.pth.tar","title":"Learning both Weights and Connections for Efficient Neural Networks"},{"location":"model_zoo.html#results","text":"Our reference is TorchVision's pretrained Alexnet model which has a Top1 accuracy of 56.55 and Top5=79.09. We prune away 88.44% of the parameters and achieve Top1=56.61 and Top5=79.45. Song Han prunes 89% of the parameters, which is slightly better than our results. Parameters: +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------| | 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 | | 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 | | 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 | | 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 | | 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 | | 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 | | 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 | | 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 | | 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 | +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ 2018-04-04 21:30:52,499 - Total sparsity: 88.44 2018-04-04 21:30:52,499 - --- validate (epoch=89)----------- 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch) 2018-04-04 21:31:35,357 - ==> Top1: 51.838 Top5: 74.817 Loss: 2.150 2018-04-04 21:31:39,251 - --- test --------------------- 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch) 2018-04-04 21:32:01,274 - ==> Top1: 56.606 Top5: 79.446 Loss: 1.893","title":"Results"},{"location":"model_zoo.html#to-prune-or-not-to-prune-exploring-the-efficacy-of-pruning-for-model-compression","text":"In their paper Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint.\" They also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning.\" This pruning schedule is implemented by distiller.AutomatedGradualPruner, which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size. ImageNet files: Distiller schedule: distiller/examples/agp-pruning/mobilenet.imagenet.schedule_agp.yaml Checkpoint file: checkpoint.pth.tar ResNet18 files: Distiller schedule: distiller/examples/agp-pruning/resnet18.schedule_agp.yaml Checkpoint file: checkpoint.pth.tar","title":"To prune, or not to prune: exploring the efficacy of pruning for model compression"},{"location":"model_zoo.html#results_1","text":"As our baseline we used a pretrained PyTorch MobileNet model (width=1) which has Top1=68.848 and Top5=88.740. In their paper, Zhu and Gupta prune 50% of the elements of MobileNet (width=1) with a 1.1% drop in accuracy. We pruned about 51.6% of the elements, with virtually no change in the accuracies (Top1: 68.808 and Top5: 88.656). We didn't try to prune more than this, but we do note that the baseline accuracy that we used is almost 2% lower than the accuracy published in the paper. +----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean | |----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------| | 0 | module.model.0.0.weight | (32, 3, 3, 3) | 864 | 864 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.14466 | 0.00103 | 0.06508 | | 1 | module.model.1.0.weight | (32, 1, 3, 3) | 288 | 288 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.32146 | 0.01020 | 0.12932 | | 2 | module.model.1.3.weight | (64, 32, 1, 1) | 2048 | 2048 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11942 | 0.00024 | 0.03627 | | 3 | module.model.2.0.weight | (64, 1, 3, 3) | 576 | 576 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.15809 | 0.00543 | 0.11513 | | 4 | module.model.2.3.weight | (128, 64, 1, 1) | 8192 | 8192 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08442 | -0.00031 | 0.04182 | | 5 | module.model.3.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.16780 | 0.00125 | 0.10545 | | 6 | module.model.3.3.weight | (128, 128, 1, 1) | 16384 | 16384 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07126 | -0.00197 | 0.04123 | | 7 | module.model.4.0.weight | (128, 1, 3, 3) | 1152 | 1152 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.10182 | 0.00171 | 0.08719 | | 8 | module.model.4.3.weight | (256, 128, 1, 1) | 32768 | 13108 | 0.00000 | 0.00000 | 10.15625 | 59.99756 | 12.50000 | 59.99756 | 0.05543 | -0.00002 | 0.02760 | | 9 | module.model.5.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.12516 | -0.00288 | 0.08058 | | 10 | module.model.5.3.weight | (256, 256, 1, 1) | 65536 | 26215 | 0.00000 | 0.00000 | 12.50000 | 59.99908 | 23.82812 | 59.99908 | 0.04453 | 0.00002 | 0.02271 | | 11 | module.model.6.0.weight | (256, 1, 3, 3) | 2304 | 2304 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08024 | 0.00252 | 0.06377 | | 12 | module.model.6.3.weight | (512, 256, 1, 1) | 131072 | 52429 | 0.00000 | 0.00000 | 23.82812 | 59.99985 | 14.25781 | 59.99985 | 0.03561 | -0.00057 | 0.01779 | | 13 | module.model.7.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.11008 | -0.00018 | 0.06829 | | 14 | module.model.7.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 14.25781 | 59.99985 | 21.28906 | 59.99985 | 0.02944 | -0.00060 | 0.01515 | | 15 | module.model.8.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.08258 | 0.00370 | 0.04905 | | 16 | module.model.8.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 21.28906 | 59.99985 | 28.51562 | 59.99985 | 0.02865 | -0.00046 | 0.01465 | | 17 | module.model.9.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07578 | 0.00468 | 0.04201 | | 18 | module.model.9.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 28.51562 | 59.99985 | 23.43750 | 59.99985 | 0.02939 | -0.00044 | 0.01511 | | 19 | module.model.10.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.07091 | 0.00014 | 0.04306 | | 20 | module.model.10.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 24.60938 | 59.99985 | 20.89844 | 59.99985 | 0.03095 | -0.00059 | 0.01672 | | 21 | module.model.11.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.05729 | -0.00518 | 0.04267 | | 22 | module.model.11.3.weight | (512, 512, 1, 1) | 262144 | 104858 | 0.00000 | 0.00000 | 20.89844 | 59.99985 | 17.57812 | 59.99985 | 0.03229 | -0.00044 | 0.01797 | | 23 | module.model.12.0.weight | (512, 1, 3, 3) | 4608 | 4608 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.04981 | -0.00136 | 0.03967 | | 24 | module.model.12.3.weight | (1024, 512, 1, 1) | 524288 | 209716 | 0.00000 | 0.00000 | 16.01562 | 59.99985 | 44.23828 | 59.99985 | 0.02514 | -0.00106 | 0.01278 | | 25 | module.model.13.0.weight | (1024, 1, 3, 3) | 9216 | 9216 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.02396 | -0.00949 | 0.01549 | | 26 | module.model.13.3.weight | (1024, 1024, 1, 1) | 1048576 | 419431 | 0.00000 | 0.00000 | 44.72656 | 59.99994 | 1.46484 | 59.99994 | 0.01801 | -0.00017 | 0.00931 | | 27 | module.fc.weight | (1000, 1024) | 1024000 | 409600 | 1.46484 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 60.00000 | 0.05078 | 0.00271 | 0.02734 | | 28 | Total sparsity: | - | 4209088 | 1726917 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 58.97171 | 0.00000 | 0.00000 | 0.00000 | +----+--------------------------+--------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ Total sparsity: 58.97 --- validate (epoch=199)----------- 128116 samples (256 per mini-batch) ==> Top1: 65.337 Top5: 84.984 Loss: 1.494 --- test --------------------- 50000 samples (256 per mini-batch) ==> Top1: 68.810 Top5: 88.626 Loss: 1.282","title":"Results"},{"location":"model_zoo.html#learning-structured-sparsity-in-deep-neural-networks","text":"This research paper from the University of Pittsburgh, \"proposes a Structured Sparsity Learning (SSL) method to regularize the structures (i.e., filters, channels, filter shapes, and layer depth) of DNNs. SSL can: (1) learn a compact structure from a bigger DNN to reduce computation cost; (2) obtain a hardware-friendly structured sparsity of DNN to efficiently accelerate the DNN\u2019s evaluation.\" Note that this paper does not use pruning, but instead uses group regularization during the training to force weights towards zero, as a group. We used a schedule which thresholds the regularized elements at a magnitude equal to the regularization strength. At the end of the regularization phase, we save the final sparsity masks generated by the regularization, and exit. Then we load this regularized model, remove the layers corresponding to the zeroed weight tensors (all of a layer's elements have a zero value).","title":"Learning Structured Sparsity in Deep Neural Networks"},{"location":"model_zoo.html#baseline-training","text":"We started by training the baseline ResNet20-Cifar dense network since we didn't have a pre-trained model. Distiller schedule: distiller/examples/ssl/resnet20_cifar_baseline_training.yaml Checkpoint files: distiller/examples/ssl/checkpoints/ $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.3 --epochs=180 --compress=../cifar10/resnet20/baseline_training.yaml -j=1 --deterministic","title":"Baseline training"},{"location":"model_zoo.html#regularization","text":"Then we started training from scratch again, but this time we used Group Lasso regularization on entire layers: Distiller schedule: distiller/examples/ssl/ssl_4D-removal_4L_training.yaml $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.4 --epochs=180 --compress=../ssl/ssl_4D-removal_training.yaml -j=1 --deterministic The diagram below shows the training of Resnet20/CIFAR10 using Group Lasso regularization on entire layers (in blue) vs. training Resnet20/CIFAR10 baseline (in red). You may notice several interesting things: 1. The LR-decay policy is the same, but the two sessions start with different initial LR values. 2. The data-loss of the regularized training follows the same shape as the un-regularized training (baseline), and eventually the two seem to merge. 3. We see similar behavior in the validation Top1 and Top5 accuracy results, but the regularized training eventually performs better. 4. In the top right corner we see the behavior of the regularization loss ( Reg Loss ), which actually increases for some time, until the data-loss has a sharp drop (after ~16K mini-batches), at which point the regularization loss also starts dropping. This regularization yields 5 layers with zeroed weight tensors. We load this model, remove the 5 layers, and start the fine tuning of the weights. This process of layer removal is specific to ResNet for CIFAR, which we altered by adding code to skip over layers during the forward path. When you export to ONNX, the removed layers do not participate in the forward path, so they don't get incarnated. We managed to remove 5 of the 16 3x3 convolution layers which dominate the computation time. It's not bad, but we probably could have done better.","title":"Regularization"},{"location":"model_zoo.html#fine-tuning","text":"During the fine-tuning process, because the removed layers do not participate in the forward path, they do not appear in the backward path and are not backpropogated: therefore they are completely disconnected from the network. We copy the checkpoint file of the regularized model to checkpoint_trained_4D_regularized_5Lremoved.pth.tar . Distiller schedule: distiller/examples/ssl/ssl_4D-removal_finetuning.yaml $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --lr=0.1 --epochs=250 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --compress=../ssl/ssl_4D-removal_finetuning.yaml -j=1 --deterministic","title":"Fine-tuning"},{"location":"model_zoo.html#results_2","text":"Our baseline results for ResNet20 Cifar are: Top1=91.450 and Top5=99.750 We used Distiller's GroupLassoRegularizer to remove 5 layers from Resnet20 (CIFAR10) with no degradation of the accuracies. The regularized model exhibits really poor classification abilities: $ time python3 compress_classifier.py --arch resnet20_cifar ../data.cifar10 -p=50 --resume=../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar --evaluate => loading checkpoint ../cifar10/resnet20/checkpoint_trained_4D_regularized_5Lremoved.pth.tar best top@1: 90.620 Loaded compression schedule from checkpoint (epoch 179) Removing layer: module.layer1.0.conv1 [layer=0 block=0 conv=0] Removing layer: module.layer1.0.conv2 [layer=0 block=0 conv=1] Removing layer: module.layer1.1.conv1 [layer=0 block=1 conv=0] Removing layer: module.layer1.1.conv2 [layer=0 block=1 conv=1] Removing layer: module.layer2.2.conv2 [layer=1 block=2 conv=1] Files already downloaded and verified Files already downloaded and verified Dataset sizes: training=45000 validation=5000 test=10000 --- test --------------------- 10000 samples (256 per mini-batch) ==> Top1: 22.290 Top5: 68.940 Loss: 5.172 However, after fine-tuning, we recovered most of the accuracies loss, but not quite all of it: Top1=91.020 and Top5=99.670 We didn't spend time trying to wrestle with this network, and therefore didn't achieve SSL's published results (which showed that they managed to remove 6 layers and at the same time increase accuracies).","title":"Results"},{"location":"model_zoo.html#pruning-filters-for-efficient-convnets","text":"Quoting the authors directly: We present an acceleration method for CNNs, where we prune filters from CNNs that are identified as having a small effect on the output accuracy. By removing whole filters in the network together with their connecting feature maps, the computation costs are reduced significantly. In contrast to pruning weights, this approach does not result in sparse connectivity patterns. Hence, it does not need the support of sparse convolution libraries and can work with existing efficient BLAS libraries for dense matrix multiplications. The implementation of the research by Hao et al. required us to add filter-pruning sensitivity analysis, and support for \"network thinning\". After performing filter-pruning sensitivity analysis to assess which layers are more sensitive to the pruning of filters, we execute distiller.L1RankedStructureParameterPruner once in order to rank the filters of each layer by their L1-norm values, and then we prune the schedule-prescribed sparsity level. Distiller schedule: distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank.yaml Checkpoint files: checkpoint_finetuned.pth.tar The excerpt from the schedule, displayed below, shows how we declare the L1RankedStructureParameterPruner. This class currently ranks filters only, but because in the future this class may support ranking of various structures, you need to specify for each parameter both the target sparsity level, and the structure type ('3D' is filter-wise pruning). pruners: filter_pruner: class: 'L1RankedStructureParameterPruner' reg_regims: 'module.layer1.0.conv1.weight': [0.6, '3D'] 'module.layer1.1.conv1.weight': [0.6, '3D'] 'module.layer1.2.conv1.weight': [0.6, '3D'] 'module.layer1.3.conv1.weight': [0.6, '3D'] In the policy, we specify that we want to invoke this pruner once, at epoch 180. Because we are starting from a network which was trained for 180 epochs (see Baseline training below), the filter ranking is performed right at the outset of this schedule. policies: - pruner: instance_name: filter_pruner epochs: [180] Following the pruning, we want to \"physically\" remove the pruned filters from the network, which involves reconfiguring the Convolutional layers and the parameter tensors. When we remove filters from Convolution layer n we need to perform several changes to the network: 1. Shrink layer n 's weights tensor, leaving only the \"important\" filters. 2. Configure layer n 's .out_channels member to its new, smaller, value. 3. If a BN layer follows layer n , then it also needs to be reconfigured and its scale and shift parameter vectors need to be shrunk. 4. If a Convolution layer follows the BN layer, then it will have less input channels which requires reconfiguration and shrinking of its weights. All of this is performed by distiller.ResnetCifarFilterRemover which is also scheduled at epoch 180. We call this process \"network thinning\". extensions: net_thinner: class: 'FilterRemover' thinning_func_str: remove_filters arch: 'resnet56_cifar' dataset: 'cifar10' Network thinning requires us to understand the layer connectivity and data-dependency of the DNN, and we are working on a robust method to perform this. On networks with topologies similar to ResNet (residuals) and GoogLeNet (inception), which have several inputs and outputs to/from Convolution layers, there is extra details to consider. Our current implementation is specific to certain layers in ResNet and is a bit fragile. We will continue to improve and generalize this.","title":"Pruning Filters for Efficient ConvNets"},{"location":"model_zoo.html#baseline-training_1","text":"We started by training the baseline ResNet56-Cifar dense network (180 epochs) since we didn't have a pre-trained model. Distiller schedule: distiller/examples/pruning_filters_for_efficient_convnets/resnet56_cifar_baseline_training.yaml Checkpoint files: checkpoint.resnet56_cifar_baseline.pth.tar","title":"Baseline training"},{"location":"model_zoo.html#results_3","text":"We trained a ResNet56-Cifar10 network and achieve accuracy results which are on-par with published results: Top1: 92.970 and Top5: 99.740. We used Hao et al.'s algorithm to remove 37.3% of the original convolution MACs, while maintaining virtually the same accuracy as the baseline: Top1: 92.830 and Top5: 99.760","title":"Results"},{"location":"prepare_model_quant.html","text":"Preparing a Model for Quantization Background Note: If you just want a run-down of the required modifications to make sure a model is properly quantized in Distiller, you can skip this part and head right to the next section. Distiller provides an automatic mechanism to convert a \"vanilla\" FP32 PyTorch model to a quantized counterpart (for quantization-aware training and post-training quantization ). This mechanism works at the PyTorch \"Module\" level. By \"Module\" we refer to any sub-class of the torch.nn.Module class . The Distiller Quantizer can detect modules, and replace them with other modules. However, it is not a requirement in PyTorch that all operations be defined as modules. Operations are often executed via direct overloaded tensor operator ( + , - , etc.) and functions under the torch namespace (e.g. torch.cat() ). There is also the torch.nn.functional namespace, which provides functional equivalents to modules provided in torch.nn . When an operation does not maintain any state, even if it has a dedicated nn.Module , it'll often be invoked via its functional counterpart. For example - calling nn.functional.relu() instead of creating an instance of nn.ReLU and invoking that. Such non-module operations are called directly from the module's forward function. There are ways to discover these operations up-front, which are used in Distiller for different purposes. Even so, we cannot replace these operations without resorting to rather \"dirty\" Python tricks, which we would rather not do for numerous reasons. In addition, there might be cases where the same module instance is re-used multiple times in the forward function. This is also a problem for Distiller. There are several flows that will not work as expected if each call to an operation is not \"tied\" to a dedicated module instance. For example: When collecting statistics, each invocation of a re-used it will overwrite the statistics collected for the previous invocation. We end up with statistics missing for all invocations except the last one. \"Net-aware\" quantization relies on a 1:1 mapping from each operation executed in the model to a module which invoked it. With re-used modules, this mapping is not 1:1 anymore. Hence, to make sure all supported operations in a model are properly quantized by Distiller, it might be necessary to modify the model code before passing it to the quantizer. Note that the exact set of supported operations might vary between the different available quantizers . Model Preparation To-Do List The steps required to prepare a model for quantization can be summarized as follows: Replace direct tensor operations with modules Replace re-used modules with dedicated instances Replace torch.nn.functional calls with equivalent modules Special cases - replace modules that aren't quantize-able with quantize-able variants In the next section we'll see an example of the items 1-3 in this list. As for \"special cases\", at the moment the only such case is LSTM. See the section after the example for details. Model Preparation Example We'll using the following simple module as an example. This module is loosely based on the ResNet implementation in torchvision , with some changes that don't make much sense and are meant to demonstrate the different modifications that might be required. import torch.nn as nn import torch.nn.functional as F class BasicModule(nn.Module): def __init__(self, in_ch, out_ch, kernel_size): super(BasicModule, self).__init__() self.conv1 = nn.Conv2d(in_ch, out_ch, kernel_size) self.bn1 = nn.BatchNorm2d(out_ch) self.relu = nn.ReLU() self.conv2 = nn.Conv2d(out_ch, out_ch, kernel_size) self.bn2 = nn.BatchNorm2d(out_ch) def forward(self, x): identity = x out = self.conv1(x) out = self.bn1(out) out = self.relu(out) out = self.conv2(out) out = self.bn2(out) # (1) Overloaded tensor addition operation # Alternatively, could be called via a tensor function: skip_1.add_(identity) out += identity # (2) Relu module re-used out = self.relu(out) # (3) Using operation from 'torch' namespace out = torch.cat([identity, out], dim=1) # (4) Using function from torch.nn.functional out = F.sigmoid(out) return out Replace direct tensor operations with modules The addition (1) and concatenation (3) operations in the forward function are examples of direct tensor operations. These operations do not have equivalent modules defined in torch.nn.Module . Hence, if we want to quantize these operations, we must implement modules that will call them. In Distiller we've implemented a few simple wrapper modules for common operations. These are defined in the distiller.modules namespace. Specifically, the addition operation should be replaced with the EltWiseAdd module, and the concatenation operation with the Concat module. Check out the code here to see the available modules. Replace re-used modules with dedicated instances The relu operation above is called via a module, but the same instance is used for both calls (2). We need to create a second instance of nn.ReLU in __init__ and use that for the second call during forward . Replace torch.nn.functional calls with equivalent modules The sigmoid (4) operation is invoked using the functional interface. Luckily, operations in torch.nn.functional have equivalent modules, so se can just use those. In this case we need to create an instance of torch.nn.Sigmoid . Putting it all together After making all of the changes detailed above, we end up with: import torch.nn as nn import torch.nn.functional as F import distiller.modules class BasicModule(nn.Module): def __init__(self, in_ch, out_ch, kernel_size): super(BasicModule, self).__init__() self.conv1 = nn.Conv2d(in_ch, out_ch, kernel_size) self.bn1 = nn.BatchNorm2d(out_ch) self.relu1 = nn.ReLU() self.conv2 = nn.Conv2d(out_ch, out_ch, kernel_size) self.bn2 = nn.BatchNorm2d(out_ch) # Fixes start here # (1) Replace '+=' with an inplace module self.add = distiller.modules.EltWiseAdd(inplace=True) # (2) Separate instance for each relu call self.relu2 = nn.ReLU() # (3) Dedicated module instead of tensor op self.concat = distiller.modules.Concat(dim=1) # (4) Dedicated module instead of functional call self.sigmoid = nn.Sigmoid() def forward(self, x): identity = x out = self.conv1(x) out = self.bn1(out) out = self.relu1(out) out = self.conv2(out) out = self.bn2(out) out = self.add(out, identity) out = self.relu(out) out = self.concat(identity, out) out = self.sigmoid(out) return out Special Case: LSTM (a \"compound\" module) Background LSTMs present a special case. An LSTM block is comprised of building blocks, such as fully-connected layers and sigmoid/tanh non-linearities, all of which have dedicated modules in torch.nn . However, the LSTM implementation provided in PyTorch does not use these building blocks. For optimization purposes, all of the internal operations are implemented at the C++ level. The only part of the model exposed at the Python level are the parameters of the fully-connected layers. Hence, all we can do with the PyTorch LSTM module is to quantize the inputs/outputs of the entire block, and to quantize the FC layers parameters. We cannot quantize the internal stages of the block at all. In addition to just quantizing the internal stages, we'd also like the option to control the quantization parameters of each of the internal stage separately. What to do Distiller provides a \"modular\" implementation of LSTM, comprised entirely of operations defined at the Python level. We provide an implementation of DistillerLSTM and DistillerLSTMCell , paralleling LSTM and LSTMCell provided by PyTorch. See the implementation here . A function to convert all LSTM instances in the model to the Distiller variant is also provided: model = distiller.modules.convert_model_to_distiller_lstm(model) To see an example of this conversion, and of mixed-precision quantization within an LSTM block, check out our tutorial on word-language model quantization here .","title":"Preparing a Model for Quantization"},{"location":"prepare_model_quant.html#preparing-a-model-for-quantization","text":"","title":"Preparing a Model for Quantization"},{"location":"prepare_model_quant.html#background","text":"Note: If you just want a run-down of the required modifications to make sure a model is properly quantized in Distiller, you can skip this part and head right to the next section. Distiller provides an automatic mechanism to convert a \"vanilla\" FP32 PyTorch model to a quantized counterpart (for quantization-aware training and post-training quantization ). This mechanism works at the PyTorch \"Module\" level. By \"Module\" we refer to any sub-class of the torch.nn.Module class . The Distiller Quantizer can detect modules, and replace them with other modules. However, it is not a requirement in PyTorch that all operations be defined as modules. Operations are often executed via direct overloaded tensor operator ( + , - , etc.) and functions under the torch namespace (e.g. torch.cat() ). There is also the torch.nn.functional namespace, which provides functional equivalents to modules provided in torch.nn . When an operation does not maintain any state, even if it has a dedicated nn.Module , it'll often be invoked via its functional counterpart. For example - calling nn.functional.relu() instead of creating an instance of nn.ReLU and invoking that. Such non-module operations are called directly from the module's forward function. There are ways to discover these operations up-front, which are used in Distiller for different purposes. Even so, we cannot replace these operations without resorting to rather \"dirty\" Python tricks, which we would rather not do for numerous reasons. In addition, there might be cases where the same module instance is re-used multiple times in the forward function. This is also a problem for Distiller. There are several flows that will not work as expected if each call to an operation is not \"tied\" to a dedicated module instance. For example: When collecting statistics, each invocation of a re-used it will overwrite the statistics collected for the previous invocation. We end up with statistics missing for all invocations except the last one. \"Net-aware\" quantization relies on a 1:1 mapping from each operation executed in the model to a module which invoked it. With re-used modules, this mapping is not 1:1 anymore. Hence, to make sure all supported operations in a model are properly quantized by Distiller, it might be necessary to modify the model code before passing it to the quantizer. Note that the exact set of supported operations might vary between the different available quantizers .","title":"Background"},{"location":"prepare_model_quant.html#model-preparation-to-do-list","text":"The steps required to prepare a model for quantization can be summarized as follows: Replace direct tensor operations with modules Replace re-used modules with dedicated instances Replace torch.nn.functional calls with equivalent modules Special cases - replace modules that aren't quantize-able with quantize-able variants In the next section we'll see an example of the items 1-3 in this list. As for \"special cases\", at the moment the only such case is LSTM. See the section after the example for details.","title":"Model Preparation To-Do List"},{"location":"prepare_model_quant.html#model-preparation-example","text":"We'll using the following simple module as an example. This module is loosely based on the ResNet implementation in torchvision , with some changes that don't make much sense and are meant to demonstrate the different modifications that might be required. import torch.nn as nn import torch.nn.functional as F class BasicModule(nn.Module): def __init__(self, in_ch, out_ch, kernel_size): super(BasicModule, self).__init__() self.conv1 = nn.Conv2d(in_ch, out_ch, kernel_size) self.bn1 = nn.BatchNorm2d(out_ch) self.relu = nn.ReLU() self.conv2 = nn.Conv2d(out_ch, out_ch, kernel_size) self.bn2 = nn.BatchNorm2d(out_ch) def forward(self, x): identity = x out = self.conv1(x) out = self.bn1(out) out = self.relu(out) out = self.conv2(out) out = self.bn2(out) # (1) Overloaded tensor addition operation # Alternatively, could be called via a tensor function: skip_1.add_(identity) out += identity # (2) Relu module re-used out = self.relu(out) # (3) Using operation from 'torch' namespace out = torch.cat([identity, out], dim=1) # (4) Using function from torch.nn.functional out = F.sigmoid(out) return out","title":"Model Preparation Example"},{"location":"prepare_model_quant.html#replace-direct-tensor-operations-with-modules","text":"The addition (1) and concatenation (3) operations in the forward function are examples of direct tensor operations. These operations do not have equivalent modules defined in torch.nn.Module . Hence, if we want to quantize these operations, we must implement modules that will call them. In Distiller we've implemented a few simple wrapper modules for common operations. These are defined in the distiller.modules namespace. Specifically, the addition operation should be replaced with the EltWiseAdd module, and the concatenation operation with the Concat module. Check out the code here to see the available modules.","title":"Replace direct tensor operations with modules"},{"location":"prepare_model_quant.html#replace-re-used-modules-with-dedicated-instances","text":"The relu operation above is called via a module, but the same instance is used for both calls (2). We need to create a second instance of nn.ReLU in __init__ and use that for the second call during forward .","title":"Replace re-used modules with dedicated instances"},{"location":"prepare_model_quant.html#replace-torchnnfunctional-calls-with-equivalent-modules","text":"The sigmoid (4) operation is invoked using the functional interface. Luckily, operations in torch.nn.functional have equivalent modules, so se can just use those. In this case we need to create an instance of torch.nn.Sigmoid .","title":"Replace torch.nn.functional calls with equivalent modules"},{"location":"prepare_model_quant.html#putting-it-all-together","text":"After making all of the changes detailed above, we end up with: import torch.nn as nn import torch.nn.functional as F import distiller.modules class BasicModule(nn.Module): def __init__(self, in_ch, out_ch, kernel_size): super(BasicModule, self).__init__() self.conv1 = nn.Conv2d(in_ch, out_ch, kernel_size) self.bn1 = nn.BatchNorm2d(out_ch) self.relu1 = nn.ReLU() self.conv2 = nn.Conv2d(out_ch, out_ch, kernel_size) self.bn2 = nn.BatchNorm2d(out_ch) # Fixes start here # (1) Replace '+=' with an inplace module self.add = distiller.modules.EltWiseAdd(inplace=True) # (2) Separate instance for each relu call self.relu2 = nn.ReLU() # (3) Dedicated module instead of tensor op self.concat = distiller.modules.Concat(dim=1) # (4) Dedicated module instead of functional call self.sigmoid = nn.Sigmoid() def forward(self, x): identity = x out = self.conv1(x) out = self.bn1(out) out = self.relu1(out) out = self.conv2(out) out = self.bn2(out) out = self.add(out, identity) out = self.relu(out) out = self.concat(identity, out) out = self.sigmoid(out) return out","title":"Putting it all together"},{"location":"prepare_model_quant.html#special-case-lstm-a-compound-module","text":"","title":"Special Case: LSTM (a \"compound\" module)"},{"location":"prepare_model_quant.html#background_1","text":"LSTMs present a special case. An LSTM block is comprised of building blocks, such as fully-connected layers and sigmoid/tanh non-linearities, all of which have dedicated modules in torch.nn . However, the LSTM implementation provided in PyTorch does not use these building blocks. For optimization purposes, all of the internal operations are implemented at the C++ level. The only part of the model exposed at the Python level are the parameters of the fully-connected layers. Hence, all we can do with the PyTorch LSTM module is to quantize the inputs/outputs of the entire block, and to quantize the FC layers parameters. We cannot quantize the internal stages of the block at all. In addition to just quantizing the internal stages, we'd also like the option to control the quantization parameters of each of the internal stage separately.","title":"Background"},{"location":"prepare_model_quant.html#what-to-do","text":"Distiller provides a \"modular\" implementation of LSTM, comprised entirely of operations defined at the Python level. We provide an implementation of DistillerLSTM and DistillerLSTMCell , paralleling LSTM and LSTMCell provided by PyTorch. See the implementation here . A function to convert all LSTM instances in the model to the Distiller variant is also provided: model = distiller.modules.convert_model_to_distiller_lstm(model) To see an example of this conversion, and of mixed-precision quantization within an LSTM block, check out our tutorial on word-language model quantization here .","title":"What to do"},{"location":"pruning.html","text":"Pruning A common methodology for inducing sparsity in weights and activations is called pruning . Pruning is the application of a binary criteria to decide which weights to prune: weights which match the pruning criteria are assigned a value of zero. Pruned elements are \"trimmed\" from the model: we zero their values and also make sure they don't take part in the back-propagation process. We can prune weights, biases, and activations. Biases are few and their contribution to a layer's output is relatively large, so there is little incentive to prune them. We usually see sparse activations following a ReLU layer, because ReLU quenches negative activations to exact zero (\\(ReLU(x): max(0,x)\\)). Sparsity in weights is less common, as weights tend to be very small, but are often not exact zeros. Let's define sparsity Sparsity is a a measure of how many elements in a tensor are exact zeros, relative to the tensor size. A tensor is considered sparse if \"most\" of its elements are zero. How much is \"most\", is not strictly defined, but when you see a sparse tensor you know it ;-) The \\(l_0\\)-\"norm\" function measures how many zero-elements are in a tensor x : \\[\\lVert x \\rVert_0\\;=\\;|x_1|^0 + |x_2|^0 + ... + |x_n|^0 \\] In other words, an element contributes either a value of 1 or 0 to \\(l_0\\). Anything but an exact zero contributes a value of 1 - that's pretty cool. Sometimes it helps to think about density, the number of non-zero elements (NNZ) and sparsity's complement: \\[ density = 1 - sparsity \\] You can use distiller.sparsity and distiller.density to query a PyTorch tensor's sparsity and density. What is weights pruning? Weights pruning, or model pruning, is a set of methods to increase the sparsity (amount of zero-valued elements in a tensor) of a network's weights. In general, the term 'parameters' refers to both weights and bias tensors of a model. Biases are rarely, if ever, pruned because there are very few bias elements compared to weights elements, and it is just not worth the trouble. Pruning requires a criteria for choosing which elements to prune - this is called the pruning criteria . The most common pruning criteria is the absolute value of each element: the element's absolute value is compared to some threshold value, and if it is below the threshold the element is set to zero (i.e. pruned) . This is implemented by the distiller.MagnitudeParameterPruner class. The idea behind this method, is that weights with small \\(l_1\\)-norms (absolute value) contribute little to the final result (low saliency), so they are less important and can be removed. A related idea motivating pruning, is that models are over-parametrized and contain redundant logic and features. Therefore, some of these redundancies can be removed by setting their weights to zero. And yet another way to think of pruning is to phrase it as a search for a set of weights with as many zeros as possible, which still produces acceptable inference accuracies compared to the dense-model (non-pruned model). Another way to look at it, is to imagine that because of the very high-dimensionality of the parameter space, the immediate space around the dense-model's solution likely contains some sparse solutions, and we want to use find these sparse solutions. Pruning schedule The most straight-forward to prune is to take a trained model and prune it once; also called one-shot pruning . In Learning both Weights and Connections for Efficient Neural Networks Song Han et. al show that this is surprisingly effective, but also leaves a lot of potential sparsity untapped. The surprise is what they call the \"free lunch\" effect: \"reducing 2x the connections without losing accuracy even without retraining.\" However, they also note that when employing a pruning-followed-by-retraining regimen, they can achieve much better results (higher sparsity at no accuracy loss). This is called iterative pruning , and the retraining that follows pruning is often referred to as fine-tuning . How the pruning criteria changes between iterations, how many iterations we perform and how often, and which tensors are pruned - this is collectively called the pruning schedule . We can think of iterative pruning as repeatedly learning which weights are important, removing the least important ones based on some importance criteria, and then retraining the model to let it \"recover\" from the pruning by adjusting the remaining weights. At each iteration, we prune more weights. The decision of when to stop pruning is also expressed in the schedule, and it depends on the pruning algorithm. For example, if we are trying to achieve a specific sparsity level, then we stop when the pruning achieves that level. And if we are pruning weights structures in order to reduce the required compute budget, then we stop the pruning when this compute reduction is achieved. Distiller supports expressing the pruning schedule as a YAML file (which is then executed by an instance of a PruningScheduler). Pruning granularity Pruning individual weight elements is called element-wise pruning , and it is also sometimes referred to as fine-grained pruning. Coarse-grained pruning - also referred to as structured pruning , group pruning , or block pruning - is pruning entire groups of elements which have some significance. Groups come in various shapes and sizes, but an easy to visualize group-pruning is filter-pruning, in which entire filters are removed. Sensitivity analysis The hard part about inducing sparsity via pruning is determining what threshold, or sparsity level, to use for each layer's tensors. Sensitivity analysis is a method that tries to help us rank the tensors by their sensitivity to pruning. The idea is to set the pruning level (percentage) of a specific layer, and then to prune once, run an evaluation on the test dataset and record the accuracy score. We do this for all of the parameterized layers, and for each layer we examine several sparsity levels. This should teach us about the \"sensitivity\" of each of the layers to pruning. The evaluated model should be trained to maximum accuracy before running the analysis, because we aim to understand the behavior of the trained model's performance in relation to pruning of a specific weights tensor. Much as we can prune structures, we can also perform sensitivity analysis on structures. Distiller implements element-wise pruning sensitivity analysis using the \\(l_1\\)-norm of individual elements; and filter-wise pruning sensitivity analysis using the mean \\(l_1\\)-norm of filters. The authors of Pruning Filters for Efficient ConvNets describe how they do sensitivity analysis: \"To understand the sensitivity of each layer, we prune each layer independently and evaluate the resulting pruned network\u2019s accuracy on the validation set. Figure 2(b) shows that layers that maintain their accuracy as filters are pruned away correspond to layers with larger slopes in Figure 2(a). On the contrary, layers with relatively flat slopes are more sensitive to pruning. We empirically determine the number of filters to prune for each layer based on their sensitivity to pruning. For deep networks such as VGG-16 or ResNets, we observe that layers in the same stage (with the same feature map size) have a similar sensitivity to pruning. To avoid introducing layer-wise meta-parameters, we use the same pruning ratio for all layers in the same stage. For layers that are sensitive to pruning, we prune a smaller percentage of these layers or completely skip pruning them.\" The diagram below shows the results of running an element-wise sensitivity analysis on Alexnet, using Distillers's perform_sensitivity_analysis utility function. As reported by Song Han, and exhibited in the diagram, in Alexnet the feature detecting layers (convolution layers) are more sensitive to pruning, and their sensitivity drops, the deeper they are. The fully-connected layers are much less sensitive, which is great, because that's where most of the parameters are. References Song Han, Jeff Pool, John Tran, William J. Dally . Learning both Weights and Connections for Efficient Neural Networks , arXiv:1607.04381v2, 2015. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf . Pruning Filters for Efficient ConvNets , arXiv:1608.08710v3, 2017.","title":"Pruning"},{"location":"pruning.html#pruning","text":"A common methodology for inducing sparsity in weights and activations is called pruning . Pruning is the application of a binary criteria to decide which weights to prune: weights which match the pruning criteria are assigned a value of zero. Pruned elements are \"trimmed\" from the model: we zero their values and also make sure they don't take part in the back-propagation process. We can prune weights, biases, and activations. Biases are few and their contribution to a layer's output is relatively large, so there is little incentive to prune them. We usually see sparse activations following a ReLU layer, because ReLU quenches negative activations to exact zero (\\(ReLU(x): max(0,x)\\)). Sparsity in weights is less common, as weights tend to be very small, but are often not exact zeros.","title":"Pruning"},{"location":"pruning.html#lets-define-sparsity","text":"Sparsity is a a measure of how many elements in a tensor are exact zeros, relative to the tensor size. A tensor is considered sparse if \"most\" of its elements are zero. How much is \"most\", is not strictly defined, but when you see a sparse tensor you know it ;-) The \\(l_0\\)-\"norm\" function measures how many zero-elements are in a tensor x : \\[\\lVert x \\rVert_0\\;=\\;|x_1|^0 + |x_2|^0 + ... + |x_n|^0 \\] In other words, an element contributes either a value of 1 or 0 to \\(l_0\\). Anything but an exact zero contributes a value of 1 - that's pretty cool. Sometimes it helps to think about density, the number of non-zero elements (NNZ) and sparsity's complement: \\[ density = 1 - sparsity \\] You can use distiller.sparsity and distiller.density to query a PyTorch tensor's sparsity and density.","title":"Let's define sparsity"},{"location":"pruning.html#what-is-weights-pruning","text":"Weights pruning, or model pruning, is a set of methods to increase the sparsity (amount of zero-valued elements in a tensor) of a network's weights. In general, the term 'parameters' refers to both weights and bias tensors of a model. Biases are rarely, if ever, pruned because there are very few bias elements compared to weights elements, and it is just not worth the trouble. Pruning requires a criteria for choosing which elements to prune - this is called the pruning criteria . The most common pruning criteria is the absolute value of each element: the element's absolute value is compared to some threshold value, and if it is below the threshold the element is set to zero (i.e. pruned) . This is implemented by the distiller.MagnitudeParameterPruner class. The idea behind this method, is that weights with small \\(l_1\\)-norms (absolute value) contribute little to the final result (low saliency), so they are less important and can be removed. A related idea motivating pruning, is that models are over-parametrized and contain redundant logic and features. Therefore, some of these redundancies can be removed by setting their weights to zero. And yet another way to think of pruning is to phrase it as a search for a set of weights with as many zeros as possible, which still produces acceptable inference accuracies compared to the dense-model (non-pruned model). Another way to look at it, is to imagine that because of the very high-dimensionality of the parameter space, the immediate space around the dense-model's solution likely contains some sparse solutions, and we want to use find these sparse solutions.","title":"What is weights pruning?"},{"location":"pruning.html#pruning-schedule","text":"The most straight-forward to prune is to take a trained model and prune it once; also called one-shot pruning . In Learning both Weights and Connections for Efficient Neural Networks Song Han et. al show that this is surprisingly effective, but also leaves a lot of potential sparsity untapped. The surprise is what they call the \"free lunch\" effect: \"reducing 2x the connections without losing accuracy even without retraining.\" However, they also note that when employing a pruning-followed-by-retraining regimen, they can achieve much better results (higher sparsity at no accuracy loss). This is called iterative pruning , and the retraining that follows pruning is often referred to as fine-tuning . How the pruning criteria changes between iterations, how many iterations we perform and how often, and which tensors are pruned - this is collectively called the pruning schedule . We can think of iterative pruning as repeatedly learning which weights are important, removing the least important ones based on some importance criteria, and then retraining the model to let it \"recover\" from the pruning by adjusting the remaining weights. At each iteration, we prune more weights. The decision of when to stop pruning is also expressed in the schedule, and it depends on the pruning algorithm. For example, if we are trying to achieve a specific sparsity level, then we stop when the pruning achieves that level. And if we are pruning weights structures in order to reduce the required compute budget, then we stop the pruning when this compute reduction is achieved. Distiller supports expressing the pruning schedule as a YAML file (which is then executed by an instance of a PruningScheduler).","title":"Pruning schedule"},{"location":"pruning.html#pruning-granularity","text":"Pruning individual weight elements is called element-wise pruning , and it is also sometimes referred to as fine-grained pruning. Coarse-grained pruning - also referred to as structured pruning , group pruning , or block pruning - is pruning entire groups of elements which have some significance. Groups come in various shapes and sizes, but an easy to visualize group-pruning is filter-pruning, in which entire filters are removed.","title":"Pruning granularity"},{"location":"pruning.html#sensitivity-analysis","text":"The hard part about inducing sparsity via pruning is determining what threshold, or sparsity level, to use for each layer's tensors. Sensitivity analysis is a method that tries to help us rank the tensors by their sensitivity to pruning. The idea is to set the pruning level (percentage) of a specific layer, and then to prune once, run an evaluation on the test dataset and record the accuracy score. We do this for all of the parameterized layers, and for each layer we examine several sparsity levels. This should teach us about the \"sensitivity\" of each of the layers to pruning. The evaluated model should be trained to maximum accuracy before running the analysis, because we aim to understand the behavior of the trained model's performance in relation to pruning of a specific weights tensor. Much as we can prune structures, we can also perform sensitivity analysis on structures. Distiller implements element-wise pruning sensitivity analysis using the \\(l_1\\)-norm of individual elements; and filter-wise pruning sensitivity analysis using the mean \\(l_1\\)-norm of filters. The authors of Pruning Filters for Efficient ConvNets describe how they do sensitivity analysis: \"To understand the sensitivity of each layer, we prune each layer independently and evaluate the resulting pruned network\u2019s accuracy on the validation set. Figure 2(b) shows that layers that maintain their accuracy as filters are pruned away correspond to layers with larger slopes in Figure 2(a). On the contrary, layers with relatively flat slopes are more sensitive to pruning. We empirically determine the number of filters to prune for each layer based on their sensitivity to pruning. For deep networks such as VGG-16 or ResNets, we observe that layers in the same stage (with the same feature map size) have a similar sensitivity to pruning. To avoid introducing layer-wise meta-parameters, we use the same pruning ratio for all layers in the same stage. For layers that are sensitive to pruning, we prune a smaller percentage of these layers or completely skip pruning them.\" The diagram below shows the results of running an element-wise sensitivity analysis on Alexnet, using Distillers's perform_sensitivity_analysis utility function. As reported by Song Han, and exhibited in the diagram, in Alexnet the feature detecting layers (convolution layers) are more sensitive to pruning, and their sensitivity drops, the deeper they are. The fully-connected layers are much less sensitive, which is great, because that's where most of the parameters are.","title":"Sensitivity analysis"},{"location":"pruning.html#references","text":"Song Han, Jeff Pool, John Tran, William J. Dally . Learning both Weights and Connections for Efficient Neural Networks , arXiv:1607.04381v2, 2015. Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, Hans Peter Graf . Pruning Filters for Efficient ConvNets , arXiv:1608.08710v3, 2017.","title":"References"},{"location":"quantization.html","text":"Quantization Quantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the predominant numerical format used for research and for deployment has so far been 32-bit floating point, or FP32. However, the desire for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been extensively demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy. The use of even lower bit-widths, such as 4/2/1-bits, is an active field of research that has also shown great progress. Note that this discussion is on quantization only in the context of more efficient inference. Using lower-precision numerics for more efficient training is currently out of scope. Motivation: Overall Efficiency The more obvious benefit from quantization is significantly reduced bandwidth and storage . For instance, using INT8 for weights and activations consumes 4x less overall bandwidth compared to FP32. Additionally integer compute is faster than floating point compute. It is also much more area and energy efficient : INT8 Operation Energy Saving vs FP32 Area Saving vs FP32 Add 30x 116x Multiply 18.5x 27x ( Dally, 2015 ) Note that very aggressive quantization can yield even more efficiency. If weights are binary (-1, 1) or ternary (-1, 0, 1 using 2-bits), then convolution and fully-connected layers can be computed with additions and subtractions only, removing multiplications completely. If activations are binary as well, then additions can also be removed, in favor of bitwise operations ( Rastegari et al., 2016 ). Integer vs. FP32 There are two main attributes when discussing a numerical format. The first is dynamic range , which refers to the range of representable numbers. The second one is how many values can be represented within the dynamic range, which in turn determines the precision / resolution of the format (the distance between two numbers). For all integer formats, the dynamic range is [-2^{n-1} .. 2^{n-1}-1] , where n is the number of bits. So for INT8 the range is [-128 .. 127] , and for INT4 it is [-8 .. 7] (we're limiting ourselves to signed integers for now). The number of representable values is 2^n . Contrast that with FP32, where the dynamic range is \\pm 3.4\\ x\\ 10^{38} , and approximately 4.2\\ x\\ 10^9 values can be represented. We can immediately see that FP32 is much more versatile , in that it is able to represent a wide range of distributions accurately. This is a nice property for deep learning models, where the distributions of weights and activations are usually very different (at least in dynamic range). In addition the dynamic range can differ between layers in the model. In order to be able to represent these different distributions with an integer format, a scale factor is used to map the dynamic range of the tensor to the integer format range. But still we remain with the issue of having a significantly lower number of representable values, that is - much lower resolution. Note that this scale factor is, in most cases, a floating-point number. Hence, even when using integer numerics, some floating-point computations remain. Courbariaux et al., 2014 scale using only shifts, eliminating the floating point operation. In GEMMLWOP , the FP32 scale factor is approximated using an integer or fixed-point multiplication followed by a shift operation. In many cases the effect of this approximation on accuracy is negligible. Avoiding Overflows Convolution and fully connected layers involve the storing of intermediate results in accumulators. Due to the limited dynamic range of integer formats, if we would use the same bit-width for the weights and activation, and for the accumulators, we would likely overflow very quickly. Therefore, accumulators are usually implemented with higher bit-widths. The result of multiplying two n -bit integers is, at most, a 2n -bit number. In convolution layers, such multiplications are accumulated c\\cdot k^2 times, where c is the number of input channels and k is the kernel width (assuming a square kernel). Hence, to avoid overflowing, the accumulator should be 2n + M -bits wide, where M is at least log_2(c\\cdot k^2) . In many cases 32-bit accumulators are used, however for INT4 and lower it might be possible to use less than 32 -bits, depending on the expected use cases and layer widths. \"Conservative\" Quantization: INT8 In many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy ( Gysel at al., 2018 ). As mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained \"online\" during inference, or \"offline\". Offline means gathering activations statistics before deploying the model, either during training or by running a few \"calibration\" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation. Online means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive. It is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. A simple method which can yield nice results is to simply use an average of the observed min/max values instead of the actual values. Alternatively, statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible ( Migacz, 2017 ). Going further, Banner et al., 2018 have proposed a method for analytically computing the clipping value under certain conditions. Another possible optimization point is scale-factor scope . The most common way is use a single scale-factor per-layer, but it is also possible to calculate a scale-factor per-channel. This can be beneficial if the weight distributions vary greatly between channels. When used to directly quantize a model without re-training, as described so far, this method is commonly referred to as post-training quantization . However, recent publications have shown that there are cases where post-training quantization to INT8 doesn't preserve accuracy ( Benoit et al., 2018 , Krishnamoorthi, 2018 ). Namely, smaller models such as MobileNet seem to not respond as well to post-training quantization, presumabley due to their smaller representational capacity. In such cases, quantization-aware training is used. \"Aggressive\" Quantization: INT4 and Lower Naively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy: Training / Re-Training : For INT4 and lower, training is required in order to obtain reasonable accuracy. The training loop is modified to take quantization into account. See details in the next section . Zhou S et al., 2016 have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods require a trained FP32 model, either as a starting point ( Zhou A et al., 2017 ), or as a teacher network in a knowledge distillation training setup (see here ). Replacing the activation function : The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used ( Zhou S et al., 2016 , Mishra et al., 2018 ). Another method learns the clipping value per layer, with better results ( Choi et al., 2018 ). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above). Modifying network structure : Mishra et al., 2018 try to compensate for the loss of information due to quantization by using wider layers (more channels). Lin et al., 2017 proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different \"base\", covering a larger dynamic range overall. First and last layer : Many methods do not quantize the first and last layer of the model. It has been observed by Han et al., 2015 that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically ( Zhou S et al., 2016 , Choi et al., 2018 ). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them ( Rastegari et al., 2016 ). Most methods keep the first and last layers at FP32. However, Choi et al., 2018 showed that \"conservative\" quantization of these layers, e.g. to INT8, does not reduce accuracy. Iterative quantization : Most methods quantize the entire model at once. Zhou A et al., 2017 employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization. Mixed Weights and Activations Precision : It has been observed that activations are more sensitive to quantization than weights ( Zhou S et al., 2016 ). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 ( Li et al., 2016 , Zhu et al., 2016 ). Quantization-Aware Training As mentioned above, in order to minimize the loss of accuracy from \"aggressive\" quantization, many methods that target INT4 and lower (and in some cases for INT8 as well) involve training the model in a way that considers the quantization. This means training with quantization of weights and activations \"baked\" into the training procedure. The training graph usually looks like this: A full precision copy of the weights is maintained throughout the training process (\"weights_fp\" in the diagram). Its purpose is to accumulate the small changes from the gradients without loss of precision (Note that the quantization of the weights is an integral part of the training graph, meaning that we back-propagate through it as well). Once the model is trained, only the quantized weights are used for inference. In the diagram we show \"layer N\" as the conv + batch-norm + activation combination, but the same applies to fully-connected layers, element-wise operations, etc. During training, the operations within \"layer N\" can still run in full precision, with the \"quantize\" operations in the boundaries ensuring discrete-valued weights and activations. This is sometimes called \"simulated quantization\". Straight-Through Estimator An important question in this context is how to back-propagate through the quantization functions. These functions are discrete-valued, hence their derivative is 0 almost everywhere. So, using their gradients as-is would severely hinder the learning process. An approximation commonly used to overcome this issue is the \"straight-through estimator\" (STE) ( Hinton et al., 2012 , Bengio, 2013 ), which simply passes the gradient through these functions as-is. References William Dally . High-Performance Hardware for Machine Learning. Tutorial, NIPS, 2015 Mohammad Rastegari, Vicente Ordone, Joseph Redmon and Ali Farhadi . XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. ECCV, 2016 Matthieu Courbariaux, Yoshua Bengio and Jean-Pierre David . Training deep neural networks with low precision multiplications. arxiv:1412.7024 Philipp Gysel, Jon Pimentel, Mohammad Motamedi and Soheil Ghiasi . Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 2018 Szymon Migacz . 8-bit Inference with TensorRT. GTC San Jose, 2017 Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou . DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arxiv:1606.06160 Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu and Yurong Chen . Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights. ICLR, 2017 Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook and Debbie Marr . WRPN: Wide Reduced-Precision Networks. ICLR, 2018 Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan and Kailash Gopalakrishnan . PACT: Parameterized Clipping Activation for Quantized Neural Networks. arxiv:1805.06085 Xiaofan Lin, Cong Zhao and Wei Pan . Towards Accurate Binary Convolutional Neural Network. NIPS, 2017 Song Han, Jeff Pool, John Tran and William Dally . Learning both Weights and Connections for Efficient Neural Network. NIPS, 2015 Fengfu Li, Bo Zhang and Bin Liu . Ternary Weight Networks. arxiv:1605.04711 Chenzhuo Zhu, Song Han, Huizi Mao and William J. Dally . Trained Ternary Quantization. arxiv:1612.01064 Yoshua Bengio, Nicholas Leonard and Aaron Courville . Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arxiv:1308.3432 Geoffrey Hinton, Nitish Srivastava, Kevin Swersky, Tijmen Tieleman and Abdelrahman Mohamed . Neural Networks for Machine Learning. Coursera, video lectures, 2012 Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam and Dmitry Kalenichenko . Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. ECCV, 2018 Raghuraman Krishnamoorthi . Quantizing deep convolutional networks for efficient inference: A whitepaper arxiv:1806.08342 Ron Banner, Yury Nahshan, Elad Hoffer and Daniel Soudry . ACIQ: Analytical Clipping for Integer Quantization of neural networks arxiv:1810.05723","title":"Quantization"},{"location":"quantization.html#quantization","text":"Quantization refers to the process of reducing the number of bits that represent a number. In the context of deep learning, the predominant numerical format used for research and for deployment has so far been 32-bit floating point, or FP32. However, the desire for reduced bandwidth and compute requirements of deep learning models has driven research into using lower-precision numerical formats. It has been extensively demonstrated that weights and activations can be represented using 8-bit integers (or INT8) without incurring significant loss in accuracy. The use of even lower bit-widths, such as 4/2/1-bits, is an active field of research that has also shown great progress. Note that this discussion is on quantization only in the context of more efficient inference. Using lower-precision numerics for more efficient training is currently out of scope.","title":"Quantization"},{"location":"quantization.html#motivation-overall-efficiency","text":"The more obvious benefit from quantization is significantly reduced bandwidth and storage . For instance, using INT8 for weights and activations consumes 4x less overall bandwidth compared to FP32. Additionally integer compute is faster than floating point compute. It is also much more area and energy efficient : INT8 Operation Energy Saving vs FP32 Area Saving vs FP32 Add 30x 116x Multiply 18.5x 27x ( Dally, 2015 ) Note that very aggressive quantization can yield even more efficiency. If weights are binary (-1, 1) or ternary (-1, 0, 1 using 2-bits), then convolution and fully-connected layers can be computed with additions and subtractions only, removing multiplications completely. If activations are binary as well, then additions can also be removed, in favor of bitwise operations ( Rastegari et al., 2016 ).","title":"Motivation: Overall Efficiency"},{"location":"quantization.html#integer-vs-fp32","text":"There are two main attributes when discussing a numerical format. The first is dynamic range , which refers to the range of representable numbers. The second one is how many values can be represented within the dynamic range, which in turn determines the precision / resolution of the format (the distance between two numbers). For all integer formats, the dynamic range is [-2^{n-1} .. 2^{n-1}-1] , where n is the number of bits. So for INT8 the range is [-128 .. 127] , and for INT4 it is [-8 .. 7] (we're limiting ourselves to signed integers for now). The number of representable values is 2^n . Contrast that with FP32, where the dynamic range is \\pm 3.4\\ x\\ 10^{38} , and approximately 4.2\\ x\\ 10^9 values can be represented. We can immediately see that FP32 is much more versatile , in that it is able to represent a wide range of distributions accurately. This is a nice property for deep learning models, where the distributions of weights and activations are usually very different (at least in dynamic range). In addition the dynamic range can differ between layers in the model. In order to be able to represent these different distributions with an integer format, a scale factor is used to map the dynamic range of the tensor to the integer format range. But still we remain with the issue of having a significantly lower number of representable values, that is - much lower resolution. Note that this scale factor is, in most cases, a floating-point number. Hence, even when using integer numerics, some floating-point computations remain. Courbariaux et al., 2014 scale using only shifts, eliminating the floating point operation. In GEMMLWOP , the FP32 scale factor is approximated using an integer or fixed-point multiplication followed by a shift operation. In many cases the effect of this approximation on accuracy is negligible.","title":"Integer vs. FP32"},{"location":"quantization.html#avoiding-overflows","text":"Convolution and fully connected layers involve the storing of intermediate results in accumulators. Due to the limited dynamic range of integer formats, if we would use the same bit-width for the weights and activation, and for the accumulators, we would likely overflow very quickly. Therefore, accumulators are usually implemented with higher bit-widths. The result of multiplying two n -bit integers is, at most, a 2n -bit number. In convolution layers, such multiplications are accumulated c\\cdot k^2 times, where c is the number of input channels and k is the kernel width (assuming a square kernel). Hence, to avoid overflowing, the accumulator should be 2n + M -bits wide, where M is at least log_2(c\\cdot k^2) . In many cases 32-bit accumulators are used, however for INT4 and lower it might be possible to use less than 32 -bits, depending on the expected use cases and layer widths.","title":"Avoiding Overflows"},{"location":"quantization.html#conservative-quantization-int8","text":"In many cases, taking a model trained for FP32 and directly quantizing it to INT8, without any re-training, can result in a relatively low loss of accuracy (which may or may not be acceptable, depending on the use case). Some fine-tuning can further improve the accuracy ( Gysel at al., 2018 ). As mentioned above, a scale factor is used to adapt the dynamic range of the tensor at hand to that of the integer format. This scale factor needs to be calculated per-layer per-tensor. The simplest way is to map the min/max values of the float tensor to the min/max of the integer format. For weights and biases this is easy, as they are set once training is complete. For activations, the min/max float values can be obtained \"online\" during inference, or \"offline\". Offline means gathering activations statistics before deploying the model, either during training or by running a few \"calibration\" batches on the trained FP32 model. Based on these gathered statistics, the scaled factors are calculated and are fixed once the model is deployed. This method has the risk of encountering values outside the previously observed ranges at runtime. These values will be clipped, which might lead to accuracy degradation. Online means calculating the min/max values for each tensor dynamically during runtime. In this method clipping cannot occur, however the added computation resources required to calculate the min/max values at runtime might be prohibitive. It is important to note, however, that the full float range of an activations tensor usually includes elements which are statistically outliers. These values can be discarded by using a narrower min/max range, effectively allowing some clipping to occur in favor of increasing the resolution provided to the part of the distribution containing most of the information. A simple method which can yield nice results is to simply use an average of the observed min/max values instead of the actual values. Alternatively, statistical measures can be used to intelligently select where to clip the original range in order to preserve as much information as possible ( Migacz, 2017 ). Going further, Banner et al., 2018 have proposed a method for analytically computing the clipping value under certain conditions. Another possible optimization point is scale-factor scope . The most common way is use a single scale-factor per-layer, but it is also possible to calculate a scale-factor per-channel. This can be beneficial if the weight distributions vary greatly between channels. When used to directly quantize a model without re-training, as described so far, this method is commonly referred to as post-training quantization . However, recent publications have shown that there are cases where post-training quantization to INT8 doesn't preserve accuracy ( Benoit et al., 2018 , Krishnamoorthi, 2018 ). Namely, smaller models such as MobileNet seem to not respond as well to post-training quantization, presumabley due to their smaller representational capacity. In such cases, quantization-aware training is used.","title":"\"Conservative\" Quantization: INT8"},{"location":"quantization.html#aggressive-quantization-int4-and-lower","text":"Naively quantizing a FP32 model to INT4 and lower usually incurs significant accuracy degradation. Many works have tried to mitigate this effect. They usually employ one or more of the following concepts in order to improve model accuracy: Training / Re-Training : For INT4 and lower, training is required in order to obtain reasonable accuracy. The training loop is modified to take quantization into account. See details in the next section . Zhou S et al., 2016 have shown that bootstrapping the quantized model with trained FP32 weights leads to higher accuracy, as opposed to training from scratch. Other methods require a trained FP32 model, either as a starting point ( Zhou A et al., 2017 ), or as a teacher network in a knowledge distillation training setup (see here ). Replacing the activation function : The most common activation function in vision models is ReLU, which is unbounded. That is - its dynamic range is not limited for positive inputs. This is very problematic for INT4 and below due to the very limited range and resolution. Therefore, most methods replace ReLU with another function which is bounded. In some cases a clipping function with hard coded values is used ( Zhou S et al., 2016 , Mishra et al., 2018 ). Another method learns the clipping value per layer, with better results ( Choi et al., 2018 ). Once the clipping value is set, the scale factor used for quantization is also set, and no further calibration steps are required (as opposed to INT8 methods described above). Modifying network structure : Mishra et al., 2018 try to compensate for the loss of information due to quantization by using wider layers (more channels). Lin et al., 2017 proposed a binary quantization method in which a single FP32 convolution is replaced with multiple binary convolutions, each scaled to represent a different \"base\", covering a larger dynamic range overall. First and last layer : Many methods do not quantize the first and last layer of the model. It has been observed by Han et al., 2015 that the first convolutional layer is more sensitive to weights pruning, and some quantization works cite the same reason and show it empirically ( Zhou S et al., 2016 , Choi et al., 2018 ). Some works also note that these layers usually constitute a very small portion of the overall computation within the model, further reducing the motivation to quantize them ( Rastegari et al., 2016 ). Most methods keep the first and last layers at FP32. However, Choi et al., 2018 showed that \"conservative\" quantization of these layers, e.g. to INT8, does not reduce accuracy. Iterative quantization : Most methods quantize the entire model at once. Zhou A et al., 2017 employ an iterative method, which starts with a trained FP32 baseline, and quantizes only a portion of the model at the time followed by several epochs of re-training to recover the accuracy loss from quantization. Mixed Weights and Activations Precision : It has been observed that activations are more sensitive to quantization than weights ( Zhou S et al., 2016 ). Hence it is not uncommon to see experiments with activations quantized to a higher precision compared to weights. Some works have focused solely on quantizing weights, keeping the activations at FP32 ( Li et al., 2016 , Zhu et al., 2016 ).","title":"\"Aggressive\" Quantization: INT4 and Lower"},{"location":"quantization.html#quantization-aware-training","text":"As mentioned above, in order to minimize the loss of accuracy from \"aggressive\" quantization, many methods that target INT4 and lower (and in some cases for INT8 as well) involve training the model in a way that considers the quantization. This means training with quantization of weights and activations \"baked\" into the training procedure. The training graph usually looks like this: A full precision copy of the weights is maintained throughout the training process (\"weights_fp\" in the diagram). Its purpose is to accumulate the small changes from the gradients without loss of precision (Note that the quantization of the weights is an integral part of the training graph, meaning that we back-propagate through it as well). Once the model is trained, only the quantized weights are used for inference. In the diagram we show \"layer N\" as the conv + batch-norm + activation combination, but the same applies to fully-connected layers, element-wise operations, etc. During training, the operations within \"layer N\" can still run in full precision, with the \"quantize\" operations in the boundaries ensuring discrete-valued weights and activations. This is sometimes called \"simulated quantization\".","title":"Quantization-Aware Training"},{"location":"quantization.html#straight-through-estimator","text":"An important question in this context is how to back-propagate through the quantization functions. These functions are discrete-valued, hence their derivative is 0 almost everywhere. So, using their gradients as-is would severely hinder the learning process. An approximation commonly used to overcome this issue is the \"straight-through estimator\" (STE) ( Hinton et al., 2012 , Bengio, 2013 ), which simply passes the gradient through these functions as-is.","title":"Straight-Through Estimator"},{"location":"quantization.html#references","text":"William Dally . High-Performance Hardware for Machine Learning. Tutorial, NIPS, 2015 Mohammad Rastegari, Vicente Ordone, Joseph Redmon and Ali Farhadi . XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. ECCV, 2016 Matthieu Courbariaux, Yoshua Bengio and Jean-Pierre David . Training deep neural networks with low precision multiplications. arxiv:1412.7024 Philipp Gysel, Jon Pimentel, Mohammad Motamedi and Soheil Ghiasi . Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 2018 Szymon Migacz . 8-bit Inference with TensorRT. GTC San Jose, 2017 Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu and Yuheng Zou . DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arxiv:1606.06160 Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu and Yurong Chen . Incremental Network Quantization: Towards Lossless CNNs with Low-precision Weights. ICLR, 2017 Asit Mishra, Eriko Nurvitadhi, Jeffrey J Cook and Debbie Marr . WRPN: Wide Reduced-Precision Networks. ICLR, 2018 Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang, Vijayalakshmi Srinivasan and Kailash Gopalakrishnan . PACT: Parameterized Clipping Activation for Quantized Neural Networks. arxiv:1805.06085 Xiaofan Lin, Cong Zhao and Wei Pan . Towards Accurate Binary Convolutional Neural Network. NIPS, 2017 Song Han, Jeff Pool, John Tran and William Dally . Learning both Weights and Connections for Efficient Neural Network. NIPS, 2015 Fengfu Li, Bo Zhang and Bin Liu . Ternary Weight Networks. arxiv:1605.04711 Chenzhuo Zhu, Song Han, Huizi Mao and William J. Dally . Trained Ternary Quantization. arxiv:1612.01064 Yoshua Bengio, Nicholas Leonard and Aaron Courville . Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation. arxiv:1308.3432 Geoffrey Hinton, Nitish Srivastava, Kevin Swersky, Tijmen Tieleman and Abdelrahman Mohamed . Neural Networks for Machine Learning. Coursera, video lectures, 2012 Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam and Dmitry Kalenichenko . Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference. ECCV, 2018 Raghuraman Krishnamoorthi . Quantizing deep convolutional networks for efficient inference: A whitepaper arxiv:1806.08342 Ron Banner, Yury Nahshan, Elad Hoffer and Daniel Soudry . ACIQ: Analytical Clipping for Integer Quantization of neural networks arxiv:1810.05723","title":"References"},{"location":"regularization.html","text":"Regularization In their book Deep Learning Ian Goodfellow et al. define regularization as \"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\" PyTorch's optimizers use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance). In general, we can write this as: \\[ loss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) \\] And specifically, \\[ loss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2 \\] Where W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or criterion in the Distiller sample image classifier compression application). optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001) criterion = nn.CrossEntropyLoss() ... for input, target in dataset: optimizer.zero_grad() output = model(input) loss = criterion(output, target) loss.backward() optimizer.step() \\(\\lambda_R\\) is a scalar called the regularization strength , and it balances the data error and the regularization error. In PyTorch, this is the weight_decay argument. \\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a magnitude , or sizing, of the weights tensor. \\[ \\lVert W \\rVert_2^2 = \\sum_{l=1}^{L} \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l) \\] \\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation. The qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in Deep Learning . Sparsity and Regularization We mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods. In Dense-Sparse-Dense (DSD) , Song Han et al. use pruning as a regularizer to improve a model's accuracy: \"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\" Regularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\). \\[ \\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i| \\] \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as feature selection and gives us another interpretation of pruning. One of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization. If we configure weight_decay to zero and use \\(l_1\\)-norm regularization, then we have: \\[ loss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1 \\] If we use both regularizers, we have: \\[ loss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2 + \\lambda_{R_1} \\lVert W \\rVert_1 \\] Class distiller.L1Regularizer implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization. l1_regularizer = distiller.s(model.parameters()) ... loss = criterion(output, target) + lambda * l1_regularizer() Group Regularization In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The group structures have to be pre-defined. To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated: \\[ loss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\] Let's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\). \\[ R_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2 \\] where \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\). \\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer. Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups). Group Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore it can be beneficial to improve inference speed. Huizi-et-al-2017 provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even intra kernel strided sparsity can also be used. distiller.GroupLassoRegularizer currently implements most of these groups, and you can easily add new groups. References Ian Goodfellow and Yoshua Bengio and Aaron Courville . Deep Learning , arXiv:1607.04381v2, 2017. Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally . DSD: Dense-Sparse-Dense Training for Deep Neural Networks , arXiv:1607.04381v2, 2017. Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally . Exploring the Regularity of Sparse Structure in Convolutional Neural Networks , arXiv:1705.08922v3, 2017. Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung . Structured pruning of deep convolutional neural networks , arXiv:1512.08571, 2015","title":"Regularization"},{"location":"regularization.html#regularization","text":"In their book Deep Learning Ian Goodfellow et al. define regularization as \"any modification we make to a learning algorithm that is intended to reduce its generalization error, but not its training error.\" PyTorch's optimizers use \\(l_2\\) parameter regularization to limit the capacity of models (i.e. reduce the variance). In general, we can write this as: \\[ loss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) \\] And specifically, \\[ loss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_2^2 \\] Where W is the collection of all weight elements in the network (i.e. this is model.parameters()), \\(loss(W;x;y)\\) is the total training loss, and \\(loss_D(W)\\) is the data loss (i.e. the error of the objective function, also called the loss function, or criterion in the Distiller sample image classifier compression application). optimizer = optim.SGD(model.parameters(), lr = 0.01, momentum=0.9, weight_decay=0.0001) criterion = nn.CrossEntropyLoss() ... for input, target in dataset: optimizer.zero_grad() output = model(input) loss = criterion(output, target) loss.backward() optimizer.step() \\(\\lambda_R\\) is a scalar called the regularization strength , and it balances the data error and the regularization error. In PyTorch, this is the weight_decay argument. \\(\\lVert W \\rVert_2^2\\) is the square of the \\(l_2\\)-norm of W, and as such it is a magnitude , or sizing, of the weights tensor. \\[ \\lVert W \\rVert_2^2 = \\sum_{l=1}^{L} \\sum_{i=1}^{n} |w_{l,i}|^2 \\;\\;where \\;n = torch.numel(w_l) \\] \\(L\\) is the number of layers in the network; and the notation about used 1-based numbering to simplify the notation. The qualitative differences between the \\(l_2\\)-norm, and the squared \\(l_2\\)-norm is explained in Deep Learning .","title":"Regularization"},{"location":"regularization.html#sparsity-and-regularization","text":"We mention regularization because there is an interesting interaction between regularization and some DNN sparsity-inducing methods. In Dense-Sparse-Dense (DSD) , Song Han et al. use pruning as a regularizer to improve a model's accuracy: \"Sparsity is a powerful form of regularization. Our intuition is that, once the network arrives at a local minimum given the sparsity constraint, relaxing the constraint gives the network more freedom to escape the saddle point and arrive at a higher-accuracy local minimum.\" Regularization can also be used to induce sparsity. To induce element-wise sparsity we can use the \\(l_1\\)-norm, \\(\\lVert W \\rVert_1\\). \\[ \\lVert W \\rVert_1 = l_1(W) = \\sum_{i=1}^{|W|} |w_i| \\] \\(l_2\\)-norm regularization reduces overfitting and improves a model's accuracy by shrinking large parameters, but it does not force these parameters to absolute zero. \\(l_1\\)-norm regularization sets some of the parameter elements to zero, therefore limiting the model's capacity while making the model simpler. This is sometimes referred to as feature selection and gives us another interpretation of pruning. One of Distiller's Jupyter notebooks explains how the \\(l_1\\)-norm regularizer induces sparsity, and how it interacts with \\(l_2\\)-norm regularization. If we configure weight_decay to zero and use \\(l_1\\)-norm regularization, then we have: \\[ loss(W;x;y) = loss_D(W;x;y) + \\lambda_R \\lVert W \\rVert_1 \\] If we use both regularizers, we have: \\[ loss(W;x;y) = loss_D(W;x;y) + \\lambda_{R_2} \\lVert W \\rVert_2^2 + \\lambda_{R_1} \\lVert W \\rVert_1 \\] Class distiller.L1Regularizer implements \\(l_1\\)-norm regularization, and of course, you can also schedule regularization. l1_regularizer = distiller.s(model.parameters()) ... loss = criterion(output, target) + lambda * l1_regularizer()","title":"Sparsity and Regularization"},{"location":"regularization.html#group-regularization","text":"In Group Regularization, we penalize entire groups of parameter elements, instead of individual elements. Therefore, entire groups are either sparsified (i.e. all of the group elements have a value of zero) or not. The group structures have to be pre-defined. To the data loss, and the element-wise regularization (if any), we can add group-wise regularization penalty. We represent all of the parameter groups in layer \\(l\\) as \\( W_l^{(G)} \\), and we add the penalty of all groups for all layers. It gets a bit messy, but not overly complicated: \\[ loss(W;x;y) = loss_D(W;x;y) + \\lambda_R R(W) + \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\] Let's denote all of the weight elements in group \\(g\\) as \\(w^{(g)}\\). \\[ R_g(w^{(g)}) = \\sum_{g=1}^{G} \\lVert w^{(g)} \\rVert_g = \\sum_{g=1}^{G} \\sum_{i=1}^{|w^{(g)}|} {(w_i^{(g)})}^2 \\] where \\(w^{(g)} \\in w^{(l)} \\) and \\( |w^{(g)}| \\) is the number of elements in \\( w^{(g)} \\). \\( \\lambda_g \\sum_{l=1}^{L} R_g(W_l^{(G)}) \\) is called the Group Lasso regularizer. Much as in \\(l_1\\)-norm regularization we sum the magnitudes of all tensor elements, in Group Lasso we sum the magnitudes of element structures (i.e. groups). Group Regularization is also called Block Regularization, Structured Regularization, or coarse-grained sparsity (remember that element-wise sparsity is sometimes referred to as fine-grained sparsity). Group sparsity exhibits regularity (i.e. its shape is regular), and therefore it can be beneficial to improve inference speed. Huizi-et-al-2017 provides an overview of some of the different groups: kernel, channel, filter, layers. Fiber structures such as matrix columns and rows, as well as various shaped structures (block sparsity), and even intra kernel strided sparsity can also be used. distiller.GroupLassoRegularizer currently implements most of these groups, and you can easily add new groups.","title":"Group Regularization"},{"location":"regularization.html#references","text":"Ian Goodfellow and Yoshua Bengio and Aaron Courville . Deep Learning , arXiv:1607.04381v2, 2017. Song Han, Jeff Pool, Sharan Narang, Huizi Mao, Enhao Gong, Shijian Tang, Erich Elsen, Peter Vajda, Manohar Paluri, John Tran, Bryan Catanzaro, William J. Dally . DSD: Dense-Sparse-Dense Training for Deep Neural Networks , arXiv:1607.04381v2, 2017. Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, Xingyu Liu, Yu Wang, William J. Dally . Exploring the Regularity of Sparse Structure in Convolutional Neural Networks , arXiv:1705.08922v3, 2017. Sajid Anwar, Kyuyeon Hwang, and Wonyong Sung . Structured pruning of deep convolutional neural networks , arXiv:1512.08571, 2015","title":"References"},{"location":"schedule.html","text":"Compression scheduler In iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of CompressionScheduler : it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions. We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification. We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base. Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code. High level overview Let's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies. Pruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively. An LR-scheduler specifies the LR-decay algorithm. These define the what part of the schedule. The Policies define the when part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application). A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing. The CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code. Syntax through example We'll use alexnet.schedule_agp.yaml to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet. version: 1 pruners: my_pruner: class: 'SensitivityPruner' sensitivities: 'features.module.0.weight': 0.25 'features.module.3.weight': 0.35 'features.module.6.weight': 0.40 'features.module.8.weight': 0.45 'features.module.10.weight': 0.55 'classifier.1.weight': 0.875 'classifier.4.weight': 0.875 'classifier.6.weight': 0.625 lr_schedulers: pruning_lr: class: ExponentialLR gamma: 0.9 policies: - pruner: instance_name : 'my_pruner' starting_epoch: 0 ending_epoch: 38 frequency: 2 - lr_scheduler: instance_name: pruning_lr starting_epoch: 24 ending_epoch: 200 frequency: 1 There is only one version of the YAML syntax, and the version number is not verified at the moment. However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2. version: 1 In the pruners section, we define the instances of pruners we want the scheduler to instantiate and use. We define a single pruner instance, named my_pruner , of algorithm SensitivityPruner . We will refer to this instance in the Policies section. Then we list the sensitivity multipliers, \\(s\\), of each of the weight tensors. You may list as many Pruners as you want in this section, as long as each has a unique name. You can several types of pruners in one schedule. pruners: my_pruner: class: 'SensitivityPruner' sensitivities: 'features.module.0.weight': 0.25 'features.module.3.weight': 0.35 'features.module.6.weight': 0.40 'features.module.8.weight': 0.45 'features.module.10.weight': 0.55 'classifier.1.weight': 0.875 'classifier.4.weight': 0.875 'classifier.6.weight': 0.6 Next, we want to specify the learning-rate decay scheduling in the lr_schedulers section. We assign a name to this instance: pruning_lr . As in the pruners section, you may use any name, as long as all LR-schedulers have a unique name. At the moment, only one instance of LR-scheduler is allowed. The LR-scheduler must be a subclass of PyTorch's _LRScheduler . You can use any of the schedulers defined in torch.optim.lr_scheduler (see here ). In addition, we've implemented some additional schedulers in Distiller (see here ). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to torch.optim.lr_scheduler , they can be used without changing the application code. lr_schedulers: pruning_lr: class: ExponentialLR gamma: 0.9 Finally, we define the policies section which defines the actual scheduling. A Policy manages an instance of a Pruner , Regularizer , Quantizer , or LRScheduler , by naming the instance. In the example below, a PruningPolicy uses the pruner instance named my_pruner : it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38. policies: - pruner: instance_name : 'my_pruner' starting_epoch: 0 ending_epoch: 38 frequency: 2 - lr_scheduler: instance_name: pruning_lr starting_epoch: 24 ending_epoch: 200 frequency: 1 This is iterative pruning : Train Connectivity Prune Connections Retrain Weights Goto 2 It is described in Learning both Weights and Connections for Efficient Neural Networks : \"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\" Regularization You can also define and schedule regularization. L1 regularization Format (this is an informal specification, not a valid ABNF specification): regularizers: <REGULARIZER_NAME_STR>: class: L1Regularizer reg_regims: <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT> ... <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT> threshold_criteria: [Mean_Abs | Max] For example: version: 1 regularizers: my_L1_reg: class: L1Regularizer reg_regims: 'module.layer3.1.conv1.weight': 0.000002 'module.layer3.1.conv2.weight': 0.000002 'module.layer3.1.conv3.weight': 0.000002 'module.layer3.2.conv1.weight': 0.000002 threshold_criteria: Mean_Abs policies: - regularizer: instance_name: my_L1_reg starting_epoch: 0 ending_epoch: 60 frequency: 1 Group regularization Format (informal specification): Format: regularizers: <REGULARIZER_NAME_STR>: class: L1Regularizer reg_regims: <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>] <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>] threshold_criteria: [Mean_Abs | Max] For example: version: 1 regularizers: my_filter_regularizer: class: GroupLassoRegularizer reg_regims: 'module.layer3.1.conv1.weight': [0.00005, '3D'] 'module.layer3.1.conv2.weight': [0.00005, '3D'] 'module.layer3.1.conv3.weight': [0.00005, '3D'] 'module.layer3.2.conv1.weight': [0.00005, '3D'] threshold_criteria: Mean_Abs policies: - regularizer: instance_name: my_filter_regularizer starting_epoch: 0 ending_epoch: 60 frequency: 1 Mixing it up You can mix pruning and regularization. version: 1 pruners: my_pruner: class: 'SensitivityPruner' sensitivities: 'features.module.0.weight': 0.25 'features.module.3.weight': 0.35 'features.module.6.weight': 0.40 'features.module.8.weight': 0.45 'features.module.10.weight': 0.55 'classifier.1.weight': 0.875 'classifier.4.weight': 0.875 'classifier.6.weight': 0.625 regularizers: 2d_groups_regularizer: class: GroupLassoRegularizer reg_regims: 'features.module.0.weight': [0.000012, '2D'] 'features.module.3.weight': [0.000012, '2D'] 'features.module.6.weight': [0.000012, '2D'] 'features.module.8.weight': [0.000012, '2D'] 'features.module.10.weight': [0.000012, '2D'] lr_schedulers: # Learning rate decay scheduler pruning_lr: class: ExponentialLR gamma: 0.9 policies: - pruner: instance_name : 'my_pruner' starting_epoch: 0 ending_epoch: 38 frequency: 2 - regularizer: instance_name: '2d_groups_regularizer' starting_epoch: 0 ending_epoch: 38 frequency: 1 - lr_scheduler: instance_name: pruning_lr starting_epoch: 24 ending_epoch: 200 frequency: 1 Quantization-Aware Training Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the Quantizer class (see details here ). Note that only a single quantizer instance may be defined per YAML. Let's see an example: quantizers: dorefa_quantizer: class: DorefaQuantizer bits_activations: 8 bits_weights: 4 overrides: conv1: bits_weights: null bits_activations: null relu1: bits_weights: null bits_activations: null final_relu: bits_weights: null bits_activations: null fc: bits_weights: null bits_activations: null The specific quantization method we're instantiating here is DorefaQuantizer . Then we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively. Then, we define the overrides mapping. In the example above, we choose not to quantize the first and last layer of the model. In the case of DorefaQuantizer , the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters conv1 , the first activation layer relu1 , the last activation layer final_relu and the last layer with parameters fc . Specifying null means \"do not quantize\". Note that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers. Defining overrides for groups of layers using regular expressions Suppose we have a sub-module in our model named block1 , which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named conv1 , conv2 and so on. In that case we would define the following: overrides: 'block1\\.conv*': bits_weights: 2 bits_activations: null RegEx Note : Remember that the dot ( . ) is a meta-character (i.e. a reserved character) in regular expressions. So, to match the actual dot characters which separate sub-modules in PyTorch module names, we need to escape it: \\. Overlapping patterns are also possible, which allows to define some override for a groups of layers and also \"single-out\" specific layers for different overrides. For example, let's take the last example and configure a different override for block1.conv1 : overrides: 'block1\\.conv1': bits_weights: 4 bits_activations: null 'block1\\.conv*': bits_weights: 2 bits_activations: null Important Note : The patterns are evaluated eagerly - first match wins. So, to properly quantize a model using \"broad\" patterns and more \"specific\" patterns as just shown, make sure the specific pattern is listed before the broad one. The QuantizationPolicy , which controls the quantization procedure during training, is actually quite simplistic. All it does is call the prepare_model() function of the Quantizer when it's initialized, followed by the first call to quantize_params() . Then, at the end of each epoch, after the float copy of the weights has been updated, it calls the quantize_params() function again. policies: - quantizer: instance_name: dorefa_quantizer starting_epoch: 0 ending_epoch: 200 frequency: 1 Important Note : As mentioned here , since the quantizer modifies the model's parameters (assuming training with quantization in the loop is used), the call to prepare_model() must be performed before an optimizer is called. Therefore, currently, the starting epoch for a quantization policy must be 0, otherwise the quantization process will not work as expected. If one wishes to do a \"warm-startup\" (or \"boot-strapping\"), training for a few epochs with full precision and only then starting to quantize, the only way to do this right now is to execute a separate run to generate the boot-strapped weights, and execute a second which will resume the checkpoint with the boot-strapped weights. Post-Training Quantization Post-training quantization differs from the other techniques described here. Since it is not executed during training, it does not require any Policies nor a Scheduler. Currently, the only method implemented for post-training quantization is range-based linear quantization . Quantizing a model using this method, requires adding 2 lines of code: quantizer = distiller.quantization.PostTrainLinearQuantizer(model, <quantizer arguments>) quantizer.prepare_model() # Execute evaluation on model as usual See the documentation for PostTrainLinearQuantizer in range_linear.py for details on the available arguments. In addition to directly instantiating the quantizer with arguments, it can also be configured from a YAML file. The syntax for the YAML file is exactly the same as seen in the quantization-aware training section above. Not surprisingly, the class defined must be PostTrainLinearQuantizer , and any other components or policies defined in the YAML file are ignored. We'll see how to create the quantizer in this manner below. If more configurability is needed, a helper function can be used that will add a set of command-line arguments to configure the quantizer: parser = argparse.ArgumentParser() distiller.quantization.add_post_train_quant_args(parser) args = parser.parse_args() These are the available command line arguments: Arguments controlling quantization at evaluation time (\"post-training quantization\"): --quantize-eval, --qe Apply linear quantization to model before evaluation. Applicable only if --evaluate is also set --qe-calibration PORTION_OF_TEST_SET Run the model in evaluation mode on the specified portion of the test dataset and collect statistics. Ignores all other 'qe--*' arguments --qe-mode QE_MODE, --qem QE_MODE Linear quantization mode. Choices: sym | asym_s | asym_u --qe-bits-acts NUM_BITS, --qeba NUM_BITS Number of bits for quantization of activations --qe-bits-wts NUM_BITS, --qebw NUM_BITS Number of bits for quantization of weights --qe-bits-accum NUM_BITS Number of bits for quantization of the accumulator --qe-clip-acts QE_CLIP_ACTS, --qeca QE_CLIP_ACTS Activations clipping mode. Choices: none | avg | n_std --qe-clip-n-stds QE_CLIP_N_STDS When qe-clip-acts is set to 'n_std', this is the number of standard deviations to use --qe-no-clip-layers LAYER_NAME [LAYER_NAME ...], --qencl LAYER_NAME [LAYER_NAME ...] List of layer names for which not to clip activations. Applicable only if --qe-clip-acts is not 'none' --qe-per-channel, --qepc Enable per-channel quantization of weights (per output channel) --qe-scale-approx-bits NUM_BITS, --qesab NUM_BITS Enable scale factor approximation using integer multiply + bit shift, and uset his number of bits to use for the integer multiplier --qe-stats-file PATH Path to YAML file with calibration stats. If not given, dynamic quantization will be run (Note that not all layer types are supported for dynamic quantization) --qe-config-file PATH Path to YAML file containing configuration for PostTrainLinearQuantizer (if present, all other --qe* arguments are ignored) (Note that --quantize-eval and --qe-calibration are mutually exclusive.) When using these command line arguments, the quantizer can be invoked as follows: if args.quantize_eval: quantizer = distiller.quantization.PostTrainLinearQuantizer.from_args(model, args) quantizer.prepare_model() # Execute evaluation on model as usual Note that the command-line arguments don't expose the overrides parameter of the quantizer, which allows fine-grained control over how each layer is quantized. To utilize this functionality, configure with a YAML file. To see integration of these command line arguments in use, see the image classification example . For examples invocations of post-training quantization see here . Collecting Statistics for Quantization To collect generate statistics that can be used for static quantization of activations, do the following (shown here assuming the command line argument --qe-calibration shown above is used, which specifies the number of batches to use for statistics generation): if args.qe_calibration: distiller.utils.assign_layer_fq_names(model) msglogger.info(\"Generating quantization calibration stats based on {0} users\".format(args.qe_calibration)) collector = distiller.data_loggers.QuantCalibrationStatsCollector(model) with collector_context(collector): # Here call your model evaluation function, making sure to execute only # the portion of the dataset specified by the qe_calibration argument yaml_path = 'some/dir/quantization_stats.yaml' collector.save(yaml_path) The genreated YAML stats file can then be provided using the `--qe-stats-file argument. An example of a generated stats file can be found here . Pruning Fine-Control Sometimes the default pruning process doesn't satisfy our needs and we require finer control over the pruning process (e.g. over masking, gradient handling, and weight updates). Below we will explain the math and nuances of fine-control configuration. Setting up the problem We represent the weights of a DNN as the set \\theta=\\left\\{\\theta_{l} : 0 \\leq l \\leq : L\\right\\} where \\theta_{l} represents the parameters tensor (weights and biases) of layer l in a network having L layers. Usually we do not prune biases because of their small size and relative importance. Therefore, we will consider only the network weights (also known as network connections): W=\\left\\{W_{l} : 0 \\leq l \\leq : L\\right\\} We wish to optimize some objective (e.g. minimize the energy required to execute a network in inference mode) under some performance constraint (e.g. accuracy), and we do this by maximizing the sparsity of the network weights (sometimes under some chosen sparsity-pattern constraint). We formalize pruning as a 3-step action: Generating a mask - in which we define a sparsity-inducing function per layer, P_l , such that M_{l}=P_{l}\\left(W_{l}\\right) M_{l} is a binary matrix which is used to mask W_{l} . P_l is implemented by subclasses of distiller.pruner . Masking the weights using the Hadamard product: \\widehat{W}_{l}=M_{l} \\circ W_{l} Updating the weights (performed by the optimizer). By default, we compute the data-loss using the masked weights, and calculate the gradient of this loss with respect to the masked-weights. We update the weights by making a small adjustment to the masked weights : W_{l} \\leftarrow \\widehat{W}_{l}-\\alpha \\frac{\\partial Loss(\\widehat{W}_{l})}{\\partial \\widehat{W}_{l}} We show below how to change this default behavior. We also provide a more exact description of the weights update when using PyTorch's SGD optimizer. The pruning regimen follows a pruning-rate schedule which, analogously to learning-rate annealing, changes the pruning rate according to a configurable strategy over time. The schedule allows us to configure new masks either once at the beginning of epochs (most common), or at the beginning of mini-batches (for finer control). In the former, the masks are calculated and assigned to \\{M_{l}\\} once, at the beginning of epochs (the specific epochs are determined by the schedule). The pseudo-code below shows the typical training-loop with CompressionScheduler callbacks in bold font, and the three pruning actions described above in burgendy. Figure 1: Pruning algorithm pseudo-code We can perform masking by adding the masking operation to the network graph. We call this in-graph masking , as depicted in the bottom of Figure 2. In the forward-pass we apply element-wise multiplication of the weights W_{l} and the mask M_{l} to obtain the masked weights widehat{W}_{l} , which we apply to the Convolution operation. In the backward-pass we mask \\frac{\\partial L}{\\partial \\widehat{W}} to obtain \\frac{\\partial L}{\\partial W} with which we update W_{l} . Figure 2: Forward and backward weight masking In Distiller we perform out-of-graph masking in which we directly set the value of \\widehat{W}_{l} by applying a mask on W_{l} In the backward-pass we make sure that the weights are updated by the proper gradients. In the common pruning use-case we want the optimizer to update only the unmasked weights, but we can configure this behavior using the fine-control arguments, as explained below. Fine-Control For finer control over the behavior of the pruning process, Distiller provides a set of PruningPolicy arguments in the args field, as in the sample below. pruners: random_filter_pruner: class: BernoulliFilterPruner desired_sparsity: 0.1 group_type: Filters weights: [module.conv1.weight] policies: - pruner: instance_name: random_filter_pruner args: mini_batch_pruning_frequency: 16 discard_masks_at_minibatch_end: True use_double_copies: True mask_on_forward_only: True mask_gradients: True starting_epoch: 15 ending_epoch: 180 frequency: 1 Controls mini_batch_pruning_frequency (default: 0): controls pruning scheduling at the mini-batch granularity. Every mini_batch_pruning_frequency training steps (i.e. mini_batches) we configure a new mask. In between mask updates, we mask mini-batches with the current mask. discard_masks_at_minibatch_end (default: False): discards the pruning mask at the end of the mini-batch. In the example YAML above, a new mask is computed once every 16 mini-batches, applied in one forward-pass, and then discraded. In the next 15 mini-batches the mask is Null so we do not mask. mask_gradients (default: False): mask the weights gradients after performing the backward-pass, and before invoking the optimizer. One way to mask the gradients in PyTorch is to register to the backward callback of the weight tensors we want to mask, and alter the gradients there. We do this by setting mask_gradients: True , as in the sample YAML above. This is sufficient if our weights optimization uses plain-vanilla SGD, because the update maintains the sparsity of the weights: \\widehat{W}_{l} is sparse by definition, and the gradients are sparse because we mask them. W_{l} \\leftarrow \\widehat{W}_{l}-\\alpha \\frac{\\partial Loss(\\widehat{W}_{l})}{\\partial \\widehat{W}_{l}} But this is not always the case. For example, PyTorch\u2019s SGD optimizer with weight-decay ( \\lambda ) and momentum ( \\alpha ) has the optimization logic listed below: 1. \\Delta p=\\frac{\\partial Loss\\left(\\widehat{W}_{l}^{i}\\right)}{\\partial \\widehat{W}_{l}^{i}}+\\lambda \\widehat{W}_{l}^{i} 2. v_{i}=\\left\\lbrace \\matrix{ {\\Delta p: \\; if \\;i==0 }\\; \\cr {v_{i-1} \\rho+ (1-dampening)\\Delta p: \\; if \\; i>0} } \\right\\rbrace 3. W_{l}^{i+1} = \\widehat{W}_{l}^{i}-\\alpha v_{i} Let\u2019s look at the weight optimization update at some arbitrary step (i.e. mini-batch) k . We want to show that masking the weights and gradients ( W_{l}^{i=k} and \\frac{\\partial Loss\\left(\\widehat{W}_{l}^{i=k}\\right)}{\\partial \\widehat{W}_{l}^{i=k}} ) is not sufficient to guarantee that W_{l}^{i=k+1} is sparse. This is easy do: if we allow for the general case where v_i is not necessarily sparse, then W_{l}^{i+1} is not necessarily sparse. Masking the weights in the forward-pass, and gradients in the backward-pass, is not sufficient to maintain the sparsity of the weights! This is an important insight, and it means that na\u00efve in-graph masking is also not sufficient to guarantee sparsity of the updated weights. use_double_copies (default: False): If you want to compute the gradients using the masked weights and also to update the unmasked weights (instead of updating the masked weights, per usual), set use_double_copies = True . This changes step (3) to: 3. W_{l}^{i+1} = W_{1}^{i}-\\alpha \\Delta p mask_on_forward_only (default: False): when set to False the weights will also be masked after the Optimizer is done updating the weights, to remove any updates of the masked gradients. If we want to guarantee the sparsity of the updated weights, we must explicitly mask the weights after step (3) above: 4. {W}_{l}^{i+1} \\leftarrow M_{l}^{i} \\circ {W}_{l}^{i+1} This argument defaults to False , but you can skip step (4), by setting mask_on_forward_only = True . Finally, note that mask_gradients and not mask_on_forward_only are mutually exclusive, or simply put: if you are masking in the backward-pass, you should choose to either do it via mask_gradients or mask_on_forward_only=False , but not both. Knowledge Distillation Knowledge distillation (see here ) is also implemented as a Policy , which should be added to the scheduler. However, with the current implementation, it cannot be defined within the YAML file like the rest of the policies described above. To make the integration of this method into applications a bit easier, a helper function can be used that will add a set of command-line arguments related to knowledge distillation: import argparse import distiller parser = argparse.ArgumentParser() distiller.knowledge_distillation.add_distillation_args(parser) (The add_distillation_args function accepts some optional arguments, see its implementation at distiller/knowledge_distillation.py for details) These are the command line arguments exposed by this function: Knowledge Distillation Training Arguments: --kd-teacher ARCH Model architecture for teacher model --kd-pretrained Use pre-trained model for teacher --kd-resume PATH Path to checkpoint from which to load teacher weights --kd-temperature TEMP, --kd-temp TEMP Knowledge distillation softmax temperature --kd-distill-wt WEIGHT, --kd-dw WEIGHT Weight for distillation loss (student vs. teacher soft targets) --kd-student-wt WEIGHT, --kd-sw WEIGHT Weight for student vs. labels loss --kd-teacher-wt WEIGHT, --kd-tw WEIGHT Weight for teacher vs. labels loss --kd-start-epoch EPOCH_NUM Epoch from which to enable distillation Once arguments have been parsed, some initialization code is required, similar to the following: # Assuming: # \"args\" variable holds command line arguments # \"model\" variable holds the model we're going to train, that is - the student model # \"compression_scheduler\" variable holds a CompressionScheduler instance args.kd_policy = None if args.kd_teacher: # Create teacher model - replace this with your model creation code teacher = create_model(args.kd_pretrained, args.dataset, args.kd_teacher, device_ids=args.gpus) if args.kd_resume: teacher, _, _ = apputils.load_checkpoint(teacher, chkpt_file=args.kd_resume) # Create policy and add to scheduler dlw = distiller.DistillationLossWeights(args.kd_distill_wt, args.kd_student_wt, args.kd_teacher_wt) args.kd_policy = distiller.KnowledgeDistillationPolicy(model, teacher, args.kd_temp, dlw) compression_scheduler.add_policy(args.kd_policy, starting_epoch=args.kd_start_epoch, ending_epoch=args.epochs, frequency=1) Finally, during the training loop, we need to perform forward propagation through the teacher model as well. The KnowledgeDistillationPolicy class keeps a reference to both the student and teacher models, and exposes a forward function that performs forward propagation on both of them. Since this is not one of the standard policy callbacks, we need to call this function manually from our training loop, as follows: if args.kd_policy is None: # Revert to a \"normal\" forward-prop call if no knowledge distillation policy is present output = model(input_var) else: output = args.kd_policy.forward(input_var) To see this integration in action, take a look at the image classification sample at examples/classifier_compression/compress_classifier.py .","title":"Compression Scheduling"},{"location":"schedule.html#compression-scheduler","text":"In iterative pruning, we create some kind of pruning regimen that specifies how to prune, and what to prune at every stage of the pruning and training stages. This motivated the design of CompressionScheduler : it needed to be part of the training loop, and to be able to make and implement pruning, regularization and quantization decisions. We wanted to be able to change the particulars of the compression schedule, w/o touching the code, and settled on using YAML as a container for this specification. We found that when we make many experiments on the same code base, it is easier to maintain all of these experiments if we decouple the differences from the code-base. Therefore, we added to the scheduler support for learning-rate decay scheduling because, again, we wanted the freedom to change the LR-decay policy without changing code.","title":"Compression scheduler"},{"location":"schedule.html#high-level-overview","text":"Let's briefly discuss the main mechanisms and abstractions: A schedule specification is composed of a list of sections defining instances of Pruners, Regularizers, Quantizers, LR-scheduler and Policies. Pruners, Regularizers and Quantizers are very similar: They implement either a Pruning/Regularization/Quantization algorithm, respectively. An LR-scheduler specifies the LR-decay algorithm. These define the what part of the schedule. The Policies define the when part of the schedule: at which epoch to start applying the Pruner/Regularizer/Quantizer/LR-decay, the epoch to end, and how often to invoke the policy (frequency of application). A policy also defines the instance of Pruner/Regularizer/Quantizer/LR-decay it is managing. The CompressionScheduler is configured from a YAML file or from a dictionary, but you can also manually create Policies, Pruners, Regularizers and Quantizers from code.","title":"High level overview"},{"location":"schedule.html#syntax-through-example","text":"We'll use alexnet.schedule_agp.yaml to explain some of the YAML syntax for configuring Sensitivity Pruning of Alexnet. version: 1 pruners: my_pruner: class: 'SensitivityPruner' sensitivities: 'features.module.0.weight': 0.25 'features.module.3.weight': 0.35 'features.module.6.weight': 0.40 'features.module.8.weight': 0.45 'features.module.10.weight': 0.55 'classifier.1.weight': 0.875 'classifier.4.weight': 0.875 'classifier.6.weight': 0.625 lr_schedulers: pruning_lr: class: ExponentialLR gamma: 0.9 policies: - pruner: instance_name : 'my_pruner' starting_epoch: 0 ending_epoch: 38 frequency: 2 - lr_scheduler: instance_name: pruning_lr starting_epoch: 24 ending_epoch: 200 frequency: 1 There is only one version of the YAML syntax, and the version number is not verified at the moment. However, to be future-proof it is probably better to let the YAML parser know that you are using version-1 syntax, in case there is ever a version 2. version: 1 In the pruners section, we define the instances of pruners we want the scheduler to instantiate and use. We define a single pruner instance, named my_pruner , of algorithm SensitivityPruner . We will refer to this instance in the Policies section. Then we list the sensitivity multipliers, \\(s\\), of each of the weight tensors. You may list as many Pruners as you want in this section, as long as each has a unique name. You can several types of pruners in one schedule. pruners: my_pruner: class: 'SensitivityPruner' sensitivities: 'features.module.0.weight': 0.25 'features.module.3.weight': 0.35 'features.module.6.weight': 0.40 'features.module.8.weight': 0.45 'features.module.10.weight': 0.55 'classifier.1.weight': 0.875 'classifier.4.weight': 0.875 'classifier.6.weight': 0.6 Next, we want to specify the learning-rate decay scheduling in the lr_schedulers section. We assign a name to this instance: pruning_lr . As in the pruners section, you may use any name, as long as all LR-schedulers have a unique name. At the moment, only one instance of LR-scheduler is allowed. The LR-scheduler must be a subclass of PyTorch's _LRScheduler . You can use any of the schedulers defined in torch.optim.lr_scheduler (see here ). In addition, we've implemented some additional schedulers in Distiller (see here ). The keyword arguments (kwargs) are passed directly to the LR-scheduler's constructor, so that as new LR-schedulers are added to torch.optim.lr_scheduler , they can be used without changing the application code. lr_schedulers: pruning_lr: class: ExponentialLR gamma: 0.9 Finally, we define the policies section which defines the actual scheduling. A Policy manages an instance of a Pruner , Regularizer , Quantizer , or LRScheduler , by naming the instance. In the example below, a PruningPolicy uses the pruner instance named my_pruner : it activates it at a frequency of 2 epochs (i.e. every other epoch), starting at epoch 0, and ending at epoch 38. policies: - pruner: instance_name : 'my_pruner' starting_epoch: 0 ending_epoch: 38 frequency: 2 - lr_scheduler: instance_name: pruning_lr starting_epoch: 24 ending_epoch: 200 frequency: 1 This is iterative pruning : Train Connectivity Prune Connections Retrain Weights Goto 2 It is described in Learning both Weights and Connections for Efficient Neural Networks : \"Our method prunes redundant connections using a three-step method. First, we train the network to learn which connections are important. Next, we prune the unimportant connections. Finally, we retrain the network to fine tune the weights of the remaining connections...After an initial training phase, we remove all connections whose weight is lower than a threshold. This pruning converts a dense, fully-connected layer to a sparse layer. This first phase learns the topology of the networks \u2014 learning which connections are important and removing the unimportant connections. We then retrain the sparse network so the remaining connections can compensate for the connections that have been removed. The phases of pruning and retraining may be repeated iteratively to further reduce network complexity.\"","title":"Syntax through example"},{"location":"schedule.html#regularization","text":"You can also define and schedule regularization.","title":"Regularization"},{"location":"schedule.html#l1-regularization","text":"Format (this is an informal specification, not a valid ABNF specification): regularizers: <REGULARIZER_NAME_STR>: class: L1Regularizer reg_regims: <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT> ... <PYTORCH_PARAM_NAME_STR>: <STRENGTH_FLOAT> threshold_criteria: [Mean_Abs | Max] For example: version: 1 regularizers: my_L1_reg: class: L1Regularizer reg_regims: 'module.layer3.1.conv1.weight': 0.000002 'module.layer3.1.conv2.weight': 0.000002 'module.layer3.1.conv3.weight': 0.000002 'module.layer3.2.conv1.weight': 0.000002 threshold_criteria: Mean_Abs policies: - regularizer: instance_name: my_L1_reg starting_epoch: 0 ending_epoch: 60 frequency: 1","title":"L1 regularization"},{"location":"schedule.html#group-regularization","text":"Format (informal specification): Format: regularizers: <REGULARIZER_NAME_STR>: class: L1Regularizer reg_regims: <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>] <PYTORCH_PARAM_NAME_STR>: [<STRENGTH_FLOAT>, <'2D' | '3D' | '4D' | 'Channels' | 'Cols' | 'Rows'>] threshold_criteria: [Mean_Abs | Max] For example: version: 1 regularizers: my_filter_regularizer: class: GroupLassoRegularizer reg_regims: 'module.layer3.1.conv1.weight': [0.00005, '3D'] 'module.layer3.1.conv2.weight': [0.00005, '3D'] 'module.layer3.1.conv3.weight': [0.00005, '3D'] 'module.layer3.2.conv1.weight': [0.00005, '3D'] threshold_criteria: Mean_Abs policies: - regularizer: instance_name: my_filter_regularizer starting_epoch: 0 ending_epoch: 60 frequency: 1","title":"Group regularization"},{"location":"schedule.html#mixing-it-up","text":"You can mix pruning and regularization. version: 1 pruners: my_pruner: class: 'SensitivityPruner' sensitivities: 'features.module.0.weight': 0.25 'features.module.3.weight': 0.35 'features.module.6.weight': 0.40 'features.module.8.weight': 0.45 'features.module.10.weight': 0.55 'classifier.1.weight': 0.875 'classifier.4.weight': 0.875 'classifier.6.weight': 0.625 regularizers: 2d_groups_regularizer: class: GroupLassoRegularizer reg_regims: 'features.module.0.weight': [0.000012, '2D'] 'features.module.3.weight': [0.000012, '2D'] 'features.module.6.weight': [0.000012, '2D'] 'features.module.8.weight': [0.000012, '2D'] 'features.module.10.weight': [0.000012, '2D'] lr_schedulers: # Learning rate decay scheduler pruning_lr: class: ExponentialLR gamma: 0.9 policies: - pruner: instance_name : 'my_pruner' starting_epoch: 0 ending_epoch: 38 frequency: 2 - regularizer: instance_name: '2d_groups_regularizer' starting_epoch: 0 ending_epoch: 38 frequency: 1 - lr_scheduler: instance_name: pruning_lr starting_epoch: 24 ending_epoch: 200 frequency: 1","title":"Mixing it up"},{"location":"schedule.html#quantization-aware-training","text":"Similarly to pruners and regularizers, specifying a quantizer in the scheduler YAML follows the constructor arguments of the Quantizer class (see details here ). Note that only a single quantizer instance may be defined per YAML. Let's see an example: quantizers: dorefa_quantizer: class: DorefaQuantizer bits_activations: 8 bits_weights: 4 overrides: conv1: bits_weights: null bits_activations: null relu1: bits_weights: null bits_activations: null final_relu: bits_weights: null bits_activations: null fc: bits_weights: null bits_activations: null The specific quantization method we're instantiating here is DorefaQuantizer . Then we define the default bit-widths for activations and weights, in this case 8 and 4-bits, respectively. Then, we define the overrides mapping. In the example above, we choose not to quantize the first and last layer of the model. In the case of DorefaQuantizer , the weights are quantized as part of the convolution / FC layers, but the activations are quantized in separate layers, which replace the ReLU layers in the original model (remember - even though we replaced the ReLU modules with our own quantization modules, the name of the modules isn't changed). So, in all, we need to reference the first layer with parameters conv1 , the first activation layer relu1 , the last activation layer final_relu and the last layer with parameters fc . Specifying null means \"do not quantize\". Note that for quantizers, we reference names of modules, not names of parameters as we do for pruners and regularizers.","title":"Quantization-Aware Training"},{"location":"schedule.html#defining-overrides-for-groups-of-layers-using-regular-expressions","text":"Suppose we have a sub-module in our model named block1 , which contains multiple convolution layers which we would like to quantize to, say, 2-bits. The convolution layers are named conv1 , conv2 and so on. In that case we would define the following: overrides: 'block1\\.conv*': bits_weights: 2 bits_activations: null RegEx Note : Remember that the dot ( . ) is a meta-character (i.e. a reserved character) in regular expressions. So, to match the actual dot characters which separate sub-modules in PyTorch module names, we need to escape it: \\. Overlapping patterns are also possible, which allows to define some override for a groups of layers and also \"single-out\" specific layers for different overrides. For example, let's take the last example and configure a different override for block1.conv1 : overrides: 'block1\\.conv1': bits_weights: 4 bits_activations: null 'block1\\.conv*': bits_weights: 2 bits_activations: null Important Note : The patterns are evaluated eagerly - first match wins. So, to properly quantize a model using \"broad\" patterns and more \"specific\" patterns as just shown, make sure the specific pattern is listed before the broad one. The QuantizationPolicy , which controls the quantization procedure during training, is actually quite simplistic. All it does is call the prepare_model() function of the Quantizer when it's initialized, followed by the first call to quantize_params() . Then, at the end of each epoch, after the float copy of the weights has been updated, it calls the quantize_params() function again. policies: - quantizer: instance_name: dorefa_quantizer starting_epoch: 0 ending_epoch: 200 frequency: 1 Important Note : As mentioned here , since the quantizer modifies the model's parameters (assuming training with quantization in the loop is used), the call to prepare_model() must be performed before an optimizer is called. Therefore, currently, the starting epoch for a quantization policy must be 0, otherwise the quantization process will not work as expected. If one wishes to do a \"warm-startup\" (or \"boot-strapping\"), training for a few epochs with full precision and only then starting to quantize, the only way to do this right now is to execute a separate run to generate the boot-strapped weights, and execute a second which will resume the checkpoint with the boot-strapped weights.","title":"Defining overrides for groups of layers using regular expressions"},{"location":"schedule.html#post-training-quantization","text":"Post-training quantization differs from the other techniques described here. Since it is not executed during training, it does not require any Policies nor a Scheduler. Currently, the only method implemented for post-training quantization is range-based linear quantization . Quantizing a model using this method, requires adding 2 lines of code: quantizer = distiller.quantization.PostTrainLinearQuantizer(model, <quantizer arguments>) quantizer.prepare_model() # Execute evaluation on model as usual See the documentation for PostTrainLinearQuantizer in range_linear.py for details on the available arguments. In addition to directly instantiating the quantizer with arguments, it can also be configured from a YAML file. The syntax for the YAML file is exactly the same as seen in the quantization-aware training section above. Not surprisingly, the class defined must be PostTrainLinearQuantizer , and any other components or policies defined in the YAML file are ignored. We'll see how to create the quantizer in this manner below. If more configurability is needed, a helper function can be used that will add a set of command-line arguments to configure the quantizer: parser = argparse.ArgumentParser() distiller.quantization.add_post_train_quant_args(parser) args = parser.parse_args() These are the available command line arguments: Arguments controlling quantization at evaluation time (\"post-training quantization\"): --quantize-eval, --qe Apply linear quantization to model before evaluation. Applicable only if --evaluate is also set --qe-calibration PORTION_OF_TEST_SET Run the model in evaluation mode on the specified portion of the test dataset and collect statistics. Ignores all other 'qe--*' arguments --qe-mode QE_MODE, --qem QE_MODE Linear quantization mode. Choices: sym | asym_s | asym_u --qe-bits-acts NUM_BITS, --qeba NUM_BITS Number of bits for quantization of activations --qe-bits-wts NUM_BITS, --qebw NUM_BITS Number of bits for quantization of weights --qe-bits-accum NUM_BITS Number of bits for quantization of the accumulator --qe-clip-acts QE_CLIP_ACTS, --qeca QE_CLIP_ACTS Activations clipping mode. Choices: none | avg | n_std --qe-clip-n-stds QE_CLIP_N_STDS When qe-clip-acts is set to 'n_std', this is the number of standard deviations to use --qe-no-clip-layers LAYER_NAME [LAYER_NAME ...], --qencl LAYER_NAME [LAYER_NAME ...] List of layer names for which not to clip activations. Applicable only if --qe-clip-acts is not 'none' --qe-per-channel, --qepc Enable per-channel quantization of weights (per output channel) --qe-scale-approx-bits NUM_BITS, --qesab NUM_BITS Enable scale factor approximation using integer multiply + bit shift, and uset his number of bits to use for the integer multiplier --qe-stats-file PATH Path to YAML file with calibration stats. If not given, dynamic quantization will be run (Note that not all layer types are supported for dynamic quantization) --qe-config-file PATH Path to YAML file containing configuration for PostTrainLinearQuantizer (if present, all other --qe* arguments are ignored) (Note that --quantize-eval and --qe-calibration are mutually exclusive.) When using these command line arguments, the quantizer can be invoked as follows: if args.quantize_eval: quantizer = distiller.quantization.PostTrainLinearQuantizer.from_args(model, args) quantizer.prepare_model() # Execute evaluation on model as usual Note that the command-line arguments don't expose the overrides parameter of the quantizer, which allows fine-grained control over how each layer is quantized. To utilize this functionality, configure with a YAML file. To see integration of these command line arguments in use, see the image classification example . For examples invocations of post-training quantization see here .","title":"Post-Training Quantization"},{"location":"schedule.html#collecting-statistics-for-quantization","text":"To collect generate statistics that can be used for static quantization of activations, do the following (shown here assuming the command line argument --qe-calibration shown above is used, which specifies the number of batches to use for statistics generation): if args.qe_calibration: distiller.utils.assign_layer_fq_names(model) msglogger.info(\"Generating quantization calibration stats based on {0} users\".format(args.qe_calibration)) collector = distiller.data_loggers.QuantCalibrationStatsCollector(model) with collector_context(collector): # Here call your model evaluation function, making sure to execute only # the portion of the dataset specified by the qe_calibration argument yaml_path = 'some/dir/quantization_stats.yaml' collector.save(yaml_path) The genreated YAML stats file can then be provided using the `--qe-stats-file argument. An example of a generated stats file can be found here .","title":"Collecting Statistics for Quantization"},{"location":"schedule.html#pruning-fine-control","text":"Sometimes the default pruning process doesn't satisfy our needs and we require finer control over the pruning process (e.g. over masking, gradient handling, and weight updates). Below we will explain the math and nuances of fine-control configuration.","title":"Pruning Fine-Control"},{"location":"schedule.html#setting-up-the-problem","text":"We represent the weights of a DNN as the set \\theta=\\left\\{\\theta_{l} : 0 \\leq l \\leq : L\\right\\} where \\theta_{l} represents the parameters tensor (weights and biases) of layer l in a network having L layers. Usually we do not prune biases because of their small size and relative importance. Therefore, we will consider only the network weights (also known as network connections): W=\\left\\{W_{l} : 0 \\leq l \\leq : L\\right\\} We wish to optimize some objective (e.g. minimize the energy required to execute a network in inference mode) under some performance constraint (e.g. accuracy), and we do this by maximizing the sparsity of the network weights (sometimes under some chosen sparsity-pattern constraint). We formalize pruning as a 3-step action: Generating a mask - in which we define a sparsity-inducing function per layer, P_l , such that M_{l}=P_{l}\\left(W_{l}\\right) M_{l} is a binary matrix which is used to mask W_{l} . P_l is implemented by subclasses of distiller.pruner . Masking the weights using the Hadamard product: \\widehat{W}_{l}=M_{l} \\circ W_{l} Updating the weights (performed by the optimizer). By default, we compute the data-loss using the masked weights, and calculate the gradient of this loss with respect to the masked-weights. We update the weights by making a small adjustment to the masked weights : W_{l} \\leftarrow \\widehat{W}_{l}-\\alpha \\frac{\\partial Loss(\\widehat{W}_{l})}{\\partial \\widehat{W}_{l}} We show below how to change this default behavior. We also provide a more exact description of the weights update when using PyTorch's SGD optimizer. The pruning regimen follows a pruning-rate schedule which, analogously to learning-rate annealing, changes the pruning rate according to a configurable strategy over time. The schedule allows us to configure new masks either once at the beginning of epochs (most common), or at the beginning of mini-batches (for finer control). In the former, the masks are calculated and assigned to \\{M_{l}\\} once, at the beginning of epochs (the specific epochs are determined by the schedule). The pseudo-code below shows the typical training-loop with CompressionScheduler callbacks in bold font, and the three pruning actions described above in burgendy. Figure 1: Pruning algorithm pseudo-code We can perform masking by adding the masking operation to the network graph. We call this in-graph masking , as depicted in the bottom of Figure 2. In the forward-pass we apply element-wise multiplication of the weights W_{l} and the mask M_{l} to obtain the masked weights widehat{W}_{l} , which we apply to the Convolution operation. In the backward-pass we mask \\frac{\\partial L}{\\partial \\widehat{W}} to obtain \\frac{\\partial L}{\\partial W} with which we update W_{l} . Figure 2: Forward and backward weight masking In Distiller we perform out-of-graph masking in which we directly set the value of \\widehat{W}_{l} by applying a mask on W_{l} In the backward-pass we make sure that the weights are updated by the proper gradients. In the common pruning use-case we want the optimizer to update only the unmasked weights, but we can configure this behavior using the fine-control arguments, as explained below.","title":"Setting up the problem"},{"location":"schedule.html#fine-control","text":"For finer control over the behavior of the pruning process, Distiller provides a set of PruningPolicy arguments in the args field, as in the sample below. pruners: random_filter_pruner: class: BernoulliFilterPruner desired_sparsity: 0.1 group_type: Filters weights: [module.conv1.weight] policies: - pruner: instance_name: random_filter_pruner args: mini_batch_pruning_frequency: 16 discard_masks_at_minibatch_end: True use_double_copies: True mask_on_forward_only: True mask_gradients: True starting_epoch: 15 ending_epoch: 180 frequency: 1","title":"Fine-Control"},{"location":"schedule.html#controls","text":"mini_batch_pruning_frequency (default: 0): controls pruning scheduling at the mini-batch granularity. Every mini_batch_pruning_frequency training steps (i.e. mini_batches) we configure a new mask. In between mask updates, we mask mini-batches with the current mask. discard_masks_at_minibatch_end (default: False): discards the pruning mask at the end of the mini-batch. In the example YAML above, a new mask is computed once every 16 mini-batches, applied in one forward-pass, and then discraded. In the next 15 mini-batches the mask is Null so we do not mask. mask_gradients (default: False): mask the weights gradients after performing the backward-pass, and before invoking the optimizer. One way to mask the gradients in PyTorch is to register to the backward callback of the weight tensors we want to mask, and alter the gradients there. We do this by setting mask_gradients: True , as in the sample YAML above. This is sufficient if our weights optimization uses plain-vanilla SGD, because the update maintains the sparsity of the weights: \\widehat{W}_{l} is sparse by definition, and the gradients are sparse because we mask them. W_{l} \\leftarrow \\widehat{W}_{l}-\\alpha \\frac{\\partial Loss(\\widehat{W}_{l})}{\\partial \\widehat{W}_{l}} But this is not always the case. For example, PyTorch\u2019s SGD optimizer with weight-decay ( \\lambda ) and momentum ( \\alpha ) has the optimization logic listed below: 1. \\Delta p=\\frac{\\partial Loss\\left(\\widehat{W}_{l}^{i}\\right)}{\\partial \\widehat{W}_{l}^{i}}+\\lambda \\widehat{W}_{l}^{i} 2. v_{i}=\\left\\lbrace \\matrix{ {\\Delta p: \\; if \\;i==0 }\\; \\cr {v_{i-1} \\rho+ (1-dampening)\\Delta p: \\; if \\; i>0} } \\right\\rbrace 3. W_{l}^{i+1} = \\widehat{W}_{l}^{i}-\\alpha v_{i} Let\u2019s look at the weight optimization update at some arbitrary step (i.e. mini-batch) k . We want to show that masking the weights and gradients ( W_{l}^{i=k} and \\frac{\\partial Loss\\left(\\widehat{W}_{l}^{i=k}\\right)}{\\partial \\widehat{W}_{l}^{i=k}} ) is not sufficient to guarantee that W_{l}^{i=k+1} is sparse. This is easy do: if we allow for the general case where v_i is not necessarily sparse, then W_{l}^{i+1} is not necessarily sparse. Masking the weights in the forward-pass, and gradients in the backward-pass, is not sufficient to maintain the sparsity of the weights! This is an important insight, and it means that na\u00efve in-graph masking is also not sufficient to guarantee sparsity of the updated weights. use_double_copies (default: False): If you want to compute the gradients using the masked weights and also to update the unmasked weights (instead of updating the masked weights, per usual), set use_double_copies = True . This changes step (3) to: 3. W_{l}^{i+1} = W_{1}^{i}-\\alpha \\Delta p mask_on_forward_only (default: False): when set to False the weights will also be masked after the Optimizer is done updating the weights, to remove any updates of the masked gradients. If we want to guarantee the sparsity of the updated weights, we must explicitly mask the weights after step (3) above: 4. {W}_{l}^{i+1} \\leftarrow M_{l}^{i} \\circ {W}_{l}^{i+1} This argument defaults to False , but you can skip step (4), by setting mask_on_forward_only = True . Finally, note that mask_gradients and not mask_on_forward_only are mutually exclusive, or simply put: if you are masking in the backward-pass, you should choose to either do it via mask_gradients or mask_on_forward_only=False , but not both.","title":"Controls"},{"location":"schedule.html#knowledge-distillation","text":"Knowledge distillation (see here ) is also implemented as a Policy , which should be added to the scheduler. However, with the current implementation, it cannot be defined within the YAML file like the rest of the policies described above. To make the integration of this method into applications a bit easier, a helper function can be used that will add a set of command-line arguments related to knowledge distillation: import argparse import distiller parser = argparse.ArgumentParser() distiller.knowledge_distillation.add_distillation_args(parser) (The add_distillation_args function accepts some optional arguments, see its implementation at distiller/knowledge_distillation.py for details) These are the command line arguments exposed by this function: Knowledge Distillation Training Arguments: --kd-teacher ARCH Model architecture for teacher model --kd-pretrained Use pre-trained model for teacher --kd-resume PATH Path to checkpoint from which to load teacher weights --kd-temperature TEMP, --kd-temp TEMP Knowledge distillation softmax temperature --kd-distill-wt WEIGHT, --kd-dw WEIGHT Weight for distillation loss (student vs. teacher soft targets) --kd-student-wt WEIGHT, --kd-sw WEIGHT Weight for student vs. labels loss --kd-teacher-wt WEIGHT, --kd-tw WEIGHT Weight for teacher vs. labels loss --kd-start-epoch EPOCH_NUM Epoch from which to enable distillation Once arguments have been parsed, some initialization code is required, similar to the following: # Assuming: # \"args\" variable holds command line arguments # \"model\" variable holds the model we're going to train, that is - the student model # \"compression_scheduler\" variable holds a CompressionScheduler instance args.kd_policy = None if args.kd_teacher: # Create teacher model - replace this with your model creation code teacher = create_model(args.kd_pretrained, args.dataset, args.kd_teacher, device_ids=args.gpus) if args.kd_resume: teacher, _, _ = apputils.load_checkpoint(teacher, chkpt_file=args.kd_resume) # Create policy and add to scheduler dlw = distiller.DistillationLossWeights(args.kd_distill_wt, args.kd_student_wt, args.kd_teacher_wt) args.kd_policy = distiller.KnowledgeDistillationPolicy(model, teacher, args.kd_temp, dlw) compression_scheduler.add_policy(args.kd_policy, starting_epoch=args.kd_start_epoch, ending_epoch=args.epochs, frequency=1) Finally, during the training loop, we need to perform forward propagation through the teacher model as well. The KnowledgeDistillationPolicy class keeps a reference to both the student and teacher models, and exposes a forward function that performs forward propagation on both of them. Since this is not one of the standard policy callbacks, we need to call this function manually from our training loop, as follows: if args.kd_policy is None: # Revert to a \"normal\" forward-prop call if no knowledge distillation policy is present output = model(input_var) else: output = args.kd_policy.forward(input_var) To see this integration in action, take a look at the image classification sample at examples/classifier_compression/compress_classifier.py .","title":"Knowledge Distillation"},{"location":"tutorial-gnmt_quant.html","text":"Post-Training Quantization of GNMT using Distiller A detailed, Jupyter Notebook based tutorial on this topic is located at <distiller_repo_root>/examples/GNMT . Check out the README file in that directory for more details.","title":"Quantizing GNMT"},{"location":"tutorial-gnmt_quant.html#post-training-quantization-of-gnmt-using-distiller","text":"A detailed, Jupyter Notebook based tutorial on this topic is located at <distiller_repo_root>/examples/GNMT . Check out the README file in that directory for more details.","title":"Post-Training Quantization of GNMT using Distiller"},{"location":"tutorial-lang_model.html","text":"Using Distiller to prune a PyTorch language model Contents Introduction Setup Preparing the code Training-loop Creating compression baselines Compressing the language model What are we compressing? How are we compressing? When are we compressing? Until next time Introduction In this tutorial I'll show you how to compress a word-level language model using Distiller . Specifically, we use PyTorch\u2019s word-level language model sample code as the code-base of our example, weave in some Distiller code, and show how we compress the model using two different element-wise pruning algorithms. To make things manageable, I've divided the tutorial to two parts: in the first we will setup the sample application and prune using AGP . In the second part I'll show how I've added Baidu's RNN pruning algorithm and then use it to prune the same word-level language model. The completed code is available here . The results are displayed below and the code is available here . Note that we can improve the results by training longer, since the loss curves are usually still decreasing at the end of epoch 40. However, for demonstration purposes we don\u2019t need to do this. Type Sparsity NNZ Validation Test Command line Small 0% 7,135,600 101.13 96.29 time python3 main.py --cuda --epochs 40 --tied --wd=1e-6 Medium 0% 28,390,700 88.17 84.21 time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied,--wd=1e-6 Large 0% 85,917,000 87.49 83.85 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6 Large 70% 25,487,550 90.67 85.96 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml Large 70% 25,487,550 90.59 85.84 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml --wd=1e-6 Large 70% 25,487,550 87.40 82.93 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70B.schedule_agp.yaml --wd=1e-6 Large 80.4% 16,847,550 89.31 83.64 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_80.schedule_agp.yaml --wd=1e-6 Large 90% 8,591,700 90.70 85.67 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_90.schedule_agp.yaml --wd=1e-6 Large 95% 4,295,850 98.42 92.79 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_95.schedule_agp.yaml --wd=1e-6 Table 1: AGP language model pruning results. NNZ stands for number of non-zero coefficients (embeddings are counted once, because they are tied). Figure 1: Perplexity vs model size (lower perplexity is better). The model is composed of an Encoder embedding, two LSTMs, and a Decoder embedding. The Encoder and decoder embeddings (projections) are tied to improve perplexity results (per https://arxiv.org/pdf/1611.01462.pdf), so in the sparsity statistics we account for only one of the encoder/decoder embeddings. We used the WikiText2 dataset (twice as large as PTB). We compared three model sizes: small (7.1M; 14M), medium (28M; 50M), large: (86M; 136M) \u2013 reported as (#parameters net/tied; #parameters gross). The results reported below use a preset seed (for reproducibility), and we expect results can be improved if we allow \u201ctrue\u201d pseudo-randomness. We limited our tests to 40 epochs, even though validation perplexity was still trending down. Essentially, this recreates the language model experiment in the AGP paper, and validates its conclusions: \u201cWe see that sparse models are able to outperform dense models which have significantly more parameters.\u201d The 80% sparse large model (which has 16.9M parameters and a perplexity of 83.64) is able to outperform the dense medium (which has 28.4M parameters and a perplexity of 84.21), a model which has 1.7 times more parameters. It also outperform the dense large model, which exemplifies how pruning can act as a regularizer. \u201cOur results show that pruning works very well not only on the dense LSTM weights and dense softmax layer but also the dense embedding matrix. This suggests that during the optimization procedure the neural network can find a good sparse embedding for the words in the vocabulary that works well together with the sparse connectivity structure of the LSTM weights and softmax layer.\u201d Setup We start by cloning Pytorch\u2019s example repository . I\u2019ve copied the language model code to distiller\u2019s examples/word_language_model directory, so I\u2019ll use that for the rest of the tutorial. Next, let\u2019s create and activate a virtual environment, as explained in Distiller's README file. Now we can turn our attention to main.py , which contains the training application. Preparing the code We begin by adding code to invoke Distiller in file main.py . This involves a bit of mechanics, because we did not pip install Distiller in our environment (we don\u2019t have a setup.py script for Distiller as of yet). To make Distiller library functions accessible from main.py , we modify sys.path to include the distiller root directory by taking the current directory and pointing two directories up. This is very specific to the location of this example code, and it will break if you\u2019ve placed the code elsewhere \u2013 so be aware. import os import sys script_dir = os.path.dirname(__file__) module_path = os.path.abspath(os.path.join(script_dir, '..', '..')) if module_path not in sys.path: sys.path.append(module_path) import distiller import apputils from distiller.data_loggers import TensorBoardLogger, PythonLogger Next, we augment the application arguments with two Distiller-specific arguments. The first, --summary , gives us the ability to do simple compression instrumentation (e.g. log sparsity statistics). The second argument, --compress , is how we tell the application where the compression scheduling file is located. We also add two arguments - momentum and weight-decay - for the SGD optimizer. As I explain later, I replaced the original code's optimizer with SGD, so we need these extra arguments. # Distiller-related arguments SUMMARY_CHOICES = ['sparsity', 'model', 'modules', 'png', 'percentile'] parser.add_argument('--summary', type=str, choices=SUMMARY_CHOICES, help='print a summary of the model, and exit - options: ' + ' | '.join(SUMMARY_CHOICES)) parser.add_argument('--compress', dest='compress', type=str, nargs='?', action='store', help='configuration file for pruning the model (default is to use hard-coded schedule)') parser.add_argument('--momentum', default=0., type=float, metavar='M', help='momentum') parser.add_argument('--weight-decay', '--wd', default=0., type=float, metavar='W', help='weight decay (default: 1e-4)') We add code to handle the --summary application argument. It can be as simple as forwarding to distiller.model_summary or more complex, as in the Distiller sample. if args.summary: distiller.model_summary(model, None, args.summary, 'wikitext2') exit(0) Similarly, we add code to handle the --compress argument, which creates a CompressionScheduler and configures it from a YAML schedule file: if args.compress: source = args.compress compression_scheduler = distiller.CompressionScheduler(model) distiller.config.fileConfig(model, None, compression_scheduler, args.compress, msglogger) We also create the optimizer, and the learning-rate decay policy scheduler. The original PyTorch example manually manages the optimization and LR decay process, but I think that having a standard optimizer and LR-decay schedule gives us the flexibility to experiment with these during the training process. Using an SGD optimizer configured with momentum=0 and weight_decay=0 , and a ReduceLROnPlateau LR-decay policy with patience=0 and factor=0.5 will give the same behavior as in the original PyTorch example. From there, we can experiment with the optimizer and LR-decay configuration. optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=args.momentum, weight_decay=args.weight_decay) lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=0, verbose=True, factor=0.5) Next, we add code to setup the logging backends: a Python logger backend which reads its configuration from file and logs messages to the console and log file ( pylogger ); and a TensorBoard backend logger which logs statistics to a TensorBoard data file ( tflogger ). I configured the TensorBoard backend to log gradients because RNNs suffer from vanishing and exploding gradients, so we might want to take a look in case the training experiences a sudden failure. This code is not strictly required, but it is quite useful to be able to log the session progress, and to export logs to TensorBoard for realtime visualization of the training progress. # Distiller loggers msglogger = apputils.config_pylogger('logging.conf', None) tflogger = TensorBoardLogger(msglogger.logdir) tflogger.log_gradients = True pylogger = PythonLogger(msglogger) Training loop Now we scroll down all the way to the train() function. We'll change its signature to include the epoch , optimizer , and compression_schdule . We'll soon see why we need these. def train(epoch, optimizer, compression_scheduler=None) Function train() is responsible for training the network in batches for one epoch, and in its epoch loop we want to perform compression. The CompressionScheduler invokes ScheduledTrainingPolicy instances per the scheduling specification that was programmed in the CompressionScheduler instance. There are four main SchedulingPolicy types: PruningPolicy , RegularizationPolicy , LRPolicy , and QuantizationPolicy . We'll be using PruningPolicy , which is triggered on_epoch_begin (to invoke the Pruners , and on_minibatch_begin (to mask the weights). Later we will create a YAML scheduling file, and specify the schedule of AutomatedGradualPruner instances. Because we are writing a single application, which can be used with various Policies in the future (e.g. group-lasso regularization), we should add code to invoke all of the CompressionScheduler 's callbacks, not just the mandatory on_epoch_begin callback. We invoke on_minibatch_begin before running the forward-pass, before_backward_pass after computing the loss, and on_minibatch_end after completing the backward-pass. def train(epoch, optimizer, compression_scheduler=None): ... # The line below was fixed as per: https://github.com/pytorch/examples/issues/214 for batch, i in enumerate(range(0, train_data.size(0), args.bptt)): data, targets = get_batch(train_data, i) # Starting each batch, we detach the hidden state from how it was previously produced. # If we didn't, the model would try backpropagating all the way to start of the dataset. hidden = repackage_hidden(hidden) if compression_scheduler: compression_scheduler.on_minibatch_begin(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch) output, hidden = model(data, hidden) loss = criterion(output.view(-1, ntokens), targets) if compression_scheduler: compression_scheduler.before_backward_pass(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch, loss=loss) optimizer.zero_grad() loss.backward() # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs. torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip) optimizer.step() total_loss += loss.item() if compression_scheduler: compression_scheduler.on_minibatch_end(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch) The rest of the code could stay as in the original PyTorch sample, but I wanted to use an SGD optimizer, so I replaced: for p in model.parameters(): p.data.add_(-lr, p.grad.data) with: optimizer.step() The rest of the code in function train() logs to a text file and a TensorBoard backend. Again, such code is not mandatory, but a few lines give us a lot of visibility: we have training progress information saved to log, and we can monitor the training progress in realtime on TensorBoard. That's a lot for a few lines of code ;-) if batch % args.log_interval == 0 and batch > 0: cur_loss = total_loss / args.log_interval elapsed = time.time() - start_time lr = optimizer.param_groups[0]['lr'] msglogger.info( '| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.4f} | ms/batch {:5.2f} ' '| loss {:5.2f} | ppl {:8.2f}'.format( epoch, batch, len(train_data) // args.bptt, lr, elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss))) total_loss = 0 start_time = time.time() stats = ('Peformance/Training/', OrderedDict([ ('Loss', cur_loss), ('Perplexity', math.exp(cur_loss)), ('LR', lr), ('Batch Time', elapsed * 1000)]) ) steps_completed = batch + 1 distiller.log_training_progress(stats, model.named_parameters(), epoch, steps_completed, steps_per_epoch, args.log_interval, [tflogger]) Finally we get to the outer training-loop which loops on args.epochs . We add the two final CompressionScheduler callbacks: on_epoch_begin , at the start of the loop, and on_epoch_end after running evaluate on the model and updating the learning-rate. try: for epoch in range(0, args.epochs): epoch_start_time = time.time() if compression_scheduler: compression_scheduler.on_epoch_begin(epoch) train(epoch, optimizer, compression_scheduler) val_loss = evaluate(val_data) lr_scheduler.step(val_loss) if compression_scheduler: compression_scheduler.on_epoch_end(epoch) And that's it! The language model sample is ready for compression. Creating compression baselines In To prune, or not to prune: exploring the efficacy of pruning for model compression Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint.\" They also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning.\" This pruning schedule is implemented by distiller.AutomatedGradualPruner (AGP), which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size. Before we start compressing stuff ;-), we need to create baselines so we have something to benchmark against. Let's prepare small, medium, and large baseline models, like Table 3 of To prune, or Not to Prune . These will provide baseline perplexity results that we'll compare the compressed models against. I chose to use tied input/output embeddings, and constrained the training to 40 epochs. The table below shows the model sizes, where we are interested in the tied version (biases are ignored due to their small size and because we don't prune them). Size Number of Weights (untied) Number of Weights (tied) Small 13,951,200 7,295,600 Medium 50,021,400 28,390,700 Large 135,834,000 85,917,000 I started experimenting with the optimizer setup like in the PyTorch example, but I added some L2 regularization when I noticed that the training was overfitting. The two right columns show the perplexity results (lower is better) of each of the models with no L2 regularization and with 1e-5 and 1e-6. In all three model sizes using the smaller L2 regularization (1e-6) gave the best results. BTW, I'm not showing here experiments with even lower regularization because that did not help. Type Command line Validation Test Small time python3 main.py --cuda --epochs 40 --tied 105.23 99.53 Small time python3 main.py --cuda --epochs 40 --tied --wd=1e-6 101.13 96.29 Small time python3 main.py --cuda --epochs 40 --tied --wd=1e-5 109.49 103.53 Medium time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied 90.93 86.20 Medium time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-6 88.17 84.21 Medium time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-5 97.75 93.06 Large time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied 88.23 84.21 Large time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6 87.49 83.85 Large time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-5 99.22 94.28 Compressing the language model OK, so now let's recreate the results of the language model experiment from section 4.2 of paper. We're using PyTorch's sample, so the language model we implement is not exactly like the one in the AGP paper (and uses a different dataset), but it's close enough, so if everything goes well, we should see similar compression results. What are we compressing? To gain insight about the model parameters, we can use the command-line to produce a weights-sparsity table: $ python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --summary=sparsity Parameters: +---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean | |---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------| | 0.00000 | encoder.weight | (33278, 1500) | 49917000 | 49916999 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00000 | 0.05773 | -0.00000 | 0.05000 | | 1.00000 | rnn.weight_ih_l0 | (6000, 1500) | 9000000 | 9000000 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00000 | 0.01491 | 0.00001 | 0.01291 | | 2.00000 | rnn.weight_hh_l0 | (6000, 1500) | 9000000 | 8999999 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00001 | 0.01491 | 0.00000 | 0.01291 | | 3.00000 | rnn.weight_ih_l1 | (6000, 1500) | 9000000 | 8999999 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00001 | 0.01490 | -0.00000 | 0.01291 | | 4.00000 | rnn.weight_hh_l1 | (6000, 1500) | 9000000 | 9000000 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00000 | 0.01491 | -0.00000 | 0.01291 | | 5.00000 | decoder.weight | (33278, 1500) | 49917000 | 49916999 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00000 | 0.05773 | -0.00000 | 0.05000 | | 6.00000 | Total sparsity: | - | 135834000 | 135833996 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | +---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ Total sparsity: 0.00 So what's going on here? encoder.weight and decoder.weight are the input and output embeddings, respectively. Remember that in the configuration I chose for the three model sizes these embeddings are tied, which means that we only have one copy of parameters, that is shared between the encoder and decoder. We also have two pairs of RNN (LSTM really) parameters. There is a pair because the model uses the command-line argument args.nlayers to decide how many instances of RNN (or LSTM or GRU) cells to use, and it defaults to 2. The recurrent cells are LSTM cells, because this is the default of args.model , which is used in the initialization of RNNModel . Let's look at the parameters of the first RNN: rnn.weight_ih_l0 and rnn.weight_hh_l0 : what are these? Recall the LSTM equations that PyTorch implements. In the equations, there are 8 instances of vector-matrix multiplication (when batch=1). These can be combined into a single matrix-matrix multiplication (GEMM), but PyTorch groups these into two GEMM operations: one GEMM multiplies the inputs ( rnn.weight_ih_l0 ), and the other multiplies the hidden-state ( rnn.weight_hh_l0 ). How are we compressing? Let's turn to the configurations of the Large language model compression schedule to 70%, 80%, 90% and 95% sparsity. Using AGP it is easy to configure the pruning schedule to produce an exact sparsity of the compressed model. I'll use the 70% schedule to show a concrete example. The YAML file has two sections: pruners and policies . Section pruners defines instances of ParameterPruner - in our case we define three instances of AutomatedGradualPruner : for the weights of the first RNN ( l0_rnn_pruner ), the second RNN ( l1_rnn_pruner ) and the embedding layer ( embedding_pruner ). These names are arbitrary, and serve are name-handles which bind Policies to Pruners - so you can use whatever names you want. Each AutomatedGradualPruner is configured with an initial_sparsity and final_sparsity . For examples, the l0_rnn_pruner below is configured to prune 5% of the weights as soon as it starts working, and finish when 70% of the weights have been pruned. The weights parameter tells the Pruner which weight tensors to prune. pruners: l0_rnn_pruner: class: AutomatedGradualPruner initial_sparsity : 0.05 final_sparsity: 0.70 weights: [rnn.weight_ih_l0, rnn.weight_hh_l0] l1_rnn_pruner: class: AutomatedGradualPruner initial_sparsity : 0.05 final_sparsity: 0.70 weights: [rnn.weight_ih_l1, rnn.weight_hh_l1] embedding_pruner: class: AutomatedGradualPruner initial_sparsity : 0.05 final_sparsity: 0.70 weights: [encoder.weight] When are we compressing? If the pruners section defines \"what-to-do\", the policies section defines \"when-to-do\". This part is harder, because we define the pruning schedule, which requires us to try a few different schedules until we understand which schedule works best. Below we define three PruningPolicy instances. The first two instances start operating at epoch 2 ( starting_epoch ), end at epoch 20 ( ending_epoch ), and operate once every epoch ( frequency ; as I explained above, Distiller's Pruning scheduling operates only at on_epoch_begin ). In between pruning operations, the pruned model is fine-tuned. policies: - pruner: instance_name : l0_rnn_pruner starting_epoch: 2 ending_epoch: 20 frequency: 1 - pruner: instance_name : l1_rnn_pruner starting_epoch: 2 ending_epoch: 20 frequency: 1 - pruner: instance_name : embedding_pruner starting_epoch: 3 ending_epoch: 21 frequency: 1 We invoke the compression as follows: $ time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml Table 1 above shows that we can make a negligible improvement when adding L2 regularization. I did some experimenting with the sparsity distribution between the layers, and the scheduling frequency and noticed that the embedding layers are much less sensitive to pruning than the RNN cells. I didn't notice any difference between the RNN cells, but I also didn't invest in this exploration. A new 70% sparsity schedule , prunes the RNNs only to 50% sparsity, but prunes the embedding to 85% sparsity, and achieves almost a 3 points improvement in the test perplexity results. We provide similar pruning schedules for the other compression rates. Until next time This concludes the first part of the tutorial on pruning a PyTorch language model. In the next installment, I'll explain how we added an implementation of Baidu Research's Exploring Sparsity in Recurrent Neural Networks paper, and applied to this language model. Geek On.","title":"Pruning a Language Model"},{"location":"tutorial-lang_model.html#using-distiller-to-prune-a-pytorch-language-model","text":"","title":"Using Distiller to prune a PyTorch language model"},{"location":"tutorial-lang_model.html#contents","text":"Introduction Setup Preparing the code Training-loop Creating compression baselines Compressing the language model What are we compressing? How are we compressing? When are we compressing? Until next time","title":"Contents"},{"location":"tutorial-lang_model.html#introduction","text":"In this tutorial I'll show you how to compress a word-level language model using Distiller . Specifically, we use PyTorch\u2019s word-level language model sample code as the code-base of our example, weave in some Distiller code, and show how we compress the model using two different element-wise pruning algorithms. To make things manageable, I've divided the tutorial to two parts: in the first we will setup the sample application and prune using AGP . In the second part I'll show how I've added Baidu's RNN pruning algorithm and then use it to prune the same word-level language model. The completed code is available here . The results are displayed below and the code is available here . Note that we can improve the results by training longer, since the loss curves are usually still decreasing at the end of epoch 40. However, for demonstration purposes we don\u2019t need to do this. Type Sparsity NNZ Validation Test Command line Small 0% 7,135,600 101.13 96.29 time python3 main.py --cuda --epochs 40 --tied --wd=1e-6 Medium 0% 28,390,700 88.17 84.21 time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied,--wd=1e-6 Large 0% 85,917,000 87.49 83.85 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6 Large 70% 25,487,550 90.67 85.96 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml Large 70% 25,487,550 90.59 85.84 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml --wd=1e-6 Large 70% 25,487,550 87.40 82.93 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70B.schedule_agp.yaml --wd=1e-6 Large 80.4% 16,847,550 89.31 83.64 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_80.schedule_agp.yaml --wd=1e-6 Large 90% 8,591,700 90.70 85.67 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_90.schedule_agp.yaml --wd=1e-6 Large 95% 4,295,850 98.42 92.79 time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_95.schedule_agp.yaml --wd=1e-6 Table 1: AGP language model pruning results. NNZ stands for number of non-zero coefficients (embeddings are counted once, because they are tied). Figure 1: Perplexity vs model size (lower perplexity is better). The model is composed of an Encoder embedding, two LSTMs, and a Decoder embedding. The Encoder and decoder embeddings (projections) are tied to improve perplexity results (per https://arxiv.org/pdf/1611.01462.pdf), so in the sparsity statistics we account for only one of the encoder/decoder embeddings. We used the WikiText2 dataset (twice as large as PTB). We compared three model sizes: small (7.1M; 14M), medium (28M; 50M), large: (86M; 136M) \u2013 reported as (#parameters net/tied; #parameters gross). The results reported below use a preset seed (for reproducibility), and we expect results can be improved if we allow \u201ctrue\u201d pseudo-randomness. We limited our tests to 40 epochs, even though validation perplexity was still trending down. Essentially, this recreates the language model experiment in the AGP paper, and validates its conclusions: \u201cWe see that sparse models are able to outperform dense models which have significantly more parameters.\u201d The 80% sparse large model (which has 16.9M parameters and a perplexity of 83.64) is able to outperform the dense medium (which has 28.4M parameters and a perplexity of 84.21), a model which has 1.7 times more parameters. It also outperform the dense large model, which exemplifies how pruning can act as a regularizer. \u201cOur results show that pruning works very well not only on the dense LSTM weights and dense softmax layer but also the dense embedding matrix. This suggests that during the optimization procedure the neural network can find a good sparse embedding for the words in the vocabulary that works well together with the sparse connectivity structure of the LSTM weights and softmax layer.\u201d","title":"Introduction"},{"location":"tutorial-lang_model.html#setup","text":"We start by cloning Pytorch\u2019s example repository . I\u2019ve copied the language model code to distiller\u2019s examples/word_language_model directory, so I\u2019ll use that for the rest of the tutorial. Next, let\u2019s create and activate a virtual environment, as explained in Distiller's README file. Now we can turn our attention to main.py , which contains the training application.","title":"Setup"},{"location":"tutorial-lang_model.html#preparing-the-code","text":"We begin by adding code to invoke Distiller in file main.py . This involves a bit of mechanics, because we did not pip install Distiller in our environment (we don\u2019t have a setup.py script for Distiller as of yet). To make Distiller library functions accessible from main.py , we modify sys.path to include the distiller root directory by taking the current directory and pointing two directories up. This is very specific to the location of this example code, and it will break if you\u2019ve placed the code elsewhere \u2013 so be aware. import os import sys script_dir = os.path.dirname(__file__) module_path = os.path.abspath(os.path.join(script_dir, '..', '..')) if module_path not in sys.path: sys.path.append(module_path) import distiller import apputils from distiller.data_loggers import TensorBoardLogger, PythonLogger Next, we augment the application arguments with two Distiller-specific arguments. The first, --summary , gives us the ability to do simple compression instrumentation (e.g. log sparsity statistics). The second argument, --compress , is how we tell the application where the compression scheduling file is located. We also add two arguments - momentum and weight-decay - for the SGD optimizer. As I explain later, I replaced the original code's optimizer with SGD, so we need these extra arguments. # Distiller-related arguments SUMMARY_CHOICES = ['sparsity', 'model', 'modules', 'png', 'percentile'] parser.add_argument('--summary', type=str, choices=SUMMARY_CHOICES, help='print a summary of the model, and exit - options: ' + ' | '.join(SUMMARY_CHOICES)) parser.add_argument('--compress', dest='compress', type=str, nargs='?', action='store', help='configuration file for pruning the model (default is to use hard-coded schedule)') parser.add_argument('--momentum', default=0., type=float, metavar='M', help='momentum') parser.add_argument('--weight-decay', '--wd', default=0., type=float, metavar='W', help='weight decay (default: 1e-4)') We add code to handle the --summary application argument. It can be as simple as forwarding to distiller.model_summary or more complex, as in the Distiller sample. if args.summary: distiller.model_summary(model, None, args.summary, 'wikitext2') exit(0) Similarly, we add code to handle the --compress argument, which creates a CompressionScheduler and configures it from a YAML schedule file: if args.compress: source = args.compress compression_scheduler = distiller.CompressionScheduler(model) distiller.config.fileConfig(model, None, compression_scheduler, args.compress, msglogger) We also create the optimizer, and the learning-rate decay policy scheduler. The original PyTorch example manually manages the optimization and LR decay process, but I think that having a standard optimizer and LR-decay schedule gives us the flexibility to experiment with these during the training process. Using an SGD optimizer configured with momentum=0 and weight_decay=0 , and a ReduceLROnPlateau LR-decay policy with patience=0 and factor=0.5 will give the same behavior as in the original PyTorch example. From there, we can experiment with the optimizer and LR-decay configuration. optimizer = torch.optim.SGD(model.parameters(), args.lr, momentum=args.momentum, weight_decay=args.weight_decay) lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', patience=0, verbose=True, factor=0.5) Next, we add code to setup the logging backends: a Python logger backend which reads its configuration from file and logs messages to the console and log file ( pylogger ); and a TensorBoard backend logger which logs statistics to a TensorBoard data file ( tflogger ). I configured the TensorBoard backend to log gradients because RNNs suffer from vanishing and exploding gradients, so we might want to take a look in case the training experiences a sudden failure. This code is not strictly required, but it is quite useful to be able to log the session progress, and to export logs to TensorBoard for realtime visualization of the training progress. # Distiller loggers msglogger = apputils.config_pylogger('logging.conf', None) tflogger = TensorBoardLogger(msglogger.logdir) tflogger.log_gradients = True pylogger = PythonLogger(msglogger)","title":"Preparing the code"},{"location":"tutorial-lang_model.html#training-loop","text":"Now we scroll down all the way to the train() function. We'll change its signature to include the epoch , optimizer , and compression_schdule . We'll soon see why we need these. def train(epoch, optimizer, compression_scheduler=None) Function train() is responsible for training the network in batches for one epoch, and in its epoch loop we want to perform compression. The CompressionScheduler invokes ScheduledTrainingPolicy instances per the scheduling specification that was programmed in the CompressionScheduler instance. There are four main SchedulingPolicy types: PruningPolicy , RegularizationPolicy , LRPolicy , and QuantizationPolicy . We'll be using PruningPolicy , which is triggered on_epoch_begin (to invoke the Pruners , and on_minibatch_begin (to mask the weights). Later we will create a YAML scheduling file, and specify the schedule of AutomatedGradualPruner instances. Because we are writing a single application, which can be used with various Policies in the future (e.g. group-lasso regularization), we should add code to invoke all of the CompressionScheduler 's callbacks, not just the mandatory on_epoch_begin callback. We invoke on_minibatch_begin before running the forward-pass, before_backward_pass after computing the loss, and on_minibatch_end after completing the backward-pass. def train(epoch, optimizer, compression_scheduler=None): ... # The line below was fixed as per: https://github.com/pytorch/examples/issues/214 for batch, i in enumerate(range(0, train_data.size(0), args.bptt)): data, targets = get_batch(train_data, i) # Starting each batch, we detach the hidden state from how it was previously produced. # If we didn't, the model would try backpropagating all the way to start of the dataset. hidden = repackage_hidden(hidden) if compression_scheduler: compression_scheduler.on_minibatch_begin(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch) output, hidden = model(data, hidden) loss = criterion(output.view(-1, ntokens), targets) if compression_scheduler: compression_scheduler.before_backward_pass(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch, loss=loss) optimizer.zero_grad() loss.backward() # `clip_grad_norm` helps prevent the exploding gradient problem in RNNs / LSTMs. torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip) optimizer.step() total_loss += loss.item() if compression_scheduler: compression_scheduler.on_minibatch_end(epoch, minibatch_id=batch, minibatches_per_epoch=steps_per_epoch) The rest of the code could stay as in the original PyTorch sample, but I wanted to use an SGD optimizer, so I replaced: for p in model.parameters(): p.data.add_(-lr, p.grad.data) with: optimizer.step() The rest of the code in function train() logs to a text file and a TensorBoard backend. Again, such code is not mandatory, but a few lines give us a lot of visibility: we have training progress information saved to log, and we can monitor the training progress in realtime on TensorBoard. That's a lot for a few lines of code ;-) if batch % args.log_interval == 0 and batch > 0: cur_loss = total_loss / args.log_interval elapsed = time.time() - start_time lr = optimizer.param_groups[0]['lr'] msglogger.info( '| epoch {:3d} | {:5d}/{:5d} batches | lr {:02.4f} | ms/batch {:5.2f} ' '| loss {:5.2f} | ppl {:8.2f}'.format( epoch, batch, len(train_data) // args.bptt, lr, elapsed * 1000 / args.log_interval, cur_loss, math.exp(cur_loss))) total_loss = 0 start_time = time.time() stats = ('Peformance/Training/', OrderedDict([ ('Loss', cur_loss), ('Perplexity', math.exp(cur_loss)), ('LR', lr), ('Batch Time', elapsed * 1000)]) ) steps_completed = batch + 1 distiller.log_training_progress(stats, model.named_parameters(), epoch, steps_completed, steps_per_epoch, args.log_interval, [tflogger]) Finally we get to the outer training-loop which loops on args.epochs . We add the two final CompressionScheduler callbacks: on_epoch_begin , at the start of the loop, and on_epoch_end after running evaluate on the model and updating the learning-rate. try: for epoch in range(0, args.epochs): epoch_start_time = time.time() if compression_scheduler: compression_scheduler.on_epoch_begin(epoch) train(epoch, optimizer, compression_scheduler) val_loss = evaluate(val_data) lr_scheduler.step(val_loss) if compression_scheduler: compression_scheduler.on_epoch_end(epoch) And that's it! The language model sample is ready for compression.","title":"Training loop"},{"location":"tutorial-lang_model.html#creating-compression-baselines","text":"In To prune, or not to prune: exploring the efficacy of pruning for model compression Zhu and Gupta, \"compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint.\" They also \"propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning.\" This pruning schedule is implemented by distiller.AutomatedGradualPruner (AGP), which increases the sparsity level (expressed as a percentage of zero-valued elements) gradually over several pruning steps. Distiller's implementation only prunes elements once in an epoch (the model is fine-tuned in between pruning events), which is a small deviation from Zhu and Gupta's paper. The research paper specifies the schedule in terms of mini-batches, while our implementation specifies the schedule in terms of epochs. We feel that using epochs performs well, and is more \"stable\", since the number of mini-batches will change, if you change the batch size. Before we start compressing stuff ;-), we need to create baselines so we have something to benchmark against. Let's prepare small, medium, and large baseline models, like Table 3 of To prune, or Not to Prune . These will provide baseline perplexity results that we'll compare the compressed models against. I chose to use tied input/output embeddings, and constrained the training to 40 epochs. The table below shows the model sizes, where we are interested in the tied version (biases are ignored due to their small size and because we don't prune them). Size Number of Weights (untied) Number of Weights (tied) Small 13,951,200 7,295,600 Medium 50,021,400 28,390,700 Large 135,834,000 85,917,000 I started experimenting with the optimizer setup like in the PyTorch example, but I added some L2 regularization when I noticed that the training was overfitting. The two right columns show the perplexity results (lower is better) of each of the models with no L2 regularization and with 1e-5 and 1e-6. In all three model sizes using the smaller L2 regularization (1e-6) gave the best results. BTW, I'm not showing here experiments with even lower regularization because that did not help. Type Command line Validation Test Small time python3 main.py --cuda --epochs 40 --tied 105.23 99.53 Small time python3 main.py --cuda --epochs 40 --tied --wd=1e-6 101.13 96.29 Small time python3 main.py --cuda --epochs 40 --tied --wd=1e-5 109.49 103.53 Medium time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied 90.93 86.20 Medium time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-6 88.17 84.21 Medium time python3 main.py --cuda --emsize 650 --nhid 650 --dropout 0.5 --epochs 40 --tied --wd=1e-5 97.75 93.06 Large time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied 88.23 84.21 Large time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-6 87.49 83.85 Large time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --wd=1e-5 99.22 94.28","title":"Creating compression baselines"},{"location":"tutorial-lang_model.html#compressing-the-language-model","text":"OK, so now let's recreate the results of the language model experiment from section 4.2 of paper. We're using PyTorch's sample, so the language model we implement is not exactly like the one in the AGP paper (and uses a different dataset), but it's close enough, so if everything goes well, we should see similar compression results.","title":"Compressing the language model"},{"location":"tutorial-lang_model.html#what-are-we-compressing","text":"To gain insight about the model parameters, we can use the command-line to produce a weights-sparsity table: $ python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --summary=sparsity Parameters: +---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean | |---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------| | 0.00000 | encoder.weight | (33278, 1500) | 49917000 | 49916999 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00000 | 0.05773 | -0.00000 | 0.05000 | | 1.00000 | rnn.weight_ih_l0 | (6000, 1500) | 9000000 | 9000000 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00000 | 0.01491 | 0.00001 | 0.01291 | | 2.00000 | rnn.weight_hh_l0 | (6000, 1500) | 9000000 | 8999999 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00001 | 0.01491 | 0.00000 | 0.01291 | | 3.00000 | rnn.weight_ih_l1 | (6000, 1500) | 9000000 | 8999999 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00001 | 0.01490 | -0.00000 | 0.01291 | | 4.00000 | rnn.weight_hh_l1 | (6000, 1500) | 9000000 | 9000000 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00000 | 0.01491 | -0.00000 | 0.01291 | | 5.00000 | decoder.weight | (33278, 1500) | 49917000 | 49916999 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00000 | 0.05773 | -0.00000 | 0.05000 | | 6.00000 | Total sparsity: | - | 135834000 | 135833996 | 0.00000 | 0.00000 | 0 | 0.00000 | 0 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | +---------+------------------+---------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ Total sparsity: 0.00 So what's going on here? encoder.weight and decoder.weight are the input and output embeddings, respectively. Remember that in the configuration I chose for the three model sizes these embeddings are tied, which means that we only have one copy of parameters, that is shared between the encoder and decoder. We also have two pairs of RNN (LSTM really) parameters. There is a pair because the model uses the command-line argument args.nlayers to decide how many instances of RNN (or LSTM or GRU) cells to use, and it defaults to 2. The recurrent cells are LSTM cells, because this is the default of args.model , which is used in the initialization of RNNModel . Let's look at the parameters of the first RNN: rnn.weight_ih_l0 and rnn.weight_hh_l0 : what are these? Recall the LSTM equations that PyTorch implements. In the equations, there are 8 instances of vector-matrix multiplication (when batch=1). These can be combined into a single matrix-matrix multiplication (GEMM), but PyTorch groups these into two GEMM operations: one GEMM multiplies the inputs ( rnn.weight_ih_l0 ), and the other multiplies the hidden-state ( rnn.weight_hh_l0 ).","title":"What are we compressing?"},{"location":"tutorial-lang_model.html#how-are-we-compressing","text":"Let's turn to the configurations of the Large language model compression schedule to 70%, 80%, 90% and 95% sparsity. Using AGP it is easy to configure the pruning schedule to produce an exact sparsity of the compressed model. I'll use the 70% schedule to show a concrete example. The YAML file has two sections: pruners and policies . Section pruners defines instances of ParameterPruner - in our case we define three instances of AutomatedGradualPruner : for the weights of the first RNN ( l0_rnn_pruner ), the second RNN ( l1_rnn_pruner ) and the embedding layer ( embedding_pruner ). These names are arbitrary, and serve are name-handles which bind Policies to Pruners - so you can use whatever names you want. Each AutomatedGradualPruner is configured with an initial_sparsity and final_sparsity . For examples, the l0_rnn_pruner below is configured to prune 5% of the weights as soon as it starts working, and finish when 70% of the weights have been pruned. The weights parameter tells the Pruner which weight tensors to prune. pruners: l0_rnn_pruner: class: AutomatedGradualPruner initial_sparsity : 0.05 final_sparsity: 0.70 weights: [rnn.weight_ih_l0, rnn.weight_hh_l0] l1_rnn_pruner: class: AutomatedGradualPruner initial_sparsity : 0.05 final_sparsity: 0.70 weights: [rnn.weight_ih_l1, rnn.weight_hh_l1] embedding_pruner: class: AutomatedGradualPruner initial_sparsity : 0.05 final_sparsity: 0.70 weights: [encoder.weight]","title":"How are we compressing?"},{"location":"tutorial-lang_model.html#when-are-we-compressing","text":"If the pruners section defines \"what-to-do\", the policies section defines \"when-to-do\". This part is harder, because we define the pruning schedule, which requires us to try a few different schedules until we understand which schedule works best. Below we define three PruningPolicy instances. The first two instances start operating at epoch 2 ( starting_epoch ), end at epoch 20 ( ending_epoch ), and operate once every epoch ( frequency ; as I explained above, Distiller's Pruning scheduling operates only at on_epoch_begin ). In between pruning operations, the pruned model is fine-tuned. policies: - pruner: instance_name : l0_rnn_pruner starting_epoch: 2 ending_epoch: 20 frequency: 1 - pruner: instance_name : l1_rnn_pruner starting_epoch: 2 ending_epoch: 20 frequency: 1 - pruner: instance_name : embedding_pruner starting_epoch: 3 ending_epoch: 21 frequency: 1 We invoke the compression as follows: $ time python3 main.py --cuda --emsize 1500 --nhid 1500 --dropout 0.65 --tied --compress=../../examples/agp-pruning/word_lang_model.LARGE_70.schedule_agp.yaml Table 1 above shows that we can make a negligible improvement when adding L2 regularization. I did some experimenting with the sparsity distribution between the layers, and the scheduling frequency and noticed that the embedding layers are much less sensitive to pruning than the RNN cells. I didn't notice any difference between the RNN cells, but I also didn't invest in this exploration. A new 70% sparsity schedule , prunes the RNNs only to 50% sparsity, but prunes the embedding to 85% sparsity, and achieves almost a 3 points improvement in the test perplexity results. We provide similar pruning schedules for the other compression rates.","title":"When are we compressing?"},{"location":"tutorial-lang_model.html#until-next-time","text":"This concludes the first part of the tutorial on pruning a PyTorch language model. In the next installment, I'll explain how we added an implementation of Baidu Research's Exploring Sparsity in Recurrent Neural Networks paper, and applied to this language model. Geek On.","title":"Until next time"},{"location":"tutorial-lang_model_quant.html","text":"Post-Training Quantization of a Language Model using Distiller A detailed, Jupyter Notebook based tutorial on this topic is located at <distiller_repo_root>/examples/word_language_model/quantize_lstm.ipynb . You can view a \"read-only\" version of it in the Distiller GitHub repository here . The tutorial covers the following: Converting the model to use Distiller's modular LSTM implementation, which allows flexible quantization of internal LSTM operations. Collecting activation statistics prior to quantization Creating a PostTrainLinearQuantizer and preparing the model for quantization \"Net-aware quantization\" capability of PostTrainLinearQuantizer Progressively tweaking the quantization settings in order to improve accuracy","title":"Quantizing a Language Model"},{"location":"tutorial-lang_model_quant.html#post-training-quantization-of-a-language-model-using-distiller","text":"A detailed, Jupyter Notebook based tutorial on this topic is located at <distiller_repo_root>/examples/word_language_model/quantize_lstm.ipynb . You can view a \"read-only\" version of it in the Distiller GitHub repository here . The tutorial covers the following: Converting the model to use Distiller's modular LSTM implementation, which allows flexible quantization of internal LSTM operations. Collecting activation statistics prior to quantization Creating a PostTrainLinearQuantizer and preparing the model for quantization \"Net-aware quantization\" capability of PostTrainLinearQuantizer Progressively tweaking the quantization settings in order to improve accuracy","title":"Post-Training Quantization of a Language Model using Distiller"},{"location":"tutorial-struct_pruning.html","text":"Pruning Filters & Channels Introduction Channel and filter pruning are examples of structured-pruning which create compressed models that do not require special hardware to execute. This latter fact makes this form of structured pruning particularly interesting and popular. In networks that have serial data dependencies, it is pretty straight-forward to understand and define how to prune channels and filters. However, in more complex models, with parallel-data dependencies (paths) - such as ResNets (skip connections) and GoogLeNet (Inception layers) \u2013 things become increasingly more complex and require a deeper understanding of the data flow in the model, in order to define the pruning schedule. This post explains channel and filter pruning, the challenges, and how to define a Distiller pruning schedule for these structures. The details of the implementation are left for a separate post. Before we dive into pruning, let\u2019s level-set on the terminology, because different people (and even research papers) do not always agree on the nomenclature. This reflects my understanding of the nomenclature, and therefore these are the names used in Distiller. I\u2019ll restrict this discussion to Convolution layers in CNNs, to contain the scope of the topic I\u2019ll be covering, although Distiller supports pruning of other structures such as matrix columns and rows. PyTorch describes torch.nn.Conv2d as applying \u201ca 2D convolution over an input signal composed of several input planes.\u201d We call each of these input planes a feature-map (or FM, for short). Another name is input channel , as in the R/G/B channels of an image. Some people refer to feature-maps as activations (i.e. the activation of neurons), although I think strictly speaking activations are the output of an activation layer that was fed a group of feature-maps. Because it is very common, and because the use of an activation is orthogonal to our discussion, I will use activations to refer to the output of a Convolution layer (i.e. 3D stack of feature-maps). In the PyTorch documentation Convolution outputs have shape (N, C out , H out , W out ) where N is a batch size, C out denotes a number of output channels, H out is a height of output planes in pixels, and W out is width in pixels. We won\u2019t be paying much attention to the batch-size since it\u2019s not important to our discussion, so without loss of generality we can set N=1. I\u2019m also assuming the most common Convolutions having groups==1 . Convolution weights are 4D: (F, C, K, K) where F is the number of filters, C is the number of channels, and K is the kernel size (we can assume the kernel height and width are equal for simplicity). A kernel is a 2D matrix (K, K) that is part of a 3D feature detector. This feature detector is called a filter and it is basically a stack of 2D kernels . Each kernel is convolved with a 2D input channel (i.e. feature-map) so if there are C in channels in the input, then there are C in kernels in a filter (C == C in ). Each filter is convolved with the entire input to create a single output channel (i.e. feature-map). If there are C out output channels, then there are C out filters (F == C out ). Filter Pruning Filter pruning and channel pruning are very similar, and I\u2019ll expand on that similarity later on \u2013 but for now let\u2019s focus on filter pruning. In filter pruning we use some criterion to determine which filters are important and which are not. Researchers came up with all sorts of pruning criteria: the L1-magnitude of the filters (citation), the entropy of the activations (citation), and the classification accuracy reduction (citation) are just some examples. Disregarding how we chose the filters to prune, let\u2019s imagine that in the diagram below, we chose to prune (remove) the green and orange filters (the circle with the \u201c*\u201d designates a Convolution operation). Since we have two less filters operating on the input, we must have two less output feature-maps. So when we prune filters, besides changing the physical size of the weight tensors, we also need to reconfigure the immediate Convolution layer (change its out_channels ) and the following Convolution layer (change its in_channels ). And finally, because the next layer\u2019s input is now smaller (has fewer channels), we should also shrink the next layer\u2019s weights tensors, by removing the channels corresponding to the filters we pruned. We say that there is a data-dependency between the two Convolution layers. I didn\u2019t make any mention of the activation function that usually follows Convolution, because these functions are parameter-less and are not sensitive to the shape of their input. There are some other dependencies that Distiller resolves (such as Optimizer parameters tightly-coupled to the weights) that I won\u2019t discuss here, because they are implementation details. The scheduler YAML syntax for this example is pasted below. We use L1-norm ranking of weight filters, and the pruning-rate is set by the AGP algorithm (Automatic Gradual Pruning). The Convolution layers are conveniently named conv1 and conv2 in this example. pruners: example_pruner: class: L1RankedStructureParameterPruner_AGP initial_sparsity : 0.10 final_sparsity: 0.50 group_type: Filters weights: [module.conv1.weight] Now let\u2019s add a Batch Normalization layer between the two convolutions: The Batch Normalization layer is parameterized by a couple of tensors that contain information per input-channel (i.e. scale and shift). Because our Convolution produces less output FMs, and these are the input to the Batch Normalization layer, we also need to reconfigure the Batch Normalization layer. And we also need to physically shrink the Batch Normalization layer\u2019s scale and shift tensors, which are coefficients in the BN input transformation. Moreover, the scale and shift coefficients that we remove from the tensors, must correspond to the filters (or output feature-maps channels) that we removed from the Convolution weight tensors. This small nuance will prove to be a large pain, but we\u2019ll get to that in later examples. The presence of a Batch Normalization layer in the example above is transparent to us, and in fact, the YAML schedule does not change. Distiller detects the presence of Batch Normalization layers and adjusts their parameters automatically. Let\u2019s look at another example, with non-serial data-dependencies. Here, the output of conv1 is the input for conv2 and conv3 . This is an example of parallel data-dependency, since both conv2 and conv3 depend on conv1 . Note that the Distiller YAML schedule is unchanged from the previous two examples, since we are still only explicitly pruning the weight filters of conv1 . The weight channels of conv2 and conv3 are pruned implicitly by Distiller in a process called \u201cThinning\u201d (on which I will expand in a different post). Next, let\u2019s look at another example also involving three Convolutions, but this time we want to prune the filters of two convolutional layers, whose outputs are element-wise-summed and fed into a third Convolution. In this example conv3 is dependent on both conv1 and conv2 , and there are two implications to this dependency. The first, and more obvious implication, is that we need to prune the same number of filters from both conv1 and conv2 . Since we apply element-wise addition on the outputs of conv1 and conv2 , they must have the same shape - and they can only have the same shape if conv1 and conv2 prune the same number of filters. The second implication of this triangular data-dependency is that both conv1 and conv2 must prune the same filters! Let\u2019s imagine for a moment, that we ignore this second constraint. The diagram below illustrates the dilemma that arises: how should we prune the channels of the weights of conv3 ? Obviously, we can\u2019t. We must apply the second constraint \u2013 and that means that we now need to be proactive: we need to decide whether to use the prune conv1 and conv2 according to the filter-pruning choices of conv1 or of conv2 . The diagram below illustrates the pruning scheme after deciding to follow the pruning choices of conv1 . The YAML compression schedule syntax needs to be able to express the two dependencies (or constraints) discussed above. First we need to tell the Filter Pruner that we there is a dependency of type Leader . This means that all of the tensors listed in the weights field are pruned together, to the same extent at each iteration, and that to prune the filters we will use the pruning decisions of the first tensor listed. In the example below module.conv1.weight and module.conv2.weight are pruned together according to the pruning choices for module.conv1.weight . pruners: example_pruner: class: L1RankedStructureParameterPruner_AGP initial_sparsity : 0.10 final_sparsity: 0.50 group_type: Filters group_dependency: Leader weights: [module.conv1.weight, module.conv2.weight] When we turn to filter-pruning ResNets we see some pretty long dependency chains because of the skip-connections. If you don\u2019t pay attention, you can easily under-specify (or mis-specify) dependency chains and Distiller will exit with an exception. The exception does not explain the specification error and this needs to be improved. Channel Pruning Channel pruning is very similar to Filter pruning with all the details of dependencies reversed. Look again at example #1, but this time imagine that we\u2019ve changed our schedule to prune the channels of module.conv2.weight . pruners: example_pruner: class: L1RankedStructureParameterPruner_AGP initial_sparsity : 0.10 final_sparsity: 0.50 group_type: Channels weights: [module.conv2.weight] As the diagram shows, conv1 is now dependent on conv2 and its weights filters will be implicitly pruned according to the channels removed from the weights of conv2 . Geek On.","title":"Pruning Filters and Channels"},{"location":"tutorial-struct_pruning.html#pruning-filters-channels","text":"","title":"Pruning Filters &amp; Channels"},{"location":"tutorial-struct_pruning.html#introduction","text":"Channel and filter pruning are examples of structured-pruning which create compressed models that do not require special hardware to execute. This latter fact makes this form of structured pruning particularly interesting and popular. In networks that have serial data dependencies, it is pretty straight-forward to understand and define how to prune channels and filters. However, in more complex models, with parallel-data dependencies (paths) - such as ResNets (skip connections) and GoogLeNet (Inception layers) \u2013 things become increasingly more complex and require a deeper understanding of the data flow in the model, in order to define the pruning schedule. This post explains channel and filter pruning, the challenges, and how to define a Distiller pruning schedule for these structures. The details of the implementation are left for a separate post. Before we dive into pruning, let\u2019s level-set on the terminology, because different people (and even research papers) do not always agree on the nomenclature. This reflects my understanding of the nomenclature, and therefore these are the names used in Distiller. I\u2019ll restrict this discussion to Convolution layers in CNNs, to contain the scope of the topic I\u2019ll be covering, although Distiller supports pruning of other structures such as matrix columns and rows. PyTorch describes torch.nn.Conv2d as applying \u201ca 2D convolution over an input signal composed of several input planes.\u201d We call each of these input planes a feature-map (or FM, for short). Another name is input channel , as in the R/G/B channels of an image. Some people refer to feature-maps as activations (i.e. the activation of neurons), although I think strictly speaking activations are the output of an activation layer that was fed a group of feature-maps. Because it is very common, and because the use of an activation is orthogonal to our discussion, I will use activations to refer to the output of a Convolution layer (i.e. 3D stack of feature-maps). In the PyTorch documentation Convolution outputs have shape (N, C out , H out , W out ) where N is a batch size, C out denotes a number of output channels, H out is a height of output planes in pixels, and W out is width in pixels. We won\u2019t be paying much attention to the batch-size since it\u2019s not important to our discussion, so without loss of generality we can set N=1. I\u2019m also assuming the most common Convolutions having groups==1 . Convolution weights are 4D: (F, C, K, K) where F is the number of filters, C is the number of channels, and K is the kernel size (we can assume the kernel height and width are equal for simplicity). A kernel is a 2D matrix (K, K) that is part of a 3D feature detector. This feature detector is called a filter and it is basically a stack of 2D kernels . Each kernel is convolved with a 2D input channel (i.e. feature-map) so if there are C in channels in the input, then there are C in kernels in a filter (C == C in ). Each filter is convolved with the entire input to create a single output channel (i.e. feature-map). If there are C out output channels, then there are C out filters (F == C out ).","title":"Introduction"},{"location":"tutorial-struct_pruning.html#filter-pruning","text":"Filter pruning and channel pruning are very similar, and I\u2019ll expand on that similarity later on \u2013 but for now let\u2019s focus on filter pruning. In filter pruning we use some criterion to determine which filters are important and which are not. Researchers came up with all sorts of pruning criteria: the L1-magnitude of the filters (citation), the entropy of the activations (citation), and the classification accuracy reduction (citation) are just some examples. Disregarding how we chose the filters to prune, let\u2019s imagine that in the diagram below, we chose to prune (remove) the green and orange filters (the circle with the \u201c*\u201d designates a Convolution operation). Since we have two less filters operating on the input, we must have two less output feature-maps. So when we prune filters, besides changing the physical size of the weight tensors, we also need to reconfigure the immediate Convolution layer (change its out_channels ) and the following Convolution layer (change its in_channels ). And finally, because the next layer\u2019s input is now smaller (has fewer channels), we should also shrink the next layer\u2019s weights tensors, by removing the channels corresponding to the filters we pruned. We say that there is a data-dependency between the two Convolution layers. I didn\u2019t make any mention of the activation function that usually follows Convolution, because these functions are parameter-less and are not sensitive to the shape of their input. There are some other dependencies that Distiller resolves (such as Optimizer parameters tightly-coupled to the weights) that I won\u2019t discuss here, because they are implementation details. The scheduler YAML syntax for this example is pasted below. We use L1-norm ranking of weight filters, and the pruning-rate is set by the AGP algorithm (Automatic Gradual Pruning). The Convolution layers are conveniently named conv1 and conv2 in this example. pruners: example_pruner: class: L1RankedStructureParameterPruner_AGP initial_sparsity : 0.10 final_sparsity: 0.50 group_type: Filters weights: [module.conv1.weight] Now let\u2019s add a Batch Normalization layer between the two convolutions: The Batch Normalization layer is parameterized by a couple of tensors that contain information per input-channel (i.e. scale and shift). Because our Convolution produces less output FMs, and these are the input to the Batch Normalization layer, we also need to reconfigure the Batch Normalization layer. And we also need to physically shrink the Batch Normalization layer\u2019s scale and shift tensors, which are coefficients in the BN input transformation. Moreover, the scale and shift coefficients that we remove from the tensors, must correspond to the filters (or output feature-maps channels) that we removed from the Convolution weight tensors. This small nuance will prove to be a large pain, but we\u2019ll get to that in later examples. The presence of a Batch Normalization layer in the example above is transparent to us, and in fact, the YAML schedule does not change. Distiller detects the presence of Batch Normalization layers and adjusts their parameters automatically. Let\u2019s look at another example, with non-serial data-dependencies. Here, the output of conv1 is the input for conv2 and conv3 . This is an example of parallel data-dependency, since both conv2 and conv3 depend on conv1 . Note that the Distiller YAML schedule is unchanged from the previous two examples, since we are still only explicitly pruning the weight filters of conv1 . The weight channels of conv2 and conv3 are pruned implicitly by Distiller in a process called \u201cThinning\u201d (on which I will expand in a different post). Next, let\u2019s look at another example also involving three Convolutions, but this time we want to prune the filters of two convolutional layers, whose outputs are element-wise-summed and fed into a third Convolution. In this example conv3 is dependent on both conv1 and conv2 , and there are two implications to this dependency. The first, and more obvious implication, is that we need to prune the same number of filters from both conv1 and conv2 . Since we apply element-wise addition on the outputs of conv1 and conv2 , they must have the same shape - and they can only have the same shape if conv1 and conv2 prune the same number of filters. The second implication of this triangular data-dependency is that both conv1 and conv2 must prune the same filters! Let\u2019s imagine for a moment, that we ignore this second constraint. The diagram below illustrates the dilemma that arises: how should we prune the channels of the weights of conv3 ? Obviously, we can\u2019t. We must apply the second constraint \u2013 and that means that we now need to be proactive: we need to decide whether to use the prune conv1 and conv2 according to the filter-pruning choices of conv1 or of conv2 . The diagram below illustrates the pruning scheme after deciding to follow the pruning choices of conv1 . The YAML compression schedule syntax needs to be able to express the two dependencies (or constraints) discussed above. First we need to tell the Filter Pruner that we there is a dependency of type Leader . This means that all of the tensors listed in the weights field are pruned together, to the same extent at each iteration, and that to prune the filters we will use the pruning decisions of the first tensor listed. In the example below module.conv1.weight and module.conv2.weight are pruned together according to the pruning choices for module.conv1.weight . pruners: example_pruner: class: L1RankedStructureParameterPruner_AGP initial_sparsity : 0.10 final_sparsity: 0.50 group_type: Filters group_dependency: Leader weights: [module.conv1.weight, module.conv2.weight] When we turn to filter-pruning ResNets we see some pretty long dependency chains because of the skip-connections. If you don\u2019t pay attention, you can easily under-specify (or mis-specify) dependency chains and Distiller will exit with an exception. The exception does not explain the specification error and this needs to be improved.","title":"Filter Pruning"},{"location":"tutorial-struct_pruning.html#channel-pruning","text":"Channel pruning is very similar to Filter pruning with all the details of dependencies reversed. Look again at example #1, but this time imagine that we\u2019ve changed our schedule to prune the channels of module.conv2.weight . pruners: example_pruner: class: L1RankedStructureParameterPruner_AGP initial_sparsity : 0.10 final_sparsity: 0.50 group_type: Channels weights: [module.conv2.weight] As the diagram shows, conv1 is now dependent on conv2 and its weights filters will be implicitly pruned according to the channels removed from the weights of conv2 . Geek On.","title":"Channel Pruning"},{"location":"usage.html","text":"Using the sample application The Distiller repository contains a sample application, distiller/examples/classifier_compression/compress_classifier.py , and a set of scheduling files which demonstrate Distiller's features. Following is a brief discussion of how to use this application and the accompanying schedules. You might also want to refer to the following resources: An explanation of the scheduler file format. An in-depth discussion of how we used these schedule files to implement several state-of-the-art DNN compression research papers. The sample application supports various features for compression of image classification DNNs, and gives an example of how to integrate distiller in your own application. The code is documented and should be considered the best source of documentation, but we provide some elaboration here. This diagram shows how where compress_classifier.py fits in the compression workflow, and how we integrate the Jupyter notebooks as part of our research work. Command line arguments To get help on the command line arguments, invoke: $ python3 compress_classifier.py --help For example: $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml Parameters: +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean | |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------| | 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 | | 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 | | 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 | | 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 | | 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 | | 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 | | 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 | | 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 | | 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 | +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ 2018-04-04 21:30:52,499 - Total sparsity: 88.44 2018-04-04 21:30:52,499 - --- validate (epoch=89)----------- 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch) 2018-04-04 21:31:04,646 - Epoch: [89][ 50/ 500] Loss 2.175988 Top1 51.289063 Top5 74.023438 2018-04-04 21:31:06,427 - Epoch: [89][ 100/ 500] Loss 2.171564 Top1 51.175781 Top5 74.308594 2018-04-04 21:31:11,432 - Epoch: [89][ 150/ 500] Loss 2.159347 Top1 51.546875 Top5 74.473958 2018-04-04 21:31:14,364 - Epoch: [89][ 200/ 500] Loss 2.156857 Top1 51.585938 Top5 74.568359 2018-04-04 21:31:18,381 - Epoch: [89][ 250/ 500] Loss 2.152790 Top1 51.707813 Top5 74.681250 2018-04-04 21:31:22,195 - Epoch: [89][ 300/ 500] Loss 2.149962 Top1 51.791667 Top5 74.755208 2018-04-04 21:31:25,508 - Epoch: [89][ 350/ 500] Loss 2.150936 Top1 51.827009 Top5 74.767857 2018-04-04 21:31:29,538 - Epoch: [89][ 400/ 500] Loss 2.150853 Top1 51.781250 Top5 74.763672 2018-04-04 21:31:32,842 - Epoch: [89][ 450/ 500] Loss 2.150156 Top1 51.828125 Top5 74.821181 2018-04-04 21:31:35,338 - Epoch: [89][ 500/ 500] Loss 2.150417 Top1 51.833594 Top5 74.817187 2018-04-04 21:31:35,357 - ==> Top1: 51.838 Top5: 74.817 Loss: 2.150 2018-04-04 21:31:35,364 - Saving checkpoint 2018-04-04 21:31:39,251 - --- test --------------------- 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch) 2018-04-04 21:31:51,512 - Test: [ 50/ 195] Loss 1.487607 Top1 63.273438 Top5 85.695312 2018-04-04 21:31:55,015 - Test: [ 100/ 195] Loss 1.638043 Top1 60.636719 Top5 83.664062 2018-04-04 21:31:58,732 - Test: [ 150/ 195] Loss 1.833214 Top1 57.619792 Top5 80.447917 2018-04-04 21:32:01,274 - ==> Top1: 56.606 Top5: 79.446 Loss: 1.893 Let's look at the command line again: $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml In this example, we prune a TorchVision pre-trained AlexNet network, using the following configuration: Learning-rate of 0.005 Print progress every 50 mini-batches. Use 44 worker threads to load data (make sure to use something suitable for your machine). Run for 90 epochs. Torchvision's pre-trained models did not store the epoch metadata, so pruning starts at epoch 0. When you train and prune your own networks, the last training epoch is saved as a metadata with the model. Therefore, when you load such models, the first epoch is not 0, but it is the last training epoch. The pruning schedule is provided in alexnet.schedule_sensitivity.yaml Log files are written to directory logs . Examples Distiller comes with several example schedules which can be used together with compress_classifier.py . These example schedules (YAML) files, contain the command line that is used in order to invoke the schedule (so that you can easily recreate the results in your environment), together with the results of the pruning or regularization. The results usually contain a table showing the sparsity of each of the model parameters, together with the validation and test top1, top5 and loss scores. For more details on the example schedules, you can refer to the coverage of the Model Zoo . examples/agp-pruning : Automated Gradual Pruning (AGP) on MobileNet and ResNet18 (ImageNet dataset) examples/hybrid : AlexNet AGP with 2D (kernel) regularization (ImageNet dataset) AlexNet sensitivity pruning with 2D regularization examples/network_slimming : ResNet20 Network Slimming (this is work-in-progress) examples/pruning_filters_for_efficient_convnets : ResNet56 baseline training (CIFAR10 dataset) ResNet56 filter removal using filter ranking examples/sensitivity_analysis : Element-wise pruning sensitivity-analysis: AlexNet (ImageNet) MobileNet (ImageNet) ResNet18 (ImageNet) ResNet20 (CIFAR10) ResNet34 (ImageNet) Filter-wise pruning sensitivity-analysis: ResNet20 (CIFAR10) ResNet56 (CIFAR10) examples/sensitivity-pruning : AlexNet sensitivity pruning with Iterative Pruning AlexNet sensitivity pruning with One-Shot Pruning examples/ssl : ResNet20 baseline training (CIFAR10 dataset) Structured Sparsity Learning (SSL) with layer removal on ResNet20 SSL with channels removal on ResNet20 examples/quantization : AlexNet w. Batch-Norm (base FP32 + DoReFa) Pre-activation ResNet20 on CIFAR10 (base FP32 + DoReFa) Pre-activation ResNet18 on ImageNEt (base FP32 + DoReFa) Experiment reproducibility Experiment reproducibility is sometimes important. Pete Warden recently expounded about this in his blog . PyTorch's support for deterministic execution requires us to use only one thread for loading data (other wise the multi-threaded execution of the data loaders can create random order and change the results), and to set the seed of the CPU and GPU PRNGs. Using the --deterministic command-line flag and setting j=1 will produce reproducible results (for the same PyTorch version). Performing pruning sensitivity analysis Distiller supports element-wise and filter-wise pruning sensitivity analysis. In both cases, L1-norm is used to rank which elements or filters to prune. For example, when running filter-pruning sensitivity analysis, the L1-norm of the filters of each layer's weights tensor are calculated, and the bottom x% are set to zero. The analysis process is quite long, because currently we use the entire test dataset to assess the accuracy performance at each pruning level of each weights tensor. Using a small dataset for this would save much time and we plan on assessing if this will provide sufficient results. Results are output as a CSV file ( sensitivity.csv ) and PNG file ( sensitivity.png ). The implementation is in distiller/sensitivity.py and it contains further details about process and the format of the CSV file. The example below performs element-wise pruning sensitivity analysis on ResNet20 for CIFAR10: $ python3 compress_classifier.py -a resnet20_cifar ../../../data.cifar10/ -j=1 --resume=../cifar10/resnet20/checkpoint_trained_dense.pth.tar --sense=element The sense command-line argument can be set to either element or filter , depending on the type of analysis you want done. There is also a Jupyter notebook with example invocations, outputs and explanations. Post-Training Quantization The following example qunatizes ResNet18 for ImageNet: $ python3 compress_classifier.py -a resnet18 ../../../data.imagenet --pretrained --quantize-eval --evaluate See here for more details on how to invoke post-training quantization from the command line. A checkpoint with the quantized model will be dumped in the run directory. It will contain the quantized model parameters (the data type will still be FP32, but the values will be integers). The calculated quantization parameters (scale and zero-point) are stored as well in each quantized layer. For more examples of post-training quantization see here . Summaries You can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below). You can log sparsity statistics (written to console and CSV file), performance, optimizer and model information, and also create a PNG image of the DNN. Creating a PNG image is an experimental feature (it relies on features which are not available on PyTorch 3.1 and that we hope will be available in PyTorch's next release), so to use it you will need to compile the PyTorch master branch, and hope for the best ;-). $ python3 compress_classifier.py --resume=../ssl/checkpoints/checkpoint_trained_ch_regularized_dense.pth.tar -a=resnet20_cifar ../../../data.cifar10 --summary=compute Generates: +----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+ | | Name | Type | Attrs | IFM | IFM volume | OFM | OFM volume | Weights volume | MACs | |----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------| | 0 | module.conv1 | Conv2d | k=(3, 3) | (1, 3, 32, 32) | 3072 | (1, 16, 32, 32) | 16384 | 432 | 442368 | | 1 | module.layer1.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 2 | module.layer1.0.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 3 | module.layer1.1.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 4 | module.layer1.1.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 5 | module.layer1.2.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 6 | module.layer1.2.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 7 | module.layer2.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 4608 | 1179648 | | 8 | module.layer2.0.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 | | 9 | module.layer2.0.downsample.0 | Conv2d | k=(1, 1) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 512 | 131072 | | 10 | module.layer2.1.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 | | 11 | module.layer2.1.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 | | 12 | module.layer2.2.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 | | 13 | module.layer2.2.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 | | 14 | module.layer3.0.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 18432 | 1179648 | | 15 | module.layer3.0.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 | | 16 | module.layer3.0.downsample.0 | Conv2d | k=(1, 1) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 2048 | 131072 | | 17 | module.layer3.1.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 | | 18 | module.layer3.1.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 | | 19 | module.layer3.2.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 | | 20 | module.layer3.2.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 | | 21 | module.fc | Linear | | (1, 64) | 64 | (1, 10) | 10 | 640 | 640 | +----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+ Total MACs: 40,813,184 Using TensorBoard Google's TensorBoard is an excellent tool for visualizing the progress of DNN training. Distiller's logger supports writing performance indicators and parameter statistics in a file format that can be read by TensorBoard (Distiller uses TensorFlow's APIs in order to do this, which is why Distiller requires the installation of TensorFlow). To view the graphs, invoke the TensorBoard server. For example: $ tensorboard --logdir=logs Distillers's setup (requirements.txt) installs TensorFlow for CPU. If you want a different installation, please follow the TensorFlow installation instructions . Collecting activations statistics In CNNs with ReLU layers, ReLU activations (feature-maps) also exhibit a nice level of sparsity (50-60% sparsity is typical). You can collect activation statistics using the --act_stats command-line flag. For example: $ python3 compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10 --resume=checkpoint.resnet56_cifar_baseline.pth.tar --act-stats=test -e The test parameter indicates that, in this example, we want to collect activation statistics during the test phase. Note that we also used the -e command-line argument to indicate that we want to run a test phase. The other two legal parameter values are train and valid which collect activation statistics during the training and validation phases, respectively. Collectors and their collaterals An instance of a subclass of ActivationStatsCollector can be used to collect activation statistics. Currently, ActivationStatsCollector has two types of subclasses: SummaryActivationStatsCollector and RecordsActivationStatsCollector . Instances of SummaryActivationStatsCollector compute the mean of some statistic of the activation. It is rather light-weight and quicker than collecting a record per activation. The statistic function is configured in the constructor. In the sample compression application, compress_classifier.py , we create a dictionary of collectors. For example: SummaryActivationStatsCollector(model, \"sparsity\", lambda t: 100 * distiller.utils.sparsity(t)) The lambda expression is invoked per activation encountered during forward passes, and the value it returns (in this case, the sparsity of the activation tensors, multiplied by 100) is stored in module.sparsity ( \"sparsity\" is this collector's name). To access the statistics, you can invoke collector.value() , or you can access each module's data directly. Another type of collector is RecordsActivationStatsCollector which computes a hard-coded set of activations statistics and collects a record per activation . For obvious reasons, this is slower than instances of SummaryActivationStatsCollector . ActivationStatsCollector default to collecting activations statistics only on the output activations of ReLU layers, but we can choose any layer type we want. In the example below we collect statistics from outputs of torch.nn.Conv2d layers. RecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d]) Collectors can write their data to Excel workbooks (which are named using the collector's name), by invoking collector.to_xlsx(path_to_workbook) . In compress_classifier.py we currently create four different collectors which you can selectively disable. You can also add other statistics collectors and use a different function to compute your new statistic. collectors = missingdict({ \"sparsity\": SummaryActivationStatsCollector(model, \"sparsity\", lambda t: 100 * distiller.utils.sparsity(t)), \"l1_channels\": SummaryActivationStatsCollector(model, \"l1_channels\", distiller.utils.activation_channels_l1), \"apoz_channels\": SummaryActivationStatsCollector(model, \"apoz_channels\", distiller.utils.activation_channels_apoz), \"records\": RecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d])}) By default, these Collectors write their data to files in the active log directory. You can use a utility function, distiller.log_activation_statsitics , to log the data of an ActivationStatsCollector instance to one of the backend-loggers. For an example, the code below logs the \"sparsity\" collector to a TensorBoard log file. distiller.log_activation_statsitics(epoch, \"train\", loggers=[tflogger], collector=collectors[\"sparsity\"]) Caveats Distiller collects activations statistics using PyTorch's forward-hooks mechanism. Collectors iteratively register the modules' forward-hooks, and collectors are called during the forward traversal and get exposed to activation data. Registering for forward callbacks is performed like this: module.register_forward_hook This makes apparent two limitations of this mechanism: We can only register on PyTorch modules. This means that we can't register on the forward hook of a functionals such as torch.nn.functional.relu and torch.nn.functional.max_pool2d . Therefore, you may need to replace functionals with their module alternative. For example: class MadeUpNet(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(3, 6, 5) def forward(self, x): x = F.relu(self.conv1(x)) return x Can be changed to: class MadeUpNet(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.relu = nn.ReLU(inplace=True) def forward(self, x): x = self.relu(self.conv1(x)) return x We can only use a module instance once in our models. If we use the same module several times, then we can't determine which node in the graph has invoked the callback, because the PyTorch callback signature def hook(module, input, output) doesn't provide enough contextual information. TorchVision's ResNet is an example of a model that uses the same instance of nn.ReLU multiple times: class BasicBlock(nn.Module): expansion = 1 def __init__(self, inplanes, planes, stride=1, downsample=None): super(BasicBlock, self).__init__() self.conv1 = conv3x3(inplanes, planes, stride) self.bn1 = nn.BatchNorm2d(planes) self.relu = nn.ReLU(inplace=True) self.conv2 = conv3x3(planes, planes) self.bn2 = nn.BatchNorm2d(planes) self.downsample = downsample self.stride = stride def forward(self, x): residual = x out = self.conv1(x) out = self.bn1(out) out = self.relu(out) # <================ out = self.conv2(out) out = self.bn2(out) if self.downsample is not None: residual = self.downsample(x) out += residual out = self.relu(out) # <================ return out In Distiller we changed ResNet to use multiple instances of nn.ReLU, and each instance is used only once: class BasicBlock(nn.Module): expansion = 1 def __init__(self, inplanes, planes, stride=1, downsample=None): super(BasicBlock, self).__init__() self.conv1 = conv3x3(inplanes, planes, stride) self.bn1 = nn.BatchNorm2d(planes) self.relu1 = nn.ReLU(inplace=True) self.conv2 = conv3x3(planes, planes) self.bn2 = nn.BatchNorm2d(planes) self.relu2 = nn.ReLU(inplace=True) self.downsample = downsample self.stride = stride def forward(self, x): residual = x out = self.conv1(x) out = self.bn1(out) out = self.relu1(out) # <================ out = self.conv2(out) out = self.bn2(out) if self.downsample is not None: residual = self.downsample(x) out += residual out = self.relu2(out) # <================ return out Using the Jupyter notebooks The Jupyter notebooks contain many examples of how to use the statistics summaries generated by Distiller. They are explained in a separate page. Generating this documentation Install mkdocs and the required packages by executing: $ pip3 install -r doc-requirements.txt To build the project documentation run: $ cd distiller/docs-src $ mkdocs build --clean This will create a folder named 'site' which contains the documentation website. Open distiller/docs/site/index.html to view the documentation home page.","title":"Usage"},{"location":"usage.html#using-the-sample-application","text":"The Distiller repository contains a sample application, distiller/examples/classifier_compression/compress_classifier.py , and a set of scheduling files which demonstrate Distiller's features. Following is a brief discussion of how to use this application and the accompanying schedules. You might also want to refer to the following resources: An explanation of the scheduler file format. An in-depth discussion of how we used these schedule files to implement several state-of-the-art DNN compression research papers. The sample application supports various features for compression of image classification DNNs, and gives an example of how to integrate distiller in your own application. The code is documented and should be considered the best source of documentation, but we provide some elaboration here. This diagram shows how where compress_classifier.py fits in the compression workflow, and how we integrate the Jupyter notebooks as part of our research work.","title":"Using the sample application"},{"location":"usage.html#command-line-arguments","text":"To get help on the command line arguments, invoke: $ python3 compress_classifier.py --help For example: $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml Parameters: +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ | | Name | Shape | NNZ (dense) | NNZ (sparse) | Cols (%) | Rows (%) | Ch (%) | 2D (%) | 3D (%) | Fine (%) | Std | Mean | Abs-Mean | |----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------| | 0 | features.module.0.weight | (64, 3, 11, 11) | 23232 | 13411 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 42.27359 | 0.14391 | -0.00002 | 0.08805 | | 1 | features.module.3.weight | (192, 64, 5, 5) | 307200 | 115560 | 0.00000 | 0.00000 | 0.00000 | 1.91243 | 0.00000 | 62.38281 | 0.04703 | -0.00250 | 0.02289 | | 2 | features.module.6.weight | (384, 192, 3, 3) | 663552 | 256565 | 0.00000 | 0.00000 | 0.00000 | 6.18490 | 0.00000 | 61.33445 | 0.03354 | -0.00184 | 0.01803 | | 3 | features.module.8.weight | (256, 384, 3, 3) | 884736 | 315065 | 0.00000 | 0.00000 | 0.00000 | 6.96411 | 0.00000 | 64.38881 | 0.02646 | -0.00168 | 0.01422 | | 4 | features.module.10.weight | (256, 256, 3, 3) | 589824 | 186938 | 0.00000 | 0.00000 | 0.00000 | 15.49225 | 0.00000 | 68.30614 | 0.02714 | -0.00246 | 0.01409 | | 5 | classifier.1.weight | (4096, 9216) | 37748736 | 3398881 | 0.00000 | 0.21973 | 0.00000 | 0.21973 | 0.00000 | 90.99604 | 0.00589 | -0.00020 | 0.00168 | | 6 | classifier.4.weight | (4096, 4096) | 16777216 | 1782769 | 0.21973 | 3.46680 | 0.00000 | 3.46680 | 0.00000 | 89.37387 | 0.00849 | -0.00066 | 0.00263 | | 7 | classifier.6.weight | (1000, 4096) | 4096000 | 994738 | 3.36914 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 75.71440 | 0.01718 | 0.00030 | 0.00778 | | 8 | Total sparsity: | - | 61090496 | 7063928 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 88.43694 | 0.00000 | 0.00000 | 0.00000 | +----+---------------------------+------------------+---------------+----------------+------------+------------+----------+----------+----------+------------+---------+----------+------------+ 2018-04-04 21:30:52,499 - Total sparsity: 88.44 2018-04-04 21:30:52,499 - --- validate (epoch=89)----------- 2018-04-04 21:30:52,499 - 128116 samples (256 per mini-batch) 2018-04-04 21:31:04,646 - Epoch: [89][ 50/ 500] Loss 2.175988 Top1 51.289063 Top5 74.023438 2018-04-04 21:31:06,427 - Epoch: [89][ 100/ 500] Loss 2.171564 Top1 51.175781 Top5 74.308594 2018-04-04 21:31:11,432 - Epoch: [89][ 150/ 500] Loss 2.159347 Top1 51.546875 Top5 74.473958 2018-04-04 21:31:14,364 - Epoch: [89][ 200/ 500] Loss 2.156857 Top1 51.585938 Top5 74.568359 2018-04-04 21:31:18,381 - Epoch: [89][ 250/ 500] Loss 2.152790 Top1 51.707813 Top5 74.681250 2018-04-04 21:31:22,195 - Epoch: [89][ 300/ 500] Loss 2.149962 Top1 51.791667 Top5 74.755208 2018-04-04 21:31:25,508 - Epoch: [89][ 350/ 500] Loss 2.150936 Top1 51.827009 Top5 74.767857 2018-04-04 21:31:29,538 - Epoch: [89][ 400/ 500] Loss 2.150853 Top1 51.781250 Top5 74.763672 2018-04-04 21:31:32,842 - Epoch: [89][ 450/ 500] Loss 2.150156 Top1 51.828125 Top5 74.821181 2018-04-04 21:31:35,338 - Epoch: [89][ 500/ 500] Loss 2.150417 Top1 51.833594 Top5 74.817187 2018-04-04 21:31:35,357 - ==> Top1: 51.838 Top5: 74.817 Loss: 2.150 2018-04-04 21:31:35,364 - Saving checkpoint 2018-04-04 21:31:39,251 - --- test --------------------- 2018-04-04 21:31:39,252 - 50000 samples (256 per mini-batch) 2018-04-04 21:31:51,512 - Test: [ 50/ 195] Loss 1.487607 Top1 63.273438 Top5 85.695312 2018-04-04 21:31:55,015 - Test: [ 100/ 195] Loss 1.638043 Top1 60.636719 Top5 83.664062 2018-04-04 21:31:58,732 - Test: [ 150/ 195] Loss 1.833214 Top1 57.619792 Top5 80.447917 2018-04-04 21:32:01,274 - ==> Top1: 56.606 Top5: 79.446 Loss: 1.893 Let's look at the command line again: $ time python3 compress_classifier.py -a alexnet --lr 0.005 -p 50 ../../../data.imagenet -j 44 --epochs 90 --pretrained --compress=../sensitivity-pruning/alexnet.schedule_sensitivity.yaml In this example, we prune a TorchVision pre-trained AlexNet network, using the following configuration: Learning-rate of 0.005 Print progress every 50 mini-batches. Use 44 worker threads to load data (make sure to use something suitable for your machine). Run for 90 epochs. Torchvision's pre-trained models did not store the epoch metadata, so pruning starts at epoch 0. When you train and prune your own networks, the last training epoch is saved as a metadata with the model. Therefore, when you load such models, the first epoch is not 0, but it is the last training epoch. The pruning schedule is provided in alexnet.schedule_sensitivity.yaml Log files are written to directory logs .","title":"Command line arguments"},{"location":"usage.html#examples","text":"Distiller comes with several example schedules which can be used together with compress_classifier.py . These example schedules (YAML) files, contain the command line that is used in order to invoke the schedule (so that you can easily recreate the results in your environment), together with the results of the pruning or regularization. The results usually contain a table showing the sparsity of each of the model parameters, together with the validation and test top1, top5 and loss scores. For more details on the example schedules, you can refer to the coverage of the Model Zoo . examples/agp-pruning : Automated Gradual Pruning (AGP) on MobileNet and ResNet18 (ImageNet dataset) examples/hybrid : AlexNet AGP with 2D (kernel) regularization (ImageNet dataset) AlexNet sensitivity pruning with 2D regularization examples/network_slimming : ResNet20 Network Slimming (this is work-in-progress) examples/pruning_filters_for_efficient_convnets : ResNet56 baseline training (CIFAR10 dataset) ResNet56 filter removal using filter ranking examples/sensitivity_analysis : Element-wise pruning sensitivity-analysis: AlexNet (ImageNet) MobileNet (ImageNet) ResNet18 (ImageNet) ResNet20 (CIFAR10) ResNet34 (ImageNet) Filter-wise pruning sensitivity-analysis: ResNet20 (CIFAR10) ResNet56 (CIFAR10) examples/sensitivity-pruning : AlexNet sensitivity pruning with Iterative Pruning AlexNet sensitivity pruning with One-Shot Pruning examples/ssl : ResNet20 baseline training (CIFAR10 dataset) Structured Sparsity Learning (SSL) with layer removal on ResNet20 SSL with channels removal on ResNet20 examples/quantization : AlexNet w. Batch-Norm (base FP32 + DoReFa) Pre-activation ResNet20 on CIFAR10 (base FP32 + DoReFa) Pre-activation ResNet18 on ImageNEt (base FP32 + DoReFa)","title":"Examples"},{"location":"usage.html#experiment-reproducibility","text":"Experiment reproducibility is sometimes important. Pete Warden recently expounded about this in his blog . PyTorch's support for deterministic execution requires us to use only one thread for loading data (other wise the multi-threaded execution of the data loaders can create random order and change the results), and to set the seed of the CPU and GPU PRNGs. Using the --deterministic command-line flag and setting j=1 will produce reproducible results (for the same PyTorch version).","title":"Experiment reproducibility"},{"location":"usage.html#performing-pruning-sensitivity-analysis","text":"Distiller supports element-wise and filter-wise pruning sensitivity analysis. In both cases, L1-norm is used to rank which elements or filters to prune. For example, when running filter-pruning sensitivity analysis, the L1-norm of the filters of each layer's weights tensor are calculated, and the bottom x% are set to zero. The analysis process is quite long, because currently we use the entire test dataset to assess the accuracy performance at each pruning level of each weights tensor. Using a small dataset for this would save much time and we plan on assessing if this will provide sufficient results. Results are output as a CSV file ( sensitivity.csv ) and PNG file ( sensitivity.png ). The implementation is in distiller/sensitivity.py and it contains further details about process and the format of the CSV file. The example below performs element-wise pruning sensitivity analysis on ResNet20 for CIFAR10: $ python3 compress_classifier.py -a resnet20_cifar ../../../data.cifar10/ -j=1 --resume=../cifar10/resnet20/checkpoint_trained_dense.pth.tar --sense=element The sense command-line argument can be set to either element or filter , depending on the type of analysis you want done. There is also a Jupyter notebook with example invocations, outputs and explanations.","title":"Performing pruning sensitivity analysis"},{"location":"usage.html#post-training-quantization","text":"The following example qunatizes ResNet18 for ImageNet: $ python3 compress_classifier.py -a resnet18 ../../../data.imagenet --pretrained --quantize-eval --evaluate See here for more details on how to invoke post-training quantization from the command line. A checkpoint with the quantized model will be dumped in the run directory. It will contain the quantized model parameters (the data type will still be FP32, but the values will be integers). The calculated quantization parameters (scale and zero-point) are stored as well in each quantized layer. For more examples of post-training quantization see here .","title":"Post-Training Quantization"},{"location":"usage.html#summaries","text":"You can use the sample compression application to generate model summary reports, such as the attributes and compute summary report (see screen capture below). You can log sparsity statistics (written to console and CSV file), performance, optimizer and model information, and also create a PNG image of the DNN. Creating a PNG image is an experimental feature (it relies on features which are not available on PyTorch 3.1 and that we hope will be available in PyTorch's next release), so to use it you will need to compile the PyTorch master branch, and hope for the best ;-). $ python3 compress_classifier.py --resume=../ssl/checkpoints/checkpoint_trained_ch_regularized_dense.pth.tar -a=resnet20_cifar ../../../data.cifar10 --summary=compute Generates: +----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+ | | Name | Type | Attrs | IFM | IFM volume | OFM | OFM volume | Weights volume | MACs | |----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------| | 0 | module.conv1 | Conv2d | k=(3, 3) | (1, 3, 32, 32) | 3072 | (1, 16, 32, 32) | 16384 | 432 | 442368 | | 1 | module.layer1.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 2 | module.layer1.0.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 3 | module.layer1.1.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 4 | module.layer1.1.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 5 | module.layer1.2.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 6 | module.layer1.2.conv2 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 16, 32, 32) | 16384 | 2304 | 2359296 | | 7 | module.layer2.0.conv1 | Conv2d | k=(3, 3) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 4608 | 1179648 | | 8 | module.layer2.0.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 | | 9 | module.layer2.0.downsample.0 | Conv2d | k=(1, 1) | (1, 16, 32, 32) | 16384 | (1, 32, 16, 16) | 8192 | 512 | 131072 | | 10 | module.layer2.1.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 | | 11 | module.layer2.1.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 | | 12 | module.layer2.2.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 | | 13 | module.layer2.2.conv2 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 32, 16, 16) | 8192 | 9216 | 2359296 | | 14 | module.layer3.0.conv1 | Conv2d | k=(3, 3) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 18432 | 1179648 | | 15 | module.layer3.0.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 | | 16 | module.layer3.0.downsample.0 | Conv2d | k=(1, 1) | (1, 32, 16, 16) | 8192 | (1, 64, 8, 8) | 4096 | 2048 | 131072 | | 17 | module.layer3.1.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 | | 18 | module.layer3.1.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 | | 19 | module.layer3.2.conv1 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 | | 20 | module.layer3.2.conv2 | Conv2d | k=(3, 3) | (1, 64, 8, 8) | 4096 | (1, 64, 8, 8) | 4096 | 36864 | 2359296 | | 21 | module.fc | Linear | | (1, 64) | 64 | (1, 10) | 10 | 640 | 640 | +----+------------------------------+--------+----------+-----------------+--------------+-----------------+--------------+------------------+---------+ Total MACs: 40,813,184","title":"Summaries"},{"location":"usage.html#using-tensorboard","text":"Google's TensorBoard is an excellent tool for visualizing the progress of DNN training. Distiller's logger supports writing performance indicators and parameter statistics in a file format that can be read by TensorBoard (Distiller uses TensorFlow's APIs in order to do this, which is why Distiller requires the installation of TensorFlow). To view the graphs, invoke the TensorBoard server. For example: $ tensorboard --logdir=logs Distillers's setup (requirements.txt) installs TensorFlow for CPU. If you want a different installation, please follow the TensorFlow installation instructions .","title":"Using TensorBoard"},{"location":"usage.html#collecting-activations-statistics","text":"In CNNs with ReLU layers, ReLU activations (feature-maps) also exhibit a nice level of sparsity (50-60% sparsity is typical). You can collect activation statistics using the --act_stats command-line flag. For example: $ python3 compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10 --resume=checkpoint.resnet56_cifar_baseline.pth.tar --act-stats=test -e The test parameter indicates that, in this example, we want to collect activation statistics during the test phase. Note that we also used the -e command-line argument to indicate that we want to run a test phase. The other two legal parameter values are train and valid which collect activation statistics during the training and validation phases, respectively.","title":"Collecting activations statistics"},{"location":"usage.html#collectors-and-their-collaterals","text":"An instance of a subclass of ActivationStatsCollector can be used to collect activation statistics. Currently, ActivationStatsCollector has two types of subclasses: SummaryActivationStatsCollector and RecordsActivationStatsCollector . Instances of SummaryActivationStatsCollector compute the mean of some statistic of the activation. It is rather light-weight and quicker than collecting a record per activation. The statistic function is configured in the constructor. In the sample compression application, compress_classifier.py , we create a dictionary of collectors. For example: SummaryActivationStatsCollector(model, \"sparsity\", lambda t: 100 * distiller.utils.sparsity(t)) The lambda expression is invoked per activation encountered during forward passes, and the value it returns (in this case, the sparsity of the activation tensors, multiplied by 100) is stored in module.sparsity ( \"sparsity\" is this collector's name). To access the statistics, you can invoke collector.value() , or you can access each module's data directly. Another type of collector is RecordsActivationStatsCollector which computes a hard-coded set of activations statistics and collects a record per activation . For obvious reasons, this is slower than instances of SummaryActivationStatsCollector . ActivationStatsCollector default to collecting activations statistics only on the output activations of ReLU layers, but we can choose any layer type we want. In the example below we collect statistics from outputs of torch.nn.Conv2d layers. RecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d]) Collectors can write their data to Excel workbooks (which are named using the collector's name), by invoking collector.to_xlsx(path_to_workbook) . In compress_classifier.py we currently create four different collectors which you can selectively disable. You can also add other statistics collectors and use a different function to compute your new statistic. collectors = missingdict({ \"sparsity\": SummaryActivationStatsCollector(model, \"sparsity\", lambda t: 100 * distiller.utils.sparsity(t)), \"l1_channels\": SummaryActivationStatsCollector(model, \"l1_channels\", distiller.utils.activation_channels_l1), \"apoz_channels\": SummaryActivationStatsCollector(model, \"apoz_channels\", distiller.utils.activation_channels_apoz), \"records\": RecordsActivationStatsCollector(model, classes=[torch.nn.Conv2d])}) By default, these Collectors write their data to files in the active log directory. You can use a utility function, distiller.log_activation_statsitics , to log the data of an ActivationStatsCollector instance to one of the backend-loggers. For an example, the code below logs the \"sparsity\" collector to a TensorBoard log file. distiller.log_activation_statsitics(epoch, \"train\", loggers=[tflogger], collector=collectors[\"sparsity\"])","title":"Collectors and their collaterals"},{"location":"usage.html#caveats","text":"Distiller collects activations statistics using PyTorch's forward-hooks mechanism. Collectors iteratively register the modules' forward-hooks, and collectors are called during the forward traversal and get exposed to activation data. Registering for forward callbacks is performed like this: module.register_forward_hook This makes apparent two limitations of this mechanism: We can only register on PyTorch modules. This means that we can't register on the forward hook of a functionals such as torch.nn.functional.relu and torch.nn.functional.max_pool2d . Therefore, you may need to replace functionals with their module alternative. For example: class MadeUpNet(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(3, 6, 5) def forward(self, x): x = F.relu(self.conv1(x)) return x Can be changed to: class MadeUpNet(nn.Module): def __init__(self): super().__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.relu = nn.ReLU(inplace=True) def forward(self, x): x = self.relu(self.conv1(x)) return x We can only use a module instance once in our models. If we use the same module several times, then we can't determine which node in the graph has invoked the callback, because the PyTorch callback signature def hook(module, input, output) doesn't provide enough contextual information. TorchVision's ResNet is an example of a model that uses the same instance of nn.ReLU multiple times: class BasicBlock(nn.Module): expansion = 1 def __init__(self, inplanes, planes, stride=1, downsample=None): super(BasicBlock, self).__init__() self.conv1 = conv3x3(inplanes, planes, stride) self.bn1 = nn.BatchNorm2d(planes) self.relu = nn.ReLU(inplace=True) self.conv2 = conv3x3(planes, planes) self.bn2 = nn.BatchNorm2d(planes) self.downsample = downsample self.stride = stride def forward(self, x): residual = x out = self.conv1(x) out = self.bn1(out) out = self.relu(out) # <================ out = self.conv2(out) out = self.bn2(out) if self.downsample is not None: residual = self.downsample(x) out += residual out = self.relu(out) # <================ return out In Distiller we changed ResNet to use multiple instances of nn.ReLU, and each instance is used only once: class BasicBlock(nn.Module): expansion = 1 def __init__(self, inplanes, planes, stride=1, downsample=None): super(BasicBlock, self).__init__() self.conv1 = conv3x3(inplanes, planes, stride) self.bn1 = nn.BatchNorm2d(planes) self.relu1 = nn.ReLU(inplace=True) self.conv2 = conv3x3(planes, planes) self.bn2 = nn.BatchNorm2d(planes) self.relu2 = nn.ReLU(inplace=True) self.downsample = downsample self.stride = stride def forward(self, x): residual = x out = self.conv1(x) out = self.bn1(out) out = self.relu1(out) # <================ out = self.conv2(out) out = self.bn2(out) if self.downsample is not None: residual = self.downsample(x) out += residual out = self.relu2(out) # <================ return out","title":"Caveats"},{"location":"usage.html#using-the-jupyter-notebooks","text":"The Jupyter notebooks contain many examples of how to use the statistics summaries generated by Distiller. They are explained in a separate page.","title":"Using the Jupyter notebooks"},{"location":"usage.html#generating-this-documentation","text":"Install mkdocs and the required packages by executing: $ pip3 install -r doc-requirements.txt To build the project documentation run: $ cd distiller/docs-src $ mkdocs build --clean This will create a folder named 'site' which contains the documentation website. Open distiller/docs/site/index.html to view the documentation home page.","title":"Generating this documentation"}]}