{"title": "Compacting, Picking and Growing for Unforgetting Continual Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 13669, "page_last": 13679, "abstract": "Continual lifelong learning is essential to many applications. In this paper, we propose a simple but effective approach to continual deep learning. Our approach leverages the principles of deep model compression, critical weights selection, and progressive networks expansion. By enforcing their integration in an iterative manner, we introduce an incremental learning method that is scalable to the number of sequential tasks in a continual learning process. Our approach is easy to implement and owns several favorable characteristics. First, it can avoid forgetting (i.e., learn new tasks while remembering all previous tasks). Second, it allows model expansion but can maintain the model compactness when handling sequential tasks. Besides, through our compaction and selection/expansion mechanism, we show that the knowledge accumulated through learning previous tasks is helpful to build a better model for the new tasks compared to training the models independently with tasks. Experimental results show that our approach can incrementally learn a deep model tackling multiple tasks without forgetting, while the model compactness is maintained with the performance more satisfiable than individual task training.", "full_text": "Compacting, Picking and Growing for Unforgetting\n\nContinual Learning\n\nSteven C. Y. Hung, Cheng-Hao Tu, Cheng-En Wu, Chien-Hung Chen,\n\nYi-Ming Chan, and Chu-Song Chen\n\nInstitute of Information Science, Academia Sinica, Taipei, Taiwan\n\nMOST Joint Research Center for AI Technology and All Vista Healthcare\n\n{brent12052003, andytu455176}@gmail.com,\n\n{chengen, redsword26, yiming, song}@iis.sinica.edu.tw\n\nAbstract\n\nContinual lifelong learning is essential to many applications. In this paper, we\npropose a simple but effective approach to continual deep learning. Our approach\nleverages the principles of deep model compression, critical weights selection,\nand progressive networks expansion. By enforcing their integration in an iterative\nmanner, we introduce an incremental learning method that is scalable to the number\nof sequential tasks in a continual learning process. Our approach is easy to imple-\nment and owns several favorable characteristics. First, it can avoid forgetting (i.e.,\nlearn new tasks while remembering all previous tasks). Second, it allows model\nexpansion but can maintain the model compactness when handling sequential tasks.\nBesides, through our compaction and selection/expansion mechanism, we show\nthat the knowledge accumulated through learning previous tasks is helpful to build a\nbetter model for the new tasks compared to training the models independently with\ntasks. Experimental results show that our approach can incrementally learn a deep\nmodel tackling multiple tasks without forgetting, while the model compactness is\nmaintained with the performance more satis\ufb01able than individual task training.\n\n1\n\nIntroduction\n\nContinual lifelong learning [42, 28] has received much attention in recent deep learning studies. In\nthis research track, we hope to learn a model capable of handling unknown sequential tasks while\nkeeping the performance of the model on previously learned tasks. In continual lifelong learning, the\ntraining data of previous tasks are assumed non-available for the newly coming tasks. Although the\nmodel learned can be used as a pre-trained model, \ufb01ne-tuning a model for the new task will force the\nmodel parameters to \ufb01t new data, which causes catastrophic forgetting [24, 31] on previous tasks.\nTo lessen the effect of catastrophic forgetting, techniques leveraging on regularization of gradients\nor weights during training have been studied [14, 49, 19, 35]. In Kirkpatrick et al. [14] and Zenke\net al. [49], the proposed algorithms regularize the network weights and hope to search a common\nconvergence for the current and previous tasks. Schwarz et al. [40] introduce a network-distillation\nmethod for regularization, which imposes constraints on the neural weights adapted from the teacher to\nthe student network and applies the elastic-weight-consolidation (EWC) [14] for incremental training.\nThe regularization-based approaches reduce the affection of catastrophic forgetting. However, as the\ntraining data of previous tasks are missing during learning and the network capacity is \ufb01xed (and\nlimited), the regularization approaches often forget the learned skills gradually. Earlier tasks tend to\nbe forgotten more catastrophically in general. Hence, they would not be a favorable choice when the\nnumber of sequential tasks is unlimited.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Compacting, Picking, and Growing (CPG) continual learning. Given a well-trained model,\ngradual pruning is applied to compact the model to release redundant weights. The compact model\nweights are kept to avoid forgetting. Then a learnable binary weight-picking mask is trained along\nwith previously released space for new tasks to effectively reuse the knowledge of previous tasks. The\nmodel can be expanded for new tasks if it does not meet the performance goal. Best viewed in color.\n\nTo address the data-missing issue (i.e., lacking of the training data of old tasks), data-preserving and\nmemory-replay techniques have been introduced. Data-preserving approaches (such as [32, 3, 11, 34])\nare designed to directly save important data or latent codes as an ef\ufb01cient form, while memory-replay\napproaches [41, 46, 13, 45, 27] introduce additional memory models such as GANs for keeping data\ninformation or distribution in an indirect way. The memory models have the ability to replay previous\ndata. Based on past data information, we can then train a model such that the performance can be\nrecovered to a considerable extent for the old tasks. However, a general issue of memory-replay\napproaches is that they require explicit re-training using old information accumulated, which leads to\neither large working memory or compromise between the information memorized and forgetting.\nThis paper introduces an approach for learning sustainable but compact deep models, which can handle\nan unlimited number of sequential tasks while avoiding forgetting. As a limited architecture cannot\nensure to remember the skills incrementally learned from unlimited tasks, our method allows growing\nthe architecture to some extent. However, we also remove the model redundancy during continual\nlearning, and thus can increasingly compact multiple tasks with very limited model expansion.\nBesides, pre-training or gradually \ufb01ne-tuning the models from a starting task only incorporates prior\nknowledge at initialization; hence, the knowledge base is getting diminished with the past tasks. As\nhumans have the ability to continually acquire, \ufb01ne-tune and transfer knowledge and skills throughout\ntheir lifespan [28], in lifelong learning, we would hope that the experience accumulated from previous\ntasks is helpful to learn a new task. As the model increasingly learned by using our method serves as\na compact, un-forgetting base, it generally yields a better model for the subsequent tasks than training\nthe tasks independently. Experimental results reveal that our lifelong learning method can leverage\nthe knowledge accumulated from the past to enhance the performance of new tasks.\n\nMotivation of Our Method Design: Our method is designed by combining the ideas of deep\nmodel compression via weights pruning (Compacting), critical weights selection (Picking), and\nProgressiveNet extension (Growing). We refer it to as CPG, whose rationals are given below.\nAs stated above, although the regularization or memory-replay approaches lessen the effect of\nforgetting, they often do not guarantee to preserve the performance for previous tasks. To exactly\navoid forgetting, a promising way is to keep the old-task weights already learned [38, 47] and enlarge\nthe network by adding nodes or weights for training new tasks. In ProgressiveNet [38], to ease the\ntraining of new tasks, the old-task weights are shared with the new ones but remain \ufb01xed, where\nonly the new weights are adapted for the new task. As the old-task weights are kept, it ensures the\nperformance of learned tasks. However, as the complexity of the model architecture is proportional\nto the number of tasks, it yields a highly redundant structure for keeping multiple models.\nMotivated by ProgressiveNet, we design a method allowing the sustainability of architecture too. To\navoid constructing a complex and huge structure like ProgressiveNet, we perform model compression\nfor the current task every time so that a condensed model is established for the old tasks. According\nto deep-net compression [10], much redundancy is contained in a neural network and removing the\n\n2\n\nTask 1Pick Learned WeightsGraduallyPrune & RetrainTask 2Task \ud835\udc3eTask \ud835\udc3eGraduallyPrune & RetrainPruned WeightsFill the Remaining WeightsLearnable MaskExpand if NeededTask 2\u2a00More TasksTask 1GraduallyPrune & Retrain\fredundant weights does not affect the network performance. Our approach exploits this property,\nwhich compresses the current task by deleting neglectable weights. This yields a compressing-and-\ngrowing loop for a sequence of tasks. Following the idea of ProgressiveNet, the weights preserved for\nthe old tasks are set as invariant to avoid forgetting in our approach. However, unlike ProgressiveNet\nwhere the architecture is always grown for a new task, as the weights deleted from the current task\ncan be released for use for the new tasks, we do not have to grow the architecture every time but can\nemploy the weights released previously for learning the next task. Therefore, in the growing step\nof our CPG approach, two possible choices are provided. The \ufb01rst is to use the previously released\nweights for the new task. If the performance goal is not ful\ufb01lled yet when all the released weights are\nused, we then proceed to the second choice where the architecture is expanded and both the released\nand expanded weights are used for the new-task training.\nAnother distinction of our approach is the \u201cpicking\u201d step. The idea is motivated below. In Progres-\nsiveNet, the old-tasks weights preserved are all co-used (yet remain \ufb01xed) for learning the new tasks.\nHowever, as the number of tasks is increased, the amount of old-task weights is getting larger too.\nWhen all of them are co-used with the weights newly added in the growing step, the old weights (that\nare \ufb01xed) act like inertia since only the fewer new weights are allowed to be adapted, which tends to\nslow down the learning process and make the solution found immature in our experience. To address\nthis issue, we do not employ all the old-task weights but picking only some critical ones from them\nvia a differentiable mask. In the picking step of our CPG approach, the old weights\u2019 picking-mask\nand the new weights added in the growing step are both adapted to learn an initial model for the new\ntask. Then, likewise, the initial model obtained is compressed and preserved for the new task as well.\nTo compress the weights for a task, a main dif\ufb01culty is the lacking of prior knowledge to determine\nthe pruning ratio. To solve this problem, in the compacting step of our CPG approach, we employ\nthe gradual pruning procedure [51] that prunes a small portion of weights and retrains the remaining\nweights to restore the performance iteratively. The procedure stops when meeting a pre-de\ufb01ned\naccuracy goal. Note that only the newly added weights (from the released and/or expanded ones in\nthe growing step) are allowed to be pruned, whereas the old-task weights remain unchanged.\n\nMethod Overview: Our method overview is depicted as follows. The CPG method compresses\nthe deep model and (selectively) expands the architecture alternatively. First, a compressed model is\nbuilt from pruning. Given a new task, the weights of the old-task models are \ufb01xed as well. Next, we\npick and re-use some of the old-task weights critical to the new task via a differentiable mask, and\nuse the previously released weights for learning together. If the accuracy goal is not attained yet, the\narchitecture can be expanded by adding \ufb01lters or nodes in the model and resuming the procedure.\nThen we repeat the gradual pruning [51] (i.e., iteratively removing a portion of weights and retraining)\nfor compacting the model of the new task. An overview of our CPG approach is given in Figure 1.\nThe new-task weights are formed by a combination of two parts: the \ufb01rst part is picked via a learnable\nmask on the old-task weights, and the second part is learned by gradual pruning/retraining of the extra\nweights. As the old-task weights are only picked but \ufb01xed, we can integrate the required function\nmappings in a compact model without affecting their accuracy in inference. Main characteristics of\nour approach are summarized as follows.\nAvoid forgetting: Our approach ensures unforgetting. The function mappings previously built are\nmaintained as exactly the same when new tasks are incrementally added.\nExpand with shrinking: Our method allows expansion but keeps the compactness of the architecture,\nwhich can potentially handle unlimited sequential tasks. Experimental results reveal that multiple\ntasks can be condensed in a model with slight or no architecture growing.\nCompact knowledge base: Experimental results show that the condensed model recorded for\nprevious tasks serves as knowledge base with accumulated experience for weights picking in our\napproach, which yields performance enhancement for learning new tasks.\n\n2 Related Work\n\nContinual lifelong learning [28] can be divided into three main categories: network regularization,\nmemory or data replay, and dynamic architecture. Besides, works on task-free [2] or as a program\nsynthesis [43] have also been studied recently. In the following, we give a brief review of works in\nthe main categories, and readers are suggested to refer to a recent survey paper [28] for more studies.\n\n3\n\n\fNetwork regularization: The key idea of network regularization approaches [14, 49, 19, 35, 40, 5, 6]\nis to restrictively update learned model weights. To keep the learned task information, some penalties\nare added to the change of weights. EWC [14] uses Fisher\u2019s information to evaluate the importance\nof weights for old tasks, and updates weights according to the degree of importance. Based on similar\nideas, the method in [49] calculates the importance by the learning trajectory. Online EWC [40]\nand EWC++ [5] improve the ef\ufb01ciency issues of EWC. Learning without Memorizing(LwM) [6]\npresents an information preserving penalty. The approach builds an attention map, and hopes that\nthe attention region of the previous and concurrent models are consistent. These works alleviate\ncatastrophic forgetting but cannot guarantee the previous-task accuracy exactly.\nMemory replay: Memory or data replay methods [32, 41, 13, 3, 46, 45, 11, 34, 33, 27] use additional\nmodels to remember data information. Generative Replay [41] introduces GANs to lifelong learning.\nIt uses a generator to sample fake data which have similar distribution to previous data. New tasks can\nbe trained with these generated data. Memory Replay GANs (MeRGANs) [45] shows that forgetting\nphenomenon still exists in a generator, and the property of generated data will become worse with\nincoming tasks. They use replay data to enhance the generator quality. Dynamic Generative Memory\n(DGM) [27] uses neural masking to learn connection plasticity in conditional generative models, and\nset a dynamic expansion mechanism in the generator for sequential tasks. Although these methods\ncan exploit data information, they still cannot guarantee the exact performance of past tasks.\nDynamic architecture: Dynamic-architecture approaches [38, 20, 36, 29, 48] adapt the architecture\nwith a sequence of tasks. ProgressiveNet [38] expands the architecture for new tasks and keeps the\nfunction mappings by preserving the previous weights. LwF [20] divides the model layers into two\nparts, shared and task-speci\ufb01c, where the former are co-used by tasks and the later are grown with\nfurther branches for new tasks. DAN [36] extends the architecture per new task, while each layer in\nthe new-task model is a sparse linear-combination of the original \ufb01lters in the corresponding layer of\na base model. Architecture expansion has also been adopted in a recent memory-replay approach [27]\non GANs. These methods can considerably lessen or avoid catastrophic forgetting via architecture\nexpansion, but the model is monotonically increased and a redundant structure would be yielded.\nAs continually growing the architecture will retain the model redundancy, some approach performs\nmodel compression before expansion [48] so that a compact model can be built. In the past, the\nmost related method to ours is Dynamic-expansion Net (DEN) [48]. DEN reduces the weights of\nthe previous tasks via sparse-regularization. Newly added weights and old weights are both adapted\nfor the new task with sparse constraints. However, DEN does not ensure non-forgetting. As the\nold-task weights are jointly trained with the new weights, part of the old-tasks weights are selected\nand modi\ufb01ed. Hence, a \"Split & Duplication\" step is introduced to further \u2018restore\u2019 some of the old\nweights modi\ufb01ed for lessening the forgetting effect. Pack and Expand (PAE) [12] is our previous\napproach that takes advantage of PackNet [23] and ProgressiveNet [38]. It can avoid forgetting,\nmaintain model compactness, and allow dynamic model expansion. However, as it uses all weights\nof previous tasks for sharing, the performance becomes less favorable when learning a new task.\nOur approach (CPG) is accomplished by a compacting\u2192picking(\u2192growing) loop, which selects\ncritical weights from old tasks without modifying them, and thus avoids forgetting. Besides, our\napproach does not have to restore the old-task performance like DEN as the performance is already\nkept, which thus avoids a tedious \"Split & Duplication\" process which takes extra time for model\nadjustment and will affect the new-task performance. Our approach is hence simple and easier to\nimplement. In the experimental results, we show that our approach also outperforms DEN and PAE.\n\n3 The CPG approach for Continual Lifelong Learning\n\nWithout loss of generality, our work follows a task-based sequential learning setup that is a common\nsetting in continual learning. In the following, we present our method in the sequential-task manner.\nTask 1: Given the \ufb01rst task (task-1) and an initial model trained via its dataset, we perform gradual\npruning [51] on the model to remove its redundancy with the performance kept. Instead of pruning\nweights one time to the pruning ratio goal, the gradual pruning removes a portion of the weights and\nretrains the model to restore the performance iteratively until meeting the pruning criteria. Thus, we\ncompact the current model so that redundancy among the model weights are removed (or released).\nThe weights in the compact model are then set unalterable and remain \ufb01xed to avoid forgetting. After\n\n4\n\n\f1:k, the element-wise product of the 0-1 mask M and WP\n\n1:k, M \u2208 {0, 1}D with D the dimension of WP\n\ngradual pruning, the model weights can be divided into two parts: one is preserved for task 1; the\nother is released and able to be employed by the subsequent tasks.\nTask k to k+1: Assume that in task-k, a compact model that can handle tasks 1 to k has been built\nand available. The model weights preserved for tasks 1 to k are denoted as WP\n1:k. The released\n(redundant) weights associated with task-k are denoted as WE\nk , and they are extra weights that can\nbe used for subsequent tasks. Given the dataset of task-(k + 1), we apply a learnable mask M to pick\n1:k. The weights picked are then\nthe old weights WP\nrepresented as M (cid:12) WP\n1:k. Without loss\nof generality, we use the piggyback approach [22] that learns a real-valued mask \u02c6M and applies a\nthreshold for binarization to construct M. Hence, given a new task, we pick a set of weights (known\nas the critical weights) from the compact model via a learnable mask. Besides, we also use the\nreleased weights WE\nk are learned\ntogether on the training data of task-(k+1) with the loss function of task-(k+1) via back-propagation.\nSince the binarized mask M is not differentiable, when training the binary mask M, we update the\nreal-valued mask \u02c6M in the backward pass; then M is quantized with a threshold on \u02c6M and applied\nto the forward pass. If the performance is unsatis\ufb01ed yet, the model architecture can be grown to\ninclude more weights for training. That is, WE\nk can be augmented with additional weights (such as\nnew \ufb01lters in convolutional layers and nodes in fully-connected layers) and then resumes the training\nk . Note that during traning, the mask M and new weights WE\nof both M and WE\nk are adapted but\nthe old weights WP\n1:k are \u201cpicked\u201d only and remain \ufb01xed. Thus, old tasks can be exactly recalled.\nCompaction of task k+1: After M and WE\nThen, we \ufb01x the mask M and apply gradual pruning to compress WE\nmodel WP\nold tasks then becomes WP\nrepeated from task to task. Details of CPG continual learning is listed in Algorithm 1.\n\nk are learned, an initial model of task-(k + 1) is obtained.\nk , so as to get the compact\nk+1 for task-(k + 1). The compact model of\nk+1. The compacting and picking/growing loop is\n\nk for the new task. The mask M and the additional weights WE\n\nk+1 and the redundant (released) weights WE\n\n1:(k+1) = WP\n\n1:k \u222a WP\n\nAlgorithm 1: Compacting, Picking and Growing Continual Learning\nInput: given task 1 and an original model trained on task 1.\nSet an accuracy goal for task 1;\nAlternatively remove small weights and re-train the remaining weights for task 1 via gradual pruning [51],\nwhenever the accuracy goal is still hold;\nLet the model weights preserved for task 1 be WP\nby the iterative pruning be WE\nfor task k = 2\u00b7\u00b7\u00b7 K (let the released weights of task k be W E\n\n1 (referred to as the released weights);\n\n1 (referred to as task-1 weights), and those that are removed\n\nk ) do\n\nSet an accuracy goal for task k;\nApply a mask M to the weights WP\n1:k\u22121 \ufb01xed;\nIf the accuracy goal is not achieved, expand the number of \ufb01lters (weights) in the model, reset WE\ngo to previous step;\nGradually prune WE\nk\u22121\\WE\nWP\n\n1:k\u22121 \ufb01xed) for task k, until meeting the accuracy goal;\nk (with WP\n1:k\u22121 \u222a WP\nk ;\n\nk\u22121 to obtain WE\nk and WP\n1:k = WP\n\n1:k\u22121; train both M and WE\n\nk\u22121 for task k, with WP\n\nk = WE\n\nk\u22121 and\n\nend\n\n4 Experiments and Results\n\nWe perform three experiments to verify the effectiveness of our approach. The \ufb01rst experiment\ncontains 20 tasks organized with CIFAR-100 dataset [16]. In the second experiment, we follow the\nsame settings of PackNet [23] and Piggyback [22] approaches, where several \ufb01ne-grained datasets\nare chosen for classi\ufb01cation in an incremental manner. In the third experiment, we start from\nface veri\ufb01cation and compact three further facial-informatic tasks (expression, gender, and age)\nincrementally to examine the performance of our continual learning approach in a realistic scenario.\nWe implement our CPG approach1 and independent task learning (from scratch or \ufb01ne-tuning) via\nPyTorch [30] in all experiments, but implement DEN [27] via Tensor\ufb02ow [1] with its of\ufb01cial codes.\n\n1Our codes are available at https://github.com/ivclab/CPG.\n\n5\n\n\f(a) Task-1\n\n(b) Task-5\n\n(c) Task-10\n\n(d) Task-15\n\nFigure 2: The accuracy of DEN, Finetune and CPG for the sequential tasks 1, 5, 10, 15 on CIFAR-100.\n\n4.1 Twenty Tasks of CIFAR-100\n\nWe divide the CIFAR-100 dataset into 20 tasks. Each task has 5 classes, 2500 training images, and\n500 testing images. In the experiment, VGG16-BN model (VGG16 with batch normalization layers)\nis employed to train the 20 tasks sequentially. First, we compare our approach with DEN [27] (as it\nalso uses an alternating mechanism of compression and expansion) and \ufb01ne-tuning. To implement\n\ufb01ne-tuning, we train task-1 from scratch by using VGG16-BN; then, assuming the models of task\n1 to task k are available, we then train the model of task-(k + 1) by \ufb01ne-tuning one of the models\nrandomly selected from tasks 1 to k. We repeat this process 5 times and get the average accuracy\n(referred to as Finetune Avg). To implement our CPG approach, task-1 is also trained by using\nVGG16-BN, and this initial model is adapted for the sequential tasks following Algorithm 1. DEN is\nimplemented via the of\ufb01cial codes provided by the authors and modi\ufb01ed for VGG16-BN.\nFigure 2 shows the classi\ufb01cation accuracy of DEN, \ufb01ne-tuning, and our CPG. Figure 2(a) is the\naccuracy of task-1 when all of the 20 tasks have been trained. Initially, the accuracy of DEN is\nhigher than CPG and \ufb01ne-tuning although the same model is trained from scratch. We conjecture that\nit is because they are implemented on different platforms (Tensor\ufb02ow vs PyTorch). Nevertheless,\nthe performance of task-1 gradually drops when the other tasks (2 to 20) are increasingly learned\nin DEN, as shown in Figure 2(a), and the drops are particularly signi\ufb01cant for tasks 15 to 20. In\nFigure 2(b), the initial accuracy of DEN on task-5 becomes a little worse than that of CPG and\n\ufb01ne-tuning. It reveals that DEN could not employ the previously leaned model (tasks 1-4) to enhance\nthe performance of the current task (task 5). Besides, the accuracy of task-5 still drops when new\ntasks (6-20) are learned. Similarly, for tasks 10 and 15 respectively shown in Figures 2(c) and (d),\nDEN has a large performance gap on the initial model, with an increasing accuracy dropping either.\nWe attribute the phenomenon as follows. As DEN does not guarantee unforgetting, a \"Split &\nDuplication\" step is enforced to recover the old-task performance. Though DEN tries to preserve the\nlearned tasks as much as they could via optimizing weight sparsity, the tuning of hyperparameters in\nits loss function makes DEN non-intuitive to balance the learning of the current task and remembering\nthe previous tasks. The performance thus drops although we have tried our best for tuning it. On\nthe other hand, \ufb01ne-tuning and our CPG have roughly the same accuracy initially on task-1 (both\nare trained from scratch), whereas CPG gradually outperforms \ufb01ne-tuning on tasks 5, 10, and 15\nin Figure 2. The results suggest that our approach can exploit the accumulated knowledge base to\nenhance the new task performance. After model growing for 20 tasks, the \ufb01nal amount of weights is\nincreased by 1.09 times (compared to VGG16-BN) for both DEN and CPG. Hence, our approach can\nnot only ensure maintaining the old-task performance (as the horizontal lines shown in Figure 2), but\neffectively accumulate the weights for knowledge picking.\nUnlike ProgressiveNet that uses all of the weights kept for the old tasks when training the new task,\nour method only picks the old-task weights critical to the new tasks. To evaluate the effectiveness of\nthe weights picking mechanism, we compare CPG with PAE [12] and PackNet [23]. In our method,\nif all of the old weights are always picked, it is referred to as the pack-and-expand (PAE) approach. If\nwe further restrict PAE such that the architecture expansion is forbidden, it degenerates to an existing\napproach, PackNet [23]. Note that both PAE and PackNet ensure unforgetting. As shown in Table 1,\nbesides the \ufb01rst two tasks, CPG performs more favorably than PAE and PackNet consistently. The\nresults reveal that the critical-weights picking mechanism in CPG not only reduces the unnecessary\nweights but also boost the performance for new tasks. As PackNet does not allow model expansion,\nits weights amount remains the same (1\u00d7). However, when proceeding with more tasks, available\nspace in PackNet gradually reduces, which limits the effectiveness of PackNet to learn new tasks.\nPAE uses all the previous weights during learning. As with more tasks, the weights from previous\n\n6\n\n\fTable 1: The performance of PackNet, PAE and CPG on CIFAR-100 twenty tasks. We use Avg., Exp.\nand Red. as abbreviations for Average accuracy, Expansion weights and Redundant weights.\n\nMethods\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\n11\n\n12\n\n13\n\n14\n\n15\n\n16\n\n17\n\n18\n\n19\n\n20 Avg.\n\nPackNet 66.4 80.0 76.2 78.4 80.0 79.8 67.8 61.4 68.8 77.2 79.0 59.4 66.4 57.2 36.0 54.2 51.6 58.8 67.8 83.2 67.5\nPAE\n67.2 77.0 78.6 76.0 84.4 81.2 77.6 80.0 80.4 87.8 85.4 77.8 79.4 79.6 51.2 68.4 68.6 68.6 83.2 88.8 77.1\nCPG\n65.2 76.6 79.8 81.4 86.6 84.8 83.4 85.0 87.2 89.2 90.8 82.4 85.6 85.2 53.2 74.4 70.0 73.4 88.8 94.8 80.9\n\nRed.\n(\u00d7)\n0\n0\n\nExp.\n(\u00d7)\n1\n2\n1.5 0.41\n\n3\n\n4\n\n1\n\n2\n\nMethods\n\nTable 2: The performance of CPGs and individual models on CIFAR-100 twenty tasks. We use\n\ufb01ne-Avg and \ufb01ne-Max as abbreviations for Average and Max accuracy of the 5 \ufb01ne-tuning models.\nRed.\n(\u00d7)\n0\n0\n0\n\nScratch\n65.8 78.4 76.6 82.4 82.2 84.6 78.6 84.8 83.4 89.4 87.8 80.2 84.4 80.2 52.0 69.4 66.4 70.0 87.2 91.2 78.8\n\ufb01ne-Avg\n65.2 76.1 76.1 77.8 85.4 82.5 79.4 82.4 82.0 87.4 87.4 81.5 84.6 80.8 52.0 72.1 68.1 71.9 88.1 91.5 78.6\n\ufb01ne-Max 65.8 76.8 78.6 80.0 86.2 84.8 80.4 84.0 83.8 88.4 89.4 83.8 87.2 82.8 53.6 74.6 68.8 74.4 89.2 92.2 80.2\nCPG avg 65.2 76.6 79.8 81.4 86.6 84.8 83.4 85.0 87.2 89.2 90.8 82.4 85.6 85.2 53.2 74.4 70.0 73.4 88.8 94.8 80.9\nCPG max 67.0 79.2 77.2 82.0 86.8 87.2 82.0 85.6 86.4 89.6 90.0 84.0 87.2 84.8 55.4 73.8 72.0 71.6 89.6 92.8 81.2\nCPG top 66.6 77.2 78.6 83.2 88.2 85.8 82.4 85.4 87.6 90.8 91.0 84.6 89.2 83.0 56.2 75.4 71.0 73.8 90.6 93.6 81.7\n\n20 Avg.\n\nExp.\n(\u00d7)\n20\n20\n20\n1.5 0.41\n1.5\n1.5\n\n0\n0\n\n11\n\n12\n\n13\n\n14\n\n15\n\n16\n\n17\n\n18\n\n19\n\n9\n\n10\n\n8\n\n5\n\n6\n\n7\n\ntasks would dominate the whole network and become a burden in learning new tasks. Finally, as\nshown in the Expand (Exp.) \ufb01eld in Table 1, PAE grows the model and uses 2 times of weights for the\n20 tasks. Our CPG expands to 1.5\u00d7 of weights (with 0.41\u00d7 redundant weights that can be released\nto future tasks). Hence, CPG \ufb01nds a more compact and sustainable model with better accuracy when\nthe picking mechanism is enforced.\nTable 2 shows the performance of different settings of our CPG method, together with their comparison\nto independent task learning (including learning from scratch and \ufb01ne-tuning from a pre-trained\nmodel). In this table, \u2018scratch\u2019 means learning each task independently from scratch via the VGG16-\nBN model. As depicted before, \u2018\ufb01ne-Avg\u2019 means the average accuracy of \ufb01ne-tuning from a previous\nmodel randomly selected and repeats the process 5 times. \u2018\ufb01ne-Max.\u2019 means the maximum accuracy\nof these 5 random trials. In the implementation of our CPG algorithm, an accuracy goal has to be\nset for the gradual-pruning and model-expansion steps. In this table, the \u2018avg\u2019, \u2018max\u2019, and \u2018top\u2019\ncorrespond to the settings of accuracy goals to be \ufb01ne-Avg, \ufb01ne-Max, and a slight increment of the\nmaximum of both, respectively. The upper bound of model weights expansion is set as 1.5 in this\nexperiment. As can be seen in Table 2, CPG gets better accuracy than both the average and maximum\nof \ufb01ne-tuning in general. CPG also performs more favorably than learning from scratch averagely.\nThis reveals again that the knowledge previously learned with our CPG can help learn new tasks.\nBesides, the results show that a higher accuracy goal yields better performance and more consumption\nof weights in general. In Table 2, the accuracy achieved by \u2018CPG avg\u2019, \u2018CPG max\u2019, and \u2018CPG top\u2019\nis getting increased. The former remains to have 0.41\u00d7 redundant weights that are saved for future\nuse, whereas the later two consume all weights. The model size includes not only the backbone\nmodel weights, but also the overhead of \ufb01nal layers increased with new classes, batch-normalization\nparameters, and the binary masks. Including all overheads, the model sizes of CPG for the three\nsettings are 2.16\u00d7, 2.40\u00d7 and 2.41\u00d7 of the original VGG16-BN, as shown in Table 3. Compared\nto independent models (learning-from-scratch or \ufb01ne-tuning) that require 20\u00d7 for maintaining the\nold-task accuracy, our approach can yield a far smaller model to achieve exact unforgetting.\n\n4.2 Fine-grained Image Classi\ufb01cation Tasks\n\nIn this experiment, following the same settings in the works of PackNet [23] and Piggyback [22], six\nimage classi\ufb01cation datasets are used. The statistics are summarized in Table 4, where ImageNet [17]\nis the \ufb01rst task, following by \ufb01ne-grained classi\ufb01cation tasks, CUBS [44], Stanford Cars [15] and\nFlowers [26], and \ufb01nally WikiArt [39] and Sketch [8] that are arti\ufb01cial images drawing in various\nstyles and objects. Unlike previous experiments where the \ufb01rst task consists of some of the \ufb01ve\nclasses from CIFAR-100. In this experiment, the \ufb01rst-task classi\ufb01er is trained on ImageNet, which is\na strong base for \ufb01ne-tuning. Hence, in the \ufb01ne-tuning setting of this experiment, tasks 2 to 6 are all\n\ufb01ne-tuned from the task-1, instead of selecting a previous task randomly. For all tasks, the image size\nis 224 \u00d7 224, and the architecture used in this experiment is ResNet50.\n\n7\n\n\fTable 3: Model sizes on\nCIFAR-100 twenty tasks.\nMethods\n\nModel Size (MB)\n\nVGG16-BN\nIndividual Models\nCPG avg\nCPG max\nCPG top\n\n128.25\n2565\n278\n308\n310\n\nTable 4: Statistics of the \ufb01ne-\ngrained datasets\nDataset\n\n#Eval #Classes\n\n#Train\n\nTable 5: Statistics of the facial-\ninformatic datasets\n\nDataset\n\n#Train\n\n#Eval\n\n#Classes\n\nImageNet\nCUBS\nStanford Cars\nFlowers\nWikiArt\nSketch\n\n1,281,167 50,000\n5,794\n8,041\n6,149\n10,628\n4,000\n\n5,994\n8,144\n2,040\n42,129\n16,000\n\n1,000\n200\n196\n102\n195\n250\n\nVGGFace2\nLFW\nFotW\nIMDB-Wiki\nAffectNet\nAdience\n\n3,137,807\n\n0\n\n0\n\n6,171\n216,161\n283,901\n12,287\n\n13,233\n3,086\n\n0\n\n3,500\n3,868\n\n8,6301\n5,749\n\n3\n3\n7\n8\n\nTable 6: Accuracy on \ufb01ne-grained dataset.\n\nTable 7: Accuracy on facial-informatic tasks.\n\nDataset\n\nImageNet\nCUBS\nStanford Cars\nFlowers\nWikiart\nSketch\n\nModel Size\n(MB)\n\nTrain from\nScratch Finetune Prog.\n\n76.16\n40.96\n61.56\n59.73\n56.50\n75.40\n\n-\n\n82.83\n91.83\n96.56\n75.60\n80.78\n\nNet PackNet Piggyback CPG\n75.81\n76.16\n78.94\n83.59\n92.80\n89.21\n96.62\n93.41\n77.15\n74.94\n76.35\n80.33\n\n76.16\n81.59\n89.62\n94.77\n71.33\n79.91\n\n75.71\n80.41\n86.11\n93.04\n69.40\n76.17\n\n554\n\n554\n\n563\n\n115\n\n121\n\n121\n\nTask\n\nFace\nGender\nExpression\nAge\nExp. (\u00d7)\nRed. (\u00d7)\n\nTrain from\n\nScratch\n\n99.417 \u00b1 0.367\n\n83.70\n57.64\n46.14\n\n4\n0\n\nFinetune\n\nCPG\n\n-\n\n90.80\n62.54\n57.27\n\n4\n0\n\n99.300 \u00b1 0.384\n\n89.66\n63.57\n57.66\n\n1\n\n0.003\n\nThe performance is shown on Table 6. Five methods are compared with CPG: training from scratch,\n\ufb01ne-tuning, ProgressiveNet, PackNet, and Piggyback. For the \ufb01rst task (ImageNet), CPG and PackNet\nperforms slightly worse than the others, since both methods have to compress the model (ResNet50)\nvia pruning. Then, for tasks 2 to 6, CPG outperforms the others in almost all cases, which shows the\nsuperiority of our method on building a compact and unforgetting base for continual learning. As for\nthe model size, ProgressiveNet increases the model per task. Learning-from-scratch and \ufb01ne-tuning\nneed 6 models to achieve unforgetting. Their model sizes are thus large. CPG yields a smaller model\nsize comparable to piggyback, which is favorable when considering both accuracy and model size.\n\n4.3 Facial-informatic Tasks\n\nIn a realistic scenario, four facial-informatic tasks, face veri\ufb01cation, gender, expression and age\nclassi\ufb01cation are used with the datasets summarized in Table 5. For face veri\ufb01cation, we use\nVGGFace2 [4] for training and LFW [18] for testing. For gender classi\ufb01cation, we combine FotW [9]\nand IMDB-Wiki [37] datasets and classify faces into three categories, male, female and other. The\nAffectNet dataset [25] is used for expression classi\ufb01cation that classi\ufb01es faces into seven primary\nemotions. Finally, Adience dataset [7] contains faces with labels of eight different age groups. For\nthese datasets, faces are aligned using MTCNN [50] with output size of 112 \u00d7 112. We use the\n20-layer CNN in SphereFace [21] and train a model for the face veri\ufb01cation task accordingly. We\ncompare CPG with the models \ufb01ne-tuned from the face veri\ufb01cation task. The results are reported in\nTable 7. Compared with individual models (training-from-scratch and \ufb01ne-tuning), CPG can achieve\ncomparable or more favorable results without additional expansion. After learning four tasks, CPG\nstill has 0.003 \u00d7 of the released weights able to be used for new tasks.\n\n5 Conclusion and Future Work\n\nWe introduce a simple but effective method, CPG, for continual learning avoiding forgetting. Com-\npacting a model can prevent the model complexity from unaffordable when the number of tasks is\nincreased. Picking learned weights using binary masks and train them together with newly added\nweights is an effective way to reuse previous knowledge. The weights for old tasks are preserved, and\nthus prevents them from forgetting. Growing the model for new tasks facilitates the model to learn\nunlimited and unknown/un-related tasks. CPG is easy to be realized and applicable to real situations.\nExperiments show that CPG can achieve similar or better accuracy with limited additional space.\nCurrently, our method compacts a model by weights pruning, and we plan to include channel pruning\nin the future. Besides, we assume that clear boundaries exit between the tasks, and will extend our\napproach to handle continual learning problems without task boundaries. In the future, We also plan\nto provide a mechanism of \u201cselectively forgetting\u201d some previous tasks via the masks recorded.\n\n8\n\n\fAcknowledgments\n\nWe thank the anonymous reviewers and area chair for their constructive comments. This work is\nsupported in part under contract MOST 108-2634-F-001-004.\n\nReferences\n[1] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine learning on\nheterogeneous distributed systems. arXiv, 2016.\n\n[2] Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. Task-free continual learning. In Proceedings of\n\nCVPR, 2019.\n\n[3] Pratik Prabhanjan Brahma and Adrienne Othon. Subset replay based continual learning for scalable\n\nimprovement of autonomous systems. In Proceedings of the IEEE CVPRW, 2018.\n\n[4] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset for recognising faces\n\nacross pose and age. In Proceedings of IEEE FG, 2018.\n\n[5] Arslan Chaudhry, Puneet K Dokania, Thalaiyasingam Ajanthan, and Philip HS Torr. Riemannian walk for\n\nincremental learning: Understanding forgetting and intransigence. In Proceedings of ECCV, 2018.\n\n[6] Prithviraj Dhar, Rajat Vikram Singh, Kuan-Chuan Peng, Ziyan Wu, and Rama Chellappa. Learning without\n\nmemorizing. Proceedings of CVPR, 2019.\n\n[7] Eran Eidinger, Roee Enbar, and Tal Hassner. Age and gender estimation of un\ufb01ltered faces. IEEE TIFS,\n\n9(12):2170\u20132179, 2014.\n\n[8] Mathias Eitz, James Hays, and Marc Alexa. How do humans sketch objects? ACM Trans. Graph.,\n\n31(4):44\u20131, 2012.\n\n[9] Sergio Escalera, Mercedes Torres Torres, Brais Martinez, Xavier Bar\u00f3, Hugo Jair Escalante, Isabelle\nGuyon, Georgios Tzimiropoulos, Ciprian Corneou, Marc Oliu, Mohammad Ali Bagheri, et al. Chalearn\nlooking at people and faces of the world: Face analysis workshop and challenge 2016. In Proceedings of\nIEEE CVPRW, 2016.\n\n[10] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural networks with\n\npruning, trained quantization and huffman coding. In Proceedings of ICLR, 2016.\n\n[11] Wenpeng Hu, Zhou Lin, Bing Liu, Chongyang Tao, Zhengwei Tao, Jinwen Ma, Dongyan Zhao, and Rui\n\nYan. Overcoming catastrophic forgetting via model adaptation. In Proceedings of ICLR, 2019.\n\n[12] Steven CY Hung, Jia-Hong Lee, Timmy ST Wan, Chein-Hung Chen, Yi-Ming Chan, and Chu-Song Chen.\nIncreasingly packing multiple facial-informatics modules in a uni\ufb01ed deep-learning model via lifelong\nlearning. In Proceedings of International Conference on Multimedia Retrieval (ICMR), 2019.\n\n[13] Ronald Kemker and Christopher Kanan. Fearnet: Brain-inspired model for incremental learning. In\n\nProceedings of ICLR, 2018.\n\n[14] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu,\nKieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic\nforgetting in neural networks. Proceedings of the national academy of sciences, 2017.\n\n[15] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for \ufb01ne-grained\ncategorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13),\nSydney, Australia, 2013.\n\n[16] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.\n\n[17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[18] Erik Learned-Miller, Gary B Huang, Aruni RoyChowdhury, Haoxiang Li, and Gang Hua. Labeled faces in\n\nthe wild: A survey. In Advances in face detection and facial image analysis. Springer, 2016.\n\n[19] Sang-Woo Lee, Jin-Hwa Kim, JungWoo Ha, and Byoung-Tak Zhang. Overcoming catastrophic forgetting\n\nby incremental moment matching. In NIPS, 2017.\n\n9\n\n\f[20] Zhizhong Li and Derek Hoiem. Learning without forgetting. IEEE Transactions on Pattern Analysis and\n\nMachine Intelligence, 40:2935\u20132947, 2018.\n\n[21] Weiyang Liu, Yandong Wen, Zhiding Yu, Ming Li, Bhiksha Raj, and Le Song. Sphereface: Deep\n\nhypersphere embedding for face recognition. In Proceedings of IEEE CVPR, 2017.\n\n[22] Arun Mallya, Dillon Davis, and Svetlana Lazebnik. Piggyback: Adapting a single network to multiple\n\ntasks by learning to mask weights. In Proceedins of ECCV, 2018.\n\n[23] Arun Mallya and Svetlana Lazebnik. Packnet: Adding multiple tasks to a single network by iterative\n\npruning. In Proceedings of the IEEE CVPR, 2018.\n\n[24] James L McClelland, Bruce L McNaughton, and Randall C O\u2019reilly. Why there are complementary learning\nsystems in the hippocampus and neocortex: insights from the successes and failures of connectionist\nmodels of learning and memory. Psychological review, 1995.\n\n[25] Ali Mollahosseini, Behzad Hasani, and Mohammad H Mahoor. Affectnet: A database for facial expression,\n\nvalence, and arousal computing in the wild. IEEE Trans. Affective Comput., 2017.\n\n[26] M-E. Nilsback and A. Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of classes. In\n\nProceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.\n\n[27] Oleksiy Ostapenko, Mihai Puscas, Tassilo Klein, Patrick J\u00e4hnichen, and Moin Nabi. Learning to remember:\n\nA synaptic plasticity driven framework for continual learning. In Proceedings of CVPR, 2019.\n\n[28] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong\n\nlearning with neural networks: A review. Neural Networks, 2019.\n\n[29] German Ignacio Parisi, Xu Ji, and Stefan Wermter. On the role of neurogenesis in overcoming catastrophic\n\nforgetting. In Proceedings of NeurIPS Workshop on Continual Learning, 2018.\n\n[30] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming\nLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In Proceedings\nof NeurIPS, 2017.\n\n[31] B. Pfulb and A. Gepperth. A comprehensive, application-oriented study of catastrophic forgetting in dnns.\n\nIn ICLR 2019, 2019.\n\n[32] Sylvestre-Alvise Rebuf\ufb01, Alexander Kolesnikov, Georg Sperl, and Christoph H Lampert. icarl: Incremental\n\nclassi\ufb01er and representation learning. In Proceedings of IEEE CVPR, 2017.\n\n[33] Matthew Riemer, Ignacio Cases, Robert Ajemian, Miao Liu, Irina Rish, Yuhai Tu, and Gerald Tesauro.\nLearning to learn without forgetting by maximizing transfer and minimizing interference. In Proceedings\nof ICLR, 2019.\n\n[34] Matthew Riemer, Tim Klinger, Djallel Bouneffouf, and Michele Franceschini. Scalable recollections for\n\ncontinual lifelong learning. In AAAI 2019, 2019.\n\n[35] Hippolyt Ritter, Aleksandar Botev, and David Barber. Online structured laplace approximations for\n\novercoming catastrophic forgetting. In NeurIPS, 2018.\n\n[36] Amir Rosenfeld and John K. Tsotsos. Incremental learning through deep adaptation. IEEE transactions on\n\npattern analysis and machine intelligence, Early Access, 2018.\n\n[37] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single\nimage. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 10\u201315,\n2015.\n\n[38] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray\n\nKavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv, 2016.\n\n[39] Babak Saleh and Ahmed Elgammal. Large-scale classi\ufb01cation of \ufb01ne-art paintings: Learning the right\n\nmetric on the right feature. In ICDMW, 2015.\n\n[40] Jonathan Schwarz, Wojciech Czarnecki, Jelena Luketina, Agnieszka Grabska-Barwinska, Yee Whye Teh,\nRazvan Pascanu, and Raia Hadsell. Progress & compress: A scalable framework for continual learning. In\nProceedings of ICML, 2018.\n\n[41] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative\n\nreplay. In Proceedings of NeurIPS, 2017.\n\n10\n\n\f[42] Sebastian Thrun. A lifelong learning perspective for mobile robot control. In Intelligent Robots and\n\nSystems. Elsevier, 1995.\n\n[43] Lazar Valkov, Dipak Chaudhari, Akash Srivastava, Charles A. Sutton, and Swarat Chaudhuri. Houdini:\n\nLifelong learning as program synthesis. In NeurIPS, 2018.\n\n[44] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset.\n\nTechnical Report CNS-TR-2011-001, California Institute of Technology, 2011.\n\n[45] Chenshen Wu, Luis Herranz, Xialei Liu, yaxing wang, Joost van de Weijer, and Bogdan Raducanu. Memory\n\nreplay gans: Learning to generate new categories without forgetting. In Proceedings of NeurIPS, 2018.\n\n[46] Yue Wu, Yinpeng Chen, Lijuan Wang, Yuancheng Ye, Zicheng Liu, Yandong Guo, Zhengyou Zhang, and\nYun Fu. Incremental classi\ufb01er learning with generative adversarial networks. CoRR, abs/1802.00853,\n2018.\n\n[47] Tianjun Xiao, Jiaxing Zhang, Kuiyuan Yang, Yuxin Peng, and Zheng Zhang. Error-driven incremental\nlearning in deep convolutional neural network for large-scale image classi\ufb01cation. In Proceedings of\nACM-MM, 2014.\n\n[48] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju Hwang. Lifelong learning with dynamically\n\nexpandable networks. In Proceedings of ICLR, 2018.\n\n[49] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In\n\nProceedings of ICML, 2017.\n\n[50] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using\n\nmultitask cascaded convolutional networks. IEEE Signal Processing Letters, 23:1499\u20131503, 2016.\n\n[51] Michael Zhu and Suyog Gupta. To prune, or not to prune: Exploring the ef\ufb01cacy of pruning for model\n\ncompression. In Proceedings of ICLR Workshop, 2018.\n\n11\n\n\f", "award": [], "sourceid": 7589, "authors": [{"given_name": "Ching-Yi", "family_name": "Hung", "institution": "Academia Sinica"}, {"given_name": "Cheng-Hao", "family_name": "Tu", "institution": "Academia Sinica"}, {"given_name": "Cheng-En", "family_name": "Wu", "institution": "Academia Sinica"}, {"given_name": "Chien-Hung", "family_name": "Chen", "institution": "Academia Sinica"}, {"given_name": "Yi-Ming", "family_name": "Chan", "institution": "Academia Sinica"}, {"given_name": "Chu-Song", "family_name": "Chen", "institution": "Academia Sinica"}]}