{"title": "Model Compression with Adversarial Robustness: A Unified Optimization Framework", "book": "Advances in Neural Information Processing Systems", "page_first": 1285, "page_last": 1296, "abstract": "Deep model compression has been extensively studied, and state-of-the-art methods can now achieve high compression ratios with minimal accuracy loss. This paper studies model compression through a different lens: could we compress models without hurting their robustness to adversarial attacks, in addition to maintaining accuracy? Previous literature suggested that the goals of robustness and compactness might sometimes contradict. We propose a novel Adversarially Trained Model Compression (ATMC) framework. ATMC constructs a unified constrained optimization formulation, where existing compression means (pruning, factorization, quantization) are all integrated into the constraints. An efficient algorithm is then developed. An extensive group of experiments are presented, demonstrating that ATMC obtains remarkably more favorable trade-off among model size, accuracy and robustness, over currently available alternatives in various settings. The codes are publicly available at: https://github.com/shupenggui/ATMC.", "full_text": "Model Compression with Adversarial Robustness:\n\nA Uni\ufb01ed Optimization Framework\n\nShupeng Gui(cid:5),\u2217, Haotao Wang\u2020,\u2217, Haichuan Yang(cid:5), Chen Yu(cid:5),\n\nZhangyang Wang\u2020 and Ji Liu\u2021\n\n\u2020Department of Computer Science and Engineering, Texas A&M University\n\n(cid:5)Department of Computer Science, University of Rochester\n\u2021Ytech Seattle AI lab, FeDA lab, AI platform, Kwai Inc\n\n\u2020{htwang, atlaswang}@tamu.edu\n\n(cid:5){sgui2, hyang36, cyu28}@ur.rochester.edu\n\n\u2021ji.liu.uwisc@gmail.com\n\nAbstract\n\nDeep model compression has been extensively studied, and state-of-the-art methods\ncan now achieve high compression ratios with minimal accuracy loss. This paper\nstudies model compression through a different lens: could we compress models\nwithout hurting their robustness to adversarial attacks, in addition to maintaining\naccuracy? Previous literature suggested that the goals of robustness and compact-\nness might sometimes contradict. We propose a novel Adversarially Trained Model\nCompression (ATMC) framework. ATMC constructs a uni\ufb01ed constrained opti-\nmization formulation, where existing compression means (pruning, factorization,\nquantization) are all integrated into the constraints. An ef\ufb01cient algorithm is then\ndeveloped. An extensive group of experiments are presented, demonstrating that\nATMC obtains remarkably more favorable trade-off among model size, accuracy\nand robustness, over currently available alternatives in various settings. The codes\nare publicly available at: https://github.com/shupenggui/ATMC.\n\n1\n\nIntroduction\n\nBackground: CNN Model Compression As more Internet-of-Things (IoT) devices come online,\nthey are equipped with the ability to ingest and analyze information from their ambient environments\nvia sensor inputs. Over the past few years, convolutional neural networks (CNNs) have led to rapid\nadvances in the predictive performance in a large variety of tasks [1]. It is appealing to deploy CNNs\nonto IoT devices to interpret big data and intelligently react to both user and environmental events.\nHowever, the model size, together with inference latency and energy cost, have become critical\nhurdles [2\u20134]. The enormous complexity of CNNs remains a major inhibitor for their more extensive\napplications in resource-constrained IoT systems. Therefore, model compression [5] is becoming\nincreasingly demanded and studied [6\u20139]. We next brie\ufb02y review three mainstream compression\nmethods: pruning, factorization, and quantization.\nPruning refers to sparsifying the CNN by zeroing out non-signi\ufb01cant weights, e.g., by thresholding\nthe weights magnitudes [10]. Various forms of sparsity regularization were explicitly incorporated in\nthe training process [6, 7], including structured sparsity, e.g., through channel pruning [8, 11, 12].\nMost CNN layers consist of large tensors to store their parameters [13\u201315], in which large redundancy\nexists due to the highly-structured \ufb01lters or columns [13\u201315]. Matrix factorization was thus adopted\n\n\u2217The \ufb01rst two authors Gui and Wang contributed equally and are listed alphabetically.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fto (approximately) decompose large weight matrices into several much smaller matrix factors [16,\n13, 17]. Combining low-rank factorization and sparse pruning showed further effectiveness [18].\nQuantization saves model size and computation by reducing \ufb02oat-number elements to lower numerical\nprecision, e.g., from 32 bits to 8 bits or less [19, 20]. The model could even consist of only binary\nweights in the extreme case [21, 22]. Beyond scalar quantization, vector quantization was widely\nadopted too in model compression for parameter sharing [23, 24]. [25, 26] also integrated pruning\nand quantization in one ADMM optimization framework.\n\n1.1 Adversarial Robustness: Connecting to Model Compression?\nOn a separate note, the prevailing deployment of CNNs also calls for attention to their robustness.\nDespite their impressive predictive powers, the state-of-the-art CNNs remain to commonly suffer from\nfragility to adversarial attacks, i.e., a well-trained CNN-based image classi\ufb01er could be easily fooled to\nmake unreasonably wrong predictions, by perturbing the input image with a small, often unnoticeable\nvariation [27\u201334]. Other tasks, such as image segmentation [35] and graph classi\ufb01cation [36], were\nall shown to be vulnerable to adversarial attacks. Apparently, such \ufb01ndings put CNN models in\njeopardy for security- and trust-sensitive IoT applications, such as mobile bio-metric veri\ufb01cation.\nThere are a magnitude of adversarial defense methods proposed, ranging from hiding gradients [37],\nto adding stochasticity [38], to label smoothening/defensive distillation [39, 40], to feature squeez-\ning [41], among many more [42\u201344]. A handful of recent works pointed out that those empirical\ndefenses could still be easily compromised [27], and a few certi\ufb01ed defenses were introduced [32, 45].\nTo our best knowledge, there have been few existing studies on examining the robustness of com-\npressed models: most CNN compression methods are evaluated only in terms of accuracy on the\n(clean) test set. Despite their satisfactory accuracies, it becomes curious to us: did they sacri\ufb01ce\nrobustness as a \u201chidden price\u201d paid? We ask the question: could we possibly have a compression\nalgorithm, that can lead to compressed models that are not only accurate, but also robust?\nThe answer yet seems to highly non-straightforward and contextually varying, at least w.r.t. different\nmeans of compression. For example, [46] showed that sparse algorithms are not stable: if an\nalgorithm promotes sparsity, then its sensitivity to small perturbations of the input data remains\nbounded away from zero (i.e., no uniform stability properties). But for other forms of compression,\ne.g., quantization, it seems to reduce the Minimum Description Length [47] and might potentially\nmake the algorithm more robust. In deep learning literature, [48] argued that the tradeoff between\nrobustness and accuracy may be inevitable for the classi\ufb01cation task. This was questioned by [49],\nwhose theoretical examples implied that a both accurate and robust classi\ufb01er might exist, given\nthat classi\ufb01er has suf\ufb01ciently large model capacity (perhaps much larger than standard classi\ufb01ers).\nConsequently, different compression algorithms might lead to different trade-offs between robustness\nand accuracy. [50] empirically discovered that an appropriately higher CNN model sparsity led to\nbetter robustness, whereas over-sparsi\ufb01cation (e.g., less than 5% non-zero parameters) could in turn\ncause more fragility. Although sparsi\ufb01cation (i.e., pruning) is only one speci\ufb01c case of compression,\nthe observation supports a non-monotonic relationship between mode size and robustness.\nA few parallel efforts [38, 51] discussed activation pruning or quantization as defense ways. While\npotentially leading to the speedup of model inference, they have no direct effect on reducing model\nsize and therefore are not directly \u201capple-to-apple\u201d comparable to us. We also notice one concurrent\nwork [52] combining adversarial training and weight pruning. Sharing a similar purpose, our method\nappears to solve a more general problem, by jointly optimizing three means of pruning, factorization\nquantization wr.t. adversarial robustness. Another recent work [53] studied the transferability of\nadversarial examples between compressed models and their non-compressed baseline counterparts.\n\n1.2 Our Contribution\nAs far as we know, this paper describes one of the \ufb01rst algorithmic frameworks that connects model\ncompression with the robustness goal. We propose a uni\ufb01ed constrained optimization form for\ncompressing large-scale CNNs into both compact and adversarially robust models. The frame-\nwork, dubbed adversarially trained model compression (ATMC), features a seamless integration\nof adversarial training (formulated as the optimization objective), as well as a novel structured\ncompression constraint that jointly integrates three compression mainstreams: pruning, factorization\nand quantization. An ef\ufb01cient algorithm is derived to solve this challenging constrained problem.\n\n2\n\n\fWhile we focus our discussion on reducing model size only in this paper, we note that ATMC could be\neasily extended to inference speedup or energy ef\ufb01ciency, with drop-in replacements of the constraint\n(e.g., based on FLOPs) in the optimization framework.\nWe then conduct an extensive set of experiments, comparing ATMC with various baselines and\noff-the-shelf solutions. ATMC consistently shows signi\ufb01cant advantages in achieving competitive\nrobustness-model size trade-offs. As an interesting observation, the models compressed by ATMC\ncan achieve very high compression ratios, while still maintaining appealing robustness, manifesting\nthe value of optimizing the model compactness through the robustness goal.\n\n2 Adversarially Trained Model Compression\nIn this section, we de\ufb01ne and solve the ATMC problem. ATMC is formulated as a constrained min-\nmax optimization problem: the adversarial training makes the min-max objective (Section 2.1), while\nthe model compression by enforcing certain weight structures constitutes the constraint (Section 2.2).\nWe then derive the optimization algorithm to solve the ATMC formulation (Section 2.3).\n\n2.1 Formulating the ATMC Objective: Adversarial Robustness\n\nWe consider a common white-box attack setting [30]. The white box attack allows an adversary to\neavesdrop the optimization and gradients of the learning model. Each time, when an \u201cclean\" image x\ncomes to a target model, the attacker is allowed to \u201cperturb\u201d the image into x(cid:48) with an adversarial\nperturbation with bounded magnitudes. Speci\ufb01cally, let \u2206 \u2265 0 denote the prede\ufb01ned bound for the\nattack magnitude, x(cid:48) must be from the following set:\n\nB\u2206\u221e(x) := {x(cid:48) : (cid:107)x(cid:48) \u2212 x(cid:107)\u221e \u2264 \u2206} .\n\nThe objective for the attacker is to perturb x(cid:48) within B\u2206\u221e(x), such as the target model performance is\nmaximally deteriorated. Formally, let f (\u03b8; x, y) be the loss function that the target model aims to\nminimize, where \u03b8 denotes the model parameters and (x, y) the training pairs. The adversarial loss,\ni.e., the training objective for the attacker, is de\ufb01ned by\nf adv(\u03b8; x, y) = max\n\nf (\u03b8; x(cid:48), y)\n\n(1)\n\nx(cid:48)\u2208B\u2206\u221e(x)\n\nIt could be understood that the maximum (worst) target model loss attainable at any point within\nB\u2206\u221e(x). Next, since the target model needs to defend against the attacker, it requires to suppress the\nworst risk. Therefore, the overall objective for the target model to gain adversarial robustness could\nbe expressed as Z denotes the training data set:\n\nf adv(\u03b8; x, y).\n\n(2)\n\n(cid:88)\n\nmin\n\n\u03b8\n\n(x,y)\u2208Z\n\n2.2\n\nIntegrating Pruning, Factorization and Quantization for the ATMC Constraint\n\nAs we reviewed previously, typical CNN model compression strategies include pruning (element-level\n[10], or channel-level [8]), low-rank factorization [16, 13, 17], and quantization [23, 19, 20]. In this\nwork, we aim to integrate all three into a uni\ufb01ed, \ufb02exible structural constraint.\nWithout loss of generality, we denote the major operation of a CNN layer (either convolutional\nor fully-connected) as xout = W xin, W \u2208 Rm\u00d7n, m \u2265 n; computing the non-linearity (neuron)\nparts takes minor resources compared to the large-scale matrix-vector multiplication. The basic\npruning [10] encourages the elements of W to be zero. On the other hand, the factorization-based\nmethods decomposes W = W1W2. Looking at the two options, we propose to enforce the following\nstructure to W (k is a hyperparameter):\n\nW = U V + C,\n\n(3)\nwhere (cid:107) \u00b7 (cid:107)0 denotes the number of nonzeros of the augment matrix. The above enforces a novel,\ncompound (including both multiplicative and additive) sparsity structure on W , compared to existing\nsparsity structures directly on the elements of W . Decomposing a matrix into sparse factors\nwas studied before [54], but not in a model compression context. We further allow for a sparse\n\n(cid:107)U(cid:107)0 + (cid:107)V (cid:107)0 + (cid:107)C(cid:107)0 \u2264 k,\n\n3\n\n\ferror C for more \ufb02exibility, as inspired from robust optimization [55]. By default, we choose\nU \u2208 Rm\u00d7m, V \u2208 Rm\u00d7n in (3).\nMany extensions are clearly available for equation 3. For example, the channel pruning [8] enforces\nrows of W to be zero, which could be considered as a specially structured case of basic element-level\npruning. It could be achieved by a drop-in replacement of group-sparsity norms. We choose (cid:96)0 norm\nhere both for simplicity, and due to our goal here being focused on reducing model size only. We\nrecognize that using group sparsity norms in (3) might potentially be a preferred option if ATMC will\nbe adapted for model acceleration.\nQuantization is another powerful strategy for model compression [20, 21, 56]. To maximize the\nrepresentation capability after quantization, we choose to use the nonuniform quantization strategy\nto represent the nonzero elements in DNN parameter, that is, each nonzero element of the DNN\nparameter can only be chosen from a set of a few values and these values are not necessarily evenly\ndistributed and need to be optimized. We use the notation | \u00b7 |0 to denote the number of different\nvalues except 0 in the augment matrix, that is,\n\n|M|0 := |{Mi,j : Mi,j (cid:54)= 0 \u2200i \u2200j}|\n\nFor example, for M = [0, 1; 4; 1], (cid:107)M(cid:107)0 = 3 and |M|0 = 2. To answer the all nonzero elements\nof {U (l), V (l), C(l)}, we introduce the non-uniform quantization strategy (i.e., the quantization\nintervals or thresholds are not evenly distributed). We also do not pre-choose those thresholds, but\ninstead learn them directly with ATMC, by only constraining the number of unique nonzero values\nthrough prede\ufb01ning the number of representation bits b in each matrix, such as\n\n|U (l)|0 \u2264 2b, |V (l)|0 \u2264 2b, |C(l)|0 \u2264 2b \u2200l \u2208 [L].\n\n2.3 ATMC: Formulation\n\nLet us use \u03b8 to denote the (re-parameterized) weights in all L layers:\n\n\u03b8 := {U (l), V (l), C(l)}L\n\nl=1.\n\nWe are now ready to present the overall constrained optimization formulation of the proposed ATMC\nframework, combining all compression strategies or constraints:\n\nf adv(\u03b8; x, y)\n\n(4)\n\n\u03b8 \u2208 Qb := {\u03b8 : |U (l)|0 \u2264 2b, |V (l)|0 \u2264 2b, |C(l)|0 \u2264 2b \u2200l \u2208 [L]}.(quantization constraint)\n\nBoth k and b are hyper-parameters in ATMC: k controls the overall sparsity of \u03b8, and b controls the\nquantization bit precision per nonzero element. They are both \u201cglobal\u201d for the entire model rather\nthan layer-wise, i.e., setting only the two hyper-parameters will determine the \ufb01nal compression. We\nnote that it is possible to achieve similar compression ratios using different combinations of k and b\n(but likely leading to different accuracy/robustness), and the two can indeed collaborate or trade-off\nwith each other to achieve more effective compression.\n\n2.4 Optimization\n\nThe optimization in equation 4 is a constrained optimization with two constraints. The typical method\nto solve the constrained optimization is using projected gradient descent or projected stochastic\ngradient descent, if the projection operation (onto the feasible set de\ufb01ned by constraints) is simple\nenough. Unfortunately, in our case, this projection is quite complicated, since the intersection of\nthe sparsity constraint and the quantization constraint is complicated. However, we notice that the\nprojection onto the feasible set de\ufb01ned by each individual constraint is doable (the projection onto the\nsparsity constraint is quite standard, how to do ef\ufb01cient projection onto the quantization constraint\n\n4\n\n(x,y)\u2208Z\n\n(cid:88)\nL(cid:88)\n(cid:124)\n\nl=1\n\nmin\n\n\u03b8\n\ns.t.\n\n(cid:107)U (l)(cid:107)0 + (cid:107)V (l)(cid:107)0 + (cid:107)C(l)(cid:107)0\n\n\u2264 k, (sparsity constraint)\n\n(cid:123)(cid:122)\n\n(cid:107)\u03b8(cid:107)0\n\n(cid:125)\n\n\f(cid:88)\n\n(x,y)\u2208Z\n\n(cid:88)\n\n(5)\n\n(6)\n\n(7)\n\nde\ufb01ned set will be clear soon). Therefore, we apply the ADMM [57] optimization framework to\nsplit these two constraints by duplicating the optimization variable \u03b8. First the original optimization\nformulation equation 4 can be rewritten as by introducing one more constraint\n\n(cid:88)\n\nmin\n\n(cid:107)\u03b8(cid:107)0\u2264k, \u03b8(cid:48)\u2208Qb\n\n(x,y)\u2208Z\ns.t. \u03b8 = \u03b8(cid:48).\n\nf adv(\u03b8; x, y)\n\nIt can be further cast into a minimax problem by removing the equality constraint \u03b8 = \u03b8(cid:48):\n\nmin\n\n(cid:107)\u03b8(cid:107)0\u2264k, \u03b8(cid:48)\u2208Qb\n\nmax\n\nu\n\nf adv(\u03b8; x, y) + \u03c1(cid:104)u, \u03b8 \u2212 \u03b8(cid:48)(cid:105) +\n\n(cid:107)\u03b8 \u2212 \u03b8(cid:48)(cid:107)2\n\nF\n\n\u03c1\n2\n\nwhere \u03c1 > 0 is a prede\ufb01ned positive number in ADMM. Plug the form of f adv, we obtain a complete\nminimax optimization\n\nmin\n\n(cid:107)\u03b8(cid:107)0\u2264k, \u03b8(cid:48)\u2208Qb\n\nmax\n\nu,\n\n{xadv\u2208B\u2206\u221e(x)}(x,y)\u2208Z\n\n(x,y)\u2208Z\n\nf (\u03b8; x(cid:48), y) + \u03c1(cid:104)u, \u03b8 \u2212 \u03b8(cid:48)(cid:105) +\n\n(cid:107)\u03b8 \u2212 \u03b8(cid:48)(cid:107)2\n\nF\n\n\u03c1\n2\n\nADMM essentially iteratively minimizes variables \u03b8 and \u03b8(cid:48), and maximizes u and all xadv.\nUpdate u We update the dual variable as ut+1 = ut + (\u03b8 \u2212 \u03b8(cid:48)), which can be considered to be a\ngradient ascent step with learning rate 1/\u03c1.\n\nUpdate xadv We update xadv for sampled data (x, y) by\n\nxadv \u2190 Proj{x(cid:48):(cid:107)x(cid:48)\u2212x(cid:107)\u221e\u2264\u2206} {x + \u03b1\u2207xf (\u03b8; x, y)}\n\n\u03b8\n\nUpdate \u03b8 The \ufb01rst step is to optimize \u03b8 in equation 7 (\ufb01xing other variables) which is only related\nto the sparsity constraint. Therefore, we are essentially solving\n(cid:107)\u03b8 \u2212 \u03b8(cid:48) + u(cid:107)2\n\nf (\u03b8; xadv, y) +\n\ns.t. (cid:107)\u03b8(cid:107)0 \u2264 k.\n\nmin\n\nF\n\nSince the projection onto the sparsity constraint is simply enough, we can use the projected stochastic\ngradient descent method by iteratively updating \u03b8 as\n\n\u03b8 \u2190 Proj{\u03b8(cid:48)(cid:48):(cid:107)\u03b8(cid:48)(cid:48)(cid:107)0\u2264k}\n\n\u03b8 \u2212 \u03b3\u2207\u03b8\n\nf (\u03b8; xadv, y) +\n\n(cid:107)\u03b8 \u2212 \u03b8(cid:48) + u(cid:107)2\n\nF\n\n\u03c1\n2\n\n{\u03b8(cid:48)(cid:48) : (cid:107)\u03b8(cid:48)(cid:48)(cid:107)0 \u2264 k} denotes the feasible domain of the sparsity constraint. \u03b3t is the learning rate.\nUpdate \u03b8(cid:48) The second step is to optimize equation 7 with respect to \u03b8(cid:48) (\ufb01xing other variables),\nwhich is essentially solving the following projection problem\n\n(cid:105)(cid:17)\n\n.\n\n\u03c1\n2\n\n(cid:104)\n\n(cid:16)\n\n(cid:107)\u03b8(cid:48) \u2212 (\u03b8 + u)(cid:107)2\nF ,\n\ns.t. \u03b8(cid:48) \u2208 Qb.\n\nmin\n\u03b8(cid:48)\n\n(8)\n\nTo take a close look at this formulation, we are essentially solving the following particular one\ndimensional clustering problem with 2b + 1 clusters on \u03b8 + u (for each U (l), V (l), and C(l))\n\nmin\nU ,{ak}2b\n\nk=1\n\n(cid:107)U \u2212 \u00afU(cid:107)2\n\nF\n\ns.t. Ui,j \u2208 {0, a1, a2,\u00b7\u00b7\u00b7 , a2b}.\n\nThe major difference from the standard clustering problem is that there is a constant cluster 0. Take\n(cid:48)(l)\nU(cid:48)(l) as an example, the update rule of \u03b8(cid:48) is U\nt = ZeroKmeans2b (U (l) + uU (l)), where uU (l)\nis the dual variable with respect to U (l) in \u03b8. Here we use a modi\ufb01ed Lloyd\u2019s algorithm [58] to\nsolve equation 8. The detail of this algorithm is shown in Algorithm 1,\nWe \ufb01nally summarize the full ATMC algorithm in Algorithm 2.\n3 Experiments\nTo demonstrate that ATMC achieves remarkably favorable trade-offs between robustness and model\ncompactness, we carefully design experiments on a variety of popular datasets and models as summa-\nrized in Section 3.1. Speci\ufb01cally, since no algorithm with exactly the same goal (adversarially robust\ncompression) exists off-the-shelf, we craft various ablation baselines, by sequentially composing\ndifferent compression strategies with the state-of-the-art adversarial training [32]. Besides, we show\nthat the robustness of ATMC compressed models generalizes to different attackers.\n\n5\n\n\fTable 1: The datasets and CNN models used in the experiments.\n\nModels\nLeNet\n\nResNet34\nResNet34\nWideResNet\n\n#Parameters\n\nbit width Model Size (bits) Dataset & Accuracy\n\n430K\n21M\n21M\n11M\n\n32\n32\n32\n32\n\n13,776,000\n680,482,816\n681,957,376\n350,533,120\n\nMNIST: 99.32%\nCIFAR-10: 93.67%\nCIFAR-100: 73.16%\n\nSVHN: 95.25%\n\nAlgorithm 1 ZeroKmeansB( \u00afU )\n1: Input: a set of real numbers \u00afU, number\n\nof clusters B.\n\npicked nonzero elements from \u00afU.\n\n2: Output: quantized tensor U.\n3: Initialize a1, a2,\u00b7\u00b7\u00b7 , aB by randomly\n4: Q := {0, a1, a2,\u00b7\u00b7\u00b7 , aB}\n5: repeat\nfor i = 0 to | \u00afU| \u2212 1 do\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13: until Convergence\n14: for i = 0 to | \u00afU| \u2212 1 do\n15: Ui \u2190 Q\u03b4i\n16: end for\n\n\u03b4i \u2190 arg minj( \u00afUi \u2212 Qj)2\naj \u2190 (cid:80)\n(cid:80)\n\nend for\nFix Q0 = 0\nfor j = 1 to B do\n\ni I(\u03b4i=j) \u00afUi\ni I(\u03b4i=j)\n\nend for\n\nAlgorithm 2 ATMC\n1: Input: dataset Z, stepsize sequence\nt=0 , update steps n and T ,\n\n{\u03b3t > 0}T\u22121\nhyper-parameter \u03c1, k, and b, \u2206\n\n\u03b1\u2207xf (\u03b8; x, y)(cid:9)\n\n2: Output: model \u03b8\n3: \u03b1 \u2190 1.25 \u00d7 \u2206/n\n4: Initialize \u03b8, let \u03b8(cid:48) = \u03b8 and u = 0\n5: for t = 0 to T \u2212 1 do\nSample (x, y) from Z\n6:\nfor i = 0 to n \u2212 1 do\n7:\nxadv \u2190 Proj{x(cid:48):(cid:107)x(cid:48)\u2212x(cid:107)\u221e\u2264\u2206}\n8:\nend for\n\u03b8\n\u03b3t\u2207\u03b8[f (\u03b8; xadv, y)+ \u03c1\n\u03b8(cid:48) \u2190 ZeroKmeans2b (\u03b8 + u)\nu \u2190 u + (\u03b8 \u2212 \u03b8(cid:48))\n\n\u2190 Proj{\u03b8(cid:48)(cid:48):(cid:107)\u03b8(cid:48)(cid:48)(cid:107)0\u2264k}\n\n9:\n10:\n\n11:\n12:\n13: end for\n\n2(cid:107)\u03b8\u2212\u03b8(cid:48)+u(cid:107)2\n\n(cid:8)x +\n(cid:0)\u03b8 \u2212\nF ](cid:1)\n\n3.1 Experimental Setup\n\nDatasets and Benchmark Models As in Table 1, we select four popular image classi\ufb01cation\ndatasets and pick one top-performer CNN model on each: LeNet on the MNIST dataset [59];\nResNet34 [60] on CIFAR-10 [61] and CIFAR-100 [61]; and WideResNet [62] on SVHN [63].\n\nEvaluation Metrics The classi\ufb01cation accuracies on both benign and on attacked testing sets are\nreported, the latter being widely used to quantify adversarial robustness, e.g., in [32]. The model\nsize is computed via multiplying the quantization bit per element with the total number of non-zero\nelements, added with the storage size for the quantization thresholds ( equation 8). The compression\nratio is de\ufb01ned by the ratio between the compressed and original model sizes. A desired model\ncompression is then expected to achieve strong adversarial robustness (accuracy on the attacked\ntesting set), in addition to high benign testing accuracy, at compression ratios from low to high .\n\nATMC Hyper-parameters For ATMC, there are two hyper-parameters in equation 4 to control\ncompression ratios: k in the sparsity constraint, and b in the quantization constraint. In our experi-\nments, we try 32-bit (b = 32) full precision, and 8-bit (b = 8) quantization; and then vary different k\nvalues under either bit precision, to navigate different compression ratios. We recognize that a better\ncompression-robustness trade-off is possible via \ufb01ne-tuning, or perhaps to jointly search for, k and b.\n\nTraining Settings For adversarial training, we apply the PGD [32] attack to \ufb01nd adversarial\nsamples. Unless otherwise speci\ufb01ed, we set the perturbation magnitude \u2206 to be 76 for MNIST and 4\nfor the other three datasets. (The color scale of each channel is between 0 and 255.) Following the\nsettings in [32], we set PGD attack iteration numbers n to be 16 for MNIST and 7 for the other three\ndatasets. We follow [30] to set PGD attack step size \u03b1 to be min(\u2206 + 4, 1.25\u2206)/n. We train ATMC\nfor 50, 150, 150, 80 epochs on MNIST, CIFAR10, CIFAR100 and SVHN respectively.\n\n6\n\n\fAdversarial Attack Settings Without further notice, we use PGD attack with the same settings\nas used in adversarial training on testing sets to evaluate model robustness. In section 3.3, we also\nevaluate model robustness on PGD attack, FGSM attack [29] and WRM attack [45] with varying\nattack parameter settings to show the robustness of our method across different attack settings.\n\n3.2 Comparison to Pure Compression, Pure Defense, and Their Mixtures\n\nSince no existing work directly pursues our same goal, we start from two straightforward baselines to\nbe compared with ATMC: standard compression (without defense), and standard defense (without\ncompression). Furthermore, we could craft \u201cmixture\u201d baselines to achieve the goal: \ufb01rst applying a\ndefense method on a dense model, then compressing it, and eventually \ufb01ne-tuning the compressed\nmodel (with parameter number unchanged, e.g., by \ufb01xing zero elements) using the defense method\nagain. We design the following seven comparison methods (the default bit precision is 32 unless\notherwise speci\ufb01ed):\n\n\u2022 Non-Adversarial Pruning (NAP): we train a dense state-of-the-art CNN and then compress\nit by the pruning method proposed in [10]: only keeping the largest-magnitudes weight\nelements while setting others to zero, and then \ufb01ne-tune the nonzero weights (with zero\nweights \ufb01xed) on the training set again until convergence, . NAP can thus explicitly control\nthe compressed model size in the same way as ATMC. There is no defense in NAP.\n\n\u2022 Dense Adversarial Training (DA): we apply adversarial training [32] to defend a dense\n\nCNN, with no compression performed.\n\n\u2022 Adversarial Pruning (AP): we \ufb01rst apply the defensive method [32] to pre-train a defense\nCNN. We then prune the dense model into a sparse one [10], and \ufb01ne-tune the non-zero\nweights of pruned model until convergence, similarly to NAP.\n\n\u2022 Adversarial (cid:96)0 Pruning (A(cid:96)0): we start from the same pre-trained dense defensive CNN used\nby AP and then apply (cid:96)0 projected gradient descent to solve the constrained optimization\nproblem with an adversarial training objective and a constraint on number of non-zero\nparameters in the CNN. Note that this is in essence a combination of one state-of-the-art\ncompression method [64] and PGD adversarial training.\n\n\u2022 Adversarial Low-Rank Decomposition (ALR): it is all similar to the AP routine, except that\n\nwe use low rank factorization [17] in place of pruning to achieve the compression step.\n\n\u2022 ATMC (8 bits, 32 bits): two ATMC models with different quantization bit precisions are\n\nchosen. For either one, we will vary k for different compression ratios.\n\n(a) MNIST\n\n(b) CIFAR-10\n\n(c) CIFAR-100\n\n(d) SVHN\n\nFigure 1: Comparison among NAP, AP, A(cid:96)0, ALR and ATMC (32 bits & 8 bits) on four mod-\nels/datasets. Top row: accuracy on benign testing images versus compression ratio. Bottom row:\nrobustness (accuracy on PGD attacked testing images) versus compression ratio. The black dashed\nlines mark the the uncompressed model results.\n\nFig 1 compares the accuracy on benign (top row) and PGD-attack (bottom row) testing images\nrespectively, w.r.t. the compression ratios, from which a number of observations can be drawn.\n\n7\n\n103102101Model Size Compression Ratio0.20.40.60.81.0AccuracyOriginalNAPAPALRAL0ATMC-32bitsATMC-8bitsDA103102101Model Size Compression Ratio0.50.60.70.80.9AccuracyOriginalNAPAPALRAL0ATMC-32bitsATMC-8bitsDA103102101Model Size Compression Ratio0.10.20.30.40.50.60.7AccuracyOriginalNAPAPALRAL0ATMC-32bitsATMC-8bitsDA103102101Model Size Compression Ratio0.40.50.60.70.80.9AccuracyOriginalNAPAPALRAL0ATMC-32bitsATMC-8bitsDA103102101Model Size Compression Ratio0.00.20.40.60.8AccuracyPGD AttackNAPAPALRAL0ATMC-32bitsATMC-8bitsDA103102101Model Size Compression Ratio0.00.20.40.6AccuracyPGD AttackNAPAPALRAL0ATMC-32bitsATMC-8bitsDA103102101Model Size Compression Ratio0.00.10.20.3AccuracyPGD AttackNAPAPALRAL0ATMC-32bitsATMC-8bitsDA103102101Model Size Compression Ratio0.00.20.40.6AccuracyPGD AttackNAPAPALRAL0ATMC-32bitsATMC-8bitsDA\fFirst, our results empirically support the existence of inherent trade-off between robustness and\naccuracy at different compression ratios; although the practically achievable trade-off differs by\nmethod. For example, while NAP (a standard CNN compression) obtains decent accuracy results on\nbenign testing sets (e.g., the best on CIFAR-10 and CIFAR-100), it becomes very deteriorated in terms\nof robustness under adversarial attacks. That veri\ufb01es our motivating intuition: naive compression,\nwhile still maintaining high standard accuracy, can signi\ufb01cantly compromise robustness \u2013 the \u201chidden\nprice\u201d has indeed been charged. The observation also raises a red \ufb02ag for current evaluation ways of\nCNN compression, where the robustness of compressed models is (almost completely) overlooked.\nSecond, while both AP and ALR consider compression and defense in ad-hoc sequential ways, A(cid:96)0\nand ATMC-32 bits further gain notably advantages over them via \u201cjoint optimization\u201d type methods,\nin achieving superior trade-offs between benign test accuracy, robustness, and compression ratios.\nFurthermore, ATMC-32 bits outperforms A(cid:96)0 especially at the low end of compression ratios. That is\nowing to the the new decomposition structure that we introduced in ATMC.\nThird, ATMC achieves comparable test accuracy and robustness to DA, with only minimal amounts\nof parameters after compression. Meanwhile, ATMC also achieves very close, sometimes better\naccuracy-compression ratio trade-offs on benign testing sets than NAP, with much enhanced robust-\nness. Therefore, it has indeed combined the best of both worlds. It also comes to our attention that\nfor ATMC-compressed models, the gaps between their accuracies on benign and attacked testing sets\nare smaller than those of the uncompressed original models. That seems to potentially suggest that\ncompression (when done right) in turn has positive performance regularization effects.\nLastly, we compare between ATMC-32bits and ATMC-8bits. While ATMC-32bits already outper-\nforms other baselines in terms of robustness-accuracy trade-off, more aggressive compression can be\nachieved by ATMC-8bits (with further around four-time compression at the same sparsity level), with\nstill competitive performance. The incorporation of quantization and weight pruning/decomposition\nin one framework allows us to \ufb02exibly explore and optimize their different combinations.\n\n(a) PGD, perturbation=2 (b) PGD, perturbation=8 (c) FGSM, perturbation=4 (d) WRM, penalty=1.3, it-\n\neration=7\n\nFigure 2: Robustness-model size trade-off under different attacks and perturbation levels. Note that\nthe accuracies here are all measured by attacked images, i.e., indicating robustness.\n\n3.3 Generalized Robustness Against Other Attackers\nIn all previous experiments, we test ATMC and other baselines against the PGD attacker at certain\n\ufb01xed perturbation levels. We will now show that the superiority of ATMC persists under different\nattackers and perturbation levels. On CIFAR-10 (whose default perturbation level is 4), we show the\nresults against the PGD attack with perturbation levels 2 and 8, in Fig 2a and Fig 2b, respectively.\nWe also try the FGSM attack [29] with perturbation 4, and the WRM attack [45] with penalty\nparameter 1.3 and iteration 7, with results displayed in Fig 2c and Fig 2d, respectively. As we can see,\nATMC-32bit outperforms its strongest competitor AP in the full compression spectrum. ATMC-8bit\ncan get more aggressively compressed model sizes while maintaining similar or better robustness\nto ATMC-32bit at low compression ratios. In all, the robustness gained by ATMC compression is\nobserved to be sustainable and generalizable.\n4 Conclusion\nThis paper aims to address the new problem of simultaneously achieving high robustness and\ncompactness in CNN models. We propose the ATMC framework, by integrating the two goals in one\nuni\ufb01ed constrained optimization framework. Our extensive experiments endorse the effectiveness of\nATMC by observing: i) naive model compression may hurt robustness, if the latter is not explicitly\ntaken into account; ii) a proper joint optimization could achieve both well: a properly compressed\nmodel could even maintain almost the same accuracy and robustness compared to the original one.\n\n8\n\n0.0020.0040.0060.0080.010Model Size Compression Ratio0.00.20.40.60.8AccuracyPGD-eps:2NAPAPALRATMC-32bitsATMC-8bits0.0020.0040.0060.0080.010Model Size Compression Ratio0.00.10.20.30.4AccuracyPGD-eps:8NAPAPALRATMC-32bitsATMC-8bits0.0020.0040.0060.0080.010Model Size Compression Ratio0.10.20.30.40.50.6AccuracyFGSM-eps:4NAPAPALRATMC-32bitsATMC-8bits0.0020.0040.0060.0080.010Model Size Compression Ratio0.20.30.40.50.60.70.8AccuracyWRM-iter:7NAPAPALRATMC-32bitsATMC-8bits\fReferences\n[1] Jia Deng, Dong Wei, Socher Richard, Li Li-Jia, Li Kai, and Fei-Fei Li. ImageNet: A large-scale\nhierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition,\npages 248\u2013255, 2009.\n\n[2] Yue Wang, Tan Nguyen, Yang Zhao, Zhangyang Wang, Yingyan Lin, and Richard Baraniuk.\nEnergyNet: Energy-ef\ufb01cient dynamic inference. In Advances in Neural Information Processing\nSystems Workshop, 2018.\n\n[3] Yue Wang, Jianghao Shen, Ting-Kuei Hu, Pengfei Xu, Tan Nguyen, Richard Baraniuk,\nZhangyang Wang, and Yingyan Lin. Dual dynamic inference: Enabling more ef\ufb01cient, adaptive\nand controllable deep inference. arXiv preprint arXiv:1907.04523, 2019.\n\n[4] Wuyang Chen, Ziyu Jiang, Zhangyang Wang, Kexin Cui, and Xiaoning Qian. Collaborative\nglobal-local networks for memory-ef\ufb01cient segmentation of ultra-high resolution images. In\nIEEE Conference on Computer Vision and Pattern Recognition, pages 8924\u20138933, 2019.\n\n[5] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and\n\nacceleration for deep neural networks. arXiv preprint arXiv:1710.09282, 2017.\n\n[6] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Marianna Pensky. Sparse con-\nvolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition,\npages 806\u2013814, 2015.\n\n[7] Hao Zhou, Jose M Alvarez, and Fatih Porikli. Less is more: Towards compact CNNs. In\n\nEuropean Conference on Computer Vision, pages 662\u2013677, 2016.\n\n[8] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning \ufb01lters for\n\nef\ufb01cient convnets. arXiv preprint arXiv:1608.08710, 2016.\n\n[9] Yunhe Wang, Chang Xu, Chao Xu, and Dacheng Tao. Adversarial learning of portable student\n\nnetworks. In AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\n[10] Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections\nfor ef\ufb01cient neural network. In Advances in Neural Information Processing Systems, pages\n1135\u20131143, 2015.\n\n[11] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Learning structured sparsity in\ndeep neural networks. In Advances in Neural Information Processing Systems, pages 2074\u20132082,\n2016.\n\n[12] Haichuan Yang, Yuhao Zhu, and Ji Liu. ECC: Platform-independent energy-constrained deep\nneural network compression via a bilinear regression model. In IEEE Conference on Computer\nVision and Pattern Recognition, pages 11206\u201311215, 2019.\n\n[13] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting\nlinear structure within convolutional networks for ef\ufb01cient evaluation. In Advances in Neural\nInformation Processing Systems, pages 1269\u20131277, 2014.\n\n[14] Jonghoon Jin, Aysegul Dundar, and Eugenio Culurciello. Flattened convolutional neural\n\nnetworks for feedforward acceleration. arXiv preprint arXiv:1412.5474, 2014.\n\n[15] Preetum Nakkiran, Raziel Alvarez, Rohit Prabhavalkar, and Carolina Parada. Compressing deep\nneural networks using a rank-constrained topology. In Annual Conference of the International\nSpeech Communication Association, 2015.\n\n[16] Vadim Lebedev, Yaroslav Ganin, Maksim Rakhuba, Ivan Oseledets, and Victor Lempitsky.\nSpeeding-up convolutional neural networks using \ufb01ne-tuned cp-decomposition. arXiv preprint\narXiv:1412.6553, 2014.\n\n[17] Cheng Tai, Tong Xiao, Yi Zhang, Xiaogang Wang, et al. Convolutional neural networks with\n\nlow-rank regularization. arXiv preprint arXiv:1511.06067, 2015.\n\n9\n\n\f[18] Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by\nlow rank and sparse decomposition. In IEEE Conference on Computer Vision and Pattern\nRecognition, pages 7370\u20137379, 2017.\n\n[19] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional\nneural networks for mobile devices. In IEEE Conference on Computer Vision and Pattern\nRecognition, pages 4820\u20134828, 2016.\n\n[20] Song Han, Huizi Mao, and William J Dally. Deep compression: Compressing deep neural\nnetworks with pruning, trained quantization and Huffman coding. International Conference on\nLearning Representations, 2015.\n\n[21] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. BinaryConnect: Training deep\nneural networks with binary weights during propagations. In Advances in Neural Information\nProcessing Systems, pages 3123\u20133131, 2015.\n\n[22] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. XNOR-Net: Ima-\ngeNet classi\ufb01cation using binary convolutional neural networks. In European Conference on\nComputer Vision, pages 525\u2013542, 2016.\n\n[23] Yunchao Gong, Liu Liu, Ming Yang, and Lubomir Bourdev. Compressing deep convolutional\n\nnetworks using vector quantization. arXiv preprint arXiv:1412.6115, 2014.\n\n[24] Junru Wu, Yue Wang, Zhenyu Wu, Zhangyang Wang, Ashok Veeraraghavan, and Yingyan\nLin. Deep k-means: Re-training and parameter sharing with harder cluster assignments for\ncompressing deep convolutions. In International Conference on Machine Learning, pages\n5359\u20135368, 2018.\n\n[25] Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Jiaming Xie, Yun Liang, Sijia Liu, Xue\nLin, and Yanzhi Wang. A uni\ufb01ed framework of DNN weight pruning and weight cluster-\ning/quantization using ADMM. arXiv preprint arXiv:1811.01907, 2018.\n\n[26] Haichuan Yang, Shupeng Gui, Yuhao Zhu, and Ji Liu. Learning sparsity and quantization\njointly and automatically for neural network compression via constrained optimization. In\nInternational Conference on Learning Representations, 2019.\n\n[27] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of\nsecurity: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420,\n2018.\n\n[28] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In\n\nIEEE Symposium on Security and Privacy, pages 39\u201357, 2017.\n\n[29] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv preprint arXiv:1412.6572, 2014.\n\n[30] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale.\n\narXiv preprint arXiv:1611.01236, 2016.\n\n[31] Bo Luo, Yannan Liu, Lingxiao Wei, and Qiang Xu. Towards imperceptible and robust adversarial\n\nexample attacks against neural networks. arXiv preprint arXiv:1801.04693, 2018.\n\n[32] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083,\n2017.\n\n[33] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. DeepFool: a simple\nand accurate method to fool deep neural networks. In IEEE Conference on Computer Vision\nand Pattern Recognition, pages 2574\u20132582, 2016.\n\n[34] Cihang Xie, Jianyu Wang, Zhishuai Zhang, Yuyin Zhou, Lingxi Xie, and Alan Yuille. Adversar-\nial examples for semantic segmentation and object detection. In IEEE International Conference\non Computer Vision, pages 1369\u20131378, 2017.\n\n10\n\n\f[35] Jan Hendrik Metzen, Mummadi Chaithanya Kumar, Thomas Brox, and Volker Fischer. Uni-\nversal adversarial perturbations against semantic image segmentation. In IEEE Conference on\nComputer Vision and Pattern Recognition, pages 2755\u20132764, 2017.\n\n[36] Daniel Z\u00fcgner, Amir Akbarnejad, and Stephan G\u00fcnnemann. Adversarial attacks on neural\nnetworks for graph data. In ACM SIGKDD International Conference on Knowledge Discovery\n& Data Mining, pages 2847\u20132856, 2018.\n\n[37] Florian Tram\u00e9r, Alexey Kurakin, Nicolas Papernot, Dan Boneh, and Patrick D. McDaniel.\nEnsemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.\n\n[38] Guneet S Dhillon, Kamyar Azizzadenesheli, Zachary C Lipton, Jeremy Bernstein, Jean Kossai\ufb01,\nAran Khanna, and Anima Anandkumar. Stochastic activation pruning for robust adversarial\ndefense. arXiv preprint arXiv:1803.01442, 2018.\n\n[39] Nicolas Papernot and Patrick McDaniel. Extending defensive distillation. arXiv preprint\n\narXiv:1705.05264, 2017.\n\n[40] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation\nas a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on\nSecurity and Privacy, pages 582\u2013597, 2016.\n\n[41] Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in\n\ndeep neural networks. arXiv preprint arXiv:1704.01155, 2017.\n\n[42] Hossein Hosseini, Yize Chen, Sreeram Kannan, Baosen Zhang, and Radha Poovendran. Block-\ning transferability of adversarial examples in black-box learning systems. arXiv preprint\narXiv:1703.04318, 2017.\n\n[43] Dongyu Meng and Hao Chen. MagNet: a two-pronged defense against adversarial examples. In\nACM SIGSAC Conference on Computer and Communications Security, pages 135\u2013147, 2017.\n\n[44] Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and Jun Zhu. Defense\nagainst adversarial attacks using high-level representation guided denoiser. In IEEE Conference\non Computer Vision and Pattern Recognition, pages 1778\u20131787, 2018.\n\n[45] Aman Sinha, Hongseok Namkoong, and John Duchi. Certi\ufb01able distributional robustness with\n\nprincipled adversarial training. arXiv preprint arXiv:1710.10571, 2017.\n\n[46] Huan Xu, Constantine Caramanis, and Shie Mannor. Sparse algorithms are not stable: A\nIEEE Transactions on Pattern Analysis and Machine Intelligence,\n\nno-free-lunch theorem.\n34(1):187\u2013193, 2012.\n\n[47] Richard S Zemel. A minimum description length framework for unsupervised learning. Citeseer,\n\n1994.\n\n[48] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry.\n\nRobustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.\n\n[49] Preetum Nakkiran. Adversarial robustness may be at odds with simplicity. arXiv preprint\n\narXiv:1901.00532, 2019.\n\n[50] Yiwen Guo, Chao Zhang, Changshui Zhang, and Yurong Chen. Sparse DNNs with improved\n\nadversarial robustness. arXiv preprint arXiv:1810.09619, 2018.\n\n[51] Ji Lin, Chuang Gan, and Song Han. Defensive quantization: When ef\ufb01ciency meets robustness.\n\narXiv preprint arXiv:1904.08444, 2019.\n\n[52] Shaokai Ye, Kaidi Xu, Sijia Liu, Hao Cheng, Jan-Henrik Lambrechts, Huan Zhang, Aojun\nZhou, Kaisheng Ma, Yanzhi Wang, and Xue Lin. Adversarial robustness vs model compression,\nor both? arXiv preprint arXiv:1903.12561, 2019.\n\n[53] Yiren Zhao, Ilia Shumailov, Robert Mullins, and Ross Anderson. To compress or not to\ncompress: Understanding the interactions between adversarial attacks and neural network\ncompression. arXiv preprint arXiv:1810.00208, 2018.\n\n11\n\n\f[54] Behnam Neyshabur and Rina Panigrahy.\n\narXiv:1311.3315, 2013.\n\nSparse matrix factorization.\n\narXiv preprint\n\n[55] Emmanuel J Cand\u00e8s, Xiaodong Li, Yi Ma, and John Wright. Robust principal component\n\nanalysis? Journal of the ACM, 58(3):11, 2011.\n\n[56] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quan-\ntized neural networks: Training neural networks with low precision weights and activations.\nThe Journal of Machine Learning Research, 18(1):6869\u20136898, 2017.\n\n[57] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, Jonathan Eckstein, et al. Distributed opti-\nmization and statistical learning via the alternating direction method of multipliers. Foundations\nand Trends R(cid:13) in Machine learning, 3(1):1\u2013122, 2011.\n\n[58] Stuart Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory,\n\n28(2):129\u2013137, 1982.\n\n[59] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[60] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n[61] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[62] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.\n\narXiv:1605.07146, 2016.\n\narXiv preprint\n\n[63] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. In Advances in Neural\nInformation Processing Systems Workshop, 2011.\n\n[64] Haichuan Yang, Yuhao Zhu, and Ji Liu. Energy-constrained compression for deep neural net-\nworks via weighted sparse projection and layer input masking. arXiv preprint arXiv:1806.04321,\n2018.\n\n12\n\n\f", "award": [], "sourceid": 760, "authors": [{"given_name": "Shupeng", "family_name": "Gui", "institution": "University of Rochester"}, {"given_name": "Haotao", "family_name": "Wang", "institution": "Texas A&M University"}, {"given_name": "Haichuan", "family_name": "Yang", "institution": "University of Rochester"}, {"given_name": "Chen", "family_name": "Yu", "institution": "University of Rochester"}, {"given_name": "Zhangyang", "family_name": "Wang", "institution": "TAMU"}, {"given_name": "Ji", "family_name": "Liu", "institution": "University of Rochester, Tencent AI lab"}]}