{"title": "Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee", "book": "Advances in Neural Information Processing Systems", "page_first": 3177, "page_last": 3186, "abstract": "We introduce and analyze a new technique for model reduction for deep neural networks. While large networks are theoretically capable of learning arbitrarily complex models, overfitting and model redundancy negatively affects the prediction accuracy and model variance.  Our Net-Trim algorithm prunes (sparsifies) a trained network layer-wise, removing connections at each layer by solving a convex optimization program.  This program seeks a sparse set of weights at each layer that keeps the layer inputs and outputs consistent with the originally trained model.  The algorithms and associated analysis are applicable to neural networks operating with the rectified linear unit (ReLU) as the nonlinear activation. We present both parallel and cascade versions of the algorithm.  While the latter can achieve slightly simpler models with the same generalization performance, the former can be computed in a distributed manner.  In both cases, Net-Trim significantly reduces the number of connections in the network, while also providing enough regularization to slightly reduce the generalization error. We also provide a mathematical analysis of the consistency between the initial network and the retrained model.  To analyze the model sample complexity, we derive the general sufficient conditions for the recovery of a sparse transform matrix. For a single layer taking independent Gaussian random vectors of length $N$ as inputs,  we show that if the network response can be described using a maximum number of $s$ non-zero weights per node, these weights can be learned from $\\mathcal{O}(s\\log N)$ samples.", "full_text": "Net-Trim: Convex Pruning of Deep Neural Networks\n\nwith Performance Guarantee\n\nAlireza Aghasi\u2217\n\nInstitute for Insight\n\nGeorgia State University\n\nIBM TJ Watson\n\naaghasi@gsu.edu\n\nAfshin Abdi\n\nDepartment of ECE\n\nGeorgia Tech\n\nabdi@gatech.edu\n\nNam Nguyen\nIBM TJ Watson\n\nnnguyen@us.ibm.com\n\nJustin Romberg\nDepartment of ECE\n\nGeorgia Tech\n\njrom@ece.gatech.edu\n\nAbstract\n\nWe introduce and analyze a new technique for model reduction for deep neural\nnetworks. While large networks are theoretically capable of learning arbitrarily\ncomplex models, over\ufb01tting and model redundancy negatively affects the prediction\naccuracy and model variance. Our Net-Trim algorithm prunes (sparsi\ufb01es) a trained\nnetwork layer-wise, removing connections at each layer by solving a convex\noptimization program. This program seeks a sparse set of weights at each layer\nthat keeps the layer inputs and outputs consistent with the originally trained model.\nThe algorithms and associated analysis are applicable to neural networks operating\nwith the recti\ufb01ed linear unit (ReLU) as the nonlinear activation. We present both\nparallel and cascade versions of the algorithm. While the latter can achieve slightly\nsimpler models with the same generalization performance, the former can be\ncomputed in a distributed manner. In both cases, Net-Trim signi\ufb01cantly reduces the\nnumber of connections in the network, while also providing enough regularization\nto slightly reduce the generalization error. We also provide a mathematical analysis\nof the consistency between the initial network and the retrained model. To analyze\nthe model sample complexity, we derive the general suf\ufb01cient conditions for the\nrecovery of a sparse transform matrix. For a single layer taking independent\nGaussian random vectors of length N as inputs, we show that if the network\nresponse can be described using a maximum number of s non-zero weights per\n\nnode, these weights can be learned fromO(s log N) samples.\n\n1\n\nIntroduction\n\nWith enough layers, neurons in each layer, and a suf\ufb01ciently large set of training data, neural networks\ncan learn structure of arbitrary complexity [1]. This model \ufb02exibility has made the deep neural\nnetwork a pioneer machine learning tool over the past decade (see [2] for a comprehensive overview).\nIn practice, multi-layer networks often have more parameters than can be reliably estimated from the\namount of data available. This gives the training procedure a certain ambiguity \u2013 many different sets\nof parameter values can model the data equally well, and we risk instabilities due to over\ufb01tting. In\nthis paper, we introduce a framework for sparisfying networks that have already been trained using\nstandard techniques. This reduction in the number of parameters needed to specify the network makes\nit more robust and more computationally ef\ufb01cient to implement without sacri\ufb01cing performance.\n\n\u2217Corresponding Author\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fIn recent years there has been increasing interest in the mathematical understanding of deep networks.\nThese efforts are mainly in the context of characterizing the minimizers of the underlying cost\nfunction [3, 4] and the geometry of the loss function [5]. Recently, the analysis of deep neural\nnetworks using compressed sensing tools has been considered in [6], where the distance preservability\nof feedforward networks at each layer is studied. There are also works on formulating the training of\nfeedforward networks as an optimization problem [7, 8, 9], where the majority of the works approach\ntheir understanding of neural networks by sequentially studying individual layers.\nVarious methods have been proposed to reduce over\ufb01tting via regularizing techniques and pruning\nstrategies. These include explicit regularization using (cid:96)1 and (cid:96)2 penalties during training [10, 11],\nand techniques that randomly remove active connections in the training phase (e.g. Dropout [12] and\nDropConnect [13]) making them more likely to produce sparse networks. There has also been recent\nworks on explicit network compression (e.g., [14, 15, 16]) to remove the inherent redundancies. In\nwhat is perhaps the most closely related work to what is presented below, [14] proposes a pruning\nscheme that simply truncates small weights of an already trained network, and then re-adjusts the\nremaining active weights using another round of training. These aforementioned techniques are based\non heuristics, and lack general performance guarantees that help understand when and how well they\nwork.\nWe present a framework, called Net-Trim, for pruning the network layer-by-layer that is based on\nconvex optimization. Each layer of the net consists of a linear map followed by a nonlinearity;\nthe algorithms and theory presented below use a recti\ufb01ed linear unit (ReLU) applied point-wise\nto each output of the linear map. Net-trim works by taking a trained network, and then \ufb01nding\nthe sparsest set of weights for each layer that keeps the output responses consistent with the initial\n\ntraining. More concisely, if Y((cid:96)\u22121) is the input (across the training examples) to layer (cid:96), and\nY((cid:96)) is the output following the ReLU operator, Net-Trim searches for a sparse W such that\nY((cid:96)) \u2248 ReLU(W\nY((cid:96)\u22121)). Using the standard (cid:96)1 relaxation for sparsity and the fact that the\n\n\u0016\n\nReLU function is piecewise linear allows us to perform this search by solving a convex program. In\ncontrast to techniques based on thresholding (such as [14]), Net-Trim does not require multiple other\ntime-consuming training steps after the initial pruning.\nAlong with making the computations tractable, Net-Trim\u2019s convex formulation also allows us to derive\ntheoretical guarantees on how far the retrained model is from the initial model, and establish sample\ncomplexity arguments about the number of random samples required to retrain a presumably sparse\nlayer. To the best of our knowledge, Net-Trim is the \ufb01rst pruning scheme with such performance\nguarantees. In addition, it is easy to modify and adapt to other structural constraints on the weights\nby adding additional penalty terms or introducing additional convex constraints.\nAn illustrative example is shown in Figure 1. Here, 200 points in the 2D plane are used to train\na binary classi\ufb01er. The regions corresponding to each class are nested spirals. We \ufb01t a classi\ufb01er\nusing a simple neural network with two hidden layers with fully connected weights, each consisting\n200 neurons. Figure 1(b) shows the weighted adjacency matrix between the layers after training,\nand then again after Net-Trim is applied. With only a negligible change to the overall network\nresponse (panel (a) vs panel (d)), Net-Trim is able to prune more than 93% of the links among the\nneurons, representing a signi\ufb01cant model reduction. Even when the neural network is trained using\nsparsifying weight regularizers (here, Dropout [12] and (cid:96)1 penalty), Net-Trim produces a model\nwhich is over 7 times sparser than the initial one, as presented in panel (c). The numerical experiments\nin Section 6 show that these kinds of results are not limited to toy examples; Net-Trim achieves\nsigni\ufb01cant compression ratios on large networks trained on real data sets.\nThe remainder of the paper is structured as follows. In Section 2, we formally present the network\nmodel used in the paper. The proposed pruning schemes, both the parallel and cascade Net-Trim are\npresented and discussed in Section 3. Section 4 is devoted to the convex analysis of the proposed\nframework and its sample complexity. The implementation details of the proposed convex scheme\nare presented in Section 5. Finally, in Section 6, we report some retraining experiments using the\nNet-Trim and conclude the paper by presenting some general remarks. Along with some extended\ndiscussions, the proofs of all of the theoretical statements in the paper are presented as a supplementary\nnote (speci\ufb01cally, \u00a74 of the notes is devoted to the technical proofs).\n\nWe very brie\ufb02y summarize the notation used below. For a matrix A, we use A\u03931,\u2236 to denote the\nsubmatrix formed by restricting the rows of A to the index set \u03931. Similarly, A\u2236,\u03932 is the submatrix\n\nof columns indexed by \u03932, and A\u03931,\u03932 is formed by extracting both rows and columns. For an\n\n2\n\n\f(b)\n\n(c)\n\n(a)\n\n(d)\n\nFigure 1: Net-Trim pruning performance; (a) initial trained model; (b) the weighted adjacency matrix\nrelating the two hidden layers before (left) and after (right) the application of Net-Trim; (c) left: the\nadjacency matrix after training the network with Dropout and (cid:96)1 regularization; right: after retraining\nvia Net-Trim; (d) the retrained classi\ufb01er\n\nn=1xm,n and\u0001X\u0001F as the Frobenius\nM\u00d7 N matrix X with entries xm,n, we use2\u0001X\u00011\u00e1\u2211M\nnorm. For a vector x,\u0001x\u00010 is the cardinality of x, supp x is the set of indexes with non-zero entries,\nand suppc x is the complement set. We will use the notation x+ as shorthand max(x, 0), where\nmax(., 0) is applied to vectors and matrices component-wise. Finally, the vertical concatenation of\ntwo vectors a and b is denoted by[a; b].\n\nm=1\u2211N\n\n2 Feedforward Network Model\n\nprevious layer Y\n\nThe network activations are taken to be recti\ufb01ed linear units. The output of the (cid:96)-th layer is Y\n\nIn this section, we introduce some notational conventions related to a feedforward network model. We\n\nassume that we have P training samples xp, p= 1,\u0016, P , where xp\u2208 RN is an input to the network.\nWe stack up these samples into a matrix X\u2208 RN\u00d7P , structured as X=[x1,\u0016, xP]. Considering\nL layers for the network, the output of the network at the \ufb01nal layer is denoted by Y(L)\u2208 RNL\u00d7P ,\nwhere each column in Y(L) is a response to the corresponding training column in X.\n((cid:96))\u2208\nRN(cid:96)\u00d7P , generated by applying the adjoint of the weight matrix W(cid:96)\u2208 RN(cid:96)\u22121\u00d7N(cid:96) to the output of the\n((cid:96)\u22121) and then applying a component-wise max(., 0) operation:\n(0) = X and N0 = N. A trained neural network as outlined in (1) is represented by\n(cid:96)=1, X).\nTN({W(cid:96)}L\nnetworks, where\u0001W(cid:96)\u00011= 1 for every layer (cid:96)= 1,\u0016, L. Such presentation is with no loss of generality,\nj=0\u0001Wj\u00011. Since max(\u03b1x, 0)= \u03b1 max(x, 0) for \u03b1> 0, any\nW(cid:96)~\u0001W(cid:96)\u00011, and Y((cid:96)+1) with Y((cid:96)+1)~\u220f(cid:96)\n\nas any network in the form of (1) can be converted to its link-normalized version by replacing W(cid:96) with\n\nFor the sake of theoretical analysis, all the results presented in this paper are stated for link-normalized\n\n(cid:96)= 1,\u0016, L,\n\nweight processing on a network of the form (1) can be applied to the link-normalized version and\nlater transferred to the original domain via a suitable scaling.\n\n((cid:96))= max\u0002W\n\n\u0016\n\n(cid:96) Y\n\n((cid:96)\u22121)\n\n, 0\u0002 ,\n\nY\n\nwhere Y\n\n(1)\n\n3 Convex Pruning of the Network\n\nOur pruning strategy relies on redesigning the network so that for the same training data each layer\noutcomes stay more or less close to the initial trained model, while the weights associated with each\nlayer are replaced with sparser versions to reduce the model complexity. Figure 2 presents the main\nidea, where the complex paths between the layer outcomes are replaced with simple paths. In a sense,\nif we consider each layer response to the transmitted data as a checkpoint, Net-Trim assures the\ncheckpoints remain roughly the same, while a simpler path between the checkpoints is discovered.\n\n2The notation\u0001X\u00011 should not be confused with the matrix induced (cid:96)1 norm\n\n3\n\n-1.5-1-0.500.511.5-1.5-1-0.500.511.5-1.5-1-0.500.511.5-1.5-1-0.500.511.5\fW1\n\nX\n\n(1)\n\nY\n\n\u0016\n\n(L\u22121)\n\nY\n\nWL\n\n(L)\n\nY\n\n\u21d2 X\n\n\u02c6W1\n\n(1)\n\n\u02c6Y\n\n\u0016\n\n(L\u22121)\n\n\u02c6Y\n\n\u02c6WL\n\n(L)\n\n\u02c6Y\n\nFigure 2: The main retraining idea: keeping the layer outcomes close to the initial trained model\nwhile \ufb01nding a simpler path relating each layer input to the output\n\nConsider the \ufb01rst layer, where X =[x1,\u0016, xP] is the layer input, W =[w1,\u0016, wM] the layer\ncoef\ufb01cient matrix, and Y =[ym,p] the layer outcome. We require the new coef\ufb01cient matrix \u02c6W to\n\nbe sparse and the new response to be close to Y . Using the sum of absolute entries as a proxy to\npromote sparsity, a natural strategy to retrain the layer is addressing the nonlinear program\n\n\u02c6W = arg min\n\n\u0001U\u00011\n\ns.t.\n\n\u0002max\u0001U\n\nX, 0\u0001\u2212 Y\u0002\n\n\u0016\n\n\u2264 \u0001.\n\nU\n\nDespite the convex objective, the constraint set in (2) is non-convex. However, we may approximate\n\nit with a convex set by imposing Y and \u02c6Y = max( \u02c6W\u0016X, 0) to have similar activation patterns.\nMore speci\ufb01cally, knowing that ym,p is either zero or positive, we enforce the max(., 0) argument\nto be negative when ym,p= 0, and close to ym,p elsewhere. To present the convex formulation, for\nV =[vm,p], throughout the paper we use the notation U\u2208C\u0001(X, Y , V) to present the constraint set\n\nF\n\n.\n\n(3)\n\n(2)\n\n(4)\n\n\u0017\u0017\u0017\u0017\u0017\u0017\u0017\n\n(u\u0016\nmxp\u2212 ym,p)2\u2264 \u00012\n\u2211\nmxp\u2264 vm,p\nm,p\u2236 ym,p>0\nu\u0016\n\u0001U\u00011\n\u02c6W = arg min\n\nm, p\u2236 ym,p= 0\ns.t. U\u2208C\u0001(X, Y , 0).\n\nBased on this de\ufb01nition, a convex proxy to (2) is\n\nU\n\nmxp to emulate the\nReLU operation. As a \ufb01rst observation towards establishing a retraining framework, we show that the\nsolution of (4) is consistent with the desired constraint in (2), as follows.\n\nBasically, depending on the value of ym,p, a different constraint is imposed on u\u0016\nProposition 1. Let \u02c6W be the solution to (4). For \u02c6Y = max( \u02c6W\u0016X, 0) being the retrained layer\nresponse,\u0001 \u02c6Y \u2212 Y\u0001F \u2264 \u0001.\n\n3.1 Parallel and Cascade Net-Trim\n\n((cid:96)\u22121) and Y\n\nBased on the above exploratory, we propose two schemes to retrain a neural network; one explores\na computationally distributable nature and the other proposes a cascading scheme to retrain the\nlayers sequentially. The general idea which originates from the relaxation in (4) is referred to as the\nNet-Trim, speci\ufb01ed by the parallel or cascade nature.\nThe parallel Net-Trim is a straightforward application of the convex program (4) to each layer in\nthe network. Basically, each layer is processed independently based on the initial model input and\noutput, without taking into account the retraining result from the previous layer. Speci\ufb01cally, denoting\nY\nequation (1)), we propose to relearn the coef\ufb01cient matrix W(cid:96) via the convex program\n\n((cid:96)) as the input and output of the (cid:96)-th layer of the initial trained neural network (see\n\n\u0001U\u00011\n\u02c6W(cid:96)= arg min\npseudocode, we use TRIM(X, Y , V , \u0001) as a function which returns the solution to a program like (4)\nwith the constraint U\u2208C\u0001(X, Y , V).\n\nThe optimization in (5) can be independently applied to every layer in the network and hence\ncomputationally distributable. Algorithm 1 presents the pseudocode for the parallel Net-Trim. In this\n\ns.t. U\u2208C\u0001\u0002Y\n\n, 0\u0002 .\n\n((cid:96)\u22121)\n\n((cid:96))\n\n(5)\n\n, Y\n\nU\n\nWith reference to the constraint in (5), if we only retrain the (cid:96)-th layer, the output of the retrained\nlayer is in the \u0001-neighborhood of that before retraining. However, when all the layers are retrained\nthrough (5), an immediate question would be whether the retrained network produces an output\nwhich is controllably close to the initially trained model. In the following theorem, we show that the\nretrained error does not blow up across the layers and remains a multiple of \u0001.\n\n4\n\n\f((cid:96)\u22121)\u0003 ,\n\n(6)\n\n(7)\n\n2 .\n\nmatrix, \u02c6W(cid:96), is obtained via\n\nmin\nU\n\n\u00012\n\n((cid:96)\u22121)\n\n, Y\n\n((cid:96)\u22121)\n\n(cid:96),m \u02c6y\n\np\n\n\u02c6Y\n\n, W\n(cid:96)\n\n((cid:96))\n\u0016\n\u2212 y\nm,p\u00022\n((cid:96))\n\n.\n\nthe supplementary note, such program is not necessarily feasible and needs to be suf\ufb01ciently slacked\n\nWhen all the layers are retrained with a \ufb01xed parameter \u0001 (as in Algorithm 1), a corollary of the\n\ns.t. U\u2208C\u0001(cid:96)\u0003\u02c6Y\n\u0002w\n\u0016\n\nof the resulting matrices. In the following theorem, we prove that the outcome of the retrained\nnetwork produced by Algorithm 2 is close to that of the network before retraining.\n\nIn a cascade Net-Trim, unlike the parallel scheme where each layer is retrained independently, the\noutcome of a retrained layer is probed into the program retraining the next layer. More speci\ufb01cally,\nhaving the \ufb01rst layer processed via (4), one would ideally seek to address (5) with the modi\ufb01ed\n\n(cid:96)=1, X) be a link-normalized trained network with layer outcomes Y\n((cid:96))\n(cid:96)=1, X) by solving the convex programs\n\u0016 \u02c6Y((cid:96)\u22121), 0) obey\n\nTheorem 1. LetTN({W(cid:96)}L\ndescribed by (1). Form the retrained networkTN({ \u02c6W(cid:96)}L\n(5), with \u0001= \u0001(cid:96) at each layer. Then the retrained layer outcomes \u02c6Y((cid:96))= max( \u02c6W(cid:96)\n\u0001 \u02c6Y((cid:96))\u2212 Y((cid:96))\u0001F \u2264\u2211(cid:96)\nj=1 \u0001j.\n(L)\u0001F \u2264 L\u0001.\ntheorem above would bound the overall discrepancy as\u0001 \u02c6Y(L)\u2212 Y\nconstraint U\u2208C\u0001( \u02c6Y((cid:96)\u22121), Y((cid:96)), 0) to retrain the subsequent layers. However, as detailed in \u00a71 of\nto warrant feasibility. In this regard, for every subsequent layer, (cid:96)= 2,\u0016, L, the retrained weighting\n\u0001U\u00011\nwhere for W(cid:96)=[w(cid:96),1,\u0016, w(cid:96),N(cid:96)] and \u03b3(cid:96)\u2265 1,\n(cid:96)= \u03b3(cid:96) Q\nm,p>0\nm,p\u2236 y\n((cid:96))\nThe constants \u03b3(cid:96)\u2265 1 (referred to as the in\ufb02ation rates) are free parameters, which control the sparsity\nTheorem 2. LetTN({W(cid:96)}L\n(cid:96)=1, X) be a link-normalized trained network with layer outcomes Y\n((cid:96)).\n(cid:96)=1, X) by solving (5) for the \ufb01rst layer and (6) for the\nForm the retrained networkTN({ \u02c6W(cid:96)}L\n\u0016X, 0) and \u03b3(cid:96)\u2265 1.\n\u0016 \u02c6Y((cid:96)\u22121), 0), \u02c6Y(1)= max( \u02c6W1\nsubsequent layers with \u0001(cid:96) as in (7), \u02c6Y((cid:96))= max( \u02c6W(cid:96)\nj=2 \u03b3j) 1\nThen the outputs \u02c6Y((cid:96)) of the retrained network will obey\u0001 \u02c6Y((cid:96))\u2212 Y((cid:96))\u0001F \u2264 \u00011(\u220f(cid:96)\nnetwork with \u00011= \u0001 and a constant in\ufb02ation rate, \u03b3, across all the layers. In such case, a corollary of\nTheorem 2 bounds the network overall discrepancy as\u0001 \u02c6Y(L)\u2212 Y\n((cid:96))\u0001F and use\nnecessary and to retrain layer (cid:96) in the parallel Net-Trim we can take \u0001 = \u0001r\u0001Y\n\u0001= \u0001r\u0001Y\n(1)\u0001F for the cascade case, where \u0001r plays a similar role as \u0001 for a link-normalized network.\npractical networks that follow (1) for the \ufb01rst L\u2212 1 layers and skip an activation at the last layer.\n1: Input: X, \u0001> 0, \u03b3> 1 and normalized W1,\u0016, WL\n1 X, 0\u0001\n2: Y \u2190 max\u0001W\n\u0016\n3: \u02c6W1\u2190 TRIM(X, Y , 0, \u0001)\n4: \u02c6Y \u2190 max( \u02c6W1\nX, 0)\n\u0016\n5: for (cid:96)= 2,\u0016, L do\nY \u2190 max(W\n(cid:96) Y , 0)\n\u0016\n\u0001\u2190(\u03b3\u2211m,p\u2236ym,p>0(w\n(cid:96),m \u02c6yp\u2212 ym,p)2)1~2\n\u0016\n\u02c6W(cid:96)\u2190 TRIM( \u02c6Y , Y , W\n\u02c6Y , \u0001)\n\u0016\n\u02c6Y \u2190 max( \u02c6W\n\u02c6Y , 0)\n\u0016\n11: Output: \u02c6W1,\u0016, \u02c6WL\n\n1: Input: X, \u0001> 0, and normalized W1,\u0016, WL\n(0)\u2190 X\n3: for (cid:96)= 1,\u0016, L do\n((cid:96))\u2190 max\u0001W\n\u0016\n6: for all (cid:96)= 1,\u0016, L do\n\u02c6W(cid:96)\u2190 TRIM\u0001Y\n((cid:96)\u22121)\n9: Output: \u02c6W1,\u0016, \u02c6WL\n\nWe would like to note that focusing on a link-normalized network is only for the sake of\nIn practice, such conversion is not\npresenting the theoretical results in a more compact form.\n\nAlgorithm 2 presents the pseudo-code to implement the cascade Net-Trim for a link normalized\n\nMoreover, as detailed in \u00a72 of the supplementary note, Theorems 1 and 2 identically apply to the\n\n, 0\u0001\n((cid:96)\u22121)\n, 0, \u0001\u0001\n\n((cid:96))\n\n(L)\u0001F \u2264 \u03b3\n\n(L\u22121)\n\n2\n\n\u0001.\n\n% generating initial layer outcomes:\n\n% w(cid:96),m is the m-th column of W(cid:96)\n\nAlgorithm 2 Cascade Net-Trim\n\nY\n\n4:\n5: end for\n\n% retraining:\n\nAlgorithm 1 Parallel Net-Trim\n\n8:\n9:\n10: end for\n\n(cid:96)\n\n(cid:96)\n\n2: Y\n\n7:\n8: end for\n\n(cid:96) Y\n\n, Y\n\n6:\n7:\n\n5\n\n\f4 Convex Analysis and Sample Complexity\n\nIn this section, we derive a sampling theorem for a single-layer, redundant network. Here, there are\nmany sets of weights that can induce the observed outputs given then input vectors. This scenario\nmight arise when the number of training samples used to train a (large) network is small (smaller than\nthe network degrees of freedom). We will show that when the inputs into the layers are independent\nGaussian random vectors, if there are sparse set of weights that can generate the output, then with\nhigh probability, the Net-Trim program in (4) will \ufb01nd them.\nAs noted above, in the case of a redundant layer, for a given input X and output Y , the relation\n\nY = max(W\nX, 0) can be established via more than one W . In this case, we hope to \ufb01nd a sparse\nW by setting \u0001= 0 in (4). For this value of \u0001, our central convex program decouples into M convex\n\n\u0016\n\nprograms, each searching for the m-th column in \u02c6W :\n\nw\n\n\u0001w\u00011 s.t.\u0003 w\u0016xp= ym,p\n\u02c6wm= arg min\nw\u0016xp\u2264 0\n\u02dcX\u0003w\n\u0001w\u00011\n\u0003= y,\n\u0004 , y=\u0003y\u2126\n\u02dcX=\u0004 X\n\n\u0016\u2236,\u2126\n\u0016\u2236,\u2126c \u2212I\n\nmin\nw,s\n\ns.t.\n\nX\n\n0\n\ns\n\np\u2236 ym,p> 0\np\u2236 ym,p= 0\ns\u0002 0,\n\n.\n\n\u0003 ,\n\n(8)\n\n(9)\n\nBy dropping the m index and introducing the slack variable s, program (8) can be cast as\n\nwhere\n\n0\n\n(cid:96)\u2208 supp w\u2217\n(cid:96)\u2208 supp s\u2217\n\n,\u0003\u039b(cid:96)=sign(w\u2217\n(cid:96))\n\u039bn1+(cid:96)=0\n\n\u0003\u22121<\u039b(cid:96)<1 (cid:96)\u2208 suppc w\u2217\n0< \u039bn1+(cid:96)\n(cid:96)\u2208 suppc s\u2217\n\nand \u2126={p\u2236 yp> 0}. For a general \u02dcX, not necessarily structured as above, the following result states\nthe suf\ufb01cient conditions under which a sparse pair(w\u2217, s\u2217) is the unique minimizer to (9).\nProposition 2. Consider a pair(w\u2217, s\u2217)\u2208(Rn1 , Rn2), which is feasible for the convex program\n(9). If there exists a vector \u039b=[\u039b(cid:96)]\u2208 Rn1+n2 in the range of \u02dcX\u0016 with entries satisfying\nand for \u02dc\u0393= supp w\u2217\u222a{n1+ supp s\u2217} the restricted matrix \u02dcX\u2236,\u02dc\u0393 is full column rank, then the pair\n(w\u2217, s\u2217) is the unique solution to (9).\nThe proposed optimality result can be related to the unique identi\ufb01cation of a sparse w\u2217 from recti\ufb01ed\nobservations of the form y= max(X\nw\u2217, 0). Clearly, the structure of the feature matrix X plays\ndual certi\ufb01cate can be constructed. As a result, we can warrant that learning w\u2217 can be performed\nTheorem 3. Let w\u2217\u2208 RN be an arbitrary s-sparse vector, X\u2208 RN\u00d7P a Gaussian matrix repre-\nsenting the samples and \u00b5> 1 a \ufb01xed value. Given P =(11s+ 7)\u00b5 log N observations of the type\ny= max(X\nw\u2217, 0), with probability exceeding 1\u2212 N 1\u2212\u00b5 the vector w\u2217 can be learned exactly\n\nthe key role here, and the construction of the dual certi\ufb01cate stated in Proposition 2 entirely relies on\nthis. As an insightful case, we show that when X is a Gaussian matrix (that is, the elements of X are\ni.i.d values drawn from a standard normal distribution), for suf\ufb01ciently large number of samples, the\n\nwith much fewer samples than the layer degrees of freedom.\n\n(10)\n\n\u0016\n\n\u0016\n\nthrough (8).\n\nThe standard Gaussian assumption for the feature matrix X allows us to relate the number of training\nsamples to the number of active links in a layer. Such feature structure could be a realistic assumption\nfor the \ufb01rst layer of the neural network. As re\ufb02ected in the proof of Theorem 3, because of the\ndependence of the set \u2126 to the entries in X, we need to take a totally nontrivial analysis path than the\nstandard concentration of measure arguments for the sum of independent random matrices. In fact,\nthe proof requires establishing concentration bounds for the sum of dependent random matrices.\n\nX, 0),\n\u2217 individually, for the observations Y = max(W\n\u2217\u0016\n\u2217 can be warranted as a corollary of Theorem 3.\nM]\u2208 RN\u00d7M , where sm=\u0001w\u2217\nm\u00010, and\n1,\u0016, w\u2217\n0< sm\u2264 smax for m= 1,\u0016, M. For X\u2208 RN\u00d7P being a Gaussian matrix, set Y = max(W\nX, 0).\n\u2217\u0016\nIf \u00b5>(1+ logN M) and P=(11smax+ 7)\u00b5 log N, for \u0001= 0, W\n\u2217 can be accurately learned through\n(4) with probability exceeding 1\u2212\u2211M\n\nWhile we focused on each column of W\nusing the union bound, an exact identi\ufb01cation of W\nCorollary 1. Consider an arbitrary matrix W\n\n\u2217=[w\u2217\nm=1 N 1\u2212\u00b5 11smax+7\n11sm+7 .\n\n6\n\n\fIt can be shown that for the network model in (1), probing the network with an i.i.d sample matrix\nX would generate subgaussian random matrices with independent columns as the subsequent layer\noutcomes. Under certain well conditioning of the input covariance matrix of each layer, results similar\nto Theorem 3 are extendable to the subsequent layers. While such results are left for a more extended\npresentation of the work, Theorem 3 is brought here as a good reference for the general performance\nof the proposed retraining scheme and the associated analysis theme.\n\n5\n\nImplementing the Convex Program\n\nIf the quadratic constraint in (3) is brought to the objective via a regularization parameter \u03bb, the\nresulting convex program decouples into M smaller programs of the form\n\n\u0016\n\nxp\u2264 vm,p, for p\u2236 ym,p= 0, (11)\n\ns.t. u\n\n\u02c6wm= arg min\n\nu\n\n\u0001u\u00011+ \u03bb Q\np\u2236 ym,p>0\n\n\u0001u\n\n\u0016\n\nxp\u2212 ym,p\u00012\n\neach recovering a column of \u02c6W . Such decoupling of the regularized form is computationally\nattractive, since it makes the trimming task extremely distributable among parallel processing units\nby recovering each column of \u02c6W on a separate unit. Addressing the original constrained form (4) in\na fast and scalable way requires using more complicated techniques, which is left to a more extended\npresentation of the work.\nWe can formulate the program in a standard form by introducing the index sets\n\n,\n\n\u0016\n\nmin\n\nu\n\nwhere\n\n\u0016\u2236,\u2126m\n\nin terms of u as\n\n\u2126m={p\u2236 ym,p> 0}, \u2126c\nm={p\u2236 ym,p= 0}.\nDenoting the m-th row of Y by y\u0016\nm and the m-th row of V by v\u0016\n\u0001u\u00011+ u\nQmu+ 2q\nmu s.t. P mu\u0002 cm,\n\u0016\nQm= \u03bbX\u2236,\u2126mX\n=\u2212\u03bbXym, P m= X\nqm=\u2212\u03bbX\u2236,\u2126m ym\u2126m\n\u0016\u2236,\u2126c\n\u02dcu=[u+;\u2212u\u2212], where u\u2212= min(u, 0). This variable change naturally yields\n\u0001u\u00011= 1\nu=[I,\u2212I]\u02dcu,\n\u0016\n\u0004 \u02dcP m\u2212I\n\u0016 \u02dcQm \u02dcu+(1+ 2\u02dcqm)\u0016\n\u0004 \u02dcu\u0002\u0003cm\n\u0003 ,\n\u22121\n\u02dcP m=[P m \u2212P m] .\n\u02dcqm=\u0003 qm\u2212qm\n\u0003\u2297 Qm,\n\nThe convex program (13) is now cast as the standard quadratic program\n\n\u02dcQm=\u0003 1\n\u22121\n\n\u0003 ,\n\n\u02dcu s.t.\n\nwhere\n\nmin\n\n\u02dcu\n\n,\n\nm\n\n\u02dcu.\n\n\u02dcu\n\n0\n\n1\n\nm, the solution to (14) is found, the solution to (11) can be recovered via \u02c6wm=[I,\u2212I]\u02dcu\n\u2217\n\nOnce \u02dcu\nm.\nAside from the variety of convex solvers that can be used to address (14), we are speci\ufb01cally interested\nin using the alternating direction method of multipliers (ADMM). In fact the main motivation to\ntranslate (11) into (14) is the availability of ADMM implementations for problems in the form of\n(14) that are reasonably fast and scalable (e.g., see [17]). The authors have made the implementation\npublicly available online3.\n\n\u2217\n\nThe (cid:96)1 term in the objective of (12) can be converted into a linear term by de\ufb01ning a new vector\n\nm, one can equivalently rewrite (11)\n\ncm= vm\u2126c\n\nm\n\n(12)\n\n. (13)\n\n(14)\n\n6 Experiments and Discussions\n\nAside from the major technical contribution of the paper in providing a theoretical understanding of the\nNet-Trim pruning process, in this section we present some experiments to highlight its performance\nagainst the state of the art techniques.\n\n3The code for the regularized Net-Trim implementation using the ADMM scheme can be accessed online at:\n\nhttps://github.com/DNNToolBox/Net-Trim-v1\n\n7\n\n\fThe \ufb01rst set of experiments associated with the example presented in the introduction (classi\ufb01cation\nof 2D points on nested spirals) compares the Net-Trim pruning power against the standard pruning\nstrategies of (cid:96)1 regularization and Dropout. The experiments demonstrate how Net-Trim can signi\ufb01-\ncantly improve the pruning level of a given network and produce simpler and more understandable\nnetworks. We also compare the cascade Net-Trim against the parallel scheme. As could be expected,\nfor a \ufb01xed level of discrepancy between the initial and retrained models, the cascade scheme is\ncapable of producing sparser networks. However, the computational distributability of the parallel\nscheme makes it a more favorable approach for large scale and big data problems. Due to the space\nlimitation, these experiments are moved to \u00a73 of the supplementary note.\nWe next apply Net-Trim to the problem of classifying hand-written digits of the mixed national\ninstitute of standards and technology (MNIST) dataset. The set contains 60,000 training samples and\n10,000 test instances. To examine different settings, we consider 6 networks: NN2-10K, which is a\n\n784\u22c5300\u22c5300\u22c510 network (two hidden layers of 300 nodes) and trained with 10,000 samples; NN3-30K,\na 784\u22c5300\u22c5500\u22c5300\u22c510 network trained with 30,000 samples; and NN3-60K, a 784\u22c5300\u22c51000\u22c5300\u22c510\nsize 5\u00d7 5\u00d7 1 for the \ufb01rst layer and 5\u00d7 5\u00d7 32 for the second layer, both followed by max pooling\n\nnetwork trained with 60,000 samples. We also consider CNN-10K, CNN-30K and CNN-60K which\nare topologically identical convolutional networks trained with 10,000, 30,000 and 60,000 samples,\nrespectively. The convolutional networks contain two convolutional layers composed of 32 \ufb01lters of\n\nand a fully connected layer of 512 neurons. While the linearity of the convolution allows using the\nNet-Trim for the associated layers, here we merely consider retraining the fully connected layers.\nTo address the Net-Trim convex program, we use the regularized form outlined in Section 5, which is\nfully capable of parallel processing. For our largest problem (associated with the fully connected layer\nin CNN-60K), retraining each column takes less than 20 seconds and distributing the independent\njobs among a cluster of processing units (in our case 64) or using a GPU reduces the overall retraining\nof a layer to few minutes.\nTable 1 summarize the retraining experiments. Panel (a) corresponds to the Net-Trim operating in a\nlow discrepancy mode (smaller \u0001), while in panel (b) we explore more sparsity by allowing larger\ndiscrepancies. Each neural network is trained three times with different initialization seeds and\naverage quantities are reported. In these tables, the \ufb01rst row corresponds to the test accuracy of the\ninitial models. The second row reports the overall pruning rate and the third row reports the overall\ndiscrepancy between the initial and Net-Trim retrained models. We also compare the results with\nthe work by Han, Pool, Tran and Dally (HPTD) [14]. The basic idea in [14] is to truncate the small\nweights across the network and perform another round of training on the active weights. The forth\nrow reports the test accuracy after applying Net-Trim. To make a fair comparison in applying the\nHPTD, we impose the same number of weights to be truncated in the HPTD technique. The accuracy\nof the model after this truncation is presented on the \ufb01fth row. Rows six and seven present the test\naccuracy of Net-Trim and HPTD after a \ufb01ne training process (optional for Net-Trim).\nAn immediate observation is the close test error of Net-Trim compared to the initial trained models\n(row four vs row one). We can observe from the second and third rows of the two tables that allowing\nmore discrepancy (larger \u0001) increases the pruning level. We can also observe that the basic Net-Trim\nprocess (row four) in many scenarios beats the HPTD (row seven), and if we allow a \ufb01ne training\nstep after the Net-Trim (row six), in all the scenarios a better test accuracy is achieved.\nA serious problem with the HPTD is the early minima trapping (EMT). When we simply truncate the\nlayer transfer matrices, ignoring their actual contribution to the network, the error introduced can\nbe very large (row \ufb01ve), and using this biased pattern as an initialization for the \ufb01ne training can\nproduce poor local minima solutions with large errors. The EMT blocks in the table correspond to\nthe scenarios where all three random seeds failed to generate acceptable results for this approach. In\nthe experiments where Net-Trim was followed by an additional \ufb01ne training step, this was never an\nissue, since the Net-Trim outcome is already a good model solution.\nIn Figure 3(a), we visualize \u02c6W1 after the Net-Trim process. We observe 28 bands (MNIST images are\n\n28\u00d728), where the zero columns represent the boundary pixels with the least image information. It is\n\nnoteworthy that such interpretable result is achieved using the Net-Trim with no post or pre-processes.\nA similar outcome of HPTD is depicted in panel (b). As a matter of fact, the authors present a similar\nvisualization as panel (a) in [14], which is the result of applying the HPTD process iteratively and\ngoing through the retraining step many times. Such path certainly produces a lot of processing load\nand lacks any type of con\ufb01dence on being a convergent procedure.\n\n8\n\n\fTable 1: The test accuracy of different models before and after Net-Trim (NT) and HPTD [14].\nWithout a \ufb01ne training (FT) step, Net-Trim produces pruned networks in the majority of cases more\naccurate than HPTD and with no risk of poor local minima. Adding an additional FT step makes\nNet-Trim consistently prominent\n\nK\n0\n6\n-\nN\nN\nC\n\nK\n0\n3\n-\nN\nN\nC\n\nK\n0\n1\n-\nN\nN\nC\n\nK\n0\n3\n-\n3\nN\nN\n\nK\n0\n6\n-\n3\nN\nN\n\nInit. Mod. Acc. (%)\nTotal Pruning (%)\n\nK\n0\n1\n-\n2\nN\nN\n95.59 97.58 98.18 98.37 99.11 99.25\n40.86 30.69 29.38 43.91 39.11 45.74\n1.98\n0.55\n95.47 97.55 98.1 98.31 99.15 99.25\n9.3\n10.34 8.92 19.17 55.92 30.17\n95.85 97.67 98.12 98.35 99.21 99.33\nHPTD + FT Acc. (%) 93.56 97.32 EMT 98.16 EMT EMT\n\nNT Overall Disc. (%)\nNT No FT Acc. (%)\n\nNT + FT Acc. (%)\n\nHPTD No FT Acc. (%)\n\n1.77\n\n1.31\n\n1.22\n\n0.75\n\nK\n0\n6\n-\nN\nN\nC\n\nK\n0\n3\n-\nN\nN\nC\n\nK\n0\n1\n-\nN\nN\nC\n\nK\n0\n6\n-\n3\nN\nN\n\nK\n0\n3\n-\n3\nN\nN\n\nK\n0\n1\n-\n2\nN\nN\n95.59 97.58 98.18 98.37 99.11 99.25\n75.87 75.82 77.40 76.18 77.63 81.62\n4.95 11.01 11.47 3.65\n8.93\n94.92 95.97 97.35 97.91 99.08 98.96\n8.97\n8.92 31.18 73.36 46.84\n95.89 97.69 98.19 98.40 99.17 99.26\n95.61 EMT 97.96 EMT 99.01 99.06\n\n5.32\n\n10.1\n\nInit. Mod. Acc. (%)\nTotal Pruning (%)\n\nNT Overall Disc. (%)\nNT No FT Acc. (%)\n\nHPTD No FT Acc. (%)\n\nNT + FT Acc. (%)\n\nHPTD + FT Acc. (%)\n\n(a)\n\n(b)\n\nFigure 3: Visualization of \u02c6W1 in NN3-60K; (a) Net-Trim output; (b) standard HPTD\n\n(a)\n\n(b)\n\ny\nc\na\nr\nu\nc\nc\nA\n\nt\ns\ne\nT\n\ny\nc\na\nr\nu\nc\nc\nA\n\nt\ns\ne\nT\n\nNoise (%)\n(a)\n\nNoise (%)\n(b)\n\nFigure 4: Noise robustness of initial and retrained networks; (a) NN2-10K; (b) NN3-30K\n\nAlso, for a deeper understanding of the robustness Net-Trim adds to the models, in Figure 4 we\nhave plotted the classi\ufb01cation accuracy of the initial and retrained models against the level of added\nnoise to the test data (ranging from 0 to 160%). The Net-Trim improvement in accuracy becomes\nmore noticeable as the noise level in the data increases. Basically, as expected, reducing the model\ncomplexity makes the network more robust to outliers and noisy samples. It is also interesting to note\nthat the NN3-30K initial model in panel (b), which is trained with more data, presents robustness to a\nlarger level of noise compared to NN2-10K in panel (a). However, the retrained models behave rather\nsimilarly (blue curves) indicating the saving that can be achieved in the number of training samples\nvia Net-Trim.\nIn fact, Net-Trim can be particularly useful when the number of training samples is limited. While\nover\ufb01tting is likely to occur in such scenarios, Net-Trim reduces the complexity of the model by\nsetting a signi\ufb01cant portion of weights at each layer to zero, yet maintaining the model consistency.\nThis capability can also be viewed from a different perspective, that Net-Trim simpli\ufb01es the process\nof determining the network size. In other words, if the network used at the training phase is oversized,\nNet-Trim can reduce its size to an order matching the data. Finally, aside from the theoretical and\npractical contribution that Net-Trim brings to the understanding of deep neural network, the idea can\nbe easily generalized to retraining schemes with other regularizers (e.g., the use of ridge or elastic net\ntype regularizers) or other structural constraint about the network.\n\n9\n\n0100200300400500600700050100150200250300010020030040050060070005010015020025030002040608010012014016065707580859095100Net-Trim Retrained ModelInitial Model02040608010012014016065707580859095100Net-Trim Retrained ModelInitial Model\fReferences\n[1] K. Hornik, M. Stinchcombe, H. White D. Achlioptas, and F. McSherry. Multilayer feedforward networks\n\nare universal approximators. Neural networks, 2(5):359\u2013366, 1989.\n\n[2] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.\n\n[3] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. In\n\nProceedings of the 31st International Conference on Machine Learning, 2014.\n\n[4] K. Kawaguchi. Deep learning without poor local minima. In Preprint, 2016.\n\n[5] A. Choromanska, M. Henaff, M. Mathieu, G.B. Arous, and Y. LeCun. The loss surfaces of multilayer\nnetworks. In Proceedings of the 18th International Conference on Arti\ufb01cial Intelligence and Statistics,\n2015.\n\n[6] R. Giryes, G. Sapiro, and A.M. Bronstein. Deep neural networks with random gaussian weights: A\n\nuniversal classi\ufb01cation strategy? IEEE Transactions on Signal Processing, 64(13):3444\u20133457, 2016.\n\n[7] Y. Bengio, N. Le Roux, P. Vincent, O. Delalleau, and P. Marcotte. Convex neural networks. In Proceedings\n\nof the 18th International Conference on Neural Information Processing Systems, pages 123\u2013130, 2005.\n\n[8] F. Bach. Breaking the curse of dimensionality with convex neural networks. Technical report, 2014.\n\n[9] O. Aslan, X. Zhang, and D. Schuurmans. Convex deep learning via normalized kernels. In Proceedings of\n\nthe 27th International Conference on Neural Information Processing Systems, pages 3275\u20133283, 2014.\n\n[10] S. Nowlan and G. Hinton. Simplifying neural networks by soft weight-sharing. Neural computation,\n\n4(4):473\u2013493, 1992.\n\n[11] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures. Neural\n\ncomputation, 7(2):219\u2013269, 1995.\n\n[12] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple way to\nprevent neural networks from over\ufb01tting. The Journal of Machine Learning Research, 15(1):1929\u20131958,\n2014.\n\n[13] L. Wan, M. Zeiler, S. Zhang, Y. LeCun, and R. Fergus. Regularization of neural networks using dropconnect.\n\nIn Proceedings of the 33rd International Conference on Machine Learning, 2016.\n\n[14] S. Han, J. Pool, J. Tran, and W. Dally. Learning both weights and connections for ef\ufb01cient neural network.\n\nIn Advances in Neural Information Processing Systems, pages 1135\u20131143, 2015.\n\n[15] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing neural networks with the hashing\n\ntrick. In International Conference on Machine Learning, pages 2285\u20132294, 2015.\n\n[16] S. Han, H. Mao, and W. J Dally. Deep compression: Compressing deep neural networks with pruning,\n\ntrained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.\n\n[17] E. Ghadimi, A. Teixeira, I. Shames, and M. Johansson. Optimal parameter selection for the alternating\ndirection method of multipliers (admm): quadratic problems. IEEE Transactions on Automatic Control,\n60(3):644\u2013658, 2015.\n\n10\n\n\f", "award": [], "sourceid": 1803, "authors": [{"given_name": "Alireza", "family_name": "Aghasi", "institution": "Institute for Insight"}, {"given_name": "Afshin", "family_name": "Abdi", "institution": "Georgia Institute of Technology"}, {"given_name": "Nam", "family_name": "Nguyen", "institution": "IBM Thomas J. Watson Research Center"}, {"given_name": "Justin", "family_name": "Romberg", "institution": "Georgia Institute of Technology"}]}