{"title": "Knowledge Extraction with No Observable Data", "book": "Advances in Neural Information Processing Systems", "page_first": 2705, "page_last": 2714, "abstract": "Knowledge distillation is to transfer the knowledge of a large neural network into a smaller one and has been shown to be effective especially when the amount of training data is limited or the size of the student model is very small. To transfer the knowledge, it is essential to observe the data that have been used to train the network since its knowledge is concentrated on a narrow manifold rather than the whole input space. However, the data are not accessible in many cases due to the privacy or confidentiality issues in medical, industrial, and military domains. To the best of our knowledge, there has been no approach that distills the knowledge of a neural network when no data are observable. In this work, we propose KegNet (Knowledge Extraction with Generative Networks), a novel approach to extract the knowledge of a trained deep neural network and to generate artificial data points that replace the missing training data in knowledge distillation. Experiments show that KegNet outperforms all baselines for data-free knowledge distillation. We provide the source code of our paper in https://github.com/snudatalab/KegNet.", "full_text": "Knowledge Extraction with No Observable Data\n\nJaemin Yoo\n\nSeoul National University\njaeminyoo@snu.ac.kr\n\nTaebum Kim\n\nSeoul National University\nk.taebum@snu.ac.kr\n\nMinyong Cho\n\nSeoul National University\nchominyong@gmail.com\n\nU Kang\u2217\n\nSeoul National University\n\nukang@snu.ac.kr\n\nAbstract\n\nKnowledge distillation is to transfer the knowledge of a large neural network into\na smaller one and has been shown to be effective especially when the amount of\ntraining data is limited or the size of the student model is very small. To transfer\nthe knowledge, it is essential to observe the data that have been used to train the\nnetwork since its knowledge is concentrated on a narrow manifold rather than the\nwhole input space. However, the data are not accessible in many cases due to the\nprivacy or con\ufb01dentiality issues in medical, industrial, and military domains. To the\nbest of our knowledge, there has been no approach that distills the knowledge of a\nneural network when no data are observable. In this work, we propose KEGNET\n(Knowledge Extraction with Generative Networks), a novel approach to extract the\nknowledge of a trained deep neural network and to generate arti\ufb01cial data points\nthat replace the missing training data in knowledge distillation. Experiments show\nthat KEGNET outperforms all baselines for data-free knowledge distillation. We\nprovide the source code of our paper in https://github.com/snudatalab/KegNet.\n\n1\n\nIntroduction\n\nHow can we distill the knowledge of a deep neural network without any observable data?\nKnowledge distillation [9] is to transfer the knowledge of a large neural network or an ensemble of\nneural networks into a smaller network. Given a set of trained teacher models, one feeds training\ndata to them and uses their predictions instead of the true labels to train the small student model. It\nhas been effective especially when the amount of training data is limited or the size of the student\nmodel is very small [14, 28], because the teacher\u2019s knowledge helps the student to learn ef\ufb01ciently\nthe hidden relationships between the target labels even with a small dataset.\nHowever, it is essential for knowledge distillation that at least a few training examples are observable,\nsince the knowledge of a deep neural network does not cover the whole input space; it is focused\non a manifold px of data that the network has actually observed. The network is likely to produce\nunpredictable outputs if given random inputs that are not described by px, misguiding the student\nnetwork. There are recent works for distilling a network\u2019s knowledge by a small dataset [22] or only\nmetadata at each layer [23], but no approach has successfully distilled the knowledge without any\nobservable data. It is desirable in this case to generate arti\ufb01cial data by generative networks [7, 31],\nbut they also require a large amount of training data to estimate the true manifold px.\nWe propose KEGNET (Knowledge Extraction with Generative Networks), a novel architecture that\nextracts the knowledge of a trained neural network for knowledge distillation without observable data.\n\n\u2217Corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An overall structure of KEGNET. The generator creates arti\ufb01cial data and feed them into\nthe classi\ufb01er and decoder. The \ufb01xed classi\ufb01er produces the label distribution of each data point, and\nthe decoder \ufb01nds its hidden representation as a low-dimensional vector.\n\nKEGNET estimates the data manifold px by generating arti\ufb01cial data points, based on the conditional\ndistribution p(y|x) of label y which has been learned by the given neural network. The generated\nexamples replace the missing training data when distilling the knowledge of the given network. As a\nresult, the knowledge is transferred well to other networks even with no observable data.\nThe overall structure of KEGNET is depicted in Figure 1, which consists of two learnable networks G\nand D. A trained network M is given as a teacher, and we aim to distill its knowledge to a student\nnetwork which is not included in this \ufb01gure. The \ufb01rst learnable network is a generator G that takes a\npair of sampled variables \u02c6y and \u02c6z to create a fake data point \u02c6x. The second network is a decoder D\nwhich aims to extract the variable \u02c6z from \u02c6x, given as an input for \u02c6x. The variable \u02c6z is interpreted as a\nlow-dimensional representation of \u02c6x, which contains the implicit meaning of \u02c6x independently from\nthe label \u02c6y. The networks G and D are updated to minimize the reconstruction errors between the\ninput and output variables: \u02c6y and \u00afy, and \u02c6z and \u00afz. After the training, G is used to generate fake data\nwhich replace the missing training data in the distillation; D is not used in this step.\nOur extensive experiments in Section 5 show that KEGNET accurately extracts the knowledge of a\ntrained deep neural network for various types of datasets. KEGNET outperforms baseline methods\nfor distilling the knowledge without observable data, showing a large improvement of accuracy up to\n39.6 percent points compared with the best competitors. Especially, KEGNET generates arti\ufb01cial\ndata whose patterns are clearly recognizable by extracting the knowledge of well-known classi\ufb01ers\nsuch as a residual network [8] trained with image datasets such as MNIST [21] and SVHN [25].\n\n2 Related Work\n\n2.1 Knowledge Distillation\n\nKnowledge distillation [9] is a technique to transfer the knowledge of a large neural network or an\nensemble of neural networks into a small one. Given a teacher network M and a student network S,\nwe feed training data to M and use its predictions to train S instead of the true labels. As a result, S\nis trained by soft distributions rather than one-hot vectors, learning latent relationships between the\nlabels that M has already learned. Knowledge distillation has been used for reducing the size of a\nmodel or training a model with insuf\ufb01cient data [1, 2, 12, 14, 29].\nRecent works focused on the distillation with insuf\ufb01cient data. Papernot et al. [28] and Kimura et al.\n[15] used knowledge distillation to effectively use unlabeled data for semi-supervised learning when\na trained teacher network is given. Li et al. [22] added a 1 \u00d7 1 convolutional layer at the end of each\nblock of a student network and aligned the teacher and student networks by updating those layers\nwhen only a few samples are given. Lopes et al. [23] considered the case where the data were not\nobservable, but metadata were given for each activation layer of the teacher network. From the given\nmetadata, they reconstructed the missing data and used them to train a student network.\n\n2\n\nNoise \ud835\udc67\u0302Label \ud835\udc66$Generated data\ud835\udc65$Generator \ud835\udc3a\ud835\udc5d\ud835\udc65\ud835\udc66,\ud835\udc67Classifier \ud835\udc40(fixed)Decoder \ud835\udc37\ud835\udc5d\ud835\udc67\ud835\udc65Label \ud835\udc66+Noise \ud835\udc67\u0305Classifier lossDecoder loss\ud835\udc5d\u0302-\ud835\udc66\ud835\udc5d.\ud835\udc67concatsamplingsampling\fHowever, these approaches assume that at least a few labeled examples or metadata are given so that\nit is able to estimate the distribution of missing data. They are not applicable to situations where no\ndata are accessible due to strict privacy or con\ufb01dentiality. To the best of our knowledge, there has\nbeen no approach that works well without training data in knowledge distillation, despite its large\nimportance in various domains that impose strict limitations for distributing the data.\n\n2.2 Tucker Decomposition\n\nA tensor decomposition is to represent an n-dimensional tensor as a sequence of small tensors. Tucker\ndecomposition [13, 30] is one of the most successful algorithms for a tensor decomposition, which\ndecomposes an n-dimensional tensor X \u2208 RI1\u00d7I2\u00d7\u00b7\u00b7\u00b7\u00d7In into the following form:\n\n\u02c6X = G \u00d71 A(1) \u00d72 A(2) \u00d73 \u00b7\u00b7\u00b7 \u00d7N A(N ),\n\n(1)\nwhere \u00d7i is the i-mode product [16] between a tensor and a matrix, G \u2208 RR1\u00d7R2\u00d7\u00b7\u00b7\u00b7\u00d7Rn is a core\ntensor, and A(i) \u2208 RIi\u00d7Ri is the i-th factor matrix.\nTucker decomposition has been used to compress various types of deep neural networks. Kim et al.\n[13] and Kholiavchenko [11] compressed convolutional neural networks using Tucker-2 decomposi-\ntion which decomposes convolution kernels along the \ufb01rst two axes (the numbers of \ufb01lters and input\nchannels). They used the global analytic variational Bayesian matrix factorization (VBMF) [24] for\nselecting the rank R, which is important to the performance of compression. Kossai\ufb01 et al. [17] used\nTucker decomposition to compress fully connected layers as well as convolutional layers.\nUnlike most compression algorithms [3, 4], Tucker decomposition itself is a data-free algorithm that\nrequires no training data in the execution. However, a \ufb01ne-tuning of the compressed networks has\nbeen essential [11, 13] since the compression is done layerwise and the compressed layers are not\naligned with respect to the target problem. In this work, we use Tucker decomposition to initialize a\nstudent network that requires the teacher\u2019s knowledge to improve its performance. Our work can be\nseen as using Tucker decomposition as a general compression algorithm when the target network is\ngiven but no data are observable, and can be extended to other compression algorithms.\n\n3 Knowledge Extraction\n\nWe are given a trained network M that predicts the label of a feature vector x as a probability vector.\nHowever, we have no information about the data distribution px(x) that was used to train M, which\nis essential to understand its functionality and to use its learned knowledge in further tasks. It is thus\ndesirable to estimate px(x) from M, which is the opposite of a traditional learning problem that aims\nto train M based on observable px. We call this knowledge extraction.\nHowever, it is impracticable to estimate px(x) directly since the data space R|x| is exponential with\nthe dimensionality of data, while we have no single observation except the trained classi\ufb01er M. We\nthus revert to sampling data points and modeling an empirical distribution: for a set D of sampled\npoints, the probability of each sampled point is 1/|D|, and the probability at any other point is zero.\nWe generate the set D of sampled data points by modeling a conditional probability of x given two\nrandom vectors y and z, where y is a probability vector that represents a label, and z is our proposed\nvariable that represents the implicit meaning of a data point as a low-dimensional vector:\n\nD =\n\np(\u02c6x|\u02c6y, \u02c6z) | \u02c6y \u223c \u02c6py(y) and \u02c6z \u223c pz(z)\n\narg max\n\n\u02c6x\n\n,\n\n(2)\n\nwhere \u02c6py(y) is an estimation of the true label distribution py(y) which we cannot observe, and pz(z)\nis our proposed distribution that is assumed to describe the property of z.\nIn this way, we reformulate the problem as to estimate the conditional distribution p(x|y, z) instead\nof the data distribution px(x). Recall that z is a low-dimensional representation of a data point x.\nThe variables y and z are conditionally independent of each other given x, since they both depend on\nx but have no direct interactions. Thus, the argmax in Equation (2) is rewritten as follows:\n\np(\u02c6x|\u02c6y, \u02c6z) = arg max\n\narg max\n\n\u02c6x\n\n\u02c6x\n\n= arg max\n\n\u02c6x\n\n(log p(\u02c6y|\u02c6x, \u02c6z) + log p(\u02c6x|\u02c6z) \u2212 log p(\u02c6y|\u02c6z))\n(log p(\u02c6y|\u02c6x) + log p(\u02c6x|\u02c6z)),\n\n(3)\n\n(4)\n\n(cid:26)\n\n(cid:27)\n\n3\n\n\fwhere the \ufb01rst probability p(\u02c6y|\u02c6x) is the direct output of M when \u02c6x is given as an input, which we do\nnot need to estimate since M is already trained. The second probability p(\u02c6x|\u02c6z) represents how well \u02c6z\nrepresents the property of \u02c6x as its low-dimensional representation.\nHowever, estimating the distribution p(x|z) requires knowing px(x) in advance, which we cannot\nobserve due to the absence of accessible data. Thus, we rewrite Equation (4) as Equation (5) and then\napproximate it as Equation (6) ignoring the data probability px(x):\n\np(\u02c6x|\u02c6y, \u02c6z) = arg max\n\u2248 arg max\n\n\u02c6x\n\n(log p(\u02c6y|\u02c6x) + log p(\u02c6z|\u02c6x) + log p(\u02c6x) \u2212 log p(\u02c6z))\n(log p(\u02c6y|\u02c6x) + log p(\u02c6z|\u02c6x)).\n\n(6)\nThe difference is that now we estimate the likelihood p(\u02c6z|\u02c6x) of the variable \u02c6z given \u02c6x instead of the\nposterior p(\u02c6x|\u02c6z). Equation (6) is our \ufb01nal target of estimation for extracting the knowledge of the\ngiven model M. We introduce in the next section how to model these conditional distributions by\ndeep neural networks and how to design an objective function which we aim to minimize.\n\n\u02c6x\n\narg max\n\n\u02c6x\n\n(5)\n\n4 Proposed Method\n\nKEGNET (Knowledge Extraction with Generative Networks) is our novel architecture to distill the\nknowledge of a neural network without using training data, by extracting its knowledge as a set of\narti\ufb01cial data points of Equation (2). KEGNET uses two kinds of deep neural networks to model the\nconditional distributions in Equation (6). The \ufb01rst is a generator G which takes \u02c6y and \u02c6z as inputs and\nreturns a data point with the maximum conditional likelihood p(\u02c6x|\u02c6y, \u02c6z). The second is a decoder D\nwhich takes a data point \u02c6x as an input and returns its low-dimensional representation \u00afz.\nThe overall structure of KEGNET is depicted in Figure 1. The generator G is our main component\nthat estimates the empirical distribution by sampling a data point \u02c6x several times. Given a sampled\nclass vector \u02c6y as an input, G is trained to produce data points that M is likely to classify as \u02c6y. This\nmakes G learn different properties of different classes based on M, but leads it to generate similar\ndata points for each class. To address this problem, we train G also to minimize the reconstruction\nerror between \u02c6z and \u00afz, forcing G to embed the information of \u02c6z in the generated data \u02c6x so that D can\nsuccessfully recover it. Thus, data points of the same class should be different from each other when\ngiven different input variables \u02c6z. The reconstruction errors are computed for \u02c6y and \u02c6z, respectively,\nand then added to the \ufb01nal objective function. We also introduce a diversity loss to further increase\nthe data diversity in each batch so that the generated data cover a larger region in the data space.\n\n4.1 Objective Function\n\nWe formulate the conditional probabilities of Equation (6) as loss terms to train both the generator G\nand decoder D, and combine them as a single objective function:\n\nl(B) =\n\nlcls(\u02c6y, \u02c6z) + \u03b1ldec(\u02c6y, \u02c6z)\n\n+ \u03b2ldiv(B),\n\n(7)\n\n(cid:17)\n\n(cid:88)\n\n(cid:16)\n\n(\u02c6y,\u02c6z)\u2208B\n\nwhich consists of three different loss functions lcls, ldec, and ldiv. B is a batch of sampled variables\n{(\u02c6y, \u02c6z) | \u02c6y \u223c \u02c6py(y), \u02c6z \u223c pz(z)}, and \u03b1 and \u03b2 are two nonnegative hyperparameters that adjust the\nbalance between the loss terms. Each batch is created by sampling \u02c6y and \u02c6z randomly several times\nfrom the distributions \u02c6py and pz which are determined also as hyperparameters. In our experiments,\nwe set \u02c6py to the categorical distribution that produces one-hot vectors as \u02c6y, and pz to the multivariate\nGaussian distribution that produces standard normal vectors.\nThe classi\ufb01er loss lcls in Equation (8) represents the distance between the input label \u02c6y given to G and\nthe output M (G(\u02c6y, \u02c6z)) returned from M as the cross-entropy between two probability distributions.\nNote that \u02c6y is not a scalar label but a probability vector of length |S| where S is the set of classes.\nMinimizing lcls forces the generated data to follow a manifold that M is able to classify well. The\nlearned manifold may be different from px, but is suited for extracting the knowledge of M.\n\nlcls(\u02c6y, \u02c6z) = \u2212(cid:80)\n\ni\u2208S \u02c6yi log M (G(\u02c6y, \u02c6z))i\n\n(8)\n\nThe decoder loss ldec in Equation (9) represents the distance between the input variable \u02c6z given to\nG and the output D(G(\u02c6y, \u02c6z)) returned from D. We use the Euclidean distance instead of the cross\n\n4\n\n\fentropy since z is not a probability distribution. If we optimize G only for lcls, it is likely to produce\nsimilar data points for each class with little diversity. ldec prevents such a problem by forcing G to\ninclude the information of \u02c6z along with \u02c6y in the generated data.\nldec(\u02c6y, \u02c6z) = (cid:107)\u02c6z \u2212 D(G(\u02c6y, \u02c6z))(cid:107)2\n2.\n\n(9)\n\nHowever, the diversity of generated data points may still be insuf\ufb01cient even though D forces G to\ninclude the information of \u02c6z in \u02c6x. In such a case, the empirical distribution estimated by G covers\nonly a small manifold in the large data space, extracting only partial knowledge of M. The diversity\nloss ldiv is introduced to address the problem and further increase the diversity of generated data.\nGiven a distance function d between two data points, the diversity loss ldiv is de\ufb01ned as follows:\n\nldiv(B) = exp\n\n(\u02c6y2,\u02c6z2)\u2208B (cid:107)\u02c6z1 \u2212 \u02c6z2(cid:107)2\n\n2 \u00b7 d(G(\u02c6y1, \u02c6z1), G(\u02c6y2, \u02c6z2))\n\n(10)\nIt increases the pairwise distance between sampled data points in each batch B, multiplied with the\ndistance between \u02c6z1 and \u02c6z2. This gives more weights to the pairs of data points whose input variables\nare more distant by multiplying the distance of noise variables as a scalar weight. The exponential\nfunction makes ldiv produce a positive value as a loss to be minimized. We set d to the Manhattan\ndistance function d(x1, x2) = (cid:107)x1 \u2212 x2(cid:107)1 in our experiments.\n\n.\n\n(cid:16)\u2212(cid:80)\n\n(\u02c6y1,\u02c6z1)\u2208B(cid:80)\n\n(cid:17)\n\n4.2 Relations to Other Structures\n\nAutoencoders The overall structure of KEGNET can be understood as an autoencoder that tries to\nreconstruct two variables y and z at the same time. It is speci\ufb01cally an overcomplete autoencoder\nwhich learns a larger embedding vector than the target variables, since x is a learned representation\nand y and z are target variables by this interpretation. It is typically dif\ufb01cult to train an overcomplete\nautoencoder because it can recover the target variable in the representation and make a zero recon-\nstruction error. However in our case, the trained classi\ufb01er M prevents such a problem because it acts\nas a strong regularizer over the generated representations by classifying their labels based on its \ufb01xed\nknowledge. Thus, G needs to be trained carefully so that the generated representations \ufb01t as correct\ninputs to M, while containing the information of both y and z.\n\nGenerative adversarial networks KEGNET is similar to generated adversarial networks (GAN)\n[7] in that a generator creates fake data to estimate the true distribution, and the generated data are\nfed into another network to be evaluated. The structure of G is also motivated by DCGAN [31] and\nACGAN [26] for generating image datasets. However, the main difference from GAN-based models\nis that we have no observable data and thus we cannot train a discriminator which separates fake data\nfrom the real ones. We instead rely on the trained classi\ufb01er M and guide G indirectly toward the true\ndistribution. The decoder D in KEGNET can be understood as an adversarial model that hinders G\nfrom converging to a naive solution, but it is not a direct counterpart of G. Thus, KEGNET can be\nunderstood as a novel architecture designed for the case where no observable data are available.\n\n4.3 Knowledge Distillation\n\nTo distill the knowledge of M, we use the trained generator G to create arti\ufb01cial data and feed them\ninto both the teacher M and student S. We apply the following two ideas to make S explore a large\nspace and to maximize its generalization performance. First, we use a set G of multiple generators\ninstead of a single network. Since each generator is initialized randomly, each of the generators learns\na data manifold that is different from those of the others. The number of generators is not limited\nbecause they do not require observable training data. Second, we set \u02c6py to the elementwise uniform\ndistribution which generates unnormalized probability vectors: \u02c6yi \u223c U(0, 1) for each i. This gives an\nuncertain evidence to G and forces it to generate data points which are not classi\ufb01ed easily by M,\nmaking M produce soft distributions in which its knowledge is embedded well.\nAs a result, our loss function ldis to train S by knowledge distillation is given as follows:\n\nldis(\u02c6y, \u02c6z) =\n\nCE(M (G(\u02c6y, \u02c6z)), S(G(\u02c6y, \u02c6z))),\n\n(11)\n\n(cid:88)\n\nG\u2208G\n\nwhere CE denotes the cross entropy. Previous works for knowledge distillation use a temperature [9]\nto increase the entropy of predictions from M so that S can learn hidden relationships between the\nclasses more easily. We do not use the temperature since the predictions of M are soft already due to\nour second idea of using the elementwise uniform distribution as \u02c6py.\n\n5\n\n\fTable 1: Detailed information of datasets.\n\nDataset\n8\nShuttle\n16\nPenDigits\n16\nLetter\n1 \u00d7 28 \u00d7 28\nMNIST\nFashion MNIST 1 \u00d7 28 \u00d7 28\n3 \u00d7 32 \u00d7 32\nSVHN\n\nFeatures Labels Training Valid.\n5,438\n937\n2,000\n5,000\n5,000\n5,000\n\n7\n10\n26\n10\n10\n10\n\n38,062\n6,557\n14,000\n55,000\n55,000\n68,257\n\nTest Properties\n\n14,500 Unstructured\n3,498 Unstructured\n4,000 Unstructured\n10,000 Grayscale images\n10,000 Grayscale images\n26,032 RGB images\n\nTable 2: Classi\ufb01cation accuracy of KEGNET and the baseline methods on the unstructured datasets.\nWe report the compression ratios of student models along with the accuracy of Tucker.\n\nModel Approach\nMLP\nMLP\nMLP\nMLP\nMLP\n\nOriginal\nTucker (T)\nT+Uniform\nT+Gaussian\nT+KEGNET\n\nPendigits\n96.56%\n26.44% (8.07\u00d7)\n\nLetter\nShuttle\n95.63%\n99.83%\n75.49% (8.17\u00d7)\n31.40% (4.13\u00d7)\n93.83 \u00b1 0.13% 80.21 \u00b1 0.98% 62.50 \u00b1 0.90%\n94.00 \u00b1 0.06% 78.22 \u00b1 1.74% 76.80 \u00b1 1.84%\n94.21 \u00b1 0.03% 82.62 \u00b1 1.05% 77.73 \u00b1 0.33%\n\n5 Experiments\n\nWe evaluate KEGNET on two kinds of networks and datasets: multilayer perceptrons on unstructured\ndatasets from the UCI Machine Learning Repository2, and convolutional neural networks on MNIST\n[21], Fashion MNIST [33], and SVHN [25]. The datasets are summarized as Table 1.\nWe compare KEGNET with baseline approaches for distilling the knowledge of a neural network\nwithout using observable data. The simplest approach is to use Tucker decomposition alone, but the\nresulting student is not optimized for the target problem because its objective is only to minimize\nthe reconstruction error. The second approach is to \ufb01ne-tune the student after Tucker decomposition\nusing arti\ufb01cial data derived from a sampling distribution. If the distribution is largely different from\nthe true distribution, this approach may even decrease the performance of the student from the \ufb01rst\napproach. We use the Gaussian distribution N (0, 1) and uniform distribution U(\u22121, 1).\nIn each setting, we train \ufb01ve generators with different random seeds as G and combine the generated\ndata from all generators. We also train \ufb01ve student networks and report the average and standard\ndeviation of classi\ufb01cation accuracy for quantitative evaluation. We initialize the compressed weights\nof student networks by running the singular value decomposition on the original weights and update\nthem by Tucker decomposition to minimize the reconstruction errors [18]. We also use the hidden\nvariable \u02c6z of length 10 in all settings, which is much smaller than the data vectors. We use a decoder\nnetwork of the same structure in all settings: a multilayer perceptron of n hidden layers with the ELU\nactivation [5] and batch normalization. n is chosen by the data complexity: n = 1 in MNIST, n = 2\nin the unstructured datasets, and n = 3 in Fashion MNIST and SVHN.\n\n5.1 Unstructured Datasets\n\nWe use unstructured datasets in the UCI Machine Learning Repository, for which previous works\n[6, 27] have established reliable standards of performances. We select three datasets which have at\nleast three classes and ten thousand instances. We divide each dataset into training, validation, and\ntest sets with the 7:1:2 ratios if the explicit training and test sets are not given. Otherwise, we divide\nthe given training data into new training and validation sets.\nWe use a multilayer perceptron (MLP) as a classi\ufb01er M, which has been used in [27] and contains\nten hidden layers with the ELU activation function and dropout [32] of probability 0.15. We create\nstudent networks by applying Tucker decomposition to all dense layers: the target rank is 5 in Shuttle\nand 10 in the others. We use an MLP as a generator G of two hidden layers with the ELU activation\n\n2http://persoal.citius.usc.es/manuel.fernandez.delgado/papers/jmlr/\n\n6\n\n\fTable 3: Classi\ufb01cation accuracy of KEGNET and the baselines on the image datasets. We report the\ncompression ratios of student models along with the accuracy of Tucker. We use three variants of\nstudents for each dataset with different compression ratios.\n\nDataset Model\nLeNet5\nMNIST\nLeNet5\nMNIST\nLeNet5\nMNIST\nMNIST\nLeNet5\nLeNet5\nMNIST\n\nApproach\nOriginal\nTucker (T)\nT+Uniform\nT+Gaussian\nT+KEGNET\n\nSVHN\nSVHN\nSVHN\nSVHN\nSVHN\n\nResNet14 Original\nResNet14 Tucker (T)\nResNet14 T+Uniform\nResNet14 T+Gaussian\nResNet14 T+KEGNET\n\nFashion ResNet14 Original\nFashion ResNet14 Tucker (T)\nFashion ResNet14 T+Uniform\nFashion ResNet14 T+Gaussian\nFashion ResNet14 T+KEGNET\n\nStudent 2\n98.90%\n67.35% (4.10\u00d7)\n\n93.23%\n11.02% (1.65\u00d7)\n\nStudent 3\nStudent 1\n98.90%\n98.90%\n85.18% (3.62\u00d7)\n50.01% (4.49\u00d7)\n95.48 \u00b1 0.11% 88.27 \u00b1 0.07% 69.89 \u00b1 0.28%\n95.45 \u00b1 0.15% 87.70 \u00b1 0.12% 71.76 \u00b1 0.18%\n96.32 \u00b1 0.05% 90.89 \u00b1 0.11% 89.94 \u00b1 0.08%\n93.23%\n93.23%\n19.31% (1.44\u00d7)\n11.07% (3.36\u00d7)\n33.08 \u00b1 1.47% 63.08 \u00b1 1.77% 23.83 \u00b1 1.86%\n26.58 \u00b1 1.61% 60.22 \u00b1 4.17% 21.49 \u00b1 2.96%\n69.89 \u00b1 1.24% 87.26 \u00b1 0.46% 63.40 \u00b1 1.80%\n92.50%\n65.09% (1.40\u00d7)\n< 65.09%\n< 65.09%\n85.23 \u00b1 1.36% 87.80 \u00b1 0.31% 79.95 \u00b1 1.36%\n\n92.50%\n75.80% (1.58\u00d7)\n< 75.80%\n< 75.80%\n\n92.50%\n46.55% (2.90\u00d7)\n< 46.55%\n< 46.55%\n\nand batch normalization. We also apply the non-learnable batch normalization after the output layer\nto restrict the output space to the standard normal distribution: the parameters \u03b3 and \u03b2 [10] are \ufb01xed\nas 0 and 1, respectively. This is natural since most neural networks take standardized inputs.\nTable 2 compares the classi\ufb01cation accuracy of student networks trained by KEGNET and the baseline\napproaches on the unstructured datasets. All three approaches show large improvements of accuracy\nover Tucker, which applies Tucker decomposition without \ufb01ne-tuning. This implies that even simple\ndistributions are helpful to improve the performance of student networks when no training data are\nobservable. Nevertheless, KEGNET shows the highest accuracy in all datasets.\n\n5.2\n\nImage Datasets\n\nWe use two well-known classi\ufb01ers on the image datasets: LeNet5 [20] for MNIST and ResNet with\n14 layers (referred to as ResNet14) [8] for Fashion MNIST and SVHN. We initialize the student\nnetworks by compressing the weight tensors using Tucker-2 decomposition [13] with VBMF [24];\nwe compress only the convolutional layers except the dense layers as [13]. Since the classi\ufb01ers are\nconvolutional neural networks that are optimized for image datasets, we use a generator that is similar\nto that of ACGAN [26], which consists of two fully connected layers followed by three transposed\nconvolutional layers with the batch normalization after each layer.\n\n5.2.1 Quantitative Analysis\n\nWe evaluate KEGNET by training three different student networks for each classi\ufb01er. For LeNet5, we\ncompress the last convolutional layer in Student 1 and the last two convolutional layers in Student 2.\nWe then increase the compression ratio of Student 2 by decreasing the projection rank in Student 3.\nFor ResNet14, we compress the last residual block which consists of two convolutional layers. We\ncompress each of the convolutional layers in Students 1 and 2 and the both layers in Student 3.\nTable 3 shows the classi\ufb01cation accuracy of student networks trained by KEGNET and the baseline\napproaches. In MNIST where the dataset and classi\ufb01er are both simple, the Uniform and Gaussian\nbaselines also achieve high accuracy which is up to 21.8%p higher than that of Tucker. However,\ntheir accuracy gain becomes much lower in SVHN, and the accuracy becomes even lower than that\nof Tucker in Fashion MNIST, meaning that \ufb01ne-tuning after Tucker decomposition is not helpful at\nall. This shows that simple random distributions fail with complex datasets whose manifolds are\nfar from trivial distributions. We do not report their exact accuracy in Fashion because they keep\n\n7\n\n\f(a) MNIST (z = 0).\n\n(b) SVHN (z = 0).\n\n(c) SVHN (averaged by z).\n\n(d) Latent space walking from 0 to 5 in SVHN.\n\nFigure 2: Arti\ufb01cial images generated by \ufb01ve generators of KEGNET for MNIST and SVHN. We \ufb01x\nthe noise variable z to a zero vector in (a) and (b), while we average multiple images with random z\nin (c) and (d). The digits are blurry but recognizable especially when averaged by z.\n\ndecreasing as we continue the training. On the other hand, KEGNET outperforms all baselines by\nlearning successfully the data distributions from the given classi\ufb01ers in all datasets.\n\n5.2.2 Qualitative Analysis\n\nWe also analyze qualitatively the extracted data for the image datasets. Figure 2 visualizes arti\ufb01cial\nimages for MNIST and SVHN, which were generated by the \ufb01ve generators in G. The images seem\nnoisy but contain digits which are clearly recognizable, even though the generators do not have any\ninformation about the true datasets. KEGNET generates more clear images in SVHN than in MNIST,\nbecause the digits in SVHN have more distinct patterns than in the hand-written digits. The digits\nare more clear when averaged from multiple hidden variables, implying that images with different\nhidden variables are diverse but share a common feature that the classi\ufb01er is able to capture. We also\nvisualize images with soft evidence in Figure 2d by changing smoothly the input label from 0 to 5. It\nis shown that the generators create digits following the strength of evidence for each class.\n\n6 Conclusion\n\nWe propose KEGNET (Knowledge Extraction with Generative Networks), a novel architecture that\nextracts the knowledge of a trained neural network without any observable data. KEGNET learns the\nconditional distribution of data points by training the generator and decoder networks, and estimates\nthe manifold of missing data as a set of arti\ufb01cial data points. Our experiments show that KEGNET is\nable to reconstruct unobservable data that were used to train a deep neural network, especially for\nimage datasets that have distinct and complex manifolds, and improves the performance of data-free\nknowledge distillation. Future works include extending KEGNET to knowledge distillation between\nneural networks of different structures, such as LeNet5 and ResNet14, or to more complex datasets\nsuch as CIFAR-10/100 [19] that may require a careful design of new generator networks.\n\nAcknowledgments\n\nThis work was supported by the ICT R&D program of MSIT/IITP (No.2017-0-01772, Development\nof QA systems for Video Story Understanding to pass the Video Turing Test).\n\n8\n\n0123456789\ud835\udc3a\"\ud835\udc3a#\ud835\udc3a$\ud835\udc3a%\ud835\udc3a&0123456789\ud835\udc3a\"\ud835\udc3a#\ud835\udc3a$\ud835\udc3a%\ud835\udc3a&0123456789\ud835\udc3a\"\ud835\udc3a#\ud835\udc3a$\ud835\udc3a%\ud835\udc3a&05\ud835\udc3a\"\ud835\udc3a#\ud835\udc3a$\ud835\udc3a%\ud835\udc3a&\fReferences\n[1] Anoop Korattikara Balan, Vivek Rathod, Kevin P. Murphy, and Max Welling. Bayesian dark\n\nknowledge. In NIPS, 2015.\n\n[2] Tianqi Chen, Ian J. Goodfellow, and Jonathon Shlens. Net2net: Accelerating learning via\n\nknowledge transfer. In ICLR, 2016.\n\n[3] Jian Cheng, Peisong Wang, Gang Li, Qinghao Hu, and Hanqing Lu. Recent advances in ef\ufb01cient\ncomputation of deep convolutional neural networks. Frontiers of IT & EE, 19(1):64\u201377, 2018.\n[4] Yu Cheng, Duo Wang, Pan Zhou, and Tao Zhang. A survey of model compression and\n\nacceleration for deep neural networks. arXiv, 2017.\n\n[5] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network\n\nlearning by exponential linear units (elus). In ICLR, 2016.\n\n[6] Manuel Fern\u00e1ndez Delgado, Eva Cernadas, Sen\u00e9n Barro, and Dinani Gomes Amorim. Do we\nneed hundreds of classi\ufb01ers to solve real world classi\ufb01cation problems? Journal of Machine\nLearning Research, 15(1):3133\u20133181, 2014.\n\n[7] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil\n\nOzair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\n[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. In CVPR, pages 770\u2013778, 2016.\n\n[9] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. Distilling the knowledge in a neural\n\nnetwork. arXiv, 2015.\n\n[10] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\n\nby reducing internal covariate shift. In ICML, pages 448\u2013456, 2015.\n\n[11] Maksym Kholiavchenko. Iterative low-rank approximation for CNN compression. arXiv, 2018.\n\nURL http://arxiv.org/abs/1803.08995.\n\n[12] Jangho Kim, Seonguk Park, and Nojun Kwak. Paraphrasing complex network: Network\n\ncompression via factor transfer. In NeurIPS, 2018.\n\n[13] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin.\nCompression of deep convolutional neural networks for fast and low power mobile applications.\nICLR, 2016.\n\n[14] Yoon Kim and Alexander M. Rush. Sequence-level knowledge distillation. In EMNLP, pages\n\n1317\u20131327, 2016.\n\n[15] Akisato Kimura, Zoubin Ghahramani, Koh Takeuchi, Tomoharu Iwata, and Naonori Ueda.\nFew-shot learning of neural networks from scratch by pseudo example optimization. In BMVC,\npage 105, 2018.\n\n[16] Tamara G. Kolda and Brett W. Bader. Tensor decompositions and applications. SIAM Review,\n\n51(3):455\u2013500, 2009.\n\n[17] Jean Kossai\ufb01, Zachary C. Lipton, Aran Khanna, Tommaso Furlanello, and Anima Anandkumar.\n\nTensor regression networks. arXiv, 2017.\n\n[18] Jean Kossai\ufb01, Yannis Panagakis, Anima Anandkumar, and Maja Pantic. Tensorly: Tensor\n\nlearning in python. Journal of Machine Learning Research, 20:26:1\u201326:6, 2019.\n\n[19] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.\n\nTechnical report, Citeseer, 2009.\n\n[20] Yann LeCun, LD Jackel, L\u00e9on Bottou, Corinna Cortes, John S Denker, Harris Drucker, Isabelle\nGuyon, Urs A Muller, Eduard Sackinger, Patrice Simard, et al. Learning algorithms for\nclassi\ufb01cation: A comparison on handwritten digit recognition. Neural networks: the statistical\nmechanics perspective, 261:276, 1995.\n\n[21] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[22] Tianhong Li, Jianguo Li, Zhuang Liu, and Changshui Zhang. Knowledge distillation from few\n\nsamples. arXiv, 2018.\n\n9\n\n\f[23] Raphael Gontijo Lopes, Stefano Fenu, and Thad Starner. Data-free knowledge distillation for\n\ndeep neural networks. NIPS Workshop, 2017.\n\n[24] Shinichi Nakajima, Masashi Sugiyama, S. Derin Babacan, and Ryota Tomioka. Global ana-\nlytic solution of fully-observed variational bayesian matrix factorization. Journal of Machine\nLearning Research, 14(1):1\u201337, 2013.\n\n[25] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\n\nReading digits in natural images with unsupervised feature learning. 2011.\n\n[26] Augustus Odena, Christopher Olah, and Jonathon Shlens. Conditional image synthesis with\n\nauxiliary classi\ufb01er gans. In ICML, 2017.\n\n[27] Matthew Olson, Abraham J. Wyner, and Richard Berk. Modern neural networks generalize on\n\nsmall data sets. In NeurIPS, 2018.\n\n[28] Nicolas Papernot, Mart\u00edn Abadi, \u00dalfar Erlingsson, Ian J. Goodfellow, and Kunal Talwar. Semi-\n\nsupervised knowledge transfer for deep learning from private training data. ICLR, 2017.\n\n[29] Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and\n\nquantization. In ICLR, 2018.\n\n[30] Stephan Rabanser, Oleksandr Shchur, and Stephan G\u00fcnnemann. Introduction to tensor decom-\n\npositions and their applications in machine learning. arXiv, 2017.\n\n[31] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with\n\ndeep convolutional generative adversarial networks. In ICLR, 2016.\n\n[32] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-\nnov. Dropout: a simple way to prevent neural networks from over\ufb01tting. Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[33] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for\n\nbenchmarking machine learning algorithms, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1557, "authors": [{"given_name": "Jaemin", "family_name": "Yoo", "institution": "Seoul National University"}, {"given_name": "Minyong", "family_name": "Cho", "institution": "Seoul National University"}, {"given_name": "Taebum", "family_name": "Kim", "institution": "Seoul National University"}, {"given_name": "U", "family_name": "Kang", "institution": "Seoul National University"}]}