{"title": "Positive-Unlabeled Compression on the Cloud", "book": "Advances in Neural Information Processing Systems", "page_first": 2565, "page_last": 2574, "abstract": "Many attempts have been done to extend the great success of convolutional neural networks (CNNs) achieved on high-end GPU servers to portable devices such as smart phones. Providing compression and acceleration service of deep learning models on the cloud is therefore of significance and is attractive for end users. However, existing network compression and acceleration approaches usually fine-tuning the svelte model by requesting the entire original training data (e.g. ImageNet), which could be more cumbersome than the network itself and cannot be easily uploaded to the cloud. In this paper, we present a novel positive-unlabeled (PU) setting for addressing this problem. In practice, only a small portion of the original training set is required as positive examples and more useful training examples can be obtained from the massive unlabeled data on the cloud through a PU classifier with an attention based multi-scale feature extractor. We further introduce a robust knowledge distillation (RKD) scheme to deal with the class imbalance problem of these newly augmented training examples. The superiority of the proposed method is verified through experiments conducted on the benchmark models and datasets. We can use only 8% of uniformly selected data from the ImageNet to obtain an efficient model with comparable performance to the baseline ResNet-34.", "full_text": "Positive-Unlabeled Compression on the Cloud\n\nYixing Xu\u2020, Yunhe Wang\u2020, Hanting Chen\u00a7, Kai Han\u2020,\n\nChunjing Xu\u2020, Dacheng Tao\u2021, Chang Xu\u2021\n\n\u2020Huawei Noah\u2019s Ark Lab\n\n\u00a7Key Laboratory of Machine Perception (MOE), CMIC,\n\nSchool of EECS, Peking University, China\n\n\u2021The University of Sydney, Darlington, NSW 2008, Australia\n{yixing.xu, yunhe.wang, kai.han, xuchunjing}@huawei.com\n\nhtchen@pku.edu.cn, {dacheng.tao, c.xu}@sydney.edu.au\n\nAbstract\n\nMany attempts have been done to extend the great success of convolutional neural\nnetworks (CNNs) achieved on high-end GPU servers to portable devices such as\nsmart phones. Providing compression and acceleration service of deep learning\nmodels on the cloud is therefore of signi\ufb01cance and is attractive for end users. How-\never, existing network compression and acceleration approaches usually \ufb01ne-tuning\nthe svelte model by requesting the entire original training data (e.g. ImageNet),\nwhich could be more cumbersome than the network itself and cannot be easily\nuploaded to the cloud. In this paper, we present a novel positive-unlabeled (PU)\nsetting for addressing this problem. In practice, only a small portion of the original\ntraining set is required as positive examples and more useful training examples can\nbe obtained from the massive unlabeled data on the cloud through a PU classi\ufb01er\nwith an attention based multi-scale feature extractor. We further introduce a robust\nknowledge distillation (RKD) scheme to deal with the class imbalance problem of\nthese newly augmented training examples. The superiority of the proposed method\nis veri\ufb01ed through experiments conducted on the benchmark models and datasets.\nWe can use only 8% of uniformly selected data from the ImageNet to obtain an\nef\ufb01cient model with comparable performance to the baseline ResNet-34.\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNNs) have been widely used in a variety of computer vision\napplications such as image classi\ufb01cation [14, 18, 21], object detection [5], semantic segmentation [17],\nclustering [31], multi-label learning [23], etcCNNs are often over-parameterized to achieve a good\nrecognition performance. However, many empirical studies suggest that those redundant parameters\nor \ufb01lters can be eliminated without affecting the performance of the network. To be compatible with\nvarious running environments (e.g. cell phone and autonomous driving) in real-world applications,\nwell trained neural networks need to be further compressed and accelerated accordingly. Considering\nthe scalable computation resource (e.g. GPU and RAM) offered by the cloud, it is therefore promising\nto provide network compression service for end users.\nCompared with the model compression service offered by the cloud, it would be much harder for\nend users to compress the cumbersome network by themselves. One one hand, GPUs are essential to\ndoing effective deep learning. Compared with setting up their own servers, many users tend to spin\nup cloud instances with GPUs by balancing the \ufb02exibility and the investment, especially when the\nGPUs are only needed for several hours. One the other hand, not every user is a deep learning expert,\nand a cloud service would be expected to produce ef\ufb01cient deep neural networks according to users\u2019\nneeds.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: The diagram of the proposed method for compressing deep networks on the cloud.\n\nExisting methods like quantization approach [6], pruning approach [4] and knowledge distillation\napproach [9] cannot be easily deployed on the cloud to compress the cumbersome network submitted\nby end customers. The major reason is that most of these methods require users to provide the original\ntraining data for \ufb01ne-tuning the compressed network to avoid much drop of the accuracy. However,\ncompared with the model size of modern CNNs, the size of the entire training data would be much\nlarger. For example, ResNet-50 [8] only occupies an about 95MB for storing its parameters while its\ntraining dataset (i.e. ImageNet [14]) contains more than one million images with an over 120GB \ufb01le\nsize. Therefore, given the limitation of transmission speed (e.g. 10MB/s), users have to wait for a\nlong period of time before launching the compression methods, which does harm to user experience\nof the service.\nIn this paper, we suggest a two-stage pipeline to leverage the easily accessible unlabeled data for\ntraining compact neural networks, as shown in Fig. 1. Users are required to upload the pre-trained\ndeep network and a small portion (e.g. 10%) of the original training data. Taking the scarce labeled\ndata as \u2018positive\u2019, in the unlabeled pool (e.g. Flickr [12]) there could be \u2018positive\u2019 data that follows a\nsimilar distribution (e.g. of the same concept), while the remaining data are treated as \u2018negative\u2019. A\nbinary PU classi\ufb01er learned from these positive-unlabeled data can then be employed to identify the\nmost related unlabeled data to augment the training set for our compression task. In order to correct\nthe biased labels contained in the augmented dataset, we further develop a robust knowledge distiller\n(RKD) to address the problem of noisy and imbalanced labels. Experimental results conducted on\nseveral benchmark datasets and deep models demonstrate that with the help of massive unlabeled\ndata, the proposed method is effective for learning ef\ufb01cient networks with only a small proportion of\nthe original training data.\n\n2 Positive-Unlabeled Classi\ufb01er for More Data\n\nHere we \ufb01rst present some preliminaries for learning ef\ufb01cient neural networks, and then develop a\nnovel framework to effectively utilize massive unlabeled data on the cloud for training.\n\n2.1 Knowledge for Compressing Neural Networks\n\nConventional deep model compression algorithms aim to eliminate redundant weights or \ufb01lters in\npre-trained deep neural networks. The resulting networks are often of speci\ufb01c structures such as\nsparse matrices and mix-bit multiplications which need additional technical supports. In contrast,\nknowledge distillation (KD) method [9] is proposed to directly learn student networks with fewer\nparameters and computational complexities by inheriting feature information from the given teacher\nnetwork.\n\n2\n\nMassive Unlabeled DataPre-Trained NetworkCloudSelected DataDesired NetworkPUClassifierRKDFew Training Data\fDenoting the pre-trained teacher network and the desired ef\ufb01cient student networks as Nte and Nst,\nrespectively, the student network is trained using the following objective function:\n\nLc(yte\n\ni , yst\n\ni ),\n\n(1)\n\n(cid:88)\n\ni\n\nLKD =\n\n1\nn\n\nwhere Lc(\u00b7,\u00b7) is the cross-entropy loss, n is the number of samples in the training set, yte\ni and yst\nare the output responses corresponding to the teacher network Nte and student network Nst, of the\ni\nsame input data xi, respectively. By using the KD method described in Eq. 1, the student network is\nable to generalize in the same way as the teacher network, and can empirically obtain a much better\nresult than training it from-scratch.\nHowever, the number of samples in the training dataset of Nte is often extremely large, e.g. there are\nover 1.2 million images in the ILSVRC 2012 dataset with \ufb01le size of 120GB. Differently, modern\nCNN architectures are more and more lightweighted, e.g. the model size of MobileNet-v2 [22] is\nonly about 15MB. Thus, the time consumption of uploading such huge datasets affects the user\u2019s\nexperience on the model compression service on the cloud.\n\n2.2 Positive-Unlabeled Classi\ufb01er for Selecting Data\n\n(2)\n\nIn order to reduce the required number of samples n in the training dataset, we propose to look for\nalternative data. Actually, there are massive datasets on the cloud servers for conducting different tasks\n(e.g. CIFAR, ImageNet and Flickr). We can regard them as unlabeled data, and a small proportion\nof the original samples as positive data. Thus, the data selection task is exactly a positive-unlabeled\n(PU) learning problem [13, 30].\nPU learning method focuses on learning a classi\ufb01er from positive and unlabeled data. Given\nxi \u2208 X \u2286 Rd, yi \u2208 Y = {\u00b11} be the input samples and output labels, together with nl and nu be\nthe number of labeled and unlabeled samples, respectively. The training set T of the PU classi\ufb01er can\nbe formulated as:\n\nT = L \u222a U = {(xl, +1)}nl\n\nl=1 \u222a {(xu, yu)}nu\nu=1,\n\nwhere L is the labeled set and U is the unlabeled set, respectively.\nDenote the desired decision function as f : X \u2192 Y, and F : X \u2192 R is a discriminant function\nthat maps the input data to a real number such that f (x) = sign(F (x)), and l : X \u00d7 Y \u2192 R is an\narbitrary loss function, the decision function f can be optimized by the following equation:\n\np (f ) + max{0, \u02c6Rx(f ) \u2212 \u03c0p \u02c6R\u2212\np (f ) = Ep[l(F (x), +1)] and \u02c6R\u2212\n\np (f )},\n\u02dcRpu(f ) = \u03c0p \u02c6R+\n(3)\np (f ) = Ep[l(F (x),\u22121)] are\nwhere \u02c6Rx(f ) = Ex[l(F (x),\u22121)], \u02c6R+\nthe corresponding risk functions, and \u03c0p = p(y = +1) is the class prior. l(t, y) is an arbitrary loss\nfunction between the target t and the ground truth label y.\nFor the given pre-trained teacher network Nte, we can ask the user to provide a tiny dataset Lt\nconsists of a small proportion (e.g. 10%) of the original training set. Then we can collect an unlabeled\ndataset U on the cloud, Eq. 3 can be further utilized to select more positive data from U to construct\nanother training dataset for conducting the subsequent model compression task.\nSince the pre-trained teacher network Nte is designed to solve the original tasks, such as an ordinary\nclassi\ufb01cation, it is infeasible to directly use the same architecture on PU classi\ufb01cation [29]. Therefore,\nwe introduce an attention based multi-scale feature extractor NA for extracting features of input data,\ni.e. F (x). Note that the deep features transition from general to speci\ufb01c along the network, and the\ntransferability of the features drop rapidly in higher layers. Simply using the feature produced by\nthe last layer will produce a large transferability gap, while using combined features from layers in\ndifferent locations of the network will reduce the gap.\nSpeci\ufb01cally, let hj \u2208 RH j\u00d7W j\u00d7Cj be the features extracted in the j-th layer. Note that these outputs\ncannot be directly concatenated because the size of heights and widths are different. A common way\nto mitigate this problem is using global average pooling. Given hj, the global spatial information is\ncompressed into a channel-wise descriptor oj \u2208 RCj [10], where the c-th element of oj is calculated\nby:\n\noj\nc =\n\n1\n\nH \u00d7 W\n\nh(m, n, c).\n\n(4)\n\nH(cid:88)\n\nW(cid:88)\n\nm=1\n\nn=1\n\n3\n\n\fGiven the compressed channel-wise descriptors,\na simplest way is to directly concatenate them\ntogether into a single vector. However, it is not\n\ufb02exible enough for the vector to re\ufb02ect the im-\nportance of the input signals, which represent\nfeatures from general to speci\ufb01c. Generally, in-\nputs containing more information should have a\nlarger weight. Thus, we add attention on top of\nthese descriptors for adaptation between modal-\nities. Attention method can be viewed as a way\nto allocate the input signal so that more informa-\ntive component will get more attention by the\nnext layer, which has been widely used in CN-\nN across a range of tasks [1, 11]. Speci\ufb01cally,\ngiven the concatenated channel wise descriptor\no = {o1,\u00b7 \u00b7 \u00b7, oj}, we opt to employ a gating\nmechanism as suggested in [10]:\n\nAlgorithm 1 PU classi\ufb01er for more data.\nRequire: An initialized network NA, a tiny la-\nbeled dataset Lt and an unlabeled dataset U.\n1: Module 1: Train PU Classi\ufb01er\n2: repeat\n3:\n\nRandomly select a batch(cid:8)xU\n(cid:110)\n\ni=1 from U\n\n(cid:111)N\n\n(cid:9)N\n\nxLt\ni\nOptimize NA following Eq. 3;\n\nfrom Lt;\n\n4:\n5: until convergence\n6: Module 2: Extend the labeled dataset\n7: Obtain the positive data U p from U utilizing\n\nPU classi\ufb01er NA;\n8: Unify the positive dataset U p and tiny dataset\nLl to achieve extended dataset Ll = Lt \u222a U p;\n\ni\n\nand\n\ni=1\n\nEnsure: Extended dataset Ll.\n\nw = Attention(o, W) = \u03c3(W2\u03b4(W1o)),\nin which \u03b4 is the ReLU transformation, W1 \u2208 R C\nr are the parameters of two\nFC layers that reduce the dimensionality of the input by a ratio r, followed by a non-linearity and\nthen increase the dimensionality back to origin. A sigmoid activation is used to perform the attention\nweight w. The \ufb01nal output \u02dcoj is obtained by simply re-scaling the channel wise descriptor:\n\nr \u00d7C and W2 \u2208 RC\u00d7 C\n\n(5)\n\n\u02dcoj = wjoj.\n\n(6)\nBased on the proposed feature extractor NA, we train the data from the unlabeled dataset U and\nthe tiny labeled dataset Lt which is randomly sampled from the original dataset L. Dataset Lt is\nthen expanded with the data which is classi\ufb01ed as positive in dataset U, and \ufb01nally derive a larger\npositive dataset Ll with non-negative PU loss Eq. 3. Speci\ufb01cally, we minimize the non-negative\nPU loss with stochastic gradient descent (SGD) and stochastic gradient ascent (SGA). Denoting\nti = \u02c6Rx(f (xU\ni )). When ti > 0, we minimize Eq. 3 with SGD. Otherwise, the\ngradient of ti is computed and we update the parameter of the network with SGA. That is, we go\nalong with \u2212 (cid:53)\u03b8 ti, in order to alleviate the over-\ufb01tting of the current mini-batch i. A more speci\ufb01c\nprocedure is presented in Algorithm 1.\n\ni )) \u2212 \u03c0p \u02c6R\u2212\n\np (f (xLt\n\n3 Robust Knowledge Distillation\n\nThe number of training examples in each class is usually balanced for a better training of deep\nneural networks. However, the dataset Ll generated by PU learning may suffer from data imbalanced\nproblem. For example, in ImageNet dataset the number of samples in category \u2019dog\u2019 is 30 times more\nthan that in category \u2019plane\u2019, and there are no sample from category \u2019deer\u2019. When Lt is randomly\nsampled from CIFAR-10 and ImageNet is treated as the unlabeled dataset U, the number of \u2019dog\u2019\nsamples will dominate the expanded dataset Ll. Therefore, it is unsuitable to directly adopt the KD\nmethod given the imbalanced dataset Ll.\nThere are many works which focused on the data imbalanced problem. However, they cannot be\ndirectly used in our problem, since the number of samples in each category is unknown in Ll. The\nPU learning method only distinguish whether the images in U belong to the given dataset L, but\nnever deal with the speci\ufb01c classes of input images.\nIn practice, we utilize the output of the teacher network. Note that instead of treating the output class\nlabel as the ground truth of the input sample xi, we treat the output response yte\ni = sof tmax(Zi/T )\nas the pseudo ground truth vector, in which Zi is the \ufb01nal score output and T is the temperature\nparameter which helps soften the output when the probability for one class is close to 1 and others\nare close to 0 (T = 1 in the following experiments). To this end, we propose a robust knowledge\ndistillation (RKD) method to solve the data imbalanced problem.\nSpeci\ufb01cally, we assign weight to each category of the samples, where categories with fewer samples\ni , we have\n\nwill have larger weights. Based on this principle, de\ufb01ning y = {y1, y2,\u00b7 \u00b7 \u00b7, yK} =(cid:80)\n\ni yte\n\n4\n\n\fNote that the derivation of wkd is not optimal,\nsince the predicted output response yte\nis not\ni\noptimal and is contaminated with noise. How-\never, we assume that the teacher network is\nwell-trained, and there is only a slight difference\nbetween the elements in wkd and the optimal\nweight vector w\u2217\nkd:\n\np(|wk\n\nkd \u2212 w\u2217k\n\nkd| < \u0001) > 1 \u2212 \u03b4.\n\n(9)\n\n3: repeat\n4:\n5:\n\ntended dataset Ll and a hyper-parameter \u0001.\n\nAlgorithm 2 Robust Knowledge distillation.\nRequire: A given teacher network Nte, the ex-\n1: Initialize the student network Nst;\n2: Calculate weight vectors wkd using Eq. 7 and\ngenerative a set W using a random perturb \u0001;\n\n(cid:111)m\n\n(cid:110)\ni \u2190 Nst(xLl\ni )\n\nxLl\ni\n\n;\n\n6:\n\ni=1\n\ni ); yst\n\nRandomly select a batch\nEmploy the teacher and student network:\ni \u2190 Nte(xLl\nyte\nCalculate the surrogate KD loss \u02dcLKD fol-\nUpdate N W\nst with Eq. 10;\n7:\n8: until convergence\nEnsure: The student network Nst.\n\nlowing Eq. 8;\n\nthe weight vector wkd = {wk\n\nkd}K\n\nk=1, in which:\n\nK/yk(cid:80)K\n\nk=1 1/yk\n\nwk\n\nkd =\n\n, k = 1, 2,\u00b7 \u00b7 \u00b7, K,\n\n(7)\n\nand K is the number of categories in the original dataset. When training the student network, the\nweight of the input sample is de\ufb01ned as wi = wcategoryi\nin which categoryi is the index of the\ni .\nlargest element in the ground truth vector yte\nTherefore, the surrogate KD loss can be derived\nbased on Eq. 1:\n\u02dcLKD =\n\n(cid:88)\n\nwiFce(yte\n\n(8)\n\nkd\n\ni , yst\n\ni ).\n\n1\nn\n\ni\n\nThus, we give a random perturb \u0001 on each el-\nement of the original weight vector wkd and\nget a \ufb01nite set of possible weight vectors W =\nkd| < \u0001.\n{wkd_1,\u00b7\u00b7\u00b7, wkd_n}, in which |wk\nNote that this is similar to the cost-sensitive\nlearning with multiple cost matrices. Based on\nthese weight vectors, we are able to train the student network with the following equation:\n\nkd_i\u2212wk\n\nN W\nst = arg min\nNst\u2208N\n\n(10)\nin which N is the hypothesis space. This is similar to the method proposed in [27]. However, different\nfrom the cost matrix, the weight vector in Eq. 7 is only related to the proportion of the samples in\neach category and has nothing to do with the classi\ufb01cation result, which is suitable for our learning\nproblem. Besides, we solve a multi-class problem rather than a binary class problem.\n\nmax\nw\u2208W\n\n\u02dcLKD(Nst, w),\n\n4 Experiments\n\n4.1 CIFAR-10\n\nThe widely used CIFAR-10 benchmark is \ufb01rst selected as the original dataset, which is composed\nof 32 \u00d7 32 images from 10 categories. We randomly select nl samples in each class and form the\ntiny labeled dataset Lt with 10nl positive samples. Benchmark dataset ImageNet contains over\n1.2M images from 1000 classes, but it is treated as the unlabeled dataset U with nu = 1.2M\nunlabeled samples in our experiment. In this setting, \u2018positive\u2019 indicates that the category of the input\nsample belongs to one of the categories of the original dataset CIFAR-10. Recall that the class prior\n\u03c0p = p(y = +1) in Eq. 3 indicates the proportion of the positive samples in U, which is assumed to\nbe known in the following experiments. In practice, it can be estimated with the method in [19]. In\nthis experiment, we manually select positive data from U based on the name of the category provided\nby ImageNet 2012 classi\ufb01cation dataset [14], and train the student network with manually selected\ndata using the proposed RKD method as the baseline. The total number of positive data we selected\nis around 270k, thus we set the class prior \u03c0p = 0.21 \u2248 270k/1280k in the following experiment.\nThe model used in the \ufb01rst step is an attention based multi-scale feature extractor based on ResNet-34.\nSpeci\ufb01cally, the channel-wise descriptor oj in Eq. 4 is derived from the outputs of 4 groups in\nResNet-34. The network is trained for 200 epochs using SGD. We use a weight decay of 0.005 and\n\n5\n\n\fnt\n\nMethod\nTeacher\nKD [9]\n\nBaseline-1\nBaseline-2\n\nData source\nOriginal Data\nOriginal Data\n\nTable 1: Classi\ufb01cation results on CIFAR-10 dataset. The best results are bold in the table.\nFLOPs #params Acc(%)\n95.61\n1.16G\n94.40\n557M\n557M\n93.44\n87.02\n557M\n93.75\n93.02\n92.23\n91.56\n91.33\n91.27\n\n50,000\n50,000\n269,427 Manually selected data\n50,000\nRandomly selected data\n110,608\n94,803\n74,663\n50,000\n50,000\n50,000\n\nnl\n-\n-\n-\n-\n100\n50\n20\n100\n50\n20\n\n21M\n11M\n11M\n11M\n\n11M\n\nPU-s1\n\nPU-s2\n\nPU data\n\n557M\n\nPU data\n\n557M\n\n11M\n\nmomentum of 0.9. We start with a learning rate of 0.001 and divide it by 10 every 50 epochs. Data in\nImageNet is resized to 32 \u00d7 32 rather than 224 \u00d7 224 in our experiment. Random \ufb02ipping, random\ncrop and zero-padding are used for data augmentation. In the second step, the teacher network is a\npre-trained ResNet-34, and ResNet-18 is used as the student network. A weight decay of 0.0005 and\nmomentum of 0.9 is used. We optimized the student network using SGD by starting with a learning\nrate of 0.1 and divide it by 10 every 50 epochs. \u03c0p = 0.21 is used in the following experiments.\nNote that in the \ufb01rst step in our algorithm, the positive samples are automatically selected by the\nPU method. Thus, the number of training samples for the second step is un\ufb01xed, and could be\nin\ufb02uenced by the architecture of the network, the hyper-parameter used in the experiment, etc. In this\ncircumstances, it is dif\ufb01cult to judge whether a good result is bene\ufb01t from the quality or the number\nof the training data. Therefore, there are two settings in our experiment. The \ufb01rst setting is to feed all\nthe positive data selected by the PU method to the second step to train the student network. Another\nsetting is to randomly select a bunch of data which has the same number as the original training\ndataset (50k for CIFAR-10).\nThe experimental results are shown in Tab. 1. Wherein, \u2018Baseline-1\u2019 method directly feeding manually\nselected positive data to the second step. \u2018Baseline-2\u2019 method randomly select 50000 data and then\nfed to the second step, which inevitably contains many negative data and should results in a bad\nperformance. \u2018PU-s1\u2019 is the setting of feeding all the positive data selected by the PU method to\nthe second step, and \u2018PU-s2\u2019 is the setting of randomly feeding 50000 positive data to conduct the\nsecond step. In addition, nl is the number of samples selected from each class in CIFAR-10, nt is the\nnumber of training samples used to train the student network. Suppose that nu\np positive samples are\nselected from U by PU method, then we have nt = nu\nThe result shows that the performance of the proposed method is even better than the baseline method.\nWith 1000 samples in CIFAR-10 and about 110k training samples selected from ImageNet, it achieves\na higher accuracy than the baseline method with 270k manually selected training data. It shows the\npriority of the proposed method of selecting high quality positive samples from unlabeled dataset.\nIn fact, manually selecting positive samples from ImageNet requires a huge effort, and the way we\nselect are not carefully enough to exclude all the negative data in the manually selected dataset.\nIn the previous experiments the class prior \u03c0p is assumed to be known. In practice we may suffer\nfrom the error of estimating \u03c0p. Thus, a number of different \u03c0(cid:48)\np are given to the proposed algorithm\nin order to test the robustness of the proposed method on the class prior. All the experimental settings\nare exactly the same except for the change from \u03c0p to \u03c0(cid:48)\np. Fig. 2 shows the classi\ufb01cation accuracies\nof using different \u03c0(cid:48)\np. 50k training samples are randomly selected in the second step to alleviate\nthe in\ufb02uence of the number of training samples. The same experiments are conducted on both\nResNet-34 and the attention based multi-scale feature extractor with traditional KD and RKD method\nto show the superiority of the proposed architecture and RKD method. The result shows that the\nproposed architecture with RKD method behaves the best, and is more robust on the under-estimate\nand over-estimate of the true class prior \u03c0p.\nThe experimental results show that although there are many negative data in the Imagenet dataset, the\nPU classi\ufb01er can successfully pick a large amount of positive data whose categories is the same as\nthat of given data. Therefore, the extended dataset with given data and selective data can be used to\ntrain a portable student network.\n\np + 10nl.\n\n6\n\n\fFigure 2: Classi\ufb01cation accuracies on CIFAR-10\ndataset with different \u03c0(cid:48)\np.\n\nFigure 3: Relationship between the number of\nsamples selected from each category in ImageNet\nand the resulting accuracy.\n\nTable 2: Classi\ufb01cation results on ImageNet dataset. \u201cKD-all\u201d utilizes the entire ImageNet training\ndataset to train the student network. \u201cKD-500k\u201d randomly selects 500k training data from ImageNet\nfor learning the student network.\n\nnt\n\nData source\n1,281,167 Original Data\n1,281,167 Original Data\n500,000\nOriginal Data\n690,978\n500,000\n\nPU data\nPU data\n\nFLOPs #params\n3.67G\n1.82G\n1.82G\n1.82G\n1.82G\n\n22M\n12M\n12M\n12M\n12M\n\ntop-1 acc(%)\n\ntop-5 acc(%)\n\n73.27\n68.67\n63.90\n61.92\n61.21\n\n91.26\n88.76\n85.88\n86.00\n85.33\n\nAlgorithm\nTeacher\nKD-all\nKD-500k\n\nPU-s1\nPU-s2\n\n4.2\n\nImageNet\n\nThen, we conduct experiment on ImageNet dataset, which is treated as the original dataset. Flicker1M\ndataset is used as the unlabeled dataset1. The experimental setting is the same as those in the CIFAR-\n10 experiments, except that we train 110 epochs in both steps and divide the learning rate by 10 every\n30 epochs. The class prior \u03c0p is set to 0.7 in the following experiments. Experimental result is shown\nin Tabel 2.\nIn order to make a fair comparison, we randomly select 500k samples from ImageNet and treat\nKD-500k as the baseline method. In the proposed method, we randomly select 100 samples from\neach category in ImageNet and form a tiny labeled dataset Lt, and then PU method is used to select\npositive data from Flicker1M dataset. The result shows that when feeding all the positive samples to\nthe second step, the top-5 accuracy is even better than the baseline method. The reason that top-1\naccuracy is worse than the baseline while top-5 is better is that we do not distinguish the speci\ufb01c\ncategory when using the PU method. Thus, the proposed method is better at learning meta knowledge\nthan the speci\ufb01c label. When using a same number of training samples, the proposed method has\nonly 0.5% top-5 accuracy drop compared to the baseline method while using only 8% of the samples\nin the original dataset.\nFig. 3 shows the relationship between the number of samples selected from each category in ImageNet\nand the accuracy of the proposed method. It is obvious that our method still achieves a promising\nresult when using only about 0.8% samples of the original dataset.\n\n4.3 MNIST\n\nSince most of experiments in existing methods are conducted on the MNIST dataset, we further\nconduct the experiments on this dataset in order to compare our method to the state-of-the-art methods\n\n1http://press.liacs.nl/mir\ufb02ickr/mirdownload.html\n\n7\n\n\fTable 3: Comparsion on the state-of-the-art methods on the MNIST dataset.\n\ndata-free KD [16]\n\nFitNet [20]\nFSKD [15]\n\nPU-s1\nPU-s2\n\n1\n-\n\n90.3\n95.5\n98.5\n98.3\n\n2\n-\n\n94.2\n97.2\n98.7\n98.5\n\n5\n-\n\n96.1\n97.6\n98.7\n98.5\n\n10\n-\n\n96.7\n98.0\n98.8\n98.6\n\n20\n-\n\n97.3\n98.1\n98.9\n98.6\n\nall-meta-data\n\n92.5\n\n-\n-\n-\n-\n\nincluding FitNet [20], FSKD [15] and data-free KD method [16]. The EMNIST dataset2 is used\nas the unlabeled dataset, which contains 814K hand-written letters and digits. We randomly select\n1,2,5,10 and 20 samples from each category in MNIST to form the tiny set Lt. We use a standard\nLeNet-5 as the teacher network and the student network is \u2018half-size\u2019 to that of the corresponding\nteacher network in terms of the number of feature map channels per conv-layers. The class prior \u03c0p\nis set to 0.47 in the following experiments.\nDetailed classi\ufb01cation results are shown in Tab. 3. It is clear that the proposed method outperforms\nFitNet and FSKD with a notable margin and is more robust when the number of labeled samples in\neach category is extremely rare (< 5).\n\n5 Related Works\n\nIn this section, we give a brief introduction about the related works of model compression.\nThere is a bunch of algorithms designed for learning ef\ufb01cient neural networks with fewer memory\nusage and computational complexity [7, 28]. For example, Gong et.al. [6] investigated the vector\nquantization approach for representing similar weights for smaller CNNs. Denton et.al. [4] exploited\nthe redundancy within convolutional \ufb01lters to derive approximations and signi\ufb01cantly reduced the\nrequired computational costs. Chen et.al. [3] compressed the weights in neural networks using the\nhashing trick [24, 25]. Hinton et.al. [9] presents the knowledge distillation approach for transferring\ninformation from the pre-trained teacher network to a compressed student network.\nNowadays, there are only a few attempts to learn ef\ufb01cient neural networks with some meta-data\nof the training set or without using the original training data. For instance, Srinivas and Babu [26]\ndirectly removed the redundant similar neurons in a systematic way. Based on knowledge distillation,\nLopes et.al. [16] used some extra meta-data to learning smaller deep neural networks. However, the\nperformance of the resulting networks learned through these methods are often much worse than that\nof the baseline network. This is because the amount of available data and information is extremely\nsmall. More recently, Chen et.al. [2] designed a generator for generating data of the similar properties\nas those of the original dataset, which obtained promising performance but lacked ef\ufb01ciency for\ngenerating images.\n\n6 Conclusion\n\nMost of existing network compression methods require the original dataset to achieve acceptable\nperformance. However, the huge size of the training dataset leads to unacceptable transmission cost\nfrom end-user to the cloud. Therefore, we propose a two-step framework to compress the given neural\nnetwork using only a small portion of the training data. Firstly, a PU classi\ufb01er with an attention based\nmulti-scale feature extractor is trained with the given labeled data and massive unlabeled data on the\ncloud. Then, a new dataset is conducted by combining the given data and the \u2019positive\u2019 data selected\nby PU classi\ufb01er. Secondly, we develop a robust knowledge distillation (RKD) method to address\nthe class imbalanced problem with noise in the augmented dataset. Experiments on the MNIST,\nCIFAR-10 and ImageNet datasets demonstrate that the proposed method can successfully dig more\nuseful training samples using only a small amount of original data, and achieve the state-of-the-art\nperformance comparing to other few-shot learning model-compression methods.\n\n2https://www.westernsydney.edu.au/bens/home/reproducible_research/emnist\n\n8\n\n\fAcknowledgments\n\nWe thank anonymous area chair and reviewers for their helpful comments. Chang Xu was supported\nby the Australian Research Council under Project DE180101438.\n\nReferences\n[1] C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, et al.\nLook and think twice: Capturing top-down visual attention with feedback convolutional neural\nnetworks. In Proceedings of the IEEE International Conference on Computer Vision, pages\n2956\u20132964, 2015.\n\n[2] H. Chen, Y. Wang, C. Xu, Z. Yang, C. Liu, B. Shi, C. Xu, C. Xu, and Q. Tian. Data-free learning\n\nof student networks. arXiv preprint arXiv:1904.01186, 2019.\n\n[3] W. Chen, J. Wilson, S. Tyree, K. Weinberger, and Y. Chen. Compressing neural networks with\nthe hashing trick. In International Conference on Machine Learning, pages 2285\u20132294, 2015.\n[4] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus. Exploiting linear structure within\nconvolutional networks for ef\ufb01cient evaluation. In Advances in neural information processing\nsystems, pages 1269\u20131277, 2014.\n\n[5] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer\n\nvision, pages 1440\u20131448, 2015.\n\n[6] Y. Gong, L. Liu, M. Yang, and L. Bourdev. Compressing deep convolutional networks using\n\nvector quantization. arXiv preprint arXiv:1412.6115, 2014.\n\n[7] K. Han, Y. Wang, Y. Xu, C. Xu, D. Tao, and C. Xu. Full-stack \ufb01lters to build minimum viable\n\ncnns. arXiv preprint arXiv:1908.02023, 2019.\n\n[8] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition.\n\nIn\nProceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013\n778, 2016.\n\n[9] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint\n\narXiv:1503.02531, 2015.\n\n[10] J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. In Proceedings of the IEEE\n\nconference on computer vision and pattern recognition, pages 7132\u20137141, 2018.\n\n[11] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in\n\nneural information processing systems, pages 2017\u20132025, 2015.\n\n[12] H. Jegou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for\nlarge scale image search. In European conference on computer vision, pages 304\u2013317, 2008.\n\n[13] R. Kiryo, G. Niu, M. C. du Plessis, and M. Sugiyama. Positive-unlabeled learning with non-\nnegative risk estimator. In Advances in neural information processing systems, pages 1675\u20131685,\n2017.\n\n[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105.\n\n[15] T. Li, J. Li, Z. Liu, and C. Zhang. Knowledge distillation from few samples. arXiv preprint\n\n[16] R. G. Lopes, S. Fenu, and T. Starner. Data-free knowledge distillation for deep neural networks.\n\narXiv:1812.01839, 2018.\n\narXiv preprint arXiv:1710.07535, 2017.\n\n[17] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In\nProceedings of the IEEE international conference on computer vision, pages 1520\u20131528, 2015.\n[18] I. Oguntola, S. Olubeko, and C. Sweeney. Slimnets: An exploration of deep model compression\nand acceleration. In 2018 IEEE High Performance extreme Computing Conference (HPEC),\npages 1\u20136. IEEE, 2018.\n\n[19] H. Ramaswamy, C. Scott, and A. Tewari. Mixture proportion estimation via kernel embeddings\nof distributions. In International Conference on Machine Learning, pages 2052\u20132060, 2016.\n[20] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for\n\nthin deep nets. arXiv preprint arXiv:1412.6550, 2014.\n\n[21] J. S\u00e1nchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classi\ufb01cation with the \ufb01sher vector:\n\nTheory and practice. International journal of computer vision, 105(3):222\u2013245, 2013.\n\n[22] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted\nresiduals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 4510\u20134520, 2018.\n\n9\n\n\f[23] X. Shen, W. Liu, I. W. Tsang, Q.-S. Sun, and Y.-S. Ong. Multilabel prediction via cross-view\nsearch. IEEE transactions on neural networks and learning systems, 29(9):4324\u20134338, 2017.\n[24] X. Shen, F. Shen, L. Liu, Y.-H. Yuan, W. Liu, and Q.-S. Sun. Multiview discrete hashing for\nscalable multimedia search. ACM Transactions on Intelligent Systems and Technology (TIST),\n9(5):53, 2018.\n\n[25] X. Shen, F. Shen, Q.-S. Sun, Y. Yang, Y.-H. Yuan, and H. T. Shen. Semi-paired discrete\nhashing: Learning latent hash codes for semi-paired cross-view retrieval. IEEE transactions on\ncybernetics, 47(12):4275\u20134288, 2016.\n\n[26] S. Srinivas and R. V. Babu. Data-free parameter pruning for deep neural networks. arXiv\n\npreprint arXiv:1507.06149, 2015.\n\n[27] R. Wang and K. Tang. Minimax classi\ufb01er for uncertain costs. arXiv preprint arXiv:1205.0406,\n\n2012.\n\n[28] Y. Wang, C. Xu, J. Qiu, C. Xu, and D. Tao. Towards evolutionary compression. In Proceedings\nof the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining,\npages 2476\u20132485. ACM, 2018.\n\n[29] M. Xu, B. Li, G. Niu, B. Han, and M. Sugiyama. Revisiting sample selection approach to\npositive-unlabeled learning: Turning unlabeled data into positive rather than negative. arXiv\npreprint arXiv:1901.10155, 2019.\n\n[30] Y. Xu, C. Xu, C. Xu, and D. Tao. Multi-positive and unlabeled learning. In IJCAI, pages\n\n3182\u20133188, 2017.\n\n[31] P. Zhou, Y. Hou, and J. Feng. Deep adversarial subspace clustering. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pages 1596\u20131604, 2018.\n\n10\n\n\f", "award": [], "sourceid": 1473, "authors": [{"given_name": "Yixing", "family_name": "Xu", "institution": "Huawei Noah's Ark Lab"}, {"given_name": "Yunhe", "family_name": "Wang", "institution": "Huawei Noah's Ark Lab"}, {"given_name": "Hanting", "family_name": "Chen", "institution": "Huawei Noah's Ark Lab"}, {"given_name": "Kai", "family_name": "Han", "institution": "Huawei Noah's Ark Lab"}, {"given_name": "Chunjing", "family_name": "XU", "institution": "Huawei Technologies"}, {"given_name": "Dacheng", "family_name": "Tao", "institution": "University of Sydney"}, {"given_name": "Chang", "family_name": "Xu", "institution": "University of Sydney"}]}