{"title": "Pelee: A Real-Time Object Detection System on Mobile Devices", "book": "Advances in Neural Information Processing Systems", "page_first": 1963, "page_last": 1972, "abstract": "An increasing need of running Convolutional Neural Network (CNN) models on mobile devices with limited computing power and memory resource encourages studies on efficient model design. A number of efficient architectures have been proposed in recent years, for example, MobileNet, ShuffleNet, and MobileNetV2. However, all these models are heavily dependent on depthwise separable convolution which lacks efficient implementation in most deep learning frameworks. In this study, we propose an efficient architecture named PeleeNet, which is built with conventional convolution instead. On ImageNet ILSVRC 2012 dataset, our proposed PeleeNet achieves a higher accuracy and 1.8 times faster speed than MobileNet and MobileNetV2 on NVIDIA TX2. Meanwhile, PeleeNet is only 66% of the model size of MobileNet. We then propose a real-time object detection system by combining PeleeNet with Single Shot MultiBox Detector (SSD) method and optimizing the architecture for fast speed. Our proposed detection system, named Pelee, achieves 76.4% mAP (mean average precision) on PASCAL VOC2007 and 22.4 mAP on MS COCO dataset at the speed of 23.6 FPS on iPhone 8 and 125 FPS on NVIDIA TX2. The result on COCO outperforms YOLOv2 in consideration of a higher precision, 13.6 times lower computational cost and 11.3 times smaller model size. The code and models are open sourced.", "full_text": "Pelee: A Real-Time Object Detection System on\n\nMobile Devices\n\nRobert J. Wang, Xiang Li, Charles X. Ling \u2217\n\nDepartment of Computer Science\n\nUniversity of Western Ontario\n\nLondon, Ontario, Canada, N6A 3K7\n\n{jwan563,lxiang2,charles.ling}@uwo.ca\n\nAbstract\n\nAn increasing need of running Convolutional Neural Network (CNN) models on\nmobile devices with limited computing power and memory resource encourages\nstudies on ef\ufb01cient model design. A number of ef\ufb01cient architectures have been\nproposed in recent years, for example, MobileNet, Shuf\ufb02eNet, and MobileNetV2.\nHowever, all these models are heavily dependent on depthwise separable convolu-\ntion which lacks ef\ufb01cient implementation in most deep learning frameworks. In\nthis study, we propose an ef\ufb01cient architecture named PeleeNet, which is built\nwith conventional convolution instead. On ImageNet ILSVRC 2012 dataset, our\nproposed PeleeNet achieves a higher accuracy and over 1.8 times faster speed than\nMobileNet and MobileNetV2 on NVIDIA TX2. Meanwhile, PeleeNet is only\n66% of the model size of MobileNet. We then propose a real-time object detec-\ntion system by combining PeleeNet with Single Shot MultiBox Detector (SSD)\nmethod and optimizing the architecture for fast speed. Our proposed detection\nsystem2, named Pelee, achieves 76.4% mAP (mean average precision) on PASCAL\nVOC2007 and 22.4 mAP on MS COCO dataset at the speed of 23.6 FPS on iPhone\n8 and 125 FPS on NVIDIA TX2. The result on COCO outperforms YOLOv2 in\nconsideration of a higher precision, 13.6 times lower computational cost and 11.3\ntimes smaller model size.\n\n1\n\nIntroduction\n\nThere has been a rising interest in running high-quality CNN models under strict constraints on\nmemory and computational budget. Many innovative architectures, such as MobileNets [1], Shuf-\n\ufb02eNet [2], NASNet-A [3], MobileNetV2 [4], have been proposed in recent years. However, all these\narchitectures are heavily dependent on depthwise separable convolution [5] which lacks ef\ufb01cient\nimplementation. Meanwhile, there are few studies that combine ef\ufb01cient models with fast object\ndetection algorithms [6]. This research tries to explore the design of an ef\ufb01cient CNN architecture for\nboth image classi\ufb01cation tasks and object detection tasks. It has made a number of major contributions\nlisted as follows:\nWe propose a variant of DenseNet [7] architecture called PeleeNet for mobile devices. PeleeNet\nfollows the connectivity pattern and some of key design principals of DenseNet. It is also designed\nto meet strict constraints on memory and computational budget. Experimental results on Stanford\nDogs [8] dataset show that our proposed PeleeNet is higher in accuracy than the one built with the\noriginal DenseNet architecture by 5.05% and higher than MobileNet [1] by 6.53%. PeleeNet achieves\na compelling result on ImageNet ILSVRC 2012 [9] as well. The top-1 accuracy of PeleeNet is 72.1%\n\n\u2217Contact author\n2The code and models are available at: https://github.com/Robert-JunWang/Pelee\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fwhich is higher than that of MobileNet by 1.6%. It is also important to point out that PeleeNet is only\n66% of the model size of MobileNet. Some of the key features of PeleeNet are:\n\n\u2022 Two-Way Dense Layer Motivated by GoogLeNet [5], we use a 2-way dense layer to get\ndifferent scales of receptive \ufb01elds. One way of the layer uses a 3x3 kernel size. The other\nway of the layer uses two stacked 3x3 convolution to learn visual patterns for large objects.\nThe structure is shown on Fig. 1,\n\n(a) original dense layer\n\n(b) 2-way dense layer\n\nFigure 1: Structure of 2-way dense layer\n\n\u2022 Stem Block Motivated by Inception-v4 [10] and DSOD [11], we design a cost ef\ufb01cient stem\nblock before the \ufb01rst dense layer. The structure of stem block is shown on Fig. 2. This stem\nblock can effectively improve the feature expression ability without adding computational\ncost too much - better than other more expensive methods, e.g., increasing channels of the\n\ufb01rst convolution layer or increasing growth rate.\n\nFigure 2: Structure of stem block\n\n\u2022 Dynamic Number of Channels in Bottleneck Layer Another highlight is that the number\nof channels in the bottleneck layer varies according to the input shape instead of \ufb01xed\n4 times of growth rate used in the original DenseNet.\nIn DenseNet, we observe that\nfor the \ufb01rst several dense layers, the number of bottleneck channels is much larger than\nthe number of its input channels, which means that for these layers, bottleneck layer\nincreases the computational cost instead of reducing the cost. To maintain the consistency\nof the architecture, we still add the bottleneck layer to all dense layers, but the number is\ndynamically adjusted according to the input shape, to ensure that the number of channels\ndoes not exceed the input channels. Compared to the original DenseNet structure, our\nexperiments show that this method can save up to 28.5% of the computational cost with a\nsmall impact on accuracy. (Fig. 3)\n\n\u2022 Transition Layer without Compression Our experiments show that the compression factor\nproposed by DenseNet hurts the feature expression. We always keep the number of output\nchannels the same as the number of input channels in transition layers.\n\n2\n\n1x1, 4k, conv3x3, k, convFilter concatenatePrevious layer1x1, 4k, conv3x3, k, convFilter concatenatePrevious layer1x1, 2k, conv3x3, k/2, convFilter concatenatePrevious layer3x3, k/2, conv3x3, k/2, conv1x1, 2k, conv1x1, 2k, conv3x3, k/2, convFilter concatenatePrevious layer3x3, k/2, conv3x3, k/2, conv1x1, 2k, convInput1x1, 16, stride 1, conv2x2, stride 2max pool3x3, 32, stride 2, convFilter concatenate1x1, 32, stride 1, conv56x56x3256x56x64112x112x323x3, 32, stride 2, conv224x224x3Input1x1, 16, stride 1, conv2x2, stride 2max pool3x3, 32, stride 2, convFilter concatenate1x1, 32, stride 1, conv56x56x3256x56x64112x112x323x3, 32, stride 2, conv224x224x3\f(a) Dense layer with bottleneck\n\n(b) Computational cost of the \ufb01rst 4 dense layers\n\nFigure 3: Dynamic number of channels in bottleneck layer\n\n\u2022 Composite Function To improve actual speed, we use the conventional wisdom of \u201cpost-\nactivation\u201d (Convolution - Batch Normalization [12] - Relu) as our composite function\ninstead of pre-activation used in DenseNet. For post-activation, all batch normalization\nlayers can be merged with convolution layer at the inference stage, which can accelerate the\nspeed greatly. To compensate for the negative impact on accuracy caused by this change, we\nuse a shallow and wide network structure. We also add a 1x1 convolution layer after the last\ndense block to get the stronger representational abilities.\n\nWe optimize the network architecture of Single Shot MultiBox Detector (SSD) [13] for speed\nacceleration and then combine it with PeleeNet. Our proposed system, named Pelee, achieves\n76.4% mAP on PASCAL VOC [14] 2007 and 22.4 mAP on COCO. It outperforms YOLOv2 [15] in\nterms of accuracy, speed and model size. The major enhancements proposed to balance speed and\naccuracy are:\n\n\u2022 Feature Map Selection We build object detection network in a way different from the\noriginal SSD with a carefully selected set of 5 scale feature maps (19 x 19, 10 x 10, 5 x 5, 3\nx 3, and 1 x 1). To reduce computational cost, we do not use 38 x 38 feature map.\n\n\u2022 Residual Prediction Block We follow the design ideas proposed by [16] that encourage\nfeatures to be passed along the feature extraction network. For each feature map used for\ndetection, we build a residual [17] block (ResBlock) before conducting prediction. The\nstructure of ResBlock is shown on Fig. 4\n\n\u2022 Small Convolutional Kernel for Prediction Residual prediction block makes it possible\nfor us to apply 1x1 convolutional kernels to predict category scores and box offsets. Our\nexperiments show that the accuracy of the model using 1x1 kernels is almost the same as\nthat of the model using 3x3 kernels. However, 1x1 kernels reduce the computational cost by\n21.5%.\n\n(a) ResBlock\n\n(b) Network of Pelee\n\nFigure 4: Residual prediction block\n\nWe provide a benchmark test for different ef\ufb01cient classi\ufb01cation models and different one-stage\nobject detection methods on NVIDIA TX2 embedded platform and iPhone 8.\n\n3\n\n\f2 PeleeNet: An Ef\ufb01cient Feature Extraction Network\n\n2.1 Architecture\n\nThe architecture of our proposed PeleeNet is shown as follows in Table 1. The entire network consists\nof a stem block and four stages of feature extractor. Except the last stage, the last layer in each stage\nis average pooling layer with stride 2. A four-stage structure is a commonly used structure in the\nlarge model design. Shuf\ufb02eNet [2] uses a three stage structure and shrinks the feature map size at the\nbeginning of each stage. Although this can effectively reduce computational cost, we argue that early\nstage features are very important for vision tasks, and that premature reducing the feature map size\ncan impair representational abilities. Therefore, we still maintain a four-stage structure. The number\nof layers in the \ufb01rst two stages are speci\ufb01cally controlled to an acceptable range.\n\nTable 1: Overview of PeleeNet architecture\n\nStage\n\nLayer\n\nInput\n\nStage 0\n\nStage 1\n\nStage 2\n\nStage 3\n\nStage 4\n\nStem Block\nDense Block\n\nTransition Layer\n\nDense Block\n\nTransition Layer\n\nDense Block\n\nTransition Layer\n\nDense Block\n\nTransition Layer\n\nDenseLayer x 3\n1 x 1 conv, stride 1\n\n2 x 2 average pool, stride 2\n\nDenseLayer x 4\n1 x 1 conv, stride 1\n\n2 x 2 average pool, stride 2\n\nDenseLayer x 8\n1 x 1 conv, stride 1\n\n2 x 2 average pool, stride 2\n\nDenseLayer x 6\n1 x 1 conv, stride 1\n\nOutput Shape\n224 x 224 x 3\n56 x 56 x 32\n\n28 x 28 x 128\n\n14 x 14 x 256\n\n7 x 7 x 512\n\n7 x 7 x 704\n\nClassi\ufb01cation Layer\n\n7 x 7 global average pool\n\n1 x 1 x 704\n\n1000D fully-connecte,softmax\n\n2.2 Ablation Study\n\n2.2.1 Dataset\n\nWe build a customized Stanford Dogs dataset for ablation study. Stanford Dogs [8] dataset contains\nimages of 120 breeds of dogs from around the world. This dataset has been built using images\nand annotation from ImageNet for the task of \ufb01ne-grained image classi\ufb01cation. We believe the\ndataset used for this kind of task is complicated enough to evaluate the performance of the network\narchitecture. However, there are only 14,580 training images, with about 120 images per class, in the\noriginal Stanford Dogs dataset, which is not large enough to train the model from scratch. Instead of\nusing the original Stanford Dogs, we build a subset of ILSVRC 2012 according to the ImageNet wnid\nused in Stanford Dogs. Both training data and validation data are exactly copied from the ILSVRC\n2012 dataset. In the following chapters, the term of Stanford Dogs means this subset of ILSVRC\n2012 instead of the original one. Contents of this dataset:\n\n\u2022 Number of categories: 120\n\u2022 Number of training images: 150,466\n\u2022 Number of validation images: 6,000\n\n2.2.2 Effects of Various Design Choices on the Performance\n\nWe build a DenseNet-like network called DenseNet-41 as our baseline model. There are two\ndifferences between this model and the original DenseNet. The \ufb01rst one is the parameters of the \ufb01rst\n\n4\n\n\fconv layer. There are 24 channels on the \ufb01rst conv layer instead of 64, the kernel size is changed from\n7 x 7 to 3 x 3 as well. The second one is that the number of layers in each dense block is adjusted to\nmeet the computational budget.\nAll our models in this section are trained by PyTorch with mini-batch size 256 for 120 epochs. We\nfollow most of the training settings and hyper-parameters used in ResNet on ILSVRC 2012. Table 2\nshows the effects of various design choices on the performance. We can see that, after combining\nall these design choices, PeleeNet achieves 79.25% accuracy on Stanford Dogs, which is higher in\naccuracy by 4.23% than DenseNet-41 at less computational cost.\n\nTable 2: Effects of various design choices and components on performance\n\nTransition layer without compres-\nsion\nPost-activation\nDynamic bottleneck channels\nStem Block\nTwo-way dense layer\nGo deeper (add 3 extra dense lay-\ners)\nTop 1 accuracy\n\nFrom DenseNet-41 to PeleeNet\n\n\u0013\n\n\u0013\n\n\u0013\n\n\u0013\n\n\u0013\n\n\u0013\n\n\u0013\n\u0013\n\n\u0013\n\n\u0013\n\u0013\n\u0013\n\u0013\n\n\u0013\n\n\u0013\n\u0013\n\u0013\n\u0013\n\n\u0013\n\n75.02\n\n76.1\n\n75.2\n\n75.8\n\n76.8\n\n78.8\n\n79.25\n\n2.3 Results on ImageNet ILSVRC 2012\n\nOur PeleeNet is trained by PyTorch with mini-batch size 512 on two GPUs. The model is trained\nwith a cosine learning rate annealing schedule, similar to what is used by [18] and [19]. The initial\nlearning rate is set to 0.18 and the total amount of epochs is 120. We then \ufb01ne tune the model with\nthe initial learning rate of 5e-3 for 20 epochs. Other hyper-parameters are the same as the one used\non Stanford Dogs dataset.\nCosine Learning Rate Annealing means that the learning rate decays with a cosine shape (the\nlearning rate of epoch t (t <= 120 ) set to 0 .5 \u2217 lr \u2217 (cos(\u03c0 \u2217 t/120 ) + 1 ).\nAs can be seen from Table 3, PeleeNet achieves a higher accuracy than that of MobileNet and\nShuf\ufb02eNet at no more than 66% model size and the lower computational cost. The model size of\nPeleeNet is only 1/49 of VGG16.\n\nModel\n\nVGG16\n\n1.0 MobileNet\n\nShuf\ufb02eNet 2x (g = 3)\n\nNASNet-A\n\nPeleeNet (ours)\n\n(FLOPs)\n15,346 M\n\n569 M\n524 M\n564 M\n508 M\n\nTable 3: Results on ImageNet ILSVRC 2012\nModel Size\n(Parameters)\n\nComputational Cost\n\nAccuracy (%)\nTop-1 Top-5\n89.8\n71.5\n70.6\n89.5\n70.9\n74.0\n72.1\n\n91.6\n90.6\n\n-\n\n138 M\n4.24 M\n5.2 M\n5.3 M\n2.8 M\n\n2.4 Speed on Real Devices\n\nCounting FLOPs (the number of multiply-accumulates) is widely used to measure the computational\ncost. However, it cannot replace the speed test on real devices, considering that there are many other\nfactors that may in\ufb02uence the actual time cost, e.g. caching, I/O, hardware optimization etc,. This\nsection evaluates the performance of ef\ufb01cient models on iPhone 8 and NVIDIA TX2 embedded\nplatform. The speed is calculated by the average time of processing 100 pictures with 1 batch size.\nWe run 100 picture processing for 10 times separately and average the time.\n\n5\n\n\fAs can be seen in Table 4 PeleeNet is much faster than MoibleNet and MobileNetV2 on TX2.\nAlthough MobileNetV2 achieves a high accuracy with 300 FLOPs, the actual speed of the model is\nslower than that of MobileNet with 569 FLOPs.\nUsing half precision \ufb02oat point (FP16) instead of single precision \ufb02oat point (FP32) is a widely\nused method to accelerate deep learning inference. As can be seen in Figure 5, PeleeNet runs 1.8\ntimes faster in FP16 mode than in FP32 mode. In contrast, the network that is built with depthwise\nseparable convolution is hard to bene\ufb01t from the TX2 half-precision (FP16) inference engine. The\nspeed of MobileNet and MobileNetV2 running in FP16 mode is almost the same as the ones running\nin FP32 mode.\nOn iPhone 8, PeleeNet is slower than MobileNet for the small input dimension but is faster than\nMobileNet for the large input dimension. There are two possible reasons for the unfavorable result\non iPhone. The \ufb01rst reason is related to CoreML which is built on Apple\u2019s Metal API. Metal is a 3D\ngraphics API and is not originally designed for CNNs. It can only hold 4 channels of data (originally\nused to hold RGBA data). The high-level API has to slice the channel by 4 and caches the result of\neach slice. The separable convolution can bene\ufb01t more from this mechanism than the conventional\nconvolution. The second reason is the architecture of PeleeNet. PeleeNet is built in a multi-branch\nand narrow channel style with 113 convolution layers. Our original design is misled by the FLOPs\ncount and involves unnecessary complexity.\n\nTable 4: Speed on NVIDIA TX2 (The larger the better) The benchmark tool is built with NVIDIA\nTensorRT4.0 library.\n\nModel\n\nTop-1 Accuracy\non ILSVRC2012\n\n@224x224\n\nFLOPs\n\n@224x224\n\nSpeed\n\n(images per second)\n\nInput Dimension\n\n224x224\n\n320x320\n\n640x640\n\n1.0 MobileNet\n\n1.0 MobileNetV2\n\nShuf\ufb02eNet 2x (g = 3)\n\nPeleeNet (ours)\n\n70.6\n72.0\n70.9\n72.1\n\n569 M\n300 M\n524 M\n508 M\n\n136.2\n123.1\n110\n240.3\n\n75.7\n68.8\n65.3\n129.1\n\n22.4\n21.6\n19.8\n37.2\n\n(a) Speed and accuracy on FP16 mode\n\n(b) FP32 vs FP16 by 224x224 dimension\n\nFigure 5: Speed on NVIDIA TX2\n\n6\n\n\fTable 5: Speed on iPhone 8 (The larger the better) The benchmark tool is built with CoreML library.\n\nModel\n\nTop-1 Accuracy\non ILSVRC2012\n\n@224x224\n\nFLOPs\n\n@224x224\n\nSpeed\n\n(images per second)\n\nInput Dimension\n\n224x224\n\n320x320\n\n1.0 MobileNet\nPeleeNet (ours)\n\n70.6\n72.1\n\n569 M\n508 M\n\n27.7\n26.1\n\n20.3\n22.8\n\n3 Pelee: A Real-Time Object Detection System\n\n3.1 Overview\n\nThis section introduces our object detection system and the optimization for SSD. The main purpose\nof our optimization is to improve the speed with acceptable accuracy. Except for our ef\ufb01cient feature\nextraction network proposed in last section, we also build the object detection network in a way\ndifferent from the original SSD with a carefully selected set of 5 scale feature maps. In the meantime,\nfor each feature map used for detection, we build a residual block before conducting prediction (Fig.\n4). We also use small convolutional kernels to predict object categories and bounding box locations\nto reduce computational cost. In addition, we use quite different training hyperparameters. Although\nthese contributions may seem small independently, we note that the \ufb01nal system achieves 70.9%\nmAP on PASCAL VOC2007 and 22.4 mAP on MS COCO dataset. The result on COCO outperforms\nYOLOv2 in consideration of a higher precision, 13.6 times lower computational cost and 11.3 times\nsmaller model size.\nThere are 5 scales of feature maps used in our system for prediction: 19 x 19, 10 x 10, 5 x 5, 3 x 3,\nand 1 x 1. We do not use 38 x 38 feature map layer to ensure a balance able to be reached between\nspeed and accuracy. The 19x19 feature map is combined to two different scales of default boxes and\neach of the other 4 feature maps is combined to one scale of default box. Huang et al. [6] also do not\nuse 38 x 38 scale feature map when combining SSD with MobileNet. However, they add another\n2 x 2 feature map to keep 6 scales of feature map used for prediction, which is different from our\nsolution.\n\nModel\nOriginal\n\nSSD\nSSD +\n\nMobileNet [6]\nPelee (ours)\n\nTable 6: Scale of feature map and default box\n\nScale of Feature Map : Scale of Default Box\n\n38x38:30\n\n19x19:60\n\n10x10:110\n\n5x5:162\n\n3x3:213\n\n1x1:264\n\n19x19:60\n\n10x10:105\n\n5x5:150\n\n19x19: 30.4 & 60.8\n\n10x10:112.5\n\n3x3:195\n5x5:164.2\n\n2x2:240\n3x3:215.8\n\n1x1:285\n1x1:267.4\n\n3.2 Results on VOC 2007\n\nOur object detection system is based on the source code of SSD3 and is trained with Caffe [20]. The\nbatch size is set to 32. The learning rate is set to 0.005 initially, then it decreased by a factor of 10 at\n80k and 100k iterations,respectively. The total iterations are 120K.\n\n3.2.1 Effects of Various Design Choices\n\nTable 7 shows the effects of our design choices on performance. We can see that residual prediction\nblock can effectively improve the accuracy. The model with residual prediction block achieves a\nhigher accuracy by 2.2% than the model without residual prediction block. The accuracy of the model\nusing 1x1 kernels for prediction is almost same as the one of the model using 3x3 kernels. However,\n1x1 kernels reduce the computational cost by 21.5% and the model size by 33.9%.\n\n3https://github.com/weiliu89/caffe/tree/ssd\n\n7\n\n\fTable 7: Effects of various design choices on performance\n\n38x38 Feature ResBlock Kernel Size\nPrediction\n\nfor\n\n\u0013\n\u0017\n\u0017\n\u0017\n\n\u0017\n\u0017\n\u0013\n\u0013\n\n3x3\n3x3\n3x3\n1x1\n\nFLOPs\n\n1,670 M\n1,340 M\n1,470 M\n1,210 M\n\nModel Size\n(Parameters)\n\nmAP (%)\n\n5.69 M\n5.63 M\n7.27 M\n5.43 M\n\n69.3\n68.6\n70.8\n70.9\n\n3.2.2 Comparison with Other Frameworks\n\nAs can be seen from Table 8, the accuracy of Pelee is higher than that of TinyYOLOv2 by 13.8%\nand higher than that of SSD+MobileNet [6] by 2.9%. It is even higher than that of YOLOv2-288 at\nonly 14.5% of the computational cost of YOLOv2-288. Pelee achieves 76.4% mAP when we take\nthe model trained on COCO trainval35k as described in Section 3.3 and \ufb01ne-tuning it on the 07+12\ndataset.\nTable 8: Results on PASCAL VOC 2007. Data: \u201d07+12\u201d: union of VOC2007 and VOC2012\ntrainval. \u201d07+12+COCO\u201d: \ufb01rst train on COCO trainval35k then \ufb01ne-tune on 07+12\n\nModel\n\nYOLOv2\n\nTiny-YOLOv2\nSSD+MobileNet\n\nPelee (ours)\n\nSSD+MobileNet\n\nPelee (ours)\n\nInput\n\nDimension\n288x288\n416x416\n300x300\n304x304\n300x300\n304x304\n\nFLOPs Model Size\n(Parameters)\n67.13 M\n8,360 M\n15.86 M\n3,490 M\n5.77 M\n1,150 M\n5.43 M\n1,210 M\n1,150 M\n5.77 M\n5.43 M\n1,210 M\n\nData\n\nmAP (%)\n\n07+12\n07+12\n07+12\n07+12\n\n07+12+COCO\n07+12+COCO\n\n69.0\n57.1\n68\n70.9\n72.7\n76.4\n\n3.2.3 Speed on Real Devices\n\nWe then evaluate the actual inference speed of Pelee on real devices. The speed are calculated by\nthe average time of 100 images processed by the benchmark tool. This time includes the image\npre-processing time, but it does not include the time of the post-processing part (decoding the\nbounding-boxes and performing non-maximum suppression). Usually, post-processing is done on the\nCPU, which can be executed asynchronously with the other parts that are executed on mobile GPU.\nHence, the actual speed should be very close to our test result.\nAlthough residual prediction block used in Pelee increases the computational cost, Pelee still runs\nfaster than SSD+MobileNet on iPhone and on TX2 in FP32 mode. As can be seen from Table 9,\nPelee has a greater speed advantage compared to SSD+MobileNet and SSDLite+MobileNetV2 in\nFP16 mode.\n\nTable 9: Speed on real devices\n\nModel\n\nSSD+MobileNet\n\nSSDLite+MobileNetV2\n\nPelee (ours)\n\nInput\n\nDimension\n\nFLOPs\n\n300x300\n320x320\n304x304\n\n1,200 M\n805 M\n1,290 M\n\nSpeed (FPS)\n\nTX2\n(FP16)\n\niPhone 8\n\n22.8\n\n-\n\n23.6\n\n82\n62\n125\n\nTX2\n(FP32)\n\n73\n60\n77\n\n3.3 Results on COCO\n\nWe further validate Pelee on the COCO dataset. The models are trained on the COCO train+val\ndataset excluding 5000 minival images and evaluated on the test-dev2015 set. The batch size is set\n\n8\n\n\fto 128. We \ufb01rst train the model with the learning rate of 10 \u22122 for 70k iterations, and then continue\ntraining for 10k iterations with 10 \u22123 and 20k iterations with 10 \u22124 .\nTable 10 shows the results on test-dev2015. Pelee is not only more accurate than SSD+MobileNet [6],\nbut also more accurate than YOLOv2 [15] in both mAP@[0.5:0.95] and mAP@0.75. Meanwhile,\nPelee is 3.7 times faster in speed and 11.3 times smaller in model size than YOLOv2.\n\nTable 10: Results on COCO test-dev2015\n\nModel\n\nInput\n\nDimension\n\nSpeed\non TX2\n(FPS)\n\nModel Size\n(Parameters)\n\nAvg. Precision (%), IoU:\n\n0.5:0.95\n\n0.5\n\n0.75\n\nOriginal\n\nSSD\n\nYOLOv2\nYOLOv3\n\nYOLOv3-Tiny\nSSD+MobileNet\n\nSSDlite +\n\nMobileNet v2\nPelee (ours)\n\n300x300\n416x416\n320x320\n416x416\n300x300\n320x320\n304x304\n\n-\n\n32.2\n21.5\n105\n80\n61\n120\n\n34.30 M\n67.43 M\n62.3 M\n12.3 M\n6.80 M\n4.3 M\n5.98 M\n\n25.1\n21.6\n\n-\n-\n\n18.8\n22\n22.4\n\n43.1\n44.0\n51.5\n33.1\n\n-\n-\n\n25.8\n19.2\n\n-\n-\n-\n-\n\n38.3\n\n22.9\n\n4 Conclusion\n\nDepthwise separable convolution is not the only way to build an ef\ufb01cient model. Instead of using\ndepthwise separable convolution, our proposed PeleeNet and Pelee are built with conventional\nconvolution and have achieved compelling results on ILSVRC 2012, VOC 2007 and COCO.\nBy combining ef\ufb01cient architecture design with mobile GPU and hardware-speci\ufb01ed optimized\nruntime libraries, we are able to perform real-time prediction for image classi\ufb01cation and object\ndetection tasks on mobile devices. For example, Pelee, our proposed object detection system, can run\n23.6 FPS on iPhone 8 and 125 FPS on NVIDIA TX2 with high accuracy.\n\nReferences\n[1] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias\nWeyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural\nnetworks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.\n\n[2] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shuf\ufb02enet: An extremely ef\ufb01cient\n\nconvolutional neural network for mobile devices. arXiv preprint arXiv:1707.01083, 2017.\n\n[3] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable\n\narchitectures for scalable image recognition. arXiv preprint arXiv:1707.07012, 2017.\n\n[4] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.\nMobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference\non Computer Vision and Pattern Recognition, pages 4510\u20134520, 2018.\n\n[5] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov,\nDumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions.\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.\n\n[6] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi,\nIan Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs\nfor modern convolutional object detectors. arXiv preprint arXiv:1611.10012, 2016.\n\n[7] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected\n\nconvolutional networks. arXiv preprint arXiv:1608.06993, 2016.\n\n9\n\n\f[8] Aditya Khosla, Nityananda Jayadevaprakash, Bangpeng Yao, and Fei-Fei Li. Novel dataset for\n\ufb01ne-grained image categorization: Stanford dogs. In Proc. CVPR Workshop on Fine-Grained\nVisual Categorization (FGVC), volume 2, page 1, 2011.\n\n[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[10] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4,\ninception-resnet and the impact of residual connections on learning. In AAAI, pages 4278\u20134284,\n2017.\n\n[11] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue.\nDsod: Learning deeply supervised object detectors from scratch. In The IEEE International\nConference on Computer Vision (ICCV), volume 3, page 7, 2017.\n\n[12] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training\nby reducing internal covariate shift. In International Conference on Machine Learning, pages\n448\u2013456, 2015.\n\n[13] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang\nFu, and Alexander C Berg. Ssd: Single shot multibox detector. In European conference on\ncomputer vision, pages 21\u201337. Springer, 2016.\n\n[14] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman.\nThe pascal visual object classes (voc) challenge. International journal of computer vision,\n88(2):303\u2013338, 2010.\n\n[15] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger.\n\narXiv:1612.08242, 2016.\n\narXiv preprint\n\n[16] Kyoungmin Lee, Jaeseok Choi, Jisoo Jeong, and Nojun Kwak. Residual features and uni\ufb01ed\n\nprediction network for single stage detection. arXiv preprint arXiv:1707.05031, 2017.\n\n[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,\npages 770\u2013778, 2016.\n\n[18] Geoff Pleiss, Danlu Chen, Gao Huang, Tongcheng Li, Laurens van der Maaten, and Kilian Q\nWeinberger. Memory-ef\ufb01cient implementation of densenets. arXiv preprint arXiv:1707.06990,\n2017.\n\n[19] Ilya Loshchilov and Frank Hutter. Sgdr: stochastic gradient descent with restarts. arXiv preprint\n\narXiv:1608.03983, 2016.\n\n[20] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick,\nSergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature\nembedding. In Proceedings of the 22nd ACM international conference on Multimedia, pages\n675\u2013678. ACM, 2014.\n\n10\n\n\f", "award": [], "sourceid": 988, "authors": [{"given_name": "Robert", "family_name": "Wang", "institution": "Aeryon Labs"}, {"given_name": "Xiang", "family_name": "Li", "institution": "Western University"}, {"given_name": "Charles", "family_name": "Ling", "institution": "University of Western Ontario"}]}