{"title": "Dual Path Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4467, "page_last": 4475, "abstract": "In this work, we present a simple, highly efficient and modularized Dual Path Network (DPN) for image classification which presents a new topology of connection paths internally. By revealing the equivalence of the state-of-the-art Residual Network (ResNet) and Densely Convolutional Network (DenseNet) within the HORNN framework, we find that ResNet enables feature re-usage while DenseNet enables new features exploration which are both important for learning good representations. To enjoy the benefits from both path topologies, our proposed Dual Path Network shares common features while maintaining the flexibility to explore new features through dual path architectures. Extensive experiments on three benchmark datasets, ImagNet-1k, Places365 and PASCAL VOC, clearly demonstrate superior performance of the proposed DPN over state-of-the-arts. In particular, on the ImagNet-1k dataset, a shallow DPN surpasses the best ResNeXt-101(64x4d) with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN (DPN-131) further pushes the state-of-the-art single model performance with about 2 times faster training speed. Experiments on the Places365 large-scale scene dataset, PASCAL VOC detection dataset, and PASCAL VOC segmentation dataset also demonstrate its consistently better performance than DenseNet, ResNet and the latest ResNeXt model over various applications.", "full_text": "Dual Path Networks\n\n4Qihoo 360 AI Institute\n\nYunpeng Chen1, Jianan Li1,2, Huaxin Xiao1,3, Xiaojie Jin1, Shuicheng Yan4,1, Jiashi Feng1\n\n1National University of Singapore\n2Beijing Institute of Technology\n\n3National University of Defense Technology\n\nAbstract\n\nIn this work, we present a simple, highly ef\ufb01cient and modularized Dual Path\nNetwork (DPN) for image classi\ufb01cation which presents a new topology of connec-\ntion paths internally. By revealing the equivalence of the state-of-the-art Residual\nNetwork (ResNet) and Densely Convolutional Network (DenseNet) within the\nHORNN framework, we \ufb01nd that ResNet enables feature re-usage while DenseNet\nenables new features exploration which are both important for learning good repre-\nsentations. To enjoy the bene\ufb01ts from both path topologies, our proposed Dual Path\nNetwork shares common features while maintaining the \ufb02exibility to explore new\nfeatures through dual path architectures. Extensive experiments on three bench-\nmark datasets, ImagNet-1k, Places365 and PASCAL VOC, clearly demonstrate\nsuperior performance of the proposed DPN over state-of-the-arts. In particular, on\nthe ImagNet-1k dataset, a shallow DPN surpasses the best ResNeXt-101(64 \u00d7 4d)\nwith 26% smaller model size, 25% less computational cost and 8% lower memory\nconsumption, and a deeper DPN (DPN-131) further pushes the state-of-the-art sin-\ngle model performance with about 2 times faster training speed. Experiments on the\nPlaces365 large-scale scene dataset, PASCAL VOC detection dataset, and PASCAL\nVOC segmentation dataset also demonstrate its consistently better performance\nthan DenseNet, ResNet and the latest ResNeXt model over various applications.\n\n1\n\nIntroduction\n\n\u201cNetwork engineering\u201d is increasingly more important for visual recognition research. In this paper,\nwe aim to develop new path topology of deep architectures to further push the frontier of representation\nlearning. In particular, we focus on analyzing and reforming the skip connection, which has been\nwidely used in designing modern deep neural networks and offers remarkable success in many\napplications [16, 7, 20, 14, 5]. Skip connection creates a path propagating information from a lower\nlayer directly to a higher layer. During the forward propagation, skip connection enables a very\ntop layer to access information from a distant bottom layer; while for the backward propagation,\nit facilitates gradient back-propagation to the bottom layer without diminishing magnitude, which\neffectively alleviates the gradient vanishing problem and eases the optimization.\nDeep Residual Network (ResNet) [5] is one of the \ufb01rst works that successfully adopt skip connections,\nwhere each mirco-block, a.k.a. residual function, is associated with a skip connection, called residual\npath. The residual path element-wisely adds the input features to the output of the same mirco-\nblock, making it a residual unit. Depending on the inner structure design of the mirco-block, the\nresidual network has developed into a family of various architectures, including WRN [22], Inception-\nresnet [20], and ResNeXt [21].\nMore recently, Huang et al. [8] proposed a different network architecture that achieves comparable\naccuracy with deep ResNet [5], named Dense Convolutional Network (DenseNet). Different from\nresidual networks which add the input features to the output features through the residual path, the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fDenseNet uses a densely connected path to concatenate the input features with the output features,\nenabling each micro-block to receive raw information from all previous micro-blocks. Similar with\nresidual network family, DenseNet can be categorized to the densely connected network family.\nAlthough the width of the densely connected path increases linearly as it goes deeper, causing\nthe number of parameters to grow quadratically, DenseNet provides higher parameter ef\ufb01ciency\ncompared with the ResNet [5].\nIn this work, we aim to study the advantages and limitations of both topologies and further enrich\nthe path design by proposing a dual path architecture. In particular, we \ufb01rst provide a new under-\nstanding of the densely connected networks from the lens of a higher order recurrent neural network\n(HORNN) [19], and explore the relations between densely connected networks and residual networks.\nMore speci\ufb01cally, we bridge the densely connected networks with the HORNNs, showing that the\ndensely connected networks are HORNNs when the weights are shared across steps. Inspired by [12]\nwhich demonstrates the relations between the residual networks and RNNs, we prove that the residual\nnetworks are densely connected networks when connections are shared across layers. With this uni\ufb01ed\nview on the state-of-the-art deep architecture, we \ufb01nd that the deep residual networks implicitly reuse\nthe features through the residual path, while densely connected networks keep exploring new features\nthrough the densely connected path.\nBased on this new view, we propose a novel dual path architecture, called the Dual Path Network\n(DPN). This new architecture inherits both advantages of residual and densely connected paths,\nenabling effective feature re-usage and re-exploitation. The proposed DPN also enjoys higher\nparameter ef\ufb01ciency, lower computational cost and lower memory consumption, and being friendly\nfor optimization compared with the state-of-the-art classi\ufb01cation networks. Experimental results\nvalidate the outstanding high accuracy of DPN compared with other well-established baselines\nfor image classi\ufb01cation on both ImageNet-1k dataset and Places365-Standard dataset. Additional\nexperiments on object detection task and semantic segmentation task also demonstrate that the\nproposed dual path architecture can be broadly applied for various tasks and consistently achieve the\nbest performance.\n\n2 Related work\nDesigning an advanced neural network architecture is one of the most challenging but effective\nways for improving the image classi\ufb01cation performance, which can also directly bene\ufb01t a variety\nof other tasks. AlexNet [10] and VGG [18] are two most important works that show the power\nof deep convolutional neural networks. They demonstrate that building deeper networks with tiny\nconvolutional kernels is a promising way to increase the learning capacity of the neural network.\nResidual Network was \ufb01rst proposed by He et al. [5], which greatly alleviates the optimization\ndif\ufb01culty and further pushes the depth of deep neural networks to hundreds of layers by using\nskipping connections. Since then, different kinds of residual networks arose, concentrating on\neither building a more ef\ufb01cient micro-block inner structure [3, 21] or exploring how to use residual\nconnections [9]. Recently, Huang et al. [8] proposed a different network, called Dense Convolutional\nNetworks, where skip connections are used to concatenate the input to the output instead of adding.\nHowever, the width of the densely connected path linearly increases as the depth rises, causing the\nnumber of parameters to grow quadratically and costing a large amount of GPU memory compared\nwith the residual networks if the implementation is not speci\ufb01cally optimized. This limits the building\nof a deeper and wider densenet that may further improve the accuracy.\nBesides designing new architectures, researchers also try to re-explore the existing state-of-the-art\narchitectures. In [6], the authors showed the importance of the residual path on alleviating the\noptimization dif\ufb01culty. In [12], the residual networks are bridged with recurrent neural networks\n(RNNs), which helps people better understand the deep residual network from the perspective of\nRNNs. In [3], several different residual functions are uni\ufb01ed, trying to provide a better understanding\nof designing a better mirco structure with higher learning capacity. But still, for the densely connected\nnetworks, in addition to several intuitive explanations on better feature reusage and ef\ufb01cient gradient\n\ufb02ow introduced, there have been few works that are able to provide a really deeper understanding.\nIn this work, we provide a deeper understanding of the densely connected network, from the lens of\nHigher Order RNN, and explain how the residual networks are in indeed a special case of densely\nconnected network. Based on these analysis, we then propose a novel Dual Path Network architecture\nthat not only achieves higher accuracy, but also enjoys high parameter and computational ef\ufb01ciency.\n\n2\n\n\fFigure 1: The topological relations of different types of neural networks. (a) and (b) show relations\nbetween residual networks and RNN, as stated in [12]; (c) and (d) show relations between densely\nconnected networks and higher order recurrent neural network (HORNN), which is explained in this\npaper. The symbol \u201cz\u22121\u201d denotes a time-delay unit; \u201c\u2295\u201d denotes the element-wise summation; \u201cI(\u00b7)\u201d\ndenotes an identity mapping function.\n\n3 Revisiting ResNet, DenseNet and Higher Order RNN\n\nIn this section, we \ufb01rst bridge the densely connected network [8] with higher order recurrent\nneural networks [19] to provide a new understanding of the densely connected network. We prove\nthat residual networks [5, 6, 22, 21, 3], essentially belong to the family of densely connected\nnetworks except their connections are shared across steps. Then, we present analysis on strengths\nand weaknesses of each topology architecture, which motivates us to develop the dual path network\narchitecture.\nFor exploring the above relation, we provide a new view on the densely connected networks from\nthe lens of Higher Order RNN, explain their relations and then specialize the analysis to residual\nnetworks. Throughout the paper, we formulate the HORNN in a more generalized form. We use ht\nto denote the hidden state of the recurrent neural network at the t-th step and use k as the index of the\nt (\u00b7) refers to the feature\ncurrent step. Let xt denotes the input at t-th step, h0 = x0. For each step, f k\nextracting function which takes the hidden state as input and outputs the extracted information. The\ngk(\u00b7) denotes a transformation function that transforms the gathered information to current hidden\nstate:\n\nhk = gk\n\nf k\nt (ht)\n\n.\n\n(1)\n\n(cid:35)\n\n(cid:34)k\u22121(cid:88)\n\nt=0\n\nEqn. (1) encapsulates the update rule of various network architectures in a generalized way. For\nHORNNs, weights are shared across steps, i.e. \u2200t, k, f k\nk\u2212t(\u00b7) \u2261 ft(\u00b7) and \u2200k, gk(\u00b7) \u2261 g(\u00b7). For the\nt (\u00b7)\ndensely connected networks, each step (micro-block) has its own parameter, which means f k\nand gk(\u00b7) are not shared. Such observation shows that the densely connected path of DenseNet\nis essentially a higher order path which is able to extract new information from previous states.\nFigure 1(c)(d) graphically shows the relations of densely connected networks and higher order\nrecurrent networks.\nWe then explain that the residual networks are special cases of densely connected networks if taking\n\u2200t, k, f k\nt (\u00b7) \u2261 ft(\u00b7). Here, for succinctness we introduce rk to denote the intermediate results and let\nr0 = 0. Then Eqn. (1) can be rewritten as\n\nrk (cid:44) k\u22121(cid:88)\nhk = gk(cid:0)rk(cid:1) .\n\nt=1\n\nft(ht) = rk\u22121 + fk\u22121(hk\u22121),\n\n(2)\n\n(3)\n\nThus, by substituting Eqn. (3) into Eqn. (2), Eqn. (2) can be simpli\ufb01ed as\n\nrk = rk\u22121 + fk\u22121(hk\u22121) = rk\u22121 + fk\u22121(gk\u22121(cid:0)rk\u22121(cid:1)) = rk\u22121 + \u03c6k\u22121(rk\u22121),\n\n(4)\nwhere \u03c6k(\u00b7) = fk(gk(\u00b7)). Obviously, Eqn. (4) has the same form as the residual network and the\nrecurrent neural network. Speci\ufb01cally, when \u2200k, \u03c6k(\u00b7) \u2261 \u03c6(\u00b7), Eqn. (4) degenerates to an RNN;\nwhen none of \u03c6k(\u00b7) is shared and xk = 0, k > 1, Eqn. (4) produces a residual network. Figure 1(a)(b)\n\n3\n\nUnfoldFoldUnfoldFold(a) ResNet with shared weights (b) ResNet in RNN form(c) DenseNet with shared weights (d) DenseNet in HORNN formh1h2++\u2026x0+Outputxt\u03c6(\u05bc)+I(\u05bc)z-1h1x0+h2+\u2026g2(\u05bc)g1(\u05bc)f1(\u05bc)f2(\u05bc)Outputhkz-1+...++z-1z-1fk-1k(\u05bc)fk-2k(\u05bc)f1k(\u05bc)f0k(\u05bc)gk(\u05bc)x0\fgraphically shows the relation. Besides, recall that Eqn. (4) is derived under the condition when\n\u2200t, k, f k\nt (\u00b7) \u2261 ft(\u00b7) from Eqn. (1) and the densely connected networks are in forms of Eqn. (1),\nmeaning that the residual network family essentially belongs to the densely connected network family.\nFigure 2(a\u2013c) give an example and demonstrate such equivalence, where ft(\u00b7) corresponds to the\n\ufb01rst 1 \u00d7 1 convolutional layer and the gk(\u00b7) corresponds to the other layers within a micro-block in\nFigure 2(b).\nFrom the above analysis, we observe: 1) both residual networks and densely connected networks can\nt (\u00b7) and gk(\u00b7) are shared for all k; 2) a residual network is a densely\nbe seen as a HORNN when f k\nconnected network if \u2200t, k, f k\nt (\u00b7) \u2261 ft(\u00b7). By sharing the f k\nt (\u00b7) across all steps, gk(\u00b7) receives the\nsame feature from a given output state, which encourages the feature reusage and thus reduces the\nfeature redundancy. However, such an information sharing strategy makes it dif\ufb01cult for residual\nnetworks to explore new features. Comparatively, the densely connected networks are able to explore\nt (\u00b7) is not shared across steps. However, different\nnew information from previous outputs since the f k\nt (\u00b7) may extract the same type of features multiple times, leading to high redundancy.\nf k\nIn the following section, we present the dual path networks which can overcome both inherent\nlimitations of these two state-of-the-art network architectures. Their relations with HORNN also\nimply that our proposed architecture can be used for improving HORNN, which we leave for future\nworks.\n\n4 Dual Path Networks\n\nAbove we explain the relations between residual networks and densely connected networks, showing\nthat the residual path implicitly reuses features, but it is not good at exploring new features. In contrast\nthe densely connected network keeps exploring new features but suffers from higher redundancy.\nIn this section, we describe the details of our proposed novel dual path architecture, i.e. the Dual Path\nNetwork (DPN). In the following, we \ufb01rst introduce and formulate the dual path architecture, and\nthen present the network structure in details with complexity analysis.\n\n4.1 Dual Path Architecture\n\nSec. 3 discusses the advantage and limitations of both residual networks and densely connected\nt (\u00b7)\nnetworks. Based on the analysis, we propose a simple dual path architecture which shares the f k\nacross all blocks to enjoy the bene\ufb01ts of reusing common features with low redundancy, while still\nremaining a densely connected path to give the network more \ufb02exibility in learning new features. We\nformulate such a dual path architecture as follows:\n\n(7)\n(8)\nwhere xk and yk denote the extracted information at k-th step from individual path, vt(\u00b7) is a feature\nt (\u00b7). Eqn. (5) refers to the densely connected path that enables exploring new\nlearning function as f k\nfeatures, Eqn. (6) refers to the residual path that enables common features re-usage, and Eqn. (7)\nde\ufb01nes the dual path that integrates them and feeds them to the last transformation function in\nEqn. (8). The \ufb01nal transformation function gk(\u00b7) generates current state, which is used for making\nnext mapping or prediction. Figure 2(d)(e) show an example of the dual path architecture that is\nbeing used in our experiments.\nMore generally, the proposed DPN is a family of convolutional neural networks which contains a\nresidual alike path and a densely connected alike path, as explained later. Similar to these networks,\none can customize the micro-block function of DPN for task-speci\ufb01c usage or for further overall\nperformance boosting.\n\n4\n\nt (ht),\nf k\n\nxk (cid:44) k\u22121(cid:88)\nyk (cid:44) k\u22121(cid:88)\nhk = gk(cid:0)rk(cid:1) ,\n\nrk (cid:44) xk + yk,\n\nt=1\n\nt=1\n\nvt(ht) = yk\u22121 + \u03c6k\u22121(yk\u22121),\n\n(5)\n\n(6)\n\n\fFigure 2: Architecture comparison of different networks. (a) The residual network. (b) The densely\nconnected network, where each layer can access the outputs of all previous micro-blocks. Here, a\n1 \u00d7 1 convolutional layer (underlined) is added for consistency with the micro-block design in (a).\n(c) By sharing the \ufb01rst 1 \u00d7 1 connection of the same output across micro-blocks in (b), the densely\nconnected network degenerates to a residual network. The dotted rectangular in (c) highlights the\nresidual unit. (d) The proposed dual path architecture, DPN. (e) An equivalent form of (d) from\nthe perspective of implementation, where the symbol \u201c(cid:111)\u201d denotes a split operation, and \u201c+\u201d denotes\nelement-wise addition.\n\n4.2 Dual Path Networks\n\nThe proposed network is built by stacking multiple modualized mirco-blocks as shown in Figure 2.\nIn this work, the structure of each micro-block is designed with a bottleneck style [5] which starts\nwith a 1 \u00d7 1 convolutional layer followed by a 3 \u00d7 3 convolutional layer, and ends with a 1 \u00d7 1\nconvolutional layer. The output of the last 1 \u00d7 1 convolutional layer is split into two parts: the \ufb01rst\npart is element-wisely added to the residual path, and the second part is concatenated with the densly\nconnected path. To enhance the leaning capacity of each micro-block, we use the grouped convolution\nlayer in the second layer as the ResNeXt [21].\nConsidering that the residual networks are more wildly used than the densely connected networks in\npractice, we choose the residual network as the backbone and add a thin densely connected path to\nbuild the dual path network. Such design also helps slow the width increment of the densely connected\npath and the cost of GPU memory. Table 1 shows the detailed architecture settings. In the table, G\nrefers to the number of groups, and k refers to the channels increment for the densely connected path.\nFor the new proposed DPNs, we use (+k) to indicate the width increment of the densely connected\npath. The overall design of DPN inherits backbone architecture of the vanilla ResNet / ResNeXt,\nmaking it very easy to implement and apply to other tasks. One can simply implement a DPN by\nadding one more \u201cslice layer\u201d and \u201cconcat layer\u201d upon existing residual networks. Under a well\noptimized deep learning platform, none of these newly added operations requires extra computational\ncost or extra memory consumption, making the DPNs highly ef\ufb01cient.\nIn order to demonstrate the appealing effectiveness of the dual path architecture, we intentionally\ndesign a set of DPNs with a considerably smaller model size and less FLOPs compared with the\nsate-of-the-art ResNeXts [21], as shown in Table 1. Due to limited computational resources, we set\nthese hyper-parameters based on our previous experience instead of grid search experiments.\n\nModel complexity We measure the model complexity by counting the total number of learnable\nparameters within each neural network. Table 1 shows the results for different models. The DPN-92\ncosts about 15% fewer parameters than ResNeXt-101 (32 \u00d7 4d), while the DPN-98 costs about 26%\nfewer parameters than ResNeXt-101 (64 \u00d7 4d).\n\nComputational complexity We measure the computational cost of each deep neural network using\nthe \ufb02oating-point operations (FLOPs) with input size of 224 \u00d7 224, in the number of multiply-adds\nfollowing [21]. Table 1 shows the theoretical computational cost. Though the actual time cost\nmight be in\ufb02uenced by other factors, e.g. GPU bandwidth and coding quality, the computational\ncost shows the speed upper bound. As can be see from the results, DPN-92 consumes about 19%\nless FLOPs than ResNeXt-101(32 \u00d7 4d), and the DPN-98 consumes about 25% less FLOPs than\nResNeXt-101(64 \u00d7 4d).\n\n5\n\n(a) Residual Network(b) Densely Connected Network(e) DPN(d) Dual Path Architecture1\u00d71+3\u00d731\u00d711\u00d711\u00d713\u00d731\u00d71++1\u00d713\u00d731\u00d71+1\u00d713\u00d731\u00d711\u00d71+1\u00d713\u00d73+1\u00d71~1\u00d711\u00d71+1\u00d713\u00d73+~1\u00d711\u00d713\u00d73~+1\u00d711\u00d713\u00d73~+(c) Densely Connected Network ( with shared connections )1\u00d713\u00d731\u00d711\u00d713\u00d731\u00d71++residual unit\fTable 1: Architecture and complexity comparison of our proposed Dual Path Networks (DPNs) and\nother state-of-the-art networks. We compare DPNs with two baseline methods: DenseNet [5] and\nResNeXt [21]. The symbol (+k) denotes the width increment on the densely connected path.\nDPN-98 (40\u00d74d)\nstage\n7 \u00d7 7, 96, stride 2\n\nResNeXt-101 (32\u00d74d)\n7 \u00d7 7, 64, stride 2\n\nResNeXt-101 (64\u00d74d)\n7 \u00d7 7, 64, stride 2\n\nDPN-92 (32\u00d73d)\n7 \u00d7 7, 64, stride 2\n\nDenseNet-161 (k=48)\n7 \u00d7 7, 96, stride 2\n\n112x112\n\noutput\n\nconv1\n\n3 \u00d7 3 max pool, stride 2\n\n3 \u00d7 3 max pool, stride 2\n\n3 \u00d7 3 max pool, stride 2\n\n3 \u00d7 3 max pool, stride 2\n\n3 \u00d7 3 max pool, stride 2\n\n3\u00d73, 128, G=32\n1\u00d71, 256\n\n3\u00d73, 256, G=32\n1\u00d71, 512\n\n\uf8ee\uf8f0 1\u00d71, 128\n\uf8ee\uf8f0 1\u00d71, 256\n\uf8ee\uf8f0 1\u00d71, 512\n\uf8ee\uf8f0 1\u00d71, 1024\n\n3\u00d73, 512, G=32\n1\u00d71, 1024\n\n3\u00d73, 1024, G=32\n1\u00d71, 2048\n\n\uf8f9\uf8fb \u00d7 3\n\uf8f9\uf8fb \u00d7 4\n\uf8f9\uf8fb \u00d7 23\n\uf8f9\uf8fb \u00d7 3\n\n3\u00d73, 256, G=64\n1\u00d71, 256\n\n3\u00d73, 512, G=64\n1\u00d71, 512\n\n\uf8ee\uf8f0 1\u00d71, 256\n\uf8ee\uf8f0 1\u00d71, 512\n\uf8ee\uf8f0 1\u00d71, 1024\n\uf8ee\uf8f0 1\u00d71, 2048\n\n3\u00d73, 1024, G=64\n1\u00d71, 1024\n\n3\u00d73, 2048, G=64\n1\u00d71, 2048\n\n\uf8f9\uf8fb \u00d7 3\n\uf8f9\uf8fb \u00d7 4\n\uf8f9\uf8fb \u00d7 23\n\uf8f9\uf8fb \u00d7 3\n\n3\u00d73, 96, G=32\n1\u00d71, 256 (+16)\n\n3\u00d73, 192, G=32\n1\u00d71, 512 (+32)\n\n\uf8ee\uf8f0 1\u00d71, 96\n\uf8ee\uf8f0 1\u00d71, 192\n\uf8ee\uf8f0 1\u00d71, 384\n\uf8ee\uf8f0 1\u00d71, 768\n\n3\u00d73, 384, G=32\n1\u00d71, 1024 (+24)\n\n3\u00d73, 768, G=32\n1\u00d71, 2048 (+128)\n\n\uf8f9\uf8fb \u00d7 3\n\uf8f9\uf8fb \u00d7 4\n\uf8f9\uf8fb \u00d7 20\n\uf8f9\uf8fb \u00d7 3\n\n3\u00d73, 160, G=40\n1\u00d71, 256 (+16)\n\n3\u00d73, 320, G=40\n1\u00d71, 512 (+32)\n\n\uf8ee\uf8f0 1\u00d71, 160\n\uf8ee\uf8f0 1\u00d71, 320\n\uf8ee\uf8f0 1\u00d71, 640\n\uf8ee\uf8f0 1\u00d71, 1280\n\n3\u00d73, 640, G=40\n1\u00d71, 1024 (+32)\n\n3\u00d73, 1280, G=40\n1\u00d71, 2048 (+128)\n\n\uf8f9\uf8fb \u00d7 3\n\uf8f9\uf8fb \u00d7 6\n\uf8f9\uf8fb \u00d7 20\n\uf8f9\uf8fb \u00d7 3\n\n3\u00d73, 48\n\n(cid:20) 1\u00d71, 192\n(cid:20) 1\u00d71, 192\n(cid:20) 1\u00d71, 192\n(cid:20) 1\u00d71, 192\n\n3\u00d73, 48\n\n3\u00d73, 48\n\n3\u00d73, 48\n\n(cid:21)\n(cid:21)\n(cid:21)\n(cid:21)\n\n\u00d7 6\n\n\u00d7 12\n\n\u00d7 36\n\n\u00d7 24\n\nconv2\n\n56x56\n\nconv3\n\n28\u00d728\n\nconv4\n\n14\u00d714\n\nconv5\n\n7\u00d77\n\n1\u00d71\n\n# params\n\nFLOPs\n\nglobal average pool\n1000-d fc, softmax\n28.9 \u00d7 106\n7.7 \u00d7 109\n\nglobal average pool\n1000-d fc, softmax\n44.3 \u00d7 106\n8.0 \u00d7 109\n\nglobal average pool\n1000-d fc, softmax\n83.7 \u00d7 106\n15.5 \u00d7 109\n\nglobal average pool\n1000-d fc, softmax\n37.8 \u00d7 106\n6.5 \u00d7 109\n\nglobal average pool\n1000-d fc, softmax\n61.7 \u00d7 106\n11.7 \u00d7 109\n\n5 Experiments\n\nExtensive experiments are conducted for evaluating the proposed Dual Path Networks. Speci\ufb01cally,\nwe evaluate the proposed architecture on three tasks: image classi\ufb01cation, object detection and\nsemantic segmentation, using three standard benchmark datasets: the ImageNet-1k dataset, Places365-\nStandard dataset and the PASCAL VOC datasets.\nKey properties of the proposed DPNs are studied on the ImageNet-1k object classi\ufb01cation dataset [17]\nand further veri\ufb01ed on the Places365-Standard scene understanding dataset [24]. To verify whether the\nproposed DPNs can bene\ufb01t other tasks besides image classi\ufb01cation, we further conduct experiments\non the PASCAL VOC dataset [4] to evaluate its performance in object detection and semantic\nsegmentation.\n\n5.1 Experiments on image classi\ufb01cation task\n\nWe implement the DPNs using MXNet [2] on a cluster with 40 K80 graphic cards. Following [3], we\nadopt standard data augmentation methods and train the networks using SGD with a mini-batch size\nof 32 for each GPU. For the deepest network, i.e. DPN-1311, the mini-batch size is limited to 24\nbecause of the 12GB GPU memory constraint. The learning rate starts from\n0.1 for DPN-92 and\nDPN-131, and from 0.4 for DPN-98. It drops in a \u201csteps\u201d manner by a factor of 0.1. Following [5],\nbatch normalization layers are re\ufb01ned after training.\n\n\u221a\n\n5.1.1 ImageNet-1k dataset\n\nFirstly, we compare the image classi\ufb01cation performance of DPNs with current state-of-the-art\nmodels. As can be seen from the \ufb01rst block in Table 2, a shallow DPN with only the depth of 92\nreduces the top-1 error rate by an absolute value of 0.5% compared with the ResNeXt-101(32 \u00d7 4d)\nand an absolute value of 1.5% compared with the DenseNet-161 yet provides with considerably less\nFLOPs. In the second block of Table 2, a deeper DPN (DPN-98) surpasses the best residual network \u2013\nResNeXt-101 (64 \u00d7 4d), and still enjoys 25% less FLOPs and a much smaller model size (236 MB\nv.s. 320 MB). In order to further push the state-of-the-art accuracy, we slightly increase the depth\nof the DPN to 131 (DPN-131). The results are shown in the last block in Table 2. Again, the DPN\nshows superior accuracy over the best single model \u2013 Very Deep PolyNet [23], with a much smaller\nmodel size (304 MB v.s. 365 MB). Note that the Very Deep PolyNet adopts numerous tricks, e.g.\ninitialization by insertion, residual scaling, stochastic paths, to assist the training process. In contrast,\nour proposed DPN-131 is simple and does not involve these tricks, DPN-131 can be trained using a\nstandard training strategy as shallow DPNs. More importantly, the actual training speed of DPN-131\nis about 2 times faster than the Very Deep PolyNet, as discussed in the following paragraph.\n\n1The DPN-131 has 128 channels at conv1, 4 blocks at conv2, 8 blocks at conv3, 28 blocks at conv4 and 3\n\nblocks at conv5, which has #params=79.5 \u00d7 106 and FLOPs=16.0 \u00d7 109.\n\n6\n\n\fTable 2: Comparison with state-of-the-art CNNs on\nImageNet-1k dataset. Single crop validation error rate\n(%) on validation set. *: Performance reported by [21],\n\u2020: With Mean-Max Pooling (see supplementary material).\n\nModel\nSize GFLOPs\nMethod\n111 MB\nDenseNet-161(k=48) [8]\n170 MB\nResNet-101* [5]\nResNeXt-101 (32 \u00d7 4d) [21]\n170 MB\nDPN-92 (32 \u00d7 3d)\n145 MB\n247 MB\nResNet-200 [6]\n227 MB\nInception-resnet-v2 [20]\nResNeXt-101 (64 \u00d7 4d) [21]\n320 MB\nDPN-98 (40 \u00d7 4d)\n236 MB\nVery deep Inception-resnet-v2 [23] 531 MB\nVery Deep PolyNet [23]\n365 MB\nDPN-131 (40 \u00d7 4d)\n304 MB\nDPN-131 (40 \u00d7 4d) \u2020\n304 MB\n\n7.7\n7.8\n8.0\n6.5\n15.0\n\n15.5\n11.7\n\u2013\n\u2013\n\n16.0\n16.0\n\n\u2013\n\nx224\n\nx320 / x299\ntop-1 top-5 top-1 top-5\n22.2\n22.0\n21.2\n20.7\n21.7\n\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n\u2013\n4.7\n19.3\n4.8\n20.1\n4.9\n19.9\n4.4\n19.1\n18.9\n4.4\n19.10 4.48\n18.71 4.25\n19.93 5.12 18.62 4.23\n19.93 5.12 18.55 4.16\n\n\u2013\n6.0\n5.6\n5.4\n5.8\n\u2013\n5.3\n5.2\n\u2013\n\u2013\n\n20.4\n20.2\n\u2013\n\u2013\n\n\u2013\n\nTable 3: Comparison with state-of-the-\nart CNNs on Places365-Standard dataset.\n10 crops validation accuracy rate (%) on\nvalidation set.\n\ntop-1\nacc.\n53.17\n53.63\n55.24\n54.74\n56.21\n56.60\n56.84\n\ntop-5\nacc.\n82.89\n83.88\n84.91\n85.08\n86.25\n86.55\n86.69\n\nMethod\nAlexNet [24]\nGoogleLeNet [24]\nVGG-16 [24]\nResNet-152 [24]\nResNeXt-101 [3]\nCRU-Net-116 [3]\nDPN-92 (32 \u00d7 3d)\n\nModel\nSize\n\n223 MB\n44 MB\n518 MB\n226 MB\n165 MB\n163 MB\n138 MB\n\nFigure 3: Comparison of total actual cost between different models during training. Evaluations are\nconducted on a single Node with 4 K80 graphic card with all training samples cached into memory.\n(For the comparison of Training Speed, we push the mini-batch size to its maximum value given a\n12GB GPU memory to test the fastest possible training speed of each model.)\n\nSecondly, we compare the training cost between the best performing models. Here, we focus on\nevaluating two key properties \u2013 the actual GPU memory cost and the actual training speed. Figure 3\nshows the results. As can be seen from Figure 3(a)(b), the DPN-98 is 15% faster and uses 9% less\nmemory than the best performing ResNeXt with a considerably lower testing error rate. Note that\ntheoretically the computational cost of DPN-98 shown in Table 2 is 25% less than the best performing\nResNeXt, indicating there is still room for code optimization. Figure 3(c) presents the same result in\na more clear way. The deeper DPN-131 only costs about 19% more training time compared with the\nbest performing ResNeXt, but achieves the state-of-the-art single model performance. The training\nspeed of the previous state-of-the-art single model, i.e. Very Deep PolyNet (537 layers) [23], is about\n31 samples per second based on our implementation using MXNet, showing that DPN-131 runs about\n2 times faster than the Very Deep PolyNet during training.\n\n5.1.2 Place365-Standard dataset\nIn this experiment, we further evaluate the accuracy of the proposed DPN on the scene classi\ufb01cation\ntask using the Places365-Standard dataset. The Places365-Standard dataset is a high-resolution scene\nunderstanding dataset with more than 1.8 million images of 365 scene categories. Different from\nobject images, scene images do not have very clear discriminative patterns and require a higher level\ncontext reasoning ability.\nTable 3 shows the results of different models on this dataset. To make a fair comparison, we perform\nthe DPN-92 on this dataset instead of using deeper DPNs. As can be seen from the results, DPN\nachieves the best validation accuracy compared with other methods. The DPN-92 requires much less\nparameters (138 MB v.s. 163 MB), which again demonstrates its high parameter ef\ufb01ciency and high\ngeneralization ability.\n\n5.2 Experiments on the object detection task\n\nWe further evaluate the proposed Dual Path Network on the object detection task. Experiments\nare performed on the PASCAL VOC 2007 datasets [4]. We train the models on the union set of\nVOC 2007 trainval and VOC 2012 trainval following [16], and evaluate them on VOC 2007 test set.\nWe use standard evaluation metrics Average Precision (AP) and mean of AP (mAP) following the\nPASCAL challenge protocols for evaluation.\n\n7\n\n5060708090100Training Speed (samples/sec)18.51919.52020.5Single Crop, Top-1 ErrorResNet-200ResNeXt-101 (64x4d)DPN-98 (40x4d)DPN-131 (40x4d)89101112Memory Cost (GB), Batch Size = 2418.51919.52020.5Single Crop, Top-1 ErrorResNet-200ResNeXt-101 (64x4d)DPN-98 (40x4d)DPN-131 (40x4d)5060708090100Training Speed (samples/sec)89101112MemoryCost(GB),BatchSize=24ResNet-200ResNeXt-101 (64x4d)DPN-98 (40x4d)DPN-131 (40x4d)(a)(b)(c)\fTable 4: Object detection results on PASCAL VOC 2007 test set. The performance is measured by\nmean of Average Precision (mAP, in %).\n\nMethod\n\nDenseNet-161 (k=48)\n\nmAP areo bike bird boat bottle bus\ncat chair cow table dog horse mbk prsn plant sheep sofa train tv\n79.9 80.4 85.9 81.2 72.8 68.0 87.1 88.0 88.8 64.0 83.3 75.4 87.5 87.6 81.3 84.2 54.6 83.2 80.2 87.4 77.2\n76.4 79.8 80.7 76.2 68.3 55.9 85.1 85.3 89.8 56.7 87.8 69.4 88.3 88.9 80.9 78.4 41.7 78.6 79.8 85.3 72.0\nResNeXt-101 (32 \u00d7 4d) 80.1 80.2 86.5 79.4 72.5 67.3 86.9 88.6 88.9 64.9 85.0 76.2 87.3 87.8 81.8 84.1 55.5 84.0 79.7 87.9 77.0\n82.5 84.4 88.5 84.6 76.5 70.7 87.9 88.8 89.4 69.7 87.0 76.7 89.5 88.7 86.0 86.1 58.4 85.0 80.4 88.2 83.1\n\nDPN-92 (32 \u00d7 3d)\n\nResNet-101 [16]\n\ncar\n\nMethod\n\nTable 5: Semantic segmentation results on PASCAL VOC 2012 test set. The performance is measured\nby mean Intersection over Union (mIoU, in %).\nmIoU bkg areo bike bird boat bottle bus\ncat chair cow table dog horse mbk prsn plant sheep sofa train tv\n68.7 92.1 77.3 37.1 83.6 54.9 70.0 85.8 82.5 85.9 26.1 73.0 55.1 80.2 74.0 79.1 78.2 51.5 80.0 42.2 75.1 58.6\n73.1 93.1 86.9 39.9 87.6 59.6 74.4 90.1 84.7 87.7 30.0 81.8 56.2 82.7 82.7 80.1 81.1 52.4 86.2 52.5 81.3 63.6\nResNeXt-101 (32 \u00d7 4d) 73.6 93.1 84.9 36.2 80.3 65.0 74.7 90.6 83.9 88.7 31.1 86.3 62.4 84.7 86.1 81.2 80.1 54.0 87.4 54.0 76.3 64.2\n74.8 93.7 88.3 40.3 82.7 64.5 72.0 90.9 85.0 88.8 31.1 87.7 59.8 83.9 86.8 85.1 82.8 60.8 85.3 54.1 82.6 64.6\n\nDPN-92 (32 \u00d7 3d)\n\nDenseNet-161 (k=48)\n\nResNet-101\n\ncar\n\nWe perform all experiments based on the ResNet-based Faster R-CNN framework, following [5] and\nmake comparisons by replacing the ResNet, while keeping other parts unchanged. Since our goal is\nto evaluate DPN, rather than further push the state-of-the-art accuracy on this dataset, we adopt the\nshallowest DPN-92 and baseline networks at roughly the same complexity level. Table 4 provides the\ndetection performance comparisons of the proposed DPN with several current state-of-the-art models.\nIt can be observed that the DPN obtains the mAP of 82.5%, which makes large improvements, i.e.\n6.1% compared with ResNet-101 [16] and 2.4% compared with ResNeXt-101 (32 \u00d7 4d). The better\nresults shown in this experiment demonstrate that the Dual Path Network is also capable of learning\nbetter feature representations for detecting objects and bene\ufb01ting the object detection task.\n\n5.3 Experiments on the semantic segmentation task\nIn this experiment, we evaluate the Dual Path Network for dense prediction, i.e. semantic segmenta-\ntion, where the training target is to predict the semantic label for each pixel in the input image. We\nconduct experiments on the PASCAL VOC 2012 segmentation benchmark dataset [4] and use the\nDeepLab-ASPP-L [1] as the segmentation framework. For each compared method in Table 5, we\nreplace the 3 \u00d7 3 convolutional layers in conv4 and conv5 of Table 1 with atrous convolution [1]\nand plug in a head of Atrous Spatial Pyramid Pooling (ASPP) [1] in the \ufb01nal feature maps of conv5.\nWe adopt the same training strategy for all networks following [1] for fair comparison.\nTable 5 shows the results of different convolutional neural networks. It can be observed that the\nproposed DPN-92 has the highest overall mIoU accuracy. Compared with the ResNet-101 which\nhas a larger model size and higher computational cost, the proposed DPN-92 further improves the\nIoU for most categories and improves the overall mIoU by an absolute value 1.7%. Considering the\nResNeXt-101 (32 \u00d7 4d) only improves the overall mIoU by an absolute value 0.5% compared with\nthe ResNet-101, the proposed DPN-92 gains more than 3 times improvement compared with the\nResNeXt-101 (32 \u00d7 4d). The better results once again demonstrate the proposed Dual Path Network\nis capable of learning better feature representation for dense prediction.\n\n6 Conclusion\nIn this paper, we revisited the densely connected networks, bridged the densely connected networks\nwith Higher Order RNNs and proved the residual networks are essentially densely connected networks\nwith shared connections. Based on this new explanation, we proposed a dual path architecture that\nenjoys bene\ufb01ts from both sides. The novel network, DPN, is then developed based on this dual path\narchitecture. Experiments on the image classi\ufb01cation task demonstrate that the DPN enjoys high\naccuracy, small model size, low computational cost and low GPU memory consumption, thus is\nextremely useful for not only research but also real-word application. Experiments on the object\ndetection task and semantic segmentation tasks show that the proposed DPN can also bene\ufb01t other\ntasks by simply replacing the base network.\n\nAcknowledgments\nThe work of Jiashi Feng was partially supported by National University of Singapore startup grant\nR-263-000-C08-133, Ministry of Education of Singapore AcRF Tier One grant R-263-000-C21-112\nand NUS IDS grant R-263-000-C67-646.\n\n8\n\n\fReferences\n[1] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab:\nSemantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs.\narXiv preprint arXiv:1606.00915, 2016.\n\n[2] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan\nZhang, and Zheng Zhang. Mxnet: A \ufb02exible and ef\ufb01cient machine learning library for heterogeneous\ndistributed systems. arXiv preprint arXiv:1512.01274, 2015.\n\n[3] Yunpeng Chen, Xiaojie Jin, Bingyi Kang, Jiashi Feng, and Shuicheng Yan. Sharing residual units through\n\ncollective tensor factorization in deep neural networks. arXiv preprint arXiv:1703.02180, 2017.\n\n[4] Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew\n\nZisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111(1):98\u2013136, 2014.\n\n[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks.\n\nIn European Conference on Computer Vision, pages 630\u2013645. Springer, 2016.\n\n[7] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask r-cnn.\n\narXiv:1703.06870, 2017.\n\narXiv preprint\n\n[8] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolu-\n\ntional networks. arXiv preprint arXiv:1608.06993, 2016.\n\n[9] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep con-\nvolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 1646\u20131654, 2016.\n\n[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[11] Chen-Yu Lee, Patrick W Gallagher, and Zhuowen Tu. Generalizing pooling functions in convolutional\n\nneural networks: Mixed, gated, and tree. In Arti\ufb01cial Intelligence and Statistics, pages 464\u2013472, 2016.\n\n[12] Qianli Liao and Tomaso Poggio. Bridging the gaps between residual learning, recurrent neural networks\n\nand visual cortex. arXiv preprint arXiv:1604.03640, 2016.\n\n[13] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic seg-\nmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n3431\u20133440, 2015.\n\n[14] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In\n\nEuropean Conference on Computer Vision, pages 483\u2013499. Springer, 2016.\n\n[15] Geoff Pleiss, Danlu Chen, Gao Huang, Tongcheng Li, Laurens van der Maaten, and Kilian Q Weinberger.\n\nMemory-ef\ufb01cient implementation of densenets. arXiv preprint arXiv:1707.06990, 2017.\n\n[16] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection\nwith region proposal networks. In Advances in neural information processing systems, pages 91\u201399, 2015.\n[17] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang,\nAndrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large\nScale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211\u2013252,\n2015. doi: 10.1007/s11263-015-0816-y.\n\n[18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[19] Rohollah Soltani and Hui Jiang. Higher order recurrent neural networks. arXiv preprint arXiv:1605.00064,\n\n2016.\n\n[20] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alex Alemi. Inception-v4, inception-resnet and\n\nthe impact of residual connections on learning. arXiv preprint arXiv:1602.07261, 2016.\n\n[21] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual transforma-\n\ntions for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.\n\n[22] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146,\n\n2016.\n\n[23] Xingcheng Zhang, Zhizhong Li, Chen Change Loy, and Dahua Lin. Polynet: A pursuit of structural\n\ndiversity in very deep networks. arXiv preprint arXiv:1611.05725, 2016.\n\n[24] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, and Aude Oliva. Places: An image\n\ndatabase for deep scene understanding. arXiv preprint arXiv:1610.02055, 2016.\n\n9\n\n\f", "award": [], "sourceid": 2333, "authors": [{"given_name": "Yunpeng", "family_name": "Chen", "institution": "National University of Singapore"}, {"given_name": "Jianan", "family_name": "Li", "institution": "Beijing Institute of Technology"}, {"given_name": "Huaxin", "family_name": "Xiao", "institution": "NUDT"}, {"given_name": "Xiaojie", "family_name": "Jin", "institution": "National University of Singapore & Snap Research"}, {"given_name": "Shuicheng", "family_name": "Yan", "institution": "Qihoo 360 AI Institute"}, {"given_name": "Jiashi", "family_name": "Feng", "institution": "National University of Singapore"}]}