{"title": "Searching for Efficient Multi-Scale Architectures for Dense Image Prediction", "book": "Advances in Neural Information Processing Systems", "page_first": 8699, "page_last": 8710, "abstract": "The design of neural network architectures is an important component for achieving state-of-the-art performance with machine learning systems across a broad array of tasks. Much work has endeavored to design and build architectures automatically through clever construction of a search space paired with simple learning algorithms. Recent progress has demonstrated that such meta-learning methods may exceed scalable human-invented architectures on image classification tasks. An open question is the degree to which such methods may generalize to new domains. In this work we explore the construction of meta-learning techniques for dense image prediction focused on the tasks of scene parsing, person-part segmentation, and semantic image segmentation. Constructing viable search spaces in this domain is challenging because of the multi-scale representation of visual information and the necessity to operate on high resolution imagery. Based on a survey of techniques in dense image prediction, we construct a recursive search space and demonstrate that even with efficient random search, we can identify architectures that outperform human-invented architectures and achieve state-of-the-art performance on three dense prediction tasks including 82.7% on Cityscapes (street scene parsing), 71.3% on PASCAL-Person-Part (person-part segmentation), and 87.9% on PASCAL VOC 2012 (semantic image segmentation). Additionally, the resulting architecture is more computationally efficient, requiring half the parameters and half the computational cost as previous state of the art systems.", "full_text": "Searching for Ef\ufb01cient Multi-Scale\n\nArchitectures for Dense Image Prediction\n\nLiang-Chieh Chen Maxwell D. Collins\n\nBarret Zoph\n\nFlorian Schroff\n\nYukun Zhu\n\nHartwig Adam\n\nGeorge Papandreou\n\nJonathon Shlens\n\nGoogle Inc.\n\nAbstract\n\nThe design of neural network architectures is an important component for achieving\nstate-of-the-art performance with machine learning systems across a broad array of\ntasks. Much work has endeavored to design and build architectures automatically\nthrough clever construction of a search space paired with simple learning algo-\nrithms. Recent progress has demonstrated that such meta-learning methods may\nexceed scalable human-invented architectures on image classi\ufb01cation tasks. An\nopen question is the degree to which such methods may generalize to new domains.\nIn this work we explore the construction of meta-learning techniques for dense\nimage prediction focused on the tasks of scene parsing, person-part segmentation,\nand semantic image segmentation. Constructing viable search spaces in this do-\nmain is challenging because of the multi-scale representation of visual information\nand the necessity to operate on high resolution imagery. Based on a survey of\ntechniques in dense image prediction, we construct a recursive search space and\ndemonstrate that even with ef\ufb01cient random search, we can identify architectures\nthat outperform human-invented architectures and achieve state-of-the-art perfor-\nmance on three dense prediction tasks including 82.7% on Cityscapes (street scene\nparsing), 71.3% on PASCAL-Person-Part (person-part segmentation), and 87.9%\non PASCAL VOC 2012 (semantic image segmentation). Additionally, the resulting\narchitecture is more computationally ef\ufb01cient, requiring half the parameters and\nhalf the computational cost as previous state of the art systems.\n\n1\n\nIntroduction\n\nThe resurgence of neural networks in machine learning has shifted the emphasis for building state-of-\nthe-art systems in such tasks as image recognition [44, 84, 83, 34], speech recognition [36, 8], and\nmachine translation [88, 82] towards the design of neural network architectures. Recent work has\ndemonstrated successes in automatically designing network architectures, largely focused on single-\nlabel image classi\ufb01cation tasks [100, 101, 52] (but see [100, 65] for language tasks). Importantly,\nin just the last year such meta-learning techniques have identi\ufb01ed architectures that exceed the\nperformance of human-invented architectures for large-scale image classi\ufb01cation problems [101, 52,\n68].\nImage classi\ufb01cation has provided a great starting point because much research effort has identi\ufb01ed\nsuccessful network motifs and operators that may be employed to construct search spaces for\narchitectures [52, 68, 101]. Additionally, image classi\ufb01cation is inherently multi-resolution whereby\nfully convolutional architectures [77, 58] may be trained on low resolution images (with minimal\ncomputational demand) and be transferred to high resolution images [101].\nAlthough these results suggest opportunity, the real promise depends on the degree to which meta-\nlearning may extend into domains beyond image classi\ufb01cation. In particular, in the image domain,\nmany important tasks such as semantic image segmentation [58, 11, 97], object detection [71, 21],\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand instance segmentation [20, 33, 9] rely on high resolution image inputs and multi-scale image\nrepresentations. Na\u00efvely porting ideas from image classi\ufb01cation would not suf\ufb01ce because (1) the\nspace of network motifs and operators differ notably from systems that perform classi\ufb01cation and\n(2) architecture search must inherently operate on high resolution imagery. This \ufb01nal point makes\nprevious approaches computationally intractable where transfer learning from low to high image\nresolutions was critical [101].\nIn this work, we present the \ufb01rst effort towards applying meta-learning to dense image prediction\n(Fig. 1) \u2013 largely focused on the heavily-studied problem of scene labeling. Scene labeling refers to\nthe problem of assigning semantic labels such as \"person\" or \"bicycle\" to every pixel in an image.\nState-of-the-art systems in scene labeling are elaborations of convolutional neural networks (CNNs)\nlargely structured as an encoder-decoder in which various forms of pooling, spatial pyramid structures\n[97] and atrous convolutions [11] have been explored. The goal of these operations is to build a\nmulti-scale representation of a high resolution image to densely predict pixel values (e.g., stuff label,\nobject label, etc.). We leverage off this literature in order to construct a search space over network\nmotifs for dense prediction. Additionally, we perform an array of experiments to demonstrate how to\nconstruct a computationally tractable and simple proxy task that may provide predictive information\non multi-scale architectures for high resolution imagery.\nWe \ufb01nd that an effective random search policy provides a strong baseline [5, 30] and identify several\ncandidate network architectures for scene labeling. In experiments on the Cityscapes dataset [18], we\n\ufb01nd architectures that achieve 82.7% mIOU accuracy, exceeding the performance of human-invented\narchitectures by 0.7% [6]. For reference, note that achieving gains on the Cityscapes dataset is\nchallenging as the previous academic competition elicited gains of 0.8% in mIOU from [97] to [6]\nover more than one year. Additionally, this same network applied to other dense prediction tasks\nsuch as person-part segmentation [16] and semantic image segmentation [24] surpasses state-of-\nthe-art results [25, 93] by 3.7% and 1.7% in absolute percentage, respectively (and comparable to\nconcurrent works [14, 96, 48] on VOC 2012). This is the \ufb01rst time to our knowledge that a meta-\nlearning algorithm has matched state-of-the-art performance using architecture search techniques\non dense image prediction problems. Notably, the identi\ufb01ed architecture operates with half the\nnumber of trainable parameters and roughly half the computational demand (in Multiply-Adds) as\nprevious state-of-the-art systems [14], when employing the powerful Xception [17, 67, 14] as network\nbackbone1.\n\n2 Related Work\n\n2.1 Architecture search\n\nOur work is motivated by the neural architecture search (NAS) method [100, 101], which trains a\ncontroller network to generate neural architectures. In particular, [101] transfers architectures learned\non a proxy dataset [43] to more challenging datasets [73] and demonstrates superior performance over\nmany human-invented architectures. Many parallel efforts have employed reinforcement learning\n[3, 99], evolutionary algorithms [81, 69, 59, 90, 53, 68] and sequential model-based optimization\n[61, 52] to learn network structures. Additionally, other works focus on successively increasing\nmodel size [7, 15], sharing model weights to accelerate model search [65], or a continuous relaxation\nof the architecture representation [54]. Note that our work is complimentary and may leverage all of\nthese advances in search techniques to accelerate the search and decrease computational demand.\nCritically, all approaches are predicated on constructing powerful but tractable architecture search\nspaces. Indeed, [52, 101, 68] \ufb01nd that sophisticated learning algorithms achieve superior results;\nhowever, even random search may achieve strong results if the search space is not overly expansive.\nMotivated by this last point, we focus our efforts on developing a tractable and powerful search space\nfor dense image prediction paired with ef\ufb01cient random search [5, 30].\nRecently, [75, 27] proposed methods for embedding an exponentially large number of architectures\nin a grid arrangement for semantic segmentation tasks. In this work, we instead propose a novel\nrecursive search space and simple yet predictive proxy tasks aimed at \ufb01nding effective architectures\nfor dense image prediction.\n\n1An implementation of the proposed model will be made available at https://github.com/tensorflow/\n\nmodels/tree/master/research/deeplab.\n\n2\n\n\f2.2 Multi-scale representation for dense image prediction\n\nState-of-the-art solutions for dense image predictions derive largely from convolutional neural\nnetworks [46]. A critical element of building such systems is supplying global features and context in-\nformation to perform pixel-level classi\ufb01cation [35, 78, 41, 45, 31, 92, 60, 19, 63]. Several approaches\nexist for how to ef\ufb01ciently encode the multi-scale context information in a network architecture: (1)\ndesigning models that take as input an image pyramid so that large scale objects are captured by\nthe downsampled image [26, 66, 23, 50, 13, 11], (2) designing models that contain encoder-decoder\nstructures [2, 72, 49, 28, 64, 93, 96], or (3) designing models that employ a multi-scale context mod-\nule, e.g., DenseCRF module [42, 4, 10, 98, 50, 76], global context [56, 95], or atrous convolutions\ndeployed in cascade [57, 94, 12] or in parallel [11, 12]. In particular, PSPNet [97] and DeepLab\n[12, 14] perform spatial pyramid pooling at several hand-designed grid scales.\nA common theme in the dense prediction literature is how to best tune an architecture to extract\ncontext information. Several works have focused on sampling rates in atrous convolution to encode\nmulti-scale context [37, 29, 77, 62, 10, 94, 11]. DeepLab-v1 [10] is the \ufb01rst model that enlarges\nthe sampling rate to capture long range information for segmentation. The authors of [94] build a\ncontext module by gradually increasing the rate on top of belief maps, the \ufb01nal CNN feature maps\nthat contain output channels equal to the number of predicted classes. The work in [87] employs a\nhybrid of rates within the last two blocks of ResNet [34], while Deformable ConvNets [22] proposes\nthe deformable convolution which generalizes atrous convolution by learning the rates. DeepLab-v2\n[11] and DeepLab-v3 [12] employ a module, called ASPP (atrous spatial pyramid pooling module),\nwhich consists of several parallel atrous convolutions with different rates, aiming to capture different\nscale information. Dense-ASPP [91] proposes to build the ASPP module in a densely connected\nmanner. We discuss below how to construct a search space that captures all of these features.\n\n3 Methods\n\nTwo key components for building a successful architecture search method are the design of the\nsearch space and the design of the proxy task [100, 101]. Most of the human expertise shifts from\narchitecture design to the construction of a search space that is both expressive and tractable. Likewise,\nidentifying a proxy task that is both predictive of the large-scale task but is extremely quick to run is\ncritical for searching this space ef\ufb01ciently.\n\n3.1 Architecture search space\n\nThe goal of architecture search space is to\ndesign a space that may express a wide\nrange of architectures, but also be tractable\nenough for identifying good models. We\nstart with the premise of building a search\nspace that may express all of the state-of-\nthe-art dense prediction and segmentation\nmodels previously discussed (e.g. [12, 97]\nand see Sec. 2 for more details).\nFigure 1: Schematic diagram of architecture search for\nWe build a recursive search space to encode\ndense image prediction. Example tasks explored in\nmulti-scale context information for dense\nthis paper include scene parsing [18], semantic image\nprediction tasks that we term a Dense Pre-\nsegmentation [24] and person-part segmentation [16].\ndiction Cell (DPC). The cell is represented\nby a directed acyclic graph (DAG) which consists of B branches and each branch maps one input\ntensor to another output tensor. In preliminary experiments we found that B = 5 provides a good\ntrade-off between \ufb02exibility and computational tractability (see Sec. 5 for more discussion).\nWe specify a branch bi in a DPC as a 3-tuple, (Xi, OPi, Yi), where Xi \u2208 Xi speci\ufb01es the input tensor,\nOPi \u2208 OP speci\ufb01es the operation to apply to input Xi, and Yi denotes the output tensor. The \ufb01nal\noutput, Y , of the DPC is the concatenation of all branch outputs, i.e., Y = concat(Y1, Y2, . . . , YB),\nallowing us to exploit all the learned information from each branch. For branch bi, the set of possible\ninputs, Xi, is equal to the last network backbone feature maps, F, plus all outputs obtained by\n\n3\n\nDensePredictionCell(DPC)\fFigure 2: Diagram of the search space for atrous convolutions. 3 \u00d7 3 atrous convolutions with\nsampling rates rh \u00d7 rw to capture contexts with different aspect ratios. From left to right: standard\nconvolution (1 \u00d7 1), equal expansion (6 \u00d7 6), short and fat (6 \u00d7 24) and tall and skinny (24 \u00d7 6).\n\nprevious branches, Y1, . . . , Yi\u22121, i.e., Xi = {F, Y1, . . . , Yi\u22121}. Note that X1 = {F}, i.e., the \ufb01rst\nbranch can only take F as input.\nThe operator space, OP, is de\ufb01ned as the following set of functions:\n\n\u2022 Convolution with a 1 \u00d7 1 kernel.\n\u2022 3\u00d7 3 atrous separable convolution with rate rh\u00d7 rw, where rh and rw \u2208 {1, 3, 6, 9, . . . , 21}.\n\u2022 Average spatial pyramid pooling with grid size gh \u00d7 gw, where gh and gw \u2208 {1, 2, 4, 8}.\n\nFor the spatial pyramid pooling operation, we perform average pooling in each grid. After the average\npooling, we apply another 1 \u00d7 1 convolution followed by bilinear upsampling to resize back to the\nsame spatial resolution as input tensor. For example, when the pooling grid size gh \u00d7 gw is equal\nto 1 \u00d7 1, we perform image-level average pooling followed by another 1 \u00d7 1 convolution, and then\nresize back (i.e., tile) the features to have the same spatial resolution as input tensor.\nWe employ separable convolution [79, 85, 86, 17, 38] with 256 \ufb01lters for all the convolutions, and\ndecouple sampling rates in the 3 \u00d7 3 atrous separable convolution to be rh \u00d7 rw which allows us to\ncapture object scales with different aspect ratios. See Fig. 2 for an example.\nThe resulting search space may encode all leading architectures but is more diverse as each branch\nof the cell may build contextual information through parallel or cascaded representations. The\npotential diversity of the search space may be expressed in terms of the total number of potential\narchitectures. For i-th branch, there are i possible inputs, including the last feature maps produced\nby the network backbone (i.e., F) as well as all the previous branch outputs (i.e., Y1, . . . , Yi\u22121), and\n1 + 8\u00d7 8 + 4\u00d7 4 = 81 functions in the operator space, resulting in i\u00d7 81 possible options. Therefore,\nfor B = 5 branches, the search space contains B! \u00d7 81B \u2248 4.2 \u00d7 1011 con\ufb01gurations.\n\n3.2 Architecture search\n\nThe model search framework builds on top of an ef\ufb01cient optimization service [30]. It may be thought\nof as a black-box optimization tool whose task is to optimize an objective function f : b \u2192 R with a\nlimited evaluation budget, where in our case b = {b1, b2, . . . , bB} is the architecture of DPC and f (b)\nis the pixel-wise mean intersection-over-union (mIOU) [24] evaluated on the dense prediction dataset.\nThe black-box optimization refers to the process of generating a sequence of b that approaches the\nglobal optimum (if any) as fast as possible. Our search space size is on the order of 1011 and we\nadopt the random search algorithm implemented by Vizier [30], which basically employs the strategy\nof sampling points b uniformly at random as well as sampling some points b near the currently\nbest observed architectures. We refer the interested readers to [30] for more details. Note that the\nrandom search algorithm is a simple yet powerful method. As highlighted in [101], random search is\ncompetitive with reinforcement learning and other learning techniques [52].\n\n3.3 Design of the proxy task\n\nNa\u00efvely applying architecture search to a dense prediction task requires an inordinate amount of\ncomputation and time, as the search space is large and training a candidate architecture is time-\nconsuming. For example, if one \ufb01ne-tunes the entire model with a single dense prediction cell (DPC)\non the Cityscapes dataset, then training a candidate architecture with 90K iterations requires 1+ week\n\n4\n\nrate = 6x6rate = 6x24rate = 1x1rate = 24x6\f(a) \u03c1 = 0.36\n\n(b) \u03c1 = 0.47\n\nFigure 3: Measuring the \ufb01delity of proxy tasks for a dense prediction cell (DPC) in a reduced search\nspace. In preliminary search spaces, a comparison of (a) small to large network backbones, and (b)\nproxy versus large-scale training with MobileNet-v2 backbone. \u03c1 is Spearman\u2019s rank correlation\ncoef\ufb01cient.\n\nwith a single P100 GPU. Therefore, we focus on designing a proxy task that is (1) fast to compute\nand (2) may predict the performance in a large-scale training setting.\nImage classi\ufb01cation employs low resolution images [43] as a fast proxy task for high-resolution [73].\nThis proxy task does not work for dense image prediction where high resolution imagery is critical\nfor conveying multi-scale context information. Therefore, we propose to design the proxy dataset\nby (1) employing a smaller network backbone and (2) caching the feature maps produced by the\nnetwork backbone on the training set and directly building a single DPC on top of it. Note that the\nlatter point is equivalent to not back-propagating gradients to the network backbone in the real setting\nIn addition, we elect for early stopping by not training candidate architectures to convergence. In our\nexperiments, we only train each candidate architecture with 30K iterations. In summary, these two\ndesign choices result in a proxy task that runs in 90 minutes on a GPU cutting down the computation\ntime by 100+-fold but is predictive of larger tasks (\u03c1 \u2265 0.4).\nAfter performing architecture search, we run a reranking experiment to more precisely measure the\nef\ufb01cacy of each architecture in the large-scale setting [100, 101, 68]. In the reranking experiments, the\nnetwork backbone is \ufb01ne-tuned and trained to full convergence. The new top architectures returned\nby this experiment are presented in this work as the best DPC architectures.\n\n4 Results\n\nWe demonstrate the effectiveness of our proposed method on three dense prediction tasks that are well\nstudied in the literature: scene parsing (Cityscapes [18]), person part segmentation (PASCAL-Person-\nPart [16]), and semantic image segmentation (PASCAL VOC 2012 [24]). Training and evaluation\nprotocols follow [12, 14]. In brief, the network backbone is pre-trained on the COCO dataset [51].\nThe training protocol employs a polynomial learning rate [56] with an initial learning rate of 0.01,\nlarge crop sizes (e.g., 769 \u00d7 769 on Cityscapes and 513 \u00d7 513 on PASCAL images), \ufb01ne-tuned batch\nnormalization parameters [40] and small batch training (batch size = 8, 16 for proxy and real tasks,\nrespectively). For evaluation and architecture search, we employ a single image scale. For the \ufb01nal\nresults in which we compare against other state-of-the-art systems (Tab. 2, Tab. 3 and Tab. 4), we\nperform evaluation by averaging over multiple scalings of a given image.\n\n4.1 Designing a proxy task for dense prediction\n\nThe goal of a proxy task is to identify a problem that is quick to evaluate but provides a predictive\nsignal about the large-scale task. In the image classi\ufb01cation work, the proxy task was classi\ufb01cation\non low resolution (e.g. 32 \u00d7 32) images [100, 101]. Dense prediction tasks innately require high\nresolution images as training data. Because the computational demand of convolutional operations\nscale as the number of pixels, another meaningful proxy task must be identi\ufb01ed.\nWe approach the problem of proxy task design by focusing on speed and predictive ability. As\ndiscussed in Sec. 3, we employ several strategies for devising a fast and predictive proxy task to speed\nup the evaluation of a model from 1+ week to 90 minutes on a single GPU. In these preliminary\n\n5\n\n0.660.680.70.720.740.750.760.770.78MobileNet\u2212v2 ImageNetXception\u221265 ImageNet0.520.540.560.580.680.70.720.74ProxyReal\f(a) Score distribution\n\n(b) \u03c1 = 0.46\n\nFigure 4: Measuring the \ufb01delity of the proxy tasks for a dense prediction cell (DPC) in the full search\nspace. (a) Score distribution on the proxy task. The search algorithm is able to explore a diversity\nof architectures. (b) Correlation of the found top-50 architectures between the proxy dataset and\nlarge-scale training with MobileNet-v2 backbone. \u03c1 is Spearman\u2019s rank correlation coef\ufb01cient.\n\nexperiments, we demonstrate that these strategies provide an instructive signal for predicting the\nef\ufb01cacy of a given architecture.\nTo minimize stochastic variation due to sampling architectures, we \ufb01rst construct an extremely small\nsearch space containing only 31 architectures2 in which we may exhaustively explore performance.\nWe perform the experiments and subsequent architecture search on Cityscapes [18], which features\nlarge variations in object scale across 19 semantic labels.\nFollowing previous state-of-the-art segmentation models, we employ the Xception architecture [17,\n67, 14] for the large-scale setting. We \ufb01rst asked whether a smaller network backbone, MobileNet-v2\n[74] provides a strong signal of the performance of the large network backbone (Fig. 3a). MobileNet-\nv2 consists of roughly 1\n20 the computational cost and cuts down the backbone feature channels\nfrom 2048 to 320 dimensions. We indeed \ufb01nd a rank correlation (\u03c1 = 0.36) comparable to learned\npredictors [52], suggesting that this may provide a reasonable substitute for the proxy task. We next\nasked whether employing a \ufb01xed and cached set of activations correlates well with training end-to-end.\nFig. 3b shows that a higher rank correlation between cached activations and training end-to-end\nfor COCO pretrained MobileNet-v2 backbone (\u03c1 = 0.47). The fact that these rank correlations are\nsigni\ufb01cantly above chance rate (\u03c1 = 0) indicates that these design choices provide a useful signal for\nlarge-scale experiments (i.e., more expensive network backbone) comparable to learned predictors\n[52, 101] (for reference, \u03c1 \u2208 [0.41, 0.47] in the last stage of [52]) as well as a fast proxy task.\n\n4.2 Architecture search for dense prediction cells\n\nWe deploy the resulting proxy task, with our proposed architecture search space, on Cityscapes to\nexplore 28K DPC architectures across 370 GPUs over one week. We employ a simple and ef\ufb01cient\nrandom search [5, 30] and select the top 50 architectures (w.r.t. validation set performance) for\nre-ranking based on \ufb01ne-tuning the entire model using MobileNet-v2 network backbone. Fig. 4a\nhighlights the distribution of performance scores on the proxy dataset, showing that the architecture\nsearch algorithm is able to explore a diversity of architectures. Fig. 4b demonstrates the correlation\nof the found top-50 DPCs between the original proxy task and the re-ranked scores. Notably, the top\nmodel identi\ufb01ed with re-ranking was the 12th best model as measured by the proxy score.\nFig. 5a provides a schematic diagram of the top DPC architecture identi\ufb01ed (see Fig. 6 for the next\nbest performing ones). Following [39] we examine the L1 norm of the weights connecting each\nbranch (via a 1 \u00d7 1 convolution) to the output of the top performing DPC in Fig. 5b. We observe\nthat the branch with the 3 \u00d7 3 convolution (rate = 1 \u00d7 6) contributes most, whereas the branches\nwith large rates (i.e., longer context) contribute less. In other words, information from image features\nin closer proximity (i.e. \ufb01nal spatial scale) contribute more to the \ufb01nal outputs of the network. In\ncontrast, the worst-performing DPC (Fig. 6c) does not preserve \ufb01ne spatial information as it cascades\nfour branches after the global image pooling operation.\n\n2The small search space consists of all possible combinations of the 5 parallel branches of the ASPP\narchitecture \u2013 a top ranked architecture for dense prediction [12]. There exist 25\u2212 1 = 31 potential arrangements\nof these parallel pathways.\n\n6\n\n0.5950.60.6050.610.6150.720.730.740.750.76ProxyReal\fFigure 5: Schematic diagram of top ranked DPC (left) and average absolute \ufb01lter weights (L1 norm)\nfor each operation (right).\n\n(b) Top-2 DPC\n\n(c) Top-3 DPC\n\n(d) Worst DPC\n\nFigure 6: Diversity of DPCs explored in architecture search. (b-d) Top-2, Top-3 and worst DPCs.\n\nNetwork Backbone Module\nASPP [12]\n\nMobileNet-v2\nMobileNet-v2\n\nDPC\n\nModi\ufb01ed Xception\nModi\ufb01ed Xception\n\nASPP [12]\n\nDPC\n\nParams MAdds mIOU (%)\n2.82B\n0.25M\n0.36M\n3.00B\n1.59M 18.12B\n6.84B\n0.81M\n\n73.97\n75.38\n80.25\n80.85\n\nTable 1: Cityscapes validation set performance (labeling IOU) across different network backbones\n(output stride = 16). ASPP is the previous state-of-the-art system [12] and DPC indicates this work.\nParams and MAdds indicate the number of parameters and number of multiply-add operations in\neach multi-scale context module.\n\n4.3 Performance on scene parsing\n\nWe train the best learned DPC with MobileNet-v2 [74] and modi\ufb01ed Xception [17, 67, 14] as\nnetwork backbones on Cityscapes training set [18] and evaluate on the validation set. The network\nbackbone is pretrained on the COCO dataset [51] for this and all subsequent experiments. Fig. 1 in\nthe supplementary material shows qualitative results of the predictions from the resulting architecture.\nQuantitative results in Tab. 1 highlight that the learned DPC provides 1.4% improvement on the\nvalidation set when using MobileNet-v2 network backbone and a 0.6% improvement when using\nthe larger modi\ufb01ed Xception network backbone. Furthermore, the best DPC only requires half of\nthe parameters and 38% of the FLOPS of the previous state-of-the-art dense prediction network\n[14] when using Xception as network backbone. We note the computation saving results from the\ncascaded structure in our top-1 DPC, since the feature channels of Xception backbone is 2048 and\nthus it is expensive to directly build parallel operations on top of it (like ASPP).\nWe next evaluate the performance on the test set (Tab. 2). DPC sets a new state-of-the-art performance\nof 82.7% mIOU \u2013 an 0.7% improvement over the state-of-the-art model [6]. This model outperforms\nother state-of-the-art models across 11 of the 19 categories. We emphasize that achieving gains on\nCityscapes dataset is challenging because this is a heavily researched benchmark. The previous\nacademic competition elicited gains of 0.8% in mIOU from [97] to [6] over the span of one year.\n\n7\n\nFConv 3x3Rate 1x6Conv 3x3Rate 18x15Conv 3x3Rate 6x21Conv 3x3Rate 1x1Conv 3x3Rate 6x3ConcatYConv 3x3Rate 1x6Conv 3x3Rate 18x15Conv 3x3Rate 6x3Conv 3x3Rate 1x1Conv 3x3Rate 6x21FConv 1x1Conv 3x3Rate 21x15Conv 3x3Rate 3x6Conv 3x3Rate 12x21ConcatConv 3x3Rate 6x1YFConv 1x1Conv 3x3Rate 12x1Conv 1x1Conv 3x3Rate 21x21ConcatConv 3x3Rate 1x6YFPyramidPooling1x1PyramidPooling8x2Conv 3x3Rate 15x1PyramidPooling2x1ConcatPyramidPooling1x8Y\froad sidewalk building wall fence pole light sign vege.\nMethod\n98.7\n58.4 63.7 67.7 76.1 80.5 93.6\nPSPNet [97]\n61.8 63.9 67.7 77.4 80.8 93.7\nMapillary Research [6] 98.4\n59.5 63.7 71.4 78.2 82.2 94.0\n98.7\nDeepLabv3+ [14]\n98.7\n57.7 63.5 71.0 78.0 82.1 94.0\nDPC\n\n70.8\n72.6\n73.8\n74.1\nTable 2: Cityscapes test set performance across leading competitive models.\n\nterrain sky person rider car\n72.2\n71.9\n73.0\n73.3\n\n71.9 96.2 77.7 91.5 83.6\n72.8 95.7 79.9 93.1 89.7\n73.3 96.4 78.0 90.9 83.9\n74.5 96.5 81.2 93.3 89.0\n\n93.5\n93.7\n93.9\n93.8\n\n86.9\n85.0\n87.0\n87.1\n\n95.3\n95.6\n95.9\n95.4\n\n86.8\n86.7\n88.0\n88.2\n\ntrain mbike bicycle mIOU\n81.2\n82.0\n82.1\n82.7\n\n77.5\n78.2\n78.9\n79.0\n\ntruck bus\n\nMethod\nhead torso u-arms l-arms u-legs l-legs bkg mIOU\nLiang et al. [47] 82.89 67.15 51.42\n48.72 51.72 45.91 97.18 63.57\nXia et al. [89]\n85.50 67.87 54.72\n54.30 48.25 44.76 95.32 64.39\n56.21 52.43 50.36 97.72 67.60\nFang et al. [25]\n87.15 72.28 57.07\n88.81 74.54 63.85\n63.73 57.24 54.55 96.66 71.34\nDPC\nTable 3: PASCAL-Person-Part validation set performance.\n\naero bike bird boat bottle bus\nMethod\n95.3 76.9 94.2 80.2 85.3 96.5 90.8 96.3 47.9 93.9 80.0 92.4 96.6\nEncNet [95]\nDFN [93]\n96.4 78.6 95.5 79.1 86.4 97.1 91.4 95.0 47.7 92.9 77.2 91.0 96.7\nDeepLabv3+ [14] 97.0 77.1 97.1 79.3 89.3 97.4 93.2 96.6 56.9 95.0 79.2 93.1 97.0\n96.8 80.3 97.0 82.5 87.8 96.3 92.6 96.4 53.3 94.3 78.4 94.1 94.9\nExFuse [96]\n96.8 76.8 97.0 80.6 89.3 97.4 93.8 97.1 56.7 94.3 78.3 93.5 97.1\nMSCI [48]\n97.4 77.5 96.6 79.4 87.2 97.6 90.1 96.6 56.8 97.0 77.0 94.3 97.5\nDPC\n\ncar\n\ncat\n\n90.5\n92.2\n94.0\n91.6\n94.0\n93.2\n\n91.5\n91.7\n92.8\n92.3\n92.8\n92.5\n\n70.9 93.6 66.5 87.7 80.8\n76.5 93.1 64.4 88.3 81.2\n71.3 92.9 72.4 91.0 84.9\n81.7 94.8 70.3 90.1 83.8\n72.3 92.6 73.6 90.8 85.4\n78.9 94.3 70.1 91.4 84.0\n\ntv mIOU\n85.9\n86.2\n87.8\n87.9\n88.0\n87.9\n\nchair cow table dog horse mbike person plant sheep sofa train\n\nTable 4: PASCAL VOC 2012 test set performance.\n\n4.4 Performance on person part segmentation\n\nPASCAL-Person-Part dataset [16] contains large variation in object scale and human pose annotating\nsix person part classes as well as the background class. We train a model on this dataset employing\nthe same DPC identi\ufb01ed during architecture search using the modi\ufb01ed Xception network backbone.\nFig. 2 in the supplementary material shows a qualitative visualization of these results and Tab. 3\nquanti\ufb01es the model performance. The DPC architecture achieves state-of-the-art performance of\n71.34%, representing a 3.74% improvement over the best state-of-the-art model [25], consistently\noutperforming other models w.r.t. all categories except the background class. Additionally, note that\nthe DPC model does not require extra MPII training data [1], as required in [89, 25].\n\n4.5 Performance on semantic image segmentation\n\nThe PASCAL VOC 2012 benchmark [24] (augmented by [32]) involves segmenting 20 foreground\nobject classes and one background class. We train a model on this dataset employing the same DPC\nidenti\ufb01ed during architecture search using the modi\ufb01ed Xception network backbone.\nFig. 3 in the supplementary material provides a qualitative visualization of the results and Tab. 4\nquanti\ufb01es the model performance on the test set. The DPC architecture outperforms previous state-\nof-the-art models [95, 93] by more than 1.7%, and is comparable to concurrent works [14, 96, 48].\nAcross semantic categories, DPC achieves state-of-the-art performance in 6 categories of the 20\ncategories.\n\n5 Conclusion\n\nThis work demonstrates how architecture search techniques may be employed for problems beyond\nimage classi\ufb01cation \u2013 in particular, problems of dense image prediction where multi-scale processing\nis critical for achieving state-of-the-art performance. The application of architecture search to dense\nimage prediction was achieved through (1) the construction of a recursive search space leveraging\ninnovations in the dense prediction literature and (2) the construction of a fast proxy predictive\nof the large-scale task. The resulting learned architecture surpasses human-invented architectures\nacross three dense image prediction tasks: scene parsing [18], person-part segmentation [16] and\nsemantic segmentation [24]. In the \ufb01rst task, the resulting architecture achieved performance gains\ncomparable to the gains witnessed in last year\u2019s academic competition [18]. In addition, the resulting\narchitecture is more ef\ufb01cient than state-of-the-art systems, requiring half of the parameters and 38%\nof the computational demand when using deeper Xception [17, 67, 14] as network backbone.\nSeveral opportunities exist for improving the quality of these results. Previous work identi\ufb01ed the\ndesign of a large and \ufb02exible search space as a critical element for achieving strong results [101, 52,\n\n8\n\n\f100, 65]. Expanding the search space further by increasing the number of branches B in the dense\nprediction cell may yield further gains. Preliminary experiments with B > 5 on the scene parsing\ndata suggest some opportunity, although random search in an exponentially growing space becomes\nmore challenging. The use of intelligent search algorithms such as reinforcement learning [3, 99],\nsequential model-based optimization [61, 52] and evolutionary methods [81, 69, 59, 90, 53, 68] may\nbe leveraged to further improve search ef\ufb01ciency particularly as the space grows in size. We hope\nthat these ideas may be ported into other domains such as depth prediction [80] and object detection\n[70, 55] to achieve similar gains over human-invented designs.\n\nAcknowledgments We thank Kevin Murphy for many ideas and inspiration; Quoc Le, Bo Chen,\nMaxim Neumann and Andrew Howard for support and discussion; Hui Hui for helping release the\nmodels; members of the Google Brain, Mobile Vision and Vizier team for infrastructure, support and\ndiscussion.\n\nReferences\n\n[1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and\n\nstate of the art analysis. In CVPR, 2014.\n\n[2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture\n\nfor image segmentation. arXiv:1511.00561, 2015.\n\n[3] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement\n\nlearning. In ICLR, 2017.\n\n[4] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material recognition in the wild with the materials in\n\ncontext database. In CVPR, 2015.\n\n[5] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. JMLR, 2012.\n[6] S. R. Bul\u00f2, L. Porzi, and P. Kontschieder. In-place activated batchnorm for memory-optimized training of\n\ndnns. In CVPR, 2018.\n\n[7] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. Ef\ufb01cient architecture search by network transformation.\n\nIn AAAI, 2018.\n\n[8] W. Chan, N. Jaitly, Q. Le, and O. Vinyals. Listen, attend and spell: A neural network for large vocabulary\n\nconversational speech recognition. In ICASSP, 2016.\n\n[9] L.-C. Chen, A. Hermans, G. Papandreou, F. Schroff, P. Wang, and H. Adam. Masklab: Instance\n\nsegmentation by re\ufb01ning object detection with semantic and direction features. In CVPR, 2018.\n\n[10] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation\n\nwith deep convolutional nets and fully connected crfs. In ICLR, 2015.\n\n[11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image\nsegmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017.\n[12] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image\n\nsegmentation. arXiv:1706.05587, 2017.\n\n[13] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image\n\nsegmentation. In CVPR, 2016.\n\n[14] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable\n\nconvolution for semantic image segmentation. In ECCV, 2018.\n\n[15] T. Chen, I. Goodfellow, and J. Shlens. Net2Net: Accelerating learning via knowledge transfer. In ICLR,\n\n2016.\n\n[16] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and\n\nrepresenting objects using holistic models and body parts. In CVPR, 2014.\n\n[17] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In CVPR, 2017.\n[18] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and\n\nB. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016.\n\n[19] J. Dai, K. He, and J. Sun. Convolutional feature masking for joint object and stuff segmentation. In\n\nCVPR, 2015.\n\n[20] J. Dai, K. He, and J. Sun. Instance-aware semantic segmentation via multi-task network cascades. In\n\nCVPR, 2016.\n\n[21] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks.\n\nIn NIPS, 2016.\n\n[22] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In\n\nICCV, 2017.\n\n[23] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale\n\nconvolutional architecture. In ICCV, 2015.\n\n9\n\n\f[24] M. Everingham, S. M. A. Eslami, L. V. Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal\n\nvisual object classes challenge \u2013 a retrospective. IJCV, 2014.\n\n[25] H.-S. Fang, G. Lu, X. Fang, J. Xie, Y.-W. Tai, and C. Lu. Weakly and semi supervised human body part\n\nparsing via pose-guided knowledge transfer. In CVPR, 2018.\n\n[26] C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling.\n\nPAMI, 2013.\n\n[27] D. Fourure, R. Emonet, E. Fromont, D. Muselet, A. Tremeau, and C. Wolf. Residual conv-deconv grid\n\nnetwork for semantic segmentation. In BMVC, 2017.\n\n[28] J. Fu, J. Liu, Y. Wang, and H. Lu. Stacked deconvolutional network for semantic segmentation.\n\narXiv:1708.04943, 2017.\n\n[29] A. Giusti, D. Ciresan, J. Masci, L. Gambardella, and J. Schmidhuber. Fast image scanning with deep\n\nmax-pooling convolutional neural networks. In ICIP, 2013.\n\n[30] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. Google vizier: A service for\n\nblack-box optimization. In SIGKDD, 2017.\n\n[31] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent\n\nregions. In ICCV, 2009.\n\n[32] B. Hariharan, P. Arbel\u00e1ez, L. Bourdev, S. Maji, and J. Malik. Semantic contours from inverse detectors.\n\nIn ICCV, 2011.\n\n[33] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick. Mask r-cnn. In ICCV, 2017.\n[34] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n[35] X. He, R. S. Zemel, and M. Carreira-Perpindn. Multiscale conditional random \ufb01elds for image labeling.\n\nIn CVPR, 2004.\n\n[36] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen,\nT. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views\nof four research groups. IEEE Signal Processing Magazine, 29(6):82\u201397, 2012.\n\n[37] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian. A real-time algorithm for signal\nanalysis with the help of the wavelet transform. In Wavelets: Time-Frequency Methods and Phase Space,\npages 289\u2013297. 1989.\n\n[38] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam.\nMobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications. arXiv:1704.04861,\n2017.\n\n[39] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks.\n\nIn CVPR, 2017.\n\n[40] S. Ioffe and C. Szegedy. Batch normalization: accelerating deep network training by reducing internal\n\ncovariate shift. In ICML, 2015.\n\n[41] P. Kohli, P. H. Torr, et al. Robust higher order potentials for enforcing label consistency. IJCV, 82(3):302\u2013\n\n324, 2009.\n\n[42] P. Kr\u00e4henb\u00fchl and V. Koltun. Ef\ufb01cient inference in fully connected crfs with gaussian edge potentials. In\n\nNIPS, 2011.\n\n[43] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report,\n\n2009.\n\n[44] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In NIPS, 2012.\n\n[45] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr. Associative hierarchical crfs for object class image\n\nsegmentation. In ICCV, 2009.\n\n[46] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backprop-\n\nagation applied to handwritten zip code recognition. Neural computation, 1(4):541\u2013551, 1989.\n\n[47] X. Liang, L. Lin, X. Shen, J. Feng, S. Yan, and E. P. Xing. Interpretable structure-evolving lstm. In CVPR,\n\n2017.\n\n[48] D. Lin, Y. Ji, D. Lischinski, D. Cohen-Or, and H. Huang. Multi-scale context intertwining for semantic\n\nsegmentation. In ECCV, 2018.\n\n[49] G. Lin, A. Milan, C. Shen, and I. Reid. Re\ufb01nenet: Multi-path re\ufb01nement networks with identity mappings\n\nfor high-resolution semantic segmentation. In CVPR, 2017.\n\n[50] G. Lin, C. Shen, A. van den Hengel, and I. Reid. Ef\ufb01cient piecewise training of deep structured models\n\nfor semantic segmentation. In CVPR, 2016.\n\n[51] T.-Y. Lin et al. Microsoft coco: Common objects in context. In ECCV, 2014.\n[52] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy.\n\nProgressive neural architecture search. In ECCV, 2018.\n\n[53] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. Hierarchical representations for\n\nef\ufb01cient architecture search. In ICLR, 2018.\n\n[54] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv:1806.09055, 2018.\n\n10\n\n\f[55] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox\n\ndetector. In ECCV, 2016.\n\n[56] W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv:1506.04579, 2015.\n[57] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Semantic image segmentation via deep parsing network. In\n\n[58] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR,\n\nICCV, 2015.\n\n2015.\n\n[59] R. Miikkulainen, J. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad,\n\nA. Navruzyan, N. Duffy, and B. Hodjat. Evolving deep neural networks. arXiv:1703.00548, 2017.\n\n[60] M. Mostajabi, P. Yadollahpour, and G. Shakhnarovich. Feedforward semantic segmentation with zoom-out\n\nfeatures. In CVPR, 2015.\n\n[61] R. Negrinho and G. Gordon. Deeparchitect: Automatically designing and training deep architectures.\n\narXiv:1704.08792, 2017.\n\n[62] G. Papandreou, I. Kokkinos, and P.-A. Savalle. Modeling local and global deformations in deep learning:\n\nEpitomic convolution, multiple instance learning, and sliding window detection. In CVPR, 2015.\n\n[63] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Murphy. Personlab: Person pose\nestimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In\nECCV, 2018.\n\n[64] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun. Large kernel matters\u2013improve semantic segmentation by\n\nglobal convolutional network. In CVPR, 2017.\n\n[65] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Ef\ufb01cient neural architecture search via parameter\n\nsharing. In ICML, 2018.\n\n[66] P. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene labeling. In ICML, 2014.\n[67] H. Qi, Z. Zhang, B. Xiao, H. Hu, B. Cheng, Y. Wei, and J. Dai. Deformable convolutional networks \u2013\n\ncoco detection and segmentation challenge 2017 entry. ICCV COCO Challenge Workshop, 2017.\n\n[68] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classi\ufb01er architecture\n\nsearch. arXiv:1802.01548, 2018.\n\n[69] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. Le, and A. Kurakin. Large-scale\n\nevolution of image classi\ufb01ers. In ICML, 2017.\n\n[70] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Uni\ufb01ed, real-time object\n\ndetection. In CVPR, 2016.\n\n[71] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region\n\nproposal networks. In NIPS, 2015.\n\n[72] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmenta-\n\ntion. In MICCAI, 2015.\n\n[73] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\nM. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. IJCV,\n2015.\n\n[74] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and\n\nlinear bottlenecks. In CVPR, 2018.\n\n[75] S. Saxena and J. Verbeek. Convolutional neural fabrics. In NIPS, 2016.\n[76] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. arXiv:1503.02351, 2015.\n[77] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition,\n\nlocalization and detection using convolutional networks. In ICLR, 2014.\n\n[78] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Textonboost for image understanding: Multi-class object\n\nrecognition and segmentation by jointly modeling texture, layout, and context. IJCV, 2009.\n\n[79] L. Sifre. Rigid-motion scattering for image classi\ufb01cation. PhD thesis, 2014.\n[80] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. Indoor segmentation and support inference from rgbd\n\n[81] K. O. Stanley and R. Miikkulainen. Evolving neural networks through augmenting topologies. Evolution-\n\n[82] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS,\n\nimages. In ECCV, 2012.\n\nary computation, 2002.\n\n2014.\n\n[83] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.\n\nGoing deeper with convolutions. In CVPR, 2015.\n\n[84] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for\n\ncomputer vision. In CVPR, 2016.\n\n[85] V. Vanhoucke. Learning visual representations at scale. ICLR invited talk, 2014.\n[86] M. Wang, B. Liu, and H. Foroosh. Design of ef\ufb01cient convolutional layers using single intra-channel\n\nconvolution, topological subdivisioning and spatial \"bottleneck\" structure. arXiv:1608.04337, 2016.\n\n[87] P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, and G. Cottrell. Understanding convolution for\n\nsemantic segmentation. arXiv:1702.08502, 2017.\n\n11\n\n\f[88] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao,\nK. Macherey, et al. Google\u2019s neural machine translation system: Bridging the gap between human\nand machine translation. arXiv:1609.08144, 2016.\n\n[89] F. Xia, P. Wang, X. Chen, and A. L. Yuille. Joint multi-person pose estimation and semantic part\n\nsegmentation. In CVPR, 2017.\n\n[90] L. Xie and A. Yuille. Genetic cnn. In ICCV, 2017.\n[91] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang. Denseaspp for semantic segmentation in street scenes. In\n\nCVPR, 2018.\n\n[92] J. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint object detection, scene\n\nclassi\ufb01cation and semantic segmentation. In CVPR, 2012.\n\n[93] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. Learning a discriminative feature network for\n\nsemantic segmentation. In CVPR, 2018.\n\n[94] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.\n[95] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and A. Agrawal. Context encoding for semantic\n\nsegmentation. In CVPR, 2018.\n\n[96] Z. Zhang, X. Zhang, C. Peng, D. Cheng, and J. Sun. Exfuse: Enhancing feature fusion for semantic\n\nsegmentation. In ECCV, 2018.\n\n[97] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.\n[98] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr. Conditional\n\nrandom \ufb01elds as recurrent neural networks. In ICCV, 2015.\n\n[99] Z. Zhong, J. Yan, and C.-L. Liu. Practical network blocks design with q-learning. In AAAI, 2018.\n[100] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In ICLR, 2017.\n[101] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image\n\nrecognition. In CVPR, 2018.\n\n12\n\n\f", "award": [], "sourceid": 5247, "authors": [{"given_name": "Liang-Chieh", "family_name": "Chen", "institution": "Google Inc."}, {"given_name": "Maxwell", "family_name": "Collins", "institution": "Google Inc."}, {"given_name": "Yukun", "family_name": "Zhu", "institution": "Google"}, {"given_name": "George", "family_name": "Papandreou", "institution": "Google"}, {"given_name": "Barret", "family_name": "Zoph", "institution": "Google Brain"}, {"given_name": "Florian", "family_name": "Schroff", "institution": "Google Inc., Venice, CA"}, {"given_name": "Hartwig", "family_name": "Adam", "institution": "Google"}, {"given_name": "Jon", "family_name": "Shlens", "institution": "Google Research"}]}