{"title": "NAT: Neural Architecture Transformer for Accurate and Compact Architectures", "book": "Advances in Neural Information Processing Systems", "page_first": 737, "page_last": 748, "abstract": "Designing effective architectures is one of the key factors behind the success of deep neural networks. Existing deep architectures are either manually designed or automatically searched by some Neural Architecture Search (NAS) methods. However, even a well-searched architecture may still contain many non-significant or redundant modules or operations (e.g., convolution or pooling), which may not only incur substantial memory consumption and computation cost but also deteriorate the performance. Thus, it is necessary to optimize the operations inside an architecture to improve the performance without introducing extra computation cost. Unfortunately, such a constrained optimization problem is NP-hard. To make the problem feasible, we cast the optimization problem into a Markov decision process (MDP) and seek to learn a Neural Architecture Transformer (NAT) to replace the redundant operations with the more computationally efficient ones (e.g., skip connection or directly removing the connection). Based on MDP, we learn NAT by exploiting reinforcement learning to obtain the optimization policies w.r.t. different architectures. To verify the effectiveness of the proposed strategies, we apply NAT on both hand-crafted architectures and NAS based architectures. Extensive experiments on two benchmark datasets, i.e., CIFAR-10 and ImageNet, demonstrate that the transformed architecture by NAT significantly outperforms both its original form and those architectures optimized by existing methods.", "full_text": "NAT: Neural Architecture Transformer for Accurate\n\nand Compact Architectures\n\nYong Guo\u2217, Yin Zheng\u2217, Mingkui Tan\u2217\u2020, Qi Chen,\n\nJian Chen\u2020, Peilin Zhao, Junzhou Huang\n\nSouth China University of Technology, Weixin Group, Tencent,\n\nTencent AI Lab, University of Texas at Arlington\n\n{guo.yong, sechenqi}@mail.scut.edu.cn, {mingkuitan, ellachen}@scut.edu.cn,\n\n{yinzheng, masonzhao}@tencent.com, jzhuang@uta.edu\n\nAbstract\n\nDesigning effective architectures is one of the key factors behind the success of\ndeep neural networks. Existing deep architectures are either manually designed\nor automatically searched by some Neural Architecture Search (NAS) methods.\nHowever, even a well-searched architecture may still contain many non-signi\ufb01cant\nor redundant modules or operations (e.g., convolution or pooling), which may\nnot only incur substantial memory consumption and computation cost but also\ndeteriorate the performance. Thus, it is necessary to optimize the operations inside\nan architecture to improve the performance without introducing extra computation\ncost. Unfortunately, such a constrained optimization problem is NP-hard. To make\nthe problem feasible, we cast the optimization problem into a Markov decision\nprocess (MDP) and seek to learn a Neural Architecture Transformer (NAT) to\nreplace the redundant operations with the more computationally ef\ufb01cient ones\n(e.g., skip connection or directly removing the connection). Based on MDP, we\nlearn NAT by exploiting reinforcement learning to obtain the optimization policies\nw.r.t. different architectures. To verify the effectiveness of the proposed strategies,\nwe apply NAT on both hand-crafted architectures and NAS based architectures.\nExtensive experiments on two benchmark datasets, i.e., CIFAR-10 and ImageNet,\ndemonstrate that the transformed architecture by NAT signi\ufb01cantly outperforms\nboth its original form and those architectures optimized by existing methods.\n\n1\n\nIntroduction\n\nDeep neural networks (DNNs) [25] have been producing state-of-the-art results in many challenging\ntasks including image classi\ufb01cation [12, 23, 42, 57, 58, 18, 11, 53], face recognition [38, 43, 56],\nbrain signal processing [33, 34], video analysis [50, 49] and many other areas [55, 54, 24, 3, 10, 9, 4].\nOne of the key factors behind the success lies in the innovation of neural architectures, such as\nVGG [40] and ResNet[13]. However, designing effective neural architectures is often labor-intensive\nand relies heavily on substantial human expertise. Moreover, the human-designed process cannot\nfully explore the whole architecture space and thus the designed architectures may not be optimal.\nHence, there is a growing interest to replace the manual process of architecture design with Neural\nArchitecture Search (NAS).\nRecently, substantial studies [29, 35, 61] have shown that the automatically discovered architectures\nare able to achieve highly competitive performance compared to existing hand-crafted architectures.\n\n\u2217Authors contributed equally.\n\u2020Corresponding author.\n2This work is done when Yong Guo works as an intern in Tencent AI Lab.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Comparison between Neural Architecture Optimization (NAO) [31] and our Neural\nArchitecture Transformer (NAT). Green blocks denote the two input nodes of the cell and blue blocks\ndenote the intermediate nodes. Red blocks denote the connections that are changed by NAT. The\naccuracy and the number of parameters are evaluated on CIFAR-10 models.\n\nHowever, there are some limitations in NAS based architecture design methods. In fact, since there\nis an extremely large search space [35, 61] (e.g., billions of candidate architectures), these methods\noften produce sub-optimal architectures, leading to limited representation performance or substantial\ncomputation cost. Thus, even for a well-designed model, it is necessary yet important to optimize its\narchitecture (e.g., removing the redundant operations) to achieve better performance and/or reduce\nthe computation cost.\nTo optimize the architectures, Luo et al. recently proposed a neural architecture optimization (NAO)\nmethod [31]. Speci\ufb01cally, NAO \ufb01rst encodes an architecture into an embedding in continuous space\nand then conducts gradient descent to obtain a better embedding. After that, it uses a decoder to\nmap the embedding back to obtain an optimized architecture. However, NAO comes with its own\nset of limitations. First, NAO often produces a totally different architecture from the input one\nand may introduce extra parameters or additional computation cost. Second, similar to the NAS\nbased methods, NAO has a huge search space, which, however, may not be necessary for the task\nof architecture optimization and may make the optimization problem very expensive to solve. An\nillustrative comparison between our method and NAO can be found in Figure 1.\nUnlike existing methods that design neural architectures, we seek to design an architecture optimiza-\ntion method, called Neural Architecture Transformer (NAT), to optimize neural architectures. Since\nthe optimization problem is non-trivial to solve, we cast it into a Markov decision process (MDP).\nThus, the architecture optimization process is reduced to a series of decision making problems. Based\non MDP, we seek to replace the expensive operations or redundant modules in the architecture\nwith more computationally ef\ufb01cient ones. Speci\ufb01cally, NAT shall remove the redundant modules\nor replace these modules with skip connections. In this way, the search space can be signi\ufb01cantly\nreduced. Thus, the training complexity to learn an architecture optimizer is smaller than those NAS\nbased methods, e.g., NAO. Last, it is worth mentioning that our NAT model can be used as a general\narchitecture optimizer which takes any architecture as the input and output an optimized one. In\nexperiments, we apply NAT to both hand-crafted and NAS based architectures and demonstrate the\nperformance on two benchmark datasets, namely CIFAR-10 [22] and ImageNet [8].\nThe main contributions of this paper are summarized as follows.\n\u2022 We propose a novel architecture optimization method, called Neural Architecture Transformer\n(NAT), to optimize arbitrary architectures in order to achieve better performance and/or reduce\ncomputation cost. To this end, NAT either removes the redundant paths or replaces the original\noperation with skip connection to improve the architecture design.\n\n\u2022 We cast the architecture optimization problem into a Markov decision process (MDP), in which we\nseek to solve a series of decision making problems to optimize the operations. We then solve the\nMDP problem with policy gradient. To better exploit the adjacency information of operations in\nan architecture, we propose to exploit graph convolution network (GCN) to build the architecture\noptimization model.\n\n\u2022 Extensive experiments demonstrate the effectiveness of our NAT on both hand-crafted and NAS\nbased architectures. Speci\ufb01cally, for hand-crafted models (e.g., VGG), our NAT automatically\nintroduces additional skip connections into the plain network and results in 2.75% improvement in\nterms of Top-1 accuracy on ImageNet. For NAS based models (e.g., DARTS [29]), NAT reduces\n30% parameters and achieves 1.31% improvement in terms of Top-1 accuracy on ImageNet.\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: An example of the graph representation of a residual block and the diagram of operation\ntransformations. (a) a residual block [13]; (b) a graph view of residual block; (c) transformations\namong three kinds of operations. N denotes a null operation without any computation, S denotes a\nskip connection, and O denotes some computational modules other than null and skip connections.\n\n2 Related Work\n\nHand-crafted architecture design. Many studies focus on architecture design and propose a\nseries of deep neural architectures, such as Network-in-network [27], VGG [40] and so on. Unlike\nthese plain networks that only contain a stack of convolutions, He et al. propose the residual\nnetwork (ResNet) [13] by introducing residual shortcuts between different layers. However, the\nhuman-designed process often requires substantial human effort and cannot fully explore the whole\narchitecture space, making the hand-crafted architectures often not optimal.\nNeural architecture search. Recently, neural architecture search (NAS) methods have been proposed\nto automate the process of architecture design [61, 62, 35, 1, 59, 29, 2, 45, 41]. Some researchers\nconduct architecture search by modeling the architecture as a graph [51, 20]. Unlike these methods,\nDSO-NAS [52] \ufb01nds the optimal architectures by starting from a fully connected block and then\nimposing sparse regularization [17, 44] to prune useless connections. Besides, Jin et al. propose a\nBayesian optimization approach [19] to morph the deep architectures by inserting additional layers,\nadding more \ufb01lters or introducing additional skip connections. More recently, Luo et al. propose the\nneural architecture optimization (NAO) [31] method to perform the architecture search on continuous\nspace by exploiting the encoding-decoding technique. However, NAO is essentially designed for\narchitecture search and often obtains very different architectures from the input architectures and\nmay introduce extra parameters. Unlike these methods, our method is able to optimize architectures\nwithout introducing extra computation cost (See the detailed comparison in Figure 1).\nArchitecture adaptation and model compression. Several methods [48, 7, 6] have been proposed\nto obtain compact architectures by learning the optimal settings of each convolution, including kernel\nsize, stride and the number of \ufb01lters. To obtain compact models, model compression methods [26,\n15, 30, 60] detect and remove the redundant channels from the original models. However, these\nmethods only change the settings of convolution but ignore the fact that adjusting the connections\nin the architecture could be more critical. Recently, Cao et al. propose an automatic architecture\ncompression method [5]. However, this method has to learn a compressed model for each given\npre-trained model and thus has limited generalization ability to different architectures. Unlike these\nmethods, we seek to learn a general optimizer for any arbitrary architecture.\n\n3 Neural Architecture Transformer\n\n3.1 Problem De\ufb01nition\n\nGiven an architecture space \u2126, we can represent an architecture \u03b1 as a directed acyclic graph (DAG),\ni.e., \u03b1 = (V,E), where V is a set of nodes that denote the feature maps in DNNs and E is an edge\nset [61, 35, 29], as shown in Figure 2. Here, the directed edge eij \u2208 E denotes some operation (e.g.,\nconvolution or max pooling) that transforms the feature map from node vi to vj. For convenience,\nwe divide the edges in E into three categories, namely, S, N, O, as shown in Figure 2(c). Here, S\ndenotes the skip connection, N denotes the null connection (i.e., no edge between two nodes), and\nO denotes the operations other than skip connection or null connection (e.g., convolution or max\n\n3\n\nconv 5x5conv 3x3skip connection201conv 5x5conv 3x3skip connection-1outNOS\fpooling). Note that different operations have different costs. Speci\ufb01cally, let c(\u00b7) be a function to\nevaluate the computation cost. Obviously, we have c(O) > c(S) > c(N ).\nIn this paper, we seek to design an architecture optimization method, called Neural Architecture\nTransformer (NAT), to optimize any given architecture into a better one with the improved perfor-\nmance and/or less computation cost. To achieve this, an intuitive way is to make the original operation\nwith less computation cost, e.g., using the skip connection to replace convolution or using the null\nconnection to replace skip connection. Although the skip connection has slightly higher cost than\nthe null connection, it often can signi\ufb01cantly improve the performance [13, 14]. Thus, we enable\nthe transition from null connection to skip connection to increase the representation ability of deep\nnetworks. In summary, we constrain the possible transitions among O, S, and N in Figure 2(c) in\norder to reduce the computation cost.\nNote that the architecture optimization on an entire network is still very computationally expensive.\nMoreover, we hope to learn a general architecture optimizer. Given these two concerns, we consider\nlearning a computational cell as the building block of the \ufb01nal architecture. To build a cell, we follow\nthe same settings as that in ENAS [35]. Speci\ufb01cally, each cell has two input nodes, i.e., v\u22122 and\nv\u22121, which denote the outputs of the second nearest and the nearest cell in front of the current one,\nrespectively. Each intermediate node (marked as the blue box in Figure 1) also takes two previous\nnodes in this cell as inputs. Last, based on the learned cell, we are able to form any \ufb01nal network.\n\n3.2 Markov Decision Process for Architecture Optimization\n\nIn this paper, we seek to learn a general architecture optimizer \u03b1 = NAT(\u03b2; \u03b8), which transforms\nany \u03b2 into an optimized \u03b1 and is parameterized by \u03b8. Here, we assume \u03b2 follows some distribution\np(\u00b7), e.g., multivariate uniformly discrete distribution. Let w\u03b1 and w\u03b2 be the well-learned model\nparameters of architectures \u03b1 and \u03b2, respectively. We measure the performance of \u03b1 and \u03b2 by some\nmetric R(\u03b1, w\u03b1) and R(\u03b2, w\u03b2), e.g., the accuracy on validation data. For convenience, we de\ufb01ne the\nperformance improvement between \u03b1 and \u03b2 by R(\u03b1|\u03b2) = R(\u03b1, w\u03b1) \u2212 R(\u03b2, w\u03b2).\nTo learn a good transformer \u03b1 = NAT(\u03b2; \u03b8) to optimize arbitrary \u03b2, we can maximize the expectation\nof performance improvement R(\u03b1|\u03b2) over the distribution of \u03b2 under a constraint of computation\ncost c(\u03b1) \u2264 \u03ba, where c(\u03b1) measures the cost of \u03b1 and \u03ba is an upper bound of the cost. Then, the\noptimization problem can be written as\n\nE\u03b2\u223cp(\u00b7) [R (\u03b1|\u03b2)] , s.t. c(\u03b1) \u2264 \u03ba.\n\nmax\n\n\u03b8\n\n(1)\n\n(2)\n\nmax\n\n\u03b8\n\nUnfortunately, it is non-trivial to directly obtain the optimal \u03b1 given different \u03b2. Nevertheless,\nfollowing [61, 35], given any architecture \u03b2, we instead sample \u03b1 from some well learned policy,\ndenoted by \u03c0(\u00b7|\u03b2; \u03b8), namely \u03b1 \u223c \u03c0(\u00b7|\u03b2; \u03b8). In other words, NAT \ufb01rst learns the policy and then\nconducts sampling from it to obtain the optimized architecture. In this sense, the parameters to be\nlearned only exist in \u03c0(\u00b7|\u03b2; \u03b8). To learn the policy, we solve the following optimization problem:\n\n(cid:2)E\u03b1\u223c\u03c0(\u00b7|\u03b2;\u03b8) R (\u03b1|\u03b2)(cid:3) , s.t. c(\u03b1) \u2264 \u03ba, \u03b1 \u223c \u03c0(\u00b7|\u03b2; \u03b8),\n\nE\u03b2\u223cp(\u00b7)\n\n(cid:2)E\u03b1\u223c\u03c0(\u00b7|\u03b2;\u03b8) R (\u03b1|\u03b2)(cid:3) remains a question.\n\nwhere E\u03b2\u223cp(\u00b7) [\u00b7] and E\u03b1\u223c\u03c0(\u00b7|\u03b2;\u03b8) [\u00b7] denote the expectation operation over \u03b2 and \u03b1, respectively.\nThis problem, however, is still very challenging to solve. First, the computation cost of deep\nnetworks can be evaluated by many metrics, such as the number of multiply-adds (MAdds), latency,\nand energy consumption, making it hard to \ufb01nd a comprehensive measure to accurately evaluate\nthe cost. Second, the upper bound of computation cost \u03ba in Eqn. (1) may vary for different cases\nand thereby is hard to determine. Even if there already exists a speci\ufb01c upper bound, dealing with\nthe constrained optimization problem is still a typical NP-hard problem. Third, how to compute\nE\u03b2\u223cp(\u00b7)\nTo address the above challenges, we cast the optimization problem into an architecture transformation\nproblem and reformulate it as a Markov decision process (MDP). Speci\ufb01cally, we optimize archi-\ntectures by making a series of decisions to alternate the types of different operations. Following the\ntransition graph in Figure 2(c), as c(O) > c(S) > c(N ), we can naturally obtain more compact archi-\ntectures than the given ones. In this sense, we can achieve the goal to optimize arbitrary architecture\nwithout introducing extra cost into the architecture. Thus, for the \ufb01rst two challenges, we do not have\nto evaluate the cost c(\u03b1) or determine the upper bound \u03ba. For the third challenge, we estimate the\nexpectation value by sampling architectures from p(\u00b7) and \u03c0(\u00b7|\u03b2; \u03b8) (See details in Section 3.4).\n\n4\n\n\fMDP formulation details. A typical MDP [39] is de\ufb01ned by a tuple (S,A, P, R, q, \u03b3), where S is a\n\ufb01nite set of states, A is a \ufb01nite set of actions, P : S \u00d7 A \u00d7 S \u2192 R is the state transition distribution,\nR : S \u00d7A \u2192 R is the reward function, q : S \u2192 [0, 1] is the distribution of initial state, and \u03b3 \u2208 [0, 1]\nis a discount factor. Here, we de\ufb01ne an architecture as a state, a transformation mapping \u03b2 \u2192 \u03b1 as\nan action. Here, we use the accuracy improvement on the validation set as the reward. Since the\nproblem is a one-step MDP, we can omit the discount factor \u03b3. Based on the problem de\ufb01nition, we\ntransform any \u03b2 into an optimized architecture \u03b1 with the policy \u03c0(\u00b7|\u03b2; \u03b8). Then, the main challenge\nbecomes how to learn an optimal policy \u03c0(\u00b7|\u03b2; \u03b8). Here, we exploit reinforcement learning [46] to\nsolve the problem and propose an ef\ufb01cient policy learning algorithm.\nSearch space of NAT over a cell structure. For a cell structure with B nodes and 3 states for each\nedge, there are 2(B\u22123) edges and the size of the search space w.r.t. a speci\ufb01c \u03b2 is |\u2126\u03b2| = 32(B\u22123).\nHowever, NAS methods [35, 61] have a large search space with the size of k2(B\u22123)((B \u2212 2)!)2,\nwhere k is the number of candidate operations (e.g., k=5 in ENAS [35] and k=8 in DARTS [29]).\n\n3.3 Policy Learning by Graph Convolutional Neural Networks\nTo learn the optimal policy \u03c0(\u00b7|\u03b2; \u03b8) w.r.t. an arbitrary architecture \u03b2, we propose an effective\nlearning method to optimize the operations inside the architecture. Speci\ufb01cally, we take an arbitrary\narchitecture graph \u03b2 as the input and output the optimization policy w.r.t \u03b2. Such a policy is used to\noptimize the operations of the given architecture. Since the choice of operation on an edge depends\non the adjacent nodes and edges, we have to consider the attributes of both the current edge and\nits neighbors. For this reason, we employ a graph convolution networks (GCN) [21] to exploit\nthe adjacency information of the operations in the architecture. Here, an architecture graph can be\nrepresented by a data pair (A, X), where A denotes the adjacency matrix of the graph and X denotes\nthe attributes of the nodes together with their two input edges3. We consider a two-layer GCN and\nformulate the model as:\n\nZ = f (X, A) = Softmax\n\nA\u03c3\n\n,\n\n(3)\n\n(cid:16)\n\n(cid:16)\n\nAXW(0)(cid:17)\n\nW(1)WFC(cid:17)\n\nwhere W(0) and W(1) denote the weights of two graph convolution layers, WFC denotes the weight\nof the fully-connected layer, \u03c3 is a non-linear activation function (e.g., the Recti\ufb01ed Linear Unit\n(ReLU) [32]), and Z refers to the probability distribution of different candidate operations on the\nedges, i.e., the learned policy \u03c0(\u00b7|\u03b2; \u03b8). For convenience, we denote \u03b8 = {W(0), W(1), WFC} as the\nparameters of the architecture transformer. To cover all possible architectures, we randomly sample\narchitectures from the whole architecture space and use them to train our model.\nDifferences with LSTM. The architecture graph can also be processed by the long short-term\nmemory (LSTM) [16], which is a common practice in NAS methods [31, 61, 35]. In these methods,\nLSTM \ufb01rst treats the graph as a sequence of tokens and then learns the information from the sequence.\nHowever, turning a graph into a sequence of tokens may lose some connectivity information of\nthe graph, leading to limited performance. On the contrary, our GCN model can better exploit the\ninformation from the graph and yield superior performance (See results in Section 4.4).\n\n3.4 Training and Inference of NAT\n\nWe apply the policy gradient [46] to train our model. The overall scheme is shown in Algorithm 1,\nwhich employs an alternating manner. Speci\ufb01cally, in each training epoch, we \ufb01rst train the model\nparameters w with \ufb01xed transformer parameters \u03b8. Then, we train the transformer parameters \u03b8 by\n\ufb01xing the model parameters w.\nTraining the model parameters w. Given any \u03b8, we need to update the model parameters w\nbased on the training data. Here, to accelerate the training process, we adopt the parameter sharing\ntechnique [35], i.e., we construct a large computational graph, where each subgraph represents a\nneural network architecture, hence forcing all architectures to share the parameters. Thus, we can use\nthe shared parameters w to represent the parameters for different architectures. For any architecture\n\u03b2 \u223c p(\u00b7), let L(\u03b2, w) be the loss function on the training data, e.g., the cross-entropy loss. Then,\ngiven any m sampled architectures, the updating rule for w with parameter sharing can be given by\nw \u2190 w \u2212 \u03b7 1\n\n(cid:80)m\ni=1 \u2207wL(\u03b2i, w), where \u03b7 is the learning rate.\n\nm\n\n3Due to the page limit, we put the detailed representation methods in the supplementary.\n\n5\n\n\ffor each iteration on training data do\n\n// Fix \u03b8 and update w.\nSample \u03b2i \u223c p(\u00b7) to construct a batch {\u03b2i}m\nUpdate the model parameters w by descending the gradient:\n\n(cid:80)m\ni=1 \u2207wL(\u03b2i, w).\n\ni=1.\n\nm\n\nend for\nfor each iteration on validation data do\n\nw \u2190 w \u2212 \u03b7 1\n\n1: Initiate w and \u03b8.\n2: while not convergent do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\nend for\n15:\n16: end while\n\n\u03b8 \u2190 \u03b8 +\u03b7 1\n\nmn\n\nAlgorithm 1 Training method for Neural Architecture Transformer (NAT).\nRequire: The number of sampled input architectures in an iteration m, the number of sampled optimized\narchitectures for each input architecture n, learning rate \u03b7, regularizer parameter \u03bb in Eqn. (4), input\narchitecture distribution p(\u00b7), shared model parameters w, transformer parameters \u03b8.\n\n// Fix w and update \u03b8.\nSample \u03b2i \u223c p(\u00b7) to construct a batch {\u03b2i}m\nObtain {\u03b1j}n\nUpdate the transformer parameters \u03b8 by ascending the gradient:\n\nj=1 according to the policy learned by GCN.\n\ni=1.\n\n(cid:2)\u2207\u03b8 log \u03c0(\u03b1j|\u03b2i; \u03b8)(cid:0)R(\u03b1j, w) \u2212 R(\u03b2i, w)(cid:1)+\u03bb\u2207\u03b8H(cid:0)\u03c0 (\u00b7|\u03b2i; \u03b8)(cid:1)(cid:3).\n\n(cid:80)m\n\n(cid:80)n\n\ni=1\n\nj=1\n\nTraining the transformer parameters \u03b8. We train the transformer model with policy gradient [46].\nTo encourage exploration, we introduce an entropy regularization term into the objective to prevent\nthe transformer from converging to a local optimum too quickly [62], e.g., selecting the \u201coriginal\u201d\noption for all the operations. Given the shared parameters w, the objective can be formulated as\n\n(cid:2)E\u03b1\u223c\u03c0(\u00b7|\u03b2;\u03b8) [R (\u03b1, w) \u2212 R (\u03b2, w)] + \u03bbH(cid:0)\u03c0(\u00b7|\u03b2; \u03b8)(cid:1)(cid:3)\n(cid:34)(cid:88)\n\u03c0(\u03b1|\u03b2; \u03b8)(cid:0)R (\u03b1, w) \u2212 R (\u03b2, w)(cid:1) + \u03bbH(cid:0)\u03c0(\u00b7|\u03b2; \u03b8)(cid:1)(cid:35)\n\nJ(\u03b8) = E\u03b2\u223cp(\u00b7)\n\n(4)\n\n.\n\n(cid:88)\n\n\u03b2\n\n=\n\np(\u03b2)\n\n\u03b1\n\ni=1\n\nj=1\n\nn(cid:88)\n\nm(cid:88)\n\n\u2207\u03b8J(\u03b8) \u2248 1\nmn\n\nwhere p(\u03b2) is the probability to sample some architecture \u03b2 from the distribution p(\u00b7), \u03c0(\u03b1|\u03b2; \u03b8)\nis the probability to sample some architecture \u03b1 from the distribution \u03c0(\u00b7|\u03b2; \u03b8), H(\u00b7) evaluates the\nentropy of the policy, and \u03bb controls the strength of the entropy regularization term. For each input\narchitecture, we sample n optimized architectures {\u03b1j}n\nj=1 from the distribution \u03c0(\u00b7|\u03b2; \u03b8) in each\niteration. Thus, the gradient of Eqn. (4) w.r.t. \u03b8 becomes4\n\n(cid:2)\u2207\u03b8 log \u03c0(\u03b1j|\u03b2i; \u03b8)(cid:0)R(\u03b1j, w) \u2212 R(\u03b2i, w)(cid:1) + \u03bb\u2207\u03b8H(cid:0)\u03c0(\u00b7|\u03b2i; \u03b8)(cid:1)(cid:3) .\n\nThe regularization term H(cid:0)\u03c0(\u00b7|\u03b2i; \u03b8)(cid:1) encourages the distribution \u03c0(\u00b7|\u03b2; \u03b8) to have high entropy,\n\ni.e., high diversity in the decisions on the edges. Thus, the decisions for some operations would be\nencouraged to choose the \u201cidentity\u201d or \u201cnull\u201d operations during training. As a result, NAT is able to\nsuf\ufb01ciently explore the whole search space to \ufb01nd the optimal architecture.\nInferring the optimized architecture. We do not explicitly obtain the optimized architecture via\n\u03b1 = NAT(\u03b2; \u0001). Instead, we conduct sampling according to the learned probability distribution.\nSpeci\ufb01cally, we \ufb01rst sample several candidate optimized architectures from the learned policy\n\u03c0(\u00b7|\u03b2; \u03b8) and then select the architecture with the highest validation accuracy. Note that we can also\nobtain the optimized architecture by selecting the operation with the maximum probability, which,\nhowever, tends to reach a local optimum and yields worse results than the sampling based method\n(See comparisons in Section 4.4).\n\n(5)\n\n4 Experiments\n\nIn this section, we apply NAT on both hand-crafted and NAS based architectures, and conduct\nexperiments on two image classi\ufb01cation benchmark datasets, i.e., CIFAR-10 [22] and ImageNet [8].\nAll implementations are based on PyTorch.5\n\n4We put the derivations of Eqn. (5) in the supplementary.\n5The source code of NAT is available at https://github.com/guoyongcs/NAT.\n\n6\n\n\fTable 1: Performance comparisons of the optimized architectures obtained by different methods based\non hand-crafted architectures. \u201c/\u201d denotes the original models that are not changed by architecture\noptimization methods.\n\nModel\n\nMethod\n\nCIFAR-10\n#Params (M)\n\n#MAdds (M) Acc. (%)\n\nModel\n\nMethod\n\nImageNet\n#Params (M)\n\n#MAdds (M)\n\nVGG16\n\nResNet20\n\nResNet56\n\nMobileNetV2\n\n/\n\n/\n\n/\n\n/\n\nNAO[31]\n\nNAT\n\nNAO [31]\n\nNAT\n\nNAO [31]\n\nNAT\n\nNAO [31]\n\nNAT\n\n15.2\n19.5\n15.2\n0.3\n0.4\n0.3\n0.9\n1.3\n0.9\n2.3\n2.9\n2.3\n\n313\n548\n315\n41\n61\n42\n127\n199\n129\n91\n131\n92\n\n93.56\n95.72\n96.04\n91.37\n92.44\n92.95\n93.21\n95.27\n95.40\n94.47\n94.75\n95.17\n\nVGG16\n\nResNet18\n\nResNet50\n\nMobileNetV2\n\n/\n\n/\n\n/\n\n/\n\nNAO [31]\n\nNAT\n\nNAO [31]\n\nNAT\n\nNAO [31]\n\nNAT\n\nNAO [31]\n\nNAT\n\n138.4\n147.7\n138.4\n11.7\n17.9\n11.7\n25.6\n34.8\n25.6\n3.4\n4.5\n3.4\n\n15620\n18896\n15693\n1580\n2246\n1588\n3530\n4505\n3547\n300\n513\n302\n\nAcc. (%)\n\nTop-1 Top-5\n71.6\n90.4\n91.3\n72.9\n92.0\n74.3\n89.1\n69.8\n89.7\n70.8\n90.0\n71.1\n92.9\n76.2\n77.4\n93.2\n93.5\n77.7\n90.3\n72.0\n90.6\n72.2\n72.5\n91.0\n\nFigure 3: Architecture optimization results of hand-crafted architectures. We provide both the views\nof graph (left) and network (right) to show the differences in architecture.\n\n4.1\n\nImplementation Details\n\nWe consider two kinds of cells in a deep network, including the normal cell and the reduction cell. The\nnormal cell preserves the same spatial size as inputs while the reduction cell reduces the spatial size\nby 2\u00d7. Both the normal and reduction cells contain 2 input nodes and a number of intermediate nodes.\nDuring training, we build the deep network by stacking 8 basic cells and train the transformer for 100\nepochs. We set m = 1, n = 1, and \u03bb = 0.003 in the training. We split CIFAR-10 training set into\n40% and 60% slices to train the model parameters w and the transformer parameters \u03b8, respectively.\nAs for the evaluation of the networks with different architectures, we replace the original cell with the\noptimized one and train the model from scratch. Please see more details in the supplementary. For all\nthe considered architectures, we follow the same settings of the original papers. In the experiments,\nwe only apply cutout to the NAS based architectures on CIFAR-10.\n\n4.2 Results on Hand-crafted Architectures\n\nIn this experiment, we apply NAT on three popular hand-crafted models, i.e., VGG [40], ResNet [13],\nand MobileNet [37]. To make all architectures share the same graph representation method de\ufb01ned in\nSection 3.2, we add null connections into the hand-crafted architectures to ensure that each node has\n\n7\n\nView of GraphView of NetworkVGG CellVGGNAT-VGG-2-1outResNetNAT-ResNetResidual Cell-2-1outArchitectureMobileNetV2 Cell-2-1outMobileNetV2NAT-MobileNetV2\fTable 2: Comparisons of the optimized architectures obtained by different methods based on NAS\nbased architectures. \u201c-\u201d denotes that the results are not reported. \u201c/\u201d denotes the original models that\nare not changed by architecture optimization methods. \u2020 denotes the models trained with cutout.\n\nModel\n\nMethod\n\nCIFAR-10\n#Params (M)\n\n#MAdds (M) Acc. (%)\n\nModel\n\nMethod\n\nImageNet\n#Params (M)\n\n#MAdds (M)\n\nAmoebaNet\u2020 [36]\n\nPNAS\u2020 [28]\nSNAS\u2020 [47]\nGHN\u2020 [51]\nENAS\u2020 [35]\n\nDARTS\u2020 [29]\n\nNAONet\u2020 [31]\n\n/\n\n/\n\n/\n\n/\n\nNAO [31]\n\nNAT\n\nNAO [31]\n\nNAT\n\nNAO [31]\n\nNAT\n\n3.2\n3.2\n2.9\n5.7\n4.6\n4.5\n4.6\n3.3\n3.5\n2.7\n128\n143\n113\n\n-\n-\n-\n-\n804\n763\n804\n528\n577\n424\n66016\n73705\n58326\n\n96.73\n96.67\n97.08\n97.22\n97.11\n97.05\n97.24\n97.06\n97.09\n97.28\n97.89\n97.91\n98.01\n\nAmoebaNet [36]\n\nPNAS [28]\nSNAS [47]\nGHN [51]\n\nENAS [35]\n\nDARTS [29]\n\nNAONet [31]\n\n/\n\n/\n\n/\n\n/\n\nNAO [31]\n\nNAT\n\nNAO [31]\n\nNAT\n\nNAO [31]\n\nNAT\n\n5.1\n5.1\n4.3\n6.1\n5.6\n5.5\n5.6\n4.7\n5.1\n4.0\n11.35\n11.83\n8.36\n\n555\n588\n522\n569\n607\n589\n607\n574\n627\n441\n1360\n1417\n1025\n\nAcc. (%)\n\nTop-1 Top-5\n74.5\n92.0\n91.9\n74.2\n90.8\n72.7\n91.3\n73.0\n91.7\n73.8\n91.7\n73.7\n91.8\n73.9\n73.1\n91.0\n91.1\n73.3\n91.4\n73.7\n91.8\n74.3\n92.0\n74.5\n74.8\n92.3\n\nFigure 4: Architecture optimization results on the architectures of NAS based architectures.\n\ntwo input nodes (See examples in Figure 3). For a fair comparison, we build deep networks using the\noriginal and optimized architectures while keeping the same depth and number of channels as the\noriginal models. We compare NAT with a strong baseline method Neural Architecture Optimization\n(NAO) [31]. We show the results in Table 1 and the corresponding architectures in Figure 3. From\nTable 1, although the models with NAO yield better performance than the original ones, they often\nhave more parameters and higher computation cost. By contrast, our NAT based models consistently\noutperform the original models by a large margin with approximately the same computation cost.\n\n8\n\nNAT-ENASENASNAT-DARTSDARTSNormal cellReduction cellArchitectureNAT-NAONetNAONet\fTable 3: Performance comparisons of the architectures obtained by different methods on CIFAR-10.\nThe reported accuracy (%) is the average performance of \ufb01ve runs with different random seeds. \u201c/\u201d\ndenotes the original models that are not changed by architecture optimization methods. \u2020 denotes the\nmodels trained with cutout.\n\nMethod\n\n/\n\nMaximum-GCN\n\nSampling-GCN (Ours)\n\nRandom Search\n\nLSTM\n\nVGG16 ResNet20 MobileNetV2\n93.56\n93.17\n94.45\n94.37\n95.93\n\n94.47\n94.38\n95.01\n94.87\n95.13\n\n91.37\n91.56\n92.19\n92.57\n92.97\n\nENAS\u2020 DARTS\u2020 NAONet\u2020\n97.89\n97.11\n96.31\n96.58\n97.93\n97.05\n97.90\n96.92\n97.21\n97.99\n\n97.06\n95.17\n97.05\n97.00\n97.26\n\n4.3 Results on NAS Based Architectures\n\nFor the automatically searched architectures, we evaluate the proposed NAT on three state-of-the-art\nNAS based architectures, i.e., DARTS [29], NAONet [31], and ENAS [35]. Moreover, we also\ncompare our optimized architectures with other NAS based architectures, including AmoebaNet [36],\nPNAS [28], SNAS [47] and GHN [51]. From Table 2, all the NAT based architectures yield higher\naccuracy than their baseline models and the models optimized by NAO on CIFAR-10 and ImageNet.\nCompared with other NAS based architectures, our NAT-DARTS performs the best on CIFAR-10\nand achieves the competitive performance compared to the best architecture (i.e., AmoebaNet)\non ImageNet with less computation cost and fewer number of parameters. We also visualize the\narchitectures of the original and optimized cell in Figure 4. As for DARTS and NAONet, NAT\nreplaces several redundant operations with the skip connections or directly removes the connection,\nleading to fewer number of parameters. While optimizing ENAS, NAT removes the average pooling\noperation and improves the performance without introducing extra computations.\n\n4.4 Comparisons of Different Policy Learners\n\nIn this experiment, we compare the performance of different policy learners, including Random\nSearch, LSTM, and the GCN method. For the Random Search method, we perform random transitions\namong O, S, and N on the input architectures. For the GCN method, we consider two variants which\ninfer the optimized architecture by sampling from the learned policy (denoted by Sampling-GCN)\nor by selecting the operation with the maximum probability (denoted by Maximum-GCN). From\nTable 3, our Sampling-GCN method outperforms all the considered policies on different architectures.\nThese results demonstrate the superiority of the proposed GCN method as the policy learner.\n\n4.5 Effect of Different Graph Representations on Hand-crafted Architectures\n\nIn this experiment, we investigate the effect of different graph representations on hand-crafted\narchitectures. Note that an architecture may correspond to many different topological graphs,\nespecially for the hand-crafted architectures, e.g., VGG and ResNet, where the number of nodes is\nsmaller than that of our basic cell. For convenience, we study three different graphs for VGG and\nResNet20, respectively. The average accuracy of NAT-VGG is 95.83% and outperforms the baseline\nVGG with the accuracy of 93.56%. Similarly, our NAT-ResNet20 yields the average accuracy of\n92.48%, which is also better than the original model. We put the architecture and the performance of\neach possible representation in the supplementary. In practice, the graph representation may in\ufb02uence\nthe result of NAT and how to alleviate its effect still remains an open question.\n\n5 Conclusion\n\nIn this paper, we have proposed a novel Neural Architecture Transformer (NAT) for the task of\narchitecture optimization. To solve this problem, we cast it into a Markov decision process (MDP)\nby making a series of decisions to optimize existing operations with more computationally ef\ufb01cient\noperations, including skip connection and null operation. To show the effectiveness of NAT, we\napply it to both hand-crafted architectures and Neural Architecture Search (NAS) based architectures.\nExtensive experiments on CIFAR-10 and ImageNet datasets demonstrate the effectiveness of the\nproposed method in improving the accuracy and the compactness of neural architectures.\n\n9\n\n\fAcknowledgments\n\nThis work was partially supported by Guangdong Provincial Scienti\ufb01c and Technological Funds\nunder Grants 2018B010107001, National Natural Science Foundation of China (NSFC) (No.\n61602185), key project of NSFC (No. 61836003), Fundamental Research Funds for the Central\nUniversities (No. D2191240), Program for Guangdong Introducing Innovative and Enterpreneurial\nTeams 2017ZT07X183, Tencent AI Lab Rhino-Bird Focused Research Program (No. JR201902),\nGuangdong Special Branch Plans Young Talent with Scienti\ufb01c and Technological Innovation (No.\n2016TQ03X445), Guangzhou Science and Technology Planning Project (No. 201904010197), and\nMicrosoft Research Asia (MSRA Collaborative Research Program). We last thank Tencent AI Lab.\n\nReferences\n[1] B. Baker, O. Gupta, N. Naik, and R. Raskar. Designing neural network architectures using reinforcement\n\nlearning. In International Conference on Learning Representations, 2017.\n\n[2] H. Cai, L. Zhu, and S. Han. ProxylessNAS: Direct neural architecture search on target task and hardware.\n\nIn International Conference on Learning Representations, 2019.\n\n[3] J. Cao, Y. Guo, Q. Wu, C. Shen, J. Huang, and M. Tan. Adversarial learning with local coordinate coding.\n\nIn International Conference on Machine Learning, pages 706\u2013714, 2018.\n\n[4] J. Cao, L. Mo, Y. Zhang, K. Jia, C. Shen, and M. Tan. Multi-marginal wasserstein gan. In Advances in\n\nNeural Information Processing Systems, 2019.\n\n[5] S. Cao, X. Wang, and K. M. Kitani. Learnable embedding space for ef\ufb01cient neural architecture compres-\n\nsion. In International Conference on Learning Representations, 2019.\n\n[6] T. Chen, I. Goodfellow, and J. Shlens. Net2net: Accelerating learning via knowledge transfer.\n\nInternational Conference on Learning Representations, 2016.\n\nIn\n\n[7] X. Dai, P. Zhang, B. Wu, H. Yin, F. Sun, Y. Wang, M. Dukhan, Y. Hu, Y. Wu, Y. Jia, et al. Chamnet:\nTowards ef\ufb01cient network design through platform-aware model adaptation. In The IEEE Conference on\nComputer Vision and Pattern Recognition, pages 11398\u201311407, 2019.\n\n[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\ndatabase. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 248\u2013255, 2009.\n\n[9] Y. Guo, Q. Chen, J. Chen, J. Huang, Y. Xu, J. Cao, P. Zhao, and M. Tan. Dual reconstruction nets for\n\nimage super-resolution with gradient sensitive loss. arXiv preprint arXiv:1809.07099, 2018.\n\n[10] Y. Guo, Q. Chen, J. Chen, Q. Wu, Q. Shi, and M. Tan. Auto-embedding generative adversarial networks\n\nfor high resolution image synthesis. IEEE Transactions on Multimedia, 2019.\n\n[11] Y. Guo, M. Tan, Q. Wu, J. Chen, A. V. D. Hengel, and Q. Shi. The shallow end: Empowering shallower\n\ndeep-convolutional networks through auxiliary outputs. arXiv preprint arXiv:1611.01773, 2016.\n\n[12] Y. Guo, Q. Wu, C. Deng, J. Chen, and M. Tan. Double forward propagation for memorized batch\n\nnormalization. In AAAI Conference on Arti\ufb01cial Intelligence, pages 3134\u20133141, 2018.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In The IEEE Conference\n\non Computer Vision and Pattern Recognition, pages 770\u2013778, 2016.\n\n[14] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In The European\n\nConference on Computer Vision, pages 630\u2013645, 2016.\n\n[15] Y. He, X. Zhang, and J. Sun. Channel pruning for accelerating very deep neural networks. In The IEEE\n\nInternational Conference on Computer Vision, pages 1398\u20131406, 2017.\n\n[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, 1997.\n\n[17] Z. Huang and N. Wang. Data-driven sparse structure selection for deep neural networks. In Proceedings of\n\nthe European Conference on Computer Vision (ECCV), pages 304\u2013320, 2018.\n\n[18] Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational deep embedding: An unsupervised and\ngenerative approach to clustering. In International Joint Conference on Arti\ufb01cial Intelligence, pages\n1965\u20131972, 2017.\n\n10\n\n\f[19] H. Jin, Q. Song, and X. Hu. Auto-keras: Ef\ufb01cient neural architecture search with network morphism. arXiv\n\npreprint arXiv:1806.10282, 2018.\n\n[20] W. Jin, K. Yang, R. Barzilay, and T. Jaakkola. Learning multimodal graph-to-graph translation for molecular\n\noptimization. In International Conference on Learning Representations, 2019.\n\n[21] T. N. Kipf and M. Welling. Semi-supervised classi\ufb01cation with graph convolutional networks.\n\nInternational Conference on Learning Representations, 2016.\n\nIn\n\n[22] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report,\n\nCiteseer, 2009.\n\n[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 1097\u20131105, 2012.\n\n[24] S. Lauly, Y. Zheng, A. Allauzen, and H. Larochelle. Document neural autoregressive distribution estimation.\n\nThe Journal of Machine Learning Research, 18(1):4046\u20134069, 2017.\n\n[25] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropa-\n\ngation Applied to Handwritten zip Code Recognition. Neural Computation, 1(4):541\u2013551, 1989.\n\n[26] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf. Pruning \ufb01lters for ef\ufb01cient convnets. In\n\nInternational Conference on Learning Representations, 2017.\n\n[27] M. Lin, Q. Chen, and S. Yan. Network in network. In International Conference on Learning Representa-\n\ntions, 2014.\n\n[28] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy.\nProgressive neural architecture search. In The European Conference on Computer Vision, pages 19\u201334,\n2018.\n\n[29] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. In International Conference\n\non Learning Representations, 2019.\n\n[30] J.-H. Luo, J. Wu, and W. Lin. Thinet: A \ufb01lter level pruning method for deep neural network compression.\n\nIn The IEEE International Conference on Computer Vision, pages 5058\u20135066, 2017.\n\n[31] R. Luo, F. Tian, T. Qin, E. Chen, and T.-Y. Liu. Neural architecture optimization. In Advances in Neural\n\nInformation Processing Systems, pages 7816\u20137827, 2018.\n\n[32] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In International\n\nConference on Machine Learning, pages 807\u2013814, 2010.\n\n[33] C. S. Nam, A. Nijholt, and F. Lotte. Brain\u2013computer interfaces handbook: technological and theoretical\n\nadvances. CRC Press, 2018.\n\n[34] J. Pan, Y. Li, and J. Wang. An eeg-based brain-computer interface for emotion recognition. In 2016\n\ninternational joint conference on neural networks (IJCNN), pages 2063\u20132067. IEEE, 2016.\n\n[35] H. Pham, M. Guan, B. Zoph, Q. Le, and J. Dean. Ef\ufb01cient neural architecture search via parameter sharing.\n\nIn International Conference on Machine Learning, pages 4095\u20134104, 2018.\n\n[36] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. Regularized evolution for image classi\ufb01er architecture\n\nsearch. In AAAI Conference on Arti\ufb01cial Intelligence, volume 33, pages 4780\u20134789, 2019.\n\n[37] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Mobilenetv2: Inverted residuals and linear\nbottlenecks. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 4510\u20134520,\n2018.\n\n[38] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A Uni\ufb01ed Embedding for Face Recognition and\nClustering. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 815\u2013823, 2015.\n\n[39] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization.\n\nInternational Conference on Machine Learning, pages 1889\u20131897, 2015.\n\nIn\n\n[40] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nInternational Conference on Learning Representations, 2015.\n\n[41] D. R. So, C. Liang, and Q. V. Le. The evolved transformer. In International Conference on Machine\n\nLearning, 2019.\n\n11\n\n\f[42] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training Very Deep Networks. In Advances in Neural\n\nInformation Processing Systems, pages 2377\u20132385, 2015.\n\n[43] Y. Sun, X. Wang, and X. Tang. Deeply Learned Face Representations are Sparse, Selective, and Robust. In\n\nThe IEEE Conference on Computer Vision and Pattern Recognition, pages 2892\u20132900, 2015.\n\n[44] M. Tan, I. W. Tsang, and L. Wang. Towards ultrahigh dimensional feature selection for big data. The\n\nJournal of Machine Learning Research, 15(1):1371\u20131429, 2014.\n\n[45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, \u0141. Kaiser, and I. Polosukhin.\nAttention is all you need. In Advances in Neural Information Processing Systems, pages 5998\u20136008, 2017.\n\n[46] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning.\n\nMachine Learning, 8(3-4):229\u2013256, 1992.\n\n[47] S. Xie, H. Zheng, C. Liu, and L. Lin. Snas: Stochastic neural architecture search. In International\n\nConference on Learning Representations, 2019.\n\n[48] T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, M. Sandler, V. Sze, and H. Adam. Netadapt: Platform-\naware neural network adaptation for mobile applications. In The European Conference on Computer Vision,\npages 285\u2013300, 2018.\n\n[49] R. Zeng, C. Gan, P. Chen, W. Huang, Q. Wu, and M. Tan. Breaking winner-takes-all: Iterative-winners-out\nnetworks for weakly supervised temporal action localization. IEEE Transactions on Image Processing,\n28(12):5797\u20135808, 2019.\n\n[50] R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan. Graph convolutional networks for\n\ntemporal action localization. In The IEEE International Conference on Computer Vision, Oct 2019.\n\n[51] C. Zhang, M. Ren, and R. Urtasun. Graph hypernetworks for neural architecture search. In International\n\nConference on Learning Representations, 2019.\n\n[52] X. Zhang, Z. Huang, and N. Wang. You only search once: Single shot neural architecture search via direct\n\nsparse optimization. arXiv preprint arXiv:1811.01567, 2018.\n\n[53] Y. Zhang, H. Chen, Y. Wei, P. Zhao, J. Cao, X. Fan, X. Lou, H. Liu, J. Hou, X. Han, et al. From\nwhole slide imaging to microscopy: Deep microscopy adaptation network for histopathology cancer\nimage classi\ufb01cation. In International Conference on Medical Image Computing and Computer-Assisted\nIntervention, pages 360\u2013368. Springer, 2019.\n\n[54] Y. Zheng, C. Liu, B. Tang, and H. Zhou. Neural autoregressive collaborative \ufb01ltering for implicit feedback.\nIn Proceedings of the 1st Workshop on Deep Learning for Recommender Systems, pages 2\u20136. ACM, 2016.\n\n[55] Y. Zheng, B. Tang, W. Ding, and H. Zhou. A neural autoregressive approach to collaborative \ufb01ltering. In\n\nInternational Conference on Machine Learning, pages 764\u2013773, 2016.\n\n[56] Y. Zheng, R. S. Zemel, Y.-J. Zhang, and H. Larochelle. A neural autoregressive approach to attention-based\n\nrecognition. International Journal of Computer Vision, 113(1):67\u201379, 2015.\n\n[57] Y. Zheng, Y.-J. Zhang, and H. Larochelle. Topic modeling of multimodal data: An autoregressive approach.\n\nIn The IEEE Conference on Computer Vision and Pattern Recognition, pages 1370\u20131377, 2014.\n\n[58] Y. Zheng, Y.-J. Zhang, and H. Larochelle. A deep and autoregressive approach for topic modeling of\nmultimodal data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(6):1056\u20131069,\n2015.\n\n[59] Z. Zhong, J. Yan, W. Wu, J. Shao, and C.-L. Liu. Practical block-wise neural network architecture\ngeneration. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 2423\u20132432,\n2018.\n\n[60] Z. Zhuang, M. Tan, B. Zhuang, J. Liu, Y. Guo, Q. Wu, J. Huang, and J. Zhu. Discrimination-aware channel\npruning for deep neural networks. In Advances in Neural Information Processing Systems, pages 875\u2013886,\n2018.\n\n[61] B. Zoph and Q. V. Le. Neural architecture search with reinforcement learning. In International Conference\n\non Learning Representations, 2017.\n\n[62] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. Learning transferable architectures for scalable image\nrecognition. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 8697\u20138710,\n2018.\n\n12\n\n\f", "award": [], "sourceid": 383, "authors": [{"given_name": "Yong", "family_name": "Guo", "institution": "South China University of Technology"}, {"given_name": "Yin", "family_name": "Zheng", "institution": "WeiXin Group, Tencent"}, {"given_name": "Mingkui", "family_name": "Tan", "institution": "South China University of Technology"}, {"given_name": "Qi", "family_name": "Chen", "institution": "South China University of Technology"}, {"given_name": "Jian", "family_name": "Chen", "institution": "\"South China University of Technology, China\""}, {"given_name": "Peilin", "family_name": "Zhao", "institution": "Tencent AI Lab"}, {"given_name": "Junzhou", "family_name": "Huang", "institution": "University of Texas at Arlington / Tencent AI Lab"}]}