{"title": "Modular Networks: Learning to Decompose Neural Computation", "book": "Advances in Neural Information Processing Systems", "page_first": 2408, "page_last": 2418, "abstract": "Scaling model capacity has been vital in the success of deep learning. For a typical network, necessary compute resources and training time grow dramatically with model size. Conditional computation is a promising way to increase the number of parameters with a relatively small increase in resources. We propose a training algorithm that flexibly chooses neural modules based on the data to be processed. Both the decomposition and modules are learned end-to-end. In contrast to existing approaches, training does not rely on regularization to enforce diversity in module use. We apply modular networks both to image recognition and language modeling tasks, where we achieve superior performance compared to several baselines. Introspection reveals that modules specialize in interpretable contexts.", "full_text": "Modular Networks:\n\nLearning to Decompose Neural Computation\n\nLouis Kirsch\u2217\n\nDepartment of Computer Science\n\nUniversity College London\nmail@louiskirsch.com\n\nJulius Kunze\n\nDepartment of Computer Science\n\nUniversity College London\njuliuskunze@gmail.com\n\nDavid Barber\n\nDepartment of Computer Science\n\nUniversity College London\ndavid.barber@ucl.ac.uk\n\nAbstract\n\nScaling model capacity has been vital in the success of deep learning. For a\ntypical network, necessary compute resources and training time grow dramatically\nwith model size. Conditional computation is a promising way to increase the\nnumber of parameters with a relatively small increase in resources. We propose\na training algorithm that \ufb02exibly chooses neural modules based on the data to\nbe processed. Both the decomposition and modules are learned end-to-end. In\ncontrast to existing approaches, training does not rely on regularization to enforce\ndiversity in module use. We apply modular networks both to image recognition\nand language modeling tasks, where we achieve superior performance compared\nto several baselines. Introspection reveals that modules specialize in interpretable\ncontexts.\n\n1\n\nIntroduction\n\nWhen enough data and training time is available, increasing the number of network parameters\ntypically improves prediction accuracy [16, 6, 14, 1]. While the largest arti\ufb01cial neural networks\ncurrently only have a few billion parameters [9], the usefulness of much larger scales is suggested\nby the fact that human brain has evolved to have an estimated 150 trillion synapses [19] under tight\nenergy constraints. In deep learning, typically all parts of a network need to be executed for every data\ninput. Unfortunately, scaling such architectures results in a roughly quadratic explosion in training\ntime as both more iterations are needed and the cost per sample grows. In contrast, usually only few\nregions of the brain are highly active simultaneously [20]. Furthermore, the modular structure of\nbiological neural connections [28] is hypothesized to optimize energy cost [8, 15], improve adaption\nto changing environments and mitigate catastrophic forgetting [26].\nInspired by these observations, we propose a novel way of training neural networks by automatically\ndecomposing the functionality needed for solving a given task (or set of tasks) into reusable modules.\nWe treat the choice of module as a latent variable in a probabilistic model and learn both the\ndecomposition and module parameters end-to-end by maximizing a variational lower bound of the\nlikelihood. Existing approaches for conditional computation [25, 2, 21] rely on regularization to\navoid a module collapse (the network only uses a few modules repeatedly) that would result in poor\n\n\u2217now af\ufb01liated with IDSIA, The Swiss AI Lab (USI & SUPSI)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f(a) Based on the input, the controller selects K\nmodules from a set of M available modules. In this\nexample, K = 3 and M = 6.\n\n(b) The selected modules then each process the in-\nput, with the results being summed up or concate-\nnated to form the \ufb01nal output of the modular layer.\n\nFigure 1: Architecture of the modular layer. Continuous arrows represent data \ufb02ow, while dotted\narrows represent \ufb02ow of modules.\n\nperformance. In contrast, our algorithm learns to use a variety of modules without such a modi\ufb01cation\nand we show that training is less noisy compared to baselines.\nA small \ufb01xed number out of a larger set of modules is selected to process a given input, and only the\ngradients for these modules need to be calculated during backpropagation. Different from approaches\nbased on mixture-of-experts, our method results in fully deterministic module choices enabling low\ncomputational costs. Because the pool of available modules is shared between processing stages (or\ntime steps), modules can be used at multiple locations in the network. Therefore, the algorithm can\nlearn to share parameters dynamically depending on the context.\n\n2 Modular Networks\n\nThe network is composed of functions (modules) that can be combined to solve a given task. Each\nmodule fi for i \u2208 {1, . . . , M} is a function fi(x; \u03b8i) that takes a vector input x and returns a vector\noutput, where \u03b8i denotes the parameters of module i. A modular layer, as illustrated in Figure 1,\ndetermines which modules (based on the input to the layer) are executed. The output of the layer\nconcatenates (or sums) the values of the selected modules. The output of this layer can then be fed\ninto a subsequent modular layer. The layer can be placed anywhere in a neural network.\nMore fully, each modular layer l \u2208 {1, . . . , L} is de\ufb01ned by the set of M available modules and\na controller which determines which K from the M modules will be used. The random variable\na(l) denotes the chosen module indices a(l) \u2208 {1, . . . , M}K. The controller distribution of layer l,\np(a(l)|x(l), \u03c6(l)) is parameterized by \u03c6(l) and depends on the input to the layer x(l) (which might be\nthe output of a preceding layer).\nWhile a variety of approaches could be used to calculate the output y(l), we used concatenation and\nsummation in our experiments. In the latter case, we obtain\n\nDepending on the architecture, this can then form the input to a subsequent modular layer l + 1. The\nmodule selections at all layers can be combined to a single joint distribution given by\n\ny(l) =\n\nfa(l)\n\nK(cid:88)\n\ni=1\n\n(cid:1)\n\ni\n\n(cid:0)x(l); \u03b8a(l)\nL(cid:89)\n\ni\n\np(a|x, \u03c6) =\n\np(a(l)|x(l), \u03c6(l))\n\nl=1\n\nThe entire neural network, conditioned on the composition of modules a, can be used for the\nparameterization of a distribution over the \ufb01nal network output y \u223c p(y|x, a, \u03b8). For example, the\n\n2\n\n(1)\n\n(2)\n\nM1M2M3ControllerInputOutputAdd / ConcatenateM4M5M6ControllerInputOutputAdd / ConcatenateM1M2M3M4M5M6M4M2M5\fn for all n = 1, . . . , N by sampling uniformly from all possible module compositions\n\nAlgorithm 1 Training modular networks with generalized EM\n\nGiven dataset D = {(xn, yn) |n = 1, . . . , N}\nInitialize a\u2217\nrepeat\n\nSample mini-batch of datapoint indices I \u2286 {1, . . . , N}\nfor each n \u2208 I do\n\nSample module compositions \u02dcA = {\u02dcas \u223c p(an|xn, \u03c6)|s = 1, . . . , S}\nUpdate a\u2217\n\nn to best value out of \u02dcA \u222a {a\u2217\n\nn} according to Equation 11\n\n(cid:46) Partial E-step\n\nend for\nrepeat k times\n\nuntil convergence\n\nSample mini-batch from dataset B \u2286 D\nUpdate \u03b8 and \u03c6 with gradient step according to Equation 8 on mini-batch B\n\n(cid:46) Partial M-step\n\n\ufb01nal module might de\ufb01ne a Gaussian distribution N(cid:0)y \u00b5, \u03c32(cid:1) as the output of the network whose\n\nmean and variance are determined by the \ufb01nal layer module. This de\ufb01nes a joint distribution over\noutput y and module selection a\n\np(y, a|x, \u03b8, \u03c6) = p(y|x, a, \u03b8)p(a|x, \u03c6)\n\n(3)\nSince the selection of modules is stochastic we treat a as a latent variable, giving the marginal output\n\np(y|x, \u03b8, \u03c6) =\n\np(y|x, a, \u03b8)p(a|x, \u03c6)\n\n(4)\n\n(cid:88)\n\na\n\nSelecting K modules at each of the L layers means that the number of states of a is M KL. For all but\na small number of modules and layers, this summation is intractable and approximations are required.\n\n2.1 Learning Modular Networks\n\nFrom a probabilistic modeling perspective the natural training objective is maximum likelihood.\nGiven a collection of input-output training data (xn, yn), n = 1, . . . , N, we seek to adjust the module\nparameters \u03b8 and controller parameters \u03c6 to maximize the log likelihood:\n\nN(cid:88)\n\nn=1\n\nL(\u03b8, \u03c6) =\n\nlog p(yn|xn, \u03b8, \u03c6)\n\n(5)\n\nTo address the dif\ufb01culties in forming the exact summation over the states of a we use generalized\nExpectation-Maximisation (EM) [17], here written for a single datapoint\n\nq(a) log q(a) +\n\nq(a) log (p(y|x, a, \u03b8)p(a|x, \u03c6)) \u2261 L(q, \u03b8, \u03c6) (6)\n\nlog p(y|x, \u03b8, \u03c6) \u2265 \u2212(cid:88)\n\n(cid:88)\n\na\n\na\n\nwhere q(a) is a variational distribution used to tighten the lower bound L on the likelihood. We can\nmore compactly write\n\nL(q, \u03b8, \u03c6) = Eq(a)[log p(y, a|x, \u03b8, \u03c6)] + H[q]\n\n(7)\nwhere H[q] is the entropy of the distribution q. We then seek to adjust q, \u03b8, \u03c6 to maximize L. The\npartial M-step on (\u03b8, \u03c6) is de\ufb01ned by taking multiple gradient ascent steps, where the gradient is\n\n\u2207\u03b8,\u03c6L(q, \u03b8, \u03c6) = \u2207\u03b8,\u03c6 Eq(a)[log p(y, a|x, \u03b8, \u03c6)]\n\n(8)\nIn practice we randomly select a mini-batch of datapoints at each iteration. Evaluating this gradient\nexactly requires a summation over all possible values of a. We experimented with different strategies\nto avoid this and found that the Viterbi EM [17] approach is particularly effective in which q(a) is\nconstrained to the form\nq(a) = \u03b4(a, a\u2217)\n\n(9)\nwhere \u03b4(x, y) is the Kronecker delta function which is 1 if x = y and 0 otherwise. A full E-step\nwould now update a\u2217 to\na\u2217\nnew = argmax\n\np(y|x, a, \u03b8)p(a|x, \u03c6)\n\n(10)\n\na\n\n3\n\n\ffor all datapoints. For tractability we instead make the E-step partial in two ways: Firstly, we choose\nthe best from only S samples \u02dcas \u223c p(a|x, \u03c6) for s \u2208 {1, ..., S} or keep the previous a\u2217 if none of\nthese are better (thereby making sure that L does not decrease):\n\na\u2217\nnew =\n\nargmax\n\na\u2208{\u02dcas|s\u2208{1,...,S}}\u222a{a\u2217}\n\np(y|x, a, \u03b8)p(a|x, \u03c6)\n\n(11)\n\nSecondly, we apply this update only for a mini-batch, while keeping the a\u2217 associated with all other\ndatapoints constant.\nThe overall stochastic generalized EM approach is summarized in Algorithm 1. Intuitively, the\nalgorithm clusters similar inputs, assigning them to the same module. We begin with an arbitrary\nassignment of modules to each datapoint. In each partial E-step we use the controller p(a|x, \u03c6)\nas a guide to reassign modules to each datapoint. Because this controller is a smooth function\napproximator, similar inputs are assigned to similar modules. In each partial M-step the module\nparameters \u03b8 are adjusted to learn the functionality required by the respective datapoints assigned\nto them. Furthermore, by optimizing the parameters \u03c6 we train the controller to predict the current\noptimal module selection a\u2217\nFigure 2 visualizes the above clustering process for a simple feed-forward neural network composed\nof 6 modular layers with K = 1 modules being selected at each layer out of a possible M = 3\nmodules. The task is image classi\ufb01cation, see Section 3.3. Each node in the graph represents a\nmodule and each datapoint uses a path of modules starting from layer 1 and ending in layer 6. The\nwidth of the edge between two nodes n1 and n2 represents the number of datapoints that use the\n\ufb01rst module n1 followed by n2; the size of a node represents how many times that module was used.\nFigure 2 shows how a subset of datapoints starting with a fairly uniform distribution over all paths\nends up being clustered to a single common path. The upper and lower graphs correspond to two\ndifferent subsets of the datapoints. We visualized only two clusters but in general many such clusters\n(paths) form, each for a different subset of datapoints.\n\nn for each datapoint.\n\n2.2 Alternative Training\n\nRelated work [25, 3, 21] uses two different training approaches that can also be applied to our modular\narchitecture. REINFORCE [30] maximizes the lower bound\np(a|x, \u03c6) log p(y|x, a, \u03b8) \u2264 L(\u03b8, \u03c6)\n\nB(\u03b8, \u03c6) \u2261(cid:88)\n\n(12)\n\na\n\non the log likelihood L. Using the log-trick we obtain the gradients\n\u2207\u03c6B(\u03b8, \u03c6) = Ep(a|x,\u03c6)[log p(y|x, a, \u03b8)\u2207\u03c6 log p(a|x, \u03c6)]\n\u2207\u03b8B(\u03b8, \u03c6) = Ep(a|x,\u03c6)[\u2207\u03b8 log p(y|x, a, \u03b8)]\n\n(13)\n(14)\nThese expectations are then approximated by sampling from p(a|x, \u03c6). An alternative training\nalgorithm is the noisy top-k mixture of experts [25]. A mixture of experts is the weighted sum of\nseveral parameterized functions and therefore also separates functionality into multiple components.\nA gating network is used to predict the weight for each expert. Noise is added to the output of this\ngating network before setting all but the maximum k units to \u2212\u221e, effectively disabling these experts.\nOnly these k modules are then evaluated and gradients backpropagated. We discuss issues with these\ntraining techniques in the next section.\n\n2.3 Avoiding Module Collapse\n\nRelated work [25, 3, 21] suffered from the problem of missing module diversity (\"module collapse\"),\nwith only a few modules actually realized. This premature selection of only a few modules has\noften been attributed to a self-reinforcing effect where favored modules are trained more rapidly,\nfurther increasing the gap [25]. To counter this effect, previous studies introduced regularizers to\nencourage different modules being used for different datapoints within a mini-batch. In contrast to\nthese approaches, no regularization is needed in our method. However, to avoid module collapse, we\nmust take suf\ufb01cient gradient steps within the partial M-step to optimize both the module parameters \u03b8,\nas well as the controller parameters \u03c6. That is, between each E-step, there are many gradient updates\nfor both \u03b8 and \u03c6. Note that this form of training is critical, not just to prevent module collapse but to\n\n4\n\n\fFigure 2: Two different subsets of datapoints (top and bottom) that use the same modules at the end\nof training (right) start with entirely different modules (left) and slowly cluster together over the\ncourse of training (left to right). Nodes in the graph represent modules with their size proportional\nto the number of datapoints that use this module. Edges between nodes n1 and n2 and their stroke\nwidth represent how many datapoints \ufb01rst use module n1 followed by n2.\n\nobtain a high likelihood. When module collapse occurs, the resulting log-likelihood is lower than\nthe log-likelihood of the non-module-collapsed trained model. In other words, our approach is not a\nregularizer that biases the model towards a desired form of a sub-optimal minimum \u2013 it is a critical\ncomponent of the algorithm to ensure \ufb01nding a high-valued optimum.\n\n3 Experiments\n\nTo investigate how modules specialize during training, we \ufb01rst consider a simple toy regression\nproblem. We then apply our modular networks to language modeling and image classi\ufb01cation.\nAlternative training methods for our modular networks are noisy top-k gating [25], as well as\nREINFORCE [3, 21] to which we will compare our approach. Except if noted otherwise, we use\na controller consisting of a linear transformation followed by a softmax function for each of the K\nmodules to select. Our modules are either linear transformations or convolutions, followed by a\nReLU activation. Additional experimental details are given in the supplementary material.\nIn order to analyze what kind of modules are being used we de\ufb01ne two entropy measures. The\nmodule selection entropy is de\ufb01ned as\n\nwhere B is the size of the batch. Ha has larger values for more uncertainty for each sample n. We\nwould like to minimize Ha (so we have high certainty in the module being selected for a datapoint\nxn). Secondly, we de\ufb01ne the entropy over the entire batch\n\nModule collapse would correspond to a low Hb. Ideally, we would like to have a large Hb so that\ndifferent modules will be used, depending on the input xn.\n\n3.1 Toy Regression Problem\n\nWe demonstrate the ability of our algorithm to learn conditional execution using the following\nregression problem: For each data point (xn, yn), the input vectors xn are generated from a mixture\nof Gaussians with two components with uniform latent mixture probabilities p(sn = 1) = p(sn =\n2 according to xn|sn \u223c N (xn \u00b5sn , \u03a3sn ). Depending on the component sn, the target yn is\n2) = 1\n\n5\n\nL(cid:88)\n\nB(cid:88)\n\nH(cid:104)\n\nl=1\n\nn=1\n\n(cid:105)\nn |xn, \u03c6)\n\np(a(l)\n\nHa =\n\n1\nBL\n\n(cid:34)\n\nL(cid:88)\n\nl=1\n\n(cid:35)\nn |xn, \u03c6)\n\np(a(l)\n\nB(cid:88)\n\nn=1\n\nHb =\n\n1\nL\n\nH\n\n1\nB\n\n(15)\n\n(16)\n\nLayerModuleBeginning of trainingMid-trainingEnd of training\f(a) Module composition learned on the toy dataset.\n\n(b) Minimization of loss on the toy dataset.\n\nFigure 3: Performance of one modular layer on toy regression.\n\ngenerated by linearly transforming the input xn according to\n\n(cid:26)Rxn\n\nSxn\n\nyn =\n\nif sn = 1\notherwise\n\n(17)\n\nwhere R is a randomly generated rotation matrix and S is a diagonal matrix with random scaling\nfactors.\nIn the case of our toy example, we use a single modular layer, L = 1, with a pool of two modules,\nM = 2, where one module is selected per data point, K = 1. Loss and module selection entropy\nquickly converge to zero, while batch module selection entropy stabilizes near log 2 as shown in\nFigure 3. This implies that the problem is perfectly solved by the architecture in the following way:\nEach of the two modules specializes on regression of data points from one particular component by\nlearning the corresponding linear transformations R and S respectively and the controller learns to\npick the corresponding module for each data point deterministically, thereby effectively predicting\nthe identity of the corresponding generating component. Thus, our modular networks successfully\ndecompose the problem into modules yielding perfect training and generalization performance.\n\n3.2 Language Modeling\n\nModular layers can readily be used to update the state within an RNN. This allows us to model\nsequence-to-sequence tasks with a single RNN which learns to select modules based on the context.\nFor our experiments, we use a modi\ufb01ed Gated Recurrent Unit [5] in which the state update operation\nis a modular layer. Therefore, K modules are selected and applied at each time step. Full details can\nbe found in the supplement.\nWe use the Penn Treebank2 dataset, consisting of 0.9 million words with a vocabulary size of 10,000.\nThe input of the recurrent network for each timestep is a jointly-trained embedding vector of size 32\nthat is associated with each word.\nWe compare the EM-based modular networks approach to unregularized REINFORCE (with an\nexponential moving average control variate) and noisy top-k, as well as a baseline without modularity,\nthat uses the same K modules for all datapoints. This baseline uses the same number of module\nparameters per datapoint as the modular version. For this experiment, we test four con\ufb01gurations of\nthe network being able to choose K out of M modules at each timestep: 1 out of 5 modules, 3 out\nof 5, 1 out of 15, and 3 out of 15. We report the test perplexity after 50,000 iterations for the Penn\nTreebank dataset in Table 1.\nWhen only selecting a single module out of 5 or 15, our modular networks outperform both baselines\nwith 1 or 3 \ufb01xed modules. Selecting 3 out of 5 or 15 seems to be harder to learn, currently not\noutperforming a single chosen module (K = 1). Remarkably, apart from the controller network,\nthe baseline with three static modules performs three times the computation and achieves worse test\nperplexity compared to a single intelligently selected module using our method. Compared to the\nREINFORCE and noisy-top-k training methods, our approach has lower test perplexities for each\nmodule con\ufb01guration.\n\n2http://www.fit.vutbr.cz/~imikolov/rnnlm/simple-examples.tgz\n\n6\n\n0200040006000800010000step0.00.10.20.30.40.50.60.7entropyBatch module selection entropyModule selection entropy0200040006000800010000step02505007501000125015001750lossTraining lossTest loss\fTable 1: Test perplexity after 50,000 steps on Penn Treebank\n\nType\nEM Modular Networks\nEM Modular Networks\nEM Modular Networks\nEM Modular Networks\nREINFORCE\nREINFORCE\nREINFORCE\nREINFORCE\nNoisy Top-k (k = 4)\nNoisy Top-k (k = 4)\nBaseline\nBaseline\n\n#modules (M)\n15\n5\n15\n5\n15\n5\n15\n5\n15\n5\n1\n3\n\n#parallel modules (K)\n1\n1\n3\n3\n1\n1\n3\n3\n1\n1\n1\n3\n\ntest perplexity\n229.651\n236.809\n246.493\n236.314\n240.760\n240.450\n274.060\n267.585\n422.636\n338.275\n247.408\n241.294\n\n(a) Module selection entropy Ha\n\n(b) Batch module selection entropy Hb\n\nFigure 4: Modular networks are less noisy during optimization compared to REINFORCE and more\ndeterministic than noisy top-k. Our method uses all modules at the end of training, shown by a large\nbatch module selection entropy. The task is language modeling on the Penn Treebank dataset.\n\nWe further inspect training behavior in Figure 4. Using our method, all modules are effectively\nbeing used at the end of training, as shown by a large batch module selection entropy in Figure 4b.\nAdditionally, the optimization is generally less noisy compared to the alternative approaches and the\nmethod quickly reaches a deterministic module selection. Figure 5 shows how the module selection\nchanges over the course of training for a single batch. At the beginning of training, the controller\nessentially has no preference over modules for any instance in the batch. Later in training, the\nselection is deterministic for some datapoints and \ufb01nally becomes fully deterministic.\nFor language modeling tasks, modules specialize in certain grammatical and semantic contexts.\nThis is illustrated in Table 2, where we observe specialization on numerals, the beginning of a new\n\nFigure 5: A visualization of the controller distribution for a particular mini-batch, choosing K = 1\nout of M = 5 modules. Training progresses from the top image to the bottom image. A black pixel\nrepresents zero probability and a white pixel represents probability 1.\n\n7\n\n050000100000150000200000step0.00.20.40.60.81.01.21.41.6entropyREINFORCEEM Modular NetworksNoisy Top-k050000100000150000200000step0.00.20.40.60.81.01.21.41.6entropyREINFORCEEM Modular NetworksNoisy Top-k020406080100120batch instancemodule\fTable 2: For a few out of M = 15 modules (with K = 1), we show examples of the corresponding\ninput word which they are invoked on (highlighted) together with surrounding words in the sentence.\n\nModule 1\n\nModule 3\n\nModule 14\n\n... than ...\n\n... be substantially less ...\n\n... up ...\n... million was ...\n... $ billion ...\n... million of ...\n... $ billion ...\n\n... by to ...\n\n... yield ...\n\n... debt from the ...\n\n...\n\n... Australia A ...\n... opposition I ...\n\n... said But ...\n\n... teachers for the ...\n\n... result That ...\n\n... but the ...\n\n...\n\n... based on the ...\n\n... business He ...\n... rates This ...\n... offer Federal ...\n\n... said the acquired ...\n\n... on the \ufb01rst ...\n\n... that the carrier ...\n... to the recent ...\n... and the sheets ...\n\n... and the naczelnik ...\n\n... if the actual ...\n\n... say the earnings ...\n\n... in the third ...\n... brain the skin ...\n\n...\n\nFigure 6: Modular networks test (left) and training accuracy (right) for a linear controller and a\nconvolutional controller compared to the non-modularized baseline.\n\nsentence and the occurrence of the de\ufb01nite article the, indicating that the word to be predicted is a\nnoun or adjective.\n\n3.3\n\nImage Classi\ufb01cation\n\nWe applied our method to image classi\ufb01cation on CIFAR10 [13] by using a modularized feed-forward\nnetwork. Compared to [21], we not only modularized the two fully connected layers but also the\nremaining three convolutional layers. Details can be found in the supplement.\nFigure 6 shows how using modules achieves higher training accuracy compared to the non-\nmodularized baseline. However, in comparison to the language modeling tasks, this does not\nlead to improved generalization. We found that the controller over\ufb01ts to speci\ufb01c features. In Figure 6\nwe therefore compared to a more constrained convolutional controller that reduces over\ufb01tting consid-\nerably. Shazeer et al. [25] make a similar claim in their study and therefore only train on very large\nlanguage modeling datasets. More investigation is needed to understand how to take advantage of\nmodularization in tasks with limited data.\n\n4 Related Work\n\nLearning modules and their composition is closely related to mixtures of experts, dating back to [11,\n12]. A mixture of experts is the weighted sum of several parameterized functions and therefore also\nseparates functionality into multiple components. Our work is different in two major aspects. Firstly,\nour training algorithm is designed such that the selection of modules becomes fully deterministic\ninstead of a mixture. This enables ef\ufb01cient prediction such that only the single most likely module\nhas to be evaluated for each of the K distributions. Secondly, instead of having a single selection\nof modules, we compose modules hierarchically in an arbitrary manner, both sequentially and in\nparallel. The latter idea has been, in part, pursued by [10], relying on stacked mixtures of experts\ninstead of a single selection mechanism. Due to their training by backpropagation of entire mixtures,\n\n8\n\n0100000200000300000400000500000600000700000step0.00.10.20.30.40.50.60.7test accuracyEM Modular NetworksEM Modular Networks conv controllerBaseline0100000200000300000400000500000600000700000step0.00.10.20.30.40.50.60.7training accuracyEM Modular NetworksEM Modular Networks conv controllerBaseline\fsumming over all paths, no clear computational bene\ufb01ts have yet been achieved through such a form\nof modularization.\nDifferent approaches for limiting the number of evaluations of experts are stochastic estimation\nof gradients through REINFORCE [30] or noisy top-k gating [4]. Nevertheless, both the mixture\nof experts in [3] based on REINFORCE as well as the approach by [25] based on noisy top-k\ngating require regularization to ensure diversity of experts for different inputs. If regularization\nis not applied, only very few experts are actually used. In contrast, our modular networks use a\ndifferent training algorithm, generalized Viterbi EM, enabling the training of modules without any\narti\ufb01cial regularization. This has the advantage of not forcing the optimization to reach a potentially\nsub-optimal log-likelihood based on regularizing the training objective.\nOur architecture differs from [25] in that we don\u2019t assign a probability to every of the M modules and\npick the K most likely but instead we assign a probability to each composition of modules. In terms\nof recurrent networks, in [25] a mixture-of-experts layer is sandwiched between multiple recurrent\nneural networks. However, to the best of our knowledge, we are the \ufb01rst to introduce a method where\neach modular layer is updating the state itself.\nThe concept of learning modules has been further extended to multi-task learning with the introduction\nof routing networks [21]. Multiple tasks are learned jointly by conditioning the module selection on\nthe current task and/or datapoint. While conditioning on the task through the use of the multi-agent\nWeighted Policy Learner shows promising results, they reported that a single agent conditioned on\nthe task and the datapoint fails to use more than one or two modules. This is consistent with previous\nobservations [3, 25] that a RL-based training without regularization tends to use only few modules.\nWe built on this work by introducing a training method that no longer requires this regularization. As\nfuture work we will apply our approach in the context of multi-task learning.\nThere is also a wide range of literature in robotics that uses modularity to learn robot skills more\nef\ufb01ciently by reusing functionality shared between tasks [22, 7]. However, the decomposition into\nmodules and their reuse has to be speci\ufb01ed manually, whereas our approach offers the ability to learn\nboth the decomposition and modules automatically. In future work we intend to apply our approach\nto parameterizing policies in terms of the composition of simpler policy modules.\nConditional computation can also be achieved through activation sparsity or winner-take-all mecha-\nnisms [27, 23, 24] but is hard to parallelize on modern accelerators such as GPUs. A solution that\nworks with these accelerators is learning structured sparsity [18, 29] but often requires non-sparse\ncomputation during training or is not conditional.\n\n5 Conclusion\n\nWe introduced a novel method to decompose neural computation into modules, learning both the\ndecomposition as well as the modules jointly. Compared to previous work, our method produces fully\ndeterministic module choices instead of mixtures, does not require any regularization to make use of\nall modules, and results in less noisy training. Modular layers can be readily incorporated into any\nneural architecture. We introduced the modular gated recurrent unit, a modi\ufb01ed GRU that enables\nminimalistic sequence-to-sequence models based on modular computation. We applied our method in\nlanguage modeling and image classi\ufb01cation, showing how to learn modules for these different tasks.\nTraining modular networks has long been a sought-after goal in neural computation since this opens\nup the possibility to signi\ufb01cantly increase the power of neural networks without an increase in\nparameter explosion. We have introduced a simple and effective way to learn such networks, opening\nup a range of possibilities for their future application in areas such as transfer learning, reinforcement\nlearning and lifelong learning. Future work may also explore how modular networks scale to larger\nproblems, architectures, and different domains. A library to use modular layers in TensorFlow can be\nfound at http://louiskirsch.com/libmodular.\n\nAcknowledgments\n\nWe thank Ilya Feige, Hippolyt Ritter, Tianlin Xu, Raza Habib, Alex Mansbridge, Roberto Fierimonte,\nand our anonymous reviewers for their feedback. This work was supported by the Alan Turing\nInstitute under the EPSRC grant EP/N510129/1. Furthermore, we thank IDSIA (The Swiss AI Lab)\n\n9\n\n\ffor the opportunity to \ufb01nalize the camera ready version on their premises, partially funded by the\nERC Advanced Grant (no: 742870).\n\nReferences\n\n[1] D. Amodei et al. \u201cDeep Speech 2: End-to-End Speech Recognition in English and Mandarin\u201d.\n\nIn: ICML (2015).\n\n[2] E. Bengio. \u201cOn Reinforcement Learning for Deep Neural Architectures: Conditional Com-\nputation with Stochastic Computation Policies\u201d. PhD thesis. McGill University Libraries,\n2017.\n\n[3] E. Bengio et al. \u201cConditional Computation in Neural Networks for Faster Models\u201d. In: ICLR\n\nWorkshop (2016).\n\n[4] Y. Bengio, N. L\u00e9onard, and A. Courville. \u201cEstimating or propagating gradients through\nstochastic neurons for conditional computation\u201d. In: arXiv preprint arXiv:1308.3432 (2013).\n[5] K. Cho et al. \u201cLearning Phrase Representations using RNN Encoder-Decoder for Statistical\nMachine Translation\u201d. In: Conference on Empirical Methods in Natural Language Processing\n(2014).\n\n[7]\n\n[6] D. C. Ciresan, U. Meier, and J. Schmidhuber. \u201cMulti-Column Deep Neural Networks for\nImage Classi\ufb01cation\u201d. IEEE Conference on Computer Vision and Pattern Recognition CVPR\n2012. 2012.\nI. Clavera, D. Held, and P. Abbeel. \u201cPolicy transfer via modularity and reward guiding\u201d.\nIEEE International Conference on Intelligent Robots and Systems. Vol. 2017-Septe. 2017,\npp. 1537\u20131544.\nJ. Clune, Mouret J-B., and H. Lipson. \u201cThe evolutionary origins of modularity\u201d. In: Proceed-\nings of the Royal Society of London B: Biological Sciences 280.1755 (2013).\n\n[8]\n\n[9] A. Coates et al. \u201cDeep learning with COTS HPC systems\u201d. In: ICML (2013), pp. 1337\u20131345.\n[10] D. Eigen, M. Ranzato, and I. Sutskever. \u201cLearning Factored Representations in a Deep Mixture\n\nof Experts\u201d. In: ICLR Workshop (2013).\n\n[11] R. A. Jacobs et al. \u201cAdaptive Mixtures of Local Experts\u201d. In: Neural Computation 3.1 (1991),\n\npp. 79\u201387.\n\n[12] M. I. Jordan and R. A. Jacobs. \u201cHierarchical Mixtures of Experts and the EM Algorithm\u201d. In:\n\nNeural Computation 6.2 (1994), pp. 181\u2013214.\n\n[13] A. Krizhevsky and G. Hinton. \u201cLearning Multiple Layers of Features from Tiny Images\u201d. MSc\n\nthesis. University of Toronto, 2009.\n\n[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. \u201cImageNet classi\ufb01cation with deep convolu-\n\ntional neural networks\u201d. In: NIPS (2012), pp. 1097\u20131105.\n\n[15] R. A. Legenstein and W. Maass. \u201cNeural circuits for pattern recognition with small total wire\n\nlength\u201d. In: Theoretical Computer Science (2002), pp. 239\u2013249.\n\n[16] L. Li, Z. Ding, and D. Huang. \u201cRecognizing location names from Chinese texts based on\nMax-Margin Markov network\u201d. In: International Conference on Natural Language Processing\nand Knowledge Engineering (2008).\n\n[17] R. M. Neal and G. E. Hinton. \u201cLearning in Graphical Models\u201d. In: ed. by M. I. Jordan.\n\nCambridge, MA, USA: MIT Press, 1999. Chap. A View of, pp. 355\u2013368.\n\n[18] K. Neklyudov et al. \u201cStructured Bayesian Pruning via Log-Normal Multiplicative Noise\u201d.\n\nNIPS. 2017, pp. 6775\u20136784.\n\n[19] B. Pakkenberg et al. \u201cAging and the human neocortex\u201d. In: Experimental gerontology 38.1-2\n\n(2003), pp. 95\u201399.\n\n[20] M. Ramezani et al. \u201cJoint sparse representation of brain activity patterns in multi-task fMRI\n\ndata\u201d. In: IEEE Transactions on Medical Imaging 34.1 (2015), pp. 2\u201312.\n\n[21] C. Rosenbaum, T. Klinger, and M. Riemer. \u201cRouting Networks: Adaptive Selection of Non-\n\nlinear Functions for Multi-Task Learning\u201d. ICLR. 2018.\n\n[22] H. Sahni et al. \u201cLearning to Compose Skills\u201d. In: NIPS Workshop (2017).\n[23]\n[24]\n\nJ. Schmidhuber. \u201cSelf-delimiting neural networks\u201d. In: arXiv preprint arXiv:1210.0118 (2012).\nJ. Schmidhuber. \u201cThe neural bucket brigade\u201d. Connectionism in perspective. 1989, pp. 439\u2013\n446.\n\n10\n\n\f[25] N. Shazeer et al. \u201cOutrageously Large Neural Networks: The Sparsely-Gated Mixture-of-\n\nExperts Layer\u201d. In: ICLR (2017).\n\n[26] O. Sporns and R. F. Betzel. \u201cModular Brain Networks\u201d. In: Annual Review of Psychology 67.1\n\n(2016), pp. 613\u2013640.\n\n[27] R. K. Srivastava et al. \u201cCompete to compute\u201d. NIPS. 2013, pp. 2310\u20132318.\n[28] P. Sternberg. \u201cModular processes in mind and brain\u201d. In: Cognitive Neurophysiologogy 28.4 &\n\n4 (2011), pp. 156\u2013208.\n\n[29] W. Wen et al. \u201cLearning structured sparsity in deep neural networks\u201d. NIPS. 2016, pp. 2074\u2013\n\n2082.\n\n[30] R. J. Williams. \u201cSimple Statistical Gradient-Following Algorithms for Connectionist Rein-\n\nforcement Learning\u201d. In: Machine Learning 8 (1992), pp. 229\u2013256.\n\n11\n\n\f", "award": [], "sourceid": 1229, "authors": [{"given_name": "Louis", "family_name": "Kirsch", "institution": "University College London & IDSIA"}, {"given_name": "Julius", "family_name": "Kunze", "institution": "University College London"}, {"given_name": "David", "family_name": "Barber", "institution": "University College London"}]}