{"title": "Self-Routing Capsule Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 7658, "page_last": 7667, "abstract": "Capsule networks have recently gained a great deal of interest as a new architecture of neural networks that can be more robust to input perturbations than similar-sized CNNs. Capsule networks have two major distinctions from the conventional CNNs: (i) each layer consists of a set of capsules that specialize in disjoint regions of the feature space and (ii) the routing-by-agreement coordinates connections between adjacent capsule layers. Although the routing-by-agreement is capable of filtering out noisy predictions of capsules by dynamically adjusting their influences, its unsupervised clustering nature causes two weaknesses: (i) high computational complexity and (ii) cluster assumption that may not hold in presence of heavy input noise. In this work, we propose a novel and surprisingly simple routing strategy called self-routing where each capsule is routed independently by its subordinate routing network. Therefore, the agreement between capsules is not required anymore but both poses and activations of upper-level capsules are obtained in a way similar to Mixture-of-Experts. Our experiments on CIFAR-10, SVHN and SmallNORB show that the self-routing performs more robustly against white-box adversarial attacks and affine transformations, requiring less computation.", "full_text": "Self-Routing Capsule Networks\n\nTaeyoung Hahn Myeongjang Pyeon Gunhee Kim\n\nSeoul National University, Seoul, Korea\n\n{taeyounghahn,mjpyeon,gunhee}@snu.ac.kr\nhttp://vision.snu.ac.kr/projects/self-routing\n\nAbstract\n\nCapsule networks have recently gained a great deal of interest as a new architecture\nof neural networks that can be more robust to input perturbations than similar-sized\nCNNs. Capsule networks have two major distinctions from the conventional CNNs:\n(i) each layer consists of a set of capsules that specialize in disjoint regions of the\nfeature space and (ii) the routing-by-agreement coordinates connections between\nadjacent capsule layers. Although the routing-by-agreement is capable of \ufb01ltering\nout noisy predictions of capsules by dynamically adjusting their in\ufb02uences, its\nunsupervised clustering nature causes two weaknesses: (i) high computational\ncomplexity and (ii) cluster assumption that may not hold in the presence of heavy\ninput noise. In this work, we propose a novel and surprisingly simple routing\nstrategy called self-routing, where each capsule is routed independently by its\nsubordinate routing network. Therefore, the agreement between capsules is not\nrequired anymore, but both poses and activations of upper-level capsules are\nobtained in a way similar to Mixture-of-Experts. Our experiments on CIFAR-\n10, SVHN, and SmallNORB show that the self-routing performs more robustly\nagainst white-box adversarial attacks and af\ufb01ne transformations, requiring less\ncomputation.\n\n1\n\nIntroduction\n\nIn the past years, deep convolutional neural networks (CNNs) have become the de-facto standard\narchitecture in image classi\ufb01cation tasks, thanks to their high representational power. However, an\nimportant yet unanswered question is whether deep networks can truly generalize. Well-trained\nnetworks can be catastrophically fooled by the images with carefully designed perturbations that\nare even unrecognizable by human eyes [23, 29, 37, 38]. Furthermore, natural, non-adversarial pose\nchanges of familiar objects are enough to trick deep networks [1, 8]. The later is more depressing\nsince natural pose changes are universal in the real world.\nSome research [5, 15] has argued that neural networks should aim for equivariance, not invariance.\nThe reasoning is that, by preserving variations of an entity in a group of neurons (equivariance) rather\nthan only detecting its existence (invariance), it would be easier to learn the underlying spatial relations\nand thus yield better generalization. Following this argument, a new network architecture called\ncapsule networks (CapsNets) and a mechanism called routing-by-agreement have been introduced\n[16, 34]. In this design of networks, each capsule contains a pose (or instantiation parameters) for\nencoding patterns of its responsible entity. Active capsules in one layer make pose predictions for\ncapsules in the next layer via transformation matrices. Then, the routing algorithm \ufb01nds a center-\nof-mass among the predictions via iterative clustering and ensures that only the majority opinion is\npassed down to the next layer.\nWhile the routing-by-agreement [16, 34] has shown to be effective, its unsupervised clustering nature\ncauses two inherent weaknesses. First, it requires repeatedly computing means and membership\nscores of prediction vectors. This makes CapsNets much computationally heavier than one-pass\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\ffeed-forward CNNs. Second, it makes assumptions on cluster distributions of predictions. This might\nnot be a problem if the training of CapsNets successfully learns to \ufb01t the weights to the assumptions.\nHowever, it is likely that the routing-by-agreement tends to fail when a number of prediction vectors\nbecome noisy so that they are clustered in an unexpected form.\nIn this work, we aim to overcome the above limitations by proposing a new and surprisingly simple\nrouting strategy that does not involve agreement anymore. In our algorithm, the contribution of a\nlower-level capsule to a higher-level capsule is determined by its activation and the routing decision\nby its subordinate routing network. We refer to this design of routing as self-routing. To the best of\nour knowledge, there is no previous literature on removing the routing-by-agreement from CapsNets.\nOur method is motivated by the structural resemblance between CapsNets and Mixture-of-Experts\n(MoE) [18, 7, 36, 20]. They are similar in that their composing units (i.e. capsules and experts)\nspecialize in different regions of input space and that their contributions are adjusted differently\nper example. One key difference is that, in CapsNets, gating values are dynamically adjusted to\nsuppress potentially unreliable submodules via the routing-by-agreement. However, if the robustness\nof CapsNets can be retained without the routing-by-agreement, then we might be able to safely\nremove the unsupervised clustering part that causes the two aforementioned weaknesses.\nFor evaluation, we compare our self-routing to the two most prominent routing-by-agreement,\ndynamic [34] and EM [16] routing on CIFAR-10 [22], SVHN [42] and SmallNORB [24] datasets.\nWe compare not only classi\ufb01cation accuracies but also robustness under adversarial attacks and\nviewpoint changes. For fairness, we use the same CNN base architectures (e.g. 7-layer CNN and\nResNet-20) and replace only the last layers of the original networks to respective capsule layers.\nCompared to the previous routing-by-agreement methods [16, 34], the self-routing achieves better\nclassi\ufb01cation performance in most cases while using signi\ufb01cantly fewer computations in FLOPs.\nMoreover, it shows stronger robustness under both perturbations of adversarial attacks and viewpoint\nchanges. We also show that our self-routing bene\ufb01ts more the CapsNets from the increase in model\nsizes (i.e. wider capsule layers), while the previous methods often degrade.\n\n2 Related Work\n\nCapsule networks. Recently, capsule networks have been actively applied to many domains, such\nas generative models [19], object localization [27], and graph networks [40], to name a few. Hinton\net al. [15] \ufb01rst introduced the idea of capsules and equivariance in neural networks. In their work,\nautoencoders are trained to generate images with the desired transformation; yet the model requires\ntransformation parameters to be supplied externally. Later, Sabour et al. [34] proposed a more\ncomplete model in which transformations are directly learnable from the data. To control the\ninformation \ufb02ow between adjacent capsule layers, they employed a mechanism named dynamic\nrouting. Since then, alternative routing methods have been suggested. Hinton et al. [16] proposed to\nuse Gaussian-mixture clustering. Bahadori et al. [3] made convergence faster via eigendecomposition\nof prediction vectors. Wang et al. [41] formalized the routing process to suggest a theoretically\nre\ufb01ned version. Li et al. [25] approximated the routing process with the interaction between master\nand aide branches. Compared to all of the previous work, our routing approach is free from agreement\nbut focuses on its ability of mixture-of-experts.\nMixture-of-experts. There have been many attempts to incorporate mixture-of-experts (MoE) into\ndeep network models. Eigen et al. [7] stacked multiple layers of MoE to create an exponentially\nincreasing number of inference paths. Shazeer et al. [36] used sparsely-gated MoE between stacked\nLSTM layers to expand model capacity with only a minor loss in computational ef\ufb01ciency. In [30],\narchitectural diversity is added by allowing experts to be an identity function or a pooling operation.\nKirsch et al. [21] modularized a network so that neural modules can be selected on a per-example\nbasis. In [33], a routing network is trained to choose appropriate function blocks for the input and task.\nThis work interprets the origin of CapsNet\u2019s strengths as the behavior of MoE, and such perspective\nleads to our self-routing design.\nCNN fragility. Despite the great success of CNNs, many recent studies have raised concerns about\ntheir robustness [8, 9, 14]. Unlike humans, CNNs easily yield incorrect answers when presented\nwith rotated images [8, 9]. Surprisingly, little improvement is observed in terms of noise robustness\neven for recent deep CNNs that are highly successful on image classi\ufb01cation tasks [14]. However,\n\n2\n\n\ftheir fragility may be hard to overcome by data augmentation techniques [9]. There are also a\nnumber of methods called adversarial attacks, that fool CNNs by creating images whose fabrication\nis hardly perceivable even to human [23, 29, 38, 37]. In this work, we evaluate the robustness of\nCapsNets, which are proposed as an alternative or a supplement for deep CNNs. In section 5, we\ndemonstrate that augmenting only one routing layer structured by capsules can signi\ufb01cantly improve\nthe robustness against adversarial attacks and af\ufb01ne transformations.\n\n3 Preliminaries\n\nWe \ufb01rst review the basics of capsule networks and the two most popular routing algorithms: dynamic\n[34] and EM routing [16].\n\n3.1 Capsule Formulation\n\nij\n\nvectors: uj =(cid:80)\n\nfor every capsule j \u2208 \u2126l+1 predicts pose changes: \u02c6uj|i = Wpose\n\nA capsule network [16, 34] is composed of layers of capsules. Let \u2126l denote the sets of capsules\nin layer l. Each capsule i \u2208 \u2126l has a pose vector ui and an activation scalar ai. In addition, a\nweight matrix Wpose\nij ui. The pose\nvector of capsule j is a linear combination (or together with an activation function) of the prediction\ni cij \u02c6uj|i, where cij is a routing coef\ufb01cient determined by an routing algorithm. In\nthe convolutional case, capsules within K \u00d7 K neighborhood in \u2126l de\ufb01ne a capsule in \u2126l+1 where K\nis the kernel size. The formulation of capsules varies according to the design of the routing algorithm.\nIn dynamic routing [34], the pose is de\ufb01ned as a vector, and its length is used as its activation. In EM\nrouting [16], the pose is de\ufb01ned as a matrix, and the activation scalar is separately de\ufb01ned. In our\nmethod, we use a vector for the pose with separated activation scalar.\n\n3.2 Dynamic and EM Routing\n\nlower-level capsule to all upper-level capsules sum to 1 (e.g.(cid:80)\n\nA capsule is activated when multiple predictions by lower-level capsules agree. In other words, the\nactivation depends on how tight the prediction vectors are clustered. The routing coef\ufb01cients from a\nj cij = 1), and are iteratively adjusted\nso that the lower-level capsule i has more in\ufb02uence on the upper-level capsule j when i-th prediction\nis close to the mean of the predictions that j receives. We below review how the two most popular\nrouting methods compute the routing coef\ufb01cient cij from capsule i to j.\nIn dynamic routing [34], the cosine similarity is used to measure the agreement. The routing logits\nbij are initialized to 0 and adjusted iteratively by the following equation:\nfor t = 1,\u00b7\u00b7\u00b7 k,\n\n(1)\nwhere t is the iteration number. The routing coef\ufb01cients cij are obtained by applying softmax to\nbij along j-th dimension. u(t)\nis obtained by applying a non-linear squash function: fsquash(s) =\nj\n||s||2\n1+||s||2\nIn EM routing [16], it is assumed that the probability density of \u02c6uj|i follows the j\u2019s Gaussian\nmodel with mean \u00b5j and variance diag[\u03c32\nj,h] where h is the dimension of uj and \u02c6uj|i:\nj,h]) for all i \u2208 \u2126l. The following iterative routing process is\n\u02c6uj|i \u223c N (\u00b5j, diag[\u03c32\nconducted with a(1)\n\nij \u02c6uj|i. Then, c and uj are updated alternatively for k iterations.\n\ns||s|| to(cid:80)\n\nij + \u02c6uj|i \u00b7 u(t)\n\nij \u2190 b(t)\nb(t+1)\n\nj,2,\u00b7\u00b7\u00b7 , \u03c32\n\nj,1, \u03c32\ninitialized to 0 and for each iteration t = 1,\u00b7\u00b7\u00b7 , k,\nj \u2190 M \u2212 step(ai, c(t)\nj , \u03c3(t)\ni\u2208\u2126l\n\nj = sigmoid(cid:0)\u03bb(\u03b2a\u2212 cost(t)\n\n(2)\nand\n. \u03b2a and \u03b2u are learned discriminatively and \u03bb is a hyperparameter that increases during\n\nij , \u02c6uj|i),\ncij(\u03b2u \u2212 log p(\u02c6uj|i)), a(t)\n\nj )(cid:1), aj = a(k)\n\nij \u2190 E \u2212 step(a(t)\nc(t+1)\n\nj , \u03c3(t)\nj ),\n\nwhere costj = (cid:80)\n\na(t)\nj , \u00b5(t)\n\nj , \u00b5(t)\n\nj,1, \u03c32\n\nj,2,\u00b7\u00b7\u00b7 , \u03c32\n\ni c(t)\n\nj\n\nj\n\nuj = \u00b5(k)\ntraining with a \ufb01xed schedule.\n\nj\n\nj\n\n4 Approach\n\nIn section 4.1, we discuss the two distinctive strengths of CapsNets. In section 4.2, we propose our\napproach of self-routing, whose motivation is to maximize the merits of CapsNets while minimizing\nundesired side effects.\n\n3\n\n\f(a) Capsule networks\n\n(b) Mixture-of-Experts\n\nFigure 1: Comparison between (a) capsule networks and (b) mixture-of-experts. Given routing coef\ufb01cients, the\nonly computational difference is that capsule networks use disjoint input representations for each capsule unit.\n\n(a) Routing-by-Agreement\n\n(b) Self-Routing\n\nFigure 2: Conceptual comparison between (a) routing-by-agreement and (b) our proposed self-routing. In\nself-routing, subordinate routing networks W route\nfed pose vectors ui are used to obtain routing coef\ufb01cients cij\nrather than unsupervised clustering on prediction vectors \u02c6uj|i.\n\ni\n\n4.1 Motivation\n\nWe view that CapsNets have two distinctive characteristics compared to the conventional CNNs: (i)\nthe behavior of mixture-of-experts (MoE) and (ii) the noise \ufb01ltering mechanism.\nBehavior of MoE. Capsules specialize in disjoint regions of feature space and make multiple\npredictions based on information available to them for the regions. In each capsule layer, this\nstructure naturally forms an ensemble of submodules that are activated differently per example, in a\nway similar to Mixture-of-Experts (MoE) [18, 7, 36, 20]. Compared to MoE, the division of labor is\nmore explicit as different capsules do not share the same feature space (see Figure 1). Nonetheless, the\ndiscriminative power of each capsule may not be as strong as in CNNs, where the entire feature map\nin each layer is utilized to produce a single output of the layer. However, by aggregating predictions\nfrom weaker modules that have different parameters, it effectively prevents over\ufb01tting and thus can\nreduce the output variance with respect to small input variations.\nNoise \ufb01ltering mechanism. In CapsNets, initial gating values (i.e. activation scalars) of capsules\nare dynamically adjusted. The routing-by-agreement ensures that the predictions far from general\nconsensus have a lesser in\ufb02uence on the output of each layer. In other words, the process can\neffectively \ufb01lter out the contribution of submodules having possibly noisy information. Additionally,\noutput capsules of which predictions have high variance are further suppressed. On the other hand,\nCNNs have no mechanism of segmenting potentially noisy channels from unnoisy ones.\nIn this work, we aim to design a routing method that mainly focuses on the \ufb01rst characteristic of\nCapsNets. The second property is bene\ufb01cial but brought by the agreement-based routing, which\nunfortunately causes two critical side effects: (i) high computational complexity and (ii) assumptions\non cluster distribution of prediction vectors. Speci\ufb01cally, the previous routing methods assume a\nspherical or normal distribution of prediction vectors, which is unlikely to hold due to high variability\nand noisiness of real-world data. In fact, clustering noisy data is still a challenging task. Hence,\nthe key to our intuition is to remove the notion of agreement from the routing process but introduce\na learnable routing network for each capsule instead (section 4.2). Although the clustering in the\nagreement-based routing can help to reduce the variance of output, we empirically observe that, even\nwithout the routing-by-agreement, the simple weight average of the new routing method can have a\nsimilar effect. That is, the unreliability of a single prediction can be mitigated by ensemble averaging\nsince the errors of the submodules (capsules) average out to provide a stable combined output.\n\n4\n\n\ud835\udc2e\"predictpredictpredict\ud835\udc50$%\"%|\u03a9()*|\ud835\udc2e\"%\ud835\udc2e$%\ud835\udc2e$+\ud835\udc2e$,\ud835\udc31\ud835\udc2e\"%\ud835\udc2e\"+\ud835\udc2e\",\ud835\udc32\ud835\udc50$+\"%\ud835\udc50$,\"%\ud835\udc2e/\"%|$%\ud835\udc2e/\"%|$+\ud835\udc2e/\"%|$,\ud835\udc31\ud835\udc32expert 1expert 2expert 3\ud835\udc50$\ud835\udc50%\ud835\udc50&\f4.2 Self-Routing\n\nWe name the CapsNet model with our proposed self-routing as SR-CapsNet. Figure 2(a)\u2013(b) illustrate\nthe high-level difference between the self-routing and the routing-by-agreement. In the self-routing,\neach capsule determines its routing coef\ufb01cients by itself without coordinating the agreement with peer\ncapsules. Instead, each capsule is endowed higher modeling power by a subordinate routing network.\nFollowing the MoE literature [36], we design the routing network as single-layer perceptrons,\nalthough it is straightforward to extend it to an MLP. The routing coef\ufb01cients also work as the\npredicted activations of output capsules. That is, an upper-level capsule is more likely to be activated\nif more capsules have high routing coef\ufb01cients to it.\nThe self-routing involves two learnable weight matrices, Wroute and Wpose, which are used to\ncompute routing coef\ufb01cients cij and predictions \u02c6uj|i, respectively. Each layer of the routing network\nmultiplies the pose vector ui by a trainable weight matrix Wroute\nand outputs the routing coef\ufb01cients\nci\u2217 via a softmax layer. The routing coef\ufb01cients cij are then multiplied by the capsule\u2019s activation\nscalar ai to generate weighted votes. The activation aj of an upper-layer capsule is simply the\nsummation of the weighted votes of lower-level capsules over spatial dimensions H \u00d7 W (or K \u00d7 K\nwhen using convolution). In summary,\n\ni\n\ncij = softmax(Wroute\n\ni ui)j, aj =\n\n.\n\n(3)\n\nThe pose of the upper-layer capsule is determined by the weighted average of prediction vectors:\n\n(cid:80)\n(cid:80)\n\ni\u2208\u2126l\ni\u2208\u2126l\n\ncijai\n\nai\n\n(cid:80)\n(cid:80)\n\ni\u2208\u2126l\n\ncijai \u02c6uj|i\n\ni\u2208\u2126l\n\ncijai\n\n\u02c6uj|i = Wpose\n\nij ui, uj =\n\n5 Experiments\n\n.\n\n(4)\n\nIn experiments, we focus on comparing our self-routing scheme with the two most impor-\ntant agreement-based routing algorithms from multiple perspectives. We \ufb01rst evaluate the im-\nage classi\ufb01cation performance (section 5.2). We then compare the robustness against unseen\ninput perturbations, since such generalization abilities have been the key motivation of Cap-\nsNets. Especially, we experiment the robustness against adversarial examples (section 5.3) and\nviewpoint changes by af\ufb01ne transformation (section 5.4). Our full source code is available at\nhttp://vision.snu.ac.kr/projects/self-routing.\n\n5.1 Experimental Settings\n\nDatasets. Following CapsNet literature, we mostly use two classi\ufb01cation benchmarks of CIFAR-10\n[22] and SVHN [42] and additionally SmallNORB [24] for the af\ufb01ne transformation tests. During\ntraining on CIFAR-10 and SVHN, we augment using random crop, random horizontal \ufb02ip, and\nnormalization. For SmallNORB, we follow the setting of [16]; we downsample training images to\n48 \u00d7 48, randomly crop 32 \u00d7 32 patches, and add random brightness and contrast. Test images are\ncenter cropped after downsampled.\nArchitectures. For a fair comparison, we let all routing algorithms share the same base CNN\narchitecture. We choose ResNet-20 [13] designed for CIFAR-10 classi\ufb01cation for the following two\nreasons. First, the ResNet is one of the best performing CNN architectures in various computer\nvision applications [4, 11, 26, 31, 32]. It would be interesting to verify whether employing capsule\nstructures can bene\ufb01t mainstream CNNs. Second, the ResNet is mostly composed of Conv layers,\nwhich makes it easy to implement CapsNets due to the similar structure between them. Note that\nCapsNets consist of only Conv layers and capsule layers. Given that ResNet-20 [13] consists of\n19 Conv layers followed by the last average pooling and FC layers, we can build a CapsNet on\ntop of it by replacing the last two layers by a primary capsule (PrimaryCaps) and fully-connected\ncapsule (FCCaps) layers, respectively. However, SmallNORB dataset is much less complex than\nCIFAR-10/SVHN, and its training set is relatively small (16200 samples), we use a 7-layer CNN,\nwhich is a shallower network with no shortcut connection. It consists of 6 CONV layers, followed by\nAvgPool and FC layers.\nWe also measure the performance variations according to the depth and width of capsule layers. The\ndepth means how many routing layers are inserted after the PrimaryCaps layer. Thus, the depth of\n\n5\n\n\fTable 1: Comparison of parameter counts (M), FLOPs (M), and error rates (%) between routing algorithms of\nCapsNets and CNN models. We use ResNet-20 as the base network. DR, EM and SR stands for dynamic [34],\nEM [16] and self-routing, respectively. The number following (method-) is the number of stacked capsule layers\non top of Conv layers. All CapsNets have 32 capsules in each layer. We test each model 5 times with different\nrandom seeds. Error rates reported below are their averages.\n\nMethods\nAvgPool\nConv\nDR-1\nDR-2\nEM-1\nEM-2\nSR-1\nSR-2\n\n# Param. (M)\n\n# FLOPs (M)\n\n0.3\n0.9\n\n5.8\n4.2\n0.9\n0.8\n0.9\n3.2\n\n41.3\n61.0\n\n73.5\n232.1\n76.6\n173.8\n62.2\n140.3\n\nCIFAR-10\n7.94\u00b10.21\n10.01\u00b10.99\n8.46\u00b10.27\n7.86\u00b10.21\n10.25\u00b10.45\n12.52\u00b10.32\n8.17\u00b10.18\n7.86\u00b10.12\n\nSVHN\n\n3.55 \u00b10.11\n3.98 \u00b10.15\n3.49 \u00b10.69\n3.17 \u00b10.09\n3.85 \u00b10.13\n3.70 \u00b10.35\n3.34 \u00b10.08\n3.12 \u00b10.13\n\n1 means the \ufb01nal layers of the network are PrimaryCaps+FCCaps, between which the routing is\nperformed once. The depth of 2 indicates the \ufb01nal layers are PrimaryCaps+ConvCaps+FCCaps so\nthat the routing is performed twice between consecutive capsule layers. Therefore, the depth of d\ninvolves d \u2212 1 ConvCaps inserted between PrimaryCaps and FCCaps. All ConvCaps layers have a\nkernel size of 3 and a stride of 1 except for the \ufb01rst ConvCaps layer that has a stride of 2. The width\nindicates the number of capsules in each capsule layer; for example, the width of 8 means there are 8\ncapsules in all Primary/ConvCaps layers of the architecture.\nCNN baselines. We also test two variants of CNNs for comparison between CapsNets and conven-\ntional CNNs. Since the former CONV layers of the base networks are shared by all the models, we\nvary the last two layers as (1) AvgPool+FC as the original architecture and (2) Conv+FC for verifying\nwhether the performance obtained by CapsNets is simply caused by using more parameters, not by\ntheir structures or routing algorithms.\nWe describe more details of experiments in the Appendix, including architectures and optimization.\n\n5.2 Results of Image Classi\ufb01cation\n\nWe compare the image classi\ufb01cation performance of self-routing with two agreement-based routing\nalgorithms on SVHN [42] and CIFAR-10 [22]. Table 1 summarizes the error rates as well as\nmemory and computation overheads of each method. Self-routing and dynamic routing [34] show\nsimilar classi\ufb01cation accuracies to CNN baselines, while EM routing [16] degrades the performances.\nImportantly, the computation overheads of self-routing in FLOPs are less than those of other routing\nbaselines, since it requires no iterative routing computations. In terms of the parameter size, EM\nrouting is the most ef\ufb01cient due to its matrix representations. Yet, we \ufb01nd that it is hard to train the\nnetworks with stacked EM routing layers on datasets more complex than initially tested ones [16]\n(e.g. SmallNORB). It seems that the constraint of 4 \u00d7 4 weight matrix multiplications is too strong to\nlearn good representations from complex data with multi-level routing layers. Dynamic routing is\ncomparable to our self-routing in performance, but it requires much more FLOPs for computation. We\n\ufb01nd that CapsNet variants equipped with self-routing are better than CNN baselines, but the margins\nare not large. It seems that the bene\ufb01t of using the ensemble of weak submodules is relatively weak\nsince the degree of required specialization is small for the task and the datasets. In fact, MoE-style\ndeep models are commonly strong in the tasks that require obvious specializations such as multi-task\nlearning [2, 33, 30].\n\n5.3 Robustness to Adversarial Examples\n\nAdversarial examples are the inputs that are intentionally crafted to trick recognition models into\nmisclassi\ufb01cation. Numerous defensive methods for such attacks have been suggested [6, 28]. In\n[16], CapsNets have shown considerable robustness to such attacks without any reactive modi\ufb01cation.\nTherefore, we evaluate our model\u2019s robustness to adversarial examples. We use the targeted and\nuntargeted white-box Fast Gradient Sign Method (FGSM) [10] to generate adversarial examples.\nFGSM \ufb01rst computes the gradient of the loss with respect to the input pixels and adds the signs of the\n\n6\n\n\f(a) PrimaryCaps+FCCaps (depth of 1)\n\n(b) PrimaryCaps+ConvCaps+FCCaps (depth of 2)\n\nFigure 3: Success rates (%) of untargeted and targeted FGSM attacks against different routing methods of\nCapsNets and CNN models. All CapsNets have 32 capsules in each layer. We set \u0001 = 0.1 for all FGSM attacks.\nAll results are obtained with 5 random seeds.\n\nFigure 4: The variation of success rates of untargeted and targeted FGSM attacks according to the number of\ncapsules per layer. The self-routing improves robustness as the number of capsules increases. We set \u0001 = 0.1 for\nall FGSM attacks. All results are obtained with 5 random seeds.\n\nobtained tensor to the input pixels by a \ufb01xed amount \u0001. For a fair comparison, we attack images for\nwhich predictions of each model are correct.\nFigure 3 summarizes the results. CNN baselines are signi\ufb01cantly more vulnerable against both\ngeneral and targeted FGSM attacks than the CapsNets, among which our SR-CapsNets are the most\nrobust. Figure 4 shows that adding more capsules per layer further improves the robustness of our\nSR-CapsNets, while stacking more capsule layers helps against the untargeted attack only. We cannot\n\ufb01nd a similar pattern in other routing algorithms. Often, their performance degrades when more\ncapsules are used. This suggests that the routing-by-agreement struggles to cluster predictions whose\nlarge portions involve noisy information.\nWe also attach the results obtained by another adversarial attack of BIM [23] in the Appendix.\n\n5.4 Robustness to Af\ufb01ne Transformation\n\nOne of CapsNets\u2019 known strengths is their generalization ability to novel viewpoints [16]. To\ndemonstrate this, we measure the classi\ufb01cation performance on the SmallNORB [24] test sets with\nnovel viewpoints. Following the experimental protocol of [16], we train all models on 1/3 of training\ndata with azimuths of 0, 20, 40, 300, 320, 340 degrees, and test them on 2/3 of test data with other\n\n7\n\nCIFAR-10SVHNSuccess Rate (%)Untargeted FGSMTargeted FGSMCIFAR-10SVHNCIFAR-10SVHNSuccess Rate (%)Untargeted FGSMTargeted FGSMCIFAR-10SVHN\fTable 2: Comparison of error rates (%) on the SmallNORB test set with the 7-layer CNN as the base architecture.\nFamiliar and Novel denote the results on the test samples with seen and unseen viewpoints during training,\nrespectively. All CapsNets have 32 capsules in each layer. All results are obtained with 10 random seeds.\n\nMethods\n\nAvgPool\nConv\nDR-1\nEM-1\nSR-1\n\nAzimuth\n\nElevation\n\nFamiliar\n8.49\u00b10.45\n8.39\u00b10.56\n6.86\u00b10.50\n7.36\u00b10.89\n7.62\u00b10.95\n\nNovel\n\n21.76\u00b11.18\n22.07\u00b11.02\n20.33\u00b11.32\n20.16\u00b10.96\n19.86\u00b11.03\n\nFamiliar\n5.68\u00b10.72\n7.51\u00b11.09\n5.78\u00b10.48\n5.97\u00b10.98\n5.96\u00b10.46\n\nNovel\n\n17.72\u00b10.30\n18.78\u00b10.67\n16.37\u00b10.90\n17.51\u00b11.52\n15.91\u00b11.09\n\nFigure 5: Comparison of error rates (%) on the SmallNORB test set without CNN base. All results are obtained\nwith 5 random seeds.\n\nazimuths. In another experiment, models are trained on 1/3 of training data with elevations of 30, 35,\n40 degrees from the horizontal, and tested on 2/3 of test data with other elevations.\nTable 2 shows the results where capsule-based models generalize better than the CNN baselines.\nSpeci\ufb01cally, self-routing has the best performance for the images with novel azimuths and eleva-\ntions. The results suggest that viewpoint generalization is not the unique strength of the routing-by-\nagreement. We also \ufb01nd a weak correlation between increase in model size (i.e. depth and width of\ncapsule layers) and generalization performance, but the improvement is small.\nAlthough the experiments with the 7-layer CNN base show that CapsNets generalize better than\nCNNs, the margins between different routing methods are not signi\ufb01cant. Thus, we conduct additional\nexperiments on SmallNORB with a smaller network that consists of only one convolution layer\nfollowed by three consecutive capsule layers of PrimaryCaps+ConvCaps+FCCaps. The capsule\nlayers are composed of 16 capsules, each with 16 neurons. Figure 5 shows the results that our\nself-routing (SR) outperforms Dynamic and EM routing with signi\ufb01cant margins in both tasks. That\nis, using shallow feature extractors, the previous routing techniques could struggle to learn good\nrepresentations.\n\n6 Conclusion\n\nWe proposed a supervised, non-iterative routing method for capsule-based models with better com-\nputational ef\ufb01ciency. We conducted systemic experiments for the comparison between the existing\nrouting methods and our self-routing. The experiments veri\ufb01ed that our method achieves competitive\nperformance on adversarial defense and viewpoint generalization that are the two proposed strengths\nof CapsNets. Moreover, our method generally performs better when more capsules are used per\nlayer, while the previous methods often behave unstably. The results suggested that the routing-by-\nagreement may not be a requirement for CapsNet\u2019s robustness. As future work, it is interesting to\nlook for a method that can bring residual connections to our models, since it has been shown that\nresidual networks behave like ensembles of networks with different depths [39]. It can be synergetic\nwith our models where capsules take different paths of the same depth.\n\n8\n\n\fAcknowledgements. This work was supported by Samsung Research Funding Center of Samsung\nElectronics (No. SRFC-IT1502-51) and the ICT R&D program of MSIT/IITP (No. 2019-0-01309,\nDevelopment of AI technology for guidance of a mobile robot to its goal with uncertain maps in\nindoor/outdoor environments). Gunhee Kim is the corresponding author.\n\nReferences\n\n[1] M. A. Alcorn, Q. Li, Z. Gong, C. Wang, L. Mai, W.-S. Ku, and A. Nguyen. Strike (with) a pose: Neural\nnetworks are easily fooled by strange poses of familiar objects. arXiv preprint arXiv:1811.11553, 2018.\n[2] R. Aljundi, P. Chakravarty, and T. Tuytelaars. Expert gate: Lifelong learning with a network of experts. In\n\nCVPR, 2017.\n\n[3] M. T. Bahadori. Spectral capsule networks. In ICLR workshop, 2018.\n[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic image\nsegmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions\non pattern analysis and machine intelligence, 40(4):834\u2013848, 2017.\n\n[5] T. Cohen and M. Welling. Group equivariant convolutional networks. In ICML, 2016.\n[6] F. Croce, M. Andriushchenko, and M. Hein. Provable robustness of relu networks via maximization of\n\nlinear regions. arXiv preprint arXiv:1810.07481, 2018.\n\n[7] D. Eigen, M. Ranzato, and I. Sutskever. Learning factored representations in a deep mixture of experts.\n\nICLR Workshops, 2014.\n\n[8] L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry. A rotation and a translation suf\ufb01ce: Fooling\n\ncnns with simple transformations. arXiv preprint arXiv:1712.02779, 2017.\n\n[9] R. Geirhos, C. R. Temme, J. Rauber, H. H. Sch\u00fctt, M. Bethge, and F. A. Wichmann. Generalisation in\n\nhumans and deep neural networks. In NeurIPS, 2018.\n\n[10] I. J. Goodfellow, J. Shlens, and S. Christian. Explaining and harnessing adversarial examples. In ICLR,\n\n2015.\n\n[11] K. He, G. Gkioxari, P. Doll\u00e1r, and R. Girshick. Mask R-CNN. In ICCV, 2017.\n[12] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level performance on\n\nimagenet classi\ufb01cation. In ICCV, 2015.\n\n[13] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.\n[14] D. Hendrycks and T. Dietterich. Benchmarking neural network robustness to common corruptions and\n\nperturbations. In ICLR, 2019.\n\n[15] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In ICANN, 2011.\n[16] G. E. Hinton, S. Sabour, and N. Frosst. Matrix capsules with EM routing. In ICLR, 2018.\n[17] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[18] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, G. E. Hinton, et al. Adaptive mixtures of local experts. Neural\n\n[19] A. Jaiswal, W. AbdAlmageed, Y. Wu, and P. Natarajan. CapsuleGAN: Generative adversarial capsule\n\n[20] M. I. Jordan and R. A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural computation,\n\n[21] L. Kirsch, J. Kunze, and D. Barber. Modular Networks: Learning to decompose neural computation. In\n\n[22] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report,\n\n[23] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv preprint\n\n[24] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to\n\n[25] H. Li, X. Guo, B. DaiWanli Ouyang, and X. Wang. Neural network encapsulation. In ECCV, pages\n\n[26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox\n\ncomputation, 3(1):79\u201387, 1991.\n\nnetwork. In ECCV, 2018.\n\n6(2):181\u2013214, 1994.\n\nNeurIPS, 2018.\n\nUniversity of Toronto, 2009.\n\narXiv:1607.02533, 2016.\n\npose and lighting. In CVPR, 2004.\n\n252\u2013267, 2018.\n\ndetector. In ECCV, 2016.\n\n[27] W. Liu, E. Barsoum, and J. D. Owens. Object localization and motion transfer learning with capsules.\n\narXiv preprint arXiv:1805.07706, 2018.\n\n[28] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to\n\nadversarial attacks. arXiv preprint arXiv:1706.06083, 2017.\n\n[29] A. Nguyen, J. Yosinski, and J. Clune. Deep neural networks are easily fooled: High con\ufb01dence predictions\n\nfor unrecognizable images. In CVPR, 2015.\n\n[30] P. Ramachandran and Q. V. Le. Diversity and depth in per-example routing models. In ICLR, 2019.\n[31] J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In CVPR, 2017.\n[32] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region\n\nproposal networks. In Neurips, 2015.\n\n[33] C. Rosenbaum, T. Klinger, and M. Riemer. Routing Networks: Adaptive selection of non-linear functions\n\nfor multi-task learning. arXiv preprint arXiv:1711.01239, 2017.\n\n9\n\n\f[34] S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In NeurIPS, pages 3856\u20133866,\n\n[35] S. Saito and S. Roy. Effects of loss functions and target representations on adversarial robustness. arXiv\n\n2017.\n\npreprint arXiv:1812.00181, 2018.\n\n2018.\n\n2018.\n\n[36] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural\n\nnetworks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.\n\n[37] J. Su, D. V. Vargas, and K. Sakurai. One pixel attack for fooling deep neural networks. IEEE Transactions\n\non Evolutionary Computation, 2019.\n\n[38] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing\n\nproperties of neural networks. arXiv preprint arXiv:1312.6199, 2013.\n\n[39] A. Veit, M. J. Wilber, and S. Belongie. Residual networks behave like ensembles of relatively shallow\n\nnetworks. In NeurIPS, 2016.\n\n[40] S. Verma and Z.-L. Zhang. Graph capsule convolutional neural networks. arXiv preprint arXiv:1805.08090,\n\n[41] D. Wang and Q. Liu. An optimization view on dynamic routing between capsules. In ICLR workshop,\n\n[42] Y. N. T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\n\nunsupervised feature learning. In NeurIPS, 2011.\n\n10\n\n\f", "award": [], "sourceid": 4168, "authors": [{"given_name": "Taeyoung", "family_name": "Hahn", "institution": "SNUVL"}, {"given_name": "Myeongjang", "family_name": "Pyeon", "institution": "Seoul National University"}, {"given_name": "Gunhee", "family_name": "Kim", "institution": "Seoul National University"}]}