{"title": "Group Equivariant Capsule Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 8844, "page_last": 8853, "abstract": "We present group equivariant capsule networks, a framework to introduce guaranteed equivariance and invariance properties to the capsule network idea. Our work can be divided into two contributions. First, we present a generic routing by agreement algorithm defined on elements of a group and prove that equivariance of output pose vectors, as well as invariance of output activations, hold under certain conditions. Second, we connect the resulting equivariant capsule networks with work from the field of group convolutional networks. Through this connection, we provide intuitions of how both methods relate and are able to combine the strengths of both approaches in one deep neural network architecture. The resulting framework allows sparse evaluation of the group convolution operator, provides control over specific equivariance and invariance properties, and can use routing by agreement instead of pooling operations. In addition, it is able to provide interpretable and equivariant representation vectors as output capsules, which disentangle evidence of object existence from its pose.", "full_text": "Group Equivariant Capsule Networks\n\nJan Eric Lenssen\n\nMatthias Fey\n\nPascal Libuschewski\n\nTU Dortmund University - Computer Graphics Group\n\n44227 Dortmund, Germany\n\n{janeric.lenssen, matthias.fey, pascal.libuschewski}@udo.edu\n\nAbstract\n\nWe present group equivariant capsule networks, a framework to introduce guar-\nanteed equivariance and invariance properties to the capsule network idea. Our\nwork can be divided into two contributions. First, we present a generic routing by\nagreement algorithm de\ufb01ned on elements of a group and prove that equivariance\nof output pose vectors, as well as invariance of output activations, hold under\ncertain conditions. Second, we connect the resulting equivariant capsule networks\nwith work from the \ufb01eld of group convolutional networks. Through this connec-\ntion, we provide intuitions of how both methods relate and are able to combine\nthe strengths of both approaches in one deep neural network architecture. The\nresulting framework allows sparse evaluation of the group convolution operator,\nprovides control over speci\ufb01c equivariance and invariance properties, and can\nuse routing by agreement instead of pooling operations. In addition, it is able to\nprovide interpretable and equivariant representation vectors as output capsules,\nwhich disentangle evidence of object existence from its pose.\n\n1\n\nIntroduction\n\nConvolutional neural networks heavily rely on equivariance of the convolution operator under\ntranslation. Weights are shared between different spatial positions, which reduces the number\nof parameters and pairs well with the often occurring underlying translational transformations in\nimage data. It naturally follows that a large amount of research is done to exploit other underlying\ntransformations and symmetries and provide deep neural network models with equivariance or\ninvariance under those transformations (cf. Figure 1). Further, equivariance and invariance are useful\nproperties when aiming to produce data representations that disentangle factors of variation: when\ntransforming a given input example by varying one factor, we usually aim for equivariance in one\nrepresentation entry and invariance in the others. One recent line of methods that aim to provide a\nrelaxed version of such a setting are capsule networks.\nOur work focuses on obtaining a formalized version of capsule networks that guarantees those\nproperties and bringing them together with group equivariant convolutions by Cohen and Welling\n[2016], which also provide provable equivariance properties under transformations within a group. In\nthe following, we will shortly introduce capsule networks, as proposed by Hinton et al. and Sabour\net al., before we outline our contribution in detail.\n\n1.1 Capsule networks\n\nCapsule networks [Hinton et al., 2011] and the recently proposed routing by agreement algo-\nrithm [Sabour et al., 2017] represent a different paradigm for deep neural networks for vision\ntasks. They aim to hard-wire the ability to disentangle the pose of an object from the evidence of\nits existence, also called viewpoint equi- and invariance in the context of vision tasks. This is done\nby encoding the output of one layer as a tuple of a pose vector and an activation. Further, they are\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: The task of dynamic routing for capsules with concepts of equivariant pose vectors\nand invariant agreements. Layers with those properties can be used to build viewpoint invariant\narchitectures, which disentangle factors of variation.\n\ninspired by human vision and detect linear, hierarchical relationships occurring in the data. Recent\nadvances describe the dynamic routing by agreement method that iteratively computes how to route\ndata from one layer to the next. One capsule layer receives n pose matrices Mi, which are then\ntransformed by a trainable linear transformation Wi,j to cast n votes for the pose of the jth output\ncapsule:\n\nVi,j = Mi \u00b7 Wi,j.\n\nThe votes are used to compute a proposal for an output pose by a variant of weighted averaging.\nThe weights are then iteratively re\ufb01ned using distances between votes and the proposal. Last, an\nagreement value is computed as output activation, which encodes how strong the votes agree on the\noutput pose. The capsule layer outputs a set of tuples (M, a), each containing the pose matrix and\nthe agreement (as activation) of one output capsule.\n\n1.2 Motivation and contribution\n\nGeneral capsule networks do not come with guaranteed equivariances or invariances which are\nessential to guarantee disentangled representations and viewpoint invariance. We identi\ufb01ed two issues\nthat prevent exact equivariance in current capsule architectures: First, the averaging of votes takes\nplace in a vector space, while the underlying space of poses is a manifold. The vote averaging of\nvector space representations does not produce equivariant mean estimates on the manifold. Second,\ncapsule layers use trainable transformation kernels de\ufb01ned over a local receptive \ufb01eld in the spatial\nvector \ufb01eld domain, where the receptive \ufb01eld coordinates are agnostic to the pose. They lead to\nnon-equivariant votes and consequently, non-equivariant output poses. In this work, we propose\npossible solutions for these issues.\nOur contribution can be divided into the following parts. First, we present group equivariant capsule\nlayers, a specialized kind of capsule layer whose pose vectors are elements of a group (G,\u25e6)\n(cf. Section 2). Given this restriction, we provide a general scheme for dynamic routing by agreement\nalgorithms and show that, under certain conditions, equivariance and invariance properties under\ntransformations from G are mathematically guaranteed. Second, we tackle the issue of aggregating\nover local receptive \ufb01elds in group capsule networks (cf. Section 3). Third, we bring together capsule\nnetworks with group convolutions and show how the group capsule layers can be leveraged to\nbuild convolutional neural networks that inherit the guaranteed equi- and invariances and produce\ndisentangled representations (cf. Section 4). Last, we apply this combined architecture as proof of\nconcept application of our framework to MNIST datasets and verify the properties experimentally.\n\n2 Group equivariant capsules\n\nWe begin with essential de\ufb01nitions for group capsule layers and the properties we aim to guarantee.\nGiven a Lie group (G,\u25e6), we formally describe a group capsule layer with m output capsules by a\nset of function tuples\n(1)\nHere, the functions Lp compute the output pose vectors while functions La compute output activations,\ngiven input pose vectors P = (p1, ..., pn) \u2208 Gn and input activations a \u2208 Rn. Since our goal\n\na(P, a)) | j \u2208 {1, . . . , m}}.\n\np(P, a), Lj\n\n{(Lj\n\n2\n\nRoutingLp(\u00b7)La(\u00b7)gRoutingggLp(\u00b7)La(\u00b7)Equivarianceofposevectors:Lp(g\u25e6P,a)=g\u25e6Lp(P,a);Invarianceofagreements:La(g\u25e6P,a)=La(P,a)OriginalTransformed\fis to achieve global invariance and local equivariance under the group law \u25e6, we de\ufb01ne those two\nproperties for one single group capsule layer (cf. Figure 1). First, the function computing the output\npose vectors of one layer is left-equivariant regarding applications of the group law if\n\nLp(g \u25e6 P, a) = g \u25e6 Lp(P, a),\n\n\u2200g \u2208 G.\n\n(2)\n\nSecond, the function computing activations of one layer is invariant under applications of the group\nlaw \u25e6 if\n\n(3)\n\nLa(g \u25e6 P, a) = La(P, a),\n\n\u2200g \u2208 G.\n\nSince equivariance is transitive, it can be deducted that stacking layers that ful\ufb01ll these properties\npreserves both properties for the combined operation. Therefore, if we apply a transformation from G\non the input of a sequence of those layers (e.g. a whole deep network), we do not change the resulting\noutput activations but produce output pose vectors which are transformed by the same transformation.\nThis sums up to ful\ufb01lling the vision of locally equivariant and globally invariant capsule networks.\n\n2.1 Group capsule layer\n\nWe de\ufb01ne the group capsule layer functions as the output of an iterative routing by agreement, similar\nto the approach proposed by Sabour et al. [2017]. The whole algorithm, given a generic weighted\naverage operation M and a distance measure \u03b4, is shown in Algorithm 1.\nAlgorithm 1 Group capsule layer\nInput: poses P = (p1, . . . , pn) \u2208 Gn, activations a = (a1, . . . , an) \u2208 Rn\nTrainable parameters: transformations ti,j\nOutput: poses \u02c6P = (\u02c6p1, . . . , \u02c6pm) \u2208 Gm, activations \u02c6a = (\u02c6a1, . . . , \u02c6am) \u2208 Rm\n\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2014\u2013\nfor all input capsules i and output capsules j\nvi,j \u2190 pi \u25e6 ti,j\n\u2200j\n\u02c6pj \u2190 M((v1,j, . . . , vn,j), a)\nfor r iterations do\nwi,j \u2190 \u03c3(\u2212\u03b4(\u02c6pj, vi,j)) \u00b7 ai\n\u2200i, j\n\u2200j\n\u02c6pj \u2190 M((v1,j, . . . , vn,j), w:,j)\n\u2200j\n\nend for\n\u02c6aj \u2190 \u03c3(\u2212 1\nReturn \u02c6p1, . . . , \u02c6pm, \u02c6a\n\ni=1 \u03b4(\u02c6pj, vi,j))\n\n(cid:80)n\n\nn\n\nGenerally, votes are cast by applying trainable group elements ti,j to the input pose vectors pi (using\nthe group law \u25e6), where i and j are the indices for input and output capsules, respectively. Then, the\nagreement is iteratively computed: First, new pose candidates are obtained by using the weighted\naverage operator M. Second, the negative, shifted \u03b4-distance between votes pose candidates are used\nfor the weight update. Last, the agreement is computed by averaging negative distances between\nvotes and the new pose. The functions \u03c3 can be chosen to be some scaling and shifting non-linearity,\nfor example \u03c3(x) = sigmoid(\u03b1 \u00b7 x + \u03b2) with trainable \u03b1 and \u03b2, or as softmax over the output\ncapsule dimension.\n\nProperties of M and \u03b4 For the following theorems we need to de\ufb01ne speci\ufb01c properties of M and\n\u03b4. The mean operation M : Gn \u00d7 Rn \u2192 G should map n elements of the group (G,\u25e6), weighted\nby values x = (x1, ..., xn) \u2208 Rn, to some kind of weighted mean of those values in G. Besides the\nclosure, M should be left-equivariant under the group law, formally:\n(4)\n\u2200g \u2208 G,\n\nM(g \u25e6 P, x) = g \u25e6 M(P, x),\n\nand invariant under permutations of the inputs. Further, the distance measure \u03b4 needs to be chosen so\nthat transformations g \u2208 G are \u03b4-distance preserving:\n\n\u2200g \u2208 G.\nGiven these preliminaries, we can formulate the following two theorems.\n\n\u03b4(g \u25e6 g1, g \u25e6 g2) = \u03b4(g1, g2), x),\n\n(5)\n\n3\n\n\fTheorem 1. Let M be a weighted averaging operation that is equivariant under left-applications\nof g \u2208 G and let G be closed under applications of M. Further, let \u03b4 be chosen so that all g \u2208 G\nare \u03b4-distance preserving. Then, the function Lp(P, a) = (\u02c6p1, . . . , \u02c6pm), de\ufb01ned by Algorithm 1, is\nequivariant under left-applications of g \u2208 G on input pose vectors P \u2208 Gn:\n\n(6)\n\nLp(g \u25e6 P, a) = g \u25e6 Lp(P, a),\n\n\u2200g \u2208 G.\n\nProof. The theorem follows by induction over the inner loop of the algorithm, using the equivariance\nof M, \u03b4-preservation and group properties. The full proof is provided in the appendix.\nTheorem 2. Given the same conditions as in Theorem 1. Then, the function La(P, a) = (\u02c6a1, . . . , \u02c6am)\nde\ufb01ned by Algorithm 1 is invariant under joint left-applications of g \u2208 G on input pose vectors\nP \u2208 Gn:\n(7)\n\nLa(g \u25e6 P, a) = La(P, a),\n\n\u2200g \u2208 G.\n\nProof. The result follows by applying Theorem 1 and the \u03b4-distance preservation. The full proof is\nprovided in the appendix.\n\nGiven these two theorems (and the method proposed in Section 3), we are able to build a deep\ngroup capsule network, by a composition of those layers, that guarantees global invariance in output\nactivations and equivariance in pose vectors.\n\n2.2 Examples of useful groups\nGiven the proposed algorithm, M and \u03b4 have to be chosen based on the chosen group and element\nrepresentations. A canonical application of the proposed framework on images is achieved by\nusing the two-dimensional rotation group SO(2). We chose to represent the elements of G as\ntwo-dimensional unit vectors, M as the renormalized, Euclidean, weighted mean, and \u03b4 as the\nnegative scalar product. Further higher dimensional groups include the three-dimensional rotation\ngroup SO(3) and GL(n, R), the group of general invertible matrices. Other potentially interesting\napplications of group capsules are translation groups. Further discussion about them, as well as the\nother groups, can be found in the appendix.\n\nGroup products\nIt should be noted that using the direct product of groups allows us to apply our\nframework for group combinations. Given two groups (G,\u25e6G) and (H,\u25e6H ), we can construct the\ndirect product group (G,\u25e6G)\u00d7(H,\u25e6H ) = (G\u00d7H,\u25e6), with (g1, h1)\u25e6(g2, h2) = (g1\u25e6Gg2, h1\u25e6H h2).\nThus, for example, the product SO(2) \u00d7 (R2, +) is again a group. Therefore, Theorem 1 and 2 also\napply for those combinations. As a result, the pose vectors contain independent poses for each group,\nkeeping information disentangled between the individual ones.\n\n3 Spatial aggregation with group capsules\n\nThis section describes our proposed spatial aggregation method for group capsule networks. As\npreviously mentioned, current capsule networks perform spatial aggregation of capsules, which does\nnot result in equivariant poses. When the input of a capsule network is transformed, not only the\ndeeper pose vectors change accordingly. Since vector \ufb01elds of poses are computed, the positions\nof those pose vectors in Rn might also change based on the transformation, formally modeled\nusing the concept of induced representations [Cohen et al., 2018]. The trainable transformations\nt however, are de\ufb01ned for \ufb01xed positions of the local receptive \ufb01eld, which is agnostic to those\ntranslations. Therefore, the composition of pose vectors and trainable transformations to compute the\nvotes depends on the input transformation, which prevents equivariance and invariance.\nFormally, the votes vi computed in a capsule layer over a local receptive \ufb01eld can be described by\n(8)\nwhere xi is a receptive \ufb01eld position, p(xi) the input pose at position xi, t(xi) the trainable transfor-\nmation at position xi, and g the input transformation. It can be seen that we do not receive a set of\nequivariant votes vi since the matching of p(\u00b7) and t(\u00b7) varies depending on g. A visual example of\nthe described issue (and a counterexample for equivariance) for an aggregation over a 2 \u00d7 2 block\nand G = SO(2) can be found in Figures 2a and 2b.\n\n\u22121(xi)) \u25e6 t(xi),\n\nvi = g \u25e6 p(g\n\n4\n\n\f(b) Rotated input, false matching\n\n(a) Non-rotated input and poses\nFigure 2: Example for the spatial aggregation of a 2\u00d72 block of SO(2) capsules. Figure (a) shows the\nbehavior for non-rotated inputs. The resulting votes have full agreement, pointing to the top. Figure\n(b) shows the behavior when rotating the input by \u03c0/2, where we obtain a different element-wise\nmatching of pose vectors p(\u00b7) and transformations t(\u00b7), depending on the input rotation. Figure (c)\nshows the behavior with the proposed kernel alignment. It can be seen that p and t match again and\nthe result is the same full pose agreement as in (a) with equivariant mean pose, pointing to the left.\n\n(c) Pose-aligned t-kernels\n\nPose-aligning transformation kernels As a solution, we propose to align the constant positions\nxi based on the pose before using them as input for a trainable transformation generator t(\u00b7). We\ncan compute \u00afp = M(p1, . . . , pn, 1), a mean pose vector for the current receptive \ufb01eld, given local\npose vectors p1, . . . , pn. The mean poses of transformed and non-transformed inputs differ by\nthe transformation g: \u00afp = g \u25e6 \u00afq. This follows from equivariance of M, invariance of M under\npermutation, and from the equivariance property of previous layers, meaning that the rotation applied\nto the input directly translates to the pose vectors in deeper layers. Therefore, we can apply the\ninverse mean pose \u00afp\u22121 = \u00afq\u22121 \u25e6 g\u22121 to the constant input positions x of t and calculate the votes as\n(9)\nas shown as an example in Figure 2c. Using this construction, we use the induced representation as\ninputs for p(\u00b7) and t(\u00b7) equally, leading to a combination of p(\u00b7) and t(\u00b7) that is independent from g.\nNote that \u00afq\u22121 \u2208 G is constant for all input transformations and therefore does not lead to further\nissues. In practice, we use a two-layer MLP to calculate t(\u00b7), which maps the normalized position to\nn\u00b7m transformations (for n input capsules per position and m output capsules). The proposed method\ncan also be understood as pose-aligning a trainable, continuous kernel window, which generates\ntransformations from G. It is similar to techniques applied for sparse data aggregation in irregular\ndomains [Gilmer et al., 2017]. Since commutativity is not required, it also works for non-abelian\ngroups (e.g. SO(3)). As an additional bene\ufb01t, we observed signi\ufb01cantly faster convergence during\ntraining when using the MLP generator instead of directly optimizing the transformations t.\n\n\u22121)(xi)) = g \u25e6 p(\u02c6xi) \u25e6 t(\u00afq\n\n\u22121(xi)) \u25e6 t((\u00afq\n\nvi = g \u25e6 p(g\n\n\u22121 \u25e6 g\n\n\u22121(\u02c6xi)),\n\n4 Group capsules and group convolutions\nThe newly won properties of pose vectors and activations allow us to combine our group equivariant\ncapsule networks with methods from the \ufb01eld of group equivariant convolutional networks. We show\nthat we can build sparse group convolutional networks that inherit invariance of activations under the\ngroup law from the capsule part of the network. Instead of using a regular discretization of the group,\nthose networks evaluate the convolution for a \ufb01xed set of arbitrary group elements. The proposed\nmethod leads to improved theoretical ef\ufb01ciency for group convolutions, improves the qualitative\nperformance of our capsule networks and is still able to provide disentangled information. In the\nfollowing, we shortly introduce group convolutions before presenting the combined architecture.\n\nGroup convolution Group convolutions (G-convs) are a generalized convolution/correlation oper-\nator de\ufb01ned for elements of a group (G,\u25e6) (here for Lie groups with underlying manifold):\n\n(cid:90)\n\nK(cid:88)\n\nh\u2208G\n\nk=1\n\n[f (cid:63) \u03c8] (g) =\n\nfk(h)\u03c8(g\n\n\u22121h) dh,\n\n(10)\n\nfor K input feature signals, which behaves equivariant under applications of the group law \u25e6 [Cohen\nand Welling, 2016, Cohen et al., 2018]. The authors showed that they can be used to build group\nequivariant convolutional neural networks that apply a stack of those layers to obtain an equivariant\narchitecture. However, compared to capsule networks, they do not directly compute disentangled\nrepresentations, which we aim to achieve through the combination with capsule networks.\n\n5\n\nr\u22121\u25e6p(xi)\u25e6t(xi)rr\u22121\u25e6r\u25e6p(r\u22121(xi))\u25e6t(xi)rr\u25e6r\u25e6p(r\u22121(xi))\u25e6t(r\u22121(xi))\f(a) Sparse group convolution.\n\n(b) Handling of local receptive \ufb01elds with different poses.\n\nFigure 3: (a) Scheme for the combination of capsules and group convolutions. Poses computed by\ndynamic routing are used to evaluate group convolutions. The output is weighted by the computed\nagreement. The invariance property of capsule activations is inherited to the output feature maps of\nthe group convolutions. (b) Realization of the sparse group convolution. The local receptive \ufb01elds are\ntransformed using the calculated poses Lp before aggregated using a continuous kernel function \u03c8.\n\n4.1 Sparse group convolution\n\nAn intuition for the proposed method is to interpret our group capsule network as a sparse tree\nrepresentation of a group equivariant network. The output feature map of a group convolution layer\n[f (cid:63) \u03c8] (g) over group G is de\ufb01ned for each element g \u2208 G. In contrast, the output of our group\ncapsule layer is a set of tuples (g, a) with group element g (pose vector) and activation a, which can\nbe interpreted as a sparse index/value representation of the output of a G-conv layer. In this context,\nthe pose g, computed using routing by agreement from poses of layer l, serves as the hypothesis for\nthe relevance of the feature map content of layer l + 1 at position g. We can now sparsely evaluate\nthe feature map output of the group convolution and can use the agreement values from capsules\nto dampen or amplify the resulting feature map contents, bringing captured pose covariances into\nconsideration. Figure 3a shows a scheme of this idea.\nWe show that when using the pose vector outputs to evaluate a G-conv layer for group element g we\ninherit the invariance property from the capsule activations, by proving the following theorem:\nTheorem 3. Given pose vector outputs Lp(p, a) of a group capsule layer for group G, input signal\nf : G \u2192 R, and \ufb01lter \u03c8 : G \u2192 R. Then, the group convolution [f (cid:63) \u03c8] is invariant under joint\nleft-applications of g \u2208 G on capsule input pose vectors P \u2208 Gn and signal f:\n\n[(g \u25e6 f ) (cid:63) \u03c8] (Lp(g \u25e6 P, a)) = [f (cid:63) \u03c8] (Lp(P, a)).\n\n(11)\n\nProof. The invariance follows from Theorem 1, the de\ufb01nition of group law application on the feature\nmap, and the group properties. The full proof is provided in the appendix.\n\nThe result tells us that when we pair each capsule in the network with an operator that performs pose-\nnormalized convolution on a feature map, we get activations that are invariant under transformations\nfrom G. We can go one step further: given a group convolution layer for a product group, we can use\nthe capsule output poses as an index for one group and densely evaluate the convolution for the other,\nleading to equivariance in the dense dimension (follows from equivariance of group convolution)\nand invariance in the capsule-indexed dimension. This leads to our proof of concept application\nwith two-dimensional rotation and translation. We provide further formal details and a proof in the\nappendix.\nCalculation of the convolutions can be performed by applying the inverse transformation to the local\ninput using the capsule\u2019s pose vector, as it is shown in Figure 3b. In practice, it can be achieved, e.g.,\nby using the grid warping approach proposed by Henriques and Vedaldi [2017] or by using spatial\ngraph-based convolution operators, e.g. from Fey et al. [2018]. Further, we can use the iteratively\ncomputed weights from the routing algorithm to perform pooling by agreement on the feature maps:\ninstead of using max or average operators for spatial aggregation, the feature map content can be\ndynamically aggregated by weighting it with the routing weights before combining it.\n\n6\n\nRoutingpaxG-Conv(cid:12)xLp\u03c8\f5 Related work\n\nDifferent ways to provide deep neural networks with speci\ufb01c equivariance properties have been\nintroduced. One way is to share weights over differently rotated \ufb01lters or augment the input heavily\nby transformations [Yanzhao et al., 2017, Weiler et al., 2018]. A related but more general set of\nmethods are the group convolutional networks [Cohen and Welling, 2016, Dieleman et al., 2016] and\nits applications like Spherical CNNs in SO(3) [Cohen et al., 2018] and Steerable CNNs in SO(2)\n[Cohen and Welling, 2017], which both result in special convolution realizations.\nCapsule networks were introduced by Hinton et al. [2011]. Lately, dynamic routing algorithms for\ncapsule networks have been proposed [Sabour et al., 2017, Hinton et al., 2018]. Our work builds upon\ntheir methods and vision for capsule networks and connects those to the group equivariant networks.\nFurther methods include harmonic networks [Worrall et al., 2017], which use circular harmonics\nas a basis for \ufb01lter sets, and vector \ufb01eld networks [Marcos et al., 2017]. These methods focus on\ntwo-dimensional rotational equivariance. While we chose an experiment which is similar to their\napproaches, our work aims to build a more general framework for different groups and disentangled\nrepresentations.\n\n6 Experiments\n\nWe provide proof of concept experiments to verify and visualize the theoretic properties shown in\nthe previous sections. As an instance of our framework, we chose an architecture for rotational\nequivariant classi\ufb01cation on different MNIST datasets [LeCun et al., 1998].\n\nImplementation and training details\n\n6.1\nInitial pose extraction An important subject which we did not tackle yet is the \ufb01rst pose extraction\nof a group capsule network. We need to extract pose vectors p \u2208 G with activations a out of the raw\ninput of the network without eliminating the equi- and invariance properties of Equations 2 and 3.\nOur solution for images is to simply compute local gradients using a Sobel operator and taking the\nlength of the gradient as activation. For the case of a zero gradient, we need to ensure that capsules\nwith only zero inputs also produce a zero agreement and an unde\ufb01ned pose vector.\nConvolution operator As convolution implementation we chose the spline-based convolution\noperator proposed by Fey et al. [2018]. Although the discrete two- or three-dimensional convolution\noperator is also applicable, this variant allows us to omit the resampling of grids after applying group\ntransformations on the signal f. The reason for this is the continuous de\ufb01nition range of the B-spline\nkernel functions. Due to the representation of images as grid graphs, these kernels allow us to easily\ntransform local neighborhoods by transforming the relative positions given on the edges.\nDynamic routing\nIn contrast to the method from Sabour et al. [2017], we do not use softmax over\nthe output capsule dimension but the sigmoid function for each weight individually. The sigmoid\nfunction makes it possible for the network to route information to more than one output capsule and\nto no output capsule at all. Further, we use two iterations of computing pose proposals.\nArchitecture and parameters Our canonical architecture consists of \ufb01ve capsule layers where\neach layer aggregates capsules from 2 \u00d7 2 spatial blocks with stride 2. The learned transformations\nare shared over the spatial positions. We use the routing procedure described in Section 2 and the\nspatial aggregation method described in Section 3. We also pair each capsule with a pose-indexed\nconvolution as described in Section 4 with ReLU non-linearities after each layer, leading to a CNN\narchitecture that is guided by pose vectors to become a sparse group CNN. The numbers of output\ncapsules are 16, 32, 32, 64, and 10 per spatial position for each of the \ufb01ve capsule layers, respectively.\nIn total, the architecture contains 235k trainable parameters (145k for the capsules and 90k for the\nCNN). The architecture results in two sets of classi\ufb01cation outputs: the agreement values of the last\ncapsule layer and the softmax outputs from the convolutional part. We use the spread loss as proposed\nby Hinton et al. [2018] for the capsule part and standard cross entropy loss for the convolutional\npart and add them up. We trained our models for 45 epochs. For further details, we refer to our\nimplementation, which is available on Github1.\n\n1Implementation at: https://github.com/mrjel/group_equivariant_capsules_pytorch\n\n7\n\n\fAffNist MNIST\nMNIST\nrot. (50k)\nrot. (10k)\n92.30% 81.64% 90.19%\n94.68% 71.86% 91.87%\n98.42% 89.10% 97.40%\n\nCNN(*)\nCapsules\nWhole\n\nAverage pose\nerror [degree]\n\nNaive average poses\nCapsules w/o recon. loss\nCapsules with recon. loss\n\n70.92\n28.32\n16.21\n\n(a) Ablation experiment results\n\n(b) Avg. pose errors for different con\ufb01gurations\n\nTable 1: (a) Ablation experiments for the individual parts of our architecture including the CNN\nwithout induced pose vectors, the equivariant capsule network and the combined architecture. All\nMNIST experiments are conducted using randomly rotated training and testing data. (b) Average\npose extraction error for three scenarios: simple averaging of initial pose vectors as baseline, our\ncapsule architecture without reconstruction loss, and the same model with reconstruction loss.\n\n6.2 Results\nEquivariance properties and accuracy We con\ufb01rm equivariance and invariance properties of our\narchitecture by training our network on non-rotated MNIST images and test it on images, which are\nrandomly rotated by multiples of \u03c0/2. We can con\ufb01rm that we achieve exactly the same accuracies,\nas if we evaluate on the non-rotated test set, which is 99.02%. We also obtain the same output\nactivations and equivariant pose vectors with occasional small numerical errors < 0.0001, which\ncon\ufb01rms equi- and invariance. This is true for capsule and convolutional outputs. When we consider\narbitrary rotations for testing, the accuracy of a network trained on non-rotated images is 89.12%,\nwhich is a decent generalization result, compared to standard CNNs.\nFor fully randomly rotated training and test sets we performed an ablation study using three datasets.\nThose include standard MNIST dataset with 50k training examples and the dedicated MNIST-rot\ndataset with the 10k/50k train/test split [Larochelle et al., 2007]. In addition, we replicated the\nexperiment of Sabour et al. [2017] on the affNIST dataset2, a modi\ufb01cation of MNIST where small,\nrandom af\ufb01ne transformations are applied to the images. We trained on padded and translated (not\nrotated) MNIST and tested on affNIST. All results are shown in Table 1a. We chose our CNN\narchitecture without information from the capsule part as our baseline (*). Without the induced poses,\nthe network is equivalent to a traditional CNN, similar to the grid experiment presented by Fey et al.\n[2018]. When trained on a non-rotated MNIST, it achieves 99.13% test accuracy and generalizes\nweakly to a rotated test set with only 58.79% test accuracy. For training on rotated data, results are\nsummarized in the table.\nThe results show that combining capsules with convolutions signi\ufb01cantly outperforms both parts\nalone. The pose vectors provided by the capsule network guide the CNN, which signi\ufb01cantly boosts\nthe CNN for rotation invariant classi\ufb01cation. We do not reach the state-of-the-art of 99.29% in rotated\nMNIST classi\ufb01cation obtained by Weiler et al. [2018]. In the affNIST experiment we surpass the\nresult of 79% from Sabour et al. [2017] with much less parameters (235k vs. 6.8M) by a large margin.\n\nRepresentations We provide a quantitative and a qualitative analysis of generated representations\nof our MNIST trained model in Table 1b and Figure 4, respectively. We measured the average pose\nerror by rotating each MNIST test example by a random angle and calculated the distance between the\npredicted and expected poses. The results of our capsule networks with and without a reconstruction\nloss (cf. next paragraph) are compared to the naive approach of hierarchically averaging local pose\nvectors. The capsule poses are far more accurate, since they do not depend equally on all local poses\nbut mostly on those which can be explained by the existence of the detected object. It should be noted\nthat the pose extraction was not directly supervised\u2014the networks were trained using discriminative\nclass annotations (and reconstruction loss) only. Similar to Sabour et al. [2017], we observe that\nusing an additional reconstruction loss improves the extracted representations. In Figure 4a we show\noutput poses for eleven random test samples, each rotated in \u03c0/4 steps. It can be seen that equivariant\noutput poses are produced in most cases. The bottom row shows an error case, where an ambiguous\npattern creates false poses. We provide a more detailed analysis for different MNIST classes in the\nappendix. Figure 4b shows poses after the \ufb01rst (top) and the second (bottom) capsule layer.\n\n2affNIST: http://www.cs.toronto.edu/~tijmen/affNIST/\n\n8\n\n\f(b) Poses after \ufb01rst and second capsule layer\n\n(a) Output pose vectors for rotated inputs\n\n(c) Reconstruction with transformed poses\n\nFigure 4: Visualization of output poses (a), internal poses (b), and reconstructions (c). (a) It can be\nseen that the network produces equivariant output pose vectors. The bottom row shows a rare error\ncase, where symmetries lead to false poses. (b) Internal poses behave nearly equivariant, we can\nsee differences due to changing discretization and image resampling. (c) The original test sample is\non the left. Then, reconstructions after rotating the representation pose vector are shown. For the\nreconstruction, we selected visually correct reconstructed samples, which was not always the case.\n\nReconstruction For further veri\ufb01cation of disentanglement, we also replicated the autoencoder\nexperiment of Sabour et al. [2017] by appending a three-layer MLP to convolution outputs, agreement\noutputs, and poses and train it to reconstruct the input image. Example reconstructions can be seen in\nFigure 4c. To verify the disentanglement of rotation, we provide reconstructions of the images after\nwe applied \u03c0/4 rotations to the output pose vectors. It can be seen that we have \ufb01ne-grained control\nover the orientation of the resulting image. However, not all representations were reconstructed\ncorrectly. We chose visually correct ones for display.\n7 Limitations\nLimitations of our method arise from the restriction of capsule poses to be elements of a group for\nwhich we have proper M and \u03b4. Therefore, in contrast to the original capsule networks, arbitrary\npose vectors can no longer be extracted. Through product groups though, it is possible to combine\nseveral groups and achieve more general pose vectors with internally disentangled information if we\ncan \ufb01nd M and \u03b4 for this group. For Lie groups, an implementation of an equivariant Karcher mean\nwould be a suf\ufb01cient operator for M. It is de\ufb01ned as the point on the manifold that minimizes the\nsum of all weighted geodesic distances [Nielsen and Bhatia, 2012]. However, for each group there\nis a different number of possible realizations from which only few are applicable in a deep neural\nnetwork architecture. Finding appropriate candidates and evaluating them is part of our future work.\n8 Conclusion\nWe proposed group equivariant capsule networks that provide provable equivariance and invariance\nproperties. They include a scheme for routing by agreement algorithms, a spatial aggregation method,\nand the ability to integrate group convolutions. We proved the relevant properties and con\ufb01rmed\nthem through proof of concept experiments while showing that our architecture provides disentangled\npose vectors. In addition, we provided an example of how sparse group equivariant CNNs can be\nconstructed using guiding poses. Future work will include applying the proposed framework to other,\nhigher-dimensional groups, to come closer to the expressiveness of original capsule networks while\npreserving the guarantees.\n\nAcknowledgments\nPart of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG)\nwithin the Collaborative Research Center SFB 876 Providing Information by Resource-Constrained\nAnalysis, projects B2 and A6.\n\n9\n\n\fReferences\nT. S. Cohen and M. Welling. Group equivariant convolutional networks. In Proceedings of the 33rd International\n\nConference on International Conference on Machine Learning (ICML), pages 2990\u20132999, 2016.\n\nT. S. Cohen and M. Welling. Steerable CNNs. In International Conference on Learning Representations (ICLR),\n\n2017.\n\nT. S. Cohen, M. Geiger, J. K\u00f6hler, and M. Welling. Spherical CNNs. In International Conference on Learning\n\nRepresentations (ICLR), 2018.\n\nT. S. Cohen, M. Geiger, and M. Weiler. Intertwiners between Induced Representations (with Applications to the\n\nTheory of Equivariant Neural Networks). ArXiv e-prints, 2018.\n\nS. Dieleman, J. De Fauw, and K. Kavukcuoglu. Exploiting cyclic symmetry in convolutional neural networks. In\nProceedings of the 33rd International Conference on International Conference on Machine Learning (ICML),\npages 1889\u20131898, 2016.\n\nM. Fey, J. E. Lenssen, F. Weichert, and H. M\u00fcller. SplineCNN: Fast geometric deep learning with continuous\n\nB-spline kernels. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\nJ. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl. Neural message passing for quantum\nIn Proceedings of the 34th International Conference on Machine Learning (ICML), pages\n\nchemistry.\n1263\u20131272, 2017.\n\nJ. F. Henriques and A. Vedaldi. Warped convolutions: Ef\ufb01cient invariance to spatial transformations.\n\nProceedings of the 34th International Conference on Machine Learning (ICML), pages 1461\u20131469, 2017.\n\nIn\n\nG. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In Arti\ufb01cial Neural Networks and\nMachine Learning - 21st International Conference on Arti\ufb01cial Neural Networks (ICANN), pages 44\u201351,\n2011.\n\nG. E. Hinton, S. Sabour, and N. Frosst. Matrix capsules with EM routing. In International Conference on\n\nLearning Representations (ICLR), 2018.\n\nH. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures\non problems with many factors of variation. In Proceedings of the 24th International Conference on Machine\nLearning, pages 473\u2013480, 2007.\n\nY. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In\n\nProceedings of the IEEE, pages 2278\u20132324, 1998.\n\nD. Marcos, M. Volpi, N. Komodakis, and D. Tuia. Rotation equivariant vector \ufb01eld networks.\n\nInternational Conference on Computer Vision (ICCV), pages 5058\u20135067, 2017.\n\nIn IEEE\n\nF. Nielsen and R. Bhatia. Matrix Information Geometry. Springer Publishing Company, 2012.\n\nS. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In Advances in Neural Information\n\nProcessing Systems (NIPS), pages 3859\u20133869, 2017.\n\nM. Weiler, F. A. Hamprecht, and M. Storath. Learning steerable \ufb01lters for rotation equivariant CNNs. In IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\nD. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic networks: Deep translation and\n\nrotation equivariance. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\nZ. Yanzhao, Y. Qixiang, Q. Qiang, and J. Jianbin. Oriented response networks. In IEEE Conference on Computer\n\nVision and Pattern Recognition (CVPR), pages 4961\u20134970, 2017.\n\n10\n\n\f", "award": [], "sourceid": 5313, "authors": [{"given_name": "Jan Eric", "family_name": "Lenssen", "institution": "TU Dortmund"}, {"given_name": "Matthias", "family_name": "Fey", "institution": "TU Dortmund"}, {"given_name": "Pascal", "family_name": "Libuschewski", "institution": "TU Dortmund"}]}