{"title": "Clebsch\u2013Gordan Nets: a Fully Fourier Space Spherical Convolutional Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 10117, "page_last": 10126, "abstract": "Recent work by Cohen et al. has achieved state-of-the-art results for learning spherical images in a rotation invariant way by using ideas from group representation theory and noncommutative harmonic analysis. In this paper we propose a generalization of this work that generally exhibits improved performace, but from an implementation point of view is actually simpler. An unusual feature of the proposed architecture is that it uses the Clebsch--Gordan transform as its only source of nonlinearity, thus avoiding repeated forward and backward Fourier transforms. The underlying ideas of the paper generalize to constructing neural networks that are invariant to the action of other compact groups.", "full_text": "Clebsch\u2013Gordan Nets: a Fully Fourier Space\n\nSpherical Convolutional Neural Network\n\nRisi Kondor1\u2217 Zhen Lin1\u2217 Shubhendu Trivedi2\u2217\n\n1The University of Chicago\n2Toyota Technological Institute\n{risi, zlin7}@uchicago.edu, shubhendu@ttic.edu\n\nAbstract\n\nRecent work by Cohen et al. [1] has achieved state-of-the-art results for learning\nspherical images in a rotation invariant way by using ideas from group represen-\ntation theory and noncommutative harmonic analysis. In this paper we propose\na generalization of this work that generally exhibits improved performace, but\nfrom an implementation point of view is actually simpler. An unusual feature\nof the proposed architecture is that it uses the Clebsch\u2013Gordan transform as its\nonly source of nonlinearity, thus avoiding repeated forward and backward Fourier\ntransforms. The underlying ideas of the paper generalize to constructing neural\nnetworks that are invariant to the action of other compact groups.\n\n1\n\nIntroduction\n\nDespite the many recent breakthroughs in deep learning, we still do not have a satisfactory understand-\ning of how deep neural networks are able to achieve such spectacular perfomance on a wide range of\nlearning problems. One thing that is clear, however, is that certain architectures pick up on natural\ninvariances in data, and this is a key component to their success. The classic example is of course\nConvolutional Neural Networks (CNNs) for image classi\ufb01cation [2]. Recall that, fundamentally, each\nlayer of a CNN realizes two simple operations: a linear one consisting of convolving the previous\nlayer\u2019s activations with a (typically small) learnable \ufb01lter, and a nonlinear but pointwise one, such\nas a ReLU operator2. This architecture is suf\ufb01cient to guarantee translation equivariance, meaning\nthat if the input image is translated by some vector t, then the activation pattern in each higher layer\nof the network will translate by the same amount. Equivariance is crucial to image recognition for\ntwo closely related reasons: (a) It guarantees that exactly the same \ufb01lters are applied to each part\nthe input image regardless of position. (b) Assuming that \ufb01nally, at the very top of the network, we\nadd some layer that is translation invariant, the entire network will be invariant, ensuring that it can\ndetect any given object equally well regardless of its location.\nRecently, a number of papers have appeared that examine equivariance from the theoretical point\nof view, motivated by the understanding that the natural way to generalize convolutional networks\nto other types of data will likely lead through generalizing the notion of equivariance itself to other\ntransformation groups [3, 4, 5, 6, 7]. Letting f s denote the activations of the neurons in layer s\nof a hypothetical generalized convolution-like neural network, mathematically, equivariance to a\ngroup G means that if the inputs to the network are transformed by some transformation g \u2208 G,\ng }g\u2208G. s(Note that in\nthen f s transforms to T s\nsome contexts this is called \u201ccovariance\u201d, the difference between the two words being only one of\nemphasis.)\n\ng (f s) for some \ufb01xed set of linear transformations {T s\n\n\u2217Authors are arranged alphabetically\n2Real CNNs typically of course have multiple channels, and correspondingly multiple \ufb01lters per layer, but\n\nthis does not fundamentally change the network\u2019s invariance properties.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fA recent major success of this approach are Spherical CNNs [1][8], which are an SO(3)\u2013equivariant\nneural network architecture for learning images painted on the sphere3. Learning images on the\nsphere in a rotation invariant way has applications in a wide range of domains from 360 degree video\nthrough drone navigation to molecular chemistry [9, 10, 11, 12, 13, 14, 15]. The key idea in Spherical\nCNNs is to generalize convolutions using the machinery of noncommutative harmonic analysis:\nemploying a type of generalized SO(3) Fourier transform [16, 17], Spherical CNNs transform the\nimage to a sequence of matrices, and compute the spherical analog of convolution in Fourier space.\nThis beautiful construction guarantees equivariance, and the resulting network attains state of the art\nresults on several benchmark datasets.\nOne potential drawback of Spherical CNNs of the form proposed in [1], however, is that the nonlinear\ntransform in each layer still needs to be computed in \u201creal space\u201d. Consequently, each layer of the\nnetwork involves a forward and a backward SO(3) Fourier transform, which is relatively costly, and\nis a source of numerical errors, especially since the sphere and the rotation group do not admit any\nregular discretization similar to the square grid for Euclidean space.\nSpherical CNNs are not the only context in which the idea of Fourier space neural networks has\nrecently appeared [18, 19, 5, 7]. From a mathematical point of view, the relevance of Fourier theoretic\nideas in all these cases is a direct consequence of equivariance, speci\ufb01cally, of the fact that the\n{T s\ng }g\u2208G operators form a representation of the underlying group, in the algebraic sense of the\nword [20]. In particular, it has been shown that whenever there is a compact group G acting on the\ninputs of a neural network, there is a natural notion of Fourier transformation with respect to G,\nyielding a sequence of Fourier matrices {F s\n(cid:96) }(cid:96) at each layer, and the linear operation at layer s will\nbe equivariant to G if and only if it is equivalent to multiplying each of these matrices from the right\n(cid:96) [7]. Any other sort of operation will break equivariance. The\nby some (learnable) \ufb01lter matrix H s\nspherical convolutions employed in [1] are a special case of this general setup for SO(3), and the\nordinary convolutions employed in classical CNNs are a special case for the integer translation group\nZ2. In all of these cases, however, the issue remains that the nonlinearities need to be computed in\n\u201creal space\u201d, necessitating repeated forward and backward Fourier transforms.\nIn the present paper we propose a spherical CNN that differs from [1] in two fundamental ways:\n1. While retaining the connection to noncommutative Fourier analysis, we relax the requirement\nthat the activation of each layer of the network needs to be a (vector valued) function on SO(3),\nrequiring only that it be expressible as a collection of some number of SO(3)\u2013covariant vectors\n(which we call fragments) corresponding to different irreducible representations of the group. In\nthis sense, our architecture is strictly more general than [1].\n\n2. Rather than a pointwise nonlinearity in real space, our network takes the tensor (Kronecker)\nproduct of the activations in each layer followed by decomposing the result into irreducible\nfragments using the so-called Clebsch\u2013Gordan decomposition. This way, we get a \u201cfully Fourier\nspace\u201d neural network that avoids repeated forward and backward Fourier transforms.\n\nThe resulting architecture is not only more \ufb02exible and easier to implement than [1], but our experi-\nments show that it can also perform better on some standard datasets.\nThe Clebsch\u2013Gordan transform has recently appeared in two separate preprints discussing neural\nnetworks for learning physical systems [21, 22]. However, to the best of our knowledge, it has never\nbeen proposed as a general purpose nonlinearity for covariant neural networks. In fact, any compact\ngroup has a Clebsch\u2013Gordan decomposition (although, due to its connection to angular momentum\nin physics, the SO(3) case is by far the best known), so, in principle, the methods of the present paper\ncould be applied much broadly, in any situation where one desires to build a neural network that is\nequivariant to some class of transformations captured by a compact group.\n\n2 Convolutions on the sphere\n\nThe simplest example of a covariant neural network is a classical S + 1 layer CNN for image\nrecognition. In each layer of a CNN the neurons are arranged in a rectangular grid, so (assuming\nfor simplicity that the network has just one channel) the activation of layer s can be regarded as\na function f s : Z2 \u2192 R, with f 0 being the input image. The neurons compute f s by taking the\n\n3SO(3) denotes the group of three dimensional rotations, i.e., the group of 3\u00d7 3 orthogonal matrices.\n\n2\n\n\fcross-correlation4 of the previous layer\u2019s output with a small (learnable) \ufb01lter hs,\n\n(hs (cid:63) f s\u22121)(x) =\n\nhs(y\u2212 x) f s\u22121(y),\n\n(1)\n\n(cid:88)\n\ny\n\nand then applying a nonlinearity \u03c3, such as the Re-LU operator:\nf s(x) = \u03c3((hs (cid:63) f s\u22121)(x)).\n\n(2)\nDe\ufb01ning Tx(hs)(y) = hs(y \u2212 x), which is nothing but hs translated by x, allows us to equivalently\nwrite (1) as\n\n(hs (cid:63) f s\u22121)(x) = (cid:104)f s\u22121, Tx(hs)(cid:105),\n\nwhere the inner product is (cid:104)f s\u22121, Tx(hs)(cid:105) =(cid:80)\n\n(3)\ny f s\u22121(y) Tx(hs)(y). What this formula tells us is\nthat fundamentally each layer of the CNN just does pattern matching: f s(x) is an indication of how\nwell the part of f s\u22121 around x matches the \ufb01lter hs.\nEquation 3 is the natural starting point for generalizing convolution to the unit sphere, S2. An\nimmediate complication that we face, however, is that unlike the plane, S2 cannot be discretized\nby any regular (by which we mean rotation invariant) arrangement of points. A number of authors\nhave addressed this problem in different ways [14, 15]. Instead of following one of these approaches,\nsimilarly to recent work on manifold CNNs [23, 24], in the following we simply treat each f s and\nthe corresponding \ufb01lter hs as continuous functions on the sphere, f s(\u03b8, \u03c6) and hs(\u03b8, \u03c6), where \u03b8 and\n\u03c6 are the polar and azimuthal angles. We allow both these functions to be complex valued, the reason\nfor which will become clear later.\nThe inner product of two complex valued functions on the surface of the sphere is given by the\nformula\n\n(cid:104)g, h(cid:105)S2\n\n(4)\nwhere \u2217 denotes complex conjugation. Further, h (dropping the layer index for clarity) can be\nmoved to any point (\u03b80, \u03c60) on S2 by taking h(cid:48)(\u03b8, \u03c6) = h(\u03b8\u2212 \u03b80, \u03c6\u2212 \u03c60). This suggests that the\ngeneralization of 3 to the sphere should be\n\n\u2212\u03c0\n\n=\n\n0\n\n[g(\u03b8, \u03c6)]\u2217 h(\u03b8, \u03c6) cos \u03b8 d\u03b8 d\u03c6,\n\n1\n4\u03c0\n\n(cid:90) 2\u03c0\n\n(cid:90) \u03c0\n\n(h (cid:63) f )(\u03b80, \u03c60) =\n\n1\n4\u03c0\n\n[h(\u03b8\u2212 \u03b80, \u03c6\u2212 \u03c60)]\u2217 f (\u03b8, \u03c6) cos \u03b8 d\u03b8 d\u03c6.\n\n(5)\n\nUnfortunately, this generalization would be wrong, because it does not take into account that h can\nalso be rotated around a third axis. The correct way to generalize cross-correlations to the sphere is\nto de\ufb01ne h (cid:63) f as a function on the rotation group itself, i.e., to set\n\n(cid:90) 2\u03c0\n\n(cid:90) \u03c0\n\n0\n\n\u2212\u03c0\n\n(cid:90) 2\u03c0\n\n(cid:90) \u03c0\n\n0\n\n\u2212\u03c0\n\n(cid:2)hR(\u03b8, \u03c6)(cid:3)\u2217\n\n(h (cid:63) f )(R) =\n\n1\n4\u03c0\n\nf (\u03b8, \u03c6) cos \u03b8 d\u03b8 d\u03c6\n\nR \u2208 SO(3),\n\n(6)\n\n(7)\n\nwhere hR is h rotated by R, expressible as\n\nhR(x) = h(R\u22121x),\n\nwith x being the point on the sphere at position (\u03b8, \u03c6) (c.f. [25][1]).\n\n2.1 Fourier space \ufb01lters and activations\n\nCohen et al.[1] observe that the double integral in (6) would be extremely inconvenient to compute\nin a neural network. As mentioned, in the case of the sphere, just \ufb01nding the right discretizations\nto represent f and h is already problematic. As an alternative, it is natural to represent both these\nfunctions in terms of their spherical harmonic expansions\n\n\u221e(cid:88)\n\n(cid:96)(cid:88)\n\n(cid:96)=0\n\nm=\u2212(cid:96)\n\n(cid:98)f m\n\nf (\u03b8, \u03c6) =\n\n\u221e(cid:88)\n\n(cid:96)(cid:88)\n\n(cid:96)=0\n\nm=\u2212(cid:96)\n\n(cid:98)hm\n\n(cid:96) Y m\n\n(cid:96) (\u03b8, \u03c6)\n\nh(\u03b8, \u03c6) =\n\n(cid:96) Y m\n\n(cid:96) (\u03b8, \u03c6).\n\n(8)\n\n4Convolution and cross-correlation are closely related mathematical concepts that are somewhat confounded\nin the deep learning literature. In this paper we are going to be a little more precise and focus on cross-correlation,\nbecause, despite their name, that is what CNNs actually compute.\n\n3\n\n\f(cid:96) (\u03b8, \u03c6) are the well known spherical harmonic functions indexed by (cid:96) = 0, 1, 2, . . . and\nHere, Y m\nm \u2208 {\u2212(cid:96),\u2212(cid:96) + 1, . . . , (cid:96)}. The spherical harmonics form an orthonormal basis for L2(S2), so (8)\ncan be seen as a kind of Fourier series on the sphere, in particular, the elements of the f0, f1, f2, . . .\ncoef\ufb01cient vectors can be computed relatively easily by\n\n(cid:98)f m\n\n(cid:96) =\n\n1\n4\u03c0\n\n(cid:90) 2\u03c0\n\n(cid:90) \u03c0\n\n0\n\n\u2212\u03c0\n\nf (\u03b8, \u03c6) Y m\n\n(cid:96) (\u03b8, \u03c6) cos \u03b8 d\u03b8 d\u03c6,\n\nand similarly for h. Similarly to usual Fourier series, in practical scenarios spherical harmonic\nexpansions are computed up to some limiting \u201cfrequency\u201d L, which depends on the desired resolution.\nNoncommutative harmonic analysis [26, 27] tells us that functions on the rotation group also admit a\ntype of generalized Fourier transform. Given a function g : SO(3) \u2192 C, the Fourier transform of g is\nde\ufb01ned as the collection of matrices\n\nSO(3)\n\nG(cid:96) =\n\ng(R) \u03c1(cid:96)(R) d\u00b5(R)\n\n(9)\nwhere \u03c1(cid:96) : SO(3) \u2192 C(2(cid:96)+1)\u00d7(2(cid:96)+1) are \ufb01xed matrix valued functions called the irreducible repre-\nsentations of SO(3), sometimes also called Wigner D-matrices. Here \u00b5 is a \ufb01xed measure called the\nHaar measure that just hides factors similar to the cos \u03b8 appearing in (4). For future reference we also\nnote that one dimensional irreducible representation \u03c10 is the constant representation \u03c10(R) = (1).\nThe inverse Fourier transform is given by\n\n(cid:96) = 0, 1, 2, . . . ,\n\n(cid:90)\n\n1\n4\u03c0\n\ntr(cid:2)G(cid:96) \u03c1(cid:96)(R\u22121)(cid:3)\n\n\u221e(cid:88)\n\n(cid:96)=0\n\ng(R) =\n\nR \u2208 SO(3).\n\n[(cid:100)h (cid:63) f ](cid:96) = (cid:98)f(cid:96) \u00b7(cid:98)h\n\n\u2020\n(cid:96)\n\nWhile the spherical harmonics can be chosen to be real, the \u03c1(cid:96)(R) representation matrices are\ninherently complex valued. This is the reason that we allow all other quantities, including the f s\nactivations and hs \ufb01lters to be complex, too.\nRemarkably, the above notions of harmonic analysis on the sphere and the rotation group are closely\nrelated. In particular, it is possible to show that each Fourier component of the spherical cross\ncorrelation (6) that we are interested in computing is given simply by the outer product\n\n(cid:96) = 0, 1, 2, . . . , L,\n\n(10)\nwhere \u2020 denotes the conjugate transpose (Hermitian conjugate) operation. Cohen et al.\u2019s Spherical\nCNNs [1] are essentially based on this formula. In particular, they argue that instead of the continuous\n\nfunction f, it is more expedient to regard the components of the (cid:98)f0,(cid:98)f1, . . . ,(cid:98)fL vectors as the\n\u201cactivations\u201d of their neural network, while the learnable weights or \ufb01lters are the(cid:98)h0,(cid:98)h1, . . . ,(cid:98)hL\n\nvectors. Computing spherical convolutions in Fourier space then reduces to just computing a few\nouter products. Layers s = 2, 3, . . . , S of the Spherical CNN operate similarly, except that f s\u22121 is a\nfunction on SO(3), so (6) must be replaced by cross-correlation on SO(3) itself, and h must also be\na function on SO(3) rather than just the sphere. Fortuitiously, the resulting cross-correlation formula\nis almost exactly the same:\n\n(cid:96) = 0, 1, 2, . . . , L,\n\n(11)\n\napart from the fact that now F(cid:96) and H(cid:96) are matrices (see [1] for details).\n\n3 Generalized spherical CNNs\n\nThe starting point for our Generalized Spherical CNNs is the Fourier space correlation formula (10).\nIn contrast to [1], however, rather than the geometry, we concentrate on its algebraic properties, in\nparticular, its behavior under rotations. It is well known that if we rotate a spherical function by some\nR \u2208 SO(3) as in (7), then each vector of its spherical harmonic expansion just gets multiplied with\nthe corresponding Wigner D-matrix:\n\n(12)\nFor functions on SO(3), the situation is similar. If g : SO(3) \u2192 C, and g(cid:48) is the rotated function\ng(cid:48)(R(cid:48)) = g(R\u22121R(cid:48)), then the Fourier matrices of g(cid:48) are G(cid:48)\n(cid:96) = \u03c1(cid:96)(R) G(cid:96). The following proposition\nshows that the matrices output by the (10) and (11) cross-correlation formulae behave analogously.\n\n(cid:98)f(cid:96) (cid:55)\u2192 \u03c1(cid:96)(R) \u00b7 (cid:98)f(cid:96).\n\n4\n\n[(cid:100)h (cid:63) f ](cid:96) = F(cid:96) \u00b7 H\n\n\u2020\n(cid:96)\n\n\fProposition 1 Let f : S2 \u2192 C be an activation function that under the action of a rotation R\ntransforms as (7), and let h : S2 \u2192 C be a \ufb01lter. Then, each Fourier component of the cross\ncorrelation (6) transforms as\n\n[(cid:100)h (cid:63) f ](cid:96) (cid:55)\u2192 \u03c1(cid:96)(R) \u00b7 [(cid:100)h (cid:63) f ](cid:96).\n\n(13)\n\nSimilarly, if f(cid:48), h(cid:48) : SO(3) \u2192 C, then (cid:92)h(cid:48) (cid:63) f(cid:48) (as de\ufb01ned in (11)) transforms the same way.\nEquation (12) describes the behavior of spherical harmonic vectors under rotations, while (15)\ndescribes the behavior of Fourier matrices. However, the latter is equivalent to saying that each\ncolumn of the matrices separately transforms according to (12). One of the key ideas of the present\npaper is to take this property as the basis for the de\ufb01nition of covariance to rotations in neural nets.\nThus we have the following de\ufb01nition.\nDe\ufb01nition 1 Let N be an S + 1 layer feed-forward neural network whose input is a spherical\nfunction f 0 : S2 \u2192 Cd. We say that N is a generalized SO(3)\u2013covariant spherical CNN if the\noutput of each layer s can be expressed as a collection of vectors\n\n, . . . . . . . . . , . . .(cid:98)f s\n(cid:125)\n(cid:124) (cid:123)(cid:122) (cid:125)\n\n0,2, . . . ,(cid:98)f s\n0,1,(cid:98)f s\n(cid:98)f s = ((cid:98)f s\n,(cid:98)f s\n1,1,(cid:98)f s\n1,2, . . . ,(cid:98)f s\n(cid:124)\n(cid:125)\n(cid:123)(cid:122)\n(cid:124)\n(cid:123)(cid:122)\nwhere each (cid:98)f s\nsome rotation R, then (cid:98)f s\n(cid:96),j \u2208 C2(cid:96)+1 is a \u03c1(cid:96)\u2013covariant vector in the sense that if the input image is rotated by\n(cid:96),j (cid:55)\u2192 \u03c1(R) \u00b7 (cid:98)f s\n(cid:96),j transforms as(cid:98)f s\n(cid:96),j vectors the irreducible fragments of (cid:98)f s, and the integer vector\nWe call the individual (cid:98)f s\nL) counting the number of fragments for each (cid:96) the type of (cid:98)f s.\n\n(14)\n\n\u03c4 s = (\u03c4 s\n\n0 , \u03c4 s\n\n1 , . . . , \u03c4 s\n\n(15)\n\n),\n\nL,\u03c4 s\nL\n\n(cid:96)=L\n\n(cid:96)=0\n\n(cid:96)=1\n\n0,\u03c4 s\n0\n\n1,\u03c4 s\n1\n\n(cid:96),j.\n\nThere are a few things worth noting about De\ufb01nition 1. First, since the (15) maps are linear, clearly any\nSO(3)\u2013covariant spherical CNN is equivariant to rotations, as de\ufb01ned in the introduction. Second,\nsince in [1] the inputs are functions on the sphere, whereas in higher layers the activations are\nfunctions on SO(3), their architecture is a special case of De\ufb01nition 1 with \u03c4 0 = (1, 1, . . . , 1) and\n\u03c4 s = (1, 3, 5, . . . , 2L + 1) for s \u2265 1.\nFinally, by the theorem of complete reducibility of representations of compact groups, any f s that\ntransforms under rotations linearily is reducible into a sequence of irreducible fragments as in (14).\nThis means that (14) is really the most general possible form for an SO(3) equivariant neural network.\nAs we remarked in the introduction, technically, the terms \u201cequivariant\u201d and \u201ccovariant\u201d map to the\nsame concept. The difference between them is one of emphasis. We use the term \u201cequivariant\u201d when\nwe have the same group acting on two objects in a way that is qualitively similar, as in the case of the\nrotation group acting on functions on the sphere and on cross-correlation functions on SO(3). We use\nthe term \u201ccovariant\u201d if the actions are qualitively different, as in the case of rotations of functions on\nthe sphere and the corresonding transformations (15) of the irreducible fragments in a neural network.\nTo fully de\ufb01ne our neural network, we need to describe three things: 1. The form of the linear\ntransformations in each layer involving learnable weights, 2. The form of the nonlinearity in each\nlayer, 3. The way that the \ufb01nal output of the network can be reduced to a vector that is rotation\ninvariant, since that is our ultimate goal. The following subsections describe each of these components\nin turn.\n\n3.1 Covariant linear transformations\n\nIn a covariant neural network architecture, the linear operation of each layer must be covariant. As\ndescribed in the Introduction, in classical CNNs, convolution automatically satis\ufb01es this criterion. In\nthe more general setting of covariance to the action of compact groups, the problem was studied in\n[7]. The specialization of their result to our case is the following.\n\nProposition 2 Let (cid:98)f s be an SO(3)\u2013covariant activation function of the form (14), and(cid:98)gs = L((cid:98)f s)\nbe a linear function of (cid:98)f s written in a similar form. Then(cid:98)gs is SO(3)\u2013covariant if and only each(cid:98)gs\nfragment is a linear combination of fragments from (cid:98)f s with the same (cid:96).\n\n(cid:96),j\n\n5\n\n\fGs\n\n(cid:96) = F s\n\n(cid:96) W s\n(cid:96)\n\n(2(cid:96) + 1) \u00d7 \u03c4 s\nthat\n\n(cid:96) dimensional matrix F s\n\nfor some sequence of complex valued martrices W s\n\nProposition 2 can be made more clear by stacking all fragments of (cid:98)f corresponding to (cid:96) into a\n(cid:96) , and doing the same for(cid:98)g. Then the proposition tells us simply\nneed to be square, i.e., the number of fragments in (cid:98)f and(cid:98)g corresponding to (cid:96) might be different. In\nresponding to taking W(cid:96) = (cid:98)h\n\nthe context of a neural network, the entries of the W s\nNote that the Fourier space cross-correlation formulae (10) and (11) are special cases of (16) cor-\n\u2020\n(cid:96) . The case of general W(cid:96) does not have such an\nintuitive interpretation in terms of cross-correlation. What the (16) lacks in interpretability it makes\nup for in terms of generality, since it provides an extremely simple and \ufb02exible way of inducing\nSO(3)\u2013covariant linear transformations in neural networks.\n\n(16)\n(cid:96) does not necessarily\n\n(cid:96) matrices are learnable parameters.\n\n\u2020\n(cid:96) or W(cid:96) = H\n\nL. Note that W s\n\n(cid:96) = 0, 1, 2, . . . , L\n\n0 , . . . , W s\n\n3.2 Covariant nonlinearities: the Clebsch\u2013Gordan transform\n\nDifferentiable nonlinearities are essential for the operation of multi-layer neural networks. Formu-\nlating covariant nonlinearities in Fourier space, however, is more challenging than formulating the\nlinear operation. This is the reason that most existing group equivariant neural networks perform this\noperation in \u201creal space\u201d. However, as discussed above, moving back and forth between real space\nand the Fourier domain comes at a sign\ufb01ciant cost and leads to a range of complications involving\nquadrature on the transformation group and numerical errors.\nOne of the key contributions of the present paper is to propose a fully Fourier space nonlinearity based\non the Clebsch\u2013Gordan transform. In representation theory, the Clebsch\u2013Gordan decomposition\narises in the context of decomposing the tensor (i.e., Kronecker) product of irreducible representations\ninto a direct sum of irreducibles. In the speci\ufb01c case of SO(3), it takes form\n\n\u03c1(cid:96)1(R) \u2297 \u03c1(cid:96)2(R) = C(cid:96)1,(cid:96)2\n\n\u03c1(cid:96)(R)\n\nC(cid:62)\n\n(cid:96)1,(cid:96)2\n\n,\n\nR \u2208 SO(3),\n\n(cid:20)\n\n(cid:96)1+(cid:96)2(cid:77)\n\n(cid:96)=|(cid:96)1\u2212(cid:96)2|\n\n(cid:21)\n\n(cid:98)g(cid:96) = C(cid:62)\n(cid:71)\n\n|(cid:96)1\u2212(cid:96)2|\u2264(cid:96)\u2264(cid:96)1+(cid:96)2\n\nwhere C(cid:96)1,(cid:96)2 are \ufb01xed matrices. Equivalently, letting C(cid:96)1,(cid:96)2,(cid:96) denote the appropriate block of columns\nof C(cid:96)1,(cid:96)2,\n\n\u03c1(cid:96)(R) = C(cid:62)\n\n(cid:96)1,(cid:96)2,(cid:96) [\u03c1(cid:96)1(R) \u2297 \u03c1(cid:96)2(R)] C(cid:96)1,(cid:96)2,(cid:96).\n\nThe CG-transform is well known in physics, because it is intimately related to the algebra of angular\nmomentum in quantum mechanics, and the entries of the C(cid:96)1,(cid:96)2,(cid:96) matrices can be computed relatively\neasily. The following Lemma explains why this construction is relevant to creating Fourier space\nnonlinearities.\n\nLemma 3 Let (cid:98)f(cid:96)1 and (cid:98)f(cid:96)2 be two \u03c1(cid:96)1 resp. \u03c1(cid:96)2 covariant vectors, and (cid:96) be any integer between\n\n|(cid:96)1\u2212 (cid:96)2| and (cid:96)1 + (cid:96)2. Then\n\n(cid:2)(cid:98)f(cid:96)1 \u2297 (cid:98)f(cid:96)2\n\n(cid:3)\n\n(cid:96)1,(cid:96)2,(cid:96)\n\n(17)\n\nis a \u03c1(cid:96)\u2013covariant vector.\nExploiting Lemma 3, the nonlinearity used in our generalized Spherical CNNs consists of computing\n(17) between all pairs of fragments. In matrix notation,\nC(cid:62)\n\n(cid:2)F s\n\n\u2297 F s\n\n(cid:3),\n\nGs\n\n(18)\n\n(cid:96)1,(cid:96)2,(cid:96)\n\n(cid:96)1\n\n(cid:96)2\n\n(cid:96) =\n\nwhere (cid:116) denotes merging matrices horizontally. Note that this operation increases the size of\nthe activation substantially: the total number of fragments is squared, which can potentially be\nproblematic, and is addressed in the following subsection.\nThe Clebsch\u2013Gordan decomposition has recently appeared in two preprints discussing neural net-\nworks for learning physical systems [21, 22]. However, to the best of our knowledge, in the present\ncontext of a general purpose nonlinearity, it has never been proposed before. At \ufb01rst sight, the\ncomputational cost it would appear that the computational cost of (17) (assuming that C(cid:96)1,(cid:96)2,(cid:96) has\nbeen precomputed) is (2(cid:96)1 + 1)(2(cid:96)2 + 1)(2(cid:96) + 1). However, C(cid:96)1,(cid:96)2,(cid:96) is actually sparse, in particular\n[C(cid:96)1,(cid:96)2,(cid:96)](m1,m2),m = 0 unless m1 +m2 = m. Denoting the total number of scalar entries in the F s\n(cid:96) (cid:96)\n\n6\n\n\f(cid:96) = 0\n\n(cid:96) = 1\n\nF s\u22121\n\n0\n\nF s\u22121\n\n1\n\n(cid:96) = 2\n\nF s\u22121\n\n2\n\n\u00d7W0\n\n\u00d7W1\n\n\u00d7W2\n\nF s\n0\n\nF s\n1\n\nF s\n2\n\nInputs\n\nClebsch\u2013Gordan product\n\nLinear\n\nTransform\n\nOutputs\n\nFigure 1: Schematic of a single layer of the Clebsch\u2013Gordan network.\n\nmatrices by N, this reduces the complexity of computing (18) to O(N 2L). While the CG transform\nis not currently available as a differentiable operator in any of the major deep learning software\nframeworks, we have developed and will publicly release a C++ PyTorch extension for it.\nA more unusual feature of the CG nonlinearity is that its essentially quadratic nature. Quadratic\nnonlinearities are not commonly used in deep neural networks. Nonetheless, our experiments indicate\nthat the CG nonlinearity is effective in the context of learning spherical images. It is also possible to\nuse higher CG powers, although the computational cost will obviously increase.\n\n3.3 Limiting the number of channels\n\nIn a covariant network, each individual (cid:98)f s\n\n(cid:96) fragment is effecively a separate channel. In this sense,\nthe quadratic increase in the number of channels after the CG-transform can be seen as a natural\nbroadening of the network to capture more complicated features. Naturally, allowing the number of\nchannels to increase quadratically in each layer would be untenable, though.\nFollowing the results of Section 3.1, the natural way to counteract the exponential increase in the\nnumber of channels is follow the CG-transform with another learnable linear transformation that\nreduces the number of fragments for each (cid:96) to some \ufb01xed maximum number \u03c4 (cid:96).\nIn fact, this\nlinear transformation can replace the transformation of Section 3.1. Whereas in conventional neural\nnetworks the linear transformation always precedes the nonlinear operation, in Clebsch\u2013Gordan\nnetworks it is natural to design each layer so as to perform the CG-transform \ufb01rst, and then the\nconvolution like step (16), which will limit the number of fragments.\n\n(cid:55)\u2192 \u03c1(cid:96)(R) F S\u22121\n\nof scalars. In our Fourier theoretic language, this simply corresponds to the (cid:98)f S\n\n3.4 Final invariant layer\nAfter the S \u2212 1\u2019th layer, the activations of our network will be a series of matrices F S\u22121\n, . . . , F S\u22121\n,\neach transforming under rotations according to F S\u22121\n. Ultimately, however, the\nobjective of the network is to output a vector that is invariant with respect rotations, i.e., a collection\n0,j fragments, since the\n0 are invariant. Thus, the \ufb01nal layer\n(cid:96) = 0 representation is constant, and therefore the elements of F S\ncan be similar to the earlier ones, except that it only needs to output this single (single row) matrix.\nNote that in contrast to other architectures such as [1] that involve repeated forward and backward\ntransforms, thanks to their fully Fourier nature, for Clebsch\u2013Gordan nets, in both training and\ntesting, the elements of F S\n0 are guaranteed to be invariant to rotations of arbitrary magnitude not just\napproximately, but in the exact sense, up to limitations of \ufb01nite precision arithmetic. This is a major\nadvantage of Clebsch\u2013Gordan networks compared to other covariant architectures.\n\n(cid:96)\n\n0\n\nL\n\n(cid:96)\n\n3.5 Summary of algorithm\n\nIn summary, our Spherical Clebsch\u2013Gordan network is an S +1 layer feed-forward neural network in\nwhich apart from the initial spherical harmonic transform, every other operation is a simple matrix\noperation. The algorithm is presented in explicit form in the Supplement.\n\n7\n\n\fRMSE\nMethod\n5.96\nMLP/Random CM [28]\nLGIKA (RF) [29]\n10.82\nRBF Kernels/Rand CM [28] 11.42\nRBF Kernels/Sorted CM [28] 12.59\nMLP/Sorted CM [28]\n16.06\n8.47\nSpherical CNN [1]\n7.97\nOurs (FFS2CNN)\n\nP@N R@N F1@N mAP NDCG\nMethod\n0.705 0.769 0.719 0.696 0.783\nTatsuma_ReVGG\nFuruya_DLAN\n0.814 0.683 0.706 0.656 0.754\nSHREC16-Bai_GIFT 0.678 0.667 0.661 0.607 0.735\nDeng_CM-VGG5-6DB 0.412 0.706 0.472 0.524 0.624\nSpherical CNNs [1]\n0.701 0.711 0.699 0.676 0.756\n0.707 0.722 0.701 0.683 0.756\nFFS2CNNs (ours)\n\nTable 1: Results on the QM7 and 3D shape recognition datasets.\n\n4 Experiments\n\nIn this section we describe experiments that give a direct comparison with those reported by Cohen\net al. [1]. We choose these experiments as the Spherical CNN proposed in [1] is the only direct\ncompetition to our method.\n\nRotated MNIST on the Sphere We use a version of MNIST in which the images are painted onto\na sphere and use two instances as in [1], more details about the data, baseline models, as well as the\ndetailed architecture of our model and hyperparameters are provided in the appendix. We report three\nsets of experiments: For the \ufb01rst set both the training and test sets were not rotated (denoted NR/NR),\nfor the second, the training set was not rotated while the test was randomly rotated (NR/R) and \ufb01nally\nwhen both the training and test sets were rotated (denoted R/R).\nNR/NR NR/R\n22.18\n97.67\n95.59\n94.62\n96.4\n\nMethod\nBaseline CNN\nCohen et al.\nOurs (FFS2CNN)\n\nR/R\n12\n93.4\n96.6\n\n96\n\nWe observe that the baseline model\u2019s performance deteriorates in the three cases, more or less\nreducing to random chance in the R/R case. While our results are better than those reported in [1],\nthey also have another characteristic: they remain roughly the same in the three regimes, while those\nof [1] slightly worsen. We think this might be a result of the loss of equivariance in their method.\n\nAtomization Energy Prediction Next, we apply our framework to the QM7 dataset [30, 31],\nwhere the goal is to regress over atomization energies of molecules given atomic positions (pi) and\ncharges (zi). Each molecule contains up to 23 atoms of 5 types (C, N, O, S, H). More details about\nthe representations used, baseline models, as well as the architectural parameters are provided in the\nappendix. The \ufb01nal results are presented in the table, which show that our method outperforms the\nSpherical CNN of Cohen et al.. The only method that delivers better performance is a MLP trained\non randomly permuted Coulomb matrices [28], and as [1] point out, this method is unlikely to scale\nto large molecules as it needs a large sample of random permutations, which grows rapidly with N.\n\n3D Shape Recognition Finally, we report results for shape classi\ufb01cation using the SHREC17\ndataset [32], which is a subset of the larger ShapeNet dataset [33] having roughly 51300 3D models\nspread over 55 categories. Architectural details are provided in the appendix. We compare our\nresults to some of the top performing models on SHREC (which use architectures specialized to the\ntask) as well as the model of Cohen et al.. Our method, like the model of Cohen et al. is task agnostic\nand uses the same representation. Despite this, it is able to consistently come second or third in the\ncompetition, showing that it affords an ef\ufb01cient method to learn from spherical signals.\n\n5 Conclusion\n\nWe have presented an SO(3)-equivariant neural network architecture for spherical data that operates\ncompletely in Fourier space, circumventing a major drawback of earlier models that need to switch\nback and forth between Fourier space and \u201creal\u201d space. We achieve this by \u2013 rather unconventionally\n\u2013 using the Clebsch-Gordan decomposition as the only source of nonlinearity. While the speci\ufb01c\nfocus is on spheres and SO(3)-equivariance, the approach is more widely applicable, suggesting a\ngeneral formalism for designing fully Fourier neural networks that are equivariant to the action of\nany compact continuous group.\n\n8\n\n\fReferences\n\n[1] T. S. Cohen, M. Geiger, J. K\u00f6hler, and M. Welling. Spherical CNNs. International Conference on Learning\n\nRepresentations, 2018.\n\n[2] Y LeCun, B Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropa-\n\ngation applied to handwritten zip code recognition. Neural Computation, 1:541\u2013551, 1989.\n\n[3] R. Gens. Deep Symmetry Networks. NIPS 2014, pages 1\u20139, 2014.\n[4] T. S. Cohen and M. Welling. Group equivariant convolutional networks. Proceedings of The 33rd\n\nInternational Conference on Machine Learning, 48:2990\u20132999, 2016.\n\n[5] T. S. Cohen and M. Welling. Steerable cnns. In ICLR, 2017.\n[6] S. Ravanbakhsh, J. Schneider, and B. Poczos. Equivariance through parameter-sharing. In Proceedings of\n\nInternational Conference on Machine Learning, 2017.\n\n[7] R. Kondor and S. Trivedi. On the generalization of equivariance and convolution in neural networks to the\n\naction of compact groups. 2018.\n\n[8] C. Esteves, C. Allen-Blanchette, A. Makadia, and K. Daniilidis. Learning SO(3) Equivariant Representa-\n\ntions with Spherical CNNs. 2017.\n\n[9] L. Zelnik-Manor, G. Peters, and P. Perona. Squaring the Circles in Panoramas. Ieee Iccv, pages 1292\u20131299,\n\n2005.\n\n[10] J. Cruz-Mota, I. Bogdanova, B. Paquier, M. Bierlaire, and J-P Thiran. Scale invariant feature transform on\n\nthe sphere: Theory and applications. International Journal of Computer Vision, 98(2):217\u2013241, 2012.\n\n[11] Y-C Su, D. Jayaraman, and K. Grauman. Pano2vid: Automatic cinematography for watching 360 videos.\nLecture Notes in Computer Science (including subseries Lecture Notes in Arti\ufb01cial Intelligence and Lecture\nNotes in Bioinformatics), 10114 LNCS(1):154\u2013171, 2017.\n\n[12] W-S Lai, Y. Huang, N. Joshi, C. Buehler, M-H Yang, and S-B Kang. Semantic-driven Generation of\n\nHyperlapse from 360 @BULLET Video. pages 1\u201312.\n\n[13] R. Khasanova and P. Frossard. Graph-Based Classi\ufb01cation of Omnidirectional Images. 2017.\n[14] W. Boomsma and J. Frellsen. Spherical convolutions and their application in molecular modelling.\n\n(Nips):3436\u20133446, 2017.\n\n[15] Y-C Su and K. Grauman. Making 360$\u02c6{}$ Video Watchable in 2D: Learning Videography for Click Free\n\nViewing. 2017.\n\n[16] D. M. Healey, D. N. Rockmore, and S. B. Moore. An fft for the 2-sphere and applications. In 1996 IEEE\nInternational Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 3,\npages 1323\u20131326 vol. 3, May 1996.\n\n[17] P. J. Kostelec and D. N. Rockmore. Ffts on the rotation group. Journal of Fourier Analysis and Applications,\n\n14(2):145\u2013179, Apr 2008.\n\n[18] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow. Harmonic Networks: Deep Translation\n\nand Rotation Equivariance. 2016.\n\n[19] C. Esteves, C. Allen-Blanchette, X. Zhou, and K. Daniilidis. Polar Transformer Networks. 2017.\n[20] J-P. Serre. Linear Representations of Finite Groups, volume 42 of Graduate Texts in Mathamatics.\n\nSpringer-Verlag, 1977.\n\n[21] N. Thomas, T. Smidt, S. Kearnes, L. Yang, L. Li, K. Kohlhoff, and P. Riley. Tensor \ufb01eld networks: Rotation-\nand translation-equivariant neural networks for 3d point clouds. arXiv:1802.08219 [cs], Feb 2018. arXiv:\n1802.08219.\n\n[22] R. Kondor. N-body networks: a covariant hierarchical neural network architecture for learning atomic\n\npotentials. CoRR, abs/1803.01588, 2018.\n\n[23] J. Masci, D. Boscaini, M. M. Bronstein, and P. Vandergheynst. Geodesic convolutional neural networks on\n\nriemannian manifolds. 2015.\n\n[24] F. Monti, D. Boscaini, J. Masci, E. Rodola, J. Svoboda, and M. M. Bronstein. Geometric deep learning on\n\ngraphs and manifolds using mixture model cnns. 2016.\n\n[25] B. Gutman, Y. Wang, T. Chan, P. M. Thompson, and A. W. Toga. Shape Registration with Spherical Cross\nCorrelation. 2nd MICCAI Workshop on Mathematical Foundations of Computational Anatomy, pages\n56\u201367, 2008.\n\n[26] A. Terras. Fourier analysis on \ufb01nite groups and applications, volume 43 of London Mathematical Society\n\nStudent Texts. Cambridge Univ. Press, 1999.\n\n[27] P. Diaconis. Group Representation in Probability and Statistics, volume 11 of IMS Lecture Series. Institute\n\nof Mathematical Statistics, 1988.\n\n[28] G. Montavon, K. Hansen, S. Fazli, M. Rupp, F. Biegler, A. Ziehe, A. Tkatchenko, O.A. von Lilienfeld, and\nK. M\u00fcller. Learning invariant representations of molecules for atomization energy prediction. In NIPS,\n2012.\n\n9\n\n\f[29] A. Raj, A. Kumar, Y. Mroueh, and P.T. Fletcher et al. Local group invariant representations via orbit\n\nembeddings. 2016.\n\n[30] L. C. Blum and J.-L. Reymond. 970 million druglike small molecules for virtual screening in the chemical\n\nuniverse database gdb-13. Journal of the American Chemical Society, 2009.\n\n[31] M. Rupp, A. Tkatchenko, K.-R. M\u00fcller, and O. A. von Lilienfeld. Fast and accurate modeling of molecular\n\natomization energies with machine learning. Physical Review Letters, 2012.\n\n[32] M. Savva, F. Yu, H. Su, A. Kanezaki, T. Furuya, R. Ohbuchi, Z. Zhou, R. Yu, S. Bai, X. Bai, M. Aono,\nA. Tatsuma, S. Thermos, A. Axenopoulos, G. Th. Papadopoulos, P. Daras, X. Deng, Z. Lian, B. Li,\nH. Johan, Y. Lu, and S. Mk. Large-scale 3d shape retrieval from shapenet core55. Eurographics Workshop\non 3D Object Retrieval, 2017.\n\n[33] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song,\n\nH. Su, J. Xiao, L. Yi, and F. Yu. Shapenet: An information-rich 3d model repository. 2015.\n\n10\n\n\f", "award": [], "sourceid": 6510, "authors": [{"given_name": "Risi", "family_name": "Kondor", "institution": "U. Chicago"}, {"given_name": "Zhen", "family_name": "Lin", "institution": "The University of Chicago"}, {"given_name": "Shubhendu", "family_name": "Trivedi", "institution": "Brown University"}]}