{"title": "Deep Symmetry Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2537, "page_last": 2545, "abstract": "The chief difficulty in object recognition is that objects' classes are obscured by a large number of extraneous sources of variability, such as pose and part deformation. These sources of variation can be represented by symmetry groups, sets of composable transformations that preserve object identity. Convolutional neural networks (convnets) achieve a degree of translational invariance by computing feature maps over the translation group, but cannot handle other groups. As a result, these groups' effects have to be approximated by small translations, which often requires augmenting datasets and leads to high sample complexity. In this paper, we introduce deep symmetry networks (symnets), a generalization of convnets that forms feature maps over arbitrary symmetry groups. Symnets use kernel-based interpolation to tractably tie parameters and pool over symmetry spaces of any dimension. Like convnets, they are trained with backpropagation. The composition of feature transformations through the layers of a symnet provides a new approach to deep learning. Experiments on NORB and MNIST-rot show that symnets over the affine group greatly reduce sample complexity relative to convnets by better capturing the symmetries in the data.", "full_text": "Deep Symmetry Networks\n\nRobert Gens\n\nPedro Domingos\n\nDepartment of Computer Science and Engineering\n\nUniversity of Washington\n\nSeattle, WA 98195-2350, U.S.A.\n\n{rcg,pedrod}@cs.washington.edu\n\nAbstract\n\nThe chief dif\ufb01culty in object recognition is that objects\u2019 classes are obscured by\na large number of extraneous sources of variability, such as pose and part de-\nformation. These sources of variation can be represented by symmetry groups,\nsets of composable transformations that preserve object identity. Convolutional\nneural networks (convnets) achieve a degree of translational invariance by com-\nputing feature maps over the translation group, but cannot handle other groups.\nAs a result, these groups\u2019 effects have to be approximated by small translations,\nwhich often requires augmenting datasets and leads to high sample complexity.\nIn this paper, we introduce deep symmetry networks (symnets), a generalization\nof convnets that forms feature maps over arbitrary symmetry groups. Symnets\nuse kernel-based interpolation to tractably tie parameters and pool over symmetry\nspaces of any dimension. Like convnets, they are trained with backpropagation.\nThe composition of feature transformations through the layers of a symnet pro-\nvides a new approach to deep learning. Experiments on NORB and MNIST-rot\nshow that symnets over the af\ufb01ne group greatly reduce sample complexity relative\nto convnets by better capturing the symmetries in the data.\n\nIntroduction\n\n1\nObject recognition is a central problem in vision. What makes it challenging are all the nuisance\nfactors such as pose, lighting, part deformation, and occlusion. It has been shown that if we could\nremove these factors, recognition would be much easier [2, 18]. Convolutional neural networks\n(convnets) [16], the current state-of-the-art method for object recognition, capture only one type of\ninvariance (translation); the rest have to be approximated via it and standard features. In practice,\nthe best networks require enormous datasets which are further expanded by af\ufb01ne transformations\n[7, 13] yet are sensitive to imperceptible image perturbations [24]. We propose deep symmetry\nnetworks, a generalization of convnets based on symmetry group theory [21] that makes it possible\nto capture a broad variety of invariances, and correspondingly improves generalization.\nA symmetry group is a set of transformations that preserve the identity of an object and obey the\ngroup axioms. Most of the visual nuisance factors are symmetry groups themselves, and by incor-\nporating them into our model we are able to reduce the sample complexity of learning from data\ntransformed by these groups. Deep symmetry networks (symnets) form feature maps over any sym-\nmetry group, rather than just the translation group. A feature map in a deep symmetry network\nis de\ufb01ned analogously to convnets as a \ufb01lter that is applied at all points in the symmetry space.\nEach layer in our general architecture is constructed by applying every symmetry in the group to\nthe input, computing features on the transformed input, and pooling over neighborhoods. The entire\narchitecture is then trained by backpropagation. In this paper, we instantiate the architecture with the\naf\ufb01ne group, resulting in deep af\ufb01ne networks. In addition to translation, the af\ufb01ne group includes\nrotation, scaling and shear. The af\ufb01ne group of the two-dimensional plane is six-dimensional (i.e.,\nan af\ufb01ne transformation can be represented by a point in 6D af\ufb01ne space). The key challenge with\n\n1\n\n\fextending convnets to af\ufb01ne spaces is that it is intractable to explicitly represent and compute with\na high-dimensional feature map. We address this by approximating the map using kernel functions,\nwhich not only interpolate but also control pooling in the feature maps. Compared to convnets,\nthis architecture substantially reduces sample complexity on image datasets involving 2D and 3D\ntransformations.\nWe share with other researchers the hypothesis that explanatory factors cannot be disentangled un-\nless they are represented in an appropriate symmetry space [4, 11]. Our adaptation of a repre-\nsentation to work in symmetry space is similar in some respects to the use of tangent distance in\nnearest-neighbor classi\ufb01ers [23]. Symnets, however, are deep networks that compute features in\nsymmetry space at every level. Whereas the tangent distance approximation is only locally accurate,\nsymnet feature maps can represent large displacements in symmetry space. There are other deep\nnetworks that reinterpret the invariance of convolutional networks. Scattering networks [6] are cas-\ncades of wavelet decompositions designed to be invariant to particular Lie groups, where translation\nand rotation invariance have been demonstrated so far. The M-theory of Anselmi et al. [2] con-\nstructs features invariant to a symmetry group by using statistics of dot products with group orbits.\nWe differ from these networks in that we model multiple symmetries jointly in each layer, we do not\ncompletely pool out a symmetry, and we discriminatively train our entire architecture. The \ufb01rst two\ndifferences are important because objects and their subparts may have relative \ufb02exibility but not total\ninvariance along certain dimensions of symmetry space. For example, a leg of a person can be seen\nin some but not all combinations of rotation and scale relative to the torso. Without discriminative\ntraining, scattering networks and M-theory are limited to representing features whose invariances\nmay be inappropriate for a target concept because they are \ufb01xed ahead of time, either by the wavelet\nhierarchy of the former or unsupervised training of the latter. The discriminative training of symnets\nyields features with task-oriented invariance to their sub-features. In the context of digit recognition\nthis might mean learning the concept of a \u20180\u2019 with more rotation invariance than a \u20186\u2019, which would\nincur loss if it had positive weights in the region of symmetry space where a \u20189\u2019 would also \ufb01re.\nMuch of the vision literature is devoted to features that reduce or remove the effects of certain sym-\nmetry groups, e.g., [19, 18]. Each feature by itself is not discriminative for object recognition, so\nstructure is modeled separately, usually with a representation that does not generalize to novel view-\npoints (e.g., bags-of-features) or with a rigid alignment algorithm that cannot represent uncertainty\nover geometry (e.g. [9, 20]). Compared to symnets, these features are not learned, have invariance\nlimited to a small set of symmetries, and destroy information that could be used to model object\nsub-structure. Like deformable part models [10], symnets can model and penalize relative transfor-\nmations that compose up the hierarchy, but can also capture additional symmetries.\nSymmetry group theory has made a limited number of appearances in machine learning [8]. A few\napplications are discussed by Kondor [12], and they are also used in determinantal point processes\n[14]. Methods for learning transformations from examples [25, 11] could potentially bene\ufb01t from\nbeing embedded in a deep symmetry network. Symmetries in graphical models [22] lead to effective\nlifted probabilistic inference algorithms. Deep symmetry networks may be applicable to these and\nother areas.\nIn this paper, we \ufb01rst review symmetry group theory and its relation to sample complexity. We then\ndescribe symnets and their af\ufb01ne instance, and develop new methods to scale to high-dimensional\nsymmetry spaces. Experiments on NORB and MNIST-rot show that af\ufb01ne symnets can reduce by a\nlarge factor the amount of data required to achieve a given accuracy level.\n\n2 Symmetry Group Theory\n\nA symmetry of an object is a transformation that leaves certain properties of that object intact [21].\nA group is a set S with an operator \u2217 on it with the four properties of closure, associativity, an\nidentity element, and an inverse element. A symmetry group is a type of group where the group\nelements are functions and the operator is function composition. A simple geometric example is\nthe symmetry group of a square, which consists of four re\ufb02ections and {0, 1, 2, 3} multiples of 90-\ndegree rotations. These transformations can be composed together to yield one of the original eight\nsymmetries. The identity element is the 0-degree rotation. Each symmetry has a corresponding\ninverse element. Composition of these symmetries is associative.\n\n2\n\n\fLie groups are continuous symmetry groups whose elements form a smooth differentiable manifold.\nFor example, the symmetries of a circle include re\ufb02ections and rotations about the center. The af\ufb01ne\ngroup is a set of transformations that preserves collinearity and parallel lines. The Euclidean group\nis a subgroup of the af\ufb01ne group that preserves distances, and includes the set of rigid body motions\n(translations and rotations) in three-dimensional space.\nThe elements of a symmetry group can be represented as matrices. In this form, function composi-\ntion can be performed via matrix multiplication. The transformation P followed by Q (also denoted\nQ \u25e6 P) is computed as R = QP. In this paper we treat the transformation matrix P as a point\nin D-dimensional space, where D depends on the particular representation of the symmetry group\n(e.g., D = 6 for af\ufb01ne transformations in the plane).\nA generating set of a group is a subset of the group such that any group element can be expressed\nthrough combinations of generating set elements and their inverses. For example, a generating set\nof the translation symmetry group is {x \u2192 x + \u0001, y \u2192 y + \u0001} for in\ufb01nitesimal \u0001. We de\ufb01ne the\nk-neighborhood of element f in group S under generating set G as the subset of S that can be\nexpressed as f composed with elements of G or their inverses at most k times. With the previous\nexample, the k-neighborhood of a translation vector f would take the shape of a diamond centered\nat f in the xy-plane.\nThe orbit of an object x is the set of objects obtained by applying each element of a symmetry group\nto x. Formally, a symmetry group S acting on a set of objects X de\ufb01nes an orbit for each x \u2208 X:\nOx = {s\u2217 x : s \u2208 S}. For example, the orbit of an image I(u) whose points are transformed by the\nrotation symmetry group s \u2217 I(u) = I(s\u22121 \u2217 u) is the set of images resulting from all rotations of\nthat image. If two orbits share an element, they are the same orbit. In this way, a symmetry group\na Oa. If a data distribution D(x, y) has\nthe property that all the elements of an orbit share the same label y, S imposes a constraint on the\nhypothesis class of a learner, effectively lowering its VC-dimension and sample complexity [1].\n\nS partitions the set of objects into unique orbits X = (cid:83)\n\n3 Deep Symmetry Networks\n\nDeep symmetry networks represent rich compositional structure that incorporates invariance to high-\ndimensional symmetries. The ideas behind these networks are applicable to any symmetry group, be\nit rigid-body transformations in 3D or permutation groups over strings. The architecture of a symnet\nconsists of several layers of feature maps. Like convnets, these feature maps bene\ufb01t from weight\ntying and pooling, and the whole network is trained with backpropagation. The maps and the \ufb01lters\nthey apply are in the dimension D of the chosen symmetry group S.\nA deep symmetry network has L layers l \u2208 {1, ..., L} each with Il features and corresponding\nfeature maps. A feature is the dot-product of a set of weights with a corresponding set of values from\na local region of a lower layer followed by a nonlinearity. A feature map represents the application\nof a \ufb01lter at all points in symmetry space. A feature at point P is computed from the feature maps\nof the lower layer at points in the k-neighborhood of P. As P moves in the symmetry space of a\nfeature map, so does its neighborhood of inputs in the lower layer. Feature map i of layer l is denoted\nM [l, i] : RD \u2192 R, a scalar function of the D-dimensional symmetry space. Given a generating set\nG \u2282 S, the points in the k-neighborhood of the identity element are stored in an array T[ ]. Each\n\ufb01lter i of layer l de\ufb01nes a weight vector w[l, i, j] for each point T[j] in the k-neighborhood. The\nvector w[l, i, j] is the size of Il\u22121, the number of features in the underlying layer. For example, a\nfeature in an af\ufb01ne symnet that detects a person would have positive weight for an arm sub-feature in\nthe region of the k-neighborhood that would transform the arm relative to the person (e.g., smaller,\nrotated, and translated relative to the torso). The value of feature map i in layer l at point P is\nthe dot-product of weights and underlying feature values in the neighborhood of P followed by a\nnonlinearity:\n\nM [l, i](P) =\n\nv(P, l, i) = (cid:80)|T|\n\nx(P(cid:48)) =\n\n\u03c3 (v(P, l, i))\n\n(cid:42) S(M [l \u2212 1, 0])(P(cid:48))\n(cid:43)\nj w[l, i, j] \u00b7 x(P \u25e6 T[j])\n\n. . .\n\nS(M [l \u2212 1, Il\u22121])(P(cid:48))\n\n3\n\n(1)\n(2)\n\n(3)\n\n\fsum-pooling, S(M [l, i])(P) = (cid:82) M [l, i](P \u2212 Q)K(Q) dQ; for max-pooling, S(M [l, i])(P) =\n\nFigure 1: The evaluation of point P in map M [l, i]. The elements of the k-neighborhood of P are\ncomputed P \u25e6 T[j]. Each point in the neighborhood is evaluated in the pooled feature maps of the\nlower layer l \u2212 1. The pooled maps are computed with kernels on the underlying feature maps. The\ndashed line intersects the points in the pooled map whose values form x(P \u25e6 T[j]) in Equation 3; it\nalso intersects the contours of kernels used to compute those pooled values. The value of the feature\nis the sum of the dot-products w[l, i, j] \u00b7 x(P \u25e6 T[j]) over all j, followed by a nonlinearity.\nwhere \u03c3 is the nonlinearity (e.g., tanh(x) or max(x, 0)), v(P, l, i) is the dot product, P \u25e6 T[j] rep-\nresents element j in the k-neighborhood of P, and x(P(cid:48)) is the vector of values from the underlying\npooled maps at point P(cid:48). This de\ufb01nition is a generalization of feature maps in convnets1. Similarly,\nthe same \ufb01lter weights w[l, i, j] are tied across all points P in feature map M [l, i]. The evaluation\nof a point in a feature map is visualized in Figure 1.\nFeature maps M [l, i] are pooled via kernel convolution to become S(M [l, i]).\nIn the case of\nmaxQ M [l, i](P\u2212 Q)K(Q). The kernel K(Q) is also a scalar function of the D-dimensional sym-\nmetry space. In the previous example of a person feature, the arm feature map could be pooled over\na wide range of rotations but narrow range of translations and scales so that the person feature allows\nfor moveable but not unrealistic arms. Each \ufb01lter can specify the kernels it uses to pool lower layers,\nbut for the sake of brevity and analogy to convnets we assume that the feature maps of a layer are\npooled by the same kernel. Note that convnets discretize these operations, subsample the pooled\nmap, and use a uniform kernel K(Q) = 1{(cid:107)Q(cid:107)\u221e < r}.\nAs with convnets, the values of points in a symnet feature map are used by higher symnet layers,\nlayers of fully connected hidden units, and ultimately softmax classi\ufb01cation. Hidden units take the\nfamiliar form o = \u03c3(Wx + b), with input x, output o, weight matrix W, and bias b. The log-loss\nc exp (wc \u00b7 x + bc)), where Y = i is\nthe true label, wc and bc are the weight vector and bias for class c, and the summation is over the\nclasses. The input image is treated as a feature map (or maps, if color or stereo) with values in the\ntranslation symmetry space.\nDeep symmetry networks are trained with backpropagation and are amenable to the same best prac-\ntices as convnets. Though feature maps are de\ufb01ned as continuous, in practice the maps and their\ngradients are evaluated on a \ufb01nite set of points P \u2208 M [l, i]. We provide the partial derivative of the\nloss L with respect to a weight vector.\n\nof the softmax L on an instance is \u2212wi \u00b7 x \u2212 bi + log ((cid:80)\n\n= (cid:80)\n\n\u2202L\n\n\u2202w[l, i, j]\n\u2202M [l, i](P)\n\u2202w[l, i, j]\n\n=\n\n\u2202L\n\n\u2202M [l,i](P)\n\n\u2202M [l,i](P)\n\nP\u2208M [l,i]\n\u03c3(cid:48) (v(P, l, i)) x(P \u25e6 T[j])\n\n\u2202w[l,i,j]\n\n(4)\n\n(5)\n\n1The neighborhood that de\ufb01nes a square \ufb01lter in convnets is the reference point translated by up to k times\n\nin x and k times in y.\n\n4\n\nLayer lFeature map iLayer l-1Feature maps 0,1,2Layer l-1Pooled feature maps 0,1,2Kernels\fFigure 2: The feature hierarchy of a three-layer deep af\ufb01ne net is visualized with and without\npooling. From top to bottom, the layers (A,B,C) contain one, \ufb01ve, and four feature maps, each cor-\nresponding to a labeled part of the cartoon \ufb01gure. Each horizontal line represents a six-dimensional\naf\ufb01ne feature map, and bold circles denote six-dimensional points in the map. The dashed lines\nrepresent the af\ufb01ne transformation from a feature to the location of one of its \ufb01lter points. For clar-\nity, only a subset of \ufb01lter points are shown. Left: Without pooling, the hierarchy represents a rigid\naf\ufb01ne transformation among all maps. Another point on feature map A is visualized in grey. Right:\nFeature maps B1 and C1 are pooled with a kernel that gives those features \ufb02exibility in rotation.\nThe partial derivative of the loss L with respect to the value of a point in a lower layer is\n\n\u2202L\n\n\u2202M [l \u2212 1, i](P)\n\u2202M [l, i(cid:48)](P(cid:48))\n\u2202M [l \u2212 1, i](P)\n\n(cid:80)Il\ni(cid:48)(cid:80)\n\n=\n\n= \u03c3(cid:48) (v(P(cid:48), l, i(cid:48)))(cid:80)|T|\n\nP(cid:48)\u2208M [l,i(cid:48)]\n\n\u2202L\n\n\u2202M [l,i(cid:48)](P(cid:48))\n\n\u2202M [l,i(cid:48)](P(cid:48))\n\u2202M [l\u22121,i](P)\n\nj w[l, i(cid:48), j][i] \u2202S(M [l\u22121,i])(P(cid:48)\u25e6T[j])\n\n\u2202M [l\u22121,i](P)\n\n(6)\n\n(7)\n\nwhere the gradient of the pooled feature map \u2202S(M [l,i])(P)\n\u2202M [l,i](Q)\n\nequals K(P \u2212 Q) for sum-pooling.\n\nNone of this treatment depends explicitly on the dimensionality of the space except for the kernel\nand transformation composition which have polynomial dependence on D. In the next section we\napply this architecture to the af\ufb01ne group in 2D, but it could also be applied to the af\ufb01ne group in\n3D or any other symmetry group.\n4 Deep Af\ufb01ne Networks\nWe instantiate a deep symmetry network with the\naf\ufb01ne symmetry group in the plane. The af\ufb01ne\nsymmetry group contains transformations capa-\nble of rotating, scaling, shearing, and translating\ntwo-dimensional points. The transformation is de-\nscribed by six coordinates:\n\n(cid:21)\n\n(cid:20) x(cid:48)\n\ny(cid:48)\n\n=\n\n(cid:20) a b\n\n(cid:21)(cid:20) x\n\nc\n\nd\n\ny\n\n(cid:21)\n\n(cid:21)\n\n(cid:20) e\n\nf\n\n+\n\nThis means that each of the feature maps M [l, i]\nand elements T[j] of the k-neighborhood is repre-\nsented in six dimensions. The identity transforma-\ntion is a = d = 1, b = c = e = f = 0. The generating\nset of the af\ufb01ne symmetry group contains six el-\nements, each of which is obtained by adding \u0001 to\none of the six coordinates in the identity transform.\nThis generating set is visualized in Figure 3.\nA deep af\ufb01ne network can represent a rich part hierarchy where each weight of a feature modulates\nthe response to a subpart at a point in the af\ufb01ne neighborhood. The geometry of a deep af\ufb01ne network\nis best understood by tracing a point on a feature map through its \ufb01lter point transforms into lower\nlayers. Figure 2 visualizes this structure without and with pooling on the left and right sides of\nthe diagram, respectively. Without pooling, the feature hierarchy de\ufb01nes a rigid af\ufb01ne relationship\nbetween the point of evaluation on a map and the location of its sub-features. In contrast, a pooled\nvalue on a sub-feature map is computed from a neighborhood de\ufb01ned by the kernel of points in\naf\ufb01ne space; this can represent model \ufb02exibility along certain dimensions of af\ufb01ne space.\n\nFigure 3: The six transformations in the gener-\nating set of the af\ufb01ne group applied to a square\n(exaggerated \u0001 = 0.2, identity is black square).\n\n5\n\nAB1B2C2B3C3B4C4B5C1AB1B2C2B3C3B4C4B5C1AB1B2B3B4B5C1C2C3C4AB1B2C2B3C3B4C4B5C1B1C1B1C1AB1B2B3B4B5C1C2C3C4\u2212101\u2212101\fAssuming(cid:80)|T|\n\n5 Scaling to High-Dimensional Symmetry Spaces\nIt would be intractable to explicitly represent the high-dimensional feature maps of symnets. Even a\nsubsampled grid becomes unwieldy at modest dimensions (e.g., a grid in af\ufb01ne space with ten steps\nper axis has 106 points). Instead, each feature map is evaluated at N control points. The control\npoints are local maxima of the feature in symmetry space, found by Gauss-Newton optimization,\neach initialized from a prior. This can be seen as a form of non-maximum suppression. Since the\ngoal is recognition, there is no need to approximate the many points in symmetry space where the\nfeature is not present. The map is then interpolated with kernel functions; the shape of the function\nalso controls pooling.\n5.1 Transformation Optimization\nConvnets max-pool a neighborhood of translation space by exhaustive evaluation of feature loca-\ntions. There are a number of algorithms that solve for a maximal feature location in symmetry space\nbut they are not ef\ufb01cient when the feature weights are frequently adjusted [9, 20]. We adopt an\niterative approach that dovetails with the de\ufb01nition of our features.\nIf a symnet is based on a Lie group, gradient based optimization can be used to \ufb01nd a point P\u2217\nthat locally maximizes the feature value (Equation 1) initialized at point P. In our experiments\n(cid:80)|T|\nwith deep af\ufb01ne nets, we follow the forward compositional (FC) warp [3] to align \ufb01lters with the\nimage. An extension of Lucas-Kanade, FC solves for an image alignment. We adapt this procedure\nj (cid:107)w[l, i, j] \u2212 x(P \u25e6 \u2206P \u25e6 T[j])(cid:107)2. We run an FC\nto our \ufb01lters and weight vectors: min\u2206P\nalignment for each of the N control points in feature map M [l, i], each initialized from a prior.\nj (cid:107)x(P \u25e6 \u2206P \u25e6 T[j])(cid:107)2 is constant, this procedure locally maximizes the dot product\nbetween the \ufb01lter and the map in Equation 2. Each iteration of FC takes a Gauss-Newton step to\nsolve for a transformation of the neighborhood of the feature in the underlying map \u2206P, which is\nthen composed with the control point: P \u2190 P \u25e6 \u2206P.\n5.2 Kernels\nGiven\n=\n{(P1, v1), . . . , (PN , vN )}\nfeature\nmap M [l, i], we use kernel-based interpolation to compute\na pooled map S(M [l, i]).\nThe kernel performs three\nfunctions: penalizing relative locations of sub-features\nin symmetry space (cf.\n[10]), interpolating the map, and\npooling a region of the map. These roles could be split\ninto separate \ufb01lter-speci\ufb01c kernels that are then convolved\nappropriately.\nThe choice of these kernels will vary\nwith the application.\nIn our experiments, we lump these\nfunctions into a single kernel for a layer. We use a Gaussian\nkernel K(Q) = e\u2212qT \u03a3\u22121q where q is the D-dimensional\nvector representation of Q and the D\u00d7D covariance matrix\n\u03a3 controls the shape and extent of the kernel. Several\ninstances of this kernel are shown in Figure 4. Max-pooling\nproduced the best results on our tests.\n6 Experiments\nIn our experiments we test the hypothesis that a deep network with access to a larger symmetry group\nwill generalize better from fewer examples, provided those symmetries are present in the data. In\nparticular, theory suggests that a symnet will have better sample complexity than another classi\ufb01er\non a dataset if it is based on a symmetry group that generates variations present in that dataset [1].\nWe compare deep af\ufb01ne symnets to convnets on the MNIST-rot and NORB image classi\ufb01cation\ndatasets, which \ufb01nely sample their respective symmetry spaces such that learning curves measure\nthe amount of augmentation that would be required to achieve similar performance. On both datasets\naf\ufb01ne symnets achieve a substantial reduction in sample complexity. This is particularly remarkable\non NORB because its images are generated by a symmetry space in 3D. Symnet running time was\nwithin an order of magnitude of convnets, and could be greatly optimized.\n\nFigure 4: Contours of three 6D Gaus-\nsian kernels visualized on a surface\nin af\ufb01ne space. Points are visualized\nby an oriented square transformed by\nthe af\ufb01ne transformation at that point.\nEach kernel has a different covariance\nmatrix \u03a3.\n\nin D-dimensional\n\na\n\nset\n\nof N local\n\noptima O\u2217\n\n6.1 MNIST-rot\nMNIST-rot [15] consists of 28x28 pixel greyscale images: 104 for training, 2 \u00d7 103 for validation,\nand 5\u00d7 104 for testing. The images are sampled from the MNIST digit recognition dataset and each\n\n6\n\n\fFigure 5: Impact of training set size on MNIST-rot test performance for architectures that use either\none convolutional layer or one af\ufb01ne symnet layer.\nis rotated by a random angle in the uniform distribution [0, 2\u03c0]. With transformations that apply\nto the whole image, MNIST-rot is a good testbed for comparing the performance of a single af\ufb01ne\nlayer to a single convnet layer.\nWe modi\ufb01ed the Theano [5] implementation of convolutional networks so that the network consisted\nof a single layer of convolution and maxpooling followed by a hidden layer of 500 units and then\nsoftmax classi\ufb01cation. The af\ufb01ne net layer was directly substituted for the convolutional layer. The\ncontrol points of the af\ufb01ne net were initialized at uniformly random positions with rotations oriented\naround the image center, and each control point was locally optimized with four iterations of Gauss-\nNewton updates. The \ufb01lter points of the af\ufb01ne net were arranged in a square grid. Both the af\ufb01ne\nnet and the convnet compute a dot-product and use the sigmoid nonlinearity. Both networks were\ntrained with 50 epochs of mini-batch gradient descent with momentum, and test results are reported\non the network with lowest error on the validation set2. The convnet did best with small 5 \u00d7 5\n\ufb01lters and the symnet with large 20 \u00d7 20 \ufb01lters. This is not surprising because the convnet must\napproximate the large rotations of the dataset with translations of small patches. The af\ufb01ne net can\npool directly in this space of rotations with large \ufb01lters.\nLearning curves for the two networks are presented in Figure 5. We observe that the af\ufb01ne symnet\nroughly halves the error of the convnet. With small sample sizes, the symnet achieves an accuracy\nfor which the convnet requires about eight times as many samples.\n\n6.2 NORB\nMNIST-rot is a synthetic dataset with symmetries that are not necessarily representative of real\nimages. The NORB dataset [17] contains 2\u00d7 108\u00d7 108 pixel stereoscopic images of 50 toys in \ufb01ve\ncategories: quadrupeds, human \ufb01gures, airplanes, trucks, and cars. Five of the ten instances of each\ncategory are reserved for the test set. Each toy is photographed on a turntable from an exhaustive\nset of angles and lighting conditions. Each image is then perturbed by a random translation shift,\nplanar rotation, luminance change, contrast change, scaling, distractor object, and natural image\nbackground. A sixth blank category containing just the distractor and background is also used. As\nin other papers, we downsample the images to 2\u00d748\u00d748. To compensate for the effect of distractors\nin smaller training sets, we also train and test on a version of the dataset that is centrally-cropped to\n2 \u00d7 24 \u00d7 24. We report results for whichever version had lower validation error. In our experiments\nwe train on a variable subset of the \ufb01rst training fold, using the \ufb01rst 2 \u00d7 103 images of the second\nfold for validation. Our results use both of the testing folds.\nWe compare architectures that use two convolutional layers or two af\ufb01ne ones, which performed\nbetter than single-layer ones. As with the MNIST-rot experiments, the symnet and convnet layers\nare followed by a layer of 500 hidden units and softmax classi\ufb01cation. The symnet control points in\nthe \ufb01rst layer were arranged in three concentric rings in translation space, with 8 points spaced across\nrotation (200 total points). Control points in the second layer were \ufb01xed at the center of translation\n2Grid search over learning rate {.1, .2}, mini-batch size {10, 50, 100}, \ufb01lter size {5, 10, 15, 20, 25}, num-\nber of \ufb01lters {20, 50, 80}, pooling size (convnet) {2, 3, 4}, and number of control points (symnet) {5, 10, 20}.\n\n7\n\n\fFigure 6: Impact of training set size on NORB test performance for architectures with two convolu-\ntional or af\ufb01ne symnet layers followed by a fully connected layer and then softmax classi\ufb01cation.\n\nspace arranged over 8 rotations and up to 2 vertical scalings (16 total points) to approximate the\neffects of elevation change. Control points were not iteratively optimized due to the small size of\nobject parts in downsampled images. The \ufb01lter points of the \ufb01rst layer of the af\ufb01ne net were arranged\nin a square grid. The second layer \ufb01lter points were arranged in a circle in translation space at a 3\nor 4 pixel radius, with 8 \ufb01lter points evenly spaced across rotation at each translation. We report the\ntest results of the networks with lowest validation error on a range of hyperparameters3.\nThe learning curves for convnets and af\ufb01ne symnets are shown in Figure 6. Even though the primary\nvariability in NORB is due to rigid 3D transformations, we \ufb01nd that our af\ufb01ne networks still have\nan advantage over convnets. A 3D rotation can be locally approximated with 2D scales, shears, and\nrotations. The af\ufb01ne net can represent these transformations and so it bene\ufb01ted from larger \ufb01lter\npatches. The translation approximation of the convnet is unable to properly align larger features to\nthe true symmetries, and so it performed better with smaller \ufb01lters. The convnet requires about four\ntimes as much data to reach the accuracy of the symnet with the smallest training set. Larger \ufb01lters\ncapture more structure than smaller ones, allowing symnets to generalize better than convnets, and\neffectively giving each symnet layer the power of more than one convnet layer.\nThe left side of the graph may be more indicative of the types of gains symnets may have over\nconvnets in more realistic datasets that do not have thousands of images of identical 3D shapes.\nWith the ability to apply more realistic transformations to sub-parts, symnets may also be better able\nto reuse substructure on datasets with many interrelated or \ufb01ne-grained categories. Since symnets\nare a clean generalization of convnets, they should bene\ufb01t from the learning, regularization, and\nef\ufb01ciency techniques used by state-of-the-art networks [13].\n7 Conclusion\nSymmetry groups underlie the hardest challenges in computer vision. In this paper we introduced\ndeep symmetry networks, the \ufb01rst deep architecture that can compute features over any symmetry\ngroup. It is a natural generalization of convolutional neural networks that uses kernel interpolation\nand transformation optimization to address the dif\ufb01culties in representing high-dimensional feature\nmaps. In experiments on two image datasets with 2D and 3D variability, af\ufb01ne symnets achieved\nhigher accuracy than convnets while using signi\ufb01cantly less data.\nDirections for future work include extending to other symmetry groups (e.g., lighting, 3D space),\nmodeling richer distortions, incorporating probabilistic inference, and scaling to larger datasets.\nAcknowledgments\nThis research was partly funded by ARO grant W911NF-08-1-0242, ONR grants N00014-13-1-\n0720 and N00014-12-1-0312, and AFRL contract FA8750-13-2-0019. The views and conclusions\ncontained in this document are those of the authors and should not be interpreted as necessarily\nrepresenting the of\ufb01cial policies, either expressed or implied, of ARO, ONR, AFRL, or the United\nStates Government.\ncontrol point translation spacing (symnet) {2, 3}, momentum {0, 0.5, 0.9}, others as in MNIST-rot.\n\n3Grid search over \ufb01lter size in each layer {6, 9}, pooling size in each layer (convnet) {2, 3, 4}, \ufb01rst layer\n\n8\n\n\fReferences\n[1] Y. S. Abu-Mostafa. Hints and the VC dimension. Neural Computation, 5(2):278\u2013288, 1993.\n[2] F. Anselmi, J. Z. Leibo, L. Rosasco, J. Mutch, A. Tacchetti, and T. Poggio. Unsupervised learning of\n\ninvariant representations in hierarchical architectures. ArXiv preprint 1311.4158, 2013.\n\n[3] S. Baker and I. Matthews. Lucas-Kanade 20 years on: A unifying framework. International Journal of\n\nComputer Vision, 56(3):221\u2013255, 2004.\n\n[4] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 35(8):1798\u20131828, 2013.\n\n[5] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley,\nand Y. Bengio. Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for\nScienti\ufb01c Computing Conference, 2010.\n\n[6] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 35(8):1872\u20131886, 2013.\n\n[7] D. Ciresan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classi\ufb01cation.\n\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.\n\n[8] P. Diaconis. Group representations in probability and statistics. Institute of Mathematical Statistics, 1988.\n[9] B. Drost, M. Ulrich, N. Navab, and S. Ilic. Model globally, match locally: Ef\ufb01cient and robust 3D object\nrecognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010.\n[10] P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part\n\nmodel. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008.\n\n[11] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In Proceedings of the Twenty-\n\nFirst International Conference on Arti\ufb01cial Neural Networks, 2011.\n\n[12] I. R. Kondor. Group theoretical methods in machine learning. Columbia University, 2008.\n[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems 25, 2012.\n\n[14] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. ArXiv preprint 1207.6083,\n\n2012.\n\n[15] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep archi-\ntectures on problems with many factors of variation. In Proceedings of the Twenty-Fourth International\nConference on Machine Learning, 2007.\n\n[16] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[17] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to\npose and lighting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\n2004.\n\n[18] T. Lee and S. Soatto. Video-based descriptors for object recognition.\n\n29(10):639\u2013652, 2011.\n\nImage and Vision Computing,\n\n[19] D. G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the IEEE Confer-\n\nence on Computer Vision and Pattern Recognition, 1999.\n\n[20] F. Lu and E. Milios. Robot pose estimation in unknown environments by matching 2D range scans.\n\nJournal of Intelligent and Robotic Systems, 18(3):249\u2013275, 1997.\n\n[21] W. Miller. Symmetry groups and their applications. Academic Press, 1972.\n[22] M. Niepert. Markov chains on orbits of permutation groups. In Proceedings of the Twenty-Eight Confer-\n\nence on Uncertainty in Arti\ufb01cial Intelligence, 2012.\n\n[23] P. Simard, Y. LeCun, and J. S. Denker. Ef\ufb01cient pattern recognition using a new transformation distance.\n\nIn Advances in Neural Information Processing Systems 5, 1992.\n\n[24] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.\nproperties of neural networks. International Conference on Learning Representations, 2014.\n\nIntriguing\n\n[25] L. Wiskott and T. J. Sejnowski. Slow feature analysis: Unsupervised learning of invariances. Neural\n\nComputation, 14(4):715\u2013770, 2002.\n\n9\n\n\f", "award": [], "sourceid": 1321, "authors": [{"given_name": "Robert", "family_name": "Gens", "institution": "University of Washington"}, {"given_name": "Pedro", "family_name": "Domingos", "institution": "U. Washington"}]}