{"title": "Invariance and Stability of Deep Convolutional Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 6210, "page_last": 6220, "abstract": "In this paper, we study deep signal representations that are near-invariant to groups of transformations and stable to the action of diffeomorphisms without losing signal information. This is achieved by generalizing the multilayer kernel introduced in the context of convolutional kernel networks and by studying the geometry of the corresponding reproducing kernel Hilbert space. We show that the signal representation is stable, and that models from this functional space, such as a large class of convolutional neural networks, may enjoy the same stability.", "full_text": "Invariance and Stability\n\nof Deep Convolutional Representations\n\nAlberto Bietti\n\nInria\u2217\n\nJulien Mairal\n\nInria\u2217\n\nalberto.bietti@inria.fr\n\njulien.mairal@inria.fr\n\nAbstract\n\nIn this paper, we study deep signal representations that are near-invariant to groups\nof transformations and stable to the action of diffeomorphisms without losing signal\ninformation. This is achieved by generalizing the multilayer kernel introduced\nin the context of convolutional kernel networks and by studying the geometry\nof the corresponding reproducing kernel Hilbert space. We show that the signal\nrepresentation is stable, and that models from this functional space, such as a large\nclass of convolutional neural networks, may enjoy the same stability.\n\n1\n\nIntroduction\n\nThe results achieved by deep neural networks for prediction tasks have been impressive in domains\nwhere data is structured and available in large amounts. In particular, convolutional neural networks\n(CNNs) [14] have shown to model well the local appearance of natural images at multiple scales,\nwhile also representing images with some invariance through pooling operations. Yet, the exact nature\nof this invariance and the characteristics of functional spaces where convolutional neural networks\nlive are poorly understood; overall, these models are sometimes only seen as clever engineering black\nboxes that have been designed with a lot of insight collected since they were introduced.\n\nUnderstanding the geometry of these functional spaces is nevertheless a fundamental question. In\naddition to potentially bringing new intuition about the success of deep networks, it may for instance\nhelp solving the issue of regularization, by providing ways to control the variations of prediction\nfunctions in a principled manner. Small deformations of natural signals often preserve their main\ncharacteristics, such as the class label in a classi\ufb01cation task (e.g., the same digit with different\nhandwritings may correspond to the same images up to small deformations), and provide a much\nricher class of transformations than translations. Representations that are stable to small deformations\nallow more robust models that may exploit these invariances, which may lead to improved sample\ncomplexity. The scattering transform [5, 17] is a recent attempt to characterize convolutional\nmultilayer architectures based on wavelets. The theory provides an elegant characterization of\ninvariance and stability properties of signals represented via the scattering operator, through a notion\nof Lipschitz stability to the action of diffeomorphisms. Nevertheless, these networks do not involve\n\u201clearning\u201d in the classical sense since the \ufb01lters of the networks are pre-de\ufb01ned, and the resulting\narchitecture differs signi\ufb01cantly from the most used ones.\n\nIn this work, we study these theoretical properties for more standard convolutional architectures from\nthe point of view of positive de\ufb01nite kernels [27]. Speci\ufb01cally, we consider a functional space derived\nfrom a kernel for multi-dimensional signals, which admits a multilayer and convolutional structure\nthat generalizes the construction of convolutional kernel networks (CKNs) [15, 16]. We show that\nthis functional space contains a large class of CNNs with smooth homogeneous activation functions\nin addition to CKNs [15], allowing us to obtain theoretical results for both classes of models.\n\n\u2217Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe main motivation for introducing a kernel framework is to study separately data representation\nand predictive models. On the one hand, we study the translation-invariance properties of the kernel\nrepresentation and its stability to the action of diffeomorphisms, obtaining similar guarantees as the\nscattering transform [17], while preserving signal information. When the kernel is appropriately\ndesigned, we also show how to obtain signal representations that are near-invariant to the action of\nany group of transformations. On the other hand, we show that these stability results can be translated\nto predictive models by controlling their norm in the functional space. In particular, the RKHS norm\ncontrols both stability and generalization, so that stability may lead to improved sample complexity.\n\nRelated work. Our work relies on image representations introduced in the context of convolutional\nkernel networks [15, 16], which yield a sequence of spatial maps similar to traditional CNNs, but\neach point on the maps is possibly in\ufb01nite-dimensional and lives in a reproducing kernel Hilbert space\n(RKHS). The extension to signals with d spatial dimensions is straightforward. Since computing the\ncorresponding Gram matrix as in classical kernel machines is computationally impractical, CKNs\nprovide an approximation scheme consisting of learning \ufb01nite-dimensional subspaces of each RKHS\u2019s\nlayer, where the data is projected, see [15]. The resulting architecture of CKNs resembles traditional\nCNNs with a subspace learning interpretation and different unsupervised learning principles.\n\nAnother major source of inspiration is the study of group-invariance and stability to the action of\ndiffeomorphisms of scattering networks [17], which introduced the main formalism and several proof\ntechniques from harmonic analysis that were keys to our results. Our main effort was to extend them to\nmore general CNN architectures and to the kernel framework. Invariance to groups of transformations\nwas also studied for more classical convolutional neural networks from methodological and empirical\npoints of view [6, 9], and for shallow learned representations [1] or kernel methods [13, 19, 22].\n\nNote also that other techniques combining deep neural networks and kernels have been introduced.\nEarly multilayer kernel machines appear for instance in [7, 26]. Shallow kernels for images modelling\nlocal regions were also proposed in [25], and a multilayer construction was proposed in [4]. More\nrecently, different models based on kernels are introduced in [2, 10, 18] to gain some theoretical\ninsight about classical multilayer neural networks, while kernels are used to de\ufb01ne convex models for\ntwo-layer neural networks in [36]. Finally, we note that Lipschitz stability of deep models to additive\nperturbations was found to be important to get robustness to adversarial examples [8]. Our results\nshow that convolutional kernel networks already enjoy such a property.\n\nNotation and basic mathematical tools. A positive de\ufb01nite kernel K that operates on a set X\nimplicitly de\ufb01nes a reproducing kernel Hilbert space H of functions from X to R, along with a\nmapping \u03d5 : X \u2192 H. A predictive model associates to every point z in X a label in R; it consists of\na linear function f in H such that f (z) = hf, \u03d5(z)iH, where \u03d5(z) is the data representation. Given\nnow two points z, z\u2032 in X , Cauchy-Schwarz\u2019s inequality allows us to control the variation of the\npredictive model f according to the geometry induced by the Hilbert norm k.kH:\n\n(1)\nThis property implies that two points z and z\u2032 that are close to each other according to the RKHS\nnorm should lead to similar predictions, when the model f has reasonably small norm in H.\n\n|f (z) \u2212 f (z\u2032)| \u2264 kf kHk\u03d5(z) \u2212 \u03d5(z\u2032)kH.\n\nThen, we consider notation from signal processing similar to [17]. We call a signal x a function\nin L2(\u2126, H), where \u2126 is a subset of Rd representing spatial coordinates, and H is a Hilbert space,\nwhen kxk2\nHdu < \u221e, where du is the Lebesgue measure on Rd. Given a linear\noperator T : L2(\u2126, H) \u2192 L2(\u2126, H\u2032), the operator norm is de\ufb01ned as kT kL2(\u2126,H)\u2192L2(\u2126,H\u2032) :=\nsupkxkL2 (\u2126,H)\u22641 kT xkL2(\u2126,H\u2032). For the sake of clarity, we drop norm subscripts, from now on, using\nthe notation k \u00b7 k for Hilbert space norms, L2 norms, and L2 \u2192 L2 operator norms, while | \u00b7 | denotes\nthe Euclidean norm on Rd. Some useful mathematical tools are also presented in Appendix A.\n\nL2 := R\u2126 kx(u)k2\n\n2 Construction of the Multilayer Convolutional Kernel\n\nWe now present the multilayer convolutional kernel, which operates on signals with d spatial dimen-\nsions. The construction follows closely that of convolutional kernel networks [15] but generalizes it\nto input signals de\ufb01ned on the continuous domain \u2126 = Rd (which does not prevent signals to have\ncompact support), as done by Mallat [17] for analyzing the properties of the scattering transform; the\nissue of discretization where \u2126 is a discrete grid is addressed in Section 2.1.\n\n2\n\n\fxk := AkMkPkxk\u20131 : \u2126 \u2192 Hk\n\nxk(w) = AkMkPkxk\u20131(w) \u2208 Hk\nlinear pooling\n\nMkPkxk\u20131 : \u2126 \u2192 Hk\n\nMkPkxk\u20131(v) = \u03d5k(Pkxk\u20131(v)) \u2208 Hk\n\nkernel mapping\n\nxk\u20131(u) \u2208 Hk\u20131\n\nxk\u20131 : \u2126 \u2192 Hk\u20131\n\nPkxk\u20131(v) \u2208 Pk (patch extraction)\n\nFigure 1: Construction of the k-th signal representation from the k\u20131-th one. Note that while \u2126\nis depicted as a box in R2 here, our construction is supported on \u2126 = Rd. Similarly, a patch is\nrepresented as a squared box for simplicity, but it may potentially have any shape.\n\nIn what follows, an input signal is denoted by x0 and lives in L2(\u2126, H0), where H0 is typically\nRp0 (e.g., with p0 = 3, x0(u) may represent the RGB pixel value at location u). Then, we build\na sequence of RKHSs H1, H2, . . ., and transform x0 into a sequence of \u201cfeature maps\u201d supported\non \u2126, respectively denoted by x1 in L2(\u2126, H1), x2 in L2(\u2126, H2), . . . . As depicted in Figure 1,\na new map xk is built from the previous one xk\u20131 by applying successively three operators that\nperform patch extraction (Pk), kernel mapping (Mk) in a new RKHS Hk, and linear pooling (Ak),\nrespectively. When going up in the hierarchy, the points xk(u) carry information from larger signal\nneighborhoods centered at u in \u2126 with more invariance, as we will formally show.\n\nPatch extraction operator. Given the layer xk\u20131, we consider a patch shape Sk, de\ufb01ned as a\ncompact centered subset of Rd, e.g., a box [\u22121, 1] \u00d7 [\u22121, 1] for images, and we de\ufb01ne the Hilbert\nkz(u)k2d\u03bdk(u), where d\u03bdk is the\nnormalized uniform measure on Sk for every z in Pk. More precisely, we now de\ufb01ne the linear patch\nextraction operator Pk : L2(\u2126, Hk\u20131) \u2192 L2(\u2126, Pk) such that for all u in \u2126,\n\nspace Pk := L2(Sk, Hk\u20131) equipped with the norm kzk2 = RSk\n\nPkxk\u20131(u) = (v 7\u2192 xk\u20131(u + v))v\u2208Sk \u2208 Pk.\n\nNote that by equipping Pk with a normalized measure, the operator Pk preserves the norm. By\nFubini\u2019s theorem, we have indeed kPkxk\u20131k = kxk\u20131k and hence Pkxk\u20131 is in L2(\u2126, Pk).\n\nIn a second stage, we map each patch of xk\u20131 to a RKHS Hk with a\nKernel mapping operator.\nkernel mapping \u03d5k : Pk \u2192 Hk associated to a positive de\ufb01nite kernel Kk. It is then possible to\nde\ufb01ne the non-linear pointwise operator Mk such that\n\nAs in [15], we use homogeneous dot-product kernels of the form\n\nMkPkxk\u20131(u) := \u03d5k(Pkxk\u20131(u)) \u2208 Hk.\n\nKk(z, z\u2032) = kzkkz\u2032k\u03bak(cid:18) hz, z\u2032i\n\nkzkkz\u2032k(cid:19) with \u03bak(1) = 1,\n\n(2)\n\nwhich ensures that kMkPkxk\u20131(u)k = kPkxk\u20131(u)k and that MkPkxk\u20131 is in L2(\u2126, Hk). Concrete\nexamples of kernels satisfying (2) with some other properties are presented in Appendix B.\n\nPooling operator. The last step to build the layer xk is to pool neighboring values to achieve some\nlocal shift-invariance. As in [15], we apply a linear convolution operator Ak with a Gaussian kernel\nat scale \u03c3k, h\u03c3k (u) := \u03c3\u2212d\n\nk h(u/\u03c3k), where h(u) = (2\u03c0)\u2212d/2 exp(\u2212|u|2/2). Then,\n\nxk(u) = AkMkPkxk\u20131(u) =ZRd\n\nh\u03c3k (u \u2212 v)MkPkxk\u20131(v)dv \u2208 Hk.\n\nApplying Schur\u2019s test to the integral operator Ak (see Appendix A), we obtain that kAkk \u2264 1. Thus,\nkxkk \u2264 kMkPkxk\u20131k and xk \u2208 L2(\u2126, Hk). Note that a similar pooling operator is used in the\nscattering representation [5, 17], though in a different way which does not affect subsequent layers.\n\n3\n\n\fMultilayer construction. Finally, we obtain a multilayer representation by composing multiple\ntimes the previous operators. In order to increase invariance with each layer, the size of the patch Sk\nand pooling scale \u03c3k typically grow exponentially with k, with \u03c3k and supc\u2208Sk |c| of the same order.\nWith n layers, the \ufb01nal representation is given by the feature map\n\n\u03a6n(x0) := xn = AnMnPnAn\u20131Mn\u20131Pn\u20131 \u00b7 \u00b7 \u00b7 A1M1P1x0 \u2208 L2(\u2126, Hn).\n\n(3)\n\nThen, we can de\ufb01ne a kernel Kn on two signals x0 and x\u2032\n0)i,\nwhose RKHS HKn contains all functions of the form f (x0) = hw, \u03a6n(x0)i with w \u2208 L2(\u2126, Hn).\nThe following lemma shows that this representation preserves all information about the signal at each\nlayer, and each feature map xk can be sampled on a discrete set with no loss of information. This\nsuggests a natural approach for discretization which we discuss next. For space limitation reasons, all\nproofs in this paper are relegated to Appendix C.\n\n0) := h\u03a6n(x0), \u03a6n(x\u2032\n\n0 by Kn(x0, x\u2032\n\nLemma 1 (Signal preservation). Assume that Hk contains linear functions hw, \u00b7i with w in Pk (this\nis true for all kernels Kk described in Appendix B), then the signal xk\u20131 can be recovered from a\nsampling of xk = AkMkPkxk\u20131 at discrete locations as soon as the union of patches centered at\nthese points covers all of \u2126. It follows that xk can be reconstructed from such a sampling.\n\n2.1 From Theory to Practice: Discretization and Signal Preservation\n\nThe previous construction de\ufb01nes a kernel representation for general signals in L2(\u2126, H0), which\nis an abstract object de\ufb01ned for theoretical purposes, as often done in signal processing [17]. In\npractice, signals are discrete, and it is thus important to discuss the problem of discretization, as done\nin [15]. For clarity, we limit the presentation to 1-dimensional signals (\u2126 = Rd with d = 1), but the\narguments can easily be extended to higher dimensions d when using box-shaped patches. Notation\nfrom the previous section is preserved, but we add a bar on top of all discrete analogues of their\ndiscrete counterparts, e.g., \u00afxk is a discrete feature map in \u21132(Z, \u00afHk) for some RKHS \u00afHk.\n\nInput signals x0 and \u00afx0. Discrete signals acquired by a physical device are often seen as local\nintegrators of signals de\ufb01ned on a continuous domain (e.g., sensors from digital cameras integrate the\npointwise distribution of photons that hit a sensor in a spatial window). Let us then consider a signal x0\nin L2(\u2126, H0) and s0 a sampling interval. By de\ufb01ning \u00afx0 in \u21132(Z, H0) such that \u00afx0[n] = x0(ns0) for\nall n in Z, it is thus natural to assume that x0 = A0x, where A0 is a pooling operator (local integrator)\napplied to an original signal x. The role of A0 is to prevent aliasing and reduce high frequencies;\ntypically, the scale \u03c30 of A0 should be of the same magnitude as s0, which we choose to be s0 = 1 in\nthe following, without loss of generality. This natural assumption will be kept later in the analysis.\n\nMultilayer construction. We now want to build discrete feature maps \u00afxk in \u21132(Z, \u00afHk) at each\nlayer k involving subsampling with a factor sk w.r.t. \u00afxk\u20131. We now de\ufb01ne the discrete analogues of\nthe operators Pk (patch extraction), Mk (kernel mapping), and Ak (pooling) as follows: for n \u2208 Z,\n\n\u00afPk \u00afxk\u20131[n] := e\u22121/2\n\nk\n\n(\u00afxk\u20131[n], \u00afxk\u20131[n + 1], . . . , \u00afxk\u20131[n + ek \u2212 1]) \u2208 \u00afPk := \u00afHek\nk\u20131\n\n\u00afMk \u00afPk \u00afxk\u20131[n] := \u00af\u03d5k( \u00afPk \u00afxk\u20131[n]) \u2208 \u00afHk\n\n\u00afxk[n] = \u00afAk \u00afMk \u00afPk \u00afxk\u20131[n] := s1/2\n\nk Xm\u2208Z\n\n\u00afhk[nsk \u2212 m] \u00afMk \u00afPk \u00afxk\u20131[m] = (\u00afhk \u2217 \u00afMk \u00afPk \u00afxk\u20131)[nsk] \u2208 \u00afHk,\n\nwhere (i) \u00afPk extracts a patch of size ek starting at position n in \u00afxk\u20131[n] (de\ufb01ning a patch centered\nat n is also possible), which lives in the Hilbert space \u00afPk de\ufb01ned as the direct sum of ek times \u00afHk\u20131;\n(ii) \u00afMk is a kernel mapping identical to the continuous case, which preserves the norm, like Mk;\n(iii) \u00afAk performs a convolution with a Gaussian \ufb01lter and a subsampling operation with factor sk.\nThe next lemma shows that under mild assumptions, this construction preserves signal information.\nLemma 2 (Signal recovery with subsampling). Assume that \u00afHk contains the linear functions hw, \u00b7i\nfor all w \u2208 \u00afPk and that ek \u2265 sk. Then, \u00afxk\u20131 can be recovered from \u00afxk.\n\nWe note that this result relies on recovery by deconvolution of a pooling convolution with \ufb01lter \u00afhk,\nwhich is stable when its scale parameter, typically of order sk to prevent anti-aliasing, is small enough.\nThis suggests using small values for ek, sk, as in typical recent convolutional architectures [30].\n\n4\n\n\fLinks between the parameters of the discrete and continuous models. Due to subsampling, the\npatch size in the continuous and discrete models are related by a multiplicative factor. Speci\ufb01cally, a\npatch of size ek with discretization corresponds to a patch Sk of diameter eksk\u22121sk\u22122 . . . s1 in the\ncontinuous case. The same holds true for the scale parameter of the Gaussian pooling.\n\n2.2 From Theory to Practice: Kernel Approximation and Convolutional Kernel Networks\n\nBesides discretization, two modi\ufb01cations are required to use the image representation we have\ndescribed in practice. The \ufb01rst one consists of using feature maps with \ufb01nite spatial support, which\nintroduces border effects that we did not study, but which are negligible when dealing with large\nrealistic images. The second one requires \ufb01nite-dimensional approximation of the kernel maps,\nleading to the convolutional kernel network model of [15]. Typically, each RKHS\u2019s mapping is\napproximated by performing a projection onto a subspace of \ufb01nite dimension, a classical approach to\nmake kernel methods work at large scale [12, 31, 34]. One advantage is its compatibility with the\nRKHSs (meaning that the approximations live in the respective RKHSs), and the stability results we\nwill present next are preserved thanks to the non-expansiveness of the projection.\n\nIt is then be possible to derive theoretical results for the CKN model, which appears as a natural\nimplementation of the kernel constructed previously; yet, we will also show in Section 5 that the\nresults apply more broadly to CNNs that are contained in the functional space associated to the kernel.\n\n3 Stability to Deformations and Translation Invariance\n\nIn this section, we study the translation-invariance and the stability of the kernel representation\ndescribed in Section 2 for continuous signals under the action of diffeomorphisms. We use a\nsimilar characterization of stability to the one introduced by Mallat [17]: for a C 1-diffeomorphism\n\u03c4 : \u2126 \u2192 \u2126, let L\u03c4 denote the linear operator de\ufb01ned by L\u03c4 x(u) = x(u \u2212 \u03c4 (u)), the representation\n\u03a6(\u00b7) is stable under the action of diffeomorphisms if there exist two constants C1 and C2 such that\n\nk\u03a6(L\u03c4 x) \u2212 \u03a6(x)k \u2264 (C1k\u2207\u03c4 k\u221e + C2k\u03c4 k\u221e)kxk,\n\n(4)\n\nwhere \u2207\u03c4 is the Jacobian of \u03c4 , k\u2207\u03c4 k\u221e := supu\u2208\u2126 k\u2207\u03c4 (u)k, and k\u03c4 k\u221e := supu\u2208\u2126 |\u03c4 (u)|. As\nin [17], our results will assume the regularity condition k\u2207\u03c4 k\u221e < 1/2. In order to have a translation-\ninvariant representation, we want C2 to be small (a translation is a diffeomorphism with \u2207\u03c4 = 0),\nand indeed we will show that C2 is proportional to 1/\u03c3n, where \u03c3n is the scale of the last pooling\nlayer, which typically increases exponentially with the number of layers n.\n\nNote that unlike the scattering transform [17], we do not have a representation that preserves the\nnorm, i.e., such that k\u03a6(x)k = kxk. While the patch extraction Pk and kernel mapping Mk operators\ndo preserve the norm, the pooling operators Ak may remove (or signi\ufb01cantly reduce) frequencies\nfrom the signal that are larger than 1/\u03c3k. Yet, natural signals such as natural images often have high\nenergy in the low-frequency domain (the power spectra of natural images is often considered to have\na polynomial decay in 1/f 2, where f is the signal frequency [33]). For such classes of signals, a\nlarge fraction of the signal energy will be preserved by the pooling operator. In particular, with some\nadditional assumptions on the kernels Kk, it is possible to show [3]:\n\nk\u03a6(x)k \u2265 kAn \u00b7 \u00b7 \u00b7 A0xk.\n\nAdditionally, when using a Gaussian kernel mapping \u03d5n+1 on top of the last feature map as a\nprediction layer instead of a linear layer, the \ufb01nal representation \u03a6f (x) := \u03d5n+1(\u03a6n(A0x)) preserves\nstability and always has unit norm (see the extended version of the paper [3] for details). This suggests\nthat norm preservation may be a less relevant concern in our kernel setting.\n\n3.1 Stability Results\n\nIn order to study the stability of the representation (3), we assume that the input signal x0 may be\nwritten as x0 = A0x, where A0 is an initial pooling operator at scale \u03c30, which allows us to control\nthe high frequencies of the signal in the \ufb01rst layer. As discussed previously in Section 2.1, this\nassumption is natural and compatible with any physical acquisition device. Note that \u03c30 can be taken\narbitrarily small, making the operator A0 arbitrarily close to the identity, so that this assumption does\nnot limit the generality of our results. Moreover, we make the following assumptions for each layer k:\n\n5\n\n\f(A1) Norm preservation: k\u03d5k(x)k = kxk for all x in Pk;\n(A2) Non-expansiveness: k\u03d5k(x) \u2212 \u03d5k(x\u2032)k \u2264 kx \u2212 x\u2032k for all x, x\u2032 in Pk;\n(A3) Patch sizes: there exists \u03ba > 0 such that at any layer k we have\n\n|c| \u2264 \u03ba\u03c3k\u22121.\n\nsup\nc\u2208Sk\n\nNote that assumptions (A1-2) imply that the operators Mk preserve the norm and are non-expansive.\nAppendix B exposes a large class of homogeneous kernels that satisfy assumptions (A1-2).\n\nGeneral bound for stability. The following result gives an upper bound on the quantity of interest,\nk\u03a6(L\u03c4 x) \u2212 \u03a6(x)k, in terms of the norm of various linear operators which control how \u03c4 affects each\nlayer. The commutator of linear operators A and B is denoted [A, B] = AB \u2212 BA.\nProposition 3. Let \u03a6(x) = \u03a6n(A0x) where \u03a6n is de\ufb01ned in (3) for x in L2(\u2126, H0). Then,\n\nk\u03a6(L\u03c4 x) \u2212 \u03a6(x)k \u2264 n\nXk=1\n\nk[PkAk\u22121, L\u03c4 ]k + k[An, L\u03c4 ]k + kL\u03c4 An \u2212 Ank! kxk\n\n(5)\n\nIn the case of a translation L\u03c4 x(u) = Lcx(u) = x(u \u2212 c), it is easy to see that pooling and\npatch extraction operators commute with Lc (this is also known as covariance or equivariance to\ntranslations), so that we are left with the term kLcAn \u2212 Ank, which should control translation\ninvariance. For general diffeomorphisms \u03c4 , we no longer have exact covariance, but we show below\nthat commutators are stable to \u03c4 , in the sense that k[PkAk\u22121, L\u03c4 ]k is controlled by k\u2207\u03c4 k\u221e, while\nkL\u03c4 An \u2212 Ank is controlled by k\u03c4 k\u221e and decays with the pooling size \u03c3n.\n\nBound on k[PkAk\u22121, L\u03c4 ]k. We begin by noting that Pkz can be identi\ufb01ed with (Lcz)c\u2208Sk isomet-\n\nkLczk2d\u03bdk(c) by Fubini\u2019s theorem. Then,\n\nrically for all z in L2(\u2126, Hk\u20131), since kPkzk2 =RSk\nkPkAk\u22121L\u03c4 z \u2212 L\u03c4 PkAk\u22121zk2 =ZSk\n\nkLcAk\u22121L\u03c4 z \u2212 L\u03c4 LcAk\u22121zk2d\u03bdk(c)\n\n\u2264 sup\nc\u2208Sk\n\nkLcAk\u22121L\u03c4 x \u2212 L\u03c4 LcAk\u22121zk2,\n\nso that k[PkAk\u22121, L\u03c4 ]k \u2264 supc\u2208Sk k[LcAk\u22121, L\u03c4 ]k.\nk[LcAk\u22121, L\u03c4 ]k when |c| \u2264 \u03ba\u03c3k\u22121, which is satis\ufb01ed under assumption (A3).\nLemma 4. Let A\u03c3 be the pooling operator with kernel h\u03c3(u) = \u03c3\u2212dh(u/\u03c3). If k\u2207\u03c4 k\u221e \u2264 1/2,\nthere exists a constant C1 such that for any \u03c3 and |c| \u2264 \u03ba\u03c3, we have\n\nThe following result\n\nlets us bound\n\nwhere C1 depends only on h and \u03ba.\n\nk[LcA\u03c3, L\u03c4 ]k \u2264 C1k\u2207\u03c4 k\u221e,\n\nA similar result is obtained in Mallat [17, Lemma E.1] for commutators of the form [A\u03c3, L\u03c4 ], but we\nextend it to handle integral operators LcA\u03c3 with a shifted kernel. The proof (given in Appendix C.4)\nrelies on the fact that [LcA\u03c3, L\u03c4 ] is an integral operator in order to bound its norm via Schur\u2019s test.\nNote that \u03ba can be made larger, at the cost of an increase of the constant C1 of the order \u03bad+1.\n\nBound on kL\u03c4 An \u2212 Ank. We bound the operator norm kL\u03c4 An \u2212 Ank in terms of k\u03c4 k\u221e using the\nfollowing result due to Mallat [17, Lemma 2.11], with \u03c3 = \u03c3n:\nLemma 5. If k\u2207\u03c4 k\u221e \u2264 1/2, we have\n\nwith C2 = 2d \u00b7 k\u2207hk1.\n\nkL\u03c4 A\u03c3 \u2212 A\u03c3k \u2264\n\nC2\n\u03c3\n\nk\u03c4 k\u221e,\n\nCombining Proposition 3 with Lemmas 4 and 5, we immediately obtain the following result.\nTheorem 6. Let \u03a6(x) be a representation given by \u03a6(x) = \u03a6n(A0x) and assume (A1-3).\nk\u2207\u03c4 k\u221e \u2264 1/2, we have\n\nk\u03a6(L\u03c4 x) \u2212 \u03a6(x)k \u2264(cid:18)C1 (1 + n) k\u2207\u03c4 k\u221e +\n\nC2\n\u03c3n\n\nk\u03c4 k\u221e(cid:19) kxk.\n\n6\n\n(6)\n\nIf\n\n(7)\n\n\fThis result matches the desired notion of stability in Eq. (4), with a translation-invariance factor that\ndecays with \u03c3n. The dependence on a notion of depth (the number of layers n here) also appears\nin [17], with a factor equal to the maximal length of scattering paths, and with the same condition\nk\u2207\u03c4 k\u221e \u2264 1/2. However, while the norm of the scattering representation is preserved as the length\nof these paths goes to in\ufb01nity, the norm of \u03a6(x) can decrease with depth due to pooling layers,\nthough this concern may be alleviated by using an additional non-linear prediction layer, as discussed\npreviously (see also [3]).\n\n3.2 Stability with Kernel Approximations\n\nAs in the analysis of the scattering transform of [17], we have characterized the stability and shift-\ninvariance of the data representation for continuous signals, in order to give some intuition about the\nproperties of the corresponding discrete representation, which we have described in Section 2.1.\n\nAnother approximation performed in the CKN model of [15] consists of adding projection steps on\n\ufb01nite-dimensional subspaces of the RKHS\u2019s layers, as discusssed in Section 2.2. Interestingly, the\nstability properties we have obtained previously are compatible with these steps. We may indeed\nrede\ufb01ne the operator Mk as the pointwise operation such that Mkz(u) = \u03a0k\u03d5k(z(u)) for any map z\nin L2(\u2126, Pk), instead of Mkz(u) = \u03d5k(z(u)); \u03a0k : Hk \u2192 Fk is here a projection operator onto a\nlinear subspace. Then, Mk does not necessarily preserve the norm anymore, but kMkzk \u2264 kzk, with a\nloss of information corresponding to the quality of approximation of the kernel Kk on the points z(u).\nOn the other hand, the non-expansiveness of Mk is satis\ufb01ed thanks to the non-expansiveness of\nthe projection. Additionally, the CKN construction provides a \ufb01nite-dimensional representation\nat each layer, which preserves the norm structure of the original Hilbert spaces isometrically. In\nsummary, it is possible to show that the conclusions of Theorem 6 remain valid for this tractable CKN\nrepresentation, but we lose signal information in the process. The stability of the predictions can then\nbe controlled through the norm of the last (linear) layer, which is typically used as a regularizer [15].\n\n4 Global Invariance to Group Actions\n\nIn Section 3, we have seen how the kernel representation of Section 2 creates invariance to translations\nby commuting with the action of translations at intermediate layers, and how the last pooling layer on\nthe translation group governs the \ufb01nal level of invariance. It is often useful to encode invariances\nto different groups of transformations, such as rotations or re\ufb02ections (see, e.g., [9, 17, 22, 29]).\nHere, we show how this can be achieved by de\ufb01ning adapted patch extraction and pooling operators\nthat commute with the action of a transformation group G (this is known as group covariance or\nequivariance). We assume that G is locally compact, so that we can de\ufb01ne a left-invariant Haar\nmeasure \u00b5\u2014that is, a measure on G that satis\ufb01es \u00b5(gS) = \u00b5(S) for any Borel set S \u2282 G and g in G.\nWe assume the initial signal x(u) is de\ufb01ned on G, and we de\ufb01ne subsequent feature maps on the\nsame domain. The action of an element g \u2208 G is denoted by Lg, where Lgx(u) = x(g\u22121u). Then,\nwe are interested in de\ufb01ning a layer\u2014that is, a succession of patch extraction, kernel mapping, and\npooling operators\u2014that commutes with Lg, in order to achieve equivariance to the group G.\n\nPatch extraction. We de\ufb01ne patch extraction as follows\n\nP x(u) = (x(uv))v\u2208S\n\nfor all u \u2208 G,\n\nwhere S \u2282 G is a patch centered at the identity. P commutes with Lg since\n\nP Lgx(u) = (Lgx(uv))v\u2208S = (x(g\u22121uv))v\u2208S = P x(g\u22121u) = LgP x(u).\n\nKernel mapping. The pointwise operator M is de\ufb01ned as in Section 2, and thus commutes with Lg.\n\nPooling. The pooling operator on the group G is de\ufb01ned in a similar fashion as [22] by\n\nAx(u) =ZG\n\nx(uv)h(v)d\u00b5(v) =ZG\n\nx(v)h(u\u22121v)d\u00b5(v),\n\nwhere h is a pooling \ufb01lter typically localized around the identity element. It is easy to see from the\n\ufb01rst expression of Ax(u) that ALgx(u) = LgAx(u), making the pooling operator G-equivariant.\n\n7\n\n\fIn our analysis of stability in Section 3, we saw that inner pooling layers are useful to guarantee\nstability to local deformations, while global invariance is achieved mainly through the last pooling\nlayer. In some cases, one only needs stability to a subgroup of G, while achieving global invariance\nto the whole group, e.g., in the roto-translation group [21], one might want invariance to a global\nrotation but stability to local translations. Then, one can perform pooling just on the subgroup to\nstabilize (e.g., translations) in intermediate layers, while pooling on the entire group at the last layer\nto achieve the global group invariance.\n\n5 Link with Convolutional Neural Networks\n\nIn this section, we study the connection between the kernel representation de\ufb01ned in Section 2 and\nCNNs. Speci\ufb01cally, we show that the RKHS HKn obtained from our kernel construction contains\na set of CNNs on continuous domains with certain types of smooth homogeneous activations. An\nimportant consequence is that the stability results of previous sections apply to this class of CNNs.\n\nCNN maps construction. We now de\ufb01ne a CNN function f\u03c3 that takes as input an image x0 in\nL2(\u2126, Rp0 ) with p0 channels, and builds a sequence of feature maps, represented at layer k as a\nfunction zk in L2(\u2126, Rpk ) with pk channels; it performs linear convolutions with a set of \ufb01lters\n(wi\nk)i=1,...,pk , followed by a pointwise activation function \u03c3 to obtain intermediate feature maps \u02dczk,\nthen applies a linear pooling \ufb01lter and repeats the same operations at each layer. Note that here, each\nk is in L2(Sk, Rpk\u20131 ), with channels denoted by wij\nwi\nk \u2208 L2(Sk, R). Formally, the intermediate map\n\u02dczk in L2(\u2126, Rpk ) is obtained for k \u2265 1 by\n\n\u02dczi\nk(u), . . . , \u02dczpk\nwhere \u02dczk(u) = (\u02dcz1\nk (u)) in Rpk , and Pk is the patch extraction operator, which operates\nhere on \ufb01nite-dimensional maps. The activation involves a pointwise non-linearity \u03c3 along with a\nquantity nk(u) that is independent of the \ufb01lters and that will be made explicit in the sequel. Finally,\nthe map zk is obtained by using a pooling operator as in Section 2, with zk = Ak \u02dczk, and z0 = x0.\n\nk, Pkzk\u20131(u)i/nk(u)(cid:1) ,\n\nk(u) = nk(u)\u03c3(cid:0)hwi\n\n(8)\n\nHomogeneous activations. The choice of non-linearity \u03c3 relies on Lemma B.2 of the appendix,\nwhich shows that for many choices of smooth functions \u03c3, the RKHSs Hk de\ufb01ned in Section 2 con-\ntains the linear functions z 7\u2192 kzk\u03c3(hg, zi/kzk) for all g in Pk. While this homogenization involving\nthe quantities kzk is not standard in classical CNNs, we note that (i) the most successful activation\nfunction, namely recti\ufb01ed linear units, is homogeneous\u2014that is, relu(hg, zi) = kzkrelu(hg, zi/kzk);\n(ii) while relu is nonsmooth and thus not in our RKHSs, there exists a smoothed variant that satis\ufb01es\nthe conditions of Lemma B.2 for useful kernels. As noticed in [35, 36], this is for instance the case\nfor the inverse polynomial kernel described in Appendix B, In Figure 2, we plot and compare these\ndifferent variants of relu. Then, we may now de\ufb01ne the quantities nk(u) := kPkxk\u22121(u)k in (8),\nwhich are due to the homogenization, and which are independent of the \ufb01lters wi\nk.\n\nClassi\ufb01cation layer. The \ufb01nal CNN prediction function f\u03c3 is given by inner products with the\nfeature maps of the last layer:\n\nf\u03c3(x0) = hwn+1, zni,\n\nwith parameters wn+1 in L2(\u2126, Rpn ). The next result shows that for appropriate \u03c3, the function f\u03c3\nis in HKn . The construction of this function in the RKHS and the proof are given in Appendix D. We\nnote that a similar construction for fully connected networks with constraints on weights and inputs\nwas given in [35].\nProposition 7 (CNNs and RKHSs). Assume the activation \u03c3 satis\ufb01es C\u03c3(a) < \u221e for all a \u2265 0,\nwhere C\u03c3 is de\ufb01ned for a given kernel in Lemma B.2. Then the CNN function f\u03c3 de\ufb01ned above is in\nthe RKHS HKn , with norm\n\nwhere Bn,i is de\ufb01ned recursively by B1,i = C 2\n\n\u03c3(kwi\n\n1k2\n\n2) and Bk,i = C 2\n\nThe results of this section imply that our study of the geometry of the kernel representations, and\nin particular the stability and invariance properties of Section 3, apply to the generic CNNs de\ufb01ned\n\nj=1 kwij\n\nk k2\n\n\u03c3(cid:16)pk\u20131Ppk\u20131\n\n2Bk\u20131,j(cid:17).\n\nkf\u03c3k2 \u2264 pn\n\npn\n\nXi=1\n\nkwi\n\nn+1k2\n\n2Bn,i,\n\n8\n\n\fFigure 2: Comparison of one-dimensional functions obtained with relu and smoothed relu (sReLU)\nactivations. (Left) non-homogeneous setting of [35, 36]. (Right) our homogeneous setting, for\ndifferent values of the parameter w. Note that for w \u2265 0.5, sReLU and ReLU are indistinguishable.\n\nabove, thanks to the Lipschitz smoothness relation (1). The smoothness is then controlled by the\nRKHS norm of these functions, which sheds light on the links between generalization and stability.\nIn particular, functions with low RKHS norm (a.k.a. \u201clarge margin\u201d) are known to generalize better to\nunseen data (see, e.g., the notion of margin bounds for SVMs [27, 28]). This implies, for instance, that\ngeneralization is harder if the task requires classifying two slightly deformed images with different\nlabels, since this requires a function with large RKHS norm according to our stability analysis. In\ncontrast, if a stable function (i.e., with small RKHS norm) is suf\ufb01cient to do well on a training set,\nlearning becomes \u201ceasier\u201d and few samples may be enough for good generalization.\n\nAcknowledgements\n\nThis work was supported by a grant from ANR (MACARON project under grant number ANR-\n14-CE23-0003-01), by the ERC grant number 714381 (SOLARIS project), and by the MSR-Inria\njoint center.\n\nReferences\n\n[1] F. Anselmi, L. Rosasco, and T. Poggio. On invariance and selectivity in representation learning.\n\nInformation and Inference, 5(2):134\u2013158, 2016.\n\n[2] F. Anselmi, L. Rosasco, C. Tan, and T. Poggio. Deep convolutional networks are hierarchical\n\nkernel machines. preprint arXiv:1508.01084, 2015.\n\n[3] A. Bietti and J. Mairal. Group invariance and stability to deformations of deep convolutional\n\nrepresentations. preprint arXiv:1706.03078, 2017.\n\n[4] L. Bo, K. Lai, X. Ren, and D. Fox. Object recognition with hierarchical kernel descriptors. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\n2011.\n\n[5] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE Transactions on pattern\n\nanalysis and machine intelligence (PAMI), 35(8):1872\u20131886, 2013.\n\n[6] J. Bruna, A. Szlam, and Y. LeCun. Learning stable group invariant representations with\n\nconvolutional networks. preprint arXiv:1301.3537, 2013.\n\n[7] Y. Cho and L. K. Saul. Kernel methods for deep learning. In Advances in Neural Information\n\nProcessing Systems (NIPS), 2009.\n\n[8] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving\nrobustness to adversarial examples. In International Conference on Machine Learning (ICML),\n2017.\n\n[9] T. Cohen and M. Welling. Group equivariant convolutional networks. In International Confer-\n\nence on Machine Learning (ICML), 2016.\n\n[10] A. Daniely, R. Frostig, and Y. Singer. Toward deeper understanding of neural networks: The\npower of initialization and a dual view on expressivity. In Advances in Neural Information\nProcessing Systems (NIPS), 2016.\n\n9\n\n2.01.51.00.50.00.51.01.52.0x0.00.51.01.52.0f(x)f:x(x)ReLUsReLU2.01.51.00.50.00.51.01.52.0x01234f(x)f:x|x|(wx/|x|)ReLU, w=1sReLU, w = 0sReLU, w = 0.5sReLU, w = 1sReLU, w = 2\f[11] J. Diestel and J. J. Uhl. Vector Measures. American Mathematical Society, 1977.\n\n[12] S. Fine and K. Scheinberg. Ef\ufb01cient SVM training using low-rank kernel representations.\n\nJournal of Machine Learning Research (JMLR), 2:243\u2013264, 2001.\n\n[13] B. Haasdonk and H. Burkhardt. Invariant kernel functions for pattern analysis and machine\n\nlearning. Machine learning, 68(1):35\u201361, 2007.\n\n[14] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D.\nJackel. Backpropagation applied to handwritten zip code recognition. Neural computation,\n1(4):541\u2013551, 1989.\n\n[15] J. Mairal. End-to-End Kernel Learning with Supervised Convolutional Kernel Networks. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2016.\n\n[16] J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. In Advances\n\nin Neural Information Processing Systems (NIPS), 2014.\n\n[17] S. Mallat. Group invariant scattering. Communications on Pure and Applied Mathematics,\n\n65(10):1331\u20131398, 2012.\n\n[18] G. Montavon, M. L. Braun, and K.-R. M\u00fcller. Kernel analysis of deep networks. Journal of\n\nMachine Learning Research (JMLR), 12:2563\u20132581, 2011.\n\n[19] Y. Mroueh, S. Voinea, and T. A. Poggio. Learning with group invariant features: A kernel\n\nperspective. In Advances in Neural Information Processing Systems (NIPS), 2015.\n\n[20] K. Muandet, K. Fukumizu, B. Sriperumbudur, B. Sch\u00f6lkopf, et al. Kernel mean embedding\nof distributions: A review and beyond. Foundations and Trends R(cid:13) in Machine Learning,\n10(1-2):1\u2013141, 2017.\n\n[21] E. Oyallon and S. Mallat. Deep roto-translation scattering for object classi\ufb01cation. In Proceed-\n\nings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.\n\n[22] A. Raj, A. Kumar, Y. Mroueh, T. Fletcher, and B. Schoelkopf. Local group invariant representa-\ntions via orbit embeddings. In International Conference on Arti\ufb01cial Intelligence and Statistics\n(AISTATS), 2017.\n\n[23] S. Saitoh. Integral transforms, reproducing kernels and their applications, volume 369. CRC\n\nPress, 1997.\n\n[24] I. J. Schoenberg. Positive de\ufb01nite functions on spheres. Duke Mathematical Journal, 9(1):96\u2013\n\n108, 1942.\n\n[25] B. Sch\u00f6lkopf. Support Vector Learning. PhD thesis, Technischen Universit\u00e4t Berlin, 1997.\n\n[26] B. Sch\u00f6lkopf, A. Smola, and K.-R. M\u00fcller. Nonlinear component analysis as a kernel eigenvalue\n\nproblem. Neural Computation, 10(5):1299\u20131319, 1998.\n\n[27] B. Sch\u00f6lkopf and A. J. Smola. Learning with kernels: support vector machines, regularization,\n\noptimization, and beyond. 2001.\n\n[28] S. Shalev-Shwartz and S. Ben-David. Understanding machine learning: From theory to\n\nalgorithms. Cambridge university press, 2014.\n\n[29] L. Sifre and S. Mallat. Rotation, scaling and deformation invariant scattering for texture dis-\ncrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition\n(CVPR), 2013.\n\n[30] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image\n\nrecognition. In International Conference on Learning Representations (ICLR), 2014.\n\n[31] A. J. Smola and B. Sch\u00f6lkopf. Sparse greedy matrix approximation for machine learning. In\n\nProceedings of the International Conference on Machine Learning (ICML), 2000.\n\n10\n\n\f[32] E. M. Stein. Harmonic Analysis: Real-variable Methods, Orthogonality, and Oscillatory\n\nIntegrals. Princeton University Press, 1993.\n\n[33] A. Torralba and A. Oliva. Statistics of natural image categories. Network: computation in\n\nneural systems, 14(3):391\u2013412, 2003.\n\n[34] C. Williams and M. Seeger. Using the Nystr\u00f6m method to speed up kernel machines. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2001.\n\n[35] Y. Zhang, J. D. Lee, and M. I. Jordan. \u21131-regularized neural networks are improperly learnable\n\nin polynomial time. In International Conference on Machine Learning (ICML), 2016.\n\n[36] Y. Zhang, P. Liang, and M. J. Wainwright. Convexi\ufb01ed convolutional neural networks. In\n\nInternational Conference on Machine Learning (ICML), 2017.\n\n11\n\n\f", "award": [], "sourceid": 3147, "authors": [{"given_name": "Alberto", "family_name": "Bietti", "institution": "Inria"}, {"given_name": "Julien", "family_name": "Mairal", "institution": "Inria"}]}