{"title": "Spectral Representations for Convolutional Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2449, "page_last": 2457, "abstract": "Discrete Fourier transforms provide a significant speedup in the computation of convolutions in deep learning. In this work, we demonstrate that, beyond its advantages for efficient computation, the spectral domain also provides a powerful representation in which to model and train convolutional neural networks (CNNs).We employ spectral representations to introduce a number of innovations to CNN design. First, we propose spectral pooling, which performs dimensionality reduction by truncating the representation in the frequency domain. This approach preserves considerably more information per parameter than other pooling strategies and enables flexibility in the choice of pooling output dimensionality. This representation also enables a new form of stochastic regularization by randomized modification of resolution. We show that these methods achieve competitive results on classification and approximation tasks, without using any dropout or max-pooling. Finally, we demonstrate the effectiveness of complex-coefficient spectral parameterization of convolutional filters. While this leaves the underlying model unchanged, it results in a representation that greatly facilitates optimization. We observe on a variety of popular CNN configurations that this leads to significantly faster convergence during training.", "full_text": "Spectral Representations for\n\nConvolutional Neural Networks\n\nDepartment of Mathematics\n\nMassachusetts Institute of Technology\n\nOren Rippel\n\nrippel@math.mit.edu\n\nJasper Snoek\n\nTwitter and Harvard SEAS\n\njsnoek@seas.harvard.edu\n\nRyan P. Adams\n\nTwitter and Harvard SEAS\nrpa@seas.harvard.edu\n\nAbstract\n\nDiscrete Fourier transforms provide a signi\ufb01cant speedup in the computation of\nconvolutions in deep learning. In this work, we demonstrate that, beyond its ad-\nvantages for ef\ufb01cient computation, the spectral domain also provides a powerful\nrepresentation in which to model and train convolutional neural networks (CNNs).\nWe employ spectral representations to introduce a number of innovations to CNN\ndesign. First, we propose spectral pooling, which performs dimensionality re-\nduction by truncating the representation in the frequency domain. This approach\npreserves considerably more information per parameter than other pooling strate-\ngies and enables \ufb02exibility in the choice of pooling output dimensionality. This\nrepresentation also enables a new form of stochastic regularization by random-\nized modi\ufb01cation of resolution. We show that these methods achieve competitive\nresults on classi\ufb01cation and approximation tasks, without using any dropout or\nmax-pooling.\nFinally, we demonstrate the effectiveness of complex-coef\ufb01cient spectral param-\neterization of convolutional \ufb01lters. While this leaves the underlying model un-\nchanged, it results in a representation that greatly facilitates optimization. We\nobserve on a variety of popular CNN con\ufb01gurations that this leads to signi\ufb01cantly\nfaster convergence during training.\n\nIntroduction\n\n1\nConvolutional neural networks (CNNs) (LeCun et al., 1989) have been used to achieve unparal-\nleled results across a variety of benchmark machine learning problems, and have been applied\nsuccessfully throughout science and industry for tasks such as large scale image and video clas-\nsi\ufb01cation (Krizhevsky et al., 2012; Karpathy et al., 2014). One of the primary challenges of CNNs,\nhowever, is the computational expense necessary to train them. In particular, the ef\ufb01cient implemen-\ntation of convolutional kernels has been a key ingredient of any successful use of CNNs at scale.\nDue to its ef\ufb01ciency and the potential for amortization of cost, the discrete Fourier transform has long\nbeen considered by the deep learning community to be a natural approach to fast convolution (Ben-\ngio & LeCun, 2007). More recently, Mathieu et al. (2013); Vasilache et al. (2014) have demonstrated\nthat convolution can be computed signi\ufb01cantly faster using discrete Fourier transforms than directly\nin the spatial domain, even for tiny \ufb01lters. This computational gain arises from the convenient prop-\nerty of operator duality between convolution in the spatial domain and element-wise multiplication\nin the frequency domain.\nIn this work, we argue that the frequency domain offers more than a computational trick for con-\nvolution: it also provides a powerful representation for modeling and training CNNs. Frequency\ndecomposition allows studying an input across its various length-scales of variation, and as such\nprovides a natural framework for the analysis of data with spatial coherence. We introduce two\n\n1\n\n\fapplications of spectral representations. These contributions can be applied independently of each\nother.\nSpectral parametrization We propose the idea of learning the \ufb01lters of CNNs directly in the\nfrequency domain. Namely, we parametrize them as maps of complex numbers, whose discrete\nFourier transforms correspond to the usual \ufb01lter representations in the spatial domain.\nBecause this mapping corresponds to unitary transformations of the \ufb01lters, this reparametrization\ndoes not alter the underlying model. However, we argue that the spectral representation provides an\nappropriate domain for parameter optimization, as the frequency basis captures typical \ufb01lter struc-\nture well. More speci\ufb01cally, we show that \ufb01lters tend to be considerably sparser in their spectral rep-\nresentations, thereby reducing the redundancy that appears in spatial domain representations. This\nprovides the optimizer with more meaningful axis-aligned directions that can be taken advantage of\nwith standard element-wise preconditioning.\nWe demonstrate the effectiveness of this reparametrization on a number of CNN optimization tasks,\nconverging 2-5 times faster than the standard spatial representation.\nSpectral pooling Pooling refers to dimensionality reduction used in CNNs to impose a capacity\nbottleneck and facilitate computation. We introduce a new approach to pooling we refer to as spec-\ntral pooling. It performs dimensionality reduction by projecting onto the frequency basis set and\nthen truncating the representation.\nThis approach alleviates a number of issues present in existing pooling strategies. For example,\nwhile max pooling is featured in almost every CNN and has had great empirical success, one major\ncriticism has been its poor preservation of information (Hinton, 2014b,a). This weakness is exhib-\nited in two ways. First, along with other stride-based pooling approaches, it implies a very sharp\ndimensionality reduction by at least a factor of 4 every time it is applied on two-dimensional inputs.\nMoreover, while it encourages translational invariance, it does not utilize its capacity well to reduce\napproximation loss: the maximum value in each window only re\ufb02ects very local information, and\noften does not represent well the contents of the window.\nIn contrast, we show that spectral pooling preserves considerably more information for the same\nnumber of parameters. It achieves this by exploiting the non-uniformity of typical inputs in their\nsignal-to-noise ratio as a function of frequency. For example, natural images are known to have an\nexpected power spectrum that follows an inverse power law: power is heavily concentrated in the\nlower frequencies \u2014 while higher frequencies tend to encode noise (Torralba & Oliva, 2003). As\nsuch, the elimination of higher frequencies in spectral pooling not only does minimal damage to the\ninformation in the input, but can even be viewed as a type of denoising.\nIn addition, spectral pooling allows us to specify any arbitrary output map dimensionality. This per-\nmits reduction of the map dimensionality in a slow and controlled manner as a function of network\ndepth. Also, since truncation of the frequency representation exactly corresponds to reduction in\nresolution, we can supplement spectral pooling with stochastic regularization in the form of ran-\ndomized resolution.\nSpectral pooling can be implemented at a negligible additional computational cost in convolutional\nneural networks that employ FFT for convolution kernels, as it only requires matrix truncation. We\nalso note that these two ideas are both compatible with the recently-introduced method of batch\nnormalization (Ioffe & Szegedy, 2015), permitting even better training ef\ufb01ciency.\n2 The Discrete Fourier Transform\nThe discrete Fourier transform (DFT) is a powerful way to decompose a spatiotemporal signal. In\nthis section, we provide an introduction to a number of components of the DFT drawn upon in\nthis work. We con\ufb01ne ourselves to the two-dimensional DFT, although all properties and results\npresented can be easily extended to other input dimensions.\nGiven an input x \u2208 CM\u00d7N (we address the constraint of real inputs in Subsection 2.1), its 2D\nDFT F (x) \u2208 CM\u00d7N is given by\nN\u22121(cid:88)\nF (x)hw =\n\nxmne\u22122\u03c0i( mh\n\nM\u22121(cid:88)\n\nM + nw\nN )\n\n1\n\n\u221aM N\n\nm=0\n\nn=0\n\n\u2200h \u2208 {0, . . . , M \u2212 1},\u2200w \u2208 {0, . . . , N \u2212 1} .\n\nThe DFT is linear and unitary, and so its inverse transform is given by F \u22121(\u00b7) = F (\u00b7)\u2217, namely the\nconjugate of the transform itself.\n\n2\n\n\f(a) DFT basis functions.\n\n(b) Examples of input-transform pairs.\n\n(c) Conjugate Symm.\n\nFigure 1: Properties of discrete Fourier transforms. (a) All discrete Fourier basis functions of map\nsize 8 \u00d7 8. Note the equivalence of some of these due to conjugate symmetry. (b) Examples of\ninput images and their frequency representations, presented as log-amplitudes. The frequency maps\nhave been shifted to center the DC component. Rays in the frequency domain correspond to spatial\ndomain edges aligned perpendicular to these. (c) Conjugate symmetry patterns for inputs with odd\n(top) and even (bottom) dimensionalities. Orange: real-valuedness constraint. Blue: no constraint.\nGray: value \ufb01xed by conjugate symmetry.\n\nIntuitively, the DFT coef\ufb01cients resulting from projections onto the different frequencies can be\nthought of as measures of correlation of the input with basis functions of various length-scales. See\nFigure 1(a) for a visualization of the DFT basis functions, and Figure 1(b) for examples of input-\nfrequency map pairs.\nThe widespread deployment of the DFT can be partially attributed to the development of the\nFast Fourier Transform (FFT), a mainstay of signal processing and a standard component of\nmost math libraries. The FFT is an ef\ufb01cient implementation of the DFT with time complex-\nity O (M N log (M N )).\nConvolution using DFT One powerful property of frequency analysis is the operator duality be-\ntween convolution in the spatial domain and element-wise multiplication in the spectral domain.\nNamely, given two inputs x, f \u2208 RM\u00d7N , we may write\nwhere by \u2217 we denote a convolution and by (cid:12) an element-wise product.\nApproximation error The unitarity of the Fourier basis makes it convenient for the analysis of\napproximation loss. More speci\ufb01cally, Parseval\u2019s Theorem links the (cid:96)2 loss between any input x and\nits approximation \u02c6x to the corresponding loss in the frequency domain:\n\nF (x \u2217 f ) = F (x) (cid:12) F (f )\n\n(1)\n\n(cid:107)x \u2212 \u02c6x(cid:107)2\n\n2 = (cid:107)F (x) \u2212 F (\u02c6x)(cid:107)2\n2 .\n\n(2)\nAn equivalent statement also holds for the inverse DFT operator. This allows us to quickly assess\nhow an input is affected by any distortion we might make to its frequency representation.\n2.1 Conjugate symmetry constraints\nIn the following sections of the paper, we will propagate signals and their gradients through DFT\nand inverse DFT layers. In these layers, we will represent the frequency domain in the complex\n\ufb01eld. However, for all layers apart from these, we would like to ensure that both the signal and its\ngradient are constrained to the reals. A necessary and suf\ufb01cient condition to achieve this is conjugate\nsymmetry in the frequency domain. Namely, for any transform y = F (x) of some input x, it must\nhold that\n\nymn = y\u2217\n\n(M\u2212m) modM,(N\u2212n) modN\n\n\u2200m \u2208 {0, . . . , M \u2212 1},\u2200n \u2208 {0, . . . , N \u2212 1} .\n\n(3)\nThus, intuitively, given the left half of our frequency map, the diminished number of degrees of\nfreedom allows us to reconstruct the right. In effect, this allows us to store approximately half the\nparameters that would otherwise be necessary. Note, however, that this does not reduce the effec-\ntive dimensionality, since each element consists of real and imaginary components. The conjugate\nsymmetry constraints are visualized in Figure 1(c). Given a real input, its DFT will necessarily meet\nthese. This symmetry can be observed in the frequency representations of the examples in Figure\n1(b). However, since we seek to optimize over parameters embedded directly in the frequency do-\nmain, we need to pay close attention to ensure the conjugate symmetry constraints are enforced upon\ninversion back to the spatial domain (see Subsection 2.2).\n\n3\n\n\f2.2 Differentiation\nHere we discuss how to propagate the gradient through a Fourier transform layer. This analysis can\nbe similarly applied to the inverse DFT layer. De\ufb01ne x \u2208 RM\u00d7N and y = F (x) to be the input and\noutput of a DFT layer respectively, and R : RM\u00d7N \u2192 R a real-valued loss function applied to y\nwhich can be considered as the remainder of the forward pass. Since the DFT is a linear operator,\nits gradient is simply the transformation matrix itself. During back-propagation, then, this gradient\nis conjugated, and this, by DFT unitarity, corresponds to the application of the inverse transform:\n\n(cid:18) \u2202R\n\n(cid:19)\n\n.\n\n(4)\n\n= F \u22121\n\n\u2202R\n\u2202x\n\n\u2202y\n\n2\n\n2 , N\n\n, y N\n\n2 ,0, y0, N\n\nThere is an intricacy that makes matters a bit more complicated. Namely, the conjugate symmetry\ncondition discussed in Subsection 2.1 introduces redundancy.\nInspecting the conjugate symme-\ntry constraints in Equation (3), we note their enforcement of the special case y00 \u2208 R for N odd,\n2 \u2208 R for N even. For all other indices they enforce conjugate equality\nand y00, y N\nof pairs of distinct elements. These conditions imply that the number of unconstrained parameters\nis about half the map in its entirety.\n3 Spectral Pooling\nThe choice of a pooling technique boils down to the selection of an appropriate set of basis func-\ntions to project onto, and some truncation of this representation to establish a lower-dimensionality\napproximation to the original input. The idea behind spectral pooling stems from the observation\nthat the frequency domain provides an ideal basis for inputs with spatial structure. We \ufb01rst discuss\nthe technical details of this approach, and then its advantages.\nSpectral pooling is straightforward to understand and to implement. We assume we are given an\ninput x \u2208 RM\u00d7N , and some desired output map dimensionality H \u00d7 W . First, we compute the dis-\ncrete Fourier transform of the input into the frequency domain as y = F (x) \u2208 CM\u00d7N , and assume\nthat the DC component has been shifted to the center of the domain as is standard practice. We\nthen crop the frequency representation by maintaining only the central H \u00d7 W submatrix of fre-\nquencies, which we denote as \u02c6y \u2208 CH\u00d7W . Finally, we map this approximation back into the spatial\ndomain by taking its inverse DFT as \u02c6x = F \u22121(\u02c6y) \u2208 RH\u00d7W . These steps are listed in Algorithm\n1. Note that some of the conjugate symmetry special cases described in Subsection 2.2 might be\nbroken by this truncation. As such, to ensure that \u02c6x is real-valued, we must treat these individually\nwith TREATCORNERCASES, which can be found in the supplementary material.\nFigure 2 demonstrates the effect of this pooling for various choices of H \u00d7 W . The back-\npropagation procedure is quite intuitive, and can be found in Algorithm 2 (REMOVEREDUNDANCY\nand RECOVERMAP can be found in the supplementary material). In Subsection 2.2, we addressed\nthe nuances of differentiating through DFT and inverse DFT layers. Apart from these, the last com-\nponent left undiscussed is differentiation through the truncation of the frequency matrix, but this\ncorresponds to a simple zero-padding of the gradient maps to the appropriate dimensions.\nIn practice, the DFTs are the computational bottlenecks of spectral pooling. However, we note that in\nconvolutional neural networks that employ FFTs for convolution computation, spectral pooling can\nbe implemented at a negligible additional computational cost, since the DFT is performed regardless.\nWe proceed to discuss a number of properties of spectral pooling, which we then test comprehen-\nsively in Section 5.\n\nAlgorithm 1: Spectral pooling\n\nInput: Map x \u2208 RM\u00d7N , output size H\u00d7W\nOutput: Pooled map \u02c6x \u2208 RH\u00d7W\n1: y \u2190 F (x)\n2: \u02c6y \u2190 CROPSPECTRUM(y, H \u00d7 W )\n3: \u02c6y \u2190 TREATCORNERCASES(\u02c6y)\n4: \u02c6x \u2190 F \u22121(\u02c6y)\n\n(cid:1)\n\n1: \u02c6z \u2190 F(cid:0) \u2202R\n\nAlgorithm 2: Spectral pooling back-propagation\nInput: Gradient w.r.t output \u2202R\n\u2202 \u02c6x\nOutput: Gradient w.r.t input \u2202R\n\u2202x\n2: \u02c6z \u2190 REMOVEREDUNDANCY(\u02c6z)\n3: z \u2190 PADSPECTRUM(\u02c6z, M \u00d7 N )\n4: z \u2190 RECOVERMAP(z)\n5: \u2202R\n\n\u2202x \u2190 F \u22121 (z)\n\n\u2202 \u02c6x\n\n4\n\n\fFigure 2: Approximations for different pooling schemes, for different factors of dimensionality\nreduction. Spectral pooling projects onto the Fourier basis and truncates it as desired. This retains\nsigni\ufb01cantly more information and permits the selection of any arbitrary output map dimensionality.\n\nInformation preservation\n\n3.1\nSpectral pooling can signi\ufb01cantly increase the amount of retained information relative to max-\npooling in two distinct ways. First, its representation maintains more information for the same\nnumber of degrees of freedom. Spectral pooling reduces the information capacity by tuning the\nresolution of the input precisely to match the desired output dimensionality. This operation can also\nbe viewed as linear low-pass \ufb01ltering and it exploits the non-uniformity of the spectral density of\nthe data with respect to frequency. That is, that the power spectra of inputs with spatial structure,\nsuch as natural images, carry most of their mass on lower frequencies. As such, since the amplitudes\nof the higher frequencies tend to be small, Parseval\u2019s theorem from Section 2 informs us that their\nelimination will result in a representation that minimizes the (cid:96)2 distortion after reconstruction.\nSecond, spectral pooling does not suffer from the sharp reduction in output dimensionality exhib-\nited by other pooling techniques. More speci\ufb01cally, for stride-based pooling strategies such as max\npooling, the number of degrees of freedom of two-dimensional inputs is reduced by at least 75% as\na function of stride. In contrast, spectral pooling allows us to specify any arbitrary output dimen-\nsionality, and thus allows us to reduce the map size gradually as a function of layer.\n\n3.2 Regularization via resolution corruption\nWe note that the low-pass \ufb01ltering radii, say RH and RW , can be chosen to be smaller than the output\nmap dimensionalities H, W . Namely, while we truncate our input frequency map to size H \u00d7 W ,\nwe can further zero-out all frequencies outside the central RH \u00d7 RW square. While this maintains\nthe output dimensionality H \u00d7 W of the input domain after applying the inverse DFT, it effectively\nreduces the resolution of the output. This can be seen in Figure 2.\nThis allows us to introduce regularization in the form of random resolution reduction. We apply this\nstochastically by assigning a distribution pR(\u00b7) on the frequency truncation radius (for simplicity\nwe apply the same truncation on both axes), sampling from this a random radius at each iteration,\nand wiping out all frequencies outside the square of that size. Note that this can be regarded as an\napplication of nested dropout (Rippel et al., 2014) on both dimensions of the frequency decompo-\nsition of our input. In practice, we have had success choosing pR(\u00b7) = U[Hmin,H](\u00b7), i.e., a uniform\ndistribution stretching from some minimum value all the way up to the highest possible resolution.\n4 Spectral Parametrization of CNNs\nHere we demonstrate how to learn the \ufb01lters of CNNs directly in their frequency domain represen-\ntations. This offers signi\ufb01cant advantages over the traditional spatial representation, which we show\nempirically in Section 5.\nLet us assume that for some layer of our convolutional neural network we seek to learn \ufb01lters\nof size H \u00d7 W . To do this, we parametrize each \ufb01lter f \u2208 CH\u00d7W in our network directly in\nthe frequency domain. To attain its spatial representation, we simply compute its inverse DFT\n\n5\n\n\f(a) Filters over time.\n\n(b) Sparsity patterns.\n\n(c) Momenta distributions.\n\nFigure 3: Learning dynamics of CNNs with spectral parametrization. The histograms have been\nproduced after 10 epochs of training on CIFAR-10 by each method, but are similar throughout.\n(a) Progression over several epochs of \ufb01lters parametrized in the frequency domain. Each pair of\ncolumns corresponds to the spectral parametrization of a \ufb01lter and its inverse transform to the spatial\ndomain. Filter representations tend to be more local in the Fourier basis. (b) Sparsity patterns for the\ndifferent parametrizations. Spectral representations tend to be considerably sparser. (c) Distributions\nof momenta across parameters for CNNs trained with and without spectral parametrization. In the\nspectral parametrization considerably fewer parameters are updated.\n\nas F \u22121(f ) \u2208 RH\u00d7W . From this point on, we proceed as we would for any standard CNN by com-\nputing the convolution of the \ufb01lter with inputs in our mini-batch, and so on.\nThe back-propagation through the inverse DFT is virtually identical to the one of spectral pooling\ndescribed in Section 3. We compute the gradient as outlined in Subsection 2.2, being careful to obey\nthe conjugate symmetry constraints discussed in Subsection 2.1.\nWe emphasize that this approach does not change the underlying CNN model in any way \u2014 only\nthe way in which it is parametrized. Hence, this only affects the way the solution space is explored\nby the optimization procedure.\n\n4.1 Leveraging \ufb01lter structure\n\nThis idea exploits the observation that CNN \ufb01lters have a very characteristic structure that reappears\nacross data sets and problem domains. That is, CNN weights can typically be captured with a small\nnumber of degrees of freedom. Represented in the spatial domain, however, this results in signi\ufb01cant\nredundancy.\nThe frequency domain, on the other hand, provides an appealing basis for \ufb01lter representation: char-\nacteristic \ufb01lters (e.g., Gabor \ufb01lters) are often very localized in their spectral representations. This\nfollows from the observation that \ufb01lters tend to feature very speci\ufb01c length-scales and orientations.\nHence, they tend to have nonzero support in a narrow set of frequency components. This hypothesis\ncan be observed qualitatively in Figure 3(a) and quantitatively in Figure 3(b).\nEmpirically, in Section 5 we observe that spectral representations of \ufb01lters leads to a convergence\nspeedup by 2-5 times. We remark that, had we trained our network with standard stochastic gradient\ndescent, the linearity of differentiation and parameter update would have resulted in exactly the same\n\ufb01lters regardless of whether they were represented in the spatial or frequency domain during training\n(this is true for any invertible linear transformation of the parameter space).\nHowever, as discussed, this parametrization corresponds to a rotation to a more meaningful axis\nalignment, where the number of relevant elements has been signi\ufb01cantly reduced. Since modern\noptimizers implement update rules that consist of adaptive element-wise rescaling, they are able to\nleverage this axis alignment by making large updates to a small number of elements. This can be\nseen quantitatively in Figure 3(c), where the optimizer \u2014 Adam (Kingma & Ba, 2015), in this case\n\u2014 only touches a small number of elements in its updates.\nThere exist a number of extensions of the above approach we believe would be quite promising in\nfuture work; we elaborate on these in the discussion.\n5 Experiments\nWe demonstrate the effectiveness of spectral representations in a number of different experiments.\nWe ran all experiments on code optimized for the Xeon Phi coprocessor. We used Spearmint (Snoek\net al., 2015) for Bayesian optimization of hyperparameters with 5-20 concurrent evaluations.\n\n6\n\n10\u2212410\u2212310\u2212210\u22121Elementmomentum0.00.20.40.60.81.0NormalizedcountSpatialSpectral\fMethod\nStochastic pooling\nMaxout\nNetwork-in-network\nDeeply supervised\nSpectral pooling\n\nCIFAR-10 CIFAR-100\n41.51%\n38.57%\n35.68%\n34.57%\n\n15.13%\n11.68%\n10.41%\n9.78%\n\n8.6%\n\n31.6%\n\n(a) Approximation loss for the ImageNet validation set.\n\n(b) Classi\ufb01cation rates.\n\nFigure 4: (a) Average information dissipation for the ImageNet validation set as a function of frac-\ntion of parameters kept. This is measured in (cid:96)2 error normalized by the input norm. The red hori-\nzontal line indicates the best error rate achievable by max pooling. (b) Test errors on CIFAR-10/100\nwithout data augmentation of the optimal spectral pooling architecture, as compared to current state-\nof-the-art approaches: stochastic pooling (Zeiler & Fergus, 2013), Maxout (Goodfellow et al., 2013),\nnetwork-in-network (Lin et al., 2013), and deeply-supervised nets (Lee et al., 2014).\n\n5.1 Spectral pooling\nInformation preservation We test the information retainment properties of spectral pooling on\nthe validation set of ImageNet (Russakovsky et al., 2015). For the different pooling strategies we plot\nthe average approximation loss resulting from pooling to different dimensionalities. This can be seen\nin Figure 4. We observe the two aspects discussed in Subsection 3.1: \ufb01rst, spectral pooling permits\nsigni\ufb01cantly better reconstruction for the same number of parameters. Second, for max pooling,\nthe only knob controlling the coarseness of approximation is the stride, which results in severe\nquantization and a constraining lower bound on preserved information (marked in the \ufb01gure as a\nhorizontal red line). In contrast, spectral pooling permits the selection of any output dimensionality,\nthereby producing a smooth curve over all frequency truncation choices.\nClassi\ufb01cation with convolutional neural networks We test spectral pooling on different classi-\n\ufb01cation tasks. We hyperparametrize and optimize the following CNN architecture:\n\n(cid:0)C96+32m\n\n3\u00d73\n\n\u2192 SP\u2193(cid:98)\u03b3Hm(cid:99)\u00d7(cid:98)\u03b3Hm(cid:99)(cid:1)M\n\nm=1 \u2192 C96+32M\n\n1\u00d71\n\n\u2192 C10/100\n\n1\u00d71 \u2192 GA \u2192 Softmax\n(5)\n\nHere, by CF\nS we denote a convolutional layer with F \ufb01lters each of size S, by SP\u2193S a spectral\npooling layer with output dimensionality S, and GA the global averaging layer described in Lin\net al. (2013). We upper-bound the number of \ufb01lters per layer as 288. Every convolution and pooling\nlayer is followed by a ReLU nonlinearity. We let Hm be the height of the map of layer m. Hence,\neach spectral pooling layer reduces each output map dimension by factor \u03b3 \u2208 (0, 1). We assign\nfrequency dropout distribution pR(\u00b7; m, \u03b1, \u03b2) = U[(cid:98)cmHm(cid:99),Hm](\u00b7) for layer m, total layers M and\nM (\u03b2 \u2212 \u03b1) for some constants \u03b1, \u03b2 \u2208 R. This parametrization can be thought\nwith cm(\u03b1, \u03b2) = \u03b1 + m\nof as some linear parametrization of the dropout rate as a function of the layer.\nWe perform hyperparameter optimization on the dimensionality decay rate \u03b3 \u2208 [0.25, 0.85], number\nof layers M \u2208 {1, . . . , 15}, resolution randomization hyperparameters \u03b1, \u03b2 \u2208 [0, 0.8], weight decay\nrate in [10\u22125, 10\u22122], momentum in [1 \u2212 0.10.5, 1 \u2212 0.12] and initial learning rate in [0.14, 0.1]. We\ntrain each model for 150 epochs and anneal the learning rate by a factor of 10 at epochs 100 and 140.\nWe intentionally use no dropout nor data augmentation, as these introduce a number of additional\nhyperparameters which we want to disambiguate as alternative factors for success.\nPerhaps unsurprisingly, the optimal hyperparameter con\ufb01guration assigns the slowest possible\nlayer map decay rate \u03b3 = 0.85.\nIt selects randomized resolution reduction constants of about\n\u03b1 \u2248 0.30, \u03b2 \u2248 0.15, momentum of about 0.95 and initial learning rate 0.0088. These settings al-\nlow us to attain classi\ufb01cation rates of 8.6% on CIFAR-10 and 31.6% on CIFAR-100. These are\ncompetitive results among approaches that do not employ data augmentation: a comparison to state-\nof-the-art approaches from the literature can be found in Table 4(b).\n5.2 Spectral parametrization of CNNs\nWe demonstrate the effectiveness of spectral parametrization on a number of CNN optimization\ntasks, for different architectures and for different \ufb01lter sizes. We use the notation MPT\nS to denote a\nmax pooling layer with size S and stride T , and FCF is a fully-connected layer with F \ufb01lters.\n\n7\n\n0.00.20.40.60.8Fractionofparameterskept2\u221272\u221262\u221252\u221242\u221232\u221222\u2212120kf\u2212\u02c6fkkfkMaxpoolingSpectralpooling\fArchitecture\n\nDeep (7)\nDeep (7)\nGeneric (6)\nGeneric (6)\nSp. Pooling (5)\nSp. Pooling (5)\n\nFilter\nsize\n3 \u00d7 3\n5 \u00d7 5\n3 \u00d7 3\n5 \u00d7 5\n3 \u00d7 3\n5 \u00d7 5\n\nSpeedup\nfactor\n\n2.2\n4.8\n2.2\n5.1\n2.4\n4.8\n\n(b) Speedup factors.\n\n(a) Training curves.\n\nFigure 5: Optimization of CNNs via spectral parametrization. All experiments include data aug-\n(a) Training curves for the various experiments. The remainder of the optimization\nmentation.\npast the matching point is marked in light blue. The red diamonds indicate the relative epochs in\nwhich the asymptotic error rate of the spatial approach is achieved. (b) Speedup factors for different\narchitectures and \ufb01lter sizes. A non-negligible speedup is observed even for tiny 3 \u00d7 3 \ufb01lters.\nThe \ufb01rst architecture is the generic one used in a variety of deep learning papers, such as Krizhevsky\net al. (2012); Snoek et al. (2012); Krizhevsky (2009); Kingma & Ba (2015):\n\nC96\n3\u00d73 \u2192 MP2\n\n3\u00d73 \u2192 C192\n\n3\u00d73 \u2192 MP2\n\n3\u00d73 \u2192 FC1024 \u2192 FC512 \u2192 Softmax\n\n(6)\n\nThe second architecture we consider is the one employed in Snoek et al. (2015), which was shown\nto attain competitive classi\ufb01cation rates. It is deeper and more complex:\n1\u00d71 \u2192 C\n\n1\u00d71 \u2192 GA \u2192 Softmax\n\n3\u00d73 \u2192 MP\n\n3\u00d73 \u2192 MP\n\n3\u00d73 \u2192 C\n\n192\n\n3\u00d73 \u2192 C\n\n192\n\n3\u00d73 \u2192 C\n\n96\n\n3\u00d73 \u2192 C\n\n10/100\n\n192\n\n2\n\n3\u00d73 \u2192 C\n\n(7)\n\nC\n\n96\n\n2\n\n192\n\nThe third architecture considered is the spectral pooling network from Equation 5. To increase\nthe dif\ufb01culty of optimization and re\ufb02ect real training conditions, we supplemented all networks with\ndata augmentation in the form of translations, horizontal re\ufb02ections, HSV perturbations and dropout.\nWe initialized both spatial and spectral \ufb01lters in the spatial domain as the same values; for the\nspectral parametrization experiments we then computed the Fourier transform of these to attain their\nfrequency representations. We optimized all networks using the Adam (Kingma & Ba, 2015) update\nrule, a variant of RMSprop that we \ufb01nd to be a fast and robust optimizer.\nThe training curves can be found in Figure 5(a) and the respective factors of convergence speedup\nin Table 5. Surprisingly, we observe non-negligible speedup even for tiny \ufb01lters of size 3\u00d7 3, where\nwe did not expect the frequency representation to have much room to exploit spatial structure.\n6 Discussion and remaining open problems\nIn this work, we demonstrated that spectral representations provide a rich spectrum of applications.\nWe introduced spectral pooling, which allows pooling to any desired output dimensionality while\nretaining signi\ufb01cantly more information than other pooling approaches. In addition, we showed that\nthe Fourier functions provide a suitable basis for \ufb01lter parametrization, as demonstrated by faster\nconvergence of the optimization procedure.\nOne possible future line of work is to embed the network in its entirety in the frequency domain.\nIn models that employ Fourier transforms to compute convolutions, at every convolutional layer the\ninput is FFT-ed and the elementwise multiplication output is then inverse FFT-ed. These back-and-\nforth transformations are very computationally intensive, and as such it would be desirable to strictly\nremain in the frequency domain. However, the reason for these repeated transformations is the\napplication of nonlinearities in the forward domain: if one were to propose a sensible nonlinearity\nin the frequency domain, this would spare us from the incessant domain switching.\n\nAcknowledgements We would like to thank Prabhat, Michael Gelbart and Matthew Johnson for\nuseful discussions and assistance throughout this project. Jasper Snoek was a fellow in the Harvard\nCenter for Research on Computation and Society. This work is supported by the Applied Mathe-\nmatics Program within the Of\ufb01ce of Science Advanced Scienti\ufb01c Computing Research of the U.S.\nDepartment of Energy under contract No. DE-AC02-05CH11231. This work used resources of the\nNational Energy Research Scienti\ufb01c Computing Center (NERSC). We thank Helen He and Doug\nJacobsen for providing us with access to the Babbage Xeon-Phi testbed at NERSC.\n\n8\n\ne\u22122e\u22121e0e1Size5e\u22121e0e1e\u22122e\u22121e0e1SpatialSpectral04080120160200Deepe\u22122e\u22121e0e1Size304080120160200Generice\u22121e0e10306090120150Sp.Poolinge\u22122e\u22121e0e1\fReferences\nBengio, Yoshua and LeCun, Yann. Scaling learning algorithms towards AI. In Bottou, L\u00b4eon, Chapelle, Olivier,\n\nDeCoste, D., and Weston, J. (eds.), Large Scale Kernel Machines. MIT Press, 2007.\n\nGoodfellow, Ian J., Warde-Farley, David, Mirza, Mehdi, Courville, Aaron C., and Bengio, Yoshua. Maxout net-\nworks. CoRR, abs/1302.4389, 2013. URL http://dblp.uni-trier.de/db/journals/corr/\ncorr1302.html#abs-1302-4389.\n\nHinton, Geoffrey. What\u2019s wrong with convolutional nets? MIT Brain and Cognitive Sciences - Fall\nColloquium Series, Dec 2014a. URL http://techtv.mit.edu/collections/bcs/videos/\n30698-what-s-wrong-with-convolutional-nets.\n\nHinton, Geoffrey. Ask me anything: Geoffrey hinton. Reddit Machine Learning, 2014b. URL https:\n//www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/.\nIoffe, Sergey and Szegedy, Christian. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.\nKarpathy, Andrej, Toderici, George, Shetty, Sanketh, Leung, Thomas, Sukthankar, Rahul, and Fei-Fei, Li.\nLarge-scale video classi\ufb01cation with convolutional neural networks. In Computer Vision and Pattern Recog-\nnition, 2014.\n\nKingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.\n\nURL http://arxiv.org/abs/1412.6980.\n\nKrizhevsky, Alex. Learning multiple layers of features from tiny images. Technical report, 2009.\nKrizhevsky, Alex., Sutskever, Ilya, and Hinton, Geoffrey E. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems, 2012.\n\nLeCun, Yann, Boser, Bernhard, Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel., L. D.\nHandwritten digit recognition with a back-propagation network. In Advances in Neural Information Pro-\ncessing Systems, 1989.\n\nLee, Chen-Yu, Xie, Saining, Gallagher, Patrick, Zhang, Zhengyou, and Tu, Zhuowen. Deeply-supervised nets.\n\nCoRR, abs/1409.5185, 2014. URL http://arxiv.org/abs/1409.5185.\n\nLin, Min, Chen, Qiang, and Yan, Shuicheng. Network in network. CoRR, abs/1312.4400, 2013. URL http:\n\n//dblp.uni-trier.de/db/journals/corr/corr1312.html#LinCY13.\n\nMathieu, Micha\u00a8el, Henaff, Mikael, and LeCun, Yann. Fast training of convolutional networks through FFTs.\n\nCoRR, abs/1312.5851, 2013. URL http://arxiv.org/abs/1312.5851.\n\nRippel, Oren, Gelbart, Michael A., and Adams, Ryan P. Learning ordered representations with nested dropout.\n\nIn International Conference on Machine Learning, 2014.\n\nRussakovsky, Olga, Deng, Jia, Su, Hao, Krause, Jonathan, Satheesh, Sanjeev, Ma, Sean, Huang, Zhiheng,\nKarpathy, Andrej, Khosla, Aditya, Bernstein, Michael, Berg, Alexander C., and Li, Fei-Fei.\nImageNet\nLarge Scale Visual Recognition Challenge. International Journal of Computer Vision, 2015. doi: 10.1007/\ns11263-015-0816-y.\n\nSnoek, Jasper, Larochelle, Hugo, and Adams, Ryan Prescott. Practical Bayesian optimization of machine\n\nlearning algorithms. In Neural Information Processing Systems, 2012.\n\nSnoek, Jasper, Rippel, Oren, Swersky, Kevin, Kiros, Ryan, Satish, Nadathur, Sundaram, Narayanan, Patwary,\nMd. Mostofa Ali, Prabhat, and Adams, Ryan P. Scalable Bayesian optimization using deep neural networks.\nIn International Conference on Machine Learning, 2015.\n\nTorralba, Antonio and Oliva, Aude. Statistics of natural image categories. Network, 14(3):391\u2013412, August\n\n2003. ISSN 0954-898X.\n\nVasilache, Nicolas, Johnson, Jeff, Mathieu, Micha\u00a8el, Chintala, Soumith, Piantino, Serkan, and LeCun, Yann.\nFast convolutional nets with fbfft: A GPU performance evaluation. CoRR, abs/1412.7580, 2014. URL\nhttp://arxiv.org/abs/1412.7580.\n\nZeiler, Matthew D. and Fergus, Rob. Stochastic pooling for regularization of deep convolutional neural net-\nworks. CoRR, abs/1301.3557, 2013. URL http://dblp.uni-trier.de/db/journals/corr/\ncorr1301.html#abs-1301-3557.\n\n9\n\n\f", "award": [], "sourceid": 1453, "authors": [{"given_name": "Oren", "family_name": "Rippel", "institution": "MIT"}, {"given_name": "Jasper", "family_name": "Snoek", "institution": "Twitter Cortex"}, {"given_name": "Ryan", "family_name": "Adams", "institution": "Harvard"}]}