{"title": "SplineNets: Continuous Neural Decision Graphs", "book": "Advances in Neural Information Processing Systems", "page_first": 1994, "page_last": 2004, "abstract": "We present SplineNets, a practical and novel approach for using conditioning in convolutional neural networks (CNNs). SplineNets are continuous generalizations of neural decision graphs, and they can dramatically reduce runtime complexity and computation costs of CNNs, while maintaining or even increasing accuracy. Functions of SplineNets are both dynamic (i.e., conditioned on the input) and hierarchical (i.e.,conditioned on the computational path). SplineNets employ a unified loss function with a desired level of smoothness over both the network and decision parameters, while allowing for sparse activation of a subset of nodes for individual samples. In particular, we embed infinitely many function weights (e.g. filters) on smooth, low dimensional manifolds parameterized by compact B-splines, which are indexed by a position parameter. Instead of sampling from a categorical distribution to pick a branch, samples choose a continuous position to pick a function weight. We further show that by maximizing the mutual information between spline positions and class labels, the network can be optimally utilized and specialized for classification tasks. Experiments show that our approach can significantly increase the accuracy of ResNets with negligible cost in speed, matching the precision of a 110 level ResNet with a 32 level SplineNet.", "full_text": "SplineNets:\n\nContinuous Neural Decision Graphs\n\nCem Keskin\n\nShahram Izadi\n\ncemkeskin@google.com\n\nshahrami@google.com\n\nAbstract\n\nWe present SplineNets, a practical and novel approach for using conditioning in\nconvolutional neural networks (CNNs). SplineNets are continuous generalizations\nof neural decision graphs, and they can dramatically reduce runtime complexity\nand computation costs of CNNs, while maintaining or even increasing accuracy.\nFunctions of SplineNets are both dynamic (i.e., conditioned on the input) and\nhierarchical (i.e., conditioned on the computational path). SplineNets employ a\nuni\ufb01ed loss function with a desired level of smoothness over both the network\nand decision parameters, while allowing for sparse activation of a subset of nodes\nfor individual samples. In particular, we embed in\ufb01nitely many function weights\n(e.g. \ufb01lters) on smooth, low dimensional manifolds parameterized by compact\nB-splines, which are indexed by a position parameter. Instead of sampling from\na categorical distribution to pick a branch, samples choose a continuous position\nto pick a function weight. We further show that by maximizing the mutual infor-\nmation between spline positions and class labels, the network can be optimally\nutilized and specialized for classi\ufb01cation tasks. Experiments show that our ap-\nproach can signi\ufb01cantly increase the accuracy of ResNets with negligible cost in\nspeed, matching the precision of a 110 level ResNet with a 32 level SplineNet.\n\n1\n\nIntroduction and Related Work\n\nThere is a growing body of literature applying conditioning to CNNs, where only parts of the net-\nwork are selectively active to save runtime complexity and compute. Approaches use conditioning\nto scale model complexity without an explosion of computational cost e.g. [1] or increase runtime\nef\ufb01ciency by reducing model size and compute without degrading accuracy e.g. [2]. In this paper we\npresent a new and practical approach for supporting conditional neural networks called SplineNets.\nWe demonstrate how these novel networks dramatically reduce runtime complexity and computation\ncost beyond regular CNNs while maintaining or even increasing accuracy.\nConditional neural networks can be categorized into three main categories: (1) proxy objective\nmethods that train decision parameters via a surrogate loss function, (2) probabilistic (mixture of\nexperts) methods that assign scores to each branch and treat the loss as a weighted sum over all\nbranches, and (3) feature augmentation methods that augment the representations with the scores.\nThe \ufb01rst group relies on non-differentiable decision functions and trains decision parameters using\na proxy loss. Xiong et al. use a loss that maximizes distances between subclusters in their condi-\ntional convolutional network [3]. Baek et al. use purity of data activation according to the class\nlabel [4], and Bic\u00b8ici et al. use information gain as the proxy objective [5]. While the former ap-\nproach only allows soft decisions, the latter method uses argmax of a scoring function to sparsely\nactivate branches, which results in discontinuities in the loss function. Bulo et al. use multi-layered\nperceptrons as decisions, and the network otherwise acts like a decision tree [6]. Denoyer et al. train\na tree structured network using the REINFORCE algorithm [7].\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00b4eal, Canada.\n\n\fIoannou et al. take a probabilistic approach, assigning weights to each branch and treating the loss\nas a weighted sum over losses of each branch [2]. They sparsify the decisions at test time, leading\nto gains in speed at the cost of accuracy. Shazeer et al. follow a similar probabilistic approach\nthat assigns weights to a very large number of branches in a single layer [1]. They use the top-k\nbranches both at training and test time, leading to a discontinuous loss function. Another approach\ntreats decision probabilities of a tree as the output of a separate neural network and then trains both\nmodels jointly, while slowly binarizing decisions [8]. However, inference still requires a regular\nnetwork to be densely evaluated to generate the decision probabilities.\nFinally, Wang et al.\n[9] take the approach of creating features from scores, enabling decisions to\nbe trained via regular back-propagation. However, they have only decisions and do not build higher\nlevel representations.\nOur method is novel and does not \ufb01t any of these categories. It is inspired from the recently proposed\nnon-linear dimensionality reduction technique called Additive Component Analysis (ACA) [10].\nThis technique \ufb01ts a smooth, low dimensional manifold to data and learns a mapping from it to the\ninput space. We extend this work by learning both the projections onto these manifolds, as well\nas the mapping to the input space in a neural network setting. We then apply this technique to\nthe function weights in our network. Hence, SplineNets employ subnetworks that explicitly form\nthe main network parameters from some latent parameters, as in the case of Hypernetworks [11].\nUnlike that work, we also project to these latent manifolds, conditioned on the feature maps. This\nmakes SplineNet operations dynamic, which is similar to the approaches in Dynamic Filter Net-\nworks [12], Phase-functioned Neural Networks [13] and Spatial Transformer Networks [14]. We\nfurther condition these projections on previous projections in the network to make the model hi-\nerarchical. As these projections replace the common decision making process of selecting a child\nfrom a discrete set, this becomes a new paradigm for decision networks, which operates in contin-\nuous decision spaces. There is some similarity with Reinforcement Learning (RL) on Continuous\nAction Spaces [15], but the aim and problem setting for RL is much different from hierarchical de-\ncision graphs. For instance, rewards come from the environment in an RL setting, whereas decisions\nspecialize the network and de\ufb01ne the loss in our case.\nThe heart of SplineNets is the embedding of function weights on low dimensional manifolds de-\nscribed by B-splines. This effectively makes the branching factor of the underlying decision graph\nuncountable, and indexes the in\ufb01nitely many choices with a continuous position parameter. Instead\nof making discrete choices, the network then chooses a position, which makes the decision process\ndifferentiable and inherently sparse. We call this the Embedding Trick, somewhat analogous to the\nReparameterization Trick that makes the loss function differentiable with respect to the distribution\nparameters [16].\nLoad balancing is a common problem in hierarchical networks, which translates to under-utilization\nof splines in our continuous paradigm. The common solution of maximizing information gain based\non label distributions on discrete nodes, as used in [5], translates to specializing spline sections to\nclass labels or clusters in our case. We show that maximizing the mutual information between spline\npositions and labels solves both problems, and in the absence of labels, one can instead maximize\nthe mutual information between spline positions and the input, in the style of InfoGANs [17].\nAnother common issue with conditional neural networks is that the mini-batch size gets smaller\nwith each decision. This leads to serious dif\ufb01culties with training due to noisier gradients towards\nthe leaves and learning rate needing adaptation to each node, while also limiting the maximum\npossible depth of the network. Baek et al. use a decision jungle based architecture to tackle this\nproblem [4]. Decision jungles allow nodes in a layer to co-parent children, essentially turning the\ntree into a directed acyclic graph [18]. SplineNets also do not suffer from this issue, since their\narchitecture is essentially a decision jungle with an (uncountably) in\ufb01nite branching factor. We\nfurther add a constraint on the range of valid children to simulate interesting architectures.\nOur contributions in this paper are: (i) the embedding trick to enable smooth conditional branching,\n(ii) a novel neural decision graph that is uncountable and operates in continuous decision space, (iii)\nan architectural constraint to enforce a semantic relationship between latent parameters of functions\nin consecutive layers, (iv) a regularizer that maximizes mutual information to utilize and specialize\nsplines, and (v) a new differentiable quantization method for estimating soft entropies.\n\n2\n\n\f2 Methodology\n\nCNNs take an input x1=x and apply a series of parametric transformations,\nsuch that\nxi+1=T i(xi; \u03c9i), where superscript i is the level index. On the other hand, decision graphs typ-\nically navigate the input x with decision functions parameterized by \u03b8i, selecting an eligible node\nand the next parameters \u03b8i+1 contained within. Skipping non-parametric functions for CNNs and\nassuming sparse activations for decision graphs, we can describe their behavior with the graphical\nmodels shown in Figure 1. Assuming a symmetrical architecture, such that T i\nj for each node j in\nlayer i has the same form but different weights, a neural decision graph (NDG) can then be de-\nscribed by the graphical model shown on the right. The decision process governed by parameters \u03b8i\nis used here to determine the next \u03c9i (red arrows) as well as \u03b8i+1 (green arrows). Discrete NDGs\nthen have a \ufb01nite set of weights {\u03c9i\nj}N i\nj=1 \u2208 Rhi) to choose\nfrom at each level with N i nodes, depending on the features xi and choices from the previous level.\nIn this work, we investigate the case where the set of choices is uncountably in\ufb01nite and indexed by\na real-valued position parameter, forming a continuous NDG.\n\nj}N i\nj=1 \u2208 Rgi (and the corresponding {\u03b8i\n\nFigure 1: Graphical models: (left) Decision graph, (middle) CNN, (right) Neural decision graph.\n\n2.1 Embedding trick\n\nj and \u03b8i\n\nj in an NDG.\n\nThe embedding trick can be viewed as making the branching factor of the underlying NDG un-\ncountably in\ufb01nite, while imposing a topology on neighboring nodes. Each \u03c9i\nj in the set of discrete\nchoices is essentially a point in Rgi. In the limit of N i\u2192\u221e, these in\ufb01nitely many points lie on a\nlow dimensional manifold, as shown in Figure 2 (assuming gi=3). In this work, we assume this\nmanifold is a 1D curve in Rgi, parameterized by a position parameter \u03c6i \u2208 [0, 1], which is used to\nindex the individual \u03c9i\u2019s. A simple example would be the set of all 5 \u00d7 5 edge \ufb01lters, which form a\nclosed 1D curve in a 25D space. Instead of making a discrete choice from a \ufb01nite set, the network\ncan then choose a \u03c6i using a smooth, differentiable function. The same process can be applied to\nj to form another 1D curve in Rhi, indexed by the same position parameter\nthe decision parameters \u03b8i\n\u03c6i, since selecting a node j should implicitly pick both \u03c9i\nWe parameterize these latent curves with B-splines, since they have bounded support over their\ncontrol points, or knots. This is a desirable property that allows ef\ufb01cient inference by allowing\nonly a small set of knots be activated per sample, and it also makes training more well-behaved,\nsince updates to control points can change the shape of the manifold only locally. Notably, this\nalso ef\ufb01ciently solves the exploration problem, since the derivative of the loss with respect to the\nspline position exists, and the network knows which direction on the spline should further reduce\nthe energy. The knots are trained of\ufb02ine, while the position parameters are dynamically determined\nduring runtime.\nWhile we only focus on 1D manifolds in this paper, our formulation allows easy extension to higher\ndimensional hypersurfaces in the same style as ACA, simply by describing surfaces as a sum of\nsplines. This restricts the family of representable surfaces, but leads to a linear increase in the\nnumber of control points that need to be learned, as opposed to the general case where one needs\nexponentially many control points.\nB\u2013splines are piecewise polynomial parametric curves with bounded support and a desired level\nof smoothness up to C d\u22121, where d is the degree of the polynomials. The curve approximately\nk=1 CkBk(\u03c6). Here, Ck, k = 1..K are the\nt=0 at\u03c6t.\nCoef\ufb01cients at are \ufb01xed and can be determined from continuity and smoothness constraints. For\nany position \u03c6 on the spline, only d+1 basis functions are non\u2013zero. For simplicity, we restrict \u03c6\nto the range [0, 1] regardless of K or d, and use cardinal B\u2013splines to make knots equidistant in the\n\ninterpolates a number of knots, such that S(\u03c6) = (cid:80)K\nknots and Bk(\u03c6) are the basis functions, which are piecewise polynomials of the form(cid:80)d\n\n3\n\n\fFigure 2: The Embedding Trick: (left) Discrete NDGs have a \ufb01nite set of points to choose from.\n(middle) Increasing the number of nodes reveals that the weights lie on a lower dimensional man-\nifold. (right) The embedding trick represents this manifold with a spline, and replaces the discrete\ndecision making process with a continuous projection.\n\n\u03c9 and its knots C i\n\n\u03b8 and its knots C i\n\n\u03c9,k are in Rgi, and Si\n\n\u03c9 generates the \u03c9i and a decision spline Si\n\nposition space spanned by \u03c6. Note that they can still be anywhere in the actual space they live in.\nThe degree of the splines also controls how far away the samples can see on the spline, helping with\nthe exploration problem common with conditional architectures.\nWe apply the embedding trick to both transformation weights \u03c9 and decision parameters \u03b8 sepa-\n\u03b8 generates the \u03b8i. The\nrately, so that a transformer spline Si\n\u03b8,k are in Rhi. Figure 3 shows the cor-\nspline Si\nresponding graphical models. The model on the left shows a dynamic, non-hierarchical SplineNet,\nand the model on the right is a hierarchical SplineNet. Here, the orange arrows correspond to the\nprojections, blue arrows indicate transformations, red arrows are generation of transformation pa-\nrameters and green arrows are generation of decision weights. The dashed orange arrow indicates a\ntopological constraint, which is used to restrict access to sections of a spline depending on the previ-\nous position of the sample, for instance to simulate tree-like behaviour. This can enforce a semantic\nrelationship between sections of consecutive splines, roughly corresponding to mutually exclusive\nsubtrees for instance. Since \u03c6i is analogous to the node index in the discrete case, it can be used to\ndetermine the valid children nodes, which corresponds to a sub-range of the positions available on\nthe spline of the next layer.\n\nFigure 3: Graphical models with embedding trick: (left) Dynamic, non-hierarchical SplineNet\n(right) Hierarchical SplineNet. The additional dependency between positions \u03c6i and \u03c6i+1 allows\nus to de\ufb01ne topological constraints to simulate tree-like architectures.\n\nWithout any topological constraints, the projection becomes simply \u03c6i=D(xi; \u03b8i), where D is a\nmapping RM i\u2192[0, 1], with M i= dim(xi) (orange arrows). In this work, we assume a linear projec-\ntion (dense or convolutional) followed by a sigmoid function, but any differentiable mapping can be\nused. With topological constraints, we de\ufb01ne the new position \u03c6i+1 as a linear combination between\nthe old position and the new projection:\n\n\u03c6i+1 = (1 \u2212 \u03b4i)\u03c6i + \u03b4iD(xi; \u03b8i).\n\n(1)\nHere, \u03b4i is a layer dependent diffusion parameter that controls the conditional range of projections.\nSetting \u03b4i=1 roughly simulates a decision jungle [18] with an in\ufb01nite branching factor, which has no\nrestrictions on how far samples can jump between layers. While regular decision jungles maintain\nonly a small number of valid children for each node, SplineNet jungles allow transition to any one of\nthe in\ufb01nitely many nodes in the next layer. By setting \u03b4i=b1\u2212i we can also simulate a b-ary decision\ntree, but it has the issues with mini-batch size discussed earlier, and performs typically worse. A\n\ufb01xed \u03b4i=\u03b1 spans a novel continuous family of architectures for 0\u2264\u03b1\u22641.\n\n4\n\n\fNote that while the topological constraint adds a \ufb02avor of hierarchical dependency to the network,\na proper decision graph should pick different decision parameters for each node. This constraint\ndetermines the graph edges between consecutive layers, but it does not have an effect on the decision\nparameters.\n\n2.2 SplineNet operators\n\nProjections Projection D is a mapping from feature map xi to spline position \u03c6i. While any\ndifferentiable mapping is valid, we consider two simple cases: (i) a dot product with the \ufb02attened\nfeature vector, and (ii) a 1\u00d71 convolution followed by global averaging, which has fewer parameters\nto learn. In the \ufb01rst case, \u03b8i is a vector in RM i, and the time and memory complexities are both\nO(M i). While this is negligible for networks with large dense layers, it becomes signi\ufb01cant for\nall\u2013convolutional networks. For the second case, \u03b8i is a \ufb01lter in R1,1,c,1 where c is the number of\nfeature map channels. In both cases, a sigmoid with a tunable slope parameter is used to map the\nresponse to the right range.\n\nDecision parameter generation The splines Si\n\u03b8 of the hierarchical SplineNets all have knots C i\nthat live in the same space as \u03b8i, which is RM i. The cost of generating \u03b8i is then a weighted sum\nover d+1 such vectors, where d is the degree of the spline. The time complexity is O(dM i), and\nthe memory complexity is O(K i\n\n\u03b8 is the number of knots.\n\n\u03b8M i), where K i\n\n\u03b8,k\n\nFully connected layers Fully connected layers have weights \u03c9 in the form of a matrix, such\nthat \u03c9i \u2208 RM i\u00d7M i+1. Thus, the corresponding spline must have knots that also lie in the same\nspace. The additional costs compared to a regular fully connected layer besides the projection is\n\u03c9, which has a time complexity O(dM iM i+1), and a\nthe generation of \u03c9 via the spline function Si\nmemory complexity of O(K i\n\n\u03c9 is the number of knots.\n\n\u03c9M iM i+1), where K i\n\nConvolutional layers Conventional 2D convolutional layers have rank-4 \ufb01lter banks in Rh,w,c,f\nas weights \u03c9, where h is the height and w is the width of the kernels, c is the number of input\nchannels, and f is the number of \ufb01lters (we drop superscript i for simplicity). We propose two\ndifferent ways of representing \ufb01lter banks with splines: (i) a single spline with rank-4 knots, or\n(ii) f splines with rank-3 knots. While the \ufb01rst approach is a single curve between entire \ufb01lter\nbanks in Rh,w,c,f , the second approach uses a spline per \ufb01lter, each with knots living in Rh,w,c.\n\u03c9 generating the \ufb01lter bank as before. The\nRank-4 case is straightforward, with a single spline Si\n\u03c9hwcf ). To handle\nadditional time complexity is O(dhwcf ), and the memory complexity is O(K i\nmultiple splines with rank-3 \ufb01lters, we need to project to each spline separately.\nIn the case of\ndot product decisions, this effectively turns \u03b8i to a matrix of form Rf\u00d7M i, such that the projection\nD(xi; \u03b8i) = sigmoid(\u03b8ixi) results in a \u03c6i \u2208 Rf . Then, the \ufb01nal \ufb01lter bank can be generated by:\n\nj is the jth element of \u03c6i that corresponds to the spline Si\n\nwhere \u03c6i\nthat forms the \ufb01lter bank. The additional time complexity is O(M if + dhwcf ) and the memory\nneeded is O(M if +K i\n\u03c9hwcf ). Adapting the same approach to convolutional decisions is straight-\nforward by changing the decision \ufb01lter dimensionality to R1,1,c,f . When plugging many such rank-3\nlayers together, it may not be possible to diffuse positions coming from the previous layer with the\nnew ones directly using Equation 1, since they may have different sizes. In such cases, we multiply\nthe inherited positions with a learned matrix to match the shapes of position vectorsd.\n\n2.3 Regularizing SplineNets\n\nThe continuous position distribution P (\u03c6i) plays an important role in utilizing SplineNets properly,\ne.g.\nif all \u03c6i\u2019s that are dynamically generated by samples are the same, the model reduces to a\nregular CNN. Figure 4 shows some real examples of such under-utilization. Meanwhile, we would\nalso like to specialize splines to data, such that each control point handles a subset of classes or\nclusters more effectively. Both of these problems are common in decision trees, and the typical\n\n5\n\nf(cid:77)\n\nj=1\n\n\u03c9i =\n\nSi\n\u03c9,j(\u03c6i\n\nj),\n\n\u03c9,j, and(cid:76) is the stacking operator\n\n(2)\n\n\fsolution is maximizing the mutual information (MI) I(Y ; \u039b) = H(Y ) \u2212 H(Y |\u039b), where Y and \u039b\ncorrespond to class labels and (discrete) nodes. In the absence of Y , X can be used to specialize\nnodes to clusters.\n\nFigure 4: The distribution of \u03c6i can be suboptimal in various ways. Some examples directly taken\nfrom Tensorboard, where depth indicates training time: (1) binary positions, (2) constant positions,\n(3) positions slowly shifting, and (4) under-utilization. Ideally, the distribution P (\u03c6i) should be\nclose to uniform.\n\nIn the case of SplineNets, the MI between the class labels and the continuous spline positions is:\n\nI(\u03c6i; Y ) = H(\u03c6i) \u2212 H(\u03c6i|Y ) = H(Y ) \u2212 H(Y |\u03c6i).\n\n(3)\n\nWe choose to use the \ufb01rst form, which explicitly maximizes the position entropy H(\u03c6), which\nwe call the utilization term, and minimizes the conditional position entropy H(\u03c6|Y ), which is the\nspecialization term. Then, our regularizer loss becomes:\n\nreg = \u2212wuH(\u03c6i) + wsH(\u03c6i|Y ),\nLi\n\n(4)\n\nn, yn}N\n\nwhich should be minimized. The wu and ws are utilization and specialization weights, respectively.\nTo calculate these entropies, the underlying continuous distributions P (\u03c6i) and P (\u03c6i|Y ) need to be\nestimated from the N sampled position-label pairs {\u03c6i\nn=1. The two most common techniques\nfor this are Kernel Density Estimation (KDE) and quantization. Here, we describe a differentiable\nquantization method that can be used to estimate these entropies.\nQuantization method To approximate P (\u03c6i) and P (\u03c6i|Y ), we can quantize the splines into B\nbins and count samples that fall inside each bin. Normalizing the bin histograms can then give\nus probability estimates that can be used to calculate entropies. However, a loss based on entropy\ncalculated with hard counts cannot be used to regularize the network, since the indicator quantization\nfunction is non-differentiable. To solve this problem, we construct a soft quantization function:\n\nU (\u03a6; cb, wb, \u03c5) = 1 \u2212 (1 + \u03c5(2(\u03a6\u2212cb)/wb)2\n\n)\u22121.\n\n(5)\n\nThis function returns almost 1 when position is inside the bin described by the center cb and width\nwb, and almost 0 otherwise. The parameter \u03c5 controls the slope. Figure 5 shows three different\nslopes used to construct bins.\n\nFigure 5: (left) Quantization method. (right) Five shifted copies of the soft quantization function U\nwith width=2 and slopes \u03c5={10000, 100}.\n\nWe use this function to discretize the continuous variable \u03c6i with B bins, which turns \u03c6i into the\ndiscrete variable \u039bi. Thus, the bin probabilities for \u039bi = b and the entropy H(\u039bi) become:\n\n(6)\n\n(7)\n\nPr(\u039bi=b) \u2248\n\n(cid:80)N\n(cid:80)B\n(cid:80)N\nH(\u039bi) = \u2212 B(cid:88)\n\nn=1\n\nb=1\n\n6\n\nn=1 U (\u03c6i\n\nn; cb, wb, \u03c5)\n\nb(cid:48)=1 U (\u03c6i\n\nn; cb(cid:48), wb(cid:48), \u03c5)\n\nPr(\u039bi=b) log Pr(\u039bi=b).\n\n\fThe specialization term is then:\n\nPr(\u039bi=b|Y =c) \u2248\n\n(cid:80)N\n(cid:80)N\n(cid:80)B\nH(\u039bi|Y ) = \u2212 C(cid:88)\n\nn=1\n\nb(cid:48)=1 U (\u03c6i\n\nB(cid:88)\n\nPr(Y =c)\n\nn=1 U (\u03c6i\n\nn; cb, wb, \u03c5)1(Y =c)\n\n,\n\nn; cb(cid:48), wb(cid:48), \u03c5)1(Y =c)\nPr(\u039bi=b|Y =c) log Pr(\u039bi=b|Y =c).\n\n(8)\n\n(9)\n\nc=1\n\nb=1\n\nThe effect of using different values of B on P (\u03c6i) can be seen in Figure 6. Using small B creates\nartifacts in the distribution, which is why we use B = 50 in our experiments.\n\nFigure 6: The effect of quantization based uniformization. From left to right: (i)B = 2, (ii)B = 3,\n(iii)B = 5, (iv)B = 50. We use B = 50 in this work.\n\nThe effect of specialization can be seen for a 10-class problem (Fashion MNIST) in Figure 7. Here,\nthe top and bottom rows correspond to labeled sample distribution over splines for two consecutive\nlayers, for four different values of ws (columns). Each image has 10 rows corresponding to classes,\nand 50 columns representing bins, where each bin\u2019s label distribution is normalized separately.\n\nFigure 7: The effect of quantization based specialization. Increasing values of ws leads to more\nspecialization.\n\n3 Experiments\n\nArchitecture SplineNets are generic and can replace any convolutional or dense layer in any CNN.\nTo demonstrate this, we converted ResNets [19] and LeNets [20] into SplineNets. We converted all\nconvolutional and dense layers into dynamic and hierarchical SplineNet layers, using different de-\ncision operators (\u2019dot\u2019, \u2019conv\u2019) and knot types (\u2019rank-3\u2019, \u2019rank-4\u2019), and experimented with different\nnumber of knots. For ResNet, we relied on the public implementation inside the Tensor\ufb02ow package\nwith \ufb01ne-tuned baseline parameters. We modernized LeNet with ReLu and dropout [21]. LeNet has\ntwo convolutional layers with C1 and C2 \ufb01lters respectively, followed by two dense layers with H\nand 10 hidden nodes. We experimented only with models where C1=s, C2=2s, H=4s for various\nvalues of s (depicted as LeNet-s). While the more compact mostly convolutional ResNet is better\nfor demonstrating SplineNet\u2019s ability to increase speed or accuracy at the cost of model complexity,\nLeNet with large dense layers is better suited for showcasing the gains in model size and speed while\nmaintaining (or even increasing) accuracy.\n\nImplementation details We implemented SplineNets in Tensor\ufb02ow [22] and used stochastic gra-\ndient descent with constant momentum (0.9). We initialized all knot weights of a spline together\nwith a normal distribution, with a variance of c/n, where c is a small constant and n is the fan in pa-\nrameter that is calculated from the constructed operator weight shape. We found that training with\nconditional convolutions per sample using grouped or depthwise convolutions was prohibitively\nslow. Therefore we took the linear convolution operator inside the weighted sum over the knots,\nsuch that it applies to individual knots rather than their sum. This method effectively combines all\n\n7\n\n\fknots into a single \ufb01lter bank and applies convolution in one step, and then takes the weighted sum\nover its partitions. Note that this is only needed when training with mini-batches; at test time a single\nsample fully bene\ufb01ts from the theoretical gains in speed.\n\nHyperparameters A batch size of 250 and learning rate 0.3 were found to be optimal for\nCIFAR-10. Initializer variance constant c was set to 0.05, number of bins B to 50 and quantiza-\ntion slope \u03c5 to 100. We found it useful to add a multiplicative slope parameter inside the decision\nsigmoid functions, which was set to 0.4. The diffusion parameter had almost no effect on the shal-\nlow LeNet, so \u03b1 was set to 1. With ResNet, optimal diffusion rate was found to be 0.875, but the\nincrease in accuracy was not signi\ufb01cant. Regularizer had a more signi\ufb01cant effect on the shallow\nLeNet, with ws and wu set to 0.2 giving the best results. On ResNet, the effect was less signi\ufb01cant\nafter \ufb01ne tuning the initialization and sigmoid slope parameters. Setting ws and wu to 0.05 gave\nslightly better results.\n\nSpline-ResNet We compared against 32 and 110 level ResNets on CIFAR-10, which was aug-\nmented with random shifts, crops and horizontal \ufb02ips, followed by per-image whitening. We ex-\nperimented with model type M \u2208 {D, H} (D is dynamic-only, H is hierarchical), decision type\nT \u2208 {C, D} (C is 1\u00d71 convolution, D is dense), and the knot rank R \u2208 {3, 4}. The model nam-\ning uses the pattern M (K)-T -R; e.g. D(3)-C-R4 is a dynamic model with three rank-4 knots and\nconvolutional decisions. All SplineNets have depth 32. Baseline ResNet-32 and ResNet-110 reach\n92.5% and 93.6% on CIFAR-10, respectively.\nFirst experiment shows the effect of increasing the number of knots. For these tests we opted for the\nmore compact convolutional decisions and the more powerful rank-3 knots. Figure 8 shows how the\naccuracy, model size and runtime complexity for a single sample (measured in FLOPS) are affected\nwhen going from two to \ufb01ve knots. Evidently, SplineNets deliver on the promise that they can\nincrease model complexity without affecting speed, resulting in higher accuracy. For this setting,\nboth the dynamic and hierarchical models reach nearly 93.4%, while being three times faster than\nResNet-110.\n\n(a) Effect of K on number of FLOPS.\n\n(b) Effect of K on model size.\n\nFigure 8: Spline-Resnet-32 using convolutional decisions and rank-3 knots, with K=2\u22125. (Left)\nIncreasing the number of knots increases the accuracy signi\ufb01cantly, and has a negligible effect on\nthe number of FLOPS for single sample inference. (Right) Model size grows linearly with number\nof knots.\n\nNext, we compare all decision and convolution mechanisms for both architectures by \ufb01xing K=5.\nThe results are given in Figure 9. Here, the most powerful SplineNet tested is H(5)-D-R3, with\nhierarchical, dense projections for each \ufb01lter, and it matches the accuracy of ResNet-110 with almost\nhalf the number of FLOPS. However, a dot product for each \ufb01lter in every layer creates a quite\nlarge model with 42M parameters. Its dynamic-only counterpart has 10M parameters and is three\ntimes faster than ResNet-110. With close to 93.5% accuracy, H(5)-D-R4 provides a better trade off\nbetween model size and speed with 3.67M parameters. Note that we have not tested more than \ufb01ve\nknots, and all models should further bene\ufb01t from increased number of knots.\n\nSpline-LeNet We trained LeNet-32, LeNet-64 and LeNet-128 as baseline models on CIFAR-10.\nWe used SplineNets with the same parameters of LeNet-32, combined with the more powerful rank-\n3 knots and dot product decisions. The comparisons in accuracy, runtime complexity and model\nsize are given in Figure 10. Notably, LeNet model size increases rapidly with number of \ufb01lters,\nwith a large impact on speed as well. Increasing the number of knots in SplineNets is much more\nef\ufb01cient in terms of both speed and model size, leading to models that are as accurate as the larger\n\n8\n\n\f(a) Effect of T and R on number of FLOPS.\n\n(b) Effect of T and R on model size.\n\nFigure 9: Dynamic and Hierarchical Spline-Resnets with 32 levels and 5 knots, with different op-\ntions for decision types and knot ranks.\n\nbaseline, while also being 15 times faster and almost \ufb01ve times smaller. These results show that\nSplineNets can indeed reduce model complexity while maintaining the accuracy of the baseline\nmodel, especially in the presence of large dense layers.\n\n(a) Effect of K on number of FLOPS.\n\n(b) Effect of K on model size.\n\nFigure 10: Dynamic and Hierarchical Spline-LeNets with K=2\u22125, using rank-3 knots and dot\nproduct decisions.\n\nFinally, we experimented with Spline-LeNets on MNIST. We augmented MNIST with af\ufb01ne and\nelastic deformations and trained LeNet-32 as a baseline, which achieved 99.52% on the original,\nand 99.60% on the augmented dataset. In comparison, H(2)-D-R3 reached 99.61% and 99.65%\non the respective datasets. The best score of 99.71% was achieved by H-SN(4) on the augmented\ndataset with 2.25M parameters. In comparison, CapsuleNets [23] report a score of 99.75% with\n8.5M parameters. Higher scores can typically only be reached with ensemble models.\n\n4 Conclusions and Discussions\n\nIn this work, we presented the concept of SplineNets, a novel and practical method for realizing\nconditional neural networks using embedded continuous manifolds. Our results dramatically reduce\nruntime complexity and computation costs of CNNs while maintaining or even increasing accuracy.\nLike other conditional computation techniques, SplineNets have the added bene\ufb01t of allowing fur-\nther ef\ufb01ciency gains through compression, for instance by pruning layers [24, 25] in an energy-aware\nmanner [26], by substitution of \ufb01lters with smaller ones [27], or by quantizing [28, 29, 30] or bina-\nrizing [31] weights. While these methods are mostly orthogonal to our work, one should take care\nwhen sparsifying knots independently.\nWe consider several avenues for future work. The theoretical limit of gains in accuracy from in-\ncreasing knots is not clear from the results so far, for which we will conduct more experiments.\nAlso there is a large jump in accuracy and model size when switching from rank-4 to rank-3 knots.\nTo investigate the cases between the two, we will introduce the concept of spline groups, where\ngroups contain rank-3 knots and projections need to be per group rather than per \ufb01lter. Another in-\nteresting direction is using SplineNets to form novel hierarchical generative models. Finally, we will\nexplore methods to jointly train SplineNet ensembles by borrowing from the decision tree literature.\n\n9\n\n\fReferences\n[1] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E.\nHinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-\nexperts layer. CoRR, abs/1701.06538, 2017.\n\n[2] Yani Ioannou, Duncan P. Robertson, Darko Zikic, Peter Kontschieder, Jamie Shotton, Matthew\nBrown, and Antonio Criminisi. Decision forests, convolutional networks and the models in-\nbetween. CoRR, abs/1603.01250, 2016.\n\n[3] Chao Xiong, Xiaowei Zhao, Danhang Tang, Jayashree Karlekar, Shuicheng Yan, and Tae-Kyun\nKim. Conditional convolutional neural network for modality-aware face recognition. In ICCV,\npages 3667\u20133675. IEEE Computer Society, 2015.\n\n[4] Seungryul Baek, Kwang In Kim, and Tae-Kyun Kim. Deep convolutional decision jungle for\n\nimage classi\ufb01cation. CoRR, abs/1706.02003, 2017.\n\n[5] Ufuk Can Bic\u00b8ici, Cem Keskin, and Lale Akarun. Conditional information gain networks. In\n\nPattern Recognition (ICPR), 2018 24th International Conference on. IEEE, 2018.\n\n[6] Samuel Rota Bul and Peter Kontschieder. Neural decision forests for semantic image labelling.\nIn Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern\nRecognition, pages 81\u201388, 06 2014.\n\n[7] Ludovic Denoyer and Patrick Gallinari.\n\nabs/1410.0510, 2014.\n\nDeep sequential neural network.\n\nCoRR,\n\n[8] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bul`o. Deep neural\ndecision forests. In Proceedings of the Twenty-Fifth International Joint Conference on Arti\ufb01-\ncial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 4190\u20134194, 2016.\n\n[9] Suhang Wang, Charu Aggarwal, and Huan Liu. Using a random forest to inspire a neural\nnetwork and improving on it. In Proceedings of the 2017 SIAM International Conference on\nData Mining, pages 1\u20139, 2017.\n\n[10] C. Murdock and F. D. l. Torre. Additive component analysis. In 2017 IEEE Conference on\n\nComputer Vision and Pattern Recognition (CVPR), pages 673\u2013681, July 2017.\n\n[11] David Ha, Andrew M. Dai, and Quoc V. Le. Hypernetworks. CoRR, abs/1609.09106, 2016.\n\n[12] Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic \ufb01lter networks.\n\nCoRR, abs/1605.09673, 2016.\n\n[13] Daniel Holden, Taku Komura, and Jun Saito. Phase-functioned neural networks for character\n\ncontrol. ACM Trans. Graph., 36(4):42:1\u201342:13, July 2017.\n\n[14] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial trans-\n\nformer networks. CoRR, abs/1506.02025, 2015.\n\n[15] Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval\nTassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning.\nCoRR, abs/1509.02971, 2015.\n\n[16] Diederik P. Kingma and Max Welling.\n\nabs/1312.6114, 2013.\n\nAuto-encoding variational bayes.\n\nCoRR,\n\n[17] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. In-\nfogan: Interpretable representation learning by information maximizing generative adversarial\nnets. CoRR, abs/1606.03657, 2016.\n\n[18] Jamie Shotton, Sebastian Nowozin, Toby Sharp, John Winn, Pushmeet Kohli, and Antonio\nIn Proceedings\nCriminisi. Decision jungles: Compact and rich models for classi\ufb01cation.\nof the 26th International Conference on Neural Information Processing Systems - Volume 1,\nNIPS\u201913, pages 234\u2013242, USA, 2013. Curran Associates Inc.\n\n10\n\n\f[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\n\nrecognition. CoRR, abs/1512.03385, 2015.\n\n[20] Yann Lecun, Lon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied\n\nto document recognition. In Proceedings of the IEEE, pages 2278\u20132324, 1998.\n\n[21] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-\ndinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR,\nabs/1207.0580, 2012.\n\n[22] Mart\u00b4\u0131n Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Good-\nfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz\nKaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Man\u00b4e, Rajat Monga, Sherry Moore,\nDerek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,\nKunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00b4egas, Oriol\nVinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.\nTensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software avail-\nable from tensor\ufb02ow.org.\n\n[23] Sara Sabour, Nicholas Frosst, and Geoffrey E. Hinton. Dynamic routing between capsules.\n\nCoRR, abs/1710.09829, 2017.\n\n[24] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural\nnetwork with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.\n\n[25] Pavlo Molchanov, Stephen Tyree, Tero Karras, Timo Aila, and Jan Kautz. Pruning convolu-\n\ntional neural networks for resource ef\ufb01cient transfer learning. CoRR, abs/1611.06440, 2016.\n\n[26] Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. Designing energy-ef\ufb01cient convolutional\n\nneural networks using energy-aware pruning. CoRR, abs/1611.05128, 2016.\n\n[27] Forrest N. Iandola, Matthew W. Moskewicz, Khalid Ashraf, Song Han, William J. Dally, and\nKurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model\nsize. CoRR, abs/1602.07360, 2016.\n\n[28] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolu-\n\ntional neural networks for mobile devices. CoRR, abs/1512.06473, 2015.\n\n[29] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-\nnet: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR,\nabs/1606.06160, 2016.\n\n[30] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. Incremental network quan-\n\ntization: Towards lossless cnns with low-precision weights. CoRR, abs/1702.03044, 2017.\n\n[31] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Ima-\ngenet classi\ufb01cation using binary convolutional neural networks. CoRR, abs/1603.05279, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1006, "authors": [{"given_name": "Cem", "family_name": "Keskin", "institution": "Google Inc."}, {"given_name": "Shahram", "family_name": "Izadi", "institution": "Google"}]}