{"title": "On the Computational Efficiency of Training Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 855, "page_last": 863, "abstract": "It is well-known that neural networks are computationally hard to train. On the other hand, in practice, modern day neural networks are trained efficiently using SGD and a variety of tricks that include different activation functions (e.g. ReLU), over-specification (i.e., train networks which are larger than needed), and regularization. In this paper we revisit the computational complexity of training neural networks from a modern perspective. We provide both positive and negative results, some of them yield new provably efficient and practical algorithms for training neural networks.", "full_text": "On the Computational Ef\ufb01ciency of Training Neural\n\nNetworks\n\nRoi Livni\n\nThe Hebrew University\nroi.livni@mail.huji.ac.il\n\nShai Shalev-Shwartz\nThe Hebrew University\n\nshais@cs.huji.ac.il\n\nOhad Shamir\n\nWeizmann Institute of Science\nohad.shamir@weizmann.ac.il\n\nAbstract\n\nIt is well-known that neural networks are computationally hard to train. On the\nother hand, in practice, modern day neural networks are trained ef\ufb01ciently us-\ning SGD and a variety of tricks that include different activation functions (e.g.\nReLU), over-speci\ufb01cation (i.e., train networks which are larger than needed), and\nregularization. In this paper we revisit the computational complexity of training\nneural networks from a modern perspective. We provide both positive and neg-\native results, some of them yield new provably ef\ufb01cient and practical algorithms\nfor training certain types of neural networks.\n\n1\n\nIntroduction\n\nOne of the most signi\ufb01cant recent developments in machine learning has been the resurgence of\n\u201cdeep learning\u201d, usually in the form of arti\ufb01cial neural networks. A combination of algorithmic\nadvancements, as well as increasing computational power and data size, has led to a breakthrough\nin the effectiveness of neural networks, and they have been used to obtain very impressive practical\nperformance on a variety of domains (a few recent examples include [17, 16, 24, 10, 7]).\nA neural network can be described by a (directed acyclic) graph, where each vertex in the graph cor-\nresponds to a neuron and each edge is associated with a weight. Each neuron calculates a weighted\nsum of the outputs of neurons which are connected to it (and possibly adds a bias term). It then\npasses the resulting number through an activation function \u03c3 : R \u2192 R and outputs the resulting\nnumber. We focus on feed-forward neural networks, where the neurons are arranged in layers, in\nwhich the output of each layer forms the input of the next layer. Intuitively, the input goes through\nseveral transformations, with higher-level concepts derived from lower-level ones. The depth of the\nnetwork is the number of layers and the size of the network is the total number of neurons.\nFrom the perspective of statistical learning theory, by specifying a neural network architecture (i.e.\nthe underlying graph and the activation function) we obtain a hypothesis class, namely, the set of all\nprediction rules obtained by using the same network architecture while changing the weights of the\nnetwork. Learning the class involves \ufb01nding a speci\ufb01c set of weights, based on training examples,\nwhich yields a predictor that has good performance on future examples. When studying a hypothesis\nclass we are usually concerned with three questions:\n\n1. Sample complexity: how many examples are required to learn the class.\n2. Expressiveness: what type of functions can be expressed by predictors in the class.\n3. Training time: how much computation time is required to learn the class.\n\nFor simplicity, let us \ufb01rst consider neural networks with a threshold activation function (i.e. \u03c3(z) =\n1 if z > 0 and 0 otherwise), over the boolean input space, {0, 1}d, and with a single output in\n{0, 1}. The sample complexity of such neural networks is well understood [3]. It is known that the\nVC dimension grows linearly with the number of edges (up to log factors). It is also easy to see that\nno matter what the activation function is, as long as we represent each weight of the network using\n\n1\n\n\fa constant number of bits, the VC dimension is bounded by a constant times the number of edges.\nThis implies that empirical risk minimization - or \ufb01nding weights with small average loss over the\ntraining data - can be an effective learning strategy from a statistical point of view.\nAs to the expressiveness of such networks, it is easy to see that neural networks of depth 2 and\nsuf\ufb01cient size can express all functions from {0, 1}d to {0, 1}. However, it is also possible to show\nthat for this to happen, the size of the network must be exponential in d (e.g. [19, Chapter 20]).\nWhich functions can we express using a network of polynomial size? The theorem below shows that\nall boolean functions that can be calculated in time O(T (d)), can also be expressed by a network of\ndepth O(T (d)) and size O(T (d)2).\nTheorem 1. Let T : N \u2192 N and for every d, let Fd be the set of functions that can be implemented\nby a Turing machine using at most T (d) operations. Then there exist constants b, c \u2208 R+ such that\nfor every d, there is a network architecture of depth c T (d) + b, size of (c T (d) + b)2, and threshold\nactivation function, such that the resulting hypotesis class contains Fd.\n\nThe proof of the theorem follows directly from the relation between the time complexity of programs\nand their circuit complexity (see, e.g., [22]), and the fact that we can simulate the standard boolean\ngates using a \ufb01xed number of neurons.\nWe see that from the statistical perspective, neural networks form an excellent hypothesis class; On\none hand, for every runtime T (d), by using depth of O(T (d)) we contain all predictors that can be\nrun in time at most T (d). On the other hand, the sample complexity of the resulting class depends\npolynomially on T (d).\nThe main caveat of neural networks is the training time. Existing theoretical results are mostly\nnegative, showing that successfully learning with these networks is computationally hard in the worst\ncase. For example, neural networks of depth 2 contain the class of intersection of halfspaces (where\nthe number of halfspaces is the number of neurons in the hidden layer). By reduction to k-coloring,\nit has been shown that \ufb01nding the weights that best \ufb01t the training set is NP-hard ([9]). [6] has\nshown that even \ufb01nding weights that result in close-to-minimal empirical error is computationally\ninfeasible. These hardness results focus on proper learning, where the goal is to \ufb01nd a nearly-optimal\npredictor with a \ufb01xed network architecture A. However, if our goal is to \ufb01nd a good predictor, there\nis no reason to limit ourselves to predictors with one particular architecture. Instead, we can try,\nfor example, to \ufb01nd a network with a different architecture A(cid:48), which is almost as good as the\nbest network with architecture A. This is an example of the powerful concept of improper learning,\nwhich has often proved useful in circumventing computational hardness results. Unfortunately, there\nare hardness results showing that even with improper learning, and even if the data is generated\nexactly from a small, depth-2 neural network, there are no ef\ufb01cient algorithms which can \ufb01nd a\npredictor that performs well on test data. In particular, [15] and [12] have shown this in the case of\nlearning intersections of halfspaces, using cryptographic and average case complexity assumptions.\nOn a related note, [4] recently showed positive results on learning from data generated by a neural\nnetwork of a certain architecture and randomly connected weights. However, the assumptions used\nare strong and unlikely to hold in practice.\nDespite this theoretical pessimism, in practice, modern-day neural networks are trained successfully\nin many learning problems. There are several tricks that enable successful training:\n\u2022 Changing the activation function: The threshold activation function, \u03c3(a) = 1a>0, has zero\nderivative almost everywhere. Therefore, we cannot apply gradient-based methods with this ac-\ntivation function. To circumvent this problem, we can consider other activation functions. Most\nwidely known is a sigmoidal activation, e.g. \u03c3(a) = 1\n1+ea , which forms a smooth approxima-\ntion of the threshold function. Another recent popular activation function is the recti\ufb01ed linear\nunit (ReLU) function, \u03c3(a) = max{0, a}. Note that subtracting a shifted ReLU from a ReLU\nyields an approximation of the threshold function, so by doubling the number of neurons we can\napproximate a network with threshold activation by a network with ReLU activation.\n\n\u2022 Over-speci\ufb01cation: It was empirically observed that it is easier to train networks which are larger\n\nthan needed. Indeed, we empirically demonstrate this phenomenon in Sec. 5.\n\n\u2022 Regularization: It was empirically observed that regularizing the weights of the network speeds\n\nup the convergence (e.g. [16]).\n\n2\n\n\fThe goal of this paper is to revisit and re-raise the question of neural network\u2019s computational ef\ufb01-\nciency, from a modern perspective. This is a challenging topic, and we do not pretend to give any\nde\ufb01nite answers. However, we provide several results, both positive and negative. Most of them are\nnew, although a few appeared in the literature in other contexts. Our contributions are as follows:\n\u2022 We make a simple observation that for suf\ufb01ciently over-speci\ufb01ed networks, global optima are\nubiquitous and in general computationally easy to \ufb01nd. Although this holds only for extremely\nlarge networks which will over\ufb01t, it can be seen as an indication that the computational hard-\nness of learning does decrease with the amount of over-speci\ufb01cation. This is also demonstrated\nempirically in Sec. 5.\n\u2022 Motivated by the idea of changing the activation function, we consider the quadratic activation\nfunction, \u03c3(a) = a2. Networks with the quadratic activation compute polynomial functions of\nthe input in Rd, hence we call them polynomial networks. Our main \ufb01ndings for such networks\nare as follows:\n\u2013 Networks with quadratic activation are as expressive as networks with threshold activation.\n\u2013 Constant depth networks with quadratic activation can be learned in polynomial time.\n\u2013 Sigmoidal networks of depth 2, and with (cid:96)1 regularization, can be approximated by polynomial\nnetworks of depth O(log log(1/\u0001)). It follows that sigmoidal networks with (cid:96)1 regularization\ncan be learned in polynomial time as well.\n\n\u2013 The aforementioned positive results are interesting theoretically, but lead to impractical algo-\nrithms. We provide a practical, provably correct, algorithm for training depth-2 polynomial\nnetworks. While such networks can also be learned using a linearization trick, our algorithm is\nmore ef\ufb01cient and returns networks whose size does not depend on the data dimension. Our al-\ngorithm follows a forward greedy selection procedure, where each step of the greedy selection\nprocedure builds a new neuron by solving an eigenvalue problem.\n\n\u2013 We generalize the above algorithm to depth-3, in which each forward greedy step involves an\nef\ufb01cient approximate solution to a tensor approximation problem. The algorithm can learn a\nrich sub-class of depth-3 polynomial networks.\n\n\u2013 We describe some experimental evidence, showing that our practical algorithm is competitive\n\nwith state-of-the-art neural network training methods for depth-2 networks.\n\n2 Suf\ufb01ciently Over-Speci\ufb01ed Networks Are Easy to Train\n\nWe begin by considering the idea of over-speci\ufb01cation, and make an observation that for suf\ufb01ciently\nover-speci\ufb01ed networks, the optimization problem associated with training them is generally quite\neasy to solve, and that global optima are in a sense ubiquitous. As an interesting contrast, note that\nfor very small networks (such as a single neuron with a non-convex activation function), the associ-\nated optimization problem is generally hard, and can exhibit exponentially many local (non-global)\nminima [5]. We emphasize that our observation only holds for extremely large networks, which will\nover\ufb01t in any reasonable scenario, but it does point to a possible spectrum where computational cost\ndecreases with the amount of over-speci\ufb01cation.\nTo present the result, let X \u2208 Rd,m be a matrix of m training examples in Rd. We can think of the\nnetwork as composed of two mappings. The \ufb01rst maps X into a matrix Z \u2208 Rn,m, where n is the\nnumber of neurons whose outputs are connected to the output layer. The second mapping is a linear\nmapping Z (cid:55)\u2192 W Z, where W \u2208 Ro,n, that maps Z to the o neurons in the output layer. Finally,\nthere is a loss function (cid:96) : Ro,m \u2192 R, which we\u2019ll assume to be convex, that assesses the quality of\nthe prediction on the entire data (and will of course depend on the m labels). Let V denote all the\nweights that affect the mapping from X to Z, and denote by f (V ) the function that maps V to Z.\nThe optimization problem associated with learning the network is therefore minW,V (cid:96)(W f (V )).\nThe function (cid:96)(W f (V )) is generally non-convex, and may have local minima. However, if n \u2265 m,\nthen it is reasonable to assume that Rank(f (V )) = m with large probability (under some random\nchoice of V ), due to the non-linear nature of the function computed by neural networks1. In that\ncase, we can simply \ufb01x V and solve minW (cid:96)(W f (V )), which is computationally tractable as (cid:96) is\n1For example, consider the function computed by the \ufb01rst layer, X (cid:55)\u2192 \u03c3(VdX), where \u03c3 is a sigmoid\n\nfunction. Since \u03c3 is non-linear, the columns of \u03c3(VdX) will not be linearly dependent in general.\n\n3\n\n\fassumed to be convex. Since f (V ) has full rank, the solution of this problem corresponds to a global\noptima of (cid:96), and hence to a global optima of the original optimization problem. Thus, for suf\ufb01ciently\nlarge networks, \ufb01nding global optima is generally easy, and they are in a sense ubiquitous.\n\n3 The Hardness of Learning Neural Networks\n\nWe now review several known hardness results and apply them to our learning setting. For simplic-\nity, throughout most of this section we focus on the PAC model in the binary classi\ufb01cation case, over\nthe Boolean cube, in the realizable case, and with a \ufb01xed target accuracy.2\nFix some \u0001, \u03b4 \u2208 (0, 1). For every dimension d, let the input space be Xd = {0, 1}d and let H be a\nhypothesis class of functions from Xd to {\u00b11}. We often omit the subscript d when it is clear from\ncontext. A learning algorithm A has access to an oracle that samples x according to an unknown\ndistribution D over X and returns (x, f\u2217(x)), where f\u2217 is some unknown target hypothesis in H.\nThe objective of the algorithm is to return a classi\ufb01er f : X \u2192 {\u00b11}, such that with probability of\nat least 1 \u2212 \u03b4,\n\nPx\u223cD [f (x) (cid:54)= f\u2217(x)] \u2264 \u0001.\n\nWe say that A is ef\ufb01cient if it runs in time poly(d) and the function it returns can also be evaluated\non a new instance in time poly(d). If there is such A, we say that H is ef\ufb01ciently learnable.\nIn the context of neural networks, every network architecture de\ufb01nes a hypothesis class, Nt,n,\u03c3,\nthat contains all target functions f that can be implemented using a neural network with t layers, n\nneurons (excluding input neurons), and an activation function \u03c3. The immediate question is which\nNt,n,\u03c3 are ef\ufb01ciently learnable. We will \ufb01rst address this question for the threshold activation func-\ntion, \u03c30,1(z) = 1 if z > 0 and 0 otherwise.\nObserving that depth-2 networks with the threshold activation function can implement intersections\nof halfspaces, we will rely on the following hardness results, due to [15].\nTheorem 2 (Theorem 1.2 in [15]). Let X = {\u00b11}d, let\n\n(cid:0)w(cid:62)x \u2212 b \u2212 1/2(cid:1) : b \u2208 N, w \u2208 Nd,|b| + (cid:107)w(cid:107)1 \u2264 poly(d)(cid:9) ,\n\nH a =(cid:8)x \u2192 \u03c30,1\n\nk = {x \u2192 h1(x) \u2227 h2(x) \u2227 . . . \u2227 hk(x) : \u2200i, hi \u2208 H a}, where k = d\u03c1 for some constant\n\nand let H a\n\u03c1 > 0. Then under a certain cryptographic assumption, H a\n\nk is not ef\ufb01ciently learnable.\n\n\u03c30,1 ((cid:80)\n\nUnder a different complexity assumption, [12] showed a similar result even for k = \u03c9(1).\nAs mentioned before, neural networks of depth \u2265 2 and with the \u03c30,1 activation function can\nexpress intersections of halfspaces: For example, the \ufb01rst layer consists of k neurons comput-\ning the k halfspaces, and the second layer computes their conjunction by the mapping x (cid:55)\u2192\ni xi \u2212 k + 1/2). Trivially, if some class H is not ef\ufb01ciently learnable, then any class con-\ntaining it is also not ef\ufb01ciently learnable. We thus obtain the following corollary:\nCorollary 1. For every t \u2265 2, n = \u03c9(1), the class Nt,n,\u03c30,1 is not ef\ufb01ciently learnable (under the\ncomplexity assumption given in [12]).\n\nWhat happens when we change the activation function? In particular, two widely used activation\nfunctions for neural networks are the sigmoidal activation function, \u03c3sig(z) = 1/(1 + exp(\u2212z)),\nand the recti\ufb01ed linear unit (ReLU) activation function, \u03c3relu(z) = max{z, 0}.\nAs a \ufb01rst observation, note that for |z| (cid:29) 1 we have that \u03c3sig(z) \u2248 \u03c30,1(z). Our data domain is\nthe discrete Boolean cube, hence if we allow the weights of the network to be arbitrarily large, then\nNt,n,\u03c30,1 \u2286 Nt,n,\u03c3sig. Similarly, the function \u03c3relu(z)\u2212\u03c3relu(z\u22121) equals \u03c30,1(z) for every |z| \u2265 1.\nAs a result, without restricting the weights, we can simulate each threshold activated neuron by two\nReLU activated neurons, which implies that Nt,n,\u03c30,1 \u2286 Nt,2n,\u03c3relu. Hence, Corollary 1 applies to\nboth sigmoidal networks and ReLU networks as well, as long as we do not regularize the weights of\nthe network.\n\n2While we focus on the realizable case (i.e., there exists f\u2217 \u2208 H that provides perfect predictions), with a\n\ufb01xed accuracy (\u0001) and con\ufb01dence (\u03b4), since we are dealing with hardness results, the results trivially apply to\nthe agnostic case and to learning with arbitrarily small accuracy and con\ufb01dence parameters.\n\n4\n\n\fWhat happens when we do regularize the weights? Let Nt,n,\u03c3,L be all target functions that can be\nimplemented using a neural network of depth t, size n, activation function \u03c3, and when we restrict\nthe input weights of each neuron to be (cid:107)w(cid:107)1 + |b| \u2264 L.\nOne may argue that in many real world distributions, the difference between the two classes, Nt,n,\u03c3,L\nand Nt,n,\u03c30,1 is small. Roughly speaking, when the distribution density is low around the decision\nboundary of neurons (similarly to separation with margin assumptions), then sigmoidal neurons will\nbe able to effectively simulate threshold activated neurons.\nIn practice, the sigmoid and ReLU activation functions are advantageous over the threshold activa-\ntion function, since they can be trained using gradient based methods. Can these empirical successes\nbe turned into formal guarantees? Unfortunately, a closer examination of Thm. 2 demonstrates that\nif L = \u2126(d) then learning N2,n,\u03c3sig,L and N2,n,\u03c3relu,L is still hard. Formally, to apply these net-\nworks to binary classi\ufb01cation, we follow a standard de\ufb01nition of learning with a margin assumption:\nWe assume that the learner receives examples of the form (x, sign(f\u2217(x))) where f\u2217 is a real-valued\nfunction that comes from the hypothesis class, and we further assume that |f\u2217(x)| \u2265 1. Even under\nthis margin assumption, we have the following:\nCorollary 2. For every t \u2265 2, n = \u03c9(1), L = \u2126(d), the classes Nt,n,\u03c3sig,L and Nt,n,\u03c3relu,L are not\nef\ufb01ciently learnable (under the complexity assumption given in [12]).\n\nA proof is provided in the appendix. What happens when L is much smaller? Later on in the paper\nwe will show positive results for L being a constant and the depth being \ufb01xed. These results will be\nobtained using polynomial networks, which we study in the next section.\n\n4 Polynomial Networks\n\nIn the previous section we have shown several strong negative results for learning neural networks\nwith the threshold, sigmoidal, and ReLU activation functions. One way to circumvent these hardness\nresults is by considering another activation function. Maybe the simplest non-linear function is\nthe squared function, \u03c32(x) = x2. We call networks that use this activation function polynomial\nnetworks, since they compute polynomial functions of their inputs. As in the previous section, we\ndenote by Nt,n,\u03c32,L the class of functions that can be implemented using a neural network of depth\nt, size n, squared activation function, and a bound L on the (cid:96)1 norm of the input weights of each\nneuron. Whenever we do not specify L we refer to polynomial networks with unbounded weights.\nBelow we study the expressiveness and computational complexity of polynomial networks. We\nnote that algorithms for ef\ufb01ciently learning (real-valued) sparse or low-degree polynomials has been\nstudied in several previous works (e.g. [13, 14, 8, 2, 1]). However, these rely on strong distributional\nassumptions, such as the data instances having a uniform or log-concave distribution, while we are\ninterested in a distribution-free setting.\n\n4.1 Expressiveness\n\nWe \ufb01rst show that, similarly to networks with threshold activation, polynomial networks of polyno-\nmial size can express all functions that can be implemented ef\ufb01ciently using a Turing machine.\nTheorem 3 (Polynomial networks can express Turing Machines). Let Fd and T be as in Thm. 1.\nThen there exist constants b, c \u2208 R+ such that for every d, the class Nt,n,\u03c32,L, with t =\nc T (d) log(T (d)) + b, n = t2, and L = b, contains Fd.\nThe proof of the theorem relies on the result of [18] and is given in the appendix.\nAnother relevant expressiveness result, which we will use later, shows that polynomial networks can\napproximate networks with sigmoidal activation functions:\nTheorem 4. Fix 0 < \u0001 < 1, L \u2265 3 and t \u2208 N. There are Bt \u2208 \u02dcO(log(tL + L log 1\nBn \u2208 \u02dcO(tL + L log 1\nthat sup(cid:107)x(cid:107)\u221e<1 (cid:107)f (x) \u2212 g(x)(cid:107)\u221e \u2264 \u0001.\nThe proof relies on an approximation of the sigmoid function based on Chebyshev polynomials, as\nwas done in [21], and is given in the appendix.\n\n\u0001 )) and\n\u0001 ) such that for every f \u2208 Nt,n,\u03c3sig,L there is a function g \u2208 NtBt,nBn,\u03c32, such\n\n5\n\n\f4.2 Training Time\n\nWe now turn to the computational complexity of learning polynomial networks. We \ufb01rst show that\nit is hard to learn polynomial networks of depth \u2126(log(d)).\nIndeed, by combining Thm. 4 and\nCorollary 2 we obtain the following:\nCorollary 3. The class Nt,n,\u03c32, where t = \u2126(log(d)) and n = \u2126(d), is not ef\ufb01ciently learnable.\nOn the \ufb02ip side, constant-depth polynomial networks can be learned in polynomial time, using a\nsimple linearization trick. Speci\ufb01cally, the class of polynomial networks of constant depth t is\ncontained in the class of multivariate polynomials of total degree at most s = 2t. This class can\nbe represented as a ds-dimensional linear space, where each vector is the coef\ufb01cient vector of some\nsuch polynomial. Therefore, the class of polynomial networks of depth t can be learned in time\n), by mapping each instance vector x \u2208 Rd to all of its monomials, and learning a linear\npoly(d2t\npredictor on top of this representation (which can be done ef\ufb01ciently in the realizable case, or when\na convex loss function is used). In particular, if t is a constant then so is 2t and therefore polynomial\nnetworks of constant depth are ef\ufb01ciently learnable. Another way to learn this class is using support\nvector machines with polynomial kernels.\nAn interesting application of this observation is that depth-2 sigmoidal networks are ef\ufb01ciently learn-\nable with suf\ufb01cient regularization, as formalized in the result below. This contrasts with corollary 2,\nwhich provides a hardness result without regularization.\nTheorem 5. The class N2,n,\u03c3sig,L can be learned, to accuracy \u0001, in time poly(T ) where T =\n(1/\u0001) \u00b7 O(d4L ln(11L2+1)).\nThe idea of the proof is as follows. Suppose that we obtain data from some f \u2208 N2,n,\u03c3sig,L. Based\non Thm. 4, there is g \u2208 N2Bt,nBn,\u03c32 that approximates f to some \ufb01xed accuracy \u00010 = 0.5, where Bt\nand Bn are as de\ufb01ned in Thm. 4 for t = 2. Now we can learn N2Bt,nBn,\u03c32 by considering the class\nof all polynomials of total degree 22Bt, and applying the linearization technique discussed above.\nSince f is assumed to separate the data with margin 1 (i.e. y = sign(f\u2217(x)),|f\u2217(x)| \u2265 1|), then g\nseparates the data with margin 0.5, which is enough for establishing accuracy \u0001 in sample and time\nthat depends polynomially on 1/\u0001.\n\n4.3 Learning 2-layer and 3-layer Polynomial Networks\n\nWhile interesting theoretically, the above results are not very practical, since the time and sample\ncomplexity grow very fast with the depth of the network.3 In this section we describe practical,\nprovably correct, algorithms for the special case of depth-2 and depth-3 polynomial networks, with\nsome additional constraints. Although such networks can be learned in polynomial time via explicit\nlinearization (as described in section 4.2), the runtime and resulting network size scales quadratically\n(for depth-2) or cubically (for depth-3) with the data dimension d. In contrast, our algorithms and\nguarantees have a much milder dependence on d.\nWe \ufb01rst consider 2 layer polynomial networks, of the following form:\n\n(cid:40)\n\nP2,k =\n\nx (cid:55)\u2192 b + w(cid:62)\n\n0 x +\n\n\u03b1i(w(cid:62)\n\nk(cid:88)\n\n(cid:41)\ni x)2 : \u2200i \u2265 1,|\u03b1i| \u2264 1,(cid:107)wi(cid:107)2 = 1\n\n.\n\ni=1\n\nThis networks corresponds to one hidden layer containing r neurons with the squared activation\nfunction, where we restrict the input weights of all neurons in the network to have bounded (cid:96)2 norm,\nand where we also allow a direct linear dependency between the input layer and the output layer.\nWe\u2019ll describe an ef\ufb01cient algorithm for learning this class, which is based on the GECO algorithm\nfor convex optimization with low-rank constraints [20].\n\n3If one uses SVM with polynomial kernels, the time and sample complexity may be small under margin\nassumptions in a feature space corresponding to a given kernel. Note, however, that large margin in that space\nis very different than the assumption we make here, namely, that there is a network with a small number of\nhidden neurons that works well on the data.\n\n6\n\n\fThe goal of the algorithm is to \ufb01nd f that minimizes the objective\n\nm(cid:88)\n\ni=1\n\nR(f ) =\n\n1\nm\n\n(cid:96)(f (xi), yi),\n\n(1)\n\nwhere (cid:96) : R \u00d7 R \u2192 R is a loss function. We\u2019ll assume that (cid:96) is \u03b2-smooth and convex.\nThe basic idea of the algorithm is to gradually add hidden neurons to the hidden layer, in a greedy\nmanner, so as to decrease the loss function over the data. To do so, de\ufb01ne V = {x (cid:55)\u2192 (w(cid:62)x)2 :\n(cid:107)w(cid:107)2 = 1} the set of functions that can be implemented by hidden neurons. Then every f \u2208 P2,r\nis an af\ufb01ne function plus a weighted sum of functions from V. The algorithm starts with f being\nthe minimizer of R over all af\ufb01ne functions. Then at each greedy step, we search for g \u2208 V that\nminimizes a \ufb01rst order approximation of R(f + \u03b7g):\n\nm(cid:88)\n\ni=1\n\nR(f + \u03b7g) \u2248 R(f ) + \u03b7\n\n1\nm\n\n(cid:96)(cid:48)(f (xi), yi)g(xi) ,\n\n(2)\n\ni\n\nm\n\nm\n\n(cid:80)m\ni=1 (cid:96)(cid:48)(f (xi), yi)xix(cid:62)\n\ni\n\n(cid:1) w . The vector w that minimizes this\n(cid:1).\n\nbe rewritten as R(f ) + \u03b7 w(cid:62)(cid:0) 1\nexpression (for positive \u03b7) is the leading eigenvector of the matrix(cid:0) 1\n\nwhere (cid:96)(cid:48) is the derivative of (cid:96) w.r.t. its \ufb01rst argument. Observe that for every g \u2208 V there is some w\nwith (cid:107)w(cid:107)2 = 1 for which g(x) = (w(cid:62)x)2 = w(cid:62)xx(cid:62)w. Hence, the right-hand side of Eq. (2) can\n(cid:80)m\ni=1 (cid:96)(cid:48)(f (xi), yi)xix(cid:62)\nWe add this vector as a hidden neuron to the network.4 Finally, we minimize R w.r.t. the weights\nfrom the hidden layer to the output layer (namely, w.r.t. the weights \u03b1i).\nThe following theorem, which follows directly from Theorem 1 of [20], provides convergence guar-\nantee for GECO. Observe that the theorem gives guarantee for learning P2,k if we allow to output\nan over-speci\ufb01ed network.\nTheorem 6. Fix some \u0001 > 0. Assume that the loss function is convex and \u03b2-smooth. Then if\niterations, it outputs a network f \u2208 N2,r,\u03c32 for which\nthe GECO Algorithm is run for r > 2\u03b2k2\nR(f ) \u2264 minf\u2217\u2208P2,k R(f\u2217) + \u0001.\n\u0001\n(cid:110)\nx (cid:55)\u2192(cid:81)i\nWe next consider a hypothesis class consisting of third degree polynomials, which is a subset of\n3-layer polynomial networks (see Lemma 1 in the appendix) . The hidden neurons will be functions\ni=1 \u03b1igi(x) : \u2200i, |\u03b1i| \u2264 1, gi \u2208 V(cid:111)\nfrom the class: V = \u222a3\n. The hypothesis\nclass we consider is P3,k =\nThe basic idea of the algorithm is the same as for 2-layer networks. However, while in the 2-layer\ncase we could implement ef\ufb01ciently each greedy step by solving an eigenvalue problem, we now\nface the following tensor approximation problem at each greedy step:\n\n(cid:111)\nj x) : \u2200j, (cid:107)wj(cid:107)2 = 1\n\ni=1Vi where Vi =\n\n(cid:110)\nx (cid:55)\u2192(cid:80)k\n\nj=1(w(cid:62)\n\n.\n\nmax\ng\u2208V3\n\n1\nm\n\n(cid:96)(cid:48)(f (xi), yi)g(xi) =\n\nmax\n\n(cid:107)w(cid:107)=1,(cid:107)u(cid:107)=1,(cid:107)v(cid:107)=1\n\n1\nm\n\n(cid:96)(cid:48)(f (xi), yi)(w(cid:62)xi)(u(cid:62)xi)(v(cid:62)xi) .\n\nWhile this is in general a hard optimization problem, we can approximate it \u2013 and luckily, an approx-\nimate greedy step suf\ufb01ces for success of the greedy procedure. This procedure is given in Figure 1,\nand is again based on an approximate eigenvector computation. A guarantee for the quality of ap-\nproximation is given in the appendix, and this leads to the following theorem, whose proof is given\nin the appendix.\nTheorem 7. Fix some \u03b4, \u0001 > 0. Assume that the loss function is convex and \u03b2-smooth. Then if the\nGECO Algorithm is run for r > 4d\u03b2k2\n\u0001(1\u2212\u03c4 )2 iterations, where each iteration relies on the approximation\nprocedure given in Fig. 1, then with probability (1\u2212\u03b4)r, it outputs a network f \u2208 N3,5r,\u03c32 for which\nR(f ) \u2264 minf\u2217\u2208P3,k R(f\u2217) + \u0001.\n\n4It is also possible to \ufb01nd an approximate solution to the eigenvalue problem and still retain the performance\nguarantees (see [20]). Since an approximate eigenvalue can be found in time O(d) using the power method, we\nobtain the runtime of GECO depends linearly on d.\n\n7\n\nm(cid:88)\n\ni=1\n\nm(cid:88)\n\ni=1\n\n\fInput: {xi}m\nOutput: A 1\u2212\u03c4\u221a\n\ni=1 \u2208 Rd \u03b1 \u2208 Rm, \u03c4,\u03b4\n\napproximate solution to\n(cid:62)\n\u03b1i(w\n\nd\nF (w, u, v) =\n\n(cid:88)\n\ni\n\nmax\n\nxi)(u\n\nxi)(v\n(cid:107)w(cid:107),(cid:107)u(cid:107),(cid:107)v(cid:107)=1\nPick randomly w1, . . . , ws iid according to N (0, Id).\nFor t = 1, . . . , 2d log 1\n\u03b4\ni \u03b1i(w(cid:62)\n\ni and set ut, vt s.t:\n\nt Avt) \u2265 (1 \u2212 \u03c4 ) max(cid:107)u(cid:107),(cid:107)v(cid:107)=1 T r(u(cid:62)Av).\nReturn w, u, v the maximizers of maxi\u2264s F (wi, ui, ui).\n\nLet A =(cid:80)\nwt \u2190 wt(cid:107)wt(cid:107)\nT r(u(cid:62)\n\nt xi)xix(cid:62)\n\n(cid:62)\n\n(cid:62)\n\nxi)\n\nFigure 1: Approximate tensor maximization.\n\n5 Experiments\n\n0.1\n\nr\no\nr\nr\nE\n\n0\n\n0.2\n\n0.8\n\nGECO\n\n6 \u00b7 10\u22122\n\n9 \u00b7 10\u22122\n\n7 \u00b7 10\u22122\n\n5 \u00b7 10\u22122\n\n8 \u00b7 10\u22122\n\nSGD ReLU\nSGD Squared\n\nTo demonstrate the practicality of GECO to train neural networks for real world problems, we con-\nsidered a pedestrian detection problem as follows. We collected 200k training examples of image\npatches of size 88x40 pixels containing either pedestrians (positive examples) or hard negative ex-\namples (containing images that were classi\ufb01ed as pedestrians by applying a simple linear classi\ufb01er in\na sliding window manner). See a few examples of images above. We used half of the examples as a\ntraining set and the other half as a test set. We calculated HoG\nfeatures ([11]) from the images5. We then trained, using GECO,\na depth-2 polynomial network on the resulting features. We\nused 40 neurons in the hidden layer. For comparison we trained\nthe same network architecture (i.e. 40 hidden neurons with a\nsquared activation function) by SGD. We also trained a similar\nnetwork (40 hidden neurons again) with the ReLU activation\nfunction. For the SGD implementation we tried the following\ntricks to speed up the convergence: heuristics for initialization\nof the weights, learning rate rules, mini-batches, Nesterov\u2019s mo-\nmentum (as explained in [23]), and dropout. The test errors of\nSGD as a function of the number of iterations are depicted on\nthe top plot of the Figure on the side. We also mark the perfor-\nmance of GECO as a straight line (since it doesn\u2019t involve SGD\niterations). As can be seen, the error of GECO is slightly bet-\nter than SGD. It should be also noted that we had to perform a\nvery large number of SGD iterations to obtain a good solution,\nwhile the runtime of GECO was much faster. This indicates that\nGECO may be a valid alternative approach to SGD for training\ndepth-2 networks. It is also apparent that the squared activation\nfunction is slightly better than the ReLU function for this task.\nThe second plot of the side \ufb01gure demonstrates the bene\ufb01t of\nover-speci\ufb01cation for SGD. We generated random examples in R150 and passed them through a\nrandom depth-2 network that contains 60 hidden neurons with the ReLU activation function. We\nthen tried to \ufb01t a new network to this data with over-speci\ufb01cation factors of 1, 2, 4, 8 (e.g., over-\nspeci\ufb01cation factor of 4 means that we used 60 \u00b7 4 = 240 hidden neurons). As can be clearly seen,\nSGD converges much faster when we over-specify the network.\nAcknowledgements: This research is supported by Intel (ICRI-CI). OS was also supported by\nan ISF grant (No. 425/13), and a Marie-Curie Career Integration Grant. SSS and RL were also\nsupported by the MOS center of Knowledge for AI and ML (No. 3-9243). RL is a recipient of the\nGoogle Europe Fellowship in Learning Theory, and this research is supported in part by this Google\nFellowship. We thank Itay Safran for spotting a mistake in a previous version of Sec. 2 and to James\nMartens for helpful discussions.\n\n0.4\n0.6\niterations\n\n0.6\n0.4\n#iterations\n\n1\n\u00b7105\n\n1\n\u00b7105\n\n4\n\n3\n\n2\n\n1\n\n0\n\n0\n\n0.2\n\nE\nS\nM\n\n1\n2\n4\n8\n\n0.8\n\n5Using the Matlab implementation provided in http://www.mathworks.com/matlabcentral/\n\nfileexchange/33863-histograms-of-oriented-gradients.\n\n8\n\n\fReferences\n[1] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning polynomials with neural net-\n\nworks. In ICML, 2014.\n\n[2] A. Andoni, R. Panigrahy, G. Valiant, and L. Zhang. Learning sparse polynomial functions. In\n\nSODA, 2014.\n\n[3] M. Anthony and P. Bartlett. Neural Network Learning - Theoretical Foundations. Cambridge\n\nUniversity Press, 2002.\n\n[4] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representa-\n\ntions. arXiv preprint arXiv:1310.6343, 2013.\n\n[5] P. Auer, M. Herbster, and M. Warmuth. Exponentially many local minima for single neurons.\n\nIn NIPS, 1996.\n\n[6] P. L. Bartlett and S. Ben-David. Hardness results for neural network approximation problems.\n\nTheor. Comput. Sci., 284(1):53\u201366, 2002.\n\n[7] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new per-\nspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35:1798\u20131828,\n2013.\n\n[8] E. Blais, R. O\u2019Donnell, and K. Wimmer. Polynomial regression under arbitrary product distri-\n\nbutions. Machine Learning, 80(2-3):273\u2013294, 2010.\n\n[9] A. Blum and R. Rivest. Training a 3-node neural network is np-complete. Neural Networks,\n\n5(1):117\u2013127, 1992.\n\n[10] G. Dahl, T. Sainath, and G. Hinton. Improving deep neural networks for lvcsr using recti\ufb01ed\n\nlinear units and dropout. In ICASSP, 2013.\n\n[11] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.\n[12] A. Daniely, N. Linial, and S. Shalev-Shwartz. From average case complexity to improper\n\nlearning complexity. In FOCS, 2014.\n\n[13] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM J.\n\nComput., 37(6):1777\u20131805, 2008.\n\n[14] A. Kalai, A. Samorodnitsky, and S.-H. Teng. Learning and smoothed analysis. In FOCS, 2009.\n[15] A. Klivans and A. Sherstov. Cryptographic hardness for learning intersections of halfspaces.\n\nIn FOCS, 2006.\n\n[16] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\n[17] Q. V. Le, M.-A. Ranzato, R. Monga, M. Devin, G. Corrado, K. Chen, J. Dean, and A. Y. Ng.\n\nBuilding high-level features using large scale unsupervised learning. In ICML, 2012.\n\n[18] N. Pippenger and M. Fischer. Relations among complexity measures. Journal of the ACM\n\n(JACM), 26(2):361\u2013381, 1979.\n\n[19] S. Shalev-Shwartz and S. Ben-David. Understanding Machine Learning: From Theory to\n\nAlgorithms. Cambridge University Press, 2014.\n\n[20] S. Shalev-Shwartz, A. Gonen, and O. Shamir. Large-scale convex minimization with a low-\n\nrank constraint. In ICML, 2011.\n\n[21] S. Shalev-Shwartz, O. Shamir, and K. Sridharan. Learning kernel-based halfspaces with the\n\n0-1 loss. SIAM Journal on Computing, 40(6):1623\u20131646, 2011.\n\n[22] M. Sipser. Introduction to the Theory of Computation. Thomson Course Technology, 2006.\n[23] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and\n\nmomentum in deep learning. In ICML, 2013.\n\n[24] M. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks. arXiv\n\npreprint arXiv:1311.2901, 2013.\n\n9\n\n\f", "award": [], "sourceid": 559, "authors": [{"given_name": "Roi", "family_name": "Livni", "institution": "Hebrew University"}, {"given_name": "Shai", "family_name": "Shalev-Shwartz", "institution": "The Hebrew University"}, {"given_name": "Ohad", "family_name": "Shamir", "institution": "Weizmann Institute of Science"}]}