{"title": "Porcupine Neural Networks: Approximating Neural Network Landscapes", "book": "Advances in Neural Information Processing Systems", "page_first": 4831, "page_last": 4841, "abstract": "Neural networks have been used prominently in several machine learning and statistics applications. In general, the underlying optimization of neural networks is non-convex which makes analyzing their performance challenging. In this paper, we take another approach to this problem by constraining the network such that the corresponding optimization landscape has good theoretical properties without significantly compromising performance. In particular, for two-layer neural networks we introduce Porcupine Neural Networks (PNNs) whose weight vectors are constrained to lie over a finite set of lines. We show that most local optima of PNN optimizations are global while we have a characterization of regions where bad local optimizers may exist. Moreover, our theoretical and empirical results suggest that an unconstrained neural network can be approximated using a polynomially-large PNN.", "full_text": "Porcupine Neural Networks:\n\nApproximating Neural Network Landscapes\n\nSoheil Feizi\n\nDepartment of Computer Science\n\nUniversity of Maryland, College Park\n\nsfeizi@cs.umd.edu\n\nDepartment of Electrical and Computer Engineering\n\nHamid Javadi\n\nRice University\n\nhrhakim@rice.edu\n\nDepartment of Electrical Engineering\n\nDepartment of Electrical Engineering\n\nDavid Tse\n\nStanford University\n\ndntse@stanford.edu\n\nJesse Zhang\n\nStanford University\n\njessez@stanford.edu\n\nAbstract\n\nNeural networks have been used prominently in several machine learning and\nstatistics applications. In general, the underlying optimization of neural networks\nis non-convex which makes analyzing their performance challenging.\nIn this\npaper, we take another approach to this problem by constraining the network\nsuch that the corresponding optimization landscape has good theoretical properties\nwithout signi\ufb01cantly compromising performance. In particular, for two-layer neural\nnetworks we introduce Porcupine Neural Networks (PNNs) whose weight vectors\nare constrained to lie over a \ufb01nite set of lines. We show that most local optima\nof PNN optimizations are global while we have a characterization of regions\nwhere bad local optimizers may exist. Moreover, our theoretical and empirical\nresults suggest that an unconstrained neural network can be approximated using a\npolynomially-large PNN.\n\n1\n\nIntroduction\n\nNeural networks have been used in several machine learning and statistical inference problems\nincluding regression and classi\ufb01cation tasks. Some successful applications of neural networks and\ndeep learning include speech recognition [1], natural language processing [2], and image classi\ufb01cation\n[3]. The underlying neural network optimization is non-convex in general which makes its training\nNP-complete even for small networks [4]. In practice, however, different variants of local search\nmethods such as the gradient descent algorithm show excellent performance. Understanding the\nreason behind the success of such local search methods is still an open problem in the general case.\nThere have been several recent works in the theoretical literature aiming to study risk landscapes of\nneural networks and deep learning under various modeling assumptions. For example, references [5, 6,\n7, 8] have studied the convergence of the local search algorithms for the neural network optimization\nwith zero hidden neurons and a single output. Other works have studied the risk landscapes of neural\nnetwork optimizations for more complex structures under various model assumptions [9, 10, 11,\n12, 13, 14, 15, 16, 17, 18, 19]. In particular, references [13, 14, 15] consider a two-layer neural\nnetwork with Gaussian inputs under a matched (realizable) model where the output is generated from\na network with planted weights. Over-parameterized networks where the number of parameters are\nlarger than the number of training samples have been studied in [20, 21]. In addition, reference [22]\nmakes the argument (through a visualization scheme) that the loss landscape of reduced-capacity\nnetworks such as resnets has fewer bad locals than that of unconstrained ones. Further, [23] presents\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fconditions for a local minimum to be global in the case of having regularization on the network\nweights.\nWe review these prior work in SM 1 Section 2. In this paper, we study a key question: can an\nunconstrained neural network be approximated with a constrained one whose optimization landscape\nhas good theoretical properties? For two-layer neural networks, we provide an af\ufb01rmative answer to\nthis question by introducing a family of constrained neural networks which we refer to as Porcupine\nNeural Networks (PNNs) (Figure 1-a). In PNNs, incoming weight vectors to hidden neurons are\nconstrained to lie on a \ufb01xed set of lines. For example, a neural network with multiple inputs and\nmultiple neurons where each neuron is connected to one input is a PNN since input weight vectors to\nneurons lie on lines parallel to the standard axes (SM Section 3).\nDesigning new objective functions with good landscape properties for non-convex problems has\nbeen explored previously in [19, 15, 24]. Our work provides an alternative procedure to obtain a\nnew objective function for the underlying non-convex optimization. Our framework can be extended\nnaturally to deeper networks as well even though in our analysis in this paper, we focus on two-layer\nnetworks. Moreover, a PNN can be viewed as a neural network whose feature vectors (i.e., input\nweight vectors to neurons) are \ufb01xed up to scalings due to the PNN optimization. This view can relate\na random PNN (i.e., a PNN whose lines are random) to the application of random features in kernel\nmachines [25, 26, 27, 28, 29, 30, 6]. Although our results in Sections 4 and 5 are for general PNNs,\nwe study them for random PNNs in Section 6 as well.\nWe analyze population risk landscapes of two-layer PNNs with jointly Gaussian inputs and recti\ufb01ed\nlinear unit (ReLU) activation functions at hidden neurons. We show that under some modeling\nassumptions, most local optima of PNN optimizations are also global optimizers. Moreover, we\ncharacterize the parameter regions where bad local optima (i.e., local optimizers that are not global)\nmay exist.\nNext, we study whether one can approximate an unconstrained (fully-connected) two-layer neural\nnetwork function with a PNN whose number of neurons is polynomially-large in dimension. Our\nempirical results offer an af\ufb01rmative answer to this question. For example, suppose the output data is\ngenerated using an unconstrained two-layer neural network with d = 15 inputs and k\u2217 = 20 hidden\nneurons. Using this data, we train a random two-layer PNN with k hidden neurons. We evaluate the\nPNN approximation error as the mean-squared error (MSE) normalized by the L2 norm of the output\nsamples in a two-fold cross validation setup. As depicted in Figure 1-b, by increasing the number of\nPNN hidden neurons k, the PNN approximation error decreases. Notably, to obtain a relatively small\napproximation error, PNN\u2019s number of hidden neurons does not need to be exponentially large in\ndimension. We explain details of this experiment in Section 7.\nIn Section 6, we study a characterization of the PNN approximation error with respect to the input\ndimension and the complexity of the unconstrained neural network function. We show that under\nsome modeling assumptions, the PNN approximation error can be bounded by the spectral norm of\nthe generalized Schur complement of a kernel matrix. We analyze this bound for random PNNs in\nthe high-dimensional regime when the ground-truth data is generated using an unconstrained neural\nnetwork with random weights. For the case where the dimension of input and the number of hidden\nneurons increase with the same rate, we compute the asymptotic limit. Finally, in Section 8, we\ndiscuss how the proposed PNN framework can potentially be used to explain the success of local\nsearch methods such as gradient descent in solving the unconstrained neural network optimization.\nIn summary, PNNs provide three main advantageous. The \ufb01rst is that its loss landscape has nice\nproperties with provably few bad locals. Thus, we can have some guarantees on the performance\nof gradient descent. Second, its approximation power is good, i.e. our experimental and theoretical\nresults suggest that one can approximate unconstrained networks arbitrarily closely with polynomially-\nlarge PNNs. Third, PNN uses fewer parameters than unconstrained neural networks, thus leading\nto better generalization and also bene\ufb01ts in terms of storage. Finally, we discuss using the PNN\nframework in general deep neural networks in Section 8.\n\n1In this document, we refer to pointers in the supplementary materials using the pre\ufb01x SM.\n\n2\n\n\f\u03c6(cid:0)wt\nix(cid:1) ,\n\nk(cid:88)\n\ni=1\n\nh(x; W) :=\n\n(1)\n\nFigure 1: (a) A two-layer Porcupine Neural Network (PNN). In PNN, incoming weight vectors to\nneurons are constrained to lie over a \ufb01xed set of lines in a d-dimensional space. (b) Approximations\nof an unconstrained two-layer neural network with d = 15 inputs and k\u2217 = 20 hidden neurons using\nrandom two-layer PNNs.\n\n2 Unconstrained Neural Networks\nConsider a two-layer neural network with k neurons where the input is in Rd (Figure 1-a). The\nweight vector from the input to the i-th neuron is denoted by wi \u2208 Rd. For simplicity, we assume\nthat second layer weights are equal to one. Let\n\nwhere x = (x1, ..., xd)t and W := (w1, w2, ..., wk) \u2208 W \u2286 Rd\u00d7k. The activation function at each\nneuron is assumed to be \u03c6(z) := ReLU(z) = max(z, 0).\nConsider F, the set of all functions f : Rd \u2192 R where f can be realized with a neural network\ndescribed in (1). In other words,\n\nF :=(cid:8)f : Rd \u2192 R; \u2203W \u2208 W, f (x) = h(x; W), \u2200x \u2208 Rd(cid:9) .\n\n(2)\nIn a fully connected neural network structure, W = Rd\u00d7k. We refer to this case as the unconstrained\nneural network. Note that particular network architectures can impose constraints on W.\nLet x \u223c N (0, I). We consider the population risk de\ufb01ned as the mean squared error (MSE):\n\nL(W) := E(cid:104)\n\n(h(x; W) \u2212 y)2(cid:105)\n\n,\n\n(3)\n\nwhere y is the output variable. If y is generated by a neural network with the same architecture\ndescribed by (1), we have y = h(x; Wtrue).\nUnderstanding the population risk function is an important step towards characterizing the empirical\nrisk landscape [5]. In this paper, for simplicity, we only focus on the population risk. The neural\nnetwork optimization can be written as follows:\n\nmin\nW\n\nL(W)\nW \u2208 W.\n\n(4)\n\nLet W\u2217 be a global optimum of this optimization. L(W\u2217) = 0 means that y can be generated by a\nneural network with the same architecture (i.e., Wtrue is a global optimum.). We refer to this case as\nmatched. Moreover, we refer to the case of L(W\u2217) > 0 as mismatched. Optimization (4) in general\nis non-convex owing to nonlinear activation functions in neurons.\n\n3 Porcupine Neural Networks\n\nCharacterizing the landscape of the objective function of optimization (4) is challenging in general.\nIn this paper, we consider a constrained version of this optimization where weight vectors belong to a\n\n3\n\n......xdx2x1wkw1(b)(a)w1wk1111Approximation error (normalized MSE)PNN\u2019s number of neurons\f\ufb01nite set of lines in a d-dimensional space (Figure 1-a). This constraint may arise either from the\nneural network architecture or can be imposed by design.\nMathematically, let L = {L1, ..., Lr} be a set of lines in a d-dimensional space. Let Gi be the set\nof neurons whose incoming weight vectors lie over the line Li. Therefore, we have G1 \u222a ... \u222a Gr =\n{1, ..., k}. Moreover, we assume Gi (cid:54)= \u2205 for 1 \u2264 i \u2264 r; otherwise that line can be removed from the\nset L. For every j \u2208 Gi, we de\ufb01ne the function g(.) such that g(j) = i.\nFor a given set L and a neuron-to-line mapping G, we de\ufb01ne FL,G \u2286 F as the set of all functions\nthat can be realized with a neural network (1) where wi lies over the line Lg(i). Namely,\n\nFL,G := {f : Rd \u2192 R; \u2203W = (w1, ..., wk), wi \u2208 Lg(i), f (x) = h(x; W), \u2200x \u2208 Rd}.\n\n(5)\n\nWe refer to this family of neural networks as Porcupine Neural Networks (PNNs). In general, functions\ndescribed by PNNs (i.e., FL,G) can be viewed as angular discretizations of functions described by\nunconstrained neural networks (i.e., F). By increasing the size of |L| (i.e., the number of lines),\nwe can approximate every f \u2208 F by some \u02c6f \u2208 FL,G arbitrarily closely. Thus, characterizing the\nlandscape of the loss function over PNNs can help us to understand the landscape of the unconstrained\nloss function.\nThe PNN optimization can be written as\n\nmin\nW\n\nL(W)\nwi \u2208 Lg(i)\n\n1 \u2264 i \u2264 k.\n\n(6)\n\nMatched and mismatched PNN optimizations are de\ufb01ned in a similar manner to the unconstrained\nones.\n\n4 Population Risk Landscapes of Matched PNNs\n\nWe say a vector has a positive orientation if its component in the largest non-zero index is positive.\nOtherwise, it has a negative orientation. For example, w1 = (\u22121, 2, 0, 3, 0) has a positive orientation\nbecause w1(4) > 0, while the vector w2 = (\u22121, 2, 0, 0,\u22123) has a negative orientation because\nw2(5) < 0. Mathematically, let \u00b5(wi) be the largest index of the vector wi with a non-zero entry,\ni.e., \u00b5(wi) = arg maxj(wi(j) (cid:54)= 0). We de\ufb01ne s(wi) = 1 if \u00b5(wi) > 0, otherwise s(wi) = \u22121.\nLet ui be a unit norm vector on the line Li such that s(ui) = 1. Let UL = (u1, ..., ur). Let\nAL \u2208 Rr\u00d7r be a matrix whose (i, j)-component is the angle between lines Li and Lj, i.e., AL(i, j) =\n\u03b8ui,uj . Moreover, let KL = UtLUL = cos[AL].\nWe de\ufb01ne\n\nqr :=\n\n(cid:107)wi(cid:107), q\u2217\n\nr :=\n\n(cid:107)w\u2217\ni (cid:107),\n\n(7)\n\n(cid:88)\n\ni\u2208Gr\n\n(cid:88)\n\ni\u2208Gr\n\nwhere (cid:107).(cid:107) is the second norm operator. Also, we de\ufb01ne q := (q1, ..., qd)t and q\u2217 := (q\u2217\nDe\ufb01ne the kernel function \u03c8 : [\u22121, 1] \u2192 R as\n\n1, ..., q\u2217\n\nd)t.\n\n(cid:16)(cid:112)\n\n\u03c8(x) = x +\n\n2\n\u03c0\n\n(cid:17)\n\n1 \u2212 x2 \u2212 x cos\u22121(x)\n\n.\n\n(8)\n\nSome properties of this kernel function are explained in SM Section 4. In the following Theorem, we\nshow that this kernel function plays an important role in characterizing optimizers of optimization (6).\nIn particular, we show that the objective function of the neural network optimization has a term where\nthis kernel function is applied (component-wise) to the inner product matrix among vectors u1,...,ur.\n\nTheorem 1 The loss function (3) for a matched PNN can be written as\n\nL(W) =\n\n1\n4\n\nwi \u2212 w\u2217\n\ni (cid:107)2 +\n\n1\n4\n\n(q \u2212 q\u2217)t\u03c8[KL](q \u2212 q\u2217),\n\n(9)\n\nwhere \u03c8(.) is de\ufb01ned as in (8) and q and q\u2217 are de\ufb01ned as in (7).\n\n(cid:107) k(cid:88)\n\ni=1\n\n4\n\n\fThe kernel function \u03c8(.) has a linear term and a nonlinear term. Note that the inner product matrix\nKL is positive semide\ufb01nite. Below, we show that applying the kernel function \u03c8(.) (component-wise)\nto KL preserves this property.\nLemma 1 For every L, \u03c8[KL] is positive semide\ufb01nite.\nNext, we characterize local optimizers of optimization (6) for a general PNN. De\ufb01ne R(s1, ..., sr) as\nthe space of W where si is the sign vector of weights wj over the line Li (i.e., j \u2208 Gi).\nTheorem 2 For a general PNN, in regions R(s1, ..., sr) where at least d of the si\u2019s are not equal to\n\u00b1(1, 1, ..., 1), every local optimizer of optimization (6) is a global optimizer.\nIn practice the number of lines r is much larger than the number of inputs d (i.e., r (cid:29) d). Thus, the\ncondition of Theorem 2 which requires d out of r variables si not to be equal to \u00b11 is likely to be\nsatis\ufb01ed if we initialize the local search algorithm randomly (SM Section 5).\n\n5 Population Risk Landscapes of Mismatched PNNs\n\nIn this section, we characterize the population risk landscape of a mismatched PNN optimization\nwhere the model that generates the data and the model used in the PNN optimization are different.\nWe assume that the output variable y is generated using a two-layer PNN with k\u2217 neurons whose\nweights lie on the set of lines L\u2217 with neuron-to-line mapping G\u2217. That is\n\ni )t x\n\n,\n\ni=1\n\ny =\n\nReLU\n\n(10)\ni lies on a line in the set L\u2217 for 1 \u2264 i \u2264 k\u2217. The neural network optimization (6) is over\nwhere w\u2217\nPNNs with k neurons over the set of lines L with the neuron-to-line mapping G. Note that L and G\ncan be different than L\u2217 and G\u2217, respectively.\nLet r = |L| and r\u2217 = |L\u2217| be the number of lines in L and L\u2217, respectively. Let u\u2217\nvector on the line L\u2217\nthe line Li \u2208 L such that s(ui) = 1. Let UL = (u1, ..., ur) and UL\u2217 = (u\u2217\n1, ..., u\u2217\nrank of UL is at least d. De\ufb01ne\n\ni be the unit norm\ni ) = 1. Similarly, we de\ufb01ne ui as the unit norm vector on\nr). Suppose the\n\ni \u2208 L\u2217 such that s(u\u2217\n\nk\u2217(cid:88)\n\n(cid:16)\n\n(w\u2217\n\n(cid:17)\n\nKL = UtLUL \u2208 Rr\u00d7r\nKL\u2217 = UtL\u2217 UL\u2217 \u2208 Rr\u2217\u00d7r\u2217\nKL,L\u2217 = UtLUL\u2217 \u2208 Rr\u00d7r\u2217\n.\n\n(11)\n\n(12)\n\n(13)\n\n(cid:107) k(cid:88)\n\nwi \u2212 k\u2217(cid:88)\n\nTheorem 3 The loss function (3) for a mismatched PNN can be written as\n\n1\n4\n\nL(W) =\n\ni (cid:107)2 +\nw\u2217\n\nqt\u03c8[KL,L\u2217 ]q\u2217,\nwhere \u03c8(.) is de\ufb01ned as in (8) and q and q\u2217 are de\ufb01ned as in (7) using G and G\u2217, respectively.\nCorollary 1 Let\n\n(q\u2217)t\u03c8[KL\u2217 ]q\u2217 \u2212 1\n2\n\nqt\u03c8[KL]q +\n\n1\n4\n\n1\n4\n\ni=1\n\ni=1\n\n(cid:18) KL KL,L\u2217\n\nKtL,L\u2217 KL\u2217\n\n(cid:19)\n\nK =\n\n\u2208 R(r+r\u2217)\u00d7(r+r\u2217).\n\nThen, the loss function of a mismatched PNN can be lower bounded as\n(cid:107)q\u2217(cid:107)2\u03bbmin (\u03c8[K]/\u03c8[KL])\n\n(14)\nwhere \u03c8[K]/\u03c8[KL] := \u03c8[KL\u2217 ] \u2212 \u03c8[KL\u2217 ]t\u03c8[KL]\u2020\u03c8[KL\u2217 ] is the generalized Schur complement of\nthe block \u03c8[KL] in the matrix \u03c8[K].\n\nL(W) \u2265 1\n4\n\nNext, we characterize local optimizers of optimization (6) for a mismatched PNN. Similar to the\nmatched PNN case, we de\ufb01ne R(s1, ..., sr) as the space of W where si is the vector of sign variables\nof weight vectors over the line Li.\n\n5\n\n\fTheorem 4 For a mismatched PNN, in regions R(s1, ..., sr) where at least d of the si\u2019s are not equal\nto \u00b1(1, 1, ..., 1), every local optimizer of optimization (6) is a global optimizer. Moreover, in those\npoints we have\n\nL(W\u2217) =\n\n1\n4\n\n(q\u2217)t (\u03c8[K]/\u03c8[KL]) q\u2217 \u2264 1\n4\n\n(cid:107)q\u2217(cid:107)2(cid:107)\u03c8[K]/\u03c8[KL](cid:107).\n\n(15)\n\nWhen the condition of Theorem 4 holds, the spectral norm of the matrix (cid:107)\u03c8[K]/\u03c8[KL](cid:107) provides\nan upper-bound on the loss value at global optimizers of the mismatched PNN. In Section 6, we\nstudy this bound in more detail. Moreover, in SM Section 7, we study the case where the condition\nof Theorem 4 does not hold (i.e., the local search method converges to a point in parameter regions\nwhere more than r \u2212 d of variables si are equal to \u00b11). Finally, we study a Minimax analysis of the\nnaive nearest line approximation approach in SM Section 8.\n\n6 PNN Approximations of Unconstrained Neural Networks\n\nIn this section, we study whether an unconstrained two-layer neural network function can be approx-\nimated by a PNN. We assume that the unconstrained neural network has d inputs and k\u2217 hidden\nneurons. This neural network function can also be viewed as a PNN whose lines are determined\nby input weight vectors to neurons. Thus, in this case r\u2217 \u2264 k\u2217 where r\u2217 is the number of lines of\nthe original network. If weights are generated randomly, with probability one, r\u2217 = k\u2217 since the\nprobability that two random vectors lie on the same line is zero. Note that lines of the ground-truth\nPNN (i.e., the unconstrained neural network) are unknowns in the training step. For training, we\nuse a two-layer PNN with r lines, drawn uniformly at random, and k neurons. Since we have ReLU\nactivation functions at neurons, without loss of generality, we can assume k = 2r, i.e., for every line\nwe assign two neurons (one for potential weight vectors with positive orientations on that line and the\nother one for potential weight vectors with negative orientations). Since there is a mismatch between\nthe model generating the data and the model used for training, we will have an approximation error.\nIn this section, we study this approximation error as a function of parameters d, r and r\u2217.\nSince these lines will be different than L\u2217, the neural network optimization can be formulated as a\nmismatched PNN optimization as studied in Section 5. Moreover, in this section, we assume the\ncondition of Theorem 4 holds, i.e., the local search algorithm converges to a point in parameter\nregions where at least d of variables si are not equal to \u00b11. The case that violates this condition is\nmore complicated and is investigated in SM Section 7.\nUnder the condition of Theorem 4, the PNN approximation error depends on both (cid:107)q\u2217(cid:107) and\n(cid:107)\u03c8[K]/\u03c8[KL](cid:107). The former term provides a scaling normalization for the loss function. Thus,\nwe focus on analyzing the later term.\nSince Theorem 4 provides an upper-bound for the mismatched PNN optimization loss using\n(cid:107)\u03c8[K]/\u03c8[KL](cid:107), intuitively, increasing the number of lines in L should decrease (cid:107)\u03c8[K]/\u03c8[KL](cid:107).\nWe prove this in the following theorem.\nTheorem 5 Let K be de\ufb01ned as in (13). We add a distinct line to the set L, i.e., Lnew = L \u222a Lr+1.\nDe\ufb01ne\n\n(cid:18) KLnew KLnew,L\u2217\n\n(cid:19)\n\nKtLnew,L\u2217\n\nKL\u2217\n\n=\n\nKnew =\n\n\uf8eb\uf8ed 1\n\nzt\n1\n\nzt\n2\n\nz1 KL KL,L\u2217\nz2 KtL,L\u2217 KL\u2217\n\n\uf8f6\uf8f8 \u2208 R(r+r\u2217+1)\u00d7(r+r\u2217+1).\n\nThen, we have\n\n(cid:107)\u03c8[Knew]/\u03c8[KLnew](cid:107) \u2264 (cid:107)\u03c8[K]/\u03c8[KL](cid:107).\n\nMore speci\ufb01cally,\n\nwhere \u03b1 =(cid:0)1 \u2212(cid:10)\u03c8[z1], \u03c8[KL]\u22121\u03c8[z1](cid:11)(cid:1)\u22121 \u2265 0, v = \u03c8[z2] \u2212 \u03c8[KL,L\u2217 ]t\u03c8[KL]\u22121\u03c8[z1].\n\n\u03c8[Knew]/\u03c8[KLnew ] = \u03c8[K]/\u03c8[KL] \u2212 \u03b1vvt,\n\nTheorem 5 indicates that adding lines to L decreases (cid:107)\u03c8[K]/\u03c8[KL](cid:107). However, it does not charac-\nterize the rate of this decrease as a function of r, r\u2217 and d.\n\n6\n\n(16)\n\n(17)\n\n(18)\n\n\fFigure 2: The spectral norm of \u03c8[K]/\u03c8[KL] when d = r. Theoretical limits are described in\nTheorem 6. Experiments have been repeated 100 times. Average results are shown.\n\nNext, we characterize the asymptotic behaviour of (cid:107)\u03c8[K]/\u03c8[KL](cid:107) when d, r \u2192 \u221e. There has\nbeen some recent interest in characterizing spectrum of inner product kernel random matrices\n[31, 32, 33, 34]. If the kernel is linear, the distribution of eigenvalues of the covariance matrix follows\nthe well-known Marcenko-Pastur law. If the kernel is nonlinear, reference [31] shows that in the high\ndimensional regime where d, r \u2192 \u221e and \u03b3 = r/d \u2208 (0,\u221e) is \ufb01xed, only the linear part of the kernel\nfunction affects the spectrum. Note that the matrix of interest in our problem is the Schur complement\nmatrix \u03c8[K]/\u03c8[KL], not \u03c8[K]. However, we can use results characterizing the spectrum of \u03c8[K] to\ncharacterize the spectrum of \u03c8[K]/\u03c8[KL].\nWe consider the regime where r, d \u2192 \u221e while \u03b3 = r/d \u2208 (0,\u221e) is a \ufb01xed number. Theorem 2.1 of\nreference [31] shows that in this regime and under some mild assumptions on the kernel function\n(which our kernel function \u03c8(.) satis\ufb01es), \u03c8[KL] converges (in probability) to the following matrix:\n\n11t + \u03c8(cid:48)(0)UtLUL + (\u03c8(1) \u2212 \u03c8(0) \u2212 \u03c8(cid:48)(0)) Ir.\n\n(19)\n\n(cid:18)\n\nRL =\n\n\u03c8(0) +\n\n\u03c8(cid:48)(cid:48)(0)\n2d\n\n(cid:19)\n\nTo obtain this formula, one can write the Taylor expansion of the kernel function \u03c8(.) near 0. It turns\nout that in the regime where r, d \u2192 \u221e while d/r is \ufb01xed, it is suf\ufb01cient for off-diagonal elements of\n\u03c8[KL] to replace \u03c8(.) with its linear part. However, diagonal elements of \u03c8[KL] should be adjusted\naccordingly (the last term in (19)). For the kernel function of our interest, de\ufb01ned as in (8), we have\n\u03c8(cid:48)(0) = 0, \u03c8(cid:48)(cid:48)(0) = 2/\u03c0, \u03c8(0) = 2/\u03c0 and \u03c8(1) = 1. This simpli\ufb01es (19) further to:\n\n+\n\n1\n\u03c0d\n\n11t + (1 \u2212 2\n\u03c0\n\nRL =\n\n(20)\nThis matrix has (r \u2212 1) eigenvalues of 1 \u2212 2/\u03c0 and one eigenvalue of (2/\u03c0)r + 1 \u2212 2/\u03c0 + \u03b3/\u03c0.\nUsing this result, we characterize (cid:107)\u03c8[K]/\u03c8[KL](cid:107) in the following theorem:\nTheorem 6 Let L and L\u2217 have r and r\u2217 lines in Rd generated uniformly at random, respectively.\nLet d, r \u2192 \u221e while \u03b3 = r/d \u2208 (0,\u221e) is \ufb01xed. Moreover, r\u2217/r = O(1). Then,\n\n)Ir.\n\n(cid:18) 2\n\n\u03c0\n\n(cid:19)\n\n(cid:18)\n\n(cid:19)(cid:18)\n\n(cid:19)\n\nr\u2217\nr\n\n1 +\n\n1 \u2212 2\n\u03c0\n\n,\n\n(21)\n\n(cid:107)\u03c8[K]/\u03c8[KL](cid:107) \u2192\n\nwhere the convergence is in probability.\n\nFigure 2 shows the spectral norm of \u03c8[K]/\u03c8[KL] when d = r. As it is illustrated in this \ufb01gure,\nempirical results match closely to analytical limits of Theorem 6. Note that by increasing the ratio of\nr/r\u2217, (cid:107)\u03c8[K]/\u03c8[KL](cid:107) and therefore the PNN approximation error decreases. If r\u2217 is constant, the\nlimit is 1 \u2212 2/\u03c0 \u2248 0.36.\nTheorem 6 provides a bound on (cid:107)\u03c8[K]/\u03c8[KL](cid:107) in the asymptotic regime. In the following corollary,\nwe use this result to bound the PNN approximation error measured as the MSE normalized by the L2\nnorm of the output variables (i.e., L(W = 0)).\n\n7\n\n2004006008001000rspectral norm of Schur complementempiricalresulttheoritical limitr/r*=3r/r*=2r/r*=1(4/3)(1-2/\ufffd)(3/2)(1-2/\ufffd)2(1-2/\ufffd)\fFigure 3: (a) Gamma curves \ufb01t to Histogram of PNN approximation errors for different values of k.\n(b) Normalized losses obtained by training unconstrained and PNNs on the MNIST dataset.\n\nProposition 1 Let W\u2217 be the global optimizer of the mismatched PNN optimization under the setup\nof Theorem 6. Then, with high probability, we have\n\nL(W\u2217)\nL(W = 0)\n\n\u2264\n\nr\u2217\nr\n\n1 +\n\n(cid:19)\n\n1 \u2212 2\n\u03c0\n\n(cid:18)\n\n(cid:19)(cid:18)\n\n.\n\n(22)\n\nThis proposition indicates that in the asymptotic regime with d and r growing with the same rate (i.e.,\nr is a constant factor of d), the PNN is able to explain a fraction of the variance of the output variable.\nIn practice, however, r should grow faster than d in order to obtain a small PNN approximation error.\n\n7 Experiments on Synthetic and MNIST Datasets\n\nIn this section, we numerically evaluate the performance of PNNs on synthetic and MNIST datasets.\nWe use random PNNs as described in Section 5. To enforce the PNN architecture, we project\ngradients along the directions of the PNN lines before updating the weights. For more details on\nthese experiments, see SM Section 9.\nIn the synthetic data experiments, we generated data using a fully-connected two-layer network with\nd = 15 inputs and k\u2217 = 20 hidden neurons. We generated 10,000 ground-truth training samples\nand 10,000 test samples using a set of randomly chosen weights for the network. For PNNs, we\nuse 10 \u2264 k \u2264 100 hidden neurons. For each value of k, we perform 25 trials. We train the PNN\nvia stochastic gradient descent using batches of size 100, 100 training epochs, no momentum, and a\nlearning rate of 10\u22123 which decays every epoch at a rate of 0.95 every 390 epochs. For evaluation, we\ncompute the normalized MSE (i.e., MSE normalized by the L2 norm of y) in the test set over different\ninitializations. The results are shown in Figures 1-b and 3. Figure 3-a shows that as k increases, the\nPNN approximation gets better, which is consistent to our theoretical results in Section 6.\nNext, we evaluate PNNs on MNSIT. We \ufb01rst trained a dense network on a subset of the MNIST\nhandwritten digits dataset. Of the 10 types of 28x28 MNIST images, we only looked at images of 1\u2019s\nand 2\u2019s, assigning them the labels of y = 1 and y = 2, respectively. This resulted in n = 11, 649\ntraining samples and 2, 167 test samples. The network has the structure shown in Figure 1-a except\nweight vectors in the \ufb01rst layer are not constrained. Only \ufb01rst-layer weights were updated during\ntraining. The unconstrained network has k\u2217 = 512 hidden neurons.\nWe then approximated this trained network using PNNs of similar structures as previously described.\nWe tested k = 512, 1024, 2048 (recall that each k involves r = k/2 unique lines). Figure 3-b\nshows the normalized losses obtained using unconstrained networks and PNNs. As illustrated in this\n\ufb01gure, the PNN approximation error is small even with similar numbers of neurons to that of the\nunconstrained network.\n\n8\n\n(b)(a)normalized MSEepochUnconstrained NN, 512 neuronsPNN, 2048 neuronsPNN, 1024 neuronsPNN, 512 neurons\f8 Conclusion\n\nIn this paper, we introduced a family of constrained neural networks, called Porcupine Neural\nNetworks (PNNs), whose population risk landscapes have good theoretical properties, i.e., most\nlocal optima of PNN optimizations are global while we have a characterization of parameter regions\nwhere bad local optimizers may exist. We also showed that an unconstrained (fully-connected) neural\nnetwork function can be approximated by a polynomially-large PNN. In particular, we provided\napproximation bounds at global optima and also bad local optima (under some conditions) of the PNN\noptimization. These results may provide a means of explaining the success of local search methods\nin solving the unconstrained neural network optimization because every bad local optimum of the\nunconstrained problem can be viewed as a local optimum (either good or bad) for a PNN constrained\nproblem, which has a bounded loss according to our results. We leave further explorations of this\nidea for future work.\nIn Section 6, we used a set of projection lines for PNNs that are generated uniformly randomly. The\nchoice of the projection lines will affect the kernel matrix and thus can affect the capacity of a PNN.\nA more sophisticated design can have a higher density of projection lines along directions with high\nvariances. Exploring the design of projection lines for PNNs can be an interesting direction for future\nwork. In addition, it is possible to extend our results to feed-forward neural networks with more than\ntwo layers by constraining weight vectors in every layer, except the last one, to lie on \ufb01xed projection\nlines. In that case and using the method proposed in [26], we can generalize our analysis by replacing\nthe kernel function of (8) with a kernel that also depends on the architecture. Finally, other extensions\nof PNNs to network architectures with different activation functions than ReLU, and other types of\noperations (such as convolutions) are also among interesting directions for future work.\n\n9 Code\n\nWe provide code for PNN experiments in the following link:\nhttps://github.com/jessemzhang/porcupine_neural_networks\n\n10 Acknowledgment\n\nThis work supported by the Center for Science of Information (CSoI), an NSF Science and Technology\nCenter, under grant agreement CCF-0939370.\n\nReferences\n[1] Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. Acoustic modeling using deep\nbelief networks. IEEE Transactions on Audio, Speech, and Language Processing, 20(1):14\u201322,\n2012.\n\n[2] Ronan Collobert and Jason Weston. A uni\ufb01ed architecture for natural language processing: Deep\nneural networks with multitask learning. In Proceedings of the 25th international conference\non Machine learning, pages 160\u2013167. ACM, 2008.\n\n[3] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep\nconvolutional neural networks. In Advances in neural information processing systems, pages\n1097\u20131105, 2012.\n\n[4] Avrim Blum and Ronald L Rivest. Training a 3-node neural network is np-complete.\n\nAdvances in neural information processing systems, pages 494\u2013501, 1989.\n\nIn\n\n[5] Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex\n\nlosses. arXiv preprint arXiv:1607.06534, 2016.\n\n[6] Elad Hazan, K\ufb01r Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex\noptimization. In Advances in Neural Information Processing Systems, pages 1594\u20131602, 2015.\n\n9\n\n\f[7] Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Ef\ufb01cient learning of general-\nized linear and single index models with isotonic regression. In Advances in Neural Information\nProcessing Systems, pages 927\u2013935, 2011.\n\n[8] Mahdi Soltanolkotabi. Learning relus via gradient descent. arXiv preprint arXiv:1705.04591,\n\n2017.\n\n[9] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information\n\nProcessing Systems, pages 586\u2013594, 2016.\n\n[10] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Global optimality conditions for deep neural\n\nnetworks. arXiv preprint arXiv:1707.02444, 2017.\n\n[11] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error\n\nguarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\n[12] Anna Choromanska, Mikael Henaff, Michael Mathieu, G\u00e9rard Ben Arous, and Yann LeCun.\nThe loss surfaces of multilayer networks. In Arti\ufb01cial Intelligence and Statistics, pages 192\u2013204,\n2015.\n\n[13] Yuandong Tian. Symmetry-breaking convergence analysis of certain two-layered neural net-\n\nworks with relu nonlinearity. 2016.\n\n[14] Yuandong Tian. An analytical formula of population gradient for two-layered relu network and\nits applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560,\n2017.\n\n[15] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery\n\nguarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017.\n\n[16] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with\n\ngaussian inputs. arXiv preprint arXiv:1702.07966, 2017.\n\n[17] Qiuyi Zhang, Rina Panigrahy, Sushant Sachdeva, and Ali Rahimi. Electron-proton dynamics in\n\ndeep learning. arXiv preprint arXiv:1702.00458, 2017.\n\n[18] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu\n\nactivation. arXiv preprint arXiv:1705.09886, 2017.\n\n[19] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity:\nGuaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473,\n2015.\n\n[20] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimiza-\ntion landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926,\n2017.\n\n[21] Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv\n\npreprint arXiv:1704.08045, 2017.\n\n[22] Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. Visualizing the loss landscape of neural\n\nnets. arXiv preprint arXiv:1712.09913, 2017.\n\n[23] Rene Vidal, Joan Bruna, Raja Giryes, and Stefano Soatto. Mathematics of deep learning. arXiv\n\npreprint arXiv:1712.04741, 2017.\n\n[24] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with\n\nlandscape design. arXiv preprint arXiv:1711.00501, 2017.\n\n[25] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances\n\nin neural information processing systems, pages 1177\u20131184, 2008.\n\n[26] Amit Daniely, Roy Frostig, Vineet Gupta, and Yoram Singer. Random features for compositional\n\nkernels. arXiv preprint arXiv:1703.07872, 2017.\n\n10\n\n\f[27] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks:\nThe power of initialization and a dual view on expressivity. In Advances In Neural Information\nProcessing Systems, pages 2253\u20132261, 2016.\n\n[28] Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural\n\ninformation processing systems, pages 342\u2013350, 2009.\n\n[29] Francis Bach. Breaking the curse of dimensionality with convex neural networks. Journal of\n\nMachine Learning Research, 18(19):1\u201353, 2017.\n\n[30] Uri Heinemann, Roi Livni, Elad Eban, Gal Elidan, and Amir Globerson. Improper deep kernels.\n\nIn Arti\ufb01cial Intelligence and Statistics, pages 1159\u20131167, 2016.\n\n[31] Noureddine El Karoui et al. The spectrum of kernel random matrices. The Annals of Statistics,\n\n38(1):1\u201350, 2010.\n\n[32] Y Do and V Vu. The spectrum of random kernel matrices. arXiv preprint arXiv:1206.3763,\n\n2012.\n\n[33] Xiuyuan Cheng and Amit Singer. The spectrum of random inner-product kernel matrices.\n\nRandom Matrices: Theory and Applications, 2(04):1350010, 2013.\n\n[34] Zhou Fan and Andrea Montanari. The spectral norm of random inner-product kernel matrices.\n\narXiv preprint arXiv:1507.05343, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2338, "authors": [{"given_name": "Soheil", "family_name": "Feizi", "institution": "University of Maryland, College Park"}, {"given_name": "Hamid", "family_name": "Javadi", "institution": "Stanford University"}, {"given_name": "Jesse", "family_name": "Zhang", "institution": "Stanford University"}, {"given_name": "David", "family_name": "Tse", "institution": "Stanford University"}]}