{"title": "Adding One Neuron Can Eliminate All Bad Local Minima", "book": "Advances in Neural Information Processing Systems", "page_first": 4350, "page_last": 4360, "abstract": "One of the main difficulties in analyzing neural networks is the non-convexity of the loss function which may have many bad local minima. In this paper, we study the landscape of neural networks for binary classification tasks. Under mild assumptions, we prove that after adding one special neuron with a skip connection to the output, or one special neuron per layer, every local minimum is a global minimum.", "full_text": "Adding One Neuron Can Eliminate All Bad\n\nLocal Minima\n\nShiyu Liang\n\nCoordinated Science Laboratory\n\nDept. of Electrical and Computer Engineering\nUniversity of Illinois at Urbana-Champaign\n\nsliang26@illinois.edu\n\nRuoyu Sun\n\nCoordinated Science Laboratory\n\nDepartment of ISE\n\nUniversity of Illinois at Urbana-Champaign\n\nruoyus@illinois.edu\n\nJason D. Lee\n\nMarshall School of Business\n\nUniversity of Southern California\njasonlee@marshall.usc.edu\n\nR. Srikant\n\nCoordinated Science Laboratory\n\nDept. of Electrical and Computer Engineering\nUniversity of Illinois at Urbana-Champaign\n\nrsrikant@illinois.edu\n\n\u2217\n\nAbstract\n\nOne of the main dif\ufb01culties in analyzing neural networks is the non-convexity\nof the loss function which may have many bad local minima. In this paper, we\nstudy the landscape of neural networks for binary classi\ufb01cation tasks. Under mild\nassumptions, we prove that after adding one special neuron with a skip connection\nto the output, or one special neuron per layer, every local minimum is a global\nminimum.\n\nIntroduction\n\n1\nDeep neural networks have recently achieved huge success in various machine learning tasks (see,\nKrizhevsky et al. 2012; Goodfellow et al. 2013; Wan et al. 2013, for example). However, a theoretical\nunderstanding of neural networks is largely lacking. One of the dif\ufb01culties in analyzing neural\nnetworks is the non-convexity of the loss function which allows the existence of many local minima\nwith large losses. This was long considered a bottleneck of neural networks, and one of the reasons\nwhy convex formulations such as support vector machine (Cortes & Vapnik, 1995) were preferred\npreviously. Given the recent empirical success of the deep neural networks, an interesting question is\nwhether the non-convexity of the neural network is really an issue.\nIt has been widely conjectured that all local minima of the empirical loss lead to similar training\nperformance (LeCun et al., 2015; Choromanska et al., 2015). For example, prior works empirically\nshowed that neural networks with identical architectures but different initialization points can converge\nto local minima with similar classi\ufb01cation performance (Krizhevsky et al., 2012; He et al., 2016;\nHuang & Liu, 2017). On the theoretical side, there have been many recent attempts to analyze the\nlandscape of the neural network loss functions. A few works have studied deep networks, but they\neither require linear activation functions (Baldi & Hornik, 1989; Kawaguchi, 2016; Freeman &\nBruna, 2016; Hardt & Ma, 2017; Yun et al., 2017), or require assumptions such as independence\nof ReLU activations (Choromanska et al., 2015) and signi\ufb01cant overparametrization (Nguyen &\nHein, 2017a,b; Livni et al., 2014). There is a large body of works that study single-hidden-layer\nneural networks and provide various conditions under which a local search algorithm can \ufb01nd a\nglobal minimum (Du & Lee, 2018; Ge et al., 2018; Andoni et al., 2014; Sedghi & Anandkumar,\n2014; Janzamin et al., 2015; Haeffele & Vidal, 2015; Gautier et al., 2016; Brutzkus & Globerson,\n\n\u2217Correpondence to R. Srikant, rsrikant@illinois.edu and Ruoyu Sun, ruoyus@illinois.edu\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f2017; Soltanolkotabi, 2017; Soudry & Hoffer, 2017; Goel & Klivans, 2017; Du et al., 2017; Zhong\net al., 2017; Li & Yuan, 2017; Liang et al., 2018; Mei et al., 2018). It can be roughly divided into\ntwo categories: non-global landscape analysis and global landscape analysis. For the \ufb01rst category,\nthe result do not apply to all local minima. One typical conclusion is about the local geometry, i.e.,\nin a small neighborhood of the global minima no bad local minima exist (Zhong et al., 2017; Du\net al., 2017; Li & Yuan, 2017). Another typical conclusion is that a subset of local minima are global\nminima (Haeffele et al., 2014; Haeffele & Vidal, 2015; Soudry & Carmon, 2016; Nguyen & Hein,\n2017a,b). Shamir (2018) has shown that a subset of second-order local minima can perform nearly as\nwell as linear predictors. The presence of various conclusions re\ufb02ects the dif\ufb01culty of the problem:\nwhile analyzing the global landscape seems hard, we may step back and analyze the local landscape\nor a \u201cmajority\u201d of the landscape. For the second category of global landscape analysis, the typical\nresult is that every local minimum is a global minimum. However, even for single-layer networks,\nstrong assumptions such as over-parameterization, very special neuron activation functions, \ufb01xed\nsecond layer parameters and/or Gaussian data distribution are often needed in the existing works.\nThe presence of various strong assumptions also re\ufb02ects the dif\ufb01culty of the problem: even for\nthe single-hidden-layer nonlinear neural network, it seems hard to analyze the landscape, so it is\nreasonable to make various assumptions.\nOne exception is the recent work Liang et al. (2018) which adopts a different path: instead of\nsimply making several assumptions to obtain positive results, it carefully studies the effect of various\nconditions on the landscape of neural networks for binary classi\ufb01cation. It gives both positive and\nnegative results on the existence of bad local minimum under different conditions. In particular, it\nstudies many common types of neuron activation functions and shows that for a class of neurons\nthere is no bad local minimum, and for other neurons there is. This clearly shows that the choice of\nneurons can affect the landscape. Then a natural question is: while Liang et al. (2018) considers\nsome special types of data and a broad class of neurons, can we obtain results for more general data\nwhen limiting to a smaller class of neurons?\n\n1.1 Our Contributions\nGiven this context, our main result is quite surprising: for a neural network with a special type of\nneurons, every local minimum is a global minimum of the loss function. Our result requires no\nassumption on the network size, the speci\ufb01c type of the original neural network, etc., yet our result\napplies to every local minimum. Besides the requirement on the neuron activation type, the major\ntrick is an associated regularizer. Our major results and their implications are as follows:\n\u2022 We focus on the binary classi\ufb01cation problem with a smooth hinge loss function. We prove the\nfollowing result: for any neural network, by adding a special neuron (e.g., exponential neuron) to\nthe network and adding a quadratic regularizer of this neuron, the new loss function has no bad\nlocal minimum. In addition, every local minimum achieves the minimum misclassi\ufb01cation error.\n\u2022 In the main result, the augmented neuron can be viewed as a skip connection from the input to the\noutput layer. However, this skip connection is not critical, as the same result also holds if we add\none special neuron to each layer of a fully-connected feedforward neural network.\n\n\u2022 To our knowledge, this is the \ufb01rst result that no spurious local minimum exists for a wide class\nof deep nonlinear networks. Our result indicates that the class of \u201cgood neural networks\u201d (neural\nnetworks such that there is an associated loss function with no spurious local minima) contains\nany network with one special neuron, thus this class is rather \u201cdense\u201d in the class of all neural\nnetworks: the distance between any neural network and a good neural network is just a neuron\naway.\n\nThe outline of the paper is as follows. In Section 2, we present several notations. In Section 3, we\npresent the main result and several extensions on the main results are presented in Section 4. We\npresent the proof idea of the main result in Section 5 and conclude this paper in Section 6. All proofs\nare presented in Appendix.\n\n2 Preliminaries\nFeed-forward networks. Given an input vector of dimension d, we consider a neural network with\nL layers of neurons for binary classi\ufb01cation. We denote by Ml the number of neurons in the l-th layer\n(note that M0 = d). We denote the neural activation function by \u03c3. Let Wl \u2208 RMl\u22121\u00d7Ml denote the\nweight matrix connecting the (l \u2212 1)-th and l-th layer and bl denote the bias vector for neurons in\n\n2\n\n\ff (x; \u03b8) = W (cid:62)\n\nL+1\u03c3(cid:0)WL\u03c3(cid:0)...\u03c3(cid:0)W (cid:62)\n\nthe l-th layer. Let WL+1 \u2208 RML and bL \u2208 R denote the weight vector and bias scalar in the output\nlayer, respectively. Therefore, the output of the network f : Rd \u2192 R can be expressed by\n\n1 x + b1\n\n(1)\ni=1 to denote a dataset containing n samples, where xi \u2208 Rd\nLoss and error. We use D = {(xi, yi)}n\nand yi \u2208 {\u22121, 1} denote the feature vector and the label of the i-th sample, respectively. Given a\nneural network f (x; \u03b8) parameterized by \u03b8 and a loss function (cid:96) : R \u2192 R, in binary classi\ufb01cation\ntasks, we de\ufb01ne the empirical loss Ln(\u03b8) as the average loss of the network f on a sample in\nthe dataset and de\ufb01ne the training error (also called the misclassi\ufb01cation error) Rn(\u03b8; f ) as the\nmisclassi\ufb01cation rate of the network f on the dataset D, i.e.,\n\n(cid:1) + bL\u22121\n\n(cid:1) + bL\n\n(cid:1) + bL+1.\n\nLn(\u03b8) =\n\n(cid:96)(\u2212yif (xi; \u03b8))\n\nand Rn(\u03b8; f ) =\n\n1\nn\n\nI{yi (cid:54)= sgn(f (xi; \u03b8))}.\n\n(2)\n\nwhere I is the indicator function.\nTensors products. We use a\u2297b to denote the tensor product of vectors a and b and use a\u2297k to denote\nthe tensor product a\u2297 ...\u2297 a where a appears k times. For an N-th order tensor T \u2208 Rd1\u00d7d2\u00d7...\u00d7dN\nand N vectors u1 \u2208 Rd1 , u2 \u2208 Rd2, ..., uN \u2208 RdN , we de\ufb01ne\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\n\ni=1\n\n(cid:88)\n\nT \u2297 u1... \u2297 uN =\n\ni1\u2208[d1],...,iN\u2208[dN ]\n\nT (i1, ..., iN )u1(i1)...uN (iN ),\n\nwhere we use T (i1, ..., iN ) to denote the (i1, ..., iN )-th component of the tensor T , uk(ik) to denote\nthe ik-th component of the vector uk, k = 1, ..., N and [dk] to denote the set {1, ..., dk}.\n3 Main Result\nIn this section, we \ufb01rst present several important conditions on the loss function and the dataset in\norder to derive the main results. After that, we will present the main results.\n\n3.1 Assumptions\nIn this subsection, we introduce two assumptions on the loss function and the dataset.\nAssumption 1 (Loss function) Assume that the loss function (cid:96) : R \u2192 R is monotonically non-\ndecreasing and twice differentiable, i.e., (cid:96) \u2208 C 2. Assume that every critical point of the loss function\n(cid:96)(z) is also a global minimum and every global minimum z satis\ufb01es z < 0.\nA simple example of the loss function satisfying Assumption 1 is the polynomial hinge loss, i.e.,\n(cid:96)(z) = [max{z + 1, 0}]p, p \u2265 3. It is always zero for z \u2264 \u22121 and behaves like a polynomial function\nin the region z > \u22121. Note that the condition that every global minimum of the loss function (cid:96)(z) is\nnegative is not needed to prove the result that every local minimum of the empirical loss is globally\nminimal, but is necessary to prove that the global minimizer of the empirical loss is also the minimizer\nof the misclassi\ufb01cation rate.\nAssumption 2 (Realizability) Assume that there exists a set of parameters \u03b8 such that the neural\nnetwork f (\u00b7; \u03b8) is able to correctly classify all samples in the dataset D.\nBy Assumption 2, we assume that the dataset is realizable by the neural architecture f. We note that\nthis assumption is consistent with previous empirical observations (Zhang et al., 2016; Krizhevsky\net al., 2012; He et al., 2016) showing that at the end of the training process, neural networks usually\nachieve zero misclassi\ufb01cation rates on the training sets. However, as we will show later, if the loss\nfunction (cid:96) is convex, then we can prove the main result even without Assumption 2.\n\n3.2 Main Result\nIn this subsection, we \ufb01rst introduce several notations and next present the main result of the paper.\nGiven a neural architecture f (\u00b7; \u03b8) de\ufb01ned on a d-dimensional Euclidean space and parameterized by\na set of parameters \u03b8, we de\ufb01ne a new architecture \u02dcf by adding the output of an exponential neuron\nto the output of the network f, i.e.,\n\n\u02dcf (x, \u02dc\u03b8) = f (x; \u03b8) + a exp(cid:0)w(cid:62)\n\nx + b(cid:1) ,\n\n(3)\n\n3\n\n\fwhere the vector \u02dc\u03b8 = (\u03b8, a, w, b) denote the parametrization of the network \u02dcf. For this designed\nmodel, we de\ufb01ne the empirical loss function as follows,\n\nn(cid:88)\n\ni=1\n\n\u02dcLn( \u02dc\u03b8) =\n\n(cid:16)\u2212yi \u02dcf (x; \u02dc\u03b8)\n(cid:17)\n\n(cid:96)\n\n+\n\n\u03bba2\n2\n\n,\n\n(4)\n\nwhere the scalar \u03bb is a positive real number, i.e., \u03bb > 0. Different from the empirical loss function\nLn, the loss \u02dcLn has an additional regularizer on the parameter a, since we aim to eliminate the impact\nof the exponential neuron on the output of the network \u02dcf at every local minimum of \u02dcLn. As we will\nshow later, the exponential neuron is inactive at every local minimum of the empirical loss \u02dcLn. Now\nwe present the following theorem to show that every local minimum of the loss function \u02dcLn is also a\nglobal minimum.\nRemark: Instead of viewing the exponential term in Equation (3) as a neuron, one can also equiva-\nlently think of modifying the loss function to be\n\nn(cid:88)\n\ni=1\n\n\u02dcLn( \u02dc\u03b8) =\n\n(cid:96)(cid:0)\u2212yi(f (xi; \u03b8) + a exp(w(cid:62)\n\nxi + b))(cid:1) +\n\n\u03bba2\n2\n\n.\n\nThen, one can interpret Equation (3) and (4) as maintaining the original neural architecture and\nslightly modifying the loss function.\nTheorem 1 Suppose that Assumption 1 and 2 hold. Then both of the following statements are true:\n\n(i) The empirical loss function \u02dcLn( \u02dc\u03b8) has at least one local minimum.\n(ii) Assume that \u02dc\u03b8\u2217 = (\u03b8\u2217, a\u2217, w\u2217, b\u2217) is a local minimum of the empirical loss function \u02dcLn( \u02dc\u03b8),\nthen \u02dc\u03b8\u2217 is a global minimum of \u02dcLn( \u02dc\u03b8). Furthermore, \u03b8\u2217 achieves the minimum loss value\nand the minimum misclassi\ufb01cation rate on the dataset D, i.e., \u03b8\u2217 \u2208 arg min\u03b8 Ln(\u03b8) and\n\u03b8\u2217 \u2208 arg min\u03b8 Rn(\u03b8; f ).\n\nRemarks: (i) Theorem 1 shows that every local minimum \u02dc\u03b8\u2217 of the empirical loss \u02dcLn is also a\nglobal minimum and shows that \u03b8\u2217 achieves the minimum training error and the minimum loss value\non the original loss function Ln at the same time. (ii) Since we do not require the explicit form\nof the neural architecture f, Theorem 1 applies to the neural architectures widely used in practice\nsuch as convolutional neural network (Krizhevsky et al., 2012), deep residual networks (He et al.,\n2016), etc. This further indicates that the result holds for any real neural activation functions such as\nrecti\ufb01ed linear unit (ReLU), leaky recti\ufb01ed linear unit (Leaky ReLU), etc. (iii) As we will show in\nthe following corollary, at every local minimum \u02dc\u03b8\u2217, the exponential neuron is inactive. Therefore, at\nevery local minimum \u02dc\u03b8\u2217 = (\u03b8\u2217, a\u2217, w\u2217, b\u2217), the neural network \u02dcf with an augmented exponential\nneuron is equivalent to the original neural network f.\nCorollary 1 Under the conditions of Theorem 1, if \u02dc\u03b8\u2217 = (\u03b8\u2217, a\u2217, w\u2217, b\u2217) is a local minimum of the\nempirical loss function \u02dcLn( \u02dc\u03b8), then two neural networks f (\u00b7; \u03b8\u2217) and \u02dcf (\u00b7; \u02dc\u03b8\u2217) are equivalent, i.e.,\nf (x; \u03b8\u2217) = \u02dcf (x; \u02dc\u03b8\u2217), \u2200x \u2208 Rd.\nCorollary 1 shows that at every local minimum, the exponential neuron does not contribute to\nthe output of the neural network \u02dcf. However, this does not imply that the exponential neuron is\nunnecessary, since several previous results (Safran & Shamir, 2018; Liang et al., 2018) have already\nshown that the loss surface of pure ReLU neural networks are guaranteed to have bad local minima.\nFurthermore, to prove the main result under any dataset, the regularizer is also necessary, since Liang\net al. (2018) has already shown that even with an augmented exponential neuron, the empirical loss\nwithout the regularizer still have bad local minima under some datasets.\n\n4 Extensions\n4.1 Eliminating the Skip Connection\nAs noted in the previous section, the exponential term in Equation (3) can be viewed as a skip\nconnection or a modi\ufb01cation to the loss function. Our analysis also works under other architectures\nas well. When the exponential term is viewed as a skip connection, the network architecture is as\nshown in Fig. 1(a). This architecture is different from the canonical feedforward neural architectures\n\n4\n\n\fas there is a direct path from the input layer to the output layer. In this subsection, we will show\nthat the main result still holds if the model \u02dcf is de\ufb01ned as a feedforward neural network shown in\nFig. 1(b), where each layer of the network f is augmented by an additional exponential neuron. This\nis a standard fully connected neural network except for one special neuron at each layer.\nNotations. Given a fully-connected feedforward neu-\nral network f (\u00b7; \u03b8) de\ufb01ned by Equation (1), we de-\n\ufb01ne a new fully connected feedforward neural net-\nwork \u02dcf by adding an additional exponential neu-\nron to each layer of the network f. We use the\nvector \u02dc\u03b8 = (\u03b8, \u03b8exp) to denote the parameteriza-\ntion of the network \u02dcf, where \u03b8exp denotes the vec-\ntor consisting of all augmented weights and biases.\nLet \u02dcWl \u2208 R(Ml\u22121+1)\u00d7(Ml+1) and \u02dcbl \u2208 RMl+1\ndenote the weight matrix and the bias vector in\nthe l-th layer of the network \u02dcf, respectively. Let\n\u02dcWL+1 \u2208 R(ML+1) and \u02dcbL+1 \u2208 R denote the weight\nvector and the bias scalar in the output layer of the\nnetwork \u02dcf, respectively. Without the loss of general-\nity, we assume that the (Ml + 1)-th neuron in the l-th\nlayer is the augmented exponential neuron. Thus, the\noutput of the network \u02dcf is expressed by\n\nFigure 1: (a) The neural architecture considered\nin Theorem 1. (b) The neural architecture con-\nsidered in Theorem 2. The blue and red circles\ndenote the neurons \u03c3 in the original network and\nthe augmented exponential neurons, respectively.\n\n\u02dcf (x; \u03b8) = \u02dcW (cid:62)\n\n\u02dcWL \u02dc\u03c3L\n\n1 x + \u02dcb1\n\nL+1 \u02dc\u03c3L+1\n\n(5)\nwhere \u02dc\u03c3l : RMl\u22121+1 \u2192 RMl+1 is a vector-valued activation function with the \ufb01rst Ml components\nbeing the activation functions \u03c3 in the network f and with the last component being the exponential\nfunction, i.e., \u02dc\u03c3l(z) = (\u03c3(z), ..., \u03c3(z), exp(z)). Furthermore, we use the \u02dcwl to denote the vector in\nthe (Ml\u22121 + 1)-th row of the matrix \u02dcWl. In other words, the components of the vector \u02dcwl are the\nweights on the edges connecting the exponential neuron in the (l \u2212 1)-th layer and the neurons in the\nl-th layer. For this feedforward network, we de\ufb01ne an empirical loss function as\n\n+ \u02dcbL+1,\n\n+ \u02dcbL\u22121\n\n+ \u02dcbL\n\n(cid:16)\n\n(cid:16)\n\n... \u02dc\u03c31\n\n\u02dcW (cid:62)\n\n(cid:17)\n\n(cid:17)\n\n(cid:17)\n\n(cid:16)\n\nn(cid:88)\n\nL+1(cid:88)\n\n\u03bb\n2\n\n(cid:96)(\u2212yi \u02dcf (xi; \u02dc\u03b8)) +\n\n(cid:107) \u02dcwl(cid:107)2L\n\n2L\n\nl=2\n\ni=1\n\n\u02dcLn( \u02dc\u03b8) =\n\n(6)\nwhere (cid:107)a(cid:107)p denotes the p-norm of a vector a and \u03bb is a positive real number, i.e., \u03bb > 0. Similar to\nthe empirical loss discussed in the previous section, we add a regularizer to eliminate the impacts\nof all exponential neurons on the output of the network. Similarly, we can prove that at every local\nminimum of \u02dcLn, all exponential neurons are inactive. Now we present the following theorem to show\nthat if the set of parameters \u02dc\u03b8\u2217 = (\u03b8\u2217, \u03b8\u2217\nexp) is a local minimum of the empirical loss function \u02dcLn( \u02dc\u03b8),\nthen \u02dc\u03b8\u2217 is a global minimum and \u03b8\u2217 is a global minimum of both minimization problems min\u03b8 Ln(\u03b8)\nand min\u03b8 Rn(\u03b8; f ). This means that the neural network f (\u00b7; \u03b8\u2217) simultaneously achieves the globally\nminimal loss value and misclassi\ufb01cation rate on the dataset D.\nTheorem 2 Suppose that Assumption 1 and 2 hold. Suppose that the activation function \u03c3 is\ndifferentiable. Assume that \u02dc\u03b8\u2217 = (\u03b8\u2217, \u03b8\u2217\nexp) is a local minimum of the empirical loss function\n\u02dcLn( \u02dc\u03b8), then \u02dc\u03b8\u2217 is a global minimum of \u02dcLn( \u02dc\u03b8). Furthermore, \u03b8\u2217 achieves the minimum loss value\nand the minimum misclassi\ufb01cation rate on the dataset D, i.e., \u03b8\u2217 \u2208 arg min\u03b8 Ln(\u03b8) and \u03b8\u2217 \u2208\narg min\u03b8 Rn(\u03b8; f ).\nRemarks: (i) This theorem is not a direct corollary of the result in the previous section, but the\nproof ideas are similar. (ii) Due to the assumption on the differentiability of the activation function \u03c3,\nTheorem 2 does not apply to the neural networks consisting of non-smooth neurons such as ReLUs,\nLeaky ReLUs, etc. (iii) Similar to Corollary 1, we will present the following corollary to show that at\nevery local minimum \u02dc\u03b8\u2217 = (\u03b8\u2217, \u03b8\u2217\nexp), the neural network \u02dcf with augmented exponential neurons is\nequivalent to the original neural network f.\nCorollary 2 Under the conditions in Theorem 2, if \u02dc\u03b8\u2217 = (\u03b8\u2217, \u03b8\u2217\nexp) is a local minimum of the\nempirical loss function \u02dcLn( \u02dc\u03b8), then two neural networks f (\u00b7; \u03b8\u2217) and \u02dcf (\u00b7; \u02dc\u03b8\u2217) are equivalent, i.e.,\nf (x; \u03b8\u2217) = \u02dcf (x; \u02dc\u03b8\u2217),\u2200x \u2208 Rd.\n\n5\n\n......\u02dcf(x;\u02dc\u2713)\u02dcf(x;\u02dc\u2713)xx......\u02dcf(x;\u02dc\u2713)\u02dcf(x;\u02dc\u2713)xx....................................(a)(a) (b)(b)\fCorollary 2 further shows that even if we add an exponential neuron to each layer of the original\nnetwork f, at every local minimum of the empirical loss, all exponential neurons are inactive.\n\n4.2 Neurons\nIn this subsection, we will show that even if the exponential neuron is replaced by a monomial neuron,\nthe main result still holds under additional assumptions. Similar to the case where exponential\nneurons are used, given a neural network f (x; \u03b8), we de\ufb01ne a new neural network \u02dcf by adding the\noutput of a monomial neuron of degree p to the output of the original model f, i.e.,\n\n\u02dcf (x; \u02dc\u03b8) = f (x; \u03b8) + a(cid:0)w(cid:62)\n\nx + b(cid:1)p\n\n.\n\n(7)\n\nIn addition, the empirical loss function \u02dcLn is exactly the same as the loss function de\ufb01ned by\nEquation (4). Next, we will present the following theorem to show that if all samples in the dataset D\ncan be correctly classi\ufb01ed by a polynomial of degree t and the degree of the augmented monomial is\nnot smaller than t (i.e., p \u2265 t), then every local minimum of the empirical loss function \u02dcLn( \u02dc\u03b8) is also\na global minimum. We note that the degree of a monomial is the sum of powers of all variables in\nthis monomial and the degree of a polynomial is the maximum degree of its monomial.\nProposition 1 Suppose that Assumptions 1 and 2 hold. Assume that all samples in the dataset D\ncan be correctly classi\ufb01ed by a polynomial of degree t and p \u2265 t. Assume that \u02dc\u03b8\u2217 = (\u03b8\u2217, a\u2217, w\u2217, b\u2217)\nis a local minimum of the empirical loss function \u02dcLn( \u02dc\u03b8), then \u02dc\u03b8\u2217 is a global minimum of \u02dcLn( \u02dc\u03b8).\nFurthermore, \u03b8\u2217 is a global minimizer of both problems min\u03b8 Ln(\u03b8) and min Rn(\u03b8; f ).\nRemarks: (i) We note that, similar to Theorem 1, Proposition 1 applies to all neural architectures\nand all neural activation functions de\ufb01ned on R, as we do not require the explicit form of the neural\nnetwork f. (ii) It follows from the Lagrangian interpolating polynomial and Assumption 2 that for a\ndataset consisted of n different samples, there always exists a polynomial P of degree smaller n such\nthat the polynomial P can correctly classify all points in the dataset. This indicates that Proposition 1\nalways holds if p \u2265 n. (iii) Similar to Corollary 1 and 2, we can show that at every local minimum\n\u02dc\u03b8\u2217 = (\u03b8\u2217, a\u2217, w\u2217, b\u2217), the neural network \u02dcf with an augmented monomial neuron is equivalent to\nthe original neural network f.\n\n4.3 Allowing Random Labels\nIn previous subsections, we assume the realizability of the dataset by the neural network which\nimplies that the label of a given feature vector is unique. It does not cover the case where the\ndataset contains two samples with the same feature vector but with different labels (for example, the\nsame image can be labeled differently by two different people). Clearly, in this case, no model can\ncorrectly classify all samples in this dataset. Another simple example of this case is the mixture of\ntwo Gaussians where the data samples are drawn from each of the two Gaussian distributions with\ncertain probability.\nIn this subsection, we will show that under this broader setting that one feature vector may correspond\nto two different labels, with a slightly stronger assumption on the convexity of the loss (cid:96), the same\nresult still holds. The formal statement is present by the following proposition.\nProposition 2 Suppose that Assumption 1 holds and the loss function (cid:96) is convex. Assume that \u02dc\u03b8\u2217 =\n(\u03b8\u2217, a\u2217, w\u2217, b\u2217) is a local minimum of the empirical loss function \u02dcLn( \u02dc\u03b8), then \u02dc\u03b8\u2217 is a global minimum\nof \u02dcLn( \u02dc\u03b8). Furthermore, \u03b8\u2217 achieves the minimum loss value and the minimum misclassi\ufb01cation rate\non the dataset D, i.e., \u03b8\u2217 \u2208 arg min\u03b8 Ln(\u03b8) and \u03b8\u2217 \u2208 arg min\u03b8 Rn(\u03b8; f ).\nRemark: The differences of Proposition 2 and Theorem 1 can be understood in the following ways.\nFirst, as stated previously, Proposition 2 allows a feature vector to have two different labels, but\nTheorem 1 does not. Second, the minimum misclassi\ufb01cation rate under the conditions in Theorem 1\nmust be zero, while in Proposition 2, the minimum misclassi\ufb01cation rate can be nonzero.\n\n4.4 High-order Stationary Points\nIn this subsection, we characterize the high-order stationary points of the empirical loss \u02dcLn shown in\nSection 3.2. We \ufb01rst introduce the de\ufb01nition of the high-order stationary point and next show that\nevery stationary point of the loss \u02dcLn with a suf\ufb01ciently high order is also a global minimum.\n\n6\n\n\f2\n\n.\n\nDe\ufb01nition 1 (k-th order stationary point) A critical point \u03b80 of a function L(\u03b8) is a k-th order\nstationary point, if there exists positive constant C, \u03b5 > 0 such that for every \u03b8 with (cid:107)\u03b8 \u2212 \u03b80(cid:107)2 \u2264 \u03b5,\nL(\u03b8) \u2265 L(\u03b80) \u2212 C(cid:107)\u03b8 \u2212 \u03b80(cid:107)k+1\nNext, we will show that if a polynomial of degree p can correctly classify all points in the dataset,\nthen every stationary point of the order at least 2p is a global minimum and the set of parameters\ncorresponding to this stationary point achieves the minimum training error.\nProposition 3 Suppose that Assumptions 1 and 2 hold. Assume that all samples in the dataset can be\ncorrectly classi\ufb01ed by a polynomial of degree p. Assume that \u02dc\u03b8\u2217 = (\u03b8\u2217, a\u2217, w\u2217, b\u2217) is a k-th order\nstationary point of the empirical loss function \u02dcLn( \u02dc\u03b8) and k \u2265 2p, then \u02dc\u03b8\u2217 is a global minimum of\n\u02dcLn( \u02dc\u03b8). Furthermore, the neural network f (\u00b7; \u03b8\u2217) achieves the minimum misclassi\ufb01cation rate on the\ndataset D, i.e., \u03b8\u2217 \u2208 arg min\u03b8 Rn(\u03b8; f ).\nOne implication of Proposition 3 is that if a dataset is linearly separable, then every second order\nstationary point of the empirical loss function is a global minimum and, at this stationary point, the\nneural network achieves zero training error. When the dataset is not linearly separable, our result\nonly covers fourth or higher order stationary point of the empirical loss.\n\n5 Proof Idea\nIn this section, we provide overviews of the proof of Theorem 1.\n\nImportant Lemmas\n\n5.1\nIn this subsection, we present two important lemmas where the proof of Theorem 1 is based.\n\nLemma 1 Under Assumption 1 and \u03bb > 0, if \u02dc\u03b8\u2217 = (\u03b8\u2217, a\u2217, w\u2217, b\u2217) is a local minimum of \u02dcLn, then\n(i) a\u2217 = 0, (ii) for any integer p \u2265 0, the following equation holds for all unit vector u : (cid:107)u(cid:107)2 = 1,\n\nn(cid:88)\n\ni=1\n\n(cid:48)\n(cid:96)\n\n(\u2212yif (xi; \u03b8\u2217\n\n)) yiew\u2217(cid:62)xi+b\u2217\n\n(u(cid:62)\n\nxi)p = 0.\n\n(8)\n\ni=1, if(cid:80)n\n\ni=1 cix\n\nLemma 2 For any integer k \u2265 0 and any sequence {ci}n\n\nunit vector u : (cid:107)u(cid:107)2 = 1, then the k-th order tensor Tk =(cid:80)n\n\ni=1 ci(u(cid:62)xi)k = 0 holds for all\n\u2297k\nis a k-th order zero tensor.\ni\n\n5.2 Proof Sketch of Lemma 1\nProof sketch of Lemma 1(i): To prove a\u2217 = 0, we only need to check the \ufb01rst order conditions\nof local minima. By assumption that \u02dc\u03b8\u2217 = (\u03b8\u2217, a\u2217, w\u2217, b\u2217) is a local minimum of \u02dcLn, then the\nderivative of \u02dcLn with respect to a and b at the point \u02dc\u03b8\u2217 are all zeros, i.e.,\n\u2207a \u02dcLn( \u02dc\u03b8)\n\nyi exp(w\u2217(cid:62)\n\n) \u2212 yia\n\u2217\n\n) + \u03bba\n\nxi + b\n\n= 0,\n\n\u2217\n\n\u2217\n\n(cid:12)(cid:12)(cid:12) \u02dc\u03b8= \u02dc\u03b8\u2217 = \u2212 n(cid:88)\n(cid:48)(cid:16)\u2212yif (xi; \u03b8\u2217\n(cid:12)(cid:12)(cid:12) \u02dc\u03b8= \u02dc\u03b8\u2217 = \u2212a\n(cid:48)(cid:16)\u2212yif (xi; \u03b8\u2217\n\u2217 n(cid:88)\n\ni=1\n\n(cid:96)\n\n(cid:96)\n\ni=1\n\new\u2217(cid:62)xi+b\u2217(cid:17)\new\u2217(cid:62)xi+b\u2217(cid:17)\n\n\u2207b \u02dcLn( \u02dc\u03b8)\n\n) \u2212 yia\n\u2217\n\nyi exp(w\u2217(cid:62)\n\n\u2217\nxi + b\n\n) = 0.\n\nFrom the above equations, it is not dif\ufb01cult to see that a\u2217 satis\ufb01es \u03bba\u22172 = 0 or, equivalently, a\u2217 = 0.\nWe note that the main observation we are using here is that the derivative of the exponential neuron is\nitself. Therefore, it is not dif\ufb01cult to see that the same proof holds for all neuron activation function\n\u03c3 satisfying \u03c3(cid:48)(z) = c\u03c3(z),\u2200z \u2208 R for some constant c. In fact, with a small modi\ufb01cation of\nthe proof, we can show that the same proof works for all neuron activation functions satisfying\n\u03c3(z) = (c1z + c0)\u03c3(cid:48)(z),\u2200z \u2208 R for some constants c0 and c1. This further indicates that the same\nproof holds for the monomial neurons and thus the proof of Proposition 1 follows directly from the\nproof of Theorem 1.\nProof sketch of Lemma 1(ii): The main idea of the proof is to use the high order information\nof the local minimum to derive Equation (8). Due to the assumption that \u02dc\u03b8 = (\u03b8\u2217, a\u2217, w\u2217, b\u2217)\nis a local minimum of the empirical loss function \u02dcLn, there exists a bounded local region such\n\n7\n\n\fthat the parameters \u02dc\u03b8\u2217 achieve the minimum loss value in this region, i.e., \u2203\u03b4 \u2208 (0, 1) such that\n\u02dcLn( \u02dc\u03b8\u2217 + \u2206) \u2265 \u02dcLn( \u02dc\u03b8\u2217) for \u2200\u2206 : (cid:107)\u2206(cid:107)2 \u2264 \u03b4.\nNow, we use \u03b4a, \u03b4w to denote the perturbations on the parameters a and w, respectively. Next, we\nconsider the loss value at the point \u02dc\u03b8\u2217 + \u2206 = (\u03b8\u2217, a\u2217 + \u03b4a, w\u2217 + \u03b4w, b\u2217), where we set |\u03b4a| = e\u22121/\u03b5\nand \u03b4w = \u03b5u for an arbitrary unit vector u : (cid:107)u(cid:107)2 = 1. Therefore, as \u03b5 goes to zero, the perturbation\nmagnitude (cid:107)\u2206(cid:107)2 also goes to zero and this indicates that there exists an \u03b50 \u2208 (0, 1) such that\n\u02dcLn( \u02dc\u03b8\u2217 + \u2206) \u2265 \u02dcLn( \u02dc\u03b8\u2217) for \u2200\u03b5 \u2208 [0, \u03b50). By the result a\u2217 = 0, shown in Lemma 1(i), the output of\nthe model \u02dcf under parameters \u02dc\u03b8\u2217 + \u2206 can be expressed by\n) + \u03b4a exp(\u03b4(cid:62)\n\n+ \u2206) = f (x; \u03b8\u2217\n\nwx) exp(w\u2217(cid:62)\n\n\u02dcf (x; \u02dc\u03b8\u2217\n\nx + b\n\n).\n\n\u2217\n\nwx) exp(w\u2217(cid:62)\n\nFor simplicity of notation, let g(x; \u02dc\u03b8\u2217, \u03b4w) = exp(\u03b4(cid:62)\nx + b\u2217). From the second order\nTaylor expansion with Lagrangian remainder and the assumption that (cid:96) is twice differentiable, it\nfollows that there exists a constant C( \u02dc\u03b8\u2217,D) depending only on the local minimizer \u02dc\u03b8 and the dataset\nD such that the following inequality holds for every sample in the dataset and every \u03b5 \u2208 [0, \u03b50),\n(cid:96)(\u2212yi \u02dcf (xi; \u02dc\u03b8\u2217\n,D)\u03b42\na.\nSumming the above inequality over all samples in the dataset and recalling that \u02dcLn( \u02dc\u03b8\u2217+\u2206) \u2265 \u02dcLn( \u02dc\u03b8\u2217)\nholds for all \u03b5 \u2208 [0, \u03b50), we obtain\n\u2212sgn(\u03b4a)\n\n+ \u2206)) \u2264 (cid:96)(\u2212yif (xi; \u03b8\u2217\n\n))(\u2212yi)\u03b4ag(xi; \u02dc\u03b8\u2217\n\n,D)+\u03bb/2] exp(\u22121/\u03b5) \u2265 0.\n\nxi) exp(w\u2217(cid:62)\n\n, \u03b4w) + C( \u02dc\u03b8\u2217\n\n(\u2212yif (xi; \u03b8\u2217\n\n(\u2212yif (xi; \u03b8\u2217\n\n))yi exp(\u03b5u(cid:62)\n\n)+[nC( \u02dc\u03b8\u2217\n\nn(cid:88)\n\n(cid:48)\n)) + (cid:96)\n\nxi+b\n\n(cid:48)\n(cid:96)\n\n\u2217\n\ni=1\n\nFinally, we complete the proof by induction. Speci\ufb01cally, for the base hypothesis where p = 0, we\ncan take the limit on the both sides of the above inequality as \u03b5 \u2192 0, using the property that \u03b4a can\nbe either positive or negative and thus establish the base case where p = 0. For the higher order case,\nwe can \ufb01rst assume that Equation (8) holds for p = 0, ..., k and then subtract these equations from\nthe above inequality. After taking the limit on the both sides of the inequality as \u03b5 \u2192 0, we can prove\nthat Equation (8) holds for p = k + 1. Therefore, by induction, we can prove that Equation (8) holds\nfor any non-negative integer p.\n\n5.3 Proof Sketch of Lemma 2\nThe proof of Lemma 2 follows directly from the results in reference (Zhang et al., 2012). It is easy to\ncheck that, for every sequence {ci}n\ni=1 and every non-negative integer k \u2265 0, the k-th order tensor\nis a symmetric tensor. From Theorem 1 in (Zhang et al., 2012), it directly follows\n\ni=1 cix\n\nTk =(cid:80)n\nFurthermore, by assumption that Tk(u, ..., u) =(cid:80)n\n\nu1,...,uk:(cid:107)u1(cid:107)2=...=(cid:107)uk(cid:107)2=1\n\n\u2297k\ni\n\nmax\n\nthat\n\n|Tk(u1, ..., uk)| = max\nu:(cid:107)u(cid:107)2=1\n\n|Tk(u, ..., u)|.\n\ni=1 ci(u(cid:62)xi)k = 0 holds for all (cid:107)u(cid:107)2 = 1, then\n|Tk(u1, ..., uk)| = 0,\n\nu1,...,uk:(cid:107)u1(cid:107)2=...=(cid:107)uk(cid:107)2=1\n\nmax\n\nd , where 0d is the zero vector in the d-dimensional space.\n\nand this is equivalent to Tk = 0\u2297k\n5.4 Proof Sketch of Theorem 1\nFor every dataset D satisfying Assumption 2, by the Lagrangian interpolating polynomial, there\nj cj\u03c0j(x) de\ufb01ned on Rd such that it can correctly classify all\nsamples in the dataset with margin at least one, i.e., yiP (xi) \u2265 1,\u2200i \u2208 [n], where \u03c0j denotes the j-th\nmonomial in the polynomial P (x). Therefore, from Lemma 1 and 2, it follows that\n\nalways exists a polynomial P (x) =(cid:80)\nn(cid:88)\n\n(cid:48)\n(cid:96)\n\n(\u2212yif (xi; \u03b8\u2217\n\n))yiew\u2217(cid:62)xi+b\u2217\n\n\u03c0j(xi) = 0.\n\n(\u2212yif (xi; \u03b8\u2217\n\n))ew\u2217(cid:62)xi+b\u2217\n\n(cid:88)\n\nn(cid:88)\n\nyiP (xi) =\n\n(cid:48)\n(cid:96)\n\ncj\n\ni=1\n\nj\n\ni=1\n\nSince yiP (xi) \u2265 1 and ew\u2217(cid:62)xi+b\u2217\n> 0 hold for \u2200i \u2208 [n] and the loss function (cid:96) is a non-decreasing\nfunction, i.e., (cid:96)(cid:48)(z) \u2265 0,\u2200z \u2208 R, then (cid:96)(cid:48)(\u2212yif (xi; \u03b8\u2217)) = 0 holds for all i \u2208 [n]. In addition, from\nthe assumption that every critical point of the loss function (cid:96) is a global minimum, it follows that\nzi = \u2212yif (xi; \u03b8\u2217) achieves the global minimum of the loss function (cid:96) and this further indicates that\n\n8\n\n\f\u03b8\u2217 is a global minimum of the empirical loss Ln(\u03b8). Furthermore, since at every local minimum,\nthe exponential neuron is inactive, a\u2217 = 0, then the set of parameters \u02dc\u03b8\u2217 is a global minimum\nof the loss function \u02dcLn( \u02dc\u03b8). Finally, since every critical point of the loss function (cid:96)(z) satis\ufb01es\nz < 0, then for every sample, (cid:96)(cid:48)(\u2212yif (xi; \u03b8\u2217)) = 0 indicates that yif (xi; \u03b8\u2217) > 0, or, equivalently,\nyi = sgn(f (xi; \u03b8\u2217)). Therefore, the set of parameters \u03b8\u2217 also minimizes the training error. In\nsummary, the set of parameters \u02dc\u03b8\u2217 = (\u03b8\u2217, a\u2217, w\u2217, b\u2217) minimizes the loss function \u02dcLn( \u02dc\u03b8) and the set\nof parameters \u03b8\u2217 simultaneously minimizes the empirical loss function Ln(\u03b8) and the training error\nRn(\u03b8; f ).\n\n6 Conclusions and Discussions\nOne of the dif\ufb01culties in analyzing neural networks is the non-convexity of the loss functions which\nallows the existence of many spurious minima with large loss values. In this paper, we prove that for\nany neural network, by adding a special neuron and an associated regularizer, the new loss function\nhas no spurious local minimum. In addition, we prove that, at every local minimum of this new loss\nfunction, the exponential neuron is inactive and this means that the augmented neuron and regularizer\nimprove the landscape of the loss surface without affecting the representing power of the original\nneural network. We also extend the main result in a few ways. First, while adding a special neuron\nmakes the network different from a classical neural network architecture, the same result also holds\nfor a standard fully connected network with one special neuron added to each layer. Second, the same\nresult holds if we change the exponential neuron to a polynomial neuron with a degree dependent on\nthe data. Third, the same result holds even if one feature vector corresponds to both labels.\nThis paper is an effort in designing neural networks that are \u201cgood\u201d. Here \u201cgood\u201d can mean various\nthings such as nice landscape, stronger representation power or better generalization, and in this\npaper we focus on the landscape \u2013in particular, the very speci\ufb01c property \u201cevery local minimum is\na global minimum\u201d. While our results enhance the understanding of the landscape, the practical\nimplications are not straightforward to see since we did not consider other aspects such as algorithms\nand generalization. It is an interesting direction to improve the landscape results by considering other\naspects, such as studying when a speci\ufb01c algorithm will converge to local minima and thus global\nminima.\n\n7 Acknowledgment\nResearch is supported by the following grants: USDA/NSF CPS grant AG 2018-67007-2837, NSF\nNeTS 1718203, NSF CPS ECCS 1739189, DTRA Grant DTRA grant HDTRA1-15-1-0003, NSF\nCCF 1755847 and a start-up grant from Dept. of ISE, University of Illinois Urbana-Champaign.\n\nReferences\nAndoni, A., Panigrahy, R., Valiant, G., and Zhang, L. Learning polynomials with neural networks. In\n\nICML, 2014.\n\nBaldi, P. and Hornik, K. Neural networks and principal component analysis: Learning from examples\n\nwithout local minima. Neural networks, 2(1):53\u201358, 1989.\n\nBrutzkus, A. and Globerson, A. Globally optimal gradient descent for a convnet with gaussian inputs.\n\narXiv preprint arXiv:1702.07966, 2017.\n\nChoromanska, A., Henaff, M., Mathieu, M., Arous, G., and LeCun, Y. The loss surfaces of multilayer\n\nnetworks. In AISTATS, 2015.\n\nCortes, C. and Vapnik, V. Support-vector networks. Machine learning, 1995.\n\nDu, S. S and Lee, J. D. On the power of over-parametrization in neural networks with quadratic\n\nactivation. arXiv preprint arXiv:1803.01206, 2018.\n\nDu, S. S., Lee, J. D., and Tian, Y. When is a convolutional \ufb01lter easy to learn? arXiv preprint\n\narXiv:1709.06129, 2017.\n\nFreeman, C D. and Bruna, J. Topology and geometry of half-recti\ufb01ed network optimization. arXiv\n\npreprint arXiv:1611.01540, 2016.\n\n9\n\n\fGautier, A., Nguyen, Q. N., and Hein, M. Globally optimal training of generalized polynomial neural\n\nnetworks with nonlinear spectral methods. In NIPS, pp. 1687\u20131695, 2016.\n\nGe, R., Lee, J. D, and Ma, T. Learning one-hidden-layer neural networks with landscape design.\n\nICLR, 2018.\n\nGoel, S. and Klivans, A. Learning depth-three neural networks in polynomial time. arXiv preprint\n\narXiv:1709.06010, 2017.\n\nGoodfellow, I. J, Warde-Farley, D., Mirza, M., Courville, A., and Bengio, Y. Maxout networks. arXiv\n\npreprint arXiv:1302.4389, 2013.\n\nHaeffele, B., Young, E., and Vidal, R. Structured low-rank matrix factorization: Optimality, algorithm,\n\nand applications to image processing. In ICML, 2014.\n\nHaeffele, B. D and Vidal, R. Global optimality in tensor factorization, deep learning, and beyond.\n\narXiv preprint arXiv:1506.07540, 2015.\n\nHardt, M. and Ma, T. Identity matters in deep learning. ICLR, 2017.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In CVPR, pp.\n\n770\u2013778, 2016.\n\nHuang, G. and Liu, Z. Densely connected convolutional networks. In CVPR, 2017.\n\nJanzamin, M., Sedghi, H., and Anandkumar, A. Beating the perils of non-convexity: Guaranteed\n\ntraining of neural networks using tensor methods. arXiv preprint arXiv:1506.08473, 2015.\n\nKawaguchi, K. Deep learning without poor local minima. In NIPS, pp. 586\u2013594, 2016.\n\nKrizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, 2012.\n\nLeCun, Y., Bengio, Y., and Hinton, G. E. Deep learning. Nature, 521(7553):436, 2015.\n\nLi, Y. and Yuan, Y. Convergence analysis of two-layer neural networks with relu activation. In NIPS,\n\npp. 597\u2013607, 2017.\n\nLiang, S., Sun, R., Li, Y., and Srikant, R. Understanding the loss surface of neural networks for\n\nbinary classi\ufb01cation. 2018.\n\nLivni, R., Shalev-Shwartz, S., and Shamir, O. On the computational ef\ufb01ciency of training neural\n\nnetworks. In NIPS, 2014.\n\nMei, Song, Montanari, Andrea, and Nguyen, Phan-Minh. A mean \ufb01eld view of the landscape of\n\ntwo-layers neural networks. arXiv preprint arXiv:1804.06561, 2018.\n\nNguyen, Q. and Hein, M. The loss surface and expressivity of deep convolutional neural networks.\n\narXiv preprint arXiv:1710.10928, 2017a.\n\nNguyen, Q. and Hein, M. The loss surface and expressivity of deep convolutional neural networks.\n\narXiv preprint arXiv:1710.10928, 2017b.\n\nSafran, Itay and Shamir, Ohad. Spurious local minima are common in two-layer relu neural networks.\n\nICML, 2018.\n\nSedghi, H. and Anandkumar, A. Provable methods for training neural networks with sparse connec-\n\ntivity. arXiv preprint arXiv:1412.2693, 2014.\n\nShamir, O. Are resnets provably better than linear predictors? arXiv preprint arXiv:1804.06739,\n\n2018.\n\nSoltanolkotabi, M. Learning relus via gradient descent. In NIPS, pp. 2004\u20132014, 2017.\n\nSoudry, D. and Carmon, Y. No bad local minima: Data independent training error guarantees for\n\nmultilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\n10\n\n\fSoudry, D. and Hoffer, E. Exponentially vanishing sub-optimal local minima in multilayer neural\n\nnetworks. arXiv preprint arXiv:1702.05777, 2017.\n\nWan, L., Zeiler, M., Zhang, S., Le Cun, Y., and Fergus, R. Regularization of neural networks using\n\ndropconnect. In ICML, pp. 1058\u20131066, 2013.\n\nYun, C., Sra, S., and Jadbabaie, A. Global optimality conditions for deep neural networks. arXiv\n\npreprint arXiv:1707.02444, 2017.\n\nZhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires\n\nrethinking generalization. ICLR, 2016.\n\nZhang, X., Ling, C., and Qi, L. The best rank-1 approximation of a symmetric tensor and related\n\nspherical optimization problems. SIAM Journal on Matrix Analysis and Applications, 2012.\n\nZhong, K., Song, Z., Jain, P., Bartlett, P. L, and Dhillon, I. S. Recovery guarantees for one-hidden-\n\nlayer neural networks. arXiv preprint arXiv:1706.03175, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2125, "authors": [{"given_name": "SHIYU", "family_name": "LIANG", "institution": "UIUC"}, {"given_name": "Ruoyu", "family_name": "Sun", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Jason", "family_name": "Lee", "institution": "University of Southern California"}, {"given_name": "R.", "family_name": "Srikant", "institution": "University of Illinois at Urbana-Champaign"}]}