{"title": "A theory on the absence of spurious solutions for nonconvex and nonsmooth optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 2441, "page_last": 2449, "abstract": "We study the set of continuous functions that admit no spurious local optima (i.e. local minima that are not global minima) which we term global functions. They satisfy various powerful properties for analyzing nonconvex and nonsmooth optimization problems. For instance, they satisfy a theorem akin to the fundamental uniform limit theorem in the analysis regarding continuous functions. Global functions are also endowed with useful properties regarding the composition of functions and change of variables. Using these new results, we show that a class of non-differentiable nonconvex optimization problems arising in tensor decomposition applications are global functions. This is the first result concerning nonconvex methods for nonsmooth objective functions. Our result provides a theoretical guarantee for the widely-used $\\ell_1$ norm to avoid outliers in nonconvex optimization.", "full_text": "A theory on the absence of spurious solutions for\n\nnonconvex and nonsmooth optimization\n\nC. Josz\n\nEECS, UC Berkeley\n\ncedric.josz@gmail.com\n\nY. Ouyang\n\nIEOR, UC Berkeley\n\nouyangyii@gmail.com\n\nR. Y. Zhang\n\nIEOR, UC Berkeley\nryz@berkeley.edu\n\nJ. Lavaei\n\nIEOR, UC Berkeley\n\nlavaei@berkeley.edu\n\nS. Sojoudi\n\nEECS, UC Berkeley\n\nsojoudi@berkeley.edu\n\nAbstract\n\nWe study the set of continuous functions that admit no spurious local optima\n(i.e. local minima that are not global minima) which we term global functions.\nThey satisfy various powerful properties for analyzing nonconvex and nonsmooth\noptimization problems. For instance, they satisfy a theorem akin to the fundamental\nuniform limit theorem in the analysis regarding continuous functions. Global\nfunctions are also endowed with useful properties regarding the composition of\nfunctions and change of variables. Using these new results, we show that a class of\nnonconvex and nonsmooth optimization problems arising in tensor decomposition\napplications are global functions. This is the \ufb01rst result concerning nonconvex\nmethods for nonsmooth objective functions. Our result provides a theoretical\nguarantee for the widely-used (cid:96)1 norm to avoid outliers in nonconvex optimization.\n\n1\n\nIntroduction\n\nA recent branch of research in optimization and machine learning consists in proving that simple\nand practical algorithms can solve nonconvex optimization problems. Applications include, but are\nnot limited to, neural networks [40, 44], dictionary learning [1, 2], deep learning [39, 50], mixed\nlinear regression [49, 43], and phase retrieval [46, 21]. In this paper, we focus our attention on\nmatrix completion/sensing [30, 24, 38] and tensor recovery/decomposition [5, 4, 31, 35]. Matrix\ncompletion/sensing aims to recover an unknown positive semide\ufb01nite matrix M of known size n\nand rank r from a \ufb01nite number of linear measurements modeled by the expression (cid:104)Ai, M(cid:105) :=\ntrace(AiM ), i = 1, . . . , m, where the symmetric matrices A1, . . . , Am of size n are known. It is\nassumed that the measurements contain noise which can modeled as bi := (cid:104)Ai, M(cid:105) + \u0001i where \u0001i is\na realization of a random variable. When the noise is Gaussian, the maximum likelihood estimate of\nM can be recast as the nonconvex optimization problem\n\n((cid:104)Ai, M(cid:105) \u2212 bi)2\n\ni=1\n\ninf\nM(cid:60)0\n\n(1)\nwhere M (cid:60) 0 stands for positive semide\ufb01nite. One can remove the rank constraint and obtain a\nconvex relaxation. It can then be solved via semide\ufb01nite programming after the reformulation of the\nobjective function in a linear way. However, the computational complexity of the resulting problem\nis high, which makes it impractical for large-scale problems. A popular alternative is due to Burer\nand Monteiro [18, 12]:\n\nrank(M ) = r\n\nsubject to\n\nm(cid:88)\n\n(cid:0)(cid:104)Ai, XX T(cid:105) \u2212 bi\n\n(cid:1)2\n\nm(cid:88)\n\ni=1\n\ninf\n\nX\u2208Rn\u00d7r\n\n(2)\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fThis nonlinear Least-Squares (LS) problem can be solved ef\ufb01ciently and on a large-scale with the\nGauss-Newton method for instance. It has received a lot of attention recently due to the discovery\nin [30, 10] stating that the problem admits no spurious local minima (i.e. local minima that are\nnot global minima) under certain conditions. These require adding a regularizer and satisfying the\nrestricted isometry property (RIP) [20]. We raise the question of whether this also holds in the case of\nLaplacian noise, which is a better model to account for outliers in the data. The maximum likelihood\nestimate of M can be converted to the Least-Absolute Value (LAV) optimization problem\n\n(cid:12)(cid:12)(cid:104)Ai, XX T(cid:105) \u2212 bi\n\n(cid:12)(cid:12) .\n\nm(cid:88)\n\ni=1\n\ninf\n\nX\u2208Rn\u00d7r\n\n(3)\n\nThe nonlinear problem can be solved ef\ufb01ciently using nonconvex methods (for some recent work,\nsee [36]). For example, one may adopt the famous reformulation technique for converting (cid:96)1 norms\nto linear functions subject to linear inequalities to cast the above problem as a smooth nonconvex\nquadratically-constrained quadratic program [13]. However, the analysis of this result has not been\naddressed in the literature - all ensuing papers (e.g. [29, 52, 8]) on matrix completion since the\naforementioned discovery deal with smooth objective functions.\nConsider y \u2208 Rn and assume r = 1. On the one hand, in the fully observable case1 with M = yyT ,\nthe above nonconvex LS problem (2) consists in solving\n\n(xixj \u2212 yiyj \u2212 \u0001i,j)2\n\n(4)\n\nfor which there are no spurious local minima with high probability when \u0001i,j are i.i.d. Gaussian\nvariables [30]. On the other hand, in the full observable case, the LAV problem (3) aims to solve\n\nn(cid:88)\n\ni,j=1\n\ninf\nx\u2208Rn\n\nn(cid:88)\n\ni,j=1\n\ninf\nx\u2208Rn\n\n|xixj \u2212 yiyj \u2212 \u0001i,j|.\n\n(5)\n\nAlthough the LS problem has nice properties with Gaussian noise, we observe that stochastic gradient\ndescent (SGD) fails to recover the matrix M = yyT in the presence of large but sparse noise. In\ncontrast, SGD can perfectly recover the matrix by solving the LAV problem even when the sparse\nnoise \u0001i,j has a large amplitude. Figures 1a and 1b show our experiments for n = 20 and n = 50 with\nthe number of noisy elements ranging from 0 to n2. See Appendix 5.1 for our experiment settings.\n\n(a) n = 20\n\n(b) n = 50\n\nFigure 1: Experiments with sparse noise\n\nUpon this LAV formulation hinges the potential of nonconvex methods to cope with sparse noise\nand with Laplacian noise. There is no result on the analysis of the local solutions of this nons-\nmooth problem in the literature even for the noiseless case. This could be due to the fact that the\noptimality conditions for the smooth reformulated version of this problem in the form of quadratically-\nconstrained quadratic program are highly nonlinear and lead to an exponential number of scenarios.\n\n1This corresponds to the case where the sensing matrices A1, . . . , An2 have all zeros terms apart from one\n\nelement which is equal to 1.\n\n2\n\n\fAs such, the goal of this paper is to prove the following proposition, which as the reader will see, is a\nsigni\ufb01cant hurdle. It addresses the matrix noiseless case and more generally the case of a tensor of\norder d \u2208 N.\nProposition 1.1. The function f1 : Rn \u2212\u2192 R de\ufb01ned as\n\nn(cid:88)\n\nf1(x) :=\n\n|xi1 . . . xid \u2212 yi1 . . . yid|\n\n(6)\n\nhas no spurious local minima.\n\ni1,...,id=1\n\nA direct consequence of Proposition 1.1 is that one can perform the rank-one tensor decomposition\nby minimizing the function in Proposition 1.1 using a local search algorithm (e.g. [19]). Whenever\nthe algorithm reaches a local minimum, it is a globally optimal solution leading to the desired\ndecomposition. Existing proof techniques, e.g. [29, 30, 24, 38, 5, 4, 31, 35], are not directly useful\nfor the analysis of the nonconvex and nonsmooth optimization problem stated above. In particular,\nresults on the absence of spurious local minima neural networks with a Recti\ufb01ed Linear Unit (ReLU)\nactivation function pertain to smooth objective functions (e.g. [48, 14]). The Clarke derivative [22, 23]\nprovides valuable insight (see Lemma 3.1) but it is not conclusive. In order to pursue the proof\n(see Lemma 3.2), we propose the new notion of global function. Unlike the previous approaches, it\ndoes not require one to exhibit a direction of descent. After some successive transformations, we\nreduce the problem to a linear program. It is then obvious that there are no spurious local minima.\nIncidentally, global functions provide a far simpler and shorter proof to a slightly weaker result, that\nis to say, the absence of spurious strict local minima. It eschews the Clarke derivative all together\nand instead considers a sequence of converging differentiable functions that have no spurious local\nminima (see Proposition 3.1). In fact, this technique also applies if we substitute the (cid:96)1 norm with the\n(cid:96)\u221e norm (see Proposition 3.2).\nThe paper is organized as follows. Global functions are examined in Section 2 and their application\nto tensor decomposition is discussed in Section 3. Section 4 concludes our work. The proofs may be\nfound in the supplementary material (Section 5 of the supplementary material).\n\n(cid:115) n(cid:80)\n\ni=1\n\n2 Notion of global function\n\nGiven an integer n, consider the Euclidian space Rn with norm (cid:107)x(cid:107)2 :=\nS \u2282 Rn. The next two de\ufb01nitions are classical.\nDe\ufb01nition 2.1. We say that x \u2208 S is a global minimum of f : S \u2212\u2192 R if for all y \u2208 S \\ {x}, it\nholds that f (x) (cid:54) f (y).\nDe\ufb01nition 2.2. We say that x \u2208 S is a local minimum (respectively, strict local minimum) of\nf : S \u2212\u2192 R if there exists \u0001 > 0 such that for all y \u2208 S \\ {x} satisfying (cid:107)x \u2212 y(cid:107)2 (cid:54) \u0001, it holds that\nf (x) (cid:54) f (y) (respectively, f (x) < f (y)).\n\ni along with a subset\nx2\n\nWe introduce the notion of global functions below.\nDe\ufb01nition 2.3. We say that f : S \u2212\u2192 R is a global function if it is continuous and its local minima\nare all global minima. De\ufb01ne G(S) as the set of all global functions on S.\nIn the following, we compare global functions with other classes of functions in the literature,\nparticularly those that seek to generalize convex functions.\nWhen the domain S is convex, two important proper subsets of G(S) are the sets of convex and strict\nquasiconvex functions. Convex functions (respectively, strict quasiconvex [27, 26]) are such that\nf (\u03bbx + (1\u2212 \u03bb)y) (cid:54) \u03bbf (x) + (1\u2212 \u03bb)f (y) (respectively, f (\u03bbx + (1\u2212 \u03bb)y) < max{f (x), f (y)}) for\nall x, y \u2208 S (with x (cid:54)= y) and 0 < \u03bb < 1. To see why these are proper subsets, notice that the cosine\nfunction on [0, 4\u03c0] is a global function that is neither convex nor strict quasiconvex. In dimension\none, global and strict quasiconvex functions are very closely related. Indeed, when the domain is\nconvex and compact (i.e. an interval [a, b] where a, b \u2208 R), it can be shown that a function is strict\nquasiconvex if and only if it is global and has a unique global minimum. However, this is not true in\nhigher dimensions, as can be seen in Figure 4b in Appendix 5.2, or in the existing literature, i.e. in\n\n3\n\n\f[25] or in [9, Figure 1.1.10]. It is also not true in dimension one if we remove the assumption that the\ndomain is compact (consider f (x) := (x2 + x4)/(1 + x4) de\ufb01ned on R and illustrated in Figure 4a\nin Appendix 5.2).\nWhen the domain S is not necessarily convex, a proper subset of G(S) is the set of star-convex\nfunctions. For a star-convex function f, there exists x \u2208 S such that f (\u03bbx+(1\u2212\u03bb)y) (cid:54) \u03bbf (x)+(1\u2212\n\u03bb)f (y) for all y \u2208 S \\{x} and 0 < \u03bb < 1. Again, the cosinus function on [0, 4\u03c0] is a global function\nthat is not star-convex. Another interesting proper subset of G(S) is the set of functions for which,\ninformally, given any point, there exists a strictly decreasing path from that point to a global minimum.\nThis property is discussed in [47, P.1] (see also [28]) to study the landscape of loss functions of\nneural networks. Formally, the property is that for all x \u2208 S such that f (x) > inf y\u2208S f (y), there\nexists a continuous function g : [0, 1] \u2212\u2192 S such that g(0) = x, g(1) \u2208 argmin{f (y) | y \u2208 S}, and\nt \u2208 [0, 1] (cid:55)\u2212\u2192 f (g(t)) is strictly decreasing (i.e. f (g(t1)) > f (g(t2)) if 0 (cid:54) t1 < t2 (cid:54) 1). Not all\nglobal functions satisfy this property, as illustrated by the function in Figure 4a. For instance, there\nexists no strictly decreasing path from x = \u22123 to the global minimizer 0. However, in the funtion in\nFigure 4b in Appendix 5.2, there exists a strictly decreasing path from any point to the unique global\nminimizer. One could thus think that if S is compact, or if f is coercive, then one should always\nbe able to \ufb01nd a strictly decreasing path. However, there need not exist a strictly decreasing path in\ngeneral. Consider for example the function de\ufb01ned on ([\u22121, 1] \\ {0}) \u00d7 [\u22121, 1] as follows\n\n\uf8f1\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f4\uf8f3\n\n4|x1|3(cid:16)\n\nf (x1, x2) :=\n\n+ 1\n\n(cid:17)\n\n(cid:17)\n(cid:16)\u2212 1|x1|\n(cid:16)\n(cid:17) \u2212 2\n(cid:17)\n(cid:111)\n(cid:16)\u2212 1|x1|\n(cid:17) \u2212 3\n(cid:17)\n(cid:111)\n(cid:16)\u2212 1|x1|\n(cid:17)\n(cid:16)\u2212 1|x1|\nx2 \u2212 4|x1|3(cid:16)\n\nx2\n2 +\n\nx3\n2 +\n\n+ 1\n\n+ 1\n\nsin\n\n\u22124|x1|3(1 \u2212 x2)\n\nsin\n\n(cid:110)\n12|x1|3(cid:16)\n(cid:110)\n20|x1|3(cid:16)\n(cid:17)\n(cid:16)\u2212 1|x1|\n\nsin\n\nsin\n\nsin\n\n+ 1\n\nif\n\n0 (cid:54) x2 (cid:54) 1,\n\nif \u2212 1 (cid:54) x2 < 0.\n\n(cid:17)\n\n(cid:17)\n\n+ 1\n\n10\n\n1)(x2 \u2212 4x2\n\nThe function and its differential can readily be extended continuously to [\u22121, 1] \u00d7 [\u22121, 1]. This\nis illustrated in Figure 6a in Appendix 5.2. This yields a smooth2 global function for which there\nexists no strictly decreasing path from the point x = (0, 1/2) to a global minimizer (i.e. any point\nin [\u22121, 1] \u00d7 {\u22121}). We \ufb01nd this to be rather counter-intuitive. To the best of our knowledge, no\nsuch function has been presented in past literature. Hestenes [32] considered the function de\ufb01ned on\n[\u22121, 1] \u00d7 [\u22121, 1] by f (x1, x2) := (x2 \u2212 x2\n1) (see also [9, Figure 1.1.18]). It is a global\nfunction for which the point x = (0, 0) (which is not a global minimizer) admits no direction of\ndescent, i.e. d \u2208 R2 such that t \u2208 [0, 1] (cid:55)\u2212\u2192 f (x + td) is strictly decreasing. However, it does\n\u221a\nadmit a strictly decreasing path to a global minimizer, i.e. t \u2208 [0, 1] (cid:55)\u2212\u2192 (\n4 t, t2), along which\nthe objective equals \u2212 9\n16 t4. This is unlike the function exhibited in Figure 6a. As a byproduct, our\nfunction shows that the generalization of quasiconvexity to non-convex domains described in [6,\nChapter 9] is a proper subset of global functions. This generalization was proposed in [41] and further\ninvestigated in [7, 33, 34, 15, 16, 17]. It consists in replacing the segment used to de\ufb01ne convexity\nand quasiconvexity by a continuous path.\nFinally, we note that there exists a characterization of functions whose local minima are global,\nwithout requiring continuity as in global functions. It is based on a certain notion of continuity\nof sublevel sets, namely lower-semicontinuity of point-to-set mappings [51, Theorem 3.3]. We\nwill see below that continuity is a key ingredient for obtaining our results. We do not require\nmore regularity precisely because one of our goals is to study nonsmooth functions. Speaking of\nwhich, observe that global functions can be nowhere differentiable, contrary to convex functions [11,\nTheorems 2.1.2 and 2.5.1]. Consider for example the global function de\ufb01ned on ]0, 1[ \u00d7 ]0, 1[ by\nn=0 s(2nx1)/2n where s(x) := minn\u2208N |x \u2212 n| is the distance to nearest\ninteger. For any \ufb01xed x2 (cid:54)= 0, the function x1 \u2208 [0, 1] (cid:55)\u2212\u2192 f (x1, x2)/|x2| is the Takagi curve\n[45, 3, 37] which is nowhere differentiable. It can easily be deduced that the bivariate function is\nnowhere differentiable. This is illustrated in Figure 6b.\nIn the following, we review some of the properties of global functions. Their proofs can be found in\nthe appendix. We begin by investigating the composition operation.\n\nf (x1, x2) := |2x2 \u2212 1|(cid:80)+\u221e\n\n2In fact, one could make it in\ufb01nitely differentiable by using the exponential function in the construction, but\n\nit is more cumbersome.\n\n4\n\n\fProposition 2.1 (Composition of functions). Consider f : S \u2212\u2192 R. Let \u03c6 : f (S) \u2212\u2192 R denote\na strictly increasing function where f (S) is the range of f. It holds that f \u2208 G(S) if and only if\n\u03c6 \u25e6 f \u2208 G(S).\nHowever, the set of global functions is not closed under composition of functions in general. For\ninstance, f (x) := |x| and g(x) := max(\u22121,|x|\u2212 2) are global functions on R, but f \u25e6 g is not global\nfunction on R.\nProposition 2.2 (Change of variables). Consider f : S \u2212\u2192 R, a subset S(cid:48) \u2282 Rn, and a homeomor-\nphism \u03d5 : S \u2212\u2192 S(cid:48) (i.e. continuous bijection with continuous inverse). It holds that f \u2208 G(S) if and\nonly if f \u25e6 \u03d5\u22121 \u2208 G(S(cid:48)).\nNext, we consider what happens if we have a sequence of global functions. Figure 2a shows that the\nsequence of global functions (red dotted curves) pointwise converges to a function with a spurious\nlocal minimum (blue curve). Figure 2b shows that uniform convergence also does not preserve the\nproperty of being a global function: all points on the middle part of the limit function (blue curve) are\nspurious local minima. However, it suggests that uniform convergence preserves a slightly weaker\nproperty than being a global function. Intuitively, the limit should behave like a global function\nexcept that it may have \u201c\ufb02at\u201d parts. We next formalize this intuition. To do so, we consider the\nnotions of global minimum, local minimum, and strict local minimum (De\ufb01nition 2.1 and De\ufb01nition\n2.2), which apply to points in Rn, and generalize them to subsets of Rn. We will borrow the notion\nof neighborhood of a set (uniform neighborhood to be precise, see De\ufb01nition 2.5).\n\n(a) Pointwise convergence\n\n(b) Uniform convergence\n\nFigure 2: Convergence of a sequence of global functions\n\nDe\ufb01nition 2.4. We say that a subset X \u2282 S is a global minimum of f : S \u2212\u2192 R if inf X f (cid:54)\ninf S\\X f.\n\nWe note in passing the following two propositions. We will use them repeatedly in the next section.\nThe proofs are omitted as they follow directly from the de\ufb01nitions.\nProposition 2.3. Assume that the following statements are true:\n\n1. X \u2282 S is a global minimum of f;\n2. f \u2208 G(X);\n3. f does not have any local minima on S \\ X.\n\nThen, f \u2208 G(S).\nNote that the \ufb01rst assumption is needed; otherwise the function may not be global because it could\ntake a smaller value at a non local min outside X (possible when S is unbounded).\nProposition 2.4. If f : S \u2212\u2192 R is a global function on global minima (X\u03b1)\u03b1\u2208A for some index set\n\nA, then it is a global function on(cid:83)\n\n\u03b1\u2208A X\u03b1.\n\nWe proceed to generalize the de\ufb01nition of local minimum.\n\n5\n\n\fDe\ufb01nition 2.5. We say that a compact subset X \u2282 S is local minimum (respectively, strict local\nminimum) of f : S \u2212\u2192 R if there exists \u0001 > 0 such that for all x \u2208 X and for all y \u2208 S \\ X\nsatisfying (cid:107)x \u2212 y(cid:107)2 (cid:54) \u0001, it holds that f (x) (cid:54) f (y) (respectively, f (x) < f (y)).3\nThe above de\ufb01nitions are distinct from the notion of valley proposed in [47, De\ufb01nition 1]. The latter\nis de\ufb01ned as a connected component4 of a sublevel set (i.e. {x \u2208 S | f (x) (cid:54) \u03b1} for some \u03b1 \u2208 R).\nLocal minima and strict local minima need not be valleys, and vice-versa. One may easily check\nthat when the set is a point, i.e. X = {x} with x \u2208 S, the two de\ufb01nitions above are the same as the\nprevious de\ufb01nitions of minimum (De\ufb01nition 2.1 and De\ufb01nition 2.2). They are therefore consistent. It\nturns out that the notion of global function (De\ufb01nition 2.3) does not change when we interpret it in\nthe sense of sets. We next verify this claim.\nProposition 2.5 (Consistency of De\ufb01nition 2.3). Let f : S \u2212\u2192 R denote a continuous function. All\nlocal minima are global minima in the sense of points if only if all local minima are global minima in\nthe sense of sets.\n\nWe are ready to de\ufb01ne a slightly weaker notion than being a global function.\nDe\ufb01nition 2.6. We say that f : S \u2212\u2192 R is a weakly global function if it is continuous and if all strict\nlocal minima are global minima in the sense of sets.\n\nThe generalization from points to sets in the de\ufb01nition of a minimum is justi\ufb01ed here, as can be seen\nin Figure 3. All strict local minima are global minima in the sense of points. However, X = [a, b]\nwith a \u2248 \u22122.6 and b = \u22121 is a strict local minimum that is not a global minimum. Indeed,\ninf X f = 6 > 1 = infR\\X f. Hence, the function is not weakly global.\n\nFigure 3: All strict local minima are global minima in the sense of points but not in the sense of sets.\n\nWe next make the link with the intuition regarding the \ufb02at part in Figure 2b.\nProposition 2.6. If f : S \u2212\u2192 R is a weakly global function, then it is constant on all local minima\nthat are not global minima.\nWe are interested in functions that are potentially de\ufb01ned on all of Rn (i.e. unconstrained optimization)\nor on subsets S \u2282 Rn that are not necessarily compact (i.e. general constrained optimization). We\ntherefore need to borrow a slightly more general notion than uniform convergence [42, page 95,\nSection 3].\nDe\ufb01nition 2.7. We say that a sequence of continuous functions fk : S \u2212\u2192 R, k = 1, 2, . . . ,\nconverges compactly towards f : S \u2212\u2192 R if for all compact subsets K \u2282 S, the restrictions of fk to\nK converge uniformly towards the restriction of f to K.\n\nWe are now ready to state a result regarding the convergence of a sequence of global functions and an\nimportant property that is preserved in the process.\n\n3Note that the neighborhood of a compact set is always uniform.\n4A subset C \u2282 S is connected if it is not equal to the union of two disjoint nonempty closed subsets of S. A\n\nmaximal connected subset (ordered by inclusion) of S is called a connected component.\n\n6\n\n\fProposition 2.7 (Compact convergence). Consider a sequence of functions (fk)k\u2208N and a function\nf, all from S \u2282 Rn to R. If\n\nfk \u2212\u2192 f compactly\n\nand if fk are global functions on S, then f is a weakly global function on S.\n\nNote that the proofs in this section are not valid if we replace the Euclidian space by an in\ufb01nite-\ndimensional metric space. Indeed, we have implicitely used the fact that the unit ball is compact in\norder for the uniform neighborhood of a minimum to be compact.\n\n3 Application to tensor decomposition\n\nGlobal functions can be used to prove the following two signi\ufb01cant results on nonlinear functions\ninvolving (cid:96)1 norm and (cid:96)\u221e norm, as explained below.\nProposition 3.1. The function f1 : Rn \u2212\u2192 R de\ufb01ned as\n\n(7)\n\n(8)\n\n(9)\n\nn(cid:88)\n\nf1(x) :=\n\n|xi1 . . . xid \u2212 yi1 . . . yid|\n\ni1,...,id=1\n\nis a weakly global function; in particular, it has no spurious strict local minima.\n\nProof. The functions\n\nn(cid:88)\n\nfp(x) :=\n\n|xi1 . . . xid \u2212 yi1 . . . yid|p\n\ni1,...,id=1\n\nfor p \u2212\u2192 1 with p > 1 form a set of global functions that converge compactly towards the function\nf1. This is illustrated in Figure 5 in Appendix 5.2 for n = d = 2 and y = (1,\u22123/4). The desired\nresult then follows from Proposition 2.7. To see why each fp is a global function, observe that fp is\ndifferentiable with the \ufb01rst-order optimality condition as follows:\n\nxi1 . . . xid\u22121(xi1 . . . xid\u22121 xi \u2212 yi1 . . . yid\u22121 yi)|xi1 . . . xid\u22121 xi \u2212 yi1 . . . yid\u22121 yi|p\u22122 = 0\n\nn(cid:88)\n\ni1,...,id\u22121=1\n\nfor all i \u2208 {1, . . . , n}. Note that each term in the sum converges towards zero if the expression inside\nthe absolute value converges towards zero, so that the equation is well-de\ufb01ned. Consider a local\nminimum x \u2208 Rn; then, x must satisfy the above \ufb01rst-order optimality condition. If yi = 0, then the\nabove equation readily yields xi = 0. This reduces the problem dimension from n variables to n \u2212 1\nvariables, so without loss of generality we may assume that yi (cid:54)= 0, i = 1, . . . , m. After a division,\nobserve that the following equation is satis\ufb01ed\n\n(cid:18) xi1 . . . xid\u22121\n\nyi1 . . . yid\u22121\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12) xi1 . . . xid\u22121\n\nyi1 . . . yid\u22121\n\nt \u2212 1\n\n(cid:12)(cid:12)(cid:12)(cid:12)p\u22122\n\nt \u2212 1\n\n= 0\n\nn(cid:88)\n\n|yi1 . . . yid\u22121|p xi1 . . . xid\u22121\nyi1 . . . yid\u22121\n\ni1,...,id\u22121=1\n\nfor all t \u2208 {x1/y1, . . . , xn/yn}. Each term with xi1 . . . xid\u22121 (cid:54)= 0 in the above sum is a strictly\nincreasing function of t \u2208 R since it is the derivative of the strictly convex function\n\ng(t) = |xi1 . . . xid\u22121 t \u2212 yi1 . . . yid\u22121|p.\n\n(10)\nThe point x = 0 is not a local minimum (y is a direction of descent of fp at 0), and thus x (cid:54)= 0. As a\nresult, the above sum is a strictly increasing function of t \u2208 R. Hence, it has at most one root, that is\nto say t = x1/y1 = \u00b7\u00b7\u00b7 = xn/yn. Plugging in, we \ufb01nd that td = 1. If d is odd, then x = y and if d\nis even, then x = \u00b1y. To conclude, any local minimum x is a global minimum of fp.\nProposition 3.2. f\u221e : Rn \u2212\u2192 R de\ufb01ned as\nmax\n\n|xi1 . . . xid \u2212 yi1 . . . yid|\n\nf\u221e(x) :=\n\n(11)\n\n1(cid:54)i1,...,id(cid:54)n\n\nis a weakly global function; in particular, it has no spurious strict local minima.\n\n7\n\n\f(cid:32) n(cid:80)\n\n(cid:33) 1\n\np\n\nProof. The functions hp(x) :=\n\nfor p \u2212\u2192 +\u221e form a set\nof global functions that converge compactly towards the function f\u221e. We know that each hp is a\nglobal function by applying Proposition 2.1 to (9) with the fact that (\u00b7)\np is increasing for nonnegative\narguments.\n\n|xi1 . . . xid \u2212 yi1 . . . yid|p\n\ni1,...,id=1\n\n1\n\nNote that the functions in Proposition 3.1 and Proposition 3.2 are a priori utterly different, yet both\nproofs are essentially the same. This highlights the usefulness of the new notion of global functions.\nRemark 3.1. The notion of weakly global functions explains that one can perform tensor decomposi-\ntion by minimizing the nonconvex and nonsmooth functions in Proposition 3.1 and Proposition 3.2\nwith a local search algorithm. Whenever the algorithm reports a strict local minimum, it is a globally\noptimal solution.\n\nIn order to strengthen the conclusion in Proposition 3.1 and to establish the absence of spurious local\nminima, we propose the following two lemmas. Using Proposition 2.3 and these two lemmas, we\narrive at the stronger result stated in Proposition 1.1.\nLemma 3.1. If x \u2208 Rn is a \ufb01rst-order stationary point of f1 in the sense of the Clarke derivative,\nthen the following statements hold:\n\n1. If yi = 0 for some i \u2208 {1, . . . , n}, then xi = 0;\n2. For all i1, . . . , id \u2208 {1, . . . , n}, it holds that xi1 ...xid\nyi1 ...yid\n\n(cid:54) 1.\n\nProof. Similar in spirit to the proof of Proposition 3.1, the ratios t \u2208 {x1/y1, . . . , xn/yn} for a\n\ufb01rst-order stationary point must all be the roots of an increasing (set-valued) \u201cstaircase function\". We\nthen obtain the results by analyzing the relation between the roots and the jump points of the staircase\nfunction. See Appendix 5.8 for the complete proof.\n\nNote that the above lemma only uses the \ufb01rst-order optimality condition (in the sense of Clarke\nderivative) without any direction of decent.\nRemark 3.2. One cannot show that there are no spurious local minima with only the \ufb01rst-order\noptimality condition (in the Clarke derivative sense). In fact, any x \u2208 Rn satisfying\n= 0\n(cid:54) 1 for all i1, . . . , id \u2208 {1, . . . , n}, is a \ufb01rst-order stationary point, but is not a local\nand xi1 ...xid\nyi1 ...yid\nminimum.\nLemma 3.2. If y1 . . . yn (cid:54)= 0, de\ufb01ne the set\n\n|yi| xi\n\nn(cid:80)\n\ni=1\n\nyi\n\n(cid:27)\n\n(cid:54) 1 ,\n\n\u2200 i1, . . . , id \u2208 {1, . . . , n}\n\n.\n\n(12)\n\n(cid:26)\n\nS :=\n\nx \u2208 Rn\n\n(cid:12)(cid:12)(cid:12)(cid:12) xi1 . . . xid\n\nyi1 . . . yid\n\nThen, f1 \u2208 G(S).\n\n|yi|\n\n(cid:19)d\n\n(cid:18) n(cid:80)\n\nProof. We provide a sketch here, and the complete proof is deferred to Appendix 5.9. The\n\n(cid:19)d \u2212\n(cid:18) n(cid:80)\nthat f1 is a global function on S if and only if fodd(x) = \u2212(cid:80)n\nis an even number, f is a global function if and only if feven(x) = \u2212 ((cid:80)n\nwe divide the set S(cid:48) into two subsets: S(cid:48) \u2229 {x|(cid:80)n\n\nobjective function on S is equal to f1(x) =\n. De\ufb01ne the set\n(cid:54) 1 , \u2200 i1, . . . , id \u2208 {1, . . . , n} }. When d is an odd number, the\nS(cid:48) := { x \u2208 Rn | xi1 . . . xid\ncomposition and change of variables properties of global functions (Propositions 2.1 and 2.2) imply\ni=1 |yi|xi \u2208 G(S(cid:48)). Similarly, when d\ni=1 |yi|xi)2 \u2208 G(S(cid:48)). For the\ncase when d is odd, we apply the Karush-Kuhn-Tucker conditions to restrict attention to the positive\northant and conclude by showing its association with a linear program. For the case when d is even,\ni=1 |yi|xi \u2264 0}.\nObserve that feven(x) is a global function on each of the subset by associating each subset with a\nlinear program. Then, Proposition 2.3 establishes the result.\n\ni=1 |yi|xi \u2265 0} and S(cid:48) \u2229 {x|(cid:80)n\n\n|yi| xi\n\ni=1\n\ni=1\n\nyi\n\nThe two previous lemmas prove Proposition 1.1; the notion of global function is used to the prove the\nlatter.\n\n8\n\n\f4 Conclusion\n\nNonconvex optimization appears in many applications, such as matrix completion/sensing, tensor\nrecovery/decomposition, and training of neural networks. For a general nonconvex function, a\nlocal search algorithm may become stuck at a local minimum that is arbitrarily worse than a global\nminimum. We develop a new notion of global functions for which all local minima are global\nminima. Using certain properties of global functions, we show that the set of these functions include\na class of nonconvex and nonsmooth functions that arise in matrix completion/sensing and tensor\nrecovery/decomposition with Laplacian noise. This paper offers a new mathematical technique for\nthe analysis of nonconvex and nonsmooth functions such as those involving (cid:96)1 norm and (cid:96)\u221e norm.\n\nAcknowledgments\n\nThis work was supported by the ONR Awards N00014-17-1-2933 ONR and N00014-18-1-2526,\nNSF Award 1808859, DARPA Award D16AP00002, and AFOSR Award FA9550- 17-1-0163. We\nwish to thank the anonymous reviewers for their valuable feedback, as well as Chris Dock for fruitful\ndiscussions.\n\n9\n\n\f", "award": [], "sourceid": 1239, "authors": [{"given_name": "Cedric", "family_name": "Josz", "institution": "UC Berkeley"}, {"given_name": "Yi", "family_name": "Ouyang", "institution": "Preferred Networks"}, {"given_name": "Richard", "family_name": "Zhang", "institution": "University of California, Berkeley"}, {"given_name": "Javad", "family_name": "Lavaei", "institution": "University of California, Berkeley"}, {"given_name": "Somayeh", "family_name": "Sojoudi", "institution": "University of California, Berkeley"}]}