{"title": "Who is Afraid of Big Bad Minima? Analysis of gradient-flow in spiked matrix-tensor models", "book": "Advances in Neural Information Processing Systems", "page_first": 8679, "page_last": 8689, "abstract": "Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones.Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model.Our framework is based on the Kac-Rice analysis of stationary points and a closed-form analysis of gradient-flow originating from statistical physics. We show that there is a well defined region of parameters where the gradient-flow algorithm finds a good global minimum despite the presence of exponentially many spurious local minima.\nWe show that this is achieved by surfing on saddles that have strong negative direction towards the global minima, a phenomenon that is connected to a BBP-type threshold in the Hessian describing the critical points of the landscapes.", "full_text": "Who is Afraid of Big Bad Minima? Analysis of\nGradient-Flow in a Spiked Matrix-Tensor Model\n\nStefano Sarao Mannelli\u2020, Giulio Biroli\u2021, Chiara Cammarota\u2217,\n\nFlorent Krzakala\u2021, and Lenka Zdeborov\u00e1\u2020\n\nAbstract\n\nGradient-based algorithms are effective for many machine learning tasks, but\ndespite ample recent effort and some progress, it often remains unclear why they\nwork in practice in optimising high-dimensional non-convex functions and why\nthey \ufb01nd good minima instead of being trapped in spurious ones. Here we present a\nquantitative theory explaining this behaviour in a spiked matrix-tensor model. Our\nframework is based on the Kac-Rice analysis of stationary points and a closed-form\nanalysis of gradient-\ufb02ow originating from statistical physics. We show that there\nis a well de\ufb01ned region of parameters where the gradient-\ufb02ow algorithm \ufb01nds a\ngood global minimum despite the presence of exponentially many spurious local\nminima. We show that this is achieved by sur\ufb01ng on saddles that have strong\nnegative direction towards the global minima, a phenomenon that is connected to a\nBBP-type threshold in the Hessian describing the critical points of the landscapes.\n\n1\n\nIntroduction\n\nA common theme in machine learning and optimisation is to understand the behaviour of gradient\ndescent methods for non-convex problems with many minima. Despite the non-convexity, such\nmethods often successfully optimise models such as neural networks, matrix completion and tensor\nfactorisation. This has motivated a recent spur in research attempting to characterise the properties of\nthe loss landscape that may shed some light on the reason of such success. Without the aim of being\nexhaustive these include [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11].\nOver the last few years, a popular line of research has shown, for a variety of systems, that spurious\nlocal minima are not present in certain regimes of parameters. When the signal-to-noise ratio is\nlarge enough, the success of gradient descent can thus be understood by a trivialisation transition\nin the loss landscape: either there is only a single minima, or all minima become \"good\", and no\nspurious minima can trap the dynamics. This is what happens, for instance, in the limit of small noise\nand abundance of data for matrix completion and tensor factorization [3, 8], or for some very large\nneural networks [1, 2]. However, it is often observed in practice that these guarantees fall short of\nexplaining the success of gradient descent, that is empirically observed to \ufb01nd good minima very far\nfrom the regime under mathematical control. In fact, gradient-descent-based algorithms may be able\nto perform well even when spurious local minima are present because the basins of attraction of the\nspurious minima may be small and the dynamics might be able to avoid them. Understanding this\nbehaviour requires, however, a very detailed characterisation of the dynamics and of the landscape, a\nfeat which is not yet possible in full generality.\n\u2020 Institut de Physique Th\u00e9orique, CNRS & CEA & Universit\u00e9 Paris-Saclay, Saclay, France.\n\u2021 Laboratoire de Physique de l\u2019Ecole normale sup\u00e9rieure ENS, Universit\u00e9 PSL, CNRS, Sorbonne Universit\u00e9,\nUniversit\u00e9 Paris-Diderot, Sorbonne Paris Cit\u00e9 Paris, France.\n\u2217 Department of Mathematics, King\u2019s College London, Strand London WC2R 2LS, UK.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fA fruitful direction is the study of Gaussian functions on the N-dimensional sphere, known as\np-spin spherical spin glass models in the physics literature, and as isotropic models in the Gaussian\nprocess literature [12, 13, 14, 15, 16]. In statistics and machine learning, these models have appeared\nfollowing the studies of spiked matrix and tensor models [17, 18, 19, 20, 21]. In particular, a\nvery recent work [22] showed explicitly that for a spiked matrix-tensor model the gradient-\ufb02ow\nalgorithm indeed reaches global minimum even when spurious local minima are present and the\nauthors estimated numerically the corresponding regions of parameters. In this work we consider this\nvery same model and explain the mechanism by which the spurious local minima are avoided, and\ndevelop a quantitative theoretical framework that we believe has a strong potential to be generic and\nextendable to a much broader range of models in high-dimensional inference and neural networks.\nThe Spiked Matrix-Tensor Model. The spiked matrix-tensor model has been recently proposed\nto be a prototypical model for non-convex high-dimensional optimisation where several non-trivial\nregimes of cost-landscapes can be displayed quantitatively by tuning the parameters [23, 22]. Related\nmixed matrix-tensor models have also been studied in the context of text-analysis applications in\n[20, 19]. In this model, one aims at reconstructing a hidden vector (i.e. the spike) \u03c3\u03c3\u03c3\u2217 from the\nobservation of a noisy version of both the rank-one matrix and rank-one tensor created from the spike.\nUsing the following notation: bold lowercase symbols represent vectors, bold uppercase symbols\nrepresent matrices or tensors, and (cid:104)\u00b7,\u00b7(cid:105) represent the scalar product, the model is de\ufb01ned as follows:\ngiven a signal (or spike), \u03c3\u03c3\u03c3\u2217, uniformly sampled on the N-dimensional hyper-sphere of radius 1, it is\ngiven a tensor TTT and a matrix YYY such that\n\n. . . \u03c3\u2217\n\nip\n\n,\n\n(1)\n\ni1\n\nTi1...ip = \u03b7i1...ip +(cid:112)N (p \u2212 1)! \u03c3\u2217\n\n\u221a\n\nN \u03c3\u2217\n\ni \u03c3\u2217\nj ,\n\nYij = \u03b7ij +\n\n(2)\nwhere \u03b7i1...ip and \u03b7ij are Gaussian random variables of variance \u2206p and \u22062 respectively. Neglect-\ning constant terms, the maximum likelihood estimation of the ground truth, \u03c3\u03c3\u03c3\u2217, corresponds to\nminimization of the following loss function:\n\nTi1...ip \u03c3i1 . . . \u03c3ip \u2212\n\nYij\u03c3i\u03c3j =\n\n(3)\n\n(cid:88)\n\n\u221a\n1\n\n\u22062\n\nN\n\ni \u2206GF\nsignal for all times that do not grow as N grows. We contrast it with the threshold \u2206triv\n\nthe gradient \ufb02ow reaches in \ufb01nite time the global minimum, well\nthe algorithm remains uncorrelated with the\n2 ,\n2 < \u2206GF\n\n2\n\n2\n\n2\n\n\fFigure 1: Cartoon illustrating the mechanism by which gradient-\ufb02ow avoids spurious local minima in\nthe spiked matrix-tensor model. As the signal-to-noise (snr) ratio 1/\u22062 is increased, the spurious\nlocal minima that attract the randomly initialised GF algorithm develop a single negative direction\ntowards the global minimum before the others, in particular the lower-cost spurious local minima, do.\nThis has drastic consequences on the GF algorithm. In region (i), for snr < 1/\u2206GF\n2 , the algorithm\ngoes down the landscape, eventually reaches the high-energy threshold minima and remains stuck. In\nregion (ii), however, these threshold minima are turned into saddles with a strong negative direction\ntowards the signal. The algorithm is initially reaching these minima-turned-saddles, sur\ufb01ng on the\nnegative slope, it then turns towards the \"good\" minima correlated with the signal, avoiding the\nexponentially many spurious minima at lower energies. The main technical contribution of this\npaper is a quantitative description of this scenario, including a simple formula for the corresponding\n2 , eq. (5). As the snr is further increased, the negative direction appears in lower and\nthreshold \u2206GF\nlower minima until the trivialization transition in region (iii): for snr > 1/\u2206triv\n, all the spurious\nminima have been turned into saddles.\n\n2\n\n2\n\n2\n\nis less than \u2206AMP\n\nestablished in [22], below which the energy landscape does not present any spurious local minima.\nNote that \u2206GF\n= 1 [23], below which the best known algorithm, speci\ufb01cally the\napproximate message passing, works.\nThe second main result of this paper, is the insight we obtain on the behaviour of the gradient-\ufb02ow in\nthe loss landscape, that is summarised in Fig. 1. The key point is to consider the fate of the spurious\nlocal minima that attract the GF algorithm when the signal to noise ratio snr = 1/\u22062 is increased. As\nthe snr increases, these minima turn into saddles with a single negative direction towards the signal (a\nphenomenon that we analyze in the next section, and that turns out to be linked to the BBP transition\n[25] in random matrix theory), all that well before all the other spurious local minima disappear. We\npresent two ways to quantify this insight:\n(a) We use the Kac-Rice formula for the number of stationary points, as derived for the present model\nin [22]. In [22] this formula is used to quantify the region with no spurious local minima. Here we\nfocus on a BBP-type of phase transition that is crucial in the derivation of this formula and deduce\nthe GF threshold (5) from it.\n(b) We use the CHSCK equations [26, 27] for closed-form description of the behaviour of the gradient-\n\ufb02ow, as derived and numerically solved in [23, 22]. Building on dynamical theory of mean-\ufb01eld spin\nglasses we determine precisely when and how the algorithm escapes the manifold of zero overlap\nwith the signal, leading again to the threshold (5).\nBoth these arguments are derived using reasoning common in theoretical physics. From a mathe-\nmatically rigorous point of view the threshold (5) remains a conjecture and its rigorous proof is an\ninteresting challenge for future work. We note that both the Kac-Rice approach [16] and the CHSCK\nequations [28] have been made rigorous in closely related problems.\nMoreover, our reasoning leading to the formula for \u2206GF\nis in retrospect not restricted to the present\nmodel, and ends up way simpler than the full CHSCK analysis or the full Kac-Rice complexity\ncalculation. The mechanism of converging to the threshold states and then escaping from them can\nalso be tested numerically even in models that are not amenable to analytic description. This makes\nthe results of the present work widely testable and applicable to other settings than the present model.\n\n2\n\n3\n\n\fTherefore, we will investigate analytically and numerically other models and real-data-based learning\nin order to validate this theory and to understand its limitations.\n\nFigure 2: The phase diagram shows the different regions for gradient-\ufb02ow behaviour, for the spiked\nmatrix-tensor model with p = 3. In the region shaded in red (light and dark), GF does not correlate\nwith the signal, while it does in the grey and green regions. In the dark-red region obtaining correlation\nwith the signal is impossible information-theoretically [23]. The possible region is divided by a red\ndashed line, below that line even best known algorithms are unable to obtain correlation with the\nsignal [23]. The green region is characterised by a trivial landscape, i.e. all the spurious minima\ndisappear [22]. The grey region is where gradient-\ufb02ow succeeds to converge despite the presence\nof spurious minima. We marked with black crosses points predicting the gradient-\ufb02ow threshold\nobtained numerically in [22], they perfectly agree with our theoretical prediction of the threshold (5),\nmarked by the grey dashed line. The circles in colours are points that we will use to illustrate the\ndifferent features of these regions.\n\n2 Probing the Landscape by the Kac-Rice Method\n\nThe statistical properties of the landscape associated to the loss function (3) can be studied by the\nKac-Rice method. The technique was developed in the 40s in [29, 30], nicely summarized in [31],\nand applied in high-dimensional problems in [13, 14]. Earlier applications in high dimensions appear\nin the statistical physics literature, see [32] for an overview, and has been recently extended in [24].\nThe quantities of interest are the number of critical points at a given energy, N (\u0001p, \u00012), and the Hessian\nmatrix evaluated at those critical points. We analyse the logarithm of N (\u0001p, \u00012), called the complexity.\nSince the complexity is a random quantity we compute its upper bound \u03a3a(\u0001p, \u00012) = ln E[N (\u0001p, \u00012)],\nalong the lines of [16, 22]. We have also computed its typical value \u03a3q(\u0001p, \u00012) = E[lnN (\u0001p, \u00012)]\nalong the lines of [24], i.e. non-rigorously using the replica symmetry assumption (see SM Sec. A).\nIn what follows we focus on complexity of stationary points with no correlation with the signal, in\nwhich case analytical and numerical arguments (see SM Sec. A.2.1) indicate that \u03a3a(ep, e2) and\n\u03a3q(ep, e2) are either very close numerically or possibly equal, and by Jansen inequality it implies a\nbound on its asymptotic distribution. Thus, in the following, we will simply refer to the complexity\n\u03a3(\u0001p, \u00012) without further speci\ufb01cation.\nIn the Kac-Rice analysis the statistics of the Hessian, H, of critical points plays a key role. It was\nshown in [22], and the argumentation is reproduced in the SM Sec. A.2, that H has a simple form for\nthe loss (3). It is a (N \u2212 1) \u00d7 (N \u2212 1) matrix formed by the sum of three contributions: a random\nmatrix WN\u22121 belonging to the Gaussian orthogonal ensemble (GOE), a matrix proportional to the\nidentity, and a rank one projector in the direction of the signal. Under the choice of a convenient\nreference frame [22] where the \ufb01rst element of the basis eee1 is aligned with the components of the\nestimator tangent to the signal, the expression of H for critical points with null overlap m with the\n\n4\n\n\fFigure 3: Complexity curves for the number of critical points for an overlap value m = 0 at \ufb01xed\n\u2206p = 1.0 for (from left to right) 1/\u22062 = 2.7, 2.3, 1.9. The lines are dotted when the complexity is\ndominate by critical points having an extensive number of eigenvalues are negative, dashed when\nonly one eigenvalue is negative, full when the points have only positive (or null) eigenvalues i.e. they\nare minima. The complexity of the minima is drawn in full lines with the same colours, and it merges\nwith the complexity of stationary points when it becomes dominant.\n\nsignal and with energies \u0001p and \u00012 reads:\n\nH =(cid:112)Q(cid:48)(cid:48)(1)(cid:2)WN\u22121 + t IN\u22121 \u2212 \u03b8 eee1eeeT\n, t = \u2212 (p\u0001p + 2\u00012) /(cid:112)Q(cid:48)(cid:48)(1), and \u03b8 = Q(cid:48)(cid:48)(0) /(cid:112)Q(cid:48)(cid:48)(1). The normalisa-\n\n(cid:3)\n\n(6)\n\n+ x2\n2\u22062\n\nN\u22121] = 1.\n\nwith Q(x) = xp\np\u2206p\ntion of WN\u22121 is chosen such that TrE[W2\nThe Fate of the Spurious: The initial condition for the gradient-\ufb02ow algorithm is a random con\ufb01g-\nuration \u03c3\u03c3\u03c30 uniformly drawn on the hyper-sphere. Such an initial condition clearly belongs to the\nlarge manifold of con\ufb01gurations uncorrelated with the ground-truth signal. We aim to investigate\nhow does the gradient \ufb02ow manage to escape from this initial manifold. For this purpose we focus on\nthe properties of the landscape in the subspace where the overlap with the signal is zero, m = 0.\nIn Fig. 3, we plot the complexity at m = 0 as a function of the energy \u0001\n\n1\n\n(cid:12)(cid:12)(cid:12)m=0\n\n\u03a3(\u0001) =\n\nsup\n\u0001p,\u00012\n\ns.t. \u0001p+\u00012=\u0001\n\n\u03a3(\u0001p, \u00012)\n\n2\n\nfor the points 1/\u22062 = 1.9, 2.3, 2.7 and \u2206p = 1.0 (p = 3), which are marked with circles of the\ncorresponding colour in Fig. 2. We use discontinuous lines for the complexity of critical points that\nhave at least one negative direction, and full lines for the complexity of local minima. A \ufb01nding\nof [22], that holds for any value of \u2206p, is that for small 1/\u22062 the majority of critical points with\nzero overlap with the signal at low enough energies are spurious minima; they disappear increasing\n1/\u22062 above a \u2206p-dependent value 1/\u2206triv\ncorresponding to the green region of Fig. 2. In this part\nof the phase diagram, there are no spurious minima and the global minimum is correlated with the\nsignal; this is an \"easy\" landscape for gradient \ufb02ow which is therefore expected to succeed there. The\nmain open question concerns the behavior for smaller values of 1/\u22062: When does the existence of\nspurious minima, appearing in panel (b) and (c) of Fig. 3, start to be harmful to gradient \ufb02ow?\nIn order to answer this question, we investigate more closely the nature of the spurious minima\nat different energies. We focus in particular on their Hessian, which plays a crucial role in order\nto understand which spurious minima have the largest basin of attraction and, hence, can trap\nthe algorithm. Speci\ufb01cally, the Hessian Eq. (6) at a critical point is the sum of a GOE matrix,\na multiple t of the identity and rank one perturbation with strength \u03b8(\u22062, \u2206p). For low signal-\nto-noise ratio, large \u2206p and large \u22062, the spectrum of (6) is a shifted Wigner semicircle with\n\nsupport [(cid:112)Q(cid:48)(cid:48)(1)(\u22122 + t),(cid:112)Q(cid:48)(cid:48)(1)(2 + t)]. The most numerous critical points at \ufb01xed energy \u0001\n\nare characterized by a t(\u0001) that is a monotonously decreasing function of \u0001, Fig. A.2 in the SM.\nMoving towards higher energies, the spectrum of the Hessian shifts to the left, which indicates\nsmaller curvature and wider minima. The transition between minima and saddles takes place at the\nthreshold energy when the left edge of the Wigner semi-circle law touches zero, i.e. when t(\u0001th) = 2,\nthe numerical value is obtained in the Appendix Sec. A.2.3. Putting the above \ufb01ndings together,\nminima at \u0001 = \u0001th are the most numerous and the marginally stable ones. Therefore, they are the\n\n5\n\n\u22121.92\u22121.91\u03b5\u22120.0010\u22120.00050.00000.00050.00100.0015\u03a3\u22121.81\u22121.80\u22121.79\u03b5\u03a3\u22121.695\u22121.685\u22121.675\u03b5\u03a3\fnatural candidates for having the largest basin of attraction and the highest in\ufb02uence on the randomly\ninitialised algorithm. This reasonable guess is at the basis of the theory of glassy dynamics in physics\n[27]. We take it as a working hypothesis for now, and we con\ufb01rm it analytically and numerically in\nwhat follows. We also remark that this phenomenology can be tested also in models that, contrarily\nto the present one, are not amenable to an analytic description.\nFinally, the effect of the perturbation (third contribution to the RHS of (6)) on the support of the\nspectrum is negligible as long as \u03b8 \u2264 1, as follows from the work on low-rank perturbations of random\nGOE matrices [25]. Thus, when the signal-to-noise ratio is small we expect that the con\ufb01guration\n\u03c3\u03c3\u03c3(t) slowly approaches at long times the ubiquitous \"threshold minima\" characterised by energy \u0001th\nand zero overlap with the signal.\nThe last missing piece is unveiling what makes those minima unstable for large snr. We show below\nthat it is a transition, called BBP (Baik-Ben Arous-P\u00e9ch\u00e9) [25], which takes place in the spectrum of\nthe Hessian when \u22062 and \u2206p become small enough so that \u03b8 becomes larger than one, as in Fig. 3.\n\nAfter the transition an eigenvalue, equal to(cid:112)Q(cid:48)(cid:48)(1)(cid:0)\u2212\u03b8 \u2212 \u03b8\u22121 + t(cid:1), pops out on the left of the\n\nWigner semi-circle, and its corresponding eigenvector develops a \ufb01nite overlap with the signal [25].\nThis implies that, as soon as the isolated eigenvalue pops out, an unstable downward direction towards\nthe signal emerges in the landscape around the threshold minima, at which point the algorithmic\nthreshold for gradient \ufb02ow takes place. Interestingly, many other spurious minima at lower energy\nalso undergo the BBP transition, but they remain stable for longer as the isolated eigenvalue is\npositive when it pops out from the semi-circle. In conclusion, our analysis of the landscape suggests\na dynamical transition for signal estimation by gradient \ufb02ow given by\n\n\u03b8 = Q(cid:48)(cid:48)(0)/(cid:112)Q(cid:48)(cid:48)(1) = 1\n\n(7)\n\nwhich leads to a very simple expression for the transition line \u2206GF\n2 , Eq. (5). This theoretical prediction\nis shown in Fig. 2 as a dashed grey line: The agreement with the numerical estimation from [22]\n(black crosses) is perfect.\nOur analysis unveils that the key property of the loss-landscape determining the performance of the\ngradient-\ufb02ow algorithm, is the (in)stability in the direction of the signal of the minima with largest\nbasin of attraction. These are the most numerous and the highest in energy, a condition that likely\nholds for many high-dimensional estimation problems.\nThe other spurious minima, which are potentially more trapping than the threshold ones and still\nstable at the algorithmic transition just derived, are actually completely innocuous since a random\ninitial condition does not lie in their basin of attraction with probability one in the large N limit. This\nbenign role of very bad spurious minima might appear surprising; it is due to the high-dimensionality\nof the non-convex loss function. Indeed it does not happen in \ufb01nite dimensional cases, in which a\nrandom initial condition has instead a \ufb01nite probability to fall into bad minima if those are present.\n\n3 Probing the Gradient-Flow Dynamics\n\n3.1 Closed-Form Dynamical Equations\n\nIn the large N limit gradient-\ufb02ow dynamics for the spiked matrix-tensor model can be analysed using\ntechniques originally developed in statistical physics studies of spin-glasses [33, 34, 35] and later put\non a rigorous basis in [28]. Three observables play a key role in this theory:\n(i) The overlap (or correlation) of the estimator at two different times: C(t, t(cid:48)) = (cid:104)\u03c3\u03c3\u03c3(t), \u03c3\u03c3\u03c3(t(cid:48))(cid:105).\n(ii) The change (or response) of the estimator at time t due to an in\ufb01nitesimal perturbation in the loss\n\nat time t(cid:48), i.e. (cid:96) \u2192 (cid:96) + (cid:104)\u03c3\u03c3\u03c3(t(cid:48)), hhh(t(cid:48))(cid:105) in Eq. (4): R(t, t(cid:48)) =(cid:80)N\n\n.\n\n(cid:12)(cid:12)(cid:12)hi=0\n\n(iii) The average overlap of the estimator with the ground truth m(t) = (cid:104)\u03c3\u03c3\u03c3\u2217, \u03c3\u03c3\u03c3(t)(cid:105).\nFor N \u2192 \u221e the above quantities converge to a non-\ufb02uctuating limit, i.e. they concentrate with\nrespect to the randomness in the initial condition and in the generative process, and satisfy closed\nequations. Following works of Crisanti-Horner-Sommers-Cugliandolo-Kurchan (CHSCK) [34, 35]\n\n\u03b4\u03c3i(t)\n\u03b4hi(t(cid:48))\n\ni=1\n\n6\n\n\fFigure 4: Right panel: energy as a function of time for the set of parameters indicated by small circles\nin Fig. 2. The horizontal dotted lines correspond to value of the threshold energy \u0001th, as derived\nboth from the Kac-Rice approach in Appendix Sec. A.2.3 and from the large time behaviour of the\ndynamics in Appendix Sec. B.2.6. Left panel: Eigenvalue distribution of the Hessian of the threshold\nstates for the same set of parameters. When 1/\u22062 becomes smaller than 2 an isolated eigenvalue\nappears; it has been highlighted using vertical arrows. Concomitantly, the energy as a function of\ntime \ufb01rst approaches the plateau and eventually departs from it and reaches the energy of the global\nminimum.\n\nand their recent extension to the spiked matrix-tensor model [23, 22] the above quantities satisfy:\n\nC(t, t(cid:48)) = \u2212\u00b5(t) C(t, t(cid:48)) + Q(cid:48)(m(t))m(t(cid:48)) +\n\nR(t, t(cid:48)(cid:48))Q(cid:48)(cid:48)(C(t, t(cid:48)(cid:48)))C(t(cid:48), t(cid:48)(cid:48))dt(cid:48)(cid:48)\n\n(cid:90) t\n\n0\n\n(8)\n\n(9)\n\n(10)\n\n\u2202\n\u2202t\n\n(cid:90) t(cid:48)\n\n0\n\n(cid:90) t\n\nt(cid:48)\n\n+\n\nR(t(cid:48), t(cid:48)(cid:48))Q(cid:48)(C(t, t(cid:48)(cid:48)))dt(cid:48)(cid:48) ,\n\nR(t, t(cid:48)) = \u2212\u00b5(t) R(t, t(cid:48)) +\n\n\u2202\n\u2202t\nd\ndt\n\u00b5(t) = Q(cid:48)(m(t))m(t) +\n\n(cid:90) t\n\nm(t) = \u2212\u00b5(t) m(t) + Q(cid:48)(m(t)) +\n\n(cid:90) t\n\n0\n\nR(t, t(cid:48)(cid:48))Q(cid:48)(cid:48)(C(t, t(cid:48)(cid:48)))R(t(cid:48)(cid:48), t(cid:48))dt(cid:48)(cid:48) ,\n\nR(t, t(cid:48)(cid:48))m(t(cid:48)(cid:48))Q(cid:48)(cid:48)(C(t, t(cid:48)(cid:48)))dt(cid:48)(cid:48) ,\n\n0\n\nR(t, t(cid:48)(cid:48)) [Q(cid:48)(C(t, t(cid:48)(cid:48))) + Q(cid:48)(cid:48)(C(t, t(cid:48)(cid:48))) C(t, t(cid:48)(cid:48))] dt(cid:48)(cid:48) ,\n\n(11)\nwith initial conditions C(t, t) = 1 \u2200t and R(t, t(cid:48)) = 0 for all t < t(cid:48) and limt(cid:48)\u2192t\u2212 R(t, t(cid:48)) = 1 \u2200t.\nThe additional function \u00b5(t), and its associated equation, are due to the spherical constraint; \u00b5(t)\nplays the role of a Lagrange multiplier and guarantees that the solution of the previous equations is\nsuch that C(t, t) = 1. The derivation of these equations can be found in [23] and in the SM Sec. B. It\nis obtained using heuristic theoretical physics approach and can be very plausibly made fully rigorous\ngeneralising the work of [28, 36].\nThis set of equations can be solved numerically as described in [23]. The numerical estimation of\nthe algorithmic threshold of gradient-\ufb02ow, reproduced in Fig. 2, was obtained in [22]. We have also\ndirectly simulated the gradient \ufb02ow Eq. (4) and compare the result to the one obtained from solving\nEqs. (8-11). As shown in the SM Sec. C, for N = 65535, we \ufb01nd a very good agreement even for\nthis large yet \ufb01nite size.\nSur\ufb01ng on saddles: Armed with the dynamical equations, we now con\ufb01rm the prediction of the\nthreshold (5) based on the Kac-Rice-type of landscape analysis. In the SM we check that the minima\ntrapping the dynamics are indeed the marginally stable ones (t = 2), see Figs. B.1 and B.2 in the\nSM, and we show the energy can be expressed in terms of C, R and m. In the right panel of Fig. 4\nwe then plot the energy as a function of time obtained from the numerical solution of Eqs. (8-11)\nfor 1/\u22062 = 1.5, 1.9, 2.3, 2.7 and \u2206p = 1 (same points and colour code of Figs. 2 and 3). For the\ntwo smaller values of 1/\u22062 the energy converges to a plateau value at \u0001th (dotted line), whereas\nfor 1/\u22062 = 2.3, 2.7 the energy plateaus close to \u0001th but then eventually drifts away and reaches a\nlower value, corresponding to the global minimum correlated with the signal. This behaviour can be\nunderstood in terms of the spectral properties of the Hessian (6) of the minima trapping the dynamics.\n\n7\n\n0246810\u03bb0.0000.0250.0500.0750.1000.1250.1500.175\u03c1(\u03bb)1/\u22062=1.51/\u22062=1.91/\u22062=2.31/\u22062=2.7\u22120.20.00.20.000.0502004006008001000t\u22122.2\u22122.0\u22121.8\u22121.6\u22121.4\u22121.2\u03b5\fIn the left panel of Fig. 4 we plot the corresponding density of eigenvalues of H for the same\nvalues of 1/\u22062 and \u2206p used in the right panel. This is an illustration of the dynamical phenomenon\nexplained in the previous section: when the signal-to-noise ratio is large enough threshold minima\nbecome unstable because a negative eigenvalue, associated to a downward direction toward the signal,\nemerges. In this case \u03c3\u03c3\u03c3(t) \ufb01rst seems to converge to the threshold minima and then, at long times,\ndrifts away along the unstable direction. The larger is the signal-to-noise ratio the more unstable is\nthe downward direction and, hence, the shortest is the intermediate trapping time.\n\n3.2 Gradient-\ufb02ow Threshold from Dynamical Theory\n\nWe now show that the very same prediction (5) for the algorithmic threshold of gradient-\ufb02ow can\nbe directly obtained analysing the dynamical equations (8-11), without directly using results from\nthe Kac-Rice analysis, thus establishing a \ufb01rm and novel connection between the behaviour of the\ngradient-\ufb02ow algorithm and Kac-Rice landscape approaches.\nFor small signal-to-noise ratios, when m remains zero at all times, the dynamical equations (8-11)\nare identical to the well-known one in spin glasses theory, for reviews see [37, 38]. These equations\nhave been studied extensively for decades in statistical physics and a range of results about their\nbehaviour has been established. Here we describe the results which are important for our analysis\nand devote the SM Sec. B.2 to a more extended presentation. It was shown analytically in [34] that\nthe behaviour of the dynamics at large times is captured by an asymptotic solution of Eqs. (8-11) that\nveri\ufb01es several remarkable properties. The ones of interest to us are that for t and t(cid:48) large:\n(i) C(t, t(cid:48)) = 1 when t \u2212 t(cid:48) \ufb01nite; C(t, t(cid:48)) becomes less than one when t \u2212 t(cid:48) diverges with t and t(cid:48).\n(ii) R(t, t(cid:48)) = RTTI(t \u2212 t(cid:48)) + Rag(t, t(cid:48)), where TTI stands for time-translational-invariance, ag\nstand for aging. Here RTTI(t \u2212 t(cid:48)) goes to zero on a \ufb01nite time-scale, whereas Rag(t, t(cid:48)) varies\non timescales diverging with t and t(cid:48). Moreover, Rag(t, t(cid:48)) veri\ufb01es the so called \"weak-long term\nRag(t, t(cid:48)(cid:48))dt(cid:48)(cid:48) is arbitrarily small. We refer to this function\n\nmemory\" property: for any \ufb01nite t0,(cid:82) t\n\nform for R(t, t(cid:48)) as the aging ansatz, adopting the physics terminology.\nThese properties are con\ufb01rmed to hold by our numerical solution, see for instance Fig. B.3 in the SM.\nThe interpretation of these dynamical properties is that at long times \u03c3\u03c3\u03c3(t) decreases in the energy\nlandscape and approaches the marginally stable minima. Concomitantly, dynamics slows down and\ntakes place along the almost \ufb02at directions associated to the vanishing eigenvalues of the Hessian.\nWe remind that in the previous paragraphs we assume null correlation with the signal, m = 0. In\norder to \ufb01nd the algorithmic threshold beyond which the gradient-\ufb02ow develops a positive correlation,\nwe study the instability of the aging solution as a function of the signal-to-noise ratio. Our strategy is\nto start with an arbitrarily small overlap, m(0) = \u03b4, and determine whether it grows at long times thus\nindicating an instability towards the signal. Since the initial condition for the overlap is uncorrelated\nwith the signal, then, for suf\ufb01ciently small \u03b4, C and R reach their asymptotic form before m becomes\nof order one. We can thus plug the asymptotic aging ansatz for R in the dynamical equation for m:\n\nt\u2212t0\n\n(cid:90) t\n\nm(t) = \u2212\u00b5(t)m(t) + Q(cid:48)(m(t)) +\n\nRTTI(t \u2212 t(cid:48)(cid:48))Q(cid:48)(cid:48)(1)m(t(cid:48)(cid:48))dt(cid:48)(cid:48)+\n\nd\ndt\n\nRag(t, t(cid:48)(cid:48))Q(cid:48)(cid:48)(Cag(t, t(cid:48)(cid:48)))m(t(cid:48)(cid:48))dt(cid:48)(cid:48)\n\n0\n\n(12)\n\n(cid:90) t\n\n0\n\n+\n\nIn the linear approximation the solution has the form m(t) = \u03b4 exp(\u039bt) and we assume \u039b arbitrarily\nsmall since we want to \ufb01nd the algorithmic threshold where \u039b = 0. The term Q(cid:48)(m(t)) becomes\nQ(cid:48)(cid:48)(0)m(t). Since m(t) has an arbitrarily slow evolution, whereas RTTI(t \u2212 t(cid:48)(cid:48)) relaxes to zero on a\n\ufb01nite timescale, the second term of the RHS of eq. (12) simpli\ufb01es to:\n\nRTTI(t \u2212 t(cid:48)(cid:48)) exp(\u2212\u039b(t \u2212 t(cid:48)(cid:48)))dt(cid:48)(cid:48) (cid:39) m(t)Q(cid:48)(cid:48)(1)R\n\nwhere R = (cid:82) t\n(cid:90) t\n\n\u03b4 exp(\u039bt)Q(cid:48)(cid:48)(1)\n0 RTTI(t \u2212 t(cid:48)(cid:48))dt(cid:48)(cid:48) does not depend on t (since t can be taken arbitrarily large and\nRTTI(t\u2212 t(cid:48)(cid:48)) relaxes to zero on \ufb01nite time-scales). The contribution of to the last term on (12) reads:\nRag(t, t(cid:48)(cid:48))Q(cid:48)(cid:48)(Cag(t, t(cid:48)(cid:48)))e\u2212\u039b(t\u2212t(cid:48)(cid:48)))dt(cid:48)(cid:48).\n\nRag(t, t(cid:48)(cid:48))Q(cid:48)(cid:48)(Cag(t, t(cid:48)(cid:48)))e\u2212\u039b(t\u2212t(cid:48)(cid:48))dt(cid:48)(cid:48)= m(t)\n\n(cid:90) t\n\n\u03b4 exp(\u039bt)\n\n(cid:90) t\n\n0\n\n0\n\n0\n\n8\n\n\fUsing that Q(cid:48)(cid:48)(Cag(t, t(cid:48)(cid:48))) is bounded by Q(cid:48)(cid:48)(1) and that \u039b cuts-off the integral on a time t0 \u223c 1/\u039b\nthat does not diverge with t, we can use the \"weak-long term memory\" property to conclude that\nthe last term is arbitrarily small compared to m(t) and hence can be neglected with respect to the\nprevious ones. Collecting all the pieces together we \ufb01nd:\n\nm(t) =(cid:2)\u2212\u00b5\u221e + Q(cid:48)(cid:48)(0) + Q(cid:48)(cid:48)(1)R(cid:3) m(t) + O(\u03b42) .\n\n(13)\nThis is solved by m(t) = \u03b4 exp(\u039bt) with \u039b = \u2212\u00b5\u221e + Q(cid:48)(cid:48)(0) + Q(cid:48)(cid:48)(1)R , which therefore justi\ufb01es\na posteriori our assumption of exponential growth. The condition for the instability of the aging\nsolution towards the signal solution is therefore given by\n\nd\ndt\n\n(14)\nFrom the analysis of the asymptotic aging solution presented in SM Sec. B.2 one \ufb01nds that \u00b5\u221e =\n\n2(cid:112)Q(cid:48)(cid:48)(1) and R = 1/(cid:112)Q(cid:48)(cid:48)(1), therefore obtaining Q(cid:48)(cid:48)(0) =(cid:112)Q(cid:48)(cid:48)(1). This condition is the same\n\n0 = \u2212\u00b5\u221e + Q(cid:48)(cid:48)(0) + Q(cid:48)(cid:48)(1)R .\n\none found from the study of the landscape, and thus leads to the transition line eq. (5).\n\nAcknowledgments\n\nWe thank Pierfrancesco Urbani for proof checking of the draft and many related discussions. We\nwould also like to thank the Kavli Institute for Theoretical Physics (KITP) for welcoming us during\npart of this research, with the support of the National Science Foundation under Grant No. NSF PHY-\n1748958 We acknowledge funding from the ERC under the European Union\u2019s Horizon 2020 Research\nand Innovation Programme Grant Agreement 714608-SMiLe; from the European Union\u2019s Horizon\n2020 research and innovation programme under the Marie Sk\u0142odowska-Curie grant agreement CoSP\nNo 823748; from the French National Research Agency (ANR) grant PAIL; and from the Simons\nFoundation (#454935, Giulio Biroli).\n\nReferences\n\n[1] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information\n\nProcessing Systems, pages 586\u2013594, 2016.\n\n[2] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error\n\nguarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016.\n\n[3] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In\n\nAdvances in Neural Information Processing Systems, pages 2973\u20132981, 2016.\n\n[4] C Daniel Freeman and Joan Bruna. Topology and geometry of half-recti\ufb01ed network optimiza-\n\ntion. ICLR 2017, 2017. preprint arXiv:1611.01540.\n\n[5] Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search\nfor low rank matrix recovery. In Advances in Neural Information Processing Systems, pages\n3873\u20133881, 2016.\n\n[6] Dohyung Park, Anastasios Kyrillidis, Constantine Carmanis, and Sujay Sanghavi. Non-square\nmatrix sensing without spurious local minima via the Burer-Monteiro approach. In Arti\ufb01cial\nIntelligence and Statistics, pages 65\u201374, 2017.\n\n[7] Simon S Du, Jason D Lee, Yuandong Tian, Aarti Singh, and Barnabas Poczos. Gradient descent\nlearns one-hidden-layer CNN: Don\u2019t be afraid of spurious local minima. In International\nConference on Machine Learning, pages 1338\u20131347, 2018.\n\n[8] Rong Ge and Tengyu Ma. On the optimization landscape of tensor decompositions. In Advances\n\nin Neural Information Processing Systems, pages 3653\u20133663, 2017.\n\n[9] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems:\nA uni\ufb01ed geometric analysis. In Proceedings of the 34th International Conference on Machine\nLearning, pages 1233\u20131242, 2017.\n\n9\n\n\f[10] Haihao Lu and Kenji Kawaguchi. Depth creates no bad local minima. arXiv preprint\n\narXiv:1702.08580, 2017.\n\n[11] Shuyang Ling, Ruitu Xu, and Afonso S Bandeira. On the landscape of synchronization networks:\n\nA perspective from nonconvex optimization. arXiv preprint arXiv:1809.11083, 2018.\n\n[12] David J Gross and Marc M\u00e9zard. The simplest spin glass. Nuclear Physics B, 240(4):431\u2013452,\n\n1984.\n\n[13] Yan V Fyodorov. Complexity of random energy landscapes, glass transition, and absolute value\nof the spectral determinant of random matrices. Physical review letters, 92(24):240601, 2004.\n[14] Antonio Auf\ufb01nger, G\u00e9rard Ben Arous, and Ji\u02c7r\u00ed \u02c7Cern\u00fd. Random matrices and complexity of spin\n\nglasses. Communications on Pure and Applied Mathematics, 66(2):165\u2013201, 2013.\n\n[15] Levent Sagun, V Ugur Guney, Gerard Ben Arous, and Yann LeCun. Explorations on high\n\ndimensional landscapes. arXiv preprint arXiv:1412.6615, 2014.\n\n[16] Gerard Ben Arous, Song Mei, Andrea Montanari, and Mihai Nica. The landscape of the spiked\n\ntensor model. arXiv preprint arXiv:1711.05424, 2017.\n\n[17] Iain M Johnstone and Arthur Yu Lu. On consistency and sparsity for principal components\nanalysis in high dimensions. Journal of the American Statistical Association, 104(486):682\u2013693,\n2009.\n\n[18] Yash Deshpande and Andrea Montanari. Information-theoretically optimal sparse PCA. In\nInformation Theory (ISIT), 2014 IEEE International Symposium on, pages 2197\u20132201. IEEE,\n2014.\n\n[19] Emile Richard and Andrea Montanari. A statistical model for tensor PCA. In Advances in\n\nNeural Information Processing Systems, pages 2897\u20132905, 2014.\n\n[20] Animashree Anandkumar, Rong Ge, Daniel Hsu, Sham M Kakade, and Matus Telgarsky. Tensor\ndecompositions for learning latent variable models. The Journal of Machine Learning Research,\n15(1):2773\u20132832, 2014.\n\n[21] Anima Anandkumar, Yuan Deng, Rong Ge, and Hossein Mobahi. Homotopy analysis for tensor\n\npca. COLT 2017, arXiv:1610.09322, 2016.\n\n[22] Stefano Sarao Mannelli, Florent Krzakala, Pierfrancesco Urbani, and Lenka Zdeborova. Passed\n& spurious: Descent algorithms and local minima in spiked matrix-tensor models. In Interna-\ntional Conference on Machine Learning, pages 4333\u20134342, 2019.\n\n[23] Stefano Sarao Mannelli, Giulio Biroli, Chiara Cammarota, Florent Krzakala, Pierfrancesco\nUrbani, and Lenka Zdeborov\u00e1. Marvels and pitfalls of the langevin algorithm in noisy high-\ndimensional inference. arXiv preprint arXiv:1812.09066, 2018.\n\n[24] Valentina Ros, Gerard Ben Arous, Giulio Biroli, and Chiara Cammarota. Complex energy\nlandscapes in spiked-tensor and simple glassy models: Ruggedness, arrangements of local\nminima, and phase transitions. Physical Review X, 9(1):011003, 2019.\n\n[25] Jinho Baik, G\u00e9rard Ben Arous, Sandrine P\u00e9ch\u00e9, et al. Phase transition of the largest eigenvalue\nfor nonnull complex sample covariance matrices. The Annals of Probability, 33(5):1643\u20131697,\n2005.\n\n[26] A Crisanti, H Horner, and H-J Sommers. The spherical p-spin interaction spin-glass model.\n\nZeitschrift f\u00fcr Physik B Condensed Matter, 92(2):257\u2013271, 1993.\n\n[27] Leticia F Cugliandolo and Jorge Kurchan. Analytical solution of the off-equilibrium dynamics\n\nof a long-range spin-glass model. Physical Review Letters, 71(1):173, 1993.\n\n[28] Gerard Ben Arous, Amir Dembo, and Alice Guionnet. Cugliandolo-Kurchan equations for\n\ndynamics of spin-glasses. Probability theory and related \ufb01elds, 136(4):619\u2013660, 2006.\n\n10\n\n\f[29] M. Kac. On the average number of real roots of a random algebraic equation. Bull. Amer. Math.\n\nSoc., 49(4):314\u2013320, 04 1943.\n\n[30] Stephen O Rice. Mathematical analysis of random noise. Bell System Technical Journal,\n\n23(3):282\u2013332, 1944.\n\n[31] Robert J Adler and Jonathan E Taylor. Random \ufb01elds and geometry. Springer Science &\n\nBusiness Media, 2009.\n\n[32] Tommaso Castellani and Andrea Cavagna. Spin glass theory for pedestrians. Journal of\n\nStatistical Mechanics: Theory and Experiment, 2005:P05012, 2005.\n\n[33] Andrea Crisanti and H-J Sommers. The spherical p-spin interaction spin glass model: the statics.\n\nZeitschrift f\u00fcr Physik B Condensed Matter, 87(3):341\u2013354, 1992.\n\n[34] Leticia F Cugliandolo and Jorge Kurchan. On the out-of-equilibrium relaxation of the\nSherrington-Kirkpatrick model. Journal of Physics A: Mathematical and General, 27(17):5749,\n1994.\n\n[35] Leticia F Cugliandolo and Jorge Kurchan. Weak ergodicity breaking in mean-\ufb01eld spin-glass\n\nmodels. Philosophical Magazine B, 71(4):501\u2013514, 1995.\n\n[36] Pompiliu Manuel Zam\ufb01r. Limiting dynamics for spherical models of spin glasses with magnetic\n\n\ufb01eld. arXiv preprint arXiv:0806.3519, 2008.\n\n[37] Jean-Philippe Bouchaud, Leticia F Cugliandolo, Jorge Kurchan, and Marc M\u00e9zard. Out of\nequilibrium dynamics in spin-glasses and other glassy systems. Spin glasses and random \ufb01elds,\npages 161\u2013223, 1998.\n\n[38] Leticia F Cugliandolo. Course 7: Dynamics of glassy systems. In Slow Relaxations and\n\nnonequilibrium dynamics in condensed matter, pages 367\u2013521. Springer, 2003.\n\n11\n\n\f", "award": [], "sourceid": 4667, "authors": [{"given_name": "Stefano", "family_name": "Sarao Mannelli", "institution": "Institut de Physique Th\u00e9orique"}, {"given_name": "Giulio", "family_name": "Biroli", "institution": "ENS"}, {"given_name": "Chiara", "family_name": "Cammarota", "institution": "King's College London"}, {"given_name": "Florent", "family_name": "Krzakala", "institution": "\u00c9cole Normale Sup\u00e9rieure"}, {"given_name": "Lenka", "family_name": "Zdeborov\u00e1", "institution": "CEA Saclay"}]}