{"title": "Solving Random Quadratic Systems of Equations Is Nearly as Easy as Solving Linear Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 739, "page_last": 747, "abstract": "This paper is concerned with finding a solution x to a quadratic system of equations y_i = |< a_i, x >|^2, i = 1, 2, ..., m. We prove that it is possible to solve unstructured quadratic systems in n variables exactly from O(n) equations in linear time, that is, in time proportional to reading and evaluating the data. This is accomplished by a novel procedure, which starting from an initial guess given by a spectral initialization procedure, attempts to minimize a non-convex objective. The proposed algorithm distinguishes from prior approaches by regularizing the initialization and descent procedures in an adaptive fashion, which discard terms bearing too much influence on the initial estimate or search directions. These careful selection rules---which effectively serve as a variance reduction scheme---provide a tighter initial guess, more robust descent directions, and thus enhanced practical performance. Further, this procedure also achieves a near-optimal statistical accuracy in the presence of noise. Finally, we demonstrate empirically that the computational cost of our algorithm is about four times that of solving a least-squares problem of the same size.", "full_text": "Solving Random Quadratic Systems of Equations\n\nIs Nearly as Easy as Solving Linear Systems\n\nYuxin Chen\n\nDepartment of Statistics\n\nStanford University\nStanford, CA 94305\n\nyxchen@stanfor.edu\n\nDepartment of Mathematics and Department of Statistics\n\nEmmanuel J. Cand\u00e8s\n\nStanford University\nStanford, CA 94305\n\ncandes@stanford.edu\n\nAbstract\n\nThis paper is concerned with \ufb01nding a solution x to a quadratic system of equa-\ntions yi = |(cid:104)ai, x(cid:105)|2, i = 1, . . . , m. We demonstrate that it is possible to solve\nunstructured random quadratic systems in n variables exactly from O(n) equa-\ntions in linear time, that is, in time proportional to reading the data {ai} and {yi}.\nThis is accomplished by a novel procedure, which starting from an initial guess\ngiven by a spectral initialization procedure, attempts to minimize a nonconvex\nobjective. The proposed algorithm distinguishes from prior approaches by reg-\nularizing the initialization and descent procedures in an adaptive fashion, which\ndiscard terms bearing too much in\ufb02uence on the initial estimate or search direc-\ntions. These careful selection rules\u2014which effectively serve as a variance reduc-\ntion scheme\u2014provide a tighter initial guess, more robust descent directions, and\nthus enhanced practical performance. Further, this procedure also achieves a near-\noptimal statistical accuracy in the presence of noise. Empirically, we demonstrate\nthat the computational cost of our algorithm is about four times that of solving a\nleast-squares problem of the same size.\n\n1\n\nIntroduction\n\nSuppose we are given a response vector y = [yi]1\u2264i\u2264m generated from a quadratic transformation\nof an unknown object x \u2208 Rn/Cn, i.e.\n\nyi = |(cid:104)ai, x(cid:105)|2 ,\n\ni = 1,\u00b7\u00b7\u00b7 , m,\n\n(1)\nwhere the feature/design vectors ai \u2208 Rn/Cn are known. In other words, we acquire measurements\nabout the linear product (cid:104)ai, x(cid:105) with all signs/phases missing. Can we hope to recover x from this\nnonlinear system of equations?\nThis problem can be recast as a quadratically constrained quadratic program (QCQP), which sub-\nsumes as special cases various classical combinatorial problems with Boolean variables (e.g. the\nNP-complete stone problem [1, Section 3.4.1]). In the physical sciences, this problem is commonly\nreferred to as phase retrieval [2]; the origin is that in many imaging applications (e.g. X-ray crys-\ntallography, diffraction imaging, microscopy) it is infeasible to record the phases of the diffraction\npatterns so that we can only record |Ax|2, where x is the electrical \ufb01eld of interest. Moreover, this\nproblem \ufb01nds applications in estimating the mixture of linear regression, since one can transform the\nlatent membership variables into missing phases [3]. Despite its importance across various \ufb01elds,\nsolving the quadratic system (1) is combinatorial in nature and, in general, NP complete.\nTo be more realistic albeit more challenging, the acquired samples are almost always corrupted by\nsome amount of noise, namely,\n\nyi \u2248 |(cid:104)ai, x(cid:105)|2 ,\n\ni = 1,\u00b7\u00b7\u00b7 , m.\n\n(2)\n\n1\n\n\fFor instance, in imaging applications the data are best modeled by Poisson random variables\n\ni = 1,\u00b7\u00b7\u00b7 , m,\n\n(3)\n\nind.\u223c Poisson(cid:0)|(cid:104)ai, x(cid:105)|2(cid:1),\n\nyi\n\nwhich captures the variation in the number of photons detected by a sensor. While we shall pay\nspecial attention to the Poisson noise model due to its practical relevance, the current work aims to\naccommodate general\u2014or even deterministic\u2014noise structures.\n\n1.1 Nonconvex optimization\n\nAssuming independent samples, the \ufb01rst attempt is to seek the maximum likelihood estimate (MLE):\n(4)\nwhere (cid:96) (z; yi) represents the log-likelihood of a candidate z given the outcome yi. As an example,\nunder the Poisson data model (3), one has (up to some constant offset)\n\n(cid:96) (z; yi) ,\n\ni=1\n\nminimizez \u2212(cid:88)m\n\n(cid:96)(z; yi) = yi log(|a\u2217\n\ni z|2) \u2212 |a\u2217\n\ni z|2.\n\n(5)\n\nComputing the MLE, however, is in general intractable, since (cid:96)(z; yi) is not concave in z.\nFortunately, under unstructured random systems, the problem is not as ill-posed as it might seem,\nand is solvable via convenient convex programs with optimal statistical guarantees [4\u201312]. The\nbasic paradigm is to lift the quadratically constrained problem into a linearly constrained problem\nby introducing a matrix variable X = xx\u2217 and relaxing the rank-one constraint. Nevertheless,\nworking with the auxiliary matrix variable signi\ufb01cantly increases the computational complexity,\nwhich exceeds the order of n3 and is prohibitively expensive for large-scale data.\nThis paper follows a different route, which attempts recovery by minimizing the nonconvex objec-\ntive (4) or (5) directly (e.g. [2, 13\u201319]). The main incentive is the potential computational bene\ufb01t,\nsince this strategy operates directly upon vectors instead of lifting decision variables to higher di-\nmension. Among this class of procedures, one natural candidate is the family of gradient-descent\ntype algorithms developed with respect to the objective (4). This paradigm can be regarded as per-\nforming some variant of stochastic gradient descent over the random samples {(yi, ai)}1\u2264i\u2264m as\nan approximation to maximize the population likelihood L(z) := E(y,a)[(cid:96)(z; y)]. While in general\nnonconvex optimization falls short of performance guarantees, a recently proposed approach called\nWirtinger Flow (WF) [13] promises ef\ufb01ciency under random features. In a nutshell, WF initializes\nthe iterate via a spectral method, and then successively re\ufb01nes the estimate via the following update\nrule:\n\nz(t+1) = z(t) +\n\n\u2207(cid:96)(z(t); yi),\n\nwhere z(t) denotes the tth iterate of the algorithm, and \u00b5t is the learning rate. Here, \u2207(cid:96)(z; yi)\nrepresents the Wirtinger derivative with respect to z, which reduces to the ordinary gradient in the\nreal setting. Under Gaussian designs, WF (i) allows exact recovery from O (n log n) noise-free\nquadratic equations [13];1 (ii) recovers x up to \u0001-accuracy within O(mn2 log 1/\u0001) time (or \ufb02ops)\n[13]; and (iii) is stable and converges to the MLE under Gaussian noise [20]. Despite these intriguing\nguarantees, the computational complexity of WF still far exceeds the best that one can hope for.\nMoreover, its sample complexity is a logarithmic factor away from the information-theoretic limit.\n\n(cid:88)m\n\ni=1\n\n\u00b5t\nm\n\n1.2 This paper: Truncated Wirtinger Flow\n\nThis paper develops a novel linear-time algorithm, called Truncated Wirtinger Flow (TWF), that\nachieves a near-optimal statistical accuracy. The distinguishing features include a careful initializa-\ntion procedure and a more adaptive gradient \ufb02ow. Informally, TWF entails two stages:\n\nsubset T0 of data {yi} that do not bear too much in\ufb02uence on the spectral estimates;\n\n1. Initialization: compute an initial guess z(0) by means of a spectral method applied to a\n2. Loop: for 0 \u2264 t < T ,\n\nz(t+1) = z(t) +\n\n\u2207(cid:96)(z(t); yi)\n\n(6)\n\n(cid:88)\n\n\u00b5t\nm\n\ni\u2208Tt+1\n\nfor some index set Tt+1 \u2286 {1,\u00b7\u00b7\u00b7 , m} over which \u2207(cid:96)(z(t); yi) are well-controlled.\n\n1 f (n) = O (g(n)) or f (n) (cid:46) g(n) (resp. f (n) (cid:38) g(n)) means there exists a constant c > 0 such that\n\n|f (n)| \u2264 c|g(n)| (resp. |f (n)| \u2265 c|g(n)|). f (n) (cid:16) g(n) means f (n) and g(n) are orderwise equivalent.\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1:\n(a) Relative errors of CG and TWF vs. iteration count, where n = 1000 and m = 8n.\n(b) Relative MSE vs. SNR in dB, where n = 100. The curves are shown for two settings: TWF for\nsolving quadratic equations (blue), and MLE had we observed additional phase information (green).\n\nWe highlight three aspects of the proposed algorithm, with details deferred to Section 2.\n\n(a) In contrast to WF and other gradient descent variants, we regularize both the initialization and\nthe gradient \ufb02ow in a more cautious manner by operating only upon some iteration-varying\nindex sets Tt. The main point is that enforcing such careful selection rules lead to tighter\ninitialization and more robust descent directions.\n(b) TWF sets the learning rate \u00b5t in a far more liberal fashion (e.g. \u00b5t \u2261 0.2 under suitable\nconditions), as opposed to the situation in WF that recommends \u00b5t = O(1/n).\n(c) Computationally, each iterative step mainly consists in calculating {\u2207(cid:96)(z; yi)}, which is in-\nexpensive and can often be performed in linear time, that is, in time proportional to evaluating\nthe data and the constraints. Take the real-valued Poisson likelihood (5) for example:\n\n(cid:26) yi\n\n|a(cid:62)\ni z|2\n\n(cid:18) yi \u2212 |a(cid:62)\n\ni z|2\na(cid:62)\ni z\n\n(cid:19)\n\n(cid:27)\n\n= 2\n\n(cid:40)\n\n\u2207(cid:96)(z; yi) = 2\n\naia(cid:62)\n\ni z \u2212 aia(cid:62)\ni z\n\n1 \u2264 i \u2264 m,\n\nai,\n\nwhich essentially amounts to two matrix-vector products. To see this, rewrite\ni \u2208 Tt+1,\notherwise,\n\n2 yi\u2212|a(cid:62)\ni z(t)|2\na(cid:62)\ni z(t)\n0,\n\n\u2207(cid:96)(z(t); yi) = A(cid:62)v,\n\nvi =\n\n,\n\n(cid:88)\n\ni\u2208Tt+1\n\nwhere A := [a1,\u00b7\u00b7\u00b7 , am](cid:62). Hence, Az(t) gives v and A(cid:62)v the desired truncated gradient.\n\n1.3 Numerical surprises\nThe power of TWF is best illustrated by numerical examples. Since x and e\u2212j\u03c6x are indistinguish-\nable given y, we evaluate the solution based on a metric that disregards the global phase [13]:\n\ndist (z, x) := min\u03d5:\u2208[0,2\u03c0) (cid:107)e\u2212j\u03d5z \u2212 x(cid:107).\n\n(7)\n\nIn the sequel, TWF operates according to the Poisson log-likelihood (5), and takes \u00b5t \u2261 0.2.\nWe \ufb01rst compare the computational ef\ufb01ciency of TWF for solving quadratic systems with that of\nconjugate gradient (CG) for solving least square problems. As is well known, CG is among the\nmost popular methods for solving large-scale least square problems, and hence offers a desired\nbenchmark. We run TWF and CG respectively over the following two problems:\n\n(a)\n(b)\n\n\ufb01nd x \u2208 Rn\n\ufb01nd x \u2208 Rn\n\ns.t. bi = a(cid:62)\ni x,\ns.t. bi = |a(cid:62)\ni x|,\n\n1 \u2264 i \u2264 m,\n1 \u2264 i \u2264 m,\n\nwhere m = 8n, x \u223c N (0, I), and ai\nind.\u223c N (0, I). This yields a well-conditioned design matrix\nA, for which CG converges extremely fast [21]. The relative estimation errors of both methods are\nreported in Fig. 1(a), where TWF is seeded by 10 power iterations. The iteration counts are plotted\nin different scales so that 4 TWF iterations are tantamount to 1 CG iteration. Since each iteration\nof CG and TWF involves two matrix vector products Az and A(cid:62)v, the numerical plots lead to a\nsuprisingly positive observation for such an unstructured design:\n\n3\n\nIteration01015Relative error10-810-710-610-510-410-310-210-11000204060 5least squares (CG)truncated WFSNR (dB) (n =100)152025303540455055Relative MSE (dB)-65-60-55-50-45-40-35-30-25-20truncated WFMLE w/ phase \fFigure 2: Recovery after (top) truncated spectral initialization, and (bottom) 50 TWF iterations.\n\nEven when all phase information is missing, TWF is capable of solving a quadratic system of\nequations only about 4 times2 slower than solving a least squares problem of the same size!\n\nThe numerical surprise extends to noisy quadratic systems. Under the Poisson data model, Fig. 1(b)\ndisplays the relative mean-square error (MSE) of TWF when the signal-to-noise ratio (SNR) varies;\nhere, the relative MSE and the SNR are de\ufb01ned as3\n\nMSE := dist2(\u02c6x, x) / (cid:107)x(cid:107)2\n\nand\n\nSNR := 3(cid:107)x(cid:107)2,\n\n(8)\n\nwhere \u02c6x is an estimate. Both SNR and MSE are displayed on a dB scale (i.e. the values of\n10 log10(SNR) and 10 log10(MSE) are plotted). To evaluate the quality of the TWF solution, we\ncompare it with the MLE applied to an ideal problem where the phases (i.e. {\u03d5i = sign(a(cid:62)\ni x)})\nare revealed a priori. The presence of this precious side information gives away the phase retrieval\nproblem and allows us to compute the MLE via convex programming. As illustrated in Fig. 1(b),\nTWF solves the quadratic system with nearly the best possible accuracy, since it only incurs an extra\n1.5 dB loss compared to the ideal MLE with all true phases revealed.\nTo demonstrate the scalability of TWF on real data, we apply TWF on a 320\u00d71280 image. Consider\na type of physically realizable measurements called coded diffraction patterns (CDP) [22], where\n\ny(l) = |F D(l)x|2,\n\n1 \u2264 l \u2264 L,\n\n(9)\nwhere m = nL, |z|2 denotes the vector of entrywise squared magnitudes, and F is the DFT matrix.\nHere, D(l) is a diagonal matrix whose diagonal entries are randomly drawn from {1,\u22121, j,\u2212j},\nwhich models signal modulation before diffraction. We generate L = 12 masks for measurements,\nand run TWF on a MacBook Pro with a 3 GHz Intel Core i7. We run 50 truncated power iterations\nand 50 TWF iterations, which in total cost 43.9 seconds for each color channel. The relative errors\nafter initialization and TWF iterations are 0.4773 and 2.2 \u00d7 10\u22125, respectively; see Fig. 2.\n\n1.4 Main results\n\nWe corroborate the preceding numerical \ufb01ndings with theoretical support. For concreteness, we\nassume TWF proceeds according to the Poisson log-likelihood (5). We suppose the samples (yi, ai)\nare independently and randomly drawn from the population, and model the random features ai as\n\nai \u223c N (0, I n) .\n\n(10)\n\nTo start with, the following theorem con\ufb01rms the performance of TWF under noiseless data.\n\n2Similar phenomena arise in many other experiments we\u2019ve conducted (e.g. when the sample size m ranges\ni x)2 and yi \u2212 \u00b5i,\n3To justify the de\ufb01nition of SNR, note that the signals and noise are captured by \u00b5i = (a(cid:62)\nm(cid:107)x(cid:107)2 = 3(cid:107)x(cid:107)2.\n\nfrom 6n to 20n). In fact, this factor seems to improve slightly as m/n increases.\ni x|4\ni x|2 \u2248 3m(cid:107)x(cid:107)4\n\nrespectively. The SNR is thus given by\n\n(cid:80)m\n(cid:80)m\ni=1 Var[yi] =\n\ni=1 \u00b52\ni\n\n(cid:80)m\n(cid:80)m\ni=1 |a(cid:62)\ni=1 |a(cid:62)\n\n4\n\n\fTheorem 1 (Exact recovery). Consider the noiseless case (1) with an arbitrary x \u2208 Rn. Suppose\nthat the learning rate \u00b5t is either taken to be a constant \u00b5t \u2261 \u00b5 > 0 or chosen via a backtracking\nline search. Then there exist some constants 0 < \u03c1, \u03bd < 1 and \u00b50, c0, c1, c2 > 0 such that with\nprobability exceeding 1 \u2212 c1 exp (\u2212c2m), the TWF estimates (Algorithm 1) obey\n\ndist(z(t), x) \u2264 \u03bd(1 \u2212 \u03c1)t(cid:107)x(cid:107),\n\n\u2200t \u2208 N,\n\n(11)\n\nprovided that m \u2265 c0n and \u00b5 \u2264 \u00b50. As discussed below, we can take \u00b50 \u2248 0.3.\nTheorem 1 justi\ufb01es two intriguing properties of TWF. To begin with, TWF recovers the ground\ntruth exactly as soon as the number of equations is on the same order of the number of unknowns,\nwhich is information theoretically optimal. More surprisingly, TWF converges at a geometric rate,\ni.e. it achieves \u0001-accuracy (i.e. dist(z(t), x) \u2264 \u0001(cid:107)x(cid:107)) within at most O (log 1/\u0001) iterations. As a\nresult, the time taken for TWF to solve the quadratic systems is proportional to the time taken to\nread the data, which con\ufb01rms the linear-time complexity of TWF. These outperform the theoretical\nguarantees of WF [13], which requires O(mn2 log 1/\u0001) runtime and O(n log n) sample complexity.\nNotably, the performance gain of TWF is the result of the key algorithmic changes. Rather than\nmaximizing the data usage at each step, TWF exploits the samples at hand in a more selective\nmanner, which effectively trims away those components that are too in\ufb02uential on either the initial\nguess or the search directions, thus reducing the volatility of each movement. With a tighter initial\nguess and better-controlled search directions in place, TWF is able to proceed with a more aggressive\nlearning rate. Taken collectively these efforts enable the appealing convergence property of TWF.\nNext, we turn to more realistic noisy data by accounting for a general additive noise model:\n\nyi = |(cid:104)ai, x(cid:105)|2 + \u03b7i,\n\n1 \u2264 i \u2264 m,\n\nwhere \u03b7i represents a noise term. The stability of TWF is demonstrated in the theorem below.\nTheorem 2 (Stability). Consider the noisy case (12). Suppose that the learning rate \u00b5t is either\ntaken to be a positive constant \u00b5t \u2261 \u00b5 or chosen via a backtracking line search. If\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\nthen with probability at least 1 \u2212 c2 exp (\u2212c3m), the TWF estimates (Algorithm 1) satisfy\n\n(cid:107)\u03b7(cid:107)\u221e \u2264 c1 (cid:107)x(cid:107)2 ,\n\ndist(z(t), x) (cid:46)\n\nand\n\nm \u2265 c0n, \u00b5 \u2264 \u00b50,\n(cid:107)\u03b7(cid:107)\u221a\nm(cid:107)x(cid:107) + (1 \u2212 \u03c1)t(cid:107)x(cid:107),\n|(cid:104)ai, x(cid:105)|4(cid:17)\n\n(cid:16)(cid:88)m\n\nSNR :=\n\ni=1\n\n/ (cid:107)\u03b7(cid:107)2 \u2248 3m(cid:107)x(cid:107)4 / (cid:107)\u03b7(cid:107)2,\n\n\u2200t \u2208 N\n\nfor all x \u2208 Rn. Here, 0 < \u03c1 < 1 and \u00b50, c0, c1, c2, c3 > 0 are some universal constants.\nAlternatively, if one regards the SNR for the model (12) as follows\n\nthen we immediately arrive at another form of performance guarantee stated in terms of SNR:\n\n1\u221a\nSNR\n\ndist(z(t), x) (cid:46)\n\n(cid:107)x(cid:107) + (1 \u2212 \u03c1)t(cid:107)x(cid:107),\n\n\u2200t \u2208 N.\n\n(16)\nAs a consequence, the relative error of TWF reaches O(SNR\u22121/2) within a logarithmic number of\niterations. It is worth emphasizing that the above stability guarantee is deterministic, which holds for\nany noise structure obeying (13). Encouragingly, this statistical accuracy is nearly un-improvable,\nas revealed by a minimax lower bound that we provide in the supplemental materials.\nWe pause to remark that several other nonconvex methods have been proposed for solving quadratic\nequations, which exhibit intriguing empirical performances. A partial list includes the error reduc-\ntion schemes by Fienup [2], alternating minimization [14], Kaczmarz method [17], and generalized\napproximate message passing [15]. However, most of them fall short of theoretical support. The\nanalytical dif\ufb01culty arises since these methods employ the same samples in each iteration, which\nintroduces complicated dependencies across all iterates. To circumvent this issue, [14] proposes\na sample-splitting version of the alternating minimization method that employs fresh samples in\neach iteration. Despite the mathematical convenience, the sample complexity of this approach is\nO(n log3 n + n log2 n log 1/\u0001), which is a factor of O(log3 n) from optimal and is empirically\nlargely outperformed by the variant that reuses all samples. In contrast, our algorithm uses the same\npool of samples all the time and is therefore practically appealing. Besides, the approach in [14]\ndoes not come with provable stability guarantees. Numerically, each iteration of Fienup\u2019s algorithm\n(or alternating minimization) involves solving a least squares problem, and the algorithm converges\nin tens or hundreds of iterations. This is computationally more expensive than TWF, whose compu-\ntational complexity is merely about 4 times that of solving a least squares problem.\n\n5\n\n\f2 Algorithm: Truncated Wirtinger Flow\n\nwe denote A := [a1,\u00b7\u00b7\u00b7 , am](cid:62) and A (M ) :=(cid:8)a(cid:62)\n\n(cid:9)\n\nThis section explains the basic principles of truncated Wirtinger \ufb02ow. For notational convenience,\n\ni M ai\n\n1\u2264i\u2264m for any M \u2208 Rn\u00d7n.\n\n2.1 Truncated gradient stage\n\nIn the case of independent real-valued data, the descent direc-\ntion of the WF updates\u2014which is the gradient of the Poisson\nlog-likelihood\u2014can be expressed as follows:\ni z|2\nyi \u2212 |a(cid:62)\n(cid:125)\na(cid:62)\ni z\n\n\u2207(cid:96)(z; yi) =\n\nm(cid:88)\n\nm(cid:88)\n\n(cid:123)(cid:122)\n\n(17)\n\n(cid:124)\n\nai,\n\ni=1\n\ni=1\n\n2\n\n:=\u03bdi\n\nFigure 3: The locus of \u2212 1\n2\u2207(cid:96)i (z)\nfor all unit vectors ai. The red ar-\nrows depict those directions with\nlarge weights.\n\nwhere \u03bdi represents the weight assigned to each feature ai.\nUnfortunately, the gradient of this form is non-integrable and\nhence uncontrollable. To see this, consider any \ufb01xed z \u2208 Rn.\nm(cid:107)z(cid:107),\nThe typical value of min1\u2264i\u2264m |a(cid:62)\nleading to some excessively large weights \u03bdi. Notably, an un-\nderlying premise for a nonconvex procedure to succeed is to\nensure all iterates reside within a basin of attraction, that is, a neighborhood surrounding x within\nwhich x is the unique stationary point of the objective. When a gradient is unreasonably large, the\niterative step might overshoot and end up leaving this basin of attraction. Consequently, WF moving\nalong the preceding direction might not come close to the truth unless z is already very close to x.\nThis is observed in numerical simulations4.\nTWF addresses this challenge by discarding terms having too high of a leverage on the search\ndirection; this is achieved by regularizing the weights \u03bdi via appropriate truncation. Speci\ufb01cally,\n\ni z| is on the order of 1\n\nz(t+1) = z(t) +\n\n\u2207(cid:96)tr(z(t)),\n\n\u2200t \u2208 N,\n\n(18)\n\n\u00b5t\nm\n\nwhere \u2207(cid:96)tr (\u00b7) denotes the truncated gradient given by\nyi \u2212 |a(cid:62)\ni z|2\na(cid:62)\ni z\nfor some appropriate truncation criteria speci\ufb01ed by E i\nE i\n1 (z) and E i\n2 (z) to be two collections of events given by\n\n\u2207(cid:96)tr (z) :=\n\n(cid:88)m\nz (cid:107)z(cid:107) \u2264(cid:12)(cid:12)a(cid:62)\n\ni z(cid:12)(cid:12) \u2264 \u03b1ub\n\ni=1\n\n2\n\n:= (cid:8)\u03b1lb\n(cid:26)\n\n:=\n\nE i\n1(z)\nE i\n2(z)\n\n|yi \u2212 |a(cid:62)\n\ni z|2| \u2264 \u03b1h\nm\n\nai1E i\n1 (\u00b7) and E i\n\n(19)\n1(z)\u2229E i\n2 (\u00b7). In our algorithm, we take\n\n2(z)\n\nz (cid:107)z(cid:107)(cid:9);\n(cid:13)(cid:13)y \u2212 A(cid:0)zz(cid:62)(cid:1)(cid:13)(cid:13)1\n\n(cid:27)\n\n,\n\n|a(cid:62)\ni z|\n(cid:107)z(cid:107)\n\n(20)\n\n(21)\n\nz , \u03b1ub\n\nwhere \u03b1lb\nz , \u03b1z are predetermined truncation thresholds. In words, we drop components whose\nsize fall outside some con\ufb01dence range\u2014a range where the magnitudes of both the numerator and\ndenominator of \u03bdi are comparable to their respective mean values.\nThis paradigm could be counter-intuitive at \ufb01rst glance, since one might expect the larger terms\nto be better aligned with the desired search direction. The issue, however, is that the large terms\nare extremely volatile and could dominate all other components in an undesired way. In contrast,\nTWF makes use of only gradient components of typical sizes, which slightly increases the bias but\nremarkably reduces the variance of the descent direction. We expect such gradient regularization\nand variance reduction schemes to be bene\ufb01cial for solving a broad family of nonconvex problems.\n\n2.2 Truncated spectral initialization\n\nA key step to ensure meaningful convergence is to seed TWF with some point inside the basin of\nattraction, which proves crucial for other nonconvex procedures as well. An appealing initialization\n\n4For complex-valued data, WF converges empirically, as mini |a(cid:62)\n\ni z| is much larger than the real case.\n\n6\n\nz = (3, 6)x = (2.7, 8)\fAlgorithm 1 Truncated Wirtinger Flow.\nInput: Measurements {yi | 1 \u2264 i \u2264 m} and feature vectors {ai | 1 \u2264 i \u2264 m}; truncation\nthresholds \u03b1lb\n\nz = 0.3, \u03b1ub\n\nz = \u03b1h = 5, and \u03b1y = 3)\n\nz , \u03b1ub\n\nand \u03b1y \u2265 3.\n\n(25)\n\ni=1 yi and \u02dcz is the leading eigenvector of\n\nInitialize z(0) to be(cid:113) mn(cid:80)m\n\n0 < \u03b1lb\n\n(cid:113) 1\n\nz , \u03b1h, and \u03b1y satisfying (by default, \u03b1lb\nz \u2264 0.5, \u03b1ub\nz \u2265 5, \u03b1h \u2265 5,\n(cid:80)m\n(cid:88)m\nyi \u2212(cid:12)(cid:12)a\u2217\n\ni=1(cid:107)ai(cid:107)2 \u03bb\u02dcz, where \u03bb =\n\nyiaia\u2217\n\n1\nm\n\nY =\n\ni=1\n\nm\n\ni 1{|yi|\u2264\u03b12\n\n0}.\n\ny\u03bb2\n\ni z(t)(cid:12)(cid:12)2\n\nLoop: for t = 0 : T do\n\n(cid:26)\n\nwhere\nE i\n1 :=\n\nz \u2264\n\u03b1lb\n\n(cid:27)\n\nz(t+1) = z(t) +\ni z(t)|\n|a\u2217\n(cid:107)z(t)(cid:107) \u2264 \u03b1ub\n\n,\n\nz\n\n\u221a\nn\n(cid:107)ai(cid:107)\n\nand Kt :=\n\n1\nm\n\n2\u00b5t\nm\nE i\n2 :=\n\n(cid:88)m\n(cid:26)\n(cid:88)m\n\ni=1\n\nl=1\n\nz(t)\u2217ai\n\nai1E i\n\n|yi \u2212 |a\u2217\n\ni z(t)|2| \u2264 \u03b1hKt\n\n(cid:12)(cid:12)yl \u2212 |a\u2217\n\nl z(t)|2(cid:12)(cid:12).\n\n2\n\n1\u2229E i\n,\n\u221a\nn\n(cid:107)ai(cid:107)\n\n(22)\n\n(23)\n\n, (24)\n\n(cid:27)\n\ni z(t)|\n|a\u2217\n(cid:107)z(t)(cid:107)\n\nOutput z(T ).\n\nprocedure is the spectral method [14] [13], which initializes z(0) as the leading eigenvector of (cid:101)Y :=\n(cid:80)m\ni=1 yiaia(cid:62)\n\ni . This is based on the observation that for any \ufb01xed unit vector x,\n\n1\nm\n\nE[(cid:101)Y ] = I + 2xx(cid:62),\n(cid:0)m\u22121aka(cid:62)\n\nwhose principal component is exactly x with an eigenvalue of 3.\nUnfortunately, the success of this method requires a sample complexity exceeding n log n. To see\nthis, recall that maxi yi \u2248 2 log m. Letting k = arg maxi yi and \u02dcak := ak/(cid:107)ak(cid:107), one can derive\n\nwhich dominates x(cid:62)(cid:101)Y x \u2248 3 unless m (cid:38) n log m. As a result, \u02dcak is closer to the principal\ncomponent of (cid:101)Y than x when m (cid:16) n. This drawback turns out to be a substantial practical issue.\n\nk yk\n\n(cid:1) \u02dcak \u2248 (2n log m)/m,\n\nk (cid:101)Y \u02dcak \u2265 \u02dca(cid:62)\n\nk\n\n\u02dca(cid:62)\n\n(cid:88)m\n\nThis issue can be remedied if we preclude those data yi with large\nmagnitudes when running the spectral method. Speci\ufb01cally, we\npropose to initialize z(0) as the leading eigenvector of\n\n1\nm\n\nyiaia(cid:62)\n\nY :=\n\n(26)\nfollowed by proper scaling so as to ensure (cid:107)z(0)(cid:107) \u2248 (cid:107)x(cid:107). As illus-\ntrated in Fig. 4, the empirical advantage of the truncated spectral\nmethod is increasingly more remarkable as n grows.\n\ni 1{|yi|\u2264\u03b12\n\ny( 1\n\ni=1\n\nm\n\n(cid:80)m\nl=1 yl)}\n\n2.3 Choice of algorithmic parameters\n\nFigure 4: Relative initializa-\ntion error when ai \u223c N (0, I).\n\nOne important implementation detail is the learning rate \u00b5t. There\nare two alternatives that work well in both theory and practice:\n1. Fixed size. Take \u00b5t \u2261 \u00b5 for some constant \u00b5 > 0. As long as \u00b5 is not too large, this strategy\nalways works. Under the condition (25), our theorems hold for any positive constant \u00b5 < 0.28.\n2. Backtracking line search with truncated objective. This strategy performs a line search along\nthe descent direction and determines an appropriate learning rate that guarantees a suf\ufb01cient\nimprovement with respect to the truncated objective. Details are deferred to the supplement.\n\nAnother algorithmic details to specify are the truncation thresholds \u03b1h, \u03b1lb\nz , and \u03b1y. The\npresent paper isolates a concrete set of combinations as given in (25). In all theory and numerical\nexperiments presented in this work, we assume that the parameters fall within this range.\n\nz , \u03b1ub\n\n7\n\nn: signal dimension (m = 6n)10002000300040005000Relative MSE0.70.80.9 1 spectral method truncated spectral method\f(a)\n\n(b)\n\n(c)\n\nFigure 5: (a) Empirical success rates for real Gaussian design; (b) empirical success rates for com-\nplex Gaussian design; (c) relative MSE (averaged over 100 runs) vs. SNR for Poisson data.\n\n3 More numerical experiments and discussion\n\nz = 0.3, \u03b1ub\n\nWe conduct more extensive numerical experiments to corroborate our main results and verify the\napplicability of TWF on practical problems. For all experiments conducted herein, we take a \ufb01xed\nstep size \u00b5t \u2261 0.2, employ 50 power iterations for initialization and T = 1000 gradient iterations.\nThe truncation levels are taken to be the default values \u03b1lb\nWe \ufb01rst apply TWF to a sequence of noiseless problems with n = 1000 and varying m. Generate the\nind.\u223c N (0, I);\nobject x at random, and produce the feature vectors ai in two different ways: (1) ai\nind.\u223c N (0, I) + jN (0, I). A Monte Carlo trial is declared success if the estimate \u02c6x obeys\n(2) ai\ndist (\u02c6x, x) /(cid:107)x(cid:107) \u2264 10\u22125. Fig. 5(a) and 5(b) illustrate the empirical success rates of TWF (average\nover 100 runs for each m) for noiseless data, indicating that m \u2265 5n are m \u2265 4.5n are often\nsuf\ufb01cient under real and complex Gaussian designs, respectively. For the sake of comparison, we\nsimulate the empirical success rates of WF, with the step size \u00b5t = min{1 \u2212 e\u2212t/330, 0.2} as\nrecommended by [13]. As shown in Fig. 5, TWF outperforms WF under random Gaussian features,\nimplying that TWF exhibits either better convergence rate or enhanced phase transition behavior.\n\nz = \u03b1h = 5, and \u03b1y = 3.\n\nind.\u223c\nNext, we empirically evaluate the stability of TWF under noisy data. Set n = 1000, produce ai\nN (0, I), and generate yi according to the Poisson model (3). Fig. 5(c) shows the relative mean\nsquare error\u2014on the dB scale\u2014with varying SNR (cf. (8)). As can be seen, the empirical relative\nMSE scales inversely proportional to SNR, which matches our stability guarantees in Theorem 2\n(since on the dB scale, the slope is about -1 as predicted by the theory (16)).\nWhile this work focuses on the Poisson-type objective for con-\ncreteness, the proposed paradigm carries over to a variety of\nnonconvex objectives, and might have implications in solving\nother problems that involve latent variables, e.g. matrix comple-\ntion [23\u201325], sparse coding [26], dictionary learning [27], and\nmixture problems (e.g. [28, 29]). We conclude this paper with an\nexample on estimating mixtures of linear regression. Imagine\n\n(cid:26)a(cid:62)\n\nyi \u2248\n\ni \u03b21, with probability p,\na(cid:62)\ni \u03b22,\n\nelse,\n\n1 \u2264 i \u2264 m,\n\n(27)\n\nFigure 6: Empirical success rate\nfor mixed regression (p = 0.5).\n\n1 \u2264 i \u2264 m,\n\nwhere \u03b21, \u03b22 are unknown. It has been shown in [3] that in the\nnoiseless case, the ground truth satis\ufb01es\n(cid:62)\n2 + \u03b22\u03b2\n\nfi(\u03b21, \u03b22) := y2\n\n1 )ai \u2212 a(cid:62)\n(cid:62)\n\ni + 0.5a(cid:62)\n\ni (\u03b21\u03b2\n\nreduces to the form (1)). Running TWF with a nonconvex objective(cid:80)m\n\nwhich forms a set of quadratic constraints (in particular, if one further knows \u03b21 = \u2212\u03b22, then this\ni (z1, z2) (with the\nassistance of a 1-D grid search proposed in [29] applied right after truncated initialization) yields\naccurate estimation of \u03b21, \u03b22 under minimal sample complexity, as illustrated in Fig. 6.\n\ni (\u03b21 + \u03b22) yi = 0,\n\ni=1 f 2\n\nAcknowledgments\n\nE. C. is partially supported by NSF under grant CCF-0963835 and by the Math + X Award from the\nSimons Foundation. Y. C. is supported by the same NSF grant.\n\n8\n\nm: number of measurements (n =1000) 2n 3n 4n 5n 6n Empirical success rate 0 0.5 1 TWF (Poisson objective) WF (Gaussian objective)m: number of measurements (n =1000) 2n 3n 4n 5n 6n Empirical success rate 0 0.5 1 TWF (Poisson objective) WF (Gaussian objective)SNR (dB) (n =1000)152025303540455055Relative MSE (dB)-65-60-55-50-45-40-35-30-25-20 m = 6n m = 8n m = 10nm: number of measurements ( n =1000) 5n 6n 7n 8n 9n 10n Empirical success rate 0 0.5 1 \fReferences\n[1] A. Ben-Tal and A. Nemirovski. Lectures on modern convex optimization, volume 2. 2001.\n[2] J. R. Fienup. Phase retrieval algorithms: a comparison. Applied optics, 21:2758\u20132769, 1982.\n[3] Y. Chen, X. Yi, and C. Caramanis. A convex formulation for mixed regression with two\n\ncomponents: Minimax optimal rates. In Conference on Learning Theory (COLT), 2014.\n\n[4] E. J. Cand\u00e8s, T. Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery from\nmagnitude measurements via convex programming. Communications on Pure and Applied\nMathematics, 66(8):1017\u20131026, 2013.\n\n[5] I. Waldspurger, A. d\u2019Aspremont, and S. Mallat. Phase recovery, maxcut and complex semidef-\n\ninite programming. Mathematical Programming, 149(1-2):47\u201381, 2015.\n\n[6] Y. Shechtman, Y. C. Eldar, A. Szameit, and M. Segev. Sparsity based sub-wavelength imaging\nwith partially incoherent light via quadratic compressed sensing. Optics express, 19(16), 2011.\n[7] E. J. Cand\u00e8s and X. Li. Solving quadratic equations via PhaseLift when there are about as\nmany equations as unknowns. Foundations of Computational Math., 14(5):1017\u20131026, 2014.\n[8] H. Ohlsson, A. Yang, R. Dong, and S. Sastry. Cprl\u2013an extension of compressive sensing to the\nphase retrieval problem. In Advances in Neural Information Processing Systems (NIPS), 2012.\n[9] Y. Chen, Y. Chi, and A. J. Goldsmith. Exact and stable covariance estimation from quadratic\n\nsampling via convex programming. IEEE Trans. on Inf. Theory, 61(7):4034\u20134059, 2015.\n\n[10] T. Cai and A. Zhang. ROP: Matrix recovery via rank-one projections. Annals of Stats.\n[11] K. Jaganathan, S. Oymak, and B. Hassibi. Recovery of sparse 1-D signals from the magnitudes\n\nof their Fourier transform. In IEEE ISIT, pages 1473\u20131477, 2012.\n\n[12] D. Gross, F. Krahmer, and R. Kueng. A partial derandomization of phaselift using spherical\n\ndesigns. Journal of Fourier Analysis and Applications, 21(2):229\u2013266, 2015.\n\n[13] E. J. Cand\u00e8s, X. Li, and M. Soltanolkotabi. Phase retrieval via Wirtinger \ufb02ow: Theory and\n\nalgorithms. IEEE Transactions on Information Theory, 61(4):1985\u20132007, April 2015.\n\n[14] P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. NIPS,\n\n2013.\n\n[15] P. Schniter and S. Rangan. Compressive phase retrieval via generalized approximate message\n\npassing. IEEE Transactions on Signal Processing, 63(4):1043\u20131055, Feb 2015.\n\n[16] A. Repetti, E. Chouzenoux, and J.-C. Pesquet. A nonconvex regularized approach for phase\n\nretrieval. International Conference on Image Processing, pages 1753\u20131757, 2014.\n\n[17] K. Wei. Phase retrieval via Kaczmarz methods. arXiv:1502.01822, 2015.\n[18] C. White, R. Ward, and S. Sanghavi. The local convexity of solving quadratic equations.\n\narXiv:1506.07868, 2015.\n\n[19] Y. Shechtman, A. Beck, and Y. C. Eldar. GESPAR: Ef\ufb01cient phase retrieval of sparse signals.\n\nIEEE Transactions on Signal Processing, 62(4):928\u2013938, 2014.\n\n[20] M. Soltanolkotabi. Algorithms and Theory for Clustering and Nonconvex Quadratic Program-\n\nming. PhD thesis, Stanford University, 2014.\n\n[21] L. N. Trefethen and D. Bau III. Numerical linear algebra, volume 50. SIAM, 1997.\n[22] E. J. Cand\u00e8s, X. Li, and M. Soltanolkotabi. Phase retrieval from coded diffraction patterns. to\n\nappear in Applied and Computational Harmonic Analysis, 2014.\n\n[23] R. Keshavan, A. Montanari, and S. Oh. Matrix completion from noisy entries. Journal of\n\nMachine Learning Research, 11:2057\u20132078, 2010.\n\n[24] P. Jain, P. Netrapalli, and S. Sanghavi. Low-rank matrix completion using alternating mini-\n\nmization. In ACM symposium on Theory of computing, pages 665\u2013674, 2013.\n\n[25] R. Sun and Z. Luo. Guaranteed matrix completion via nonconvex factorization. FOCS, 2015.\n[26] S. Arora, R. Ge, T. Ma, and A. Moitra. Simple, ef\ufb01cient, and neural algorithms for sparse\n\ncoding. Conference on Learning Theory (COLT), 2015.\n\n[27] J. Sun, Q. Qu, and J. Wright. Complete dictionary recovery over the sphere. ICML, 2015.\n[28] S. Balakrishnan, M. J. Wainwright, and B. Yu. Statistical guarantees for the EM algorithm:\n\nFrom population to sample-based analysis. arXiv preprint arXiv:1408.2156, 2014.\n\n[29] X. Yi, C. Caramanis, and S. Sanghavi. Alternating minimization for mixed linear regression.\n\nInternational Conference on Machine Learning, June 2014.\n\n9\n\n\f", "award": [], "sourceid": 497, "authors": [{"given_name": "Yuxin", "family_name": "Chen", "institution": "Stanford University"}, {"given_name": "Emmanuel", "family_name": "Candes", "institution": "Stanford University"}]}