{"title": "Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 2375, "page_last": 2383, "abstract": "We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form $y = Ax + \\eta$ where $A$ is an unknown $n \\times n$ matrix and $x$ is chosen uniformly at random from $\\{+1, -1\\}^n$, $\\eta$ is an $n$-dimensional Gaussian random variable with unknown covariance $\\Sigma$: We give an algorithm that provable recovers $A$ and $\\Sigma$ up to an additive $\\epsilon$ whose running time and sample complexity are polynomial in $n$ and $1 / \\epsilon$. To accomplish this, we introduce a novel ``quasi-whitening'' step that may be useful in other contexts in which the covariance of Gaussian noise is not known in advance. We also give a general framework for finding all local optima of a function (given an oracle for approximately finding just one) and this is a crucial step in our algorithm, one that has been overlooked in previous attempts, and allows us to control the accumulation of error when we find the columns of $A$ one by one via local search.", "full_text": "Provable ICA with Unknown Gaussian Noise, with\n\nImplications for Gaussian Mixtures and Autoencoders\n\nSanjeev Arora\u2217\n\nRong Ge\u2217\n\nAnkur Moitra \u2020\n\nSushant Sachdeva\u2217\n\nAbstract\n\nWe present a new algorithm for Independent Component Analysis (ICA) which\nhas provable performance guarantees. In particular, suppose we are given samples\nof the form y = Ax + \u03b7 where A is an unknown n \u00d7 n matrix and x is a random\nvariable whose components are independent and have a fourth moment strictly\nless than that of a standard Gaussian random variable and \u03b7 is an n-dimensional\nGaussian random variable with unknown covariance \u03a3: We give an algorithm that\nprovable recovers A and \u03a3 up to an additive \u0001 and whose running time and sam-\nple complexity are polynomial in n and 1/\u0001. To accomplish this, we introduce\na novel \u201cquasi-whitening\u201d step that may be useful in other contexts in which the\ncovariance of Gaussian noise is not known in advance. We also give a general\nframework for \ufb01nding all local optima of a function (given an oracle for approx-\nimately \ufb01nding just one) and this is a crucial step in our algorithm, one that has\nbeen overlooked in previous attempts, and allows us to control the accumulation\nof error when we \ufb01nd the columns of A one by one via local search.\n\nIntroduction\n\n1\nWe present an algorithm (with rigorous performance guarantees) for a basic statistical problem.\nSuppose \u03b7 is an independent n-dimensional Gaussian random variable with an unknown covariance\nmatrix \u03a3 and A is an unknown n \u00d7 n matrix. We are given samples of the form y = Ax + \u03b7 where\nx is a random variable whose components are independent and have a fourth moment strictly less\nthan that of a standard Gaussian random variable. The most natural case is when x is chosen uni-\nformly at random from {+1,\u22121}n, although our algorithms in even the more general case above.\nOur goal is to reconstruct an additive approximation to the matrix A and the covariance matrix \u03a3\nrunning in time and using a number of samples that is polynomial in n and 1\n\u0001 , where \u0001 is the target\nprecision (see Theorem 1.1) This problem arises in several research directions within machine learn-\ning: Independent Component Analysis (ICA), Deep Learning, Gaussian Mixture Models (GMM),\netc. We describe these connections next, and known results (focusing on algorithms with provable\nperformance guarantees, since that is our goal).\nMost obviously, the above problem can be seen as an instance of Independent Component Analysis\n(ICA) with unknown Gaussian noise. ICA has an illustrious history with applications ranging from\neconometrics, to signal processing, to image segmentation. The goal generally involves \ufb01nding a\nlinear transformation of the data so that the coordinates are as independent as possible [1, 2, 3]. This\nis often accomplished by \ufb01nding directions in which the projection is \u201cnon-Gaussian\u201d [4]. Clearly,\nif the datapoint y is generated as Ax (i.e., with no noise \u03b7 added) then applying linear transformation\nA\u22121 to the data results in samples A\u22121y whose coordinates are independent. This restricted case\nwas considered by Comon [1] and Frieze, Jerrum and Kannan [5], and their goal was to recover an\n\u2217{arora, rongge, sachdeva}@cs.princeton.edu. Department of Computer Science, Princeton University,\nPrinceton NJ 08540. Research supported by the NSF grants CCF-0832797, CCF-1117309 and Simons Investi-\ngator Grant\n\u2020moitra@ias.edu. School of Mathematics, Institute for Advanced Study, Princeton NJ 08540. Research\n\nsupported in part by NSF grant No. DMS-0835373 and by an NSF Computing and Innovation Fellowship.\n\n1\n\n\fadditive approximation to A ef\ufb01ciently and using a polynomial number of samples. (We will later\nnote a gap in their reasoning, albeit \ufb01xable by our methods. See also recent papers by Anandkumar\net al., Hsu and Kakade[6, 7], that do not use local search and avoids this issue.) To the best of our\nknowledge, there are currently no known algorithms with provable guarantees for the more general\ncase of ICA with Gaussian noise (this is especially true if the covariance matrix is unknown, as in\nour problem), although many empirical approaches are known. (eg. [8], the issue of \u201cempirical\u201d vs\n\u201crigorous\u201d is elaborated upon after Theorem 1.1.)\nThe second view of our problem is as a concisely described Gaussian Mixture Model. Our data is\ngenerated as a mixture of 2n identical Gaussian components (with an unknown covariance matrix)\nwhose centers are the points {Ax : x \u2208 {\u22121, 1}n}, and all mixing weights are equal. Notice, this\nmixture of 2n Gaussians admits a concise description using O(n2) parameters. The problem of\nlearning Gaussian mixtures has a long history, and the popular approach in practice is to use the\nEM algorithm [9], though it has no worst-case guarantees (the method may take a very long time\nto converge, and worse, may not always converge to the correct solution). An in\ufb02uential paper of\nDasgupta [10] initiated the program of designing algorithms with provable guarantees, which was\nimproved in a sequence of papers [11, 12, 13, 14]. But in the current setting, it is unclear how to\napply any of the above algorithms (including EM) since the trivial application would keep track\nof exponentially many parameters \u2013 one for each component. Thus, new ideas seem necessary to\nachieve polynomial running time.\nThe third view of our problem is as a simple form of autoencoding [15]. This is a central notion in\nDeep Learning, where the goal is to obtain a compact representation of a target distribution using a\nmultilayered architecture, where a complicated function (the target) can be built up by composing\nlayers of a simple function (called the autoencoder [16]). The main tenet is that there are interest-\ning functions which can be represented concisely using many layers, but would need a very large\nrepresentation if a \u201cshallow\u201d architecture is used instead). This is most useful for functions that\nare \u201chighly varying\u201d (i.e. cannot be compactly described by piecewise linear functions or other\n\u201csimple\u201d local representations). Formally, it is possible to represent using just (say) n2 parameters,\nsome distributions with 2n \u201cvarying parts\u201d or \u201cinteresting regions.\u201d The Restricted Boltzmann Ma-\nchine (RBM) is an especially popular autoencoder in Deep Learning, though many others have been\nproposed. However, to the best of our knowledge, there has been no successful attempt to give a\nrigorous analysis of Deep Learning. Concretely, if the data is indeed generated using the distribu-\ntion represented by an RBM, then do the popular algorithms for Deep Learning [17] learn the model\nparameters correctly and in polynomial time? Clearly, if the running time were actually found to\nbe exponential in the number of parameters, then this would erode some of the advantages of the\ncompact representation.\nHow is Deep Learning related to our problem? As noted by Freund and Haussler [18] many years\nago, an RBM with real-valued visible units (the version that seems more amenable to theoretical\nanalysis) is precisely a mixture of exponentially many standard Gaussians. It is parametrized by an\nn \u00d7 m matrix A and a vector \u03b8 \u2208 Rn. It encodes a mixture of n-dimensional standard Gaussians\ncentered at the points {Ax : x \u2208 {\u22121, 1}m}, where the mixing weight of the Gaussian centered at\n2 + \u03b8 \u00b7 x). This is of course reminiscent of our problem. Formally, our algorithm\nAx is exp((cid:107)Ax(cid:107)2\ncan be seen as a nonlinear autoencoding scheme analogous to an RBM but with uniform mixing\nweights. Interestingly, the algorithm that we present here looks nothing like the approaches favored\ntraditionally in Deep Learning, and may provide an interesting new perspective.\n\n1.1 Our results and techniques\nWe give a provable algorithm for ICA with unknown Gaussian noise. We have not made an attempt\nto optimize the quoted running time of this model, but we emphasize that this is in fact the \ufb01rst\nalgorithm with provable guarantees for this problem and moreover we believe that in practice our\nalgorithm will run almost as fast as the usual ICA algorithms, which are its close relatives.\nTheorem 1.1 (Main, Informally). There is an algorithm that recovers the unknown A and \u03a3 up to\nadditive error \u0001 in each entry in time that is polynomial in n,(cid:107)A(cid:107)2,(cid:107)\u03a3(cid:107)2, 1/\u0001, 1/\u03bbmin(A) where (cid:107) \u00b7 (cid:107)2\ndenotes the operator norm and \u03bbmin(\u00b7) denotes the smallest eigenvalue.\n\nThe classical approach for ICA initiated in Comon [1] and Frieze, Jerrum and Kannan [5]) is for\nthe noiseless case in which y = Ax. The \ufb01rst step is whitening, which applies a suitable linear\ntransformation that makes the variance the same in all directions, thus reducing to the case where\n\n2\n\n\fA is a rotation matrix. Given samples y = Rx where R is a rotation matrix, the rows of R can be\nfound in principle by computing the vectors u that are local minima of E[(u \u00b7 y)4]. Subsequently, a\nnumber of works (see e.g. [19, 20]) have focused on giving algorithms that are robust to noise. A\npopular approach is to use the fourth order cumulant (as an alternative to the fourth order moment)\nas a method for \u201cdenoising,\u201d or any one of a number of other functionals whose local optima reveal\ninteresting directions. However, theoretical guarantees of these algorithms are not well understood.\nThe above procedures in the noise-free model can almost be made rigorous (i.e., provably polyno-\nmial running time and number of samples), except for one subtlety: it is unclear how to use local\nsearch to \ufb01nd all optima in polynomial time. In practice, one \ufb01nds a single local optimum, projects\nto the subspace orthogonal to it and continues recursively on a lower-dimensional problem. How-\never, a naive implementation of this idea is unstable since approximation errors can accumulate\nbadly, and to the best of our knowledge no rigorous analysis has been given prior to our work. (This\nis not a technicality: in some similar settings the errors are known to blow up exponentially [21].)\nOne of our contributions is a modi\ufb01ed local search that avoids this potential instability and \ufb01nds all\nlocal optima in this setting. (Section 4.2.)\nOur major new contribution however is dealing with noise that is an unknown Gaussian. This is an\nimportant generalization, since many methods used in ICA are quite unstable to noise (and a wrong\nestimate for the covariance could lead to bad results). Here, we no longer need to assume we know\neven rough estimates for the covariance. Moreover, in the context of Gaussian Mixture Models this\ngeneralization corresponds to learning a mixture of many Gaussians where the covariance of the\ncomponents is not known in advance.\nWe design new tools for denoising and especially whitening in this setting. Denoising uses the fourth\norder cumulant instead of the fourth moment used in [5] and whitening involves a novel use of the\nHessian of the cumulant. Even then, we cannot reduce to the simple case y = Rx as above, and are\nleft with a more complicated functional form (see \u201cquasi-whitening\u201d in Section 2.) Nevertheless,\nwe can reduce to an optimization problem that can be solved via local search, and which remains\namenable to a rigorous analysis. The results of the local optimization step can be then used to\nsimplify the complicated functional form and recover A as well as the noise \u03a3. We defer many of\nour proofs to the supplementary material section, due to space constraints.\nIn order to avoid cluttered notation, we have focused on the case in which x is chosen uniformly at\nrandom from {\u22121, +1}n, although our algorithm and analysis work under the more general con-\nditions that the coordinates of x are (i) independent and (ii) have a fourth moment that is less\nthan three (the fourth moment of a Gaussian random variable). In this case, the functional P (u)\n(see Lemma 2.2) will take the same form but with weights depending on the exact value of the\nfourth moment for each coordinate. Since we already carry through an unknown diagonal matrix D\nthroughout our analysis, this generalization only changes the entries on the diagonal and the same\nalgorithm and proof apply.\n2 Denoising and quasi-whitening\nAs mentioned, our approach is based on the fourth order cumulant. The cumulants of a random\nvariable are the coef\ufb01cients of the Taylor expansion of the logarithm of the characteristic function\n[22]. Let \u03bar(X) be the rth cumulant of a random variable X. We make use of:\nFact 2.1. (i) If X has mean zero, then \u03ba4(X) = E[X 4]\u2212 3 E[X 2]2. (ii) If X is Gaussian with mean\n\u00b5 and variance \u03c32, then \u03ba1(X) = \u00b5, \u03ba2(X) = \u03c32 and \u03bar(X) = 0 for all r > 2. (iii) If X and Y\nare independent, then \u03bar(X + Y ) = \u03bar(X) + \u03bar(Y ).\n\nThe crux of our technique is to look at the following functional, where y is the random variable\nAx + \u03b7 whose samples are given to us. Let u \u2208 Rn be any vector. Then P (u) = \u2212\u03ba4(uT y).\nNote that for any u we can compute P (u) reasonably accurately by drawing suf\ufb01cient number of\nsamples of y and taking an empirical average. Furthermore, since x and \u03b7 are independent, and \u03b7 is\nGaussian, the next lemma is immediate. We call it \u201cdenoising\u201d since it allows us empirical access\nto some information about A that is uncorrupted by the noise \u03b7.\n\nLemma 2.2 (Denoising Lemma). P (u) = 2(cid:80)n\n\ni .\ni=1(uT A)4\n\nThe intuition is that P (u) = \u2212\u03ba4(uT Ax) since the fourth cumulant does not depend on the additive\nGaussian noise, and then the lemma follows from completing the square.\n\n3\n\n\f2.1 Quasi-whitening via the Hessian of P (u)\nIn prior works on ICA, whitening refers to reducing to the case where y = Rx for some some\nrotation matrix R. Here we give a technique to reduce to the case where y = RDx + \u03b7(cid:48) where \u03b7(cid:48)\nis some other Gaussian noise (still unknown), R is a rotation matrix and D is a diagonal matrix that\ndepends upon A. We call this quasi-whitening. Quasi-whitening suf\ufb01ces for us since local search\nusing the objective function \u03ba4(uT y) will give us (approximations to) the rows of RD, from which\nwe will be able to recover A.\nQuasi-whitening involves computing the Hessian of P (u), which recall is the matrix of all 2nd order\npartial derivatives of P (u). Throughout this section, we will denote the Hessian operator by H. In\nmatrix form, the Hessian of P (u) is\n\nAi,kAj,k(Ak \u00b7 u)2; H(P (U )) = 24\n\n(Ak \u00b7 u)2AkAT\n\nk = ADA(u)AT\n\n\u22022\n\n\u2202ui\u2202uj\n\nP (u) = 24\n\nn(cid:88)\n\nk=1\n\nn(cid:88)\n\nk=1\n\nwhere Ak is the k-th column of the matrix A (we use subscripts to denote the columns of matrices\nthrought the paper). DA(u) is the following diagonal matrix:\nDe\ufb01nition 2.3. Let DA(u) be a diagonal matrix in which the kth entry is 24(Ak \u00b7 u)2.\nOf course, the exact Hessian of P (u) is unavailable and we will instead compute an empirical\n\napproximation (cid:98)P (u) to P (u) (given many samples from the distribution), and we will show that the\nHessian of (cid:98)P (u) is a good approximation to the Hessian of P (u).\n\nDe\ufb01nition 2.4. Given 2N samples y1, y(cid:48)\n\n1, y2, y(cid:48)\n\n2..., yN , y(cid:48)\n\nN(cid:88)\n\n(cid:98)P (u) =\n\n\u22121\nN\n\nN(cid:88)\n\n3\nN\n\nN of the random variable y, let\n(uT yi)2(uT y(cid:48)\n\ni=1\n\ni=1\n\ni)2.\n\n(uT yi)4 +\n\nOur \ufb01rst step is to show that the expectation of the Hessian of (cid:98)P (u) is exactly the Hessian of P (u).\nIn fact, since the expectation of (cid:98)P (u) is exactly P (u) (and since (cid:98)P (u) is an analytic function of the\n\nsamples and of the vector u), we can interchange the Hessian operator and the expectation operator.\nRoughly, one can imagine the expectation operator as an integral over the possible values of the\nrandom samples, and as is well-known in analysis, one can differentiate under the integral provided\nthat all functions are suitably smooth over the domain of integration.\nClaim 2.5. Ey,y(cid:48)[\u2212(uT y)4 + 3(uT y)2(uT y(cid:48))2] = P (u)\nThis claim follows immediately from the de\ufb01nition of P (u), and since y and y(cid:48) are independent.\nLemma 2.6. H(P (u)) = Ey,y(cid:48)[H(\u2212(uT y)4 + 3(uT y)2(uT y(cid:48))2)]\nNext, we compute the two terms inside the expectation:\nClaim 2.7. H((uT y)4) = 12(uT y)2yyT\nClaim 2.8. H((uT y)2(uT y(cid:48))2) = 2(uT y(cid:48))2yyT + 2(uT y)2y(cid:48)(y(cid:48))T + 4(uT y)(uT y(cid:48))(y(y(cid:48))T +\n(y(cid:48))yT )\nLet \u03bbmin(A) denote the smallest eigenvalue of A. Our analysis also requires bounds on the entries\nof DA(u0):\nClaim 2.9. If u0 is chosen uniformly at random then with high probability for all i,\n\n2\n\nn\n\nn\n\nlog n\n\nmin\ni=1\n\n(cid:107)Ai(cid:107)2\n\n(cid:107)Ai(cid:107)2\n\nLemma 2.10.\nIf u0 is chosen uniformly at random and furthermore we are given 2N =\npoly(n, 1/\u0001, 1/\u03bbmin(A),(cid:107)A(cid:107)2,(cid:107)\u03a3(cid:107)2) samples of y, then with high probability we will have that\n\n2n\u22124 \u2264 DA(u0))i,i \u2264 n\nmax\ni=1\n(1 \u2212 \u0001)ADA(u0)AT (cid:22) H((cid:98)P (u0)) (cid:22) (1 + \u0001)ADA(u0)AT .\nLemma 2.11. Suppose that (1\u2212 \u0001)ADA(u0)AT (cid:22) (cid:99)M (cid:22) (1 + \u0001)ADA(u0)AT , and let(cid:99)M = BBT .\nThen there is a rotation matrix R\u2217 such that (cid:107)B\u22121ADA(u0)1/2 \u2212 R\u2217(cid:107)F \u2264 \u221a\nwe can \ufb01nd a unit vector x where the quadratic forms xT ADA(u0)AT x and xT(cid:99)M x are too far apart\nThe intuition is: if any of the singular values of B\u22121ADA(u0)1/2 are outside the range [1\u2212 \u0001, 1 + \u0001],\n(which contradicts the condition of the lemma). Hence the singular values of B\u22121ADA(u0)1/2 can\nall be set to one without changing the Froebenius norm of B\u22121ADA(u0)1/2 too much, and this\nyields a rotation matrix.\n\nn\u0001.\n\n4\n\n\f3 Our algorithm (and notation)\nIn this section we describe our overall algorithm. It uses as a blackbox the denoising and quasi-\nwhitening already described above, as well as a routine for computing all local maxima of some\n\u201cwell-behaved\u201d functions which is described later in Section 4.\nNotation: Placing a hat over a function corresponds to an empirical approximation that we obtain\nfrom random samples. This approximation introduces error, which we will keep track of.\n\nStep 1: Pick a random u0 \u2208 Rn and estimate the Hessian H((cid:98)P (u0)). Compute B such that\nH((cid:98)P (u0)) = BBT . Let D = DA(u0) be the diagonal matrix de\ufb01ned in De\ufb01nition 2.3.\n(cid:80)N\n(cid:16)(cid:80)N\ni=1(uT B\u22121yi)4 +\nStep 3: Use the procedure ALLOPT((cid:98)P (cid:48)(u), \u03b2, \u03b4(cid:48), \u03b2(cid:48), \u03b4(cid:48)) of Section 4 to compute all n local maxima\nof the function (cid:98)P (cid:48)(u).\n\nN , and let (cid:98)P (cid:48)(u) = \u2212 1\n\nwhich is an empirical estimation of P (cid:48)(u).\n\nStep 2: Take 2N samples y1, y2, ..., yN , y(cid:48)\n\n1, y(cid:48)\n\n2, ..., y(cid:48)\n\nN\n\n3\nN\n\ni=1(uT B\u22121yi)2(uT B\u22121y(cid:48)\n\ni)2(cid:17)\n\nStep 4: Let R be the matrix whose rows are the n local optima recovered in the previous step. Use\nprocedure RECOVER of Section 5 to \ufb01nd A and \u03a3.\nExplanation: Step 1 uses the transformation B\u22121 computed in the previous Section to quasi-whiten\nthe data. Namely, we consider the sequence of samples z = B\u22121y, which are therefore of the form\nR(cid:48)Dx+\u03b7(cid:48) where \u03b7 = B\u22121\u03b7, D = DA(u0) and R(cid:48) is close to a rotation matrix R\u2217 (by Lemma 2.11).\nIn Step 2 we look at \u03ba4((uT z)), which effectively denoises the new samples (see Lemma 2.2), and\nthus is the same as \u03ba4(R(cid:48)D\u22121/2x). Let P (cid:48)(u) = \u03ba4(uT z) = \u03ba4(uT B\u22121y) which is easily seen to be\noptima via local search. Ideally we would have liked access to the functional P \u2217(u) = (uT R\u2217x)4\nsince the procedure for local optima works only for true rotations. But since R(cid:48) and R\u2217 are close\n\nE[(uT R(cid:48)D\u22121/2x)4]. Step 2 estimates this function, obtaining (cid:98)P (cid:48)(u). Then Step 3 tries to \ufb01nd local\nwe can make it work approximately with (cid:98)P (cid:48)(u), and then in Step 4 use these local optima to \ufb01nally\n\nrecover A.\nTheorem 3.1. Suppose we are given samples of the form y = Ax + \u03b7 where x is uniform on\n{+1,\u22121}n, A is an n \u00d7 n matrix, \u03b7 is an n-dimensional Gaussian random variable independent\nof x with unknown covariance matrix \u03a3. There is an algorithm that with high probability recovers\n\n(cid:107)(cid:98)A \u2212 A\u03a0diag(ki)(cid:107)F \u2264 \u0001 where \u03a0 is some permutation matrix and each ki \u2208 {+1,\u22121} and\nalso recovers (cid:107)(cid:98)\u03a3 \u2212 \u03a3(cid:107)F \u2264 \u0001. Furthermore the running time and number of samples needed are\npoly(n, 1/\u0001,(cid:107)A(cid:107)2 ,(cid:107)\u03a3(cid:107)2 , 1/\u03bbmin(A))\nNote that here we recover A up to a permutation of the columns and sign-\ufb02ips. In general, this is\nall we can hope for since the distribution of x is also invariant under these same operations. Also,\nthe dependence of our algorithm on the various norms (of A and \u03a3) seems inherent since our goal is\nto recover an additive approximation, and as we scale up A and/or \u03a3, this goal becomes a stronger\nrelative guarantee on the error.\n\n4 Framework for iteratively \ufb01nding all local maxima\nIn this section, we \ufb01rst describe a fairly standard procedure (based upon Newton\u2019s method) for\n\ufb01nding a single local maximum of a function f\u2217 : Rn \u2192 R among all unit vectors and an analysis\nof its rate of convergence. Such a procedure is a common tool in statistical algorithms, but here we\nstate it rather carefully since we later give a general method to convert any local search algorithm\n(that meets certain criteria) into one that \ufb01nds all local maxima (see Section 4.2).\nGiven that we can only ever hope for an additive approximation to a local maximum, one should\nbe concerned about how the error accumulates when our goal is to \ufb01nd all local maxima. In fact, a\nnaive strategy is to project onto the subspace orthogonal to the directions found so far, and continue\nin this subspace. However, such an approach seems to accumulate errors badly (the additive error\nof the last local maxima found is exponentially larger than the error of the \ufb01rst). Rather, the crux\nof our analysis is a novel method for bounding how much the error can accumulate (by re\ufb01ning old\nestimates).\n\n5\n\n\fAlgorithm 1. LOCALOPT, Input:f (u), us, \u03b2, \u03b4 Output: vector v\n\n1. Set u \u2190 us.\n2. Maximize (via Lagrangian methods) Proj\u22a5u(\u2207f (u))T \u03be + 1\n\n(cid:107)\u03be(cid:107)2\n\n2 Subject to (cid:107)\u03be(cid:107)2 \u2264 \u03b2(cid:48) and uT \u03be = 0\n\n2 \u03beT Proj\u22a5u(H(f (u)))\u03be \u2212 1\n\n2\n\n(cid:16) \u2202\n\n\u2202u\n\n(cid:17)\u00b7\n\nf (u)\n\n3. Let \u03be be the solution, \u02dcu = u+\u03be\n(cid:107)u+\u03be(cid:107)\n4. If f (\u02dcu) \u2265 f (u) + \u03b4/2, set u \u2190 \u02dcu and Repeat Step 2\n5. Else return u\n\nOur strategy is to \ufb01rst \ufb01nd a local maximum in the orthogonal subspace, then run the local optimiza-\ntion algorithm again (in the original n-dimensional space) to \u201cre\ufb01ne\u201d the local maximum we have\nfound. The intuition is that since we are already close to a particular local maxima, the local search\nalgorithm cannot jump to some other local maxima (since this would entail going through a valley).\n\n4.1 Finding one local maximum\nThroughout this section, we will assume that we are given oracle access to a function f (u) and its\ngradient and Hessian. The procedure is also given a starting point us, a search range \u03b2, and a step\nsize \u03b4. For simplicity in notation we de\ufb01ne the following projection operator.\nDe\ufb01nition 4.1. Proj\u22a5u(v) = v \u2212 (uT v)u, Proj\u22a5u(M ) = M \u2212 (uT M u)uuT .\nThe basic step the algorithm is a modi\ufb01cation of Newton\u2019s method to \ufb01nd a local improvement\nthat makes progress so long as the current point u is far from a local maxima. Notice that if we\nadd a small vector to u, we do not necessarily preserve the norm of u. In order to have control\nover how the norm of u changes, during local optimization step the algorithm projects the gradient\n\u2207f and Hessian H(f ) to the space perpendicular to u. There is also an additional correction term\n\u2212\u2202/\u2202uf (u) \u00b7 (cid:107)\u03be(cid:107)2/2. This correction term is necessary because the new vector we obtain is (u +\n\u03be)/(cid:107)(u + \u03be)(cid:107)2 which is close to u \u2212 (cid:107)\u03be(cid:107)2\n2/2 \u00b7 u + \u03be + O(\u03b23). Step 2 of the algorithm is just\nmaximizing a quadratic function and can be solved exactly using Lagrangian Multiplier method. To\nincrease ef\ufb01ciency it is also acceptable to perform an approximate maximization step by taking \u03be to\nbe either aligned with the gradient Proj\u22a5u\u2207f (u) or the largest eigenvector of Proj\u22a5u(H(f (u))).\nThe algorithm is guaranteed to succeed in polynomial time when the function is Locally Improvable\nand Locally Approximable:\nDe\ufb01nition 4.2 ((\u03b3, \u03b2, \u03b4)-Locally Improvable). A function f (u) : Rn \u2192 R is (\u03b3, \u03b2, \u03b4)-Locally\nImprovable, if for any u that is at least \u03b3 far from any local maxima, there is a u(cid:48) such that\n(cid:107)u(cid:48) \u2212 u(cid:107)2 \u2264 \u03b2 and f (u(cid:48)) \u2265 f (u) + \u03b4.\nDe\ufb01nition 4.3 ((\u03b2, \u03b4)-Locally Approximable). A function f (u) is locally approximable, if its third\norder derivatives exist and for any u and any direction v, the third order derivative of f at point u in\nthe direction of v is bounded by 0.01\u03b4/\u03b23.\n\nThe analysis of the running time of the procedure comes from local Taylor expansion. When a\nfunction is Locally Approximable it is well approximated by the gradient and Hessian within a \u03b2\nneighborhood. The following theorem from [5] showed that the two properties above are enough to\nguarantee the success of a local search algorithm even when the function is only approximated.\nTheorem 4.4 ([5]). If |f (u) \u2212 f\u2217(u)| \u2264 \u03b4/8, the function f\u2217(u) is (\u03b3, \u03b2, \u03b4)-Locally Improvable,\nf (u) is (\u03b2, \u03b4) Locally Approximable, then Algorithm 1 will \ufb01nd a vector v that is \u03b3 close to some\nlocal maximum. The running time is at most O((n2 + T ) max f\u2217/\u03b4) where T is the time to evaluate\nthe function f and its gradient and Hessian.\n\n4.2 Finding all local maxima\nNow we consider how to \ufb01nd all local maxima of a given function f\u2217(u). The crucial condition that\nwe need is that all local maxima are orthogonal (which is indeed true in our problem, and is morally\ntrue when using local search more generally in ICA). Note that this condition implies that there are\nat most n local maxima.1 In fact we will assume that there are exactly n local maxima. If we are\ngiven an exact oracle for f\u2217 and can compute exact local maxima then we can \ufb01nd all local maxima\n\n6\n\n\fAlgorithm 2. ALLOPT, Input:f (u), \u03b2, \u03b4, \u03b2(cid:48), \u03b4(cid:48) Output: v1, v2, ..., vn, \u2200i (cid:107)vi \u2212 v\u2217\n\ni (cid:107) \u2264 \u03b3.\n\n1. Let v1 = LOCALOPT(f, e1, \u03b2, \u03b4)\n2. FOR i = 2 TO n DO\n3.\n4.\n5.\n6. END FOR\n7. Return v1, v2, ..., vn\n\nLet gi be the projection of f to the orthogonal subspace of v1, v2, ..., vi\u22121.\nLet u(cid:48) = LOCALOPT(g, e1, \u03b2(cid:48), \u03b4(cid:48)).\nLet vi = LOCALOPT(f, u(cid:48), \u03b2, \u03b4).\n\neasily: \ufb01nd one local maximum, project the function into the orthogonal subspace, and continue to\n\ufb01nd more local maxima.\nDe\ufb01nition 4.5. The projection of a function f to a linear subspace S is a function on that subspace\nwith value equal to f. More explicitly, if {v1, v2, ..., vd} is an orthonormal basis of S, the projection\n\nof f to S is a function g : Rd \u2192 R such that g(w) = f ((cid:80)d\n\ni=1 wivi).\n\nThe following theorem gives suf\ufb01cient conditions under which the above algorithm \ufb01nds all local\nmaxima, making precise the intuition given at the beginning of this section.\nTheorem 4.6. Suppose the function f\u2217(u) : Rn \u2192 R satis\ufb01es the following properties:\n\n1. Orthogonal Local Maxima: The function has n local maxima v\u2217\n\ni , and they are orthogonal\n\nto each other.\n\nof local maxima is (\u03b3(cid:48), \u03b2(cid:48), \u03b4(cid:48)) Locally Improvable. The step size \u03b4(cid:48) \u2265 10\u03b4.\n\n2. Locally Improvable: f\u2217 is (\u03b3, \u03b2, \u03b4) Locally Improvable.\n3. Improvable Projection: The projection of the function to any subspace spanned by a subset\n4. Lipschitz: If (cid:107)u \u2212 u(cid:48)(cid:107)2 \u2264 3\n\u221a\n5. Attraction Radius: Let Rad \u2265 3\n\nn\u03b3 + \u03b3(cid:48), for any local maximum v\u2217\ni (cid:107)2 \u2264 Rad, then there exist a set U containing (cid:107)u \u2212 v\u2217\n\ni , let T be min f\u2217(u)\n\u221a\ni (cid:107)2 \u2264 3\nfor (cid:107)u \u2212 v\u2217\nn\u03b3 + \u03b3(cid:48) and\ndoes not contain any other local maxima, such that for every u that is not in U but is \u03b2\nclose to U, f\u2217(u) < T .\n\nn\u03b3, then the function value |f\u2217(u) \u2212 f\u2217(u(cid:48))| \u2264 \u03b4(cid:48)/20.\n\n\u221a\n\nIf we are given function f such that |f (u) \u2212 f\u2217(u)| \u2264 \u03b4/8 and f is both (\u03b2, \u03b4) and (\u03b2(cid:48), \u03b4(cid:48)) Locally\nApproximable, then Algorithm 2 can \ufb01nd all local maxima of f\u2217 within distance \u03b3.\n\nTo prove this theorem, we \ufb01rst notice the projection of the function f in Step 3 of the algorithm\nshould be close to the projection of f\u2217 to the remaining local maxima. This is implied by Lipschitz\ncondition and is formally shown in the following two lemmas. First we prove a \u201ccoupling\u201d between\nthe orthogonal complement of two close subspaces:\nLemma 4.7. Given v1, v2, ..., vk, each \u03b3-close respectively to local maxima v\u2217\nk (this is\nwithout loss of generality because we can permute the index of local maxima), then there is an\northonormal basis vk+1, vk+2, ..., vn for the orthogonal space of span{v1, v2, ..., vk} such that for\n\nany unit vector w \u2208 Rn\u2212k,(cid:80)n\u2212k\n\nn\u03b3 close to(cid:80)n\u2212k\n\n\u221a\ni=1 wkvk+i is 3\n\ni=1 wkv\u2217\n\n1, v\u2217\n\n2, ..., v\u2217\n\nk+i.\n\nWe prove this lemma using a modi\ufb01cation of the Gram-Schmidt orthonormalization procedure. Us-\ning this lemma we see that the projected function is close to the projection of f\u2217 in the span of the\nrest of local maxima:\nLemma 4.8. Let g\u2217 be the projection of f\u2217 into the space spanned by the rest of local maxima, then\n|g\u2217(w) \u2212 g(w)| \u2264 \u03b4/8 + \u03b4(cid:48)/20 \u2264 \u03b4(cid:48)/8.\n5 Local search on the fourth order cumulant\nNext, we prove that the fourth order cumulant P \u2217(u) satis\ufb01es the properties above. Then the algo-\nrithm given in the previous section will \ufb01nd all of the local maxima, which is the missing step in our\n1Technically, there are 2n local maxima since for each direction u that is a local maxima, so too is \u2212u but\n\nthis is an unimportant detail for our purposes.\n\n7\n\n\fAlgorithm 3. RECOVER, Input:B, (cid:98)P (cid:48)(u), (cid:98)R, \u0001 Output: (cid:98)A,(cid:98)\u03a3\n(cid:16)(cid:98)P (cid:48)((cid:98)Ri)\n\n1. Let (cid:98)DA(u) be a diagonal matrix whose ith entry is 1\n2. Let (cid:98)A = B(cid:98)R(cid:98)DA(u)\u22121/2.\n3. Estimate C = E[yyT ] by taking O(((cid:107)A(cid:107)2 + (cid:107)\u03a3(cid:107)2)4n2\u0001\u22122) samples and let (cid:98)C = 1\n4. Let(cid:98)\u03a3 = (cid:98)C \u2212 (cid:98)A(cid:98)AT\n5. Return (cid:98)A,(cid:98)\u03a3\n\n(cid:17)\u22121/2\n\n.\n\n2\n\nN\n\n(cid:80)N\n\ni .\ni=1 yiyT\n\nmain goal: learning a noisy linear transformation Ax + \u03b7 with unknown Gaussian noise. We \ufb01rst\nuse a theorem from [5] to show that properties for \ufb01nding one local maxima is satis\ufb01ed.\nAlso, for notational convenience we set di = 2DA(u0)\u22122\ni,i and let dmin and dmax denote the min-\nimum and maximum values (bounds on these and their ratio follow from Claim 2.9). Using this\n\nnotation P \u2217(u) =(cid:80)n\n\ni=1 di(uT R\u2217\n\ni )4.\n\n\u221a\n\nTheorem 5.1 ([5]). When \u03b2 < dmin/10dmaxn2, the function P \u2217(u) is (3\nn\u03b2, \u03b2, P \u2217(u)\u03b22/100)\nLocally Improvable and (\u03b2, dmin\u03b22/100n) Locally Approximable. Moreover, the local maxima of\nthe function is exactly {\u00b1R\u2217\ni }.\n\nWe then observe that given enough samples, the empirical mean (cid:98)P (cid:48)(u) is close to P \u2217(u). For\n\nmin\u03bbmin(A)8(cid:107)\u03a3(cid:107)4\n\nconcentration we require every degree four term zizjzkzl has variance at most Z.\nClaim 5.2. Z = O(d2\nsuppose columns of R(cid:48) =\nLemma 5.3. Given 2N samples y1, y2, ..., yN , y(cid:48)\nB\u22121ADA(u0)1/2 are \u0001 close to the corresponding columns of R\u2217, with high probability the function\n\n(cid:98)P (cid:48)(u) is O(dmaxn1/2\u0001 + n2(N/Z log n)\u22121/2) close to the true function P \u2217(u).\n\n2, ..., y(cid:48)\nN ,\n\n1, y(cid:48)\n\n2 + d2\n\nmin).\n\nThe other properties required by Theorem 4.6 are also satis\ufb01ed:\nLemma 5.4. For any (cid:107)u \u2212 u(cid:48)(cid:107)2 \u2264 r, |P \u2217(u) \u2212 P \u2217(u(cid:48))| \u2264 5dmaxn1/2r. All local maxima of P \u2217\nhas attraction radius Rad \u2265 dmin/100dmax.\nApplying Theorem 4.6 we obtain the following Lemma (the parameters are chosen so that all prop-\nerties required are satis\ufb01ed):\nLemma 5.5. Let \u03b2(cid:48) = \u0398((dmin/dmax)2), \u03b2 = min{\u03b3n\u22121/2, \u2126((dmin/dmax)4n\u22123.5)}, then the\nprocedure RECOVER(f, \u03b2, dmin\u03b22/100n , \u03b2(cid:48), dmin\u03b2(cid:48)2/100n) \ufb01nds vectors v1, v2, ..., vn, so that\nthere is a permutation matrix \u03a0 and ki \u2208 {\u00b11} and for all i: (cid:107)vi \u2212 (R\u03a0Diag(ki))\u2217\n\nAfter obtaining (cid:98)R = [v1, v2, ..., vn] we can use Algorithm 3 to \ufb01nd A and \u03a3:\nTheorem 5.6. Given a matrix (cid:98)R such that there is permutation matrix \u03a0 and ki \u2208 {\u00b11} with\n(cid:107)(cid:98)Ri \u2212 ki(R\u2217\u03a0)i(cid:107)2 \u2264 \u03b3 for all i, Algorithm 3 returns matrix (cid:98)A such that (cid:107)(cid:98)A \u2212 A\u03a0Diag(ki)(cid:107)F \u2264\n(cid:107)(cid:98)\u03a3 \u2212 \u03a3(cid:107)F \u2264 \u0001.\n2 n3/2\u03bbmin(A)) \u00d7 min{1/(cid:107)A(cid:107)2 , 1}, we also have\nR\u2217 (or an approximation) and since P \u2217(u) =(cid:80)n\n\nRecall that the diagonal matrix DA(u) is unknown (since it depends on A), but if we are given\ni )4, we can recover the matrix DA(u)\napproximately from computing P \u2217(R\u2217\ni ). Then given DA(u), we can recover A and \u03a3 and this\ncompletes the analysis of our algorithm.\nConclusions\nICA is a vast \ufb01eld with many successful techniques. Most rely on heuristic nonlinear optimization.\nAn exciting question is: can we give a rigorous analysis of those techniques as well, just as we\ndid for local search on cumulants? A rigorous analysis of deep learning \u2014say, an algorithm that\nprovably learns the parameters of an RBM\u2014is another problem that is wide open, and a plausible\nspecial case involves subtle variations on the problem we considered here.\n\n2 n3/2/\u03bbmin(A)). If \u03b3 \u2264 O(\u0001/(cid:107)A(cid:107)2\n\ni=1 di(uT R\u2217\n\ni (cid:107)2 \u2264 \u03b3.\n\nO(\u03b3 (cid:107)A(cid:107)2\n\n8\n\n\fReferences\n\n[1] P. Comon. Independent component analysis: a new concept? Signal Processing, pp. 287\u2013314,\n\n1994. 1, 1.1\n\n[2] A. Hyvarinen, J. Karhunen, E. Oja.\n\n2001. 1\n\nIndependent Component Analysis. Wiley: New York,\n\n[3] A. Hyvarinen, E. Oja. Independent component analysis: algorithms and applications. Neural\n\nNetworks, pp. 411\u2013430, 2000. 1\n\n[4] P. J. Huber. Projection pursuit. Annals of Statistics pp. 435\u2013475, 1985. 1\n[5] A. Frieze, M. Jerrum, R. Kannan. Learning linear transformations. FOCS, pp. 359\u2013368, 1996.\n\n1, 1.1, 4.1, 4.4, 5, 5.1\n\n[6] A. Anandkumar, D. Foster, D. Hsu, S. Kakade, Y. Liu. Two SVDs suf\ufb01ce: spectral decompo-\nsitions for probabilistic topic modeling and latent Dirichlet allocation. Arxiv:abs/1203.0697,\n2012. 1\n\n[7] D. Hsu, S. Kakade. Learning mixtures of spherical Gaussians: moment methods and spectral\n\ndecompositions. Arxiv:abs/1206.5766, 2012. 1\n\n[8] L. De Lathauwer; J. Castaing; J.-F. Cardoso, Fourth-Order Cumulant-Based Blind Identi\ufb01-\ncation of Underdetermined Mixtures, Signal Processing, IEEE Transactions on, vol.55, no.6,\npp.2965-2973, June 2007 1\n\n[9] A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM Algorithm. Journal of the Royal Statistical Society Series B, pp. 1\u201338, 1977. 1\n\n[10] S. Dasgupta. Learning mixtures of Gaussians. FOCS pp. 634\u2013644, 1999. 1\n[11] S. Arora and R. Kannan. Learning mixtures of separated nonspherical Gaussians. Annals of\n\nApplied Probability, pp. 69-92, 2005. 1\n\n[12] M. Belkin and K. Sinha. Polynomial learning of distribution families. FOCS pp. 103\u2013112,\n\n2010. 1\n\n[13] A. T. Kalai, A. Moitra, and G. Valiant. Ef\ufb01ciently learning mixtures of two Gaussians. STOC\n\npp. 553-562, 2010. 1\n\n[14] A. Moitra and G. Valiant. Setting the polynomial learnability of mixtures of Gaussians. FOCS\n\npp. 93\u2013102, 2010. 1\n\n[15] G. Hinton, R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Sci-\n\nence pp. 504\u2013507, 2006. 1\n\n[16] Y. Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning,\n\npp. 1\u2013127, 2009. 1\n\n[17] G. E. Hinton. A Practical Guide to Training Restricted Boltzmann Machines, Version 1, UTML\n\nTR 2010-003, Department of Computer Science, University of Toronto, August 2010 1\n\n[18] Y. Freund , D. Haussler. Unsupervised Learning of Distributions on Binary Vectors using Two\n\nLayer Networks University of California at Santa Cruz, Santa Cruz, CA, 1994 1\n\n[19] S. Cruces, L. Castedo, A. Cichocki, Robust blind source separation algorithms using cumu-\n\nlants, Neurocomputing, Volume 49, Issues 14, pp 87-118, 2002. 1.1\n\n[20] L., De Lathauwer; B., De Moor; J. Vandewalle. Independent component analysis based on\nhigher-order statistics only Proceedings of 8th IEEE Signal Processing Workshop on Statistical\nSignal and Array Processing, 1996. 1.1\n\n[21] S. Vempala, Y. Xiao. Structure from local optima: learning subspace juntas via higher order\n\nPCA. Arxiv:abs/1108.3329, 2011. 1.1\n\n[22] M. Kendall, A. Stuart. The Advanced Theory of Statistics Charles Grif\ufb01n and Company, 1958.\n\n2\n\n9\n\n\f", "award": [], "sourceid": 1149, "authors": [{"given_name": "Sanjeev", "family_name": "Arora", "institution": null}, {"given_name": "Rong", "family_name": "Ge", "institution": null}, {"given_name": "Ankur", "family_name": "Moitra", "institution": null}, {"given_name": "Sushant", "family_name": "Sachdeva", "institution": null}]}