{"title": "Recovery Guarantee of Non-negative Matrix Factorization via Alternating Updates", "book": "Advances in Neural Information Processing Systems", "page_first": 4987, "page_last": 4995, "abstract": "Non-negative matrix factorization is a popular tool for decomposing data into feature and weight matrices under non-negativity constraints. It enjoys practical success but is poorly understood theoretically. This paper proposes an algorithm that alternates between decoding the weights and updating the features, and shows that assuming a generative model of the data, it provably recovers the ground-truth under fairly mild conditions. In particular, its only essential requirement on features is linear independence. Furthermore, the algorithm uses ReLU to exploit the non-negativity for decoding the weights, and thus can tolerate adversarial noise that can potentially be as large as the signal, and can tolerate unbiased noise much larger than the signal. The analysis relies on a carefully designed coupling between two potential functions, which we believe is of independent interest.", "full_text": "Recovery Guarantee of Non-negative Matrix\n\nFactorization via Alternating Updates\n\nYuanzhi Li, Yingyu Liang, Andrej Risteski\n\nComputer Science Department at Princeton University\n\n35 Olden St, Princeton, NJ 08540\n\n{yuanzhil, yingyul, risteski}@cs.princeton.edu\n\nAbstract\n\nNon-negative matrix factorization is a popular tool for decomposing data into\nfeature and weight matrices under non-negativity constraints. It enjoys practical\nsuccess but is poorly understood theoretically. This paper proposes an algorithm\nthat alternates between decoding the weights and updating the features, and shows\nthat assuming a generative model of the data, it provably recovers the ground-\ntruth under fairly mild conditions. In particular, its only essential requirement on\nfeatures is linear independence. Furthermore, the algorithm uses ReLU to exploit\nthe non-negativity for decoding the weights, and thus can tolerate adversarial noise\nthat can potentially be as large as the signal, and can tolerate unbiased noise much\nlarger than the signal. The analysis relies on a carefully designed coupling between\ntwo potential functions, which we believe is of independent interest.\n\n1\n\nIntroduction\n\nIn this paper, we study the problem of non-negative matrix factorization (NMF), where given a matrix\nY \u2208 Rm\u00d7N , the goal to \ufb01nd a matrix A \u2208 Rm\u00d7n and a non-negative matrix X \u2208 Rn\u00d7N such\nthat Y \u2248 AX.1 A is often referred to as feature matrix and X referred as weights. NMF has been\nextensively used in extracting a parts representation of the data (e.g., [LS97, LS99, LS01]). It has\nbeen shown that the non-negativity constraint on the coef\ufb01cients forcing features to combine, but not\ncancel out, can lead to much more interpretable features and improved downstream performance of\nthe learned features.\nDespite all the practical success, however, this problem is poorly understood theoretically, with only\nfew provable guarantees known. Moreover, many of the theoretical algorithms are based on heavy\ntools from algebraic geometry (e.g., [AGKM12]) or tensors (e.g. [AKF+12]), which are still not\nas widely used in practice primarily because of computational feasibility issues or sensitivity to\nassumptions on A and X. Some others depend on speci\ufb01c structure of the feature matrix, such as\nseparability [AGKM12] or similar properties [BGKP16].\nA natural family of algorithms for NMF alternate between decoding the weights and updating the\nfeatures. More precisely, in the decoding step, the algorithm represents the data as a non-negative\ncombination of the current set of features; in the updating step, it updates the features using the\ndecoded representations. This meta-algorithm is popular in practice due to ease of implementation,\ncomputational ef\ufb01ciency, and empirical quality of the recovered features. However, even less\ntheoretical analysis exists for such algorithms.\n\n1In the usual formulation of the problem, A is also assumed to be non-negative, which we will not require in\n\nthis paper.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f\u221a\n\nThis paper proposes an algorithm in the above framework with provable recovery guarantees. To\nbe speci\ufb01c, the data is assumed to come from a generative model y = A\u2217x\u2217 + \u03bd. Here, A\u2217 is the\nground-truth feature matrix, x\u2217 are the non-negative ground-truth weights generated from an unknown\ndistribution, and \u03bd is the noise. Our algorithm can provably recover A\u2217 under mild conditions, even\nin the presence of large adversarial noise.\nOverview of main results. The existing theoretical results on NMF can be roughly split into two\ncategories. In the \ufb01rst category, they make heavy structural assumptions on the feature matrix A\u2217\nsuch as separability ([AGM12]) or allowing running time exponential in n ( [AGKM12]). In the\nsecond one, they impose strict distributional assumptions on x\u2217 ([AKF+12]), where the methods are\nusually based on the method of moments and tensor decompositions and have poor tolerance to noise,\nwhich is very important in practice.\nIn this paper, we present a very simple and natural alternating update algorithm that achieves the best\nof both worlds. First, we have minimal assumptions on the feature matrix A\u2217: the only essential\ncondition is linear independence of the features. Second, it is robust to adversarial noise \u03bd which\nin some parameter regimes be potentially be on the same order as the signal A\u2217x\u2217, and is robust to\nunbiased noise potentially even higher than the signal by a factor of O(\nn). The algorithm does not\nrequire knowing the distribution of x\u2217, and allows a fairly wide family of interesting distributions.\nWe get this at a rather small cost of a mild \u201cwarm start\u201d. Namely, we initialize each of the features to\nbe \u201ccorrelated\u201d with the ground-truth features. This type of initialization is often used in practice as\nwell, for example in LDA-c, the most popular software for topic modeling ([lda16]).\nA major feature of our algorithm is the signi\ufb01cant robustness to noise. In the presence of adversarial\nnoise on each entry of y up to level C\u03bd, the noise level (cid:107)\u03bd(cid:107)1 can be in the same order as the signal\nA\u2217x\u2217. Still, our algorithm is able to output a matrix A such that the \ufb01nal (cid:107)A\u2217 \u2212 A(cid:107)1 \u2264 O((cid:107)\u03bd(cid:107)1) in\nthe order of the noise in one data point. If the noise is unbiased (i.e., E[\u03bd|x\u2217] = 0), the noise level\n(cid:107)\u03bd(cid:107)1 can be \u2126(\nn) times larger than the signal A\u2217x\u2217, while we can still guarantee (cid:107)A\u2217 \u2212 A(cid:107)1 \u2264\nO ((cid:107)\u03bd(cid:107)1\nn) \u2013 so our algorithm is not only tolerant to noise, but also has very strong denoising effect.\nNote that even for the unbiased case the noise can potentially be correlated with the ground-truth in\nvery complicated manner, and also, all our results are obtained only requiring the columns of A\u2217 are\nindependent.\nTechnical contribution. The success of our algorithm crucially relies on exploiting the non-negativity\nof x\u2217 by a ReLU thresholding step during the decoding procedure. Similar techniques have been\nconsidered in prior works on matrix factorization, however to the best of our knowledge, the analysis\n(e.g., [AGMM15]) requires that the decodings are correct in all the intermediate iterations, in the\nsense that the supports of x\u2217 are recovered with no error. Indeed, we cannot hope for a similar\nguarantee in our setting, since we consider adversarial noise that could potentially be the same order\nas the signal. Our major technical contribution is a way to deal with the erroneous decoding through\nout all the intermediate iterations. We achieve this by a coupling between two potential functions\nthat capture different aspects of the working matrix A. While analyzing iterative algorithms like\nalternating minimization or gradient descent in non-convex settings is a popular topic in recent\nyears, the proof usually proceeds by showing that the updates are approximately performing gradient\ndescent on an objective with some local or hidden convex structure. Our technique diverges from the\ncommon proof strategy, and we believe is interesting in its own right.\nOrganization. After reviewing related work, we de\ufb01ne the problem in Section 3 and describe our\nmain algorithm in Section 4. To emphasize the key ideas, we \ufb01rst present the results and the proof\nsketch for a simpli\ufb01ed yet still interesting case in Section 5, and then present the results under much\nmore general assumptions in Section 6. The complete proof is provided in the appendix.\n\n\u221a\n\n\u221a\n\n2 Related work\n\nNon-negative matrix factorization relates to several different topics in machine learning.\nNon-negative matrix factorization. The area of non-negative matrix factorization (NMF) has a rich\nempirical history, starting with the practical algorithm of [LS97].On the theoretical side, [AGKM12]\nprovides a \ufb01xed-parameter tractable algorithm for NMF, which solves algebraic equations and thus has\npoor noise tolerance. [AGKM12] also studies NMF under separability assumptions about the features.\n\n2\n\n\f[BGKP16] studies NMF under heavy noise, but also needs assumptions related to separability, such\nas the existence of dominant features. Also, their noise model is different from ours.\nTopic modeling. A closely related problem to NMF is topic modeling, a common generative model\nfor textual data [BNJ03, Ble12]. Usually, (cid:107)x\u2217(cid:107)1 = 1 while there also exist work that assume\ni \u2208 [0, 1] and are independent [ZX12]. A popular heuristic in practice for learning A\u2217 is variational\nx\u2217\ninference, which can be interpreted as alternating minimization in KL divergence norm. On the\ntheory front, there is a sequence of works by based on either spectral or combinatorial approaches,\nwhich need certain \u201cnon-overlapping\u201d assumptions on the topics. For example, [AGH+13] assume\nthe topic-word matrix contains \u201canchor words\u201d: words which appear in a single topic. Most related\nis the work of [AR15] who analyze a version of the variational inference updates when documents\nare long. However, they require strong assumptions on both the warm start, and the amount of\n\u201cnon-overlapping\u201d of the topics in the topic-word matrix.\nICA. Our generative model for x\u2217 will assume the coordinates are independent, therefore our problem\ncan be viewed as a non-negative variant of ICA with high levels of noise. Results here typically are\nnot robust to noise, with the exception of [AGMS12] that tolerates Gaussian noise. However, to best\nof our knowledge, no result in this setting is provably robust to adversarial noise.\nNon-convex optimization. The framework of having a \u201cdecoding\u201d for the samples, along with\nperforming an update for the model parameters has proven successful for dictionary learning as\nwell. The original empirical work proposing such an algorithm (in fact, it suggested that the V1\nlayer processes visual signals in the same manner) was due to [OF97]. Even more, similar families\nof algorithms based on \u201cdecoding\u201d and gradient-descent are believed to be neurally plausible as\nmechanisms for a variety of tasks like clustering, dimension-reduction, NMF, etc ([PC15, PC14]). A\ntheoretical analysis came latter for dictionary learning due to [AGMM15] under the assumption that\nthe columns of A\u2217 are incoherent. The technique is not directly applicable to our case, as we don\u2019t\nwish to have any assumptions on the matrix A\u2217. For instance, if A\u2217 is non-negative and columns\nwith l1 norm 1, incoherence effectively means the the columns of A\u2217 have very small overlap.\n\n3 Problem de\ufb01nition and assumptions\nGiven a matrix Y \u2208 Rm\u00d7N , the goal of non-negative matrix factorization (NMF) is to \ufb01nd a matrix\nA \u2208 Rm\u00d7n and a non-negative matrix X \u2208 Rn\u00d7N , so that Y \u2248 AX. The columns of Y are\ncalled data points, those of A are features, and those of X are weights. We note that in the original\nNMF, A is also assumed to be non-negative, which is not required here. We also note that typically\nm (cid:29) n, i.e., the features are a few representative components in the data space. This is different\nfrom dictionary learning where overcompleteness is often assumed.\nThe problem in the worst case is NP-hard [AGKM12], so some assumptions are needed to design\nprovable ef\ufb01cient algorithms. In this paper, we consider a generative model for the data point\n\ny = A\u2217x\u2217 + \u03bd\n\n(1)\nwhere A\u2217 is the ground-truth feature matrix, x\u2217 is the ground-truth non-negative weight from some\nunknown distribution, and \u03bd is the noise. Our focus is to recover A\u2217 given access to the data\n(cid:80)\ndistribution, assuming some properties of A\u2217, x\u2217, and \u03bd. To describe our assumptions, we let [M]i\ndenote the i-th row of a matrix M, [M]j its i-th column, Mi,j its (i, j)-th entry. Denote its column\nj |Mi,j|,\nnorm, row norm, and symmetrized norm as (cid:107)M(cid:107)1 = maxj\nand (cid:107)M(cid:107)s = max{(cid:107)M(cid:107)1,(cid:107)M(cid:107)\u221e} , respectively.\nWe assume the following hold for parameters C1, c2, C2, (cid:96), C\u03bd to be determined in our theorems.\n\n(cid:80)\ni |Mi,j|,(cid:107)M(cid:107)\u221e = maxi\n\n(A1) The columns of A\u2217 are linearly independent.\n(A2) For all i \u2208 [n], x\u2217\ni \u2019s are independent.\nn and c2\n(A3) The initialization A(0) = A\u2217(\u03a3(0) + E(0)) + N(0), where \u03a3(0) is diagonal, E(0) is off-\n\ni \u2208 [0, 1], E[x\u2217\n\nn \u2264 E[(x\u2217\n\ni )2] \u2264 C2\n\nn , and x\u2217\n\ndiagonal, and\n\nWe consider two noise models.\n\n(cid:13)(cid:13)(cid:13)E(0)(cid:13)(cid:13)(cid:13)s\n\n\u2264 (cid:96).\n\ni ] \u2264 C1\n\n\u03a3(0) (cid:23) (1 \u2212 (cid:96))I,\n\n3\n\n\f(N1) Adversarial noise: only assume that maxi |\u03bdi| \u2264 C\u03bd almost surely.\n(N2) Unbiased noise: maxi |\u03bdi| \u2264 C\u03bd almost surely, and E[\u03bd|x\u2217] = 0.\n\n3 = 1 and the case when x\u2217\n\nRemarks. We make several remarks about each of the assumptions.\n(A1) is the assumption about A\u2217. It only requires the columns of A\u2217 to be linear independent, which\nis very mild and needed to ensure identi\ufb01ability. Otherwise, for instance, if (A\u2217)3 = \u03bb1(A\u2217)1 +\n\u03bb2(A\u2217)2, it is impossible to distinguish between the case when x\u2217\n2 = \u03bb1\nand x\u2217\n1 = \u03bb2. In particular, we do not restrict the feature matrix to be non-negative, which is more\ngeneral than the traditional NMF and is potentially useful for many applications. We also do not\nmake incoherence or anchor word assumptions that are typical in related work.\n(A2) is the assumption on x\u2217. First, the coordinates are non-negative and bounded by 1; this is simply\na matter of scaling. Second, the assumption on the moments requires that, roughly speaking, each\nfeature should appear with reasonable probability. This is expected: if the occurrences of the features\nare extremely unbalanced, then it will be dif\ufb01cult to recover the rare ones. The third requirement\non independence is motivated by that the features should be different so that their occurrences are\nnot correlated. Here we do not stick to a speci\ufb01c distribution, since the moment conditions are more\ngeneral, and highlight the essential properties our algorithm needs. Example distributions satisfying\nour assumptions will be discussed later.\nThe warm start required by (A3) means that each feature A(0)\nhas a large fraction of the ground-\ntruth feature A\u2217\ni and a small fraction of the other features, plus some noise outside the span of the\nground-truth features. We emphasize that N(0) is the component of A(0) outside the column space\nof A\u2217, and is not the difference between A(0) and A\u2217. This requirement is typically achieved in\npractice by setting the columns of A(0) to reasonable \u201cpure\u201d data points that contains one major\nfeature and a small fraction of some other features (e.g. [lda16, AR15]); in this initialization, it is\ngenerally believed that N(0) = 0. But we state our theorems to allow some noise N(0) for robustness\nin the initialization.\nThe adversarial noise model (N1) is very general, only imposing an upper bound on the entry-wise\nnoise level. Thus, \u03bd can be correlated with x\u2217 in some complicated unknown way. (N2) additionally\nrequires it to be zero mean, which is commonly assumed and will be exploited by our algorithm to\ntolerate larger noise.\n\ni\n\n4 Main algorithm\n\nAlgorithm 1 Puri\ufb01cation\nInput: initialization A(0), threshold \u03b1, step size \u03b7, scaling factor r, sample size N, iterations T\n1: for t = 0, 1, 2, ..., T \u2212 1 do\n2:\n3:\n\nDraw examples y1, . . . , yN .\n(Decode) Compute A\u2020, the pseudo-inverse of A(t) with minimum (cid:107)(A)\u2020(cid:107)\u221e.\n\nSet x = \u03c6\u03b1(A\u2020y) for each example y.\n\n// \u03c6\u03b1 is ReLU activation; see (2) for the\n\n4:\n\nde\ufb01nition\n(Update) Update the feature matrix\n\nA(t+1) = (1 \u2212 \u03b7) A(t) + r\u03b7 \u02c6E(cid:2)(y \u2212 y(cid:48))(x \u2212 x(cid:48))(cid:62)(cid:3)\n\nwhere \u02c6E is over independent uniform y, y(cid:48) from {y1, . . . , yN}, and x, x(cid:48) are their decodings.\n\nOutput: A = A(T )\n\nOur main algorithm is presented in Algorithm 1. It keeps a working feature matrix and operates in\niterations. In each iteration, it \ufb01rst compute the weights for a batch of N examples (decoding), and\nthen uses the computed weights to update the feature matrix (updating).\nThe decoding is simply multiplying the example by the pseudo-inverse of the current feature matrix\nand then passing it through the recti\ufb01ed linear unit (ReLU) \u03c6\u03b1 with offset \u03b1. The pseudo-inverse\nwith minimum in\ufb01nity norm is used so as to maximize the robustness to noise (see the theorems).\nThe ReLU function \u03c6\u03b1 operates element-wisely on the input vector v, and for an element vi, it is\n\n4\n\n\fde\ufb01ned as\n\n\u03c6\u03b1(vi) = max{vi \u2212 \u03b1, 0} .\n\n(2)\n\nTo get an intuition why the decoding makes sense, suppose the current feature matrix is the ground-\ntruth. Then A\u2020y = A\u2020A\u2217x\u2217 + A\u2020\u03bd = x\u2217 + A\u2020\u03bd. So we would like to use a small A\u2020 and use\nthreshold to remove the noise term.\n\nIn the encoding step, the algorithm move the feature matrix along the direction E(cid:2)(y \u2212 y(cid:48))(x \u2212 x(cid:48))(cid:62)(cid:3).\nnoise, E(cid:2)(y \u2212 y(cid:48))(x \u2212 x(cid:48))(cid:62)(cid:3) = A\u2217, and thus it is moving towards the ground-truth. Without those\n\nTo see intuitively why this is a good direction, note that when the decoding is perfect and there is no\n\nideal conditions, we need to choose a proper step size, which is tuned by the parameters \u03b7 and r.\n\n5 Results for a simpli\ufb01ed case\n\nOur intuitions can be demonstrated in a simpli\ufb01ed setting with (A1), (A2\u2019), (A3), and (N1), where\n\n(A2\u2019) x\u2217\n\ni \u2019s are independent, and x\u2217\n\ni = 1 with probability s/n and 0 otherwise for a constant s > 0.\n\nFurthermore, let N(0) = 0. This is a special case of our general assumptions, with C1 = c2 = C2 = s\nwhere s is the parameter in (A2\u2019). It is still an interesting setting; as far as we know, there is no\nexisting guarantee of alternating type algorithms for it.\nTo present our results, we let (A\u2217)\u2020 denote the matrix satisfying (A\u2217)\u2020A\u2217 = I; if there are multiple\nsuch matrices we let it denote the one with minimum (cid:107)(A\u2217)\u2020(cid:107)\u221e.\nTheorem 1 (Simpli\ufb01ed case, adversarial noise). There exists a absolute constant G such that when\nAssumption (A1)(A2\u2019)(A3) and (N1) are satis\ufb01ed with l = 1/10, C\u03bd \u2264\nmax{m,n(cid:107)(A\u2217)\u2020(cid:107)\u221e} for\nsome 0 \u2264 c \u2264 1, and N(0) = 0, then there exist \u03b1, \u03b7, r such that for every 0 < \u0001, \u03b4 < 1 and\nN = poly(n, m, 1/\u0001, 1/\u03b4) the following holds with probability at least 1 \u2212 \u03b4.\n\n(cid:1) iterations, Algorithm 1 outputs a solution A = A\u2217(\u03a3 + E) + N where\n\nAfter T = O(cid:0)ln 1\n\n\u03a3 (cid:23) (1 \u2212 (cid:96))I is diagonal, (cid:107)E(cid:107)1 \u2264 \u0001 + c is off-diagonal, and (cid:107)N(cid:107)1 \u2264 c.\nRemarks. Consequently, when (cid:107)A\u2217(cid:107)1 = 1, we can do normalization \u02c6Ai = Ai/(cid:107)Ai(cid:107)1, and the\nnormalized output \u02c6A satis\ufb01es\n\nGc\n\n\u0001\n\n(cid:107) \u02c6A \u2212 A\u2217(cid:107)1 \u2264 \u0001 + 2c.\n\nSo under mild conditions and with proper parameters, our algorithm recovers the ground-truth in a\ngeometric rate. It can achieve arbitrary small recovery error in the noiseless setting, and achieve error\nup to the noise limit even with adversarial noise whose level is comparable to the signal.\nThe condition on (cid:96) means that a constant warm start is suf\ufb01cient for our algorithm to converge, which\nis much better than previous work such as [AR15]. Indeed, in that work, the (cid:96) needs to even depend\non the dynamic range of the entries of A\u2217 which is problematic in practice.\nIt is shown that with large adversarial noise, the algorithm can still recover the features up to the\n\u2020 (cid:107)\u221e, each data point has adversarial noise with (cid:96)1 norm as large\nnoise limit. When m \u2265 n(cid:107) (A\u2217)\nas (cid:107)\u03bd(cid:107)1 = C\u03bdm = \u2126(c), which is in the same order as the signal (cid:107)A\u2217x\u2217(cid:107)1 = O(1). Our algorithm\nstill works in this regime. Furthermore, the \ufb01nal error (cid:107)A \u2212 A\u2217(cid:107)1 is O(c), in the same order as the\nadversarial noise in one data point.\n\u2020 (cid:107)\u221e is not surprising. The case when the columns are the canonical\nNote the appearance of (cid:107) (A\u2217)\n\u2020 (cid:107)\u221e = 1, is expected to be easier than the\nunit vectors for instance, which corresponds to (cid:107) (A\u2217)\ncase when the columns are nearly the same, which corresponds to large (cid:107) (A\u2217)\nA similar theorem holds for the unbiased noise model.\nTheorem 2 (Simpli\ufb01ed case, unbiased noise). If Assumption (A1)(A2\u2019)(A3) and (N2) are satis\ufb01ed\nwith C\u03bd =\nmax{m,n(cid:107)(A\u2217)\u2020(cid:107)\u221e} and the other parameters set as in Theorem 1, then the same\nguarantee in holds.\n\n\u2020 (cid:107)\u221e.\n\nGc\n\n\u221a\n\nn\n\n5\n\n\f\u221a\n\nn larger than the adversarial case. When m \u2265 n(cid:107) (A\u2217)\n\nRemarks. With unbiased noise which is commonly assumed in many applications, the algorithm can\n\u2020 (cid:107)\u221e, each data point\n\u221a\ntolerate noise level\nhas adversarial noise with (cid:96)1 norm as large as (cid:107)\u03bd(cid:107)1 = C\u03bdm = \u2126(c\nn), which can be \u2126(\nn) times\nlarger than the signal (cid:107)A\u2217x\u2217(cid:107)1 = O(1). The algorithm can recover the ground-truth in this heavy\n\u221a\n\u221a\nnoise regime. Furthermore, the \ufb01nal error (cid:107)A \u2212 A\u2217(cid:107)1 is O ((cid:107)\u03bd(cid:107)1/\nn), which is only O(1/\nn)\nfraction of the noise in one data point. This is very strong denoising effect and a bit counter-intuitive.\nIt is possible since we exploit the average of the noise for cancellation, and also use thresholding to\nremove noise spread out in the coordinates.\n\n\u221a\n\n5.1 Analysis: intuition\n\nA natural approach typically employed to analyze algorithms for non-convex problems is to de\ufb01ne a\nfunction on the intermediate solution A and the ground-truth A\u2217 measuring their distance and then\nshow that the function decreases at each step. However, a single potential function will not be enough\nin our case, as we argue below, so we introduce a novel framework of maintaining two potential\nfunctions which capture different aspects of the intermediate solutions.\nLet us denote the intermediate solution and the update as (omitting the superscript (t))\n\n\u02c6E[(y \u2212 y(cid:48))(x \u2212 x(cid:48))(cid:62)] = A\u2217((cid:101)\u03a3 +(cid:101)E) + (cid:101)N,\n\n(cid:62)\n\n],\n\n(3)\n\ni +(cid:80)\n\nA = A\u2217(\u03a3 + E) + N,\n\nwhere \u03a3 and (cid:101)\u03a3 are diagonal, E and(cid:101)E are off-diagonal, and N and (cid:101)N are the terms outside the span\n\nj , if the ratio between (cid:107)Ei(cid:107)1 =(cid:80)\n\nof A\u2217 which is caused by the noise. To cleanly illustrate the intuition behind ReLU and the coupled\npotential functions, we focus on the noiseless case and assume that we have in\ufb01nite samples.\nj(cid:54)=i |Ej,i| and \u03a3i,i gets smaller,\nSince Ai = \u03a3i,iA\u2217\nj(cid:54)=i Ej,iA\u2217\nthen the algorithm is making progress; if the ratio is large at the end, a normalization of Ai gives a\ni . So it suf\ufb01ces to show that \u03a3i,i is always about a constant while (cid:107)Ei(cid:107)1\ngood approximation of A\u2217\ndecreases at each iteration. We will focus on E and consider the update rule in more detail to argue\nthis. After some calculation, we have\n\n(cid:101)E = E[(x\u2217 \u2212 (x(cid:48))\u2217) (x \u2212 x(cid:48))\n(cid:0)(\u03a3 + E)\u22121(x(cid:48))\u2217(cid:1) .\n\nE \u2190 (1 \u2212 \u03b7)E + r\u03b7(cid:101)E,\n(cid:0)(\u03a3 + E)\u22121x\u2217(cid:1) ,\n(cid:101)E = E(x\u2217 \u2212 (x(cid:48))\u2217)(cid:2)A\u2020A\u2217(x\u2217 \u2212 (x(cid:48))\u2217)(cid:3)(cid:62)\n= E(cid:2)(x\u2217 \u2212 (x(cid:48))\u2217)(x\u2217 \u2212 (x(cid:48))\u2217)(cid:62)(cid:3)(cid:2)(\u03a3 + E)\u22121(cid:3)(cid:62)\n\u221d(cid:2)(\u03a3 + E)\u22121(cid:3)(cid:62) \u2248 \u03a3\u22121 \u2212 \u03a3\u22121E\u03a3\u22121.\nwhere we used Taylor expansion and the fact that E(cid:2)(x\u2217 \u2212 (x(cid:48))\u2217)(x\u2217 \u2212 (x(cid:48))\u2217)(cid:62)(cid:3) is a scaling of\n\nTo see why the ReLU function matters, consider the case when we do not use it.\n\nwhere x, x(cid:48) are the decoding for x\u2217, (x(cid:48))\u2217 respectively:\n\nx(cid:48) = \u03c6\u03b1\n\nx = \u03c6\u03b1\n\n(4)\n\nidentity. Hence, if we think of \u03a3 as approximately I and take an appropriate r, the update to the\nmatrix E is approximately E \u2190 E \u2212 \u03b7E(cid:62). Since we do not have control over the signs of E\nthroughout the iterations, the problematic case is when the entries of E(cid:62) and E roughly match in\nsigns, which would lead to the entries of E increasing.\nNow we consider the decoding to see why ReLU is important. Ignoring the higher order terms and\nregarding \u03a3 = I, we have\n\n(cid:0)\u03a3\u22121x\u2217 \u2212 \u03a3\u22121E\u03a3\u22121x\u2217(cid:1) \u2248 \u03c6\u03b1 (x\u2217 \u2212 Ex\u2217) .\n\n(5)\nThe problematic term is Ex\u2217. These errors when summed up will be comparable or even larger\nthan the signals, and the algorithm will fail. However, since the signals are non-negative and most\ncoordinates with errors only have small values, thresholding with ReLU properly can remove those\n\nerrors while keeping a large fraction of the signals. This leads to large (cid:101)\u03a3i,i and small(cid:101)Ej,i\u2019s, and then\n\n(cid:0)(\u03a3 + E)\u22121x\u2217(cid:1) \u2248 \u03c6\u03b1\n\nx = \u03c6\u03b1\n\nwe can choose an r such that Ej,i\u2019s keep decreasing while \u03a3i,i\u2019s stay in a certain range.\nTo get a quantitative bound, we divide E into its positive part E+ and its negative part E\u2212:\n\n[E+]i,j = max{Ei,j, 0} ,\n\n[E\u2212]i,j = max{\u2212Ei,j, 0} .\n\n(6)\n\n6\n\n\ftion,(cid:2)(\u03a3 + E)\u22121x\u2217(cid:3)\n(cid:2)(\u03a3 + E)\u22121x\u2217(cid:3)\n\nThe reason to do so is the following: when Ei,j is negative, by the Taylor expansion approxima-\ni will tend to be more positive and will not be thresholded most of the time.\nTherefore, Ej,i will turn more positive at next iteration. On the other hand, when Ei,j is positive,\ni will tend to be more negative and zeroed out by the threshold function. Therefore,\nEj,i will not be more negative at next iteration. We will show for positive and negative parts of E:\npostive(t+1) \u2190 (1\u2212\u03b7)positive(t)+(\u03b7)negative(t), negative(t+1) \u2190 (1\u2212\u03b7)negative(t)+(\u03b5\u03b7)positive(t)\nfor a small \u03b5 (cid:28) 1. Due to \u0001, we can couple the two parts so that a weighted average of them will\ndecrease, which implies that (cid:107)E(cid:107)s is small at the end. This leads to our coupled potential function.2\n\n5.2 Analysis: proof sketch\n\nHere we describe a proof sketch for the simpli\ufb01ed case while the complete proof for the general case\nis presented in the appendix. The lemmas here are direct corollaries of those in the appendix.\nOne iteration. We focus on one update and omit the superscript (t). Recall the de\ufb01nitions of E, \u03a3\n\nand N in (5.1), and (cid:101)E, (cid:101)\u03a3 and (cid:101)N in (5.1). Our goal is to derive lower and upper bounds for (cid:101)E, (cid:101)\u03a3\nand (cid:101)N, assuming that \u03a3i,i falls into some range around 1, while E and N are small. This will allow\n\ndoing induction on them.\nFirst, begin with the decoding. Some calculation shows that, the decoding for y = A\u2217x\u2217 + \u03bd is\n\n(7)\n\nn2 + c\n\n\u22121 , \u03be = \u2212A\u2020NZx\u2217 + A\u2020\u03bd.\n\nx = \u03c6\u03b1 (Zx\u2217 + \u03be) , where Z = (\u03a3 + E)\n\nNow, we can present our key lemmas bounding(cid:101)E, (cid:101)\u03a3, and (cid:101)N.\n(cid:12)(cid:12)(cid:12)(cid:101)Ej,i\n(cid:12)(cid:12)(cid:12) \u2264 O(cid:0) 1\nn2 (|Zi,j| + c)(cid:1) ,\nLemma 3 (Simpli\ufb01ed bound on(cid:101)E, informal). (1) if Zi,j < 0, then\nn2|Zi,j|(cid:1) \u2264(cid:12)(cid:12)(cid:12)(cid:101)Ej,i\n(cid:12)(cid:12)(cid:12) \u2264 O(cid:0) 1\n(2) if Zi,j \u2265 0, then \u2212O(cid:0) c\nn(cid:107)Zi,j(cid:107)(cid:1) .\nthe upper bound on |(cid:101)Ej,i| is very small and thus |Ej,i| decreases, as described in the intuition.\n\nNote that Z \u2248 \u03a3\u22121 \u2212 \u03a3\u22121E\u03a3\u22121, so Zi,j < 0 corresponds roughly to Ei,j > 0. In this case,\nWhat is most interesting is the case when Zi,j \u2265 0 (roughly Ei,j < 0). The upper bound is much\nlarger, corresponding to the intuition that negative Ei,j can contribute a large positive value to Ej,i.\nFortunately, the lower bounds are of much smaller absolute value, which allows us to show that a\npotential function that couples Case (1) and Case (2) in Lemma 3 actually decreases; see the induction\nbelow.\n\nn|Zi,j| + 1\n\nLemma 4 (Simpli\ufb01ed bound on (cid:101)\u03a3, informal). (cid:101)\u03a3i,i \u2265 \u2126(\u03a3\u22121\nLemma 5 (Simpli\ufb01ed bound on (cid:101)N, adversarial noise, informal).\n\n(cid:12)(cid:12)(cid:12) \u2264 O(C\u03bd/n).\n\n(cid:12)(cid:12)(cid:12)(cid:101)Ni,j\n\ni,i \u2212 \u03b1)/n.\n\nInduction by iterations. We now show how to use the three lemmas to prove the theorem for the\nadversarial noise, and that for the unbiased noise is similar.\n\n, and choose \u03b7 = (cid:96)/6. We begin with proving the following\n\n+\n\n(cid:13)(cid:13)(cid:13)s\n\n(cid:13)(cid:13)(cid:13)s\n\nand bt :=\n\n(cid:13)(cid:13)(cid:13)E(t)\u2212\n\n(1) (1 \u2212 (cid:96))I (cid:22) \u03a3(t)\n\nLet at :=\nthree claims by induction on t: at the beginning of iteration t,\n\n(cid:13)(cid:13)(cid:13)E(t)\n(2) (cid:13)(cid:13)E(t)(cid:13)(cid:13)s \u2264 1/8, and if t > 0, then at + \u03b2bt \u2264(cid:0)1 \u2212 1\n(3) (cid:13)(cid:13)N(t)(cid:13)(cid:13)s \u2264 c/10.\n(cid:18)\n(cid:19)\n\n\u03b2 \u2208 (1, 8), and some small value h,\n\n(cid:18)\n\nat+1 \u2264\n\n1 \u2212 3\n25\n\n\u03b7\n\nat + 7\u03b7bt + \u03b7h,\n\n25 \u03b7(cid:1) (at\u22121 + \u03b2bt\u22121) + \u03b7h, for some\n\nThe most interesting part is the second claim. At a high level, by Lemma 3, we can show that\n\n(cid:19)\n\nbt+1 \u2264\n\n1 \u2212 24\n25\n\n\u03b7\n\nbt +\n\n1\n100\n\n\u03b7at + \u03b7h.\n\n2Note that since intuitively, Ei,j gets affected by Ej,i after an update, if we have a row which contains\ni (cid:107)1 as a\n\ni (cid:107)1 increases. So we cannot simply use maxi (cid:107)Ai \u2212 A\u2217\n\nnegative entries, it is possible that (cid:107)Ai \u2212 A\u2217\npotential function.\n\n7\n\n\fNotice that the contribution of bt to at+1 is quite large (due to the larger upper bound in Case (2)\nin Lemma 3), but the other terms are much nicer, such as the small contribution of at to bt+1. This\nallows to choose a \u03b2 \u2208 (1, 8) so that at+1 + \u03b2bt+1 leads to the desired recurrence in the second\nclaim. In other words, at+1 + \u03b2bt+1 is our potential function which decreases at each iteration up to\nthe level h. The other claims can also be proved by the corresponding lemmas. Then the theorem\nfollows from the induction claims.\n\n6 More general results\nMore general weight distributions. Our argument holds under more general assumptions on x\u2217.\nTheorem 6 (Adversarial noise). There exists an absolute constant G such that when Assumption (A0)-\n(A3) and (N1) are satis\ufb01ed with l = 1/10, C2 \u2264 2c2, C 3\n\n2n, C\u03bd \u2264\n\n1 \u2264 Gc2\n\n(cid:26)\n\n(cid:27)\n\n2Gc\nc4\n\n2Gc\nc2\n1 m ,\nC2\n\n1 n(cid:107)(A\u2217)\u2020(cid:107)\u221e\n\nC5\n\nfor 0 \u2264 c \u2264 1, and(cid:13)(cid:13)N(0)(cid:13)(cid:13)\u221e \u2264\nAfter T = O(cid:0)ln 1\n\n2Gc\nc2\n\n1(cid:107)(A\u2217)\u2020(cid:107)\u221e\nC3\n\n, then there exist \u03b1, \u03b7, r such that for every 0 < \u0001, \u03b4 < 1\n\nand N = poly(n, m, 1/\u0001, 1/\u03b4), with probability at least 1 \u2212 \u03b4 the following holds.\n\n(cid:1) iterations, Algorithm 1 outputs a solution A = A\u2217(\u03a3 + E) + N where\n\n\u03a3 (cid:23) (1 \u2212 (cid:96))I is diagonal, (cid:107)E(cid:107)1 \u2264 \u0001 + c/2 is off-diagonal, and (cid:107)N(cid:107)1 \u2264 c/2.\nTheorem 7 (Unbiased noise).\nIf Assumption (A0)-(A3) and (N2) are satis\ufb01ed with C\u03bd =\nC1 max{m,n(cid:107)(A\u2217)\u2020(cid:107)\u221e} and the other parameters set as in Theorem 6, then the same guarantee\nholds.\n\nc2G\u221a\n\ncn\n\n\u0001\n\nThe conditions on C1, c2, C2 intuitively mean that each feature needs to appear with reasonable\nprobability. C2 \u2264 2c2 means that their proportions are reasonably balanced. This may be a mild\nrestriction for some applications, and additionally we propose a pre-processing step that can relax\nthis in the next subsection. The conditions allow a rather general family of distributions, so we point\nout an important special case to provide a more concrete sense of the parameters. For example, for\nthe uniform independent distribution considered in the simpli\ufb01ed case, we can actually allow s to be\nmuch larger than a constant; our algorithm just requires s \u2264 Gn for a \ufb01xed constant G. So it works\nfor uniform sparse distributions even when the sparsity is linear, which is an order of magnitude larger\nthan in the dictionary learning regime. Furthermore, the distributions of x\u2217\ni can be very different,\n2n). Moreover, all these can be handled without speci\ufb01c structural\nsince we only require C 3\nassumptions on A\u2217.\nMore general proportions. A mild restriction in Theorem 6 and 7 is that C2 \u2264 2c2, that is,\ni )2] \u2264 2 mini\u2208[n] E[(x\u2217\nmaxi\u2208[n] E[(x\u2217\ni )2]. To satisfy this, we propose a preprocessing algorithm for\ni )2]. The idea is quite simple: instead of solving Y \u2248 A\u2217X, we could also solve\nbalancing E[(x\u2217\nY \u2248 [A\u2217D][(D)\u22121X] for a positive diagonal matrix D, where E[(x\u2217\ni,i is with in a factor of\n2 from each other. We show in the appendix that this can be done under assumptions as the above\ntheorems, and additionally \u03a3 (cid:22) (1 + (cid:96))I and E(0) \u2265 entry-wise. After balancing, one can use\nAlgorithm 1 on the new ground-truth matrix [A\u2217D] to get the \ufb01nal result.\n\n1 = O(c2\n\ni )2]/D2\n\n7 Conclusion\n\nA simple and natural algorithm that alternates between decoding and updating is proposed for\nnon-negative matrix factorization and theoretical guarantees are provided. The algorithm provably\nrecovers a feature matrix close to the ground-truth and is robust to noise. Our analysis provides\ninsights on the effect of the ReLU units in the presence of the non-negativity constraints, and the\nresulting interesting dynamics of the convergence.\n\nAcknowledgements\n\nThis work was supported in part by NSF grants CCF-1527371, DMS-1317308, Simons Investigator\nAward, Simons Collaboration Grant, and ONR-N00014-16-1-2329.\n\n8\n\n\fReferences\n[AGH+13] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu. A\n\npractical algorithm for topic modeling with provable guarantees. In ICML, 2013.\n\n[AGKM12] Sanjeev Arora, Rong Ge, Ravindran Kannan, and Ankur Moitra. Computing a nonnega-\n\ntive matrix factorization\u2013provably. In STOC, pages 145\u2013162. ACM, 2012.\n\n[AGM12] S. Arora, R. Ge, and A. Moitra. Learning topic models \u2013 going beyond svd. In FOCS,\n\n2012.\n\n[AGMM15] S. Arora, R. Ge, T. Ma, and A. Moitra. Simple, ef\ufb01cient, and neural algorithms for\n\nsparse coding. In COLT, 2015.\n\n[AGMS12] Sanjeev Arora, Rong Ge, Ankur Moitra, and Sushant Sachdeva. Provable ica with\nunknown gaussian noise, with implications for gaussian mixtures and autoencoders. In\nNIPS, pages 2375\u20132383, 2012.\n\n[AKF+12] A. Anandkumar, S. Kakade, D. Foster, Y. Liu, and D. Hsu. Two svds suf\ufb01ce: Spectral\ndecompositions for probabilistic topic modeling and latent dirichlet allocation. Technical\nreport, 2012.\n\n[AR15] Pranjal Awasthi and Andrej Risteski. On some provably correct cases of variational\n\ninference for topic models. In NIPS, pages 2089\u20132097, 2015.\n\n[BGKP16] Chiranjib Bhattacharyya, Navin Goyal, Ravindran Kannan, and Jagdeep Pani. Non-\nnegative matrix factorization under heavy noise. In Proceedings of the 33nd Interna-\ntional Conference on Machine Learning, 2016.\n\n[Ble12] David M Blei. Probabilistic topic models. Communications of the ACM, 2012.\n\n[BNJ03] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. JMLR,\n\n3:993\u20131022, 2003.\n\n[lda16] Lda-c software. https://github.com/blei-lab/lda-c/blob/master/readme.\n\ntxt, 2016. Accessed: 2016-05-19.\n\n[LS97] Daniel D Lee and H Sebastian Seung. Unsupervised learning by convex and conic\n\ncoding. NIPS, pages 515\u2013521, 1997.\n\n[LS99] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative\n\nmatrix factorization. Nature, 401(6755):788\u2013791, 1999.\n\n[LS01] Daniel D Lee and H Sebastian Seung. Algorithms for non-negative matrix factorization.\n\nIn NIPS, pages 556\u2013562, 2001.\n\n[OF97] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set:\n\nA strategy employed by v1? Vision research, 37(23):3311\u20133325, 1997.\n\n[PC14] Cengiz Pehlevan and Dmitri B Chklovskii. A hebbian/anti-hebbian network derived\nfrom online non-negative matrix factorization can cluster and discover sparse features.\nIn Asilomar Conference on Signals, Systems and Computers, pages 769\u2013775. IEEE,\n2014.\n\n[PC15] Cengiz Pehlevan and Dmitri Chklovskii. A normative theory of adaptive dimensionality\n\nreduction in neural networks. In NIPS, pages 2260\u20132268, 2015.\n\n[ZX12] Jun Zhu and Eric P Xing. Sparse topical coding. arXiv preprint arXiv:1202.3778, 2012.\n\n9\n\n\f", "award": [], "sourceid": 2540, "authors": [{"given_name": "Yuanzhi", "family_name": "Li", "institution": "Princeton University"}, {"given_name": "Yingyu", "family_name": "Liang", "institution": "Princeton University"}, {"given_name": "Andrej", "family_name": "Risteski", "institution": "Princeton University"}]}