{"title": "Matrix factorization with binary components", "book": "Advances in Neural Information Processing Systems", "page_first": 3210, "page_last": 3218, "abstract": "Motivated by an application in computational biology, we consider constrained low-rank matrix factorization problems with $\\{0,1\\}$-constraints on one of the factors. In addition to the the non-convexity shared with more general matrix factorization schemes, our problem is further complicated by a combinatorial constraint set of size $2^{m \\cdot r}$, where $m$ is the dimension of the data points and $r$ the rank of the factorization. Despite apparent intractability, we provide $-$in the line of recent work on non-negative matrix factorization by Arora et al.~(2012)$-$ an algorithm that provably recovers the underlying factorization in the exact case with operations of the order $O(m r 2^r + mnr)$ in the worst case. To obtain that result, we invoke theory centered around a fundamental result in combinatorics, the Littlewood-Offord lemma.", "full_text": "Matrix factorization with Binary Components\n\nMartin Slawski, Matthias Hein and Pavlo Lutsik\n\nSaarland University\n\n{ms,hein}@cs.uni-saarland.de, p.lutsik@mx.uni-saarland.de\n\nAbstract\n\nMotivated by an application in computational biology, we consider low-rank ma-\ntrix factorization with {0, 1}-constraints on one of the factors and optionally con-\nvex constraints on the second one. In addition to the non-convexity shared with\nother matrix factorization schemes, our problem is further complicated by a com-\nbinatorial constraint set of size 2m\u00b7r, where m is the dimension of the data points\nand r the rank of the factorization. Despite apparent intractability, we provide\n\u2212 in the line of recent work on non-negative matrix factorization by Arora et\nal. (2012)\u2212 an algorithm that provably recovers the underlying factorization in the\nexact case with O(mr2r + mnr + r2n) operations for n datapoints. To obtain this\nresult, we use theory around the Littlewood-Offord lemma from combinatorics.\n\n1 Introduction\n\nLow-rank matrix factorization techniques like the singular value decomposition (SVD) constitute\nan important tool in data analysis yielding a compact representation of data points as linear com-\nbinations of a comparatively small number of \u2019basis elements\u2019 commonly referred to as factors,\ncomponents or latent variables. Depending on the speci\ufb01c application, the basis elements may be\nrequired to ful\ufb01ll additional properties, e.g. non-negativity [1, 2], smoothness [3] or sparsity [4, 5].\nIn the present paper, we consider the case in which the basis elements are constrained to be binary,\ni.e. we aim at factorizing a real-valued data matrix D into a product T A with T \u2208 {0, 1}m\u00d7r and\nA \u2208 Rr\u00d7n, r \u226a min{m, n}. Such decomposition arises e.g. in blind source separation in wire-\nless communication with binary source signals [6]; in network inference from gene expression data\n[7, 8], where T encodes connectivity of transcription factors and genes; in unmixing of cell mixtures\nfrom DNA methylation signatures [9] in which case T represents presence/absence of methylation;\nor in clustering with overlapping clusters with T as a matrix of cluster assignments [10, 11].\nSeveral other matrix factorizations involving binary matrices have been proposed in the literature. In\n[12] and [13] matrix factorization for binary input data, but non-binary factors T and A is discussed,\nwhereas a factorization T W A with both T and A binary and real-valued W is proposed in [14],\nwhich is more restrictive than the model of the present paper. The model in [14] in turn encom-\npasses binary matrix factorization as proposed in [15], where all of D, T and A are constrained to\nbe binary. It is important to note that this ine of research is fundamentally different from Boolean\nmatrix factorization [16], which is sometimes also referred to as binary matrix factorization.\nA major drawback of matrix factorization schemes is non-convexity. As a result, there is in gen-\neral no algorithm that is guaranteed to compute the desired factorization. Algorithms such as block\ncoordinate descent, EM, MCMC, etc. commonly employed in practice lack theoretical guarantees\nbeyond convergence to a local minimum. Substantial progress in this regard has been achieved\nrecently for non-negative matrix factorization (NMF) by Arora et al. [17] and follow-up work in\n[18], where it is shown that under certain additional conditions, the NMF problem can be solved\nglobally optimal by means of linear programming. Apart from being a non-convex problem, the\nmatrix factorization studied in the present paper is further complicated by the {0, 1}-constraints im-\nposed on the left factor T , which yields a combinatorial optimization problem that appears to be\ncomputationally intractable except for tiny dimensions m and r even in case the right factor A were\n\n1\n\n\falready known. Despite the obvious hardness of the problem, we present as our main contribution\nan algorithm that provably provides an exact factorization D = T A whenever such factorization\nexists. Our algorithm has exponential complexity only in the rank r of the factorization, but scales\nlinearly in m and n. In particular, the problem remains tractable even for large values of m as long\nas r remains small. We extend the algorithm to the approximate case D \u2248 T A and empirically\nshow superior performance relative to heuristic approaches to the problem. Moreover, we estab-\nlish uniqueness of the exact factorization under the separability condition from the NMF literature\n[17, 19], or alternatively with high probability for T drawn uniformly at random. As a corollary, we\nobtain that at least for these two models, the suggested algorithm continues to be fully applicable\nif additional constraints e.g. non-negativity, are imposed on the right factor A. We demonstrate the\npractical usefulness of our approach in unmixing DNA methylation signatures of blood samples [9].\nNotation. For a matrix M and index sets I, J , MI,J denotes the submatrix corresponding to I and\nJ ; MI,: and M:,J denote the submatrices formed by the rows in I respectively columns in J . We\nwrite [M ; M\u2032] and [M, M\u2032] for the row- respectively column-wise concatenation of M and M\u2032. The\naf\ufb01ne hull generated by the columns of M is denoted by aff(M ). The symbols 1/0 denote vectors\nor matrices of ones/zeroes and I denotes the identity matrix. We use | \u00b7 | for the cardinality of a set.\nSupplement. The supplement contains all proofs, additional comments and experimental results.\n\n2 Exact case\n\nWe start by considering the exact case, i.e. we suppose that a factorization having the desired\nproperties exists. We \ufb01rst discuss the geometric ideas underlying our basic approach for recovering\nsuch factorization from the data matrix before presenting conditions under which the factorization\nis unique. It is shown that the question of uniqueness as well as the computational performance of\nour approach is intimately connected to the Littlewood-Offord problem in combinatorics [20].\n\n2.1 Problem formulation. Given D \u2208 Rm\u00d7n, we consider the following problem.\n\n\ufb01nd T \u2208 {0, 1}m\u00d7r and A \u2208 Rr\u00d7n, A\u22a41r = 1n such that D = T A.\n\n(1)\n\nk=1 of T , which are vertices of the hypercube [0, 1]m, are referred to as compo-\nThe columns {T:,k}r\nnents. The requirement A\u22a41r = 1n entails that the columns of D are af\ufb01ne instead of linear combi-\nnations of the columns of T . This additional constraint is not essential to our approach; it is imposed\nfor reasons of presentation, in order to avoid that the origin is treated differently from the other ver-\ntices of [0, 1]m, because otherwise the zero vector could be dropped from T , leaving the factorization\nunchanged. We further assume w.l.o.g. that r is minimal, i.e. there is no factorization of the form (1)\nwith r\u2032 < r, and in turn that the columns of T are af\ufb01nely independent, i.e. \u2200\u03bb \u2208 Rr, \u03bb\u22a41r = 0,\nT \u03bb = 0 implies that \u03bb = 0. Moreover, it is assumed that rank(A) = r. This ensures the existence\nof a submatrix A:,C of r linearly independent columns and of a corresponding submatrix of D:,C of\naf\ufb01nely independent columns, when combined with the af\ufb01ne independence of the columns of T :\n\n\u2200\u03bb \u2208 Rr, \u03bb\u22a41r = 0 : D:,C\u03bb = 0 \u21d0\u21d2 T (A:,C\u03bb) = 0 =\u21d2 A:,C\u03bb = 0 =\u21d2 \u03bb = 0,\n\n(2)\n\nusing at the second step that 1\u22a4r A:,C\u03bb = 1\u22a4r \u03bb = 0 and the af\ufb01ne independence of the {T:,k}r\nk=1.\nNote that the assumption rank(A) = r is natural; otherwise, the data would reside in an af\ufb01ne\nsubspace of lower dimension so that D would not contain enough information to reconstruct T .\n\n2.2 Approach. Property (2) already provides the entry point of our approach. From D = T A, it is\nobvious that aff(T ) \u2287 aff(D). Since D contains the same number of af\ufb01nely independent columns\nas T , it must also hold that aff(D) \u2287 aff(T ), in particular aff(D) \u2287 {T:,k}r\nk=1. Consequently, (1)\ncan in principle be solved by enumerating all vertices of [0, 1]m contained in aff(D) and selecting a\nmaximal af\ufb01nely independent subset thereof (see Figure 1). This procedure, however, is exponential\nin the dimension m, with 2m vertices to be checked for containment in aff(D) by solving a linear\nsystem. Remarkably, the following observation along with its proof, which prompts Algorithm 1\nbelow, shows that the number of elements to be checked can be reduced to 2r\u22121 irrespective of m.\nProposition 1. The af\ufb01ne subspace aff(D) contains no more than 2r\u22121 vertices of [0, 1]m. More-\nover, Algorithm 1 provides all vertices contained in aff(D).\n\n2\n\n\fAlgorithm 1 FINDVERTICES EXACT\n\n1. Fix p \u2208 aff(D) and compute P = [D:,1 \u2212 p, . . . , D:,n \u2212 p].\n2. Determine r \u2212 1 linearly independent columns C of P , obtaining P:,C and subsequently\n\nr \u2212 1 linearly independent rows R, obtaining PR,C \u2208 Rr\u22121\u00d7r\u22121.\n\nRm\u00d72r\u22121\n\n, where the columns of B(r\u22121) correspond to the elements of {0, 1}r\u22121.\n\n3. Form Z = P:,C(PR,C)\u22121 \u2208 Rm\u00d7r\u22121 and bT = Z(B(r\u22121) \u2212 pR1\u22a42r\u22121 ) + p1\u22a42r\u22121 \u2208\n4. Set T = \u2205. For u = 1, . . . , 2r\u22121, if bT:,u \u2208 {0, 1}m set T = T \u222a {bT:,u}.\n\n5. Return T = {0, 1}m \u2229 aff(D).\n\nAlgorithm 2 BINARYFACTORIZATION EXACT\n\n1. Obtain T as output from FINDVERTICES EXACT(D)\n2. Select r af\ufb01nely independent elements of T to be used as columns of T .\n3. Obtain A as solution of the linear system [1\u22a4r ; T ]A = [1\u22a4n ; D].\n4. Return (T, A) solving problem (1).\n\nFigure 1: Illustration of the geometry underlying our ap-\nproach in dimension m = 3. Dots represent data points\nand the shaded areas their af\ufb01ne hulls aff(D) \u2229 [0, 1]m.\nLeft: aff(D) intersects with r + 1 vertices of [0, 1]m.\nRight: aff(D) intersects with precisely r vertices.\n\nComments. In step 2 of Algorithm 1, determining the rank of P and an associated set of linearly\nindependent columns/rows can be done by means of a rank-revealing QR factorization [21, 22].\nThe crucial step is the third one, which is a compact description of \ufb01rst solving the linear systems\nPR,C\u03bb = b \u2212 pR for all b \u2208 {0, 1}r\u22121 and back-substituting the result to compute candidate vertices\n\nP:,C\u03bb + p stacked into the columns of bT ; the addition/subtraction of p is merely because we have to\n\ndeal with an af\ufb01ne instead of a linear subspace, in which p serves as origin. In step 4, the pool of\n2r\u22121 \u2019candidates\u2019 is \ufb01ltered, yielding T = aff(D) \u2229 {0, 1}m.\nDetermining T is the hardest part in solving the matrix factorization problem (1). Given T , the\nsolution can be obtained after few inexpensive standard operations. Note that step 2 in Algorithm\n2 is not necessary if one does not aim at \ufb01nding a minimal factorization, i.e. if it suf\ufb01ces to have\nD = T A with T \u2208 {0, 1}m\u00d7r\u2032\nAs detailed in the supplement, the case without sum-to-one constraints on A can be handled simi-\nlarly, as can be the model in [14] with binary left and right factor and real-valued middle factor.\nComputational complexity. The dominating cost in Algorithm 1 is computation of the candidate\n\nbut r\u2032 possibly being larger than r.\n\nmatrix bT and checking whether its columns are vertices of [0, 1]m. Note that\nbTR,: = ZR,:(B(r\u22121)\u2212pR1\u22a42r\u22121 )+pR1\u22a42r\u22121 = Ir\u22121(B(r\u22121)\u2212pR1\u22a42r\u22121)+pR1\u22a42r\u22121 = B(r\u22121), (3)\ni.e. the r \u2212 1 rows of bT corresponding to R do not need to be taken into account. Forming the\nmatrix bT would hence require O((m \u2212 r + 1)(r \u2212 1)2r\u22121) and the subsequent check for vertices in\n\nthe fourth step O((m \u2212 r + 1)2r\u22121) operations. All other operations are of lower order provided\ne.g. (m \u2212 r + 1)2r\u22121 > n. The second most expensive operation is forming the matrix PR,C in\nstep 2 with the help of a QR decomposition requiring O(mn(r \u2212 1)) operations in typical cases\n[21]. Computing the matrix factorization (1) after the vertices have been identi\ufb01ed (steps 2 to 4 in\nAlgorithm 2) has complexity O(mnr + r3 + r2n). Here, the dominating part is the solution of a\nlinear system in r variables and n right hand sides. Altogether, our approach for solving (1) has\nexponential complexity in r, but only linear complexity in m and n. Later on, we will argue that\nunder additional assumptions on T , the O((m\u2212r+1)2r\u22121) terms can be reduced to O((r\u22121)2r\u22121).\n\n2.3 Uniqueness.\nIn this section, we study uniqueness of the matrix factorization problem (1)\n(modulo permutation of columns/rows). First note that in view of the af\ufb01ne independence of the\ncolumns of T , the factorization is unique iff T is, which holds iff\n\n(4)\ni.e. if the af\ufb01ne subspace generated by {T:,1, . . . , T:,r} contains no other vertices of [0, 1]m than the\nr given ones (cf. Figure 1). Uniqueness is of great importance in applications, where one aims at\n\naff(D) \u2229 {0, 1}m = aff(T ) \u2229 {0, 1}m = {T:,1, . . . , T:,r},\n\n3\n\n\fan interpretation in which the columns of T play the role of underlying data-generating elements.\nSuch an interpretation is not valid if (4) fails to hold, since it is then possible to replace one of the\ncolumns of a speci\ufb01c choice of T by another vertex contained in the same af\ufb01ne subspace.\nSolution of a non-negative variant of our factorization. In the sequel, we argue that property (4)\nplays an important role from a computational point of view when solving extensions of problem (1)\nin which further constraints are imposed on A. One particularly important extension is the following.\n\n\ufb01nd T \u2208 {0, 1}m\u00d7r and A \u2208 Rr\u00d7n\n\n+ , A\u22a41r = 1n such that D = T A.\n\n(5)\n\nProblem (5) is a special instance of non-negative matrix factorization. Problem (5) is of particular\ninterest in the present paper, leading to a novel real world application of matrix factorization tech-\nniques as presented in Section 4.2 below. It is natural to ask whether Algorithm 2 can be adapted\nto solve problem (5). A change is obviously required for the second step when selecting r vertices\nfrom T , since in (5) the columns D now have to be expressed as convex instead of only af\ufb01ne com-\nbinations of columns of T : picking an af\ufb01nely independent collection from T does not take into\naccount the non-negativity constraint imposed on A. If, however, (4) holds, we have |T | = r and\nAlgorithm 2 must return a solution of (5) provided that there exists one.\n\nCorollary 1. If problem (1) has a unique solution, i.e. if condition (4) holds and if there exists a\nsolution of (5), then it is returned by Algorithm 2.\n\nTo appreciate that result, consider the converse case |T | > r. Since the aim is a minimal factoriza-\ntion, one has to \ufb01nd a subset of T of cardinality r such that (5) can be solved. In principle, this can\n\ntionally feasible: the upper bound of Proposition 1 indicates that |T | = 2r\u22121 in the worst case. For\nthe example below, T consists of all 2r\u22121 vertices contained in an r \u2212 1-dimensional face of [0, 1]m:\n\nbe achieved by solving a linear program for(cid:0)|T |r (cid:1) subsets of T , but this is in general not computa-\nT =\uf8eb\uf8ec\uf8ed\n\n\uf8f6\uf8f7\uf8f8 with T =(T \u03bb : \u03bb1 \u2208 {0, 1}, . . . , \u03bbr\u22121 \u2208 {0, 1}, \u03bbr = 1 \u2212\n\nIr\u22121 0r\u22121\n\n\u03bbk) .\n\nr\u22121Xk=1\n\n(6)\n\n0m\u2212r\u00d7r\n\n0\u22a4\nr\n\nUniqueness under separability.\nIn view of the negative example (6), one might ask whether\nuniqueness according to (4) can be achieved under additional conditions on T . We prove uniqueness\nunder separability, a condition introduced in [19] and imposed recently in [17] to show solvability\nof the NMF problem by linear programming. We say that T is separable if there exists a permutation\n\u03a0 such that \u03a0T = [M ; Ir], where M \u2208 {0, 1}m\u2212r\u00d7r.\nProposition 2. If T is separable, condition (4) holds and thus problem (1) has a unique solution.\n\nUniqueness under generic random sampling. Both the negative example (6) as well as the posi-\ntive result of Proposition 2 are associated with special matrices T . This raises the question whether\nuniqueness holds respectively fails for broader classes of binary matrices. In order to gain insight\ninto this question, we consider random T with i.i.d. entries from a Bernoulli distribution with param-\neter 1\n2 and study the probability of the event {aff(T ) \u2229 {0, 1}m = {T:,1, . . . , T:,r}}. This question\nhas essentially been studied in combinatorics [23], with further improvements in [24]. The results\ntherein rely crucially on Littlewood-Offord theory (see Section 2.4 below).\nTheorem 1. Let T be a random m \u00d7 r-matrix whose entries are drawn i.i.d. from {0, 1} with\nprobability 1\n\n2 . Then, there is a constant C so that if r \u2264 m \u2212 C,\n\nP(cid:16)aff(T )\u2229{0, 1}m = {T:,1, . . . , T:,r(cid:17) \u2265 1\u2212(1+o(1)) 4(cid:18)r\n\n3(cid:19)(cid:18) 3\n4(cid:19)m\n\n\u2212(cid:18) 3\n\n4\n\n+ o(1)(cid:19)m\n\nas m \u2192 \u221e.\n\nTheorem 1 suggests a positive answer to the question of uniqueness posed above. For m large\nenough and r small compared to m (in fact, following [24] one may conjecture that Theorem 1\nholds with C = 1), the probability that the af\ufb01ne hull of r vertices of [0, 1]m selected uniformly at\nrandom contains some other vertex is exponentially small in the dimension m. We have empirical\nevidence that the result of Theorem 1 continues to hold if the entries of T are drawn from a Bernoulli\ndistribution with parameter in (0, 1) suf\ufb01ciently far away from the boundary points (cf. supplement).\nAs a byproduct, these results imply that also the NMF variant of our matrix factorization problem\n(5) can in most cases be reduced to identifying a set of r vertices of [0, 1]m (cf. Corollary 1).\n\n4\n\n\f2.4 Speeding up Algorithm 1.\nformed (Step 3). We have discussed the case (6) where all candidates must indeed be vertices,\nin which case it seems to be impossible to reduce the computational cost of O((m \u2212 r)r2r\u22121),\nwhich becomes signi\ufb01cant once m is in the thousands and r \u2265 25. On the positive side, Theorem\n1 indicates that for many instances of T , only r out of 2r\u22121 candidates are in fact vertices.\nIn\n\nIn Algorithm 1, an m \u00d7 2r\u22121 matrix bT of potential vertices is\n\nwith coordinates not in {0, 1}. We have observed empirically that this scheme rapidly reduces the\n\nthat case, noting that columns of bT cannot be vertices if a single coordinate is not in {0, 1} (and\nthat the vast majority of columns of bT must have one such coordinate), it is computationally more\nfavourable to incrementally compute subsets of rows of bT and then to discard already those columns\ncandidate set \u2212 already checking a single row of bT eliminates a substantial portion (see Figure 2).\n\nLittlewood-Offord theory. Theoretical underpinning for the last observation can be obtained from\na result in combinatorics, the Littlewood-Offord (L-O)-lemma. Various extensions of that result have\nbeen developed until recently, see the survey [25]. We here cite the L-O-lemma in its basic form.\nTheorem 2. [20] Let a1, . . . , a\u2113 \u2208 R \\ {0} and y \u2208 R.\n\ni=1 aibi = y}(cid:12)(cid:12) \u2264(cid:0) \u2113\n(i) (cid:12)(cid:12){b \u2208 {0, 1}\u2113 : P\u2113\n\u230a\u2113/2\u230b(cid:1).\n(ii) If |ai| \u2265 1, i = 1, . . . , \u2113, (cid:12)(cid:12){b \u2208 {0, 1}\u2113 : P\u2113\n\ni=1 aibi \u2208 (y, y + 1)}(cid:12)(cid:12) \u2264(cid:0) \u2113\n\u230a\u2113/2\u230b(cid:1).\n\nThe two parts of Theorem 2 are referred to as discrete respectively continuous L-O lemma. The\ndiscrete L-O lemma provides an upper bound on the number of {0, 1}-vectors whose weighted\nsum with given weights {ai}\u2113\ni=1 is equal to some given number y, whereas the stronger continuous\nversion, under a more stringent condition on the weights, upper bounds the number of {0, 1}-vectors\nwhose weighted sum is contained in some interval (y, y + 1). In order to see the relation of Theorem\n2 to Algorithm 1, let us re-inspect the third step of that algorithm. To obtain a reduction of candidates\n\nZi,:\n\nk=1\n\nB(r\u22121)\n\n= Zi,:pR \u2212 pi\n\nin R do not need to be checked, cf. (3)) and u \u2208 {1, . . . , 2r\u22121} arbitrary. The u-th candidate can be\n\nby checking a single row ofbT = Z(B(r\u22121)\u2212pR1\u22a42r\u22121 )+p1\u22a42r\u22121 , pick i /\u2208 R (recall that coordinates\na vertex only if bTi,u \u2208 {0, 1}. The condition bTi,u = 0 can be written as\n:,u| {z }=b\n|\nA similar reasoning applies when setting bTi,u = 1. Provided none of the entries of Zi,: = 0, the\ndiscrete L-O lemma implies that there are at most 2(cid:0)\n\u230a(r\u22121)/2\u230b(cid:1) out of 2r\u22121 candidates for which the\ni-th coordinate is in {0, 1}. This yields a reduction of the candidate set by 2(cid:0)\n\u230a(r\u22121)/2\u230b(cid:1)/2r\u22121 =\nO(cid:16) 1\u221ar\u22121(cid:17). Admittedly, this reduction may appear insigni\ufb01cant given the total number of candi-\n\ndates to be checked. The reduction achieved empirically (cf. Figure 2) is typically larger. Stronger\nreductions have been proven under additional assumptions on the weights {ai}\u2113\ni=1: e.g. for distinct\nweights, one obtains a reduction of O((r \u2212 1)\u22123/2) [25]. Furthermore, when picking successively d\n\n|{z}\n\n{ak}r\n\n=y\n\n{z\n\n.\n\n}\n\n(7)\n\nr\u22121\n\nr\u22121\n\nrows of bT and if one assumes that each row yields a reduction according to the discrete L-O lemma,\n\none would obtain the reduction (r \u2212 1)\u2212d/2 so that d = r \u2212 1 would suf\ufb01ce to identify all vertices\nprovided r \u2265 4. Evidence for the rate (r \u2212 1)\u2212d/2 can be found in [26]. This indicates a reduction\nin complexity of Algorithm 1 from O((m \u2212 r)r2r\u22121) to O(r22r\u22121).\nAchieving further speed-up with integer linear programming. The continuous L-O lemma (part\n(ii) of Theorem 2) combined with the derivation leading to (7) allows us to tackle even the case\nr = 80 (280 \u2248 1024). In view of the continuous L-O lemma, a reduction in the number of candidates\n\nsatisfying the relaxed constraint for the i-th coordinate can be obtained from the feasibility problem\n\ncan still be achieved if the requirement is weakened to bTi,u \u2208 [0, 1]. According to (7) the candidates\n\n\ufb01nd b \u2208 {0, 1}r\u22121 subject to 0 \u2264 Zi,:(b \u2212 pR) + pi \u2264 1,\n\nwhich is an integer linear program that can be solved e.g. by CPLEX. The L-O- theory suggests that\nthe branch-bound strategy employed therein is likely to be successful. With the help of CPLEX, it\n\n(8)\n\nis affordable to solve problem (8) with all m \u2212 r + 1 constraints (one for each of the rows of bT to\n\nbe checked) imposed simultaneously. We always recovered directly the underlying vertices in our\nexperiments and only these, without the need to prune the solution pool (which could be achieved\nby Algorithm 1, replacing the 2r\u22121 candidates by a potentially much smaller solution pool).\n\n5\n\n\fMaximum number of remaining vertices (out of 2(r\u22121)) over 100 trials\n\nSpeed\u2212up achieved by cplex (m = 1000)\n\n \n\n2\n\n)\n\ng\no\nl\n(\n \ns\ne\nc\ni\nt\nr\ne\nV\n\n \nf\n\no\n\n \nr\ne\nb\nm\nu\nN\n\n25\n\n20\n\n15\n\n10\n\n5\n\n \n\n0\n0\n\n1 2\n\n5\n\n \n\nr=8, p=0.02\nr=8, p=0.1\nr=8, p=0.5\nr=16, p=0.02\nr=16, p=0.1\nr=16, p=0.5\nr=24, p=0.02\nr=24, p=0.1\nr=24, p=0.5\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n\ns\nd\nn\no\nc\ne\ns\n \n\nn\n\ni\n \n)\ne\nm\n\ni\nt\n \n\nU\nP\nC\n(\n0\n1\ng\no\n\nl\n\nalg1,p=0.5\nalg1,p = 0.1\nalg1,p = 0.9\ncplex,p = 0.5\ncplex,p = 0.1\ncplex,p = 0.9\n80\n\n70\n\n10\nCoordinates checked\n\n20\n\n50\n\n100 200\n\n500\n\n\u22120.5\n\n \n10\n\n20\n\n30\n\n40\n\n50\n\n60\n\nr\n\nFigure 2: Left: Speeding up the algorithm by checking single coordinates, remaining number of\ncoordinates vs.# coordinates checked (m = 1000). Right: Speed up by CPLEX compared to\nAlgorithm 1. For both plots, T is drawn entry-wise from a Bernoulli distribution with parameter p.\n\n3 Approximate case\n\nIn the sequel, we discuss an extension of our approach to handle the approximate case D \u2248 T A\nwith T and A as in (1). In particular, we have in mind the case of additive noise i.e. D = T A + E\nwith kEkF small. While the basic concept of Algorithm 1 can be adopted, changes are necessary\nbecause D may have full rank min{m, n} and second aff(D) \u2229 {0, 1}m = \u2205, i.e. the distances of\naff(D) and the {T:,k}r\nk=1 may be strictly positive (but are at least assumed to be small). As dis-\n\nAlgorithm 3 FINDVERTICES APPROXIMATE\n\n1. Let p = D1n/n and compute P = [D:,1 \u2212 p, . . . , D:,n \u2212 p].\n2. Compute U (r\u22121) \u2208 Rm\u00d7r\u22121, the left singular vectors corresponding to the r \u2212 1 largest\nsingular values of P . Select r \u2212 1 linearly independent rows R of U (r\u22121), obtaining\nU (r\u22121)\nR,: \u2208 Rr\u22121\u00d7r\u22121.\n\n3. Form Z = U (r\u22121)(U (r\u22121)\n\nR,:\n\n4. Compute bT 01 \u2208 Rm\u00d72r\u22121\n5. For u = 1, . . . , 2r\u22121, set \u03b4u = kbT:,u \u2212bT 01\n:,u1 . . .bT 01\n6. Return T = [bT 01\n\n:,ur ]\n\n)\u22121 and bT = Z(B(r\u22121) \u2212 pR1\u22a42r\u22121 ) + p1\u22a42r\u22121 .\n: for u = 1, . . . , 2r\u22121, i = 1, . . . , m, set bT 01\n\n2 ).\n:,uk2. Order increasingly s.t. \u03b4u1 \u2264 . . . \u2264 \u03b42r\u22121 .\n\ni,u = I(bTi,u > 1\n\ntinguished from the exact case, Algorithm 3 requires the number of components r to be speci\ufb01ed\nin advance as it is typically the case in noisy matrix factorization problems. Moreover, the vector\np subtracted from all columns of D in step 1 is chosen as the mean of the data points, which is in\nparticular a reasonable choice if D is contaminated with additive noise distributed symmetrically\naround zero. The truncated SVD of step 2 achieves the desired dimension reduction and potentially\nreduces noise corresponding to small singular values that are discarded. The last change arises in\n\nl=1, and subsequently using T = argmin{T (l)} minAkD \u2212 T (l)Ak2\n\nstep 5. While in the exact case, one identi\ufb01es all columns of bT that are in {0, 1}m, one instead only\n\nidenti\ufb01es columns close to {0, 1}m. Given the output of Algorithm 3, we solve the approximate\nmatrix factorization problem via least squares, obtaining the right factor from minAkD \u2212 T Ak2\nF .\nRe\ufb01nements. Improved performance for higher noise levels can be achieved by running Algorithm\n3 multiple times with different sets of rows selected in step 2, which yields candidate matrices\n{T (l)}s\nF , i.e. one picks the candi-\ndate yielding the best \ufb01t. Alternatively, we may form a candidate pool by merging the {T (l)}s\nl=1 and\nthen use a backward elimination scheme, in which successively candidates are dropped that yield the\nsmallest improvement in \ufb01tting D until r candidates are left. Apart from that, T returned by Algo-\nrithm 3 can be used for initializing the block optimization scheme of Algorithm 4 below. Algorithm\n4 is akin to standard block coordinate descent schemes proposed in the matrix factorization litera-\nture, e.g. [27]. An important observation (step 3) is that optimization of T is separable along the\nrows of T , so that for small r, it is feasible to perform exhaustive search over all 2r possibilities\n(or to use CPLEX). However, Algorithm 4 is impractical as a stand-alone scheme, because with-\nout proper initialization, it may take many iterations to converge, with each single iteration being\nmore expensive than Algorithm 3. When initialized with the output of the latter, however, we have\nobserved convergence of the block scheme only after few steps.\n\n6\n\n\fAlgorithm 4 Block optimization scheme for solving minT\u2208{0,1}m\u00d7r , A kD \u2212 T Ak2\n1. Set k = 0 and set T (k) equal to a starting value.\n2. A(k) \u2190 argminAkD \u2212 T (k)Ak2\n3. T (k) \u2190 argminT\u2208{0,1}m\u00d7r kD\u2212T A(k)k2\n4. Alternate between steps 2 and 3.\n\nF = argmin{Ti,:\u2208{0,1}r}m\n\nF and set k = k + 1.\n\nF\n\ni=1Pm\n\ni=1kDi,:\u2212Ti,:A(k)k2\n\n2 (9)\n\n4 Experiments\n\nIn Section 4.1 we demonstrate with the help of synthetic data that the approach of Section 3\nperforms well on noisy datasets. In the second part, we present an application to a real dataset.\n\n4.1 Synthetic data.\nSetup. We generate D = T \u2217A\u2217 + \u03b1E, where the entries of T \u2217 are drawn i.i.d. from {0, 1} with\nprobability 0.5, the columns of A are drawn i.i.d. uniformly from the probability simplex and the\nentries of E are i.i.d. standard Gaussian. We let m = 1000, r = 10 and n = 2r and let the noise\nlevel \u03b1 vary along a grid starting from 0. Small sample sizes n as considered here yield more\nchallenging problems and are motivated by the real world application of the next subsection.\nEvaluation. Each setup is run 20 times and we report averages over the following perfor-\nF /(m r) and the two RMSEs\nmance measures:\nkT \u2217A\u2217 \u2212 T AkF /(m n)1/2 and kT A \u2212 DkF /(m n)1/2, where (T, A) denotes the output of one of\nthe following approaches that are compared. FindVertices: our approach in Section 3. oracle: we\nsolve problem (9) with A(k) = A\u2217. box: we run the block scheme of Algorithm 4, relaxing the\ninteger constraint into a box constraint. Five random initializations are used and we take the result\nyielding the best \ufb01t, subsequently rounding the entries of T to ful\ufb01ll the {0, 1}-constraints and\n\nthe normalized Hamming distance kT \u2217 \u2212 T k2\n\nre\ufb01tting A. quad pen: as box, but a (concave) quadratic penalty \u03bbPi,k Ti,k(1 \u2212 Ti,k) is added to\n\npush the entries of T towards {0, 1}. D.C. programming [28] is used for the block updates of T .\n\nHamming(T,T*),T0.5,r=10\n\nRMSE(TA, T*A*),T0.5,r=10\n\nRMSE(TA, D),T0.5,r=10\n\n0.2\n\n0.15\n\n)\nr\n*\n\nm\n\nF2\n\n(\n/\n\n|\n*\nT\n\u2212\nT\n\n|\n\n0.1\n\n0.05\n\n \n\n0\n0\n\n0.2\n\n0.15\n\n)\nr\n*\n\nm\n\nF2\n\n(\n/\n\n|\n*\nT\n\u2212\nT\n\n|\n\n0.1\n\n0.05\n\n \n\n0\n0\n\nbox\nquad pen\noracle\nFindVertices\n\n0.05\n\nalpha\n\n0.1\n\nHamming(T,T*),r=10\n\nHotTopixx\nFindVertices\n\n \n\n \n\nbox\nquad pen\noracle\nFindVertices\n\n0.05\n\nalpha\n\n0.1\n\nRMSE(TA, T*A*),r=10\n\nHotTopixx\nFindVertices\n\n \n\n \n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n\n0\n0\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nbox\nquad pen\noracle\nFindVertices\n\n0.05\n\nalpha\n\n0.1\n\nRMSE(TA, D),r=10\n\nHotTopixx\nFindVertices\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\n \n\n0\n0\n\n0.2\n\n0.15\n\n0.1\n\n0.05\n\nF\n\n)\nn\n*\nm\n\n(\nt\nr\nq\ns\n/\n\n|\n\nD\n\u2212\nA\nT\n\n|\n\n)\nn\n*\nm\n\n(\nt\nr\nq\ns\n/\n\n|\n\nD\n\u2212\nA\nT\n\n|\n\nF\n\nF\n\n)\nn\n*\nm\n\n(\nt\nr\nq\ns\n/\n\n|\n*\nA\n*\nT\n\u2212\nA\nT\n\n|\n\n)\nn\n*\nm\n\n(\nt\nr\nq\ns\n/\n\n|\n*\nA\n*\nT\n\u2212\nA\nT\n\nF\n\n0.02\n\n0.04\n\nalpha\n\n0.06\n\n|\n\n \n\n0\n0\n\n0.02\n\n0.04\n\nalpha\n\n0.06\n\n \n\n0\n0\n\n0.02\n\n0.04\n\nalpha\n\n \n\n \n\n0.06\n\nFigure 3: Top: comparison against block schemes. Bottom: comparison against HOTTOPIXX.\nLeft/Middle/Right: kT \u2217 \u2212 T k2\n\nF /(m r), kT \u2217A\u2217 \u2212 T AkF /(m n)1/2 and kT A \u2212 DkF /(m n)1/2.\n\nComparison to HOTTOPIXX [18]. HOTTOPIXX (HT) is a linear programming approach to\nNMF equipped with guarantees such as correctness in the exact and robustness in the non-exact\ncase as long as T is (nearly) separable (cf. Section 2.3). HT does not require T to be binary, but\napplies to the generic NMF problem D \u2248 T A, T \u2208 Rm\u00d7r\n+ . Since separability is\ncrucial to the performance of HT, we restrict our comparison to separable T = [M ; Ir], generating\nthe entries of M i.i.d. from a Bernoulli distribution with parameter 0.5. For runtime reasons,\nwe lower the dimension to m = 100. Apart from that, the experimental setup is as above. We\n\nand A \u2208 Rr\u00d7n\n\n+\n\n7\n\n\f+\n\nkD \u2212 T Ak2\n\nuse an implementation of HT from [29]. We \ufb01rst pre-normalize D to have unit row sums as\nrequired by HT, and obtain A as \ufb01rst output. Given A, the non-negative least squares problem\nF is solved. The entries of T are then re-scaled to match the original scale\nminT\u2208Rm\u00d7r\nof D, and thresholding at 0.5 is applied to obtain a binary matrix. Finally, A is re-optimized by\nsolving the above \ufb01tting problem with respect to A in place of T . In the noisy case, HT needs a\ntuning parameter to be speci\ufb01ed that depends on the noise level, and we consider a grid of 12 values\nfor that parameter. The range of the grid is chosen based on knowledge of the noise matrix E. For\neach run, we pick the parameter that yields best performance in favour of HT.\nResults. From Figure 3, we \ufb01nd that unlike the other approaches, box does not always recover\nT \u2217 even if the noise level \u03b1 = 0. FindVertices outperforms box and quad pen throughout. For\n\u03b1 \u2264 0.06, its performance closely matches that of the oracle. In the separable case, our approach\nperforms favourably as compared to HT, a natural benchmark in this setting.\n\n4.2 Analysis of DNA methylation data.\nBackground. Unmixing of DNA methylation pro\ufb01les is a problem of high interest in cancer\nresearch. DNA methylation is a chemical modi\ufb01cation of the DNA occurring at speci\ufb01c sites,\nso-called CpGs. DNA methylation affects gene expression and in turn various processes such as\ncellular differentiation. A site is either unmethylated (\u20190\u2019) or methylated (\u20191\u2019). DNA methylation\nmicroarrays allow one to measure the methylation level for thousands of sites.\nIn the dataset\nconsidered here, the measurements D (the rows corresponding to sites, the columns to samples)\nresult from a mixture of cell types. The methylation pro\ufb01les of the latter are in {0, 1}m, whereas,\ndepending on the mixture proportions associated with each sample, the entries of D take values in\n[0, 1]m. In other words, we have the model D \u2248 T A, with T representing the methylation of the\ncell types and the columns of A being elements of the probability simplex. It is often of interest\nto recover the mixture proportions of the samples, because e.g. speci\ufb01c diseases, in particular\ncancer, can be associated with shifts in these proportions. The matrix T is frequently unknown, and\ndetermining it experimentally is costly. Without T , however, recovering the mixing matrix A is\nchallenging, in particular since the number of samples in typical studies is small.\nDataset. We consider the dataset studied in [9], with m = 500 CpG sites and n = 12 samples of\nblood cells composed of four major types (B-/T-cells, granulocytes, monocytes), i.e. r = 4. Ground\ntruth is partially available: the proportions of the samples, denoted by A\u2217, are known.\n\nA (ground truth)\n\nestimated A\n\nnumber of components vs error\n\n \n\n1\n\nt\nn\ne\nn\no\np\nm\no\nc\n\n1\n\n2\n\n3\n\n4\n\n \n\n1 2 3 4 5 6 7 8 9 101112\n\nsample\n\nt\nn\ne\nn\no\np\nm\no\nc\n\n1\n\n2\n\n3\n\n4\n\n \n\n0.5\n\n0\n\n \n\n1\n\n0.2\n\n)\nn\n \n*\n \n\nm\n\n \n\nFindVertices\nground truth\n\n0.5\n\n0\n\n1 2 3 4 5 6 7 8 9 101112\n\nsample\n\n(\nt\nr\nq\ns\n \n/\n \n\nF\n\n|\n\n \n\n \n\nA\nT\n\u2212\nD\n\n \n\n|\n\n0.15\n\n \n\n0.1\n2\n\n3\n5\ncomponents used\n\n4\n\n6\n\nFigure 4: Left: Mixture proportions of the ground truth. Middle: mixture proportions as estimated\nby our method. Right: RMSEs kD \u2212 T AkF /(m n)1/2 in dependency of r.\n\nAnalysis. We apply our approach to obtain an approximate factorization D \u2248 T A, T \u2208 {0, 1}m\u00d7r,\nand A\u22a41r = 1n. We \ufb01rst obtained T as outlined in Section 3, replacing {0, 1} by\nA \u2208 Rr\u00d7n\n{0.1, 0.9} in order to account for measurement noise in D that slightly pushes values towards 0.5.\n\n+\n\nThis can be accomodated re-scaling bT 01 in step 4 of Algorithm 3 by 0.8 and then adding 0.1. Given\n\nT , we solve the quadratic program A = argminA\u2208Rr\u00d7n\nF and compare A to\nthe ground truth A\u2217. In order to judge the \ufb01t as well as the matrix T returned by our method, we\ncompute T \u2217 = argminT\u2208{0,1}m\u00d7r kD \u2212 T A\u2217k2\nF as in (9). We obtain 0.025 as average mean squared\ndifference of T and T \u2217, which corresponds to an agreement of 96 percent. Figure 4 indicates at\nleast a qualitative agreement of A\u2217 and A. In the rightmost plot, we compare the RMSEs of our\napproach for different choices of r relative to the RMSE of (T \u2217, A\u2217). The error curve \ufb02attens after\nr = 4, which suggests that with our approach, we can recover the correct number of cell types.\n\nkD \u2212 T Ak2\n\n+ ,A\u22a4 1r=1n\n\n8\n\n\fReferences\n\n[1] P. Paatero and U. Tapper. Positive matrix factorization: A non-negative factor model with optimal utiliza-\n\ntion of error estimates of data values. Environmetrics, 5:111\u2013126, 1994.\n\n[2] D. Lee and H. Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401:788\u2013\n\n791, 1999.\n\n[3] J. Ramsay and B. Silverman. Functional Data Analysis. Springer, New York, 2006.\n\n[4] F. Bach, J. Mairal, and J. Ponce. Convex Sparse Matrix Factorization. Technical report, ENS, Paris, 2008.\n\n[5] D. Witten, R. Tibshirani, and T. Hastie. A penalized matrix decomposition, with applications to sparse\n\nprincipal components and canonical correlation analysis. Biostatistics, 10:515\u2013534, 2009.\n\n[6] A-J. van der Veen. Analytical Method for Blind Binary Signal Separation.\n\nIEEE Signal Processing,\n\n45:1078\u20131082, 1997.\n\n[7] J. Liao, R. Boscolo, Y. Yang, L. Tran, C. Sabatti, and V. Roychowdhury. Network component analysis:\n\nreconstruction of regulatory signals in biological systems. PNAS, 100(26):15522\u201315527, 2003.\n\n[8] S. Tu, R. Chen, and L. Xu. Transcription Network Analysis by a Sparse Binary Factor Analysis Algorithm.\n\nJournal of Integrative Bioinformatics, 9:198, 2012.\n\n[9] E. Houseman et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC\n\nBioinformatics, 13:86, 2012.\n\n[10] A. Banerjee, C. Krumpelman, J. Ghosh, S. Basu, and R. Mooney. Model-based overlapping clustering.\n\nIn KDD, 2005.\n\n[11] E. Segal, A. Battle, and D. Koller. Decomposing gene expression into cellular processes. In Proceedings\n\nof the 8th Paci\ufb01c Symposium on Biocomputing, 2003.\n\n[12] A. Schein, L. Saul, and L. Ungar. A generalized linear model for principal component analysis of binary\n\ndata. In AISTATS, 2003.\n\n[13] A. Kaban and E. Bingham. Factorisation and denoising of 0-1 data: a variational approach. Neurocom-\n\nputing, 71:2291\u20132308, 2008.\n\n[14] E. Meeds, Z. Gharamani, R. Neal, and S. Roweis. Modeling dyadic data with binary latent factors. In\n\nNIPS, 2007.\n\n[15] Z. Zhang, C. Ding, T. Li, and X. Zhang. Binary matrix factorization with applications. In IEEE ICDM,\n\n2007.\n\n[16] P. Miettinen and T. Mielik\u00a8ainen and A. Gionis and G. Das and H. Mannila. The discrete basis problem.\n\nIn PKDD, 2006.\n\n[17] S. Arora, R. Ge, R. Kannan, and A. Moitra. Computing a nonnegative matrix factorization \u2013 provably.\n\nSTOC, 2012.\n\n[18] V. Bittdorf, B. Recht, C. Re, and J. Tropp. Factoring nonnegative matrices with linear programs. In NIPS,\n\n2012.\n\n[19] D. Donoho and V. Stodden. When does non-negative matrix factorization give a correct decomposition\n\ninto parts? In NIPS, 2003.\n\n[20] P. Erd\u00a8os. On a lemma of Littlewood and Offord. Bull. Amer. Math. Soc, 51:898\u2013902, 1951.\n\n[21] M. Gu and S. Eisenstat. Ef\ufb01cient algorithms for computing a strong rank-revealing QR factorization.\n\nSIAM Journal on Scienti\ufb01c Computing, 17:848\u2013869, 1996.\n\n[22] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, 1996.\n\n[23] A. Odlyzko. On Subspaces Spanned by Random Selections of \u00b11 vectors. Journal of Combinatorial\n\nTheory A, 47:124\u2013133, 1988.\n\n[24] J. Kahn, J. Komlos, and E. Szemeredi. On the Probability that a \u00b11 matrix is singular. Journal of the\n\nAmerican Mathematical Society, 8:223\u2013240, 1995.\n\n[25] H. Nguyen and V. Vu. Small ball probability, Inverse theorems, and applications. arXiv:1301.0019.\n\n[26] T. Tao and V. Vu. The Littlewoord-Offord problem in high-dimensions and a conjecture of Frankl and\n\nF\u00a8uredi. Combinatorica, 32:363\u2013372, 2012.\n\n[27] C.-J. Lin. Projected gradient methods for non-negative matrix factorization. Neural Computation,\n\n19:2756\u20132779, 2007.\n\n[28] P. Tao and L. An. Convex analysis approach to D.C. programming: theory, algorithms and applications.\n\nActa Mathematica Vietnamica, pages 289\u2013355, 1997.\n\n[29] https://sites.google.com/site/nicolasgillis/publications.\n\n9\n\n\f", "award": [], "sourceid": 1491, "authors": [{"given_name": "Martin", "family_name": "Slawski", "institution": "Saarland University"}, {"given_name": "Matthias", "family_name": "Hein", "institution": "Saarland University"}, {"given_name": "Pavlo", "family_name": "Lutsik", "institution": "Saarland University"}]}