{"title": "Matrix Completion from Fewer Entries: Spectral Detectability and Rank Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 1261, "page_last": 1269, "abstract": "The completion of low rank matrices from few entries is a task with many practical applications. We consider here two aspects of this problem: detectability, i.e. the ability to estimate the rank $r$ reliably from the fewest possible random entries, and performance in achieving small reconstruction error. We propose a spectral algorithm for these two tasks called MaCBetH (for Matrix Completion with the Bethe Hessian). The rank is estimated as the number of negative eigenvalues of the Bethe Hessian matrix, and the corresponding eigenvectors are used as initial condition for the minimization of the discrepancy between the estimated matrix and the revealed entries. We analyze the performance in a random matrix setting using results from the statistical mechanics of the Hopfield neural network, and show in particular that MaCBetH efficiently detects the rank $r$ of a large $n\\times m$ matrix from $C(r)r\\sqrt{nm}$ entries, where $C(r)$ is a constant close to $1$. We also evaluate the corresponding root-mean-square error empirically and show that MaCBetH compares favorably to other existing approaches.", "full_text": "Matrix Completion from Fewer Entries:\n\nSpectral Detectability and Rank Estimation\n\nAlaa Saade1 and Florent Krzakala1,2\n\n1 Laboratoire de Physique Statistique, CNRS & \u00c9cole Normale Sup\u00e9rieure, Paris, France.\n2Sorbonne Universit\u00e9s, Universit\u00e9 Pierre et Marie Curie Paris 06, F-75005, Paris, France\n\nInstitut de Physique Th\u00e9orique, CEA Saclay and CNRS UMR 3681, 91191 Gif-sur-Yvette, France\n\nLenka Zdeborov\u00e1\n\nAbstract\n\nThe completion of low rank matrices from few entries is a task with many practical\napplications. We consider here two aspects of this problem: detectability, i.e. the\nability to estimate the rank r reliably from the fewest possible random entries,\nand performance in achieving small reconstruction error. We propose a spectral\nalgorithm for these two tasks called MaCBetH (for Matrix Completion with the\nBethe Hessian). The rank is estimated as the number of negative eigenvalues of\nthe Bethe Hessian matrix, and the corresponding eigenvectors are used as initial\ncondition for the minimization of the discrepancy between the estimated matrix\nand the revealed entries. We analyze the performance in a random matrix setting\nusing results from the statistical mechanics of the Hop\ufb01eld neural network, and\nshow in particular that MaCBetH ef\ufb01ciently detects the rank r of a large n \u00d7\nm matrix from C(r)r\nnm entries, where C(r) is a constant close to 1. We\nalso evaluate the corresponding root-mean-square error empirically and show that\nMaCBetH compares favorably to other existing approaches.\n\n\u221a\n\nMatrix completion is the task of inferring the missing entries of a matrix given a subset of known\nentries. Typically, this is possible because the matrix to be completed has (at least approximately)\nlow rank r. This problem has witnessed a burst of activity, see e.g. [1, 2, 3], motivated by many\napplications such as collaborative \ufb01ltering [1], quantum tomography [4] in physics, or the analysis\nof a covariance matrix [1]. A commonly studied model for matrix completion assumes the matrix\nto be exactly low rank, with the known entries chosen uniformly at random and observed without\nnoise. The most widely considered question in this setting is how many entries need to be revealed\nsuch that the matrix can be completed exactly in a computationally ef\ufb01cient way [1, 3]. While our\npresent paper assumes the same model, the main questions we investigate are different.\nThe \ufb01rst question we address is detectability: how many random entries do we need to reveal in order\nto be able to estimate the rank r reliably. This is motivated by the more generic problem of detecting\nstructure (in our case, low rank) hidden in partially observed data. It is reasonable to expect the\nexistence of a region where exact completion is hard or even impossible yet the rank estimation is\ntractable. A second question we address is what is the minimum achievable root-mean-square error\n(RMSE) in estimating the unknown elements of the matrix. In practice, even if exact reconstruction\nis not possible, having a procedure that provides a very small RMSE might be quite suf\ufb01cient.\nIn this paper we propose an algorithm called MaCBetH that gives the best known empirical perfor-\nmance for the two tasks above when the rank r is small. The rank in our algorithm is estimated as the\nnumber of negative eigenvalues of an associated Bethe Hessian matrix [5, 6], and the corresponding\neigenvectors are used as an initial condition for the local optimization of a cost function commonly\nconsidered in matrix completion (see e.g. [3]). In particular, in the random matrix setting, we show\n\n1\n\n\f\u221a\n\n\u221a\n\nthat MaCBetH detects the rank of a large n \u00d7 m matrix from C(r)r\nnm entries, where C(r) is a\nsmall constant, see Fig. 2, and C(r) \u2192 1 as r \u2192 \u221e. The RMSE is evaluated empirically and, in the\nregime close to C(r)r\nnm, compares very favorably to existing approache such as OptSpace [3].\nThis paper is organized as follows. We de\ufb01ne the problem and present generally our approach in the\ncontext of existing work in Sec. 1. In Sec. 2 we describe our algorithm and motivate its construction\nvia a spectral relaxation of the Hop\ufb01eld model of neural network. Next, in Sec. 3 we show how the\nperformance of the proposed spectral method can be analyzed using, in parts, results from spin glass\ntheory and phase transitions, and rigorous results on the spectral density of large random matrices.\nFinally, in Sec. 4 we present numerical simulations that demonstrate the ef\ufb01ciency of MaCBetH.\nImplementations of our algorithms in the Julia and Matlab programming languages are available at\nthe SPHINX webpage http://www.lps.ens.fr/~krzakala/WASP.html.\n\n1 Problem de\ufb01nition and relation to other work\nLet Mtrue be a rank-r matrix such that\n\nMtrue = XY T ,\n\n(1)\nwhere X \u2208 Rn\u00d7r and Y \u2208 Rm\u00d7r are two (unknown) tall matrices. We observe only a small\nfraction of the elements of Mtrue, chosen uniformly at random. We call E the subset of observed\nentries, and M the (sparse) matrix supported on E whose nonzero elements are the revealed entries\nof Mtrue. The aim is to reconstruct the rank r matrix Mtrue = XY T given M. An important\n\u221a\nparameter which controls the dif\ufb01culty of the problem is \u0001 = |E|/\nnm. In the case of a square\nmatrix M, this is the average number of revealed entries per line or column.\nIn our numerical examples and theoretical justi\ufb01cations we shall generate the low rank matrix\nMtrue = XY T, using tall matrices X and Y with iid Gaussian elements, we call this the ran-\ndom matrix setting. The MaCBetH algorithm is, however, non-parametric and does not use any\nprior knowledge about X and Y . The analysis we perform applies to the limit n \u2192 \u221e while\nm/n = \u03b1 = O(1) and r = O(1).\nThe matrix completion problem was popularized in [1] who proposed nuclear norm minimization\nas a convex relaxation of the problem. The algorithmic complexity of the associated semide\ufb01nite\nprogramming is, however, O(n2m2). A low complexity procedure to solve the problem was later\nproposed by [7] and is based on singular value decomposition (SVD). A considerable step towards\ntheoretical understanding of matrix completion from few entries was made in [3] who proved that\nwith the use of trimming the performance of SVD-based matrix completion can be improved and a\n\nRMSE proportional to(cid:112)nr/|E| can be achieved. The algorithm of [3] is referred to as OptSpace,\n\nand empirically it achieves state-of-the-art RMSE in the regime of very few revealed entries.\nOptSpace proceeds in three steps [3]. First, one trims the observed matrix M by setting to zero all\nrows (resp. columns) with more revealed entries than twice the average number of revealed entries\nper row (resp. per column). Second, a singular value decompositions is performed on the matrix\nand only the \ufb01rst r components are kept. When the rank r is unknown it is estimated as the index for\nwhich the ratio between two consecutive singular values has a minimum. Third, a local minimization\nof the discrepancy between the observed entries and the estimate is performed. The initial condition\nfor this minimization is given by the \ufb01rst r left and right singular vectors from the second step.\nIn this work we improve upon OptSpace by replacing the \ufb01rst two steps by a different spectral\nprocedure that detects the rank and provides a better initial condition for the discrepancy minimiza-\ntion. Our method leverages on recent progress made in the task of detecting communities in the\nstochastic block model [8, 5] with spectral methods. Both in community detection and matrix com-\npletion, traditional spectral methods fail in the very sparse regime due to the existence of spurious\nlarge eigenvalues (or singular values) corresponding to localized eigenvectors [8, 3]. The authors\nof [8, 5, 9] showed that using the non-backtracking matrix or the closely related Bethe Hessian as\na basis for the spectral method in community detection provides reliable rank estimation and better\ninference performance. The present paper provides an analogous improvement for the matrix com-\npletion problem. In particular, we shall analyze the algorithm using tools from spin glass theory in\nstatistical mechanics, and show that there exists a phase transition between a phase where it is able\nto detect the rank, and a phase where it is unable to do so.\n\n2\n\n\f2 Algorithm and motivation\n\n2.1 The MaCBetH algorithm\n\nA standard approach to the completion problem (see e.g. [3]) is to minimize the cost function\n\n[Mij \u2212 (XY T)ij]2\n\n(2)\n\n(cid:88)\n\nmin\nX,Y\n\n(ij)\u2208E\n\nover X \u2208 Rn\u00d7r and Y \u2208 Rm\u00d7r. This function is non-convex, and global optimization is hard.\nOne therefore resorts to a local optimization technique with a careful choice of the initial conditions\nX0, Y0. In our method, given the matrix M, we consider a weighted bipartite undirected graph with\nadjacency matrix A \u2208 R(n+m)\u00d7(n+m)\n\n(3)\nWe will refer to the graph thus de\ufb01ned as G. We now de\ufb01ne the Bethe Hessian matrix H(\u03b2) \u2208\nR(n+m)\u00d7(n+m) to be the matrix with elements\n\nMT\n\nA =\n\n.\n\n(cid:19)\n\n(cid:18) 0 M\n(cid:33)\n\n0\n\n\u03b4ij \u2212 1\n2\n\n(cid:32)\n\n(cid:88)\n\nk\u2208\u2202i\n\nHij(\u03b2) =\n\n1 +\n\nsinh2 \u03b2Aik\n\nsinh(2\u03b2Aij) ,\n\n(4)\n\nwhere \u03b2 is a parameter that we will \ufb01x to a well-de\ufb01ned value \u03b2SG depending on the data, and \u2202i\nstands for the neighbors of i in the graph G. Expression (4) corresponds to the matrix introduced in\n[5], applied to the case of graphical model (6). The MaCBetH algorithm that is the main subject of\nthis paper is then, given the matrix A, which we assume to be centered:\nAlgorithm (MaCBetH)\n\n1. Numerically solve for the value of \u02c6\u03b2SG such that F ( \u02c6\u03b2SG) = 1, where\n\nF (\u03b2) :=\n\n1\u221a\nnm\n\ntanh2(\u03b2Mij) .\n\n(5)\n\n(cid:88)\n\n(i,j)\u2208E\n\n2. Build the Bethe Hessian H( \u02c6\u03b2SG) following eq. (4).\n3. Compute all\n\nits negative eigenvalues \u03bb1,\u00b7\u00b7\u00b7 , \u03bb\u02c6r and corresponding eigenvectors\nv1,\u00b7\u00b7\u00b7 , v\u02c6r. \u02c6r is our estimate for the rank r. Set X0 (resp. Y0) to be the \ufb01rst n lines\n(resp. the last m lines) of the matrix [v1 v2 \u00b7\u00b7\u00b7 v\u02c6r].\n\n4. Perform local optimization of the cost function (2) with rank \u02c6r and initial condition X0, Y0.\n\nIn step 1, \u02c6\u03b2SG is an approximation of the optimal value of \u03b2, for which H(\u03b2) has a maximum number\nof negative eigenvalues (see section 3). Instead of this approximation, \u03b2 can be chosen in such a\nway as to maximize the number of negative eigenvalues. We however observed numerically that\nthe algorithm is robust to some imprecision on the value of \u02c6\u03b2SG. In step 2 we could also use the\nnon-backtracking matrix weighted by tanh \u03b2Mij, it was shown in [5] that the spectrum of the Bethe\nHessian and the non-backtracking matrix are closely related. In the next section, we will motivate\nand analyze this algorithm (in the setting where Mtrue was generated from element-wise random\nX and Y ) and show that in this case MaCBetH is able to infer the rank whenever \u0001 > \u0001c. Fig. 1\nillustrates the spectral properties of the Bethe Hessian that justify this algorithm: the spectrum is\ncomposed of a few informative negative eigenvalues, well separated from the bulk (which remains\npositive).\nIn particular, as observed in [8, 5], it avoids the spurious eigenvalues with localized\neigenvectors that make trimming necessary in the case of [3]. This algorithm is computationally\nef\ufb01cient as it is based on the eigenvalue decomposition of a sparse, symmetric matrix.\n\n2.2 Motivation from a Hop\ufb01eld model\n\nWe shall now motivate the construction of the MaCBetH algorithm from a graphical model perspec-\ntive and a spectral relaxation. Given the observed matrix M from the previous section, we consider\n\n3\n\n\fthe following graphical model\n\nP ({s},{t}) =\n\n1\nZ\n\nexp\n\n\uf8eb\uf8ed\u03b2\n\n(cid:88)\n\n(i,j)\u2208E\n\n\uf8f6\uf8f8 ,\n\nMijsitj\n\n(6)\n\nwhere the {si}1\u2264i\u2264n and {tj}1\u2264j\u2264m are binary variables, and \u03b2 is a parameter controlling the\nstrength of the interactions. This model is a (generalized) Hebbian Hop\ufb01eld model on a bipartite\nsparse graph, and is therefore known to have r modes (up to symmetries) correlated with the lines of\nX and Y [10]. To study it, we can use the standard Bethe approximation which is widely believed\nto be exact for such problems on large random graphs [11, 12]. In this approximation the means\nE(si), E(tj) and moments E(sitj) of each variable are approximated by the parameters bi, cj and\n\u03beij that minimize the so-called Bethe free energy FBethe({bi},{cj},{\u03beij}) that reads\n\nFBethe({bi},{cj},{\u03beij}) = \u2212 (cid:88)\n(cid:17)\nn(cid:88)\nm(cid:88)\n\n(cid:16) 1 + bisi\n\n(cid:88)\n\n(i,j)\u2208E\n\n(1 \u2212 di)\n\n\u03b7\n\n+\n\n+\n\n2\n\nMij\u03beij +\n\n(cid:88)\n(cid:88)\n(cid:16) 1 + cjtj\n(cid:88)\n\n(i,j)\u2208E\n\nsi,tj\n\n(cid:16) 1 + bisi + cjtj + \u03beijsitj\n(cid:17)\n\n4\n\n\u03b7\n\n(cid:17)\n\n(1 \u2212 dj)\n\n\u03b7\n\n,\n\n2\n\n(7)\n\ni=1\n\nsi\n\nj=1\n\ntj\n\nwhere \u03b7(x) := x ln x, and di, dj are the degrees of nodes i and j in the graph G. Neural network\nmodels such as eq. (6) have been extensively studied over the last decades (see e.g. [12, 13, 14, 15,\n16] and references therein) and the phenomenology, that we shall review brie\ufb02y here, is well known.\nIn particular, for \u03b2 small enough, the global minimum of the Bethe free energy corresponds to the\nso-called paramagnetic state\n\n\u2200i, j,\n\n(8)\nAs we increase \u03b2, above a certain value \u03b2R, the model enters a retrieval phase, where the free energy\nhas local minima correlated with the factors X and Y . There are r local minima, called retrieval\nstates ({bl\n\nij}) indexed by l = 1,\u00b7\u00b7\u00b7 , r such that, in the large n, m limit,\n\nj},{\u03bel\n\ni},{cl\n\nbi = cj = 0, \u03beij = tanh (\u03b2Mij).\n\n\u2200l = 1\u00b7\u00b7\u00b7 r,\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nm(cid:88)\n\nj=1\n\n1\nm\n\nXi,lbl\n\ni > 0,\n\nYj,lcl\n\nj > 0 .\n\n(9)\n\nThese retrieval states are therefore convenient initial conditions for the local optimization of eq. (2),\nand we expect their number to tell us the correct rank. Increasing \u03b2 above a critical value \u03b2SG the\nsystem eventually enters a spin glass phase, marked by the appearance of many spurious minima.\nIt would be tempting to continue the Bethe approach leading to belief propagation, but we shall\ninstead consider a simpler spectral relaxation of the problem, following the same strategy as used\nin [5, 6] for graph clustering. First, we use the fact that the paramagnetic state (8) is always a\nstationary point of the Bethe free energy, for any value of \u03b2 [17, 18]. In order to detect the retrieval\nstates, we thus study its stability by looking for negative eigenvalues of the Hessian of the Bethe\nfree energy evaluated at the paramagnetic state (8). At this point, the elements of the Hessian\ninvolving one derivative with respect to \u03beij vanish, while the block involving two such derivatives\nis a diagonal positive de\ufb01nite matrix [5, 17]. The remaining part is the matrix called Bethe Hessian\nin [5] (which however considers a different graphical model than (6)). Eigenvectors corresponding\nto its negative eigenvalues are thus expected to give an approximation of the retrieval states (9). The\npicture exposed in this section is summarized in Figure 1 and motivates the MaCBetH algorithm.\nNote that a similar approach was used in [16] to detect the retrieval states of a Hop\ufb01eld model using\nthe weighted non-backtracking matrix [8], which linearizes the belief propagation equations rather\nthan the Bethe free energy, resulting in a larger, non-symmetric matrix. The Bethe Hessian, while\nmathematically closely related, is also simpler to handle in practice.\n\n3 Analysis of performance in detection\n\nWe now show how the performance of MaCBetH can be analyzed, and the spectral properties of the\nmatrix characterized using both tools from statistical mechanics and rigorous arguments.\n\n4\n\n\fFigure 1: Spectral density of the Bethe Hessian for various values of the parameter \u03b2. Red dots\nare the result of the direct diagonalisation of the Bethe Hessian for a rank r = 5 and n = m = 104\nmatrix, with \u0001 = 15 revealed entries per row on average. The black curves are the solutions of (18)\ncomputed with belief propagation on a graph of size 105. We isolated the 5 smallest eigenvalues,\nrepresented as small bars for convenience, and the inset is a zoom around these smallest eigenvalues.\nFor \u03b2 small enough (top plots), the Bethe Hessian is positive de\ufb01nite, signaling that the paramagnetic\nstate (8) is a local minimum of the Bethe free energy. As \u03b2 increases, the spectrum is shifted towards\nthe negative region and has 5 negative eigenvalues at the approximate value of \u02c6\u03b2SG = 0.12824 (to\nbe compared to \u03b2R = 0.0832 for this case) evaluated by our algorithm (lower left plot). These\neigenvalues, corresponding to the retrieval states (9), become positive and eventually merge in the\nbulk as \u03b2 is further increased (lower right plot), while the bulk of uninformative eigenvalues remains\nat all values of \u03b2 in the positive region.\n\n3.1 Analysis of the phase transition\n\np)1\u2264l\u2264r, yp = (yl\n\nWe start by investigating the phase transition above which our spectral method will detect the correct\nrank. Let xp = (xl\np)1\u2264l\u2264r be random vectors with the same empirical distribution\nas the lines of X and Y respectively. Using the statistical mechanics correspondence between the\nnegative eigenvalues of the Bethe Hessian and the appearance of phase transitions in model (6), we\ncan compute the values \u03b2R and \u03b2SG where instabilities towards, respectively, the retrieval states and\nthe spurious glassy states, arise. We have repeated the computations of [13, 14, 15, 16] in the case\nof model (6), using the cavity method [12]. We refer the reader interested in the technical details of\nthe statistical mechanics approach to neural networks to [14, 15, 16].\nFollowing a standard computation for locating phase transitions in the Bethe approximation (see e.g.\n[12, 19]), the stability of the paramagnetic state (8) towards these two phases can be monitored in\nterms of the two following parameters:\n\n(cid:17)(cid:105) 1\n\n2s\n\nr(cid:88)\n\nxl\np+1yl\np\n\n(cid:17)(cid:105) 1\n\n2s\n\n(10)\n\n,\n\n(11)\n\nE(cid:104) s(cid:89)\nE(cid:104) s(cid:89)\n\np=1\n\ntanh2(cid:16)\n(cid:16)\n\ntanh\n\nr(cid:88)\n\nl=1\n\nr(cid:88)\n\ntanh2(cid:16)\n(cid:17)\n(cid:17)\nr(cid:88)\n\n\u03b2\n\nxl\npyl\np\n\n(cid:16)\n\nl=1\n\ntanh\n\n\u03b2\n\nxl\npyl\np\n\nxl\np+1yl\np\n\n,\n\n\u03b2|x1\n\np| + \u03b2\npy1\n\n\u03b2|x1\n\np+1y1\n\np| + \u03b2\n\np=1\n\nl=2\n\nl=2\n\n\u03bb(\u03b2) = lim\ns\u2192\u221e\n\n\u00b5(\u03b2) = lim\ns\u2192\u221e\n\nwhere the expectation is over the distribution of the vectors xp, yp. The parameter \u03bb(\u03b2) controls the\nsensitivity of the paramagnetic solution to random noise, while \u00b5(\u03b2) measures its sensitivity to a\nperturbation in the direction of a retrieval state. \u03b2SG and \u03b2R are de\ufb01ned implicitly as \u0001\u03bb(\u03b2SG) = 1\nand \u0001\u00b5(\u03b2R) = 1, i.e. the value beyond which the perturbation diverges. The existence of a retrieval\nphase is equivalent to the condition \u03b2SG > \u03b2R, so that there exists a range of values of \u03b2 where the\nretrieval states exist, but not the spurious ones. If this condition is met, by setting \u03b2 = \u03b2SG in our\nalgorithm, we ensure the presence of meaningful negative eigenvalues of the Bethe Hessian.\n\n5\n\n\u03bb0510152025\u03c1(\u03bb)00.040.080.120.160.2\u03b2=0.25DirectdiagBP\u03bb012345678\u03c1(\u03bb)00.10.20.30.40.5\u03b2=0.12824DirectdiagBP\u03bb-1.5-1-0.500.511.522.5\u03c1(\u03bb)00.20.40.60.811.21.4\u03b2=0.05DirectdiagBP\u03bb0.20.40.60.811.2\u03c1(\u03bb)01234567\u03b2=0.01DirectdiagBP0.70.80.900.91.800.250.500.150.300.61.200.030.06-0.500.500.090.18\fWe de\ufb01ne the critical value of \u0001 = \u0001c such that \u03b2SG > \u03b2R if and only if \u0001 > \u0001c. In general, there is\nno closed-form formula for this critical value, which is de\ufb01ned implicitly in terms of the functions \u03bb\nand \u00b5. We thus computed \u0001c numerically using a population dynamics algorithm [12] and the results\nfor C(r) = \u0001c/r are presented on Figure 2. Quite remarkably, with the de\ufb01nition \u0001 = |E|/\nnm,\nthe critical value \u0001c does not depend on the ratio m/n, only on the rank r.\nIn the limit of large \u0001 and r it is possible to\nobtain a simple closed-form formula.\nIn this\ncase the observed entries of the matrix become\njointly Gaussian distributed, and uncorrelated,\nand therefore independent. Expression (10)\nthen simpli\ufb01es to\n\n\u221a\n\n\u03bb(\u03b2) =r\u2192\u221e E(cid:104)\n\ntanh2(cid:16)\n\n\u03b2\n\nxlyl(cid:17)(cid:105)\nr(cid:88)\n\n.\n\n(12)\n\nl=1\n\nNote that the MaCBetH algorithm uses an em-\npirical estimator F (\u03b2) (cid:39) \u0001\u03bb(\u03b2) (5) of this\nquantity to compute an approximation \u02c6\u03b2SG of\n\u03b2SG purely from the revealed entries.\nIn the\nlarge r, \u0001 regime, both \u03b2SG, \u03b2R decay to 0, so\nthat we can further approximate\n1 = \u0001\u03bb(\u03b2SG) \u223cr\u2192\u221e \u0001r\u03b22\nE[x2]E[y2] , (13)\n1 = \u0001\u00b5(\u03b2R) \u223cr\u2192\u221e \u0001\u03b2R\n(14)\nso that we reach the simple asymptotic expres-\nsion, in the large \u0001, r limit, that \u0001c = r, or equivalently C(r) = 1. Interestingly, this result was\nobtained as the detectability threshold in completion of rank r = O(n) matrices from O(n2) entries\n\u221a\nin the Bayes optimal setting in [20]. Notice, however, that exact completion in the setting of [20] is\nonly possible for \u0001 > r(m+n)/\nnm: clearly detection and exact completion are different phenom-\nena. The previous analysis can be extended beyond the random setting assumption, as long as the\nempirical distribution of the entries is well de\ufb01ned, and the lines of X (resp. Y ) are approximately\northogonal and centered. This condition is related to the standard incoherence property [1, 3].\n\nFigure 2: Location of the critical value as a func-\ntion of the rank r. MaCBetH is able to estimate\nthe correct rank from |E| > C(r)r\nnm known\nentries. We used a population dynamics algorithm\nwith a population of size 106 to compute the func-\ntions \u03bb and \u00b5 from (10,11). The dotted line is a \ufb01t\nsuggesting that C(r) \u2212 1 = O(r\u22123/4).\n\n(cid:112)E[x2]E[y2] ,\n\n\u221a\n\nSG\n\n3.2 Computation of the spectral density\n\nIn this section, we show how the spectral density of the Bethe Hessian can be computed analytically\non tree-like graphs such as those generated by picking uniformly at random the observed entries of\nthe matrix XY T. This further motivates our algorithm and in particular our choice of \u03b2 = \u02c6\u03b2SG,\nindependently of section 3. The spectral density is de\ufb01ned as\n\n\u03bd(\u03bb) = lim\n\nn,m\u2192\u221e\n\n\u03b4(\u03bb \u2212 \u03bbi) ,\n\n(15)\n\nwhere the \u03bbi\u2019s are the eigenvalues of the Bethe Hessian. Using again the cavity method, it can be\nshown [21] that the spectral density (in which potential delta peaks have been removed) is given by\n\n\u03bd(\u03bb) = lim\n\nn,m\u2192\u221e\n\nIm\u2206i(\u03bb) ,\n\n(16)\n\nwhere the \u2206i are complex variables living on the vertices of the graph G, which are given by:\n\nwhere \u2202i is the set of neighbors of i. The \u2206i\u2192j are the (linearly stable) solution of the following\nbelief propagation recursion:\n\nn+m(cid:88)\n\n1\n\nn + m\n\ni=1\n\n1\n\ni=1\n\n\u03c0(n + m)\n\nn+m(cid:88)\nsinh2 \u03b2Aik \u2212(cid:88)\nsinh2 \u03b2Aik \u2212 (cid:88)\n\nl\u2208\u2202i\n\n1\n4\n\nl\u2208\u2202i\\j\n\n6\n\n(cid:16) \u2212 \u03bb + 1 +\n(cid:16) \u2212 \u03bb + 1 +\n\n(cid:88)\n\nk\u2208\u2202i\n\n(cid:88)\n\nk\u2208\u2202i\n\n\u2206i =\n\n\u2206i\u2192j =\n\n(cid:17)\u22121\n\n(cid:17)\u22121\n\nsinh2(2\u03b2Ail)\u2206l\u2192i\n\n,\n\n(17)\n\n1\n4\n\nsinh2(2\u03b2Ail)\u2206l\u2192i\n\n.\n\n(18)\n\nr510152025C(r)0.911.11.21.31.41.5C(r)C(r\u2192\u221e)1+0.812r\u22123/4\fFigure 3: Mean inferred rank as a function of \u0001, for different sizes, averaged over 100 samples of\nn \u00d7 m XY T matrices. The entries of X, Y are drawn from a Gaussian distribution of mean 0 and\nvariance 1. The theoretical transition is computed with a population dynamics algorithm (see section\n3.1). The \ufb01nite size effects are considerable but consistent with the asymptotic prediction.\n\nThis formula can be derived by turning the computation of the spectral density into a marginalization\nproblem for a graphical model on the graph G and then solving it using loopy belief propagation.\nQuite remarkably, this approach leads to an asymptotically exact (and rigorous [22]) description of\nthe spectral density on Erd\u02ddos-R\u00e9nyi random graphs. Solving equation (18) numerically we obtain\nthe results shown on Fig. 1: the bulk of the spectrum, in particular, is always positive.\nWe now demonstrate that for any value of \u03b2 < \u03b2SG, there exists an open set around \u03bb = 0 where\nthe spectral density vanishes. This justi\ufb01es independently or choice for the parameter \u03b2. The proof\n\u22122(\u03b2Aij) is a \ufb01xed point of the recursion (18)\nfollows [5] and begins by noticing that \u2206i\u2192j = cosh\nfor \u03bb = 0. Since this \ufb01xed point is real, the corresponding spectral density is 0. Now consider\n\u22122(\u03b2Aij)\u03b4ij).\na small perturbation \u03b4ij of this solution such that \u2206i\u2192j = cosh\nl\u2208\u2202i\\j tanh2(\u03b2Ail)\u03b4i\u2192l . The linear operator thus\nde\ufb01ned is a weighted version of the non-backtracking matrix of [8]. Its spectral radius is given by\n\u03c1 = \u0001\u03bb(\u03b2), where \u03bb is de\ufb01ned in 10. In particular, for \u03b2 < \u03b2SG, \u03c1 < 1, so that a straightforward\napplication [5] of the implicit function theorem allows to show that there exists a neighborhood U of\n0 such that for any \u03bb \u2208 U, there exists a real, linearly stable \ufb01xed point of (18), yielding a spectral\ndensity equal to 0. At \u03b2 = \u02c6\u03b2SG, the informative eigenvalues (those outside of the bulk), are therefore\nexactly the negative ones, which motivates independently our algorithm.\n\nThe linearized version of (18) writes \u03b4i\u2192j =(cid:80)\n\n\u22122(\u03b2Aij)(1 + cosh\n\n4 Numerical tests\n\nFigure 3 illustrates the ability of the Bethe Hessian to infer the rank above the critical value \u0001c in\nthe limit of large size n, m (see section 3.1). In Figure 4, we demonstrate the suitability of the\neigenvectors of the Bethe Hessian as starting point for the minimization of the cost function (2). We\ncompare the \ufb01nal RMSE achieved on the reconstructed matrix XY T with 4 other initializations of\nthe optimization, including the largest singular vectors of the trimmed matrix M [3]. MaCBetH sys-\ntematically outperforms all the other choices of initial conditions, providing a better initial condition\nfor the optimization of (2). Remarkably, the performance achieved by MaCBetH with the inferred\nrank is essentially the same as the one achieved with an oracle rank. By contrast, estimating the cor-\nrect rank from the (trimmed) SVD is more challenging. We note that for the choice of parameters\nwe consider, trimming had a negligible effect. Along the same lines, OptSpace [3] uses a different\nminimization procedure, but from our tests we could not see any difference in performance due to\nthat. When using Alternating Least Squares [23, 24] as optimization method, we also obtained a\nsimilar improvement in reconstruction by using the eigenvectors of the Bethe Hessian, instead of the\nsingular vectors of M, as initial condition.\n\n7\n\n\u03f52345678910Meaninferredrank00.511.522.53Rank3n=m=500n=m=2000n=m=8000n=m=16000Transition\u03f5c\u03f5910111213141516171819012345678910Rank10\fFigure 4: RMSE as a function of the number of revealed entries per row \u0001: comparison between\ndifferent initializations for the optimization of the cost function (2). The top row shows the proba-\nbility that the achieved RMSE is smaller than 10\u22121, while the bottom row shows the probability that\nthe \ufb01nal RMSE is smaller than 10\u22128. The probabilities were estimated as the frequency of success\nover 100 samples of matrices XY T of size 10000 \u00d7 10000, with the entries of X, Y drawn from a\nGaussian distribution of mean 0 and variance 1. All methods optimize the cost function (2) using a\nlow storage BFGS algorithm [25] part of NLopt [26], starting from different initial conditions. The\nmaximum number of iterations was set to 1000. The initial conditions compared are MaCBetH with\noracle rank (MaCBetH OR) or inferred rank (MaCBetH IR), SVD of the observed matrix M after\ntrimming, with oracle rank (Tr-SVD OR), or inferred rank (Tr-SVD IR, note that this is equivalent\nto OptSpace [3] in this regime), and random initial conditions with oracle rank (Random OR). For\nthe Tr-SVD IR method, we inferred the rank from the SVD by looking for an index for which the\nratio between two consecutive eigenvalues is minimized, as suggested in [27].\n\n5 Conclusion\n\nIn this paper, we have presented MaCBetH, an algorithm for matrix completion that is ef\ufb01cient for\ntwo distinct, complementary, tasks: (i) it has the ability to estimate a \ufb01nite rank r reliably from\nfewer random entries than other existing approaches, and (ii) it gives lower root-mean-square recon-\nstruction errors than its competitors. The algorithm is built around the Bethe Hessian matrix and\nleverages both on recent progresses in the construction of ef\ufb01cient spectral methods for clustering\nof sparse networks [8, 5, 9], and on the OptSpace approach [3] for matrix completion.\nThe method presented here offers a number of possible future directions, including replacing the\nminimization of the cost function by a message-passing type algorithm, the use of different neural\nnetwork models, or a more theoretical direction involving the computation of information theoreti-\ncally optimal transitions for detectability.\n\nAcknowledgment\n\nOur research has received funding from the European Research Council under the European Union\u2019s\n7th Framework Programme (FP/2007-2013/ERC Grant Agreement 307087-SPARCS).\n\n8\n\n1020304050P(RMSE<10\u22121)00.10.20.30.40.50.60.70.80.91Rank3MacbethORTr-SVDORRandomORMacbethIRTr-SVDIR\u03f51020304050P(RMSE<10\u22128)00.10.20.30.40.50.60.70.80.9110203040506000.10.20.30.40.50.60.70.80.91Rank10\u03f510203040506000.10.20.30.40.50.60.70.80.91\fReferences\n[1] E. J. Cand\u00e8s and B. Recht, \u201cExact matrix completion via convex optimization,\u201d Foundations of Computa-\n\ntional mathematics, vol. 9, no. 6, pp. 717\u2013772, 2009.\n\n[2] E. J. Cand\u00e8s and T. Tao, \u201cThe power of convex relaxation: Near-optimal matrix completion,\u201d Information\n\nTheory, IEEE Transactions on, vol. 56, no. 5, pp. 2053\u20132080, 2010.\n\n[3] R. H. Keshavan, A. Montanari, and S. Oh, \u201cMatrix completion from a few entries,\u201d Information Theory,\n\nIEEE Transactions on, vol. 56, no. 6, pp. 2980\u20132998, 2010.\n\n[4] D. Gross, Y.-K. Liu, S. T. Flammia, S. Becker, and J. Eisert, \u201cQuantum state tomography via compressed\n\nsensing,\u201d Physical review letters, vol. 105, no. 15, p. 150401, 2010.\n\n[5] A. Saade, F. Krzakala, and L. Zdeborov\u00e1, \u201cSpectral clustering of graphs with the bethe hessian,\u201d in Ad-\n\nvances in Neural Information Processing Systems, 2014, pp. 406\u2013414.\n\n[6] A. Saade, F. Krzakala, M. Lelarge, and L. Zdeborov\u00e1, \u201cSpectral detection in the censored block model,\u201d\n\nIEEE International Symposium on Information Theory (ISIT2015), to appear, 2015.\n\n[7] J.-F. Cai, E. J. Cand\u00e8s, and Z. Shen, \u201cA singular value thresholding algorithm for matrix completion,\u201d\n\nSIAM Journal on Optimization, vol. 20, no. 4, pp. 1956\u20131982, 2010.\n\n[8] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborov\u00e1, and P. Zhang, \u201cSpectral redemption\n\nin clustering sparse networks,\u201d Proc. Natl. Acad. Sci., vol. 110, no. 52, pp. 20 935\u201320 940, 2013.\n\n[9] C. Bordenave, M. Lelarge, and L. Massouli\u00e9, \u201cNon-backtracking spectrum of random graphs: community\n\ndetection and non-regular ramanujan graphs,\u201d 2015, arXiv:1501.06087.\n\n[10] J. J. Hop\ufb01eld, \u201cNeural networks and physical systems with emergent collective computational abilities,\u201d\n\nProc. Nat. Acad. Sci., vol. 79, no. 8, pp. 2554\u20132558, 1982.\n\n[11] J. S. Yedidia, W. T. Freeman, and Y. Weiss, \u201cBethe free energy, kikuchi approximations, and belief prop-\n\nagation algorithms,\u201d Advances in neural information processing systems, vol. 13, 2001.\n\n[12] M. Mezard and A. Montanari, Information, Physics, and Computation. Oxford University Press, 2009.\n[13] D. J. Amit, H. Gutfreund, and H. Sompolinsky, \u201cSpin-glass models of neural networks,\u201d Physical Review\n\nA, vol. 32, no. 2, p. 1007, 1985.\n\n[14] B. Wemmenhove and A. Coolen, \u201cFinite connectivity attractor neural networks,\u201d Journal of Physics A:\n\nMathematical and General, vol. 36, no. 37, p. 9617, 2003.\n\n[15] I. P. Castillo and N. Skantzos, \u201cThe little\u2013hop\ufb01eld model on a sparse random graph,\u201d Journal of Physics\n\nA: Mathematical and General, vol. 37, no. 39, p. 9087, 2004.\n\n[16] P. Zhang, \u201cNonbacktracking operator for the ising model and its applications in systems with multiple\n\nstates,\u201d Physical Review E, vol. 91, no. 4, p. 042120, 2015.\n\n[17] J. M. Mooij and H. J. Kappen, \u201cValidity estimates for loopy belief propagation on binary real-world\n\nnetworks.\u201d in Advances in Neural Information Processing Systems, 2004, pp. 945\u2013952.\n\n[18] F. Ricci-Tersenghi, \u201cThe bethe approximation for solving the inverse ising problem: a comparison with\n\nother inference methods,\u201d J. Stat. Mech.: Th. and Exp., p. P08015, 2012.\n\n[19] L. Zdeborov\u00e1, \u201cStatistical physics of hard optimization problems,\u201d acta physica slovaca, vol. 59, no. 3,\n\npp. 169\u2013303, 2009.\n\n[20] Y. Kabashima, F. Krzakala, M. M\u00e9zard, A. Sakata, and L. Zdeborov\u00e1, \u201cPhase transitions and sample\n\ncomplexity in bayes-optimal matrix factorization,\u201d 2014, arXiv:1402.1298.\n\n[21] T. Rogers, I. P. Castillo, R. K\u00fchn, and K. Takeda, \u201cCavity approach to the spectral density of sparse\n\nsymmetric random matrices,\u201d Phys. Rev. E, vol. 78, no. 3, p. 031116, 2008.\n\n[22] C. Bordenave and M. Lelarge, \u201cResolvent of large random graphs,\u201d Random Structures and Algorithms,\n\nvol. 37, no. 3, pp. 332\u2013352, 2010.\n\n[23] P. Jain, P. Netrapalli, and S. Sanghavi, \u201cLow-rank matrix completion using alternating minimization,\u201d\nin Proceedings of the forty-\ufb01fth annual ACM symposium on Theory of computing. ACM, 2013, pp.\n665\u2013674.\n\n[24] M. Hardt, \u201cUnderstanding alternating minimization for matrix completion,\u201d in Foundations of Computer\n\nScience (FOCS), 2014 IEEE 55th Annual Symposium on.\n\nIEEE, 2014, pp. 651\u2013660.\n\n[25] D. C. Liu and J. Nocedal, \u201cOn the limited memory bfgs method for large scale optimization,\u201d Mathemat-\n\nical programming, vol. 45, no. 1-3, pp. 503\u2013528, 1989.\n\n[26] S. G. Johnson, \u201cThe nlopt nonlinear-optimization package,\u201d 2014.\n[27] R. H. Keshavan, A. Montanari, and S. Oh, \u201cLow-rank matrix completion with noisy observations: a quan-\ntitative comparison,\u201d in 47th Annual Allerton Conference on Communication, Control, and Computing,\n2009, pp. 1216\u20131222.\n\n9\n\n\f", "award": [], "sourceid": 780, "authors": [{"given_name": "Alaa", "family_name": "Saade", "institution": "ENS"}, {"given_name": "Florent", "family_name": "Krzakala", "institution": "Ecole Normale Superieure Paris"}, {"given_name": "Lenka", "family_name": "Zdeborov\u00e1", "institution": "CEA"}]}